All CategoriesBig Data Hadoop & Spark - Advanced

Data Serialization with Avro in Hive

This blog focuses on providing in-depth information of Avro in Hive. Here we have discussed the importance and necessity of Avro and how to implement it in Hive. Through this blog, you will get a clear idea about Avro and its implementation in your Hadoop projects.

What is Avro?

Avro is one of the preferred data serialization systems because of its language neutrality.
Due to lack of language portability in Hadoop writable classes, Avro becomes a natural choice because of its ability to handle multiple data formats which can be further processed by multiple languages.
Avro is also very much preferred for serializing the data in Hadoop.
It uses JSON for defining data types and protocols and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services.
By this, we can define Avro as a file format introduced with Hadoop to store data in a predefined format.This file format can be used in any of the Hadoop’s tools like Pig and Hive.

100% Free Course On Big Data Essentials

Subscribe to our blog and get access to this course ABSOLUTELY FREE.

Implementing Avro file format in Hive

Before we take a look at how the Avro file format is implemented, let’s have a quick introduction to the Avro schema and how to create Avro records, Hive tables and much more.

Avro Schema

Avro relies on a schema. When Avro data is read, the schema used for writing it is always present. This permits each datum to be written with no per-value overheads, making serialization both fast and small. This also facilitates using it with dynamic scripting languages, since data together with its schema, is fully self-describing.
When Avro data is stored in a file, its schema is stored with it, so that files may be processed later by any program. If the program reading the data expects a different schema this can be easily resolved, since both schemas are present.
When Avro is used in RPC, the client and server exchange schemas in the connection handshake. (This can be optimized so that, for most calls, no schemas are actually transmitted.) Since both client and server both have the other’s full schema, correspondence between same named fields, missing fields, extra fields, etc. can all be easily resolved.
Avro schemas are defined with JSON. This facilitates implementation in languages that already have JSON libraries.Using Avro, we can convert unstructured and semi-structured data into properly structured data using its schemas.

Creating a table to store the data in Avro format

This process is initiated with the creation of JSON based schema to serialize data in a format that has a schema built in.
Avro has its own parser to return the provided schema as an object.
The created object allows us to create records with that schema.
We can create our schema inside the table properties while creating a Hive table
TBLPROPERTIES (‘avro.schema.literal’='{json schema here}’);

Now, let’s create an Avro file format for olympic data.You can download the data from the following link:
Olympic Dataset

Creating a Hive Table:

create table olympic_avro
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
tblproperties ('avro.schema.literal'='{
"name": "my_record",
"type": "record",
"fields": [
{"name":"athelete", "type":"string"},
{"name":"age", "type":"int"},
{"name":"country", "type":"string"},
{"name":"year", "type":"string"},
{"name":"closing", "type":"string"},
{"name":"sport", "type":"string"},
{"name":"gold", "type":"int"},
{"name":"silver", "type":"int"},
{"name":"bronze", "type":"int"},
{"name":"total", "type":"int"}
]}');

Inside the tblproperties you can see the schema of the data. Every record inside the tblproperties will become a column. Here, ‘Name’ defines the column name and ‘type’ defines the datatype of the particular column.The above fields are for olympic data.
We can view the fields inside the table by using the command describe olympic_avro.


We have successfully created an avro table!
Here’s the in depth explanation behind the commands used for creating the table:

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'

We have used the above command while creating the table. This means that the data providing is serialized and de-serialized as well.If we have our own serialized data, there is no need to use this line. Else, the above line should be present, which acts as the default serialize for the data.

STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'

This line says that we are storing the file as AvroContainerInputFormat which is present in the default hive package as specified above. When using this line, the file will be stored with extension .avro, which is an extension of Avro.

OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

This line says that the output of the file is also in Avro format and will have the extension .avro .
The next step is to insert the data into the table.
A default parser will act on the schema and creates an object for the schema, which in turn accommodates the type of data which suits the created schema.Now the data can be inserted into the Avro table created with the same schema.

Data Insertion into Avro Table:

There are 2 methods by which the data can be inserted into an Avro table:
1. If we have a file with extension ‘.avro’ and the schema of the file is the same as what you specified, then you can directly import the file using the command

LOAD DATA LOCAL  INPATH  'PATH OF THE FILE';

2. You can copy the contents of a previously created table into the newly created Avro table.
Let’s take a look at the second type of data insertion technique to import data into an Avro table. We will begin by creating a table which is delimited by tab space.
Text File Table Creation:

create tableolympic(athelete STRING,age INT,country STRING,year STRING,closing STRING,sport STRING,gold INT,silver INT,bronzeINT,total INT) row format delimited fields terminated by '\t' stored as textfile;

Here, we are creating a table with the name ‘olympic’. The schema of the table is as specified and the data inside the input file is delimited by Tab space. At the end of the command, we have specified ‘stored as textfile’, which means that we are using a TEXTFILE format. The schema of the created table can be checked using the command describe olympic;
Now, the data can be loaded into the created table as follows:
load data local inpath ‘path of your file in local’ into table olympic;


We have successfully loaded the data from the input file data into the table in TEXTFILE format.
Now, to copy the data from the ‘olympic’ table to the newly created Avro table, the command
insert overwrite table olympic_avro select * from olympic limit 20; can be used for the same.


This olympic data has some NULL values and the Avro schema cannot handle the null values by default. To make the Avro table work with NULL values, the schema of the table needs to be changed. We will see how this is done as well.
Now, let’s perform average operation on the Avro table using the command

select AVG(age) from olympic_avro;


You can see that we have successfully performed average operation on our newly created Avro table.
Output: The average age of the athletes is 23.65
We have successfully created an Avro table, imported data into the Avro and performed a average operation on the available data.

Handling NULL Values in Avro Table:

Like mentioned earlier, the schema of the table needs to be changed to make the table accept NULL values as well. In this case,the schema will be as shown below:

create table olympic_avro1
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
tblproperties ('avro.schema.literal'='{
"namespace":"com.example.avro",
"name": "my_record",
"type": "record",
"fields": [
{"name":"athelete", "type":["string","null"],"default":null},
{"name":"age", "type":["int","null"],"default":0},
{"name":"country", "type":["string","null"],"default":null},
{"name":"year", "type":["string","null"],"default":null},
{"name":"closing", "type":["string","null"],"default":null},
{"name":"sport", "type":["string","null"],"default":null},
{"name":"gold", "type":["int","null"],"default":0},
{"name":"silver", "type":["int","null"],"default":0},
{"name":"bronze", "type":["int","null"],"default":0},
{"name":"total", "type":["int","null"],"default":0}
]}');

The difference between the normal Avro schema and the above-specified schema is as follows:
In the type in above code, we have given two values, one is the data type and the other is null, which means that if the value is the specified data type, it accepts the value, else, the value is NULL. It will consider the default value which we have given with attribute default.Here we have given the default value as null for string and 0(Zero) for int.

NOTE: When we specify the default value as null or 0,we need to specify it in double quotes.
Now, let’s create a table with the above-specified schema which can accept NULL values as well.


We can see the columns created in the table using the below command:

describe olympic_avro1


This table will now accept NULL values as well.
Let’s try loading some data containing NULL values into the table. We know that the olympic data has some NULL values in it, so let us try to load the contents of the olympic table with the newly created table using the command

insert overwrite table olympic_avro1 select * from olympic


We have successfully loaded the contents of the olympic table containing NULL values into olympic_avro1 table.
Now, let’s try performing the AVERAGE operation on the newly created table using the below command:

select AVG(age) from olympic_avro1


We have successfully performed the AVERAGE operation on the entire olympic data.

Output: The average age of athletes is found to be 26.405433646812956

This is how NULL values can be handled in Avro tables.

Hadoop

Downloading the .avro File for the Created Avro Table:

We can see the Avro schema of the created table and also that a table is created with extension ‘.avro’.
Follow the below steps to download the file with this extension (‘.avro’):

Note:Make sure that all your Hadoop daemons are running.
Step 1: Open your browser and type localhost:50070

Step 2: In that page, click on ‘utilities’and then click on ‘Browse the file system’.

Step 3: You can see the list of files available in your HDFS. In this list of click on ‘user’.

Inside the user directory, Hive warehouse will be stored and within this warehouse, all the Hive tables will be stored.


Step 4: After clicking on the user,you can see a list of folders. From this list click on ‘Hive’.

Step 5: After clicking on ‘Hive’, click on ‘Warehouse’.

After clicking on ‘Warehouse’, you can see all the databases and tables created in Hive
Step 6: From the list of tables, select the Avro table you had created.

Note:
In this case, no database has been created, so the default database has been used. Consequently, all the tables will be listed on the home page itself.If the table has been created inside the database, the table should be then checked inside that database.

After clicking on the table,you can now see a table with extension ‘.avro’.


Step 7: As soon as you click on the file you will get an option to download it.

The file will be downloaded into your ‘Downloads’ folder.
Now to view the Avro schema, go to the location where your downloaded .avro extension file (using the terminal) is present and then type the below command:
cat filename(in this case file name is 000000_0.avro)

In the above image, you can see the schema of the created Avro table. It is quite hard to understand, so we need to convert it into JSON format to understand the schema.

Converting Avro to JSON

To convert the Avro file into JSON we need to download a jar file called
‘avro-tools-1.7.5 jar’, which contains the option to convert the Avro file into JSON. You can download the jar file from the following link
avro-tools-1.7.5 jar

This jar file will also be downloaded in the ‘Downloads’ folder. From the terminal move to the ‘Downloads’ folder and then type the command with below syntax.

java -jar avro-tools-1.7.5.jar tojson 'avro file name' >newfilename.json

In our case the command will be as follows:

java -jar avro-tools-1.7.5.jar tojson 000000_0.avro >olympic.json

Now, a file with name ‘olympic.json’ will be created with the json schema of the Avro file. To see the contents of this json file use the below command:

cat olympic.json

Now, you can see the contents of your table with JSON format.
As mentioned above, the Avro parser has converted the schema into an object and the object will contain the data.In the above image, you can see that inside the schema there are the contents which say that the created object is holding our content.
Loading Files with .avro Extension into an Avro Table:
Let’s see how to load the file with extension ‘.avro’ directly into the Avro table as mentioned in the section ” Data Insertion into Avro Table
Let us create another Avro table with the name ‘olympic_avro2’ and the schema as follows:

create table olympic_avro2
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
tblproperties ('avro.schema.literal'='{
"name": "my_record",
"type": "record",
"fields": [
{"name":"athelete", "type":"string"},
{"name":"age", "type":"int"},
{"name":"country", "type":"string"},
{"name":"year", "type":"string"},
{"name":"closing", "type":"string"},
{"name":"sport", "type":"string"},
{"name":"gold", "type":"int"},
{"name":"silver", "type":"int"},
{"name":"bronze", "type":"int"},
{"name":"total", "type":"int"}
]}');

We can see the columns we have created using the command describe olympic_avro2
Now, let’s load the .avro extension file, 000000_0.avro using the command

LOAD DATA LOCAL INPATH '/path of the file' into table olympic_avro2

We have successfully loaded the .avro file into the Avro table. Now, let’s perform the AVERAGE operation on the newly created ‘olympic_avro2’ table, using the below command:

select AVG(age) from olympic_avro2;


Output: We have got the AVERAGE age of athletes as 23.65, which is the same as what we got in the first case.

Let’s summarize:

• Created Avro table.
• Discussed the two techniques of loading data into the Avro table.
• Performed queries on the data inside the Avro table
• Downloaded the file with ‘.avro’ extension.
• Viewed the Avro schema.
• Converted the Avro schema into JSON.
• Viewed the contents of table in JSON format.

Keep visiting our website Acadgild for more updates on Big Data and other technologies. Click here to learn Big Data Hadoop Development.

Suggested Reading

Parquet File Format

Working with Spark Avro and Spark Parquet Files

Related Popular Courses:

WHAT IS HADOOP BIG DATA

APPLICATION ANDROID COURSE

KAFKA PRODUCER EXAMPLE

DATA SCIENCE ONLINE TRAINING

BIG DATA ANALYTICS COURSE

Hadoop

Tags

3 Comments

  1. thanks for this great article.
    btw, can I load avro file into orc table directly?
    I mean there are large amount of avro files on hdfs
    then I want to load that avro files into orc table with out through hive avro table.
    query like
    load data /path/data.avro into table_orc

  2. Hi Kiran,
    can you explain bit more about this two lines.
    Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close