All CategoriesBig Data Hadoop & Spark

Loading and Storing Hive Data into Pig

In this tutorial, we will be giving a demo on how to load Hive data into Pig using HCatLoader and how to store the data in Pig into Hive.

To perform loading and storing Hive data into Pig we need to use HCatalog. HCatalog is a table and as well as a storage management layer for Hadoop. It enables users with different data processing tools like Pig, MapReduce and also helps read and write data on the grid more easily. Hive by default has HCatalog included in it.

Now we need to download two jar files and add them to Hive and HCatalog libraries. Below are the details:

Download slf4j-api-*.jar file from the below link and add the jar file into $HIVE_HOME/lib directory.

https://drive.google.com/open?id=0ByJLBTmJojjzTW9Vc2VVRGJSMlk

Download Hive-hcatalog-hbase-storage handler jar file from the below link and it into the hcatalog lib.

https://drive.google.com/open?id=0ByJLBTmJojjzUkw4NlhhdXdzRkE

HCatalog will be present in Hive folder itself. Inside the HCatalog directory create one folder with name lib and add the above jar file into the lib folder.

Now in Pig we need to set a few properties of $HIVE_HOME, $HADOOP_HOME, $HCAT_HOME

Open pig.properties file which is present in $PIG_HOME/conf directory. Here set the below properties.

export HADOOP_HOME=<Path to Hadoop installed directory>/etc/hadoop
export HCAT_HOME=<Path to Hcatalog directory>
export HIVE_HOME=<Path to Hive installed directory>
export PIG_CLASSPATH=$HCAT_HOME/share/hcatalog/hive-hcatalog-core*.jar:\
$HCAT_HOME/share/hcatalog/hive-hcatalog-pig-adapter*.jar:\
$HIVE_HOME/lib/hive-metastore-*.jar:$HIVE_HOME/lib/libthrift-*.jar:\
$HIVE_HOME/lib/hive-exec-*.jar:$HIVE_HOME/lib/libfb303-*.jar:\
$HIVE_HOME/lib/jdo2-api-*-ec.jar:$HIVE_HOME/conf:$HADOOP_HOME/conf:\
$HIVE_HOME/lib/slf4j-api-*.jar
export PIG_OPTS=-Dhive.metastore.uris=thrift://<Host_name>:<Port>

If you do not have hive.metastore.uris, you need to configure it in your hive-site.xml by including the below property.

<property>
<name>hive.metastore.uris</name>
<value>thrift://localhost:9083</value>
<description> URI for client to contact metastore server </description>
</property>

Download the dataset used in this blog from this link

https://drive.google.com/open?id=0ByJLBTmJojjzV1czX3Nha0R3bTQ

Now it is all set for loading Hive data into pig. We already have a few tables present in Hive. Now we will export this data into Pig using HCatalog. Here is the data present in Hive.

hive> show tables;
OK
olympic
pokemon
pokemon1
pokemon2
Time taken: 1.046 seconds, Fetched: 4 row(s)
hive> describe olympic;
OK
olympic_athelete string
olympic_age int
olympic_country string
olympic_year string
olympic_closing string
olympic_sport string
olympic_gold int
olympic_silver int
olympic_bronze int
olympic_total int
Time taken: 0.254 seconds, Fetched: 10 row(s)
hive> select * from olympic limit 5;
OK
Michael Phelps 23 United States 2008 8/24/2008 Swimming 8 0 0 8
Michael Phelps 19 United States 2004 8/29/2004 Swimming 6 0 2 8
Michael Phelps 27 United States 2012 8/12/2012 Swimming 4 2 0 6
Natalie Coughlin 25 United States 2008 8/24/2008 Swimming 1 2 3 6
Aleksey Nemov 24 Russia 2000 10/1/2000 Gymnastics 2 1 3 6
Time taken: 0.601 seconds, Fetched: 5 row(s)
hive>

Now when we perform average on the age of the athletes in hive using the query select avg(olympic_age) from olympic we have got the result as


Hadoop
26.405433646812956

We have one olympic table with olympics data loaded into it, now we will load the same data into Pig using Hcatalog. Now we will perform the above avg query in Pig.

To load the Hive data into Pig using HCatalog, while starting Pig we need to specify that Pig -useHCatalog

Before doing this, please make sure that you have set $HCAT_HOME in your bashrc file. Open your bashrc file using the command gedit .bashrc and add the below lines.

#SET HCAT_HOME
export HCAT_HOME=$HIVE_HOME/hcatalog
export PATH=$PATH:$HCAT_HOME/bin

After adding the above lines, save and close the file. Now update the bashrc file using the command source .bashrc
Note: While loading Hive data into Pig relation, make sure that the user has started Hive metastore service using the below command
hive –service metastore
Keep the Hive metastore service running in one terminal and use Pig in another terminal
Now to load the hive data into pig, Pig uses HCataLoader() function and it looks like this

A = LOAD ‘table_name’ USING org.apache.hive.hcatalog.pig.HCatLoader();

Note: For loading Hive data into pig using HCatLoader, make sure that the function name is HCatLoader otherwise you will get an error like this.
ERROR org.apache.pig.tools.grunt.Grunt – ERROR 1070: Could not resolve org.apache.hive.hcatalog.pig.HCatLoader using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]

As our table name is olympic, relation looks like this

A = LOAD ‘olympic’ USING org.apache.hive.hcatalog.pig.HCatLoader();

Now the data will be successfully loaded into the Pig’s relation. Then if you do describe A then you can see the schema of the relation is as follows:

grunt> describe A;
2016-10-10 17:01:58,836 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
A: {olympic_athelete: chararray,olympic_age: int,olympic_country: chararray,olympic_year: chararray,olympic_closing: chararray,olympic_sport: chararray,olympic_gold: int,olympic_silver: int,olympic_bronze: int,olympic_total: int}
B = group A all;
C = foreach B generate avg(olympic_age);
dump

We have got the result as 26.405433646812956

We will filter out the athletes whose age is greater than 30

D = filter A by olympic_age>=30;

Now we will find out the average age of athletes whose age is above 30

E = group D all;

F = foreach E generate AVG(D.olympic_age);

dump

Here we have got the result as 33.35679374389052

So we have successfully loaded hive data into Pig relation. Now we will export the data present in the Pig relation D into hive and check for the average age of athletes whose age is above 30. How can we do that? It is very simple.

STORE D INTO 'tablename' USING org.apache.hive.hcatalog.pig.HCatStorer();

Before storing, make sure that the table name which you are giving here exists in Hive.

Now we will create a table with name olympics in Hive using the same schema.

create table olympics(olympic_athelete STRING,olympic_age INT,olympic_country STRING,olympic_year STRING,olympic_closing STRING,olympic_sport STRING,olympic_gold INT,olympic_silver INT,olympic_bronze INT,olympic_total INT) row format delimited fields terminated by '\t' stored as textfile;

We will store the data in relation D into the olympics table as follows:

STORE D INTO 'olympics' USING org.apache.hive.hcatalog.pig.HCatStorer();

All the data of the athletes whose age>=30 will be stored in the table olympics.
We will open Hive and perform the average age of athletes of whose age>=30

select avg(olympic_age) from olympics

We have got the output as 33.35679374389052 which is same as we got in Pig.

Hope this blog helps you understand, how to load and store Hive data into pig relation and data in Pig relation to a Hive table. Keep visiting our site www.acadgild.com for more update on Bigdata and other technologies.

Hadoop

Tags

5 Comments

  1. Hi, Thanks for the post.
    I followed the above steps but while trying to load the data from hive, i got the below error
    ERROR org.apache.pig.tools.grunt.Grunt – ERROR 2245: Cannot get schema from loadFunc org.apache.hive.hcatalog.pig.HCatLoader
    Request your assistance on the same.

    1. Hi Silvi,
      Along with steps written above, you need to start “hiveserver2” in thrift mode.
      And if you have followed above steps in the blog, then you are ready to rock.!
      All the best.

    2. Hi Silvi,
      Along with steps written above, you need to start “hiveserver2” in thrift mode.
      And if you have followed above steps in the blog, then you are ready to rock.!
      All the best.

    3. Hi Silvi,
      Along with steps written above, you need to start “hiveserver2” in thrift mode.
      And if you have followed above steps in the blog, then you are ready to rock.!
      All the best.

  2. Pingback: Data Analysis Using Apache Hive and Apache Pig | Treselle Systems | Big Data, Technology & Integration, Quality Assurance

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close