Big Data Hadoop & Spark

Hive HBase Integration

A brief introduction to Apache Hive:

Apache Hive is a data warehouse software that facilitates querying and managing of large datasets residing in distributed storage. Apache Hive provides SQL-like language called HiveQL for querying the data. Hive is considered friendlier and more familiar to users who are used to using SQL for querying data.

Hive is best suited for data warehousing applications where data is stored, mined and reporting is done based on the processing. Apache Hive bridges the gap between data warehouse applications and Hadoop as relational database models are the base of most data warehousing applications.

100% Free Course On Big Data Essentials

Subscribe to our blog and get access to this course ABSOLUTELY FREE.

It resides on top of Hadoop to summarize Big Data and makes querying and analyzing easy. HiveQL also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.

Why do we need to integrate Apache Hive with HBase?

Hive can store information of hundreds of millions of users effortlessly, but, faces some difficulties when it comes to keeping the warehouse up to date with the latest information. Apache Hive uses HDFS as an underlying storage which comes with limitations like append-only, block-oriented storage. This makes it impossible to directly apply individual updates to warehouse tables. Up till now the only practical option to overcome this limitation is to pull the snapshots from MySQL databases and dump them to new Hive partitions. This expensive operation of pulling the data from one location to another location is not frequently practiced.(leading to stale data in the warehouse), and it also does not scale well as the data volume continues to shoot through the roof.

To overcome this problem,  Apache HBase is used in place of MySQL, with Hive.

What is HBase?

HBase is a scale-out table store which can support a very high rate of row-level updates over a large amount of data. HBase solves Hadoop’s append-only constraint by keeping recently updated data in memory and incrementally rewriting data to new files, splitting and merging data intelligently based on data distribution changes.

Since HBase is based on Hadoop, integrating it with Hive is pretty straightforward as HBase tables can be accessed like native Hive tables.  As a result, a single Hive query can now perform complex operations such as join, union, and aggregation across combinations of HBase and native Hive tables.  Likewise, Hive’s INSERT statement can be used to move data between HBase and native Hive tables, or to reorganize data within the HBase itself.

How is HBase integrated with Hive?

For integrating HBase with Hive, Storage Handlers in Hive is used.

Storage Handlers are a combination of InputFormat, OutputFormat, SerDe, and specific code that Hive uses to identify an external entity as a Hive table. This allows the user to issue SQL queries seamlessly, whether the table represents a text file stored in Hadoop or a column family stored in a NoSQL database such as Apache HBaseApache Cassandra, and Amazon DynamoDB. Storage Handlers are not only limited to NoSQL databases, a storage handler could be designed for several different kinds of data stores.

Here the example for connecting Hive with HBase using HiveStorageHandler.

Hadoop

Create the HBase Table:

create 'employee','personaldetails','deptdetails'

‘personaldetails’ and ‘deptdetails’The above statement will create ‘employee’ with two columns families

Insert the data into HBase table:

hbase(main):049:0> put 'employee','eid01','personaldetails:fname','Brundesh'
0 row(s) in 0.1030 seconds
hbase(main):050:0> put 'employee','eid01','personaldetails:Lname','R'
0 row(s) in 0.0160 seconds
hbase(main):051:0> put 'employee','eid01','personaldetails:salary','10000'
0 row(s) in 0.0090 seconds
hbase(main):060:0> put 'employee','eid01','deptdetails:name','R&D'
0 row(s) in 0.0680 seconds
hbase(main):061:0> put 'employee','eid01','deptdetails:location','Banglore'
0 row(s) in 0.0140 seconds
hbase(main):067:0>  put 'employee','eid02','personaldetails:fname','Abhay'
0 row(s) in 0.0080 seconds
hbase(main):068:0>  put 'employee','eid02','personaldetails:Lname','Kumar'
0 row(s) in 0.0080 seconds
hbase(main):069:0>  put 'employee','eid02','personaldetails:salary','100000'
0 row(s) in 0.0090 seconds

Now create the Hive table pointing to HBase table.

If there are multiple columns family in HBase, we can create one table for each column families. In this case, we have 2 column families and hence we are creating two tables, one for each column families.

Table for personal details column family:

create external table employee_hbase(Eid String, f_name string, s_name string, salary int)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
 with serdeproperties ("hbase.columns.mapping"=":key,personaldetails:fname,personaldetails:Lname,personaldetails:salary")
 tblproperties("hbase.table.name"="employee");
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

If we are creating the non-native Hive table using Storage Handler then we should specify the STORED BY clause

Note: There are different classes for different databases

hbase.columns.mapping : It is used to map the Hive columns with the HBase columns. The first column must be the key column which would also be same as the HBase’s row key column.

Now we can query the HBase table with SQL queries in hive using the below command.

select *from employee_hbase;

We hope going through this blog will help you in the integration of hive and hbase and help in building the useful SQL interface on the top of Hbase .Above query fired from hive terminal will yield all the data from the hbase table.

Keep visiting our site AcadGild for more updates on Bigdata and other technologies.

Related Popular Courses:

BIG DATA DEVELOPER

ANDROID CERTIFICATE COURSE

KAFKA CONSUMER EXAMPLE

DATA SCIENCE COURSES ONLINE 

BEST ANALYTICS COURSES

Hadoop

6 Comments

  1. Pingback: HIVE & HBASE – POC FARM
  2. i am executing in hive
    select eid, fname,lname, COALESCE(salary,CAST(0 AS BIGINT)) from emp2_hbase;
    unable to get output .?
    i updated classpath in hbase.env.sh directing to lib folder of hbase ,but still same error ?
    Error: java.lang.ClassNotFoundException: org.apache.zookeeper.KeeperException
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    at org.apache.hadoop.hbase.client.HConnectionManager.getConnection(HConnectionManager.java:184)
    at org.apache.hadoop.hbase.client.HTable.(HTable.java:155)
    at org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat.getRecordReader(HiveHBaseTableInputFormat.java:92)
    at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:241)
    at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:538)
    at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:191)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:412)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close