Free Shipping

Secure Payment

easy returns

24/7 support

Integrating Hive with HBase

 July 9  | 0 Comments

Brief introduction to Hive:

Apache Hive is a data warehouse software that facilitates querying and managing of large datasets residing in distributed storage. Hive provides SQL-like language called HiveQL for querying the data. Hive is considered friendlier and more familiar to users who are used to using SQL for querying data.

Hive is best suited for Data Warehousing Applications where data is stored, mined and reporting is done based on processing. Hive bridges the gap between datawarehouse applications and hadoop as relational database models are the base of most data warehousing applications.

It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy. Hive QL also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.

Why do we need to integrate Hive with HBase?

Hive can store information of hundreds of millions of users effortlessly, but faces some difficulties when it comes to keeping the warehouse up to date with the latest information. Hive uses HDFS as an underlying storage which come with limitations like append-only, block-oriented storage. This makes it impossible to directly apply individual updates to warehouse tables. Up till now the only practical option to overcome this limitation is to pull the snapshots from MySQL databases and dump them to new Hive partitions. This expensive operation of pulling the data from one location to another location is not frequently practiced.(leading to stale data in the warehouse), and it also does not scale well as  the data volume continues to shoot through the roof.

To overcome this problem,  Apache HBase is used in place of MySQL, with Hive.

What is HBase?

HBase is a scale-out table store which can support a very high rate of row-level updates over a large amount of data.  HBase solves Hadoop’s append-only constraint by keeping recently updated data in memory and incrementally rewriting data to new files, splitting and merging data intelligently based on data distribution changes.

Since HBase is based on Hadoop, Integrating it with Hive is pretty straightforward as HBase tables can be accessed like native Hive tables.  As a result, a single Hive query can now perform complex operations such as join, union, and aggregation across combinations of HBase and native Hive tables.  Likewise, Hive’s INSERT statement can be used to move data between HBase and native Hive tables, or to reorganize data within the HBase itself.

How is HBase integrated with Hive?

For integrating HBase with Hive, Storage Handlers in Hive is used.

Storage Handlers are a combination of InputFormat, OutputFormat, SerDe, and specific code that Hive uses to identify an external entity as a Hive table. This allows the user to issue SQL queries seamlessly, whether the table represents a text file stored in Hadoop or a column family stored in a NoSQL database such as Apache HBase, Apache Cassandra, and Amazon DynamoDB. Storage Handlers are not only limited to NoSQL databases, a storage handler could be designed for several different kinds of data stores.

Here the example for connecting Hive with HBase using HiveStorageHandler.

Create the HBase Table:

‘personaldetails’ and ‘deptdetails’The above statement will create ‘employee’ with two columns families

Insert the data into HBase table:

Now create the Hive table pointing to HBase table.

If there are multiple columns family in HBase, we can create one table for each column families. In this case we have 2 column families and we are creating two tables, one for each column families.

Table for personal details column family:

If we are creating the non-native Hive table using Storage Handler then we should specify the STORED BY clause

Note: There are different classes for different data bases

hbase.columns.mapping : It is used to map the Hive columns with the HBase columns. The first column must be the key column which would also be same as the HBase’s rowkey column.

Now we can query the HBase table with SQL queries in hiveusing the below command.

We hope going through this blog will help you in the integration of hive and hbase and help in building the useful sql interface on the top of Hbase.Above query fired from hive terminal  will yield all the data from hbase table.

>