A brief introduction to Apache Hive:
Apache Hive is a data warehouse software that facilitates querying and managing of large datasets residing in distributed storage. Apache Hive provides SQL-like language called HiveQL for querying the data. Hive is considered friendlier and more familiar to users who are used to using SQL for querying data.
100% Free Course On Big Data Essentials
Subscribe to our blog and get access to this course ABSOLUTELY FREE.
Hive is best suited for data warehousing applications where data is stored, mined and reporting is done based on the processing. Apache Hive bridges the gap between data warehouse applications and Hadoop as relational database models are the base of most data warehousing applications.
It resides on top of Hadoop to summarize Big Data and makes querying and analyzing easy. HiveQL also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.
Why do we need to integrate Apache Hive with HBase?
Hive can store information of hundreds of millions of users effortlessly, but, faces some difficulties when it comes to keeping the warehouse up to date with the latest information. Apache Hive uses HDFS as an underlying storage which comes with limitations like append-only, block-oriented storage. This makes it impossible to directly apply individual updates to warehouse tables. Up till now the only practical option to overcome this limitation is to pull the snapshots from MySQL databases and dump them to new Hive partitions. This expensive operation of pulling the data from one location to another location is not frequently practiced.(leading to stale data in the warehouse), and it also does not scale well as the data volume continues to shoot through the roof.
To overcome this problem, Apache HBase is used in place of MySQL, with Hive.
What is HBase?
HBase is a scale-out table store which can support a very high rate of row-level updates over a large amount of data. HBase solves Hadoop’s append-only constraint by keeping recently updated data in memory and incrementally rewriting data to new files, splitting and merging data intelligently based on data distribution changes.
Since HBase is based on Hadoop, integrating it with Hive is pretty straightforward as HBase tables can be accessed like native Hive tables. As a result, a single Hive query can now perform complex operations such as join, union, and aggregation across combinations of HBase and native Hive tables. Likewise, Hive’s INSERT statement can be used to move data between HBase and native Hive tables, or to reorganize data within the HBase itself.
How is HBase integrated with Hive?
For integrating HBase with Hive, Storage Handlers in Hive is used.
Storage Handlers are a combination of InputFormat, OutputFormat, SerDe, and specific code that Hive uses to identify an external entity as a Hive table. This allows the user to issue SQL queries seamlessly, whether the table represents a text file stored in Hadoop or a column family stored in a NoSQL database such as Apache HBase, Apache Cassandra, and Amazon DynamoDB. Storage Handlers are not only limited to NoSQL databases, a storage handler could be designed for several different kinds of data stores.
Here the example for connecting Hive with HBase using HiveStorageHandler.
Create the HBase Table:
‘personaldetails’ and ‘deptdetails’The above statement will create ‘employee’ with two columns families
Insert the data into HBase table:
hbase(main):049:0> put 'employee','eid01','personaldetails:fname','Brundesh' 0 row(s) in 0.1030 seconds hbase(main):050:0> put 'employee','eid01','personaldetails:Lname','R' 0 row(s) in 0.0160 seconds hbase(main):051:0> put 'employee','eid01','personaldetails:salary','10000' 0 row(s) in 0.0090 seconds hbase(main):060:0> put 'employee','eid01','deptdetails:name','R&D' 0 row(s) in 0.0680 seconds hbase(main):061:0> put 'employee','eid01','deptdetails:location','Banglore' 0 row(s) in 0.0140 seconds hbase(main):067:0> put 'employee','eid02','personaldetails:fname','Abhay' 0 row(s) in 0.0080 seconds hbase(main):068:0> put 'employee','eid02','personaldetails:Lname','Kumar' 0 row(s) in 0.0080 seconds hbase(main):069:0> put 'employee','eid02','personaldetails:salary','100000' 0 row(s) in 0.0090 seconds
Now create the Hive table pointing to HBase table.
If there are multiple columns family in HBase, we can create one table for each column families. In this case, we have 2 column families and hence we are creating two tables, one for each column families.
Table for personal details column family:
create external table employee_hbase(Eid String, f_name string, s_name string, salary int) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with serdeproperties ("hbase.columns.mapping"=":key,personaldetails:fname,personaldetails:Lname,personaldetails:salary") tblproperties("hbase.table.name"="employee"); STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
If we are creating the non-native Hive table using Storage Handler then we should specify the STORED BY clause
Note: There are different classes for different databases
hbase.columns.mapping : It is used to map the Hive columns with the HBase columns. The first column must be the key column which would also be same as the HBase’s row key column.
Now we can query the HBase table with SQL queries in hive using the below command.
select *from employee_hbase;
We hope going through this blog will help you in the integration of hive and hbase and help in building the useful SQL interface on the top of Hbase .Above query fired from hive terminal will yield all the data from the hbase table.