All CategoriesBig Data Hadoop & Spark

Beginners Guide to Apache HBase 2017

Before understanding what is Apache HBase, we need to understand why It was introduced at first place.
Prior to apache HBase, we had Relation Database Management system (RDBMS) from late 1970’s , and it helped lot of companies to implement the solutions for their problems which are in use today.
Even today there are many uses cases where RDBMS is the perfect tool. Ex: Handling transactions, etc
Yet there are some problems like handling big data which cannot be solved through RDBMS .

Age of Bigdata

We live in an era where Peta bytes of data is generated on daily basis from sources like social media, e-commerce etc .Because of this, companies are focusing on delivering more targeted information, such as giving recommendations or online ads which influence their success as a business. With the Emergence of new machine learning algorithm,the need for collection of data have increased drastically and with collected data, technology like Hadoop is able to process it with ease.

In the past, Due to restrictions on cost required to store the data ,companies use to ignore historical data and used to retain only last N days data and keep all the remaining data as back up in tape drives.

Because of performing analytics on limited data ,resulting models were not effective.

Few companies like Google ,amazon etc, realized the importance of the data  and started developing the solutions for solving the big data problems.  These ideas were then implemented outside of Google as part of the open source Hadoop project: HDFS and Map Reduce.

But Hadoop was mainly introduced for batch processing but companies also needed a database which could be used for real-time responses.

So ,Google came up with the Bigtable, A column-oriented data base to address the real-time queries.

Before going deep into Apache HBase and its operations, let’s first understand Column-oriented database.

Column-oriented databases differ from row oriented traditional databases where entire rows are stored contiguously.

In Column-oriented database data are grouped by columns and subsequent columns are stored contiguously on the disk.

Storing values on a per column basis increases the efficiency when all the values are not needed.

In column-oriented database values of one column is very much similar in nature or even vary only slightly between logical rows and this makes them a very suitable candidate for compression when compared to the heterogeneous values of row oriented record structures.

Introduction Apache HBase

Apache HBase is the open source implementation of Google’s Big Table, with slight modifications.  It was created in 2007 , it was initially  a contributions to Hadoop  and it  later became a top level Apache project.

Apache HBase is a distributed column-oriented database built on top of the Hadoop file system  and it is horizontally scalable meaning we can add the new nodes to Hbase as data grows.

 It is well suited for sparse data sets, which are common in many big data use cases.

An Apache HBase system comprises a set of tables. Each table contains rows and columns, much like a traditional database. Each table must have an element defined as a Primary Key, and all access attempts to HBase tables must use this Primary Key.

It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in the Hadoop File System.

HBase Vs RDBMS

HBase RDBMS
A Schema-less ,i.e there is no predefined schema for HBase tables.  RDBMS Tables have Fixed  schema, which describes the whole structure of tables.
Built for wide tables. It is horizontally scalable. It is thin and built for small tables. Hard to scale.
No transactions are there in HBase. RDBMS is transactional.
It has de-normalized data. It will have normalized data.
It is good for semi-structured as well as structured data. It is good for structured data.

Before proceeding further we will install Hbase, Click here to download the installation document.

We can interact with HBase in two different ways.

  • Through HBase interactive shell
  • HBase Java Client API

HBase shell is made up of JRuby( JRuby is Java implementation of the Ruby), we can login to that using below command

$HBASE_HOME/bin/hbase shell

[[email protected] Downloads]$ hbase shell
2015-12-15 10:39:46,050 INFO  [main] Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.98.14-hadoop2, r4e4aabb93b52f1b0fef6b66edd06ec8923014dec, Tue Aug 25 22:35:44 PDT 2015
hbase(main):001:0>

Version command will display the version of HBase

hbase(main):001:0> version
0.98.14-hadoop2, r4e4aabb93b52f1b0fef6b66edd06ec8923014dec, Tue Aug 25 22:35:44 PDT 2015

List command will list all the tables present HBase

hbase(main):002:0> list
TABLE
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hbase/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
2015-12-15 10:40:35,051 WARN  [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
customer
1 row(s) in 2.8550 seconds
=> ["customer"]
hbase(main):003:0>

Basic commands and the structure

Column – A single field in table

Column-family – is group of columns

Row-key-Row-key in hbase is a mandatory field which serves as the unique identifier for every record.

Creating a table in HBase

Syntax : create ‘<table-name>’,’<column-family1>’ ,’<column-family2>’ …….
HBase data model, columns are grouped into column families, which must be defined during table creation. We must have at least one column family. HBase currently does not do well with above three column families so keep the number of column families in your schema low.

Column-family

hbase(main):021:0> create 'customer','address','order'
0 row(s) in 0.4030 seconds
=> Hbase::Table - customer
hbase(main):022:0> list
TABLE                                                                          
customer
1 row(s) in 0.0110 seconds
=> ["customer"]

Inserting the data into HBase

We can insert the data using PUT command

Syntax: put ‘<table-name>’,’row-key’,’columnfamily:columnname’,’value’

Row-key is a mandatory field which serves as the unique identifier for every record.

Here in the below description customer is the table name and john is the row key followed by column family and its value.

hbase(main):026:0> put 'customer','john','address:city','Boston'
0 row(s) in 0.0290 seconds
hbase(main):027:0> put 'customer','john','address:state','Massachusetts'
0 row(s) in 0.0060 seconds
hbase(main):028:0> put 'customer','john','address:street','street1'
0 row(s) in 0.0130 seconds
hbase(main):029:0> put 'customer','john','order:number','ORD-15'
0 row(s) in 0.0260 seconds
hbase(main):030:0> put 'customer','john','order:amount','15'
0 row(s) in 0.0120 seconds

Inserting second record

hbase(main):034:0> put 'customer','Finch','address:city','Newyork'
0 row(s) in 0.0060 seconds
hbase(main):035:0> put 'customer','Finch','address:state','Newyork'
0 row(s) in 0.0060 seconds
hbase(main):036:0> put 'customer','Finch','order:number','ORD-16'
0 row(s) in 0.0090 seconds
hbase(main):037:0> put 'customer','Finch','order:amount','15'
0 row(s) in 0.0080 seconds

Getting single record from table

Hadoop Training Online

We should use GET command to retrieve single record from HBase table,

Syntax: get ‘<table-name>’,’<row-key>’,’<column-family>’

get 'customer','john'
COLUMN                           CELL                                                                                        
address:city                    timestamp=1450143157606, value=Boston                                                      
address:state                   timestamp=1450143185560, value=Mashitushes
address:street                  timestamp=1450143246875, value=street1                                                     
order:amount                    timestamp=1450143320786, value=15                                                          
order:number                    timestamp=1450143305944, value=ORD-15                                                      
5 row(s) in 0.0180 seconds

Using get command to retrieve the address of john

hbase(main):044:0> get 'customer','john','address'
COLUMN                           CELL                                                                                       
address:city                    timestamp=1450143157606, value=Boston                                                      
address:state                   timestamp=1450143185560, value=Mashitushes
address:street                  timestamp=1450143246875, value=street1                                                     
3 row(s) in 0.0330 seconds

Using get command to retrieve city of john

hbase(main):045:0> get 'customer','john','address:city'
COLUMN                           CELL                                                                                       
address:city                    timestamp=1450143157606, value=Boston                                                      
1 row(s) in 0.0060 seconds

To get the all the records from the table we should use scan command.

Syntax : scan ‘<table -name>’

hbase(main):041:0> scan 'customer'
ROW                              COLUMN+CELL                                                                                 
Finch                           column=address:city, timestamp=1450143461624, value=Newyork
Finch                           column=address:state, timestamp=1450143466906, value=Newyork
Finch                           column=order:amount, timestamp=1450143490833, value=15                                     
Finch                           column=order:number, timestamp=1450143479920, value=ORD-16                                 
john                            column=address:city, timestamp=1450143157606, value=Boston                                 
john                            column=address:state, timestamp=1450143185560, value=Mashitushes
john                            column=address:street, timestamp=1450143246875, value=street1                              
john                            column=order:amount, timestamp=1450143320786, value=15                                      
john                            column=order:number, timestamp=1450143305944, value=ORD-15                                 
2 row(s) in 0.0230 seconds

Deleting records

Deleting entire record from table

delete  ‘<table-name>’,’<rowkey>’
hbase(main):046:0> delete 'customer','Finch'
0 row(s) in 0.0270 seconds

Deleting specific column from table

hbase(main):046:0> delete 'customer',john,'address:city'
0 row(s) in 0.0270 seconds

Counting number of rows in the table

hbase(main):047:0> count 'customer'
2 row(s) in 0.0320 seconds

Version in Apache HBase

Updating tables means replacing the previous value with the new one. But in HBase, if we try to rewrite the column values, it does not overwrite the existing value but rather stores different values per row by time (and qualifier). Excess versions are removed during major compaction. The number of max versions may need to be increased or decreased depending on application needs.

The default version is 1, we can modify and increase or decrease the versions  to be stored using alter command:

hbase(main):048:0> alter 'customer',NAME=>'address',VERSIONS=>5
Updating all regions with the new schema...
0/1 regions updated.
1/1 regions updated.
Done.
0 row(s) in 2.2290 seconds
hbase(main):049:0>  put 'customer','Finch','address:city','Newyork'
0 row(s) in 0.0190 seconds
hbase(main):050:0>  put 'customer','Finch','address:city','Detroit'
0 row(s) in 0.0090 seconds
hbase(main):051:0>  put 'customer','Finch','address:city','Sanfranscisco'
0 row(s) in 0.0110 seconds
hbase(main):052:0> 
hbase(main):052:0>  scan 'customer',{COLUMN=>'address:city',VERSIONS=>2}
ROW                              COLUMN+CELL                                                                                 
 Finch                           column=address:city, timestamp=1450147800933, value=Sanfranscisco                           
 Finch                           column=address:city, timestamp=1450147785900, value=Detroit                                 
 john                            column=address:city, timestamp=1450143157606, value=Boston                                  
2 row(s) in 0.0170 seconds
hbase(main):053:0>  scan 'customer',{COLUMN=>'address:city',VERSIONS=>1}
ROW                              COLUMN+CELL                                                                                 
 Finch                           column=address:city, timestamp=1450147800933, value=Sanfranscisco                           
 john                            column=address:city, timestamp=1450143157606, value=Boston                                  
2 row(s) in 0.0170 seconds
hbase(main):054:0>  scan 'customer',{COLUMN=>'address:city',VERSIONS=>3}
ROW                              COLUMN+CELL                                                                                 
 Finch                           column=address:city, timestamp=1450147800933, value=Sanfranscisco                           
 Finch                           column=address:city, timestamp=1450147785900, value=Detroit                                 
 Finch                           column=address:city, timestamp=1450147775468, value=Newyork                                 
 john                            column=address:city, timestamp=1450143157606, value=Boston                                  
2 row(s) in 0.0140 seconds

Dropping table

Before dropping the table we must disable the table by using below syntax:

disable ‘table-name’

disable ‘customer’

Now, you can drop the table using below syntax:

drop ‘table-name’

drop ‘customer’

We hope this blog helped you in getting a brief overview of HBase and its implementation in Hadoop. Keep visiting www.acadgild.com for more updates on Big Data and other technologies. Click here to learn Big Data Hadoop Development.

Big Data Hadoop Course

Tags

3 Comments

  1. Pingback: Performing CRUD Operations on HBase Using Java API – Big Data
  2. Hi,
    I must appreciate you for providing such a valuable content for us. This is one amazing piece of article. Helped a lot in increasing my knowledge on HBase.
    Thank you so much for the information you provided is very helpful for Hbase Learners

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close