Big Data Hadoop & Spark

ImportTSV Data from HDFS into HBase

Is it really hard to insert data inside HBase by writing the scripts? For every record, you have to write an identical script to get data inside HBase. Even though we have same data already present in HDFS.
But what if by writing only a few lines you can have the data copied inside HBase?. It would be a lot of fun to work with HBase then, to get an analytical result much faster than traditional ways. In this blog, you will see a utility which will save us from writing multiple lines of scripts to insert data in HBase. HBase has developed numbers of utilities to make our work easier. Like many of the other HBase utilities, one which we are about to see is ImportTsv.
A utility that loads data in the TSV format into HBase. ImportTsv takes data from HDFS into HBase via Puts.
Find below the syntax used to load data via Puts (i.e., non-bulk loading):
$ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c <tablename> <hdfs-inputdir>
In this blog, we will be practicing with small sample dataset how data inside HDFS is loaded into HBase.
Steps to Practical Execution
Yet, Before starting practice on TSV import, it is compulsory to start all the Hadoop and HBase daemons.
While Hadoop is not running, go to Hadoop-X/sbin/start-all.sh
and so start Hadoop-X/sbin/mr-historyserver-daemon.sh.
So, if HMaster is not running, go to Hbase/bin/start-Hbase.sh.

 
Now our system is ready.

Step1:

Inside Hbase shell give the following command to create table along with 2 column family.
Create ‘bulktable’, ‘cf1’, ‘cf2’

100% Free Course On Big Data Essentials

Subscribe to our blog and get access to this course ABSOLUTELY FREE.

Step2 :

Come out of HBase shell to the terminal and also make a directory for Hbase in the local drive; So,
since you have your own path you can use it.
mkdir -p hbase

Now move to the directory where we will keep our data.
cd hbase

Hadoop

Step3:

Create a file inside the HBase directory named bulk_data.tsv with tab separated data inside using below command in terminal.
vi hbase/bulk_data.tsv

Put these data in,
1    Amit 4
2    Girija  3
3    Jatin   5
4    Swati   3

Once created save the file using esc + :wq + enter

Step4:

Our data should be present in HDFS while performing the import task to Hbase.
In real time projects, the data will already be present inside HDFS.
Here for our learning purpose, we copy the data inside HDFS using below commands in terminal.
Command: hadoop fs -mkdir /hbase

Command:
hadoop fs -put bulk_data.tsv /hbase/

Command:
hadoop fs -cat /hbase/bulk_data.tsv

Step5:

After the data is present now in HDFS.In terminal, we give the following command along with arguments <tablename> and <path of data in HDFS>

Command:
hbase org.apache.hadoop.hbase.mapreduce.ImportTsv –
Dimporttsv.columns=HBASE_ROW_KEY,cf1:name,cf2:exp bulktable /hbase/bulk_data.tsv


Observe that the map is done 100% although we get an error afterward.
For now, ignore the error message due to our task is to map data in HBase table.
Now,also let us check whether we actually got the data inside HBase by using the below command.
Scan ‘bulkdata’

We see all the data are present in the table, thus confirming our mapping successful for tab separated values.

Running ImportTsv with no arguments prints brief usage information:

Usage: importtsv -Dimporttsv.columns=a,b,c <tablename> <inputdir>
Imports the given input directory of TSV data into the specified table.
The column names of the TSV data must be specified using the -Dimporttsv.columns
option. This option takes the form of comma-separated column names, where each
column name is either a simple column family or a columnfamily:qualifier. Also, the special column name HBASE_ROW_KEY is used to designate that this column should be used as the row key for each imported record. You must specify exactly one column to be the row key, and consequently, you must specify a column name for every column that exists in the input data.
Especially relevant, this importtsv will load data directly into HBase. To instead generate
HFiles of data to prepare for a bulk data load, pass the option:

-Dimporttsv.bulk.output=/path/for/output

 Note: the target table will be created with default column family descriptors if it does not already exist.
Other options that may be specified with -D include:
 -Dimporttsv.skip.bad.lines=falsefail if encountering an invalid line
‘-Dimporttsv.separator=|’ – eg
separate on pipes instead of tabs
-Dimporttsv.timestamp=currentTimeAsLong
use the specified timestamp for the import
 -Dimporttsv.mapper.class=my.Mapper – A user-defined Mapper to use instead of org.apache.hadoop.hbase.mapreduce.TsvImporterMapper
So, Hope this post helped you in importing tab separated values data. For any queries feel free to comment below.

Keep visiting www.acadgild.com for more updates on the courses.

Hadoop

 

prateek

An alumnus of the NIE-Institute Of Technology, Mysore, Prateek is an ardent Data Science enthusiast. He has been working at Acadgild as a Data Engineer for the past 3 years. He is a Subject-matter expert in the field of Big Data, Hadoop ecosystem, and Spark.

5 Comments

  1. Hallo, my name is ozi. May i ask to you? How to import large files with large number of columns (eg 1000 columns)? Do we have to write all his columns? thanks.

  2. Hi,
    If the table gets a new row (5 Jyothi 3), how to import only this row into hbase, rather than importing the whole table again.
    I know that even if you import whole table, hbase will not create any duplicates. But, I want to import only a single/updated row.
    Thanks

  3. excellent demonstration. Just one problem in my case: importTSV command selects first column as the row_key which is not unique in my case and only last record is being updated by hbase. How can I select Nth column in my tsv as row_key?

  4. excellent demonstration. Just one problem in my case: importTSV command selects first column as the row_key which is not unique in my case and only last record is being updated by hbase. How can I select Nth column in my tsv as row_key??

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close