Is it really hard to insert data inside HBase by writing the scripts? For every record, you have to write an identical script to get data inside HBase. Even though we have same data already present in HDFS.
But what if by writing only a few lines you can have the data copied inside HBase?. It would be a lot of fun to work with HBase then, to get an analytical result much faster than traditional ways. In this blog, you will see a utility which will save us from writing multiple lines of scripts to insert data in HBase. HBase has developed numbers of utilities to make our work easier. Like many of the other HBase utilities, one which we are about to see is ImportTsv.
A utility that loads data in the TSV format into HBase. ImportTsv takes data from HDFS into HBase via Puts.
Find below the syntax used to load data via Puts (i.e., non-bulk loading):
$ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c <tablename> <hdfs-inputdir>
In this blog, we will be practicing with small sample dataset how data inside HDFS is loaded into HBase.
Steps to Practical Execution
Yet, Before starting practice on TSV import, it is compulsory to start all the Hadoop and HBase daemons.
While Hadoop is not running, go to Hadoop-X/sbin/start-all.sh
and so start Hadoop-X/sbin/mr-historyserver-daemon.sh.
So, if HMaster is not running, go to Hbase/bin/start-Hbase.sh.
Now our system is ready.
Inside Hbase shell give the following command to create table along with 2 column family.
Create ‘bulktable’, ‘cf1’, ‘cf2’
Come out of HBase shell to the terminal and also make a directory for Hbase in the local drive; So,
since you have your own path you can use it.
mkdir -p hbase
Now move to the directory where we will keep our data.
Create a file inside the HBase directory named bulk_data.tsv with tab separated data inside using below command in terminal.
Put these data in,
1 Amit 4
2 Girija 3
3 Jatin 5
4 Swati 3
Once created save the file using esc + :wq + enter
Our data should be present in HDFS while performing the import task to Hbase.
In real time projects, the data will already be present inside HDFS.
Here for our learning purpose, we copy the data inside HDFS using below commands in terminal.
Command: hadoop fs -mkdir /hbase
hadoop fs -put bulk_data.tsv /hbase/
hadoop fs -cat /hbase/bulk_data.tsv
After the data is present now in HDFS.In terminal, we give the following command along with arguments <tablename> and <path of data in HDFS>
hbase org.apache.hadoop.hbase.mapreduce.ImportTsv –
Dimporttsv.columns=HBASE_ROW_KEY,cf1:name,cf2:exp bulktable /hbase/bulk_data.tsv
Observe that the map is done 100% although we get an error afterward.
For now, ignore the error message due to our task is to map data in HBase table.
Now,also let us check whether we actually got the data inside HBase by using the below command.
We see all the data are present in the table, thus confirming our mapping successful for tab separated values.
Running ImportTsv with no arguments prints brief usage information:
Usage: importtsv -Dimporttsv.columns=a,b,c <tablename> <inputdir>
Imports the given input directory of TSV data into the specified table.
The column names of the TSV data must be specified using the -Dimporttsv.columns
option. This option takes the form of comma-separated column names, where each
column name is either a simple column family or a columnfamily:qualifier. Also, the special column name HBASE_ROW_KEY is used to designate that this column should be used as the row key for each imported record. You must specify exactly one column to be the row key, and consequently, you must specify a column name for every column that exists in the input data.
Especially relevant, this importtsv will load data directly into HBase. To instead generate
HFiles of data to prepare for a bulk data load, pass the option:
Note: the target table will be created with default column family descriptors if it does not already exist.
Other options that may be specified with -D include:
-Dimporttsv.skip.bad.lines=false – fail if encountering an invalid line
‘-Dimporttsv.separator=|’ – eg separate on pipes instead of tabs
-Dimporttsv.timestamp=currentTimeAsLong –use the specified timestamp for the import
-Dimporttsv.mapper.class=my.Mapper – A user-defined Mapper to use instead of org.apache.hadoop.hbase.mapreduce.TsvImporterMapper
So, Hope this post helped you in importing tab separated values data. For any queries feel free to comment below.
Keep visiting www.acadgild.com for more updates on the courses.