All CategoriesBig Data Hadoop & Spark - Advanced

Data Bulk Loading into HBase Table Using MapReduce

In this blog, we will be discussing the steps to perform data bulk loading file contents from HDFS path into an HBase table using Java MapReduce API. Before, moving forward you can follow below link blogs to gain more knowledge on HBase and its working.
Beginners Guide to Apache HBase
Integrating Hive with HBase
Performing CRUD Operations on HBase Using Java API
Introduction to HBase Filters
Read and Write Operations in HBase
How to Import Table from MySQL to HBase
Apache HBase gives us a random, real-time, read/write access to Big Data, but here it is more important that how do we get the data loaded into HBase. As HBase Put API can be used to insert the data into HDFS, but inserting the every record into HBase using the Put API is lot slower than the bulk loading.
Thus, it is better to load a complete file content as a bulk into the HBase table using Bulk load function.
Bulk loading in HBase is the process of preparing HFiles and loading it directly into the region servers.
In our example, we will be using a sample data set hbase_input_emp.txt which is saved in our hdfs directory hbase_input_dir. You can download this sample data set for practice from the below link.
Please refer the description for the above data set containing  three columns named as:
Column 1: Employee Id
Column 2: Employee name
Column 3: Employee mail id
Column 4: Employee salary
You can follow below steps to perform bulk load data contents from Hdfs to HBase via MapReduce job.
Extract the data from the source, and load into HDFS.
If data is in Oracle, MySQL you need to fetch it using Sqoop or any such tools which gives mechanism to import data directly from a database into HDFS. If your raw files such as .txt, .pst, .xml are located in any servers then simply pull it and load into HDFS. HBase doesn’t prepare HFiles directly reading data from the source.
As of our example, our data is already available in our hdfs path. We can use cat command to see the input file hbase_input_emp.txt content which is saved in the hbase_input_dir folder of hdfs path.
hdfs dfs -cat /hbase_input_dir/hbase_input_emp.txt

Transform the data into HFiles via MapReduce job.
Here we write a MapReduce job which will process our data and create HFile. There will be only Mapper class and will be no Reducer class. In our code, we configure HFileOutputFormat.configureIncrementalLoad() doing which HBase creates its own Reducer class.

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.mapreduce.Mapper;
public class HBaseBulkLoad {
public static class BulkLoadMap extends Mapper<LongWritable, Text, ImmutableBytesWritable, Put> {  
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String rowKey = parts[0];
//The line is splitting the file records into parts wherever it is comma (‘,’) separated, and the first column are considered as rowKey.
ImmutableBytesWritable HKey = new ImmutableBytesWritable(Bytes.toBytes(rowKey));
//Here the row key is first converted to Bytes as Hbase understand its data as Bytes, and also object is created as ImmutableBytesWriteable
Put HPut = new Put(Bytes.toBytes(rowKey));
//This will write the rowKey values into Hbase while creating an object.
//Here the fields of tables inside Hbase is are stated to be written
HPut.add(Bytes.toBytes("id"), Bytes.toBytes("name"), Bytes.toBytes(parts[1]));
HPut.add(Bytes.toBytes("id"), Bytes.toBytes("mail_id"), Bytes.toBytes(parts[2]));
HPut.add(Bytes.toBytes("id"), Bytes.toBytes("sal"), Bytes.toBytes(parts[3]));
//first we are creating instance PUT with 1st field as row key,
public static void main(String[] args) throws Exception {
Configuration conf = HBaseConfiguration.create();
String inputPath = args[0];
HTable table=new HTable(conf,args[2]);
conf.set("hbase.mapred.outputtable", args[2]);
Job job = new Job(conf,"HBase_Bulk_loader");  
FileInputFormat.setInputPaths(job, inputPath);
TextOutputFormat.setOutputPath(job, new Path(args[1]));
HFileOutputFormat.configureIncrementalLoad(job, table);
System.exit(job.waitForCompletion(true) ? 0 : 1);

The above step finishes the MapReduce programming part.
Now, we need to create a new table in Hbase to import table contents from hdfs input directory. So, follow the below steps to import the contents from hdfs path to Hbase table.
Enter HBase shell:
Before entering to HBase shell user should start the start HBase service. Use below command to start HBase services.

After starting the hmaster service use below command to enter HBase shell.
HBase shell

Create table:
We can use create command to create a table in HBase.
Create ‘Academp’,’id’

Scan table:
We can use scan command to see a table contents in Hbase.
Scan ‘Academp’

We can observe from the above image no contents are available in the table Academp
Export  Hadoop_classpath:
In the next step, we need to load the HBase library files into the Hadoop classpath this enables the Hadoop client to connect to HBase and get the number of splits.

Mapreduce jar execution:
Now, run the MapReduce job by following below command to generate the HFiles.
hadoop jar /home/acadgild/Desktop/BKLoad.jar /hbase_input_dir/hbase_input_emp.txt /hbase_output_dir Academp

Here, the first parameter is the input the input directory where our input file is saved, the second parameter is the output directory where we will be saving the HFiles, and the third parameter is the HBase table name.
Now, let us use list command to list the HFiles which are stored in our output directory ‘hbase_output_dir’
hadoop fs -ls /hbase_output_dir  

hadoop fs -ls /hbase_output_dir/id

We can use below command to see the output HFile content which is saved in the sub-directory ‘id’
hadoop fs -cat /hbase_output/dir/id/5ed1f7…..

After the data has been prepared using HFileOutputFormat, it is loaded into the cluster using completebulkload. This command line tool iterates through the prepared data files, and for each one determines the region the file belongs to. It then contacts the appropriate Region Server which adopts the HFile, moving it into its storage directory and making the data available to clients.
Now, load the files into HBase by telling the RegionServers where to find them.
HBase hadoop jar execution:
Once the HFiles are created in HDFS directory, we can use below command to store the HFiles contents into HBase table.
hadoop jar /home/acadgild/Downloads/hbase-server-0.98.14-hadoop2.jar completebulkload /hbase_output_dir/ Academp

Scan Academp table:
Now, we can use scan command on the table Academp to see the contents which are exported from HDFS path.
scan ‘Academp’

Thus, from the above steps, we can observe that we have successfully imported bulk data into an HBase table using Java API.
We hope this post has been helpful in understanding importing bulk data into HBase table. In case of any queries, feel free to comment below and we will get back to you at the earliest.
Keep visiting for more posts on Big Data and other technologies.


One Comment

  1. Dear Manjunath, Your posts are easy to follow for a complete beginner . Thanks for your efforts .
    I have few questions in my mind before taking up this course .
    Let me introduce myself : Complete fresher with no programming background wanted to become a Hadoop developer.
    Is Java a pre-requisite to learn big data and Hadoop?
    Big data and Hadoop have many components like Pig, Hive, and Hbase where Java is not a pre-requisite. People from various domains with no prior knowledge of Java have got successfully trained with us and are now working in the big data industry. Though, knowledge of core Java is an added advantage, as it acts as a main component of Hadoop (MapReduce is implemented in Java).
    So coming back to point again : You have written almost 70lines of Java code which I didn’t even understand . So will I able to write as much as code like you if I take up the session or course from your organization .

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Articles