Big Data Hadoop & Spark

Querying HBase using Apache Spark

In this blog, we will see how to access and query HBase tables using Apache Spark.

Spark can work on data present in multiple sources like a local filesystem, HDFS, Cassandra, Hbase, MongoDB etc.

To get the basic understanding of HBase refer our Beginners guide to Hbase

Now, we will see the steps for accessing hbase tables through spark.

The first step first, you must start HMaster.

Create an HBASE_PATH environmental variable to store the hbase paths

Start the spark shell by passing HBASE_PATH variable to include all the hbase jars.

Now we have started hbase and spark we will create the connection to hbase through spark shell

Import the required libraries as given below:

import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.{HTableDescriptor,HColumnDescriptor}
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.client.{Put,HTable}

 

 
// create hbase configuration object

val conf = HBaseConfiguration.create()
val tablename = "Acadgild_spark_Hbase"

 

// create Admin instance and set input format

conf.set(TableInputFormat.INPUT_TABLE,tablename)
val admin = new HBaseAdmin(conf)

 
Hadoop

//Create table

if(!admin.isTableAvailable(tablename)){
print("creating table:"+tablename+"\t")
val tableDescription = new HTableDescriptor(tablename)
tableDescription.addFamily(new HColumnDescriptor("cf".getBytes()));
admin.createTable(tableDescription);
} else {
print("table already exists")
}

//Check the create table exists or not

admin.isTableAvailable(tablename)

If the table exists, it will return ‘True’.

Now we will put some data into it;

val table = new HTable(conf,tablename);
for(x <- 1 to 10){
var p = new Put(new String("row" + x).getBytes());
p.add("colfamily1".getBytes(),"column1".getBytes(),new String("value" + x).getBytes());
table.put(p);
}

 

Now we can create the HadoopRDD from the data present in HBase using newAPIHadoopRDD by InputFormat , output key and value class.

We can perform all the transformations and actions on created RDD

We hope this blog helped you in understanding integration of Spark HBase. Keep visiting our site www.acadgild.com for more updates on Big data and other technologies.

Related Popular Courses:

BIG DATA HADOOP CERTIFICATION

ANDROID APPLICATION TRAINING

SPARK STREAMING KAFKA

DATA SCIENCE CERTIFICATE

BEST DATA ANALYTICS COURSES

Spark

Tags

5 Comments

  1. Which versions of HBase and Spark are you using? I’m using HBase 1.3.1 and Spark 2.2.0 and I get the following error:
    error: object hbase is not a member of package org.apache.hadoop
    Do I need to compile HBase by hand with different dependencies?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close