Big Data Hadoop & Spark

Connecting HBase with Python Application using Thrift Server

Apache HBase is the Hadoop database, a distributed, scalable and a Big Data store. Apache HBase can be used when a random, real-time read/write access to your Big Data is required.

HBase provides many methods for interacting with it. Since HBase is built in Java and the Java API is most widely used so it provides the most number of functionalities.

100% Free Course On Big Data Essentials

Subscribe to our blog and get access to this course ABSOLUTELY FREE.

In case you would like to use HBase without Java, HBase provides you with two options:

  1. Thrift Interface – This is much more lightweight and hence faster of the two options.

  2. REST Interface – This uses HTTP verbs to perform an action. By using HTTP, a REST interface offers a much wider array of languages and programs that can access the interface.

In this post, we will learn about the concept of Thrift and how to install it on Centos, code sample of Python for accessing the Hbase.

You must have the Hadoop cluster with Hbase installed in it to implement the concepts explained further in the blogs.

So, what is Apache Thrift?

Apache Thrift is a software framework for scalable cross-language services development, which combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages.

When to use Thrift?

Thrift can used when developing a web service that uses a service developed in one language access that is in another language. For example: HBase is built in Java and if there is a web application in Python, then access for HBase with Python can be through Thrift API.

HBase Thrift

In the context of HBase, Java is the only language which can access HBase directly. The Thrift interface acts as bridge which allows other languages to access HBase using a Thrift server that interacts with the Java client.

For Thrift to work, another HBase daemon needs to be running to handle these requests. This daemon comes with HBase, but we need to install some additional dependencies along with it. The below diagram shows how Thrift and REST are placed in the cluster. 

Note:

The Thrift and REST client hosts usually don’t run any other services (such as DataNodes or RegionServers) to keep the overhead low and responsiveness high for REST or Thrift interactions.

Make sure to install and start these daemons on nodes that have access to both the Hadoop cluster and the application that needs access to HBase.

The downside to Thrift is that it’s much more difficult to set up than REST. You will need to compile Thrift and generate the language-specific bindings. These bindings are nice because they give you code for the language you are working. Meaning, there’s no need to parse XML or JSON like in REST. Rather, the Thrift interface gives you direct access to the row data. Another nice feature is that the Thrift protocol has native binary transport; you will not need to base64 encode and decode data.

To start using the Thrift interface, you need to figure out which port it’s running on. The default port is 9090

Now, let’s look at an example of accessing HBase with Python.

Step by Step Tutorial: Accessing HBase with Python

  1. First, we need to install all language specific dependencies on operating system where the Thrift server is started.

  2. Install Python dependencies for centos.

sudo yum install boost-devel php-devel pcre-devel automake libtool flex bison pkgconfig gcc-c++ boost-devel libevent-devel zlib-devel python-devel ruby-devel libtool*
  1. Download the Thrift server using the below command:
wget https://archive.apache.org/dist/thrift/0.6.0/thrift-0.6.0.tar.gz

Untar the thrift file using the below command and then change the directory to thrift-0.6.1

$ tar xfz thrift-0.6.1.tar.gz
$ cd thrift-0.6.1/

We need to perform configure operation from the thrift-0.6.1 directory.

Hadoop

$./configure

After performing the ‘Configure’ command we will be able to see what other languages are supported by Thrift.

Building C++ Library ......... : no
Building C (GLib) Library .... : no
Building Java Library ........ : no
Building C# Library .......... : no
Building Python Library ...... : yes
Building Ruby Library ........ : no
Building Haskell Library ..... : no
Building Perl Library ........ : no
Building PHP Library ......... : yes
Building Erlang Library ...... : yes

The configure script is responsible for getting ready to build the software on your specific system. It makes sure all of the dependencies for the rest of the build and install process are available, and finds out whatever it needs to know to use those dependencies

 $ sudo make

Once configure has done its job, we can invoke make to build the software. This runs a series of tasks defined in a Makefile to build the finished program from its source code.

 $ sudo make install

Now that the software is built and ready to run, the files can be copied to their final destinations. The make install command will copy the built program, and its libraries and documentation, to the correct location.

Hadoop

After the installation of Thrift server, we need to need to compile Thrift and generate the language-specific bindings for accessing HBase.

We need to generate HBase thrift python module using the below command:

thrift --gen py Hbase.thrift

Once this is done, you should have Thrift in your path.

Note:

You can find HBase.thrift in the below path:

thrift -gen py /path/to/hbase/source/hbase-VERSION/src/main/resources/org/apache/hadoop/hbase/thrift/Hbase.thrift

or you can also download HBase.thrift from the following link:

HBase thrift

The above command will create ‘gen-py’ folder in your current working directory.

  1. The HBase Thrift Python module needs to be added to python path using the below command:
export PYTHONPATH=$PYTHONPATH:/<path to gen-py>

  1. The HBase Thrift server needs to be started. This can be done by simply executing the following command:
$hbase thrift start

This command will start the HBase Thrift server on port 9090, which is the default port.

We need to open a new terminal and then we should ensure that in HBase we should ensure that we have already created the table, we can cross check the list of tables in HBase by typing the command list.

Note: Do not shut down the terminal where thrift server has started.

We need to write the code given below and then save it with the extension .py and in our case we have saved the code in a file with name table.py and then execute it by using the command shown in the below screen shot to print all the tables name on the terminal.

from thrift.transport.TSocket import TSocket
from thrift.transport.TTransport import TBufferedTransport
from thrift.protocol import TBinaryProtocol
from hbase import Hbase
transport = TBufferedTransport(TSocket('localhost', 9090))
transport.open()
protocol = TBinaryProtocol.TBinaryProtocol(transport)
client = Hbase.Client(protocol)
print(client.getTableNames())

We can see that by executing the python script we are able to fetch the name of the table in HBase.

Let’s explore each part of the code in brief.

A Brief Description of the code saved as table.py:

The below commands will import all the required HBase Thrift modules.

from thrift.transport.TSocket import TSocket
from thrift.transport.TTransport import TBufferedTransport
from thrift.protocol import TBinaryProtocol
from hbase import Hbase

The below command will create the socket transport and line protocol and allows the Thrift client to connect and talk to the Thrift server.

transport = TBufferedTransport(TSocket('localhost', 9090))

Next we need to open the socket to the Thrift server.

transport.open()

Tbinary is binary implementation of thrift (converting transport to binary implementation)

protocol = TBinaryProtocol.TBinaryProtocol(transport)

The below lines create the Client object which will be used to interact with HBase. From this client object, you will issue all your Gets and Puts.

client = Hbase.Client(protocol)
print(client.getTableNames())

We hope this blog helped you in understanding the accessing of Hbase with Python scripts.

Keep visiting our website Acadgild for more updates on Big Data and other technologies. Click here to learn Big Data Hadoop Development.

Hadoop

One Comment

  1. This is very useful and gives an high level idea on using python with HBase.
    I have a question and it would be great if you can help me with this.
    What happens if two application programs tries writing to the same hbase table ?
    Assume to same columnfamily.
    How HBase handles this scenario?
    Thanks,
    Nash

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close