Apache HBase is the Hadoop database, a distributed, scalable and a Big Data store. Apache HBase can be used when a random, real-time read/write access to your Big Data is required.
HBase provides many methods for interacting with it. Since HBase is built in Java and the Java API is most widely used so it provides the most number of functionalities.
In case you would like to use HBase without Java, HBase provides you with two options:
Thrift Interface – This is much more lightweight and hence faster of the two options.
REST Interface – This uses HTTP verbs to perform an action. By using HTTP, a REST interface offers a much wider array of languages and programs that can access the interface.
In this post, we will learn about the concept of Thrift and how to install it on Centos, code sample of Python for accessing the Hbase.
You must have the Hadoop cluster with Hbase installed in it to implement the concepts explained further in the blogs.
So, what is Apache Thrift?
When to use Thrift?
Thrift can used when developing a web service that uses a service developed in one language access that is in another language. For example: HBase is built in Java and if there is a web application in Python, then access for HBase with Python can be through Thrift API.
In the context of HBase, Java is the only language which can access HBase directly. The Thrift interface acts as bridge which allows other languages to access HBase using a Thrift server that interacts with the Java client.
For Thrift to work, another HBase daemon needs to be running to handle these requests. This daemon comes with HBase, but we need to install some additional dependencies along with it. The below diagram shows how Thrift and REST are placed in the cluster.
The Thrift and REST client hosts usually don’t run any other services (such as DataNodes or RegionServers) to keep the overhead low and responsiveness high for REST or Thrift interactions.
Make sure to install and start these daemons on nodes that have access to both the Hadoop cluster and the application that needs access to HBase.
The downside to Thrift is that it’s much more difficult to set up than REST. You will need to compile Thrift and generate the language-specific bindings. These bindings are nice because they give you code for the language you are working. Meaning, there’s no need to parse XML or JSON like in REST. Rather, the Thrift interface gives you direct access to the row data. Another nice feature is that the Thrift protocol has native binary transport; you will not need to base64 encode and decode data.
To start using the Thrift interface, you need to figure out which port it’s running on. The default port is 9090
Now, let’s look at an example of accessing HBase with Python.
Step by Step Tutorial: Accessing HBase with Python
First, we need to install all language specific dependencies on operating system where the Thrift server is started.
Install Python dependencies for centos.
sudo yum install boost-devel php-devel pcre-devel automake libtool flex bison pkgconfig gcc-c++ boost-devel libevent-devel zlib-devel python-devel ruby-devel libtool*
- Download the Thrift server using the below command:
Untar the thrift file using the below command and then change the directory to thrift-0.6.1
$ tar xfz thrift-0.6.1.tar.gz $ cd thrift-0.6.1/
We need to perform configure operation from the thrift-0.6.1 directory.
After performing the ‘Configure’ command we will be able to see what other languages are supported by Thrift.
Building C++ Library ......... : no Building C (GLib) Library .... : no Building Java Library ........ : no Building C# Library .......... : no Building Python Library ...... : yes Building Ruby Library ........ : no Building Haskell Library ..... : no Building Perl Library ........ : no Building PHP Library ......... : yes Building Erlang Library ...... : yes
The configure script is responsible for getting ready to build the software on your specific system. It makes sure all of the dependencies for the rest of the build and install process are available, and finds out whatever it needs to know to use those dependencies
$ sudo make
Once configure has done its job, we can invoke make to build the software. This runs a series of tasks defined in a Makefile to build the finished program from its source code.
$ sudo make install
Now that the software is built and ready to run, the files can be copied to their final destinations. The make install command will copy the built program, and its libraries and documentation, to the correct location.
After the installation of Thrift server, we need to need to compile Thrift and generate the language-specific bindings for accessing HBase.
We need to generate HBase thrift python module using the below command:
thrift --gen py Hbase.thrift
Once this is done, you should have Thrift in your path.
You can find HBase.thrift in the below path:
thrift -gen py /path/to/hbase/source/hbase-VERSION/src/main/resources/org/apache/hadoop/hbase/thrift/Hbase.thrift
or you can also download HBase.thrift from the following link:
The above command will create ‘gen-py’ folder in your current working directory.
- The HBase Thrift Python module needs to be added to python path using the below command:
export PYTHONPATH=$PYTHONPATH:/<path to gen-py>
- The HBase Thrift server needs to be started. This can be done by simply executing the following command:
$hbase thrift start
This command will start the HBase Thrift server on port 9090, which is the default port.
We need to open a new terminal and then we should ensure that in HBase we should ensure that we have already created the table, we can cross check the list of tables in HBase by typing the command list.
Note: Do not shut down the terminal where thrift server has started.
We need to write the code given below and then save it with the extension .py and in our case we have saved the code in a file with name table.py and then execute it by using the command shown in the below screen shot to print all the tables name on the terminal.
from thrift.transport.TSocket import TSocket from thrift.transport.TTransport import TBufferedTransport from thrift.protocol import TBinaryProtocol from hbase import Hbase transport = TBufferedTransport(TSocket('localhost', 9090)) transport.open() protocol = TBinaryProtocol.TBinaryProtocol(transport) client = Hbase.Client(protocol) print(client.getTableNames())
We can see that by executing the python script we are able to fetch the name of the table in HBase.
Let’s explore each part of the code in brief.
A Brief Description of the code saved as table.py:
The below commands will import all the required HBase Thrift modules.
from thrift.transport.TSocket import TSocket from thrift.transport.TTransport import TBufferedTransport from thrift.protocol import TBinaryProtocol from hbase import Hbase
The below command will create the socket transport and line protocol and allows the Thrift client to connect and talk to the Thrift server.
transport = TBufferedTransport(TSocket('localhost', 9090))
Next we need to open the socket to the Thrift server.
Tbinary is binary implementation of thrift (converting transport to binary implementation)
protocol = TBinaryProtocol.TBinaryProtocol(transport)
The below lines create the Client object which will be used to interact with HBase. From this client object, you will issue all your Gets and Puts.
client = Hbase.Client(protocol) print(client.getTableNames())
We hope this blog helped you in understanding the accessing of Hbase with Python scripts.