Running Hadoop in Hortonworks
In this blog, we will see how to run our Hadoop job in HortonWorks Sandbox (HDP Sandbox). HDP stands for HortonWorks Data Platform. It is enterprise ready open source Apache Hadoop distribution based on a central architecture (YARN). There are other vendors like Cloudera and MapR present in the market which provide their own distributions of Hadoop. There can always be a discussion on which distribution is better and provides more features, however, we are not going to talk on that. If you are interested in exploring HDP, you have landed in the right place!
To download HDP Sandbox, you can go to this link and choose a version of your choice. For this blog, I have used HDP 2.1
NOTE: Check the system requirements before downloading a specific HDP version.
Assuming that you have downloaded and installed HDP Sandbox, let us proceed. The first step is logging to HDP Sandbox using PuTTY. As HDP is just a Command Line Interface(CLI), you can use PuTTY to log into Sandbox and play with it. When you start your HDP Sandbox, you will see a black screen as shown below:
You can see the highlighted part, i.e. ip-address and port number. Using this, you can connect to HDP Sandbox. Start the PuTTY session and fill in the details as shown below:
After filling the connection details, you can connect to HDP. Below is image of the login screen. You need to use below mentioned id and password to login.
To get the HDP User Interface(UI) you need to use the ip-address:8000 in your web browser.
Above were the steps to login to HDP and get the UI. Now, we are ready to run Hadoop jobs. This blog will help you in running a hive, pig, hive-pig using HCatalog jobs from end to end.
Loading the data to hdfs
There are two ways to accomplish this task.
- Load the data directly using UI
- Load the data from terminal(CLI) using -put
Loading data using UI
In the web UI of HDP Sandbox, there is an option called “file browser”. You can use this to load data to hdfs from the local machine. Below are the steps:
Click “File Browser” à “upload” à “file/zip file” à “select”
Check few screenshots below:
You can cross check by clicking the file.
Another technique is,
Loading data from CLI
The traditional technique which you would have used in any Hadoop platform.
hadoop fs -put /source_file_path /hdfs_location
I have a sample file named “users.txt” present locally which I will be loading into Hadoop distributed file system (hdfs) using traditional put command.
As the data is available in hdfs, we will explore few basics using HDP Sandbox.
Using Pig in HDP Sandbox
When you open the web UI of HDP Sandbox, you will see multiple icons, out of which one will be of Pig
Click it and Pig editor will open as shown below:
I’m loading the “users.txt” file in Pig relation A and dumping it. Below is the screen shot:
In the “title” box you can fill any name of your choice. Once you check “query history”, same name will appear against the script.
Once you click “Execute” Pig will start MR job in the backend. After successful completion, you will get the result.
You can check your query in the history. Just click “Query history” and that’s all!
Using Hive UI (Beeswax) in HDP
Hortonworks provides a beautiful interface called Beeswax using which we can work with hive.
You can start Beeswax as shown below:
You can see in the above image that once you click Beeswax, by default the “query editor” will open. This is the place where you can run your hive analytical query.
When you start working with Hive, first thing is choosing the database i.e. either you can use default database or create your own database and then create Hive tables inside the created database.
We will see how to create a database and then work inside it. To create a database, click “Databases” and then “create a new database”
Once you click “create a new database”, you will be able to create a database in 2 steps shown below.
I have given the database name as “demo_database”.
That’s all is required to create a database. Next is creating a table inside our “demo_database”.
Let us see how!
You need to click “tables” and then towards the left hand side, select the desired database.
Next, click “create a table manually” and proceed.
Creating a table is 6 steps process where you need to fill up the details.
At the end, you need to click “create table”
After this, you can find the respective table inside your database.
Once you complete the table creation process, you are good to go and load the data in the table.
After clicking “execute” your job will run and data will get loaded into the table.
You can go to the table and check whether the data got loaded or not, OR you can also run select * from demo_table in the query editor to cross check.
You can run Hive queries and see the result.
Hope this blog helped you in running your Hadoop job in HDP Sandbox. To run a MapReduce job from CLI, check this
Keep visiting www.acadgild.com for more updates.