Free Shipping

Secure Payment

easy returns

24/7 support

  • Home
  • Blog
  • Running Hadoop Job in HDP Sandbox

Running Hadoop Job in HDP Sandbox

 July 15  | 0 Comments

Running Hadoop in Hortonworks

In this blog, we will see how to run our Hadoop job in HortonWorks Sandbox (HDP Sandbox). HDP stands for HortonWorks Data Platform. It is enterprise ready open source Apache Hadoop distribution based on a central architecture (YARN). There are other vendors like Cloudera and MapR present in the market which provide their own distributions of Hadoop. There can always be a discussion on which distribution is better and provides more features, however, we are not going to talk on that. If you are interested in exploring HDP, you have landed in the right place!
To download HDP Sandbox, you can go to this link and choose a version of your choice. For this blog, I have used HDP 2.1
NOTE: Check the system requirements before downloading a specific HDP version.
Assuming that you have downloaded and installed HDP Sandbox, let us proceed. The first step is logging to HDP Sandbox using PuTTY. As HDP is just a Command Line Interface(CLI), you can use PuTTY to log into Sandbox and play with it. When you start your HDP Sandbox, you will see a black screen as shown below:You can see the highlighted part, i.e. ip-address and port number. Using this, you can connect to HDP Sandbox. Start the PuTTY session and fill in the details as shown below:After filling the connection details, you can connect to HDP. Below is image of the login screen. You need to use below mentioned id and password to login.
Id: root
Password: hadoop

Above were the steps to login to HDP and get the UI. Now, we are ready to run Hadoop jobs. This blog will help you in running a hive, pig, hive-pig using HCatalog jobs from end to end.
Loading the data to hdfs
There are two ways to accomplish this task.

  • Load the data directly using UI
  • Load the data from terminal(CLI) using -put

Loading data using UI
In the web UI of HDP Sandbox, there is an option called “file browser”. You can use this to load data to hdfs from the local machine. Below are the steps:
Click “File Browser” à “upload” à “file/zip file” à “select”
Check few screenshots below:Another technique is,
Loading data from CLI
The traditional technique which you would have used in any Hadoop platform.
hadoop fs -put /source_file_path /hdfs_location
I have a sample file named “users.txt” present locally which I will be loading into Hadoop distributed file system (hdfs) using traditional put command.

As the data is available in hdfs, we will explore few basics using HDP Sandbox.

Using Pig in HDP Sandbox

When you open the web UI of HDP Sandbox, you will see multiple icons, out of which one will be of Pig
Click it and Pig editor will open as shown below:In the “title” box you can fill any name of your choice. Once you check “query history”, same name will appear against the script.
Once you click “Execute” Pig will start MR job in the backend. After successful completion, you will get the result.
You can see in the above image that once you click Beeswax, by default the “query editor” will open. This is the place where you can run your hive analytical query.
When you start working with Hive, first thing is choosing the database i.e. either you can use default database or create your own database and then create Hive tables inside the created database.
We will see how to create a database and then work inside it. To create a database, click “Databases” and then “create a new database”Hope this blog helped you in running your Hadoop job in HDP Sandbox. To run a MapReduce job from CLI, check this
Keep visiting for more updates.