Scheduling Hadoop Jobs using RUNDECK
In this blog, we will be discussing how to schedule Hadoop jobs using RunDeck. Firstly we need to know what is RunDeck.
RunDeck is an open source software that is used to automate ad-hoc and routine jobs in the data center or cloud environments. RunDeck allows you to run jobs on distributed environment, here you can select the nodes to run your job. RunDeck also includes other features that make it easy to scale up your scripting efforts including access control, workflow building, scheduling, logging, and integration with external sources for node and options data.
Installing RunDeck is very simple,
Download RunDeck launcher jar from the below link,
Keep the jar file in the directory ~/rundeck
Note: To install Rundeck, you need to have java installed in your system.
You can create the directory using the below command
sudo mkdir ~/rundeck
Now you can move the downloaded file into this directory using the below command.
sudo mv path_to_jar_file ~/rundeck/
Now you can start rundeck using the below command
java -jar rundeck/rundeck-launcher-2.7.1.jar
After running this command, you can see that rundeck will be started and the port number is mentioned at the last. By default, the port number on which rundeck runs is 4440
Now you can login to rundeck GUI using the URL localhost:4440
Default user credentials are username: admin and password: admin
After successful login, you will get a screen as shown below.
Click on NewProject. Now you will get a screen as shown below.
Enter the Project name and Description of your choice. Now in the Resource Model Source click on Edit and select the option Require File Exists as shown in the below screen shot and click on Save.
If you are executing on any other remote machine, you need to provide the ssh details below. Here we are running rundeck on a single node cluster. Now come to the last and click on Create.
In the next screen, you can see a screen as shown below with different options.
Click on Jobs and click on Create a new job
In the next screen provide the necessary details of your job like job name, job description, workflow steps and all.
To provide the Hadoop jar command to run the program, in the workflow go to Add Step
Click on Execute a remote command. Here enter the Hadoop jar command which you normally enter to run Hadoop programs in your cluster by mentioning the complete path to the jar file as shown below and click on save.
After running this job successfully, we need to get the output in the /avro_file1 directory.
Below you can see many options and parameters that rundeck allows you to set. If you want to run the job on an existing cluster, then use the Dispatch to Nodes option,
If you want to get notifications on the Job status then you need to install mail server in your system and enable the below properties.
After checking the properties, parameters and option click on create button at the last. Now you will get a screen as shown below.
On the Right side, there are two options Run Job Now, Run Job Later. If you want to schedule this job to run after some time, you can set the time by clicking on Run Job Later. The job will run after 30mins of your scheduled time. Here we are clicking on Run Job Now.
After Running the job, you will get several options to monitor the job as shown below.
Click on Monitor to check the status of the job. After successful completion of the job, you can see a Report option over there. Here you can see the status of the job and the output of the console as shown in the below screenshot.
So we have successfully run a job using rundeck. Now let us check for the output in HDFS.
You can do that from rundeck itself. Click on the Nodes on the top.
Here in the Command: console you can run commands on any nodes of your choice. By default, the option will be like Run on 0 Node.
Go to Nodes: below and click on All Nodes Here select the NameNode and now you will be able to run the commands on your selected machine.
Now we are checking the output by running the Hadoop ls command from rundeck as shown below.
You can see that a part file has been created successfully. Let’s check for the output in this part file.
In the above screenshot, you can see the output of our sample program.
There are many more options in rundeck to work on. This is a simple tutorial on how to schedule and monitor your Hadoop jobs using Rundeck.
We hope this blog helped you in understanding how to schedule a Hadoop job using rundeck. Keep visiting our site www.acadgild.com for more updates on Bigdata and other technologies.