All CategoriesBig Data Hadoop & Spark

Scheduling Hadoop Jobs using RUNDECK

Scheduling Hadoop Jobs using RUNDECK

In this blog, we will be discussing how to schedule Hadoop jobs using RunDeck. Firstly we need to know what is RunDeck.

RunDeck is an open source software that is used to automate ad-hoc and routine jobs in the data center or cloud environments. RunDeck allows you to run jobs on distributed environment, here you can select the nodes to run your job. RunDeck also includes other features that make it easy to scale up your scripting efforts including access control, workflow building, scheduling, logging, and integration with external sources for node and options data.

100% Free Course On Big Data Essentials

Subscribe to our blog and get access to this course ABSOLUTELY FREE.

Installing RUNDECK

Installing RunDeck is very simple,

Download RunDeck launcher jar from the below link,

https://drive.google.com/open?id=0ByJLBTmJojjzTmlxVFJJcjNUX0E

Keep the jar file in the directory ~/rundeck

Note: To install Rundeck, you need to have java installed in your system.

You can create the directory using the below command

sudo mkdir ~/rundeck

Now you can move the downloaded file into this directory using the below command.

sudo mv path_to_jar_file ~/rundeck/

Now you can start rundeck using the below command

java -jar rundeck/rundeck-launcher-2.7.1.jar

After running this command, you can see that rundeck will be started and the port number is mentioned at the last. By default, the port number on which rundeck runs is 4440

Now you can login to rundeck GUI using the URL localhost:4440

Default user credentials are username: admin and password: admin

After successful login, you will get a screen as shown below.

Click on NewProject. Now you will get a screen as shown below.

Enter the Project name and Description of your choice. Now in the Resource Model Source click on Edit and select the option Require File Exists as shown in the below screen shot and click on Save.

If you are executing on any other remote machine, you need to provide the ssh details below. Here we are running rundeck on a single node cluster. Now come to the last and click on Create.

In the next screen, you can see a screen as shown below with different options.

Click on Jobs and click on Create a new job

Hadoop

In the next screen provide the necessary details of your job like job name, job description, workflow steps and all.

To provide the Hadoop jar command to run the program, in the workflow go to Add Step

Click on Execute a remote command. Here enter the Hadoop jar command which you normally enter to run Hadoop programs in your cluster by mentioning the complete path to the jar file as shown below and click on save.

After running this job successfully, we need to get the output in the /avro_file1 directory.

Below you can see many options and parameters that rundeck allows you to set. If you want to run the job on an existing cluster, then use the Dispatch to Nodes option,

If you want to get notifications on the Job status then you need to install mail server in your system and enable the below properties.

After checking the properties, parameters and option click on create button at the last. Now you will get a screen as shown below.

On the Right side, there are two options Run Job Now, Run Job Later. If you want to schedule this job to run after some time, you can set the time by clicking on Run Job Later. The job will run after 30mins of your scheduled time. Here we are clicking on Run Job Now.

After Running the job, you will get several options to monitor the job as shown below.

Click on Monitor to check the status of the job. After successful completion of the job, you can see a Report option over there. Here you can see the status of the job and the output of the console as shown in the below screenshot.

So we have successfully run a job using rundeck. Now let us check for the output in HDFS.

You can do that from rundeck itself. Click on the Nodes on the top.

Here in the Command: console you can run commands on any nodes of your choice. By default, the option will be like Run on 0 Node.

Go to Nodes: below and click on All Nodes Here select the NameNode and now you will be able to run the commands on your selected machine.

Now we are checking the output by running the Hadoop ls command from rundeck as shown below.

You can see that a part file has been created successfully. Let’s check for the output in this part file.

In the above screenshot, you can see the output of our sample program.

There are many more options in rundeck to work on. This is a simple tutorial on how to schedule and monitor your Hadoop jobs using Rundeck.

We hope this blog helped you in understanding how to schedule a Hadoop job using rundeck. Keep visiting our site www.acadgild.com for more updates on Bigdata and other technologies.

Hadoop

2 Comments

  1. Can i schedule hdfs put command?
    Can i schedule the below command using rundec?
    hdfs dfs -put /home/ec2-user/airline/sample/nh_1987.csv /user/hdfs/

    1. [workflow] Begin step: 1,NodeDispatch
      21:44:30 1: Workflow step executing: [email protected]
      21:44:30 preparing for sequential execution on 1 nodes
      21:44:30 Executing command on node: localhost, NodeEntryImpl{tags=[], attributes={nodename=localhost, hostname=localhost, osFamily=unix, osVersion=3.10.0-514.26.2.el7.x86_64, osArch=amd64, description=Rundeck server node, osName=Linux, username=rundeck, tags=}, project=’null’}
      21:44:30 [workflow] beginExecuteNodeStep(localhost): NodeDispatch: [email protected]
      21:44:30 using charset: null
      21:44:30 Current OS is Linux
      21:44:30 Adding reference: ant.PropertyHelper
      21:44:30 Project base dir set to: /var/lib/rundeck
      21:44:30 Setting environment variable: RD_JOB_LOGLEVEL=DEBUG
      21:44:30 Setting environment variable: RD_JOB_ID=c4e4d6ac-829e-42aa-8738-459c12d0e7b2
      21:44:30 Setting environment variable: RD_NODE_OS_NAME=Linux
      21:44:30 Setting environment variable: RD_JOB_USERNAME=admin
      21:44:30 Setting environment variable: RD_NODE_HOSTNAME=localhost
      21:44:30 Setting environment variable: RD_NODE_OS_FAMILY=unix
      21:44:30 Setting environment variable: RD_JOB_EXECID=7
      21:44:30 Setting environment variable: RD_NODE_USERNAME=rundeck
      21:44:30 Setting environment variable: RD_JOB_URL=http://35.160.249.85:4440/project/MoveDatatoHDFS/execution/follow/7
      21:44:30 Setting environment variable: RD_NODE_TAGS=
      21:44:30 Setting environment variable: RD_JOB_PROJECT=MoveDatatoHDFS
      21:44:30 Setting environment variable: RD_JOB_NAME=MoveData_fs_to_hdfs
      21:44:30 Setting environment variable: RD_NODE_OS_ARCH=amd64
      21:44:30 Setting environment variable: RD_NODE_OS_VERSION=3.10.0-514.26.2.el7.x86_64
      21:44:30 Setting environment variable: RD_JOB_SERVERURL=http://35.160.249.85:4440/
      21:44:30 Setting environment variable: RD_NODE_NAME=localhost
      21:44:30 Setting environment variable: RD_JOB_EXECUTIONTYPE=user
      21:44:30 Setting environment variable: RD_JOB_WASRETRY=false
      21:44:30 Setting environment variable: RD_JOB_RETRYATTEMPT=0
      21:44:30 Setting environment variable: RD_JOB_USER_NAME=admin
      21:44:30 Setting environment variable: RD_NODE_DESCRIPTION=Rundeck server node
      21:44:30 Executing ‘/bin/sh’ with arguments:’-c’
      ‘hdfs dfs -put /home/ec2-user/airline/sample/nh_1987.csv /user/hdfs/’
      The ‘ characters around the executable and arguments are
      not part of the command.
      21:44:30 Execute:Java13CommandLauncher: Executing ‘/bin/sh’ with arguments:’-c’
      ‘hdfs dfs -put /home/ec2-user/airline/sample/nh_1987.csv /user/hdfs/’
      The ‘ characters around the executable and arguments are
      not part of the command.
      21:44:31 put: `/home/ec2-user/airline/sample/nh_1987.csv’: No such file or directory
      21:44:32 Setting project property: 1501530270057.node.localhost.LocalNodeExecutor.result -> 1
      21:44:32 Result: 1
      21:44:32 Failed: NonZeroResultCode: Result code was 1
      21:44:32 [workflow] finishExecuteNodeStep(localhost): NodeDispatch: NonZeroResultCode: Result code was 1
      21:44:32 1: Workflow step finished, result: Dispatch failed on 1 nodes: [localhost: NonZeroResultCode: Result code was 1]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close