This blog focuses on Setting up a Hadoop Cluster on Cloud. Before we start with the configuration, we need to have a Linux platform in cloud. We will setup our Pseudo mode Hadoop cluster on AWS ec2 Instance.
Note: Here we are assuming that you have an active AWS account and your Linux Instance is running. Also, make sure you have public and private key pair of that Instance.
Connecting to Linux Instance from Windows Using PuTTY
If you are using Windows, you can connect your Instance using PuTTY. After you launch your Instance, you can connect to it and use it the way that you would use a desktop computer.
Before you connect your Instance through PuTTY you need to complete the following prerequisites:
Step 1: Install PuTTY for Windows
Wondering what PuTTY is? PuTTY is an open-source software that is available along with the source code. PuTTY is an SSH and telnet client, developed originally by Simon Tatham for the Windows platform. You can download PuTTY by visiting this link.
For more information, you can also visit homepage of PuTTY.
Step 2: Generate PuTTY Private Key ( .ppk )
Putty does not support AWS private key format (.pem) generated by Amazon EC2. To concect your Instance with PuTTY, you need a PuTTY format key (.ppk). For this, PuTTY has a tool called PuTTYyGen, which converts the (.pem) AWS key pair into PuTTY formatted key pair (.ppk).
Here are the steps to generate PuTTy formatted key pair (.ppk):
(a) Download PuTTYgen. You can download PuTTyGen from this link.
(b) Launch the PuTTyGen tool and locate your Amazon formatted public and private key pair by pressing Load.
(c) You will see an image similar to when you Load your .pem key.
(d) Click on Save private key and save it on your Desktop.
Your private key is now in the correct format and can be used with PuTTY. You can now connect to your instance using PuTTY’s SSH client.
Step 3: Start your Putty Session
Start PuTTY and you will see a window like shown below.
Step 4: Enter your Host Name (or IP address) of your Instance
In the Category panel, expand Connection, expand SSH, and then select Auth, and follow the below instructions:
- Click Browse
- Locate your putty private key (.ppk)
- Click open
If you want to start your session later, you can also save your session.
Step 5: Provide permission.
During the first time, it will ask for permission. Click Yes. When it prompts for a login name, type ec2-user and press enter.
Now your session has been started successfully. You are able to use your Instance. And you can start your single node Hadoop Installation and configuration.
You can also refer to the below AWS documentations if you are facing any problem related to the Amazon EC2 Instance.
Step by step Hadoop Configuration
Before proceeding, let’s look at the prerequisites.
- Java Package
- Hadoop Package
Here’s the step-by-step tutorial:
1. Use the below link to download JDK in the Windows machine using the browser present.
2. On clicking the above link, a screen will prompt you to select the required version. Select the option shown with the red colored arrow symbol.
On clicking the above option, download will start and get saved in Downloads folder.
3. Download the Hadoop file using the following link:
On clicking the above link, the below screen will prompt you to select a file.
4. The next step is to connect your Instance. The steps are as follows.
Open your Instance and login as: ec2-user .
5. Add a new user to install Hadoop. For this, you need root access to add new user, so login as root.
sudo su –
Now you have root access, you can easily add new user. You can do this using the following command.
Next, provide a password to the newly created user, using the following command.
Make acadgild user as a sudo user, Add a new entry to visudo file below “Allow root to run any commands anywhere” line.
see the image below to for more reference
Now, get back to your ec2-user from root by typing the exit command.
Next, login into your acadgild user using the below command.
sudo -l acadgild
Then, enter your password which you have provided above.
6. Now we need Java to install Hadoop. You can install Java directly from ‘Yum’ repository by typing command :-
sudo yum install java
Here, we are going to copy the zip file of Java and Hadoop from the Windows machine as it has already been downloaded. So, you need WinSCP tool to copy files from Windows machine to your instance. You can use any file transfer tool like:- FileZilla. Here I am going to use WinSCP because I don’t want to configure ftp server and it’s services.
You can download the WinSCP tool from this link
Follow the below instructions to copy your file through WinSCP.
a. Launch WinSCP.
b. Enter host name, user name, and make sure the port number is 22
Note: Leave the password field blank, as we are going to login via .ppk file.
c. Click on the Advance option to import your .ppk file
d. Expand the ssh category and click on authentication. You will see a window as shown below. Browse your PuTTY formatted private key and locate the (.ppk) file.
e. Now login into your Instance and locate the files from your PC to Amazon Instance. Here I have already uploaded Hadoop and Java zip file into Instance.
When these files are uploaded into the instance, then return to your instance and type ls to check whether the files are available or not.
Copy these files from ec2-user to acadgild user.
cp jdk-8u65-linux-x64.tar.gz /home/acadgild
sudo cp hadoop-2.6.0.tar.gz /home/acadgild
Now, login to acadgild user and extract these files.
tar -xvf jdk-8u65-linux-x64.tar.gz
tar -xvf hadoop-2.6.0.tar.gz
Now, we will mess-up with Hadoop properties :
Update the .bashrc file with the required environment variables, including Java and Hadoop path.
Type the command sudo vi .bashrc from home directory /home/acadgild.
sudo vi .bashrc
Note: Update the path present in your system.
- Type the command source .bashrc to make the environmental variables work.
Note: The java path set in .bashrc will vary for every system, you must give the path of Java where it is has been downloaded and extracted, i.e. /path-to-extracted-java folder.
Create two directories to store NameNode metadata and DataNode blocks as shown below:
mkdir -p $HOME/hadoop/namenode
mkdir -p $HOME/hadoop/datanode
Next, Change the permissions of the directory to 755.
chmod 755 $HOME/hadoop/namenode
chmod 755 $HOME/hadoop/datanode
Change the directory to the location where Hadoop is installed.
Open hadoop-env.sh and add the Java home (path) and Hadoop home (path) in it.
sudo vi hadoop-env.sh
Note: Update the Java version and path of the Java present in your system, in our case the version is 1.8 and location is /usr/lib/jvm/jdk1.8.0_65.
- Open Core-site.xml using the below command, from the path shown in the screenshot.
sudo vi core-site.xml
Add the below properties in between configuration tag of core-site.xml
<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration>
- Open the hdfs-site.xml and add the following lines in between configuration tags.
sudo vi hdfs-site.xml
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>/home/acadgild/hadoop/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/home/acadgild/hadoop/datanode</value> </property> </configuration>
- Open the Yarn-site.xml and add the following lines in between configuration tags.
<property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property>
- Copy the mapred-site.xml template into mapred-site.xml
sudo cp mapred-site.xml.template mapred-site.xml
And then, add the following properties as shown in mapred-site.xml.
Simply edit mapred-site.xml same as below property :-
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
Generate ssh key for Hadoop user.
ssh-keygen -t rsa
You can refer to the below screenshot for this.
Note: Ensure to hit enter key after typing the command ssh-keygen -t rsa and hit enter once again when it asks for file in which to save the key and for passphrase.
Copy the public key from .ssh directory to the authorized_keys folder.
cat id_rsa.pub >> ~/.sshauthorized_keys
Change the directory to .ssh and then type the below command to copy the files into the authorized _keys folder. Then type the command ls to check whether authorized_keys folder has been created or not.
Change the permission of the .ssh directory.
chmod 600 .ssh/authorized_keys
Restart the ssh service by typing the below command.
sudo service sshd start
Format the NameNode:
hadoop namenode -format
Change the directory to the location of Hadoop.
Note: Change the directory to sbin of Hadoop before starting the daemon.
To start all the daemons, follow the below steps:
Starting NameNode, DataNode, ResourceManager, NodeManager and Jobhistoryserver
Type the below command to start Namenode
./hadoop-daemon.sh start namenode
- Next, start the DataNode using the below command.
./hadoop-daemon.sh start datanode
- Now, Start the ResourceManager using the following command.
./yarn-daemon.sh start resourcemanager
- Next, start the NodeManager.
./yarn-daemon.sh start nodemanager
- Starting Job historyserver
./mr-jobhistory-daemon.sh start historyserver
- Type ‘jps’ command to see running daemons:-