Big Data Hadoop & Spark

Setting up Hadoop Cluster on Cloud

This blog focuses on Setting up a Hadoop Cluster on CloudBefore we start with the configuration, we need to have a Linux platform in cloud. We will setup our Pseudo mode Hadoop cluster on AWS ec2 Instance.
Note: Here we are assuming that you have an active AWS account and your Linux Instance is running. Also, make sure you have public and private key pair of that Instance.

Connecting to Linux Instance from Windows Using PuTTY

If you are using Windows, you can connect your Instance using PuTTY. After you launch your Instance, you can connect to it and use it the way that you would use a desktop computer.

100% Free Course On Big Data Essentials

Subscribe to our blog and get access to this course ABSOLUTELY FREE.

Before you connect your Instance through PuTTY you need to complete the following prerequisites:

Step 1: Install PuTTY for Windows

Wondering what PuTTY is? PuTTY is an open-source software that is available along with the source code. PuTTY is an SSH and telnet client, developed originally by Simon Tatham for the Windows platform. You can download PuTTY by visiting this link.
http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html
Putty Download
For more information, you can also visit homepage of PuTTY.
http://www.putty.org/

Step 2: Generate PuTTY Private Key ( .ppk )

Putty does not support AWS private key format (.pem) generated by Amazon EC2. To concect your Instance with PuTTY, you need a PuTTY format key (.ppk). For this, PuTTY has a tool called PuTTYyGen, which converts the (.pem) AWS key pair into PuTTY formatted key pair (.ppk).

Here are the steps to generate PuTTy formatted key pair (.ppk):

(a) Download PuTTYgen. You can download PuTTyGen from this link.
http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html
Putty Generator
(b) Launch the PuTTyGen tool and locate your Amazon formatted public and private key pair by pressing Load.
Putty key load
(c) You will see an image similar to when you Load your .pem key.
putty key load
(d) Click on Save private key and save it on your Desktop.
Save ppk Key
Your private key is now in the correct format and can be used with PuTTY. You can now connect to your instance using PuTTY’s SSH client.

Step 3: Start your Putty Session

Start PuTTY and you will see a window like shown below.
Putty Configuration

Step 4:  Enter your Host Name (or IP address) of your Instance

In the Category panel, expand Connection, expand SSH, and then select Auth, and follow the below instructions:

  1. Click Browse
  2. Locate your putty private key (.ppk)
  3. Click open  

If you want to start your session later, you can also save your session.

Step 5:   Provide permission.

During the first time, it will ask for permission. Click Yes. When it prompts for a login name, type ec2-user and press enter.
Now your session has been started successfully. You are able to use your Instance. And you can start your single node Hadoop Installation and configuration.
ec2-user login
You can also refer to the below AWS documentations if you are facing any problem related to the Amazon EC2 Instance.
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstances.html
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/get-set-up-for-amazon-ec2.html

Step by step Hadoop Configuration

Before proceeding, let’s look at the prerequisites.

  1.  Java Package
  2.  Hadoop Package

Here’s the step-by-step tutorial:
1. Use the below link to download JDK in the Windows machine using the browser present.
http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
2. On clicking the above link, a screen will prompt you to select the required version. Select the option shown with the red colored arrow symbol.
Java Download
On clicking the above option, download will start and get saved in Downloads folder.
3. Download the Hadoop file using the following link:
http://mirrors.sonic.net/apache/hadoop/common/hadoop-2.6.0/
On clicking the above link, the below screen will prompt you to select a file.
hadoop download
4. The next step is to connect your Instance. The steps are as follows.
Open your Instance and login as: ec2-user .
ec2-user login
5. Add a new user to install Hadoop. For this, you need root access to add new user, so login as root.

sudo su –

sudo root login
Now you have root access, you can easily add new user. You can do this using the following command.

useradd acadgild

useradd Acadgild
Next, provide a password to the newly created user, using the following command.

passwd acadgild

passwd acadgild
Make acadgild user as a sudo user, Add a new entry to visudo file below “Allow root to run any commands anywhere” line.
see the image below to for more reference

visudo

visudo
Now, get back to your ec2-user from root by typing the exit command.

exit

Next, login into your acadgild user using the below command.

sudo -l acadgild

Then, enter your password which you have provided above.
acadgild login
6. Now we need Java to install Hadoop. You can install Java directly from ‘Yum’ repository by typing command :-

sudo yum install java

Here, we are going to copy the zip file of Java and Hadoop from the Windows machine as it has already been downloaded. So, you need WinSCP tool to copy files from Windows machine to your instance. You can use any file transfer tool like:- FileZilla. Here I am going to use WinSCP because I don’t want to configure ftp server and it’s services.
You can download the WinSCP tool from this link
https://winscp.net/eng/download.php#download2
or
https://winscp.net/eng/download.php

Follow the below instructions to copy your file through WinSCP.

a. Launch WinSCP.
winscp login
b. Enter host name, user name, and make sure the port number is 22
Note: Leave the password field blank, as we are going to login via .ppk file.
c. Click on the Advance option to import your .ppk file
advance login
d. Expand the ssh category and click on authentication. You will see a window as shown below. Browse your PuTTY formatted private key and locate the (.ppk) file.
authentication
e. Now login into your Instance and locate the files from your PC to Amazon Instance. Here I have already uploaded Hadoop and Java zip file into Instance.
login
WinSCP Upload
When these files are uploaded into the instance, then return to your instance and type ls to check whether the files are available or not.

ls

ls
Copy these files from ec2-user to acadgild user.

cp  jdk-8u65-linux-x64.tar.gz  /home/acadgild

copy jdk to /home/acadgild

sudo cp  hadoop-2.6.0.tar.gz  /home/acadgild

cp hadoop /home/acadgild
Now, login to acadgild user and extract these files.

tar -xvf jdk-8u65-linux-x64.tar.gz
tar -xvf hadoop-2.6.0.tar.gz

 

Now, we will mess-up with Hadoop properties :

  • Update the .bashrc file with the required environment variables, including Java and Hadoop path.

Type the command sudo vi .bashrc from home directory /home/acadgild.

sudo vi .bashrc

.bashrc
Note: Update the path present in your system.

  • Type the command source .bashrc to make the environmental variables work.
source .bashrc

source bashrc
Note: The java path set in .bashrc will vary for every system, you must give the path of Java where it is has been downloaded and extracted, i.e. /path-to-extracted-java folder.
Example: /home/acadgild/jdk1.8.0_65

  • Create two directories to store NameNode metadata and DataNode blocks as shown below:

mkdir -p $HOME/hadoop/namenode
mkdir -p $HOME/hadoop/datanode

Next, Change the permissions of the directory to 755.

chmod 755 $HOME/hadoop/namenode
chmod 755 $HOME/hadoop/datanode
  • Change the directory to the location where Hadoop is installed.

cd  /home/acadgild/hadoop-2.6.0/etc/hadoop/

cd hadoop

  • Open hadoop-env.sh and add the Java home (path) and Hadoop home (path) in it.

 sudo vi hadoop-env.sh

Note: Update the Java version and path of the Java present in your system, in our case the version is 1.8 and location is /usr/lib/jvm/jdk1.8.0_65.
hadoop env.sh

  • Open Core-site.xml using the below command, from the path shown in the screenshot.
sudo vi core-site.xml

vi core-site.xml
Add the below properties in between configuration tag of core-site.xml

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

core-site.xml

  • Open the hdfs-site.xml and add the following lines in between configuration tags.
sudo vi hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/acadgild/hadoop/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/acadgild/hadoop/datanode</value>
</property>
</configuration>

hdfs-site.xml

  • Open the Yarn-site.xml and add the following lines in between configuration tags.
vi yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

yarn-site.xml

  • Copy the mapred-site.xml template into mapred-site.xml 
    sudo cp mapred-site.xml.template  mapred-site.xml

    cp mapred-site.xml

And then, add the following properties as shown in mapred-site.xml.

vi mapred-site.xml

vi mapred-site.xml
Simply edit mapred-site.xml same as below property :-

  • <configuration>
    <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
    </property>
    </configuration>
    

  • Generate ssh key for Hadoop user.

ssh-keygen -t rsa

You can refer to the below screenshot for this.
ssh-keygen -t rsa
Note: Ensure to hit enter key after typing the command ssh-keygen -t rsa  and hit enter once again when it asks for file in which to save the key and for passphrase.

  • Copy the public key from .ssh directory to the authorized_keys folder.

cat id_rsa.pub >> ~/.sshauthorized_keys

Change the directory to .ssh and then type the below command to copy the files into the authorized _keys folder. Then type the command ls to check whether authorized_keys folder has been created or not.
cat id_rsa.pub >> .ssh/authorized

  • Change the permission of the .ssh directory.

chmod 600 .ssh/authorized_keys
  • Restart the ssh service by typing the below command.

sudo service sshd start
  • Format the NameNode:

hadoop namenode -format

hadoop namenode -format
Change the directory to the location of Hadoop.

cd hadoop-2.6.0/sbin

hadoop/sbin
Note: Change the directory to sbin of Hadoop before starting the daemon.

 To start all the daemons, follow the below steps:

  • Starting NameNode, DataNode, ResourceManager, NodeManager and Jobhistoryserver

Type the below command to start Namenode

./hadoop-daemon.sh start namenode

start namenode

  • Next, start the DataNode using the below command.
./hadoop-daemon.sh start datanode

start datanode

  • Now, Start the ResourceManager using the following command.
./yarn-daemon.sh start resourcemanager

start resourcemanager

  • Next, start the NodeManager.
./yarn-daemon.sh start nodemanager

nodemaager

  • Starting Job historyserver
./mr-jobhistory-daemon.sh start historyserver

start historyserver

  • Type ‘jps’ command to see running daemons:-
jps

jps

Here, We can see All the Daemons are running, It means we have configured pseudo mode Hadoop Cluster on AWS Instance.

Keep visiting our site www.acadgild.com for more updates on Bigdata and other technologies.Click here to learn Bigdata Hadoop from our Expert Mentors

Hadoop

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close