Hadoop 3.x is the latest release of Hadoop which is still in alpha phase. Developers who are interested in Hadoop can install the product and report to Apache if they found any issues or bugs. There are many new features that are introduced in Hadoop 3.x.
In this blog, we will be discussing about how to install Hadoop 3.x in a pseudo distributed mode and exploring HDFS new features.
Here are the list of changes and features that are introduced in Hadoop 3.x
- Minimum required Java version increased from Java 7 to Java 8
- Support for erasure encoding in HDFS
- YARN Timeline Service v.2
- Shell script rewrite
- MapReduce task-level native optimization
- Support for more than 2 NameNodes
- Default ports of multiple services have been changed
- Support for Microsoft Azure Data Lake filesystem connector
- Intra-datanode balancer
- Reworked daemon and task heap management
We also recommend our users to go for our blog 10 differences between Hadoop2.x and Hadoop3.x.
Hadoop 3.x Installation Procedure
Let’s get started with Hadoop 3.x installation
Download latest version of Hadoop release from here
We have downloaded Hadoop3.0.0.alpha2.tar.gz
After downloading, move into the downloaded folder and extract it using the command
tar -xzf hadoop3.0.0.alpha2.tar.gz
Note: We believe that Java is already installed in your system. The minimum JDK required for Hadoop3.x is jdk8.
Setting JAVA_HOME path
Now move into the etc/hadoop/ directory of unzipped hadoop-3.0.0-alpha2 folder and set the JAVA_HOME path in the hadoop-env.sh file
To get the path of JAVA_HOME in your machine, open your terminal and type $JAVA_HOME
In our case, the path is /usr/lib/jvm/java-8-oracle and the same we have set in the hadoop-env.sh file as shown in the below screenshot.
After setting the Java path, save and close the file.
Configuring core-site.xml file
Now open the core-site.xml file which is present in the etc/hadoop/ directory and set the below properties of your distributed file system.
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration>
Configuring hdfs-site.xml file
Open the hdfs-site.xml file in the same location and set the below property for replication.
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
Here in hdfs-site.xml, for storing the metadata of namenode and datanode, you need to create two folders and you need to set the path of those folders here.
<property> <name>dfs.namenode.name.dir</name> <value>/home/kiran/Downloads/Hadoop/Hadoop3_data/NameNode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/home/kiran/Downloads/Hadoop/Hadoop3_data/DataNode</value> </property>
We have created a folder called Hadoop3_data and inside we have created 2 directories with names NameNode & DataNode
The same you can see in the below screenshot.
Configuring ssh & pdsh
Install and setup ssh
If you are using Debian OS, install ssh with the below command
sudo apt-get install ssh
If you are using a Non-debian OS, install ssh with the below command
yum install openssh-server
After the installation, generate the ssh key with the below commands
Generate ssh key for hadoop user using the command:
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa Copy the public key from .ssh directory to authorized_keys folder. Change the directory to .ssh and then type the below command to copy the files into the authorized _keys folder. cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys To ensure whether the keys have been copied, type the command: cat authorized_keys Change the permission of the .ssh directory. chmod 600 ~/.ssh/authorized_keys
Install and setup pdsh
If you are using debian OS, install pdsh using the below command
sudo apt-get install pdsh
If you are using Non debian OS, install pdsh using the below commands
yum update rpm -Uvh http://public-repo-1.hortonworks.com/ambari/centos6/1.x/GA/ambari-1.x-1.el6.noarch.rpm yum install pdsh
After installing, copy the ssh file into pdsh. For copying, you need to enter root user
Enter the root user using the command sudo and then type the below command.
echo "ssh" > /etc/pdsh/rcmd_default
Now let’s configure YARN
Configuring mapred-site.xml file
Open mapred-site.xml file in etc/hadoop/ and set the below parameters. First you need to change the name of mapred-site.xml.template to mapred-site.xml
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.admin.user.env</name> <value>HADOOP_MAPRED_HOME=$HADOOP_COMMON_HOME</value> </property> <property> <name>yarn.app.mapreduce.am.env</name> <value>HADOOP_MAPRED_HOME=$HADOOP_COMMON_HOME</value> </property> </configuration>
Configuring yarn-site.xml file
Open yarn-site.xml file etc/hadoop/ directory and set the below parameters
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration>
After setting the properties in the Hadoop configuration files, save and close the files. Now open bashrc file which is in your home directory.
cd gedit .bashrc
In the bashrc file, set the path of hadoop3 as shown below
export HADOOP_HOME=/home/kiran/Downloads/Hadoop/hadoop-3.0.0-alpha2 export PATH=$PATH:$HADOOP_HOME/bin export PATH=$PATH:$HADOOP_HOME/sbin
After setting, save and close the file and now update the bashrc file using the command source .bashrc
That’s it your Hadoop 3.x is ready. Let’s now format the name node.
Use the command ./hdfs namenode -format in the $HADOOP_HOME/bin directory
After successful format, you will get the message as shown in the below screen shot
We have successfully formatted the namenode, let’s now start the Hadoop daemons one by one. Move into the $HADOOP_HOME/sbin directory and type the below commands
Starting Hadoop daemons
Starting HDFS daemons
Starting name node
./hadoop-daemon.sh start namenode
./hadoop-daemon.sh start datanode
Starting secondary namenode
./hadoop-daemon.sh start secondarynamenode
Starting YARN daemons
Starting Resource Manager
./yarn-daemon.sh start resourcemanager
Starting Node Manager
./yarn-daemon.sh start nodemanager
We have successfully started all the Hadoop daemons. You can check their status using the jps command.
You can also start all these commands using one command i.e., start-all.sh as shown in the below screenshot.
In Hadoop 3.x, HDFS come up with some new features to deal with the files, you can do all kinds of storage operations from web UI itself. Let’s see how to do that.
In Hadoop 2.x, web UI port is 50070 but in Hadoop3.x, it is moved to 9870. You can access HDFS web UI from localhost:9870 as shown in the below screenshot
You can see all the HDFS configurations in this page, to access webHDFS, click on Utilities–>Browse the file system
You can see few options added in it i.e., Creation of New folder, Upload files, Cut and Paste files from one directory to another directory.
Before creating the folder, make sure that the user has correct permissions to perform operations on that directories. If not, you can change the permissions using the command
hadoop fs -chmod -R 777 /
Let’s create a new folder and upload some data into it. To create a new folder, click on the folder icon and give some name to the directory as shown below.
You can see that the folder has been created successfully in the below screenshot
To upload files, click on upload symbol and browse your file system to select the file that you need to upload.
You can also delete the files by clicking on the delete symbol beside the directory or file as shown in the below screenshot.
You can also cut and paste the files from one directory to another directory.
Select the files which you want to cut and click on cut option and then click on Ok as shown below.
Now move into the folder where ever you want to paste this file and just click on Paste option. After clicking on Paste, you can see that file has been pasted in that directory as shown in the below screenshot.
This is how you can perform operations on files using HDFS web UI in Hadoop 3.x.
We hope this blog helped you in understanding how to install Hadoop 3.x in a single node cluster and how to perform operation on HDFS files using HDFS web UI.
Enroll for Hadoop Training conducted by Acadgild and become a successful big data developer.