Are you a Hadoop developer and want to know the basics of configuring Hadoop cluster? If yes then this blog will help you to set up a single node cluster on your machine right away!
This blog aims to provide a brief on the most needed settings that need to be taken care of, for a successful installation.
100% Free Course On Big Data Essentials
Subscribe to our blog and get access to this course ABSOLUTELY FREE.
What Is The Default Configuration In Hadoop?
This blog will guide you with the right settings to setup a single node cluster step by step. The single node mode is usually used by the developers to test their sample codes.
When you download the Hadoop tar file and install it with default settings, you get a standalone mode.
All the XML files for Hadoop contains properties defined by Apache through which Hadoop understands its limitations and responsibilities as well as its working nature.
The links below give us the default property settings for all types of configuration files that are needed for Hadoop:
The four files that need to be configured explicitly while setting up a single node hadoop cluster are:
Overriding The Default xml Properties In site.xml File
We can override some explicit properties by configuring them in above files.
In Hadoop, default replication factor is 3 but we can override that property by making replication factor as 1 by explicitly configuring the property in hdfs-site.xml.
Overriding the default parameters optimizes the cluster, improves performance and lets one know about the internal working of Hadoop ecosystem.
Below screenshot shows different files which can be either overridden with explicit properties or can be used as default properties in Hadoop cluster.
How site.xml Overrides default.xml Settings
Hadoop’s jar files are available in the following path:
[here HADOOP_HOME indicates path where Hadoop is installed]
It gets all the default configuration details, like default replication factor which is 3 from DFSclient.java from one of the jar files
The default configuration files have specific classpath from where it is always loaded in reference for working Hadoop. Similarly the modified site.xml files given to developer are loaded from classpath and checked for additional configuration objects created and deployed into the existing Hadoop ecosystem overriding the default.xml files.
We will look through the xml files wich we specifically need to alter files at the time of basic installation of the single node cluster.
Common things to all xml files
We can specify the new value with tags like <property>, <name>, <description>, <final>, etc. inside predefined <configuration> tag. As Hadoop is an open source framework so the owners have provided option to override some features by declaring some attribute inside various site.xml files.
Settings that need to be done in Core-site.xml
Some of the important properties are:
- Configuring the name node address
- Configuring the rack awareness factor
- Selecting the type of security
Refer the Table below for the schematic representation of the above properties:
<property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> <description>The name of the default file system. Either the literal string "local" or a host:port for NDFS. </description> <final>true</final> </property> <property> <name>hadoop.security.authentication</name> <value>kerberos</value> <description> Set the authentication for the cluster. Valid values are: simple or kerberos. </description> </property> <property> <name>fs.trash.interval</name> <value>0</value> <description>Number of minutes between trash checkpoints. If zero, the trash feature is disabled. </description> </property>
<name>fs.default.name</name>Here is a detailed description of the below attribute which is compulsorily needed for configuring Hadoop single node cluster.
A filesystem path in Hadoop has two main components:
- A URI (Uniform Resource Identifier) that identifies the file system
- A path which specifies only the path
Hadoop tries to find that path on the file system defined by fs.default.name
Hadoop tries to find the path on HDFS whose namenode is running at <authority><port>
At some point, if a user specifies both the URI and the path in the request, then the URI in the request overrides fs.default.name and Hadoop tries to find the path on the filesystem identified by the URI in the request.
One of the important tasks done by fs.default.name filesystem is handling the delete operation in Hadoop ecosystem.
Some of the overridden name attributes are hadoop.security.authentication, fs.trash.interval, fs.default.name. Explanation for the attribute we use while setting single node cluster is explained here with the help of these examples.These examples help us to understand it better while sharing the customized config.
Settings To Be Done In HDFS-site.xml
The properties inside this xml file deals with storage procedure inside HDFS of Hadoop. Some of the important properties are:
- Configure port access
- Manages ssl client authentication
- Controls Network interface
- Changes file permission
Some of the overridden name attributes are dfs.namenode.name.dir, dfs.datanode.data.dir, blocksize, replication, etc.
Explanation for the attributes that we use while setting single node cluster is explained here.
Block replication can be configured using the below setting:
The default is used if replication is not specified in create time which is 3 .
Maximum block replication can be 512 and minimum can be 1.
We can change the replication factor on a per file basis using the Hadoop FS shell
$hadoop fs-setrep –w 3 /my_file
All files inside directory are available here
$hadoop fs-setrep –w 3 /my_dir
Block size can be configured using
This takes the specified path for namenode directory on local filesystem. It has the parent property of directory and stores the name table. If this is a comma-delemited list of directories then the name table is replicated in all the directories, for redundancy. In case of any loss for data, this redundancy helps in recovering the lost data. Here comes the replication factor, which again defines how many copies of a file has been stored.
This takes the specified path for datanode directory on local filesystem. It has the parent property of directory on the local file system on DFS data node and stores it in blocks. If this is comma delimited list of directories then data will be stored in named directories, typically on different devices.
It will change the default block size for all files inside HDFS. In this case, we set the dfs.block.size to 128MB. Changing this setting will only affect the block size of files placed into HDFS after this settings has taken effect.
The fsck command will give replication factor as result with other important factors as shown in figure below:
$hdfs fsck /<path of file >/<name of file >
Settings In yarn-site.xml
Understanding about yarn-site.xml is easy if I explain you some relative concepts of YARN and why YARN came into existence in Hadoop v2.x .
In Hadoop v1.x TaskTraker and JobTracker were present to handle the job of allocating resources to processes.
YARN has ResourceManager settings which effects resource allocation with node manager and application manager. Some of the important properties are:
- WebAppProxy Configuration
- MapReduce Configuration
- NodeManager Configuration
- ResourceManager Configuration
- IPC Configuration
It tells the NodeManager if any auxillary service called mapreduce.shuffle need to implemented. After we tell the NodeManager to implement that service, we give it a class name as the means to implement that service. This particular configuration tells MapReduce how to do its shuffle because NodeManagers won’t shuffle data for a non- MapReduce job. We need to configure such a service for MapReduce by default.
This property tells NodeManager that MapReduce container will have to do a shuffle from the map tasks to the reduce task.
Previously the shuffle step was part of the MapReduce TaskTracker.
The shuffle is an auxillary service and must be sent in the configuration file. In addition we have yarn.nodemanager.aux.services.mapreduce.shuffle. Although it is possible to write your own shuffle handler by extending this class, it is recommended that the default class be used.
Shuffle handler :- It is a process that runs inside the YARN NodeManager, the rest server and many third party applications and they all use the port 8080. This will result in conflicts if you deploy more than one at a time without reconfiguring the default port.
Some of the overridden name attributes are yarn.resourcemanager.am.max-attempts, yarn.resourcemanager.proxy-user-privileges.enabled, yarn.nodemanager.aux-services, yarn.nodemanager.aux-services.mapreduce.shuffle.class etc.
When Hadoop runs for any analysis of dataset, the framework at runtime for MapReduce jobs is a vast set of rules for assigning jobs to slave and maintain the jobs records. Here YARN in Hadoop2.x is introduced to help this framework to work efficiently and take the workload for job related assignments. It is again a large unit of Hadoop ecosystem which helps running the map and reduce the collaboration with YARN. Some of the important features it handles are:
- Node health script variables
- Proxy Configuration
- Job Notification Configuration
The value of this attribute determines whether you are running MapReduce framework in local mode, classic (mapreduce v1) mode or YARN(MapReduce v2) mode. The local mode indicates that the job is run locally using local JobRunner. If set to YARN , the job is submitted and executed via the YARN-cluster.
Some of the overridden name attributes are yarn.app.mapreduce.client.max-retries, mapreduce.shuffle.port, mapreduce.job.tags, I/O properties.
All these properties explained above sum up the requirement for a single node hadoop cluster.
Follow the document given in the below link to set up a pseudo mode single node hadoop cluster for a deep understanding.