In this post, we will be discussing how to configure replication factor, block size for the entire cluster, along with directory, and file in HDFS.
Hadoop Distributed File System (HDFS) stores files such as blocks, and distributes them across the entire cluster. As HDFS was designed to be fault-tolerant and to run on commodity hardware, bocks are replicated several times to ensure high data availability.
Before going ahead, it is important to know basic information like, what is Replication factor, blocks and block size. So, let’s get a clear picture of them first.
Blocks and Block Size:
HDFS is designed to store and process huge amounts of data and data sets. A typical block size used by HDFS is about 64MB. We can also change the block size in Hadoop Cluster. All blocks in a file, except the last block are of the same size. When you store a file in HDFS, the system breaks it down into a set of individual blocks and stores these blocks in various slave nodes in the Hadoop cluster.
Block Size Configuration for Entire Cluster:
If you want to set some specific block size for the entire cluster, you need to add a property into hdfs-site.xml as shown below.
Here, we have set the dfs.block.size as 128MB. This will be applicable for the entire cluster.
Changing the dfs.block.size property in hdfs-site.xml will change the default block size for all the files placed into HDFS. Here, changing the block size will not affect the block size of any files already in HDFS. It will only be applicable for those files which will be placed after this setting takes effect.
The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at the time of creation of the file and can be changed later. Files in HDFS are write-once and have strictly one writer at any time.
The replication factor is a property that can be set in the HDFS configuration file. It also allows you to adjust the global replication factor for the entire cluster. For each block stored in HDFS, there will be n – 1 duplicated blocks distributed across the cluster.
If you want to set 4 as the replication factor for the entire cluster, then you need to specify the replication factor into the hdfs-site.xml.
<configuration> <property> <name>dfs.replication</name> <value>4</value> <!-- Here you need to set replication factor for entire cluster. --> </property> <property> <name>dfs.namenode.name.dir</name> <value>/home/acadgild/hadoop/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/home/acadgild/hadoop/datanode</value> </property> </configuration>
We can also change the replication factor on a file.
Let’s now create a new directory in HDFS root as shown below.
hadoop dfs -mkdir /test1/
You can verify this using the command-
hadoop fs -ls /
Now, let’s add a file into this directory.
hadoop dfs -put /home/acadgild/acadgild /test1/
Next, let’s try running the command to change the replication factor of a file in Hadoop cluster. The command to this is as shown below:
hadoop fs –setrep –w 5 /test1/acadgild
We can also change the replication factor of all the files within a directory by using the below command.
hadoop fs –setrep –w 3 -R /test/
We now have three files under this test directory. Therefore, it is considering the first file and will replicate other files later on.
Note: Replication of individual files and directory takes time and it varies on various factor like:
- Number of replication factor
- Size of files and directory
- Datanode Hardware
So, it’s better not to change replication factor for files basis and directory basis unless you need it.
Hope this post has been helpful in understanding the steps to configure block size and replication factor in HDFS. In case of queries, feel free to comment below and we will get back to you at the earliest.