Big Data Hadoop & Spark

HDFS Interview Questions and Answers 2019

1. What is Hadoop?

A. Hadoop is a Java based programming framework that supports distributed storage of data and parallel
For distributed storage, you use HDFS which is a reliable and scalable platform.
For parallel processing, you take advantage of MapReduce which is again reliable and scalable.
Hadoop works on the principle of scaling out where the clusters are built on majority of commodity

2. What is the function of namenode?

A. It contains two important information
– Hadoop file system tree and metadata
– In-memory mapping of blocks and corresponding data node
Name node is a service which contains the metadata about the HDFS. Name node contains the file system
tree of Hadoop. It contains the metadata like permissions of a file, replication factor of a file, block size,
creation time, owner of the file, and the mapping of blocks of file in the data nodes like which block is
present in which node.
It contains all directory structure of HDFS, replication level of file, modification and access time of files,
access permissions of files and directories, block size of files, the blocks constituting a file.
– When any operation take place in HDFS, the directory structure gets modified.
– These modifications are stored in memory as well as in edits files (edits files are stored on hard disk)
All these changes are added in append only fashion in the namenode.

3. Why can a secondary namenode not act like a primary node?

A. Basically, in order to reduce the start-time, you have to merge the fsimage and the edits file
periodically so that the size of edits file doesn’t grow. For this, you have to load the fsimage into memory
and then you have to update one by one which is a resource intensive operation.
Since namenode is already occupied, this work is given to some other node called as secondary
Now the problem with secondary namenode is that it does not contain the recent changes. That is the
reason, your secondary name node cannot become the name node.
If you want to make the secondary name node as the name node, you will loose some data which was
already present in the edits of the primary name node.

4. What is HDFS namenode federation?

A. In this, one single namenode will contain the metadata, whereas multiple name nodes will contain the
metadata about the block mapping of files and directories of subsets of the entire HDFS.
For example, if HDFS contains two directories inside a directory, there may be two namenodes which
maintain the two different directories so that the load will not be more and if one namenode fails, then the
other can take over.
The list of sub-directories maintained by a name node is called a namespace volume.
Blocks for files belonging to a namespace is called block pool.
For these reasons, namenode will not become a single point of failure.

5. What is name node high availability?

A. The problem with the federation is that if one name node goes down, you cannot access the portion of
the data that the namenode is taking care of.
In HDFS high availability, you will maintain two namenodes: one of which is active and the other stand
by each namenode and contain the file system tree, block mapping of the entire HDFS, and the edits are
shared across both the name nodes. In case of failure, other name node will take the charge.
Architectural changes:
-The namenode must use high available shared storage to share the edit log.Edit logs are read by
StandbyNameNode when it takes the responsibility of ActiveNameNode
-Data nodes should send block report to both the namenodes
-Check pointing is done by standby namenode

6. How can you control block size and replication factor at file level?

A. You can change the block size and replication factors and many other configurations at the cluster
level by setting the properties in the configuration files like core-site.xml, hdfs-site.xml, mapred-site.xml,
If you want to upload a file into HDFS with some specific block size and with some specific replication
factor, you can do that by providing the configuration and its value while writing the file into HDFS.
Changing block size
hadoop fs -Ddfs.block.size=1048576 -put file.txt /user/acadgild
hadoop fs -Ddfs.blocksize=1048576 -put file.txt /user/acadgild
Changing replication factor
hadoop fs -setrep -w2 /my/file
hadoop fs -Ddfs.replication=2 /my/file

7.What are the roles of Mapper, Combiner, Partitioner, Shuffle & Sort and Reducer classes?

A. Entire Map reduce works on the Key-Value pairs
Mapper phase takes input key-value pairs from a file or from any source and produces output in the form
of key-value pairs after performing transformations. Those key-value pairs can be of simple types like
IntWritable, LongWritable, text etc., or it can also be some composite types if you can create your custom
writable or custom writable comparable.
The key value of the mapper come from different machines, all the values of one key are collected
together, so the shuffle makes sure that the values having the same key are collected together and then
that is passed to the reducer.
The reducer takes the group of values for a key, in map reduce and is known as the iterable of all the
values and can perform operations like aggregation etc.,
Combiner is like a mini reducer, if the operation is associative like if the operation is finding the
maximum or minimum then you can have the aggregation before the shuffle phase, this reduces the work
of shuffle and also it reduces the load on the reducer. So, based on the requirement, if you can use
combiner in the application. But you cannot use combiner all the times. You can use only if your
operations are associative and ordering does not matter.
Whenever a mapper produces the output, there is a technique to decide, which key should go into which
reducer. Suppose, if there are multiple reducers running parallel, partitioner controls which key should go
into which reducer. By default, there is hash partitioner that decides which should go into which reducer
along with its values. Hash partitioner works on the hash code and it decides the reducer by performing
the modulus of the key by dividing it with a number of reducers.
The reducer output is sorted according to the key. The output key of the mapper is taken care by the
shuffle and sort.
As a developer, you can override the properties of Mapper, Combiner, Partitioner, Reducers, but shuffle
and sort is internal to the Hadoop framework.

8. How to control the number of reducers in a map reduce program?

A. By default, for every 1 GB of input data, 1 reducer will be spawned/created. But you can also override
this property by using the below property
job.setNumReduceTasks(int n)
The above property will set the number of reducers based on the integer number you provide to the
function as parameter.

9. How does Hadoop know how many mappers has to be started?

A. Number of mappers equals the number of input splits
Number of input splits(for a single file) = Ceil(Size of file)/(Size of input split)
For example, if you have 1GB of data and the input split size is 128MB then 1024/128 gives you 8 so 8
mappers will be started.
In default situations, input split size equals to the block size so number of input splits is equal to the
number of blocks. So, you can say that number of mappers is equal to the number of blocks.

Hope this post helped you know some important interview questions that are asked in the Hadoop HDFS and MapReduce topics.

Enroll for Big Data Hadoop training conducted by Acadgild and become a successful Big Data Developer.

Suggested Reading

Spark Interview Questions



An alumnus of the NIE-Institute Of Technology, Mysore, Prateek is an ardent Data Science enthusiast. He has been working at Acadgild as a Data Engineer for the past 3 years. He is a Subject-matter expert in the field of Big Data, Hadoop ecosystem, and Spark.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles