In this blog, we will go through some of the important Big Data and Hadoop interview questions asked on Advanced MapReduce. Hadoop is the hot skill, constantly in demand and we have compiled some of the most important questions which you must go through to have an upper hand in the interview.
Below are some of the questions:
How to process different files by different mappers?
Ans: MultipleInputs class supports MR jobs that have multiple input path with a different input format and Mapper for each path. Add a path with custom input format and mapper to the list of inputs for the MR job.
What is distributed cache?
Ans: It is a facility provided by the MapReduce framework to cache files that are needed by an application. This ensures faster retrieval of files.
What is Map side join?
Ans: If one of the dataset is small enough to fit into memory, it can be added to distributed cache. This type of join happens completely in the map phase and does not require the shuffle. Hence, map side is generally faster.
When should you use Map side join?
Ans: When one of the two datasets is small enough to fit in a memory, we should go with map side join.
What is the role of Writable and WritableComparable?
Ans: These are the interfaces for serialization, i.e. it allows objects to be serialized in Hadoop framework.
What should be the output key of map (Writable/WritableComparable)?
Ans: They must be WritableComparable.
What is input split?
Ans: It is the logical representation of the data stored in file blocks. Data is not read directly from blocks, instead, it is read from the input split.
What is the role of RecordReader?
Ans: RecordReader reads data from input split and converts it into key, value pair.
What are the tasks of InputSplit?
Ans: It validates the input specification of the job, splits up the input file into logical input splits, provides the RecordReader implementation to be used to read input records.
What is small file problem in Hadoop?
Ans: If we have multiple small files, then, it will cause an overhead to NameNode as it contains the metadata. This in return will slow down the processing.
What are the advantages of sequence files?
Ans: Sequence files are the flat files consisting of a binary key, value pairs. It is extensively used in MapReduce as input/output format.
Hope this blog helped you in understanding some of the important Hadoop Interview Questions related to Advanced MapReduce. Enroll for Big Data And Hadoop Development Training conducted by Acadgild.