All CategoriesBig Data Hadoop & Spark

Hadoop Interview Questions – 2017

1. I have Hadoop 2.x and my configured block size to 128MB. Now, I have changed my block size to 150MB. So, will this change affect the files which are already present?

  1. No, this change will not affect the existing files, all the existing files will be with a block size of 128MB. Once you restart the cluster after configuring the new block size, the change will come into effect. All the new files which you copy into the HDFS will be with a block size of 150MB.

2. HDFS works on the principle of ‘Write Once, Read Many Times.’ So, by this logic, you can overwrite a file which is already present in the HDFS. If yes, explain how can that be done.

  1. Yes, we can overwrite a file which is already present in the HDFS. Using the -f option in the put or copyFromLocal command of HDFS, we can overwrite a file in HDFS
    hadoop fs -put -f <<local path>> <<hdfs path>>

 
3. What is meant by Safe Mode and when does the NameNode go into safe mode?

  1. Safe mode is a situation wherein you cannot write data to HDFS. HDFS will be in the Read Mode. In this case, you can only read the data, but you cannot write into HDFS. Here, the NameNode will change the file system state from fsimage to edit logs and load into the memory.

Every time you start the HDFS daemons, the NameNode goes into the safe mode and checks for the block report, and, also ensures whether all the data nodes are working or not.
If this process is interrupted in the beginning by some internal or external process, the NameNode will be in the Safe Mode completely. Now, you need to come out of the Safe Mode explicitly by using the command: hdfs dfsadmin -safemode leave
4. Currently, I am using Hadoop2.x, but I want to upgrade to Hadoop3.x. How can I upgrade without losing my data in HDFS?

  1. You can upgrade or downgrade to any Hadoop version without losing data as far as you have the NameNode and DataNode’s Current and Version files. While installing, you just need to give these directories as NameNode and DataNode’s metadata directories.

As we are not changing the cluster, by using the metadata present in that folders, Hadoop will get back your data.
5. Where does the metadata of NameNode reside? Is it in-memory or on the disk?

  1. The answer is both. The metadata will be on the disk, but once you start the Hadoop cluster, the NameNode will take metadata into in-memory for faster access, so whatever the updates that are going to happen after starting the Hadoop cluster will happen in in-memory, and once you shut down the cluster, the changes will be saved to the drive.

6. My Hadoop cluster is running fine, unfortunately, I have deleted the NameNode metadata directory. What will happen now? All my data will be lost and the existing processes will be distracted?

  1. No! Everything will go normally until you shut down your cluster, because, once you start the Hadoop cluster, NameNode’s metadata directory will go in-memory, and there will be no interaction with the local directory from that point. So all your data will be there until you shut down the cluster and nothing will happen to the existing processes too.

But once you shut down and start your cluster, everything will be new and you cannot see any data in your cluster.
7. Suppose, I am using Hadoop 2.x and my block size is 128MB, and I am writing a file of the size of 1GB into the cluster, suddenly after writing 200MB, the process is stopped. What do you think will happen now? Will I be able to read the 200MB of data or not now?

  1. You will be able to read only 128MB of data. A client can read a complete block of data which is written into HDFS. You will not be able to read the rest of the 72MB of data, as the data write to this block is interrupted in between and the other 8024MB of data is not written into the HDFS at all.

While writing the data into HDFS, HDFS will simultaneously maintain replicas also. If your replication factor is 3, then the other 2 replicas will also be written simultaneously.
8. How can you troubleshoot if either of your NameNodes or DataNodes are not running?

  1. We need to check for the CLUSTER ID in NameNode’s VERSION file and DataNode’s VERSION file, both the CLUSTER IDs should match, or else there will be no synchronization between the NameNodes and DataNodes. So, if the CLUSTER IDs of both are different, then you need to keep the CLUSTER ID’s same.

9. When does the reducer phase take place in a MapReduce job?

  1. The reduce phase has 3 steps: Shuffle, Sort, and Reduce. Shuffle phase is where the data is collected by the Reducer from each Mapper. This can happen while Mappers are generating data since it is only a data transfer. On the other hand, sort and reduce can only start once all the Mappers are done. You can tell which one MapReduce is doing by looking at the reducer completion percentage;

033% means it’s doing the shuffle
3466% is sort
67100% is reduce
This is why your reducers will sometimes seem “stuck” at 33%, it’s waiting for Mappers to finish.
Reducers start shuffling based on a threshold of percentage of Mappers that have finished. You can change the parameters to get reducers to start sooner or later.
10.How can you chain MapReduce jobs?
A. Not every problem can be solved with a MapReduce program, but fewer still are those that can be solved with a single MapReduce job. Many problems can be solved with MapReduce, by writing several MapReduce steps which run in a series to accomplish a goal:
Map1 -> Reduce1 -> Map2 -> Reduce2 -> Map3
You can easily chain jobs together in this fashion by writing multiple driver methods, one for each job. Call the first driver method, which uses JobClient.runJob() to run the job and wait for it to complete. When that job has completed, then call the next driver method, which creates a new JobConf object referring to different instances of Mapper and Reducer, etc. The first job in the chain should write its output to a path which is then used as the input path for the second job. This process can be repeated for as many jobs that are necessary to arrive at a complete solution to the problem.
Many problems, which at first seem impossible in MapReduce, can be accomplished by dividing one job into two or more.
Hadoop provides another mechanism for managing batches of jobs with dependencies between jobs. Rather than submit a JobConf to the JobClient‘s runJob()or submitJob() methods, org.apache.hadoop.mapred.jobcontrol.Job objects can be created to represent each job; A Job takes a JobConf object as its constructor argument. Jobs can depend on one another through the use of the addDependingJob()method. The code:

 x.addDependingJob(y)

11. What are counters in MapReduce?
A. A Counter is generally used to keep track of the occurrences of any event. In the Hadoop Framework, whenever any MapReduce job gets executed, the Hadoop Framework initiates counters to keep track of the job statistics like the number of rows read, the number of rows written as output, etc.
These are built in counters in the Hadoop Framework. Additionally, we can also create and use our own custom counters.
Typically, some of the operations of Hadoop counters are:

  • Number of Mappers and Reducers launched
  • Number of bytes that get read and written
  • The number of tasks that get launched and successfully run
  • The amount of CPU and memory consumed appropriate or not for job and cluster nodes

You can refer to this blog for more information on counters.
12. Are there any other storage systems that can be used with MapReduce other than HDFS?

  1. Yes, Hadoop supports many other compatible file systems. From Hadoop 2.x, you can use Amazon’s S3, from Hadoop 3.x, you can use Microsoft’s Azure data lake storage or Azure blob storage. MongoDB has released a Hadoop-MongoDB connector to integrate with.

You can refer to this blog for knowing how to process data in MongoDB using MapReduce.
13. Is this piece of code correct? If not explain where it went wrong.
Hadoop

public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{
    private final static Text one = new Text(1);
    private Text word = new Text();
    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

Yes, there is a fault in this code. The output value which is defined in the Mapper parameters is IntWritable, but the data type which is returning in the context is of Text.
The data types of the output key and the output value, which are defined in the Mapper parameters, should match with the data type of the key, and the value should be returned in the context.
So, either you should change the data type of the output value specified in the Mapper class parameters, or you should change the data type of one to IntWritable.
14. What is the difference between HDFS block and an InputSplit, and explain how the input split is prepared in Hadoop?

  1. The HDFS block is a physical representation of data, while the InputSplit is a logical representation of the data. The InputSplit will refer to the HDFS block location.

The InputFormat is responsible for providing the splits.
In general, if you have n nodes, the HDFS will distribute the file over all these n nodes. If you start a job, there will be n mappers by default. Thanks to Hadoop, the Mapper on a machine will process the part of the data that is stored on this node. I think this is called Rack awareness.
To cut a long story short: Upload data on the HDFS and start an MR Job. Hadoop will take care for the optimized execution.
15. Can you find the top 10 records based on values using map reduce?
A. Yes, there is a design pattern called Top K records in Hadoop, using which, we can find out the top 10 records.
You can refer to this blog for knowing the implementation of top k records in map reduce.
16. What is meant by a combiner and where exactly it is used in map reduce.

  1. Combiner acts like a mini reducer in Hadoop. Combiner will reduce the output of all the Mappers before sending them to the reducers.

So, because of the usage of combiner, the burden on the Reducer will be less and the execution will happen faster.
The combiner is used after the Mapper and before the Reducer phases.
17. How can I get the output of a Hive query into a .csv file?

  1. You can save the output of a Hive query into a file by using the Insert overwrite statement as:
INSERT OVERWRITE LOCAL DIRECTORY '/home/acdgild/hiveql_output' select * from table;

18. Can I run a Hive query directly from the terminal without logging into the Hive shell?

  1. Yes, by using hive -e option, we can run any kind of Hive query directly from the terminal without logging into the Hive shell.

Here is an example:

hive -e 'select * from table'

You can also save the output into a file by using the cat ‘>’ command of Linux as shown below:

hive -e 'select * from table' > / home/acdgild/hiveql_output.tsv

19. Explain Cluster By vs. Order By vs. Sort By in Hive.
CLUSTER BY guarantees global ordering, provided you’re willing to join the multiple output files yourself.
The longer version:

  • ORDER BY x: guarantees global ordering, but does this by pushing all data through just one reducer. This is basically unacceptable for large datasets. You end up one sorted file as the output.
  • SORT BY x: orders data at each of the N reducers, but each reducer can receive overlapping ranges of data. You end up with N or more sorted files with overlapping ranges.
  • DISTRIBUTE BY x: ensures each of the N reducers gets non-overlapping ranges of x, but does not sort the output of every reducer. You end up with N or unsorted files with non-overlapping ranges.
  • CLUSTER BY x: ensures each of the N reducers gets non-overlapping ranges, then sorts by those ranges at the Reducers. This gives you a global ordering and is the same as doing (DISTRIBUTE BY x and SORT BY x). You end up with N or more sorted files with non-overlapping ranges.

Hence, CLUSTER BY is basically the more scalable version of ORDER BY.
20. Explain the differences between Hive internal and External table.

  1. Managed table is also called as an Internal table. This is the default table in Hive. When we create a table in Hive without specifying it as managed or external, by default, we will get a Managed table. If we create a table as a managed table, the table will be created in a specific location in HDFS.

By default, the table data will be created in the /usr/hive/warehouse directory of HDFS.

If we delete a Managed table, both the table data and the metadata for that table will be deleted from the HDFS.

An external table is created for external use, as this when the data is used outside Hive. Whenever we want to delete a table’s metadata and want to keep the table’s data as it is, we use an External table. External table only deletes the schema of the table.
You can refer to our blog on Managed and External Tables in Hive for more information and hands-on knowledge.
21. How can you select the current date and time using HiveQL?

SELECT from_unixtime(unix_timestamp()); --/Selecting Current Time stamp/
SELECT CURRENT_DATE; --/Selecting Current Date/
SELECT CURRENT_TIMESTAMP; --/Selecting Current Time stamp/

22. How can you skip the first line of the data set while loading it into a Hive table?

  1. While creating the Hive table, we can specify in the tblproperties to skip the first row and load the rest of the dataset. Here is an example for it.
create external table testtable (name string, message string)
row format delimited
fields terminated by '\t'
lines terminated by '\n'
location '/testtable'
tblproperties ("skip.header.line.count"="1");

23. Explain the difference between COLLECT_LIST & COLLECT_SET and say where exactly can they be used in Hive.

  1. When you want to collect an array of values for a key, you can use these COLLECT_LIST & COLLECT_SET functions.

COLLECT_LIST will include duplicate values for a key in the list. COLLECT_SET will keep the unique values for a key in the list.
24. How can you run a Hive query in the Debug mode?

  1. Hive queries can be run in debug by starting your Hive console by switching on the logger mode to DEBUG as follows:
hive --hiveconf hive.root.logger=DEBUG,console

Now all the queries which you run in the Hive shell will run in the debug mode and you can also see the entire stack trace that query.
25. How can you store the output of a Pig relation directly into Hive?

  1. Using the HCatStorer function of HCatalog, you can store the output of a Pig relation directly into a Hive table.

Similarly, you can load the data of a Hive table into a Pig relation for pre-processing using the HCatLoader function of HCatalog.
You can refer to our blog on Loading and Storing Hive Data into Pig for some hands-on knowledge.
26. Can you process the data present in MongoDB using Pig?

  1. Yes, you can process the data present in MongoDB using Pig with the help of MongoDB Pig connector.

You can refer to our blog on processing data stored in MongoDB using Pig.
27. How many kinds of functions available in pig UDF?
There are 3 kinds of functions available in the Pig UDF:
1. Eval function: All kinds of Evaluation functions can be done with Eval functions. It takes one record as input, evaluates and return and returns one result.
2. Aggregate function: There are other kinds of Eval functions that work on a group of data, they take a bag as the input and return a scalar value as the output.
3. Filter function: These are also a kind of Eval functions that return a Boolean value as result. If the record satisfies the condition it returns, then it is true, or else it is false.
28. How can you visualize the outcomes of a Pig relation?

  1. Zeppelin provides you the simplest ways to visualize the outcome of a Pig relation. From Zeppelin 0.7.0, the support to visualize the outcome of Pig is added.

You can refer to our blog on Integrating Pig with Zeppelin for more information.
29. How can you load a file into the HBase?

  1. One way you can load a file is through using Bulk loading while using MapReduce.

Another way you can load is by using Hive with the help of the hive-HBase-storage handler. First, you will load the data into Hive, in turn, it will get reflected on the HBase. You can refer to our blog on HBase Write Using Hive.
You can also write a shell script which runs recursively till all the lines are written into the HBase table.
30. How can you transfer data present in Mysql to Hbase?
One way you can migrate data from Mysql to Hbase is by using Sqoop.
You can also migrate data from Mysql to Hbase using MapReduce.
We hope this blog helped you in your Hadoop interview preparation, keep visiting our site www.acadgild.com for more updates on Big data and other technologies.
Big Data Spark Training Certification

Tags

One Comment

  1. Thank you for sharing great list of Hadoop interview question and answer. i am ready your blog you should update regular your blog related Devops,

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close