All CategoriesBig Data Hadoop & Spark

Frequently Asked Hadoop Interview Questions in 2017 Part – 2

Before going through this Hadoop interview questions part-2, we recommend our users to go through our previous post on Hadoop interview questions 2017 part -1 .
In this Second Part of Hadoop interview Questions, we would be discussing various questions related to Big Data Hadoop Ecosystem.
We have given relevant posts with most of the questions which you can refer for practical implementation.

1.Can you join or transform tables/columns when importing using Sqoop?

100% Free Course On Big Data Essentials

Subscribe to our blog and get access to this course ABSOLUTELY FREE.

Yes, we can perform all the SQL commands while importing the table data to Sqoop.

For more information, please refer to this post – Beginners Guide to Sqoop.

2.What is the importance of indexing in Hive and how does this relate to Partition and Bucketing?

The goal of Hive indexing is to improve the speed of query lookup on certain columns of a table. Without an index, queries with predicates like ‘WHERE tab1.col1 = 10’ load the entire table or partition and processes all the rows. However, if an index exists for col1, then only a portion of the file needs to be loaded and processed.

Indexes become even more essential when the tables grow extremely large, and as you now undoubtedly know, Hive thrives on large tables. We can index tables that are partitioned or bucketed.

Bucketing:

Bucketing is usually used for join operations, as you can optimize joins by bucketing records by a specific ‘key’ or ‘id’. In this way, when you want to do a join operation, records with the same ‘key’ will be in the same bucket and then the join operation will be faster. This is similar to a technique for decomposing data sets into more manageable parts.

Partitioning:

When any user wants a data contained within a table to be split across multiple sections in Hive table, the use of partition is highly suggested.

The entries for the various columns of the dataset are segregated and then stored in their respective partition. When we write the query to fetch the values from the table, only the required partitions of the table are queried, which reduces the time taken by the query to yield the result.

3. How many types of joins are present in Hadoop and when to use them?

In Hadoop, there are two types of joins. One is Map side join and the other one is Reduce side join

Map Side Join:

Joining of datasets in the map phase is called map side join. Map side join is preferred when you need to perform a join between one larger dataset and one smaller dataset. Map side joins are faster and are executed in the cache. A technique called Distributed Cache is implemented in map side joins, where the smaller dataset is given to all the data nodes through cache memory. The smaller dataset size is limited to the cache memory of the cluster.

Reduce Side Join:

Joining of datasets in the reduce class is called reduce side join. When both the datasets are large, we use reduce side join. They are less efficient than maps-side joins because  the datasets have to go through the sort and shuffle phase.

4.How to optimize Hive queries?

Follow the below blog link to get the tips to optimize your hive queries

Optimizing Hive Queries

5.What are combiners in Hadoop?

Combiner class can summarize the map output records with the same key, and the output (key value collection) of the combiner will be sent over the network to the actual Reducer task as an input. This will help to cut down the amount of data shuffled between the mappers and the reducers.

For more information, please refer to this post – Combiners in Hadoop.

6.What is the difference between Combiner and in-mapper combiner in Hadoop?

You are probably already aware that a combiner is a process that runs locally on each Mapper machine to pre-aggregate data before it’s shuffled across the network to the various cluster Reducers.

The in-mapper combiner takes this optimization a bit further: the aggregations does not even write to local disk. They occur in-memory in the Mapper itself.

The in-mapper combiner does this by taking advantage of the setup() and cleanup() methods in

org.apache.hadoop.mapreduce.Mapper

7.Let’s consider this scenario; if I have a folder consisting of n number of files (datasets) and if I want to apply the same mapper and reducer logic, what should I do?

The traditional FileInputFormat takes each row as input to the mapper. Instead of that, if you want to take the whole file as input, you need to use wholeFileInputFormat of MapReduce. It takes the whole file as the input to the mapper.

8.Suppose, if you have 50 mappers and 1 reducer, how will your cluster performance be? And if it takes a lot of time, how can you reduce it?

If there are 50 mappers and 1 reducer, it will take a lot of time to run the whole program, because the reducer needs to collect all the mapper’s output and then it need to process. So, for this we can do two things:

A. If possible, you can add a combiner so that the amount of output coming from the mapper will be reduced and the load on the reducer also will get reduced.

B. You can enable map output compression so that the size of the data going to the reducer will be less.

9. Explain some string functions in Hive

String functions perform operations on String data type columns. The various string functions are as follows:

ASCII – Converts the first character of the string to its ASCII value.

Concat – Concatenates all the string columns in the table.

substr(string A, int start) – Returns the sub string starting from the index given until the end.

(string A) – Returns the string converted to upper case.

lower(string A) – Returns the string converted to lower case.

trim(String A) – Returns the string trimming the spaces from both the ends.

10.Can you create a table in Hive, which can skip the header lines from the dataset?

Yes, we can include the skip.header.line.count property inside the tblproperties while creating the table.

For example:

CREATE TABLE Employee (Emp_Number Int,Emp_Name String,Emp_sal Int) row format delimited fields terminated BY ‘,’ lines terminated BY ‘\n’ tblproperties(“skip.header.line.count”=”1”);

11.What are the binary storage formats available in Hive?

The default format in Hive is TextInputFormat, but Hive supports many file formats like Sequence Files, Avro Data files, RCFiles, ORC files, Parquet files, etc.

You can refer to the below posts to know more on these file formats.

File Formats in Hive

Introduction to Avro in Hive

Parquet File Format in Hadoop

12.Can you use multiple Hive instances at the same time? If yes, how can you do that?

By default, Hive comes with Derby database. So, you cannot use multiple instances with Derby database. However, if you change the Hive metastore as MySQL, then you can use multiple Hive instances at the same time.

You can refer to the post – MySQL Metastore Integration with Hive, to know how to configure Hive metastore as MySQL.

13.Is there any testing available in Pig? If yes, how can you do it?

Yes, we can do unit testing for Pig scripts.

You can refer to the post – Unit testing Pig scirpts, to know how Pig scripts can be tested.

14. Can you run Pig scripts using Java? If yes, how can you do it?

Yes, it is possible to embed Pig scripts inside a Java code.

You can refer to the post – Embedding Pig in Java, to know how pig scripts can be run using java.

Hadoop

15.Can you automate a Flume job by running it up to a stipulated time? If yes, how can you do that?

Flume job can be run for a stipulated time using a Java program. For this, Flume provides an application class to run it using a Java program.

public class flume {
    public static void main(String[] args)
    {
        String[] args1 = new String[] { "agent","-nTwitterAgent","-fflume.conf" };
        BasicConfigurator.configure();
        Application.main(args1);
        System.setProperty("hadoop.home.dir", "/");
   }
}

The following is the code to run the Flume configuration file using a Java program. We can automate this program while keeping this code inside a thread.

Hope this post has been useful in helping you prepare for that big interview. In the case of any queries, feel free to comment below and we will get back to you at the earliest. 

16.What is meant by Safe Mode and when does the NameNode go into safe mode?

  1. Safe mode is a situation wherein you cannot write data to HDFS. HDFS will be in the Read Mode. In this case, you can only read the data, but you cannot write into HDFS. Here, the NameNode will change the file system state from fsimage to edit logs and load into the memory.

Every time you start the HDFS daemons, the NameNode goes into the safe mode and checks for the block report, and, also ensures whether all the data nodes are working or not.
If this process is interrupted in the beginning by some internal or external process, the NameNode will be in the Safe Mode completely. Now, you need to come out of the Safe Mode explicitly by using the command:

hdfs dfsadmin -safemode leave

17.Where does the metadata of NameNode reside? Is it in-memory or on the disk?

  1. The answer is both. The metadata will be on the disk, but once you start the Hadoop cluster, the NameNode will take metadata into in-memory for faster access, so whatever the updates that are going to happen after starting the Hadoop cluster will happen in in-memory, and once you shut down the cluster, the changes will be saved to the drive.

18.My Hadoop cluster is running fine, unfortunately, I have deleted the NameNode metadata directory. What will happen now? All my data will be lost and the existing processes will be distracted?

  1. No! Everything will go normally until you shut down your cluster, because, once you start the Hadoop cluster, NameNode’s metadata directory will go in-memory, and there will be no interaction with the local directory from that point. So all your data will be there until you shut down the cluster and nothing will happen to the existing processes too.

19.When does the reducer phase take place in a MapReduce job?

  1. The reduce phase has 3 steps: Shuffle, Sort, and Reduce. Shuffle phase is where the data is collected by the Reducer from each Mapper. This can happen while Mappers are generating data since it is only a data transfer. On the other hand, sort and reduce can only start once all the Mappers are done. You can tell which one MapReduce is doing by looking at the reducer completion percentage;

033% means it’s doing the shuffle
3466% is sort
67100% is reduce
This is why your reducers will sometimes seem “stuck” at 33%, it’s waiting for Mappers to finish.
Reducers start shuffling based on a threshold of percentage of Mappers that have finished. You can change the parameters to get reducers to start sooner or later.
20.How can you chain MapReduce jobs?
A. Not every problem can be solved with a MapReduce program, but fewer still are those that can be solved with a single MapReduce job. Many problems can be solved with MapReduce, by writing several MapReduce steps which run in a series to accomplish a goal:
Map1 -> Reduce1 -> Map2 -> Reduce2 -> Map3
You can easily chain jobs together in this fashion by writing multiple driver methods, one for each job. Call the first driver method, which uses JobClient.runJob() to run the job and wait for it to complete. When that job has completed, then call the next driver method, which creates a new JobConf object referring to different instances of Mapper and Reducer, etc. The first job in the chain should write its output to a path which is then used as the input path for the second job. This process can be repeated for as many jobs that are necessary to arrive at a complete solution to the problem.
Many problems, which at first seem impossible in MapReduce, can be accomplished by dividing one job into two or more.
Hadoop provides another mechanism for managing batches of jobs with dependencies between jobs. Rather than submit a JobConf to the JobClient‘s runJob()or submitJob() methods, org.apache.hadoop.mapred.jobcontrol.Job objects can be created to represent each job; A Job takes a JobConf object as its constructor argument. Jobs can depend on one another through the use of the addDependingJob()method. The code:

 x.addDependingJob(y)

21.Are there any other storage systems that can be used with MapReduce other than HDFS?

  1. Yes, Hadoop supports many other compatible file systems. From Hadoop 2.x, you can use Amazon’s S3, from Hadoop 3.x, you can use Microsoft’s Azure data lake storage or Azure blob storage. MongoDB has released a Hadoop-MongoDB connector to integrate with.

You can refer to this blog for knowing how to process data in MongoDB using MapReduce.
22.How can you find the top 10 records based on values using map reduce?
A. There is a design pattern called Top K records in Hadoop, using which, we can find out the top 10 records.
You can refer to this blog for knowing the implementation of top k records in map reduce.
23.How can I get the output of a Hive query into a .csv file?

  1. You can save the output of a Hive query into a file by using the Insert overwrite statement as:
INSERT OVERWRITE LOCAL DIRECTORY '/home/acdgild/hiveql_output' select * from table;

24.Explain Cluster By vs. Order By vs. Sort By in Hive.
CLUSTER BY guarantees global ordering, provided you’re willing to join the multiple output files yourself.
The longer version:

  • ORDER BY x: guarantees global ordering, but does this by pushing all data through just one reducer. This is basically unacceptable for large datasets. You end up one sorted file as the output.
  • SORT BY x: orders data at each of the N reducers, but each reducer can receive overlapping ranges of data. You end up with N or more sorted files with overlapping ranges.
  • DISTRIBUTE BY x: ensures each of the N reducers gets non-overlapping ranges of x, but does not sort the output of every reducer. You end up with N or unsorted files with non-overlapping ranges.
  • CLUSTER BY x: ensures each of the N reducers gets non-overlapping ranges, then sorts by those ranges at the Reducers. This gives you a global ordering and is the same as doing (DISTRIBUTE BY and SORT BY x). You end up with N or more sorted files with non-overlapping ranges.

Hence, CLUSTER BY is basically the more scalable version of ORDER BY.
25.How can you run a Hive query in the Debug mode?

  1. Hive queries can be run in debug by starting your Hive console by switching on the logger mode to DEBUG as follows:
hive --hiveconf hive.root.logger=DEBUG,console

26.How many kinds of functions available in pig UDF?
There are 3 kinds of functions available in the Pig UDF:
1. Eval function: All kinds of Evaluation functions can be done with Eval functions. It takes one record as input, evaluates and return and returns one result.
2. Aggregate function: There are other kinds of Eval functions that work on a group of data, they take a bag as the input and return a scalar value as the output.
3. Filter function: These are also a kind of Eval functions that return a Boolean value as result. If the record satisfies the condition it returns, then it is true, or else it is false.
27.How can you load a file into the HBase?

  1. One way you can load a file is through using Bulk loading while using MapReduce.

Another way you can load is by using Hive with the help of the hive-HBase-storage handler. First, you will load the data into Hive, in turn, it will get reflected on the HBase. You can refer to our blog on HBase Write Using Hive.
You can also write a shell script which runs recursively till all the lines are written into the HBase table.
28.How can you transfer data present in Mysql to Hbase?
One way you can migrate data from Mysql to Hbase is by using Sqoop.
You can also migrate data from Mysql to Hbase using MapReduce.
29.Is this piece of code correct? If not explain where it went wrong.

public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{
    private final static Text one = new Text(1);
    private Text word = new Text();
    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

Yes, there is a fault in this code. The output value which is defined in the Mapper parameters is IntWritable, but the data type which is returning in the context is of Text.
The data types of the output key and the output value, which are defined in the Mapper parameters, should match with the data type of the key, and the value should be returned in the context.
So, either you should change the data type of the output value specified in the Mapper class parameters, or you should change the data type of one to IntWritable.
30.HDFS works on the principle of ‘Write Once, Read Many Times.’ So, by this logic, you can overwrite a file which is already present in the HDFS. If yes, explain how can that be done.

  1. Yes, we can overwrite a file which is already present in the HDFS. Using the -f option in the put or copyFromLocal command of HDFS, we can overwrite a file in HDFS
    hadoop fs -put -f <<local path>> <<hdfs path>>

Keep visiting our site www.acadgild.com for more updates on Big data and other technologies.

Suggested Reading

Map Side Join

Big Data Hadoop Certification

Tags

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close