In this first Part of Hadoop interview Questions, we would be discussing various questions related to Big Data Hadoop Ecosystem.
We have given relevant posts with most of the questions for 2017 interviews which you can refer for practical implementation.
- What are the different types of File formats in hive?
Ans. Different file formats which Hive can handle are:
For more detailed explanation, click here
2. Explain Indexing in Hive.
Ans. Index acts as a reference to the records. Instead of searching all the records, we can refer to the index to search for a particular record. Indexes maintain the reference of the records. So, it is easy to search for a record with minimum overhead. Indexes also speed up data searching.
For more detailed explanation click here
3. Explain about Avro File format in Hadoop.
Ans. Avro is one of the preferred data serialization system because of its language neutrality.
Due to lack of language portability in hadoop writable classes, avro becomes a natural choice because of its ability to handle multiple data formats which can be further processed by multiple languages.
Avro is most preferred for serializing the data in Hadoop.
It uses JSON for defining data types and protocols. It serializes data in a compact binary format.
Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services.
By this we can define Avro as a file format introduced with Hadoop to store data in a predefined format.This file format can be used in any of the Hadoop’s tools like Pig and Hive
For more detailed explanation, click here
- Does Hive support transactions?
Ans. Yes, Hive supports transactions from hive-0.13, with some restrictions.
For detailed information, click here
5. Explain about Top-k Map-Reduce design pattern.
Ans. Top-k Map-reduce design pattern is used for find the top k records from the given dataset.
This design pattern achieves this by defining a ranking function or comparison function between two records that determines whether one is higher than the other. We can apply this pattern to use MapReduce to find the records with the highest value across the entire data set.
For more detailed explanation, click here
- Explain about Hive Storage Handlers.
Storage Handlers are a combination of Input Format, Output Format, SerDe, and specific code that Hive uses to identify an external entity as a Hive table. This allows the user to issue SQL queries seamlessly, whether the table represents a text file stored in Hadoop or a column family stored in a NoSQL database such as Apache HBase, Apache Cassandra, and Amazon Dynamo DB. Storage Handlers are not only limited to NoSQL databases; a storage handler can be designed for several different kinds of data stores.
For practical implementation of this concept, click here
- Explain partitioning in Hive.
Ans. Table partitioning means dividing table data into some parts based on the values of particular columns, thus segregating input records into different directories based on that column for practical implementation on partitioning in Hive, click here
8. What is the use of Impala?
Ans. Cloudera’s Impala is a massively parallel processing (MPP) SQL-like query engine that allows users to execute low latency SQL Queries for the data stored in HDFS and HBase, without any data transformation or movement.
The main goal of Impala is to make SQL on Hadoop operations, fast and efficient to appeal to new categories of users and open up Hadoop to new types of use cases. Impala makes SQL queries simple enough to be accessible to analysts who are familiar with SQL and to those using business intelligence tools that run on Hadoop.
For more detailed explanation on Impala, click here
9. Explain how to choose between Managed & External tables in Hive.
Ans. Hive tables can be created as EXTERNAL or INTERNAL. This is a choice that affects how data is loaded, controlled, and managed.
Use EXTERNAL tables when:
The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn’t lock the files.
Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing to multiple schemas (tables or views) at a single data set or if you are iterating through various possible schemas.
Use INTERNAL tables when:
The data is temporary.
You want Hive to completely manage the life cycle of the table and data.
For more detailed information, click here
10.What are the different methods in Mapper class and order of their invocation?
Ans. There are 3 methods in Mapper.
*map () –> executes for each line of the input file (In text input format)
*setup () –> Executes once per input split at the beginning of the program
*clean up () –> Executes once per input split at the end of the program
order of invocation:
setup () –1
map () –2
clean up –3
11.What is the purpose of Record Reader in Hadoop?
Ans. In MapReduce, data is divided into input splits. Record Reader, typically, converts the input, provided by the Input Split, and presents a record-oriented view for the Mapper & Reducer tasks for processing. It thus assumes the responsibility of processing record boundaries and presenting the tasks with keys and values.
12.What details are present in FSIMAGE?
Ans. The entire file system namespace, including the mapping of blocks to files and file system properties, is stored FsImage. The FsImage is stored as a file in the Name Node’s local file system too.
The Name Node keeps an image of the entire file system namespace and file Block map in memory. This key metadata item is designed to be compact, such that a NameNode with 4GB of RAM is sufficient to support a huge number of files and directories. When the NameNode starts up, it reads the FsImage and EditLog from disk, applies all the transactions from the EditLog to the in-memory representation of the FsImage, and flushes out this new version into a new FsImage on disk.
It can then truncate the old EditLog because its transactions have been applied to the persistent FsImage. This process is called a checkpoint. In the current implementation, a checkpoint only occurs when the NameNode starts up.
13.Why do we need bucketing in Hive?
Ans. Bucketing is a simple idea if you are already aware. You create multiple buckets. You read each record and place it into one of the buckets based on some logic mostly some kind of hashing algorithm. This allows you to organize your data by decomposing it into multiple parts. You might wonder if we can achieve the same thing using partitioning then why to bother about bucketing. There is one difference. When we do partitioning, we create a partition for each unique value of the column. This may give rise to a situation where you might need to create thousands of tiny partitions. But if you use bucketing, you can limit it to a number which you can choose and decompose your data into those buckets. In Hive, a partition is a directory but a bucket is a file.
For more detailed explanation, click here
14.What is a Sequence File in Hadoop?
Ans. In addition to text files, Hadoop also provides support for binary files, out of these binary file formats, Hadoop specific file format stores serialized key/value pairs.
15.How do you copy files from one cluster to another cluster?
Ans. With the help of DistCp command, we can copy files from one cluster.
The most common invocation of DistCp is an inter-cluster copy:
bash$ Hadoop DistCp
16.I have Hadoop 2.x and my configured block size to 128MB. Now, I have changed my block size to 150MB. So, will this change affect the files which are already present?
- No, this change will not affect the existing files, all the existing files will be with a block size of 128MB. Once you restart the cluster after configuring the new block size, the change will come into effect. All the new files which you copy into the HDFS will be with a block size of 150MB.
17.Currently, I am using Hadoop2.x, but I want to upgrade to Hadoop3.x. How can I upgrade without losing my data in HDFS?
- You can upgrade or downgrade to any Hadoop version without losing data as far as you have the NameNode and DataNode’s Current and Version files. While installing, you just need to give these directories as NameNode and DataNode’s metadata directories.
As we are not changing the cluster, by using the metadata present in that folders, Hadoop will get back your data.
18.Suppose, I am using Hadoop 2.x and my block size is 128MB, and I am writing a file of the size of 1GB into the cluster, suddenly after writing 200MB, the process is stopped. What do you think will happen now? Will I be able to read the 200MB of data or not now?
- You will be able to read only 128MB of data. A client can read a complete block of data which is written into HDFS. You will not be able to read the rest of the 72MB of data, as the data write to this block is interrupted in between and the other 8024MB of data is not written into the HDFS at all.
While writing the data into HDFS, HDFS will simultaneously maintain replicas also. If your replication factor is 3, then the other 2 replicas will also be written simultaneously.
19.How can you troubleshoot if either of your NameNodes or DataNodes are not running?
- We need to check for the CLUSTER ID in NameNode’s VERSION file and DataNode’s VERSION file, both the CLUSTER IDs should match, or else there will be no synchronization between the NameNodes and DataNodes. So, if the CLUSTER IDs of both are different, then you need to keep the CLUSTER ID’s same.
20.What are counters in MapReduce?
A. A Counter is generally used to keep track of the occurrences of any event. In the Hadoop Framework, whenever any MapReduce job gets executed, the Hadoop Framework initiates counters to keep track of the job statistics like the number of rows read, the number of rows written as output, etc.
These are built in counters in the Hadoop Framework. Additionally, we can also create and use our own custom counters.
Typically, some of the operations of Hadoop counters are:
- Number of Mappers and Reducers launched
- Number of bytes that get read and written
- The number of tasks that get launched and successfully run
- The amount of CPU and memory consumed appropriate or not for job and cluster nodes
You can refer to this blog for more information on counters.
21.What is the difference between HDFS block and an InputSplit, and explain how the input split is prepared in Hadoop?
- The HDFS block is a physical representation of data, while the InputSplit is a logical representation of the data. The InputSplit will refer to the HDFS block location.
InputFormat is responsible for providing the splits.
In general, if you have n nodes, the HDFS will distribute the file over all these n nodes. If you start a job, there will be n mappers by default. Thanks to Hadoop, the Mapper on a machine will process the part of the data that is stored on this node. I think this is called
To cut a long story short: Upload data on the HDFS and start an MR Job. Hadoop will take care for the optimized execution.
22.Can I run a Hive query directly from the terminal without logging into the Hive shell?
- Yes, by using hive -e option, we can run any kind of Hive query directly from the terminal without logging into the Hive shell.
Here is an example:
hive -e 'select * from table'
You can also save the output into a file by using the cat ‘>’ command of Linux as shown below:
hive -e 'select * from table' > / home/acdgild/hiveql_output.tsv
23.How can you select the current date and time using HiveQL?
SELECT from_unixtime(unix_timestamp()); --/Selecting Current Time stamp/ SELECT CURRENT_DATE; --/Selecting Current Date/ SELECT CURRENT_TIMESTAMP; --/Selecting Current Time stamp/
24.How can you skip the first line of the data set while loading it into a Hive table?
- While creating the Hive table, we can specify in the tblproperties to skip the first row and load the rest of the dataset. Here is an example for it.
create external table testtable (name string, message string) row format delimited fields terminated by '\t' lines terminated by '\n' location '/testtable' tblproperties ("skip.header.line.count"="1");
25.Explain the difference between COLLECT_LIST & COLLECT_SET and say where exactly can they be used in Hive.
- When you want to collect an array of values for a key, you can use these COLLECT_LIST & COLLECT_SET functions.
COLLECT_LIST will include duplicate values for a key in the list. COLLECT_SET will keep the unique values for a key in the list.
26.How can you store the output of a Pig relation directly into Hive?
- Using the HCatStorer function of HCatalog, you can store the output of a Pig relation directly into a Hive table.
Similarly, you can load the data of a Hive table into a Pig relation for pre-processing using the HCatLoader function of HCatalog.
You can refer to our blog on Loading and Storing Hive Data into Pig for some hands-on knowledge.
27.Can you process the data present in MongoDB using Pig?
- Yes, you can process the data present in MongoDB using Pig with the help of MongoDB Pig connector.
You can refer to our blog on processing data stored in MongoDB using Pig.
We will be coming up with more questions and detailed explanations in the next posts.
28.Can you visualize the outcomes of a Pig relation?
- Yes! Zeppelin provides you the simplest ways to visualize the outcome of a Pig relation. From Zeppelin 0.7.0, the support to visualize the outcome of Pig is added.
You can refer to our blog on Integrating Pig with Zeppelin for more information.
Keep visiting our website acadgild.com for more blogs and posts on Trending Big Data Topics.
To learn more about Big Data Hadoop click here.