Check out the top 12 apache pig interview questions that will help you clear the hadoop Interview.
Top 12 Pig Interview Questions & Answers
1. What is Pig Storage?
A. Pig Storage is the default load function in Pig. Whenever, you want to load data from a file system into pig, you can use pig storage. While loading data using pig storage, you can also specify the delimiter of the data (how the fields in the record are separated) and also you can specify the schema of the data along with the type of the data.
2. While writing evaluate UDF, which method has to be overridden?
A. While writing UDF in pig, you have to override the method exec() and the base class can be different, while writing filter UDF, you will have to extend FilterFunc and for evaluate UDF, you will have to extend the EvalFunc.EvaluFunc is parameterized and must provide the return type also.
3. What are the different UDF’s in Pig?
-
UDF can be processed based on the number of rows. UDF’s are divided into two types:
-
UDF’s that take one record at a time. For example, Filter and Eval.
-
User Defined Aggregate Function (UDAFs) that take multiple records at a time. For example, Avg and Sum.
-
Also, pig gives you the facility to write your own UDF’s for load/store the data.
4. What are the Optimizations developer can use during joins?
A. To perform join between a small dataset with a large dataset, use replicated join. Here, in the replicated join, the small dataset will be copied to all the machines where the mapper is running and the large dataset is divided across all the nodes. This gives you the advantage of Map-side joins.
If your dataset is skewed i.e., if a particular data is repeated multiple times even if you use reduce side join, the particular reducer will be overloaded and it will take a lot of time. So, in this case you can go for skewed join and the skewed key is calculated by the pig itself.
And, if you have datasets where the records are sorted on the same field, you can go for sorted join, this also happens in map phase and is very efficient and fast.
5. What is a skewed join?
A. Whenever you want to perform a join with a skewed dataset i.e., a particular value will be repeated many times.
Suppose, if you have two datasets which contains the details about city and the person living in that city. The second dataset contains the details of city and the country.
So automatically city name will be repeated multiple times based on the population of the city and if you want to perform join using the city column then a particular reducer will receive a lot of values for that particular city.
In the skewed dataset, the left input on the join predicate will be divided and even if you have skeweness in the data your data will be split across different machines and the input on the right side will be duplicated and split across different machines and in this way skewed join is handled in the Pig.
6. What is Flatten?
A. Flatten is an operator in pig that removes the level of nesting. Sometimes, we have data in a bag or a tuple and you want to remove the level of nesting so that the data structured should become even.
A foreach with flatten produces a cross product of every record in the bag with all of the other expressions in the generate statement.
For example, consider the below record: Jorge Posada, {(Catcher), (Designated_hitter)}
So when you do flatten on this bag, you will receive a separate record with each tuple in the bag.
Record 1: Jorge Posada, Catcher
Record 2: Jorge Posada, Designated_hitter
7. Write a word count program in pig.
lines = LOAD ‘/user/hadoop/HDFS_File.txt’ AS (line:chararray);
words = FOREACH lines GENERATE FLATTEN(TOKENIZE (line)) as word;
grouped = GROUP words by word;
wordcount = FOREACH grouped GENERATE group, COUNT (words);
DUMP wordcount;
8. How to access Hive table inside a pig program?
A. Using pig, we can access Hive table and a HBase table as well.
For accessing Hive table, you have to use HCatalog while starting pig shell like pig -useHCatalog
Then with the help of HCataLoader(), you can load any Hive table into pig.
A = LOAD ‘sample_07’ USING org.apache.hive.hcatalog.pig.HCataLoader();
Here sample_07 is table in hive. Below we are filtering the rows having salary>=4000.
B = filter A by salary >= 4000;
Now using HCataStorer(), you can store any pig relation into a Hive table.
STORE B into ‘HCatalog_table’ USING org.apache.hive.hcatalog.pig.HcatStorer();
All the schema of the records and the data types will be automatically preserved by the HCataLoader();
9. How can we access HBase tables from pig?
A. To access the data in HBase, you have to use HbaseStorage just like pigStorage() to load the data from a file, we have got HBaseStorage() to load the data from HBase tables. Here you have to specify the column families and the column names that you wish to access from HBase. Optionally you can specify the schema also.
data = LOAD ‘hbase://employee’ USING org.apache.pig.backend.hadoop.hbase.HbaseStorage(‘personal:*, professional:*’,’-loadkey true’) as (id:CHARARRAY,personal:MAP[],professional:MAP[]);
DUMP data;
To access the data, you can use the following code:
result = FOREACH data GENERATE id, personal#’name’, professional#’exp’
Similarly, you can also store the pig relation into HBase. Suppose, if you want to store a file into HBase, you can use the following code:
T = LOAD ‘/home/acadgild/hbase/bulk_data.tsv’ AS (userid,name,exp);
STORE T into ‘bulk_pig’ USING org.apache.pig.backend.hadoop.hbase.HbaseStorage (‘cf1:name,cf1:exp’);
Here, whenever you are loading a file, the first field will go as the row key and the other column names will go as column families based on the schema you specify.
10. What are the complex data types in pig?
A. The following are the complex data types in Pig:
Tuple: A record that is formed by an ordered set of fields is known as a tuple, the fields can be of any type.
Example: (Raja, 30)
The above is an example record which is in the form of a tuple, you can also access the individual elements of a tuple by specifying the index as $0 for the first element and $1 for the second element.
Bag: A bag is an unordered set of tuples. In other words, a collection of non-unique tuples is known as a bag.
Example: {(Raja, 30), (Mohammad, 45)}
Map: A Map contains a set of key value pairs separated by #.
Example: {name#Raja, age#30}
11. What are the commands to debug pig queries?
The following are the commands to debug this picture:
Explain and illustrate are widely used to debug the pig scripts.
-
Explain will review the logical, physical, and MapReduce execution plans that are used to compute the specified relationship.
Example: EXPLAIN <relation name>
-
Illustrate will give you the step by step execution of a sequence of statements.
Example: ILLUSTRATE <relation name>
12. What are macros in pig?
A. Macros are introduced in the later versions of pig. The main intention in introducing macros is to make the pig language modular. Generally in other languages, we create a function to use to multiple times similarly in pig we can create a macro in pig and we can run the macro number of times.
Suppose you have to count the number of tuples in a relation, then use the following code:
DEFINE row_count (X) RETURNS Z {Y = group $X all; $Z = foreach Y generate COUNT($X);}
ZÂ will be returned from the macro.
new_relation = row_count (existing_relation)
In this way you can create functions in pig to use them repeatedly in many places.
Hope this post helped you know some important Pig interview questions that are asked on the Hadoop topics. Enroll for Big Data Hadoop training conducted by Acadgild and become a successful Big Data Developer.