In our previous blog we discussed about Replicated Joins in Pig and in this post we will be discussing about merge joins.
In the case of Merge Join users data is stored in such a way where both input files are totally sorted on the join key and then join operation can be performed in the map phase of the Map Reduce job.
This kind of join provides a significant progress in performance when compared to passing of all the data through unneeded sort and shuffle phases.
This join will be implemented by the selection of left input as the input file for the map phase of MapReduce job and the right input to be considered as the side file.
Let’s discuss the below steps to perform the merge join.
Step 1: We will be sorting the larger data set by using sort command in linux terminal.
Refer the below screenshot for the same.
Step 2: We will perform sort operation on the smaller data set and refer the below screenshot for the same.
Step 3: Copying of the sorted Data Sets.
In this step we will be copying the sorted data sets into hdfs because we are running Pig in MapReduce mode.
Refer the below screenshots for copying of files to HDFS.
Step 4: Copying of the sorted dataset to sorted_nobots_weblogs relation.
The sorted file i.e sorted_nobots_tsv.txt is copied into HDFS.
Step 5: Copying of smaller dataset into sorted_ip_country relation.
Step 6: Performing the join operations on both the relation.
Step 7: Displaying the first 10 outputs.
We hope this blog helped you to understand merge join in Pig. In our next post we will be discussing skewed join.
Keep visiting our website www.acadgild.com for trending blogs and E-books on Big-Data and other technologies.