In our previous blogs we discussed about Replicated Join and Merge Join in Pig.
In this post we will be continuing our discussion by implementing skewed joins.
Skewed join can be implemented if user’s underlying data is sufficiently skewed and the control needs to be given to user over the allocation of reducer to counteract the skew.
Meaning of skewed data:
Data skew is a situation in distributed processing environment when the data is not evenly divided among the emitted key tuples from the map phase.
This can lead to inconsistent processing times.
In this blog we will be skewing the apache_nobots_tsv.txt file by creating a shell script to append the same row a few thousand times and we rename it to a new file named as skewed_apache_nobots_tsv.txt.
We have to use skewed_apache_nobots_tsv.txt for the implementation of skewed Join.
Type the below scripts in vi editor in Linux to create a skewed data set
To execute the above script file please type the below command in the Linux terminal.
And in case if user faces the error like permission denied then we need to change the permission of the folder where this script is present.
After changing the permission the script will be executed,refer the below screenshot for the same.
In the below step we have loaded the skewed dataset into the Pig relation skewed_nobots_weblogs.
In this step we loaded the smaller dataset into the Pig relation ip_country_tbl.
In this step skewed Join is performed on both the relation.
To display first 10 records we used limit command and then dumped the relation filtered_weblog to display the joined records.
We hope this blog helped you in understanding the concepts of skewed join.
Keep visiting our website www.acadgild.com/blog more blogs and EBooks on Big Data and other technologies.