Big Data Hadoop & Spark

Skewed Join in Pig

In our previous blogs we discussed about Replicated Join and Merge Join in Pig.
In this post we will be continuing our discussion by implementing skewed joins.
Skewed join can be implemented if user’s underlying data is sufficiently skewed and the control needs to be given to user over the allocation of reducer to counteract the skew.

Meaning of skewed data:

Data skew is a situation in distributed processing environment when the data is not evenly divided among the emitted key tuples from the map phase.
This can lead to inconsistent processing times.
In this blog we will be skewing the apache_nobots_tsv.txt file by creating a shell script to append the same row a few thousand times and we rename it to a new file named as skewed_apache_nobots_tsv.txt.
We have to use skewed_apache_nobots_tsv.txt for the implementation of skewed Join.

Type the below scripts in vi editor in Linux to create a skewed data set

Skewed join
To execute the above script file please type the below command in the Linux terminal.
And in case if user faces the error like permission denied then we need to change the permission of the folder where this script is present.
skewed join 2
Hadoop
After changing the permission the script will be executed,refer the below screenshot for the same.
skewed 3
In the below step we have loaded the skewed dataset into the Pig relation skewed_nobots_weblogs.

In this step we loaded the smaller dataset into the Pig relation ip_country_tbl.
skewed
In this step skewed Join  is performed on both the relation.
skewed
To display first 10 records we used limit command and then dumped the relation filtered_weblog to display the joined records.
skewed output
We hope this blog helped you in understanding the concepts of skewed join.
Keep visiting our website www.acadgild.com/blog more blogs and EBooks on Big Data and other technologies.
Hadoop

Tags

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Articles

Close