All CategoriesBig Data Hadoop & Spark - Advanced

Pig Use Case – The Daily Show Data Analysis Part – I

In this post, we will be looking at the use case, The daily show. Here, we will work on some problem statements and come up with solutions using Pig scripts.

We have a historical data of The Daily Show guests from 1999 to 2004. You can download this dataset from here.

100% Free Course On Big Data Essentials

Subscribe to our blog and get access to this course ABSOLUTELY FREE.

Please find the dataset description below.

Dataset Description:

YEAR – The year the episode aired.

GoogleKnowlege_Occupation -Their occupation or office, according to Google’s Knowledge Graph. On the other hand, if they are not in there, how Stewart introduced them on the program.

Show – Air date of the episode. Not unique, as some shows had more than one guest

Group – A larger group designation for the occupation. For instance, U.S senators, U.S presidents, and former presidents are all under “politicians”

Raw_Guest_List – The person or list of people who appeared on the show, according to Wikipedia. The GoogleKnowlege_Occupation only refers to one of them in a given row.

Problem Statement 1:

Find the top five kinds of GoogleKnowlege_Occupation people who were guests in the show, in a particular time period.

Source Code:

A = load '/home/kiran/dialy_shows' using PigStorage(',') AS (year:chararray,occupation:chararray,date:chararray,group:chararray,gusetlist:chararray);
B = foreach A generate occupation,date;
C = foreach B generate occupation,ToDate(date,'MM/dd/yy') as date;
D = filter C by ((date> ToDate('1/11/99','MM/dd/yy')) AND (date<ToDate('6/11/99','MM/dd/yy')));
#Date range can be modified by the user
E = group D by occupation;
F = foreach E generate group, COUNT(D) as cnt;
G = order F by cnt desc;
H = limit G 5;

In relation A, we are loading the dataset using PigStorage along with the schema of the file.

In relation B, we are extracting the required columns i.e., occupation and date.

In relation C, we are converting the date in string format to date using ToDate function in Pig.

In relation D, we are filtering the dates in a specific range. Here, we have given the date range from 1/11/99 to 6/11/99 i.e., we are taking out the data for 6 months.

In relation E, we are grouping relation D by occupation.

If you describe relation E then you can see the schema of the relation as shown below:

 describe E;
E: {group: chararray,D: {(occupation: chararray,date: datetime)}}

In relation F, we are generating the group and the Count of values. Here, we will get the occupation of the guest and the number of times that occupation guest came to the show within this span of 6 months.

In relation G, we are ordering the relation F by descending order.

In relation H, we are limiting the records of relation G to 5.

With this, we will get the top five GoogleKnowlege_Occupation guests in the show in a particular period.

When we dump the relation, we will get the below result.

(actor,28)
(actress,20)
(comedian,4)
(television actress,3)
(singer,2)

Hadoop

Problem Statement 2:

Find out the number of politicians who came each year.

Source Code:

A = load '/home/kiran/dialy_shows' using PigStorage(',') AS (year:chararray,occupation:chararray,date:chararray,group:chararray,gusetlist:chararray);
B = foreach A generate year,group;
C = filter B by group == 'Politician';
D = group C by year;
E = foreach D generate group, COUNT(C) as cnt;
F = order E by cnt desc;

In relation A, we are loading the dataset using PigStorage along with the schema of the file.

In relation B, we are extracting the required columns i.e., year and the group.

In relation C, we are filtering the group by Politician.

In relation D, we are grouping the relation C by year.

If you describe relation D then you can see the schema of the relation as shown below:

describe D;
D: {group: chararray,C: {(year: chararray,group: chararray)}}

In relation E, we are generating the group and the Count of values in the relation C.

In relation F, we are ordering the values in the relation F by descending order.

When we dump, the relation F we will get the number of politicians who were guests on the show each year and the result is as displayed below.

(2004,32)
(2012,29)
(2008,27)
(2009,26)
(2006,25)
(2010,25)
(2011,23)
(2005,22)
(2007,21)
(2015,14)
(2003,14)
(2014,13)
(2000,13)
(2013,11)
(2002,8)
(2001,3)
(1999,2)

We hope this post has been helpful in understanding how to perform analysis using Apache Pig. In the case of any queries, feel free to comment below and we will get back to you at the earliest.

Keep visiting our site www.acadgild.com for more updates on Big Data and other technologies.

Hadoop

12 Comments

    1. Hi Neeraj,
      After clicking on the link you will get an option to download the file. Click on Download, after downloading the file you will be able to check the contents.

  1. Thanks a lot kiran,….it is really very good………..
    gave me a momentum to work on pig….. tnx a lot and continue spreading ur knowledge

  2. Thanks for sharing the sample Pig use case. The second example does not take care of scenarios where Politician was written in lower case. It would be great if you could share the same.

  3. Hi kian,
    Thanks for the nice intro.But i am trying to load the data present in HDFS.Can you please tell me how the Load command should be??
    I tried to use below command
    A= Load ‘hdfs://localhost:50070/user/input/daily_show_guests.txt’ but its not fetching the data. I have the kept the pig execution type as ‘mapreduce’.Can you please help me regarding the same.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close