Free Shipping

Secure Payment

easy returns

24/7 support

  • Home
  • Blog
  • Pig Use Case – The Daily Show Data Analysis Part – I

Pig Use Case – The Daily Show Data Analysis Part – I

 July 9  | 0 Comments

In this post, we will be looking at the use case, The daily show. Here, we will work on some problem statements and come up with solutions using Pig scripts.

We have a historical data of The Daily Show guests from 1999 to 2004. You can download this dataset from here.

Please find the dataset description below.

Dataset Description:

YEAR – The year the episode aired.

GoogleKnowlege_Occupation -Their occupation or office, according to Google’s Knowledge Graph. On the other hand, if they are not in there, how Stewart introduced them on the program.

Show – Air date of the episode. Not unique, as some shows had more than one guest

Group – A larger group designation for the occupation. For instance, U.S senators, U.S presidents, and former presidents are all under “politicians”

Raw_Guest_List – The person or list of people who appeared on the show, according to Wikipedia. The GoogleKnowlege_Occupation only refers to one of them in a given row.

Problem Statement 1:

Find the top five kinds of GoogleKnowlege_Occupation people who were guests in the show, in a particular time period.

Source Code:

A = load '/home/kiran/dialy_shows' using PigStorage(',') AS (year:chararray,occupation:chararray,date:chararray,group:chararray,gusetlist:chararray);
B = foreach A generate occupation,date;
C = foreach B generate occupation,ToDate(date,'MM/dd/yy') as date;
D = filter C by ((date> ToDate('1/11/99','MM/dd/yy')) AND (date<ToDate('6/11/99','MM/dd/yy')));
#Date range can be modified by the user
E = group D by occupation;
F = foreach E generate group, COUNT(D) as cnt;
G = order F by cnt desc;
H = limit G 5;

In relation A, we are loading the dataset using PigStorage along with the schema of the file.

In relation B, we are extracting the required columns i.e., occupation and date.

In relation C, we are converting the date in string format to date using ToDate function in Pig.

In relation D, we are filtering the dates in a specific range. Here, we have given the date range from 1/11/99 to 6/11/99 i.e., we are taking out the data for 6 months.

In relation E, we are grouping relation by occupation.

If you describe relation then you can see the schema of the relation as shown below:

 describe E;
E: {group: chararray,D: {(occupation: chararray,date: datetime)}}

In relation F, we are generating the group and the Count of values. Here, we will get the occupation of the guest and the number of times that occupation guest came to the show within this span of 6 months.

In relation G, we are ordering the relation by descending order.

In relation H, we are limiting the records of relation to 5.

With this, we will get the top five GoogleKnowlege_Occupation guests in the show in a particular period.

When we dump the relation, we will get the below result.

(actor,28)
(actress,20)
(comedian,4)
(television actress,3)
(singer,2)

Hadoop

Problem Statement 2:

Find out the number of politicians who came each year.

Source Code:

A = load '/home/kiran/dialy_shows' using PigStorage(',') AS (year:chararray,occupation:chararray,date:chararray,group:chararray,gusetlist:chararray);
B = foreach A generate year,group;
C = filter B by group == 'Politician';
D = group C by year;
E = foreach D generate group, COUNT(C) as cnt;
F = order E by cnt desc;

In relation A, we are loading the dataset using PigStorage along with the schema of the file.

In relation B, we are extracting the required columns i.e., year and the group.

In relation C, we are filtering the group by Politician.

In relation D, we are grouping the relation by year.

If you describe relation then you can see the schema of the relation as shown below:

describe D;
D: {group: chararray,C: {(year: chararray,group: chararray)}}

In relation E, we are generating the group and the Count of values in the relation C.

In relation F, we are ordering the values in the relation by descending order.

When we dump, the relation we will get the number of politicians who were guests on the show each year and the result is as displayed below.

(2004,32)
(2012,29)
(2008,27)
(2009,26)
(2006,25)
(2010,25)
(2011,23)
(2005,22)
(2007,21)
(2015,14)
(2003,14)
(2014,13)
(2000,13)
(2013,11)
(2002,8)
(2001,3)
(1999,2)

We hope this post has been helpful in understanding how to perform analysis using Apache Pig. In the case of any queries, feel free to comment below and we will get back to you at the earliest.

Keep visiting our site www.acadgild.com for more updates on Big Data and other technologies.

>