Free Shipping

Secure Payment

easy returns

24/7 support

  • Home
  • Blog
  • Pig Use Case – The Daily Show Data Analysis Part – II

Pig Use Case – The Daily Show Data Analysis Part – II

 July 14  | 0 Comments

In this post, we will be looking at the use case, The daily show. Here, we will work on some problem statements and come up with solutions using Pig.

In our previous blog we have worked on the problem statements the top five kinds of GoogleKnowlege_Occupation people who were guests in the show, in a particular time period and the number of politicians who came every year in the show as guests.

We have a historical data of The Daily Show guests from 1999 to 2004. You can download this dataset from here.

Please find the dataset description below.

Dataset Description:

YEAR – The year the episode aired.

GoogleKnowlege_Occupation -Their occupation or office, according to Google’s Knowledge Graph. On the other hand, if they are not in there, how Stewart introduced them on the program.

Show – Air date of the episode. Not unique, as some shows had more than one guest

Group – A larger group designation for the occupation. For instance, U.S senators, U.S presidents, and former presidents are all under “politicians”

Raw_Guest_List – The person or list of people who appeared on the show, according to Wikipedia. The GoogleKnowlege_Occupation only refers to one of them in a given row.

Problem Statement 1:

Find the number of GoogleKnowledge occupation types in each group, who have been guests on the show

Source Code:

A = load '/home/kiran/dialy_shows' using PigStorage(',') AS (year:chararray,occupation:chararray,date:chararray,grp:chararray,gusetlist:chararray);
B = foreach A generate occupation,grp;
C = group B by grp;
D = foreach C generate group, COUNT(B) as cnt;
E = order D by cnt desc;

In relation A, we are loading the dataset using PigStorage along with the schema of the file.

In relation B, we are extracting the required columns i.e., occupation and the grp.

In relation C, we are grouping the relation by the grp.

If you describe the relation then you can see the schema of the relation as shown below:

describe C;
C: {group: chararray,B: {(occupation: chararray,grp: chararray)}}

In relation D, we are generating the group and the Count of values in relation B.

In relation E, we are displaying the count of the number of Google_knowledge_occupation types in each group, who have been guests on the show and the result is displayed below.

(Acting,930)
(Media,751)
(Politician,308)
(Comedy,150)
(Musician,123)
(Academic,103)
(Athletics,52)
(Misc,45)
(Government,40)
(Political Aide,36)
(NA,31)
(Science,28)
(Business,25)
(Advocacy,24)
(Consultant,18)
(Military,16)
(Clergy,8)
(media,5)

 

Problem Statement 2:

To verify problem statement 1, we will find out what are the combinations of group and the Google_knowledge_occupation types who have been guests in the show.

Source Code:

A = load '/home/kiran/dialy_shows' using PigStorage(',') AS (year:chararray,occupation:chararray,date:chararray,group:chararray,gusetlist:chararray);
B = foreach A generate occupation,group;
C = group B by (group,occupation);
D = foreach C generate group, COUNT(B) as cnt;
E = order D by group;

In relation A, we are loading the dataset using PigStorage along with the schema of the file.

In relation B, we are extracting the required columns i.e., occupation and the group.

In relation C, we are grouping the relation by the group and the occupation.

If you describe relation then you can see the schema of the relation as shown below:

describe C;
C: {group: (group: chararray,occupation: chararray),B: {(occupation: chararray,group: chararray)}}

In relation D, we are generating the group and the Count of values in relation B.

In relation E, we are displaying the count of the number of combinations of Google_knowledge_occupation types each group, who have been guests on the show and the sample result is displayed below.

((Acting,Film actor),9)
((Acting,Film actress),9)
((Acting,actor),596)
((Acting,actress),271)
((Acting,film actor),10)
((Acting,film actress),12)
((Acting,stunt perfomrer),5)
((Acting,television Actor),2)
((Acting,television actor),1)
((Acting,television actress),13)
((Acting,televison actor),1)
((Acting,telvision actor),1)

If you count all the combinations, you will get a total of 930 which has been displayed for Acting in the above problem statement.

We hope this post has been helpful in understanding how to perform analysis using Apache Pig. In the case of any queries, feel free to comment below and we will get back to you at the earliest.

Keep visiting our site www.acadgild.com for more updates on Big Data and other technologies.

>