Big Data Hadoop & Spark

Spark Use Case – The Daily Show

In this blog we will be  taking a famous Tv show dataset i.e., The Daily show and we will be performing analysis on the guests who came to the show.

Before going ahead we recommend readers to go through our previous blogs on various publicly available datasets.

Youtube Data Analysis

Titanic Data Analysis

Olympic Data Analysis

We have a historical data of the daily show guests from 1999 to 2004. The dataset can be downloaded from here.

Please find the the dataset description below:

Dataset Description:

YEAR –  The year the episode aired

GoogleKnowlege_Occupation -Their occupation or office, according to Google’s Knowledge Graph or, if they’re not in there, how Stewart introduced them on the program.

Show – Air date of episode. Not unique, as some shows had more than one guest

Group – A larger group designation for the occupation. For instance, us senators, us presidents, and former presidents are all under “politicians”

Raw_Guest_List – The person or list of people who appeared on the show, according to Wikipedia. The GoogleKnowlege_Occupation only refers to one of them in a given row.   

Hadoop
Problem Statement:

Find the top 5 kinds of GoogleKnowlege_Occupation people gusted the show in a particular time period.

Source Code:

val file = sc.textFile("/home/kiran/dialy_show_guests")
val split = file.map(line => line.split(","))
val format = new java.text.SimpleDateFormat("MM/dd/yy")
val pair = split.map(line => (line(1),format.parse(line(2))))
val fil = pair.filter(x => {if(x._2.after(format.parse("1/11/99")) && x._2.before(format.parse("6/11/99"))) true else false})
val cnt = fil.map(x => (x._1,1)).reduceByKey(_+_).map(item => item.swap).sortByKey(false).take(5)

Walk through of the above code:

In line 1 we are creating a new RDD by loading the dataset which is in local file system.

In line 2 we are splitting the records by using the delimiter as ‘,’ since the data is delimited by ‘,’.

In line 3 we are declaring the date format by using the java library java.text.SimpleDateFormat. In the dataset the data format is “MM/dd/YY”.

In line 4 we are creating a pair of GoogleKnowlege_Occupation and Show(date of the show). Here date of the show is taken as a string and we are converting this string to date format using the parse method available in java.text.SimpleDateFormat.

In line 5 we are using the filter method to filter out the records which doesn’t match our requirement. Here we are giving the range of data explicitly in between we need to count the GoogleKnowlege_Occupation people gusted. Here we have given the range as 6 months i.e., from 1/11/99 to 6/11/99.

In line 6 we will get the data which is in specified range from that we are creating a pair of GoogleKnowlege_Occupation and 1 as key value pairs respectively. After that we are performing reduceByKey action on the RDD which will count all the values for each unique key. Then we are swapping the GoogleKnowlege_Occupation and its count, and sorting the result by sortByKey operation with this we will get the sorted records of GoogleKnowlege_Occupation and its count in descending order. Finally, we are taking the top five from the list.

Output:

(28,actor), (20,actress), (4,comedian), (3,television actress), (2,stand-up comedian)

The same is displayed in the below screen shot.

Daily show guests

Hope this blog helped you in understanding how to perform analysis data using apache-spark in scala with a real time dataset. Keep visiting our site for more updates on Big Data and other technologies.
Keep visiting our site www.acadgild.com for more updates on Spark and other technologies.
Spark

One Comment

  1. Pingback: spark-use-case-daily-show by acadgild – hadoopminds

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close