Big Data Hadoop & Spark

Spark Use Case – Youtube Data Analysis

This post is about analyzing the data of YouTube. This total analysis is performed using Apache Spark. This YouTube data is publicly available and the data set is described below under the heading Data Set Description.
Using that dataset, we will perform some analysis and will draw out some insights, like what are the top 10 rated videos in YouTube and who uploaded the most number of videos.

Before getting into the Use Case let’s have a brief understanding of Spark with our Beginner’s Guide

100% Free Course On Big Data Essentials

Subscribe to our blog and get access to this course ABSOLUTELY FREE.

This post will help you to understand how to handle data sets that does not have proper structure and how to sort the output of reducer.

Data Set Description

Column 1: Video id of 11 characters.

Column 2: Uploader of the video.

Column 3: Interval between day of establishment of YouTube and the date of uploading of the video.

Column 4: Category of the video.

Column 5: Length of the video.

Column 6: Number of views for the video.

Column 7: Rating on the video.

Column 8: Number of ratings given for the video

Column 9: Number of comments on the videos.

Column 10: Related video ids with the uploaded video.

You can download the data set from here

Problem Statement 1:

Here, we will find out what are the top five categories with maximum number of videos uploaded.

Source Code:

val textFile = sc.textFile("hdfs://localhost:9000/youtubedata.txt")
val counts = textFile.map(line=>{var YoutubeRecord = ""; val temp=line.split("\t"); ;if(temp.length >= 3) {YoutubeRecord=temp(3)};YoutubeRecord})
val test=counts.map ( x => (x,1) )
val res=test.reduceByKey(_+_).map(item => item.swap).sortByKey(false).take(5)

Walk Through of the Above Program:

  • In line 1, we are creating an RDD with the existing dataset, which is inside HDFS.

  • In line 2, we are taking each record as input using the map method and extracting the 4th column, which is the category of the video.

  • In line 3, we are creating a pair of category_name,1(count) which is used to calculate how many times that the category is present.

In line 4, we are using the reduceByKey method so that all the values of that key are aggregated. Then we are swapping the category_name and its count, and sorting the result with this we will get the sorted records of category_name and its count in descending order. Finally, we are taking the top five from the list.

Output:

(908,Entertainment), (862,Music), (414,Comedy), (398,People & Blogs), (333,News & Politics)

You can this result in the below screenshot.

top categories with videos

Problem Statement 2:

In this problem statement, we will find the top 10 rated videos in YouTube.

val textFile = sc.textFile("hdfs://localhost:9000/youtubedata.txt")
val counts = textFile.filter { x => {if(x.toString().split("\t").length >= 6) true else false} }.map(line=>{line.toString().split("\t")})
val pairs = counts.map(x => {(x(0),x(6).toDouble)})
val res=pairs.reduceByKey(_+_).map(item => item.swap).sortByKey(false).take(10)

Walk Through of the Above Program:

  • In line 1, we are creating an RDD with the existing dataset, which is inside HDFS.

  • In line2, we are first filtering the lines with more than six elements to avoid ArrayIndexOutOfBounds Exception and then we are using map method to pass the split line as output to the next RDD.

  • In line 3, we are creating a pair of key and value by using the video id, which is the first column and 7th column 7 respectively.

  • In line 4, we are using the reduceByKey method to find the ratings of the video, and to sort them by value, we are using the map method and swapping the key and value. Now, values become the keys and we are performing the sortByKey method and sorting the videos based on their rating and taking the top 10 videos.

Output:

(5.0,ZzuGxkWLops), (5.0,O4GzZxcKmFU), (5.0,smGcj6vohLs), (5.0,_KVr7VOTwTQ), (5.0,6yuy9DEK114), (5.0,xd1kn2bFpSM), (5.0,wEQ54SUxtiI), (5.0,lbVnhaqP8F4), (5.0,3V0SjoaPx9A), (5.0,265li8v9m1k)

This output can be seen in the below screenshot.

top 10 rated videos

Hope this post has been helpful in understanding how to perform simple data analysis using Spark and Scala.

Keep visiting our website www.acadgild.com for more updates on Big Data and other technologies. Click here to learn Spark from our Expert Mentors

Hadoop

4 Comments

  1. Hi,
    In the problem statement 1, line# 3, you have used ‘var’ to check the column counts and extract column#3. Instead of using ‘var’ you can write like this,
    val counts = textFile.map(line => line.split(“\t”)).filter(columns => columns.length >=3).map(columns => columns(3))
    or
    val counts = textFile.map(_.split(“\t”)).filter(_.length >= 3).map(_(3))

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close