Big Data Hadoop & Spark

Spark Use Case – Popular Movie Analysis

In this blog, we will work on a case study to find the list of most popular movies. Spark use Case
We will perform various transformations and actions to display a list of movies with maximum occurrence in the given data set.
Let’s  start our discussion with the data definition by considering a sample of four records.

196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923

Data Definition:
Column 1:  User ID
Column 2:  Movie ID
Column 3:  Rating
Column 4: Time stamp
The input file can be downloaded from here.
The statement in the screenshot below, loads the data file by creating the RDD through sc.textFile method. The data file is loaded into RDD my_lines and the textFile property breaks every line of text into a value in the RDD.

Map method is called on my_lines RDD and  key value pair of movie ID and 1 appended to it as value is pulled out and stored in my_movies RDD.

To see the first 10 records of the my_lines RDD, the take action has been called on my_linesRDD.

In this step we call reduceByKey transformation on my_movies RDD  which group together and aggregate all the values seen  for each individual Movie ID and add all 1’s associated with that MOVIE ID  and we are able to calculate how many times each movie ID occurs  in the my_movies RDD.

To see the first 10 records of the movie_Counts RDD, the take action has been called.

Movie_Counts RDD has entries in key value format where  key is the movie ID and the value is the number of occurrence for that movie ID.
In order to sort the entries by value we make keys i.e movies_Counts as value and value i.e count of occurrence of each movie ID as key by using the script given in the below screenshot.

In this step we perform sortByKey on flipped_op to sort the RDD by key i.e count of the occurrence of each movie ID.

In this step we collect the sorted results in a python list named as final_results.

In this step we need to find the number of key value pairs present in the python object  final_results.

We extract out the last 10 pairs of key value pairs and find that the most popular movie with  movie ID 50 has 583 occurrences in the data set.

We hope this post has been helpful in understanding this Spark use case using Python. In case of any queries, feel free to comment below and we will get back to you at the earliest.
For more resources on Big Data and other technologies, keep visiting

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles