In this blog, we will work on a case study to find the list of most popular movies. Spark use Case
We will perform various transformations and actions to display a list of movies with maximum occurrence in the given data set.
Let’s start our discussion with the data definition by considering a sample of four records.
Column 1: User ID
Column 2: Movie ID
Column 3: Rating
Column 4: Time stamp
The input file can be downloaded from here.
The statement in the screenshot below, loads the data file by creating the RDD through sc.textFile method. The data file is loaded into RDD my_lines and the textFile property breaks every line of text into a value in the RDD.
Map method is called on my_lines RDD and key value pair of movie ID and 1 appended to it as value is pulled out and stored in my_movies RDD.
To see the first 10 records of the my_lines RDD, the take action has been called on my_linesRDD.
In this step we call reduceByKey transformation on my_movies RDD which group together and aggregate all the values seen for each individual Movie ID and add all 1’s associated with that MOVIE ID and we are able to calculate how many times each movie ID occurs in the my_movies RDD.
To see the first 10 records of the movie_Counts RDD, the take action has been called.
Movie_Counts RDD has entries in key value format where key is the movie ID and the value is the number of occurrence for that movie ID.
In order to sort the entries by value we make keys i.e movies_Counts as value and value i.e count of occurrence of each movie ID as key by using the script given in the below screenshot.
In this step we perform sortByKey on flipped_op to sort the RDD by key i.e count of the occurrence of each movie ID.
In this step we collect the sorted results in a python list named as final_results.
In this step we need to find the number of key value pairs present in the python object final_results.
We extract out the last 10 pairs of key value pairs and find that the most popular movie with movie ID 50 has 583 occurrences in the data set.
We hope this post has been helpful in understanding this Spark use case using Python. In case of any queries, feel free to comment below and we will get back to you at the earliest.
For more resources on Big Data and other technologies, keep visiting acadgild.com.