In this blog, we will discuss a use case involving MovieLens dataset and try to analyze how the movies fare on a rating scale of 1 to 5.
We will start our discussion with the data definition by considering a sample of four records.
196 242 3 881250949 186 302 3 891717742 22 377 1 878887116 244 51 2 880606923
Column 1: User ID
Column 2: Movie ID
Column 3: Rating
Column 4: Timestamp
You can download the input file from here.
Below is the code that is used to calculate the number of movies that are rated on a scale of 1 to 5.
from pyspark import SparkConf,SparkContext
my_lines = sc.textFile('/home/acadgild/ml-100k/u.data')
ratings = lines.map(lambda x : x.split())
res = ratings.countByValue()
my_sortedres = collections.OrderedDict(sorted(res.items()))
for key,value in sortedres.items():
print ("%s %i" %(key,value))
The first two lines of the code import SparkConf ,SparkContext from pyspark libraries that are present in Spark.
SparkContext is the fundamental starting point in Spark that enables us to create RDDs.
The statement in the screenshot below, loads the data file by creating the RDD through sc.textFile method. The data file is loaded into RDD my_lines and the textFile property breaks every line of text into a value in the RDD.
In the screenshot below, expression X is passed on to split function. The 3rd column is extracted and new RDD is created with the new results.
In the screenshot below, the statement countByValue() is executed on ratings RDD to calculate the occurrence of each and every rating starting from 1 to 5 and results are stored in a new Python object res.
The code below creates an ordered Dictionary to sort the results based on the key which is rating.
The Python code below iterates through the pair of key and value and prints the total number of occurrence of movies falling under one particular rating category.
We hope this blog was helpful in understanding the use of Spark with Python. Keep visiting our site www.acadgild.com for more updates on Big data and other technologies.