Big Data Hadoop & Spark

Spark Use Case – Analyzing MovieLens Dataset

In this blog, we will discuss a use case involving MovieLens dataset and try to analyze how the movies fare on a rating scale of 1 to 5.

We will start our discussion with the data definition by considering a sample of four records.

196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923

Data Definition

Column 1: User ID

Column 2: Movie ID

Column 3: Rating

Column 4: Timestamp

You can download the input file from here.

Below is the code that is used to calculate the number of movies that are rated on a scale of 1 to 5.

from pyspark import SparkConf,SparkContext
import collections
my_lines = sc.textFile('/home/acadgild/ml-100k/u.data')
ratings = lines.map(lambda x : x.split()[2])
res = ratings.countByValue()
my_sortedres = collections.OrderedDict(sorted(res.items()))
for key,value in sortedres.items():
print ("%s %i" %(key,value))

The first two lines of the code import SparkConf ,SparkContext from pyspark libraries that are present in Spark.

SparkContext is the fundamental starting point in Spark that enables us to create RDDs.

Once the collections are imported, the final results have to be sorted out.
Hadoop

The statement in the screenshot below, loads the data file by creating the RDD through sc.textFile method. The data file is loaded into RDD my_lines and the textFile property breaks every line of text into a value in the RDD.

In the screenshot below, expression X is passed on to split function. The 3rd column is extracted and new RDD is created with the new results.

In the screenshot below, the statement countByValue() is executed on ratings RDD to calculate the occurrence of each and every rating starting from 1 to 5 and results are stored in a new Python object res.

Below is the sample output of the above action applied on ratings RDD.

The code below creates an ordered Dictionary to sort the results based on the key which is rating.

The Python code below iterates through the pair of key and value and prints the total number of occurrence of movies falling under one particular rating category.

We hope this blog was helpful in understanding the use of Spark with Python. Keep visiting our site www.acadgild.com for more updates on Big data and other technologies.

Spark

Tags

One Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close