In this post spark, we will work on a case study to calculate the average number of friends based on their age, on a social media website.
Let’s begin by considering a sample of four records.
Column 1: User ID
Column 2: User Name
Column 3: Age of the User
Column 4: Number of Friends with that User
You can download the input file from here.
The new RDD, my_lines, is created by calling the textFile function on the Spark Context with our source data, where every individual line of that comma separated source data is passed as individual entries in the RDD.
To see the first 10 records of the my_lines RDD, the take action has been called on my_lines RDD.
We are going to transform our my_lines RDD into new RDD named as my_rdd by calling map on it and then passing it to the parse_line function, which could actually perform that mapping.
Hence, every record from my_lines RDD is passed on to parse_line function one by one and then parsed out.
my_lines RDD is first split by comma and then the required fields, i.e. age of the user and number of friends that user is having and those two fields are extracted from third and fourth fields respectively and then returned and stored in the new key-value RDD named as my_rdd.
To see the first 10 records of the RDD named as my_rdd, take action has been called on my_rdd.
The results are the key-value pairs with the age of individual user as key and number of friends for that particular age as value.
We have simplified the below complex script by breaking it into multiple statements to achieve the results.
We need to take the RDD my_rdd containing the key value pairs of the age of individual user as the key, and the number of friends for that particular age as value ,and then call the map values on it.
This transforms every value in key value pair of age and number of friends in from the above RDD.
Every value from RDD is passed on to map function, and the new output comprising of number of friends for a particular user as key and 1 as value.
The first 10 records of the new RDD can be displayed by passing take function on x RDD.
This step involves summing up of the total number of friends for one particular age group as key and the number of users in that age group as value.
This is done by passing reduceByKey transformation on x RDD.
The first 10 records of the new RDD can be displayed by passing take function on totals_Age RDD.
This step includes calculating the average number of friends for every age group by passing a formula in the lambda function to divide the key of previous RDD i.e. total number of friends for one particular age group by value in the previous RDD i.e. number of users in that age group.
The results of averages, _Age RDD is collected in my_results RDD.
The final results are displayed by using for loop statement in Python to print the age of the user as key and the average number of friends in that age group as value.
We hope this post has been helpful in understanding this Spark use case using Python. In case of any queries, feel free to comment below and we will get back to you at the earliest.
For more resources on Big Data and other technologies, keep visiting acadgild.com