In this post of spark, we will work on a case study to find the minimum temperature observed in a given weather station in a particular year.
Let’s begin by considering a sample of four records.
Column 1: Weather Station
Column 2: Date(year/Month/Day)
Column 3: Observation Type
Column 4: Temperature
The new RDD,lines, is created by calling the textFile function on the Spark Context with our source data, where every individual line of that comma separated source data is passed as individual entries in the RDD.
We are going to transform our lines RDD into new RDD named as parsed_res by calling map on it and then passing it to the parse_Line function, which could actually perform that mapping.
Hence, every record from lines RDD is passed on to parse_Line function one by one and then parsed out.
Our raw weather data includes information like minimum temperature,maximum temperature and amount of precipitation but since we need to find out the minimum temperature in a weather station so we discard other information like maximum temperature and precipitation details and only pass out the records with minimum temperature details.
In the below step we strip out the second column i.e the observation type and extracts out just weather station ID and minimum temperature.
To see the first 10 records of the temp_station RDD, take action has been called on my_rdd.
The results are the key-value pairs with the weather station ID as key and minimum temperature in that weather station as value.
In the below step reduceByKey action is called which is going to aggregate every minimum temperature observed for every weather station ID and lambda function determines that how we do that aggregation.
Two observations for minimum temperature for a given weather station ID is combined and the Min function takes the minimum temperature between two observations for a given weather station and this process continues as more and more observations are fed in and in the last only minimum temperature for a weather station ID survives.
To display the first 10 records of the temp_min RDD, take action has been called.
The results are the key-value pairs with the weather station ID as key and minimum temperature observed in that weather station throughout the year as value.
The results of averages, _Age RDD is collected in my_results which is a python list.
The final results are displayed by using for loop statement in Python to print the age of the weather station as key and the minimum temperature for that year as value.
We hope this post has been helpful in understanding this Spark use case using Python. In case of any queries, feel free to comment below and we will get back to you at the earliest.
For more resources on Big Data and other technologies, keep visiting acadgild.com