Big Data Hadoop & Spark

Spark Use Case – Olympics Data Analysis

In this blog, we will discuss the analysis of Olympics dataset using Apache Spark in Scala, one of the spark use case.

Olympics data set is a publicly available data. Using this dataset, we will evaluate some problem statements such as, finding the number of medals won by each country in swimming, finding the number of medals won by India etc.

Data Set Description

The data set consists of the following fields:

Athlete: Name of the athlete

Age: Age of the athlete

Country: The name of the country participating in Olympics

Year: The year in which Olympics is conducted

Closing Date: Closing date of Olympics

Sport: Sports name

Gold Medals: No. of gold medals

Silver Medals: No. of silver medals

Bronze Medals: No. of bronze medals

Total Medals: Total no. of medals

Dataset Link

https://drive.google.com/drive/folders/0ByJLBTmJojjzVGNsWmpUUUxTZDA

Problem Statement 1

Find the total number of medals won by each country in swimming.

Source code

val textFile = sc.textFile("hdfs://localhost:9000/olympix_data.csv")
val counts = textFile.filter { x => {if(x.toString().split("\t").length >= 10) true else false} }.map(line=>{line.toString().split("\t")})
val fil = counts.filter(x=>{if(x(5).equalsIgnoreCase("swimming")&&(x(9).matches(("\\d+")))) true else false })
val pairs: RDD[(String, Int)] = fil.map(x => (x(2),x(9).toInt))
val cnt = pairs.reduceByKey(_ + _).collect()

Description of the Above Code

Line1: We are creating an RDD with the existing dataset which is inside HDFS.

Line 2: We are taking each record as input and filtering the records which do not have 11 columns. This is useful in eliminating ArrayIndexOutofBound exception.

Line 3: We will get the records which have 11 columns and here, we are again filtering the records under the sport ‘swimming’ because we need to find out the number of medals won by countries in swimming. We are also checking whether the 10th column has a digit or not.

Line 4: We are creating a pair RDD (String,Int) where the key is country name and value is the number of medals it won in swimming.

Line 5: We are counting the number of medals that each country won in swimming by using the reduceByKey method and finally we are displaying it by using collect() method.

Output

(Australia,163), (Hungary,9), (Brazil,8), (Canada,5), (Japan,43), (Netherlands,46), (Belarus,2), (Sweden,9), (Serbia,1), (Slovakia,2), (Norway,2), (Denmark,1), (Poland,3), (Trinidad and Tobago,1), (Great Britain,11), (Argentina,1), (Croatia,1), (Lithuania,1), (Zimbabwe,7), (China,35), (Slovenia,1), (South Korea,4), (Italy,16), (Spain,3), (Germany,32), (Costa Rica,2), (France,39), (Tunisia,3), (Ukraine,7), (United States,267), (South Africa,11), (Romania,6), (Russia,20), (Austria,3)

You can see the same in the below screen shot.

Figure 1

Problem Statement 2

Find the number of medals that India won year wise.

Source code

val textFile = sc.textFile("hdfs://localhost:9000/olympix_data.csv")
val counts = textFile.filter { x => {if(x.toString().split("\t").length >= 10) true else false} }.map(line=>{line.toString().split("\t")})
val fil = counts.filter(x=>{if(x(2).equalsIgnoreCase("india")&&(x(9).matches(("\\d+")))) true else false })
val pairs: RDD[(String, Int)] = fil.map(x => (x(3),x(9).toInt))
val cnt = pairs.reduceByKey(_ + _).collect()

Description of the Above Code

Line1: We are creating an RDD with the existing dataset which is inside HDFS.

Line 2: We are taking each record as input and filtering the records which do not have 11 columns. This is useful in eliminating ArrayIndexOutofBound exception.


Hadoop

Line 3: We will get the records which have 11 columns and here, we are again filtering the records for the country ‘India’ as we need to find out the number of medals won by India. Also, we are checking whether the 10th column has a digit or not.

Line 4: We are creating a pair RDD(String,Int) where key is the year and value is the number of medals won in that year.

Line 5: We are counting the number of medals won by India, year wise by using the reduceByKey method and finally we are displaying it by using collect() method.

Output

(2008,3), (2004,1), (2000,1), (2012,6)

You can see the same in the below screen shot.

Figure 2

Problem Statement 3

Find the total number of medals won by each country.

Source Code

val textFile = sc.textFile("hdfs://localhost:9000/olympix_data.csv")
val counts = textFile.filter { x => {if(x.toString().split("\t").length >= 10) true else false} }.map(line=>{line.toString().split("\t")})
val fil = counts.filter(x=>{if((x(9).matches(("\\d+")))) true else false })
val pairs: RDD[(String, Int)] = fil.map(x => (x(2),x(9).toInt))
val cnt = pairs.reduceByKey(_ + _).collect()

Description of the Above Code

Line1: We are creating an RDD with the existing dataset which is inside HDFS.

Line 2: We are taking each record as input and filtering the lines which do not have 11 columns. This is useful in eliminating ArrayIndexOutofBound exception.

Line 3: We will get the records which have 11 columns and we are again filtering the records which has a digit in 10th column.

Line 4: We are creating a pair RDD(String,Int) where key is the country and value is the number of medals won by the country.

Line 5: We are counting the number of medals won by each country by using the reduceByKey method and finally we are displaying it by using collect() method.

Output

(Australia,609), (Great Britain,322), (Brazil,221), (Canada,370), (Uzbekistan,19), (Barbados,1), (Japan,282), (Cyprus,1), (Finland,118), (Singapore,7), (Montenegro,14), (Uruguay,1), (Moldova,5), (Colombia,13), (Sweden,181), (Vietnam,2), (Serbia,31), (Iran,24), (Slovakia,35), (Mozambique,1), (Cameroon,20), (Denmark,89), (Turkey,28), (Panama,1), (Saudi Arabia,6), (Hungary,145), (Portugal,9), (Paraguay,17), (Jamaica,80), (Georgia,23), (Dominican Republic,5), (Kyrgyzstan,3), (Netherlands,318), (Iceland,15), (Morocco,11), (Belarus,97), (Mongolia,10), (Kazakhstan,42), (Kenya,39), (Syria,1), (Indonesia,22), (Eritrea,1), (Uganda,1), (Norway,192), (Puerto Rico,2), (Poland,80), (Tajikistan,3), (Grenada,1), (Trinidad and Tobago,19), (Afghanistan,2), (Israel,4)

You can see the same in the below screen shot.

Figure 3

We hope this spark use case helped you in understanding dataset analysis using Spark. Keep visiting our site www.acadgild.com for more blogs on Spark and other technologies. Click here to learn Spark from our Expert Mentors

Spark

3 Comments

  1. Hi ,
    Is there any one help me …in Spark -Hbase Integration
    I am able to write the data to Hbase but I am trying to read it form Hbase ?
    If any sample code please forward to [email protected]

  2. Hi Kiran,
    Thank you so much for this precious information, but can be able to load CSV file using textFile(…csv)? i dont think so, could you please let me know (if you have a chance) how can we convert CSV file into RDD? Thank you!!!
    Vasu

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close