Big Data Hadoop & Spark - Advanced
Trending

Soccer Data Analysis Using Apache Spark SQL (Use Case)

In this blog series, we will discuss a real-time industry scenario where the spark SQL will be used to analyze the soccer data. Nowadays spark is boon for technology.it is the most active open big data tool which is used to reshape the big data market. spark is 100 times faster than the Hadoop and 10 times faster than the accessing data from the disk.

Advantages of spark for data analysis:

 

1) Inbuilt machine learning libraries.

2) Efficient in interactive queries and iterative algorithm.

3) Provides highly reliable fast in-memory computation

4) Provides processing platform for streaming data using spark streaming

5) Fault tolerance capabilities because of immutable primary abstraction named RDD.

6) Highly efficient in real-time analytics using spark streaming and Spark SQL.

Now we will start our analysis and we have to do some certain steps

 

Step 1: Launch Spark shell

Code:

spark-shell --packages com.databricks:spark-csv_2.10:1.5.0

Code Explanation:

Here we have launched spark shell to write application or code. We have used the package from data bricks which will help us to read the data from CSV easily.

Output on the console:

 

Spark Shell

Step 2: Create and Import SQL context.

Code:

import org.apache.spark.sql.SQLContext

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

Code explanation:

To use the functionalities of spark SQL we create class SQLContext and we import it. By default, the spark context object is initialized with the name sc when we start spark shell.

Output on the console:

 

Creating SQL Context

 

 

 

 

3) Step 3: Download importing soccer_data_set.csv file & Creating data frame

Download the Dataset from the below link.

https://s3.amazonaws.com/acadgildsite/wordpress_images/bigdatadeveloper/Olympics_Analysis/Soccer_Data_Set.docx.csv

Code:

val data = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("delimiter",",").load("/home/acadgild/Desktop/Soccer_Data_Set.docx.csv")

Code explanation:

We have created one variable called data and we have read the CSV file after reading the CSV file we have stored all the data into variable data

Output on the console:

 

Reading CSV File

 

 

 

 

4) Step 4: Create a table from the data.

Code:

data.registerTempTable("olympics")

Code explanation:

first of all, you can see in the previous we have created a data frame from the CSV file. also, we are creating a temporary table from the data frame in order to execute the queries which we want.

Output on the console:

 

Creating Temp Table

 

 

 

As a result, we have created a table and so we can fire a set of the query to get the desired output.

Now we have some realtime problems for which we have to find out the solution here are some problems mentioned below with the solution.

 

1) Find the total number of bronze medals won by each country in Football.

Code: 

val result1= sqlContext.sql("select Country,count(Medal) as Medal from olympics where
Sport=='Football' and Medal=='Bronze' group by Country").show()

Output on the console:

 

Result 1

2) Find the number of Medals won by the USA grouped by sport.

Code:

val result2= sqlContext.sql("select Count(Medal) as Medal,country,sport from olympics where
Country=='USA' group by Sport,country").show()

Output on the console:

 

Result 2

3) Find the total number of medals won by each country displayed by type of medal.

Code:

val result3= sqlContext.sql("select Country,Medal,Count(Medal) as Count from olympics group
by Medal,country").show()

Output on the console:

 

Result 3

4) Find how many Silver medals have been given to Mexico and the year of each.

Code:

val result4= sqlContext.sql("select Country,Medal,Count(Medal) as Count,year from olympics
where Medal='Silver' and Country='MEX' group by Medal,country,year").show(false)

Output on the console:

 

Result 4

Finally, we have done our analysis and showed the result on the spark console.

 

We hope the above blog helped you to understand the detailed functioning of HDFS. Keep visiting our site for more updates on Big data and other technologies. Click here to learn Scala language which is used in spark.

Ajit Khutal

Ajit Khutal has been working with AcadGild as an Associate Big Data analyst with expertise in Big Data Technologies like Hadoop, Spark, Kafka, Nifi. He has been a Python enthusiast and been associated with the implementation of many Analytics project related to various domains like E-commerce, Banking, and Education.

One Comment

  1. Everything is working fine till the first SQL query.
    when I ran this : val result1= sqlContext.sql(“select Country,count(Medal) as Medal from olympics where
    Sport==’Football’ and Medal==’Bronze’ group by Country”).show()

    It bursted out and the errors are:
    WARN Hive: Failed to access metastore. This class should not accessed in runtime.
    org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
    at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close