In this blog series, we will discuss a real-time industry scenario where the spark SQL will be used to analyze the soccer data. Nowadays spark is boon for technology.it is the most active open big data tool which is used to reshape the big data market. spark is 100 times faster than the Hadoop and 10 times faster than the accessing data from the disk.
Advantages of spark for data analysis:
1) Inbuilt machine learning libraries.
2) Efficient in interactive queries and iterative algorithm.
3) Provides highly reliable fast in-memory computation
4) Provides processing platform for streaming data using spark streaming
5) Fault tolerance capabilities because of immutable primary abstraction named RDD.
6) Highly efficient in real-time analytics using spark streaming and Spark SQL.
Now we will start our analysis and we have to do some certain steps
Step 1: Launch Spark shell
Code:
Code Explanation:
Here we have launched spark shell to write application or code. We have used the package from data bricks which will help us to read the data from CSV easily.
Output on the console:
Step 2: Create and Import SQL context.
Code:
Code explanation:
To use the functionalities of spark SQL we create class SQLContext and we import it. By default, the spark context object is initialized with the name sc when we start spark shell.
Output on the console:
3) Step 3: Download importing soccer_data_set.csv file & Creating data frame.
Download the Dataset from the below link.
https://s3.amazonaws.com/acadgildsite/wordpress_images/bigdatadeveloper/Olympics_Analysis/Soccer_Data_Set.docx.csv
Code:
Code explanation:
We have created one variable called data and we have read the CSV file after reading the CSV file we have stored all the data into variable data
Output on the console:
4) Step 4: Create a table from the data.
Code:
Code explanation:
first of all, you can see in the previous we have created a data frame from the CSV file. also, we are creating a temporary table from the data frame in order to execute the queries which we want.
Output on the console:
As a result, we have created a table and so we can fire a set of the query to get the desired output.
Now we have some realtime problems for which we have to find out the solution here are some problems mentioned below with the solution.
1) Find the total number of bronze medals won by each country in Football.
Code:
Output on the console:
2) Find the number of Medals won by the USA grouped by sport.
Code:
Output on the console:
3) Find the total number of medals won by each country displayed by type of medal.
Code:
Output on the console:
4) Find how many Silver medals have been given to Mexico and the year of each.
Code:
Output on the console:
Finally, we have done our analysis and showed the result on the spark console.
We hope the above blog helped you to understand the detailed functioning of HDFS. Keep visiting our site for more updates on Big data and other technologies. Click here to learn Scala language which is used in spark.