In this blog article, we will discuss about a real-time industry scenario where we export the data from excel sheet and perform analytics on the same data. Exporting data from excel sheet is one of the most popular data file formats which is maintained across industries. In this case, we will solve the real-time problems implementing Spark SQL, the most popular and widely used framework in the Spark ecosystem.
Before we go further, Let us understand the dataset provided in the below excel sheet:
The data in the above screenshot represents the customer’s information of an organization. Where the keywords in the excel sheet defines the following:
- Country – This defines to which country the customer belongs.
- Name – This defines the name of the customer.
- OrderId – This defines the order number placed by the customer.
- ItemId – This defines the item number purchased by the customer.
- ItemReturned – This defines whether the purchased item is returned or not.
- CustomerGroup – This group represents customers, where customers may be potential customers who can buy electronics items or home appliances.
- ItemPrice – This defines the price of all the purchased items.
We have already discussed that the dataset will be exported as excel sheet. For this, we should add dependencies in build.sbt file to import the dependent libraries. Since we are working implementing Spark SQL, even this should be included in the build.sbt file. The following snippet of code shows the build.sbt entries:
As per the above snippet of code, we need to include the dependent library for spark-excel. Once the dependent library is included, now create a SparkSession to work with DataFrame as shown in the below code:
Now, let us try to read the data:
Now, provide the following information:
- Provide the type of file format in which data is stored, which is excel in this case
- Provide the stored file location along with the complete path
- Keep the header details as input.
Once we have got this data ready, try to print and see whether data has been loaded successfully or not.
To view the results of few top rows of DataFrame, type the command show() and pass the row numbers as an argument. The below code shows you the best example for viewing top five rows:
And, its output is: –
By executing the above code, it has been validated that the dataset has been loaded correctly. While loading the dataset, we have got the data into a DataFrame using SQLContext.
Here are some of the key technical questions which makes you to understand the concept clear. Let us answer some of the questions:
- Filter out the entire customer belonging to group 400.
Answer: The below code snippet groups the entire customer belonging to group 400.
- Find the number of the customer from each country and find the total item price of all the purchased items from each country.
Answer: The below code snippet finds the number of customers from each country and also finds the total itemprice of all the purchased items for each country.
- Find out the number of customers who belong to INDIA and who have returned a product.
Answer: The below code snippet finds the number of customers from INDIA and who have returned the products.
- Find out the item description of purchased item from each customer.
Answer: In this case, we need to load an additional data file item_desc.xlsx and get the details. Once this additional file is loaded, the data should be joined on customer_sample.xlsx based on the ItemId.
Now, let us join on the common field ItemId.
Hope this blog post helped you understanding E-commerce data analysis using Spark SQL and stay tuned for more Big Data notes. Enroll for the big data and hadoop training with Acadgild and become a successfull Hadoop Developer.