In this blog, we will see how to perform analysis on E-commerce Data using Apache Spark. Using Spark, we can directly refer to the data present in MySQL using the JDBC data source and create a data frame for the whole table in MySQL.
First thing first, you need to integrate MySQL and Spark. If you have no idea how to do that, refer to our blog on Loading Data from MySQL into Spark using JDBC.
100% Free Course On Big Data Essentials
Subscribe to our blog and get access to this course ABSOLUTELY FREE.
val connection="jdbc:mysql://localhost/ecom" //connection to the database val mysql_props = new java.util.Properties //Providing mysql user credentails as properties mysql_props.setProperty("user","root") //providing user name mysql_props.setProperty("password","kiran0541") //providing password val user_account_table = sqlContext.read.jdbc(connection,"user_account",mysql_props) //creating spark dataframe for user_account table in mysql user_account_table.registerTempTable("user_account") //Creating a table in spark for a dataframe val order_details_table = sqlContext.read.jdbc(connection,"order_details",mysql_props) //creating spark dataframe for order_details table in mysql order_details_table.registerTempTable("order_details") //Creating table in saprk for a dataframe val order_item_details_table = sqlContext.read.jdbc(connection,"order_item_details",mysql_props) //creating spark dataframe for order_item_details table in mysql order_item_details_table.registerTempTable("order_item_details") //Creating table in spark for a dataframe val order_items_history_table = sqlContext.read.jdbc(connection,"order_items_history",mysql_props) //creating spark dataframe for order_items_history table in mysql order_items_history_table.registerTempTable("order_items_history") //Creating table in spark for a dataframe
This way we have all the tables in MySQL that are required for analysis as Spark data frames.
Now, let us jump straight into the analysis part.
Find details of the various items purchased by various customers in each quarter of the year separately.
In this problem statement, we need to find out the items purchased by various customers in each quarter of the year. To do this we basically need to find out the item_no from the inventory and the customer’s ac_no from the user_account and the date of the item purchased from the item history.
Here is the Spark SQL query to find out the above details:
val ps1 = sqlContex.sql("select account.ac_no as ac_no,COLLECT_SET(order_item_details.inven_item_no) as item_no_list,SUM(order_item_details.quantity) as quantity,COLLECT_SET(order_items_history.date) as order_date from user_account account JOIN order_details details ON (account.ac_no=details.ac_no) JOIN order_item_details ON (details.order_num=order_item_details.order_num) JOIN order_items_history ON (order_item_details.order_item_no=order_items_history.order_item_no) where order_items_history.date>='2015-05-01' AND order_items_history.date<='2015-08-31' group by account.ac_no")
You can see the result given in the screenshot below:
2. Find out which part of the world has more number of customers and the total purchases made by them in the past.
To evaluate this problem statement, we need to create data frames for two more tables in MySQL, and here’s how to do it:
val address_table = sqlContext.read.jdbc(connection,"address",mysql_props) //creating spark dataframe for address table in mysql address_table.registerTempTable("address") //Creating table in Spark for a dataframe val shipping_details_table = sqlContext.read.jdbc(connection,"shipping_details",mysql_props) //creating spark dataframe for shipping_details table in mysql shipping_details_table.registerTempTable("shipping_details") //Creating table in spark for a dataframe
We have now successfully created data frames for the two tables that we needed.
In this problem statement, we need to find out which part of the world has more number of customers and the total purchases made by them in the past. For this, we will analyze this data state wise and will find which state has the most number of customers and the total purchases made by them.
When analyzing state wise, we will be able to find out customers’ addresses from the address table, the number of customers can be counted from the user_account table and their transactions can be calculated from the user_account table itself.
Here is the Spark SQL query used to evaluate the above problem statement.
val ps2 = sqlContext.sql("select addres.state as state,COLLECT_SET(account.ac_no) as ac_no_list,COUNT(account.ac_no) as count_ac_no,SUM(account.trnsc_amt) as sum_of_amount from user_account account JOIN order_details details ON account.ac_no=details.ac_no JOIN shipping_details ship ON details.order_num=ship.order_num JOIN address addres ON ship.adrss_no=addres.adrss_no group by addres.state")
You can see the result in the screenshot given below:
We have now successfully evaluated the two problem statements that are required for the business development of an e-commerce firm using Spark and Spark SQL very easily with very less overhead.
We hope that this blog helped you in understanding how to perform analysis on a database in MySQL using Spark.