In our previous blog, we have seen how to review the data present in MySQL using Hive by importing this data incrementally using Sqoop. Here in this blog, Let us see how to perform the E-commerce Data analysis using Spark. In Spark, we can directly refer to the data in MySQL using the JDBC data source. Through this, we can create a data frame for the whole table in MySQL.
So let us first start with MySQL integration with Spark. You can refer to our blog on Loading Data from MySQL into Spark using JDBC.
val connection="jdbc:mysql://localhost/ecom" //connection to the database
val mysql_props = new java.util.Properties //Providing mysql user credentails as properties
mysql_props.setProperty("user","root") //providing user name
mysql_props.setProperty("password","kiran0541") //providing password
val user_account_table = sqlContext.read.jdbc(connection,"user_account",mysql_props) //creating spark dataframe for user_account table in mysql
user_account_table.registerTempTable("user_account") //Creating a table in spark for a dataframe
val order_details_table = sqlContext.read.jdbc(connection,"order_details",mysql_props) //creating spark dataframe for order_details table in mysql
order_details_table.registerTempTable("order_details") //Creating table in saprk for a dataframe
val order_item_details_table = sqlContext.read.jdbc(connection,"order_item_details",mysql_props)
//creating spark dataframe for order_item_details table in mysql
order_item_details_table.registerTempTable("order_item_details") //Creating table in spark for a dataframe
val order_items_history_table = sqlContext.read.jdbc(connection,"order_items_history",mysql_props)
//creating spark dataframe for order_items_history table in mysql
order_items_history_table.registerTempTable("order_items_history") //Creating table in spark for a dataframe
This way we have all the tables in MySQL that are required for analysis as Spark data frames.
Now, let us jump straight into the analysis part.
Find details of the various items purchased by various customers in each quarter of the year separately.
In this problem statement, we need to find out the items purchased by various customers in each quarter of the year. To do this we basically need to find out the item_no from the inventory and the customer’s ac_no from the user_account and the date of the item purchased from the item history.
Here is the Spark SQL query to find out the above details:
val ps1 = sqlContex.sql("select account.ac_no as ac_no,COLLECT_SET(order_item_details.inven_item_no) as item_no_list,SUM(order_item_details.quantity) as quantity,COLLECT_SET(order_items_history.date) as order_date from user_account account JOIN order_details details ON (account.ac_no=details.ac_no) JOIN order_item_details ON (details.order_num=order_item_details.order_num) JOIN order_items_history ON (order_item_details.order_item_no=order_items_history.order_item_no) where order_items_history.date>='2015-05-01' AND order_items_history.date<='2015-08-31' group by account.ac_no")
You can see the result given in the screenshot below:
2. Find out which part of the world has more number of customers and the total purchases made by them in the past.
To evaluate this problem statement, we need to create data frames for two more tables in MySQL, and here’s how to do it:
val address_table = sqlContext.read.jdbc(connection,"address",mysql_props)
//creating spark dataframe for address table in mysql
address_table.registerTempTable("address") //Creating table in Spark for a dataframe
val shipping_details_table = sqlContext.read.jdbc(connection,"shipping_details",mysql_props)
//creating spark dataframe for shipping_details table in mysql
shipping_details_table.registerTempTable("shipping_details") //Creating table in spark for a dataframe
We have now successfully created data frames for the two tables that we needed.
In this problem statement, we need to find out which part of the world has more number of customers and the total purchases made by them in the past. For this, we will analyze this data state wise and will find which state has the most number of customers and the total purchases made by them.
When analyzing state wise, we will be able to find out customers’ addresses from the address table, the number of customers can be counted from the user_account table and their transactions can be calculated from the user_account table itself.
Here is the Spark SQL query used to evaluate the above problem statement.
val ps2 = sqlContext.sql("select addres.state as state,COLLECT_SET(account.ac_no) as ac_no_list,COUNT(account.ac_no) as count_ac_no,SUM(account.trnsc_amt) as sum_of_amount from user_account account JOIN order_details details ON account.ac_no=details.ac_no JOIN shipping_details ship ON details.order_num=ship.order_num JOIN address addres ON ship.adrss_no=addres.adrss_no group by addres.state")
You can see the result in the screenshot given below:
We have now successfully evaluated the two problem statements that are required for the business development of an e-commerce firm using Spark and Spark SQL very easily with very less overhead.
We hope that this blog helped you in understanding how to perform analysis on a database in MySQL using Spark.