All CategoriesBig Data Hadoop & SparkBig Data Hadoop & Spark - Advanced

Big Data Case Study: E-Commerce Data Analysis Using Hadoop

E-commerce or electronic commerce is a transaction of buying or selling online. It draws on technologies such as mobile commerce, electronic funds transfer, supply chain management, and much more. The e-commerce field is increasing rapidly across the world. The e-commerce business in India will be worth 38 billion dollars by 2016 and it is estimated to reach 159 billion dollars by 2020.
The e-commerce firms are growing rapidly all over the world with millions of transactions made every day. So, one needs to analyze that data and draw some useful insights from it.
Here, we bring to you a business use case of an e-commerce company which wants to analyze their transactions and draw some useful insights out of it, which will be useful for their business development.
Here is the database design, the E-R diagram, of an e-commerce firm which was found in an IEEE research paper.

Let’s start with the creation of tables according to this E-R diagram.
Firstly, we will create a database to store our tables. We have named our database as ecom.

create database ecom;
use ecom;

We have now successfully created the database and have moved it into that database too. You can see the same in the screenshot below:

Now, we will create all the tables that are required for our e-commerce website based on the E-R diagram.
Creating a table user_account

create table user_account
(
ac_no int,
user_id varchar(30),
user_pswd varchar(30),
date datetime,
no_of_visit int,
no_of_trnsc int,
trnsc_amt int,
primary key(ac_no)
);

Inserting data

insert into user_account values(125,'a123','pp284','2011-04-14',5,6,100);
insert into user_account values(126,'a124','rr999','2012-06-13',2,3,1000);
insert into user_account values(127,'a125','ab888','2010-07-15',3,9,5000);
insert into user_account values(128,'a126','bb900','2016-05-15',1,8,7000);
insert into user_account values(129,'a127','rt007','2012-07-14',4,5,4000);
insert into user_account values(130,'a128','ss008','2013-03-15',6,2,8000);

Creating table user_session_details

create table user_session_details
(
ac_no int,
session_no int,
ip_address varchar(30),
no_of_clicks int,
date datetime,
primary key(session_no),
foreign key(ac_no) references user_account(ac_no)
);

Inserting data

insert into user_session_details values(130,11,'10.23.130.162',5,'2016-05-14');
insert into user_session_details values(130,10,'11.26.135.167',5,'2014-09-15');
insert into user_session_details values(125,9,'11.58.123.151',3,'2012-10-13');
insert into user_session_details values(127,3,'11.33.111.188',2,'2015-01-14');
insert into user_session_details values(129,15,'11.37.101.142',7,'2014-10-17');
insert into user_session_details values(128,8,'11.67.100.123',6,'2012-09-12');

Creating table shopping_cart_details

create table shopping_cart_details
(
ac_no int,
cart_num int,
active_status boolean,
expire_date datetime,
primary key(cart_num),
foreign key(ac_no) references user_account(ac_no)
);

Inserting data

insert into shopping_cart_details values(128,144,'0','2014-11-15');
insert into shopping_cart_details values(125,178,'0','2016-03-13');
insert into shopping_cart_details values(126,156,'1','2013-07-14');
insert into shopping_cart_details values(127,158,'1','2015-02-15');
insert into shopping_cart_details value(130,161,'1','2016-04-15');
insert into shopping_cart_details value(129,190,'1','2013-01-13');

Creating table inventory details

create table inventory_item_details
(
inven_item_no int,
title int,
price int,
date datetime,
primary key(inven_item_no)
);

Inserting data

insert into inventory_item_details values(188,78,1500,'2013-10-15');
insert into inventory_item_details values(122,54,5000,'2014-09-14');
insert into inventory_item_details values(167,56,4500,'2015-11-11');
insert into inventory_item_details values(134,66,4000,'2014-01-16');
insert into inventory_item_details values(156,35,3000,'2015-12-07');
insert into inventory_item_details values(177,88,6000,'2013-02-17');

Creating table address

create table address
(
adrss_no int,
name varchar(30),
adrss_1 varchar(30),
adrss_2 varchar(30),
city varchar(30),
state varchar(30),
zip int,
primary key(adrss_no)
);

Inserting data

insert into address values(13,'Blossoms','KS layout','Marathalli','Bangalore','Karnataka',560037);
insert into address values(15,'Prestige','Ecity','HSR Layout','Bangalore','Karnataka',560048);
insert into address values(16,'Orchid','vimaan nagar','Whitefield','Pune','Maharashtra',460056);
insert into address values(24,'Alpino','KS layout','connaught place','Delhi','New Delhi',460008);
insert into address values(18,'Galaxy','BTM Layout','Cunningham road','Bangalore','Karnataka',560027);
insert into address values(11,'Pratiksha','TR layout','Andheri East','Mumbai','Maharashtra',760037);

Creating table credit_card_details

create table credit_card_details
(
ac_no int,
adrss_no int,
ccard_no int,
card_no varchar(30),
holder_name varchar(30),
expire_date datetime,
primary key(ccard_no),
foreign key(ac_no) references user_account(ac_no),
foreign key (adrss_no) references address(adrss_no)
);

Inserting data

insert into credit_card_details values(126,13,345,'AB24563789','Shalini','2019-05-12');
insert into credit_card_details values(125,15,679,'HD54654458','Sakshi','2020-06-11');
insert into credit_card_details values(127,24,145,'KK98735476','Kiran','2021-11-18');
insert into credit_card_details values(128,11,211,'AA54673825','Alex','2025-08-14');
insert into credit_card_details values(129,18,978,'CB64738463','Rose','2023-01-16');
insert into credit_card_details values(130,16,667,'BB37845678','Lisa','2021-10-10');

Creating table payment_details

create table payment_details
(
ccard_no int,
payment_no int,
amt int,
state varchar(30),
date datetime,
primary key(payment_no),
foreign key(ccard_no) references credit_card_details(ccard_no)
);

Inserting data

insert into payment_details values(679,80,6600,'success','2014-11-16');
insert into payment_details values(345,67,8000,'success','2014-06-21');
insert into payment_details values(211,56,9000,'success','2015-02-25');
insert into payment_details values(145,78,1000,'failed','2015-05-22');
insert into payment_details values(667,12,5000,'success','2013-12-19');
insert into payment_details values(978,24,2000,'failed','2015-07-11');

Creating table invoice_details

create table invoice_details
  (
  invoice_num int,
  creation_date datetime,
  session_no int,
  ccard_no int,
  adrss_no int,
  primary key(invoice_num),
  foreign key(session_no) references user_session_details(session_no),
  foreign key(ccard_no) references credit_card_details(ccard_no),
  foreign key(adrss_no) references address(adrss_no)
  );

Inserting data

insert into invoice_details values(188,'2014-02-05',11,345,15);
insert into invoice_details values(101,'2014-04-07',10,145,13);
insert into invoice_details values(111,'2015-03-10',03,211,16);
insert into invoice_details values(197,'2015-06-15',09,978,11);
insert into invoice_details values(125,'2014-07-21',08,679,24);
insert into invoice_details values(167,'2015-01-11',15,667,18);

Creating table order_details

create table order_details
(
order_num int,
payment_no int,
ac_no int,
invoice_num int,
order_date datetime,
total_amt int,
state varchar(30),
primary key(order_num),
foreign key(ac_no) references user_account(ac_no),
foreign key(payment_no) references payment_details(payment_no),
foreign key(invoice_num) references invoice_details(invoice_num)
);

Inserting data

insert into order_details values(16,80,125,188,'2015-08-14',6000,'placed');
insert into order_details values(18,56,127,111,'2014-05-12',8000,'cancelled');
insert into order_details values(21,67,126,101,'2015-01-10',9000,'cancelled');
insert into order_details values(28,12,129,197,'2015-02-18',4000,'placed');
insert into order_details values(39,78,128,125,'2014-01-16',7000,'placed');
insert into order_details values(38,24,130,167,'2015-06-19',5500,'not placed');

Creating table shipping_details

create table shipping_details(
adrss_no int,
order_num int,
invoice_num int,
shipping_no int,
shipping_method varchar(30),
shipping_charge int,
state varchar(30),
shipping_date datetime,
primary key(shipping_no),
foreign key(adrss_no) references address(adrss_no),
foreign key(order_num)references order_details(order_num),
foreign key(invoice_num)references invoice_details(invoice_num)
);

Inserting data

insert into shipping_details values(13,16,188,109,'EKART',1000,'delivered','2015-09-12');
insert into shipping_details values(15,18,111,105,'FIRST FLIGHT',6000,'cancelled','2014-01-11');
insert into shipping_details values(24,21,101,110,'DHL',2000,'cancelled','2015-04-09');
insert into shipping_details values(16,39,125,111,'EKART',3000,'cancelled','2014-10-15');
insert into shipping_details values(11,38,197,124,'INDIA POST',4000,'delivered','2015-08-14');
insert into shipping_details values(18,28,167,178,'BLUE DART',5000,'delivered','2015-02-18');

Creating table order_item_details

create table order_item_details
(
order_item_no int,
quantity int,
order_num int,
shipping_no int,
inven_item_no int,
primary key(order_item_no),
foreign key(order_num) references order_details(order_num),
foreign key(shipping_no) references shipping_details(shipping_no),
foreign key(inven_item_no) references inventory_item_details(inven_item_no)
);

Inserting data

insert into order_item_details values(111,13,16,109,188);
insert into order_item_details values(112,10,21,105,167);
insert into order_item_details values(189,9,18,110,122);
insert into order_item_details values(127,8,39,124,156);
insert into order_item_details values(199,14,38,178,134);
insert into order_item_details values(146,15,28,111,177);

Creating table invoice_history

create table invoice_history
(
invoice_state int,
state_desc varchar(30),
notes varchar(30),
date datetime,
invoice_num int,
foreign key(invoice_num) references invoice_details(invoice_num)
);

Inserting data

insert into invoice_history values(1,'Valid','Good','2012-09-24',188);
insert into invoice_history values(0,'Invalid','Not good','2015-01-05',101);
insert into invoice_history values(0,'Invalid','Not good','2012-09-11',111);
insert into invoice_history values(1,'valid','Good','2012-09-09',197);
insert into invoice_history values(1,'valid','Good','2012-09-14',167);
insert into invoice_history values(0,'Invalid','Not good','2014-08-24',125);

Creating table order_items_history

create table order_items_history
(
seq_no int,
amt int,
state varchar(30),
notes varchar(30),
date datetime,
order_item_no int,
primary key (seq_no),
foreign key(order_item_no) references order_item_details(order_item_no)
);

Inserting data

insert into order_items_history values(10,7000,'Delivered','Good','2014-09-12',111);
insert into order_items_history values(11,8000,'cancelled','not good','2015-05-10',112);
insert into order_items_history values(12,9000,'Cancelled','not good','2014-10-08',127);
insert into order_items_history values(13,5000,'Cancelled','Not good','2015-03-14',189);
insert into order_items_history values(14,6000,'Delivered','Good','2015-06-11',199);
insert into order_items_history values(15,3000,'Delivered','Good','2014-07-13',146);

Creating table card_details

create table cart_details
(
c_no int,
inven_item_no int,
state varchar(30),
cart_num int,
date datetime,
quantity int,
primary key(c_no),
foreign key(inven_item_no)references inventory_item_details(inven_item_no),
foreign key(cart_num) references shopping_cart_details(cart_num)
);

Inserting data

insert into cart_details values(1,188,'0',144,'2016-11-15',13);
insert into cart_details values(2,122,'0',178,'2011-03-13',9);
insert into cart_details values(3,167,'1',156,'2015-07-14',10);
insert into cart_details values(4,134,'1',158,'2015-02-15',14);
insert into cart_details value(5,156,'1',161,'2016-06-15',8);
insert into cart_details value(6,177,'1',190,'2015-01-13',15);

This way, we have created all the 14 tables in total accordance with the database design. We have also inserted some data into these tables.

We have our databases ready now. The only thing remaining is the Big Data Analytics to be done on this data. So, first, we need to move this data into the HDFS. We do that using Sqoop.
We will implement the Sqoop Incremental Import to import this data into the HDFS incrementally from MySQL.
Below are the Sqoop scripts given to transfer this data into the HDFS incrementally. First, we will transfer the data that is present in the MySQL table. Next, when the MySQL table is updated, we need to specify the last value in the Sqoop import command so that all the newly added records will get imported into the HDFS.
Hadoop
Sqoop Incremental Import
To transfer user_account data

sqoop import --connect jdbc:mysql://localhost/ecom --username 'root' --P --table user_account --target-dir '/ecom/user_account' --incremental append --check-column ac_no -m 1  --driver com.mysql.jdbc.Driver

After running this job, you will get the “last value option” in the job metrics. Note down the last value of the check column record, and the next time you want to insert some new data, you need to mention the last value, here the last value is 130 and we are mentioning it in the Sqoop command as shown below.

sqoop import --connect jdbc:mysql://localhost/ecom --username 'root' --P --table user_account --target-dir '/ecom/user_account' --incremental append --check-column ac_no --last-value 130 -m 1  --driver com.mysql.jdbc.Driver

To transfer credit_card_details data

sqoop import --connect jdbc:mysql://localhost/ecom --username 'root' --P --table credit_card_details --target-dir '/ecom/credit_card_details' --incremental append --check-column ccard_no -m 1  --driver com.mysql.jdbc.Driver

After running this job, you will get the “last value option” in the job metrics, note down the last value of the check column record. The next time you want to insert the new data, you need to mention the last value. Here the last value is 978, and we will be mentioning it in the Sqoop command as shown below.

sqoop import --connect jdbc:mysql://localhost/ecom --username 'root' --P --table credit_card_details --target-dir '/ecom/credit_card_details' --incremental append --check-column ccard_no --last-value 978 -m 1  --driver com.mysql.jdbc.Driver

To transfer address data

sqoop import --connect jdbc:mysql://localhost/ecom --username 'root' --P --table address --target-dir '/ecom/address' --incremental append --check-column adrss_no -m 1  --driver com.mysql.jdbc.Driver

After running this job, you will get the “last value option” in the job metrics. Note down the last value of the check column record and the next time when you want to insert this new data, you need to mention the last value. Here, the last value is 24 and we are mentioning it in the Sqoop command as shown below:

sqoop import --connect jdbc:mysql://localhost/ecom --username 'root' --P --table address --target-dir '/ecom/address' --incremental append --check-column adrss_no --last-value 24 -m 1  --driver com.mysql.jdbc.Driver

To transfer Payment_details data

sqoop import --connect jdbc:mysql://localhost/ecom --username 'root' --P --table payment_details --target-dir '/ecom/payment_details' --incremental append --check-column payment_no -m 1  --driver com.mysql.jdbc.Driver

After running this job, you will get the “last value option” in the job metrics. Note down the last value of the check column record and the next time when you want to insert this new data, you need to mention the last value. Here, the last value is 80, we are mentioning it in the Sqoop command as shown below:

sqoop import --connect jdbc:mysql://localhost/ecom --username 'root' --P --table payment_details --target-dir '/ecom/payment_details' --incremental append --check-column payment_no --last-value 80 -m 1  --driver com.mysql.jdbc.Driver

To transfer Shipping_details data

sqoop import --connect jdbc:mysql://localhost/ecom --username 'root' --P --table shipping_details --target-dir '/ecom/shipping_details' --incremental append --check-column shipping_no -m 1  --driver com.mysql.jdbc.Driver

After running this job, you will get the “last value option” in the job metrics. Note down the last value of the check column record and the next time when you want to insert this new data, you need to mention the last value. Here, the last value is 178, we are mentioning it in the Sqoop command as shown below:

sqoop import --connect jdbc:mysql://localhost/ecom --username 'root' --P --table shipping_details --target-dir '/ecom/shipping_details' --incremental append --check-column shipping_no --last-value 178 -m 1 --driver com.mysql.jdbc.Driver

To transfer order_items_history data

sqoop import --connect jdbc:mysql://localhost/ecom --username 'root' --P --table order_items_history --target-dir '/ecom/order_items_history' --incremental append --check-column seq_no -m 1  --driver com.mysql.jdbc.Driver

After running this job, you will get the “last value option” in the job metrics. Note down the last value of the check column record and the next time when you want to insert this new data, you need to mention the last value. Here, the last value is 15, we are mentioning it in the Sqoop command as shown below:

sqoop import --connect jdbc:mysql://localhost/ecom --username 'root' --P --table order_items_history --target-dir '/ecom/order_items_history' --incremental append --check-column seq_no --last-value 15 -m 1  --driver com.mysql.jdbc.Driver

To transfer Order_item_details data

sqoop import --connect jdbc:mysql://localhost/ecom --username 'root' --P --table order_item_details --target-dir '/ecom/order_item_details' --incremental append --check-column order_item_no -m 1  --driver com.mysql.jdbc.Driver

After running this job, you will get the “last value option” in the job metrics. Note down the last value of the check column record and the next time when you want to insert this new data, you need to mention the last value. Here, the last value is 199, we are mentioning it in the Sqoop command as shown below:

sqoop import --connect jdbc:mysql://localhost/ecom --username 'root' --P --table order_item_details --target-dir '/ecom/order_item_details' --incremental append --check-column order_item_no --last-value 199 -m 1 --driver com.mysql.jdbc.Driver

To transfer Order_details data

sqoop import --connect jdbc:mysql://localhost/ecom --username 'root' --P --table order_details --target-dir '/ecom/order_details' --incremental append --check-column order_num -m 1  --driver com.mysql.jdbc.Driver

After running this job, you will get the “last value option” in the job metrics. Note down the last value of the check column record and the next time when you want to insert this new data, you need to mention the last value. Here, the last value is 39, we are mentioning it in the Sqoop command as shown below:

sqoop import --connect jdbc:mysql://localhost/ecom --username 'root' --P --table order_details --target-dir '/ecom/order_details' --incremental append --check-column order_num --last-value 39 -m 1  --driver com.mysql.jdbc.Driver

We have successfully exported the that is required for our analysis. You can see the same in the screenshot below:

Let’s see how to analyze this data using Hive.
First, we will create a database in Hive as shown below:

create database ecom;
use ecom;

We have successfully created the database, we have also entered into the database.

Now, we will create external tables in Hive referencing to the locations of the data that are present in the HDFS. Even if the data is updated, we can query the updated data in Hive as we are pointing the location of the data. If we delete the table too, our data will be present in that particular directory.
Creating user_account table

create external table user_account(
ac_no int,
user_id STRING,
user_pswd STRING,
date1 string,
no_of_visits int,
no_of_trnsc int,
trnsc_amt int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/ecom/user_account';


Creating order_details table

create external table order_details(
order_num int,
payment_no int,
ac_no int,
invoice_num int,
order_date string,
total_amt int,
state string)
ROW FORMAT delimited
FIELDS TERMINATED BY ','
LOCATION '/ecom/order_details';


Creating order_item_details  table

create external table order_item_details(
order_item_no int,
quantity int,
order_num int,
shipping_no int,
inven_item_no int)
ROW FORMAT delimited
FIELDS TERMINATED BY ','
LOCATION '/ecom/order_item_details';


Creating table order_items_history table

create external table order_items_history(
seq_no int,
amt int,
state string,
notes string,
date1 string,
order_item_no int)
row format delimited
fields terminated by ','
location '/ecom/order_items_history';


Creating shipping_details table

create external table shipping_details(
adrss_no int,
order_num int,
invoice_num int,
shipping_no int,
shipping_method string,
shipping_charge int,
state string,
shipping_date string)
row format delimited
fields terminated by ','
location '/ecom/shipping_details';


Creating payment_details table

create external table payment_details(
ccard_no int,
payment_no int,
amt int,
state string,
date1 string)
row format delimited
fields terminated by ','
location '/ecom/payment_details';


Creating address table

create external table address(
adrss_no int,
name string,
adrss_1 string,
adrss_2 string,
city string,
state string,
zip int)
row format delimited
fields terminated by ','
location '/ecom/address';


Creating credit_card_details table

create external table credit_card_details(
ac_no int,
adrss_no int,
ccard_no int,
card_no string,
holder_name string,
expire_date string)
row format delimited
fields terminated by ','
location '/ecom/credit_card_details';


Here, we have created the 8 tables in Hive that are required for our analysis part, and the data is also populated into those tables. You can see the 8 tables in the screenshot below:

Now, let’s jump to the analysis part.
1.Find the details of various items purchased by various customers in each quarter of the year separately.
In this problem statement, we need to find out the various items purchased by various customers in each quarter of the year. Here, we need to find out the item_no from the inventory and the customer’s ac_no from the user_account and the date an item purchased was purchased on from the item history.
Presented below is the Hive query to find out the details mentioned above.

select account.ac_no as ac_no,COLLECT_SET(order_item_details.inven_item_no) as item_no_list,SUM(order_item_details.quantity) as quantity,COLLECT_SET(order_items_history.date1) as order_date from user_account account JOIN order_details details ON (account.ac_no=details.ac_no) JOIN order_item_details ON (details.order_num=order_item_details.order_num) JOIN order_items_history ON (order_item_details.order_item_no=order_items_history.order_item_no) where order_items_history.date1>='2015-05-01' AND order_items_history.date1<='2015-08-31'  group by account.ac_no;

You can see the results in the screenshot below:

  1.  Find out which part of the world gets the most number of customers and total purchases as compared to the past.

In this problem statement, we need to find out which part of the world has more number of customers and their total purchases in the past. For this, we will examine the details state-wise and find out which state has how many number of customers and those customers’ total purchases.
We can find out about the states from the address table, customers will be counted from the user_account table and their transactions will be calculated from the user_account table itself.
Given below is a Hive query that can be used to evaluate the above problem statement.

select addres.state as state,COLLECT_SET(account.ac_no) as ac_no_list,COUNT(account.ac_no) as count_ac_no,SUM(account.trnsc_amt) as sum_of_amount from user_account account JOIN order_details details ON account.ac_no=details.ac_no JOIN  shipping_details ship ON details.order_num=ship.order_num JOIN address addres ON ship.adrss_no=addres.adrss_no group by addres.state;

You can see the result in the screenshot below:

When data is updated, even then we can run the same queries in Hive, and Hive gives us the updated results as the data in Hive gets updated automatically as we have given our dataset location referencing to an external path in the HDFS.
Enroll for Big Data and Hadoop Training conducted by Acadgild and become a successful big data developer.

Hadoop

2 Comments

  1. Pingback: DataDotz Bigdata Weekly - DataDotz - Big Data | Hadoop | NoSQL Consulting Services
  2. Can you suggest how to make e-commerce data analysis project by using specific API for eg eBay API? how to extract API and load it into hdfs?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close