Here is an Hadoop use case in which, we will perform IBM data analysis which is publicly available. Though the dataset consists of dummy values, fields are good enough to apply to the business logic and perform analysis accordingly. The complete analysis has been done using Apache Hive.
Let us see what the dataset looks like. Below is the link to download the dataset.
Some of the fields are:
- Business travel
Apart from above, there are many more columns.
Let us understand what the flow of our work will look like:
- Loading Raw data into HDFS.
- Data Cleansing using Apache Pig
- Loading validated data into Hive table using HCatlog
- Analysis of validated data
Note: Complete flow have been implemented in Cloudera
Loading data into HDFS:
Below is the command to load data from local file system to HDFS.
hadoop fs -put ibm.csv /
Note: You can load data to any specific location in HDFS. I have loaded into the root.
The command wc -l ibm.csv shows the number of lines dataset have. There are 100 lines including the header.
Data cleansing using Apache Pig
REGISTER /usr/lib/pig/piggybank.jar; inp = load '/ibm.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',','YES_MULTILINE','NOCHANGE','SKIP_INPUT_HEADER'); a = foreach inp generate (CHARARRAY)$4 as dept, (INT)$9 as empnum, (CHARARRAY)$11 as gender, (INT)$18 as income, (CHARARRAY)$22 as ot, (INT)$33 as lastpromo; b = filter a by (dept!= 'NULL') AND NOT(dept MATCHES '') AND (empnum is not null) AND (gender!= 'NULL') AND NOT(gender MATCHES '') AND (income is not null) AND (ot!='NULL') AND NOT(ot MATCHES '') AND(lastpromo is not null);
Let’s understand what exactly the above Pig script does.
In the first line, we are registering piggybank jar to leverage CSVExcelStorage class to handle a CSV file that consists of a header, double quotes, etc.
In relation inp, we are loading raw data using CSVExcelStorage.
In relation a, we are generating required columns that will be used to solve our problem statement. Also, we are explicitly typecasting each of them.
In relation b, we are filtering the data from relation a by removing the NULL values if any.
This is it from the cleaning part, we need to send cleansed data to Hive for further analysis.
Loading validated data into Hive table using HCatalog
Once the data is validated, we must send it to Hive. For this, we will use HCatlog. If you are unaware of the process of sending data using HCatalog, I would suggest you go through this blog.
First, you must start the Hive metastore services using below command.
hive –service metastore
Next, create a table in Hive with the same schema as it is in Pig. Below is the Hive script for creating a table.
create table ibmanalysis( dept STRING, empnum INT, gender STRING, income INT, ot STRING, lastpromo INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
Above screenshot shows that a Hive table named ibmanalysis is created and as of now there is no data in it.
Now is the time when we will be loading the data from Pig to Hive using HCatalog. Let’s go back to Pig grunt shell and fire below command.
STORE b INTO ‘sumit.ibmanalysis’ USING org.apache.hive.hcatalog.pig.HCatStorer();
Using HCatStorer() , we are loading the data into Hive at ‘sumit.ibmanalysis’ location( where sumit is the database name and ibmanalysis is the Hive table)
We can go back to Hive and check whether the data got loaded in table or not.
Run the below command.
Select * from ibmanalysis
Analysis on validated data
Now, as the data is available, we can perform certain analysis and gain more insights.
Below are the some of the problem statements, however, you can come up with your own problem statements and tweak the complete flow to get results accordingly.
Q1) Find out the employee number and dept of employee who does overtime?
Below is the Hive query for same.
SELECT empnum,dept from validated where ot = 'Yes';
Q2) Find out last 5 employees based on last promotion received?
Below is the Hive query for the same.
select empnum,dept,lastpromo from ibmanalysis order by lastpromo DESC LIMIT 5;
Q3) Find out the list of employee whose income is more than the average income of all the employee’s present in the same department?
select i.empnum,i.income,i.dept from ibmanalysis i INNER JOIN (select dept,avg(income) as sal from ibmanalysis group by dept)t ON (i.dept = t.dept) WHERE (i.income >= t.sal);
Enroll for Big Data and Hadoop Training conducted by Acadgild and become a successful big data developer.