This blog will help you learn, how to perform aviation data analysis for gaining some insights on the U.S. Airline data using Apache Hive and Pig. I hope readers of this blog are aware of what Apache Hive is and various operations that can be performed using it.
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled, and diverted flights appear in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released.
Aviation Data Analysis
Download the datasets from the following link:
Let us understand few columns of this dataset
There are 29 columns in this dataset. Some of them are:
- Year: 1987 – 2008
- Month: 1 – 12
- FlightNum: Flight number
- Canceled: Was the flight canceled?
- CancelleationCode: The reason for cancellation.
For complete details, refer to this link
Before we proceed, check the diagram below to understand the flow which is considered in this blog for processing data.
Step 1: Loading raw data into the root directory of HDFS.
Command: hadoop fs -put /home/cloudera/DelayedFlights.csv /
Step 2: Pre-processing using Pig.
This is also known as data cleansing phase.
In Line 1: As the file is comma separated, we will register and use piggybank jar in order to use the CSVExcelStorage class.
In relation A, we are loading the data using CSVExcelStorage because of its effective technique to handle double quotes and headers.
In relation B, we are generating the columns that are required for processing and explicitly typecasting each of them.
The question may pop up in your mind on my columns selection as why only these columns are considered? Don’t worry, you will understand the reason shortly
In relation C, we are filtering out the null values if any, from the generated columns.
The pre-processing steps might look easy, but it is one of the major steps in any of the production level projects as the data might not always be properly structured and also you may not need all the columns to perform your analysis. Considering complete columns even when it does not have any role in the analysis will only lead to increased processing time. I think now you got the answer of question (in Relation B).
Step 3: Loading pre-processed data from pig to hive using HCatalog.
Once the data is cleaned, we need to transfer it to process and gain insights. I will be using HCatalog and sending the cleansed data directly from pig to hive using it. If you perform this step for the first time, you might face few difficulties. So, I would suggest, you go through our blog on loading data from pig to hive using HCatlog for a detailed explanation.
3.a) You need to start hive metastore service before loading data using HCatalog.
Command: hive –service metastore
3.b) Next, create a hive table with the same schema as you had pre-processed in the Pig.
You need to maintain the same order as well while you create the table. You can see that we only need 8 columns out of 29 for our analysis.
3.c) Use below command in pig grunt shell to load the data to hive.
Syntax: STORE relation_name INTO ‘hive_table’ USING org.apache.hive.hcatalog.pig .HCatStorer();
This will load the data into hive table which we had already created.
You can cross check the same using SELECT * FROM aviation; in hive shell.
Step 4: Processing the data using Apache Hive
Here, we have few problem statements which we will be solving using hive.
PROBLEM STATEMENT 1:
Which month have seen the most number of cancellation due to bad weather?
SELECT month,COUNT(canceled) as t FROM aviation
WHERE canceled = 1 AND canel_code = ‘B’
GROUP BY month
ORDER BY t DESC
PROBLEM STATEMENT 2:
Top 10 route(origin and dest) that has seen maximum diversions?
SELECT origin,dest,COUNT(diversion) as t FROM aviation
WHERE diversion = 1
GROUP BY origin,dest
ORDER BY t DESC
PROBLEM STATEMENT 3:
Top 5 visited destination.
SELECT dest,COUNT(dest) as x FROM aviation
GROUP BY dest
ORDER BY x DESC
Step 5: Data Visualization using Tableau
You can connect Hive with tableau to visualize your data and generate reports. If you want to understand in deep, I would suggest you follow our blog on how to connect hive with tableau.
As I’m using Cloudera, I can easily connect Hive Cloudera Hadoop with Tableau. After this, you can use this report generating tool for visualization. Below is the image for “Destination” Vs “Count of Visitors to that destination”.
Hope this blog helped you in understanding/implementing a complete use case. Please check www.acadgild.com for enhancing hands-on expertise.