Big Data Hadoop & Spark

Aviation Data Analysis Using Apache Pig

This blog will help you in gaining some insights on the U.S. Airline data using Apache Pig. I hope readers of this blog are aware of what Apache Pig is and various operations that can be performed using it.
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled, and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released.
You can download the datasets from the following links:
Delayed_Flights.csv
Airports.csv
These are 2 different datasets, i.e., Delayed_Flights.csv and Airports.csv. Let us understand one at a time.
 Delayed_Flights.csv Datasets
There are 29 columns in this dataset. Some of them have been mentioned below:

  • Year: 1987 – 2008
  • Month: 1 – 12
  • FlightNum: Flight number
  • Canceled: Was the flight canceled?
  • CancelleationCode: The reason for cancellation.

For complete details, refer to this link.
Airports.csv Datasets

100% Free Course On Big Data Essentials

Subscribe to our blog and get access to this course ABSOLUTELY FREE.

  • iata: the international airport abbreviation code
  • name of the airport
  • city and country in which airport is located.
  • lat and long: the latitude and longitude of the airport

Now, using Apache pig, we will try to gain more insights from these datasets.

Problem Statement 1

Find out the top 5 most visited destinations.
 Source Code

REGISTER '/home/acadgild/airline_usecase/piggybank.jar';
A = load '/home/acadgild/airline_usecase/DelayedFlights.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',','NO_MULTILINE','UNIX','SKIP_INPUT_HEADER');
B = foreach A generate (int)$1 as year, (int)$10 as flight_num, (chararray)$17 as origin,(chararray) $18 as dest;
C = filter B by dest is not null;
D = group C by dest;
E = foreach D generate group, COUNT(C.dest);
F = order E by $1 DESC;
Result = LIMIT F 5;
A1 = load '/home/acadgild/airline_usecase/airports.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',','NO_MULTILINE','UNIX','SKIP_INPUT_HEADER');
A2 = foreach A1 generate (chararray)$0 as dest, (chararray)$2 as city, (chararray)$4 as country;
joined_table = join Result by $0, A2 by dest;
dump joined_table;

In Line 1: We are registering the piggybank jar in order to use the CSVExcelStorage class.
In relation A, we are loading the dataset using CSVExcelStorage because of its effective technique to handle double quotes and headers.
In relation B, we are generating the columns that are required for processing and explicitly typecasting each of them.
In relation C, we are filtering the null values from the “dest” column.
In relation D, we are grouping relation C by “dest.”
In relation E, we are generating the grouped column and the count of each.
Relation F and Result is used to order and limit the result to top 5.
These are the steps to find the top 5 most visited destinations. However, adding few more steps in this process, we will be using another table to find the city name and country as well.
In relation A1, we are loading another table to which we will look-up and find the city as well as the country.
In relation A2, we are generating dest, city, and country from the previous relation.
In relation joined_table, we are joining Result and A2 based on a common column, i.e., “dest”
Finally, using dump, we are printing the result.

Problem Statement 2

Which month has seen the most number of cancellations due to bad weather?
Source code

REGISTER '/home/acadgild/airline_usecase/piggybank.jar';
A = load '/home/acadgild/airline_usecase/DelayedFlights.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',','NO_MULTILINE','UNIX','SKIP_INPUT_HEADER');
B = foreach A generate (int)$2 as month,(int)$10 as flight_num,(int)$22 as cancelled,(chararray)$23 as cancel_code;
C = filter B by cancelled == 1 AND cancel_code =='B';
D = group C by month;
E = foreach D generate group, COUNT(C.cancelled);
F= order E by $1 DESC;
Result = limit F 1;
dump Result;

In Line 1: We are registering piggybank jar in order to use the CSVExcelStorage class.
In relation A, we are loading the dataset using CSVExcelStorage because of its effective technique to handle double quotes and header.
In relation B, we are generating the columns which are required for processing and explicitly typecasting each of them.
In relation C, we are filtering the data based on cancellation and cancellation code, i.e.,  canceled = 1 means flight have been canceled and cancel_code = ‘B’ means the reason for cancellation is “weather.” So relation C will point to the data which consists of canceled flights due to bad weather.
In relation D, we are grouping the relation C based on every month.
In relation E, we are finding the count of canceled flights every month.
Relation F and Result is for ordering and finding the top month based on cancellation.

Problem Statement 3

Top ten origins with the highest AVG departure delay
 Source code

REGISTER '/home/acadgild/airline_usecase/piggybank.jar';
A = load '/home/acadgild/airline_usecase/DelayedFlights.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',','NO_MULTILINE','UNIX','SKIP_INPUT_HEADER');
B1 = foreach A generate (int)$16 as dep_delay, (chararray)$17 as origin;
C1 = filter B1 by (dep_delay is not null) AND (origin is not null);
D1 = group C1 by origin;
E1 = foreach D1 generate group, AVG(C1.dep_delay);
Result = order E1 by $1 DESC;
Top_ten = limit Result 10;
Lookup = load '/home/acadgild/airline_usecase/airports.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',','NO_MULTILINE','UNIX','SKIP_INPUT_HEADER');
Lookup1 = foreach Lookup generate (chararray)$0 as origin, (chararray)$2 as city, (chararray)$4 as country;
Joined = join Lookup1 by origin, Top_ten by $0;
Final = foreach Joined generate $0,$1,$2,$4;
Final_Result = ORDER Final by $3 DESC;
dump Final_Result;

Explanation of first 3 lines are the same as explained in the previous 2 problem statements.
In relation C1, we are removing the null values fields present if any.
In relation D1, we are grouping the data based on column “origin.”
In relation E1, we are finding average delay from each unique origin.
Relations named Result and Top_ten are ordering the results in descending order and printing the top ten values.
These steps are good enough to find the top ten origins with the highest average departure delay.
However, rather than generating just the code of origin, we will be following a few more steps to find some more details like country and city.
In the relation Lookup, we are loading another table to which we will look up and find the city as well as the country.
In the relation Lookup1, we are generating the destination, city, and country from the previous relation.
In the relation Joined, we are joining relation Top_ten and Lookup1 based on common a column, i.e., “origin.”
In the relation Final, we are generating required columns from the Joined table.
Finally, we are ordering and printing the results.

Problem Statement 4

Which route (origin & destination) has seen the maximum diversion?
Source code

REGISTER '/home/acadgild/airline_usecase/piggybank.jar';
A = load '/home/acadgild/airline_usecase/DelayedFlights.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',','NO_MULTILINE','UNIX','SKIP_INPUT_HEADER');
B = FOREACH A GENERATE (chararray)$17 as origin, (chararray)$18 as dest, (int)$24 as diversion;
C = FILTER B BY (origin is not null) AND (dest is not null) AND (diversion == 1);
D = GROUP C by (origin,dest);
E = FOREACH D generate group, COUNT(C.diversion);
F = ORDER E BY $1 DESC;
Result = limit F 10;
dump Result;

In Line 1: We are registering piggybank jar in order to use CSVExcelStorage class.
In relation A, we are loading the dataset using CSVExcelStorage because of its effective technique to handle double quotes and headers.
In relation B, we are generating the columns which are required for processing and explicitly type-casting each of them.
In relation C, we are filtering the data based on “not null” and diversion =1. This will remove the null records, if any, and give the data corresponding to the diversion taken.
In relation D, we are grouping the data based on origin and destination.
Relation D finds the count of diversion taken per unique origin and destination.
Relations F and Result orders the result and produces top 10 results.


Hope this post helped you in understanding and analyzing airline data using Apache Pig. For further updates, keep visiting www.acadgild.com
 
Hadoop

8 Comments

  1. Pingback: Use Case for Data Analysis Using Pig | HadoopMinds
  2. HI,
    i am getting error in loading the csv file as below. i downloaded the csv from the given link.tried loading it from local as well as from HDFS but no luck.i can post the screenshot if needed. below is the error-
    Error 1000:error during parsing.Encountered “<path" at line 2 column 1.

  3. it was a space issue after and before the = operator.for e.g
    A=LOAD // doesn’t run
    A = Load //runs successfully.
    this tutorial helped a lot.Thank you.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close