Data Analytics with R, Excel & Tableau
Trending

Data Manipulation using R

Data manipulation using dplyr package on Houston flights data with R.

In this article, we will be performing data manipulation operations using the dplyr package on Houston flights dataset which is available in R.

First of all, what is a data manipulation?

Data manipulation is an operation which is performed on an existing dataset in order to generate the objective of a result.  furthermore, these operations can be finding patterns, queries, sorting, filtering, removing a field with empty data column values, modify/update a given data and more.

To achieve all the above operations we will be using the Huston flight sample data set.

Here we will be using two packages i.e in our examples,

  • dplyr
  • hflights.

dplyr Package

Dplyr is a package which is used to perform data manipulation operations, It introduces,  easy-to-use functions to reveal new variables and new observations in new ways for the reason to describe data that are very handy when performing exploratory data analysis and data manipulation.

Hflights package

Hflights package contains Houston airports dataset which is available with R by default. This dataset is created by the Bureau of transportation statistics for research and innovation technology administration. This dataset with 227,496 rows and 21 columns- Variables.

Hflights dataset column(variable) description:

Year, Month, DayofMonth; DayOfWeek; DepTime, ArrTime; UniqueCarrier; FlightNum; TailNum; ActualElapsedTime; AirTime, ArrDelay, DepDelay; Origin, Dest origin and destination airport codes; Distance; TaxiIn, TaxiOut; Cancelled: cancelled indicator: (1 = Yes, 0 = No); CancellationCode: reason for cancellation: A = carrier, B = weather, C = national air system, D = security; Diverted: diverted indicator: (1 = Yes, 0 = No).

Package Installation

First, we will install dplyr package to perform data manipulation operations and hflights package to use Houston flights dataset.

In the below code we can see how to install packages.

> install.packages(“hflights”)

> install.packages(“dplyr”)

And load the library function of dplyr package and hflights as well.

install dplyr package

We can use help() function to see the description of Houston flights data in the right side on the Rstudio help section.

View Data

By using view() we can see how actual dataset looks.

dataset

Here in the above table, we can see hflights dataset.

Structure

Now by using Str(), you can see the structure of hflights.  

structure

In the above console, we can see that it is a data frame and 227496 objects of 21 variables. Also, variables having integer and character in it.

Summary

Let us see the summary of the dataset by using the summary() function.

data summary

Here in console, it is showing the mean, median, mode, class, NA’s values.

pipe operator %>%

If you want to select() from select function for any specific columns from the data frame you have to use pipe operator %>% which comes inside the dplyr package. Here Pipe operator trying pipe the operation select with the df, taking the df and select these three columns i.e; Airtime, TaxiIn, TaxiOut.

Updating Data Frames or modifying data frames.

i) Select() function

df is very large which we don’t want to print because it may take longer. Instead, we should store the query into a updated_df object.

modify data frame

In the environment section, we can see the new updated_df  has same 227469 observations but the number of the column has changed because we have selected only  3 columns.

If we want all the variables except few those 3 variables we have selected we can use -c so these column name will not be present.

So Let’s check it out by the code above to drop variables.

machine learning

In the environment section, we can see the new updated_df  has same 227469 observations but the number of the column has changed because we have dropped selected variables and a total of 18 remaining variables are showing.

ii) Sampel_frac()

Now let’s see with sample_frac() function to create a new data frame to find the small sample out of large dataset we have and we do not want to store all the data.

Following the above code, we have taken a sample size of 0.2 i.e; 20% of the dataset. So if you want to do some kind of analysis in the same subset of the data and you can do it by selecting a random data sample by using function sample_frac.

data analytics

In the environment section, we can see the new sample updated_df  has some 45499 observations that are 20% of the original dataset and the number of the column remains unchanged because by sample_frac we are selecting only random numbers of observation.

iii)  mutate() function

We are using mutate function to create a new column or new variables in existing data.

In the below code we have used pipe operator %>%  trying to pipe the operation select with the df and store into the updated_df object.

On the right-hand side of the R studio, we can see in the environment section to see updated_df with 22 variables which have been added created by using mutate function.

By clicking on updated_df we can easily view the modified dataset.

updated data frame

The above data set is showing the AvgSpeed added into the new column.

iv) group_by() function and summarise() function.

Now let us see how to group the data by using group_by() function and summarise function.

In the below code we want to see Month and avg_delay we have to use group_by(), to group the data frame. group_by() function generally used to groups the data frame by multiple columns with mean, sum or other functions.

Also, summarise() function is to reduce the data by grouping. So you can group with the help of group function and summarise it.

On the right-hand side of the R studio, we can see in the environment section to see updated_df with 22 variables which have been added created by using mutate function.

By clicking on updated_df we can easily view the modified dataset.

Here in the updated_df using group_by() function we can see Monthly and the avg_delay columns.

Likewise, if we want to see DayOfWeeks and avg_delay only. We will follow the same procedure. We have to use group_by() to group the data frame and summarise it by reducing the data by using summarise() function.

On the right-hand side of the R studio, we can see in the environment section to see updated_df which has 7 observations and 2 variables which has been created by using group_by() function.

By clicking on updated_df we can easily view the modified dataset.

Here in the updated_df using group_by() function we can see Monthly and the avg_delay columns.

If you want to find the information of flight which has high departure time, We are grouping the flight number by calculating the average departure delay for each flight number.

Next, we will follow the same steps as we have done for DayOfWeeks and avg_delay.

Here, are using group_by() function to group the data frame and summarise it by reducing the data by using summarise() function.

groupby function

On the right-hand side of the R studio, we can see in the environment section to see the updated_df which has 3740 observations and 2 variables which has been created by using group_by() function.

Rstudio

Click on the updated_df dataframe to view the modified dataset. From the below table we can see the flights (flight numbers) that have the minimum average delay and maximum delay.

minimum meanmaximum mean

Here in the updated_df dataframe using group_by() function we can see flight number and the avg_delay columns. 1817 is the flight number which has the minimum average delay of -10.0000 minutes (i.e, before the arrival time the plane has landed ) and 4493 is the flight number that has maximum departure delay of 281 minutes (that means it is delayed by this many minutes after the departure time).

Now say you want the information of departure delay average for each month and for each flight separately, for this objective we have to group two variables i.e, Month and FlightNum.

We will follow the same steps as we have done to find the FlightNum and the avg_delay in the previous scenario.

We have to use group_by() function to group the data frame and summarise it by reducing the data by using summarise() function.

summarise function

On the right-hand side of the R studio, we can see in the environment section to see updated_df dataframe which has 14872 observations and 3 variables which has been created by using group_by() function.

data frame

Click on the updated_df dataframe to view the modified dataset. From the below table we can see the flights (flight numbers) that have the minimum average delay and maximum delay.

Now we can see separate columns for departure delay for each month and for each flight in the below table.

variables

From the above data table, we can see the information of  the variables summary statistics for the columns Month and FlightNum.. using group_by() function.

You can recreate questions if you want to analyze more queries like above.

Example,  What is the maximum departure or which/Find the month which as the maximum departure delay or Find the day or date that as maximum departure delays. For these kinds of queries you need to use group_by() function first, then summarise() function to aggregate the data and then you can find the month has or the day that has the maximum departure delay.

What is summarise function?

Summarise is a function used to perform aggregate operations like to find minimum value or the maximum value or the mean value in a given dataset.

To find the basic statistics of any variable, we have arrival delay (ArrDelay) variables so, if you see the data frame df which is the original data frame. We see the arrival delay, we can see the minimum value, it means it is not delayed rather it reaches the destination 70 minutes prior to its time.

data.table

If you want to see ArrDelay and the maximum arrival delay is 978 minutes to its time.

data manipulation

The data set has some NA values i.e; missing values. The first and foremost we have to find the missing data in our original df (Here you have mentioned the original df has missing data. So explain once). we need to remove missing data so that we can find a summary of the data because these NA (Is the NA represents the missing values) values will hinder in our calculations.

NA's is Missing value in R

This is how our dataset looks like when we scroll down to dataset.

V) Filter() function.

Let’s go the first filter this original data frame. We will use df and then filter() function !is.na and arrival delay.

Filter() function

!is.na command drop all the NA values.

On the right-hand side of the R studio, we can see in the environment section to see updated_df which has 223874 observations and 21 variables which has been created by after dropping NA values.

Data set variables

By clicking on updated_df we can easily view the modified dataset.

modified data frame

Here in the above data set, we can see there is no NA values left because we have filtered those.

Now after filtering we are going to find the summary of this arrival delay column using summarise() function. So we can use filter() function and summarise() function to find min,average, max arrival delay.

Here,

#summarise

#aggregate function

#summary about ArrDelay

#min(x) - minimum value of vector x

#max(x) - maximum value of vector x

#mean(x) - mean or average value of vector x

!is.na and na.rm

So here in the console, we can see the min_delay is -70, avg_delay = 7.094 and max delay =978.

We can also modify the data frame using filter() and summarise() function.

pipe operator

On the right-hand side of the R studio, we can see in the environment section to see updated_df which has 1 observation and 3 variables which have been created by using filter() function.

R enviroment

By clicking on updated_df dataframe we can view the modified dataset.

Dplyr. sapply, lapply, Tapply, data.table

So here in the above table, we have updated_df dataframe displaying min_delay is equals to -70, avg_delay is 7.094334 and max_delay is 978 after removing NA values.

This is how we can find the summary statistics for any columns.

Now we are going to use the filter function to filter the rows based on the condition used inside the filter function on a column.

If you see the original data frame you can see the distance variable after shorting in order the minimum value is 79 and the maximum value is 3904.

First, take df on pipeline line operator i.e; %>%  then use the filter function and only take those rows where the distance variable is greater than 2000.

On the right-hand side of the R studio, we can see in the environment section to see the updated_df dataframe which has 918 observations and 21 variables which has been created using filter() function where the distance variable value is greater than 20000.

Here we want to select those rows where the distance is greater than in 2000. So it will filter all the rows which are lesser than the distance 2000 km.

Click on the updated_df to view the modified dataset.

 Houston flights data

Here in the above updated_df table, we can see the distance column/distance variable consists of values that are greater than 2000.

Now let’s say you want only DayOFWeek, the minimum DayOFWeek 1 and the maximum is 7 and it is coded based on Monday to Sunday.

We want only Saturday and Sunday i.e; 6 and 7. So how we would do, we will use the DayOfWeek variable. Here we can write it as DayOfWeek%in%c(6,7).

That means DayOfWeek should have only these values as vector 6 and 7.

filter

On the right-hand side of the R studio, we can see in the environment section to see updated_df which has 59687 observations and 21 variables which has been created by using filter() function.

So DayOfWeek is either six minimum and maximum will be 7.

hflights package

Here in the above updated_df table, we can see DayOfWeek is 6 and 7 selected by filter() function.

So we can filter based on many conditions you can use lesser, greater, equal for numeric variable or character variable and also if you just want the variable to have only a few values then can use %in% where you can mention the values which you want to have like 6,7 if you want only 3 then just write DayOfWeek == 3 so it will give DayOfWeek to be only 3.

This is how we can use the filter function to achieve the filtered datasets.

So, we have covered pipe operator %>% , select(), sample_frac(), mutate(), group_by(), filter() and summarise() function in data manipulation using dplyr package on Houston flights data with R.

From the above examples, we believe this blog helped you in understanding Data manipulation operations using the dplyr package on Houston flights data with R.

Suggested Reading

https://acadgild.com/blog/linear-model-building Using Airquality Data Set with R.

Keep visiting our site www.acadgild.com for more updates on Data Analytics and other technologies. Click here to learn data science course in Bangalore.

Tags

Badal Kumar

Data Analyst at Aeon Learning

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close