Getting started with data.table package using R.
In this article, we will be performing data manipulation operations with data.table package on houston flights dataset which is available in R.
Before moving ahead, we recommend our viewers to have little grasp on the process of data manipulation.
Wherefore you can follow our article Data Manipulation using R to know about data manipulation and dplyr package.
DATA.TABLE
The data.table is an Extension of data.frame which is a high-performance version build in base R. It is extensively used for rapid aggregation or getting fixed inferences of huge data sets, fast ordered joins, remove/add/modify columns by group.
The data.table is being used with large data sets offers fast and memory efficient.
example: 1GB to 100GB in RAM (currently in-memory:64bit and 100GB is routine).
The data.table R package is the fastest package for data manipulation. It is used to sort, evaluate, analyze, compare, manipulate the data sets.
So, what is a Data Frame
R provides an essential data structure known as the “data frame” that provides the user to organize, view, and access data. There’s a lot of the functions to read in external files i.e; read.csv, read.table or connect to databases ie; RMySQL.
Furthermore brief review of data.frame() concepts:
- Data frame is a two dimensional data structure in R programming.
- A data frame is a set of columns used to store data tables.
- It is a list of vectors of equal length and same data type.
- Every column is the same length but different type.
Lets see the syntax of data.table syntax.
syntex :
It is inspired by A[B] syntax in R.
Where;
A is a matrix and B is a column-2 matrix.
Therefore, a data.table is a data.frame, and it is Suitable with R functions and packages that accept only data.frames.
General form: DT[i, j, by]
Where;]
i = Row
j = Column
by = groupby
Take DT, subset rows using i, then calculate j grouped by by.
The data.table builds on base R Packages to reduce time:
1. programming time ( debug and maintain,easy to write, read).
2. computation time (memory efficient and fast).
Let’s discuss data manipulation using Hflights dataset in data.table way.
Hflights package
Hflights package contains Houston airports dataset which is Added to with base package in R 3.6 version. This dataset is created by the Bureau of transportation statistics for research and innovation technology administration. This dataset with 227,496 rows and 21 columns- Variables.
Here we will be using two packages i.e in our examples,
- data.table
- hflights.
Installation of packages
First and foremost, we will install data.table package to perform data manipulation operations and hflights package to use Houston flights dataset.
Here, In the below code we can see how to install packages.
To install the package called data.table, you can install package using install.package() function by giving the package name in quotations “data.table”.
install.packages("data.table")
And then import the library function. So, inside the library function give the package name data.table and then run, It will import the data.table package.
library(data.table)
To show how data.table functions we are using a dataset name hflights. And to get this data set which is inbuilt in R you have to install a package called hflights. So you can install package using install.package() function by giving the package name in quotations “hflights”.
install.packages("hflights")
And then import the library function. So, inside the library function give the package name hflights and then run, It will load the hflight package, Which is carrying Houston airports dataset.
library(hflights)
View Data
To see the hflights data set we can use View() function and inside the view function use the dataset names or package name. It will display all the columns inside the dataset
View(hflights)
Here in the above data table, we can see hflights dataset.
head()
If you want to check the top six rows you can use the head() function and within the function enter the dataset name hflights. It will display the first six rows of all the columns by default.
head(hflights)
The first six rows of every column is shown in the above console.
colnames()
To see the column names of the data set you can use the colnames function and give the dataset name to see all the column names in hflight dataset.
colnames(hflights)
Here in above console it displays all the columns that is available inside the data set.
class()
To see the class of the dataset just use the class function and give the dataset name inside the parentheses and it will show it is data.frame
class of the data set
class(hflights)
However you want to work with data.table package you have to convert this dataset to a data table.
To do that you have to use as.data.table() function and give data set name inside the parentheses.
Here to convert data.frame into data.table use as.data.table()
hflights <- as.data.table(hflights)
class(hflights)
So if you see now the class of hflight you can see the data.table along with data.frame.
In the above console it is showing data.table along with data.frame. Now you can apply the data.table functions using this data.table packages and also this will remain as a data frame. So, you can also use inbuilt R function such for data manipulation or extractions.
how to select data?
Lets perform both data.frame and data.table in their way.
Select some columns from the dataset for data.frame way using the ordinary syntax to select certain columns from the data “ArrDelay”, and “Dep”Delay” from data frame.
select multiple columns using data.table
Here we want to select 1st three rows indexing 1:3 and for columns name, vector to specify variable names c( “ArrDelay”,“DepDelay”) within quotations.
hflights[1:3, c("ArrDelay","DepDelay")]
#data.frame
In the above console it is showing arrival delay and departure delay from the first three rows.
So now if you want to do it in data.table way then you have to give hflights again, the row and column separated with a comma. Selecting 1:3 first three rows. And for column you can put list() function here and put column names which you want to select. Here we are taking again list(ArrDelay,DepDelay) without any quotations.
hflights[1:3, list(ArrDelay,DepDelay)]
#data.table
Here in above console we are seeing the same result in data.table way.
Select multiple columns using .() dot operator, data.table way
For indexing the first ten rows 1:10 separated by a comma with dot function inside the function give .(ArrDelay,DepDelay) without any quotations.
Hflights[1:10, .(ArrDelay,DepDelay)]
#data.table
Now you can see in the above console we are seeing the result in data.table way by using .() dot function .
Data filtration
In data filtration we can select certain rows or columns based on conditions as per our requirements.
So lets select ArrDelay greater than equal to >= 45 and based on this ArrDelay we will select origins of all the flights. For data.frame way here we are using head function so that it will show first 6 rows by default and inside hflight we are indexing hflights$ArrDelay, where $ sign means selecting variable i.e; ArrDelay from the dataset which is greater than equal to 45 and The origin saprated by a comma.
head(hflights[hflights$ArrDelay>=45, Origin])
#data.frame
Here in the above console it is showing origins based on the given condition i.e; ArrDelay >=45.
Now lets do data filtration in data.table way. So lets select ArrDelay greater than equal to >= 500 and based on this ArrDelay we will select origins of all the flights. For data.table way here we are using head function so that it will print first 6 rows by default and inside we don’t have to index hflights$ArrDelay, we can directly give the aArrDelay greater than equal to >= 500 with comma , Origin.
head(hflights[ArrDelay>=500, Origin])
#data.table
Here in the above console we can see the origin names on the given condition i.e; ArrDelay >=45.
Remove/omit Na values
Let’s first print a summary of the data set using summary() function to see total Na values present in the dataset.
summary(hflights)
To remove missing values from the dataset we will use !is.na for data.frame way and na.omit for data.table way.
For data.frame way selecting first 6 rows using head() function and checking NA values or missing present in the column ArrDelay by using the command is.na inside the hflight data set indexing the third bracket ArrDelay with $ sign so it will print the arrival delay.
head(hflights[ is.na(hflights$ArrDelay), ArrDelay])
#data.frame
Here in the above console we can see the Na values present in ArrDelay column.
Now let’s remove all the Missing values present in the column, by using !is.na. Here ! sign indicates that it drops all the values and print the result.
head(hflights[!is.na(hflights$ArrDelay), ArrDelay])
#data.frame
Here in the above console we can see the printed first six rows of arrival delay values without NA’s.
For data.table way selecting first 6 rows using head() function and checking NA values or missing present in the column ArrDelay by using the command na.omit inside the hflight data set indexing the ArrDelay. Hance it will print the arrival delay without any NA’s.
head(hflights[,na.omit(ArrDelay)]) #data.table
Here in the above console we can see the printed first six rows of arrival delay values without NA’s.
Add new column
We have already seen in our previous article “Data manipulation using R” how to add or Remove column with dplyr package. So now we will perform the add and remove column operation with data.table package.
Let add a column to the data set name total_delay in data.table way.
Name hflights_total_delay as modified dataset and The operators <- used to assign variable in the same environment using the code hflights[,total_delay := ArrDelay + DepDelay] we are specifying total_delay colon equals arrival delay and departure delay to put under the column by adding ArrDelay + depDelay.
So it will create a total delay column based on arrival delay and departure delay inside the hflight data set .
hflights_total_delay <- hflights[,total_delay := ArrDelay + DepDelay]
Using head() function to print 1st six rows of modified dataset called hflights_total_delay.
head(hflights_total_delay)
Here in the above console the hflights_total_delay is printed and also we can see the total_delay column.
Let’s find the total_delay which is greater than equal to >= 80. Here we are using head() function that will print 1st six rows but we are specifying with a coma it should print 1st five rows. Here we are using the updated data set hflights_total_delay with which new column is added “total_delay”.
head(hflights_total_delay[total_delay>=80],5)
So here in the above console we can see the first five rows printed and the total values is showing greater than equal >= 80
Similarly we are adding a new variable AvgSpeed in a new modified dataset Name hflights_AvgSpeed as modified dataset and The operators <- used to assign variable in the same environment using the code hflights[,AvgSpeed := Distance/AirTime*60].
we are specifying total_delay to put under the column by the basic definition of speed is equal to distance by time.
hflights_AvgSpeed <- hflights[,AvgSpeed := Distance/AirTime*60]
Print hflight_AvgSpeed using head() function which will show the first six values.
head(hflights_AvgSpeed)
In the above console we can see the new variable name Avg_speed and its values.
Let’s find the Avg_speed which is greater than equal to >= 500. Here we are using head() function that will print 1st six rows but we are specifying with first 5 rows with comma, it should print 1st five rows. Here we are using the updated data set hflights_AvgSpeed with which new column is added “AvgSpeed”.
head(hflights_AvgSpeed[AvgSpeed>=500],5)
So here in the above console we can see the first five rows printed and the total values is showing greater than equal >= 80
Remove column
If you want to remove certain columns then you can do it by specifying Null operator. To remove AvgSpeed column [, AvgSpeed := NULL]. Rows is separated in a coma, and column is AvgSpeed and colon equals to NULL. And then print first six rows by using head() function.
hflights[, AvgSpeed := NULL]
head(hflights)
In the above console the AvgSpeed column is removed by using NULL operator in data.table way. .
copy entire table
copy the whole table into a new modified dataset by using copy() function to new dataset copy_new_data.
copy_new_data <- copy(hflights_total_delay)
If you want to see class of the dataset in which we have saved the data by using class() function.
class(copy_new_data)
In the above console it is showing data.table along with data.frame.
Now lets see the column name of copy_new_data dataset by using colnames() function.
colnames(copy_new_data)
Here in the console you can see the printed column names which is present in the data set.
Use head() function to print frix six rows to see what our new dataset copy_new_data looks like.
head(copy_new_data)
select last flight in the dataset
If you want to print last row and inside that last row all the columns then you can use the dot .N operator in the row portion [.N,] in indexing and specifying rows and after coma can specify columns. Here we have not selected any column. [ .N ] means select or specify the last index of the row.
Select hflight dataset here and print.
hflights[.N,]
So here in the above console it print the last index of the dataset.
Data Aggregation
Now coming to data aggregation which is the main part of data.table package.
Data aggregation is the process where data or information is gathered and expressed in a summary form, for purposes such as statistical analysis.
Df[i,j,by]
Here df is a data frame and [] indexing is based on i, j ,by
#i = row
#j = column
#by = groupby
Let’s find the Average arrival delay by Month. i.e; Mean arrival delay of the flights by month.
Here in data.table way, hflight is the data set for indexing [,mean(na.omit(ArrDelay)), by = Month].
We are selecting all rows because before coma the index is empty, where after a comma we are selecting a variable ArrDelay and taking monthly mean of it and na.omit is dropping all the missing values present in the ArrDelay column. And group by Month where it calculate mean arrival delay and print for 12 months .
hflights[,mean(na.omit(ArrDelay)), by = Month]
Here in the above console you can see the 1-12 months and V1 is the mean arrival delay for each month.
Average delay of unique carrier by month
Now if you want to identify what is the mean total delay and what is the number of unique carrier that is moving from one airport to another airport by month.
In hflight we will index rows and columns, in the below code there is nothing mentioned so that it will select all the rows then we are using the dot operator and inside the dot operator we are creating mean_delay variable that will list all the mean values equal to mean inside na.omit also it will omit NA values,inside that the sum of ArrDelay and DepDelay to calculate the mean of the total delay.
Now for unique carrier we are saving it as N_uniqueCarrier using the length() function that will calculate the number of instances and Number of observations also inside unique() function that will remove the duplicate values. Here it may be same flight which is moving same airport to another airport several times. However we are selecting Unique carrier, How many unique carriers are moving with UniqueCarrier names.
So here it will automatically count the length of this unique values by equal to Month.
mean_delay_by_month <- hlights[,.(mean_delay = mean(na.omit((total_delay))), N_UniqueCarrier = length(unique(UniqueCarrier))), by = Month]
Print Mean_delay_by_month in data.table way and it should print all the data based on each month and total mean dealy and total number of unique carrier per month.
Mean_delay_by_month
So here in the above console the modified data set Mean_delay_by_month, it is showing for each month total mean_delay and N_unique carrier that is moving from one airport to another airport.
Sort as per ascending and descending order
If you want to short the data set that or newly created conditional data set based on data.table in Ascending order and Descending order based on the requirement.
To sort in Ascending order the above table mean_delay_by_month where we have seen the total mean_delay and the number of N_UniqueCarrier by month.
Lets create the new modified dataset name mean_delay_by_month_accending based on the total mean delay and we are indexing in the Mean_delay_by_month based on the order() function and here we are selecting mean_delay to arrange in ascending order.
mean_delay_by_month_accending <- mean_delay_by_month[order(-mean_delay)]
So inside the order function we wants to print total mean_delay in ascending order and it will save to the modified dataset Mean_delay_by_month_accending.
mean_delay_by_month_accending
Here in the above console it printed values in ascending order.
Similarly for descending order,
To sort in Descending order the above table mean_delay_by_month where we have seen the total mean_delay and the number of N_UniqueCarrier by month.
Lets create the new modified dataset name mean_delay_by_month_decending based on the total mean delay and we are indexing in the Mean_delay_by_month based on the order() function and here we are selecting -mean_delay with negative sign to arrange in descending order.
mean_delay_by_month_decending <- mean_delay_by_month[order(-mean_delay)]
So inside the order function we wants to print total mean_delay in descending order and it will save to the modified dataset Mean_delay_by_month_decending.
mean_delay_by_month_decending
Here in the above console it printed values in ascending order.
From the above examples, we believe this blog helped you to understand data.table package using R programming.
Suggested Reading
https://acadgild.com/blog/linear-model-building Using Airquality Data Set with R.
Keep visiting our site www.acadgild.com for more updates on Data Analytics and other technologies.