In this article, we will be demonstrating data cleaning and missing value imputation, using Multiple imputations by chained equations (mice).
So, What is Data Cleaning?
Data cleaning is a method in which you update information that’s incomplete, incorrect, improperly formatted, duplicated, or unsuitable. Data cleansing sometimes involves the improvement of information compiled in one space.
Though data cleaning involves deleting, updating information, it’s centered additional on change, correcting, and consolidating information to make sure your data is good enough to perform descriptive as well as predictive modeling.
Data cleaning is one in every of the vital part of the machine learning project. It plays a major half in building a model. However, skilled data scientists sometimes pay a massive portion of their time on this step.
If we’ve got a well-cleaned dataset, we will get desired results even with an awfully straightforward algorithm.
Here in this article, we are performing data cleaning using mice and VIM package with vehicleMiss.csv dataset.
Data set contains 1674 records and 7 columns.
Vehicle – Vehicle number
Fm – Vehicle failure month
Mileage – Vehicle failure at the mileage
Lh – labor hour
Lc – labor cost
Mc- Material cost
State- The region of vehicle failure
Lets us understand mice imputation on missing values.
Here at first, we have to install required packages i.e.; VIM and mice.
Then load the package library with
Reading the data set into R and looking for 1st six rows with head command.
Checking missing value with is.na() function
Here the command saying TRUE which means there are missing values in the data set.
With the summary command, lets see in which column there are missing values and how many missing values in each column.
Here in the above console there are four columns having missing values i.e, mileage = 13 NA’s, lh = 6 NA’s, Lc = 8 NA’s, State = 15 NA’s.
Let us find out what percentage of missing data is present in each variable in our data set.
So here we have written a function to find out percentage and pass it in apply() function.
So here in the above console, we can see that state and Mileage having more missing values than other variables.
We can also see the pattern of the data by md.pattern() command.
This gives us a table as well as a plot showing missing data.
So here in the above console, we can see that in 1st row 1586 which has the value of 0, this means that there are 1586 rows with no missing data and there are 11 rows where exactly 1 data point is missing and that data point is missing in respective column “state” as it is showing 0 in state column in the above table. In 13 rows where Mileage values are missing. Similarly, there are 6 rows there is one lc data point is missing and for 2 rows there one from lc and one from state data point is missing. Therefore, we have 42 data points are missing.
We can also see how many data points are observed with md.pair() command.
In the above console, $rr indicates how many data points have been observed. In-vehicle there are total data point i.e; 1624 has been observed. For the variables having missing data points has fewer data observed from the actual no. of records i.e; for Mileage is 1618 and for the state is 1609 etc.
In the next table $rm i.e; observed and missing followed by $mr which gives us information about missing vs observed and then $mm missing vs missing.
Let us plot margin plot using marginplot() command to represent the observed and missing data points.
In the above plot, we can see all the blue scattered data points are observed values and the red dotted points are missing values.
And the box plot represents mileage and missing labour cost.
Missing values Imputation with mice.
Lets store imputed data into impute. The function we are using here is mice() and within the data, we only feed 2 to 7 variables as there is no importance of 1st column i.e; vehicle no. for further analysis. We can also specify the number of imputations m =4, the default value is 5.
We can also go from random seed let’s say it 123.
Here it has done five iterations and for each iteration, it has done 4 imputations.
Let’s print impute.
Here in the above console, it is showing total number of imputations.
And imputation methods for missing values are also shown.
Fm and mc don’t have any missing value, for mileage, lh, lc – is numeric variables and the default method for dealing with missing numeric values is pmm – Predictive mean matching.
The state is a factor or categorical variable so the method of imputation is polyreg – Multinomial logistic regression.
Let’s look at some imputed values.
We are looking for Mileage here.
We can see in the above console there are 4 imputations estimate. so we can select which one is bet imputation for the given data. Let’s look at some value to see what it has done.
For 253rd row and all columns
It is showing that mileage having the missing values.
Note that this car failed after 1 month so imputing with only mean value gives us the wrong result as it is rare that a person drives 20599 miles in one month. So we have to impute the correct value.
Lets us look for the summary for the variable Mileage.
We can see the mean, median, mode, and quartile values. And total missing values present.
Complete data with complete() function
So here we are using 2nd imputation values as it is showing the best results among all the imputation values.
Let’s see the summary of imputed data.
Observed and imputed values
Now the data is ready for classification or prediction models.
Here if we see that the vehicle has only 0-1 failure month shown 863, 11 miles. Which are quite appropriate.
Distribution of observed/imputed values.
We are using striplot() function to see the distribution of observed/ imputed values.
Here blue color is observed values and 0 for original data and 1,2,3,4 for impute data. In fm,mc everything is blue means there are no missing data points. In Milage, lh,& lc there are blue as well as Red colors i.e; observed and imputed data. Red indicates the estimated values to be imputed. We can not see any unusual pattern.
XY plot for two variables
Lets plot for labour cost “lc” and labour hours “lh” with xyplot() command.
In the above xyplot i.e; lc vs lh. Here 1st plot is for original data and rest for imputed data. lh,& lc data points are represented by blue as well as Red colors i.e; observed and imputed data. Red indicates the estimated values to be imputed.
We have selected 2nd imputation for missing values.
We hope this post has been helpful in understanding data cleaning. In the case of any queries, feel free to comment below and we will get back to you at the earliest.