PCA stands for Principal Component Analysis and it is used to reduce the dimension of the data with minimum loss of information.
It is a supervised learning technique and is used in applications like face recognition and image compression.
In this blog we will be implementing the famous ‘iris’ dataset with PCA in R.
The Iris flower data set is a multivariate data set introduced by the British statistician. The data was collected to quantify the morphologic variation of Iris flowers of three related species. The data set consists of 50 samples from each of three species of Iris (Iris Setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.
The dataset contains a set of 150 records under 5 attributes – Petal Length, Petal Width, Sepal Length, Sepal width and Class(Species).
So let us dive into the coding part.
We will first load the dataset and display the first few records.
Getting the structure of the dataset using the str() function.
Hence there are 150 rows and 5 columns.
Checking for any null value, if present.
Hence there is no null value.
Summarizing the dataset using the summary() function.
Partitioning the dataset into training data and test data.
Creating Scatter plot and correlation coefficient.
We can see high correlation exists between petal length and petal width. We have also dropped the last ‘Species’ column in the above code.
High correlations among independent variables lead to “multicollinearity” problems.
Here we have performed PCA on the four variables using the prcomp() function. It do the analysis on the given data matrix and returns the result as an object of class prcomp.
Now we have used the attribute ‘center’ that is used to indicate the variable to be shifted to zero centered.
Taking the mean of ‘Sepal_length’ from the training data
We will scale the values using the attribute ‘scale’ which is used to scale the columns of a numeric matrix.
‘center’ and ‘scale’ refers to respective mean and standard deviation of the variables that are used for normalization prior to implementing PCA
Taking the standard deviation of ‘Sepal_length’ from the training data.
Computing the Principle Components
We have results for 4 Principle components. Each principle components are normalized linear combinations of original variables.
Summarizing the PCA objects
Orthogonality of PCA
The pairs.panel() function used to show a scatter plot of matrices (SPLOM), with bivariate scatter plots below the diagonal, histograms on the diagonal, and the Pearson correlation above the diagonal.
Visualization using the factoextra library.
The fviz_pca_biplot is a function from factoextra package and are used to Biplot of individuals and variables
It is showing which component is more similar. Components such as Petal length, Petal width and Sepal LEngth are more significant component. Whereas Sepal Width which is far from datapoint hence it is less a significant component.
Prediction with PCA
Performing multinomial regression with PCA1 and PCA2.
Creating Confusing Matrix and Misclassification error on the training data.
As we can see from the above result, in setosa, versicolor and virginica 39, 38 and 35 flowers belongs to the same species, respectively. Whereas in versicolor and virginica 5 and 4 flowers are misclassified to this species.
Calculating the misclassification error
Hence, the misclassification error is about 7.4%.
Creating Confusing Matrix and Misclassification error on the test data.
As we have seen the outcome in the testing dataset above, similarly here on the test data in setosa, versicolor and virginica 11, 7 and 9 flowers belongs to the same species, respectively. Whereas only in virginica 2 flowers are misclassified to this species.
Calculating the misclassification error
Hence, the misclassification error for testing data is about 6.9%
This brings us to the end of this article. I hope you find this blog helpful.
Keep visiting our website for more blogs on Data Science and Data Analytics.