Free Shipping

Secure Payment

easy returns

24/7 support

  • Home
  • Blog
  • Principal Component Analysis with R

Principal Component Analysis with R

 June 17  | 0 Comments

PCA stands for Principal Component Analysis and it is used to reduce the dimension of the data with minimum loss of information.

It is a supervised learning technique and is used in applications like face recognition and image compression.

GET SKILLED IN DATA ANALYTICS

In this blog we will be implementing the famous ‘iris’ dataset with PCA in R.

The Iris flower data set is a multivariate data set introduced by the British statistician. The data was collected to quantify the morphologic variation of Iris flowers of three related species. The data set consists of 50 samples from each of three species of Iris (Iris Setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.

The dataset contains a set of 150 records under 5 attributes – Petal Length, Petal Width, Sepal Length, Sepal width and Class(Species).

So let us dive into the coding part.

We will first load the dataset and display the first few records.

Getting the structure of the dataset using the str() function.

Hence there are 150 rows and 5 columns.

Checking for any null value, if present.

Hence there is no null value.

Summarizing the dataset using the summary() function.

Partitioning the dataset into training data and test data.

Creating Scatter plot and correlation coefficient.

We can see high correlation exists between petal length and petal width. We have also dropped the last ‘Species’ column in the above code.

High correlations among independent variables lead to “multicollinearity” problems.

Performing PCA.

Here we have performed PCA on the four variables using the prcomp() function. It do the analysis on the given data matrix and returns the result as an object of class prcomp.

Now we have used the attribute ‘center’ that is used to indicate the variable to be shifted to zero centered.

Taking the mean of ‘Sepal_length’ from the training data

We will scale the values using the attribute ‘scale’ which is used to scale the columns of a numeric matrix.

‘center’ and ‘scale’ refers to respective mean and standard deviation of the variables that are used for normalization prior to implementing PCA

Taking the standard deviation of ‘Sepal_length’ from the training data.

Computing the Principle Components

We have results for 4 Principle components. Each principle components are normalized linear combinations of original variables.

Summarizing the PCA objects

Orthogonality of PCA

The pairs.panel() function used to show a scatter plot of matrices (SPLOM), with bivariate scatter plots below the diagonal, histograms on the diagonal, and the Pearson correlation above the diagonal.

Visualization using the factoextra library.

The fviz_pca_biplot is a function from factoextra package and are used to Biplot of individuals and variables

It is showing which component is more similar. Components such as Petal length, Petal width and Sepal LEngth are more significant component. Whereas Sepal Width which is far from datapoint hence it is less a significant component.

Prediction with PC

Performing multinomial regression with PCA1 and PCA2.

Summarizing model1

Creating Confusing Matrix and Misclassification error on the training data.

As we can see from the above result, in setosa, versicolor and virginica 39, 38 and 35 flowers belongs to the same species, respectively.  Whereas in versicolor and virginica 5 and 4 flowers are misclassified to this species.

Calculating the misclassification error

Hence, the misclassification error is about 7.4%.

Creating Confusing Matrix and Misclassification error on the test data.

As we have seen the outcome in the testing dataset above, similarly here on the test data in setosa, versicolor and virginica 11, 7 and 9 flowers belongs to the same species, respectively. Whereas only in virginica 2 flowers are misclassified to this species.

Calculating the misclassification error

Hence, the misclassification error for testing data is about 6.9%

This brings us to the end of this article. I hope you find this blog helpful.

Suggested reading:

Hierarchical Clustering with R

Keep visiting our site www.acadgild.com for more updates on Data Analytics and other technologies. Click here to learn data science course in Bangalore.

Keep visiting our website for more blogs on Data Science and Data Analytics.

>