Data Science and Artificial Intelligence

Covariance and Correlation

This entry is part 3 of 17 in the series Machine Learning Algorithms

Covariance and correlation are two mathematical concepts which are commonly used in statistics. When comparing data samples from different populations, covariance is used to determine how much two random variables vary together, whereas correlation is used to determine when a change in one variable can result in a change in another.

Both covariance and correlation measure linear relationships between variables. When the correlation coefficient is positive, an increase in one variable also results in an increase in the other. When the correlation coefficient is negative, the changes in the two variables are in opposite directions. When there is no relationship, there is no change in either.

Free Step-by-step Guide To Become A Data Scientist

Subscribe and get this detailed guide absolutely FREE

Sample

A sample is a randomly chosen selection of elements from an underlying population. We calculate covariance and correlation on samples rather than complete population. Covariance and correlation measured on samples are known as sample covariance and sample correlation.

Sample Covariance

Covariance is a measure used to determine how much two variables change in tandem. The unit of covariance is a product of the units of the two variables. Covariance is affected by a change in scale. The value of covariance lies between -∞ and +∞.

The sample covariance matrix is a K-by-K matrix..

 

 

Here’s what each element in this equation means:
qj,k = the sample covariance between variables j and k.
N = the number of elements in both samples.
i = an index that assigns a number to each sample element, ranging from 1 to N.
xij = a single element in the sample for j.
xik = a single element in the sample for k.

Sample Correlation

The sample correlation between two variables is a normalized version of the covariance. 

The value of correlation coefficient is always between -1 and 1. Once we’ve normalized the metric to the -1 to 1 scale, we can make meaningful statements and compare correlations.

To calculate the sample correlation, which is also known as the sample correlation coefficient, between random variables X and Y, you have to divide the sample covariance of X and Y by the product of the sample standard deviation of X and the sample standard deviation of Y.

The key terms in this formula are
 Corr(X,Y) = sample correlation between X and Y
 Cov(X,Y) = sample covariance between X and Y
 = sample standard deviation of X
 = sample standard deviation of Y

The formula used to compute the sample correlation coefficient ensures that its value ranges between –1 and 1.

Implementation:

NumPy has methods to calculate these two stats with a random variable as input.

Import libraries

Covariance:

import os
import sys
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
X = np.random.rand(50)
Y = 2 * X + np.random.normal(0, 0.1, 50)
cov_matrix = np.cov(X, Y)
print('Covariance of X and Y: %.2f'%cov_matrix[0, 1])

Covariance of X and Y: 0.21

Correlation:

X = np.random.rand(50)
Y = 2 * X + np.random.normal(0, 0.1, 50)

cor_matrix = np.corrcoef(X, Y)
print(Correlation of X and Y: %.2f'%cor_matrix[0, 1])

Correlation of X and Y: 0.99

 Covariance vs. Correlation

 

Correlation is simply a normalized form of covariance. They are otherwise the same and are often used semi-interchangeably in everyday conversation. It is obviously important to be precise with language when discussing the two, but conceptually they are almost identical.

The value of the correlation coefficient ranges from [-1 – 1]. -1 stand for the negative relationship. 1 means a positive relationship. 0 means no relationship.

To get a sense of what correlated data looks like let us plot two correlated datasets

Positive Relationship:

X = np.random.rand(50)
Y = X + np.random.normal(0, 0.1, 50)
plt.scatter(X,Y)
plt.xlabel('X Value')
plt.ylabel('Y Value')
plt.show()
print('Correlation of X and Y: %.2f'%np.corrcoef(X, Y)[0, 1])

Correlation of X and Y: 0.94

Negative Relationship:

X = np.random.rand(50)
Y = -X + np.random.normal(0, .1, 50)

plt.scatter(X,Y)
plt.xlabel('X Value')
plt.ylabel('Y Value')
plt.show()
print('Correlation of X and Y: %.2f'%np.corrcoef(X, Y)[0, 1])

Correlation of X and Y: -0.96

Conclusion

Correlation is a normalized form of covariance and not affected by scale. Both covariance and correlation measure the linear relationship between variables but cannot be used interchangeably.

Series Navigation<< Using Decision Trees for Regression ProblemsUnderstand Power of Polynomials with Polynomial Regression >>

Abhay Kumar

Abhay Kumar, lead Data Scientist – Computer Vision in a startup, is an experienced data scientist specializing in Deep Learning in Computer vision and has worked with a variety of programming languages like Python, Java, Pig, Hive, R, Shell, Javascript and with frameworks like Tensorflow, MXNet, Hadoop, Spark, MapReduce, Numpy, Scikit-learn, and pandas.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close