Covariance and correlation are two mathematical concepts which are commonly used in statistics. When comparing data samples from different populations, covariance is used to determine how much two random variables vary together, whereas correlation is used to determine when a change in one variable can result in a change in another.
Both covariance and correlation measure linear relationships between variables. When the correlation coefficient is positive, an increase in one variable also results in an increase in the other. When the correlation coefficient is negative, the changes in the two variables are in opposite directions. When there is no relationship, there is no change in either.
Free Step-by-step Guide To Become A Data Scientist
Subscribe and get this detailed guide absolutely FREE
A sample is a randomly chosen selection of elements from an underlying population. We calculate covariance and correlation on samples rather than complete population. Covariance and correlation measured on samples are known as sample covariance and sample correlation.
Covariance is a measure used to determine how much two variables change in tandem. The unit of covariance is a product of the units of the two variables. Covariance is affected by a change in scale. The value of covariance lies between -∞ and +∞.
The sample covariance matrix is a K-by-K matrix..
Here’s what each element in this equation means: qj,k = the sample covariance between variables j and k. N = the number of elements in both samples. i = an index that assigns a number to each sample element, ranging from 1 to N. xij = a single element in the sample for j. xik = a single element in the sample for k.
The sample correlation between two variables is a normalized version of the covariance.
The value of correlation coefficient is always between -1 and 1. Once we’ve normalized the metric to the -1 to 1 scale, we can make meaningful statements and compare correlations.
To calculate the sample correlation, which is also known as the sample correlation coefficient, between random variables X and Y, you have to divide the sample covariance of X and Y by the product of the sample standard deviation of X and the sample standard deviation of Y.
The key terms in this formula are Corr(X,Y) = sample correlation between X and Y Cov(X,Y) = sample covariance between X and Y = sample standard deviation of X = sample standard deviation of Y
The formula used to compute the sample correlation coefficient ensures that its value ranges between –1 and 1.
NumPy has methods to calculate these two stats with a random variable as input.
import os import sys import numpy as np import matplotlib.pyplot as plt import seaborn as sns
X = np.random.rand(50) Y = 2 * X + np.random.normal(0, 0.1, 50)
cov_matrix = np.cov(X, Y) print('Covariance of X and Y: %.2f'%cov_matrix[0, 1]) Covariance of X and Y: 0.21
X = np.random.rand(50) Y = 2 * X + np.random.normal(0, 0.1, 50) cor_matrix = np.corrcoef(X, Y) print(Correlation of X and Y: %.2f'%cor_matrix[0, 1]) Correlation of X and Y: 0.99
Covariance vs. Correlation
Correlation is simply a normalized form of covariance. They are otherwise the same and are often used semi-interchangeably in everyday conversation. It is obviously important to be precise with language when discussing the two, but conceptually they are almost identical.
The value of the correlation coefficient ranges from [-1 – 1]. -1 stand for the negative relationship. 1 means a positive relationship. 0 means no relationship.
To get a sense of what correlated data looks like let us plot two correlated datasets
X = np.random.rand(50) Y = X + np.random.normal(0, 0.1, 50) plt.scatter(X,Y) plt.xlabel('X Value') plt.ylabel('Y Value') plt.show() print('Correlation of X and Y: %.2f'%np.corrcoef(X, Y)[0, 1]) Correlation of X and Y: 0.94
X = np.random.rand(50) Y = -X + np.random.normal(0, .1, 50) plt.scatter(X,Y) plt.xlabel('X Value') plt.ylabel('Y Value') plt.show() print('Correlation of X and Y: %.2f'%np.corrcoef(X, Y)[0, 1]) Correlation of X and Y: -0.96
Correlation is a normalized form of covariance and not affected by scale. Both covariance and correlation measure the linear relationship between variables but cannot be used interchangeably.