In this blog, we discuss variance, standard deviation and how to find them using Python.
Before we Proceed we recommend you to go through our first blog in this Data Science Tutorial on understanding Mean,Median and Mode using Python.
What is Variance?
Variance is the measure of dispersion in a data set. In other words, it measures how spread out a data set is. It is calculated by first finding the deviation of each element in the data set from the mean, and then by squaring it.
Variance is the average of all squared deviations.
Steps to Calculate Variance:
1. List elements of data set.
The following are ages of students pursuing a Master’s degree:
Data set 1: 28,25,26,27,31,32,24
2. Calculate the mean.
(28 + 25 +26 +27 +31 +32 + 24) / 7 = 27.57
3. Find the deviation from the mean for each data point.
4. Square it.
5. The average of all squared differences is the variance. To find it, add all squared variances and divide the sum by a number of elements in data set (n).
(0.1849 + 6.6049 + 2.4649 + .3249 + 11.76 + 19.6249 + 12. 4609) / 7
53.4303 /7 = 7.6329
Now that we know how to calculate the variance of a data set, let us look at how to find the same using Python.
Consider a list of random integers (data set 2) – 3,3,3,5,6,1. We will now calculate the variance using numpy library.
import numpy as np
results = [3,3,3,5,6,1]
As we can see, the variance of the random data set is 2.58.
Standard deviation is the average difference of all elements in a data set from its mean. It cannot be negative. The value of standard deviation can be easily impacted by outliers as a single outlier (abnormal value) distorts the overall mean, and thereby, deviation from the mean of all elements.
If, to find variance we square the deviations of individual elements from the mean, then to calculate standard deviation, we need to calculate the square root of the variance.
To find the standard deviation in ages of students pursuing Master’s, we calculate the square root of the variance:
Standard deviation of the data set 1 is 2.76.
To find standard deviation using Python, we will use data set 2. The numbers are listed below, and we already know the variance.
results = [3,3,3,5,6,1]
To calculate standard deviation use the inbuilt function “std” from numpy as shown below:
Example of mean and variance from everyday life.
Employee 1 login times:
Day 1: 9.01AM, Day 2: 9.15AM, Day 3: 9.12AM, Day 4: 9.30AM, Day 5: 9.45AM
Employee 2 login times:
Day 1: 9.00AM, Day 2: 9.15AM, Day 3: 9.30AM, Day 4: 9.45AM, Day 5: 9.20AM
Which employee logs-in at a consistent time? Answer: The one with less variance.
In the next article, we will discuss probability density function, probability mass functions, and data distributions. To learn more about data science and make a career in this booming field, join AcadGild – the home of India’s best online courses.