This is the first article in a series of tutorials on data science. We will cover the following topics in this article:

- Types of data
- Mean
- Median
- Impact of outliers on mean
- Mode

Without delving too deep into the coding aspect, we will see what mean, median, and mode are, and how to derive them in Python. We will discuss codes in the subsequent articles that focus on Python libraries. Let us begin by discussing the three different types of data:

- Numerical Data
- Categorical Data
- Ordinal Data

**1.**** Numerical Data**

It’s probably the most common type of data. Basically, it represents some quantifiable thing that you can measure. Some examples are heights of people, page load times, and stock prices.

Numerical data can be subdivided into two types:

**1.1)****Discrete data **

Discrete data refers to the measure of things in whole numbers (integers). For example, the number of purchases made by a customer in a year. Since the number of things that a person buys cannot be three and a half, or four and a third – it must be a whole number like four or five things – this kind of data falls under the discrete category.

**1.2) Continuous data**

In contrast to discrete data, continuous data includes all numbers possible between any two integers or whole numbers. For example, the height of something. It could be 9.2345 inches or 9.7219 inches, or any other fraction between the two whole numbers nine and ten. Another example could be the amount of rainfall recorded in a day. Again, the amount does not necessarily have to be a whole number. It could be 6.5 mm or 23.1 mm of rainfall, depending on the shower God’s fancy.

**2. ****Categorical Data**

This type of data is non-numeric. We use it to quantify things in categories like gender, ethnicity, nationality, political party, etc. We can assign numbers to the categories, but the numbers would not, in that case, represent their value per say. They will only separate one type from the other – type one from type two or three. For example, while calculating India’s population, Bangalore could be city number one, Mumbai number two, and so on. The data collected, however, would still represent the number of people in Bangalore and Mumbai, and not the population of one and two. These numbers have no value of their own in this context.

**3.**** Ordinal Data**

Ordinal data is an amalgamation of numerical and categorical data. Simply put, this data type consists of categories that are in order. The intervals between categories are not known. Good examples of this data type are movie or music ratings that use stars to denote quality. Numbers simply represent the good and bad categories. A movie with a 5-star rating is obviously very good as opposed to a movie with only 1-star, which, very likely, is terrible. Note that the numbers in this example do denote value. Mathematically speaking, 5 is greater than 1. This difference in value is used to differentiate good films from bad. Good films receive a higher rating of 4 or 5, while bad films only get a lower rating of 1 or 2.

**Mean**

Mean is simply another name for average. To calculate the mean of a data set, divide the sum of all values by the number of values.

Consider the following set of numbers: {5,2,2,7}. The mean is (5 + 2 + 2 + 7) / 4 = 16 / 4 = 4. We use the symbol “x-bar” to represent the mean of a sample data. The formula to compute the mean for a set of n values is:

We will explain terms like standard deviation and normal distribution in subsequent blogs. For now, all we need to keep in mind is the sample size (10,000), and the mean (25,000). Don’t worry about other components like numpy for code, or the criteria for calculation.

**Code:**

1 2 3 |
import numpy as np expenditure = np.random.normal(25000, 15000, 10000) np.mean(expenditure) |

**Median**

Median, in simple words, is the number that lies in the middle of a list of ordered numbers. The numbers may be in the ascending or descending order. Let us consider the following data set:

0,2,3,4,5,1,2,0,6

After sorting these numbers in the ascending order, we get the following list:

0,0,1,2,2,3,4,5,6

2 – the number in the center (fifth from either side) – is the median in this example.

The median is easy to find when there are odd number of elements in the data set. When there are even number of elements, you need to take the average of the two numbers that fall in the center of the ordered list. So, if we consider the following data set:

0,0,1,4,2,3

After sorting the numbers, we get the following list:

0,0,1,2,3,4

The average of 1 and 2, in this case, is the median.

Median = (1 + 2) / 2

= 1.5

Median is 1.5.

**Let us now see how to find the median in Python.**

To get the median of a data set in Python, run the script “np.median(expenditure)” in Jupyter notebook.

The median of expenditures from the previous example is 25,179.05. In this case, it is not very far from the mean, which is 25,120.24.

Before we discuss mode, let us understand what outliers are, and how they impact the mean of a data set.

- Any value in a dataset that is at an abnormal distance from all other values can be termed as an outlier. Outliers generally tend to skew the mean radically.
- Outliers can be present in the dataset with very high value or with a very low value.

Let us see how by passing a large value (1000000000) manually in the expenditure and then calculating the mean and median.

**Code:**

1 2 3 |
expenditure = np.append(expenditure, [1000000000]) np.median(expenditure) np.mean(expenditure) |

What we find is that the large value, or the outlier, changes the median to some extent (from 25,179.05 to 24,932.93), and the mean to a great extent (from 25,120.244 to 1,24,822.14). The outlier is an abnormal value because of its potential to skew the mean of a data set radically, and thereby misrepresenting the data set altogether.

**Mode**

Mode is not used as often as mean or median. It is that value which appears the most number of times in a data set. For example, in the following data set, 0 appears the most number of times. Therefore, it is the mode.

0,0,1,2,3,0,4,5,0

**Mode in Python:**

Let’s generate a random expenditure set data using the script below.

expenditure = np.random.randint(15, high=50, size=200)

expenditure

1 2 |
from scipy import stats stats.mode(expenditure) |

35 is the most frequently occurring value in the random dataset. Therefore, it is the mode of the data set.

**Conclusion:**

Mean is the average of a data set. Median is the value that lies at the center. In case there even number of elements in a data set, the median is the average of the two values that lie in the center. The mode is the value with the highest frequency in a data set. Outliers are abnormal elements of a data set that lie very far from the rest of the elements in the same set.

In the next article, we’ll look at standard deviation, variance, and how to find them using Python. If you have any queries or feedback, feel free to comment below. To learn more about data science and make a career in this booming field, join AcadGild – the home of India’s best online courses.

## Leave a Reply