In this blog, we will be discussing on how to calculate statistical mean, median, and mode using the R programming language.
In statistics, the mean, median and mode are used to determine the central tendency of a data set.
When working on a huge data set, it is handy to represent the whole data set with a single value, i.e a central tendency that characterizes the halfway of the absolute data set.
Statistical analysis in R is achieved by applying in-built functions. Almost every function is a part of the R base package.
You can execute a variety of functions on a single set of numbers.
Example: range(x) # Returns the minimum and maximum of x
vector() # Produces a vector of given length and mode, etc.
Now, let us understand how to find or calculate the mean, median and mode using R.
How to calculate Mean for a sequence of data set?
What is Mean?
Mean is the average of all numbers present in a dataset, i.e
“Sum of all values of a collection of numbers/ total value count of the numbers in the collection.”
Suppose we have a column of a dataset with size n, then its sample mean is represented as follows:
Here x is the sample mean, n is the size of the data set and x¡ the numbers in sequence.
∑ is the summation of the entire data set
Similarly, for a data population of size N, the population mean is:
Here μ is the population mean, N is the size of data population and x¡ the numbers in sequence.
∑ is the summation of the entire data set.
Now, let us understand the working of Mean using the following dataset.
Example: Find the average from the given sequence x.
Hence the average employee salary is 6.5
Basic syntax in R for Mean:
mean(x, trim = 0, na.rm = FALSE, …)
In the above syntax, mean() is the function name which performs the mean operation in R, x is the input vector, trim means to drop some observations from both ends of the sorted vector, and na.rm removes the missing values from the input vector.
By default, R carries NA values in variables. These appear from data sets consist of variables of uneven range.
To establish that you are only using the real data, add na.rm= TRUE.
In the above code sample, we have created the vector – 10,9,9,5,18,20,5,-21,5,5 which is assigned to x and then calculated the Mean value using the mean() function.
Also, we use trim 0.3 which helps to drop 3 values from each end from calculation to find the mean.
R function helps to construct each number in order by size. Hence, the sequence: -21,5,5,5,5,9,9,10,18,20
Next, we create a vector with NA value, i.e Missing value. x = 10,9,9,5,18,20,5,-21,5,5,NA
If there is any missing values present in the data it will return an NA value.
How to calculate the Median for a sequence of data set?
In the data center, the mean and median are often followed over time to spot trends. The statistical median is the middle number in a sequence of numbers.
“To find the median, construct each number in order by size, the number lies in the hallway is the median.”
It is a statistical measure of the central tendency of the data values. To get the median, arrange the values of the data set in the increasing order and analyze which value lies in the halfway of the data set.
Now let us understand how to calculate the “median” with an example
Example: Find the Median of the given sequence x.
x = 10,9,9,5,18,20,5,-21,5,5
Sort the numbers from least to greatest or in ascending order.
The sequence you get will be -21,5,5,5,5,9,9,10,18,20
Note: If there are two middle numbers, calculate their average value.
The median = 7 “is middle or halfway through the sequence.”
Basic syntax in R for Median:
median(x, na.rm = FALSE)
Here, The function median() is used.
x is the input vector and na.rm removes the missing values from the input vector.
Create a vector and assign it to a variable x.
We can see from the above code, the vector arranged in the ascending order using the function in R and it calculates the average of the two middle values to find the Median.
How to calculate Mode for a sequence of data set?
“The mode is the number that occurs over and over in a set of numbers.”
Mode helps to analyze the most accepted or frequent occurrence of particular value in the data set.
Find the Mode from the given sequence, x.
x = 10,9,9,5,18,20,5,-21,5,5
The mode: Here 5 occurs multiple times so it will be considered the mode of the given sequence.
In R, there are no standard in-built functions to calculate the Mode.
Therefore, we are going to create a user function to find the mode of a data set in R. Here, the function takes the vector as an input and calculates the mode value as the output.
To calculate the mode, we have created the function(x). We can see the output as the most occurred value in the sequence.
Here, unique(x) returns a vector, data frame or array like x but with duplicate elements or rows removed.
getmode(x) is an internal function if calling directly. Returns the mode from the numeric vector.
From the above examples, we believe this blog helped you to understand how to calculate the Statistical Mean, Median, and Mode with R.
If you are a Python geek you can follow this blog to learn more https://acadgild.com/blog/python-mean-median-mode .
Keep visiting our site www.acadgild.com for more updates on Data Analytics and other technologies.