As the name suggests, the world of Data Science is centered on statistics and graphs. Anyone who wants to go ahead with data science as a career must have good command over the basics of these two topics.
In this article, let us study the cumulative frequency distribution and analyze its use in cumulative frequency distribution. To understand this definition we need to understand certain key terms here.
Free Step-by-step Guide To Become A Data Scientist
Subscribe and get this detailed guide absolutely FREE
Introduction to Statistics
For a better beginning, let us first understand what ‘statistics’ is. Statistics is the scientific way of collecting data, re-organizing data, analyzing data, inferring results from it, and then its presentation. Since statistics can be visualized as the study of data, it is related to the core of data science.
Important Key Terms
1. Frequency: Frequency in statistics means: a count of how often something happened. This could be how many times an event happened, or happened within a given time interval. For example, 200 people in the age group 18-25 watch TV series. Here 200 is the frequency
2. Class: A class is a grouped range in which a number of frequencies lie. It is a grouping of values by which data is divided to compute frequency distribution. The value of a class remains fixed and is computed by dividing the difference of maximum frequency and minimum frequency in equal intervals. For example, if 100 students score marks between 10 and 100, the classes can be 0-10, 10-20, 20-30, and so on.
3. Class Frequency: Class frequency simply implies the number of characters or data elements in the class. A class can be “70-80”, married, manufacturing or even more vivid! Class frequency would be the number of students who scored between 70% and 80%, the number of married people or the number of people working in manufacturing.
Introduction to Cumulative Frequency Distribution
Now that we have our basics covered, let us now try to understand what cumulative frequency distribution is. Cumulative frequency distribution is the sum of the class and all classes below it in a given frequency distribution table.
In data analytics, which is an important part of data science, cumulative frequency is used to determine the number of observations that lie above (or below) a particular value in a data set.
All that means is – you’re adding up a value and the values that came before it. It is calculated by adding each frequency from a frequency distribution table to sum of its predecessor.
You want to check if your math is correct? You can do it by adding up all the numbers, and comparing it to your sample size. If both are equal, you know you have included all your data.
The last value will always be equal to the total for all observations, since all frequencies will already have added to the previous total.
Let us take an example to understand the concept in depth. Say in a class of 120 students, the frequency distribution table of ‘Percentage scored’ looks like following:
Percentage Number of students
Here, Percentage is our ‘class’ with a class interval of 10 (Upper class limit – Lower class limit) and ‘Number of students’ is our frequency table.
Looking at the table, we could easily say that the number of students who scored above 90% is 15. How many students scored above 80%? Is it 35?
Think carefully! Students who scored above 90 percent have definitely scored above 80%. Isn’t it? So if you were to answer how many students scored above 80, you would add 35 (number of students who scored between 80%-90%) and 15 (number of students who scored between 90%-100%) which is 50.
You just calculated the cumulative frequency for the class 80%-90%! Simple, isn’t it? Similarly if you were to find the CF for the class 70%-80% you would do 25+35+15 which is 75. This is how cumulative frequency distribution is calculated.
Plotting the Cumulative Frequency Distribution Curve
If we plot this data on a graph, i.e. plot the cumulative frequency of each class, the curve we obtain is cumulative frequency distribution curve. It is also known as an Ogive. It looks somewhat similar to the shape of the alphabet ‘S’.
The graph helps us to draw a lot of conclusions, patterns and relationships between classes and frequencies. This is why understanding of cumulative frequency concepts is important for learning data science. Let us know understand its application in data science.
Cumulative Frequency Distribution As A Data Science Tool
Cumulative frequency distribution is a very useful tool for data scientists, as it plays a key role to extract information that is used for multiple purposes, such as decision making, product development, trend analysis and forecasting.
Consider a scenario where data scientists want to calculate the number of people suffering from heart disease under certain weight category. After careful examination, scientists arrive at a conclusion that greater the obesity, higher is the chances of people suffering from a heart disease.
How did they arrive at this conclusion? By observing the cumulative frequency distribution. Similar to what we observed in the ‘Percentage’ example. The compared the age group range (classes) against the number of people suffering from heart diseases (class frequency) and deduced the results!
Tabulation of Data Helps in Data Analysis.
Cumulative frequency is an important tool in Statistics to tabulate data in an organized manner. It helps to create tabs for specific behavior, interests, locations, actions, and so on.
Whenever you wish to find out the popularity of a certain type of data, or the likelihood that a given event will fall within certain frequency distribution, a cumulative frequency table can be most useful.
Hope this helped you to understand the basic concepts of cumulative frequency distribution, how to apply it and its usage in data science. Stay tuned for more updates about data science studies and other basic concepts of data science.
Until then, happy surfing in the world of frequencies!