Any aspiring Data Scientist would be aware that a basic knowledge of Statistics is a must to make the learning process hassle-free. Although you don’t need to have a doctoral degree in the subject to wrap your brain around data-related concepts, you ought to know a few important ones at least as far as statistics for data science is concerned.
Many are under the misconception that data science is for those who, at least remotely, love Mathematics! However, statistics is one of the core subjects you need to know even if your mathematical knowledge or coding skills aren’t exemplary!
What Is Statistics For Data Science?
According to renowned statisticians Croxton and Cowden, “Statistics may be defined as the collection, presentation, analysis, and interpretation of numerical data.” As data is the foundation of the digital age, it shouldn’t be surprising that Statistics becomes relevant as well.
Statistics For Data Science: Distributions
Statistics for data science is incomplete without a knowledge of various distributions. These include:
This is usually used to represent medical findings. The graphical representation often assumes a bell-shaped curve. For example, a single variable can be observed in a large group over a period. It Applications of normal distribution include:
Finding the normal birth weight range of newborns worldwide.
Even predicting stock returns based on their performance over a period can be derived using the normal distribution.
An essential tool in statistics, it’s used to forecast the number of events that are likely to occur in a specified time interval. It’s widely used in different types of industries, that deal with a large amount of discrete data but the probability of occurrence of an individual event is small. Below are some of the situations where Poisson distribution can be used:
- Customers visiting a bank on an hourly basis.
- The number of visits on a website on an hourly basis.
- The daily number of emergency calls made in a city.
- The number of typos in a document.
- The number of absentees in a large MNC monthly.
It is defined as the likelihood of pass or fails outcome in a survey or experiment that’s repeatedly done in succession. There can only two possible outcomes for these experiments, it’s either True/False or Yes/No! It’s different from Poisson distribution in terms of outcomes. The latter has no limit on the number of outcomes.
- Following are the applications where binomial distribution:
- Number of heads/tails in a series of coin flips
- Vote counts for two different candidates in an election
- The number of successful lead-conversions.
- The number of defective products in a manufacturing line.
Statistics For Data Science: Theorems And Algorithms
While learning Statistics for data science, it’s impossible to skip the theorems and algorithms below:
Bayes Theorem is a mathematical formula for determining conditional probability. It guides you to revise existing predictions or theories based on a set of available evidence. For instance, banks use Bayes theorem to rate of risk involved while lending money to potential borrowers based on their past history of defaulted payment or account activities.
K-Nearest Neighbor Algorithm
Often referred to as a lazy algorithm because of its ease of application. This is a very easy algorithm both in terms of understanding and implementation. It’s majorly used for regression problems and can give highly competitive results.
Bagging (Bootstrap aggregating)
Bagging is short for “Bootstrap aggregating”. It’s a derivative of ensemble machine learning algorithms wherein several weak models are used to aggregate individual’s predictions to get the final prediction.
Ultimately, the list is not a one-stop destination for all the fundamentals you need to know in Statistics. However, this provides an overview of what you should know about statistics for data science before taking the plunge into the field. Don’t miss out on checking the data science course that you can enroll for if you are aspiring data scientist.