This post presents a collection of Data Science related key terms with concise definitions. It is a known fact that familiarising with data science terminologies is time-consuming, as these words are not part of the routine. However, once you start studying and hearing about these terminologies, you will comprehend the importance of these terms in data science and eventually be interested to know more. I, in this article, presenting a bunch of key data science terminologies, grouped into various categories. Let’s now study these categories the terminologies in them, one by one in detail.
- The Fundamentals of Data Science
- Sectors Involving Data Science
- Statistical Tools and Terminologies
- Machine Learning Tools and terminologies
- Deep Learning Key Terms
Statistical Tools and Terminologies
Usually, the major focus and effort while learning about any new field are about getting acquainted with its vocabulary. Statistics is no exception. Gaining knowledge about the terminologies is challenging initially because the explanation of one feature habitually assumes that there’s a relative working knowledge of other terms, however, all of which can be given an explanation at once. For instance, to understand what boxplot is, one must already know about mean, median, quartile and outlier.
This post aims at bridging the gap between the known and unknown terminologies of statistics that are absolute basics.
Bayesian Statistics is a mathematical process that uses probabilities to solve statistical problems. It provides tools to update beliefs in the evidence of new data. It is different from typical frequentist method and uses Bayesian prospects to review the evidence.
Correlation is a statistical measure that specifies the range to which more than two variables fluctuate simultaneously. A positive correlation denotes the range till which variables increase or decrease in equivalence. A negative correlation indicates the extent to which one of the two variables rises and the other declines.
A confidence interval evaluates the actual percentage of the population that fits into a category based on the results from a trial population. This field of Statistics suggests precise mathematical approaches to analyze confidence intervals.
Descriptive Statistics is an assortment of statistical tools for quantitative description or for summarizing the data assortment. This type of statistics intends to summarize, and as such is different from inferential statistics, that is increasingly predictive.
Distribution is positioning of data, based on values of one variable in the ascending order. This form of order, and its features like the configuration and spread, deliver data about the original example.
Frequentist Statistics tell us whether an incident or hypothesis will happen or not. It computes the probability of an occurrence over a course of time in the experiment (i.e. the experiment is done repeatedly under the identical conditions to obtain the outcome).
Generalizability is the ability to make decisions about the characteristics of the population based on the results of data collected from a sample. The decision hugely depends on the essence of the trial assortment, sample magnitude, and many other aspects.
Inferential Statistics is among the two main branches of Statistics. This form of Statistics employs arbitrary data samples taken from a population to describe and make interpretations about the population.
Interquartile Range (IQR)
The IQR or Interquartile Range is the variance between the score describing the 75th percentile and the 25th percentile, the third and first quartiles, respectively.
In Statistics, latent variables are variables that are indirectly discerned and theorized (via the mathematical model) from additional directly measured/observed variables. Mathematical models that explain observed variables through latent variables are known as latent variable models.
Mean is the natural arithmetic norm of the distribution variable values. The mean offers a solitary, brief numerical synopsis of distribution. It is probably the most common statistics that have come across in wide-ranging researches.
Mean, along with median and mode, are the three major measures of fundamental tendency, which together evaluates an imperative and rudimentary feature of distribution.
The median is the score of a distribution present at the 50th percentile, unraveling the upper and lower 50 percent of scores. The median is suitable for both excruciating a set of distribution scores into half and to assist in identifying the skew of a distribution.
The mode is a score that occurs most recurrently in any distribution. Following are the four types of modalities:
- Unimodal: Has one peak
- Bimodal: Has two peaks
- Multimodal: Has several peaks
- Uniform: Peaks that are uniform
Image Source: makemeanalyst.com
A set of data is said to be normalized only if all the values fall in the common array. Data sets are usually normalized to establish easier and eloquent comparisons.
An outlier is a data point that is enormously far away from other points (the flock). Most often it is the result of either exceptional conditions or faults in measurement. Thus, outlier must be found out during the initial stages of the data analysis workflow.
Image Source: sacredmysteries.com
A parameter is a value which is a part of a population. For instance, if all the data of humans on Earth is taken into account then the mean age of that population is the parameter.
A population is any comprehensive set with a minimum of one attribute in common. Populations are not merely people. It may include people, animals, measurements, buildings, motors, vehicles, farms, objects or events.
Predictive Modeling is a process which employs data mining and probability to estimate conclusions. Each model is built with numerous interpreters, and these variables may impact impending outcomes.
The range is one of the most significant procedures of dispersion. The range is the change between the maximum and minimum values of distribution.
The residual is a measure of the extent up to which real value alters from the statistical value, calculated depending on the dataset. This phenomenon is often interchangeably used as “error,” even though, an error is a purely theoretical value.
The sample is the collection of data points that are under scrutinization. The collection and examination of samples are mostly to make inferences about a larger population. A sample, in Statistics, is an illustrative choice out of an entire population.
With Business Intelligence (BI) in the backdrop, Statistical Analysis is a process of collection and inspection of each data sample from the group of objects the samples is part of.
It is an assemblage of mathematical procedures that are useful to scrutinize and bestow data. Statistics can find application in fields like scheming researches and inspections and for collection and analysis of data.
Image Source: clien.net
When there is a disproportion in the scores, in other words, if the scores are largely towards one end of the distribution than the other, then it results in skew. If the scores of a distribution are more towards the high end, which means the score distribution is scarcer on the low-end, resulting in a tail. This disproportion is nothing but the negative skew. Positive skew occurs when a distribution shows a tail at its high end
Standard Error, where S = estimated Standard Deviation
The Standard Error is a statistical term that evaluates the accuracy of the representative sample of any given population. In Statistics, if the sample mean diverges from the definite mean of a population, then the deviation is nothing but the standard error.
The Standard Error is inversely proportional to the sample size; the greater the sample size, the smaller the Standard Error. This is because the statistic will be closer to the actual value.
It is the technique to mathematically demonstrate that a certain statistic is consistent. When the decisions depend on the result of the currently executing experiments, it’s important to confirm the existence of the relationship between the two.
The result of an experiment is will have statistical significance if the occurrence is not coincident with a given statistical significance level.
Summary Statistics are the trials to share acumens about data in a simple and comprehensible way.
A Time Series is a chronological arrangement of a group of data according to the occurrence of each data point. Thus, time Series data will avail measurements of observations, for instance, air temperature, pressure or stock rates and more, together with the date and time imprints.
Variance is the statistical norm for diffusion of scores in a distribution. It is rarely in use as an independent process, however, it is a convenient method to calculate descriptive statistical measurements, like Standard Deviation.
To Be Continued…
I hope the listicle about “Statistical Tools and Terminologies”will serve you as a cheat sheet whenever you’re in need of it. In my next article, I will discuss about “Machine Learning Tools and Terminologies”. For more information about Data Science and related courses visit Acadgild.