This post presents a collection of key terms related to Data Science with brief definitions and descriptions categorized into separate topics. It takes time to familiarize yourself with Data Science terminologies as these words may not be a part of your daily vocabulary. However, once you start reading about the topic and hearing about these terminologies, you will comprehend the importance of these terms in Data Science and eventually develop the interest to learn more. I, in this article, have some key Data Science terminologies which are segregated into categories.
- The Fundamentals of Data Science
- Sectors Involving Data Science
- Statistical Tools and Terminologies
- Machine Learning Tools and terminologies
- Deep Learning Key Terms
- Procedures concerning Data Science
Statistical Tools and Terminologies
Usually, the major focus and effort while learning about any new field of knowledge is the procurement of and getting acquainted with its vocabulary. Statistics is no exception. Gaining knowledge about the terminologies is a challenge during the initial stages because the explanation of one feature habitually assumes that there’s a relative working knowledge of other terms, however, all of which can be given an explanation at once. For instance, to understand what boxplot is, one must previously know the definitions of mean, median, quartile, an outlier.
This post to an extent aims to bridge the gap between the known and unknown terminologies which are the absolute basics of statistics.
Bayesian Statistics is a mathematical process that uses probabilities to solve statistical problems. It provides people the tools to update their beliefs in the evidence of new data. It differs from a typical frequentist method and is based on the use of Bayesian prospects to review the evidence.
Correlation is a statistical measure that specifies the range to which more than two variables fluctuate simultaneously. A positive correlation denotes the range till which variables increase or decrease in equivalence. A negative correlation indicates the extent to which one of the two variables rises and the other declines.
A confidence interval evaluates the actual percentage of the population that fits into a category based on the results from a trial population. This field of Statistics suggests precise mathematical approaches to analyze confidence intervals.
Descriptive Statistics is an assortment of statistical tools for quantitative description or for summarizing the data assortment. This type of statistics intends to summarize, and as such is different from inferential statistics, that is increasingly predictive.
Distribution is positioning of data, based on values of one variable in the ascending order. This form of order, and its features like the configuration and spread, deliver data about the original example.
Frequentist Statistics trials tell us whether an incident or hypothesis take place or not. It computes the probability of an occurrence over a course of time in the experiment (i.e. the experiment is done repeatedly under the identical conditions to obtain the outcome).
Generalizability is the ability to make decisions about the characteristics of the population based on the results of data collected from a sample. This ability hugely depends on the essence of trial assortment, sample magnitude, and many other aspects.
Inferential Statistics is among the two main branches of Statistics. This form of Statistics employs arbitrary data samples taken from a population to describe and make interpretations about the population.
Interquartile Range (IQR)
The IQR or Interquartile Range is the variance between the score describing the 75th percentile and the 25th percentile, the third and first quartiles, respectively.
In Statistics, latent variables are variables that are indirectly discerned and theorized (via the mathematical model) from additional directly measured/observed variables. Mathematical models that explain observed variables in term of latent variables are known as latent variable models.
Mean is the natural arithmetic norm of the distribution of variable values. The mean offers a solitary, brief numerical synopsis of distribution. The mean is probably the utmost common statistics that have come across in wide-ranging researches.
Mean, along with median and mode, are the three major measures of fundamental tendency, which together evaluates an imperative and rudimentary feature of distribution.
The median is the score of a distribution present at the 50th percentile, unraveling the upper and lower 50 percent of scores. The median is suitable for both excruciating a set of distribution scores into half and to assist in identifying the skew of a distribution.
The mode is merely a score that occurs most recurrently in any distribution. Following are the four types of modalities:
- Unimodal: Has one peak
- Bimodal: Has two peaks
- Multimodal: Has several peaks
- Uniform: Peaks that are uniform
Image Source: makemeanalyst.com
A set of data is said to be normalized only if all the values fall in to be a part of a common array. Data sets are usually normalized to establish easier and eloquent comparisons.
An outlier is a data point that is enormously far away from other points (the flock). Most often it is the result of either exceptional conditions or faults in measurement. Thus, outlier must be found out during the initial stages of the data analysis workflow.
Image Source: sacredmysteries.com
A parameter is a value which is a part of a population. For instance, if all the data of humans on Earth is taken into account and the mean age of that population is the parameter.
A population is a chosen individual or group representative of a complete array of associates part of a certain set of inquisitiveness.
Predictive Modeling is a process which employs data mining and probability to estimate conclusions. Each model is built with numerous interpreters, and these variables may impact impending outcomes.
The range is one of the most significant procedures of dispersion. The range is the change amid the maximum and minimum values of a distribution.
The residual is a measure of the extent up to which real value alters from the statistical value, calculated depending on the dataset. This phenomenon is often interchangeably used as “error,” even though, an error is a purely theoretical value.
The sample is the collection of data points that are under scrutinization. The collection and examination of samples are mostly to make inferences about a larger population. A sample, in Statistics, is an illustrative choice out of an entire population.
With Business Intelligence (BI) in the backdrop, Statistical Analysis is a process of collection and inspection of each data sample from the group of objects the samples is part of.
It is an assemblage of mathematical procedures that are useful to scrutinize and bestow data. Statistics can find application in fields like scheming researches and inspections and for collection and analysis of data.
Image Source: clien.net
When there is a disproportion in the scores, in other words, if the scores are largely towards one end of the distribution than the other, then it results in skew. If the scores of a distribution are more towards the high end, which means the score distribution is scarcer on the low-end, resulting in a tail. This disproportion is nothing but the negative skew. Positive skew occurs when a distribution shows a tail at its high end
Standard Error, where S = estimated Standard Deviation
The Standard Error is a statistical term that evaluates the accuracy of the representative sample of any given population. In Statistics, if the sample mean diverges from the definite mean of a population, then the deviation is nothing but the standard error.
The Standard Error is inversely proportional to the sample size; the greater the sample size, the smaller the Standard Error. This is because the statistic will be closer to the actual value.
It is the technique to mathematically demonstrate that a certain statistic is consistent. When the decisions depend on the result of the currently executing experiments, it’s important to confirm the existence of the relationship between the two.
The result of an experiment is will have statistical significance if the occurrence is not coincident with a given statistical significance level.
Summary Statistics are the trials to share acumens about data in a simple and comprehensible way.
A Time Series is a chronological arrangement of a group of data according to the occurrence of each data point. Thus, Time Series data will avail measurements of observations, for instance, air temperature, pressure or stock rates and more, together with the date and time imprints.
Variance is the statistical norm of the diffusion of scores in a distribution. It is rarely in use as an independent process, however, it is a convenient method to calculate descriptive statistical measurements, like Standard Deviation.
To Be Continued…
I hope the listicle about “Statistical Tools and Terminologies” will be able to serve as a cheat sheet whenever you’re in need of it. In my next article, I will discuss another set of Data Science terminologies with the heading “Machine Learning Tools and Terminologies”. For more information about Data Science and related courses visit Acadgild.