This post presents a collection of Data Science related key terms with concise definitions ordered into distinct topics. It may take some time to familiarize with these terminologies, however, once you start reading about it and hearing about the terminologies, you will comprehend the importance of key terms and their relevance to data science.
There are multiple approaches to listing these terminologies. They can have segregation with relevant entitlements or even a random listing would also serve the purpose. I, in this article, will ghettoize these terminologies into categories and in my consecutive blogs. Also, let’s now have a look at the categories these terminologies belong to.
- The Fundamentals of Data Science
- Sectors Involving Data Science
- Statistical Tools and Terminologies
- Machine Learning Tools and terminologies
- Key Terms in Deep Learning
The Fundamentals of Data Science
These are some baseline concepts that are helpful to grasp when starting to learn about data science. While you probably won’t have to work with every concept mentioned here, knowing what the terms mean will help when reading articles or discussing topics with fellow data lovers.
In mathematics, semantics, computing and relative topics, an algorithm could be said as an arrangement of determinate directives, often finding its use in calculation and data processing. The steps in the algorithm may consist of branching or repetition depending on the purpose of the algorithm. Algorithms are usually in the humanly comprehensible language and it will be independent of any programming language.
Bayes’ theorem, also known as Bayes’ Rule or Bayes’ Law, is a mathematical formula for determining conditional probability. The theorem delivers reviews the main calculations and concepts specified for a new or additional evidence. The theorem explains the probability of an event under consideration, grounded over former acquaintances of circumstances that might be concerning the event.
The formula for Bayes theorem is as follows:
Image Source: analyticsvidhya.com
Big data is a developing area that defines a capacious volume of structured, semi-structured and unstructured data that qualifies itself to be dug in to evacuate hidden information. It can be categorized by 5 Vs to be specific. There are more traits or classifications in big data, however, these are the major ones.
These Vs are the five pillars of big data. They describe the active atmosphere of data that is compulsory for effective knowledge about prevention of malware.
Classification is a data mining function that allocates things in an assortment to mark categories or modules. The objective of classification is to precisely forecast the target class for every instance in the data. Classifications are discrete and do not indicate any order. Classification models are verified by equating the foreseen values to the identifiable values over a set of trial data.
Deep learning is also known as deep structured learning or hierarchical learning is a machine learning procedure that helps computers to effortlessly do normal human actions.This form of learning is a set of artificial neural networks made up of multiple deposits. It is an artificial intelligence function that emulates human intellect activities to process data and create outlines for decision making.
A decision tree is a graphical representation of conceivable results for decisions on certain conditions. The name is decision tree as it begins with a single entity which is the root and then it branches off into numerous solutions, just like a tree. The tree is an assembly to demonstrate how and why a choice may lead to the next with the help of branches that represent reciprocally exclusive choices
Data structure denotes the process of unifying units of data inside larger data sets. Accomplishing and preserving precise data structures improve data accessibility and valuation. Data structures are the executions of non-concrete data categories in a tangible and physical scenery.
Exploratory Data Analysis (EDA)
EDA or exploratory data analysis is a phase used for data science conduit with an agenda to comprehend the insights of the data through conception or by statistical analysis. The crux of EDA is to study data sets and review their key features often through visual means.
An evaluation metric is to assess the efficiency of information reclamation systems and to validate theoretical and/or logical developments of these systems. It is a set of measures that follow a common fundamental evaluation technique. There are many metrics to evaluate the effectiveness of semi-structured text (XML) retrieval systems.
Fuzzy Logic is mathematical lucidity that efforts to unravel difficulties by assigning values to a vague spectrum of data and attain the at most precise conclusion thinkable. This kind of logic is intended to percept inherently imprecise notions.
A histogram is a design that unfolds, and illustrates, the primary frequency distribution (shape) for a set of continuous data. Histograms deliver a visual explanation of numerical data by signifying the number of data points that fall within an array of values.
In statistics, imputation is the course of substituting mislaid data with values. Once data substitution is done for a single data point, then it is known as “unit imputation”. However, when replacing happens for a component of a data point, then it is “item imputation”. Some of the actionable and familiar efforts to handle missing data are:
- Hot deck and cold deck imputation
- Listwise and pairwise deletion
- Mean imputation
- Regression imputation
- Last observation carried forward
- Stochastic imputation
- Multiple imputations
Linear algebra is subject to mathematics that speaks about vectors and linear functions. Also, it is the pivot of intersection for almost all zones of mathematics. For instance, linear algebra is fundamental in modern presentations of geometry, including for defining basic objects such as lines, planes, and rotations.
Machine learning is an application of artificial intelligence (AI) that offers computers the ability to robotically acquire knowledge and progress as of experience without being explicitly set. It lays emphases on the expansion of computer programs that can access data and custom and mold into convenient self-learning platform.
MATLAB is a high-implementation language for technical computing. It assimilates computation, conception, and software development in an easy-to-use milieu. The snags and solutions are articulated through acquainted mathematical notation. Typical uses of MATLAB consist of:
- Math and Computation
- Algorithm Development
- Modeling, Simulation, and Prototyping
- Data analysis, Exploration, and Visualization
- Scientific and Engineering Graphics
- Application development, plus Graphical User Interface building
Multivariate analysis is a set of methods for analyzing data sets that encompass more than one variable, and the techniques are particularly valuable when working with allied variables. This form of scrutinization can diminish the probability of Type I errors.
The normal distribution, also known as the Gaussian or standard normal distribution, is probability distribution that schemes all its data in a symmetrical manner, and major outcomes are found around the probability’s mean.
Overfitting occurs when the machine learning model is highly intricate. In such a case, the model absorbs noise in the training data and performs just fine. However, when the same model is then used to test other datasets, the model poorly performs paired with massive errors. Overfit models usually have a huge variance.
The p-value is of peripheral importance within a statistical hypothesis test, that illustrates the of the probability of occurrence of any given event. P-Value finds its application in hypothesis testing to emphasize the decision of support or rejection of the null hypothesis. The p-value is the evidence against a null hypothesis. Thus, the lower the p-value, the more robust is the evidence to reject the null hypothesis.
In statistics, Quartiles are the values that split data into quarters. However, quartiles aren’t cut like pizza slices; Instead, they split data into four sectors according to where the numbers come over the number line. The four quarters that rift dataset into quartiles are:
- The lowest 25% of numbers.
- The next 25% of numbers up to the median.
- The next 25% of numbers above the median.
- The highest 25% of numbers.
The pictorial representation for quartiles is as follows:
Image Source: analyticsvidhya.com
It is a technique for defining the statistical association amid two or more variables. Regression occurs when there is a variation in a dependent variable is in alliance with and is dependent upon the change in one or more autonomous variables.
Standard deviation is the measure of diffusion of the dataset from its mean. It calculates the absolute unpredictability of a distribution; the higher the dispersal or inconsistency, the greater is the standard deviation and bigger will be the deviation extent of the chosen value from its mean.
Training and Testing
Training and Testing data is a fragment of the machine learning workflow. The designing of the predictive model involves data assignment. This helps in understanding the machine learning workflow. Once the model has undergone the training, then the model gets a test set, where it applies its understanding and attempts to forecast a target value.
Type 1 Error
The choice to decline the null hypothesis may be incorrect and that is known as Type I error
Image Source: analyticsvidhya.com
The state of underfitting occurs to a statistical model or a machine learning algorithm is when it cannot seize the original inclination of data. Underfitting terminates precision in any machine learning model. Its occurrence purely states that the model or the algorithm is not complimenting the data adequately. Underfitting usually happens when there is data scarcity, despite that you are trying to build an accurate model.
TO BE CONTINUED…
This listicle if in some way help those who are new to the arena of data science and to those who are struggling with the terminologies, then the purpose of this glossary of data science terms would be achieved. We hope it will be handy and serve as a cheat sheet whenever you’re in need of it. In my next article, I will discuss the next set of data science terminologies under the heading “Sectors Involving Data Science”. To know more about data science and related courses visit Acadgild.