Sklearn (Scikitlearn) is a free machine learning library for Python. It features various algorithms like support vector machine,random forests, k-neighbours,etc and it also supports Python numerical and scientific libraries like NumPy and SciPy
This blog is must for beginners to know everyday useful functions present in sklearn for Preprocessing data,Model Building, Model Fitting, Model Predicting,Evaluating Model’s Performance,&Tune Your Model
It is a one stop library to hadle all the task needed by data scientist.Individually you can look into parameters of each function to know flexibility it provides.
Standerization(mu =0 sigma=1)(unitVarience)
Standardization rescales data to have a mean (μ) of 0 and standard deviation (σ) of 1 (unit variance).
Normalization rescales the values into a range of [0,1]. This might be useful in some cases where all parameters need to have the same positive scale. However, the outliers from the data set are lost.
Binarizer Threshhold [0,1]
It plays a key role in the discretization of continuous feature values. Basically when you set threshold it converts the data either 0 or 1 keeping threshold value in consideration
Encoding Categerical features 001|010|100
Label Encoding refers to converting the labels into numeric form so as to convert it into the machine-readable form.italicized text
Impute Missing values
replacing values 0 with mean of other !0 values
Train and Test data
train_test_split splits arrays or matrices into random train and test subsets. That means that everytime you run it without specifying random_state, you will get a different result, this is expected behavior.On the other hand if you use random_state=some_number, then you can guarantee that the output of Run 1 will be equal to the output of Run 2, i.e. your split will be always the same. It doesn’t matter what the actual random_state number is 42, 0, 21, … The important thing is that everytime you use 42, you will always get the same output the first time you make the split. This is useful if you want reproducible results
Model Building & Model Fitting & Model Predicting
Machine learning consists of algorithms that can automate analytical model building. Using algorithms that iteratively learn from data, machine learning models facilitate computers to find hidden insights from Big Data without being explicitly programmed where to look
Model fitting is a measure of how well a machine learning model generalizes to similar data to that on which it was trained.During the fitting process, you run an algorithm on data for which you know the target variable, known as “labeled” data, and produce a machine learning model.
Predictive modeling is the general concept of building a model that is capable of making predictions. Typically, such a model includes a machine learning algorithm that learns certain properties from a training dataset in order to make those predictions.
Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples
In statistics, linear regression is a linear approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables). The case of one explanatory variable is called simple linear regression.
Support Vector Machine
In machine learning, support-vector machines (SVMs, also support-vector networks) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier
In machine learning, naïve Bayes classifiers are a family of simple “probabilistic classifiers” based on applying Bayes’ theorem with strong (naïve) independence assumptions between the features. They are among the simplest Bayesian network models.
The KNN algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other. The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised machine learning algorithm that can be used to solve both classification and regression problems
Unsupervised learning is a type of self-organized learning that helps find previously unknown patterns in data set without pre-existing labels. It is also known as self-organization and allows the modeling of probability densities of given inputs
Principal Component Analysis (PCA)
Principal component analysis is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components
k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining
Evaluating Model’s Performance
Model evaluation aims to estimate the generalization accuracy of a model on future (unseen/out-of-sample) data.
In a regression task, the model learns to predict numeric scores
Mean Absolute Error
Given any test data-set, Mean Absolute Error of your model refers to the mean of the absolute values of each prediction error on all instances of the test data-set
Mean Squared Error
MSE is the average of the squared error that is used as the loss function for least squares regression: It is the sum, over all the data points, of the square of the difference between the predicted and actual target variables, divided by the number of data points.
R-squared (R2), which is the proportion of variation in the outcome that is explained by the predictor variables. In multiple regression models, R2 corresponds to the squared correlation between the observed outcome values and the predicted values by the model. The Higher the R-squared, the better the model
The sklearn. metrics module implements several loss, score, and utility functions to measure classification performance. Some metrics might require probability estimates of the positive class, confidence values, or binary decisions values.
It is the ratio of number of correct predictions to the total number of input samples. It works well only if there are equal number of samples belonging to each class.
A Classification report is used to measure the quality of predictions from a classification algorithm. How many predictions are True and how many are False. More specifically, True Positives, False Positives, True negatives and False Negatives are used to predict the metrics of a classification report
a confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm. on a set of test data for which the true values are known
Adjusted Rand Index
The Rand index or Rand measure (named after William M. Rand) in statistics, and in particular in data clustering, is a measure of the similarity between two data clusterings. A form of the Rand index may be defined that is adjusted for the chance grouping of elements, this is the adjusted Rand index.
A clustering result satisfies homogeneity if all of its clusters contain only data points which are members of a single class. This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values won’t change the score value in any way
The V-Measure is defined as the harmonic mean of homogeneity and completeness. of the clustering. Both these measures can be expressed in terms of the mutual information and entropy measures of the information theory. Homogeneity is maximized when each cluster contains elements of as few different classes as possible.
Cross-validation is another antidote for overfitting. Cross-validation involves partitioning data into multiple groups and then training and testing models on different group combinations. For example, in a 5-fold cross-validation we would split our transaction data set into five partitions of equal sizes
Tune Your Model
Fine tuning machine learning predictive model is a crucial step to improve accuracy of the forecasted results.
Grid search is an approach to hyperparameter tuning that will methodically build and evaluate a model for each combination of algorithm parameters specified in a grid
Random search is a technique where random combinations of the hyper-parameters are used to find the best solution for the built model . It tries random combinations of a range of values. To optimise with random search, the function is evaluated at some number of random configurations in the parameter space