Data Science and Artificial Intelligence

Sklearn Python

Sklearn (Scikitlearn) is a free machine learning library for Python. It features various algorithms like support vector machine,random forests, k-neighbours,etc and it also supports Python numerical and scientific libraries like NumPy and SciPy 

This blog is must for beginners to know everyday useful functions present in sklearn for Preprocessing data,Model Building, Model Fitting, Model Predicting,Evaluating Model’s Performance,&Tune Your Model
It is a one stop library to hadle all the task needed by data scientist.Individually you can look into parameters of each function to know flexibility it provides.

Free Step-by-step Guide To Become A Data Scientist

Subscribe and get this detailed guide absolutely FREE

Preprocessing data

Standerization(mu =0 sigma=1)(unitVarience)

Standardization rescales data to have a mean (μ) of 0 and standard deviation (σ) of 1 (unit variance).
Xchanged=X−μ/σ

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X_train)
standerized_X =scaler.transform(X_train)
standerized_x_test=scaler.transform(Y_train)

Normalization [0,1]

Normalization rescales the values into a range of [0,1]. This might be useful in some cases where all parameters need to have the same positive scale. However, the outliers from the data set are lost.

Xchanged=X−Xmin/Xmax−Xmin

from sklearn.preprocessing import Normalizer
scaler = Normalizer().fit(X_train)
normalized_x = scaler.transform(X_train)
normalized_Y = scaler.transform(Y_test)

Binarizer Threshhold [0,1]

It plays a key role in the discretization of continuous feature values. Basically when you set threshold it converts the data either 0 or 1 keeping threshold value in consideration

from sklearn.preprocessing import Binarizer
scaler = Binarizer(threshold = 1600).fit(X_train)
binarizer_X = scaler.transform(X_train)

Encoding Categerical features 001|010|100

Label Encoding refers to converting the labels into numeric form so as to convert it into the machine-readable form.italicized text

from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
enc_X = enc.fit_transform(X_train)

Impute Missing values

replacing values 0 with mean of other !0 values

from sklearn.preprocessing import Imputer
imp = Imputer(missing_values= 0 , strategy = 'mean', axis = 0)
imp_x = imp.fit_transform(X_train)

Train and Test data

train_test_split splits arrays or matrices into random train and test subsets. That means that everytime you run it without specifying random_state, you will get a different result, this is expected behavior.On the other hand if you use random_state=some_number, then you can guarantee that the output of Run 1 will be equal to the output of Run 2, i.e. your split will be always the same. It doesn’t matter what the actual random_state number is 42, 0, 21, … The important thing is that everytime you use 42, you will always get the same output the first time you make the split. This is useful if you want reproducible results

from sklearn.preprocessing import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X,y,random_state=0)

Model Building & Model Fitting & Model Predicting

Machine learning consists of algorithms that can automate analytical model building. Using algorithms that iteratively learn from data, machine learning models facilitate computers to find hidden insights from Big Data without being explicitly programmed where to look

Model fitting is a measure of how well a machine learning model generalizes to similar data to that on which it was trained.During the fitting process, you run an algorithm on data for which you know the target variable, known as “labeled” data, and produce a machine learning model.

Predictive modeling is the general concept of building a model that is capable of making predictions. Typically, such a model includes a machine learning algorithm that learns certain properties from a training dataset in order to make those predictions.

Supervised Learning

Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples

Linear Regression

In statistics, linear regression is a linear approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables). The case of one explanatory variable is called simple linear regression.

from sklearn.linear_model import LinearRegression
lr = LinearRession(normalize=True)
lr.fit(X_train,Y_train)
y_pred = lr.predict(X_test)

Support Vector Machine

In machine learning, support-vector machines (SVMs, also support-vector networks[1]) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier

from sklearn.svm import SVC
svc = SVC(kernal='lenear')
svc.fit(X_train,y_train)
y_pred = svc.predict(np.random.random((2,5)))

Naive Bayes

In machine learning, naïve Bayes classifiers are a family of simple “probabilistic classifiers” based on applying Bayes’ theorem with strong (naïve) independence assumptions between the features. They are among the simplest Bayesian network models.

form sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

KNN(K-Nearest neighbors)

The KNN algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other. The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised machine learning algorithm that can be used to solve both classification and regression problems

from sklearn import neighbors
knn = neighbors.KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict_proba(X_test))

Unsupervised learning

Unsupervised learning is a type of self-organized learning that helps find previously unknown patterns in data set without pre-existing labels. It is also known as self-organization and allows the modeling of probability densities of given inputs

Principal Component Analysis (PCA)

Principal component analysis is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components

from sklearn.decomposition import PCA
pca = PCA(n_components=0.95)
pca_model = pca.fit_transform(X_train)

K Means

k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining

from sklearn.cluster import KMeans
k_means = KMeans(n_clusters=3, random_state=0)
k_means.fit(X_train)
y_pred = k_means.predict(X_test)

Evaluating Model’s Performance

Model evaluation aims to estimate the generalization accuracy of a model on future (unseen/out-of-sample) data.

Regression Metrics

In a regression task, the model learns to predict numeric scores

Mean Absolute Error

Given any test data-set, Mean Absolute Error of your model refers to the mean of the absolute values of each prediction error on all instances of the test data-set

from sklearn.metrics import mean_absolute_error
y_true = [3, -0.5, 2])
mean_absolute_error(y_true, y_pred))

Mean Squared Error

MSE is the average of the squared error that is used as the loss function for least squares regression: It is the sum, over all the data points, of the square of the difference between the predicted and actual target variables, divided by the number of data points.

from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, y_pred))

R2 Score

R-squared (R2), which is the proportion of variation in the outcome that is explained by the predictor variables. In multiple regression models, R2 corresponds to the squared correlation between the observed outcome values and the predicted values by the model. The Higher the R-squared, the better the model

from sklearn.metrics import r2_score
r2_score(y_true, y_pred))

Classification Metrics

The sklearn. metrics module implements several loss, score, and utility functions to measure classification performance. Some metrics might require probability estimates of the positive class, confidence values, or binary decisions values.

Accuracy Score

It is the ratio of number of correct predictions to the total number of input samples. It works well only if there are equal number of samples belonging to each class.

knn.score(X_test, y_test)
>>> from sklearn.metrics import accuracy_score
>>> accuracy_score(y_test, y_pred)

Classification Report

A Classification report is used to measure the quality of predictions from a classification algorithm. How many predictions are True and how many are False. More specifically, True Positives, False Positives, True negatives and False Negatives are used to predict the metrics of a classification report

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred)))

Confusion Matrix

a confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm. on a set of test data for which the true values are known

from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred)))

Clustering Metrics

Adjusted Rand Index

The Rand index or Rand measure (named after William M. Rand) in statistics, and in particular in data clustering, is a measure of the similarity between two data clusterings. A form of the Rand index may be defined that is adjusted for the chance grouping of elements, this is the adjusted Rand index.

from sklearn.metrics import adjusted_rand_score
adjusted_rand_score(y_true, y_pred))

Homogeneity

A clustering result satisfies homogeneity if all of its clusters contain only data points which are members of a single class. This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values won’t change the score value in any way

from sklearn.metrics import homogeneity_score
homogeneity_score(y_true, y_pred))

V-measure

The V-Measure is defined as the harmonic mean of homogeneity and completeness. of the clustering. Both these measures can be expressed in terms of the mutual information and entropy measures of the information theory. Homogeneity is maximized when each cluster contains elements of as few different classes as possible.

from sklearn.metrics import v_measure_score
metrics.v_measure_score(y_true, y_pred))

Cross-Validation

Cross-validation is another antidote for overfitting. Cross-validation involves partitioning data into multiple groups and then training and testing models on different group combinations. For example, in a 5-fold cross-validation we would split our transaction data set into five partitions of equal sizes

print(cross_val_score(knn, X_train, y_train, cv=4))
print(cross_val_score(lr, X, y, cv=2))

Tune Your Model

Fine tuning machine learning predictive model is a crucial step to improve accuracy of the forecasted results.

Grid Search

Grid search is an approach to hyperparameter tuning that will methodically build and evaluate a model for each combination of algorithm parameters specified in a grid

from sklearn.grid_search import GridSearchCV

params = {"n_neighbors": np.arange(1,3), "metric": ["euclidean", "cityblock"]}

grid = GridSearchCV(estimator=knn,param_grid=params)

grid.fit(X_train, y_train)

print(grid.best_score_)

print(grid.best_estimator_.n_neighbors)

Random search

Random search is a technique where random combinations of the hyper-parameters are used to find the best solution for the built model . It tries random combinations of a range of values. To optimise with random search, the function is evaluated at some number of random configurations in the parameter space

from sklearn.grid_search import RandomizedSearchCV

params = {"n_neighbors": range(1,5), "weights": ["uniform", "distance"]}

rsearch = RandomizedSearchCV(estimator=knn,param_distributions=params,cv=4,n_iter=8,random_state=5)

rsearch.fit(X_train, y_train)

print(rsearch.best_score_)

prateek

An alumnus of the NIE-Institute Of Technology, Mysore, Prateek is an ardent Data Science enthusiast. He has been working at Acadgild as a Data Engineer for the past 3 years. He is a Subject-matter expert in the field of Big Data, Hadoop ecosystem, and Spark.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close