The post KNN in Python appeared first on AcadGild.
]]>KNN is used for both regression and classification problems and is a non-parametric algorithm which means it doesn’t make any assumption about the underlying data, it makes its selection based on the proximity to other data points regardless of what feature the numerical values represent.
In this blog, we will read about KNN and its implementation using a dataset in Python.
Working of KNN
When we have several data points that belong to some specific class or category and a new data point gets introduced, the KNN algorithm decides which class this new datapoint would belong to on the basis of some factor.
The K, in KNN, is the number of nearest neighbors that surrounds the new data point and is the core deciding factor.
We pick a value for K and will take K nearest neighbors of the new data point according to their Euclidean distance.
Suppose that the value of K = 5, we will choose 5 nearest neighbors to the new data point whose euclidean distance will be less.
Among these neighbors(K), we will count the number of data points in each category and the new data point will be assigned to that category to which the majority of 5 nearest points belong.
As we can see in the above image the new data(denoted by +), belongs to class 1 that has the majority of neighbors.
Since we now have a basic idea of how KNN works, we will begin our coding in Python using the ‘Wine’ dataset.
The Wine dataset is a popular dataset which is famous for multi-class classification problems. This data is the result of a chemical analysis of wines grown in the same region in Italy using three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.
The dataset comprises 13 features and a target variable(a type of cultivars).
This data has three types of cultivar classes: ‘class_0’, ‘class_1’, and ‘class_2’. Here, you can build a model to classify the type of cultivar. The dataset has been imported from the Sklearn library as shown below.
Importing all the necessary libraries:
import numpy as np import pandas as pd #importing the dataset from sklearn.datasets import load_wine wine = load_wine()
X = pd.DataFrame(wine.data, columns=wine.feature_names) X.head()
y = pd.Categorical.from_codes(wine.target, wine.target_names) y = pd.get_dummies(y) y.head()
X and y are the predictors and the target variable respectively. Since the target variable is a categorical one consisting of 3 categories of flower species, we have used ‘Categorical.from_codes’.
Also, Using get_dummies() function we have converted our categories of ‘type of cultivators’ into dummy variables.
Checking the info of X and y.
X.info()
y.info()
Checking the shape of X and y.
print(X.shape) print(y.shape)
Hence our dataset is free from null values.
Standardizing the Variables.
Before training our data, it is always a good practice to scale the features so that all of them can be uniformly evaluated. Refer to the Link for better understanding.
For scaling, we will import the StandardScalar class from the Sklearn library.
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() #fitting scaler to the feature scaler.fit(X)
Use the .transform() method to transform the features into a scaled version.
scaled_features = scaler.transform(X)
Converting the scaled features to a dataframe and check the head of this dataframe to make sure the scaling worked.
df_feat = pd.DataFrame(scaled_features,columns=X.columns) df_feat.head()
Hence it looks pretty clear that the variables have been scaled.
Splitting our data into Training and Test data.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(scaled_features,y, test_size=0.20)
Implementing KNN algorithm
from sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier(n_neighbors=5) knn.fit(X_train, y_train)
In the above code the KNN class ‘KNeighborsClassifier’ is initialized with one parameter, i.e. n_neigbours. This is basically the value for the K and there is no fixed value for this parameter. For now, we have set its value as 5.
Making Prediction
pred = knn.predict(X_test)
Evaluating the algorithm
For evaluating an algorithm, confusion matrix, precision, recall, and f1 score are the most commonly used metrics.
from sklearn.metrics import classification_report,confusion_matrix print(confusion_matrix(y_test.values.argmax(axis=1), pred.argmax(axis=1)))
The results show that our KNN algorithm was able to classify 33 records correctly.
print(classification_report(y_test,pred))
from sklearn import metrics # Model Accuracy, how often is the classifier correct? print("Accuracy:",metrics.accuracy_score(y_test, pred))
We got a classification rate of 91.66%, which can be considered as very good accuracy.
This is how we implement a dataset using the KNN algorithm. I hope you find this algorithm useful.
Keep visiting our website for more blogs on Data Science and Data Analytics.
The post KNN in Python appeared first on AcadGild.
]]>The post Assumptions of Linear Regression appeared first on AcadGild.
]]>In this blog, we will discuss these assumptions, in brief, using the ‘Advertising’ dataset, verify those assumptions and ways to overcome if these assumptions are being violated using Python.
Linear Regression is one of the important algorithms in Machine Learning. This algorithm is mainly used for regression problems. In one of our previous blog posts, the end to end implementation of this algorithm has already been presented using the ‘Boston dataset’. We assume our readers will have little basic knowledge of Linear Regression and its implementation. If not you can go through our previous blog to understand the implementation of Linear Regression in a detailed way.
The data-set used ahead contains information about money spent on advertisements and their generated Sales. Money was spent on TV, radio and newspaper ads. It has 3 features namely ‘TV’, ‘radio’ and ‘newspaper’ and target variable ‘Sales’. The dataset contains information on investments made on advertisements and their generated sales. These advertisements were made through electronic (TV & Radio) and print media (Newspaper).
The dataset contains the below fields.
Features:
Target variable:
Let us begin by loading our dataset and then verifying the assumptions one by one.
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline advert = pd.read_csv(r'Aeon\Advertising.csv') advert.head()
Assumptions:
1. Linearity: The assumption state there should be a linear relationship between the independent and dependent variables. We can determine the linearity through the use of Scatter plots.
Since our dataset has 3 independent variables namely, ‘TV’, ‘Radio’, ‘Newspaper’ and the dependent variable ‘Sales’, we will verify the linearity between all the independent variables and the dependent variable using the scatter plot.
for c in advert.columns[:-1]: plt.title("{} vs. \nSales".format(c)) plt.scatter(x=advert[c],y=advert['Sales'],color='blue',edgecolor='k') plt.grid(True) plt.xlabel(c,fontsize=14) plt.ylabel('Sales') plt.show()
From the above output, we can see that there is a perfect linear relationship between TV and Sales, a moderate linear relationship between Radio and Sales and a non-linear relationship between Newspaper and Sales.
Violation in this assumption can be fixed by applying log transformation to the independent variables and then plotting the scatterplot between the two.
2. No or Little Multicollinearity: Multicollinearity is a situation where the independent variables are highly correlated with each other. Therefore these assumptions say that there should be no or a little correlation between the independent variables. The presence of correlated independent variables imposes a serious problem to our regression model as the coefficients will be wrongly estimated.
We can check for multicollinearity with the help of a correlation matrix or VIF factor.
Verifying multicollinearity using correlation matrix or heat map.
df = advert[['TV', 'Radio', 'Newspaper']] sns.heatmap(df.corr(), annot = True)
If we find any values in which the absolute value of their correlation is >=0.8, the multicollinearity assumption is being broken.
VIF stands for Variance Inflation Factor and is the ratio of variance in a model with multiple terms, divided by the variance of a model with one term alone. VIF value <= 4 suggests no multicollinearity whereas value >=10 implies strong multicollinearity.
Calculating VIF values for the independent variables
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif for i in range(len(advert.columns[:-1])): v=vif(np.matrix(advert[:-1]),i) print("Variance inflation factor for {}:{}".format(advert.columns[i],round(v,2)))
The feature ‘TV’ has a VIF value greater than 10 which indicates significant multicollinearity.
The violation of this assumption can be fixed by removing the independent variable with high VIF value or which are highly correlated, however, removing a feature may eliminate necessary information from the dataset. We can also transform many variables into one by taking the average value or else we can use PCA to reduce features to a smaller set of uncorrelated components.
3. No Autocorrelation: Autocorrelation refers to the situation where there is a presence of correlation in error terms or when the residuals are dependent on each other. Hence this assumption says that there should be NO autocorrelation in error terms in our data.
This kind of scenario usually occurs in time series or paneled data where the next instant is dependent on the previous instant.
We can test for Autocorrelation with the Durbin-Watson test.
Since our dataset is not as mentioned above we would not be verifying this assumption and would move ahead with the next one.
4. Normality: The assumption state, the residuals of the regression should be normally distributed. The residuals are also known as the error and is the difference between the predicted value and the observed value.
The test of normality applies to the model’s residuals and not the variables themselves. This can be tested visually by plotting the residuals as a histogram, and/or using a probability plot.
One of the ways to visually test this assumption is through the use of the Q-Q-Plot. Q-Q stands for the Quantile-Quantile plot and is a technique to compare two probability distributions in a visual manner.
To generate this Q-Q plot we will be using scipy’s probplot function where we compare a variable of our chosen to a normal probability.
Before plotting the Q-Q plot we will first fit a model using stats model formula model, then the diagnostics will be run on this model.
import statsmodels.formula.api as smf model = smf.ols("Sales ~ TV + Radio + Newspaper", data= advert).fit() model.summary()
Histogram of Normalized residuals
plt.figure(figsize=(8,5)) plt.hist(model.resid_pearson,bins=20,edgecolor='k') plt.ylabel('Count') plt.xlabel('Normalized residuals') plt.title("Histogram of normalized residuals") plt.show()
Visualizing Q-Q plot of the residual
from statsmodels.graphics.gofplots import qqplot plt.figure(figsize=(8,5)) fig=qqplot(model.resid_pearson,line='45',fit='True') plt.xticks(fontsize=13) plt.yticks(fontsize=13) plt.xlabel("Theoretical quantiles",fontsize=15) plt.ylabel("Ordered Values",fontsize=15) plt.title("Q-Q plot of normalized residuals",fontsize=18) plt.grid(True) plt.show()
In the above output, the scatter plot is a set of data points that are observed, while the regression line is the prediction.
We know that our residuals are perfectly normally distributed as the residuals represented as dots in blue are falling on the red line. A few points outside of the line is due to our small sample size.
The Q-Q plot and the histogram above shows that the normality assumption is satisfied pretty good.
If this assumption is violated we can fix it by a nonlinear transformation of target variable or features or removing/treating potential outliers.
5. Homoscedasticity: this is the most vital assumption for linear regression if this assumption is violated then the standard errors will be biased. The standard errors are used to conduct a significance level and calculate the confidence intervals.
This is a situation when the error terms or the residuals should have constant variance with respect to the independent or dependent variables. It can be easily tested with a scatterplot of the residuals.
If by looking at the scatter plot of the residuals from our linear regression analysis we notice a pattern, this would be a clear sign that this assumption is being violated and is heteroscedastic. Refer to the image below for better understanding
Plot to verify homoscedasticity
p=plt.scatter(x=model.fittedvalues,y=model.resid,edgecolor='k') xmin=min(model.fittedvalues) xmax = max(model.fittedvalues) plt.hlines(y=0,xmin=xmin*0.9,xmax=xmax*1.1,color='red',linestyle='--',lw=3) plt.xlabel("Fitted values") plt.ylabel("Residuals") plt.title("Fitted vs. residuals plot") plt.grid(True) plt.show()
From the above output, it is clear that the residuals have constant variance and homoscedasticity is not violated.
Violation of this assumption can be fixed by log transformation of the dependent variable.
As we have tested our model has passed all the basic assumptions of linear regression and hence is a qualified model to predict results. Also, we understand the influence of independent variables (predictor variables) on our dependent variable.
Here is a visual recap:
If you have any queries on the above blog post please leave a comment we will get back to you.
Keep visiting our Acadgild blog site for more informative blogs on data science, data analysis, and big data blog posts. Thank you.
The post Assumptions of Linear Regression appeared first on AcadGild.
]]>The post Connect hive with beeline | Hive installation with Mysql metastore appeared first on AcadGild.
]]>In our previous blog, we have discussed Apache Hive Architecture in detail. This blog gives you a detailed footprint to install Apache Hive on Ubuntu and how to connect hive using beeline. We believe that java and Hadoop application software are pre-installed. You can refer our blog hadoop-3-x-installation-guide if Hadoop needs to be installed.
How To Install Mysql?
How To Configure Mysql?
How To Install Hive?
How To Configure Hive Metastore?
How Beeline is used to connect Hive?
Prerequisites:
So let’s get started with our first step which is required for the hive installation.
Install Mysql
Step 1: Update the repositories
sudo apt-get update -y
Step 2: Install MySQL
sudo apt install mysql-server
Step 3: Configure Mysql
sudo mysql_secure_installation
In order to use a password to connect to MySQL as root, you will need to switch its authentication method from auth_socket to mysql_native_password.
To do this, open up(not required) the MySQL prompt from your terminal
Step 4: Open MySQL Prompt
sudo mysql
You can see that the root user does, in fact, authenticate using the auth_socket plugin. To configure the root account to authenticate with a password, run the following ALTER USER command. Be sure to change the password to a strong password of your choosing, and note that this command will change the root password.
Step 5: Change the root password
ALTER USER 'root'@'localhost' IDENTIFIED WITH mysql_native_password BY 'root@123'; FLUSH PRIVILEGES; exit;
Now your MySQL root user password is changed and successfully configured MySQL.
Now we will start with the installation of Hive
Step 1: Create a directory for the hive and Download the hive tarball from the below link.
mkdir hive cd hive
wget http://mirrors.estointernet.in/apache/hive/hive-3.1.2/apache-hive3.1.2-bin.tar.gz
Step 2: Extract the tarball
tar -zxvf <filename>
Step 3: Now we have to update the bashrc file so for that we need the path where the hive is installed.
cd apache-hive-3.1.2-bin pwd
Now you will get the path where your hive is installed, Copy that path.
Step 4: Open a new terminal and update the bashrc file by entering the below export statements for installing hive.
Open the .bashrc file from your home directory
sudo vi .bashrc
Add the below statements in the .bashrc file
export HIVE_HOME=/home/hadoop/install/hive/apache-hive-3.1.2-bin export PATH=$PATH:$HIVE_HOME/bin
Use the command esc + : + wq! To save and exit the .bashrc file
Step 5: Now run the below command to update the .bashrc file.
source .bashrc
Changing Default Metastore Of Hive
Step 6: Download the hive-site.xml file from the below link and place it to the conf directory in the hive.
Note: Assuming that your file is in download.
Now go to the download directory and copy the hive-site.xml to hive conf directory
cd downloads cp hive-site.xml /home/hadoop/install/hive/apache-hive-3.1.2-bin/conf/
Step 7: Download Mysql Connector Jar file from the below link and Copy to the hive lib directory
Note: Assuming your connector file is in the download folder
cd downloads cp mysql-connector-java-5.1.48.jar /home/hadoop/install/hive/apache-hive-3.1.2-bin/lib/
Now we have to initialize MySQL schema because we have changed the metastore database to MySQL
Step 7: Initialize Schema
schematool -dbType mysql -initSchema
From the above screenshot, we can observe we have successfully installed hive with MySQL metastore.
Now, we will see what is beeline and its purpose.
Soon, the Hive CLI tool will not have support to authenticate and authorize the hive directly.
When the direct access to the hive CLI is deprecated for security reasons to avoid direct access to the data on HDFS or Mapreduce Jobs, a beeline can be used to access the hive.
Beeline is a Hive client that is included on the master nodes of your cluster. Beeline uses JDBC to connect to HiveServer2, a service hosted on your cluster. You can also use Beeline to access Hive on remotely over the internet.
Step 8: Add Below line to core-site.xml which is present in Hadoop conf directory.
<property> <name>hadoop.proxyuser.ABC.groups</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.ABC.hosts</name> <value>*</value> </property>
Note: In the above configuration lines present ABC as User now just replace it with your username.
Step 9: Start the hive server
hiveserver2 start
Note: Do not close the hiveserver2 terminal once started until the task is completed successfully or when you do not want to use hive.
Note: Before you go to the next step start your Hadoop services first.
Step 10: Using the following beeline command.
beeline -n hadoop -u jdbc:hive2://localhost:10000
Note: In the above command Hadoop is my username so you can give your username accordingly.
As you can see above in the screenshot we have successfully connected to the hive using beeline client.
We have successfully installed hive and also we have(not required) configured hive metastore using MySQL database. Then we have connected hive using beeline client.
I hope this blog helps you in the future while installing hive, configuring MySQL metastore for hive as well as using beeline to connect and execute queries through HiveServer2.
In the case of any queries, feel free to comment below. Happy Learning.
The post Connect hive with beeline | Hive installation with Mysql metastore appeared first on AcadGild.
]]>The post Data cleaning using Mice Package in R appeared first on AcadGild.
]]>Data cleaning is a method in which you update information that’s incomplete, incorrect, improperly formatted, duplicated, or unsuitable. Data cleansing sometimes involves the improvement of information compiled in one space.
Though data cleaning involves deleting, updating information, it’s centered additional on change, correcting, and consolidating information to make sure your data is good enough to perform descriptive as well as predictive modeling.
Data cleaning is one in every of the vital part of the machine learning project. It plays a major half in building a model. However, skilled data scientists sometimes pay a massive portion of their time on this step.
If we’ve got a well-cleaned dataset, we will get desired results even with an awfully straightforward algorithm.
Here in this article, we are performing data cleaning using mice and VIM package with vehicleMiss.csv dataset.
Vehicle – Vehicle number
Fm – Vehicle failure month
Mileage – Vehicle failure at the mileage
Lh – labor hour
Lc – labor cost
Mc- Material cost
State- The region of vehicle failure
Lets us understand mice imputation on missing values.
Here at first, we have to install required packages i.e.; VIM and mice.
install.package(“package_name”)
Then load the package library with
library(package_name)
Reading the data set into R and looking for 1st six rows with head command.
Checking missing value with is.na() function
Here the command saying TRUE which means there are missing values in the data set.
With the summary command, lets see in which column there are missing values and how many missing values in each column.
Here in the above console there are four columns having missing values i.e, mileage = 13 NA’s, lh = 6 NA’s, Lc = 8 NA’s, State = 15 NA’s.
Let us find out what percentage of missing data is present in each variable in our data set.
So here we have written a function to find out percentage and pass it in apply() function.
So here in the above console, we can see that state and Mileage having more missing values than other variables.
We can also see the pattern of the data by md.pattern() command.
This gives us a table as well as a plot showing missing data.
So here in the above console, we can see that in 1st row 1586 which has the value of 0, this means that there are 1586 rows with no missing data and there are 11 rows where exactly 1 data point is missing and that data point is missing in respective column “state” as it is showing 0 in state column in the above table. In 13 rows where Mileage values are missing. Similarly, there are 6 rows there is one lc data point is missing and for 2 rows there one from lc and one from state data point is missing. Therefore, we have 42 data points are missing.
We can also see how many data points are observed with md.pair() command.
In the above console, $rr indicates how many data points have been observed. In-vehicle there are total data point i.e; 1624 has been observed. For the variables having missing data points has fewer data observed from the actual no. of records i.e; for Mileage is 1618 and for the state is 1609 etc.
In the next table $rm i.e; observed and missing followed by $mr which gives us information about missing vs observed and then $mm missing vs missing.
Let us plot margin plot using marginplot() command to represent the observed and missing data points.
In the above plot, we can see all the blue scattered data points are observed values and the red dotted points are missing values.
And the box plot represents mileage and missing labour cost.
Lets store imputed data into impute. The function we are using here is mice() and within the data, we only feed 2 to 7 variables as there is no importance of 1st column i.e; vehicle no. for further analysis. We can also specify the number of imputations m =4, the default value is 5.
We can also go from random seed let’s say it 123.
Here it has done five iterations and for each iteration, it has done 4 imputations.
Here in the above console, it is showing total number of imputations.
And imputation methods for missing values are also shown.
Fm and mc don’t have any missing value, for mileage, lh, lc – is numeric variables and the default method for dealing with missing numeric values is pmm – Predictive mean matching.
The state is a factor or categorical variable so the method of imputation is polyreg – Multinomial logistic regression.
Let’s look at some imputed values.
We are looking for Mileage here.
We can see in the above console there are 4 imputations estimate. so we can select which one is bet imputation for the given data. Let’s look at some value to see what it has done.
For 253rd row and all columns
It is showing that mileage having the missing values.
Note that this car failed after 1 month so imputing with only mean value gives us the wrong result as it is rare that a person drives 20599 miles in one month. So we have to impute the correct value.
Lets us look for the summary for the variable Mileage.
We can see the mean, median, mode, and quartile values. And total missing values present.
Complete data with complete() function
So here we are using 2nd imputation values as it is showing the best results among all the imputation values.
Let’s see the summary of imputed data.
Observed and imputed values
Now the data is ready for classification or prediction models.
Here if we see that the vehicle has only 0-1 failure month shown 863, 11 miles. Which are quite appropriate.
Distribution of observed/imputed values.
We are using striplot() function to see the distribution of observed/ imputed values.
Here blue color is observed values and 0 for original data and 1,2,3,4 for impute data. In fm,mc everything is blue means there are no missing data points. In Milage, lh,& lc there are blue as well as Red colors i.e; observed and imputed data. Red indicates the estimated values to be imputed. We can not see any unusual pattern.
Lets plot for labour cost “lc” and labour hours “lh” with xyplot() command.
In the above xyplot i.e; lc vs lh. Here 1st plot is for original data and rest for imputed data. lh,& lc data points are represented by blue as well as Red colors i.e; observed and imputed data. Red indicates the estimated values to be imputed.
We have selected 2nd imputation for missing values.
We hope this post has been helpful in understanding data cleaning. In the case of any queries, feel free to comment below and we will get back to you at the earliest.
Keep visiting our site www.acadgild.com for more updates on Data Analytics and other technologies. Click here to learn data science course in Bangalore.
The post Data cleaning using Mice Package in R appeared first on AcadGild.
]]>The post Decision Tree in Python appeared first on AcadGild.
]]>In our previous blog, we have learned about the decision tree and its implementation in R using a dataset. If you are familiar with R programming language we suggest our readers to go through the blog from the below link:
https://acadgild.com/blog/decision-tree-using-r
As we know, Decision Tree is a popular supervised machine learning algorithm that is used for carrying out both classification and regression tasks. So decision trees are a binary tree like flowchart where each node represents the feature variables and are split in such a way that the branches represent a group of observation based on the feature variables and finally the leaves which represents the final outcome for the dataset.
The main objective of the decision tree is to split data in such a way that each element in one group belongs to the same category. Decision tree graphs are easily interpreted.
The splitting up of data is based on some measures that partition data into the best possible manner. In order to split on, we need a way of measuring how good the split is. The most popular measures are:
Gini Index: This is used to measure impurity or the quality of a split of a node. The scikit learn implementation of the DecisionTreeClassifier uses gini by default.
It works with the categorical target variable “Success” and “Failure” and performs only binary splits.
The degree of the Gini index falls between 0 and 1, where 0 denotes that all the elements belong to a certain class and 1 denotes that the elements are randomly distributed across various classes. When the value of Gini is equal to 0, the node is considered pure and no further split is done.
Information Gain: information gain is derived from entropy. Entropy is a way of measuring the amount of impurity in a given set of data
Information gain is used to determine which feature or attribute gives us the maximum information about a class.
High entropy means that we have a collection of different classes and a low entropy means that we have predominantly one class, therefore, we keen on splitting the node in a way that decreases the entropy.
In this blog, we will be using the popular Iris Data Set. This dataset is perhaps the best known database to be found in the pattern recognition literature. The dataset contains 3 classes of 50 instances each, where each class refers to a type of Iris plant namely Setosa, Versicolor and Virginica. One is linearly separable from the other 2 and the latter are not linearly separable from each other.
The predicted attribute is the class of Iris plant.
The dataset can be loaded from the sklearn library itself.
So let us begin our coding in Python.
We’ll import all the necessary libraries.
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline
We will now load the dataset from the sklearn library
from sklearn.datasets import load_iris iris = load_iris()
Initializing X and y as the predictors and the target variable respectively.
X = pd.DataFrame(iris.data, columns = iris.feature_names) y = pd.Categorical.from_codes(iris.target, iris.target_names)
Since the target variable is a categorical one consisting of 3 categories of flower species, we have used ‘Categorical.from_codes’. This constructor is useful when we have categories datatype.
Checking the first few records of both the variables.
X.head()
y = pd.get_dummies(y) y.head()
Using get_dummies() function we have converted our categories of flower species into dummy variables.
Checking the info of X and y respectively
X.info()
y.info()
We have 150 respective non-null values in our dataset.
Checking the statistical data of our predictor variables
X.describe()
Getting the unique values of the target variable
y.nunique
From the above output, we can see that we have 50 instances each of the three respective species of the plant.
Visualizing the independent feature
X['petal width (cm)'].plot.hist() plt.show()
About 50 flowers in this dataset have values between 0.1 and 0.5.
Splitting the data into training and test data.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
Fitting the model with the train data
from sklearn.tree import DecisionTreeClassifier dt = DecisionTreeClassifier() dt.fit(X_train, y_train) prediction = dt.predict(X_test)
Evaluating model
Importing all the classes from sklearn library to do the evaluation
from sklearn.metrics import classification_report, confusion_matrix from sklearn.metrics import accuracy_score print(classification_report(y_test, prediction))
print(confusion_matrix(y_test.values.argmax(axis=1), prediction.argmax(axis=1)))
As we can see, our decision tree classifier correctly classified 29/30 plants.
score = accuracy_score(y_test, prediction) score
The accuracy of our model is 96%, which is pretty good.
Tree Visualization
Scikit learn has some built-in visualization capabilities for decision trees. We might not use it often as it requires us to install graphviz.
Graphviz is a visualization library and can be installed using the below command:
conda install graphviz
or
conda install python-graphviz
and
conda install pydot
from IPython.display import Image from sklearn.externals.six import StringIO from sklearn.tree import export_graphviz import pydot import graphviz from sklearn import tree
dot_data = tree.export_graphviz(dt, out_file=None, filled=True, rounded=True, feature_names=iris.feature_names, class_names=iris.target_names) graph = graphviz.Source(dot_data) graph
export_graphviz function converts decision tree classifier into dot file and pydotplus convert this dot file to png or displayable form on Jupyter.
In the decision tree chart, each internal node has a decision rule that splits the data. Gini is referred to as the Gini ratio, which measures the impurity of the node. You can say a node is pure when all of its records belong to the same class, such nodes known as the leaf node.
This is a pruned tree that is less complex, explainable, and easy to understand.
Petal length(cm) < = 2.6 is the first question the decision tree asks if the petal length is less than 2.6 cm and based on the result it either follows the true or false path.
gini=0.443 this is the Gini score which is a metric that quantifies the purity of the leaf/node. A Gini score of 0 means that the node is pure.
Samples tell us that how many examples are at that node
Value is that vector of samples for each class
And this brings us to the end of our blog. I hope this helps you in understanding the decision trees classifier. Do leave us a comment for any query or suggestion.
Keep visiting our website for more blogs on Data Science and Data Analytics.
The post Decision Tree in Python appeared first on AcadGild.
]]>The post Random Forest using R appeared first on AcadGild.
]]>The core idea behind the Random Forest algorithm is, it generates multiple small decision trees from random subsets of the original data, then aggregating the result of multiple predictors of varying depth, gives a better prediction than the best individual predictor.
This group of decision trees or predictors is called an ensemble and this technique is called Ensemble Learning.
In our previous blog We have explained the working of decision trees with the help of cardiograpy dataset. Before proceeding further We recommend our readers to go through our previous blog to understand the concept of decision trees and the dataset better.
In this blog also, we will use the same cardiography dataset and build a model using the random forest algorithm to find the accuracy of patient belonging to the category of NSP.
You can download the dataset from the below link:
https://acadgildsite.s3.amazonaws.com/wordpress_images/r/cardiography/Cardiotocographic.csv
Loading the data and fetching first few records.
Getting the structure of the dataset using the str() function
We will now use the as.factor() function to convert the data objects which are used to categorize the data of the target variable ‘NSP’ and store it as levels. They can store both strings and integers.
Summarizing the statistical figures using the summary() function.
Fetching the occurrence/frequency of each class present in the Target variable.
Level 1 that is, the ‘Normal’ state has occured maximum number of times.
Our data has now been split into training and validation data in the ratio of 70:30.
Applying Random Forest algorithm to build the model
To apply the random forest algorithm we have first imported the ‘randomForest’ library.
We will then fit the model on training data
Here we can see the error rate OOB that stands for Out Of Bag is 5.84%.
OOB data is the data that has been left out in the original dataset while taking random samples for training data from the original dataset.
These samples are also known as Bootstrap sample and the prediction error using the data which is not in Bootstrap sample is the OOB error rate.
Summarizing the attributes of random forest
From the above result we can see that the data 1175, 144 and 115 have been correctly classified to the respective class 1, 2 and 3.
Also at class 2 level that is the Suspect state has the maximum error of 28% and the least error rate is found for class 1 level.
From the graph it is seen that the error lines got somewhat constant from the value of trees=300, therefore we will give the value for ntree as 300.
ntree refers to the number of trees that grow in Random Forest. By default the value of ntree is equal to 500.
Tuning the random forest model for better accuracy.
Here the OOB error rate is the least when the value of mtry is equal to 8.
Mtry is the number of variables available for splitting at each tree node.
Again fitting the Random forest model on the training data after tuning the model by giving the value of ntree = 300 and mtry = 8
It is observed that after tuning the model the error rate has been slightly decreased to 5.58%, therefore the accuracy is 94.86%.
Checking for the number of nodes for the trees
The maximum frequency for the number of nodes could be found in the range 75-85.
Graph 1 test how worse the model performs or how impure the nodes are without each variable for mean decrease accuracy
Graph 2 tells us how pure the model is at the end of the tree without each variable for mean decrease Gini
Quantifying the values of each predictor variable against the target variable in our dataset
Creating partial plot on the variable ASTV(i.e., percentage of time with abnormal short term variability) when the class value of the target variable(NSP) = 1
Therefore the class value for NSP is 1, when the value of ASTV is less than 60.
Creating partial plot on the variable ASTV(i.e., percentage of time with abnormal short term variability) when the class value of the target variable(NSP) = 2
Therefore the class value of NSP is 2 when the value of ASTV is between 50 to 70 and it is difficult to find where the patient is Suspect or not at these values of ASTV
Creating partial plot on the variable ASTV(i.e., percentage of time with abnormal short term variability) when the class value of the target variable(NSP) = 3
Therefore the class value of NSP is 3 when the value of ASTV is greater than 60.
Extracting the information of single tree from the forest.
Plotting the multidimensional scaling plot of proximity matrix for the train data of the target variable.
The data points for class value 1of NSP shown in red seems to be more scattered as compared to class value 2 shown in blue which is very less scattered and class value 3 shown in green that is not at all scattered.
We can see that the actual and predicted values are similar.
We will now create the confusion matrix and check for accuracy based on the train data
Create the confusion matrix and checking for accuracy based on the test data
Hence we got an accuracy of 94.86% on our test data with a 95% confidence interval in the range of 92%-96%.
We hope this post has been helpful in understanding Random Forest. In the case of any queries, feel free to comment below and we will get back to you at the earliest.
Keep visiting our site www.acadgild.com for more updates on Data Analytics and other technologies. Click here to learn data science course in Bangalore.
The post Random Forest using R appeared first on AcadGild.
]]>The post Apache Hive Architecture | What Is Hive? appeared first on AcadGild.
]]>In our previous blog posts, we have discussed a brief introduction on Apache hive with its DDL commands, so a user will know how data is defined and should reside in a database from our previous posts.
If a user is working on hive projects, then the user must know its architecture, components of the hive, how hive internally interacts with Hadoop and other important characteristics.
What is Hive?
Important Characteristics Of Apache Hive
Hive Architecture
Hive Components
How To Process Data With Apache Hive?
So let’s get started with what is hive?
Hive is a data warehousing tool that is built on top of the Hadoop distributed file system (HDFS).
Hive makes the job easy for performing operations like
As you can see from the above diagram it shows you the hive architecture and its components.
Hive uses the concept of MapReduce internally for job execution.
These are the main components of apache hive and we are going to discuss it in detail in the next section.
Hive Client :
Users can easily write the hive client applications written in the language of their choice. hive supports all the applications written in languages like java, python, etc.using JDBC driver, thrift and ODBC driver.
These clients are categorized into 3 types.
Thrift Client
Apache hive server is based on thrift so it can serve the request from all those languages that support thrift.
JDBC Client
Apache hive allows Java applications to connect to it using a JDBC driver.
ODBC Client
ODBC driver allows applications that support the ODBC protocol to connect to the hive.
Hive Services:
Hive provides different kinds of services like Web User Interface, Command-line Interface (CLI) to perform the queries on data.
Now we will discuss how a query executes in the hive.
So, this is the approach of how a hive query is executed.
Hive is an ETL or warehousing tool to analyze and process a large amount of data built on top of Hadoop. It provides a simple way of the query language like HQL for querying and processing the data. As you have learned the apache hive architecture and its components let’s Learn How To Install The Hive On Ubuntu to get hands-on.
For any further queries please share your views through your comments. Happy Learning
The post Apache Hive Architecture | What Is Hive? appeared first on AcadGild.
]]>The post Predicting Low Infant Birth Weight using Decision Tree appeared first on AcadGild.
]]>In this blog we will be using the ‘Risk Factors associated with Low Infant Birth Weight’ dataset using the decision tree algorithm. The data were collected at Baystate Medical Center, Springfield, Mass during 1986. The objective of this dataset is to assess factors associated with low birth weight babies in Baystate Medical Center. Low birth weight is defined as an infant born with a weight of less than 2500 g. It is one of the major public health problems worldwide.
Therefore we will predict whether the infant born is under the weight of 2.5 kg or not based on variables predictors(independent variables).
Before moving further I would suggest our blog readers to go through the previous post to understand the concepts better.
So let us begin our coding in R.
We will import the necessary libraries first
The packages MASS and rpart have been imported.
MASS package is used to import the ‘birthwt’ dataset and rpart for creating Decision tree for the same dataset.
We will now load the data and fetch the first few records.
Checking the percentage of uniques values for each level in a particular variable
Here the value under feature low depicts that for of the 2 levels ‘0’ and ‘1’ under this column 1.1 is the percentage value that is unique for this feature.
Likewise, for the column race out of the 2 levels ‘1’, ‘2’ and ‘3’, 1.6 is the percentage value that is unique for this feature.
Converting all the categorical variables into factors.
Here variables with different levels have been converted into Factors.
Checking for null value if any
Getting the summary of the dataset using the summary() function.
In the target variable we can see that it has 59 values that corresponds to the number of infants that were born with weight less than 2.5kg.
Splitting the data into training and test datasets.
We have split our data into training and test data in the ratio 80:20 according to our target variable i.e., ‘birthwt$low’
We have fitted our training data using rpart function and plotted the tree.
Again Visualizing the same decision tree using rpart function()
From the above graphs we can infer that if the value of ptl be 0,2 or 3 we are getting a value of 46% that corresponds to the situation where the weight of infants are less than 2.5 kg based on ‘race’ and 33% based on ‘lwt.’
Also if the value of ptl is not equal to 0,2,3 we are getting a value of 12% that corresponds to the situation where the weight of infants are more than 2.5 kg.
Making predictions using the test data.
Hence our model have predicted that 23 data correctly corresponds to class 0 that is the weight of infants are less than 2.5 kg and 9 data correctly responds to class 1 that is situation where the weight of infants are more than 2.5 kg.
Evaluating the accuracy
Our model has an accuracy of 84%.
We use AUC (Area Under The Curve) ROC (Receiver Operating Characteristics) curve whenever we want to check or visualize the performance of the multi – class classification problem.
It is one of the most important evaluation metrics for checking any classification model’s performance.
Now that our ROC have been built, we will calculate the Area under the ROC curve(AUC)
Higher the AUC, the better the model is at predicting values. Since our model has an AUC value of 77% which is quite good.
And this brings to the end of this blog. We hope you find this blog helpful.
Keep visiting our site www.acadgild.com for more updates on Data Analytics and other technologies. Click here to learn data science course in Bangalore.
The post Predicting Low Infant Birth Weight using Decision Tree appeared first on AcadGild.
]]>The post Logistic Regression in Python appeared first on AcadGild.
]]>In other words, we can say that the Logistic Regression model predicts P(Y=1) as a function of X.
Assumptions in Logistic Regression
Examples
Examples of Logistic Regression include:
In Logistic Regression, we use the Sigmoid function to describe the probability that a sample belongs to one of the two classes. The shape of the sigmoid functions determines the probabilities predicted by our model.
In mathematics, the below equation as a Sigmoid function:
P = 1 / (1+e^(-y))
Where y is the equation of line : y=mx+c
No matter what values we have for y, a Sigmoid function ranges from 0 to 1.
The Sigmoid function looks like below:
In this blog, we will understand the working of Logistic regression by building a model using the Advertising dataset. Please note that this is not a real dataset but a sample one that has been created for your understanding. You can download the dataset from this Link.
The dataset consists of 10 columns. The classification goal is to predict whether the user will click on an ad featuring on websites (1) or not (0) based on various features variables.
Let us begin our coding in Python.
We will begin by importing all the necessary libraries.
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline
Loading the CSV file by mentioning the correct path of the file.
advert = pd.read_csv(r'advertising.csv') #Fetching the first few records of the data. advert.head(10)
The input variables include:
Target variable:
Checking the ‘info’ of the dataset
advert.info()
We can see that there is a total of 1000 entries(rows) and 10 columns.
Checking the statistical figures of the dataset.
advert.describe()
Getting the counts of unique values for the target column and the ‘Male’ columns respectively.
advert['Clicked on Ad'].value_counts()
From the above output, it is clear that 50% of the users clicked on the Ad while browsing the internet while 50% of them did not.
advert['Male'].value_counts()
Hence the number of males out of 1000 users is 481.
Checking the total number of null values in the dataset.
advert.isnull().sum()
Luckily we do not have to deal with handling the missing values as our dataset doesn’t contain any missing values.
Grouping by the target variable with the mean of other feature variables.
advert.groupby('Clicked on Ad').mean()
It can be inferred that:
The average age of the user who clicked on the Ad is higher than that of users who didn’t.
The user who clicked on the ad has less average daily time spent on site as compared to the user who didn’t.
Grouping by the target variable and the ‘Male’ column together.
advert.groupby(['Clicked on Ad','Male']).size()
It is clear that the number of people who clicked on the Ad is 231 which are male and 269 others.
Visualization
Now we will visualize the data using Matplotlib and seaborn library to see the patterns and trends in the dataset.
Creating histogram for the Age column
sns.set_style('whitegrid') sns.distplot(advert['Age'], kde = False, bins = 40)
The above graph shows that maximum users are of the age ranged between 25-45.
Creating jointplot for the columns ‘Age’ and ‘Area Income’
sns.jointplot(x = 'Age', y = 'Area Income', data = advert) plt.show()
There is no visible linear relation was found between the two variables. However, people age between 20-45 were found to have more income.
Creating jointplot for the columns ‘Age’ and ‘Daily Time spent on site’
sns.jointplot(x = 'Age', y ='Daily Time Spent on Site', data = advert, kind = 'kde', color = 'red') plt.show()
People of age between 20-45 spend more time on site daily.
Visualizing the number of males
sns.countplot(x = 'Male', data = advert, palette= 'pastel')
Hence fewer numbers of males(categorized as 1) as compared to others(categorized as 0).
Using countplot to visualize what numbers of Males and others have clicked on the ad
sns.countplot(x = 'Clicked on Ad', data = advert, hue = 'Male', color = 'red')
From the above graph, we can see that 1 and 0 refer to whether the user clicked on an ad or not, whereas red and pink color refers to males and others respectively.
Therefore there are fewer males who clicked on the ad as compared to ‘Others’ which are more in number.
Also the number of males who clicked on the ad is less as compared to those who didn’t.
Creating pairplot for the whole dataset
sns.pairplot(advert, hue = 'Clicked on Ad') plt.plot()
Logistic Regression
Now since our data is prepared we will now split our data into training and test datasets.
advert.columns
X = advert[['Daily Time Spent on Site', 'Age', 'Area Income', 'Daily Internet Usage']] y = advert['Clicked on Ad']
Here X and y are independent and dependent features respectively. The columns ‘Ad Topic Line’, ‘City’, ‘Male’, ‘Country’, ‘Timestamp’, are not numeric and don’t have much impact on the dataset. Hence we will not consider these features.
Splitting the data into training and test datasets using sklearn library.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
Now since our data has been split into training and test dataset in 80:20 ratio respectively. We now import the Logistic Regression class from sklearn library and would create the instance for the same. We then call the fit() function to train the model with the training dataset.
from sklearn.linear_model import LogisticRegression #Creating an instance of Logistic Regression class logreg = LogisticRegression() logreg.fit(X_train, y_train)
Now we’ll check how the model performs against data that it hasn’t been trained on.
prediction = logreg.predict(X_test)
Since it was a classification problem, we use a confusion matrix to measure the accuracy of our model.
from sklearn.metrics import confusion_matrix conf_Matrix = confusion_matrix(y_test, prediction) print(conf_Matrix)
From our confusion matrix, we conclude that:
Computing the classification report which states the precision, recall, f1-score and support.
from sklearn.metrics import classification_report print(classification_report(y_test,prediction))
The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier to not label a sample as positive if it is negative. Said another way, “for all instances classified positive, what percent was correct?”
The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples. Said another way, “for all instances that were actually positive, what percent was classified correctly?”
The F-beta score can be interpreted as a weighted harmonic mean of precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0.
The support is the number of occurrences of each class in y_test.
This brings us to the end of this blog. I hope you find this blog helpful. For any query or suggestions do drop us a comment blog. Keep visiting our website for more blogs on Data Science and Data Analytics.
Suggested Reading:
The post Logistic Regression in Python appeared first on AcadGild.
]]>The post Decision Tree using R appeared first on AcadGild.
]]>The decision trees are constructed with an approach that identifies ways to split the dataset based on different conditions. These are generally in the form of if-then-else statements. It is a tree-like graph with nodes representing the attributes where we ask the questions, edges represents the answers to the questions and the leaves represent the actual outcomes.
Decision tree are applicable in cases where the uncertainty concerning which outcome will actually happen or when the user has an objective he is trying to achieve:max profit/optimize costs.
Taking an instance that we have 5 days data of my friend which tells whether he will come to play or not based on some weather conditions as below:
Day | Weather | Temperature | Humidity | Wind | Play |
1 | Sunny | Hot | High | Weak | No |
2 | Cloudy | Hot | High | Weak | Yes |
3 | Sunny | Mild | Normal | Strong | Yes |
4 | Cloudy | Mild | High | Strong | Yes |
5 | Rainy | Mild | High | Strong | No |
We will form a decision tree based on the above table which will be shown something like this:
Hence in the above tree we can see that each node represents an attribute or feature, the branches represents the outcome of that node and the leaves are where the final decisions are made.
In this blog, we will build a model using the Cardiotocography dataset. The dataset consists of measurements of fetal heart rate (FHR) and uterine contractions (UC) features on cardiotocography classified by expert obstetricians. 2126 fetal cardiotocography (CTGs) were automatically processed and the respective diagnostic features measured. CTGs are classified by three expert obstetricians and consensus classification label as Normal, Suspect or Pathologic. You can get the dataset from the below link.
So let us begin our coding in R.
Loading the dataset and fetching the first few records.
Getting the structure or information about each variable of the dataset using the str() function.
Hence all the variables are either integer or float data types.
Getting the statistical summary of the dataset using the summary() function
Hence no null value present in the dataset.
Now, since we have the values of target variable in 3 levels that is 1, 2 and 3. We are using the factor() function which are used to convert the data objects which are used to categorize the data and store it as levels. They can store both strings and integers.
After converting the target variables into factors. We will now split our data into training and validation sets and will set the seed of R’s random number generator, which is useful for creating simulations or random objects that can be reproduced.
Now the dataset has been split 80% into training data stated by index 1 and 20% into validation data stated by index 2 respectively.
We will now import the ‘Party’package
The package “party” has the function ctree() which is used to create and analyze decision tree.
Here we have given the independent variables as LB, AC, FM and dependent variables to be NSP.
Here we can see that the nodes represents the independent variables, branches refer to the values that are to be compared and the leaves represents the target variable with its 3 levels.
We will now make predictions using the predict() function using ‘tree’ variable taken from ‘party’ package and ‘validate’ data.
The rpart library that stands for Recursive Partitioning and Regression Trees, the resulting models can be represented as binary trees.
Here the binary tree has been created using the training dataset.
Again creating the tree using rpart by initializing the attribute ‘extra = 1’ which means that it displays the number of observations that fall in the node (per class for class objects; prefixed by the number of events for poisson and exp models). Similar to text.rpart’s use.n=TRUE.
We will again form a tree this time initializing ‘extra = 2’, which means that Class models: display the classification rate at the node, expressed as the number of correct classifications and the number of observations in the node. Poisson and exp models: display the number of events.
Hence the tree will look something as below
We will again make prediction using the predict() function, but this time using the variable ‘tree1’ taken from ‘rpart’ library and ‘validate’ data.
Creating confusion matrix using the table() function. Table() function is also helpful in creating Frequency tables with condition and cross tabulations.
Here values 1, 2 and 3 depicts the three levels of target variable NSP where the values represents Normal, Suspect and Pathologic.
Computing the accuracy by taking the proportion of true positive and true negative over the sum of the matrix as shown below.
Again creating the confusion matrix using the validation dataset.
Here you can see that the misclassification error for train data is 0.19 whereas the misclassification error for test data is 0.21.
And this brings us to the end of this blog. Hope you find this helpful.
Keep visiting our site www.acadgild.com for more updates on Data Analytics and other technologies. Click here to learn data science course in Bangalore.
The post Decision Tree using R appeared first on AcadGild.
]]>