The post Decision Tree using R appeared first on AcadGild.

]]>The decision trees are constructed with an approach that identifies ways to split the dataset based on different conditions. These are generally in the form of if-then-else statements. It is a tree-like graph with nodes representing the attributes where we ask the questions, edges represents the answers to the questions and the leaves represent the actual outcomes.

Decision tree are applicable in cases where the uncertainty concerning which outcome will actually happen or when the user has an objective he is trying to achieve:max profit/optimize costs.

Taking an instance that we have 5 days data of my friend which tells whether he will come to play or not based on some weather conditions as below:

Day | Weather | Temperature | Humidity | Wind | Play |

1 | Sunny | Hot | High | Weak | No |

2 | Cloudy | Hot | High | Weak | Yes |

3 | Sunny | Mild | Normal | Strong | Yes |

4 | Cloudy | Mild | High | Strong | Yes |

5 | Rainy | Mild | High | Strong | No |

We will form a decision tree based on the above table which will be shown something like this:

Hence in the above tree we can see that each node represents an attribute or feature, the branches represents the outcome of that node and the leaves are where the final decisions are made.

In this blog, we will build a model using the Cardiotocography dataset. The dataset consists of measurements of fetal heart rate (FHR) and uterine contractions (UC) features on cardiotocography classified by expert obstetricians. 2126 fetal cardiotocography (CTGs) were automatically processed and the respective diagnostic features measured. CTGs are classified by three expert obstetricians and consensus classification label as Normal, Suspect or Pathologic. You can get the dataset from the below link.

So let us begin our coding in R.

Loading the dataset and fetching the first few records.

- LB: FHR(Fetal heart rate) baseline (beats per minute)
- AC: # of accelerations per second
- FM: # of fetal movements per second
- UC: # of uterine contractions per second
- DL: # of light decelerations per second
- DS: # of severe deceleration per second
- DP: # of prolonged decelerations per second
- ASTV – percentage of time with abnormal short term variability
- MSTV – mean value of short term variability
- ALTV – percentage of time with abnormal long term variability
- MLTV – mean value of long term variability
- Width – width of FHR histogram
- Min – minimum of FHR histogram
- Max – Maximum of FHR histogram
- Nmax – # of histogram peaks
- Nzeros – # of histogram zeros
- Mode – histogram mode
- Mean – histogram mean
- Median – histogram median
- Variance – histogram variance
- Tendency – histogram tendency
- CLASS – FHR pattern class code (1 to 10)

- Target variable:
- NSP – fetal state class code (N=normal; S=suspect; P=pathologic)

Getting the structure or information about each variable of the dataset using the *str() *function.

Hence all the variables are either integer or float data types.

Getting the statistical summary of the dataset using the *summary() *function

Hence no null value present in the dataset.

Now, since we have the values of target variable in 3 levels that is 1, 2 and 3. We are using the factor() function which are used to convert the data objects which are used to categorize the data and store it as levels. They can store both strings and integers.

After converting the target variables into factors. We will now split our data into training and validation sets and will set the seed of R’s random number generator, which is useful for creating simulations or random objects that can be reproduced.

Now the dataset has been split 80% into training data stated by index 1 and 20% into validation data stated by index 2 respectively.

We will now import the ‘Party’package

The package “party” has the function ctree() which is used to create and analyze decision tree.

Here we have given the independent variables as LB, AC, FM and dependent variables to be NSP.

Here we can see that the nodes represents the independent variables, branches refer to the values that are to be compared and the leaves represents the target variable with its 3 levels.

We will now make predictions using the *predict() *function using ‘tree’ variable taken from ‘party’ package and ‘validate’ data.

The rpart library that stands for Recursive Partitioning and Regression Trees, the resulting models can be represented as binary trees.

Here the binary tree has been created using the training dataset.

Again creating the tree using rpart by initializing the attribute ‘extra = 1’ which means that it displays the number of observations that fall in the node (per class for class objects; prefixed by the number of events for poisson and exp models). Similar to text.rpart’s use.n=TRUE.

We will again form a tree this time initializing ‘extra = 2’, which means that Class models: display the classification rate at the node, expressed as the number of correct classifications and the number of observations in the node. Poisson and exp models: display the number of events.

Hence the tree will look something as below

We will again make prediction using the *predict() *function, but this time using the variable ‘tree1’ taken from ‘rpart’ library and ‘validate’ data.

Creating confusion matrix using the *table() *function. Table() function is also helpful in creating Frequency tables with condition and cross tabulations.

Here values 1, 2 and 3 depicts the three levels of target variable NSP where the values represents Normal, Suspect and Pathologic.

Computing the accuracy by taking the proportion of true positive and true negative over the sum of the matrix as shown below.

Again creating the confusion matrix using the validation dataset.

Here you can see that the misclassification error for train data is 0.19 whereas the misclassification error for test data is 0.21.

And this brings us to the end of this blog. Hope you find this helpful.

*Keep visiting our site* www.acadgild.com* for more updates on Data Analytics and other technologies. Click here to learn data science course in Bangalore.*

The post Decision Tree using R appeared first on AcadGild.

]]>The post What is Ansible? appeared first on AcadGild.

]]>- What is Ansible?
- Configuration Management
- Push Based VS Pull Based
- How To Install Ansible
- Host Inventory
- Ansible Modules
- Understanding YAML
- Ansible Playbook

Now let’s get started with our first topic, So let’s begin

Ansible is a simple open-source IT engine that automates application development, intra service orchestration, cloud provisioning, and many other IT tools.

Ansible also is very easy to deploy because it does not use any agent or customer security infrastructures.

Ansible uses Ansible playbooks to describe automation jobs and playbook uses very simple language that is the YAML.

YAML is a human-readable data serialization language which is commonly used to build configuration files and in applications where data is being stored or transmitted.

The advantage of the YAML file is that even the IT infrastructure support professionals can read and understand the playbook and debug if needed easily.

So ansible is completely agentless which means ansible works by connecting the nodes to ssh by default but if you want any another method for connection like Kerberos ansible also gives you that option.

After connecting to the nodes ansible pushes some small programs called ansible modules ansible runs that modules on nodes and remove them when the work is finished. It also manages your inventory with the simple text file that basically hosts files that you can see below.

Ansible uses the host file where one can group the hosts and control the action on a specified group in the playbooks.

As you know ansible is a control management tool you should also understand configuration management before you understand how to deploy it with the tools.

Configuration management in terms of ansible means that it maintains the configuration of product performance by keeping a record and updating detailed information which describes and enterprise hardware and software.so such a piece of information includes the exact versions and updates that apply to install packages and the locations and network addresses of devices.

For example, if you want to install a new version of a web logic server on all the machines present on the in your enterprise it is not feasible for you to manually go and update each and every machine.

So you can install the web logic server in one go on all your machines with the ansible-playbook.

All you have to do is list out all IP addresses of your nodes in the inventory and write a playbook to install the web logic server after that you have to run the playbook from your control machine and then it will be installed on all your nodes.

Basically ansible works by connecting nodes and pushing outs small programs called ansible modules to them.so when ansible executes these modules over ssh by default and then removes them when finished.

So your library of modules resides on any one machine and there are node servers daemons databases required.

As you can see below picture is basically controlling the node which controls the entire execution of the playbook.

It’s the node from which you are running the installation and the inventory file provides the hosts where ansible module needs to be run.

And then management node does the ssh connection to other nodes listed in the inventory and then it executes the ansible modules on the host machines and installs the product.

So this is how Ansible works.

**Agentless:**

The first feature is agentless which means there is no kind of software any kind of agent that managing your node unlike puppet or chef.in puppet or chef you need to install the puppet agent or chef-client on all your nodes.

But for the ansible, you just have to install ansible in your control machine and then you are good to go.

**Build on python**

It is built on top of python and this helps to provide a lot of functionality of python.

**SSH**

Ansible uses ssh for secure connection now, ssh is very simple to password less network authentication protocol which is very secure.

So your responsibility just generates a public key on your control machine and copy the same key on your node machines.

**Push based Architecture**

Ansible is pushing base application for sending configurations, in the case of ansible where you want to make kind of configuration changes on your nodes all you have to do is write down the configuration and then you just have to push them all at once on your nodes.

When we talk in simple terms what it does is that it gives full control over whenever you want to make changes on your node and also makes it very easy and fast to set up and need very minimal requirements.

As we say ansible is a push-based application what do you think what is the difference between push-based application and pull-based application?

Well, tools like puppet and chef are basically pull-based applications and whereas ansible is a push-based application for configuration management.

Now in the case of puppet and chef, you can present the agent software that puppet has and is basically called the puppet agent and in case of the chef, it is known as chef-client. So what exactly agent does is that it keeps pulling on the central server periodically for any kind of configuration information.

And whatever information agent finds it pulls those changes and then gets them affected on your node machines.

Whereas in the case of Ansible since there are no agents present whenever you want to make any changes you can make those changes directly and you can push those configurations directly whenever you want to as you got full control over it.

Example of push-based application architecture and pull-based application architecture you can see in the below picture.

Before we start the installation let me tell you that there are basically two types of machines for the deployment of ansible those are the control machine and the remote machine

So the control machines are machines from where we can manage other machines and the remote machine is those machines which are handle or control by the control machines.

So there can be multiple remote machines which are handled by a single control machine in order to handle remote machines we have to install ansible on the control machine.

**Step 1:** Update the repositories using the below command.

sudo apt-get update

**Step 2:** Install the common software properties by using the below command.

sudo apt-get install software-properties-common -y

**Step 3: **Add Ansible repositories using the below command.

sudo apt-add-repository ppa:ansible/ansible -y

**Step 4: **Update the repositories using the below command.

sudo apt-get update -y

**Step 5:** Install the Ansible tool using the below command.

sudo apt-get install ansible -y

**We have done with the installation of ansible on ubuntu here.**

Let’s move to our next topic which is

Inventory defines a group of hosts for example you can group web servers in one group and application server in one group. So the group can have multiple servers and a single server that can be a part of multiple groups.

If you want to group my web servers together and my data servers together all you have to do is just write a group name between the two square brackets [ ].

If you want to make some configuration changes on the webserver but not on data servers then you just have to specify the group name on the host then it will automatically configure your web server.

Modules are nothing but the executables plugins that get the real job done. usually, modules can take the key-value argument and runs in a customized way depending on the argument given.

So the module can be invoked by the command line or its included in the ansible-playbook.

- If you want to use modules from the command lines you have to type the below command.

ansible all -m ping

- If you want to use the ping mode to ping all the host find in the inventory then you have to type the below command

ansible webservers -m command -a “ls”

- If you want to flush IP tables rule on all the hosts available in inventory then you have to type the below command.

ansible -i inventory all -m command -a “iptables-F” --become --ask-become-pass

- If you want to gather facts about the ansible then you have to type the below command.

ansible all -m setup

- If you want to extract particular facts in the documentation of the setup module you have to type the below command.

ansible-doc setup

Ansible uses YAML f syntax for expressing ansible-playbook because it’s really simple to understand, read and write compared to other data formats like XML and JSON.

Every YAML file starts with three “—” and ends with three dots “…”. You can also use Abbreviations in YAML to represent dictionaries.

Not only this but we can also represent the list in the YAML file.

Playbooks are the files where ansible code is written and these are written in YAML format so YAML stands for yet another markup language and the playbook is one of the core features of ansible and tell ansible what to execute.

There is a to-do list for ansible contains a list of task and playbooks contains step which to user wants to execute on a particular machine.

So playbook runs sequentially playbooks building block for all the use cases of ansible.

As we discussed earlier YAML starts with “—” and ends with “…” there are multiple tags available in YAML file lets go to them each and every tag.

**Name :**

The name tag will specify the name of the ansible-playbook. any logical name you can give to the playbook

**Hosts : **

The host specifies the host groups or hots against which we want to perform the task. host field in YAML is mandatory.

**Vars :**

Vars tag lets you define the variables which you can use in your playbook and usage is very similar to the variable in any programming languages.

**Tasks:**

All the playbook should contain a task or list of tasks to be executed.task is basically an action one needs to perform.

Hope this blog helps you to understand what is ansible and components of ansible.

To know more about the DevOps tool you can go through our blogs section where you will find each and every open-source DevOps tool.

The post What is Ansible? appeared first on AcadGild.

]]>The post Linear Regression on Boston Housing data appeared first on AcadGild.

]]>Linear regression is used to find the relationship between the target and one or more predictors. Here the target is the dependent variable and the predictors are the independent variables.

In this blog, we are using the Boston Housing dataset which contains information about different houses. We can also access this data from the sci-kit learn library. There are 506 samples and 13 feature variables in this dataset. The objective is to predict the value of prices of the house using the given features.

The features of the dataset can be summarized as follows:

- CRIM: This column represents per capita crime rate by town
- ZN: This column represents the proportion of residential land zoned for lots larger than 25,000 sq.ft.
- INDUS: This column represents the proportion of non-retail business acres per town.
- CHAS: This column represents the Charles River dummy variable (this is equal to 1 if tract bounds river; 0 otherwise)
- NOX: This column represents the concentration of the nitric oxide (parts per 10 million)
- RM: This column represents the average number of rooms per dwelling
- AGE: This column represents the proportion of owner-occupied units built prior to 1940
- DIS: This column represents the weighted distances to five Boston employment centers
- RAD: This column represents the index of accessibility to radial highways
- TAX: This column represents the full-value property-tax rate per $10,000
- PTRATIO: This column represents the pupil-teacher ratio by town
- B: This is calculated as 1000(Bk — 0.63)², where Bk is the proportion of people of African American descent by town
- LSTAT: This is the percentage lower status of the population
- MEDV: This is the median value of owner-occupied homes in $1000s

So let’s get started with our coding in Python.

First, we will import all the important libraries.

import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline import sklearn

We will then load the boston dataset from the sklearn library.

from sklearn.datasets import load_boston boston = load_boston()

Now we will load the data into a pandas dataframe and then will print the first few rows of the data using the ** head()** function.

bos = pd.DataFrame(boston.data) bos.head()

We will now rename the columns as the description of the dataset given above.

bos.columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT'] bos.head()

The variable MEDV indicates the prices of the houses and is the target variable. The rest of the variables are the predictors based on which we will predict the value of the house.

In the above result, we can see that the target variable ‘MEDV’ is missing from the data. We will create a new column of target values and add them to the dataframe.

bos['MEDV'] = boston.target

Fetching more information about the dataset using the ** info()** function.

bos.info()

From the above information, we can see that the 14 columns present in the dataset contain all non-null values with float data types.

Checking the statistical values of the dataset using the ** describe()** function.

bos.describe()

We will now check for null values if any present in the dataset.

bos.isnull().sum()

There is no null value present in the dataset.

**EDA**

Exploratory Data Analysis is a very important step before training the model. We will use some visualizations to understand the relationship of the target variable with other variables.

We will first plot the distribution of the target variable MEDV. For this we will use the ** distplot()** function from the seaborn library.

sns.distplot(bos['MEDV']) plt.show()

From the above output we can see that the values of MEDV is normally distributed with some of the outliers.

We will now visualize the pairplot which shows the relationships between all the features present in the dataset.

sns.pairplot(bos)

We will now use the heatmap function from the seaborn library to plot the correlation matrix.

corr_mat = bos.corr().round(2) sns.heatmap(data=corr_mat, annot=True)

From the above two graphs, we can clearly see that the feature RM has a positive correlation with MEDV.

Based on the above observations we will plot an **lmplot** between RM and MEDV to see the relationship between the two more clearly.

sns.lmplot(x = 'RM', y = 'MEDV', data = bos)

**Splitting the data into Training and
Test Data**

We will now split the dataset into training and test data. We do this to train our model with 80% of the samples and test with the remaining 20%.

We
are using the* train_test_split*
function from the sklearn library to split the data.

X = bos[['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX','PTRATIO', 'B', 'LSTAT']] y = bos['MEDV']

X is the independent variable and y is the dependent variable.

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 10)

**Training the Model**

We will now train our model using the ** LinearRegression** function from the sklearn library.

from sklearn.linear_model import LinearRegression lm = LinearRegression() lm.fit(X_train, y_train)

**Prediction**

We will now make prediction on the test data using the LinearRegression function and plot a scatterplot between the test data and the predicted value.

prediction = lm.predict(X_test) plt.scatter(y_test, prediction)

Plotting the data frame for the actual and predicted value and plotting a graph for the same.

df1 = pd.DataFrame({'Actual': y_test, 'Predicted':prediction}) df2 = df1.head(10) df2

df2.plot(kind = 'bar')

From the above graph, we can see that there is not much difference between the actual and predicted values, Hence our predicted model seems to work pretty well.

**Model Evaluation**

We will now evaluate the model using the metrics and r2_score function from sklearn library.

Here we will evaluate the Mean Absolute Error, Mean Squared Error, Root Mean Squared Error and R-squared value.

The value of R-square ranges from 0 to 1 where value ‘1’ ( or near to 1) indicates predictor perfectly accounts for all the variation in Y.

from sklearn import metrics from sklearn.metrics import r2_score print('MAE', metrics.mean_absolute_error(y_test, prediction)) print('MSE', metrics.mean_squared_error(y_test, prediction)) print('RMSE', np.sqrt(metrics.mean_squared_error(y_test, prediction))) print('R squared error', r2_score(y_test, prediction))

The R squared value is moderately nearer to the value 1 which seems to be a good start. However, we will keep on working to increase the model’s performance by working on more examples in our upcoming blogs.

Do drop us a comment for any query or suggestion. Keep visiting our website for more blogs on Data Science and Data Analytics.

The post Linear Regression on Boston Housing data appeared first on AcadGild.

]]>The post K-Nearest Neighbors using R appeared first on AcadGild.

]]>In this, predictions are made for a new instance (x) by searching through the entire training set for the K most similar instances (the neighbors) and summarizing the output variable for those K instances.

In KNN, K is the number of nearest neighbors. The number of neighbors is the deciding factor. If k = 1, then the object is simply assigned to the class of that single nearest neighbor.

For finding closest similar points, we find the distance between points using distance measures such as Euclidean distance, Hamming distance, Manhattan distance and Minkowski distance. KNN has the following basic steps:

- Calculate distance
- Find closest neighbors
- Group the similar data

In this blog we will be analysing the ___ dataset using the KNN algorithm.

lets dive into the coding part

Loading the required packages

Loading the dataset and getting the structure of the dataset using the str() function.

We can see there are 4 variables viz., admit, gre, gpa and rank. Where admit is the target variable and the remaining 3 are predictors.

Checking for null value if present.

As we can see there is no null value present in the whole dataset.

Since the target variable ‘admit’ has two values 0 and 1, where 0 depicts False and 1 depicts True.

Hence we will factorize the two values into ‘Yes’ and ‘No’.

Summarizing the dataset

We can see the number of admissions taken is 127 and not taken is 273.

Partitioning the dataset into training and test data

Here we are using the function trainControl() that controls the computational nuances of the train function.

Here with trainControl() we are performing 10-fold cross validation.

Fitting the train data

Here ROC was used to select the optimal model using the largest value. Hence the value for k is 39.

Plotting the fitted model

We are now calculating the variable importance for object produced by train data using the varImp() function.

The importance of ROC curve variable is shown where the variable gpa is the most important and then the variable rank.

Making predictions on the training and test data.

From the above result we can see that from the confusion matrix generated, 7 is the number that which in actual is yes but is predicted as no. Similarly 26 is the numbers which in actual is no but is predicted as yes. Hence these values has been misclassified.

Also the model is predicting an accuracy of 71% on the test data.

Keep visiting our website for more blogs on Data Science and Data Analytics.

*Keep visiting our site* www.acadgild.com* for more updates on Data Analytics and other technologies. Click here to learn data science course in Bangalore.*

Keep visiting our website for more blogs on Data Science and Data Analytics.

The post K-Nearest Neighbors using R appeared first on AcadGild.

]]>The post Linear Discriminant Analysis with R appeared first on AcadGild.

]]>The main goal of dimensionality reduction techniques is to reduce the dimensions by removing the redundant and dependent features by transforming the features from higher dimensional space to a space with lower dimensions.

The difference between PCA and LDA is PCA ignore the class labels and LDA attempts to find a feature subspace the maximize class separability.

In this blog we will make predictions on the irish dataset. We have already implemented irish dataset by Principal Component Analysis in our previous blog.

So let us begin coding in R and understand the difference between PCA and LDA.

Loading the data and displaying the first few records.

Getting the structure of the dataset.

Checking for null values, if present.

Hence, no null value present in the whole dataset.

Summarizing the dataset.

Finding the correlation between the independent variables.

As we can see the correlation between Petal.Length and Petal.Width is more.

Splitting the data into training data and test data

The Linear Discriminant Analysis can be easily computed using the function lda() from the MASS package.

It gives the following output

LDA determines group means and computes, for each individual, the probability of belonging to the different groups. The individual is then affected to the group with the highest probability score.

Percentage separations achieved by the first Discriminant Function is 99.14% which is very high.

Checking the prior probabilities of class membership using the attribute ‘prior’.

Counting the number of each species using the attribute ‘counts’

Scaling the values of the 2 Linear discriminant obtained in the earlier result.

Stacked histograms of discriminant function values.

The above code displays histograms and density plots for the observations in each group on the linear discriminant dimension.

As we can group 1 that is, Setosa is not overlapping with other species while Versicolor and Verginica are overlapping at some point.

We got an accuracy of 97.34% on the training data.

As we can see we got the best accuracy on the Test data, therefore we can infer that all flowers belongs to their respective species correctly.

And this brings us to the end. Do drop us a comment below for any query or suggestions.

*Keep visiting our site* www.acadgild.com* for more updates on Data Analytics and other technologies. Click here to learn data science course in Bangalore.*

Keep visiting our website for more blogs on Data Science and Data Analytics.

The post Linear Discriminant Analysis with R appeared first on AcadGild.

]]>The post Principal Component Analysis with R appeared first on AcadGild.

]]>It is a supervised learning technique and is used in applications like face recognition and image compression.

In this blog we will be implementing the famous ‘iris’ dataset with PCA in R.

The Iris flower data set is a multivariate data set introduced by the British statistician. The data was collected to quantify the morphologic variation of Iris flowers of three related species. The data set consists of 50 samples from each of three species of Iris (Iris Setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.

The dataset contains a set of 150 records under 5 attributes – Petal Length, Petal Width, Sepal Length, Sepal width and Class(Species).

So let us dive into the coding part.

We will first load the dataset and display the first few records.

Getting the structure of the dataset using the str() function.

Hence there are 150 rows and 5 columns.

Checking for any null value, if present.

Hence there is no null value.

Summarizing the dataset using the summary() function.

Partitioning the dataset into training data and test data.

We can see high correlation exists between petal length and petal width. We have also dropped the last ‘Species’ column in the above code.

High correlations among independent variables lead to “multicollinearity” problems.

Here we have performed PCA on the four variables using the prcomp() function. It do the analysis on the given data matrix and returns the result as an object of class prcomp.

Now we have used the attribute ‘center’ that is used to indicate the variable to be shifted to zero centered.

Taking the mean of ‘Sepal_length’ from the training data

We will scale the values using the attribute ‘scale’ which is used to scale the columns of a numeric matrix.

‘center’ and ‘scale’ refers to respective mean and standard deviation of the variables that are used for normalization prior to implementing PCA

Taking the standard deviation of ‘Sepal_length’ from the training data.

We have results for 4 Principle components. Each principle components are normalized linear combinations of original variables.

The pairs.panel() function used to show a scatter plot of matrices (SPLOM), with bivariate scatter plots below the diagonal, histograms on the diagonal, and the Pearson correlation above the diagonal.

The fviz_pca_biplot is a function from factoextra package and are used to Biplot of individuals and variables

It is showing which component is more similar. Components such as Petal length, Petal width and Sepal LEngth are more significant component. Whereas Sepal Width which is far from datapoint hence it is less a significant component.

**Summarizing model1**

Creating Confusing Matrix and Misclassification error on the training data.

As we can see from the above result, in setosa, versicolor and virginica 39, 38 and 35 flowers belongs to the same species, respectively. Whereas in versicolor and virginica 5 and 4 flowers are misclassified to this species.

Hence, the misclassification error is about 7.4%.

Creating Confusing Matrix and Misclassification error on the test data.

As we have seen the outcome in the testing dataset above, similarly here on the test data in setosa, versicolor and virginica 11, 7 and 9 flowers belongs to the same species, respectively. Whereas only in virginica 2 flowers are misclassified to this species.

Hence, the misclassification error for testing data is about 6.9%

This brings us to the end of this article. I hope you find this blog helpful.

*Keep visiting our site* www.acadgild.com

Keep visiting our website for more blogs on Data Science and Data Analytics.

The post Principal Component Analysis with R appeared first on AcadGild.

]]>The post Hierarchical Clustering with R appeared first on AcadGild.

]]>We will carry out this analysis on the popular USArrest dataset. We have already done the analysis on this dataset by using K-means clustering in our previous blog. I suggest you to go through the blog to have a better understanding of the dataset. You can refer to the same from the below link: Analyzing USArrest dataset using K-means Clustering

We will load the dataset and get the first few records.

Getting the structure of the dataset using the str() function.

Checking for any null values, if present

Hence there is no null value present.

Summarizing the dataset using the **summary()** function.

Now that we have summarized the dataset and observed that there are total 50 rows and 4 columns.

Importing the necessary libraries.

Scaling the dataset and displaying the first few records

Based on the algorithmic structure, there are two ways of clustering the data points.

**Agglomerative:**An agglomerative approach begins with each observation in a separate clusters of its own, and successively merges similar clusters together until a stopping criterion is satisfied, until there is just one big clusters.**Divisive:**this is an inverse of agglomerative clustering, in which all objects are included into one cluster.

Performing **Agglomerative Hierarchical Clustering**

We perform the agglomerative hierarchical clustering with hclust.

First we need to compute the dissimilarity values using **dist()** function and will then store these values into **hclust()** function.

After this we specify the agglomeration method to be used (i.e. “complete”, “average”, “single”, “ward.D”). Here we have used the method ‘complete linkage’ that means for each pair of clusters, the algorithm computes and merges them to minimize the maximum distance between the clusters.

We will then plot the dendrogram, which is a multilevel hierarchy where clusters at one level are joined together to form the clusters at the next levels.

It gives the below graph

In the above code we have divided the tree into four groups and fetched the number of members in each cluster and then plot the graph.

We will use agnes() function, in which each observation is assigned to its own cluster. Then the similarity between each of the cluster is computer and the most similar cluster is merged into one.

Hence we have computed the optimal number of clusters and visualize K-mean clustering.

Hope you find this blog helpful. In case of any query or suggestions drop us a comment below.

Keep visiting our website for more blogs on Data Science and Data Analytics.

*Keep visiting our site* www.acadgild.com

Keep visiting our website for more blogs on Data Science and Data Analytics.

The post Hierarchical Clustering with R appeared first on AcadGild.

]]>The post How To Install Go Language On Windows appeared first on AcadGild.

]]>- How to Install and configure Go Environment on windows.

Go is an open-source programming language that makes it easy to build simple, reliable, and efficient software.

For Instance, Go shell is a popular application that enables us to run Go code before running the actual job on the cluster. In Addition, it is user-friendly so in this blog, we are going to show you how you can install the Go environment on windows as well as on Linux.

**It’s open-source at it’s best…but don’t forget: it’s case-sensitive!**

So let’s get started on the Microsoft Windows 10 operating system. You’ll see just how easy this really is — only a basic working knowledge of GitHub and the command prompt is required. Sure there are other ways of installing and running the program, but with the limited coding background, I felt this set of instructions was the easiest to understand and follow.

**Step 1: **As Go uses open-source (FREE!) repositories often, be sure to install the Git package here first.

**Step 2:** Download and install the latest 64-bit Go set for Microsoft Windows OS.

**Step 3: **Double click on the installer which is downloaded in step 2 and start the installation process

**Step 4:** Accept the end-user license agreement and click on the Next button.

**Step 5: **Here you have to select the destination folder where you want to install.

**Note: We recommend you to remain the same destination folder which is taken by the system.**

**Step 6: **Now, Click on the Install button.

**Step 7: **Click on the Finish button once the installation gets complete.

**Step 8: **Verify the installation by running the Command Prompt on your computer by searching for “cmd”. Open the command line and type: “go version”.

**Step 8: **Run your first hello world programe.

Create one file called hello.go and put the below code in it.

**Step 9:** Now run the code using the command prompt.

As you can see above we have successfully run our first program which is hello world.

I hope this blog helps you to install Go Environment on windows.

The post How To Install Go Language On Windows appeared first on AcadGild.

]]>The post Sorting and Searching Program in Python appeared first on AcadGild.

]]>Searching is a technique of finding a particular element in a list

In our previous blog, we have learned about sorting and searching algorithms in detail along with all the sorting types and its working with examples.

You can refer to the blog by the below link:

Introduction_to_Data_Structure

In this blog, we will be implementing programs for various sorting algorithms in Python. So let us start with Bubble sorting.

**Bubble Sort**

Given an array, ‘array’ of n elements with values or records x_{1}, x_{2}, x_{3},…, xn bubble sort is applied to sort the array ‘array’

- Start with the first element (index 0) Compare the first two elements x
_{1}and x_{2}in the list. - if x
_{1}> x_{2}, swap those elements - If x
_{1}< x_{2}, move and continue with next 2 elements. - Repeat step 1 until the whole array is sorted and no more swaps are possible.
- Return the final sorted list.

**Program :**

def bubbleSort(array): #the outer loop will traverse through all the element starting from the 0th index to n-1 for i in range(0, len(array)-1): #the inner loop will traverse through all elements except the last element as it is sorted and is in a fixed position for j in range(0, len(array) - 1 - i): #if the current element is found greater than the next element if array[j] > array[j+1]: #then swap the position of the two elements array[j],array[j+1] = array[j+1], array[j] #taking input from user separated by delimiter inp = input('Enter a list of numbers separated by commas: ').split(',') #typecasting each value of the list into integer array = [int(num) for num in inp] bubbleSort(array) print('The Sorted list is :',array)

**Output:**

Worst Case Time Complexity: O(n^{2}). The worst-case occurs when the array is reverse sorted.

Best Case Time Complexity: O(n). The best-case occurs when the array is already sorted.

**Selection Sort**

Consider an Array ‘arr’ with n elements x_{1}, x_{2}, x_{3},…, x_{n}, selection sort is applied to sort the array ‘arr’

- Start with the first element (index 0) & set it to min_elem = 0 and search for the minimum element in the list.
- If the minimum value is found swap the first element with the minimum element in the list
- Increment the position of min_elem so that it points to the next element
- Repeat the steps with new sublists until the list gets sorted.

**Program**

def selectionSort(arr): #the outer loop will traverse through all the element starting from the 0th index to n-1 for i in range(0, len(arr)-1): #minimum value is initialized to i that everytime checks for the minimum value in the unsorted list of 'i' min_elem = i #the inner loop starts from i+1 as it iterates through the unsorted part of the list for j in range(i+1, len(arr)): #we do comparison to find the minimum element in the remaining unsorted list if arr[j] < arr[min_elem]: #after finding the minimum value we assign it to the variable min_elem min_elem = j #swapping the minimum found element with the first element temp = arr[i] arr[i] = arr[min_elem] arr[min_elem] = temp #taking input from user separated by delimiter inp = input('Enter a list of numbers separated by commas: ').split(',') #typecasting each value of the list into integer arr = [int(num) for num in inp] selectionSort(arr) print('The Sorted list is :',arr)

**Output:**

Worst-Case and Best-Case Time Complexity: O(n^{2}) as there are two nested loops.

**Insertion Sort**

Given an array with n elements with values or records x_{0}, x_{1}, x_{2}, x_{3}, …, x_{n}.

- Initially, x
_{0}is the only element in the sorted sublist and the leftmost element in the array - We start from the element x
_{1}& assign it as the key. Compare x_{1}with the elements in the sorted sub-list(initially x_{0}and x_{1}), and place it in the correct position(shift all the elements in the sorted sub-list that is greater than the

value to be sorted) - Then we make the third element as key and compare it with all the elements at the left and insert it to the right position
- Repeat steps 2 and 3 until the array is sorted.

**Program**

def insertionSort(ar): #the outer loop starts from 1st index as it will have at least 1 element to compare itself with for i in range(1, len(ar)): #making elements as key while iterating each element of i key = ar[i] #j is the element left to i j = i - 1 #checking condition while j>=0 and key<ar[j]: ar[j+1] = ar[j] j = j - 1 ar[j+1] = key #taking input from user separated by delimiter inp = input('Enter a list of numbers separated by commas: ').split(',') #typecasting each value of the list into integer ar = [int(num) for num in inp] insertionSort(ar) print('The Sorted list is :',ar)

**Output:**

Worst Case Time Complexity: O(n^{2}).

Best Case Time Complexity: **Ω**(n).

**Merge Sort**

Given an unsorted array with n elements with values x_{1}, x_{2}, x_{3}, …, x_{n} and is divided into n sub-arrays. We implement 2 main functions divide & merge.

- Dividing the given array into multiple small arrays until we get a single atomic value.
- Merge the smaller into a new list in sorted order.

**Program:**

def mergeSort(alist): print("Splitting ",alist) if len(alist)>1: mid = len(alist)//2 lefthalf = alist[:mid] righthalf = alist[mid:] #recursion mergeSort(lefthalf) mergeSort(righthalf) i=0 j=0 k=0 while i < len(lefthalf) and j < len(righthalf): if lefthalf[i] < righthalf[j]: alist[k]=lefthalf[i] i=i+1 else: alist[k]=righthalf[j] j=j+1 k=k+1 while i < len(lefthalf): alist[k]=lefthalf[i] i=i+1 k=k+1 while j < len(righthalf): alist[k]=righthalf[j] j=j+1 k=k+1 print("Merging ",alist) alist = input('Enter the list of numbers: ').split() alist = [int(x) for x in alist] mergeSort(alist) print('Sorted list: ', end='') print(alist)

**Output:**

Worst-Case and Best-Case Time Complexity: O(n^{ }log(n)) as merge sort always divides the array into two halves and take linear time to merge two halves.

**Quick Sort**

Given an array with n elements with values x_{1}, x_{2}, x_{3}, …, x_{n}.

- Make the rightmost element of the array as the pivot.
- Partitioning: Rearranging the array in such a way such that all the elements with a value less than the pivot come before the pivot and all the elements with value more than the pivot comes after it.

After this, the pivot comes to its correct final position.

- The elements at the left and right of the pivot are not sorted, hence we take these subarrays and repeat steps 1 and 2 until we get the sorted array.
- The approach used here is recursion at each split to get to the single-element array.

**Program:**

#function to implement partitioning where 'low' and 'high' are the starting and the end element of 'array' respectively def partition(array, low, high): i = low - 1 #pivot is the last element in the array pivot = array[high] for j in range(low, high): #comparing each element in the array with the pivot if array[j] <= pivot: #if condition is true increment the value of i by 1 #And swap the element at current index of j to current index of i i = i+1 array[i], array[j] = array[j], array[i] #after all the traversing has been done, replace the pivot value with the element present at current index of (i+1) array[i+1], array[high] = array[high], array[i+1] #returning pivot value return i+1 #function to do quick sort def quickSort(array, low, high): #comparing if the value of low is smaller than high if low < high: #p is partitioning index, we'll perform partitioning until the array is sorted p = partition(array, low, high) #Separately sort elements before partition and after partition quickSort(array, low, p-1) quickSort(array, p+1, high) #taking input from user separated by delimiter inp = input('Enter a list of numbers separated by commas: ').split(',') n = len(inp) #typecasting each value of the list into integer array = [int(num) for num in inp] quickSort(array, 0, n-1) print('The Sorted list is :',array)

**Output**

Worst Case Time Complexity: O(n^{2}).

Best Case Time Complexity: O(n log(n)).

**Linear Search**

For a given array[] with n elements, and x is the key element that has to be searched, we do the linear search

- Start from the first element of the array, and one by one compare the key with each element of the array
- If the key matches with any of the element, it returns the index of the corresponding element
- If no such element is found, it returns -1.

**Program**

def linearSearch(array, x): for i in range(0, len(array)): if array[i] == x: return i return -1 array = input('Enter the list of element').split(',') arr=[int(num) for num in array] x = int(input('Enter the element that needs to be searched')) result = linearSearch(arr, x) if result == -1: print('Element was not present in the list') else: print('Element was found at the position',result)

**Output:**

Worst-Case Time Complexity: O(n).

Best-Case Time Complexity: O(1)

**Binary Search**

For a given array[] with n elements, and x is the key element that has to be searched, we do the binary search:

- Start by dividing the given array into two halves and then compare the middle element with x
- If x matches with the mid element, it returns the index of that middle element
- Else if x is smaller than the middle element, it means it is present in the left subarray, we recur the function into the left half
- Else, it means x is present in the right subarray, we recur the function into the right half.

**Program:**

# Returns index of x in arr if present, else -1, array is the list of elements and x is the element to be searched def binarySearch(array, f, l, x): #checking the base case if f <= l: #getting the middle element mid = (f + (l-f)//2) #checking if x is present in the middle index if array[mid] == x: return mid #checking if element is smaller than mid, then it is present in the left subarray if array[mid]>x: return binarySearch(array, f, mid-1, x) #if element is larger than mid, then it is present in the right subarray else: return binarySearch(array, mid+1, l, x) #if the element not at all present in the list return -1 arr = input('Enter the list of element ').split(',') array=[int(num) for num in arr] x = int(input('Enter the element that needs to be searched ')) result = binarySearch(array, 0, len(array)-1, x) if result == -1: print('Element was not present in the list') else: print('Element was found at the position',result)

**Output:**

Worst-Case Time Complexity: O(log n).

Best-Case Time Complexity: O(1)

This brings us to the end. For any query or suggestions drop us a comment below.

Keep visiting our website for more blogs on Data Science and Data Analytics.

The post Sorting and Searching Program in Python appeared first on AcadGild.

]]>The post Analyzing Wine dataset using K-means Clustering appeared first on AcadGild.

]]>We have done an analysis on USArrest Dataset using K-means clustering in our previous blog, you can refer to the same from the below link:

This wine dataset is a result of chemical analysis of wines grown in a particular area. The analysis determined the quantities of 13 constituents found in each of the three types of wines. The attributes are: Alcohol, Malic acid, Ash, Alkalinity of ash, Magnesium, Total phenols, Flavonoids, Non-Flavonoid phenols, Proanthocyanins, Color intensity, Hue, OD280/OD315 of diluted wines, and Proline. The data set has 178 observations and no missing values.

You can download the dataset from the link.

Our goal is to try to group similar observations together and determine the number of possible clusters (it may differ from 3). This would help us make predictions and reduce dimensionality.

Loading the dataset and getting the first few records of the dataset

Getting the structure of the dataset using the str() function.

We can see the dataset has 178 rows and 14 columns

Summarizing
the dataset using the **summary()**
function.

To check any missing values, hence no missing value present in the whole dataset.

Displaying the first few columns of the dataset after scaling it.

We can see that the data points have been standardized that is, it has been scaled. Scaling is done to make the variables comparable.

Standardizing consists of transforming the variables such that they have zero mean and standard deviation as 1.

Now we will load two of the libraries, that is, cluster and factoextra that are the required R packages.

Now we are defining clusters such that the total intra-cluster variation (total within-cluster sum of squares) is minimized.

It creates the below graph

Similar to the elbow method, there is a function fviz_nbclust() that is used to visualize and determine the optimal number of clusters.

From the above various results, we came to know that 3 is the optimal number of clusters, we can perform the final analysis and extract the results using these 3 clusters.

Determine cluster, a vector of integers (from 1: k) indicating the cluster to which each point is allocated.

Determining cluster size that is, the number of points in each cluster.

2D representation of clusters

Hence, we have computed the optimal number of clusters that are 3 in numbers and visualize K-mean clustering.

Hope you find this blog helpful. In case of any query or suggestions drop us a comment below.

*Keep visiting our site* www.acadgild.com

Keep visiting our website for more blogs on Data Science and Data Analytics.

The post Analyzing Wine dataset using K-means Clustering appeared first on AcadGild.

]]>