The post Hierarchical Clustering with R appeared first on AcadGild.

]]>We will carry out this analysis on the popular USArrest dataset. We have already done the analysis on this dataset by using K-means clustering in our previous blog. I suggest you to go through the blog to have a better understanding of the dataset. You can refer to the same from the below link: Analyzing USArrest dataset using K-means Clustering

We will load the dataset and get the first few records.

Getting the structure of the dataset using the str() function.

Checking for any null values, if present

Hence there is no null value present.

Summarizing the dataset using the **summary()** function.

Now that we have summarized the dataset and observed that there are total 50 rows and 4 columns.

Importing the necessary libraries.

Scaling the dataset and displaying the first few records

Based on the algorithmic structure, there are two ways of clustering the data points.

**Agglomerative:**An agglomerative approach begins with each observation in a separate clusters of its own, and successively merges similar clusters together until a stopping criterion is satisfied, until there is just one big clusters.**Divisive:**this is an inverse of agglomerative clustering, in which all objects are included into one cluster.

Performing **Agglomerative Hierarchical Clustering**

We perform the agglomerative hierarchical clustering with hclust.

First we need to compute the dissimilarity values using **dist()** function and will then store these values into **hclust()** function.

After this we specify the agglomeration method to be used (i.e. “complete”, “average”, “single”, “ward.D”). Here we have used the method ‘complete linkage’ that means for each pair of clusters, the algorithm computes and merges them to minimize the maximum distance between the clusters.

We will then plot the dendrogram, which is a multilevel hierarchy where clusters at one level are joined together to form the clusters at the next levels.

It gives the below graph

In the above code we have divided the tree into four groups and fetched the number of members in each cluster and then plot the graph.

We will use agnes() function, in which each observation is assigned to its own cluster. Then the similarity between each of the cluster is computer and the most similar cluster is merged into one.

Hence we have computed the optimal number of clusters and visualize K-mean clustering.

Hope you find this blog helpful. In case of any query or suggestions drop us a comment below.

Keep visiting our website for more blogs on Data Science and Data Analytics.

*Keep visiting our site* www.acadgild.com* for more updates on Data Analytics and other technologies. Click here to learn data science course in Bangalore.*

Keep visiting our website for more blogs on Data Science and Data Analytics.

The post Hierarchical Clustering with R appeared first on AcadGild.

]]>The post How To Install Go Language On Windows appeared first on AcadGild.

]]>- How to Install and configure Go Environment on windows.

Go is an open-source programming language that makes it easy to build simple, reliable, and efficient software.

For Instance, Go shell is a popular application that enables us to run Go code before running the actual job on the cluster. In Addition, it is user-friendly so in this blog, we are going to show you how you can install the Go environment on windows as well as on Linux.

**It’s open-source at it’s best…but don’t forget: it’s case-sensitive!**

So let’s get started on the Microsoft Windows 10 operating system. You’ll see just how easy this really is — only a basic working knowledge of GitHub and the command prompt is required. Sure there are other ways of installing and running the program, but with the limited coding background, I felt this set of instructions was the easiest to understand and follow.

**Step 1: **As Go uses open-source (FREE!) repositories often, be sure to install the Git package here first.

**Step 2:** Download and install the latest 64-bit Go set for Microsoft Windows OS.

**Step 3: **Double click on the installer which is downloaded in step 2 and start the installation process

**Step 4:** Accept the end-user license agreement and click on the Next button.

**Step 5: **Here you have to select the destination folder where you want to install.

**Note: We recommend you to remain the same destination folder which is taken by the system.**

**Step 6: **Now, Click on the Install button.

**Step 7: **Click on the Finish button once the installation gets complete.

**Step 8: **Verify the installation by running the Command Prompt on your computer by searching for “cmd”. Open the command line and type: “go version”.

**Step 8: **Run your first hello world programe.

Create one file called hello.go and put the below code in it.

**Step 9:** Now run the code using the command prompt.

As you can see above we have successfully run our first program which is hello world.

I hope this blog helps you to install Go Environment on windows.

The post How To Install Go Language On Windows appeared first on AcadGild.

]]>The post Sorting and Searching Program in Python appeared first on AcadGild.

]]>Searching is a technique of finding a particular element in a list

In our previous blog, we have learned about sorting and searching algorithms in detail along with all the sorting types and its working with examples.

You can refer to the blog by the below link:

Introduction_to_Data_Structure

In this blog, we will be implementing programs for various sorting algorithms in Python. So let us start with Bubble sorting.

**Bubble Sort**

Given an array, ‘array’ of n elements with values or records x_{1}, x_{2}, x_{3},…, xn bubble sort is applied to sort the array ‘array’

- Start with the first element (index 0) Compare the first two elements x
_{1}and x_{2}in the list. - if x
_{1}> x_{2}, swap those elements - If x
_{1}< x_{2}, move and continue with next 2 elements. - Repeat step 1 until the whole array is sorted and no more swaps are possible.
- Return the final sorted list.

**Program :**

def bubbleSort(array): #the outer loop will traverse through all the element starting from the 0th index to n-1 for i in range(0, len(array)-1): #the inner loop will traverse through all elements except the last element as it is sorted and is in a fixed position for j in range(0, len(array) - 1 - i): #if the current element is found greater than the next element if array[j] > array[j+1]: #then swap the position of the two elements array[j],array[j+1] = array[j+1], array[j] #taking input from user separated by delimiter inp = input('Enter a list of numbers separated by commas: ').split(',') #typecasting each value of the list into integer array = [int(num) for num in inp] bubbleSort(array) print('The Sorted list is :',array)

**Output:**

Worst Case Time Complexity: O(n^{2}). The worst-case occurs when the array is reverse sorted.

Best Case Time Complexity: O(n). The best-case occurs when the array is already sorted.

**Selection Sort**

Consider an Array ‘arr’ with n elements x_{1}, x_{2}, x_{3},…, x_{n}, selection sort is applied to sort the array ‘arr’

- Start with the first element (index 0) & set it to min_elem = 0 and search for the minimum element in the list.
- If the minimum value is found swap the first element with the minimum element in the list
- Increment the position of min_elem so that it points to the next element
- Repeat the steps with new sublists until the list gets sorted.

**Program**

def selectionSort(arr): #the outer loop will traverse through all the element starting from the 0th index to n-1 for i in range(0, len(arr)-1): #minimum value is initialized to i that everytime checks for the minimum value in the unsorted list of 'i' min_elem = i #the inner loop starts from i+1 as it iterates through the unsorted part of the list for j in range(i+1, len(arr)): #we do comparison to find the minimum element in the remaining unsorted list if arr[j] < arr[min_elem]: #after finding the minimum value we assign it to the variable min_elem min_elem = j #swapping the minimum found element with the first element temp = arr[i] arr[i] = arr[min_elem] arr[min_elem] = temp #taking input from user separated by delimiter inp = input('Enter a list of numbers separated by commas: ').split(',') #typecasting each value of the list into integer arr = [int(num) for num in inp] selectionSort(arr) print('The Sorted list is :',arr)

**Output:**

Worst-Case and Best-Case Time Complexity: O(n^{2}) as there are two nested loops.

**Insertion Sort**

Given an array with n elements with values or records x_{0}, x_{1}, x_{2}, x_{3}, …, x_{n}.

- Initially, x
_{0}is the only element in the sorted sublist and the leftmost element in the array - We start from the element x
_{1}& assign it as the key. Compare x_{1}with the elements in the sorted sub-list(initially x_{0}and x_{1}), and place it in the correct position(shift all the elements in the sorted sub-list that is greater than the

value to be sorted) - Then we make the third element as key and compare it with all the elements at the left and insert it to the right position
- Repeat steps 2 and 3 until the array is sorted.

**Program**

def insertionSort(ar): #the outer loop starts from 1st index as it will have at least 1 element to compare itself with for i in range(1, len(ar)): #making elements as key while iterating each element of i key = ar[i] #j is the element left to i j = i - 1 #checking condition while j>=0 and key<ar[j]: ar[j+1] = ar[j] j = j - 1 ar[j+1] = key #taking input from user separated by delimiter inp = input('Enter a list of numbers separated by commas: ').split(',') #typecasting each value of the list into integer ar = [int(num) for num in inp] insertionSort(ar) print('The Sorted list is :',ar)

**Output:**

Worst Case Time Complexity: O(n^{2}).

Best Case Time Complexity: **Ω**(n).

**Merge Sort**

Given an unsorted array with n elements with values x_{1}, x_{2}, x_{3}, …, x_{n} and is divided into n sub-arrays. We implement 2 main functions divide & merge.

- Dividing the given array into multiple small arrays until we get a single atomic value.
- Merge the smaller into a new list in sorted order.

**Program:**

def mergeSort(alist): print("Splitting ",alist) if len(alist)>1: mid = len(alist)//2 lefthalf = alist[:mid] righthalf = alist[mid:] #recursion mergeSort(lefthalf) mergeSort(righthalf) i=0 j=0 k=0 while i < len(lefthalf) and j < len(righthalf): if lefthalf[i] < righthalf[j]: alist[k]=lefthalf[i] i=i+1 else: alist[k]=righthalf[j] j=j+1 k=k+1 while i < len(lefthalf): alist[k]=lefthalf[i] i=i+1 k=k+1 while j < len(righthalf): alist[k]=righthalf[j] j=j+1 k=k+1 print("Merging ",alist) alist = input('Enter the list of numbers: ').split() alist = [int(x) for x in alist] mergeSort(alist) print('Sorted list: ', end='') print(alist)

**Output:**

Worst-Case and Best-Case Time Complexity: O(n^{ }log(n)) as merge sort always divides the array into two halves and take linear time to merge two halves.

**Quick Sort**

Given an array with n elements with values x_{1}, x_{2}, x_{3}, …, x_{n}.

- Make the rightmost element of the array as the pivot.
- Partitioning: Rearranging the array in such a way such that all the elements with a value less than the pivot come before the pivot and all the elements with value more than the pivot comes after it.

After this, the pivot comes to its correct final position.

- The elements at the left and right of the pivot are not sorted, hence we take these subarrays and repeat steps 1 and 2 until we get the sorted array.
- The approach used here is recursion at each split to get to the single-element array.

**Program:**

#function to implement partitioning where 'low' and 'high' are the starting and the end element of 'array' respectively def partition(array, low, high): i = low - 1 #pivot is the last element in the array pivot = array[high] for j in range(low, high): #comparing each element in the array with the pivot if array[j] <= pivot: #if condition is true increment the value of i by 1 #And swap the element at current index of j to current index of i i = i+1 array[i], array[j] = array[j], array[i] #after all the traversing has been done, replace the pivot value with the element present at current index of (i+1) array[i+1], array[high] = array[high], array[i+1] #returning pivot value return i+1 #function to do quick sort def quickSort(array, low, high): #comparing if the value of low is smaller than high if low < high: #p is partitioning index, we'll perform partitioning until the array is sorted p = partition(array, low, high) #Separately sort elements before partition and after partition quickSort(array, low, p-1) quickSort(array, p+1, high) #taking input from user separated by delimiter inp = input('Enter a list of numbers separated by commas: ').split(',') n = len(inp) #typecasting each value of the list into integer array = [int(num) for num in inp] quickSort(array, 0, n-1) print('The Sorted list is :',array)

**Output**

Worst Case Time Complexity: O(n^{2}).

Best Case Time Complexity: O(n log(n)).

**Linear Search**

For a given array[] with n elements, and x is the key element that has to be searched, we do the linear search

- Start from the first element of the array, and one by one compare the key with each element of the array
- If the key matches with any of the element, it returns the index of the corresponding element
- If no such element is found, it returns -1.

**Program**

def linearSearch(array, x): for i in range(0, len(array)): if array[i] == x: return i return -1 array = input('Enter the list of element').split(',') arr=[int(num) for num in array] x = int(input('Enter the element that needs to be searched')) result = linearSearch(arr, x) if result == -1: print('Element was not present in the list') else: print('Element was found at the position',result)

**Output:**

Worst-Case Time Complexity: O(n).

Best-Case Time Complexity: O(1)

**Binary Search**

For a given array[] with n elements, and x is the key element that has to be searched, we do the binary search:

- Start by dividing the given array into two halves and then compare the middle element with x
- If x matches with the mid element, it returns the index of that middle element
- Else if x is smaller than the middle element, it means it is present in the left subarray, we recur the function into the left half
- Else, it means x is present in the right subarray, we recur the function into the right half.

**Program:**

# Returns index of x in arr if present, else -1, array is the list of elements and x is the element to be searched def binarySearch(array, f, l, x): #checking the base case if f <= l: #getting the middle element mid = (f + (l-f)//2) #checking if x is present in the middle index if array[mid] == x: return mid #checking if element is smaller than mid, then it is present in the left subarray if array[mid]>x: return binarySearch(array, f, mid-1, x) #if element is larger than mid, then it is present in the right subarray else: return binarySearch(array, mid+1, l, x) #if the element not at all present in the list return -1 arr = input('Enter the list of element ').split(',') array=[int(num) for num in arr] x = int(input('Enter the element that needs to be searched ')) result = binarySearch(array, 0, len(array)-1, x) if result == -1: print('Element was not present in the list') else: print('Element was found at the position',result)

**Output:**

Worst-Case Time Complexity: O(log n).

Best-Case Time Complexity: O(1)

This brings us to the end. For any query or suggestions drop us a comment below.

Keep visiting our website for more blogs on Data Science and Data Analytics.

The post Sorting and Searching Program in Python appeared first on AcadGild.

]]>The post Analyzing Wine dataset using K-means Clustering appeared first on AcadGild.

]]>We have done an analysis on USArrest Dataset using K-means clustering in our previous blog, you can refer to the same from the below link:

This wine dataset is a result of chemical analysis of wines grown in a particular area. The analysis determined the quantities of 13 constituents found in each of the three types of wines. The attributes are: Alcohol, Malic acid, Ash, Alkalinity of ash, Magnesium, Total phenols, Flavonoids, Non-Flavonoid phenols, Proanthocyanins, Color intensity, Hue, OD280/OD315 of diluted wines, and Proline. The data set has 178 observations and no missing values.

You can download the dataset from the link.

Our goal is to try to group similar observations together and determine the number of possible clusters (it may differ from 3). This would help us make predictions and reduce dimensionality.

Loading the dataset and getting the first few records of the dataset

Getting the structure of the dataset using the str() function.

We can see the dataset has 178 rows and 14 columns

Summarizing
the dataset using the **summary()**
function.

To check any missing values, hence no missing value present in the whole dataset.

Displaying the first few columns of the dataset after scaling it.

We can see that the data points have been standardized that is, it has been scaled. Scaling is done to make the variables comparable.

Standardizing consists of transforming the variables such that they have zero mean and standard deviation as 1.

Now we will load two of the libraries, that is, cluster and factoextra that are the required R packages.

Now we are defining clusters such that the total intra-cluster variation (total within-cluster sum of squares) is minimized.

It creates the below graph

Similar to the elbow method, there is a function fviz_nbclust() that is used to visualize and determine the optimal number of clusters.

From the above various results, we came to know that 3 is the optimal number of clusters, we can perform the final analysis and extract the results using these 3 clusters.

Determine cluster, a vector of integers (from 1: k) indicating the cluster to which each point is allocated.

Determining cluster size that is, the number of points in each cluster.

2D representation of clusters

Hence, we have computed the optimal number of clusters that are 3 in numbers and visualize K-mean clustering.

Hope you find this blog helpful. In case of any query or suggestions drop us a comment below.

*Keep visiting our site* www.acadgild.com* for more updates on Data Analytics and other technologies. Click here to learn data science course in Bangalore.*

Keep visiting our website for more blogs on Data Science and Data Analytics.

The post Analyzing Wine dataset using K-means Clustering appeared first on AcadGild.

]]>The post Beginner Guide For Kubernetes : A Deep Introduction appeared first on AcadGild.

]]>Apart from this we will also concise the difference between what Kubernetes is and it’s not as there are myths and confusions around communities misconceived Kubernetes to be a containerization platform.

Well, it’s not so we will be discussing what it is? and what it’s not?

We will be discussing on a use case where how Kubernetes was used at Pokemon Go and how it helped Pokemon go will become one of the best games of the year 2017.

And finally, at the end of the blog, you will get a demonstration of how to do deployment with Kubernetes.

Need Of Kubernetes

What exactly is it? & What it’s not?

How does Kubernetes work?

Use-case: Kubernetes At Pokemon Go

Before understanding Kubernetes we should know what a container is and its working.

Firstly to understand why we need Kubernetes lets to understand what are the benefits and drawbacks of containers.

First of all, containers are amazingly good it could be a Linux container or a docker container or even rocket container, they all do one thing, they package your application and isolate the application from the host.

These properties make container fast, reliable efficient lightweight and scalable.

Now hold the thought yes containers are scalable but then there is a problem that comes with that and this is what is the resultant of they need for Kubernetes even though containers are scalable they are not easily scalable.

So let’s look at this way you have one container you might want to probably scale it up to two containers or three containers.

Well, it is possible it’s going to be a little bit of manual effort but you can scale it up.

Now considering with the real-world scenario where you might want to scale up to like 50 to 100 containers then in these cases what happens is that you have to manually set up, manage and customize reports for every action.

If each container is unable to communicate or share the status of the container to its server, then the reestablishing or recovering or scaling container will be complex with manual accounting.

Scaling up the containers is pretty easy but the problem is what happens after that.once you scale up the container you will have a lot of problems.

- Containers cannot share signals or reports manually. Thus, there will be a communication drop where the user has to manually manage reports for each container.
- Containers have to be deployed appropriately(Explain in detail).
- Containers have to be managed carefully.
- Autoscaling will be impossible.
- Distributing traffic will be challenging.

Let me explain with an example. Consider your own an e-commerce portal say amazon or Flipkart.

Consider you have a decent amount of traffic on your portal on weekdays but on weekends you have spiked in traffic probably 4x or 5x the usual traffic.

In this case, your servers may respond to the requests which are received on the weekdays because of less traffic.

But, on weekends due to increased traffic requests your server may take more time or may hover for some time due to unexpectedly high traffic.

If you want to avoid this problem, what do you do? You have to scale up and now would you ideally keep scaling up every weekend and scaling down after the weekend

Technically is it possible will you be buying your server and then setting it up and every Friday would you be again buying new servers setting up your infrastructure and then the moment

when your weekday’s start will you just destroy all your raw servers whatever you build.

How do you avoid this problem?

A general solution is to scale up the resources every weekend and scale down the resources after the weekend.

**“This solution is not optimized or the best solution for the above scenario”**.

In truth, **“It will be difficult”** to set up and manage the servers and other infrastructure resources whenever there is a requirement and scale back the resources when there is less or no requirement.

So, in this case, for auto-scaling, we can use Kubernetes. Kubernetes keeps track of your server traffic, container resource utilization, and when the traffic is high or reaches the threshold auto-scaling functionality will be activated accordingly.

Kubernetes autoscale the containers accordingly without any manual intervention. Thus this auto-scaling function attracting organizations to adopt Kubernetes.

Kubernetes is the container management tool that automates container deployment, container scaling, and container load balancing.

The benefit of Kubernetes is, it works brilliantly with all cloud vendors like Amazon web service, Google cloud platform.

Kubernetes a google designed product written in go language is an open-source project now maintained by CNCF – Cloud Native Computing Foundation which aims to build sustainable ecosystems and advanced communities to support the growth and health of the cloud-native open-source software.

Here are the key features and selling points of Kubernetes

● Automatic Bin packing

● Service Discovery & Load Balancing

● Storage Orchestration

● Self Healing

● Secret & Configuration Management

● Batch Execution

● Horizontal Scaling

● Automatic Roll Back & Rollouts.

**1) Automatic BinPacking**

Automatic bin packing is basically, Kubernetes packages your application and automatically places containers, based on their requirements and the availability of the resources.

**2) Service Discovery And Load Balancing.**

If you are using Kubernetes, then no need to worry about internal network address setup and management, thanks to Kubernetes will automatically assign containers on their own IP addresses and probably with a single DNS name for a set of containers which are performing a logical operation. Thus there will be a load balancing across them.

**3) Storage Orchestration**

With Kubernetes, you can automatically mount the storage system of your choice. You can choose either local storage or a public cloud such as GCP or AWS.

**4) Self Healing**

What self-healing is all about that whenever Kubernetes realize that one of your containers has failed then it will restart that container on its own and creates a new container in place of that container that is inoperative.

In case your node itself fails then what Kubernetes does is, the containers which are running in that failed node will be initiated and commenced on the other healthy nodes.

**5) Batch Execution**

So when we say batch execution its that along with services Kubernetes can also manage your batch and CI workloads which is more of DevOps role so as part of your CI workloads Kubernetes can replace your containers which fail and it can restart and restore its original state.

**6) Secret & Configuration Management**

This is another big feature of Kubernetes. This is the concept where you deploy and update your confidential or classified application or application configuration without rebuilding the entire image or exposing the classified application in your stack configuration.

**7) Horizontal Scaling **

You can scale your application up and down easily with a simple command. The simple command can be run on the CLI or you can easily do it on your GUI which is your Kubernetes dashboard.

**8) Automatic Roll Backs and Roll Outs**

Now what Kubernetes does is whenever there is an update to your application which you want to release, Kubernetes progressively rolls out these changes and updates to the application.

If the application is not compatible with the current version then Kubernetes will roll back to the previous versions.

The first thing is that Kubernetes should not be compared with docker because both have a different set of parameters that you may compare them against.

Docker is a containerization platform on the other hand Kubernetes is a container management tool.

It means that once you containerize your application with the help of docker containers or Linux containers and when you are looking forward to using the auto-scaling functionality, this where Kubernetes would come in.

Kubernetes is robust and reliable, Now when I say robust and reliable I am referring to the fact that the cluster that is created by the Kubernetes is very strong and it’s not going to be broken easily.

If your container gets failed Kubernetes will restart the container or start the new container in the same way if your node gets failed then the containers which are running in a particular container will start running on the different node.

All of us very well known Pokemon go the very famous game and was actually declared as the best game of the year 2017. The main reason for that being the best is because of Kubernetes.

let me explain in detail?

Pokemon go is augmented reality game developed by Niantic for android and ios devices.

So let’s see the Backend architecture of Pokemon go.

We have a container that had two primary components one is your google big table which is your database where every data is stored and then you have your programs that are running on your java cloud. these two this running your game.

Mapreduce and cloud dataflow is used for scaling up. So it’s not just the container scaling up but its with respect to the application how the program would react when there is an increased number of users and how to handle an increased number of requests.

The Pokemon go have the capacity where they can go up till 5 times so technically they only serve 5x number of requests but in case of failure conditions or heavy traffic load they could not go beyond 5 times because then the server will start crashing.

What happened at Pokemon go on releasing in just those three different geographies is that the movement they deployed in the usage became so much that it was nit x number of times which is technically there a failure limit.

And even its not 5x which is the server capability but the traffic that they got was up 50 times which is more than what they expected.

So you know that if your traffic is so much then you are gonna be brought down to your knees and your application is gone for a toss.

So in this scenario where Kubernetes comes into the picture and they overcome all the challenges.

Because Kubernetes can perform vertical scaling and horizontal scaling at ease.

Scaling of servers containers is the biggest problem because any application or any other company can easily do horizontal scaling where you can just spin up a container and you set up the environment.

But vertical scaling is something which is very specific and this is even more challenging

Now it’s more particular with this game the virtual reality would keep changing. Whenever a person moves around or walks around somewhere in his apartment or somewhere on the road then the RAM that would have to increase.

The in-memory and storage memory all this would increase so in real-time your server capacity also has to increase vertically so once they deployed it’s not just horizontal scalability anymore.

Kubernetes solves all these problems effortlessly and Niantic also surprised that Kubernetes could do it.

Thanks to Kubernetes, Niantic was able to handle x50 times traffic.

Hope the above information given in this blog will help you to understand what is Kubernetes

In the next blog, we will see the Architecture, Installation, and Hands-on on Kubernetes.

The post Beginner Guide For Kubernetes : A Deep Introduction appeared first on AcadGild.

]]>The post Introduction to Data Structures appeared first on AcadGild.

]]>This defines the relationship between data and the operations that can be performed on the data.* *

In this tutorial, we will be discussing Data Structures and Algorithms. As we know Data Structure is an essential and important topic to learn, therefore we have tried to keep this article as basic as we could.

So let us dive into in detail.

*The main idea behind Data structure is to *

*reduces the memory usage of a Data Structure operation i.e., space complexity*

- reduces the execution time of operations on Data Structure i.e., the
*time complexity*

*Data structures and algorithms that we will see later in this blog are concepts that are independent of language. Therefore once we master it in any one language, it’s relatively easy to switch to another one, though it differs in the built-in methods and APIs for different languages.*

*Therefore, the data structures and algorithm as concepts are the same across languages, the implementation, however, varies greatly.*

For Example, the most common use of Data Structure is a Dictionary, where all the words are arranged in alphabetical order that is, in a sorted manner which makes the searching of words to find its meaning, very easy.

We have two types of Data Structures:

- Linear Data Structure
- Non-Linear Data Structure

**Linear Data Structure**

** **A data structure is said to be linear if the elements are arranged in a linear and sequential order.

Data items can be traversed in a single run. The implementation of this type of Data structure is easy.

Common Examples of Linear Data Structure are:

- Arrays
- Queues
- Stacks
- Linked lists

**Non-Linear Data Structure**

A data structure is said to be non-linear if the elements are arranged in a non – sequential order.

Data cannot be traversed in a single run and also the implementation is difficult.

Examples of Non-linear Data Structure are:

- Trees
- Graphs

**Why do we need Data Structure?**

- Allows easier processing of data.
- Data Structures are important for designing efficient algorithms.
- Secure way of storing data.
- We can access data anytime and anywhere.

**Execution Time Cases**

There are three cases which are used to compare Data Structure’s execution time:

**Worst Case:**The scenario where a specific Data structure operations take the maximum time it can take.

**Best Case:**The scenario that takes the least execution time to perform a Data Structure’s operation.

**Average Case:**This is the scenario that depicts the average execution time of the operation of a Data Structure.

**Algorithm**

An important aspect to Data Structures are the Algorithms as Data structures are implemented using Algorithms. It is generally described that Data Structure + Algorithm = Programs.

An algorithm is a step-by-step procedure, which defines a set of instructions to be executed in a certain order to get the desired output.

Algorithms can be implemented in more than one programming language.

A data structure is a systematic way of organizing and accessing data, and an algorithm is a step-by-step procedure for performing some task in a finite amount of time.

**Operations performed on Data Structure**

There are various operations performed on Data Structures. Few of them are listed as:

**Sorting:**arranging the data elements of a data structure in a specific order is called sorting**Searching:**searching for a particular element in a data structure is called searching**Merging:**combining elements of two similar data structure to form a new data structure of the same type is called merging**Insertion:**Inserting or adding a new element to a data structure**Deletion:**removal of an element from a data structure if present is called deletion**Traversal:**processing all the elements present in a data structure is called Traversing

**Asymptotic Notations **

Before writing a program we create a blueprint in the form of Algorithms that serves as a flowchart depicting the steps according to which we could implement the program.

There could be various approaches to solve a particular problem, we can call these approaches as algorithms.

Among these approaches, we choose the best algorithm based on its time and space complexity.

To represent these complexities Asymptotic Notations are used. These are expressions that allow us to analyze an algorithm’s time and space complexities by identifying its behavior as the input size(n) for the algorithm increases. This is also known as Algorithm’s Growth Rate.

Using Asymptotic Notation we can very well conclude the Best Case, Average Case and Worst-Case scenario of an algorithm.

**Types of Asymptotic Notations**

There are commonly three types of Asymptotic Notation used for calculating the running time complexity of an algorithm.

**Big Oh(O) Notation →**it is the asymptotic notation for the**worst case**for a given expression and is represented by**O**. It gives us an asymptotic**upper bound**for the growth rate of the runtime of an algorithm.

It measures the maximum amount of time an algorithm can take to complete its operation.

Common Big O notations:

**O(1):**The big O notation O(1) represents the complexity of an algorithm whose execution time is constant or same for operations regardless of the input data.**O(n):**The big O notation O(n) represents the complexity of an algorithm whose performance is directly proportional (grows linearly) to the size of the input data.**O(n^2):**The big O notation O(n^2) represents the complexity of an algorithm is directly proportional to the square of the size of the input data.**O(log n):**The big O notation O(log n) basically means time goes up linearly while ‘n’ goes up exponentially.

2. **Omega Notation → **it is the asymptotic notation for the **best case** for a given expression and is represented by the symbol **Ω. **It expresses the **lower bound** of an algorithm’s running time.

In other words, if we represent the time complexity of an algorithm in the form of **Ω, **it means that the algorithm will at least take this much time to complete its execution, it can take more than this too.

3. **Theta Notation → **it is the asymptotic notation, which is represented by the symbol **Θ. ** This is used to denote the asymptotically tight bound that is both the upper and the lower bound of an algorithm’s running time.

The theta notation **Θ** represents the average running time that lies between the best and the worst cases.

**Sorting techniques**

The sorting operation mentioned above can be done in many ways. These various sorting techniques are analyzed for the best, worst and the average.

As we know sorting refers to arranging data in a particular format.

These sorting techniques are:

**Bubble sort:**this is the simplest of sorting techniques. This is a comparison based algorithm, in which for a given array, each element compares with the adjacent element in the array and the elements are swapped if not in order.

Bubble sort has **worst-case** complexity:** O(n ^{2}) and best case **complexity

where n is the number of elements being sorted.

Let us see the working of the Bubble sort:

Bubble sorting starts from the first element present at the 0th index, compares it with the adjacent element to check which one is greater and swaps to keep it in ascending order and the process continues.

Let’s say we have an array with 5 elements

15 | 6 | 22 | 90 | 10 |

In this case, sorting starts from the first two elements, compares themselves, here 15 is greater than 6, therefore the two element gets swapped.

6 | 15 | 22 | 90 | 10 |

Now the second element compares itself with the third since they are in sorted order already, no swapping is done.

6 | 15 | 22 | 90 | 10 |

Next, we compare the next two values, that is, 22 and 90, since they are in sorted order, no swapping is done.

6 | 15 | 22 | 90 | 10 |

Then we move on to the last two elements, compare the two values and swap if needed.

After the first iteration the sorting looks like this:

6 | 15 | 22 | 10 | 90 |

After the second iteration it looks like this:

6 | 15 | 10 | 22 | 90 |

After the third iteration it looks like this:

6 | 10 | 15 | 22 | 90 |

And when no swapping is required, the array is completely sorted and bubble sorting is done.

**Selection sort:**in this sorting algorithm the first element of the array is selected and is compared with all the other elements in the array. If any other element in the array is less than the first element, then swapping of those two elements takes place.

Selection sort has **worst-case **complexity**: O(n ^{2})**

Let us see the working of selection sort:

Taking the same array of elements as above

15 | 6 | 22 | 90 | 10 |

We select the first element that is, 15 and compares it with the rest of the element. Here we found 6 to be the lowest value and hence swaps the two elements.

6 | 15 | 22 | 90 | 10 |

After the first iteration, the minimum value in the array that is 6 appears to be in the first position

6 | 15 | 22 | 90 | 10 |

For the second iteration, the second element 15 is selected and is compared to the rest of the elements in the list, 10 is found to be the least and hence the two get swapped. The array looks like this

6 | 10 | 22 | 90 | 15 |

The same process continues and after the third iteration the array looks like this

6 | 10 | 15 | 90 | 22 |

After the fourth iteration, the array looks like this and is finally in the sorted order.

6 | 10 | 15 | 22 | 90 |

**Insertion sort:**It is another comparison based sorting algorithm. Insertion sorting starts from index 1 that is the key. This key element is compared to the elements to its left.

If the key element is less than the element at the left, we swap the two and if the key element is greater than the element at the left we left it as it is.

Then we make the element at index 2 as the key and compare the element at its left and do the swapping accordingly until it gets sorted and repeats the same procedure until the array is sorted.

Insertion sort has **worst-case **complexity**: O(n ^{2})**

Let us take an array with 4 elements as below:

7 | 4 | 5 | 2 |

We make the element at index 1 that is, 4 as the key and compares it with the value at the left. Since 4 is less than 7 hence the two gets swapped.

4 | 7 | 5 | 2 |

Element at the 2nd index that is 5 is the key now. Comparing it with the values at the left that is 7, since 7 is greater than 5 and 4 is less than 5, so no change in position of 4 and 5 will be moved to the position of 7.

4 | 5 | 7 | 2 |

The last element that is 2 is the key now. Comparing it with all the elements at the left we found all the elements are greater than 2, so all the elements will move forward and 2 will be shifted to position of 4.

2 | 4 | 5 | 7 |

Hence the above array results in the final sorted array.

**Merge sort:**this is the sorting technique that is based on divide and conquer algorithm. In this algorithm, sorting is done by first dividing the array into equal halves and then combining it in a sorted manner.

Merge sort has **worst-case **complexity:** O(n log(n))** **and best case **complexity**: Ω(n log(n))**

Let us see the working of this algorithm.

We take an unsorted array of length 8 as follows:

77 | 41 | 65 | 33 | 22 | 10 | 4 | 60 |

We will divide the array into two equal halves, which will make two arrays with 4 elements each.

77 | 41 | 65 | 33 |

22 | 10 | 4 | 60 |

Now we will further divide the two arrays into their respective halves

77 | 41 |

65 | 33 |

22 | 10 |

4 | 60 |

We will further divide it until it can no more be divided and we get the atomic values

77 |

41 |

65 |

33 |

22 |

10 |

4 |

60 |

Since they are completely broken down into atomic values, we will now combine them in exactly the same manner as they were divided

While combining we will first compare the two elements being combined and sort them. Therefore while combining the elements 77 and 41 since 77 is greater hence we swap the two elements.

Likewise while combining 65 and 33, since 65 is greater we swap the two. Similarly, 22 and 10 are swapped whereas 4 and 60 remain as it is. Hence the combined array looks as below.

41 | 77 |

33 | 65 |

10 | 22 |

4 | 60 |

Now in the second iteration, we will compare lists of two data values, and merge them into an array of 4 each

33 | 41 | 65 | 77 |

4 | 10 | 22 | 60 |

And after final merging, the list looks like this

4 | 10 | 22 | 33 | 41 | 60 | 65 | 77 |

**Quicksort:**It is a recursive sorting method that keeps calling itself. It is based on divide and conquer algorithm and is efficient for large data sets.

In this, an element is selected as a Pivot value and two subarrays are created one for storing values less than the pivot value and other for storing values larger than the pivot value.

The main question arises here is which element to be picked up as pivot. The value of pivot is generally chosen based on some logic and implementation. There are four possibilities:

- Selecting first element of the list as the pivot
- Selecting last element of the list as the pivot
- Selecting a random element as pivot
- Selecting median as the pivot

However we prefer to choose the last element of the list as Pivot.

Quicksort has **worst-case **complexity**: O(n ^{2})**

Let us see the working of the same.

We have an array with element as follows.

29 | 60 | 44 | 19 | 30 | 59 | 57 |

Here we will take the **last element(70) as Pivot**. Now we have to do partitioning of the array in such a way such that the pivot comes to its correct position in a sorted array and all the elements smaller than the pivot comes to the left side of the pivot and elements larger than the pivot comes to the right side of the pivot. This method is also known as Partitioning. We will continue to do this until we get the sorted list.

We will take two variables** i and j**, i will be used to calculate the final position of the pivot and j is used to iterate through the array. ** Low** is the starting index and **high** is the ending index(the index of pivot).

Initially, **i and j are initialized to low-1 and low** respectively.

It will start by comparing each value of j with the pivot. j is in the range from low to high.

If arr[j] <= pivot

i++

arr[i] = arr[j]

For every time the value of j is less than or equal to value of pivot, i is incremented by 1 and value at index if i is swapped with the value at the index of j. Note that no swapping is done if the i=j.

These steps are repeated as long as j<=high-1. Now the control comes out of the iteration and pivot is swapped from arr[i+1] to arr[high].

In the above array that we have taken to perform quick sort on the pivot value is array[high] = 57.

i = -1 and j = 0, 29 <= 57, i = 0, no swap(since i=j)

I = 0 and j = 1, 60 >57, no swap

i = 0 and j = 2, 44<=57, i = 1, swap 60<>44

29 | 44 | 60 | 19 | 30 | 59 | 57 |

i = 1 and j = 3, 19<=57, i = 2, swap 60 <> 19

29 | 44 | 19 | 60 | 30 | 59 | 57 |

i = 2, j = 4, 30<=57, i = 3, swap 60<>30

29 | 44 | 19 | 30 | 60 | 59 | 57 |

i = 3, j = 5, 59>57, no swap

We come out of the loop because j is now equal to high.

Finally, we place pivot at the correct position by swapping

arr[i+1] and arr[high]

So since i = 3, we will replace the pivot value 57 with i+1 that is, at i= 4.

29 | 44 | 19 | 30 | 57 | 59 | 60 |

Now the pivot value 57 is placed at the correct position with all values smaller to it has been moved left to it and all values larger than it is placed the right side of it.

But we see the array is still not sorted. Hence we will again split the array into two parts. That is:

29 | 44 | 19 | 30 |

And

59 | 60 |

We will again perform partitioning on these two halves until we get the sorted list.

Finally, the sorted array would be:

19 | 29 | 30 | 44 | 57 | 59 | 60 |

**The Quicksort pictorial view :**

**Searching Techniques: **The searching operation mentioned above can be done in many ways. The two common ways of searching are:

**Linear Search:**this is the simplest method for searching, in this technique of searching the element to be found is searched sequentially in the list. This operation can be performed both on the sorted and unsorted list.

Searching is done from the 0th index and is traversed through the complete list until the element is found or the end of the list is reached.

Linear Search has **worst-case **complexity**: O(n)** **and best case **complexity**: O(1)**

**Binary Search:**binary search the fastest search algorithm. This algorithm works on the principle of divide and conquer. It requires the list to be in sorted order.

Given a sorted array of n elements, to search an element we compare it with the middle element of the sorted list. If the element matches with the middle element the search is successful, otherwise the list is divided into two equal halves, one from the 0th element to the middle element and another from the center element to the last element.

The searching mechanism proceeds from either of the two halves, if the value of the search key is less than the item in the middle of the interval then it checks on the first half of the division otherwise in the second half of the division.

Binary Search has **worst-case **complexity**: O(log n)** **and best case **complexity**: O(1)**

This brings us to the end of this blog. So far we have only read about the Data Structure, Algorithms, Sorting and Searching technique. A lot more is still left to cover like the types of Data Structure and its implementation. So stay tuned for our next blog.

Do leave us a comment for any query or suggestions.

Keep visiting our website for more blogs on Data Science and Data analytics.

Happy Learning

The post Introduction to Data Structures appeared first on AcadGild.

]]>The post Analyzing USArrest dataset using K-means Clustering appeared first on AcadGild.

]]>Kmeans clustering algorithm is an iterative algorithm that tries to partition the dataset into distinct non-overlapping clusters where each datapoint belongs to only one group.

It assigns the data points to the clusters such that the euclidean distance between the data points and the cluster’s centroid is at the minimum.

This is a systematic approach for identifying and analyzing patterns and trends in crime using USArrest dataset. The model that we will be building in this blog, can predict regions which have high probability of crime occurrence and can visualize crime prone areas.

This dataset contains statistics, in arrest per 100,000 residents for assault, murder and rape in each of the 50 US States in 1973. The percentage of the population living in urban areas is also given. The aim of the dataset is to see if there is any dependency between the state been acquired and the arrest history.

Fetching the working directory

Loading the dataset data(“USArrest”). This dataset is inbulit with R, You can directly load the dataset and can see the first few records of the data using the ** head() **function.

Getting the structure of the dataset using the ** str() **function.

Summarizing the dataset using the ** summary() **function.

We can see there is no null value present in the dataset.

We will now check for the correlation between all the variables by using the ** corrplot() **function.

It gives the following output

We can observe from the above result screenshot that the 3 crime variables are correlated with each other, that is, Assault-Murder, Rape-Assault and Rape-Murder.

Displaying the first few columns of the dataset after scaling it.

We can see that the data points have been standardized that is, it has been scaled. Scaling is done to make the variables comparable.

Standardizing consists of transforming the variables such that they have zero mean and standard deviation as 1.

Now we will load two of the libraries, that is, cluster and factoextra that are the required R packages.

cluster is for computing clustering algorithms and factoextra for ggplot2-based elegant visualization of clustering results.

We’ll use only a subset of the data by taking 10 random rows among the 50 rows in the data set.

We will now compute the Euclidean distance by using the ** dist()** function.

To make it easier to see the distance information generated by the *dist*() function, we are reformatting the distance vector into a matrix using the ** as.matrix()** function.

As we can see Euclidean Distance is placed in a matrix and only 4 cities are shown where distances are rounded to 1 decimal place.

We have used fviz_dist() from the factoextra package to visualize the distance matrices.

It shows the following output.

In the above graph the Red color shows the closest distence and Blue color shows maximum distence.

Now we are defining clusters such that the total intra-cluster variation (total within-cluster sum of squares) is minimized.

Similar to the elbow method, there is a function ** fviz_nbclust()** that is used to visualize and determine the optimal number of clusters.

From the above various results we came to know that 4 is the optimal number of clusters, we can perform the final analysis and extract the results using these 4 clusters.

The output of Kmeans returns a list of components. The most important one are listed below:

- cluster: A vector of integers (from 1:k) indicating the cluster to which each point is allocated.
- centers: A matrix of cluster centers.
- totss: The total sum of squares.
- withinss: Vector of within-cluster sum of squares, one component per cluster.
- tot.withinss: Total within-cluster sum of squares, i.e. sum(withinss).
- betweenss: The between-cluster sum of squares, i.e. $totss-tot.withinss$.
- size: The number of points in each cluster.

These components can be accessed as follows

**Hence we have computed the optimal number of clusters and visualize K-mean clustring.**

Hope you find this blog helpful. In case of any query or suggestions drop us a comment below.

*Keep visiting our site* www.acadgild.com* for more updates on Data Analytics and other technologies. Click here to learn data science course in Bangalore.*

Keep visiting our website for more blogs on Data Science and Data Analytics.

The post Analyzing USArrest dataset using K-means Clustering appeared first on AcadGild.

]]>The post Predicting the Salary class using Logistic Regression in R appeared first on AcadGild.

]]>We have already performed Logistic Regression problem in one of our previous blogs which you can refer for better understanding:

Diabetes Prediction using Logistic Regression in R

In this blog we have used a dataset that contains an individual’s annual income that results from various factors. It is also based on some other factors such as an individual’s education level, age, gender, occupation, and etc.

The dataset contains 16 columns in which the Target field is the **Income** which is divided into two classes: <=50K and >50K. We can explore the possibility in predicting income level based on the individual’s personal information.

The dataset ** “adult” **was found in the UCI Machine Learning Repository.

This project explores logistic regression using the UCI Adult Income data set. We will try to predict the salary class of a person based upon the given information. This is from an assigned project from Data Science and Machine Learning with R

Let us begin with the coding part. You can download the dataset from the below link:

Setting up filepath

Loading the dataset and reading the first few records using the **head()** function.

The dataset is stored in a variable “adult” and shows 6 rows and 8 out of 15 columns.

Fetching the structure of the dataset using the **str()** function.

Summarizing the dataset using the **summary()** function.

As we can see there is no null values present in our dataset.

Cross checking to see if there is a single null value present in the whole dataset.

Hence, no null value present.

We can see from the structure output that some of the columns have a large number of factors. We can clean these columns by combining similar factors, thus reducing the total number of factors.

As we have seen that there are 9 factors in this column, we will combine it into 6 columns as shown below.

We can reduce these factors into the following groups:

- Married
- Not-Married
- Never-Married

There are a lot of factors present in the *country* column, we can reduce them to their respective regions as shown in the below output.

Now we have to re-assign these altered columns to factors since we had to change them to characters:

During the data cleaning process we came across some of the missing values that were present in the form of ‘?’. We can convert these values to NA so we can deal with it in a more efficient manner.

Converting ‘?’ to NA

Omitting the NA values

NA values have been omitted from the dataset.

Firstly we will plot a histogram of ages that is colored by income

Here the colored part is indicative of percentage. From this plot we can see that the percentage of people who make above 50K peaks out at roughly 35% between ages 30 and 50.

Next we will plot a histogram of hours worked per week by people.

From the above graph it is clear that the highest number of hours worked per week is 40.

Now we will depict the income class by the region where they stay in.But first we need to change the name of the country column to region.

It shows the following output.

From the above output it is clear that people from North America have the highest income, where around 11000 people earn more than 50k and people around 30000 earn less than or equal to 50000.

The purpose of this model is to classify people into two groups, below 50k or above 50k in income. We will build the model using training data, and then predict the salary class using the test data.

**Splitting the data into Train and Test**

We will split the dataset into training data and test data in 80% and 20% respectively, using the caTools.

While training our model we have used the **glm() **function that tells R to run a generalized linear model. **‘income ~ .’** means that we want to model *income* using every available feature. **family = binomial()** is used because we are predicting a binary outcome, below 50k or above 50k.

Making predictions on the Trained data, by applying ROC and AUC curve, as shown below.

we get the output as:

The above graph shows that the accuracy we got from the Train data is 90%

Making predictions on the Test data as shown below.

We are now converting probabilities to values as shown below

Here we have initialized predictions on the Test data using our Logistic Regression Model. We had specify *type = “response” *above, to get predicted probabilities instead of probability on the logit scale. The accuracy here shows to be 85%.

Applying ROC and AUC Curve on the Test data.

It shows the below output.

We get the accuracy from the Test data to be 90%.

We will now compare our results using a confusion matrix.

A confusion matrix is a table that is often used to describe the performance of a classification model on a set of test data for which the true values are known.

The most basic terms used in this matrix are:

- true positives (TP): These are cases in which we predicted yes and the actual result is also true.
- true negatives (TN): We predicted no, and the actual result is also false.
- false positives (FP): We predicted yes, but the actual result is false. (Also known as a “Type I error.”)
- false negatives (FN): We predicted no, but the actual result is true. (Also known as a “Type II error.”)

Since our predictions are predicted probabilities, we specify probabilities that are above or equal to 50% will be TRUE (above 50K) and anything below 50% will be FALSE (below 50K).

**Hence, our logit model is 90% accurate to predict the salary class of a person based upon the given information.**

Hope you find this blog helpful. In case of any query or suggestions drop us a comment below.

Keep visiting our website for more blogs on Data Science and Data Analytics.

*Keep visiting our site* www.acadgild.com* for more updates on Data Analytics and other technologies. **Click here to learn **data science course in Bangalore**.*

The post Predicting the Salary class using Logistic Regression in R appeared first on AcadGild.

]]>The post Diabetes Prediction using Logistic Regression in R appeared first on AcadGild.

]]>The dataset used in this blog is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

The datasets consist of several medical predictor variables and one target variable, that is the outcome. Predictor variables include the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

We will build a machine learning model to accurately predict whether the patients have diabetes or not.

Before moving further, we should first understand what is Logistic Regression and why we use it.

Logistic regression is a classification algorithm used to assign observations to a discrete set of data.

Examples of classification problems are Email spam or not spam, Online transactions Fraud or not Fraud, Person is diabetic or not.

It is a Machine Learning algorithm which is used for classification problems, which is a predictive analysis algorithm and is based on the concept of probability.

We expect our model to give us a set of outputs based on probability when we pass the inputs and returns a probability score between 0 and 1.

Now, since we have a brief knowledge of Logistic Regression, let us begin with the coding part.

You can download the dataset from the link: Dataset

We will first set up the filepath representing the directory of the R process.

**getwd()** returns an absolute filepath
representing the current working directory of the **R** process.

Loading the requrired library packages.

Loading the dataset.

The ** head()** function is used to return
the first few records of all the dataset.

We will now check if any null values are present in the dataset.

We can see from the above result that there is no null value present in the dataset.

Summarizing the dataset using the ** summary()**
function.

We will find the structure of the dataset
using the ** str()** function.

We can see from the above result that there are 9 columns present in the dataset. The variables *Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction and Age* are responsible for the variable *Outcome, *that states whether a person has diabetes or not. Where 1 says ‘Yes’ and 0 says ‘No’.

We will now check for the range of people with respect to their age.

We have made use of the ** factor()
**function that is used to represent categorical data. It can be ordered
or unordered.

From the above result, we can see that the values are maximum between the range 21 to 30 and 31 to 40. That is people are maximum in numbers between the ages of 21 and 30 being 417 in numbers.

Visualizing the above range of ages with
the help of **Histogram.**

**The **above code will show the following output.

Visualizing the same with the help of **Barplot** for a better understanding of the dataset.

It will show the following output:

Plotting Age category against BMI with
the help of **Boxplot.**

It gives the following output:

Age between 21 to 30 has the maximum outliers, which has been shown with Red dots.

Plotting a correlation matrix against all the variables present in the dataset.

It is inferred that no correlation exists between the variables.

Plotting it using ** corrplot()** function which
is a graphical representation of the above correlation matrix.

It shows the following graph.

The above graph shows that there is **no strong correlation** observed between
variables. So we can do further analysis without dropping any columns.

We will now install the caTools that Contains several basic utility functions including: moving window statistic functions, read/write for GIF and ENVI binary files, fast calculation of AUC, LogitBoost classifier, etc. It has been called here to split our data into Train and Test data.

Splitting the dataset into Train and Test data into 80% and 20% respectively.

Calculating the total number of rows

Total number of Train data rows

Total number of Test data rows

Fitting model using all the independent variables.

Here we have fitted our model based on **Train** data.

The AIC here is an estimator of the relative quality of statistical models for a given dataset. AIC estimates the quality of each model. Thus, AIC provides a means for model selection. A good model is the one that has minimum AIC among all the other models.

Now we will carry out operation to find the average prediction for each of the two outcomes(0 and 1) against all other variables of the dataset.

The ROC curve that stands for Receiver Operating Characteristic (ROC) is a curve that is used to assess the accuracy of a continuous measurement for predicting a binary outcome. It generally shows the performance of a classification model at all classification thresholds.

This curve plots two parameters:

- True Positive Rate
- False Positive Rate

AUC stands for “Area under the ROC Curve.” That is, AUC measures the entire two-dimensional area underneath the entire ROC curve. It is used in classification analysis in order to determine which of the used models predicts the classes best.

Generating ROC curve on train data.

Generating AUC curve

It gives the below graph

From the above graph it is inferred that
we get an accuracy rate of **84%** on
our **Train data.**

Making predictions on our Test Data

We see that the above output gives us the accuracy rate as 74%. Lets improve the performance of the model.

We get the following output

*From the above graph it is inferred that we get an
accuracy rate of 82% on our Test data. Hence, the model is 82%
accurate to predict whether the person is Diabetic or not. *

This brings us to the end of this blog. Hope you find this article helpful. For any query or suggestions do drop a comment below.

Keep visiting our website for more blogs on Data Science and Data Analytics.

*https://acadgild.com/blog/linear-model-building* Using Airquality Data Set with R.https://acadgild.com/blog/premium-insurance-policyholders-using-linear-regression-with-r

*Keep visiting our site* www.acadgild.com* for more updates on Data Analytics and other technologies. **Click here to learn **data science course in Bangalore**.*

The post Diabetes Prediction using Logistic Regression in R appeared first on AcadGild.

]]>The post Beginner Guide For Git & GitHub: Installation, and Commands appeared first on AcadGild.

]]>**Version Control – What & Why****Version Control Tools****GitHub & Git****Case Study: Dominion Enterprises****Git Features****Git Operations & Commands****So let us begin for our first topic,**

You can consider the version control is a management system which is managing the changes that you have made in your project, documents, computer programs, large websites and other collection of information.

The changes might be adding new files or modifying the older files by changing the source code.

So what does the version control system does that every time you make a change in your project.

It creates a snapshot of your entire project and saves it. These snapshots are actually known as different versions, Now if you are having trouble in word snapshot just consider that snapshot is actually the entire state of your project at a particular time.

look at one example to get a better understanding of version control system or to know version control better.

Let’s say I am developing my own website and at the beginning, I have only one webpage called index.html and After a few days, I have added another webpage called about.html.

Now I had made some changes in about.html like adding some kind of pictures, texts.so version control system detects that some new features are added and few features have been re-edited.

Now again after a few days, I have changed the entire layout of the about.html page so again version control system will detect this update has happened and it will store a snapshot and stores both old and new snapshot.

As per the above example now we have three versions of website which we are working with. so this is how a version control system stores the versions.

We have specified the reasons below that why we need version control systems.

**Collaboration:**

Let’s have a look at the below image we have taken one example

There are three developers working on a particular project and everyone is working on isolation.

There will be conflict when these three developers working on the same folder or same file.

Now the developer one made some changes lets called Xyz, at the same time developer two makes few changes say Abc. So at the end when you try to collaborate and try to merge all of the work together the project will end up with conflicts.

You might not know which developer has made different changes. But with the version control system, it provides you with the shared workspace and it continuously tells you who has made what kind of changes. So you always get notified someone has made a change in your project.

**Storing
Versions**

This is one of the most important things because of why we need a version control system.

Because saving versions of your project after you have made changes is very essential and without a version control system it can actually get confusing.

**Backup**

The version control system provides us with(not required) the backup.

Now just look at the diagram below.

The diagram which is shown above is nothing but the typical layout of the version control system.

There is one central server where all the files located and apart from that, the developer has a local copy of all the files that are present in the central server.

So what developer do is that every time they start the coding they fetch all the files from the central server and store it in the local machine after they are done with the work they actually transfer files back to the central server.

If in case your central server gets crashed because some reasons you don’t have to be worry because all the developer have the local copy.

**Analyze
**

The next thing which version control system helps us is to analyze my project because when you finish your project you want to know how your project actually evolved so that you can make analysis to check you can make it better.

**Version
Control Tools **

There is four more popular version control system available in the current market.

1.Git

2.Apache Subversion

3.Concurrent version system (CVS)

4.Mercurial

As you can see by looking at the above image we conclude that git is very much popular nowadays.we have taken this report from google trends.

Let’s have a quick look at the below image which represents what exactly git and Github

GitHub is going to be my central repository and at the other hand, git is going to allow me to create my local repository.

People might get confused between Git and GitHub because they think that it is the same thing, maybe because of the name but actually it is very different.

Well, git is a version control management tool that allows you to perform all operations like fetch data from the central repository server and to just push your files to the central repository server.

Whereas GitHub is just a company that allows us to host your central repository on a remote server.

If you want me to explain in easy word then let’s consider GitHub as a social network which is very similar to facebook, only the difference is that GitHub is a social network for developers.

Dominion Enterprises is the leading marketing services and publishing company that works across several industries and they have got more than one hundred offices worldwide.

So they have distributed technical team support to develop a range of websites and they include the most popular website homes.com all the dominion enterprises websites actually get more than tens of million unique visitors every month.

Each of the websites which they were working on has a separate development team and all of the developers work independently and each team has its own goals, projects but they actually wanted to share the resources.

They wanted everyone to see what each of the teams is actually working on so basically they want transparency.

**Solution**

They needed a platform that flexible enough to share the code securely and fo that they adopted the GitHub platform.

The Dominions Enterprises CEO Joe Fuller said that GitHub enterprise has allowed us to store our company’s source code in a central, corporately controlled system.

Having all of their code in one place makes it easier for them to collaborate on projects.

Git is a distributed version control tool that supports distributed non-linear workflows by providing data assurance for developing quality software.

Here are the below some popular features of Git listed:

- Distributed
- Compatible
- Non-linear
- Branching
- Lightweight
- Speed
- Open-source
- Reliable
- Secure
- Economical

**Distributed**

This feature allows distributed development of code and Every developer has a local copy of the entire development history and changes are copied from one repository to another.

**Compatible**

Compatible with existing systems & protocols also SVN and SVK repositories can be directly accessed using git-svn

**Non-linear**

Supports non-linear development of software and includes various techniques to navigate and visualize non-linear development history.

**Branching**

This is the feature where git stands apart from nearly every other control management system because git is the only tool which has a branching model.

It takes only a few seconds to create & merge branches and master branch always contains production-quality code.

**Lightweight**

This feature is used lossless compression technique to compress data on the client’s side.

**Speed**

Fetching the data from local repository is 100 times faster than the remote repository and git is one order of magnitude faster than other version control system tools.

**Open
Source**

Git is an open-source tool that means you can modify its source code according to your needs.

**Reliable**

On events of a system crash, the lost data can be easily recovered from any of the local repositories of the collaborators.

**Secure**

Uses SHA1 to name and identify objects and Every file & Commit is checksummed and is retrieved by its checksum at a time of checkout.

**Economical**

Released under GPL’s license. It is for free and all the heavy lifting is done on the client-side, hence a lot of expenses can be saved on the costly server.

A directory or storage space where your projects can live. It can be local to a folder on your computer, or it can be a storage space on GitHub or another online host. You can keep code files, text files, image files, you name it, inside a repository.

**There
are two types of repositories:**

- Central Repository
- Local Repository

Now let us see the steps to install and configure GIT which is the most popular version control software.

**Step 1:** To install git using the following command in the terminal.

sudo apt-get install git -y

As you can see GIT is already present in Ubuntu if it is not installed then use the above command to install GIT in your system.

**Step 2:** To check the git version you can use the below command:

git –version

**Step 3:** Before using GIT repository we have to configure the “email id” and the “username”. Thus use the following commands to configure.

git config --global user.name “Your Name” git config --global user.email your email address

As you have configured the “user name” and “email id”, all the above information is stored in the below gitconfig file

~/.gitconfig

**Step 4:** Create and Initialize a GIT repository using the below commands.

mkdir -p /home/acadgild/myproject cd /home/acadgild/myproject/ git init

**Step 5:** Add and Commit files to GIT repository by using the below commands.

vim wordcount.py git add . git commit -m "First Commit"

You have successfully installed and configured the GIT. Now you are ready to use this open-source distributed version control system.

The post Beginner Guide For Git & GitHub: Installation, and Commands appeared first on AcadGild.

]]>