The post Basic SQL Commands appeared first on AcadGild.

]]>In this article, we will learn the working of regularly used SQL commands with the help of a sample data set.

Before moving further, let us understand **What is a Database and Importance of SQL.**

**A Database** is a collection of tables as a unit. The purpose of a database is to store and retrieve related information from a table or from multiple tables.

Oracle Database was the first database which was designed for enterprise grid computing, and cost effective way to manage information and applications.

SQL is Query language in Oracle Database. A structured Query language for sorting, manipulating and retrieving data from oracle database. It is a very popular query language where the primary objective was created to give the possibility to common people get interested data from database. With the help of its english like simple syntax any new learner can learn this query language in lesser time. The important feature of SQL is that most DB engines are compatible with all the SQL code. So once you learn SQL it should be similar to work across any relational databases.

*To understand SQL operations in detail, we will be using the below Student and Department table.*

Now first we will learn about the use of DML Commands:-

Abbreviation of** DML **is Data Manipulation Language. Data manipulation language(DML) is used for mapping table data in a database, like retrieve data from database tables, inserting new record, update existing records, and remove unwanted rows from tables, these common operations are collectively known as Data manipulation language.

The main use of select statements is to retrieve zero or more records from one or multiple tables.

**NOTE 1:-** *If you want to display a particular column data specifically then after select statement you can pass the column_names to display only those column records. *

SELECT STUDENT_ID,FIRST_NAME,PHONE_NUMBER

**NOTE :- ***I**f you want to retrieve all columns data from a table using this syntax *

(select * from table name)

OUTPUT:-

The insert statement is used to insert new data into a particular table. You can insert new data in multiple ways.

INSERT INTO TABLE_NAME VALUES(‘Values1’,’Values2’,’Values3’,’Values3’);

**NOTE:- ***if you have any confusion regarding colunm_name and it’s a sequence of colomn_name so apply this method.*

.

(student_id,first_name,last_name,email,phone_number,admission_ date,course,course_fee,mentor_id,department_id)

VALUES('112','Mohan','Dollas','Mohan.Dollas@gmail.com','6878984532','07-AUG-19','Data Analytics','45789','101','90');

**Output:**–

Now new record inserted into students table.

**NOTE:**– Before going to update and Delete Statements.first of all, understanding SQL clause because in that statement have used clause. so before going to these statements we aware about clause.

The clause is put some specific conditions in your statement.after using clause you can filter the record according to requirements. Now we talk about WHERE, GROUP BY, HAVING and ORDER BY.

Where clause used to put a specific condition(Filtration) in your statement. Where clause used to extract only those records which fulfills the condition defined within in the select statement. your condition.

Select * from table_name

Where Colomn_name=’Value’;

**Output:**

The Group By clause is used in SQL statement and this clause is used in aggregate functions.Group by clause is used to retrieve data in the form of a group of row.

SELECT column_name(s)

FROM table_name

WHERE condition

GROUP BY column_name(s)

**Example:-**

**Output:-**

The Having clause same as where clause but Having clause only applicable for Aggregate Function like(SUM, COUNT, AVG, MIN, MAX).

SELECT column_name(s)

FROM table_name

HAVING condition

**Output:-**

Order By clause is part of SQL statements and this clause is used to set the row of Records in ascending and descending order like sorting. Use Order by keyword to sort your record according to your objective.

SELECT column1, column2, ...

FROM table_name

ORDER BY column1, column2, ... ASC|DESC;

**Output:-**

Now we have understand about SQL Clause so now start remaining Update, Delete statements.

Update Statement is used to update or modify an existing record in the table. For example, if you want to update a record with a new value you can use this syntax.

UPDATE table_name

SET column1 = value1, column2 = value2, ...

WHERE condition;

**Example:-**

**Output:-**

**NOTE:-**After updating the record like this.

Delete statements are used to delete records in existing records in your table.

DELETE FROM table_name;

(These syntaxes to use delete all records in your table )

*NOTE:- **if you want to delete specific record so You can add one or one more than condition in your statements. If your condition will be match than the specific record is deleted*.

DELETE FROM table_name WHERE condition;

**Example:- **one condition or one more than condition.

**Output:-**

A **comparison** (or relational) **operator** is a mathematical symbol which is used to **compare** two values. Comparison operators are used to specify the condition and that compare one expression to another value or expression. The most popular three comparison operators are IN, NOT, BETWEEN .

IN operator is used in where clause and the IN operator are specified with set of values. You can add multiple values in one condition like pass list of data in condition and you can also pass a select statement in **IN **operator.

SELECT column_name(s)

FROM table_name

WHERE column_name IN (value1, value2, ...);

WHERE column_name IN (SELECT STATEMENT);

**Example:-**

**Output:-**

The NOT operator is just opposite of IN operator i.e, the NOT operator extracts only those records where the records are not matched with the passing list of value.

SELECT column1,column2,..

FROM table_name

WHERE NOT condition;

**Example:-**

**Output:-**

The Between operator is also used within where clause but the Between operator are used to specify the range of value like an upper limit and lower limit. You can also say the starting point and ending point (data in fixed range).

SELECT column_name(s)

FROM table_name

WHERE column_name BETWEEN value1 AND value2;

**Example:-**

**Output:-**

From the above example, we believe this blog helped you to understand DML(Data manipulation Language) commands , SQL Clause and Few most popular comparison operators.

**R vs Python combat**

*Keep visiting our site* www.acadgild.com* for more updates on Data Analytics and other technologies. Click here to learn **data science course in Bangalore**.*

The post Basic SQL Commands appeared first on AcadGild.

]]>The post Top 5 most in-demand skills in 2019 appeared first on AcadGild.

]]>Skills are the most important thing for every person to occupy his position and be productive in any field. The collaboration of Core and Trending skills with a smart effort will always deliver you the most accurate and perfect result set.

This is the era of technologies an individual with the right expertise is the most esteemed entity.

In this blog, we brief the top 5 most in-demand skills to have in 2019.

Let us begin, then.

Cloud Computing is one of the best technologies in the computational world and key part of enterprise digital transformation strategy- including servers, storage, networking, databases, analytics, and intelligence. cloud computing to offer faster innovation, flexible resources, and economies of scale. There are a lot of companies working on cloud computing which reduces operating costs and run more efficiently.

- IaaS
- PaaS
- serverless and
- SaaS

Most of the cloud computing services can be divided into four categories: software as a service (SaaS), infrastructure as a service (IaaS), platform as a service (PaaS) and serverless. These categories are called the cloud computing stack because they build on top of one another.

According to Forbes, the revenue will increase to $167B by 2020 and the external cloud adoption will increase from 22% to 32% in 2020.

IT Professionals with cloud-computing skills earn an average salary of $120,900. however, taking a step ahead into this industry, you need to pick up a few skills that will increase your chances to get the best job in the field.

- Programming skills like Java, PHP, Perl, Python, Ruby
- Framework skills like DevOps
- Database knowledge like SQL
- Linux skills
- Knowledge of cloud platforms like AWS, Microsoft AZURE, etc.

Artificial Intelligence is the technology which changes the face of the world. Artificial Intelligence has the potential to enormously change the way that humans connect with the digital world and in the near future, its impact is expected to grow further.

In the last few years – Big data, Machine learning and the development of deep learning have brought a revolution in artificial intelligence.

Today the devices store everyday data and generate huge data sets, which “Machine learning” and “deep learning” algorithms analysis to find trends and make predictions.

Artificial intelligence is a growing field in today’s industry.

According to Forbes jobs requiring machine learning skills are paying an average of $114,000. Advertised data scientist jobs pay an average of $105,000 and advertised data engineering jobs pay an average of $117,000.

Without a doubt, Machine Learning (ML) and Artificial Intelligence (AI) are the two advanced technologies ruling the current marketplace AI will create close to 2.3 million jobs by 2020. Personal with skills in deep learning and machine learning in-demand for AI positions has increased in the past years.

- Mobile Application DevelopmentLearn Programming Languages like Python/SAS/R
- Data science- Machine learning and deep learning Algorithms.
- Natural-Language Processing for text analysis.
- Neural networks
- Data mining, Data filtering, Speech Recognition, Virtual assistants
- Analytic skills.
- Solid Mathematical and Algorithms Knowledge
- Good Command Over Unix Tools such as awk, grep, sort, find, cut, etc.

Analytics specialist is analytical thinking that gives the ability to solve problems to improve business processes. Organizations worldwide are dealing with enormous volumes of data, and as companies get more versed at collecting data. Data Analytics professionals collect, process and perform statistical analyses of data for business insights and to drive vital growth.

Qualified data analytics and Business analytics professionals are in huge demand and can offer high salaries for specialized skills.

According to Forbes, Data Analyst jobs are among the most challenging to fill, taking five days longer to find qualified candidates than the market average. Employers are willing to pay premium salaries for professionals with expertise in these areas as well. The study found that the employers are willing to pay a premium of $8,700 above-median bachelor’s and graduate-level salaries, with successful applicants earning a starting salary of $80,200. Experienced Data Scientists and Data Engineers are negotiating sales of over $100,000.

- Mathematics and Analytics
- Programming languages for Statistical analysis like R, Python, SAS, etc
- SQL databases and database querying languages
- Business intelligence and data visualization tools like PowerBI, Tableau, Qlikview, etc.
- Ability to differentiate between tools and methods

UX Design is basically known as user research. The main goal of the UX design is customer satisfaction and loyalty through the utility and provide an attractive product. UX has long been recognized as a driver of business success, and this will remain to be a key topic in 2019. In recent years, there have been several experiments to scale the business value of best design.

**User Experience (UX) Designer Tasks**

- Collaborate with designers, executives, clients, engineers, and product managers to find a result that improves user experience.
- Design prototypes and perform user testing to handle each iteration of the design.
- Present design concepts and outputs that meet business or client requirements.
- Tools like Hadron App have started to unify Design and Dev workflows into a single UI that has two distinct “views”.

The average salary across 102 countries is $54,500, with the highest average being in $102,614 in Switzerland and The national average salary for a User Experience Designer is ₹9,12,000 in India.

- Be a user first and think like a user.
- Be familiar with UX tools like Adobe Creative Suite
- understand typographical rules, color theory, and visual hierarchy
- basic knowledge of HTML, CSS, and Javascript

Mobile apps are on a steady growth with the coming of newer devices and regularly updated platforms with new

capabilities offered to businesses. The present era of mobile apps is within the transformation part, also communication has become faster and easier. The reason that has junction rectifier to higher interaction and association is that the increase within the range of social media apps. A device-enabled product like the location-based apps increased and video game apps, sound-based apps, associated mobile games are powering a tremendous

growth.

The mobile app developers are continuously trying to find an associate update that what’s new within the market from the technical perspective. That’s why they keep an eye fixed on the newest mobile app development trends associated technologies that may set an imprint within the future years.

Mobile application development now the most in-demand skill the IT industry. Mobile app sales are expected to reach $99 billion (approximately £75 billion) in 2019, according to a forecast from Juniper Research.

- Cross-platform App Development
- Mobile UI Designing
- Knowledge of programming languages like C, C++ and Java
- Cybersecurity Guidelines
- Expert in Agile Methodologies
- Make yourself proficient with iOS, android and Hybrid.

Creativity involves new creations, new ideas, smarter activity, a different way of thinking, problem-solving attitude. With creativity, one can change their living standard. Before thinking about creativity, you have to understand about your surroundings, nature or problem, analyze the different outcomes.

- Play with your brain
- Challenge your brain
- Break your old lifestyle pattern
- Ask one question # WHY?
- Eat your ideas and improve it

Adaptability is directly proportional to the learning, learning comes in many ways like Learning from experience, Learning from the experiment, Learning from books, etc. ‘Adaptability means to adopt the change’.

- Always open to understand
- Enhance your Curiosity
- Open your mind to change
- Change is not easy but not impossible
- Engaging yourself with self positive talk

Collaboration is an act of working with someone to produce something. Sometimes collaboration becomes difficult but you should have to handle the situation according to the demand of the organization.

- Encourage the team
- Understand the purpose
- Define the roles of individual
- Identify the strength of the team
- Relationship is a key

When someone suffers from a more difficult time then there is a hidden opportunity waiting for side by side.

- Accepting Life’s Challenges
- Take a wise decision
- Fight for your dreams
- Decide how to Change
- Stay motivated

Time is the most valuable thing. Nobody can earn time. The clock is always running, each and every person having their 24/7 hrs. The interesting thing is how can you manage the time.

- Trust me sometimes saying “NO” will save more time
- Find your to-do list
- Identify the key and make it a habit
- Start early
- Focus to complete the task before the deadline

*Learn skills for **data science** and **data analytics*

HOW ARTIFICIAL INTELLIGENCE IS IMPACTING INDUSTRIES

*Keep visiting our site* www.acadgild.com* for more updates on Data Analytics and other technologies. Click here to learn **data science course in Bangalore**.*

The post Top 5 most in-demand skills in 2019 appeared first on AcadGild.

]]>The post All About Hypothesis Testing appeared first on AcadGild.

]]>Let us first understand what is hypothesis and why we do the testing.

**Hypothesis **

Is an idea or prediction or assumption that can be tested by an experiment.

If we say petrol price in India is high, so this is an assumption or a statement but is not testable, until I have something to compare it with. But if we define ‘high’ as any price higher than Rs. 73.41, then it immediately becomes a hypothesis.

Now, what cannot be a hypothesis is, suppose in a class we compare the progress of two Students A and B of a class before their assessment, would the two students do better or worse, Statistically this is an assumption but there is no data to test it, therefore, it cannot be a hypothesis of a statistical test.

Conversely, we may compare the progress of two students who have already passed that class, as we have data for both.

The assumption is called a hypothesis and the statistical tests used for this purpose are called statistical hypothesis tests.

This assumption or hypothesis made may or may not be true. Hypothesis testing refers to the formal procedures used by statisticians to accept or reject statistical hypotheses.

There are two hypotheses that are made:

**Null Hypothesis**, denoted by**H**_{0}, and**Alternative Hypothesis**, denoted by**H**_{1}or**H**_{A}.

The null hypothesis is the one to be tested and the alternative is the converse of the null hypothesis.

**Steps Involved In Hypothesis Testing**

There are 4 steps involved in Hypothesis Testing:

- We must formulate our null and alternative hypothesis
- Once the hypotheses have been formulated, we will choose the right test for our hypothesis
- The third step included the execution of the test
- Then making a decision based on the result to accept or reject the null hypothesis

The above steps are also called **Data-Driven Decision-making**.

**Example**

Explaining the above concepts and the four steps involved with the help of a simple example:

Suppose we want to flip a coin for 50 times, and we assume that half the flip would result in heads and half the result would be in tails.

Here the **null hypothesis** would be: “**result would be half heads and half tails**“.

And the **alternate hypothesis** would be: “**number of heads and tails would be very different**“.

Now, when we actually execute the experiment and test the outcomes, we see that we get 30 heads and 20 tails.

**In this case**, we would **reject the null hypothesis** and **accept the alternative hypothesis**.

Hope now you have a clear idea as what is a hypothesis and what we mean by hypothesis testing.

If you want to understand why hypothesis testing works, you should first have an idea about the significance level and the rejection region.

So let’s jump right into the action.

**Significance Level**

Normally we aim at rejecting the null hypothesis if it is false. However, as with any test, there is a small chance that we could get it wrong and reject the null hypothesis that is true.

Hence, the significance level is the probability of rejecting the null hypothesis when it is true. It is denoted by α.

Typical values for α are 0.01, 0.05 and 0.1. It is a value that we select based on the certainty we need. 0.05 is the most commonly used value.

In other words, the significance level is a statistical way of demonstrating how confident you are in your conclusion. If you set a high alpha (0.1), then you’ll have a better chance at supporting your alternative hypothesis. However, you’ll also have a bigger chance of being wrong about your conclusion.

**Example**

Suppose, we need to test if a machine is working properly. We would expect the test to make little or no mistakes. As we want to be very precise, we should pick a low significance level such as 0.01, as low level of significance tells us that we are pretty sure about the null hypothesis to be true.

A packet of cookies from a famous brand contains 5 pieces per packet. If the machine drops 1 extra cookie in the packet, it will lead to the overall damage of the packaging. So, in certain situations, we need to be as accurate as possible. Here, we can keep the value of α as 0.01.

However, if we are analyzing humans or an organization, we would expect more random or uncertain behavior. Hence, a higher degree of error.

*Now that we got some idea of Hypothesis Testing, we will now see the mechanics of this testing.*

**Mechanism of Hypothesis Testing**

Suppose we want to analyze that in a certain university how students are performing on an overall basis.

The dean of the university says that on an average student have 75%. But we can’t simply agree on this opinion and we start testing.

Here H_{0 }is: The population mean percentage is 75%

And H_{1}/H_{A} : The population mean percentage is not 75%

Therefore, H_{0 }: μ_{0} = 75%

H_{1 }: μ_{0} ≠ 75%

Now we would perform the Z-test, the formula is :

Here x̅ is the sample mean, ‘u’ is hypothesized mean, ‘s’ is the standard error and ‘n’ is the sample size.

Through this, we are standardizing the sample mean we got. *So if the sample mean is close enough to hypothesized mean , then Z will be close to 0.*

**In this case, we will accept the Null Hypothesis** (*As demonstrated in the image below*)

Otherwise, we will reject it.

Now, you might be thinking when we will be rejecting the Null Hypothesis.

Now we will see how big should Z be to reject the Null Hypothesis.

Here we will be using a **two-sided or two-tailed test**. **A two-tailed test is used when the null contains an equality or an inequality sign.**

** “A two-tailed test is a test of a statistical hypothesis, where the region of rejection is on both sides of the sampling distribution and the region of acceptance is in the middle**.”

When we calculate Z, we will get a value. If this value falls into the middle part, then we accept the null hypothesis but If it falls outside, in the shaded region, then we reject the null hypothesis.

The shaded part in the above image is called the Rejection Region.

The cut-off value for the rejection region depends on the value of the Significance Level.

For instance, if the level of significance, α, is 0.05. Then we will divide α by 2, and we get 0.025 on the left side and 0.025 on the right side.

Now, these are values we can check from the z-table. When α is 0.025, Z is 1.96. So, 1.96 on the right side and -1.96 on the left side as shown in the below image.

Therefore, the value of Z we get from the test is lower than -1.96, or higher than 1.96, we will reject the null hypothesis. Otherwise, we will accept it.

**Here’s a summary of why we need Significance Level and Rejection Region?**

The significance level and the reject region are quite important in the process of hypothesis testing. The level of significance conducts the accuracy of prediction. We choose it depending on how big of a difference a possible error could make. On the other hand, the reject region helps us decide whether or not to reject the null hypothesis.

**Statistical Error**

No hypothesis test is 100% accurate. There is always a chance of making an incorrect conclusion because the test is based on Probability. Hence while doing hypothesis testing, two types of errors are possible, **Type I and Type II Error**.

**Type I Error: **When the null hypothesis is true and you reject it, you make a type I error. This type of error is also known as **False Positive**.

**Type II Error: **When the null hypothesis is false and you accept it, you make a type II error. This type of error is also known as **False Negative**.

The probability of committing a Type I error (False positive) is equal to the significance level α.

The probability of committing a Type II error (False negative) is equal to the beta β.

**Let’s understand this with the help of examples**

Suppose we have to predict whether a criminal is guilty or not.

We define our Null Hypothesis and Alternate Hypothesis as:

H_{0 }: Person is not guilty of the crime

H_{1 }: Person is guilty of the crime.

*In the above example, two cases can occur. *

**i)** The person is judged as guilty when the person actually did not commit the crime i.e., convicting an innocent person, here is when we commit Type I Error.

**ii)** The person is judged not guilty when they actually did commit the crime i.e., letting a guilty person go free, this is where we commit Type II Error.

Let understand this with the help of the below table.

H_{0} is True(Type I Error: False Positive) | H_{0} is False(Type II Error: False Negative) | |

Reject the Null Hypothesis | ✓ (H _{0} is true and yet we reject it) | The person is judged as not guilty when he actually did commit the crime. |

Accept the Null Hypothesis | The person is judged as guilty when he actually did not commit the crime. | ✓ (H _{0} is false and yet we accept it) |

We can take another example of Medical Diagnosis.

H_{0 }: Medical test cures Disease A

H_{1} : Medical test doesn’t cure Disease A

For the above instance, the table would be:

H_{0} is True(Type I Error: False Positive) | H_{0} is False(Type II Error: False Negative) | |

Reject the Null Hypothesis | ✓ (H _{0} is true and yet we reject it) | The medical test didn’t cure disease A for a person still the reports says it does. Here we accept the Null Hypothesis. |

Accept the Null Hypothesis | Medical test cured disease A for a person still the reports says it doesn’t. Here we reject the Null Hypothesis | ✓ (H _{0} is false and yet we accept it) |

I hope we are now pretty clear about the Statistical Error.

Now we will understand the concept of p-value. But before moving further we will understand what are **Point Estimate and Confidence Interval**.

A specific value is called an **Estimate**. There are two types of Estimates:

- Point Estimate
- Confidence Interval Estimate

**Point Estimate**

It is a single number. The point estimate is located exactly in the middle of the confidence interval. As we have seen in our earlier blog the sample mean **x̄** is a point estimate of the population mean **μ**. Likewise,the sample variance **S ^{2}** is a point estimate of population variance

For Eg: We are interested in knowing the mean weight of 10-year-old girl living in the United States. Since it would have been impractical to weigh all the 10-year-old girls in the United States, we took a sample of 16 and found that the mean weight was 25 kg. This sample mean of 25 is a point estimate of the population mean.

We cannot rely on this data as not all 10 years old girl would be of 25 kg, therefore we feel point estimate is of little usefulness.

**Confidence Interval**

This, on the other hand, is an interval. Confidence level provides much more information and is preferred when making inferences. We believe that the point estimate lies somewhere in the middle of the Confidence Interval.

*Confidence Interval is the range within which we expect the population parameter to be.*

If we say the age meal in India is somewhere between Rs. 50 to 100, in this way we have created a Confidence Interval around the point estimate. However, there is still some uncertainty left which we measure in **Level of Confidence.**

Taking the same above example when we say we are 90% confident that the population parameter lies between Rs 50 to 100. However, we cannot be 100% confident unless we go through the entire population.

The Confidence level is denoted by **1 – α** and is called the **Confidence Level of Interval**. **α is a value between 0 and 1**.

Example: if we say we are 90% confident that the parameter is inside the interval, α is 10%.

If we are 95% confident, α will be 5%.

This Confidence Interval is calculated with the following formula:

The common confidence level are 90%, 95% and 99%. With respective alphas of 10%, 5% and 1%. Or we say α = 0.1%, 0.05% and 0.01%.

Let’s take one more example to ensure that we have hold grip of this concept.

I don’t know the age of the reader reading this blog, but I am 95% confident that your age lies between 18 to 55 years, based on the fact that you looking online for Statistics article. However I don’t have much information to begin with, also I don’t have any information about the age of any of the reader. Hence the wider interval.

So I am 95% confident that you are between 18 and 55 years old. Also, I’m 99% confident that you are between 10 and 70 years old and I am 100% confident that you are between 0 and 110 years old.

Finally, I’m 5% confident that you are 25% years old. Since the value located somewhere between our interval, which is a very arbitrary number.

The above explanation is described by the chart below.

100% confidence interval is completely useless as I included all the possible value for age.

25 years old is a pretty useful estimate but the level of confidence of 5% is too small for us to make any meaningful analysis.

Alright, we will now discuss p-value.

**p-Value**

We know that the Null Hypothesis can be rejected at various Levels of Significance, but we couldn’t find a level of significance for which we can no longer do it and here’s how a new measure was introduced called the p-value.

This is the most common way of testing hypothesis. Instead of testing the hypothesis at predefined levels of Significance, we can find the smallest level of significance at which we can still reject the null hypothesis.

**How Do We Calculate The p-Value?**

When we test for a hypothesis using a value of Significance we get the value of Z. Then we check for the corresponding value of Z in the z-table. Then using this corresponding value of Z we calculate the value of **p. **If the value of p is lower than the significance level taken for this particular test we reject the Null hypothesis, otherwise we accept it.

We calculate p using the formula:

* 1-tailed test: p = (1 – number from the table)*

*2-tailed test: p = (1 – number from the table) * 2*

Example

We are doing hypothesis testing with the l of significance 0.05, we get the value of Z as 2.81.

Look for the corresponding value of Z in the z table and get the value as 0.9975.

And calculate the p-value as: p = 1 – 0.9975 = 0.002

Now we compare the value of p with alpha.

Since (p-value) 0.0025 < 0.05 (α), therefore we reject the null hypothesis

And this is how we calculate the p-value for 1-tailed test as well as 2-tailed test.

**P-value is an extremely powerful measure at it works for all distribution. **

**Difference Between Z-Test and T-Test**

We have already read about the T-test in our previous blog. You can refer to that blog from the link given at the end of this blog.

Well, both Z score vs T score is part of hypothesis testing under the normal distribution.

Z – TEST | T – TEST |

The z-score is calculated with the formula:z = (X-μ)/σ | The t-score is calculated by the formula:T = (X – μ) / [ s/√(n) ] |

Z-score is used when we know the Population Standard Deviation σ. | T-score is used when we don’t know the Population Standard Deviation σ. |

When the sample size is above 30, we use the z-score | When the sample test is below 30, we use the t-score. |

Now we are almost done with the concept of hypothesis testing we will see some practical examples on it.

*Q: State the null hypothesis, H _{0} and the alternative hypothesis, H_{a}: for the following statements*

*Mean number of years Indians work before retiring is 40.**At most 60% of Indians vote in presidential elections.**Mean starting salary for ABC University graduates is at least Rs 300,000 per year.**10 percent of high school seniors fail each month.**About 70% of adults ride vehicle to work in India.**The mean number of cars a person owns in her lifetime is not more than 5.**About half of Indians prefer to live away from cities, given the choice.**Indians have a mean paid vacation each year of six weeks.**The chance of developing breast cancer is under 11% for women.**Private universities’ mean tuition fee is more than 200,000 per year.*

*Ans. **A: **H0:μ = 40; Ha:μ ≠ 40*

*B: **H0:p ≤ 0.60; Ha:p > 0.60*

*C: H0:μ ≥ 300,000; Ha:μ < 300,000*

*D: H0:p = 0.1; Ha:p ≠ 0.1*

*E: H0:p = 0.7;Ha:p < 0.7*

*F: H0:μ ≤ 5:Ha:μ > 5*

*G: H0:p = 0.50;Ha:p ≠ 0.50*

*H: H0:μ = 6;Ha:μ ≠ 6*

*I: H0:p ≥ 0.11;Ha:p < 0.11*

*J: H0:μ ≤ 200,000;Ha:μ > 200,000*

This brings us to the end of this blog. Hope this blog helped you in understanding the working of Hypothesis Testing. For any query or suggestion do drop us a comment below.

You can refer to our previous blog based on statistics.

Keep visiting our website AcadGild for more blogs on Data Science and Data Analytics.

Happy Learning:)

The post All About Hypothesis Testing appeared first on AcadGild.

]]>The post Creating Your First Pipeline Using Jenkins DevOps Automation Tool appeared first on AcadGild.

]]>Before moving to build a job or build a pipeline using Jenkins we will recommend you to please go through the blogs which will help you to understand what is DevOps, Most popular DevOps tools And steps to install And What is Jenkins.

- What are the key features of these plugins?

- What is a Jenkin pipeline
- Pipeline Concepts
- Create your First Jenkins Pipeline

- Scripted Pipeline
- Declarative Pipeline

- Declarative Pipeline Demo
- Scripted Pipeline Demo

Pipelining is the process where we take the instruction from the processor through a pipeline. It allows storing and executing an instruction in an orderly process.

The pipeline is divided into stages and these stages are connected with one other to form a pipe-like structure. Instructions enter from one end and exit from another.

**Now we will begin to build Jenkins pipeline but before that, we will discuss on some features or plugins of the Jenkins.**

Before the Jenkins pipeline introduced there were several other features existed such as Jenkins build flow, Jenkins build pipeline plugin, Jenkins workflow, etc.

Represent multiple Jenkins jobs as one pipeline.

These pipelines are a collection of Jenkins jobs which trigger each other in a specified sequence.

Basically, Jenkins is a single platform that will run the entire pipeline as a code.

Now let’s say a pipeline has 10 jobs so instead of manually creating these jobs then training them together then assigning processes to it, you can just code these jobs and run them in a single go.

This code is stored as text in the file which is known as jenkinsfile.and this jenkinsfile can be checked into a version control system.

The developer easily accesses and edit the file at any point of time because it is locally available to them.

**Pipeline as code **

The first feature is pipeline as code concept so basically instead of creating hundreds of job just going to code them and run them on a pipeline

**The code will be checked into a VCS**

The second feature is that the code can be checked into the version control system

because of this advantage developers can easily access and edit the code at any point in time.

**Incorporates user input **

Because of this feature user can interact with the pipeline. another important feature is that it And runs the job in parallel which will save the time and resources as well.

Please refer the below picture for the feature of Jenkins pipeline

- A text file that stores the pipeline as code.
- It can be checked into an SCM on your local system.
- Enables the developers to access edit and check the code at all times.
- It is written using the Groovy DSL.
- Written based on two syntaxes.

Write the code in the local file and then store this file in the source control system

**Key Features **

- Recent Feature.
- Simpler groovy syntax.
- Code is written locally in a file and is checked into an SCM.
- The code is defined within a pipeline block.

Directly type out the file on the Jenkins user interface.

**Key Features**

- The traditional way of writing code
- Stricter groovy syntax.
- Code is written on the Jenkins user interface.
- The code is defined within a node block.

This pipeline concept is nothing but the fundamentals of groovy code.if you need to code the pipeline then you need to know the basic understanding of the groovy code.

**Pipeline:**

A user-defined block which contains all the stages.it is a key part of declarative pipeline syntax.

Example: pipeline { }

**Node:**

A node is a machine that executes an entire workflow.it is a key part of scripted pipeline syntax.

Example: node { }

**Agent: **

The agent is an executor which instructs Jenkins to allocate an executor for the builds.it is defined for an entire pipeline or a specific stage.

It has the following parameters:

**Any:**Runs pipeline/stage on any available agent**None:**applied at the root of the pipeline, it indicates that there is no global agent for the entire pipeline & each stage must specify its own agent.**Label:**Executes the pipeline/stage on the labeled agent.**Docker:**Uses a docker container as an execution environment for the pipeline or a specific stage.

**Stages: **

It contains all the work, each stage performs a specific task.

Now the entire work which is written within a pipeline is executed within a stage.

You can see in the below image within each stage I have defined the certain specific task.

**Stage 1** = Build

**Stage 2** = Test

**Stage 3** = QA

**Stage 4** = Deploy

**Stage 5** = Monitor

**Steps: **

Steps are always defined within a stage and carried out in a sequence to execute a stage.

Now in the example below, you can see that within a build stage I have defined steps which run the simple echo command

**Step 1:** Log in to Jenkins and select ‘New Item’ from the dashboard.

**Step 2:** Next, enter the name of your pipeline and select ‘pipeline project’.Click ‘ok’ to proceed.

**Step 3:** Scroll down to the pipeline tab and choose if you want a declarative or scripted pipeline.

**Step 4a:** If you want a scripted pipeline, then choose ‘pipeline script’ and start typing your code.

**Step 4b: **If you want a declarative pipeline, select ‘Pipeline script from SCM’ and choose your SCM. In my case, I’m going to choose Git throughout this step by step guide. Enter your repository URL.

**Step 5:** Within the script path is the name of the jenkinsfile that is going to be accessed from your SCM (Git) to run. Finally, Click on ‘apply’ and ‘save’.

Now we will write the code for both declarative and scripted pipeline and will execute the same.

Before jumping into the pipeline let me take you through the code.

Click** ****here **to download the code.

**Code Explanation:**

**Stage 1 **

- The echo command specified in ‘steps’ block displays the message.

**Stage 2**

- Input directive allows prompting user input in a stage.
- On receiving the user input the pipeline either proceeds with further execution or aborts

**Stage 3**

- ‘When’ executes a step depending on the conditions defined within the loop.
- The corresponding stage is executed, If conditions are met.
- In this demo, we’re using ‘not’ tag
- This tag executes a stage when the nested condition is false.

**Stage 4**

- Runs ‘Unit test’ and ‘Integration test’ stages in parallel.
- ‘Unit Test’ runs an echo command.
- In ‘Integration test’ stage, a docker agent pulls and ‘Ubuntu’ image and runs the reuseNode which is a boolean ( returns false by default ).
- If true, the docker container will run on the agent specified at the top-level of the pipeline.

We have already seen how to create a Jenkins pipeline in the above section now same as I have created one declarative pipeline where we will see next how to run that pipeline

**Step 1: **Do the configuration for the declarative pipeline as shown in the below screenshot.

**Step 2:** Run the Declarative Pipeline.

**Step 3: **Check the results of each stage by clicking on logs as shown in the below screenshot

Step 4: Result Of Declarative Pipeline

We will execute the scripted pipeline using simple code like hello world.

**Step 1:** Configuration for the scripted pipeline.

** Step 2:** Click on build now and check the results.

Here we have done by executing a declarative pipeline and scripted pipeline. So this ends our blog on how to build your first Jenkins pipeline

We hope this post was helpful to you to know how to build your first pipeline with the Jenkins which is nothing but the DevOps automation tool.

The post Creating Your First Pipeline Using Jenkins DevOps Automation Tool appeared first on AcadGild.

]]>The post 7 Probability Distributions Every Data Science Expert Should Know appeared first on AcadGild.

]]>**Inferential Statistics**

Inferential statistics refers to methods that rely on probability theory and distributions.

Inferential statistics allows you to make inferences about the population from the sample data.

It uses a random sample of data taken from a population to describe and make inferences about the population.

Statistical inference is the process of using data analysis to draw conclusions of properties of an underlying probability distribution.

*In this blog, we will know the importance and working of Probability, Probability Distribution, Types of Probability Distribution and common terms related to Distribution, from scratch.*

Let us first understand what is Probability.

**Probability**

Probability is a measure of the likelihood that an event will occur in a Random Experiment. Probability is quantified as a number between 0 and 1, where, we can say, 0 indicates uncertainty and 1 indicates certainty. The higher the probability of an event, the more likely it is that the event will occur.

**For Example:**

While tossing a fair (unbiased) coin, there is a possibility of occurrence of two outcomes (“heads” and “tails”), which are equally probable; i.e, the probability of “heads” equals the probability of “tails”. The probability of either “heads” or “tails” is 1/2 (which could also be written as 0.5 or 50%).

Before proceeding further we should be aware of the basic terms like:

**Random Experiment: **

A random experiment is a physical situation whose outcome cannot be predicted until it is observed.

**Sample Space**

A sample space is a set of all possible outcomes of a random experiment.

In the above example, we have:

*Random Experiment: Tossing of a fair coin*

*Sample space: {Head, Tail}*

As we got a little understanding of Probability, we will now read about Probability Distribution and its types with the help of examples and formulas wherever required.

**Distribution**

In statistics when we use the term Distribution it usually means Probability distribution.

A Distribution is a function that shows the possible values for a variable and how often they occur.

Or A Probability Distribution is a mathematical function that can be thought of as providing the probabilities of occurrence of different possible outcomes in an experiment.

**Good examples are the**

*Normal Distribution**Binomial Distribution**Uniform Distribution*

The above image shows the three distribution respectively.

**Types of Probability Distribution **

There are many different types of probability distribution. Some of them that we will be covering in this blog is listed below:

- Normal Distribution
- Bernoulli’s Distribution
- Binomial Distribution
- Uniform Distribution
- Student’s T Distribution
- Poisson Distribution

Each probability distribution has a visual representation. It is a graph that describes the likelihood of occurrence of every event. The graph is just a visual representation of a distribution.

*Do Not misunderstand that the Distribution is a graph. Distribution is defined by the underlying probability and not the graph.*

**1. Normal Distribution**

The visual representation of Normal Distribution has already been seen above in the blog.

The Normal Distribution is a very common continuous probability distribution. This type of distributions are important in statistics and are often used to represent random variables whose distribution is not known.

The statistical term for this type of distribution is Gaussian Distribution though many people call it Bell curve as it is shaped like one.

This type of distribution is symmetric and its mean, median and mode are equal.

Mathematically, Gaussian Distribution is represented as:

**N~(μ, σ ^{2 })**

Where N stands for Normal, symbol ~ stands for distribution, symbol μ stands for mean and σ^{2 }stands for variance.

In the above image, we can see the highest point is located at the mean μ and the spread of the graph is determined by the standard deviation σ.

Let us understand this with the simplest example where we have a random variable X with distribution:

X = {1, 2, 3, 4, 5}

When we take the mean and standard deviation of the above data set we get mean(μ) = 3 and standard deviation(σ) = 1. When we plot it, we get some distribution like this:

This Bell curve specifies the Gaussian/Normal Distribution.

**Note: Not all but more than 70% of the data distribution usually follows this pattern.**

When we talk about Gaussian Distribution or Normal Distribution we have often heard the term Empirical Formula. What exactly does this formula states, well is what we will be covering next.

**1.1 Empirical Formula**

The empirical rule states that for a Normal Distribution, nearly all of the data will fall within three range of standard deviations of the mean. The empirical rule can be understood when broken down into three parts:

- 68% of the data falls within the first standard deviation from the mean.
- 95% fall within two standard deviations.
- 99.7% fall within three standard deviations.

We can understand this with the help of the below image.

- Approximately 68% of the data falls within one standard deviation of the mean (i.e., between the mean minus(-) one times the standard deviation, and the mean + 1 times the standard deviation). In mathematical notation, this is represented as μ ± 1σ

- Approximately 95% of the data falls within two standard deviations of the mean (i.e., between the mean – 2 times the standard deviation, and the mean + 2 times the standard deviation). The mathematical notation for this is: μ ± 2σ

- Approximately 99.7% of the data falls within three standard deviations of the mean (i.e., between the mean – three times the standard deviation and the mean + three times the standard deviation). The following notation is used to represent this fact: μ ± 3σ

The rule is also called the **68-95-99.7** Rule or the Three Sigma Rule.

The Empirical Rule is often used in statistics for forecasting, especially when obtaining the right data is difficult or impossible to get. The rule can give you a rough estimate of what your data collection might look like.

When a Normal Distribution is standardized, the result is called a **Standard Normal Distribution. **

**1.2 Standard Normal Distribution**

Understanding Standardization in the context of statistics. Every distribution can be standardized. Let say if the mean and the variance of a variable are μ and σ^{2 }respectively.

Standardization is the process of transforming a variable to one with a mean of 0 and a standard deviation of 1.

i.e., **~(μ, σ**^{2 }**) → ~ (0, 1)**

When a Normal Distribution is standardized, the result is called a Standard Normal Distribution.

i.e., **N~(μ, σ**^{2 }**) → ~ N(0, 1)**

We use the following formula for standardization:

Where x is data element, μ is mean and σ is the standard deviation

We use the letter Z to denote standardization. The standardized value i.e., Z is known as the z-score.

These Z scores are important because they tell you how far a value is from the mean. When you standardize a random variable, its ‘mean’ becomes 0 and its standard deviation becomes 1.

If the Z score of x is zero, then the value of x is equal to the mean.

Let us understand the steps involved in Standardization with the help of a simple example.

Suppose we have a dataset with elements

X = { 1, 2, 2, 3, 3, 3, 4, 4, 5}

And Uniformly Distributed as:

We get mean as 3, variance as 1.49 and std dev as 1.22 i.e., N ~ (3, 1.49).

Now we will subtract the mean from all data points, i.e., **x – μ.**

We will get a new data set as below:

X1 = {-2, -1, -1, 0, 0, 0, 1, 1, 2}

Now we get mean as 0, but variance and std dev still as 1.49 and 1.22 respectively i.e., N ~ (0, 1.49)

So far we have a new distribution but it is still normal and needs to standardized.

So the next step of standardization is to divide all the data points by the standard deviation, i.e., (x – μ)/**σ.**

Dividing each datapoint by 1.22(std dev) we get a new data set as :

X2 = {-1.6, -0.82, -0.82, 0, 0, 0, 0.82, 0.82, and 1.63.}

Now if we calculate the mean we get as 0 and standard deviation as 1 i.e., N ~ (0, 1)

Plotting it on a graph we get something like this

This is how we can obtain Standard Normal Distribution from any normally distributed dataset.

Using this standardized normal distribution makes inferences and predictions much easier.

**1.3 Probability Density Function and Probability Mass Function**

Probability density function and Probability mass function is a statistical expression that defines a Probability Distribution for a random variable.

Do not get confused between the two terms. Probability density function(PDF) is used to determine the probability distribution for a Continuous Random Variable. When the PDF is graphically plotted the area under the curve indicates the interval in which the variable will fall.

Whereas the Probability Mass Function(PMF) is used to determine the probability distribution for a Discrete Random Variable.

*As we know Continuous Random Variables are the one which takes an infinite number of possible values eg: the weight of a person can be 50.2, 44.5, 60.7, etc and Discrete Random Variables are the one which may take on only a countable number of distinct values such as 0,1,2,3,4….*

If we know the mean and variance of our dataset we can compute the PDF and PMF. PDF and PMF tell how well our data has been distributed with respect to the mean and standard deviation within a particular curve.

**1.4 Cumulative Density Function**

The cumulative distribution function (CDF) of a random variable is another method to describe the distribution of random variables.

The cumulative frequency is the sum of the relative frequencies. It starts at the frequency of the first brand, then we add the second, the third and so on until it finishes at 100%.

The advantage of the CDF is that it can be defined for any kind of random variable (discrete, continuous, and mixed).

**1.5 Central Limit Theorem**

The Central Limit Theorem is one of the most important concepts in Statistics.

This theorem states that as the number of samples taken which are maximum in number, the distribution of the average of the sample means when plotted tends to be a normal distribution.

Don’t worry, let’s study it briefly with pictorial representation.

Suppose we have a very large dataset, *whose distribution doesn’t matter and could be normal, uniform, binomial or random.*

The first thing we do is take out the subsets from the large data sets, that means,** we will fetch smaller datasets of size 30 or more** and create different subsets.

After fetching different samples, which is sufficient in number, we will then calculate the mean of each sample and then plot this different distribution. Surprisingly, our graph of the sample means look more like a Normal/Gaussian Distribution.

Also, if we take the average of all sample means it will be nearly equal to the actual population mean and the standard deviation equals **σ/√n. **

Where: σ = the population standard deviation

n = the sample size(i.e., number of observations in our sample)

Let us revise the key summary that should be kept in mind while applying the Central Limit Theorem.

- The distribution of the original(population) dataset doesn’t matter. It could be normal, uniform, binomial, etc.
- The distribution of the sample means would always be Normal Distribution

- The more the number of samples extracted from the population, the closer to a Normal Distribution the sample means will be.

- The samples extracted should be bigger than 30 observations.
- The average of the sample mean extracted will be nearly equal to the mean of the population and its variance would be equal to the original variance divided by the sample size i.e., ‘n’.

**2. Binomial Distribution**

This type of distribution is used when there are exactly two outcomes of a trial. These outcomes are labeled as “Success” and “Failure”.

Here the probability of both the outcomes is the same for all the trials.

Each trial is independent since the outcome of the previous toss doesn’t determine or affect the outcome of the current toss. An experiment with only two possible outcomes repeated n number of times is called binomial. The parameters of a binomial distribution are n and p where n is the total number of trials and p is the probability of success in each trial.

We have already seen the graph representing Binomial Distribution above.

We define Binomial Distribution with the below formula:

**3. Bernoulli’s Distribution**

Binomial Distribution is closely related to Bernoulli’s Distribution.

Bernoulli Distribution is a special case of Binomial Distribution with a single trial.

The Bernoulli distribution is a discrete distribution having two possible outcomes that is, 0 and 1, where n = 1 (usually called a “success”) occurs with probability p and n = 0 (usually called a “failure”) occurs with probability q = 1 – p, where 0 < p < 1.

Therefore the probability density function(pdf) and the graph for Bernoulli’s Distribution is shown in the figure below:

In the above graph, 1 refers to success and 0 specifies the failure.

The head and tail distribution in coin tossing is an example of Bernoulli’s Distribution with p = q = ½.

**4. Uniform Distribution**

A uniform distribution is a distribution that has a constant probability.

We have already seen the graphical representation of uniform distribution above. Let us understand this with the help of an example.

EXAMPLE: If we roll a die(numbered from 1 to 6), then the probability of getting 1 is one out of six i.e., 1/6

Similarly, the probability of getting 2, 3, 4, 5 and 6 also is ⅙. There is an equal chance of getting each of the 6 outcomes.

Now, if we check for the probability of getting 7, then it is 0 since it is impossible to get a 0 when rolling a die.

For the probability of outcomes for 1 to 6, we have an equal chance of occurrence and this is what we call a Discrete Uniform Distribution.

*Remember that the sum of their probabilities is equal to 1 or 100%.*

**5. Student’s T-Distribution**

T Distribution or Student’s T Distribution is a probability distribution that is used to estimate population parameters when the sample size is small and/or when the population variance is unknown.

Visually, the Student’s T distribution looks much like a Normal distribution but generally has fatter tails. Fatter tails, allow for a higher dispersion of variables, as there is more uncertainty.

As the z-statistic is related to the standard Normal distribution, the t-statistic is related to the Student’s T distribution.

The formula that allows us to calculate it is:

** t = [ x̅ – μ ] / (s / √n)**

t with n-1 degrees of freedom equals the sample mean minus the population means, divided by the standard deviation of the sample by n which refers to the sample size.

*The degrees of freedom refers to the number of independent observations in a set of data.*

Now we will see the graph for Student’s T Distribution and will also see how it is different from Normal Distribution.

**Why use T-Distribution?**

According to the Central Limit Theorem, the distribution follows Normal Distribution when the sample size is sufficiently large. Here we know the standard deviation and can calculate the z-score and can plot the Normal Distribution.

But sometimes the sample sizes are small and also we do not know the standard deviation of the population. This is where statistician prefer on the distribution of T-Distribution (also known as t-score).

**6. Poisson Distribution**

The Poisson Distribution is a discrete probability distribution which states that the number of events occurring in a fixed interval of time or space conditionally that the value of an average number of occurrence of the event is known.

For instance, If the average number of diners for seven days is 500, we can predict the probability of a certain day having more customers.

Another example is if a call center gets 30 calls in an hour, we can predict the probability that he received no calls in the first 3 minutes and so many more examples.

The Poisson Distribution results from a Poisson’s Experiments which states that for a series of discrete event where the average time between events is known, but the exact timing of events is random.

Suppose we conduct a Poisson experiment, in which the average number of successes within a given region is μ.

Then, the Poisson probability is:

**P(x; μ) = (e ^{-μ}) (μ^{x}) / x!**

where x is the actual number of successes that result from the experiment, and e is approximately equal to 2.71828.

The graph of Poisson Distribution is as shown below:

The mean and variance of x following a Poisson distribution:

Mean → E(x) = µ

Variance → Var(x) = µ

**7. Exponential Distribution**

Exponential Distribution is one of the most widely used continuous distributions. It measures the expected time of an event to occur.

The exponential distribution is highly used for survival analysis purposes. An example of an exponential distribution is the lifespan of a machine.

It basically answers our query as to how much time do we need to wait before a given event occurs.

The graph of Exponential Distribution is shown below:

This brings us to the end of this blog. We hope this blog helped you in learning Probability Distribution from scratch.

Keep visiting our website AcadGild, for more blogs on Data Science and Data Analytics.

Happy Learning:)

The post 7 Probability Distributions Every Data Science Expert Should Know appeared first on AcadGild.

]]>The post Text Mining using R appeared first on AcadGild.

]]>Text Mining is generally known as Text Analytics. It is the process of collecting insight and information from a set of text-data. Text Mining is used to help the business to find out relevant information from text-based content. These contents can be in the form of a word document, posts on social media, email, etc. Text mining technique allows us to feature the most frequently used keywords in a paragraph of texts. Word cloud, also referred to as a text cloud, which is a visual representation of text-data. The steps of creating word clouds are quite easy in R.

The ability to deal with text-data is one of the important skills of a data scientist in today’s scenario. With the onset of review websites, social media, forums, web pages, companies now have access to enormous text-data of their customers.

These data will be messy. however, the source of information, insights which can help companies to boost their businesses. That is the reason, why Text Mining as a technique well-known as **N****atural Language Processing (NLP)** is growing rapidly and being broadly used by data scientists. The text mining package ‘tm’ and the word cloud package (wordcloud) are available in R for text analysis and to quickly visualize the keywords as a word cloud.

Text Mining saves time and performs efficiently than human brains.

- Text mining can help in predictive analytics
- Text Mining used to summarize the documents and helps to track opinions over time
- Text mining techniques used to analyze problems in different areas of business.
- Also, it helps to extract concepts from the text and present it in a more simple way

Text Mining can be used to filter irrelevant e-mail using certain words or phrases. Such emails will automatically go to spam. Text Mining will also send an alert to the email used to remove the mails with such offending words or content.

Text mining allows for understanding the text better than anything else. Text Mining technique takes words from unstructured data into numerical values. Text mining helps to find patterns and relationships that exist in a large chunk of text. Text mining generally uses machine algorithms to read and analyze text-data information.

It will be difficult to understand the text easily and quickly without text-mining. The steps in the text mining process and making word clouds are listed below.

Let us see an example of how actually text mining works and how to create wonderful word clouds with R. Reasons you should use word clouds to present your text data.

In the following examples, I’ll process my Acadgild Article on Artifical intelligence in (.txt) format by following the link https://acadgild.com/artificial-intelligence.txt .You can use any text you want :

Here in the below code line, we have loaded the data in filePath.

Type the R code below, to install and load the required packages:

install.package(“NLP”)

install.package("tm")

install.package(“RColorBrewer”)

install.package(“wordcloud”)

install.package(“wordcloud2”)

To load the library of these packages use the library() function :

**library(package_name)**

To learn more about the above packages :

**help(package_name)**

“Text Mining is a technique that boosts the research process and helps to test the queries.”

To import the file type the following R code.

filePath <- "https://acadgild.com/artificial-intelligence.txt"

Read lines in text_file using readLine() function. And storing text-data to modified ** text_file.**

text_file <- readLines(filePath)

Let’s see the first few lines line or text_file by using head() function.

head(text_file)

Here we can see the first few lines of our text file.

Now we are using paste() function in **text_file** and make it a chunk and the text collapse into quotations (“ ”). And storing to **text_file1.**

Giving you a very small example because of the real text file is too large to show here.

*Example:** “hello” “world” to “hello world”*

text_file1 <- paste(text_file, collapse = " ")

head(text_file1)

Using the head() function to see few lines or the modified **text_file1** document.

As shown in the small example, Also here in the above console, the entire text comes into the quotations.

The **text mining** function is used to convert the text to ** lower case**, to remove unnecessary

Let us make **text_file1** to lower case using tolower() function. And assign it to modified **clean_text.**

#clean_text-data

clean_text <- tolower(text_file1)

head(clean_text)

We can see the few lines of **clean_text **by using the head() function.

*In every step, you can modify your text-data and use it in the next step for text-manipulation. *

You can also remove Punctuation and digits with **removeNumbers** and **removePunctuation** arguments.

To remove punctuations we are using gsub() function in the below code.

Here, pattern= **“\\W’’** to remove puncations.

replace= “ _ ” , we are going to replace the puncatuations by space. If we dont do so then it may make new words.

#Remove punctuations

clean_text1 <- gsub(pattern = "\\W", replace = " " ,clean_text)

head(clean_tex t1)

Here in the above console, you can see there are no punctuations remain, and words are separated by space.

If digits are present in your text file. You can easily remove those numbers form your text by using gsub() function. Probably not required.

Here “\\d” to remove digits.

#remove digits

clean_text2 <- gsub(pattern = "\\d", replace = " ", clean_text1)

head(clean_text2)

The information value of **‘stopwords’** is near zero due to the fact that they are so common in a language. Extracting this kind of words is helpful before further analyses.

#clean the stop words

#load the required packages

library(NLP)

library(tm)

let’s see a preview of stopwords using stopwords() command.

stopwords()

In the above console, we can see the list of stopwords.

Lets us remove those stopwords and unnecessary words by using removeWords() function.

#Remove stop words

clean_text3 <- removeWords(clean_text2,words = c(stopwords(),"ai","â"))

head(clean_text3)

Now let us remove single letters, by gsub() function in the code below.

#remove single letters

clean_text4 <- gsub(pattern = "\\b[A-z]\\b{1}", replace = " ", clean_text3 )

head(clean_text4)

Here, \\b[A-z] represents strings with any letter between a-z. The string can take uppercase letters as well as lower case letters and subset \\{1} says that the strings end with length one.

Here in the above console, the single letters have been removed.

We can finally remove white spaces using stripWhitespace() function,which is a part of tm library.

#remove white spaces

clean_text5 <- stripWhitespace(clean_text4)

head(clean_text4)

We now have a chunk of lines, and we are looking for the counting words.

And already joined various lines and made a chunk.

So frist split individual words and add space between them as split using split() function.

#splitwords

clean_text6 <- strsplit(clean_text5, " ")

head(clean_text6)

By using head() function we can see the first few split words in our console.

Here in the above console, we can see the split words from our clean_text6 text-data.

Now create word_freq table and assign clean_text6 data in the table using table function.

#frequency of words

word_freq <- table(clean_text6)

head(word_freq)

Here in the above console, we can see random words with the number of times it repeated in the article.

By using cbind() by taking word_frew data-frame arguments and combine by *c*olumns or *r*ows, respectively.

word_freq1 <- cbind(names(word_freq), as.integer(word_freq))

head(word_freq1)

By using head() function we can see the first six rows by default.

In the above console, the first six-row has been printed and showing the words with a number of times it been repeated.

- Word clouds add clarity and simplicity.
- The most used keywords stand out better in a word cloud.
- Word clouds are a dynamic tool for communication. Easy to understand, to be shared and are impressive words representation.

The required libraries for making wonderful word-cloud.

library(RColorBrewer)

library(wordcloud)

class(clean_text6)

word_cloud <- unlist(clean_text6)

**words:**the words to be plotted i.e; word_cloud where we have saved the text-data.**Freq:**word frequencies**min.freq:**words with a frequency below min.freq will not be plotted.**max .words:**maximum number of words to be plotted**random.order:**plot words in random order. If false, then words will be plotted in decreasing frequency**Rot.per:**to adjust proportion words with 90-degree rotation (vertical text)**brewer.pal:****??brewer.pal,**?? command to see the functionality in R**colors**: color words from least to most frequent. Use, for example, colors =“Red” for a single color or “random-dark”, “random-light”.

Follow the below code and create wonderful word clouds:

wordcloud(word_cloud)

wordcloud(word_cloud,min.freq = 5 , random.order = FALSE, scale=c(3, 0.5))

wordcloud(word_cloud,min.freq = 3, max.words=1000, random.order=F, rot.per=0.2, colors=brewer.pal(5, "Dark2"), scale=c(4,0.2))

library(wordcloud2)

wordcloud2(word_freq)

wordcloud2(word_freq, color = "random-light", backgroundColor = "white")

wordcloud2(word_freq, color = "random-dark", backgroundColor = "white",size = 0.5, shape = "triangle")

wordcloud2(word_freq, minRotation = -pi/20, maxRotation = -pi/20, minSize = 10, rotateRatio = 1, color = "random-dark", backgroundColor = “white”)

The above **word cloud** clearly shows that “will”, “artifical”, “data”, “human” and “intelligence” are the five most important words in the “**Artifical intelligance**” artical.

**R vs Python combat**

https://acadgild.com/blog/r-vs-python-combat

*Keep visiting our site* www.acadgild.com* for more updates on Data Analytics and other technologies. Click here to learn **data science course in Bangalore**.*

The post Text Mining using R appeared first on AcadGild.

]]>The post What is Jenkins? | Jenkins for continuous integration appeared first on AcadGild.

]]>Before moving to Jenkins we will recommend you to please go through the blogs which will help you to understand what is DevOps and most popular DevOps tools with the installation.

- What is Jenkins?
- What is Continuous Integration (CI)?
- Why Continuous Delivery (CD).
- Why We Need CI/CD With Realtime Example.
- What is Continuous Delivery Pipeline?
- Stages Of Continuous Delivery Pipeline.
- Advantages Of Continuous Delivery

- Continuous Integration and Continuous Delivery at HP.
- Continuous Integration At AcadGild.

Jenkins is a powerful automation tool written in Java, also a continuous integration server designed to handle any type of build or continuous integration processes. It can be used by teams with different sizes on those projects where the teams may be working with any heterogeneous languages like Java, .net, PHP, etc.

Junkins got wonderful plugins that allow connecting all software development tools which is used in code, test and deployment phases. That is what makes Jenkins very powerful.

From the continuous integration perspective, Jenkins can connect to various source code servers and it has also got plugins to allow it to build test and deploy. For this reason, makes Jenkins an ideal choice for the continuous integration server.

Let’s assume a bunch of developers is working on the same project and code.

If code check-ins do not happen quickly every day would be very costly for the whole project.

Early detection of such issues will ensure the quick delivery of software** **so as a part of continuous integration what is demanded is that every developer check-in code every day

Now the end of the day you have an automated server called Jenkins continuous integration tool which will wake up and pulls the latest code from the source control management system which is nothing but the GIT and build the code compile the code and test the code.

In case of any breakage or errors found in the code Jenkins server will send the notification to all the developers who ware working on the same code via email or notification that some this is broken.so this type of early error detection and automation of build, compile and test will be very helpful for the continuous delivery of the software development.

You can see the complete flow of the Continuous Integration from the below picture.

Let’s look at the below picture with a simple scenario

We have a number of the developer which will develop the code and committing that code to the code repositories and this repository to compile the code and execute the code and then send it to the quality assurance team.

Now this QA team will manually test like unit testing, integration testing to check if there are any logical errors in the code and once they are done fixing the error the directly deploy the application to the production environment.

But this has resulted in the failure of the application so let’s look at the reason why did application get failed.

- Different Environment (servers)
- Different Libraries and packages
- End-user load(traffic)
- App not accessible to the intended audience

Let’s look at each reason why application gets fails when we deployed in production.

The application is tested on the environment is different from the production environment. Because of this application, get fails to run on the production environment.

If the two environments are different they going to support different libraries and packages so that why the code is compatible with a testing environment and not on the production environment.

The production server is not capable enough handling the end-user load because which the server get crashes.

The production server has a threshold beyond which the server is not capable to process anymore user request Now if the production server is flooded with too many users request and it is not capable enough to process these many requests then the server ends up with the crashing so that’s why the application get failed here.

Now we will discuss how these reasons overcome by using continuous delivery pipeline with the Jenkins.** **

Stages of the Continuous Delivery Pipeline :

- Development Team
- Version Control System
- Build and Test (Jenkins)
- Test Environment ( Acceptance testing, load testing )
- Production Ready

So let’s look at each phase of the continuous delivery pipeline

In this phase, the developer is writing the code for the application as per the client’s requirements.

Once the developer writes the code they will push their code into any version control system like GIT. On the other hand, the version control system is the category of the software tool that helps the software development team to manage changes to the source code over time.

Basically, it will help you to track whatever changes you made in the code over a period of time. version control system or software keeps track of each and every modification happens with the source code. Developers can turn back the clock and compare the earlier version of the code to help fix the mistake while minimizing disruption to all the team member.

Let’s take one example, there is a team of developer who builds the application and they want to update the version of the application from version 1.0 to version 2.0

In order to do that developer will write new code or modify the same code and then commit the code into the version control system.

So now developer started the application using under version 2.0 and lets say developer faces the lots of issue in the application so developer wants to shift to version 1.0 now he can easily do this because version control system has tracked and saved all the code changes that developer made.because of this reason version control system is required.

Now, look into this phase, where you implement the continuous integration using tools like Jenkins. Now we have seen earlier what is continuous integration. On the other hand, the code should be pulled, build compiled and test continuously from the version control system using automation tool.

Instead of manually performing the unit test and integration test you can automate them using the tools such as Selenium test engine.

So as to log as you automate the process the delivery is going to be quicker and less prone to errors.

This is the most important part of the continuous delivery pipeline because this will make sure the software is always in the production-ready state in the other hand software has to be in such a state that its ready to release.

Various testing carried out to do the same as acceptance testing and load testing.

This is performed to check the behavior of the application under a certainly expected load.

We saw previously the application fails on the production level because of unable to process the too many user requests.now in this case by implementing the continuous delivery that failure is not going to occur because we are performing load testing.

In load testing where we check how much of user load the application is capable of processing.

Here, in this case, the user can access the application without any problem. in case if the user finds any problem in the application then the feedback will be immediately sent to the developer to make the required changes and commit them to the version control system.

- Automate Software Releases.
- Increases The Developer Productivity
- Locate and Addresses the Bugs Quicker.

Let’s look at each advantage in detail.

The first advantage is that it automates software releases. The continuous delivery makes sure it uses automation tool in each phase of the software development life cycle. Whether it is building testing and deploying all of these have to be automated now because of doing this the software delivery become faster and it is less prone to any human errors.

Next Advantage is it increases the developer productivity because of continuous delivery automated the developer finally focuses on building new features so in this way developer productivity increases.

The most important advantage is we can find the bugs and address them in a quick manner.

Once you find the bugs we can fix it using automation tools.

Continuous Delivery At **HP(Hewlett-Packard)**

Now let’s see how continuous delivery solved the problems at HP. **HP** is a Techno giant which offers worldwide IT, technology & enterprise products, solutions, and services. **HP** Inc is the retainer of HP’s Computer and printer business which ensures that there is a perfect device for every office or home, based on their need.

- In 2008 they were facing problems, their product delivery cycle was slow.
- It took them 6-12 months to build new features; making them slower than all their competitors.

HP come up with the new plan and the target of this new plan was to improve the developer productivity by a factor of 10. Now in order to achieve this, they came up with three high-level goals.

- A single platform to support the whole workflow.
- Improved quality software release.
- Faster release.

They implemented a continuous delivery pipeline with two important features.

- Practicing Continous Integration.
- Automation at every step.

- Overall development cost reduced by 40%.
- Programs under development increased by 140%.
- Development cost per program went down by 78%.

Continuous integration At **AcadGild**

I am pretty sure you all have enrolled for the **AcadGild** technical courses at some point in your life.in a software development project at **AcadGild** there was a process called two-day builds. two-day builds can be thought of as a predecessor to Continuous Integration. It means that every two days an automated system pulls the code added to the shared repository throughout the day and builds that code.

The idea is quite similar to Continuous Integration, but since the code that was built in two days was quite large, locating and fixing of bugs was a real pain. Due to this, Acadgild adopted Continuous Integration (CI). As a result, every commit made to the source code in the repository was built. If the build result shows that there is a bug in the code, then the developers only need to check that particular commit. This significantly reduced the time required to release new software.

We hope this post was helpful to you to know what Jenkins and how it works.

In the next blog, we will see how to create a continuous integration pipeline with Jenkins.

Keep visiting our website AcadGild for further updates on the DevOps and other technologies.

The post What is Jenkins? | Jenkins for continuous integration appeared first on AcadGild.

]]>The post R vs Python combat appeared first on AcadGild.

]]>R and Python both are open source programming languages with a large community. While both of these languages are under steady development. This is the reason why these languages add new libraries in their prospectus regularly. The major purpose of using R is a high-level programming language for statistical analysis and reporting, in other way **Python is a general-purpose programming language**, that provides a more general approach to data science.

For the beginner in a programming language can learn R without putting more effort.

R is built by statisticians, it has a variety of libraries for different tasks to do **Statistical analysis and visualization**. Likewise Python is one of the most popular programming languages for beginners with powerful libraries for different tasks like designing web, data analytics, etc. And hence the simplicity of Python language makes it more popular.

Furthermore, let us see in detail that which language is better for you.

R is one of the most powerful programming language and software environment for **statistical analysis** and representing graphics reports. It is an open-source platform. For the reason that the main purpose of using R is that it can be used to implement a statistical **approach **such as **linear**** **and **non-linear** modeling.

**IDE:**** **The common IDE for R programming is RStudio.

RStudio is an Integrated Development Environment (IDE) which allows users to code and develop R based applications. R consists of countless libraries from data manipulation to data visualization which makes programming easier.

For more info on RStudio with basic Coding examples, we recommend our Article link below.

R is a user-friendly language that is mostly used for data analysis, statistics including graphical representation models. Even packages like ggplot2 and dplyr that extend the R features further.

For more about visualization, we recommend our Article link below.

R can be used to integrate with databases such as SQL server. It can also be used for machine learning algorithms, natural language processing.

R also supports data structures such as vectors, lists, matrices, arrays, factors and data frames.

For more understanding and the working of data structure in R, we recommend our below link blog.

- People who adopt R, generally from fields such as research, data science, statistics.
- Statistical models can be written in an easy way with a few lines and some functionality can be written in many ways.
- R is easy to use complex statistical formulas. All statistical test models are available and easy to use.

**Popular Packages **

- Data Manipulation:-
**dplyr****, plyr, and****data.table**to manipulate data easily. **Stringr**to manipulate strings generally used for text manipulation.**Zoo**to work with regular or irregular time series or trend analysis.**Ggplot2**and**lattice**to visualize data.**Caret**for the machine learning approach.

**Flexibility:** Easy to use the available library. i.e; ggplot2, dplyr etc.

**Database size**: R can handle the huge size (in GBs) of the dataset.

Learn to handle a huge dataset using data.table package in the below article.

Python is a high-level most popular general-purpose programming language. It is an open-source platform. The python codes are easy to write, read, debug because of its code brevity.

Similar to R, Python is also an interpreted language. It is easy to give commands using the command line, Users can use command prompt to execute python scripts.

Learn Python step by step, the article link is given below.

**IDE:**** **The most popular IDE for Python is Spyder, Jupyter notebook which is easily available in anaconda distributor.

To install Jupyternotebook and to work on python codes we recommend our below link blog.

Python can be used to integrate with databases such as MySQL. It can also be used for machine learning algorithms, natural language processing and many more.

Libraries like Matplotlib and Pandas and Numpy that extends the Python features further.

Python also supports data structures such as lists, dictionaries, and tuples.

For more understanding and the working of data structures in Python, we recommend our Article link below.

- People who adopt Python are developers, programmers, data scientists.
- Codes can be written in an easy way because of its nice syntax and any functionality can be written in the same way in python.
- Python is flexible for doing something complex that has never done before.

**Important libraries in Python: **

**Pandas**to manipulate data easily.**SciPy/NumPy**for scientific computing.**Matplotlib**to make interactive graphs.**Statsmodels**to exploratory data analysis and estimate statical models.**Scikit-learn**to use machine learning algorithms.

**Flexibility:** Easy to build models from scratch. i.e., matrix computation and optimization, etc.

**Database size**: Python can handle the huge size (in GBs) of the dataset.

R has been more popular among analysts and data scientists till in 2015-2016. Also in the last 2-3 years python gained a lot of popularity.

**KDnuggets** has done a survey to figure out the top tool among data analysts and data scientists professionals.

There is a tremendous demand for R and Python – data analytics, data scientists professionals in MNC’s like Google, Facebook, Microsoft, Musigma, Amazon, etc. The average annual salaries were $110,000 (R) and $95,000 (Python)

Now you have got a brief comparison on R vs Python. You can use any one for data analysis and data science that best fits your needs.

**consequently, ****b***oth R and Python languages have their own strengths in Statistical analysis and model deployment.*

These programming languages have a lot of similarities in terms of syntax. You can choose to work with any of them. Now you may come to know the strengths of these programming languages over each other and their approach.

*Learn skills for **data science** and **data analytics*

*Suggested reading:** *

HOW ARTIFICIAL INTELLIGENCE IS IMPACTING INDUSTRIES

The post R vs Python combat appeared first on AcadGild.

]]>The post Spark Integration With Jupyter Notebook In 10 Minutes appeared first on AcadGild.

]]>This blog gives you a detailed explanation as to how to integrate Apache spark with Jupyter notebook on windows.

For Instance, Jupyter notebook is a popular application which enables to run pyspark code before running the actual job on the cluster. In Addition, it is user-friendly so in this blog, we are going to show you how you can integrate pyspark with the jupyter notebook.

**Install and configure anaconda on windows.**

- Step by Step Guide To Install Anaconda (Jupyter notebook)

**Setup Winutils For Hadoop and Spark.**

- Download and setup winutils.exe

**Install Spark On Windows **

- Download Spark Binaries
- Create Folders For Spark
- Set Environment Variables For Spark

**Integrate Spark With Jupyter Notebook**

- Install Find Spark Module.
- Run the Spark Code In Jupyter Notebook

**System Prerequisites:**

- Installed Anaconda software
- Minimum 4 GB RAM
- Minimum 500 GB Hard Disk

Before jump into the installation process, you have to install anaconda software which is first requisite which is mentioned in the prerequisite section.

“

Ajit KhutalInstalling Anaconda On Windows”

Ajit Khutal

“Installing Pyspark On Windows “

Link To Download Spark: https://spark.apache.org/downloads.html

Extract the Download file into the pyspark folder which we have created earlier in step 1.

**Download the winutils.exe** **file from the below link and store that file to /hadoop/bin location which is created in Step 3.**

**Link:** Winutils.exe

Ajit Khutal

“Set Environment Variables For PySpark”

**Variable name:**SPARK_HOME-
**Variable value:**D:\pysaprk

**Here We have successfully set the user variables for pyspark.**

**Now we will set the “system variables” for spark**

**Path**=D:\pysaprk\bin

So we have successfully set the user and system environment variables for pyspark.

Ajit Khutal

“Set the Environment Variable For Hadoop(winutils.exe)”

**As we have done previously to set pyspark environment variable same we have to do that for Hadoop(winutils.exe).**

**User Variable**

**System Variable **

**After Clicking the edit button you will get the new window as shown in below image then click on the new button and type the path D\hadoop\bin**.

We have completed the setting up environment variables for Hadoop(winutils.exe) and pyspark.

Ajit Khutal

“Integrate Pyspark With Jupyter Notebook“

As you can see from the above screenshot we have successfully installed spark and integrated with the jupyter notebook.

firstly we hope above all post was helpful to you to know how to integrate spark, pyspark with a jupyter notebook.

Secondly, Keep visiting our website AcadGild for further updates on data science and other technologies.

The post Spark Integration With Jupyter Notebook In 10 Minutes appeared first on AcadGild.

]]>The post Learning List Comprehension in Python appeared first on AcadGild.

]]>We have discussed on the List and other collection types in our earlier blogs. We recommend our readers to go through this collection details from below link blog.

In this blog, we will learn the collection ** “List”** and

The List is a collection of Data which is ordered, means its index are fixed and is mutable, i.e, items inside a list can be changed at any point of time.

The items inside a list are separated by** ‘,’ **and are enclosed within **‘[ ]’.**

Eg: list1 = [ 1, 14, 8, 3, 23]

list2 = [ “Red”, “Yellow”, “Green”, “Blue”]

We can perform numerous operations on the list. Let us understand each operation with an example.

**ACCESSING ITEMS FROM A LIST**

We can access items from a list by referring to its index number. Indexing of a list starts from 0 to (length-1).

To print the first and last item of a string.

list = ["Red", "Yellow", "Green", "White"] print(list) print("Printing the first item of the list: ",list[0]) print("Printing the last item of the list: ", list[-1]) #negative indexing is done from backward

**CHANGING LIST ITEM VALUE**

Since items are mutable, we can change the value of any of the items as:

list = ["Red", "Yellow", "Green", "White"] list[1] = "Pink" #inserting string ‘Pink’ at index 1 print("The updated list:", list)

**INSERTING ITEMS TO A LIST**

We can insert new items in an existing list with the help of** append() and insert()** method.

The difference between the two methods * is append() method adds an item at the end of the list* and

list = ["Red", "Yellow", "Green", "White"] list.append("Purple") #appending string ‘purple’ at the end of the list print(list) list.insert(2, "Black") # inserting string ‘Black’ at the 2nd index print(list)

**DELETING ITEMS FROM A LIST**

We can delete item/s from a list by a number of methods like **pop(), clear(), remove() and ‘del’ keyword.**

*pop() method removes the last item of a list or at the specified index.*

** clear() method empties the list**.

*remove() method removes the specified item.*

*The del keyword removes an item at a specified index or deletes the list completely.*

list = ['Red', 'Yellow', 'Black', 'Green', 'White', 'Purple'] list.pop() #removes the last element print(list) list.remove("Yellow") #removes the specified item, also note that it is case-sensitive print(list) del list[0] #delete item at 0th index i.e, the first item print(list) list.clear() #clears the list print(list)

**LENGTH OF LIST**

We can determine the length of the element, i.e., the number of items in a list by len() method.

list = ['Red', 'Yellow', 'Black', 'Green', 'White', 'Purple'] len(list)

**LIST SLICING**

List slicing is the method of splitting a list into its subset. We do this with the help of the indices of the list items.

list1 = [ 31,2,16,80,3,29,19,43,61,50] print(list1[:]) #printing all the items of a list print(list1[:5]) #printing items from starting (0th index) to 5th index, excluding item at the 5th index print(list1[2:6]) # printing items from the second index(16) to the 6th index, excluding item at the 6th index print(list1[3:-1]) #printing items from the third index(80) to the last index, excluding item at the last index

During slicing when we specify two index, the last indexed item (n) is usually excluded and item at (n-1)th position is taken.

**LOOPING THROUGH A LIST**

When we have a number of items in a list, we can loop through the list with the help of ‘for’ loop as shown below:

list1 = [ 31,2,16,80,50] for x in list1: #looping through the list ‘list1’ and printing each item at a time print(x)

Another example of list looping where we have added 5 to each element in the list.

list1 = [ 31,2,16,80,50] for x in list1: x = x + 5 print(x)

**LIST COMPREHENSION**

List comprehensions are Python functions that are used to create new lists, sets, dictionaries, etc.using lists that have already been created.

It reduces loops and makes code easier to read.

As list comprehensions return lists, they consist of brackets containing the expression, which is executed for each element along with the for loop to iterate over each element.

The comprehension is usually shorter, more readable, and more efficient.

Types of comprehension:

- List

[ i*2 for i in range(3) ]

- Set

{ i*2 for i in range(3)}

- Dictionary

d = {key: value for item in sequence …} </n>

{ i: i*2 for i in range(3)}

We will understand the concept of List Comprehension in a better way by below a number of examples.

**1. Squaring values of a list and a set**

def square(list): return [i ** 2 for i in list] square(range(0,11))

In the above program, we have created a function with name ‘square’ and passed a variable ‘list’ as the argument. Then we implemented list comprehension to square each value of the list. And then we defined the list to be in the range 0 to 10 as the last range value is excluded. Therefore the output would be like:

setvalue ={0,1,2,3,4,5} square = [i ** 2 for i in setvalue] print(square) type(square)

In the above program, a set of values is stored in a variable called ‘setvalue’ from 0 to 5. We implemented list comprehension to square each value in the setvalue and stored it in variable ‘square’. To know the collection type of ‘square’ we can check it by using type(square) in our code. Therefore the output would be something like this.

**2. Converting temperature from Centigrade to Fahrenheit.**

ctemps = [17.1, 22.3, 18.4, 19.1] #temperature in celsius ftemp = [((i * 9/5) + 32) for i in ctemps] #applying formula to convert each value of ctemps into fahrenheit print(ftemp) #printing ftemp

**3. Working with Strings**

Given is a string containing the names of people both first and last name. We will be performing two operations on the list, first to separate only the last name and second to print the name in reverse order.

The above operations will be carried out by using the **split()** method.

*split() method returns a list of strings after breaking the given string by the specified separator (i.e., by which the list is separated).*

**syntax: str.split(delimeter)**

It returns a list of strings after breaking the given string by the specified separator.

names = ["Isaac Newton", "Albert Einstein", "Niels Bohr", "Marie Curie", "Charles Darwin", "Louis Pasteur", "Galileo Galilei", "Margaret Mead"] x = [i.split()[1] for i in names] #splitting wherever there is space and at index 1 i.e., the last name x

x = [i.split()[::-1] for i in names] # splitting where there is space and reversing the order using [::-1] x

**4. Printing cubes of first 10 natural numbers**

def cube(list): return [i ** 3 for i in list] cube(range(0,10))

In the above program, we have created a function named cube and passed a variable list as the argument. Then list comprehension is implemented where for each value in the list its cube is calculated. The range for the list is given from 0 to 10 where 10 being excluded.

Therefore the output would be:

**5. Finding common words in 2 lists **

lst_1= "I love Python coding" lst_2= "I am learning Data Science with Python" [s for s in lst_1.split() if s in lst_2.split()] #checking for every word present in lst_1(separated by spaces) if it is also present in lst_2

**6. List Comprehension to get the given output**

**l1=[1,2,3,4,5] **

**l2=[4,5,6,7]**

**output 1 : [1,2,3]****output 2 : [6,7]****output 3 : [4,5]**

l1=[1,2,3,4,5] l2=[4,5,6,7] lc1=[i for i in l1 if i<4] #checking for numbers less than 4 in list 1 lc1

lc2 = [i for i in l2 if i>5] #checking for numbers greater than 5 in list 2 lc2

lc3 = [i for i in l2 if i<6] #checking for numbers less than 6 in list 2 lc3

**7. Summing the numbers when two dice are rolled**

Note: the sum of two numbers in tuples should be more than 7.

x = [(i,j) for i in range(0,7) for j in range(0,7) if i+j > 7] #i and j are the two numbers that would come upon respective dice. x

**8. For given input , produce given output**

**inp = “Hello Python World”**

**output 1 : ‘dlroW nohtyP olleH’****output 2 : ‘olleH nohtyP dlroW’****output 3 : ‘World Python Hello’**

inp = "Hello Python World" out1 = inp[::-1] #reverses the whole string out1

out2 = ' '.join([x[::-1] for x in inp.split(' ')]) #reverse the string at its respective position that is separated by space out2

*The join() method is a string method and returns a string in which the elements of the sequence have been joined by str separator.*

out3 = ' '.join([x for x in inp.split(' ')[::-1]]) out3

**9. Removing vowels from list**

inp = "HellO python wOrld !" out1 = ‘’.join([i for i in inp if i.lower() not in ['a','e','i','o','u']]) #checking words if it contains vowels or not out1

**10.Replacing missing spaces in a string with the least frequent character**

- Input : ‘dbc deb abed gade’
- Output: ‘dbccdebcabedcgade’

my_str = 'dbc deb abed gade' #we have this given input import pandas as pd #importing pandas libraries for creating series ser = pd.Series(list('dbc deb abed gade')) #created 1-D array of indexed data ser

freq = ser.value_counts() #calculating the frequency count of each data and printing it print(freq)

least_freq = freq.dropna().index[-1] #as ‘c’ and ‘g’ has occurred least number of times, we will take the last one i.e., c least_freq

out1 = "".join(ser.replace(' ', least_freq)) #we will join each letter without any space and will replace spaces with ‘c’ out1

**11. Create a numpy 4*4 array and get the given output**

**Input: array([ [ 0, 1, 2, 3],****[ 4, 5, 6, 7],****[ 8, 9, 10, 11],****[12, 13, 14, 15] ])****Output = array([ [ 4, 6],****[12, 14] ])**

import numpy as np #imported numpy libraries for creating array x = np.arange(16).reshape(4,4) #creating 4x4 matrix x

x[1::2,::2] #slicing alternate row and third column, [::2] is used to access every alternate element

**12. Creating a DataFrame and performing operations on it**

In this example, we will be creating and Data Frame and check for some given condition.

import pandas as pd df = pd.DataFrame({'DateOfBirth': ['1986-11-11', '1999-05-12', '1976-01-01','1986-06-01', '1983-06-04', '1990-03-07','1999-07-09'], 'Name':['Jane', 'Pane', 'Aaron', 'Penelope', 'Frane','Christina', 'Cornelia'], 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']}) df

In the above code, we have imported the Pandas library and created a DataFrame with three columns viz, DateOfBirth, Name, and State. Therefore the output will be:

Now we have to find all rows in the above DataFrame where ‘Name’ contains “ane” or ‘State’=”TX”

df1 = df[(df['Name'].str.contains("ane")) | (df['State'].str.contains("TX"))] df1

df[‘Name’] selects the Name column of the DataFrame. df[‘Name’].str allows us to apply string methods (e.g., lower, contains) to the DataFrame.

df[‘Name’].str.contains(‘ane’) checks each element of the Column as whether it contains the string ‘ane’ as a substring. The result is a Series of Booleans indicating True or False.

df[df[‘Name’].str.contains(‘ane’)] applies the Boolean ‘mask’ to the dataframe and returns a view containing appropriate records.

This brings us to the end of the blog. Hope this blog helped you in learning List Comprehension in Python from scratch. You can refer to the blogs on Python Libraries to understand List Comprehension in a better way.

Keep visiting our website AcadGild for blogs related to Data Science and Big Data.

The post Learning List Comprehension in Python appeared first on AcadGild.

]]>