*Linear Regression Model Building using Air Quality data set with R.*

In this blog, we will be discussing how to use a linear regression model to find and build a prediction model.

Here we will be using the Airquality data set which is available in R to build a linear regression prediction model.

Before going further, first of all, understand what is linear regression and its significance.

** Linear regression establishes a relationship between dependent variables i.e Y and independent variables i.e X using a best fit straight line known as a regression line**. Generally denoted as

**R^**

**2**.

**The equation of the regression line can be used to predict the value of Y for any given X.**

Let us see the syntax of the linear model :

when we use the *lm()* function, we imply the dataframe using the data = parameter.

df = dataframe that contains variables.

target ~ predictor syntax is basically telling the lm() function what is the “target” variable which we want to predict and what our “predictor” variable is – the x variable that we’re using as an input for the prediction.

##### In mathematics the linear equation is given as:

Y = mX + c where m= slope of straight line and c= Y-intercept

Or y=b0+b1x where, y = Predicted value, b0= Intercept, b1= Slope, x = Predictor

Here the dependent variables are for target variables which can be continuous and independent variables for predict or predictor which can be continuous or discrete.

**For further analysis, we will use the air quality dataset that comes with R.**

**For further analysis, we will use the air quality dataset that comes with R.**

Air quality is a standard built-in data set that makes it convenient to work on linear regression. You can access this data set by typing air quality in your R console. You will find that it consists of 153 observations (rows) and 6 variables (columns) – Ozone, Solar.R, Wind, Temp, Month, Day.

**To build the model we require the data set, i.e air quality data set which is already loaded in R.**

Load the data set in R and process it; the code flow is given below:

##### How the air quality data set looks like?

The command view(airquality) reflect the data set in your R environment.

We can use the below code to check air quality data set from R console.

In the above command, we have used str() command that shows you it is a data frame and 153 observations of 6 variables are present.

We can also check with the head() command, it will take first 6 records by default.

Let’s process the data set.

We can see the summary of the data set which shows the NA values or missing values using the summary() command.

**Ozone has 37 missing values and Solar.R has 7 missing values in the data set.**

Now let’s give input monthly mean in Ozone and Solar.R to replace missing values with Mean.

In the above code, for 1:nrow taking first to the last number of rows of the data set, if there are any missing values, we can check with is.na command. Now we accept argument na.rm=True and the particular missing value is replaced by monthly mean by mean() command for Ozone and Solar.R.

We can see in the above console that there is no NA value or missing value left. This is a very important part when we are dealing with the data cleaning part.

We will discuss more data cleaning/data wrangling process in the upcoming blogs.

**Now let’s normalize the data set.**

In the code below, we can see Normalization rescales the values into a range of [0,1], also called min-max scaled.

We can see in the console that Normalization transforms the data into a range between 0 and 1 and there are no outliers or missing values left in the data set.

Now apply the** Linear regression algorithm** using the** Least Squares Method** on “Ozone” and “Solar.R”

In the below code we select the target attribute Y i.e Ozone and Predictor attribute X i.e Solar.R to build the model_1 and check the correlation between X and Y with lm() function.

We observe that model_1 provides the regression line coefficient that is slope and Y – intercept.

##### Let’s plot the graph between X and Y

The above graph shows the scatter plot between X and Y.

Here we are adding a regression line to scatter plot to see the relationship between X and Y.

The slope of the line goes upward, hence there exists a positive correlation between Ozone and Solar.R.

Now, if we increase the value of X, the value of Y will also increase, and vice versa.

The above graph shows the regression line between X and Y, and the positive correlation between the X and Y attributes.

*We follow the same steps for model_2 as we did for model_1.*

In the below code, we select the target attribute Y i.e Ozone, and Predictor attribute i.e. X. We have to build the model_2 and check the correlation between X and Y with lm() function.

Apply linear regression algorithm using Least Squares Method on “Ozone” and “Wind”

We can see that model_2 provides the regression line coefficient that is slope and Y – intercept.

##### Let’s plot the graph between X and Y

The above graph shows the scatter plot between X and Y

Here we are adding a regression line to scatter plot to see the relationship between X and Y.

The slope of the line goes downward, hence there exists a negative correlation between Ozone and Wind.

So if we increase the value of X, the value of Y will decrease, and vice versa.

The above graph shows the regression line between X and Y and the negative correlation between the X and Y attributes

**Let’s perform prediction on the Ozone level with model_1 and model_2.**

**Let’s perform prediction on the Ozone level with model_1 and model_2.**

**Predict the Ozone level when Solar.R radiation is 10**

Hence the required prediction of Ozone level is 1.049993 when solar radiation is 10.

**Predict the Ozone level when Wind is 5**

Hence the required prediction of Ozone level is -21.46849 when the wind is 5.

*From the above example, we believe this blog helped you to understand Linear Regression Model Building using Air Quality data set with R.*

You can refer the link* https://acadgild.com/blog/55690-2 *to learn Mean Median and Mode using R.

*Keep visiting our site*** www.acadgild.com*** for more updates on Data Analytics and other technologies.*