In the previous blog, we discussed about using histograms to check the central tendency measures of a continuous variable. We also plotted the frequencies of different ordinal variables to see the distribution of observations in each category.
In this blog, we will discuss about other types of data visualizations, their use and interpretation.
Kernel Density plots (KDP) are an effective way to illustrate the distribution of a variable. It is a useful non-parametric technique for visualizing the underlying distribution of a continuous variable. Histograms can often be a poor method for determining the shape of a distribution because they are strongly affected by the number of bins used. For example, visualizing the same data with less number of bins can make the same observation to appear as normally distributed.
The R function density (x), where x is a numeric vector, can be used to create a density estimate. The user can optionally specify kernels other than the default Gaussian. The result of the density estimate can be viewed with either the plot() or lines() function.
# Kernel Density Plot d <- density(amount_in_thousand) # returns the density data plot(d,main="Kernel Density of amount ",xlab = "amount (in thousand)") # plots the results
# Filled Density Plot d <- density(amount_in_thousand) plot(d, main="Kernel Density of amount ",xlab = "amount (in thousand)") polygon(d, col="yellow", border="red")
#comparing densities across multiple categories library(sm) # plot densities per status sm.density.compare(amount_in_thousand, status, xlab="amount in thousand") title(main="amount in thousand of all status") # add legend via mouse click colfill<-c(2:(2+length(levels(status)))) legend(locator(1), levels(status), fill=colfill,title = "loan status")
# plot densities per loan duration sm.density.compare(amount_in_thousand, duration, xlab="amount in thousand") title(main="amount in thousand of all status") # add legend via mouse click colfill<-c(2:(2+length(levels(as.factor(duration))))) legend(locator(1), levels(as.factor(duration)), fill=colfill,title = "loan duration")
A scatter graph is a type of mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. If the points are color-coded, you can increase the number of displayed variables to three.
It is also known as scatter chart, scattergram, scatter diagram or scatter graph. The scatter diagram is one of the seven basic tools of quality control. A scatter plot can be used in either of the situations:
When one continuous variable that is under the control of the experimenter and the other depends on it.
When both continuous variables are independent.
A scatter plot can suggest various kinds of correlation between variables with a certain confidence interval. Correlations may be positive (rising), negative (falling), or null (uncorrelated). If the pattern of dots, slopes from lower left to upper right, it indicates a positive correlation between the variables being studied. If the pattern of dots slopes from upper left to lower right, it indicates a negative correlation. A line of best fit (alternatively called ‘trendline’) can be drawn in order to study the relationship between the variables.
An equation for the correlation between the variables can be determined by established best-fit procedures. For a linear correlation, the best-fit procedure is known as linear regression and is guaranteed to generate a correct solution in a finite time.
# Simple Scatterplot # more loan mount more installment plot(amount_in_thousand, installaments_in_thousand, main="Scatterplot Example", xlab="loan amount taken", ylab="installment ", pch=19) # Add fit lines abline(lm(installaments_in_thousand ~ amount_in_thousand), col="red") # regression line (y~x) lines(lowess(amount_in_thousand, installaments_in_thousand), col="blue") # lowess line (x,y)
Here we see that there is a linear relationship between the loan amount taken and installment i.e. bigger the loan amount, greater is the installment. But from the plot, we see there is a third relationship among these two attributes, let’s try to see that more clearly.
# Enhanced Scatterplot of amount_in_thousand vs installaments_in_thousand # by duration library(car) scatterplot(installaments_in_thousand ~ amount_in_thousand| duration, data=loan_date_loan_amt_payment_duration, xlab="installment(thousand)", ylab="loan amount(thousand)", main = "installment vs loan amount with \nrepayment duration", legend.coords = "bottomright" )
The third attribute is the time duration for which the amount is taken. If we try to interpret all these together, it indicates that:
- People taking less loans tend to repay their loan quite early i.e. 12,24 and 36 months
- Also, higher the loan amount, higher is the duration for repayment and greater is the installment
A line chart or line graph is a type of chart which displays information as a series of data points called ‘markers’ connected by straight line segments. It is created by connecting a series of data points together with a line. This is the most basic type of chart used in many fields and commonly used in fields like finance and sports. It is similar to a scatter plot except that the measurement points are ordered (typically by their x-axis value) and joined with straight line segments. A line chart is often used to visualize a trend in data over intervals of time.
Plot the stock prices of a product in last 24 days.
time <- 1:24 price_of_stock <- c(102,101,85,90,115,114,110,100,101,105,109,110,150,110,115,112,101,105,109,108,110,116,120,100) plot(time,price_of_stock)
Show me the trend:
Now, let’s compare stock prices of two products:
plot(time,price_of_stock_of_p1,type = "b",col="red",ylab="stock price") lines(time,price_of_stock_of_p2,type = "b",col="blue")
A boxplot is an efficient method for graphically displaying numerical data.
It depicts the following information:
- The smallest observation (sample minimum)
- The lower quartile (25%)
- The median (50%)
- The upper quartile (75%)
- The largest observation (sample maximum)
If there are outliers, the boxplot indicates them as well. The box is constructed from the bottom, lower quartile to the top, upper quartile. The whiskers connect the box to the smallest and largest values that are not outliers.
Outliers are observations that are distant from the rest of the sample.
Extreme outliers are observations that lie outside the box at a distance of more than three times the Inter-Quartile Range (IQR: the difference between the third and first quartiles). Mild outliers are observations that lay more than 1.5 times the IQR from the first or third quartile but not as far as extreme outliers. The representation of both extreme and mild outliers will be different.
To interpret a boxplot, we look at the numerical values of the three quartiles, representing 25 percent, 50 percent and 75 percent of the sample respectively. We also look at the general shape of the box and whiskers for indications of symmetry or asymmetry and outliers.
According to Researchers, a boxplot represents the following summaries of the data at a glance:
- Location: is displayed by the cut line at the median (as well as by the middle of the box)
- Spread: is defined by the length of the box (as well as by the distance between the ends of the whiskers and the range).
- Skewness: is defined by the deviation of the median line from the center of the box relative to the length of the box (as well as by the length of the upper whisker relative to the length of the lower one, and by the number of individual observations displayed on each side).
- Longtailedness: is the distance between the ends of the whiskers relative to the length of the box (as well as by the number of observations specifically marked).
boxplot(amount_in_thousand,main="distribution of loan amount")
The above plot gives the following information about the data:
- The range is from 0 to 600 units i.e. the loan amount that has been given out from bank is from 0 to around 600 thousand.
- The dense area lies between 60 units and 220 units i.e. most of the loan amount that has been dispersed from the bank lies between 60 thousand to around 220 thousand.
- Minimum value is as low as 0 i.e. the minimum loan amount that has been given to anyone is very much negligible i.e. few thousands, may be around 10 thousand.
- Maximum value of the distribution is close to 600 units i.e. maximum loan amount given by bank is close to 600 thousand.
- There are very few values that fall above 420 units and they are called as outliers i.e. bank has approved very few loans for an amount which is greater than 420 thousand. These might be the exceptional cases.
- The median lies around 120 units i.e. the middle value of all the loan amount is 120 thousand.
- The data is right skewed since the box lies towards lower half of the plot which means most of the distribution lies under around 200 units i.e. most of the loans that have been taken from the bank is under 200 thousand.
Similarly, we can plot a boxplot for loan amount for each loan repayment status.
boxplot(amount_in_thousand~status,main="Distribution of loan amount \nfor each loan repayment status")
We can compare the disribution for each category and then form our conclusion as to what type of loan amount/customers belong to which category. Here is the conclusion from the plot:
- People who take small loan amount are less likely to default.
- People taking large loans find it hard to earn a good reputation with the bank and are C and D graders.
- However, the data in D grade is almost symmetrical i.e. there are almost equal number of people taking loan more than around 300 thousand and less than that.
We hope this blog was useful. Visit our website www.acadgild.com for more blogs on Big Data,R,Machine Learning and other technologies.