Exploratory Data Analysis: The First Statistical Glance of the Data
In this blog, we will learn about the basic analysis tasks what we should apply to our data before we go ahead and build complex models.
We will discuss the basic statistical properties that almost all the data have and can be used to extract information from the data. These steps are commonly known as Exploratory Data Analysis (EDA).
Exploratory Data Analysis
John Tukey suggested using EDA to collect and analyze data—not to confirm a hypothesis, but to form a hypothesis that could later be confirmed through other methods.
In statistics, EDA is an approach to analyze data sets to summarize their main characteristics, with the help of descriptive statistics and visual methods. Primarily, EDA is used for visualizing what the data can tell us about itself without performing a complex operation on it.
Before making inferences from data, it is essential to examine all of its variables.
To listen to the data:
- maximize insight into a data set
- to detect mistakes i.e. detect outliers and anomalies
- extract important variables
- to see patterns in the data
- to find violations of statistical assumptions
- to generate hypotheses
- test underlying assumptions
…because if you don’t, you may have trouble later.
Exploratory Data Analysis involves both graphical displays of data and numerical summaries of data.
In this blog, we will discuss:
- numerical summaries or descriptive statistics
- check details of data density and
- graphical analysis
Following are the components of a data/dataset:
A data set is often represented as a matrix
There is a row for each unit
There is a column for each variable
A unit is an object that can be measured, such as a person, or a thing
A variable is a characteristic of a unit that can be assigned a number or a category
Dimensionality of Data Sets
- Univariate: Measurement made on one variable per subject
- Bivariate: Measurement made on two variables per subject
- Multivariate: Measurement made on many variables per subject
Type of variables
Qualitative: Variables take on values that are names or labels.
Example: The color of a ball (e.g., red, green, blue) or the breed of a dog (e.g., collie, shepherd, terrier.
Nominal: It does not matter which way the categories are ordered in tabular or graphical displays of the data — all orderings are equally meaningful. For example, a student’s religion (Atheist, Christian, Muslim, Hindu, …) is nominal.
Ordinal: A categorical variable whose categories can be meaningfully ordered is called ordinal. For example, a student’s grade in an exam (A, B, C or Fail) is ordinal.
Quantitative: Variables that are measured on a numeric or quantitative scale.
Example: Age, count of anything etc.
Discrete: A discrete variable is one that cannot take on all values within the limits of the variable.
For example, number of children is a discrete numerical variable (a count). The variable cannot have the value 1.7
Continuous: If a variable can take on any value between two specified values, it is called a continuous variable.
For example, age of a human: 25 years, 10 months, 2 days, 5 hours
Numerical Summaries of Data
Numerical measures are useful in situations which require decision making and inferences to be drawn based on data. The following measures are discussed below:
• Central Tendency measures
- They are computed to give a “center” around which the measurements in the data are distributed
- To check the central tendency of the data, compute the following:
• Variation or Variability measures
- They describe “data spread” or how distant are the measurements from the center
- To check the Variation or spread of the data compute the following:
- Standard Deviation
- Inter Quartile Range (IQR)
• Relative Standing measures
Let us now discuss the above measures in more detail.
Central Tendency measures
- The Mean: It is the average of the observations
- To calculate the average x of a set of observations, add their value and divide by the number of observations:
- The Median: It is the value which is exactly in the middle
- If there are odd number of observations, find the middle value
- If there are even number of observations, find the middle two values and average them
Age of participants: 17 19 21 22 22 33 23 38
Median = (22+22)/2 = 22
Note: Which is the best Location Measure?
Mean is best for symmetric distributions without outliers
0 1 2 3 4 5 6 7 8 9 10
Mean = 3 Median = 3
Median is useful for skewed distributions or data with outliers
0 1 2 3 4 5 6 7 8 9 10
Mean = 4 Median = 3
- The Mode: The mode is the number that is repeated more often than any other
Example: 1, 1, 1, 1, 14, 14, 16, 18, 21
Mode = 1 since it has been repeated most
- The Minimum: Minimum value available in that observation list
- The Maximum: Maximum value available in that observation list
Variation or Variability measures
- The Range:
- Complete spread of the data
- To calculate range: Maximum – Minimum
- Displays all windows in which all possible observations are recorded
- The Variance: Average of squared deviations of values from the mean
Increasing contribution to the variance as you go farther from the mean.
- The Standard Deviation:
- Variance is arbitrary
- What does it mean to have a variance of 10.8 or 2.2 or 1 459.092 or 0.000001?
- Nothing. But if you could “standardize” that value, you could talk about any variance (i.e. deviation) in equivalent terms.
- Standard deviations are simply the square root of the variance
- Standard Deviation simply scales the number that you gain from variance, so that it can be used as a standard unit
For any normal distribution, especially if their histogram is bell-shaped,
- About 68% of the observations are within 1 SD of the mean.
- About 95% of the observations are within 2 SDs of the mean.
Nearly all observations are within 3 SDs of the mean.
- The IQR: The “Interquartile Range” is the range from first quartile i.e. Q1 to third quartile i.e. Q3:
Quartiles: Quartiles are the values that divide a list of numbers into quarters.
- First put the list of numbers in order
- Then divide the list into four equal parts
- The Quartiles are at the “cuts”
Put them in order: 17 19 21 22 23 27 33 38 40
Divide the list into quarters:
17 19 21 22 23 27 33 38 40
And the result is:
- Quartile 1 (Q1) = 21
- Quartile 2 (Q2), which is also the Median, = 23
- Quartile 3 (Q3) = 33
Sometimes a “cut” is between two numbers that is the Quartile is the average of the two numbers.
Example: 17 19 21 22 23 27 33 38
The numbers are already in order
Cut the list into quarters:
17 19 21 22 23 27 33 38
In this case Quartile 2 is half way between 5 and 6:
Q2 = (22+23)/2 = 22.5
And the result is:
- Quartile 1 (Q1) = (19+21)/2=20.0
- Quartile 2 (Q2) = 22.5
- Quartile 3 (Q3) = 30.0
Relative Standing measures
Percentiles and Quartiles:
Measures of relative standing can be used to compare values from different data sets, or to compare values within the same data set.
To calculate quartiles and percentiles, the data must be ordered from smallest to largest.
A percentile is a measure used in statistics indicating the value, below which a given percentage of observations in a group of observations fall.
For example, the 20th percentile is the value (or score) below which 20 percent of the observations may be found.
The 25th percentile is also known as the first quartile (Q1), the 50th percentile as the median or second quartile (Q2), and the 75th percentile as the third quartile (Q3).
- Checking the relationship between the available variables
The covariance of two variables x and y in a data sample measures how the two are linearly related.
A positive covariance would indicate a positive linear relationship between the variables, and a negative covariance would indicate the opposite.
The sample covariance is defined in terms of the sample means as:
Check the shape of the data
Skewness is a measure of the symmetry in a distribution
A distribution or data set is symmetric if the left and right of the center point looks exactly the same
A symmetrical dataset will have a skewness equal to 0. So, a normal distribution will have a skewness of 0
Skewness essentially measures the relative size of the two tails
If the value is negative, it implies that the distribution of the data is slightly skewed to the left or negatively skewed
If the value is positive, it implies that the distribution of the data is slightly skewed to the right or positively skewed
Measure of the “tailedness” of the probability distribution of a real-valued random variable
Kurtosis is a measure of whether the data is heavy-tailed or light-tailed relative to a normal distribution. That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data sets with low kurtosis tend to have light tails, or lack outliers
The kurtosis of any univariate normal distribution is 3. It is common to compare the kurtosis of a distribution to this value.
Distribution with kurtosis less than 3 are said to be platykurtic. An example of a platykurtic distribution is the uniform distribution, which does not have positive-valued tails.
Distributions with kurtosis greater than 3 are said to be leptokurtic. An example of a leptokurtic distribution is the Laplace distribution, which has tails that asymptotically approach zero slowly when compared with a Gaussian.
The ‘Best’ way to summarize data sets:
First step is to summarize each variable in the data set.
Then, the best way to summarize a variable depends on its characteristics i.e. whether it is qualitative or quantitative:
Then summarize each variable with respect to other variables present in the dataset
Hope this overview on Exploratory Data Analysis was useful. Keep visiting our website Acadgild for more updates on Machine Learning and other technologies. Click here to learn Machine Learning with R. In the next blog, we will apply all these numerical summaries on a bank’s loan dataset with the help of a popular and open source statistical tool called R.