Exploratory Data Analysis: The First Statistical Glance of the Data
In this blog, we will learn about the basic analysis tasks what we should apply to our data before we go ahead and build complex models.
We will discuss the basic statistical properties that almost all the data have and can be used to extract information from the data. These steps are commonly known as Exploratory Data Analysis (EDA).
Exploratory Data Analysis
John Tukey suggested using EDA to collect and analyze data—not to confirm a hypothesis, but to form a hypothesis that could later be confirmed through other methods.
In statistics, EDA is an approach to analyze data sets to summarize their main characteristics, with the help of descriptive statistics and visual methods. Primarily, EDA is used for visualizing what the data can tell us about itself without performing a complex operation on it.
Before making inferences from data, it is essential to examine all of its variables.
Why?
To listen to the data:
 maximize insight into a data set
 to detect mistakes i.e. detect outliers and anomalies
 extract important variables
 to see patterns in the data
 to find violations of statistical assumptions
 to generate hypotheses
 test underlying assumptions
…because if you don’t, you may have trouble later.
Exploratory Data Analysis involves both graphical displays of data and numerical summaries of data.
In this blog, we will discuss:
 numerical summaries or descriptive statistics
 check details of data density and
 graphical analysis
Dataset
Following are the components of a data/dataset:

A data set is often represented as a matrix

There is a row for each unit

There is a column for each variable

A unit is an object that can be measured, such as a person, or a thing

A variable is a characteristic of a unit that can be assigned a number or a category
Dimensionality of Data Sets
 Univariate: Measurement made on one variable per subject
 Bivariate: Measurement made on two variables per subject
 Multivariate: Measurement made on many variables per subject
Type of variables

Qualitative: Variables take on values that are names or labels.
Example: The color of a ball (e.g., red, green, blue) or the breed of a dog (e.g., collie, shepherd, terrier.

Types:

Nominal: It does not matter which way the categories are ordered in tabular or graphical displays of the data — all orderings are equally meaningful. For example, a student’s religion (Atheist, Christian, Muslim, Hindu, …) is nominal.

Ordinal: A categorical variable whose categories can be meaningfully ordered is called ordinal. For example, a student’s grade in an exam (A, B, C or Fail) is ordinal.


Quantitative: Variables that are measured on a numeric or quantitative scale.
Example: Age, count of anything etc.

Types:

Discrete: A discrete variable is one that cannot take on all values within the limits of the variable.

For example, number of children is a discrete numerical variable (a count). The variable cannot have the value 1.7

Continuous: If a variable can take on any value between two specified values, it is called a continuous variable.
For example, age of a human: 25 years, 10 months, 2 days, 5 hours
Numerical Summaries of Data
Numerical measures are useful in situations which require decision making and inferences to be drawn based on data. The following measures are discussed below:
• Central Tendency measures
 They are computed to give a “center” around which the measurements in the data are distributed
 To check the central tendency of the data, compute the following:
 mean
 median
 mode
• Variation or Variability measures
 They describe “data spread” or how distant are the measurements from the center
 To check the Variation or spread of the data compute the following:
 Range
 Variance
 Standard Deviation
 Inter Quartile Range (IQR)
• Relative Standing measures
 Percentile
 Quartiles
Let us now discuss the above measures in more detail.
Central Tendency measures
 The Mean: It is the average of the observations
 To calculate the average x of a set of observations, add their value and divide by the number of observations:
 The Median: It is the value which is exactly in the middle
 Calculation:
 If there are odd number of observations, find the middle value
 If there are even number of observations, find the middle two values and average them
For example:
Age of participants: 17 19 21 22 22 33 23 38
Median = (22+22)/2 = 22
Note: Which is the best Location Measure? Mean is best for symmetric distributions without outliers 0 1 2 3 4 5 6 7 8 9 10 Mean = 3 Median = 3 Median is useful for skewed distributions or data with outliers 0 1 2 3 4 5 6 7 8 9 10 Mean = 4 Median = 3 
 The Mode: The mode is the number that is repeated more often than any other
Example: 1, 1, 1, 1, 14, 14, 16, 18, 21
Mode = 1 since it has been repeated most
 The Minimum: Minimum value available in that observation list
 The Maximum: Maximum value available in that observation list
Variation or Variability measures
 The Range:
 Complete spread of the data
 To calculate range: Maximum – Minimum
 Displays all windows in which all possible observations are recorded
 The Variance: Average of squared deviations of values from the mean
Increasing contribution to the variance as you go farther from the mean.
 The Standard Deviation:
 Variance is arbitrary
 What does it mean to have a variance of 10.8 or 2.2 or 1 459.092 or 0.000001?
 Nothing. But if you could “standardize” that value, you could talk about any variance (i.e. deviation) in equivalent terms.
 Standard deviations are simply the square root of the variance
 Standard Deviation simply scales the number that you gain from variance, so that it can be used as a standard unit

Note:
Empirical Rule
For any normal distribution, especially if their histogram is bellshaped,
 About 68% of the observations are within 1 SD of the mean.
 About 95% of the observations are within 2 SDs of the mean.
Nearly all observations are within 3 SDs of the mean.
 The IQR: The “Interquartile Range” is the range from first quartile i.e. Q1 to third quartile i.e. Q3:
Example:
Quartiles: Quartiles are the values that divide a list of numbers into quarters.
 First put the list of numbers in order
 Then divide the list into four equal parts
 The Quartiles are at the “cuts”
Example: 17,19,21,22,27,33,23,38,40
Put them in order: 17 19 21 22 23 27 33 38 40
Divide the list into quarters:
17 19 21 22 23 27 33 38 40
And the result is:
 Quartile 1 (Q1) = 21
 Quartile 2 (Q2), which is also the Median, = 23
 Quartile 3 (Q3) = 33
Sometimes a “cut” is between two numbers that is the Quartile is the average of the two numbers.
Example: 17 19 21 22 23 27 33 38
The numbers are already in order
Cut the list into quarters:
17 19 21 22 23 27 33 38
In this case Quartile 2 is half way between 5 and 6:
Q2 = (22+23)/2 = 22.5
And the result is:
 Quartile 1 (Q1) = (19+21)/2=20.0
 Quartile 2 (Q2) = 22.5
 Quartile 3 (Q3) = 30.0
Relative Standing measures
Percentiles and Quartiles:

Measures of relative standing can be used to compare values from different data sets, or to compare values within the same data set.

To calculate quartiles and percentiles, the data must be ordered from smallest to largest.
Percentiles

A percentile is a measure used in statistics indicating the value, below which a given percentage of observations in a group of observations fall.

For example, the 20th percentile is the value (or score) below which 20 percent of the observations may be found.
Quartiles

The 25th percentile is also known as the first quartile (Q1), the 50th percentile as the median or second quartile (Q2), and the 75th percentile as the third quartile (Q3).
Other Attributes
 Checking the relationship between the available variables
Covariance

The covariance of two variables x and y in a data sample measures how the two are linearly related.

A positive covariance would indicate a positive linear relationship between the variables, and a negative covariance would indicate the opposite.

The sample covariance is defined in terms of the sample means as:

Check the shape of the data
Skewness

Skewness is a measure of the symmetry in a distribution

A distribution or data set is symmetric if the left and right of the center point looks exactly the same

A symmetrical dataset will have a skewness equal to 0. So, a normal distribution will have a skewness of 0

Skewness essentially measures the relative size of the two tails

If the value is negative, it implies that the distribution of the data is slightly skewed to the left or negatively skewed

If the value is positive, it implies that the distribution of the data is slightly skewed to the right or positively skewed
Kurtosis

Measure of the “tailedness” of the probability distribution of a realvalued random variable

Kurtosis is a measure of whether the data is heavytailed or lighttailed relative to a normal distribution. That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data sets with low kurtosis tend to have light tails, or lack outliers

The kurtosis of any univariate normal distribution is 3. It is common to compare the kurtosis of a distribution to this value.

Distribution with kurtosis less than 3 are said to be platykurtic. An example of a platykurtic distribution is the uniform distribution, which does not have positivevalued tails.

Distributions with kurtosis greater than 3 are said to be leptokurtic. An example of a leptokurtic distribution is the Laplace distribution, which has tails that asymptotically approach zero slowly when compared with a Gaussian.
The ‘Best’ way to summarize data sets:

First step is to summarize each variable in the data set.

Then, the best way to summarize a variable depends on its characteristics i.e. whether it is qualitative or quantitative:

Then summarize each variable with respect to other variables present in the dataset
Hope this overview on Exploratory Data Analysis was useful. Keep visiting our website Acadgild for more updates on Machine Learning and other technologies. Click here to learn Machine Learning with R. In the next blog, we will apply all these numerical summaries on a bank’s loan dataset with the help of a popular and open source statistical tool called R.
Hi
Can you pls explain the example given under “best location measure”? I do not follow how the two examples given (from 0 to 10) are symmetric and skewed ?
Thanks.