In this blog, we will learn how data can be visualized with the help of two of the Python most important libraries Matplotlib and Seaborn.
Also, we will read about plotting 3D graphs using Matplotlib and an Introduction to Seaborn, a compliment for Matplotlib, later in this blog. Also, the above has been explained with the help of a Use Case, visualizing data for different scenarios.
The concept of using pictures and graphs to understand data has been around for many years. As day by day, the data is getting increased it is a challenge to visualize these data and provide productive results within the lesser amount of time. Thus, Data visualization comes to the rescue to convey concepts in a universal manner and to experiment in different scenarios by making slight adjustments.
Data visualization is a process of describing information in a graphical or pictorial format which helps the decision makers to analyze the data in an easier way.
- Data visualization just not makes data more beautiful but also provides insight into complex data sets
- Helps in identifying areas that need attention or improvement.
- Helps to understand which fields to place where
- Helps to predict scenarios and more.
Now, as we have understood a glimpse of Data visualization. Now, let us see how data can be visualized using Matplotlib.
INTRODUCTION TO MATPLOTLIB
Matplotlib is a Python 2D plotting library used to create 2D graphs and plots by using python scripts. It has a module named pyplot which makes things easy for plotting by providing the feature to control line styles, font properties, formatting axes, etc. Matplotlib consists of several plots like line, bar, scatter, histogram, etc.
The plt is used as an alias name for Matplotlib and will be used in the rest of the coding example in this blog. Pyplot is the core object that contains the methods to create all sorts of charts and features in a plot.
The %matplotlib inline is a jupyter notebook specific command that lets you see the plots in the notebook itself.
There are the following key plots that you need to know well for basic data visualization. They are:
- Line Plot
- Bar Chart
- Histogram Plot
- Scatter Plot
- Stack Plot
- Pie Chart
- Box Plot
We will see the respective plotting in detail as follows.
This is the simplest of all plotting type, as it is the visualization of a single function.
Let us see the below program to understand how line plotting is done.
In the above program plot() is the function used to plot the line chart, that takes two variables to plot the line.
When we plot the line using the function plot() the graph gets plotted internally but to visualize externally we use the function show().
Let us see more example to understand the line chart in detail.
In the above program, two lines have been created using variable x & y and x2 & y2. We can also make use of NumPy library to create the arrays X and Y.
The plt.plot() function takes additional arguments that can be used to specify different specifications.
Like in the above program we used argument such as:
- label: to give a label to each line we used in the program.
- color: to assign different colors to the lines. We can specify these colors in any way such as by name, color code, hex code, etc.
- linestyle: to adjust the line style in any form such as dashed, dotted, solid, dashdot. We can also use codes to specify these linestyle such as, ‘–’, ‘:’, ‘-’, ‘-.’, respectively.
If you want to be extremely concise we can combine the colors and linestyle together into a single non-keyword argument as, ‘-g’, ‘-.r’, etc.
The plt.xlabel() and plt.ylabel() function is used to give names to the x-axis and y-axis of the graph plotted, respectively.
The plt.title() method is used to give a title to the graph and it usually appears at the topside of the graph.
The plt.legend() method is used when multiple lines are being shown within a single axis, it can be useful to create a plot legend that labels each line type. Matplotlib has a built-in way of quickly creating such a legend and it is done using this method.
The plt.legend() function keeps track of the line style and color and matches these with the correct label.
There are many more similar methods, which you may check on the official website of Matplotlib.
Another type of plotting technique is the Barchart and Histogram. Let us see its working in detail as follows.
A Bar Graph is used to compare data among different categories. Bar Chart can be represented Horizontally and Vertically.
Let us see this with an example.
The bar graph is plotted using the bar() method.
In the above program, two bars are mentioned as Bar1 and Bar2. Bar1 is plotted using the data of x & y and Bar2 is plotted using the data of x2 & y2.
Bar1 is shown with color code ‘r’ i.e, with red color and Bar2 is shown with color code ‘c’ i.e., with cyan color.
We can also use different parameters such as height, width, align, ticklabels, etc.
We can also generate a horizontal bar graph. For this we use the method plt.barh() in place of plt.bar() method. We urge you to practice it by yourself for a better understanding.
Histograms are similar to Bar Chart, however, Histograms are used to show distribution. These are helpful when we have data as arrays.
Let us see this with an example where the age of the population is plotted with respect to bin.
Bin refers to the range of values that are divided into a series of usually the same size of intervals.
In the above program, the popul_age shows the age of various people. The variable Bin shows the number of people within a particular age group.
Therefore, in the output we can see that people of the age group 30-40 are more in numbers.
The method hist() is used to plot histograms.
The keyword histtype shows the various types of histograms that can be bar, barstacked, step, stepfilled. rwidth tells the relative width of the bars.
Similarly, we can use other parameters also as and when required.
Let us now understand about Scatter Plots and Stack Plots.
Scatter Plot is much alike to line graph in which instead of points being joined by line segments, the points are shown individually with a dot, circle or any other shape.
We can plot Scatter Plot graphs by using both plt.plot() and plt.scatter() methods.
Let’s first see an example to create a scatter plot using plt.plot() method:
In the above program we have created 2 arrays using NumPy library.
These 2 arrays are plotted using the method plt.plot() method. The attribute ‘o’ is used to display the shape of the scatter.
Now we will see an example to create a scatter plot using plt.scatter() method.
In the above program, the two arrays are plotted using the plt.scatter() method.
The keyword marker is used to display the shape in which the scatter plot will be plotted and s refers to the size of the scatter.
We can also use these character codes with line and color codes to plot points along with a line connecting them. Let us see the code below:
In the above program we can see that x and y are passed as the array variable, ‘-’ is linetype, ‘o’ is the scatter plot point style and ‘k’ for the color.
The plt.plot() method is different from plt.scatter() as it does not provide the option to change the color and size of point dynamically. Whereas the latter allows us to do that.
Let’s see this by creating a random scatter plot with points of many colors and sizes.
In the above program, two arrays have been created using numpy library, color is mapped to be within the range of 100. Size is given in pixel. cmap stands for colormap and is the instance or registered colormap name.
A stack plot is a plot which shows the whole data set with easy visualization of how each part makes up the whole.
Each constituent of the stack plot is stacked on top of each other.
It is more like a pie chart which shows all the various constituents of a data set. However, it is still different as stack plots have axes, unlike pie charts. Pie charts have basically one numerical data set with labels.
Let us understand this with the below code
In the above code, we have considered a situation where we have taken data of 5 days since each day consists of 24 hours, it is divided into activities that we carry out on a daily basis i.e, sleeping, eating, working and playing.
We have plotted these activities with different labels, giving linewidth of 5 to each.
We have plotted it into stackplot using the plt.stackplot() method. Therefore the output will look something like this:
A pie chart is a circular statistical diagram. The area of the whole chart represents the whole of the data. The areas of the pie chart represents the percentage of parts of data and it is called wedges.
Pie charts can be drawn using the function pie() in the pyplot module.
By default, the pyplot arranges the pies or wedges in counter-clockwise direction.
Let us now look into the code:
In the above program, we have taken the same previous example on stackplot, where we have taken the data of five days and divided each day which is of 24 hours into slices of different activities and plotted these data into a pie chart using the plt.pie() method.
Within this method, we specify the “slices,” which are the relevant sizes for each part. Then, we specify the color list for the corresponding slices. Next, we can optionally specify the “Start angle” for the graph. This lets you start the line where you want. In our case, we chose a 90-degree angle for the pie chart.
We can optionally add a shadow to the plot for a bit of character and then we used “explode” to pull out a slice a bit.
So the output will be:
A box plot (or box-and-whisker plot) shows the distribution of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable. The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution, except for points that are determined to be “outliers” using a method that is a function of the inter-quartile range.
It can also tell you if your data is symmetrical, how tightly your data is grouped, and if and how your data is skewed.
The code for Box plot is as follows:
In the above code we have created box plot with four elements. To create boxplot graph we use plt.boxplot() method. The data passed to the ax.boxplot() method can be a list or NumPy array.
The xticklabels( labels ) sets the x-axis tick labels for the current axes.
So the output would be:
INTRODUCTION TO 3D MATPLOTLIB
Matplotlib was initially designed with only two-dimensional plotting in mind
The mpl_toolkits.mplot3d import axes3d submodule included with Matplotlib provides the methods necessary to create 3D surface plots with Python.
We will now create 3D plots for Bar Charts and Scatter Plots.
3D BAR CHARTS
The difference between 2D and 3D bar chart is that, with a 3D bar other than having a starting point, height and width of the bar we also get the depth of the bar.
Let us understand this with the help of a basic example:
In the above program, three-dimensional plots are enabled by importing the mplot3d toolkit. plt.figure() method is used to create the 3D figure.
Once this submodule is imported, three-dimensional axes can be created by passing the keyword projection=’3d’ to any of the normal axes.
Then we have declared different variables with list and numpy and then plotted these variables using bar3d() method giving it a ‘Cyan’ color. Therefore the output will look something like this:
3D SCATTER PLOT
In the above program, we have taken two sets of variables and plotted each with a different color using the plt.scatter() method.
Therefore the output will look something like this:
INTRODUCTION TO SEABORN
We have already read about Matplotlib, a 2D plotting library that allows us to create 2D and 3D graphs. Another complimentary package that is based on this data visualization library is Seaborn, which provides a high-level interface to draw statistical graphics.
Seaborn aims to make visualization a central part of exploring and understanding data. Its dataset-oriented plotting functions operate on dataframes and arrays containing whole datasets and internally perform the necessary mapping and statistical aggregation to produce informative plots.
We import seaborn, which is the only library necessary, as follows
import seaborn as sns
sns is the alias for Seaborn. Internally seaborn use Matplotlib to draw plots.
We are already familiar with Histogram and a ‘hist’ function already exists in Matplotlib. A histogram represents the distribution of data by forming bins along with the range of the data and then drawing bars to show the number of observations that fall in each bin.
To illustrate this let us see the code below:
In the above program, we have created an array using the numpy library and plotted the histogram using the displot() method.
The keyword ‘kde’ passed stands for Kernel Density Estimate and is a non-parametric way to estimate the probability density function of a random variable. By default kde is True.
The keyword rug adds a rug plot which is used to draw a small vertical tick at each observation.
Therefore the output will look something like this:
Let us now make kde as true, which by default it is and also remove rug and see what happens:
The kernel density estimate may be less familiar, but it can be a useful tool for plotting the shape of a distribution. Like the histogram, the KDE plots encode the density of observations on one axis with height along the other axis.
If we use the kdeplot() function in seaborn, we get the same curve. Let’s look at an example.
Until now we have seen plotting univariate distributions. Where univariate refers to an expression, equation, function or polynomial of only one variable. Now we will see an example on plotting bivariate distributions.
The most familiar way to visualize a bivariate distribution is a scatter plot, where each observation is shown with a point at the x and y values.
We can draw a scatterplot with the matplotlib plt.scatter function, and it is also the default kind of plot shown by the jointplot() function in seaborn.
We have already read about the Box Plot using Matplotlib library. Let us now see how plotting of Box Plot is done using Seaborn library.
In the above example, the method set_style is used to set the theme as a background with white grids.
Then we have created 2 numpy arrays (the first one having 20 arrays with 6 elements and the other one with 6 elements from to 5 diving each by 2) and summed up both.
And then the Boxplot has been plotted using the boxplot() method passing data as the argument.
So the output would be:
Let us understand this with the help of a use case that would help understand the above concepts better.
We have taken a dataset which consists of the marks secured by the students in various subjects named as StudentsPerformance.csv.
You can download the dataset from the below link:
This data set consists of the marks secured by the students in various subjects.
- parental level of education
- test preparation course
- math score
- reading score
- writing score
1. First, we will import all important libraries and then import the csv file.
Now we will recognize and analyze our data using a wide variety of functions in the pandas library.
1. We will see visualization based on gender with the help of bar graphs.
2. Count of the column race/ethnicity is shown with the below graph
3. Plotting graphs showing comparison for ‘writing score’, ‘reading score’ and ‘math score’ for both the ‘genders’ based on ‘parental level of education’ respectively.
4. Plotting a graph for both the genders comparing math score for different levels of education.
5. Visualizing different groups based on percentage with the help of a piechart.
6. Plotting a graph for math score vs writing score for both the genders using scatter plot.
7. Visualizing frequency of math score vs writing score vs reading score using kde plot.
8. Visualization for math score for both the genders using Box Plot
9. Data visualization using Pairplot.
Pairplot plots pairwise relationships in a dataset.
Like above, we can perform numerous operations on various data and create Data Visualization using several plotting techniques.
This brings us to the end of our blog on Data Visualization. Hope this blog helped you to understand and use plotting techniques to create various Data Visualization.
Keep visiting our website for more Data Science and Big Data related blogs.
You can refer to our previous blogs on Python important libraries Numpy and Pandas, whose links have been given below: