We are frequently asked that what are the python libraries a data science beginner should be aware of.
In this blog, we will be discussing on most popular python libraries which are used in data science projects.
Free Stepbystep Guide To Become A Data Scientist
Subscribe and get this detailed guide absolutely FREE
Before moving forward to know the frequently used popular python libraries we recommend beginners to have a good revision on python by referring the below series of videos.
What you will Learn:
Library
 What is a “library”?
Data science libraries
 Python libraries used in data science projects.
 Popular functions used in each python library for/in data science projects with examples.
Core Libraries
 NumPy
 SciPy
Statistics
 Statsmodels
Data Loading and Processing
 Pandas
 List of functions used in pandas
Visualization
Matplotlib
 General tips for matplotlib
 Importing matplotlib
 Setting styles
 How to display your plots
 How to display images
Seaborn
 General tips for seaborn
 Basic steps to create plots with seaborn
 Types of plots we can draw with seaborn
 Categorical plots and their types
 Regression Plots
 Matrix plots
 Distribution Plots
Machine Learning
 ScikitLearn
 XgBoost
Deep Learning
 TensorFlow
 NeuroLinguistic Programming.
Python is the leading programming language and has gained a leading position in solving data science task and challenges. So, now we will show you some useful and powerful python libraries which are used in data science For example to do the scientific calculations and other stuff.
All the libraries are open sourced so if you find these libraries are helpful in your business you can donate them on their official websites.
What is a library?
In very simple terms a library is a file which consists of some useful code, on the other hand, this code could be a simple function or collection of function, variables, and classes.
Now, let us look at the most used python libraries for data science.
Core Libraries
1) NumPy(Commits: 20000+, Contributors: 1000+)
Firstly, we start our list with the libraries that are used in scientific applications and numPy is one of the top libraries which is used for processing large multidimensional arrays and matrices.
It has a collection of highlevel mathematical functions and methods.
We can also use NumPy in complex mathematical operations like Fourier transformation, linear algebra, random number, etc. Since it is an array interface which allows the user to reshape the datasets.
List of Functions Used In NumPy:

numpy.array()
In the following example, we will be creating a onedimensional array using numpy.
In the below code, we have performed the following steps:
 The numpy package Imported
 Created an array with function np.array
 Printing array
 Checking data type of the array
import numpy as np # Importing NumPy package as np a = np.array([1,2,3]) # Creating An array with funtion array print(a) # Printig array a.dtype # Checking Data Type Of the Array

numpy.genfromtxt()
numpy.genfromtxt function can be used to read files.
In the below code, we have performed the following steps:
 Using the genfromtxt function we read the studentdata.csv file
 Specifying the keyword argument delimiter=”;” so that the fields are parsed properly.
 Specifying the keyword argument skip_header=1 so that the header row is skipped.
student = np.genfromtxt("studentdata.csv", delimiter=";", skip_header=1)

numpy.arrange()
Syntax : arange([start,] stop[, step,], dtype=None)
This function will take four parameters as below:
 start: number, optional
Start of interval. The interval includes the start value. The default start value is 0.
 stop: number
End of interval. The interval does not include the stop value, except in some cases where “step” is not an integer and floating point roundoff affects the length of “out”.
 step: number, optional
Spacing between values. For any output “out”, this is the distance between two adjacent values, “out[i+1] – out[i]”. The default step size is 1. If “step” is specified as a position argument,
 dtype: dtype
dtype is the type of the output array. If “dtype” is not given, infer the data type from the other input arguments.
In the below code, we have performed the following steps:
 We are using the arrange function to create an ndimensional array.
np.arange(3) np.arange(3.0) np.arange(3,7) np.arange(3,7,2)

numpy.broadcast
Syntax: np.broadcast(self, /, *args, **kwargs)
numpy.broadcast function takes parameters like
Int1, Int2,….:
The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes.
In the below code, we have performed the following steps:
 Creating an array called “x”
 Creating another array called “y”
 Broadcasting the “x” and “y”
x = np.array([[1], [2], [3]]) y = np.array([4, 5, 6]) b = np.broadcast(x, y)
2) SciPy (Commits: 21000+, Contributors: 1000+)
Another important python library for researchers, developers and data scientists which can be used in scientific calculations is Scipy. Scipy is based on the data structures of Numpy and therefore it extends the capabilities of Numpy.
SciPy contains modules for linear algebra, optimization, integration, and statistics. It is built upon the numPy because of that only thus make the substantial use of numPy.
List Of Functions Used in SciPy:

imread()
In the below code, we have performed the following steps:
 Imported misc package from scipy library
 Now using misc.imread() function we are importing the image
from scipy import misc misc.imread('Image_Name.png')

linalg.det
The module for traditional algebra operations is thought as scipy.linalg and you’re needed to import it well before any operation.
To calculate the determinant of a matrix, we’ll use scipy.linalg.det() to operate within the following way:
from scipy import linalg
In the below code, we have performed the following steps:
 Creating the square matrix
 Computing the determinant of a matrix
mat = np.array([[2,1],[4,3]]) #For a square matrix ‘mat’ linalg.det(mat)

linalg.inv()
Another function is inv() that can be used to compute the inverse of the square matrix.
In the below code, we have performed the following steps:
 Creating a matrix
 Computing the inverse of the matrix
mat = np.array([[2,1],[4,3]]) #For a square matrix ‘mat’ linalg.inv(mat)
Special Functions in SciPy
scipy.special module contains a list of transcendental performs, that are most frequently utilized in operation across varied disciplines. Here is that the syntax for a few of the foremost used to perform from the scipy.special modules are:
# To calculate the area under a Guassian curve, we use erf() function like this: scipy.special.erf() # Syntax for Gamma function: scipy.special.gamma() # In order to calculate the log of Gamma, we use the following syntax: scipy.special.gammaln() # Elliptic Function scipy.special.eppilj() # Nth order Bessel Function scipy.special.jn()
Statistics
3) StatsModels( Commits: 17213+, Contributors: 489+)
As the name suggests statsModels is one of the python libraries which is used for statistical calculations. This module provides the functions and classes for the estimation of many different statistical models
It can conduct statistical tests and statistical data exploration. To ensure the results are correct the results are tested against existing statistical packages.
This library is open source so if you find these libraries are helpful in your business you can donate on their official websites.
To import the stats model library or we can call it module we use below command.
import statsmodels.api as sm
Data Loading And Processing
4) Pandas (Commits: 19144+, Contributors: 800+)
Pandas are data science libraries which are used for loading, processing and to do analysis on the data available. Also, we can do the analysis through this library. Pandas are designed to do the work with “labeled” and “relational” data.
Pandas is one of the best tools for data wrangling and which is the most important step in data science. There have been a few new releases of the pandas’ library, including hundreds of new features, enhancements, bug fixes, and API changes. The improvements regard pandas abilities for grouping and sorting data, more suitable output for the apply method, and the support in performing custom types operations.
List Of Functions In Pandas:

pd.read_csv()
If the input file is in the format of CSV then we can use pd.read_csv function to read the .csv file.
In the below code, we have performed the following steps:
 Importing the pandas as pd
 Now we are reading the CSV file using pd.rad_csv function.
# Importing pandas module import pandas as pd #Reading a CSV files pd.read_csv('file_name.csv')

pd.read_excel
If the data is in the form of excel then we can use pd.read_excel function to read the excel data.
We can use two functions to read an excel file.
In the below code, we have performed the following steps:
 Importing the pandas as pd
 Now we are reading the excel file using pd.excel & pd.ExcelFile function.
# Importing pandas module import pandas as pd #Reading a Excel files pd.read_excel('file_name.csv') xlsx = pd.ExcelFile('your_excel_file.xlsx')
For taking basic information of the data or data frame in pandas, we have functions like:
 shape()
 index()
 column()
 info ()
 count()
In the below code, we have performed the following steps:
 Imported pandas module as pd.
 Created data frame from the data.
 Now we have shown the function which is used for the basic information of the data frame.
import pandas as pd data = {'Country': ['Belgium', 'India', 'Brazil'], 'Capital': ['Brussels', 'New Delhi', 'Brasília'], 'Population': [11190846, 1303171035, 207847528]} df = pd.DataFrame(data,columns=['Country', 'Capital', 'Population']) df.shape() df.index() df.columns() df.count()
To get the Summary of the data frame we have a couple of functions like:
 sum()
 cumsum()
 min() / df.max()
 describe()
 mean()
 median()
df.sum() df.cumsum() df.min() df.max() df.describe() df.mean() df.median()
Dropping columns
Sometime you might want to delete some unwanted columns so we have functions like:
 df.drop
df.drop('Country', axis=1)
Functions To Handle Missing Data In pandas
Missing data/values may harm the data when an analysis is performed on a given dataset. To handle these missing data/values, to drop a value or a column/columns to clean the missing data we can use the below functions.
 isnull(): Generate a boolean mask indicating missing values
 notnull(): Opposite of isnull()//Explain in brief
 dropna(): Return a filtered version of the data
 fillna(): Return a copy of the data with missing values filled or imputed
Now let us see the different types of missing values which can be seen and can be handled in data science wrangling modules
The first sentinel value used by Pandas is None, a Python singleton object that is often used for missing data in Python code. Because it is a Python object, None cannot be used in any arbitrary NumPy/Pandas array, but only in arrays with data type ‘object’ (i.e., arrays of Python objects):
import numpy as np import pandas as pd vals1 = np.array([1,None,3,4]) print(vals1)
Secondly, We have other missing data representation, NaN (an acronym for Not a Number), is different; it is a special floatingpoint value recognized by all systems that use the standard IEEE floatingpoint representation:
import pandas as pd import numpy as np vals2 = np.array([1,np.nan,3,4]) vals2.dtype
Notice that NumPy chose a native floatingpoint type for this array: this means that unlike the object array from before, this array supports fast operations pushed into the compiled code. You should be aware that NaN is a bit like a data virus–it infects any other object it touches. Regardless of the operation, the result of arithmetic with NaN will be another NaN:
So we have functions to operate in missing values in pandas lets look at each function
Detecting null values
Pandas data structures have two useful methods for detecting null data: isnull() and notnull(). Either one will return a Boolean mask over the data. For example:
data = pd.Series([1, np.nan, 'hi', None]) data.isnull() data[data.notnull()]
The isnull() and notnull() methods produce similar Boolean results for DataFrames
Dropping null values
In addition to the masking used before, there are the conventional methods, dropna() (which removes NA values) and fillna() (which fills in NA values). For a Series, the result is straightforward:
data.dropna()
We can fill NA entries with a single value, such as zero:
data.fillna(0)
Visualization
5) Matplotlib(Commits: 29747+, Contributors: 850+)
Another library which is popular for creating graphs which will help you to take the decision for your data.it is the lowlevel library for creating twodimensional diagrams and graphs.
There have been style changes in colors, size, and fonts, etc.
With a bit of effort, you can make just about any visualizations for example,
 Line Plots.
 Scatter Plots.
 Pie Charts.
 Bar Charts.
 Histograms.
General Matplotlib Tips
Before we dive into the details of creating visualizations with Matplotlib, there are a few useful things you should know about using the package.

Importing Matplotlib
Just as we use the np shorthand for NumPy and the pd shorthand for Pandas, we will use some standard shorthands for Matplotlib imports:
import matplotlib as mpl import matplotlib.pyplot as plt

Setting Styles
We will use the plt.style directive to choose appropriate aesthetic styles for our figures. Here we will set the classic style, which ensures that the plots we create use the classic Matplotlib style:
plt.style.use('classic')

How to Display your plots?
In order to display your plot, we have functions or you can call the methods:
 Show()
In the below code, we have performed the following steps:
 Importing matplotlib library
 Importing numpy
 Using the np.linspace we are taking data for plot.
 Using the plt.show method we are plotting or showing the plot.
import matplotlib.pyplot as plt import numpy as np x = np.linspace(0, 10, 50) plt.plot(x, np.sin(x)) plt.plot(x, np.cos(x)) plt.show()

How to display the image?
Oftenly you may have to import the images while coding or to display it so for this you can use the Image object which is present in the IPython library.
from IPython.display import Image Image('my_figure.png')
6) Seaborn(Commits: 3000+, Contributors:150+)
Seaborn is a highlevel API based on the matplotlib library. It has a rich gallery of visualizations including some complex types like time series, joint plots, and violin diagrams.
Below image shows you the line plot with the seaborn library.
General seaborn Tips
Before we dive into the details of creating visualizations with seaborn, there are a few useful things you should know about using the seaborn package.
Follow the below code to import the seaborn library:
import matplotlib.pyplot as plt import seaborn as sns
The basic steps to creating plots with Seaborn:
 Prepare some data
 Control figure aesthetics
 Plot with Seaborn
 Further, customize your plot
Types of Plots we can draw with the seaborn library:
 Categorical Plots
 Regression Plots
 Distribution Plots
 Matrix Plots
So let’s start with the categorical Plots
Categorical Plots
Again in categorical plots, there are types of plots available as below:
 Scatter Plot
 Bar Chart
 Count Plot
 Point Plot
 Box Plot
 Violin Plot
So first we will start with:
Scatter Plot
In the below code, we have performed the following steps:
 Loading Data
 Scatterplot with one categorical variable.
 Categorical scatterplot with nonoverlapping points.
import matplotlib.pyplot as plt import seaborn as sns iris = sns.load_dataset("iris") sns.stripplot(x="species", y="petal_length",data=iris) sns.swarmplot(x="species", y="petal_length", data=iris)
Bar Chart
In the below code, we have performed the following steps:
 Loading Data which is inbuilt.
 Show point estimates and confidence intervals with scatterplot glyphs
titanic = sns.load_dataset("titanic") sns.barplot(x="sex", y="survived", hue="class", data=titanic)
Count Plot
In the below code, we have performed the following steps:
 Loading Titanic data
 Showing the count of observations
titanic = sns.load_dataset("titanic") sns.countplot(x="deck", data=titanic,palette="Greens_d")
Point Plot
In the below code, we have performed the following steps:
 Loading the Titanic data set
 Show point estimates and confidence intervals as rectangular bars
titanic = sns.load_dataset("titanic")
sns.pointplot(x="class", y="survived", hue="sex", data=titanic, palette={"male":"g", "female":"m"}, markers=["^","o"], linestyles=["",""])
Boxplot
In the below code, we have performed the following steps:
 Plotting box plot with the Titanic dataset.
 Boxplot with wideform data
sns.boxplot(x="alive", y="age", hue="adult_male", data=titanic) sns.boxplot(data=iris,orient="h")
Regression Plots
In the below code, we have performed the following steps:
 Plot data and a linear regression model fit.
sns.regplot(x="sepal_width", y="sepal_length", data=iris)
Distribution Plots
In the below code, we have performed the following steps:
 Plotting univariate distribution
# Plot univariate distribution plot = sns.distplot(data.y, kde=False,color="b")
Matrix Plots
In the below code, we have performed the following steps:
 Plotting Heatmap
sns.heatmap(uniform_data,vmin=0,vmax=1)
Machine Learning
7) Scikit Learn(Commits: 21793+, Contributors: 950+)
Scikit learn is simple and effective tools for data mining and data analysis.it is accessible to everybody, and reusable in various contexts.
Everyone can use this library in their organization because it is open source.
Scikit library can be used to resolve:
 Classification Problems
 Clustering Problems
 Regression Problems
As you know machine learning is very vast and it is not feasible to show all the machine learning methods and algorithms in this blog because this blog is for beginner, Hence Requesting you to learn the machine learning algorithms and methods just go through the data science blog section of acadgild.
8) XGboost (Commits: 19693+, Contributors: 879+)
Xgboost stands for eXtreme Gradient Boosting, to be highly efficient, flexible and portable, this library is optimized for distributed Gradient Boosting.
To train gradientboosted decision trees and other gradient boosted models we use this library.
In addition, we can integrate xgboost with AWS YARN, Spark (Big Data Tools).
Deep Learning
9) TensorFlow(Commits: 18785+, Contributors: 995+)
One of the great and popular library which will help you to to develop and train your machine learning models.
Why TensorFlow?
Firstly, tensorflow is an open source library for machine learning.it has flexible and effective ecosystem tools, libraries and resources that let the developer build and deploy machine learning applications.
NeuroLinguistic Programming.
10) NLTK(Commits: 18785+, Contributors: 995+)
What is Natural Language Processing?
Natural Language Processing is manipulation or understanding text or speech by any machine or software.in NLP instead of human, computers have the responsibility to interacts, understand and respond with the appropriate answer.
What is NLTK?
NLTK stands for Natural Language Toolkit. One of the most powerful library which contains packages to make the machine understand the human language and respond with the appropriate answer.
Note: All the operations done above in this blog are performed on “Jupyter Notebook” so am requesting you to please install the jupyter notebook in your system to do the handson of above code examples.
In conclusion
We hope this post was helpful to you to know the most used libraries in data science projects.
Keep visiting our website AcadGild for further updates on data science and other technologies.