We are frequently asked that what are the python libraries a data science beginner should be aware of.
In this blog, we will be discussing on most popular python libraries which are used in data science projects.
Before moving forward to know the frequently used popular python libraries we recommend beginners to have a good revision on python by referring the below series of videos.
Python Tutorial for Beginners
What you will Learn:
Library
- What is a “library”?
Data science libraries
- Python libraries used in data science projects.
- Popular functions used in each python library for/in data science projects with examples.
Core Libraries
- NumPy
- SciPy
Statistics
- Statsmodels
Data Loading and Processing
- Pandas
- List of functions used in pandas
Visualization
Matplotlib
- General tips for matplotlib
- Importing matplotlib
- Setting styles
- How to display your plots
- How to display images
Seaborn
- General tips for seaborn
- Basic steps to create plots with seaborn
- Types of plots we can draw with seaborn
- Categorical plots and their types
- Regression Plots
- Matrix plots
- Distribution Plots
Machine Learning
- Scikit-Learn
- XgBoost
Deep Learning
- TensorFlow
- Neuro-Linguistic Programming.
Python is the leading programming language and has gained a leading position in solving data science task and challenges. So, now we will show you some useful and powerful python libraries which are used in data science For example to do the scientific calculations and other stuff.
All the libraries are open sourced so if you find these libraries are helpful in your business you can donate them on their official websites.
What is a library?
In very simple terms a library is a file which consists of some useful code, on the other hand, this code could be a simple function or collection of function, variables, and classes.
Now, let us look at the most used python libraries for data science.
Core Libraries
1) NumPy(Commits: 20000+, Contributors: 1000+)
Firstly, we start our list with the libraries that are used in scientific applications and numPy is one of the top libraries which is used for processing large multidimensional arrays and matrices.
It has a collection of high-level mathematical functions and methods.
We can also use NumPy in complex mathematical operations like Fourier transformation, linear algebra, random number, etc. Since it is an array interface which allows the user to reshape the datasets.
List of Functions Used In NumPy:
-
numpy.array()
In the following example, we will be creating a one-dimensional array using numpy.
In the below code, we have performed the following steps:
- The numpy package Imported
- Created an array with function np.array
- Printing array
- Checking data type of the array
import numpy as np # Importing NumPy package as np a = np.array([1,2,3]) # Creating An array with funtion array print(a) # Printig array a.dtype # Checking Data Type Of the Array
-
numpy.genfromtxt()
numpy.genfromtxt function can be used to read files.
In the below code, we have performed the following steps:
- Using the genfromtxt function we read the student-data.csv file
- Specifying the keyword argument delimiter=”;” so that the fields are parsed properly.
- Specifying the keyword argument skip_header=1 so that the header row is skipped.
student = np.genfromtxt("student-data.csv", delimiter=";", skip_header=1)
-
numpy.arrange()
Syntax : arange([start,] stop[, step,], dtype=None)
This function will take four parameters as below:
- start: number, optional
Start of interval. The interval includes the start value. The default start value is 0.
- stop: number
End of interval. The interval does not include the stop value, except in some cases where “step” is not an integer and floating point round-off affects the length of “out”.
- step: number, optional
Spacing between values. For any output “out”, this is the distance between two adjacent values, “out[i+1] – out[i]”. The default step size is 1. If “step” is specified as a position argument,
- dtype: dtype
dtype is the type of the output array. If “dtype” is not given, infer the data type from the other input arguments.
In the below code, we have performed the following steps:
- We are using the arrange function to create an n-dimensional array.
np.arange(3) np.arange(3.0) np.arange(3,7) np.arange(3,7,2)
-
numpy.broadcast
Syntax: np.broadcast(self, /, *args, **kwargs)
numpy.broadcast function takes parameters like
Int1, Int2,….:
The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes.
In the below code, we have performed the following steps:
- Creating an array called “x”
- Creating another array called “y”
- Broadcasting the “x” and “y
x = np.array([[1], [2], [3]]) y = np.array([4, 5, 6]) b = np.broadcast(x, y)
2) SciPy (Commits: 21000+, Contributors: 1000+)
Another important python library for researchers, developers and data scientists which can be used in scientific calculations is Scipy. Scipy is based on the data structures of Numpy and therefore it extends the capabilities of Numpy.
SciPy contains modules for linear algebra, optimization, integration, and statistics. It is built upon the numPy because of that only thus make the substantial use of numPy.
List Of Functions Used in SciPy:
-
imread()
In the below code, we have performed the following steps:
- Imported misc package from scipy library
- Now using misc.imread() function we are importing the image
from scipy import misc misc.imread('Image_Name.png')
-
linalg.det
The module for traditional algebra operations is thought as scipy.linalg and you’re needed to import it well before any operation.
To calculate the determinant of a matrix, we’ll use scipy.linalg.det() to operate within the following way:
from scipy import linalg
In the below code, we have performed the following steps:
- Creating the square matrix
- Computing the determinant of a matrix
mat = np.array([[2,1],[4,3]]) #For a square matrix ‘mat’ linalg.det(mat)
-
linalg.inv()
Another function is inv() that can be used to compute the inverse of the square matrix.
In the below code, we have performed the following steps:
- Creating a matrix
- Computing the inverse of the matrix
mat = np.array([[2,1],[4,3]]) #For a square matrix ‘mat’ linalg.inv(mat)
Special Functions in SciPy
scipy.special module contains a list of transcendental performs, that are most frequently utilized in operation across varied disciplines. Here is that the syntax for a few of the foremost used to perform from the scipy.special modules are:
# To calculate the area under a Guassian curve, we use erf() function like this: scipy.special.erf() # Syntax for Gamma function: scipy.special.gamma() # In order to calculate the log of Gamma, we use the following syntax: scipy.special.gammaln() # Elliptic Function scipy.special.eppilj() # Nth order Bessel Function scipy.special.jn()
Statistics
3) StatsModels( Commits: 17213+, Contributors: 489+)
As the name suggests statsModels is one of the python libraries which is used for statistical calculations. This module provides the functions and classes for the estimation of many different statistical models
It can conduct statistical tests and statistical data exploration. To ensure the results are correct the results are tested against existing statistical packages.
This library is open source so if you find these libraries are helpful in your business you can donate on their official websites.
To import the stats model library or we can call it module we use below command.
import statsmodels.api as sm
Data Loading And Processing
4) Pandas (Commits: 19144+, Contributors: 800+)
Pandas are data science libraries which are used for loading, processing and to do analysis on the data available. Also, we can do the analysis through this library. Pandas are designed to do the work with “labeled” and “relational” data.
Pandas is one of the best tools for data wrangling and which is the most important step in data science. There have been a few new releases of the pandas’ library, including hundreds of new features, enhancements, bug fixes, and API changes. The improvements regard pandas abilities for grouping and sorting data, more suitable output for the apply method, and the support in performing custom types operations.
List Of Functions In Pandas:
-
pd.read_csv()
If the input file is in the format of CSV then we can use pd.read_csv function to read the .csv file.
In the below code, we have performed the following steps:
- Importing the pandas as pd
- Now we are reading the CSV file using pd.rad_csv function.
# Importing pandas module import pandas as pd #Reading a CSV files pd.read_csv('file_name.csv')
-
pd.read_excel
If the data is in the form of excel then we can use pd.read_excel function to read the excel data.
We can use two functions to read an excel file.
In the below code, we have performed the following steps:
- Importing the pandas as pd
- Now we are reading the excel file using pd.excel & pd.ExcelFile function.
# Importing pandas module import pandas as pd #Reading a Excel files pd.read_excel('file_name.csv') xlsx = pd.ExcelFile('your_excel_file.xlsx')
For taking basic information of the data or data frame in pandas, we have functions like:
- shape()
- index()
- column()
- info ()
- count()
In the below code, we have performed the following steps:
- Imported pandas module as pd.
- Created data frame from the data.
- Now we have shown the function which is used for the basic information of the data frame.
import pandas as pd data = {'Country': ['Belgium', 'India', 'Brazil'], 'Capital': ['Brussels', 'New Delhi', 'Brasília'], 'Population': [11190846, 1303171035, 207847528]} df = pd.DataFrame(data,columns=['Country', 'Capital', 'Population']) df.shape() df.index() df.columns() df.count()
To get the Summary of the data frame we have a couple of functions like:
- sum()
- cumsum()
- min() / df.max()
- describe()
- mean()
- median()
df.sum() df.cumsum() df.min() df.max() df.describe() df.mean() df.median()
Dropping columns
Sometime you might want to delete some unwanted columns so we have functions like:
- df.dro
df.drop('Country', axis=1)
Functions To Handle Missing Data In pandas
Missing data/values may harm the data when an analysis is performed on a given dataset. To handle these missing data/values, to drop a value or a column/columns to clean the missing data we can use the below functions.
- isnull(): Generate a boolean mask indicating missing values
- notnull(): Opposite of isnull()//Explain in brief
- dropna(): Return a filtered version of the data
- fillna(): Return a copy of the data with missing values filled or imputed
Now let us see the different types of missing values which can be seen and can be handled in data science wrangling modules
The first sentinel value used by Pandas is None, a Python singleton object that is often used for missing data in Python code. Because it is a Python object, None cannot be used in any arbitrary NumPy/Pandas array, but only in arrays with data type ‘object’ (i.e., arrays of Python objects):
import numpy as np import pandas as pd vals1 = np.array([1,None,3,4]) print(vals1)
Secondly, We have other missing data representation, NaN (an acronym for Not a Number), is different; it is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation:
import pandas as pd import numpy as np vals2 = np.array([1,np.nan,3,4]) vals2.dtype
Notice that NumPy chose a native floating-point type for this array: this means that unlike the object array from before, this array supports fast operations pushed into the compiled code. You should be aware that NaN is a bit like a data virus–it infects any other object it touches. Regardless of the operation, the result of arithmetic with NaN will be another NaN:
So we have functions to operate in missing values in pandas lets look at each function
Detecting null values
Pandas data structures have two useful methods for detecting null data: isnull() and notnull(). Either one will return a Boolean mask over the data. For example:
data = pd.Series([1, np.nan, 'hi', None]) data.isnull() data[data.notnull()]
The isnull() and notnull() methods produce similar Boolean results for DataFrames
Dropping null values
In addition to the masking used before, there are the conventional methods, dropna() (which removes NA values) and fillna() (which fills in NA values). For a Series, the result is straightforward:
data.dropna()
We can fill NA entries with a single value, such as zero:
data.fillna(0)
Visualization
5) Matplotlib(Commits: 29747+, Contributors: 850+)
Another library which is popular for creating graphs which will help you to take the decision for your data.it is the low-level library for creating two-dimensional diagrams and graphs.
There have been style changes in colors, size, and fonts, etc.
With a bit of effort, you can make just about any visualizations for example,
- Line Plots.
- Scatter Plots.
- Pie Charts.
- Bar Charts.
- Histograms.
General Matplotlib Tips
Before we dive into the details of creating visualizations with Matplotlib, there are a few useful things you should know about using the package.
-
Importing Matplotlib
Just as we use the np shorthand for NumPy and the pd shorthand for Pandas, we will use some standard shorthands for Matplotlib imports
import matplotlib as mpl import matplotlib.pyplot as plt
-
Setting Styles
We will use the plt.style directive to choose appropriate aesthetic styles for our figures. Here we will set the classic style, which ensures that the plots we create use the classic Matplotlib style
plt.style.use('classic')
-
How to Display your plots?
In order to display your plot, we have functions or you can call the methods:
- Show()
In the below code, we have performed the following steps:
- Importing matplotlib library
- Importing numpy
- Using the np.linspace we are taking data for plot.
- Using the plt.show method we are plotting or showing the plot.
import matplotlib.pyplot as plt import numpy as np x = np.linspace(0, 10, 50) plt.plot(x, np.sin(x)) plt.plot(x, np.cos(x)) plt.show()
-
How to display the image?
Oftenly you may have to import the images while coding or to display it so for this you can use the Image object which is present in the IPython library.
from IPython.display import Image Image('my_figure.png')
6) Seaborn(Commits: 3000+, Contributors:150+)
Seaborn is a high-level API based on the matplotlib library. It has a rich gallery of visualizations including some complex types like time series, joint plots, and violin diagrams.
Below image shows you the line plot with the seaborn library.
General seaborn Tips
Before we dive into the details of creating visualizations with seaborn, there are a few useful things you should know about using the seaborn package.
Follow the below code to import the seaborn library:
import matplotlib.pyplot as plt import seaborn as sns
The basic steps to creating plots with Seaborn:
- Prepare some data
- Control figure aesthetics
- Plot with Seaborn
- Further, customize your plot
Types of Plots we can draw with the seaborn library:
- Categorical Plots
- Regression Plots
- Distribution Plots
- Matrix Plots
So let’s start with the categorical Plots
Categorical Plots
Again in categorical plots, there are types of plots available as below:
- Scatter Plot
- Bar Chart
- Count Plot
- Point Plot
- Box Plot
- Violin Plot
So first we will start with:
Scatter Plot
In the below code, we have performed the following steps:
- Loading Data
- Scatterplot with one categorical variable.
- Categorical scatterplot with non-overlapping points
import matplotlib.pyplot as plt import seaborn as sns iris = sns.load_dataset("iris") sns.stripplot(x="species", y="petal_length",data=iris) sns.swarmplot(x="species", y="petal_length", data=iris)
Bar Chart
In the below code, we have performed the following steps:
- Loading Data which is inbuilt.
- Show point estimates and confidence intervals with scatterplot glyphs
titanic = sns.load_dataset("titanic") sns.barplot(x="sex", y="survived", hue="class", data=titanic)
Count Plot
In the below code, we have performed the following steps:
- Loading Titanic data
- Showing the count of observations
titanic = sns.load_dataset("titanic") sns.countplot(x="deck", data=titanic,palette="Greens_d")
Point Plot
In the below code, we have performed the following steps:
- Loading the Titanic data set
- Show point estimates and confidence intervals as rectangular bar
titanic = sns.load_dataset("titanic") sns.pointplot(x="class", y="survived", hue="sex", data=titanic, palette={"male":"g", "female":"m"}, markers=["^","o"], linestyles=["-","--"])
Boxplot
In the below code, we have performed the following steps:
- Plotting box plot with the Titanic dataset.
- Boxplot with wide-form data
sns.boxplot(x="alive", y="age", hue="adult_male", data=titanic) sns.boxplot(data=iris,orient="h")
Regression Plots
In the below code, we have performed the following steps:
- Plot data and a linear regression model fit.
sns.regplot(x="sepal_width", y="sepal_length", data=iris)
Distribution Plots
In the below code, we have performed the following steps:
- Plotting univariate distribution
# Plot univariate distribution plot = sns.distplot(data.y, kde=False,color="b")
Matrix Plots
In the below code, we have performed the following steps:
- Plotting Heatmap
sns.heatmap(uniform_data,vmin=0,vmax=1)
Machine Learning
7) Scikit Learn(Commits: 21793+, Contributors: 950+)
Scikit learn is simple and effective tools for data mining and data analysis.it is accessible to everybody, and reusable in various contexts.
Everyone can use this library in their organization because it is open source.
Scikit library can be used to resolve:
- Classification Problems
- Clustering Problems
- Regression Problems
As you know machine learning is very vast and it is not feasible to show all the machine learning methods and algorithms in this blog because this blog is for beginner, Hence Requesting you to learn the machine learning algorithms and methods just go through the data science blog section of acadgild.
8) XGboost (Commits: 19693+, Contributors: 879+)
Xgboost stands for eXtreme Gradient Boosting, to be highly efficient, flexible and portable, this library is optimized for distributed Gradient Boosting.
To train gradient-boosted decision trees and other gradient boosted models we use this library.
In addition, we can integrate xgboost with AWS YARN, Spark (Big Data Tools).
Deep Learning
9) TensorFlow(Commits: 18785+, Contributors: 995+)
One of the great and popular library which will help you to to develop and train your machine learning models.
Why TensorFlow?
Firstly, tensorflow is an open source library for machine learning.it has flexible and effective ecosystem tools, libraries and resources that let the developer build and deploy machine learning applications.
Neuro-Linguistic Programming.
10) NLTK(Commits: 18785+, Contributors: 995+)
What is Natural Language Processing?
Natural Language Processing is manipulation or understanding text or speech by any machine or software.in NLP instead of human, computers have the responsibility to interacts, understand and respond with the appropriate answer.
What is NLTK?
NLTK stands for Natural Language Toolkit. One of the most powerful library which contains packages to make the machine understand the human language and respond with the appropriate answer.
Note: All the operations done above in this blog are performed on “Jupyter Notebook” so am requesting you to please install the jupyter notebook in your system to do the hands-on of above code examples.
In conclusion
We hope this post was helpful to you to know the most used libraries in data science projects.
Keep visiting our website AcadGild for further updates on data science and other technologies.