Data Science and Artificial Intelligence

Data Manipulation with Pandas

Data Manipulation with Pandas in Python

In this blog we will be learning about Python’s one of the important libraries after NumPy i.e., Pandas.

If you are new and want to know about NumPy refer to the below link for a detailed study on NumPy.

Free Step-by-step Guide To Become A Data Scientist

Subscribe and get this detailed guide absolutely FREE

https://acadgild.com/blog/data-manipulation

Pandas is a python package that provides fast, flexible and expressive data structure that is designed to work with 1D and 2D data and that makes data manipulation and analysis easy.

There are the following data structures that Pandas libraries work on:

  • The Series and
  • The DataFrame

To begin coding with Pandas we have to first install it. Installation of Pandas requires NumPy to be installed.

Once Pandas is installed, we can import it and check the version:

 

 

We can provide an alias name to import pandas:

import pandas as pd

This import convention will be used throughout the coding in this blog.

Let’s deep dive into  Series, Dataframe, Missing values and filling the missing values using Pandas.

INTRODUCTION TO PANDAS SERIES OBJECT

A Pandas Series is a one-dimensional array of indexed data. It can be created from a list or array. This has been explained in the below code:

 

 

 

 

In the above code, we have imported NumPy to access any of its required functions.

Series is the object which we had called using the alias ‘pd’.

In the output, the first column refers to the index and the second column refers to its related values, which we can access with the ‘index’ and ‘values’ attributes respectively as shown by the below code:

 

 

 

Unlike the NumPy Array that has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.

That means in Pandas Series the index need not be an Integer value, it can be of any desired data type. Let us see this with the below code:

Pandas series

 

 

 

 

In the output, the index is of the type ‘String’.

Pandas Series can also be thought of as a Python Dictionary. A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a Series is a structure which maps typed keys to a set of typed values. This typing is important as it makes it much more efficient than Python dictionaries for certain operations.

Let us see this with the help of an example:

 

 

 

 

We can access items dictionary-styled data as follows:

 

 

The series also supports array-style operations such as slicing:

 

 

 

INTRODUCTION TO PANDAS DATAFRAME OBJECT

Pandas DataFrame is a two-dimensional data structure, where data is aligned in a tabular fashion in rows and columns.

Creating a DataFrame using List:

 

 

 

 

Creating DataFrame from dictionary:

To create DataFrame from dictionary, all the arrays should be of the same length.

 

 

 

 

Creating a DataFrame from a Series:

In the previous program that we have executed already, we will add one more dictionary to it, in the code shown below

 

 

 

 

 

 

 

 

 

 

 

As in the above code, we can see a new Series named state has been created which consists of the states of the 4 cities. Then using DataFrame we have added 2 columns namely Area and State.

In the output shown, the first column can be accessed by using the attribute ‘index’ as shown in the below example:

 

 

Likewise, other columns can be accessed by using the attribute ‘column’

 

 

OPERATIONS ON DATA 

Pandas make use of some functions and methods that can be used to combine datasets. These methods include concat, merge and join.

concat(): To concatenate the DataFrames along the row we use the concat() function in pandas. We have to pass the names of the DataFrames in a list as the argument to the concat() function, which is shown in the below example:

import pandas as pd
import numpy as np

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']},  index = [0, 1, 2, 3])

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']},  index = [4, 5, 6, 7])

df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                    'B': ['B8', 'B9', 'B10', 'B11'],
                    'C': ['C8', 'C9', 'C10', 'C11'],
                    'D': ['D8', 'D9', 'D10', 'D11']}, index = [8, 9, 10, 11])

pd.concat([df1, df2, df3])

Output:

In the above code, we have created 3 DataFrames df1, df2 and df3, which we have concatenated using the function concat().

merge(): This function is also used to merge or add two DataFrames, while it looks for one or more matching column names between the two inputs and uses this as the key.

Sometimes this merging is not done so efficiently, therefore this function provides some keywords to handle this, which we will discuss later.

Let us see an example of this:

 

 

 

 

Using the ‘on’ keyword we can explicitly specify the name of the key column, which takes a column name or a list of column names.

Other keywords are like:

left_on/right_on: When we have two DataSets with the same column but different column name, we can use the left_on and right_on keywords to specify the two column names.

left_index/right_index: when we have to merge two DataSets based on index we can use left_index/right_index.

HANDLING MISSING DATA 

Let us discuss how to handle missing data but before that let us understand what missing data is.

Missing data can occur when the information provided for one or more than one data. Missing data in Pandas is represented by NaN (Not a Number) or None.

Let us see how missing data occurs:

 

 

 

 

 

 

In the above program, we have created an array of dimension 5×3 and later-on reindexed to 8 rows, where data for some of the indices go missing so we get missing values as NaN. For re-indexing, we used an attribute reindex which changes the row-label and column-label of a DataFrame.

CHECK FOR MISSING VALUES

We can check for missing data by using isnull() and notnull() functions:

 

 

 

 

 

 

In the above program, the attribute isnull() checks for null value and wherever the values are missing, it returns true.

FILLING MISSING VALUES 

We can fill in the missing values by using function like fillna():

 

 

 

 

 

 

 

In the above program we used the function fillna() passing attribute as 1. Hence the value ‘1’ gets filled in the place of missing value.

DROPPING MISSING VALUES 

We can drop null values from a DataFrame using dropna() function. By default, this function works along rows.

 

 

 

 

 

 

We hope this post has been helpful in understanding the working of Pandas, its various operations and some other concepts which has been explained with the help of codes and the output.

In future, you can expect more blogs on Python libraries, until then keep visiting our website Acadgild for more updates on Data Science and other technologies.

Mitali Singh

Python|| Machine Learning|| Statistics|| Data Science

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close