Python | Introduction to Pandas

What is Pandas?

The Pandas package is an open-source Python library which is the most important tool at the disposal of Data Scientists and Analysts working today. Pandas make data importing, analyzing, and visualizing a lot simpler. The build-on packages of Pandas like NumPy and Matplotlib give a single, advantageous, and convenient place to do most of the data analysis and visualization work easily.

Pandas is named after Panel Data (an econometric term) and stands for Python Data Analysis Library. Pandas is a fast library service and it has high-performance and productivity for the users. Python with Pandas is used in a wide range of fields like academic and commercial domains including finance, economics, statistics, analytics, etc.

Today's Agenda

In this post, we will learn how to start working on Pandas. We will begin with installing the package on our systems, making our first series and data frame, and performing various operations over the same.

Prerequisite

This post has been prepared for the audience who :

Have access to a Linux-based system or a Windows-based system.
Have Python installed over their systems. Check for python version using: python --version
Have NumPy, pytz, python-dateutil, and setuptools installed beforehand.
And finally, who are eager to learn and try such useful module.

Let's get started

Step 1: Install and Import Pandas

Python Pandas can be installed on Windows and Linux in two ways:

Pandas comes with the Anaconda distribution.

It can be installed using the following command -

 conda install pandas

Pandas can be installed via pip from PyPI with the following command -

 pip install pandas

In order to import Pandas all you have to do is write the following code - import pandas as pd
Step 2: Start knowing Data Structures 

The two primary data structures of pandas are Series and Data Frame.

A Series is basically a column, and a DataFrame is a table with multiple columns made up of a collection of series. 
These data structures are built on top of the Numpy array, which means they are fast.
Data Structures in Pandas


Step 3: Creating a Series

A Series can be created using various inputs including - Lists, Arrays, Dict or Scalar Value.

Creating a Pandas Series by passing a list of values. In this case, we are letting Pandas create a default integer index starting from 0.

 
 import pandas as pd
 import numpy as np

 series1 = pd.Series([2, 6, 10, np.nan, 3, 41])
 print(series1)
 It's Output will be as :

0 2

1 6
2 10
3 NaN
4 3
5 41

dtype: float64

Creating a Pandas Series using an Array and adding a customized index. The index values must be unique and have the same length as the data.

 
 import pandas as pd
 import numpy as np

 data = np.array(['a', 'n', 's', 'h', 'i']) 
 series2 = pd.Series(data, index = [10, 11, 12, 13, 14]) 
 print(series2)

It's Output will be as:

 10     a
 11     n  
 12     s
 13     h 
 14     i

 dtype: object

Creating a Pandas Series using a Dictionary. If we do not specify any index values, then the dictionary key values are taken into account as the sorted index values. In case we specify the index values, then the corresponding values are taken.

 
 import pandas as pd
 import numpy as np

 data = {'a': 7, 'b': 11, 'c': 3} 
 series3 = pd.Series(data, index = ['b', 'c', 'd', 'a'])
 print(series3)

It's Output will be as :

 b   11.0
 c   3.0
 d   NaN
 a   7.0   

 dtype: float64

Note : Index order is persisted and the missing element is filled with NaN

Creating a Pandas Series using a Scalar Value. When the data is input using a scalar value, an index is a must. The input value is repeated to match the length of the index.

 
 import pandas as pd
 import numpy as np

 series4 = pd.Series(7, index = [0, 1, 2])
 print(series4)

It's Output will be as:

 0   7
 1   7
 2   7

 dtype: int64


Step 4: Creating a Data Frame

Features of DataFrame

All columns can be of different data types.
The size of a dataframe is mutable.
It consists of labeled axes (rows and columns).
We can perform various arithmetic operations on rows and columns.

A DataFrame can be created using various inputs including - Lists, Dict, Series, Numpy arrays, another DataFrame.

Creating a Pandas Dataframe using a list of lists.

 
 import pandas as pd

 data = [['Ball', 50], ['Notebook', 120], ['Chips', 30]] 
 df1 = pd.DataFrame(data, columns = ['Product', 'Price'], dtype: float)   
 print(df1) 
 
It's Output will be as :

      Product       Price 

 0   Ball              50.0
 1   Notebook    120.0
 2   Chips           30.0

Note - The dtype parameter for the "Price" column changes the datatype to floating point.

Creating a Pandas indexed Dataframe using arrays.

 
 import pandas as pd

 data = {'Name': ['Rahul', 'Oscar', 'Stephen', 'Amar'], 'Age': [28, 34, 21, 25]} 
 df2 = pd.DataFrame(data, index = ['Rank1', 'Rank2', 'Rank3', 'Rank4'])   
 print(df2)

It's Output will be as :

Name Age
Rank1 Rahul 28

Rank2 Oscar 34

Rank3 Stephen 21

Rank4 Amar 25

Note - The index parameter, denoted by "Rank" assigns an index to each of the row

Creating a Pandas Dataframe using a Dictionary.

 
 import pandas as pd

 data = {'Fruits': ['Apples', 'Mangoes', 'Oranges', 'Guavas'],
        'Price/kg': [70, 120, 30, 65]}
 print(df3)
 

It's Output will be as :

Fruits       Price/kg
0 Apple           70
1 Mango 120
2      Orange 30
3 Guava 65