Python | Introduction to Pandas



What is Pandas?

The Pandas package is an open-source Python library which is the most important tool at the disposal of Data Scientists and Analysts working today. Pandas make data importing, analyzing, and visualizing a lot simpler. The build-on packages of Pandas like NumPy and Matplotlib give a single, advantageous, and convenient place to do most of the data analysis and visualization work easily.

Pandas is named after Panel Data (an econometric term) and stands for Python Data Analysis Library. Pandas is a fast library service and it has high-performance and productivity for the users. Python with Pandas is used in a wide range of fields like academic and commercial domains including finance, economics, statistics, analytics, etc.

Today's Agenda

In this post, we will learn how to start working on Pandas. We will begin with installing the package on our systems, making our first series and data frame, and performing various operations over the same.

Prerequisite

This post has been prepared for the audience who :
  1. Have access to a Linux-based system or a Windows-based system.
  2. Have Python installed over their systems. Check for python version using: python --version
  3. Have NumPy, pytz, python-dateutil, and setuptools installed beforehand.
  4. And finally, who are eager to learn and try such useful module.

Let's get started

Step 1: Install and Import Pandas

Python Pandas can be installed on Windows and Linux in two ways:
  • Pandas comes with the Anaconda distribution.
    It can be installed using the following command -
 conda install pandas

  • Pandas can be installed via pip from PyPI with the following command -
 pip install pandas

In order to import Pandas all you have to do is write the following code -
 import pandas as pd

Step 2: Start knowing Data Structures

The two primary data structures of pandas are Series and Data Frame.

A Series is basically a column, and a DataFrame is a table with multiple columns made up of a collection of series.
These data structures are built on top of the Numpy array, which means they are fast.
Data Structures in Pandas

Step 3: Creating a Series

A Series can be created using various inputs including - Lists, Arrays, Dict or Scalar Value.

  • Creating a Pandas Series by passing a list of values. In this case, we are letting Pandas create a default integer index starting from 0.
 
import pandas as pd
import numpy as np

series1 = pd.Series([2, 6, 10, np.nan, 3, 41])
print(series1)
 
It's Output will be as :

 0 2

1     6  
 2     10  
 3     NaN  
 4     3  
 5     41  


 dtype: float64 


  • Creating a Pandas Series using an Array and adding a customized index. The index values must be unique and have the same length as the data.
 
import pandas as pd
import numpy as np

data = np.array(['a', 'n', 's', 'h', 'i'])
series2 = pd.Series(data, index = [10, 11, 12, 13, 14])
print(series2) 
 
It's Output will be as:

 10 a
11     n  
 12   s
 13     h 
 14    i

 dtype: object 
  • Creating a Pandas Series using a Dictionary. If we do not specify any index values, then the dictionary key values are taken into account as the sorted index values. In case we specify the index values, then the corresponding values are taken.
 
import pandas as pd
import numpy as np

data = {'a': 7, 'b': 11, 'c': 3}
series3 = pd.Series(data, index = ['b', 'c', 'd', 'a'])
print(series3) 
 
It's Output will be as :

b 11.0

c 3.0

d NaN

a 7.0  


dtype: float64 

Note : Index order is persisted and the missing element is filled with NaN

  • Creating a Pandas Series using a Scalar Value. When the data is input using a scalar value, an index is a must. The input value is repeated to match the length of the index.
 
import pandas as pd
import numpy as np

series4 = pd.Series(7, index = [0, 1, 2])
print(series4) 
 
It's Output will be as:

0 7
1 7

2 7


dtype: int64 


Step 4: Creating a Data Frame

Features of DataFrame

  • All columns can be of different data types.
  • The size of a dataframe is mutable.
  • It consists of labeled axes (rows and columns).
  • We can perform various arithmetic operations on rows and columns.

A DataFrame can be created using various inputs including - Lists, Dict, Series, Numpy arrays, another DataFrame.

  • Creating a Pandas Dataframe using a list of lists. 
 
import pandas as pd

data = [['Ball', 50], ['Notebook', 120], ['Chips', 30]]
df1 = pd.DataFrame(data, columns = ['Product', 'Price'], dtype: float)
print(df1) 
 
It's Output will be as :

      Product       Price 


 0   Ball              50.0

 1   Notebook    120.0

 2   Chips           30.0 

Note - The dtype parameter for the "Price"  column changes the datatype to floating point.

  • Creating a Pandas indexed Dataframe using arrays.
 
import pandas as pd

data = {'Name': ['Rahul', 'Oscar', 'Stephen', 'Amar'], 'Age': [28, 34, 21, 25]}
df2 = pd.DataFrame(data, index = ['Rank1', 'Rank2', 'Rank3', 'Rank4'])
print(df2) 
 
It's Output will be as :

    Name       Age
 Rank1      Rahul        28  

 Rank2      Oscar        34  

 Rank3      Stephen    21  

 Rank4      Amar        25

Note - The index parameter, denoted by "Rank" assigns an index to each of the row

  • Creating a Pandas Dataframe using a Dictionary.

 

 import pandas as pd


 data = {'Fruits'['Apples''Mangoes''Oranges''Guavas'],
    'Price/kg': [70, 120, 30, 65]}

 print(df3)

 


It's Output will be as :

         Fruits       Price/kg  
 0      Apple           70  
 1      Mango         120  
 2      Orange         30  
 3      Guava           65  



    





Comments