Create a Pipeline in Pandas

Pipelines play a useful role in transforming and manipulating tons of data. Pipeline are a sequence of data processing mechanisms. Pandas pipeline feature allows us to string together various user-defined Python functions in order to build a pipeline of data processing. There are two ways to create a Pipeline in pandas. By calling .pipe() function and by importing pdpipe package.

Through pandas pipeline function i.e. pipe() function we can call more than one function at a time and in a single line for data processing. Let’s understand and create a pipeline by using the pipe() function.

Below are various examples that depict how to create a pipeline using pandas.

Example 1:

Python3

# importing pandas library

import pandas as pd
 
# Create empty dataframe

df = pd.DataFrame()
 
# Creating a simple dataframe

df['name'] = ['Reema', 'Shyam', 'Jai', 

              'Nimisha', 'Rohit', 'Riya']

df['gender'] = ['Female', 'Male', 'Male', 

                'Female', 'Male', 'Female']

df['age'] = [31, 32, 19, 23, 28, 33]
 
# View dataframe
df

Output:

Now, creating functions for data processing.

Python3

# function to find mean

def mean_age_by_group(dataframe, col):

    # groups the data by a column and 

    # returns the mean age per group

    return dataframe.groupby(col).mean()

# function to convert to uppercase

def uppercase_column_name(dataframe):

    # Converts all the column names into uppercase

    dataframe.columns = dataframe.columns.str.upper()

    # And returns them

    return dataframe

Now, creating a pipeline using .pipe() function.

Python3

# Create a pipeline that applies both the functions created above

pipeline = df.pipe(mean_age_by_group, col='gender').pipe(uppercase_column_name)
 
# calling pipeline
pipeline

Output:

Now, let’s understand and create a pipeline by importing pdpipe package.

The pdpipe Python package provides a concise interface for building pandas pipelines that have pre-conditions. The pdpipe is a pre-processing pipeline package for Python’s panda data frame. The pdpipe API helps to easily break down or compose complex-ed panda processing pipelines with few lines of codes.

We can install this package by simply writing:

pip install pdpipe

Example 2:

Python3

# importing the package

import pdpipe as pdp

import pandas as pd
 
# creating a empty dataframe named dataset

dataset = pd.DataFrame()
 
# Creating a simple dataframe

dataset['name'] = ['Reema', 'Shyam', 'Jai',

                   'Nimisha', 'Rohit', 'Riya']
 
dataset['gender'] = ['Female', 'Male', 'Male',

                     'Female', 'Male', 'Female']
 
dataset['age'] = [31, 32, 19, 23, 28, 33]
 
dataset['department'] = ['Accounts', 'Management',

                         'IT', 'IT', 'Management',

                         'Advertising']
 
dataset['index'] = [1, 2, 3, 4, 5, 6]
 
# View dataframe
dataset

Output:

Removing a column from dataframe using pdpipe.

Python3

# creating a pipeline and 
# dropping the unwanted column

dropCol = pdp.ColDrop("index").apply(dataset)
 
# display the new dataframe 
# after column drop
dropCol

Output:

There is another way to drop columns through pdpipe.

Python3

# creating a pipeline and 
# dropping the unwanted column

dropCol2 = pdp.ColDrop("index")
 
# applying the ColDrop to dataframe

df2 = dropCol2(dataset)
 
# display dataframe
df2

Output:

Here, the column is dropped in two steps. In the first step, we created a pipeline and in the second step, we applied it to the dataframe.

Example 3:

Now we are adding one column to dataframe using pdpipe.

Python3

# importing the package

import pdpipe as pdp

import pandas as pd
 
# creating a empty dataframe named dataset

dataset = pd.DataFrame()
 
# Creating a simple dataframe

dataset['name'] = ['Reema', 'Shyam', 'Jai',

                   'Nimisha', 'Rohit', 'Riya']
 
dataset['gender'] = ['Female', 'Male', 'Male',

                     'Female', 'Male', 'Female']
 
dataset['age'] = [31, 32, 19, 23, 28, 33]
 
dataset['department'] = ['Accounts', 'Management',

                         'IT', 'IT', 'Management',

                         'Advertising']
 
dataset['index'] = [1, 2, 3, 4, 5, 6]
 
# View dataframe
dataset

Output:

Now, dropping the values from dataframe.

Python3

#dropping the values using ValDrop

df3 = pdp.ValDrop(['IT'],'department').apply(dataset)
 
#display dataframe
df3

Output:

The row containing ‘ IT ‘ value is dropped.

Article Tags :

Python

Python-pandas