Pipelines play a useful role in transforming and manipulating tons of data. Pipeline are a sequence of data processing mechanisms. Pandas pipeline feature allows us to string together various user-defined Python functions in order to build a pipeline of data processing. There are two ways to create a Pipeline in pandas. By calling .pipe() function and by importing pdpipe package.
Through pandas pipeline function i.e. pipe() function we can call more than one function at a time and in a single line for data processing. Let’s understand and create a pipeline by using the pipe() function.
Below are various examples that depict how to create a pipeline using pandas.
Example 1:
# importing pandas library import pandas as pd
# Create empty dataframe df = pd.DataFrame()
# Creating a simple dataframe df[ 'name' ] = [ 'Reema' , 'Shyam' , 'Jai' ,
'Nimisha' , 'Rohit' , 'Riya' ]
df[ 'gender' ] = [ 'Female' , 'Male' , 'Male' ,
'Female' , 'Male' , 'Female' ]
df[ 'age' ] = [ 31 , 32 , 19 , 23 , 28 , 33 ]
# View dataframe df |
Output:
Now, creating functions for data processing.
# function to find mean def mean_age_by_group(dataframe, col):
# groups the data by a column and
# returns the mean age per group
return dataframe.groupby(col).mean()
# function to convert to uppercase def uppercase_column_name(dataframe):
# Converts all the column names into uppercase
dataframe.columns = dataframe.columns. str .upper()
# And returns them
return dataframe
|
Now, creating a pipeline using .pipe() function.
# Create a pipeline that applies both the functions created above pipeline = df.pipe(mean_age_by_group, col = 'gender' ).pipe(uppercase_column_name)
# calling pipeline pipeline |
Output:
Now, let’s understand and create a pipeline by importing pdpipe package.
The pdpipe Python package provides a concise interface for building pandas pipelines that have pre-conditions. The pdpipe is a pre-processing pipeline package for Python’s panda data frame. The pdpipe API helps to easily break down or compose complex-ed panda processing pipelines with few lines of codes.
We can install this package by simply writing:
pip install pdpipe
Example 2:
# importing the package import pdpipe as pdp
import pandas as pd
# creating a empty dataframe named dataset dataset = pd.DataFrame()
# Creating a simple dataframe dataset[ 'name' ] = [ 'Reema' , 'Shyam' , 'Jai' ,
'Nimisha' , 'Rohit' , 'Riya' ]
dataset[ 'gender' ] = [ 'Female' , 'Male' , 'Male' ,
'Female' , 'Male' , 'Female' ]
dataset[ 'age' ] = [ 31 , 32 , 19 , 23 , 28 , 33 ]
dataset[ 'department' ] = [ 'Accounts' , 'Management' ,
'IT' , 'IT' , 'Management' ,
'Advertising' ]
dataset[ 'index' ] = [ 1 , 2 , 3 , 4 , 5 , 6 ]
# View dataframe dataset |
Output:
Removing a column from dataframe using pdpipe.
# creating a pipeline and # dropping the unwanted column dropCol = pdp.ColDrop( "index" ). apply (dataset)
# display the new dataframe # after column drop dropCol |
Output:
There is another way to drop columns through pdpipe.
# creating a pipeline and # dropping the unwanted column dropCol2 = pdp.ColDrop( "index" )
# applying the ColDrop to dataframe df2 = dropCol2(dataset)
# display dataframe df2 |
Output:
Here, the column is dropped in two steps. In the first step, we created a pipeline and in the second step, we applied it to the dataframe.
Example 3:
Now we are adding one column to dataframe using pdpipe.
# importing the package import pdpipe as pdp
import pandas as pd
# creating a empty dataframe named dataset dataset = pd.DataFrame()
# Creating a simple dataframe dataset[ 'name' ] = [ 'Reema' , 'Shyam' , 'Jai' ,
'Nimisha' , 'Rohit' , 'Riya' ]
dataset[ 'gender' ] = [ 'Female' , 'Male' , 'Male' ,
'Female' , 'Male' , 'Female' ]
dataset[ 'age' ] = [ 31 , 32 , 19 , 23 , 28 , 33 ]
dataset[ 'department' ] = [ 'Accounts' , 'Management' ,
'IT' , 'IT' , 'Management' ,
'Advertising' ]
dataset[ 'index' ] = [ 1 , 2 , 3 , 4 , 5 , 6 ]
# View dataframe dataset |
Output:
Now, dropping the values from dataframe.
#dropping the values using ValDrop df3 = pdp.ValDrop([ 'IT' ], 'department' ). apply (dataset)
#display dataframe df3 |
Output:
The row containing ‘ IT ‘ value is dropped.