Pipelines – Python and scikit-learn

Last Updated : 13 Jul, 2021

The workflow of any machine learning project includes all the steps required to build it. A proper ML project consists of basically four main parts are given as follows:

Gathering data:
The process of gathering data depends on the project it can be real-time data or the data collected from various sources such as a file, database, survey and other sources.
Data pre-processing:
Usually, within the collected data, there is a lot of missing data, extremely large values, unorganized text data or noisy data and thus cannot be used directly within the model, therefore, the data require some pre-processing before entering the model.
Training and testing the model: Once the data is ready for algorithm application, It is then ready to put into the machine learning model. Before that, it is important to have an idea of what model is to be used which may give a nice performance output. The data set is divided into 3 basic sections i.e. The training set, validation set and test set. The main aim is to train data in the train set, to tune the parameters using ‘validation set’ and then test the performance test set.
Evaluation:
Evaluation is a part of the model development process. It helps to find the best model that represents the data and how well the chosen model works in the future. This is done after training of model in different algorithms is done. The main motto is to conclude the evaluation and choose model accordingly again.

ML Workflow in python
The execution of the workflow is in a pipe-like manner, i.e. the output of the first steps becomes the input of the second step. Scikit-learn is a powerful tool for machine learning, provides a feature for handling such pipes under the sklearn.pipeline module called Pipeline.
It takes 2 important parameters, stated as follows:

The Stepslist:
List of (name, transform) tuples (implementing fit/transform) that are chained, in the order in which they are chained, with the last object an estimator.
verbose:

Code:

python3

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier
# import some data within sklearn for iris classification 
iris = datasets.load_iris()
X = iris.data 
y = iris.target
 
# Splitting data into train and testing part
# The 25 % of data is test size of the data 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
# importing pipes for making the Pipe flow
from sklearn.pipeline import Pipeline
# pipe flow is :
# PCA(Dimension reduction to two) -> Scaling the data -> DecisionTreeClassification 
pipe = Pipeline([('pca', PCA(n_components = 2)), ('std', StandardScaler()), ('decision_tree', DecisionTreeClassifier())], verbose = True)
 
# fitting the data in the pipe
pipe.fit(X_train, y_train)
 
# scoring data 
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, pipe.predict(X_test)))

Output:

[Pipeline] ............... (step 1 of 3) Processing pca, total=   0.0s
[Pipeline] ............... (step 2 of 3) Processing std, total=   0.0s
[Pipeline] ..... (step 3 of 3) Processing Decision_tree, total=   0.0s
0.9736842105263158

Important property:

pipe.named_steps: pipe.named_steps is a dictionary storing the name key linked to the individual objects in the pipe. For example:

pipe.named_steps['decision_tree'] # returns a decision tree classifier object

Hyper parameters:
There are different set of hyper parameters set within the classes passed in as a pipeline. To view them, pipe.get_params() method is used. This method returns a dictionary of the parameters and descriptions of each classes in the pipeline.
Example:

python3

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier
# import some data within sklearn for iris classification 
iris = datasets.load_iris()
X = iris.data 
y = iris.target
 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
 
from sklearn.pipeline import Pipeline
pipe = Pipeline([('pca', PCA(n_components = 2)), ('std', StandardScaler()), ('Decision_tree', DecisionTreeClassifier())], verbose = True)
 
pipe.fit(X_train, y_train)
 
# to see all the hyper parameters
pipe.get_params()

Output:

{'memory': None,
 'steps': [('pca',
   PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
       svd_solver='auto', tol=0.0, whiten=False)),
  ('std', StandardScaler(copy=True, with_mean=True, with_std=True)),
  ('Decision_tree',
   DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                          max_depth=None, max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, presort='deprecated',
                          random_state=None, splitter='best'))],
 'verbose': True,
 'pca': PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
     svd_solver='auto', tol=0.0, whiten=False),
 'std': StandardScaler(copy=True, with_mean=True, with_std=True),
 'Decision_tree': DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                        max_depth=None, max_features=None, max_leaf_nodes=None,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        min_weight_fraction_leaf=0.0, presort='deprecated',
                        random_state=None, splitter='best'),
 'pca__copy': True,
 'pca__iterated_power': 'auto',
 'pca__n_components': 2,
 'pca__random_state': None,
 'pca__svd_solver': 'auto',
 'pca__tol': 0.0,
 'pca__whiten': False,
 'std__copy': True,
 'std__with_mean': True,
 'std__with_std': True,
 'Decision_tree__ccp_alpha': 0.0,
 'Decision_tree__class_weight': None,
 'Decision_tree__criterion': 'gini',
 'Decision_tree__max_depth': None,
 'Decision_tree__max_features': None,
 'Decision_tree__max_leaf_nodes': None,
 'Decision_tree__min_impurity_decrease': 0.0,
 'Decision_tree__min_impurity_split': None,
 'Decision_tree__min_samples_leaf': 1,
 'Decision_tree__min_samples_split': 2,
 'Decision_tree__min_weight_fraction_leaf': 0.0,
 'Decision_tree__presort': 'deprecated',
 'Decision_tree__random_state': None,
 'Decision_tree__splitter': 'best'}

Suggest improvement

How to Install Scikit-Learn on Linux?

Share your thoughts in the comments