Standardize ML problem solving using SKlearn Pipeline — Part -1

Onkar Patil
2 min readJun 18, 2022

Though for most beginners and learners, this might be an optional or last-to-pick topic, but those who are leveraging machine learning techniques to solve their business problems might find this helpful. Now standardize is a very generic word and includes dealing with various aspects of problem-solving flow. Here and in the following few articles, we confine our scope of discussion to using sklearn’s pipeline module and its use and later how to use MLFlow along with ML pipeline

Since 2008, Sci-kit learn is evolving and continuously incorporating new tools and techniques to optimize and enhance the existing ML problem-solving method and that is why it is very popular among ML practitioners. The typical Machine Learning life cycle can be divided into three major parts — 1. Data Transformation ( It includes data preprocessing, feature engineering, or data scaling) 2. Model Training (It includes model selecting, and hyperparameter tuning) 3. Inferring the result (It includes inferring the results on unseen data). Mostly it is tackled individually and increases the complexities after the deployment. Here sci-kit learn came up with the solution — sklearn.pipeline.Pipeline It combines these three parts and treats them as one, which makes it easy to process, train, and infer.

Let’s see a simple ML example using sklearn Pipeline Module -

Here I’m using Breast Cancer Data -

source: https://indianaiproduction.com/seaborn-pairplot/

As a part of Data Transformation, I have used -1. Data Scaling 2. PCA. Three Different pipelines for three models.

Below is an example of pipeline definition, though we can use a user-defined pre-processing function instead of these predefined sklearn modules.

import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split


from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
# import the data
xtrain, xtest, ytrain, ytest=train_test_split(load_breast_cancer(return_X_y=True)[0],load_breast_cancer(return_X_y=True)[1],
test_size=0.25, random_state=23)
# Defining the pipelines# pipeline for Logistic regressionpipe_lr = Pipeline([('scl', StandardScaler()),
('pca', PCA(n_components=29)),
('clf', LogisticRegression(random_state=42))])
# pipeline for SVM
pipe_svm = Pipeline([('scl', StandardScaler()),
('pca', PCA(n_components=29)),
('clf', svm.SVC(random_state=42))])
# Pipeline for Decision Tree
pipe_rf = Pipeline([('scl', StandardScaler()),
('pca', PCA(n_components=29)),
('clf', RandomForestClassifier(random_state=42))])
# Combine these pipelinespipelines = [pipe_lr, pipe_svm, pipe_rf]# Dictionary of pipelines and classifier types for ease of referencepipe_dict = {0: 'Logistic Regression', 1: 'Support Vector Machine', 2: 'Random Forest'}# Training the the pipelines
for pipe in pipelines:
pipe.fit(xtrain, ytrain)

# Testing it
for idx, val in enumerate(pipelines):
print('%s pipeline test accuracy: %.3f' % (pipe_dict[idx], val.score(xtest, ytest)))

Git repo for this code

This is the simple definition and implementation example, but the actual problem aren’t this straight. In next article we will see : How to customize the pipeline based on your use case. Link to part 2

--

--

Onkar Patil

Data Science Lead at IKS Health, AI/ML Enthusiasts.