In most machine learning projects, the data you are working with is unlikely to be in the right format to generate the optimal model.There are many steps to transform data, such as classifying variable encoding, feature scaling, and normalization.The preprocessing module of Scikit-learn contains built-in functions to support these common transformations.
However, you will need to apply these transformations at least twice in a typical machine learning workflow.One is during training and the other is when you want to use the model to predict new data.Of course you can write a function to reuse these transformations, but you still need to run the function first, then call the model.Scikit-learn s pipeline/pipeline is a tool to simplify this operation with the following advantages:
- Make the workflow easier to understand
- Force step implementation and execution order
- Make your work more reproducible
In this article, I will use a loan forecasting aspect data set To introduce the working principle and implementation of pipelining.
1. Converter / Transformer
Learn programming, up Intelligence Network , online programming environment, one-to-one tutorial guidance.
First I will import the training and test files into jypyter notebook.I deleted the Load_ID column because it is not needed in training and prediction.I use the pandas dtypes function to get a brief summary of the dataset:
import pandas as pd train = pd.read_csv('train.csv') test = pd.read_csv('test.csv') train = train.drop('Loan_ID', axis=1) train.dtypes
You can see that there are both categorical and numerical variables in the data, so I need to apply at least one-hot coding transformations and scaling at some scale.I use the scikit-learn s pipeline to perform these transformations, while applying the fit method for training.
Before building the pipeline, I split the training data into training and test sets so that I can verify the performance of the model:
X = train.drop('Loan_Status', axis=1) y = train['Loan_Status'] from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
The first step in constructing a pipeline is to define the type of transformer.In the code below, I create a numeric converter that uses StandardScaler and contains a SimpleImputer to fill in missing values.This is a fairly good function in scikit-learn s with many options to define how to fill in missing values.I choose to use median data, but there may be other options that work better.The Classification Converter also has a SimpleImputer that supports various filling methods, and the Fire uses OneHotEncoder to convert the classified values to integers:
from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler, OneHotEncodernumeric_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])categorical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='constant', fill_value='missing')), ('onehot', OneHotEncoder(handle_unknown='ignore'))])
Next, we use ColumnTransformer to transform columns in a data frame.Prior to this, the list was sorted using pandas's dtype method:
numeric_features = train.select_dtypes(include=['int64', 'float64']).columns categorical_features = train.select_dtypes(include=['object']).drop(['Loan_Status'], axis=1).columns from sklearn.compose import ColumnTransformer preprocessor = ColumnTransformer( transformers=[ ('num', numeric_transformer, numeric_features), ('cat', categorical_transformer, categorical_features)])
2. Classifier Training
The next step is to create a pipeline to integrate the previously created preprocessor with the classifier.Here I use a simple RandomForestClassifier:
from sklearn.ensemble import RandomForestClassifier rf = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', RandomForestClassifier())])
You can simply call the fit method on the original data, and the preprocessing step will execute before training the classifier:
rf.fit(X_train, y_train)
To predict new data as well, the pipeline will be preprocessed before predicting:
y_pred = rf.predict(X_test)
3. Model Selection
Pipelines can be used in the model selection process.The following sample code applies transformations to a set of scikit-learn classifiers one by one and trains the model.
from sklearn.metrics import accuracy_score, log_loss from sklearn.neighbors import KNeighborsClassifier from sklearn.svm import SVC, LinearSVC, NuSVC from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysisclassifiers = [ KNeighborsClassifier(3), SVC(kernel="rbf", C=0.025, probability=True), NuSVC(probability=True), DecisionTreeClassifier(), RandomForestClassifier(), AdaBoostClassifier(), GradientBoostingClassifier() ]for classifier in classifiers: pipe = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', classifier)]) pipe.fit(X_train, y_train) print(classifier) print("model score: %.3f" % pipe.score(X_test, y_test))
[External chain picture transfer failed (img-WAIAk0il-1569282274148)(scikit-learn-pipeline-guide/model-select.png)]
4. Model parameter search
Pipelines can also be used in grid search to find the best parameters for a model.To do this, we need to first create a parameter grid for the model.It is important that you add the name of the classifier to each parameter name.I named the classifier in the code above For classifier, so I added classifier_u to each parameter.Next, I create a grid search object that contains the original pipeline.When I call the fit method, the data is transformed before the grid search for cross-validation.
param_grid = { 'classifier__n_estimators': [200, 500], 'classifier__max_features': ['auto', 'sqrt', 'log2'], 'classifier__max_depth' : [4,5,6,7,8], 'classifier__criterion' :['gini', 'entropy']} from sklearn.model_selection import GridSearchCV CV = GridSearchCV(rf, param_grid, n_jobs= 1) CV.fit(X_train, y_train) print(CV.best_params_) print(CV.best_score_)
Before I started using pipelining, I often found that I could not understand the process of a previous project.pipeline makes the whole machine learning process clear and easy to understand and maintain.Hopefully this tutorial will help you learn pipelines for scikit-learn s.
Original Link: Principle and Practice of Scikit-learn Pipeline - Intelligence Network