Setting Up a Machine Learning Pipeline

For this tutorial, we’ll set up a very basic pipeline that consists of the following sequence:

  1. Scaler: For pre-processing data, i.e., transform the data to zero mean and unit variance using the StandardScaler().
  2. Feature selector: Use VarianceThreshold() for discarding features whose variance is less than a certain defined threshold.
  3. Classifier: KNeighborsClassifier(), which implements the k-nearest neighbor classifier and selects the class of the majority k points, which are closest to the test example.

pipe = Pipeline([

(‘scaler’, StandardScaler()),

(‘selector’, VarianceThreshold()),

(‘classifier’, KNeighborsClassifier())

])

The pipe object is simple to understand. It says, scale first, select features second and classify in the end. Let’s call fit() method of the pipe object on our training data and get the training and test scores.

pipe.fit(X_train, y_train)

print('Training set score: ’ + str(pipe.score(X_train,y_train)))
print('Test set score: ’ + str(pipe.score(X_test,y_test)))

Running the example you should see the following:

Training set score: 0.8794642857142857

Test set score: 0.8392857142857143