Why to use a "pipeline" from sklearn?

rajanikant-ghate · 22 January 2022 08:24

With python, sklearn there’s definitely a great flexibility in how to set up processes before training and building a model. For e.g. if feature set are crafted from a time series based data, using time series split rather than random split would be your choice. This is what often a data scientist does, researches, explores, plots and concludes what steps to undertake.

However, just imagine, if the steps are not in the right order for some reason. For e.g. a data is first scaled and then split into train and test set. This will definitely lead to data leakage. For this reason, ‘pipelines’ should be in place.

This blog is a good guide I found online that suggests how to setup a pipeline using sklearn.