With python, sklearn there’s definitely a great flexibility in how to set up processes before training and building a model. For e.g. if feature set are crafted from a time series based data, using time series split rather than random split would be your choice. This is what often a data scientist does, researches, explores, plots and concludes what steps to undertake.
However, just imagine, if the steps are not in the right order for some reason. For e.g. a data is first scaled and then split into train and test set. This will definitely lead to data leakage. For this reason, ‘pipelines’ should be in place.
This blog is a good guide I found online that suggests how to setup a pipeline using sklearn.