If you reading ‘k fold analysis’ or ‘k fold cross validation technique’ for the first time, do look up online blogs that talk about how to implement it using sklearn in python. However, the emphasis on how to make use of it is too subtle for an amateur data scientist.
There are two prime reasons:
- Look at the results metrics (beyond accuracy of course, whatever might be of interest) across different k splits. If the results are consistent, that means the model is indeed learning and generalizable. This then eliminates a detrimental conclusion arising out of a very good result from a model built on a specific training set and test set.
- Look at the feature importance. If the most important features are consistent through all the splits, then definitely those are the strong predictors.
The main purpose of this exercise is not to build an ML model, but let computer (machine) learn and we understand what it learns.