Explain cross-validation

rohit-lotlikar-b6b3af34 · 4 February 2022 11:48

Cross-validation is a statistical method used to estimate the performance (or accuracy) of machine learning models. It is used to protect against overfitting in a predictive model, particularly in a case where the amount of data may be limited. In cross-validation, you make a fixed number of folds (or partitions) of the data, run the analysis on each fold, and then average the overall error estimate.

When dealing with a Machine Learning task, you have to properly identify the problem so that you can pick the most suitable algorithm which can give you the best score. But how do we compare the models?

Say, you have trained the model with the dataset available and now you want to know how well the model can perform. One approach can be that you are going to test the model on the dataset you have trained it on, but this may not be a good practice.

So what is wrong with testing the model on the training dataset? If we do so, we assume that the training data represents all the possible scenarios of real-world and this will surely never be the case. Our main objective is that the model should be able to work well on the real-world data, although the training dataset is also real-world data, it represents a small set of all the possible data points(examples) out there.