What is Overfitting & Underfitting in ml?

Overfitting refers to the scenario where a machine learning model can’t generalize or fit well on unseen dataset. A clear sign of machine learning overfitting is if its error on the testing or validation dataset is much greater than the error on training dataset.

Overfitting is a term used in statistics that refers to a modeling error that occurs when a function corresponds too closely to a dataset. As a result, overfitting may fail to fit additional data, and this may affect the accuracy of predicting future observations.

Overfitting happens when a model learns the detail and noise in the training dataset to the extent that it negatively impacts the performance of the model on a new dataset. This means that the noise or random fluctuations in the training dataset is picked up and learned as concepts by the model. The problem is that these concepts do not apply to new datasets and negatively impact the model’s ability to generalize.

The opposite of overfitting is underfitting.

Underfitting refers to a model that can neither model the training dataset nor generalize to new dataset. An underfit machine learning model is not a suitable model and will be obvious as it will have poor performance on the training dataset.

Underfitting is often not discussed as it is easy to detect given a good performance metric.

Understanding it with Examples

Example 1:

Let’s say three students have prepared for a mathematics examination.

The first student has only studied Addition mathematic operations and skipped other mathematics operations such as Subtraction, Division, Multiplication etc.

The second student has a particularly good memory. Thus, second student has memorized all the problems presented in the textbook.

And the third student has studied all mathematical operations and is well prepared for the exam.

In the exam student one will only be able to solve the questions related to Addition and will fail in problems or questions asked related to other mathematics operations.

Student two will only be able to answer questions if they happened to appear in the textbook (as he has memorized it) and will not be able to answer any other questions.

Student three will be able to solve all the exam problems reasonably well.

Machine Learning algorithms have similar behavior to our three students, sometimes the model generated by the algorithm are similar to the first student. They learn from only from a small part of the training dataset, in such cases the model is Underfitting.

Sometimes the model will memorize the entire training dataset, like the second student. They perform very well on known instances, but faulter badly on unseen data or unknown instances. In such cases the model is said to be Overfitting.

And when model does well in both the training dataset and on the unseen data or unknown instances like student three, it is a good fit.