How to handle missing values in Test data?

This is a fairly commonly question when it comes to data preprocessing. But a valid one.

The data might well be missing in the training & test data as well, but a more important question is why is the data missing?

Is it at random? Or is there a reason behind why those values are missing?

These questions need to be answered before building the complete preprocessing pipeline during the training phase.

One of the ways to handle categorical missing values is to add another feature as “None” or “Unknown”. Instances with missing values are assigned these categories.

For the numerical data, a very important point that needs to be kept in mind is not to use test data itself for imputing missing values in the test data.

If the test data is used, then it might result in what we call as “Data Leakage”. Your model will get the information it wasn’t supposed to get.

In essence, the same preprocessing pipeline consisting of data cleaning, filtering, imputing, etc. built in training phase is applied for the test data as well.

So that your model doesn’t freak out on seeing missing values during test.

#machinelearning #datascience