Data validation means checking the accuracy and quality of source data before training a new model version. It ensures that anomalies that are infrequent or manifested in incremental data are not silently ignored. It focuses on checking that the statistics of the new data are as expected (e.g. feature distribution, number of categories, etc).
Different types of validation can be performed depending on objectives and constraints. Examples of such objectives in the machine learning pipeline are below –
- Are there any anomalies or data errors in the incremental data? If yes, raise an alert to the team for investigation.
- Are there any assumptions on data that are taken during model training and are getting violated during serving? If yes, raise an alert to the team for investigation.
- Are there significant differences between training and serving data? Or, Are there differences between successive data that are getting added into training data? If yes, raise an alert to investigate differences in training and serving code stack.
Output from the data validation steps should be informative enough for a data engineer to take action. Also, it needs to have high precision as too many false alarms will easily be lost credibility.