Machine learning models are vulnerable to poor data quality as per the old adage “garbage in garbage out”.
In production, the model gets re-trained with a fresh set of incremental data added periodically (as frequent as daily) and the updated model is pushed to the serving layer. The model makes predictions with new incoming data while serving and the same data is added with actual labels and used for retraining. This ensures that the newly generated model adapts to the changes in data characteristics.
However, the new incoming data in the serving layer can change due to various reasons like code changes that introduces errors in the serving data ingestion component or the difference between training and serving stacks. With time, the erroneous ingested data will become part of the training data, which will start degrading the model accuracy. Since In each iteration, newly added data is generally a small fraction of overall training data, hence, the changes in model accuracy will be easily missed and the errors will keep adding with time .
Thus, catching the data errors at an early stage is very important because it will reduce the cost of data error which is bound to increase as the error propagates further in the pipeline.