How does the data validation component work?

Think of the data validation component as a guard post of the ML application that does not let bad quality data in. It keeps a check on each and every new data entry that is going to add to the training data. The data validation framework can be summarized in 5 steps:

  1. ​​​​​​​Calculate the statistics from the training data against a set of rules
  2. Calculate the statistics of the ingested data that has to be validated
  3. Compare the statistics of the validation data with the statistics from the training data
  4. Store the validation results and takes automated actions like removing the row, capping, or flooring the values
  5. Sending the notification and alerts for approval​​​​​​​

​​​​​​​​​​​​​​​One such Example is of Amazon Research as follows:
**Unit-test approach for data validation by Amazon Research **

In software engineering, engineers write unit tests to test their code. Similarly, unit tests should also be defined to test the incoming data. Authors have defined a framework to define this component that follows the below principles –

  1. a) Declare constraints : User defines how their data should look like. It works by declaring checks on their data by composing constraints on various columns.
  2. b) Compute metrics : Based on the declared constraint, translate them to measurable metrics. These metrics can be computed and compared between the data in hand and the incremental data.
  3. c) Analyze and report: Based on the collected metrics over time, predict if the metric on the incremental data is an anomaly or not. As a rule, the user can have the system issue “a warning” if the new metric is more than three standard deviations away from the previous mean or throw “an error” if it is more than four standard deviations away. Basis the analysis, report the constraints that fail including the value(s) that made the constraint fail. ​​​​​​