Data preparation involves transforming raw data into a form that is most appropriate for the learning algorithms.
This might involve scaling values, handling missing values, and changing the probability distribution of variables.
Transforms can be made to change representation of the historical data to meet the expectations or requirements of specific learning algorithms. Yet, sometimes good or best results can be achieved when the expectations are violated or when an unrelated transform to the data is performed.
We can think of choosing transforms to apply to the training data as a search or optimization problem of best exposing the unknown underlying structure of the data to the learning algorithm.
- Data Preparation: Function inputs are sequences of transforms, optimization problems that require an iterative global search algorithm.
This optimization problem is often performed manually with human-based trial and error. Nevertheless, it is possible to automate this task using a global optimization algorithm where the inputs to the function are the types and order of transforms applied to the training data.
The number and permutations of data transforms are typically quite limited and it may be possible to perform an exhaustive search or a grid search of commonly used sequences.