Machine learning algorithms learn from data. It is critical that you feed them the right data for the problem you want to solve.
Even if you have good data, you need to make sure that it is in a useful scale, format and even that meaningful features are included.
This step is concerned with selecting the subset of all available data that you will be working with. There is always a strong desire for including all data that is available, that the maxim “more is better” will hold. This may or may not be true.
You need to consider what data you actually need to address the question or problem you are working on. Make some assumptions about the data you require and be careful to record those assumptions so that you can test them later if needed.
Below are some questions to help you think through this process:
- What is the extent of the data you have available? For example through time, database tables, connected systems. Ensure you have a clear picture of everything that you can use.
- What data is not available that you wish you had available? For example data that is not recorded or cannot be recorded. You may be able to derive or simulate this data.
- What data don’t you need to address the problem? Excluding data is almost always easier than including data. Note down which data you excluded and why.
It is only in small problems, like competition or toy datasets where the data has already been selected for you.