Describe the problem as though you were describing it to a friend or colleague. This can provide a great starting point for highlighting areas that you might need to fill. It also provides the basis for a one sentence description you can use to share your understanding of the problem.
For example: I need a program that will tell me which tweets will get retweets.
In a previous blog post defining machine learning you learned about Tom Mitchell’s machine learning formalism. Here it is again to refresh your memory.
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
Use this formalism to define the T, P, and E for your problem.
- Task (T): Classify a tweet that has not been published as going to get retweets or not.
- Experience (E): A corpus of tweets for an account where some have retweets and some do not.
- Performance (P): Classification accuracy, the number of tweets predicted correctly out of all tweets considered as a percentage.
Create a list of assumptions about the problem and it’s phrasing. These may be rules of thumb and domain specific information that you think will get you to a viable solution faster.
It can be useful to highlight questions that can be tested against real data because breakthroughs and innovation occur when assumptions and best practice are demonstrated to be wrong in the face of real data. It can also be useful to highlight areas of the problem specification that may need to be challenged, relaxed or tightened.
- The specific words used in the tweet matter to the model.
- The specific user that retweets does not matter to the model.
- The number of retweets may matter to the model.
- Older tweets are less predictive than more recent tweets.
Photo attributed to dullhunk, some rights reserved
What other problems have you seen or can you think of that are like the problem you are trying to solve? Other problems can inform the problem you are trying to solve by highlighting limitations in your phrasing of the problem such as time dimensions and conceptual drift (where the concept being modeled changes over time). Other problems can also point to algorithms and data transformations that could be adopted to spot check performance.
For example: A related problem would be email spam discrimination that uses text messages as input data and needs binary classification decision.