The CRoss Industry Standard Process for Data Mining (CRISP-DM) is a process model that serves as the base for a data science process. It has six sequential phases:
- Business understanding – What does the business need? The Business Understanding phase focuses on understanding the objectives and requirements of the project. Aside from the third task, the three other tasks in this phase are foundational project management activities that are universal to most projects.
- Data understanding – What data do we have/need? Is it clean? Next is the Data Understanding phase. Adding to the foundation of Business Understanding, it drives the focus to identify, collect, and analyze the data sets that can help you accomplish the project goals. This phase also has four tasks.
- Data preparation – How do we organize the data for modeling? A common rule of thumb is that 80% of the project is data preparation. This phase, which is often referred to as “data munging”, prepares the final data set(s) for modeling. It has five tasks
- Modeling – What modeling techniques should we apply? What is widely regarded as data science’s most exciting work is also often the shortest phase of the project. Here you’ll likely build and assess various models based on several different modeling techniques.
- Evaluation – Which model best meets the business objectives? Whereas the Assess Model task of the Modeling phase focuses on technical model assessment, the Evaluation phase looks more broadly at which model best meets the business and what to do next.
- Deployment – How do stakeholders access the results? A model is not particularly useful unless the customer can access its results. The complexity of this phase varies widely. This final phase has four tasks: