Think Like a Data Scientist Part 1

Hello folks,

On the rare occasion that we’re thinking about a data scientist (hah, yeah, right, that’s most likely never), we probably start thinking about someone really good at statistics, probability, programming, and neural networks.

In this article, I want to share my thought process, and decision tree-like methodology with you that I believe is more practical. In other words, I view the thought of being a data scientist less as a mindset but more as a daily practice.

By approaching problems in this methodology, I have helped several clients solve their business problems and have made several contributions to cancer metabolism as a computational biologist and data scientist. And now, I hope to share my general process with you. I hope you find it helpful.

How to think like a data scientist

There are 6 general and iterative steps in how I approach problems as a data scientist:

  1. Defining the problem and assumptions, I’m making about the problem
  2. Outlining how I will solve the problem and what my metrics of success are
  3. Collect and structure data
  4. Visualize and analyze the data
  5. Modeling
  6. Data and model interpretation

* Defining the problem and assumptions I’m making about the problem

“The way that you become world-class is by asking good questions” — Tim Ferriss

The main goal of a data scientist is to be able to analyze data, measure outcomes, design experiments, and make decisions that will move an individual or organization forward.

However, before we do all of that, I think the most important step is to define the problem, and the assumptions I believe are true related to the problem.

While this seems like such a simple step, we often don’t think too deeply about the problems we’re trying to solve. Worse, we rarely think about the assumptions and biases we impose on our models and data.

These two mistakes can be quite costly to you and your team. If we ask the wrong questions or make false assumptions, we measure the wrong variables. This wastes both time and money and introduces risks from competitors who may be thinking about the problem differently, or worse, better than you.

I suggest spending a lot of time in this phase and outline the problem we’re trying to solve, the evidence we have so far, and the assumptions we’re making explicitly. In the same line as Ray Dalio’s principle of radical transparency, we aim to view the problem as rationally as possible to remove the unknown unknowns and biases we carry with us.

Once we see the problem clearly, then, and only then, should we move to the next step.

* Outlining how I will solve the problem and what my metrics of success are

Once we know the problem, we need to choose the tools we’ll be using to solve the problem and the variables we’ll be measuring to evaluate success. At this point, we probably don’t have a high-resolution view of how we’re going to execute our master plan. And that’s totally okay.

But we need to outline a general course of action. Some things we should be thinking about are:

  • Do we think this is a supervised learning problem with a measurable response variable, or will we need to think of this more as an unsupervised or semi-supervised learning problem?
  • How much data do we need for our model to work?
  • How much will it cost to acquire this data?
  • And finally, what does success look like for my modeling approach? Will we maximize accuracy? Or try to make the model as interpretable as possible?

While outlining our problem seems like it is sucking the creativity out of the problem-solving step, it adds constraints to our approach that let us think deeper about the problem. More importantly, it forces us to solve a problem as a series of actionable steps and develop a strategy to tackle the problem at hand.

* Collect and structure data

Unfortunately, most problems we’re interested in do not have data readily available in a database. It’s time to start collecting and generating data or hopefully having someone else collect and generate data for you. Think Amazon’s Mechanical Turk services (not being paid for this promo, but call me Amazon).

The most time-consuming part of being a data scientist is collecting and generating data. If you’re trying to solve a machine learning problem, converting unstructured data into a structured dataset is the added challenge.

Continue to Part 2.

Link: Think Like a Data Scientist Part 2