Some types of outliers can be removed. Garbage values or values that you know cannot be true, can be dropped. Outliers with extreme values far outside the rest of the data points clustered in a set can be removed as well. If you cannot drop outliers, you could reconsider whether you chose the right model, you could use algorithms (like random forests) that won’t be impacted as heavily by the outlier values, or you could try normalizing your data.
- Have you worked on a data science project that required a substantial programming component? What did you take away from the experience?
- Describe how to effectively represent data with five dimensions.
- You need to generate a predictive model using multiple regression. What’s your process for validating this model?
- How do you ensure that the changes you’re making to an algorithm are an improvement?
- Please provide your method for handling an imbalanced data set that’s being used for prediction (i.e., vastly more negative classes than positive classes).
- What’s your approach to validate a model you created to generate a predictive model of a quantitative outcome variable using multiple regression?
- You have two different models of comparable computational performance and accuracy. Please explain how you decide which to choose for production and why.
- You are given a data set consisting of variables with a substantial portion missing values. What’s your approach?