How mislabeling affects choice of machine learning model

Often at times, the data isn’t clean. Most of the large data is coagulation of real life granular datasets which can be mislabeled. For example, often in manufacturing plants, if a machine has failed but been shortly fixed, isn’t often noted in breakdown. Also, on the flip side there are instances where machines have to be shut down for some other reasons, leading to sensors reporting a failure in data. A good tracking system will definitely take care of most of the edge cases, but say it misses out 10%.

With this backdrop, boosting algorithms shouldn’t be your choice. Go for bagging based ensembles (random forest would be my first choice). As a data scientist, in fact you can even estimate mislabeled data with pretty good accuracy by using a mix modeling approach.

Read this very interesting article if you want to dig deeper into the topic.