How to deal with unbalanced binary classification?

How to deal with unbalanced binary classification?

Short answer is sampling. But sampling can be done in various simple to complex ways

  1. You can oversample the minority class or under sample the majority class
  2. you can do stratified multiple samples and create an ensembled data at the end
  3. You can sample in a specific ration that you want to maintain
  4. You can create clusters for sampling as well

Few other tedious and not much recommended ways include playing around with accuracy measures - for example change probability cut off to analyze sensitivity / specificity so that you get closer to desired accuracy levels using different validation mechanisms like k-fold
You could also prefer using decision tree class of models as they are known to perform better on imbalanced data

When dealing with imbalanced classification problem, I would consider the following aspects :

  1. Whether more data could be collected or not. Sometime the dataset is imbalanced because we don’t collect enough data.
  2. Utilize techniques to balance the data. We can consider downsampling the majority class or upsampling the minority class or generating synthetic data, etc. The main goal is to convert the imbalanced classification problem into a balanced classification problem so the regular classification algorithms can be used.
  3. Choose the algorithm that is insensitive to imbalance data, like cost-sensitive learning. The basic idea is to adjust the cost of various classes.
  4. Use the correct performance metric, like AUC, f-score, Confusion Matrix, etc. It’s quite important to choose the right metric to evaluate your model performance.