Multi-Label Classification refers to those classification tasks that have two or more class labels, where one or more class labels may be predicted for each example.
Consider the example of photo classification, where a given photo may have multiple objects in the scene and a model may predict the presence of multiple known objects in the photo, such as “ bicycle ,” “ apple ,” “ person ,” etc.
This is unlike binary classification and multi-class classification, where a single class label is predicted for each example.
It is common to model multi-label classification tasks with a model that predicts multiple outputs, with each output taking predicted as a Bernoulli probability distribution. This is essentially a model that makes multiple binary classification predictions for each example.
Classification algorithms used for binary or multi-class classification cannot be used directly for multi-label classification. Specialized versions of standard classification algorithms can be used, so-called multi-label versions of the algorithms, including:
- Multi-label Decision Trees
- Multi-label Random Forests
- Multi-label Gradient Boosting
Another approach is to use a separate classification algorithm to predict the labels for each class.
Imbalanced Classification refers to classification tasks where the number of examples in each class is unequally distributed.
Typically, imbalanced classification tasks are binary classification tasks where the majority of examples in the training dataset belong to the normal class and a minority of examples belong to the abnormal class.
- Fraud detection.
- Outlier detection.
- Medical diagnostic tests.
These problems are modeled as binary classification tasks, although may require specialized techniques.
Specialized techniques may be used to change the composition of samples in the training dataset by undersampling the majority class or oversampling the minority class.
Specialized modeling algorithms may be used that pay more attention to the minority class when fitting the model on the training dataset, such as cost-sensitive machine learning algorithms.
- Cost-sensitive Logistic Regression.
- Cost-sensitive Decision Trees.
- Cost-sensitive Support Vector Machines.
Finally, alternative performance metrics may be required as reporting the classification accuracy may be misleading.
We can use the make_multilabel_classification() function to generate a synthetic multi-label classification dataset. and the make_classification() function to generate a synthetic imbalanced binary classification dataset.