Why is Area Under ROC Curve (AUROC) better than raw accuracy as an out-of-sample evaluation metric?

Area Under ROC Curve (AUROC)

AUROC is robust to class imbalance, unlike raw accuracy.

For example, if you want to detect a type of cancer that’s prevalent in only 1% of the population, you can build a model that achieves 99% accuracy by simply classifying everyone has cancer-free.

Ensemble Learning

I wouldn’t say AUC is always a better measurement of the performance, but perhaps it is the best “summary” of the performance of a classifier, as it incorporates different aspects of the performance into a single number. Both sensitivity and specificity for all threshold levels are incorporated in the number given by AUC. But depending on your purpose AUC might not be the best measure. Most of the time you might want to determine a single level of sensitivity or specificity that is desired for the problem and measure the performance at that single point of the ROC curve.