AUC-mixup: Deep AUC Maximization with Mixup

Jianzhi Xv
Shandong University
[email protected]
Gang Li
Texas A&M University
[email protected]
Tianbao Yang
Texas A&M University
[email protected]

Abstract

While deep AUC maximization (DAM) has shown remarkable success on imbalanced medical tasks, e.g., chest X-rays classification and skin lesions classification, it could suffer from severe overfitting when applied to small datasets due to its aggressive nature of pushing prediction scores of positive data away from that of negative data. This paper studies how to improve generalization of DAM by mixup data augmentation- an approach that is widely used for improving generalization of the cross-entropy loss based deep learning methods. However, AUC is defined over positive and negative pairs, which makes it challenging to incorporate mixup data augmentation into DAM algorithms. To tackle this challenge, we employ the AUC margin loss and incorporate soft labels into the formulation to effectively learn from data generated by mixup augmentation, which is referred to as the AUC-mixup loss. Our experimental results demonstrate the effectiveness of the proposed AUC-mixup methods on imbalanced benchmark and medical image datasets compared to standard DAM training methods.

1 Introduction

In recent years, deep AUC maximization (DAM), focusing on developing deep learning models that directly optimize the area under the receiver operating characteristic curve (AUC), has gained growing importance. Different methods of DAM, such as optimizing the AUC margin loss [9] (AUCM) and compositional DAM [8], have been successfully applied to medical image classification to improve the AUC performance. These methods demonstrate superior performance in large-scale medical image classification tasks such as Chexpert [3] and Melanoma [5] compared to optimizing traditional loss functions, such as cross-entropy loss (CE) and focal loss.

Although DAM methods work well on large datasets, they still face the challenge of vulnerability to overfitting when training on imbalanced datasets with a small size. Typically, DAM losses place more emphasis on the positive class with fewer samples, and as the overall data volume decreases, the samples of this class become scarce and limited, which leads to the overfitting problem. In this context, mixup augmentation [11] is an effective solution by introducing soft labels into training and generating much more data using convex combinations of samples. With mixup augmentation, the focus of DAM would shift from the minor class to the combination of samples from different classes, which mitigates the overfitting issue.

However, existing DAM losses are developed on hard labels and thus are incompatible with mixup augmentation. Different from traditional loss functions that are defined over individual data, DAM losses are non-decomposable, making the incorporation of soft labels more complicated. To address the problem, we propose an AUC-mixup loss by replacing conditional means in the min-max AUC margin loss with soft means by using soft labels. Our goal is to utilize the AUC-mixup loss to improve DAM and compositional DAM methods in medical tasks with a small number of data by incorporating mixup augmentation. We validate our method on imbalanced benchmark and medical image datasets, including several 3D datasets, to demonstrate the superiority in generalization performance over two standard DAM baselines.

2 Method

Let $\mathbb{I}(.)$ be an indicator function of a predicate, let $S=\{(\mathbf{x}_{1},y_{1}),...,(\mathbf{x}_{n},y_{n})\}$ denote a set of training examples (e.g., a 2D or 3D image), and let $y_{i}\in\{1,0\}$ denote its corresponding label. Let $\mathbf{w}\in\mathbb{R}^{d}$ denote the parameters of the deep neural network and let $h_{\mathbf{w}}(\mathbf{x})=h(\mathbf{w},\mathbf{x})$ denote the prediction of the neural network on input data $\mathbf{x}$ . Yuan et al. [9] proposed DAM by minimizing the AUC margin loss that is equivalent to the following min-max optimization:

	$\displaystyle\min_{\mathbf{w}\in\mathbb{R}^{d},(a,b)\in\mathbb{R}^{2}}\max_{\alpha\geq 0}$	$\displaystyle\frac{\sum_{i=1}^{N}\left(h_{\mathbf{w}}(\mathbf{x}_{i})-a\right)^{2}\mathbb{I}(y_{i}=1)}{N_{+}}+\frac{\sum_{i=1}^{N}\left(h_{\mathbf{w}}(\mathbf{x}_{i})-b\right)^{2}\mathbb{I}(y_{i}=0)}{N_{-}}$		(1)
		$\displaystyle+2\alpha\left(m-\frac{\sum_{i=1}^{N}h_{\mathbf{w}}(\mathbf{x}_{i})\mathbb{I}(y_{i}=1)}{N}+\frac{\sum_{i=1}^{N}h_{\mathbf{w}}(\mathbf{x}_{i})\mathbb{I}(y_{i}=0)}{N_{-}}\right)-\alpha^{2},$

where $m$ is a hyperparameter to control the desired margin between optimal $a$ (i.e., the mean score of positive data) and optimal $b$ (i.e., the mean score of negative data), $N_{+}$ is the number of positive samples and $N_{-}$ is the number of negative samples. AUC margin loss is shown to enjoy better robustness than AUC square loss [9]. An improvement of minimizing the AUC margin loss from scratch is compositional DAM, which minimizes a compositional objective function, where the outer function corresponds to the AUC margin loss and the inner function represents a gradient descent step for minimizing a cross-entropy (CE) loss. However, the mixup technique cannot be directly applied to both methods, where an augmented example $\hat{\mathbf{x}}$ and its corresponding soft label $\hat{y}$ are generated from two randomly sampled training data and labels:

\displaystyle\hat{\mathbf{x}}=\lambda\mathbf{x}_{i}+(1-\lambda)\mathbf{x}_{j},\quad\hat{y}=\lambda y_{i}+(1-\lambda)y_{j},

(2)

where $\hat{y}\in(0,1)$ and $\lambda\sim Beta(\alpha,\alpha)$ , $\alpha\in(0,\infty)$ . It is obvious that directly using the AUC margin loss will ignore all the augmented samples with $\lambda\in(0,1)$ . To address this problem, we propose an AUC-mixup loss. The idea is to replace the conditional means in (1) by soft means, i.e.,

	$\displaystyle\min_{\mathbf{w}\in\mathbb{R}^{d},(a,b)\in\mathbb{R}^{2}}\max_{\alpha\geq 0}$	$\displaystyle\frac{\sum_{i=1}^{\hat{N}}\left(h_{\mathbf{w}}(\hat{\mathbf{x}}_{i})-a\right)^{2}\hat{y}_{i}}{\sum_{i=1}^{\hat{N}}\hat{y}_{i}}+\frac{\sum_{i=1}^{\hat{N}}\left(h_{\mathbf{w}}(\hat{\mathbf{x}}_{i})-b\right)^{2}(1-\hat{y}_{i})}{\sum_{i=1}^{\hat{N}}(1-\hat{y}_{i})}$		(3)
		$\displaystyle+2\alpha\left(m-\frac{\sum_{i=1}^{\hat{N}}h_{\mathbf{w}}(\hat{\mathbf{x}}_{i})\hat{y}_{i}}{\sum_{i=1}^{\hat{N}}\hat{y}_{i}}+\frac{\sum_{i=1}^{\hat{N}}h_{\mathbf{w}}(\hat{\mathbf{x}}_{i})(1-\hat{y}_{i})}{\sum_{i=1}^{\hat{N}}(1-\hat{y}_{i})}\right)-\alpha^{2},$

where $\{(\hat{\mathbf{x}}_{i},\hat{y}_{i})\}_{i=1}^{\hat{N}}$ denotes the mixup augmented dataset. It is not difficult to show that the optimal $a,b$ are soft mean scores of positive and negative data, respectively, i.e., $a=\frac{\sum_{i=1}^{\hat{N}}h_{\mathbf{w}}(\hat{\mathbf{x}}_{i})\hat{y}_{i}}{\sum_{i=1}^{N}\hat{y}_{i}},b=\frac{\sum_{i=1}^{\hat{N}}h_{\mathbf{w}}(\hat{\mathbf{x}}_{i})(1-\hat{y}_{i})}{\sum_{i=1}^{\hat{N}}(1-\hat{y}_{i})}$ . The AUC-mixup loss can be easily integrated with compositional DAM.

3 Experiments

Datasets. In this section, we perform extensive experiments to evaluate the proposed AUC-mixup approaches on diverse benchmark and medical image datasets. With respect to benchmark datasets, we choose the Cat&Dog, CIFAR-10, CIFAR-100, and STL-10 and construct a binary imbalanced version of these datasets by following the instruction of Yuan et al. [9]. We adopt a DenseNet121 [2] network pre-trained in ImageNet as the backbone for benchmark datasets, following Yuan et al. [9]. For medical image datasets, we choose six MEDMNIST [7] datasets with the format of 2D or 3D images, namely PneumoniaMNIST (PneumoniaM), BreastMNIST (BreastM), NoduleMNIST3D (NoduleM), AdrenalMNIST3D (AdrenalM), VesselMNIST3D (VesselM), SynapseMNIST3D (SynapseM)). These datasets are naturally imbalanced. We adopt ResNet18 [1] network for training on these 2D and 3D datasets (with an acsconv [6]). We split all datasets into training set, validation set, and test set to conduct cross-evaluation for tuning hyperparameters, and report the AUC score on the test set(the means and standard deviations of three runs).

Settings. In the experiments, we apply AUC mixup strategy to DAM for vanila training from scratch (AUC-mixup) and compositional training (CT-mixup). We choose four methods with different losses as baselines: cross-entropy loss (CE), focal loss (Focal), AUCM loss [9], and compositional AUC loss (CT-AUC) [8]. For all methods, we tune the learning rate in {0.1,0.01,0.001} on all datasets, and decrease it by a factor of 10 at 50% and 75% of total training time. The number of training epochs is 100 for all datasets except for breastmnist which is trained for 200 epochs, considering its relatively smaller size. For focal loss, its parameters $\hat{\alpha},\hat{\lambda}$ are fixed on 1 and 2. For AUCM and CT-AUC, the margin $m$ is set at 1.0 on all datasets. The ADAM optimizer [4] is used for optimizing CE and focal loss, while PESG [9] and PDSCA [8] are used for optimizing AUCM and CT-AUC losses, respectively, with weight decay at 0.0001 and epoch decay at 0.001. The beta parameters of PDSCA are set at 0.9 and the number of inner gradient steps $k$ is set at 1. We use a batch size of 64 and a Dualsampler [10] to guarantee positive data in a minibatch.

Refer to caption — Figure 1: t-SNE visualization of feature representations of a training set for the BreastMNIST visualized by tSNE learned by different methods

Table 1: Testing performance on benchmark datasets and medical datasets.

Dataset	Loss	AUC(%)
Cat&Dog	CE	95.57±0.58
	Focal	95.67±0.11
	AUCM	94.97±0.85
	AUC-mixup	95.74±0.06
	CT-AUC	95.29±0.16
	CT-mixup	95.88±0.09
CIFAR10	CE	78.27±5.30
	Focal	81.98±1.64
	AUCM	86.56±0.02
	AUC-mixup	87.95±0.39
	CT-AUC	87.05±0.11
	CT-mixup	87.96±0.06
STL-10	CE	89.94±0.67
	Focal	53.55±0.38
	AUCM	95.64±0.51
	AUC-mixup	96.84±0.57
	CT-AUC	96.19±0.14
	CT-mixup	96.90±0.03
CIFAR100	CE	63.28±1.66
	Focal	67.35±0.39
	AUCM	67.29±0.85
	AUC-mixup	69.38±0.49
	CT-AUC	68.76±0.96
	CT-mixup	69.25±0.15
PneumoniaM	CE	94.35±0.46
	Focal	95.39±1.17
	AUCM	96.17±0.06
	AUC-mixup	96.71±0.05
	CT-AUC	96.38±0.14
	CT-mixup	96.78±0.13

Dataset	Loss	AUC(%)
BreastM	CE	90.10±1.33
	Focal	90.64±0.37
	AUCM	91.15±0.48
	AUC-mixup	91.85±1.35
	CT-AUC	89.26±1.10
	CT-mixup	90.02±0.43
NoduleM	CE	88.62±1.11
	Focal	89.58±0.67
	AUCM	89.52±1.89
	AUC-mixup	90.39±1.25
	CT-AUC	90.11±0.47
	CT-mixup	90.84±1.50
AdrenalM	CE	85.55±1.04
	Focal	82.32±1.39
	AUCM	87.36±0.11
	AUC-mixup	87.72±0.37
	CT-AUC	86.28±0.30
	CT-mixup	86.57±0.61
VesselM	CE	83.32±1.99
	Focal	79.95±1.26
	AUCM	84.67±3.80
	AUC-mixup	88.44±2.42
	CT-AUC	83.71±1.18
	CT-mixup	85.33±0.42
SynapseM	CE	77.79±4.81
	Focal	76.15±1.47
	AUCM	80.15±4.82
	AUC-mixup	83.49±7.75
	CT-AUC	72.44±1.02
	CT-mixup	73.84±3.03

Results. We compare the testing AUC scores of different methods on all datasets in Table 1. We can observe that (i) the AUC-mixup helps achieve the highest AUC scores on all datasets; (ii) the AUC-mixup strategy usually yields an improvement of varying degrees compared to the corresponding DAM methods without using AUC-mixup; (iii) the AUC-mixup is competitive if not better than CT-mixup, which indicates that employing the AUC-mixup loss for training from scratch can eliminate the additional compositional training overhead without sacrificing the prediction performance. We further show the learned feature representations of DAM methods on the BreastMNIST training data in Figure 1, which illustrates that employing the AUC-mixup loss obtains better feature representations.

References

He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
Huang et al. [2017] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
Irvin et al. [2019] J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, B. Haghgoo, R. Ball, K. Shpanskaya, et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 590–597, 2019.
Kingma and Ba [2014] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[5] V. Rotemberg, N. Kurtansky, B. Betz-Stablein, L. Caffery, E. Chousakos, N. Codella, M. Combalia, S. Dusza, P. Guitera, D. Gutman, et al. A patient-centric dataset of images and metadata for identifying melanomas using clinical context. sci data 8, 34 (2021).
Yang et al. [2021] J. Yang, X. Huang, Y. He, J. Xu, C. Yang, G. Xu, and B. Ni. Reinventing 2d convolutions for 3d images. IEEE Journal of Biomedical and Health Informatics, 25(8):3009–3018, 2021.
Yang et al. [2023] J. Yang, R. Shi, D. Wei, Z. Liu, L. Zhao, B. Ke, H. Pfister, and B. Ni. Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification. Scientific Data, 10(1):41, 2023.
Yuan et al. [2021a] Z. Yuan, Z. Guo, N. Chawla, and T. Yang. Compositional training for end-to-end deep auc maximization. In International Conference on Learning Representations, 2021a.
Yuan et al. [2021b] Z. Yuan, Y. Yan, M. Sonka, and T. Yang. Large-scale robust deep auc maximization: A new surrogate loss and empirical studies on medical image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3040–3049, 2021b.
Yuan et al. [2023] Z. Yuan, D. Zhu, Z.-H. Qiu, G. Li, X. Wang, and T. Yang. Libauc: A deep learning library for x-risk optimization. arXiv preprint arXiv:2306.03065, 2023.
Zhang et al. [2017] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.