This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

AUC-mixup: Deep AUC Maximization with Mixup

Jianzhi Xv
Shandong University
[email protected]
Gang Li
Texas A&M University
[email protected]
Tianbao Yang
Texas A&M University
[email protected]
Abstract

While deep AUC maximization (DAM) has shown remarkable success on imbalanced medical tasks, e.g., chest X-rays classification and skin lesions classification, it could suffer from severe overfitting when applied to small datasets due to its aggressive nature of pushing prediction scores of positive data away from that of negative data. This paper studies how to improve generalization of DAM by mixup data augmentation- an approach that is widely used for improving generalization of the cross-entropy loss based deep learning methods. However, AUC is defined over positive and negative pairs, which makes it challenging to incorporate mixup data augmentation into DAM algorithms. To tackle this challenge, we employ the AUC margin loss and incorporate soft labels into the formulation to effectively learn from data generated by mixup augmentation, which is referred to as the AUC-mixup loss. Our experimental results demonstrate the effectiveness of the proposed AUC-mixup methods on imbalanced benchmark and medical image datasets compared to standard DAM training methods.

1 Introduction

In recent years, deep AUC maximization (DAM), focusing on developing deep learning models that directly optimize the area under the receiver operating characteristic curve (AUC), has gained growing importance. Different methods of DAM, such as optimizing the AUC margin loss [9] (AUCM) and compositional DAM [8], have been successfully applied to medical image classification to improve the AUC performance. These methods demonstrate superior performance in large-scale medical image classification tasks such as Chexpert [3] and Melanoma [5] compared to optimizing traditional loss functions, such as cross-entropy loss (CE) and focal loss.

Although DAM methods work well on large datasets, they still face the challenge of vulnerability to overfitting when training on imbalanced datasets with a small size. Typically, DAM losses place more emphasis on the positive class with fewer samples, and as the overall data volume decreases, the samples of this class become scarce and limited, which leads to the overfitting problem. In this context, mixup augmentation [11] is an effective solution by introducing soft labels into training and generating much more data using convex combinations of samples. With mixup augmentation, the focus of DAM would shift from the minor class to the combination of samples from different classes, which mitigates the overfitting issue.

However, existing DAM losses are developed on hard labels and thus are incompatible with mixup augmentation. Different from traditional loss functions that are defined over individual data, DAM losses are non-decomposable, making the incorporation of soft labels more complicated. To address the problem, we propose an AUC-mixup loss by replacing conditional means in the min-max AUC margin loss with soft means by using soft labels. Our goal is to utilize the AUC-mixup loss to improve DAM and compositional DAM methods in medical tasks with a small number of data by incorporating mixup augmentation. We validate our method on imbalanced benchmark and medical image datasets, including several 3D datasets, to demonstrate the superiority in generalization performance over two standard DAM baselines.

2 Method

Let 𝕀(.)\mathbb{I}(.) be an indicator function of a predicate, let S={(𝐱1,y1),,(𝐱n,yn)}S=\{(\mathbf{x}_{1},y_{1}),...,(\mathbf{x}_{n},y_{n})\} denote a set of training examples (e.g., a 2D or 3D image), and let yi{1,0}y_{i}\in\{1,0\} denote its corresponding label. Let 𝐰d\mathbf{w}\in\mathbb{R}^{d} denote the parameters of the deep neural network and let h𝐰(𝐱)=h(𝐰,𝐱)h_{\mathbf{w}}(\mathbf{x})=h(\mathbf{w},\mathbf{x}) denote the prediction of the neural network on input data 𝐱\mathbf{x}. Yuan et al. [9] proposed DAM by minimizing the AUC margin loss that is equivalent to the following min-max optimization:

min𝐰d,(a,b)2maxα0\displaystyle\min_{\mathbf{w}\in\mathbb{R}^{d},(a,b)\in\mathbb{R}^{2}}\max_{\alpha\geq 0} i=1N(h𝐰(𝐱i)a)2𝕀(yi=1)N++i=1N(h𝐰(𝐱i)b)2𝕀(yi=0)N\displaystyle\frac{\sum_{i=1}^{N}\left(h_{\mathbf{w}}(\mathbf{x}_{i})-a\right)^{2}\mathbb{I}(y_{i}=1)}{N_{+}}+\frac{\sum_{i=1}^{N}\left(h_{\mathbf{w}}(\mathbf{x}_{i})-b\right)^{2}\mathbb{I}(y_{i}=0)}{N_{-}} (1)
+2α(mi=1Nh𝐰(𝐱i)𝕀(yi=1)N+i=1Nh𝐰(𝐱i)𝕀(yi=0)N)α2,\displaystyle+2\alpha\left(m-\frac{\sum_{i=1}^{N}h_{\mathbf{w}}(\mathbf{x}_{i})\mathbb{I}(y_{i}=1)}{N}+\frac{\sum_{i=1}^{N}h_{\mathbf{w}}(\mathbf{x}_{i})\mathbb{I}(y_{i}=0)}{N_{-}}\right)-\alpha^{2},

where mm is a hyperparameter to control the desired margin between optimal aa (i.e., the mean score of positive data) and optimal bb (i.e., the mean score of negative data), N+N_{+} is the number of positive samples and NN_{-} is the number of negative samples. AUC margin loss is shown to enjoy better robustness than AUC square loss [9]. An improvement of minimizing the AUC margin loss from scratch is compositional DAM, which minimizes a compositional objective function, where the outer function corresponds to the AUC margin loss and the inner function represents a gradient descent step for minimizing a cross-entropy (CE) loss. However, the mixup technique cannot be directly applied to both methods, where an augmented example 𝐱^\hat{\mathbf{x}} and its corresponding soft label y^\hat{y} are generated from two randomly sampled training data and labels:

𝐱^=λ𝐱i+(1λ)𝐱j,y^=λyi+(1λ)yj,\displaystyle\hat{\mathbf{x}}=\lambda\mathbf{x}_{i}+(1-\lambda)\mathbf{x}_{j},\quad\hat{y}=\lambda y_{i}+(1-\lambda)y_{j}, (2)

where y^(0,1)\hat{y}\in(0,1) and λBeta(α,α)\lambda\sim Beta(\alpha,\alpha), α(0,)\alpha\in(0,\infty). It is obvious that directly using the AUC margin loss will ignore all the augmented samples with λ(0,1)\lambda\in(0,1). To address this problem, we propose an AUC-mixup loss. The idea is to replace the conditional means in (1) by soft means, i.e.,

min𝐰d,(a,b)2maxα0\displaystyle\min_{\mathbf{w}\in\mathbb{R}^{d},(a,b)\in\mathbb{R}^{2}}\max_{\alpha\geq 0} i=1N^(h𝐰(𝐱^i)a)2y^ii=1N^y^i+i=1N^(h𝐰(𝐱^i)b)2(1y^i)i=1N^(1y^i)\displaystyle\frac{\sum_{i=1}^{\hat{N}}\left(h_{\mathbf{w}}(\hat{\mathbf{x}}_{i})-a\right)^{2}\hat{y}_{i}}{\sum_{i=1}^{\hat{N}}\hat{y}_{i}}+\frac{\sum_{i=1}^{\hat{N}}\left(h_{\mathbf{w}}(\hat{\mathbf{x}}_{i})-b\right)^{2}(1-\hat{y}_{i})}{\sum_{i=1}^{\hat{N}}(1-\hat{y}_{i})} (3)
+2α(mi=1N^h𝐰(𝐱^i)y^ii=1N^y^i+i=1N^h𝐰(𝐱^i)(1y^i)i=1N^(1y^i))α2,\displaystyle+2\alpha\left(m-\frac{\sum_{i=1}^{\hat{N}}h_{\mathbf{w}}(\hat{\mathbf{x}}_{i})\hat{y}_{i}}{\sum_{i=1}^{\hat{N}}\hat{y}_{i}}+\frac{\sum_{i=1}^{\hat{N}}h_{\mathbf{w}}(\hat{\mathbf{x}}_{i})(1-\hat{y}_{i})}{\sum_{i=1}^{\hat{N}}(1-\hat{y}_{i})}\right)-\alpha^{2},

where {(𝐱^i,y^i)}i=1N^\{(\hat{\mathbf{x}}_{i},\hat{y}_{i})\}_{i=1}^{\hat{N}} denotes the mixup augmented dataset. It is not difficult to show that the optimal a,ba,b are soft mean scores of positive and negative data, respectively, i.e., a=i=1N^h𝐰(𝐱^i)y^ii=1Ny^i,b=i=1N^h𝐰(𝐱^i)(1y^i)i=1N^(1y^i)a=\frac{\sum_{i=1}^{\hat{N}}h_{\mathbf{w}}(\hat{\mathbf{x}}_{i})\hat{y}_{i}}{\sum_{i=1}^{N}\hat{y}_{i}},b=\frac{\sum_{i=1}^{\hat{N}}h_{\mathbf{w}}(\hat{\mathbf{x}}_{i})(1-\hat{y}_{i})}{\sum_{i=1}^{\hat{N}}(1-\hat{y}_{i})}. The AUC-mixup loss can be easily integrated with compositional DAM.

3 Experiments

Datasets. In this section, we perform extensive experiments to evaluate the proposed AUC-mixup approaches on diverse benchmark and medical image datasets. With respect to benchmark datasets, we choose the Cat&Dog, CIFAR-10, CIFAR-100, and STL-10 and construct a binary imbalanced version of these datasets by following the instruction of Yuan et al. [9]. We adopt a DenseNet121 [2] network pre-trained in ImageNet as the backbone for benchmark datasets, following Yuan et al. [9]. For medical image datasets, we choose six MEDMNIST [7] datasets with the format of 2D or 3D images, namely PneumoniaMNIST (PneumoniaM), BreastMNIST (BreastM), NoduleMNIST3D (NoduleM), AdrenalMNIST3D (AdrenalM), VesselMNIST3D (VesselM), SynapseMNIST3D (SynapseM)). These datasets are naturally imbalanced. We adopt ResNet18 [1] network for training on these 2D and 3D datasets (with an acsconv [6]). We split all datasets into training set, validation set, and test set to conduct cross-evaluation for tuning hyperparameters, and report the AUC score on the test set(the means and standard deviations of three runs).

Settings. In the experiments, we apply AUC mixup strategy to DAM for vanila training from scratch (AUC-mixup) and compositional training (CT-mixup). We choose four methods with different losses as baselines: cross-entropy loss (CE), focal loss (Focal), AUCM loss [9], and compositional AUC loss (CT-AUC) [8]. For all methods, we tune the learning rate in {0.1,0.01,0.001} on all datasets, and decrease it by a factor of 10 at 50% and 75% of total training time. The number of training epochs is 100 for all datasets except for breastmnist which is trained for 200 epochs, considering its relatively smaller size. For focal loss, its parameters α^,λ^\hat{\alpha},\hat{\lambda} are fixed on 1 and 2. For AUCM and CT-AUC, the margin mm is set at 1.0 on all datasets. The ADAM optimizer [4] is used for optimizing CE and focal loss, while PESG [9] and PDSCA [8] are used for optimizing AUCM and CT-AUC losses, respectively, with weight decay at 0.0001 and epoch decay at 0.001. The beta parameters of PDSCA are set at 0.9 and the number of inner gradient steps kk is set at 1. We use a batch size of 64 and a Dualsampler [10] to guarantee positive data in a minibatch.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: t-SNE visualization of feature representations of a training set for the BreastMNIST visualized by tSNE learned by different methods
Table 1: Testing performance on benchmark datasets and medical datasets.
Dataset Loss AUC(%)
Cat&Dog CE 95.57±0.58
Focal 95.67±0.11
AUCM 94.97±0.85
AUC-mixup 95.74±0.06
CT-AUC 95.29±0.16
CT-mixup 95.88±0.09
CIFAR10 CE 78.27±5.30
Focal 81.98±1.64
AUCM 86.56±0.02
AUC-mixup 87.95±0.39
CT-AUC 87.05±0.11
CT-mixup 87.96±0.06
STL-10 CE 89.94±0.67
Focal 53.55±0.38
AUCM 95.64±0.51
AUC-mixup 96.84±0.57
CT-AUC 96.19±0.14
CT-mixup 96.90±0.03
CIFAR100 CE 63.28±1.66
Focal 67.35±0.39
AUCM 67.29±0.85
AUC-mixup 69.38±0.49
CT-AUC 68.76±0.96
CT-mixup 69.25±0.15
PneumoniaM CE 94.35±0.46
Focal 95.39±1.17
AUCM 96.17±0.06
AUC-mixup 96.71±0.05
CT-AUC 96.38±0.14
CT-mixup 96.78±0.13
Dataset Loss AUC(%)
BreastM CE 90.10±1.33
Focal 90.64±0.37
AUCM 91.15±0.48
AUC-mixup 91.85±1.35
CT-AUC 89.26±1.10
CT-mixup 90.02±0.43
NoduleM CE 88.62±1.11
Focal 89.58±0.67
AUCM 89.52±1.89
AUC-mixup 90.39±1.25
CT-AUC 90.11±0.47
CT-mixup 90.84±1.50
AdrenalM CE 85.55±1.04
Focal 82.32±1.39
AUCM 87.36±0.11
AUC-mixup 87.72±0.37
CT-AUC 86.28±0.30
CT-mixup 86.57±0.61
VesselM CE 83.32±1.99
Focal 79.95±1.26
AUCM 84.67±3.80
AUC-mixup 88.44±2.42
CT-AUC 83.71±1.18
CT-mixup 85.33±0.42
SynapseM CE 77.79±4.81
Focal 76.15±1.47
AUCM 80.15±4.82
AUC-mixup 83.49±7.75
CT-AUC 72.44±1.02
CT-mixup 73.84±3.03

Results. We compare the testing AUC scores of different methods on all datasets in Table 1. We can observe that (i) the AUC-mixup helps achieve the highest AUC scores on all datasets; (ii) the AUC-mixup strategy usually yields an improvement of varying degrees compared to the corresponding DAM methods without using AUC-mixup; (iii) the AUC-mixup is competitive if not better than CT-mixup, which indicates that employing the AUC-mixup loss for training from scratch can eliminate the additional compositional training overhead without sacrificing the prediction performance. We further show the learned feature representations of DAM methods on the BreastMNIST training data in Figure 1, which illustrates that employing the AUC-mixup loss obtains better feature representations.

References

  • He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Huang et al. [2017] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
  • Irvin et al. [2019] J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, B. Haghgoo, R. Ball, K. Shpanskaya, et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 590–597, 2019.
  • Kingma and Ba [2014] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [5] V. Rotemberg, N. Kurtansky, B. Betz-Stablein, L. Caffery, E. Chousakos, N. Codella, M. Combalia, S. Dusza, P. Guitera, D. Gutman, et al. A patient-centric dataset of images and metadata for identifying melanomas using clinical context. sci data 8, 34 (2021).
  • Yang et al. [2021] J. Yang, X. Huang, Y. He, J. Xu, C. Yang, G. Xu, and B. Ni. Reinventing 2d convolutions for 3d images. IEEE Journal of Biomedical and Health Informatics, 25(8):3009–3018, 2021.
  • Yang et al. [2023] J. Yang, R. Shi, D. Wei, Z. Liu, L. Zhao, B. Ke, H. Pfister, and B. Ni. Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification. Scientific Data, 10(1):41, 2023.
  • Yuan et al. [2021a] Z. Yuan, Z. Guo, N. Chawla, and T. Yang. Compositional training for end-to-end deep auc maximization. In International Conference on Learning Representations, 2021a.
  • Yuan et al. [2021b] Z. Yuan, Y. Yan, M. Sonka, and T. Yang. Large-scale robust deep auc maximization: A new surrogate loss and empirical studies on medical image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3040–3049, 2021b.
  • Yuan et al. [2023] Z. Yuan, D. Zhu, Z.-H. Qiu, G. Li, X. Wang, and T. Yang. Libauc: A deep learning library for x-risk optimization. arXiv preprint arXiv:2306.03065, 2023.
  • Zhang et al. [2017] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.