Large-scale Optimization of Partial AUC in a Range of False Positive Rates
Abstract
The area under the ROC curve (AUC) is one of the most widely used performance measures for classification models in machine learning. However, it summarizes the true positive rates (TPRs) over all false positive rates (FPRs) in the ROC space, which may include the FPRs with no practical relevance in some applications. The partial AUC, as a generalization of the AUC, summarizes only the TPRs over a specific range of the FPRs and is thus a more suitable performance measure in many real-world situations. Although partial AUC optimization in a range of FPRs had been studied, existing algorithms are not scalable to big data and not applicable to deep learning. To address this challenge, we cast the problem into a non-smooth difference-of-convex (DC) program for any smooth predictive functions (e.g., deep neural networks), which allowed us to develop an efficient approximated gradient descent method based on the Moreau envelope smoothing technique, inspired by recent advances in non-smooth DC optimization. To increase the efficiency of large data processing, we used an efficient stochastic block coordinate update in our algorithm. Our proposed algorithm can also be used to minimize the sum of ranked range loss, which also lacks efficient solvers. We established a complexity of for finding a nearly -critical solution. Finally, we numerically demonstrated the effectiveness of our proposed algorithms in training both linear models and deep neural networks for partial AUC maximization and sum of ranked range loss minimization.
1 Introduction
The area under the receiver operating characteristic (ROC) curve (AUC) is one of the most widely used performance measures for classifiers in machine learning, especially when the data is imbalanced between the classes (Hanley and McNeil, 1982; Bradley, 1997). Typically, the classifier produces a score for each data point. Then a data point is classified as positive if its score is above a chosen threshold; otherwise, it is classified as negative. Varying the threshold will change the true positive rate (TPR) and the false positive rate (FPR) of the classifier. The ROC curve shows the TPR as a function of the FPR that corresponds to the same threshold. Hence, maximizing the AUC of a classifier is essentially maximizing the classifier’s average TPR over all FPRs from zero to one. However, for some applications, some FPR regions have no practical relevance. So does the TPR over those regions. For example, in clinical practice, a high FPR in diagnostic tests often results in a high monetary cost, so people may only need to maximize the TPR when the FPR is low (Dodd and Pepe, 2003; Ma et al., 2013; Yang et al., 2019). Moreover, since two models with the same AUC can still have different ROCs, the AUC does not always reflect the true performance of a model that is needed in a particular production environment (Bradley, 2014).
As a generalization of the AUC, the partial AUC (pAUC) only measures the area under the ROC curve that is restricted between two FPRs. A probabilistic interpretation of the pAUC can be found in Dodd and Pepe (2003). In contrast to the AUC, the pAUC represents the average TPR only over a relevant range of FPRs and provides a performance measure that is more aligned with the practical needs in some applications.
In literature, the existing algorithms for training a classifier by maximizing the pAUC include the boosting method (Komori and Eguchi, 2010) and the cutting plane algorithm (Narasimhan and Agarwal, 2013b; Narasimhan and Agarwal, 2013a, 2017). However, the former has no theoretical guarantee, and the latter applies only to linear models. More importantly, both methods require processing all the data in each iteration and thus, become computationally inefficient for large datasets.
In this paper, we proposed an approximate gradient method for maximizing the pAUC that works for nonlinear models (e.g., deep neural networks) and only needs to process randomly sampled positive and negative data points of any size in each iteration. In particular, we formulated the maximization of the pAUC as a non-smooth difference-of-convex (DC) program (Tao and An, 1997; Le Thi and Dinh, 2018). Due to non-smoothness, most existing DC optimization algorithms cannot be applied to our formulation. Motivated by Sun and Sun (2021), we approximate the two non-smooth convex components in the DC program by their Moreau envelopes and obtain a smooth approximation of the problem, which will be solved using the gradient descent method. Since the gradient of the smooth problem cannot be calculated explicitly, we approximated the gradient by solving the two proximal-point subproblems defined by each convex component using the stochastic block coordinate descent (SBCD) method. Our method, besides its low per-iteration cost, has a rigorous theoretical guarantee, unlike the existing methods. In fact, we show that our method finds a nearly -critical point of the pAUC optimization problem in iterations with only small samples of positive and negative data points processed per iteration.111Throughout the paper, suppresses all logarithmic factors. This is the main contribution of this paper.
Note that, for non-convex non-smooth optimization, the existing stochastic methods (Davis and Grimmer, 2019; Davis and Drusvyatskiy, 2018) find an nearly -critical point in iterations under a weak convexity assumption. Our method needs iterations because our problem is a DC problem with both convex components non-smooth which is much more challenging than a weakly non-convex minimization problem. In addition, our iteration number matches the known best iteration complexity for non-smooth non-convex min-max optimization (Rafique et al., 2021; Liu et al., 2021) and non-smooth non-convex constrained optimization (Ma et al., 2020).
In addition to pAUC optimization, our method can be also used to minimize the sum of ranked range (SoRR) loss, which can be viewed as a special case of pAUC optimization. Many machine learning models are trained by minimizing an objective function, which is defined as the sum of losses over all training samples (Vapnik, 1992). Since the sum of losses weights all samples equally, it is insensitive to samples from minority groups. Hence, the sum of top- losses (Shalev-Shwartz and Wexler, 2016; Fan et al., 2017) is often used as an alternative objective function because it provides the model with robustness to non-typical instances. However, the sum of top- losses can be very sensitive to outliers, especially when is small. To address this issue, Hu et al. (2020) proposed the SoRR loss as a new learning objective, which is defined as the sum of a consecutive sequence of losses from any range after the losses are sorted. Compared to the sum of all losses and the sum of top- losses, the SoRR loss maintains a model’s robustness to a minority group but also reduces the model’s sensitivity to outliers. See Fig.1 in Hu et al. (2020) for an illustration of the benefit of using the SoRR loss over other ways of aggregating individual losses.
To minimize the SoRR loss, Hu et al. (2020) applied a difference-of-convex algorithm (DCA) (Tao and An, 1997; An and Tao, 2005), which linearizes the second convex component and solves the resulting subproblem using the stochastic subgradient method. DCA has been well studied in literature; but when the both components are non-smooth, as in our problem, only asymptotic convergence results are available. To establish the total number of iterations needed to find an -critical point in a non-asymptotic sense, most existing studies had to assume that at least one of the components is differentiable, which is not the case in this paper. Using the approximate gradient method presented in this paper, one can find a nearly -critical point of the SoRR loss optimization problem in iterations.
2 Related Works
The pAUC has been studied for decades (McClish, 1989; Thompson and Zucchini, 1989; Jiang et al., 1996; Yang and Ying, 2022). However, most studies focused on its estimation (Dodd and Pepe, 2003) and application as a performance measure, while only a few studies were devoted to numerical algorithms for optimizing the pAUC. Efficient optimization methods have been developed for maximizing AUC and multiclass AUC by Ying et al. (2016) and Yang et al. (2021a), but they cannot be applied to pAUC. Besides the boosting method (Komori and Eguchi, 2010) and the cutting plane algorithm (Narasimhan and Agarwal, 2013b; Narasimhan and Agarwal, 2013a, 2017) mentioned in the previous section, Ueda and Fujino (2018); Yang et al. (2022, 2021b); Zhu et al. (2022) developed surrogate optimization techniques that directly maximize a smooth approximation of the pAUC or the two-way pAUC (Yang et al., 2019). However, their approaches can only be applied when the FPR starts from exactly zero. On the contrary, our method allows the FPR to start from any value between zero and one. Wang and Chang (2011) and Ricamato and Tortorella (2011) developed algorithms that use the pAUC as a criterion for creating a linear combination of multiple existing classifiers while we consider directly train a classifier using the pAUC.
DC optimization has been studied since the 1950s (Alexandroff, 1950; Hartman, 1959). We refer interested readers to Tuy (1995); Tao and An (1998, 1997); Pang et al. (2017); Le Thi and Dinh (2018), and the references therein. The actively studied numerical methods for solving a DC program include DCA (Tao and An, 1998, 1997; An and Tao, 2005; Souza et al., 2016), which is also known as the concave-convex procedure (Yuille and Rangarajan, 2003; Sriperumbudur and Lanckriet, 2009; Lipp and Boyd, 2016), the proximal DCA (Sun et al., 2003; Moudafi and Maingé, 2006; Moudafi, 2008; An and Nam, 2017), and the direct gradient methods (Khamaru and Wainwright, 2018). However, when the two convex components are both non-smooth, the existing methods have only asymptotic convergence results except the method by Abbaszadehpeivasti et al. (2021), who considered a stopping criterion different from ours. When at least one component is smooth, non-asymptotic convergence rates have been established with and without the Kurdyka-Łojasiewicz (KL) condition (Souza et al., 2016; Artacho et al., 2018; Wen et al., 2018; An and Nam, 2017; Khamaru and Wainwright, 2018).
The algorithms mentioned above are deterministic and require processing the entire dataset per iteration. Stochastic algorithms that process only a small data sample per iteration have been studied (Mairal, 2013; Nitanda and Suzuki, 2017; Le Thi et al., 2017; Deng and Lan, 2020; He et al., 2021). However, they all assumed smoothness on at least one of the two convex components in the DC program. The stochastic methods of Xu et al. (2019); Thi et al. (2021); An et al. (2019) can be applied when both components are non-smooth but their methods require an unbiased stochastic estimation of the gradient and/or value of the two components, which is not available in the DC formulation of the pAUC maximization problem in this paper.
The technique most related to our work is the smoothing method based on the Moreau envelope (Ellaia, 1984; Gabay, 1982; Hiriart-Urruty, 1985, 1991; Sun and Sun, 2021; Moudafi, 2021). Our work is motivated by Sun and Sun (2021); Moudafi (2021), but the important difference is that they studied deterministic methods and assumed either that one function is smooth or that the proximal-point subproblems can be solved exactly, which we do not assume. However, Sun and Sun (2021); Moudafi (2021) consider a more general problem and study the fundamental properties of the smoothed function such as its Lipschitz smoothness and how its stationary points correspond to those of the original problems. We mainly focus on partial AUC optimization which has a special structure we can utilize when solving the proximal-point subproblems. Additionally, Sun and Sun (2021) developed an algorithm when there were linear equality constraints, which we do not consider in this paper.
3 Preliminary
We consider a classical binary classification problem, where the goal is to build a predictive model that predicts a binary label based on a feature vector . Let be the predictive model parameterized by a vector , which produces a score for . Then is classified as positive () if is above a chosen threshold and classified as negative (), otherwise.
Let and be the sets of feature vectors of positive and negative training data, respectively. The problem of learning through maximizing its empirical AUC on the training data can be formulated as
(1) |
where is the indicator function which equals one if the inequality inside the parentheses holds and equals zero, otherwise. According to the introduction, pAUC can be a better performance measure of than AUC. Consider two FPRs and with . For simplicity of exposition, we assume and are both integers. Let and . The problem of maximizing the empirical pAUC with FPR between and can be formulated as
(2) |
where denotes the index of the th largest coordinate in vector with ties broken arbitrarily. Note that in (2) is a normalizer that makes the objective value between zero and one. Solving (2) is challenging due to discontinuity. Let be a differential non-increasing loss function. Problem (2) can be approximated by the loss minimization problem
(3) |
To facilitate the discussion, we first introduce a few notations. Given a vector and an integer with , the sum of the top- values in is
(4) |
where denotes the index of the th largest coordinate in with ties broken arbitrarily. For integers and with , is the sum from the th to the th (inclusive) largest coordinates of , also called a sum of ranked range (SoRR). In addition, we define vectors
for , where for and . Since is non-increasing, the th largest coordinate of is . As a result, we have, for ,
Hence, after dropping the normalizer, (3) can be equivalently written as
(5) |
where
(6) |
Next, we introduce an interesting special case of (5), namely, the problem of minimizing SoRR loss. We still consider a supervised learning problem but the target does not need to be binary. We want to predict based on a feature vector using . With a little abuse of notation, we measure the discrepancy between and by , where is a loss function. We consider learning the model’s parameter from a training set , where and for , by minimizing the SoRR loss. More specifically, we define vector
where . Recall (4). For any integers and with , the problem of minimizing the SoRR loss with a range from to is formulated as , which is an instance of (5) with
(7) |
If we view and only as vector-value functions of but ignore how they are formulated using data, (7) is a special case of (6) with and .
4 Nearly Critical Point and Moreau Envelope Smoothing
We first develop a stochastic algorithm for (5) with defined in (6). To do so, we make the following assumptions, which are satisfied by many smooth ’s and ’s.
Assumption 1
(a) is smooth and there exists such that222In this paper, represents Euclidean norm. for any , and . (b) There exists such that for any , and . (c) .
Given , the subdifferential of is
where each element in is called a subgradient of at . We say is -weakly convex for some if for any and and and say is -strongly convex for some if for any and and . It is known that, if is -weakly convex, then is a -strongly convex function when .
Under Assumption 1, is a composite of the closed convex function and the smooth map . According to Lemma 4.2 in Drusvyatskiy and Paquette (2019), we have the following lemma.
To solve (5) numerically, we need to overcome the following challenges. (i) is non-convex even if each is convex. In fact, is a DC function because, by Lemma 1, we can represent as the difference of the convex functions and with . (ii) is non-smooth due to so that finding an approximate critical point (defined below) of is difficult. (iii) Computing the exact subgradient of for requires processing data pairs, which is computationally expensive for a large data set.
Because of challenges (i) and (ii), we have to consider a reasonable goal when solving (5). We say is a critical point of if . Given , we say is an -critical point of if there exists such that . A critical point can only be achieved asymptotically in general.333A stronger notion than criticality is (directional) stationarity, which can also be achieved asymptotically (Pang et al., 2017). Within finitely many iterations, there also exists no algorithm that can find an -critical point unless at least one of and is smooth, e.g., Xu et al. (2019). Since and are both non-smooth, we have to consider a weaker but achievable target, which is a nearly -critical point defined below.
Definition 2
Given , we say is a nearly -critical point of if there exist , , and such that and .
Definition 2 is reduced to the -stationary point defined by Sun and Sun (2021); Moudafi (2021) when equals or . However, obtaining their -stationary point requires exactly solving the proximal mapping of or while finding a nearly -critical point requires only solving the proximal mapping inexactly. When is generated by a stochastic algorithm, we also call a nearly -critical point if it satisfies Definition 2 with each replaced by .
Motivated by Sun and Sun (2021) and Moudafi (2021), we approximate non-smooth by a smooth function using the Moreau envelopes. Given a proper, -weakly convex and closed function on , the Moreau envelope of with the smoothing parameter is defined as
(8) |
and the proximal mapping of is defined as
(9) |
Note that the is unique because the minimization above is strongly convex. Standard results show that is smooth with and is -Lipschitz continuous. See Proposition 13.37 in Rockafellar and Wets (2009) and Proposition 1 in Sun and Sun (2021). Hence, using the Moreau envelope, we can construct a smooth approximation of (5) as follows
(10) |
Function has the following properties. The first property is shown in Sun and Sun (2021). We give the proof for the second in Appendix B.
Lemma 3
Since is smooth, we can directly apply a first-order method for smooth non-convex optimization to (10). To do so, we need to evaluate , which requires computing and , i.e., exactly solving (9) with and , respectively. Computing the subgradients of and require processing data pairs which is costly. Unfortunately, the standard approach of sampling over data pairs does not produce unbiased stochastic subgradients of and due to the composite structure . In the next section, we will discuss a solution to overcome this challenge and approximate and , which leads to an efficient algorithm for (10).
5 Algorithm for pAUC Optimization
Consider (10) with defined in (6) for and . To avoid of processing data points, one method is to introduce dual variables for and formulate as
(11) |
where . Then (10) can be reformulated as a min-max problem and solved by a primal-dual stochastic gradient method (e.g. Rafique et al. (2021)). However, the maximization in (11) involves decision variables and equality constraints, so the per-iteration cost is still even after using stochastic gradients.
To further reduce the per-iteration cost, we take the dual form of the maximization in (11) (see Lemma 10 in Appendix B) and formulate as
(12) |
where . Hence, (9) with for and can be reformulated as
(13) |
Note that is jointly convex in and when (see Lemma 9 in Appendix B). Thanks to formulation (13), we can construct stochastic subgradient of and apply coordinate update to by sampling indexes ’s and ’s, which significantly reduce the computational cost when and are both large. We present this standard stochastic block coordinate descent (SBCD) method for solving (13) in Algorithm 1 and present its convergence property as follows.
(14) |
(15) |
Using Algorithm 1 to compute an approximation of for and and thus, an approximation of , we can apply an approximate gradient descent (AGD) method to (10) and find a nearly -critical point of (5) according to Lemma 3. We present the AGD method in Algorithm 2 and its convergence property as follows.
Theorem 5
According to Theorem 5, to find a nearly -critical point of (5), we need iterations in Algorithm 2 and iterations of Algorithm 1 in total across all calls.
Remark 6 (Challenges in proving Theorem 5)
Suppose we can set in lines 3 and 4 of Algorithm 2 appropriately such that the approximation errors and are both . We can then prove that Algorithm 2 finds a nearly -critical point within iterations and the total complexity is . This is just a standard idea. However, by Proposition 4, such a must be where and also change with . Then it is not clear what the order of is. By a novel proving technique based on the (linear) error-bound condition of with respect to , we prove that both and are (see (27) and (30) in Appendix D) which ensures that and thus the total complexity is .
Remark 7 (Analysis of sensitivity of the algorithm to )
For the interesting case where , we have . In this case, we can derive that the order of dependency on is and the optimal choice of is thus , e.g., , which leads to a complexity of . We present the convergence curves and the test performance of our method when applied to training a linear model with of different values in Appendix E.7.
6 Numerical Experiments
In this section, we demonstrate the effectiveness of our algorithm AGD-SBCD for pAUC maximization and SoRR loss minimization problems (see Appendix E.1 for details). All experiments are conducted in Python and Matlab on a computer with the CPU 2GHz Quad-Core Intel Core i5 and the GPU NVIDIA GeForce RTX 2080 Ti. All datasets we used are publicly available and contain no personally identifiable information and offensive contents.




6.1 Partial AUC Maximization
For maximizing pAUC, we focus on large-scale imbalanced medical dataset CheXpert (Irvin et al., 2019), which is licensed under CC-BY-SA and has 224,316 images. We construct five binary classification tasks with the logistic loss for predicting five popular diseases, Cardiomegaly (D1), Edema (D2), Consolidation (D3), Atelectasis (D4), and P. Effusion (D5).
For comparison of training convergence, we consider different methods for optimizing the partial AUC. We compare with three baselines, DCA (Hu et al., 2020) (see Appendix E.4 for details), proximal DCA (Wen et al., 2018) (see Appendix E.5 for details) and SVMpAUC-tight (Narasimhan and Agarwal, 2013b). Since DCA, proximal DCA and SVMpAUC-tight cannot be applied to deep neural networks, we focus on linear model and use a pre-trained deep neural network to extract a fixed dimensional feature vectors of . The deep neural network was trained by optimizing the cross-entropy loss following the same setting as in Yuan et al. (2020).
For three baselines and our algorithm, the process to tune the hyper-parameters is explained in Appendix E.2. In Figure 1 and Figure 3 in Appendix E.3, we show how the training loss (the objective value of (3)) and normalized partial AUC on the training data change with the number of epochs. We observe that for all of these five diseases, our algorithm converges much faster than DCA and proximal DCA and we get a better partial AUC than DCA and proximal DCA.
Methods | SVMpAUC-tight | AGD-SBCD | |||||
Tasks | pAUC | CPU | CPU time (epoch) | GPU time to | pAUC | CPU | GPU |
time | to outperform | outperform | time | time | |||
D1 | 0.6259 | 95.14 | 2.91 (0.23) | 1.85 | 0.70050.0003 | 118.32 | 82.13 |
D2 | 0.5860 | 90.83 | 3.36 (0.23) | 1.93 | 0.72140.0024 | 415.66 | 247.29 |
D3 | 0.3745 | 90.56 | 3.26 (0.23) | 1.84 | 0.49100.0006 | 181.70 | 104.55 |
D4 | 0.3895 | 89.64 | 10.09 (0.63) | 8.38 | 0.46160.0006 | 187.36 | 158.14 |
D5 | 0.7267 | 90.86 | 3.97 (0.23) | 1.89 | 0.82720.0001 | 238.10 | 142.91 |
The comparison between our AGD-SBCD and SVMpAUC-tight on training data are shown in Table 1. As shown from the second to the fifth column of Table 1, our algorithm needs only a few seconds to exceed the pAUCs that SVMpAUC-tight takes more than one minute to return. As shown from sixth to eighth column, our algorithm eventually improves the pAUC by at least compared with SVMpAUC-tight. DCA and proximal DCA are not included in the tables because it computes deterministic subgradients, which leads to a runtime significantly longer than the other two methods. We plot the convergence curves of training pAUC over GPU time for DCA and our algorithm in Figure 7 in Appendix E.8.
To compare the testing performances, we consider the deep neural networks besides the linear model. For linear model, we still compare with DCA and SVMpAUC-tight. For deep neural networks, we compare with the naive mini-batch based method (MB) (Kar et al., 2014) and methods based on different optimization objectives, including the cross-entropy loss (CE) and the AUC min-max margin loss (AUC-M) (Yuan et al., 2021). We learn the model DenseNet121 from scratch with the CheXpert training data split in train/val=9:1 and the CheXpert validation dataset as the testing set, which has 234 samples. The range of FPRs in pAUC is [0.05, 0.5]. For optimizing CE, we use the standard Adam optimizer. For optimizing AUC-M, we use the PESG optimizer in Yuan et al. (2021). We run each method 10 epochs and the learning rate ( in AGD-SBCD) of all methods is tuned from . The mini-batch size is 32. For AGD-SBCD, is set to , is set to and is tuned from . For MB, the learning rate decays in the same way as in Kar et al. (2014). For CE and AUC-M, the learning rate decays 10-fold after every 5 epochs. For AUC-M, we tune the hyperparameter in {100, 500, 1000}. For each method, the validation set is used to tune the hyperparameters and select the best model across all iterations. The results of the pAUCs on the testing set are reported in Table 2, which shows that our algorithm performs the best for all diseases. The complete ROC curves on the testing set are shown in Appendix E.3.
Method | D1 | D2 | D3 | D4 | D5 | |
---|---|---|---|---|---|---|
Linear Model | SVMpAUC-tight | 0.65380.0042 | 0.60380.0009 | 0.69460.0020 | 0.65210.0006 | 0.79940.0004 |
DCA | 0.66360.0093 | 0.80780.0030 | 0.74270.0257 | 0.61690.0208 | 0.83710.0022 | |
Proximal DCA | 0.66150.0103 | 0.80410.0033 | 0.70640.0253 | 0.59450.0266 | 0.83520.0023 | |
AGD-SBCD | 0.67210.0081 | 0.82570.0025 | 0.80160.0075 | 0.63400.0165 | 0.85000.0017 | |
Deep Model | MB | 0.75100.0248 | 0.81970.0127 | 0.63390.0328 | 0.56980.0343 | 0.84610.0188 |
CE | 0.69940.0453 | 0.80750.0244 | 0.76730.0266 | 0.64990.0184 | 0.78840.0080 | |
AUC-M | 0.74030.0339 | 0.80020.0274 | 0.85330.0469 | 0.74200.0277 | 0.85040.0065 | |
AGD-SBCD | 0.75350.0255 | 0.83450.0130 | 0.86890.0184 | 0.75200.0079 | 0.85130.0107 |
For deep neural networks, we also learn the model ResNet-20 from scratch with the CIFAR-10-LT and the Tiny-ImageNet-200-LT datasets, which are constructed similarly as in Yang et al. (2021b). Details about these two datasets are summarized in Appendix E.6. The range of FPRs in pAUC is [0.05, 0.5]. The process of tuning hyperparameters is the same as that for CheXpert. The results of the pAUCs on the testing set are reported in Table 3, which shows that our algorithm performs the best for these two long-tailed datasets.
Dataset | MB | CE | AUC-M | AGD-SBCD | |
---|---|---|---|---|---|
Deep Model | CIFAR-10-LT | 0.93370.0043 | 0.90160.0137 | 0.93230.0055 | 0.94080.0084 |
Tiny-ImageNet-200-LT | 0.64450.0214 | 0.65490.008 | 0.64970.009 | 0.65940.0192 |
7 Conclusion
Most existing methods for optimizing pAUC are deterministic and only have an asymptotic convergence property. We formulate pAUC optimization as a non-smooth DC program and develop a stochastic subgradient method based on the Moreau envelope smoothing technique. We show that our method finds a nearly -critical point in iterations and demonstrate its performance numerically. A limitation of this paper is the smoothness assumption on , which does not hold for some models, e.g., neural networks using ReLU activation functions. It is a future work to extend our results for non-smooth models.
Acknowledgements
This work was jointly supported by the University of Iowa Jumpstarting Tomorrow Program and NSF award 2147253. T. Yang was also supported by NSF awards 2110545 and 1844403, and Amazon research award. We thank Zhishuai Guo, Zhuoning Yuan and Qi Qi for discussing about processing the image dataset.
References
- Abbaszadehpeivasti et al. (2021) Hadi Abbaszadehpeivasti, Etienne de Klerk, and Moslem Zamani. On the rate of convergence of the difference-of-convex algorithm (dca). arXiv preprint arXiv:2109.13566, 2021.
- Alexandroff (1950) AD Alexandroff. Surfaces represented by the difference of convex functions. In Doklady Akademii Nauk SSSR (NS), volume 72, pages 613–616, 1950.
- An and Tao (2005) Le Thi Hoai An and Pham Dinh Tao. The dc (difference of convex functions) programming and dca revisited with dc models of real world nonconvex optimization problems. Annals of operations research, 133(1):23–46, 2005.
- An et al. (2019) Le Thi Hoai An, Huynh Van Ngai, Pham Dinh Tao, and Luu Hoang Phuc Hau. Stochastic difference-of-convex algorithms for solving nonconvex optimization problems. arXiv preprint arXiv:1911.04334, 2019.
- An and Nam (2017) Nguyen Thai An and Nguyen Mau Nam. Convergence analysis of a proximal point algorithm for minimizing differences of functions. Optimization, 66(1):129–147, 2017.
- Artacho et al. (2018) Francisco J Aragón Artacho, Ronan MT Fleming, and Phan T Vuong. Accelerating the dc algorithm for smooth functions. Mathematical Programming, 169(1):95–118, 2018.
- Bradley (1997) Andrew P Bradley. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern recognition, 30(7):1145–1159, 1997.
- Bradley (2014) Andrew P Bradley. Half-auc for the evaluation of sensitive or specific classifiers. Pattern Recognition Letters, 38:93–98, 2014.
- Chang and Lin (2011) Chih-Chung Chang and Chih-Jen Lin. Libsvm: A library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):1–27, 2011.
- Davis and Drusvyatskiy (2018) Damek Davis and Dmitriy Drusvyatskiy. Stochastic subgradient method converges at the rate on weakly convex functions. arXiv preprint arXiv:1802.02988, 2018.
- Davis and Grimmer (2019) Damek Davis and Benjamin Grimmer. Proximally guided stochastic subgradient method for nonsmooth, nonconvex problems. SIAM Journal on Optimization, 29(3):1908–1930, 2019.
- Deng and Lan (2020) Qi Deng and Chenghao Lan. Efficiency of coordinate descent methods for structured nonconvex optimization. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 74–89. Springer, 2020.
- Dodd and Pepe (2003) Lori E Dodd and Margaret S Pepe. Partial auc estimation and regression. Biometrics, 59(3):614–623, 2003.
- Drusvyatskiy and Paquette (2019) Dmitriy Drusvyatskiy and Courtney Paquette. Efficiency of minimizing compositions of convex functions and smooth maps. Mathematical Programming, 178(1):503–558, 2019.
- Dua and Graff (2017) Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
- Ellaia (1984) Rachid Ellaia. Contribution à l’analyse et l’optimisation de différence de fonctions convexes. PhD thesis, Université Paul Sabatier, 1984.
- Fan et al. (2017) Yanbo Fan, Siwei Lyu, Yiming Ying, and Bao-Gang Hu. Learning with average top-k loss. arXiv preprint arXiv:1705.08826, 2017.
- Gabay (1982) D. Gabay. Minimizing the difference of two convex functions. I. Algorithms based on exact regularization. 1982.
- Hanley and McNeil (1982) James A Hanley and Barbara J McNeil. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology, 143(1):29–36, 1982.
- Hartman (1959) Philip Hartman. On functions representable as a difference of convex functions. Pacific Journal of Mathematics, 9(3):707–713, 1959.
- He et al. (2021) Lulu He, Jimin Ye, et al. Accelerated proximal stochastic variance reduction for dc optimization. Neural Computing and Applications, 33(20):13163–13181, 2021.
- Hiriart-Urruty (1985) J-B Hiriart-Urruty. Generalized differentiability/duality and optimization for problems dealing with differences of convex functions. In Convexity and duality in optimization, pages 37–70. Springer, 1985.
- Hiriart-Urruty (1991) J-B Hiriart-Urruty. How to regularize a difference of convex functions. Journal of mathematical analysis and applications, 162(1):196–209, 1991.
- Hu et al. (2020) Shu Hu, Yiming Ying, Siwei Lyu, et al. Learning by minimizing the sum of ranked range. Advances in Neural Information Processing Systems, 33, 2020.
- Irvin et al. (2019) Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, Jayne Seekins, David A. Mong, Safwan S. Halabi, Jesse K. Sandberg, Ricky Jones, David B. Larson, Curtis P. Langlotz, Bhavik N. Patel, Matthew P. Lungren, and Andrew Y. Ng. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison, 2019.
- Jiang et al. (1996) Yulei Jiang, Charles E Metz, and Robert M Nishikawa. A receiver operating characteristic partial area index for highly sensitive diagnostic tests. Radiology, 201(3):745–750, 1996.
- Kar et al. (2014) Purushottam Kar, Harikrishna Narasimhan, and Prateek Jain. Online and stochastic gradient methods for non-decomposable loss functions. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceedings.neurips.cc/paper/2014/file/7d04bbbe5494ae9d2f5a76aa1c00fa2f-Paper.pdf.
- Khamaru and Wainwright (2018) Koulik Khamaru and Martin Wainwright. Convergence guarantees for a class of non-convex and non-smooth optimization problems. In International Conference on Machine Learning, pages 2601–2610. PMLR, 2018.
- Komori and Eguchi (2010) Osamu Komori and Shinto Eguchi. A boosting method for maximizing the partial area under the roc curve. BMC bioinformatics, 11(1):1–17, 2010.
- Le Thi and Dinh (2018) Hoai An Le Thi and Tao Pham Dinh. Dc programming and dca: thirty years of developments. Mathematical Programming, 169(1):5–68, 2018.
- Le Thi et al. (2017) Hoai An Le Thi, Hoai Minh Le, Duy Nhat Phan, and Bach Tran. Stochastic dca for the large-sum of non-convex functions problem and its application to group variable selection in classification. In International Conference on Machine Learning, pages 3394–3403. PMLR, 2017.
- Lipp and Boyd (2016) Thomas Lipp and Stephen Boyd. Variations and extension of the convex–concave procedure. Optimization and Engineering, 17(2):263–287, 2016.
- Liu et al. (2021) Mingrui Liu, Hassan Rafique, Qihang Lin, and Tianbao Yang. First-order convergence theory for weakly-convex-weakly-concave min-max problems. Journal of Machine Learning Research, 22(169):1–34, 2021.
- Ma et al. (2013) Hua Ma, Andriy I Bandos, Howard E Rockette, and David Gur. On use of partial area under the roc curve for evaluation of diagnostic performance. Statistics in medicine, 32(20):3449–3458, 2013.
- Ma et al. (2020) Runchao Ma, Qihang Lin, and Tianbao Yang. Quadratically regularized subgradient methods for weakly convex optimization with weakly convex constraints. In International Conference on Machine Learning, pages 6554–6564. PMLR, 2020.
- Mairal (2013) Julien Mairal. Stochastic majorization-minimization algorithms for large-scale optimization. arXiv preprint arXiv:1306.4650, 2013.
- McClish (1989) Donna Katzman McClish. Analyzing a portion of the roc curve. Medical decision making, 9(3):190–195, 1989.
- Moudafi (2008) Abdellatif Moudafi. On the difference of two maximal monotone operators: Regularization and algorithmic approaches. Applied mathematics and computation, 202(2):446–452, 2008.
- Moudafi (2021) Abdellatif Moudafi. A complete smooth regularization of dc optimization problems. 2021.
- Moudafi and Maingé (2006) Abdellatif Moudafi and Paul-Emile Maingé. On the convergence of an approximate proximal method for dc functions. Journal of computational Mathematics, pages 475–480, 2006.
- Narasimhan and Agarwal (2013a) Harikrishna Narasimhan and Shivani Agarwal. A structural svm based approach for optimizing partial auc. In International Conference on Machine Learning, pages 516–524. PMLR, 2013a.
- Narasimhan and Agarwal (2013b) Harikrishna Narasimhan and Shivani Agarwal. : a new support vector method for optimizing partial auc based on a tight convex upper bound. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 167–175, 2013b.
- Narasimhan and Agarwal (2017) Harikrishna Narasimhan and Shivani Agarwal. Support vector algorithms for optimizing the partial area under the roc curve. Neural computation, 29(7):1919–1963, 2017.
- Nitanda and Suzuki (2017) Atsushi Nitanda and Taiji Suzuki. Stochastic difference of convex algorithm and its application to training deep boltzmann machines. In Artificial intelligence and statistics, pages 470–478. PMLR, 2017.
- Pang et al. (2017) Jong-Shi Pang, Meisam Razaviyayn, and Alberth Alvarado. Computing b-stationary points of nonsmooth dc programs. Mathematics of Operations Research, 42(1):95–118, 2017.
- Rafique et al. (2021) Hassan Rafique, Mingrui Liu, Qihang Lin, and Tianbao Yang. Weakly-convex–concave min–max optimization: provable algorithms and applications in machine learning. Optimization Methods and Software, pages 1–35, 2021.
- Ricamato and Tortorella (2011) Maria Teresa Ricamato and Francesco Tortorella. Partial auc maximization in a linear combination of dichotomizers. Pattern Recognition, 44(10-11):2669–2677, 2011.
- Rockafellar and Wets (2009) R Tyrrell Rockafellar and Roger J-B Wets. Variational analysis, volume 317. Springer Science & Business Media, 2009.
- Shalev-Shwartz and Wexler (2016) Shai Shalev-Shwartz and Yonatan Wexler. Minimizing the maximal loss: How and why. In International Conference on Machine Learning, pages 793–801. PMLR, 2016.
- Souza et al. (2016) João Carlos O Souza, Paulo Roberto Oliveira, and Antoine Soubeyran. Global convergence of a proximal linearized algorithm for difference of convex functions. Optimization Letters, 10(7):1529–1539, 2016.
- Sriperumbudur and Lanckriet (2009) Bharath K Sriperumbudur and Gert RG Lanckriet. On the convergence of the concave-convex procedure. In Nips, volume 9, pages 1759–1767. Citeseer, 2009.
- Sun and Sun (2021) Kaizhao Sun and Xu Andy Sun. Algorithms for difference-of-convex (dc) programs based on difference-of-moreau-envelopes smoothing. arXiv preprint arXiv:2104.01470, 2021.
- Sun et al. (2003) Wen-yu Sun, Raimundo JB Sampaio, and MAB Candido. Proximal point algorithm for minimization of dc function. Journal of computational Mathematics, pages 451–462, 2003.
- Tao and An (1997) Pham Dinh Tao and Le Thi Hoai An. Convex analysis approach to dc programming: theory, algorithms and applications. Acta mathematica vietnamica, 22(1):289–355, 1997.
- Tao and An (1998) Pham Dinh Tao and Le Thi Hoai An. A dc optimization algorithm for solving the trust-region subproblem. SIAM Journal on Optimization, 8(2):476–505, 1998.
- Thi et al. (2021) Hoai An Le Thi, Hoang Phuc Hau Luu, and Tao Pham Dinh. Online stochastic dca with applications to principal component analysis. arXiv preprint arXiv:2108.02300, 2021.
- Thompson and Zucchini (1989) Mary Lou Thompson and Walter Zucchini. On the statistical analysis of roc curves. Statistics in medicine, 8(10):1277–1290, 1989.
- Tuy (1995) Hoang Tuy. Dc optimization: theory, methods and algorithms. In Handbook of global optimization, pages 149–216. Springer, 1995.
- Ueda and Fujino (2018) Naonori Ueda and Akinori Fujino. Partial auc maximization via nonlinear scoring functions. arXiv preprint arXiv:1806.04838, 2018.
- Vapnik (1992) Vladimir Vapnik. Principles of risk minimization for learning theory. In Advances in neural information processing systems, pages 831–838, 1992.
- Wang and Chang (2011) Zhanfeng Wang and Yuan-Chin Ivan Chang. Marker selection via maximizing the partial area under the roc curve of linear risk scores. Biostatistics, 12(2):369–385, 2011.
- Wen et al. (2018) Bo Wen, Xiaojun Chen, and Ting Kei Pong. A proximal difference-of-convex algorithm with extrapolation. Computational optimization and applications, 69(2):297–324, 2018.
- Xu et al. (2019) Yi Xu, Qi Qi, Qihang Lin, Rong Jin, and Tianbao Yang. Stochastic optimization for dc functions and non-smooth non-convex regularizers with non-asymptotic convergence. In International Conference on Machine Learning, pages 6942–6951. PMLR, 2019.
- Yang et al. (2019) Hanfang Yang, Kun Lu, Xiang Lyu, and Feifang Hu. Two-way partial auc and its properties. Statistical methods in medical research, 28(1):184–195, 2019.
- Yang and Ying (2022) Tianbao Yang and Yiming Ying. Auc maximization in the era of big data and ai: A survey. ACM Comput. Surv., (August 2022), 37 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn, 2022.
- Yang et al. (2021a) Zhiyong Yang, Qianqian Xu, Shilong Bao, Xiaochun Cao, and Qingming Huang. Learning with multiclass auc: Theory and algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021a.
- Yang et al. (2021b) Zhiyong Yang, Qianqian Xu, Shilong Bao, Yuan He, Xiaochun Cao, and Qingming Huang. When all we need is a piece of the pie: A generic framework for optimizing two-way partial auc. In International Conference on Machine Learning, pages 11820–11829. PMLR, 2021b.
- Yang et al. (2022) Zhiyong Yang, Qianqian Xu, Shilong Bao, Yuan He, Xiaochun Cao, and Qingming Huang. Optimizing two-way partial auc with an end-to-end framework. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- Ying et al. (2016) Yiming Ying, Longyin Wen, and Siwei Lyu. Stochastic online auc maximization. Advances in neural information processing systems, 29, 2016.
- Yuan et al. (2020) Zhuoning Yuan, Yan Yan, Milan Sonka, and Tianbao Yang. Robust deep auc maximization: A new surrogate loss and empirical studies on medical image classification. arXiv preprint arXiv:2012.03173, 2020.
- Yuan et al. (2021) Zhuoning Yuan, Yan Yan, Milan Sonka, and Tianbao Yang. Large-scale robust deep AUC maximization: A new surrogate loss and empirical studies on medical image classification. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 3020–3029. IEEE, 2021. doi: 10.1109/ICCV48922.2021.00303. URL https://doi.org/10.1109/ICCV48922.2021.00303.
- Yuille and Rangarajan (2003) Alan L Yuille and Anand Rangarajan. The concave-convex procedure. Neural computation, 15(4):915–936, 2003.
- Zhu et al. (2022) Dixian Zhu, Gang Li, Bokun Wang, Xiaodong Wu, and Tianbao Yang. When auc meets dro: Optimizing partial auc for deep learning with non-convex convergence guarantee. arXiv preprint arXiv:2203.00176, 2022.
Appendix
Appendix A Algorithm for Sum of Range Optimization
The technique in the previous sections can be directly applied to minimize the SoRR loss, which is formulated as (5) but with defined in (7). Since (7) is a special case of (6) with and , we can again formulate subproblem (9) with as (13) with being a scalar. Since is a scalar, when solving (13), we no longer use block coordinate update but only need to sample over indexes to construct stochastic subgradients. We present the stochastic subgradient (SGD) method for (13) in Algorithm 3. Next, we apply Algorithm 2 with SBCD in lines 3 and 4 replaced by SGD. The convergence result in this case is directly from Theorem 5.
Appendix B Proofs of Lemmas
Proof.[of Lemma 3] We will only prove the second conclusion in Lemma 3 since the first conclusion has been shown in Proposition 1 in Sun and Sun (2021).
Suppose . By the first conclusion in Lemma 3, we must have . By the optimality conditions satisfied by and , there exist and such that
which implies and . Suppose . We have and . Hence, satisfies Definition 2 with and . Suppose . The conclusion can be also proved similarly.
We first present the following lemma which is similar to Lemma 4.2 in Drusvyatskiy and Paquette (2019).
Lemma 9
Suppose Assumption 1 holds. For any , , , , and , we have
where . Moreover, is jointly convex in and for any and any .
Proof. By Assumption 1, we have
(18) |
Let . We have
where the first inequality is by the convexity of , the second inequality is from (18) and the last inequality is by the definitions of and the fact that .
Combining the inequality from the first conclusion with the equality and using the fact that , we can obtain
which proves the second conclusion.
Proof. For , we introduce a Lagrangian multiplier for the constraint in (11). Let which equals . Then, for each , we have
The conclusion is thus proved by summing up the equality above for .
Appendix C Proof of Proposition 4
Let be the unique optimal solution of (8) and let be the th largest coordinate of . It is easy to show that the set of optimal solutions of (13) is where
(19) |
Given any , we denote its projection onto as . By the structure of , the th coordinate of is just the projection of onto , which we denote by . Moreover, we denote the distance from to as and it satisfies
(20) |
With these definitions, we can present the following lemma.
Lemma 11
Suppose Assumption 1 holds and . For any , , and , we have
Proof. It is easy to observe that , where
and . Since is a piecewise linear in with an outward slope of at least one at either end of the interval , we must have
which implies
Moreover, by Assumption 1, is -Lipschitz continuous in so is -Lipschitz continuous in . We then have , which implies the conclusion together with the previous inequality.
We present the proof of Proposition 4 below.
Proof.[of Proposition 4] Let us denote and . Let be the expectation conditioning on and .
By the optimality condition satisfied by and -strong convexity of the objective function in (14), we have
where the last inequality is by Young’s inequality. Since is deterministic conditioning on and , applying (21) and taking expectation on the both sides of the inequality above yield
(22) | |||||
By the updating equation (15) for , we have
Since is deterministic conditioning on and , applying (21) and taking expectation on the both sides of the inequality above yield
(23) |
Multiplying both sides of (23) by and adding it with (22), we have
(24) | |||||
where the second inequality is because , which allows us to apply Lemma 9 with and .
Notice that and do not change with that . Summing up (24) for , taking full expectation, and organizing terms give us
(25) | |||||
Because is -strongly convex and the facts that and that for any optimal , we have that
(26) | |||||
where the last inequality is because is jointly convex in and . Recall that . Applying (25) to the left-hand side of the inequality above, we obtain
The conclusion is then proved given that .
Appendix D Proof of Theorem 4
We first present a technical lemma.
Lemma 12
Given two intervals and with . We have
Proof. We will prove the result in three cases, , and . Suppose so that . We have . The result when can be proved similarly. Suppose so that . Note that . We define . Then .
Under Assumption 1, we have for any so that for any and any . By the definition of , we have , which implies
(27) |
We then provide the following proposition as an additional conclusion from the proof of Proposition 4.
Proof. By (25), the last inequality in (26), and Lemma 11, we have
which implies the conclusion by reorganizing terms.
Next we are ready to present the proof of of Theorem 4.
Proof.[of Theorem 4] Let for . In the th iteration of Algorithm 2, Algorithm 1 is applied to the subproblem
(28) |
with initial solution and runs for iterations. Let be the set of optimal for (28). We will prove by induction the following two inequalities for .
(29) | |||||
(30) |
where is defined in (16) for and .
Applying Proposition 4 to (28) and using (27), we have, for ,
(31) |
Moreover, applying Proposition 13 to (28) and using (27), we have
(32) | |||||
Suppose . By (31) and the choice of , we have , so (29) holds for . In addition, (30) holds trivially for . Next, we assume (29) and (30) both hold for and prove they also hold for .
From the updating equation in Algorithm 2, we know that
(34) | |||||
where the second inequality is by triangle inequality and the last inequality is by (27) and the fact that (29) holds for .
Let denote the th coordinate of for and and . Recall that be the th largest coordinate of . By Assumption 1 and elementary argument, we can show that is -Lipschitz continuous for any and . Since is -Lipschitz continuous, we have
(35) | |||||
for and , where the last inequality is by (34). According to (19), (20), (35), and Lemma 12, we can prove that, for ,
Stating (31) for case gives
(36) |
By (36), (30) for case , and the choice of , we have , which means (29) holds for case . By induction, we have proved that (29) and (30) holds for .
Because is -Lipschitz continuous, we have
Applying Young’s inequality to the first term on the right-hand side of the last inequality above gives
where the second inequality is because of (30) for and and the last because of .
Summing the above inequality over , we have
where the second inequality is because and (see Lemma 1 in Sun and Sun (2021)). Let be randomly sampled from , then it holds
by the definition of . On the other hands, by (29), we have for and that
where the last inequality is because of the value of . Hence, by the second claim in Lemma 3 (with and ), is an nearly -critical point of (5) with defined in (6). This completes the proof.
Appendix E Additional Materials for Numerical Experiments
In this section, we present some additional details of our numerical experiments in Section 6 which we are not able to show due to the limit of space.
E.1 Details of SoRR Loss Minimization Experiment
For SoRR loss minimization, we compare with the baseline DCA. We focus on learning a linear model with four benchmark datasets from the UCI (Dua and Graff, 2017) data repository preprocessed by Libsvm (Chang and Lin, 2011): a9a, w8a, ijcnn1 and covtype. The statistics of these datasets are summarized in Table 4.We use logistic loss where label . Due to the limit space, we present the process of tuning hyperparameters and the convergence curves (Figure 2). From the curves, we notice that our algorithm reduce SoRR loss faster than DCAs for all of these four datasets. In this section, we first present some summary statistics on the datasets.
Datasets | # Samples | # Features |
---|---|---|
a9a | 32,561 | 123 |
w8a | 49,749 | 300 |
ijcnn1 | 49,990 | 22 |
covtype | 581,012 | 54 |
For AGD-SBCD, we fix the , and . In the spirit of Corollary 8, within Algorithm 3 called in the th main iteration, and are both set to for any with tuned in the range of and is set to with selected from . Parameter is chosen from and is tuned from .
According to the experiments in Hu et al. (2020), when solving (37), we first use the same step size and the same number of iterations for all ’s. We choose the step size from and the iteration number from . However, we find that the performance of DCA improves if we let the step size and the number of iterations vary with . Hence, we apply the same process as in AGD-SBCD to select , , and in Algorithm 4 in the th main iteration of DCA. We report the performances of DCA under both settings (named DCA.Constant.lr and DCA.Changing.lr, respectively). The changes of the SoRR loss with the number of epochs are shown in Figure 2. From the curves, we notice that our algorithm reduce SoRR loss faster than DCAs for all of these four datasets.




E.2 Process of Tuning Hyperparameters
For AGD-SBCD, we fix the , and . In the spirit of Theorem 5, within Algorithm 3 called in the th main iteration, is set to , is set to for any with tuned in and is set to . Parameters is chosen from and is tuned from . For DCA, we only implement the setting of DCA.Changing.lr (see Appendix E.1 for descriptions). In Algorithm 4 called in the th main iteration, and are selected following the same process as in AGD-SBCD and is set to with chosen from . For the SVMpAUC-tight method, we use their MATLAB code in Narasimhan and Agarwal (2013b) and tune hyper-parameter from .
E.3 Additional Plots of Partial AUC Maximization Experiment
The results for partial AUC maximization of diseases D3D5 are shown in Figure 3.






We plot the ROC curves of three algorithms with linear model on the CheXpert testing set in Figure 4. The range of false positive rate for the pAUC is set as .





We plot the convergence curves of patial AUC on the CheXpert testing set of our algorithm AGD-SBCD and baseline MB in Figure 5.





E.4 DCA and Algorithm for its Subproblem
Although DCA is originally only studied for SoRR loss minimization in Hu et al. (2020), it can also be applied to pAUC maximization. Hence, we only describe DCA for pAUC maximization which cover SoRR loss minimization as a special case. At the th iteration of DCA, it computes a deterministic subgradient of at iterate , denoted by . Then DCA updates by approximately solving the following subproblem using a SBCD method similar to Algorithm 1 by sampling indexes and (see Algorithm 4 for details).
(37) |
Algorithm 4 is used to solve the subproblem (37) in the th main iteration of DCA. It is similar to the SBCD method in Algorithm 1.
E.5 Details about Proximal DCA
At the th iteration of proximal DCA, it computes a deterministic subgradient of at iterate , denoted by . Then proximal DCA updates by approximately solving the subproblem
(38) |
using a SBCD method similar to Algorithm 1 by sampling indexes and . In proximal DCA, is tuned from and other hyper-parameters are tuned from the same range as DCA.
E.6 Details about CIFAR-10-LT and Tiny-Imagenet-200-LT Datasets
Binary CIFAR-10-LT Dataset. To construct a binary classification, we set labels of category ’cats’ to 1 and labels of other categories to 0. We split the training data in train/val = 9:1 and use the validation dataset as the testing set. More details are provided in Table 5.
Binary Tiny-Imagenet-200-LT Dataset. To construct a binary classification, we set labels of category ’birds’ to 1 and labels of other categories to 0. We split the training data in train/val = 9:1 and use the validation dataset as the testing set. More details are provided in Table 5.
Dataset | Pos. Class ID | Pos. Class Name | # Pos. Samples | # Neg. Samples |
---|---|---|---|---|
CIFAR-10-LT | 3 | cats | 1077 | 11329 |
Tiny-Imagenet-200-LT | 11,20,21,22 | birds | 1308 | 20241 |
E.7 Convergence Curves and Testing Performance of AGD-SBCD on Different










D1 | D2 | D3 | D4 | D5 | ||
---|---|---|---|---|---|---|
AGD-SBCD | 0.67210.0081 | 0.82570.0025 | 0.80160.0075 | 0.63400.0165 | 0.85000.0017 | |
0.66170.0073 | 0.82420.0057 | 0.82720.0070 | 0.63230.0028 | 0.84630.0003 | ||
0.66360.0056 | 0.82420.0057 | 0.82370.0077 | 0.63320.0072 | 0.84630.0002 |
E.8 Convergence Curves of Training pAUC over GPU Time




