This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Large-scale Optimization of Partial AUC in a Range of False Positive Rates

\nameYao Yao \email[email protected]
\addrDepartment of Mathematics
The University of Iowa \AND\nameQihang Lin \email[email protected]
\addrTippie College of Business
The University of Iowa \AND\nameTianbao Yang \email[email protected]
\addrDepartment of Computer Science & Engineering
Texas A&M University
Abstract

The area under the ROC curve (AUC) is one of the most widely used performance measures for classification models in machine learning. However, it summarizes the true positive rates (TPRs) over all false positive rates (FPRs) in the ROC space, which may include the FPRs with no practical relevance in some applications. The partial AUC, as a generalization of the AUC, summarizes only the TPRs over a specific range of the FPRs and is thus a more suitable performance measure in many real-world situations. Although partial AUC optimization in a range of FPRs had been studied, existing algorithms are not scalable to big data and not applicable to deep learning. To address this challenge, we cast the problem into a non-smooth difference-of-convex (DC) program for any smooth predictive functions (e.g., deep neural networks), which allowed us to develop an efficient approximated gradient descent method based on the Moreau envelope smoothing technique, inspired by recent advances in non-smooth DC optimization. To increase the efficiency of large data processing, we used an efficient stochastic block coordinate update in our algorithm. Our proposed algorithm can also be used to minimize the sum of ranked range loss, which also lacks efficient solvers. We established a complexity of O~(1/ϵ6)\tilde{O}(1/\epsilon^{6}) for finding a nearly ϵ\epsilon-critical solution. Finally, we numerically demonstrated the effectiveness of our proposed algorithms in training both linear models and deep neural networks for partial AUC maximization and sum of ranked range loss minimization.

1 Introduction

The area under the receiver operating characteristic (ROC) curve (AUC) is one of the most widely used performance measures for classifiers in machine learning, especially when the data is imbalanced between the classes (Hanley and McNeil, 1982; Bradley, 1997). Typically, the classifier produces a score for each data point. Then a data point is classified as positive if its score is above a chosen threshold; otherwise, it is classified as negative. Varying the threshold will change the true positive rate (TPR) and the false positive rate (FPR) of the classifier. The ROC curve shows the TPR as a function of the FPR that corresponds to the same threshold. Hence, maximizing the AUC of a classifier is essentially maximizing the classifier’s average TPR over all FPRs from zero to one. However, for some applications, some FPR regions have no practical relevance. So does the TPR over those regions. For example, in clinical practice, a high FPR in diagnostic tests often results in a high monetary cost, so people may only need to maximize the TPR when the FPR is low (Dodd and Pepe, 2003; Ma et al., 2013; Yang et al., 2019). Moreover, since two models with the same AUC can still have different ROCs, the AUC does not always reflect the true performance of a model that is needed in a particular production environment (Bradley, 2014).

As a generalization of the AUC, the partial AUC (pAUC) only measures the area under the ROC curve that is restricted between two FPRs. A probabilistic interpretation of the pAUC can be found in Dodd and Pepe (2003). In contrast to the AUC, the pAUC represents the average TPR only over a relevant range of FPRs and provides a performance measure that is more aligned with the practical needs in some applications.

In literature, the existing algorithms for training a classifier by maximizing the pAUC include the boosting method (Komori and Eguchi, 2010) and the cutting plane algorithm (Narasimhan and Agarwal, 2013b; Narasimhan and Agarwal, 2013a, 2017). However, the former has no theoretical guarantee, and the latter applies only to linear models. More importantly, both methods require processing all the data in each iteration and thus, become computationally inefficient for large datasets.

In this paper, we proposed an approximate gradient method for maximizing the pAUC that works for nonlinear models (e.g., deep neural networks) and only needs to process randomly sampled positive and negative data points of any size in each iteration. In particular, we formulated the maximization of the pAUC as a non-smooth difference-of-convex (DC) program (Tao and An, 1997; Le Thi and Dinh, 2018). Due to non-smoothness, most existing DC optimization algorithms cannot be applied to our formulation. Motivated by Sun and Sun (2021), we approximate the two non-smooth convex components in the DC program by their Moreau envelopes and obtain a smooth approximation of the problem, which will be solved using the gradient descent method. Since the gradient of the smooth problem cannot be calculated explicitly, we approximated the gradient by solving the two proximal-point subproblems defined by each convex component using the stochastic block coordinate descent (SBCD) method. Our method, besides its low per-iteration cost, has a rigorous theoretical guarantee, unlike the existing methods. In fact, we show that our method finds a nearly ϵ\epsilon-critical point of the pAUC optimization problem in O~(ϵ6)\tilde{O}(\epsilon^{-6}) iterations with only small samples of positive and negative data points processed per iteration.111Throughout the paper, O~()\tilde{O}(\cdot) suppresses all logarithmic factors. This is the main contribution of this paper.

Note that, for non-convex non-smooth optimization, the existing stochastic methods (Davis and Grimmer, 2019; Davis and Drusvyatskiy, 2018) find an nearly ϵ\epsilon-critical point in O(ϵ4)O(\epsilon^{-4}) iterations under a weak convexity assumption. Our method needs O(ϵ6)O(\epsilon^{-6}) iterations because our problem is a DC problem with both convex components non-smooth which is much more challenging than a weakly non-convex minimization problem. In addition, our iteration number matches the known best iteration complexity for non-smooth non-convex min-max optimization (Rafique et al., 2021; Liu et al., 2021) and non-smooth non-convex constrained optimization (Ma et al., 2020).

In addition to pAUC optimization, our method can be also used to minimize the sum of ranked range (SoRR) loss, which can be viewed as a special case of pAUC optimization. Many machine learning models are trained by minimizing an objective function, which is defined as the sum of losses over all training samples (Vapnik, 1992). Since the sum of losses weights all samples equally, it is insensitive to samples from minority groups. Hence, the sum of top-kk losses (Shalev-Shwartz and Wexler, 2016; Fan et al., 2017) is often used as an alternative objective function because it provides the model with robustness to non-typical instances. However, the sum of top-kk losses can be very sensitive to outliers, especially when kk is small. To address this issue, Hu et al. (2020) proposed the SoRR loss as a new learning objective, which is defined as the sum of a consecutive sequence of losses from any range after the losses are sorted. Compared to the sum of all losses and the sum of top-kk losses, the SoRR loss maintains a model’s robustness to a minority group but also reduces the model’s sensitivity to outliers. See Fig.1 in Hu et al. (2020) for an illustration of the benefit of using the SoRR loss over other ways of aggregating individual losses.

To minimize the SoRR loss, Hu et al. (2020) applied a difference-of-convex algorithm (DCA) (Tao and An, 1997; An and Tao, 2005), which linearizes the second convex component and solves the resulting subproblem using the stochastic subgradient method. DCA has been well studied in literature; but when the both components are non-smooth, as in our problem, only asymptotic convergence results are available. To establish the total number of iterations needed to find an ϵ\epsilon-critical point in a non-asymptotic sense, most existing studies had to assume that at least one of the components is differentiable, which is not the case in this paper. Using the approximate gradient method presented in this paper, one can find a nearly ϵ\epsilon-critical point of the SoRR loss optimization problem in O~(ϵ6)\tilde{O}(\epsilon^{-6}) iterations.

2 Related Works

The pAUC has been studied for decades (McClish, 1989; Thompson and Zucchini, 1989; Jiang et al., 1996; Yang and Ying, 2022). However, most studies focused on its estimation (Dodd and Pepe, 2003) and application as a performance measure, while only a few studies were devoted to numerical algorithms for optimizing the pAUC. Efficient optimization methods have been developed for maximizing AUC and multiclass AUC by Ying et al. (2016) and Yang et al. (2021a), but they cannot be applied to pAUC. Besides the boosting method (Komori and Eguchi, 2010) and the cutting plane algorithm (Narasimhan and Agarwal, 2013b; Narasimhan and Agarwal, 2013a, 2017) mentioned in the previous section, Ueda and Fujino (2018); Yang et al. (2022, 2021b); Zhu et al. (2022) developed surrogate optimization techniques that directly maximize a smooth approximation of the pAUC or the two-way pAUC (Yang et al., 2019). However, their approaches can only be applied when the FPR starts from exactly zero. On the contrary, our method allows the FPR to start from any value between zero and one. Wang and Chang (2011) and Ricamato and Tortorella (2011) developed algorithms that use the pAUC as a criterion for creating a linear combination of multiple existing classifiers while we consider directly train a classifier using the pAUC.

DC optimization has been studied since the 1950s (Alexandroff, 1950; Hartman, 1959). We refer interested readers to Tuy (1995); Tao and An (1998, 1997); Pang et al. (2017); Le Thi and Dinh (2018), and the references therein. The actively studied numerical methods for solving a DC program include DCA (Tao and An, 1998, 1997; An and Tao, 2005; Souza et al., 2016), which is also known as the concave-convex procedure (Yuille and Rangarajan, 2003; Sriperumbudur and Lanckriet, 2009; Lipp and Boyd, 2016), the proximal DCA (Sun et al., 2003; Moudafi and Maingé, 2006; Moudafi, 2008; An and Nam, 2017), and the direct gradient methods (Khamaru and Wainwright, 2018). However, when the two convex components are both non-smooth, the existing methods have only asymptotic convergence results except the method by Abbaszadehpeivasti et al. (2021), who considered a stopping criterion different from ours. When at least one component is smooth, non-asymptotic convergence rates have been established with and without the Kurdyka-Łojasiewicz (KL) condition (Souza et al., 2016; Artacho et al., 2018; Wen et al., 2018; An and Nam, 2017; Khamaru and Wainwright, 2018).

The algorithms mentioned above are deterministic and require processing the entire dataset per iteration. Stochastic algorithms that process only a small data sample per iteration have been studied (Mairal, 2013; Nitanda and Suzuki, 2017; Le Thi et al., 2017; Deng and Lan, 2020; He et al., 2021). However, they all assumed smoothness on at least one of the two convex components in the DC program. The stochastic methods of Xu et al. (2019); Thi et al. (2021); An et al. (2019) can be applied when both components are non-smooth but their methods require an unbiased stochastic estimation of the gradient and/or value of the two components, which is not available in the DC formulation of the pAUC maximization problem in this paper.

The technique most related to our work is the smoothing method based on the Moreau envelope (Ellaia, 1984; Gabay, 1982; Hiriart-Urruty, 1985, 1991; Sun and Sun, 2021; Moudafi, 2021). Our work is motivated by Sun and Sun (2021); Moudafi (2021), but the important difference is that they studied deterministic methods and assumed either that one function is smooth or that the proximal-point subproblems can be solved exactly, which we do not assume. However, Sun and Sun (2021); Moudafi (2021) consider a more general problem and study the fundamental properties of the smoothed function such as its Lipschitz smoothness and how its stationary points correspond to those of the original problems. We mainly focus on partial AUC optimization which has a special structure we can utilize when solving the proximal-point subproblems. Additionally, Sun and Sun (2021) developed an algorithm when there were linear equality constraints, which we do not consider in this paper.

3 Preliminary

We consider a classical binary classification problem, where the goal is to build a predictive model that predicts a binary label y{1,1}y\in\{1,-1\} based on a feature vector 𝐱p{\mathbf{x}}\in\mathbb{R}^{p}. Let h𝐰:ph_{\mathbf{w}}:\mathbb{R}^{p}\rightarrow\mathbb{R} be the predictive model parameterized by a vector 𝐰d{\mathbf{w}}\in\mathbb{R}^{d}, which produces a score h𝐰(𝐱)h_{\mathbf{w}}({\mathbf{x}}) for 𝐱{\mathbf{x}}. Then 𝐱{\mathbf{x}} is classified as positive (y=1y=1) if h𝐰(𝐱)h_{\mathbf{w}}({\mathbf{x}}) is above a chosen threshold and classified as negative (y=1y=-1), otherwise.

Let 𝒳+={𝐱i+}iN+{\mathcal{X}}_{+}=\{{\mathbf{x}}_{i}^{+}\}_{i}^{N_{+}} and 𝒳={𝐱i}iN{\mathcal{X}}_{-}=\{{\mathbf{x}}_{i}^{-}\}_{i}^{N_{-}} be the sets of feature vectors of positive and negative training data, respectively. The problem of learning h𝐰h_{\mathbf{w}} through maximizing its empirical AUC on the training data can be formulated as

max𝐰1N+Ni=1N+j=1N𝟏(h𝐰(𝐱i+)>h𝐰(𝐱j)),\max_{{\mathbf{w}}}\frac{1}{N_{+}N_{-}}\sum_{i=1}^{N_{+}}\sum_{j=1}^{N_{-}}\mathbf{1}(h_{{\mathbf{w}}}({\mathbf{x}}_{i}^{+})>h_{{\mathbf{w}}}({\mathbf{x}}_{j}^{-})), (1)

where 𝟏()\mathbf{1}(\cdot) is the indicator function which equals one if the inequality inside the parentheses holds and equals zero, otherwise. According to the introduction, pAUC can be a better performance measure of h𝐰h_{{\mathbf{w}}} than AUC. Consider two FPRs α\alpha and β\beta with 0α<β10\leq\alpha<\beta\leq 1. For simplicity of exposition, we assume NαN_{-}\alpha and NβN_{-}\beta are both integers. Let m=Nαm=N_{-}\alpha and n=Nβn=N_{-}\beta. The problem of maximizing the empirical pAUC with FPR between α\alpha and β\beta can be formulated as

max𝐰1N+(nm)i=1N+j=m+1n𝟏(h𝐰(𝐱i+)>h𝐰(𝐱[j])),\max_{{\mathbf{w}}}\frac{1}{N_{+}(n-m)}\sum_{i=1}^{N_{+}}\sum_{j=m+1}^{n}\mathbf{1}(h_{{\mathbf{w}}}({\mathbf{x}}_{i}^{+})>h_{{\mathbf{w}}}({\mathbf{x}}_{[j]}^{-})), (2)

where [j][j] denotes the index of the jjth largest coordinate in vector (h𝐰(𝐱j))j=1N(h_{{\mathbf{w}}}({\mathbf{x}}_{j}^{-}))_{j=1}^{N_{-}} with ties broken arbitrarily. Note that N+(nm)N_{+}(n-m) in (2) is a normalizer that makes the objective value between zero and one. Solving (2) is challenging due to discontinuity. Let :\ell:\mathbb{R}\rightarrow\mathbb{R} be a differential non-increasing loss function. Problem (2) can be approximated by the loss minimization problem

min𝐰1N+(nm)i=1N+j=m+1n(h𝐰(𝐱i+)h𝐰(𝐱[j])).\min_{{\mathbf{w}}}\frac{1}{N_{+}(n-m)}\sum_{i=1}^{N_{+}}\sum_{j=m+1}^{n}\ell(h_{{\mathbf{w}}}({\mathbf{x}}_{i}^{+})-h_{{\mathbf{w}}}({\mathbf{x}}_{[j]}^{-})). (3)

To facilitate the discussion, we first introduce a few notations. Given a vector S=(si)i=1NNS=\left(s_{i}\right)_{i=1}^{N}\in\mathbb{R}^{N} and an integer ll with 0lN0\leq l\leq N, the sum of the top-ll values in SS is

ϕl(S):=j=1ls[j],\textstyle\phi_{l}(S):=\sum_{j=1}^{l}s_{[j]}, (4)

where [j][j] denotes the index of the jjth largest coordinate in SS with ties broken arbitrarily. For integers l1l_{1} and l2l_{2} with 0l1<l2N0\leq l_{1}<l_{2}\leq N, ϕl2(S)ϕl1(S)\phi_{l_{2}}(S)-\phi_{l_{1}}(S) is the sum from the (l1+1)(l_{1}+1)th to the l2l_{2}th (inclusive) largest coordinates of SS, also called a sum of ranked range (SoRR). In addition, we define vectors

Si(𝐰):=(sij(𝐰))j=1NS_{i}({\mathbf{w}}):=\left(s_{ij}({\mathbf{w}})\right)_{j=1}^{N_{-}}

for i=1,,N+i=1,\dots,N_{+}, where sij(𝐰):=(h𝐰(𝐱i+)h𝐰(𝐱j))s_{ij}({\mathbf{w}}):=\ell(h_{{\mathbf{w}}}({\mathbf{x}}_{i}^{+})-h_{{\mathbf{w}}}({\mathbf{x}}_{j}^{-})) for i=1,,N+i=1,\dots,N_{+} and j=1,,Nj=1,\dots,N_{-}. Since \ell is non-increasing, the jjth largest coordinate of Si(𝐰)S_{i}({\mathbf{w}}) is (h𝐰(𝐱i+)h𝐰(𝐱[j]))\ell(h_{{\mathbf{w}}}({\mathbf{x}}_{i}^{+})-h_{{\mathbf{w}}}({\mathbf{x}}_{[j]}^{-})). As a result, we have, for i=1,,N+i=1,\dots,N_{+},

j=m+1n(h𝐰(𝐱i+)h𝐰(𝐱[j]))=ϕn(Si(𝐰))ϕm(Si(𝐰)).\displaystyle\textstyle\sum_{j=m+1}^{n}\ell(h_{{\mathbf{w}}}({\mathbf{x}}_{i}^{+})-h_{{\mathbf{w}}}({\mathbf{x}}_{[j]}^{-}))=\phi_{n}(S_{i}({\mathbf{w}}))-\phi_{m}(S_{i}({\mathbf{w}})).

Hence, after dropping the normalizer, (3) can be equivalently written as

F=min𝐰{F(𝐰):=fn(𝐰)fm(𝐰)},\displaystyle F^{*}=\min_{{\mathbf{w}}}\left\{F({\mathbf{w}}):=f^{n}({\mathbf{w}})-f^{m}({\mathbf{w}})\right\}, (5)

where

fl(𝐰)=i=1N+ϕl(Si(𝐰)) for l=m,n.\displaystyle\textstyle f^{l}({\mathbf{w}})=\sum_{i=1}^{N_{+}}\phi_{l}(S_{i}({\mathbf{w}}))\quad\text{ for }l=m,n. (6)

Next, we introduce an interesting special case of (5), namely, the problem of minimizing SoRR loss. We still consider a supervised learning problem but the target yy\in\mathbb{R} does not need to be binary. We want to predict yy based on a feature vector 𝐱p{\mathbf{x}}\in\mathbb{R}^{p} using h𝐰(𝐱)h_{\mathbf{w}}({\mathbf{x}}). With a little abuse of notation, we measure the discrepancy between h𝐰(𝐱)h_{\mathbf{w}}({\mathbf{x}}) and yy by (h𝐰(𝐱),y)\ell(h_{\mathbf{w}}({\mathbf{x}}),y), where :2+\ell:\mathbb{R}^{2}\rightarrow\mathbb{R}_{+} is a loss function. We consider learning the model’s parameter 𝐰{\mathbf{w}} from a training set 𝒟={(𝐱j,yj)}j=1N\mathcal{D}=\{({\mathbf{x}}_{j},y_{j})\}_{j=1}^{N}, where 𝐱jp{\mathbf{x}}_{j}\in\mathbb{R}^{p} and yjy_{j}\in\mathbb{R} for j=1,,Nj=1,\dots,N, by minimizing the SoRR loss. More specifically, we define vector

S(𝐰)=(sj(𝐰))j=1N,S({\mathbf{w}})=(s_{j}({\mathbf{w}}))_{j=1}^{N},

where sj(𝐰):=(h𝐰(𝐱j),yj),j=1,,Ns_{j}({\mathbf{w}}):=\ell(h_{\mathbf{w}}({\mathbf{x}}_{j}),y_{j}),~{}j=1,\dots,N. Recall (4). For any integers mm and nn with 0m<nN0\leq m<n\leq N, the problem of minimizing the SoRR loss with a range from m+1m+1 to nn is formulated as min𝐰{ϕn(S(𝐰))ϕm(S(𝐰))}\min_{{\mathbf{w}}}\left\{\phi_{n}(S({\mathbf{w}}))-\phi_{m}(S({\mathbf{w}}))\right\}, which is an instance of (5) with

fl=ϕl(S(𝐰)) for l=m,n.\displaystyle f^{l}=\phi_{l}(S({\mathbf{w}}))\text{ for }l=m,n. (7)

If we view Si(𝐰)S_{i}({\mathbf{w}}) and S(𝐰)S({\mathbf{w}}) only as vector-value functions of 𝐰{\mathbf{w}} but ignore how they are formulated using data, (7) is a special case of (6) with N+=1N_{+}=1 and N=NN_{-}=N.

4 Nearly Critical Point and Moreau Envelope Smoothing

We first develop a stochastic algorithm for (5) with flf^{l} defined in (6). To do so, we make the following assumptions, which are satisfied by many smooth h𝐰h_{{\mathbf{w}}}’s and \ell’s.

Assumption 1

(a) sij(𝐰)s_{ij}({\mathbf{w}}) is smooth and there exists L0L\geq 0 such that222In this paper, \|\cdot\| represents Euclidean norm. sij(𝐰)sij(𝐯)L𝐰𝐯\left\|\nabla s_{ij}({\mathbf{w}})-\nabla s_{ij}({\mathbf{v}})\right\|\leq L\|{\mathbf{w}}-{\mathbf{v}}\| for any 𝐰,𝐯d{\mathbf{w}},{\mathbf{v}}\in\mathbb{R}^{d}, i=1,,N+i=1,\dots,N_{+} and j=1,,Nj=1,\dots,N_{-}. (b) There exists B0B\geq 0 such that sij(𝐰)B\|\nabla s_{ij}({\mathbf{w}})\|\leq B for any 𝐰d{\mathbf{w}}\in\mathbb{R}^{d}, i=1,,N+i=1,\dots,N_{+} and j=1,,Nj=1,\dots,N_{-}. (c) F>F^{*}>-\infty.

Given f:d{+}f:\mathbb{R}^{d}\rightarrow\mathbb{R}\cup\{+\infty\}, the subdifferential of ff is

f(𝐰)={𝝃d|f(𝐯)h(𝐰)+𝝃(𝐯𝐰)+o(𝐯𝐰2),𝐯𝐰},\displaystyle\partial f({\mathbf{w}})=\left\{{\boldsymbol{\xi}}\in\mathbb{R}^{d}\left|f({\mathbf{v}})\geq h({\mathbf{w}})+{\boldsymbol{\xi}}^{\top}({\mathbf{v}}-{\mathbf{w}})+o(\|{\mathbf{v}}-{\mathbf{w}}\|_{2}),~{}{\mathbf{v}}\rightarrow{\mathbf{w}}\right.\right\},

where each element in f(𝐰)\partial f({\mathbf{w}}) is called a subgradient of ff at 𝐰{\mathbf{w}}. We say ff is ρ\rho-weakly convex for some ρ0\rho\geq 0 if f(𝐯)f(𝐰)+𝝃,𝐯𝐰ρ2𝐯𝐰2f({\mathbf{v}})\geq f({\mathbf{w}})+\left\langle{\boldsymbol{\xi}},{\mathbf{v}}-{\mathbf{w}}\right\rangle-\frac{\rho}{2}\|{\mathbf{v}}-{\mathbf{w}}\|^{2} for any 𝐯{\mathbf{v}} and 𝐰{\mathbf{w}} and 𝝃f(𝐰){\boldsymbol{\xi}}\in\partial f({\mathbf{w}}) and say ff is ρ\rho-strongly convex for some ρ0\rho\geq 0 if f(𝐯)f(𝐰)+𝝃,𝐯𝐰+ρ2𝐯𝐰2f({\mathbf{v}})\geq f({\mathbf{w}})+\left\langle{\boldsymbol{\xi}},{\mathbf{v}}-{\mathbf{w}}\right\rangle+\frac{\rho}{2}\|{\mathbf{v}}-{\mathbf{w}}\|^{2} for any 𝐯{\mathbf{v}} and 𝐰{\mathbf{w}} and 𝝃f(𝐰){\boldsymbol{\xi}}\in\partial f({\mathbf{w}}). It is known that, if ff is ρ\rho-weakly convex, then f(𝐰)+12μ𝐰2f({\mathbf{w}})+\frac{1}{2\mu}\|{\mathbf{w}}\|^{2} is a (μ1ρ)(\mu^{-1}-\rho)-strongly convex function when μ1>ρ\mu^{-1}>\rho.

Under Assumption 1, ϕl(Si(𝐰))\phi_{l}(S_{i}({\mathbf{w}})) is a composite of the closed convex function ϕl\phi_{l} and the smooth map Si(𝐰)S_{i}({\mathbf{w}}). According to Lemma 4.2 in Drusvyatskiy and Paquette (2019), we have the following lemma.

Lemma 1

Under Assumption 1, fm(𝐰)f^{m}({\mathbf{w}}) and fn(𝐰)f^{n}({\mathbf{w}}) in (6) are ρ\rho-weakly convex with ρ:=N+NL\rho:=N_{+}N_{-}L.

To solve (5) numerically, we need to overcome the following challenges. (i) F(𝐰)F({\mathbf{w}}) is non-convex even if each sij(𝐰)s_{ij}({\mathbf{w}}) is convex. In fact, F(𝐰)F({\mathbf{w}}) is a DC function because, by Lemma 1, we can represent F(𝐰)F({\mathbf{w}}) as the difference of the convex functions fn(𝐰)+12μ𝐰2f^{n}({\mathbf{w}})+\frac{1}{2\mu}\|{\mathbf{w}}\|^{2} and fm(𝐰)+12μ𝐰2f^{m}({\mathbf{w}})+\frac{1}{2\mu}\|{\mathbf{w}}\|^{2} with μ1>ρ\mu^{-1}>\rho. (ii) F(𝐰)F({\mathbf{w}}) is non-smooth due to ϕl\phi_{l} so that finding an approximate critical point (defined below) of F(𝐰)F({\mathbf{w}}) is difficult. (iii) Computing the exact subgradient of fl(𝐰)f^{l}({\mathbf{w}}) for l=m,nl=m,n requires processing N+NN_{+}N_{-} data pairs, which is computationally expensive for a large data set.

Because of challenges (i) and (ii), we have to consider a reasonable goal when solving (5). We say 𝐰d{\mathbf{w}}\in\mathbb{R}^{d} is a critical point of (5)\eqref{eqn:pAUC_raw} if 𝟎fn(𝐰)fm(𝐰)\mathbf{0}\in\partial f^{n}({\mathbf{w}})-\partial f^{m}({\mathbf{w}}). Given ϵ>0\epsilon>0, we say 𝐰d{\mathbf{w}}\in\mathbb{R}^{d} is an ϵ\epsilon-critical point of (5)\eqref{eqn:pAUC_raw} if there exists 𝝃fn(𝐰)fm(𝐰){\boldsymbol{\xi}}\in\partial f^{n}({\mathbf{w}})-\partial f^{m}({\mathbf{w}}) such that 𝝃ϵ\|{\boldsymbol{\xi}}\|\leq\epsilon. A critical point can only be achieved asymptotically in general.333A stronger notion than criticality is (directional) stationarity, which can also be achieved asymptotically (Pang et al., 2017). Within finitely many iterations, there also exists no algorithm that can find an ϵ\epsilon-critical point unless at least one of fmf^{m} and fnf^{n} is smooth, e.g., Xu et al. (2019). Since fmf^{m} and fnf^{n} are both non-smooth, we have to consider a weaker but achievable target, which is a nearly ϵ\epsilon-critical point defined below.

Definition 2

Given ϵ>0\epsilon>0, we say 𝐰d{\mathbf{w}}\in\mathbb{R}^{d} is a nearly ϵ\epsilon-critical point of (5)\eqref{eqn:pAUC_raw} if there exist 𝛏{\boldsymbol{\xi}}, 𝐰{\mathbf{w}}^{\prime}, and 𝐰′′d{\mathbf{w}}^{\prime\prime}\in\mathbb{R}^{d} such that 𝛏fn(𝐰)fm(𝐰′′){\boldsymbol{\xi}}\in\partial f^{n}({\mathbf{w}}^{\prime})-\partial f^{m}({\mathbf{w}}^{\prime\prime}) and max{𝛏,𝐰𝐰,𝐰𝐰′′}ϵ\max\left\{\|{\boldsymbol{\xi}}\|,\|{\mathbf{w}}-{\mathbf{w}}^{\prime}\|,\|{\mathbf{w}}-{\mathbf{w}}^{\prime\prime}\|\right\}\leq\epsilon.

Definition 2 is reduced to the ϵ\epsilon-stationary point defined by Sun and Sun (2021); Moudafi (2021) when 𝐰{\mathbf{w}} equals 𝐰{\mathbf{w}}^{\prime} or 𝐰′′{\mathbf{w}}^{\prime\prime}. However, obtaining their ϵ\epsilon-stationary point requires exactly solving the proximal mapping of fmf^{m} or fnf^{n} while finding a nearly ϵ\epsilon-critical point requires only solving the proximal mapping inexactly. When 𝐰{\mathbf{w}} is generated by a stochastic algorithm, we also call 𝐰{\mathbf{w}} a nearly ϵ\epsilon-critical point if it satisfies Definition 2 with each \|\cdot\| replaced by 𝔼\mathbb{E}\|\cdot\|.

Motivated by Sun and Sun (2021) and Moudafi (2021), we approximate non-smooth F(𝐰)F({\mathbf{w}}) by a smooth function using the Moreau envelopes. Given a proper, ρ\rho-weakly convex and closed function ff on d\mathbb{R}^{d}, the Moreau envelope of ff with the smoothing parameter μ(0,ρ1)\mu\in(0,\rho^{-1}) is defined as

fμ(𝐰):=min𝐯{f(𝐯)+12μ𝐯𝐰2}f_{\mu}({\mathbf{w}}):=\min_{{\mathbf{v}}}\Big{\{}f({\mathbf{v}})+\frac{1}{2\mu}\|{\mathbf{v}}-{\mathbf{w}}\|^{2}\Big{\}} (8)

and the proximal mapping of ff is defined as

𝐯μf(𝐰):=argmin𝐯{f(𝐯)+12μ𝐯𝐰2}.{\mathbf{v}}_{\mu f}({\mathbf{w}}):=\operatorname*{arg\,min}_{{\mathbf{v}}}\Big{\{}f({\mathbf{v}})+\frac{1}{2\mu}\|{\mathbf{v}}-{\mathbf{w}}\|^{2}\Big{\}}. (9)

Note that the 𝐯μf(𝐰){\mathbf{v}}_{\mu f}({\mathbf{w}}) is unique because the minimization above is strongly convex. Standard results show that fμ(𝐰)f_{\mu}({\mathbf{w}}) is smooth with fμ(𝐰)=μ1(𝐰𝐯μf(𝐰))\nabla f_{\mu}({\mathbf{w}})=\mu^{-1}({\mathbf{w}}-{\mathbf{v}}_{\mu f}({\mathbf{w}})) and 𝐯μf(𝐰){\mathbf{v}}_{\mu f}({\mathbf{w}}) is (1μρ)1(1-\mu\rho)^{-1}-Lipschitz continuous. See Proposition 13.37 in Rockafellar and Wets (2009) and Proposition 1 in Sun and Sun (2021). Hence, using the Moreau envelope, we can construct a smooth approximation of (5) as follows

min𝐰{Fμ:=fμn(𝐰)fμm(𝐰)}.\displaystyle\min_{{\mathbf{w}}}\left\{F^{\mu}:=f_{\mu}^{n}({\mathbf{w}})-f_{\mu}^{m}({\mathbf{w}})\right\}. (10)

Function FμF^{\mu} has the following properties. The first property is shown in Sun and Sun (2021). We give the proof for the second in Appendix B.

Lemma 3

Suppose Assumption 1 holds and μ>ρ1\mu>\rho^{-1} with ρ\rho defined in Lemma 1. The following claims hold

  1. 1.

    Fμ(𝐰)=μ1(𝐯μfm(𝐰)𝐯μfn(𝐰))\nabla F^{\mu}({\mathbf{w}})=\mu^{-1}({\mathbf{v}}_{\mu f^{m}}({\mathbf{w}})-{\mathbf{v}}_{\mu f^{n}}({\mathbf{w}})) and it is LμL_{\mu}-Lipschitz continuous with Lμ=2μμ2ρL_{\mu}=\frac{2}{\mu-\mu^{2}\rho}.

  2. 2.

    If 𝐯¯\bar{\mathbf{v}} and 𝐰{\mathbf{w}} are two random vectors such that 𝔼Fμ(𝐰)2min{1,μ2}ϵ2/4\mathbb{E}\|\nabla F^{\mu}({\mathbf{w}})\|^{2}\leq\min\{1,\mu^{-2}\}\epsilon^{2}/4 and 𝔼𝐯¯𝐯μfl(𝐰)2ϵ2/4\mathbb{E}\|\bar{\mathbf{v}}-{\mathbf{v}}_{\mu f^{l}}({\mathbf{w}})\|^{2}\leq\epsilon^{2}/4 for either l=ml=m or l=nl=n, then 𝐯¯\bar{\mathbf{v}} is a nearly ϵ\epsilon-critical points of (5)\eqref{eqn:pAUC_raw}.

Since FμF^{\mu} is smooth, we can directly apply a first-order method for smooth non-convex optimization to (10). To do so, we need to evaluate Fμ(𝐰)\nabla F^{\mu}({\mathbf{w}}), which requires computing 𝐯μfm(𝐰){\mathbf{v}}_{\mu f^{m}}({\mathbf{w}}) and 𝐯μfn(𝐰){\mathbf{v}}_{\mu f^{n}}({\mathbf{w}}), i.e., exactly solving (9) with f=fmf=f^{m} and f=fnf=f^{n}, respectively. Computing the subgradients of fmf^{m} and fnf^{n} require processing N+NN_{+}N_{-} data pairs which is costly. Unfortunately, the standard approach of sampling over data pairs does not produce unbiased stochastic subgradients of fmf^{m} and fnf^{n} due to the composite structure ϕl(Si(𝐰))\phi_{l}(S_{i}({\mathbf{w}})). In the next section, we will discuss a solution to overcome this challenge and approximate 𝐯μfm(𝐰){\mathbf{v}}_{\mu f^{m}}({\mathbf{w}}) and 𝐯μfn(𝐰){\mathbf{v}}_{\mu f^{n}}({\mathbf{w}}), which leads to an efficient algorithm for (10).

5 Algorithm for pAUC Optimization

Consider (10) with flf^{l} defined in (6) for l=ml=m and nn. To avoid of processing N+NN_{+}N_{-} data points, one method is to introduce dual variables 𝐩i=(pij)j=1N{\mathbf{p}}_{i}=(p_{ij})_{j=1}^{N_{-}} for i=1,,N+i=1,\dots,N_{+} and formulate flf^{l} as

fl(𝐰)=max𝐩i𝒫l,i=1,,N+{i=1N+j=1Npijsij(𝐰)},\displaystyle\textstyle f^{l}({\mathbf{w}})=\max\limits_{{\mathbf{p}}_{i}\in\mathcal{P}^{l},i=1,\dots,N_{+}}\left\{\sum_{i=1}^{N_{+}}\sum_{j=1}^{N_{-}}p_{ij}s_{ij}({\mathbf{w}})\right\}, (11)

where 𝒫l={𝐩N|j=1Npj=l,pj[0,1]}\mathcal{P}^{l}=\{{\mathbf{p}}\in\mathbb{R}^{N_{-}}|\sum_{j=1}^{N_{-}}p_{j}=l,~{}p_{j}\in[0,1]\}. Then (10) can be reformulated as a min-max problem and solved by a primal-dual stochastic gradient method (e.g. Rafique et al. (2021)). However, the maximization in (11) involves N+NN_{+}N_{-} decision variables and equality constraints, so the per-iteration cost is still O(N+N)O(N_{+}N_{-}) even after using stochastic gradients.

To further reduce the per-iteration cost, we take the dual form of the maximization in (11) (see Lemma 10 in Appendix B) and formulate flf^{l} as

fl(𝐰)=min𝝀{gl(𝐰,𝝀):=l𝟏𝝀+i=1N+j=1N[sij(𝐰)λi]+},f^{l}({\mathbf{w}})=\min_{{\boldsymbol{\lambda}}}\bigg{\{}g^{l}({\mathbf{w}},{\boldsymbol{\lambda}}):=l\mathbf{1}^{\top}{\boldsymbol{\lambda}}+\sum_{i=1}^{N_{+}}\sum^{N_{-}}_{j=1}[s_{ij}({\mathbf{w}})-\lambda_{i}]_{+}\bigg{\}}, (12)

where 𝝀=(λ1,,λN+){\boldsymbol{\lambda}}=(\lambda_{1},\dots,\lambda_{N_{+}}). Hence, (9) with f=flf=f^{l} for l=ml=m and nn can be reformulated as

min𝐯,𝝀{gl(𝐯,𝝀)+12μ𝐯𝐰2}.\displaystyle\min_{{\mathbf{v}},{\boldsymbol{\lambda}}}\bigg{\{}g^{l}({\mathbf{v}},{\boldsymbol{\lambda}})+\frac{1}{2\mu}\|{\mathbf{v}}-{\mathbf{w}}\|^{2}\bigg{\}}. (13)

Note that gl(𝐯,𝝀)g^{l}({\mathbf{v}},{\boldsymbol{\lambda}}) is jointly convex in 𝐯{\mathbf{v}} and 𝝀{\boldsymbol{\lambda}} when μ1>ρ=N+NL\mu^{-1}>\rho=N_{+}N_{-}L (see Lemma 9 in Appendix B). Thanks to formulation (13), we can construct stochastic subgradient of glg^{l} and apply coordinate update to 𝝀{\boldsymbol{\lambda}} by sampling indexes ii’s and jj’s, which significantly reduce the computational cost when N+N_{+} and NN_{-} are both large. We present this standard stochastic block coordinate descent (SBCD) method for solving (13) in Algorithm 1 and present its convergence property as follows.

Algorithm 1 Stochastic Block Coordinate Descent for (13): (𝐯¯,𝝀¯)=(\bar{\mathbf{v}},\bar{\boldsymbol{\lambda}})=SBCD(𝐰,𝝀,T,μ,l)({\mathbf{w}},{\boldsymbol{\lambda}},T,\mu,l)
1:Input: Initial solution (𝐰,𝝀)({\mathbf{w}},{\boldsymbol{\lambda}}), the number of iterations TT, μ>0\mu>0, an integer l>0l>0 and sample sizes II and JJ.
2:Set (𝐯(0),𝝀(0))=(𝐰,𝝀)({\mathbf{v}}^{(0)},{\boldsymbol{\lambda}}^{(0)})=({\mathbf{w}},{\boldsymbol{\lambda}}) and choose (ηt,θt)t=0T1(\eta_{t},\theta_{t})_{t=0}^{T-1}.
3:for t=0t=0 to T1T-1 do
4:     Sample t{1,,N+}\mathcal{I}_{t}\subset\{1,\dots,N_{+}\} with |t|=I|\mathcal{I}_{t}|=I and sample 𝒥t{1,,N}\mathcal{J}_{t}\subset\{1,\dots,N_{-}\} with |𝒥t|=J|\mathcal{J}_{t}|=J.
5:     Compute stochastic subgradient w.r.t. 𝐯{\mathbf{v}}:
G𝐯(t)=N+NIJitj𝒥tsij(𝐯(t))𝟏(sij(𝐯(t))>λi(t))\textstyle G_{\mathbf{v}}^{(t)}=\frac{N_{+}N_{-}}{IJ}\sum\limits_{i\in\mathcal{I}_{t}}\sum\limits_{j\in\mathcal{J}_{t}}\nabla s_{ij}({\mathbf{v}}^{(t)})\mathbf{1}\left(s_{ij}({\mathbf{v}}^{(t)})>\lambda_{i}^{(t)}\right)
6:     Proximal stochastic subgradient update on 𝐯{\mathbf{v}}:
𝐯(t+1)=argmin𝐯(G𝐯(t))𝐯+𝐯𝐰22μ+𝐯𝐯(t)22ηt{\mathbf{v}}^{(t+1)}=\operatorname*{arg\,min}_{{\mathbf{v}}}(G_{\mathbf{v}}^{(t)})^{\top}{\mathbf{v}}+\frac{\|{\mathbf{v}}-{\mathbf{w}}\|^{2}}{2\mu}+\frac{\|{\mathbf{v}}-{\mathbf{v}}^{(t)}\|^{2}}{2\eta_{t}} (14)
7:     Compute stochastic subgradient w.r.t. λi\lambda_{i} for iti\in\mathcal{I}_{t}:
Gλi(t)=lNJj𝒥t𝟏(sij(𝐯(t))>λi(t)) for it\textstyle G_{\lambda_{i}}^{(t)}=l-\frac{N_{-}}{J}\sum\limits_{j\in\mathcal{J}_{t}}\mathbf{1}\left(s_{ij}({\mathbf{v}}^{(t)})>\lambda_{i}^{(t)}\right)~{}\text{ for }i\in\mathcal{I}_{t}
8:     Stochastic block subgradient update on λi\lambda_{i} for iti\in\mathcal{I}_{t}:
λi(t+1)=λi(t)θtGλi(t) for it and λi(t+1)=λi(t) for it.\lambda_{i}^{(t+1)}=\lambda_{i}^{(t)}-\theta_{t}G_{\lambda_{i}}^{(t)}~{}\text{ for }i\in\mathcal{I}_{t}\quad\text{ and }\quad\lambda_{i}^{(t+1)}=\lambda_{i}^{(t)}~{}\text{ for }i\notin\mathcal{I}_{t}. (15)
9:end for
10:Output: (𝐯¯,𝝀¯)=1Tt=0T1(𝐯(t),𝝀(t))(\bar{\mathbf{v}},\bar{\boldsymbol{\lambda}})=\frac{1}{T}\sum_{t=0}^{T-1}({\mathbf{v}}^{(t)},{\boldsymbol{\lambda}}^{(t)}).
Proposition 4

Suppose Assumption 1 holds and μ1>ρ=N+NL\mu^{-1}>\rho=N_{+}N_{-}L, θt=dist(𝛌(0),Λ)ITN\theta_{t}=\frac{\textup{dist}({\boldsymbol{\lambda}}^{(0)},\Lambda^{*})}{\sqrt{IT}N_{-}} and ηt=𝐯μfl(𝐰)𝐰N+NBT\eta_{t}=\frac{\|{\mathbf{v}}_{\mu f^{l}}({\mathbf{w}})-{\mathbf{w}}\|}{N_{+}N_{-}B\sqrt{T}} for any tt in Algorithm 1. It holds that

(12μρ2)𝔼𝐯¯𝐯μfl(𝐰))2N+NITdist(𝝀(0),Λ)+N+NB2T𝐯μfl(𝐰)𝐰+𝐯μfl(𝐰)𝐰22μT,\displaystyle\begin{aligned} \left(\frac{1}{2\mu}-\frac{\rho}{2}\right)\mathbb{E}\|\bar{\mathbf{v}}-{\mathbf{v}}_{\mu f^{l}}({\mathbf{w}}))\|^{2}\leq\frac{N_{+}N_{-}}{\sqrt{IT}}\textup{dist}({\boldsymbol{\lambda}}^{(0)},\Lambda^{*})+\frac{N_{+}N_{-}B}{2\sqrt{T}}\|{\mathbf{v}}_{\mu f^{l}}({\mathbf{w}})-{\mathbf{w}}\|+\frac{\|{\mathbf{v}}_{\mu f^{l}}({\mathbf{w}})-{\mathbf{w}}\|^{2}}{2\mu T},\end{aligned}

where Λ=argmin𝛌gl(𝐯μfl(𝐰),𝛌)\Lambda^{*}=\operatorname*{arg\,min}\limits_{{\boldsymbol{\lambda}}}g^{l}({\mathbf{v}}_{\mu f^{l}}({\mathbf{w}}),{\boldsymbol{\lambda}}).

Using Algorithm 1 to compute an approximation of 𝐯μfl(𝐰){\mathbf{v}}_{\mu f^{l}}({\mathbf{w}}) for l=ml=m and nn and thus, an approximation of Fμ(𝐰)\nabla F^{\mu}({\mathbf{w}}), we can apply an approximate gradient descent (AGD) method to (10) and find a nearly ϵ\epsilon-critical point of (5) according to Lemma 3. We present the AGD method in Algorithm 2 and its convergence property as follows.

Algorithm 2 Approximate Gradient Descent (AGD) for (10)
1:Input: Initial solutions (𝐰(0),𝝀¯m(0),𝝀¯n(0))({\mathbf{w}}^{(0)},\bar{\boldsymbol{\lambda}}_{m}^{(0)},\bar{\boldsymbol{\lambda}}_{n}^{(0)}), the number of iterations KK, μ>ρ1\mu>\rho^{-1}, γ>0\gamma>0, m=αNm=\alpha N_{-} and n=βNn=\beta N_{-}.
2:for k=0k=0 to K1K-1 do
3:     (𝐯¯m(k),𝝀¯m(k+1))=(\bar{\mathbf{v}}_{m}^{(k)},\bar{\boldsymbol{\lambda}}_{m}^{(k+1)})=SBMD(𝐰(k),𝝀¯m(k),Tk,μ,m)({\mathbf{w}}^{(k)},\bar{\boldsymbol{\lambda}}_{m}^{(k)},T_{k},\mu,m)
4:     (𝐯¯n(k),𝝀¯n(k+1))=(\bar{\mathbf{v}}_{n}^{(k)},\bar{\boldsymbol{\lambda}}_{n}^{(k+1)})=SBMD(𝐰(k),𝝀¯n(k),Tk,μ,n)({\mathbf{w}}^{(k)},\bar{\boldsymbol{\lambda}}_{n}^{(k)},T_{k},\mu,n)
5:     𝐰(k+1)=𝐰(k)γμ1(𝐯¯m(k)𝐯¯n(k)){\mathbf{w}}^{(k+1)}={\mathbf{w}}^{(k)}-\gamma\mu^{-1}(\bar{\mathbf{v}}_{m}^{(k)}-\bar{\mathbf{v}}_{n}^{(k)})
6:end for
7:Output: 𝐯¯n(k¯)\bar{\mathbf{v}}_{n}^{(\bar{k})} with k¯\bar{k} sampled from {0,,K1}\{0,\dots,K-1\}.
Theorem 5

Suppose Assumption 1 holds and Algorithm 1 is called in iteration kk of Algorithm 2 with parameters μ1>ρ=N+NL\mu^{-1}>\rho=N_{+}N_{-}L, θt=dist(𝛌¯(k),Λk)ITkN\theta_{t}=\frac{\text{dist}(\bar{\boldsymbol{\lambda}}^{(k)},\Lambda_{k}^{*})}{\sqrt{IT_{k}}N_{-}}, ηt=𝐯μfl(𝐰(k))𝐰(k)N+NBTk\eta_{t}=\frac{\|{\mathbf{v}}_{\mu f^{l}}({\mathbf{w}}^{(k)})-{\mathbf{w}}^{(k)}\|}{N_{+}N_{-}B\sqrt{T_{k}}} for any tt, and

Tk=max{144N+2N2Dl2(k+1)2I(μ1ρ)2,4N+2N2μ2l2B2(k+1)2(μ1ρ)2,6μl2B2(k+1)2(μ1ρ)2}T_{k}=\max\left\{\frac{144N_{+}^{2}N_{-}^{2}D_{l}^{2}(k+1)^{2}}{I(\mu^{-1}-\rho)^{2}},~{}\frac{4N_{+}^{2}N_{-}^{2}\mu^{2}l^{2}B^{2}(k+1)^{2}}{(\mu^{-1}-\rho)^{2}},\frac{6\mu l^{2}B^{2}(k+1)}{2(\mu^{-1}-\rho)^{2}}\right\}

where Λk=argmin𝛌gl(𝐯μfl(𝐰(k)),𝛌)\Lambda_{k}^{*}=\operatorname*{arg\,min}\limits_{{\boldsymbol{\lambda}}}g^{l}({\mathbf{v}}_{\mu f^{l}}({\mathbf{w}}^{(k)}),{\boldsymbol{\lambda}}) and

Dl:=max{dist(𝝀¯l(0),Λ0),12(1μρ)+μl2B22+N+B+N+B1μρ(2γμ+γnB+γmB)}.\displaystyle D_{l}:=\max\left\{\text{dist}(\bar{\boldsymbol{\lambda}}_{l}^{(0)},\Lambda_{0}^{*}),\quad\frac{1}{2}\left(\frac{1}{\mu}-\rho\right)+\frac{\mu l^{2}B^{2}}{2}+N_{+}B+\frac{N_{+}B}{1-\mu\rho}\left(\frac{2\gamma}{\mu}+\gamma nB+\gamma mB\right)\right\}. (16)

Then 𝐯¯n(k¯)\bar{\mathbf{v}}_{n}^{(\bar{k})} is a nearly ϵ\epsilon-critical point of (5) with flf^{l} defined in (6) with KK no more than

K=max{16μ2γmin{1,μ2}ϵ2(F(𝐯μfn(𝐰(0)))F),96min{1,μ2}ϵ2log(96min{1,μ2}ϵ2)}.\displaystyle K=\max\left\{\frac{16\mu^{2}}{\gamma\min\{1,\mu^{2}\}\epsilon^{2}}\left(F({\mathbf{v}}_{\mu f^{n}}({\mathbf{w}}^{(0)}))-F^{*}\right),\frac{96}{\min\{1,\mu^{2}\}\epsilon^{2}}\log\left(\frac{96}{\min\{1,\mu^{2}\}\epsilon^{2}}\right)\right\}. (17)

According to Theorem 5, to find a nearly ϵ\epsilon-critical point of (5), we need K=O~(ϵ2)K=\tilde{O}(\epsilon^{-2}) iterations in Algorithm 2 and k=0K1Tk=O(K3)=O~(ϵ6)\sum_{k=0}^{K-1}T_{k}=O(K^{3})=\tilde{O}(\epsilon^{-6}) iterations of Algorithm 1 in total across all calls.

Remark 6 (Challenges in proving Theorem 5)

Suppose we can set TkT_{k} in lines 3 and 4 of Algorithm 2 appropriately such that the approximation errors 𝔼𝐯¯m(k)𝐯μfm(𝐰(k))2\mathbb{E}\|\bar{\mathbf{v}}_{m}^{(k)}-{\mathbf{v}}_{\mu f^{m}}({\mathbf{w}}^{(k)})\|^{2} and 𝔼𝐯¯n(k)𝐯μfn(𝐰(k))2\mathbb{E}\|\bar{\mathbf{v}}_{n}^{(k)}-{\mathbf{v}}_{\mu f^{n}}({\mathbf{w}}^{(k)})\|^{2} are both O(1/k)O(1/k). We can then prove that Algorithm 2 finds a nearly ϵ\epsilon-critical point within K=O~(ϵ2)K=\tilde{O}(\epsilon^{-2}) iterations and the total complexity is k=0K1Tk\sum_{k=0}^{K-1}T_{k}. This is just a standard idea. However, by Proposition 4, such a TkT_{k} must be Θ(k2(dist2(𝛌¯(k),Λk)+𝐯μfl(𝐰(k))𝐰(k)2))\Theta(k^{2}(\text{dist}^{2}(\bar{\boldsymbol{\lambda}}^{(k)},\Lambda_{k}^{*})+\|{\mathbf{v}}_{\mu f^{l}}({\mathbf{w}}^{(k)})-{\mathbf{w}}^{(k)}\|^{2})) where dist2(𝛌¯(k),Λk)\text{dist}^{2}(\bar{\boldsymbol{\lambda}}^{(k)},\Lambda_{k}^{*}) and 𝐯μfl(𝐰(k))𝐰(k)2\|{\mathbf{v}}_{\mu f^{l}}({\mathbf{w}}^{(k)})-{\mathbf{w}}^{(k)}\|^{2} also change with kk. Then it is not clear what the order of TkT_{k} is. By a novel proving technique based on the (linear) error-bound condition of gl(𝐰,𝛌)g^{l}({\mathbf{w}},{\boldsymbol{\lambda}}) with respect to 𝛌{\boldsymbol{\lambda}}, we prove that both dist2(𝛌¯(k),Λk)\text{dist}^{2}(\bar{\boldsymbol{\lambda}}^{(k)},\Lambda_{k}^{*}) and 𝐯μfl(𝐰(k))𝐰(k)2\|{\mathbf{v}}_{\mu f^{l}}({\mathbf{w}}^{(k)})-{\mathbf{w}}^{(k)}\|^{2} are O(1)O(1) (see (27) and (30) in Appendix D) which ensures that Tk=Θ(k2)T_{k}=\Theta(k^{2}) and thus the total complexity is k=0K1Tk=O(K3)=O~(ϵ6)\sum_{k=0}^{K-1}T_{k}=O(K^{3})=\tilde{O}(\epsilon^{-6}).

Remark 7 (Analysis of sensitivity of the algorithm to μ\mu)

For the interesting case where ρ1\rho\geq 1, we have μ<1/ρ<1\mu<1/\rho<1. In this case, we can derive that the order of dependency on μ\mu is O(1ϵ6μ6)O(\frac{1}{\epsilon^{6}\mu^{6}}) and the optimal choice of μ\mu is thus Θ(ρ1)\Theta(\rho^{-1}), e.g., μ=12ρ\mu=\frac{1}{2\rho}, which leads to a complexity of O(ρ6/ϵ6)O(\rho^{6}/\epsilon^{6}). We present the convergence curves and the test performance of our method when applied to training a linear model with μ\mu of different values in Appendix E.7.

The technique in the previous sections can be directly applied to minimize the SoRR loss, which is formulated as (5) but with flf^{l} defined in (7). Due to the limit of space, we present the algorithm for minimizing the SoRR loss and its convergence result in Appendix A.

6 Numerical Experiments

In this section, we demonstrate the effectiveness of our algorithm AGD-SBCD for pAUC maximization and SoRR loss minimization problems (see Appendix E.1 for details). All experiments are conducted in Python and Matlab on a computer with the CPU 2GHz Quad-Core Intel Core i5 and the GPU NVIDIA GeForce RTX 2080 Ti. All datasets we used are publicly available and contain no personally identifiable information and offensive contents.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: Results for Patial AUC Maximization of D1 and D2. (Results of D3, D4 and D5 are shown in Appendix E.3 Figure 3)

6.1 Partial AUC Maximization

For maximizing pAUC, we focus on large-scale imbalanced medical dataset CheXpert (Irvin et al., 2019), which is licensed under CC-BY-SA and has 224,316 images. We construct five binary classification tasks with the logistic loss (z)=log(1+exp(z))\ell(z)=\log(1+\exp(-z)) for predicting five popular diseases, Cardiomegaly (D1), Edema (D2), Consolidation (D3), Atelectasis (D4), and P. Effusion (D5).

For comparison of training convergence, we consider different methods for optimizing the partial AUC. We compare with three baselines, DCA (Hu et al., 2020) (see Appendix E.4 for details), proximal DCA (Wen et al., 2018) (see Appendix E.5 for details) and SVMpAUC-tight (Narasimhan and Agarwal, 2013b). Since DCA, proximal DCA and SVMpAUC-tight cannot be applied to deep neural networks, we focus on linear model and use a pre-trained deep neural network to extract a fixed dimensional feature vectors of 10241024. The deep neural network was trained by optimizing the cross-entropy loss following the same setting as in Yuan et al. (2020).

For three baselines and our algorithm, the process to tune the hyper-parameters is explained in Appendix E.2. In Figure 1 and Figure 3 in Appendix E.3, we show how the training loss (the objective value of (3)) and normalized partial AUC on the training data change with the number of epochs. We observe that for all of these five diseases, our algorithm converges much faster than DCA and proximal DCA and we get a better partial AUC than DCA and proximal DCA.

Table 1: Comparison on the CheXpert training data. From left to right, the columns are the tasks, the pAUCs returned by SVMpAUC-tight, the CPU time (in seconds) SVMpAUC-tight takes, the CPU and GPU time AGD-SBCD uses to exceed SVMpAUC-tight’s pAUCs, the final pAUCs returned by AGD-SBCD, and the CPU and GPU time (in seconds) AGD-SBCD takes to return the final pAUCs.
Methods SVMpAUC-tight AGD-SBCD
Tasks pAUC CPU CPU time (epoch) GPU time to pAUC CPU GPU
time to outperform outperform time time
D1 0.6259 95.14 2.91 (0.23) 1.85 0.7005±\pm0.0003 118.32 82.13
D2 0.5860 90.83 3.36 (0.23) 1.93 0.7214±\pm0.0024 415.66 247.29
D3 0.3745 90.56 3.26 (0.23) 1.84 0.4910±\pm0.0006 181.70 104.55
D4 0.3895 89.64 10.09 (0.63) 8.38 0.4616±\pm0.0006 187.36 158.14
D5 0.7267 90.86 3.97 (0.23) 1.89 0.8272±\pm0.0001 238.10 142.91

The comparison between our AGD-SBCD and SVMpAUC-tight on training data are shown in Table 1. As shown from the second to the fifth column of Table 1, our algorithm needs only a few seconds to exceed the pAUCs that SVMpAUC-tight takes more than one minute to return. As shown from sixth to eighth column, our algorithm eventually improves the pAUC by at least 12%12\% compared with SVMpAUC-tight. DCA and proximal DCA are not included in the tables because it computes deterministic subgradients, which leads to a runtime significantly longer than the other two methods. We plot the convergence curves of training pAUC over GPU time for DCA and our algorithm in Figure 7 in Appendix E.8.

To compare the testing performances, we consider the deep neural networks besides the linear model. For linear model, we still compare with DCA and SVMpAUC-tight. For deep neural networks, we compare with the naive mini-batch based method (MB) (Kar et al., 2014) and methods based on different optimization objectives, including the cross-entropy loss (CE) and the AUC min-max margin loss (AUC-M) (Yuan et al., 2021). We learn the model DenseNet121 from scratch with the CheXpert training data split in train/val=9:1 and the CheXpert validation dataset as the testing set, which has 234 samples. The range of FPRs in pAUC is [0.05, 0.5]. For optimizing CE, we use the standard Adam optimizer. For optimizing AUC-M, we use the PESG optimizer in Yuan et al. (2021). We run each method 10 epochs and the learning rate (cc in AGD-SBCD) of all methods is tuned from {105100}\{10^{-5}\sim 10^{0}\}. The mini-batch size is 32. For AGD-SBCD, TkT_{k} is set to 50(k+1)250(k+1)^{2}, μ\mu is set to 103N+N\frac{10^{3}}{N_{+}N_{-}} and γ\gamma is tuned from {0.1,1,2}×103/(N+N)\{0.1,1,2\}\times 10^{3}/(N_{+}N_{-}). For MB, the learning rate decays in the same way as in  Kar et al. (2014). For CE and AUC-M, the learning rate decays 10-fold after every 5 epochs. For AUC-M, we tune the hyperparameter γ\gamma in {100, 500, 1000}. For each method, the validation set is used to tune the hyperparameters and select the best model across all iterations. The results of the pAUCs on the testing set are reported in Table 2, which shows that our algorithm performs the best for all diseases. The complete ROC curves on the testing set are shown in Appendix E.3.

Table 2: The pAUCs with FPRs between 0.050.05 and 0.50.5 on the testing sets from the CheXpert data.
Method D1 D2 D3 D4 D5
Linear Model SVMpAUC-tight 0.6538±\pm0.0042 0.6038±\pm0.0009 0.6946±\pm0.0020 0.6521±\pm0.0006 0.7994±\pm0.0004
DCA 0.6636±\pm0.0093 0.8078±\pm0.0030 0.7427±\pm0.0257 0.6169±\pm0.0208 0.8371±\pm0.0022
Proximal DCA 0.6615±\pm0.0103 0.8041±\pm0.0033 0.7064±\pm0.0253 0.5945±\pm0.0266 0.8352±\pm0.0023
AGD-SBCD 0.6721±\pm0.0081 0.8257±\pm0.0025 0.8016±\pm0.0075 0.6340±\pm0.0165 0.8500±\pm0.0017
Deep Model MB 0.7510±\pm0.0248 0.8197±\pm0.0127 0.6339±\pm0.0328 0.5698±\pm0.0343 0.8461±\pm0.0188
CE 0.6994±\pm0.0453 0.8075±\pm0.0244 0.7673±\pm0.0266 0.6499±\pm0.0184 0.7884±\pm0.0080
AUC-M 0.7403±\pm0.0339 0.8002±\pm0.0274 0.8533±\pm0.0469 0.7420±\pm0.0277 0.8504±\pm0.0065
AGD-SBCD 0.7535±\pm0.0255 0.8345±\pm0.0130 0.8689±\pm0.0184 0.7520±\pm0.0079 0.8513±\pm0.0107

For deep neural networks, we also learn the model ResNet-20 from scratch with the CIFAR-10-LT and the Tiny-ImageNet-200-LT datasets, which are constructed similarly as in Yang et al. (2021b). Details about these two datasets are summarized in Appendix E.6. The range of FPRs in pAUC is [0.05, 0.5]. The process of tuning hyperparameters is the same as that for CheXpert. The results of the pAUCs on the testing set are reported in Table 3, which shows that our algorithm performs the best for these two long-tailed datasets.

Table 3: The pAUCs with FPRs between 0.050.05 and 0.50.5 on the testing sets from the CIFAR-10-LT and the Tiny-ImageNet-200-LT Datasets.
Dataset MB CE AUC-M AGD-SBCD
Deep Model CIFAR-10-LT 0.9337±\pm0.0043 0.9016±\pm0.0137 0.9323±\pm0.0055 0.9408±\pm0.0084
Tiny-ImageNet-200-LT 0.6445±\pm0.0214 0.6549±\pm0.008 0.6497±\pm0.009 0.6594±\pm0.0192

7 Conclusion

Most existing methods for optimizing pAUC are deterministic and only have an asymptotic convergence property. We formulate pAUC optimization as a non-smooth DC program and develop a stochastic subgradient method based on the Moreau envelope smoothing technique. We show that our method finds a nearly ϵ\epsilon-critical point in O~(ϵ6)\tilde{O}(\epsilon^{-6}) iterations and demonstrate its performance numerically. A limitation of this paper is the smoothness assumption on sij(𝐰)s_{ij}({\mathbf{w}}), which does not hold for some models, e.g., neural networks using ReLU activation functions. It is a future work to extend our results for non-smooth models.

Acknowledgements

This work was jointly supported by the University of Iowa Jumpstarting Tomorrow Program and NSF award 2147253. T. Yang was also supported by NSF awards 2110545 and 1844403, and Amazon research award. We thank Zhishuai Guo, Zhuoning Yuan and Qi Qi for discussing about processing the image dataset.

References

  • Abbaszadehpeivasti et al. (2021) Hadi Abbaszadehpeivasti, Etienne de Klerk, and Moslem Zamani. On the rate of convergence of the difference-of-convex algorithm (dca). arXiv preprint arXiv:2109.13566, 2021.
  • Alexandroff (1950) AD Alexandroff. Surfaces represented by the difference of convex functions. In Doklady Akademii Nauk SSSR (NS), volume 72, pages 613–616, 1950.
  • An and Tao (2005) Le Thi Hoai An and Pham Dinh Tao. The dc (difference of convex functions) programming and dca revisited with dc models of real world nonconvex optimization problems. Annals of operations research, 133(1):23–46, 2005.
  • An et al. (2019) Le Thi Hoai An, Huynh Van Ngai, Pham Dinh Tao, and Luu Hoang Phuc Hau. Stochastic difference-of-convex algorithms for solving nonconvex optimization problems. arXiv preprint arXiv:1911.04334, 2019.
  • An and Nam (2017) Nguyen Thai An and Nguyen Mau Nam. Convergence analysis of a proximal point algorithm for minimizing differences of functions. Optimization, 66(1):129–147, 2017.
  • Artacho et al. (2018) Francisco J Aragón Artacho, Ronan MT Fleming, and Phan T Vuong. Accelerating the dc algorithm for smooth functions. Mathematical Programming, 169(1):95–118, 2018.
  • Bradley (1997) Andrew P Bradley. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern recognition, 30(7):1145–1159, 1997.
  • Bradley (2014) Andrew P Bradley. Half-auc for the evaluation of sensitive or specific classifiers. Pattern Recognition Letters, 38:93–98, 2014.
  • Chang and Lin (2011) Chih-Chung Chang and Chih-Jen Lin. Libsvm: A library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):1–27, 2011.
  • Davis and Drusvyatskiy (2018) Damek Davis and Dmitriy Drusvyatskiy. Stochastic subgradient method converges at the rate o(k1/4)o(k^{-1/4}) on weakly convex functions. arXiv preprint arXiv:1802.02988, 2018.
  • Davis and Grimmer (2019) Damek Davis and Benjamin Grimmer. Proximally guided stochastic subgradient method for nonsmooth, nonconvex problems. SIAM Journal on Optimization, 29(3):1908–1930, 2019.
  • Deng and Lan (2020) Qi Deng and Chenghao Lan. Efficiency of coordinate descent methods for structured nonconvex optimization. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 74–89. Springer, 2020.
  • Dodd and Pepe (2003) Lori E Dodd and Margaret S Pepe. Partial auc estimation and regression. Biometrics, 59(3):614–623, 2003.
  • Drusvyatskiy and Paquette (2019) Dmitriy Drusvyatskiy and Courtney Paquette. Efficiency of minimizing compositions of convex functions and smooth maps. Mathematical Programming, 178(1):503–558, 2019.
  • Dua and Graff (2017) Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
  • Ellaia (1984) Rachid Ellaia. Contribution à l’analyse et l’optimisation de différence de fonctions convexes. PhD thesis, Université Paul Sabatier, 1984.
  • Fan et al. (2017) Yanbo Fan, Siwei Lyu, Yiming Ying, and Bao-Gang Hu. Learning with average top-k loss. arXiv preprint arXiv:1705.08826, 2017.
  • Gabay (1982) D. Gabay. Minimizing the difference of two convex functions. I. Algorithms based on exact regularization. 1982.
  • Hanley and McNeil (1982) James A Hanley and Barbara J McNeil. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology, 143(1):29–36, 1982.
  • Hartman (1959) Philip Hartman. On functions representable as a difference of convex functions. Pacific Journal of Mathematics, 9(3):707–713, 1959.
  • He et al. (2021) Lulu He, Jimin Ye, et al. Accelerated proximal stochastic variance reduction for dc optimization. Neural Computing and Applications, 33(20):13163–13181, 2021.
  • Hiriart-Urruty (1985) J-B Hiriart-Urruty. Generalized differentiability/duality and optimization for problems dealing with differences of convex functions. In Convexity and duality in optimization, pages 37–70. Springer, 1985.
  • Hiriart-Urruty (1991) J-B Hiriart-Urruty. How to regularize a difference of convex functions. Journal of mathematical analysis and applications, 162(1):196–209, 1991.
  • Hu et al. (2020) Shu Hu, Yiming Ying, Siwei Lyu, et al. Learning by minimizing the sum of ranked range. Advances in Neural Information Processing Systems, 33, 2020.
  • Irvin et al. (2019) Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, Jayne Seekins, David A. Mong, Safwan S. Halabi, Jesse K. Sandberg, Ricky Jones, David B. Larson, Curtis P. Langlotz, Bhavik N. Patel, Matthew P. Lungren, and Andrew Y. Ng. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison, 2019.
  • Jiang et al. (1996) Yulei Jiang, Charles E Metz, and Robert M Nishikawa. A receiver operating characteristic partial area index for highly sensitive diagnostic tests. Radiology, 201(3):745–750, 1996.
  • Kar et al. (2014) Purushottam Kar, Harikrishna Narasimhan, and Prateek Jain. Online and stochastic gradient methods for non-decomposable loss functions. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceedings.neurips.cc/paper/2014/file/7d04bbbe5494ae9d2f5a76aa1c00fa2f-Paper.pdf.
  • Khamaru and Wainwright (2018) Koulik Khamaru and Martin Wainwright. Convergence guarantees for a class of non-convex and non-smooth optimization problems. In International Conference on Machine Learning, pages 2601–2610. PMLR, 2018.
  • Komori and Eguchi (2010) Osamu Komori and Shinto Eguchi. A boosting method for maximizing the partial area under the roc curve. BMC bioinformatics, 11(1):1–17, 2010.
  • Le Thi and Dinh (2018) Hoai An Le Thi and Tao Pham Dinh. Dc programming and dca: thirty years of developments. Mathematical Programming, 169(1):5–68, 2018.
  • Le Thi et al. (2017) Hoai An Le Thi, Hoai Minh Le, Duy Nhat Phan, and Bach Tran. Stochastic dca for the large-sum of non-convex functions problem and its application to group variable selection in classification. In International Conference on Machine Learning, pages 3394–3403. PMLR, 2017.
  • Lipp and Boyd (2016) Thomas Lipp and Stephen Boyd. Variations and extension of the convex–concave procedure. Optimization and Engineering, 17(2):263–287, 2016.
  • Liu et al. (2021) Mingrui Liu, Hassan Rafique, Qihang Lin, and Tianbao Yang. First-order convergence theory for weakly-convex-weakly-concave min-max problems. Journal of Machine Learning Research, 22(169):1–34, 2021.
  • Ma et al. (2013) Hua Ma, Andriy I Bandos, Howard E Rockette, and David Gur. On use of partial area under the roc curve for evaluation of diagnostic performance. Statistics in medicine, 32(20):3449–3458, 2013.
  • Ma et al. (2020) Runchao Ma, Qihang Lin, and Tianbao Yang. Quadratically regularized subgradient methods for weakly convex optimization with weakly convex constraints. In International Conference on Machine Learning, pages 6554–6564. PMLR, 2020.
  • Mairal (2013) Julien Mairal. Stochastic majorization-minimization algorithms for large-scale optimization. arXiv preprint arXiv:1306.4650, 2013.
  • McClish (1989) Donna Katzman McClish. Analyzing a portion of the roc curve. Medical decision making, 9(3):190–195, 1989.
  • Moudafi (2008) Abdellatif Moudafi. On the difference of two maximal monotone operators: Regularization and algorithmic approaches. Applied mathematics and computation, 202(2):446–452, 2008.
  • Moudafi (2021) Abdellatif Moudafi. A complete smooth regularization of dc optimization problems. 2021.
  • Moudafi and Maingé (2006) Abdellatif Moudafi and Paul-Emile Maingé. On the convergence of an approximate proximal method for dc functions. Journal of computational Mathematics, pages 475–480, 2006.
  • Narasimhan and Agarwal (2013a) Harikrishna Narasimhan and Shivani Agarwal. A structural svm based approach for optimizing partial auc. In International Conference on Machine Learning, pages 516–524. PMLR, 2013a.
  • Narasimhan and Agarwal (2013b) Harikrishna Narasimhan and Shivani Agarwal. SVMpAUCtight\text{SVM}_{\text{pAUC}}^{\text{tight}}: a new support vector method for optimizing partial auc based on a tight convex upper bound. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 167–175, 2013b.
  • Narasimhan and Agarwal (2017) Harikrishna Narasimhan and Shivani Agarwal. Support vector algorithms for optimizing the partial area under the roc curve. Neural computation, 29(7):1919–1963, 2017.
  • Nitanda and Suzuki (2017) Atsushi Nitanda and Taiji Suzuki. Stochastic difference of convex algorithm and its application to training deep boltzmann machines. In Artificial intelligence and statistics, pages 470–478. PMLR, 2017.
  • Pang et al. (2017) Jong-Shi Pang, Meisam Razaviyayn, and Alberth Alvarado. Computing b-stationary points of nonsmooth dc programs. Mathematics of Operations Research, 42(1):95–118, 2017.
  • Rafique et al. (2021) Hassan Rafique, Mingrui Liu, Qihang Lin, and Tianbao Yang. Weakly-convex–concave min–max optimization: provable algorithms and applications in machine learning. Optimization Methods and Software, pages 1–35, 2021.
  • Ricamato and Tortorella (2011) Maria Teresa Ricamato and Francesco Tortorella. Partial auc maximization in a linear combination of dichotomizers. Pattern Recognition, 44(10-11):2669–2677, 2011.
  • Rockafellar and Wets (2009) R Tyrrell Rockafellar and Roger J-B Wets. Variational analysis, volume 317. Springer Science & Business Media, 2009.
  • Shalev-Shwartz and Wexler (2016) Shai Shalev-Shwartz and Yonatan Wexler. Minimizing the maximal loss: How and why. In International Conference on Machine Learning, pages 793–801. PMLR, 2016.
  • Souza et al. (2016) João Carlos O Souza, Paulo Roberto Oliveira, and Antoine Soubeyran. Global convergence of a proximal linearized algorithm for difference of convex functions. Optimization Letters, 10(7):1529–1539, 2016.
  • Sriperumbudur and Lanckriet (2009) Bharath K Sriperumbudur and Gert RG Lanckriet. On the convergence of the concave-convex procedure. In Nips, volume 9, pages 1759–1767. Citeseer, 2009.
  • Sun and Sun (2021) Kaizhao Sun and Xu Andy Sun. Algorithms for difference-of-convex (dc) programs based on difference-of-moreau-envelopes smoothing. arXiv preprint arXiv:2104.01470, 2021.
  • Sun et al. (2003) Wen-yu Sun, Raimundo JB Sampaio, and MAB Candido. Proximal point algorithm for minimization of dc function. Journal of computational Mathematics, pages 451–462, 2003.
  • Tao and An (1997) Pham Dinh Tao and Le Thi Hoai An. Convex analysis approach to dc programming: theory, algorithms and applications. Acta mathematica vietnamica, 22(1):289–355, 1997.
  • Tao and An (1998) Pham Dinh Tao and Le Thi Hoai An. A dc optimization algorithm for solving the trust-region subproblem. SIAM Journal on Optimization, 8(2):476–505, 1998.
  • Thi et al. (2021) Hoai An Le Thi, Hoang Phuc Hau Luu, and Tao Pham Dinh. Online stochastic dca with applications to principal component analysis. arXiv preprint arXiv:2108.02300, 2021.
  • Thompson and Zucchini (1989) Mary Lou Thompson and Walter Zucchini. On the statistical analysis of roc curves. Statistics in medicine, 8(10):1277–1290, 1989.
  • Tuy (1995) Hoang Tuy. Dc optimization: theory, methods and algorithms. In Handbook of global optimization, pages 149–216. Springer, 1995.
  • Ueda and Fujino (2018) Naonori Ueda and Akinori Fujino. Partial auc maximization via nonlinear scoring functions. arXiv preprint arXiv:1806.04838, 2018.
  • Vapnik (1992) Vladimir Vapnik. Principles of risk minimization for learning theory. In Advances in neural information processing systems, pages 831–838, 1992.
  • Wang and Chang (2011) Zhanfeng Wang and Yuan-Chin Ivan Chang. Marker selection via maximizing the partial area under the roc curve of linear risk scores. Biostatistics, 12(2):369–385, 2011.
  • Wen et al. (2018) Bo Wen, Xiaojun Chen, and Ting Kei Pong. A proximal difference-of-convex algorithm with extrapolation. Computational optimization and applications, 69(2):297–324, 2018.
  • Xu et al. (2019) Yi Xu, Qi Qi, Qihang Lin, Rong Jin, and Tianbao Yang. Stochastic optimization for dc functions and non-smooth non-convex regularizers with non-asymptotic convergence. In International Conference on Machine Learning, pages 6942–6951. PMLR, 2019.
  • Yang et al. (2019) Hanfang Yang, Kun Lu, Xiang Lyu, and Feifang Hu. Two-way partial auc and its properties. Statistical methods in medical research, 28(1):184–195, 2019.
  • Yang and Ying (2022) Tianbao Yang and Yiming Ying. Auc maximization in the era of big data and ai: A survey. ACM Comput. Surv., (August 2022), 37 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn, 2022.
  • Yang et al. (2021a) Zhiyong Yang, Qianqian Xu, Shilong Bao, Xiaochun Cao, and Qingming Huang. Learning with multiclass auc: Theory and algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021a.
  • Yang et al. (2021b) Zhiyong Yang, Qianqian Xu, Shilong Bao, Yuan He, Xiaochun Cao, and Qingming Huang. When all we need is a piece of the pie: A generic framework for optimizing two-way partial auc. In International Conference on Machine Learning, pages 11820–11829. PMLR, 2021b.
  • Yang et al. (2022) Zhiyong Yang, Qianqian Xu, Shilong Bao, Yuan He, Xiaochun Cao, and Qingming Huang. Optimizing two-way partial auc with an end-to-end framework. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  • Ying et al. (2016) Yiming Ying, Longyin Wen, and Siwei Lyu. Stochastic online auc maximization. Advances in neural information processing systems, 29, 2016.
  • Yuan et al. (2020) Zhuoning Yuan, Yan Yan, Milan Sonka, and Tianbao Yang. Robust deep auc maximization: A new surrogate loss and empirical studies on medical image classification. arXiv preprint arXiv:2012.03173, 2020.
  • Yuan et al. (2021) Zhuoning Yuan, Yan Yan, Milan Sonka, and Tianbao Yang. Large-scale robust deep AUC maximization: A new surrogate loss and empirical studies on medical image classification. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 3020–3029. IEEE, 2021. doi: 10.1109/ICCV48922.2021.00303. URL https://doi.org/10.1109/ICCV48922.2021.00303.
  • Yuille and Rangarajan (2003) Alan L Yuille and Anand Rangarajan. The concave-convex procedure. Neural computation, 15(4):915–936, 2003.
  • Zhu et al. (2022) Dixian Zhu, Gang Li, Bokun Wang, Xiaodong Wu, and Tianbao Yang. When auc meets dro: Optimizing partial auc for deep learning with non-convex convergence guarantee. arXiv preprint arXiv:2203.00176, 2022.

Appendix

Appendix A Algorithm for Sum of Range Optimization

The technique in the previous sections can be directly applied to minimize the SoRR loss, which is formulated as (5) but with flf^{l} defined in (7). Since (7) is a special case of (6) with N+=1N_{+}=1 and N=NN_{-}=N, we can again formulate subproblem (9) with f=flf=f^{l} as (13) with 𝝀=λ{\boldsymbol{\lambda}}=\lambda being a scalar. Since λ\lambda is a scalar, when solving (13), we no longer use block coordinate update but only need to sample over indexes j=1,,Nj=1,\dots,N to construct stochastic subgradients. We present the stochastic subgradient (SGD) method for (13) in Algorithm 3. Next, we apply Algorithm 2 with SBCD in lines 3 and 4 replaced by SGD. The convergence result in this case is directly from Theorem 5.

Algorithm 3 Stochastic Subgradient Descent for SoRR: (𝐯¯,λ¯)=(\bar{\mathbf{v}},\bar{\lambda})=SGD(𝐰,λ,T,μ,l)({\mathbf{w}},\lambda,T,\mu,l)
1:Input: Initial solution (𝐰,λ)({\mathbf{w}},\lambda), the number of iterations TT, μ>ρ1\mu>\rho^{-1}, an integer l>0l>0 and sample size JJ.
2:Set (𝐯(0),λ(0))=(𝐰,λ)({\mathbf{v}}^{(0)},\lambda^{(0)})=({\mathbf{w}},\lambda) and choose (ηt,θt)t=0T1(\eta_{t},\theta_{t})_{t=0}^{T-1}.
3:for t=0t=0 to T1T-1 do
4:     Sample 𝒥t{1,,N}\mathcal{J}_{t}\subset\{1,\dots,N\} with |𝒥t|=J|\mathcal{J}_{t}|=J.
5:     Compute stochastic subgradient w.r.t. 𝐯{\mathbf{v}}:
G𝐯(t)=NJj𝒥tsj(𝐯(t))𝟏(sj(𝐯(t))>λi(t))G_{\mathbf{v}}^{(t)}=\frac{N}{J}\sum\limits_{j\in\mathcal{J}_{t}}\nabla s_{j}({\mathbf{v}}^{(t)})\mathbf{1}\left(s_{j}({\mathbf{v}}^{(t)})>\lambda_{i}^{(t)}\right)
6:     Proximal stochastic subgradient update on 𝐯{\mathbf{v}}:
𝐯(t+1)=argmin𝐯(G𝐯(t))𝐯+𝐯𝐰22μ+𝐯𝐯(t)22ηt{\mathbf{v}}^{(t+1)}=\operatorname*{arg\,min}_{{\mathbf{v}}}(G_{\mathbf{v}}^{(t)})^{\top}{\mathbf{v}}+\frac{\|{\mathbf{v}}-{\mathbf{w}}\|^{2}}{2\mu}+\frac{\|{\mathbf{v}}-{\mathbf{v}}^{(t)}\|^{2}}{2\eta_{t}}
7:     Compute stochastic subgradient w.r.t. λ\lambda:
Gλ(t)=lNJj𝒥t𝟏(sj(𝐯(t))>λ(t))G_{\lambda}^{(t)}=l-\frac{N}{J}\sum\limits_{j\in\mathcal{J}_{t}}\mathbf{1}\left(s_{j}({\mathbf{v}}^{(t)})>\lambda^{(t)}\right)
8:     Stochastic subgradient update on λ\lambda:
λ(t+1)=λ(t)ηtGλ(t)\lambda^{(t+1)}=\lambda^{(t)}-\eta_{t}G_{\lambda}^{(t)}
9:end for
10:Output: (𝐯¯,λ¯)=1Tt=0T1(𝐯(t),λ(t))(\bar{\mathbf{v}},\bar{\lambda})=\frac{1}{T}\sum_{t=0}^{T-1}({\mathbf{v}}^{(t)},\lambda^{(t)}).
Corollary 8

Suppose Assumption 1 holds with N+=1N_{+}=1, N=NN_{-}=N and sij(𝐰)=sj(𝐰)s_{ij}({\mathbf{w}})=s_{j}({\mathbf{w}}) and SBCD in Algorithm 2 are replaced by SGD (Algorithm 3). Suppose θt\theta_{t}, ηt\eta_{t}, and TkT_{k} are set the same as in Theorem 5 when Algorithm 3 is called in iteration kk of Algorithm 2. Then 𝐯¯n(k¯)\bar{\mathbf{v}}_{n}^{(\bar{k})} is an nearly ϵ\epsilon-critical point of (5) with flf^{l} defined in (7) with KK no more than (17).

Appendix B Proofs of Lemmas

Proof.[of Lemma 3] We will only prove the second conclusion in Lemma 3 since the first conclusion has been shown in Proposition 1 in Sun and Sun (2021).

Suppose 𝔼Fμ(𝐰)2min{1,μ2}ϵ2/4\mathbb{E}\|\nabla F^{\mu}({\mathbf{w}})\|^{2}\leq\min\{1,\mu^{-2}\}\epsilon^{2}/4. By the first conclusion in Lemma 3, we must have μ2𝔼𝐯μfm(𝐰)𝐯μfn(𝐰)2min{1,μ2}ϵ2/4\mu^{-2}\mathbb{E}\|{\mathbf{v}}_{\mu f^{m}}({\mathbf{w}})-{\mathbf{v}}_{\mu f^{n}}({\mathbf{w}})\|^{2}\leq\min\{1,\mu^{-2}\}\epsilon^{2}/4. By the optimality conditions satisfied by 𝐯μfm(𝐰){\mathbf{v}}_{\mu f^{m}}({\mathbf{w}}) and 𝐯μfn(𝐰){\mathbf{v}}_{\mu f^{n}}({\mathbf{w}}), there exist 𝝃mfm(𝐯μfm(𝐰)){\boldsymbol{\xi}}_{m}\in\partial f^{m}({\mathbf{v}}_{\mu f^{m}}({\mathbf{w}})) and 𝝃nfn(𝐯μfn(𝐰)){\boldsymbol{\xi}}_{n}\in\partial f^{n}({\mathbf{v}}_{\mu f^{n}}({\mathbf{w}})) such that

𝝃m+μ1(𝐯μfm(𝐰)𝐰)=𝟎=𝝃n+μ1(𝐯μfn(𝐰)𝐰),{\boldsymbol{\xi}}_{m}+\mu^{-1}({\mathbf{v}}_{\mu f^{m}}({\mathbf{w}})-{\mathbf{w}})=\mathbf{0}={\boldsymbol{\xi}}_{n}+\mu^{-1}({\mathbf{v}}_{\mu f^{n}}({\mathbf{w}})-{\mathbf{w}}),

which implies 𝝃=𝝃n𝝃m=μ1(𝐯μfm(𝐰)𝐯μfn(𝐰))fn(𝐯μfn(𝐰))fm(𝐯μfm(𝐰)){\boldsymbol{\xi}}={\boldsymbol{\xi}}_{n}-{\boldsymbol{\xi}}_{m}=\mu^{-1}({\mathbf{v}}_{\mu f^{m}}({\mathbf{w}})-{\mathbf{v}}_{\mu f^{n}}({\mathbf{w}}))\in\partial f^{n}({\mathbf{v}}_{\mu f^{n}}({\mathbf{w}}))-\partial f^{m}({\mathbf{v}}_{\mu f^{m}}({\mathbf{w}})) and 𝔼𝝃𝔼𝝃2ϵ/2\mathbb{E}\|{\boldsymbol{\xi}}\|\leq\sqrt{\mathbb{E}\|{\boldsymbol{\xi}}\|^{2}}\leq\epsilon/2. Suppose 𝔼𝐯¯𝐯μfn(𝐰)2ϵ2/4\mathbb{E}\|\bar{\mathbf{v}}-{\mathbf{v}}_{\mu f^{n}}({\mathbf{w}})\|^{2}\leq\epsilon^{2}/4. We have 𝔼𝐯¯𝐯μfn(𝐰)ϵ/2\mathbb{E}\|\bar{\mathbf{v}}-{\mathbf{v}}_{\mu f^{n}}({\mathbf{w}})\|\leq\epsilon/2 and 𝔼𝐯¯𝐯μfm(𝐰)𝔼𝐯¯𝐯μfn(𝐰)+𝔼𝐯μfn(𝐰)𝐯μfm(𝐰)ϵ\mathbb{E}\|\bar{\mathbf{v}}-{\mathbf{v}}_{\mu f^{m}}({\mathbf{w}})\|\leq\mathbb{E}\|\bar{\mathbf{v}}-{\mathbf{v}}_{\mu f^{n}}({\mathbf{w}})\|+\mathbb{E}\|{\mathbf{v}}_{\mu f^{n}}({\mathbf{w}})-{\mathbf{v}}_{\mu f^{m}}({\mathbf{w}})\|\leq\epsilon. Hence, 𝐯¯\bar{\mathbf{v}} satisfies Definition 2 with 𝐰=𝐯μfn(𝐰){\mathbf{w}}^{\prime}={\mathbf{v}}_{\mu f^{n}}({\mathbf{w}}) and 𝐰′′=𝐯μfm(𝐰){\mathbf{w}}^{\prime\prime}={\mathbf{v}}_{\mu f^{m}}({\mathbf{w}}). Suppose 𝔼𝐯¯𝐯μfm(𝐰)2ϵ2/4\mathbb{E}\|\bar{\mathbf{v}}-{\mathbf{v}}_{\mu f^{m}}({\mathbf{w}})\|^{2}\leq\epsilon^{2}/4. The conclusion can be also proved similarly. \Box

We first present the following lemma which is similar to Lemma 4.2 in Drusvyatskiy and Paquette (2019).

Lemma 9

Suppose Assumption 1 holds. For any 𝐯{\mathbf{v}}, 𝛌{\boldsymbol{\lambda}}, 𝐯{\mathbf{v}}^{\prime}, 𝛌{\boldsymbol{\lambda}}^{\prime}, and (𝛏𝐯,𝛏𝛌)gl(𝐯,𝛌)({\boldsymbol{\xi}}_{{\mathbf{v}}},{\boldsymbol{\xi}}_{{\boldsymbol{\lambda}}})\in\partial g^{l}({\mathbf{v}}^{\prime},{\boldsymbol{\lambda}}^{\prime}), we have

gl(𝐯,𝝀)gl(𝐯,𝝀)+𝝃𝐯(𝐯𝐯)+𝝃𝝀(𝝀𝝀)ρ2𝐯𝐯2,g^{l}({\mathbf{v}},{\boldsymbol{\lambda}})\geq g^{l}({\mathbf{v}}^{\prime},{\boldsymbol{\lambda}}^{\prime})+{\boldsymbol{\xi}}_{{\mathbf{v}}}^{\top}({\mathbf{v}}-{\mathbf{v}}^{\prime})+{\boldsymbol{\xi}}_{{\boldsymbol{\lambda}}}^{\top}({\boldsymbol{\lambda}}-{\boldsymbol{\lambda}}^{\prime})-\frac{\rho}{2}\|{\mathbf{v}}-{\mathbf{v}}^{\prime}\|^{2},

where ρ=N+NL\rho=N_{+}N_{-}L. Moreover, gl(𝐯,𝛌)+12μ𝐯𝐰2g^{l}({\mathbf{v}},{\boldsymbol{\lambda}})+\frac{1}{2\mu}\|{\mathbf{v}}-{\mathbf{w}}\|^{2} is jointly convex in 𝛌{\boldsymbol{\lambda}} and 𝐯{\mathbf{v}} for any 𝐰{\mathbf{w}} and any μ1>ρ\mu^{-1}>\rho.

Proof. By Assumption 1, we have

sij(𝐯)sij(𝐯)sij(𝐯)(𝐯𝐯)L2𝐯𝐯2.\displaystyle s_{ij}({\mathbf{v}})-s_{ij}({\mathbf{v}}^{\prime})\geq\nabla s_{ij}({\mathbf{v}}^{\prime})^{\top}({\mathbf{v}}-{\mathbf{v}}^{\prime})-\frac{L}{2}\|{\mathbf{v}}-{\mathbf{v}}^{\prime}\|^{2}. (18)

Let ξij[sij(𝐯)λi]+\xi_{ij}\in\partial[s_{ij}({\mathbf{v}}^{\prime})-\lambda_{i}^{\prime}]_{+}. We have

gl(𝐯,𝝀)\displaystyle g^{l}({\mathbf{v}},{\boldsymbol{\lambda}})
=\displaystyle= l𝟏𝝀+i=1N+j=1N[sij(𝐯)λi]+\displaystyle l\mathbf{1}^{\top}{\boldsymbol{\lambda}}+\sum_{i=1}^{N_{+}}\sum^{N_{-}}_{j=1}[s_{ij}({\mathbf{v}})-\lambda_{i}]_{+}
\displaystyle\geq l𝟏𝝀+l𝟏(𝝀𝝀)+i=1N+j=1N[sij(𝐯)λi]++i=1N+j=1Nξij(sij(𝐯)sij(𝐯)λi+λi)\displaystyle l\mathbf{1}^{\top}{\boldsymbol{\lambda}}^{\prime}+l\mathbf{1}^{\top}({\boldsymbol{\lambda}}-{\boldsymbol{\lambda}}^{\prime})+\sum_{i=1}^{N_{+}}\sum^{N_{-}}_{j=1}[s_{ij}({\mathbf{v}}^{\prime})-\lambda_{i}^{\prime}]_{+}+\sum_{i=1}^{N_{+}}\sum^{N_{-}}_{j=1}\xi_{ij}(s_{ij}({\mathbf{v}})-s_{ij}({\mathbf{v}}^{\prime})-\lambda_{i}+\lambda_{i}^{\prime})
\displaystyle\geq gl(𝐯,𝝀)+𝝃𝝀(𝝀𝝀)+i=1N+j=1Nξijsij(𝐯)(𝐯𝐯)i=1N+j=1NξijL2𝐯𝐯2\displaystyle g^{l}({\mathbf{v}}^{\prime},{\boldsymbol{\lambda}}^{\prime})+{\boldsymbol{\xi}}_{{\boldsymbol{\lambda}}}^{\top}({\boldsymbol{\lambda}}-{\boldsymbol{\lambda}}^{\prime})+\sum_{i=1}^{N_{+}}\sum^{N_{-}}_{j=1}\xi_{ij}\nabla s_{ij}({\mathbf{v}}^{\prime})^{\top}({\mathbf{v}}-{\mathbf{v}}^{\prime})-\sum_{i=1}^{N_{+}}\sum^{N_{-}}_{j=1}\xi_{ij}\frac{L}{2}\|{\mathbf{v}}-{\mathbf{v}}^{\prime}\|^{2}
\displaystyle\geq gl(𝐯,𝝀)+𝝃𝐯(𝐯𝐯)+𝝃𝝀(𝝀𝝀)N+NL2𝐯𝐯2,\displaystyle g^{l}({\mathbf{v}}^{\prime},{\boldsymbol{\lambda}}^{\prime})+{\boldsymbol{\xi}}_{{\mathbf{v}}}^{\top}({\mathbf{v}}-{\mathbf{v}}^{\prime})+{\boldsymbol{\xi}}_{{\boldsymbol{\lambda}}}^{\top}({\boldsymbol{\lambda}}-{\boldsymbol{\lambda}}^{\prime})-\frac{N_{+}N_{-}L}{2}\|{\mathbf{v}}-{\mathbf{v}}^{\prime}\|^{2},

where the first inequality is by the convexity of []+[\cdot]_{+}, the second inequality is from (18) and the last inequality is by the definitions of (𝝃𝐯,𝝃𝝀)({\boldsymbol{\xi}}_{{\mathbf{v}}},{\boldsymbol{\xi}}_{{\boldsymbol{\lambda}}}) and the fact that ξij[0,1]\xi_{ij}\in[0,1].

Combining the inequality from the first conclusion with the equality 12μ𝐯𝐰2=12μ𝐯𝐯2+1μ(𝐯𝐯)(𝐯𝐰)+12μ𝐯𝐰2\frac{1}{2\mu}\|{\mathbf{v}}-{\mathbf{w}}\|^{2}=\frac{1}{2\mu}\|{\mathbf{v}}-{\mathbf{v}}^{\prime}\|^{2}+\frac{1}{\mu}({\mathbf{v}}-{\mathbf{v}}^{\prime})^{\top}({\mathbf{v}}^{\prime}-{\mathbf{w}})+\frac{1}{2\mu}\|{\mathbf{v}}^{\prime}-{\mathbf{w}}\|^{2} and using the fact that μ1>ρ\mu^{-1}>\rho, we can obtain

gl(𝐯,𝝀)+12μ𝐯𝐰2gl(𝐯,𝝀)+12μ𝐯𝐰2+(μ1(𝐯𝐰)+𝝃𝐯)(𝐯𝐯)+𝝃𝝀(𝝀𝝀),g^{l}({\mathbf{v}},{\boldsymbol{\lambda}})+\frac{1}{2\mu}\|{\mathbf{v}}-{\mathbf{w}}\|^{2}\geq g^{l}({\mathbf{v}}^{\prime},{\boldsymbol{\lambda}}^{\prime})+\frac{1}{2\mu}\|{\mathbf{v}}^{\prime}-{\mathbf{w}}\|^{2}+(\mu^{-1}({\mathbf{v}}^{\prime}-{\mathbf{w}})+{\boldsymbol{\xi}}_{{\mathbf{v}}})^{\top}({\mathbf{v}}-{\mathbf{v}}^{\prime})+{\boldsymbol{\xi}}_{{\boldsymbol{\lambda}}}^{\top}({\boldsymbol{\lambda}}-{\boldsymbol{\lambda}}^{\prime}),

which proves the second conclusion. \Box

Lemma 10

The dual problem of the maximization problem in (11) is the minimization problem in (12).

Proof. For i=1,,N+i=1,\dots,N_{+}, we introduce a Lagrangian multiplier λi\lambda_{i} for the constraint j=1Npj=l\sum_{j=1}^{N_{-}}p_{j}=l in (11). Let [z]=min{z,0}[z]_{-}=\min\{z,0\} which equals [z]+-[-z]_{+}. Then, for each ii, we have

max𝐩i𝒫l{j=1Npijsij(𝐰)}\displaystyle\max\limits_{{\mathbf{p}}_{i}\in\mathcal{P}^{l}}\left\{\sum_{j=1}^{N_{-}}p_{ij}s_{ij}({\mathbf{w}})\right\} =\displaystyle= minpij[0,1]jmaxλi{j=1Npijsij(𝐰)+λi(j=1Npjl)}\displaystyle-\min\limits_{p_{ij}\in[0,1]~{}\forall j}\max_{\lambda_{i}}\left\{-\sum_{j=1}^{N_{-}}p_{ij}s_{ij}({\mathbf{w}})+\lambda_{i}(\sum_{j=1}^{N_{-}}p_{j}-l)\right\}
=\displaystyle= maxλiminpij[0,1]j{lλi+j=1Npij[λisij(𝐰)]}\displaystyle-\max_{\lambda_{i}}\min\limits_{p_{ij}\in[0,1]~{}\forall j}\left\{-l\lambda_{i}+\sum_{j=1}^{N_{-}}p_{ij}[\lambda_{i}-s_{ij}({\mathbf{w}})]\right\}
=\displaystyle= maxλi{lλi+j=1N[λisij(𝐰)]}\displaystyle-\max_{\lambda_{i}}\left\{-l\lambda_{i}+\sum_{j=1}^{N_{-}}[\lambda_{i}-s_{ij}({\mathbf{w}})]_{-}\right\}
=\displaystyle= minλi{lλi+j=1N[sij(𝐰)λi]+}.\displaystyle\min_{\lambda_{i}}\left\{l\lambda_{i}+\sum_{j=1}^{N_{-}}[s_{ij}({\mathbf{w}})-\lambda_{i}]_{+}\right\}.

The conclusion is thus proved by summing up the equality above for i=1,,N+i=1,\dots,N_{+}. \Box

Appendix C Proof of Proposition 4

Let 𝐯=𝐯μfl(𝐰){\mathbf{v}}^{*}={\mathbf{v}}_{\mu f^{l}}({\mathbf{w}}) be the unique optimal solution of (8) and let si[j](𝐯)s_{i[j]}({\mathbf{v}}^{*}) be the jjth largest coordinate of Si(𝐯)S_{i}({\mathbf{v}}^{*}). It is easy to show that the set of optimal solutions of (13) is {𝐯}×Λ\{{\mathbf{v}}^{*}\}\times\Lambda^{*} where

Λ=argmin𝝀gl(𝐯,𝝀)+12μ𝐯𝐰2=i=1N+(Λi:=[si[l](𝐯),si[l+1](𝐯)]).\Lambda^{*}=\operatorname*{arg\,min}\limits_{{\boldsymbol{\lambda}}}g^{l}({\mathbf{v}}^{*},{\boldsymbol{\lambda}})+\frac{1}{2\mu}\|{\mathbf{v}}^{*}-{\mathbf{w}}\|^{2}=\prod_{i=1}^{N_{+}}\left(\Lambda^{*}_{i}:=\left[s_{i[l]}({\mathbf{v}}^{*}),s_{i[l+1]}({\mathbf{v}}^{*})\right]\right). (19)

Given any 𝝀N+{\boldsymbol{\lambda}}\in\mathbb{R}^{N_{+}}, we denote its projection onto Λ\Lambda^{*} as ProjΛ(𝝀)\text{Proj}_{\Lambda^{*}}({\boldsymbol{\lambda}}). By the structure of Λ\Lambda^{*}, the iith coordinate of ProjΛ(𝝀)\text{Proj}_{\Lambda^{*}}({\boldsymbol{\lambda}}) is just the projection of λi\lambda_{i} onto Λi\Lambda^{*}_{i}, which we denote by ProjΛi(λi)\text{Proj}_{\Lambda_{i}^{*}}(\lambda_{i}). Moreover, we denote the distance from 𝝀{\boldsymbol{\lambda}} to Λ\Lambda^{*} as dist(𝝀,Λ)\text{dist}({\boldsymbol{\lambda}},\Lambda^{*}) and it satisfies

dist2(𝝀,Λ)=i=1N+dist2(λi,Λi)=i=1N+(λiProjΛi(λi))2\text{dist}^{2}({\boldsymbol{\lambda}},\Lambda^{*})=\sum_{i=1}^{N_{+}}\text{dist}^{2}(\lambda_{i},\Lambda_{i}^{*})=\sum_{i=1}^{N_{+}}(\lambda_{i}-\text{Proj}_{\Lambda_{i}^{*}}(\lambda_{i}))^{2} (20)

With these definitions, we can present the following lemma.

Lemma 11

Suppose Assumption 1 holds and μ1>ρ\mu^{-1}>\rho. For any 𝐰{\mathbf{w}}, 𝐯{\mathbf{v}}, 𝛌{\boldsymbol{\lambda}} and 𝐯=𝐯μfl(𝐰){\mathbf{v}}^{*}={\mathbf{v}}_{\mu f^{l}}({\mathbf{w}}), we have

N+B𝐯𝐯+gl(𝐯,𝝀)gl(𝐯,ProjΛ(𝝀))i=1N+dist(λi,Λi)dist(𝝀,Λ).N_{+}B\|{\mathbf{v}}-{\mathbf{v}}^{*}\|+g^{l}({\mathbf{v}},{\boldsymbol{\lambda}})-g^{l}({\mathbf{v}}^{*},{\mathrm{Proj}}_{\Lambda^{*}}({\boldsymbol{\lambda}}))\geq\sum_{i=1}^{N_{+}}\text{dist}(\lambda_{i},\Lambda_{i}^{*})\geq\text{dist}({\boldsymbol{\lambda}},\Lambda^{*}).

Proof. It is easy to observe that gl(𝐯,𝝀):=i=1N+gil(𝐯,λi)g^{l}({\mathbf{v}},{\boldsymbol{\lambda}}):=\sum_{i=1}^{N_{+}}g_{i}^{l}({\mathbf{v}},\lambda_{i}), where

gil(𝐯,λi):=lλi+j=1N[sij(𝐯)λi]+,g_{i}^{l}({\mathbf{v}},\lambda_{i}):=l\lambda_{i}+\sum^{N_{-}}_{j=1}[s_{ij}({\mathbf{v}})-\lambda_{i}]_{+},

and Λi=argminλigil(𝐯,λi)\Lambda_{i}^{*}=\operatorname*{arg\,min}_{\lambda_{i}}g_{i}^{l}({\mathbf{v}}^{*},\lambda_{i}). Since gil(𝐯,λi)g_{i}^{l}({\mathbf{v}}^{*},\lambda_{i}) is a piecewise linear in λi\lambda_{i} with an outward slope of at least one at either end of the interval Λi\Lambda_{i}^{*}, we must have

gil(𝐯,λi)gil(𝐯,ProjΛi(λi))|λiProjΛi(λi)|,g_{i}^{l}({\mathbf{v}}^{*},\lambda_{i})-g_{i}^{l}({\mathbf{v}}^{*},{\mathrm{Proj}}_{\Lambda_{i}^{*}}(\lambda_{i}))\geq|\lambda_{i}-{\mathrm{Proj}}_{\Lambda_{i}^{*}}(\lambda_{i})|,

which implies

gl(𝐯,𝝀)gl(𝐯,ProjΛ(𝝀))i=1N+|λiProjΛi(λi)|=i=1N+dist(λi,Λi)dist(𝝀,Λ).g^{l}({\mathbf{v}}^{*},{\boldsymbol{\lambda}})-g^{l}({\mathbf{v}}^{*},{\mathrm{Proj}}_{\Lambda^{*}}({\boldsymbol{\lambda}}))\geq\sum_{i=1}^{N_{+}}|\lambda_{i}-{\mathrm{Proj}}_{\Lambda_{i}^{*}}(\lambda_{i})|=\sum_{i=1}^{N_{+}}\text{dist}(\lambda_{i},\Lambda_{i}^{*})\geq\text{dist}({\boldsymbol{\lambda}},\Lambda^{*}).

Moreover, by Assumption 1, gil(𝐯,λi)g_{i}^{l}({\mathbf{v}},\lambda_{i}) is BB-Lipschitz continuous in 𝐯{\mathbf{v}} so gl(𝐯,𝝀)g^{l}({\mathbf{v}},{\boldsymbol{\lambda}}) is N+BN_{+}B-Lipschitz continuous in 𝐯{\mathbf{v}}. We then have N+B𝐯𝐯+gl(𝐯,𝝀)gl(𝐯,𝝀)N_{+}B\|{\mathbf{v}}-{\mathbf{v}}^{*}\|+g^{l}({\mathbf{v}},{\boldsymbol{\lambda}})\geq g^{l}({\mathbf{v}}^{*},{\boldsymbol{\lambda}}), which implies the conclusion together with the previous inequality. \Box

We present the proof of Proposition 4 below.

Proof.[of Proposition 4] Let us denote [t]={0,,t}\mathcal{I}_{[t]}=\{\mathcal{I}_{0},\dots,\mathcal{I}_{t}\} and 𝒥[t]={𝒥0,,𝒥t}\mathcal{J}_{[t]}=\{\mathcal{J}_{0},\dots,\mathcal{J}_{t}\}. Let 𝔼t\mathbb{E}_{t} be the expectation conditioning on [t1]\mathcal{I}_{[t-1]} and 𝒥[t1]\mathcal{J}_{[t-1]}.

By Assumption 1 and the definitions of G𝐯(t)G_{\mathbf{v}}^{(t)} and Gλi(t)G_{\lambda_{i}}^{(t)} in Algorithm 1, we have

G𝐯(t)N+NB and |Gλi(t)|N for t=0,,T1 and i=1,,N+.\displaystyle\|G_{\mathbf{v}}^{(t)}\|\leq N_{+}N_{-}B~{}\text{ and }~{}|G_{\lambda_{i}}^{(t)}|\leq N_{-}\text{ for }t=0,\dots,T-1\text{ and }i=1,\dots,N_{+}. (21)

By the optimality condition satisfied by 𝐯(t+1){\mathbf{v}}^{(t+1)} and (1μ+1ηt)\left(\frac{1}{\mu}+\frac{1}{\eta_{t}}\right)-strong convexity of the objective function in (14), we have

12μ𝐯(t+1)𝐰2+12ηt𝐯(t+1)𝐯(t)2+12(1μ+1ηt)𝐯𝐯(t+1)2\displaystyle\frac{1}{2\mu}\|{\mathbf{v}}^{(t+1)}-{\mathbf{w}}\|^{2}+\frac{1}{2\eta_{t}}\|{\mathbf{v}}^{(t+1)}-{\mathbf{v}}^{(t)}\|^{2}+\frac{1}{2}\left(\frac{1}{\mu}+\frac{1}{\eta_{t}}\right)\|{\mathbf{v}}^{*}-{\mathbf{v}}^{(t+1)}\|^{2}
\displaystyle\leq (G𝐯(t))(𝐯𝐯(t+1))+12μ𝐯𝐰2+12ηt𝐯𝐯(t)2\displaystyle(G_{\mathbf{v}}^{(t)})^{\top}\left({\mathbf{v}}^{*}-{\mathbf{v}}^{(t+1)}\right)+\frac{1}{2\mu}\|{\mathbf{v}}^{*}-{\mathbf{w}}\|^{2}+\frac{1}{2\eta_{t}}\|{\mathbf{v}}^{*}-{\mathbf{v}}^{(t)}\|^{2}
=\displaystyle= (G𝐯(t))(𝐯𝐯(t))+(G𝐯(t))(𝐯(t)𝐯(t+1))+12μ𝐯𝐰2+12ηt𝐯𝐯(t)2\displaystyle(G_{\mathbf{v}}^{(t)})^{\top}\left({\mathbf{v}}^{*}-{\mathbf{v}}^{(t)}\right)+(G_{\mathbf{v}}^{(t)})^{\top}\left({\mathbf{v}}^{(t)}-{\mathbf{v}}^{(t+1)}\right)+\frac{1}{2\mu}\|{\mathbf{v}}^{*}-{\mathbf{w}}\|^{2}+\frac{1}{2\eta_{t}}\|{\mathbf{v}}^{*}-{\mathbf{v}}^{(t)}\|^{2}
\displaystyle\leq (G𝐯(t))(𝐯𝐯(t))+ηt(G𝐯(t))22+12ηt𝐯(t)𝐯(t+1)2+12μ𝐯𝐰2+12ηt𝐯𝐯(t)2,\displaystyle(G_{\mathbf{v}}^{(t)})^{\top}\left({\mathbf{v}}^{*}-{\mathbf{v}}^{(t)}\right)+\frac{\eta_{t}(G_{\mathbf{v}}^{(t)})^{2}}{2}+\frac{1}{2\eta_{t}}\|{\mathbf{v}}^{(t)}-{\mathbf{v}}^{(t+1)}\|^{2}+\frac{1}{2\mu}\|{\mathbf{v}}^{*}-{\mathbf{w}}\|^{2}+\frac{1}{2\eta_{t}}\|{\mathbf{v}}^{*}-{\mathbf{v}}^{(t)}\|^{2},

where the last inequality is by Young’s inequality. Since 𝐯𝐯(t){\mathbf{v}}^{*}-{\mathbf{v}}^{(t)} is deterministic conditioning on [t1]\mathcal{I}_{[t-1]} and 𝒥[t1]\mathcal{J}_{[t-1]}, applying (21) and taking expectation 𝔼t\mathbb{E}_{t} on the both sides of the inequality above yield

12μ𝔼t𝐯(t+1)𝐰2+12(1μ+1ηt)𝔼t𝐯𝐯(t+1)2\displaystyle\frac{1}{2\mu}\mathbb{E}_{t}\|{\mathbf{v}}^{(t+1)}-{\mathbf{w}}\|^{2}+\frac{1}{2}\left(\frac{1}{\mu}+\frac{1}{\eta_{t}}\right)\mathbb{E}_{t}\|{\mathbf{v}}^{*}-{\mathbf{v}}^{(t+1)}\|^{2} (22)
\displaystyle\leq (𝔼tG𝐯(t))(𝐯𝐯(t))+ηtN+2N2B22+12μ𝐯𝐰2+12ηt𝐯𝐯(t)2.\displaystyle\left(\mathbb{E}_{t}G_{\mathbf{v}}^{(t)}\right)^{\top}\left({\mathbf{v}}^{*}-{\mathbf{v}}^{(t)}\right)+\frac{\eta_{t}N_{+}^{2}N_{-}^{2}B^{2}}{2}+\frac{1}{2\mu}\|{\mathbf{v}}^{*}-{\mathbf{w}}\|^{2}+\frac{1}{2\eta_{t}}\|{\mathbf{v}}^{*}-{\mathbf{v}}^{(t)}\|^{2}.

By the updating equation (15) for 𝝀(t+1){\boldsymbol{\lambda}}^{(t+1)}, we have

dist2(𝝀(t+1),Λ)\displaystyle\text{dist}^{2}({\boldsymbol{\lambda}}^{(t+1)},\Lambda^{*}) =\displaystyle= itc(λi(t+1)ProjΛi(λi(t+1)))2+it(λi(t+1)ProjΛi(λi(t+1)))2\displaystyle\sum_{i\in\mathcal{I}_{t}^{c}}(\lambda_{i}^{(t+1)}-\text{Proj}_{\Lambda_{i}^{*}}(\lambda_{i}^{(t+1)}))^{2}+\sum_{i\in\mathcal{I}_{t}}(\lambda_{i}^{(t+1)}-\text{Proj}_{\Lambda_{i}^{*}}(\lambda_{i}^{(t+1)}))^{2}
\displaystyle\leq itc(λi(t)ProjΛi(λi(t)))2+it(λi(t+1)ProjΛi(λi(t)))2\displaystyle\sum_{i\in\mathcal{I}_{t}^{c}}(\lambda_{i}^{(t)}-\text{Proj}_{\Lambda_{i}^{*}}(\lambda_{i}^{(t)}))^{2}+\sum_{i\in\mathcal{I}_{t}}(\lambda_{i}^{(t+1)}-\text{Proj}_{\Lambda_{i}^{*}}(\lambda_{i}^{(t)}))^{2}
=\displaystyle= itc(λi(t)ProjΛi(λi(t)))2+it(λi(t)θtGλi(t)ProjΛi(λi(t)))2\displaystyle\sum_{i\in\mathcal{I}_{t}^{c}}(\lambda_{i}^{(t)}-\text{Proj}_{\Lambda_{i}^{*}}(\lambda_{i}^{(t)}))^{2}+\sum_{i\in\mathcal{I}_{t}}(\lambda_{i}^{(t)}-\theta_{t}G_{\lambda_{i}}^{(t)}-\text{Proj}_{\Lambda_{i}^{*}}(\lambda_{i}^{(t)}))^{2}
=\displaystyle= itc(λi(t)ProjΛi(λi(t)))2+it(λi(t)ProjΛi(λi(t)))2\displaystyle\sum_{i\in\mathcal{I}_{t}^{c}}(\lambda_{i}^{(t)}-\text{Proj}_{\Lambda_{i}^{*}}(\lambda_{i}^{(t)}))^{2}+\sum_{i\in\mathcal{I}_{t}}(\lambda_{i}^{(t)}-\text{Proj}_{\Lambda_{i}^{*}}(\lambda_{i}^{(t)}))^{2}
2θtit(Gλi(t))(λi(t)ProjΛi(λi(t)))+θt2it(Gλi(t))2.\displaystyle-2\theta_{t}\sum_{i\in\mathcal{I}_{t}}(G_{\lambda_{i}}^{(t)})^{\top}(\lambda_{i}^{(t)}-\text{Proj}_{\Lambda_{i}^{*}}(\lambda_{i}^{(t)}))+\theta_{t}^{2}\sum_{i\in\mathcal{I}_{t}}(G_{\lambda_{i}}^{(t)})^{2}.

Since λi(t)ProjΛi(λi(t))\lambda_{i}^{(t)}-\text{Proj}_{\Lambda_{i}^{*}}(\lambda_{i}^{(t)}) is deterministic conditioning on [t1]\mathcal{I}_{[t-1]} and 𝒥[t1]\mathcal{J}_{[t-1]}, applying (21) and taking expectation 𝔼t\mathbb{E}_{t} on the both sides of the inequality above yield

𝔼tdist2(𝝀(t+1),Λ)\displaystyle\mathbb{E}_{t}\text{dist}^{2}({\boldsymbol{\lambda}}^{(t+1)},\Lambda^{*}) \displaystyle\leq dist2(𝝀(t),Λ)2θtIN+i=1N+𝔼tGλi(t)(λi(t)ProjΛi(λi(t)))+θt2IN2\displaystyle\text{dist}^{2}({\boldsymbol{\lambda}}^{(t)},\Lambda^{*})-\frac{2\theta_{t}I}{N_{+}}\sum_{i=1}^{N_{+}}\mathbb{E}_{t}G_{\lambda_{i}}^{(t)}(\lambda_{i}^{(t)}-\text{Proj}_{\Lambda_{i}^{*}}(\lambda_{i}^{(t)}))+\theta_{t}^{2}IN_{-}^{2} (23)

Multiplying both sides of (23) by N+2Iθt\frac{N_{+}}{2I\theta_{t}} and adding it with (22), we have

N+2Iθt𝔼tdist2(𝝀(t+1),Λ)+12μ𝔼t𝐯(t+1)𝐰2+12(1μ+1ηt)𝔼t𝐯𝐯(t+1)2\displaystyle\frac{N_{+}}{2I\theta_{t}}\mathbb{E}_{t}\text{dist}^{2}({\boldsymbol{\lambda}}^{(t+1)},\Lambda^{*})+\frac{1}{2\mu}\mathbb{E}_{t}\|{\mathbf{v}}^{(t+1)}-{\mathbf{w}}\|^{2}+\frac{1}{2}\left(\frac{1}{\mu}+\frac{1}{\eta_{t}}\right)\mathbb{E}_{t}\|{\mathbf{v}}^{*}-{\mathbf{v}}^{(t+1)}\|^{2} (24)
\displaystyle\leq N+2Iθtdist2(𝝀(t),Λ)+12μ𝐯𝐰2+12ηt𝐯𝐯(t)2+ηtN+2N2B22+θtN+N22\displaystyle\frac{N_{+}}{2I\theta_{t}}\text{dist}^{2}({\boldsymbol{\lambda}}^{(t)},\Lambda^{*})+\frac{1}{2\mu}\|{\mathbf{v}}^{*}-{\mathbf{w}}\|^{2}+\frac{1}{2\eta_{t}}\|{\mathbf{v}}^{*}-{\mathbf{v}}^{(t)}\|^{2}+\frac{\eta_{t}N_{+}^{2}N_{-}^{2}B^{2}}{2}+\frac{\theta_{t}N_{+}N_{-}^{2}}{2}
+[𝔼tG𝐯(t)](𝐯𝐯(t))+i=1N+𝔼tGλi(t)(ProjΛi(λi(t))λi(t))\displaystyle+\left[\mathbb{E}_{t}G_{\mathbf{v}}^{(t)}\right]^{\top}\left({\mathbf{v}}^{*}-{\mathbf{v}}^{(t)}\right)+\sum_{i=1}^{N_{+}}\mathbb{E}_{t}G_{\lambda_{i}}^{(t)}(\text{Proj}_{\Lambda_{i}^{*}}(\lambda_{i}^{(t)})-\lambda_{i}^{(t)})
\displaystyle\leq N+2Iθtdist2(𝝀(t),Λ)+12μ𝐯𝐰2+12(ρ+1ηt)𝐯𝐯(t)2+ηtN+2N2B22+θtN+N22\displaystyle\frac{N_{+}}{2I\theta_{t}}\text{dist}^{2}({\boldsymbol{\lambda}}^{(t)},\Lambda^{*})+\frac{1}{2\mu}\|{\mathbf{v}}^{*}-{\mathbf{w}}\|^{2}+\frac{1}{2}\left(\rho+\frac{1}{\eta_{t}}\right)\|{\mathbf{v}}^{*}-{\mathbf{v}}^{(t)}\|^{2}+\frac{\eta_{t}N_{+}^{2}N_{-}^{2}B^{2}}{2}+\frac{\theta_{t}N_{+}N_{-}^{2}}{2}
+gl(𝐯,ProjΛ(𝝀(t)))gl(𝐯(t),𝝀(t)),\displaystyle+g^{l}({\mathbf{v}}^{*},\text{Proj}_{\Lambda^{*}}({\boldsymbol{\lambda}}^{(t)}))-g^{l}({\mathbf{v}}^{(t)},{\boldsymbol{\lambda}}^{(t)}),

where the second inequality is because (𝔼tG𝐯(t),𝔼tGλ1(t),,𝔼tGλN+(t))gl(𝐯(t),𝝀(t))(\mathbb{E}_{t}G_{\mathbf{v}}^{(t)},\mathbb{E}_{t}G_{\lambda_{1}}^{(t)},\dots,\mathbb{E}_{t}G_{\lambda_{N_{+}}}^{(t)})\in\partial g^{l}({\mathbf{v}}^{(t)},{\boldsymbol{\lambda}}^{(t)}), which allows us to apply Lemma 9 with (𝐯,𝝀)=(𝐯,ProjΛ(𝝀(t)))({\mathbf{v}},{\boldsymbol{\lambda}})=({\mathbf{v}}^{*},\text{Proj}_{\Lambda^{*}}({\boldsymbol{\lambda}}^{(t)})) and (𝐯,𝝀)=(𝐯(t),𝝀(t))({\mathbf{v}}^{\prime},{\boldsymbol{\lambda}}^{\prime})=({\mathbf{v}}^{(t)},{\boldsymbol{\lambda}}^{(t)}).

Notice that θt\theta_{t} and ηt\eta_{t} do not change with tt that ρ<μ1\rho<\mu^{-1}. Summing up (24) for t=0,,T1t=0,\dots,T-1, taking full expectation, and organizing terms give us

t=0T1𝔼(gl(𝐯(t),𝝀(t))+12μ𝐯(t)𝐰2gl(𝐯,ProjΛ(𝝀(t)))12μ𝐯𝐰2)\displaystyle\sum_{t=0}^{T-1}\mathbb{E}\left(g^{l}({\mathbf{v}}^{(t)},{\boldsymbol{\lambda}}^{(t)})+\frac{1}{2\mu}\|{\mathbf{v}}^{(t)}-{\mathbf{w}}\|^{2}-g^{l}({\mathbf{v}}^{*},\text{Proj}_{\Lambda^{*}}({\boldsymbol{\lambda}}^{(t)}))-\frac{1}{2\mu}\|{\mathbf{v}}^{*}-{\mathbf{w}}\|^{2}\right) (25)
+N+2IθT1𝔼dist2(𝝀(T),Λ)+12μ𝔼𝐯(T)𝐰2+12(1μ+1ηT1)𝔼𝐯𝐯(T)2\displaystyle+\frac{N_{+}}{2I\theta_{T-1}}\mathbb{E}\text{dist}^{2}({\boldsymbol{\lambda}}^{(T)},\Lambda^{*})+\frac{1}{2\mu}\mathbb{E}\|{\mathbf{v}}^{(T)}-{\mathbf{w}}\|^{2}+\frac{1}{2}\left(\frac{1}{\mu}+\frac{1}{\eta_{T-1}}\right)\mathbb{E}\|{\mathbf{v}}^{*}-{\mathbf{v}}^{(T)}\|^{2}
\displaystyle\leq N+2Iθ0dist2(𝝀(0),Λ)+12μ𝐯(0)𝐰2+12(1μ+1η0)𝐯𝐯(0)2\displaystyle\frac{N_{+}}{2I\theta_{0}}\text{dist}^{2}({\boldsymbol{\lambda}}^{(0)},\Lambda^{*})+\frac{1}{2\mu}\|{\mathbf{v}}^{(0)}-{\mathbf{w}}\|^{2}+\frac{1}{2}\left(\frac{1}{\mu}+\frac{1}{\eta_{0}}\right)\|{\mathbf{v}}^{*}-{\mathbf{v}}^{(0)}\|^{2}
+t=0T1(ηtN+2N2B22+θtN+N22).\displaystyle+\sum_{t=0}^{T-1}\left(\frac{\eta_{t}N_{+}^{2}N_{-}^{2}B^{2}}{2}+\frac{\theta_{t}N_{+}N_{-}^{2}}{2}\right).

Because fl(𝐯¯)+12μ𝐯¯𝐰2f^{l}(\bar{\mathbf{v}})+\frac{1}{2\mu}\|\bar{\mathbf{v}}-{\mathbf{w}}\|^{2} is (μ1ρ)(\mu^{-1}-\rho)-strongly convex and the facts that fl(𝐯¯)gl(𝐯¯,𝝀¯)f^{l}(\bar{\mathbf{v}})\leq g^{l}(\bar{\mathbf{v}},\bar{\boldsymbol{\lambda}}) and that fl(𝐯)=gl(𝐯,𝝀)f^{l}({\mathbf{v}}^{*})=g^{l}({\mathbf{v}}^{*},{\boldsymbol{\lambda}}^{*}) for any optimal 𝝀{\boldsymbol{\lambda}}^{*}, we have that

12(1μρ)𝔼𝐯¯𝐯2\displaystyle\frac{1}{2}\left(\frac{1}{\mu}-\rho\right)\mathbb{E}\|\bar{\mathbf{v}}-{\mathbf{v}}^{*}\|^{2} (26)
\displaystyle\leq 𝔼(fl(𝐯¯)+12μ𝐯¯𝐰2fl(𝐯)12μ𝐯𝐰2)\displaystyle\mathbb{E}\left(f^{l}(\bar{\mathbf{v}})+\frac{1}{2\mu}\|\bar{\mathbf{v}}-{\mathbf{w}}\|^{2}-f^{l}({\mathbf{v}}^{*})-\frac{1}{2\mu}\|{\mathbf{v}}^{*}-{\mathbf{w}}\|^{2}\right)
\displaystyle\leq 𝔼(gl(𝐯¯,𝝀¯)+12μ𝐯¯𝐰2gl(𝐯,𝝀)12μ𝐯𝐰2)\displaystyle\mathbb{E}\left(g^{l}(\bar{\mathbf{v}},\bar{\boldsymbol{\lambda}})+\frac{1}{2\mu}\|\bar{\mathbf{v}}-{\mathbf{w}}\|^{2}-g^{l}({\mathbf{v}}^{*},{\boldsymbol{\lambda}}^{*})-\frac{1}{2\mu}\|{\mathbf{v}}^{*}-{\mathbf{w}}\|^{2}\right)
\displaystyle\leq 1Tt=0T1𝔼(gl(𝐯(t),𝝀(t))+12μ𝐯(t)𝐰2gl(𝐯,ProjΛ(𝝀(t)))12μ𝐯𝐰2),\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\left(g^{l}({\mathbf{v}}^{(t)},{\boldsymbol{\lambda}}^{(t)})+\frac{1}{2\mu}\|{\mathbf{v}}^{(t)}-{\mathbf{w}}\|^{2}-g^{l}({\mathbf{v}}^{*},\text{Proj}_{\Lambda^{*}}({\boldsymbol{\lambda}}^{(t)}))-\frac{1}{2\mu}\|{\mathbf{v}}^{*}-{\mathbf{w}}\|^{2}\right),

where the last inequality is because gl(𝐯,𝝀)+12μ𝐯𝐰2g^{l}({\mathbf{v}},{\boldsymbol{\lambda}})+\frac{1}{2\mu}\|{\mathbf{v}}-{\mathbf{w}}\|^{2} is jointly convex in 𝐯{\mathbf{v}} and 𝝀{\boldsymbol{\lambda}}. Recall that 𝐯(0)=𝐰{\mathbf{v}}^{(0)}={\mathbf{w}}. Applying (25) to the left-hand side of the inequality above, we obtain

12(1μρ)𝔼𝐯¯𝐯2\displaystyle\frac{1}{2}\left(\frac{1}{\mu}-\rho\right)\mathbb{E}\|\bar{\mathbf{v}}-{\mathbf{v}}^{*}\|^{2}
\displaystyle\leq N+2Iθ0Tdist2(𝝀(0),Λ)+12μT𝐯(0)𝐰2+12T(1μ+1η0)𝐯𝐯(0)2\displaystyle\frac{N_{+}}{2I\theta_{0}T}\text{dist}^{2}({\boldsymbol{\lambda}}^{(0)},\Lambda^{*})+\frac{1}{2\mu T}\|{\mathbf{v}}^{(0)}-{\mathbf{w}}\|^{2}+\frac{1}{2T}\left(\frac{1}{\mu}+\frac{1}{\eta_{0}}\right)\|{\mathbf{v}}^{*}-{\mathbf{v}}^{(0)}\|^{2}
+η0N+2N2B22+θ0N+N22\displaystyle+\frac{\eta_{0}N_{+}^{2}N_{-}^{2}B^{2}}{2}+\frac{\theta_{0}N_{+}N_{-}^{2}}{2}
\displaystyle\leq N+NITdist(𝝀(0),Λ)+N+NB2T𝐯𝐰+12μT𝐯𝐰2.\displaystyle\frac{N_{+}N_{-}}{\sqrt{IT}}\text{dist}({\boldsymbol{\lambda}}^{(0)},\Lambda^{*})+\frac{N_{+}N_{-}B}{2\sqrt{T}}\|{\mathbf{v}}^{*}-{\mathbf{w}}\|+\frac{1}{2\mu T}\|{\mathbf{v}}^{*}-{\mathbf{w}}\|^{2}.

The conclusion is then proved given that 𝐯μfl(𝐰)=𝐯{\mathbf{v}}_{\mu f^{l}}({\mathbf{w}})={\mathbf{v}}^{*}. \Box

Appendix D Proof of Theorem 4

We first present a technical lemma.

Lemma 12

Given two intervals [a,b][a,b] and [a,b][a^{\prime},b^{\prime}] with max{|aa|,|bb|}δ\max\{|a-a^{\prime}|,|b-b^{\prime}|\}\leq\delta. We have

dist(z,[a,b])dist(z,[a,b])+δ,z.\text{dist}(z,[a,b])\leq\text{dist}(z,[a^{\prime},b^{\prime}])+\delta,~{}~{}\forall z\in\mathbb{R}.

Proof. We will prove the result in three cases, z<az<a^{\prime}, z>bz>b^{\prime} and azba^{\prime}\leq z\leq b^{\prime}. Suppose z<az<a^{\prime} so that dist(z,[a,b])=|za|\text{dist}(z,[a^{\prime},b^{\prime}])=|z-a^{\prime}|. We have dist(z,[a,b])|za||za|+|aa|dist(z,[a,b])+δ\text{dist}(z,[a,b])\leq|z-a|\leq|z-a^{\prime}|+|a-a^{\prime}|\leq\text{dist}(z,[a^{\prime},b^{\prime}])+\delta. The result when z>bz>b^{\prime} can be proved similarly. Suppose azba^{\prime}\leq z\leq b^{\prime} so that dist(z,[a,b])=0\text{dist}(z,[a^{\prime},b^{\prime}])=0. Note that z=zabab+bzbaaz=\frac{z-a^{\prime}}{b^{\prime}-a^{\prime}}\cdot b^{\prime}+\frac{b^{\prime}-z}{b^{\prime}-a^{\prime}}\cdot a^{\prime}. We define z=zabab+bzbaa[a,b]z^{\prime}=\frac{z-a^{\prime}}{b^{\prime}-a^{\prime}}\cdot b+\frac{b^{\prime}-z}{b^{\prime}-a^{\prime}}\cdot a\in[a,b]. Then dist(z,[a,b])|zz|zaba|bb|+bzba|aa|dist(z,[a,b])+δ\text{dist}(z,[a,b])\leq|z-z^{\prime}|\leq\frac{z-a^{\prime}}{b^{\prime}-a^{\prime}}\cdot|b-b^{\prime}|+\frac{b^{\prime}-z}{b^{\prime}-a^{\prime}}\cdot|a-a^{\prime}|\leq\text{dist}(z,[a^{\prime},b^{\prime}])+\delta. \Box

Under Assumption 1, we have sij(𝐯)B\|\nabla s_{ij}({\mathbf{v}})\|\leq B for any 𝐯{\mathbf{v}} so that 𝝃lB\|{\boldsymbol{\xi}}\|\leq lB for any 𝝃fl(𝐯){\boldsymbol{\xi}}\in\partial f^{l}({\mathbf{v}}) and any 𝐯{\mathbf{v}}. By the definition of 𝐯μfl(𝐰){\mathbf{v}}_{\mu f^{l}}({\mathbf{w}}), we have μ1(𝐰𝐯μfl(𝐰))fl(𝐯μfl(𝐰))\mu^{-1}({\mathbf{w}}-{\mathbf{v}}_{\mu f^{l}}({\mathbf{w}}))\in\partial f^{l}({\mathbf{v}}_{\mu f^{l}}({\mathbf{w}})), which implies

𝐰𝐯μfl(𝐰)μlB for any 𝐰.\|{\mathbf{w}}-{\mathbf{v}}_{\mu f^{l}}({\mathbf{w}})\|\leq\mu lB\text{ for any }{\mathbf{w}}. (27)

We then provide the following proposition as an additional conclusion from the proof of Proposition 4.

Proposition 13

Suppose Assumption 1 holds. Algorithm 1 guarantees that

𝔼dist(𝝀¯,Λ)i=1N+𝔼dist(λ¯i,Λi)\displaystyle\mathbb{E}\text{dist}(\bar{\boldsymbol{\lambda}},\Lambda^{*})\leq\sum_{i=1}^{N_{+}}\mathbb{E}\text{dist}(\bar{\lambda}_{i},\Lambda_{i}^{*})
\displaystyle\leq N+NITdist(𝝀(0),Λ)+N+NB2T𝐯𝐰+12μT𝐯𝐰2+12μ𝐯𝐰2\displaystyle\frac{N_{+}N_{-}}{\sqrt{IT}}\text{dist}({\boldsymbol{\lambda}}^{(0)},\Lambda^{*})+\frac{N_{+}N_{-}B}{2\sqrt{T}}\|{\mathbf{v}}^{*}-{\mathbf{w}}\|+\frac{1}{2\mu T}\|{\mathbf{v}}^{*}-{\mathbf{w}}\|^{2}+\frac{1}{2\mu}\|{\mathbf{v}}^{*}-{\mathbf{w}}\|^{2}
+N+B𝔼𝐯¯𝐯.\displaystyle+N_{+}B\mathbb{E}\|\bar{\mathbf{v}}-{\mathbf{v}}^{*}\|.

Proof. By (25), the last inequality in (26), and Lemma 11, we have

𝔼(i=1N+dist(λ¯i,Λi)N+B𝐯¯𝐯12μ𝐯𝐰2)\displaystyle\mathbb{E}\left(\sum_{i=1}^{N_{+}}\text{dist}(\bar{\lambda}_{i},\Lambda_{i}^{*})-N_{+}B\|\bar{\mathbf{v}}-{\mathbf{v}}^{*}\|-\frac{1}{2\mu}\|{\mathbf{v}}^{*}-{\mathbf{w}}\|^{2}\right)
\displaystyle\leq 𝔼(gl(𝐯¯,𝝀¯)gl(𝐯,𝝀)+12μ𝐯¯𝐰212μ𝐯𝐰2)\displaystyle\mathbb{E}\left(g^{l}(\bar{\mathbf{v}},\bar{\boldsymbol{\lambda}})-g^{l}({\mathbf{v}}^{*},{\boldsymbol{\lambda}}^{*})+\frac{1}{2\mu}\|\bar{\mathbf{v}}-{\mathbf{w}}\|^{2}-\frac{1}{2\mu}\|{\mathbf{v}}^{*}-{\mathbf{w}}\|^{2}\right)
\displaystyle\leq N+NITdist(𝝀(0),Λ)+N+NB2T𝐯𝐰+12μT𝐯𝐰2,\displaystyle\frac{N_{+}N_{-}}{\sqrt{IT}}\text{dist}({\boldsymbol{\lambda}}^{(0)},\Lambda^{*})+\frac{N_{+}N_{-}B}{2\sqrt{T}}\|{\mathbf{v}}^{*}-{\mathbf{w}}\|+\frac{1}{2\mu T}\|{\mathbf{v}}^{*}-{\mathbf{w}}\|^{2},

which implies the conclusion by reorganizing terms. \Box

Next we are ready to present the proof of of Theorem 4.

Proof.[of Theorem 4] Let ϵk=1k+1\epsilon_{k}=\frac{1}{\sqrt{k+1}} for k=1,k=1,\dots. In the kkth iteration of Algorithm 2, Algorithm 1 is applied to the subproblem

min𝐯,𝝀{gl(𝐯,𝝀)+12μ𝐯𝐰(k)2}.\displaystyle\min_{{\mathbf{v}},{\boldsymbol{\lambda}}}\left\{g^{l}({\mathbf{v}},{\boldsymbol{\lambda}})+\frac{1}{2\mu}\|{\mathbf{v}}-{\mathbf{w}}^{(k)}\|^{2}\right\}. (28)

with initial solution (𝐰(k),𝝀¯l(k))({\mathbf{w}}^{(k)},\bar{\boldsymbol{\lambda}}_{l}^{(k)}) and runs for TkT_{k} iterations. Let Λk\Lambda_{k}^{*} be the set of optimal 𝝀{\boldsymbol{\lambda}} for (28). We will prove by induction the following two inequalities for k=0,1,k=0,1,\dots.

𝔼𝐯¯l(k)𝐯μfl(𝐰(k))2\displaystyle\mathbb{E}\|\bar{\mathbf{v}}_{l}^{(k)}-{\mathbf{v}}_{\mu f^{l}}({\mathbf{w}}^{(k)})\|^{2} \displaystyle\leq ϵk2/4\displaystyle\epsilon_{k}^{2}/4 (29)
𝔼dist(𝝀¯l(k),Λk)\displaystyle\mathbb{E}\text{dist}(\bar{\boldsymbol{\lambda}}_{l}^{(k)},\Lambda_{k}^{*}) \displaystyle\leq Dl,\displaystyle D_{l}, (30)

where DlD_{l} is defined in (16) for l=ml=m and nn.

Applying Proposition 4 to (28) and using (27), we have, for k=0,1,k=0,1,\dots,

12(1μρ)𝔼𝐯¯l(k)𝐯μfl(𝐰(k))2\displaystyle\frac{1}{2}\left(\frac{1}{\mu}-\rho\right)\mathbb{E}\|\bar{\mathbf{v}}_{l}^{(k)}-{\mathbf{v}}_{\mu f^{l}}({\mathbf{w}}^{(k)})\|^{2}
\displaystyle\leq N+NITk𝔼dist(𝝀¯l(k),Λk)+N+NB2Tk𝔼𝐯μfl(𝐰(k))𝐰(k)+12μTk𝔼𝐯μfl(𝐰(k))𝐰(k)2\displaystyle\frac{N_{+}N_{-}}{\sqrt{IT_{k}}}\mathbb{E}\text{dist}(\bar{\boldsymbol{\lambda}}_{l}^{(k)},\Lambda_{k}^{*})+\frac{N_{+}N_{-}B}{2\sqrt{T_{k}}}\mathbb{E}\|{\mathbf{v}}_{\mu f^{l}}({\mathbf{w}}^{(k)})-{\mathbf{w}}^{(k)}\|+\frac{1}{2\mu T_{k}}\mathbb{E}\|{\mathbf{v}}_{\mu f^{l}}({\mathbf{w}}^{(k)})-{\mathbf{w}}^{(k)}\|^{2}
\displaystyle\leq N+NITk𝔼dist(𝝀¯l(k),Λk)+N+NμlB22Tk+μ2l2B22μTk.\displaystyle\frac{N_{+}N_{-}}{\sqrt{IT_{k}}}\mathbb{E}\text{dist}(\bar{\boldsymbol{\lambda}}_{l}^{(k)},\Lambda_{k}^{*})+\frac{N_{+}N_{-}\mu lB^{2}}{2\sqrt{T_{k}}}+\frac{\mu^{2}l^{2}B^{2}}{2\mu T_{k}}. (31)

Moreover, applying Proposition 13 to (28) and using (27), we have

i=1N+𝔼dist(λ¯l,i(k+1),Λi)\displaystyle\sum_{i=1}^{N_{+}}\mathbb{E}\text{dist}(\bar{\lambda}_{l,i}^{(k+1)},\Lambda_{i}^{*}) \displaystyle\leq N+NITk𝔼dist(𝝀¯l(k),Λk,i)+N+NμlB22Tk+μ2l2B22μTk+μl2B22\displaystyle\frac{N_{+}N_{-}}{\sqrt{IT_{k}}}\mathbb{E}\text{dist}(\bar{\boldsymbol{\lambda}}_{l}^{(k)},\Lambda_{k,i}^{*})+\frac{N_{+}N_{-}\mu lB^{2}}{2\sqrt{T_{k}}}+\frac{\mu^{2}l^{2}B^{2}}{2\mu T_{k}}+\frac{\mu l^{2}B^{2}}{2} (32)
+N+B𝔼𝐯¯l(k)𝐯μfl(𝐰(k)).\displaystyle+N_{+}B\mathbb{E}\|\bar{\mathbf{v}}_{l}^{(k)}-{\mathbf{v}}_{\mu f^{l}}({\mathbf{w}}^{(k)})\|.

Suppose k=0k=0. By (31) and the choice of T0T_{0}, we have 𝔼𝐯¯l(0)𝐯μfl(𝐰(0))2ϵ02/4\mathbb{E}\|\bar{\mathbf{v}}_{l}^{(0)}-{\mathbf{v}}_{\mu f^{l}}({\mathbf{w}}^{(0)})\|^{2}\leq\epsilon_{0}^{2}/4, so (29) holds for k=0k=0. In addition, (30) holds trivially for k=0k=0. Next, we assume (29) and (30) both hold for kk and prove they also hold for k+1k+1.

Since (29) and (30) hold for kk, by (32) and the choice of TkT_{k}, we have

i=1N+𝔼dist(λ¯l,i(k+1),Λk,i)\displaystyle\sum_{i=1}^{N_{+}}\mathbb{E}\text{dist}(\bar{\lambda}_{l,i}^{(k+1)},\Lambda_{k,i}^{*}) \displaystyle\leq 12(1μρ)ϵk2/4+μl2B22+N+B𝔼𝐯¯l(k)𝐯μfl(𝐰(k))\displaystyle\frac{1}{2}\left(\frac{1}{\mu}-\rho\right)\epsilon_{k}^{2}/4+\frac{\mu l^{2}B^{2}}{2}+N_{+}B\mathbb{E}\|\bar{\mathbf{v}}_{l}^{(k)}-{\mathbf{v}}_{\mu f^{l}}({\mathbf{w}}^{(k)})\|
\displaystyle\leq 12(1μρ)ϵk2/4+μl2B22+N+B𝔼𝐯¯l(k)𝐯μfl(𝐰(k))2\displaystyle\frac{1}{2}\left(\frac{1}{\mu}-\rho\right)\epsilon_{k}^{2}/4+\frac{\mu l^{2}B^{2}}{2}+N_{+}B\sqrt{\mathbb{E}\|\bar{\mathbf{v}}_{l}^{(k)}-{\mathbf{v}}_{\mu f^{l}}({\mathbf{w}}^{(k)})\|^{2}}
\displaystyle\leq 12(1μρ)+μl2B22+N+B.\displaystyle\frac{1}{2}\left(\frac{1}{\mu}-\rho\right)+\frac{\mu l^{2}B^{2}}{2}+N_{+}B. (33)

From the updating equation 𝐰(k+1)=𝐰(k)γμ1(𝐯¯m(k)𝐯¯n(k)){\mathbf{w}}^{(k+1)}={\mathbf{w}}^{(k)}-\gamma\mu^{-1}(\bar{\mathbf{v}}_{m}^{(k)}-\bar{\mathbf{v}}_{n}^{(k)}) in Algorithm 2, we know that

𝔼𝐰(k+1)𝐰(k)\displaystyle\mathbb{E}\|{\mathbf{w}}^{(k+1)}-{\mathbf{w}}^{(k)}\| \displaystyle\leq γμ1𝔼𝐯¯m(k)𝐯¯n(k)\displaystyle\gamma\mu^{-1}\mathbb{E}\|\bar{\mathbf{v}}_{m}^{(k)}-\bar{\mathbf{v}}_{n}^{(k)}\| (34)
\displaystyle\leq γμ1(𝔼𝐯¯m(k)𝐯μfm(𝐰(k))+𝔼𝐯μfm(𝐰(k))𝐰(k)\displaystyle\gamma\mu^{-1}\bigg{(}\mathbb{E}\|\bar{\mathbf{v}}_{m}^{(k)}-{\mathbf{v}}_{\mu f^{m}}({\mathbf{w}}^{(k)})\|+\mathbb{E}\|{\mathbf{v}}_{\mu f^{m}}({\mathbf{w}}^{(k)})-{\mathbf{w}}^{(k)}\|
+𝔼𝐯¯n(k)𝐯μfn(𝐰(k))+𝔼𝐯μfn(𝐰(k))𝐰(k))\displaystyle\quad\quad+\mathbb{E}\|\bar{\mathbf{v}}_{n}^{(k)}-{\mathbf{v}}_{\mu f^{n}}({\mathbf{w}}^{(k)})\|+\mathbb{E}\|{\mathbf{v}}_{\mu f^{n}}({\mathbf{w}}^{(k)})-{\mathbf{w}}^{(k)}\|\bigg{)}
\displaystyle\leq 2γϵkμ+γnB+γmB,\displaystyle\frac{2\gamma\epsilon_{k}}{\mu}+\gamma nB+\gamma mB,

where the second inequality is by triangle inequality and the last inequality is by (27) and the fact that (29) holds for kk.

Let λ¯l,i(k+1)\bar{\lambda}_{l,i}^{(k+1)} denote the iith coordinate of 𝝀¯l(k+1)\bar{\boldsymbol{\lambda}}_{l}^{(k+1)} for i=1,,N+i=1,\dots,N_{+} and l=ml=m and nn. Recall that si[j](𝐯)s_{i[j]}({\mathbf{v}}) be the jjth largest coordinate of Si(𝐯)S_{i}({\mathbf{v}}). By Assumption 1 and elementary argument, we can show that si[j](𝐯)s_{i[j]}({\mathbf{v}}) is BB-Lipschitz continuous for any ii and jj. Since 𝐯μfl(𝐰){\mathbf{v}}_{\mu f^{l}}({\mathbf{w}}) is 11μρ\frac{1}{1-\mu\rho}-Lipschitz continuous, we have

𝔼|si[j](𝐯μfl(𝐰(k+1)))si[j](𝐯μfl(𝐰(k)))|\displaystyle\mathbb{E}\left|s_{i[j]}({\mathbf{v}}_{\mu f^{l}}({\mathbf{w}}^{(k+1)}))-s_{i[j]}({\mathbf{v}}_{\mu f^{l}}({\mathbf{w}}^{(k)}))\right| \displaystyle\leq B𝔼𝐯μfl(𝐰(k+1))𝐯μfl(𝐰(k))\displaystyle B\mathbb{E}\|{\mathbf{v}}_{\mu f^{l}}({\mathbf{w}}^{(k+1)})-{\mathbf{v}}_{\mu f^{l}}({\mathbf{w}}^{(k)})\| (35)
\displaystyle\leq B1μρ𝔼𝐰(k+1)𝐰(k)\displaystyle\frac{B}{1-\mu\rho}\mathbb{E}\|{\mathbf{w}}^{(k+1)}-{\mathbf{w}}^{(k)}\|
\displaystyle\leq B1μρ(2γϵkμ+γnB+γmB)\displaystyle\frac{B}{1-\mu\rho}\left(\frac{2\gamma\epsilon_{k}}{\mu}+\gamma nB+\gamma mB\right)

for j=lj=l and j=l+1j=l+1, where the last inequality is by (34). According to (19), (20), (35), and Lemma 12, we can prove that, for i=1,,N+i=1,\dots,N_{+},

𝔼dist(λ¯l,i(k+1),Λk+1,i)𝔼dist(λ¯l,i(k+1),Λk,i)+B1μρ(2γμ+γnB+γmB).\mathbb{E}\text{dist}\left(\bar{\lambda}_{l,i}^{(k+1)},\Lambda_{k+1,i}^{*}\right)\leq\mathbb{E}\text{dist}\left(\bar{\lambda}_{l,i}^{(k+1)},\Lambda_{k,i}^{*}\right)+\frac{B}{1-\mu\rho}\left(\frac{2\gamma}{\mu}+\gamma nB+\gamma mB\right).

Combining this inequality with (33) yields (30) for case k+1k+1.

Stating (31) for case k+1k+1 gives

12(1μρ)𝔼𝐯¯l(k+1)𝐯μfl(𝐰(k+1))2N+NITk+1𝔼dist(𝝀¯l(k+1),Λk+1)+N+NμlB22Tk+1+μ2l2B22μTk+1.\displaystyle\frac{1}{2}\left(\frac{1}{\mu}-\rho\right)\mathbb{E}\|\bar{\mathbf{v}}_{l}^{(k+1)}-{\mathbf{v}}_{\mu f^{l}}({\mathbf{w}}^{(k+1)})\|^{2}\leq\frac{N_{+}N_{-}}{\sqrt{IT_{k+1}}}\mathbb{E}\text{dist}(\bar{\boldsymbol{\lambda}}_{l}^{(k+1)},\Lambda_{k+1}^{*})+\frac{N_{+}N_{-}\mu lB^{2}}{2\sqrt{T_{k+1}}}+\frac{\mu^{2}l^{2}B^{2}}{2\mu T_{k+1}}. (36)

By (36), (30) for case k+1k+1, and the choice of Tk+1T_{k+1}, we have 𝔼𝐯¯l(k+1)𝐯μfl(𝐰(k+1))2ϵk+12/4\mathbb{E}\|\bar{\mathbf{v}}_{l}^{(k+1)}-{\mathbf{v}}_{\mu f^{l}}({\mathbf{w}}^{(k+1)})\|^{2}\leq\epsilon_{k+1}^{2}/4, which means (29) holds for case k+1k+1. By induction, we have proved that (29) and (30) holds for k=0,1,k=0,1,\dots.

Because Fμ(𝐰)=μ1(𝐯μfm(𝐰)𝐯μfn(𝐰))\nabla F^{\mu}({\mathbf{w}})=\mu^{-1}({\mathbf{v}}_{\mu f^{m}}({\mathbf{w}})-{\mathbf{v}}_{\mu f^{n}}({\mathbf{w}})) is LμL_{\mu}-Lipschitz continuous, we have

Fμ(𝐰(k))Fμ(𝐰(k+1))\displaystyle F^{\mu}({\mathbf{w}}^{(k)})-F^{\mu}({\mathbf{w}}^{(k+1)})
\displaystyle\geq Fμ(𝐰(k)),𝐰(k+1)𝐰(k)Lμ2𝐰(k+1)𝐰(k)2\displaystyle\langle-\nabla F^{\mu}({\mathbf{w}}^{(k)}),{\mathbf{w}}^{(k+1)}-{\mathbf{w}}^{(k)}\rangle-\frac{L_{\mu}}{2}\|{\mathbf{w}}^{(k+1)}-{\mathbf{w}}^{(k)}\|^{2}
=\displaystyle= μ1(𝐯μfm(𝐰(k))𝐯μfn(𝐰(k))),γμ1(𝐯¯m(k)𝐯¯n(k))γ2Lμ2μ2𝐯¯m(k)𝐯¯n(k)2\displaystyle\left\langle\mu^{-1}({\mathbf{v}}_{\mu f^{m}}({\mathbf{w}}^{(k)})-{\mathbf{v}}_{\mu f^{n}}({\mathbf{w}}^{(k)})),\gamma\mu^{-1}(\bar{\mathbf{v}}_{m}^{(k)}-\bar{\mathbf{v}}_{n}^{(k)})\right\rangle-\frac{\gamma^{2}L_{\mu}}{2\mu^{2}}\left\|\bar{\mathbf{v}}_{m}^{(k)}-\bar{\mathbf{v}}_{n}^{(k)}\right\|^{2}
=\displaystyle= γμ2𝐯μfm(𝐰(k))𝐯μfn(𝐰(k)),𝐯¯m(k)𝐯¯n(k)𝐯μfm(𝐰(k))+𝐯μfn(𝐰(k))+𝐯μfm(𝐰(k))𝐯μfn(𝐰(k))\displaystyle\frac{\gamma}{\mu^{2}}\left\langle{\mathbf{v}}_{\mu f^{m}}({\mathbf{w}}^{(k)})-{\mathbf{v}}_{\mu f^{n}}({\mathbf{w}}^{(k)}),\bar{\mathbf{v}}_{m}^{(k)}-\bar{\mathbf{v}}_{n}^{(k)}-{\mathbf{v}}_{\mu f^{m}}({\mathbf{w}}^{(k)})+{\mathbf{v}}_{\mu f^{n}}({\mathbf{w}}^{(k)})+{\mathbf{v}}_{\mu f^{m}}({\mathbf{w}}^{(k)})-{\mathbf{v}}_{\mu f^{n}}({\mathbf{w}}^{(k)})\right\rangle
γ2Lμ2μ2𝐯¯m(k)𝐯¯n(k)𝐯μfm(𝐰(k))+𝐯μfn(𝐰(k))+𝐯μfm(𝐰(k))𝐯μfn(𝐰(k))2\displaystyle-\frac{\gamma^{2}L_{\mu}}{2\mu^{2}}\left\|\bar{\mathbf{v}}_{m}^{(k)}-\bar{\mathbf{v}}_{n}^{(k)}-{\mathbf{v}}_{\mu f^{m}}({\mathbf{w}}^{(k)})+{\mathbf{v}}_{\mu f^{n}}({\mathbf{w}}^{(k)})+{\mathbf{v}}_{\mu f^{m}}({\mathbf{w}}^{(k)})-{\mathbf{v}}_{\mu f^{n}}({\mathbf{w}}^{(k)})\right\|^{2}
\displaystyle\geq γμ2𝐯μfm(𝐰(k))𝐯μfn(𝐰(k)),𝐯¯m(k)𝐯¯n(k)𝐯μfm(𝐰(k))+𝐯μfn(𝐰(k))\displaystyle\frac{\gamma}{\mu^{2}}\left\langle{\mathbf{v}}_{\mu f^{m}}({\mathbf{w}}^{(k)})-{\mathbf{v}}_{\mu f^{n}}({\mathbf{w}}^{(k)}),\bar{\mathbf{v}}_{m}^{(k)}-\bar{\mathbf{v}}_{n}^{(k)}-{\mathbf{v}}_{\mu f^{m}}({\mathbf{w}}^{(k)})+{\mathbf{v}}_{\mu f^{n}}({\mathbf{w}}^{(k)})\right\rangle
+(γμ2γ2Lμμ2)𝐯μfm(𝐰(k))𝐯μfn(𝐰(k))2γ2Lμμ2𝐯¯m(k)𝐯¯n(k)𝐯μfm(𝐰(k))+𝐯μfn(𝐰(k))2.\displaystyle+\left(\frac{\gamma}{\mu^{2}}-\frac{\gamma^{2}L_{\mu}}{\mu^{2}}\right)\|{\mathbf{v}}_{\mu f^{m}}({\mathbf{w}}^{(k)})-{\mathbf{v}}_{\mu f^{n}}({\mathbf{w}}^{(k)})\|^{2}-\frac{\gamma^{2}L_{\mu}}{\mu^{2}}\left\|\bar{\mathbf{v}}_{m}^{(k)}-\bar{\mathbf{v}}_{n}^{(k)}-{\mathbf{v}}_{\mu f^{m}}({\mathbf{w}}^{(k)})+{\mathbf{v}}_{\mu f^{n}}({\mathbf{w}}^{(k)})\right\|^{2}.

Applying Young’s inequality to the first term on the right-hand side of the last inequality above gives

𝔼[Fμ(𝐰(k))Fμ(𝐰(k+1))]\displaystyle\mathbb{E}\left[F^{\mu}({\mathbf{w}}^{(k)})-F^{\mu}({\mathbf{w}}^{(k+1)})\right]
\displaystyle\geq γμ2(12𝔼𝐯¯m(k)𝐯¯n(k)𝐯μfm(𝐰(k))+𝐯μfn(𝐰(k))2+12𝔼𝐯μfm(𝐰(k))𝐯μfn(𝐰(k))2)\displaystyle-\frac{\gamma}{\mu^{2}}\left(\frac{1}{2}\mathbb{E}\|\bar{\mathbf{v}}_{m}^{(k)}-\bar{\mathbf{v}}_{n}^{(k)}-{\mathbf{v}}_{\mu f^{m}}({\mathbf{w}}^{(k)})+{\mathbf{v}}_{\mu f^{n}}({\mathbf{w}}^{(k)})\|^{2}+\frac{1}{2}\mathbb{E}\|{\mathbf{v}}_{\mu f^{m}}({\mathbf{w}}^{(k)})-{\mathbf{v}}_{\mu f^{n}}({\mathbf{w}}^{(k)})\|^{2}\right)
+(γμ2γ2Lμμ2)𝔼𝐯μfm(𝐰(k))𝐯μfn(𝐰(k))2\displaystyle+\left(\frac{\gamma}{\mu^{2}}-\frac{\gamma^{2}L_{\mu}}{\mu^{2}}\right)\mathbb{E}\|{\mathbf{v}}_{\mu f^{m}}({\mathbf{w}}^{(k)})-{\mathbf{v}}_{\mu f^{n}}({\mathbf{w}}^{(k)})\|^{2}
γ2Lμμ2𝔼𝐯¯m(k)𝐯¯n(k)𝐯μfm(𝐰(k))+𝐯μfn(𝐰(k))2\displaystyle-\frac{\gamma^{2}L_{\mu}}{\mu^{2}}\mathbb{E}\left\|\bar{\mathbf{v}}_{m}^{(k)}-\bar{\mathbf{v}}_{n}^{(k)}-{\mathbf{v}}_{\mu f^{m}}({\mathbf{w}}^{(k)})+{\mathbf{v}}_{\mu f^{n}}({\mathbf{w}}^{(k)})\right\|^{2}
\displaystyle\geq γ2γ2Lμ2μ2𝔼𝐯μfm(𝐰(k))𝐯μfn(𝐰(k))2(γμ2+2γ2Lμμ2)ϵk2\displaystyle\frac{\gamma-2\gamma^{2}L_{\mu}}{2\mu^{2}}\mathbb{E}\|{\mathbf{v}}_{\mu f^{m}}({\mathbf{w}}^{(k)})-{\mathbf{v}}_{\mu f^{n}}({\mathbf{w}}^{(k)})\|^{2}-\left(\frac{\gamma}{\mu^{2}}+\frac{2\gamma^{2}L_{\mu}}{\mu^{2}}\right)\epsilon_{k}^{2}
\displaystyle\geq γ4μ2𝔼𝐯μfm(𝐰(k))𝐯μfn(𝐰(k))23γ2μ2ϵk2,\displaystyle\frac{\gamma}{4\mu^{2}}\mathbb{E}\|{\mathbf{v}}_{\mu f^{m}}({\mathbf{w}}^{(k)})-{\mathbf{v}}_{\mu f^{n}}({\mathbf{w}}^{(k)})\|^{2}-\frac{3\gamma}{2\mu^{2}}\epsilon_{k}^{2},

where the second inequality is because of (30) for l=ml=m and nn and the last because of γ1Lμ\gamma\leq\frac{1}{L_{\mu}}.

Summing the above inequality over k=0,,K1k=0,\cdots,K-1, we have

1Kk=0K1𝔼𝐯μfm(𝐰(k))𝐯μfn(𝐰(k))2\displaystyle\frac{1}{K}\sum^{K-1}_{k=0}\mathbb{E}\|{\mathbf{v}}_{\mu f^{m}}({\mathbf{w}}^{(k)})-{\mathbf{v}}_{\mu f^{n}}({\mathbf{w}}^{(k)})\|^{2} 4μ2γK𝔼(Fμ(𝐰(0))Fμ(𝐰(K)))+6Kk=0K1ϵk2\displaystyle\leq\frac{4\mu^{2}}{\gamma K}\mathbb{E}(F^{\mu}({\mathbf{w}}^{(0)})-F^{\mu}({\mathbf{w}}^{(K)}))+\frac{6}{K}\sum^{K-1}_{k=0}\epsilon_{k}^{2}
4μ2γK(F(𝐯μfn(𝐰(0)))F)+6logKK,\displaystyle\leq\frac{4\mu^{2}}{\gamma K}\left(F({\mathbf{v}}_{\mu f^{n}}({\mathbf{w}}^{(0)}))-F^{*}\right)+\frac{6\log K}{K},

where the second inequality is because FF(𝐯μfn(𝐰(K)))Fμ(𝐰(K))F^{*}\leq F({\mathbf{v}}_{\mu f^{n}}({\mathbf{w}}^{(K)}))\leq F^{\mu}({\mathbf{w}}^{(K)}) and Fμ(𝐰(0))F(𝐯μfm(𝐰(0)))F^{\mu}({\mathbf{w}}^{(0)})\leq F({\mathbf{v}}_{\mu f^{m}}({\mathbf{w}}^{(0)})) (see Lemma 1 in Sun and Sun (2021)). Let k¯\bar{k} be randomly sampled from {0,,K1}\{0,\dots,K-1\}, then it holds

𝔼Fμ(𝐰(k¯))2=μ2𝔼𝐯μfm(𝐰(k¯))𝐯μfn(𝐰(k¯))2μ2min{1,μ2}ϵ2=min{1,μ2}ϵ2\mathbb{E}\|\nabla F^{\mu}({\mathbf{w}}^{(\bar{k})})\|^{2}=\mu^{-2}\mathbb{E}\|{\mathbf{v}}_{\mu f^{m}}({\mathbf{w}}^{(\bar{k})})-{\mathbf{v}}_{\mu f^{n}}({\mathbf{w}}^{(\bar{k})})\|^{2}\leq\mu^{-2}\min\{1,\mu^{2}\}\epsilon^{2}=\min\{1,\mu^{-2}\}\epsilon^{2}

by the definition of KK. On the other hands, by (29), we have for l=ml=m and nn that

𝔼𝐯¯l(k¯)𝐯μfl(𝐰(k¯))21Kk=0K1ϵk2logKKmin{1,μ2}ϵ248ϵ24,\mathbb{E}\|\bar{\mathbf{v}}_{l}^{(\bar{k})}-{\mathbf{v}}_{\mu f^{l}}({\mathbf{w}}^{(\bar{k})})\|^{2}\leq\frac{1}{K}\sum_{k=0}^{K-1}\epsilon_{k}^{2}\leq\frac{\log K}{K}\leq\frac{\min\{1,\mu^{2}\}\epsilon^{2}}{48}\leq\frac{\epsilon^{2}}{4},

where the last inequality is because of the value of KK. Hence, by the second claim in Lemma 3 (with 𝐰=𝐰(k¯){\mathbf{w}}={\mathbf{w}}^{(\bar{k})} and 𝐯¯=𝐯¯n(k¯)\bar{\mathbf{v}}=\bar{\mathbf{v}}_{n}^{(\bar{k})}), 𝐯¯n(k¯)\bar{\mathbf{v}}_{n}^{(\bar{k})} is an nearly ϵ\epsilon-critical point of (5) with flf^{l} defined in (6). This completes the proof. \Box

Appendix E Additional Materials for Numerical Experiments

In this section, we present some additional details of our numerical experiments in Section 6 which we are not able to show due to the limit of space.

E.1 Details of SoRR Loss Minimization Experiment

For SoRR loss minimization, we compare with the baseline DCA. We focus on learning a linear model h𝐰(𝐱)=𝐰T𝐱h_{\mathbf{w}}({\mathbf{x}})={\mathbf{w}}^{T}{\mathbf{x}} with four benchmark datasets from the UCI (Dua and Graff, 2017) data repository preprocessed by Libsvm (Chang and Lin, 2011): a9a, w8a, ijcnn1 and covtype. The statistics of these datasets are summarized in Table 4.We use logistic loss l(h𝐰(𝐱),y)=ylog(h𝐰(𝐱))(1y)log(1h𝐰(𝐱))l(h_{\mathbf{w}}({\mathbf{x}}),y)=-y\log(h_{\mathbf{w}}({\mathbf{x}}))-(1-y)\log(1-h_{\mathbf{w}}({\mathbf{x}})) where label y{0,1}y\in\{0,1\}. Due to the limit space, we present the process of tuning hyperparameters and the convergence curves (Figure 2). From the curves, we notice that our algorithm reduce SoRR loss faster than DCAs for all of these four datasets. In this section, we first present some summary statistics on the datasets.

Table 4: Statistics of the UCI datasets.
Datasets # Samples # Features
a9a 32,561 123
w8a 49,749 300
ijcnn1 49,990 22
covtype 581,012 54

For AGD-SBCD, we fix the J=100J=100, m=103m=10^{3} and n=2×104n=2\times 10^{4}. In the spirit of Corollary 8, within Algorithm 3 called in the kkth main iteration, ηt\eta_{t} and θt\theta_{t} are both set to c(k+1)N\frac{c}{(k+1)N} for any tt with cc tuned in the range of {0.1,2,5,10,1}\{0.1,2,5,10,1\} and TkT_{k} is set to C(k+1)2C(k+1)^{2} with CC selected from {30,50,100}\{30,50,100\}. Parameter μ\mu is chosen from {2×102,103}/N\{2\times 10^{2},10^{3}\}/N and γ\gamma is tuned from {2×102,5×102,8×102,4×103,2×104}/N\{2\times 10^{2},5\times 10^{2},8\times 10^{2},4\times 10^{3},2\times 10^{4}\}/N.

According to the experiments in Hu et al. (2020), when solving (37), we first use the same step size and the same number of iterations for all kk’s. We choose the step size from {0.01,0.1,0.5,1}\{0.01,0.1,0.5,1\} and the iteration number from {2000,3000}\{2000,3000\}. However, we find that the performance of DCA improves if we let the step size and the number of iterations vary with kk. Hence, we apply the same process as in AGD-SBCD to select θt\theta_{t}, ηt\eta_{t}, and TT in Algorithm 4 in the kkth main iteration of DCA. We report the performances of DCA under both settings (named DCA.Constant.lr and DCA.Changing.lr, respectively). The changes of the SoRR loss with the number of epochs are shown in Figure 2. From the curves, we notice that our algorithm reduce SoRR loss faster than DCAs for all of these four datasets.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Results for SoRR Loss Minimization

E.2 Process of Tuning Hyperparameters

For AGD-SBCD, we fix the I=J=100I=J=100, m=104m=10^{4} and n=105n=10^{5}. In the spirit of Theorem 5, within Algorithm 3 called in the kkth main iteration, ηt\eta_{t} is set to c(k+1)N+N\frac{c}{(k+1)N^{+}N^{-}}, θt\theta_{t} is set to c(k+1)N\frac{c}{(k+1)N^{-}} for any tt with cc tuned in {0.1,1,5,10,20,30}\{0.1,1,5,10,20,30\} and TkT_{k} is set to 50(k+1)250(k+1)^{2}. Parameters μ\mu is chosen from {102,103}/(N+N)\{10^{2},10^{3}\}/(N_{+}N_{-}) and γ\gamma is tuned from {2×103,5×103}/(N+N)\{2\times 10^{3},5\times 10^{3}\}/(N_{+}N_{-}). For DCA, we only implement the setting of DCA.Changing.lr (see Appendix E.1 for descriptions). In Algorithm 4 called in the kkth main iteration, ηt\eta_{t} and θt\theta_{t} are selected following the same process as in AGD-SBCD and TT is set to C(k+1)2C(k+1)^{2} with CC chosen from {100,200,500}\{100,200,500\}. For the SVMpAUC-tight method, we use their MATLAB code in Narasimhan and Agarwal (2013b) and tune hyper-parameter CC from {103,102,101,100}\{10^{-3},10^{-2},10^{-1},10^{0}\}.

E.3 Additional Plots of Partial AUC Maximization Experiment

The results for partial AUC maximization of diseases D3\simD5 are shown in Figure 3.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Results for Patial AUC Maximization of D3, D4 and D5.

We plot the ROC curves of three algorithms with linear model on the CheXpert testing set in Figure 4. The range of false positive rate for the pAUC is set as [0.05,0.5][0.05,0.5].

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: ROC Curves of CheXpert Testing Set.

We plot the convergence curves of patial AUC on the CheXpert testing set of our algorithm AGD-SBCD and baseline MB in Figure 5.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: Convergence Curves of Partial AUC on the CheXpert Testing Set.

E.4 DCA and Algorithm for its Subproblem

Although DCA is originally only studied for SoRR loss minimization in Hu et al. (2020), it can also be applied to pAUC maximization. Hence, we only describe DCA for pAUC maximization which cover SoRR loss minimization as a special case. At the kkth iteration of DCA, it computes a deterministic subgradient of fm(𝐰)=i=1N+ϕm(Si(𝐰(k)))f^{m}({\mathbf{w}})=\sum_{i=1}^{N_{+}}\phi_{m}(S_{i}({\mathbf{w}}^{(k)})) at iterate 𝐰(k){\mathbf{w}}^{(k)}, denoted by 𝝃(k){\boldsymbol{\xi}}^{(k)}. Then DCA updates 𝐰(k){\mathbf{w}}^{(k)} by approximately solving the following subproblem using a SBCD method similar to Algorithm 1 by sampling indexes ii and jj (see Algorithm 4 for details).

(𝐰(k+1),𝝀(k+1))\displaystyle({\mathbf{w}}^{(k+1)},{\boldsymbol{\lambda}}^{(k+1)}) \displaystyle\approx argmin𝐰,𝝀n𝟏𝝀+i=1N+j=1N[sij(𝐰)λi]+𝐰𝝃(k)\displaystyle\operatorname*{arg\,min}_{{\mathbf{w}},{\boldsymbol{\lambda}}}n\mathbf{1}^{\top}{\boldsymbol{\lambda}}+\sum_{i=1}^{N_{+}}\sum_{j=1}^{N_{-}}[s_{ij}({\mathbf{w}})-\lambda_{i}]_{+}-{\mathbf{w}}^{\top}{\boldsymbol{\xi}}^{(k)} (37)

Algorithm 4 is used to solve the subproblem (37) in the kkth main iteration of DCA. It is similar to the SBCD method in Algorithm 1.

Algorithm 4 Stochastic Block Coordinate Descent for (37): (𝐰¯,𝝀¯)=(\bar{\mathbf{w}},\bar{\boldsymbol{\lambda}})=SBCD(𝐰,𝝀,T,𝝃(k))({\mathbf{w}},{\boldsymbol{\lambda}},T,{\boldsymbol{\xi}}^{(k)})
1:Input: Initial solution (𝐰,𝝀)({\mathbf{w}},{\boldsymbol{\lambda}}), the number of iterations TT, sample sizes II and JJ and a deterministic subgradient 𝝃(k){\boldsymbol{\xi}}^{(k)}.
2:Set (𝐰(0),𝝀(0))=(𝐰,𝝀)({\mathbf{w}}^{(0)},{\boldsymbol{\lambda}}^{(0)})=({\mathbf{w}},{\boldsymbol{\lambda}}) and choose (ηt,θt)t=0T1(\eta_{t},\theta_{t})_{t=0}^{T-1}.
3:for t=0t=0 to T1T-1 do
4:     Sample t{1,,N+}\mathcal{I}_{t}\subset\{1,\dots,N_{+}\} with |t|=I|\mathcal{I}_{t}|=I.
5:     Sample 𝒥t{1,,N}\mathcal{J}_{t}\subset\{1,\dots,N_{-}\} with |𝒥t|=J|\mathcal{J}_{t}|=J.
6:     Compute stochastic subgradient w.r.t. 𝐰{\mathbf{w}}:
G𝐰(t)=N+NIJitj𝒥tsij(𝐰(t))𝟏(sij(𝐰(t))>λi(t))𝝃(k)G_{\mathbf{w}}^{(t)}=\frac{N_{+}N_{-}}{IJ}\sum\limits_{i\in\mathcal{I}_{t}}\sum\limits_{j\in\mathcal{J}_{t}}\nabla s_{ij}({\mathbf{w}}^{(t)})\mathbf{1}\left(s_{ij}({\mathbf{w}}^{(t)})>\lambda_{i}^{(t)}\right)-{\boldsymbol{\xi}}^{(k)}
7:     Stochastic subgradient update on 𝐰{\mathbf{w}}:
𝐰(t+1)=𝐰(t)ηtG𝐰(t){\mathbf{w}}^{(t+1)}={\mathbf{w}}^{(t)}-\eta_{t}G_{\mathbf{w}}^{(t)}
8:     Compute stochastic subgradient w.r.t. λi\lambda_{i} for iti\in\mathcal{I}_{t}:
Gλi(t)=nNJj𝒥t𝟏(sij(𝐰(t))>λi(t)) for itG_{\lambda_{i}}^{(t)}=n-\frac{N_{-}}{J}\sum\limits_{j\in\mathcal{J}_{t}}\mathbf{1}\left(s_{ij}({\mathbf{w}}^{(t)})>\lambda_{i}^{(t)}\right)~{}\text{ for }i\in\mathcal{I}_{t}
9:     Stochastic block subgradient update on λi\lambda_{i} for iti\in\mathcal{I}_{t}:
λi(t+1)={λi(t)θtGλi(t)it,λi(t)it.\lambda_{i}^{(t+1)}=\left\{\begin{array}[]{cc}\lambda_{i}^{(t)}-\theta_{t}G_{\lambda_{i}}^{(t)}&i\in\mathcal{I}_{t},\\ \lambda_{i}^{(t)}&i\notin\mathcal{I}_{t}.\end{array}\right.
10:end for
11:Output: (𝐰¯,𝝀¯)=(𝐰(T),𝝀(T))(\bar{\mathbf{w}},\bar{\boldsymbol{\lambda}})=({\mathbf{w}}^{(T)},{\boldsymbol{\lambda}}^{(T)}).

E.5 Details about Proximal DCA

At the kkth iteration of proximal DCA, it computes a deterministic subgradient of fm(𝐰)=i=1N+ϕm(Si(𝐰(k)))f^{m}({\mathbf{w}})=\sum_{i=1}^{N_{+}}\phi_{m}(S_{i}({\mathbf{w}}^{(k)})) at iterate 𝐰(k){\mathbf{w}}^{(k)}, denoted by 𝝃(k){\boldsymbol{\xi}}^{(k)}. Then proximal DCA updates 𝐰(k){\mathbf{w}}^{(k)} by approximately solving the subproblem

(𝐰(k+1),𝝀(k+1))argmin𝐰,𝝀n𝟏𝝀+i=1N+j=1N[sij(𝐰)λi]++L2𝐰𝐰(k)2𝐰𝝃(k)\displaystyle({\mathbf{w}}^{(k+1)},{\boldsymbol{\lambda}}^{(k+1)})\approx\operatorname*{arg\,min}_{{\mathbf{w}},{\boldsymbol{\lambda}}}n\mathbf{1}^{\top}{\boldsymbol{\lambda}}+\sum_{i=1}^{N_{+}}\sum_{j=1}^{N_{-}}[s_{ij}({\mathbf{w}})-\lambda_{i}]_{+}+\frac{L}{2}\|{\mathbf{w}}-{\mathbf{w}}^{(k)}\|^{2}-{\mathbf{w}}^{\top}{\boldsymbol{\xi}}^{(k)} (38)

using a SBCD method similar to Algorithm 1 by sampling indexes ii and jj. In proximal DCA, LL is tuned from {105100}\{10^{-5}\sim 10^{0}\} and other hyper-parameters are tuned from the same range as DCA.

E.6 Details about CIFAR-10-LT and Tiny-Imagenet-200-LT Datasets

Binary CIFAR-10-LT Dataset. To construct a binary classification, we set labels of category ’cats’ to 1 and labels of other categories to 0. We split the training data in train/val = 9:1 and use the validation dataset as the testing set. More details are provided in Table 5.

Binary Tiny-Imagenet-200-LT Dataset. To construct a binary classification, we set labels of category ’birds’ to 1 and labels of other categories to 0. We split the training data in train/val = 9:1 and use the validation dataset as the testing set. More details are provided in Table 5.

Table 5: Statistics of the Long-Tailed Datesets.
Dataset Pos. Class ID Pos. Class Name # Pos. Samples # Neg. Samples
CIFAR-10-LT 3 cats 1077 11329
Tiny-Imagenet-200-LT 11,20,21,22 birds 1308 20241

E.7 Convergence Curves and Testing Performance of AGD-SBCD on Different μ\mu

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: Convergence curves of training loss and training pAUC of AGD-SBCD on different μ\mu.
Table 6: The pAUCs with FPRs between 0.050.05 and 0.50.5 on the testing sets from the CheXpert data of AGD-SBCD on different μ\mu.
μ\mu D1 D2 D3 D4 D5
AGD-SBCD 103/N+N10^{3}/N_{+}N_{-} 0.6721±\pm0.0081 0.8257±\pm0.0025 0.8016±\pm0.0075 0.6340±\pm0.0165 0.8500±\pm0.0017
108/N+N10^{8}/N_{+}N_{-} 0.6617±\pm0.0073 0.8242±\pm0.0057 0.8272±\pm0.0070 0.6323±\pm0.0028 0.8463±\pm0.0003
1010/N+N10^{10}/N_{+}N_{-} 0.6636±\pm0.0056 0.8242±\pm0.0057 0.8237±\pm0.0077 0.6332±\pm0.0072 0.8463±\pm0.0002

E.8 Convergence Curves of Training pAUC over GPU Time

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: Convergence curves of training pAUC over GPU time. (The dashed line of SVMpAUC-tight does not reflect its convergence with GPU time. It is only reported for reference since we use the authors’ MATLAB implementation which does not support GPU.)