This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Accelerated Proximal Alternating Gradient-Descent-Ascent for Nonconvex Minimax Machine Learning

Ziyi Chen Electrical & Computer Engineering
University of Utah
Salt Lake City, US
[email protected]
   Shaocong Ma Electrical & Computer Engineering
University of Utah
Salt Lake City, US
[email protected]
   Yi Zhou Electrical & Computer Engineering
University of Utah
Salt Lake City, US
[email protected]
Abstract

Alternating gradient-descent-ascent (AltGDA) is an optimization algorithm that has been widely used for model training in various machine learning applications, which aims to solve a nonconvex minimax optimization problem. However, the existing studies show that it suffers from a high computation complexity in nonconvex minimax optimization. In this paper, we develop a single-loop and fast AltGDA-type algorithm that leverages proximal gradient updates and momentum acceleration to solve regularized nonconvex minimax optimization problems. By leveraging the momentum acceleration technique, we prove that the algorithm converges to a critical point in nonconvex minimax optimization and achieves a computation complexity in the order of 𝒪(κ116ϵ2)\mathcal{O}(\kappa^{\frac{11}{6}}\epsilon^{-2}), where ϵ\epsilon is the desired level of accuracy and κ\kappa is the problem’s condition number. Such a computation complexity improves the state-of-the-art complexities of single-loop GDA and AltGDA algorithms (see the summary of comparison in Table I). We demonstrate the effectiveness of our algorithm via an experiment on adversarial deep learning.

Index Terms:
Minimiax optimization, alternating gradient descent ascent, proximal gradient, momentum, complexity.

I Introduction

Minimax optimization is an emerging and important optimization framework that covers a variety of modern machine learning applications. Some popular application examples include generative adversarial networks (GANs) [13], adversarial machine learning [37], game theory [10], reinforcement learning [34], etc. A standard minimax optimization problem is written as minxmmaxy𝒴f(x,y)\min_{x\in\mathbb{R}^{m}}\max_{y\in\mathcal{Y}}\leavevmode\nobreak\ f(x,y), where ff is a differentiable bivariate function. A basic algorithm for solving the above minimax optimization problem is the gradient-descent-ascent (GDA), which simultaneously performs gradient descent update and gradient ascent update on the variables xx and yy, respectively, i.e., xt+1=xtηx1f(xt,yt)x_{t+1}=x_{t}-\eta_{x}\nabla_{1}f(x_{t},y_{t}), yt+1=yt+ηy2f(xt,yt)y_{t+1}=y_{t}+\eta_{y}\nabla_{2}f({x_{t}},y_{t}). Here, 1\nabla_{1} and 2\nabla_{2} denote the gradient operator with regard to the first and the second variable, respectively. The convergence rate of GDA has been studied under various types of geometries of the minimax problem, including strongly-convex-strongly-concave geometry [9], nonconvex-(strongly)-concave geometry [23] and Lojasiewicz-type geometry [5]. Recently, by leveraging the popular momentum technique [28, 38, 3, 11, 20, 21] for accelerating gradient-based algorithms, accelerated variants of GDA have been proposed for strongly-convex-strongly-concave [45] and nonconvex-strongly-concave [17] minimax optimization. There are other accelerated GDA-type algorithms that achieve a near-optimal complexity [22, 46], but they involve complex nested-loop structures and require function smoothing with many fine-tuned hyper-parameters, which are not used in practical minimax machine learning.

Another important variant of GDA that has been widely used in practical training of minimax machine learning is the alternating-GDA (AltGDA) algorithm, which updates the two variables xx and yy alternatively via xt+1=xtηx1f(xt,yt)x_{t+1}=x_{t}-\eta_{x}\nabla_{1}f(x_{t},y_{t}), yt+1=yt+ηy2f(xt+1,yt)y_{t+1}=y_{t}+\eta_{y}\nabla_{2}f({x_{t+1}},y_{t}). Compared with the update of GDA, the yy-update of AltGDA uses the fresh xt+1x_{t+1} instead of the previous xtx_{t}, and it is shown to converge faster than the standard GDA algorithm [4, 2, 12, 42]. Despite the popularity of the AltGDA algorithm, it is shown to suffer from a high computation complexity in nonconvex minimax optimization. Therefore, this study aims to improve the complexity of AltGDA by leveraging momentum acceleration techniques. In particular, in the existing literature, the convergence of momentum accelerated AltGDA is only established for convex-concave minimax problems [43] and bilinear minimax problems [12], and it has not been explored in nonconvex minimax optimization that covers modern machine learning applications. Therefore, this study aims to fill in this gap by developing a single-loop proximal-AltGDA algorithm with momentum acceleration for solving a class of regularized nonconvex minimax problems, and analyze its convergence and computation complexity.

I-A Our Contribution

We are interested in a class of regularized nonconvex-strongly-concave minimax optimization problems, where the regularizers are convex functions that can be possibly non-smooth. To solve this class of minimax problems, we propose a single-loop proximal-AltGDA with momentum algorithm (referred to as proximal-AltGDAm). The algorithm takes single-loop updates that consist of a proximal gradient descent update with the heavy-ball momentum, and a proximal gradient ascent update with the Nesterov’s momentum. Our algorithm extends the applicability of the conventional momentum acceleration schemes (heavy-ball and Nesterov’s momentum) for nonconvex minimization to nonconvex minimax optimization.

We study the convergence property of Proximal-AltGDAm. Specifically, under standard smoothness assumptions on the objective function and by viewing the accelerated alternating GDA updates as inexact accelerated gradient updates, we develop an analysis to show that every limit point of the parameter sequences generated by the algorithm is a critical point of the nonconvex regularized minimax problem. Moreover, to achieve an ϵ\epsilon-accurate critical point, the overall computation complexity (i.e., number of gradient and proximal mapping evaluations) is of the order 𝒪(κ116ϵ2)\mathcal{O}\big{(}\kappa^{\frac{11}{6}}\epsilon^{-2}\big{)}, where κ>1\kappa>1 is the condition number of the problem. Thanks to momentum acceleration and a tight analysis in our technical proof, such a computation complexity is lower than that of the existing single-loop GDA-type algorithms. See Table I for a summary of comparison of the computation complexities and Appendix E for their derivation.

Table I: Comparison of computation complexity (number of gradient and proximal mappings evaluations) of the existing single-loop GDA-type algorithms in nonconvex-strongly-concave minimax optimization, where κ1\kappa\geq 1 is the problem condition number.
Alternating Momentum Computation
update acceleration complexity
(Chen, et.al) [5] ×\times ×\times 𝒪(κ6ϵ2)\mathcal{O}(\kappa^{6}\epsilon^{-2})
(Huang, et.al) [17] ×\times 𝒪(κ3ϵ2)\mathcal{O}(\kappa^{3}\epsilon^{-2})
(Lin, et.al) [23] ×\times ×\times 𝒪(κ2ϵ2)\mathcal{O}(\kappa^{2}\epsilon^{-2})
(Xu, et.al) [42] ×\times 𝒪(κ5ϵ2)\mathcal{O}(\kappa^{5}\epsilon^{-2})
(Boţ and Böhm) [4] ×\times 𝒪(κ2ϵ2)\mathcal{O}(\kappa^{2}\epsilon^{-2})
This work 𝒪(κ116ϵ2)\mathcal{O}(\kappa^{\frac{11}{6}}\epsilon^{-2})

I-B Other Related Work

GDA-type algorithms: [27] studied a slight variant of GDA by replacing gradients with subgradients for convex-concave non-smooth minimax optimization. [42] studied AltGDA which applies 2\ell_{2} regularizer to the local objective function of GDA followed by projection onto the constraint sets and obtained its convergence rate under nonconvex-concave and convex-nonconcave geometry. [26] also studied two variants of GDA, Extra-gradient and Optimistic GDA, and obtained linear convergence rate under bilinear geometry and strongly-convex-strongly-concave geometry. [6, 18] studied GDA in continuous time dynamics using differential equations. [1] analyzed a second-order variant of the GDA algorithm.

Many other studies have developed stochastic GDA-type algorithms. [23, 44] analyzed stochastic GDA and stochastic AltGDA respectively. Variance reduction techniques have been applied to stochastic minimax optimization, including SVRG-based [8, 44], SARAH/SPIDER-based [41, 25], STORM [34] and its gradient free version [16].

GDA-type algorithms with momentum: For strongly-convex-strongly-concave problems, [45] studied accelerated GDA with negative momentum. [39, 22] developed nested-loop AltGDA algorithms with Nesterov’s Accelerated Gradient Descent that achieve improved complexities. For convex-concave problems, [7] analyzed stable points of both GDA and optimistic GDA that apply negative momentum for acceleration. Moreover, for nonconvex-(strongly)-concave problems, [40] developed a single-loop GDA with momentum and Hessian preconditioning and studied its convergence to a local minimax point. [17] developed a mirror descent ascent algorithm with momentum which includes GDAm as a special case. [30] studied nested-loop GDA where multiple gradient ascent steps are performed, and they also studied the momentum-accelerated version. Similarly, [34, 16] developed GDA with momentum scheme and STORM variance reduction, and a similar version of this algorithm is extended to minimax optimization on Riemann manifold [15]. [14] developed a single-loop Primal-Dual Stochastic Momentum algorithm with ADAM-type methods.

II Problem Formulation and Preliminaries

In this section, we introduce the problem formulation and technical assumptions. We consider the following class of regularized minimax optimization problems.

minxmmaxy𝒴f(x,y)+g(x)h(y),\displaystyle\min_{x\in\mathbb{R}^{m}}\max_{y\in\mathcal{Y}}\leavevmode\nobreak\ f(x,y)+g(x)-h(y), (P)

where f:m×𝒴f:\mathbb{R}^{m}\times\mathcal{Y}\to\mathbb{R} is differentiable and nonconvex-strongly-concave, g,hg,h are possibly non-smooth convex regularizers, and 𝒴\mathcal{Y} is a bounded set with diameter RR. In particular, define the envelope function Φ(x):=maxy𝒴f(x,y)h(y)\Phi(x):=\max_{y\in\mathcal{Y}}f(x,y)-h(y), then the problem (P)\mathrm{(P)} can be rewritten as the minimization problem minxmΦ(x)+g(x)\min_{x\in\mathbb{R}^{m}}\Phi(x)+g(x).

Throughout the paper, we adopt the following assumptions on the problem (P). These are standard assumptions that have been considered in the existing literature [23, 5].

Assumption 1.

The minimax problem (P)\mathrm{(P)} satisfies:

  1. 1.

    Function f(,)f(\cdot,\cdot) is LL-smooth and function f(x,)f(x,\cdot) is μ\mu-strongly concave for all xx;

  2. 2.

    Function (Φ+g)(x)(\Phi+g)(x) is bounded below and has compact sub-level sets;

  3. 3.

    Function g,hg,h are proper and convex.

In particular, item 3 of the above assumption allows the regularizers g,hg,h to be any convex function that can be possibly non-smooth. Next, consider the mapping y(x)=argmaxy𝒴f(x,y)h(y)y^{*}(x)=\arg\max_{y\in\mathcal{Y}}f(x,y)-h(y), which is uniquely defined due to the strong concavity of f(x,)f(x,\cdot). The following proposition is proved in [23, 5] that characterizes the Lipschitz continuity of the mapping y(x)y^{*}(x) and the smoothness of the function Φ(x)\Phi(x). Throughout the paper, we denote κ=L/μ>1\kappa=L/\mu>1 as the condition number, and denote 1f(x,y),2f(x,y)\nabla_{1}f(x,y),\nabla_{2}f(x,y) as the gradients with respect to the first and the second input variables, respectively.

Proposition 1 (​​[23, 5]).

Let 1 hold. Then, the mapping y(x)y^{*}(x) is κ\kappa-Lipschitz continuous and the function Φ(x)\Phi(x) is L(1+κ)L(1+\kappa)-smooth with Φ(x)=1f(x,y(x))\nabla\Phi(x)=\nabla_{1}f(x,y^{*}(x)).

Lastly, recall that the minimax problem (P) is equivalent to the minimization problem minxmΦ(x)+g(x)\min_{x\in\mathbb{R}^{m}}\Phi(x)+g(x). Therefore, the optimization goal of the minimax problem (P) is to find a critical point xx^{*} of the nonconvex function Φ(x)+g(x)\Phi(x)+g(x) that satisfies the optimality condition 𝟎(Φ+g)(x)\mathbf{0}\in\partial(\Phi+g)(x^{*}), where \partial denotes the subdifferential operator.

III Proximal Alternating GDA with Momentum

In this section, we propose a single-loop proximal alternating-GDA with momentum (proximal-AltGDAm) algorithm to solve the regularized minimax problem (P).

We first recall the update rules of the basic proximal alternating-GDA (proximal-AltGDA) algorithm [4] for solving the problem (P). Specifically, proximal-AltGDA alternates between the following two proximal-gradient updates (a.k.a. forward-backward splitting updates [24]).

(Proximal-AltGDA):

{xt+1=proxηxg(xtηx1f(xt,yt)),yt+1=proxηyh(yt+ηy2f(xt+1,yt)).\left\{\begin{aligned} x_{t+1}&=\mathrm{prox}_{\eta_{x}g}\big{(}{x}_{t}-\eta_{x}\nabla_{1}f(x_{t},y_{t})\big{)},\\ y_{t+1}&=\mathrm{prox}_{\eta_{y}h}\big{(}{y}_{t}+\eta_{y}\nabla_{2}f({x_{t+1}},{y}_{t})\big{)}.\end{aligned}\right.

To elaborate, the first update is a proximal gradient descent update that aims to minimize the nonconvex function f(x,yt)+g(x)f(x,y_{t})+g(x) from the current point xtx_{t}, and the second update is a proximal gradient ascent update that aims to maximize the strongly-concave function f(xt+1,y)h(y)f({x_{t+1}},y)-h(y) from the current point yty_{t}. More specifically, the two proximal gradient mappings are formally defined as

proxηxg(xtηx1f(xt,yt))\displaystyle\mathrm{prox}_{\eta_{x}g}\big{(}x_{t}-\eta_{x}\nabla_{1}f(x_{t},y_{t})\big{)}
:=argminum{g(u)+12ηxuxt+ηx1f(xt,yt)2},\displaystyle:=\mathop{\mathrm{argmin}}_{u\in\mathbb{R}^{m}}\Big{\{}g(u)+\frac{1}{2\eta_{x}}\|u-x_{t}+\eta_{x}\nabla_{1}f(x_{t},y_{t})\|^{2}\Big{\}},
proxηyh(yt+ηy2f(xt+1,yt))\displaystyle\mathrm{prox}_{\eta_{y}h}\big{(}y_{t}+\eta_{y}\nabla_{2}f({x_{t+1}},y_{t})\big{)}
:=argminv𝒴{h(v)+12ηyvytηy2f(xt+1,yt)2}.\displaystyle:=\mathop{\mathrm{argmin}}_{v\in\mathcal{Y}}\Big{\{}h(v)+\frac{1}{2\eta_{y}}\|v-y_{t}-\eta_{y}\nabla_{2}f({x_{t+1}},y_{t})\|^{2}\Big{\}}.

Compared with the standard (proximal) GDA algorithm [23, 5], the proximal ascent step of proximal-AltGDA evaluates the gradient at the freshest point xt+1x_{t+1} instead of xtx_{t}. Such an alternative update is widely used in practice and usually leads to better convergence properties [4, 2, 12, 42].

Next, we introduce momentum schemes to accelerate the convergence of proximal-AltGDA. As the two proximal gradient steps of proximal-AltGDA are used to solve two different types of optimization problems, namely, the nonconvex problem f(x,yt)+g(x)f(x,y_{t})+g(x) and the strongly-concave problem f(xt+1,y)h(y)f({x_{t+1}},y)-h(y), we consider applying different momentum schemes to accelerate these proximal gradient updates. Specifically, the proximal gradient descent step minimizes a composite nonconvex function, and we apply the heavy-ball momentum scheme [33] that was originally designed for accelerating nonconvex optimization. Therefore, we propose the following proximal gradient descent with heavy-ball momentum update for minimizing the nonconvex part of the problem (P).

(Heavy-ball momentum):

{x~t=xt+β(xtxt1),xt+1=proxηxg(x~tηx1f(xt,yt)).\left\{\begin{aligned} \widetilde{x}_{t}&=x_{t}+\beta(x_{t}-x_{t-1}),\\ x_{t+1}&=\mathrm{prox}_{\eta_{x}g}\big{(}\widetilde{x}_{t}-\eta_{x}\nabla_{1}f(x_{t},y_{t})\big{)}.\end{aligned}\right.

To explain, the first step is an extrapolation step that applies the momentum term β(xtxt1)\beta(x_{t}-x_{t-1}) (with momentum coefficient β>0\beta>0), and the second proximal gradient step updates the extrapolation point x~t\widetilde{x}_{t} using the original gradient 1f(xt,yt)\nabla_{1}f(x_{t},y_{t}). In conventional gradient-based optimization, gradient descent with such a heavy-ball momentum is guaranteed to find a critical point of smooth nonconvex functions [32, 31].

On the other hand, as the proximal gradient ascent step of proximal-AltGDA maximizes a composite strongly-concave function, we are motivated to apply the popular Nesterov’s momentum scheme [29], which was originally designed for accelerating strongly-concave (convex) optimization. Specifically, we propose the following proximal gradient ascent with Nesterov’s momentum update for maximizing the strongly-concave part of the problem (P).

(Nesterov’s momentum):

{y~t=yt+γ(ytyt1),yt+1=proxηyh(y~t+ηy2f(xt+1,y~t)).\left\{\begin{aligned} \widetilde{y}_{t}&=y_{t}+\gamma(y_{t}-y_{t-1}),\\ y_{t+1}&=\mathrm{prox}_{\eta_{y}h}\big{(}\widetilde{y}_{t}+\eta_{y}\nabla_{2}f({x_{t+1}},\widetilde{y}_{t})\big{)}.\end{aligned}\right. (1)

To elaborate, the first step is a regular extrapolation step that involves momentum, which is the same as the first step of the heavy-ball scheme. The only difference from the heavy-ball scheme is that the starting point of the second proximal gradient step changes from yty_{t} to its extrapolated point y~t\widetilde{y}_{t}.

We refer to the above algorithm design as proximal-AltGDA with momentum (proximal-AltGDAm), and the algorithm updates are formally presented in Algorithm 1. It can be seen that proximal-AltGDAm is a simple algorithm that has a single loop structure, and adopts alternating updates with momentum acceleration. More importantly, it involves only standard hyper-parameters such as the learning rates and momentum parameters and therefore is easy to implement in practice. Such a practical algorithm is much simper than the other accelerated GDA-type algorithms that involve complex nested-loop structure and require fine-tuned function smoothing [22, 46].

Input: Initialization x1=x0,y1=y0x_{-1}=x_{0},y_{-1}=y_{0}, learning rates ηx,ηy\eta_{x},\eta_{y}, momentum parameters β,γ\beta,\gamma.
for  t=0,1,2,,T1t=0,1,2,\ldots,T-1 do
      
x~t\displaystyle\widetilde{x}_{t} =xt+β(xtxt1),\displaystyle=x_{t}+\beta(x_{t}-x_{t-1}),
xt+1\displaystyle x_{t+1} =proxηxg(x~tηx1f(xt,yt)),\displaystyle=\mathrm{prox}_{\eta_{x}g}\big{(}\widetilde{x}_{t}-\eta_{x}\nabla_{1}f(x_{t},y_{t})\big{)},
y~t\displaystyle\widetilde{y}_{t} =yt+γ(ytyt1),\displaystyle=y_{t}+\gamma(y_{t}-y_{t-1}),
yt+1\displaystyle y_{t+1} =proxηyh(y~t+ηy2f(xt+1,y~t)).\displaystyle=\mathrm{prox}_{\eta_{y}h}\big{(}\widetilde{y}_{t}+\eta_{y}\nabla_{2}f(x_{t+1},\widetilde{y}_{t})\big{)}.
end for
Output: xT,yTx_{T},y_{T}.
Algorithm 1 Proximal Alternating GDA with Momentum (proximal-AltGDAm)

IV Convergence and Computation Complexity of Proximal-AltGDAm

Although the proposed proximal-AltGDAm algorithm applies the popular heavy-ball momentum and Nesterov’s momentum to the GDA updates in a straightforward way, its convergence analysis cannot directly follow from the existing studies of momentum accelerated gradient descent algorithms. To explain more specifically, notice that in the xx-proximal gradient update of Algorithm 1, it involves the partial gradient 1f(xt,yt)\nabla_{1}f(x_{t},y_{t}), which corresponds to the gradient of the time-varying function f(,yt)f(\cdot,y_{t}) (since yty_{t} changes over time tt). Similarly, the yy-proximal gradient update involves the gradient of another time-varying function f(xt+1,)f(x_{t+1},\cdot). Therefore, both momentum accelerated proximal gradient updates are actually applied to time-varying functions due to the nature of GDA updates. In this sense, the existing analysis of momentum accelerated gradient descent algorithms cannot be directly applied to analyze this algorithm.

To analyze the convergence of Algorithm 1, we first study the xx-proximal gradient update with heavy-ball momentum and obtain the following characterization of per-iteration progress on the objective function value.

Proposition 2.

Let 1 hold. Then, the parameter sequences {xt,yt}t\{x_{t},y_{t}\}_{t} generated by proximal-AltGDAm satisfy, for all t=0,1,2,,t=0,1,2,...,

Φ\displaystyle\Phi (xt+1)+g(xt+1)\displaystyle(x_{t+1})+g(x_{t+1})
Φ(xt)+g(xt)(1β2ηx2Lκ116)xt+1xt2\displaystyle\leq\Phi(x_{t})+g(x_{t})-\Big{(}\frac{1-\beta}{2\eta_{x}}-2L\kappa^{\frac{11}{6}}\Big{)}\|x_{t+1}-x_{t}\|^{2}
+β2ηxxtxt12+L4κ116y(xt)yt2.\displaystyle\quad+\frac{\beta}{2\eta_{x}}\|x_{t}-x_{t-1}\|^{2}+\frac{L}{4\kappa^{\frac{11}{6}}}\|y^{*}(x_{t})-y_{t}\|^{2}. (2)

The above proposition tracks the per-iteration optimization progress made by the xx-proximal gradient update with heavy-ball momentum. To elaborate, the increment terms xt+1xt2,xtxt12\|x_{t+1}-x_{t}\|^{2},\|x_{t}-x_{t-1}\|^{2} are induced by the heavy-ball momentum scheme. Moreover, since the xx-update uses the partial gradient 1f(xt,yt)\nabla_{1}f(x_{t},y_{t}) to approximate the exact gradient Φ(xt)=1f(xt,y(xt))\nabla\Phi(x_{t})=\nabla_{1}f(x_{t},y^{*}(x_{t})), it naturally induces a tracking error term yty(xt)2\|y_{t}-y^{*}(x_{t})\|^{2} that tracks the optimization gap of the yy-update. Hence, we need to further bound this tracking error term by analyzing the yy-proximal gradient update with Nesterov’s momentum, which we explore next.

As explained earlier, the momentum accelerated yy-updates in proximal-AltGDAm are applied to a time-varying strongly-concave function f(xt+1,)f(x_{t+1},\cdot). Hence, the tracking error term yty(xt)2\|y_{t}-y^{*}(x_{t})\|^{2} cannot be directly bounded using the standard convergence result of Nesterov’s accelerated proximal gradient algorithm [28]. Instead, we can view these yy-updates as inexact accelerated proximal gradient updates. To elaborate, note that in the tt-th iteration, the yy-proximal gradient update is applied to the function f(xt+1,)f(x_{t+1},\cdot). Then, we can rewrite the yy-updates in all the previous iterations k=0,1,,t1k=0,1,...,t-1 as follows.

yk+1\displaystyle y_{k+1} =proxηyh(y~k+ηy2f(xt+1,y~k)\displaystyle=\mathrm{prox}_{\eta_{y}h}\big{(}\widetilde{y}_{k}+\eta_{y}\nabla_{2}f(x_{t+1},\widetilde{y}_{k})
+ηy[2f(xk+1,y~k)2f(xt+1,y~k)]𝐞k+1).\displaystyle\qquad\quad+\eta_{y}\underbrace{[\nabla_{2}f(x_{k+1},\widetilde{y}_{k})-\nabla_{2}f(x_{t+1},\widetilde{y}_{k})]}_{\mathbf{e}_{k+1}}\big{)}. (3)

That is, we can view all the previous yy-updates as applied to the fixed function f(xt+1,)f(x_{t+1},\cdot) with an inexactness 𝐞k+1\mathbf{e}_{k+1} involved in the computation of the partial gradient 2f(xt+1,y~k)\nabla_{2}f(x_{t+1},\widetilde{y}_{k}). In summary, the yy-updates of proximal-AltGDAm can be understood as inexact accelerated gradient updates applied to the function f(xt+1,)f(x_{t+1},\cdot) at time tt. In particular, under the smoothness condition in 1, the inexactness is bounded as 𝐞k+1Lxk+1xt+1\|\mathbf{e}_{k+1}\|\leq L\|x_{k+1}-x_{t+1}\|. Consequently, we can leverage the existing convergence result of inexact accelerated gradient algorithm [36] to bound the optimality gap yty(xt)2\|y_{t}-y^{*}(x_{t})\|^{2} as follows.

Proposition 3.

Let 1 hold. Choose learning rate ηy=1L\eta_{y}=\frac{1}{L} and momentum parameter γ=κ1κ+1\gamma=\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}. Then, the parameter sequences {xt,yt}\{x_{t},y_{t}\} generated by proximal-AltGDAm satisfy, for all t=0,1,2,t=0,1,2,...

yt+1\displaystyle\|y_{t+1}- y(xt+1)22RκL(1κ0.5)t+1\displaystyle y^{*}(x_{t+1})\|^{2}\leq\frac{2R\kappa}{\sqrt{L}}(1-\kappa^{-0.5})^{t+1}
+6κ2Lj=1t(1κ0.5)t+1j/2xj+1xj.\displaystyle\quad+\frac{6\kappa^{2}}{\sqrt{L}}\sum_{j=1}^{t}(1-\kappa^{-0.5})^{t+1-j/2}\|x_{j+1}-x_{j}\|. (4)

Intuitively, in the above bound, the first term on the right hand side corresponds to the normal accelerated convergence rate, and the other term is induced by the inexactness 𝐞k\mathbf{e}_{k}. As both terms are scaled by the accelerated convergence factor (1κ0.5)(1-\kappa^{-0.5}), we expect that the above bound converges fast and further facilitates the convergence of eq. 2. Next, substituting eq. 4 into eq. 2 and telescoping, we obtain the following asymptotic stability properties of proximal-AltGDAm.

Corollary 1.

Under the conditions of Proposition 3 and stepsize ηx1/(16Lκ116)\eta_{x}\leq 1/(16L\kappa^{\frac{11}{6}}), the sequences {xt,yt}t\{x_{t},y_{t}\}_{t} generated by proximal-AltGDAm satisfy

xt+1xt,yty(xt),yt+1yt𝑡0.\displaystyle\|x_{t+1}-x_{t}\|,\|y_{t}-y^{*}(x_{t})\|,\|y_{t+1}-y_{t}\|\overset{t}{\to}0.

Remark 1. In [5], the proximal-GDA algorithm (without alternating update and momentum) uses a small learning rate ηx𝒪(L2κ3)\eta_{x}\leq\mathcal{O}({L^{-2}}\kappa^{-3}) to establish convergence. As a comparison, proximal-AltGDAm allows to choose a much larger learning rate ηx𝒪(L1κ116)\eta_{x}\leq{\mathcal{O}(L^{-1}\kappa^{-\frac{11}{6}})} to guarantee a faster convergence (proved later), thanks to the momentum acceleration schemes.

Therefore, if we can show that {xt}t\{x_{t}\}_{t} converges to a desired critical point xx^{*}, then the above stability properties of proximal-AltGDAm implies that {yt}t\{y_{t}\}_{t} converges to the corresponding best response point y(x)y^{*}(x^{*}).

To further characterize the global convergence property of proximal-AltGDAm, we define the following proximal gradient mapping associated with the objective function Φ(x)+g(x)\Phi(x)+g(x).

G(x):=\displaystyle G(x):= 1ηx(xproxηxg(xηxΦ(x))).\displaystyle\frac{1}{\eta_{x}}\Big{(}x-\mathrm{prox}_{\eta_{x}g}\big{(}x-\eta_{x}\nabla\Phi(x)\big{)}\Big{)}. (5)

The proximal gradient mapping is a standard metric for evaluating the optimality of nonconvex composite optimization problems [28]. It can be shown that xx is a critical point of (Φ+g)(x)(\Phi+g)(x) if and only if G(x)=𝟎G(x)=\mathbf{0}, and it reduces to the normal gradient when there is no regularizer gg. Hence, our convergence criterion is to find a near-critical point xx that satisfies G(x)ϵ\|G(x)\|\leq\epsilon for some pre-determined accuracy ϵ>0\epsilon>0.

Next, by leveraging Proposition 2 and Proposition 3, we obtain the following global convergence result of proximal-AltGDAm and characterize its computational complexity (number of gradient and proximal mapping evaluations).

Theorem 1 (Global convergence).

Under the conditions of Proposition 3 and stepsize ηx1/(16Lκ116)\eta_{x}\leq 1/(16L\kappa^{\frac{11}{6}}), the sequence {xt}t\{x_{t}\}_{t} generated by proximal-AltGDAm is bounded and has a compact set of limit points. Also, every such limit point is a critical point of (Φ+g)(x)(\Phi+g)(x). Moreover, the total number of iterations TT required to achieve min0tTG(xt)ϵ{\min_{0\leq t\leq T}\|G(x_{t})\|}\leq\epsilon is T=𝒪(κ116ϵ2)T=\mathcal{O}\big{(}\kappa^{\frac{11}{6}}\epsilon^{-2}\big{)}, which is also the order of the required computational complexity.

To elaborate, the first statement of 1 shows that the sequence generated by proximal-AltGDAm converges to critical points of the minimax problem. The second statement proves that the computation complexity of proximal-AltGDAm for finding a near-critical point is of the order 𝒪(κ116ϵ2)\mathcal{O}\big{(}\kappa^{\frac{11}{6}}\epsilon^{-2}\big{)}, which strictly improves the complexity 𝒪(κ2ϵ2)\mathcal{O}\big{(}\kappa^{2}\epsilon^{-2}\big{)} of both GDA [23] and proximal-AltGDA [4] in nonconvex-strongly concave minimax optimization. We note that the improvement is substantial when the problem condition number κ=L/μ\kappa=L/\mu is large, while the dependence on ϵ2\epsilon^{-2} is generally unimprovable in nonconvex optimization. In addition, thanks to the momentum acceleration schemes, our choice of learning rate ηx=𝒪(L1κ116)\eta_{x}=\mathcal{O}(L^{-1}\kappa^{-\frac{11}{6}}) is more flexible than that ηx=𝒪(L1κ2)\eta_{x}=\mathcal{O}(L^{-1}\kappa^{-2}) adopted by these GDA-type algorithms. These improvements are not only attributed to momentum acceleration, but also to the elaborate selection of the coefficients in the proof that aims to minimize the dependence of the complexity on κ\kappa.

V Experiments

In this section, we compare the performance of proximal-AltGDAm with that of other GDA-type algorithms via numerical experiments. Specifically, we compare proximal-AltGDAm with the standard proximal-GDA/AltGDA algorithm [5] and the single-loop accelerated AltGDA algorithm (APDA) [47]. All these algorithms have a single-loop structure.

We consider solving the following regularized Wasserstein robustness model (WRM) [37] using the MNIST dataset [19].

minθmax{ξi}i=1n\displaystyle\min_{\theta}\max_{\{\xi_{i}\}^{n}_{i=1}} 1ni=1n[(hθ(ξi),yi)λξixi2]\displaystyle\frac{1}{n}\sum_{i=1}^{n}\Big{[}\ell(h_{\theta}(\xi_{i}),y_{i})-\lambda\|\xi_{i}-x_{i}\|^{2}\Big{]}
λ1i=1nξi1+λ22θ2,\displaystyle\qquad\qquad-\lambda_{1}\sum_{i=1}^{n}\|\xi_{i}\|_{1}+\frac{\lambda_{2}}{2}\|\theta\|^{2}, (6)

where n=60n=60k is the number of training samples, θ\theta is the model parameter, (xi,yi)(x_{i},y_{i}) corresponds to the ii-th data sample and label, respectively, and ξi\xi_{i} is the adversarial sample corresponding to xix_{i}. We choose the cross-entropy loss function \ell. We add an 1\ell_{1} regularization to impose sparsity on the adversarial examples, and add an 22\ell_{2}^{2} regularization to prevent divergence of the model parameters.

We set λ=1.0\lambda=1.0 that suffices to make the maximization part be strongly-concave, and set λ1=λ2=104\lambda_{1}=\lambda_{2}=10^{-4}. We use a convolution network that consists of two convolution blocks followed by two fully connected layers. Specifically, each convolution block contains a convolution layer, a max-pooling layer with stride step 22, and a ReLU activation layer. The convolution layers in the two blocks have 1,101,10 input channels and 10,2010,20 output channels, respectively, and both of them have kernel size 55, stride step 11 and no padding. The two fully connected layers have input dimensions 320,50320,50 and output dimensions 50,1050,10, respectively.

We implement all three algorithms using full gradients with the whole training set of 6060k images. We choose the same learning rates ηx=ηy=103\eta_{x}=\eta_{y}=10^{-3} for all algorithms. For proximal-AltGDAm, we choose momentum β=0.25\beta=0.25 and γ=0.75\gamma=0.75. For APDA, we choose the fine-tuned η=2ηx\eta=2\eta_{x}. As the function Φ(x)\Phi(x) cannot be exactly evaluated, we run 100100 steps of stochastic gradient ascent updates with learning rate 0.10.1 to maximize f(xt,y)h(y)f(x_{t},y)-h(y) and obtain an approximated y(xt)y^{*}(x_{t}), which is used to estimate Φ(x)+g(x)\Phi(x)+g(x).

Refer to caption
Refer to caption
Figure 1: Left: comparison of Φ(x)+g(x)\Phi(x)+g(x) of all four algorithms. Right: comparison of the corresponding accuracy on the test dataset.

Figure 1 (Left) compares the estimated objective function value achieved by all the three algorithms. It can be seen that proximal-AltGDAm achieves the fastest convergence among these algorithms and is significantly faster than the proximal-GDA and the proximal-AltGDA. This demonstrates the effectiveness of our simple momentum scheme. Figure 1 (Right) further demonstrates the advantage of proximal-AltGDAm in the test accuracy. It can be seen that the robust model trained by proximal-AltGDAm achieves a much higher test accuracy.

VI Conclusion

We develop a single-loop and fast AltGDA algorithm that leverages proximal gradient updates and momentum acceleration to solve general regularized nonconvex-strongly-concave minimax optimization problems. By viewing the GDA updates of the algorithm as inexact accelerated gradient updates, we prove that the algorithm converges to a ϵ\epsilon-critical point with a computational complexity 𝒪(κ116ϵ2)\mathcal{O}(\kappa^{\frac{11}{6}}\epsilon^{-2}), which substantially improves the state-of-the-art result. In the future work, it is interesting to develop a stochastic variant of this algorithm to further improve the sample complexity.

Acknowledgment

The work of Ziyi Chen, Shaocong Ma and Yi Zhou was supported in part by U.S. National Science Foundation under the Grants CCF-2106216 and DMS-2134223.

References

  • [1] L. Adolphs, H. Daneshmand, A. Lucchi, and T. Hofmann. Local saddle point optimization: A curvature exploitation approach. In Proc. International Conference on Artificial Intelligence and Statistics (AISTATS), pages 486–495, 2019.
  • [2] J. P. Bailey, G. Gidel, and G. Piliouras. Finite regret and cycles with fixed step-size via alternating gradient descent-ascent. In Proc. Conference on Learning Theory (COLT), pages 391–407, 2020.
  • [3] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, Mar. 2009.
  • [4] R. I. Boţ and A. Böhm. Alternating proximal-gradient steps for (stochastic) nonconvex-concave minimax problems. ArXiv:2007.13605, 2020.
  • [5] Z. Chen, Y. Zhou, T. Xu, and Y. Liang. Proximal gradient descent-ascent: Variable convergence under kł geometry. In Proc. International Conference on Learning Representations (ICLR), 2021.
  • [6] A. Cherukuri, B. Gharesifard, and J. Cortes. Saddle-point dynamics: conditions for asymptotic stability of saddle points. SIAM Journal on Control and Optimization, 55(1):486–511, 2017.
  • [7] C. Daskalakis and I. Panageas. The limit points of (optimistic) gradient descent in min-max optimization. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pages 9236–9246, 2018.
  • [8] S. S. Du and W. Hu. Linear convergence of the primal-dual gradient method for convex-concave saddle point problems without strong convexity. In Proc. International Conference on Artificial Intelligence and Statistics (AISTATS), pages 196–205, 2019.
  • [9] A. Fallah, A. Ozdaglar, and S. Pattathil. An optimal multistage stochastic gradient method for minimax problems. In 2020 59th IEEE Conference on Decision and Control (CDC), pages 3573–3579. IEEE, 2020.
  • [10] M. A. M. Ferreira, M. Andrade, M. C. P. Matos, J. A. Filipe, and M. P. Coelho. Minimax theorem and nash equilibrium. International Journal of Latest Trends in Finance & Economic Sciences, 2012.
  • [11] S. Ghadimi and G. Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(1-2):59–99, Mar. 2016.
  • [12] G. Gidel, R. A. Hemmat, M. Pezeshki, R. Le Priol, G. Huang, S. Lacoste-Julien, and I. Mitliagkas. Negative momentum for improved game dynamics. In Proc. International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1802–1811, 2019.
  • [13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pages 2672–2680, 2014.
  • [14] Z. Guo, Y. Xu, W. Yin, R. Jin, and T. Yang. On stochastic moving-average estimators for non-convex optimization. ArXiv:2104.14840, 2021.
  • [15] F. Huang, S. Gao, and H. Huang. Gradient descent ascent for min-max problems on riemannian manifolds. ArXiv:2010.06097, 2020.
  • [16] F. Huang, S. Gao, J. Pei, and H. Huang. Accelerated zeroth-order and first-order momentum methods from mini to minimax optimization. ArXiv:2008.08170, 2020.
  • [17] F. Huang, X. Wu, and H. Huang. Efficient mirror descent ascent methods for nonsmooth minimax problems. In Proc. International Conference on Neural Information Processing Systems (Neurips), volume 34, 2021.
  • [18] C. Jin, P. Netrapalli, and M. I. Jordan. What is local optimality in nonconvex-nonconcave minimax optimization? In Proc. International Conference on Machine Learning (ICML), pages 4880–4889, 2020.
  • [19] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pages 2278–2324, 1998.
  • [20] H. Li and Z. Lin. Accelerated proximal gradient methods for nonconvex programming. In Proc. International Conference on Neural Information Processing Systems (Neurips), pages 379–387, 2015.
  • [21] Q. Li, Y. Zhou, Y. Liang, and P. K. Varshney. Convergence analysis of proximal gradient with momentum for nonconvex optimization. In Proc. International Conference on Machine Learning (ICML), volume 70, pages 2111–2119, Aug 2017.
  • [22] T. Lin, C. Jin, and M. I. Jordan. Near-optimal algorithms for minimax optimization. In Proc. Annual Conference on Learning Theory (COLT), pages 2738–2779, 2020.
  • [23] T. Lin, C. Jin, and M. I. Jordan. On gradient descent ascent for nonconvex-concave minimax problems. In Proc. International Conference on Machine Learning (ICML), pages 6083–6093, 2020.
  • [24] P.-L. Lions and B. Mercier. Splitting algorithms for the sum of two nonlinear operators. SIAM Journal on Numerical Analysis, 16(6):964–979, 1979.
  • [25] L. Luo, H. Ye, Z. Huang, and T. Zhang. Stochastic recursive gradient descent ascent for stochastic nonconvex-strongly-concave minimax problems. In Proc. Advances in Neural Information Processing Systems (NeurIPS), volume 33, 2020.
  • [26] A. Mokhtari, A. Ozdaglar, and S. Pattathil. A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. In Proc. International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1497–1507, 2020.
  • [27] A. Nedić and A. Ozdaglar. Subgradient methods for saddle-point problems. Journal of optimization theory and applications, 142(1):205–228, 2009.
  • [28] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer, 2014.
  • [29] Y. E. Nesterov. A method for solving the convex programming problem with convergence rate o (1/k^ 2). In Dokl. akad. nauk Sssr, volume 269, pages 543–547, 1983.
  • [30] M. Nouiehed, M. Sanjabi, T. Huang, J. D. Lee, and M. Razaviyayn. Solving a class of non-convex min-max games using iterative first order methods. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pages 14934–14942, 2019.
  • [31] P. Ochs. Local convergence of the heavy-ball method and iPiano for non-convex optimization. Journal of Optimization Theory and Applications, 177(1):153–180, Apr 2018.
  • [32] P. Ochs, Y. Chen, T. Brox, and T. Pock. iPiano: Inertial proximal algorithm for nonconvex optimization. SIAM Journal on Imaging Sciences, 7(2):1388–1419, 2014.
  • [33] B. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
  • [34] S. Qiu, Z. Yang, X. Wei, J. Ye, and Z. Wang. Single-timescale stochastic nonconvex-concave optimization for smooth nonlinear td learning. ArXiv:2008.10103, 2020.
  • [35] R. T. Rockafellar. Convex analysis. Number 28. Princeton university press, 1970.
  • [36] M. Schmidt, N. Le Roux, and F. Bach. Convergence rates of inexact proximal-gradient methods for convex optimization. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2011.
  • [37] A. Sinha, H. Namkoong, and J. C. Duchi. Certifying some distributional robustness with principled adversarial training. In Proc. International Conference on Learning Representations (ICLR), 2018.
  • [38] P. Tseng. Approximation accuracy, gradient methods, and error bound for structured convex optimization. Mathematical Programming, 125(2):263–295, Oct 2010.
  • [39] Y. Wang and J. Li. Improved algorithms for convex-concave minimax optimization. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2020.
  • [40] Y. Wang, G. Zhang, and J. Ba. On solving minimax optimization locally: A follow-the-ridge approach. In Proc. International Conference on Learning Representations (ICLR), 2019.
  • [41] T. Xu, Z. Wang, Y. Liang, and H. V. Poor. Enhanced first and zeroth order variance reduced algorithms for min-max optimization. ArXiv:2006.09361, 2020.
  • [42] Z. Xu, H. Zhang, Y. Xu, and G. Lan. A unified single-loop alternating gradient projection algorithm for nonconvex-concave and convex-nonconcave minimax problems. ArXiv:2006.02032, 2020.
  • [43] A. Yadav, S. Shah, Z. Xu, D. Jacobs, and T. Goldstein. Stabilizing adversarial nets with prediction methods. In Proc. International Conference on Learning Representations (ICLR), 2018.
  • [44] J. Yang, N. Kiyavash, and N. He. Global convergence and variance reduction for a class of nonconvex-nonconcave minimax problems. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2020.
  • [45] G. Zhang and Y. Wang. On the suboptimality of negative momentum for minimax optimization. ArXiv:2008.07459, 2020.
  • [46] S. Zhang, J. Yang, C. Guzmán, N. Kiyavash, and N. He. The complexity of nonconvex-strongly-concave minimax optimization. ArXiv:2103.15888, 2021.
  • [47] Y. Zhu, D. Liu, and Q. Tran-Dinh. A New Primal-Dual Algorithm for a Class of Nonlinear Compositional Convex Optimization Problems. ArXiv:2006.09263, June 2020.

Appendix A Proof of Proposition 2

See 2

Proof.

Consider the tt-th iteration of PGDAm. As the function Φ\Phi is L(1+κ)L(1+\kappa)-smooth (from Proposition 1), we obtain that

Φ(xt+1)Φ(xt)+xt+1xt,Φ(xt)+L(1+κ)2xt+1xt2.\displaystyle\Phi(x_{t+1})\leq\Phi(x_{t})+\langle x_{t+1}-x_{t},\nabla\Phi(x_{t})\rangle+\frac{L(1+\kappa)}{2}\|x_{t+1}-x_{t}\|^{2}. (7)

On the other hand, by the definition of the proximal gradient step of xtx_{t}, we have that

g(xt+1)+12ηxxt+1x~t+ηx1f(xt,yt)2g(xt)+12ηxxtx~t+ηx1f(xt,yt)2,\displaystyle g(x_{t+1})+\frac{1}{2\eta_{x}}\|x_{t+1}-\widetilde{x}_{t}+\eta_{x}\nabla_{1}f(x_{t},y_{t})\|^{2}\leq g(x_{t})+\frac{1}{2\eta_{x}}\|x_{t}-\widetilde{x}_{t}+\eta_{x}\nabla_{1}f(x_{t},y_{t})\|^{2},

which further simplifies to

g(xt+1)\displaystyle g(x_{t+1}) g(xt)+12ηxxtx~t2+xtx~t,1f(xt,yt)\displaystyle\leq g(x_{t})+\frac{1}{2\eta_{x}}\|x_{t}-\widetilde{x}_{t}\|^{2}+\langle x_{t}-\widetilde{x}_{t},\nabla_{1}f(x_{t},y_{t})\rangle
12ηxxt+1x~t2xt+1x~t,1f(xt,yt)\displaystyle\quad-\frac{1}{2\eta_{x}}\|x_{t+1}-\widetilde{x}_{t}\|^{2}-\langle x_{t+1}-\widetilde{x}_{t},\nabla_{1}f(x_{t},y_{t})\rangle
=(i)g(xt)+β22ηxxtxt1212ηxxt+1xtβ(xtxt1)2\displaystyle\overset{(i)}{=}g(x_{t})+\frac{\beta^{2}}{2\eta_{x}}\|x_{t}-{x}_{t-1}\|^{2}-\frac{1}{2\eta_{x}}\|x_{t+1}-{x}_{t}-\beta(x_{t}-x_{t-1})\|^{2}
+xtxt+1,1f(xt,yt)\displaystyle\quad+\langle x_{t}-x_{t+1},\nabla_{1}f(x_{t},y_{t})\rangle
=g(xt)+β22ηxxtxt1212ηxxt+1xt2β22ηxxtxt12\displaystyle=g(x_{t})+\frac{\beta^{2}}{2\eta_{x}}\|x_{t}-{x}_{t-1}\|^{2}-\frac{1}{2\eta_{x}}\|x_{t+1}-{x}_{t}\|^{2}-\frac{\beta^{2}}{2\eta_{x}}\|x_{t}-x_{t-1}\|^{2}
+βηxxt+1xt,xtxt1+xtxt+1,1f(xt,yt)\displaystyle\quad+\frac{\beta}{\eta_{x}}\langle x_{t+1}-x_{t},x_{t}-x_{t-1}\rangle+\langle x_{t}-x_{t+1},\nabla_{1}f(x_{t},y_{t})\rangle
g(xt)12ηxxt+1xt2+β2ηxxt+1xt2+β2ηxxtxt12+xtxt+1,1f(xt,yt),\displaystyle\leq g(x_{t})-\frac{1}{2\eta_{x}}\|x_{t+1}-{x}_{t}\|^{2}+\frac{\beta}{2\eta_{x}}\|x_{t+1}-x_{t}\|^{2}+\frac{\beta}{2\eta_{x}}\|x_{t}-x_{t-1}\|^{2}+\langle x_{t}-x_{t+1},\nabla_{1}f(x_{t},y_{t})\rangle, (8)

where (i) uses the fact that xtx~t=β(xt1xt)x_{t}-\widetilde{x}_{t}=\beta(x_{t-1}-x_{t}).

Adding up eq. 7 and eq. 8 yields that

Φ(xt+1)+g(xt+1)\displaystyle\Phi(x_{t+1})+g(x_{t+1})
Φ(xt)+g(xt)(12ηxL(1+κ)2)xt+1xt2+β2ηxxt+1xt2+β2ηxxtxt12\displaystyle\leq\Phi(x_{t})+g(x_{t})-\Big{(}\frac{1}{2\eta_{x}}-\frac{L(1+\kappa)}{2}\Big{)}\|x_{t+1}-x_{t}\|^{2}+\frac{\beta}{2\eta_{x}}\|x_{t+1}-x_{t}\|^{2}+\frac{\beta}{2\eta_{x}}\|x_{t}-x_{t-1}\|^{2}
+xt+1xt,Φ(xt)1f(xt,yt)\displaystyle\quad+\langle x_{t+1}-x_{t},\nabla\Phi(x_{t})-\nabla_{1}f(x_{t},y_{t})\rangle
Φ(xt)+g(xt)(12ηxL(1+κ)2)xt+1xt2+β2ηxxt+1xt2+β2ηxxtxt12\displaystyle\leq\Phi(x_{t})+g(x_{t})-\Big{(}\frac{1}{2\eta_{x}}-\frac{L(1+\kappa)}{2}\Big{)}\|x_{t+1}-x_{t}\|^{2}+\frac{\beta}{2\eta_{x}}\|x_{t+1}-x_{t}\|^{2}+\frac{\beta}{2\eta_{x}}\|x_{t}-x_{t-1}\|^{2}
+xt+1xtΦ(xt)1f(xt,yt)\displaystyle\quad+\|x_{t+1}-x_{t}\|\|\nabla\Phi(x_{t})-\nabla_{1}f(x_{t},y_{t})\|
Φ(xt)+g(xt)(12ηxL(1+κ)2)xt+1xt2+β2ηxxt+1xt2+β2ηxxtxt12\displaystyle\leq\Phi(x_{t})+g(x_{t})-\Big{(}\frac{1}{2\eta_{x}}-\frac{L(1+\kappa)}{2}\Big{)}\|x_{t+1}-x_{t}\|^{2}+\frac{\beta}{2\eta_{x}}\|x_{t+1}-x_{t}\|^{2}+\frac{\beta}{2\eta_{x}}\|x_{t}-x_{t-1}\|^{2}
+xt+1xt1f(xt,y(xt))1f(xt,yt)\displaystyle\quad+\|x_{t+1}-x_{t}\|\|\nabla_{1}f(x_{t},y^{*}(x_{t}))-\nabla_{1}f(x_{t},y_{t})\|
Φ(xt)+g(xt)(12ηxLκ)xt+1xt2+β2ηxxt+1xt2+β2ηxxtxt12\displaystyle\leq\Phi(x_{t})+g(x_{t})-\Big{(}\frac{1}{2\eta_{x}}-L\kappa\Big{)}\|x_{t+1}-x_{t}\|^{2}+\frac{\beta}{2\eta_{x}}\|x_{t+1}-x_{t}\|^{2}+\frac{\beta}{2\eta_{x}}\|x_{t}-x_{t-1}\|^{2}
+Lxt+1xty(xt)yt\displaystyle\quad+L\|x_{t+1}-x_{t}\|\|y^{*}(x_{t})-y_{t}\|
(i)Φ(xt)+g(xt)(1β2ηxLκLκ116)xt+1xt2+β2ηxxtxt12+L4κ116y(xt)yt2\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\Phi(x_{t})+g(x_{t})-\Big{(}\frac{1-\beta}{2\eta_{x}}-L\kappa-L\kappa^{\frac{11}{6}}\Big{)}\|x_{t+1}-x_{t}\|^{2}+\frac{\beta}{2\eta_{x}}\|x_{t}-x_{t-1}\|^{2}+\frac{L}{4\kappa^{\frac{11}{6}}}\|y^{*}(x_{t})-y_{t}\|^{2}
Φ(xt)+g(xt)(1β2ηx2Lκ116)xt+1xt2+β2ηxxtxt12+L4κ116y(xt)yt2.\displaystyle\leq\Phi(x_{t})+g(x_{t})-\Big{(}\frac{1-\beta}{2\eta_{x}}-2L\kappa^{\frac{11}{6}}\Big{)}\|x_{t+1}-x_{t}\|^{2}+\frac{\beta}{2\eta_{x}}\|x_{t}-x_{t-1}\|^{2}+\frac{L}{4\kappa^{\frac{11}{6}}}\|y^{*}(x_{t})-y_{t}\|^{2}. (9)

where (i) uses AM-GM inequality that abCa22+b22Cab\leq\frac{Ca^{2}}{2}+\frac{b^{2}}{2C} for any a,b,C0a,b,C\geq 0. This proves eq. (2)

Appendix B Proof of Proposition 3

See 3

Proof.

We rewrite the inner accelerated gradient ascent steps in Algorithm 1 as the inexact-proximal gradient method (3). Then, based on Theorem 4 of [36], using ηy=1L\eta_{y}=\frac{1}{L} and γ=κ1κ+1\gamma=\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}, this method has the following convergence rate.

f(xt+1,yt+1)f(xt+1,y(xt+1))\displaystyle f(x_{t+1},y_{t+1})-f(x_{t+1},y^{*}(x_{t+1}))
(1κ0.5)t+1(2(f(xt+1,yt+1)f(xt+1,y(xt+1)))+2μi=1t+1ei(1κ0.5)i/2).\displaystyle\leq(1-\kappa^{-0.5})^{t+1}\Big{(}\sqrt{2\big{(}f(x_{t+1},y_{t+1})-f(x_{t+1},y^{*}(x_{t+1}))\big{)}}+\sqrt{\frac{2}{\mu}}\sum_{i=1}^{t+1}\|e_{i}\|(1-\kappa^{-0.5})^{-i/2}\Big{)}. (10)

The above convergence rate can be simplified as follows.

μ2yt+1y(xt+1)2\displaystyle\frac{\mu}{2}\|y_{t+1}-y^{*}(x_{t+1})\|^{2}
(i)f(xt+1,yt+1)f(xt+1,y(xt+1))\displaystyle\overset{(i)}{\leq}f(x_{t+1},y_{t+1})-f(x_{t+1},y^{*}(x_{t+1}))
(1κ0.5)t+1(2(f(xt+1,yt+1)f(xt+1,y(xt+1)))+2μi=1t+1ei(1κ0.5)i/2)\displaystyle\leq(1-\kappa^{-0.5})^{t+1}\Big{(}\sqrt{2\big{(}f(x_{t+1},y_{t+1})-f(x_{t+1},y^{*}(x_{t+1}))\big{)}}+\sqrt{\frac{2}{\mu}}\sum_{i=1}^{t+1}\|e_{i}\|(1-\kappa^{-0.5})^{-i/2}\Big{)}
(ii)(1κ0.5)t+1Lyt+1y(xt+1)2+2μi=1t+12f(xi,y~i1)2f(xt+1,y~i1)(1κ0.5)t+1i/2\displaystyle\overset{(ii)}{\leq}(1-\kappa^{-0.5})^{t+1}\sqrt{L\|y_{t+1}-y^{*}(x_{t+1})\|^{2}}+\sqrt{\frac{2}{\mu}}\sum_{i=1}^{t+1}\|\nabla_{2}f(x_{i},\widetilde{y}_{i-1})-\nabla_{2}f(x_{t+1},\widetilde{y}_{i-1})\|(1-\kappa^{-0.5})^{t+1-i/2}
(iii)RL(1κ0.5)t+1+2μi=1t+1(1κ0.5)t+1i/2j=itLxj+1xj\displaystyle\overset{(iii)}{\leq}R\sqrt{L}(1-\kappa^{-0.5})^{t+1}+\sqrt{\frac{2}{\mu}}\sum_{i=1}^{t+1}(1-\kappa^{-0.5})^{t+1-i/2}\sum_{j=i}^{t}L\|x_{j+1}-x_{j}\|
=RL(1κ0.5)t+1+L2μj=1ti=1j(1κ0.5)t+1i/2xj+1xj\displaystyle=R\sqrt{L}(1-\kappa^{-0.5})^{t+1}+L\sqrt{\frac{2}{\mu}}\sum_{j=1}^{t}\sum_{i=1}^{j}(1-\kappa^{-0.5})^{t+1-i/2}\|x_{j+1}-x_{j}\|
=RL(1κ0.5)t+1+2Lκj=1t(1κ0.5)t+0.5(1κ0.5)j/21(1κ0.5)0.51xj+1xj\displaystyle=R\sqrt{L}(1-\kappa^{-0.5})^{t+1}+\sqrt{2L\kappa}\sum_{j=1}^{t}(1-\kappa^{-0.5})^{t+0.5}\frac{(1-\kappa^{-0.5})^{-j/2}-1}{(1-\kappa^{-0.5})^{-0.5}-1}\|x_{j+1}-x_{j}\|
RL(1κ0.5)t+1+2Lκj=1t(1κ0.5)t+1j/21(1κ0.5)0.5xj+1xj\displaystyle\leq R\sqrt{L}(1-\kappa^{-0.5})^{t+1}+\sqrt{2L\kappa}\sum_{j=1}^{t}\frac{(1-\kappa^{-0.5})^{t+1-j/2}}{1-(1-\kappa^{-0.5})^{0.5}}\|x_{j+1}-x_{j}\|
(iv)RL(1κ0.5)t+1+2κ2Lj=1t(1κ0.5)t+1j/2xj+1xj,\displaystyle\overset{(iv)}{\leq}R\sqrt{L}(1-\kappa^{-0.5})^{t+1}+2\kappa\sqrt{2L}\sum_{j=1}^{t}(1-\kappa^{-0.5})^{t+1-j/2}\|x_{j+1}-x_{j}\|,

where (i) and (ii) use 2f(xt+1,y(xt+1))=0\nabla_{2}f(x_{t+1},y^{*}(x_{t+1}))=0 and Assumption 1.1 that f(x,)f(x,\cdot) is LL-smooth and μ\mu-strongly concave, (iii) uses the fact that 𝒴\mathcal{Y} is bounded with diameter RR and Assumption 1.1 that ff is LL-smooth, and (iv) uses 11(1κ0.5)0.5=1+(1κ0.5)0.5κ0.52κ0.5\frac{1}{1-(1-\kappa^{-0.5})^{0.5}}=\frac{1+(1-\kappa^{-0.5})^{0.5}}{\kappa^{-0.5}}\leq 2\kappa^{0.5}. Multiplying the above inequality with 2/μ2/\mu proves Proposition 3. ∎

Appendix C Proof of Corollary 1

See 1

Proof.

Telescoping eq. (4) over t=0,1,,T1t=0,1,\ldots,T-1 yields that

t=0T1yt+1y(xt+1)2\displaystyle\sum_{t=0}^{T-1}\|y_{t+1}-y^{*}(x_{t+1})\|^{2}
2RκLt=0T1(1κ0.5)t+1+6κ2Lt=0T1j=1t(1κ0.5)t+1j/2xj+1xj\displaystyle\leq\frac{2R\kappa}{\sqrt{L}}\sum_{t=0}^{T-1}(1-\kappa^{-0.5})^{t+1}+\frac{6\kappa^{2}}{\sqrt{L}}\sum_{t=0}^{T-1}\sum_{j=1}^{t}(1-\kappa^{-0.5})^{t+1-j/2}\|x_{j+1}-x_{j}\|
2Rκ1.5L+6κ2Lj=1T1t=jT1(1κ0.5)t+1j/2xj+1xj\displaystyle\leq\frac{2R\kappa^{1.5}}{\sqrt{L}}+\frac{6\kappa^{2}}{\sqrt{L}}\sum_{j=1}^{T-1}\sum_{t=j}^{T-1}(1-\kappa^{-0.5})^{t+1-j/2}\|x_{j+1}-x_{j}\|
2Rκ1.5L+6κ2.5Lj=1T1(1κ0.5)j/2xj+1xj\displaystyle\leq\frac{2R\kappa^{1.5}}{\sqrt{L}}+\frac{6\kappa^{2.5}}{\sqrt{L}}\sum_{j=1}^{T-1}(1-\kappa^{-0.5})^{j/2}\|x_{j+1}-x_{j}\|
2Rκ1.5L+3κ2.5Lj=1T1(1κ76L(1κ0.5)j+κ76Lxj+1xj2)\displaystyle\leq\frac{2R\kappa^{1.5}}{\sqrt{L}}+\frac{3\kappa^{2.5}}{\sqrt{L}}\sum_{j=1}^{T-1}\Big{(}\frac{1}{\kappa^{\frac{7}{6}}\sqrt{L}}(1-\kappa^{-0.5})^{j}+\kappa^{\frac{7}{6}}\sqrt{L}\|x_{j+1}-x_{j}\|^{2}\Big{)}
2Rκ1.5L+3κ116L+3κ113j=1T1xj+1xj2.\displaystyle\leq\frac{2R\kappa^{1.5}}{\sqrt{L}}+\frac{3\kappa^{\frac{11}{6}}}{L}+3\kappa^{\frac{11}{3}}\sum_{j=1}^{T-1}\|x_{j+1}-x_{j}\|^{2}. (11)

Then, telescoping eq. (2) over t=0,1,,T1t=0,1,\ldots,T-1 yields that

Φ(xT)+g(xT)Φ(x0)g(x0)\displaystyle\Phi(x_{T})+g(x_{T})-\Phi(x_{0})-g(x_{0})
(1β2ηx2Lκ116)t=0T1xt+1xt2+β2ηxt=0T1xtxt12+L4κ116t=0T1y(xt)yt2\displaystyle\leq-\Big{(}\frac{1-\beta}{2\eta_{x}}-2L\kappa^{\frac{11}{6}}\Big{)}\sum_{t=0}^{T-1}\|x_{t+1}-x_{t}\|^{2}+\frac{\beta}{2\eta_{x}}\sum_{t=0}^{T-1}\|x_{t}-x_{t-1}\|^{2}+\frac{L}{4\kappa^{\frac{11}{6}}}\sum_{t=0}^{T-1}\|y^{*}(x_{t})-y_{t}\|^{2}
(i)(12β2ηx2Lκ116)t=0T1xt+1xt2+L4κ116(R2+2Rκ1.5L+3κ116L+3κ113j=1T1xj+1xj2)\displaystyle\overset{(i)}{\leq}-\Big{(}\frac{1-2\beta}{2\eta_{x}}-2L\kappa^{\frac{11}{6}}\Big{)}\sum_{t=0}^{T-1}\|x_{t+1}-x_{t}\|^{2}+\frac{L}{4\kappa^{\frac{11}{6}}}\Big{(}R^{2}+\frac{2R\kappa^{1.5}}{\sqrt{L}}+\frac{3\kappa^{\frac{11}{6}}}{L}+3\kappa^{\frac{11}{3}}\sum_{j=1}^{T-1}\|x_{j+1}-x_{j}\|^{2}\Big{)}
(12β2ηx3Lκ116)t=0T1xt+1xt2+LR24κ116+RL2κ13+1\displaystyle\leq-\Big{(}\frac{1-2\beta}{2\eta_{x}}-3L\kappa^{\frac{11}{6}}\Big{)}\sum_{t=0}^{T-1}\|x_{t+1}-x_{t}\|^{2}+\frac{LR^{2}}{4\kappa^{\frac{11}{6}}}+\frac{R\sqrt{L}}{2\kappa^{\frac{1}{3}}}+1 (12)

where (i) uses x1=x0x_{-1}=x_{0}, y(x0)y0R\|y^{*}(x_{0})-y_{0}\|\leq R and eq. (11). When ηx116Lκ116\eta_{x}\leq\frac{1}{16L\kappa^{\frac{11}{6}}} and β14\beta\leq\frac{1}{4}, rearranging the above inequality yields that

Lκ116t=0T1xt+1xt2\displaystyle L\kappa^{\frac{11}{6}}\sum_{t=0}^{T-1}\|x_{t+1}-x_{t}\|^{2} Φ(x0)+g(x0)infxm(Φ(x)+g(x))+LR24κ116+RL2κ13+1<+\displaystyle\leq\Phi(x_{0})+g(x_{0})-\inf_{x\in\mathbb{R}^{m}}\big{(}\Phi(x)+g(x)\big{)}+\frac{LR^{2}}{4\kappa^{\frac{11}{6}}}+\frac{R\sqrt{L}}{2\kappa^{\frac{1}{3}}}+1<+\infty (13)

Letting TT\to\infty in the above inequality yields that t=0xt+1xt2<+\sum_{t=0}^{\infty}\|x_{t+1}-x_{t}\|^{2}<+\infty, so xt+1xt𝑡0\|x_{t+1}-x_{t}\|\overset{t}{\to}0. Then, letting TT\to\infty in eq. (11) yields that t=0yt+1y(xt+1)22Rκ1.5L+3κ116L+3κ113j=1xj+1xj2<+\sum_{t=0}^{\infty}\|y_{t+1}-y^{*}(x_{t+1})\|^{2}\leq\frac{2R\kappa^{1.5}}{\sqrt{L}}+\frac{3\kappa^{\frac{11}{6}}}{L}+3\kappa^{\frac{11}{3}}\sum_{j=1}^{\infty}\|x_{j+1}-x_{j}\|^{2}<+\infty, so yty(xt)𝑡0\|y_{t}-y^{*}(x_{t})\|\overset{t}{\to}0.

The last term yt+1yt𝑡0\|y_{t+1}-y_{t}\|\overset{t}{\to}0 can be proved as follows.

yt+1yt\displaystyle\|y_{t+1}-y_{t}\| yt+1y(xt+1)+y(xt)yt+y(xt+1)y(xt)\displaystyle\leq\|y_{t+1}-y^{*}(x_{t+1})\|+\|y^{*}(x_{t})-y_{t}\|+\|y^{*}(x_{t+1})-y^{*}(x_{t})\|
(i)yt+1y(xt+1)+yty(xt)+κxt+1xt𝑡0,\displaystyle\overset{(i)}{\leq}\|y_{t+1}-y^{*}(x_{t+1})\|+\|y_{t}-y^{*}(x_{t})\|+\kappa\|x_{t+1}-x_{t}\|\overset{t}{\to}0,

where (i) uses the fact that yy^{*} is κ\kappa-Lipschitz. ∎

Appendix D Proof of Theorem 1

See 1

Proof.

We first prove the existence of the limit points of {xt}\{x_{t}\}. Note that in eq. (12), 12β2ηx2Lκ1160\frac{1-2\beta}{2\eta_{x}}-2L\kappa^{\frac{11}{6}}\geq 0 since ηx116Lκ116\eta_{x}\leq\frac{1}{16L\kappa^{\frac{11}{6}}} and β14\beta\leq\frac{1}{4} as specified in Proposition 3. Hence, for all T0T\geq 0,

Φ(xT)+g(xT)\displaystyle\Phi(x_{T})+g(x_{T}) Φ(x0)+g(x0)+LR24κ116+RL2κ13+1<+,\displaystyle\leq\Phi(x_{0})+g(x_{0})+\frac{LR^{2}}{4\kappa^{\frac{11}{6}}}+\frac{R\sqrt{L}}{2\kappa^{\frac{1}{3}}}+1<+\infty,

which implies that {Φ(xt)+g(xt)}t\{\Phi(x_{t})+g(x_{t})\}_{t} is upper bounded. Hence, based on Assumption 1.2, the sequence {xt}t\{x_{t}\}_{t} is bounded and thus has a compact set of limit points.

Next, we prove that every limit point xx of {xt}t\{x_{t}\}_{t} is a critical point of (Φ+g)(x)(\Phi+g)(x), i.e., 𝟎(Φ+g)(x)\mathbf{0}\in\partial(\Phi+g)(x). By the optimality condition of the proximal gradient update of xt+1x_{t+1} we have

𝟎\displaystyle\mathbf{0} g(xt+1)+1ηx(xt+1x~t+ηx1f(xt,yt))\displaystyle\in\partial g(x_{t+1})+\frac{1}{\eta_{x}}\big{(}x_{t+1}-\widetilde{x}_{t}+\eta_{x}\nabla_{1}f(x_{t},y_{t})\big{)}
=g(xt+1)+1ηx(xt+1xtβ(xtxt1)+ηx1f(xt,yt)),\displaystyle=\partial g(x_{t+1})+\frac{1}{\eta_{x}}\big{(}x_{t+1}-x_{t}-\beta(x_{t}-x_{t-1})+\eta_{x}\nabla_{1}f(x_{t},y_{t})\big{)},

which implies that 1ηx(xt+1xtβ(xtxt1)+ηx1f(xt,yt))g(xt+1)-\frac{1}{\eta_{x}}\big{(}x_{t+1}-x_{t}-\beta(x_{t}-x_{t-1})+\eta_{x}\nabla_{1}f(x_{t},y_{t})\big{)}\in\partial g(x_{t+1}) and thus by convexity of gg we have

g(x)g(xt(j)+1)1ηx(xt+1xtβ(xtxt1)+ηx1f(xt,yt))(xxt(j)+1);xm.\displaystyle g(x)\geq g(x_{t(j)+1})-\frac{1}{\eta_{x}}\big{(}x_{t+1}-x_{t}-\beta(x_{t}-x_{t-1})+\eta_{x}\nabla_{1}f(x_{t},y_{t})\big{)}^{\top}(x-x_{t(j)+1});\forall x\in\mathbb{R}^{m}. (14)

As xt(j)𝑗xx_{t(j)}\overset{j}{\to}x^{*} and yt(j)y(xt(j))𝑡0\|y_{t(j)}-y^{*}(x_{t(j)})\|\overset{t}{\to}0, we have yt(j)𝑡y(x)y_{t(j)}\overset{t}{\to}y^{*}(x^{*}) due to continuity of y()y^{*}(\cdot). Also note that the convex function gg is continuous (See Corollary 10.1.1 of [35]). Hence, letting jj\to\infty in eq. (14) yields that

g(x)g(x)1f(x,y(x)(xx)=g(x)Φ(x)(xx);xm,\displaystyle g(x)\geq g(x^{*})-\nabla_{1}f(x^{*},y^{*}(x^{*})^{\top}(x-x^{*})=g(x^{*})-\nabla\Phi(x^{*})^{\top}(x-x^{*});\forall x\in\mathbb{R}^{m}, (15)

which further implies that Φ(x)g(x)𝟎(Φ+g)(x)-\nabla\Phi(x^{*})\in\partial g(x^{*})\Rightarrow\mathbf{0}\in\partial(\Phi+g)(x^{*}). Hence, xx^{*} in a critical point of (Φ+g)(x)(\Phi+g)(x).

Finally, we derive the non-asymptotic computational complexity to obtain min0tTG(xt)ϵ\min_{0\leq t\leq T}\|G(x_{t})\|\leq\epsilon. Note that

G(xt+1)=\displaystyle\|G(x_{t+1})\|= 1ηxxt+1proxηxg(xt+1ηxΦ(xt+1))\displaystyle\frac{1}{\eta_{x}}\big{\|}x_{t+1}-\text{prox}_{\eta_{x}g}\big{(}x_{t+1}-\eta_{x}\nabla\Phi(x_{t+1})\big{)}\big{\|}
(i)\displaystyle\overset{(i)}{\leq} 1ηxxt+1x~t+ηx[1f(xt,yt)f1(xt+1,y(xt+1))]\displaystyle\frac{1}{\eta_{x}}\big{\|}x_{t+1}-\widetilde{x}_{t}+\eta_{x}\big{[}\nabla_{1}f(x_{t},y_{t})-\nabla f_{1}\big{(}x_{t+1},y^{*}(x_{t+1})\big{)}\big{]}\big{\|}
\displaystyle\leq 1ηxxt+1xtβ(xtxt1)+Lxt+1xt+Ly(xt+1)y(xt)+Ly(xt)yt\displaystyle\frac{1}{\eta_{x}}\big{\|}x_{t+1}-x_{t}-\beta(x_{t}-x_{t-1})\big{\|}+L\|x_{t+1}-x_{t}\|+L\|y^{*}(x_{t+1})-y^{*}(x_{t})\|+L\|y^{*}(x_{t})-y_{t}\|
(ii)\displaystyle\overset{(ii)}{\leq} (1ηx+L+Lκ)xt+1xt+βηxxtxt1+Ly(xt)yt,\displaystyle\Big{(}\frac{1}{\eta_{x}}+L+L\kappa\Big{)}\|x_{t+1}-x_{t}\|+\frac{\beta}{\eta_{x}}\|x_{t}-x_{t-1}\|+L\|y^{*}(x_{t})-y_{t}\|,

where (i) uses xt+1proxηxg(x~tηx1f(xt,yt))x_{t+1}\in\mathrm{prox}_{\eta_{x}g}\big{(}\widetilde{x}_{t}-\eta_{x}\nabla_{1}f(x_{t},y_{t})\big{)}, Φ(x)=f1(x,y(x))\nabla\Phi(x)=\nabla f_{1}\big{(}x,y^{*}(x)\big{)} (from Proposition 1) and the non-expansiveness of proximal mapping, (ii) uses the property that yy^{*} is κ\kappa-Lipschitz continuous in Proposition 1. Hence,

(T1)min0tTG(xt)2\displaystyle(T-1)\min_{0\leq t\leq T}\|G(x_{t})\|^{2}
(T1)min1tT1G(xt+1)2\displaystyle\leq(T-1)\min_{1\leq t\leq T-1}\|G(x_{t+1})\|^{2}
t=1T1G(xt+1)2\displaystyle\leq\sum_{t=1}^{T-1}\|G(x_{t+1})\|^{2}
t=1T1[3(1ηx+L+Lκ)2xt+1xt2+3β2ηx2xtxt12+3L2y(xt)yt2]\displaystyle\leq\sum_{t=1}^{T-1}\Big{[}3\Big{(}\frac{1}{\eta_{x}}+L+L\kappa\Big{)}^{2}\|x_{t+1}-x_{t}\|^{2}+\frac{3\beta^{2}}{\eta_{x}^{2}}\|x_{t}-x_{t-1}\|^{2}+3L^{2}\|y^{*}(x_{t})-y_{t}\|^{2}\Big{]}
(i)3(18Lκ116)2t=0T1xt+1xt2+27L2κ113t=0T1xt+1xt2+3L2t=0T1y(xt)yt2\displaystyle\overset{(i)}{\leq}3(18L\kappa^{\frac{11}{6}})^{2}\sum_{t=0}^{T-1}\|x_{t+1}-x_{t}\|^{2}+27L^{2}\kappa^{\frac{11}{3}}\sum_{t=0}^{T-1}\|x_{t+1}-x_{t}\|^{2}+3L^{2}\sum_{t=0}^{T-1}\|y^{*}(x_{t})-y_{t}\|^{2}
(ii)999L2κ113t=0T1xt+1xt2+3L2(2Rκ1.5L+3κ116L+3κ113j=1T1xj+1xj2)+3L2y(x0)y02,\displaystyle\overset{(ii)}{\leq}999L^{2}\kappa^{\frac{11}{3}}\sum_{t=0}^{T-1}\|x_{t+1}-x_{t}\|^{2}+3L^{2}\Big{(}\frac{2R\kappa^{1.5}}{\sqrt{L}}+\frac{3\kappa^{\frac{11}{6}}}{L}+3\kappa^{\frac{11}{3}}\sum_{j=1}^{T-1}\|x_{j+1}-x_{j}\|^{2}\Big{)}+3L^{2}\|y^{*}(x_{0})-y_{0}\|^{2},
(iii)1008L2κ113Lκ116(Φ(x0)+g(x0)infxm(Φ(x)+g(x))+LR24κ116+RL2κ13+1)+6RL1.5κ1.5+9Lκ116+3L2R2\displaystyle\overset{(iii)}{\leq}\frac{1008L^{2}\kappa^{\frac{11}{3}}}{L\kappa^{\frac{11}{6}}}\Big{(}\Phi(x_{0})+g(x_{0})-\inf_{x\in\mathbb{R}^{m}}\big{(}\Phi(x)+g(x)\big{)}+\frac{LR^{2}}{4\kappa^{\frac{11}{6}}}+\frac{R\sqrt{L}}{2\kappa^{\frac{1}{3}}}+1\Big{)}+6RL^{1.5}\kappa^{1.5}+9L\kappa^{\frac{11}{6}}+3L^{2}R^{2}
=𝒪(κ116).\displaystyle=\mathcal{O}(\kappa^{\frac{11}{6}}).

where (i) uses β14\beta\leq\frac{1}{4} and the maximum possible stepsize ηx=116Lκ116\eta_{x}=\frac{1}{16L\kappa^{\frac{11}{6}}}, (ii) uses eq. (11), and (iii) uses eq. (13) and the fact that 𝒴\mathcal{Y} is bounded with diameter RR. Based on the above inequality, when the number of iterations T𝒪(κ116ϵ2)T\geq\mathcal{O}(\kappa^{\frac{11}{6}}\epsilon^{-2}), min0tTG(xt)𝒪(κ116)/(T1)ϵ\min_{0\leq t\leq T}\|G(x_{t})\|\leq\sqrt{\mathcal{O}(\kappa^{\frac{11}{6}})/(T-1)}\leq\epsilon. Since each iteration has 𝒪(1)\mathcal{O}(1) number of gradient and proximal mapping evaluations, the order of computational complexity is also 𝒪(κ116ϵ2)\mathcal{O}(\kappa^{\frac{11}{6}}\epsilon^{-2}). ∎

Appendix E Deriviation of computational complexities in Table I

In this section, we will derive some computational complexities in Table I that are not directly shown in their corresponding papers. Note that all these GDA-type algorithms in Table I are single-loop. Hence, the computational complexity (the number of gradient evaluations) has the order of 𝒪(T)\mathcal{O}(T) where TT is the number of iterations.

First, the papers in Table I use different convergence measures for computational complexity. Specifically, [5, 17] and our work show computational complexity to achieve G(x)ϵ\|G(x)\|\leq\epsilon where the proximal gradient mapping GG is defined in (5). [23, 4] use the measure mintdist(Φ(xt)+g(xt),𝟎)ϵ\min_{t}\text{dist}\big{(}\Phi(x_{t})+\partial g(x_{t}),\mathbf{0}\big{)}\leq\epsilon where Φ(x):=maxy𝒴f(x,y)h(y)\Phi(x):=\max_{y\in\mathcal{Y}}f(x,y)-h(y), g\partial g denotes the partial gradient of gg and dist(A,𝟎)\text{dist}\big{(}A,\mathbf{0}\big{)} denotes the distance between 𝟎\mathbf{0} and any set AA. [42] has no regularizers gg, hh and uses the convergence measure mintf(xt,yt)ϵ\min_{t}\|\nabla f(x_{t},y_{t})\|\leq\epsilon when ydy\in\mathbb{R}^{d} is unconstrained, which does not necessarily yield the desired approximate critical point of Φ\Phi.

E-A Derivation of complexity in [5]

In [5], Proposition 2 states that the Lyapunov function H(zt):=Φ(xt)+g(xt)+(114κ2)yty(xt)2H(z_{t}):=\Phi(x_{t})+g(x_{t})+{(1-\frac{1}{4\kappa^{2}})}\|y_{t}-y^{*}(x_{t})\|^{2} where zt:=(xt,yt)z_{t}:=(x_{t},y_{t}) are generated by GDA decreases at the following rate.

H(zt+1)H(zt)2xt+1xt214κ2(yt+1y(xt+1)2+yty(xt)2).\displaystyle H(z_{t+1})\leq H(z_{t})-2\|x_{t+1}-x_{t}\|^{2}-\frac{1}{4\kappa^{2}}\big{(}\|y_{t+1}-y^{*}(x_{t+1})\|^{2}+\|y_{t}-y^{*}(x_{t})\|^{2}\big{)}. (16)

Note that for GDA, the gradient mapping (5) has the following norm bound

G(xt+1)=\displaystyle\|G(x_{t+1})\|= 1ηxxt+1proxηxg(xt+1ηxΦ(xt+1))\displaystyle\frac{1}{\eta_{x}}\big{\|}x_{t+1}-\text{prox}_{\eta_{x}g}\big{(}x_{t+1}-\eta_{x}\nabla\Phi(x_{t+1})\big{)}\big{\|}
(i)\displaystyle\overset{(i)}{\leq} 1ηxxt+1xt+ηx[1f(xt,yt)f1(xt+1,y(xt+1))]\displaystyle\frac{1}{\eta_{x}}\big{\|}x_{t+1}-x_{t}+\eta_{x}\big{[}\nabla_{1}f(x_{t},y_{t})-\nabla f_{1}\big{(}x_{t+1},y^{*}(x_{t+1})\big{)}\big{]}\big{\|}
\displaystyle\leq (1ηx+L)xt+1xt+Ly(xt+1)y(xt)+Ly(xt)yt\displaystyle\Big{(}\frac{1}{\eta_{x}}+L\Big{)}\|x_{t+1}-x_{t}\|+L\|y^{*}(x_{t+1})-y^{*}(x_{t})\|+L\|y^{*}(x_{t})-y_{t}\|
(ii)\displaystyle\overset{(ii)}{\leq} (1ηx+L+Lκ)xt+1xt+Ly(xt)yt\displaystyle\Big{(}\frac{1}{\eta_{x}}+L+L\kappa\Big{)}\|x_{t+1}-x_{t}\|+L\|y^{*}(x_{t})-y_{t}\|\,

where (i) uses the GDA update rule xt+1proxηxg(xtηx1f(xt,yt))x_{t+1}\in\mathrm{prox}_{\eta_{x}g}\big{(}x_{t}-\eta_{x}\nabla_{1}f(x_{t},y_{t})\big{)}, the expression Φ(x)=f1(x,y(x))\nabla\Phi(x)=\nabla f_{1}\big{(}x,y^{*}(x)\big{)} in Proposition 1 and the non-expansiveness of proximal mapping, (ii) uses the property that yy^{*} is κ\kappa-Lipschitz continuous Proposition 1. Hence, we obtain the following convergence rate.

min0tTG(xt)21Tt=0T1G(xt+1)2\displaystyle\min_{0\leq t\leq T}\|G(x_{t})\|^{2}\leq\frac{1}{T}\sum_{t=0}^{T-1}\|G(x_{t+1})\|^{2}
1Tt=0T1[2(1ηx+L+Lκ)2xt+1xt2+2L2y(xt)yt2]\displaystyle\leq\frac{1}{T}\sum_{t=0}^{T-1}\Big{[}2\Big{(}\frac{1}{\eta_{x}}+L+L\kappa\Big{)}^{2}\|x_{t+1}-x_{t}\|^{2}+2L^{2}\|y^{*}(x_{t})-y_{t}\|^{2}\Big{]}
(i)1Tt=0T1[𝒪(κ6)xt+1xt2+2L2y(xt)yt2]\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\frac{1}{T}\sum_{t=0}^{T-1}\Big{[}\mathcal{O}(\kappa^{6})\|x_{t+1}-x_{t}\|^{2}+2L^{2}\|y^{*}(x_{t})-y_{t}\|^{2}\Big{]}
𝒪(κ6)Tt=0T1(2xt+1xt2+14κ2yt+1y(xt+1)2)\displaystyle\leq\frac{\mathcal{O}(\kappa^{6})}{T}\sum_{t=0}^{T-1}\Big{(}2\|x_{t+1}-x_{t}\|^{2}+\frac{1}{4\kappa^{2}}\|y_{t+1}-y^{*}(x_{t+1})\|^{2}\Big{)}
(ii)𝒪(κ6)Tt=0T1(H(zt)H(zt+1))\displaystyle\stackrel{{\scriptstyle(ii)}}{{\leq}}\frac{\mathcal{O}(\kappa^{6})}{T}\sum_{t=0}^{T-1}\big{(}H(z_{t})-H(z_{t+1})\big{)}
𝒪(κ6)T(H(z0)H(zT))\displaystyle\leq\frac{\mathcal{O}(\kappa^{6})}{T}\big{(}H(z_{0})-H(z_{T})\big{)} (17)

where (i) uses the maximum possible stepsize ηx=1κ3(L+3)2=𝒪(κ3)\eta_{x}=\frac{1}{\kappa^{3}(L+3)^{2}}=\mathcal{O}(\kappa^{-3}) in [5]. Therefore, To let min0tTG(xt)ϵ\min_{0\leq t\leq T}\|G(x_{t})\|\leq\epsilon, the computational complexity has the order T=𝒪(κ6ϵ2)T=\mathcal{O}(\kappa^{6}\epsilon^{-2}).

E-B Derivation of complexity in [17]

The mirror descent ascent algorithm (Algorithm 1) in [17] updates the variables xx and yy simultaneously using proximal mirror descent and momentum accelerated mirror ascent steps respectively. Specifically, using the Bregman functions ψt(x):=12x2\psi_{t}(x):=\frac{1}{2}\|x\|^{2} and ϕt(y):=12y2\phi_{t}(y):=\frac{1}{2}\|y\|^{2} which are both ρ=1\rho=1-strongly convex, this algorithm becomes proximal GDA with momentum on yy variable.

Substituting ρ=1\rho=1 into Theorem 2 that provides convergence rate under deterministic minimax optimization (i.e., there are no stochastic samples in the objective function), we obtain the following hyperparameter choices η=𝒪(1)\eta=\mathcal{O}(1), L=Lf(1+κ)=𝒪(Lfκ)L=L_{f}(1+\kappa)=\mathcal{O}(L_{f}\kappa) 111LfL_{f} in [17] has the same meaning as our LL, the Lipschitz parameter of f\nabla f., λ=𝒪(Lf1)\lambda=\mathcal{O}(L_{f}^{-1}), γ=𝒪[min(L1,μ/Lfκ2,μ/LfLf2)]=𝒪[min(Lf1κ1,κ3,Lf2κ1)]=𝒪(κ3)\gamma=\mathcal{O}\big{[}\min\big{(}L^{-1},\frac{\mu/L_{f}}{\kappa^{2}},\frac{\mu/L_{f}}{L_{f}^{2}}\big{)}\big{]}=\mathcal{O}\big{[}\min(L_{f}^{-1}\kappa^{-1},\kappa^{-3},L_{f}^{-2}\kappa^{-1})\big{]}=\mathcal{O}(\kappa^{-3}) (Without loss of generality, we assume μ1\mu\leq 1 which implies that κ=Lf/μLf\kappa=L_{f}/\mu\geq L_{f}). Then the convergence rate (25) becomes

min1tTG(xt)\displaystyle\min_{1\leq t\leq T}\|G(x_{t})\| 1Tt=1T𝒢t𝒪(F~(x1)F+Δ1Tγρ)\displaystyle\leq\frac{1}{T}\sum_{t=1}^{T}\|\mathcal{G}_{t}\|\leq\mathcal{O}\Bigg{(}\frac{\sqrt{\widetilde{F}(x_{1})-F^{*}}+\Delta_{1}}{\sqrt{T\gamma\rho}}\Bigg{)}
=(i)𝒪(Φ(x1)+g(x1)infx(Φ(x)+g(x))+y1y(x1)Tκ3)\displaystyle\stackrel{{\scriptstyle(i)}}{{=}}\mathcal{O}\Bigg{(}\frac{\sqrt{\Phi(x_{1})+g(x_{1})-\inf_{x}\big{(}\Phi(x)+g(x)\big{)}}+\|y_{1}-y^{*}(x_{1})\|}{\sqrt{T\kappa^{-3}}}\Bigg{)}

where (i) uses the notations in [17] that 𝒢t=G(xt)\mathcal{G}_{t}=G(x_{t}), F~(x)=Φ(x)+g(x)\widetilde{F}(x)=\Phi(x)+g(x), Δ1=y1y(x1)\Delta_{1}=\|y_{1}-y^{*}(x_{1})\| and the above hyperparameter choices. Hence, to achieve min1tT𝒢tϵ\min_{1\leq t\leq T}\|\mathcal{G}_{t}\|\leq\epsilon, the required computation complexity is T𝒪(κ3ϵ2(Φ(x1)+g(x1)infx(Φ(x)+g(x))+y1y(x1)2))T\geq\mathcal{O}\Big{(}\kappa^{3}\epsilon^{-2}\big{(}\Phi(x_{1})+g(x_{1})-\inf_{x}(\Phi(x)+g(x))+\|y_{1}-y^{*}(x_{1})\|^{2}\big{)}\Big{)}. In Table I, we only keep the dependence of T(ϵ)T(\epsilon) on ϵ0\epsilon\approx 0 and κ1\kappa\gg 1, which yields 𝒪(κ3ϵ2)\mathcal{O}\big{(}\kappa^{3}\epsilon^{-2}\big{)}.

E-C Derivation of complexity in [42]

[42] aims to solve the following minimax optimization

minx𝒳maxy𝒴f(x,y).\displaystyle\min_{x\in\mathcal{X}}\max_{y\in\mathcal{Y}}\leavevmode\nobreak\ f(x,y).

where 𝒳\mathcal{X} and 𝒴\mathcal{Y} are nonempty closed convex sets and 𝒴\mathcal{Y} is also compact. The following AltGDA algorithm with projection mappings 𝒫𝒳\mathcal{P}_{\mathcal{X}} and 𝒫𝒴\mathcal{P}_{\mathcal{Y}} is analyzed for nonconvex-strongly concave geometry where ff is LL-smooth222In Assumption 2.1 of [42], let all the Lipschitz smooth parameters Lx=Ly=L12=L21:=LL_{x}=L_{y}=L_{12}=L_{21}:=L for simplicity. and f(,y)f(\cdot,y) is μ\mu-strongly concave for any y𝒴y\in\mathcal{Y}.

{xk+1=𝒫𝒳(xkη1xf(xk,yk))yk+1=𝒫𝒴(yk+ρyf(xk+1,yk))\displaystyle\left\{\begin{gathered}x_{k+1}=\mathcal{P}_{\mathcal{X}}\big{(}x_{k}-\eta^{-1}\nabla_{x}f(x_{k},y_{k})\big{)}\hfill\\ y_{k+1}=\mathcal{P}_{\mathcal{Y}}\big{(}y_{k}+\rho\nabla_{y}f(x_{k+1},y_{k})\big{)}\hfill\\ \end{gathered}\right. (22)

Using the largest possible stepsizes η1=𝒪(L1κ3)\eta^{-1}=\mathcal{O}(L^{-1}\kappa^{-3}), ρ=𝒪(μL2)=𝒪(L1κ1)\rho=\mathcal{O}(\mu L^{-2})=\mathcal{O}(L^{-1}\kappa^{-1}) that satisfies eq. (3.18), the following key variables in Theorem 3.1 can be computed as follows.

d1=\displaystyle d_{1}= 𝒪(L1κ5)\displaystyle\mathcal{O}(L^{-1}\kappa^{-5})
F1F¯=\displaystyle F_{1}-\overline{F}= f(x1,y1)minx𝒳,y𝒴f(x,y)+𝒪(Lκ3σy2)\displaystyle f(x_{1},y_{1})-\min_{x\in\mathcal{X},y\in\mathcal{Y}}f(x,y)+\mathcal{O}(L\kappa^{3}\sigma_{y}^{2})

where σy\sigma_{y} is the diameter of the compact set 𝒴\mathcal{Y}. Hence the number of iterations (also the order of the computation complexity) required to achieve xf(xk,yk)ϵ\|\nabla_{x}f(x_{k},y_{k})\|\leq\epsilon, 1ρyk𝒫𝒴(yk+ρyf(xk,yk))ϵ\frac{1}{\rho}\big{\|}y_{k}-\mathcal{P}_{\mathcal{Y}}\big{(}y_{k}+\rho\nabla_{y}f(x_{k},y_{k})\big{)}\big{\|}\leq\epsilon (note that this does not necessarily yields approximate critical point of Φ(x):=maxy𝒴f(x,y)\Phi(x):=\max_{y\in\mathcal{Y}}f(x,y)) is

T(ϵ)=𝒪(F1F¯d1ϵ2)=𝒪(ϵ2Lκ5(f1minx𝒳,y𝒴f(x,y)+32Lκ3σy2)).\displaystyle T(\epsilon)=\mathcal{O}\Big{(}\frac{F_{1}-\underline{F}}{d_{1}\epsilon^{2}}\Big{)}=\mathcal{O}\Big{(}\epsilon^{-2}L\kappa^{5}\big{(}f_{1}-\min_{x\in\mathcal{X},y\in\mathcal{Y}}f(x,y)+32L\kappa^{3}\sigma_{y}^{2}\big{)}\Big{)}.

In Table I, we only keep the dependence of T(ϵ)T(\epsilon) on ϵ0\epsilon\approx 0 and κ1\kappa\gg 1, which yields 𝒪(κ5ϵ2)\mathcal{O}\big{(}\kappa^{5}\epsilon^{-2}\big{)}.