Accelerated Proximal Alternating Gradient-Descent-Ascent for Nonconvex Minimax Machine Learning

Ziyi Chen Electrical & Computer Engineering
University of Utah
Salt Lake City, US
[email protected] Shaocong Ma Electrical & Computer Engineering
University of Utah
Salt Lake City, US
[email protected] Yi Zhou Electrical & Computer Engineering
University of Utah
Salt Lake City, US
[email protected]

Abstract

Alternating gradient-descent-ascent (AltGDA) is an optimization algorithm that has been widely used for model training in various machine learning applications, which aims to solve a nonconvex minimax optimization problem. However, the existing studies show that it suffers from a high computation complexity in nonconvex minimax optimization. In this paper, we develop a single-loop and fast AltGDA-type algorithm that leverages proximal gradient updates and momentum acceleration to solve regularized nonconvex minimax optimization problems. By leveraging the momentum acceleration technique, we prove that the algorithm converges to a critical point in nonconvex minimax optimization and achieves a computation complexity in the order of $\mathcal{O}(\kappa^{\frac{11}{6}}\epsilon^{-2})$ , where $\epsilon$ is the desired level of accuracy and $\kappa$ is the problem’s condition number. Such a computation complexity improves the state-of-the-art complexities of single-loop GDA and AltGDA algorithms (see the summary of comparison in Table I). We demonstrate the effectiveness of our algorithm via an experiment on adversarial deep learning.

Index Terms:

Minimiax optimization, alternating gradient descent ascent, proximal gradient, momentum, complexity.

I Introduction

Minimax optimization is an emerging and important optimization framework that covers a variety of modern machine learning applications. Some popular application examples include generative adversarial networks (GANs) [13], adversarial machine learning [37], game theory [10], reinforcement learning [34], etc. A standard minimax optimization problem is written as $\min_{x\in\mathbb{R}^{m}}\max_{y\in\mathcal{Y}}\leavevmode\nobreak\ f(x,y)$ , where $f$ is a differentiable bivariate function. A basic algorithm for solving the above minimax optimization problem is the gradient-descent-ascent (GDA), which simultaneously performs gradient descent update and gradient ascent update on the variables $x$ and $y$ , respectively, i.e., $x_{t+1}=x_{t}-\eta_{x}\nabla_{1}f(x_{t},y_{t})$ , $y_{t+1}=y_{t}+\eta_{y}\nabla_{2}f({x_{t}},y_{t})$ . Here, $\nabla_{1}$ and $\nabla_{2}$ denote the gradient operator with regard to the first and the second variable, respectively. The convergence rate of GDA has been studied under various types of geometries of the minimax problem, including strongly-convex-strongly-concave geometry [9], nonconvex-(strongly)-concave geometry [23] and Lojasiewicz-type geometry [5]. Recently, by leveraging the popular momentum technique [28, 38, 3, 11, 20, 21] for accelerating gradient-based algorithms, accelerated variants of GDA have been proposed for strongly-convex-strongly-concave [45] and nonconvex-strongly-concave [17] minimax optimization. There are other accelerated GDA-type algorithms that achieve a near-optimal complexity [22, 46], but they involve complex nested-loop structures and require function smoothing with many fine-tuned hyper-parameters, which are not used in practical minimax machine learning.

Another important variant of GDA that has been widely used in practical training of minimax machine learning is the alternating-GDA (AltGDA) algorithm, which updates the two variables $x$ and $y$ alternatively via $x_{t+1}=x_{t}-\eta_{x}\nabla_{1}f(x_{t},y_{t})$ , $y_{t+1}=y_{t}+\eta_{y}\nabla_{2}f({x_{t+1}},y_{t})$ . Compared with the update of GDA, the $y$ -update of AltGDA uses the fresh $x_{t+1}$ instead of the previous $x_{t}$ , and it is shown to converge faster than the standard GDA algorithm [4, 2, 12, 42]. Despite the popularity of the AltGDA algorithm, it is shown to suffer from a high computation complexity in nonconvex minimax optimization. Therefore, this study aims to improve the complexity of AltGDA by leveraging momentum acceleration techniques. In particular, in the existing literature, the convergence of momentum accelerated AltGDA is only established for convex-concave minimax problems [43] and bilinear minimax problems [12], and it has not been explored in nonconvex minimax optimization that covers modern machine learning applications. Therefore, this study aims to fill in this gap by developing a single-loop proximal-AltGDA algorithm with momentum acceleration for solving a class of regularized nonconvex minimax problems, and analyze its convergence and computation complexity.

I-A Our Contribution

We are interested in a class of regularized nonconvex-strongly-concave minimax optimization problems, where the regularizers are convex functions that can be possibly non-smooth. To solve this class of minimax problems, we propose a single-loop proximal-AltGDA with momentum algorithm (referred to as proximal-AltGDAm). The algorithm takes single-loop updates that consist of a proximal gradient descent update with the heavy-ball momentum, and a proximal gradient ascent update with the Nesterov’s momentum. Our algorithm extends the applicability of the conventional momentum acceleration schemes (heavy-ball and Nesterov’s momentum) for nonconvex minimization to nonconvex minimax optimization.

We study the convergence property of Proximal-AltGDAm. Specifically, under standard smoothness assumptions on the objective function and by viewing the accelerated alternating GDA updates as inexact accelerated gradient updates, we develop an analysis to show that every limit point of the parameter sequences generated by the algorithm is a critical point of the nonconvex regularized minimax problem. Moreover, to achieve an $\epsilon$ -accurate critical point, the overall computation complexity (i.e., number of gradient and proximal mapping evaluations) is of the order $\mathcal{O}\big{(}\kappa^{\frac{11}{6}}\epsilon^{-2}\big{)}$ , where $\kappa>1$ is the condition number of the problem. Thanks to momentum acceleration and a tight analysis in our technical proof, such a computation complexity is lower than that of the existing single-loop GDA-type algorithms. See Table I for a summary of comparison of the computation complexities and Appendix E for their derivation.

Table I: Comparison of computation complexity (number of gradient and proximal mappings evaluations) of the existing single-loop GDA-type algorithms in nonconvex-strongly-concave minimax optimization, where

\kappa\geq 1

is the problem condition number.

	Alternating	Momentum	Computation
	update	acceleration	complexity
(Chen, et.al) [5]	$\times$	$\times$	$\mathcal{O}(\kappa^{6}\epsilon^{-2})$
(Huang, et.al) [17]	$\times$	✓	$\mathcal{O}(\kappa^{3}\epsilon^{-2})$
(Lin, et.al) [23]	$\times$	$\times$	$\mathcal{O}(\kappa^{2}\epsilon^{-2})$
(Xu, et.al) [42]	✓	$\times$	$\mathcal{O}(\kappa^{5}\epsilon^{-2})$
(Boţ and Böhm) [4]	✓	$\times$	$\mathcal{O}(\kappa^{2}\epsilon^{-2})$
This work	✓	✓	$\mathcal{O}(\kappa^{\frac{11}{6}}\epsilon^{-2})$

I-B Other Related Work

GDA-type algorithms: [27] studied a slight variant of GDA by replacing gradients with subgradients for convex-concave non-smooth minimax optimization. [42] studied AltGDA which applies $\ell_{2}$ regularizer to the local objective function of GDA followed by projection onto the constraint sets and obtained its convergence rate under nonconvex-concave and convex-nonconcave geometry. [26] also studied two variants of GDA, Extra-gradient and Optimistic GDA, and obtained linear convergence rate under bilinear geometry and strongly-convex-strongly-concave geometry. [6, 18] studied GDA in continuous time dynamics using differential equations. [1] analyzed a second-order variant of the GDA algorithm.

Many other studies have developed stochastic GDA-type algorithms. [23, 44] analyzed stochastic GDA and stochastic AltGDA respectively. Variance reduction techniques have been applied to stochastic minimax optimization, including SVRG-based [8, 44], SARAH/SPIDER-based [41, 25], STORM [34] and its gradient free version [16].

GDA-type algorithms with momentum: For strongly-convex-strongly-concave problems, [45] studied accelerated GDA with negative momentum. [39, 22] developed nested-loop AltGDA algorithms with Nesterov’s Accelerated Gradient Descent that achieve improved complexities. For convex-concave problems, [7] analyzed stable points of both GDA and optimistic GDA that apply negative momentum for acceleration. Moreover, for nonconvex-(strongly)-concave problems, [40] developed a single-loop GDA with momentum and Hessian preconditioning and studied its convergence to a local minimax point. [17] developed a mirror descent ascent algorithm with momentum which includes GDAm as a special case. [30] studied nested-loop GDA where multiple gradient ascent steps are performed, and they also studied the momentum-accelerated version. Similarly, [34, 16] developed GDA with momentum scheme and STORM variance reduction, and a similar version of this algorithm is extended to minimax optimization on Riemann manifold [15]. [14] developed a single-loop Primal-Dual Stochastic Momentum algorithm with ADAM-type methods.

II Problem Formulation and Preliminaries

In this section, we introduce the problem formulation and technical assumptions. We consider the following class of regularized minimax optimization problems.

\displaystyle\min_{x\in\mathbb{R}^{m}}\max_{y\in\mathcal{Y}}\leavevmode\nobreak\ f(x,y)+g(x)-h(y),

(P)

where $f:\mathbb{R}^{m}\times\mathcal{Y}\to\mathbb{R}$ is differentiable and nonconvex-strongly-concave, $g,h$ are possibly non-smooth convex regularizers, and $\mathcal{Y}$ is a bounded set with diameter $R$ . In particular, define the envelope function $\Phi(x):=\max_{y\in\mathcal{Y}}f(x,y)-h(y)$ , then the problem $\mathrm{(P)}$ can be rewritten as the minimization problem $\min_{x\in\mathbb{R}^{m}}\Phi(x)+g(x)$ .

Throughout the paper, we adopt the following assumptions on the problem (P). These are standard assumptions that have been considered in the existing literature [23, 5].

Assumption 1.

The minimax problem $\mathrm{(P)}$ satisfies:

1.

Function $f(\cdot,\cdot)$ is $L$ -smooth and function $f(x,\cdot)$ is $\mu$ -strongly concave for all $x$ ;
2.

Function $(\Phi+g)(x)$ is bounded below and has compact sub-level sets;
3.

Function $g,h$ are proper and convex.

In particular, item 3 of the above assumption allows the regularizers $g,h$ to be any convex function that can be possibly non-smooth. Next, consider the mapping $y^{*}(x)=\arg\max_{y\in\mathcal{Y}}f(x,y)-h(y)$ , which is uniquely defined due to the strong concavity of $f(x,\cdot)$ . The following proposition is proved in [23, 5] that characterizes the Lipschitz continuity of the mapping $y^{*}(x)$ and the smoothness of the function $\Phi(x)$ . Throughout the paper, we denote $\kappa=L/\mu>1$ as the condition number, and denote $\nabla_{1}f(x,y),\nabla_{2}f(x,y)$ as the gradients with respect to the first and the second input variables, respectively.

Proposition 1 ([23, 5]).

Let 1 hold. Then, the mapping $y^{*}(x)$ is $\kappa$ -Lipschitz continuous and the function $\Phi(x)$ is $L(1+\kappa)$ -smooth with $\nabla\Phi(x)=\nabla_{1}f(x,y^{*}(x))$ .

Lastly, recall that the minimax problem (P) is equivalent to the minimization problem $\min_{x\in\mathbb{R}^{m}}\Phi(x)+g(x)$ . Therefore, the optimization goal of the minimax problem (P) is to find a critical point $x^{*}$ of the nonconvex function $\Phi(x)+g(x)$ that satisfies the optimality condition $\mathbf{0}\in\partial(\Phi+g)(x^{*})$ , where $\partial$ denotes the subdifferential operator.

III Proximal Alternating GDA with Momentum

In this section, we propose a single-loop proximal alternating-GDA with momentum (proximal-AltGDAm) algorithm to solve the regularized minimax problem (P).

We first recall the update rules of the basic proximal alternating-GDA (proximal-AltGDA) algorithm [4] for solving the problem (P). Specifically, proximal-AltGDA alternates between the following two proximal-gradient updates (a.k.a. forward-backward splitting updates [24]).

(Proximal-AltGDA):

\left\{\begin{aligned} x_{t+1}&=\mathrm{prox}_{\eta_{x}g}\big{(}{x}_{t}-\eta_{x}\nabla_{1}f(x_{t},y_{t})\big{)},\\ y_{t+1}&=\mathrm{prox}_{\eta_{y}h}\big{(}{y}_{t}+\eta_{y}\nabla_{2}f({x_{t+1}},{y}_{t})\big{)}.\end{aligned}\right.

To elaborate, the first update is a proximal gradient descent update that aims to minimize the nonconvex function $f(x,y_{t})+g(x)$ from the current point $x_{t}$ , and the second update is a proximal gradient ascent update that aims to maximize the strongly-concave function $f({x_{t+1}},y)-h(y)$ from the current point $y_{t}$ . More specifically, the two proximal gradient mappings are formally defined as

	$\displaystyle\mathrm{prox}_{\eta_{x}g}\big{(}x_{t}-\eta_{x}\nabla_{1}f(x_{t},y_{t})\big{)}$
	$\displaystyle:=\mathop{\mathrm{argmin}}_{u\in\mathbb{R}^{m}}\Big{\{}g(u)+\frac{1}{2\eta_{x}}\\|u-x_{t}+\eta_{x}\nabla_{1}f(x_{t},y_{t})\\|^{2}\Big{\}},$
	$\displaystyle\mathrm{prox}_{\eta_{y}h}\big{(}y_{t}+\eta_{y}\nabla_{2}f({x_{t+1}},y_{t})\big{)}$
	$\displaystyle:=\mathop{\mathrm{argmin}}_{v\in\mathcal{Y}}\Big{\{}h(v)+\frac{1}{2\eta_{y}}\\|v-y_{t}-\eta_{y}\nabla_{2}f({x_{t+1}},y_{t})\\|^{2}\Big{\}}.$

Compared with the standard (proximal) GDA algorithm [23, 5], the proximal ascent step of proximal-AltGDA evaluates the gradient at the freshest point $x_{t+1}$ instead of $x_{t}$ . Such an alternative update is widely used in practice and usually leads to better convergence properties [4, 2, 12, 42].

Next, we introduce momentum schemes to accelerate the convergence of proximal-AltGDA. As the two proximal gradient steps of proximal-AltGDA are used to solve two different types of optimization problems, namely, the nonconvex problem $f(x,y_{t})+g(x)$ and the strongly-concave problem $f({x_{t+1}},y)-h(y)$ , we consider applying different momentum schemes to accelerate these proximal gradient updates. Specifically, the proximal gradient descent step minimizes a composite nonconvex function, and we apply the heavy-ball momentum scheme [33] that was originally designed for accelerating nonconvex optimization. Therefore, we propose the following proximal gradient descent with heavy-ball momentum update for minimizing the nonconvex part of the problem (P).

(Heavy-ball momentum):

\left\{\begin{aligned} \widetilde{x}_{t}&=x_{t}+\beta(x_{t}-x_{t-1}),\\ x_{t+1}&=\mathrm{prox}_{\eta_{x}g}\big{(}\widetilde{x}_{t}-\eta_{x}\nabla_{1}f(x_{t},y_{t})\big{)}.\end{aligned}\right.

To explain, the first step is an extrapolation step that applies the momentum term $\beta(x_{t}-x_{t-1})$ (with momentum coefficient $\beta>0$ ), and the second proximal gradient step updates the extrapolation point $\widetilde{x}_{t}$ using the original gradient $\nabla_{1}f(x_{t},y_{t})$ . In conventional gradient-based optimization, gradient descent with such a heavy-ball momentum is guaranteed to find a critical point of smooth nonconvex functions [32, 31].

On the other hand, as the proximal gradient ascent step of proximal-AltGDA maximizes a composite strongly-concave function, we are motivated to apply the popular Nesterov’s momentum scheme [29], which was originally designed for accelerating strongly-concave (convex) optimization. Specifically, we propose the following proximal gradient ascent with Nesterov’s momentum update for maximizing the strongly-concave part of the problem (P).

(Nesterov’s momentum):

\left\{\begin{aligned} \widetilde{y}_{t}&=y_{t}+\gamma(y_{t}-y_{t-1}),\\ y_{t+1}&=\mathrm{prox}_{\eta_{y}h}\big{(}\widetilde{y}_{t}+\eta_{y}\nabla_{2}f({x_{t+1}},\widetilde{y}_{t})\big{)}.\end{aligned}\right.

(1)

To elaborate, the first step is a regular extrapolation step that involves momentum, which is the same as the first step of the heavy-ball scheme. The only difference from the heavy-ball scheme is that the starting point of the second proximal gradient step changes from $y_{t}$ to its extrapolated point $\widetilde{y}_{t}$ .

We refer to the above algorithm design as proximal-AltGDA with momentum (proximal-AltGDAm), and the algorithm updates are formally presented in Algorithm 1. It can be seen that proximal-AltGDAm is a simple algorithm that has a single loop structure, and adopts alternating updates with momentum acceleration. More importantly, it involves only standard hyper-parameters such as the learning rates and momentum parameters and therefore is easy to implement in practice. Such a practical algorithm is much simper than the other accelerated GDA-type algorithms that involve complex nested-loop structure and require fine-tuned function smoothing [22, 46].

Input: Initialization

x_{-1}=x_{0},y_{-1}=y_{0}

, learning rates

\eta_{x},\eta_{y}

, momentum parameters

\beta,\gamma

for $t=0,1,2,\ldots,T-1$ do

	$\displaystyle\widetilde{x}_{t}$	$\displaystyle=x_{t}+\beta(x_{t}-x_{t-1}),$
	$\displaystyle x_{t+1}$	$\displaystyle=\mathrm{prox}_{\eta_{x}g}\big{(}\widetilde{x}_{t}-\eta_{x}\nabla_{1}f(x_{t},y_{t})\big{)},$
	$\displaystyle\widetilde{y}_{t}$	$\displaystyle=y_{t}+\gamma(y_{t}-y_{t-1}),$
	$\displaystyle y_{t+1}$	$\displaystyle=\mathrm{prox}_{\eta_{y}h}\big{(}\widetilde{y}_{t}+\eta_{y}\nabla_{2}f(x_{t+1},\widetilde{y}_{t})\big{)}.$

end for

Output:

x_{T},y_{T}

Algorithm 1 Proximal Alternating GDA with Momentum (proximal-AltGDAm)

IV Convergence and Computation Complexity of Proximal-AltGDAm

Although the proposed proximal-AltGDAm algorithm applies the popular heavy-ball momentum and Nesterov’s momentum to the GDA updates in a straightforward way, its convergence analysis cannot directly follow from the existing studies of momentum accelerated gradient descent algorithms. To explain more specifically, notice that in the $x$ -proximal gradient update of Algorithm 1, it involves the partial gradient $\nabla_{1}f(x_{t},y_{t})$ , which corresponds to the gradient of the time-varying function $f(\cdot,y_{t})$ (since $y_{t}$ changes over time $t$ ). Similarly, the $y$ -proximal gradient update involves the gradient of another time-varying function $f(x_{t+1},\cdot)$ . Therefore, both momentum accelerated proximal gradient updates are actually applied to time-varying functions due to the nature of GDA updates. In this sense, the existing analysis of momentum accelerated gradient descent algorithms cannot be directly applied to analyze this algorithm.

To analyze the convergence of Algorithm 1, we first study the $x$ -proximal gradient update with heavy-ball momentum and obtain the following characterization of per-iteration progress on the objective function value.

Proposition 2.

Let 1 hold. Then, the parameter sequences $\{x_{t},y_{t}\}_{t}$ generated by proximal-AltGDAm satisfy, for all $t=0,1,2,...,$

$\displaystyle\Phi$	$\displaystyle(x_{t+1})+g(x_{t+1})$
	$\displaystyle\leq\Phi(x_{t})+g(x_{t})-\Big{(}\frac{1-\beta}{2\eta_{x}}-2L\kappa^{\frac{11}{6}}\Big{)}\\|x_{t+1}-x_{t}\\|^{2}$
	$\displaystyle\quad+\frac{\beta}{2\eta_{x}}\\|x_{t}-x_{t-1}\\|^{2}+\frac{L}{4\kappa^{\frac{11}{6}}}\\|y^{*}(x_{t})-y_{t}\\|^{2}.$	(2)

The above proposition tracks the per-iteration optimization progress made by the $x$ -proximal gradient update with heavy-ball momentum. To elaborate, the increment terms $\|x_{t+1}-x_{t}\|^{2},\|x_{t}-x_{t-1}\|^{2}$ are induced by the heavy-ball momentum scheme. Moreover, since the $x$ -update uses the partial gradient $\nabla_{1}f(x_{t},y_{t})$ to approximate the exact gradient $\nabla\Phi(x_{t})=\nabla_{1}f(x_{t},y^{*}(x_{t}))$ , it naturally induces a tracking error term $\|y_{t}-y^{*}(x_{t})\|^{2}$ that tracks the optimization gap of the $y$ -update. Hence, we need to further bound this tracking error term by analyzing the $y$ -proximal gradient update with Nesterov’s momentum, which we explore next.

As explained earlier, the momentum accelerated $y$ -updates in proximal-AltGDAm are applied to a time-varying strongly-concave function $f(x_{t+1},\cdot)$ . Hence, the tracking error term $\|y_{t}-y^{*}(x_{t})\|^{2}$ cannot be directly bounded using the standard convergence result of Nesterov’s accelerated proximal gradient algorithm [28]. Instead, we can view these $y$ -updates as inexact accelerated proximal gradient updates. To elaborate, note that in the $t$ -th iteration, the $y$ -proximal gradient update is applied to the function $f(x_{t+1},\cdot)$ . Then, we can rewrite the $y$ -updates in all the previous iterations $k=0,1,...,t-1$ as follows.

	$\displaystyle y_{k+1}$	$\displaystyle=\mathrm{prox}_{\eta_{y}h}\big{(}\widetilde{y}_{k}+\eta_{y}\nabla_{2}f(x_{t+1},\widetilde{y}_{k})$
		$\displaystyle\qquad\quad+\eta_{y}\underbrace{[\nabla_{2}f(x_{k+1},\widetilde{y}_{k})-\nabla_{2}f(x_{t+1},\widetilde{y}_{k})]}_{\mathbf{e}_{k+1}}\big{)}.$		(3)

That is, we can view all the previous $y$ -updates as applied to the fixed function $f(x_{t+1},\cdot)$ with an inexactness $\mathbf{e}_{k+1}$ involved in the computation of the partial gradient $\nabla_{2}f(x_{t+1},\widetilde{y}_{k})$ . In summary, the $y$ -updates of proximal-AltGDAm can be understood as inexact accelerated gradient updates applied to the function $f(x_{t+1},\cdot)$ at time $t$ . In particular, under the smoothness condition in 1, the inexactness is bounded as $\|\mathbf{e}_{k+1}\|\leq L\|x_{k+1}-x_{t+1}\|$ . Consequently, we can leverage the existing convergence result of inexact accelerated gradient algorithm [36] to bound the optimality gap $\|y_{t}-y^{*}(x_{t})\|^{2}$ as follows.

Proposition 3.

Let 1 hold. Choose learning rate $\eta_{y}=\frac{1}{L}$ and momentum parameter $\gamma=\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}$ . Then, the parameter sequences $\{x_{t},y_{t}\}$ generated by proximal-AltGDAm satisfy, for all $t=0,1,2,...$

	$\displaystyle\\|y_{t+1}-$	$\displaystyle y^{*}(x_{t+1})\\|^{2}\leq\frac{2R\kappa}{\sqrt{L}}(1-\kappa^{-0.5})^{t+1}$
		$\displaystyle\quad+\frac{6\kappa^{2}}{\sqrt{L}}\sum_{j=1}^{t}(1-\kappa^{-0.5})^{t+1-j/2}\\|x_{j+1}-x_{j}\\|.$		(4)

Intuitively, in the above bound, the first term on the right hand side corresponds to the normal accelerated convergence rate, and the other term is induced by the inexactness $\mathbf{e}_{k}$ . As both terms are scaled by the accelerated convergence factor $(1-\kappa^{-0.5})$ , we expect that the above bound converges fast and further facilitates the convergence of eq. 2. Next, substituting eq. 4 into eq. 2 and telescoping, we obtain the following asymptotic stability properties of proximal-AltGDAm.

Corollary 1.

Under the conditions of Proposition 3 and stepsize $\eta_{x}\leq 1/(16L\kappa^{\frac{11}{6}})$ , the sequences $\{x_{t},y_{t}\}_{t}$ generated by proximal-AltGDAm satisfy

\displaystyle\|x_{t+1}-x_{t}\|,\|y_{t}-y^{*}(x_{t})\|,\|y_{t+1}-y_{t}\|\overset{t}{\to}0.

Remark 1. In [5], the proximal-GDA algorithm (without alternating update and momentum) uses a small learning rate $\eta_{x}\leq\mathcal{O}({L^{-2}}\kappa^{-3})$ to establish convergence. As a comparison, proximal-AltGDAm allows to choose a much larger learning rate $\eta_{x}\leq{\mathcal{O}(L^{-1}\kappa^{-\frac{11}{6}})}$ to guarantee a faster convergence (proved later), thanks to the momentum acceleration schemes.

Therefore, if we can show that $\{x_{t}\}_{t}$ converges to a desired critical point $x^{*}$ , then the above stability properties of proximal-AltGDAm implies that $\{y_{t}\}_{t}$ converges to the corresponding best response point $y^{*}(x^{*})$ .

To further characterize the global convergence property of proximal-AltGDAm, we define the following proximal gradient mapping associated with the objective function $\Phi(x)+g(x)$ .

\displaystyle G(x):=

\displaystyle\frac{1}{\eta_{x}}\Big{(}x-\mathrm{prox}_{\eta_{x}g}\big{(}x-\eta_{x}\nabla\Phi(x)\big{)}\Big{)}.

(5)

The proximal gradient mapping is a standard metric for evaluating the optimality of nonconvex composite optimization problems [28]. It can be shown that $x$ is a critical point of $(\Phi+g)(x)$ if and only if $G(x)=\mathbf{0}$ , and it reduces to the normal gradient when there is no regularizer $g$ . Hence, our convergence criterion is to find a near-critical point $x$ that satisfies $\|G(x)\|\leq\epsilon$ for some pre-determined accuracy $\epsilon>0$ .

Next, by leveraging Proposition 2 and Proposition 3, we obtain the following global convergence result of proximal-AltGDAm and characterize its computational complexity (number of gradient and proximal mapping evaluations).

Theorem 1 (Global convergence).

Under the conditions of Proposition 3 and stepsize $\eta_{x}\leq 1/(16L\kappa^{\frac{11}{6}})$ , the sequence $\{x_{t}\}_{t}$ generated by proximal-AltGDAm is bounded and has a compact set of limit points. Also, every such limit point is a critical point of $(\Phi+g)(x)$ . Moreover, the total number of iterations $T$ required to achieve ${\min_{0\leq t\leq T}\|G(x_{t})\|}\leq\epsilon$ is $T=\mathcal{O}\big{(}\kappa^{\frac{11}{6}}\epsilon^{-2}\big{)}$ , which is also the order of the required computational complexity.

To elaborate, the first statement of 1 shows that the sequence generated by proximal-AltGDAm converges to critical points of the minimax problem. The second statement proves that the computation complexity of proximal-AltGDAm for finding a near-critical point is of the order $\mathcal{O}\big{(}\kappa^{\frac{11}{6}}\epsilon^{-2}\big{)}$ , which strictly improves the complexity $\mathcal{O}\big{(}\kappa^{2}\epsilon^{-2}\big{)}$ of both GDA [23] and proximal-AltGDA [4] in nonconvex-strongly concave minimax optimization. We note that the improvement is substantial when the problem condition number $\kappa=L/\mu$ is large, while the dependence on $\epsilon^{-2}$ is generally unimprovable in nonconvex optimization. In addition, thanks to the momentum acceleration schemes, our choice of learning rate $\eta_{x}=\mathcal{O}(L^{-1}\kappa^{-\frac{11}{6}})$ is more flexible than that $\eta_{x}=\mathcal{O}(L^{-1}\kappa^{-2})$ adopted by these GDA-type algorithms. These improvements are not only attributed to momentum acceleration, but also to the elaborate selection of the coefficients in the proof that aims to minimize the dependence of the complexity on $\kappa$ .

V Experiments

In this section, we compare the performance of proximal-AltGDAm with that of other GDA-type algorithms via numerical experiments. Specifically, we compare proximal-AltGDAm with the standard proximal-GDA/AltGDA algorithm [5] and the single-loop accelerated AltGDA algorithm (APDA) [47]. All these algorithms have a single-loop structure.

We consider solving the following regularized Wasserstein robustness model (WRM) [37] using the MNIST dataset [19].

	$\displaystyle\min_{\theta}\max_{\{\xi_{i}\}^{n}_{i=1}}$	$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\Big{[}\ell(h_{\theta}(\xi_{i}),y_{i})-\lambda\\|\xi_{i}-x_{i}\\|^{2}\Big{]}$
		$\displaystyle\qquad\qquad-\lambda_{1}\sum_{i=1}^{n}\\|\xi_{i}\\|_{1}+\frac{\lambda_{2}}{2}\\|\theta\\|^{2},$		(6)

where $n=60$ k is the number of training samples, $\theta$ is the model parameter, $(x_{i},y_{i})$ corresponds to the $i$ -th data sample and label, respectively, and $\xi_{i}$ is the adversarial sample corresponding to $x_{i}$ . We choose the cross-entropy loss function $\ell$ . We add an $\ell_{1}$ regularization to impose sparsity on the adversarial examples, and add an $\ell_{2}^{2}$ regularization to prevent divergence of the model parameters.

We set $\lambda=1.0$ that suffices to make the maximization part be strongly-concave, and set $\lambda_{1}=\lambda_{2}=10^{-4}$ . We use a convolution network that consists of two convolution blocks followed by two fully connected layers. Specifically, each convolution block contains a convolution layer, a max-pooling layer with stride step $2$ , and a ReLU activation layer. The convolution layers in the two blocks have $1,10$ input channels and $10,20$ output channels, respectively, and both of them have kernel size $5$ , stride step $1$ and no padding. The two fully connected layers have input dimensions $320,50$ and output dimensions $50,10$ , respectively.

We implement all three algorithms using full gradients with the whole training set of $60$ k images. We choose the same learning rates $\eta_{x}=\eta_{y}=10^{-3}$ for all algorithms. For proximal-AltGDAm, we choose momentum $\beta=0.25$ and $\gamma=0.75$ . For APDA, we choose the fine-tuned $\eta=2\eta_{x}$ . As the function $\Phi(x)$ cannot be exactly evaluated, we run $100$ steps of stochastic gradient ascent updates with learning rate $0.1$ to maximize $f(x_{t},y)-h(y)$ and obtain an approximated $y^{*}(x_{t})$ , which is used to estimate $\Phi(x)+g(x)$ .

Refer to caption — Figure 1: Left: comparison of $\Phi(x)+g(x)$ of all four algorithms. Right: comparison of the corresponding accuracy on the test dataset.

Figure 1 (Left) compares the estimated objective function value achieved by all the three algorithms. It can be seen that proximal-AltGDAm achieves the fastest convergence among these algorithms and is significantly faster than the proximal-GDA and the proximal-AltGDA. This demonstrates the effectiveness of our simple momentum scheme. Figure 1 (Right) further demonstrates the advantage of proximal-AltGDAm in the test accuracy. It can be seen that the robust model trained by proximal-AltGDAm achieves a much higher test accuracy.

VI Conclusion

We develop a single-loop and fast AltGDA algorithm that leverages proximal gradient updates and momentum acceleration to solve general regularized nonconvex-strongly-concave minimax optimization problems. By viewing the GDA updates of the algorithm as inexact accelerated gradient updates, we prove that the algorithm converges to a $\epsilon$ -critical point with a computational complexity $\mathcal{O}(\kappa^{\frac{11}{6}}\epsilon^{-2})$ , which substantially improves the state-of-the-art result. In the future work, it is interesting to develop a stochastic variant of this algorithm to further improve the sample complexity.

Acknowledgment

The work of Ziyi Chen, Shaocong Ma and Yi Zhou was supported in part by U.S. National Science Foundation under the Grants CCF-2106216 and DMS-2134223.

References

[1] L. Adolphs, H. Daneshmand, A. Lucchi, and T. Hofmann. Local saddle point optimization: A curvature exploitation approach. In Proc. International Conference on Artificial Intelligence and Statistics (AISTATS), pages 486–495, 2019.
[2] J. P. Bailey, G. Gidel, and G. Piliouras. Finite regret and cycles with fixed step-size via alternating gradient descent-ascent. In Proc. Conference on Learning Theory (COLT), pages 391–407, 2020.
[3] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, Mar. 2009.
[4] R. I. Boţ and A. Böhm. Alternating proximal-gradient steps for (stochastic) nonconvex-concave minimax problems. ArXiv:2007.13605, 2020.
[5] Z. Chen, Y. Zhou, T. Xu, and Y. Liang. Proximal gradient descent-ascent: Variable convergence under kł geometry. In Proc. International Conference on Learning Representations (ICLR), 2021.
[6] A. Cherukuri, B. Gharesifard, and J. Cortes. Saddle-point dynamics: conditions for asymptotic stability of saddle points. SIAM Journal on Control and Optimization, 55(1):486–511, 2017.
[7] C. Daskalakis and I. Panageas. The limit points of (optimistic) gradient descent in min-max optimization. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pages 9236–9246, 2018.
[8] S. S. Du and W. Hu. Linear convergence of the primal-dual gradient method for convex-concave saddle point problems without strong convexity. In Proc. International Conference on Artificial Intelligence and Statistics (AISTATS), pages 196–205, 2019.
[9] A. Fallah, A. Ozdaglar, and S. Pattathil. An optimal multistage stochastic gradient method for minimax problems. In 2020 59th IEEE Conference on Decision and Control (CDC), pages 3573–3579. IEEE, 2020.
[10] M. A. M. Ferreira, M. Andrade, M. C. P. Matos, J. A. Filipe, and M. P. Coelho. Minimax theorem and nash equilibrium. International Journal of Latest Trends in Finance & Economic Sciences, 2012.
[11] S. Ghadimi and G. Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(1-2):59–99, Mar. 2016.
[12] G. Gidel, R. A. Hemmat, M. Pezeshki, R. Le Priol, G. Huang, S. Lacoste-Julien, and I. Mitliagkas. Negative momentum for improved game dynamics. In Proc. International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1802–1811, 2019.
[13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pages 2672–2680, 2014.
[14] Z. Guo, Y. Xu, W. Yin, R. Jin, and T. Yang. On stochastic moving-average estimators for non-convex optimization. ArXiv:2104.14840, 2021.
[15] F. Huang, S. Gao, and H. Huang. Gradient descent ascent for min-max problems on riemannian manifolds. ArXiv:2010.06097, 2020.
[16] F. Huang, S. Gao, J. Pei, and H. Huang. Accelerated zeroth-order and first-order momentum methods from mini to minimax optimization. ArXiv:2008.08170, 2020.
[17] F. Huang, X. Wu, and H. Huang. Efficient mirror descent ascent methods for nonsmooth minimax problems. In Proc. International Conference on Neural Information Processing Systems (Neurips), volume 34, 2021.
[18] C. Jin, P. Netrapalli, and M. I. Jordan. What is local optimality in nonconvex-nonconcave minimax optimization? In Proc. International Conference on Machine Learning (ICML), pages 4880–4889, 2020.
[19] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pages 2278–2324, 1998.
[20] H. Li and Z. Lin. Accelerated proximal gradient methods for nonconvex programming. In Proc. International Conference on Neural Information Processing Systems (Neurips), pages 379–387, 2015.
[21] Q. Li, Y. Zhou, Y. Liang, and P. K. Varshney. Convergence analysis of proximal gradient with momentum for nonconvex optimization. In Proc. International Conference on Machine Learning (ICML), volume 70, pages 2111–2119, Aug 2017.
[22] T. Lin, C. Jin, and M. I. Jordan. Near-optimal algorithms for minimax optimization. In Proc. Annual Conference on Learning Theory (COLT), pages 2738–2779, 2020.
[23] T. Lin, C. Jin, and M. I. Jordan. On gradient descent ascent for nonconvex-concave minimax problems. In Proc. International Conference on Machine Learning (ICML), pages 6083–6093, 2020.
[24] P.-L. Lions and B. Mercier. Splitting algorithms for the sum of two nonlinear operators. SIAM Journal on Numerical Analysis, 16(6):964–979, 1979.
[25] L. Luo, H. Ye, Z. Huang, and T. Zhang. Stochastic recursive gradient descent ascent for stochastic nonconvex-strongly-concave minimax problems. In Proc. Advances in Neural Information Processing Systems (NeurIPS), volume 33, 2020.
[26] A. Mokhtari, A. Ozdaglar, and S. Pattathil. A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. In Proc. International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1497–1507, 2020.
[27] A. Nedić and A. Ozdaglar. Subgradient methods for saddle-point problems. Journal of optimization theory and applications, 142(1):205–228, 2009.
[28] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer, 2014.
[29] Y. E. Nesterov. A method for solving the convex programming problem with convergence rate o (1/k^ 2). In Dokl. akad. nauk Sssr, volume 269, pages 543–547, 1983.
[30] M. Nouiehed, M. Sanjabi, T. Huang, J. D. Lee, and M. Razaviyayn. Solving a class of non-convex min-max games using iterative first order methods. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pages 14934–14942, 2019.
[31] P. Ochs. Local convergence of the heavy-ball method and iPiano for non-convex optimization. Journal of Optimization Theory and Applications, 177(1):153–180, Apr 2018.
[32] P. Ochs, Y. Chen, T. Brox, and T. Pock. iPiano: Inertial proximal algorithm for nonconvex optimization. SIAM Journal on Imaging Sciences, 7(2):1388–1419, 2014.
[33] B. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
[34] S. Qiu, Z. Yang, X. Wei, J. Ye, and Z. Wang. Single-timescale stochastic nonconvex-concave optimization for smooth nonlinear td learning. ArXiv:2008.10103, 2020.
[35] R. T. Rockafellar. Convex analysis. Number 28. Princeton university press, 1970.
[36] M. Schmidt, N. Le Roux, and F. Bach. Convergence rates of inexact proximal-gradient methods for convex optimization. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2011.
[37] A. Sinha, H. Namkoong, and J. C. Duchi. Certifying some distributional robustness with principled adversarial training. In Proc. International Conference on Learning Representations (ICLR), 2018.
[38] P. Tseng. Approximation accuracy, gradient methods, and error bound for structured convex optimization. Mathematical Programming, 125(2):263–295, Oct 2010.
[39] Y. Wang and J. Li. Improved algorithms for convex-concave minimax optimization. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2020.
[40] Y. Wang, G. Zhang, and J. Ba. On solving minimax optimization locally: A follow-the-ridge approach. In Proc. International Conference on Learning Representations (ICLR), 2019.
[41] T. Xu, Z. Wang, Y. Liang, and H. V. Poor. Enhanced first and zeroth order variance reduced algorithms for min-max optimization. ArXiv:2006.09361, 2020.
[42] Z. Xu, H. Zhang, Y. Xu, and G. Lan. A unified single-loop alternating gradient projection algorithm for nonconvex-concave and convex-nonconcave minimax problems. ArXiv:2006.02032, 2020.
[43] A. Yadav, S. Shah, Z. Xu, D. Jacobs, and T. Goldstein. Stabilizing adversarial nets with prediction methods. In Proc. International Conference on Learning Representations (ICLR), 2018.
[44] J. Yang, N. Kiyavash, and N. He. Global convergence and variance reduction for a class of nonconvex-nonconcave minimax problems. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2020.
[45] G. Zhang and Y. Wang. On the suboptimality of negative momentum for minimax optimization. ArXiv:2008.07459, 2020.
[46] S. Zhang, J. Yang, C. Guzmán, N. Kiyavash, and N. He. The complexity of nonconvex-strongly-concave minimax optimization. ArXiv:2103.15888, 2021.
[47] Y. Zhu, D. Liu, and Q. Tran-Dinh. A New Primal-Dual Algorithm for a Class of Nonlinear Compositional Convex Optimization Problems. ArXiv:2006.09263, June 2020.

Appendix A Proof of Proposition 2

See 2

Proof.

Consider the $t$ -th iteration of PGDAm. As the function $\Phi$ is $L(1+\kappa)$ -smooth (from Proposition 1), we obtain that

\displaystyle\Phi(x_{t+1})\leq\Phi(x_{t})+\langle x_{t+1}-x_{t},\nabla\Phi(x_{t})\rangle+\frac{L(1+\kappa)}{2}\|x_{t+1}-x_{t}\|^{2}.

(7)

On the other hand, by the definition of the proximal gradient step of $x_{t}$ , we have that

\displaystyle g(x_{t+1})+\frac{1}{2\eta_{x}}\|x_{t+1}-\widetilde{x}_{t}+\eta_{x}\nabla_{1}f(x_{t},y_{t})\|^{2}\leq g(x_{t})+\frac{1}{2\eta_{x}}\|x_{t}-\widetilde{x}_{t}+\eta_{x}\nabla_{1}f(x_{t},y_{t})\|^{2},

which further simplifies to

$\displaystyle g(x_{t+1})$	$\displaystyle\leq g(x_{t})+\frac{1}{2\eta_{x}}\\|x_{t}-\widetilde{x}_{t}\\|^{2}+\langle x_{t}-\widetilde{x}_{t},\nabla_{1}f(x_{t},y_{t})\rangle$
	$\displaystyle\quad-\frac{1}{2\eta_{x}}\\|x_{t+1}-\widetilde{x}_{t}\\|^{2}-\langle x_{t+1}-\widetilde{x}_{t},\nabla_{1}f(x_{t},y_{t})\rangle$
	$\displaystyle\overset{(i)}{=}g(x_{t})+\frac{\beta^{2}}{2\eta_{x}}\\|x_{t}-{x}_{t-1}\\|^{2}-\frac{1}{2\eta_{x}}\\|x_{t+1}-{x}_{t}-\beta(x_{t}-x_{t-1})\\|^{2}$
	$\displaystyle\quad+\langle x_{t}-x_{t+1},\nabla_{1}f(x_{t},y_{t})\rangle$
	$\displaystyle=g(x_{t})+\frac{\beta^{2}}{2\eta_{x}}\\|x_{t}-{x}_{t-1}\\|^{2}-\frac{1}{2\eta_{x}}\\|x_{t+1}-{x}_{t}\\|^{2}-\frac{\beta^{2}}{2\eta_{x}}\\|x_{t}-x_{t-1}\\|^{2}$
	$\displaystyle\quad+\frac{\beta}{\eta_{x}}\langle x_{t+1}-x_{t},x_{t}-x_{t-1}\rangle+\langle x_{t}-x_{t+1},\nabla_{1}f(x_{t},y_{t})\rangle$
	$\displaystyle\leq g(x_{t})-\frac{1}{2\eta_{x}}\\|x_{t+1}-{x}_{t}\\|^{2}+\frac{\beta}{2\eta_{x}}\\|x_{t+1}-x_{t}\\|^{2}+\frac{\beta}{2\eta_{x}}\\|x_{t}-x_{t-1}\\|^{2}+\langle x_{t}-x_{t+1},\nabla_{1}f(x_{t},y_{t})\rangle,$	(8)

where (i) uses the fact that $x_{t}-\widetilde{x}_{t}=\beta(x_{t-1}-x_{t})$ .

Adding up eq. 7 and eq. 8 yields that

	$\displaystyle\Phi(x_{t+1})+g(x_{t+1})$
	$\displaystyle\leq\Phi(x_{t})+g(x_{t})-\Big{(}\frac{1}{2\eta_{x}}-\frac{L(1+\kappa)}{2}\Big{)}\\|x_{t+1}-x_{t}\\|^{2}+\frac{\beta}{2\eta_{x}}\\|x_{t+1}-x_{t}\\|^{2}+\frac{\beta}{2\eta_{x}}\\|x_{t}-x_{t-1}\\|^{2}$
	$\displaystyle\quad+\langle x_{t+1}-x_{t},\nabla\Phi(x_{t})-\nabla_{1}f(x_{t},y_{t})\rangle$
	$\displaystyle\leq\Phi(x_{t})+g(x_{t})-\Big{(}\frac{1}{2\eta_{x}}-\frac{L(1+\kappa)}{2}\Big{)}\\|x_{t+1}-x_{t}\\|^{2}+\frac{\beta}{2\eta_{x}}\\|x_{t+1}-x_{t}\\|^{2}+\frac{\beta}{2\eta_{x}}\\|x_{t}-x_{t-1}\\|^{2}$
	$\displaystyle\quad+\\|x_{t+1}-x_{t}\\|\\|\nabla\Phi(x_{t})-\nabla_{1}f(x_{t},y_{t})\\|$
	$\displaystyle\leq\Phi(x_{t})+g(x_{t})-\Big{(}\frac{1}{2\eta_{x}}-\frac{L(1+\kappa)}{2}\Big{)}\\|x_{t+1}-x_{t}\\|^{2}+\frac{\beta}{2\eta_{x}}\\|x_{t+1}-x_{t}\\|^{2}+\frac{\beta}{2\eta_{x}}\\|x_{t}-x_{t-1}\\|^{2}$
	$\displaystyle\quad+\\|x_{t+1}-x_{t}\\|\\|\nabla_{1}f(x_{t},y^{*}(x_{t}))-\nabla_{1}f(x_{t},y_{t})\\|$
	$\displaystyle\leq\Phi(x_{t})+g(x_{t})-\Big{(}\frac{1}{2\eta_{x}}-L\kappa\Big{)}\\|x_{t+1}-x_{t}\\|^{2}+\frac{\beta}{2\eta_{x}}\\|x_{t+1}-x_{t}\\|^{2}+\frac{\beta}{2\eta_{x}}\\|x_{t}-x_{t-1}\\|^{2}$
	$\displaystyle\quad+L\\|x_{t+1}-x_{t}\\|\\|y^{*}(x_{t})-y_{t}\\|$
	$\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\Phi(x_{t})+g(x_{t})-\Big{(}\frac{1-\beta}{2\eta_{x}}-L\kappa-L\kappa^{\frac{11}{6}}\Big{)}\\|x_{t+1}-x_{t}\\|^{2}+\frac{\beta}{2\eta_{x}}\\|x_{t}-x_{t-1}\\|^{2}+\frac{L}{4\kappa^{\frac{11}{6}}}\\|y^{*}(x_{t})-y_{t}\\|^{2}$
	$\displaystyle\leq\Phi(x_{t})+g(x_{t})-\Big{(}\frac{1-\beta}{2\eta_{x}}-2L\kappa^{\frac{11}{6}}\Big{)}\\|x_{t+1}-x_{t}\\|^{2}+\frac{\beta}{2\eta_{x}}\\|x_{t}-x_{t-1}\\|^{2}+\frac{L}{4\kappa^{\frac{11}{6}}}\\|y^{*}(x_{t})-y_{t}\\|^{2}.$		(9)

where (i) uses AM-GM inequality that $ab\leq\frac{Ca^{2}}{2}+\frac{b^{2}}{2C}$ for any $a,b,C\geq 0$ . This proves eq. (2)

∎

Appendix B Proof of Proposition 3

See 3

Proof.

We rewrite the inner accelerated gradient ascent steps in Algorithm 1 as the inexact-proximal gradient method (3). Then, based on Theorem 4 of [36], using $\eta_{y}=\frac{1}{L}$ and $\gamma=\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}$ , this method has the following convergence rate.

	$\displaystyle f(x_{t+1},y_{t+1})-f(x_{t+1},y^{*}(x_{t+1}))$
	$\displaystyle\leq(1-\kappa^{-0.5})^{t+1}\Big{(}\sqrt{2\big{(}f(x_{t+1},y_{t+1})-f(x_{t+1},y^{*}(x_{t+1}))\big{)}}+\sqrt{\frac{2}{\mu}}\sum_{i=1}^{t+1}\\|e_{i}\\|(1-\kappa^{-0.5})^{-i/2}\Big{)}.$		(10)

The above convergence rate can be simplified as follows.

	$\displaystyle\frac{\mu}{2}\\|y_{t+1}-y^{*}(x_{t+1})\\|^{2}$
	$\displaystyle\overset{(i)}{\leq}f(x_{t+1},y_{t+1})-f(x_{t+1},y^{*}(x_{t+1}))$
	$\displaystyle\leq(1-\kappa^{-0.5})^{t+1}\Big{(}\sqrt{2\big{(}f(x_{t+1},y_{t+1})-f(x_{t+1},y^{*}(x_{t+1}))\big{)}}+\sqrt{\frac{2}{\mu}}\sum_{i=1}^{t+1}\\|e_{i}\\|(1-\kappa^{-0.5})^{-i/2}\Big{)}$
	$\displaystyle\overset{(ii)}{\leq}(1-\kappa^{-0.5})^{t+1}\sqrt{L\\|y_{t+1}-y^{*}(x_{t+1})\\|^{2}}+\sqrt{\frac{2}{\mu}}\sum_{i=1}^{t+1}\\|\nabla_{2}f(x_{i},\widetilde{y}_{i-1})-\nabla_{2}f(x_{t+1},\widetilde{y}_{i-1})\\|(1-\kappa^{-0.5})^{t+1-i/2}$
	$\displaystyle\overset{(iii)}{\leq}R\sqrt{L}(1-\kappa^{-0.5})^{t+1}+\sqrt{\frac{2}{\mu}}\sum_{i=1}^{t+1}(1-\kappa^{-0.5})^{t+1-i/2}\sum_{j=i}^{t}L\\|x_{j+1}-x_{j}\\|$
	$\displaystyle=R\sqrt{L}(1-\kappa^{-0.5})^{t+1}+L\sqrt{\frac{2}{\mu}}\sum_{j=1}^{t}\sum_{i=1}^{j}(1-\kappa^{-0.5})^{t+1-i/2}\\|x_{j+1}-x_{j}\\|$
	$\displaystyle=R\sqrt{L}(1-\kappa^{-0.5})^{t+1}+\sqrt{2L\kappa}\sum_{j=1}^{t}(1-\kappa^{-0.5})^{t+0.5}\frac{(1-\kappa^{-0.5})^{-j/2}-1}{(1-\kappa^{-0.5})^{-0.5}-1}\\|x_{j+1}-x_{j}\\|$
	$\displaystyle\leq R\sqrt{L}(1-\kappa^{-0.5})^{t+1}+\sqrt{2L\kappa}\sum_{j=1}^{t}\frac{(1-\kappa^{-0.5})^{t+1-j/2}}{1-(1-\kappa^{-0.5})^{0.5}}\\|x_{j+1}-x_{j}\\|$
	$\displaystyle\overset{(iv)}{\leq}R\sqrt{L}(1-\kappa^{-0.5})^{t+1}+2\kappa\sqrt{2L}\sum_{j=1}^{t}(1-\kappa^{-0.5})^{t+1-j/2}\\|x_{j+1}-x_{j}\\|,$

where (i) and (ii) use $\nabla_{2}f(x_{t+1},y^{*}(x_{t+1}))=0$ and Assumption 1.1 that $f(x,\cdot)$ is $L$ -smooth and $\mu$ -strongly concave, (iii) uses the fact that $\mathcal{Y}$ is bounded with diameter $R$ and Assumption 1.1 that $f$ is $L$ -smooth, and (iv) uses $\frac{1}{1-(1-\kappa^{-0.5})^{0.5}}=\frac{1+(1-\kappa^{-0.5})^{0.5}}{\kappa^{-0.5}}\leq 2\kappa^{0.5}$ . Multiplying the above inequality with $2/\mu$ proves Proposition 3. ∎

Appendix C Proof of Corollary 1

See 1

Proof.

Telescoping eq. (4) over $t=0,1,\ldots,T-1$ yields that

	$\displaystyle\sum_{t=0}^{T-1}\\|y_{t+1}-y^{*}(x_{t+1})\\|^{2}$
	$\displaystyle\leq\frac{2R\kappa}{\sqrt{L}}\sum_{t=0}^{T-1}(1-\kappa^{-0.5})^{t+1}+\frac{6\kappa^{2}}{\sqrt{L}}\sum_{t=0}^{T-1}\sum_{j=1}^{t}(1-\kappa^{-0.5})^{t+1-j/2}\\|x_{j+1}-x_{j}\\|$
	$\displaystyle\leq\frac{2R\kappa^{1.5}}{\sqrt{L}}+\frac{6\kappa^{2}}{\sqrt{L}}\sum_{j=1}^{T-1}\sum_{t=j}^{T-1}(1-\kappa^{-0.5})^{t+1-j/2}\\|x_{j+1}-x_{j}\\|$
	$\displaystyle\leq\frac{2R\kappa^{1.5}}{\sqrt{L}}+\frac{6\kappa^{2.5}}{\sqrt{L}}\sum_{j=1}^{T-1}(1-\kappa^{-0.5})^{j/2}\\|x_{j+1}-x_{j}\\|$
	$\displaystyle\leq\frac{2R\kappa^{1.5}}{\sqrt{L}}+\frac{3\kappa^{2.5}}{\sqrt{L}}\sum_{j=1}^{T-1}\Big{(}\frac{1}{\kappa^{\frac{7}{6}}\sqrt{L}}(1-\kappa^{-0.5})^{j}+\kappa^{\frac{7}{6}}\sqrt{L}\\|x_{j+1}-x_{j}\\|^{2}\Big{)}$
	$\displaystyle\leq\frac{2R\kappa^{1.5}}{\sqrt{L}}+\frac{3\kappa^{\frac{11}{6}}}{L}+3\kappa^{\frac{11}{3}}\sum_{j=1}^{T-1}\\|x_{j+1}-x_{j}\\|^{2}.$		(11)

Then, telescoping eq. (2) over $t=0,1,\ldots,T-1$ yields that

	$\displaystyle\Phi(x_{T})+g(x_{T})-\Phi(x_{0})-g(x_{0})$
	$\displaystyle\leq-\Big{(}\frac{1-\beta}{2\eta_{x}}-2L\kappa^{\frac{11}{6}}\Big{)}\sum_{t=0}^{T-1}\\|x_{t+1}-x_{t}\\|^{2}+\frac{\beta}{2\eta_{x}}\sum_{t=0}^{T-1}\\|x_{t}-x_{t-1}\\|^{2}+\frac{L}{4\kappa^{\frac{11}{6}}}\sum_{t=0}^{T-1}\\|y^{*}(x_{t})-y_{t}\\|^{2}$
	$\displaystyle\overset{(i)}{\leq}-\Big{(}\frac{1-2\beta}{2\eta_{x}}-2L\kappa^{\frac{11}{6}}\Big{)}\sum_{t=0}^{T-1}\\|x_{t+1}-x_{t}\\|^{2}+\frac{L}{4\kappa^{\frac{11}{6}}}\Big{(}R^{2}+\frac{2R\kappa^{1.5}}{\sqrt{L}}+\frac{3\kappa^{\frac{11}{6}}}{L}+3\kappa^{\frac{11}{3}}\sum_{j=1}^{T-1}\\|x_{j+1}-x_{j}\\|^{2}\Big{)}$
	$\displaystyle\leq-\Big{(}\frac{1-2\beta}{2\eta_{x}}-3L\kappa^{\frac{11}{6}}\Big{)}\sum_{t=0}^{T-1}\\|x_{t+1}-x_{t}\\|^{2}+\frac{LR^{2}}{4\kappa^{\frac{11}{6}}}+\frac{R\sqrt{L}}{2\kappa^{\frac{1}{3}}}+1$		(12)

where (i) uses $x_{-1}=x_{0}$ , $\|y^{*}(x_{0})-y_{0}\|\leq R$ and eq. (11). When $\eta_{x}\leq\frac{1}{16L\kappa^{\frac{11}{6}}}$ and $\beta\leq\frac{1}{4}$ , rearranging the above inequality yields that

\displaystyle L\kappa^{\frac{11}{6}}\sum_{t=0}^{T-1}\|x_{t+1}-x_{t}\|^{2}

\displaystyle\leq\Phi(x_{0})+g(x_{0})-\inf_{x\in\mathbb{R}^{m}}\big{(}\Phi(x)+g(x)\big{)}+\frac{LR^{2}}{4\kappa^{\frac{11}{6}}}+\frac{R\sqrt{L}}{2\kappa^{\frac{1}{3}}}+1<+\infty

(13)

Letting $T\to\infty$ in the above inequality yields that $\sum_{t=0}^{\infty}\|x_{t+1}-x_{t}\|^{2}<+\infty$ , so $\|x_{t+1}-x_{t}\|\overset{t}{\to}0$ . Then, letting $T\to\infty$ in eq. (11) yields that $\sum_{t=0}^{\infty}\|y_{t+1}-y^{*}(x_{t+1})\|^{2}\leq\frac{2R\kappa^{1.5}}{\sqrt{L}}+\frac{3\kappa^{\frac{11}{6}}}{L}+3\kappa^{\frac{11}{3}}\sum_{j=1}^{\infty}\|x_{j+1}-x_{j}\|^{2}<+\infty$ , so $\|y_{t}-y^{*}(x_{t})\|\overset{t}{\to}0$ .

The last term $\|y_{t+1}-y_{t}\|\overset{t}{\to}0$ can be proved as follows.

	$\displaystyle\\|y_{t+1}-y_{t}\\|$	$\displaystyle\leq\\|y_{t+1}-y^{}(x_{t+1})\\|+\\|y^{}(x_{t})-y_{t}\\|+\\|y^{}(x_{t+1})-y^{}(x_{t})\\|$
		$\displaystyle\overset{(i)}{\leq}\\|y_{t+1}-y^{}(x_{t+1})\\|+\\|y_{t}-y^{}(x_{t})\\|+\kappa\\|x_{t+1}-x_{t}\\|\overset{t}{\to}0,$

where (i) uses the fact that $y^{*}$ is $\kappa$ -Lipschitz. ∎

Appendix D Proof of Theorem 1

See 1

Proof.

We first prove the existence of the limit points of $\{x_{t}\}$ . Note that in eq. (12), $\frac{1-2\beta}{2\eta_{x}}-2L\kappa^{\frac{11}{6}}\geq 0$ since $\eta_{x}\leq\frac{1}{16L\kappa^{\frac{11}{6}}}$ and $\beta\leq\frac{1}{4}$ as specified in Proposition 3. Hence, for all $T\geq 0$ ,

\displaystyle\Phi(x_{T})+g(x_{T})

\displaystyle\leq\Phi(x_{0})+g(x_{0})+\frac{LR^{2}}{4\kappa^{\frac{11}{6}}}+\frac{R\sqrt{L}}{2\kappa^{\frac{1}{3}}}+1<+\infty,

which implies that $\{\Phi(x_{t})+g(x_{t})\}_{t}$ is upper bounded. Hence, based on Assumption 1.2, the sequence $\{x_{t}\}_{t}$ is bounded and thus has a compact set of limit points.

Next, we prove that every limit point $x$ of $\{x_{t}\}_{t}$ is a critical point of $(\Phi+g)(x)$ , i.e., $\mathbf{0}\in\partial(\Phi+g)(x)$ . By the optimality condition of the proximal gradient update of $x_{t+1}$ we have

	$\displaystyle\mathbf{0}$	$\displaystyle\in\partial g(x_{t+1})+\frac{1}{\eta_{x}}\big{(}x_{t+1}-\widetilde{x}_{t}+\eta_{x}\nabla_{1}f(x_{t},y_{t})\big{)}$
		$\displaystyle=\partial g(x_{t+1})+\frac{1}{\eta_{x}}\big{(}x_{t+1}-x_{t}-\beta(x_{t}-x_{t-1})+\eta_{x}\nabla_{1}f(x_{t},y_{t})\big{)},$

which implies that $-\frac{1}{\eta_{x}}\big{(}x_{t+1}-x_{t}-\beta(x_{t}-x_{t-1})+\eta_{x}\nabla_{1}f(x_{t},y_{t})\big{)}\in\partial g(x_{t+1})$ and thus by convexity of $g$ we have

\displaystyle g(x)\geq g(x_{t(j)+1})-\frac{1}{\eta_{x}}\big{(}x_{t+1}-x_{t}-\beta(x_{t}-x_{t-1})+\eta_{x}\nabla_{1}f(x_{t},y_{t})\big{)}^{\top}(x-x_{t(j)+1});\forall x\in\mathbb{R}^{m}.

(14)

As $x_{t(j)}\overset{j}{\to}x^{*}$ and $\|y_{t(j)}-y^{*}(x_{t(j)})\|\overset{t}{\to}0$ , we have $y_{t(j)}\overset{t}{\to}y^{*}(x^{*})$ due to continuity of $y^{*}(\cdot)$ . Also note that the convex function $g$ is continuous (See Corollary 10.1.1 of [35]). Hence, letting $j\to\infty$ in eq. (14) yields that

\displaystyle g(x)\geq g(x^{*})-\nabla_{1}f(x^{*},y^{*}(x^{*})^{\top}(x-x^{*})=g(x^{*})-\nabla\Phi(x^{*})^{\top}(x-x^{*});\forall x\in\mathbb{R}^{m},

(15)

which further implies that $-\nabla\Phi(x^{*})\in\partial g(x^{*})\Rightarrow\mathbf{0}\in\partial(\Phi+g)(x^{*})$ . Hence, $x^{*}$ in a critical point of $(\Phi+g)(x)$ .

Finally, we derive the non-asymptotic computational complexity to obtain $\min_{0\leq t\leq T}\|G(x_{t})\|\leq\epsilon$ . Note that

	$\displaystyle\\|G(x_{t+1})\\|=$	$\displaystyle\frac{1}{\eta_{x}}\big{\\|}x_{t+1}-\text{prox}_{\eta_{x}g}\big{(}x_{t+1}-\eta_{x}\nabla\Phi(x_{t+1})\big{)}\big{\\|}$
	$\displaystyle\overset{(i)}{\leq}$	$\displaystyle\frac{1}{\eta_{x}}\big{\\|}x_{t+1}-\widetilde{x}_{t}+\eta_{x}\big{[}\nabla_{1}f(x_{t},y_{t})-\nabla f_{1}\big{(}x_{t+1},y^{*}(x_{t+1})\big{)}\big{]}\big{\\|}$
	$\displaystyle\leq$	$\displaystyle\frac{1}{\eta_{x}}\big{\\|}x_{t+1}-x_{t}-\beta(x_{t}-x_{t-1})\big{\\|}+L\\|x_{t+1}-x_{t}\\|+L\\|y^{}(x_{t+1})-y^{}(x_{t})\\|+L\\|y^{*}(x_{t})-y_{t}\\|$
	$\displaystyle\overset{(ii)}{\leq}$	$\displaystyle\Big{(}\frac{1}{\eta_{x}}+L+L\kappa\Big{)}\\|x_{t+1}-x_{t}\\|+\frac{\beta}{\eta_{x}}\\|x_{t}-x_{t-1}\\|+L\\|y^{*}(x_{t})-y_{t}\\|,$

where (i) uses $x_{t+1}\in\mathrm{prox}_{\eta_{x}g}\big{(}\widetilde{x}_{t}-\eta_{x}\nabla_{1}f(x_{t},y_{t})\big{)}$ , $\nabla\Phi(x)=\nabla f_{1}\big{(}x,y^{*}(x)\big{)}$ (from Proposition 1) and the non-expansiveness of proximal mapping, (ii) uses the property that $y^{*}$ is $\kappa$ -Lipschitz continuous in Proposition 1. Hence,

	$\displaystyle(T-1)\min_{0\leq t\leq T}\\|G(x_{t})\\|^{2}$
	$\displaystyle\leq(T-1)\min_{1\leq t\leq T-1}\\|G(x_{t+1})\\|^{2}$
	$\displaystyle\leq\sum_{t=1}^{T-1}\\|G(x_{t+1})\\|^{2}$
	$\displaystyle\leq\sum_{t=1}^{T-1}\Big{[}3\Big{(}\frac{1}{\eta_{x}}+L+L\kappa\Big{)}^{2}\\|x_{t+1}-x_{t}\\|^{2}+\frac{3\beta^{2}}{\eta_{x}^{2}}\\|x_{t}-x_{t-1}\\|^{2}+3L^{2}\\|y^{*}(x_{t})-y_{t}\\|^{2}\Big{]}$
	$\displaystyle\overset{(i)}{\leq}3(18L\kappa^{\frac{11}{6}})^{2}\sum_{t=0}^{T-1}\\|x_{t+1}-x_{t}\\|^{2}+27L^{2}\kappa^{\frac{11}{3}}\sum_{t=0}^{T-1}\\|x_{t+1}-x_{t}\\|^{2}+3L^{2}\sum_{t=0}^{T-1}\\|y^{*}(x_{t})-y_{t}\\|^{2}$
	$\displaystyle\overset{(ii)}{\leq}999L^{2}\kappa^{\frac{11}{3}}\sum_{t=0}^{T-1}\\|x_{t+1}-x_{t}\\|^{2}+3L^{2}\Big{(}\frac{2R\kappa^{1.5}}{\sqrt{L}}+\frac{3\kappa^{\frac{11}{6}}}{L}+3\kappa^{\frac{11}{3}}\sum_{j=1}^{T-1}\\|x_{j+1}-x_{j}\\|^{2}\Big{)}+3L^{2}\\|y^{*}(x_{0})-y_{0}\\|^{2},$
	$\displaystyle\overset{(iii)}{\leq}\frac{1008L^{2}\kappa^{\frac{11}{3}}}{L\kappa^{\frac{11}{6}}}\Big{(}\Phi(x_{0})+g(x_{0})-\inf_{x\in\mathbb{R}^{m}}\big{(}\Phi(x)+g(x)\big{)}+\frac{LR^{2}}{4\kappa^{\frac{11}{6}}}+\frac{R\sqrt{L}}{2\kappa^{\frac{1}{3}}}+1\Big{)}+6RL^{1.5}\kappa^{1.5}+9L\kappa^{\frac{11}{6}}+3L^{2}R^{2}$
	$\displaystyle=\mathcal{O}(\kappa^{\frac{11}{6}}).$

where (i) uses $\beta\leq\frac{1}{4}$ and the maximum possible stepsize $\eta_{x}=\frac{1}{16L\kappa^{\frac{11}{6}}}$ , (ii) uses eq. (11), and (iii) uses eq. (13) and the fact that $\mathcal{Y}$ is bounded with diameter $R$ . Based on the above inequality, when the number of iterations $T\geq\mathcal{O}(\kappa^{\frac{11}{6}}\epsilon^{-2})$ , $\min_{0\leq t\leq T}\|G(x_{t})\|\leq\sqrt{\mathcal{O}(\kappa^{\frac{11}{6}})/(T-1)}\leq\epsilon$ . Since each iteration has $\mathcal{O}(1)$ number of gradient and proximal mapping evaluations, the order of computational complexity is also $\mathcal{O}(\kappa^{\frac{11}{6}}\epsilon^{-2})$ . ∎

Appendix E Deriviation of computational complexities in Table I

In this section, we will derive some computational complexities in Table I that are not directly shown in their corresponding papers. Note that all these GDA-type algorithms in Table I are single-loop. Hence, the computational complexity (the number of gradient evaluations) has the order of $\mathcal{O}(T)$ where $T$ is the number of iterations.

First, the papers in Table I use different convergence measures for computational complexity. Specifically, [5, 17] and our work show computational complexity to achieve $\|G(x)\|\leq\epsilon$ where the proximal gradient mapping $G$ is defined in (5). [23, 4] use the measure $\min_{t}\text{dist}\big{(}\Phi(x_{t})+\partial g(x_{t}),\mathbf{0}\big{)}\leq\epsilon$ where $\Phi(x):=\max_{y\in\mathcal{Y}}f(x,y)-h(y)$ , $\partial g$ denotes the partial gradient of $g$ and $\text{dist}\big{(}A,\mathbf{0}\big{)}$ denotes the distance between $\mathbf{0}$ and any set $A$ . [42] has no regularizers $g$ , $h$ and uses the convergence measure $\min_{t}\|\nabla f(x_{t},y_{t})\|\leq\epsilon$ when $y\in\mathbb{R}^{d}$ is unconstrained, which does not necessarily yield the desired approximate critical point of $\Phi$ .

E-A Derivation of complexity in [5]

In [5], Proposition 2 states that the Lyapunov function $H(z_{t}):=\Phi(x_{t})+g(x_{t})+{(1-\frac{1}{4\kappa^{2}})}\|y_{t}-y^{*}(x_{t})\|^{2}$ where $z_{t}:=(x_{t},y_{t})$ are generated by GDA decreases at the following rate.

\displaystyle H(z_{t+1})\leq H(z_{t})-2\|x_{t+1}-x_{t}\|^{2}-\frac{1}{4\kappa^{2}}\big{(}\|y_{t+1}-y^{*}(x_{t+1})\|^{2}+\|y_{t}-y^{*}(x_{t})\|^{2}\big{)}.

(16)

Note that for GDA, the gradient mapping (5) has the following norm bound

	$\displaystyle\\|G(x_{t+1})\\|=$	$\displaystyle\frac{1}{\eta_{x}}\big{\\|}x_{t+1}-\text{prox}_{\eta_{x}g}\big{(}x_{t+1}-\eta_{x}\nabla\Phi(x_{t+1})\big{)}\big{\\|}$
	$\displaystyle\overset{(i)}{\leq}$	$\displaystyle\frac{1}{\eta_{x}}\big{\\|}x_{t+1}-x_{t}+\eta_{x}\big{[}\nabla_{1}f(x_{t},y_{t})-\nabla f_{1}\big{(}x_{t+1},y^{*}(x_{t+1})\big{)}\big{]}\big{\\|}$
	$\displaystyle\leq$	$\displaystyle\Big{(}\frac{1}{\eta_{x}}+L\Big{)}\\|x_{t+1}-x_{t}\\|+L\\|y^{}(x_{t+1})-y^{}(x_{t})\\|+L\\|y^{*}(x_{t})-y_{t}\\|$
	$\displaystyle\overset{(ii)}{\leq}$	$\displaystyle\Big{(}\frac{1}{\eta_{x}}+L+L\kappa\Big{)}\\|x_{t+1}-x_{t}\\|+L\\|y^{*}(x_{t})-y_{t}\\|\,$

where (i) uses the GDA update rule $x_{t+1}\in\mathrm{prox}_{\eta_{x}g}\big{(}x_{t}-\eta_{x}\nabla_{1}f(x_{t},y_{t})\big{)}$ , the expression $\nabla\Phi(x)=\nabla f_{1}\big{(}x,y^{*}(x)\big{)}$ in Proposition 1 and the non-expansiveness of proximal mapping, (ii) uses the property that $y^{*}$ is $\kappa$ -Lipschitz continuous Proposition 1. Hence, we obtain the following convergence rate.

	$\displaystyle\min_{0\leq t\leq T}\\|G(x_{t})\\|^{2}\leq\frac{1}{T}\sum_{t=0}^{T-1}\\|G(x_{t+1})\\|^{2}$
	$\displaystyle\leq\frac{1}{T}\sum_{t=0}^{T-1}\Big{[}2\Big{(}\frac{1}{\eta_{x}}+L+L\kappa\Big{)}^{2}\\|x_{t+1}-x_{t}\\|^{2}+2L^{2}\\|y^{*}(x_{t})-y_{t}\\|^{2}\Big{]}$
	$\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\frac{1}{T}\sum_{t=0}^{T-1}\Big{[}\mathcal{O}(\kappa^{6})\\|x_{t+1}-x_{t}\\|^{2}+2L^{2}\\|y^{*}(x_{t})-y_{t}\\|^{2}\Big{]}$
	$\displaystyle\leq\frac{\mathcal{O}(\kappa^{6})}{T}\sum_{t=0}^{T-1}\Big{(}2\\|x_{t+1}-x_{t}\\|^{2}+\frac{1}{4\kappa^{2}}\\|y_{t+1}-y^{*}(x_{t+1})\\|^{2}\Big{)}$
	$\displaystyle\stackrel{{\scriptstyle(ii)}}{{\leq}}\frac{\mathcal{O}(\kappa^{6})}{T}\sum_{t=0}^{T-1}\big{(}H(z_{t})-H(z_{t+1})\big{)}$
	$\displaystyle\leq\frac{\mathcal{O}(\kappa^{6})}{T}\big{(}H(z_{0})-H(z_{T})\big{)}$		(17)

where (i) uses the maximum possible stepsize $\eta_{x}=\frac{1}{\kappa^{3}(L+3)^{2}}=\mathcal{O}(\kappa^{-3})$ in [5]. Therefore, To let $\min_{0\leq t\leq T}\|G(x_{t})\|\leq\epsilon$ , the computational complexity has the order $T=\mathcal{O}(\kappa^{6}\epsilon^{-2})$ .

E-B Derivation of complexity in [17]

The mirror descent ascent algorithm (Algorithm 1) in [17] updates the variables $x$ and $y$ simultaneously using proximal mirror descent and momentum accelerated mirror ascent steps respectively. Specifically, using the Bregman functions $\psi_{t}(x):=\frac{1}{2}\|x\|^{2}$ and $\phi_{t}(y):=\frac{1}{2}\|y\|^{2}$ which are both $\rho=1$ -strongly convex, this algorithm becomes proximal GDA with momentum on $y$ variable.

Substituting $\rho=1$ into Theorem 2 that provides convergence rate under deterministic minimax optimization (i.e., there are no stochastic samples in the objective function), we obtain the following hyperparameter choices $\eta=\mathcal{O}(1)$ , $L=L_{f}(1+\kappa)=\mathcal{O}(L_{f}\kappa)$ ¹¹1 $L_{f}$ in [17] has the same meaning as our $L$ , the Lipschitz parameter of $\nabla f$ ., $\lambda=\mathcal{O}(L_{f}^{-1})$ , $\gamma=\mathcal{O}\big{[}\min\big{(}L^{-1},\frac{\mu/L_{f}}{\kappa^{2}},\frac{\mu/L_{f}}{L_{f}^{2}}\big{)}\big{]}=\mathcal{O}\big{[}\min(L_{f}^{-1}\kappa^{-1},\kappa^{-3},L_{f}^{-2}\kappa^{-1})\big{]}=\mathcal{O}(\kappa^{-3})$ (Without loss of generality, we assume $\mu\leq 1$ which implies that $\kappa=L_{f}/\mu\geq L_{f}$ ). Then the convergence rate (25) becomes

	$\displaystyle\min_{1\leq t\leq T}\\|G(x_{t})\\|$	$\displaystyle\leq\frac{1}{T}\sum_{t=1}^{T}\\|\mathcal{G}_{t}\\|\leq\mathcal{O}\Bigg{(}\frac{\sqrt{\widetilde{F}(x_{1})-F^{*}}+\Delta_{1}}{\sqrt{T\gamma\rho}}\Bigg{)}$
		$\displaystyle\stackrel{{\scriptstyle(i)}}{{=}}\mathcal{O}\Bigg{(}\frac{\sqrt{\Phi(x_{1})+g(x_{1})-\inf_{x}\big{(}\Phi(x)+g(x)\big{)}}+\\|y_{1}-y^{*}(x_{1})\\|}{\sqrt{T\kappa^{-3}}}\Bigg{)}$

where (i) uses the notations in [17] that $\mathcal{G}_{t}=G(x_{t})$ , $\widetilde{F}(x)=\Phi(x)+g(x)$ , $\Delta_{1}=\|y_{1}-y^{*}(x_{1})\|$ and the above hyperparameter choices. Hence, to achieve $\min_{1\leq t\leq T}\|\mathcal{G}_{t}\|\leq\epsilon$ , the required computation complexity is $T\geq\mathcal{O}\Big{(}\kappa^{3}\epsilon^{-2}\big{(}\Phi(x_{1})+g(x_{1})-\inf_{x}(\Phi(x)+g(x))+\|y_{1}-y^{*}(x_{1})\|^{2}\big{)}\Big{)}$ . In Table I, we only keep the dependence of $T(\epsilon)$ on $\epsilon\approx 0$ and $\kappa\gg 1$ , which yields $\mathcal{O}\big{(}\kappa^{3}\epsilon^{-2}\big{)}$ .

E-C Derivation of complexity in [42]

[42] aims to solve the following minimax optimization

\displaystyle\min_{x\in\mathcal{X}}\max_{y\in\mathcal{Y}}\leavevmode\nobreak\ f(x,y).

where $\mathcal{X}$ and $\mathcal{Y}$ are nonempty closed convex sets and $\mathcal{Y}$ is also compact. The following AltGDA algorithm with projection mappings $\mathcal{P}_{\mathcal{X}}$ and $\mathcal{P}_{\mathcal{Y}}$ is analyzed for nonconvex-strongly concave geometry where $f$ is $L$ -smooth²²2In Assumption 2.1 of [42], let all the Lipschitz smooth parameters $L_{x}=L_{y}=L_{12}=L_{21}:=L$ for simplicity. and $f(\cdot,y)$ is $\mu$ -strongly concave for any $y\in\mathcal{Y}$ .

\displaystyle\left\{\begin{gathered}x_{k+1}=\mathcal{P}_{\mathcal{X}}\big{(}x_{k}-\eta^{-1}\nabla_{x}f(x_{k},y_{k})\big{)}\hfill\\ y_{k+1}=\mathcal{P}_{\mathcal{Y}}\big{(}y_{k}+\rho\nabla_{y}f(x_{k+1},y_{k})\big{)}\hfill\\ \end{gathered}\right.

(22)

Using the largest possible stepsizes $\eta^{-1}=\mathcal{O}(L^{-1}\kappa^{-3})$ , $\rho=\mathcal{O}(\mu L^{-2})=\mathcal{O}(L^{-1}\kappa^{-1})$ that satisfies eq. (3.18), the following key variables in Theorem 3.1 can be computed as follows.

	$\displaystyle d_{1}=$	$\displaystyle\mathcal{O}(L^{-1}\kappa^{-5})$
	$\displaystyle F_{1}-\overline{F}=$	$\displaystyle f(x_{1},y_{1})-\min_{x\in\mathcal{X},y\in\mathcal{Y}}f(x,y)+\mathcal{O}(L\kappa^{3}\sigma_{y}^{2})$

where $\sigma_{y}$ is the diameter of the compact set $\mathcal{Y}$ . Hence the number of iterations (also the order of the computation complexity) required to achieve $\|\nabla_{x}f(x_{k},y_{k})\|\leq\epsilon$ , $\frac{1}{\rho}\big{\|}y_{k}-\mathcal{P}_{\mathcal{Y}}\big{(}y_{k}+\rho\nabla_{y}f(x_{k},y_{k})\big{)}\big{\|}\leq\epsilon$ (note that this does not necessarily yields approximate critical point of $\Phi(x):=\max_{y\in\mathcal{Y}}f(x,y)$ ) is

\displaystyle T(\epsilon)=\mathcal{O}\Big{(}\frac{F_{1}-\underline{F}}{d_{1}\epsilon^{2}}\Big{)}=\mathcal{O}\Big{(}\epsilon^{-2}L\kappa^{5}\big{(}f_{1}-\min_{x\in\mathcal{X},y\in\mathcal{Y}}f(x,y)+32L\kappa^{3}\sigma_{y}^{2}\big{)}\Big{)}.

In Table I, we only keep the dependence of $T(\epsilon)$ on $\epsilon\approx 0$ and $\kappa\gg 1$ , which yields $\mathcal{O}\big{(}\kappa^{5}\epsilon^{-2}\big{)}$ .

	$\displaystyle\\|y_{t+1}-$	$\displaystyle y^{*}(x_{t+1})\\|^{2}\leq\frac{2R\kappa}{\sqrt{L}}(1-\kappa^{-0.5})^{t+1}$
		$\displaystyle\quad+\frac{6\kappa^{2}}{\sqrt{L}}\sum_{j=1}^{t}(1-\kappa^{-0.5})^{t+1-j/2}\\|x_{j+1}-x_{j}\\|.$		(4)

$\displaystyle g(x_{t+1})$	$\displaystyle\leq g(x_{t})+\frac{1}{2\eta_{x}}\\|x_{t}-\widetilde{x}_{t}\\|^{2}+\langle x_{t}-\widetilde{x}_{t},\nabla_{1}f(x_{t},y_{t})\rangle$
	$\displaystyle\quad-\frac{1}{2\eta_{x}}\\|x_{t+1}-\widetilde{x}_{t}\\|^{2}-\langle x_{t+1}-\widetilde{x}_{t},\nabla_{1}f(x_{t},y_{t})\rangle$
	$\displaystyle\overset{(i)}{=}g(x_{t})+\frac{\beta^{2}}{2\eta_{x}}\\|x_{t}-{x}_{t-1}\\|^{2}-\frac{1}{2\eta_{x}}\\|x_{t+1}-{x}_{t}-\beta(x_{t}-x_{t-1})\\|^{2}$
	$\displaystyle\quad+\langle x_{t}-x_{t+1},\nabla_{1}f(x_{t},y_{t})\rangle$
	$\displaystyle=g(x_{t})+\frac{\beta^{2}}{2\eta_{x}}\\|x_{t}-{x}_{t-1}\\|^{2}-\frac{1}{2\eta_{x}}\\|x_{t+1}-{x}_{t}\\|^{2}-\frac{\beta^{2}}{2\eta_{x}}\\|x_{t}-x_{t-1}\\|^{2}$
	$\displaystyle\quad+\frac{\beta}{\eta_{x}}\langle x_{t+1}-x_{t},x_{t}-x_{t-1}\rangle+\langle x_{t}-x_{t+1},\nabla_{1}f(x_{t},y_{t})\rangle$
	$\displaystyle\leq g(x_{t})-\frac{1}{2\eta_{x}}\\|x_{t+1}-{x}_{t}\\|^{2}+\frac{\beta}{2\eta_{x}}\\|x_{t+1}-x_{t}\\|^{2}+\frac{\beta}{2\eta_{x}}\\|x_{t}-x_{t-1}\\|^{2}+\langle x_{t}-x_{t+1},\nabla_{1}f(x_{t},y_{t})\rangle,$	(8)

	$\displaystyle\Phi(x_{t+1})+g(x_{t+1})$
	$\displaystyle\leq\Phi(x_{t})+g(x_{t})-\Big{(}\frac{1}{2\eta_{x}}-\frac{L(1+\kappa)}{2}\Big{)}\\|x_{t+1}-x_{t}\\|^{2}+\frac{\beta}{2\eta_{x}}\\|x_{t+1}-x_{t}\\|^{2}+\frac{\beta}{2\eta_{x}}\\|x_{t}-x_{t-1}\\|^{2}$
	$\displaystyle\quad+\langle x_{t+1}-x_{t},\nabla\Phi(x_{t})-\nabla_{1}f(x_{t},y_{t})\rangle$
	$\displaystyle\leq\Phi(x_{t})+g(x_{t})-\Big{(}\frac{1}{2\eta_{x}}-\frac{L(1+\kappa)}{2}\Big{)}\\|x_{t+1}-x_{t}\\|^{2}+\frac{\beta}{2\eta_{x}}\\|x_{t+1}-x_{t}\\|^{2}+\frac{\beta}{2\eta_{x}}\\|x_{t}-x_{t-1}\\|^{2}$
	$\displaystyle\quad+\\|x_{t+1}-x_{t}\\|\\|\nabla\Phi(x_{t})-\nabla_{1}f(x_{t},y_{t})\\|$
	$\displaystyle\leq\Phi(x_{t})+g(x_{t})-\Big{(}\frac{1}{2\eta_{x}}-\frac{L(1+\kappa)}{2}\Big{)}\\|x_{t+1}-x_{t}\\|^{2}+\frac{\beta}{2\eta_{x}}\\|x_{t+1}-x_{t}\\|^{2}+\frac{\beta}{2\eta_{x}}\\|x_{t}-x_{t-1}\\|^{2}$
	$\displaystyle\quad+\\|x_{t+1}-x_{t}\\|\\|\nabla_{1}f(x_{t},y^{*}(x_{t}))-\nabla_{1}f(x_{t},y_{t})\\|$
	$\displaystyle\leq\Phi(x_{t})+g(x_{t})-\Big{(}\frac{1}{2\eta_{x}}-L\kappa\Big{)}\\|x_{t+1}-x_{t}\\|^{2}+\frac{\beta}{2\eta_{x}}\\|x_{t+1}-x_{t}\\|^{2}+\frac{\beta}{2\eta_{x}}\\|x_{t}-x_{t-1}\\|^{2}$
	$\displaystyle\quad+L\\|x_{t+1}-x_{t}\\|\\|y^{*}(x_{t})-y_{t}\\|$
	$\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\Phi(x_{t})+g(x_{t})-\Big{(}\frac{1-\beta}{2\eta_{x}}-L\kappa-L\kappa^{\frac{11}{6}}\Big{)}\\|x_{t+1}-x_{t}\\|^{2}+\frac{\beta}{2\eta_{x}}\\|x_{t}-x_{t-1}\\|^{2}+\frac{L}{4\kappa^{\frac{11}{6}}}\\|y^{*}(x_{t})-y_{t}\\|^{2}$
	$\displaystyle\leq\Phi(x_{t})+g(x_{t})-\Big{(}\frac{1-\beta}{2\eta_{x}}-2L\kappa^{\frac{11}{6}}\Big{)}\\|x_{t+1}-x_{t}\\|^{2}+\frac{\beta}{2\eta_{x}}\\|x_{t}-x_{t-1}\\|^{2}+\frac{L}{4\kappa^{\frac{11}{6}}}\\|y^{*}(x_{t})-y_{t}\\|^{2}.$		(9)

	$\displaystyle\frac{\mu}{2}\\|y_{t+1}-y^{*}(x_{t+1})\\|^{2}$
	$\displaystyle\overset{(i)}{\leq}f(x_{t+1},y_{t+1})-f(x_{t+1},y^{*}(x_{t+1}))$
	$\displaystyle\leq(1-\kappa^{-0.5})^{t+1}\Big{(}\sqrt{2\big{(}f(x_{t+1},y_{t+1})-f(x_{t+1},y^{*}(x_{t+1}))\big{)}}+\sqrt{\frac{2}{\mu}}\sum_{i=1}^{t+1}\\|e_{i}\\|(1-\kappa^{-0.5})^{-i/2}\Big{)}$
	$\displaystyle\overset{(ii)}{\leq}(1-\kappa^{-0.5})^{t+1}\sqrt{L\\|y_{t+1}-y^{*}(x_{t+1})\\|^{2}}+\sqrt{\frac{2}{\mu}}\sum_{i=1}^{t+1}\\|\nabla_{2}f(x_{i},\widetilde{y}_{i-1})-\nabla_{2}f(x_{t+1},\widetilde{y}_{i-1})\\|(1-\kappa^{-0.5})^{t+1-i/2}$
	$\displaystyle\overset{(iii)}{\leq}R\sqrt{L}(1-\kappa^{-0.5})^{t+1}+\sqrt{\frac{2}{\mu}}\sum_{i=1}^{t+1}(1-\kappa^{-0.5})^{t+1-i/2}\sum_{j=i}^{t}L\\|x_{j+1}-x_{j}\\|$
	$\displaystyle=R\sqrt{L}(1-\kappa^{-0.5})^{t+1}+L\sqrt{\frac{2}{\mu}}\sum_{j=1}^{t}\sum_{i=1}^{j}(1-\kappa^{-0.5})^{t+1-i/2}\\|x_{j+1}-x_{j}\\|$
	$\displaystyle=R\sqrt{L}(1-\kappa^{-0.5})^{t+1}+\sqrt{2L\kappa}\sum_{j=1}^{t}(1-\kappa^{-0.5})^{t+0.5}\frac{(1-\kappa^{-0.5})^{-j/2}-1}{(1-\kappa^{-0.5})^{-0.5}-1}\\|x_{j+1}-x_{j}\\|$
	$\displaystyle\leq R\sqrt{L}(1-\kappa^{-0.5})^{t+1}+\sqrt{2L\kappa}\sum_{j=1}^{t}\frac{(1-\kappa^{-0.5})^{t+1-j/2}}{1-(1-\kappa^{-0.5})^{0.5}}\\|x_{j+1}-x_{j}\\|$
	$\displaystyle\overset{(iv)}{\leq}R\sqrt{L}(1-\kappa^{-0.5})^{t+1}+2\kappa\sqrt{2L}\sum_{j=1}^{t}(1-\kappa^{-0.5})^{t+1-j/2}\\|x_{j+1}-x_{j}\\|,$

	$\displaystyle\sum_{t=0}^{T-1}\\|y_{t+1}-y^{*}(x_{t+1})\\|^{2}$
	$\displaystyle\leq\frac{2R\kappa}{\sqrt{L}}\sum_{t=0}^{T-1}(1-\kappa^{-0.5})^{t+1}+\frac{6\kappa^{2}}{\sqrt{L}}\sum_{t=0}^{T-1}\sum_{j=1}^{t}(1-\kappa^{-0.5})^{t+1-j/2}\\|x_{j+1}-x_{j}\\|$
	$\displaystyle\leq\frac{2R\kappa^{1.5}}{\sqrt{L}}+\frac{6\kappa^{2}}{\sqrt{L}}\sum_{j=1}^{T-1}\sum_{t=j}^{T-1}(1-\kappa^{-0.5})^{t+1-j/2}\\|x_{j+1}-x_{j}\\|$
	$\displaystyle\leq\frac{2R\kappa^{1.5}}{\sqrt{L}}+\frac{6\kappa^{2.5}}{\sqrt{L}}\sum_{j=1}^{T-1}(1-\kappa^{-0.5})^{j/2}\\|x_{j+1}-x_{j}\\|$
	$\displaystyle\leq\frac{2R\kappa^{1.5}}{\sqrt{L}}+\frac{3\kappa^{2.5}}{\sqrt{L}}\sum_{j=1}^{T-1}\Big{(}\frac{1}{\kappa^{\frac{7}{6}}\sqrt{L}}(1-\kappa^{-0.5})^{j}+\kappa^{\frac{7}{6}}\sqrt{L}\\|x_{j+1}-x_{j}\\|^{2}\Big{)}$
	$\displaystyle\leq\frac{2R\kappa^{1.5}}{\sqrt{L}}+\frac{3\kappa^{\frac{11}{6}}}{L}+3\kappa^{\frac{11}{3}}\sum_{j=1}^{T-1}\\|x_{j+1}-x_{j}\\|^{2}.$		(11)

Accelerated Proximal Alternating Gradient-Descent-Ascent for Nonconvex Minimax Machine Learning

Abstract

Index Terms:

I Introduction

I-A Our Contribution

I-B Other Related Work

II Problem Formulation and Preliminaries

Assumption 1.

Proposition 1 (​​[23, 5]).

III Proximal Alternating GDA with Momentum

IV Convergence and Computation Complexity of Proximal-AltGDAm

Proposition 2.

Proposition 3.

Corollary 1.

Theorem 1 (Global convergence).

V Experiments

VI Conclusion

Acknowledgment

References

Appendix A Proof of Proposition 2

Proof.

Appendix B Proof of Proposition 3

Proof.

Appendix C Proof of Corollary 1

Proof.

Appendix D Proof of Theorem 1

Proof.

Appendix E Deriviation of computational complexities in Table I

E-A Derivation of complexity in [5]

E-B Derivation of complexity in [17]

E-C Derivation of complexity in [42]

Proposition 1 ([23, 5]).