This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

[1,2]\surZi Xu

1]\orgdivDepartment of Mathematics, College of Sciences, \orgnameShanghai University, \orgaddress\cityShanghai, \postcode200444, \countryP.R.China 2]\orgdivNewtouch Center for Mathematics of Shanghai University, \orgnameShanghai University, \orgaddress\cityShanghai, \postcode200444, \countryP.R.China

An accelerated first-order regularized momentum descent ascent algorithm for stochastic nonconvex-concave minimax problems

\surHuiling Zhang    [email protected] [ [
Abstract

Stochastic nonconvex minimax problems have attracted wide attention in machine learning, signal processing and many other fields in recent years. In this paper, we propose an accelerated first-order regularized momentum descent ascent algorithm (FORMDA) for solving stochastic nonconvex-concave minimax problems. The iteration complexity of the algorithm is proved to be 𝒪~(ε6.5)\tilde{\mathcal{O}}(\varepsilon^{-6.5}) to obtain an ε\varepsilon-stationary point, which achieves the best-known complexity bound for single-loop algorithms to solve the stochastic nonconvex-concave minimax problems under the stationarity of the objective function.

keywords:
stochastic nonconvex minimax problem, accelerated momentum projection gradient algorithm, iteration complexity, machine learning

1 Introduction

Consider the following stochastic nonconvex-concave minimax optimization problem:

minx𝒳maxy𝒴g(x,y)=𝔼ζD[G(x,y,ζ)],\min\limits_{x\in\mathcal{X}}\max\limits_{y\in\mathcal{Y}}g(x,y)=\mathbb{E}_{\zeta\sim D}[G(x,y,\zeta)], (1.1)

where 𝒳dx\mathcal{X}\subseteq\mathbb{R}^{d_{x}} and 𝒴dy\mathcal{Y}\subseteq\mathbb{R}^{d_{y}} are nonempty convex and compact sets, G(x,y,ζ):𝒳×𝒴G(x,y,\zeta):\mathcal{X}\times\mathcal{Y}\rightarrow\mathbb{R} is a smooth function, possibly nonconvex in xx and concave in yy, ζ\zeta is a random variable following an unknown distribution DD, and 𝔼\mathbb{E} denotes the expectation function. We focus on the single-loop first-order algorithms to solve problem (1.1). This problem has attracted increasing attention in machine learning, signal processing, and many other research fields in recent years, e.g., distributed nonconvex optimization Giannakis ; Liao ; Mateos , wireless system Chen20 , statistical learning Abadeh ; Giordano , etc.

There are few existing first-order methods for solving stochastic nonconvex-concave minimax optimization problem. Rafique et al. Rafique proposed a proximally guided stochastic mirror descent method (PG-SMD), which updates xx and yy simultaneously, and the iteration complexity has been proved to be 𝒪~(ε6)\tilde{\mathcal{O}}(\varepsilon^{-6}) to an approximate ε\varepsilon-stationary point of Φ()=maxy𝒴f(,y)\Phi(\cdot)=\max_{y\in\mathcal{Y}}f(\cdot,y). Zhang et al. Zhang2022 proposed a SAPD+ which achieves the oracle complexities of 𝒪(ε6)\mathcal{O}\left(\varepsilon^{-6}\right) for solving deterministic nonconvex-concave minimax problems. Both algorithms are multi-loop algorithms. On the other hand, there are few existing single-loop algorithms for solving stochastic nonconvex-concave minimax problems. Lin et al. Lin2019 proposed a stochastic gradient descent ascent (SGDA) and Bot et al. Bot proposed a stochastic alternating GDA, both of them require 𝒪(ε8)\mathcal{O}\left(\varepsilon^{-8}\right) stochastic gradient evaluations to obtain an approximate ε\varepsilon-stationary point of Φ()=maxy𝒴f(,y)\Phi(\cdot)=\max_{y\in\mathcal{Y}}f(\cdot,y). Boroun et al. Boroun proposed a novel single-loop stochastic primal-dual algorithm with momentum (SPDM) for a special class of nonconvex-concave minimax problems, i.e., the objective function statisfies the Polyak-Łojasiewicz (PL) condition with respect to xx, and the iteration complexity is proved to be 𝒪(ε4)\mathcal{O}\left(\varepsilon^{-4}\right).

There are some exisiting first order methods on solving stochastic nonconvex-strongly concave minimax optimization problems. Lin et al. Lin2019 proposed a stochastic gradient descent ascent (SGDA) which requires 𝒪(κ3ε4)\mathcal{O}\left(\kappa^{3}\varepsilon^{-4}\right) stochastic gradient evaluations to get an ε\varepsilon-stationary point. Luo et al. luo2020stochastic proposed a stochastic recursive gradient descent ascent (SREDA) algorithm, which requires a total number of 𝒪(κ3ε3)\mathcal{O}\left(\kappa^{3}\varepsilon^{-3}\right) stochastic gradient evaluations. Huang et al. Huang proposed an accelerated first-order momentum descent ascent (Acc-MDA) method with the total number of stochastic gradient evaluations being 𝒪~(κ4.5ε3)\tilde{\mathcal{O}}\left(\kappa^{4.5}\varepsilon^{-3}\right).

There are also some zeroth-order methods for stochastic nonconvex-strongly concave minimax optimization problems. Liu et al. Liu proposed a ZO-Min-Max algorithm and it needs 𝒪(κ6ε6)\mathcal{O}\left(\kappa^{6}\varepsilon^{-6}\right) function evaluations to obtain ε\varepsilon-stationary point. Wang et al. Wang proposed a zeroth-order gradient descent multi-step ascent (ZO-GDMSA) algorithm, and the function evaluation complexity of 𝒪~(κ2ε4)\tilde{\mathcal{O}}\left(\kappa^{2}\varepsilon^{-4}\right) is proved. Luo et al. luo2020stochastic proposed a stochastic recursive gradient descent ascent (SREDA) algorithm, and its function evaluation complexity is 𝒪(κ3ε3)\mathcal{O}\left(\kappa^{3}\varepsilon^{-3}\right). Xu et al. Xu20 proposed a zeroth-order variance reduced gradient descent ascent (ZO-VRGDA) algorithm with 𝒪(κ3ε3)\mathcal{O}\left(\kappa^{3}\varepsilon^{-3}\right) function evaluation complexity. Huang et al. Huang proposed an accelerated zeroth-order momentum descent ascent (Acc-ZOMDA) method which owns the function evaluation complexity of 𝒪~(κ4.5ε3)\tilde{\mathcal{O}}\left(\kappa^{4.5}\varepsilon^{-3}\right).

There are few exisiting methods on solving stochastic nonconvex-nonconcave minimax optimization problems. Yang et al. Yang22 proposed a smoothed GDA algorithm for nonconvex-PL minimax problems with the iteration complexity of 𝒪(ε4)\mathcal{O}(\varepsilon^{-4}). Xu et al. xu22zeroth proposed a zeroth-order variance reduced alternating gradient descent ascent (ZO-VRAGDA) algorithm for solving nonconvex-PL minimax problems, which can obtain the total complexity of 𝒪(ε3)\mathcal{O}(\varepsilon^{-3}). Huang Huang23 proposed two enhanced momentum-based gradient descent ascent methods for nonconvex-PL minimax problems, which owns the iteration complexity of 𝒪~(ε3)\tilde{\mathcal{O}}(\varepsilon^{-3}).

In this paper, we propose an accelerated first-order regularized momentum descent ascent algorithm (FORMDA) for solving stochastic nonconvex-concave minimax problems. The iteration complexity of the algorithm is proved to be 𝒪~(ε6.5)\tilde{\mathcal{O}}(\varepsilon^{-6.5}) to obtain an ε\varepsilon-stationary point.

1.1 Related Works.

We give a brief review on algorithms for solving deterministic minimax optimization problems. There are many existing works on solving convex-concave minimax optimization problems. Since we focus on nonconvex minimax problems, we do not attempt to survey it in this paper and refer to Chen ; Lan2016 ; Mokhtari ; Nemirovski ; Nesterov2007 ; Ou15 ; Ou21 ; Tom ; Zhang22 for more details.

For deterministic nonconvex-concave minimax problem, there are two types of algorithms, i.e., multi-loop algorithms and single-loop algorithms. One intensively studied type is nested-loop algorithms. Various algorithms of this type have been proposed in Rafique ; nouiehed2019solving ; Thek19 ; Kong ; ostrovskii2020efficient . Lin et al. lin2020near proposed a class of accelerated algorithms for smooth nonconvex-concave minimax problems with the complexity bound of 𝒪~(ε2.5)\tilde{\mathcal{O}}\left(\varepsilon^{-2.5}\right) which owns the best iteration complexity till now. On the other hand, fewer studies focus on single-loop algorithms for nonconvex-concave minimax problems. One typical method is the gradient descent-ascent (GDA) method, which performs a gradient descent step on xx and a gradient ascent step on yy simultaneously at each iteration. Jin et al. Lin2019 proposed a GDmax algorithm with iteration complexity of 𝒪~(ε6)\tilde{\mathcal{O}}\left(\varepsilon^{-6}\right). Lu et al. Lu proposed a hybrid block successive approximation (HiBSA) algorithm, which can obtain an ε\varepsilon-stationary point of f(x,y)f(x,y) in 𝒪~(ε4)\tilde{\mathcal{O}}\left(\varepsilon^{-4}\right) iterations. Pan et al. Pan proposed a new alternating gradient projection algorithm for nonconvex-linear minimax problem with the iteration complexity of 𝒪(ε3)\mathcal{O}\left(\varepsilon^{-3}\right). Xu et al. Xu proposed a unified single-loop alternating gradient projection (AGP) algorithm for solving nonconvex-(strongly) concave and (strongly) convex-nonconcave minimax problems, which can find an ε\varepsilon-stationary point with the gradient complexity of 𝒪(ε4)\mathcal{O}\left(\varepsilon^{-4}\right). Zhang et al. Zhang proposed a smoothed GDA algorithm which achieves 𝒪(ε4)\mathcal{O}\left(\varepsilon^{-4}\right) iteration complexity for general nonconvex-concave minimax problems.

Few algorithms have been proposed for solving more general deterministic nonconvex-nonconcave minimax problems. Sanjabi et al. Sanjabi18 proposed a multi-step gradient descent-ascent algorithm which can find an ε\varepsilon-first order Nash equilibrium in 𝒪(ε2logε1)\mathcal{O}(\varepsilon^{-2}\log\varepsilon^{-1}) iterations when a one-sided PL condition is satisfied. Yang et al. Yang showed the alternating GDA algorithm converges globally at a linear rate for a subclass of nonconvex-nonconcave objective functions satisfying a two-sided PL inequality. Song et al. SongOptDE proposed an optimistic dual extrapolation (OptDE) method with the iteration complexity of 𝒪(ε2)\mathcal{O}\left(\varepsilon^{-2}\right), if a weak solution exists. Hajizadeh et al. Hajizadeh showed that a damped version of extra-gradient method linearly converges to an ε\varepsilon-stationary point of nonconvex-nonconcave minimax problems that the nonnegative interaction dominance condition is satisfied. Xu et al. xu22zeroth proposed a zeroth-order alternating gradient descent ascent (ZO-AGDA) algorithm for solving NC-PL minimax problem, which can obtain the iteration complexity 𝒪(ε2)\mathcal{O}(\varepsilon^{-2}). For more related results, we refer to Bohm ; cai2022accelerated ; DiakonikolasEG ; Doan ; Grimmer ; Jiang ; Lee .

Notations For vectors, we use \|\cdot\| to represent the Euclidean norm and its induced matrix norm; x,y\left\langle x,y\right\rangle denotes the inner product of two vectors of xx and yy. We use xf(x,y)\nabla_{x}f(x,y) (or yf(x,y)\nabla_{y}f(x,y)) to denote the partial derivative of f(x,y)f(x,y) with respect to xx (or yy) at point (x,y)(x,y), respectively. We use the notation 𝒪()\mathcal{O}(\cdot) to hide only absolute constants which do not depend on any problem parameter, and 𝒪~()\tilde{\mathcal{O}}(\cdot) notation to hide only absolute constants and log factors. A continuously differentiable function f()f(\cdot) is called LL-smooth if there exists a constant L>0L>0 such that for any given x,y𝒳x,y\in\mathcal{X},

f(x)f(y)Lxy.\|\nabla f(x)-\nabla f(y)\|\leq L\|x-y\|.

A continuously differentiable function f()f(\cdot) is called μ\mu-strongly convcave if there exists a constant μ>0\mu>0 such that for any x,y𝒳x,y\in\mathcal{X},

f(y)f(x)+f(x),yxμ2yx2.f(y)\leq f(x)+\langle\nabla f(x),y-x\rangle-\frac{\mu}{2}\|y-x\|^{2}.

The paper is organized as follows. In Section 2, an accelerated first-order momentum projection gradient algorithm is proposed for stochastic nonconvex-concave minimax problem, and its iteration complexity is also established. Numerical results are presented in Section 3 to show the efficiency of the proposed algorithm. Some conclusions are made in the last section.

2 An Accelerated First-order Regularized Momentum Descent Ascent Algorithm

In this section, based on the framework of the Acc-MDA algorithm Huang , we propose an accelerated first-order regularized momentum descent ascent algorithm (FORMDA) for solving problem (1.1). At the kkth iteration of FORMDA, we consider a regularized function of g(x,y)g(x,y), i.e.,

gk(x,y)=g(x,y)ρk2y2,g_{k}(x,y)=g(x,y)-\frac{\rho_{k}}{2}\|y\|^{2}, (2.1)

where ρk0\rho_{k}\geq 0 is a regularization parameter. Compared to the Acc-MDA algorithm, the main difference in the FORMDA algorithm is that instead of g(x,y)g(x,y), the gradient of gk(x,y)g_{k}(x,y), is computed and used at each iteration. More detailedly, at the kkth iteration, for some given I={ζ1,,ζb}I=\{\zeta_{1},\cdots,\zeta_{b}\} drawn i.i.d. from an unknown distribution, by denoting

G~k(x,y;ζj)\displaystyle\tilde{G}_{k}(x,y\mathchar 24635\relax\;\zeta_{j}) =G(x,y;ζj)ρk2y2,\displaystyle=G(x,y\mathchar 24635\relax\;\zeta_{j})-\frac{\rho_{k}}{2}\|y\|^{2}, (2.2)

we compute the gradient of the stochastic function G~k(x,y;I)\tilde{G}_{k}(x,y\mathchar 24635\relax\;I) as follows,

xG~k(x,y;I)\displaystyle\nabla_{x}\tilde{G}_{k}(x,y\mathchar 24635\relax\;I) =1bj=1bxG~k(x,y;ζj),\displaystyle=\frac{1}{b}\sum_{j=1}^{b}\nabla_{x}\tilde{G}_{k}(x,y\mathchar 24635\relax\;\zeta_{j}), (2.3)
yG~k(x,y;I)\displaystyle\nabla_{y}\tilde{G}_{k}(x,y\mathchar 24635\relax\;I) =1bj=1byG~k(x,y;ζj).\displaystyle=\frac{1}{b}\sum_{j=1}^{b}\nabla_{y}\tilde{G}_{k}(x,y\mathchar 24635\relax\;\zeta_{j}). (2.4)

Then, based on xG~k(x,y;I)\nabla_{x}\tilde{G}_{k}(x,y\mathchar 24635\relax\;I) and yG~k(x,y;I)\nabla_{y}\tilde{G}_{k}(x,y\mathchar 24635\relax\;I), we compute the variance-reduced stochastic gradient vkv_{k} and wkw_{k} as shown in (2.5) and (2.6) respectively with 0<γk10<\gamma_{k}\leq 1 and 0<θk10<\theta_{k}\leq 1 that will be defined later. We update xkx_{k} and yky_{k} through alternating stochastic gradient projection with the momentum technique shown in (2.7)-(2.10), which is similar to that in the Acc-MDA algorithm. The proposed FORMDA algorithm is formally stated in Algorithm 1.

Algorithm 1 An Accelerated First-order Regularized Momentum Descent Ascent Algorithm (FORMDA)
Step 1:Input x1,y1,λ1,α1,β,0<η11,bx_{1},y_{1},\lambda_{1},\alpha_{1},\beta,0<\eta_{1}\leq 1,b; γ0=1\gamma_{0}=1,θ0=1\theta_{0}=1. Set k=1k=1.
Step 2: Draw a mini-batch samples Ik={ζik}i=1bI_{k}=\{\zeta_{i}^{k}\}_{i=1}^{b}. Compute
vk\displaystyle v_{k} =xG~k(xk,yk;Ik)+(1γk1)[vk1xG~k1(xk1,yk1;Ik)]\displaystyle=\nabla_{x}\tilde{G}_{k}(x_{k},y_{k}\mathchar 24635\relax\;I_{k})+(1-\gamma_{k-1})[v_{k-1}-\nabla_{x}\tilde{G}_{k-1}(x_{k-1},y_{k-1}\mathchar 24635\relax\;I_{k})] (2.5)
and
wk\displaystyle w_{k} =yG~k(xk,yk;Ik)+(1θk1)[wk1yG~k1(xk1,yk1;Ik)]\displaystyle=\nabla_{y}\tilde{G}_{k}(x_{k},y_{k}\mathchar 24635\relax\;I_{k})+(1-\theta_{k-1})[w_{k-1}-\nabla_{y}\tilde{G}_{k-1}(x_{k-1},y_{k-1}\mathchar 24635\relax\;I_{k})] (2.6)
where xG~k(x,y;I)\nabla_{x}\tilde{G}_{k}(x,y\mathchar 24635\relax\;I), xG~k1(x,y;I)\nabla_{x}\tilde{G}_{k-1}(x,y\mathchar 24635\relax\;I) and yG~k(x,y;I)\nabla_{y}\tilde{G}_{k}(x,y\mathchar 24635\relax\;I), yG~k1(x,y;I)\nabla_{y}\tilde{G}_{k-1}(x,y\mathchar 24635\relax\;I) are defined as in (2.3) and (2.4).
Step 3:Perform the following update for xkx_{k} and yky_{k}:   
x~k+1\displaystyle\tilde{x}_{k+1} =𝒫𝒳1/αk(xkαkvk),\displaystyle=\mathcal{P}_{\mathcal{X}}^{1/\alpha_{k}}\left(x_{k}-\alpha_{k}v_{k}\right), (2.7)
xk+1\displaystyle x_{k+1} =xk+ηk(x~k+1xk),\displaystyle=x_{k}+\eta_{k}(\tilde{x}_{k+1}-x_{k}), (2.8)
y~k+1\displaystyle\tilde{y}_{k+1} =𝒫𝒴1/β(yk+βwk),\displaystyle=\mathcal{P}_{\mathcal{Y}}^{1/\beta}\left(y_{k}+\beta w_{k}\right), (2.9)
yk+1\displaystyle y_{k+1} =yk+ηk(y~k+1yk).\displaystyle=y_{k}+\eta_{k}(\tilde{y}_{k+1}-y_{k}). (2.10)
Step 4:If some stationary condition is satisfied, stop; otherwise, set k=k+1,k=k+1, go to Step 2.

Note that FORMDA algorithm differs from the Acc-MDA algorithm proposed in Huang in two ways. On one hand, at each iteration of the FORMDA algorithm, the gradient of G~k(x,y;ζ)\tilde{G}_{k}(x,y\mathchar 24635\relax\;\zeta), a regularized version of G(x,y;ζ)G(x,y\mathchar 24635\relax\;\zeta), is computed and used, instead of the gradient of G(x,y;ζ)G(x,y\mathchar 24635\relax\;\zeta) that used in the Acc-MDA algorithm. On the other hand, the FORMDA algorithm is designed to solve general nonconvex-concave minimax problems, whereas the Acc-MDA algorithm can only solve nonconvex-strongly concave minimax problems.

Before we prove the iteration complexity of Algorithm 1, we first give some mild assumptions.

Assumption 2.1.

For any given ζ\zeta, G(x,y,ζ)G(x,y,\zeta) has Lipschitz continuous gradients and there exist a constant l>0l>0 such that for any x,x1,x2𝒳x,x_{1},x_{2}\in\mathcal{X}, and y,y1,y2𝒴y,y_{1},y_{2}\in\mathcal{Y}, we have

xG(x1,y,ζ)xG(x2,y,ζ)\displaystyle\|\nabla_{x}G(x_{1},y,\zeta)-\nabla_{x}G(x_{2},y,\zeta)\| lx1x2,\displaystyle\leq l\|x_{1}-x_{2}\|,
xG(x,y1,ζ)xG(x,y2,ζ)\displaystyle\|\nabla_{x}G(x,y_{1},\zeta)-\nabla_{x}G(x,y_{2},\zeta)\| ly1y2,\displaystyle\leq l\|y_{1}-y_{2}\|,
yG(x,y1,ζ)yG(x,y2,ζ)\displaystyle\|\nabla_{y}G(x,y_{1},\zeta)-\nabla_{y}G(x,y_{2},\zeta)\| ly1y2,\displaystyle\leq l\|y_{1}-y_{2}\|,
yG(x1,y,ζ)yG(x2,y,ζ)\displaystyle\|\nabla_{y}G(x_{1},y,\zeta)-\nabla_{y}G(x_{2},y,\zeta)\| lx1x2.\displaystyle\leq l\|x_{1}-x_{2}\|.
Assumption 2.2.

{ρk}\{\rho_{k}\} is a nonnegative monotonically decreasing sequence.

If Assumption 2.1 holds, g(x,y)g(x,y) has Lipschitz continuous gradients with constant ll by Lemma 7 in xu22zeroth . Then, by the definition of gk(x,y)g_{k}(x,y) and Assumption 2.2, we know that gk(x,y)g_{k}(x,y) has Lipschitz continuous gradients with constant LL, where L=l+ρ1L=l+\rho_{1}.

Assumption 2.3.

For any given ζ\zeta, there exists a constant δ>0\delta>0 such that for all xx and yy, it has

𝔼[xG(x,y,ζ)xg(x,y)2]\displaystyle\mathbb{E}[\|\nabla_{x}G(x,y,\zeta)-\nabla_{x}g(x,y)\|^{2}] δ2,\displaystyle\leq\delta^{2},
𝔼[yG(x,y,ζ)yg(x,y)2]\displaystyle\mathbb{E}[\|\nabla_{y}G(x,y,\zeta)-\nabla_{y}g(x,y)\|^{2}] δ2.\displaystyle\leq\delta^{2}.

By Assumption 2.3, we can immediately get that

𝔼[xG~k(x,y;I)xgk(x,y)2]\displaystyle\mathbb{E}[\|\nabla_{x}\tilde{G}_{k}(x,y\mathchar 24635\relax\;I)-\nabla_{x}g_{k}(x,y)\|^{2}] δ2b,\displaystyle\leq\frac{\delta^{2}}{b}, (2.11)
𝔼[yG~k(x,y;I)ygk(x,y)2]\displaystyle\mathbb{E}[\|\nabla_{y}\tilde{G}_{k}(x,y\mathchar 24635\relax\;I)-\nabla_{y}g_{k}(x,y)\|^{2}] δ2b.\displaystyle\leq\frac{\delta^{2}}{b}. (2.12)

We define the stationarity gap as the termination criterion as follows.

Definition 2.1.

For some given αk>0\alpha_{k}>0 and β>0\beta>0, the stationarity gap for problem (1.1) is defined as

𝒢αk,β(x,y):=(1αk(x𝒫𝒳1αk(xαkxg(x,y)))1β(y𝒫𝒴1β(y+βyg(x,y)))).\nabla\mathcal{G}^{\alpha_{k},\beta}(x,y):=\left(\begin{array}[]{c}\frac{1}{\alpha_{k}}\left(x-\mathcal{P}_{\mathcal{X}}^{\frac{1}{\alpha_{k}}}\left(x-\alpha_{k}\nabla_{x}g(x,y)\right)\right)\\ \frac{1}{\beta}\left(y-\mathcal{P}_{\mathcal{Y}}^{\frac{1}{\beta}}\left(y+\beta\nabla_{y}g(x,y)\right)\right)\end{array}\right).

2.1 Complexity analysis.

In this subsection, we prove the iteration complexity of Algorithm 1. When strong concavity of g(x,y)g(x,y) with respect to yy is absent, y(x)=argmaxy𝒴g(x,y)y^{*}(x)=\arg\max_{y\in\mathcal{Y}}g(x,y) is not necessarily unique, and Φ(x)=maxy𝒴g(x,y)\Phi(x)=\max_{y\in\mathcal{Y}}g(x,y) is not necessarily smooth. To address these challenges, we analyze the iteration complexity of FORMDA algorithm by utilizing Φk(x)\Phi_{k}(x) and yk(x)y_{k}^{*}(x), where

Φk(x)\displaystyle\Phi_{k}(x) :=maxy𝒴gk(x,y),\displaystyle:=\max_{y\in\mathcal{Y}}g_{k}(x,y), (2.13)
yk(x)\displaystyle y_{k}^{*}(x) :=argmaxy𝒴gk(x,y).\displaystyle:=\arg\max_{y\in\mathcal{Y}}g_{k}(x,y). (2.14)

By Lemma 24 in nouiehed2019solving , Φk(x)\Phi_{k}(x) is LΦkL_{\Phi_{k}}-Lipschitz smooth with LΦk=L+L2ρkL_{\Phi_{k}}=L+\frac{L^{2}}{\rho_{k}} under the ρk\rho_{k}-strong concavity of gk(x,y)g_{k}(x,y). Moreover, similar to the proof of Lemma B.2 in lin2020near , we have xΦk(x)=xgk(x,yk(x))\nabla_{x}\Phi_{k}(x)=\nabla_{x}g_{k}(x,y_{k}^{*}(x)).

We will begin to analyze the iteration complexity of the FORMDA algorithm, mainly through constructing potential functions to obtain the recursion. Firstly, we estimate an upper bound of yk+1(x¯)yk(x)2\|y_{k+1}^{*}(\bar{x})-y_{k}^{*}(x)\|^{2}, which will be utilized in the subsequent construction of the potential function.

Lemma 2.1.

Suppose that Assumptions 2.1 and 2.2 hold. Then for any x,x¯𝒳x,\bar{x}\in\mathcal{X},

yk+1(x¯)yk(x)2L2ρk+12x¯x2+ρkρk+1ρk+1(yk+1(x¯)2yk(x)2).\displaystyle\|y_{k+1}^{*}(\bar{x})-y_{k}^{*}(x)\|^{2}\leq\frac{L^{2}}{\rho_{k+1}^{2}}\|\bar{x}-x\|^{2}+\frac{\rho_{k}-\rho_{k+1}}{\rho_{k+1}}(\|y_{k+1}^{*}(\bar{x})\|^{2}-\|y_{k}^{*}(x)\|^{2}). (2.15)

Proof: The optimality condition for yk(x)y_{k}^{*}(x) implies that y𝒴\forall y\in\mathcal{Y} and k1\forall k\geq 1,

ygk+1(x¯,yk+1(x¯)),yyk+1(x¯)\displaystyle\langle\nabla_{y}g_{k+1}(\bar{x},y_{k+1}^{*}(\bar{x})),y-y_{k+1}^{*}(\bar{x})\rangle 0,\displaystyle\leq 0, (2.16)
ygk(x,yk(x)),yyk(x)\displaystyle\langle\nabla_{y}g_{k}(x,y_{k}^{*}(x)),y-y_{k}^{*}(x)\rangle 0.\displaystyle\leq 0. (2.17)

Setting y=yk(x)y=y_{k}^{*}(x) in (2.16) and y=yk+1(x¯)y=y_{k+1}^{*}(\bar{x}) in (2.17), adding these two inequalities and using the strong concavity of gk(x,y)g_{k}(x,y) with respect to yy, we have

ygk+1(x¯,yk+1(x¯))ygk(x,yk+1(x¯)),yk(x)yk+1(x¯)\displaystyle\langle\nabla_{y}g_{k+1}(\bar{x},y_{k+1}^{*}(\bar{x}))-\nabla_{y}g_{k}(x,y_{k+1}^{*}(\bar{x})),y_{k}^{*}(x)-y_{k+1}^{*}(\bar{x})\rangle
\displaystyle\leq ygk(x,yk+1(x¯))ygk(x,yk(x)),yk+1(x¯)yk(x)\displaystyle\langle\nabla_{y}g_{k}(x,y_{k+1}^{*}(\bar{x}))-\nabla_{y}g_{k}(x,y_{k}^{*}(x)),y_{k+1}^{*}(\bar{x})-y_{k}^{*}(x)\rangle
\displaystyle\leq ρkyk+1(x¯)yk(x)2.\displaystyle-\rho_{k}\|y_{k+1}^{*}(\bar{x})-y_{k}^{*}(x)\|^{2}. (2.18)

By the definition of gk(x,y)g_{k}(x,y), the Cauchy-Schwarz inequality and Assumption 2.1, (2.18) implies that

(ρkρk+1)yk+1(x¯),yk(x)yk+1(x¯)\displaystyle(\rho_{k}-\rho_{k+1})\langle y_{k+1}^{*}(\bar{x}),y_{k}^{*}(x)-y_{k+1}^{*}(\bar{x})\rangle
\displaystyle\leq yg(x¯,yk+1(x¯))yg(x,yk+1(x¯)),yk+1(x¯)yk(x)ρkyk+1(x¯)yk(x)2\displaystyle\langle\nabla_{y}g(\bar{x},y_{k+1}^{*}(\bar{x}))-\nabla_{y}g(x,y_{k+1}^{*}(\bar{x})),y_{k+1}^{*}(\bar{x})-y_{k}^{*}(x)\rangle-\rho_{k}\|y_{k+1}^{*}(\bar{x})-y_{k}^{*}(x)\|^{2}
\displaystyle\leq L22ρkx¯x2ρk2yk+1(x¯)yk(x)2.\displaystyle\frac{L^{2}}{2\rho_{k}}\|\bar{x}-x\|^{2}-\frac{\rho_{k}}{2}\|y_{k+1}^{*}(\bar{x})-y_{k}^{*}(x)\|^{2}. (2.19)

Since yk+1(x¯),yk(x)yk+1(x¯)=12(yk(x)2yk+1(x¯)2yk+1(x¯)yk(x)2)\langle y_{k+1}^{*}(\bar{x}),y_{k}^{*}(x)-y_{k+1}^{*}(\bar{x})\rangle=\frac{1}{2}(\|y_{k}^{*}(x)\|^{2}-\|y_{k+1}^{*}(\bar{x})\|^{2}-\|y_{k+1}^{*}(\bar{x})-y_{k}^{*}(x)\|^{2}) and Assumption 2.2, (2.19) implies that

yk+1(x¯)yk(x)2L2ρk+12x¯x2+ρkρk+1ρk+1(yk+1(x¯)2yk(x)2).\displaystyle\|y_{k+1}^{*}(\bar{x})-y_{k}^{*}(x)\|^{2}\leq\frac{L^{2}}{\rho_{k+1}^{2}}\|\bar{x}-x\|^{2}+\frac{\rho_{k}-\rho_{k+1}}{\rho_{k+1}}(\|y_{k+1}^{*}(\bar{x})\|^{2}-\|y_{k}^{*}(x)\|^{2}).

The proof is completed.

Next, we provide an estimate of the difference between Φk+1(xk+1)\Phi_{k+1}(x_{k+1}) and Φk(xk)\Phi_{k}(x_{k}).

Lemma 2.2.

Suppose that Assumptions 2.1 and 2.2 hold. Let {(xk,yk)}\{(x_{k},y_{k})\} be a sequence generated by Algorithm 1, if ρkL\rho_{k}\leq L, then k1\forall k\geq 1,

Φk+1(xk+1)Φk(xk)\displaystyle\Phi_{k+1}(x_{k+1})-\Phi_{k}(x_{k})\leq 2ηkαkL2ykyk(xk)2+2ηkαkxgk(xk,yk)vk2\displaystyle 2\eta_{k}\alpha_{k}L^{2}\|y_{k}-y_{k}^{*}(x_{k})\|^{2}+2\eta_{k}\alpha_{k}\|\nabla_{x}g_{k}(x_{k},y_{k})-v_{k}\|^{2}
(3ηk4αkL2ηk2ρk+1)x~k+1xk2+ρkρk+12σy2,\displaystyle-(\frac{3\eta_{k}}{4\alpha_{k}}-\frac{L^{2}\eta_{k}^{2}}{\rho_{k+1}})\|\tilde{x}_{k+1}-x_{k}\|^{2}+\frac{\rho_{k}-\rho_{k+1}}{2}\sigma_{y}^{2}, (2.20)

where σy=max{yy𝒴}\sigma_{y}=\max\{\|y\|\mid y\in\mathcal{Y}\}.

Proof: Since that Φk(x)\Phi_{k}(x) is LΦkL_{\Phi_{k}}-smooth with respect to xx and ρkL\rho_{k}\leq L, we have that

Φk(xk+1)Φk(xk)\displaystyle\Phi_{k}(x_{k+1})-\Phi_{k}(x_{k})
\displaystyle\leq xΦk(xk),xk+1xk+LΦk2xk+1xk2\displaystyle\langle\nabla_{x}\Phi_{k}(x_{k}),x_{k+1}-x_{k}\rangle+\frac{L_{\Phi_{k}}}{2}\|x_{k+1}-x_{k}\|^{2}
\displaystyle\leq ηkxgk(xk,yk(xk))xgk(xk,yk),x~k+1xk+ηkvk,x~k+1xk\displaystyle\eta_{k}\langle\nabla_{x}g_{k}(x_{k},y_{k}^{*}(x_{k}))-\nabla_{x}g_{k}(x_{k},y_{k}),\tilde{x}_{k+1}-x_{k}\rangle+\eta_{k}\langle v_{k},\tilde{x}_{k+1}-x_{k}\rangle
+ηkxgk(xk,yk)vk,x~k+1xk+L2ηk2ρkx~k+1xk2.\displaystyle+\eta_{k}\langle\nabla_{x}g_{k}(x_{k},y_{k})-v_{k},\tilde{x}_{k+1}-x_{k}\rangle+\frac{L^{2}\eta_{k}^{2}}{\rho_{k}}\|\tilde{x}_{k+1}-x_{k}\|^{2}. (2.21)

Next, we estimate the first three terms in the right hand side of (2.21). By the Cauchy-Schwarz inequality, we get

xgk(xk,yk(xk))xgk(xk,yk),x~k+1xk\displaystyle\langle\nabla_{x}g_{k}(x_{k},y_{k}^{*}(x_{k}))-\nabla_{x}g_{k}(x_{k},y_{k}),\tilde{x}_{k+1}-x_{k}\rangle
\displaystyle\leq 2αkxgk(xk,yk(xk))xgk(xk,yk)2+18αkx~k+1xk2\displaystyle 2\alpha_{k}\|\nabla_{x}g_{k}(x_{k},y_{k}^{*}(x_{k}))-\nabla_{x}g_{k}(x_{k},y_{k})\|^{2}+\frac{1}{8\alpha_{k}}\|\tilde{x}_{k+1}-x_{k}\|^{2}
\displaystyle\leq 2αkL2ykyk(xk)2+18αkx~k+1xk2.\displaystyle 2\alpha_{k}L^{2}\|y_{k}-y_{k}^{*}(x_{k})\|^{2}+\frac{1}{8\alpha_{k}}\|\tilde{x}_{k+1}-x_{k}\|^{2}. (2.22)

By the Cauchy-Schwarz inequality, we have

xgk(xk,yk)vk,x~k+1xk\displaystyle\langle\nabla_{x}g_{k}(x_{k},y_{k})-v_{k},\tilde{x}_{k+1}-x_{k}\rangle
\displaystyle\leq 2αkxgk(xk,yk)vk2+18αkx~k+1xk2.\displaystyle 2\alpha_{k}\|\nabla_{x}g_{k}(x_{k},y_{k})-v_{k}\|^{2}+\frac{1}{8\alpha_{k}}\|\tilde{x}_{k+1}-x_{k}\|^{2}. (2.23)

The optimality condition for xkx_{k} in (2.7) implies that x𝒳\forall x\in\mathcal{X} and k1\forall k\geq 1,

vk,x~k+1xk1αkx~k+1xk2.\langle v_{k},\tilde{x}_{k+1}-x_{k}\rangle\leq-\frac{1}{\alpha_{k}}\|\tilde{x}_{k+1}-x_{k}\|^{2}. (2.24)

Plugging (2.22), (2.23) and (2.24) into (2.21) and using ρk+1ρk\rho_{k+1}\leq\rho_{k}, we get

Φk(xk+1)Φk(xk)\displaystyle\Phi_{k}(x_{k+1})-\Phi_{k}(x_{k})\leq 2ηkαkL2ykyk(xk)2+2ηkαkxgk(xk,yk)vk2\displaystyle 2\eta_{k}\alpha_{k}L^{2}\|y_{k}-y_{k}^{*}(x_{k})\|^{2}+2\eta_{k}\alpha_{k}\|\nabla_{x}g_{k}(x_{k},y_{k})-v_{k}\|^{2}
(3ηk4αkL2ηk2ρk+1)x~k+1xk2.\displaystyle-(\frac{3\eta_{k}}{4\alpha_{k}}-\frac{L^{2}\eta_{k}^{2}}{\rho_{k+1}})\|\tilde{x}_{k+1}-x_{k}\|^{2}. (2.25)

On the other hand, we have

Φk+1(xk+1)Φk(xk+1)\displaystyle\Phi_{k+1}(x_{k+1})-\Phi_{k}(x_{k+1})\leq Φk+1(xk+1)gk(xk+1,yk+1(xk+1))\displaystyle\Phi_{k+1}(x_{k+1})-g_{k}(x_{k+1},y_{k+1}^{*}(x_{k+1}))
=\displaystyle= ρkρk+12yk+1(xk+1)2ρkρk+12σy2.\displaystyle\frac{\rho_{k}-\rho_{k+1}}{2}\|y_{k+1}^{*}(x_{k+1})\|^{2}\leq\frac{\rho_{k}-\rho_{k+1}}{2}\sigma_{y}^{2}. (2.26)

Combining (2.25) and (2.26), we complete the proof.

We further to estimate the upper bound of the first two terms in the right-hand side of (2.20) in Lemma 2.2.

Lemma 2.3.

Suppose that Assumptions 2.1 and 2.2 hold. Let {(xk,yk)}\{(x_{k},y_{k})\} be a sequence generated by Algorithm 1, if 0<ηk10<\eta_{k}\leq 1, 0<β16L0<\beta\leq\frac{1}{6L} and ρkL\rho_{k}\leq L, then k1\forall k\geq 1,

yk+1yk+1(xk+1)2\displaystyle\|y_{k+1}-y_{k+1}^{*}(x_{k+1})\|^{2}
\displaystyle\leq (1ηkβρk+14)ykyk(xk)23ηk4y~k+1yk2+5L2ηkρk+13βx~k+1xk2\displaystyle(1-\frac{\eta_{k}\beta\rho_{k+1}}{4})\|y_{k}-y_{k}^{*}(x_{k})\|^{2}-\frac{3\eta_{k}}{4}\|\tilde{y}_{k+1}-y_{k}\|^{2}+\frac{5L^{2}\eta_{k}}{\rho_{k+1}^{3}\beta}\|\tilde{x}_{k+1}-x_{k}\|^{2}
+5ηkβρk+1ygk(xk,yk)wk2+5(ρkρk+1)ρk+12ηkβ(yk+1(xk+1)2yk(xk)2).\displaystyle+\frac{5\eta_{k}\beta}{\rho_{k+1}}\|\nabla_{y}g_{k}(x_{k},y_{k})-w_{k}\|^{2}+\frac{5(\rho_{k}-\rho_{k+1})}{\rho_{k+1}^{2}\eta_{k}\beta}(\|y_{k+1}^{*}(x_{k+1})\|^{2}-\|y_{k}^{*}(x_{k})\|^{2}). (2.27)

Proof: gk(x,y)g_{k}(x,y) is ρk\rho_{k}-strongly concave with respect to yy, which implies that

gk(xk,y)gk(xk,yk)\displaystyle g_{k}(x_{k},y)-g_{k}(x_{k},y_{k})
\displaystyle\leq ygk(xk,yk),yykρk2yyk2\displaystyle\langle\nabla_{y}g_{k}(x_{k},y_{k}),y-y_{k}\rangle-\frac{\rho_{k}}{2}\|y-y_{k}\|^{2}
=\displaystyle= wk,yy~k+1+ygk(xk,yk)wk,yy~k+1+ygk(xk,yk),y~k+1yk\displaystyle\langle w_{k},y-\tilde{y}_{k+1}\rangle+\langle\nabla_{y}g_{k}(x_{k},y_{k})-w_{k},y-\tilde{y}_{k+1}\rangle+\langle\nabla_{y}g_{k}(x_{k},y_{k}),\tilde{y}_{k+1}-y_{k}\rangle
ρk2yyk2.\displaystyle-\frac{\rho_{k}}{2}\|y-y_{k}\|^{2}. (2.28)

By Assumption 2.1, gk(x,y)g_{k}(x,y) has Lipschitz continuous gradient with respect to yy and then

gk(xk,y~k+1)gk(xk,yk)ygk(xk,yk),y~k+1ykL2y~k+1yk2.\displaystyle g_{k}(x_{k},\tilde{y}_{k+1})-g_{k}(x_{k},y_{k})\geq\langle\nabla_{y}g_{k}(x_{k},y_{k}),\tilde{y}_{k+1}-y_{k}\rangle-\frac{L}{2}\|\tilde{y}_{k+1}-y_{k}\|^{2}. (2.29)

The optimality condition for yky_{k} in (2.9) implies that y𝒴\forall y\in\mathcal{Y} and k1\forall k\geq 1,

wk,yy~k+1\displaystyle\langle w_{k},y-\tilde{y}_{k+1}\rangle 1βy~k+1yk,yy~k+1\displaystyle\leq\frac{1}{\beta}\langle\tilde{y}_{k+1}-y_{k},y-\tilde{y}_{k+1}\rangle
=1βy~k+1yk2+1βy~k+1yk,yyk.\displaystyle=-\frac{1}{\beta}\|\tilde{y}_{k+1}-y_{k}\|^{2}+\frac{1}{\beta}\langle\tilde{y}_{k+1}-y_{k},y-y_{k}\rangle. (2.30)

Plugging (2.1) into (2.1) and combining (2.29), by setting y=yk(xk)y=y_{k}^{*}(x_{k}), we have

gk(xk,yk(xk))gk(xk,y~k+1)\displaystyle g_{k}(x_{k},y_{k}^{*}(x_{k}))-g_{k}(x_{k},\tilde{y}_{k+1})
\displaystyle\leq 1βy~k+1yk,yk(xk)yk+ygk(xk,yk)wk,yk(xk)y~k+1\displaystyle\frac{1}{\beta}\langle\tilde{y}_{k+1}-y_{k},y_{k}^{*}(x_{k})-y_{k}\rangle+\langle\nabla_{y}g_{k}(x_{k},y_{k})-w_{k},y_{k}^{*}(x_{k})-\tilde{y}_{k+1}\rangle
ρk2ykyk(xk)2(1βL2)y~k+1yk2.\displaystyle-\frac{\rho_{k}}{2}\|y_{k}-y_{k}^{*}(x_{k})\|^{2}-(\frac{1}{\beta}-\frac{L}{2})\|\tilde{y}_{k+1}-y_{k}\|^{2}. (2.31)

Next, we estimate the first two terms in the right hand side of (2.1). By (2.10), we get

yk+1yk(xk)2\displaystyle\|y_{k+1}-y_{k}^{*}(x_{k})\|^{2}
=\displaystyle= yk+ηk(y~k+1yk)yk(xk)2\displaystyle\|y_{k}+\eta_{k}(\tilde{y}_{k+1}-y_{k})-y_{k}^{*}(x_{k})\|^{2}
=\displaystyle= yky(xk)2+2ηky~k+1yk,ykyk(xk)+ηk2y~k+1yk2.\displaystyle\|y_{k}-y^{*}(x_{k})\|^{2}+2\eta_{k}\langle\tilde{y}_{k+1}-y_{k},y_{k}-y_{k}^{*}(x_{k})\rangle+\eta_{k}^{2}\|\tilde{y}_{k+1}-y_{k}\|^{2}. (2.32)

(2.32) can be rewritten as

y~k+1yk,yk(xk)yk\displaystyle\langle\tilde{y}_{k+1}-y_{k},y_{k}^{*}(x_{k})-y_{k}\rangle
=\displaystyle= 12ηkykyk(xk)2+ηk2y~k+1yk212ηkyk+1yk(xk)2.\displaystyle\frac{1}{2\eta_{k}}\|y_{k}-y_{k}^{*}(x_{k})\|^{2}+\frac{\eta_{k}}{2}\|\tilde{y}_{k+1}-y_{k}\|^{2}-\frac{1}{2\eta_{k}}\|y_{k+1}-y_{k}^{*}(x_{k})\|^{2}. (2.33)

By the Cauchy-Schwarz inequality, we have

ygk(xk,yk)wk,yk(xk)y~k+1\displaystyle\langle\nabla_{y}g_{k}(x_{k},y_{k})-w_{k},y_{k}^{*}(x_{k})-\tilde{y}_{k+1}\rangle
\displaystyle\leq 2ρkygk(xk,yk)wk2+ρk8y(xk)y~k+12\displaystyle\frac{2}{\rho_{k}}\|\nabla_{y}g_{k}(x_{k},y_{k})-w_{k}\|^{2}+\frac{\rho_{k}}{8}\|y^{*}(x_{k})-\tilde{y}_{k+1}\|^{2}
\displaystyle\leq 2ρkygk(xk,yk)wk2+ρk4yk(xk)yk2+ρk4y~k+1yk2.\displaystyle\frac{2}{\rho_{k}}\|\nabla_{y}g_{k}(x_{k},y_{k})-w_{k}\|^{2}+\frac{\rho_{k}}{4}\|y_{k}^{*}(x_{k})-y_{k}\|^{2}+\frac{\rho_{k}}{4}\|\tilde{y}_{k+1}-y_{k}\|^{2}. (2.34)

Plugging (2.33), (2.34) into (2.1), and using the fact that gk(xk,yk(xk))gk(xk,y~k+1)g_{k}(x_{k},y_{k}^{*}(x_{k}))\geq g_{k}(x_{k},\tilde{y}_{k+1}), we get

12ηkβyk+1yk(xk)2\displaystyle\frac{1}{2\eta_{k}\beta}\|y_{k+1}-y_{k}^{*}(x_{k})\|^{2}
\displaystyle\leq (12ηkβρk4)ykyk(xk)2+(ηk2β+ρk4+L21β)y~k+1yk2\displaystyle(\frac{1}{2\eta_{k}\beta}-\frac{\rho_{k}}{4})\|y_{k}-y_{k}^{*}(x_{k})\|^{2}+(\frac{\eta_{k}}{2\beta}+\frac{\rho_{k}}{4}+\frac{L}{2}-\frac{1}{\beta})\|\tilde{y}_{k+1}-y_{k}\|^{2}
+2ρkygk(xk,yk)wk2.\displaystyle+\frac{2}{\rho_{k}}\|\nabla_{y}g_{k}(x_{k},y_{k})-w_{k}\|^{2}. (2.35)

By the assumption 0<ηk10<\eta_{k}\leq 1, 0<β~16L0<\tilde{\beta}\leq\frac{1}{6L}, ρkL\rho_{k}\leq L and (2.35), we obtain that

yk+1yk(xk)2\displaystyle\|y_{k+1}-y_{k}^{*}(x_{k})\|^{2}
\displaystyle\leq (1ηkβρk2)ykyk(xk)23ηk4y~k+1yk2+4ηkβρkygk(xk,yk)wk2.\displaystyle(1-\frac{\eta_{k}\beta\rho_{k}}{2})\|y_{k}-y_{k}^{*}(x_{k})\|^{2}-\frac{3\eta_{k}}{4}\|\tilde{y}_{k+1}-y_{k}\|^{2}+\frac{4\eta_{k}\beta}{\rho_{k}}\|\nabla_{y}g_{k}(x_{k},y_{k})-w_{k}\|^{2}. (2.36)

By the Cauchy-Schwarz inequality, Lemma 2.1 and (2.8), we have

yk+1yk+1(xk+1)2\displaystyle\|y_{k+1}-y_{k+1}^{*}(x_{k+1})\|^{2}
=\displaystyle= yk+1yk(xk)2+2yk+1yk(xk),yk(xk)yk+1(xk+1)+yk(xk)yk+1(xk+1)2\displaystyle\|y_{k+1}-y_{k}^{*}(x_{k})\|^{2}+2\langle y_{k+1}-y_{k}^{*}(x_{k}),y_{k}^{*}(x_{k})-y_{k+1}^{*}(x_{k+1})\rangle+\|y_{k}^{*}(x_{k})-y_{k+1}^{*}(x_{k+1})\|^{2}
\displaystyle\leq (1+ηkβρk4)yk+1yk(xk)2+(1+4ηkβρk)yk(xk)yk+1(xk+1)2\displaystyle(1+\frac{\eta_{k}\beta\rho_{k}}{4})\|y_{k+1}-y_{k}^{*}(x_{k})\|^{2}+(1+\frac{4}{\eta_{k}\beta\rho_{k}})\|y_{k}^{*}(x_{k})-y_{k+1}^{*}(x_{k+1})\|^{2}
\displaystyle\leq (1+ηkβρk4)yk+1yk(xk)2+(1+4ηkβρk)ηk2L2ρk+12x~k+1xk2\displaystyle(1+\frac{\eta_{k}\beta\rho_{k}}{4})\|y_{k+1}-y_{k}^{*}(x_{k})\|^{2}+(1+\frac{4}{\eta_{k}\beta\rho_{k}})\frac{\eta_{k}^{2}L^{2}}{\rho_{k+1}^{2}}\|\tilde{x}_{k+1}-x_{k}\|^{2}
+(1+4ηkβρk)ρkρk+1ρk+1(yk+1(xk+1)2yk(xk)2).\displaystyle+(1+\frac{4}{\eta_{k}\beta\rho_{k}})\frac{\rho_{k}-\rho_{k+1}}{\rho_{k+1}}(\|y_{k+1}^{*}(x_{k+1})\|^{2}-\|y_{k}^{*}(x_{k})\|^{2}). (2.37)

Plugging (2.36) into (2.37), we obtain

yk+1yk+1(xk+1)2\displaystyle\|y_{k+1}-y_{k+1}^{*}(x_{k+1})\|^{2}
\displaystyle\leq (1ηkβρk2)(1+ηkβρk4)ykyk(xk)23ηk4(1+ηkβρk4)y~k+1yk2\displaystyle(1-\frac{\eta_{k}\beta\rho_{k}}{2})(1+\frac{\eta_{k}\beta\rho_{k}}{4})\|y_{k}-y_{k}^{*}(x_{k})\|^{2}-\frac{3\eta_{k}}{4}(1+\frac{\eta_{k}\beta\rho_{k}}{4})\|\tilde{y}_{k+1}-y_{k}\|^{2}
+(1+4ηkβρk)ηk2L2ρk+12x~k+1xk2+4ηkβρk(1+ηkβρk4)ygk(xk,yk)wk2\displaystyle+(1+\frac{4}{\eta_{k}\beta\rho_{k}})\frac{\eta_{k}^{2}L^{2}}{\rho_{k+1}^{2}}\|\tilde{x}_{k+1}-x_{k}\|^{2}+\frac{4\eta_{k}\beta}{\rho_{k}}(1+\frac{\eta_{k}\beta\rho_{k}}{4})\|\nabla_{y}g_{k}(x_{k},y_{k})-w_{k}\|^{2}
+(1+4ηkβρk)ρkρk+1ρk+1(yk+1(xk+1)2yk(xk)2).\displaystyle+(1+\frac{4}{\eta_{k}\beta\rho_{k}})\frac{\rho_{k}-\rho_{k+1}}{\rho_{k+1}}(\|y_{k+1}^{*}(x_{k+1})\|^{2}-\|y_{k}^{*}(x_{k})\|^{2}). (2.38)

Since 0<ηk10<\eta_{k}\leq 1, 0<β16L0<\beta\leq\frac{1}{6L} and ρkL\rho_{k}\leq L, we have ηkβρk<1\eta_{k}\beta\rho_{k}<1.Then, by Assumption 2.2, we get (1ηkβρk2)(1+ηkβρk4)1ηkβρk+14(1-\frac{\eta_{k}\beta\rho_{k}}{2})(1+\frac{\eta_{k}\beta\rho_{k}}{4})\leq 1-\frac{\eta_{k}\beta\rho_{k+1}}{4}, 3ηk4(1+ηkβρk4)3ηk4-\frac{3\eta_{k}}{4}(1+\frac{\eta_{k}\beta\rho_{k}}{4})\leq-\frac{3\eta_{k}}{4}, 4ηkβρk(1+ηkβρk4)5ηkβρk+1\frac{4\eta_{k}\beta}{\rho_{k}}(1+\frac{\eta_{k}\beta\rho_{k}}{4})\leq\frac{5\eta_{k}\beta}{\rho_{k+1}}, (1+4ηkβρk)5ρk+1βηk(1+\frac{4}{\eta_{k}\beta\rho_{k}})\leq\frac{5}{\rho_{k+1}\beta\eta_{k}}. Thus, we obtain

yk+1yk+1(xk+1)2\displaystyle\|y_{k+1}-y_{k+1}^{*}(x_{k+1})\|^{2}
\displaystyle\leq (1ηkβρk+14)ykyk(xk)23ηk4y~k+1yk2+5L2ηkρk+13βx~k+1xk2\displaystyle(1-\frac{\eta_{k}\beta\rho_{k+1}}{4})\|y_{k}-y_{k}^{*}(x_{k})\|^{2}-\frac{3\eta_{k}}{4}\|\tilde{y}_{k+1}-y_{k}\|^{2}+\frac{5L^{2}\eta_{k}}{\rho_{k+1}^{3}\beta}\|\tilde{x}_{k+1}-x_{k}\|^{2}
+5ηkβρk+1ygk(xk,yk)wk2+5(ρkρk+1)ρk+12ηkβ(yk+1(xk+1)2yk(xk)2).\displaystyle+\frac{5\eta_{k}\beta}{\rho_{k+1}}\|\nabla_{y}g_{k}(x_{k},y_{k})-w_{k}\|^{2}+\frac{5(\rho_{k}-\rho_{k+1})}{\rho_{k+1}^{2}\eta_{k}\beta}(\|y_{k+1}^{*}(x_{k+1})\|^{2}-\|y_{k}^{*}(x_{k})\|^{2}).

The proof is completed.

Next, we provide upper bound estimates of 𝔼[xgk+1(xk+1,yk+1)vk+12]\mathbb{E}[\|\nabla_{x}g_{k+1}(x_{k+1},y_{k+1})-v_{k+1}\|^{2}] and 𝔼[ygk+1(xk+1,yk+1)wk+12]\mathbb{E}[\|\nabla_{y}g_{k+1}(x_{k+1},y_{k+1})-w_{k+1}\|^{2}] in the following lemma.

Lemma 2.4.

Suppose that Assumptions 2.1 and 2.3 hold. Let {(xk,yk)}\{\left(x_{k},y_{k}\right)\} be a sequence generated by Algorithm 1, then k1\forall k\geq 1,

𝔼[xgk+1(xk+1,yk+1)vk+12]\displaystyle\mathbb{E}[\|\nabla_{x}g_{k+1}(x_{k+1},y_{k+1})-v_{k+1}\|^{2}]
\displaystyle\leq (1γk)𝔼[xgk(xk,yk)vk2]+2γk2δ2b+2L2ηk2b𝔼[x~k+1xk2+y~k+1yk2],\displaystyle(1-\gamma_{k})\mathbb{E}[\|\nabla_{x}g_{k}(x_{k},y_{k})-v_{k}\|^{2}]+\frac{2\gamma_{k}^{2}\delta^{2}}{b}+\frac{2L^{2}\eta_{k}^{2}}{b}\mathbb{E}[\|\tilde{x}_{k+1}-x_{k}\|^{2}+\|\tilde{y}_{k+1}-y_{k}\|^{2}], (2.39)
𝔼[ygk+1(xk+1,yk+1)wk+12]\displaystyle\mathbb{E}[\|\nabla_{y}g_{k+1}(x_{k+1},y_{k+1})-w_{k+1}\|^{2}]
\displaystyle\leq (1θk)𝔼[ygk(xk,yk)wk2]+2θk2δ2b+4L2ηk2b𝔼[x~k+1xk2+y~k+1yk2]\displaystyle(1-\theta_{k})\mathbb{E}[\|\nabla_{y}g_{k}(x_{k},y_{k})-w_{k}\|^{2}]+\frac{2\theta_{k}^{2}\delta^{2}}{b}+\frac{4L^{2}\eta_{k}^{2}}{b}\mathbb{E}[\|\tilde{x}_{k+1}-x_{k}\|^{2}+\|\tilde{y}_{k+1}-y_{k}\|^{2}]
+4(ρk2ρk+12)σy2b.\displaystyle+\frac{4(\rho_{k}^{2}-\rho_{k+1}^{2})\sigma_{y}^{2}}{b}. (2.40)

Proof: Note that 𝔼[xG~k(xk,yk;Ik+1)]=xgk(xk,yk)\mathbb{E}[\nabla_{x}\tilde{G}_{k}(x_{k},y_{k}\mathchar 24635\relax\;I_{k+1})]=\nabla_{x}g_{k}(x_{k},y_{k}), 𝔼[xG~k+1(xk+1,yk+1;Ik+1)]=xgk+1(xk+1,yk+1)\mathbb{E}[\nabla_{x}\tilde{G}_{k+1}(x_{k+1},y_{k+1}\mathchar 24635\relax\;I_{k+1})]=\nabla_{x}g_{k+1}(x_{k+1},y_{k+1}), and by the definition of vk+1v_{k+1}, we have

𝔼[xgk+1(xk+1,yk+1)vk+12]\displaystyle\mathbb{E}[\|\nabla_{x}g_{k+1}(x_{k+1},y_{k+1})-v_{k+1}\|^{2}]
=\displaystyle= 𝔼[xgk+1(xk+1,yk+1)xG~k+1(xk+1,yk+1;Ik+1)\displaystyle\mathbb{E}[\|\nabla_{x}g_{k+1}(x_{k+1},y_{k+1})-\nabla_{x}\tilde{G}_{k+1}(x_{k+1},y_{k+1}\mathchar 24635\relax\;I_{k+1})
(1γk)[vkxG~k(xk,yk;Ik+1)]2]\displaystyle-(1-\gamma_{k})[v_{k}-\nabla_{x}\tilde{G}_{k}(x_{k},y_{k}\mathchar 24635\relax\;I_{k+1})]\|^{2}]
=\displaystyle= 𝔼[(1γk)(xgk(xk,yk)vk)+γk(xgk+1(xk+1,yk+1)\displaystyle\mathbb{E}[\|(1-\gamma_{k})(\nabla_{x}g_{k}(x_{k},y_{k})-v_{k})+\gamma_{k}(\nabla_{x}g_{k+1}(x_{k+1},y_{k+1})
xG~k+1(xk+1,yk+1;Ik+1))+(1γk)[xgk+1(xk+1,yk+1)xgk(xk,yk)\displaystyle-\nabla_{x}\tilde{G}_{k+1}(x_{k+1},y_{k+1}\mathchar 24635\relax\;I_{k+1}))+(1-\gamma_{k})[\nabla_{x}g_{k+1}(x_{k+1},y_{k+1})-\nabla_{x}g_{k}(x_{k},y_{k})
xG~k+1(xk+1,yk+1;Ik+1)+xG~k(xk,yk;Ik+1)]2]\displaystyle-\nabla_{x}\tilde{G}_{k+1}(x_{k+1},y_{k+1}\mathchar 24635\relax\;I_{k+1})+\nabla_{x}\tilde{G}_{k}(x_{k},y_{k}\mathchar 24635\relax\;I_{k+1})]\|^{2}]
=\displaystyle= (1γk)2𝔼[xgk(xk,yk)vk2]+𝔼[γk(xgk+1(xk+1,yk+1)\displaystyle(1-\gamma_{k})^{2}\mathbb{E}[\|\nabla_{x}g_{k}(x_{k},y_{k})-v_{k}\|^{2}]+\mathbb{E}[\|\gamma_{k}(\nabla_{x}g_{k+1}(x_{k+1},y_{k+1})
xG~k+1(xk+1,yk+1;Ik+1))+(1γk)[xgk+1(xk+1,yk+1)xgk(xk,yk)\displaystyle-\nabla_{x}\tilde{G}_{k+1}(x_{k+1},y_{k+1}\mathchar 24635\relax\;I_{k+1}))+(1-\gamma_{k})[\nabla_{x}g_{k+1}(x_{k+1},y_{k+1})-\nabla_{x}g_{k}(x_{k},y_{k})
xG~k+1(xk+1,yk+1;Ik+1)+xG~k(xk,yk;Ik+1)]2].\displaystyle-\nabla_{x}\tilde{G}_{k+1}(x_{k+1},y_{k+1}\mathchar 24635\relax\;I_{k+1})+\nabla_{x}\tilde{G}_{k}(x_{k},y_{k}\mathchar 24635\relax\;I_{k+1})]\|^{2}]. (2.41)

By the fact that 𝔼[ζ𝔼ζ2]=𝔼[ζ2]𝔼[ζ2]𝔼[ζ2]\mathbb{E}[\|\zeta-\mathbb{E}\zeta\|^{2}]=\mathbb{E}[\|\zeta\|^{2}]-\|\mathbb{E}[\zeta\|^{2}]\leq\mathbb{E}[\|\zeta\|^{2}], 𝔼[1bj=1bζj2]=1b𝔼[ζj2]\mathbb{E}[\|\frac{1}{b}\sum_{j=1}^{b}\zeta_{j}\|^{2}]=\frac{1}{b}\mathbb{E}[\|\zeta_{j}\|^{2}] for i.i.d. random variables {ζj}j=1b\{\zeta_{j}\}_{j=1}^{b} with zero mean, 1γk<11-\gamma_{k}<1 and (2.11), we have

𝔼[xgk+1(xk+1,yk+1)vk+12]\displaystyle\mathbb{E}[\|\nabla_{x}g_{k+1}(x_{k+1},y_{k+1})-v_{k+1}\|^{2}]
\displaystyle\leq (1γk)2𝔼[xgk(xk,yk)vk2]+2γk2δ2b\displaystyle(1-\gamma_{k})^{2}\mathbb{E}[\|\nabla_{x}g_{k}(x_{k},y_{k})-v_{k}\|^{2}]+\frac{2\gamma_{k}^{2}\delta^{2}}{b}
+2(1γk)2b𝔼[xG~k+1(xk+1,yk+1;ζ1k+1)xG~k(xk,yk;ζ1k+1)2]\displaystyle+\frac{2(1-\gamma_{k})^{2}}{b}\mathbb{E}[\|\nabla_{x}\tilde{G}_{k+1}(x_{k+1},y_{k+1}\mathchar 24635\relax\;\zeta_{1}^{k+1})-\nabla_{x}\tilde{G}_{k}(x_{k},y_{k}\mathchar 24635\relax\;\zeta_{1}^{k+1})\|^{2}]
\displaystyle\leq (1γk)𝔼[xgk(xk,yk)vk2]+2γk2δ2b+2L2ηk2b𝔼[x~k+1xk2+y~k+1yk2].\displaystyle(1-\gamma_{k})\mathbb{E}[\|\nabla_{x}g_{k}(x_{k},y_{k})-v_{k}\|^{2}]+\frac{2\gamma_{k}^{2}\delta^{2}}{b}+\frac{2L^{2}\eta_{k}^{2}}{b}\mathbb{E}[\|\tilde{x}_{k+1}-x_{k}\|^{2}+\|\tilde{y}_{k+1}-y_{k}\|^{2}]. (2.42)

Similarly, we get

𝔼[ygk+1(xk+1,yk+1)wk+12]\displaystyle\mathbb{E}[\|\nabla_{y}g_{k+1}(x_{k+1},y_{k+1})-w_{k+1}\|^{2}]
\displaystyle\leq (1θk)2𝔼[ygk(xk,yk)wk2]+2θk2δ2b\displaystyle(1-\theta_{k})^{2}\mathbb{E}[\|\nabla_{y}g_{k}(x_{k},y_{k})-w_{k}\|^{2}]+\frac{2\theta_{k}^{2}\delta^{2}}{b}
+2(1θk)2b𝔼[yG~k+1(xk+1,yk+1;ζ1k+1)yG~k(xk,yk;ζ1k+1)2]\displaystyle+\frac{2(1-\theta_{k})^{2}}{b}\mathbb{E}[\|\nabla_{y}\tilde{G}_{k+1}(x_{k+1},y_{k+1}\mathchar 24635\relax\;\zeta_{1}^{k+1})-\nabla_{y}\tilde{G}_{k}(x_{k},y_{k}\mathchar 24635\relax\;\zeta_{1}^{k+1})\|^{2}]
\displaystyle\leq (1θk)𝔼[ygk(xk,yk)wk2]+2θk2δ2b+4(ρk2ρk+12)σy2b\displaystyle(1-\theta_{k})\mathbb{E}[\|\nabla_{y}g_{k}(x_{k},y_{k})-w_{k}\|^{2}]+\frac{2\theta_{k}^{2}\delta^{2}}{b}+\frac{4(\rho_{k}^{2}-\rho_{k+1}^{2})\sigma_{y}^{2}}{b}
+4L2ηk2b𝔼[x~k+1xk2+y~k+1yk2].\displaystyle+\frac{4L^{2}\eta_{k}^{2}}{b}\mathbb{E}[\|\tilde{x}_{k+1}-x_{k}\|^{2}+\|\tilde{y}_{k+1}-y_{k}\|^{2}].

The proof is completed.

We now establish an important recursion for the FORMDA algorithm.

Lemma 2.5.

Suppose that Assumptions 2.1, 2.2 and 2.3 hold. Let {(xk,yk)}\{\left(x_{k},y_{k}\right)\} be a sequence generated by Algorithm 1. Denote

Fk+1(xk+1,yk+1)\displaystyle F_{k+1}(x_{k+1},y_{k+1})
=\displaystyle= 𝔼[Φk+1(xk+1)]+Dk(1)𝔼[xgk+1(xk+1,yk+1)vk+12]\displaystyle\mathbb{E}[\Phi_{k+1}(x_{k+1})]+D_{k}^{(1)}\mathbb{E}[\|\nabla_{x}g_{k+1}(x_{k+1},y_{k+1})-v_{k+1}\|^{2}]
+8αkL2βρk+1𝔼[yk+1yk+1(xk+1)2]+Dk(2)𝔼[ygk+1(xk+1,yk+1)wk+12],\displaystyle+\frac{8\alpha_{k}L^{2}}{\beta\rho_{k+1}}\mathbb{E}[\|y_{k+1}-y_{k+1}^{*}(x_{k+1})\|^{2}]+D_{k}^{(2)}\mathbb{E}[\|\nabla_{y}g_{k+1}(x_{k+1},y_{k+1})-w_{k+1}\|^{2}],
Sk+1(xk+1,yk+1)\displaystyle S_{k+1}(x_{k+1},y_{k+1})
=\displaystyle= Fk+1(xk+1,yk+1)40αk+1L2(ρk+1ρk+2)ηk+1β2ρk+23𝔼[yk+1(xk+1)2]+4Dk+1(2)ρk+12σy2b.\displaystyle F_{k+1}(x_{k+1},y_{k+1})-\frac{40\alpha_{k+1}L^{2}(\rho_{k+1}-\rho_{k+2})}{\eta_{k+1}\beta^{2}\rho_{k+2}^{3}}\mathbb{E}[\|y_{k+1}^{*}(x_{k+1})\|^{2}]+\frac{4D_{k+1}^{(2)}\rho_{k+1}^{2}\sigma_{y}^{2}}{b}.

Then k1\forall k\geq 1,

Sk+1(xk+1,yk+1)Sk(xk,yk)\displaystyle S_{k+1}(x_{k+1},y_{k+1})-S_{k}(x_{k},y_{k})
\displaystyle\leq (8αkL2βρk+18αk1L2βρk)𝔼[ykyk(xk)2]\displaystyle(\frac{8\alpha_{k}L^{2}}{\beta\rho_{k+1}}-\frac{8\alpha_{k-1}L^{2}}{\beta\rho_{k}})\mathbb{E}[\|y_{k}-y_{k}^{*}(x_{k})\|^{2}]
+(2ηkαkDk(1)γk+Dk(1)Dk1(1))𝔼[xgk(xk,yk)vk2]\displaystyle+(2\eta_{k}\alpha_{k}-D_{k}^{(1)}\gamma_{k}+D_{k}^{(1)}-D_{k-1}^{(1)})\mathbb{E}[\|\nabla_{x}g_{k}(x_{k},y_{k})-v_{k}\|^{2}]
+(40ηkαkL2ρk+12Dk(2)θk+Dk(2)Dk1(2))𝔼[ygk(xk,yk)wk2]\displaystyle+(\frac{40\eta_{k}\alpha_{k}L^{2}}{\rho_{k+1}^{2}}-D_{k}^{(2)}\theta_{k}+D_{k}^{(2)}-D_{k-1}^{(2)})\mathbb{E}[\|\nabla_{y}g_{k}(x_{k},y_{k})-w_{k}\|^{2}]
(3ηk4αkL2ηk2ρk+140L4ηkαkρk+14β24L2ηk2(Dk(1)dx+Dk(2)dy)b)𝔼[x~k+1xk2]\displaystyle-(\frac{3\eta_{k}}{4\alpha_{k}}-\frac{L^{2}\eta_{k}^{2}}{\rho_{k+1}}-\frac{40L^{4}\eta_{k}\alpha_{k}}{\rho_{k+1}^{4}\beta^{2}}-\frac{4L^{2}\eta_{k}^{2}(D_{k}^{(1)}d_{x}+D_{k}^{(2)}d_{y})}{b})\mathbb{E}[\|\tilde{x}_{k+1}-x_{k}\|^{2}]
(6αkL2ηkρk+1β4L2ηk2(Dk(1)dx+Dk(2)dy)b)𝔼[y~k+1yk2]\displaystyle-(\frac{6\alpha_{k}L^{2}\eta_{k}}{\rho_{k+1}\beta}-\frac{4L^{2}\eta_{k}^{2}(D_{k}^{(1)}d_{x}+D_{k}^{(2)}d_{y})}{b})\mathbb{E}[\|\tilde{y}_{k+1}-y_{k}\|^{2}]
+(40αkL2(ρkρk+1)ηkβ2ρk+1340αk+1L2(ρk+1ρk+2)ηk+1β2ρk+13)σy2+ρkρk+12σy2\displaystyle+(\frac{40\alpha_{k}L^{2}(\rho_{k}-\rho_{k+1})}{\eta_{k}\beta^{2}\rho_{k+1}^{3}}-\frac{40\alpha_{k+1}L^{2}(\rho_{k+1}-\rho_{k+2})}{\eta_{k+1}\beta^{2}\rho_{k+1}^{3}})\sigma_{y}^{2}+\frac{\rho_{k}-\rho_{k+1}}{2}\sigma_{y}^{2}
+2δ2(γk2Dk(1)+θk2Dk(2))b+4ρk+12(Dk+1(2)Dk(2))σy2b,\displaystyle+\frac{2\delta^{2}(\gamma_{k}^{2}D_{k}^{(1)}+\theta_{k}^{2}D_{k}^{(2)})}{b}+\frac{4\rho_{k+1}^{2}(D_{k+1}^{(2)}-D_{k}^{(2)})\sigma_{y}^{2}}{b}, (2.43)

where σy=max{yy𝒴}\sigma_{y}=\max\{\|y\|\mid y\in\mathcal{Y}\}, Dk(1)>0D_{k}^{(1)}>0 and Dk(2)>0D_{k}^{(2)}>0.

Proof: By the definition of Fk(xk,yk)F_{k}(x_{k},y_{k}) and Lemmas 2.2-2.4, we get

Fk+1(xk+1,yk+1)Fk(xk,yk)\displaystyle F_{k+1}(x_{k+1},y_{k+1})-F_{k}(x_{k},y_{k})
\displaystyle\leq (8αkL2βρk+18αk1L2βρk)𝔼[ykyk(xk)2]\displaystyle(\frac{8\alpha_{k}L^{2}}{\beta\rho_{k+1}}-\frac{8\alpha_{k-1}L^{2}}{\beta\rho_{k}})\mathbb{E}[\|y_{k}-y_{k}^{*}(x_{k})\|^{2}]
+(2ηkαkDk(1)γk+Dk(1)Dk1(1))𝔼[xgk(xk,yk)vk2]\displaystyle+(2\eta_{k}\alpha_{k}-D_{k}^{(1)}\gamma_{k}+D_{k}^{(1)}-D_{k-1}^{(1)})\mathbb{E}[\|\nabla_{x}g_{k}(x_{k},y_{k})-v_{k}\|^{2}]
+(40ηkαkL2ρk+12Dk(2)θk+Dk(2)Dk1(2))𝔼[ygk(xk,yk)wk2]\displaystyle+(\frac{40\eta_{k}\alpha_{k}L^{2}}{\rho_{k+1}^{2}}-D_{k}^{(2)}\theta_{k}+D_{k}^{(2)}-D_{k-1}^{(2)})\mathbb{E}[\|\nabla_{y}g_{k}(x_{k},y_{k})-w_{k}\|^{2}]
(3ηk4αkL2ηk2ρk+140L4ηkαkρk+14β24L2ηk2(Dk(1)dx+Dk(2)dy)b)𝔼[x~k+1xk2]\displaystyle-(\frac{3\eta_{k}}{4\alpha_{k}}-\frac{L^{2}\eta_{k}^{2}}{\rho_{k+1}}-\frac{40L^{4}\eta_{k}\alpha_{k}}{\rho_{k+1}^{4}\beta^{2}}-\frac{4L^{2}\eta_{k}^{2}(D_{k}^{(1)}d_{x}+D_{k}^{(2)}d_{y})}{b})\mathbb{E}[\|\tilde{x}_{k+1}-x_{k}\|^{2}]
(6αkL2ηkρk+1β4L2ηk2(Dk(1)dx+Dk(2)dy)b)𝔼[y~k+1yk2]\displaystyle-(\frac{6\alpha_{k}L^{2}\eta_{k}}{\rho_{k+1}\beta}-\frac{4L^{2}\eta_{k}^{2}(D_{k}^{(1)}d_{x}+D_{k}^{(2)}d_{y})}{b})\mathbb{E}[\|\tilde{y}_{k+1}-y_{k}\|^{2}]
+40αkL2(ρkρk+1)ηkβ2ρk+13(𝔼[yk+1(xk+1)2]𝔼[yk(xk)2])+ρkρk+12σy2\displaystyle+\frac{40\alpha_{k}L^{2}(\rho_{k}-\rho_{k+1})}{\eta_{k}\beta^{2}\rho_{k+1}^{3}}(\mathbb{E}[\|y_{k+1}^{*}(x_{k+1})\|^{2}]-\mathbb{E}[\|y_{k}^{*}(x_{k})\|^{2}])+\frac{\rho_{k}-\rho_{k+1}}{2}\sigma_{y}^{2}
+2δ2(γk2Dk(1)+θk2Dk(2))b+4Dk(2)(ρk2ρk+12)σy2b.\displaystyle+\frac{2\delta^{2}(\gamma_{k}^{2}D_{k}^{(1)}+\theta_{k}^{2}D_{k}^{(2)})}{b}+\frac{4D_{k}^{(2)}(\rho_{k}^{2}-\rho_{k+1}^{2})\sigma_{y}^{2}}{b}. (2.44)

The proof is then completed by the definition of Sk(xk,yk)S_{k}(x_{k},y_{k}) and σy\sigma_{y}.

Define T(ε):=min{k𝒢αk,β(xk,yk)ε}T(\varepsilon):=\min\{k\mid\|\nabla\mathcal{G}^{\alpha_{k},\beta}(x_{k},y_{k})\|\leq\varepsilon\}, where ε>0\varepsilon>0 is a given target accuracy. Denote

𝒢~kαk,β(x,y)=(1αk(x𝒫𝒳1αk(xαkxgk(x,y)))1β(y𝒫𝒴1β(y+βygk(x,y)))).\nabla\tilde{\mathcal{G}}_{k}^{\alpha_{k},\beta}(x,y)=\left(\begin{array}[]{c}\frac{1}{\alpha_{k}}\left(x-\mathcal{P}_{\mathcal{X}}^{\frac{1}{\alpha_{k}}}\left(x-\alpha_{k}\nabla_{x}g_{k}(x,y)\right)\right)\\ \frac{1}{\beta}\left(y-\mathcal{P}_{\mathcal{Y}}^{\frac{1}{\beta}}\left(y+\beta\nabla_{y}g_{k}(x,y)\right)\right)\end{array}\right).

It can be easily checked that

𝒢αk,β(x,y)𝒢~kαk,β(x,y)+ρky,\|\nabla\mathcal{G}^{\alpha_{k},\beta}(x,y)\|\leq\|\nabla\tilde{\mathcal{G}}_{k}^{\alpha_{k},\beta}(x,y)\|+\rho_{k}\|y\|,

and 𝒢αk,β(x,y)=𝒢~kαk,β(x,y)\|\nabla\mathcal{G}^{\alpha_{k},\beta}(x,y)\|=\|\nabla\tilde{\mathcal{G}}_{k}^{\alpha_{k},\beta}(x,y)\| if ρk=0\rho_{k}=0. Next, we provide an upper bound estimate of 𝒢~kαk,β(x,y)\|\nabla\tilde{\mathcal{G}}_{k}^{\alpha_{k},\beta}(x,y)\|.

Lemma 2.6.

Suppose that Assumption 2.1 holds. Let {(xk,yk)}\{\left(x_{k},y_{k}\right)\} be a sequence generated by Algorithm 1, then k1\forall k\geq 1,

𝔼[𝒢~kαk,β(xk,yk)2]\displaystyle\mathbb{E}[\|\nabla\tilde{\mathcal{G}}_{k}^{\alpha_{k},\beta}(x_{k},y_{k})\|^{2}]
\displaystyle\leq 2αk2𝔼[x~k+1xk2]+2β2𝔼[y~k+1yk2]+2𝔼[xgk(xk,yk)vk2]\displaystyle\frac{2}{\alpha_{k}^{2}}\mathbb{E}[\|\tilde{x}_{k+1}-x_{k}\|^{2}]+\frac{2}{\beta^{2}}\mathbb{E}[\|\tilde{y}_{k+1}-y_{k}\|^{2}]+2\mathbb{E}[\|\nabla_{x}g_{k}\left(x_{k},y_{k}\right)-v_{k}\|^{2}]
+2𝔼[ygk(xk,yk)wk2].\displaystyle+2\mathbb{E}[\|\nabla_{y}g_{k}\left(x_{k},y_{k}\right)-w_{k}\|^{2}]. (2.45)

Proof: By (2.9), the nonexpansive property of the projection operator, we immediately get

1β(yk𝒫𝒴1β(yk+βygk(xk,yk)))1βy~k+1yk+ygk(xk,yk)wk.\displaystyle\|\frac{1}{\beta}(y_{k}-\mathcal{P}_{\mathcal{Y}}^{\frac{1}{\beta}}\left(y_{k}+\beta\nabla_{y}g_{k}\left(x_{k},y_{k}\right)\right))\|\leq\frac{1}{\beta}\|\tilde{y}_{k+1}-y_{k}\|+\|\nabla_{y}g_{k}\left(x_{k},y_{k}\right)-w_{k}\|. (2.46)

On the other hand, by (2.7) and the nonexpansive property of the projection operator, we have

1αk(xk𝒫𝒳1αk(xkαkxgk(xk,yk)))1αkx~k+1xk+xgk(xk,yk)vk.\displaystyle\|\frac{1}{\alpha_{k}}(x_{k}-\mathcal{P}_{\mathcal{X}}^{\frac{1}{\alpha_{k}}}\left(x_{k}-\alpha_{k}\nabla_{x}g_{k}\left(x_{k},y_{k}\right)\right))\|\leq\frac{1}{\alpha_{k}}\|\tilde{x}_{k+1}-x_{k}\|+\|\nabla_{x}g_{k}\left(x_{k},y_{k}\right)-v_{k}\|. (2.47)

Combing (2.46), (2.47), using Cauchy-Schwarz inequality and taking the expectation, we complete the proof.

Theorem 2.1.

Suppose that Assumptions 2.1, 2.2 and 2.3 hold. Let {(xk,yk)}\{\left(x_{k},y_{k}\right)\} be a sequence generated by Algorithm 1. For any given k1k\geq 1, let

ηk\displaystyle\eta_{k} =1(k+2)5/13,αk=a4(k+2)4/13,ρk=L(k+1)2/13,\displaystyle=\frac{1}{(k+2)^{5/13}},\quad\alpha_{k}=\frac{a_{4}}{(k+2)^{4/13}},\quad\rho_{k}=\frac{L}{(k+1)^{2/13}},
γk\displaystyle\gamma_{k} =a5(k+2)12/13,θk=a6(k+2)8/13,Dk(1)=a1(k+2)3/13,\displaystyle=\frac{a_{5}}{(k+2)^{12/13}},\quad\theta_{k}=\frac{a_{6}}{(k+2)^{8/13}},\quad D_{k}^{(1)}=a_{1}(k+2)^{3/13},
Dk(2)\displaystyle D_{k}^{(2)} =a2(k+2)3/13,\displaystyle=a_{2}(k+2)^{3/13},

with

0\displaystyle 0 <a1min{b32a4L2,ba42Lβ},0<a2min{b32a4L2,ba42Lβ},\displaystyle<a_{1}\leq\min\{\frac{b}{32a_{4}L^{2}},\frac{ba_{4}}{2L\beta}\},\quad 0<a_{2}\leq\min\{\frac{b}{32a_{4}L^{2}},\frac{ba_{4}}{2L\beta}\},
0\displaystyle 0 <a4min{18L,β85},a54a4a1+1213,a680a4a2+1213.\displaystyle<a_{4}\leq\min\{\frac{1}{8L},\frac{\beta}{8\sqrt{5}}\},\quad a_{5}\geq\frac{4a_{4}}{a_{1}}+\frac{12}{13},\quad a_{6}\geq\frac{80a_{4}}{a_{2}}+\frac{12}{13}.

If 0<β16L0<\beta\leq\frac{1}{6L}, then for any given ε>0\varepsilon>0,

T(ε)max{T~(ε),(2Lσyε)1321},T(\varepsilon)\leq\max\{\tilde{T}(\varepsilon),(\frac{2L\sigma_{y}}{\varepsilon})^{\frac{13}{2}}-1\},

where T~(ε)\tilde{T}(\varepsilon) satisfies that

ε24C1+C2ln(T~(ε)+2)d1a4(134(T~(ε)+3)4/131334/134),\frac{\varepsilon^{2}}{4}\leq\frac{C_{1}+C_{2}\ln(\tilde{T}(\varepsilon)+2)}{d_{1}a_{4}(\frac{13}{4}(\tilde{T}(\varepsilon)+3)^{4/13}-\frac{13\cdot 3^{4/13}}{4})}, (2.48)

with

C1\displaystyle C_{1} =S1(x1,y1)S¯+ρ12σy2+40α1L2ρ1η1β2ρ23σy2,d1min{18,Lβ,a1a54a4,a2a64a4},\displaystyle=S_{1}(x_{1},y_{1})-\underline{S}+\frac{\rho_{1}}{2}\sigma_{y}^{2}+\frac{40\alpha_{1}L^{2}\rho_{1}}{\eta_{1}\beta^{2}\rho_{2}^{3}}\sigma_{y}^{2},\quad d_{1}\leq\min\{\frac{1}{8},L\beta,\frac{a_{1}a_{5}}{4a_{4}},\frac{a_{2}a_{6}}{4a_{4}}\},
C2\displaystyle C_{2} =2δ2(a52a1+a62a2)b+12a2L2σy213b,S¯=minx𝒳miny𝒴Sk(x,y),\displaystyle=\frac{2\delta^{2}(a_{5}^{2}a_{1}+a_{6}^{2}a_{2})}{b}+\frac{12a_{2}L^{2}\sigma_{y}^{2}}{13b},\quad\underline{S}=\min\limits_{x\in\mathcal{X}}\min\limits_{y\in\mathcal{Y}}S_{k}(x,y),
σy\displaystyle\sigma_{y} =max{yy𝒴}.\displaystyle=\max\{\|y\|\mid y\in\mathcal{Y}\}.

Proof: By the definition of αk\alpha_{k}, ρk\rho_{k} and Dk(1)D_{k}^{(1)}, we have αkρk+1αk1ρk\frac{\alpha_{k}}{\rho_{k+1}}\leq\frac{\alpha_{k-1}}{\rho_{k}} and

Dk(1)Dk1(1)\displaystyle D_{k}^{(1)}-D_{k-1}^{(1)} 3a1(k+1)10/13133a1210/13(2+k)10/1313\displaystyle\leq\frac{3a_{1}(k+1)^{-10/13}}{13}\leq\frac{3a_{1}\cdot 2^{10/13}(2+k)^{-10/13}}{13}
6a1(2+k)9/1313.\displaystyle\leq\frac{6a_{1}(2+k)^{-9/13}}{13}.

Then, by the setting of a5a_{5}, we get

2ηkαkDk(1)γk+Dk(1)Dk1(1)\displaystyle 2\eta_{k}\alpha_{k}-D_{k}^{(1)}\gamma_{k}+D_{k}^{(1)}-D_{k-1}^{(1)} (2a4a1a5+6a113)(2+k)9/13\displaystyle\leq(2a_{4}-a_{1}a_{5}+\frac{6a_{1}}{13})(2+k)^{-9/13}
a1a52(2+k)9/13.\displaystyle\leq-\frac{a_{1}a_{5}}{2}(2+k)^{-9/13}.

Similarly, we also have

40ηkαkL2ρk+12Dk(2)θk+Dk(2)Dk1(2)\displaystyle\frac{40\eta_{k}\alpha_{k}L^{2}}{\rho_{k+1}^{2}}-D_{k}^{(2)}\theta_{k}+D_{k}^{(2)}-D_{k-1}^{(2)} (40a4a2a6+6a213)(2+k)5/13\displaystyle\leq(40a_{4}-a_{2}a_{6}+\frac{6a_{2}}{13})(2+k)^{-5/13}
a2a62(2+k)5/13.\displaystyle\leq-\frac{a_{2}a_{6}}{2}(2+k)^{-5/13}.

By the settings of a1,a2,a4a_{1},a_{2},a_{4}, we obtain

(3ηk4αkL2ηk2ρk+140L4ηkαkρk+14β24L2ηk2(Dk(1)dx+Dk(2)dy)b)\displaystyle-(\frac{3\eta_{k}}{4\alpha_{k}}-\frac{L^{2}\eta_{k}^{2}}{\rho_{k+1}}-\frac{40L^{4}\eta_{k}\alpha_{k}}{\rho_{k+1}^{4}\beta^{2}}-\frac{4L^{2}\eta_{k}^{2}(D_{k}^{(1)}d_{x}+D_{k}^{(2)}d_{y})}{b})
\displaystyle\leq (34a4+L+40a4β2+4L2(a1dx+a2dy)b)(2+k)1/13\displaystyle(-\frac{3}{4a_{4}}+L+\frac{40a_{4}}{\beta^{2}}+\frac{4L^{2}(a_{1}d_{x}+a_{2}d_{y})}{b})(2+k)^{-1/13}
\displaystyle\leq 14a4(2+k)1/13,\displaystyle-\frac{1}{4a_{4}}(2+k)^{-1/13},

and

(6αkL2ηkρk+1β4L2ηk2(Dk(1)dx+Dk(2)dy)b)\displaystyle-(\frac{6\alpha_{k}L^{2}\eta_{k}}{\rho_{k+1}\beta}-\frac{4L^{2}\eta_{k}^{2}(D_{k}^{(1)}d_{x}+D_{k}^{(2)}d_{y})}{b})
\displaystyle\leq (6La4β+4L2(a1dx+a2dy)b)(2+k)7/132La4β(2+k)7/13.\displaystyle(-\frac{6La_{4}}{\beta}+\frac{4L^{2}(a_{1}d_{x}+a_{2}d_{y})}{b})(2+k)^{-7/13}\leq-\frac{2La_{4}}{\beta}(2+k)^{-7/13}.

Plugging these inequalities into (2.1), we get

Sk+1(xk+1,yk+1)Sk(xk,yk)\displaystyle S_{k+1}(x_{k+1},y_{k+1})-S_{k}(x_{k},y_{k})
\displaystyle\leq a1a52(2+k)9/13𝔼[xgk(xk,yk)vk2]\displaystyle-\frac{a_{1}a_{5}}{2}(2+k)^{-9/13}\mathbb{E}[\|\nabla_{x}g_{k}(x_{k},y_{k})-v_{k}\|^{2}]
a2a62(2+k)5/13𝔼[ygk(xk,yk)wk2]\displaystyle-\frac{a_{2}a_{6}}{2}(2+k)^{-5/13}\mathbb{E}[\|\nabla_{y}g_{k}(x_{k},y_{k})-w_{k}\|^{2}]
14a4(2+k)1/13𝔼[x~k+1xk2]2La4β(2+k)7/13𝔼[y~k+1yk2]\displaystyle-\frac{1}{4a_{4}}(2+k)^{-1/13}\mathbb{E}[\|\tilde{x}_{k+1}-x_{k}\|^{2}]-\frac{2La_{4}}{\beta}(2+k)^{-7/13}\mathbb{E}[\|\tilde{y}_{k+1}-y_{k}\|^{2}]
+(40αkL2(ρkρk+1)ηkβ2ρk+1340αk+1L2(ρk+1ρk+2)ηk+1β2ρk+13)σy2+ρkρk+12σy2\displaystyle+(\frac{40\alpha_{k}L^{2}(\rho_{k}-\rho_{k+1})}{\eta_{k}\beta^{2}\rho_{k+1}^{3}}-\frac{40\alpha_{k+1}L^{2}(\rho_{k+1}-\rho_{k+2})}{\eta_{k+1}\beta^{2}\rho_{k+1}^{3}})\sigma_{y}^{2}+\frac{\rho_{k}-\rho_{k+1}}{2}\sigma_{y}^{2}
+2δ2(γk2Dk(1)+θk2Dk(2))b+12a2L2σy213b(k+2)1.\displaystyle+\frac{2\delta^{2}(\gamma_{k}^{2}D_{k}^{(1)}+\theta_{k}^{2}D_{k}^{(2)})}{b}+\frac{12a_{2}L^{2}\sigma_{y}^{2}}{13b}(k+2)^{-1}. (2.49)

It follows from the definition of d1d_{1}, (2.6) and (2.1) that k1\forall k\geq 1,

d1ηkαk𝔼[𝒢~kαk,β(xk,yk)2]\displaystyle d_{1}\eta_{k}\alpha_{k}\mathbb{E}[\|\nabla\tilde{\mathcal{G}}_{k}^{\alpha_{k},\beta}(x_{k},y_{k})\|^{2}]
\displaystyle\leq Sk(xk,yk)Sk+1(xk+1,yk+1)+ρkρk+12σy2\displaystyle S_{k}(x_{k},y_{k})-S_{k+1}(x_{k+1},y_{k+1})+\frac{\rho_{k}-\rho_{k+1}}{2}\sigma_{y}^{2}
+(40αkL2(ρkρk+1)ηkβ2ρk+1340αk+1L2(ρk+1ρk+2)ηk+1β2ρk+13)σy2\displaystyle+(\frac{40\alpha_{k}L^{2}(\rho_{k}-\rho_{k+1})}{\eta_{k}\beta^{2}\rho_{k+1}^{3}}-\frac{40\alpha_{k+1}L^{2}(\rho_{k+1}-\rho_{k+2})}{\eta_{k+1}\beta^{2}\rho_{k+1}^{3}})\sigma_{y}^{2}
+2δ2(γk2Dk(1)+θk2Dk(2))b+12a2L2σy213b(k+2)1.\displaystyle+\frac{2\delta^{2}(\gamma_{k}^{2}D_{k}^{(1)}+\theta_{k}^{2}D_{k}^{(2)})}{b}+\frac{12a_{2}L^{2}\sigma_{y}^{2}}{13b}(k+2)^{-1}. (2.50)

Denoting S¯=minx𝒳miny𝒴Sk(x,y)\underline{S}=\min\limits_{x\in\mathcal{X}}\min\limits_{y\in\mathcal{Y}}S_{k}(x,y), T~(ε)=min{k𝒢~kαk,β(xk,yk)ε2,k1}\tilde{T}(\varepsilon)=\min\{k\mid\|\nabla\tilde{\mathcal{G}}_{k}^{\alpha_{k},\beta}(x_{k},y_{k})\|\leq\frac{\varepsilon}{2},k\geq 1\}. By summing both sides of (2.1) from k=1k=1 to T~(ε)\tilde{T}(\varepsilon), we obtain

k=1T~(ε)d1ηkαk𝔼[𝒢~kαk,β(xk,yk)2]\displaystyle\sum_{k=1}^{\tilde{T}(\varepsilon)}d_{1}\eta_{k}\alpha_{k}\mathbb{E}[\|\nabla\tilde{\mathcal{G}}_{k}^{\alpha_{k},\beta}(x_{k},y_{k})\|^{2}]
\displaystyle\leq S1(x1,y1)ST~(ε)+1(xT~(ε)+1,yT~(ε)+1)+ρ12σy2+k=1T~(ε)2δ2(γk2Dk(1)+θk2Dk(2))b\displaystyle S_{1}(x_{1},y_{1})-S_{\tilde{T}(\varepsilon)+1}(x_{\tilde{T}(\varepsilon)+1},y_{\tilde{T}(\varepsilon)+1})+\frac{\rho_{1}}{2}\sigma_{y}^{2}+\sum_{k=1}^{\tilde{T}(\varepsilon)}\frac{2\delta^{2}(\gamma_{k}^{2}D_{k}^{(1)}+\theta_{k}^{2}D_{k}^{(2)})}{b}
+40α1L2ρ1σy2η1β2ρ23+k=1T~(ε)12a2dyL2σy213b(k+2)1\displaystyle+\frac{40\alpha_{1}L^{2}\rho_{1}\sigma_{y}^{2}}{\eta_{1}\beta^{2}\rho_{2}^{3}}+\sum_{k=1}^{\tilde{T}(\varepsilon)}\frac{12a_{2}d_{y}L^{2}\sigma_{y}^{2}}{13b}(k+2)^{-1}
\displaystyle\leq S1(x1,y1)S¯+ρ12σy2+40α1L2ρ1σy2η1β2ρ23+k=1T~(ε)2δ2(a52a1+a62a2)b(k+2)1\displaystyle S_{1}(x_{1},y_{1})-\underline{S}+\frac{\rho_{1}}{2}\sigma_{y}^{2}+\frac{40\alpha_{1}L^{2}\rho_{1}\sigma_{y}^{2}}{\eta_{1}\beta^{2}\rho_{2}^{3}}+\sum_{k=1}^{\tilde{T}(\varepsilon)}\frac{2\delta^{2}(a_{5}^{2}a_{1}+a_{6}^{2}a_{2})}{b}(k+2)^{-1}
+k=1T~(ε)12a2L2σy213b(k+2)1.\displaystyle+\sum_{k=1}^{\tilde{T}(\varepsilon)}\frac{12a_{2}L^{2}\sigma_{y}^{2}}{13b}(k+2)^{-1}. (2.51)

Since k=1T~(ε)(k+2)1ln(T~(ε)+2)\sum_{k=1}^{\tilde{T}(\varepsilon)}(k+2)^{-1}\leq\ln(\tilde{T}(\varepsilon)+2) and k=1T~(ε)(k+2)9/13134(T~(ε)+3)4/131334/134\sum_{k=1}^{\tilde{T}(\varepsilon)}(k+2)^{-9/13}\geq\frac{13}{4}(\tilde{T}(\varepsilon)+3)^{4/13}-\frac{13\cdot 3^{4/13}}{4}, by the definition of C1C_{1} and C2C_{2}, we get

ε24C1+C2ln(T~(ε)+2)d1a4(134(T~(ε)+3)4/131334/134).\displaystyle\frac{\varepsilon^{2}}{4}\leq\frac{C_{1}+C_{2}\ln(\tilde{T}(\varepsilon)+2)}{d_{1}a_{4}(\frac{13}{4}(\tilde{T}(\varepsilon)+3)^{4/13}-\frac{13\cdot 3^{4/13}}{4})}. (2.52)

On the other hand, if k(2Lσyε)1321k\geq(\frac{2L\sigma_{y}}{\varepsilon})^{\frac{13}{2}}-1, then ρkε2σy\rho_{k}\leq\frac{\varepsilon}{2\sigma_{y}}. This inequality together with the definition of σy\sigma_{y} then imply that ρkykε2\rho_{k}\|y_{k}\|\leq\frac{\varepsilon}{2}. Therefore, there exists a

T(ε)max{T~(ε),(2Lσyε)1321}.\displaystyle T(\varepsilon)\leq\max\{\tilde{T}(\varepsilon),(\frac{2L\sigma_{y}}{\varepsilon})^{\frac{13}{2}}-1\}.

such that 𝔼[𝒢αk,β(xk,yk)]𝔼[𝒢~kαk,β(xk,yk)]+ρkykε\mathbb{E}[\|\nabla\mathcal{G}^{\alpha_{k},\beta}(x_{k},y_{k})]\|\leq\mathbb{E}[\|\nabla\tilde{\mathcal{G}}_{k}^{\alpha_{k},\beta}(x_{k},y_{k})\|]+\rho_{k}\|y_{k}\|\leq\varepsilon which completes the proof.

Remark 2.1.

It is easily verified from (2.48) that 𝒪(ε2)=𝒪(lnT~(ε)T~(ε)4/13)=𝒪~(T~(ε)4/13)\mathcal{O}(\varepsilon^{2})=\mathcal{O}\left(\frac{\ln\tilde{T}(\varepsilon)}{\tilde{T}(\varepsilon)^{4/13}}\right)=\tilde{\mathcal{O}}(\tilde{T}(\varepsilon)^{-4/13}), which means T~(ε)=𝒪~(ε6.5)\tilde{T}(\varepsilon)=\tilde{\mathcal{O}}\left(\varepsilon^{-6.5}\right). Therefore, T(ε)=𝒪~(ε6.5)T(\varepsilon)=\tilde{\mathcal{O}}\left(\varepsilon^{-6.5}\right) by Theorem 2.1, which means that the number of iterations for Algorithm 1 to obtain an ε\varepsilon-stationary point of problem (1.1) is upper bounded by 𝒪~(ε6.5)\tilde{\mathcal{O}}\left(\varepsilon^{-6.5}\right) for solving stochastic nonconvex-concave minimax problems.

2.2 Nonsmooth Case.

Consider the following stochastic nonsmooth nonconvex-concave minimax optimization problem:

minx𝒳maxy𝒴g(x,y)+f(x)h(y),\min\limits_{x\in\mathcal{X}}\max\limits_{y\in\mathcal{Y}}g(x,y)+f(x)-h(y), (2.53)

where g(x,y)=𝔼ζD[G(x,y,ζ)]g(x,y)=\mathbb{E}_{\zeta\sim D}[G(x,y,\zeta)], both f(x)f(x) and h(y)h(y) are convex continuous but possibly nonsmooth function.

Denote the proximity operator as Proxh,𝒵1/α(υ):=argminz𝒵h(z)+α2zυ2\operatorname{Prox}_{h,\mathcal{Z}}^{1/\alpha}(\upsilon):=\arg\min\limits_{z\in\mathcal{Z}}h(z)+\frac{\alpha}{2}\|z-\upsilon\|^{2}. Based on FORMDA Algorithm, we propse an accelerated first-order regularized momentum descent ascent algorithm for nonsmooth minimax problem (FORMDA-NS) by replacing 𝒫𝒳1/αk\mathcal{P}_{\mathcal{X}}^{1/\alpha_{k}} with Proxf,𝒳1/αk\operatorname{Prox}_{f,\mathcal{X}}^{1/\alpha_{k}}, 𝒫𝒴1/β\mathcal{P}_{\mathcal{Y}}^{1/\beta} with Proxh,𝒴1/β\operatorname{Prox}_{h,\mathcal{Y}}^{1/\beta} in FORMDA Algorithm. The proposed FORMDA-NS algorithm is formally stated in Algorithm 2.

Algorithm 2 An Accelerated First-order Regularized Momentum Descent Ascent Algorithm for Nonsmooth Minimax Problem (FORMDA-NS)
Step 1:Input x1,y1,λ1,α1,β,0<η11,bx_{1},y_{1},\lambda_{1},\alpha_{1},\beta,0<\eta_{1}\leq 1,b; γ0=1\gamma_{0}=1,θ0=1\theta_{0}=1. Set k=1k=1.
Step 2: Draw a mini-batch samples Ik={ζik}i=1bI_{k}=\{\zeta_{i}^{k}\}_{i=1}^{b}. Compute
vk\displaystyle v_{k} =xG~k(xk,yk;Ik)+(1γk1)[vk1xG~k1(xk1,yk1;Ik)]\displaystyle=\nabla_{x}\tilde{G}_{k}(x_{k},y_{k}\mathchar 24635\relax\;I_{k})+(1-\gamma_{k-1})[v_{k-1}-\nabla_{x}\tilde{G}_{k-1}(x_{k-1},y_{k-1}\mathchar 24635\relax\;I_{k})] (2.54)
and
wk\displaystyle w_{k} =yG~k(xk,yk;Ik)+(1θk1)[wk1yG~k1(xk1,yk1;Ik)]\displaystyle=\nabla_{y}\tilde{G}_{k}(x_{k},y_{k}\mathchar 24635\relax\;I_{k})+(1-\theta_{k-1})[w_{k-1}-\nabla_{y}\tilde{G}_{k-1}(x_{k-1},y_{k-1}\mathchar 24635\relax\;I_{k})] (2.55)
Step 3:Perform the following update for xkx_{k} and yky_{k}:   
x~k+1\displaystyle\tilde{x}_{k+1} =Proxf,𝒳1/αk(xkαkvk),\displaystyle=\operatorname{Prox}_{f,\mathcal{X}}^{1/\alpha_{k}}\left(x_{k}-\alpha_{k}v_{k}\right), (2.56)
xk+1\displaystyle x_{k+1} =xk+ηk(x~k+1xk),\displaystyle=x_{k}+\eta_{k}(\tilde{x}_{k+1}-x_{k}), (2.57)
y~k+1\displaystyle\tilde{y}_{k+1} =Proxh,𝒴1/β(yk+βwk),\displaystyle=\operatorname{Prox}_{h,\mathcal{Y}}^{1/\beta}\left(y_{k}+\beta w_{k}\right), (2.58)
yk+1\displaystyle y_{k+1} =yk+ηk(y~k+1yk).\displaystyle=y_{k}+\eta_{k}(\tilde{y}_{k+1}-y_{k}). (2.59)
Step 4:If some stationary condition is satisfied, stop; otherwise, set k=k+1,k=k+1, go to Step 2.

Before analyzing the convergence of Algorithm 2, we define the stationarity gap as the termination criterion as follows.

Definition 2.2.

For some given αk>0\alpha_{k}>0 and β>0\beta>0, the stationarity gap for problem (2.53) is defined as

𝒢αk,β(x,y):=(1αk(xProxf,𝒳1αk(xαkxg(x,y)))1β(yProxh,𝒴1β(y+βyg(x,y)))).\nabla\mathcal{G}^{\alpha_{k},\beta}(x,y):=\left(\begin{array}[]{c}\frac{1}{\alpha_{k}}\left(x-\operatorname{Prox}_{f,\mathcal{X}}^{\frac{1}{\alpha_{k}}}\left(x-\alpha_{k}\nabla_{x}g(x,y)\right)\right)\\ \frac{1}{\beta}\left(y-\operatorname{Prox}_{h,\mathcal{Y}}^{\frac{1}{\beta}}\left(y+\beta\nabla_{y}g(x,y)\right)\right)\end{array}\right).

Denote

Φk(x)\displaystyle\Phi_{k}(x) :=maxy𝒴(gk(x,y)h(y)),\displaystyle:=\max_{y\in\mathcal{Y}}(g_{k}(x,y)-h(y)), (2.60)
yk(x)\displaystyle y_{k}^{*}(x) :=argmaxy𝒴(gk(x,y)h(y)).\displaystyle:=\arg\max_{y\in\mathcal{Y}}(g_{k}(x,y)-h(y)). (2.61)

By Proposition B.25 in Bertsekas , we have xΦk(x)=xgk(x,yk(x))\nabla_{x}\Phi_{k}(x)=\nabla_{x}g_{k}(x,y_{k}^{*}(x)) and Φk(x)\Phi_{k}(x) is LΦkL_{\Phi_{k}}-Lipschitz smooth with LΦk=L+L2ρkL_{\Phi_{k}}=L+\frac{L^{2}}{\rho_{k}} under the ρk\rho_{k}-strong concavity of gk(x,y)g_{k}(x,y). Similar to the proof of Section 2.1, we first estimate an upper bound of yk+1(x¯)yk(x)2\|y_{k+1}^{*}(\bar{x})-y_{k}^{*}(x)\|^{2}.

Lemma 2.7.

Suppose that Assumptions 2.1 and 2.2 hold. Then for any x,x¯𝒳x,\bar{x}\in\mathcal{X},

yk+1(x¯)yk(x)2L2ρk+12x¯x2+ρkρk+1ρk+1(yk+1(x¯)2yk(x)2).\displaystyle\|y_{k+1}^{*}(\bar{x})-y_{k}^{*}(x)\|^{2}\leq\frac{L^{2}}{\rho_{k+1}^{2}}\|\bar{x}-x\|^{2}+\frac{\rho_{k}-\rho_{k+1}}{\rho_{k+1}}(\|y_{k+1}^{*}(\bar{x})\|^{2}-\|y_{k}^{*}(x)\|^{2}). (2.62)

Proof: The optimality condition for yk(x)y_{k}^{*}(x) implies that y𝒴\forall y\in\mathcal{Y} and k1\forall k\geq 1,

ygk+1(x¯,yk+1(x¯))h(yk+1(x¯)),yk(x)yk+1(x¯)\displaystyle\langle\nabla_{y}g_{k+1}(\bar{x},y_{k+1}^{*}(\bar{x}))-\partial h(y_{k+1}^{*}(\bar{x})),y_{k}^{*}(x)-y_{k+1}^{*}(\bar{x})\rangle 0,\displaystyle\leq 0, (2.63)
ygk(x,yk(x))h(yk(x)),yk+1(x¯)yk(x)\displaystyle\langle\nabla_{y}g_{k}(x,y_{k}^{*}(x))-\partial h(y_{k}^{*}(x)),y_{k+1}^{*}(\bar{x})-y_{k}^{*}(x)\rangle 0.\displaystyle\leq 0. (2.64)

By h(yk+1(x¯))h(yk(x)),yk+1(x¯)yk(x)0\langle\partial h(y_{k+1}^{*}(\bar{x}))-\partial h(y_{k}^{*}(x)),y_{k+1}^{*}(\bar{x})-y_{k}^{*}(x)\rangle\geq 0, and similar to the proof of Lemma 2.1, we complete the proof.

Next, we prove the iteration complexity of Algorithm 2.

Lemma 2.8.

Suppose that Assumptions 2.1 and 2.2 hold. Let {(xk,yk)}\{(x_{k},y_{k})\} be a sequence generated by Algorithm 2, if ρkL\rho_{k}\leq L and 0<ηk10<\eta_{k}\leq 1, then k1\forall k\geq 1,

Φk+1(xk+1)+f(xk+1)Φk(xk)f(xk)\displaystyle\Phi_{k+1}(x_{k+1})+f(x_{k+1})-\Phi_{k}(x_{k})-f(x_{k})
\displaystyle\leq 2ηkαkL2ykyk(xk)2+2ηkαkxgk(xk,yk)vk2\displaystyle 2\eta_{k}\alpha_{k}L^{2}\|y_{k}-y_{k}^{*}(x_{k})\|^{2}+2\eta_{k}\alpha_{k}\|\nabla_{x}g_{k}(x_{k},y_{k})-v_{k}\|^{2}
(3ηk4αkL2ηk2ρk+1)x~k+1xk2+ρkρk+12σy2,\displaystyle-(\frac{3\eta_{k}}{4\alpha_{k}}-\frac{L^{2}\eta_{k}^{2}}{\rho_{k+1}})\|\tilde{x}_{k+1}-x_{k}\|^{2}+\frac{\rho_{k}-\rho_{k+1}}{2}\sigma_{y}^{2}, (2.65)

where σy=max{yy𝒴}\sigma_{y}=\max\{\|y\|\mid y\in\mathcal{Y}\}.

Proof: The optimality condition for xkx_{k} in (2.56) implies that x𝒳\forall x\in\mathcal{X} and k1\forall k\geq 1,

vk,x~k+1xk1αkx~k+1xk2ξk+1,x~k+1xk,\langle v_{k},\tilde{x}_{k+1}-x_{k}\rangle\leq-\frac{1}{\alpha_{k}}\|\tilde{x}_{k+1}-x_{k}\|^{2}-\langle\xi_{k+1},\tilde{x}_{k+1}-x_{k}\rangle, (2.66)

where ξk+1f(x~k+1)\xi_{k+1}\in\partial f(\tilde{x}_{k+1}). Similar to the proof of Lemma 2.2, by replacing (2.24) in Lemma 2.2 with (2.66), we have

Φk+1(xk+1)Φk(xk)\displaystyle\Phi_{k+1}(x_{k+1})-\Phi_{k}(x_{k})\leq 2ηkαkL2ykyk(xk)2+2ηkαkxgk(xk,yk)vk2\displaystyle 2\eta_{k}\alpha_{k}L^{2}\|y_{k}-y_{k}^{*}(x_{k})\|^{2}+2\eta_{k}\alpha_{k}\|\nabla_{x}g_{k}(x_{k},y_{k})-v_{k}\|^{2}
(3ηk4αkL2ηk2ρk+1)x~k+1xk2+ρkρk+12σy2\displaystyle-(\frac{3\eta_{k}}{4\alpha_{k}}-\frac{L^{2}\eta_{k}^{2}}{\rho_{k+1}})\|\tilde{x}_{k+1}-x_{k}\|^{2}+\frac{\rho_{k}-\rho_{k+1}}{2}\sigma_{y}^{2}
ηkξk+1,x~k+1xk.\displaystyle-\eta_{k}\langle\xi_{k+1},\tilde{x}_{k+1}-x_{k}\rangle. (2.67)

By the definition of subgradient, we have f(xk)f(x~k+1)ξk+1,xkx~k+1f(x_{k})-f(\tilde{x}_{k+1})\geq\langle\xi_{k+1},x_{k}-\tilde{x}_{k+1}\rangle, then we have

ηkξk+1,x~k+1xkηk(f(xk)f(x~k+1)).\displaystyle-\eta_{k}\langle\xi_{k+1},\tilde{x}_{k+1}-x_{k}\rangle\leq\eta_{k}(f(x_{k})-f(\tilde{x}_{k+1})). (2.68)

Utilizing the convexity of f(x)f(x), xk+1=xk+ηk(x~k+1xk)x_{k+1}=x_{k}+\eta_{k}(\tilde{x}_{k+1}-x_{k}) and 0<ηk10<\eta_{k}\leq 1, we get

f(xk+1)(1ηk)f(xk)+ηkf(x~k+1).\displaystyle f(x_{k+1})\leq(1-\eta_{k})f(x_{k})+\eta_{k}f(\tilde{x}_{k+1}). (2.69)

The proof is completed by combining (2.67), (2.68) and (2.69).

Lemma 2.9.

Suppose that Assumptions 2.1 and 2.2 hold. Let {(xk,yk)}\{(x_{k},y_{k})\} be a sequence generated by Algorithm 2, if 0<ηk10<\eta_{k}\leq 1, 0<β16L0<\beta\leq\frac{1}{6L} and ρkL\rho_{k}\leq L, then k1\forall k\geq 1,

yk+1yk+1(xk+1)2\displaystyle\|y_{k+1}-y_{k+1}^{*}(x_{k+1})\|^{2}
\displaystyle\leq (1ηkβρk+14)ykyk(xk)23ηk4y~k+1yk2+5L2ηkρk+13βx~k+1xk2\displaystyle(1-\frac{\eta_{k}\beta\rho_{k+1}}{4})\|y_{k}-y_{k}^{*}(x_{k})\|^{2}-\frac{3\eta_{k}}{4}\|\tilde{y}_{k+1}-y_{k}\|^{2}+\frac{5L^{2}\eta_{k}}{\rho_{k+1}^{3}\beta}\|\tilde{x}_{k+1}-x_{k}\|^{2}
+5ηkβρk+1ygk(xk,yk)wk2+5(ρkρk+1)ρk+12ηkβ(yk+1(xk+1)2yk(xk)2).\displaystyle+\frac{5\eta_{k}\beta}{\rho_{k+1}}\|\nabla_{y}g_{k}(x_{k},y_{k})-w_{k}\|^{2}+\frac{5(\rho_{k}-\rho_{k+1})}{\rho_{k+1}^{2}\eta_{k}\beta}(\|y_{k+1}^{*}(x_{k+1})\|^{2}-\|y_{k}^{*}(x_{k})\|^{2}). (2.70)

Proof: The optimality condition for yky_{k} in (2.58) implies that y𝒴\forall y\in\mathcal{Y} and k1\forall k\geq 1,

wk,yy~k+1\displaystyle\langle w_{k},y-\tilde{y}_{k+1}\rangle 1βy~k+1yk,yy~k+1+ϱk+1,yy~k+1,\displaystyle\leq\frac{1}{\beta}\langle\tilde{y}_{k+1}-y_{k},y-\tilde{y}_{k+1}\rangle+\langle\varrho_{k+1},y-\tilde{y}_{k+1}\rangle,
=1βy~k+1yk2+1βy~k+1yk,yyk+ϱk+1,yy~k+1\displaystyle=-\frac{1}{\beta}\|\tilde{y}_{k+1}-y_{k}\|^{2}+\frac{1}{\beta}\langle\tilde{y}_{k+1}-y_{k},y-y_{k}\rangle+\langle\varrho_{k+1},y-\tilde{y}_{k+1}\rangle
1βy~k+1yk2+1βy~k+1yk,yyk+h(y)h(y~k+1),\displaystyle\leq-\frac{1}{\beta}\|\tilde{y}_{k+1}-y_{k}\|^{2}+\frac{1}{\beta}\langle\tilde{y}_{k+1}-y_{k},y-y_{k}\rangle+h(y)-h(\tilde{y}_{k+1}), (2.71)

where ϱk+1h(y~k+1)\varrho_{k+1}\in\partial h(\tilde{y}_{k+1}), the last inequality holds by h(y)h(y~k+1)ϱk+1,yy~k+1h(y)-h(\tilde{y}_{k+1})\geq\langle\varrho_{k+1},y-\tilde{y}_{k+1}\rangle. Similar to the proof of Lemma 2.3, by replacing (2.1) in Lemma 2.3 with (2.2), we complete the proof.

Theorem 2.2.

Suppose that Assumptions 2.1, 2.2 and 2.3 hold. Let {(xk,yk)}\{\left(x_{k},y_{k}\right)\} be a sequence generated by Algorithm 2. For any given k1k\geq 1, let

ηk\displaystyle\eta_{k} =1(k+2)5/13,αk=a4(k+2)4/13,ρk=L(k+1)2/13,\displaystyle=\frac{1}{(k+2)^{5/13}},\quad\alpha_{k}=\frac{a_{4}}{(k+2)^{4/13}},\quad\rho_{k}=\frac{L}{(k+1)^{2/13}},
γk\displaystyle\gamma_{k} =a5(k+2)12/13,θk=a6(k+2)8/13,\displaystyle=\frac{a_{5}}{(k+2)^{12/13}},\quad\theta_{k}=\frac{a_{6}}{(k+2)^{8/13}},

with

0\displaystyle 0 <a1min{b32a4L2,ba42Lβ},0<a2min{b32a4L2,ba42Lβ},\displaystyle<a_{1}\leq\min\{\frac{b}{32a_{4}L^{2}},\frac{ba_{4}}{2L\beta}\},\quad 0<a_{2}\leq\min\{\frac{b}{32a_{4}L^{2}},\frac{ba_{4}}{2L\beta}\},
0\displaystyle 0 <a4min{18L,β85},a54a4a1+1213,a680a4a2+1213.\displaystyle<a_{4}\leq\min\{\frac{1}{8L},\frac{\beta}{8\sqrt{5}}\},\quad a_{5}\geq\frac{4a_{4}}{a_{1}}+\frac{12}{13},\quad a_{6}\geq\frac{80a_{4}}{a_{2}}+\frac{12}{13}.

If 0<β16L0<\beta\leq\frac{1}{6L}, then for any given ε>0\varepsilon>0,

T(ε)max{T~(ε),(2Lσyε)1321},T(\varepsilon)\leq\max\{\tilde{T}(\varepsilon),(\frac{2L\sigma_{y}}{\varepsilon})^{\frac{13}{2}}-1\},

where T~(ε)\tilde{T}(\varepsilon) satisfies that

ε24C1+C2ln(T~(ε)+2)d1a4(134(T~(ε)+3)4/131334/134),\frac{\varepsilon^{2}}{4}\leq\frac{C_{1}+C_{2}\ln(\tilde{T}(\varepsilon)+2)}{d_{1}a_{4}(\frac{13}{4}(\tilde{T}(\varepsilon)+3)^{4/13}-\frac{13\cdot 3^{4/13}}{4})}, (2.72)

with

C1\displaystyle C_{1} =S1(x1,y1)S¯+ρ12σy2+40α1L2ρ1η1β2ρ23σy2,d1min{18,Lβ,a1a54a4,a2a64a4},\displaystyle=S_{1}(x_{1},y_{1})-\underline{S}+\frac{\rho_{1}}{2}\sigma_{y}^{2}+\frac{40\alpha_{1}L^{2}\rho_{1}}{\eta_{1}\beta^{2}\rho_{2}^{3}}\sigma_{y}^{2},\quad d_{1}\leq\min\{\frac{1}{8},L\beta,\frac{a_{1}a_{5}}{4a_{4}},\frac{a_{2}a_{6}}{4a_{4}}\},
C2\displaystyle C_{2} =2δ2(a52a1+a62a2)b+12a2L2σy213b,S¯=minx𝒳miny𝒴Sk(x,y),\displaystyle=\frac{2\delta^{2}(a_{5}^{2}a_{1}+a_{6}^{2}a_{2})}{b}+\frac{12a_{2}L^{2}\sigma_{y}^{2}}{13b},\quad\underline{S}=\min\limits_{x\in\mathcal{X}}\min\limits_{y\in\mathcal{Y}}S_{k}(x,y),
σy\displaystyle\sigma_{y} =max{yy𝒴}.\displaystyle=\max\{\|y\|\mid y\in\mathcal{Y}\}.

Proof: Similar to the proof of Lemma 2.4, 2.5, 2.6 and Theorem 2.1, by replacing Fk+1(xk+1,yk+1)F_{k+1}(x_{k+1},y_{k+1}) with F~k+1(xk+1,yk+1):=Fk+1(xk+1,yk+1)+𝔼[f(xk+1)]\tilde{F}_{k+1}(x_{k+1},y_{k+1}):=F_{k+1}(x_{k+1},y_{k+1})+\mathbb{E}[f(x_{k+1})], 𝒫𝒳1/αk\mathcal{P}_{\mathcal{X}}^{1/\alpha_{k}} with Proxf,𝒳1/αk\operatorname{Prox}_{f,\mathcal{X}}^{1/\alpha_{k}}, 𝒫𝒴1/β\mathcal{P}_{\mathcal{Y}}^{1/\beta} with Proxh,𝒴1/β\operatorname{Prox}_{h,\mathcal{Y}}^{1/\beta}, respectively, we complete the proof.

Theorem 2.2 shows that the iteration complexity of Algorithm 2 to obtain an ε\varepsilon-stationary point is upper bounded by 𝒪~(ε6.5)\tilde{\mathcal{O}}(\varepsilon^{-6.5}) under the nonsmooth setting.

3 Numerical Results

In this section, we compare the numerical performance of the proposed FORMDA algorithm with some existing algorithms for solving two classes of problems. All the numerical tests are implemented in Python 3.9. The first test is run in a laptopwith 2.30GHz processor and the second one is carried out on a RTX 4090 GPU.

Firstly, we consider the following Wasserstein GAN (WGAN) problem Arjovsky ,

minφ1,φ2maxϕ1,ϕ2f(φ1,φ2,ϕ1,ϕ2)𝔼(xreal,z)𝒟[Dϕ(xreal)Dϕ(Gφ1,φ2(z))],\displaystyle\min_{\varphi_{1},\varphi_{2}}\max_{\phi_{1},\phi_{2}}f\left(\varphi_{1},\varphi_{2},\phi_{1},\phi_{2}\right)\triangleq\mathbb{E}_{\left(x^{real},z\right)\sim\mathcal{D}}\left[D_{\phi}\left(x^{real}\right)-D_{\phi}\left(G_{\varphi_{1},\varphi_{2}}(z)\right)\right],

where Gφ1,φ2(z)=φ1+φ2z,Dϕ(x)=ϕ1x+ϕ2x2,ϕ=(ϕ1,ϕ2),xrealG_{\varphi_{1},\varphi_{2}}(z)=\varphi_{1}+\varphi_{2}z,D_{\phi}(x)=\phi_{1}x+\phi_{2}x^{2},\phi=\left(\phi_{1},\phi_{2}\right),x^{real} is a normally distributed random variable with mean φ1=0\varphi_{1}^{*}=0 and variance φ2=0.1\varphi_{2}^{*}=0.1, and zz a normally distributed random variable with mean ϕ1=0\phi_{1}^{*}=0 and variance ϕ2=1\phi_{2}^{*}=1. The optimal solution is (φ1,φ2,ϕ1,ϕ2)=(0,0.1,0,1)(\varphi_{1}^{*},\varphi_{2}^{*},\phi_{1}^{*},\phi_{2}^{*})=(0,0.1,0,1).

We compare the numerical performance of the proposed FORMDA algorithm with the SGDA algorithm Lin2019 and the PG-SMD algorithm Rafique for solving the WGAN problem. The batch size is set to be b=100b=100 for all three tested algorithms. The parameters in FORMDA are chosen to be ηk=1(k+2)5/13\eta_{k}=\frac{1}{(k+2)^{5/13}}, αk=0.5(k+2)4/13\alpha_{k}=\frac{0.5}{(k+2)^{4/13}}, ρk=1(k+1)2/13\rho_{k}=\frac{1}{(k+1)^{2/13}}, γk=3(k+2)12/13\gamma_{k}=\frac{3}{(k+2)^{12/13}}, θk=2(k+2)8/13\theta_{k}=\frac{2}{(k+2)^{8/13}}, β=0.005\beta=0.005. All the parameters of the PG-SMD algorithm and the SGDA algorithm are chosen the same as that in Rafique and Lin2019 respectively.

Refer to caption
Refer to caption
Refer to caption
Figure 1: Performance of the PG-SMD algorithm, the SGDA algorithm and the FORMDA algorithm for WGAN problem.

Figure 1 shows the average change in the stochastic gradient norm and the distance from the iteration point to the optimal value point for 5 independent runs of the four test algorithms, and the shaded area around the line indicates the standard deviation. It can be found that the proposed FORMDA algorithm is better than SGDA and close to the performance of PG-SMD.

Secondly, we consider a robust learning problem over multiple domains Qian . Let P1,,PMP_{1},\ldots,P_{M} denote MM distributions of data. Robust learning over multiple domains is to minimize the maximum of the expected losses over the MM distributions, it can be formulated as the following stochastic nonconvex-concave minimax problem:

minx𝒳maxm𝔼ζPm[F(x;ζ)]=minxmaxyΔm=1Mymfm(x),\min_{x\in\mathcal{X}}\max_{m}\mathbb{E}_{\zeta\sim P_{m}}[F(x\mathchar 24635\relax\;\zeta)]=\min_{x}\max_{y\in\Delta}\sum_{m=1}^{M}y_{m}f_{m}(x),

where fm(x)=𝔼ζPm[F(x;ζ)]f_{m}(x)=\mathbb{E}_{\zeta\sim P_{m}}[F(x\mathchar 24635\relax\;\zeta)], F(x;ζ)F(x\mathchar 24635\relax\;\zeta) is a nonnegative loss function, xx denotes the network parameters, yΔy\in\Delta describes the weights over different tasks and Δ\Delta is the simplex, i.e., Δ={(y1,,yM)Tm=1Mym=1,ym0}\Delta=\left\{(y_{1},\cdots,y_{M})^{T}\mid\sum_{m=1}^{M}y_{m}=1,y_{m}\geq 0\right\}. We consider two image classification problems with MNIST and CIFAR10 datasets. Our goal is to train a neural network that works on these two completely unrelated problems simultaneously. The quality of the algorithm for solving a robust learning problem over multiple domains is measured by the worst case accuracy over all domains Qian .

We compare the numerical performance of the proposed FORMDA algorithm with SGDA algorithm Lin2019 and PG-SMD algorithm Rafique . We use cross-entropy loss as our loss function and set the batch size in all tasks to be 32. All algorithms were run for 50 epochs. The parameters in FORMDA are chosen to be ηk=1(k+12)5/13+1\eta_{k}=\frac{1}{(k+12)^{5/13}+1}, αk=0.5(k+12)2/13+1\alpha_{k}=\frac{0.5}{(k+12)^{2/13}+1}, ρk=8(k+11)2/13+2\rho_{k}=\frac{8}{(k+11)^{2/13}+2}, γk=3(k+12)8/13+1\gamma_{k}=\frac{3}{(k+12)^{8/13}+1}, θk=2(k+12)8/13+1\theta_{k}=\frac{2}{(k+12)^{8/13}+1}, β=0.001\beta=0.001. All the parameters of the PG-SMD algorithm and the SGDA algorithm are chosen the same as that in Rafique and Lin2019 respectively.

Refer to caption
Refer to caption
Figure 2: Performance of three algorithms for solving robust multi-task learning problem.

Figure 2 shows the testing accuracy on MNIST and CIFAR10 datasets. It can be found that the proposed FORMDA algorithm outperforms SGDA and has similar performance to PG-SMD.

4 Conclusions

In this paper, we propose an accelerated FORMDA algorithm for solving stochastic nonconvex-concave minimax problems. The iteration complexity of the algorithm is proved to be 𝒪~(ε6.5)\tilde{\mathcal{O}}(\varepsilon^{-6.5}) to obtain an ε\varepsilon-stationary point. It owns the optimal complexity bounds within single-loop algorithms for solving stochastic nonconvex-concave minimax problems till now. Moreover, we can extend the analysis to the scenarios with possibly nonsmooth objective functions, and obtain that the iteration complexity is 𝒪~(ε6.5)\tilde{\mathcal{O}}(\varepsilon^{-6.5}). Numerical experiments show the efficiency of the proposed algorithm. Whether there is a single-loop algorithm with better complexity for solving this type of problem is still worthy of further research.

\bmhead

Acknowledgments The authors are supported by National Natural Science Foundation of China under the Grant (Nos. 12071279 and 12471294).

Declarations

  • Conflict of interest The authors have no relevant financial or non-financial interests to disclose.

  • Data availability Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

References

  • (1) Abadeh S S, Esfahani P M M, and Kuhn D. Distributionally robust logistic regression. In NeurIPS, 2015: 1576-1584.
  • (2) Arjovsky M, Chintala S, Bottou L. Wasserstein generative adversarial networks. International conference on machine learning. PMLR, 2017: 214-223.
  • (3) Bertsekas D P. Nonlinear programming. Journal of the Operational Research Society, 1997, 48(3): 334-334.
  • (4) Boroun M, Alizadeh Z, Jalilzadeh A. Accelerated primal-dual scheme for a class of stochastic nonconvex-concave saddle point problems. arXiv preprint arXiv:2303.00211, 2023.
  • (5) Bot R I, Böhm A. Alternating proximal-gradient steps for (stochastic) nonconvex-concave minimax problem. SIAM Journal on Optimization, 2023, 33(3): 1884-1913.
  • (6) Böhm A. Solving nonconvex-nonconcave min-max problems exhibiting weak minty solutions. arXiv preprint arXiv:2201.12247, 2022.
  • (7) Chen Y, Lan G, Ouyang Y. Accelerated schemes for a class of variational inequalities. Mathematical Programming,2017,165(1):113-149.
  • (8) Cai Y, Oikonomou A, Zheng W. Accelerated algorithms for monotone inclusions and constrained nonconvex-nonconcave min-max optimization. arXiv preprint arXiv:2206.05248, 2022.
  • (9) Chen J, Lau V K N. Convergence analysis of saddle point problems in time varying wireless systems—Control theoretical approach. IEEE Transactions on Signal Processing, 2011, 60(1): 443-452.
  • (10) Diakonikolas J, Daskalakis C, Jordan M. Efficient methods for structured nonconvex-nonconcave min-max optimization. International Conference on Artificial Intelligence and Statistics. PMLR, 2021: 2746-2754.
  • (11) Doan T. Convergence rates of two-time-scale gradient descent-ascent dynamics for solving nonconvex min-max problems. Learning for Dynamics and Control Conference, PMLR, 2022:192-206.
  • (12) Grimmer B, Lu H, Worah P, Mirrokni V. The landscape of the proximal point method for nonconvex-nonconcave minimax optimization.Mathematical Programming, 2023, 201(1-2): 373-407.
  • (13) Giannakis G B, Ling Q, Mateos G, Schizas I D, and Zhu H. Decentralized learning for wireless communications and networking. Splitting Methods in Communication, Imaging, Science, and Engineering, Springer, Cham, 2016:461-497.
  • (14) Giordano R, Broderick T andJordan M I. Covariances, robustness, and variational bayes. Journal of Machine Learning Research, 19(51), 2018.
  • (15) Huang F, Gao S, Pei J, Huang H. Accelerated zeroth-order and first-order momentum methods from mini to minimax optimization. Journal of Machine Learning Research, 2022, 23: 1-70.
  • (16) Huang F. Enhanced Adaptive Gradient Algorithms for Nonconvex-PL Minimax Optimization. arXiv preprint arXiv:2303.03984, 2023.
  • (17) Hamedani E Y, Aybat N S. A primal-dual algorithm with line search for general convex-concave saddle point problems. SIAM Journal on Optimization, 2021, 31(2): 1299-1329.
  • (18) Hajizadeh S, Lu H, Grimmer B. On the linear convergence of extra-gradient methods for nonconvex-nonconcave minimax problems. INFORMS Journal on Optimization, 2023.
  • (19) Jiang J, Chen X. Optimality conditions for nonsmooth nonconvex-nonconcave min-max problems and generative adversarial networks. arXiv preprint arXiv:2203.10914, 2022.
  • (20) Kong W, Monteiro R D C. An accelerated inexact proximal point method for solving nonconvex-concave min-max problems. SIAM Journal on Optimization, 2021, 31(4): 2558-2585.
  • (21) Kingma D P, Ba J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • (22) Lan G, Monteiro R D C. Iteration-complexity of first-order augmented lagrangian methods for convex programming. Mathematical Programming, 2016,155(1-2):511-547.
  • (23) Lin T, Jin C, Jordan M. Near-optimal algorithms for minimax optimization. Conference on Learning Theory, PMLR, 2020:2738-2779.
  • (24) Lin H, Mairal J, Harchaoui Z. Catalyst acceleration for first-order convex optimization: from theory to practice. Journal of Machine Learning Research, 2018, 18(212): 1-54.
  • (25) Liao W, Hong M, Farmanbar H, and Luo Z Q. Semi-asynchronous routing for large scale hierarchical networks. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015:2894-2898.
  • (26) Luo L, Ye H S, Huang Z C, Zhang T. Stochastic recursive gradient descent ascent for stochastic nonconvex-strongly-concave minimax problems. NeurIPS, 2020, 33: 20566-20577.
  • (27) Liu S, Lu S, Chen X, et al. Min-max optimization without gradients: Convergence and applications to adversarial ml. arXiv preprint arXiv:1909.13806, 2019.
  • (28) Lin T, Jin C, Jordan M. On gradient descent ascent for nonconvex-concave minimax problems. International Conference on Machine Learning, PMLR, 2020:6083-6093.
  • (29) Lu S, Tsaknakis I, Hong M, Chen Y. Hybrid block successive approximation for one-sided nonconvex min-max problems: algorithms and applications. IEEE Transactions on Signal Processing, 2020,68:3676-3691.
  • (30) Lee S, Kim D. Fast extra gradient methods for smooth structured nonconvex-nonconcave minimax problems. Advances in Neural Information Processing Systems, 2021, 34:22588-22600.
  • (31) Mokhtari A, Ozdaglar A, Pattathil S. A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. International Conference on Artificial Intelligence and Statistics. PMLR, 2020: 1497-1507.
  • (32) Mateos G, Bazerque J A, and Giannakis G B. Distributed sparse linear regression. IEEE Transactions on Signal Processing, 2010, 58(10):5262-5276.
  • (33) Nouiehed M, Sanjabi M, Huang T, Lee J. D. Solving a class of non-convex min-max games using iterative first order methods. Advances in Neural Information Processing Systems, 2019: 14934-14942.
  • (34) Nemirovski A. Prox-method with rate of convergence O (1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 2004, 15(1): 229-251.
  • (35) Nesterov Y. Smooth minimization of non-smooth functions. Mathematical programming, 2005, 103: 127-152.
  • (36) Nesterov Y. Dual extrapolation and its applications to solving variational inequalities and related problems. Mathematical Programming, 2007, 109(2): 319-344.
  • (37) Ouyang Y, Chen Y, Lan G, Pasiliao Jr E. An accelerated linearized alternating direction method of multipliers. SIAMJournal on Imaging Sciences, 2015, 8(1):644-681.
  • (38) Ouyang Y, Xu Y. Lower complexity bounds of first-order methods for convex-concave bilinear saddle-point problems.Mathematical Programming, 2021, 185(1): 1-35.
  • (39) Ostrovskii D, Lowy A, Razaviyayn M. Efficient search of first-order nash equilibria in nonconvex-concave smooth min-max problems. SIAM Journal on Optimization, 2021, 31(4): 2508-2538.
  • (40) Pan W, Shen J, Xu Z. An efficient algorithm for nonconvex-linear minimax optimization problem and its application in solving weighted maximin dispersion problem. Computational Optimization and Applications, 2021,78(1): 287-306.
  • (41) Qian Q, Zhu S, Tang J, Jin R, Sun B, Li H. Robust optimization over multiple domains. Proceedings of the AAAI Conference on Artificial Intelligence, 2019,33(01):4739–4746.
  • (42) Rafique H, Liu M, Lin Q, et al. Weakly-convex-concave min-max optimization: provable algorithms and applications in machine learning. Optimization Methods and Software, 2022, 37(3): 1087-1121.
  • (43) Sanjabi M, Razaviyayn M, Lee J D. Solving non-convex non-concave min-max games under polyak-Łojasiewicz condition. arXiv preprint arXiv:1812.02878, 2018.
  • (44) Song C, Zhou Z, Zhou Y, Jiang Y, Ma Y. Optimistic dual extrapolation for coherent non-monotone variational inequalities. Advances in Neural Information Processing Systems, 2020,33: 14303-14314.
  • (45) Shen J, Wang Z, Xu Z. Zeroth-order single-loop algorithms for nonconvex-linear minimax problems. Journal of Global Optimization, doi:10.1007/s10898-022-01169-5, 2022.
  • (46) Thekumparampil K K, Jain P, Netrapalli P,Oh S. Efficient algorithms for smooth minimax optimization. Advances inNeural Information Processing Systems, 2019:12680-12691.
  • (47) Tominin V, Tominin Y, Borodich E, Kovalev D, Gasnikov A, Dvurechensky P. On accelerated methods for saddle-pointproblems with composite structure. COMPUTER,2023, 15(2): 433-467.
  • (48) Tieleman T, Hinton G, et al. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 2012, 4(2):26-31.
  • (49) Wang Z, Balasubramanian K, Ma S, Razaviyayn M. Zeroth-order algorithms for nonconvex minimax problems with improved complexities. arXiv preprint arXiv:2001.07819, 2020.
  • (50) Xu Z, Zhang H, Xu Y, Lan G. A unified single-loop alternating gradient projection algorithm for nonconvex-concave and convex-nonconcave minimax problems. Mathematical Programming, 2023, 201:635-706.
  • (51) Xu T, Wang Z, Liang Y, Poor H V. Gradient free minimax optimization: Variance reduction and faster convergence. arXiv preprint arXiv:2006.09361, 2020.
  • (52) Xu Z, Wang Z, Shen J, Dai Y. H. Derivative-free alternating projection algorithms for general nonconvex-concave minimax problems. SIAM Journal on Optimization, 34(2):1879-1908, (2024).
  • (53) Xu Z, Wang Z Q, Wang J L, Dai Y. H. Zeroth-order alternating gradient descent ascent algorithms for a class of nonconvex-nonconcave minimax problems. Journal of Machine Learning Research, 24(313): 1-25, (2023).
  • (54) Yang J, Kiyavash N, He N. Global convergence and variance-reduced optimization for a class of nonconvex-nonconcaveminimax problems.Advances in Neural Information Processing Systems, 2020, 33: 1153-1165.
  • (55) Yang J, Orvieto A, Lucchi A, et al. Faster single-loop algorithms for minimax optimization without strong concavity. International Conference on Artificial Intelligence and Statistics. PMLR, 2022: 5485-5517.
  • (56) Zhang J, Xiao P, Sun R, Luo Z. A single-loop smoothed gradient descent-ascent algorithm for nonconvex-concave min-max problems. Advances in Neural Information Processing Systems, 2020, 33:7377-7389.
  • (57) Zhang G, Wang Y, Lessard L, Grosse R B. Near-optimal local convergence of alternating gradient descent-ascent forminimax optimization. International Conference on Artificial Intelligence and Statistics, PMLR, 2022:7659-7679.
  • (58) Zhang X, Aybat N S, Gurbuzbalaban M. Sapd+: An accelerated stochastic method for nonconvex-concave minimax problems. Advances in Neural Information Processing Systems, 2022, 35: 21668-21681.