Diffusion Stochastic Optimization for Min-Max Problems

Haoyuan Cai^∗, , Sulaiman A. Alghunaim^∙, , and Ali H. Sayed^∗,
^∙Kuwait University, Kuwait
^∗École Polytechnique Fédérale de Lausanne, Switzerland A short version of this work without derivations will appear in the conference publication [1]. Emails: [email protected], [email protected], [email protected]

Abstract

The optimistic gradient method is useful in addressing minimax optimization problems. Motivated by the observation that the conventional stochastic version suffers from the need for a large batch size on the order of $\mathcal{O}(\varepsilon^{-2})$ to achieve an $\varepsilon$ -stationary solution, we introduce and analyze a new formulation termed Diffusion Stochastic Same-Sample Optimistic Gradient (DSS-OG). We prove its convergence and resolve the large batch issue by establishing a tighter upper bound, under the more general setting of nonconvex Polyak-Lojasiewicz (PL) risk functions. We also extend the applicability of the proposed method to the distributed scenario, where agents communicate with their neighbors via a left-stochastic protocol. To implement DSS-OG, we can query the stochastic gradient oracles in parallel with some extra memory overhead, resulting in a complexity comparable to its conventional counterpart. To demonstrate the efficacy of the proposed algorithm, we conduct tests by training generative adversarial networks.

Index Terms:

Minimax optimization, nonconvex Polyak-Lojasiewicz, distributed stochastic optimization, stochastic optimistic gradient, diffusion strategy

I Introduction

Minimax optimization takes center stage in many machine learning applications, including generative adversarial networks (GANs) [2], adversarial machine learning [3], and reinforcement learning [4].

Minimax problems take the form:

\displaystyle\underset{x\in\mathbb{R}^{M_{1}}}{\min}\underset{y\in\mathbb{R}^{M_{2}}}{\max}\ J(x,y)

(1)

where $x$ and $y$ are the primal and dual variables, respectively, and $J(x,y)$ is the risk (objective) function. This problem is typically approached from various perspectives. In the optimization literature, assuming certain structure for the risk function over the variables is prevalent. For instance, the risk function can be assumed (strongly) convex in the primal variable and (strongly) concave in the dual variable [5, 6] or it can be assumed one-sided nonconvex/Polyak-Lojasiewicz (PL) [7, 8, 9] in the primal variable and one-sided strongly concave/PL in the dual variable. Some works assume a two-sided PL condition[10]. It is noted that the gradient operator of a (strongly) convex or (strongly) concave objective is also (strongly) monotone; therefore, the optimization problem is generalized to a variational inequality (VI) problem in some settings. Solving VI problems heavily relies on the concept of “operators” and their associated properties. The goal of a VI problem is to determine the solution at which the gradient operator satisfies variational stability conditions, such as the Stampacchia variational inequality (SVI), which is a first-order necessary optimality condition [11, 12]. In a similar vein, the Minty variational inequality (MVI) condition and its weak version, which include a subclass of nonmonotone problems, has also gained popularity in recent years [13, 14, 15, 16, 17]. The settings mentioned above represent the tractable conditions for which convergent algorithms are known to exist. On the other hand, the nonconvex-nonconcave formulation, which is at the core of a wide class of minimax objectives, remains intractable [18]. This fact corroborates the significant difficulty encountered in training GANs, as they often involve nonconvex-nonconcave objectives.

The optimistic gradient (OG) method and its variants have attracted significant interest for their proven convergence in wide formulations of (1) [19, 15, 14, 7, 13, 20]. However, a good part of the theory from this body of literature breaks down when the stochastic setting is taken into consideration where only a noisy random estimate of the gradient is available. For instance, the work [7] showed that employing a batch size on the order of $\mathcal{O}(\varepsilon^{-2})$ is necessary to estimate the gradient and in order to control the gradient noise for stochastic OG method in a nonconvex-strongly concave (SC) setting. Here, the scalar $\varepsilon$ denotes the desired level of accuracy in the solution, which is generally a very small number. Similarly, the work [20] established similar results with a batch size $\mathcal{O}(\varepsilon^{-4})$ to achieve convergence in the stochastic MVI setting. These are not isolated cases; a similar need for using large batch sizes appears in the works [14, 13, 9, 15]. The reliance on batch sizes that are inversely proportional to $\varepsilon$ is a serious theoretical limitation since it rules out the freedom of using small batches. Meanwhile, empirical results showed that stochastic OG may not be subject to such a restrictive requirement [21, 20].

In this work, we investigate distributed stochastic minimax problems where a network of $K$ agents cooperate to solve the following problem:


$\displaystyle\underset{x\in\mathbb{R}^{M_{1}}}{\min}\underset{y\in\mathbb{R}^{M_{2}}}{\max}\ J(x,y)$	$\displaystyle=\sum_{k=1}^{K}p_{k}J_{k}(x,y)$	(2a)
$\displaystyle\text{where}\quad J_{k}(x,y)$	$\displaystyle\triangleq\mathbb{E}_{\bm{\xi}_{k}}Q_{k}(x,y;\bm{\xi}_{k})$	(2b)

Here, $x,y$ are the global primal and dual variables, the function $Q_{k}(x,y;\bm{\xi}_{k})$ is the local stochastic loss evaluated at the sample $\bm{\xi}_{k}$ . The scalars $p_{k}$ are positive weights satisfying $\sum_{k=1}^{K}p_{k}=1$ . We assume that each agent $k$ only has access to its local data set $\{\bm{\xi}_{k}\}$ . We consider a fully distributed (decentralized) setting where the agents are connected by a graph topology and each agent can only share information with its immediate neighbors [22]. We approach (2a)-(2b) from the perspective of optimization, adopting assumptions that are more general than the nonconvex-SC formulation. Instead of assuming the dual risk function to be SC in the dual variable, we consider the weaker PL-condition [23]. Specifically, we adopt the following assumption.

Assumption 1 (Risk function).

We assume each local risk function $J_{k}(x,y)$ is nonconvex in $x$ while $-J(x,y)$ is $\nu$ -PL in $y$ , i.e., for any $x$ and $y$ in the domain of $J(x,y)$ , it holds that

\|\nabla_{y}J(x,y)\|^{2}\geq 2\nu(\underset{y}{\max}\ J(x,y)-J(x,y))

(3)

where $\nu$ is some strictly positive constant. $\hfill{\square}$

Under this assumption, we are interested in answering the following questions: Q1: Can the reliance on large batch sizes be eliminated in the stochastic OG method? Q2: What are the performance guarantees of the resulting batch-flexible stochastic OG over networks?

I-A Related works

Minimax optimization problems have been largely studied under the single node setting [6, 24, 5, 9, 25, 26, 27, 28, 29, 30]. The fundamental setting assumed within these works is the strong convexity-strong concavity/strong monotonicity. For this setting, the work [6] established the tight complexity bound of the first-order algorithm in finding a near-optimal solution. The work [5] devised the first optimal algorithm that matches this bound. The work [24] showed that the alternating gradient descent-ascent (AGDA) method converges faster than its simultaneous counterpart. The work [25] proposed a stochastic variance-reduced extragradient (EG) method, which showed improved performance in training GANs. The quasi-Newton methods have also been proposed to achieve a faster convergence rate in finding the saddle point [26]. Dropping “strongly” setting from both sides of strong convexity-strong concavity brings about convergence issues where simultaneous GDA will suffer from divergent rotation while AGDA will get stuck in a limit cycle rather than converge to a limit point [24, 31]. For the convex-concave setting, the work [29] established the sublinear rate of $\mathcal{O}(\frac{1}{T})$ for the best-iterate of the stochastic OG method, namely the best iterate produced over running $T$ iterations. Achieving last-iterate convergence is more difficult; this aspect of OG was not addressed until the work [28]. Furthermore, the work [27] showed that the average-iterate of EG, namely the average of all iterates produced over iterations, has a convergence rate of $\mathcal{O}(\frac{1}{T^{2}})$ which is faster than its last-iterate. The work [30] also showed anchoring EG has a convergence rate of $\mathcal{O}(\frac{1}{T^{2}})$ without considering stochasticity. There is another line of research that investigates the structured nonmonotone problem, particularly under the premise of the (weak) Minty variational inequality— see [16, 17, 15].

In recent years, there has been an increasing interest towards the stochastic nonconvex (one-sided) minimax optimization problem. The work [9] established the convergence of stochastic GDA (SGDA) in finding the primal solution under a nonconvex-SC formulation, with the use of batch size $\mathcal{O}(\varepsilon^{-2})$ to reach $\varepsilon$ error accuracy. The work [32] proposed to incorporate a concave maximizer and variance reduction into SGDA, which is computationally costly since it performs multi-loop steps in updating the dual variable. The work [10] studied the two-sided PL setting and established the linear convergence of variance-reduced stochastic AGDA (SAGDA). Additionally, the work [8] showed that employing two iid samples for SAGDA obtains $\mathcal{O}(\frac{1}{T^{1/2}})$ convergence rate for the primal solution in nonconvex-PL setting. However, vanilla AGDA does not converge in bilinear scenarios[31], which casts doubt on its ability to solve more complex settings. Exploiting the convergence of the backbone minimax algorithms across various settings, say OG/EG, is always of independent interest. Unfortunately, the recent work [7] showed a less than ideal result under the nonconvex-SC formulation as the large batch issue persists. This fact strongly motivates us to pursue better results that theoretically address this limitation.

Over the past decade, distributed optimization has become an increasingly important field due to its flexibility in processing large volume and physically dispersed data [22, 33]. The distributed minimax optimization scenario remains an open field to be exploited since there are many diverse settings. Some distributed nonconvex minimax works appear in [34, 35, 36, 37, 14, 20]. The work [34] showed the convergence rate of $\mathcal{O}(\frac{1}{T^{2/3}})$ for the primal solution using STORM momentum-based [38] SGDA, under a nonconvex-SC setting. The work [37] extended the analysis to the nonconvex-PL setting. Other works [35, 36] introduced variance-reduced SGDA for the nonconvex-SC setting. However, variance reduction requires an excessively large batch size to estimate the gradient at the checkpoint, making it hard to tune. On the other hand, the works [34, 38, 37, 36, 35] rely on the Lipschitz continuity of the stochastic loss gradient to carry out the analysis. Relaxing this condition leads to a limited applicability of the results [39]. The work [14] proposed a distributed version of stochastic EG for solving stochastic MVI in a time-varying network. The work [20] proposed a single-call EG [40] for distributed parallel training of the GANs. However, both results face the same large batch size issue.

Works	Assumptions	Primal Rate	Dual Rate	Batch size
DPOSG [20]	Stochastic MVI, bounded gradient, D-S	$\mathcal{O}(\frac{1}{T^{1/4}}+\frac{\sigma^{2}}{\sqrt{KB}})$	$\mathcal{O}(\frac{1}{T^{1/4}}+\frac{\sigma^{2}}{\sqrt{KB}})$	$\mathcal{O}(\varepsilon^{-4})$
Stochastic EG [14]	Stochastic MVI, Lipschitz risk gradient, D-S	$\mathcal{O}(\frac{1}{T^{1/2}}+\frac{\sigma^{2}}{B})$	$\mathcal{O}(\frac{1}{T^{1/2}}+\frac{\sigma^{2}}{B})$	$\mathcal{O}(\varepsilon^{-2})$
DM-HSGD [34]	Stochastic nonconvex-SC, Lipschitz loss gradient, D-S	$\mathcal{O}(\frac{1}{T^{2/3}})$	——–	$\mathcal{O}(\frac{1}{\min\{1,K\epsilon\}})$
DM-GDA [37]	Stochastic nonconvex-PL, Lipschitz loss gradient, D-S	$\mathcal{O}(\frac{1}{T^{1/3}})$	——–	$\mathcal{O}(1)$
DSGDA [35]	Finite-sum, nonconvex-SC, Lipschitz loss gradient, D-S	$\mathcal{O}(\frac{1}{T})$	——–	$\mathcal{O}(\sqrt{n})$
DSS-OG (this work)	Stochastic nonconvex-PL, Lipschitz risk gradient, L-S	$\mathcal{O}(\frac{1}{T^{1/2}})$	$\mathcal{O}(\frac{1}{T})$	$\mathcal{O}(1)$

TABLE I: Comparison of the convergence rate of different distributed minimax algorithms. The works [34, 35] seek to find a approximate solution

\bm{x}^{\star}

of the primal objective

P(x)

to defined in (12) in the sense that

\mathbb{E}\|\nabla P(\bm{x}^{\star})\|^{2}\leq\varepsilon^{2}

while the work [37] uses the criterion

\mathbb{E}\|\nabla P(\bm{x}^{\star})\|\leq\varepsilon

. This work and [20, 14] seek to find the approximate solution

(\bm{x}^{\star},\bm{y}^{\star})

under the convergence criterion

\mathbb{E}\|\nabla_{x}J({\bm{x}}^{\star},{\bm{y}}^{\star})\|^{2}\leq\varepsilon^{2},\mathbb{E}\|\nabla_{y}J({\bm{x}}^{\star},{\bm{y}}^{\star})\|^{2}\leq\varepsilon^{2}

. The above table shows the convergence rate of the primal and dual variables after running

T

iterations, respectively. D-S: doubly stochastic, L-S: left stochastic,

T

: number of iterations,

B

: batch size employed to achieve an

\varepsilon

-stationary point,

\sigma^{2}

: gradient noise variance,

n

: number of total samples in finite-sum setting.

I-B Main Contributions

Our main contributions in this work are as follows:

•

We introduce and analyze a new variant of stochastic OG that achieves a convergence rate of $\mathcal{O}(\frac{1}{T^{1/2}})$ in finding the primal solution under the nonconvex-PL setting. Our analysis improves the existing performance bound by addressing the issue of large batch size [7]. We also provide insights into theoretical limitations that prevent the conventional stochastic OG from achieving better results.
•

We extend the applicability of stochastic OG to a fully distributed setting. We consider left-stochastic combination matrices, which are more general than the doubly-stochastic setting considered in [34, 35, 36, 37, 14, 20]. We also carry out convergence analysis without assuming the smoothness of the stochastic loss gradient.
•

We establish convergence results under more relaxed conditions, ensuring convergence for both primal and dual solutions. Previous works primarily focused on the convergence of the primal objective [34, 35, 36, 37]. We show that the dual solution converges at a rate of $\mathcal{O}(\frac{1}{T})$ for our method. Please refer to Table I for the comparison of our results with existing works.

Notations. We use lowercase normal font $x$ to represent deterministic scalars or vectors, and lowercase bold font $\bm{x}$ for stochastic scalars or vectors. We use upper case letters $A$ for matrices and calligraphic letters $\bm{\mathcal{X}}$ for network quantities. The symbol $\mathcal{N}_{k}$ denotes the set of neighbors of agent $k$ , and $a_{\ell k}$ represents the scaling weight for information flowing from agent $\ell$ to agent $k$ . $\mathbbm{1}_{M}$ denotes the $M$ -dimensional vector of all ones. The identity matrix of size $M$ is denoted by $I_{M}$ . The Kronecker product operator is denoted by $\otimes$ , $\|\cdot\|$ stands for the $\ell_{2}$ -norm, and $\langle\cdot,\cdot\rangle$ denotes the inner product operator.

II Algorithm Description

II-A Centralized Scenario

We start with a centralized scenario where the data processing and model training are carried out at a fusion center. Specifically, we consider initially the minimax problem with $J(x,y)=\mathbb{E}Q(x,y,\bm{\xi})$ formulated at a stand-alone agent or fusion center. Several variants of the stochastic OG method will be revisited here to motivate our new formulation. It is noted that these variants were originally proposed in [19, 40] in the deterministic setting, whereas we consider the stochastic version.

The extragradient (EG) method traces back to the seminal work [19], which is based on a heuristic idea, namely “extrapolation”. It proposed to utilize the gradient at the extrapolated point, rather than the gradient at the current point for carrying out the update. Doing so leads to increased stability by alleviating the rotational force. Such a scheme was also applied to other optimization problems[41]. The stochastic EG (S-EG) was studied in the works [7, 14]. This algorithm starts from iteration $i=0$ , say, with randomly initialized variables $\overline{\bm{x}}_{0},\overline{\bm{y}}_{0}$ and recursively performs the following steps:

S-EG	$\displaystyle\bm{x}_{i}=\overline{\bm{x}}_{i}-\mu_{x}\nabla_{x}Q(\overline{\bm{x}}_{i},\overline{\bm{y}}_{i};\overline{\bm{\xi}}_{x,i})\quad$	(4a)
S-EG	$\displaystyle\bm{y}_{i}=\overline{\bm{y}}_{i}+\mu_{y}\nabla_{y}Q(\overline{\bm{x}}_{i},\overline{\bm{y}}_{i};\overline{\bm{\xi}}_{y,i})$	(4b)
S-EG	$\displaystyle\overline{\bm{x}}_{i+1}=\overline{\bm{x}}_{i}-\mu_{x}\nabla_{x}Q(\bm{x}_{i},\bm{y}_{i};\bm{\xi}_{x,i})$	(4c)
S-EG	$\displaystyle\overline{\bm{y}}_{i+1}=\overline{\bm{y}}_{i}+\mu_{y}\nabla_{y}Q(\bm{x}_{i},\bm{y}_{i};\bm{\xi}_{y,i})$	(4d)

Here, $(\bm{x}_{i},\bm{y}_{i})$ denotes the extrapolated point, and $(\overline{\bm{x}}_{i+1},\overline{\bm{y}}_{i+1})$ is the output. The variables $(\overline{\bm{\xi}}_{x,i},\overline{\bm{\xi}}_{y,i},\bm{\xi}_{x,i},\bm{\xi}_{y,i})$ are iid samples drawn during different steps and for updating different variables. We use the subscript and bar notation in these iid samples to highlight at which point and for which variables they are drawn. For instance, $\overline{\bm{\xi}}_{x,i}$ denotes an iid sample drawn at iteration $i$ for constructing the stochastic gradient at point $(\overline{\bm{x}}_{i},\overline{\bm{y}}_{i})$ and for updating the primal variable $\bm{x}$ . Additionally, the scalars $\mu_{x},\mu_{y}>0$ are the stepsizes or learning rates. The S-EG needs to query two stochastic gradient oracles to update each variable. It is also executed in a sequential manner, i.e., determining the extrapolated point first before performing the update.

To reduce the computational complexity, a single-call version of EG known as past EG (PEG), was introduced in the work [40]. It proposed to use the past gradient at the point $(\bm{x}_{i-1},\bm{y}_{i-1})$ . The stochastic form of PEG is characterized by the following steps [42, 20]:

S-PEG	$\displaystyle\bm{x}_{i}=\overline{\bm{x}}_{i}-\mu_{x}\nabla_{x}Q(\bm{x}_{i-1},\bm{y}_{i-1};\bm{\xi}_{x,i-1})$	(5a)
S-PEG	$\displaystyle\bm{y}_{i}=\overline{\bm{y}}_{i}+\mu_{y}\nabla_{y}Q(\bm{x}_{i-1},\bm{y}_{i-1};\bm{\xi}_{y,i-1})$	(5b)
S-PEG	$\displaystyle\overline{\bm{x}}_{i+1}=\overline{\bm{x}}_{i}-\mu_{x}\nabla_{x}Q(\bm{x}_{i},\bm{y}_{i};\bm{\xi}_{x,i})$	(5c)
S-PEG	$\displaystyle\overline{\bm{y}}_{i+1}=\overline{\bm{y}}_{i}+\mu_{y}\nabla_{y}Q(\bm{x}_{i},\bm{y}_{i};\bm{\xi}_{y,i})$	(5d)

PEG can also be derived from the forward-reflected-backward framework [43]. It has attracted wide interest in theoretical aspects and obtains empirical success in practical applications [28, 21, 20].

The standard form of S-PEG, also referred to as the stochastic OG (S-OG), is often utilized for theoretical analysis [7]. The updates in Eqs. (5a)–(5d) can be simplified by rewriting them in terms of $(\bm{x}_{i+1},\bm{y}_{i+1})$ at each iteration $i$ [7, 42]:

S-OG	$\displaystyle\bm{x}_{i+1}=\bm{x}_{i}-\mu_{x}\Big{[}2\nabla_{x}Q(\bm{x}_{i},\bm{y}_{i};\bm{\xi}_{x,i})$
S-OG	$\displaystyle\qquad\quad+\nabla_{x}Q(\bm{x}_{i-1},\bm{y}_{i-1};\bm{\xi}_{x,i-1})\Big{]}$	(6a)
S-OG	$\displaystyle\bm{y}_{i+1}=\bm{y}_{i}+\mu_{y}\Big{[}2\nabla_{y}Q(\bm{x}_{i},\bm{y}_{i};\bm{\xi}_{y,i})$
S-OG	$\displaystyle\qquad\quad-\nabla_{y}Q(\bm{x}_{i-1},\bm{y}_{i-1};\bm{\xi}_{y,i-1})\Big{]}$	(6b)

S-OG involves two different samples, namely $\bm{\xi}_{x,i}$ and $\bm{\xi}_{x,i-1}$ , for updating the primal variable in different steps (similar to the dual variable). We observe that $\bm{\xi}_{x,i}$ and $\bm{\xi}_{x,i-1}$ are actually approximating different losses whose landscape differs. The gradient $\nabla_{x}Q(\bm{x}_{i},\bm{y}_{i};\bm{\xi}_{x,i})$ constructed by $\bm{\xi}_{x,i}$ can deviate from that at the same point $(\bm{x}_{i},\bm{y}_{i})$ when it is evaluated at $\bm{\xi}_{x,i-1}$ . In other words, S-OG is not making the “optimistic” direction under the same landscape in a strict sense. When the sample variance is high, the deviation between the loss landscapes constructed from $\bm{\xi}_{x,i}$ and $\bm{\xi}_{x,i-1}$ could be high, leading to a poor approximation of the update direction [44]. On the other hand, the stochastic gradients involved in (6a)–(6b) are not martingale processes, which cause difficulty for analysis (see Section III-E for discussion).

We instead introduce a new formulation using the same-sample approximation. For each variable, it utilizes the gradients from the same loss landscape to perform “optimistic” scheme; these losses are also unbiased for the true risk. This can be verified by taking expectation over the distribution of the current samples $(\bm{\xi}_{x,i}$ , $\bm{\xi}_{y,i})$ . This approach involves recomputing the stochastic gradient at the point $(\bm{x}_{i-1},\bm{y}_{i-1})$ using the current samples $(\bm{\xi}_{x,i},\bm{\xi}_{y,i})$ :

SS-OG	$\displaystyle\bm{x}_{i+1}=\bm{x}_{i}-2\mu_{x}\Big{[}\nabla_{x}Q(\bm{x}_{i},\bm{y}_{i};\bm{\xi}_{x,i})$
SS-OG	$\displaystyle\qquad\quad+\nabla_{x}Q(\bm{x}_{i-1},\bm{y}_{i-1};\bm{\xi}_{x,i})\Big{]}$	(7a)
SS-OG	$\displaystyle\bm{y}_{i+1}=\bm{y}_{i}+2\mu_{y}\Big{[}\nabla_{y}Q(\bm{x}_{i},\bm{y}_{i};\bm{\xi}_{y,i})$
SS-OG	$\displaystyle\qquad\quad-\nabla_{y}Q(\bm{x}_{i-1},\bm{y}_{i-1};\bm{\xi}_{y,i})\Big{]}$	(7b)

SS-OG requires double call of stochastic gradient oracles for each variable. However, it can be executed in parallel with some extra memory overhead, making it preferred in comparison to the sequential approaches, e.g., S-EG and S-PEG.

Algorithm 1 Stochastic Same-Sample Optimistic Gradient (SS-OG)

\bm{x}_{i},\bm{y}_{i}{\scriptstyle(i=-1,-2)}

, step sizes

\mu_{x}

\mu_{y}

1: for

i=0,1,\dots

2: Compute stochastic gradient with samples

\bm{\xi}_{x,i},\bm{\xi}_{y,i}

\displaystyle\begin{aligned} {\bm{g}}_{x,i-1}=2\nabla_{x}Q(\bm{x}_{i-1},\bm{y}_{i-1};\bm{\xi}_{x,i})-\nabla_{x}Q(\bm{x}_{i-2},\bm{y}_{i-2};\bm{\xi}_{x,i})\end{aligned}

\displaystyle\begin{aligned} {\bm{g}}_{y,i-1}=2\nabla_{y}Q(\bm{x}_{i-1},\bm{y}_{i-1};\bm{\xi}_{y,i})-\nabla_{y}Q(\bm{x}_{i-2},\bm{y}_{i-2};\bm{\xi}_{y,i})\end{aligned}

5: Adaptation

\displaystyle{\bm{x}}_{i}=\bm{x}_{i-1}-\mu_{x}\bm{g}_{x,i-1}

\displaystyle{\bm{y}}_{i}=\bm{y}_{i-1}+\mu_{y}\bm{g}_{y,i-1}

8: end for

II-B Distributed scenario

In this section, we extend SS-OG to the distributed setting (2a)–(2b). Under this scenario, each agent $k$ utilizes a local sample $\{\bm{\xi}_{k}\}$ to approximate the true gradient by using its loss value. Several distributed learning strategies for minimization problems have been investigated in [22, 33] and the references therein. Here, we integrate (6a)-(6b) with the adapt-then-combine (ATC) diffusion learning strategy from [22, 33]. This strategy has been shown in these references to outperform the consensus strategy, which has been applied to minimax problems as well in works such as [20, 35]. For this reason, we focus on diffusion learning and refer to the proposed algorithm as the Diffusion Stochastic Same-Sample Optimistic Gradient (DSS-OG). It is listed in Algorithm 1. In this algorithm, $a_{\ell k}$ is the scalar weight used by agent $k$ to scale information received from agent $\ell\in\mathcal{N}_{k}$ and $a_{\ell k}=0$ for $\ell\notin\mathcal{N}_{k}$ . This algorithm starts from $i=0$ with randomly initialized local weights $\bm{x}_{k,i},\bm{y}_{k,i}(i=-1,-2)$ . Each agent first collect two iid samples, namely $\bm{\xi}^{x}_{k,i}$ and $\bm{\xi}^{y}_{k,i}$ , to approximate the OG direction for updating the local primal and dual variables, respectively. After then the local models of each agent are sent to its immediate neighbors for aggregation using the weights $\{a_{\ell k}\}$ .

Algorithm 2 Diffusion Stochastic Same-Sample Optimistic Gradient (DSS-OG)

\bm{x}_{k,i},\bm{y}_{k,i}{\scriptstyle(i=-1,-2)}

, step sizes

\mu_{x}

\mu_{y}

1: for

i=0,1,\dots

2: for each agent

k

3: Compute stochastic gradient with sample

\bm{\xi}^{x}_{k,i},\bm{\xi}^{y}_{k,i}

\displaystyle\begin{aligned} {\bm{g}}_{x,k,i-1}=&\ 2\nabla_{x}Q_{k}(\bm{x}_{k,i-1},\bm{y}_{k,i-1};\bm{\xi}^{x}_{k,i})\\ &-\nabla_{x}Q_{k}(\bm{x}_{k,i-2},\bm{y}_{k,i-2};\bm{\xi}^{x}_{k,i})\end{aligned}

\displaystyle\begin{aligned} {\bm{g}}_{y,k,i-1}=&\ 2\nabla_{y}Q_{k}(\bm{x}_{k,i-1},\bm{y}_{k,i-1};\bm{\xi}^{y}_{k,i})\\ &-\nabla_{y}Q_{k}(\bm{x}_{k,i-2},\bm{y}_{k,i-2};\bm{\xi}^{y}_{k,i})\end{aligned}

6: Adaptation

\displaystyle{\bm{\phi}}_{k,i}=\bm{x}_{k,i-1}-\mu_{x}\bm{g}_{x,k,i-1}

\displaystyle{\bm{\psi}}_{k,i}=\bm{y}_{k,i-1}+\mu_{y}\bm{g}_{y,k,i-1}

9: Combination

10:

\bm{x}_{k,i}=\underset{\ell\in\mathcal{N}_{k}}{\sum}a_{\ell k}{\bm{\phi}}_{\ell,i},\quad\displaystyle\bm{y}_{k,i}=\underset{\ell\in\mathcal{N}_{k}}{\sum}a_{\ell k}{\bm{\psi}}_{\ell,i}

11: end for

12: end for

For convenience of the analysis, we introduce the following network quantities:


$\displaystyle\hskip 30.00005pt{\bm{x}}_{c,i}$	$\displaystyle\triangleq\sum_{k=1}^{K}p_{k}\bm{x}_{k,i}\in\mathbb{R}^{M_{1}\times 1}\quad\text{(centroid)}$	(8a)
$\displaystyle\bm{g}_{x,c,i}$	$\displaystyle\triangleq\sum_{k=1}^{K}p_{k}{\bm{g}}_{x,k,i}\in\mathbb{R}^{M_{1}\times 1}$	(8b)
$\displaystyle\bm{\mathcal{X}}_{i}$	$\displaystyle\triangleq\mbox{\rm col}\{\bm{x}_{1,i},\dots,\bm{x}_{K,i}\}\in\mathbb{R}^{M_{1}K\times 1}$	(8c)
$\displaystyle\bm{{\mathcal{X}}}_{c,i}$	$\displaystyle\triangleq\mbox{\rm col}\{\bm{x}_{c,i},\dots,\bm{x}_{c,i}\}\in\mathbb{R}^{M_{1}K\times 1}$	(8d)
$\displaystyle\bm{\mathcal{G}}_{x,i}$	$\displaystyle\triangleq\mbox{\rm col}\{\bm{g}_{x,1,i},\dots,\bm{g}_{x,K,i}\}\in\mathbb{R}^{M_{1}K\times 1}$	(8e)

and similarly for ${\bm{y}}_{c,i},\bm{g}_{y,c,i}\in\mathbb{R}^{M_{2}\times 1},\bm{\mathcal{Y}}_{i},\bm{\mathcal{Y}}_{c,i},\bm{\mathcal{G}}_{y,i}\in\mathbb{R}^{M_{2}K\times 1}$ . Additionally, we use ${{g}}_{x,c,i},{{g}}_{y,c,i},{\mathcal{G}}_{x,i},{\mathcal{G}}_{y,i}$ refer to deterministic realizations of the random quantities $\bm{g}_{x,c,i}$ , $\bm{g}_{y,c,i}$ , $\bm{\mathcal{G}}_{x,i}$ , $\bm{\mathcal{G}}_{y,i}$ . If we further introduce the network combination matrices:


$\displaystyle A$	$\displaystyle=[a_{\ell k}]\in\mathbb{R}^{K\times K}$	(9a)
$\displaystyle\mathcal{A}_{1}$	$\displaystyle\triangleq A\otimes I_{M_{1}}\in\mathbb{R}^{M_{1}K\times M_{1}K}$	(9b)
$\displaystyle\mathcal{A}_{2}$	$\displaystyle\triangleq A\otimes I_{M_{2}}\in\mathbb{R}^{M_{2}K\times M_{2}K}$	(9c)

then, the DSS-OG recursions be expressed globally the following recursion:


$\displaystyle\bm{\mathcal{X}}_{i}$	$\displaystyle=\mathcal{A}_{1}^{\top}\{\bm{\mathcal{X}}_{i-1}-\mu_{x}\bm{\mathcal{G}}_{x,i-1}\}$	(10a)
$\displaystyle\bm{\mathcal{Y}}_{i}$	$\displaystyle=\mathcal{A}_{2}^{\top}\{\bm{\mathcal{Y}}_{i-1}+\mu_{y}\bm{\mathcal{G}}_{y,i-1}\}$	(10b)

III Convergence Analysis

We now proceed to analyze the convergence properties of DSS-OG. We do so under the following assumptions, which are generally more relaxed than similar conditions in the prior literature.

III-A Assumptions

Assumption 2.

We assume the gradients associated with each local risk function are $L_{f}$ -Lipschitz, i.e.,

		$\displaystyle\\|\nabla_{w}J_{k}(x_{1},y_{1})-\nabla_{w}J_{k}(x_{2},y_{2})\\|\quad(w=x\text{ or }y)$		(11)
		$\displaystyle\leq L_{f}\Big{(}\\|x_{1}-x_{2}\\|+\\|y_{1}-y_{2}\\|\Big{)}$

Remark 1.

In this work, we will only adopt the above assumption for true gradients. This is in contrast to the works [34, 35, 36, 37], where smoothness of the local stochastic loss is assumed instead. It should be noted that if we only assume Assumption 2, the computational benefits of all momentum techniques used in [34, 37] are nullified, with the gradient evaluation complexity degrading to $\mathcal{O}(\varepsilon^{-4})$ . This is the tight lower bound for solving a nonconvex problems developed in [39]. Moreover, the primal objective introduced in (12) is essentially nonconvex. $\hfill{\square}$

Since the primal problem is to minimize over $x$ in (2a), we introduce the primal objective[34, 35, 36, 37]:

P(x)={\max}_{y}\ J(x,y)

(12)

Assumption 3.

We assume $P(x)$ is lower bounded, i.e., $P^{\star}=\operatorname{inf}_{x}P(x)>-\infty$ .

Assumption 4.

We denote the filtration generated by the random processes as $\bm{\mathcal{F}}_{i}=\{(\bm{x}_{k,j},\bm{y}_{k,j})\mid k=1,\dots,K,j=-2,-1,\dots,i\}$ . We assume the stochastic gradients are unbiased with bounded variance while taking expectation over the random sample $\bm{\xi}^{w}_{k,i}$ conditioned on $\bm{\mathcal{F}}_{i}$ , i.e.,

		$\displaystyle\mathbb{E}\Big{\{}\nabla_{w}Q_{k}(\bm{x}_{k,i},\bm{y}_{k,i};\bm{\xi}^{w}_{k,i})\mid\bm{\mathcal{F}}_{i}\Big{\}}=\nabla_{w}J_{k}(\bm{x}_{k,i},\bm{y}_{k,i})$		(13)
		$\displaystyle\mathbb{E}\Big{\{}\\|\nabla_{w}Q_{k}(\bm{x}_{k,i},\bm{y}_{k,i};\bm{\xi}^{w}_{k,i})-\nabla_{w}J_{k}(\bm{x}_{k,i},\bm{y}_{k,i})\\|^{2}\mid\bm{\mathcal{F}}_{i}\Big{\}}$
		$\displaystyle\leq{\sigma^{2}_{k}}\quad(w=x\text{ or }y)$

Moreover, we assume the data samples $\bm{\xi}^{w}_{k,i}$ are independent of each other for all $k,i,w$ .

Assumption 5.

We assume the disagreement between the local and global gradients is bounded, i.e.,

\displaystyle\|\nabla_{w}J_{k}(\bm{x}_{k,i},\bm{y}_{k,i})-\nabla_{w}J(\bm{x}_{k,i},\bm{y}_{k,i})\|^{2}

\displaystyle\leq G^{2}\ (w=x\text{ or }y)

(14)

The above assumptions can be relaxed by adopting more sophisticated distributed learning strategies, e.g., exact diffusion and gradient tracking [45, 46, 47]; we leave these extensions for future works.

Assumption 6.

The graph is strongly connected and the matrix $A=[a_{\ell,k}]$ is left-stochastic. As a result, it holds that $\mathbbm{1}^{\top}A=\mathbbm{1}^{\top}$ and $A^{r}$ has strictly positive entries for some integer $r$ .

This assumption ensures that $A$ has a single maximum eigenvalue at $1$ with Perron eigenvector $p$ with positive entries such that $Ap=p,\mathbbm{1}^{\top}p=1$ . Furthermore, the combination matrix $A$ admits the Jordan decomposition $A=VJV^{-1}$ where [22]:

V=\begin{bmatrix}p,V_{R}\end{bmatrix},J=\begin{bmatrix}1&0\\ 0&J_{\gamma}\end{bmatrix},V^{-1}=\begin{bmatrix}\mathbbm{1}^{\top}\\ V^{\top}_{L}\end{bmatrix}

(15)

Here, the submatrix $J_{\gamma}$ consists of Jordan blocks with arbitrarily small values on the lower diagonal, and it holds that $\mathbbm{1}^{\top}V_{R}=0$ , $V^{\top}_{L}V_{R}=I_{K-1}$ , $\|J_{\gamma}\|<1$ . For convenience of the analysis, we let $\mathcal{J}_{\gamma_{1}}\triangleq J_{\gamma}\otimes I_{M_{1}},\mathcal{J}_{\gamma_{2}}\triangleq J_{\gamma}\otimes I_{M_{2}},\mathcal{V}_{R_{1}}\triangleq V_{R}\otimes I_{M_{1}},\mathcal{V}_{R_{2}}\triangleq V_{R}\otimes I_{M_{2}},\mathcal{V}_{L_{1}}\triangleq V_{L}\otimes I_{M_{1}},\mathcal{V}_{L_{2}}\triangleq V_{L}\otimes I_{M_{2}}$ . It is not difficult to verify $\|\mathcal{J}_{\gamma_{1}}\|=\|\mathcal{J}_{\gamma_{2}}\|,\|\mathcal{V}_{R_{1}}\|=\|\mathcal{V}_{R_{2}}\|,\|V_{L_{1}}\|=\|V_{L_{2}}\|$ .

III-B Main Results

In Theorem 1, we prove the convergence rate of the best-iterate of the primal objective, which is the standard criterion for convergence analysis involving the nonconvex problem. In Theorem 2, we show the last-itearate convergence of the dual optimality gap. Given the conditions from these theorems, we establish the complexity of the DSS-OG in finding an $\varepsilon$ -stationary point.

Theorem 1.

Under Assumptions 1–6, choosing step sizes

$\mu_{y}=\min\{\frac{1}{\nu},\frac{2L^{2}_{f}}{\nu\beta_{1}},\frac{1-\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}}{T^{1/2}},1\}$

(16)

$\mu_{x}=\min\{\frac{\nu^{2}\mu_{y}}{6L^{2}_{f}},\frac{1}{2(L+L_{f})},\frac{1}{4L},\frac{\mu_{y}}{4\tau_{3}},\frac{1}{2\sqrt{\tau_{1}+\beta_{1}}},\mu_{y}\}$

(17)

the non-asymptotic convergence rate of the primal objective is given by:

\displaystyle\frac{1}{T}\sum_{i=0}^{T-1}\mathbb{E}\|\nabla P(\bm{x}_{c,i})\|^{2}\leq\mathcal{O}\Big{(}\frac{1}{{T}^{1/2}}\Big{)}+\mathcal{O}\Big{(}\frac{1}{T}\Big{)}+\mathcal{O}\Big{(}\frac{1}{T^{2}}\Big{)}

(18)

where $L=L_{f}+\frac{L^{2}_{f}}{\nu}$ is the Lipschitz constant associated with $P(x)$ and

$\tau_{1}\triangleq\frac{80L^{4}_{f}}{\nu^{2}},\tau_{3}\triangleq\frac{80L^{2}_{f}}{\nu^{2}},\beta_{1}\triangleq\frac{96KL_{f}^{4}}{\nu^{2}}+48KL^{2}_{f}$

(19)

are constants. Furthermore, DSS-OG outputs a primal $\varepsilon$ -stationary point after $T_{0}=\mathcal{O}(\varepsilon^{-4})$ iterations and gradient evaluation complexity, i.e.,

	$\displaystyle\mathbb{E}\\|\nabla P({\bm{x}}^{\star}_{c,T_{0}})\\|^{2}$	$\displaystyle=\inf_{i=0,\dots,T_{0}-1}\mathbb{E}\\|\nabla P(\bm{x}_{c,i})\\|^{2}$		(20)
		$\displaystyle\leq\frac{1}{T_{0}}\sum_{i=0}^{T_{0}-1}\mathbb{E}\\|\nabla P(\bm{x}_{c,i})\\|^{2}\leq\mathcal{O}(\varepsilon^{2})$

See Appendix C for proof.

Theorem 1 states that by adopting a two-time-scale scheme for the step sizes $\mu_{x}$ and $\mu_{y}$ , DSS-OG is able to output an $\epsilon$ -stationary point for the primal objective with a sublinear rate dominated by $\mathcal{O}\Big{(}\frac{1}{T^{1/2}}\Big{)}$ . The two-time-scale condition reflects the unsymmetric assumption of the risk function over the primal and dual variables. Similar conditions were also established in the works [7, 9, 8] when asymmetric conditions are assumed over the primal and dual variables.

Theorem 2.

Under Assumptions 1–6, and after iteration numbers $T_{0}$ , choosing step size

$\mu_{x}=0,\quad\mu_{y}=\min\{\sqrt{\frac{\nu}{12KL^{2}_{f}}},\frac{2T+1}{\nu(T+1)^{2}}\}$

(21)

the non-asymptotic bound of the dual optimality gap regarding the last-iterate is given by:

\displaystyle\mathbb{E}[P({\bm{x}}^{\star}_{c,T_{0}})-J({\bm{x}}^{\star}_{c,T_{0}},\bm{y}_{c,T})]\leq\mathcal{O}\Big{(}\frac{1}{T}\Big{)}+\mathcal{O}\Big{(}\frac{1}{T^{2}}\Big{)}

(22)

That is, DSS-OG outputs a dual $\varepsilon$ -stationary point after $T_{1}=\mathcal{O}(\varepsilon^{-2})$ iterations and gradient evaluation complexity, i.e.,

\displaystyle\mathbb{E}[P({\bm{x}}^{\star}_{c,T_{0}})-J({\bm{x}}^{\star}_{c,T_{0}},{\bm{y}}^{\star}_{c,T_{0}+T_{1}})]\leq\mathcal{O}(\varepsilon^{2})

(23)

See Appendix D for proof.

The above theorem implies that finding a dual solution that shrinks the dual optimality gap has a faster sublinear convergence rate of $\mathcal{O}\Big{(}\frac{1}{T}\Big{)}$ than the primal convergence rate. Eqs. (18) and (22) also show the fundamental difference of the complexity in finding the solution for a nonconvex problem and for a PL problem. From this theoretical aspect, it is hard to guarantee a pair of primal and dual solutions with the same accuracy level (24a)–(24b) simultaneously. Fortunately, we can prove that the same accuracy level of the primal and dual solutions can be obtained in an asynchronous manner (See Theorem 3).

Theorem 3.

After iterations $T_{0}+T_{1}=\mathcal{O}(\varepsilon^{-4})+\mathcal{O}(\varepsilon^{-2})$ of running DSS-OG, we can find the network centroid $({\bm{x}}^{\star}_{c,T_{0}},{\bm{y}}^{\star}_{c,T_{0}+T_{1}})$ that satisfies the following first-order stationary conditions:


$\displaystyle\mathbb{E}\\|\nabla_{x}J({\bm{x}}^{\star}_{c,T_{0}},{\bm{y}}^{\star}_{c,T_{0}+T_{1}})\\|^{2}$	$\displaystyle\leq\varepsilon^{2}$	(24a)
$\displaystyle\mathbb{E}\\|\nabla_{y}J({\bm{x}}^{\star}_{c,T_{0}},{\bm{y}}^{\star}_{c,T_{0}+T_{1}})\\|^{2}$	$\displaystyle\leq\varepsilon^{2}$	(24b)

for an arbitrarily small constant $\varepsilon$ . The solutions points $\bm{x}^{\star}_{c,T_{0}},{\bm{y}}^{\star}_{c,T_{0}+T_{1}}$ are found in different time instance where the subscripts in $({\bm{x}}^{\star}_{c,T_{0}},{\bm{y}}^{\star}_{c,T_{0}+T_{1}})$ indicate the exact iterations when they are found. See Appendix E for proof.

III-C Comparison between S-OG and SS-OG

We point out the key differences between SS-OG and S-OG:

•

The stochastic gradients involved in S-OG are not martingale process. To see this, we observe that the backward gradients $[\nabla_{x}Q(\bm{x}_{i-1};\bm{y}_{i-1};\bm{\xi}_{x,i-1}),\nabla_{y}Q(\bm{x}_{i-1},\bm{y}_{i-1};\bm{\xi}_{y,i-1})]$ used at iteration $i$ are constructed by the past samples $(\bm{\xi}_{x,i-1},\bm{\xi}_{y,i-1})$ . They are biased estimators when taking expectation over the current samples $(\bm{\xi}_{x,i},\bm{\xi}_{y,i})$ . In contrast, it is easy to verify the stochastic gradients involved in SS-OG are unbiased.
•

SS-OG strictly performs the “optimistic” scheme under the same loss landscape. The losses utilized there are unbiased estimators for the local risk. In contrast, S-OG involves gradients from two different loss landscape. The gradient direction at the same point in two different loss landscapes can be notably different when data variance is high.
•

We will bound the terms involving $\mathbb{E}\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{x,c,i}\|^{2}$ for DSS-OG rather than $\mathbb{E}\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})-\bm{g}_{x,c,i}\|^{2}$ for distributed S-OG. To see this, we can repeat the arguments in (I) and use the fact that the two terms involved in the inner product $\langle\nabla{{P}}(\bm{x}_{c,i}),\bm{g}_{x,c,i}\rangle$ are correlated for distributed S-OG. Therefore, we observe that the distributed S-OG needs to bound the deviation between the deterministic and stochastic quantities. The consequence of such deviation is that a constant on the order of $\mathcal{O}(\sigma^{2})$ will appear in the final bound because the Lipschitz continuity to be used is only assumed for determistic quantities. In constrast, we only need to bound the deviation between deterministic quantities for DSS-OG. Therefore, employing a large batch size will become necessary for S-OG in order to control the magnitude of $\mathcal{O}(\sigma^{2})$ .

IV Computer simulation

IV-A Wasserstein GAN

In this example, we consider the application of an $\ell_{2}$ -regularized Wasserstein GAN (WGAN) for learning the mean and variance of a one-dimensional Gaussian [8]. The objective is to optimize the following global risk function through the cooperation of agents:

		$\displaystyle\min_{{x}}\max_{{y}}\ J({x},{y})=\sum_{k=1}^{K}p_{k}J_{k}({x},{y})$		(25)
		$\displaystyle J_{k}({{x},{y}})\triangleq\mathbb{E}_{\bm{u}_{k}\sim\mathcal{N}(\pi_{k},\sigma^{2}_{k})}[D({y};\bm{u}_{k})]$
		$\displaystyle\qquad\qquad-\mathbb{E}_{\bm{z}_{k}\sim\mathcal{N}(0,I)}[D({y};G({x};\bm{z}_{k}))]-\lambda\\|y\\|_{2}$

Here, the dual variable $y$ is $\ell_{2}$ -regularized by a constant $\lambda$ to ensure the PL condition. We consider a setting where the generator has a single hidden layer with 5 neurons, and the discriminator is parameterized by $y\in\mathbb{R}^{2}$ and is of the form $D(y;\bm{u})=y_{1}\bm{u}_{k}+y_{2}\bm{u}_{k}^{2}$ . As for the true samples, $\bm{u}_{k}$ , they are drawn from the distribution $\mathcal{N}(\pi_{k},\sigma^{2}_{k})$ .

We generate a strongly connected network with $K=8$ agents and use the averaging rule [48, 33]. In addition to the fully distributed setting, we also implement centralized SS-OG (CSS-OG). We compare DSS-OG and CSS-OG with the distributed version of SAGDA, smoothed SAGDA and the adaptive algorithms, such as RMSprop and Adam which are also evaluated in this same application [8]. For these algorithms, we follow the optimal settings reported in [8]. For DSS-OG and CSS-OG, we tune the step sizes to $\mu_{x}=\mu_{y}=0.1$ . We set the same random seed before running each algorithm to ensure the same generated training set. The code is available at the attached link¹¹1https://github.com/MulGarFed/Minimax_DSSOG.git.

The simulation results are demonstrated in Figs.LABEL:fig:example1_variance0001_grad-LABEL:fig:example1_variance01_mse. We observe both DSS-OG and CSS-OG outperform the other algorithms in terms of gradient norm and mean square error (MSE) when $\sigma^{2}_{k}=0.001$ . From Fig. LABEL:fig:example1_variance01_mse, the MSE of DSS-OG and CSS-OG is able to match the results of RMSprop when $\sigma^{2}_{k}=0.1$ . We find that the adaptive algorithm RMSprop converges more rapidly than constant step size algorithms. Another interesting observation is that the gradient norm does not necessarily imply the quality of a solution. For instance, despite RMSprop not having the lowest gradient norm in Fig. LABEL:fig:example1_variance0001_grad, it exhibits rapid convergence and good performance in terms of MSE, as seen in Fig. LABEL:fig:example1_variance0001_mse. This discrepancy suggests a promising research direction in identifying more practical metrics for evaluating minimax solutions.

IV-B Deep Convolutional GANs

In this example, we train a deep convolutional GANs (DCGANs) using the MNIST dataset. We consider a similar neural network structure described in [49].

We use the same network topology from the previous example. To enhance performance, we implement DSS-OG with a Adam optimizer through replacing its original stochastic gradient term with the SS-OG gradients from (7a)–(7b). For this experiment, we set the momentum factors to $\beta_{1}=0.2,\beta_{2}=0.999$ and vary the step sizes. We plot the results of the Fréchet Inception Distance (FID) score [50] in Fig. 2(a) The sample images generated by a single node are demonstrated in Fig. 2(b). It is seen that the simulated algorithms can attain good results of FID score with the appropriately chosen hyperparameters. Moreover, the single agent is capable of generating all classes of digits, demonstrating the generative diversity of a single node in the distributed network.

V Conclusion

In this work, we introduced and analyzed DSS-OG, a novel distributed method for stochastic nonconvex-PL minimax optimization. This method improves the analysis of the existing works by addressing the large batch size issue. We validated the performance of DSS-OG through training experiments of two neural network models: WGAN and DCGAN. Our future work is to develop distributed, communication-efficient, and faster algorithms for variational inequality setting under cooperative or competing networks.

Appendix A Basic Lemmas

In this section, we list two basic results that will be used in our analysis.

Lemma 1 (Quadratic Growth [23]).

If $g(y)$ is $\nu$ -PL function, then it also satisfies a quadratic growth, i.e.,

\displaystyle g(y)-\min_{y}g(y)\geq\frac{\nu}{2}\|y^{o}-y\|^{2},\forall y

(26)

where $y^{o}$ is any optimal point taken from the set $\{y^{o}\mid y^{o}={\operatorname{argmin}}_{y}\ g(y)\}$ .

Lemma 2 (Danskin-type Lemma [51]).

Under Assumptions 1–2, if $-J(x,y)$ is $L_{f}$ -smooth and satisfies $\nu$ -PL in $y$ for any $x$ , then:

•

The primal objective $P(x)$ is $L$ -smooth with $L=L_{f}+\frac{L^{2}_{f}}{\nu}$ .
•

The gradient of $P(x)$ is equal to the gradient of $J(x,y^{o}(x))$ with respective to $x$ , i.e.,

$\nabla P(x)=\nabla_{x}J(x,y^{o}(x))$ (27)

where $y^{o}(x)$ is a maximum point of $J(x,y)$ for a fixed $x$ , i.e, $y^{o}(x)\in{\operatorname{argmax}}_{y}\ J(x,y)$ .

Appendix B Key Lemmas

Here, we list the auxiliary lemmas leading to the main convergence result. The proofs of these results are deferred to later appendices.

Lemma 3.

Under Assumptions 2–6, choosing the step sizes

\displaystyle\mu_{x}\leq\mu_{y}\leq\min\{\frac{1-\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}}{2(\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}+160\rho^{2}L^{2}_{f})},1\}

(28)

the iteration averaged network deviation is bounded as:

	$\displaystyle\frac{1}{T}\sum_{i=0}^{T-1}\mathbb{E}\Big{[}\\|{\bm{\mathcal{X}}}_{i}-\bm{\mathcal{X}}_{c,i}\\|^{2}+\\|{\bm{\mathcal{Y}}}_{i}-\bm{\mathcal{Y}}_{c,i}\\|^{2}\Big{]}$
	$\displaystyle\leq\frac{40\rho^{2}(4KG^{2}+\mu_{y}\sigma^{2})\mu_{y}}{1-\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2}}$		(29)

where $\rho^{2}\triangleq\|\mathcal{V}^{\top}_{R_{1}}\|^{2}\|\mathcal{V}_{L_{1}}\|^{2}\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}$ is a constant associated with the network topology, $\sigma^{2}=\sum_{k=1}^{K}\sigma^{2}_{k}$ is the sum of local gradient noise variance. See Appendix F for proof.

Choosing suitable step sizes, we observe from Lemma 3 that the bound of iteration averaged network deviation is dominated by a term on the order of $\mathcal{O}(\mu_{y})$ .

Lemma 4.

Under Assumptions 2–6, choosing the step sizes as in Lemma 3, the iteration averaged increment is bounded as:

$\displaystyle\frac{1}{T}$	$\displaystyle\sum_{i=0}^{T-1}[\mathbb{E}\\|\bm{\mathcal{X}}_{i}-\bm{\mathcal{X}}_{i-1}\\|^{2}+\mathbb{E}\\|\bm{\mathcal{Y}}_{i}-\bm{\mathcal{Y}}_{i-1}\\|^{2}]$
$\displaystyle\leq$	$\displaystyle\frac{240\rho^{2}(4KG^{2}+\mu_{y}\sigma^{2})\mu_{y}}{1-\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2}}+\frac{3K}{T}\sum_{i=0}^{T-1}[\mu_{x}^{2}\mathbb{E}\\|{{g}}_{x,c,i-1}\\|^{2}$
	$\displaystyle+\mu_{y}^{2}\mathbb{E}\\|{{g}}_{y,c,i-1}\\|^{2}]+60K\sigma^{2}\mu^{2}_{y}\underset{k}{\max}\ p^{2}_{k}$	(30)

See Appendix G for proof.

We observe that the increment is bounded by three terms, i.e., the term associated with network deviation on the order of $\mathcal{O}(\mu_{y})$ , iteration averaged risk gradients centroid, and a bound caused by the gradient noise on the order of $\mathcal{O}(\sigma^{2}\mu^{2}_{y})$ , respectively.

Lemma 5.

Under Assumptions 2–6, choosing the step sizes as in Lemma 3, the iteration averaged gradient disagreement is bounded as:

\displaystyle\frac{1}{T}

\displaystyle\sum_{i=0}^{T-1}\mathbb{E}\|\nabla_{w}J(\bm{x}_{c,i},\bm{y}_{c,i})-g_{w,c,i}\|^{2}\quad(w=x\text{ or }y)

(31)

	$\displaystyle{\leq}$	$\displaystyle\ \frac{1120L^{2}_{f}\rho^{2}(4KG^{2}+\mu_{y}\sigma^{2})\mu_{y}}{1-\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2}}+240K\sigma^{2}L^{2}_{f}\mu^{2}_{y}\underset{k}{\max}\ p^{2}_{k}$
		$\displaystyle\ +\frac{12KL^{2}_{f}}{T}\sum_{i=0}^{T-1}[\mu_{x}^{2}\mathbb{E}\\|{{g}}_{x,c,i-1}\\|^{2}+\mu_{y}^{2}\mathbb{E}\\|{{g}}_{y,c,i-1}\\|^{2}]$

See Appendix H for proof.

Lemma 5 implies how the deviation between the global risk function at network centroid $(\bm{x}_{c,i},\bm{y}_{c,i})$ and the centroid of local risk gradient is bounded over iterations. We observe that this bound consists of three terms, namely the iteration averaged risk gradient centroid and two terms on the order of $\mathcal{O}(\mu_{y})+\mathcal{O}(\sigma^{2}\mu^{2}_{y})$ . In the following lemmas, we will establish some useful inequalities associated with the Lipschitz continuity of primal objective (12) and risk function (2a). For simplicity, we use

\displaystyle d(x,y)\triangleq P(x)-J(x,y)

(32)

to denote the dual optimality gap for simplicity.

Lemma 6.

Under Assumptions 1–6, the primal objective $P(x)$ satisfies:

$\displaystyle\mathbb{E}$	$\displaystyle[P(\bm{x}_{c,i+1})]-\mathbb{E}[{{P}}(\bm{x}_{c,i})]$
$\displaystyle\leq$	$\displaystyle-\frac{\mu_{x}}{2}\mathbb{E}\\|\nabla{{P}}(\bm{x}_{c,i})\\|^{2}-\frac{\mu_{x}}{2}(1-L\mu_{x})\mathbb{E}\\|{g}_{x,c,i}\\|^{2}$
	$\displaystyle+\frac{2\mu_{x}L^{2}_{f}}{\nu}\mathbb{E}[{d}_{c}(i)]+\mu_{x}\mathbb{E}\\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{x,c,i}\\|^{2}$
	$\displaystyle+5L\mu^{2}_{x}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}$	(33)

where ${d}_{c}(i)\triangleq{d}(\bm{x}_{c,i},\bm{y}_{c,i})=P(\bm{x}_{c,i})-J(\bm{x}_{c,i},\bm{y}_{c,i})$ denotes the dual optimality gap evaluated at the network centroid $(\bm{x}_{c,i},\bm{y}_{c,i})$ . See Appendix I for proof.

Lemma 7.

Under Assumptions 1–6, we have

$\displaystyle\mathbb{E}$	$\displaystyle[-J(\bm{x}_{c,i+1},\bm{y}_{c,i+1})]$	(34)
$\displaystyle\leq$	$\displaystyle\mathbb{E}[-J(\bm{x}_{c,i+1},\bm{y}_{c,i})]-\frac{\mu_{y}}{2}\mathbb{E}\\|\nabla_{y}J(\bm{x}_{c,i+1},\bm{y}_{c,i})\\|^{2}$
	$\displaystyle-\frac{\mu_{y}(1-L_{f}\mu_{y})}{2}\mathbb{E}\\|{g}_{y,c,i}\\|^{2}+\mu_{y}L^{2}_{f}\mathbb{E}\\|\bm{x}_{c,i+1}-\bm{x}_{c,i}\\|^{2}$
	$\displaystyle+\mu_{y}\mathbb{E}\\|\nabla_{y}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{y,c,i}\\|^{2}+5L_{f}\mu^{2}_{y}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}$

and

$\displaystyle\mathbb{E}$	$\displaystyle[-J(\bm{x}_{c,i+1},\bm{y}_{c,i})]$	(35)
$\displaystyle\leq$	$\displaystyle\mathbb{E}[-J(\bm{x}_{c,i},\bm{y}_{c,i})]+\frac{\mu_{x}}{2}\mathbb{E}\\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{x,c,i}\\|^{2}$
	$\displaystyle+\frac{\mu_{x}(3+L_{f}\mu_{x})}{2}\mathbb{E}\\|{g}_{x,c,i}\\|^{2}+5L_{f}\mu^{2}_{x}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}$

See Appendix J for proof.

Lemma 8.

Under Assumptions 1–6, choosing step sizes

$\mu_{x}\leq\min\{\mu_{y},\frac{\nu^{2}\mu_{y}}{4L^{2}_{f}},\sqrt{\frac{L_{f}}{L+L_{f}}}\mu_{y}\},\mu_{y}\leq\min\{1,\frac{1}{\nu},\frac{1}{L_{f}}\}$

(36)

the iteration averaged dual optimality gap is bounded as:

	$\displaystyle\frac{1}{T}$	$\displaystyle\sum_{i=0}^{T-1}\mathbb{E}[{d}_{c}(i)]$
	$\displaystyle\leq$	$\displaystyle\frac{2}{\nu\mu_{y}T}{d}_{c}(0)+\frac{1}{T}\sum_{i=0}^{T-1}\Big{\{}\frac{3\mu_{x}(1-\nu\mu_{y})}{\nu\mu_{y}}\mathbb{E}\\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})$

	$\displaystyle-{g}_{x,c,i}\\|^{2}+(1-\nu\mu_{y})\Big{[}\frac{2\mu_{x}}{\nu\mu_{y}}+\frac{(L+L_{f})\mu^{2}_{x}}{\nu\mu_{y}}\Big{]}\mathbb{E}\\|{g}_{x,c,i}\\|^{2}$
	$\displaystyle+\frac{2L_{f}^{2}}{\nu}\mathbb{E}\\|\bm{x}_{c,i+1}-\bm{x}_{c,i}\\|^{2}+\frac{2}{\nu}\mathbb{E}\\|\nabla_{y}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{y,c,i}\\|^{2}\Big{\}}$
	$\displaystyle-\frac{\mu_{y}}{2T}(1-L_{f}\mu_{y})\sum_{i=1}^{T-1}\sum_{j=0}^{i-1}(1-\frac{\nu\mu_{y}}{2})^{i-j}\mathbb{E}\\|{g}_{y,c,j}\\|^{2}$
	$\displaystyle+\frac{20L_{f}\mu_{y}\sigma^{2}}{\nu}\underset{k}{\max}\ p^{2}_{k}$		(37)

See Appendix K for proof.

The bound of dual optimality gap is quantified by several terms, including the initialization term which decays on the rate of $\mathcal{O}(\frac{1}{\mu_{y}T})$ , the gradient disagreement terms, the risk gradients centroid, the increment of network centroid, and a constant term on the order of $\mathcal{O}(\sigma^{2}\mu_{y})$ .

Appendix C Proof of Theorem 1

We now prove the best iterate convergence for the primal variable. Moving the gradient term $\mathbb{E}\|\nabla P(\bm{x}_{c,i})\|^{2}$ to the left hand side in Lemma 6 and averaging the inequality over iterations, we get

$\displaystyle\frac{1}{T}$	$\displaystyle\sum_{i=0}^{T-1}\mathbb{E}\\|\nabla P(\bm{x}_{c,i})\\|^{2}$
$\displaystyle\leq$	$\displaystyle\frac{1}{T}\sum_{i=0}^{T-1}\Big{\{}\frac{2}{\mu_{x}}\mathbb{E}[P(\bm{x}_{c,i})-P(\bm{x}_{c,i+1})]-(1-L\mu_{x})\mathbb{E}\\|{g}_{x,c,i}\\|^{2}$
	$\displaystyle+\frac{4L_{f}^{2}}{\nu}\mathbb{E}[{d}_{c}(i)]+2\mathbb{E}\\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{x,c,i}\\|^{2}\Big{\}}$
	$\displaystyle+10L\mu_{x}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}$	(38)
$\displaystyle\overset{(a)}{\leq}$	$\displaystyle\frac{2}{\mu_{x}T}[P(\bm{x}_{c,0})-P^{\star}]-\frac{1}{T}\sum_{i=0}^{T-1}\Big{\{}(1-L\mu_{x})\mathbb{E}\\|{g}_{x,c,i}\\|^{2}$
	$\displaystyle-\frac{4L_{f}^{2}}{\nu}\mathbb{E}[{d}_{c}(i)]-2\mathbb{E}\\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{x,c,i}\\|^{2}\Big{\}}$
	$\displaystyle+10L\mu_{x}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}$	(39)

where $(a)$ follows from Assumption 3 and the telescoping results of $\sum_{i=0}^{T}P(\bm{x}_{c,i})-P(\bm{x}_{c,i+1})$ in (38). Invoking Lemma 8 and regrouping the terms in (39), we obtain

$\displaystyle\frac{1}{T}$	$\displaystyle\sum_{i=0}^{T-1}\mathbb{E}\\|\nabla P(\bm{x}_{c,i})\\|^{2}$	(40)
$\displaystyle\leq$	$\displaystyle\frac{2}{\mu_{x}T}[P(\bm{x}_{c,0})-P^{\star}]-\frac{1}{T}\sum_{i=0}^{T-1}\Bigg{\{}\Big{\{}1-L\mu_{x}-\frac{4(1-\nu\mu_{y})L^{2}_{f}}{\nu}$
	$\displaystyle\times\Big{(}\frac{2\mu_{x}}{\nu\mu_{y}}+\frac{(L+L_{f})\mu^{2}_{x}}{\nu\mu_{y}}\Big{)}\Big{\}}\mathbb{E}\\|{g}_{x,c,i}\\|^{2}-\Big{(}\frac{12\mu_{x}(1-\nu\mu_{y})L^{2}_{f}}{\nu^{2}\mu_{y}}$
	$\displaystyle+2\Big{)}\mathbb{E}\\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{x,c,i}\\|^{2}-{\frac{8L_{f}^{4}}{\nu^{2}}\mathbb{E}\\|\bm{x}_{c,i+1}-\bm{x}_{c,i}\\|^{2}}$
	$\displaystyle-{\frac{8L_{f}^{2}}{\nu^{2}}\mathbb{E}\\|\nabla_{y}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{y,c,i}\\|^{2}}\Bigg{\}}+\frac{8L_{f}^{2}}{\nu^{2}\mu_{y}T}{d}_{c}(0)$
	$\displaystyle-\frac{2\mu_{y}L^{2}_{f}}{\nu T}(1-L_{f}\mu_{y})\sum_{i=1}^{T-1}\sum_{j=0}^{i-1}(1-\frac{\nu\mu_{y}}{2})^{i-j}\mathbb{E}\\|{g}_{y,c,j}\\|^{2}$
	$\displaystyle+\frac{80L^{3}_{f}\mu_{y}\sigma^{2}}{\nu^{2}}\underset{k}{\max}\ p^{2}_{k}+10L\mu_{x}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}$

	$\displaystyle\overset{(a)}{\leq}$	$\displaystyle\frac{2}{\mu_{x}T}[P(\bm{x}_{c,0})-P^{\star}]-\frac{1}{T}\sum_{i=0}^{T-1}\Bigg{\{}\Big{\{}1-L\mu_{x}-\frac{4(1-\nu\mu_{y})L^{2}_{f}}{\nu}$
		$\displaystyle\times\Big{(}\frac{2\mu_{x}}{\nu\mu_{y}}+\frac{(L+L_{f})\mu^{2}_{x}}{\nu\mu_{y}}\Big{)}-\frac{8L_{f}^{4}\mu^{2}_{x}}{\nu^{2}}\Big{\}}\mathbb{E}\\|{g}_{x,c,i}\\|^{2}$
		$\displaystyle-4\mathbb{E}\\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{x,c,i}\\|^{2}-\frac{8L_{f}^{2}}{\nu^{2}}$
		$\displaystyle{\times\mathbb{E}\\|\nabla_{y}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{y,c,i}\\|^{2}}\Bigg{\}}+\frac{80L^{4}_{f}\mu^{2}_{x}\sigma^{2}}{\nu^{2}}\underset{k}{\max}\ p^{2}_{k}$
		$\displaystyle+\frac{8L_{f}^{2}}{\nu^{2}\mu_{y}T}{d}_{c}(0)+\frac{80L^{3}_{f}\mu_{y}\sigma^{2}}{\nu^{2}}\underset{k}{\max}\ p^{2}_{k}+10L\mu_{x}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}$
		$\displaystyle-\frac{2\mu_{y}L^{2}_{f}}{\nu T}(1-L_{f}\mu_{y})\sum_{i=1}^{T-1}\sum_{j=0}^{i-1}(1-\frac{\nu\mu_{y}}{2})^{i-j}\mathbb{E}\\|{g}_{y,c,j}\\|^{2}$

where $(a)$ follows from several steps: inserting $\bm{x}_{i+1}=\bm{x}_{i}-\mu_{x}\bm{g}_{x,c,i}$ , using (77) and choosing step size

$\mu_{y}\leq\frac{1}{\nu},\frac{12\mu_{x}(1-\nu\mu_{y})L^{2}_{f}}{\nu^{2}\mu_{y}}\leq 2\Rightarrow\mu_{x}\leq\frac{\nu^{2}\mu_{y}}{6L^{2}_{f}}$

(41)

For the sake of simplicity, we introduce the following notations for constants:

$\begin{aligned} \tau_{1}&\triangleq\frac{80L^{4}_{f}}{\nu^{2}},\tau_{2}\triangleq\frac{80L^{3}_{f}}{\nu^{2}},\tau_{3}\triangleq\frac{8L^{2}_{f}}{\nu^{2}},\tau_{4}\triangleq 10L\\ \beta_{1}&\triangleq\frac{96KL_{f}^{4}}{\nu^{2}}+48KL^{2}_{f}\end{aligned}$

(42)

Let $P_{0}=P(\bm{x}_{c,0})-P^{\star}$ be the initial primal optimality gap. Invoking Lemma 5 for inequality (40), we get

$\displaystyle\frac{1}{T}$	$\displaystyle\sum_{i=0}^{T-1}\mathbb{E}\\|\nabla P(\bm{x}_{c,i})\\|^{2}$	(43)
$\displaystyle\leq$	$\displaystyle\frac{2P_{0}}{\mu_{x}T}+(\tau_{2}\mu_{y}+\tau_{4}\mu_{x}+\tau_{1}\mu^{2}_{x})\sigma^{2}\underset{k}{\max}\ p^{2}_{k}+\frac{\tau_{3}{d}_{c}(0)}{\mu_{y}T}$
	$\displaystyle-\frac{1}{T}\Big{\{}1-L\mu_{x}-\frac{4(1-\nu\mu_{y})L^{2}_{f}}{\nu}\Big{(}\frac{2\mu_{x}}{\nu\mu_{y}}+\frac{(L+L_{f})\mu^{2}_{x}}{\nu\mu_{y}}\Big{)}$
	$\displaystyle-\tau_{1}\mu^{2}_{x}-\beta_{1}\mu^{2}_{x}\Big{\}}\sum_{i=0}^{T-1}\underbrace{\mathbb{E}\\|{g}_{x,c,i}\\|^{2}}_{\text{primal gradient}}+\frac{\beta_{1}\mu_{y}^{2}}{T}\sum_{i=0}^{T-1}\mathbb{E}\underbrace{\\|{{g}}_{y,c,i-1}\\|^{2}}_{\text{dual gradient}}$
	$\displaystyle+240K\sigma^{2}L^{2}_{f}\mu^{2}_{y}\underset{k}{\max}\ p^{2}_{k}(\tau_{3}+4)-\frac{2\mu_{y}L^{2}_{f}}{\nu T}(1-L_{f}\mu_{y})$
	$\displaystyle\times\sum_{i=1}^{T-1}\sum_{j=0}^{i-1}(1-\frac{\nu\mu_{y}}{2})^{i-j}\mathbb{E}\underbrace{\\|{g}_{y,c,j}\\|^{2}}_{\text{dual gradient}}+\frac{1120L^{2}_{f}\rho^{2}\mu_{y}(\tau_{3}+4)}{1-\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2}}$
$\displaystyle z$	$\displaystyle\times(4KG^{2}+\mu_{y}\sigma^{2})+\beta_{1}\mu^{2}_{x}\\|g_{x,c,-1}\\|^{2}$

For the terms associated with primal gradient, we can bound them by choosing a suitable step size:

	$\displaystyle 1-L\mu_{x}-\frac{4(1-\nu\mu_{y})L^{2}_{f}}{\nu}\Big{(}\frac{2\mu_{x}}{\nu\mu_{y}}+\frac{(L+L_{f})\mu^{2}_{x}}{\nu\mu_{y}}\Big{)}-(\tau_{1}+\beta_{1})$
	$\displaystyle\times\mu^{2}_{x}\overset{(a)}{\geq}1-L\mu_{x}-2\tau_{3}\frac{\mu_{x}}{\mu_{y}}-\Big{(}\tau_{1}+\beta_{1}\Big{)}\mu^{2}_{x}\overset{(b)}{\geq}0$		(44)

where $(a)$ we choose $\mu_{x}\leq 2/(L+L_{f})$ , (b) we choose $\mu_{x}\leq\frac{1}{4L},\mu_{x}\leq\frac{\mu_{y}}{4\tau_{3}}$ and $\mu^{2}_{x}\leq 1/(4(\tau_{1}+\beta_{1}))$ . By doing so, we can eliminate the primal gradient terms since now it has a negative sign. For terms associated with dual gradient, we can bound it by deducing that:

	$\displaystyle\frac{2\mu_{y}L^{2}_{f}}{\nu T}(1-L_{f}\mu_{y})\sum_{i=1}^{T-1}\sum_{j=0}^{i-1}(1-\frac{\nu\mu_{y}}{2})^{i-j}\mathbb{E}\\|{g}_{y,c,j}\\|^{2}$
	$\displaystyle-\frac{\beta_{1}\mu^{2}_{y}}{T}\sum_{i=0}^{T-1}\mathbb{E}\\|{{g}}_{y,c,i-1}\\|^{2}\hskip 200.0003pt$
	$\displaystyle\overset{(a)}{=}\frac{4L^{2}_{f}}{\nu^{2}T}(1-L_{f}\mu_{y})(1-\frac{\nu\mu_{y}}{2})\sum_{i=0}^{T-2}{(1-(1-\frac{\nu\mu_{y}}{2})^{T-1-i})}$
	$\displaystyle\quad\times\mathbb{E}\\|{g}_{y,c,i}\\|^{2}-\frac{\beta_{1}\mu_{y}^{2}}{T}\sum_{i=0}^{T-1}\mathbb{E}\\|{{g}}_{y,c,i-1}\\|^{2}$
	$\displaystyle\overset{(b)}{\geq}\ \frac{\tau_{3}}{2T}(1-L_{f}\mu_{y})(1-\frac{\nu\mu_{y}}{2})\sum_{i=0}^{T-2}{(1-(1-\frac{\nu\mu_{y}}{2}))}$
	$\displaystyle\quad\times\mathbb{E}\\|{g}_{y,c,i}\\|^{2}-\frac{\beta_{1}\mu_{y}^{2}}{T}\sum_{i=-1}^{T-2}\mathbb{E}\\|{{g}}_{y,c,i}\\|^{2}$
	$\displaystyle\geq\ \frac{1}{T}\{\frac{2L^{2}_{f}}{\nu}(1-L_{f}\mu_{y})(1-\frac{\nu\mu_{y}}{2})\mu_{y}-\beta_{1}\mu_{y}^{2}\}\sum_{i=-1}^{T-2}\mathbb{E}\\|{{g}}_{y,c,i}\\|^{2}$
	$\displaystyle\quad-\frac{\beta_{1}\mu_{y}^{2}\\|{{g}}_{y,c,-1}\\|^{2}}{T}\quad\overset{(c)}{\geq}-\frac{\beta_{1}\mu_{y}^{2}\\|{{g}}_{y,c,-1}\\|^{2}}{T}$		(45)

where $(a)$ we have used

	$\displaystyle\sum_{i=1}^{T-1}\sum_{j=0}^{i-1}q^{i-j}a_{j}=q\sum_{i=1}^{T-1}\sum_{j=0}^{i-1}q^{i-1-j}a_{j}$	$\displaystyle=q\sum_{i=0}^{T-2}a_{i}(\sum_{t=0}^{T-2-i}q^{t})$
		$\displaystyle=q(\sum_{i=0}^{T-2}a_{i}\frac{1-q^{T-1-i}}{1-q})$

$q\in(0,1)$ , $(b)$ is due to $1-q^{t}\geq 1-q$ when $t\geq 1$ , $(c)$ we choose $\mu_{y}\leq\frac{2L^{2}_{f}}{\nu\beta_{1}}$ . Replacing these gradients with the deduced upper bound, we conclude

	$\displaystyle\frac{1}{T}\sum_{i=0}^{T-1}\mathbb{E}\\|\nabla P(\bm{x}_{c,i})\\|^{2}$
	$\displaystyle\leq\ \frac{2P_{0}}{\mu_{x}T}+\frac{\tau_{3}{d}_{c}(0)}{\mu_{y}T}+(\tau_{2}\mu_{y}+\tau_{4}\mu_{x}+\tau_{1}\mu^{2}_{x})\sigma^{2}\underset{k}{\max}\ p^{2}_{k}$
	$\displaystyle\quad+\frac{1120L^{2}_{f}\rho^{2}(4KG^{2}+\mu_{y}\sigma^{2})\mu_{y}}{1-\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2}}(\tau_{3}+4)+\frac{\beta_{1}\mu_{y}^{2}\\|{{g}}_{y,c,-1}\\|^{2}}{T}$
	$\displaystyle\quad+\frac{\beta_{1}\mu^{2}_{x}\\|g_{x,c,-1}\\|^{2}}{T}+240K\sigma^{2}L^{2}_{f}\mu^{2}_{y}\underset{k}{\max}\ p^{2}_{k}(\tau_{3}+4)$		(46)

Considering all the step size conditions derived above and choosing

$\begin{aligned} \mu_{x}&=\min\{\frac{\nu^{2}\mu_{y}}{6L^{2}_{f}},\frac{1}{2(L+L_{f})},\frac{1}{4L},\frac{\mu_{y}}{4\tau_{3}},\frac{1}{2\sqrt{\tau_{1}+\beta_{1}}},\mu_{y}\}\\ \mu_{y}&=\min\{\frac{1}{\nu},\frac{2L^{2}_{f}}{\nu\beta_{1}},\frac{1-\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}}{T^{1/2}},1\}\end{aligned}$

(47)

we get the non-asymptotic bound as follows:

\displaystyle\frac{1}{T}\sum_{i=0}^{T-1}\mathbb{E}\|\nabla P(\bm{x}_{c,i})\|^{2}\leq\mathcal{O}\Big{(}\frac{1}{{T}^{1/2}}\Big{)}+\mathcal{O}\Big{(}\frac{1}{T}\Big{)}+\mathcal{O}\Big{(}\frac{1}{T^{2}}\Big{)}

(48)

It is seen that the convergence rate is dominated by $\mathcal{O}\Big{(}\frac{1}{{T}^{1/2}}\Big{)}$ , therefore DSS-OG outputs $\varepsilon$ -stationary point after $T_{0}=\mathcal{O}(\varepsilon^{-4})$ iteration and gradient evaluation complexity.

Appendix D Proof of Theorem 2

We further prove that DSS-OG can find a solution for the dual problem by halting the update of primal variable after ${\bm{x}}^{\star}_{c,T_{0}}$ is found; this can be achieved by setting $\mu_{x}=0$ after iteration $T_{0}$ . For the convenience of analysis, we use ${\bm{g}}^{\star}_{y,c,i}$ and ${{g}}^{\star}_{y,c,i}$ to denote the stochastic and true gradients when their primal variable is fixed at ${\bm{x}}^{\star}_{c,T_{0}}$ . Repeating the arguments in Lemma 7 at point $x={\bm{x}}^{\star}_{c,T_{0}}$ , we get

$\displaystyle\mathbb{E}$	$\displaystyle[d({\bm{x}}^{\star}_{c,T_{0}},\bm{y}_{c,i+1})]$	(49)
$\displaystyle\leq$	$\displaystyle(1-\nu\mu_{y})\mathbb{E}[d({\bm{x}}^{\star}_{c,T_{0}},\bm{y}_{c,i})]+\frac{\mu_{y}}{2}\mathbb{E}\\|\nabla_{y}J({\bm{x}}^{\star}_{c,T_{0}},\bm{y}_{c,i})$
	$\displaystyle-{{g}}^{\star}_{y,c,i}\\|^{2}-\frac{\mu_{y}}{2}(1-L_{f}\mu_{y})\mathbb{E}\\|{{g}}^{\star}_{y,c,i}\\|^{2}+5L_{f}\mu^{2}_{y}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}$

We now study recursion (49) for sufficiently large iteration $i$ . Furthermore, we choose the step size $\mu_{y}=\frac{2T+1}{\nu(T+1)^{2}}$ . In the following, we insert this expression of $\mu_{y}$ into the coefficient associated with $\mathbb{E}[d({\bm{x}}^{\star}_{c,T_{0}},\bm{y}_{c,T})]$ in (49), and keep “ $\mu_{y}$ ” notation in the remaining terms to determine additional step size conditions. Setting $i=T$ in (49), we get

	$\displaystyle\mathbb{E}[d({\bm{x}}^{\star}_{c,T_{0}},\bm{y}_{c,T+1})]$	(50)
$\displaystyle\leq$	$\displaystyle\frac{T^{2}}{(T+1)^{2}}\mathbb{E}[d({\bm{x}}^{\star}_{c,T_{0}},\bm{y}_{c,T})]+\frac{\mu_{y}}{2}\Big{[}\mathbb{E}\\|\nabla_{y}J({\bm{x}}^{\star}_{c,T_{0}},\bm{y}_{c,T})$
	$\displaystyle-{{g}}^{\star}_{y,c,T}\\|^{2}-(1-L_{f}\mu_{y})\mathbb{E}\\|{{g}}^{\star}_{y,c,T}\\|^{2}\Big{]}+5L_{f}\mu^{2}_{y}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}$

Multiplying both sides by $(T+1)^{2}$ , we have

	$\displaystyle(T+1)^{2}\mathbb{E}[d({\bm{x}}^{\star}_{c,T_{0}},\bm{y}_{c,T+1})]$	(51)
$\displaystyle\leq$	$\displaystyle\ T^{2}\mathbb{E}[d({\bm{x}}^{\star}_{c,T_{0}},\bm{y}_{c,T})]+\frac{\mu_{y}(T+1)^{2}}{2}\Big{[}\mathbb{E}\\|\nabla_{y}J({\bm{x}}^{\star}_{c,T_{0}},\bm{y}_{c,T})$
	$\displaystyle-{{g}}^{\star}_{y,c,T}\\|^{2}-(1-L_{f}\mu_{y})\mathbb{E}\\|{{g}}^{\star}_{y,c,T}\\|^{2}\Big{]}+5(T+1)^{2}L_{f}\mu^{2}_{y}\sigma^{2}$
	$\displaystyle\times\underset{k}{\max}\ p^{2}_{k}$

We further denote $\delta_{i}=i^{2}\times\mathbb{E}[P({\bm{x}}^{\star}_{c,T_{0}})-J({\bm{x}}^{\star}_{c,T_{0}},\bm{y}_{c,i})]$ and obtain a recursion regarding $\delta_{i}$ . Iterating the recursion of $\delta_{i}$ from $i=T+1$ to $0$ , we obtain

	$\displaystyle\delta_{T+1}\leq$	$\displaystyle\delta_{0}+\frac{\mu_{y}(T+1)^{2}}{2}\sum_{i=0}^{T}\Big{[}\mathbb{E}\\|\nabla_{y}J({\bm{x}}^{\star}_{c,T_{0}},\bm{y}_{c,i})-{{g}}^{\star}_{y,c,i}\\|^{2}$
		$\displaystyle-(1-L_{f}\mu_{y})\mathbb{E}\\|{{g}}^{\star}_{y,c,i}\\|^{2}\Big{]}+5(T+1)^{3}L_{f}\mu^{2}_{y}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}$		(52)

Repeating the arguments in Lemma 4, we can bound the iteration averaged gradient deviation as follows:

		$\displaystyle\frac{1}{T+1}\sum_{i=0}^{T}\mathbb{E}\\|\nabla_{y}J({\bm{x}}^{\star}_{c,T_{0}},\bm{y}_{c,i})-{{g}}^{\star}_{y,c,i}\\|^{2}$
		$\displaystyle\leq\frac{1120L^{2}_{f}\rho^{2}(4KG^{2}+\mu_{y}\sigma^{2})\mu_{y}}{(1-\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2})}+\frac{12KL^{2}_{f}}{\nu(T+1)}\sum_{i=0}^{T}\mu_{y}^{2}$
		$\displaystyle\quad\times\mathbb{E}\\|{{g}}^{\star}_{y,c,i}\\|^{2}+240K\sigma^{2}L^{2}_{f}\mu^{2}_{y}\underset{k}{\max}\ p^{2}_{k}$		(53)

Inserting (D) into (D) and choosing $\frac{12KL^{2}_{f}\mu^{2}_{y}}{\nu}<1-L_{f}\mu_{y}\Rightarrow\mu_{y}<\sqrt{\frac{\nu}{12KL^{2}_{f}}}$ to eliminate gradient terms in $\eqref{PLboundstep2}$ , we get ( $\delta_{0}=0$ ):

	$\displaystyle\delta_{T+1}$	$\displaystyle\leq\frac{560(T+1)^{3}L^{2}_{f}\rho^{2}(4KG^{2}+\mu_{y}\sigma^{2})\mu^{2}_{y}}{(1-\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2})}+5(T+1)^{3}$
		$\displaystyle\quad\times L_{f}\mu^{2}_{y}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}+120(T+1)^{3}K\sigma^{2}L^{2}_{f}\mu^{3}_{y}\underset{k}{\max}\ p^{2}_{k}$		(54)

Because $\mu_{y}=\frac{2T+1}{\nu(T+1)^{2}}\leq\frac{2}{\nu(T+1)}$ , dividing both sides of (D) by $(T+1)^{2}$ , we get the following non-asymptotic bound for dual optimality gap:

$\displaystyle\mathbb{E}[d({\bm{x}}^{\star}_{c,T_{0}},\bm{y}_{c,T+1})]$	$\displaystyle=\mathbb{E}[P({\bm{x}}^{\star}_{c,T_{0}})-J({\bm{x}}^{\star}_{c,T_{0}},\bm{y}_{c,T+1})]$
	$\displaystyle\leq\mathcal{O}\Big{(}(T+1)\mu^{2}_{y}\Big{)}+\mathcal{O}\Big{(}(T+1)\mu^{3}_{y}\Big{)}$
	$\displaystyle\leq\mathcal{O}\Big{(}\frac{1}{T+1}\Big{)}+\mathcal{O}\Big{(}\frac{1}{(T+1)^{2}}\Big{)}$	(55)

Appendix E Proof of Theorem 3

According to Theorem 1, after running $T_{0}=\mathcal{O}(\varepsilon^{-4})$ iterations, we can identify the point ${\bm{x}}^{\star}_{c,T_{0}}$ which satisfies the following condition

\displaystyle\mathbb{E}\|\nabla P({\bm{x}}^{\star}_{c,T_{0}})\|^{2}\leq\frac{\varepsilon^{2}}{4}=\mathcal{O}(\varepsilon^{2})

(56)

According to Lemma 2, this is equivalent to

\displaystyle\mathbb{E}\|\nabla_{x}J({\bm{x}}^{\star}_{c,T_{0}},\bm{y}^{o}({\bm{x}}^{\star}_{c,T_{0}}))\|^{2}\leq\frac{\varepsilon^{2}}{4}=\mathcal{O}(\varepsilon^{2})

(57)

Furthermore, from Theorem 2, after iterations $T_{1}$ , we can find a point ${\bm{y}}^{\star}_{c,T_{0}+T_{1}}$ such that

\displaystyle\mathbb{E}[P({\bm{x}}^{\star}_{c,T_{0}})-J({\bm{x}}^{\star}_{c,T_{0}},{\bm{y}}^{\star}_{c,T_{0}+T_{1}})]\leq\frac{1}{L^{2}_{f}}\frac{\nu}{8}\varepsilon^{2}=\mathcal{O}(\varepsilon^{2})

(58)

We keep the subscript in $({\bm{x}}^{\star}_{c,T_{0}},{\bm{y}}^{\star}_{{c,T_{0}+T_{1}}})$ to indicate the exact iterations at which this point is obtained. Under condition (58), we can verify $({\bm{x}}^{\star}_{c,T_{0}},{\bm{y}}^{\star}_{c,T_{0}+T_{1}})$ satisfies the second stationary condition (29b), i.e.,

$\displaystyle\mathbb{E}$	$\displaystyle\\|\nabla_{y}J({\bm{x}}^{\star}_{c,T_{0}},{\bm{y}}^{\star}_{c,T_{0}+T_{1}})\\|^{2}$
$\displaystyle=$	$\displaystyle\mathbb{E}\\|\nabla_{y}J({\bm{x}}^{\star}_{c,T_{0}},{\bm{y}}^{\star}_{c,T_{0}+T_{1}})-\nabla_{y}J({\bm{x}}^{\star}_{c,T_{0}},\bm{y}^{o}({\bm{x}}^{\star}_{c,T_{0}}))\\|^{2}$
$\displaystyle\overset{(a)}{\leq}$	$\displaystyle L^{2}_{f}\mathbb{E}\\|{\bm{y}}^{\star}_{c,T_{0}+T_{1}}-\bm{y}^{o}({\bm{x}}^{\star}_{c,T_{0}})\\|^{2}$
$\displaystyle\overset{(b)}{\leq}$	$\displaystyle L^{2}_{f}{\frac{2}{\nu}\mathbb{E}[P({\bm{x}}^{\star}_{c,T_{0}})-J({\bm{x}}^{\star}_{c,T_{0}},{\bm{y}}^{\star}_{c,T_{0}+T_{1}})]}$
$\displaystyle\overset{(c)}{\leq}$	$\displaystyle\frac{\varepsilon^{2}}{4}\leq\ \varepsilon^{2}=\mathcal{O}(\varepsilon^{2})$	(59)

where $(a)$ follows from the $L_{f}$ -smooth property of the local risk, $(b)$ follows from Lemma 1, $(c)$ follows from (23). On the other hand, the solution $({\bm{x}}^{\star}_{c,T_{0}},{\bm{y}}^{\star}_{c,T_{0}+T_{1}})$ also satisfies the first stationary condition (29a), i.e.,

	$\displaystyle\mathbb{E}$	$\displaystyle\\|\nabla_{x}J({\bm{x}}^{\star}_{c,T_{0}},{\bm{y}}^{\star}_{c,T_{0}+T_{1}})\\|^{2}$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}\\|\nabla_{x}J({\bm{x}}^{\star}_{c,T_{0}},{\bm{y}}^{\star}_{c,T_{0}+T_{1}})-\nabla_{x}J({\bm{x}}^{\star}_{c,T_{0}},\bm{y}^{o}({\bm{x}}^{\star}_{c,T_{0}}))$
		$\displaystyle+\nabla_{x}J({\bm{x}}^{\star}_{c,T_{0}},\bm{y}^{o}({\bm{x}}^{\star}_{c,T_{0}}))\\|^{2}$
	$\displaystyle\leq$	$\displaystyle 2\mathbb{E}\\|\nabla_{x}J({\bm{x}}^{\star}_{c,T_{0}},\bm{y}^{o}({\bm{x}}^{\star}_{c,T_{0}}))\\|^{2}$
		$\displaystyle+2\mathbb{E}\\|\nabla_{x}J({\bm{x}}^{\star}_{c,T_{0}},{\bm{y}}^{\star}_{c,T_{0}+T_{1}})-\nabla_{x}J({\bm{x}}^{\star}_{c,T_{0}},\bm{y}^{o}({\bm{x}}^{\star}_{c,T_{0}}))\\|^{2}$

$\displaystyle\overset{(a)}{\leq}$	$\displaystyle 2\mathbb{E}\\|\nabla_{x}J({\bm{x}}^{\star}_{c,T_{0}},\bm{y}^{o}({\bm{x}}^{\star}_{c,T_{0}}))\\|^{2}$
	$\displaystyle+2L^{2}_{f}{\frac{2}{\nu}\mathbb{E}[P({\bm{x}}^{\star}_{c,T_{0}})-J({\bm{x}}^{\star}_{c,T_{0}},{\bm{y}}^{\star}_{c,T_{0}+T_{1}})]}$
$\displaystyle\leq$	$\displaystyle\mathcal{O}(\frac{\varepsilon^{2}}{2})+\mathcal{O}(\frac{\varepsilon^{2}}{2})=\mathcal{O}(\varepsilon^{2})\hskip 100.00015pt$	(60)

where $(a)$ follows from (57) and (23). Therefore, the overall complexity for iteration and gradient evaluation is given by $T_{0}+T_{1}=\mathcal{O}(\varepsilon^{-4})+\mathcal{O}(\varepsilon^{-2})$ .

The proof is completed here.

Appendix F Proof of Lemma 3

Taking conditional expectations of the transient term $\|{\bm{\mathcal{X}}}_{i}-\bm{\mathcal{X}}_{c,i}\|^{2}+\|{\bm{\mathcal{Y}}}_{i}-\bm{\mathcal{Y}}_{c,i}\|^{2}$ over the distribution of random samples $\{\bm{\xi}^{x}_{k,i},\bm{\xi}^{y}_{k,i}\}_{k=1}^{K}$ , we get

$\displaystyle\mathbb{E}$	$\displaystyle\Big{[}\\|{\bm{\mathcal{X}}}_{i}-\bm{\mathcal{X}}_{c,i}\\|^{2}+\\|{\bm{\mathcal{Y}}}_{i}-\bm{\mathcal{Y}}_{c,i}\\|^{2}\mid\bm{\mathcal{F}}_{i-1}\Big{]}$
$\displaystyle=$	$\displaystyle\mathbb{E}\Big{[}\\|{\bm{\mathcal{X}}}_{i}-(\mathbbm{1}p^{\top}\otimes I_{M_{1}})\bm{\mathcal{X}}_{i}\\|^{2}+\\|{\bm{\mathcal{Y}}}_{i}-(\mathbbm{1}p^{\top}\otimes I_{M_{2}})\bm{\mathcal{Y}}_{i}\\|^{2}$
	$\displaystyle\mid\bm{\mathcal{F}}_{i-1}\Big{]}$
$\displaystyle=$	$\displaystyle\mathbb{E}\Big{[}\\|[(V^{-1}\otimes I_{M_{1}})^{\top}(V\otimes I_{M_{1}})^{\top}-(\mathbbm{1}p^{\top}\otimes I_{M_{1}})]{\bm{\mathcal{X}}}_{i}\\|^{2}+\\|$
	$\displaystyle[(V^{-1}\otimes I_{M_{2}})^{\top}(V\otimes I_{M_{2}})^{\top}-(\mathbbm{1}p^{\top}\otimes I_{M_{2}})]{\bm{\mathcal{Y}}}_{i}\\|^{2}\mid\bm{\mathcal{F}}_{i-1}\Big{]}$
$\displaystyle\overset{(a)}{=}$	$\displaystyle\mathbb{E}\Big{[}\\|\mathcal{V}_{L_{1}}\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i}\\|^{2}+\\|\mathcal{V}_{L_{2}}\mathcal{V}^{\top}_{R_{2}}{\bm{\mathcal{Y}}}_{i}\\|^{2}\mid\bm{\mathcal{F}}_{i-1}\Big{]}$
$\displaystyle\overset{(b)}{\leq}$	$\displaystyle\\|\mathcal{V}_{L_{1}}\\|^{2}\mathbb{E}\Big{[}\\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i}\\|^{2}+\\|\mathcal{V}^{\top}_{R_{2}}{\bm{\mathcal{Y}}}_{i}\\|^{2}\mid\bm{\mathcal{F}}_{i-1}\Big{]}$	(61)

where $(a)$ follows from the Jordan decomposition of $A$ and $(b)$ follows from the sub-multiplicative property of norms and the fact that $\|\mathcal{V}_{L_{1}}\|^{2}=\|\mathcal{V}_{L_{2}}\|^{2}$ . Taking expectation over (61) again and averaging this inequality over iterations, we get:

		$\displaystyle\frac{1}{T}\sum_{i=0}^{T-1}\mathbb{E}[\\|{\bm{\mathcal{X}}}_{i}-\bm{\mathcal{X}}_{c,i}\\|^{2}+\\|{\bm{\mathcal{Y}}}_{i}-\bm{\mathcal{Y}}_{c,i}\\|^{2}]$
		$\displaystyle\leq\\|\mathcal{V}_{L_{1}}\\|^{2}\frac{1}{T}\sum_{i=0}^{T-1}\mathbb{E}[\\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i}\\|^{2}+\\|\mathcal{V}^{\top}_{R_{2}}{\bm{\mathcal{Y}}}_{i}\\|^{2}]$		(62)

In the following, we bound the term $\mathbb{E}[\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i}\|^{2}]$ by Eqs. (63)-(68), $\mathbb{E}[\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i}\|^{2}+\|\mathcal{V}^{\top}_{R_{2}}{\bm{\mathcal{Y}}}_{i}\|^{2}]$ by (F), and deduce the final bound (F) by Eqs. (70)-(73). To bound $\mathbb{E}[\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i}\|^{2}]$ , we inspect $\mathbb{E}[\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i}\|^{2}\mid\bm{\mathcal{F}}_{i-1}]$ on the right hand side (RHS) of (61) by deducing that:

	$\displaystyle\mathbb{E}$	$\displaystyle\Big{[}\\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i}\\|^{2}\mid\bm{\mathcal{F}}_{i-1}\Big{]}$
	$\displaystyle\overset{(a)}{\leq}$	$\displaystyle\mathbb{E}\Big{[}\\|\mathcal{J}^{\top}_{\gamma_{1}}\mathcal{V}^{\top}_{R_{1}}[\bm{\mathcal{X}}_{i-1}-\mu_{x}\bm{\mathcal{G}}_{x,i-1}]\\|^{2}\mid\bm{\mathcal{F}}_{i-1}\Big{]}\hskip 100.00015pt$
	$\displaystyle\overset{(b)}{\leq}$	$\displaystyle{\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2}}\\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i-1}\\|^{2}-2\mu_{x}{\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2}}\mathbb{E}\Big{[}\Big{\langle}\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i-1},$
		$\displaystyle\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{G}}}_{x,i-1}\Big{\rangle}\mid\bm{\mathcal{F}}_{i-1}\Big{]}+\mu^{2}_{x}\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2}\mathbb{E}\Big{[}\\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{G}}}_{x,i-1}\\|^{2}\mid\bm{\mathcal{F}}_{i-1}\Big{]}$
	$\displaystyle\overset{(c)}{\leq}$	$\displaystyle{\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2}}\\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i-1}\\|^{2}-2\mu_{x}{\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2}}\Big{\langle}\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i-1},\mathcal{V}^{\top}_{R_{1}}{{\mathcal{G}}}_{x,i-1}\Big{\rangle}$
		$\displaystyle+\mu^{2}_{x}\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2}\mathbb{E}[\\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{G}}}_{x,i-1}\\|^{2}\mid\bm{\mathcal{F}}_{i-1}]$

$\displaystyle\overset{(d)}{\leq}$	$\displaystyle\ {\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2}}\\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i-1}\\|^{2}+\mu_{x}\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2}\Big{[}\\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i-1}\\|^{2}$
	$\displaystyle+\\|\mathcal{V}^{\top}_{R_{1}}{{\mathcal{G}}}_{x,i-1}\\|^{2}\Big{]}+\mu^{2}_{x}\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2}\mathbb{E}\Big{[}\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{G}}}_{x,i-1}\\|^{2}\mid\bm{\mathcal{F}}_{i-1}\Big{]}$
$\displaystyle\overset{(e)}{\leq}$	$\displaystyle{\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2}}\\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i-1}\\|^{2}+\mu_{x}\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2}\Big{[}\\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i-1}\\|^{2}$
	$\displaystyle\quad+\\|\mathcal{V}^{\top}_{R_{1}}{{\mathcal{G}}}_{x,i-1}\\|^{2}\Big{]}+\mu^{2}_{x}\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2}\\|\mathcal{V}^{\top}_{R_{1}}{{\mathcal{G}}}_{x,i-1}\\|^{2}$
	$\displaystyle\quad+10\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2}\\|\mathcal{V}^{\top}_{R_{1}}\\|^{2}\sigma^{2}\mu^{2}_{x}\quad(\sigma^{2}=\sum_{k=1}^{K}\sigma^{2}_{k})$	(63)

where $(a)$ follows from (10a) and

\mathcal{V}^{\top}_{R_{1}}\mathcal{A}_{1}^{\top}=\mathcal{V}^{\top}_{R_{1}}(\mathbbm{1}p^{\top}\otimes I_{M_{1}}+\mathcal{V}_{L_{1}}\mathcal{J}^{\top}_{\gamma_{1}}\mathcal{V}^{\top}_{R_{1}})=\mathcal{J}^{\top}_{\gamma_{1}}\mathcal{V}^{\top}_{R_{1}}

Moreover, $(b)$ follows from the sub-multiplicative property of the norm and the expansion of the squared norm, $(c)$ follows from the conditional expectation over the sample $\{\bm{\xi}^{x}_{k,i}\}_{k=1}^{K}$ and the fact that $\bm{\mathcal{X}}_{i-1}$ only relies on the random samples up to iteration $i-1$ , $(d)$ follows from the inequality $-2\langle a,b\rangle\leq\|a\|^{2}+\|b\|^{2}$ , $(e)$ follows from Assumption 4 and the following inequality:

$\displaystyle\ \mathbb{E}$	$\displaystyle\Big{[}\\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{G}}}_{x,i-1}\\|^{2}\mid\bm{\mathcal{F}}_{i-1}\Big{]}$
$\displaystyle\leq$	$\displaystyle\mathbb{E}\Big{[}\\|\mathcal{V}^{\top}_{R_{1}}[\mbox{\rm col}\{\bm{s}^{x}_{k,i-1}\}_{k=1}^{K}+\mathcal{G}_{x,i-1}]\\|^{2}\mid\bm{\mathcal{F}}_{i-1}\Big{]}$
$\displaystyle\leq$	$\displaystyle\mathbb{E}\Big{[}\\|\mathcal{V}^{\top}_{R_{1}}\mathcal{G}_{x,i-1}\\|^{2}\mid\bm{\mathcal{F}}_{i-1}\Big{]}+2\mathcal{V}^{\top}_{R_{1}}\mathbb{E}\Big{[}\langle\mathcal{G}_{x,i-1},\mbox{\rm col}\{\bm{s}^{x}_{k,i-1}\}_{k=1}^{K}\rangle$
	$\displaystyle\mid\bm{\mathcal{F}}_{i-1}\Big{]}+\mathbb{E}[\\|\mathcal{V}^{\top}_{R_{1}}\mbox{\rm col}\{\bm{s}^{x}_{k,i-1}\}_{k=1}^{K}\\|^{2}\mid\bm{\mathcal{F}}_{i-1}]$
$\displaystyle\leq$	$\displaystyle\ \\|\mathcal{V}^{\top}_{R_{1}}\mathcal{G}_{x,i-1}\\|^{2}+10\\|\mathcal{V}^{\top}_{R_{1}}\\|^{2}\sum_{k=1}^{K}\sigma^{2}_{k}$	(64)

where $\bm{s}^{x}_{k,i}$ denotes the gradient noise

	$\displaystyle\bm{s}^{x}_{k,i}$	$\displaystyle=2[\nabla_{x}Q_{k}(\bm{x}_{k,i},\bm{y}_{k,i};\bm{\xi}^{x}_{k,i})-\nabla_{x}J_{k}(\bm{x}_{k,i},\bm{y}_{k,i})]$		(65)
		$\displaystyle\ +[\nabla_{x}Q_{k}(\bm{x}_{k,i-1},\bm{y}_{k,i-1};\bm{\xi}^{x}_{k,i})-\nabla_{x}J_{k}(\bm{x}_{k,i-1},\bm{y}_{k,i-1})]$

Choosing $\mu_{x}<1$ , we have $\mu^{2}_{x}<\mu_{x}$ and

$\displaystyle\mathbb{E}$	$\displaystyle\Big{[}\\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i}\\|^{2}\mid\bm{\mathcal{F}}_{i-1}]$
$\displaystyle\leq$	$\displaystyle(1+\mu_{x})\ {\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2}}\\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i-1}\\|^{2}+2\mu_{x}\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2}\\|\mathcal{V}^{\top}_{R_{1}}{{\mathcal{G}}}_{x,i-1}\\|^{2}$
	$\displaystyle+10\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2}\\|\mathcal{V}^{\top}_{R_{1}}\\|^{2}\sigma^{2}\mu^{2}_{x}$	(66)

For the term $\|\mathcal{V}^{\top}_{R_{1}}{{\mathcal{G}}}_{x,i-1}\|^{2}$ , we can bound it as follows:

		$\displaystyle\\|\mathcal{V}^{\top}_{R_{1}}{{\mathcal{G}}}_{x,i-1}\\|^{2}\hskip 200.0003pt$
	$\displaystyle\overset{(a)}{=}$	$\displaystyle\ \Big{\\|}\mathcal{V}^{\top}_{R_{1}}\mbox{\rm col}\Big{\{}2\nabla_{x}J_{k}(\bm{x}_{k,i-1},\bm{y}_{k,i-1})-\nabla_{x}J_{k}(\bm{x}_{k,i-2},\bm{y}_{k,i-2})$
		$\displaystyle\Big{\}}_{k=1}^{K}-\mathcal{V}^{\top}_{R_{1}}\mbox{\rm col}\Big{\{}2\nabla_{x}J_{k}(\bm{x}_{c,i-1},\bm{y}_{c,i-1})-\nabla_{x}J_{k}(\bm{x}_{c,i-2},$
		$\displaystyle\bm{y}_{c,i-2})\Big{\}}_{k=1}^{K}+\mathcal{V}^{\top}_{R_{1}}\mbox{\rm col}\Big{\{}2\nabla_{x}J_{k}(\bm{x}_{c,i-1},\bm{y}_{c,i-1})-\nabla_{x}J_{k}$
		$\displaystyle(\bm{x}_{c,i-2},\bm{y}_{c,i-2})\Big{\}}_{k=1}^{K}-\mathcal{V}^{\top}_{R_{1}}({\mathbbm{1}p^{\top}}\otimes I_{M_{1}})\mbox{\rm col}\Big{\{}2\nabla_{x}J_{k}$
		$\displaystyle(\bm{x}_{c,i-1},\bm{y}_{c,i-1})-\nabla_{x}J_{k}(\bm{x}_{c,i-2},\bm{y}_{c,i-2})\Big{\}}_{k=1}^{K}\Big{\\|}^{2}$
	$\displaystyle\overset{(b)}{\leq}$	$\displaystyle 32\\|\mathcal{V}^{\top}_{R_{1}}\\|^{2}L^{2}_{f}\Big{[}\\|\bm{\mathcal{X}}_{i-1}-\bm{\mathcal{X}}_{c,i-1}\\|^{2}+\\|\bm{\mathcal{Y}}_{i-1}-\bm{\mathcal{Y}}_{c,i-1}\\|^{2}\Big{]}$
		$\displaystyle+8\\|\mathcal{V}^{\top}_{R_{1}}\\|^{2}L^{2}_{f}\Big{[}\\|\bm{\mathcal{X}}_{i-2}-\bm{\mathcal{X}}_{c,i-2}\\|^{2}+\\|\bm{\mathcal{Y}}_{i-2}-\bm{\mathcal{Y}}_{c,i-2}\\|^{2}\Big{]}$
		$\displaystyle+20K\\|\mathcal{V}^{\top}_{R_{1}}\\|^{2}G^{2}$

$\displaystyle\overset{(c)}{\leq}$	$\displaystyle 32\\|\mathcal{V}^{\top}_{R_{1}}\\|^{2}\\|\mathcal{V}_{L_{1}}\\|^{2}L^{2}_{f}\Big{[}\\|\mathcal{V}^{\top}_{R_{1}}\bm{\mathcal{X}}_{i-1}\\|^{2}+\\|\mathcal{V}^{\top}_{R_{2}}\bm{\mathcal{Y}}_{i-1}\\|^{2}\Big{]}$
	$\displaystyle+8\\|\mathcal{V}^{\top}_{R_{1}}\\|^{2}\\|\mathcal{V}_{L_{1}}\\|^{2}L^{2}_{f}\Big{[}\\|\mathcal{V}^{\top}_{R_{1}}\bm{\mathcal{X}}_{i-2}\\|^{2}+\\|\mathcal{V}^{\top}_{R_{2}}\bm{\mathcal{Y}}_{i-2}\\|^{2}\Big{]}$
	$\displaystyle+20K\\|\mathcal{V}^{\top}_{R_{1}}\\|^{2}G^{2}$	(67)

where $(a)$ follows from $\mathcal{V}^{\top}_{R_{1}}({\mathbbm{1}p^{\top}}\otimes I_{M_{1}})=0$ , $(b)$ follows from Jensen’s inequality, the $L_{f}$ -smooth property of the local risk $J_{k}(x,y)$ and Assumption 5, $(c)$ follows from (61). For convenience, we denote the network constant as $\rho^{2}\triangleq\|\mathcal{V}^{\top}_{R_{1}}\|^{2}\|\mathcal{V}_{L_{1}}\|^{2}\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}=\|\mathcal{V}^{\top}_{R_{2}}\|^{2}\|\mathcal{V}_{L_{2}}\|^{2}\|\mathcal{J}^{\top}_{\gamma_{2}}\|^{2}$ . Combining the results of (F) and (F) and taking expectation again, we get

$\displaystyle\mathbb{E}$	$\displaystyle\\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i}\\|^{2}$	(68)
$\displaystyle\leq$	$\displaystyle(1+\mu_{x}){\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2}}\mathbb{E}\\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i-1}\\|^{2}+64\mu_{x}{\rho}^{2}L^{2}_{f}\Big{[}\mathbb{E}\\|\mathcal{V}^{\top}_{R_{1}}\bm{\mathcal{X}}_{i-1}\\|^{2}$
	$\displaystyle+\mathbb{E}\\|\mathcal{V}^{\top}_{R_{2}}\bm{\mathcal{Y}}_{i-1}\\|^{2}\Big{]}+16\mu_{x}{\rho}^{2}L^{2}_{f}\Big{[}\mathbb{E}\\|\mathcal{V}^{\top}_{R_{1}}\bm{\mathcal{X}}_{i-2}\\|^{2}$
	$\displaystyle+\mathbb{E}\\|\mathcal{V}^{\top}_{R_{1}}\bm{\mathcal{Y}}_{i-2}\\|^{2}\Big{]}+10\\|\mathcal{V}^{\top}_{R_{1}}\\|^{2}\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2}(4KG^{2}+\mu_{x}\sigma^{2})\mu_{x}$

A similar inequality holds for $\mathbb{E}\|\mathcal{V}^{\top}_{R_{2}}{\bm{\mathcal{Y}}}_{i}\|^{2}$ . Combining these results together and choosing $\mu_{x}\leq\mu_{y}$ , we have

$\displaystyle\mathbb{E}$	$\displaystyle[\\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i}\\|^{2}+\\|\mathcal{V}^{\top}_{R_{2}}{\bm{\mathcal{Y}}}_{i}\\|^{2}]$
$\displaystyle\leq$	$\displaystyle(1+\mu_{y}){\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2}}\mathbb{E}[\\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i-1}\\|^{2}+\\|\mathcal{V}^{\top}_{R_{2}}{\bm{\mathcal{Y}}}_{i-1}\\|^{2}]+128\mu_{y}\rho^{2}$
	$\displaystyle\times L^{2}_{f}\mathbb{E}\Big{[}\\|\mathcal{V}^{\top}_{R_{1}}\bm{\mathcal{X}}_{i-1}\\|^{2}+\\|\mathcal{V}^{\top}_{R_{2}}\bm{\mathcal{Y}}_{i-1}\\|^{2}\Big{]}+32\mu_{y}\rho^{2}L^{2}_{f}$
	$\displaystyle\times\mathbb{E}\Big{[}\\|\mathcal{V}^{\top}_{R_{1}}\bm{\mathcal{X}}_{i-2}\\|^{2}+\\|\mathcal{V}^{\top}_{R_{2}}\bm{\mathcal{Y}}_{i-2}\\|^{2}\Big{]}+20\\|\mathcal{V}^{\top}_{R_{1}}\\|^{2}\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2}$
	$\displaystyle\times(4KG^{2}+\mu_{y}\sigma^{2})\mu_{y}$	(69)

Averaging (F) over iterations, we get

	$\displaystyle\frac{1}{T}\sum_{i=0}^{T-1}\mathbb{E}[\\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i}\\|^{2}+\\|\mathcal{V}^{\top}_{R_{2}}{\bm{\mathcal{Y}}}_{i}\\|^{2}]$	(70)
$\displaystyle\leq$	$\displaystyle\frac{(1+\mu_{y})\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2}+128\mu_{y}\rho^{2}L^{2}_{f}}{T}\sum_{i=0}^{T-1}\mathbb{E}\Big{[}\\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i-1}\\|^{2}$
	$\displaystyle+\\|\mathcal{V}^{\top}_{R_{2}}{\bm{\mathcal{Y}}}_{i-1}\\|^{2}\Big{]}+\frac{32\mu_{y}\rho^{2}L^{2}_{f}}{T}\sum_{i=0}^{T-1}\mathbb{E}\Big{[}\\|\mathcal{V}^{\top}_{R_{1}}\bm{\mathcal{X}}_{i-2}\\|^{2}$
	$\displaystyle+\\|\mathcal{V}^{\top}_{R_{2}}\bm{\mathcal{Y}}_{i-2}\\|^{2}\Big{]}+20\\|\mathcal{V}^{\top}_{R_{1}}\\|^{2}\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2}(4KG^{2}+\mu_{y}\rho^{2})\mu_{y}$

	$\displaystyle\overset{(a)}{\leq}$	$\displaystyle\frac{(1+\mu_{y})\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2}+160\mu_{y}\rho^{2}L^{2}_{f}}{T}\sum_{i=0}^{T-1}\mathbb{E}\Big{[}\\|\mathcal{V}^{\top}_{R_{1}}\bm{\mathcal{X}}_{i}\\|^{2}$
		$\displaystyle+\\|\mathcal{V}^{\top}_{R_{2}}\bm{\mathcal{Y}}_{i}\\|^{2}\Big{]}+20\\|\mathcal{V}^{\top}_{R_{1}}\\|^{2}\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2}(4KG^{2}+\mu_{y}\sigma^{2})\mu_{y}$

where in $(a)$ we set the initial iterates as $\bm{x}_{k,-1}=\bm{x}_{k,-2}=\bm{y}_{k,-1}=\bm{x}_{k,-2}=0$ and add terms $\mathbb{E}\|\mathcal{V}^{\top}_{R_{1}}\bm{\mathcal{X}}_{T-2}\|^{2}+\mathbb{E}\|\mathcal{V}^{\top}_{R_{2}}\bm{\mathcal{Y}}_{T-2}\|^{2}$ and $\mathbb{E}\|\mathcal{V}^{\top}_{R_{1}}\bm{\mathcal{X}}_{T-1}\|^{2}+\mathbb{E}\|\mathcal{V}^{\top}_{R_{2}}\bm{\mathcal{Y}}_{T-1}\|^{2}$ into the first inequality to group the summation terms. Since $\|\mathcal{J}^{\top}_{\gamma_{1}}\|<1$ , we choose the step sizes $\mu_{y}$ such that

		$\displaystyle 1-\Big{(}(1+\mu_{y})\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2}+160\mu_{y}\rho^{2}L^{2}_{f}\Big{)}\geq\frac{1-\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2}}{2}$		(71)
		$\displaystyle\Longrightarrow\mu_{y}\leq\frac{1-\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2}}{2(\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2}+160\rho^{2}L^{2}_{f})}$		(71)

Solving inequality (70) gives

		$\displaystyle\frac{1}{T}\sum_{i=0}^{T-1}\mathbb{E}[\\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i}\\|^{2}+\\|\mathcal{V}^{\top}_{R_{2}}{\bm{\mathcal{Y}}}_{i}\\|^{2}]$		(72)
		$\displaystyle\leq\frac{40\\|\mathcal{V}^{\top}_{R_{1}}\\|^{2}\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2}(4KG^{2}+\mu_{y}\sigma^{2})\mu_{y}}{1-\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2}}$		(72)

Combining (72) with (F), we get the final bound as follows:

		$\displaystyle\frac{1}{T}\sum_{i=0}^{T-1}\mathbb{E}[\\|{\bm{\mathcal{X}}}_{i}-\bm{\mathcal{X}}_{c,i}\\|^{2}+\\|{\bm{\mathcal{Y}}}_{i}-\bm{\mathcal{Y}}_{c,i}\\|^{2}]$		(73)
		$\displaystyle\leq\frac{40\rho^{2}(4KG^{2}+\mu_{y}\sigma^{2})\mu_{y}}{1-\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2}}$		(73)

Appendix G Proof of Lemma 4

Note that ${\bm{\mathcal{X}}}_{c,i}={\bm{\mathcal{X}}}_{c,i-1}-\mu_{x}\operatorname{col}\{{\bm{g}}_{x,c,i-1}\}_{k=1}^{K}$ , we insert this expression of ${\bm{\mathcal{X}}}_{c,i}$ into $\|\bm{\mathcal{X}}_{i}-\bm{\mathcal{X}}_{i-1}\|^{2}$ and deduce that

		$\displaystyle\\|\bm{\mathcal{X}}_{i}-\bm{\mathcal{X}}_{i-1}\\|^{2}$		(74)
		$\displaystyle=\\|\bm{\mathcal{X}}_{i}-{\bm{\mathcal{X}}}_{c,i}+{\bm{\mathcal{X}}}_{c,i}-\bm{\mathcal{X}}_{i-1}\\|^{2}$
		$\displaystyle=\\|\bm{\mathcal{X}}_{i}-{\bm{\mathcal{X}}}_{c,i}+{\bm{\mathcal{X}}}_{c,i-1}-\mu_{x}\operatorname{col}\{{\bm{g}}_{x,c,i-1}\}_{k=1}^{K}-\bm{\mathcal{X}}_{i-1}\\|^{2}$
		$\displaystyle\overset{(a)}{\leq}3\\|\bm{\mathcal{X}}_{i}-{\bm{\mathcal{X}}}_{c,i}\\|^{2}+3K\mu^{2}_{x}\\|{\bm{g}}_{x,c,i-1}\\|^{2}+3\\|\bm{\mathcal{X}}_{i-1}-\bm{\mathcal{X}}_{c,i-1}\\|^{2}\hskip 170.00026pt$

where $(a)$ follows from Jensen’s inequality. The term $\|\bm{\mathcal{Y}}_{i}-\bm{\mathcal{Y}}_{i-1}\|^{2}$ can be bounded similarly, therefore

		$\displaystyle\\|\bm{\mathcal{X}}_{i}-\bm{\mathcal{X}}_{i-1}\\|^{2}+\\|\bm{\mathcal{Y}}_{i}-\bm{\mathcal{Y}}_{i-1}\\|^{2}$		(75)
		$\displaystyle\leq 3\Big{[}\\|\bm{\mathcal{X}}_{i}-{\bm{\mathcal{X}}}_{c,i}\\|^{2}+\\|\bm{\mathcal{Y}}_{i}-{\bm{\mathcal{Y}}}_{c,i}\\|^{2}\Big{]}+3\Big{[}\\|\bm{\mathcal{X}}_{i-1}-{\bm{\mathcal{X}}}_{c,i-1}\\|^{2}$
		$\displaystyle+\\|\bm{\mathcal{Y}}_{i-1}-{\bm{\mathcal{Y}}}_{c,i-1}\\|^{2}\Big{]}+3K\Big{[}\mu_{x}^{2}\\|\bm{g}_{x,c,i-1}\\|^{2}+\mu_{y}^{2}\\|\bm{g}_{y,c,i-1}\\|^{2}\Big{]}$

Taking the conditional expectation of (75) over the distribution of random samples $\{\bm{\xi}^{x}_{k,i},\bm{\xi}^{y}_{k,i}\}_{k=1}^{K}$ , we get

$\displaystyle\mathbb{E}$	$\displaystyle\Big{[}\\|\bm{\mathcal{X}}_{i}-\bm{\mathcal{X}}_{i-1}\\|^{2}+\\|\bm{\mathcal{Y}}_{i}-\bm{\mathcal{Y}}_{i-1}\\|^{2}\mid\bm{\mathcal{F}}_{i-1}\Big{]}$	(76)
$\displaystyle\leq$	$\displaystyle\ 3\mathbb{E}\Big{[}\\|\bm{\mathcal{X}}_{i}-{\bm{\mathcal{X}}}_{c,i}\\|^{2}+\\|\bm{\mathcal{Y}}_{i}-{\bm{\mathcal{Y}}}_{c,i}\\|^{2}\mid\bm{\mathcal{F}}_{i-1}\Big{]}$
	$\displaystyle+3\mathbb{E}\Big{[}\\|\bm{\mathcal{X}}_{i-1}-{\bm{\mathcal{X}}}_{c,i-1}\\|^{2}+\\|\bm{\mathcal{Y}}_{i-1}-{\bm{\mathcal{Y}}}_{c,i-1}\\|^{2}\mid\bm{\mathcal{F}}_{i-1}\Big{]}$
	$\displaystyle+3K\mathbb{E}\Big{[}\mu_{x}^{2}\\|\bm{g}_{x,c,i-1}\\|^{2}+\mu_{y}^{2}\\|\bm{g}_{y,c,i-1}\\|^{2}\mid\bm{\mathcal{F}}_{i-1}\Big{]}$

For the term $\mathbb{E}[\|{\bm{g}}_{x,c,i-1}\|^{2}|\bm{\mathcal{F}}_{i-1}]$ , we can bound it as follows:

		$\displaystyle\ \mathbb{E}[\\|{\bm{g}}_{x,c,i-1}\\|^{2}\mid\bm{\mathcal{F}}_{i-1}]$		(77)
		$\displaystyle=\ \mathbb{E}[\\|\sum_{k=1}^{K}p_{k}(\bm{s}^{x}_{k,i-1}+{g}_{x,k,i-1})\\|^{2}\mid\bm{\mathcal{F}}_{i-1}]\hskip 100.00015pt$
		$\displaystyle\overset{(a)}{=}\ \mathbb{E}\Big{[}\\|\sum_{k=1}^{K}p_{k}\bm{s}^{x}_{k,i-1}\\|^{2}+\\|\sum_{k=1}^{K}p_{k}{g}_{x,k,i-1}\\|^{2}\mid\bm{\mathcal{F}}_{i-1}\Big{]}$
		$\displaystyle\overset{(b)}{\leq}\ 10\sum_{k=1}^{K}p^{2}_{k}\sigma^{2}_{k}+\\|{g}_{x,c,i-1}\\|^{2}\leq 10\sigma^{2}\underset{k}{\max}\ p^{2}_{k}+\\|{g}_{x,c,i-1}\\|^{2}$

where $(a)$ follows from Assumption 4 and the fact the local true gradient ${g}_{x,k,i-1}$ is independent of the gradient noise $\bm{s}^{x}_{k,i-1}$ , $(b)$ follows from Assumption 4. A similar argument holds for $\mathbb{E}[\|{\bm{g}}_{y,c,i-1}\|^{2}\mid\bm{\mathcal{F}}_{i-1}]$ . Substituting these results into (76) and choosing $\mu_{x}\leq\mu_{y}$ , we obtain

$\displaystyle\mathbb{E}$	$\displaystyle\Big{[}\\|\bm{\mathcal{X}}_{i}-\bm{\mathcal{X}}_{i-1}\\|^{2}+\\|\bm{\mathcal{Y}}_{i}-\bm{\mathcal{Y}}_{i-1}\\|^{2}\mid\bm{\mathcal{F}}_{i-1}\Big{]}$
$\displaystyle\leq$	$\displaystyle 3\mathbb{E}\Big{[}\\|\bm{\mathcal{X}}_{i}-{\bm{\mathcal{X}}}_{c,i}\\|^{2}+\\|\bm{\mathcal{Y}}_{i}-{\bm{\mathcal{Y}}}_{c,i}\\|^{2}\mid\bm{\mathcal{F}}_{i-1}\Big{]}$	(78)
	$\displaystyle+3\mathbb{E}\Big{[}\\|\bm{\mathcal{X}}_{i-1}-{\bm{\mathcal{X}}}_{c,i-1}\\|^{2}+\\|\bm{\mathcal{Y}}_{i-1}-{\bm{\mathcal{Y}}}_{c,i-1}\\|^{2}\mid\bm{\mathcal{F}}_{i-1}\Big{]}$
	$\displaystyle+3K\Big{[}\mu^{2}_{x}\\|{g}_{x,c,i-1}\\|^{2}+\mu^{2}_{y}\\|{g}_{y,c,i-1}\\|^{2}\Big{]}+60K\sigma^{2}\underset{k}{\max}\ p^{2}_{k}\mu^{2}_{y}$

Taking expectations over (G) again and averaging the inequality over iterations, we get

$\displaystyle\frac{1}{T}$	$\displaystyle\sum_{i=0}^{T-1}\mathbb{E}\Big{[}\\|\bm{\mathcal{X}}_{i}-\bm{\mathcal{X}}_{i-1}\\|^{2}+\\|\bm{\mathcal{Y}}_{i}-\bm{\mathcal{Y}}_{i-1}\\|^{2}\Big{]}$
$\displaystyle\overset{(a)}{\leq}$	$\displaystyle\frac{6}{T}\sum_{i=0}^{T-1}\mathbb{E}\Big{[}\\|\bm{\mathcal{X}}_{i}-{\bm{\mathcal{X}}}_{c,i}\\|^{2}+\\|\bm{\mathcal{Y}}_{i}-{\bm{\mathcal{Y}}}_{c,i}\\|^{2}\Big{]}+\frac{3K}{T}\sum_{i=0}^{T-1}$
	$\displaystyle\Big{[}\mu_{x}^{2}\mathbb{E}\\|{{g}}_{x,c,i-1}\\|^{2}+\mu_{y}^{2}\mathbb{E}\\|{{g}}_{y,c,i-1}\\|^{2}\Big{]}+60K\sigma^{2}\underset{k}{\max}p^{2}_{k}\mu^{2}_{y}$
$\displaystyle\overset{(b)}{\leq}$	$\displaystyle\frac{240\rho^{2}(4KG^{2}+\mu_{y}\sigma^{2})\mu_{y}}{1-\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2}}+\frac{3K}{T}\sum_{i=0}^{T-1}\Big{[}\mu_{x}^{2}\mathbb{E}\\|{{g}}_{x,c,i-1}\\|^{2}$
	$\displaystyle+\mu_{y}^{2}\mathbb{E}\\|{{g}}_{y,c,i-1}\\|^{2}\Big{]}+60K\sigma^{2}\underset{k}{\max}\ p^{2}_{k}\mu^{2}_{y}$	(79)

where $(a)$ is obtained by adding the positive terms $\mathbb{E}\|\bm{\mathcal{X}}_{T-1}-{\bm{\mathcal{X}}}_{c,T-1}\|^{2}+\mathbb{E}\|\bm{\mathcal{Y}}_{T-1}-{\bm{\mathcal{Y}}}_{c,T-1}\|^{2}$ into the RHS of (G) and grouping the summation terms, $(b)$ follows from Lemma 3.

Appendix H Proof of Lemma 5

Inserting the expression for $g_{x,c,i}$ into the following deviation term, we get

	$\displaystyle\mathbb{E}$	$\displaystyle\\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{x,c,i}\\|^{2}$
	$\displaystyle=$	$\displaystyle\mathbb{E}\\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})-\sum_{k=1}^{K}p_{k}[2\nabla_{x}J_{k}(\bm{x}_{k,i},\bm{y}_{k,i})$
		$\displaystyle-\nabla_{x}J_{k}(\bm{x}_{k,i-1},\bm{y}_{k,i-1})]\\|^{2}\hskip 100.00015pt$
	$\displaystyle\overset{(a)}{\leq}$	$\displaystyle 2\mathbb{E}\\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})-\sum_{k=1}^{K}p_{k}\nabla_{x}J_{k}(\bm{x}_{k,i},\bm{y}_{k,i})\Big{\\|}^{2}$
		$\displaystyle+2\mathbb{E}\\|\sum_{k=1}^{K}p_{k}[\nabla_{x}J_{k}(\bm{x}_{k,i},\bm{y}_{k,i})-\nabla_{x}J_{k}(\bm{x}_{k,i-1},\bm{y}_{k,i-1})]\\|^{2}$
	$\displaystyle\overset{(b)}{\leq}$	$\displaystyle\ 2\sum_{k=1}^{K}p_{k}\mathbb{E}\\|\nabla_{x}J_{k}(\bm{x}_{c,i},\bm{y}_{c,i})-\nabla_{x}J_{k}(\bm{x}_{k,i},\bm{y}_{k,i})\\|^{2}$
		$\displaystyle+2\sum_{k=1}^{K}p_{k}\mathbb{E}\\|\nabla_{x}J_{k}(\bm{x}_{k,i},\bm{y}_{k,i})-\nabla_{x}J_{k}(\bm{x}_{k,i-1},\bm{y}_{k,i-1})\\|^{2}$
	$\displaystyle\overset{(c)}{\leq}$	$\displaystyle{4L^{2}_{f}}\sum_{k=1}^{K}p_{k}\mathbb{E}\Big{[}\\|\bm{x}_{c,i}-\bm{x}_{k,i}\\|^{2}+\\|\bm{y}_{c,i}-\bm{y}_{k,i}\\|^{2}\Big{]}$
		$\displaystyle+{4L^{2}_{f}}\sum_{k=1}^{K}p_{k}\mathbb{E}\Big{[}\\|\bm{x}_{k,i}-\bm{x}_{k,i-1}\\|^{2}+\\|\bm{y}_{k,i}-\bm{y}_{k,i-1}\\|^{2}\Big{]}$

	$\displaystyle\overset{(d)}{\leq}$	$\displaystyle 4L^{2}_{f}\mathbb{E}\Big{[}\\|\bm{\mathcal{X}}_{i}-\bm{\mathcal{X}}_{c,i}\\|^{2}+\\|\bm{\mathcal{Y}}_{i}-\bm{\mathcal{Y}}_{c,i}\\|^{2}\Big{]}$
		$\displaystyle+4L^{2}_{f}\mathbb{E}\Big{[}\\|\bm{\mathcal{X}}_{i}-\bm{\mathcal{X}}_{i-1}\\|^{2}+\\|\bm{\mathcal{Y}}_{i}-\bm{\mathcal{Y}}_{i-1}\\|^{2}\Big{]}\hskip 20.00003pt$		(80)

where $(a)$ and $(b)$ follow from Jensen’s inequality, $(c)$ follows from the $L_{f}$ -smooth property of the local risk function $J_{k}(x,y)$ , $(d)$ is due to $p_{k}<1$ . Averaging (80) over iterations, we get

$\displaystyle\ \frac{1}{T}$	$\displaystyle\sum_{i=0}^{T-1}\mathbb{E}\\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{x,c,i}\\|^{2}$	(81)
$\displaystyle\leq$	$\displaystyle\frac{4L^{2}_{f}}{T}\sum_{i=0}^{T-1}\Big{(}\mathbb{E}\\|\bm{\mathcal{X}}_{i}-\bm{\mathcal{X}}_{c,i}\\|^{2}+\mathbb{E}\\|\bm{\mathcal{Y}}_{i}-\bm{\mathcal{Y}}_{c,i}\\|^{2}\Big{)}$
	$\displaystyle+\frac{4L^{2}_{f}}{T}\sum_{i=0}^{T-1}\mathbb{E}\Big{[}\\|\bm{\mathcal{X}}_{i}-\bm{\mathcal{X}}_{i-1}\\|^{2}+\\|\bm{\mathcal{Y}}_{i}-\bm{\mathcal{Y}}_{i-1}\\|^{2}\Big{]}$
$\displaystyle\overset{(a)}{\leq}$	$\displaystyle\frac{1120L^{2}_{f}\rho^{2}(4KG^{2}+\mu_{y}\sigma^{2})\mu_{y}}{1-\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2}}+240K\sigma^{2}L^{2}_{f}\mu^{2}_{y}\underset{k}{\max}\ p^{2}_{k}$
	$\displaystyle+\frac{12KL^{2}_{f}}{T}\sum_{i=0}^{T-1}\Big{[}\mu_{x}^{2}\mathbb{E}\\|{{g}}_{x,c,i-1}\\|^{2}+\mu_{y}^{2}\mathbb{E}\\|{{g}}_{y,c,i-1}\\|^{2}\Big{]}$

where $(a)$ follows from Lemmas 3–4. A similar argument holds for $\frac{1}{T}\sum_{i=0}^{T-1}\mathbb{E}\|\nabla_{y}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{y,c,i}\|^{2}$ .

Appendix I Proof of Lemma 6

For convenience, we denote

{d}_{c}(i)\triangleq{d}(\bm{x}_{c,i},\bm{y}_{c,i})={{P}}(\bm{x}_{c,i})-J(\bm{x}_{c,i},\bm{y}_{c,i})

(82)

with subscript “c” to denote the dual optimality gap evaluated at the network centroid $(\bm{x}_{c,i},\bm{y}_{c,i})$ . From Lemma 2, $P(x)$ is $L$ -smooth, we have

		$\displaystyle\ {{P}}(\bm{x}_{c,i+1})$
		$\displaystyle\leq\ {{P}}(\bm{x}_{c,i})+\langle\nabla{{P}}(\bm{x}_{c,i}),\bm{x}_{c,i+1}-\bm{x}_{c,i}\rangle+\frac{L}{2}\\|\bm{x}_{c,i+1}-\bm{x}_{c,i}\\|^{2}$
		$\displaystyle\leq\ {{P}}(\bm{x}_{c,i})-\mu_{x}\langle\nabla{{P}}(\bm{x}_{c,i}),\bm{g}_{x,c,i}\rangle+\frac{L\mu^{2}_{x}}{2}\\|\bm{g}_{x,c,i}\\|^{2}$		(83)

Taking conditional expectation of (I) over the distribution of random sample $\{\bm{\xi}^{x}_{k,i+1}\}_{k=1}^{K}$ , we get

$\displaystyle\ \mathbb{E}$	$\displaystyle[{{P}}(\bm{x}_{c,i+1})\mid\bm{\mathcal{F}}_{i}]$
$\displaystyle\overset{(a)}{\leq}$	$\displaystyle{{P}}(\bm{x}_{c,i})-\mu_{x}\langle\nabla{{P}}(\bm{x}_{c,i}),{g}_{x,c,i}\rangle+\frac{L\mu^{2}_{x}}{2}\mathbb{E}[\\|\bm{g}_{x,c,i}\\|^{2}\mid\bm{\mathcal{F}}_{i}]$
$\displaystyle\leq\$	$\displaystyle{{P}}(\bm{x}_{c,i})-\frac{\mu_{x}}{2}\\|\nabla{{P}}(\bm{x}_{c,i})\\|^{2}-\frac{\mu_{x}}{2}\\|{g}_{x,c,i}\\|^{2}$
	$\displaystyle+\frac{\mu_{x}}{2}\\|\nabla{{P}}(\bm{x}_{c,i})-{g}_{x,c,i}\\|^{2}+\frac{L\mu^{2}_{x}}{2}\mathbb{E}[\\|\bm{g}_{x,c,i}\\|^{2}\mid\bm{\mathcal{F}}_{i}]$
$\displaystyle\overset{(b)}{\leq}$	$\displaystyle{{P}}(\bm{x}_{c,i})-\frac{\mu_{x}}{2}\\|\nabla{{P}}(\bm{x}_{c,i})\\|^{2}-\frac{\mu_{x}}{2}\\|{g}_{x,c,i}\\|^{2}+\frac{\mu_{x}}{2}\\|\nabla{{P}}(\bm{x}_{c,i})$
	$\displaystyle-{g}_{x,c,i}\\|^{2}+\frac{L\mu^{2}_{x}}{2}(10\sigma^{2}\underset{k}{\max}\ p^{2}_{k}+\\|{g}_{x,c,i}\\|^{2})$	(84)

where $(a)$ follows from the fact that $\bm{x}_{c,i}$ does not rely on the random samples at iteration $i+1$ , $(b)$ follows from (77). Inserting $\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})$ into (I) and applying Jensen’s inequality, we get

$\displaystyle\mathbb{E}$	$\displaystyle[{{P}}(\bm{x}_{c,i+1})\mid\bm{\mathcal{F}}_{i}]$
$\displaystyle\leq$	$\displaystyle{{P}}(\bm{x}_{c,i})-\frac{\mu_{x}}{2}\\|\nabla{{P}}(\bm{x}_{c,i})\\|^{2}-\frac{\mu_{x}}{2}(1-L\mu_{x})\\|{g}_{x,c,i}\\|^{2}$
	$\displaystyle+\mu_{x}\\|\nabla{{P}}(\bm{x}_{c,i})-\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})\\|^{2}+5L\mu^{2}_{x}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}$
	$\displaystyle+\mu_{x}\\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{x,c,i}\\|^{2}$	(85)

Taking expectations again over $\eqref{startsmooth11inter}$ , we get

$\displaystyle\ \mathbb{E}$	$\displaystyle[{{P}}(\bm{x}_{c,i+1})]$
$\displaystyle\leq$	$\displaystyle\mathbb{E}[{{P}}(\bm{x}_{c,i})]-\frac{\mu_{x}}{2}\mathbb{E}\\|\nabla{{P}}(\bm{x}_{c,i})\\|^{2}-\frac{\mu_{x}}{2}(1-L\mu_{x})\mathbb{E}\\|{g}_{x,c,i}\\|^{2}$
	$\displaystyle+\mu_{x}\mathbb{E}\\|\nabla{{P}}(\bm{x}_{c,i})-\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})\\|^{2}$
	$\displaystyle+\mu_{x}\mathbb{E}\\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{x,c,i}\\|^{2}+5L\mu^{2}_{x}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}$
$\displaystyle\overset{(a)}{\leq}$	$\displaystyle\mathbb{E}[{{P}}(\bm{x}_{c,i})]-\frac{\mu_{x}}{2}\mathbb{E}\\|\nabla{{P}}(\bm{x}_{c,i})\\|^{2}-\frac{\mu_{x}}{2}(1-L\mu_{x})\mathbb{E}\\|{g}_{x,c,i}\\|^{2}$
	$\displaystyle+\mu_{x}L^{2}_{f}\mathbb{E}\\|\bm{y}^{o}(\bm{x}_{c,i})-\bm{y}_{c,i}\\|^{2}+\mu_{x}\mathbb{E}\\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})$
	$\displaystyle-{g}_{x,c,i}\\|^{2}+5L\mu^{2}_{x}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}$
$\displaystyle\overset{(b)}{\leq}$	$\displaystyle\mathbb{E}[{{P}}(\bm{x}_{c,i})]-\frac{\mu_{x}}{2}\mathbb{E}\\|\nabla{{P}}(\bm{x}_{c,i})\\|^{2}-\frac{\mu_{x}}{2}(1-L\mu_{x})\mathbb{E}\\|{g}_{x,c,i}\\|^{2}$
	$\displaystyle+\frac{2\mu_{x}L^{2}_{f}}{\nu}\mathbb{E}[{{P}}(\bm{x}_{c,i})-J(\bm{x}_{c,i},\bm{y}_{c,i})]$	(86)
	$\displaystyle+\mu_{x}\mathbb{E}\\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{x,c,i}\\|^{2}+5L\mu^{2}_{x}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}$

Here, $(a)$ is due to the $L_{f}$ -smoothness property of the risk function, $(b)$ follows from Lemma 1.

Appendix J Proof of Lemma 7

Because $-J(x,y)$ is $L_{f}$ -smooth, fixing $-J(x,y)$ at $\bm{x}_{c,i+1}$ , we have

	$\displaystyle-J(\bm{x}_{c,i+1},\bm{y}_{c,i+1})$
$\displaystyle\leq$	$\displaystyle-J(\bm{x}_{c,i+1},\bm{y}_{c,i})-\langle\nabla_{y}J(\bm{x}_{c,i+1},\bm{y}_{c,i}),\bm{y}_{c,i+1}-\bm{y}_{c,i}\rangle$
	$\displaystyle+\frac{L_{f}}{2}\\|\bm{y}_{c,i+1}-\bm{y}_{c,i}\\|^{2}$
$\displaystyle\leq$	$\displaystyle-J(\bm{x}_{c,i+1},\bm{y}_{c,i})-\mu_{y}\langle\nabla_{y}J(\bm{x}_{c,i+1},\bm{y}_{c,i}),\bm{g}_{y,c,i}\rangle$
	$\displaystyle+\frac{L_{f}\mu^{2}_{y}}{2}\\|\bm{g}_{y,c,i}\\|^{2}\hskip 100.00015pt$	(87)

Taking expectation of (J) over the distribution of $\{\bm{\xi}^{y}_{k,i+1}\}_{k=1}^{K}$ conditioned on $\{\bm{\xi}^{x}_{k,i+1}\}_{k=1}^{K}$ and $\bm{\mathcal{F}}_{i}$ , we get

	$\displaystyle\mathbb{E}$	$\displaystyle[-J(\bm{x}_{c,i+1},\bm{y}_{c,i+1})\mid\bm{\mathcal{F}}_{i},\{\bm{\xi}^{x}_{k,i+1}\}_{k=1}^{K}]$
	$\displaystyle\leq$	$\displaystyle\ -J(\bm{x}_{c,i+1},\bm{y}_{c,i})-\mu_{y}\langle\nabla_{y}J(\bm{x}_{c,i+1},\bm{y}_{c,i}),{g}_{y,c,i}\rangle$
		$\displaystyle+\frac{L_{f}\mu^{2}_{y}}{2}\mathbb{E}\Big{[}\\|\bm{g}_{y,c,i}\\|^{2}\mid\bm{\mathcal{F}}_{i},\{\bm{\xi}^{x}_{k,i+1}\}_{k=1}^{K}\Big{]}$
	$\displaystyle\overset{(a)}{\leq}$	$\displaystyle-J(\bm{x}_{c,i+1},\bm{y}_{c,i})-\frac{\mu_{y}}{2}\\|\nabla_{y}J(\bm{x}_{c,i+1},\bm{y}_{c,i})\\|^{2}$
		$\displaystyle+\frac{\mu_{y}}{2}\\|\nabla_{y}J(\bm{x}_{c,i+1},\bm{y}_{c,i})-{g}_{y,c,i}\\|^{2}+\frac{L_{f}\mu^{2}_{y}}{2}\\|{g}_{y,c,i}\\|^{2}$
		$\displaystyle+5L_{f}\mu^{2}_{y}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}-\frac{\mu_{y}}{2}\\|{g}_{y,c,i}\\|^{2}$

$\displaystyle\overset{(b)}{\leq}$	$\displaystyle-J(\bm{x}_{c,i+1},\bm{y}_{c,i})-\frac{\mu_{y}}{2}\\|\nabla_{y}J(\bm{x}_{c,i+1},\bm{y}_{c,i})\\|^{2}$
	$\displaystyle+\mu_{y}\\|\nabla_{y}J(\bm{x}_{c,i+1},\bm{y}_{c,i})-\nabla_{y}J(\bm{x}_{c,i},\bm{y}_{c,i})\\|^{2}$
	$\displaystyle+\mu_{y}\\|\nabla_{y}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{y,c,i}\\|^{2}+\frac{L_{f}\mu^{2}_{y}}{2}\\|{g}_{y,c,i}\\|^{2}$
	$\displaystyle+5L_{f}\mu^{2}_{y}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}-\frac{\mu_{y}}{2}\\|{g}_{y,c,i}\\|^{2}$
$\displaystyle\overset{(c)}{\leq}$	$\displaystyle-J(\bm{x}_{c,i+1},\bm{y}_{c,i})-\frac{\mu_{y}}{2}\\|\nabla_{y}J(\bm{x}_{c,i+1},\bm{y}_{c,i})\\|^{2}$	(88)
	$\displaystyle-\frac{\mu_{y}}{2}\\|{g}_{y,c,i}\\|^{2}+\mu_{y}L^{2}_{f}\\|\bm{x}_{c,i+1}-\bm{x}_{c,i}\\|^{2}+\frac{L_{f}\mu^{2}_{y}}{2}\\|{g}_{y,c,i}\\|^{2}$
	$\displaystyle+\mu_{y}\\|\nabla_{y}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{y,c,i}\\|^{2}+5L_{f}\mu^{2}_{y}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}$

where $(a)$ follows from (77), $(b)$ we insert the true gradient information $\nabla_{y}J(\bm{x}_{c,i},\bm{y}_{c,i})$ and apply Jensen’s inequality, $(c)$ is due to the $L_{f}$ -smooth property of the risk function. Taking expectation again over (J), we obtain the result (34). The proof of (35) is similar to the above arguments.

Appendix K Proof of Lemma 8

Adding $\mathbb{E}[P(\bm{x}_{c,i+1})]$ on both sides of (34):

$\displaystyle\ \mathbb{E}$	$\displaystyle[P(\bm{x}_{c,i+1})-J(\bm{x}_{c,i+1},\bm{y}_{c,i+1})]$
$\displaystyle\leq$	$\displaystyle\ \mathbb{E}[P(\bm{x}_{c,i+1})-J(\bm{x}_{c,i+1},\bm{y}_{c,i})]-\frac{\mu_{y}}{2}\mathbb{E}\\|\nabla_{y}J(\bm{x}_{c,i+1},\bm{y}_{c,i})\\|^{2}$
	$\displaystyle-\frac{\mu_{y}}{2}(1-L_{f}\mu_{y})\mathbb{E}\\|{g}_{y,c,i}\\|^{2}+5L_{f}\mu^{2}_{y}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}$
	$\displaystyle+\mu_{y}L^{2}_{f}\mathbb{E}\\|\bm{x}_{c,i+1}-\bm{x}_{c,i}\\|^{2}+\mu_{y}\mathbb{E}\\|\nabla_{y}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{y,c,i}\\|^{2}$
$\displaystyle\overset{(a)}{\leq}$	$\displaystyle\ (1-\nu\mu_{y})\mathbb{E}[P(\bm{x}_{c,i+1})-J(\bm{x}_{c,i+1},\bm{y}_{c,i})]$
	$\displaystyle+\mu_{y}\mathbb{E}\\|\nabla_{y}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{y,c,i}\\|^{2}+5L_{f}\mu^{2}_{y}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}$
	$\displaystyle-\frac{\mu_{y}}{2}(1-L_{f}\mu_{y})\mathbb{E}\\|{g}_{y,c,i}\\|^{2}+\mu_{y}L^{2}_{f}\mathbb{E}\\|\bm{x}_{c,i+1}-\bm{x}_{c,i}\\|^{2}$
$\displaystyle\overset{(b)}{\leq}$	$\displaystyle(1-\nu\mu_{y})\mathbb{E}[P(\bm{x}_{c,i})-J(\bm{x}_{c,i},\bm{y}_{c,i})+J(\bm{x}_{c,i},\bm{y}_{c,i})$	(89)
	$\displaystyle-J(\bm{x}_{c,i+1},\bm{y}_{c,i})+P(\bm{x}_{c,i+1})-P(\bm{x}_{c,i})]$
	$\displaystyle-\frac{\mu_{y}}{2}(1-L_{f}\mu_{y})\mathbb{E}\\|{g}_{y,c,i}\\|^{2}+{\mu_{y}}L_{f}^{2}\mathbb{E}\\|\bm{x}_{c,i+1}-\bm{x}_{c,i}\\|^{2}$
	$\displaystyle+\mu_{y}\mathbb{E}\\|\nabla_{y}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{y,c,i}\\|^{2}+5L_{f}\mu^{2}_{y}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}$

where $(a)$ follows from Assumption 1, $(b)$ is obtained by adding and subtracting the terms $\mathbb{E}[P(\bm{x}_{c,i})]$ and $\mathbb{E}[J(\bm{x}_{c,i},\bm{y}_{c,i})]$ in $(a)$ . By the definition of ${d}_{c}(i)$ , we can rewrite above inequality as:

$\displaystyle\mathbb{E}$	$\displaystyle[{d}_{c}(i+1)]$
$\displaystyle\leq$	$\displaystyle(1-\nu\mu_{y})\Big{\{}\mathbb{E}[{d}_{c}(i)]+\mathbb{E}[J(\bm{x}_{c,i},\bm{y}_{c,i})-J(\bm{x}_{c,i+1},\bm{y}_{c,i})]$
	$\displaystyle+\mathbb{E}[P(\bm{x}_{c,i+1})-P(\bm{x}_{c,i})]\Big{\}}+5L_{f}\mu^{2}_{y}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}$
	$\displaystyle-\frac{\mu_{y}}{2}(1-L_{f}\mu_{y})\mathbb{E}\\|{g}_{y,c,i}\\|^{2}+{\mu_{y}}L_{f}^{2}\mathbb{E}\\|\bm{x}_{c,i+1}-\bm{x}_{c,i}\\|^{2}$
	$\displaystyle+\mu_{y}\mathbb{E}\\|\nabla_{y}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{y,c,i}\\|^{2}$	(90)

By Lemma 7, we can bound $\mathbb{E}[J(\bm{x}_{c,i},\bm{y}_{c,i})-J(\bm{x}_{c,i+1},\bm{y}_{c,i})]$ as:

$\displaystyle\mathbb{E}$	$\displaystyle[J(\bm{x}_{c,i},\bm{y}_{c,i})-J(\bm{x}_{c,i+1},\bm{y}_{c,i})]$
$\displaystyle\leq$	$\displaystyle\frac{\mu_{x}}{2}\mathbb{E}\\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{x,c,i}\\|^{2}+\frac{\mu_{x}(3+L_{f}\mu_{x})}{2}\mathbb{E}\\|{g}_{x,c,i}\\|^{2}$
	$\displaystyle+5L_{f}\mu^{2}_{x}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}$	(91)

By Lemma 6, we can bound $\mathbb{E}[{{P}}(\bm{x}_{c,i+1})-{{P}}(\bm{x}_{c,i})]$ as

$\displaystyle\mathbb{E}$	$\displaystyle[{{P}}(\bm{x}_{c,i+1})-{{P}}(\bm{x}_{c,i})]$
$\displaystyle\leq$	$\displaystyle\frac{2\mu_{x}L^{2}_{f}}{\nu}{d}_{c}(i)-\frac{\mu_{x}}{2}(1-L\mu_{x})\mathbb{E}\\|{g}_{x,c,i}\\|^{2}$	(92)
	$\displaystyle+\mu_{x}\mathbb{E}\\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{x,c,i}\\|^{2}+5L\mu^{2}_{x}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}$

Combining the results of (K) and (K) into (K), we get

$\displaystyle\mathbb{E}$	$\displaystyle[{d}_{c}(i+1)]$	(93)
$\displaystyle\leq$	$\displaystyle(1-\nu\mu_{y})(1+\frac{2\mu_{x}L_{f}^{2}}{\nu})\mathbb{E}[{d}_{c}(i)]+\frac{3\mu_{x}(1-\nu\mu_{y})}{2}$
	$\displaystyle\times\mathbb{E}\\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{x,c,i}\\|^{2}+(1-\nu\mu_{y})\Big{[}\mu_{x}$
	$\displaystyle+\frac{(L+L_{f})\mu^{2}_{x}}{2}\Big{]}\mathbb{E}\\|{g}_{x,c,i}\\|^{2}-\frac{\mu_{y}}{2}(1-L_{f}\mu_{y})\mathbb{E}\\|{g}_{y,c,i}\\|^{2}$
	$\displaystyle+{\mu_{y}}L_{f}^{2}\mathbb{E}\\|\bm{x}_{c,i+1}-\bm{x}_{c,i}\\|^{2}+\mu_{y}\mathbb{E}\\|\nabla_{y}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{y,c,i}\\|^{2}$
	$\displaystyle+5{(1-\nu\mu_{y})(L+L_{f})\mu^{2}_{x}\sigma^{2}}\underset{k}{\max}\ p^{2}_{k}+5L_{f}\mu^{2}_{y}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}$

We further choose $\frac{2\mu_{x}L_{f}^{2}}{\nu}\leq\frac{\nu\mu_{y}}{2},\mu_{y}\leq\frac{1}{\nu},\mu^{2}_{x}\leq\frac{L_{f}\mu^{2}_{y}}{L+L_{f}}$ . We have $(1-\nu\mu_{y})(1+\frac{2\mu_{x}L_{f}^{2}}{\nu})\leq 1-\frac{\nu\mu_{y}}{2}$ , and inequality (93) becomes

$\displaystyle\mathbb{E}$	$\displaystyle[{d}_{c}(i+1)]$	(94)
$\displaystyle\leq$	$\displaystyle(1-\frac{\nu\mu_{y}}{2})\mathbb{E}[{d}_{c}(i)]+\frac{3\mu_{x}(1-\nu\mu_{y})}{2}\mathbb{E}\\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})$
	$\displaystyle-{g}_{x,c,i}\\|^{2}+(1-\nu\mu_{y})\Big{[}\mu_{x}+\frac{(L+L_{f})\mu^{2}_{x}}{2}\Big{]}\mathbb{E}\\|{g}_{x,c,i}\\|^{2}$
	$\displaystyle-\frac{\mu_{y}}{2}(1-L_{f}\mu_{y})\\|{g}_{y,c,i}\\|^{2}+{\mu_{y}}L_{f}^{2}\mathbb{E}\\|\bm{x}_{c,i+1}-\bm{x}_{c,i}\\|^{2}$
	$\displaystyle+\mu_{y}\mathbb{E}\\|\nabla_{y}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{y,c,i}\\|^{2}+10L_{f}\mu^{2}_{y}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}$

Choosing $\mu_{y}<\frac{1}{L_{f}}$ and iterating (94) from $i$ to $0$ , we get

	$\displaystyle\mathbb{E}$	$\displaystyle[{d}_{c}(i)]$
	$\displaystyle\leq$	$\displaystyle(1-\frac{\nu\mu_{y}}{2})^{i}{d}_{c}(0)+\sum_{j=0}^{i-1}(1-\frac{\nu\mu_{y}}{2})^{i-1-j}\Bigg{\{}\frac{3\mu_{x}(1-\nu\mu_{y})}{2}$
		$\displaystyle\times\mathbb{E}\\|\nabla_{x}J(\bm{x}_{c,j},\bm{y}_{c,j})-{g}_{x,c,j}\\|^{2}+(1-\nu\mu_{y})\Big{[}\mu_{x}$
		$\displaystyle+\frac{(L+L_{f})\mu^{2}_{x}}{2}\Big{]}\mathbb{E}\\|{g}_{x,c,j}\\|^{2}+{\mu_{y}}L_{f}^{2}\mathbb{E}\\|\bm{x}_{c,j+1}-\bm{x}_{c,j}\\|^{2}$
		$\displaystyle+\mu_{y}\mathbb{E}\\|\nabla_{y}J(\bm{x}_{c,j},\bm{y}_{c,j})-{g}_{y,c,j}\\|^{2}+10L_{f}\mu^{2}_{y}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}\Bigg{\}}$
		$\displaystyle-\frac{\mu_{y}}{2}(1-L_{f}\mu_{y})\sum_{j=0}^{i-1}(1-\frac{\nu\mu_{y}}{2})^{i-j-1}\mathbb{E}\\|{g}_{y,c,j}\\|^{2}$

$\displaystyle\overset{(a)}{\leq}$	$\displaystyle(1-\frac{\nu\mu_{y}}{2})^{i}{d}_{c}(0)+\sum_{j=0}^{i-1}(1-\frac{\nu\mu_{y}}{2})^{i-1-j}\Bigg{\{}\frac{3\mu_{x}(1-\nu\mu_{y})}{2}$
	$\displaystyle\times\mathbb{E}\\|\nabla_{x}J(\bm{x}_{c,j},\bm{y}_{c,j})-{g}_{x,c,j}\\|^{2}+(1-\nu\mu_{y})\Big{[}\mu_{x}$
	$\displaystyle+\frac{(L+L_{f})\mu^{2}_{x}}{2}\Big{]}\mathbb{E}\\|{g}_{x,c,j}\\|^{2}+{\mu_{y}}L_{f}^{2}\mathbb{E}\\|\bm{x}_{c,j+1}-\bm{x}_{c,j}\\|^{2}$
	$\displaystyle+\mu_{y}\mathbb{E}\\|\nabla_{y}J(\bm{x}_{c,j},\bm{y}_{c,j})-{g}_{y,c,j}\\|^{2}\Bigg{\}}+\frac{20L_{f}\mu_{y}\sigma^{2}}{\nu}\underset{k}{\max}\ p^{2}_{k}$
	$\displaystyle-\frac{\mu_{y}}{2}(1-L_{f}\mu_{y})\sum_{j=0}^{i-1}(1-\frac{\nu\mu_{y}}{2})^{i-j}\mathbb{E}\\|{g}_{y,c,j}\\|^{2}$	(95)

where $(a)$ is due to $(1-\frac{\nu\mu_{y}}{2})^{i-j-1}\|{g}_{y,c,j}\|^{2}\geq(1-\frac{\nu\mu_{y}}{2})^{i-j}\|{g}_{y,c,j}\|^{2}$ for any $j\leq i-1$ . Averaging (K) over iterations, we get

$\displaystyle\frac{1}{T}$	$\displaystyle\sum_{i=0}^{T-1}\mathbb{E}[{d}_{c}(i)]$
$\displaystyle\overset{(a)}{\leq}$	$\displaystyle\frac{2}{\nu\mu_{y}T}{d}_{c}(0)+\frac{1}{T}\sum_{i=0}^{T-1}\Bigg{\{}\frac{3\mu_{x}(1-\nu\mu_{y})}{\nu\mu_{y}}\mathbb{E}\\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})$
	$\displaystyle-{g}_{x,c,i}\\|^{2}+(1-\nu\mu_{y})\Big{[}\frac{2\mu_{x}}{\nu\mu_{y}}+\frac{(L+L_{f})\mu^{2}_{x}}{\nu\mu_{y}}\Big{]}\mathbb{E}\\|{g}_{x,c,i}\\|^{2}+$
	$\displaystyle\frac{2L_{f}^{2}}{\nu}\mathbb{E}\\|\bm{x}_{c,i+1}-\bm{x}_{c,i}\\|^{2}+\frac{2}{\nu}\mathbb{E}\\|\nabla_{y}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{y,c,i}\\|^{2}\Bigg{\}}$
	$\displaystyle-\frac{\mu_{y}}{2T}(1-L_{f}\mu_{y})\sum_{i=1}^{T-1}\sum_{j=0}^{i-1}(1-\frac{\nu\mu_{y}}{2})^{i-j}\mathbb{E}\\|{g}_{y,c,j}\\|^{2}$
	$\displaystyle+\frac{20L_{f}\mu_{y}\sigma^{2}}{\nu}\underset{k}{\max}\ p^{2}_{k}$	(96)

where (a) follows from the following inequality

	$\displaystyle\sum_{i=1}^{T-1}\sum_{j=0}^{i-1}q^{i-1-j}a_{j}=\sum_{i=0}^{T-2}a_{i}(\sum_{j=0}^{T-2-i}q^{j})\leq$	$\displaystyle\frac{1}{1-q}\sum_{i=0}^{T-2}a_{i}$
	$\displaystyle\leq$	$\displaystyle\frac{1}{1-q}\sum_{i=0}^{T-1}a_{i}$		(97)

here, $0<q<1$ is a constant, $\{a_{i}\}$ is a positive sequence.

References

[1] H. Cai, S. A. Sulaiman, and A. H. Sayed, “Diffusion optimistic learning for min-max optimization,” in Proc. IEEE ICASSP, Seoul, South Korea, April 2024, pp. 1–5.
[2] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems, 2014, pp. 1–9.
[3] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” arXiv:1706.06083, 2017.
[4] B. Dai, A. Shaw, L. Li, L. Xiao, N. He, Z. Liu, J. Chen, and L. Song, “Sbeed: Convergent reinforcement learning with nonlinear function approximation,” in International Conference on Machine Learning. PMLR, 2018, pp. 1125–1134.
[5] D. Kovalev and A. Gasnikov, “The first optimal algorithm for smooth and strongly-convex-strongly-concave minimax optimization,” in Advances in Neural Information Processing Systems, 2022, pp. 14691–14703.
[6] J. Zhang, M. Hong, and S. Zhang, “On lower iteration complexity bounds for the convex concave saddle point problems,” Mathematical Programming, vol. 194, no. 1-2, pp. 901–935, 2022.
[7] P. Mahdavinia, Y. Deng, H. Li, and M. Mahdavi, “Tight analysis of extra-gradient and optimistic gradient methods for nonconvex minimax problems,” in Advances in Neural Information Processing Systems, 2022, pp. 31213–31225.
[8] J. Yang, A. Orvieto, A. Lucchi, and N. He, “Faster single-loop algorithms for minimax optimization without strong concavity,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2022, pp. 5485–5517.
[9] T. Lin, C. Jin, and M. Jordan, “On gradient descent ascent for nonconvex-concave minimax problems,” in International Conference on Machine Learning. PMLR, 2020, pp. 6083–6093.
[10] J. Yang, N. Kiyavash, and N. He, “Global convergence and variance reduction for a class of nonconvex-nonconcave minimax problems,” in Advances in Neural Information Processing Systems, 2020, pp. 1153–1165.
[11] D. Kinderlehrer and G. Stampacchia, An Introduction to Variational Inequalities and Their Applications, SIAM, 2000.
[12] S. Komlósi, “On the stampacchia and minty variational inequalities,” Generalized Convexity and Optimization for Economic and Financial Decisions, pp. 231–260, 1999.
[13] Z. Dou and Y. Li, “On the one-sided convergence of adam-type algorithms in non-convex non-concave min-max optimization,” arXiv:2109.14213, 2021.
[14] A. Beznosikov, P. Dvurechenskii, A. Koloskova, V. Samokhin, S. U. Stich, and A. Gasnikov, “Decentralized local stochastic extra-gradient for variational inequalities,” in Advances in Neural Information Processing Systems, 2022, pp. 38116–38133.
[15] A. Bohm, “Solving nonconvex-nonconcave min-max problems exhibiting weak minty solutions,” Transactions on Machine Learning Research, pp. 1–21, 2023.
[16] J. Diakonikolas, C. Daskalakis, and M. I. Jordan, “Efficient methods for structured nonconvex-nonconcave min-max optimization,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2021, pp. 2746–2754.
[17] T. Pethick, O. Fercoq, P. Latafat, P. Patrinos, and V. Cevher, “Solving stochastic weak minty variational inequalities without increasing batch size,” arXiv:2302.09029, 2023.
[18] C. Daskalakis, S. Skoulakis, and M. Zampetakis, “The complexity of constrained min-max optimization,” in Proc. ACM SIGACT Symposium on Theory of Computing, 2021, pp. 1466–1478.
[19] G.M. Korpelevich, “The extragradient method for finding saddle points and other problems,” Matecon, vol. 12, pp. 747–756, 1976.
[20] M. Liu, W. Zhang, Y. Mroueh, X. Cui, J. Ross, T. Yang, and P. Das, “A decentralized parallel algorithm for training generative adversarial nets,” in Advances in Neural Information Processing Systems, 2020, pp. 11056–11070.
[21] C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng, “Training gans with optimism.,” in International Conference on Learning Representations (ICLR 2018), 2018.
[22] A. H. Sayed, “Adaptation, learning, and optimization over networks,” Foundations and Trends® in Machine Learning, vol. 7, no. 4-5, pp. 311–801, 2014.
[23] H. Karimi, J. Nutini, and M. Schmidt, “Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition,” in Machine Learning and Knowledge Discovery in Databases: European Conference ECML PKDD, Riva del Garda, Italy, 2016, 2016, pp. 795–811.
[24] G. Zhang, Y. Wang, L. Lessard, and R. B. Grosse, “Near-optimal local convergence of alternating gradient descent-ascent for minimax optimization,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2022, pp. 7659–7679.
[25] T. Chavdarova, G. Gidel, F. Fleuret, and S. Lacoste-Julien, “Reducing noise in gan training with variance reduced extragradient,” Advances in Neural Information Processing Systems, vol. 32, 2019.
[26] C. Liu and L. Luo, “Quasi-newton methods for saddle point problems,” in Advances in Neural Information Processing Systems, 2022, pp. 3975–3987.
[27] N. Golowich, S. Pattathil, C. Daskalakis, and A. Ozdaglar, “Last iterate is slower than averaged iterate in smooth convex-concave saddle point problems,” in Conference on Learning Theory. PMLR, 2020, pp. 1758–1784.
[28] E. Gorbunov, A. Taylor, and G. Gidel, “Last-iterate convergence of optimistic gradient method for monotone variational inequalities,” Advances in Neural Information Processing Systems, vol. 35, pp. 21858–21870, 2022.
[29] E.K. Ryu, K. Yuan, and W. Yin, “ODE analysis of stochastic gradient methods with optimism and anchoring for minimax problems,” arXiv:1905.10899, 2019.
[30] T. Yoon and E.K. Ryu, “Accelerated algorithms for smooth convex-concave minimax problems with o (1/k^2 ) rate on squared gradient norm,” in International Conference on Machine Learning. PMLR, 2021, pp. 12098–12109.
[31] G. Gidel, R. A. Hemmat, M. Pezeshki, R. L. Priol, G. Huang, S. Lacoste-Julien, and I. Mitliagkas, “Negative momentum for improved game dynamics,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2019, pp. 1802–1811.
[32] L. Luo, H. Ye, Z. Huang, and T. Zhang, “Stochastic recursive gradient descent ascent for stochastic nonconvex-strongly-concave minimax problems,” in Advances in Neural Information Processing Systems, 2020, pp. 20566–20577.
[33] A. H. Sayed, Inference and Learning from Data, vol. 3, Cambridge University Press, 2022.
[34] W. Xian, F. Huang, Y. Zhang, and H. Huang, “A faster decentralized algorithm for nonconvex minimax problems,” in Advances in Neural Information Processing Systems, 2021, pp. 25865–25877.
[35] H. Gao, “Decentralized stochastic gradient descent ascent for finite-sum minimax problems,” arXiv:2212.02724, 2022.
[36] L. Chen, H. Ye, and L. Luo, “A simple and efficient stochastic algorithm for decentralized nonconvex-strongly-concave minimax optimization,” arXiv:2212.02387, 2022.
[37] F. Huang and S. Chen, “Near-optimal decentralized momentum method for nonconvex-PL minimax problems,” arXiv:2304.10902, 2023.
[38] A. Cutkosky and F. Orabona, “Momentum-based variance reduction in non-convex sgd,” Advances in Neural Information Processing Systems, vol. 32, 2019.
[39] Y. Arjevani, Y. Carmon, J. C. Duchi, J.D. Foster, N. Srebro, and B. Woodworth, “Lower bounds for non-convex stochastic optimization,” Mathematical Programming, vol. 199, no. 1-2, pp. 165–214, 2023.
[40] L. D. Popov, “A modification of the arrow-hurwicz method for search of saddle points,” Mathematical Notes of the Academy of Sciences of the USSR, pp. 845–848, 1980.
[41] K. Antonakopoulos, A. Kavis, and V. Cevher, “Extra-newton: A first approach to noise-adaptive accelerated second-order methods,” Advances in Neural Information Processing Systems, vol. 35, pp. 29859–29872, 2022.
[42] Y. Hsieh, F. Iutzeler, J. Malick, and P. Mertikopoulos, “On the convergence of single-call stochastic extra-gradient methods,” in Advances in Neural Information Processing Systems, 2019, pp. 1–11.
[43] Y. Malitsky and M. K. Tam, “A forward-backward splitting method for monotone inclusions without cocoercivity,” SIAM Journal on Optimization, pp. 1451–1472, 2020.
[44] K. Mishchenko, D. Kovalev, E. Shulgin, P. Richtárik, and Y. Malitsky, “Revisiting stochastic extragradient,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2020, pp. 4573–4582.
[45] K. Yuan, B. Ying, X. Zhao, and A. H. Sayed, “Exact diffusion for distributed optimization and learning—part i: Algorithm development,” IEEE Transactions on Signal Processing, vol. 67, no. 3, pp. 708–723, 2018.
[46] S. A. Alghunaim and K. Yuan, “A unified and refined convergence analysis for non-convex decentralized learning,” IEEE Transactions on Signal Processing, vol. 70, pp. 3264–3279, 2022.
[47] S. A. Alghunaim, E. K. Ryu, K. Yuan, and A. H. Sayed, “Decentralized proximal gradient algorithms with linear convergence rates,” IEEE Transactions on Automatic Control, vol. 66, no. 6, pp. 2787–2794, 2020.
[48] A. H. Sayed, “Diffusion adaptation over networks,” in Academic Press Library in Signal Processing, vol. 3, pp. 323–453. Elsevier, 2014.
[49] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv:1511.06434, 2015.
[50] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” Advances in Neural Information Processing Systems, vol. 30, 2017.
[51] M. Nouiehed, M. Sanjabi, T. Huang, J. D. Lee, and M. Razaviyayn, “Solving a class of non-convex min-max games using iterative first order methods,” in Advances in Neural Information Processing Systems, 2019, pp. 1–9.

	$\displaystyle\mathbb{E}\\|\nabla P({\bm{x}}^{\star}_{c,T_{0}})\\|^{2}$	$\displaystyle=\inf_{i=0,\dots,T_{0}-1}\mathbb{E}\\|\nabla P(\bm{x}_{c,i})\\|^{2}$		(20)
		$\displaystyle\leq\frac{1}{T_{0}}\sum_{i=0}^{T_{0}-1}\mathbb{E}\\|\nabla P(\bm{x}_{c,i})\\|^{2}\leq\mathcal{O}(\varepsilon^{2})$

	$\displaystyle-{g}_{x,c,i}\\|^{2}+(1-\nu\mu_{y})\Big{[}\frac{2\mu_{x}}{\nu\mu_{y}}+\frac{(L+L_{f})\mu^{2}_{x}}{\nu\mu_{y}}\Big{]}\mathbb{E}\\|{g}_{x,c,i}\\|^{2}$
	$\displaystyle+\frac{2L_{f}^{2}}{\nu}\mathbb{E}\\|\bm{x}_{c,i+1}-\bm{x}_{c,i}\\|^{2}+\frac{2}{\nu}\mathbb{E}\\|\nabla_{y}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{y,c,i}\\|^{2}\Big{\}}$
	$\displaystyle-\frac{\mu_{y}}{2T}(1-L_{f}\mu_{y})\sum_{i=1}^{T-1}\sum_{j=0}^{i-1}(1-\frac{\nu\mu_{y}}{2})^{i-j}\mathbb{E}\\|{g}_{y,c,j}\\|^{2}$
	$\displaystyle+\frac{20L_{f}\mu_{y}\sigma^{2}}{\nu}\underset{k}{\max}\ p^{2}_{k}$		(37)

	$\displaystyle\overset{(a)}{\leq}$	$\displaystyle\frac{2}{\mu_{x}T}[P(\bm{x}_{c,0})-P^{\star}]-\frac{1}{T}\sum_{i=0}^{T-1}\Bigg{\{}\Big{\{}1-L\mu_{x}-\frac{4(1-\nu\mu_{y})L^{2}_{f}}{\nu}$
		$\displaystyle\times\Big{(}\frac{2\mu_{x}}{\nu\mu_{y}}+\frac{(L+L_{f})\mu^{2}_{x}}{\nu\mu_{y}}\Big{)}-\frac{8L_{f}^{4}\mu^{2}_{x}}{\nu^{2}}\Big{\}}\mathbb{E}\\|{g}_{x,c,i}\\|^{2}$
		$\displaystyle-4\mathbb{E}\\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{x,c,i}\\|^{2}-\frac{8L_{f}^{2}}{\nu^{2}}$
		$\displaystyle{\times\mathbb{E}\\|\nabla_{y}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{y,c,i}\\|^{2}}\Bigg{\}}+\frac{80L^{4}_{f}\mu^{2}_{x}\sigma^{2}}{\nu^{2}}\underset{k}{\max}\ p^{2}_{k}$
		$\displaystyle+\frac{8L_{f}^{2}}{\nu^{2}\mu_{y}T}{d}_{c}(0)+\frac{80L^{3}_{f}\mu_{y}\sigma^{2}}{\nu^{2}}\underset{k}{\max}\ p^{2}_{k}+10L\mu_{x}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}$
		$\displaystyle-\frac{2\mu_{y}L^{2}_{f}}{\nu T}(1-L_{f}\mu_{y})\sum_{i=1}^{T-1}\sum_{j=0}^{i-1}(1-\frac{\nu\mu_{y}}{2})^{i-j}\mathbb{E}\\|{g}_{y,c,j}\\|^{2}$

	$\displaystyle\frac{2\mu_{y}L^{2}_{f}}{\nu T}(1-L_{f}\mu_{y})\sum_{i=1}^{T-1}\sum_{j=0}^{i-1}(1-\frac{\nu\mu_{y}}{2})^{i-j}\mathbb{E}\\|{g}_{y,c,j}\\|^{2}$
	$\displaystyle-\frac{\beta_{1}\mu^{2}_{y}}{T}\sum_{i=0}^{T-1}\mathbb{E}\\|{{g}}_{y,c,i-1}\\|^{2}\hskip 200.0003pt$
	$\displaystyle\overset{(a)}{=}\frac{4L^{2}_{f}}{\nu^{2}T}(1-L_{f}\mu_{y})(1-\frac{\nu\mu_{y}}{2})\sum_{i=0}^{T-2}{(1-(1-\frac{\nu\mu_{y}}{2})^{T-1-i})}$
	$\displaystyle\quad\times\mathbb{E}\\|{g}_{y,c,i}\\|^{2}-\frac{\beta_{1}\mu_{y}^{2}}{T}\sum_{i=0}^{T-1}\mathbb{E}\\|{{g}}_{y,c,i-1}\\|^{2}$
	$\displaystyle\overset{(b)}{\geq}\ \frac{\tau_{3}}{2T}(1-L_{f}\mu_{y})(1-\frac{\nu\mu_{y}}{2})\sum_{i=0}^{T-2}{(1-(1-\frac{\nu\mu_{y}}{2}))}$
	$\displaystyle\quad\times\mathbb{E}\\|{g}_{y,c,i}\\|^{2}-\frac{\beta_{1}\mu_{y}^{2}}{T}\sum_{i=-1}^{T-2}\mathbb{E}\\|{{g}}_{y,c,i}\\|^{2}$
	$\displaystyle\geq\ \frac{1}{T}\{\frac{2L^{2}_{f}}{\nu}(1-L_{f}\mu_{y})(1-\frac{\nu\mu_{y}}{2})\mu_{y}-\beta_{1}\mu_{y}^{2}\}\sum_{i=-1}^{T-2}\mathbb{E}\\|{{g}}_{y,c,i}\\|^{2}$
	$\displaystyle\quad-\frac{\beta_{1}\mu_{y}^{2}\\|{{g}}_{y,c,-1}\\|^{2}}{T}\quad\overset{(c)}{\geq}-\frac{\beta_{1}\mu_{y}^{2}\\|{{g}}_{y,c,-1}\\|^{2}}{T}$		(45)

	$\displaystyle\frac{1}{T}\sum_{i=0}^{T-1}\mathbb{E}\\|\nabla P(\bm{x}_{c,i})\\|^{2}$
	$\displaystyle\leq\ \frac{2P_{0}}{\mu_{x}T}+\frac{\tau_{3}{d}_{c}(0)}{\mu_{y}T}+(\tau_{2}\mu_{y}+\tau_{4}\mu_{x}+\tau_{1}\mu^{2}_{x})\sigma^{2}\underset{k}{\max}\ p^{2}_{k}$
	$\displaystyle\quad+\frac{1120L^{2}_{f}\rho^{2}(4KG^{2}+\mu_{y}\sigma^{2})\mu_{y}}{1-\\|\mathcal{J}^{\top}_{\gamma_{1}}\\|^{2}}(\tau_{3}+4)+\frac{\beta_{1}\mu_{y}^{2}\\|{{g}}_{y,c,-1}\\|^{2}}{T}$
	$\displaystyle\quad+\frac{\beta_{1}\mu^{2}_{x}\\|g_{x,c,-1}\\|^{2}}{T}+240K\sigma^{2}L^{2}_{f}\mu^{2}_{y}\underset{k}{\max}\ p^{2}_{k}(\tau_{3}+4)$		(46)