This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Diffusion Stochastic Optimization for Min-Max Problems

Haoyuan Cai, , Sulaiman A. Alghunaim, , and Ali H. Sayed,
Kuwait University, Kuwait
École Polytechnique Fédérale de Lausanne, Switzerland
A short version of this work without derivations will appear in the conference publication [1]. Emails: [email protected], [email protected], [email protected]
Abstract

The optimistic gradient method is useful in addressing minimax optimization problems. Motivated by the observation that the conventional stochastic version suffers from the need for a large batch size on the order of 𝒪(ε2)\mathcal{O}(\varepsilon^{-2}) to achieve an ε\varepsilon-stationary solution, we introduce and analyze a new formulation termed Diffusion Stochastic Same-Sample Optimistic Gradient (DSS-OG). We prove its convergence and resolve the large batch issue by establishing a tighter upper bound, under the more general setting of nonconvex Polyak-Lojasiewicz (PL) risk functions. We also extend the applicability of the proposed method to the distributed scenario, where agents communicate with their neighbors via a left-stochastic protocol. To implement DSS-OG, we can query the stochastic gradient oracles in parallel with some extra memory overhead, resulting in a complexity comparable to its conventional counterpart. To demonstrate the efficacy of the proposed algorithm, we conduct tests by training generative adversarial networks.

Index Terms:
Minimax optimization, nonconvex Polyak-Lojasiewicz, distributed stochastic optimization, stochastic optimistic gradient, diffusion strategy

I Introduction

Minimax optimization takes center stage in many machine learning applications, including generative adversarial networks (GANs) [2], adversarial machine learning [3], and reinforcement learning [4].

Minimax problems take the form:

minxM1maxyM2J(x,y)\displaystyle\underset{x\in\mathbb{R}^{M_{1}}}{\min}\underset{y\in\mathbb{R}^{M_{2}}}{\max}\ J(x,y) (1)

where xx and yy are the primal and dual variables, respectively, and J(x,y)J(x,y) is the risk (objective) function. This problem is typically approached from various perspectives. In the optimization literature, assuming certain structure for the risk function over the variables is prevalent. For instance, the risk function can be assumed (strongly) convex in the primal variable and (strongly) concave in the dual variable [5, 6] or it can be assumed one-sided nonconvex/Polyak-Lojasiewicz (PL) [7, 8, 9] in the primal variable and one-sided strongly concave/PL in the dual variable. Some works assume a two-sided PL condition[10]. It is noted that the gradient operator of a (strongly) convex or (strongly) concave objective is also (strongly) monotone; therefore, the optimization problem is generalized to a variational inequality (VI) problem in some settings. Solving VI problems heavily relies on the concept of “operators” and their associated properties. The goal of a VI problem is to determine the solution at which the gradient operator satisfies variational stability conditions, such as the Stampacchia variational inequality (SVI), which is a first-order necessary optimality condition [11, 12]. In a similar vein, the Minty variational inequality (MVI) condition and its weak version, which include a subclass of nonmonotone problems, has also gained popularity in recent years [13, 14, 15, 16, 17]. The settings mentioned above represent the tractable conditions for which convergent algorithms are known to exist. On the other hand, the nonconvex-nonconcave formulation, which is at the core of a wide class of minimax objectives, remains intractable [18]. This fact corroborates the significant difficulty encountered in training GANs, as they often involve nonconvex-nonconcave objectives.

The optimistic gradient (OG) method and its variants have attracted significant interest for their proven convergence in wide formulations of (1) [19, 15, 14, 7, 13, 20]. However, a good part of the theory from this body of literature breaks down when the stochastic setting is taken into consideration where only a noisy random estimate of the gradient is available. For instance, the work [7] showed that employing a batch size on the order of 𝒪(ε2)\mathcal{O}(\varepsilon^{-2}) is necessary to estimate the gradient and in order to control the gradient noise for stochastic OG method in a nonconvex-strongly concave (SC) setting. Here, the scalar ε\varepsilon denotes the desired level of accuracy in the solution, which is generally a very small number. Similarly, the work [20] established similar results with a batch size 𝒪(ε4)\mathcal{O}(\varepsilon^{-4}) to achieve convergence in the stochastic MVI setting. These are not isolated cases; a similar need for using large batch sizes appears in the works [14, 13, 9, 15]. The reliance on batch sizes that are inversely proportional to ε\varepsilon is a serious theoretical limitation since it rules out the freedom of using small batches. Meanwhile, empirical results showed that stochastic OG may not be subject to such a restrictive requirement [21, 20].

In this work, we investigate distributed stochastic minimax problems where a network of KK agents cooperate to solve the following problem:

minxM1maxyM2J(x,y)\displaystyle\underset{x\in\mathbb{R}^{M_{1}}}{\min}\underset{y\in\mathbb{R}^{M_{2}}}{\max}\ J(x,y) =k=1KpkJk(x,y)\displaystyle=\sum_{k=1}^{K}p_{k}J_{k}(x,y) (2a)
whereJk(x,y)\displaystyle\text{where}\quad J_{k}(x,y) 𝔼𝝃kQk(x,y;𝝃k)\displaystyle\triangleq\mathbb{E}_{\bm{\xi}_{k}}Q_{k}(x,y;\bm{\xi}_{k}) (2b)

Here, x,yx,y are the global primal and dual variables, the function Qk(x,y;𝝃k)Q_{k}(x,y;\bm{\xi}_{k}) is the local stochastic loss evaluated at the sample 𝝃k\bm{\xi}_{k}. The scalars pkp_{k} are positive weights satisfying k=1Kpk=1\sum_{k=1}^{K}p_{k}=1. We assume that each agent kk only has access to its local data set {𝝃k}\{\bm{\xi}_{k}\}. We consider a fully distributed (decentralized) setting where the agents are connected by a graph topology and each agent can only share information with its immediate neighbors [22]. We approach (2a)-(2b) from the perspective of optimization, adopting assumptions that are more general than the nonconvex-SC formulation. Instead of assuming the dual risk function to be SC in the dual variable, we consider the weaker PL-condition [23]. Specifically, we adopt the following assumption.

Assumption 1 (Risk function).

We assume each local risk function Jk(x,y)J_{k}(x,y) is nonconvex in xx while J(x,y)-J(x,y) is ν\nu-PL in yy, i.e., for any xx and yy in the domain of J(x,y)J(x,y), it holds that

yJ(x,y)22ν(max𝑦J(x,y)J(x,y))\|\nabla_{y}J(x,y)\|^{2}\geq 2\nu(\underset{y}{\max}\ J(x,y)-J(x,y)) (3)

where ν\nu is some strictly positive constant. \hfill{\square}

Under this assumption, we are interested in answering the following questions: Q1: Can the reliance on large batch sizes be eliminated in the stochastic OG method? Q2: What are the performance guarantees of the resulting batch-flexible stochastic OG over networks?

I-A Related works

Minimax optimization problems have been largely studied under the single node setting [6, 24, 5, 9, 25, 26, 27, 28, 29, 30]. The fundamental setting assumed within these works is the strong convexity-strong concavity/strong monotonicity. For this setting, the work [6] established the tight complexity bound of the first-order algorithm in finding a near-optimal solution. The work [5] devised the first optimal algorithm that matches this bound. The work [24] showed that the alternating gradient descent-ascent (AGDA) method converges faster than its simultaneous counterpart. The work [25] proposed a stochastic variance-reduced extragradient (EG) method, which showed improved performance in training GANs. The quasi-Newton methods have also been proposed to achieve a faster convergence rate in finding the saddle point [26]. Dropping “strongly” setting from both sides of strong convexity-strong concavity brings about convergence issues where simultaneous GDA will suffer from divergent rotation while AGDA will get stuck in a limit cycle rather than converge to a limit point [24, 31]. For the convex-concave setting, the work [29] established the sublinear rate of 𝒪(1T)\mathcal{O}(\frac{1}{T}) for the best-iterate of the stochastic OG method, namely the best iterate produced over running TT iterations. Achieving last-iterate convergence is more difficult; this aspect of OG was not addressed until the work [28]. Furthermore, the work [27] showed that the average-iterate of EG, namely the average of all iterates produced over iterations, has a convergence rate of 𝒪(1T2)\mathcal{O}(\frac{1}{T^{2}}) which is faster than its last-iterate. The work [30] also showed anchoring EG has a convergence rate of 𝒪(1T2)\mathcal{O}(\frac{1}{T^{2}}) without considering stochasticity. There is another line of research that investigates the structured nonmonotone problem, particularly under the premise of the (weak) Minty variational inequality— see [16, 17, 15].

In recent years, there has been an increasing interest towards the stochastic nonconvex (one-sided) minimax optimization problem. The work [9] established the convergence of stochastic GDA (SGDA) in finding the primal solution under a nonconvex-SC formulation, with the use of batch size 𝒪(ε2)\mathcal{O}(\varepsilon^{-2}) to reach ε\varepsilon error accuracy. The work [32] proposed to incorporate a concave maximizer and variance reduction into SGDA, which is computationally costly since it performs multi-loop steps in updating the dual variable. The work [10] studied the two-sided PL setting and established the linear convergence of variance-reduced stochastic AGDA (SAGDA). Additionally, the work [8] showed that employing two iid samples for SAGDA obtains 𝒪(1T1/2)\mathcal{O}(\frac{1}{T^{1/2}}) convergence rate for the primal solution in nonconvex-PL setting. However, vanilla AGDA does not converge in bilinear scenarios[31], which casts doubt on its ability to solve more complex settings. Exploiting the convergence of the backbone minimax algorithms across various settings, say OG/EG, is always of independent interest. Unfortunately, the recent work [7] showed a less than ideal result under the nonconvex-SC formulation as the large batch issue persists. This fact strongly motivates us to pursue better results that theoretically address this limitation.

Over the past decade, distributed optimization has become an increasingly important field due to its flexibility in processing large volume and physically dispersed data [22, 33]. The distributed minimax optimization scenario remains an open field to be exploited since there are many diverse settings. Some distributed nonconvex minimax works appear in [34, 35, 36, 37, 14, 20]. The work [34] showed the convergence rate of 𝒪(1T2/3)\mathcal{O}(\frac{1}{T^{2/3}}) for the primal solution using STORM momentum-based [38] SGDA, under a nonconvex-SC setting. The work [37] extended the analysis to the nonconvex-PL setting. Other works [35, 36] introduced variance-reduced SGDA for the nonconvex-SC setting. However, variance reduction requires an excessively large batch size to estimate the gradient at the checkpoint, making it hard to tune. On the other hand, the works [34, 38, 37, 36, 35] rely on the Lipschitz continuity of the stochastic loss gradient to carry out the analysis. Relaxing this condition leads to a limited applicability of the results [39]. The work [14] proposed a distributed version of stochastic EG for solving stochastic MVI in a time-varying network. The work [20] proposed a single-call EG [40] for distributed parallel training of the GANs. However, both results face the same large batch size issue.

Works Assumptions Primal Rate Dual Rate Batch size
DPOSG [20] Stochastic MVI, bounded gradient, D-S 𝒪(1T1/4+σ2KB)\mathcal{O}(\frac{1}{T^{1/4}}+\frac{\sigma^{2}}{\sqrt{KB}}) 𝒪(1T1/4+σ2KB)\mathcal{O}(\frac{1}{T^{1/4}}+\frac{\sigma^{2}}{\sqrt{KB}}) 𝒪(ε4)\mathcal{O}(\varepsilon^{-4})
Stochastic EG [14] Stochastic MVI, Lipschitz risk gradient, D-S 𝒪(1T1/2+σ2B)\mathcal{O}(\frac{1}{T^{1/2}}+\frac{\sigma^{2}}{B}) 𝒪(1T1/2+σ2B)\mathcal{O}(\frac{1}{T^{1/2}}+\frac{\sigma^{2}}{B}) 𝒪(ε2)\mathcal{O}(\varepsilon^{-2})
DM-HSGD [34] Stochastic nonconvex-SC, Lipschitz loss gradient, D-S 𝒪(1T2/3)\mathcal{O}(\frac{1}{T^{2/3}}) ——– 𝒪(1min{1,Kϵ})\mathcal{O}(\frac{1}{\min\{1,K\epsilon\}})
DM-GDA [37] Stochastic nonconvex-PL, Lipschitz loss gradient, D-S 𝒪(1T1/3)\mathcal{O}(\frac{1}{T^{1/3}}) ——– 𝒪(1)\mathcal{O}(1)
DSGDA [35] Finite-sum, nonconvex-SC, Lipschitz loss gradient, D-S 𝒪(1T)\mathcal{O}(\frac{1}{T}) ——– 𝒪(n)\mathcal{O}(\sqrt{n})
DSS-OG (this work) Stochastic nonconvex-PL, Lipschitz risk gradient, L-S 𝒪(1T1/2)\mathcal{O}(\frac{1}{T^{1/2}}) 𝒪(1T)\mathcal{O}(\frac{1}{T}) 𝒪(1)\mathcal{O}(1)
TABLE I: Comparison of the convergence rate of different distributed minimax algorithms. The works [34, 35] seek to find a approximate solution 𝒙\bm{x}^{\star} of the primal objective P(x)P(x) to defined in (12) in the sense that 𝔼P(𝒙)2ε2\mathbb{E}\|\nabla P(\bm{x}^{\star})\|^{2}\leq\varepsilon^{2} while the work [37] uses the criterion 𝔼P(𝒙)ε\mathbb{E}\|\nabla P(\bm{x}^{\star})\|\leq\varepsilon. This work and [20, 14] seek to find the approximate solution (𝒙,𝒚)(\bm{x}^{\star},\bm{y}^{\star}) under the convergence criterion 𝔼xJ(𝒙,𝒚)2ε2,𝔼yJ(𝒙,𝒚)2ε2\mathbb{E}\|\nabla_{x}J({\bm{x}}^{\star},{\bm{y}}^{\star})\|^{2}\leq\varepsilon^{2},\mathbb{E}\|\nabla_{y}J({\bm{x}}^{\star},{\bm{y}}^{\star})\|^{2}\leq\varepsilon^{2}. The above table shows the convergence rate of the primal and dual variables after running TT iterations, respectively. D-S: doubly stochastic, L-S: left stochastic, TT: number of iterations, BB: batch size employed to achieve an ε\varepsilon-stationary point, σ2\sigma^{2}: gradient noise variance, nn: number of total samples in finite-sum setting.

I-B Main Contributions

Our main contributions in this work are as follows:

  • We introduce and analyze a new variant of stochastic OG that achieves a convergence rate of 𝒪(1T1/2)\mathcal{O}(\frac{1}{T^{1/2}}) in finding the primal solution under the nonconvex-PL setting. Our analysis improves the existing performance bound by addressing the issue of large batch size [7]. We also provide insights into theoretical limitations that prevent the conventional stochastic OG from achieving better results.

  • We extend the applicability of stochastic OG to a fully distributed setting. We consider left-stochastic combination matrices, which are more general than the doubly-stochastic setting considered in [34, 35, 36, 37, 14, 20]. We also carry out convergence analysis without assuming the smoothness of the stochastic loss gradient.

  • We establish convergence results under more relaxed conditions, ensuring convergence for both primal and dual solutions. Previous works primarily focused on the convergence of the primal objective [34, 35, 36, 37]. We show that the dual solution converges at a rate of 𝒪(1T)\mathcal{O}(\frac{1}{T}) for our method. Please refer to Table I for the comparison of our results with existing works.

Notations. We use lowercase normal font xx to represent deterministic scalars or vectors, and lowercase bold font 𝒙\bm{x} for stochastic scalars or vectors. We use upper case letters AA for matrices and calligraphic letters 𝓧\bm{\mathcal{X}} for network quantities. The symbol 𝒩k\mathcal{N}_{k} denotes the set of neighbors of agent kk, and aka_{\ell k} represents the scaling weight for information flowing from agent \ell to agent kk. 𝟙M\mathbbm{1}_{M} denotes the MM-dimensional vector of all ones. The identity matrix of size MM is denoted by IMI_{M}. The Kronecker product operator is denoted by \otimes, \|\cdot\| stands for the 2\ell_{2}-norm, and ,\langle\cdot,\cdot\rangle denotes the inner product operator.

II Algorithm Description

II-A Centralized Scenario

We start with a centralized scenario where the data processing and model training are carried out at a fusion center. Specifically, we consider initially the minimax problem with J(x,y)=𝔼Q(x,y,𝝃)J(x,y)=\mathbb{E}Q(x,y,\bm{\xi}) formulated at a stand-alone agent or fusion center. Several variants of the stochastic OG method will be revisited here to motivate our new formulation. It is noted that these variants were originally proposed in [19, 40] in the deterministic setting, whereas we consider the stochastic version.

The extragradient (EG) method traces back to the seminal work [19], which is based on a heuristic idea, namely “extrapolation”. It proposed to utilize the gradient at the extrapolated point, rather than the gradient at the current point for carrying out the update. Doing so leads to increased stability by alleviating the rotational force. Such a scheme was also applied to other optimization problems[41]. The stochastic EG (S-EG) was studied in the works [7, 14]. This algorithm starts from iteration i=0i=0, say, with randomly initialized variables 𝒙¯0,𝒚¯0\overline{\bm{x}}_{0},\overline{\bm{y}}_{0} and recursively performs the following steps:

S-EG 𝒙i=𝒙¯iμxxQ(𝒙¯i,𝒚¯i;𝝃¯x,i)\displaystyle\bm{x}_{i}=\overline{\bm{x}}_{i}-\mu_{x}\nabla_{x}Q(\overline{\bm{x}}_{i},\overline{\bm{y}}_{i};\overline{\bm{\xi}}_{x,i})\quad (4a)
S-EG 𝒚i=𝒚¯i+μyyQ(𝒙¯i,𝒚¯i;𝝃¯y,i)\displaystyle\bm{y}_{i}=\overline{\bm{y}}_{i}+\mu_{y}\nabla_{y}Q(\overline{\bm{x}}_{i},\overline{\bm{y}}_{i};\overline{\bm{\xi}}_{y,i}) (4b)
S-EG 𝒙¯i+1=𝒙¯iμxxQ(𝒙i,𝒚i;𝝃x,i)\displaystyle\overline{\bm{x}}_{i+1}=\overline{\bm{x}}_{i}-\mu_{x}\nabla_{x}Q(\bm{x}_{i},\bm{y}_{i};\bm{\xi}_{x,i}) (4c)
S-EG 𝒚¯i+1=𝒚¯i+μyyQ(𝒙i,𝒚i;𝝃y,i)\displaystyle\overline{\bm{y}}_{i+1}=\overline{\bm{y}}_{i}+\mu_{y}\nabla_{y}Q(\bm{x}_{i},\bm{y}_{i};\bm{\xi}_{y,i}) (4d)

Here, (𝒙i,𝒚i)(\bm{x}_{i},\bm{y}_{i}) denotes the extrapolated point, and (𝒙¯i+1,𝒚¯i+1)(\overline{\bm{x}}_{i+1},\overline{\bm{y}}_{i+1}) is the output. The variables (𝝃¯x,i,𝝃¯y,i,𝝃x,i,𝝃y,i)(\overline{\bm{\xi}}_{x,i},\overline{\bm{\xi}}_{y,i},\bm{\xi}_{x,i},\bm{\xi}_{y,i}) are iid samples drawn during different steps and for updating different variables. We use the subscript and bar notation in these iid samples to highlight at which point and for which variables they are drawn. For instance, 𝝃¯x,i\overline{\bm{\xi}}_{x,i} denotes an iid sample drawn at iteration ii for constructing the stochastic gradient at point (𝒙¯i,𝒚¯i)(\overline{\bm{x}}_{i},\overline{\bm{y}}_{i}) and for updating the primal variable 𝒙\bm{x}. Additionally, the scalars μx,μy>0\mu_{x},\mu_{y}>0 are the stepsizes or learning rates. The S-EG needs to query two stochastic gradient oracles to update each variable. It is also executed in a sequential manner, i.e., determining the extrapolated point first before performing the update.

To reduce the computational complexity, a single-call version of EG known as past EG (PEG), was introduced in the work [40]. It proposed to use the past gradient at the point (𝒙i1,𝒚i1)(\bm{x}_{i-1},\bm{y}_{i-1}). The stochastic form of PEG is characterized by the following steps [42, 20]:

S-PEG 𝒙i=𝒙¯iμxxQ(𝒙i1,𝒚i1;𝝃x,i1)\displaystyle\bm{x}_{i}=\overline{\bm{x}}_{i}-\mu_{x}\nabla_{x}Q(\bm{x}_{i-1},\bm{y}_{i-1};\bm{\xi}_{x,i-1}) (5a)
S-PEG 𝒚i=𝒚¯i+μyyQ(𝒙i1,𝒚i1;𝝃y,i1)\displaystyle\bm{y}_{i}=\overline{\bm{y}}_{i}+\mu_{y}\nabla_{y}Q(\bm{x}_{i-1},\bm{y}_{i-1};\bm{\xi}_{y,i-1}) (5b)
S-PEG 𝒙¯i+1=𝒙¯iμxxQ(𝒙i,𝒚i;𝝃x,i)\displaystyle\overline{\bm{x}}_{i+1}=\overline{\bm{x}}_{i}-\mu_{x}\nabla_{x}Q(\bm{x}_{i},\bm{y}_{i};\bm{\xi}_{x,i}) (5c)
S-PEG 𝒚¯i+1=𝒚¯i+μyyQ(𝒙i,𝒚i;𝝃y,i)\displaystyle\overline{\bm{y}}_{i+1}=\overline{\bm{y}}_{i}+\mu_{y}\nabla_{y}Q(\bm{x}_{i},\bm{y}_{i};\bm{\xi}_{y,i}) (5d)

PEG can also be derived from the forward-reflected-backward framework [43]. It has attracted wide interest in theoretical aspects and obtains empirical success in practical applications [28, 21, 20].

The standard form of S-PEG, also referred to as the stochastic OG (S-OG), is often utilized for theoretical analysis [7]. The updates in Eqs. (5a)–(5d) can be simplified by rewriting them in terms of (𝒙i+1,𝒚i+1)(\bm{x}_{i+1},\bm{y}_{i+1}) at each iteration ii [7, 42]:

S-OG 𝒙i+1=𝒙iμx[2xQ(𝒙i,𝒚i;𝝃x,i)\displaystyle\bm{x}_{i+1}=\bm{x}_{i}-\mu_{x}\Big{[}2\nabla_{x}Q(\bm{x}_{i},\bm{y}_{i};\bm{\xi}_{x,i})
S-OG +xQ(𝒙i1,𝒚i1;𝝃x,i1)]\displaystyle\qquad\quad+\nabla_{x}Q(\bm{x}_{i-1},\bm{y}_{i-1};\bm{\xi}_{x,i-1})\Big{]} (6a)
S-OG 𝒚i+1=𝒚i+μy[2yQ(𝒙i,𝒚i;𝝃y,i)\displaystyle\bm{y}_{i+1}=\bm{y}_{i}+\mu_{y}\Big{[}2\nabla_{y}Q(\bm{x}_{i},\bm{y}_{i};\bm{\xi}_{y,i})
S-OG yQ(𝒙i1,𝒚i1;𝝃y,i1)]\displaystyle\qquad\quad-\nabla_{y}Q(\bm{x}_{i-1},\bm{y}_{i-1};\bm{\xi}_{y,i-1})\Big{]} (6b)

S-OG involves two different samples, namely 𝝃x,i\bm{\xi}_{x,i} and 𝝃x,i1\bm{\xi}_{x,i-1}, for updating the primal variable in different steps (similar to the dual variable). We observe that 𝝃x,i\bm{\xi}_{x,i} and 𝝃x,i1\bm{\xi}_{x,i-1} are actually approximating different losses whose landscape differs. The gradient xQ(𝒙i,𝒚i;𝝃x,i)\nabla_{x}Q(\bm{x}_{i},\bm{y}_{i};\bm{\xi}_{x,i}) constructed by 𝝃x,i\bm{\xi}_{x,i} can deviate from that at the same point (𝒙i,𝒚i)(\bm{x}_{i},\bm{y}_{i}) when it is evaluated at 𝝃x,i1\bm{\xi}_{x,i-1}. In other words, S-OG is not making the “optimistic” direction under the same landscape in a strict sense. When the sample variance is high, the deviation between the loss landscapes constructed from 𝝃x,i\bm{\xi}_{x,i} and 𝝃x,i1\bm{\xi}_{x,i-1} could be high, leading to a poor approximation of the update direction [44]. On the other hand, the stochastic gradients involved in (6a)–(6b) are not martingale processes, which cause difficulty for analysis (see Section III-E for discussion).

We instead introduce a new formulation using the same-sample approximation. For each variable, it utilizes the gradients from the same loss landscape to perform “optimistic” scheme; these losses are also unbiased for the true risk. This can be verified by taking expectation over the distribution of the current samples (𝝃x,i(\bm{\xi}_{x,i}, 𝝃y,i)\bm{\xi}_{y,i}). This approach involves recomputing the stochastic gradient at the point (𝒙i1,𝒚i1)(\bm{x}_{i-1},\bm{y}_{i-1}) using the current samples (𝝃x,i,𝝃y,i)(\bm{\xi}_{x,i},\bm{\xi}_{y,i}):

SS-OG 𝒙i+1=𝒙i2μx[xQ(𝒙i,𝒚i;𝝃x,i)\displaystyle\bm{x}_{i+1}=\bm{x}_{i}-2\mu_{x}\Big{[}\nabla_{x}Q(\bm{x}_{i},\bm{y}_{i};\bm{\xi}_{x,i})
SS-OG +xQ(𝒙i1,𝒚i1;𝝃x,i)]\displaystyle\qquad\quad+\nabla_{x}Q(\bm{x}_{i-1},\bm{y}_{i-1};\bm{\xi}_{x,i})\Big{]} (7a)
SS-OG 𝒚i+1=𝒚i+2μy[yQ(𝒙i,𝒚i;𝝃y,i)\displaystyle\bm{y}_{i+1}=\bm{y}_{i}+2\mu_{y}\Big{[}\nabla_{y}Q(\bm{x}_{i},\bm{y}_{i};\bm{\xi}_{y,i})
SS-OG yQ(𝒙i1,𝒚i1;𝝃y,i)]\displaystyle\qquad\quad-\nabla_{y}Q(\bm{x}_{i-1},\bm{y}_{i-1};\bm{\xi}_{y,i})\Big{]} (7b)

SS-OG requires double call of stochastic gradient oracles for each variable. However, it can be executed in parallel with some extra memory overhead, making it preferred in comparison to the sequential approaches, e.g., S-EG and S-PEG.

Algorithm 1 Stochastic Same-Sample Optimistic Gradient (SS-OG)
0:  𝒙i,𝒚i(i=1,2)\bm{x}_{i},\bm{y}_{i}{\scriptstyle(i=-1,-2)}, step sizes μx\mu_{x}, μy\mu_{y}.
1:  for  i=0,1,i=0,1,\dots do
2:     Compute stochastic gradient with samples 𝝃x,i,𝝃y,i\bm{\xi}_{x,i},\bm{\xi}_{y,i}
3:     𝒈x,i1=2xQ(𝒙i1,𝒚i1;𝝃x,i)xQ(𝒙i2,𝒚i2;𝝃x,i)\displaystyle\begin{aligned} {\bm{g}}_{x,i-1}=2\nabla_{x}Q(\bm{x}_{i-1},\bm{y}_{i-1};\bm{\xi}_{x,i})-\nabla_{x}Q(\bm{x}_{i-2},\bm{y}_{i-2};\bm{\xi}_{x,i})\end{aligned}
4:     𝒈y,i1=2yQ(𝒙i1,𝒚i1;𝝃y,i)yQ(𝒙i2,𝒚i2;𝝃y,i)\displaystyle\begin{aligned} {\bm{g}}_{y,i-1}=2\nabla_{y}Q(\bm{x}_{i-1},\bm{y}_{i-1};\bm{\xi}_{y,i})-\nabla_{y}Q(\bm{x}_{i-2},\bm{y}_{i-2};\bm{\xi}_{y,i})\end{aligned}
5:     Adaptation
6:     𝒙i=𝒙i1μx𝒈x,i1\displaystyle{\bm{x}}_{i}=\bm{x}_{i-1}-\mu_{x}\bm{g}_{x,i-1}
7:     𝒚i=𝒚i1+μy𝒈y,i1\displaystyle{\bm{y}}_{i}=\bm{y}_{i-1}+\mu_{y}\bm{g}_{y,i-1}
8:  end for

II-B Distributed scenario

In this section, we extend SS-OG to the distributed setting (2a)–(2b). Under this scenario, each agent kk utilizes a local sample {𝝃k}\{\bm{\xi}_{k}\} to approximate the true gradient by using its loss value. Several distributed learning strategies for minimization problems have been investigated in [22, 33] and the references therein. Here, we integrate (6a)-(6b) with the adapt-then-combine (ATC) diffusion learning strategy from [22, 33]. This strategy has been shown in these references to outperform the consensus strategy, which has been applied to minimax problems as well in works such as [20, 35]. For this reason, we focus on diffusion learning and refer to the proposed algorithm as the Diffusion Stochastic Same-Sample Optimistic Gradient (DSS-OG). It is listed in Algorithm 1. In this algorithm, aka_{\ell k} is the scalar weight used by agent kk to scale information received from agent 𝒩k\ell\in\mathcal{N}_{k} and ak=0a_{\ell k}=0 for 𝒩k\ell\notin\mathcal{N}_{k}. This algorithm starts from i=0i=0 with randomly initialized local weights 𝒙k,i,𝒚k,i(i=1,2)\bm{x}_{k,i},\bm{y}_{k,i}(i=-1,-2). Each agent first collect two iid samples, namely 𝝃k,ix\bm{\xi}^{x}_{k,i} and 𝝃k,iy\bm{\xi}^{y}_{k,i}, to approximate the OG direction for updating the local primal and dual variables, respectively. After then the local models of each agent are sent to its immediate neighbors for aggregation using the weights {ak}\{a_{\ell k}\}.

Algorithm 2 Diffusion Stochastic Same-Sample Optimistic Gradient (DSS-OG)
0:  𝒙k,i,𝒚k,i(i=1,2)\bm{x}_{k,i},\bm{y}_{k,i}{\scriptstyle(i=-1,-2)}, step sizes μx\mu_{x}, μy\mu_{y}.
1:  for  i=0,1,i=0,1,\dots do
2:     for each agent kk do
3:        Compute stochastic gradient with sample 𝝃k,ix,𝝃k,iy\bm{\xi}^{x}_{k,i},\bm{\xi}^{y}_{k,i}
4:        𝒈x,k,i1= 2xQk(𝒙k,i1,𝒚k,i1;𝝃k,ix)xQk(𝒙k,i2,𝒚k,i2;𝝃k,ix)\displaystyle\begin{aligned} {\bm{g}}_{x,k,i-1}=&\ 2\nabla_{x}Q_{k}(\bm{x}_{k,i-1},\bm{y}_{k,i-1};\bm{\xi}^{x}_{k,i})\\ &-\nabla_{x}Q_{k}(\bm{x}_{k,i-2},\bm{y}_{k,i-2};\bm{\xi}^{x}_{k,i})\end{aligned}
5:        𝒈y,k,i1= 2yQk(𝒙k,i1,𝒚k,i1;𝝃k,iy)yQk(𝒙k,i2,𝒚k,i2;𝝃k,iy)\displaystyle\begin{aligned} {\bm{g}}_{y,k,i-1}=&\ 2\nabla_{y}Q_{k}(\bm{x}_{k,i-1},\bm{y}_{k,i-1};\bm{\xi}^{y}_{k,i})\\ &-\nabla_{y}Q_{k}(\bm{x}_{k,i-2},\bm{y}_{k,i-2};\bm{\xi}^{y}_{k,i})\end{aligned}
6:        Adaptation
7:        ϕk,i=𝒙k,i1μx𝒈x,k,i1\displaystyle{\bm{\phi}}_{k,i}=\bm{x}_{k,i-1}-\mu_{x}\bm{g}_{x,k,i-1}
8:        𝝍k,i=𝒚k,i1+μy𝒈y,k,i1\displaystyle{\bm{\psi}}_{k,i}=\bm{y}_{k,i-1}+\mu_{y}\bm{g}_{y,k,i-1}
9:        Combination
10:        𝒙k,i=𝒩kakϕ,i,𝒚k,i=𝒩kak𝝍,i\bm{x}_{k,i}=\underset{\ell\in\mathcal{N}_{k}}{\sum}a_{\ell k}{\bm{\phi}}_{\ell,i},\quad\displaystyle\bm{y}_{k,i}=\underset{\ell\in\mathcal{N}_{k}}{\sum}a_{\ell k}{\bm{\psi}}_{\ell,i}
11:     end for
12:  end for

For convenience of the analysis, we introduce the following network quantities:

𝒙c,i\displaystyle\hskip 30.00005pt{\bm{x}}_{c,i} k=1Kpk𝒙k,iM1×1(centroid)\displaystyle\triangleq\sum_{k=1}^{K}p_{k}\bm{x}_{k,i}\in\mathbb{R}^{M_{1}\times 1}\quad\text{(centroid)} (8a)
𝒈x,c,i\displaystyle\bm{g}_{x,c,i} k=1Kpk𝒈x,k,iM1×1\displaystyle\triangleq\sum_{k=1}^{K}p_{k}{\bm{g}}_{x,k,i}\in\mathbb{R}^{M_{1}\times 1} (8b)
𝓧i\displaystyle\bm{\mathcal{X}}_{i} col{𝒙1,i,,𝒙K,i}M1K×1\displaystyle\triangleq\mbox{\rm col}\{\bm{x}_{1,i},\dots,\bm{x}_{K,i}\}\in\mathbb{R}^{M_{1}K\times 1} (8c)
𝓧c,i\displaystyle\bm{{\mathcal{X}}}_{c,i} col{𝒙c,i,,𝒙c,i}M1K×1\displaystyle\triangleq\mbox{\rm col}\{\bm{x}_{c,i},\dots,\bm{x}_{c,i}\}\in\mathbb{R}^{M_{1}K\times 1} (8d)
𝓖x,i\displaystyle\bm{\mathcal{G}}_{x,i} col{𝒈x,1,i,,𝒈x,K,i}M1K×1\displaystyle\triangleq\mbox{\rm col}\{\bm{g}_{x,1,i},\dots,\bm{g}_{x,K,i}\}\in\mathbb{R}^{M_{1}K\times 1} (8e)

and similarly for 𝒚c,i,𝒈y,c,iM2×1,𝓨i,𝓨c,i,𝓖y,iM2K×1{\bm{y}}_{c,i},\bm{g}_{y,c,i}\in\mathbb{R}^{M_{2}\times 1},\bm{\mathcal{Y}}_{i},\bm{\mathcal{Y}}_{c,i},\bm{\mathcal{G}}_{y,i}\in\mathbb{R}^{M_{2}K\times 1}. Additionally, we use gx,c,i,gy,c,i,𝒢x,i,𝒢y,i{{g}}_{x,c,i},{{g}}_{y,c,i},{\mathcal{G}}_{x,i},{\mathcal{G}}_{y,i} refer to deterministic realizations of the random quantities 𝒈x,c,i\bm{g}_{x,c,i}, 𝒈y,c,i\bm{g}_{y,c,i}, 𝓖x,i\bm{\mathcal{G}}_{x,i}, 𝓖y,i\bm{\mathcal{G}}_{y,i}. If we further introduce the network combination matrices:

A\displaystyle A =[ak]K×K\displaystyle=[a_{\ell k}]\in\mathbb{R}^{K\times K} (9a)
𝒜1\displaystyle\mathcal{A}_{1} AIM1M1K×M1K\displaystyle\triangleq A\otimes I_{M_{1}}\in\mathbb{R}^{M_{1}K\times M_{1}K} (9b)
𝒜2\displaystyle\mathcal{A}_{2} AIM2M2K×M2K\displaystyle\triangleq A\otimes I_{M_{2}}\in\mathbb{R}^{M_{2}K\times M_{2}K} (9c)

then, the DSS-OG recursions be expressed globally the following recursion:

𝓧i\displaystyle\bm{\mathcal{X}}_{i} =𝒜1{𝓧i1μx𝓖x,i1}\displaystyle=\mathcal{A}_{1}^{\top}\{\bm{\mathcal{X}}_{i-1}-\mu_{x}\bm{\mathcal{G}}_{x,i-1}\} (10a)
𝓨i\displaystyle\bm{\mathcal{Y}}_{i} =𝒜2{𝓨i1+μy𝓖y,i1}\displaystyle=\mathcal{A}_{2}^{\top}\{\bm{\mathcal{Y}}_{i-1}+\mu_{y}\bm{\mathcal{G}}_{y,i-1}\} (10b)

III Convergence Analysis

We now proceed to analyze the convergence properties of DSS-OG. We do so under the following assumptions, which are generally more relaxed than similar conditions in the prior literature.

III-A Assumptions

Assumption 2.

We assume the gradients associated with each local risk function are LfL_{f}-Lipschitz, i.e.,

wJk(x1,y1)wJk(x2,y2)(w=x or y)\displaystyle\|\nabla_{w}J_{k}(x_{1},y_{1})-\nabla_{w}J_{k}(x_{2},y_{2})\|\quad(w=x\text{ or }y) (11)
Lf(x1x2+y1y2)\displaystyle\leq L_{f}\Big{(}\|x_{1}-x_{2}\|+\|y_{1}-y_{2}\|\Big{)}
Remark 1.

In this work, we will only adopt the above assumption for true gradients. This is in contrast to the works [34, 35, 36, 37], where smoothness of the local stochastic loss is assumed instead. It should be noted that if we only assume Assumption 2, the computational benefits of all momentum techniques used in [34, 37] are nullified, with the gradient evaluation complexity degrading to 𝒪(ε4)\mathcal{O}(\varepsilon^{-4}). This is the tight lower bound for solving a nonconvex problems developed in [39]. Moreover, the primal objective introduced in (12) is essentially nonconvex. \hfill{\square}

Since the primal problem is to minimize over xx in (2a), we introduce the primal objective[34, 35, 36, 37]:

P(x)=maxyJ(x,y)P(x)={\max}_{y}\ J(x,y) (12)
Assumption 3.

We assume P(x)P(x) is lower bounded, i.e., P=infxP(x)>P^{\star}=\operatorname{inf}_{x}P(x)>-\infty.

Assumption 4.

We denote the filtration generated by the random processes as 𝓕i={(𝐱k,j,𝐲k,j)k=1,,K,j=2,1,,i}\bm{\mathcal{F}}_{i}=\{(\bm{x}_{k,j},\bm{y}_{k,j})\mid k=1,\dots,K,j=-2,-1,\dots,i\}. We assume the stochastic gradients are unbiased with bounded variance while taking expectation over the random sample 𝛏k,iw\bm{\xi}^{w}_{k,i} conditioned on 𝓕i\bm{\mathcal{F}}_{i}, i.e.,

𝔼{wQk(𝒙k,i,𝒚k,i;𝝃k,iw)𝓕i}=wJk(𝒙k,i,𝒚k,i)\displaystyle\mathbb{E}\Big{\{}\nabla_{w}Q_{k}(\bm{x}_{k,i},\bm{y}_{k,i};\bm{\xi}^{w}_{k,i})\mid\bm{\mathcal{F}}_{i}\Big{\}}=\nabla_{w}J_{k}(\bm{x}_{k,i},\bm{y}_{k,i}) (13)
𝔼{wQk(𝒙k,i,𝒚k,i;𝝃k,iw)wJk(𝒙k,i,𝒚k,i)2𝓕i}\displaystyle\mathbb{E}\Big{\{}\|\nabla_{w}Q_{k}(\bm{x}_{k,i},\bm{y}_{k,i};\bm{\xi}^{w}_{k,i})-\nabla_{w}J_{k}(\bm{x}_{k,i},\bm{y}_{k,i})\|^{2}\mid\bm{\mathcal{F}}_{i}\Big{\}}
σk2(w=x or y)\displaystyle\leq{\sigma^{2}_{k}}\quad(w=x\text{ or }y)

Moreover, we assume the data samples 𝛏k,iw\bm{\xi}^{w}_{k,i} are independent of each other for all k,i,wk,i,w.

Assumption 5.

We assume the disagreement between the local and global gradients is bounded, i.e.,

wJk(𝒙k,i,𝒚k,i)wJ(𝒙k,i,𝒚k,i)2\displaystyle\|\nabla_{w}J_{k}(\bm{x}_{k,i},\bm{y}_{k,i})-\nabla_{w}J(\bm{x}_{k,i},\bm{y}_{k,i})\|^{2} G2(w=x or y)\displaystyle\leq G^{2}\ (w=x\text{ or }y) (14)

The above assumptions can be relaxed by adopting more sophisticated distributed learning strategies, e.g., exact diffusion and gradient tracking [45, 46, 47]; we leave these extensions for future works.

Assumption 6.

The graph is strongly connected and the matrix A=[a,k]A=[a_{\ell,k}] is left-stochastic. As a result, it holds that 𝟙A=𝟙\mathbbm{1}^{\top}A=\mathbbm{1}^{\top} and ArA^{r} has strictly positive entries for some integer rr.

This assumption ensures that AA has a single maximum eigenvalue at 11 with Perron eigenvector pp with positive entries such that Ap=p,𝟙p=1Ap=p,\mathbbm{1}^{\top}p=1. Furthermore, the combination matrix AA admits the Jordan decomposition A=VJV1A=VJV^{-1} where [22]:

V=[p,VR],J=[100Jγ],V1=[𝟙VL]V=\begin{bmatrix}p,V_{R}\end{bmatrix},J=\begin{bmatrix}1&0\\ 0&J_{\gamma}\end{bmatrix},V^{-1}=\begin{bmatrix}\mathbbm{1}^{\top}\\ V^{\top}_{L}\end{bmatrix} (15)

Here, the submatrix JγJ_{\gamma} consists of Jordan blocks with arbitrarily small values on the lower diagonal, and it holds that 𝟙VR=0\mathbbm{1}^{\top}V_{R}=0, VLVR=IK1V^{\top}_{L}V_{R}=I_{K-1}, Jγ<1\|J_{\gamma}\|<1. For convenience of the analysis, we let 𝒥γ1JγIM1,𝒥γ2JγIM2,𝒱R1VRIM1,𝒱R2VRIM2,𝒱L1VLIM1,𝒱L2VLIM2\mathcal{J}_{\gamma_{1}}\triangleq J_{\gamma}\otimes I_{M_{1}},\mathcal{J}_{\gamma_{2}}\triangleq J_{\gamma}\otimes I_{M_{2}},\mathcal{V}_{R_{1}}\triangleq V_{R}\otimes I_{M_{1}},\mathcal{V}_{R_{2}}\triangleq V_{R}\otimes I_{M_{2}},\mathcal{V}_{L_{1}}\triangleq V_{L}\otimes I_{M_{1}},\mathcal{V}_{L_{2}}\triangleq V_{L}\otimes I_{M_{2}}. It is not difficult to verify 𝒥γ1=𝒥γ2,𝒱R1=𝒱R2,VL1=VL2\|\mathcal{J}_{\gamma_{1}}\|=\|\mathcal{J}_{\gamma_{2}}\|,\|\mathcal{V}_{R_{1}}\|=\|\mathcal{V}_{R_{2}}\|,\|V_{L_{1}}\|=\|V_{L_{2}}\|.

III-B Main Results

In Theorem 1, we prove the convergence rate of the best-iterate of the primal objective, which is the standard criterion for convergence analysis involving the nonconvex problem. In Theorem 2, we show the last-itearate convergence of the dual optimality gap. Given the conditions from these theorems, we establish the complexity of the DSS-OG in finding an ε\varepsilon-stationary point.

Theorem 1.

Under Assumptions 16, choosing step sizes

μy=min{1ν,2Lf2νβ1,1𝒥γ12T1/2,1}\mu_{y}=\min\{\frac{1}{\nu},\frac{2L^{2}_{f}}{\nu\beta_{1}},\frac{1-\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}}{T^{1/2}},1\}

(16)

μx=min{ν2μy6Lf2,12(L+Lf),14L,μy4τ3,12τ1+β1,μy}\mu_{x}=\min\{\frac{\nu^{2}\mu_{y}}{6L^{2}_{f}},\frac{1}{2(L+L_{f})},\frac{1}{4L},\frac{\mu_{y}}{4\tau_{3}},\frac{1}{2\sqrt{\tau_{1}+\beta_{1}}},\mu_{y}\}

(17)

the non-asymptotic convergence rate of the primal objective is given by:

1Ti=0T1𝔼P(𝒙c,i)2𝒪(1T1/2)+𝒪(1T)+𝒪(1T2)\displaystyle\frac{1}{T}\sum_{i=0}^{T-1}\mathbb{E}\|\nabla P(\bm{x}_{c,i})\|^{2}\leq\mathcal{O}\Big{(}\frac{1}{{T}^{1/2}}\Big{)}+\mathcal{O}\Big{(}\frac{1}{T}\Big{)}+\mathcal{O}\Big{(}\frac{1}{T^{2}}\Big{)} (18)

where L=Lf+Lf2νL=L_{f}+\frac{L^{2}_{f}}{\nu} is the Lipschitz constant associated with P(x)P(x) and

τ180Lf4ν2,τ380Lf2ν2,β196KLf4ν2+48KLf2\tau_{1}\triangleq\frac{80L^{4}_{f}}{\nu^{2}},\tau_{3}\triangleq\frac{80L^{2}_{f}}{\nu^{2}},\beta_{1}\triangleq\frac{96KL_{f}^{4}}{\nu^{2}}+48KL^{2}_{f}

(19)

are constants. Furthermore, DSS-OG outputs a primal ε\varepsilon-stationary point after T0=𝒪(ε4)T_{0}=\mathcal{O}(\varepsilon^{-4}) iterations and gradient evaluation complexity, i.e.,

𝔼P(𝒙c,T0)2\displaystyle\mathbb{E}\|\nabla P({\bm{x}}^{\star}_{c,T_{0}})\|^{2} =infi=0,,T01𝔼P(𝒙c,i)2\displaystyle=\inf_{i=0,\dots,T_{0}-1}\mathbb{E}\|\nabla P(\bm{x}_{c,i})\|^{2} (20)
1T0i=0T01𝔼P(𝒙c,i)2𝒪(ε2)\displaystyle\leq\frac{1}{T_{0}}\sum_{i=0}^{T_{0}-1}\mathbb{E}\|\nabla P(\bm{x}_{c,i})\|^{2}\leq\mathcal{O}(\varepsilon^{2})

See Appendix C for proof.

Theorem 1 states that by adopting a two-time-scale scheme for the step sizes μx\mu_{x} and μy\mu_{y}, DSS-OG is able to output an ϵ\epsilon-stationary point for the primal objective with a sublinear rate dominated by 𝒪(1T1/2)\mathcal{O}\Big{(}\frac{1}{T^{1/2}}\Big{)}. The two-time-scale condition reflects the unsymmetric assumption of the risk function over the primal and dual variables. Similar conditions were also established in the works [7, 9, 8] when asymmetric conditions are assumed over the primal and dual variables.

Theorem 2.

Under Assumptions 16, and after iteration numbers T0T_{0}, choosing step size

μx=0,μy=min{ν12KLf2,2T+1ν(T+1)2}\mu_{x}=0,\quad\mu_{y}=\min\{\sqrt{\frac{\nu}{12KL^{2}_{f}}},\frac{2T+1}{\nu(T+1)^{2}}\}

(21)

the non-asymptotic bound of the dual optimality gap regarding the last-iterate is given by:

𝔼[P(𝒙c,T0)J(𝒙c,T0,𝒚c,T)]𝒪(1T)+𝒪(1T2)\displaystyle\mathbb{E}[P({\bm{x}}^{\star}_{c,T_{0}})-J({\bm{x}}^{\star}_{c,T_{0}},\bm{y}_{c,T})]\leq\mathcal{O}\Big{(}\frac{1}{T}\Big{)}+\mathcal{O}\Big{(}\frac{1}{T^{2}}\Big{)} (22)

That is, DSS-OG outputs a dual ε\varepsilon-stationary point after T1=𝒪(ε2)T_{1}=\mathcal{O}(\varepsilon^{-2}) iterations and gradient evaluation complexity, i.e.,

𝔼[P(𝒙c,T0)J(𝒙c,T0,𝒚c,T0+T1)]𝒪(ε2)\displaystyle\mathbb{E}[P({\bm{x}}^{\star}_{c,T_{0}})-J({\bm{x}}^{\star}_{c,T_{0}},{\bm{y}}^{\star}_{c,T_{0}+T_{1}})]\leq\mathcal{O}(\varepsilon^{2}) (23)

See Appendix D for proof.

The above theorem implies that finding a dual solution that shrinks the dual optimality gap has a faster sublinear convergence rate of 𝒪(1T)\mathcal{O}\Big{(}\frac{1}{T}\Big{)} than the primal convergence rate. Eqs. (18) and (22) also show the fundamental difference of the complexity in finding the solution for a nonconvex problem and for a PL problem. From this theoretical aspect, it is hard to guarantee a pair of primal and dual solutions with the same accuracy level (24a)–(24b) simultaneously. Fortunately, we can prove that the same accuracy level of the primal and dual solutions can be obtained in an asynchronous manner (See Theorem 3).

Theorem 3.

After iterations T0+T1=𝒪(ε4)+𝒪(ε2)T_{0}+T_{1}=\mathcal{O}(\varepsilon^{-4})+\mathcal{O}(\varepsilon^{-2}) of running DSS-OG, we can find the network centroid (𝐱c,T0,𝐲c,T0+T1)({\bm{x}}^{\star}_{c,T_{0}},{\bm{y}}^{\star}_{c,T_{0}+T_{1}}) that satisfies the following first-order stationary conditions:

𝔼xJ(𝒙c,T0,𝒚c,T0+T1)2\displaystyle\mathbb{E}\|\nabla_{x}J({\bm{x}}^{\star}_{c,T_{0}},{\bm{y}}^{\star}_{c,T_{0}+T_{1}})\|^{2} ε2\displaystyle\leq\varepsilon^{2} (24a)
𝔼yJ(𝒙c,T0,𝒚c,T0+T1)2\displaystyle\mathbb{E}\|\nabla_{y}J({\bm{x}}^{\star}_{c,T_{0}},{\bm{y}}^{\star}_{c,T_{0}+T_{1}})\|^{2} ε2\displaystyle\leq\varepsilon^{2} (24b)

for an arbitrarily small constant ε\varepsilon. The solutions points 𝐱c,T0,𝐲c,T0+T1\bm{x}^{\star}_{c,T_{0}},{\bm{y}}^{\star}_{c,T_{0}+T_{1}} are found in different time instance where the subscripts in (𝐱c,T0,𝐲c,T0+T1)({\bm{x}}^{\star}_{c,T_{0}},{\bm{y}}^{\star}_{c,T_{0}+T_{1}}) indicate the exact iterations when they are found. See Appendix E for proof.

III-C Comparison between S-OG and SS-OG

We point out the key differences between SS-OG and S-OG:

  • The stochastic gradients involved in S-OG are not martingale process. To see this, we observe that the backward gradients [xQ(𝒙i1;𝒚i1;𝝃x,i1),yQ(𝒙i1,𝒚i1;𝝃y,i1)][\nabla_{x}Q(\bm{x}_{i-1};\bm{y}_{i-1};\bm{\xi}_{x,i-1}),\nabla_{y}Q(\bm{x}_{i-1},\bm{y}_{i-1};\bm{\xi}_{y,i-1})] used at iteration ii are constructed by the past samples (𝝃x,i1,𝝃y,i1)(\bm{\xi}_{x,i-1},\bm{\xi}_{y,i-1}). They are biased estimators when taking expectation over the current samples (𝝃x,i,𝝃y,i)(\bm{\xi}_{x,i},\bm{\xi}_{y,i}). In contrast, it is easy to verify the stochastic gradients involved in SS-OG are unbiased.

  • SS-OG strictly performs the “optimistic” scheme under the same loss landscape. The losses utilized there are unbiased estimators for the local risk. In contrast, S-OG involves gradients from two different loss landscape. The gradient direction at the same point in two different loss landscapes can be notably different when data variance is high.

  • We will bound the terms involving 𝔼xJ(𝒙c,i,𝒚c,i)gx,c,i2\mathbb{E}\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{x,c,i}\|^{2} for DSS-OG rather than 𝔼xJ(𝒙c,i,𝒚c,i)𝒈x,c,i2\mathbb{E}\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})-\bm{g}_{x,c,i}\|^{2} for distributed S-OG. To see this, we can repeat the arguments in (I) and use the fact that the two terms involved in the inner product P(𝒙c,i),𝒈x,c,i\langle\nabla{{P}}(\bm{x}_{c,i}),\bm{g}_{x,c,i}\rangle are correlated for distributed S-OG. Therefore, we observe that the distributed S-OG needs to bound the deviation between the deterministic and stochastic quantities. The consequence of such deviation is that a constant on the order of 𝒪(σ2)\mathcal{O}(\sigma^{2}) will appear in the final bound because the Lipschitz continuity to be used is only assumed for determistic quantities. In constrast, we only need to bound the deviation between deterministic quantities for DSS-OG. Therefore, employing a large batch size will become necessary for S-OG in order to control the magnitude of 𝒪(σ2)\mathcal{O}(\sigma^{2}).

IV Computer simulation

IV-A Wasserstein GAN

In this example, we consider the application of an 2\ell_{2}-regularized Wasserstein GAN (WGAN) for learning the mean and variance of a one-dimensional Gaussian [8]. The objective is to optimize the following global risk function through the cooperation of agents:

minxmaxyJ(x,y)=k=1KpkJk(x,y)\displaystyle\min_{{x}}\max_{{y}}\ J({x},{y})=\sum_{k=1}^{K}p_{k}J_{k}({x},{y}) (25)
Jk(x,y)𝔼𝒖k𝒩(πk,σk2)[D(y;𝒖k)]\displaystyle J_{k}({{x},{y}})\triangleq\mathbb{E}_{\bm{u}_{k}\sim\mathcal{N}(\pi_{k},\sigma^{2}_{k})}[D({y};\bm{u}_{k})]
𝔼𝒛k𝒩(0,I)[D(y;G(x;𝒛k))]λy2\displaystyle\qquad\qquad-\mathbb{E}_{\bm{z}_{k}\sim\mathcal{N}(0,I)}[D({y};G({x};\bm{z}_{k}))]-\lambda\|y\|_{2}

Here, the dual variable yy is 2\ell_{2}-regularized by a constant λ\lambda to ensure the PL condition. We consider a setting where the generator has a single hidden layer with 5 neurons, and the discriminator is parameterized by y2y\in\mathbb{R}^{2} and is of the form D(y;𝒖)=y1𝒖k+y2𝒖k2D(y;\bm{u})=y_{1}\bm{u}_{k}+y_{2}\bm{u}_{k}^{2}. As for the true samples, 𝒖k\bm{u}_{k}, they are drawn from the distribution 𝒩(πk,σk2)\mathcal{N}(\pi_{k},\sigma^{2}_{k}).

We generate a strongly connected network with K=8K=8 agents and use the averaging rule [48, 33]. In addition to the fully distributed setting, we also implement centralized SS-OG (CSS-OG). We compare DSS-OG and CSS-OG with the distributed version of SAGDA, smoothed SAGDA and the adaptive algorithms, such as RMSprop and Adam which are also evaluated in this same application [8]. For these algorithms, we follow the optimal settings reported in [8]. For DSS-OG and CSS-OG, we tune the step sizes to μx=μy=0.1\mu_{x}=\mu_{y}=0.1. We set the same random seed before running each algorithm to ensure the same generated training set. The code is available at the attached link111https://github.com/MulGarFed/Minimax_DSSOG.git.

The simulation results are demonstrated in Figs.LABEL:fig:example1_variance0001_grad-LABEL:fig:example1_variance01_mse. We observe both DSS-OG and CSS-OG outperform the other algorithms in terms of gradient norm and mean square error (MSE) when σk2=0.001\sigma^{2}_{k}=0.001. From Fig. LABEL:fig:example1_variance01_mse, the MSE of DSS-OG and CSS-OG is able to match the results of RMSprop when σk2=0.1\sigma^{2}_{k}=0.1. We find that the adaptive algorithm RMSprop converges more rapidly than constant step size algorithms. Another interesting observation is that the gradient norm does not necessarily imply the quality of a solution. For instance, despite RMSprop not having the lowest gradient norm in Fig. LABEL:fig:example1_variance0001_grad, it exhibits rapid convergence and good performance in terms of MSE, as seen in Fig. LABEL:fig:example1_variance0001_mse. This discrepancy suggests a promising research direction in identifying more practical metrics for evaluating minimax solutions.

IV-B Deep Convolutional GANs

In this example, we train a deep convolutional GANs (DCGANs) using the MNIST dataset. We consider a similar neural network structure described in [49].

We use the same network topology from the previous example. To enhance performance, we implement DSS-OG with a Adam optimizer through replacing its original stochastic gradient term with the SS-OG gradients from (7a)–(7b). For this experiment, we set the momentum factors to β1=0.2,β2=0.999\beta_{1}=0.2,\beta_{2}=0.999 and vary the step sizes. We plot the results of the Fréchet Inception Distance (FID) score [50] in Fig. 2(a) The sample images generated by a single node are demonstrated in Fig. 2(b). It is seen that the simulated algorithms can attain good results of FID score with the appropriately chosen hyperparameters. Moreover, the single agent is capable of generating all classes of digits, demonstrating the generative diversity of a single node in the distributed network.

V Conclusion

In this work, we introduced and analyzed DSS-OG, a novel distributed method for stochastic nonconvex-PL minimax optimization. This method improves the analysis of the existing works by addressing the large batch size issue. We validated the performance of DSS-OG through training experiments of two neural network models: WGAN and DCGAN. Our future work is to develop distributed, communication-efficient, and faster algorithms for variational inequality setting under cooperative or competing networks.

Appendix A Basic Lemmas

In this section, we list two basic results that will be used in our analysis.

Lemma 1 (Quadratic Growth [23]).

If g(y)g(y) is ν\nu-PL function, then it also satisfies a quadratic growth, i.e.,

g(y)minyg(y)ν2yoy2,y\displaystyle g(y)-\min_{y}g(y)\geq\frac{\nu}{2}\|y^{o}-y\|^{2},\forall y (26)

where yoy^{o} is any optimal point taken from the set {yoyo=argminyg(y)}\{y^{o}\mid y^{o}={\operatorname{argmin}}_{y}\ g(y)\}.

Lemma 2 (Danskin-type Lemma [51]).

Under Assumptions 12, if J(x,y)-J(x,y) is LfL_{f}-smooth and satisfies ν\nu-PL in yy for any xx, then:

  • The primal objective P(x)P(x) is LL-smooth with L=Lf+Lf2νL=L_{f}+\frac{L^{2}_{f}}{\nu} .

  • The gradient of P(x)P(x) is equal to the gradient of J(x,yo(x))J(x,y^{o}(x)) with respective to xx, i.e.,

    P(x)=xJ(x,yo(x))\nabla P(x)=\nabla_{x}J(x,y^{o}(x)) (27)

    where yo(x)y^{o}(x) is a maximum point of J(x,y)J(x,y) for a fixed xx, i.e, yo(x)argmaxyJ(x,y)y^{o}(x)\in{\operatorname{argmax}}_{y}\ J(x,y).

Appendix B Key Lemmas

Here, we list the auxiliary lemmas leading to the main convergence result. The proofs of these results are deferred to later appendices.

Lemma 3.

Under Assumptions 26, choosing the step sizes

μxμymin{1𝒥γ122(𝒥γ12+160ρ2Lf2),1}\displaystyle\mu_{x}\leq\mu_{y}\leq\min\{\frac{1-\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}}{2(\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}+160\rho^{2}L^{2}_{f})},1\} (28)

the iteration averaged network deviation is bounded as:

1Ti=0T1𝔼[𝓧i𝓧c,i2+𝓨i𝓨c,i2]\displaystyle\frac{1}{T}\sum_{i=0}^{T-1}\mathbb{E}\Big{[}\|{\bm{\mathcal{X}}}_{i}-\bm{\mathcal{X}}_{c,i}\|^{2}+\|{\bm{\mathcal{Y}}}_{i}-\bm{\mathcal{Y}}_{c,i}\|^{2}\Big{]}
40ρ2(4KG2+μyσ2)μy1𝒥γ12\displaystyle\leq\frac{40\rho^{2}(4KG^{2}+\mu_{y}\sigma^{2})\mu_{y}}{1-\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}} (29)

where ρ2𝒱R12𝒱L12𝒥γ12\rho^{2}\triangleq\|\mathcal{V}^{\top}_{R_{1}}\|^{2}\|\mathcal{V}_{L_{1}}\|^{2}\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2} is a constant associated with the network topology, σ2=k=1Kσk2\sigma^{2}=\sum_{k=1}^{K}\sigma^{2}_{k} is the sum of local gradient noise variance. See Appendix F for proof.

Choosing suitable step sizes, we observe from Lemma 3 that the bound of iteration averaged network deviation is dominated by a term on the order of 𝒪(μy)\mathcal{O}(\mu_{y}).

Lemma 4.

Under Assumptions 26, choosing the step sizes as in Lemma 3, the iteration averaged increment is bounded as:

1T\displaystyle\frac{1}{T} i=0T1[𝔼𝓧i𝓧i12+𝔼𝓨i𝓨i12]\displaystyle\sum_{i=0}^{T-1}[\mathbb{E}\|\bm{\mathcal{X}}_{i}-\bm{\mathcal{X}}_{i-1}\|^{2}+\mathbb{E}\|\bm{\mathcal{Y}}_{i}-\bm{\mathcal{Y}}_{i-1}\|^{2}]
\displaystyle\leq 240ρ2(4KG2+μyσ2)μy1𝒥γ12+3KTi=0T1[μx2𝔼gx,c,i12\displaystyle\frac{240\rho^{2}(4KG^{2}+\mu_{y}\sigma^{2})\mu_{y}}{1-\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}}+\frac{3K}{T}\sum_{i=0}^{T-1}[\mu_{x}^{2}\mathbb{E}\|{{g}}_{x,c,i-1}\|^{2}
+μy2𝔼gy,c,i12]+60Kσ2μ2ymax𝑘p2k\displaystyle+\mu_{y}^{2}\mathbb{E}\|{{g}}_{y,c,i-1}\|^{2}]+60K\sigma^{2}\mu^{2}_{y}\underset{k}{\max}\ p^{2}_{k} (30)

See Appendix G for proof.

We observe that the increment is bounded by three terms, i.e., the term associated with network deviation on the order of 𝒪(μy)\mathcal{O}(\mu_{y}), iteration averaged risk gradients centroid, and a bound caused by the gradient noise on the order of 𝒪(σ2μy2)\mathcal{O}(\sigma^{2}\mu^{2}_{y}), respectively.

Lemma 5.

Under Assumptions 26, choosing the step sizes as in Lemma 3, the iteration averaged gradient disagreement is bounded as:

1T\displaystyle\frac{1}{T} i=0T1𝔼wJ(𝒙c,i,𝒚c,i)gw,c,i2(w=x or y)\displaystyle\sum_{i=0}^{T-1}\mathbb{E}\|\nabla_{w}J(\bm{x}_{c,i},\bm{y}_{c,i})-g_{w,c,i}\|^{2}\quad(w=x\text{ or }y) (31)
\displaystyle{\leq} 1120Lf2ρ2(4KG2+μyσ2)μy1𝒥γ12+240Kσ2Lf2μy2max𝑘pk2\displaystyle\ \frac{1120L^{2}_{f}\rho^{2}(4KG^{2}+\mu_{y}\sigma^{2})\mu_{y}}{1-\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}}+240K\sigma^{2}L^{2}_{f}\mu^{2}_{y}\underset{k}{\max}\ p^{2}_{k}
+12KLf2Ti=0T1[μx2𝔼gx,c,i12+μy2𝔼gy,c,i12]\displaystyle\ +\frac{12KL^{2}_{f}}{T}\sum_{i=0}^{T-1}[\mu_{x}^{2}\mathbb{E}\|{{g}}_{x,c,i-1}\|^{2}+\mu_{y}^{2}\mathbb{E}\|{{g}}_{y,c,i-1}\|^{2}]

See Appendix H for proof.

Lemma 5 implies how the deviation between the global risk function at network centroid (𝒙c,i,𝒚c,i)(\bm{x}_{c,i},\bm{y}_{c,i}) and the centroid of local risk gradient is bounded over iterations. We observe that this bound consists of three terms, namely the iteration averaged risk gradient centroid and two terms on the order of 𝒪(μy)+𝒪(σ2μy2)\mathcal{O}(\mu_{y})+\mathcal{O}(\sigma^{2}\mu^{2}_{y}). In the following lemmas, we will establish some useful inequalities associated with the Lipschitz continuity of primal objective (12) and risk function (2a). For simplicity, we use

d(x,y)P(x)J(x,y)\displaystyle d(x,y)\triangleq P(x)-J(x,y) (32)

to denote the dual optimality gap for simplicity.

Lemma 6.

Under Assumptions 16, the primal objective P(x)P(x) satisfies:

𝔼\displaystyle\mathbb{E} [P(𝒙c,i+1)]𝔼[P(𝒙c,i)]\displaystyle[P(\bm{x}_{c,i+1})]-\mathbb{E}[{{P}}(\bm{x}_{c,i})]
\displaystyle\leq μx2𝔼P(𝒙c,i)2μx2(1Lμx)𝔼gx,c,i2\displaystyle-\frac{\mu_{x}}{2}\mathbb{E}\|\nabla{{P}}(\bm{x}_{c,i})\|^{2}-\frac{\mu_{x}}{2}(1-L\mu_{x})\mathbb{E}\|{g}_{x,c,i}\|^{2}
+2μxLf2ν𝔼[dc(i)]+μx𝔼xJ(𝒙c,i,𝒚c,i)gx,c,i2\displaystyle+\frac{2\mu_{x}L^{2}_{f}}{\nu}\mathbb{E}[{d}_{c}(i)]+\mu_{x}\mathbb{E}\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{x,c,i}\|^{2}
+5Lμx2σ2max𝑘pk2\displaystyle+5L\mu^{2}_{x}\sigma^{2}\underset{k}{\max}\ p^{2}_{k} (33)

where dc(i)d(𝐱c,i,𝐲c,i)=P(𝐱c,i)J(𝐱c,i,𝐲c,i){d}_{c}(i)\triangleq{d}(\bm{x}_{c,i},\bm{y}_{c,i})=P(\bm{x}_{c,i})-J(\bm{x}_{c,i},\bm{y}_{c,i}) denotes the dual optimality gap evaluated at the network centroid (𝐱c,i,𝐲c,i)(\bm{x}_{c,i},\bm{y}_{c,i}). See Appendix I for proof.

Lemma 7.

Under Assumptions 16, we have

𝔼\displaystyle\mathbb{E} [J(𝒙c,i+1,𝒚c,i+1)]\displaystyle[-J(\bm{x}_{c,i+1},\bm{y}_{c,i+1})] (34)
\displaystyle\leq 𝔼[J(𝒙c,i+1,𝒚c,i)]μy2𝔼yJ(𝒙c,i+1,𝒚c,i)2\displaystyle\mathbb{E}[-J(\bm{x}_{c,i+1},\bm{y}_{c,i})]-\frac{\mu_{y}}{2}\mathbb{E}\|\nabla_{y}J(\bm{x}_{c,i+1},\bm{y}_{c,i})\|^{2}
μy(1Lfμy)2𝔼gy,c,i2+μyLf2𝔼𝒙c,i+1𝒙c,i2\displaystyle-\frac{\mu_{y}(1-L_{f}\mu_{y})}{2}\mathbb{E}\|{g}_{y,c,i}\|^{2}+\mu_{y}L^{2}_{f}\mathbb{E}\|\bm{x}_{c,i+1}-\bm{x}_{c,i}\|^{2}
+μy𝔼yJ(𝒙c,i,𝒚c,i)gy,c,i2+5Lfμy2σ2max𝑘pk2\displaystyle+\mu_{y}\mathbb{E}\|\nabla_{y}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{y,c,i}\|^{2}+5L_{f}\mu^{2}_{y}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}

and

𝔼\displaystyle\mathbb{E} [J(𝒙c,i+1,𝒚c,i)]\displaystyle[-J(\bm{x}_{c,i+1},\bm{y}_{c,i})] (35)
\displaystyle\leq 𝔼[J(𝒙c,i,𝒚c,i)]+μx2𝔼xJ(𝒙c,i,𝒚c,i)gx,c,i2\displaystyle\mathbb{E}[-J(\bm{x}_{c,i},\bm{y}_{c,i})]+\frac{\mu_{x}}{2}\mathbb{E}\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{x,c,i}\|^{2}
+μx(3+Lfμx)2𝔼gx,c,i2+5Lfμx2σ2max𝑘pk2\displaystyle+\frac{\mu_{x}(3+L_{f}\mu_{x})}{2}\mathbb{E}\|{g}_{x,c,i}\|^{2}+5L_{f}\mu^{2}_{x}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}

See Appendix J for proof.

Lemma 8.

Under Assumptions 16, choosing step sizes

μxmin{μy,ν2μy4Lf2,LfL+Lfμy},μymin{1,1ν,1Lf}\mu_{x}\leq\min\{\mu_{y},\frac{\nu^{2}\mu_{y}}{4L^{2}_{f}},\sqrt{\frac{L_{f}}{L+L_{f}}}\mu_{y}\},\mu_{y}\leq\min\{1,\frac{1}{\nu},\frac{1}{L_{f}}\}

(36)

the iteration averaged dual optimality gap is bounded as:

1T\displaystyle\frac{1}{T} i=0T1𝔼[dc(i)]\displaystyle\sum_{i=0}^{T-1}\mathbb{E}[{d}_{c}(i)]
\displaystyle\leq 2νμyTdc(0)+1Ti=0T1{3μx(1νμy)νμy𝔼xJ(𝒙c,i,𝒚c,i)\displaystyle\frac{2}{\nu\mu_{y}T}{d}_{c}(0)+\frac{1}{T}\sum_{i=0}^{T-1}\Big{\{}\frac{3\mu_{x}(1-\nu\mu_{y})}{\nu\mu_{y}}\mathbb{E}\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})
gx,c,i2+(1νμy)[2μxνμy+(L+Lf)μx2νμy]𝔼gx,c,i2\displaystyle-{g}_{x,c,i}\|^{2}+(1-\nu\mu_{y})\Big{[}\frac{2\mu_{x}}{\nu\mu_{y}}+\frac{(L+L_{f})\mu^{2}_{x}}{\nu\mu_{y}}\Big{]}\mathbb{E}\|{g}_{x,c,i}\|^{2}
+2Lf2ν𝔼𝒙c,i+1𝒙c,i2+2ν𝔼yJ(𝒙c,i,𝒚c,i)gy,c,i2}\displaystyle+\frac{2L_{f}^{2}}{\nu}\mathbb{E}\|\bm{x}_{c,i+1}-\bm{x}_{c,i}\|^{2}+\frac{2}{\nu}\mathbb{E}\|\nabla_{y}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{y,c,i}\|^{2}\Big{\}}
μy2T(1Lfμy)i=1T1j=0i1(1νμy2)ij𝔼gy,c,j2\displaystyle-\frac{\mu_{y}}{2T}(1-L_{f}\mu_{y})\sum_{i=1}^{T-1}\sum_{j=0}^{i-1}(1-\frac{\nu\mu_{y}}{2})^{i-j}\mathbb{E}\|{g}_{y,c,j}\|^{2}
+20Lfμyσ2νmax𝑘pk2\displaystyle+\frac{20L_{f}\mu_{y}\sigma^{2}}{\nu}\underset{k}{\max}\ p^{2}_{k} (37)

See Appendix K for proof.

The bound of dual optimality gap is quantified by several terms, including the initialization term which decays on the rate of 𝒪(1μyT)\mathcal{O}(\frac{1}{\mu_{y}T}), the gradient disagreement terms, the risk gradients centroid, the increment of network centroid, and a constant term on the order of 𝒪(σ2μy)\mathcal{O}(\sigma^{2}\mu_{y}).

Appendix C Proof of Theorem 1

We now prove the best iterate convergence for the primal variable. Moving the gradient term 𝔼P(𝒙c,i)2\mathbb{E}\|\nabla P(\bm{x}_{c,i})\|^{2} to the left hand side in Lemma 6 and averaging the inequality over iterations, we get

1T\displaystyle\frac{1}{T} i=0T1𝔼P(𝒙c,i)2\displaystyle\sum_{i=0}^{T-1}\mathbb{E}\|\nabla P(\bm{x}_{c,i})\|^{2}
\displaystyle\leq 1Ti=0T1{2μx𝔼[P(𝒙c,i)P(𝒙c,i+1)](1Lμx)𝔼gx,c,i2\displaystyle\frac{1}{T}\sum_{i=0}^{T-1}\Big{\{}\frac{2}{\mu_{x}}\mathbb{E}[P(\bm{x}_{c,i})-P(\bm{x}_{c,i+1})]-(1-L\mu_{x})\mathbb{E}\|{g}_{x,c,i}\|^{2}
+4Lf2ν𝔼[dc(i)]+2𝔼xJ(𝒙c,i,𝒚c,i)gx,c,i2}\displaystyle+\frac{4L_{f}^{2}}{\nu}\mathbb{E}[{d}_{c}(i)]+2\mathbb{E}\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{x,c,i}\|^{2}\Big{\}}
+10Lμxσ2max𝑘pk2\displaystyle+10L\mu_{x}\sigma^{2}\underset{k}{\max}\ p^{2}_{k} (38)
(a)\displaystyle\overset{(a)}{\leq} 2μxT[P(𝒙c,0)P]1Ti=0T1{(1Lμx)𝔼gx,c,i2\displaystyle\frac{2}{\mu_{x}T}[P(\bm{x}_{c,0})-P^{\star}]-\frac{1}{T}\sum_{i=0}^{T-1}\Big{\{}(1-L\mu_{x})\mathbb{E}\|{g}_{x,c,i}\|^{2}
4Lf2ν𝔼[dc(i)]2𝔼xJ(𝒙c,i,𝒚c,i)gx,c,i2}\displaystyle-\frac{4L_{f}^{2}}{\nu}\mathbb{E}[{d}_{c}(i)]-2\mathbb{E}\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{x,c,i}\|^{2}\Big{\}}
+10Lμxσ2max𝑘pk2\displaystyle+10L\mu_{x}\sigma^{2}\underset{k}{\max}\ p^{2}_{k} (39)

where (a)(a) follows from Assumption 3 and the telescoping results of i=0TP(𝒙c,i)P(𝒙c,i+1)\sum_{i=0}^{T}P(\bm{x}_{c,i})-P(\bm{x}_{c,i+1}) in (38). Invoking Lemma 8 and regrouping the terms in (39), we obtain

1T\displaystyle\frac{1}{T} i=0T1𝔼P(𝒙c,i)2\displaystyle\sum_{i=0}^{T-1}\mathbb{E}\|\nabla P(\bm{x}_{c,i})\|^{2} (40)
\displaystyle\leq 2μxT[P(𝒙c,0)P]1Ti=0T1{{1Lμx4(1νμy)Lf2ν\displaystyle\frac{2}{\mu_{x}T}[P(\bm{x}_{c,0})-P^{\star}]-\frac{1}{T}\sum_{i=0}^{T-1}\Bigg{\{}\Big{\{}1-L\mu_{x}-\frac{4(1-\nu\mu_{y})L^{2}_{f}}{\nu}
×(2μxνμy+(L+Lf)μx2νμy)}𝔼gx,c,i2(12μx(1νμy)Lf2ν2μy\displaystyle\times\Big{(}\frac{2\mu_{x}}{\nu\mu_{y}}+\frac{(L+L_{f})\mu^{2}_{x}}{\nu\mu_{y}}\Big{)}\Big{\}}\mathbb{E}\|{g}_{x,c,i}\|^{2}-\Big{(}\frac{12\mu_{x}(1-\nu\mu_{y})L^{2}_{f}}{\nu^{2}\mu_{y}}
+2)𝔼xJ(𝒙c,i,𝒚c,i)gx,c,i28Lf4ν2𝔼𝒙c,i+1𝒙c,i2\displaystyle+2\Big{)}\mathbb{E}\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{x,c,i}\|^{2}-{\frac{8L_{f}^{4}}{\nu^{2}}\mathbb{E}\|\bm{x}_{c,i+1}-\bm{x}_{c,i}\|^{2}}
8Lf2ν2𝔼yJ(𝒙c,i,𝒚c,i)gy,c,i2}+8Lf2ν2μyTdc(0)\displaystyle-{\frac{8L_{f}^{2}}{\nu^{2}}\mathbb{E}\|\nabla_{y}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{y,c,i}\|^{2}}\Bigg{\}}+\frac{8L_{f}^{2}}{\nu^{2}\mu_{y}T}{d}_{c}(0)
2μyLf2νT(1Lfμy)i=1T1j=0i1(1νμy2)ij𝔼gy,c,j2\displaystyle-\frac{2\mu_{y}L^{2}_{f}}{\nu T}(1-L_{f}\mu_{y})\sum_{i=1}^{T-1}\sum_{j=0}^{i-1}(1-\frac{\nu\mu_{y}}{2})^{i-j}\mathbb{E}\|{g}_{y,c,j}\|^{2}
+80Lf3μyσ2ν2max𝑘pk2+10Lμxσ2max𝑘pk2\displaystyle+\frac{80L^{3}_{f}\mu_{y}\sigma^{2}}{\nu^{2}}\underset{k}{\max}\ p^{2}_{k}+10L\mu_{x}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}
(a)\displaystyle\overset{(a)}{\leq} 2μxT[P(𝒙c,0)P]1Ti=0T1{{1Lμx4(1νμy)Lf2ν\displaystyle\frac{2}{\mu_{x}T}[P(\bm{x}_{c,0})-P^{\star}]-\frac{1}{T}\sum_{i=0}^{T-1}\Bigg{\{}\Big{\{}1-L\mu_{x}-\frac{4(1-\nu\mu_{y})L^{2}_{f}}{\nu}
×(2μxνμy+(L+Lf)μx2νμy)8Lf4μx2ν2}𝔼gx,c,i2\displaystyle\times\Big{(}\frac{2\mu_{x}}{\nu\mu_{y}}+\frac{(L+L_{f})\mu^{2}_{x}}{\nu\mu_{y}}\Big{)}-\frac{8L_{f}^{4}\mu^{2}_{x}}{\nu^{2}}\Big{\}}\mathbb{E}\|{g}_{x,c,i}\|^{2}
4𝔼xJ(𝒙c,i,𝒚c,i)gx,c,i28Lf2ν2\displaystyle-4\mathbb{E}\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{x,c,i}\|^{2}-\frac{8L_{f}^{2}}{\nu^{2}}
×𝔼yJ(𝒙c,i,𝒚c,i)gy,c,i2}+80Lf4μx2σ2ν2max𝑘p2k\displaystyle{\times\mathbb{E}\|\nabla_{y}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{y,c,i}\|^{2}}\Bigg{\}}+\frac{80L^{4}_{f}\mu^{2}_{x}\sigma^{2}}{\nu^{2}}\underset{k}{\max}\ p^{2}_{k}
+8Lf2ν2μyTdc(0)+80Lf3μyσ2ν2max𝑘pk2+10Lμxσ2max𝑘pk2\displaystyle+\frac{8L_{f}^{2}}{\nu^{2}\mu_{y}T}{d}_{c}(0)+\frac{80L^{3}_{f}\mu_{y}\sigma^{2}}{\nu^{2}}\underset{k}{\max}\ p^{2}_{k}+10L\mu_{x}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}
2μyLf2νT(1Lfμy)i=1T1j=0i1(1νμy2)ij𝔼gy,c,j2\displaystyle-\frac{2\mu_{y}L^{2}_{f}}{\nu T}(1-L_{f}\mu_{y})\sum_{i=1}^{T-1}\sum_{j=0}^{i-1}(1-\frac{\nu\mu_{y}}{2})^{i-j}\mathbb{E}\|{g}_{y,c,j}\|^{2}

where (a)(a) follows from several steps: inserting 𝒙i+1=𝒙iμx𝒈x,c,i\bm{x}_{i+1}=\bm{x}_{i}-\mu_{x}\bm{g}_{x,c,i}, using (77) and choosing step size

μy1ν,12μx(1νμy)Lf2ν2μy2μxν2μy6Lf2\mu_{y}\leq\frac{1}{\nu},\frac{12\mu_{x}(1-\nu\mu_{y})L^{2}_{f}}{\nu^{2}\mu_{y}}\leq 2\Rightarrow\mu_{x}\leq\frac{\nu^{2}\mu_{y}}{6L^{2}_{f}}

(41)

For the sake of simplicity, we introduce the following notations for constants:

τ180Lf4ν2,τ280Lf3ν2,τ38Lf2ν2,τ410Lβ196KLf4ν2+48KLf2\begin{aligned} \tau_{1}&\triangleq\frac{80L^{4}_{f}}{\nu^{2}},\tau_{2}\triangleq\frac{80L^{3}_{f}}{\nu^{2}},\tau_{3}\triangleq\frac{8L^{2}_{f}}{\nu^{2}},\tau_{4}\triangleq 10L\\ \beta_{1}&\triangleq\frac{96KL_{f}^{4}}{\nu^{2}}+48KL^{2}_{f}\end{aligned}

(42)

Let P0=P(𝒙c,0)PP_{0}=P(\bm{x}_{c,0})-P^{\star} be the initial primal optimality gap. Invoking Lemma 5 for inequality (40), we get

1T\displaystyle\frac{1}{T} i=0T1𝔼P(𝒙c,i)2\displaystyle\sum_{i=0}^{T-1}\mathbb{E}\|\nabla P(\bm{x}_{c,i})\|^{2} (43)
\displaystyle\leq 2P0μxT+(τ2μy+τ4μx+τ1μx2)σ2max𝑘pk2+τ3dc(0)μyT\displaystyle\frac{2P_{0}}{\mu_{x}T}+(\tau_{2}\mu_{y}+\tau_{4}\mu_{x}+\tau_{1}\mu^{2}_{x})\sigma^{2}\underset{k}{\max}\ p^{2}_{k}+\frac{\tau_{3}{d}_{c}(0)}{\mu_{y}T}
1T{1Lμx4(1νμy)Lf2ν(2μxνμy+(L+Lf)μx2νμy)\displaystyle-\frac{1}{T}\Big{\{}1-L\mu_{x}-\frac{4(1-\nu\mu_{y})L^{2}_{f}}{\nu}\Big{(}\frac{2\mu_{x}}{\nu\mu_{y}}+\frac{(L+L_{f})\mu^{2}_{x}}{\nu\mu_{y}}\Big{)}
τ1μx2β1μx2}i=0T1𝔼gx,c,i2primal gradient+β1μy2Ti=0T1𝔼gy,c,i12dual gradient\displaystyle-\tau_{1}\mu^{2}_{x}-\beta_{1}\mu^{2}_{x}\Big{\}}\sum_{i=0}^{T-1}\underbrace{\mathbb{E}\|{g}_{x,c,i}\|^{2}}_{\text{primal gradient}}+\frac{\beta_{1}\mu_{y}^{2}}{T}\sum_{i=0}^{T-1}\mathbb{E}\underbrace{\|{{g}}_{y,c,i-1}\|^{2}}_{\text{dual gradient}}
+240Kσ2Lf2μy2max𝑘pk2(τ3+4)2μyLf2νT(1Lfμy)\displaystyle+240K\sigma^{2}L^{2}_{f}\mu^{2}_{y}\underset{k}{\max}\ p^{2}_{k}(\tau_{3}+4)-\frac{2\mu_{y}L^{2}_{f}}{\nu T}(1-L_{f}\mu_{y})
×i=1T1j=0i1(1νμy2)ij𝔼gy,c,j2dual gradient+1120Lf2ρ2μy(τ3+4)1𝒥γ12\displaystyle\times\sum_{i=1}^{T-1}\sum_{j=0}^{i-1}(1-\frac{\nu\mu_{y}}{2})^{i-j}\mathbb{E}\underbrace{\|{g}_{y,c,j}\|^{2}}_{\text{dual gradient}}+\frac{1120L^{2}_{f}\rho^{2}\mu_{y}(\tau_{3}+4)}{1-\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}}
z\displaystyle z ×(4KG2+μyσ2)+β1μx2gx,c,12\displaystyle\times(4KG^{2}+\mu_{y}\sigma^{2})+\beta_{1}\mu^{2}_{x}\|g_{x,c,-1}\|^{2}

For the terms associated with primal gradient, we can bound them by choosing a suitable step size:

1Lμx4(1νμy)Lf2ν(2μxνμy+(L+Lf)μx2νμy)(τ1+β1)\displaystyle 1-L\mu_{x}-\frac{4(1-\nu\mu_{y})L^{2}_{f}}{\nu}\Big{(}\frac{2\mu_{x}}{\nu\mu_{y}}+\frac{(L+L_{f})\mu^{2}_{x}}{\nu\mu_{y}}\Big{)}-(\tau_{1}+\beta_{1})
×μx2(a)1Lμx2τ3μxμy(τ1+β1)μx2(b)0\displaystyle\times\mu^{2}_{x}\overset{(a)}{\geq}1-L\mu_{x}-2\tau_{3}\frac{\mu_{x}}{\mu_{y}}-\Big{(}\tau_{1}+\beta_{1}\Big{)}\mu^{2}_{x}\overset{(b)}{\geq}0 (44)

where (a)(a) we choose μx2/(L+Lf)\mu_{x}\leq 2/(L+L_{f}), (b) we choose μx14L,μxμy4τ3\mu_{x}\leq\frac{1}{4L},\mu_{x}\leq\frac{\mu_{y}}{4\tau_{3}} and μx21/(4(τ1+β1))\mu^{2}_{x}\leq 1/(4(\tau_{1}+\beta_{1})). By doing so, we can eliminate the primal gradient terms since now it has a negative sign. For terms associated with dual gradient, we can bound it by deducing that:

2μyLf2νT(1Lfμy)i=1T1j=0i1(1νμy2)ij𝔼gy,c,j2\displaystyle\frac{2\mu_{y}L^{2}_{f}}{\nu T}(1-L_{f}\mu_{y})\sum_{i=1}^{T-1}\sum_{j=0}^{i-1}(1-\frac{\nu\mu_{y}}{2})^{i-j}\mathbb{E}\|{g}_{y,c,j}\|^{2}
β1μy2Ti=0T1𝔼gy,c,i12\displaystyle-\frac{\beta_{1}\mu^{2}_{y}}{T}\sum_{i=0}^{T-1}\mathbb{E}\|{{g}}_{y,c,i-1}\|^{2}\hskip 200.0003pt
=(a)4Lf2ν2T(1Lfμy)(1νμy2)i=0T2(1(1νμy2)T1i)\displaystyle\overset{(a)}{=}\frac{4L^{2}_{f}}{\nu^{2}T}(1-L_{f}\mu_{y})(1-\frac{\nu\mu_{y}}{2})\sum_{i=0}^{T-2}{(1-(1-\frac{\nu\mu_{y}}{2})^{T-1-i})}
×𝔼gy,c,i2β1μy2Ti=0T1𝔼gy,c,i12\displaystyle\quad\times\mathbb{E}\|{g}_{y,c,i}\|^{2}-\frac{\beta_{1}\mu_{y}^{2}}{T}\sum_{i=0}^{T-1}\mathbb{E}\|{{g}}_{y,c,i-1}\|^{2}
(b)τ32T(1Lfμy)(1νμy2)i=0T2(1(1νμy2))\displaystyle\overset{(b)}{\geq}\ \frac{\tau_{3}}{2T}(1-L_{f}\mu_{y})(1-\frac{\nu\mu_{y}}{2})\sum_{i=0}^{T-2}{(1-(1-\frac{\nu\mu_{y}}{2}))}
×𝔼gy,c,i2β1μy2Ti=1T2𝔼gy,c,i2\displaystyle\quad\times\mathbb{E}\|{g}_{y,c,i}\|^{2}-\frac{\beta_{1}\mu_{y}^{2}}{T}\sum_{i=-1}^{T-2}\mathbb{E}\|{{g}}_{y,c,i}\|^{2}
1T{2Lf2ν(1Lfμy)(1νμy2)μyβ1μy2}i=1T2𝔼gy,c,i2\displaystyle\geq\ \frac{1}{T}\{\frac{2L^{2}_{f}}{\nu}(1-L_{f}\mu_{y})(1-\frac{\nu\mu_{y}}{2})\mu_{y}-\beta_{1}\mu_{y}^{2}\}\sum_{i=-1}^{T-2}\mathbb{E}\|{{g}}_{y,c,i}\|^{2}
β1μy2gy,c,12T(c)β1μy2gy,c,12T\displaystyle\quad-\frac{\beta_{1}\mu_{y}^{2}\|{{g}}_{y,c,-1}\|^{2}}{T}\quad\overset{(c)}{\geq}-\frac{\beta_{1}\mu_{y}^{2}\|{{g}}_{y,c,-1}\|^{2}}{T} (45)

where (a)(a) we have used

i=1T1j=0i1qijaj=qi=1T1j=0i1qi1jaj\displaystyle\sum_{i=1}^{T-1}\sum_{j=0}^{i-1}q^{i-j}a_{j}=q\sum_{i=1}^{T-1}\sum_{j=0}^{i-1}q^{i-1-j}a_{j} =qi=0T2ai(t=0T2iqt)\displaystyle=q\sum_{i=0}^{T-2}a_{i}(\sum_{t=0}^{T-2-i}q^{t})
=q(i=0T2ai1qT1i1q)\displaystyle=q(\sum_{i=0}^{T-2}a_{i}\frac{1-q^{T-1-i}}{1-q})

q(0,1)q\in(0,1), (b)(b) is due to 1qt1q1-q^{t}\geq 1-q when t1t\geq 1, (c)(c) we choose μy2Lf2νβ1\mu_{y}\leq\frac{2L^{2}_{f}}{\nu\beta_{1}}. Replacing these gradients with the deduced upper bound, we conclude

1Ti=0T1𝔼P(𝒙c,i)2\displaystyle\frac{1}{T}\sum_{i=0}^{T-1}\mathbb{E}\|\nabla P(\bm{x}_{c,i})\|^{2}
2P0μxT+τ3dc(0)μyT+(τ2μy+τ4μx+τ1μx2)σ2max𝑘pk2\displaystyle\leq\ \frac{2P_{0}}{\mu_{x}T}+\frac{\tau_{3}{d}_{c}(0)}{\mu_{y}T}+(\tau_{2}\mu_{y}+\tau_{4}\mu_{x}+\tau_{1}\mu^{2}_{x})\sigma^{2}\underset{k}{\max}\ p^{2}_{k}
+1120Lf2ρ2(4KG2+μyσ2)μy1𝒥γ12(τ3+4)+β1μy2gy,c,12T\displaystyle\quad+\frac{1120L^{2}_{f}\rho^{2}(4KG^{2}+\mu_{y}\sigma^{2})\mu_{y}}{1-\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}}(\tau_{3}+4)+\frac{\beta_{1}\mu_{y}^{2}\|{{g}}_{y,c,-1}\|^{2}}{T}
+β1μx2gx,c,12T+240Kσ2Lf2μy2max𝑘pk2(τ3+4)\displaystyle\quad+\frac{\beta_{1}\mu^{2}_{x}\|g_{x,c,-1}\|^{2}}{T}+240K\sigma^{2}L^{2}_{f}\mu^{2}_{y}\underset{k}{\max}\ p^{2}_{k}(\tau_{3}+4) (46)

Considering all the step size conditions derived above and choosing

μx=min{ν2μy6Lf2,12(L+Lf),14L,μy4τ3,12τ1+β1,μy}μy=min{1ν,2Lf2νβ1,1𝒥γ12T1/2,1}\begin{aligned} \mu_{x}&=\min\{\frac{\nu^{2}\mu_{y}}{6L^{2}_{f}},\frac{1}{2(L+L_{f})},\frac{1}{4L},\frac{\mu_{y}}{4\tau_{3}},\frac{1}{2\sqrt{\tau_{1}+\beta_{1}}},\mu_{y}\}\\ \mu_{y}&=\min\{\frac{1}{\nu},\frac{2L^{2}_{f}}{\nu\beta_{1}},\frac{1-\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}}{T^{1/2}},1\}\end{aligned}

(47)

we get the non-asymptotic bound as follows:

1Ti=0T1𝔼P(𝒙c,i)2𝒪(1T1/2)+𝒪(1T)+𝒪(1T2)\displaystyle\frac{1}{T}\sum_{i=0}^{T-1}\mathbb{E}\|\nabla P(\bm{x}_{c,i})\|^{2}\leq\mathcal{O}\Big{(}\frac{1}{{T}^{1/2}}\Big{)}+\mathcal{O}\Big{(}\frac{1}{T}\Big{)}+\mathcal{O}\Big{(}\frac{1}{T^{2}}\Big{)} (48)

It is seen that the convergence rate is dominated by 𝒪(1T1/2)\mathcal{O}\Big{(}\frac{1}{{T}^{1/2}}\Big{)}, therefore DSS-OG outputs ε\varepsilon-stationary point after T0=𝒪(ε4)T_{0}=\mathcal{O}(\varepsilon^{-4}) iteration and gradient evaluation complexity.

Appendix D Proof of Theorem 2

We further prove that DSS-OG can find a solution for the dual problem by halting the update of primal variable after 𝒙c,T0{\bm{x}}^{\star}_{c,T_{0}} is found; this can be achieved by setting μx=0\mu_{x}=0 after iteration T0T_{0}. For the convenience of analysis, we use 𝒈y,c,i{\bm{g}}^{\star}_{y,c,i} and gy,c,i{{g}}^{\star}_{y,c,i} to denote the stochastic and true gradients when their primal variable is fixed at 𝒙c,T0{\bm{x}}^{\star}_{c,T_{0}}. Repeating the arguments in Lemma 7 at point x=𝒙c,T0x={\bm{x}}^{\star}_{c,T_{0}}, we get

𝔼\displaystyle\mathbb{E} [d(𝒙c,T0,𝒚c,i+1)]\displaystyle[d({\bm{x}}^{\star}_{c,T_{0}},\bm{y}_{c,i+1})] (49)
\displaystyle\leq (1νμy)𝔼[d(𝒙c,T0,𝒚c,i)]+μy2𝔼yJ(𝒙c,T0,𝒚c,i)\displaystyle(1-\nu\mu_{y})\mathbb{E}[d({\bm{x}}^{\star}_{c,T_{0}},\bm{y}_{c,i})]+\frac{\mu_{y}}{2}\mathbb{E}\|\nabla_{y}J({\bm{x}}^{\star}_{c,T_{0}},\bm{y}_{c,i})
gy,c,i2μy2(1Lfμy)𝔼gy,c,i2+5Lfμy2σ2max𝑘pk2\displaystyle-{{g}}^{\star}_{y,c,i}\|^{2}-\frac{\mu_{y}}{2}(1-L_{f}\mu_{y})\mathbb{E}\|{{g}}^{\star}_{y,c,i}\|^{2}+5L_{f}\mu^{2}_{y}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}

We now study recursion (49) for sufficiently large iteration ii. Furthermore, we choose the step size μy=2T+1ν(T+1)2\mu_{y}=\frac{2T+1}{\nu(T+1)^{2}}. In the following, we insert this expression of μy\mu_{y} into the coefficient associated with 𝔼[d(𝒙c,T0,𝒚c,T)]\mathbb{E}[d({\bm{x}}^{\star}_{c,T_{0}},\bm{y}_{c,T})] in (49), and keep “μy\mu_{y}” notation in the remaining terms to determine additional step size conditions. Setting i=Ti=T in (49), we get

𝔼[d(𝒙c,T0,𝒚c,T+1)]\displaystyle\mathbb{E}[d({\bm{x}}^{\star}_{c,T_{0}},\bm{y}_{c,T+1})] (50)
\displaystyle\leq T2(T+1)2𝔼[d(𝒙c,T0,𝒚c,T)]+μy2[𝔼yJ(𝒙c,T0,𝒚c,T)\displaystyle\frac{T^{2}}{(T+1)^{2}}\mathbb{E}[d({\bm{x}}^{\star}_{c,T_{0}},\bm{y}_{c,T})]+\frac{\mu_{y}}{2}\Big{[}\mathbb{E}\|\nabla_{y}J({\bm{x}}^{\star}_{c,T_{0}},\bm{y}_{c,T})
gy,c,T2(1Lfμy)𝔼gy,c,T2]+5Lfμ2yσ2max𝑘p2k\displaystyle-{{g}}^{\star}_{y,c,T}\|^{2}-(1-L_{f}\mu_{y})\mathbb{E}\|{{g}}^{\star}_{y,c,T}\|^{2}\Big{]}+5L_{f}\mu^{2}_{y}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}

Multiplying both sides by (T+1)2(T+1)^{2}, we have

(T+1)2𝔼[d(𝒙c,T0,𝒚c,T+1)]\displaystyle(T+1)^{2}\mathbb{E}[d({\bm{x}}^{\star}_{c,T_{0}},\bm{y}_{c,T+1})] (51)
\displaystyle\leq T2𝔼[d(𝒙c,T0,𝒚c,T)]+μy(T+1)22[𝔼yJ(𝒙c,T0,𝒚c,T)\displaystyle\ T^{2}\mathbb{E}[d({\bm{x}}^{\star}_{c,T_{0}},\bm{y}_{c,T})]+\frac{\mu_{y}(T+1)^{2}}{2}\Big{[}\mathbb{E}\|\nabla_{y}J({\bm{x}}^{\star}_{c,T_{0}},\bm{y}_{c,T})
gy,c,T2(1Lfμy)𝔼gy,c,T2]+5(T+1)2Lfμ2yσ2\displaystyle-{{g}}^{\star}_{y,c,T}\|^{2}-(1-L_{f}\mu_{y})\mathbb{E}\|{{g}}^{\star}_{y,c,T}\|^{2}\Big{]}+5(T+1)^{2}L_{f}\mu^{2}_{y}\sigma^{2}
×max𝑘pk2\displaystyle\times\underset{k}{\max}\ p^{2}_{k}

We further denote δi=i2×𝔼[P(𝒙c,T0)J(𝒙c,T0,𝒚c,i)]\delta_{i}=i^{2}\times\mathbb{E}[P({\bm{x}}^{\star}_{c,T_{0}})-J({\bm{x}}^{\star}_{c,T_{0}},\bm{y}_{c,i})] and obtain a recursion regarding δi\delta_{i}. Iterating the recursion of δi\delta_{i} from i=T+1i=T+1 to 0, we obtain

δT+1\displaystyle\delta_{T+1}\leq δ0+μy(T+1)22i=0T[𝔼yJ(𝒙c,T0,𝒚c,i)gy,c,i2\displaystyle\delta_{0}+\frac{\mu_{y}(T+1)^{2}}{2}\sum_{i=0}^{T}\Big{[}\mathbb{E}\|\nabla_{y}J({\bm{x}}^{\star}_{c,T_{0}},\bm{y}_{c,i})-{{g}}^{\star}_{y,c,i}\|^{2}
(1Lfμy)𝔼gy,c,i2]+5(T+1)3Lfμ2yσ2max𝑘p2k\displaystyle-(1-L_{f}\mu_{y})\mathbb{E}\|{{g}}^{\star}_{y,c,i}\|^{2}\Big{]}+5(T+1)^{3}L_{f}\mu^{2}_{y}\sigma^{2}\underset{k}{\max}\ p^{2}_{k} (52)

Repeating the arguments in Lemma 4, we can bound the iteration averaged gradient deviation as follows:

1T+1i=0T𝔼yJ(𝒙c,T0,𝒚c,i)gy,c,i2\displaystyle\frac{1}{T+1}\sum_{i=0}^{T}\mathbb{E}\|\nabla_{y}J({\bm{x}}^{\star}_{c,T_{0}},\bm{y}_{c,i})-{{g}}^{\star}_{y,c,i}\|^{2}
1120Lf2ρ2(4KG2+μyσ2)μy(1𝒥γ12)+12KLf2ν(T+1)i=0Tμy2\displaystyle\leq\frac{1120L^{2}_{f}\rho^{2}(4KG^{2}+\mu_{y}\sigma^{2})\mu_{y}}{(1-\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2})}+\frac{12KL^{2}_{f}}{\nu(T+1)}\sum_{i=0}^{T}\mu_{y}^{2}
×𝔼gy,c,i2+240Kσ2Lf2μy2max𝑘pk2\displaystyle\quad\times\mathbb{E}\|{{g}}^{\star}_{y,c,i}\|^{2}+240K\sigma^{2}L^{2}_{f}\mu^{2}_{y}\underset{k}{\max}\ p^{2}_{k} (53)

Inserting (D) into (D) and choosing 12KLf2μy2ν<1Lfμyμy<ν12KLf2\frac{12KL^{2}_{f}\mu^{2}_{y}}{\nu}<1-L_{f}\mu_{y}\Rightarrow\mu_{y}<\sqrt{\frac{\nu}{12KL^{2}_{f}}} to eliminate gradient terms in (D)\eqref{PLboundstep2}, we get (δ0=0\delta_{0}=0):

δT+1\displaystyle\delta_{T+1} 560(T+1)3Lf2ρ2(4KG2+μyσ2)μy2(1𝒥γ12)+5(T+1)3\displaystyle\leq\frac{560(T+1)^{3}L^{2}_{f}\rho^{2}(4KG^{2}+\mu_{y}\sigma^{2})\mu^{2}_{y}}{(1-\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2})}+5(T+1)^{3}
×Lfμy2σ2max𝑘pk2+120(T+1)3Kσ2Lf2μy3max𝑘pk2\displaystyle\quad\times L_{f}\mu^{2}_{y}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}+120(T+1)^{3}K\sigma^{2}L^{2}_{f}\mu^{3}_{y}\underset{k}{\max}\ p^{2}_{k} (54)

Because μy=2T+1ν(T+1)22ν(T+1)\mu_{y}=\frac{2T+1}{\nu(T+1)^{2}}\leq\frac{2}{\nu(T+1)}, dividing both sides of (D) by (T+1)2(T+1)^{2}, we get the following non-asymptotic bound for dual optimality gap:

𝔼[d(𝒙c,T0,𝒚c,T+1)]\displaystyle\mathbb{E}[d({\bm{x}}^{\star}_{c,T_{0}},\bm{y}_{c,T+1})] =𝔼[P(𝒙c,T0)J(𝒙c,T0,𝒚c,T+1)]\displaystyle=\mathbb{E}[P({\bm{x}}^{\star}_{c,T_{0}})-J({\bm{x}}^{\star}_{c,T_{0}},\bm{y}_{c,T+1})]
𝒪((T+1)μy2)+𝒪((T+1)μy3)\displaystyle\leq\mathcal{O}\Big{(}(T+1)\mu^{2}_{y}\Big{)}+\mathcal{O}\Big{(}(T+1)\mu^{3}_{y}\Big{)}
𝒪(1T+1)+𝒪(1(T+1)2)\displaystyle\leq\mathcal{O}\Big{(}\frac{1}{T+1}\Big{)}+\mathcal{O}\Big{(}\frac{1}{(T+1)^{2}}\Big{)} (55)

Appendix E Proof of Theorem 3

According to Theorem 1, after running T0=𝒪(ε4)T_{0}=\mathcal{O}(\varepsilon^{-4}) iterations, we can identify the point 𝒙c,T0{\bm{x}}^{\star}_{c,T_{0}} which satisfies the following condition

𝔼P(𝒙c,T0)2ε24=𝒪(ε2)\displaystyle\mathbb{E}\|\nabla P({\bm{x}}^{\star}_{c,T_{0}})\|^{2}\leq\frac{\varepsilon^{2}}{4}=\mathcal{O}(\varepsilon^{2}) (56)

According to Lemma 2, this is equivalent to

𝔼xJ(𝒙c,T0,𝒚o(𝒙c,T0))2ε24=𝒪(ε2)\displaystyle\mathbb{E}\|\nabla_{x}J({\bm{x}}^{\star}_{c,T_{0}},\bm{y}^{o}({\bm{x}}^{\star}_{c,T_{0}}))\|^{2}\leq\frac{\varepsilon^{2}}{4}=\mathcal{O}(\varepsilon^{2}) (57)

Furthermore, from Theorem 2, after iterations T1T_{1}, we can find a point 𝒚c,T0+T1{\bm{y}}^{\star}_{c,T_{0}+T_{1}} such that

𝔼[P(𝒙c,T0)J(𝒙c,T0,𝒚c,T0+T1)]1Lf2ν8ε2=𝒪(ε2)\displaystyle\mathbb{E}[P({\bm{x}}^{\star}_{c,T_{0}})-J({\bm{x}}^{\star}_{c,T_{0}},{\bm{y}}^{\star}_{c,T_{0}+T_{1}})]\leq\frac{1}{L^{2}_{f}}\frac{\nu}{8}\varepsilon^{2}=\mathcal{O}(\varepsilon^{2}) (58)

We keep the subscript in (𝒙c,T0,𝒚c,T0+T1)({\bm{x}}^{\star}_{c,T_{0}},{\bm{y}}^{\star}_{{c,T_{0}+T_{1}}}) to indicate the exact iterations at which this point is obtained. Under condition (58), we can verify (𝒙c,T0,𝒚c,T0+T1)({\bm{x}}^{\star}_{c,T_{0}},{\bm{y}}^{\star}_{c,T_{0}+T_{1}}) satisfies the second stationary condition (29b), i.e.,

𝔼\displaystyle\mathbb{E} yJ(𝒙c,T0,𝒚c,T0+T1)2\displaystyle\|\nabla_{y}J({\bm{x}}^{\star}_{c,T_{0}},{\bm{y}}^{\star}_{c,T_{0}+T_{1}})\|^{2}
=\displaystyle= 𝔼yJ(𝒙c,T0,𝒚c,T0+T1)yJ(𝒙c,T0,𝒚o(𝒙c,T0))2\displaystyle\mathbb{E}\|\nabla_{y}J({\bm{x}}^{\star}_{c,T_{0}},{\bm{y}}^{\star}_{c,T_{0}+T_{1}})-\nabla_{y}J({\bm{x}}^{\star}_{c,T_{0}},\bm{y}^{o}({\bm{x}}^{\star}_{c,T_{0}}))\|^{2}
(a)\displaystyle\overset{(a)}{\leq} Lf2𝔼𝒚c,T0+T1𝒚o(𝒙c,T0)2\displaystyle L^{2}_{f}\mathbb{E}\|{\bm{y}}^{\star}_{c,T_{0}+T_{1}}-\bm{y}^{o}({\bm{x}}^{\star}_{c,T_{0}})\|^{2}
(b)\displaystyle\overset{(b)}{\leq} Lf22ν𝔼[P(𝒙c,T0)J(𝒙c,T0,𝒚c,T0+T1)]\displaystyle L^{2}_{f}{\frac{2}{\nu}\mathbb{E}[P({\bm{x}}^{\star}_{c,T_{0}})-J({\bm{x}}^{\star}_{c,T_{0}},{\bm{y}}^{\star}_{c,T_{0}+T_{1}})]}
(c)\displaystyle\overset{(c)}{\leq} ε24ε2=𝒪(ε2)\displaystyle\frac{\varepsilon^{2}}{4}\leq\ \varepsilon^{2}=\mathcal{O}(\varepsilon^{2}) (59)

where (a)(a) follows from the LfL_{f}-smooth property of the local risk, (b)(b) follows from Lemma 1, (c)(c) follows from (23). On the other hand, the solution (𝒙c,T0,𝒚c,T0+T1)({\bm{x}}^{\star}_{c,T_{0}},{\bm{y}}^{\star}_{c,T_{0}+T_{1}}) also satisfies the first stationary condition (29a), i.e.,

𝔼\displaystyle\mathbb{E} xJ(𝒙c,T0,𝒚c,T0+T1)2\displaystyle\|\nabla_{x}J({\bm{x}}^{\star}_{c,T_{0}},{\bm{y}}^{\star}_{c,T_{0}+T_{1}})\|^{2}
\displaystyle\leq 𝔼xJ(𝒙c,T0,𝒚c,T0+T1)xJ(𝒙c,T0,𝒚o(𝒙c,T0))\displaystyle\mathbb{E}\|\nabla_{x}J({\bm{x}}^{\star}_{c,T_{0}},{\bm{y}}^{\star}_{c,T_{0}+T_{1}})-\nabla_{x}J({\bm{x}}^{\star}_{c,T_{0}},\bm{y}^{o}({\bm{x}}^{\star}_{c,T_{0}}))
+xJ(𝒙c,T0,𝒚o(𝒙c,T0))2\displaystyle+\nabla_{x}J({\bm{x}}^{\star}_{c,T_{0}},\bm{y}^{o}({\bm{x}}^{\star}_{c,T_{0}}))\|^{2}
\displaystyle\leq 2𝔼xJ(𝒙c,T0,𝒚o(𝒙c,T0))2\displaystyle 2\mathbb{E}\|\nabla_{x}J({\bm{x}}^{\star}_{c,T_{0}},\bm{y}^{o}({\bm{x}}^{\star}_{c,T_{0}}))\|^{2}
+2𝔼xJ(𝒙c,T0,𝒚c,T0+T1)xJ(𝒙c,T0,𝒚o(𝒙c,T0))2\displaystyle+2\mathbb{E}\|\nabla_{x}J({\bm{x}}^{\star}_{c,T_{0}},{\bm{y}}^{\star}_{c,T_{0}+T_{1}})-\nabla_{x}J({\bm{x}}^{\star}_{c,T_{0}},\bm{y}^{o}({\bm{x}}^{\star}_{c,T_{0}}))\|^{2}
(a)\displaystyle\overset{(a)}{\leq} 2𝔼xJ(𝒙c,T0,𝒚o(𝒙c,T0))2\displaystyle 2\mathbb{E}\|\nabla_{x}J({\bm{x}}^{\star}_{c,T_{0}},\bm{y}^{o}({\bm{x}}^{\star}_{c,T_{0}}))\|^{2}
+2Lf22ν𝔼[P(𝒙c,T0)J(𝒙c,T0,𝒚c,T0+T1)]\displaystyle+2L^{2}_{f}{\frac{2}{\nu}\mathbb{E}[P({\bm{x}}^{\star}_{c,T_{0}})-J({\bm{x}}^{\star}_{c,T_{0}},{\bm{y}}^{\star}_{c,T_{0}+T_{1}})]}
\displaystyle\leq 𝒪(ε22)+𝒪(ε22)=𝒪(ε2)\displaystyle\mathcal{O}(\frac{\varepsilon^{2}}{2})+\mathcal{O}(\frac{\varepsilon^{2}}{2})=\mathcal{O}(\varepsilon^{2})\hskip 100.00015pt (60)

where (a)(a) follows from (57) and (23). Therefore, the overall complexity for iteration and gradient evaluation is given by T0+T1=𝒪(ε4)+𝒪(ε2)T_{0}+T_{1}=\mathcal{O}(\varepsilon^{-4})+\mathcal{O}(\varepsilon^{-2}).

The proof is completed here.

Appendix F Proof of Lemma 3

Taking conditional expectations of the transient term 𝓧i𝓧c,i2+𝓨i𝓨c,i2\|{\bm{\mathcal{X}}}_{i}-\bm{\mathcal{X}}_{c,i}\|^{2}+\|{\bm{\mathcal{Y}}}_{i}-\bm{\mathcal{Y}}_{c,i}\|^{2} over the distribution of random samples {𝝃k,ix,𝝃k,iy}k=1K\{\bm{\xi}^{x}_{k,i},\bm{\xi}^{y}_{k,i}\}_{k=1}^{K}, we get

𝔼\displaystyle\mathbb{E} [𝓧i𝓧c,i2+𝓨i𝓨c,i2𝓕i1]\displaystyle\Big{[}\|{\bm{\mathcal{X}}}_{i}-\bm{\mathcal{X}}_{c,i}\|^{2}+\|{\bm{\mathcal{Y}}}_{i}-\bm{\mathcal{Y}}_{c,i}\|^{2}\mid\bm{\mathcal{F}}_{i-1}\Big{]}
=\displaystyle= 𝔼[𝓧i(𝟙pIM1)𝓧i2+𝓨i(𝟙pIM2)𝓨i2\displaystyle\mathbb{E}\Big{[}\|{\bm{\mathcal{X}}}_{i}-(\mathbbm{1}p^{\top}\otimes I_{M_{1}})\bm{\mathcal{X}}_{i}\|^{2}+\|{\bm{\mathcal{Y}}}_{i}-(\mathbbm{1}p^{\top}\otimes I_{M_{2}})\bm{\mathcal{Y}}_{i}\|^{2}
𝓕i1]\displaystyle\mid\bm{\mathcal{F}}_{i-1}\Big{]}
=\displaystyle= 𝔼[[(V1IM1)(VIM1)(𝟙pIM1)]𝓧i2+\displaystyle\mathbb{E}\Big{[}\|[(V^{-1}\otimes I_{M_{1}})^{\top}(V\otimes I_{M_{1}})^{\top}-(\mathbbm{1}p^{\top}\otimes I_{M_{1}})]{\bm{\mathcal{X}}}_{i}\|^{2}+\|
[(V1IM2)(VIM2)(𝟙pIM2)]𝓨i2𝓕i1]\displaystyle[(V^{-1}\otimes I_{M_{2}})^{\top}(V\otimes I_{M_{2}})^{\top}-(\mathbbm{1}p^{\top}\otimes I_{M_{2}})]{\bm{\mathcal{Y}}}_{i}\|^{2}\mid\bm{\mathcal{F}}_{i-1}\Big{]}
=(a)\displaystyle\overset{(a)}{=} 𝔼[𝒱L1𝒱R1𝓧i2+𝒱L2𝒱R2𝓨i2𝓕i1]\displaystyle\mathbb{E}\Big{[}\|\mathcal{V}_{L_{1}}\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i}\|^{2}+\|\mathcal{V}_{L_{2}}\mathcal{V}^{\top}_{R_{2}}{\bm{\mathcal{Y}}}_{i}\|^{2}\mid\bm{\mathcal{F}}_{i-1}\Big{]}
(b)\displaystyle\overset{(b)}{\leq} 𝒱L12𝔼[𝒱R1𝓧i2+𝒱R2𝓨i2𝓕i1]\displaystyle\|\mathcal{V}_{L_{1}}\|^{2}\mathbb{E}\Big{[}\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i}\|^{2}+\|\mathcal{V}^{\top}_{R_{2}}{\bm{\mathcal{Y}}}_{i}\|^{2}\mid\bm{\mathcal{F}}_{i-1}\Big{]} (61)

where (a)(a) follows from the Jordan decomposition of AA and (b)(b) follows from the sub-multiplicative property of norms and the fact that 𝒱L12=𝒱L22\|\mathcal{V}_{L_{1}}\|^{2}=\|\mathcal{V}_{L_{2}}\|^{2}. Taking expectation over (61) again and averaging this inequality over iterations, we get:

1Ti=0T1𝔼[𝓧i𝓧c,i2+𝓨i𝓨c,i2]\displaystyle\frac{1}{T}\sum_{i=0}^{T-1}\mathbb{E}[\|{\bm{\mathcal{X}}}_{i}-\bm{\mathcal{X}}_{c,i}\|^{2}+\|{\bm{\mathcal{Y}}}_{i}-\bm{\mathcal{Y}}_{c,i}\|^{2}]
𝒱L121Ti=0T1𝔼[𝒱R1𝓧i2+𝒱R2𝓨i2]\displaystyle\leq\|\mathcal{V}_{L_{1}}\|^{2}\frac{1}{T}\sum_{i=0}^{T-1}\mathbb{E}[\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i}\|^{2}+\|\mathcal{V}^{\top}_{R_{2}}{\bm{\mathcal{Y}}}_{i}\|^{2}] (62)

In the following, we bound the term 𝔼[𝒱R1𝓧i2]\mathbb{E}[\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i}\|^{2}] by Eqs. (63)-(68), 𝔼[𝒱R1𝓧i2+𝒱R2𝓨i2]\mathbb{E}[\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i}\|^{2}+\|\mathcal{V}^{\top}_{R_{2}}{\bm{\mathcal{Y}}}_{i}\|^{2}] by (F), and deduce the final bound (F) by Eqs. (70)-(73). To bound 𝔼[𝒱R1𝓧i2]\mathbb{E}[\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i}\|^{2}], we inspect 𝔼[𝒱R1𝓧i2𝓕i1]\mathbb{E}[\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i}\|^{2}\mid\bm{\mathcal{F}}_{i-1}] on the right hand side (RHS) of (61) by deducing that:

𝔼\displaystyle\mathbb{E} [𝒱R1𝓧i2𝓕i1]\displaystyle\Big{[}\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i}\|^{2}\mid\bm{\mathcal{F}}_{i-1}\Big{]}
(a)\displaystyle\overset{(a)}{\leq} 𝔼[𝒥γ1𝒱R1[𝓧i1μx𝓖x,i1]2𝓕i1]\displaystyle\mathbb{E}\Big{[}\|\mathcal{J}^{\top}_{\gamma_{1}}\mathcal{V}^{\top}_{R_{1}}[\bm{\mathcal{X}}_{i-1}-\mu_{x}\bm{\mathcal{G}}_{x,i-1}]\|^{2}\mid\bm{\mathcal{F}}_{i-1}\Big{]}\hskip 100.00015pt
(b)\displaystyle\overset{(b)}{\leq} 𝒥γ12𝒱R1𝓧i122μx𝒥γ12𝔼[𝒱R1𝓧i1,\displaystyle{\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}}\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i-1}\|^{2}-2\mu_{x}{\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}}\mathbb{E}\Big{[}\Big{\langle}\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i-1},
𝒱R1𝓖x,i1𝓕i1]+μ2x𝒥γ12𝔼[𝒱R1𝓖x,i12𝓕i1]\displaystyle\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{G}}}_{x,i-1}\Big{\rangle}\mid\bm{\mathcal{F}}_{i-1}\Big{]}+\mu^{2}_{x}\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}\mathbb{E}\Big{[}\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{G}}}_{x,i-1}\|^{2}\mid\bm{\mathcal{F}}_{i-1}\Big{]}
(c)\displaystyle\overset{(c)}{\leq} 𝒥γ12𝒱R1𝓧i122μx𝒥γ12𝒱R1𝓧i1,𝒱R1𝒢x,i1\displaystyle{\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}}\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i-1}\|^{2}-2\mu_{x}{\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}}\Big{\langle}\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i-1},\mathcal{V}^{\top}_{R_{1}}{{\mathcal{G}}}_{x,i-1}\Big{\rangle}
+μx2𝒥γ12𝔼[𝒱R1𝓖x,i12𝓕i1]\displaystyle+\mu^{2}_{x}\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}\mathbb{E}[\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{G}}}_{x,i-1}\|^{2}\mid\bm{\mathcal{F}}_{i-1}]
(d)\displaystyle\overset{(d)}{\leq} 𝒥γ12𝒱R1𝓧i12+μx𝒥γ12[𝒱R1𝓧i12\displaystyle\ {\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}}\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i-1}\|^{2}+\mu_{x}\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}\Big{[}\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i-1}\|^{2}
+𝒱R1𝒢x,i12]+μ2x𝒥γ12𝔼[|𝒱R1𝓖x,i12𝓕i1]\displaystyle+\|\mathcal{V}^{\top}_{R_{1}}{{\mathcal{G}}}_{x,i-1}\|^{2}\Big{]}+\mu^{2}_{x}\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}\mathbb{E}\Big{[}|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{G}}}_{x,i-1}\|^{2}\mid\bm{\mathcal{F}}_{i-1}\Big{]}
(e)\displaystyle\overset{(e)}{\leq} 𝒥γ12𝒱R1𝓧i12+μx𝒥γ12[𝒱R1𝓧i12\displaystyle{\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}}\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i-1}\|^{2}+\mu_{x}\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}\Big{[}\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i-1}\|^{2}
+𝒱R1𝒢x,i12]+μ2x𝒥γ12𝒱R1𝒢x,i12\displaystyle\quad+\|\mathcal{V}^{\top}_{R_{1}}{{\mathcal{G}}}_{x,i-1}\|^{2}\Big{]}+\mu^{2}_{x}\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}\|\mathcal{V}^{\top}_{R_{1}}{{\mathcal{G}}}_{x,i-1}\|^{2}
+10𝒥γ12𝒱R12σ2μx2(σ2=k=1Kσk2)\displaystyle\quad+10\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}\|\mathcal{V}^{\top}_{R_{1}}\|^{2}\sigma^{2}\mu^{2}_{x}\quad(\sigma^{2}=\sum_{k=1}^{K}\sigma^{2}_{k}) (63)

where (a)(a) follows from (10a) and

𝒱R1𝒜1=𝒱R1(𝟙pIM1+𝒱L1𝒥γ1𝒱R1)=𝒥γ1𝒱R1\mathcal{V}^{\top}_{R_{1}}\mathcal{A}_{1}^{\top}=\mathcal{V}^{\top}_{R_{1}}(\mathbbm{1}p^{\top}\otimes I_{M_{1}}+\mathcal{V}_{L_{1}}\mathcal{J}^{\top}_{\gamma_{1}}\mathcal{V}^{\top}_{R_{1}})=\mathcal{J}^{\top}_{\gamma_{1}}\mathcal{V}^{\top}_{R_{1}}

Moreover, (b)(b) follows from the sub-multiplicative property of the norm and the expansion of the squared norm, (c)(c) follows from the conditional expectation over the sample {𝝃k,ix}k=1K\{\bm{\xi}^{x}_{k,i}\}_{k=1}^{K} and the fact that 𝓧i1\bm{\mathcal{X}}_{i-1} only relies on the random samples up to iteration i1i-1, (d)(d) follows from the inequality 2a,ba2+b2-2\langle a,b\rangle\leq\|a\|^{2}+\|b\|^{2}, (e)(e) follows from Assumption 4 and the following inequality:

𝔼\displaystyle\ \mathbb{E} [𝒱R1𝓖x,i12𝓕i1]\displaystyle\Big{[}\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{G}}}_{x,i-1}\|^{2}\mid\bm{\mathcal{F}}_{i-1}\Big{]}
\displaystyle\leq 𝔼[𝒱R1[col{𝒔k,i1x}k=1K+𝒢x,i1]2𝓕i1]\displaystyle\mathbb{E}\Big{[}\|\mathcal{V}^{\top}_{R_{1}}[\mbox{\rm col}\{\bm{s}^{x}_{k,i-1}\}_{k=1}^{K}+\mathcal{G}_{x,i-1}]\|^{2}\mid\bm{\mathcal{F}}_{i-1}\Big{]}
\displaystyle\leq 𝔼[𝒱R1𝒢x,i12𝓕i1]+2𝒱R1𝔼[𝒢x,i1,col{𝒔k,i1x}k=1K\displaystyle\mathbb{E}\Big{[}\|\mathcal{V}^{\top}_{R_{1}}\mathcal{G}_{x,i-1}\|^{2}\mid\bm{\mathcal{F}}_{i-1}\Big{]}+2\mathcal{V}^{\top}_{R_{1}}\mathbb{E}\Big{[}\langle\mathcal{G}_{x,i-1},\mbox{\rm col}\{\bm{s}^{x}_{k,i-1}\}_{k=1}^{K}\rangle
𝓕i1]+𝔼[𝒱R1col{𝒔k,i1x}k=1K2𝓕i1]\displaystyle\mid\bm{\mathcal{F}}_{i-1}\Big{]}+\mathbb{E}[\|\mathcal{V}^{\top}_{R_{1}}\mbox{\rm col}\{\bm{s}^{x}_{k,i-1}\}_{k=1}^{K}\|^{2}\mid\bm{\mathcal{F}}_{i-1}]
\displaystyle\leq 𝒱R1𝒢x,i12+10𝒱R12k=1Kσk2\displaystyle\ \|\mathcal{V}^{\top}_{R_{1}}\mathcal{G}_{x,i-1}\|^{2}+10\|\mathcal{V}^{\top}_{R_{1}}\|^{2}\sum_{k=1}^{K}\sigma^{2}_{k} (64)

where 𝒔k,ix\bm{s}^{x}_{k,i} denotes the gradient noise

𝒔k,ix\displaystyle\bm{s}^{x}_{k,i} =2[xQk(𝒙k,i,𝒚k,i;𝝃k,ix)xJk(𝒙k,i,𝒚k,i)]\displaystyle=2[\nabla_{x}Q_{k}(\bm{x}_{k,i},\bm{y}_{k,i};\bm{\xi}^{x}_{k,i})-\nabla_{x}J_{k}(\bm{x}_{k,i},\bm{y}_{k,i})] (65)
+[xQk(𝒙k,i1,𝒚k,i1;𝝃k,ix)xJk(𝒙k,i1,𝒚k,i1)]\displaystyle\ +[\nabla_{x}Q_{k}(\bm{x}_{k,i-1},\bm{y}_{k,i-1};\bm{\xi}^{x}_{k,i})-\nabla_{x}J_{k}(\bm{x}_{k,i-1},\bm{y}_{k,i-1})]

Choosing μx<1\mu_{x}<1, we have μx2<μx\mu^{2}_{x}<\mu_{x} and

𝔼\displaystyle\mathbb{E} [𝒱R1𝓧i2𝓕i1]\displaystyle\Big{[}\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i}\|^{2}\mid\bm{\mathcal{F}}_{i-1}]
\displaystyle\leq (1+μx)𝒥γ12𝒱R1𝓧i12+2μx𝒥γ12𝒱R1𝒢x,i12\displaystyle(1+\mu_{x})\ {\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}}\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i-1}\|^{2}+2\mu_{x}\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}\|\mathcal{V}^{\top}_{R_{1}}{{\mathcal{G}}}_{x,i-1}\|^{2}
+10𝒥γ12𝒱R12σ2μx2\displaystyle+10\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}\|\mathcal{V}^{\top}_{R_{1}}\|^{2}\sigma^{2}\mu^{2}_{x} (66)

For the term 𝒱R1𝒢x,i12\|\mathcal{V}^{\top}_{R_{1}}{{\mathcal{G}}}_{x,i-1}\|^{2}, we can bound it as follows:

𝒱R1𝒢x,i12\displaystyle\|\mathcal{V}^{\top}_{R_{1}}{{\mathcal{G}}}_{x,i-1}\|^{2}\hskip 200.0003pt
=(a)\displaystyle\overset{(a)}{=} 𝒱R1col{2xJk(𝒙k,i1,𝒚k,i1)xJk(𝒙k,i2,𝒚k,i2)\displaystyle\ \Big{\|}\mathcal{V}^{\top}_{R_{1}}\mbox{\rm col}\Big{\{}2\nabla_{x}J_{k}(\bm{x}_{k,i-1},\bm{y}_{k,i-1})-\nabla_{x}J_{k}(\bm{x}_{k,i-2},\bm{y}_{k,i-2})
}k=1K𝒱R1col{2xJk(𝒙c,i1,𝒚c,i1)xJk(𝒙c,i2,\displaystyle\Big{\}}_{k=1}^{K}-\mathcal{V}^{\top}_{R_{1}}\mbox{\rm col}\Big{\{}2\nabla_{x}J_{k}(\bm{x}_{c,i-1},\bm{y}_{c,i-1})-\nabla_{x}J_{k}(\bm{x}_{c,i-2},
𝒚c,i2)}k=1K+𝒱R1col{2xJk(𝒙c,i1,𝒚c,i1)xJk\displaystyle\bm{y}_{c,i-2})\Big{\}}_{k=1}^{K}+\mathcal{V}^{\top}_{R_{1}}\mbox{\rm col}\Big{\{}2\nabla_{x}J_{k}(\bm{x}_{c,i-1},\bm{y}_{c,i-1})-\nabla_{x}J_{k}
(𝒙c,i2,𝒚c,i2)}k=1K𝒱R1(𝟙pIM1)col{2xJk\displaystyle(\bm{x}_{c,i-2},\bm{y}_{c,i-2})\Big{\}}_{k=1}^{K}-\mathcal{V}^{\top}_{R_{1}}({\mathbbm{1}p^{\top}}\otimes I_{M_{1}})\mbox{\rm col}\Big{\{}2\nabla_{x}J_{k}
(𝒙c,i1,𝒚c,i1)xJk(𝒙c,i2,𝒚c,i2)}k=1K2\displaystyle(\bm{x}_{c,i-1},\bm{y}_{c,i-1})-\nabla_{x}J_{k}(\bm{x}_{c,i-2},\bm{y}_{c,i-2})\Big{\}}_{k=1}^{K}\Big{\|}^{2}
(b)\displaystyle\overset{(b)}{\leq} 32𝒱R12Lf2[𝓧i1𝓧c,i12+𝓨i1𝓨c,i12]\displaystyle 32\|\mathcal{V}^{\top}_{R_{1}}\|^{2}L^{2}_{f}\Big{[}\|\bm{\mathcal{X}}_{i-1}-\bm{\mathcal{X}}_{c,i-1}\|^{2}+\|\bm{\mathcal{Y}}_{i-1}-\bm{\mathcal{Y}}_{c,i-1}\|^{2}\Big{]}
+8𝒱R12Lf2[𝓧i2𝓧c,i22+𝓨i2𝓨c,i22]\displaystyle+8\|\mathcal{V}^{\top}_{R_{1}}\|^{2}L^{2}_{f}\Big{[}\|\bm{\mathcal{X}}_{i-2}-\bm{\mathcal{X}}_{c,i-2}\|^{2}+\|\bm{\mathcal{Y}}_{i-2}-\bm{\mathcal{Y}}_{c,i-2}\|^{2}\Big{]}
+20K𝒱R12G2\displaystyle+20K\|\mathcal{V}^{\top}_{R_{1}}\|^{2}G^{2}
(c)\displaystyle\overset{(c)}{\leq} 32𝒱R12𝒱L12Lf2[𝒱R1𝓧i12+𝒱R2𝓨i12]\displaystyle 32\|\mathcal{V}^{\top}_{R_{1}}\|^{2}\|\mathcal{V}_{L_{1}}\|^{2}L^{2}_{f}\Big{[}\|\mathcal{V}^{\top}_{R_{1}}\bm{\mathcal{X}}_{i-1}\|^{2}+\|\mathcal{V}^{\top}_{R_{2}}\bm{\mathcal{Y}}_{i-1}\|^{2}\Big{]}
+8𝒱R12𝒱L12Lf2[𝒱R1𝓧i22+𝒱R2𝓨i22]\displaystyle+8\|\mathcal{V}^{\top}_{R_{1}}\|^{2}\|\mathcal{V}_{L_{1}}\|^{2}L^{2}_{f}\Big{[}\|\mathcal{V}^{\top}_{R_{1}}\bm{\mathcal{X}}_{i-2}\|^{2}+\|\mathcal{V}^{\top}_{R_{2}}\bm{\mathcal{Y}}_{i-2}\|^{2}\Big{]}
+20K𝒱R12G2\displaystyle+20K\|\mathcal{V}^{\top}_{R_{1}}\|^{2}G^{2} (67)

where (a)(a) follows from 𝒱R1(𝟙pIM1)=0\mathcal{V}^{\top}_{R_{1}}({\mathbbm{1}p^{\top}}\otimes I_{M_{1}})=0, (b)(b) follows from Jensen’s inequality, the LfL_{f}-smooth property of the local risk Jk(x,y)J_{k}(x,y) and Assumption 5, (c)(c) follows from (61). For convenience, we denote the network constant as ρ2𝒱R12𝒱L12𝒥γ12=𝒱R22𝒱L22𝒥γ22\rho^{2}\triangleq\|\mathcal{V}^{\top}_{R_{1}}\|^{2}\|\mathcal{V}_{L_{1}}\|^{2}\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}=\|\mathcal{V}^{\top}_{R_{2}}\|^{2}\|\mathcal{V}_{L_{2}}\|^{2}\|\mathcal{J}^{\top}_{\gamma_{2}}\|^{2}. Combining the results of (F) and (F) and taking expectation again, we get

𝔼\displaystyle\mathbb{E} 𝒱R1𝓧i2\displaystyle\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i}\|^{2} (68)
\displaystyle\leq (1+μx)𝒥γ12𝔼𝒱R1𝓧i12+64μxρ2Lf2[𝔼𝒱R1𝓧i12\displaystyle(1+\mu_{x}){\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}}\mathbb{E}\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i-1}\|^{2}+64\mu_{x}{\rho}^{2}L^{2}_{f}\Big{[}\mathbb{E}\|\mathcal{V}^{\top}_{R_{1}}\bm{\mathcal{X}}_{i-1}\|^{2}
+𝔼𝒱R2𝓨i12]+16μxρ2L2f[𝔼𝒱R1𝓧i22\displaystyle+\mathbb{E}\|\mathcal{V}^{\top}_{R_{2}}\bm{\mathcal{Y}}_{i-1}\|^{2}\Big{]}+16\mu_{x}{\rho}^{2}L^{2}_{f}\Big{[}\mathbb{E}\|\mathcal{V}^{\top}_{R_{1}}\bm{\mathcal{X}}_{i-2}\|^{2}
+𝔼𝒱R1𝓨i22]+10𝒱R12𝒥γ12(4KG2+μxσ2)μx\displaystyle+\mathbb{E}\|\mathcal{V}^{\top}_{R_{1}}\bm{\mathcal{Y}}_{i-2}\|^{2}\Big{]}+10\|\mathcal{V}^{\top}_{R_{1}}\|^{2}\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}(4KG^{2}+\mu_{x}\sigma^{2})\mu_{x}

A similar inequality holds for 𝔼𝒱R2𝓨i2\mathbb{E}\|\mathcal{V}^{\top}_{R_{2}}{\bm{\mathcal{Y}}}_{i}\|^{2}. Combining these results together and choosing μxμy\mu_{x}\leq\mu_{y}, we have

𝔼\displaystyle\mathbb{E} [𝒱R1𝓧i2+𝒱R2𝓨i2]\displaystyle[\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i}\|^{2}+\|\mathcal{V}^{\top}_{R_{2}}{\bm{\mathcal{Y}}}_{i}\|^{2}]
\displaystyle\leq (1+μy)𝒥γ12𝔼[𝒱R1𝓧i12+𝒱R2𝓨i12]+128μyρ2\displaystyle(1+\mu_{y}){\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}}\mathbb{E}[\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i-1}\|^{2}+\|\mathcal{V}^{\top}_{R_{2}}{\bm{\mathcal{Y}}}_{i-1}\|^{2}]+128\mu_{y}\rho^{2}
×Lf2𝔼[𝒱R1𝓧i12+𝒱R2𝓨i12]+32μyρ2Lf2\displaystyle\times L^{2}_{f}\mathbb{E}\Big{[}\|\mathcal{V}^{\top}_{R_{1}}\bm{\mathcal{X}}_{i-1}\|^{2}+\|\mathcal{V}^{\top}_{R_{2}}\bm{\mathcal{Y}}_{i-1}\|^{2}\Big{]}+32\mu_{y}\rho^{2}L^{2}_{f}
×𝔼[𝒱R1𝓧i22+𝒱R2𝓨i22]+20𝒱R12𝒥γ12\displaystyle\times\mathbb{E}\Big{[}\|\mathcal{V}^{\top}_{R_{1}}\bm{\mathcal{X}}_{i-2}\|^{2}+\|\mathcal{V}^{\top}_{R_{2}}\bm{\mathcal{Y}}_{i-2}\|^{2}\Big{]}+20\|\mathcal{V}^{\top}_{R_{1}}\|^{2}\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}
×(4KG2+μyσ2)μy\displaystyle\times(4KG^{2}+\mu_{y}\sigma^{2})\mu_{y} (69)

Averaging (F) over iterations, we get

1Ti=0T1𝔼[𝒱R1𝓧i2+𝒱R2𝓨i2]\displaystyle\frac{1}{T}\sum_{i=0}^{T-1}\mathbb{E}[\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i}\|^{2}+\|\mathcal{V}^{\top}_{R_{2}}{\bm{\mathcal{Y}}}_{i}\|^{2}] (70)
\displaystyle\leq (1+μy)𝒥γ12+128μyρ2Lf2Ti=0T1𝔼[𝒱R1𝓧i12\displaystyle\frac{(1+\mu_{y})\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}+128\mu_{y}\rho^{2}L^{2}_{f}}{T}\sum_{i=0}^{T-1}\mathbb{E}\Big{[}\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i-1}\|^{2}
+𝒱R2𝓨i12]+32μyρ2Lf2Ti=0T1𝔼[𝒱R1𝓧i22\displaystyle+\|\mathcal{V}^{\top}_{R_{2}}{\bm{\mathcal{Y}}}_{i-1}\|^{2}\Big{]}+\frac{32\mu_{y}\rho^{2}L^{2}_{f}}{T}\sum_{i=0}^{T-1}\mathbb{E}\Big{[}\|\mathcal{V}^{\top}_{R_{1}}\bm{\mathcal{X}}_{i-2}\|^{2}
+𝒱R2𝓨i22]+20𝒱R12𝒥γ12(4KG2+μyρ2)μy\displaystyle+\|\mathcal{V}^{\top}_{R_{2}}\bm{\mathcal{Y}}_{i-2}\|^{2}\Big{]}+20\|\mathcal{V}^{\top}_{R_{1}}\|^{2}\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}(4KG^{2}+\mu_{y}\rho^{2})\mu_{y}
(a)\displaystyle\overset{(a)}{\leq} (1+μy)𝒥γ12+160μyρ2Lf2Ti=0T1𝔼[𝒱R1𝓧i2\displaystyle\frac{(1+\mu_{y})\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}+160\mu_{y}\rho^{2}L^{2}_{f}}{T}\sum_{i=0}^{T-1}\mathbb{E}\Big{[}\|\mathcal{V}^{\top}_{R_{1}}\bm{\mathcal{X}}_{i}\|^{2}
+𝒱R2𝓨i2]+20𝒱R12𝒥γ12(4KG2+μyσ2)μy\displaystyle+\|\mathcal{V}^{\top}_{R_{2}}\bm{\mathcal{Y}}_{i}\|^{2}\Big{]}+20\|\mathcal{V}^{\top}_{R_{1}}\|^{2}\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}(4KG^{2}+\mu_{y}\sigma^{2})\mu_{y}

where in (a)(a) we set the initial iterates as 𝒙k,1=𝒙k,2=𝒚k,1=𝒙k,2=0\bm{x}_{k,-1}=\bm{x}_{k,-2}=\bm{y}_{k,-1}=\bm{x}_{k,-2}=0 and add terms 𝔼𝒱R1𝓧T22+𝔼𝒱R2𝓨T22\mathbb{E}\|\mathcal{V}^{\top}_{R_{1}}\bm{\mathcal{X}}_{T-2}\|^{2}+\mathbb{E}\|\mathcal{V}^{\top}_{R_{2}}\bm{\mathcal{Y}}_{T-2}\|^{2} and 𝔼𝒱R1𝓧T12+𝔼𝒱R2𝓨T12\mathbb{E}\|\mathcal{V}^{\top}_{R_{1}}\bm{\mathcal{X}}_{T-1}\|^{2}+\mathbb{E}\|\mathcal{V}^{\top}_{R_{2}}\bm{\mathcal{Y}}_{T-1}\|^{2} into the first inequality to group the summation terms. Since 𝒥γ1<1\|\mathcal{J}^{\top}_{\gamma_{1}}\|<1, we choose the step sizes μy\mu_{y} such that

1((1+μy)𝒥γ12+160μyρ2Lf2)1𝒥γ122\displaystyle 1-\Big{(}(1+\mu_{y})\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}+160\mu_{y}\rho^{2}L^{2}_{f}\Big{)}\geq\frac{1-\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}}{2} (71)
μy1𝒥γ122(𝒥γ12+160ρ2Lf2)\displaystyle\Longrightarrow\mu_{y}\leq\frac{1-\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}}{2(\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}+160\rho^{2}L^{2}_{f})}

Solving inequality (70) gives

1Ti=0T1𝔼[𝒱R1𝓧i2+𝒱R2𝓨i2]\displaystyle\frac{1}{T}\sum_{i=0}^{T-1}\mathbb{E}[\|\mathcal{V}^{\top}_{R_{1}}{\bm{\mathcal{X}}}_{i}\|^{2}+\|\mathcal{V}^{\top}_{R_{2}}{\bm{\mathcal{Y}}}_{i}\|^{2}] (72)
40𝒱R12𝒥γ12(4KG2+μyσ2)μy1𝒥γ12\displaystyle\leq\frac{40\|\mathcal{V}^{\top}_{R_{1}}\|^{2}\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}(4KG^{2}+\mu_{y}\sigma^{2})\mu_{y}}{1-\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}}

Combining (72) with (F), we get the final bound as follows:

1Ti=0T1𝔼[𝓧i𝓧c,i2+𝓨i𝓨c,i2]\displaystyle\frac{1}{T}\sum_{i=0}^{T-1}\mathbb{E}[\|{\bm{\mathcal{X}}}_{i}-\bm{\mathcal{X}}_{c,i}\|^{2}+\|{\bm{\mathcal{Y}}}_{i}-\bm{\mathcal{Y}}_{c,i}\|^{2}] (73)
40ρ2(4KG2+μyσ2)μy1𝒥γ12\displaystyle\leq\frac{40\rho^{2}(4KG^{2}+\mu_{y}\sigma^{2})\mu_{y}}{1-\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}}

Appendix G Proof of Lemma 4

Note that 𝓧c,i=𝓧c,i1μxcol{𝒈x,c,i1}k=1K{\bm{\mathcal{X}}}_{c,i}={\bm{\mathcal{X}}}_{c,i-1}-\mu_{x}\operatorname{col}\{{\bm{g}}_{x,c,i-1}\}_{k=1}^{K}, we insert this expression of 𝓧c,i{\bm{\mathcal{X}}}_{c,i} into 𝓧i𝓧i12\|\bm{\mathcal{X}}_{i}-\bm{\mathcal{X}}_{i-1}\|^{2} and deduce that

𝓧i𝓧i12\displaystyle\|\bm{\mathcal{X}}_{i}-\bm{\mathcal{X}}_{i-1}\|^{2} (74)
=𝓧i𝓧c,i+𝓧c,i𝓧i12\displaystyle=\|\bm{\mathcal{X}}_{i}-{\bm{\mathcal{X}}}_{c,i}+{\bm{\mathcal{X}}}_{c,i}-\bm{\mathcal{X}}_{i-1}\|^{2}
=𝓧i𝓧c,i+𝓧c,i1μxcol{𝒈x,c,i1}k=1K𝓧i12\displaystyle=\|\bm{\mathcal{X}}_{i}-{\bm{\mathcal{X}}}_{c,i}+{\bm{\mathcal{X}}}_{c,i-1}-\mu_{x}\operatorname{col}\{{\bm{g}}_{x,c,i-1}\}_{k=1}^{K}-\bm{\mathcal{X}}_{i-1}\|^{2}
(a)3𝓧i𝓧c,i2+3Kμx2𝒈x,c,i12+3𝓧i1𝓧c,i12\displaystyle\overset{(a)}{\leq}3\|\bm{\mathcal{X}}_{i}-{\bm{\mathcal{X}}}_{c,i}\|^{2}+3K\mu^{2}_{x}\|{\bm{g}}_{x,c,i-1}\|^{2}+3\|\bm{\mathcal{X}}_{i-1}-\bm{\mathcal{X}}_{c,i-1}\|^{2}\hskip 170.00026pt

where (a)(a) follows from Jensen’s inequality. The term 𝓨i𝓨i12\|\bm{\mathcal{Y}}_{i}-\bm{\mathcal{Y}}_{i-1}\|^{2} can be bounded similarly, therefore

𝓧i𝓧i12+𝓨i𝓨i12\displaystyle\|\bm{\mathcal{X}}_{i}-\bm{\mathcal{X}}_{i-1}\|^{2}+\|\bm{\mathcal{Y}}_{i}-\bm{\mathcal{Y}}_{i-1}\|^{2} (75)
3[𝓧i𝓧c,i2+𝓨i𝓨c,i2]+3[𝓧i1𝓧c,i12\displaystyle\leq 3\Big{[}\|\bm{\mathcal{X}}_{i}-{\bm{\mathcal{X}}}_{c,i}\|^{2}+\|\bm{\mathcal{Y}}_{i}-{\bm{\mathcal{Y}}}_{c,i}\|^{2}\Big{]}+3\Big{[}\|\bm{\mathcal{X}}_{i-1}-{\bm{\mathcal{X}}}_{c,i-1}\|^{2}
+𝓨i1𝓨c,i12]+3K[μx2𝒈x,c,i12+μy2𝒈y,c,i12]\displaystyle+\|\bm{\mathcal{Y}}_{i-1}-{\bm{\mathcal{Y}}}_{c,i-1}\|^{2}\Big{]}+3K\Big{[}\mu_{x}^{2}\|\bm{g}_{x,c,i-1}\|^{2}+\mu_{y}^{2}\|\bm{g}_{y,c,i-1}\|^{2}\Big{]}

Taking the conditional expectation of (75) over the distribution of random samples {𝝃k,ix,𝝃k,iy}k=1K\{\bm{\xi}^{x}_{k,i},\bm{\xi}^{y}_{k,i}\}_{k=1}^{K}, we get

𝔼\displaystyle\mathbb{E} [𝓧i𝓧i12+𝓨i𝓨i12𝓕i1]\displaystyle\Big{[}\|\bm{\mathcal{X}}_{i}-\bm{\mathcal{X}}_{i-1}\|^{2}+\|\bm{\mathcal{Y}}_{i}-\bm{\mathcal{Y}}_{i-1}\|^{2}\mid\bm{\mathcal{F}}_{i-1}\Big{]} (76)
\displaystyle\leq 3𝔼[𝓧i𝓧c,i2+𝓨i𝓨c,i2𝓕i1]\displaystyle\ 3\mathbb{E}\Big{[}\|\bm{\mathcal{X}}_{i}-{\bm{\mathcal{X}}}_{c,i}\|^{2}+\|\bm{\mathcal{Y}}_{i}-{\bm{\mathcal{Y}}}_{c,i}\|^{2}\mid\bm{\mathcal{F}}_{i-1}\Big{]}
+3𝔼[𝓧i1𝓧c,i12+𝓨i1𝓨c,i12𝓕i1]\displaystyle+3\mathbb{E}\Big{[}\|\bm{\mathcal{X}}_{i-1}-{\bm{\mathcal{X}}}_{c,i-1}\|^{2}+\|\bm{\mathcal{Y}}_{i-1}-{\bm{\mathcal{Y}}}_{c,i-1}\|^{2}\mid\bm{\mathcal{F}}_{i-1}\Big{]}
+3K𝔼[μx2𝒈x,c,i12+μy2𝒈y,c,i12𝓕i1]\displaystyle+3K\mathbb{E}\Big{[}\mu_{x}^{2}\|\bm{g}_{x,c,i-1}\|^{2}+\mu_{y}^{2}\|\bm{g}_{y,c,i-1}\|^{2}\mid\bm{\mathcal{F}}_{i-1}\Big{]}

For the term 𝔼[𝒈x,c,i12|𝓕i1]\mathbb{E}[\|{\bm{g}}_{x,c,i-1}\|^{2}|\bm{\mathcal{F}}_{i-1}], we can bound it as follows:

𝔼[𝒈x,c,i12𝓕i1]\displaystyle\ \mathbb{E}[\|{\bm{g}}_{x,c,i-1}\|^{2}\mid\bm{\mathcal{F}}_{i-1}] (77)
=𝔼[k=1Kpk(𝒔k,i1x+gx,k,i1)2𝓕i1]\displaystyle=\ \mathbb{E}[\|\sum_{k=1}^{K}p_{k}(\bm{s}^{x}_{k,i-1}+{g}_{x,k,i-1})\|^{2}\mid\bm{\mathcal{F}}_{i-1}]\hskip 100.00015pt
=(a)𝔼[k=1Kpk𝒔k,i1x2+k=1Kpkgx,k,i12𝓕i1]\displaystyle\overset{(a)}{=}\ \mathbb{E}\Big{[}\|\sum_{k=1}^{K}p_{k}\bm{s}^{x}_{k,i-1}\|^{2}+\|\sum_{k=1}^{K}p_{k}{g}_{x,k,i-1}\|^{2}\mid\bm{\mathcal{F}}_{i-1}\Big{]}
(b) 10k=1Kpk2σk2+gx,c,i1210σ2max𝑘pk2+gx,c,i12\displaystyle\overset{(b)}{\leq}\ 10\sum_{k=1}^{K}p^{2}_{k}\sigma^{2}_{k}+\|{g}_{x,c,i-1}\|^{2}\leq 10\sigma^{2}\underset{k}{\max}\ p^{2}_{k}+\|{g}_{x,c,i-1}\|^{2}

where (a)(a) follows from Assumption 4 and the fact the local true gradient gx,k,i1{g}_{x,k,i-1} is independent of the gradient noise 𝒔k,i1x\bm{s}^{x}_{k,i-1}, (b)(b) follows from Assumption 4. A similar argument holds for 𝔼[𝒈y,c,i12𝓕i1]\mathbb{E}[\|{\bm{g}}_{y,c,i-1}\|^{2}\mid\bm{\mathcal{F}}_{i-1}]. Substituting these results into (76) and choosing μxμy\mu_{x}\leq\mu_{y}, we obtain

𝔼\displaystyle\mathbb{E} [𝓧i𝓧i12+𝓨i𝓨i12𝓕i1]\displaystyle\Big{[}\|\bm{\mathcal{X}}_{i}-\bm{\mathcal{X}}_{i-1}\|^{2}+\|\bm{\mathcal{Y}}_{i}-\bm{\mathcal{Y}}_{i-1}\|^{2}\mid\bm{\mathcal{F}}_{i-1}\Big{]}
\displaystyle\leq 3𝔼[𝓧i𝓧c,i2+𝓨i𝓨c,i2𝓕i1]\displaystyle 3\mathbb{E}\Big{[}\|\bm{\mathcal{X}}_{i}-{\bm{\mathcal{X}}}_{c,i}\|^{2}+\|\bm{\mathcal{Y}}_{i}-{\bm{\mathcal{Y}}}_{c,i}\|^{2}\mid\bm{\mathcal{F}}_{i-1}\Big{]} (78)
+3𝔼[𝓧i1𝓧c,i12+𝓨i1𝓨c,i12𝓕i1]\displaystyle+3\mathbb{E}\Big{[}\|\bm{\mathcal{X}}_{i-1}-{\bm{\mathcal{X}}}_{c,i-1}\|^{2}+\|\bm{\mathcal{Y}}_{i-1}-{\bm{\mathcal{Y}}}_{c,i-1}\|^{2}\mid\bm{\mathcal{F}}_{i-1}\Big{]}
+3K[μx2gx,c,i12+μy2gy,c,i12]+60Kσ2max𝑘pk2μy2\displaystyle+3K\Big{[}\mu^{2}_{x}\|{g}_{x,c,i-1}\|^{2}+\mu^{2}_{y}\|{g}_{y,c,i-1}\|^{2}\Big{]}+60K\sigma^{2}\underset{k}{\max}\ p^{2}_{k}\mu^{2}_{y}

Taking expectations over (G) again and averaging the inequality over iterations, we get

1T\displaystyle\frac{1}{T} i=0T1𝔼[𝓧i𝓧i12+𝓨i𝓨i12]\displaystyle\sum_{i=0}^{T-1}\mathbb{E}\Big{[}\|\bm{\mathcal{X}}_{i}-\bm{\mathcal{X}}_{i-1}\|^{2}+\|\bm{\mathcal{Y}}_{i}-\bm{\mathcal{Y}}_{i-1}\|^{2}\Big{]}
(a)\displaystyle\overset{(a)}{\leq} 6Ti=0T1𝔼[𝓧i𝓧c,i2+𝓨i𝓨c,i2]+3KTi=0T1\displaystyle\frac{6}{T}\sum_{i=0}^{T-1}\mathbb{E}\Big{[}\|\bm{\mathcal{X}}_{i}-{\bm{\mathcal{X}}}_{c,i}\|^{2}+\|\bm{\mathcal{Y}}_{i}-{\bm{\mathcal{Y}}}_{c,i}\|^{2}\Big{]}+\frac{3K}{T}\sum_{i=0}^{T-1}
[μx2𝔼gx,c,i12+μy2𝔼gy,c,i12]+60Kσ2max𝑘pk2μy2\displaystyle\Big{[}\mu_{x}^{2}\mathbb{E}\|{{g}}_{x,c,i-1}\|^{2}+\mu_{y}^{2}\mathbb{E}\|{{g}}_{y,c,i-1}\|^{2}\Big{]}+60K\sigma^{2}\underset{k}{\max}p^{2}_{k}\mu^{2}_{y}
(b)\displaystyle\overset{(b)}{\leq} 240ρ2(4KG2+μyσ2)μy1𝒥γ12+3KTi=0T1[μx2𝔼gx,c,i12\displaystyle\frac{240\rho^{2}(4KG^{2}+\mu_{y}\sigma^{2})\mu_{y}}{1-\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}}+\frac{3K}{T}\sum_{i=0}^{T-1}\Big{[}\mu_{x}^{2}\mathbb{E}\|{{g}}_{x,c,i-1}\|^{2}
+μy2𝔼gy,c,i12]+60Kσ2max𝑘p2kμ2y\displaystyle+\mu_{y}^{2}\mathbb{E}\|{{g}}_{y,c,i-1}\|^{2}\Big{]}+60K\sigma^{2}\underset{k}{\max}\ p^{2}_{k}\mu^{2}_{y} (79)

where (a)(a) is obtained by adding the positive terms 𝔼𝓧T1𝓧c,T12+𝔼𝓨T1𝓨c,T12\mathbb{E}\|\bm{\mathcal{X}}_{T-1}-{\bm{\mathcal{X}}}_{c,T-1}\|^{2}+\mathbb{E}\|\bm{\mathcal{Y}}_{T-1}-{\bm{\mathcal{Y}}}_{c,T-1}\|^{2} into the RHS of (G) and grouping the summation terms, (b)(b) follows from Lemma 3.

Appendix H Proof of Lemma 5

Inserting the expression for gx,c,ig_{x,c,i} into the following deviation term, we get

𝔼\displaystyle\mathbb{E} xJ(𝒙c,i,𝒚c,i)gx,c,i2\displaystyle\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{x,c,i}\|^{2}
=\displaystyle= 𝔼xJ(𝒙c,i,𝒚c,i)k=1Kpk[2xJk(𝒙k,i,𝒚k,i)\displaystyle\mathbb{E}\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})-\sum_{k=1}^{K}p_{k}[2\nabla_{x}J_{k}(\bm{x}_{k,i},\bm{y}_{k,i})
xJk(𝒙k,i1,𝒚k,i1)]2\displaystyle-\nabla_{x}J_{k}(\bm{x}_{k,i-1},\bm{y}_{k,i-1})]\|^{2}\hskip 100.00015pt
(a)\displaystyle\overset{(a)}{\leq} 2𝔼xJ(𝒙c,i,𝒚c,i)k=1KpkxJk(𝒙k,i,𝒚k,i)2\displaystyle 2\mathbb{E}\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})-\sum_{k=1}^{K}p_{k}\nabla_{x}J_{k}(\bm{x}_{k,i},\bm{y}_{k,i})\Big{\|}^{2}
+2𝔼k=1Kpk[xJk(𝒙k,i,𝒚k,i)xJk(𝒙k,i1,𝒚k,i1)]2\displaystyle+2\mathbb{E}\|\sum_{k=1}^{K}p_{k}[\nabla_{x}J_{k}(\bm{x}_{k,i},\bm{y}_{k,i})-\nabla_{x}J_{k}(\bm{x}_{k,i-1},\bm{y}_{k,i-1})]\|^{2}
(b)\displaystyle\overset{(b)}{\leq} 2k=1Kpk𝔼xJk(𝒙c,i,𝒚c,i)xJk(𝒙k,i,𝒚k,i)2\displaystyle\ 2\sum_{k=1}^{K}p_{k}\mathbb{E}\|\nabla_{x}J_{k}(\bm{x}_{c,i},\bm{y}_{c,i})-\nabla_{x}J_{k}(\bm{x}_{k,i},\bm{y}_{k,i})\|^{2}
+2k=1Kpk𝔼xJk(𝒙k,i,𝒚k,i)xJk(𝒙k,i1,𝒚k,i1)2\displaystyle+2\sum_{k=1}^{K}p_{k}\mathbb{E}\|\nabla_{x}J_{k}(\bm{x}_{k,i},\bm{y}_{k,i})-\nabla_{x}J_{k}(\bm{x}_{k,i-1},\bm{y}_{k,i-1})\|^{2}
(c)\displaystyle\overset{(c)}{\leq} 4Lf2k=1Kpk𝔼[𝒙c,i𝒙k,i2+𝒚c,i𝒚k,i2]\displaystyle{4L^{2}_{f}}\sum_{k=1}^{K}p_{k}\mathbb{E}\Big{[}\|\bm{x}_{c,i}-\bm{x}_{k,i}\|^{2}+\|\bm{y}_{c,i}-\bm{y}_{k,i}\|^{2}\Big{]}
+4Lf2k=1Kpk𝔼[𝒙k,i𝒙k,i12+𝒚k,i𝒚k,i12]\displaystyle+{4L^{2}_{f}}\sum_{k=1}^{K}p_{k}\mathbb{E}\Big{[}\|\bm{x}_{k,i}-\bm{x}_{k,i-1}\|^{2}+\|\bm{y}_{k,i}-\bm{y}_{k,i-1}\|^{2}\Big{]}
(d)\displaystyle\overset{(d)}{\leq} 4Lf2𝔼[𝓧i𝓧c,i2+𝓨i𝓨c,i2]\displaystyle 4L^{2}_{f}\mathbb{E}\Big{[}\|\bm{\mathcal{X}}_{i}-\bm{\mathcal{X}}_{c,i}\|^{2}+\|\bm{\mathcal{Y}}_{i}-\bm{\mathcal{Y}}_{c,i}\|^{2}\Big{]}
+4Lf2𝔼[𝓧i𝓧i12+𝓨i𝓨i12]\displaystyle+4L^{2}_{f}\mathbb{E}\Big{[}\|\bm{\mathcal{X}}_{i}-\bm{\mathcal{X}}_{i-1}\|^{2}+\|\bm{\mathcal{Y}}_{i}-\bm{\mathcal{Y}}_{i-1}\|^{2}\Big{]}\hskip 20.00003pt (80)

where (a)(a) and (b)(b) follow from Jensen’s inequality, (c)(c) follows from the LfL_{f}-smooth property of the local risk function Jk(x,y)J_{k}(x,y), (d)(d) is due to pk<1p_{k}<1. Averaging (80) over iterations, we get

1T\displaystyle\ \frac{1}{T} i=0T1𝔼xJ(𝒙c,i,𝒚c,i)gx,c,i2\displaystyle\sum_{i=0}^{T-1}\mathbb{E}\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{x,c,i}\|^{2} (81)
\displaystyle\leq 4Lf2Ti=0T1(𝔼𝓧i𝓧c,i2+𝔼𝓨i𝓨c,i2)\displaystyle\frac{4L^{2}_{f}}{T}\sum_{i=0}^{T-1}\Big{(}\mathbb{E}\|\bm{\mathcal{X}}_{i}-\bm{\mathcal{X}}_{c,i}\|^{2}+\mathbb{E}\|\bm{\mathcal{Y}}_{i}-\bm{\mathcal{Y}}_{c,i}\|^{2}\Big{)}
+4Lf2Ti=0T1𝔼[𝓧i𝓧i12+𝓨i𝓨i12]\displaystyle+\frac{4L^{2}_{f}}{T}\sum_{i=0}^{T-1}\mathbb{E}\Big{[}\|\bm{\mathcal{X}}_{i}-\bm{\mathcal{X}}_{i-1}\|^{2}+\|\bm{\mathcal{Y}}_{i}-\bm{\mathcal{Y}}_{i-1}\|^{2}\Big{]}
(a)\displaystyle\overset{(a)}{\leq} 1120Lf2ρ2(4KG2+μyσ2)μy1𝒥γ12+240Kσ2Lf2μy2max𝑘pk2\displaystyle\frac{1120L^{2}_{f}\rho^{2}(4KG^{2}+\mu_{y}\sigma^{2})\mu_{y}}{1-\|\mathcal{J}^{\top}_{\gamma_{1}}\|^{2}}+240K\sigma^{2}L^{2}_{f}\mu^{2}_{y}\underset{k}{\max}\ p^{2}_{k}
+12KLf2Ti=0T1[μx2𝔼gx,c,i12+μy2𝔼gy,c,i12]\displaystyle+\frac{12KL^{2}_{f}}{T}\sum_{i=0}^{T-1}\Big{[}\mu_{x}^{2}\mathbb{E}\|{{g}}_{x,c,i-1}\|^{2}+\mu_{y}^{2}\mathbb{E}\|{{g}}_{y,c,i-1}\|^{2}\Big{]}

where (a)(a) follows from Lemmas 34. A similar argument holds for 1Ti=0T1𝔼yJ(𝒙c,i,𝒚c,i)gy,c,i2\frac{1}{T}\sum_{i=0}^{T-1}\mathbb{E}\|\nabla_{y}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{y,c,i}\|^{2}.

Appendix I Proof of Lemma 6

For convenience, we denote

dc(i)d(𝒙c,i,𝒚c,i)=P(𝒙c,i)J(𝒙c,i,𝒚c,i){d}_{c}(i)\triangleq{d}(\bm{x}_{c,i},\bm{y}_{c,i})={{P}}(\bm{x}_{c,i})-J(\bm{x}_{c,i},\bm{y}_{c,i}) (82)

with subscript “c” to denote the dual optimality gap evaluated at the network centroid (𝒙c,i,𝒚c,i)(\bm{x}_{c,i},\bm{y}_{c,i}). From Lemma 2, P(x)P(x) is LL-smooth, we have

P(𝒙c,i+1)\displaystyle\ {{P}}(\bm{x}_{c,i+1})
P(𝒙c,i)+P(𝒙c,i),𝒙c,i+1𝒙c,i+L2𝒙c,i+1𝒙c,i2\displaystyle\leq\ {{P}}(\bm{x}_{c,i})+\langle\nabla{{P}}(\bm{x}_{c,i}),\bm{x}_{c,i+1}-\bm{x}_{c,i}\rangle+\frac{L}{2}\|\bm{x}_{c,i+1}-\bm{x}_{c,i}\|^{2}
P(𝒙c,i)μxP(𝒙c,i),𝒈x,c,i+Lμx22𝒈x,c,i2\displaystyle\leq\ {{P}}(\bm{x}_{c,i})-\mu_{x}\langle\nabla{{P}}(\bm{x}_{c,i}),\bm{g}_{x,c,i}\rangle+\frac{L\mu^{2}_{x}}{2}\|\bm{g}_{x,c,i}\|^{2} (83)

Taking conditional expectation of (I) over the distribution of random sample {𝝃k,i+1x}k=1K\{\bm{\xi}^{x}_{k,i+1}\}_{k=1}^{K}, we get

𝔼\displaystyle\ \mathbb{E} [P(𝒙c,i+1)𝓕i]\displaystyle[{{P}}(\bm{x}_{c,i+1})\mid\bm{\mathcal{F}}_{i}]
(a)\displaystyle\overset{(a)}{\leq} P(𝒙c,i)μxP(𝒙c,i),gx,c,i+Lμx22𝔼[𝒈x,c,i2𝓕i]\displaystyle{{P}}(\bm{x}_{c,i})-\mu_{x}\langle\nabla{{P}}(\bm{x}_{c,i}),{g}_{x,c,i}\rangle+\frac{L\mu^{2}_{x}}{2}\mathbb{E}[\|\bm{g}_{x,c,i}\|^{2}\mid\bm{\mathcal{F}}_{i}]
\displaystyle\leq\ P(𝒙c,i)μx2P(𝒙c,i)2μx2gx,c,i2\displaystyle{{P}}(\bm{x}_{c,i})-\frac{\mu_{x}}{2}\|\nabla{{P}}(\bm{x}_{c,i})\|^{2}-\frac{\mu_{x}}{2}\|{g}_{x,c,i}\|^{2}
+μx2P(𝒙c,i)gx,c,i2+Lμx22𝔼[𝒈x,c,i2𝓕i]\displaystyle+\frac{\mu_{x}}{2}\|\nabla{{P}}(\bm{x}_{c,i})-{g}_{x,c,i}\|^{2}+\frac{L\mu^{2}_{x}}{2}\mathbb{E}[\|\bm{g}_{x,c,i}\|^{2}\mid\bm{\mathcal{F}}_{i}]
(b)\displaystyle\overset{(b)}{\leq} P(𝒙c,i)μx2P(𝒙c,i)2μx2gx,c,i2+μx2P(𝒙c,i)\displaystyle{{P}}(\bm{x}_{c,i})-\frac{\mu_{x}}{2}\|\nabla{{P}}(\bm{x}_{c,i})\|^{2}-\frac{\mu_{x}}{2}\|{g}_{x,c,i}\|^{2}+\frac{\mu_{x}}{2}\|\nabla{{P}}(\bm{x}_{c,i})
gx,c,i2+Lμx22(10σ2max𝑘pk2+gx,c,i2)\displaystyle-{g}_{x,c,i}\|^{2}+\frac{L\mu^{2}_{x}}{2}(10\sigma^{2}\underset{k}{\max}\ p^{2}_{k}+\|{g}_{x,c,i}\|^{2}) (84)

where (a)(a) follows from the fact that 𝒙c,i\bm{x}_{c,i} does not rely on the random samples at iteration i+1i+1, (b)(b) follows from (77). Inserting xJ(𝒙c,i,𝒚c,i)\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i}) into (I) and applying Jensen’s inequality, we get

𝔼\displaystyle\mathbb{E} [P(𝒙c,i+1)𝓕i]\displaystyle[{{P}}(\bm{x}_{c,i+1})\mid\bm{\mathcal{F}}_{i}]
\displaystyle\leq P(𝒙c,i)μx2P(𝒙c,i)2μx2(1Lμx)gx,c,i2\displaystyle{{P}}(\bm{x}_{c,i})-\frac{\mu_{x}}{2}\|\nabla{{P}}(\bm{x}_{c,i})\|^{2}-\frac{\mu_{x}}{2}(1-L\mu_{x})\|{g}_{x,c,i}\|^{2}
+μxP(𝒙c,i)xJ(𝒙c,i,𝒚c,i)2+5Lμx2σ2max𝑘pk2\displaystyle+\mu_{x}\|\nabla{{P}}(\bm{x}_{c,i})-\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})\|^{2}+5L\mu^{2}_{x}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}
+μxxJ(𝒙c,i,𝒚c,i)gx,c,i2\displaystyle+\mu_{x}\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{x,c,i}\|^{2} (85)

Taking expectations again over (I)\eqref{startsmooth11inter}, we get

𝔼\displaystyle\ \mathbb{E} [P(𝒙c,i+1)]\displaystyle[{{P}}(\bm{x}_{c,i+1})]
\displaystyle\leq 𝔼[P(𝒙c,i)]μx2𝔼P(𝒙c,i)2μx2(1Lμx)𝔼gx,c,i2\displaystyle\mathbb{E}[{{P}}(\bm{x}_{c,i})]-\frac{\mu_{x}}{2}\mathbb{E}\|\nabla{{P}}(\bm{x}_{c,i})\|^{2}-\frac{\mu_{x}}{2}(1-L\mu_{x})\mathbb{E}\|{g}_{x,c,i}\|^{2}
+μx𝔼P(𝒙c,i)xJ(𝒙c,i,𝒚c,i)2\displaystyle+\mu_{x}\mathbb{E}\|\nabla{{P}}(\bm{x}_{c,i})-\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})\|^{2}
+μx𝔼xJ(𝒙c,i,𝒚c,i)gx,c,i2+5Lμx2σ2max𝑘pk2\displaystyle+\mu_{x}\mathbb{E}\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{x,c,i}\|^{2}+5L\mu^{2}_{x}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}
(a)\displaystyle\overset{(a)}{\leq} 𝔼[P(𝒙c,i)]μx2𝔼P(𝒙c,i)2μx2(1Lμx)𝔼gx,c,i2\displaystyle\mathbb{E}[{{P}}(\bm{x}_{c,i})]-\frac{\mu_{x}}{2}\mathbb{E}\|\nabla{{P}}(\bm{x}_{c,i})\|^{2}-\frac{\mu_{x}}{2}(1-L\mu_{x})\mathbb{E}\|{g}_{x,c,i}\|^{2}
+μxLf2𝔼𝒚o(𝒙c,i)𝒚c,i2+μx𝔼xJ(𝒙c,i,𝒚c,i)\displaystyle+\mu_{x}L^{2}_{f}\mathbb{E}\|\bm{y}^{o}(\bm{x}_{c,i})-\bm{y}_{c,i}\|^{2}+\mu_{x}\mathbb{E}\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})
gx,c,i2+5Lμx2σ2max𝑘pk2\displaystyle-{g}_{x,c,i}\|^{2}+5L\mu^{2}_{x}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}
(b)\displaystyle\overset{(b)}{\leq} 𝔼[P(𝒙c,i)]μx2𝔼P(𝒙c,i)2μx2(1Lμx)𝔼gx,c,i2\displaystyle\mathbb{E}[{{P}}(\bm{x}_{c,i})]-\frac{\mu_{x}}{2}\mathbb{E}\|\nabla{{P}}(\bm{x}_{c,i})\|^{2}-\frac{\mu_{x}}{2}(1-L\mu_{x})\mathbb{E}\|{g}_{x,c,i}\|^{2}
+2μxLf2ν𝔼[P(𝒙c,i)J(𝒙c,i,𝒚c,i)]\displaystyle+\frac{2\mu_{x}L^{2}_{f}}{\nu}\mathbb{E}[{{P}}(\bm{x}_{c,i})-J(\bm{x}_{c,i},\bm{y}_{c,i})] (86)
+μx𝔼xJ(𝒙c,i,𝒚c,i)gx,c,i2+5Lμx2σ2max𝑘pk2\displaystyle+\mu_{x}\mathbb{E}\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{x,c,i}\|^{2}+5L\mu^{2}_{x}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}

Here, (a)(a) is due to the LfL_{f}-smoothness property of the risk function, (b)(b) follows from Lemma 1.

Appendix J Proof of Lemma 7

Because J(x,y)-J(x,y) is LfL_{f}-smooth, fixing J(x,y)-J(x,y) at 𝒙c,i+1\bm{x}_{c,i+1}, we have

J(𝒙c,i+1,𝒚c,i+1)\displaystyle-J(\bm{x}_{c,i+1},\bm{y}_{c,i+1})
\displaystyle\leq J(𝒙c,i+1,𝒚c,i)yJ(𝒙c,i+1,𝒚c,i),𝒚c,i+1𝒚c,i\displaystyle-J(\bm{x}_{c,i+1},\bm{y}_{c,i})-\langle\nabla_{y}J(\bm{x}_{c,i+1},\bm{y}_{c,i}),\bm{y}_{c,i+1}-\bm{y}_{c,i}\rangle
+Lf2𝒚c,i+1𝒚c,i2\displaystyle+\frac{L_{f}}{2}\|\bm{y}_{c,i+1}-\bm{y}_{c,i}\|^{2}
\displaystyle\leq J(𝒙c,i+1,𝒚c,i)μyyJ(𝒙c,i+1,𝒚c,i),𝒈y,c,i\displaystyle-J(\bm{x}_{c,i+1},\bm{y}_{c,i})-\mu_{y}\langle\nabla_{y}J(\bm{x}_{c,i+1},\bm{y}_{c,i}),\bm{g}_{y,c,i}\rangle
+Lfμy22𝒈y,c,i2\displaystyle+\frac{L_{f}\mu^{2}_{y}}{2}\|\bm{g}_{y,c,i}\|^{2}\hskip 100.00015pt (87)

Taking expectation of (J) over the distribution of {𝝃k,i+1y}k=1K\{\bm{\xi}^{y}_{k,i+1}\}_{k=1}^{K} conditioned on {𝝃k,i+1x}k=1K\{\bm{\xi}^{x}_{k,i+1}\}_{k=1}^{K} and 𝓕i\bm{\mathcal{F}}_{i}, we get

𝔼\displaystyle\mathbb{E} [J(𝒙c,i+1,𝒚c,i+1)𝓕i,{𝝃k,i+1x}k=1K]\displaystyle[-J(\bm{x}_{c,i+1},\bm{y}_{c,i+1})\mid\bm{\mathcal{F}}_{i},\{\bm{\xi}^{x}_{k,i+1}\}_{k=1}^{K}]
\displaystyle\leq J(𝒙c,i+1,𝒚c,i)μyyJ(𝒙c,i+1,𝒚c,i),gy,c,i\displaystyle\ -J(\bm{x}_{c,i+1},\bm{y}_{c,i})-\mu_{y}\langle\nabla_{y}J(\bm{x}_{c,i+1},\bm{y}_{c,i}),{g}_{y,c,i}\rangle
+Lfμy22𝔼[𝒈y,c,i2𝓕i,{𝝃k,i+1x}k=1K]\displaystyle+\frac{L_{f}\mu^{2}_{y}}{2}\mathbb{E}\Big{[}\|\bm{g}_{y,c,i}\|^{2}\mid\bm{\mathcal{F}}_{i},\{\bm{\xi}^{x}_{k,i+1}\}_{k=1}^{K}\Big{]}
(a)\displaystyle\overset{(a)}{\leq} J(𝒙c,i+1,𝒚c,i)μy2yJ(𝒙c,i+1,𝒚c,i)2\displaystyle-J(\bm{x}_{c,i+1},\bm{y}_{c,i})-\frac{\mu_{y}}{2}\|\nabla_{y}J(\bm{x}_{c,i+1},\bm{y}_{c,i})\|^{2}
+μy2yJ(𝒙c,i+1,𝒚c,i)gy,c,i2+Lfμy22gy,c,i2\displaystyle+\frac{\mu_{y}}{2}\|\nabla_{y}J(\bm{x}_{c,i+1},\bm{y}_{c,i})-{g}_{y,c,i}\|^{2}+\frac{L_{f}\mu^{2}_{y}}{2}\|{g}_{y,c,i}\|^{2}
+5Lfμy2σ2max𝑘pk2μy2gy,c,i2\displaystyle+5L_{f}\mu^{2}_{y}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}-\frac{\mu_{y}}{2}\|{g}_{y,c,i}\|^{2}
(b)\displaystyle\overset{(b)}{\leq} J(𝒙c,i+1,𝒚c,i)μy2yJ(𝒙c,i+1,𝒚c,i)2\displaystyle-J(\bm{x}_{c,i+1},\bm{y}_{c,i})-\frac{\mu_{y}}{2}\|\nabla_{y}J(\bm{x}_{c,i+1},\bm{y}_{c,i})\|^{2}
+μyyJ(𝒙c,i+1,𝒚c,i)yJ(𝒙c,i,𝒚c,i)2\displaystyle+\mu_{y}\|\nabla_{y}J(\bm{x}_{c,i+1},\bm{y}_{c,i})-\nabla_{y}J(\bm{x}_{c,i},\bm{y}_{c,i})\|^{2}
+μyyJ(𝒙c,i,𝒚c,i)gy,c,i2+Lfμy22gy,c,i2\displaystyle+\mu_{y}\|\nabla_{y}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{y,c,i}\|^{2}+\frac{L_{f}\mu^{2}_{y}}{2}\|{g}_{y,c,i}\|^{2}
+5Lfμy2σ2max𝑘pk2μy2gy,c,i2\displaystyle+5L_{f}\mu^{2}_{y}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}-\frac{\mu_{y}}{2}\|{g}_{y,c,i}\|^{2}
(c)\displaystyle\overset{(c)}{\leq} J(𝒙c,i+1,𝒚c,i)μy2yJ(𝒙c,i+1,𝒚c,i)2\displaystyle-J(\bm{x}_{c,i+1},\bm{y}_{c,i})-\frac{\mu_{y}}{2}\|\nabla_{y}J(\bm{x}_{c,i+1},\bm{y}_{c,i})\|^{2} (88)
μy2gy,c,i2+μyLf2𝒙c,i+1𝒙c,i2+Lfμy22gy,c,i2\displaystyle-\frac{\mu_{y}}{2}\|{g}_{y,c,i}\|^{2}+\mu_{y}L^{2}_{f}\|\bm{x}_{c,i+1}-\bm{x}_{c,i}\|^{2}+\frac{L_{f}\mu^{2}_{y}}{2}\|{g}_{y,c,i}\|^{2}
+μyyJ(𝒙c,i,𝒚c,i)gy,c,i2+5Lfμy2σ2max𝑘pk2\displaystyle+\mu_{y}\|\nabla_{y}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{y,c,i}\|^{2}+5L_{f}\mu^{2}_{y}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}

where (a)(a) follows from (77), (b)(b) we insert the true gradient information yJ(𝒙c,i,𝒚c,i)\nabla_{y}J(\bm{x}_{c,i},\bm{y}_{c,i}) and apply Jensen’s inequality, (c)(c) is due to the LfL_{f}-smooth property of the risk function. Taking expectation again over (J), we obtain the result (34). The proof of (35) is similar to the above arguments.

Appendix K Proof of Lemma 8

Adding 𝔼[P(𝒙c,i+1)]\mathbb{E}[P(\bm{x}_{c,i+1})] on both sides of (34):

𝔼\displaystyle\ \mathbb{E} [P(𝒙c,i+1)J(𝒙c,i+1,𝒚c,i+1)]\displaystyle[P(\bm{x}_{c,i+1})-J(\bm{x}_{c,i+1},\bm{y}_{c,i+1})]
\displaystyle\leq 𝔼[P(𝒙c,i+1)J(𝒙c,i+1,𝒚c,i)]μy2𝔼yJ(𝒙c,i+1,𝒚c,i)2\displaystyle\ \mathbb{E}[P(\bm{x}_{c,i+1})-J(\bm{x}_{c,i+1},\bm{y}_{c,i})]-\frac{\mu_{y}}{2}\mathbb{E}\|\nabla_{y}J(\bm{x}_{c,i+1},\bm{y}_{c,i})\|^{2}
μy2(1Lfμy)𝔼gy,c,i2+5Lfμy2σ2max𝑘pk2\displaystyle-\frac{\mu_{y}}{2}(1-L_{f}\mu_{y})\mathbb{E}\|{g}_{y,c,i}\|^{2}+5L_{f}\mu^{2}_{y}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}
+μyLf2𝔼𝒙c,i+1𝒙c,i2+μy𝔼yJ(𝒙c,i,𝒚c,i)gy,c,i2\displaystyle+\mu_{y}L^{2}_{f}\mathbb{E}\|\bm{x}_{c,i+1}-\bm{x}_{c,i}\|^{2}+\mu_{y}\mathbb{E}\|\nabla_{y}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{y,c,i}\|^{2}
(a)\displaystyle\overset{(a)}{\leq} (1νμy)𝔼[P(𝒙c,i+1)J(𝒙c,i+1,𝒚c,i)]\displaystyle\ (1-\nu\mu_{y})\mathbb{E}[P(\bm{x}_{c,i+1})-J(\bm{x}_{c,i+1},\bm{y}_{c,i})]
+μy𝔼yJ(𝒙c,i,𝒚c,i)gy,c,i2+5Lfμy2σ2max𝑘pk2\displaystyle+\mu_{y}\mathbb{E}\|\nabla_{y}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{y,c,i}\|^{2}+5L_{f}\mu^{2}_{y}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}
μy2(1Lfμy)𝔼gy,c,i2+μyLf2𝔼𝒙c,i+1𝒙c,i2\displaystyle-\frac{\mu_{y}}{2}(1-L_{f}\mu_{y})\mathbb{E}\|{g}_{y,c,i}\|^{2}+\mu_{y}L^{2}_{f}\mathbb{E}\|\bm{x}_{c,i+1}-\bm{x}_{c,i}\|^{2}
(b)\displaystyle\overset{(b)}{\leq} (1νμy)𝔼[P(𝒙c,i)J(𝒙c,i,𝒚c,i)+J(𝒙c,i,𝒚c,i)\displaystyle(1-\nu\mu_{y})\mathbb{E}[P(\bm{x}_{c,i})-J(\bm{x}_{c,i},\bm{y}_{c,i})+J(\bm{x}_{c,i},\bm{y}_{c,i}) (89)
J(𝒙c,i+1,𝒚c,i)+P(𝒙c,i+1)P(𝒙c,i)]\displaystyle-J(\bm{x}_{c,i+1},\bm{y}_{c,i})+P(\bm{x}_{c,i+1})-P(\bm{x}_{c,i})]
μy2(1Lfμy)𝔼gy,c,i2+μyLf2𝔼𝒙c,i+1𝒙c,i2\displaystyle-\frac{\mu_{y}}{2}(1-L_{f}\mu_{y})\mathbb{E}\|{g}_{y,c,i}\|^{2}+{\mu_{y}}L_{f}^{2}\mathbb{E}\|\bm{x}_{c,i+1}-\bm{x}_{c,i}\|^{2}
+μy𝔼yJ(𝒙c,i,𝒚c,i)gy,c,i2+5Lfμy2σ2max𝑘pk2\displaystyle+\mu_{y}\mathbb{E}\|\nabla_{y}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{y,c,i}\|^{2}+5L_{f}\mu^{2}_{y}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}

where (a)(a) follows from Assumption 1, (b)(b) is obtained by adding and subtracting the terms 𝔼[P(𝒙c,i)]\mathbb{E}[P(\bm{x}_{c,i})] and 𝔼[J(𝒙c,i,𝒚c,i)]\mathbb{E}[J(\bm{x}_{c,i},\bm{y}_{c,i})] in (a)(a). By the definition of dc(i){d}_{c}(i), we can rewrite above inequality as:

𝔼\displaystyle\mathbb{E} [dc(i+1)]\displaystyle[{d}_{c}(i+1)]
\displaystyle\leq (1νμy){𝔼[dc(i)]+𝔼[J(𝒙c,i,𝒚c,i)J(𝒙c,i+1,𝒚c,i)]\displaystyle(1-\nu\mu_{y})\Big{\{}\mathbb{E}[{d}_{c}(i)]+\mathbb{E}[J(\bm{x}_{c,i},\bm{y}_{c,i})-J(\bm{x}_{c,i+1},\bm{y}_{c,i})]
+𝔼[P(𝒙c,i+1)P(𝒙c,i)]}+5Lfμ2yσ2max𝑘p2k\displaystyle+\mathbb{E}[P(\bm{x}_{c,i+1})-P(\bm{x}_{c,i})]\Big{\}}+5L_{f}\mu^{2}_{y}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}
μy2(1Lfμy)𝔼gy,c,i2+μyLf2𝔼𝒙c,i+1𝒙c,i2\displaystyle-\frac{\mu_{y}}{2}(1-L_{f}\mu_{y})\mathbb{E}\|{g}_{y,c,i}\|^{2}+{\mu_{y}}L_{f}^{2}\mathbb{E}\|\bm{x}_{c,i+1}-\bm{x}_{c,i}\|^{2}
+μy𝔼yJ(𝒙c,i,𝒚c,i)gy,c,i2\displaystyle+\mu_{y}\mathbb{E}\|\nabla_{y}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{y,c,i}\|^{2} (90)

By Lemma 7, we can bound 𝔼[J(𝒙c,i,𝒚c,i)J(𝒙c,i+1,𝒚c,i)]\mathbb{E}[J(\bm{x}_{c,i},\bm{y}_{c,i})-J(\bm{x}_{c,i+1},\bm{y}_{c,i})] as:

𝔼\displaystyle\mathbb{E} [J(𝒙c,i,𝒚c,i)J(𝒙c,i+1,𝒚c,i)]\displaystyle[J(\bm{x}_{c,i},\bm{y}_{c,i})-J(\bm{x}_{c,i+1},\bm{y}_{c,i})]
\displaystyle\leq μx2𝔼xJ(𝒙c,i,𝒚c,i)gx,c,i2+μx(3+Lfμx)2𝔼gx,c,i2\displaystyle\frac{\mu_{x}}{2}\mathbb{E}\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{x,c,i}\|^{2}+\frac{\mu_{x}(3+L_{f}\mu_{x})}{2}\mathbb{E}\|{g}_{x,c,i}\|^{2}
+5Lfμx2σ2max𝑘pk2\displaystyle+5L_{f}\mu^{2}_{x}\sigma^{2}\underset{k}{\max}\ p^{2}_{k} (91)

By Lemma 6, we can bound 𝔼[P(𝒙c,i+1)P(𝒙c,i)]\mathbb{E}[{{P}}(\bm{x}_{c,i+1})-{{P}}(\bm{x}_{c,i})] as

𝔼\displaystyle\mathbb{E} [P(𝒙c,i+1)P(𝒙c,i)]\displaystyle[{{P}}(\bm{x}_{c,i+1})-{{P}}(\bm{x}_{c,i})]
\displaystyle\leq 2μxLf2νdc(i)μx2(1Lμx)𝔼gx,c,i2\displaystyle\frac{2\mu_{x}L^{2}_{f}}{\nu}{d}_{c}(i)-\frac{\mu_{x}}{2}(1-L\mu_{x})\mathbb{E}\|{g}_{x,c,i}\|^{2} (92)
+μx𝔼xJ(𝒙c,i,𝒚c,i)gx,c,i2+5Lμx2σ2max𝑘pk2\displaystyle+\mu_{x}\mathbb{E}\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{x,c,i}\|^{2}+5L\mu^{2}_{x}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}

Combining the results of (K) and (K) into (K), we get

𝔼\displaystyle\mathbb{E} [dc(i+1)]\displaystyle[{d}_{c}(i+1)] (93)
\displaystyle\leq (1νμy)(1+2μxLf2ν)𝔼[dc(i)]+3μx(1νμy)2\displaystyle(1-\nu\mu_{y})(1+\frac{2\mu_{x}L_{f}^{2}}{\nu})\mathbb{E}[{d}_{c}(i)]+\frac{3\mu_{x}(1-\nu\mu_{y})}{2}
×𝔼xJ(𝒙c,i,𝒚c,i)gx,c,i2+(1νμy)[μx\displaystyle\times\mathbb{E}\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{x,c,i}\|^{2}+(1-\nu\mu_{y})\Big{[}\mu_{x}
+(L+Lf)μx22]𝔼gx,c,i2μy2(1Lfμy)𝔼gy,c,i2\displaystyle+\frac{(L+L_{f})\mu^{2}_{x}}{2}\Big{]}\mathbb{E}\|{g}_{x,c,i}\|^{2}-\frac{\mu_{y}}{2}(1-L_{f}\mu_{y})\mathbb{E}\|{g}_{y,c,i}\|^{2}
+μyLf2𝔼𝒙c,i+1𝒙c,i2+μy𝔼yJ(𝒙c,i,𝒚c,i)gy,c,i2\displaystyle+{\mu_{y}}L_{f}^{2}\mathbb{E}\|\bm{x}_{c,i+1}-\bm{x}_{c,i}\|^{2}+\mu_{y}\mathbb{E}\|\nabla_{y}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{y,c,i}\|^{2}
+5(1νμy)(L+Lf)μx2σ2max𝑘pk2+5Lfμy2σ2max𝑘pk2\displaystyle+5{(1-\nu\mu_{y})(L+L_{f})\mu^{2}_{x}\sigma^{2}}\underset{k}{\max}\ p^{2}_{k}+5L_{f}\mu^{2}_{y}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}

We further choose 2μxLf2ννμy2,μy1ν,μx2Lfμy2L+Lf\frac{2\mu_{x}L_{f}^{2}}{\nu}\leq\frac{\nu\mu_{y}}{2},\mu_{y}\leq\frac{1}{\nu},\mu^{2}_{x}\leq\frac{L_{f}\mu^{2}_{y}}{L+L_{f}}. We have (1νμy)(1+2μxLf2ν)1νμy2(1-\nu\mu_{y})(1+\frac{2\mu_{x}L_{f}^{2}}{\nu})\leq 1-\frac{\nu\mu_{y}}{2}, and inequality (93) becomes

𝔼\displaystyle\mathbb{E} [dc(i+1)]\displaystyle[{d}_{c}(i+1)] (94)
\displaystyle\leq (1νμy2)𝔼[dc(i)]+3μx(1νμy)2𝔼xJ(𝒙c,i,𝒚c,i)\displaystyle(1-\frac{\nu\mu_{y}}{2})\mathbb{E}[{d}_{c}(i)]+\frac{3\mu_{x}(1-\nu\mu_{y})}{2}\mathbb{E}\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})
gx,c,i2+(1νμy)[μx+(L+Lf)μx22]𝔼gx,c,i2\displaystyle-{g}_{x,c,i}\|^{2}+(1-\nu\mu_{y})\Big{[}\mu_{x}+\frac{(L+L_{f})\mu^{2}_{x}}{2}\Big{]}\mathbb{E}\|{g}_{x,c,i}\|^{2}
μy2(1Lfμy)gy,c,i2+μyLf2𝔼𝒙c,i+1𝒙c,i2\displaystyle-\frac{\mu_{y}}{2}(1-L_{f}\mu_{y})\|{g}_{y,c,i}\|^{2}+{\mu_{y}}L_{f}^{2}\mathbb{E}\|\bm{x}_{c,i+1}-\bm{x}_{c,i}\|^{2}
+μy𝔼yJ(𝒙c,i,𝒚c,i)gy,c,i2+10Lfμy2σ2max𝑘pk2\displaystyle+\mu_{y}\mathbb{E}\|\nabla_{y}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{y,c,i}\|^{2}+10L_{f}\mu^{2}_{y}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}

Choosing μy<1Lf\mu_{y}<\frac{1}{L_{f}} and iterating (94) from ii to 0, we get

𝔼\displaystyle\mathbb{E} [dc(i)]\displaystyle[{d}_{c}(i)]
\displaystyle\leq (1νμy2)idc(0)+j=0i1(1νμy2)i1j{3μx(1νμy)2\displaystyle(1-\frac{\nu\mu_{y}}{2})^{i}{d}_{c}(0)+\sum_{j=0}^{i-1}(1-\frac{\nu\mu_{y}}{2})^{i-1-j}\Bigg{\{}\frac{3\mu_{x}(1-\nu\mu_{y})}{2}
×𝔼xJ(𝒙c,j,𝒚c,j)gx,c,j2+(1νμy)[μx\displaystyle\times\mathbb{E}\|\nabla_{x}J(\bm{x}_{c,j},\bm{y}_{c,j})-{g}_{x,c,j}\|^{2}+(1-\nu\mu_{y})\Big{[}\mu_{x}
+(L+Lf)μx22]𝔼gx,c,j2+μyLf2𝔼𝒙c,j+1𝒙c,j2\displaystyle+\frac{(L+L_{f})\mu^{2}_{x}}{2}\Big{]}\mathbb{E}\|{g}_{x,c,j}\|^{2}+{\mu_{y}}L_{f}^{2}\mathbb{E}\|\bm{x}_{c,j+1}-\bm{x}_{c,j}\|^{2}
+μy𝔼yJ(𝒙c,j,𝒚c,j)gy,c,j2+10Lfμy2σ2max𝑘pk2}\displaystyle+\mu_{y}\mathbb{E}\|\nabla_{y}J(\bm{x}_{c,j},\bm{y}_{c,j})-{g}_{y,c,j}\|^{2}+10L_{f}\mu^{2}_{y}\sigma^{2}\underset{k}{\max}\ p^{2}_{k}\Bigg{\}}
μy2(1Lfμy)j=0i1(1νμy2)ij1𝔼gy,c,j2\displaystyle-\frac{\mu_{y}}{2}(1-L_{f}\mu_{y})\sum_{j=0}^{i-1}(1-\frac{\nu\mu_{y}}{2})^{i-j-1}\mathbb{E}\|{g}_{y,c,j}\|^{2}
(a)\displaystyle\overset{(a)}{\leq} (1νμy2)idc(0)+j=0i1(1νμy2)i1j{3μx(1νμy)2\displaystyle(1-\frac{\nu\mu_{y}}{2})^{i}{d}_{c}(0)+\sum_{j=0}^{i-1}(1-\frac{\nu\mu_{y}}{2})^{i-1-j}\Bigg{\{}\frac{3\mu_{x}(1-\nu\mu_{y})}{2}
×𝔼xJ(𝒙c,j,𝒚c,j)gx,c,j2+(1νμy)[μx\displaystyle\times\mathbb{E}\|\nabla_{x}J(\bm{x}_{c,j},\bm{y}_{c,j})-{g}_{x,c,j}\|^{2}+(1-\nu\mu_{y})\Big{[}\mu_{x}
+(L+Lf)μx22]𝔼gx,c,j2+μyLf2𝔼𝒙c,j+1𝒙c,j2\displaystyle+\frac{(L+L_{f})\mu^{2}_{x}}{2}\Big{]}\mathbb{E}\|{g}_{x,c,j}\|^{2}+{\mu_{y}}L_{f}^{2}\mathbb{E}\|\bm{x}_{c,j+1}-\bm{x}_{c,j}\|^{2}
+μy𝔼yJ(𝒙c,j,𝒚c,j)gy,c,j2}+20Lfμyσ2νmax𝑘p2k\displaystyle+\mu_{y}\mathbb{E}\|\nabla_{y}J(\bm{x}_{c,j},\bm{y}_{c,j})-{g}_{y,c,j}\|^{2}\Bigg{\}}+\frac{20L_{f}\mu_{y}\sigma^{2}}{\nu}\underset{k}{\max}\ p^{2}_{k}
μy2(1Lfμy)j=0i1(1νμy2)ij𝔼gy,c,j2\displaystyle-\frac{\mu_{y}}{2}(1-L_{f}\mu_{y})\sum_{j=0}^{i-1}(1-\frac{\nu\mu_{y}}{2})^{i-j}\mathbb{E}\|{g}_{y,c,j}\|^{2} (95)

where (a)(a) is due to (1νμy2)ij1gy,c,j2(1νμy2)ijgy,c,j2(1-\frac{\nu\mu_{y}}{2})^{i-j-1}\|{g}_{y,c,j}\|^{2}\geq(1-\frac{\nu\mu_{y}}{2})^{i-j}\|{g}_{y,c,j}\|^{2} for any ji1j\leq i-1. Averaging (K) over iterations, we get

1T\displaystyle\frac{1}{T} i=0T1𝔼[dc(i)]\displaystyle\sum_{i=0}^{T-1}\mathbb{E}[{d}_{c}(i)]
(a)\displaystyle\overset{(a)}{\leq} 2νμyTdc(0)+1Ti=0T1{3μx(1νμy)νμy𝔼xJ(𝒙c,i,𝒚c,i)\displaystyle\frac{2}{\nu\mu_{y}T}{d}_{c}(0)+\frac{1}{T}\sum_{i=0}^{T-1}\Bigg{\{}\frac{3\mu_{x}(1-\nu\mu_{y})}{\nu\mu_{y}}\mathbb{E}\|\nabla_{x}J(\bm{x}_{c,i},\bm{y}_{c,i})
gx,c,i2+(1νμy)[2μxνμy+(L+Lf)μx2νμy]𝔼gx,c,i2+\displaystyle-{g}_{x,c,i}\|^{2}+(1-\nu\mu_{y})\Big{[}\frac{2\mu_{x}}{\nu\mu_{y}}+\frac{(L+L_{f})\mu^{2}_{x}}{\nu\mu_{y}}\Big{]}\mathbb{E}\|{g}_{x,c,i}\|^{2}+
2Lf2ν𝔼𝒙c,i+1𝒙c,i2+2ν𝔼yJ(𝒙c,i,𝒚c,i)gy,c,i2}\displaystyle\frac{2L_{f}^{2}}{\nu}\mathbb{E}\|\bm{x}_{c,i+1}-\bm{x}_{c,i}\|^{2}+\frac{2}{\nu}\mathbb{E}\|\nabla_{y}J(\bm{x}_{c,i},\bm{y}_{c,i})-{g}_{y,c,i}\|^{2}\Bigg{\}}
μy2T(1Lfμy)i=1T1j=0i1(1νμy2)ij𝔼gy,c,j2\displaystyle-\frac{\mu_{y}}{2T}(1-L_{f}\mu_{y})\sum_{i=1}^{T-1}\sum_{j=0}^{i-1}(1-\frac{\nu\mu_{y}}{2})^{i-j}\mathbb{E}\|{g}_{y,c,j}\|^{2}
+20Lfμyσ2νmax𝑘pk2\displaystyle+\frac{20L_{f}\mu_{y}\sigma^{2}}{\nu}\underset{k}{\max}\ p^{2}_{k} (96)

where (a) follows from the following inequality

i=1T1j=0i1qi1jaj=i=0T2ai(j=0T2iqj)\displaystyle\sum_{i=1}^{T-1}\sum_{j=0}^{i-1}q^{i-1-j}a_{j}=\sum_{i=0}^{T-2}a_{i}(\sum_{j=0}^{T-2-i}q^{j})\leq 11qi=0T2ai\displaystyle\frac{1}{1-q}\sum_{i=0}^{T-2}a_{i}
\displaystyle\leq 11qi=0T1ai\displaystyle\frac{1}{1-q}\sum_{i=0}^{T-1}a_{i} (97)

here, 0<q<10<q<1 is a constant, {ai}\{a_{i}\} is a positive sequence.

References

  • [1] H. Cai, S. A. Sulaiman, and A. H. Sayed, “Diffusion optimistic learning for min-max optimization,” in Proc. IEEE ICASSP, Seoul, South Korea, April 2024, pp. 1–5.
  • [2] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems, 2014, pp. 1–9.
  • [3] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” arXiv:1706.06083, 2017.
  • [4] B. Dai, A. Shaw, L. Li, L. Xiao, N. He, Z. Liu, J. Chen, and L. Song, “Sbeed: Convergent reinforcement learning with nonlinear function approximation,” in International Conference on Machine Learning. PMLR, 2018, pp. 1125–1134.
  • [5] D. Kovalev and A. Gasnikov, “The first optimal algorithm for smooth and strongly-convex-strongly-concave minimax optimization,” in Advances in Neural Information Processing Systems, 2022, pp. 14691–14703.
  • [6] J. Zhang, M. Hong, and S. Zhang, “On lower iteration complexity bounds for the convex concave saddle point problems,” Mathematical Programming, vol. 194, no. 1-2, pp. 901–935, 2022.
  • [7] P. Mahdavinia, Y. Deng, H. Li, and M. Mahdavi, “Tight analysis of extra-gradient and optimistic gradient methods for nonconvex minimax problems,” in Advances in Neural Information Processing Systems, 2022, pp. 31213–31225.
  • [8] J. Yang, A. Orvieto, A. Lucchi, and N. He, “Faster single-loop algorithms for minimax optimization without strong concavity,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2022, pp. 5485–5517.
  • [9] T. Lin, C. Jin, and M. Jordan, “On gradient descent ascent for nonconvex-concave minimax problems,” in International Conference on Machine Learning. PMLR, 2020, pp. 6083–6093.
  • [10] J. Yang, N. Kiyavash, and N. He, “Global convergence and variance reduction for a class of nonconvex-nonconcave minimax problems,” in Advances in Neural Information Processing Systems, 2020, pp. 1153–1165.
  • [11] D. Kinderlehrer and G. Stampacchia, An Introduction to Variational Inequalities and Their Applications, SIAM, 2000.
  • [12] S. Komlósi, “On the stampacchia and minty variational inequalities,” Generalized Convexity and Optimization for Economic and Financial Decisions, pp. 231–260, 1999.
  • [13] Z. Dou and Y. Li, “On the one-sided convergence of adam-type algorithms in non-convex non-concave min-max optimization,” arXiv:2109.14213, 2021.
  • [14] A. Beznosikov, P. Dvurechenskii, A. Koloskova, V. Samokhin, S. U. Stich, and A. Gasnikov, “Decentralized local stochastic extra-gradient for variational inequalities,” in Advances in Neural Information Processing Systems, 2022, pp. 38116–38133.
  • [15] A. Bohm, “Solving nonconvex-nonconcave min-max problems exhibiting weak minty solutions,” Transactions on Machine Learning Research, pp. 1–21, 2023.
  • [16] J. Diakonikolas, C. Daskalakis, and M. I. Jordan, “Efficient methods for structured nonconvex-nonconcave min-max optimization,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2021, pp. 2746–2754.
  • [17] T. Pethick, O. Fercoq, P. Latafat, P. Patrinos, and V. Cevher, “Solving stochastic weak minty variational inequalities without increasing batch size,” arXiv:2302.09029, 2023.
  • [18] C. Daskalakis, S. Skoulakis, and M. Zampetakis, “The complexity of constrained min-max optimization,” in Proc. ACM SIGACT Symposium on Theory of Computing, 2021, pp. 1466–1478.
  • [19] G.M. Korpelevich, “The extragradient method for finding saddle points and other problems,” Matecon, vol. 12, pp. 747–756, 1976.
  • [20] M. Liu, W. Zhang, Y. Mroueh, X. Cui, J. Ross, T. Yang, and P. Das, “A decentralized parallel algorithm for training generative adversarial nets,” in Advances in Neural Information Processing Systems, 2020, pp. 11056–11070.
  • [21] C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng, “Training gans with optimism.,” in International Conference on Learning Representations (ICLR 2018), 2018.
  • [22] A. H. Sayed, “Adaptation, learning, and optimization over networks,” Foundations and Trends® in Machine Learning, vol. 7, no. 4-5, pp. 311–801, 2014.
  • [23] H. Karimi, J. Nutini, and M. Schmidt, “Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition,” in Machine Learning and Knowledge Discovery in Databases: European Conference ECML PKDD, Riva del Garda, Italy, 2016, 2016, pp. 795–811.
  • [24] G. Zhang, Y. Wang, L. Lessard, and R. B. Grosse, “Near-optimal local convergence of alternating gradient descent-ascent for minimax optimization,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2022, pp. 7659–7679.
  • [25] T. Chavdarova, G. Gidel, F. Fleuret, and S. Lacoste-Julien, “Reducing noise in gan training with variance reduced extragradient,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [26] C. Liu and L. Luo, “Quasi-newton methods for saddle point problems,” in Advances in Neural Information Processing Systems, 2022, pp. 3975–3987.
  • [27] N. Golowich, S. Pattathil, C. Daskalakis, and A. Ozdaglar, “Last iterate is slower than averaged iterate in smooth convex-concave saddle point problems,” in Conference on Learning Theory. PMLR, 2020, pp. 1758–1784.
  • [28] E. Gorbunov, A. Taylor, and G. Gidel, “Last-iterate convergence of optimistic gradient method for monotone variational inequalities,” Advances in Neural Information Processing Systems, vol. 35, pp. 21858–21870, 2022.
  • [29] E.K. Ryu, K. Yuan, and W. Yin, “ODE analysis of stochastic gradient methods with optimism and anchoring for minimax problems,” arXiv:1905.10899, 2019.
  • [30] T. Yoon and E.K. Ryu, “Accelerated algorithms for smooth convex-concave minimax problems with o (1/k^2 ) rate on squared gradient norm,” in International Conference on Machine Learning. PMLR, 2021, pp. 12098–12109.
  • [31] G. Gidel, R. A. Hemmat, M. Pezeshki, R. L. Priol, G. Huang, S. Lacoste-Julien, and I. Mitliagkas, “Negative momentum for improved game dynamics,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2019, pp. 1802–1811.
  • [32] L. Luo, H. Ye, Z. Huang, and T. Zhang, “Stochastic recursive gradient descent ascent for stochastic nonconvex-strongly-concave minimax problems,” in Advances in Neural Information Processing Systems, 2020, pp. 20566–20577.
  • [33] A. H. Sayed, Inference and Learning from Data, vol. 3, Cambridge University Press, 2022.
  • [34] W. Xian, F. Huang, Y. Zhang, and H. Huang, “A faster decentralized algorithm for nonconvex minimax problems,” in Advances in Neural Information Processing Systems, 2021, pp. 25865–25877.
  • [35] H. Gao, “Decentralized stochastic gradient descent ascent for finite-sum minimax problems,” arXiv:2212.02724, 2022.
  • [36] L. Chen, H. Ye, and L. Luo, “A simple and efficient stochastic algorithm for decentralized nonconvex-strongly-concave minimax optimization,” arXiv:2212.02387, 2022.
  • [37] F. Huang and S. Chen, “Near-optimal decentralized momentum method for nonconvex-PL minimax problems,” arXiv:2304.10902, 2023.
  • [38] A. Cutkosky and F. Orabona, “Momentum-based variance reduction in non-convex sgd,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [39] Y. Arjevani, Y. Carmon, J. C. Duchi, J.D. Foster, N. Srebro, and B. Woodworth, “Lower bounds for non-convex stochastic optimization,” Mathematical Programming, vol. 199, no. 1-2, pp. 165–214, 2023.
  • [40] L. D. Popov, “A modification of the arrow-hurwicz method for search of saddle points,” Mathematical Notes of the Academy of Sciences of the USSR, pp. 845–848, 1980.
  • [41] K. Antonakopoulos, A. Kavis, and V. Cevher, “Extra-newton: A first approach to noise-adaptive accelerated second-order methods,” Advances in Neural Information Processing Systems, vol. 35, pp. 29859–29872, 2022.
  • [42] Y. Hsieh, F. Iutzeler, J. Malick, and P. Mertikopoulos, “On the convergence of single-call stochastic extra-gradient methods,” in Advances in Neural Information Processing Systems, 2019, pp. 1–11.
  • [43] Y. Malitsky and M. K. Tam, “A forward-backward splitting method for monotone inclusions without cocoercivity,” SIAM Journal on Optimization, pp. 1451–1472, 2020.
  • [44] K. Mishchenko, D. Kovalev, E. Shulgin, P. Richtárik, and Y. Malitsky, “Revisiting stochastic extragradient,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2020, pp. 4573–4582.
  • [45] K. Yuan, B. Ying, X. Zhao, and A. H. Sayed, “Exact diffusion for distributed optimization and learning—part i: Algorithm development,” IEEE Transactions on Signal Processing, vol. 67, no. 3, pp. 708–723, 2018.
  • [46] S. A. Alghunaim and K. Yuan, “A unified and refined convergence analysis for non-convex decentralized learning,” IEEE Transactions on Signal Processing, vol. 70, pp. 3264–3279, 2022.
  • [47] S. A. Alghunaim, E. K. Ryu, K. Yuan, and A. H. Sayed, “Decentralized proximal gradient algorithms with linear convergence rates,” IEEE Transactions on Automatic Control, vol. 66, no. 6, pp. 2787–2794, 2020.
  • [48] A. H. Sayed, “Diffusion adaptation over networks,” in Academic Press Library in Signal Processing, vol. 3, pp. 323–453. Elsevier, 2014.
  • [49] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv:1511.06434, 2015.
  • [50] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  • [51] M. Nouiehed, M. Sanjabi, T. Huang, J. D. Lee, and M. Razaviyayn, “Solving a class of non-convex min-max games using iterative first order methods,” in Advances in Neural Information Processing Systems, 2019, pp. 1–9.