This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Accelerated Multi-Time-Scale Stochastic Approximation: Optimal Complexity and Applications in Reinforcement Learning and Multi-Agent Games

\nameSihan Zeng \email[email protected]
\addrJPMorgan AI Research
Palo Alto, CA, United States \AND\nameThinh T. Doan \email[email protected]
\addrDepartment of Aerospae Engineering & Engineering Mechanics
University of Texas at Austin
Austin, TX, United States
Abstract

Multi-time-scale stochastic approximation is an iterative algorithm for finding the fixed point of a set of NN coupled operators given their noisy samples. It has been observed that due to the coupling between the decision variables and noisy samples of the operators, the performance of this method decays as NN increases. In this work, we develop a new accelerated variant of multi-time-scale stochastic approximation, which significantly improves the convergence rates of its standard counterpart. Our key idea is to introduce auxiliary variables to dynamically estimate the operators from their samples, which are then used to update the decision variables. These auxiliary variables help not only to control the variance of the operator estimates but also to decouple the sampling noise and the decision variables. This allows us to select more aggressive step sizes to achieve an optimal convergence rate. Specifically, under a strong monotonicity condition, we show that for any value of NN the ttht^{\text{th}} iterate of the proposed algorithm converges to the desired solution at a rate ๐’ช~โ€‹(1/t)\widetilde{{\cal O}}(1/t) when the operator samples are generated from a single from Markov process trajectory.

A second contribution of this work is to demonstrate that the objective of a range of problems in reinforcement learning and multi-agent games can be expressed as a system of fixed-point equations. As such, the proposed approach can be used to design new learning algorithms for solving these problems. We illustrate this observation with numerical simulations in a multi-agent game and show the advantage of the proposed method over the standard multi-time-scale stochastic approximation algorithm.

Keywords: Multi-time-scale stochastic approximation, reinforcement learning, game theory

1 Introduction

The objective of a range of problems in reinforcement learning and games can be expressed as a coupled system of NN equations, with each equation defined through a nonlinear operator that can be queried with noise. This work proposes a single-loop stochastic approximation (SA) algorithm which solves such a system using a (near) optimal number of queries. Given a compact statistical sample space ๐’ณ{\cal X} and operators {Fi:โ„d1ร—โ‹ฏร—โ„dNร—๐’ณโ†’โ„di}i=1,โ‹ฏ,N\{F_{i}:\mathbb{R}^{d_{1}}\times\cdots\times\mathbb{R}^{d_{N}}\times{\cal X}\rightarrow\mathbb{R}^{d_{i}}\}_{i=1,\cdots,N}, our aim is to find a solution tuple ๐œฝโ‹†=(ฮธ1โ‹†,โ‹ฏ,ฮธNโ‹†)โˆˆโ„d1ร—โ‹ฏร—โ„dN\boldsymbol{\theta}^{\star}=(\theta_{1}^{\star},\cdots,\theta_{N}^{\star})\in\mathbb{R}^{d_{1}}\times\cdots\times\mathbb{R}^{d_{N}} such that

{๐”ผXโˆผฮผ๐œฝโ‹†โ€‹[F1โ€‹(ฮธ1โ‹†,ฮธ2โ‹†,โ‹ฏ,ฮธNโ‹†,X)]=0,๐”ผXโˆผฮผ๐œฝโ‹†โ€‹[F2โ€‹(ฮธ1โ‹†,ฮธ2โ‹†,โ‹ฏ,ฮธNโ‹†,X)]=0,โ‹ฎ๐”ผXโˆผฮผ๐œฝโ‹†โ€‹[FNโ€‹(ฮธ1โ‹†,ฮธ2โ‹†,โ‹ฏ,ฮธNโ‹†,X)]=0,\displaystyle\left\{\;\begin{array}[]{c}\mathbb{E}_{X\sim\mu_{\boldsymbol{\theta}^{\star}}}[F_{1}(\theta_{1}^{\star},\theta_{2}^{\star},\cdots,\theta_{N}^{\star},X)]=0,\\ \mathbb{E}_{X\sim\mu_{\boldsymbol{\theta}^{\star}}}[F_{2}(\theta_{1}^{\star},\theta_{2}^{\star},\cdots,\theta_{N}^{\star},X)]=0,\\ \vdots\\ \mathbb{E}_{X\sim\mu_{\boldsymbol{\theta}^{\star}}}[F_{N}(\theta_{1}^{\star},\theta_{2}^{\star},\cdots,\theta_{N}^{\star},X)]=0,\end{array}\right. (5)

where ฮผ๐œฝ\mu_{\boldsymbol{\theta}} denotes a distribution over ๐’ณ{\cal X} (possibly) as a function of ๐œฝ\boldsymbol{\theta}. We do not assume having direct access to ฮผ๐œฝ\mu_{\boldsymbol{\theta}}. Rather, our sampling oracle is that we may query an operator P๐œฝ:๐’ณโ†’ฮ”๐’ณP_{\boldsymbol{\theta}}:{\cal X}\rightarrow\Delta_{{\cal X}}, which induces ฮผ๐œฝ\mu_{\boldsymbol{\theta}} as the stationary distribution. More precisely, for any ๐œฝ\boldsymbol{\theta} and XX, we can draw a sample Xโ€ฒโˆผP๐œฝโ€‹(X)X^{\prime}\sim P_{\boldsymbol{\theta}}(X). We consider this relaxed sampling oracle as it realistically captures how samples are generated in reinforcement learning (RL). We use ฮฉ=โ„d1ร—โ‹ฏร—โ„dN\Omega=\mathbb{R}^{d_{1}}\times\cdots\times\mathbb{R}^{d_{N}} to denote the space of the decision variable ๐œฝ\boldsymbol{\theta} and denote Fยฏiโ€‹(ฮธ1,โ‹ฏ,ฮธN)=๐”ผXโˆผฮผ๐œฝโ€‹[Fiโ€‹(ฮธ1,โ‹ฏ,ฮธN,X)]\widebar{F}_{i}(\theta_{1},\cdots,\theta_{N})=\mathbb{E}_{X\sim\mu_{\boldsymbol{\theta}}}[F_{i}(\theta_{1},\cdots,\theta_{N},X)] for any ๐œฝโˆˆฮฉ\boldsymbol{\theta}\in\Omega.

It may appear that the system (5) is symmetric and each equation is identically coupled with any other equation. This is not true in the context of RL and games, where the problems usually have a specific hierarchical structure. The structure dictates the order in which the system can be most naturally solved and is formalized later through Assumptionย 3.

Exploiting the hierarchical structure, the standard multi-time-scale stochastic approximation algorithm for solving (5), thereafter referred to as MSA, maintains parameters ฮธ1[t],โ‹ฏ,ฮธN[t]\theta_{1}^{[t]},\cdots,\theta_{N}^{[t]} as an estimate of (ฮธ1โ‹†,โ‹ฏ,ฮธNโ‹†)(\theta_{1}^{\star},\cdots,\theta_{N}^{\star}) and simultaneously updates them, each using a unique step size. Given (possibly biased) stochastic samples {X[t]}\{X^{[t]}\}, MSA iteratively carries out

ฮธi[t+1]=ฮธi[t]โˆ’ฮฑi[t]โ€‹Fiโ€‹(ฮธ1[t],โ‹ฏ,ฮธN[t],X[t]),โˆ€i=1,โ‹ฏ,N.\displaystyle\theta_{i}^{[t+1]}=\theta_{i}^{[t]}-\alpha_{i}^{[t]}F_{i}(\theta_{1}^{[t]},\cdots,\theta_{N}^{[t]},X^{[t]}),\quad\forall i=1,\cdots,N. (6)

With properly selected step sizes, this simple method is guaranteed to converge to the solution of (5) under strong monotonicity and Lipschitz continuity assumptions on Fยฏi\widebar{F}_{i} to be introduced later (Hong etย al., 2023; Zeng etย al., 2024b; Shen and Chen, 2022; Han etย al., 2024). However, its time and sample complexity degrades as NN increases and is sub-optimal as known so far. When N=1N=1, MSA reduces to the classic stochastic approximation algorithm (Robbins and Monro, 1951) and has a complexity of ๐’ชโ€‹(1/t){\cal O}(1/t) when X[t]X^{[t]} are i.i.d. samples from a fixed distribution (and ๐’ช~โ€‹(1/t)\widetilde{{\cal O}}(1/t) if the samples X[t]X^{[t]} are consecutively generated from a Markov chain, where ๐’ช~\widetilde{{\cal O}} hides any logarithm factor). More precisely, the last iterate of (6) converges to an ๐’ช~โ€‹(1/t)\widetilde{{\cal O}}(1/t) neighborhood around ๐œฝโ‹†\boldsymbol{\theta}^{\star} in tt iterations, matching the worst-case lower bound (Chen etย al., 2022). When N=2N=2, the best-known complexity, as established in Hong etย al. (2023); Zeng etย al. (2024b), becomes ๐’ช~โ€‹(1/t2/3)\widetilde{{\cal O}}(1/t^{2/3}), inferior to that under N=1N=1 by a factor of t1/3t^{1/3}. The authors in Hong etย al. (2023); Zeng etย al. (2024b) only analyze the case of two equations, but we show in Sectionย 3.2 that their techniques can be extended to establish a complexity of ๐’ช~โ€‹(1/t1/2)\widetilde{{\cal O}}(1/t^{1/2}) for N=3N=3 and further worse bounds for N=4,5,โ‹ฏN=4,5,\cdots. This apparent gap in the performance of MSA for different NN makes us question whether the problem becomes fundamentally harder as NN increases or whether any algorithmic/analytical improvement can be made. In this work, we answer the question in the positive direction by proposing an accelerated SA algorithm which achieves the optimal complexity ๐’ช~โ€‹(1/t)\widetilde{{\cal O}}(1/t) across all NN while maintaining the favorable properties of (6) such as simplicity and single loop structure. The same algorithm has been introduced earlier in our prior works (Doan, 2024; Zeng and Doan, 2024) but we make important contributions over them as discussed in details in Sectionย 1.2. We now summarize our main contributions.

1.1 Main Contributions

Number of Equations Standard Multi-Time-Scale SA Complexity A-MSA Complexity
Two ๐’ช~โ€‹(ฯตโˆ’1.5)\widetilde{{\cal O}}(\epsilon^{-1.5}) ๐’ช~โ€‹(ฯตโˆ’1)\widetilde{{\cal O}}(\epsilon^{-1})
Three ๐’ช~โ€‹(ฯตโˆ’2)\widetilde{{\cal O}}(\epsilon^{-2}) ๐’ช~โ€‹(ฯตโˆ’1)\widetilde{{\cal O}}(\epsilon^{-1})
Four ๐’ช~โ€‹(ฯตโˆ’2.5)\widetilde{{\cal O}}(\epsilon^{-2.5}) ๐’ช~โ€‹(ฯตโˆ’1)\widetilde{{\cal O}}(\epsilon^{-1})
NN ๐’ช~โ€‹(ฯตโˆ’(N+1)/2)\widetilde{{\cal O}}(\epsilon^{-(N+1)/2}) ๐’ช~โ€‹(ฯตโˆ’1)\widetilde{{\cal O}}(\epsilon^{-1})
Table 1: Sample complexity comparison as number of equations increases (under strong monotone operators and Markovian samples).

The key contribution of this work is to propose a novel Accelerated Multi-Time-Scale Stochastic Approximation (A-MSA) algorithm that solves a coupled system of NN equations with the optimal convergence rate ๐’ช~โ€‹(1/t)\widetilde{{\cal O}}(1/t) for strong monotone operators and when the samples are generated from a single trajectory of Markov processes characterized by P๐œฝP_{\boldsymbol{\theta}}. In the case N=2N=2, this rate improves over ๐’ช~โ€‹(1/t2/3)\widetilde{{\cal O}}(1/t^{2/3}), the best-known complexity of the MSA algorithm derived in Doan (2022); Hong etย al. (2023). When Nโ‰ฅ3N\geq 3, the complexity of MSA deteriorates and is further inferior to our result (see Table.ย 1). Our innovation is built on the observation that the main hindrance to the fast convergence of (6) lies in the direct coupling between the iterates ฮธi[t]\theta_{i}^{[t]} across ii, which means that any error/noise in ฮธi[t]\theta_{i}^{[t]} for some ii immediately propagates to ฮธj[t+1]\theta_{j}^{[t+1]} for all jj in the next iteration. Such a coupling effect prevents us from choosing aggressive step sizes necessary for achieving the optimal complexity. Our solution to this challenge is to update ฮธi[t]\theta_{i}^{[t]} in the direction of a properly averaged version of operator FiF_{i} (as opposed to the single sample used in (6)). The seemingly small modification maintains the simplicity of (6) while eliminating the direct coupling between ฮธi[t]\theta_{i}^{[t]}. In Sectionย 3.3, we compare the analysis of MSA and A-MSA and pinpoint the convergence bottleneck in MSA which we overcome through an algorithmic modification.

A second contribution of our work is to show that coupled equations in (5) abstract many popular problems in games and RL, including gradient-based temporal difference learning with momentum, distributionally robust RL, policy optimization in a two-player zero-sum Markov game, and policy optimization in mean-field games. As a consequence, the proposed A-MSA method can be used to design new algorithms for solving these problems. Some of these problems do not satisfy the strong monotonicity condition but are specially structured in their own ways. We show how we can adapt the analysis for strongly monotone operators to the structure in each respective problem, which can potentially establish the state-of-the-art/previously unknown convergence rates.

Applicable to >2>2 Equations Sample Noise Smoothness Assumption Sample Complexity
Deb and Bhatnagar (2021) Yes Bounded Martingale Not Required -
Hong etย al. (2023) No i.i.d. Not Required ๐’ชโ€‹(ฯตโˆ’1.5){\cal O}(\epsilon^{-1.5})
Zeng etย al. (2024b) No Time-Varying Markovian Not Required ๐’ช~โ€‹(ฯตโˆ’1.5)\widetilde{{\cal O}}(\epsilon^{-1.5})
Shen and Chen (2022) Yes Bounded Martingale Required ๐’ชโ€‹(ฯตโˆ’1){\cal O}(\epsilon^{-1})
Han etย al. (2024) No Bounded Martingale Relaxed Version ๐’ชโ€‹(ฯตโˆ’1){\cal O}(\epsilon^{-1})
Doan (2024) No i.i.d. Not Required ๐’ชโ€‹(ฯตโˆ’1){\cal O}(\epsilon^{-1})
Zeng and Doan (2024) No i.i.d. Not Required ๐’ชโ€‹(ฯตโˆ’1){\cal O}(\epsilon^{-1})
This Work Yes Time-Varying Markovian Not Required ๐’ช~โ€‹(ฯตโˆ’1)\widetilde{{\cal O}}(\epsilon^{-1})
Table 2: Existing algorithms on nonlinear two/multi-time-scale stochastic approximation and their assumptions and complexity.

1.2 Related Work

This paper closely relates to the existing works on singe- and multi-time-scale SA (especially under Markovian samples) and those that analyze the finite-time and sample complexity of reinforcement learning algorithms through the lens of SA. We discuss the most relevant works in these domains and highlight our novelty.

Stochastic Approximation. The SA algorithm is a classic method for solving a single fixed-point equation given noisy operator samples (Robbins and Monro, 1951). The asymptotic convergence of SA is well understood and usually derived with the Poisson equation method (Benveniste etย al., 2012; Li etย al., 2023a) or by analyzing a related ODE (Meerkov, 1972; Borkar, 2008). Under i.i.d. samples, bounded Martingale difference noise, or dependent samples from a Markov chain with bounded mixing time, the finite-time analysis measured by squared errors in expectation is established for linear (Lakshminarayanan and Szepesvรกri, 2017; Srikant and Ying, 2019) and strongly monotone operators (Moulines and Bach, 2011; Chen etย al., 2022). Linear SA abstracts the temporal difference (TD) learning algorithm in RL. Various works (Bhandari etย al., 2018; Mitra, 2024) study the finite-time convergence of TD learning leveraging the problem structure, but their analysis may be extended to linear SA. Recently, non-asymptotic central limit theorems have also been proved for TD learning (Srikant, 2024) and Q learning (Li etย al., 2023b), which show the error convergence to a zero-mean Gaussian random variable in distribution. Such results may be extended to general linear/non-linear SA as well.

Some of the aforementioned works employ Polyak-Ruppert averaging to improve variance control (Srikant, 2024; Haque etย al., 2023). We note that the averaging step we carry out in the proposed method is distinct from Polyak-Ruppert averaging โ€“ Polyak-Ruppert averaging does not modify the algorithm itself but returns the solution as an weighted-averaged version of all history iterates, while our algorithm uses an averaged sequence in the updates.

Multi-Time-Scale Stochastic Approximation. When N=2N=2, the MSA algorithm reduces to the standard two-time-scale SA, which is known to converge asymptotically under linear (Borkar, 1997, 2008; Konda and Tsitsiklis, 2004) and non-linear operators (Mokkadem and Pelletier, 2006; Fort, 2015). In the linear setting, the finite-time convergence measured by mean squared errors is first established with a sub-optimal rate O~โ€‹(1/t2/3)\widetilde{O}(1/t^{2/3}) in Gupta etย al. (2019); Doan and Romberg (2019) and improved to the optimal O~โ€‹(1/t)\widetilde{O}(1/t) in Kaledin etย al. (2020); Haque etย al. (2023). In addition, convergence in probability is studied in Dalal etย al. (2018). In the non-linear setting under the strong monotonicity assumption, the best-known rate of two-time-scale SA remains O~โ€‹(1/t2/3)\widetilde{O}(1/t^{2/3}) in general (Doan, 2022; Hong etย al., 2023; Zeng etย al., 2024b). Under an additional smoothness assumption, Shen and Chen (2022); Han etย al. (2024) show that the optimal rate can be recovered. Our paper does not require this assumption as it may not hold true in practical applications.

The two-time-scale SA framework has wide applications in control, RL, and games, e.g., gradient-based TD learning (Sutton etย al., 2009; Wang etย al., 2021), actor-critic methods (Konda and Borkar, 1999; Konda and Tsitsiklis, 1999; Wu etย al., 2020), policy gradient descent-ascent for two-player games (Daskalakis etย al., 2020), and decentralized Q-learning (Sayin etย al., 2021). By providing an analysis for the general framework, one can apply it to each specific problem and immediately deduce the performance guarantees, without having to tailor the analysis on a case basis.

The study of SA for solving three or more equations remains limited but deserves increased attention due to its ability to model rich classes of problems, including gradient-based TD learning with momentum (Deb and Bhatnagar, 2022), distributionally robust RL (Liang etย al., 2024), and policy optimization in games; the applications will be discussed in Sectionย 4. The existing works on multi-time-scale SA include Deb and Bhatnagar (2021) which only provides an asymptotic analysis and Shen and Chen (2022) which relies on a restrictive smoothness assumption.

Finally, we discuss the difference of the current paper and our prior works in (Doan, 2024; Zeng and Doan, 2024), which study the special setting N=2N=2 under i.i.d. samples. First, this paper extends our prior works to the general setting with any NN. Moving from N=2N=2 to Nโ‰ฅ3N\geq 3 creates a critical technical challenge โ€“ with Nโ‰ฅ3N\geq 3 mid-level variables {ฮธi}i=2,โ‹ฏ,Nโˆ’1\{\theta_{i}\}_{i=2,\cdots,N-1} are coupled with faster and slower moving variables at the same time, making it harder for them to be decoupled by gradient averaging. We overcome the challenge through an improved bound on xi[t]x_{i}^{[t]} (which measures the convergence of ฮธi[t]\theta_{i}^{[t]} to a proper learning target introduced later) by a function of {xj[t]}j<i\{x_{j}^{[t]}\}_{j<i} and {xj[t]}j>i\{x_{j}^{[t]}\}_{j>i}. We also perform a mathematical investigation of the root cause of the complexity improvement. An explanation of why A-MSA enjoys a faster convergence rate is missing in our prior papers. This work fills in the gap by contrasting an important intermediate result obtained under A-MSA and MSA and pinpointing the terms that get improved due to the averaged operator estimation step. Finally, this paper considers a more general sampling oracle โ€“ we study non i.i.d samples generated from a time-varying Markov chain, a setting realistic for modeling applications in sequential decision making. To avoid complexity degradation due to Markovian samples, analyzing A-MSA requires carefully bounding the distance between the distribution of X[t]X^{[t]} and a (time-varying) stationary distribution determined by ๐œฝ[t]\boldsymbol{\theta}^{[t]} by ๐’ชโ€‹(logโกt){\cal O}(\log t), whereas this distance is zero in expectation in the i.i.d. setting.

2 Accelerated Multi-Time-Scale Stochastic Approximation Method

The MSA method (6) updates each ฮธi\theta_{i} in the direction of the corresponding operator FiF_{i} evaluated at the latest iterates. Our proposed method, formally presented in Algorithm 1, modifies the update by introducing the auxiliary variables fi[t]f_{i}^{[t]} to estimate Fยฏi[t]โ€‹(๐œฝ[t])\widebar{F}_{i}^{[t]}(\boldsymbol{\theta}^{[t]}) by forming a weighted average of the operator samples controlled by a parameter ฮป[t]โˆˆ(0,1)\lambda^{[t]}\in(0,1). These auxiliary variables will then be used to update the decision variables. If we set ฮป[t]=1\lambda^{[t]}=1 for all tt, Algorithmย 1 reduces to MSA in (6). However, a large ฮป[t]\lambda^{[t]} leads to a high-variance estimate, whereas if ฮป[t]\lambda^{[t]} is too small fi[t]f_{i}^{[t]} cannot track the moving target Fยฏi[t]โ€‹(๐œฝ[t])\widebar{F}_{i}^{[t]}(\boldsymbol{\theta}^{[t]}). Decaying ฮป[t]\lambda^{[t]} appropriately with respect to ฮฑi[t]\alpha_{i}^{[t]}, fitf_{i}^{t} helps to control the variance of our estimates, which is crucial to derive the optimal convergence rate.

Algorithm 1 Accelerated Multi-Time-Scale Stochastic Approximation Algorithm
1:ย ย Initialization: decision variables ๐œฝ[0]โˆˆฮฉ\boldsymbol{\theta}^{[0]}\in\Omega, operator estimation variables {fi[0]โˆˆโ„d}i\{f_{i}^{[0]}\in\mathbb{R}^{d}\}_{i}, initial sample X[0]X^{[0]}, step size sequences {ฮฑi[t]}i,t,{ฮป[t]}t\{\alpha_{i}^{[t]}\}_{i,t},\{\lambda^{[t]}\}_{t}
2:ย ย forย t=0,1,โ‹ฏ,Tโˆ’1t=0,1,\cdots,T-1ย do
3:ย ย ย ย ย Decision variable update:
ฮธi[t+1]=ฮธi[t]โˆ’ฮฑi[t]โ€‹fi[t],โˆ€i=1,โ‹ฏ,N,๐œฝ[t]={ฮธ1[t],โ‹ฏ,ฮธN[t]}.\displaystyle\begin{aligned} \theta_{i}^{[t+1]}=\theta_{i}^{[t]}-\alpha_{i}^{[t]}f_{i}^{[t]},\;\forall i=1,\cdots,N,\qquad\quad\boldsymbol{\theta}^{[t]}=\{\theta_{1}^{[t]},\cdots,\theta_{N}^{[t]}\}.\end{aligned} (7)
4:ย ย ย ย ย Gradient estimation variable update:
fi[t+1]=(1โˆ’ฮป[t])โ€‹fi[t]+ฮป[t]โ€‹Fiโ€‹(ฮธ1[t],โ‹ฏ,ฮธN[t],X[t]).\displaystyle\begin{aligned} f_{i}^{[t+1]}=(1-\lambda^{[t]})f_{i}^{[t]}+\lambda^{[t]}F_{i}(\theta_{1}^{[t]},\cdots,\theta_{N}^{[t]},X^{[t]}).\end{aligned} (8)
5:ย ย ย ย ย Sample draw
X[t+1]โˆผP๐œฝ[t]โ€‹(X[t]).\displaystyle X^{[t+1]}\sim P_{\boldsymbol{\theta}^{[t]}}(X^{[t]}). (9)
6:ย ย endย for

Note that in our algorithm the samples are drawn from the transition kernel P๐œฝP_{\boldsymbol{\theta}} parameterized by ๐œฝ\boldsymbol{\theta} (a.k.a (9)). Thus, as ๐œฝ[t]\boldsymbol{\theta}^{[t]} changes over time, the samples are temporally dependent and Markovian with time-varying stationary distributions. To ensure the stable behavior of the Markovian samples we make the following assumptions. Our first assumption is on the fast mixing of the Markov chain generated under P๐œฝP_{\boldsymbol{\theta}}. Given two distributions u1,u2โˆˆฮ”๐’ณu_{1},u_{2}\in\Delta_{{\cal X}}, their total variation (TV) distance and the mixing time of the Markov chain are defined as

dTVโ€‹(u1,u2)=12โ€‹supฮฝ:๐’ณโ†’[โˆ’1,1]|โˆซฮฝโ€‹๐‘‘u1โˆ’โˆซฮฝโ€‹๐‘‘u2|.d_{\text{TV}}(u_{1},u_{2})=\frac{1}{2}\sup_{\nu:{\cal X}\rightarrow[-1,1]}\left|\int\nu du_{1}-\int\nu du_{2}\right|. (10)
Definition 1

Given ๐›‰โˆˆฮฉ\boldsymbol{\theta}\in\Omega, consider a Markov chain {X[t]}\{X^{[t]}\} where X[t]โˆผP๐›‰โ€‹(X[tโˆ’1])X^{[t]}\sim P_{\boldsymbol{\theta}}(X^{[t-1]}) with ฮผ๐›‰\mu_{\boldsymbol{\theta}} as the stationary distribution. For any a>0a>0, the mixing time of {X[t]}\{X^{[t]}\} at level aa is defined as

ฯ„๐œฝ(ฮฑ)=min{tโˆˆโ„•:supXโˆˆ๐’ณdTV(โ„™(X[t]=โ‹…โˆฃX[0]=X),ฮผ๐œฝ)โ‰คa}.\tau_{\boldsymbol{\theta}}(\alpha)=\min\{t\in\mathbb{N}:\sup_{X\in{\cal X}}d_{\text{TV}}(\mathbb{P}(X^{[t]}=\cdot\mid X^{[0]}=X),\mu_{\boldsymbol{\theta}})\leq a\}.
Assumption 1 (Uniform Geometric Ergodicity)

For any ๐›‰โˆˆฮฉ\boldsymbol{\theta}\in\Omega, the Markov chain {X[t]}\{X^{[t]}\} generated under P๐›‰P_{\boldsymbol{\theta}} has a unique stationary distribution ฮผ๐›‰\mu_{\boldsymbol{\theta}}. In addition, there exist constants m>0m>0 and ฯโˆˆ(0,1)\rho\in(0,1) independent of ๐›‰\boldsymbol{\theta} such that

supXโˆˆ๐’ณdTV(โ„™(X[t]=โ‹…|X[0]=X),ฮผ๐œฝ)โ‰คmฯtย for allย tโ‰ฅ0.\sup_{X\in{\cal X}}d_{\text{TV}}(\mathbb{P}(X^{[t]}=\cdot\,|X^{[0]}=X),\mu_{\boldsymbol{\theta}})\leq m\rho^{t}\text{ for all }t\geq 0.

The geometric ergodicity assumption states that the distribution of a sample from the Markov chain approaches the stationary distribution geometrically fast, and is again standard in the literature on RL and SA under Markovian noise (Srikant and Ying, 2019; Wu etย al., 2020; Zeng etย al., 2024b).

Denoting ฯ„โ€‹(a)โ‰œsup๐œฝโˆˆโ„dฯ„๐œฝโ€‹(a)\tau(a)\triangleq\sup_{\boldsymbol{\theta}\in\mathbb{R}^{d}}\tau_{\boldsymbol{\theta}}(a), this assumption implies that there exists a positive constant CC depending only on ฯ\rho and mm such that

ฯ„โ€‹(a)โ‰คCโ€‹logโก(1/a).\tau(a)\leq C\log\left(1/a\right). (11)

For convenience, we denote by ฯ„tโ‰œฯ„โ€‹(ฮป[t])\tau_{t}\triangleq\tau(\lambda^{[t]}) the mixing time corresponding to ฮป[t]\lambda^{[t]} in Algorithm 1.

Assumption 2

Given two distributions d,d^d,\hat{d} over ๐’ณ{\cal X} and ๐›‰,๐›‰^โˆˆฮฉ\boldsymbol{\theta},\hat{\boldsymbol{\theta}}\in\Omega, let Xโˆผd,Xโ€ฒโˆผP๐›‰โ€‹(X)X\sim d,X^{\prime}\sim P_{\boldsymbol{\theta}}(X) and X^โˆผd^,X^โ€ฒโˆผP๐›‰^โ€‹(X^)\hat{X}\sim\hat{d},\hat{X}^{\prime}\sim P_{\hat{\boldsymbol{\theta}}}(\hat{X}). There exists a constant L>0L>0 such that

dTVโ€‹(Pโ€‹(Xโ€ฒ=โ‹…),Pโ€‹(X^โ€ฒ=โ‹…))โ‰คdTVโ€‹(d,d^)+Lโ€‹โ€–๐œฝโˆ’๐œฝ^โ€–.\displaystyle d_{\text{TV}}(P(X^{\prime}=\cdot),P(\hat{X}^{\prime}=\cdot))\leq d_{\text{TV}}(d,\hat{d})+L\|\boldsymbol{\theta}-\hat{\boldsymbol{\theta}}\|. (12)

In addition, the stationary distribution is Lipschitz in ๐›‰\boldsymbol{\theta}

dTVโ€‹(ฮผ๐œฝ,ฮผ๐œฝ^)โ‰คLโ€‹โ€–๐œฝโˆ’๐œฝ^โ€–.\displaystyle d_{\text{TV}}(\mu_{\boldsymbol{\theta}},\mu_{\hat{\boldsymbol{\theta}}})\leq L\|\boldsymbol{\theta}-\hat{\boldsymbol{\theta}}\|. (13)

Assumptionย 2 represents a regularity condition on P๐œฝP_{\boldsymbol{\theta}}, which can be interpreted as a type of the triangle inequality for the TV distance defined over the transition kernels w.r.t different parameters. This assumption can be shown to hold in standard MDPs (See Wu etย al. (2020, Lemma A1)).

We denote by โ„ฑt={X[0],X[1],โ‹ฏ,X[t]}{\cal F}_{t}=\{X^{[0]},X^{[1]},\cdots,X^{[t]}\} the filtration containing all randomness generated by Algorithm 1 up to time tt.

3 Main Results

In this section, we establish the finite-sample complexity of A-MSA to the solution of (5) when the operators FiF_{i} have a nested structure. For convenience, we refer to the first and last equations in (5) as the highest- and lowest-level problems, respectively, i.e., the problem levels increase from NN to 1. Levels here indicate the order in which the problem can be solved.

With some abuse of notation, we denote ๐œฝ1:Nโˆ’1={ฮธ1,โ‹ฏ,ฮธNโˆ’1}\boldsymbol{\theta}_{1:N-1}=\{\theta_{1},\cdots,\theta_{N-1}\}. Given ๐œฝ1:Nโˆ’1\boldsymbol{\theta}_{1:N-1}, we use yNโ€‹(๐œฝ1:Nโˆ’1)y_{N}(\boldsymbol{\theta}_{1:N-1}) to denote the solution of the lowest level problem

FยฏNโ€‹(๐œฝ1:Nโˆ’1,yNโ€‹(๐œฝ1:Nโˆ’1))=0.\displaystyle\widebar{F}_{N}\Big{(}\boldsymbol{\theta}_{1:N-1},y_{N}(\boldsymbol{\theta}_{1:N-1})\Big{)}=0.

Similarly, we denote by {yNโˆ’1โ€‹(๐œฝ1:Nโˆ’2),yNโ€‹(๐œฝ1:Nโˆ’2)}\{y_{N-1}(\boldsymbol{\theta}_{1:N-2}),y_{N}(\boldsymbol{\theta}_{1:N-2})\} the solution of the last two levels corresponding to FยฏNโˆ’1,FยฏN\widebar{F}_{N-1},\widebar{F}_{N} given ๐œฝ1:Nโˆ’2\boldsymbol{\theta}_{1:N-2}

FยฏNโˆ’1โ€‹(๐œฝ1:Nโˆ’2,yNโˆ’1โ€‹(๐œฝ1:Nโˆ’2),yNโ€‹(๐œฝ1:Nโˆ’2))=0.FยฏNโ€‹(๐œฝ1:Nโˆ’2,yNโˆ’1โ€‹(๐œฝ1:Nโˆ’2),yNโ€‹(๐œฝ1:Nโˆ’2))=0.\displaystyle\begin{aligned} &\widebar{F}_{N-1}\big{(}\boldsymbol{\theta}_{1:N-2},y_{N-1}(\boldsymbol{\theta}_{1:N-2}),y_{N}(\boldsymbol{\theta}_{1:N-2})\big{)}=0.\\ &\widebar{F}_{N}\big{(}\boldsymbol{\theta}_{1:N-2},y_{N-1}(\boldsymbol{\theta}_{1:N-2}),y_{N}(\boldsymbol{\theta}_{1:N-2})\big{)}=0.\end{aligned} (14)

Conceptually, yNโˆ’1โ€‹(๐œฝ1:Nโˆ’2)y_{N-1}(\boldsymbol{\theta}_{1:N-2}) and yNโ€‹(๐œฝ1:Nโˆ’2)y_{N}(\boldsymbol{\theta}_{1:N-2}) represent the optimal solutions of the last and second last equation with respect to the last and second last decision variables when ฮธ1,โ‹ฏ,ฮธNโˆ’2\theta_{1},\cdots,\theta_{N-2} are given.

Generalizing this line of discussion, we introduce yjโ€‹(๐œฝ1:iโˆ’1)y_{j}(\boldsymbol{\theta}_{1:i-1}) to represent the optimal solution of the jthj^{\text{th}} equation with respect to the jthj^{\text{th}} variable when ฮธ1,โ‹ฏ,ฮธiโˆ’1\theta_{1},\cdots,\theta_{i-1} are given, for any iโˆˆ[2,N]i\in[2,N] and jโˆˆ[i,N]j\in[i,N]. For i=1i=1 (i.e. no ฮธ\theta is given), we additionally define yjโ€‹(โˆ…)=ฮธjโ‹†โˆˆโ„dky_{j}(\emptyset)=\theta_{j}^{\star}\in\mathbb{R}^{d_{k}}, which is the jthj^{\text{th}} component of the optimal solution to (5). We introduce the aggregate notation

๐ฒi:jโ€‹(๐œฝ1:iโˆ’1)โ‰œ{yiโ€‹(๐œฝ1:iโˆ’1),โ‹ฏ,yjโ€‹(๐œฝ1:iโˆ’1)}\displaystyle{\bf y}_{i:j}(\boldsymbol{\theta}_{1:i-1})\triangleq\{y_{i}(\boldsymbol{\theta}_{1:i-1}),\cdots,y_{j}(\boldsymbol{\theta}_{1:i-1})\}

and note that ๐ฒi:N{\bf y}_{i:N} is the solution to the system

Fยฏjโ€‹(๐œฝ1:iโˆ’1,๐ฒi:Nโ€‹(๐œฝ1:iโˆ’1))=Fยฏjโ€‹(๐œฝ1:iโˆ’1,yiโ€‹(๐œฝ1:iโˆ’1),yi+1โ€‹(๐œฝ1:iโˆ’1),โ‹ฏ,yNโ€‹(๐œฝ1:iโˆ’1))=0,โˆ€jโˆˆ[i,N].\displaystyle\widebar{F}_{j}\big{(}\boldsymbol{\theta}_{1:i-1},{\bf y}_{i:N}(\boldsymbol{\theta}_{1:i-1})\big{)}=\widebar{F}_{j}\big{(}\boldsymbol{\theta}_{1:i-1},y_{i}(\boldsymbol{\theta}_{1:i-1}),y_{i+1}(\boldsymbol{\theta}_{1:i-1}),\cdots,y_{N}(\boldsymbol{\theta}_{1:i-1})\big{)}=0,\,\forall j\in[i,N].

We now introduce the main assumption that drives our analysis, referred to as the nested strong monotonicity of the operators Fยฏi\widebar{F}_{i} corresponding to the increasing levels defined above.

Assumption 3 (Strong Monotonicity)

There exists a constant ฮด>0\delta>0 such that for all i=1,โ‹ฏ,Ni=1,\cdots,N and ฮธ1โˆˆโ„d1,โ‹ฏ,ฮธiโˆ’1โˆˆโ„diโˆ’1\theta_{1}\in\mathbb{R}^{d_{1}},\cdots,\theta_{i-1}\in\mathbb{R}^{d_{i-1}}, ฮธi,ฮธiโ€ฒโˆˆโ„di\theta_{i},\theta_{i}^{\prime}\in\mathbb{R}^{d_{i}}

โŸจFยฏi(๐œฝ1:iโˆ’1,ฮธi,๐ฒi+1:N(๐œฝ1:iโˆ’1,ฮธi)โˆ’Fยฏi(๐œฝ1:iโˆ’1,ฮธiโ€ฒ,๐ฒi+1:N(๐œฝ1:iโˆ’1,ฮธiโ€ฒ),ฮธiโˆ’ฮธiโ€ฒโŸฉโ‰ฅฮดโˆฅฮธiโˆ’ฮธiโ€ฒโˆฅ2.\displaystyle\Big{\langle}\widebar{F}_{i}(\boldsymbol{\theta}_{1:i-1},\theta_{i},{\bf y}_{i+1:N}(\boldsymbol{\theta}_{1:i-1},\theta_{i})-\widebar{F}_{i}(\boldsymbol{\theta}_{1:i-1},\theta_{i}^{\prime},{\bf y}_{i+1:N}(\boldsymbol{\theta}_{1:i-1},\theta_{i}^{\prime}),\theta_{i}-\theta_{i}^{\prime}\Big{\rangle}\geq\delta\|\theta_{i}-\theta_{i}^{\prime}\|^{2}.

Assumptionย 3 represents a nested variant of strong monotonicity for the operators Fยฏi\widebar{F}_{i}, which does not require each Fยฏi\widebar{F}_{i} being strongly monotone w.r.t to all of its variables. It states that given ๐œฝiโˆ’1\boldsymbol{\theta}_{i-1}, Fยฏi\widebar{F}_{i} is strongly monotone w.r.t ฮธi\theta_{i} when all lower level decision variables are at the corresponding optimal solutions. When N=2N=2, the assumption reduces to the same condition made in the existing literature (Doan, 2022; Shen and Chen, 2022; Han etย al., 2024) and can be verified to hold in RL applications including the gradient-based TD learning (Xu etย al., 2019) and distributed TD learning learning111To model the distributed TD learning in the two-time-scale SA framework, the outer loop equation enforces the consensus among agents whereas in the inner loop each agent learns its local value function.. As an implication of the assumption, yjโ€‹(๐œฝ1:iโˆ’1)y_{j}(\boldsymbol{\theta}_{1:i-1}) is always unique for any i,ji,j. Without any loss of generality, we assume ฮดโ‰ค1\delta\leq 1, which makes it convenient for us to simplify terms.

Assumption 4 (Lipschitz Continuity and Boundedness)

There exists positive finite constants LL and BB such that for all ๐›‰,๐›‰โ€ฒโˆˆฮฉ\boldsymbol{\theta},\boldsymbol{\theta}^{\prime}\in\Omega

โ€–Fiโ€‹(๐œฝ,X)โˆ’Fiโ€‹(๐œฝโ€ฒ,X)โ€–\displaystyle\|F_{i}(\boldsymbol{\theta},X)-F_{i}(\boldsymbol{\theta}^{\prime},X)\| โ‰คLโ€‹โˆ‘j=1Nโ€–ฮธjโˆ’ฮธjโ€ฒโ€–,โˆ€i,X,\displaystyle\leq L\sum_{j=1}^{N}\|\theta_{j}-\theta_{j}^{\prime}\|,\quad\forall i,X, (15)
โ€–yjโ€‹(๐œฝ1:iโˆ’1)โˆ’yjโ€‹(๐œฝ1:iโˆ’1โ€ฒ)โ€–\displaystyle\|y_{j}(\boldsymbol{\theta}_{1:i-1})-y_{j}(\boldsymbol{\theta}_{1:i-1}^{\prime})\| โ‰คLโ€‹โˆ‘iโ€ฒ=1iโˆ’1โ€–ฮธiโ€ฒโˆ’ฮธiโ€ฒโ€ฒโ€–,โˆ€i,j,\displaystyle\leq L\sum_{i^{\prime}=1}^{i-1}\|\theta_{i^{\prime}}-\theta_{i^{\prime}}^{\prime}\|,\quad\forall i,j, (16)
โ€–yiโ€‹(๐œฝ1:iโˆ’1)โ€–\displaystyle\|y_{i}(\boldsymbol{\theta}_{1:i-1})\| โ‰คB,โˆ€i.\displaystyle\leq B,\quad\forall i. (17)

Eqs.ย (15) and (16) are standard Lipschitz continuity assumptions on the operator FiF_{i} and learning target (Doan, 2022; Zeng etย al., 2024b; Shen and Chen, 2022). Eq.ย 17 assumes that the learning targets are bounded and guarantees algorithm stability in the presence time-varying Markovian noise. This assumption is made in Kaledin etย al. (2020); Zeng etย al. (2024b) when N=2N=2. Without any loss of generality, we assume Lโ‰ฅmaxiโˆˆ{1,โ‹ฏ,N},Xโˆˆ๐’ณโกโ€–Fiโ€‹(0,X)โ€–L\geq\max_{i\in\{1,\cdots,N\},X\in{\cal X}}\|F_{i}(0,X)\|, which is finite due to the compactness of ๐’ณ{\cal X}. Assumptionย 4 implies that the operator FiF_{i} can be bounded affinely by the decision variable

โ€–Fiโ€‹(๐œฝ,X)โ€–โ‰คLโ€‹(โˆ‘j=1Nโ€–ฮธjโ€–+1),โˆ€iโˆˆ{1,โ‹ฏ,N},๐œฝโˆˆฮฉ.\displaystyle\|F_{i}(\boldsymbol{\theta},X)\|\leq L(\sum_{j=1}^{N}\|\theta_{j}\|+1),\quad\forall i\in\{1,\cdots,N\},\boldsymbol{\theta}\in\Omega. (18)

We use the same constant LL in Assumptionsย 2 and 4 to simplify the notation.

Convergence Metric. We quantity the algorithm convergence through the residuals

xi[t]=ฮธi[t]โˆ’yiโ€‹(๐œฝ1:iโˆ’1[t]),ฮ”โ€‹fi[t]=fi[t]โˆ’Fยฏiโ€‹(๐œฝ[t]),\displaystyle\begin{gathered}x_{i}^{[t]}=\theta_{i}^{[t]}-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]}),\qquad\Delta f_{i}^{[t]}=f_{i}^{[t]}-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]}),\end{gathered} (20)

where xi[t]x_{i}^{[t]} is the optimal residual, measuring the distance of decision variable ฮธi[t]\theta_{i}^{[t]} to the learning target solution introduced above; whereas ฮ”โ€‹fi[t]\Delta f_{i}^{[t]} is the estimation residual, capturing the quality of the estimates of the operator Fยฏi\widebar{F}_{i}. Note that if xi[t]=0x_{i}^{[t]}=0 for all ii, then ๐œฝ[t]=๐œฝโ‹†\boldsymbol{\theta}^{[t]}=\boldsymbol{\theta}^{\star}, the desired solution.

Finally, we will study the convergence of Algorithm 1 under the following choice of step sizes

ฮป[t]=cฮปt+h+1,ฮฑi[t]=cit+h+1,\displaystyle\lambda^{[t]}=\frac{c_{\lambda}}{t+h+1},\quad\alpha_{i}^{[t]}=\frac{c_{i}}{t+h+1}, (21)

where ฮป[t]โ‰ฅฮฑi[t]\lambda^{[t]}\geq\alpha_{i}^{[t]} as the operator estimates are implemented at a โ€œfasterโ€ time scale than the updates of decision variables. In particular, these step sizes satisfies the following conditions for all tโ‰ฅ0t\geq 0

c1\displaystyle c_{1} =32ฮด,ฮป[t]โ‰ค14,ฯ„t2โ€‹ฮปtโˆ’ฯ„tโ‰ค18โ€‹Dโ€‹N3,ฮฑi[t]โ‰คminโก{ฮด280โ€‹N8โ€‹L6,ฮด40โ€‹N5โ€‹L6,25โ€‹Nโ€‹L2,1ฮด},\displaystyle=\frac{32}{\delta},\quad\lambda^{[t]}\leq\frac{1}{4},\quad\tau_{t}^{2}\lambda_{t-\tau_{t}}\leq\frac{1}{8DN^{3}},\quad\alpha_{i}^{[t]}\leq\min\Big{\{}\frac{\delta^{2}}{80N^{8}L^{6}},\frac{\delta}{40N^{5}L^{6}},\frac{2}{5NL^{2}},\frac{1}{\delta}\Big{\}},
ฮฑi[t]ฮป[t]\displaystyle\frac{\alpha_{i}^{[t]}}{\lambda^{[t]}} โ‰คminโก{18โ€‹(Dโ€‹N3+3/ฮด+L),ฮด32โ€‹Dโ€‹N3,ฮด32โ€‹(9โ€‹N4โ€‹L6ฮด+8โ€‹N3โ€‹L3),16ฮด},\displaystyle\leq\min\Big{\{}\frac{1}{8(DN^{3}+3/\delta+L)},\frac{\delta}{32DN^{3}},\frac{\delta}{32(\frac{9N^{4}L^{6}}{\delta}+8N^{3}L^{3})},\frac{16}{\delta}\Big{\}},
ฮฑiโˆ’1[t]ฮฑi[t]\displaystyle\frac{\alpha_{i-1}^{[t]}}{\alpha_{i}^{[t]}} โ‰คฮด16โ€‹(9โ€‹N3โ€‹L32+4โ€‹N6โ€‹L6ฮด)โˆ’1,(ฮฑi[t])2ฮฑ1[t]โ‰คminโก{ฮด3/264โ€‹N7,85โ€‹N3โ€‹L3},\displaystyle\leq\frac{\delta}{16}\Big{(}\frac{9N^{3}L^{3}}{2}+\frac{4N^{6}L^{6}}{\delta}\Big{)}^{-1},\quad\frac{(\alpha_{i}^{[t]})^{2}}{\alpha_{1}^{[t]}}\leq\min\Big{\{}\frac{\delta^{3/2}}{64N^{7}},\frac{8}{5N^{3}L^{3}}\Big{\}}, (22)

where DD is an absolute constant depending on LL, BB, ฯ\rho, and mm. We note that there always exist constants cฮป,ci,hc_{\lambda},c_{i},h such that the step sizes in (21) satisfy (22).

For our analysis, we consider the following Lyapunov function

V[t]=โˆ‘i=1N๐”ผโ€‹[โ€–xi[t]โ€–2+โ€–ฮ”โ€‹fi[t]โ€–2].\displaystyle V^{[t]}=\sum_{i=1}^{N}\mathbb{E}[\|x_{i}^{[t]}\|^{2}+\|\Delta f_{i}^{[t]}\|^{2}]. (23)

Under the choice of step sizes above, the main result of this paper is presented in the following theorem, where we characterize the convergence complexity of Algorithm 1

Theorem 1

Under Assumptionsย 1โ€“4, the iterates generated by Algorithmย 1 satisfy for all tโ‰ฅฯ„tt\geq\tau_{t}

V[t+1]โ‰คh2โ€‹V[ฯ„t](t+h+1)2+๐’ช~โ€‹(1t+1).\displaystyle V^{[t+1]}\leq\frac{h^{2}V^{[\tau_{t}]}}{(t+h+1)^{2}}+\widetilde{{\cal O}}\left(\frac{1}{t+1}\right).

Theoremย 1 shows that the last iterate of the A-MSA algorithm converges to the unique solution of (5) in the mean-squared sense at an optimal rate ๐’ช~โ€‹(1/t)\widetilde{{\cal O}}(1/t). As the algorithm draws exactly one sample in each iteration, this rate translates to a sample complexity of ๐’ช~โ€‹(ฯตโˆ’1)\widetilde{{\cal O}}(\epsilon^{-1}) for finding an ฯต\epsilon-optimal solution. The same complexity has been shown to be achievable by the MSA algorithm under an additional smoothness assumption on FiF_{i} (Shen and Chen, 2022). We match the rate without requiring this restrictive assumption. In the absence of the smoothness condition, our rate is order-optimal (up to an logarithm factor) and significantly improves over the best-known rate of MSA, which is ๐’ช~โ€‹(1/t2/3)\widetilde{{\cal O}}(1/t^{2/3}) as established in Hong etย al. (2023); Zeng etย al. (2024b); Han etย al. (2024) when N=2N=2. In Sectionย 3.2, we further show that the convergence rate of MSA is ๐’ช~โ€‹(1/t1/2)\widetilde{{\cal O}}(1/t^{1/2}) when N=3N=3 and deteriorates as NN increases. This observation highlights the advantage of our proposed algorithm where its convergence complexity is optimal for arbitrary NN .

3.1 Proof of Theoremย 1

We introduce the following technical lemmas, which set up important intermediate steps to derive the results in Theoremย 1. These lemmas establishes a useful upper bound of โ€–ฮ”โ€‹fi[t]โ€–2\|\Delta f_{i}^{[t]}\|^{2} and โ€–xi[t]โ€–2\|x_{i}^{[t]}\|^{2} in expectation, which shows the dependence of these two quantities.

Lemma 1

Under Assumptionsย 1-4, the iterates of Algorithmย 1 satisfy for all tโ‰ฅฯ„tt\geq\tau_{t}

๐”ผโ€‹[โ€–ฮ”โ€‹fi[t+1]โ€–2]\displaystyle\mathbb{E}[\|\Delta f_{i}^{[t+1]}\|^{2}] โ‰ค(1โˆ’ฮป[t])โ€‹๐”ผโ€‹[โ€–ฮ”โ€‹fi[t]โ€–2]โˆ’ฮป[t]4โ€‹๐”ผโ€‹[โ€–ฮ”โ€‹fi[t]โ€–2]+Dโ€‹N2โ€‹ฯ„t2โ€‹ฮป[t]โ€‹ฮป[tโˆ’ฯ„t]\displaystyle\leq(1-\lambda^{[t]})\mathbb{E}[\|\Delta f_{i}^{[t]}\|^{2}]-\frac{\lambda^{[t]}}{4}\mathbb{E}[\|\Delta f_{i}^{[t]}\|^{2}]+DN^{2}\tau_{t}^{2}\lambda^{[t]}\lambda^{[t-\tau_{t}]}
+Dโ€‹N2โ€‹ฯ„t2โ€‹(ฮป[t]โ€‹ฮป[tโˆ’ฯ„t]+(ฮฑi[t])2ฮป[t])โ€‹(โˆ‘j=1Nโ€–xj[t]โ€–2+โˆ‘j=1Nโ€–ฮ”โ€‹fj[t]โ€–2),\displaystyle\hskip 20.0pt+DN^{2}\tau_{t}^{2}(\lambda^{[t]}\lambda^{[t-\tau_{t}]}+\frac{(\alpha_{i}^{[t]})^{2}}{\lambda^{[t]}})\big{(}\sum_{j=1}^{N}\|x_{j}^{[t]}\|^{2}+\sum_{j=1}^{N}\|\Delta f_{j}^{[t]}\|^{2}\big{)},

where DD is a constant depending only (polynomially) on the problem constants.

Lemma 2

Under Assumptionsย 1-4, the iterates of Algorithmย 1 satisfy for all tt

โ€–xi[t+1]โ€–2\displaystyle\|x_{i}^{[t+1]}\|^{2} โ‰คโ€–xi[t]โ€–2โˆ’ฮดโ€‹ฮฑi[t]4โ€‹โ€–xi[t]โ€–2+โˆ‘j=1iโˆ’1ฮดโ€‹ฮฑj[t]8โ€‹Nโ€‹โ€–xj[t]โ€–2\displaystyle\leq\|x_{i}^{[t]}\|^{2}-\frac{\delta\alpha_{i}^{[t]}}{4}\|x_{i}^{[t]}\|^{2}+\sum_{j=1}^{i-1}\frac{\delta\alpha_{j}^{[t]}}{8N}\|x_{j}^{[t]}\|^{2}
+(9โ€‹N3โ€‹L6ฮด+8โ€‹N2โ€‹L3)โ€‹ฮฑi[t]โ€‹โˆ‘j=i+1Nโ€–xj[t]โ€–2+(3ฮด+L)โ€‹ฮฑi[t]โ€‹โˆ‘j=1Nโ€–ฮ”โ€‹fj[t]โ€–2.\displaystyle\hskip 20.0pt+\left(\frac{9N^{3}L^{6}}{\delta}+8N^{2}L^{3}\right)\alpha_{i}^{[t]}\sum_{j=i+1}^{N}\|x_{j}^{[t]}\|^{2}+(\frac{3}{\delta}+L)\alpha_{i}^{[t]}\sum_{j=1}^{N}\|\Delta f_{j}^{[t]}\|^{2}. (24)

In (24), the terms โ€–xj[t]โ€–2\|x_{j}^{[t]}\|^{2} for j<ij<i and j>ij>i are scaled by different step sizes (ฮฑj[t]\alpha_{j}^{[t]} versus ฮฑi[t]\alpha_{i}^{[t]}). This subtle difference is the key to establish the optimal complexity of Algorithm 1.

It is worth noting that if the iterates ฮธi[t]\theta_{i}^{[t]} are generated by the MSA algorithm, a per-iteration analysis of xi[t]x_{i}^{[t]} similar to Lemmaย 2 can be established. We make the analysis and point out the distinctions that allow A-MSA to converge faster later in Sectionย 3.2 (Lemmaย 4).

We defer the proofs of the lemmas to the appendix but point out that to prove Lemmaย 2 the following bound on โ€–ฮธi[t+1]โˆ’ฮธi[t]โ€–\|\theta_{i}^{[t+1]}-\theta_{i}^{[t]}\| provides an important tool for us to control the stability of the iterates. The implication of Lemmaย 3 is that the distance โ€–ฮธi[t+1]โˆ’ฮธi[t]โ€–\|\theta_{i}^{[t+1]}-\theta_{i}^{[t]}\| converges to zero faster than ๐’ชโ€‹(ฮฑi[t]){\cal O}(\alpha_{i}^{[t]}) if โ€–ฮ”โ€‹fi[t]โ€–\|\Delta f_{i}^{[t]}\| and โˆ‘j=iNโ€–xj[t]โ€–\sum_{j=i}^{N}\|x_{j}^{[t]}\| both converge.

Lemma 3

Under Assumptionsย 1-4, the iterates of Algorithmย 1 satisfy for all i,ti,t

โ€–ฮธi[t+1]โˆ’ฮธi[t]โ€–โ‰คฮฑi[t]โ€‹(โ€–ฮ”โ€‹fi[t]โ€–+Nโ€‹L2โ€‹โˆ‘j=iNโ€–xj[t]โ€–).\displaystyle\|\theta_{i}^{[t+1]}-\theta_{i}^{[t]}\|\leq\alpha_{i}^{[t]}\big{(}\|\Delta f_{i}^{[t]}\|+NL^{2}\sum_{j=i}^{N}\|x_{j}^{[t]}\|\big{)}.

Proof (of Theoremย 1).

The proof follows in a straightforward manner by combining the bounds in Lemmasย 1 and 2 and choosing the correct step sizes. Recall the Lyapunov function (23)

V[t+1]\displaystyle V^{[t+1]} =โˆ‘i=1N๐”ผโ€‹[โ€–ฮ”โ€‹fi[t+1]โ€–2+โ€–xi[t+1]โ€–2]\displaystyle=\sum_{i=1}^{N}\mathbb{E}[\|\Delta f_{i}^{[t+1]}\|^{2}+\|x_{i}^{[t+1]}\|^{2}]
โ‰ค(1โˆ’ฮป[t])โ€‹โˆ‘i=1N๐”ผโ€‹[โ€–ฮ”โ€‹fi[t]โ€–2]โˆ’ฮป[t]4โ€‹โˆ‘i=1N๐”ผโ€‹[โ€–ฮ”โ€‹fi[t]โ€–2]+Dโ€‹N2โ€‹ฯ„t2โ€‹ฮป[t]โ€‹ฮป[tโˆ’ฯ„t]\displaystyle\leq(1-\lambda^{[t]})\sum_{i=1}^{N}\mathbb{E}[\|\Delta f_{i}^{[t]}\|^{2}]-\frac{\lambda^{[t]}}{4}\sum_{i=1}^{N}\mathbb{E}[\|\Delta f_{i}^{[t]}\|^{2}]+DN^{2}\tau_{t}^{2}\lambda^{[t]}\lambda^{[t-\tau_{t}]}
+Dโ€‹N2โ€‹ฯ„t2โ€‹โˆ‘i=1N(ฮป[t]โ€‹ฮป[tโˆ’ฯ„t]+(ฮฑi[t])2ฮป[t])โ€‹(โˆ‘j=1Nโ€–xj[t]โ€–2+โˆ‘j=1Nโ€–ฮ”โ€‹fj[t]โ€–2)\displaystyle\hskip 20.0pt+DN^{2}\tau_{t}^{2}\sum_{i=1}^{N}(\lambda^{[t]}\lambda^{[t-\tau_{t}]}+\frac{(\alpha_{i}^{[t]})^{2}}{\lambda^{[t]}})\big{(}\sum_{j=1}^{N}\|x_{j}^{[t]}\|^{2}+\sum_{j=1}^{N}\|\Delta f_{j}^{[t]}\|^{2}\big{)}
+โˆ‘i=1N๐”ผโ€‹[โ€–xi[t]โ€–2]โˆ’โˆ‘i=1Nฮดโ€‹ฮฑi[t]4โ€‹๐”ผโ€‹[โ€–xi[t]โ€–2]+โˆ‘i=1Nโˆ‘j=1iโˆ’1ฮดโ€‹ฮฑj[t]8โ€‹Nโ€‹๐”ผโ€‹[โ€–xj[t]โ€–2]\displaystyle\hskip 20.0pt+\sum_{i=1}^{N}\mathbb{E}[\|x_{i}^{[t]}\|^{2}]-\sum_{i=1}^{N}\frac{\delta\alpha_{i}^{[t]}}{4}\mathbb{E}[\|x_{i}^{[t]}\|^{2}]+\sum_{i=1}^{N}\sum_{j=1}^{i-1}\frac{\delta\alpha_{j}^{[t]}}{8N}\mathbb{E}[\|x_{j}^{[t]}\|^{2}]
+โˆ‘i=1N(9โ€‹N3โ€‹L6ฮด+8โ€‹N2โ€‹L3)โ€‹ฮฑi[t]โ€‹โˆ‘j=i+1N๐”ผโ€‹[โ€–xj[t]โ€–2]+(3ฮด+L)โ€‹โˆ‘i=1Nฮฑi[t]โ€‹โˆ‘j=1N๐”ผโ€‹[โ€–ฮ”โ€‹fj[t]โ€–2]\displaystyle\hskip 20.0pt+\sum_{i=1}^{N}\left(\frac{9N^{3}L^{6}}{\delta}+8N^{2}L^{3}\right)\alpha_{i}^{[t]}\sum_{j=i+1}^{N}\mathbb{E}[\|x_{j}^{[t]}\|^{2}]+(\frac{3}{\delta}+L)\sum_{i=1}^{N}\alpha_{i}^{[t]}\sum_{j=1}^{N}\mathbb{E}[\|\Delta f_{j}^{[t]}\|^{2}]
โ‰ค(1โˆ’ฮป[t])โ€‹โˆ‘i=1N๐”ผโ€‹[โ€–ฮ”โ€‹fi[t]โ€–2]โˆ’ฮป[t]4โ€‹โˆ‘i=1N๐”ผโ€‹[โ€–ฮ”โ€‹fi[t]โ€–2]+๐’ช~โ€‹(1(t+1)2)\displaystyle\leq(1-\lambda^{[t]})\sum_{i=1}^{N}\mathbb{E}[\|\Delta f_{i}^{[t]}\|^{2}]-\frac{\lambda^{[t]}}{4}\sum_{i=1}^{N}\mathbb{E}[\|\Delta f_{i}^{[t]}\|^{2}]+\widetilde{{\cal O}}\Big{(}\frac{1}{(t+1)^{2}}\Big{)}
+Dโ€‹N3โ€‹ฮฑN[t]โ€‹โˆ‘i=1N๐”ผโ€‹[โ€–ฮ”โ€‹fi[t]โ€–2]+Dโ€‹N3โ€‹โˆ‘i=1N(ฮฑi[t])2ฮป[t]โ€‹๐”ผโ€‹[โ€–xi[t]โ€–2]\displaystyle\hskip 20.0pt+DN^{3}\alpha_{N}^{[t]}\sum_{i=1}^{N}\mathbb{E}[\|\Delta f_{i}^{[t]}\|^{2}]+DN^{3}\sum_{i=1}^{N}\frac{(\alpha_{i}^{[t]})^{2}}{\lambda^{[t]}}\mathbb{E}[\|x_{i}^{[t]}\|^{2}]
+Dโ€‹N3โ€‹ฯ„t2โ€‹ฮป[t]โ€‹ฮป[tโˆ’ฯ„t]โ€‹โˆ‘i=1N๐”ผโ€‹[โ€–ฮ”โ€‹fi[t]โ€–2]+Dโ€‹N3โ€‹ฯ„t2โ€‹ฮป[t]โ€‹ฮป[tโˆ’ฯ„t]โ€‹โˆ‘i=1N๐”ผโ€‹[โ€–xi[t]โ€–2]\displaystyle\hskip 20.0pt+DN^{3}\tau_{t}^{2}\lambda^{[t]}\lambda^{[t-\tau_{t}]}\sum_{i=1}^{N}\mathbb{E}[\|\Delta f_{i}^{[t]}\|^{2}]+DN^{3}\tau_{t}^{2}\lambda^{[t]}\lambda^{[t-\tau_{t}]}\sum_{i=1}^{N}\mathbb{E}[\|x_{i}^{[t]}\|^{2}]
+โˆ‘i=1N๐”ผโ€‹[โ€–xi[t]โ€–2]โˆ’โˆ‘i=1Nฮดโ€‹ฮฑi[t]4โ€‹๐”ผโ€‹[โ€–xi[t]โ€–2]+โˆ‘i=1Nฮดโ€‹ฮฑi[t]8โ€‹๐”ผโ€‹[โ€–xi[t]โ€–2]\displaystyle\hskip 20.0pt+\sum_{i=1}^{N}\mathbb{E}[\|x_{i}^{[t]}\|^{2}]-\sum_{i=1}^{N}\frac{\delta\alpha_{i}^{[t]}}{4}\mathbb{E}[\|x_{i}^{[t]}\|^{2}]+\sum_{i=1}^{N}\frac{\delta\alpha_{i}^{[t]}}{8}\mathbb{E}[\|x_{i}^{[t]}\|^{2}]
+(9โ€‹N4โ€‹L6ฮด+8โ€‹N3โ€‹L3)โ€‹โˆ‘i=2Nฮฑiโˆ’1[t]โ€‹๐”ผโ€‹[โ€–xi[t]โ€–2]+(3ฮด+L)โ€‹ฮฑN[t]โ€‹โˆ‘i=1N๐”ผโ€‹[โ€–ฮ”โ€‹fi[t]โ€–2]\displaystyle\hskip 20.0pt+\left(\frac{9N^{4}L^{6}}{\delta}+8N^{3}L^{3}\right)\sum_{i=2}^{N}\alpha_{i-1}^{[t]}\mathbb{E}[\|x_{i}^{[t]}\|^{2}]+(\frac{3}{\delta}+L)\alpha_{N}^{[t]}\sum_{i=1}^{N}\mathbb{E}[\|\Delta f_{i}^{[t]}\|^{2}]
โ‰ค(1โˆ’ฮป[t])โ€‹โˆ‘i=1N๐”ผโ€‹[โ€–ฮ”โ€‹fi[t]โ€–2]+(1โˆ’ฮดโ€‹ฮฑ1[t]16)โ€‹โˆ‘i=1N๐”ผโ€‹[โ€–xi[t]โ€–2]+๐’ช~โ€‹(1(t+1)2)\displaystyle\leq(1-\lambda^{[t]})\sum_{i=1}^{N}\mathbb{E}[\|\Delta f_{i}^{[t]}\|^{2}]+(1-\frac{\delta\alpha_{1}^{[t]}}{16})\sum_{i=1}^{N}\mathbb{E}[\|x_{i}^{[t]}\|^{2}]+\widetilde{{\cal O}}\Big{(}\frac{1}{(t+1)^{2}}\Big{)}
โˆ’โˆ‘i=1N(ฮป[t]8โˆ’(Dโ€‹N3+3ฮด+L)โ€‹ฮฑN[t])โ€‹๐”ผโ€‹[โ€–ฮ”โ€‹fi[t]โ€–2]\displaystyle\hskip 20.0pt-\sum_{i=1}^{N}\left(\frac{\lambda^{[t]}}{8}-(DN^{3}+\frac{3}{\delta}+L)\alpha_{N}^{[t]}\right)\mathbb{E}[\|\Delta f_{i}^{[t]}\|^{2}]
โˆ’โˆ‘i=1N(ฮดโ€‹ฮฑi[t]16โˆ’Dโ€‹N3โ€‹(ฮฑi[t])2ฮป[t]โˆ’(9โ€‹N4โ€‹L6ฮด+8โ€‹N3โ€‹L3)โ€‹ฮฑiโˆ’1[t])โ€‹๐”ผโ€‹[โ€–xi[t]โ€–2],\displaystyle\hskip 20.0pt-\sum_{i=1}^{N}\left(\frac{\delta\alpha_{i}^{[t]}}{16}-\frac{DN^{3}(\alpha_{i}^{[t]})^{2}}{\lambda^{[t]}}-\left(\frac{9N^{4}L^{6}}{\delta}+8N^{3}L^{3}\right)\alpha_{i-1}^{[t]}\right)\mathbb{E}[\|x_{i}^{[t]}\|^{2}], (25)

where the first inequality plugs in the bounds from Lemmasย 1 and 2, and in the last inequality we have artificially introduced ฮฑ0[t]=0\alpha_{0}^{[t]}=0 and used the relation Dโ€‹N3โ€‹ฯ„t2โ€‹ฮป[tโˆ’ฯ„t]โ‰ค1/8DN^{3}\tau_{t}^{2}\lambda^{[t-\tau_{t}]}\leq 1/8. Note that the last two terms of (25) are non-positive under the step size condition ฮฑi[t]โ‰คฮป[t]8โ€‹(Dโ€‹N3+3ฮด+L)โˆ’1\alpha_{i}^{[t]}\leq\frac{\lambda^{[t]}}{8}(DN^{3}+\frac{3}{\delta}+L)^{-1}, ฮฑi[t]โ‰คฮดโ€‹ฮป[t]32โ€‹Dโ€‹N3\alpha_{i}^{[t]}\leq\frac{\delta\lambda^{[t]}}{32DN^{3}}, and ฮฑ[iโˆ’1][t]โ‰คฮดโ€‹ฮฑi[t]32โ€‹(9โ€‹N4โ€‹L6ฮด+8โ€‹N3โ€‹L3)โˆ’1\alpha_{[i-1]}^{[t]}\leq\frac{\delta\alpha_{i}^{[t]}}{32}(\frac{9N^{4}L^{6}}{\delta}+8N^{3}L^{3})^{-1}. As a result, we have

V[t+1]\displaystyle V^{[t+1]} โ‰คmaxโก{1โˆ’ฮป[t],1โˆ’ฮดโ€‹ฮฑ1[t]16}โ€‹V[t]+๐’ช~โ€‹(1(t+1)2)\displaystyle\leq\max\{1-\lambda^{[t]},1-\frac{\delta\alpha_{1}^{[t]}}{16}\}V^{[t]}+\widetilde{{\cal O}}\Big{(}\frac{1}{(t+1)^{2}}\Big{)}
โ‰ค(1โˆ’ฮดโ€‹ฮฑ1[t]16)โ€‹V[t]+๐’ช~โ€‹(1(t+1)2)\displaystyle\leq(1-\frac{\delta\alpha_{1}^{[t]}}{16})V^{[t]}+\widetilde{{\cal O}}\Big{(}\frac{1}{(t+1)^{2}}\Big{)}
โ‰ค(1โˆ’2t+h+1)โ€‹V[t]+๐’ช~โ€‹(1(t+1)2),\displaystyle\leq(1-\frac{2}{t+h+1})V^{[t]}+\widetilde{{\cal O}}\Big{(}\frac{1}{(t+1)^{2}}\Big{)},

due to ฮฑ1[t]โ‰ค16โ€‹ฮป[t]ฮด\alpha_{1}^{[t]}\leq\frac{16\lambda^{[t]}}{\delta} and c1=32ฮดc_{1}=\frac{32}{\delta}. Multiplying by (t+h+1)2(t+h+1)^{2} on both sides, we have

(t+h+1)2โ€‹V[t+1]\displaystyle(t+h+1)^{2}V^{[t+1]} โ‰ค(t+h+1)โ€‹(t+hโˆ’1)โ€‹V[t]+๐’ช~โ€‹(1)\displaystyle\leq(t+h+1)(t+h-1)V^{[t]}+\widetilde{{\cal O}}(1)
โ‰ค(t+h)2โ€‹V[t]+๐’ช~โ€‹(1)\displaystyle\leq(t+h)^{2}V^{[t]}+\widetilde{{\cal O}}(1)
โ‰คh2โ€‹V[ฯ„t]+๐’ช~โ€‹(t+1),\displaystyle\leq h^{2}V^{[\tau_{t}]}+\widetilde{{\cal O}}(t+1),

which when dividing by (t+h+1)2(t+h+1)^{2} gives the desired result.

โ–ก\Box

3.2 Standard Multi-Time-Scale Stochastic Approximation

In this section, we present the finite-time complexity of the MSA algorithm derived using the techniques in Hong etย al. (2023); Zeng etย al. (2024b), to highlight its gap from the optimal rate ๐’ช~โ€‹(1/t)\widetilde{{\cal O}}(1/t) established in Theoremย 1. We start with the case of N=3N=3, where we want to solve

Fยฏiโ€‹(๐œฝ)=0,โˆ€i=1,2,3.\displaystyle\widebar{F}_{i}(\boldsymbol{\theta})=0,\quad\forall i=1,2,3.

For simplicity, we assume (just in this section) that the stationary distribution ฮผ๐œฝ=ฮผ\mu_{\boldsymbol{\theta}}=\mu is independent of ๐œฝ\boldsymbol{\theta} and that each X[t]X^{[t]} is an i.i.d. sample from ฮผ\mu. We additionally assume that FiF_{i} has uniformly bounded energy as follows. Note that this is not a necessary assumption but considered just to simplify the analysis of MSA. Also note that we do not make this assumption in the analysis of the proposed A-MSA algorithm.

Assumption 5

There exists a constant D>0D>0 such that for all i=1,2,3i=1,2,3, ฮธiโˆˆโ„di\theta_{i}\in\mathbb{R}^{d_{i}}, and Xโˆˆ๐’ณX\in{\cal X}

โ€–Fiโ€‹(ฮธ1,ฮธ2,ฮธ3,X)โ€–โ‰คD.\displaystyle\|F_{i}(\theta_{1},\theta_{2},\theta_{3},X)\|\leq D.

Recall the update rule of MSA in (6). We measure the convergence by xi[t]x_{i}^{[t]}, the same as the metric (20) used for analyzing A-MSA.

xi[t]=ฮธi[t]โˆ’yiโ€‹(๐œฝ1:iโˆ’1[t]).\displaystyle x_{i}^{[t]}=\theta_{i}^{[t]}-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]}).

We next establish the per-iteration bound for xitx_{i}^{t}, where the analysis can be found in Appendixย A.4.

Lemma 4

We artificially define ฮฑ0[t]=0\alpha_{0}^{[t]}=0 for all tt. Under Assumptionsย 3-5, the iterates of the MSA algorithm in (6) satisfy for all tโ‰ฅ0t\geq 0

๐”ผโ€‹[โ€–xi[t+1]โ€–2]\displaystyle\mathbb{E}[\|x_{i}^{[t+1]}\|^{2}] โ‰ค๐”ผโ€‹[โ€–xi[t]โ€–2]โˆ’ฮดโ€‹ฮฑi[t]4โ€‹๐”ผโ€‹[โ€–xi[t]โ€–2]+8โ€‹N3โ€‹L6โ€‹ฮฑi[t]ฮดโ€‹โˆ‘j=i+1N๐”ผโ€‹[โ€–xj[t]โ€–2]\displaystyle\leq\mathbb{E}[\|x_{i}^{[t]}\|^{2}]-\frac{\delta\alpha_{i}^{[t]}}{4}\mathbb{E}[\|x_{i}^{[t]}\|^{2}]+\frac{8N^{3}L^{6}\alpha_{i}^{[t]}}{\delta}\sum_{j=i+1}^{N}\mathbb{E}[\|x_{j}^{[t]}\|^{2}]
+8โ€‹N2โ€‹L2โ€‹D2โ€‹(ฮฑi[t])2+N2โ€‹L2โ€‹D2โ€‹(ฮฑiโˆ’1[t])2ฮดโ€‹ฮฑi[t].\displaystyle\hskip 120.0pt+8N^{2}L^{2}D^{2}(\alpha_{i}^{[t]})^{2}+\frac{N^{2}L^{2}D^{2}(\alpha_{i-1}^{[t]})^{2}}{\delta\alpha_{i}^{[t]}}.

It is worth noting that Lemmaย 4 holds for any NN. We apply this lemma to the case N=3N=3 and consider the Lyapunov function

V[t]=๐”ผโ€‹[โ€–x1[t]โ€–2+v2[t]โ€‹โ€–x2[t]โ€–2+v3[t]โ€‹โ€–x3[t]โ€–2],\displaystyle V^{[t]}=\mathbb{E}[\|x_{1}^{[t]}\|^{2}+v_{2}^{[t]}\|x_{2}^{[t]}\|^{2}+v_{3}^{[t]}\|x_{3}^{[t]}\|^{2}], (26)

where v2[t]=1728โ€‹L6โ€‹ฮฑ1[t]ฮด2โ€‹ฮฑ2[t]v_{2}^{[t]}=\frac{1728L^{6}\alpha_{1}^{[t]}}{\delta^{2}\alpha_{2}^{[t]}} and v3[t]=8โ€‹ฮฑ1[t]ฮดโ€‹ฮฑ3[t]โ€‹(216โ€‹L6ฮด+373248โ€‹L12ฮด3)v_{3}^{[t]}=\frac{8\alpha_{1}^{[t]}}{\delta\alpha_{3}^{[t]}}(\frac{216L^{6}}{\delta}+\frac{373248L^{12}}{\delta^{3}}).

Theorem 2

Let the step sizes be

ฮฑ1[t]=c1t+h+1,ฮฑ2[t]=c2(t+h+1)3/4,ฮฑ2[t]=c3(t+h+1)1/2,\displaystyle\alpha_{1}^{[t]}=\frac{c_{1}}{t+h+1},\quad\alpha_{2}^{[t]}=\frac{c_{2}}{(t+h+1)^{3/4}},\quad\alpha_{2}^{[t]}=\frac{c_{3}}{(t+h+1)^{1/2}}, (27)

with proper choices of the constants c1,c2,c3c_{1},c_{2},c_{3}, and hh. Under Assumptionsย 3-5, the iterates of the MSA algorithm in (6) with i.i.d. samples X[t]โˆผฮผX^{[t]}\sim\mu satisfy for all tโ‰ฅ0t\geq 0

V[t+1]โ‰ค๐’ชโ€‹(V[0](t+ฯ„+1)1/2).\displaystyle V^{[t+1]}\leq{\cal O}\left(\frac{V^{[0]}}{(t+\tau+1)^{1/2}}\right).

Proof.

With N=3N=3, Lemmaย 4 implies

๐”ผโ€‹[โ€–x1[t+1]โ€–2]\displaystyle\mathbb{E}[\|x_{1}^{[t+1]}\|^{2}] โ‰ค(1โˆ’ฮดโ€‹ฮฑ1[t]4)โ€‹๐”ผโ€‹[โ€–x1[t]โ€–2]+216โ€‹L6โ€‹ฮฑ1[t]ฮดโ€‹๐”ผโ€‹[โ€–x2[t]โ€–2+โ€–x3[t]โ€–2]+72โ€‹L2โ€‹D2โ€‹(ฮฑ1[t])2,\displaystyle\leq(1-\frac{\delta\alpha_{1}^{[t]}}{4})\mathbb{E}[\|x_{1}^{[t]}\|^{2}]+\frac{216L^{6}\alpha_{1}^{[t]}}{\delta}\mathbb{E}[\|x_{2}^{[t]}\|^{2}+\|x_{3}^{[t]}\|^{2}]+72L^{2}D^{2}(\alpha_{1}^{[t]})^{2},
๐”ผโ€‹[โ€–x2[t+1]โ€–2]\displaystyle\mathbb{E}[\|x_{2}^{[t+1]}\|^{2}] โ‰ค(1โˆ’ฮดโ€‹ฮฑ2[t]4)โ€‹๐”ผโ€‹[โ€–x2[t]โ€–2]+216โ€‹L6โ€‹ฮฑ2[t]ฮดโ€‹๐”ผโ€‹[โ€–x3[t]โ€–2]+72โ€‹L2โ€‹D2โ€‹(ฮฑ2[t])2+9โ€‹L2โ€‹D2โ€‹(ฮฑ1[t])2ฮดโ€‹ฮฑ2[t],\displaystyle\leq(1-\frac{\delta\alpha_{2}^{[t]}}{4})\mathbb{E}[\|x_{2}^{[t]}\|^{2}]+\frac{216L^{6}\alpha_{2}^{[t]}}{\delta}\mathbb{E}[\|x_{3}^{[t]}\|^{2}]+72L^{2}D^{2}(\alpha_{2}^{[t]})^{2}+\frac{9L^{2}D^{2}(\alpha_{1}^{[t]})^{2}}{\delta\alpha_{2}^{[t]}},
๐”ผโ€‹[โ€–x3[t+1]โ€–2]\displaystyle\mathbb{E}[\|x_{3}^{[t+1]}\|^{2}] โ‰ค(1โˆ’ฮดโ€‹ฮฑ3[t]4)โ€‹๐”ผโ€‹[โ€–x3[t]โ€–2]+72โ€‹L2โ€‹D2โ€‹(ฮฑ3[t])2+9โ€‹L2โ€‹D2โ€‹(ฮฑ2[t])2ฮดโ€‹ฮฑ3[t].\displaystyle\leq(1-\frac{\delta\alpha_{3}^{[t]}}{4})\mathbb{E}[\|x_{3}^{[t]}\|^{2}]+72L^{2}D^{2}(\alpha_{3}^{[t]})^{2}+\frac{9L^{2}D^{2}(\alpha_{2}^{[t]})^{2}}{\delta\alpha_{3}^{[t]}}.

Recall the Lyapunov function V[t]=๐”ผโ€‹[โ€–x1[t]โ€–2+v2[t]โ€‹โ€–x2[t]โ€–2+v3[t]โ€‹โ€–x3[t]โ€–2]V^{[t]}=\mathbb{E}[\|x_{1}^{[t]}\|^{2}+v_{2}^{[t]}\|x_{2}^{[t]}\|^{2}+v_{3}^{[t]}\|x_{3}^{[t]}\|^{2}]. Note that v2[t+1]โ‰คv2[t]v_{2}^{[t+1]}\leq v_{2}^{[t]} and v3[t+1]โ‰คv3[t]v_{3}^{[t+1]}\leq v_{3}^{[t]} since ฮฑ1[t]\alpha_{1}^{[t]} decays faster than ฮฑ2[t],ฮฑ3[t]\alpha_{2}^{[t]},\alpha_{3}^{[t]}. We have

V[t+1]\displaystyle V^{[t+1]} =๐”ผโ€‹[โ€–x1[t+1]โ€–2+v2[t+1]โ€‹โ€–x2[t+1]โ€–2+v3[t+1]โ€‹โ€–x3[t+1]โ€–2]\displaystyle=\mathbb{E}[\|x_{1}^{[t+1]}\|^{2}+v_{2}^{[t+1]}\|x_{2}^{[t+1]}\|^{2}+v_{3}^{[t+1]}\|x_{3}^{[t+1]}\|^{2}]
โ‰ค๐”ผโ€‹[โ€–x1[t+1]โ€–2+v2[t]โ€‹โ€–x2[t+1]โ€–2+v3[t]โ€‹โ€–x3[t+1]โ€–2]\displaystyle\leq\mathbb{E}[\|x_{1}^{[t+1]}\|^{2}+v_{2}^{[t]}\|x_{2}^{[t+1]}\|^{2}+v_{3}^{[t]}\|x_{3}^{[t+1]}\|^{2}]
โ‰ค(1โˆ’ฮดโ€‹ฮฑ1[t]4)โ€‹๐”ผโ€‹[โ€–x1[t]โ€–2]+216โ€‹L6โ€‹ฮฑ1[t]ฮดโ€‹๐”ผโ€‹[โ€–x2[t]โ€–2+โ€–x3[t]โ€–2]+72โ€‹L2โ€‹D2โ€‹(ฮฑ1[t])2\displaystyle\leq(1-\frac{\delta\alpha_{1}^{[t]}}{4})\mathbb{E}[\|x_{1}^{[t]}\|^{2}]+\frac{216L^{6}\alpha_{1}^{[t]}}{\delta}\mathbb{E}[\|x_{2}^{[t]}\|^{2}+\|x_{3}^{[t]}\|^{2}]+72L^{2}D^{2}(\alpha_{1}^{[t]})^{2}
+v2[t]โ€‹(1โˆ’ฮดโ€‹ฮฑ2[t]4)โ€‹๐”ผโ€‹[โ€–x2[t]โ€–2]+216โ€‹L6โ€‹ฮฑ2[t]โ€‹v2[t]ฮดโ€‹๐”ผโ€‹[โ€–x3[t]โ€–2]+72โ€‹L2โ€‹D2โ€‹(ฮฑ2[t])2โ€‹v2[t]\displaystyle\hskip 20.0pt+v_{2}^{[t]}(1-\frac{\delta\alpha_{2}^{[t]}}{4})\mathbb{E}[\|x_{2}^{[t]}\|^{2}]+\frac{216L^{6}\alpha_{2}^{[t]}v_{2}^{[t]}}{\delta}\mathbb{E}[\|x_{3}^{[t]}\|^{2}]+72L^{2}D^{2}(\alpha_{2}^{[t]})^{2}v_{2}^{[t]}
+9โ€‹L2โ€‹D2โ€‹(ฮฑ1[t])2โ€‹v2[t]ฮดโ€‹ฮฑ2[t]+v3[t]โ€‹(1โˆ’ฮดโ€‹ฮฑ3[t]4)โ€‹๐”ผโ€‹[โ€–x3[t]โ€–2]\displaystyle\hskip 20.0pt+\frac{9L^{2}D^{2}(\alpha_{1}^{[t]})^{2}v_{2}^{[t]}}{\delta\alpha_{2}^{[t]}}+v_{3}^{[t]}(1-\frac{\delta\alpha_{3}^{[t]}}{4})\mathbb{E}[\|x_{3}^{[t]}\|^{2}]
+72โ€‹L2โ€‹D2โ€‹(ฮฑ3[t])2โ€‹v3[t]+9โ€‹L2โ€‹D2โ€‹(ฮฑ2[t])2โ€‹v3[t]ฮดโ€‹ฮฑ3[t]\displaystyle\hskip 20.0pt+72L^{2}D^{2}(\alpha_{3}^{[t]})^{2}v_{3}^{[t]}+\frac{9L^{2}D^{2}(\alpha_{2}^{[t]})^{2}v_{3}^{[t]}}{\delta\alpha_{3}^{[t]}}
=(1โˆ’ฮดโ€‹ฮฑ1[t]4)โ€‹๐”ผโ€‹[โ€–x1[t]โ€–2+v2[t]โ€‹โ€–x2[t]โ€–2+v3[t]โ€‹โ€–x3[t]โ€–2]+72โ€‹L2โ€‹D2โ€‹(ฮฑ1[t])2\displaystyle=(1-\frac{\delta\alpha_{1}^{[t]}}{4})\mathbb{E}[\|x_{1}^{[t]}\|^{2}+v_{2}^{[t]}\|x_{2}^{[t]}\|^{2}+v_{3}^{[t]}\|x_{3}^{[t]}\|^{2}]+72L^{2}D^{2}(\alpha_{1}^{[t]})^{2}
+124416โ€‹L8โ€‹D2โ€‹ฮฑ1[t]โ€‹ฮฑ2[t]ฮด2+15552โ€‹L8โ€‹D2โ€‹(ฮฑ1[t])3ฮด3โ€‹(ฮฑ2[t])2\displaystyle\hskip 20.0pt+\frac{124416L^{8}D^{2}\alpha_{1}^{[t]}\alpha_{2}^{[t]}}{\delta^{2}}+\frac{15552L^{8}D^{2}(\alpha_{1}^{[t]})^{3}}{\delta^{3}(\alpha_{2}^{[t]})^{2}}
+(572โ€‹L2โ€‹D2โ€‹ฮฑ1[t]โ€‹ฮฑ3[t]ฮด+9โ€‹L2โ€‹D2โ€‹ฮฑ1[t]โ€‹(ฮฑ2[t])2ฮด2โ€‹(ฮฑ3[t])2)โ€‹(216โ€‹L6ฮด+373248โ€‹L12ฮด3)\displaystyle\hskip 20.0pt+\left(\frac{572L^{2}D^{2}\alpha_{1}^{[t]}\alpha_{3}^{[t]}}{\delta}+\frac{9L^{2}D^{2}\alpha_{1}^{[t]}(\alpha_{2}^{[t]})^{2}}{\delta^{2}(\alpha_{3}^{[t]})^{2}}\right)(\frac{216L^{6}}{\delta}+\frac{373248L^{12}}{\delta^{3}})
+(ฮดโ€‹ฮฑ1[t]โ€‹v2[t]4โˆ’ฮดโ€‹ฮฑ2[t]โ€‹v2[t]4+216โ€‹L6โ€‹ฮฑ1[t]ฮด)โ€‹๐”ผโ€‹[โ€–x2[t]โ€–2]\displaystyle\hskip 20.0pt+(\frac{\delta\alpha_{1}^{[t]}v_{2}^{[t]}}{4}-\frac{\delta\alpha_{2}^{[t]}v_{2}^{[t]}}{4}+\frac{216L^{6}\alpha_{1}^{[t]}}{\delta})\mathbb{E}[\|x_{2}^{[t]}\|^{2}]
+(ฮดโ€‹ฮฑ1[t]โ€‹v3[t]4โˆ’ฮดโ€‹ฮฑ3[t]โ€‹v3[t]4+216โ€‹L6โ€‹ฮฑ1[t]ฮด+216โ€‹L6โ€‹ฮฑ2[t]โ€‹v2[t]ฮด)โ€‹๐”ผโ€‹[โ€–x3[t]โ€–2]\displaystyle\hskip 20.0pt+(\frac{\delta\alpha_{1}^{[t]}v_{3}^{[t]}}{4}-\frac{\delta\alpha_{3}^{[t]}v_{3}^{[t]}}{4}+\frac{216L^{6}\alpha_{1}^{[t]}}{\delta}+\frac{216L^{6}\alpha_{2}^{[t]}v_{2}^{[t]}}{\delta})\mathbb{E}[\|x_{3}^{[t]}\|^{2}]
โ‰ค(1โˆ’ฮดโ€‹ฮฑ1[t]4)โ€‹V[t]+72โ€‹L2โ€‹D2โ€‹(ฮฑ1[t])2+124416โ€‹L8โ€‹D2โ€‹ฮฑ1[t]โ€‹ฮฑ2[t]ฮด2+15552โ€‹L8โ€‹D2โ€‹(ฮฑ1[t])3ฮด3โ€‹(ฮฑ2[t])2\displaystyle\leq(1-\frac{\delta\alpha_{1}^{[t]}}{4})V^{[t]}+72L^{2}D^{2}(\alpha_{1}^{[t]})^{2}+\frac{124416L^{8}D^{2}\alpha_{1}^{[t]}\alpha_{2}^{[t]}}{\delta^{2}}+\frac{15552L^{8}D^{2}(\alpha_{1}^{[t]})^{3}}{\delta^{3}(\alpha_{2}^{[t]})^{2}}
+(572โ€‹L2โ€‹D2โ€‹ฮฑ1[t]โ€‹ฮฑ3[t]ฮด+9โ€‹L2โ€‹D2โ€‹ฮฑ1[t]โ€‹(ฮฑ2[t])2ฮด2โ€‹(ฮฑ3[t])2)โ€‹(216โ€‹L6ฮด+373248โ€‹L12ฮด3),\displaystyle\hskip 20.0pt+\left(\frac{572L^{2}D^{2}\alpha_{1}^{[t]}\alpha_{3}^{[t]}}{\delta}+\frac{9L^{2}D^{2}\alpha_{1}^{[t]}(\alpha_{2}^{[t]})^{2}}{\delta^{2}(\alpha_{3}^{[t]})^{2}}\right)(\frac{216L^{6}}{\delta}+\frac{373248L^{12}}{\delta^{3}}), (28)

where the last inequality is a result of

ฮดโ€‹ฮฑ1[t]โ€‹v2[t]4โˆ’ฮดโ€‹ฮฑ2[t]โ€‹v2[t]4+216โ€‹L6โ€‹ฮฑ1[t]ฮด=ฮดโ€‹ฮฑ1[t]โ€‹v2[t]4โˆ’ฮดโ€‹ฮฑ2[t]โ€‹v2[t]8โ‰ค0,\displaystyle\frac{\delta\alpha_{1}^{[t]}v_{2}^{[t]}}{4}-\frac{\delta\alpha_{2}^{[t]}v_{2}^{[t]}}{4}+\frac{216L^{6}\alpha_{1}^{[t]}}{\delta}=\frac{\delta\alpha_{1}^{[t]}v_{2}^{[t]}}{4}-\frac{\delta\alpha_{2}^{[t]}v_{2}^{[t]}}{8}\leq 0,
ฮดโ€‹ฮฑ1[t]โ€‹v3[t]4โˆ’ฮดโ€‹ฮฑ3[t]โ€‹v3[t]4+216โ€‹L6โ€‹ฮฑ1[t]ฮด+216โ€‹L6โ€‹ฮฑ2[t]โ€‹v2[t]ฮด\displaystyle\frac{\delta\alpha_{1}^{[t]}v_{3}^{[t]}}{4}-\frac{\delta\alpha_{3}^{[t]}v_{3}^{[t]}}{4}+\frac{216L^{6}\alpha_{1}^{[t]}}{\delta}+\frac{216L^{6}\alpha_{2}^{[t]}v_{2}^{[t]}}{\delta}
=ฮดโ€‹ฮฑ1[t]โ€‹v3[t]4โˆ’ฮดโ€‹ฮฑ3[t]โ€‹v3[t]4+216โ€‹L6โ€‹ฮฑ1[t]ฮด+373248โ€‹L12โ€‹ฮฑ1[t]ฮด3=ฮดโ€‹ฮฑ1[t]โ€‹v3[t]4โˆ’ฮดโ€‹ฮฑ3[t]โ€‹v3[t]8โ‰ค0,\displaystyle=\frac{\delta\alpha_{1}^{[t]}v_{3}^{[t]}}{4}-\frac{\delta\alpha_{3}^{[t]}v_{3}^{[t]}}{4}+\frac{216L^{6}\alpha_{1}^{[t]}}{\delta}+\frac{373248L^{12}\alpha_{1}^{[t]}}{\delta^{3}}=\frac{\delta\alpha_{1}^{[t]}v_{3}^{[t]}}{4}-\frac{\delta\alpha_{3}^{[t]}v_{3}^{[t]}}{8}\leq 0,

following if the step sizes satisfy ฮฑ1[t]โ‰คฮฑ2[t]2โ‰คฮฑ3[t]2\alpha_{1}^{[t]}\leq\frac{\alpha_{2}^{[t]}}{2}\leq\frac{\alpha_{3}^{[t]}}{2}.

Under the step size rule (27) with c1โ‰ฅ8ฮดc_{1}\geq\frac{8}{\delta}, it can be easily derived that

V[t+1]\displaystyle V^{[t+1]} โ‰ค(1โˆ’2t+h+1)โ€‹V[t]+๐’ชโ€‹(1/t3/2),\displaystyle\leq(1-\frac{2}{t+h+1})V^{[t]}+{\cal O}(1/t^{3/2}),

which gives

V[t+1]โ‰ค๐’ชโ€‹(V[0](t+h+1)1/2).\displaystyle V^{[t+1]}\leq{\cal O}(\frac{V^{[0]}}{(t+h+1)^{1/2}}).

โ–ก\Box

We now discuss how Theoremย 2 can be extended to more than three equations and what the expected convergence rate should be. When N=2N=2, Zeng etย al. (2024b) shows that the convergence rate is on the order of (up to a logarithm factor)

maxโก{ฮฑ1[t],ฮฑ2[t],(ฮฑ1[t]ฮฑ2[t])2}.\displaystyle\max\Big{\{}\alpha_{1}^{[t]},\alpha_{2}^{[t]},\Big{(}\frac{\alpha_{1}^{[t]}}{\alpha_{2}^{[t]}}\Big{)}^{2}\Big{\}}. (29)

To optimize (29) the step sizes should be selected as ฮฑ1[t]โˆผ1t,ฮฑ2[t]โˆผ1t2/3\alpha_{1}^{[t]}\sim\frac{1}{t},\alpha_{2}^{[t]}\sim\frac{1}{t^{2/3}}, which leads to a convergence rate of ๐’ช~โ€‹(tโˆ’2/3)\widetilde{{\cal O}}(t^{-2/3}). When N=3N=3, the convergence rate is on the order of

maxโก{ฮฑ1[t],ฮฑ2[t],ฮฑ3[t],(ฮฑ1[t]ฮฑ2[t])2,(ฮฑ2[t]ฮฑ3[t])2}.\displaystyle\max\Big{\{}\alpha_{1}^{[t]},\alpha_{2}^{[t]},\alpha_{3}^{[t]},\Big{(}\frac{\alpha_{1}^{[t]}}{\alpha_{2}^{[t]}}\Big{)}^{2},\Big{(}\frac{\alpha_{2}^{[t]}}{\alpha_{3}^{[t]}}\Big{)}^{2}\Big{\}}.

which dictates the choice of step sizes in (27).

For general NN, the convergence rate becomes on the order of

maxโก{ฮฑ1[t],โ‹ฏ,ฮฑN[t],(ฮฑ1[t]ฮฑ2[t])2,โ‹ฏ,(ฮฑNโˆ’1[t]ฮฑN[t])2}.\displaystyle\max\Big{\{}\alpha_{1}^{[t]},\cdots,\alpha_{N}^{[t]},\Big{(}\frac{\alpha_{1}^{[t]}}{\alpha_{2}^{[t]}}\Big{)}^{2},\cdots,\Big{(}\frac{\alpha_{N-1}^{[t]}}{\alpha_{N}^{[t]}}\Big{)}^{2}\Big{\}}. (30)

It is straightforward to see that the optimal decay rate of (30) is ๐’ช~โ€‹(tโˆ’2/(N+1))\widetilde{{\cal O}}(t^{-2/(N+1)}), which implies the sample complexity in Tableย 1.

3.3 Reflecting on Faster Convergence of A-MSA

Lemmasย 2 and 4 establish critical per-iteration convergence bounds on the errors in the decision variable for A-MSA and MSA, respectively. In this section, we contrast corresponding terms in the bounds to pinpoint the advantage of Lemmaย 2 over Lemmaย 4. In the analysis of โ€–xi[t+1]โ€–2\|x_{i}^{[t+1]}\|^{2} under the two algorithms, the common/comparable terms include the following:

  • โ€ข

    The same term โ€–yiโ€‹(๐œฝ1:iโˆ’1[t+1])โˆ’yiโ€‹(๐œฝ1:iโˆ’1[t])โ€–\|y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})\| arises from the shift in learning target yiโ€‹(๐œฝ1:iโˆ’1[t])y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]}) across iterations and appears in the proofs of both Lemmas 2 and 4 (in Sectionย A.2 and A.4, respectively). Due to the Lipschitz continuity of yiy_{i} imposed in Assumptionย 4, controlling this term essentially requires handling

    โˆ‘j=1iโˆ’1โ€–ฮธj[t+1]โˆ’ฮธj[t]โ€–.\displaystyle\sum_{j=1}^{i-1}\|\theta_{j}^{[t+1]}-\theta_{j}^{[t]}\|. (31)

    Under MSA, there is currently no better bound on (31) than โˆ‘j=1iโˆ’1ฮฑj[t]โ€‹โ€–Fiโ€‹(๐œฝ[t],X[t])โ€–\sum_{j=1}^{i-1}\alpha_{j}^{[t]}\|F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})\|. As โ€–Fiโ€‹(๐œฝ[t],X[t])โ€–\|F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})\| is on the order of a constant even when ๐œฝ[t]\boldsymbol{\theta}^{[t]} approaches the optimal solution (due to the random sample X[t]X^{[t]}), the bound eventually becomes ๐’ชโ€‹(ฮฑiโˆ’1){\cal O}(\alpha_{i-1}).

    In comparison, under A-MSA, we can control (31) with Lemmaย 3 and show

    โˆ‘j=1iโˆ’1โ€–ฮธj[t+1]โˆ’ฮธj[t]โ€–โ‰ค๐’ชโ€‹(โˆ‘j=1iโˆ’1ฮฑj[t]โ€‹โ€–ฮ”โ€‹fj[t]โ€–+โˆ‘j=1iโˆ’1ฮฑj[t]โ€‹โ€–xj[t]โ€–+ฮฑiโˆ’1[t]โ€‹โˆ‘j=iNโ€–xj[t]โ€–).\displaystyle\sum_{j=1}^{i-1}\|\theta_{j}^{[t+1]}-\theta_{j}^{[t]}\|\leq{\cal O}\Big{(}\sum_{j=1}^{i-1}\alpha_{j}^{[t]}\|\Delta f_{j}^{[t]}\|+\sum_{j=1}^{i-1}\alpha_{j}^{[t]}\|x_{j}^{[t]}\|+\alpha_{i-1}^{[t]}\sum_{j=i}^{N}\|x_{j}^{[t]}\|\Big{)}. (32)

    Since โ€–ฮ”โ€‹fj[t]โ€–\|\Delta f_{j}^{[t]}\| and โ€–xj[t]โ€–\|x_{j}^{[t]}\| themselves decay to zero as the iterations proceed, the convergence rate of (32) is much faster than ๐’ชโ€‹(ฮฑiโˆ’1){\cal O}(\alpha_{i-1}).

    Takeaway: Our objective is to solve a highly coupled system (5). The discussion above highlights how the A-MSA algorithm effectively decouples the decision variables, in the sense that the inaccuracy of the upper-level variables has a reduced effect on the lower-level ones. Being able to decoupled the decision variable updates is crucial as it allows us to choose ฮฑi[t]\alpha_{i}^{[t]} across ii in a more independent manner.

  • โ€ข

    โ€–ฮ”โ€‹fi[t]โ€–2=โ€–fi[t]โˆ’Fยฏiโ€‹(๐œฝ[t])โ€–2\|\Delta f_{i}^{[t]}\|^{2}=\|f_{i}^{[t]}-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\|^{2} is an error appearing in the proof of Lemmaย 2 for A-MSA, whereas โ€–Fiโ€‹(๐œฝ[t],X[t])โˆ’Fยฏiโ€‹(๐œฝ[t])โ€–2\|F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\|^{2} is the comparable term in the proof of Lemmaย 4 for MSA. They capture the variance in the estimation of Fยฏiโ€‹(๐œฝ[t])\widebar{F}_{i}(\boldsymbol{\theta}^{[t]}). In the case of MSA, โ€–Fiโ€‹(๐œฝ[t],X[t])โˆ’Fยฏiโ€‹(๐œฝ[t])โ€–2\|F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\|^{2} does not decay to zero (again due to the randomness in X[t]X^{[t]}), which eventually becomes a bottleneck in achieving the optimal rate. In contrast, โ€–ฮ”โ€‹fi[t]โ€–2\|\Delta f_{i}^{[t]}\|^{2} under A-MSA can be expected to converge to zero (under a careful analysis).

    Takeaway: Compared to MSA, A-MSA provides a low-variance estimate of Fยฏiโ€‹(๐œฝ[t])\widebar{F}_{i}(\boldsymbol{\theta}^{[t]}). The variance decaying sufficiently fast is another key driver of the overall improved complexity.

Remark 2

At the end of the section, we point out an important additional advantage of the analysis of A-MSA over that of MSA. Observe that the Lyapunov function (26) considered in Theoremย 2 weights the lower-level residuals x2[t]x_{2}^{[t]} and x3[t]x_{3}^{[t]} by v2[t]v_{2}^{[t]} and v3[t]v_{3}^{[t]}. The weights have been carefully chosen to ensure the proper cancellation of errors across different levels. Choosing such weights, however, requires non-trivial efforts and can be inconvenient as NN goes up.

In addition, we note that the weights v2[t]v_{2}^{[t]} and v3[t]v_{3}^{[t]} themselves are decaying sequences. Theoremย 2 shows that V[t]V^{[t]} decays with rate ๐’ช~โ€‹(tโˆ’1/2)\widetilde{{\cal O}}(t^{-1/2}), which only implies ๐”ผโ€‹[โ€–x2[t]โ€–2]โ‰ค๐’ช~โ€‹(tโˆ’1/4)\mathbb{E}[\|x_{2}^{[t]}\|^{2}]\leq\widetilde{{\cal O}}(t^{-1/4}) and ๐”ผโ€‹[โ€–x3[t]โ€–2]โ‰ค๐’ช~โ€‹(1)\mathbb{E}[\|x_{3}^{[t]}\|^{2}]\leq\widetilde{{\cal O}}(1). In other words, the guaranteed convergence rate of lower-level variables does not match that of the highest level variable and may even be meaningless. In contrast, the Lyapunov function (23) for the analysis of A-MSA simply combines โ€–xi[t]โ€–2\|x_{i}^{[t]}\|^{2} over all levels without additional weights. As a result, the decision variables at all level are guaranteed to converge at ๐’ช~โ€‹(tโˆ’1)\widetilde{{\cal O}}(t^{-1}).

4 Motivating Applications

In this section we discuss how the framework (5) models the objective of a range of algorithms in RL and games. In some cases, the objective is structured such that Assumptionsย 3-2 can be verified to hold, meaning that the theoretical guarantee immediately applies if the A-MSA algorithm is used. For the other problems, the assumptions may not hold, but we discuss how the analysis can be adapted to derive the complexity of A-MSA by using the specific problem structure. All applications discussed are special cases of (5) with N=3N=3, meaning that they cannot be covered by the bilevel framework in Zeng etย al. (2024b); Han etย al. (2024); Doan (2024); Zeng and Doan (2024).

4.1 Gradient-Based Temporal Difference learning with Momentum.

Gradient-based temporal difference (TD) learning algorithms including GTD2 and TDC are a popular class of methods that stably minimize a projected Bellman error under linear function approximation and off-policy samples (Sutton etย al., 2008, 2009). Consider an infinite-horizon average-reward MDP characterized by (๐’ฎ,๐’œ,๐’ซ,r,ฮณ)({\cal S},{\cal A},{\cal P},r,\gamma). Here ๐’ฎ{\cal S} and ๐’œ{\cal A} are the action and state spaces. Each state ss is associated with a feature vector ฯ•โ€‹(s)โˆˆโ„d\phi(s)\in\mathbb{R}^{d}. ๐’ซ:๐’ฎร—๐’œโ†’ฮ”๐’ฎ{\cal P}:{\cal S}\times{\cal A}\rightarrow\Delta_{{\cal S}} is the transition probability kernel, with ๐’ซโ€‹(sโ€ฒโˆฃs,a){\cal P}(s^{\prime}\mid s,a) denoting the probability of the MDP transitioning from state ss to sโ€ฒs^{\prime} under action aa. r:๐’ฎร—๐’œโ†’[0,1]r:{\cal S}\times{\cal A}\rightarrow[0,1] is the reward function. ฮณโˆˆ(0,1)\gamma\in(0,1) is the discount factor. Given a fixed policy ฯ€\pi to evaluate, our objective is to learn a value function parameter ฮธโˆˆโ„d\theta\in\mathbb{R}^{d} such that ฯ•โ€‹(s)โŠคโ€‹ฮธ\phi(s)^{\top}\theta approximates a value function Vฯ€โ€‹(s)V^{\pi}(s) for every state sโˆˆ๐’ฎs\in{\cal S}. The TDC algorithm formulates the objective with a system of two equations on ฮธ\theta and an auxiliary variable ฯ‰โˆˆโ„d\omega\in\mathbb{R}^{d}, i.e., โˆ€sโˆˆ๐’ฎ\forall s\in{\cal S} we aim to solve

๐”ผaโˆผฯ€(โ‹…โˆฃs),sโ€ฒโˆผ๐’ซ(โ‹…โˆฃs,a))โ€‹[rโ€‹(s,a)+ฮณโ€‹ฯ•โ€‹(sโ€ฒ)โŠคโ€‹ฮธโ€‹ฯ•โ€‹(s)โˆ’ฯ•โ€‹(s)โŠคโ€‹ฮธโ€‹ฯ•โ€‹(s)โˆ’ฮณโ€‹ฯ•โ€‹(sโ€ฒ)โ€‹ฯ•โ€‹(s)โŠคโ€‹ฯ‰]=0,๐”ผaโˆผฯ€(โ‹…โˆฃs),sโ€ฒโˆผ๐’ซ(โ‹…โˆฃs,a))โ€‹[rโ€‹(s,a)+ฮณโ€‹ฯ•โ€‹(sโ€ฒ)โŠคโ€‹ฮธโ€‹ฯ•โ€‹(s)โˆ’ฯ•โ€‹(s)โŠคโ€‹ฮธโ€‹ฯ•โ€‹(s)โˆ’ฯ•โ€‹(s)โ€‹ฯ•โ€‹(s)โŠคโ€‹ฯ‰]=0.\displaystyle\begin{aligned} &\mathbb{E}_{a\sim\pi(\cdot\mid s),s^{\prime}\sim{\cal P}(\cdot\mid s,a))}[r(s,a)+\gamma\phi(s^{\prime})^{\top}\theta\phi(s)-\phi(s)^{\top}\theta\phi(s)-\gamma\phi(s^{\prime})\phi(s)^{\top}\omega]=0,\\ &\mathbb{E}_{a\sim\pi(\cdot\mid s),s^{\prime}\sim{\cal P}(\cdot\mid s,a))}[r(s,a)+\gamma\phi(s^{\prime})^{\top}\theta\phi(s)-\phi(s)^{\top}\theta\phi(s)-\phi(s)\phi(s)^{\top}\omega]=0.\end{aligned} (33)

A momentum term can be added to the update of the variable ฮธ\theta to accelerate the convergence of this method. Deb and Bhatnagar (2022) studies this approach and shows that the objective of TDC with momentum can be described as solving a system of three equations. Compared with (33), the additional time scale is introduced to force the momentum term to decay to zero, which becomes an equation in the lowest level. The second equation in (33) is primarily used to solve for ฯ‰\omega and lies in the middle level, whereas ฮธ\theta is the variable associated with the highest level equation. We skip the formulation details here and point interested readers to Deb and Bhatnagar (2022)[Section 4.2]. It is important to note that under a common assumption on the feature vectors (see, for example, Xu etย al. (2019)[Assumption 1]), the strong monotonicity condition holds, allowing us to conclude that A-MSA find the (unique) optimal solution with rate ๐’ช~โ€‹(1/t)\widetilde{{\cal O}}(1/t). While TDC with momentum has been shown to converge asymptotically in Deb and Bhatnagar (2022), its finite-time complexity is unknown from the existing literature. Our paper fills in this gap.

4.2 Distributionally Reinforcement Q Learning

Distributionally robust reinforcement learning (DRRL) in general studies finding a policy robust under unknown environmental changes. In this paper, we introduce the subject following the specific formulation in Liang etย al. (2024). Consider again the MDP (๐’ฎ,๐’œ,๐’ซ,r,ฮณ)({\cal S},{\cal A},{\cal P},r,\gamma) introduced in Sectionย 4.1, where ๐’ซ{\cal P} is the transition kernel that we can sample from during training. The unknown transition kernel in the test environment may deviate from ๐’ซ{\cal P} but lies within an uncertainty set ๐’ฐ=โˆ(s,a)โˆˆ๐’ฎร—๐’œ๐’ฐs,a{\cal U}=\prod_{(s,a)\in{\cal S}\times{\cal A}}{\cal U}_{s,a} with

๐’ฐs,a={๐’ซโ€ฒ(โ‹…โˆฃs,a)โˆˆฮ”๐’ฎ:Dฯ‡2(๐’ซโ€ฒ(โ‹…โˆฃs,a),๐’ซ(โ‹…โˆฃs,a))โ‰คฯ}.{\cal U}_{s,a}=\{{\cal P}^{\prime}(\cdot\mid s,a)\in\Delta_{{\cal S}}:D_{\chi^{2}}({\cal P}^{\prime}(\cdot\mid s,a),{\cal P}(\cdot\mid s,a))\leq\rho\}.

Here Dฯ‡2โ€‹(u,v)D_{\chi^{2}}(u,v) denotes the ฯ‡2\chi^{2} distance between distributions u,vโˆˆฮ”๐’ฎu,v\in\Delta_{{\cal S}}, and ฯ\rho is a radius parameter of the uncertainty set. The aim of DRRL is to find the distributionally robust Q function Qrob,โ‹†โˆˆโ„|๐’ฎ|ร—|๐’œ|Q^{\text{rob},\star}\in\mathbb{R}^{|{\cal S}|\times|{\cal A}|} that performs optimally in the worst case

Qrob,โ‹†โ€‹(s,a)โ‰œsupฯ€inf๐’ซโ€ฒโˆˆ๐’ฐ๐”ผฯ€,๐’ซโ€ฒโ€‹[โˆ‘t=0โˆžฮณtโ€‹rโ€‹(st,at)โˆฃs0=s,a0=a].\displaystyle Q^{\text{rob},\star}(s,a)\triangleq\sup_{\pi}\inf_{{\cal P}^{\prime}\in{\cal U}}\mathbb{E}_{\pi,{\cal P}^{\prime}}[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t})\mid s_{0}=s,a_{0}=a].

The distributionally robust Q function satisfies the robust Bellman optimality equation

Qrob,โ‹†โ€‹(s,a)=rโ€‹(s,a)+ฮณโ€‹inf๐’ซโ€ฒโˆˆ๐’ฐ๐”ผ๐’ซโ€ฒโ€‹[maxaโˆˆ๐’œโกQrob,โ‹†โ€‹(sโ€ฒ,a)],โˆ€s,a.\displaystyle Q^{\text{rob},\star}(s,a)=r(s,a)+\gamma\inf_{{\cal P}^{\prime}\in{\cal U}}\mathbb{E}_{{\cal P}^{\prime}}[\max_{a\in{\cal A}}Q^{\text{rob},\star}(s^{\prime},a)],\quad\forall s,a. (34)

Directly solving Qrob,โ‹†Q^{\text{rob},\star} from (34), however, is challenging as an exhaustive search in ๐’ฐ{\cal U} is infeasible and we cannot sample from ๐’ซโ€ฒ{\cal P}^{\prime}. The trick developed in the distributionally robust optimization literature (Duchi and Namkoong, 2021) is to use the dual form of the ฯ‡2\chi^{2} distance

infDฯ‡2โ€‹(๐’ซโ€ฒ,๐’ซ)โ‰คฯ๐”ผ๐’ซโ€‹[X]=supฮทโˆˆโ„{ฮทโˆ’1+ฯโ€‹๐”ผ๐’ซโ€‹[(ฮทโˆ’X)+2]}.\displaystyle\inf_{D_{\chi^{2}}\left({\cal P}^{\prime},{\cal P}\right)\leq\rho}\mathbb{E}_{{\cal P}}[X]=\sup_{\eta\in\mathbb{R}}\left\{\eta-\sqrt{1+\rho}\sqrt{\mathbb{E}_{{\cal P}}\left[(\eta-X)_{+}^{2}\right]}\right\}. (35)

We define the shorthand notation ฯƒฯ‡2โ€‹(X,ฮท)=ฮทโˆ’1+ฯโ€‹๐”ผ๐’ซโ€‹[(ฮทโˆ’X)+2]\sigma_{\chi^{2}}(X,\eta)=\eta-\sqrt{1+\rho}\sqrt{\mathbb{E}_{{\cal P}}\left[(\eta-X)_{+}^{2}\right]}. Exploiting this relation, it can be shown that a variant of the Q learning algorithm for the distributionally robust setting aims to find the solution tuple {Qrob,โ‹†โˆˆโ„|๐’ฎ|ร—|๐’œ|,ฮทโ†’โˆˆโ„|๐’ฎ|ร—|๐’œ|,Z1โˆˆโ„|๐’ฎ|ร—|๐’œ|,Z2โˆˆโ„|๐’ฎ|ร—|๐’œ|}\{Q^{\text{rob},\star}\in\mathbb{R}^{|{\cal S}|\times|{\cal A}|},\vec{\eta}\in\mathbb{R}^{|{\cal S}|\times|{\cal A}|},Z_{1}\in\mathbb{R}^{|{\cal S}|\times|{\cal A}|},Z_{2}\in\mathbb{R}^{|{\cal S}|\times|{\cal A}|}\} to the system of equations

Qrob,โ‹†โ€‹(s,a)=rโ€‹(s,a)+ฮณโ€‹(ฮทโ†’โ€‹(s,a)โˆ’1+ฯโ€‹Z2),โˆ€s,a,\displaystyle\;Q^{\text{rob},\star}(s,a)=r(s,a)+\gamma(\vec{\eta}(s,a)-\sqrt{1+\rho}\sqrt{Z_{2}}),\quad\forall s,a, (36)
โˆ‡ฮทฯƒฯ‡2โ€‹(maxaโˆˆ๐’œโกQrob,โ‹†โ€‹(s,a),ฮทโ†’โ€‹(s,a))=1โˆ’1+ฯโ€‹Z1โ€‹(s,a)Z2โ€‹(s,a)=0,โˆ€s,a,\displaystyle\;\nabla_{\eta}\sigma_{\chi^{2}}(\max_{a\in{\cal A}}Q^{\text{rob},\star}(s,a),\vec{\eta}(s,a))=1-\sqrt{1+\rho}\frac{Z_{1}(s,a)}{Z_{2}(s,a)}=0,\quad\forall s,a, (37)
{Z1โ€‹(s,a)=๐”ผ๐’ซโ€‹[(ฮทโ†’โ€‹(s,a)โˆ’maxaโˆˆ๐’œโกQrob,โ‹†โ€‹(sโ€ฒ,a))+],Z2โ€‹(s,a)=๐”ผ๐’ซโ€‹[(ฮทโ†’โ€‹(s,a)โˆ’maxaโˆˆ๐’œโกQrob,โ‹†โ€‹(sโ€ฒ,a))2],โˆ€s,a.\displaystyle\left\{\begin{aligned} Z_{1}(s,a)=\mathbb{E}_{{\cal P}}[(\vec{\eta}(s,a)-\max_{a\in{\cal A}}Q^{\text{rob},\star}(s^{\prime},a))_{+}],\\ Z_{2}(s,a)=\mathbb{E}_{{\cal P}}[(\vec{\eta}(s,a)-\max_{a\in{\cal A}}Q^{\text{rob},\star}(s^{\prime},a))^{2}],\end{aligned}\right.\quad\forall s,a. (38)

Interested readers can find in Liang etย al. (2024) the detailed derivation of the system of equations as well as how a three-time-scale algorithm is developed to solve the system. Liang etย al. (2024) further establishes the asymptotic convergence of the algorithm. The highest level equation is (36), and the associated Bellman operator is strongly monotone under a common contraction assumption in Q learning (Chen etย al., 2022). The lowest level equation is (38), which has an associated operator that is an identity mapping with respect to Z1Z_{1} and Z2Z_{2} and is therefore obviously strongly monotone. The issue is that the middle level operator โˆ‡ฮทฯƒฯ‡2\nabla_{\eta}\sigma_{\chi^{2}} in (37) is in general only (non-strongly) monotone with respect to ฮทโ†’\vec{\eta}. The violation of assumption makes our analysis not immediately applicable. However, the techniques developed may help design accelerated three-time-scale algorithms with analysis tailored to the monotone structure of โˆ‡ฮทฯƒฯ‡2\nabla_{\eta}\sigma_{\chi^{2}}. Additionally, it is possible that โˆ‡ฮทฯƒฯ‡2\nabla_{\eta}\sigma_{\chi^{2}} becomes strongly monotone under additional assumptions on ๐’ฐ{\cal U}, in which case we can invoke Theoremย 1 and conclude that the A-MSA algorithm can be applied and find the optimal distributionally robust Q function with rate ๐’ช~โ€‹(1/t)\widetilde{{\cal O}}(1/t).

4.3 Actor-Critic Algorithm for Regularized Two-Player Zero-Sum Markov Games

Consider a two-player zero-sum Markov games described by (๐’ฎ,๐’œ,โ„ฌ,๐’ซ,ฮณ,r)({\cal S},{\cal A},{\cal B},{\cal P},\gamma,r). ๐’ฎ{\cal S} is the state space, observed by both players. ๐’œ{\cal A} and โ„ฌ{\cal B} are the action spaces of the two players. The transition kernel is ๐’ซ:๐’ฎร—๐’œร—โ„ฌโ†’ฮ”๐’ฎ{\cal P}:{\cal S}\times{\cal A}\times{\cal B}\rightarrow\Delta_{{\cal S}}, with with ๐’ซโ€‹(sโ€ฒโˆฃs,a,b){\cal P}(s^{\prime}\mid s,a,b) denoting the probability of the game transitioning from state ss to sโ€ฒs^{\prime} when player 1 selects action aโˆˆ๐’œa\in{\cal A} and player 2 selects bโˆˆโ„ฌb\in{\cal B}. The reward function is r:๐’ฎร—๐’œร—โ„ฌโ†’[0,1]r:{\cal S}\times{\cal A}\times{\cal B}\rightarrow[0,1] โ€“ when player 1 and 2 take actions aa and bb in state ss, player 1 receives reward rโ€‹(s,a,b)r(s,a,b) and player 2 receives โˆ’rโ€‹(s,a,b)-r(s,a,b). We denote the two playersโ€™ policies by ฯ€โˆˆฮ”๐’œ๐’ฎ\pi\in\Delta_{{\cal A}}^{{\cal S}} and ฯ•โˆˆฮ”โ„ฌ๐’ฎ\phi\in\Delta_{{\cal B}}^{{\cal S}}, with ฯ€โ€‹(aโˆฃs)\pi(a\mid s), ฯ•โ€‹(bโˆฃs)\phi(b\mid s) representing the probability of selecting action aa, bb in state ss according to ฯ€\pi, ฯ•\phi. For a fixed initial state distribution ฯโˆˆฮ”๐’ฎ\rho\in\Delta_{{\cal S}}, the expected discounted cumulative reward under entropy regularization (received by player 1) under policy pair (ฯ€,ฯ•)(\pi,\phi) is

Jwโ€‹(ฯ€,ฯ•)=๐”ผs0โˆผฯโ€‹[Vwฯ€,ฯ•โ€‹(s0)],\displaystyle J_{w}(\pi,\phi)=\mathbb{E}_{s_{0}\sim\rho}[V_{w}^{\pi,\phi}(s_{0})],

where wโˆˆโ„+w\in\mathbb{R}_{+} is a non-negative regularization weight and Vwฯ€,ฯ•โ€‹(s)V_{w}^{\pi,\phi}(s) is the regularized value function

Vwฯ€,ฯ•โ€‹(s)โ‰œ๐”ผst,at,btโ€‹[โˆ‘t=0โˆžฮณtโ€‹(rโ€‹(st,at,bt)โˆ’wโ€‹logโกฯ€โ€‹(atโˆฃst)+wโ€‹logโกฯ•โ€‹(btโˆฃst))โˆฃs0=s].\displaystyle V_{w}^{\pi,\phi}(s)\triangleq\mathbb{E}_{s_{t},a_{t},b_{t}}\Big{[}\sum\nolimits_{t=0}^{\infty}\gamma^{t}\Big{(}r\left(s_{t},a_{t},b_{t}\right)-w\log\pi(a_{t}\mid s_{t})+w\log\phi(b_{t}\mid s_{t})\Big{)}\mid s_{0}=s\Big{]}.

Solving a regularized game means that we want to find a Nash equilibrium policy pair (ฯ€wโ‹†,ฯ•wโ‹†)(\pi_{w}^{\star},\phi_{w}^{\star}) as the solution to

Jwโ€‹(ฯ€wโ‹†,ฯ•wโ‹†)=maxฯ€โˆˆฮ”๐’œ๐’ฎโกminฯ•โˆˆฮ”โ„ฌ๐’ฎโกJwโ€‹(ฯ€,ฯ•)=minฯ•โˆˆฮ”โ„ฌ๐’ฎโกmaxฯ€โˆˆฮ”๐’œ๐’ฎโกJwโ€‹(ฯ€,ฯ•).\displaystyle J_{w}(\pi_{w}^{\star},\phi_{w}^{\star})=\max_{\pi\in\Delta_{{\cal A}}^{{\cal S}}}\min_{\phi\in\Delta_{{\cal B}}^{{\cal S}}}J_{w}(\pi,\phi)=\min_{\phi\in\Delta_{{\cal B}}^{{\cal S}}}\max_{\pi\in\Delta_{{\cal A}}^{{\cal S}}}J_{w}(\pi,\phi).

It is known from Zeng etย al. (2022) that such a Nash equilibrium exists and is unique.

Here we consider the softmax parameterization โ€“ we introduce parameters ฮธโˆˆโ„|๐’ฎ|ร—|๐’œ|,ฯˆโˆˆโ„|๐’ฎ|ร—|โ„ฌ|\theta\in\mathbb{R}^{|{\cal S}|\times|{\cal A}|},\psi\in\mathbb{R}^{|{\cal S}|\times|{\cal B}|} that encode the policies ฯ€ฮธ,ฯ•ฯˆ\pi_{\theta},\phi_{\psi} according to

ฯ€ฮธโ€‹(aโˆฃs)=expโก(ฮธโ€‹(s,a))โˆ‘aโ€ฒโˆˆ๐’œexpโก(ฮธโ€‹(s,aโ€ฒ)),ฯ•ฯˆโ€‹(bโˆฃs)=expโก(ฯˆโ€‹(s,b))โˆ‘bโ€ฒโˆˆ๐’œexpโก(ฯˆโ€‹(s,bโ€ฒ)).\displaystyle\pi_{\theta}(a\mid s)=\frac{\exp\left(\theta(s,a)\right)}{\sum_{a^{\prime}\in{\cal A}}\exp\left(\theta(s,a^{\prime})\right)},\quad\phi_{\psi}(b\mid s)=\frac{\exp\left(\psi(s,b)\right)}{\sum_{b^{\prime}\in{\cal A}}\exp\left(\psi(s,b^{\prime})\right)}. (39)

Taking a gradient-based approach to the problem, we can express our objective as finding a stationary point where โˆ‡ฮธJwโ€‹(ฯ€ฮธ,ฯ•ฯˆ)=0\nabla_{\theta}J_{w}(\pi_{\theta},\phi_{\psi})=0 and โˆ‡ฯˆJwโ€‹(ฯ€ฮธ,ฯ•ฯˆ)=0\nabla_{\psi}J_{w}(\pi_{\theta},\phi_{\psi})=0, which can be expanded as below.

๐”ผs,a,b,sโ€ฒโ€‹[(rโ€‹(s,a)โˆ’wโ€‹logโกฯ€ฮธโ€‹(aโˆฃs)+ฮณโ€‹Vwฯ€ฮธ,ฯ•ฯˆโ€‹(sโ€ฒ)โˆ’Vwฯ€ฮธ,ฯ•ฯˆโ€‹(s))โ€‹โˆ‡ฮธlogโกฯ€ฮธโ€‹(aโˆฃs)]=0,\displaystyle\mathbb{E}_{s,a,b,s^{\prime}}\big{[}(r(s,a)-w\log\pi_{\theta}(a\mid s)+\gamma V_{w}^{\pi_{\theta},\phi_{\psi}}(s^{\prime})-V_{w}^{\pi_{\theta},\phi_{\psi}}(s))\nabla_{\theta}\log\pi_{\theta}(a\mid s)\big{]}=0, (40)
๐”ผs,a,b,sโ€ฒโ€‹[(rโ€‹(s,a)+wโ€‹logโกฯ•ฯˆโ€‹(bโˆฃs)+ฮณโ€‹Vwฯ€ฮธ,ฯ•ฯˆโ€‹(sโ€ฒ)โˆ’Vwฯ€ฮธ,ฯ•ฯˆโ€‹(s))โ€‹โˆ‡ฯˆlogโกฯ•ฯˆโ€‹(bโˆฃs)]=0.\displaystyle\mathbb{E}_{s,a,b,s^{\prime}}\big{[}(r(s,a)+w\log\phi_{\psi}(b\mid s)+\gamma V_{w}^{\pi_{\theta},\phi_{\psi}}(s^{\prime})-V_{w}^{\pi_{\theta},\phi_{\psi}}(s))\nabla_{\psi}\log\phi_{\psi}(b\mid s)\big{]}=0. (41)

The value function is not directly known but solvable as the solution to the Bellman equation

๐”ผa,b,sโ€ฒโ€‹[rโ€‹(s,a,b)โˆ’wโ€‹logโกฯ€ฮธโ€‹(aโˆฃs)+wโ€‹logโกฯ•ฯˆโ€‹(bโˆฃs)+ฮณโ€‹Vwฯ€ฮธ,ฯ•ฯˆโ€‹(sโ€ฒ)โˆ’Vwฯ€ฮธ,ฯ•ฯˆโ€‹(s)]=0,โˆ€sโˆˆ๐’ฎ.\displaystyle\mathbb{E}_{a,b,s^{\prime}}\big{[}r(s,a,b)-w\log\pi_{\theta}(a\mid s)+w\log\phi_{\psi}(b\mid s)+\gamma V_{w}^{\pi_{\theta},\phi_{\psi}}(s^{\prime})-V_{w}^{\pi_{\theta},\phi_{\psi}}(s)\big{]}=0,\;\forall s\in{\cal S}. (42)

We need to solve the system of three equations (40)-(42). The lowest level is (42) and has corresponds to an operator that can be shown to be strongly monotone. (40) and (41) are the highest level and middle level equations, associated with gradient operator โˆ‡ฮธJwโ€‹(ฯ€ฮธ,ฯ•ฯˆ)\nabla_{\theta}J_{w}(\pi_{\theta},\phi_{\psi}) and โˆ‡ฯˆJwโ€‹(ฯ€ฮธ,ฯ•ฯˆ)\nabla_{\psi}J_{w}(\pi_{\theta},\phi_{\psi}), which are not strongly monotone. This prevents our analysis from being immediately applicable. However, the operators exhibit an important structure โ€“ they are the gradients of functions satisfying the Polyak-ลojasiewicz (PL) condition, which makes the operators resemble strong monotone operators in a transformed domain. Exploiting the structure, Zeng etย al. (2022) shows that a gradient descent ascent algorithm that aims to find the solution to (40)-(41) assuming the exact knowledge of Vwฯ€ฮธ,ฯ•ฯˆV_{w}^{\pi_{\theta},\phi_{\psi}} converges linearly fast, which is the convergence rate to be expected when the operators โˆ‡ฮธJwโ€‹(ฯ€ฮธ,ฯ•ฯˆ)\nabla_{\theta}J_{w}(\pi_{\theta},\phi_{\psi}) and โˆ‡ฯˆJwโ€‹(ฯ€ฮธ,ฯ•ฯˆ)\nabla_{\psi}J_{w}(\pi_{\theta},\phi_{\psi}) are strongly monotone. Combining the techniques in Zeng etย al. (2022) on leveraging the PL condition and those in this work on acceleration, we believe that the A-MSA algorithm can be shown to find the Nash equilibrium in a regularized two-player zero-sum Markov game with convergence rate ๐’ช~โ€‹(1/t)\widetilde{{\cal O}}(1/t).

4.4 Actor-Critic Algorithm for Mean Field Games

A mean field game (MFG) provides an approximation of an NN-agent Markov game with homogeneous agents as NN approaches infinity. The goal in solving a mean field game is to find a policy optimal in an MDP determined by the induced mean field, where the induced mean field is a distribution over the (infinite) population of agents and a function of the policy itself. In essence, an representative agent in a MFG needs to perform against an infinite population of other agents that adopts its same policy.

We consider MFGs in the stationary infinite-horizon average-reward setting and follow the formulation and notations in Zeng etย al. (2024a). We denote the state and action space of the representative agent by ๐’ฎ{\cal S} and ๐’œ{\cal A}. The state transition depends not only on the action of the representative agent but also on the aggregate population behavior, which is described by the mean field uโˆˆฮ”๐’ฎu\in\Delta_{{\cal S}}. We use ๐’ซ:๐’ฎร—๐’œร—ฮ”๐’ฎโ†’ฮ”๐’ฎ{\cal P}:{\cal S}\times{\cal A}\times\Delta_{{\cal S}}\rightarrow\Delta_{{\cal S}} to denote the transition kernel, with ๐’ซโ€‹(sโ€ฒโˆฃs,a,u){\cal P}(s^{\prime}\mid s,a,u) representing the probability that the state transitions from ss to sโ€ฒs^{\prime} when the representative agent takes action aa and mean field is uu. Similarly, the reward function r:๐’ฎร—๐’œร—ฮ”๐’ฎโ†’[0,1]r:{\cal S}\times{\cal A}\times\Delta_{{\cal S}}\rightarrow[0,1] depends on the mean field โ€“ the representative agent receives reward rโ€‹(s,a,u)r(s,a,u) when it takes action aa under mean field uu in state ss. The agent does not observe the mean field, and its policy ฯ€\pi is a mapping from ๐’ฎ{\cal S} to ฮ”๐’œ\Delta_{{\cal A}}. We denote by Pฯ€,uโˆˆโ„|๐’ฎ|ร—|๐’ฎ|P^{\pi,u}\in\mathbb{R}^{|{\cal S}|\times|{\cal S}|} the state transition matrix under policy ฯ€\pi and mean field uu such that

Pฯ€,uโ€‹(sโ€ฒโˆฃs)=โˆ‘aโˆˆ๐’œ๐’ซโ€‹(sโ€ฒโˆฃs,a,u)โ€‹ฯ€โ€‹(aโˆฃs).P^{\pi,u}(s^{\prime}\mid s)=\sum_{a\in{\cal A}}{\cal P}(s^{\prime}\mid s,a,u)\pi(a\mid s).

When the mean field is uu and the agent takes policy ฯ€\pi, the agent can expect to collect the cumulative reward Jโ€‹(ฯ€,u)J(\pi,u)

Jโ€‹(ฯ€,u)\displaystyle J(\pi,u) โ‰œlimTโ†’โˆž1Tโ€‹โˆ‘t=0Tโˆ’1๐”ผatโˆผฯ€(โ‹…โˆฃst),st+1โˆผ๐’ซ(โ‹…โˆฃst,at,u)โ€‹[rโ€‹(st,at,u)โˆฃs0].\displaystyle\triangleq\lim_{T\rightarrow\infty}\frac{1}{T}\textstyle\sum_{t=0}^{T-1}\mathbb{E}_{a_{t}\sim\pi(\cdot\mid s_{t}),s_{t+1}\sim{\cal P}(\cdot\mid s_{t},a_{t},u)}[r(s_{t},a_{t},u)\mid s_{0}].

The mean field induced by policy ฯ€\pi (i.e. state visitation distribution over population when every agent takes policy ฯ€\pi) is denoted by uโ‹†โ€‹(ฯ€)u^{\star}(\pi), which satisfies

uโ‹†โ€‹(ฯ€)=limTโ†’โˆž1Tโ€‹๐”ผatโˆผฯ€(โ‹…โˆฃst),st+1โˆผ๐’ซ(โ‹…โˆฃst,at,uโ‹†(ฯ€))โ€‹[est],\displaystyle u^{\star}(\pi)=\lim_{T\rightarrow\infty}\frac{1}{T}\mathbb{E}_{a_{t}\sim\pi(\cdot\mid s_{t}),s_{t+1}\sim{\cal P}(\cdot\mid s_{t},a_{t},u^{\star}(\pi))}[e_{s_{t}}],

where esโˆˆโ„|๐’ฎ|e_{s}\in\mathbb{R}^{|{\cal S}|} is the indicator vector whose entry sโ€ฒs^{\prime} has value 1 if sโ€ฒ=ss^{\prime}=s and 0 otherwise.

We again consider the softmax function โ€“ a policy parameter ฮธ\theta encodes the policy ฯ€ฮธ\pi_{\theta} according to (39). With the derivation presented in Zeng etย al. (2024a), it can be shown that finding an equilibrium in an MFG can be formulated as solving a system of three equations (43)-(45)

โˆ‡ฮธJโ€‹(ฯ€ฮธ,u)=๐”ผs,a,sโ€ฒโ€‹[(rโ€‹(s,a,u)+Vฯ€ฮธ,uโ€‹(sโ€ฒ)โˆ’Vฯ€ฮธ,uโ€‹(s))โ€‹โˆ‡ฮธlogโกฯ€ฮธโ€‹(aโˆฃs)]=0,\displaystyle\nabla_{\theta}J(\pi_{\theta},u)=\mathbb{E}_{s,a,s^{\prime}}\big{[}(r(s,a,u)+V^{\pi_{\theta},u}(s^{\prime})-V^{\pi_{\theta},u}(s))\nabla_{\theta}\log\pi_{\theta}(a\mid s)\big{]}=0, (43)
u=uโ‹†โ€‹(ฯ€ฮธ)=limTโ†’โˆž1Tโ€‹๐”ผatโˆผฯ€ฮธ(โ‹…โˆฃst),st+1โˆผ๐’ซ(โ‹…โˆฃst,at,u)โ€‹[est],\displaystyle u=u^{\star}(\pi_{\theta})=\lim_{T\rightarrow\infty}\frac{1}{T}\mathbb{E}_{a_{t}\sim\pi_{\theta}(\cdot\mid s_{t}),s_{t+1}\sim{\cal P}(\cdot\mid s_{t},a_{t},u)}[e_{s_{t}}], (44)
๐”ผa,sโ€ฒโ€‹[rโ€‹(s,a,u)โˆ’Jโ€‹(ฯ€ฮธ,u)+Vฯ€ฮธ,uโ€‹(sโ€ฒ)โˆ’Vฯ€ฮธ,uโ€‹(s)]=0,โˆ€sโˆˆ๐’ฎ.\displaystyle\mathbb{E}_{a,s^{\prime}}\big{[}r(s,a,u)-J(\pi_{\theta},u)+V^{\pi_{\theta},u}(s^{\prime})-V^{\pi_{\theta},u}(s)\big{]}=0,\;\forall s\in{\cal S}. (45)

in which we introduce the auxiliary variable Vฯ€ฮธ,uV^{\pi_{\theta},u} which tracks the (differential) value function under ฯ€ฮธ,u\pi_{\theta},u. Among the three equations, (44) and (45) are on the middle and lowest levels, and have associated operators satisfying the strong monotonicity. The highest level operator โˆ‡ฮธJโ€‹(ฯ€ฮธ,u)\nabla_{\theta}J(\pi_{\theta},u) is not monotone but a gradient operator. We can extend the analysis in the work leveraging the techniques developed in Zeng and Doan (2024) for gradient operators and show that in this case A-MSA finds a first-order stationary point (but not necessarily a globally or locally optimal solution) of JJ with rate ๐’ช~โ€‹(1/t)\widetilde{{\cal O}}(1/\sqrt{t}) (in the sense of gradient norm squared). This recovers the rate of the state-of-the-art MFG-ASAC algorithm proposed in Zeng etย al. (2024a) with tailored analysis.

5 Numerical Simulations

We apply A-MSA to the MFG policy optimization problem discussed in Sectionย 4.4. The environment is a small-scale synthetic MFG with |๐’ฎ|=30|{\cal S}|=30, |๐’œ|=10|{\cal A}|=10, and a randomly generated transition kernel and reward function. We compare A-MSA against the standard MSA algorithm without averaging and plot the algorithm convergence in Figureย 1, in which ฮธt\theta_{t} and utu_{t} denote the policy and mean field iterates in the ttht^{\text{th}} iteration. As the norm of the gradient with respect to ฮธt\theta_{t}, โ€–โˆ‡ฮธJโ€‹(ฯ€ฮธt,ut)โ€–\|\nabla_{\theta}J(\pi_{\theta_{t}},u_{t})\| measures the convergence in the policy given the latest mean field iterate. To quantify the mean field convergence, we consider โ€–utโˆ’vฯ€ฮธt,utโ€–\|u_{t}-v^{\pi_{\theta_{t}},u_{t}}\|, where vฯ€,ฮผv^{\pi,\mu} for any ฯ€,ฮผ\pi,\mu denotes the stationary distribution of states induced by Pฯ€,ฮผP^{\pi,\mu}. The simulation shows that A-MSA has a clear advantage over MSA, though not by an apparent order of magnitude.

Refer to caption
Figure 1: Algorithm performance in synthetic MFGs averaged over 200 trials. First column measures the sub-optimality in the policy iterate under the latest mean field estimate. Second column shows the convergence of mean field estimate to mean field induced by latest policy.

6 Concluding Remarks

In this work we propose an accelerated multi-time-scale SA algorithm that solves a coupled system of NN fixed-point equations with the optimal convergence rate and sample complexity โ€“ under a strong monotonicity assumption, the last iterate of the algorithm converges to the (unique) solution with rate ๐’ช~โ€‹(1/t)\widetilde{{\cal O}}(1/t) for any NN. This is the first time such convergence rate is achieved under no additional restrictive smoothness assumptions. We show that we can formulate the objectives of a range of problems in RL and multi-agent games as a coupled fixed-point equation system, to which the proposed algorithm can be applied. Our analysis is easily generalizable to cases where the highest level operator is not strongly monotone (but the lower level operators are). An important future direction is to investigate the stability and convergence of multi-time-scale SA when the lower level operators lose strong monotonicity.

Acknowledgement

This work was partially supported by the National Science Foundation under awards ECCS-CAREER-2339509 and CCF-2343599.

Disclaimer

This paper was prepared for informational purposes in part by the Artificial Intelligence Research group of JP Morgan Chase & Co and its affiliates (โ€œJP Morganโ€), and is not a product of the Research Department of JP Morgan. JP Morgan makes no representation and warranty whatsoever and disclaims all liability, for the completeness, accuracy or reliability of the information contained herein. This document is not intended as investment research or investment advice, or a recommendation, offer or solicitation for the purchase or sale of any security, financial instrument, financial product or service, or to be used in any way for evaluating the merits of participating in any transaction, and shall not constitute a solicitation under any jurisdiction or to any person, if such solicitation under such jurisdiction or to such person would be unlawful.

References

  • Benveniste etย al. (2012) Albert Benveniste, Michel Mรฉtivier, and Pierre Priouret. Adaptive algorithms and stochastic approximations, volumeย 22. Springer Science & Business Media, 2012.
  • Bhandari etย al. (2018) Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite time analysis of temporal difference learning with linear function approximation. In Conference on learning theory, pages 1691โ€“1692. PMLR, 2018.
  • Borkar (1997) Vivekย S Borkar. Stochastic approximation with two time scales. Systems & Control Letters, 29(5):291โ€“294, 1997.
  • Borkar (2008) Vivekย S Borkar. Stochastic approximation: a dynamical systems viewpoint. Springer, 2008.
  • Chen etย al. (2022) Zaiwei Chen, Sheng Zhang, Thinhย T Doan, John-Paul Clarke, and Sivaย Theja Maguluri. Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning. Automatica, 146:110623, 2022.
  • Dalal etย al. (2018) Gal Dalal, Gugan Thoppe, Balรกzs Szรถrรฉnyi, and Shie Mannor. Finite sample analysis of two-timescale stochastic approximation with applications to reinforcement learning. In Conference On Learning Theory, pages 1199โ€“1233. PMLR, 2018.
  • Daskalakis etย al. (2020) Constantinos Daskalakis, Dylanย J Foster, and Noah Golowich. Independent policy gradient methods for competitive reinforcement learning. Advances in neural information processing systems, 33:5527โ€“5540, 2020.
  • Deb and Bhatnagar (2021) Rohan Deb and Shalabh Bhatnagar. NN-timescale stochastic approximation: Stability and convergence. arXiv preprint arXiv:2112.03515, 2021.
  • Deb and Bhatnagar (2022) Rohan Deb and Shalabh Bhatnagar. Gradient temporal difference with momentum: Stability and convergence. In Proceedings of the AAAI Conference on Artificial Intelligence, volumeย 36, pages 6488โ€“6496, 2022.
  • Doan (2022) Thinhย T Doan. Nonlinear two-time-scale stochastic approximation: Convergence and finite-time performance. IEEE Transactions on Automatic Control, 68(8):4695โ€“4705, 2022.
  • Doan (2024) Thinhย T Doan. Fast nonlinear two-time-scale stochastic approximation: Achieving ๐’ชโ€‹(1/k)\mathcal{O}(1/k) finite-sample complexity. arXiv preprint arXiv:2401.12764, 2024.
  • Doan and Romberg (2019) Thinhย T Doan and Justin Romberg. Linear two-time-scale stochastic approximation a finite-time analysis. In 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 399โ€“406. IEEE, 2019.
  • Duchi and Namkoong (2021) Johnย C Duchi and Hongseok Namkoong. Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics, 49(3):1378โ€“1406, 2021.
  • Fort (2015) Gersende Fort. Central limit theorems for stochastic approximation with controlled markov chain dynamics. ESAIM: Probability and Statistics, 19:60โ€“80, 2015.
  • Gupta etย al. (2019) Harsh Gupta, Rayadurgam Srikant, and Lei Ying. Finite-time performance bounds and adaptive learning rate selection for two time-scale reinforcement learning. Advances in Neural Information Processing Systems, 32, 2019.
  • Han etย al. (2024) Yuze Han, Xiang Li, and Zhihua Zhang. Finite-time decoupled convergence in nonlinear two-time-scale stochastic approximation. arXiv preprint arXiv:2401.03893, 2024.
  • Haque etย al. (2023) Shaanย Ul Haque, Sajad Khodadadian, and Sivaย Theja Maguluri. Tight finite time bounds of two-time-scale linear stochastic approximation with markovian noise. arXiv preprint arXiv:2401.00364, 2023.
  • Hong etย al. (2023) Mingyi Hong, Hoi-To Wai, Zhaoran Wang, and Zhuoran Yang. A two-timescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actor-critic. SIAM Journal on Optimization, 33(1):147โ€“180, 2023.
  • Kaledin etย al. (2020) Maxim Kaledin, Eric Moulines, Alexey Naumov, Vladislav Tadic, and Hoi-To Wai. Finite time analysis of linear two-timescale stochastic approximation with markovian noise. In Conference on Learning Theory, pages 2144โ€“2203. PMLR, 2020.
  • Konda and Tsitsiklis (1999) Vijay Konda and John Tsitsiklis. Actor-critic algorithms. Advances in neural information processing systems, 12, 1999.
  • Konda and Tsitsiklis (2004) Vijayย R Konda and Johnย N Tsitsiklis. Cconvergence rate of linear two-time-scale stochastic approximation. The Annals of Applied Probability, 14(2):796โ€“819, 2004.
  • Konda and Borkar (1999) Vijaymohanย R Konda and Vivekย S Borkar. Actor-criticโ€“type learning algorithms for markov decision processes. SIAM Journal on control and Optimization, 38(1):94โ€“123, 1999.
  • Lakshminarayanan and Szepesvรกri (2017) Chandrashekar Lakshminarayanan and Csaba Szepesvรกri. Linear stochastic approximation: Constant step-size and iterate averaging. arXiv preprint arXiv:1709.04073, 2017.
  • Li etย al. (2023a) Xiang Li, Jiadong Liang, and Zhihua Zhang. Online statistical inference for nonlinear stochastic approximation with markovian data. arXiv preprint arXiv:2302.07690, 2023a.
  • Li etย al. (2023b) Xiang Li, Wenhao Yang, Jiadong Liang, Zhihua Zhang, and Michaelย I Jordan. A statistical analysis of polyak-ruppert averaged q-learning. In International Conference on Artificial Intelligence and Statistics, pages 2207โ€“2261. PMLR, 2023b.
  • Liang etย al. (2024) Zhipeng Liang, Xiaoteng Ma, Jose Blanchet, Jun Yang, Jiheng Zhang, and Zhengyuan Zhou. Single-trajectory distributionally robust reinforcement learning. In Proceedings of the 41st International Conference on Machine Learning, volume 235, pages 29644โ€“29666. PMLR, 2024.
  • Meerkov (1972) S.ย M. Meerkov. Simplified description of slow markov walks: part i. Automation and Remote Control, 33:404โ€“414, 1972.
  • Mitra (2024) Aritra Mitra. A simple finite-time analysis of td learning with linear function approximation. arXiv preprint arXiv:2403.02476, 2024.
  • Mokkadem and Pelletier (2006) Abdelkader Mokkadem and Mariane Pelletier. Convergence rate and averaging of nonlinear two-time-scale stochastic approximation algorithms. Annals of Applied Probability, 16(3):1671โ€“1702, 2006.
  • Moulines and Bach (2011) Eric Moulines and Francis Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Advances in neural information processing systems, 24, 2011.
  • Robbins and Monro (1951) Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400โ€“407, 1951.
  • Sayin etย al. (2021) Muhammed Sayin, Kaiqing Zhang, David Leslie, Tamer Basar, and Asuman Ozdaglar. Decentralized q-learning in zero-sum markov games. Advances in Neural Information Processing Systems, 34:18320โ€“18334, 2021.
  • Shen and Chen (2022) Han Shen and Tianyi Chen. A single-timescale analysis for stochastic approximation with multiple coupled sequences. Advances in Neural Information Processing Systems, 35:17415โ€“17429, 2022.
  • Srikant (2024) Rย Srikant. Rates of convergence in the central limit theorem for markov chains, with an application to td learning. arXiv preprint arXiv:2401.15719, 2024.
  • Srikant and Ying (2019) Rayadurgam Srikant and Lei Ying. Finite-time error bounds for linear stochastic approximation andtd learning. In Conference on Learning Theory, pages 2803โ€“2830. PMLR, 2019.
  • Sutton etย al. (2008) Richardย S Sutton, Hamid Maei, and Csaba Szepesvรกri. A convergent Oโ€‹(n)O(n) temporal-difference algorithm for off-policy learning with linear function approximation. Advances in neural information processing systems, 21, 2008.
  • Sutton etย al. (2009) Richardย S Sutton, Hamidย Reza Maei, Doina Precup, Shalabh Bhatnagar, David Silver, Csaba Szepesvรกri, and Eric Wiewiora. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th annual international conference on machine learning, pages 993โ€“1000, 2009.
  • Wang etย al. (2021) Yue Wang, Shaofeng Zou, and Yiย Zhou. Non-asymptotic analysis for two time-scale tdc with general smooth function approximation. Advances in neural information processing systems, 34:9747โ€“9758, 2021.
  • Wu etย al. (2020) Yueย Frank Wu, Weitong Zhang, Pan Xu, and Quanquan Gu. A finite-time analysis of two time-scale actor-critic methods. Advances in Neural Information Processing Systems, 33:17617โ€“17628, 2020.
  • Xu etย al. (2019) Tengyu Xu, Shaofeng Zou, and Yingbin Liang. Two time-scale off-policy td learning: Non-asymptotic analysis over markovian samples. Advances in neural information processing systems, 32, 2019.
  • Zeng and Doan (2024) Sihan Zeng and Thinh Doan. Fast two-time-scale stochastic gradient method with applications in reinforcement learning. In The Thirty Seventh Annual Conference on Learning Theory, pages 5166โ€“5212. PMLR, 2024.
  • Zeng etย al. (2022) Sihan Zeng, Thinh Doan, and Justin Romberg. Regularized gradient descent ascent for two-player zero-sum markov games. Advances in Neural Information Processing Systems, 35:34546โ€“34558, 2022.
  • Zeng etย al. (2024a) Sihan Zeng, Sujay Bhatt, Alec Koppel, and Sumitra Ganesh. A single-loop finite-time convergent policy optimization algorithm for mean field games (and average-reward markov decision processes). arXiv preprint arXiv:2408.04780, 2024a.
  • Zeng etย al. (2024b) Sihan Zeng, Thinhย T Doan, and Justin Romberg. A two-time-scale stochastic optimization framework with applications in control and reinforcement learning. SIAM Journal on Optimization, 34(1):946โ€“976, 2024b.

ย 
Accelerated Multi-Time-Scale Stochastic Approximation: Optimal Complexity and Applications in Reinforcement Learning and Multi-Agent Games
Appendix


ย 

Appendix A Proof of Technical Lemmas

A.1 Proof of Lemmaย 1

Lemma 5

For any tโ‰ฅฯ„tt\geq\tau_{t} we have

๐”ผโ€‹[โŸจฮ”โ€‹fi[t],Fiโ€‹(๐œฝ[t],X[t])โˆ’Fยฏiโ€‹(๐œฝ[t])โŸฉ]โ‰คN2โ€‹Cฯ„โ€‹ฯ„t2โ€‹ฮป[tโˆ’ฯ„t]โ€‹(โˆ‘j=1Nโ€–xj[t]โ€–2+โˆ‘j=1Nโ€–ฮ”โ€‹fj[t]โ€–2+1),\displaystyle\mathbb{E}[\langle\Delta f_{i}^{[t]},F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\rangle]\leq N^{2}C_{\tau}\tau_{t}^{2}\lambda^{[t-\tau_{t}]}\Big{(}\sum_{j=1}^{N}\|x_{j}^{[t]}\|^{2}+\sum_{j=1}^{N}\|\Delta f_{j}^{[t]}\|^{2}+1\Big{)},

where Cฯ„C_{\tau} is a positive, bounded scalar depending only on the structural constants including LL, BB, ฯ\rho, and mm.

We derive a frequently used inequality in the proofs of Lemmaย 1 and 5 as a result of (18). For any XX, we have

โ€–Fiโ€‹(๐œฝ[t],X)โ€–\displaystyle\|F_{i}(\boldsymbol{\theta}^{[t]},X)\| โ‰คLโ€‹(โˆ‘j=1Nโ€–ฮธj[t]โ€–+1)\displaystyle\leq L(\sum_{j=1}^{N}\|\theta_{j}^{[t]}\|+1)
โ‰คL+Lโ€‹โˆ‘j=1Nโ€–ฮธj[t]โˆ’yjโ€‹(๐œฝ1:jโˆ’1[t])โ€–+Lโ€‹โˆ‘j=1Nโ€–yjโ€‹(๐œฝ1:jโˆ’1[t])โ€–\displaystyle\leq L+L\sum_{j=1}^{N}\|\theta_{j}^{[t]}-y_{j}(\boldsymbol{\theta}_{1:j-1}^{[t]})\|+L\sum_{j=1}^{N}\|y_{j}(\boldsymbol{\theta}_{1:j-1}^{[t]})\|
โ‰ค(Nโ€‹B+1)โ€‹L+Lโ€‹โˆ‘j=1Nโ€–xj[t]โ€–,\displaystyle\leq(NB+1)L+L\sum_{j=1}^{N}\|x_{j}^{[t]}\|, (46)

where the last inequality follows from the boundedness condition in Assumptionย 4 and the definition of xj[t]x_{j}^{[t]} in (20).

By the update rule in (8), we have

ฮ”โ€‹fi[t+1]\displaystyle\Delta f_{i}^{[t+1]} =fi[t+1]โˆ’Fยฏiโ€‹(๐œฝ[t+1])\displaystyle=f_{i}^{[t+1]}-\widebar{F}_{i}(\boldsymbol{\theta}^{[t+1]})
=(1โˆ’ฮป[t])โ€‹fi[t]+ฮป[t]โ€‹Fiโ€‹(๐œฝ[t],X[t])โˆ’Fยฏiโ€‹(๐œฝ[t+1])\displaystyle=(1-\lambda^{[t]})f_{i}^{[t]}+\lambda^{[t]}F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t+1]})
=(1โˆ’ฮป[t])โ€‹fi[t]+ฮป[t]โ€‹Fยฏiโ€‹(๐œฝ[t])โˆ’Fยฏiโ€‹(๐œฝ[t+1])+ฮป[t]โ€‹(Fiโ€‹(๐œฝ[t],X[t])โˆ’Fยฏiโ€‹(๐œฝ[t]))\displaystyle=(1-\lambda^{[t]})f_{i}^{[t]}+\lambda^{[t]}\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t+1]})+\lambda^{[t]}\left(F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\right)
=(1โˆ’ฮป[t])โ€‹ฮ”โ€‹fi[t]+(Fยฏiโ€‹(๐œฝ[t])โˆ’Fยฏiโ€‹(๐œฝ[t+1]))+ฮป[t]โ€‹(Fiโ€‹(๐œฝ[t],X[t])โˆ’Fยฏiโ€‹(๐œฝ[t])).\displaystyle=(1-\lambda^{[t]})\Delta f_{i}^{[t]}+\left(\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t+1]})\right)+\lambda^{[t]}\left(F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\right).

This leads to

โ€–ฮ”โ€‹fi[t+1]โ€–2\displaystyle\|\Delta f_{i}^{[t+1]}\|^{2}
=(1โˆ’ฮป[t])2โ€‹โ€–ฮ”โ€‹fi[t]โ€–2+โ€–(Fยฏiโ€‹(๐œฝ[t])โˆ’Fยฏiโ€‹(๐œฝ[t+1]))+ฮป[t]โ€‹(Fiโ€‹(๐œฝ[t],X[t])โˆ’Fยฏiโ€‹(๐œฝ[t]))โ€–2\displaystyle=(1-\lambda^{[t]})^{2}\|\Delta f_{i}^{[t]}\|^{2}+\|\left(\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t+1]})\right)+\lambda^{[t]}\left(F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\right)\|^{2}
+2โ€‹(1โˆ’ฮป[t])โ€‹โŸจฮ”โ€‹fi[t],Fยฏiโ€‹(๐œฝ[t])โˆ’Fยฏiโ€‹(๐œฝ[t+1])โŸฉ\displaystyle\hskip 20.0pt+2(1-\lambda^{[t]})\langle\Delta f_{i}^{[t]},\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t+1]})\rangle
+2โ€‹ฮป[t]โ€‹(1โˆ’ฮป[t])โ€‹โŸจฮ”โ€‹fi[t],Fiโ€‹(๐œฝ[t],X[t])โˆ’Fยฏiโ€‹(๐œฝ[t])โŸฉ\displaystyle\hskip 20.0pt+2\lambda^{[t]}(1-\lambda^{[t]})\langle\Delta f_{i}^{[t]},F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\rangle
โ‰ค(1โˆ’ฮป[t])2โ€‹โ€–ฮ”โ€‹fi[t]โ€–2+2โ€‹โ€–Fยฏiโ€‹(๐œฝ[t])โˆ’Fยฏiโ€‹(๐œฝ[t+1])โ€–2+2โ€‹(ฮป[t])2โ€‹โ€–Fยฏiโ€‹(๐œฝ[t],X[t])โˆ’Fยฏiโ€‹(๐œฝ[t])โ€–2\displaystyle\leq(1-\lambda^{[t]})^{2}\|\Delta f_{i}^{[t]}\|^{2}+2\|\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t+1]})\|^{2}+2(\lambda^{[t]})^{2}\|\widebar{F}_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\|^{2}
+ฮป[t]2โ€‹โ€–ฮ”โ€‹fi[t]โ€–2+2ฮป[t]โ€‹โ€–Fยฏiโ€‹(๐œฝ[t])โˆ’Fยฏiโ€‹(๐œฝ[t+1])โ€–2\displaystyle\hskip 20.0pt+\frac{\lambda^{[t]}}{2}\|\Delta f_{i}^{[t]}\|^{2}+\frac{2}{\lambda^{[t]}}\|\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t+1]})\|^{2}
+2โ€‹ฮป[t]โ€‹(1โˆ’ฮป[t])โ€‹โŸจฮ”โ€‹fi[t],Fiโ€‹(๐œฝ[t],X[t])โˆ’Fยฏiโ€‹(๐œฝ[t])โŸฉ\displaystyle\hskip 20.0pt+2\lambda^{[t]}(1-\lambda^{[t]})\langle\Delta f_{i}^{[t]},F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\rangle
โ‰ค(1โˆ’ฮป[t])โ€‹โ€–ฮ”โ€‹fi[t]โ€–2โˆ’ฮป[t]4โ€‹โ€–ฮ”โ€‹fi[t]โ€–2+2โ€‹(ฮป[t])2โ€‹โ€–Fiโ€‹(๐œฝ[t],X[t])โˆ’Fยฏiโ€‹(๐œฝ[t])โ€–2\displaystyle\leq(1-\lambda^{[t]})\|\Delta f_{i}^{[t]}\|^{2}-\frac{\lambda^{[t]}}{4}\|\Delta f_{i}^{[t]}\|^{2}+2(\lambda^{[t]})^{2}\|F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\|^{2}
+4ฮป[t]โ€‹โ€–Fยฏiโ€‹(๐œฝ[t])โˆ’Fยฏiโ€‹(๐œฝ[t+1])โ€–2+2โ€‹ฮป[t]โ€‹(1โˆ’ฮป[t])โ€‹โŸจฮ”โ€‹fi[t],Fiโ€‹(๐œฝ[t],X[t])โˆ’Fยฏiโ€‹(๐œฝ[t])โŸฉ,\displaystyle\hskip 20.0pt+\frac{4}{\lambda^{[t]}}\|\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t+1]})\|^{2}+2\lambda^{[t]}(1-\lambda^{[t]})\langle\Delta f_{i}^{[t]},F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\rangle, (47)

where the last inequality follows from ฮป[t]โ‰ค1/4โ‰ค1\lambda^{[t]}\leq 1/4\leq 1.

Note that the third term on the right hand side of (47) is bounded due to (46)

โ€–Fiโ€‹(๐œฝ[t],X[t])โˆ’Fยฏiโ€‹(๐œฝ[t])โ€–2\displaystyle\|F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\|^{2} โ‰ค2โ€‹โ€–Fiโ€‹(๐œฝ[t],X[t])โ€–+2โ€‹โ€–Fยฏiโ€‹(๐œฝ[t])โ€–2\displaystyle\leq 2\|F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})\|+2\|\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\|^{2}
โ‰ค4โ€‹((Nโ€‹B+1)โ€‹L+Lโ€‹โˆ‘j=1Nโ€–xj[t]โ€–)2\displaystyle\leq 4\Big{(}(NB+1)L+L\sum_{j=1}^{N}\|x_{j}^{[t]}\|\Big{)}^{2}
โ‰ค8โ€‹L2โ€‹(Nโ€‹โˆ‘j=1Nโ€–xj[t]โ€–2+(Nโ€‹B+1)2).\displaystyle\leq 8L^{2}\big{(}N\sum_{j=1}^{N}\|x_{j}^{[t]}\|^{2}+(NB+1)^{2}\big{)}. (48)

For the fourth term of (47), we have from Lemmaย 3 and the Lipschitz continuity of Fยฏi\widebar{F}_{i}

๐”ผโ€‹[โ€–Fยฏiโ€‹(๐œฝ[t])โˆ’Fยฏiโ€‹(๐œฝ[t+1])โ€–2]\displaystyle\mathbb{E}[\|\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t+1]})\|^{2}] โ‰คL2โ€‹โˆ‘j=1N๐”ผโ€‹[โ€–ฮธj[t]โˆ’ฮธj[t+1]โ€–2]\displaystyle\leq L^{2}\sum_{j=1}^{N}\mathbb{E}[\|\theta_{j}^{[t]}-\theta_{j}^{[t+1]}\|^{2}]
โ‰คL2โ€‹(ฮฑi[t])2โ€‹โˆ‘j=1N๐”ผโ€‹[(โ€–ฮ”โ€‹fj[t]โ€–+Nโ€‹L2โ€‹โˆ‘k=jNโ€–xk[t]โ€–)2]\displaystyle\leq L^{2}(\alpha_{i}^{[t]})^{2}\sum_{j=1}^{N}\mathbb{E}[\left(\|\Delta f_{j}^{[t]}\|+NL^{2}\sum_{k=j}^{N}\|x_{k}^{[t]}\|\right)^{2}]
โ‰ค2โ€‹L2โ€‹(ฮฑi[t])2โ€‹โˆ‘j=1N๐”ผโ€‹[โ€–ฮ”โ€‹fj[t]โ€–2]+2โ€‹Nโ€‹L4โ€‹(ฮฑi[t])2โ€‹โˆ‘j=1Nโˆ‘k=jN๐”ผโ€‹[โ€–xk[t]โ€–2]\displaystyle\leq 2L^{2}(\alpha_{i}^{[t]})^{2}\sum_{j=1}^{N}\mathbb{E}[\|\Delta f_{j}^{[t]}\|^{2}]+2NL^{4}(\alpha_{i}^{[t]})^{2}\sum_{j=1}^{N}\sum_{k=j}^{N}\mathbb{E}[\|x_{k}^{[t]}\|^{2}]
โ‰ค2โ€‹L2โ€‹(ฮฑi[t])2โ€‹โˆ‘j=1N๐”ผโ€‹[โ€–ฮ”โ€‹fj[t]โ€–2]+2โ€‹N2โ€‹L4โ€‹(ฮฑi[t])2โ€‹โˆ‘j=1N๐”ผโ€‹[โ€–xj[t]โ€–2].\displaystyle\leq 2L^{2}(\alpha_{i}^{[t]})^{2}\sum_{j=1}^{N}\mathbb{E}[\|\Delta f_{j}^{[t]}\|^{2}]+2N^{2}L^{4}(\alpha_{i}^{[t]})^{2}\sum_{j=1}^{N}\mathbb{E}[\|x_{j}^{[t]}\|^{2}]. (49)

The last term of (47) is bound in expectation due to Lemmaย 5

๐”ผโ€‹[โŸจฮ”โ€‹fi[t],Fiโ€‹(๐œฝ[t],X[t])โˆ’Fยฏiโ€‹(๐œฝ[t])โŸฉ]โ‰คN2โ€‹Cฯ„โ€‹ฯ„t2โ€‹ฮป[tโˆ’ฯ„t]โ€‹(โˆ‘j=1Nโ€–xj[t]โ€–2+โˆ‘j=1Nโ€–ฮ”โ€‹fj[t]โ€–2+1).\displaystyle\mathbb{E}[\langle\Delta f_{i}^{[t]},F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\rangle]\leq N^{2}C_{\tau}\tau_{t}^{2}\lambda^{[t-\tau_{t}]}\Big{(}\sum_{j=1}^{N}\|x_{j}^{[t]}\|^{2}+\sum_{j=1}^{N}\|\Delta f_{j}^{[t]}\|^{2}+1\Big{)}. (50)

Combining (48)-(50) with (47),

๐”ผโ€‹[โ€–ฮ”โ€‹fi[t+1]โ€–2]\displaystyle\mathbb{E}[\|\Delta f_{i}^{[t+1]}\|^{2}]
โ‰ค(1โˆ’ฮป[t])โ€‹๐”ผโ€‹[โ€–ฮ”โ€‹fi[t]โ€–2]โˆ’ฮป[t]4โ€‹๐”ผโ€‹[โ€–ฮ”โ€‹fi[t]โ€–2]+2โ€‹(ฮป[t])2โ€‹๐”ผโ€‹[โ€–Fiโ€‹(๐œฝ[t],X[t])โˆ’Fยฏiโ€‹(๐œฝ[t])โ€–2]\displaystyle\leq(1-\lambda^{[t]})\mathbb{E}[\|\Delta f_{i}^{[t]}\|^{2}]-\frac{\lambda^{[t]}}{4}\mathbb{E}[\|\Delta f_{i}^{[t]}\|^{2}]+2(\lambda^{[t]})^{2}\mathbb{E}[\|F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\|^{2}]
+4ฮป[t]โ€‹๐”ผโ€‹[โ€–Fยฏiโ€‹(๐œฝ[t])โˆ’Fยฏiโ€‹(๐œฝ[t+1])โ€–2]+2โ€‹ฮป[t]โ€‹(1โˆ’ฮป[t])โ€‹๐”ผโ€‹[โŸจฮ”โ€‹fi[t],Fiโ€‹(๐œฝ[t],X[t])โˆ’Fยฏiโ€‹(๐œฝ[t])โŸฉ]\displaystyle\hskip 20.0pt+\frac{4}{\lambda^{[t]}}\mathbb{E}[\|\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t+1]})\|^{2}]+2\lambda^{[t]}(1-\lambda^{[t]})\mathbb{E}[\langle\Delta f_{i}^{[t]},F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\rangle]
โ‰ค(1โˆ’ฮป[t])โ€‹๐”ผโ€‹[โ€–ฮ”โ€‹fi[t]โ€–2]โˆ’ฮป[t]4โ€‹๐”ผโ€‹[โ€–ฮ”โ€‹fi[t]โ€–2]+8โ€‹L2โ€‹(ฮป[t])2โ€‹(Nโ€‹โˆ‘j=1Nโ€–xj[t]โ€–2+(Nโ€‹B+1)2)\displaystyle\leq(1-\lambda^{[t]})\mathbb{E}[\|\Delta f_{i}^{[t]}\|^{2}]-\frac{\lambda^{[t]}}{4}\mathbb{E}[\|\Delta f_{i}^{[t]}\|^{2}]+8L^{2}(\lambda^{[t]})^{2}\big{(}N\sum_{j=1}^{N}\|x_{j}^{[t]}\|^{2}+(NB+1)^{2}\big{)}
+8โ€‹L2โ€‹(ฮฑi[t])2ฮป[t]โ€‹โˆ‘j=1N๐”ผโ€‹[โ€–ฮ”โ€‹fj[t]โ€–2]+8โ€‹N2โ€‹L4โ€‹(ฮฑi[t])2ฮป[t]โ€‹โˆ‘j=1N๐”ผโ€‹[โ€–xj[t]โ€–2]\displaystyle\hskip 20.0pt+\frac{8L^{2}(\alpha_{i}^{[t]})^{2}}{\lambda^{[t]}}\sum_{j=1}^{N}\mathbb{E}[\|\Delta f_{j}^{[t]}\|^{2}]+\frac{8N^{2}L^{4}(\alpha_{i}^{[t]})^{2}}{\lambda^{[t]}}\sum_{j=1}^{N}\mathbb{E}[\|x_{j}^{[t]}\|^{2}]
+N2โ€‹Cฯ„โ€‹ฯ„t2โ€‹ฮป[t]โ€‹ฮป[tโˆ’ฯ„t]โ€‹(โˆ‘j=1Nโ€–xj[t]โ€–2+โˆ‘j=1Nโ€–ฮ”โ€‹fj[t]โ€–2+1)\displaystyle\hskip 20.0pt+N^{2}C_{\tau}\tau_{t}^{2}\lambda^{[t]}\lambda^{[t-\tau_{t}]}(\sum_{j=1}^{N}\|x_{j}^{[t]}\|^{2}+\sum_{j=1}^{N}\|\Delta f_{j}^{[t]}\|^{2}+1)
โ‰ค(1โˆ’ฮป[t])โ€‹๐”ผโ€‹[โ€–ฮ”โ€‹fi[t]โ€–2]โˆ’ฮป[t]4โ€‹๐”ผโ€‹[โ€–ฮ”โ€‹fi[t]โ€–2]+Dโ€‹N2โ€‹ฯ„t2โ€‹ฮป[t]โ€‹ฮป[tโˆ’ฯ„t]\displaystyle\leq(1-\lambda^{[t]})\mathbb{E}[\|\Delta f_{i}^{[t]}\|^{2}]-\frac{\lambda^{[t]}}{4}\mathbb{E}[\|\Delta f_{i}^{[t]}\|^{2}]+DN^{2}\tau_{t}^{2}\lambda^{[t]}\lambda^{[t-\tau_{t}]}
+Dโ€‹N2โ€‹ฯ„t2โ€‹(ฮป[t]โ€‹ฮป[tโˆ’ฯ„t]+(ฮฑi[t])2ฮป[t])โ€‹(โˆ‘j=1Nโ€–xj[t]โ€–2+โˆ‘j=1Nโ€–ฮ”โ€‹fj[t]โ€–2),\displaystyle\hskip 20.0pt+DN^{2}\tau_{t}^{2}(\lambda^{[t]}\lambda^{[t-\tau_{t}]}+\frac{(\alpha_{i}^{[t]})^{2}}{\lambda^{[t]}})\big{(}\sum_{j=1}^{N}\|x_{j}^{[t]}\|^{2}+\sum_{j=1}^{N}\|\Delta f_{j}^{[t]}\|^{2}\big{)}, (51)

where DD is a finite constant depending only on LL, BB, and Cฯ„C_{\tau}.

โ–ก\Box

A.2 Proof of Lemmaย 2

By the update rule in (7), we have

xi[t+1]\displaystyle x_{i}^{[t+1]} =ฮธi[t+1]โˆ’yiโ€‹(๐œฝ1:iโˆ’1[t+1])\displaystyle=\theta_{i}^{[t+1]}-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})
=ฮธi[t]โˆ’ฮฑi[t]โ€‹fi[t]โˆ’yiโ€‹(๐œฝ1:iโˆ’1[t+1])\displaystyle=\theta_{i}^{[t]}-\alpha_{i}^{[t]}f_{i}^{[t]}-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})
=xi[t]โˆ’ฮฑi[t]โ€‹fi[t]+(yiโ€‹(๐œฝ1:iโˆ’1[t])โˆ’yiโ€‹(๐œฝ1:iโˆ’1[t+1]))\displaystyle=x_{i}^{[t]}-\alpha_{i}^{[t]}f_{i}^{[t]}+\left(y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\right)
=xi[t]โˆ’ฮฑi[t]โ€‹Fยฏiโ€‹(๐œฝ[t])โˆ’ฮฑi[t]โ€‹ฮ”โ€‹fi[t]+(yiโ€‹(๐œฝ1:iโˆ’1[t])โˆ’yiโ€‹(๐œฝ1:iโˆ’1[t+1])).\displaystyle=x_{i}^{[t]}-\alpha_{i}^{[t]}\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\alpha_{i}^{[t]}\Delta f_{i}^{[t]}+\left(y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\right).

Taking the norm leads to

โ€–xi[t+1]โ€–2\displaystyle\|x_{i}^{[t+1]}\|^{2}
โ‰คโ€–xi[t]โˆ’ฮฑi[t]โ€‹Fยฏiโ€‹(๐œฝ[t])โ€–2โŸT1+(ฮฑi[t])2โ€‹โ€–ฮ”โ€‹fi[t]โ€–2+โ€–yiโ€‹(๐œฝ1:iโˆ’1[t])โˆ’yiโ€‹(๐œฝ1:iโˆ’1[t+1])โ€–2โŸT2\displaystyle\leq\underbrace{\|x_{i}^{[t]}-\alpha_{i}^{[t]}\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\|^{2}}_{T_{1}}+(\alpha_{i}^{[t]})^{2}\|\Delta f_{i}^{[t]}\|^{2}+\underbrace{\|y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\|^{2}}_{T_{2}}
+ฮฑi[t]โ€‹โŸจxi[t]โˆ’ฮฑi[t]โ€‹Fยฏiโ€‹(๐œฝ[t]),ฮ”โ€‹fi[t]โŸฉโŸT3+โŸจxi[t]โˆ’ฮฑi[t]โ€‹Fยฏiโ€‹(๐œฝ[t]),yiโ€‹(๐œฝ1:iโˆ’1[t])โˆ’yiโ€‹(๐œฝ1:iโˆ’1[t+1])โŸฉโŸT4\displaystyle\hskip 20.0pt+\underbrace{\alpha_{i}^{[t]}\langle x_{i}^{[t]}-\alpha_{i}^{[t]}\widebar{F}_{i}(\boldsymbol{\theta}^{[t]}),\Delta f_{i}^{[t]}\rangle}_{T_{3}}+\underbrace{\langle x_{i}^{[t]}-\alpha_{i}^{[t]}\widebar{F}_{i}(\boldsymbol{\theta}^{[t]}),y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\rangle}_{T_{4}}
+ฮฑi[t]โ€‹โŸจฮ”โ€‹fi[t],yiโ€‹(๐œฝ1:iโˆ’1[t])โˆ’yiโ€‹(๐œฝ1:iโˆ’1[t+1])โŸฉโŸT5.\displaystyle\hskip 20.0pt+\underbrace{\alpha_{i}^{[t]}\langle\Delta f_{i}^{[t]},y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\rangle}_{T_{5}}. (52)

We bound each term of (52) separately. By the definition of yky_{k}, we have

Fยฏiโ€‹(๐œฝ1:iโˆ’1[t],๐ฒi:Nโ€‹(๐œฝ1:iโˆ’1[t]))=0.\displaystyle\widebar{F}_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]},{\bf y}_{i:N}(\boldsymbol{\theta}_{1:i-1}^{[t]}))=0.

This implies

T1\displaystyle T_{1}
=โ€–xi[t]โ€–2โˆ’2โ€‹ฮฑi[t]โ€‹โŸจxi[t],Fiโ€‹(๐œฝ[t])โŸฉ+(ฮฑi[t])2โ€‹โ€–Fiโ€‹(๐œฝ[t])โ€–2\displaystyle=\|x_{i}^{[t]}\|^{2}-2\alpha_{i}^{[t]}\langle x_{i}^{[t]},F_{i}(\boldsymbol{\theta}^{[t]})\rangle+(\alpha_{i}^{[t]})^{2}\|F_{i}(\boldsymbol{\theta}^{[t]})\|^{2}
=โ€–xi[t]โ€–2โˆ’2โ€‹ฮฑi[t]โ€‹โŸจxi[t],Fยฏiโ€‹(๐œฝ1:i[t],๐ฒi+1:Nโ€‹(๐œฝ1:i[t]))โˆ’Fยฏiโ€‹(๐œฝ1:iโˆ’1[t],๐ฒi:Nโ€‹(๐œฝ1:iโˆ’1[t]))โŸฉ\displaystyle=\|x_{i}^{[t]}\|^{2}-2\alpha_{i}^{[t]}\langle x_{i}^{[t]},\widebar{F}_{i}(\boldsymbol{\theta}_{1:i}^{[t]},{\bf y}_{i+1:N}(\boldsymbol{\theta}_{1:i}^{[t]}))-\widebar{F}_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]},{\bf y}_{i:N}(\boldsymbol{\theta}_{1:i-1}^{[t]}))\rangle
โˆ’2โ€‹ฮฑi[t]โ€‹โŸจxi[t],Fยฏiโ€‹(๐œฝ[t])โˆ’Fยฏiโ€‹(๐œฝ1:i[t],๐ฒi+1:Nโ€‹(๐œฝ1:i[t]))โŸฉ+(ฮฑi[t])2โ€‹โ€–Fยฏiโ€‹(๐œฝ[t])โ€–2\displaystyle\hskip 20.0pt-2\alpha_{i}^{[t]}\langle x_{i}^{[t]},\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}_{1:i}^{[t]},{\bf y}_{i+1:N}(\boldsymbol{\theta}_{1:i}^{[t]}))\rangle+(\alpha_{i}^{[t]})^{2}\|\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\|^{2}
=โˆฅxi[t]โˆฅ2โˆ’2ฮฑi[t]โŸจxi[t],Fยฏi(๐œฝ1:iโˆ’1[t],ฮธi[t],๐ฒi+1:N(๐œฝ1:iโˆ’1[t],ฮธi[t]))\displaystyle=\|x_{i}^{[t]}\|^{2}-2\alpha_{i}^{[t]}\Big{\langle}x_{i}^{[t]},\widebar{F}_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]},\theta_{i}^{[t]},{\bf y}_{i+1:N}(\boldsymbol{\theta}_{1:i-1}^{[t]},\theta_{i}^{[t]}))
โˆ’Fยฏi(๐œฝ1:iโˆ’1[t],yi(๐œฝ1:iโˆ’1[t]),๐ฒi+1:N(๐œฝ1:iโˆ’1[t],yi(๐œฝ1:iโˆ’1[t])))โŸฉ\displaystyle\hskip 100.0pt-\widebar{F}_{i}\Big{(}\boldsymbol{\theta}_{1:i-1}^{[t]},y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]}),{\bf y}_{i+1:N}(\boldsymbol{\theta}_{1:i-1}^{[t]},y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]}))\Big{)}\Big{\rangle}
โˆ’2โ€‹ฮฑi[t]โ€‹โŸจxi[t],Fยฏiโ€‹(๐œฝ[t])โˆ’Fยฏiโ€‹(๐œฝ1:i[t],๐ฒi+1:Nโ€‹(๐œฝ1:i[t]))โŸฉ\displaystyle\hskip 20.0pt-2\alpha_{i}^{[t]}\langle x_{i}^{[t]},\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}_{1:i}^{[t]},{\bf y}_{i+1:N}(\boldsymbol{\theta}_{1:i}^{[t]}))\rangle
+(ฮฑi[t])2โ€‹โ€–Fยฏiโ€‹(๐œฝ[t])โˆ’Fยฏiโ€‹(๐œฝ1:iโˆ’1[t],๐ฒi:Nโ€‹(๐œฝ1:iโˆ’1[t]))โ€–2\displaystyle\hskip 20.0pt+(\alpha_{i}^{[t]})^{2}\|\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]},{\bf y}_{i:N}(\boldsymbol{\theta}_{1:i-1}^{[t]}))\|^{2}
โ‰คโ€–xi[t]โ€–2โˆ’2โ€‹ฮดโ€‹ฮฑi[t]โ€‹โ€–xi[t]โ€–2+2โ€‹ฮฑi[t]โ€‹โ€–xi[t]โ€–โ€‹โ€–Fยฏiโ€‹(๐œฝ[t])โˆ’Fยฏiโ€‹(๐œฝ1:i[t],๐ฒi+1:Nโ€‹(๐œฝ1:i[t]))โ€–\displaystyle\leq\|x_{i}^{[t]}\|^{2}-2\delta\alpha_{i}^{[t]}\|x_{i}^{[t]}\|^{2}+2\alpha_{i}^{[t]}\|x_{i}^{[t]}\|\|\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}_{1:i}^{[t]},{\bf y}_{i+1:N}(\boldsymbol{\theta}_{1:i}^{[t]}))\|
+(ฮฑi[t])2โ€‹โ€–Fยฏiโ€‹(๐œฝ[t])โˆ’Fยฏiโ€‹(๐œฝ1:iโˆ’1[t],๐ฒi:Nโ€‹(๐œฝ1:iโˆ’1[t]))โ€–2\displaystyle\hskip 20.0pt+(\alpha_{i}^{[t]})^{2}\|\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]},{\bf y}_{i:N}(\boldsymbol{\theta}_{1:i-1}^{[t]}))\|^{2}
โ‰คโ€–xi[t]โ€–2โˆ’2โ€‹ฮดโ€‹ฮฑi[t]โ€‹โ€–xi[t]โ€–2+ฮดโ€‹ฮฑi[t]โ€‹โ€–xi[t]โ€–2+4โ€‹N3โ€‹L6โ€‹ฮฑi[t]ฮดโ€‹โˆ‘j=i+1Nโ€–xj[t]โ€–2\displaystyle\leq\|x_{i}^{[t]}\|^{2}-2\delta\alpha_{i}^{[t]}\|x_{i}^{[t]}\|^{2}+\delta\alpha_{i}^{[t]}\|x_{i}^{[t]}\|^{2}+\frac{4N^{3}L^{6}\alpha_{i}^{[t]}}{\delta}\sum_{j=i+1}^{N}\|x_{j}^{[t]}\|^{2}
+4โ€‹N3โ€‹L6โ€‹(ฮฑi[t])2โ€‹โˆ‘j=iNโ€–xj[t]โ€–2\displaystyle\hskip 20.0pt+4N^{3}L^{6}(\alpha_{i}^{[t]})^{2}\sum_{j=i}^{N}\|x_{j}^{[t]}\|^{2}
=โ€–xi[t]โ€–2โˆ’(ฮดโ€‹ฮฑi[t]โˆ’4โ€‹N3โ€‹L6โ€‹(ฮฑi[t])2)โ€‹โ€–xi[t]โ€–2+(4โ€‹N3โ€‹L6โ€‹ฮฑi[t]ฮด+4โ€‹N3โ€‹L6โ€‹(ฮฑi[t])2)โ€‹โˆ‘j=i+1Nโ€–xj[t]โ€–2\displaystyle=\|x_{i}^{[t]}\|^{2}-\left(\delta\alpha_{i}^{[t]}-4N^{3}L^{6}(\alpha_{i}^{[t]})^{2}\right)\|x_{i}^{[t]}\|^{2}+\left(\frac{4N^{3}L^{6}\alpha_{i}^{[t]}}{\delta}+4N^{3}L^{6}(\alpha_{i}^{[t]})^{2}\right)\sum_{j=i+1}^{N}\|x_{j}^{[t]}\|^{2}
โ‰คโ€–xi[t]โ€–2โˆ’ฮดโ€‹ฮฑi[t]2โ€‹โ€–xi[t]โ€–2+8โ€‹N3โ€‹L6โ€‹ฮฑi[t]ฮดโ€‹โˆ‘j=i+1Nโ€–xj[t]โ€–2.\displaystyle\leq\|x_{i}^{[t]}\|^{2}-\frac{\delta\alpha_{i}^{[t]}}{2}\|x_{i}^{[t]}\|^{2}+\frac{8N^{3}L^{6}\alpha_{i}^{[t]}}{\delta}\sum_{j=i+1}^{N}\|x_{j}^{[t]}\|^{2}. (53)

The third equation of (53) is a result of the identity that for all i,ji,j

yjโ€‹(๐œฝ1:iโˆ’1)=yjโ€‹(๐œฝ1:iโˆ’1,๐ฒi:jโˆ’1โ€‹(๐œฝ1:iโˆ’1)).\displaystyle y_{j}(\boldsymbol{\theta}_{1:i-1})=y_{j}(\boldsymbol{\theta}_{1:i-1},{\bf y}_{i:j-1}(\boldsymbol{\theta}_{1:i-1})). (54)

The first inequality of (53) follows from Assumptionย 3, and the final inequality follows from the step size condition ฮฑi[t]โ‰คฮด8โ€‹N3โ€‹L6\alpha_{i}^{[t]}\leq\frac{\delta}{8N^{3}L^{6}} and ฮฑi[t]โ‰ค1ฮด\alpha_{i}^{[t]}\leq\frac{1}{\delta}. The second inequality of (53) uses the bound

โ€–Fยฏiโ€‹(๐œฝ[t])โˆ’Fยฏiโ€‹(๐œฝ1:i[t],๐ฒi+1:Nโ€‹(๐œฝ1:i[t]))โ€–โ‰ค2โ€‹Nโ€‹L3โ€‹โˆ‘j=i+1Nโ€–xj[t]โ€–,\|\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}_{1:i}^{[t]},{\bf y}_{i+1:N}(\boldsymbol{\theta}_{1:i}^{[t]}))\|\leq 2NL^{3}\sum_{j=i+1}^{N}\|x_{j}^{[t]}\|,

which we now justify

โ€–Fยฏiโ€‹(๐œฝ[t])โˆ’Fยฏiโ€‹(๐œฝ1:i[t],๐ฒi+1:Nโ€‹(๐œฝ1:i[t]))โ€–\displaystyle\|\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}_{1:i}^{[t]},{\bf y}_{i+1:N}(\boldsymbol{\theta}_{1:i}^{[t]}))\|
โ‰คLโ€‹โˆ‘j=i+1Nโ€–ฮธj[t]โˆ’ykโ€‹(๐œฝ1:i[t])โ€–\displaystyle\leq L\sum_{j=i+1}^{N}\|\theta_{j}^{[t]}-y_{k}(\boldsymbol{\theta}_{1:i}^{[t]})\|
โ‰คLโ€‹โˆ‘j=i+1Nโ€–ฮธj[t]โˆ’yjโ€‹(๐œฝ1:jโˆ’1[t])โ€–+Lโ€‹โˆ‘j=i+1Nโ€–yjโ€‹(๐œฝ1:jโˆ’1[t])โˆ’yjโ€‹(๐œฝ1:i[t])โ€–\displaystyle\leq L\sum_{j=i+1}^{N}\|\theta_{j}^{[t]}-y_{j}(\boldsymbol{\theta}_{1:j-1}^{[t]})\|+L\sum_{j=i+1}^{N}\|y_{j}(\boldsymbol{\theta}_{1:j-1}^{[t]})-y_{j}(\boldsymbol{\theta}_{1:i}^{[t]})\|
=Lโ€‹โˆ‘j=i+1Nโ€–xj[t]โ€–+Lโ€‹โˆ‘j=i+1Nโ€–yjโ€‹(๐œฝ1:jโˆ’1[t])โˆ’yjโ€‹(๐œฝ1:i[t],๐ฒi:jโˆ’1โ€‹(๐œฝ1:i[t]))โ€–\displaystyle=L\sum_{j=i+1}^{N}\|x_{j}^{[t]}\|+L\sum_{j=i+1}^{N}\|y_{j}(\boldsymbol{\theta}_{1:j-1}^{[t]})-y_{j}(\boldsymbol{\theta}_{1:i}^{[t]},{\bf y}_{i:j-1}(\boldsymbol{\theta}_{1:i}^{[t]}))\|
โ‰คLโ€‹โˆ‘j=i+1Nโ€–xj[t]โ€–+L3โ€‹โˆ‘j=i+1Nโˆ‘k=i+1jโˆ’1โ€–xk[t]โ€–\displaystyle\leq L\sum_{j=i+1}^{N}\|x_{j}^{[t]}\|+L^{3}\sum_{j=i+1}^{N}\sum_{k=i+1}^{j-1}\|x_{k}^{[t]}\|
โ‰ค2โ€‹Nโ€‹L3โ€‹โˆ‘j=i+1Nโ€–xj[t]โ€–.\displaystyle\leq 2NL^{3}\sum_{j=i+1}^{N}\|x_{j}^{[t]}\|. (55)

The second inequality of (53) plugs in the bound on โ€–Fยฏiโ€‹(๐œฝ[t])โˆ’Fยฏiโ€‹(๐œฝ1:iโˆ’1[t],๐ฒi:Nโ€‹(๐œฝ1:iโˆ’1[t]))โ€–\|\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]},{\bf y}_{i:N}(\boldsymbol{\theta}_{1:i-1}^{[t]}))\| below, which follows from a similar argument

โ€–Fยฏiโ€‹(๐œฝ[t])โˆ’Fยฏiโ€‹(๐œฝ1:iโˆ’1[t],๐ฒi:Nโ€‹(๐œฝ1:iโˆ’1[t]))โ€–\displaystyle\|\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]},{\bf y}_{i:N}(\boldsymbol{\theta}_{1:i-1}^{[t]}))\| โ‰คLโ€‹โ€–xi[t]โ€–+Lโ€‹โˆ‘j=i+1Nโ€–ฮธj[t]โˆ’yjโ€‹(๐œฝ1:iโˆ’1[t])โ€–\displaystyle\leq L\|x_{i}^{[t]}\|+L\sum_{j=i+1}^{N}\|\theta_{j}^{[t]}-y_{j}(\boldsymbol{\theta}_{1:i-1}^{[t]})\|
โ‰ค2โ€‹Nโ€‹L3โ€‹โˆ‘j=iNโ€–xj[t]โ€–.\displaystyle\leq 2NL^{3}\sum_{j=i}^{N}\|x_{j}^{[t]}\|.

The term T2T_{2} can be bounded leveraging the Lipschitz condition of yiy_{i}

โ€–yiโ€‹(๐œฝ1:iโˆ’1[t])โˆ’yiโ€‹(๐œฝ1:iโˆ’1[t+1])โ€–\displaystyle\|y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\| โ‰คLโ€‹โˆ‘j=1iโˆ’1โ€–ฮธj[t+1]โˆ’ฮธj[t]โ€–\displaystyle\leq L\sum_{j=1}^{i-1}\|\theta_{j}^{[t+1]}-\theta_{j}^{[t]}\|
โ‰คLโ€‹โˆ‘j=1iโˆ’1ฮฑj[t]โ€‹(โ€–ฮ”โ€‹fj[t]โ€–+Nโ€‹L2โ€‹โˆ‘k=jNโ€–xk[t]โ€–)\displaystyle\leq L\sum_{j=1}^{i-1}\alpha_{j}^{[t]}\left(\|\Delta f_{j}^{[t]}\|+NL^{2}\sum_{k=j}^{N}\|x_{k}^{[t]}\|\right)
โ‰คLโ€‹โˆ‘j=1iโˆ’1ฮฑj[t]โ€‹โ€–ฮ”โ€‹fj[t]โ€–+N2โ€‹L3โ€‹โˆ‘j=1iโˆ’1ฮฑj[t]โ€‹โˆ‘k=jNโ€–xk[t]โ€–,\displaystyle\leq L\sum_{j=1}^{i-1}\alpha_{j}^{[t]}\|\Delta f_{j}^{[t]}\|+N^{2}L^{3}\sum_{j=1}^{i-1}\alpha_{j}^{[t]}\sum_{k=j}^{N}\|x_{k}^{[t]}\|, (56)

where the second inequality plugs in the result of Lemmaย 3.

Since ฮฑj[t]โ‰คฮฑk[t]\alpha_{j}^{[t]}\leq\alpha_{k}^{[t]} for any jโ‰คkj\leq k, ย (56) implies

T2\displaystyle T_{2} =โ€–yiโ€‹(๐œฝ1:iโˆ’1[t])โˆ’yiโ€‹(๐œฝ1:iโˆ’1[t+1])โ€–2\displaystyle=\|y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\|^{2}
โ‰ค2โ€‹(Lโ€‹โˆ‘j=1iโˆ’1ฮฑj[t]โ€‹โ€–ฮ”โ€‹fj[t]โ€–)2+2โ€‹(N2โ€‹L3โ€‹โˆ‘j=1Nฮฑj[t]โ€‹โ€–xj[t]โ€–)2\displaystyle\leq 2\left(L\sum_{j=1}^{i-1}\alpha_{j}^{[t]}\|\Delta f_{j}^{[t]}\|\right)^{2}+2\left(N^{2}L^{3}\sum_{j=1}^{N}\alpha_{j}^{[t]}\|x_{j}^{[t]}\|\right)^{2}
โ‰ค2โ€‹Nโ€‹L2โ€‹โˆ‘j=1iโˆ’1(ฮฑj[t])2โ€‹โ€–ฮ”โ€‹fj[t]โ€–2+2โ€‹N5โ€‹L6โ€‹โˆ‘j=1N(ฮฑj[t])2โ€‹โ€–xj[t]โ€–2.\displaystyle\leq 2NL^{2}\sum_{j=1}^{i-1}(\alpha_{j}^{[t]})^{2}\|\Delta f_{j}^{[t]}\|^{2}+2N^{5}L^{6}\sum_{j=1}^{N}(\alpha_{j}^{[t]})^{2}\|x_{j}^{[t]}\|^{2}. (57)

Next we treat T3T_{3}

T3\displaystyle T_{3} โ‰คฮฑi[t]โ€‹โ€–ฮ”โ€‹fi[t]โ€–โ€‹โ€–xi[t]โˆ’ฮฑi[t]โ€‹Fยฏiโ€‹(๐œฝ[t])โ€–\displaystyle\leq\alpha_{i}^{[t]}\|\Delta f_{i}^{[t]}\|\|x_{i}^{[t]}-\alpha_{i}^{[t]}\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\|
โ‰ค2โ€‹ฮฑi[t]ฮดโ€‹โ€–ฮ”โ€‹fi[t]โ€–2+ฮดโ€‹ฮฑi[t]8โ€‹โ€–xi[t]โˆ’ฮฑi[t]โ€‹Fยฏiโ€‹(๐œฝ[t])โ€–2\displaystyle\leq\frac{2\alpha_{i}^{[t]}}{\delta}\|\Delta f_{i}^{[t]}\|^{2}+\frac{\delta\alpha_{i}^{[t]}}{8}\|x_{i}^{[t]}-\alpha_{i}^{[t]}\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\|^{2}
โ‰ค2โ€‹ฮฑi[t]ฮดโ€‹โ€–ฮ”โ€‹fi[t]โ€–2+ฮดโ€‹ฮฑi[t]8โ€‹(โ€–xi[t]โ€–2โˆ’ฮดโ€‹ฮฑi[t]2โ€‹โ€–xi[t]โ€–2+8โ€‹N3โ€‹L6โ€‹ฮฑi[t]ฮดโ€‹โˆ‘j=i+1Nโ€–xj[t]โ€–2)\displaystyle\leq\frac{2\alpha_{i}^{[t]}}{\delta}\|\Delta f_{i}^{[t]}\|^{2}+\frac{\delta\alpha_{i}^{[t]}}{8}\left(\|x_{i}^{[t]}\|^{2}-\frac{\delta\alpha_{i}^{[t]}}{2}\|x_{i}^{[t]}\|^{2}+\frac{8N^{3}L^{6}\alpha_{i}^{[t]}}{\delta}\sum_{j=i+1}^{N}\|x_{j}^{[t]}\|^{2}\right)
โ‰ค2โ€‹ฮฑi[t]ฮดโ€‹โ€–ฮ”โ€‹fi[t]โ€–2+ฮดโ€‹ฮฑi[t]8โ€‹โ€–xi[t]โ€–2+N3โ€‹L6โ€‹ฮฑi[t]ฮดโ€‹โˆ‘j=i+1Nโ€–xj[t]โ€–2,\displaystyle\leq\frac{2\alpha_{i}^{[t]}}{\delta}\|\Delta f_{i}^{[t]}\|^{2}+\frac{\delta\alpha_{i}^{[t]}}{8}\|x_{i}^{[t]}\|^{2}+\frac{N^{3}L^{6}\alpha_{i}^{[t]}}{\delta}\sum_{j=i+1}^{N}\|x_{j}^{[t]}\|^{2}, (58)

where the third inequality uses (53) and the fourth inequality follows from the step size condition ฮฑi[t]โ‰ค1ฮด\alpha_{i}^{[t]}\leq\frac{1}{\delta}.

To bound T4T_{4}, we plug in (53) and (56)

T4\displaystyle T_{4} โ‰คโ€–xi[t]โˆ’ฮฑi[t]โ€‹Fยฏiโ€‹(๐œฝ[t])โ€–โ€‹โ€–yiโ€‹(๐œฝ1:iโˆ’1[t])โˆ’yiโ€‹(๐œฝ1:iโˆ’1[t+1])โ€–\displaystyle\leq\|x_{i}^{[t]}-\alpha_{i}^{[t]}\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\|\|y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\|
โ‰คโ€–xi[t]โ€–2โˆ’ฮดโ€‹ฮฑi[t]2โ€‹โ€–xi[t]โ€–2+8โ€‹N3โ€‹L6โ€‹ฮฑi[t]ฮดโ€‹โˆ‘j=i+1Nโ€–xj[t]โ€–2โ‹…\displaystyle\leq\sqrt{\|x_{i}^{[t]}\|^{2}-\frac{\delta\alpha_{i}^{[t]}}{2}\|x_{i}^{[t]}\|^{2}+\frac{8N^{3}L^{6}\alpha_{i}^{[t]}}{\delta}\sum_{j=i+1}^{N}\|x_{j}^{[t]}\|^{2}}\quad\cdot
(Lโ€‹โˆ‘j=1iโˆ’1ฮฑj[t]โ€‹โ€–ฮ”โ€‹fj[t]โ€–+Nโ€‹L3โ€‹โˆ‘j=1iโˆ’1ฮฑj[t]โ€‹โˆ‘k=jNโ€–xk[t]โ€–)\displaystyle\hskip 50.0pt\left(L\sum_{j=1}^{i-1}\alpha_{j}^{[t]}\|\Delta f_{j}^{[t]}\|+NL^{3}\sum_{j=1}^{i-1}\alpha_{j}^{[t]}\sum_{k=j}^{N}\|x_{k}^{[t]}\|\right)
โ‰ค(โˆฅxi[t]โˆฅ+8โ€‹N3โ€‹L6โ€‹ฮฑi[t]ฮดโˆ‘j=i+1Nโˆฅxj[t]โˆฅ)โ‹…\displaystyle\leq\left(\|x_{i}^{[t]}\|+\sqrt{\frac{8N^{3}L^{6}\alpha_{i}^{[t]}}{\delta}}\sum_{j=i+1}^{N}\|x_{j}^{[t]}\|\right)\cdot
(Lโ€‹โˆ‘j=1iโˆ’1ฮฑj[t]โ€‹โ€–ฮ”โ€‹fj[t]โ€–+N2โ€‹L3โ€‹โˆ‘j=1iโˆ’1ฮฑj[t]โ€‹โ€–xj[t]โ€–+N2โ€‹L3โ€‹ฮฑiโˆ’1[t]โ€‹โˆ‘j=iNโ€–xj[t]โ€–)\displaystyle\hskip 50.0pt\left(L\sum_{j=1}^{i-1}\alpha_{j}^{[t]}\|\Delta f_{j}^{[t]}\|+N^{2}L^{3}\sum_{j=1}^{i-1}\alpha_{j}^{[t]}\|x_{j}^{[t]}\|+N^{2}L^{3}\alpha_{i-1}^{[t]}\sum_{j=i}^{N}\|x_{j}^{[t]}\|\right)
โ‰คLโ€‹โˆ‘j=1iโˆ’1ฮฑj[t]โ€‹โ€–ฮ”โ€‹fj[t]โ€–โ€‹โ€–xi[t]โ€–+L4โ€‹(ฮฑi[t])1/2โ€‹ฮฑiโˆ’1[t]โ€‹8โ€‹N3ฮดโ€‹(โˆ‘j=i+1Nโ€–xj[t]โ€–)โ€‹(โˆ‘j=1iโˆ’1โ€–ฮ”โ€‹fj[t]โ€–)\displaystyle\leq L\sum_{j=1}^{i-1}\alpha_{j}^{[t]}\|\Delta f_{j}^{[t]}\|\|x_{i}^{[t]}\|+L^{4}(\alpha_{i}^{[t]})^{1/2}\alpha_{i-1}^{[t]}\sqrt{\frac{8N^{3}}{\delta}}\left(\sum_{j=i+1}^{N}\|x_{j}^{[t]}\|\right)\left(\sum_{j=1}^{i-1}\|\Delta f_{j}^{[t]}\|\right)
+N2โ€‹L3โ€‹โˆ‘j=1iโˆ’1ฮฑj[t]โ€‹โ€–xj[t]โ€–โ€‹โ€–xi[t]โ€–+N2โ€‹L3โ€‹ฮฑiโˆ’1[t]โ€‹โˆ‘j=iNโ€–xj[t]โ€–โ€‹โ€–xi[t]โ€–\displaystyle\hskip 20.0pt+N^{2}L^{3}\sum_{j=1}^{i-1}\alpha_{j}^{[t]}\|x_{j}^{[t]}\|\|x_{i}^{[t]}\|+N^{2}L^{3}\alpha_{i-1}^{[t]}\sum_{j=i}^{N}\|x_{j}^{[t]}\|\|x_{i}^{[t]}\|
+L6โ€‹(ฮฑi[t])1/2โ€‹ฮฑiโˆ’1[t]โ€‹8โ€‹N7ฮดโ€‹(โˆ‘j=i+1Nโ€–xj[t]โ€–)โ€‹(โˆ‘j=1Nโ€–xj[t]โ€–)\displaystyle\hskip 20.0pt+L^{6}(\alpha_{i}^{[t]})^{1/2}\alpha_{i-1}^{[t]}\sqrt{\frac{8N^{7}}{\delta}}\left(\sum_{j=i+1}^{N}\|x_{j}^{[t]}\|\right)\left(\sum_{j=1}^{N}\|x_{j}^{[t]}\|\right)
โ‰คNโ€‹Lโ€‹ฮฑiโˆ’1[t]4โ€‹โ€–xi[t]โ€–2+Lโ€‹โˆ‘j=1iโˆ’1ฮฑj[t]โ€‹โ€–ฮ”โ€‹fj[t]โ€–2+Nโ€‹L4โ€‹(ฮฑi[t])1/2โ€‹ฮฑiโˆ’1[t]โ€‹โˆ‘j=i+1Nโ€–xj[t]โ€–2\displaystyle\leq\frac{NL\alpha_{i-1}^{[t]}}{4}\|x_{i}^{[t]}\|^{2}+L\sum_{j=1}^{i-1}\alpha_{j}^{[t]}\|\Delta f_{j}^{[t]}\|^{2}+NL^{4}(\alpha_{i}^{[t]})^{1/2}\alpha_{i-1}^{[t]}\sum_{j=i+1}^{N}\|x_{j}^{[t]}\|^{2}
+2โ€‹N4โ€‹L4โ€‹(ฮฑi[t])1/2โ€‹ฮฑiโˆ’1[t]ฮดโ€‹โˆ‘j=i+1Nโ€–ฮ”โ€‹fj[t]โ€–2+4โ€‹N6โ€‹L6โ€‹ฮฑiโˆ’1[t]ฮดโ€‹โ€–xi[t]โ€–2+ฮด16โ€‹Nโ€‹โˆ‘j=1iโˆ’1ฮฑj[t]โ€‹โ€–xj[t]โ€–2\displaystyle\hskip 20.0pt+\frac{2N^{4}L^{4}(\alpha_{i}^{[t]})^{1/2}\alpha_{i-1}^{[t]}}{\delta}\sum_{j=i+1}^{N}\|\Delta f_{j}^{[t]}\|^{2}+\frac{4N^{6}L^{6}\alpha_{i-1}^{[t]}}{\delta}\|x_{i}^{[t]}\|^{2}+\frac{\delta}{16N}\sum_{j=1}^{i-1}\alpha_{j}^{[t]}\|x_{j}^{[t]}\|^{2}
+N3โ€‹L3โ€‹ฮฑiโˆ’1[t]4โ€‹โ€–xi[t]โ€–2+N2โ€‹L3โ€‹ฮฑiโˆ’1[t]โ€‹โˆ‘j=iNโ€–xj[t]โ€–2\displaystyle\hskip 20.0pt+\frac{N^{3}L^{3}\alpha_{i-1}^{[t]}}{4}\|x_{i}^{[t]}\|^{2}+N^{2}L^{3}\alpha_{i-1}^{[t]}\sum_{j=i}^{N}\|x_{j}^{[t]}\|^{2}
+Nโ€‹L6โ€‹(ฮฑi[t])1/2โ€‹ฮฑiโˆ’1[t]โ€‹2โ€‹N7ฮดโ€‹โˆ‘j=i+1Nโ€–xj[t]โ€–2+Nโ€‹L6โ€‹(ฮฑi[t])1/2โ€‹ฮฑiโˆ’1[t]โ€‹2โ€‹N7ฮดโ€‹โˆ‘j=1Nโ€–xj[t]โ€–2\displaystyle\hskip 20.0pt+NL^{6}(\alpha_{i}^{[t]})^{1/2}\alpha_{i-1}^{[t]}\sqrt{\frac{2N^{7}}{\delta}}\sum_{j=i+1}^{N}\|x_{j}^{[t]}\|^{2}+NL^{6}(\alpha_{i}^{[t]})^{1/2}\alpha_{i-1}^{[t]}\sqrt{\frac{2N^{7}}{\delta}}\sum_{j=1}^{N}\|x_{j}^{[t]}\|^{2}
โ‰คโˆ‘j=1iโˆ’1(ฮดโ€‹ฮฑj[t]16โ€‹N+Nโ€‹L6โ€‹2โ€‹N7ฮดโ€‹(ฮฑi[t])2)โ€‹โ€–xj[t]โ€–2+4โ€‹N2โ€‹L3โ€‹ฮฑiโˆ’1[t]โ€‹โˆ‘j=iNโ€–xj[t]โ€–2\displaystyle\leq\sum_{j=1}^{i-1}\left(\frac{\delta\alpha_{j}^{[t]}}{16N}+NL^{6}\sqrt{\frac{2N^{7}}{\delta}}(\alpha_{i}^{[t]})^{2}\right)\|x_{j}^{[t]}\|^{2}+4N^{2}L^{3}\alpha_{i-1}^{[t]}\sum_{j=i}^{N}\|x_{j}^{[t]}\|^{2}
+(N3โ€‹L3โ€‹ฮฑiโˆ’1[t]2+4โ€‹N6โ€‹L6โ€‹ฮฑiโˆ’1[t]ฮด)โ€‹โ€–xi[t]โ€–2+Lโ€‹ฮฑi[t]โ€‹โˆ‘j=1Nโ€–ฮ”โ€‹fj[t]โ€–2\displaystyle\hskip 20.0pt+(\frac{N^{3}L^{3}\alpha_{i-1}^{[t]}}{2}+\frac{4N^{6}L^{6}\alpha_{i-1}^{[t]}}{\delta})\|x_{i}^{[t]}\|^{2}+L\alpha_{i}^{[t]}\sum_{j=1}^{N}\|\Delta f_{j}^{[t]}\|^{2}
โ‰คโˆ‘j=1iโˆ’1(ฮดโ€‹ฮฑj[t]16โ€‹N+Nโ€‹L6โ€‹2โ€‹N7ฮดโ€‹(ฮฑi[t])2)โ€‹โ€–xj[t]โ€–2+4โ€‹N2โ€‹L3โ€‹ฮฑiโˆ’1[t]โ€‹โˆ‘j=i+1Nโ€–xj[t]โ€–2\displaystyle\leq\sum_{j=1}^{i-1}\left(\frac{\delta\alpha_{j}^{[t]}}{16N}+NL^{6}\sqrt{\frac{2N^{7}}{\delta}}(\alpha_{i}^{[t]})^{2}\right)\|x_{j}^{[t]}\|^{2}+4N^{2}L^{3}\alpha_{i-1}^{[t]}\sum_{j=i+1}^{N}\|x_{j}^{[t]}\|^{2}
+(9โ€‹N3โ€‹L3โ€‹ฮฑiโˆ’1[t]2+4โ€‹N6โ€‹L6โ€‹ฮฑiโˆ’1[t]ฮด)โ€‹โ€–xi[t]โ€–2+Lโ€‹ฮฑi[t]โ€‹โˆ‘j=1Nโ€–ฮ”โ€‹fj[t]โ€–2,\displaystyle\hskip 20.0pt+(\frac{9N^{3}L^{3}\alpha_{i-1}^{[t]}}{2}+\frac{4N^{6}L^{6}\alpha_{i-1}^{[t]}}{\delta})\|x_{i}^{[t]}\|^{2}+L\alpha_{i}^{[t]}\sum_{j=1}^{N}\|\Delta f_{j}^{[t]}\|^{2}, (59)

where the second inequality follows from the fact that a1+a2+โ‹ฏโ‰คa1+a2+โ‹ฏ\sqrt{a_{1}+a_{2}+\cdots}\leq\sqrt{a_{1}}+\sqrt{a_{2}}+\cdots for any positive scalars a1,a2,โ‹ฏa_{1},a_{2},\cdots. We have also simplified terms using the step size condition ฮฑi[t]โ‰คฮด2โ€‹N5โ€‹L6\alpha_{i}^{[t]}\leq\frac{\delta}{2N^{5}L^{6}}, ฮฑi[t]โ‰คN2L2\alpha_{i}^{[t]}\leq\frac{N^{2}}{L^{2}}, and ฮฑi[t]โ‰คฮด24โ€‹N8โ€‹L6\alpha_{i}^{[t]}\leq\frac{\delta^{2}}{4N^{8}L^{6}}.

We finally bound T5T_{5} with (57),

T5\displaystyle T_{5} โ‰คฮฑi[t]โ€‹โ€–ฮ”โ€‹fi[t]โ€–โ€‹โ€–yiโ€‹(๐œฝ1:iโˆ’1[t])โˆ’yiโ€‹(๐œฝ1:iโˆ’1[t+1])โ€–\displaystyle\leq\alpha_{i}^{[t]}\|\Delta f_{i}^{[t]}\|\|y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\|
โ‰ค(ฮฑi[t])2โ€‹โ€–ฮ”โ€‹fi[t]โ€–2+14โ€‹โ€–yiโ€‹(๐œฝ1:iโˆ’1[t])โˆ’yiโ€‹(๐œฝ1:iโˆ’1[t+1])โ€–2\displaystyle\leq(\alpha_{i}^{[t]})^{2}\|\Delta f_{i}^{[t]}\|^{2}+\frac{1}{4}\|y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\|^{2}
โ‰ค(ฮฑi[t])2โ€‹โ€–ฮ”โ€‹fi[t]โ€–2+Nโ€‹L22โ€‹โˆ‘j=1iโˆ’1(ฮฑj[t])2โ€‹โ€–ฮ”โ€‹fj[t]โ€–2+N5โ€‹L62โ€‹โˆ‘j=1N(ฮฑj[t])2โ€‹โ€–xj[t]โ€–2.\displaystyle\leq(\alpha_{i}^{[t]})^{2}\|\Delta f_{i}^{[t]}\|^{2}+\frac{NL^{2}}{2}\sum_{j=1}^{i-1}(\alpha_{j}^{[t]})^{2}\|\Delta f_{j}^{[t]}\|^{2}+\frac{N^{5}L^{6}}{2}\sum_{j=1}^{N}(\alpha_{j}^{[t]})^{2}\|x_{j}^{[t]}\|^{2}. (60)

Collecting the bounds on T1T_{1}-T5T_{5} and applying them to (52), we have

โ€–xit+1โ€–2\displaystyle\|x_{i}^{t+1}\|^{2}
โ‰คโ€–xi[t]โ€–2โˆ’ฮดโ€‹ฮฑi[t]2โ€‹โ€–xi[t]โ€–2+8โ€‹N3โ€‹L6โ€‹ฮฑi[t]ฮดโ€‹โˆ‘j=i+1Nโ€–xj[t]โ€–2+(ฮฑi[t])2โ€‹โ€–ฮ”โ€‹fi[t]โ€–2\displaystyle\leq\|x_{i}^{[t]}\|^{2}-\frac{\delta\alpha_{i}^{[t]}}{2}\|x_{i}^{[t]}\|^{2}+\frac{8N^{3}L^{6}\alpha_{i}^{[t]}}{\delta}\sum_{j=i+1}^{N}\|x_{j}^{[t]}\|^{2}+(\alpha_{i}^{[t]})^{2}\|\Delta f_{i}^{[t]}\|^{2}
+2โ€‹Nโ€‹L2โ€‹โˆ‘j=1iโˆ’1(ฮฑj[t])2โ€‹โ€–ฮ”โ€‹fj[t]โ€–2+2โ€‹N5โ€‹L6โ€‹โˆ‘j=1N(ฮฑj[t])2โ€‹โ€–xj[t]โ€–2\displaystyle\hskip 20.0pt+2NL^{2}\sum_{j=1}^{i-1}(\alpha_{j}^{[t]})^{2}\|\Delta f_{j}^{[t]}\|^{2}+2N^{5}L^{6}\sum_{j=1}^{N}(\alpha_{j}^{[t]})^{2}\|x_{j}^{[t]}\|^{2}
+2โ€‹ฮฑi[t]ฮดโ€‹โ€–ฮ”โ€‹fi[t]โ€–2+ฮดโ€‹ฮฑi[t]8โ€‹โ€–xi[t]โ€–2+N3โ€‹L6โ€‹ฮฑi[t]ฮดโ€‹โˆ‘j=i+1Nโ€–xj[t]โ€–2\displaystyle\hskip 20.0pt+\frac{2\alpha_{i}^{[t]}}{\delta}\|\Delta f_{i}^{[t]}\|^{2}+\frac{\delta\alpha_{i}^{[t]}}{8}\|x_{i}^{[t]}\|^{2}+\frac{N^{3}L^{6}\alpha_{i}^{[t]}}{\delta}\sum_{j=i+1}^{N}\|x_{j}^{[t]}\|^{2}
+โˆ‘j=1iโˆ’1(ฮดโ€‹ฮฑj[t]16โ€‹N+Nโ€‹L6โ€‹2โ€‹N7ฮดโ€‹(ฮฑi[t])2)โ€‹โ€–xj[t]โ€–2+4โ€‹N2โ€‹L3โ€‹ฮฑiโˆ’1[t]โ€‹โˆ‘j=i+1Nโ€–xj[t]โ€–2\displaystyle\hskip 20.0pt+\sum_{j=1}^{i-1}\left(\frac{\delta\alpha_{j}^{[t]}}{16N}+NL^{6}\sqrt{\frac{2N^{7}}{\delta}}(\alpha_{i}^{[t]})^{2}\right)\|x_{j}^{[t]}\|^{2}+4N^{2}L^{3}\alpha_{i-1}^{[t]}\sum_{j=i+1}^{N}\|x_{j}^{[t]}\|^{2}
+(9โ€‹N3โ€‹L3โ€‹ฮฑiโˆ’1[t]2+4โ€‹N6โ€‹L6โ€‹ฮฑiโˆ’1[t]ฮด)โ€‹โ€–xi[t]โ€–2+Lโ€‹ฮฑi[t]โ€‹โˆ‘j=1Nโ€–ฮ”โ€‹fj[t]โ€–2\displaystyle\hskip 20.0pt+(\frac{9N^{3}L^{3}\alpha_{i-1}^{[t]}}{2}+\frac{4N^{6}L^{6}\alpha_{i-1}^{[t]}}{\delta})\|x_{i}^{[t]}\|^{2}+L\alpha_{i}^{[t]}\sum_{j=1}^{N}\|\Delta f_{j}^{[t]}\|^{2}
+(ฮฑi[t])2โ€‹โ€–ฮ”โ€‹fi[t]โ€–2+Nโ€‹L22โ€‹โˆ‘j=1iโˆ’1(ฮฑj[t])2โ€‹โ€–ฮ”โ€‹fj[t]โ€–2+N5โ€‹L62โ€‹โˆ‘j=1N(ฮฑj[t])2โ€‹โ€–xj[t]โ€–2\displaystyle\hskip 20.0pt+(\alpha_{i}^{[t]})^{2}\|\Delta f_{i}^{[t]}\|^{2}+\frac{NL^{2}}{2}\sum_{j=1}^{i-1}(\alpha_{j}^{[t]})^{2}\|\Delta f_{j}^{[t]}\|^{2}+\frac{N^{5}L^{6}}{2}\sum_{j=1}^{N}(\alpha_{j}^{[t]})^{2}\|x_{j}^{[t]}\|^{2}
โ‰คโ€–xi[t]โ€–2โˆ’(3โ€‹ฮดโ€‹ฮฑi[t]8โˆ’9โ€‹N3โ€‹L3โ€‹ฮฑiโˆ’1[t]2โˆ’4โ€‹N6โ€‹L6โ€‹ฮฑiโˆ’1[t]ฮด+5โ€‹N5โ€‹L6โ€‹(ฮฑi[t])22)โ€‹โ€–xi[t]โ€–2\displaystyle\leq\|x_{i}^{[t]}\|^{2}-\left(\frac{3\delta\alpha_{i}^{[t]}}{8}-\frac{9N^{3}L^{3}\alpha_{i-1}^{[t]}}{2}-\frac{4N^{6}L^{6}\alpha_{i-1}^{[t]}}{\delta}+\frac{5N^{5}L^{6}(\alpha_{i}^{[t]})^{2}}{2}\right)\|x_{i}^{[t]}\|^{2}
+โˆ‘j=1iโˆ’1(5โ€‹N5โ€‹L6โ€‹(ฮฑj[t])22+ฮดโ€‹ฮฑj[t]16โ€‹N+Nโ€‹L6โ€‹2โ€‹N7ฮดโ€‹(ฮฑi[t])2)โ€‹โ€–xj[t]โ€–2\displaystyle\hskip 20.0pt+\sum_{j=1}^{i-1}\left(\frac{5N^{5}L^{6}(\alpha_{j}^{[t]})^{2}}{2}+\frac{\delta\alpha_{j}^{[t]}}{16N}+NL^{6}\sqrt{\frac{2N^{7}}{\delta}}(\alpha_{i}^{[t]})^{2}\right)\|x_{j}^{[t]}\|^{2}
+โˆ‘j=i+1N(9โ€‹N3โ€‹L6โ€‹ฮฑi[t]ฮด+5โ€‹N5โ€‹L6โ€‹(ฮฑj[t])22+4โ€‹N2โ€‹L3โ€‹ฮฑi[t])โ€‹โ€–xj[t]โ€–2+(3ฮด+L)โ€‹ฮฑi[t]โ€‹โˆ‘j=1Nโ€–ฮ”โ€‹fj[t]โ€–2\displaystyle\hskip 20.0pt+\sum_{j=i+1}^{N}\left(\frac{9N^{3}L^{6}\alpha_{i}^{[t]}}{\delta}+\frac{5N^{5}L^{6}(\alpha_{j}^{[t]})^{2}}{2}+4N^{2}L^{3}\alpha_{i}^{[t]}\right)\|x_{j}^{[t]}\|^{2}+(\frac{3}{\delta}+L)\alpha_{i}^{[t]}\sum_{j=1}^{N}\|\Delta f_{j}^{[t]}\|^{2}
โ‰คโ€–xi[t]โ€–2โˆ’ฮดโ€‹ฮฑi[t]4โ€‹โ€–xi[t]โ€–2+โˆ‘j=1iโˆ’1ฮดโ€‹ฮฑj[t]8โ€‹Nโ€‹โ€–xj[t]โ€–2\displaystyle\leq\|x_{i}^{[t]}\|^{2}-\frac{\delta\alpha_{i}^{[t]}}{4}\|x_{i}^{[t]}\|^{2}+\sum_{j=1}^{i-1}\frac{\delta\alpha_{j}^{[t]}}{8N}\|x_{j}^{[t]}\|^{2}
+(9โ€‹N3โ€‹L6ฮด+8โ€‹N2โ€‹L3)โ€‹ฮฑi[t]โ€‹โˆ‘j=i+1Nโ€–xj[t]โ€–2+(3ฮด+L)โ€‹ฮฑi[t]โ€‹โˆ‘j=1Nโ€–ฮ”โ€‹fj[t]โ€–2,\displaystyle\hskip 20.0pt+\left(\frac{9N^{3}L^{6}}{\delta}+8N^{2}L^{3}\right)\alpha_{i}^{[t]}\sum_{j=i+1}^{N}\|x_{j}^{[t]}\|^{2}+(\frac{3}{\delta}+L)\alpha_{i}^{[t]}\sum_{j=1}^{N}\|\Delta f_{j}^{[t]}\|^{2},

where in the second inequality we have combined and simplified terms with the step size condition ฮฑi[t]โ‰ค1ฮด\alpha_{i}^{[t]}\leq\frac{1}{\delta} and ฮฑi[t]โ‰ค25โ€‹Nโ€‹L\alpha_{i}^{[t]}\leq\frac{2}{5NL}, and the third inequality follows from ฮฑiโˆ’1[t]ฮฑi[t]โ‰คฮด16โ€‹(9โ€‹N3โ€‹L32+4โ€‹N6โ€‹L6ฮด)โˆ’1\frac{\alpha_{i-1}^{[t]}}{\alpha_{i}^{[t]}}\leq\frac{\delta}{16}(\frac{9N^{3}L^{3}}{2}+\frac{4N^{6}L^{6}}{\delta})^{-1}, ฮฑi[t]โ‰คฮด40โ€‹N5โ€‹L6\alpha_{i}^{[t]}\leq\frac{\delta}{40N^{5}L^{6}}, (ฮฑi[t])2ฮฑ1[t]โ‰คฮด3/264โ€‹N7\frac{(\alpha_{i}^{[t]})^{2}}{\alpha_{1}^{[t]}}\leq\frac{\delta^{3/2}}{64N^{7}}, ฮฑi[t]โ‰คฮด80โ€‹N6โ€‹L6\alpha_{i}^{[t]}\leq\frac{\delta}{80N^{6}L^{6}}, and (ฮฑi[t])2ฮฑ1[t]โ‰ค85โ€‹N3โ€‹L3\frac{(\alpha_{i}^{[t]})^{2}}{\alpha_{1}^{[t]}}\leq\frac{8}{5N^{3}L^{3}}.

โ–ก\Box

A.3 Proof of Lemmaย 3

By the definition of fi[t]f_{i}^{[t]},

โ€–fi[t]โ€–\displaystyle\|f_{i}^{[t]}\| =โ€–ฮ”โ€‹fi[t]+Fยฏiโ€‹(๐œฝ[t])โˆ’Fยฏiโ€‹(๐œฝ1:iโˆ’1[t],๐ฒi:Nโ€‹(๐œฝ1:iโˆ’1[t]))โ€–\displaystyle=\|\Delta f_{i}^{[t]}+\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]},{\bf y}_{i:N}(\boldsymbol{\theta}_{1:i-1}^{[t]}))\|
โ‰คโ€–ฮ”โ€‹fi[t]โ€–+โ€–Fยฏiโ€‹(๐œฝ[t])โˆ’Fยฏiโ€‹(๐œฝ1:iโˆ’1[t],๐ฒi:Nโ€‹(๐œฝ1:iโˆ’1[t]))โ€–\displaystyle\leq\|\Delta f_{i}^{[t]}\|+\|\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]},{\bf y}_{i:N}(\boldsymbol{\theta}_{1:i-1}^{[t]}))\|
โ‰คโ€–ฮ”โ€‹fi[t]โ€–+Lโ€‹โˆ‘j=iNโ€–ฮธj[t]โˆ’yjโ€‹(๐œฝ1:iโˆ’1[t])โ€–.\displaystyle\leq\|\Delta f_{i}^{[t]}\|+L\sum_{j=i}^{N}\|\theta_{j}^{[t]}-y_{j}(\boldsymbol{\theta}_{1:i-1}^{[t]})\|. (61)

Using the identity (54), we can bound the second term of (61) recursively

โˆ‘j=iNโ€–ฮธj[t]โˆ’yjโ€‹(๐œฝ1:iโˆ’1[t])โ€–\displaystyle\sum_{j=i}^{N}\|\theta_{j}^{[t]}-y_{j}(\boldsymbol{\theta}_{1:i-1}^{[t]})\|
โ‰คโ€–xi[t]โ€–+โˆ‘j=i+1Nโ€–ฮธj[t]โˆ’yjโ€‹(๐œฝ1:iโˆ’1[t])โ€–\displaystyle\leq\|x_{i}^{[t]}\|+\sum_{j=i+1}^{N}\|\theta_{j}^{[t]}-y_{j}(\boldsymbol{\theta}_{1:i-1}^{[t]})\|
โ‰คโ€–xi[t]โ€–+โ€–ฮธi+1[t]โˆ’yi+1โ€‹(๐œฝ1:iโˆ’1[t],yiโ€‹(๐œฝ1:iโˆ’1[t]))โ€–+โˆ‘j=i+2Nโ€–ฮธj[t]โˆ’yjโ€‹(๐œฝ1:iโˆ’1[t])โ€–\displaystyle\leq\|x_{i}^{[t]}\|+\|\theta_{i+1}^{[t]}-y_{i+1}(\boldsymbol{\theta}_{1:i-1}^{[t]},y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]}))\|+\sum_{j=i+2}^{N}\|\theta_{j}^{[t]}-y_{j}(\boldsymbol{\theta}_{1:i-1}^{[t]})\|
โ‰ค(L+1)โ€‹โ€–xi[t]โ€–+โ€–xi+1[t+1]โ€–+โˆ‘j=i+2Nโ€–ฮธj[t]โˆ’yjโ€‹(๐œฝ1:iโˆ’1[t])โ€–\displaystyle\leq(L+1)\|x_{i}^{[t]}\|+\|x_{i+1}^{[t+1]}\|+\sum_{j=i+2}^{N}\|\theta_{j}^{[t]}-y_{j}(\boldsymbol{\theta}_{1:i-1}^{[t]})\|
โ‰ค(2โ€‹L+1)โ€‹โ€–xi[t]โ€–+(L+1)โ€‹โ€–xi+1[t]โ€–+โ€–xi+2[t]โ€–+โˆ‘j=i+3Nโ€–ฮธj[t]โˆ’yjโ€‹(๐œฝ1:iโˆ’1[t])โ€–\displaystyle\leq(2L+1)\|x_{i}^{[t]}\|+(L+1)\|x_{i+1}^{[t]}\|+\|x_{i+2}^{[t]}\|+\sum_{j=i+3}^{N}\|\theta_{j}^{[t]}-y_{j}(\boldsymbol{\theta}_{1:i-1}^{[t]})\|
โ‰คNโ€‹Lโ€‹โˆ‘j=iNโ€–xj[t]โ€–.\displaystyle\leq NL\sum_{j=i}^{N}\|x_{j}^{[t]}\|. (62)

Plugging (62) into (61), we have

โ€–fi[t]โ€–โ‰คโ€–ฮ”โ€‹fi[t]โ€–+Nโ€‹L2โ€‹โˆ‘j=iNโ€–xj[t]โ€–.\displaystyle\|f_{i}^{[t]}\|\leq\|\Delta f_{i}^{[t]}\|+NL^{2}\sum_{j=i}^{N}\|x_{j}^{[t]}\|. (63)

The bound on โ€–ฮธi[t+1]โˆ’ฮธi[t]โ€–\|\theta_{i}^{[t+1]}-\theta_{i}^{[t]}\| follows simply from (63) and the update rule (7).

โ–ก\Box

A.4 Proof of Lemmaย 4

By the update rule (6), we have

xi[t+1]\displaystyle x_{i}^{[t+1]} =ฮธi[t+1]โˆ’yiโ€‹(๐œฝ1:iโˆ’1[t+1])\displaystyle=\theta_{i}^{[t+1]}-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})
=ฮธi[t]โˆ’ฮฑi[t]โ€‹Fiโ€‹(๐œฝ[t],X[t])โˆ’yiโ€‹(๐œฝ1:iโˆ’1[t+1])\displaystyle=\theta_{i}^{[t]}-\alpha_{i}^{[t]}F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})
=xi[t]โˆ’ฮฑi[t]โ€‹Fiโ€‹(๐œฝ[t],X[t])+(yiโ€‹(๐œฝ1:iโˆ’1[t])โˆ’yiโ€‹(๐œฝ1:iโˆ’1[t+1]))\displaystyle=x_{i}^{[t]}-\alpha_{i}^{[t]}F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})+\left(y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\right)
=xi[t]โˆ’ฮฑi[t]โ€‹Fยฏiโ€‹(๐œฝ[t])โˆ’ฮฑi[t]โ€‹(Fiโ€‹(๐œฝ[t],X[t])โˆ’Fยฏiโ€‹(๐œฝ[t]))+(yiโ€‹(๐œฝ1:iโˆ’1[t])โˆ’yiโ€‹(๐œฝ1:iโˆ’1[t+1])).\displaystyle=x_{i}^{[t]}-\alpha_{i}^{[t]}\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\alpha_{i}^{[t]}\big{(}F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\big{)}+\big{(}y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\big{)}.

Taking the norm,

โ€–xi[t+1]โ€–2\displaystyle\|x_{i}^{[t+1]}\|^{2}
=โ€–xi[t]โˆ’ฮฑi[t]โ€‹Fยฏiโ€‹(๐œฝ[t])โ€–2โŸA1+(ฮฑi[t])2โ€‹โ€–Fiโ€‹(๐œฝ[t],X[t])โˆ’Fยฏiโ€‹(๐œฝ[t])โ€–2โŸA2\displaystyle=\underbrace{\|x_{i}^{[t]}-\alpha_{i}^{[t]}\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\|^{2}}_{A_{1}}+\underbrace{(\alpha_{i}^{[t]})^{2}\|F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\|^{2}}_{A_{2}}
+โ€–yiโ€‹(๐œฝ1:iโˆ’1[t])โˆ’yiโ€‹(๐œฝ1:iโˆ’1[t+1])โ€–2โŸA3+ฮฑi[t]โ€‹โŸจxi[t]โˆ’ฮฑi[t]โ€‹Fยฏiโ€‹(๐œฝ[t]),Fiโ€‹(๐œฝ[t],X[t])โˆ’Fยฏiโ€‹(๐œฝ[t])โŸฉโŸA4\displaystyle\hskip 20.0pt+\underbrace{\|y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\|^{2}}_{A_{3}}+\underbrace{\alpha_{i}^{[t]}\langle x_{i}^{[t]}-\alpha_{i}^{[t]}\widebar{F}_{i}(\boldsymbol{\theta}^{[t]}),F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\rangle}_{A_{4}}
+โŸจxi[t]โˆ’ฮฑi[t]โ€‹Fยฏiโ€‹(๐œฝ[t]),yiโ€‹(๐œฝ1:iโˆ’1[t])โˆ’yiโ€‹(๐œฝ1:iโˆ’1[t+1])โŸฉโŸA5\displaystyle\hskip 20.0pt+\underbrace{\langle x_{i}^{[t]}-\alpha_{i}^{[t]}\widebar{F}_{i}(\boldsymbol{\theta}^{[t]}),y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\rangle}_{A_{5}}
+ฮฑi[t]โŸจFi(๐œฝ[t],X[t])โˆ’Fยฏi(๐œฝ[t]),yi(๐œฝ1:iโˆ’1[t])โˆ’yi(๐œฝ1:iโˆ’1[t+1])โŸA6โŸฉ.\displaystyle\hskip 20.0pt+\underbrace{\alpha_{i}^{[t]}\langle F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]}),y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})}_{A_{6}}\rangle. (64)

A bound on A1A_{1} can be obtained using an argument identical to (53), provided that the step sizes satisfy ฮฑi[t]โ‰คฮด8โ€‹N3โ€‹L6\alpha_{i}^{[t]}\leq\frac{\delta}{8N^{3}L^{6}} and ฮฑi[t]โ‰ค1ฮด\alpha_{i}^{[t]}\leq\frac{1}{\delta}

A1\displaystyle A_{1} =โ€–xi[t]โ€–2โˆ’2โ€‹ฮฑi[t]โ€‹โŸจxi[t],Fยฏiโ€‹(๐œฝ[t])โŸฉ+(ฮฑi[t])2โ€‹โ€–Fยฏiโ€‹(๐œฝ[t])โ€–2\displaystyle=\|x_{i}^{[t]}\|^{2}-2\alpha_{i}^{[t]}\langle x_{i}^{[t]},\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\rangle+(\alpha_{i}^{[t]})^{2}\|\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\|^{2}
โ‰คโ€–xi[t]โ€–2โˆ’ฮดโ€‹ฮฑi[t]2โ€‹โ€–xi[t]โ€–2+8โ€‹N3โ€‹L6โ€‹ฮฑi[t]ฮดโ€‹โˆ‘j=i+1Nโ€–xj[t]โ€–2.\displaystyle\leq\|x_{i}^{[t]}\|^{2}-\frac{\delta\alpha_{i}^{[t]}}{2}\|x_{i}^{[t]}\|^{2}+\frac{8N^{3}L^{6}\alpha_{i}^{[t]}}{\delta}\sum_{j=i+1}^{N}\|x_{j}^{[t]}\|^{2}. (65)

The second term is bounded in expectation due to Assumptionย 5

A2=(ฮฑi[t])2โ€‹โ€–Fiโ€‹(๐œฝ[t],X[t])โˆ’Fยฏiโ€‹(๐œฝ[t])โ€–2โ‰ค4โ€‹D2โ€‹(ฮฑi[t])2.\displaystyle A_{2}=(\alpha_{i}^{[t]})^{2}\|F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\|^{2}\leq 4D^{2}(\alpha_{i}^{[t]})^{2}. (66)

Using the Lipschitz continuity of yiy_{i}, we have

โ€–yiโ€‹(๐œฝ1:iโˆ’1[t])โˆ’yiโ€‹(๐œฝ1:iโˆ’1[t+1])โ€–\displaystyle\|y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\| โ‰คLโ€‹โˆ‘j=1iโˆ’1โ€–ฮธj[t+1]โˆ’ฮธj[t]โ€–\displaystyle\leq L\sum_{j=1}^{i-1}\|\theta_{j}^{[t+1]}-\theta_{j}^{[t]}\|
=Lโ€‹โˆ‘j=1iโˆ’1ฮฑj[t]โ€‹โ€–Fiโ€‹(๐œฝ[t],X[t])โ€–\displaystyle=L\sum_{j=1}^{i-1}\alpha_{j}^{[t]}\|F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})\|
โ‰คNโ€‹Lโ€‹Dโ€‹ฮฑiโˆ’1[t],\displaystyle\leq NLD\alpha_{i-1}^{[t]}, (67)

which implies

A3\displaystyle A_{3} =โ€–yiโ€‹(๐œฝ1:iโˆ’1[t])โˆ’yiโ€‹(๐œฝ1:iโˆ’1[t+1])โ€–2โ‰คN2โ€‹L2โ€‹D2โ€‹(ฮฑiโˆ’1[t])2.\displaystyle=\|y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\|^{2}\leq N^{2}L^{2}D^{2}(\alpha_{i-1}^{[t]})^{2}. (68)

The term A4A_{4} is obviously zero in expectation as

๐”ผโ€‹[Fiโ€‹(๐œฝ[t],X[t])โˆ’Fยฏiโ€‹(๐œฝ[t])]=0.\displaystyle\mathbb{E}[F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})]=0.

The bound on A5A_{5} follows from the Cauchy-Schwarz inequality

A5\displaystyle A_{5} โ‰คโ€–xi[t]โ€–โ€‹โ€–yiโ€‹(๐œฝ1:iโˆ’1[t])โˆ’yiโ€‹(๐œฝ1:iโˆ’1[t+1])โ€–+ฮฑi[t]โ€‹โ€–Fยฏiโ€‹(๐œฝ[t])โ€–โ€‹โ€–yiโ€‹(๐œฝ1:iโˆ’1[t])โˆ’yiโ€‹(๐œฝ1:iโˆ’1[t+1])โ€–\displaystyle\leq\|x_{i}^{[t]}\|\|y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\|+\alpha_{i}^{[t]}\|\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\|\|y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\|
โ‰คฮดโ€‹ฮฑi[t]4โ€‹โ€–xi[t]โ€–2+1ฮดโ€‹ฮฑi[t]โ€‹โ€–yiโ€‹(๐œฝ1:iโˆ’1[t])โˆ’yiโ€‹(๐œฝ1:iโˆ’1[t+1])โ€–2+Dโ€‹ฮฑi[t]โ€‹โ€–yiโ€‹(๐œฝ1:iโˆ’1[t])โˆ’yiโ€‹(๐œฝ1:iโˆ’1[t+1])โ€–\displaystyle\leq\frac{\delta\alpha_{i}^{[t]}}{4}\|x_{i}^{[t]}\|^{2}+\frac{1}{\delta\alpha_{i}^{[t]}}\|y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\|^{2}+D\alpha_{i}^{[t]}\|y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\|
โ‰คฮดโ€‹ฮฑi[t]4โ€‹โ€–xi[t]โ€–2+(Nโ€‹Lโ€‹Dโ€‹ฮฑiโˆ’1[t])2ฮดโ€‹ฮฑi[t]+Dโ€‹ฮฑi[t]โ‹…Nโ€‹Lโ€‹Dโ€‹ฮฑiโˆ’1[t]\displaystyle\leq\frac{\delta\alpha_{i}^{[t]}}{4}\|x_{i}^{[t]}\|^{2}+\frac{(NLD\alpha_{i-1}^{[t]})^{2}}{\delta\alpha_{i}^{[t]}}+D\alpha_{i}^{[t]}\cdot NLD\alpha_{i-1}^{[t]}
โ‰คฮดโ€‹ฮฑi[t]4โ€‹โ€–xi[t]โ€–2+N2โ€‹L2โ€‹D2โ€‹(ฮฑiโˆ’1[t])2ฮดโ€‹ฮฑi[t]+Nโ€‹Lโ€‹D2โ€‹ฮฑi[t]โ€‹ฮฑiโˆ’1[t]\displaystyle\leq\frac{\delta\alpha_{i}^{[t]}}{4}\|x_{i}^{[t]}\|^{2}+\frac{N^{2}L^{2}D^{2}(\alpha_{i-1}^{[t]})^{2}}{\delta\alpha_{i}^{[t]}}+NLD^{2}\alpha_{i}^{[t]}\alpha_{i-1}^{[t]} (69)

where in the third equality we plug in (67).

Finally, we bound A6A_{6} using a similar argument

A6\displaystyle A_{6} โ‰คฮฑi[t]โ€‹โ€–Fiโ€‹(๐œฝ[t],X[t])โˆ’Fยฏiโ€‹(๐œฝ[t])โ€–โ€‹โ€–yiโ€‹(๐œฝ1:iโˆ’1[t])โˆ’yiโ€‹(๐œฝ1:iโˆ’1[t+1])โ€–\displaystyle\leq\alpha_{i}^{[t]}\|F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\|\|y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\|
โ‰คฮฑi[t]โ‹…2โ€‹Dโ‹…Nโ€‹Lโ€‹Dโ€‹ฮฑiโˆ’1[t]\displaystyle\leq\alpha_{i}^{[t]}\cdot 2D\cdot NLD\alpha_{i-1}^{[t]}
โ‰ค2โ€‹Nโ€‹Lโ€‹D2โ€‹ฮฑiโˆ’1[t]โ€‹ฮฑi[t].\displaystyle\leq 2NLD^{2}\alpha_{i-1}^{[t]}\alpha_{i}^{[t]}. (70)

Collecting the bounds in (65)-(70) and putting them into (64), we get

๐”ผโ€‹[โ€–xi[t+1]โ€–2]\displaystyle\mathbb{E}[\|x_{i}^{[t+1]}\|^{2}]
โ‰ค๐”ผ[โˆฅxi[t]โˆฅ2โˆ’ฮดโ€‹ฮฑi[t]2โˆฅxi[t]โˆฅ2+8โ€‹N3โ€‹L6โ€‹ฮฑi[t]ฮดโˆ‘j=i+1Nโˆฅxj[t]โˆฅ2+4D2(ฮฑi[t])2+N2L2D2(ฮฑiโˆ’1[t])2\displaystyle\leq\mathbb{E}[\|x_{i}^{[t]}\|^{2}-\frac{\delta\alpha_{i}^{[t]}}{2}\|x_{i}^{[t]}\|^{2}+\frac{8N^{3}L^{6}\alpha_{i}^{[t]}}{\delta}\sum_{j=i+1}^{N}\|x_{j}^{[t]}\|^{2}+4D^{2}(\alpha_{i}^{[t]})^{2}+N^{2}L^{2}D^{2}(\alpha_{i-1}^{[t]})^{2}
+ฮดโ€‹ฮฑi[t]4โˆฅxi[t]โˆฅ2+N2โ€‹L2โ€‹D2โ€‹(ฮฑiโˆ’1[t])2ฮดโ€‹ฮฑi[t]+NLD2ฮฑi[t]ฮฑiโˆ’1[t]+2NLD2ฮฑiโˆ’1[t]ฮฑi[t]]\displaystyle\hskip 20.0pt+\frac{\delta\alpha_{i}^{[t]}}{4}\|x_{i}^{[t]}\|^{2}+\frac{N^{2}L^{2}D^{2}(\alpha_{i-1}^{[t]})^{2}}{\delta\alpha_{i}^{[t]}}+NLD^{2}\alpha_{i}^{[t]}\alpha_{i-1}^{[t]}+2NLD^{2}\alpha_{i-1}^{[t]}\alpha_{i}^{[t]}]
โ‰ค๐”ผโ€‹[โ€–xi[t]โ€–2]โˆ’ฮดโ€‹ฮฑi[t]4โ€‹๐”ผโ€‹[โ€–xi[t]โ€–2]+8โ€‹N3โ€‹L6โ€‹ฮฑi[t]ฮดโ€‹โˆ‘j=i+1N๐”ผโ€‹[โ€–xj[t]โ€–2]\displaystyle\leq\mathbb{E}[\|x_{i}^{[t]}\|^{2}]-\frac{\delta\alpha_{i}^{[t]}}{4}\mathbb{E}[\|x_{i}^{[t]}\|^{2}]+\frac{8N^{3}L^{6}\alpha_{i}^{[t]}}{\delta}\sum_{j=i+1}^{N}\mathbb{E}[\|x_{j}^{[t]}\|^{2}]
+8โ€‹N2โ€‹L2โ€‹D2โ€‹(ฮฑi[t])2+N2โ€‹L2โ€‹D2โ€‹(ฮฑiโˆ’1[t])2ฮดโ€‹ฮฑi[t].\displaystyle\hskip 20.0pt+8N^{2}L^{2}D^{2}(\alpha_{i}^{[t]})^{2}+\frac{N^{2}L^{2}D^{2}(\alpha_{i-1}^{[t]})^{2}}{\delta\alpha_{i}^{[t]}}.

โ–ก\Box

A.5 Proof of Lemmaย 5

This lemma essentially bounds the distance between samples from a time-varying Markov chain and the stationary distribution. Techniques for proving this lemma has been well-studied in the literature. Hence we omit the proof here but note that it is very similar to the proof of Lemma 1 in Zeng etย al. (2024b).