Accelerated Multi-Time-Scale Stochastic Approximation: Optimal Complexity and Applications in Reinforcement Learning and Multi-Agent Games

\nameSihan Zeng \email[email protected]
\addrJPMorgan AI Research
Palo Alto, CA, United States \AND\nameThinh T. Doan \email[email protected]
\addrDepartment of Aerospae Engineering & Engineering Mechanics
University of Texas at Austin
Austin, TX, United States

Abstract

Multi-time-scale stochastic approximation is an iterative algorithm for finding the fixed point of a set of $N$ coupled operators given their noisy samples. It has been observed that due to the coupling between the decision variables and noisy samples of the operators, the performance of this method decays as $N$ increases. In this work, we develop a new accelerated variant of multi-time-scale stochastic approximation, which significantly improves the convergence rates of its standard counterpart. Our key idea is to introduce auxiliary variables to dynamically estimate the operators from their samples, which are then used to update the decision variables. These auxiliary variables help not only to control the variance of the operator estimates but also to decouple the sampling noise and the decision variables. This allows us to select more aggressive step sizes to achieve an optimal convergence rate. Specifically, under a strong monotonicity condition, we show that for any value of $N$ the $t^{\text{th}}$ iterate of the proposed algorithm converges to the desired solution at a rate $\widetilde{{\cal O}}(1/t)$ when the operator samples are generated from a single from Markov process trajectory.

A second contribution of this work is to demonstrate that the objective of a range of problems in reinforcement learning and multi-agent games can be expressed as a system of fixed-point equations. As such, the proposed approach can be used to design new learning algorithms for solving these problems. We illustrate this observation with numerical simulations in a multi-agent game and show the advantage of the proposed method over the standard multi-time-scale stochastic approximation algorithm.

Keywords: Multi-time-scale stochastic approximation, reinforcement learning, game theory

1 Introduction

The objective of a range of problems in reinforcement learning and games can be expressed as a coupled system of $N$ equations, with each equation defined through a nonlinear operator that can be queried with noise. This work proposes a single-loop stochastic approximation (SA) algorithm which solves such a system using a (near) optimal number of queries. Given a compact statistical sample space ${\cal X}$ and operators $\{F_{i}:\mathbb{R}^{d_{1}}\times\cdots\times\mathbb{R}^{d_{N}}\times{\cal X}\rightarrow\mathbb{R}^{d_{i}}\}_{i=1,\cdots,N}$ , our aim is to find a solution tuple $\boldsymbol{\theta}^{\star}=(\theta_{1}^{\star},\cdots,\theta_{N}^{\star})\in\mathbb{R}^{d_{1}}\times\cdots\times\mathbb{R}^{d_{N}}$ such that

\displaystyle\left\{\;\begin{array}[]{c}\mathbb{E}_{X\sim\mu_{\boldsymbol{\theta}^{\star}}}[F_{1}(\theta_{1}^{\star},\theta_{2}^{\star},\cdots,\theta_{N}^{\star},X)]=0,\\ \mathbb{E}_{X\sim\mu_{\boldsymbol{\theta}^{\star}}}[F_{2}(\theta_{1}^{\star},\theta_{2}^{\star},\cdots,\theta_{N}^{\star},X)]=0,\\ \vdots\\ \mathbb{E}_{X\sim\mu_{\boldsymbol{\theta}^{\star}}}[F_{N}(\theta_{1}^{\star},\theta_{2}^{\star},\cdots,\theta_{N}^{\star},X)]=0,\end{array}\right.

(5)

where $\mu_{\boldsymbol{\theta}}$ denotes a distribution over ${\cal X}$ (possibly) as a function of $\boldsymbol{\theta}$ . We do not assume having direct access to $\mu_{\boldsymbol{\theta}}$ . Rather, our sampling oracle is that we may query an operator $P_{\boldsymbol{\theta}}:{\cal X}\rightarrow\Delta_{{\cal X}}$ , which induces $\mu_{\boldsymbol{\theta}}$ as the stationary distribution. More precisely, for any $\boldsymbol{\theta}$ and $X$ , we can draw a sample $X^{\prime}\sim P_{\boldsymbol{\theta}}(X)$ . We consider this relaxed sampling oracle as it realistically captures how samples are generated in reinforcement learning (RL). We use $\Omega=\mathbb{R}^{d_{1}}\times\cdots\times\mathbb{R}^{d_{N}}$ to denote the space of the decision variable $\boldsymbol{\theta}$ and denote $\widebar{F}_{i}(\theta_{1},\cdots,\theta_{N})=\mathbb{E}_{X\sim\mu_{\boldsymbol{\theta}}}[F_{i}(\theta_{1},\cdots,\theta_{N},X)]$ for any $\boldsymbol{\theta}\in\Omega$ .

It may appear that the system (5) is symmetric and each equation is identically coupled with any other equation. This is not true in the context of RL and games, where the problems usually have a specific hierarchical structure. The structure dictates the order in which the system can be most naturally solved and is formalized later through Assumption 3.

Exploiting the hierarchical structure, the standard multi-time-scale stochastic approximation algorithm for solving (5), thereafter referred to as MSA, maintains parameters $\theta_{1}^{[t]},\cdots,\theta_{N}^{[t]}$ as an estimate of $(\theta_{1}^{\star},\cdots,\theta_{N}^{\star})$ and simultaneously updates them, each using a unique step size. Given (possibly biased) stochastic samples $\{X^{[t]}\}$ , MSA iteratively carries out

\displaystyle\theta_{i}^{[t+1]}=\theta_{i}^{[t]}-\alpha_{i}^{[t]}F_{i}(\theta_{1}^{[t]},\cdots,\theta_{N}^{[t]},X^{[t]}),\quad\forall i=1,\cdots,N.

(6)

With properly selected step sizes, this simple method is guaranteed to converge to the solution of (5) under strong monotonicity and Lipschitz continuity assumptions on $\widebar{F}_{i}$ to be introduced later (Hong et al., 2023; Zeng et al., 2024b; Shen and Chen, 2022; Han et al., 2024). However, its time and sample complexity degrades as $N$ increases and is sub-optimal as known so far. When $N=1$ , MSA reduces to the classic stochastic approximation algorithm (Robbins and Monro, 1951) and has a complexity of ${\cal O}(1/t)$ when $X^{[t]}$ are i.i.d. samples from a fixed distribution (and $\widetilde{{\cal O}}(1/t)$ if the samples $X^{[t]}$ are consecutively generated from a Markov chain, where $\widetilde{{\cal O}}$ hides any logarithm factor). More precisely, the last iterate of (6) converges to an $\widetilde{{\cal O}}(1/t)$ neighborhood around $\boldsymbol{\theta}^{\star}$ in $t$ iterations, matching the worst-case lower bound (Chen et al., 2022). When $N=2$ , the best-known complexity, as established in Hong et al. (2023); Zeng et al. (2024b), becomes $\widetilde{{\cal O}}(1/t^{2/3})$ , inferior to that under $N=1$ by a factor of $t^{1/3}$ . The authors in Hong et al. (2023); Zeng et al. (2024b) only analyze the case of two equations, but we show in Section 3.2 that their techniques can be extended to establish a complexity of $\widetilde{{\cal O}}(1/t^{1/2})$ for $N=3$ and further worse bounds for $N=4,5,\cdots$ . This apparent gap in the performance of MSA for different $N$ makes us question whether the problem becomes fundamentally harder as $N$ increases or whether any algorithmic/analytical improvement can be made. In this work, we answer the question in the positive direction by proposing an accelerated SA algorithm which achieves the optimal complexity $\widetilde{{\cal O}}(1/t)$ across all $N$ while maintaining the favorable properties of (6) such as simplicity and single loop structure. The same algorithm has been introduced earlier in our prior works (Doan, 2024; Zeng and Doan, 2024) but we make important contributions over them as discussed in details in Section 1.2. We now summarize our main contributions.

1.1 Main Contributions

Number of Equations	Standard Multi-Time-Scale SA Complexity	A-MSA Complexity
Two	$\widetilde{{\cal O}}(\epsilon^{-1.5})$	$\widetilde{{\cal O}}(\epsilon^{-1})$
Three	$\widetilde{{\cal O}}(\epsilon^{-2})$	$\widetilde{{\cal O}}(\epsilon^{-1})$
Four	$\widetilde{{\cal O}}(\epsilon^{-2.5})$	$\widetilde{{\cal O}}(\epsilon^{-1})$
$N$	$\widetilde{{\cal O}}(\epsilon^{-(N+1)/2})$	$\widetilde{{\cal O}}(\epsilon^{-1})$

Table 1: Sample complexity comparison as number of equations increases (under strong monotone operators and Markovian samples).

The key contribution of this work is to propose a novel Accelerated Multi-Time-Scale Stochastic Approximation (A-MSA) algorithm that solves a coupled system of $N$ equations with the optimal convergence rate $\widetilde{{\cal O}}(1/t)$ for strong monotone operators and when the samples are generated from a single trajectory of Markov processes characterized by $P_{\boldsymbol{\theta}}$ . In the case $N=2$ , this rate improves over $\widetilde{{\cal O}}(1/t^{2/3})$ , the best-known complexity of the MSA algorithm derived in Doan (2022); Hong et al. (2023). When $N\geq 3$ , the complexity of MSA deteriorates and is further inferior to our result (see Table. 1). Our innovation is built on the observation that the main hindrance to the fast convergence of (6) lies in the direct coupling between the iterates $\theta_{i}^{[t]}$ across $i$ , which means that any error/noise in $\theta_{i}^{[t]}$ for some $i$ immediately propagates to $\theta_{j}^{[t+1]}$ for all $j$ in the next iteration. Such a coupling effect prevents us from choosing aggressive step sizes necessary for achieving the optimal complexity. Our solution to this challenge is to update $\theta_{i}^{[t]}$ in the direction of a properly averaged version of operator $F_{i}$ (as opposed to the single sample used in (6)). The seemingly small modification maintains the simplicity of (6) while eliminating the direct coupling between $\theta_{i}^{[t]}$ . In Section 3.3, we compare the analysis of MSA and A-MSA and pinpoint the convergence bottleneck in MSA which we overcome through an algorithmic modification.

A second contribution of our work is to show that coupled equations in (5) abstract many popular problems in games and RL, including gradient-based temporal difference learning with momentum, distributionally robust RL, policy optimization in a two-player zero-sum Markov game, and policy optimization in mean-field games. As a consequence, the proposed A-MSA method can be used to design new algorithms for solving these problems. Some of these problems do not satisfy the strong monotonicity condition but are specially structured in their own ways. We show how we can adapt the analysis for strongly monotone operators to the structure in each respective problem, which can potentially establish the state-of-the-art/previously unknown convergence rates.

	Applicable to $>2$ Equations	Sample Noise	Smoothness Assumption	Sample Complexity
Deb and Bhatnagar (2021)	Yes	Bounded Martingale	Not Required	-
Hong et al. (2023)	No	i.i.d.	Not Required	${\cal O}(\epsilon^{-1.5})$
Zeng et al. (2024b)	No	Time-Varying Markovian	Not Required	$\widetilde{{\cal O}}(\epsilon^{-1.5})$
Shen and Chen (2022)	Yes	Bounded Martingale	Required	${\cal O}(\epsilon^{-1})$
Han et al. (2024)	No	Bounded Martingale	Relaxed Version	${\cal O}(\epsilon^{-1})$
Doan (2024)	No	i.i.d.	Not Required	${\cal O}(\epsilon^{-1})$
Zeng and Doan (2024)	No	i.i.d.	Not Required	${\cal O}(\epsilon^{-1})$
This Work	Yes	Time-Varying Markovian	Not Required	$\widetilde{{\cal O}}(\epsilon^{-1})$

Table 2: Existing algorithms on nonlinear two/multi-time-scale stochastic approximation and their assumptions and complexity.

1.2 Related Work

This paper closely relates to the existing works on singe- and multi-time-scale SA (especially under Markovian samples) and those that analyze the finite-time and sample complexity of reinforcement learning algorithms through the lens of SA. We discuss the most relevant works in these domains and highlight our novelty.

Stochastic Approximation. The SA algorithm is a classic method for solving a single fixed-point equation given noisy operator samples (Robbins and Monro, 1951). The asymptotic convergence of SA is well understood and usually derived with the Poisson equation method (Benveniste et al., 2012; Li et al., 2023a) or by analyzing a related ODE (Meerkov, 1972; Borkar, 2008). Under i.i.d. samples, bounded Martingale difference noise, or dependent samples from a Markov chain with bounded mixing time, the finite-time analysis measured by squared errors in expectation is established for linear (Lakshminarayanan and Szepesvári, 2017; Srikant and Ying, 2019) and strongly monotone operators (Moulines and Bach, 2011; Chen et al., 2022). Linear SA abstracts the temporal difference (TD) learning algorithm in RL. Various works (Bhandari et al., 2018; Mitra, 2024) study the finite-time convergence of TD learning leveraging the problem structure, but their analysis may be extended to linear SA. Recently, non-asymptotic central limit theorems have also been proved for TD learning (Srikant, 2024) and Q learning (Li et al., 2023b), which show the error convergence to a zero-mean Gaussian random variable in distribution. Such results may be extended to general linear/non-linear SA as well.

Some of the aforementioned works employ Polyak-Ruppert averaging to improve variance control (Srikant, 2024; Haque et al., 2023). We note that the averaging step we carry out in the proposed method is distinct from Polyak-Ruppert averaging – Polyak-Ruppert averaging does not modify the algorithm itself but returns the solution as an weighted-averaged version of all history iterates, while our algorithm uses an averaged sequence in the updates.

Multi-Time-Scale Stochastic Approximation. When $N=2$ , the MSA algorithm reduces to the standard two-time-scale SA, which is known to converge asymptotically under linear (Borkar, 1997, 2008; Konda and Tsitsiklis, 2004) and non-linear operators (Mokkadem and Pelletier, 2006; Fort, 2015). In the linear setting, the finite-time convergence measured by mean squared errors is first established with a sub-optimal rate $\widetilde{O}(1/t^{2/3})$ in Gupta et al. (2019); Doan and Romberg (2019) and improved to the optimal $\widetilde{O}(1/t)$ in Kaledin et al. (2020); Haque et al. (2023). In addition, convergence in probability is studied in Dalal et al. (2018). In the non-linear setting under the strong monotonicity assumption, the best-known rate of two-time-scale SA remains $\widetilde{O}(1/t^{2/3})$ in general (Doan, 2022; Hong et al., 2023; Zeng et al., 2024b). Under an additional smoothness assumption, Shen and Chen (2022); Han et al. (2024) show that the optimal rate can be recovered. Our paper does not require this assumption as it may not hold true in practical applications.

The two-time-scale SA framework has wide applications in control, RL, and games, e.g., gradient-based TD learning (Sutton et al., 2009; Wang et al., 2021), actor-critic methods (Konda and Borkar, 1999; Konda and Tsitsiklis, 1999; Wu et al., 2020), policy gradient descent-ascent for two-player games (Daskalakis et al., 2020), and decentralized Q-learning (Sayin et al., 2021). By providing an analysis for the general framework, one can apply it to each specific problem and immediately deduce the performance guarantees, without having to tailor the analysis on a case basis.

The study of SA for solving three or more equations remains limited but deserves increased attention due to its ability to model rich classes of problems, including gradient-based TD learning with momentum (Deb and Bhatnagar, 2022), distributionally robust RL (Liang et al., 2024), and policy optimization in games; the applications will be discussed in Section 4. The existing works on multi-time-scale SA include Deb and Bhatnagar (2021) which only provides an asymptotic analysis and Shen and Chen (2022) which relies on a restrictive smoothness assumption.

Finally, we discuss the difference of the current paper and our prior works in (Doan, 2024; Zeng and Doan, 2024), which study the special setting $N=2$ under i.i.d. samples. First, this paper extends our prior works to the general setting with any $N$ . Moving from $N=2$ to $N\geq 3$ creates a critical technical challenge – with $N\geq 3$ mid-level variables $\{\theta_{i}\}_{i=2,\cdots,N-1}$ are coupled with faster and slower moving variables at the same time, making it harder for them to be decoupled by gradient averaging. We overcome the challenge through an improved bound on $x_{i}^{[t]}$ (which measures the convergence of $\theta_{i}^{[t]}$ to a proper learning target introduced later) by a function of $\{x_{j}^{[t]}\}_{j<i}$ and $\{x_{j}^{[t]}\}_{j>i}$ . We also perform a mathematical investigation of the root cause of the complexity improvement. An explanation of why A-MSA enjoys a faster convergence rate is missing in our prior papers. This work fills in the gap by contrasting an important intermediate result obtained under A-MSA and MSA and pinpointing the terms that get improved due to the averaged operator estimation step. Finally, this paper considers a more general sampling oracle – we study non i.i.d samples generated from a time-varying Markov chain, a setting realistic for modeling applications in sequential decision making. To avoid complexity degradation due to Markovian samples, analyzing A-MSA requires carefully bounding the distance between the distribution of $X^{[t]}$ and a (time-varying) stationary distribution determined by $\boldsymbol{\theta}^{[t]}$ by ${\cal O}(\log t)$ , whereas this distance is zero in expectation in the i.i.d. setting.

2 Accelerated Multi-Time-Scale Stochastic Approximation Method

The MSA method (6) updates each $\theta_{i}$ in the direction of the corresponding operator $F_{i}$ evaluated at the latest iterates. Our proposed method, formally presented in Algorithm 1, modifies the update by introducing the auxiliary variables $f_{i}^{[t]}$ to estimate $\widebar{F}_{i}^{[t]}(\boldsymbol{\theta}^{[t]})$ by forming a weighted average of the operator samples controlled by a parameter $\lambda^{[t]}\in(0,1)$ . These auxiliary variables will then be used to update the decision variables. If we set $\lambda^{[t]}=1$ for all $t$ , Algorithm 1 reduces to MSA in (6). However, a large $\lambda^{[t]}$ leads to a high-variance estimate, whereas if $\lambda^{[t]}$ is too small $f_{i}^{[t]}$ cannot track the moving target $\widebar{F}_{i}^{[t]}(\boldsymbol{\theta}^{[t]})$ . Decaying $\lambda^{[t]}$ appropriately with respect to $\alpha_{i}^{[t]}$ , $f_{i}^{t}$ helps to control the variance of our estimates, which is crucial to derive the optimal convergence rate.

Algorithm 1 Accelerated Multi-Time-Scale Stochastic Approximation Algorithm

1: Initialization: decision variables

\boldsymbol{\theta}^{[0]}\in\Omega

, operator estimation variables

\{f_{i}^{[0]}\in\mathbb{R}^{d}\}_{i}

, initial sample

X^{[0]}

, step size sequences

\{\alpha_{i}^{[t]}\}_{i,t},\{\lambda^{[t]}\}_{t}

2: for

t=0,1,\cdots,T-1

3: Decision variable update:

\displaystyle\begin{aligned} \theta_{i}^{[t+1]}=\theta_{i}^{[t]}-\alpha_{i}^{[t]}f_{i}^{[t]},\;\forall i=1,\cdots,N,\qquad\quad\boldsymbol{\theta}^{[t]}=\{\theta_{1}^{[t]},\cdots,\theta_{N}^{[t]}\}.\end{aligned}

(7)

4: Gradient estimation variable update:

\displaystyle\begin{aligned} f_{i}^{[t+1]}=(1-\lambda^{[t]})f_{i}^{[t]}+\lambda^{[t]}F_{i}(\theta_{1}^{[t]},\cdots,\theta_{N}^{[t]},X^{[t]}).\end{aligned}

(8)

5: Sample draw

\displaystyle X^{[t+1]}\sim P_{\boldsymbol{\theta}^{[t]}}(X^{[t]}).

(9)

6: end for

Note that in our algorithm the samples are drawn from the transition kernel $P_{\boldsymbol{\theta}}$ parameterized by $\boldsymbol{\theta}$ (a.k.a (9)). Thus, as $\boldsymbol{\theta}^{[t]}$ changes over time, the samples are temporally dependent and Markovian with time-varying stationary distributions. To ensure the stable behavior of the Markovian samples we make the following assumptions. Our first assumption is on the fast mixing of the Markov chain generated under $P_{\boldsymbol{\theta}}$ . Given two distributions $u_{1},u_{2}\in\Delta_{{\cal X}}$ , their total variation (TV) distance and the mixing time of the Markov chain are defined as

d_{\text{TV}}(u_{1},u_{2})=\frac{1}{2}\sup_{\nu:{\cal X}\rightarrow[-1,1]}\left|\int\nu du_{1}-\int\nu du_{2}\right|.

(10)

Definition 1

Given $\boldsymbol{\theta}\in\Omega$ , consider a Markov chain $\{X^{[t]}\}$ where $X^{[t]}\sim P_{\boldsymbol{\theta}}(X^{[t-1]})$ with $\mu_{\boldsymbol{\theta}}$ as the stationary distribution. For any $a>0$ , the mixing time of $\{X^{[t]}\}$ at level $a$ is defined as

\tau_{\boldsymbol{\theta}}(\alpha)=\min\{t\in\mathbb{N}:\sup_{X\in{\cal X}}d_{\text{TV}}(\mathbb{P}(X^{[t]}=\cdot\mid X^{[0]}=X),\mu_{\boldsymbol{\theta}})\leq a\}.

Assumption 1 (Uniform Geometric Ergodicity)

For any $\boldsymbol{\theta}\in\Omega$ , the Markov chain $\{X^{[t]}\}$ generated under $P_{\boldsymbol{\theta}}$ has a unique stationary distribution $\mu_{\boldsymbol{\theta}}$ . In addition, there exist constants $m>0$ and $\rho\in(0,1)$ independent of $\boldsymbol{\theta}$ such that

\sup_{X\in{\cal X}}d_{\text{TV}}(\mathbb{P}(X^{[t]}=\cdot\,|X^{[0]}=X),\mu_{\boldsymbol{\theta}})\leq m\rho^{t}\text{ for all }t\geq 0.

The geometric ergodicity assumption states that the distribution of a sample from the Markov chain approaches the stationary distribution geometrically fast, and is again standard in the literature on RL and SA under Markovian noise (Srikant and Ying, 2019; Wu et al., 2020; Zeng et al., 2024b).

Denoting $\tau(a)\triangleq\sup_{\boldsymbol{\theta}\in\mathbb{R}^{d}}\tau_{\boldsymbol{\theta}}(a)$ , this assumption implies that there exists a positive constant $C$ depending only on $\rho$ and $m$ such that

\tau(a)\leq C\log\left(1/a\right).

(11)

For convenience, we denote by $\tau_{t}\triangleq\tau(\lambda^{[t]})$ the mixing time corresponding to $\lambda^{[t]}$ in Algorithm 1.

Assumption 2

Given two distributions $d,\hat{d}$ over ${\cal X}$ and $\boldsymbol{\theta},\hat{\boldsymbol{\theta}}\in\Omega$ , let $X\sim d,X^{\prime}\sim P_{\boldsymbol{\theta}}(X)$ and $\hat{X}\sim\hat{d},\hat{X}^{\prime}\sim P_{\hat{\boldsymbol{\theta}}}(\hat{X})$ . There exists a constant $L>0$ such that

\displaystyle d_{\text{TV}}(P(X^{\prime}=\cdot),P(\hat{X}^{\prime}=\cdot))\leq d_{\text{TV}}(d,\hat{d})+L\|\boldsymbol{\theta}-\hat{\boldsymbol{\theta}}\|.

(12)

In addition, the stationary distribution is Lipschitz in $\boldsymbol{\theta}$

\displaystyle d_{\text{TV}}(\mu_{\boldsymbol{\theta}},\mu_{\hat{\boldsymbol{\theta}}})\leq L\|\boldsymbol{\theta}-\hat{\boldsymbol{\theta}}\|.

(13)

Assumption 2 represents a regularity condition on $P_{\boldsymbol{\theta}}$ , which can be interpreted as a type of the triangle inequality for the TV distance defined over the transition kernels w.r.t different parameters. This assumption can be shown to hold in standard MDPs (See Wu et al. (2020, Lemma A1)).

We denote by ${\cal F}_{t}=\{X^{[0]},X^{[1]},\cdots,X^{[t]}\}$ the filtration containing all randomness generated by Algorithm 1 up to time $t$ .

3 Main Results

In this section, we establish the finite-sample complexity of A-MSA to the solution of (5) when the operators $F_{i}$ have a nested structure. For convenience, we refer to the first and last equations in (5) as the highest- and lowest-level problems, respectively, i.e., the problem levels increase from $N$ to 1. Levels here indicate the order in which the problem can be solved.

With some abuse of notation, we denote $\boldsymbol{\theta}_{1:N-1}=\{\theta_{1},\cdots,\theta_{N-1}\}$ . Given $\boldsymbol{\theta}_{1:N-1}$ , we use $y_{N}(\boldsymbol{\theta}_{1:N-1})$ to denote the solution of the lowest level problem

\displaystyle\widebar{F}_{N}\Big{(}\boldsymbol{\theta}_{1:N-1},y_{N}(\boldsymbol{\theta}_{1:N-1})\Big{)}=0.

Similarly, we denote by $\{y_{N-1}(\boldsymbol{\theta}_{1:N-2}),y_{N}(\boldsymbol{\theta}_{1:N-2})\}$ the solution of the last two levels corresponding to $\widebar{F}_{N-1},\widebar{F}_{N}$ given $\boldsymbol{\theta}_{1:N-2}$

\displaystyle\begin{aligned} &\widebar{F}_{N-1}\big{(}\boldsymbol{\theta}_{1:N-2},y_{N-1}(\boldsymbol{\theta}_{1:N-2}),y_{N}(\boldsymbol{\theta}_{1:N-2})\big{)}=0.\\ &\widebar{F}_{N}\big{(}\boldsymbol{\theta}_{1:N-2},y_{N-1}(\boldsymbol{\theta}_{1:N-2}),y_{N}(\boldsymbol{\theta}_{1:N-2})\big{)}=0.\end{aligned}

(14)

Conceptually, $y_{N-1}(\boldsymbol{\theta}_{1:N-2})$ and $y_{N}(\boldsymbol{\theta}_{1:N-2})$ represent the optimal solutions of the last and second last equation with respect to the last and second last decision variables when $\theta_{1},\cdots,\theta_{N-2}$ are given.

Generalizing this line of discussion, we introduce $y_{j}(\boldsymbol{\theta}_{1:i-1})$ to represent the optimal solution of the $j^{\text{th}}$ equation with respect to the $j^{\text{th}}$ variable when $\theta_{1},\cdots,\theta_{i-1}$ are given, for any $i\in[2,N]$ and $j\in[i,N]$ . For $i=1$ (i.e. no $\theta$ is given), we additionally define $y_{j}(\emptyset)=\theta_{j}^{\star}\in\mathbb{R}^{d_{k}}$ , which is the $j^{\text{th}}$ component of the optimal solution to (5). We introduce the aggregate notation

\displaystyle{\bf y}_{i:j}(\boldsymbol{\theta}_{1:i-1})\triangleq\{y_{i}(\boldsymbol{\theta}_{1:i-1}),\cdots,y_{j}(\boldsymbol{\theta}_{1:i-1})\}

and note that ${\bf y}_{i:N}$ is the solution to the system

\displaystyle\widebar{F}_{j}\big{(}\boldsymbol{\theta}_{1:i-1},{\bf y}_{i:N}(\boldsymbol{\theta}_{1:i-1})\big{)}=\widebar{F}_{j}\big{(}\boldsymbol{\theta}_{1:i-1},y_{i}(\boldsymbol{\theta}_{1:i-1}),y_{i+1}(\boldsymbol{\theta}_{1:i-1}),\cdots,y_{N}(\boldsymbol{\theta}_{1:i-1})\big{)}=0,\,\forall j\in[i,N].

We now introduce the main assumption that drives our analysis, referred to as the nested strong monotonicity of the operators $\widebar{F}_{i}$ corresponding to the increasing levels defined above.

Assumption 3 (Strong Monotonicity)

There exists a constant $\delta>0$ such that for all $i=1,\cdots,N$ and $\theta_{1}\in\mathbb{R}^{d_{1}},\cdots,\theta_{i-1}\in\mathbb{R}^{d_{i-1}}$ , $\theta_{i},\theta_{i}^{\prime}\in\mathbb{R}^{d_{i}}$

\displaystyle\Big{\langle}\widebar{F}_{i}(\boldsymbol{\theta}_{1:i-1},\theta_{i},{\bf y}_{i+1:N}(\boldsymbol{\theta}_{1:i-1},\theta_{i})-\widebar{F}_{i}(\boldsymbol{\theta}_{1:i-1},\theta_{i}^{\prime},{\bf y}_{i+1:N}(\boldsymbol{\theta}_{1:i-1},\theta_{i}^{\prime}),\theta_{i}-\theta_{i}^{\prime}\Big{\rangle}\geq\delta\|\theta_{i}-\theta_{i}^{\prime}\|^{2}.

Assumption 3 represents a nested variant of strong monotonicity for the operators $\widebar{F}_{i}$ , which does not require each $\widebar{F}_{i}$ being strongly monotone w.r.t to all of its variables. It states that given $\boldsymbol{\theta}_{i-1}$ , $\widebar{F}_{i}$ is strongly monotone w.r.t $\theta_{i}$ when all lower level decision variables are at the corresponding optimal solutions. When $N=2$ , the assumption reduces to the same condition made in the existing literature (Doan, 2022; Shen and Chen, 2022; Han et al., 2024) and can be verified to hold in RL applications including the gradient-based TD learning (Xu et al., 2019) and distributed TD learning learning¹¹1To model the distributed TD learning in the two-time-scale SA framework, the outer loop equation enforces the consensus among agents whereas in the inner loop each agent learns its local value function.. As an implication of the assumption, $y_{j}(\boldsymbol{\theta}_{1:i-1})$ is always unique for any $i,j$ . Without any loss of generality, we assume $\delta\leq 1$ , which makes it convenient for us to simplify terms.

Assumption 4 (Lipschitz Continuity and Boundedness)

There exists positive finite constants $L$ and $B$ such that for all $\boldsymbol{\theta},\boldsymbol{\theta}^{\prime}\in\Omega$

$\displaystyle\\|F_{i}(\boldsymbol{\theta},X)-F_{i}(\boldsymbol{\theta}^{\prime},X)\\|$	$\displaystyle\leq L\sum_{j=1}^{N}\\|\theta_{j}-\theta_{j}^{\prime}\\|,\quad\forall i,X,$	(15)
$\displaystyle\\|y_{j}(\boldsymbol{\theta}_{1:i-1})-y_{j}(\boldsymbol{\theta}_{1:i-1}^{\prime})\\|$	$\displaystyle\leq L\sum_{i^{\prime}=1}^{i-1}\\|\theta_{i^{\prime}}-\theta_{i^{\prime}}^{\prime}\\|,\quad\forall i,j,$	(16)
$\displaystyle\\|y_{i}(\boldsymbol{\theta}_{1:i-1})\\|$	$\displaystyle\leq B,\quad\forall i.$	(17)

Eqs. (15) and (16) are standard Lipschitz continuity assumptions on the operator $F_{i}$ and learning target (Doan, 2022; Zeng et al., 2024b; Shen and Chen, 2022). Eq. 17 assumes that the learning targets are bounded and guarantees algorithm stability in the presence time-varying Markovian noise. This assumption is made in Kaledin et al. (2020); Zeng et al. (2024b) when $N=2$ . Without any loss of generality, we assume $L\geq\max_{i\in\{1,\cdots,N\},X\in{\cal X}}\|F_{i}(0,X)\|$ , which is finite due to the compactness of ${\cal X}$ . Assumption 4 implies that the operator $F_{i}$ can be bounded affinely by the decision variable

\displaystyle\|F_{i}(\boldsymbol{\theta},X)\|\leq L(\sum_{j=1}^{N}\|\theta_{j}\|+1),\quad\forall i\in\{1,\cdots,N\},\boldsymbol{\theta}\in\Omega.

(18)

We use the same constant $L$ in Assumptions 2 and 4 to simplify the notation.

Convergence Metric. We quantity the algorithm convergence through the residuals

\displaystyle\begin{gathered}x_{i}^{[t]}=\theta_{i}^{[t]}-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]}),\qquad\Delta f_{i}^{[t]}=f_{i}^{[t]}-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]}),\end{gathered}

(20)

where $x_{i}^{[t]}$ is the optimal residual, measuring the distance of decision variable $\theta_{i}^{[t]}$ to the learning target solution introduced above; whereas $\Delta f_{i}^{[t]}$ is the estimation residual, capturing the quality of the estimates of the operator $\widebar{F}_{i}$ . Note that if $x_{i}^{[t]}=0$ for all $i$ , then $\boldsymbol{\theta}^{[t]}=\boldsymbol{\theta}^{\star}$ , the desired solution.

Finally, we will study the convergence of Algorithm 1 under the following choice of step sizes

\displaystyle\lambda^{[t]}=\frac{c_{\lambda}}{t+h+1},\quad\alpha_{i}^{[t]}=\frac{c_{i}}{t+h+1},

(21)

where $\lambda^{[t]}\geq\alpha_{i}^{[t]}$ as the operator estimates are implemented at a “faster” time scale than the updates of decision variables. In particular, these step sizes satisfies the following conditions for all $t\geq 0$

$\displaystyle c_{1}$	$\displaystyle=\frac{32}{\delta},\quad\lambda^{[t]}\leq\frac{1}{4},\quad\tau_{t}^{2}\lambda_{t-\tau_{t}}\leq\frac{1}{8DN^{3}},\quad\alpha_{i}^{[t]}\leq\min\Big{\{}\frac{\delta^{2}}{80N^{8}L^{6}},\frac{\delta}{40N^{5}L^{6}},\frac{2}{5NL^{2}},\frac{1}{\delta}\Big{\}},$
$\displaystyle\frac{\alpha_{i}^{[t]}}{\lambda^{[t]}}$	$\displaystyle\leq\min\Big{\{}\frac{1}{8(DN^{3}+3/\delta+L)},\frac{\delta}{32DN^{3}},\frac{\delta}{32(\frac{9N^{4}L^{6}}{\delta}+8N^{3}L^{3})},\frac{16}{\delta}\Big{\}},$
$\displaystyle\frac{\alpha_{i-1}^{[t]}}{\alpha_{i}^{[t]}}$	$\displaystyle\leq\frac{\delta}{16}\Big{(}\frac{9N^{3}L^{3}}{2}+\frac{4N^{6}L^{6}}{\delta}\Big{)}^{-1},\quad\frac{(\alpha_{i}^{[t]})^{2}}{\alpha_{1}^{[t]}}\leq\min\Big{\{}\frac{\delta^{3/2}}{64N^{7}},\frac{8}{5N^{3}L^{3}}\Big{\}},$	(22)

where $D$ is an absolute constant depending on $L$ , $B$ , $\rho$ , and $m$ . We note that there always exist constants $c_{\lambda},c_{i},h$ such that the step sizes in (21) satisfy (22).

For our analysis, we consider the following Lyapunov function

\displaystyle V^{[t]}=\sum_{i=1}^{N}\mathbb{E}[\|x_{i}^{[t]}\|^{2}+\|\Delta f_{i}^{[t]}\|^{2}].

(23)

Under the choice of step sizes above, the main result of this paper is presented in the following theorem, where we characterize the convergence complexity of Algorithm 1

Theorem 1

Under Assumptions 1–4, the iterates generated by Algorithm 1 satisfy for all $t\geq\tau_{t}$

\displaystyle V^{[t+1]}\leq\frac{h^{2}V^{[\tau_{t}]}}{(t+h+1)^{2}}+\widetilde{{\cal O}}\left(\frac{1}{t+1}\right).

Theorem 1 shows that the last iterate of the A-MSA algorithm converges to the unique solution of (5) in the mean-squared sense at an optimal rate $\widetilde{{\cal O}}(1/t)$ . As the algorithm draws exactly one sample in each iteration, this rate translates to a sample complexity of $\widetilde{{\cal O}}(\epsilon^{-1})$ for finding an $\epsilon$ -optimal solution. The same complexity has been shown to be achievable by the MSA algorithm under an additional smoothness assumption on $F_{i}$ (Shen and Chen, 2022). We match the rate without requiring this restrictive assumption. In the absence of the smoothness condition, our rate is order-optimal (up to an logarithm factor) and significantly improves over the best-known rate of MSA, which is $\widetilde{{\cal O}}(1/t^{2/3})$ as established in Hong et al. (2023); Zeng et al. (2024b); Han et al. (2024) when $N=2$ . In Section 3.2, we further show that the convergence rate of MSA is $\widetilde{{\cal O}}(1/t^{1/2})$ when $N=3$ and deteriorates as $N$ increases. This observation highlights the advantage of our proposed algorithm where its convergence complexity is optimal for arbitrary $N$ .

3.1 Proof of Theorem 1

We introduce the following technical lemmas, which set up important intermediate steps to derive the results in Theorem 1. These lemmas establishes a useful upper bound of $\|\Delta f_{i}^{[t]}\|^{2}$ and $\|x_{i}^{[t]}\|^{2}$ in expectation, which shows the dependence of these two quantities.

Lemma 1

Under Assumptions 1-4, the iterates of Algorithm 1 satisfy for all $t\geq\tau_{t}$

	$\displaystyle\mathbb{E}[\\|\Delta f_{i}^{[t+1]}\\|^{2}]$	$\displaystyle\leq(1-\lambda^{[t]})\mathbb{E}[\\|\Delta f_{i}^{[t]}\\|^{2}]-\frac{\lambda^{[t]}}{4}\mathbb{E}[\\|\Delta f_{i}^{[t]}\\|^{2}]+DN^{2}\tau_{t}^{2}\lambda^{[t]}\lambda^{[t-\tau_{t}]}$
		$\displaystyle\hskip 20.0pt+DN^{2}\tau_{t}^{2}(\lambda^{[t]}\lambda^{[t-\tau_{t}]}+\frac{(\alpha_{i}^{[t]})^{2}}{\lambda^{[t]}})\big{(}\sum_{j=1}^{N}\\|x_{j}^{[t]}\\|^{2}+\sum_{j=1}^{N}\\|\Delta f_{j}^{[t]}\\|^{2}\big{)},$

where $D$ is a constant depending only (polynomially) on the problem constants.

Lemma 2

Under Assumptions 1-4, the iterates of Algorithm 1 satisfy for all $t$

	$\displaystyle\\|x_{i}^{[t+1]}\\|^{2}$	$\displaystyle\leq\\|x_{i}^{[t]}\\|^{2}-\frac{\delta\alpha_{i}^{[t]}}{4}\\|x_{i}^{[t]}\\|^{2}+\sum_{j=1}^{i-1}\frac{\delta\alpha_{j}^{[t]}}{8N}\\|x_{j}^{[t]}\\|^{2}$
		$\displaystyle\hskip 20.0pt+\left(\frac{9N^{3}L^{6}}{\delta}+8N^{2}L^{3}\right)\alpha_{i}^{[t]}\sum_{j=i+1}^{N}\\|x_{j}^{[t]}\\|^{2}+(\frac{3}{\delta}+L)\alpha_{i}^{[t]}\sum_{j=1}^{N}\\|\Delta f_{j}^{[t]}\\|^{2}.$		(24)

In (24), the terms $\|x_{j}^{[t]}\|^{2}$ for $j<i$ and $j>i$ are scaled by different step sizes ( $\alpha_{j}^{[t]}$ versus $\alpha_{i}^{[t]}$ ). This subtle difference is the key to establish the optimal complexity of Algorithm 1.

It is worth noting that if the iterates $\theta_{i}^{[t]}$ are generated by the MSA algorithm, a per-iteration analysis of $x_{i}^{[t]}$ similar to Lemma 2 can be established. We make the analysis and point out the distinctions that allow A-MSA to converge faster later in Section 3.2 (Lemma 4).

We defer the proofs of the lemmas to the appendix but point out that to prove Lemma 2 the following bound on $\|\theta_{i}^{[t+1]}-\theta_{i}^{[t]}\|$ provides an important tool for us to control the stability of the iterates. The implication of Lemma 3 is that the distance $\|\theta_{i}^{[t+1]}-\theta_{i}^{[t]}\|$ converges to zero faster than ${\cal O}(\alpha_{i}^{[t]})$ if $\|\Delta f_{i}^{[t]}\|$ and $\sum_{j=i}^{N}\|x_{j}^{[t]}\|$ both converge.

Lemma 3

Under Assumptions 1-4, the iterates of Algorithm 1 satisfy for all $i,t$

\displaystyle\|\theta_{i}^{[t+1]}-\theta_{i}^{[t]}\|\leq\alpha_{i}^{[t]}\big{(}\|\Delta f_{i}^{[t]}\|+NL^{2}\sum_{j=i}^{N}\|x_{j}^{[t]}\|\big{)}.

Proof (of Theorem 1).

The proof follows in a straightforward manner by combining the bounds in Lemmas 1 and 2 and choosing the correct step sizes. Recall the Lyapunov function (23)

$\displaystyle V^{[t+1]}$	$\displaystyle=\sum_{i=1}^{N}\mathbb{E}[\\|\Delta f_{i}^{[t+1]}\\|^{2}+\\|x_{i}^{[t+1]}\\|^{2}]$
	$\displaystyle\leq(1-\lambda^{[t]})\sum_{i=1}^{N}\mathbb{E}[\\|\Delta f_{i}^{[t]}\\|^{2}]-\frac{\lambda^{[t]}}{4}\sum_{i=1}^{N}\mathbb{E}[\\|\Delta f_{i}^{[t]}\\|^{2}]+DN^{2}\tau_{t}^{2}\lambda^{[t]}\lambda^{[t-\tau_{t}]}$
	$\displaystyle\hskip 20.0pt+DN^{2}\tau_{t}^{2}\sum_{i=1}^{N}(\lambda^{[t]}\lambda^{[t-\tau_{t}]}+\frac{(\alpha_{i}^{[t]})^{2}}{\lambda^{[t]}})\big{(}\sum_{j=1}^{N}\\|x_{j}^{[t]}\\|^{2}+\sum_{j=1}^{N}\\|\Delta f_{j}^{[t]}\\|^{2}\big{)}$
	$\displaystyle\hskip 20.0pt+\sum_{i=1}^{N}\mathbb{E}[\\|x_{i}^{[t]}\\|^{2}]-\sum_{i=1}^{N}\frac{\delta\alpha_{i}^{[t]}}{4}\mathbb{E}[\\|x_{i}^{[t]}\\|^{2}]+\sum_{i=1}^{N}\sum_{j=1}^{i-1}\frac{\delta\alpha_{j}^{[t]}}{8N}\mathbb{E}[\\|x_{j}^{[t]}\\|^{2}]$
	$\displaystyle\hskip 20.0pt+\sum_{i=1}^{N}\left(\frac{9N^{3}L^{6}}{\delta}+8N^{2}L^{3}\right)\alpha_{i}^{[t]}\sum_{j=i+1}^{N}\mathbb{E}[\\|x_{j}^{[t]}\\|^{2}]+(\frac{3}{\delta}+L)\sum_{i=1}^{N}\alpha_{i}^{[t]}\sum_{j=1}^{N}\mathbb{E}[\\|\Delta f_{j}^{[t]}\\|^{2}]$
	$\displaystyle\leq(1-\lambda^{[t]})\sum_{i=1}^{N}\mathbb{E}[\\|\Delta f_{i}^{[t]}\\|^{2}]-\frac{\lambda^{[t]}}{4}\sum_{i=1}^{N}\mathbb{E}[\\|\Delta f_{i}^{[t]}\\|^{2}]+\widetilde{{\cal O}}\Big{(}\frac{1}{(t+1)^{2}}\Big{)}$
	$\displaystyle\hskip 20.0pt+DN^{3}\alpha_{N}^{[t]}\sum_{i=1}^{N}\mathbb{E}[\\|\Delta f_{i}^{[t]}\\|^{2}]+DN^{3}\sum_{i=1}^{N}\frac{(\alpha_{i}^{[t]})^{2}}{\lambda^{[t]}}\mathbb{E}[\\|x_{i}^{[t]}\\|^{2}]$
	$\displaystyle\hskip 20.0pt+DN^{3}\tau_{t}^{2}\lambda^{[t]}\lambda^{[t-\tau_{t}]}\sum_{i=1}^{N}\mathbb{E}[\\|\Delta f_{i}^{[t]}\\|^{2}]+DN^{3}\tau_{t}^{2}\lambda^{[t]}\lambda^{[t-\tau_{t}]}\sum_{i=1}^{N}\mathbb{E}[\\|x_{i}^{[t]}\\|^{2}]$
	$\displaystyle\hskip 20.0pt+\sum_{i=1}^{N}\mathbb{E}[\\|x_{i}^{[t]}\\|^{2}]-\sum_{i=1}^{N}\frac{\delta\alpha_{i}^{[t]}}{4}\mathbb{E}[\\|x_{i}^{[t]}\\|^{2}]+\sum_{i=1}^{N}\frac{\delta\alpha_{i}^{[t]}}{8}\mathbb{E}[\\|x_{i}^{[t]}\\|^{2}]$
	$\displaystyle\hskip 20.0pt+\left(\frac{9N^{4}L^{6}}{\delta}+8N^{3}L^{3}\right)\sum_{i=2}^{N}\alpha_{i-1}^{[t]}\mathbb{E}[\\|x_{i}^{[t]}\\|^{2}]+(\frac{3}{\delta}+L)\alpha_{N}^{[t]}\sum_{i=1}^{N}\mathbb{E}[\\|\Delta f_{i}^{[t]}\\|^{2}]$
	$\displaystyle\leq(1-\lambda^{[t]})\sum_{i=1}^{N}\mathbb{E}[\\|\Delta f_{i}^{[t]}\\|^{2}]+(1-\frac{\delta\alpha_{1}^{[t]}}{16})\sum_{i=1}^{N}\mathbb{E}[\\|x_{i}^{[t]}\\|^{2}]+\widetilde{{\cal O}}\Big{(}\frac{1}{(t+1)^{2}}\Big{)}$
	$\displaystyle\hskip 20.0pt-\sum_{i=1}^{N}\left(\frac{\lambda^{[t]}}{8}-(DN^{3}+\frac{3}{\delta}+L)\alpha_{N}^{[t]}\right)\mathbb{E}[\\|\Delta f_{i}^{[t]}\\|^{2}]$
	$\displaystyle\hskip 20.0pt-\sum_{i=1}^{N}\left(\frac{\delta\alpha_{i}^{[t]}}{16}-\frac{DN^{3}(\alpha_{i}^{[t]})^{2}}{\lambda^{[t]}}-\left(\frac{9N^{4}L^{6}}{\delta}+8N^{3}L^{3}\right)\alpha_{i-1}^{[t]}\right)\mathbb{E}[\\|x_{i}^{[t]}\\|^{2}],$	(25)

where the first inequality plugs in the bounds from Lemmas 1 and 2, and in the last inequality we have artificially introduced $\alpha_{0}^{[t]}=0$ and used the relation $DN^{3}\tau_{t}^{2}\lambda^{[t-\tau_{t}]}\leq 1/8$ . Note that the last two terms of (25) are non-positive under the step size condition $\alpha_{i}^{[t]}\leq\frac{\lambda^{[t]}}{8}(DN^{3}+\frac{3}{\delta}+L)^{-1}$ , $\alpha_{i}^{[t]}\leq\frac{\delta\lambda^{[t]}}{32DN^{3}}$ , and $\alpha_{[i-1]}^{[t]}\leq\frac{\delta\alpha_{i}^{[t]}}{32}(\frac{9N^{4}L^{6}}{\delta}+8N^{3}L^{3})^{-1}$ . As a result, we have

	$\displaystyle V^{[t+1]}$	$\displaystyle\leq\max\{1-\lambda^{[t]},1-\frac{\delta\alpha_{1}^{[t]}}{16}\}V^{[t]}+\widetilde{{\cal O}}\Big{(}\frac{1}{(t+1)^{2}}\Big{)}$
		$\displaystyle\leq(1-\frac{\delta\alpha_{1}^{[t]}}{16})V^{[t]}+\widetilde{{\cal O}}\Big{(}\frac{1}{(t+1)^{2}}\Big{)}$
		$\displaystyle\leq(1-\frac{2}{t+h+1})V^{[t]}+\widetilde{{\cal O}}\Big{(}\frac{1}{(t+1)^{2}}\Big{)},$

due to $\alpha_{1}^{[t]}\leq\frac{16\lambda^{[t]}}{\delta}$ and $c_{1}=\frac{32}{\delta}$ . Multiplying by $(t+h+1)^{2}$ on both sides, we have

	$\displaystyle(t+h+1)^{2}V^{[t+1]}$	$\displaystyle\leq(t+h+1)(t+h-1)V^{[t]}+\widetilde{{\cal O}}(1)$
		$\displaystyle\leq(t+h)^{2}V^{[t]}+\widetilde{{\cal O}}(1)$
		$\displaystyle\leq h^{2}V^{[\tau_{t}]}+\widetilde{{\cal O}}(t+1),$

which when dividing by $(t+h+1)^{2}$ gives the desired result.

$\Box$

3.2 Standard Multi-Time-Scale Stochastic Approximation

In this section, we present the finite-time complexity of the MSA algorithm derived using the techniques in Hong et al. (2023); Zeng et al. (2024b), to highlight its gap from the optimal rate $\widetilde{{\cal O}}(1/t)$ established in Theorem 1. We start with the case of $N=3$ , where we want to solve

\displaystyle\widebar{F}_{i}(\boldsymbol{\theta})=0,\quad\forall i=1,2,3.

For simplicity, we assume (just in this section) that the stationary distribution $\mu_{\boldsymbol{\theta}}=\mu$ is independent of $\boldsymbol{\theta}$ and that each $X^{[t]}$ is an i.i.d. sample from $\mu$ . We additionally assume that $F_{i}$ has uniformly bounded energy as follows. Note that this is not a necessary assumption but considered just to simplify the analysis of MSA. Also note that we do not make this assumption in the analysis of the proposed A-MSA algorithm.

Assumption 5

There exists a constant $D>0$ such that for all $i=1,2,3$ , $\theta_{i}\in\mathbb{R}^{d_{i}}$ , and $X\in{\cal X}$

\displaystyle\|F_{i}(\theta_{1},\theta_{2},\theta_{3},X)\|\leq D.

Recall the update rule of MSA in (6). We measure the convergence by $x_{i}^{[t]}$ , the same as the metric (20) used for analyzing A-MSA.

\displaystyle x_{i}^{[t]}=\theta_{i}^{[t]}-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]}).

We next establish the per-iteration bound for $x_{i}^{t}$ , where the analysis can be found in Appendix A.4.

Lemma 4

We artificially define $\alpha_{0}^{[t]}=0$ for all $t$ . Under Assumptions 3-5, the iterates of the MSA algorithm in (6) satisfy for all $t\geq 0$

	$\displaystyle\mathbb{E}[\\|x_{i}^{[t+1]}\\|^{2}]$	$\displaystyle\leq\mathbb{E}[\\|x_{i}^{[t]}\\|^{2}]-\frac{\delta\alpha_{i}^{[t]}}{4}\mathbb{E}[\\|x_{i}^{[t]}\\|^{2}]+\frac{8N^{3}L^{6}\alpha_{i}^{[t]}}{\delta}\sum_{j=i+1}^{N}\mathbb{E}[\\|x_{j}^{[t]}\\|^{2}]$
		$\displaystyle\hskip 120.0pt+8N^{2}L^{2}D^{2}(\alpha_{i}^{[t]})^{2}+\frac{N^{2}L^{2}D^{2}(\alpha_{i-1}^{[t]})^{2}}{\delta\alpha_{i}^{[t]}}.$

It is worth noting that Lemma 4 holds for any $N$ . We apply this lemma to the case $N=3$ and consider the Lyapunov function

\displaystyle V^{[t]}=\mathbb{E}[\|x_{1}^{[t]}\|^{2}+v_{2}^{[t]}\|x_{2}^{[t]}\|^{2}+v_{3}^{[t]}\|x_{3}^{[t]}\|^{2}],

(26)

where $v_{2}^{[t]}=\frac{1728L^{6}\alpha_{1}^{[t]}}{\delta^{2}\alpha_{2}^{[t]}}$ and $v_{3}^{[t]}=\frac{8\alpha_{1}^{[t]}}{\delta\alpha_{3}^{[t]}}(\frac{216L^{6}}{\delta}+\frac{373248L^{12}}{\delta^{3}})$ .

Theorem 2

Let the step sizes be

\displaystyle\alpha_{1}^{[t]}=\frac{c_{1}}{t+h+1},\quad\alpha_{2}^{[t]}=\frac{c_{2}}{(t+h+1)^{3/4}},\quad\alpha_{2}^{[t]}=\frac{c_{3}}{(t+h+1)^{1/2}},

(27)

with proper choices of the constants $c_{1},c_{2},c_{3}$ , and $h$ . Under Assumptions 3-5, the iterates of the MSA algorithm in (6) with i.i.d. samples $X^{[t]}\sim\mu$ satisfy for all $t\geq 0$

\displaystyle V^{[t+1]}\leq{\cal O}\left(\frac{V^{[0]}}{(t+\tau+1)^{1/2}}\right).

Proof.

With $N=3$ , Lemma 4 implies

	$\displaystyle\mathbb{E}[\\|x_{1}^{[t+1]}\\|^{2}]$	$\displaystyle\leq(1-\frac{\delta\alpha_{1}^{[t]}}{4})\mathbb{E}[\\|x_{1}^{[t]}\\|^{2}]+\frac{216L^{6}\alpha_{1}^{[t]}}{\delta}\mathbb{E}[\\|x_{2}^{[t]}\\|^{2}+\\|x_{3}^{[t]}\\|^{2}]+72L^{2}D^{2}(\alpha_{1}^{[t]})^{2},$
	$\displaystyle\mathbb{E}[\\|x_{2}^{[t+1]}\\|^{2}]$	$\displaystyle\leq(1-\frac{\delta\alpha_{2}^{[t]}}{4})\mathbb{E}[\\|x_{2}^{[t]}\\|^{2}]+\frac{216L^{6}\alpha_{2}^{[t]}}{\delta}\mathbb{E}[\\|x_{3}^{[t]}\\|^{2}]+72L^{2}D^{2}(\alpha_{2}^{[t]})^{2}+\frac{9L^{2}D^{2}(\alpha_{1}^{[t]})^{2}}{\delta\alpha_{2}^{[t]}},$
	$\displaystyle\mathbb{E}[\\|x_{3}^{[t+1]}\\|^{2}]$	$\displaystyle\leq(1-\frac{\delta\alpha_{3}^{[t]}}{4})\mathbb{E}[\\|x_{3}^{[t]}\\|^{2}]+72L^{2}D^{2}(\alpha_{3}^{[t]})^{2}+\frac{9L^{2}D^{2}(\alpha_{2}^{[t]})^{2}}{\delta\alpha_{3}^{[t]}}.$

Recall the Lyapunov function $V^{[t]}=\mathbb{E}[\|x_{1}^{[t]}\|^{2}+v_{2}^{[t]}\|x_{2}^{[t]}\|^{2}+v_{3}^{[t]}\|x_{3}^{[t]}\|^{2}]$ . Note that $v_{2}^{[t+1]}\leq v_{2}^{[t]}$ and $v_{3}^{[t+1]}\leq v_{3}^{[t]}$ since $\alpha_{1}^{[t]}$ decays faster than $\alpha_{2}^{[t]},\alpha_{3}^{[t]}$ . We have

$\displaystyle V^{[t+1]}$	$\displaystyle=\mathbb{E}[\\|x_{1}^{[t+1]}\\|^{2}+v_{2}^{[t+1]}\\|x_{2}^{[t+1]}\\|^{2}+v_{3}^{[t+1]}\\|x_{3}^{[t+1]}\\|^{2}]$
	$\displaystyle\leq\mathbb{E}[\\|x_{1}^{[t+1]}\\|^{2}+v_{2}^{[t]}\\|x_{2}^{[t+1]}\\|^{2}+v_{3}^{[t]}\\|x_{3}^{[t+1]}\\|^{2}]$
	$\displaystyle\leq(1-\frac{\delta\alpha_{1}^{[t]}}{4})\mathbb{E}[\\|x_{1}^{[t]}\\|^{2}]+\frac{216L^{6}\alpha_{1}^{[t]}}{\delta}\mathbb{E}[\\|x_{2}^{[t]}\\|^{2}+\\|x_{3}^{[t]}\\|^{2}]+72L^{2}D^{2}(\alpha_{1}^{[t]})^{2}$
	$\displaystyle\hskip 20.0pt+v_{2}^{[t]}(1-\frac{\delta\alpha_{2}^{[t]}}{4})\mathbb{E}[\\|x_{2}^{[t]}\\|^{2}]+\frac{216L^{6}\alpha_{2}^{[t]}v_{2}^{[t]}}{\delta}\mathbb{E}[\\|x_{3}^{[t]}\\|^{2}]+72L^{2}D^{2}(\alpha_{2}^{[t]})^{2}v_{2}^{[t]}$
	$\displaystyle\hskip 20.0pt+\frac{9L^{2}D^{2}(\alpha_{1}^{[t]})^{2}v_{2}^{[t]}}{\delta\alpha_{2}^{[t]}}+v_{3}^{[t]}(1-\frac{\delta\alpha_{3}^{[t]}}{4})\mathbb{E}[\\|x_{3}^{[t]}\\|^{2}]$
	$\displaystyle\hskip 20.0pt+72L^{2}D^{2}(\alpha_{3}^{[t]})^{2}v_{3}^{[t]}+\frac{9L^{2}D^{2}(\alpha_{2}^{[t]})^{2}v_{3}^{[t]}}{\delta\alpha_{3}^{[t]}}$
	$\displaystyle=(1-\frac{\delta\alpha_{1}^{[t]}}{4})\mathbb{E}[\\|x_{1}^{[t]}\\|^{2}+v_{2}^{[t]}\\|x_{2}^{[t]}\\|^{2}+v_{3}^{[t]}\\|x_{3}^{[t]}\\|^{2}]+72L^{2}D^{2}(\alpha_{1}^{[t]})^{2}$
	$\displaystyle\hskip 20.0pt+\frac{124416L^{8}D^{2}\alpha_{1}^{[t]}\alpha_{2}^{[t]}}{\delta^{2}}+\frac{15552L^{8}D^{2}(\alpha_{1}^{[t]})^{3}}{\delta^{3}(\alpha_{2}^{[t]})^{2}}$
	$\displaystyle\hskip 20.0pt+\left(\frac{572L^{2}D^{2}\alpha_{1}^{[t]}\alpha_{3}^{[t]}}{\delta}+\frac{9L^{2}D^{2}\alpha_{1}^{[t]}(\alpha_{2}^{[t]})^{2}}{\delta^{2}(\alpha_{3}^{[t]})^{2}}\right)(\frac{216L^{6}}{\delta}+\frac{373248L^{12}}{\delta^{3}})$
	$\displaystyle\hskip 20.0pt+(\frac{\delta\alpha_{1}^{[t]}v_{2}^{[t]}}{4}-\frac{\delta\alpha_{2}^{[t]}v_{2}^{[t]}}{4}+\frac{216L^{6}\alpha_{1}^{[t]}}{\delta})\mathbb{E}[\\|x_{2}^{[t]}\\|^{2}]$
	$\displaystyle\hskip 20.0pt+(\frac{\delta\alpha_{1}^{[t]}v_{3}^{[t]}}{4}-\frac{\delta\alpha_{3}^{[t]}v_{3}^{[t]}}{4}+\frac{216L^{6}\alpha_{1}^{[t]}}{\delta}+\frac{216L^{6}\alpha_{2}^{[t]}v_{2}^{[t]}}{\delta})\mathbb{E}[\\|x_{3}^{[t]}\\|^{2}]$
	$\displaystyle\leq(1-\frac{\delta\alpha_{1}^{[t]}}{4})V^{[t]}+72L^{2}D^{2}(\alpha_{1}^{[t]})^{2}+\frac{124416L^{8}D^{2}\alpha_{1}^{[t]}\alpha_{2}^{[t]}}{\delta^{2}}+\frac{15552L^{8}D^{2}(\alpha_{1}^{[t]})^{3}}{\delta^{3}(\alpha_{2}^{[t]})^{2}}$
	$\displaystyle\hskip 20.0pt+\left(\frac{572L^{2}D^{2}\alpha_{1}^{[t]}\alpha_{3}^{[t]}}{\delta}+\frac{9L^{2}D^{2}\alpha_{1}^{[t]}(\alpha_{2}^{[t]})^{2}}{\delta^{2}(\alpha_{3}^{[t]})^{2}}\right)(\frac{216L^{6}}{\delta}+\frac{373248L^{12}}{\delta^{3}}),$	(28)

where the last inequality is a result of

	$\displaystyle\frac{\delta\alpha_{1}^{[t]}v_{2}^{[t]}}{4}-\frac{\delta\alpha_{2}^{[t]}v_{2}^{[t]}}{4}+\frac{216L^{6}\alpha_{1}^{[t]}}{\delta}=\frac{\delta\alpha_{1}^{[t]}v_{2}^{[t]}}{4}-\frac{\delta\alpha_{2}^{[t]}v_{2}^{[t]}}{8}\leq 0,$
	$\displaystyle\frac{\delta\alpha_{1}^{[t]}v_{3}^{[t]}}{4}-\frac{\delta\alpha_{3}^{[t]}v_{3}^{[t]}}{4}+\frac{216L^{6}\alpha_{1}^{[t]}}{\delta}+\frac{216L^{6}\alpha_{2}^{[t]}v_{2}^{[t]}}{\delta}$
	$\displaystyle=\frac{\delta\alpha_{1}^{[t]}v_{3}^{[t]}}{4}-\frac{\delta\alpha_{3}^{[t]}v_{3}^{[t]}}{4}+\frac{216L^{6}\alpha_{1}^{[t]}}{\delta}+\frac{373248L^{12}\alpha_{1}^{[t]}}{\delta^{3}}=\frac{\delta\alpha_{1}^{[t]}v_{3}^{[t]}}{4}-\frac{\delta\alpha_{3}^{[t]}v_{3}^{[t]}}{8}\leq 0,$

following if the step sizes satisfy $\alpha_{1}^{[t]}\leq\frac{\alpha_{2}^{[t]}}{2}\leq\frac{\alpha_{3}^{[t]}}{2}$ .

Under the step size rule (27) with $c_{1}\geq\frac{8}{\delta}$ , it can be easily derived that

\displaystyle V^{[t+1]}

\displaystyle\leq(1-\frac{2}{t+h+1})V^{[t]}+{\cal O}(1/t^{3/2}),

which gives

\displaystyle V^{[t+1]}\leq{\cal O}(\frac{V^{[0]}}{(t+h+1)^{1/2}}).

$\Box$

We now discuss how Theorem 2 can be extended to more than three equations and what the expected convergence rate should be. When $N=2$ , Zeng et al. (2024b) shows that the convergence rate is on the order of (up to a logarithm factor)

\displaystyle\max\Big{\{}\alpha_{1}^{[t]},\alpha_{2}^{[t]},\Big{(}\frac{\alpha_{1}^{[t]}}{\alpha_{2}^{[t]}}\Big{)}^{2}\Big{\}}.

(29)

To optimize (29) the step sizes should be selected as $\alpha_{1}^{[t]}\sim\frac{1}{t},\alpha_{2}^{[t]}\sim\frac{1}{t^{2/3}}$ , which leads to a convergence rate of $\widetilde{{\cal O}}(t^{-2/3})$ . When $N=3$ , the convergence rate is on the order of

\displaystyle\max\Big{\{}\alpha_{1}^{[t]},\alpha_{2}^{[t]},\alpha_{3}^{[t]},\Big{(}\frac{\alpha_{1}^{[t]}}{\alpha_{2}^{[t]}}\Big{)}^{2},\Big{(}\frac{\alpha_{2}^{[t]}}{\alpha_{3}^{[t]}}\Big{)}^{2}\Big{\}}.

which dictates the choice of step sizes in (27).

For general $N$ , the convergence rate becomes on the order of

\displaystyle\max\Big{\{}\alpha_{1}^{[t]},\cdots,\alpha_{N}^{[t]},\Big{(}\frac{\alpha_{1}^{[t]}}{\alpha_{2}^{[t]}}\Big{)}^{2},\cdots,\Big{(}\frac{\alpha_{N-1}^{[t]}}{\alpha_{N}^{[t]}}\Big{)}^{2}\Big{\}}.

(30)

It is straightforward to see that the optimal decay rate of (30) is $\widetilde{{\cal O}}(t^{-2/(N+1)})$ , which implies the sample complexity in Table 1.

3.3 Reflecting on Faster Convergence of A-MSA

Lemmas 2 and 4 establish critical per-iteration convergence bounds on the errors in the decision variable for A-MSA and MSA, respectively. In this section, we contrast corresponding terms in the bounds to pinpoint the advantage of Lemma 2 over Lemma 4. In the analysis of $\|x_{i}^{[t+1]}\|^{2}$ under the two algorithms, the common/comparable terms include the following:

•

The same term $\|y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})\|$ arises from the shift in learning target $y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})$ across iterations and appears in the proofs of both Lemmas 2 and 4 (in Section A.2 and A.4, respectively). Due to the Lipschitz continuity of $y_{i}$ imposed in Assumption 4, controlling this term essentially requires handling

\displaystyle\sum_{j=1}^{i-1}\|\theta_{j}^{[t+1]}-\theta_{j}^{[t]}\|.

(31)

Under MSA, there is currently no better bound on (31) than $\sum_{j=1}^{i-1}\alpha_{j}^{[t]}\|F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})\|$ . As $\|F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})\|$ is on the order of a constant even when $\boldsymbol{\theta}^{[t]}$ approaches the optimal solution (due to the random sample $X^{[t]}$ ), the bound eventually becomes ${\cal O}(\alpha_{i-1})$ .

In comparison, under A-MSA, we can control (31) with Lemma 3 and show

\displaystyle\sum_{j=1}^{i-1}\|\theta_{j}^{[t+1]}-\theta_{j}^{[t]}\|\leq{\cal O}\Big{(}\sum_{j=1}^{i-1}\alpha_{j}^{[t]}\|\Delta f_{j}^{[t]}\|+\sum_{j=1}^{i-1}\alpha_{j}^{[t]}\|x_{j}^{[t]}\|+\alpha_{i-1}^{[t]}\sum_{j=i}^{N}\|x_{j}^{[t]}\|\Big{)}.

(32)

Since $\|\Delta f_{j}^{[t]}\|$ and $\|x_{j}^{[t]}\|$ themselves decay to zero as the iterations proceed, the convergence rate of (32) is much faster than ${\cal O}(\alpha_{i-1})$ .

Takeaway: Our objective is to solve a highly coupled system (5). The discussion above highlights how the A-MSA algorithm effectively decouples the decision variables, in the sense that the inaccuracy of the upper-level variables has a reduced effect on the lower-level ones. Being able to decoupled the decision variable updates is crucial as it allows us to choose $\alpha_{i}^{[t]}$ across $i$ in a more independent manner.

•

$\|\Delta f_{i}^{[t]}\|^{2}=\|f_{i}^{[t]}-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\|^{2}$ is an error appearing in the proof of Lemma 2 for A-MSA, whereas $\|F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\|^{2}$ is the comparable term in the proof of Lemma 4 for MSA. They capture the variance in the estimation of $\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})$ . In the case of MSA, $\|F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\|^{2}$ does not decay to zero (again due to the randomness in $X^{[t]}$ ), which eventually becomes a bottleneck in achieving the optimal rate. In contrast, $\|\Delta f_{i}^{[t]}\|^{2}$ under A-MSA can be expected to converge to zero (under a careful analysis).

Takeaway: Compared to MSA, A-MSA provides a low-variance estimate of $\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})$ . The variance decaying sufficiently fast is another key driver of the overall improved complexity.

Remark 2

At the end of the section, we point out an important additional advantage of the analysis of A-MSA over that of MSA. Observe that the Lyapunov function (26) considered in Theorem 2 weights the lower-level residuals $x_{2}^{[t]}$ and $x_{3}^{[t]}$ by $v_{2}^{[t]}$ and $v_{3}^{[t]}$ . The weights have been carefully chosen to ensure the proper cancellation of errors across different levels. Choosing such weights, however, requires non-trivial efforts and can be inconvenient as $N$ goes up.

In addition, we note that the weights $v_{2}^{[t]}$ and $v_{3}^{[t]}$ themselves are decaying sequences. Theorem 2 shows that $V^{[t]}$ decays with rate $\widetilde{{\cal O}}(t^{-1/2})$ , which only implies $\mathbb{E}[\|x_{2}^{[t]}\|^{2}]\leq\widetilde{{\cal O}}(t^{-1/4})$ and $\mathbb{E}[\|x_{3}^{[t]}\|^{2}]\leq\widetilde{{\cal O}}(1)$ . In other words, the guaranteed convergence rate of lower-level variables does not match that of the highest level variable and may even be meaningless. In contrast, the Lyapunov function (23) for the analysis of A-MSA simply combines $\|x_{i}^{[t]}\|^{2}$ over all levels without additional weights. As a result, the decision variables at all level are guaranteed to converge at $\widetilde{{\cal O}}(t^{-1})$ .

4 Motivating Applications

In this section we discuss how the framework (5) models the objective of a range of algorithms in RL and games. In some cases, the objective is structured such that Assumptions 3-2 can be verified to hold, meaning that the theoretical guarantee immediately applies if the A-MSA algorithm is used. For the other problems, the assumptions may not hold, but we discuss how the analysis can be adapted to derive the complexity of A-MSA by using the specific problem structure. All applications discussed are special cases of (5) with $N=3$ , meaning that they cannot be covered by the bilevel framework in Zeng et al. (2024b); Han et al. (2024); Doan (2024); Zeng and Doan (2024).

4.1 Gradient-Based Temporal Difference learning with Momentum.

Gradient-based temporal difference (TD) learning algorithms including GTD2 and TDC are a popular class of methods that stably minimize a projected Bellman error under linear function approximation and off-policy samples (Sutton et al., 2008, 2009). Consider an infinite-horizon average-reward MDP characterized by $({\cal S},{\cal A},{\cal P},r,\gamma)$ . Here ${\cal S}$ and ${\cal A}$ are the action and state spaces. Each state $s$ is associated with a feature vector $\phi(s)\in\mathbb{R}^{d}$ . ${\cal P}:{\cal S}\times{\cal A}\rightarrow\Delta_{{\cal S}}$ is the transition probability kernel, with ${\cal P}(s^{\prime}\mid s,a)$ denoting the probability of the MDP transitioning from state $s$ to $s^{\prime}$ under action $a$ . $r:{\cal S}\times{\cal A}\rightarrow[0,1]$ is the reward function. $\gamma\in(0,1)$ is the discount factor. Given a fixed policy $\pi$ to evaluate, our objective is to learn a value function parameter $\theta\in\mathbb{R}^{d}$ such that $\phi(s)^{\top}\theta$ approximates a value function $V^{\pi}(s)$ for every state $s\in{\cal S}$ . The TDC algorithm formulates the objective with a system of two equations on $\theta$ and an auxiliary variable $\omega\in\mathbb{R}^{d}$ , i.e., $\forall s\in{\cal S}$ we aim to solve

\displaystyle\begin{aligned} &\mathbb{E}_{a\sim\pi(\cdot\mid s),s^{\prime}\sim{\cal P}(\cdot\mid s,a))}[r(s,a)+\gamma\phi(s^{\prime})^{\top}\theta\phi(s)-\phi(s)^{\top}\theta\phi(s)-\gamma\phi(s^{\prime})\phi(s)^{\top}\omega]=0,\\ &\mathbb{E}_{a\sim\pi(\cdot\mid s),s^{\prime}\sim{\cal P}(\cdot\mid s,a))}[r(s,a)+\gamma\phi(s^{\prime})^{\top}\theta\phi(s)-\phi(s)^{\top}\theta\phi(s)-\phi(s)\phi(s)^{\top}\omega]=0.\end{aligned}

(33)

A momentum term can be added to the update of the variable $\theta$ to accelerate the convergence of this method. Deb and Bhatnagar (2022) studies this approach and shows that the objective of TDC with momentum can be described as solving a system of three equations. Compared with (33), the additional time scale is introduced to force the momentum term to decay to zero, which becomes an equation in the lowest level. The second equation in (33) is primarily used to solve for $\omega$ and lies in the middle level, whereas $\theta$ is the variable associated with the highest level equation. We skip the formulation details here and point interested readers to Deb and Bhatnagar (2022)[Section 4.2]. It is important to note that under a common assumption on the feature vectors (see, for example, Xu et al. (2019)[Assumption 1]), the strong monotonicity condition holds, allowing us to conclude that A-MSA find the (unique) optimal solution with rate $\widetilde{{\cal O}}(1/t)$ . While TDC with momentum has been shown to converge asymptotically in Deb and Bhatnagar (2022), its finite-time complexity is unknown from the existing literature. Our paper fills in this gap.

4.2 Distributionally Reinforcement Q Learning

Distributionally robust reinforcement learning (DRRL) in general studies finding a policy robust under unknown environmental changes. In this paper, we introduce the subject following the specific formulation in Liang et al. (2024). Consider again the MDP $({\cal S},{\cal A},{\cal P},r,\gamma)$ introduced in Section 4.1, where ${\cal P}$ is the transition kernel that we can sample from during training. The unknown transition kernel in the test environment may deviate from ${\cal P}$ but lies within an uncertainty set ${\cal U}=\prod_{(s,a)\in{\cal S}\times{\cal A}}{\cal U}_{s,a}$ with

{\cal U}_{s,a}=\{{\cal P}^{\prime}(\cdot\mid s,a)\in\Delta_{{\cal S}}:D_{\chi^{2}}({\cal P}^{\prime}(\cdot\mid s,a),{\cal P}(\cdot\mid s,a))\leq\rho\}.

Here $D_{\chi^{2}}(u,v)$ denotes the $\chi^{2}$ distance between distributions $u,v\in\Delta_{{\cal S}}$ , and $\rho$ is a radius parameter of the uncertainty set. The aim of DRRL is to find the distributionally robust Q function $Q^{\text{rob},\star}\in\mathbb{R}^{|{\cal S}|\times|{\cal A}|}$ that performs optimally in the worst case

\displaystyle Q^{\text{rob},\star}(s,a)\triangleq\sup_{\pi}\inf_{{\cal P}^{\prime}\in{\cal U}}\mathbb{E}_{\pi,{\cal P}^{\prime}}[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t})\mid s_{0}=s,a_{0}=a].

The distributionally robust Q function satisfies the robust Bellman optimality equation

\displaystyle Q^{\text{rob},\star}(s,a)=r(s,a)+\gamma\inf_{{\cal P}^{\prime}\in{\cal U}}\mathbb{E}_{{\cal P}^{\prime}}[\max_{a\in{\cal A}}Q^{\text{rob},\star}(s^{\prime},a)],\quad\forall s,a.

(34)

Directly solving $Q^{\text{rob},\star}$ from (34), however, is challenging as an exhaustive search in ${\cal U}$ is infeasible and we cannot sample from ${\cal P}^{\prime}$ . The trick developed in the distributionally robust optimization literature (Duchi and Namkoong, 2021) is to use the dual form of the $\chi^{2}$ distance

\displaystyle\inf_{D_{\chi^{2}}\left({\cal P}^{\prime},{\cal P}\right)\leq\rho}\mathbb{E}_{{\cal P}}[X]=\sup_{\eta\in\mathbb{R}}\left\{\eta-\sqrt{1+\rho}\sqrt{\mathbb{E}_{{\cal P}}\left[(\eta-X)_{+}^{2}\right]}\right\}.

(35)

We define the shorthand notation $\sigma_{\chi^{2}}(X,\eta)=\eta-\sqrt{1+\rho}\sqrt{\mathbb{E}_{{\cal P}}\left[(\eta-X)_{+}^{2}\right]}$ . Exploiting this relation, it can be shown that a variant of the Q learning algorithm for the distributionally robust setting aims to find the solution tuple $\{Q^{\text{rob},\star}\in\mathbb{R}^{|{\cal S}|\times|{\cal A}|},\vec{\eta}\in\mathbb{R}^{|{\cal S}|\times|{\cal A}|},Z_{1}\in\mathbb{R}^{|{\cal S}|\times|{\cal A}|},Z_{2}\in\mathbb{R}^{|{\cal S}|\times|{\cal A}|}\}$ to the system of equations

	$\displaystyle\;Q^{\text{rob},\star}(s,a)=r(s,a)+\gamma(\vec{\eta}(s,a)-\sqrt{1+\rho}\sqrt{Z_{2}}),\quad\forall s,a,$		(36)
	$\displaystyle\;\nabla_{\eta}\sigma_{\chi^{2}}(\max_{a\in{\cal A}}Q^{\text{rob},\star}(s,a),\vec{\eta}(s,a))=1-\sqrt{1+\rho}\frac{Z_{1}(s,a)}{Z_{2}(s,a)}=0,\quad\forall s,a,$		(37)
	$\displaystyle\left\{\begin{aligned} Z_{1}(s,a)=\mathbb{E}_{{\cal P}}[(\vec{\eta}(s,a)-\max_{a\in{\cal A}}Q^{\text{rob},\star}(s^{\prime},a))_{+}],\\ Z_{2}(s,a)=\mathbb{E}_{{\cal P}}[(\vec{\eta}(s,a)-\max_{a\in{\cal A}}Q^{\text{rob},\star}(s^{\prime},a))^{2}],\end{aligned}\right.\quad\forall s,a.$		(38)

Interested readers can find in Liang et al. (2024) the detailed derivation of the system of equations as well as how a three-time-scale algorithm is developed to solve the system. Liang et al. (2024) further establishes the asymptotic convergence of the algorithm. The highest level equation is (36), and the associated Bellman operator is strongly monotone under a common contraction assumption in Q learning (Chen et al., 2022). The lowest level equation is (38), which has an associated operator that is an identity mapping with respect to $Z_{1}$ and $Z_{2}$ and is therefore obviously strongly monotone. The issue is that the middle level operator $\nabla_{\eta}\sigma_{\chi^{2}}$ in (37) is in general only (non-strongly) monotone with respect to $\vec{\eta}$ . The violation of assumption makes our analysis not immediately applicable. However, the techniques developed may help design accelerated three-time-scale algorithms with analysis tailored to the monotone structure of $\nabla_{\eta}\sigma_{\chi^{2}}$ . Additionally, it is possible that $\nabla_{\eta}\sigma_{\chi^{2}}$ becomes strongly monotone under additional assumptions on ${\cal U}$ , in which case we can invoke Theorem 1 and conclude that the A-MSA algorithm can be applied and find the optimal distributionally robust Q function with rate $\widetilde{{\cal O}}(1/t)$ .

4.3 Actor-Critic Algorithm for Regularized Two-Player Zero-Sum Markov Games

Consider a two-player zero-sum Markov games described by $({\cal S},{\cal A},{\cal B},{\cal P},\gamma,r)$ . ${\cal S}$ is the state space, observed by both players. ${\cal A}$ and ${\cal B}$ are the action spaces of the two players. The transition kernel is ${\cal P}:{\cal S}\times{\cal A}\times{\cal B}\rightarrow\Delta_{{\cal S}}$ , with with ${\cal P}(s^{\prime}\mid s,a,b)$ denoting the probability of the game transitioning from state $s$ to $s^{\prime}$ when player 1 selects action $a\in{\cal A}$ and player 2 selects $b\in{\cal B}$ . The reward function is $r:{\cal S}\times{\cal A}\times{\cal B}\rightarrow[0,1]$ – when player 1 and 2 take actions $a$ and $b$ in state $s$ , player 1 receives reward $r(s,a,b)$ and player 2 receives $-r(s,a,b)$ . We denote the two players’ policies by $\pi\in\Delta_{{\cal A}}^{{\cal S}}$ and $\phi\in\Delta_{{\cal B}}^{{\cal S}}$ , with $\pi(a\mid s)$ , $\phi(b\mid s)$ representing the probability of selecting action $a$ , $b$ in state $s$ according to $\pi$ , $\phi$ . For a fixed initial state distribution $\rho\in\Delta_{{\cal S}}$ , the expected discounted cumulative reward under entropy regularization (received by player 1) under policy pair $(\pi,\phi)$ is

\displaystyle J_{w}(\pi,\phi)=\mathbb{E}_{s_{0}\sim\rho}[V_{w}^{\pi,\phi}(s_{0})],

where $w\in\mathbb{R}_{+}$ is a non-negative regularization weight and $V_{w}^{\pi,\phi}(s)$ is the regularized value function

\displaystyle V_{w}^{\pi,\phi}(s)\triangleq\mathbb{E}_{s_{t},a_{t},b_{t}}\Big{[}\sum\nolimits_{t=0}^{\infty}\gamma^{t}\Big{(}r\left(s_{t},a_{t},b_{t}\right)-w\log\pi(a_{t}\mid s_{t})+w\log\phi(b_{t}\mid s_{t})\Big{)}\mid s_{0}=s\Big{]}.

Solving a regularized game means that we want to find a Nash equilibrium policy pair $(\pi_{w}^{\star},\phi_{w}^{\star})$ as the solution to

\displaystyle J_{w}(\pi_{w}^{\star},\phi_{w}^{\star})=\max_{\pi\in\Delta_{{\cal A}}^{{\cal S}}}\min_{\phi\in\Delta_{{\cal B}}^{{\cal S}}}J_{w}(\pi,\phi)=\min_{\phi\in\Delta_{{\cal B}}^{{\cal S}}}\max_{\pi\in\Delta_{{\cal A}}^{{\cal S}}}J_{w}(\pi,\phi).

It is known from Zeng et al. (2022) that such a Nash equilibrium exists and is unique.

Here we consider the softmax parameterization – we introduce parameters $\theta\in\mathbb{R}^{|{\cal S}|\times|{\cal A}|},\psi\in\mathbb{R}^{|{\cal S}|\times|{\cal B}|}$ that encode the policies $\pi_{\theta},\phi_{\psi}$ according to

\displaystyle\pi_{\theta}(a\mid s)=\frac{\exp\left(\theta(s,a)\right)}{\sum_{a^{\prime}\in{\cal A}}\exp\left(\theta(s,a^{\prime})\right)},\quad\phi_{\psi}(b\mid s)=\frac{\exp\left(\psi(s,b)\right)}{\sum_{b^{\prime}\in{\cal A}}\exp\left(\psi(s,b^{\prime})\right)}.

(39)

Taking a gradient-based approach to the problem, we can express our objective as finding a stationary point where $\nabla_{\theta}J_{w}(\pi_{\theta},\phi_{\psi})=0$ and $\nabla_{\psi}J_{w}(\pi_{\theta},\phi_{\psi})=0$ , which can be expanded as below.

	$\displaystyle\mathbb{E}_{s,a,b,s^{\prime}}\big{[}(r(s,a)-w\log\pi_{\theta}(a\mid s)+\gamma V_{w}^{\pi_{\theta},\phi_{\psi}}(s^{\prime})-V_{w}^{\pi_{\theta},\phi_{\psi}}(s))\nabla_{\theta}\log\pi_{\theta}(a\mid s)\big{]}=0,$		(40)
	$\displaystyle\mathbb{E}_{s,a,b,s^{\prime}}\big{[}(r(s,a)+w\log\phi_{\psi}(b\mid s)+\gamma V_{w}^{\pi_{\theta},\phi_{\psi}}(s^{\prime})-V_{w}^{\pi_{\theta},\phi_{\psi}}(s))\nabla_{\psi}\log\phi_{\psi}(b\mid s)\big{]}=0.$		(41)

The value function is not directly known but solvable as the solution to the Bellman equation

\displaystyle\mathbb{E}_{a,b,s^{\prime}}\big{[}r(s,a,b)-w\log\pi_{\theta}(a\mid s)+w\log\phi_{\psi}(b\mid s)+\gamma V_{w}^{\pi_{\theta},\phi_{\psi}}(s^{\prime})-V_{w}^{\pi_{\theta},\phi_{\psi}}(s)\big{]}=0,\;\forall s\in{\cal S}.

(42)

We need to solve the system of three equations (40)-(42). The lowest level is (42) and has corresponds to an operator that can be shown to be strongly monotone. (40) and (41) are the highest level and middle level equations, associated with gradient operator $\nabla_{\theta}J_{w}(\pi_{\theta},\phi_{\psi})$ and $\nabla_{\psi}J_{w}(\pi_{\theta},\phi_{\psi})$ , which are not strongly monotone. This prevents our analysis from being immediately applicable. However, the operators exhibit an important structure – they are the gradients of functions satisfying the Polyak-Łojasiewicz (PL) condition, which makes the operators resemble strong monotone operators in a transformed domain. Exploiting the structure, Zeng et al. (2022) shows that a gradient descent ascent algorithm that aims to find the solution to (40)-(41) assuming the exact knowledge of $V_{w}^{\pi_{\theta},\phi_{\psi}}$ converges linearly fast, which is the convergence rate to be expected when the operators $\nabla_{\theta}J_{w}(\pi_{\theta},\phi_{\psi})$ and $\nabla_{\psi}J_{w}(\pi_{\theta},\phi_{\psi})$ are strongly monotone. Combining the techniques in Zeng et al. (2022) on leveraging the PL condition and those in this work on acceleration, we believe that the A-MSA algorithm can be shown to find the Nash equilibrium in a regularized two-player zero-sum Markov game with convergence rate $\widetilde{{\cal O}}(1/t)$ .

4.4 Actor-Critic Algorithm for Mean Field Games

A mean field game (MFG) provides an approximation of an $N$ -agent Markov game with homogeneous agents as $N$ approaches infinity. The goal in solving a mean field game is to find a policy optimal in an MDP determined by the induced mean field, where the induced mean field is a distribution over the (infinite) population of agents and a function of the policy itself. In essence, an representative agent in a MFG needs to perform against an infinite population of other agents that adopts its same policy.

We consider MFGs in the stationary infinite-horizon average-reward setting and follow the formulation and notations in Zeng et al. (2024a). We denote the state and action space of the representative agent by ${\cal S}$ and ${\cal A}$ . The state transition depends not only on the action of the representative agent but also on the aggregate population behavior, which is described by the mean field $u\in\Delta_{{\cal S}}$ . We use ${\cal P}:{\cal S}\times{\cal A}\times\Delta_{{\cal S}}\rightarrow\Delta_{{\cal S}}$ to denote the transition kernel, with ${\cal P}(s^{\prime}\mid s,a,u)$ representing the probability that the state transitions from $s$ to $s^{\prime}$ when the representative agent takes action $a$ and mean field is $u$ . Similarly, the reward function $r:{\cal S}\times{\cal A}\times\Delta_{{\cal S}}\rightarrow[0,1]$ depends on the mean field – the representative agent receives reward $r(s,a,u)$ when it takes action $a$ under mean field $u$ in state $s$ . The agent does not observe the mean field, and its policy $\pi$ is a mapping from ${\cal S}$ to $\Delta_{{\cal A}}$ . We denote by $P^{\pi,u}\in\mathbb{R}^{|{\cal S}|\times|{\cal S}|}$ the state transition matrix under policy $\pi$ and mean field $u$ such that

P^{\pi,u}(s^{\prime}\mid s)=\sum_{a\in{\cal A}}{\cal P}(s^{\prime}\mid s,a,u)\pi(a\mid s).

When the mean field is $u$ and the agent takes policy $\pi$ , the agent can expect to collect the cumulative reward $J(\pi,u)$

\displaystyle J(\pi,u)

\displaystyle\triangleq\lim_{T\rightarrow\infty}\frac{1}{T}\textstyle\sum_{t=0}^{T-1}\mathbb{E}_{a_{t}\sim\pi(\cdot\mid s_{t}),s_{t+1}\sim{\cal P}(\cdot\mid s_{t},a_{t},u)}[r(s_{t},a_{t},u)\mid s_{0}].

The mean field induced by policy $\pi$ (i.e. state visitation distribution over population when every agent takes policy $\pi$ ) is denoted by $u^{\star}(\pi)$ , which satisfies

\displaystyle u^{\star}(\pi)=\lim_{T\rightarrow\infty}\frac{1}{T}\mathbb{E}_{a_{t}\sim\pi(\cdot\mid s_{t}),s_{t+1}\sim{\cal P}(\cdot\mid s_{t},a_{t},u^{\star}(\pi))}[e_{s_{t}}],

where $e_{s}\in\mathbb{R}^{|{\cal S}|}$ is the indicator vector whose entry $s^{\prime}$ has value 1 if $s^{\prime}=s$ and 0 otherwise.

We again consider the softmax function – a policy parameter $\theta$ encodes the policy $\pi_{\theta}$ according to (39). With the derivation presented in Zeng et al. (2024a), it can be shown that finding an equilibrium in an MFG can be formulated as solving a system of three equations (43)-(45)

	$\displaystyle\nabla_{\theta}J(\pi_{\theta},u)=\mathbb{E}_{s,a,s^{\prime}}\big{[}(r(s,a,u)+V^{\pi_{\theta},u}(s^{\prime})-V^{\pi_{\theta},u}(s))\nabla_{\theta}\log\pi_{\theta}(a\mid s)\big{]}=0,$		(43)
	$\displaystyle u=u^{\star}(\pi_{\theta})=\lim_{T\rightarrow\infty}\frac{1}{T}\mathbb{E}_{a_{t}\sim\pi_{\theta}(\cdot\mid s_{t}),s_{t+1}\sim{\cal P}(\cdot\mid s_{t},a_{t},u)}[e_{s_{t}}],$		(44)
	$\displaystyle\mathbb{E}_{a,s^{\prime}}\big{[}r(s,a,u)-J(\pi_{\theta},u)+V^{\pi_{\theta},u}(s^{\prime})-V^{\pi_{\theta},u}(s)\big{]}=0,\;\forall s\in{\cal S}.$		(45)

in which we introduce the auxiliary variable $V^{\pi_{\theta},u}$ which tracks the (differential) value function under $\pi_{\theta},u$ . Among the three equations, (44) and (45) are on the middle and lowest levels, and have associated operators satisfying the strong monotonicity. The highest level operator $\nabla_{\theta}J(\pi_{\theta},u)$ is not monotone but a gradient operator. We can extend the analysis in the work leveraging the techniques developed in Zeng and Doan (2024) for gradient operators and show that in this case A-MSA finds a first-order stationary point (but not necessarily a globally or locally optimal solution) of $J$ with rate $\widetilde{{\cal O}}(1/\sqrt{t})$ (in the sense of gradient norm squared). This recovers the rate of the state-of-the-art MFG-ASAC algorithm proposed in Zeng et al. (2024a) with tailored analysis.

5 Numerical Simulations

We apply A-MSA to the MFG policy optimization problem discussed in Section 4.4. The environment is a small-scale synthetic MFG with $|{\cal S}|=30$ , $|{\cal A}|=10$ , and a randomly generated transition kernel and reward function. We compare A-MSA against the standard MSA algorithm without averaging and plot the algorithm convergence in Figure 1, in which $\theta_{t}$ and $u_{t}$ denote the policy and mean field iterates in the $t^{\text{th}}$ iteration. As the norm of the gradient with respect to $\theta_{t}$ , $\|\nabla_{\theta}J(\pi_{\theta_{t}},u_{t})\|$ measures the convergence in the policy given the latest mean field iterate. To quantify the mean field convergence, we consider $\|u_{t}-v^{\pi_{\theta_{t}},u_{t}}\|$ , where $v^{\pi,\mu}$ for any $\pi,\mu$ denotes the stationary distribution of states induced by $P^{\pi,\mu}$ . The simulation shows that A-MSA has a clear advantage over MSA, though not by an apparent order of magnitude.

Refer to caption — Figure 1: Algorithm performance in synthetic MFGs averaged over 200 trials. First column measures the sub-optimality in the policy iterate under the latest mean field estimate. Second column shows the convergence of mean field estimate to mean field induced by latest policy.

6 Concluding Remarks

In this work we propose an accelerated multi-time-scale SA algorithm that solves a coupled system of $N$ fixed-point equations with the optimal convergence rate and sample complexity – under a strong monotonicity assumption, the last iterate of the algorithm converges to the (unique) solution with rate $\widetilde{{\cal O}}(1/t)$ for any $N$ . This is the first time such convergence rate is achieved under no additional restrictive smoothness assumptions. We show that we can formulate the objectives of a range of problems in RL and multi-agent games as a coupled fixed-point equation system, to which the proposed algorithm can be applied. Our analysis is easily generalizable to cases where the highest level operator is not strongly monotone (but the lower level operators are). An important future direction is to investigate the stability and convergence of multi-time-scale SA when the lower level operators lose strong monotonicity.

Acknowledgement

This work was partially supported by the National Science Foundation under awards ECCS-CAREER-2339509 and CCF-2343599.

Disclaimer

This paper was prepared for informational purposes in part by the Artificial Intelligence Research group of JP Morgan Chase & Co and its affiliates (“JP Morgan”), and is not a product of the Research Department of JP Morgan. JP Morgan makes no representation and warranty whatsoever and disclaims all liability, for the completeness, accuracy or reliability of the information contained herein. This document is not intended as investment research or investment advice, or a recommendation, offer or solicitation for the purchase or sale of any security, financial instrument, financial product or service, or to be used in any way for evaluating the merits of participating in any transaction, and shall not constitute a solicitation under any jurisdiction or to any person, if such solicitation under such jurisdiction or to such person would be unlawful.

References

Benveniste et al. (2012) Albert Benveniste, Michel Métivier, and Pierre Priouret. Adaptive algorithms and stochastic approximations, volume 22. Springer Science & Business Media, 2012.
Bhandari et al. (2018) Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite time analysis of temporal difference learning with linear function approximation. In Conference on learning theory, pages 1691–1692. PMLR, 2018.
Borkar (1997) Vivek S Borkar. Stochastic approximation with two time scales. Systems & Control Letters, 29(5):291–294, 1997.
Borkar (2008) Vivek S Borkar. Stochastic approximation: a dynamical systems viewpoint. Springer, 2008.
Chen et al. (2022) Zaiwei Chen, Sheng Zhang, Thinh T Doan, John-Paul Clarke, and Siva Theja Maguluri. Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning. Automatica, 146:110623, 2022.
Dalal et al. (2018) Gal Dalal, Gugan Thoppe, Balázs Szörényi, and Shie Mannor. Finite sample analysis of two-timescale stochastic approximation with applications to reinforcement learning. In Conference On Learning Theory, pages 1199–1233. PMLR, 2018.
Daskalakis et al. (2020) Constantinos Daskalakis, Dylan J Foster, and Noah Golowich. Independent policy gradient methods for competitive reinforcement learning. Advances in neural information processing systems, 33:5527–5540, 2020.
Deb and Bhatnagar (2021) Rohan Deb and Shalabh Bhatnagar. $N$ -timescale stochastic approximation: Stability and convergence. arXiv preprint arXiv:2112.03515, 2021.
Deb and Bhatnagar (2022) Rohan Deb and Shalabh Bhatnagar. Gradient temporal difference with momentum: Stability and convergence. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 6488–6496, 2022.
Doan (2022) Thinh T Doan. Nonlinear two-time-scale stochastic approximation: Convergence and finite-time performance. IEEE Transactions on Automatic Control, 68(8):4695–4705, 2022.
Doan (2024) Thinh T Doan. Fast nonlinear two-time-scale stochastic approximation: Achieving $\mathcal{O}(1/k)$ finite-sample complexity. arXiv preprint arXiv:2401.12764, 2024.
Doan and Romberg (2019) Thinh T Doan and Justin Romberg. Linear two-time-scale stochastic approximation a finite-time analysis. In 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 399–406. IEEE, 2019.
Duchi and Namkoong (2021) John C Duchi and Hongseok Namkoong. Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics, 49(3):1378–1406, 2021.
Fort (2015) Gersende Fort. Central limit theorems for stochastic approximation with controlled markov chain dynamics. ESAIM: Probability and Statistics, 19:60–80, 2015.
Gupta et al. (2019) Harsh Gupta, Rayadurgam Srikant, and Lei Ying. Finite-time performance bounds and adaptive learning rate selection for two time-scale reinforcement learning. Advances in Neural Information Processing Systems, 32, 2019.
Han et al. (2024) Yuze Han, Xiang Li, and Zhihua Zhang. Finite-time decoupled convergence in nonlinear two-time-scale stochastic approximation. arXiv preprint arXiv:2401.03893, 2024.
Haque et al. (2023) Shaan Ul Haque, Sajad Khodadadian, and Siva Theja Maguluri. Tight finite time bounds of two-time-scale linear stochastic approximation with markovian noise. arXiv preprint arXiv:2401.00364, 2023.
Hong et al. (2023) Mingyi Hong, Hoi-To Wai, Zhaoran Wang, and Zhuoran Yang. A two-timescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actor-critic. SIAM Journal on Optimization, 33(1):147–180, 2023.
Kaledin et al. (2020) Maxim Kaledin, Eric Moulines, Alexey Naumov, Vladislav Tadic, and Hoi-To Wai. Finite time analysis of linear two-timescale stochastic approximation with markovian noise. In Conference on Learning Theory, pages 2144–2203. PMLR, 2020.
Konda and Tsitsiklis (1999) Vijay Konda and John Tsitsiklis. Actor-critic algorithms. Advances in neural information processing systems, 12, 1999.
Konda and Tsitsiklis (2004) Vijay R Konda and John N Tsitsiklis. Cconvergence rate of linear two-time-scale stochastic approximation. The Annals of Applied Probability, 14(2):796–819, 2004.
Konda and Borkar (1999) Vijaymohan R Konda and Vivek S Borkar. Actor-critic–type learning algorithms for markov decision processes. SIAM Journal on control and Optimization, 38(1):94–123, 1999.
Lakshminarayanan and Szepesvári (2017) Chandrashekar Lakshminarayanan and Csaba Szepesvári. Linear stochastic approximation: Constant step-size and iterate averaging. arXiv preprint arXiv:1709.04073, 2017.
Li et al. (2023a) Xiang Li, Jiadong Liang, and Zhihua Zhang. Online statistical inference for nonlinear stochastic approximation with markovian data. arXiv preprint arXiv:2302.07690, 2023a.
Li et al. (2023b) Xiang Li, Wenhao Yang, Jiadong Liang, Zhihua Zhang, and Michael I Jordan. A statistical analysis of polyak-ruppert averaged q-learning. In International Conference on Artificial Intelligence and Statistics, pages 2207–2261. PMLR, 2023b.
Liang et al. (2024) Zhipeng Liang, Xiaoteng Ma, Jose Blanchet, Jun Yang, Jiheng Zhang, and Zhengyuan Zhou. Single-trajectory distributionally robust reinforcement learning. In Proceedings of the 41st International Conference on Machine Learning, volume 235, pages 29644–29666. PMLR, 2024.
Meerkov (1972) S. M. Meerkov. Simplified description of slow markov walks: part i. Automation and Remote Control, 33:404–414, 1972.
Mitra (2024) Aritra Mitra. A simple finite-time analysis of td learning with linear function approximation. arXiv preprint arXiv:2403.02476, 2024.
Mokkadem and Pelletier (2006) Abdelkader Mokkadem and Mariane Pelletier. Convergence rate and averaging of nonlinear two-time-scale stochastic approximation algorithms. Annals of Applied Probability, 16(3):1671–1702, 2006.
Moulines and Bach (2011) Eric Moulines and Francis Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Advances in neural information processing systems, 24, 2011.
Robbins and Monro (1951) Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
Sayin et al. (2021) Muhammed Sayin, Kaiqing Zhang, David Leslie, Tamer Basar, and Asuman Ozdaglar. Decentralized q-learning in zero-sum markov games. Advances in Neural Information Processing Systems, 34:18320–18334, 2021.
Shen and Chen (2022) Han Shen and Tianyi Chen. A single-timescale analysis for stochastic approximation with multiple coupled sequences. Advances in Neural Information Processing Systems, 35:17415–17429, 2022.
Srikant (2024) R Srikant. Rates of convergence in the central limit theorem for markov chains, with an application to td learning. arXiv preprint arXiv:2401.15719, 2024.
Srikant and Ying (2019) Rayadurgam Srikant and Lei Ying. Finite-time error bounds for linear stochastic approximation andtd learning. In Conference on Learning Theory, pages 2803–2830. PMLR, 2019.
Sutton et al. (2008) Richard S Sutton, Hamid Maei, and Csaba Szepesvári. A convergent $O(n)$ temporal-difference algorithm for off-policy learning with linear function approximation. Advances in neural information processing systems, 21, 2008.
Sutton et al. (2009) Richard S Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, David Silver, Csaba Szepesvári, and Eric Wiewiora. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th annual international conference on machine learning, pages 993–1000, 2009.
Wang et al. (2021) Yue Wang, Shaofeng Zou, and Yi Zhou. Non-asymptotic analysis for two time-scale tdc with general smooth function approximation. Advances in neural information processing systems, 34:9747–9758, 2021.
Wu et al. (2020) Yue Frank Wu, Weitong Zhang, Pan Xu, and Quanquan Gu. A finite-time analysis of two time-scale actor-critic methods. Advances in Neural Information Processing Systems, 33:17617–17628, 2020.
Xu et al. (2019) Tengyu Xu, Shaofeng Zou, and Yingbin Liang. Two time-scale off-policy td learning: Non-asymptotic analysis over markovian samples. Advances in neural information processing systems, 32, 2019.
Zeng and Doan (2024) Sihan Zeng and Thinh Doan. Fast two-time-scale stochastic gradient method with applications in reinforcement learning. In The Thirty Seventh Annual Conference on Learning Theory, pages 5166–5212. PMLR, 2024.
Zeng et al. (2022) Sihan Zeng, Thinh Doan, and Justin Romberg. Regularized gradient descent ascent for two-player zero-sum markov games. Advances in Neural Information Processing Systems, 35:34546–34558, 2022.
Zeng et al. (2024a) Sihan Zeng, Sujay Bhatt, Alec Koppel, and Sumitra Ganesh. A single-loop finite-time convergent policy optimization algorithm for mean field games (and average-reward markov decision processes). arXiv preprint arXiv:2408.04780, 2024a.
Zeng et al. (2024b) Sihan Zeng, Thinh T Doan, and Justin Romberg. A two-time-scale stochastic optimization framework with applications in control and reinforcement learning. SIAM Journal on Optimization, 34(1):946–976, 2024b.

Accelerated Multi-Time-Scale Stochastic Approximation: Optimal Complexity and Applications in Reinforcement Learning and Multi-Agent Games
Appendix

Appendix A Proof of Technical Lemmas

A.1 Proof of Lemma 1

Lemma 5

For any $t\geq\tau_{t}$ we have

\displaystyle\mathbb{E}[\langle\Delta f_{i}^{[t]},F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\rangle]\leq N^{2}C_{\tau}\tau_{t}^{2}\lambda^{[t-\tau_{t}]}\Big{(}\sum_{j=1}^{N}\|x_{j}^{[t]}\|^{2}+\sum_{j=1}^{N}\|\Delta f_{j}^{[t]}\|^{2}+1\Big{)},

where $C_{\tau}$ is a positive, bounded scalar depending only on the structural constants including $L$ , $B$ , $\rho$ , and $m$ .

We derive a frequently used inequality in the proofs of Lemma 1 and 5 as a result of (18). For any $X$ , we have

$\displaystyle\\|F_{i}(\boldsymbol{\theta}^{[t]},X)\\|$	$\displaystyle\leq L(\sum_{j=1}^{N}\\|\theta_{j}^{[t]}\\|+1)$
	$\displaystyle\leq L+L\sum_{j=1}^{N}\\|\theta_{j}^{[t]}-y_{j}(\boldsymbol{\theta}_{1:j-1}^{[t]})\\|+L\sum_{j=1}^{N}\\|y_{j}(\boldsymbol{\theta}_{1:j-1}^{[t]})\\|$
	$\displaystyle\leq(NB+1)L+L\sum_{j=1}^{N}\\|x_{j}^{[t]}\\|,$	(46)

where the last inequality follows from the boundedness condition in Assumption 4 and the definition of $x_{j}^{[t]}$ in (20).

By the update rule in (8), we have

	$\displaystyle\Delta f_{i}^{[t+1]}$	$\displaystyle=f_{i}^{[t+1]}-\widebar{F}_{i}(\boldsymbol{\theta}^{[t+1]})$
		$\displaystyle=(1-\lambda^{[t]})f_{i}^{[t]}+\lambda^{[t]}F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t+1]})$
		$\displaystyle=(1-\lambda^{[t]})f_{i}^{[t]}+\lambda^{[t]}\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t+1]})+\lambda^{[t]}\left(F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\right)$
		$\displaystyle=(1-\lambda^{[t]})\Delta f_{i}^{[t]}+\left(\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t+1]})\right)+\lambda^{[t]}\left(F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\right).$

This leads to

	$\displaystyle\\|\Delta f_{i}^{[t+1]}\\|^{2}$
	$\displaystyle=(1-\lambda^{[t]})^{2}\\|\Delta f_{i}^{[t]}\\|^{2}+\\|\left(\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t+1]})\right)+\lambda^{[t]}\left(F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\right)\\|^{2}$
	$\displaystyle\hskip 20.0pt+2(1-\lambda^{[t]})\langle\Delta f_{i}^{[t]},\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t+1]})\rangle$
	$\displaystyle\hskip 20.0pt+2\lambda^{[t]}(1-\lambda^{[t]})\langle\Delta f_{i}^{[t]},F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\rangle$
	$\displaystyle\leq(1-\lambda^{[t]})^{2}\\|\Delta f_{i}^{[t]}\\|^{2}+2\\|\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t+1]})\\|^{2}+2(\lambda^{[t]})^{2}\\|\widebar{F}_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\\|^{2}$
	$\displaystyle\hskip 20.0pt+\frac{\lambda^{[t]}}{2}\\|\Delta f_{i}^{[t]}\\|^{2}+\frac{2}{\lambda^{[t]}}\\|\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t+1]})\\|^{2}$
	$\displaystyle\hskip 20.0pt+2\lambda^{[t]}(1-\lambda^{[t]})\langle\Delta f_{i}^{[t]},F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\rangle$
	$\displaystyle\leq(1-\lambda^{[t]})\\|\Delta f_{i}^{[t]}\\|^{2}-\frac{\lambda^{[t]}}{4}\\|\Delta f_{i}^{[t]}\\|^{2}+2(\lambda^{[t]})^{2}\\|F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\\|^{2}$
	$\displaystyle\hskip 20.0pt+\frac{4}{\lambda^{[t]}}\\|\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t+1]})\\|^{2}+2\lambda^{[t]}(1-\lambda^{[t]})\langle\Delta f_{i}^{[t]},F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\rangle,$		(47)

where the last inequality follows from $\lambda^{[t]}\leq 1/4\leq 1$ .

Note that the third term on the right hand side of (47) is bounded due to (46)

$\displaystyle\\|F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\\|^{2}$	$\displaystyle\leq 2\\|F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})\\|+2\\|\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\\|^{2}$
	$\displaystyle\leq 4\Big{(}(NB+1)L+L\sum_{j=1}^{N}\\|x_{j}^{[t]}\\|\Big{)}^{2}$
	$\displaystyle\leq 8L^{2}\big{(}N\sum_{j=1}^{N}\\|x_{j}^{[t]}\\|^{2}+(NB+1)^{2}\big{)}.$	(48)

For the fourth term of (47), we have from Lemma 3 and the Lipschitz continuity of $\widebar{F}_{i}$

$\displaystyle\mathbb{E}[\\|\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t+1]})\\|^{2}]$	$\displaystyle\leq L^{2}\sum_{j=1}^{N}\mathbb{E}[\\|\theta_{j}^{[t]}-\theta_{j}^{[t+1]}\\|^{2}]$
	$\displaystyle\leq L^{2}(\alpha_{i}^{[t]})^{2}\sum_{j=1}^{N}\mathbb{E}[\left(\\|\Delta f_{j}^{[t]}\\|+NL^{2}\sum_{k=j}^{N}\\|x_{k}^{[t]}\\|\right)^{2}]$
	$\displaystyle\leq 2L^{2}(\alpha_{i}^{[t]})^{2}\sum_{j=1}^{N}\mathbb{E}[\\|\Delta f_{j}^{[t]}\\|^{2}]+2NL^{4}(\alpha_{i}^{[t]})^{2}\sum_{j=1}^{N}\sum_{k=j}^{N}\mathbb{E}[\\|x_{k}^{[t]}\\|^{2}]$
	$\displaystyle\leq 2L^{2}(\alpha_{i}^{[t]})^{2}\sum_{j=1}^{N}\mathbb{E}[\\|\Delta f_{j}^{[t]}\\|^{2}]+2N^{2}L^{4}(\alpha_{i}^{[t]})^{2}\sum_{j=1}^{N}\mathbb{E}[\\|x_{j}^{[t]}\\|^{2}].$	(49)

The last term of (47) is bound in expectation due to Lemma 5

\displaystyle\mathbb{E}[\langle\Delta f_{i}^{[t]},F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\rangle]\leq N^{2}C_{\tau}\tau_{t}^{2}\lambda^{[t-\tau_{t}]}\Big{(}\sum_{j=1}^{N}\|x_{j}^{[t]}\|^{2}+\sum_{j=1}^{N}\|\Delta f_{j}^{[t]}\|^{2}+1\Big{)}.

(50)

Combining (48)-(50) with (47),

	$\displaystyle\mathbb{E}[\\|\Delta f_{i}^{[t+1]}\\|^{2}]$
	$\displaystyle\leq(1-\lambda^{[t]})\mathbb{E}[\\|\Delta f_{i}^{[t]}\\|^{2}]-\frac{\lambda^{[t]}}{4}\mathbb{E}[\\|\Delta f_{i}^{[t]}\\|^{2}]+2(\lambda^{[t]})^{2}\mathbb{E}[\\|F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\\|^{2}]$
	$\displaystyle\hskip 20.0pt+\frac{4}{\lambda^{[t]}}\mathbb{E}[\\|\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t+1]})\\|^{2}]+2\lambda^{[t]}(1-\lambda^{[t]})\mathbb{E}[\langle\Delta f_{i}^{[t]},F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\rangle]$
	$\displaystyle\leq(1-\lambda^{[t]})\mathbb{E}[\\|\Delta f_{i}^{[t]}\\|^{2}]-\frac{\lambda^{[t]}}{4}\mathbb{E}[\\|\Delta f_{i}^{[t]}\\|^{2}]+8L^{2}(\lambda^{[t]})^{2}\big{(}N\sum_{j=1}^{N}\\|x_{j}^{[t]}\\|^{2}+(NB+1)^{2}\big{)}$
	$\displaystyle\hskip 20.0pt+\frac{8L^{2}(\alpha_{i}^{[t]})^{2}}{\lambda^{[t]}}\sum_{j=1}^{N}\mathbb{E}[\\|\Delta f_{j}^{[t]}\\|^{2}]+\frac{8N^{2}L^{4}(\alpha_{i}^{[t]})^{2}}{\lambda^{[t]}}\sum_{j=1}^{N}\mathbb{E}[\\|x_{j}^{[t]}\\|^{2}]$
	$\displaystyle\hskip 20.0pt+N^{2}C_{\tau}\tau_{t}^{2}\lambda^{[t]}\lambda^{[t-\tau_{t}]}(\sum_{j=1}^{N}\\|x_{j}^{[t]}\\|^{2}+\sum_{j=1}^{N}\\|\Delta f_{j}^{[t]}\\|^{2}+1)$
	$\displaystyle\leq(1-\lambda^{[t]})\mathbb{E}[\\|\Delta f_{i}^{[t]}\\|^{2}]-\frac{\lambda^{[t]}}{4}\mathbb{E}[\\|\Delta f_{i}^{[t]}\\|^{2}]+DN^{2}\tau_{t}^{2}\lambda^{[t]}\lambda^{[t-\tau_{t}]}$
	$\displaystyle\hskip 20.0pt+DN^{2}\tau_{t}^{2}(\lambda^{[t]}\lambda^{[t-\tau_{t}]}+\frac{(\alpha_{i}^{[t]})^{2}}{\lambda^{[t]}})\big{(}\sum_{j=1}^{N}\\|x_{j}^{[t]}\\|^{2}+\sum_{j=1}^{N}\\|\Delta f_{j}^{[t]}\\|^{2}\big{)},$		(51)

where $D$ is a finite constant depending only on $L$ , $B$ , and $C_{\tau}$ .

$\Box$

A.2 Proof of Lemma 2

By the update rule in (7), we have

	$\displaystyle x_{i}^{[t+1]}$	$\displaystyle=\theta_{i}^{[t+1]}-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})$
		$\displaystyle=\theta_{i}^{[t]}-\alpha_{i}^{[t]}f_{i}^{[t]}-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})$
		$\displaystyle=x_{i}^{[t]}-\alpha_{i}^{[t]}f_{i}^{[t]}+\left(y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\right)$
		$\displaystyle=x_{i}^{[t]}-\alpha_{i}^{[t]}\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\alpha_{i}^{[t]}\Delta f_{i}^{[t]}+\left(y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\right).$

Taking the norm leads to

	$\displaystyle\\|x_{i}^{[t+1]}\\|^{2}$
	$\displaystyle\leq\underbrace{\\|x_{i}^{[t]}-\alpha_{i}^{[t]}\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\\|^{2}}_{T_{1}}+(\alpha_{i}^{[t]})^{2}\\|\Delta f_{i}^{[t]}\\|^{2}+\underbrace{\\|y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\\|^{2}}_{T_{2}}$
	$\displaystyle\hskip 20.0pt+\underbrace{\alpha_{i}^{[t]}\langle x_{i}^{[t]}-\alpha_{i}^{[t]}\widebar{F}_{i}(\boldsymbol{\theta}^{[t]}),\Delta f_{i}^{[t]}\rangle}_{T_{3}}+\underbrace{\langle x_{i}^{[t]}-\alpha_{i}^{[t]}\widebar{F}_{i}(\boldsymbol{\theta}^{[t]}),y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\rangle}_{T_{4}}$
	$\displaystyle\hskip 20.0pt+\underbrace{\alpha_{i}^{[t]}\langle\Delta f_{i}^{[t]},y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\rangle}_{T_{5}}.$		(52)

We bound each term of (52) separately. By the definition of $y_{k}$ , we have

\displaystyle\widebar{F}_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]},{\bf y}_{i:N}(\boldsymbol{\theta}_{1:i-1}^{[t]}))=0.

This implies

	$\displaystyle T_{1}$
	$\displaystyle=\\|x_{i}^{[t]}\\|^{2}-2\alpha_{i}^{[t]}\langle x_{i}^{[t]},F_{i}(\boldsymbol{\theta}^{[t]})\rangle+(\alpha_{i}^{[t]})^{2}\\|F_{i}(\boldsymbol{\theta}^{[t]})\\|^{2}$
	$\displaystyle=\\|x_{i}^{[t]}\\|^{2}-2\alpha_{i}^{[t]}\langle x_{i}^{[t]},\widebar{F}_{i}(\boldsymbol{\theta}_{1:i}^{[t]},{\bf y}_{i+1:N}(\boldsymbol{\theta}_{1:i}^{[t]}))-\widebar{F}_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]},{\bf y}_{i:N}(\boldsymbol{\theta}_{1:i-1}^{[t]}))\rangle$
	$\displaystyle\hskip 20.0pt-2\alpha_{i}^{[t]}\langle x_{i}^{[t]},\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}_{1:i}^{[t]},{\bf y}_{i+1:N}(\boldsymbol{\theta}_{1:i}^{[t]}))\rangle+(\alpha_{i}^{[t]})^{2}\\|\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\\|^{2}$
	$\displaystyle=\\|x_{i}^{[t]}\\|^{2}-2\alpha_{i}^{[t]}\Big{\langle}x_{i}^{[t]},\widebar{F}_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]},\theta_{i}^{[t]},{\bf y}_{i+1:N}(\boldsymbol{\theta}_{1:i-1}^{[t]},\theta_{i}^{[t]}))$
	$\displaystyle\hskip 100.0pt-\widebar{F}_{i}\Big{(}\boldsymbol{\theta}_{1:i-1}^{[t]},y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]}),{\bf y}_{i+1:N}(\boldsymbol{\theta}_{1:i-1}^{[t]},y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]}))\Big{)}\Big{\rangle}$
	$\displaystyle\hskip 20.0pt-2\alpha_{i}^{[t]}\langle x_{i}^{[t]},\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}_{1:i}^{[t]},{\bf y}_{i+1:N}(\boldsymbol{\theta}_{1:i}^{[t]}))\rangle$
	$\displaystyle\hskip 20.0pt+(\alpha_{i}^{[t]})^{2}\\|\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]},{\bf y}_{i:N}(\boldsymbol{\theta}_{1:i-1}^{[t]}))\\|^{2}$
	$\displaystyle\leq\\|x_{i}^{[t]}\\|^{2}-2\delta\alpha_{i}^{[t]}\\|x_{i}^{[t]}\\|^{2}+2\alpha_{i}^{[t]}\\|x_{i}^{[t]}\\|\\|\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}_{1:i}^{[t]},{\bf y}_{i+1:N}(\boldsymbol{\theta}_{1:i}^{[t]}))\\|$
	$\displaystyle\hskip 20.0pt+(\alpha_{i}^{[t]})^{2}\\|\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]},{\bf y}_{i:N}(\boldsymbol{\theta}_{1:i-1}^{[t]}))\\|^{2}$
	$\displaystyle\leq\\|x_{i}^{[t]}\\|^{2}-2\delta\alpha_{i}^{[t]}\\|x_{i}^{[t]}\\|^{2}+\delta\alpha_{i}^{[t]}\\|x_{i}^{[t]}\\|^{2}+\frac{4N^{3}L^{6}\alpha_{i}^{[t]}}{\delta}\sum_{j=i+1}^{N}\\|x_{j}^{[t]}\\|^{2}$
	$\displaystyle\hskip 20.0pt+4N^{3}L^{6}(\alpha_{i}^{[t]})^{2}\sum_{j=i}^{N}\\|x_{j}^{[t]}\\|^{2}$
	$\displaystyle=\\|x_{i}^{[t]}\\|^{2}-\left(\delta\alpha_{i}^{[t]}-4N^{3}L^{6}(\alpha_{i}^{[t]})^{2}\right)\\|x_{i}^{[t]}\\|^{2}+\left(\frac{4N^{3}L^{6}\alpha_{i}^{[t]}}{\delta}+4N^{3}L^{6}(\alpha_{i}^{[t]})^{2}\right)\sum_{j=i+1}^{N}\\|x_{j}^{[t]}\\|^{2}$
	$\displaystyle\leq\\|x_{i}^{[t]}\\|^{2}-\frac{\delta\alpha_{i}^{[t]}}{2}\\|x_{i}^{[t]}\\|^{2}+\frac{8N^{3}L^{6}\alpha_{i}^{[t]}}{\delta}\sum_{j=i+1}^{N}\\|x_{j}^{[t]}\\|^{2}.$		(53)

The third equation of (53) is a result of the identity that for all $i,j$

\displaystyle y_{j}(\boldsymbol{\theta}_{1:i-1})=y_{j}(\boldsymbol{\theta}_{1:i-1},{\bf y}_{i:j-1}(\boldsymbol{\theta}_{1:i-1})).

(54)

The first inequality of (53) follows from Assumption 3, and the final inequality follows from the step size condition $\alpha_{i}^{[t]}\leq\frac{\delta}{8N^{3}L^{6}}$ and $\alpha_{i}^{[t]}\leq\frac{1}{\delta}$ . The second inequality of (53) uses the bound

\|\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}_{1:i}^{[t]},{\bf y}_{i+1:N}(\boldsymbol{\theta}_{1:i}^{[t]}))\|\leq 2NL^{3}\sum_{j=i+1}^{N}\|x_{j}^{[t]}\|,

which we now justify

	$\displaystyle\\|\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}_{1:i}^{[t]},{\bf y}_{i+1:N}(\boldsymbol{\theta}_{1:i}^{[t]}))\\|$
	$\displaystyle\leq L\sum_{j=i+1}^{N}\\|\theta_{j}^{[t]}-y_{k}(\boldsymbol{\theta}_{1:i}^{[t]})\\|$
	$\displaystyle\leq L\sum_{j=i+1}^{N}\\|\theta_{j}^{[t]}-y_{j}(\boldsymbol{\theta}_{1:j-1}^{[t]})\\|+L\sum_{j=i+1}^{N}\\|y_{j}(\boldsymbol{\theta}_{1:j-1}^{[t]})-y_{j}(\boldsymbol{\theta}_{1:i}^{[t]})\\|$
	$\displaystyle=L\sum_{j=i+1}^{N}\\|x_{j}^{[t]}\\|+L\sum_{j=i+1}^{N}\\|y_{j}(\boldsymbol{\theta}_{1:j-1}^{[t]})-y_{j}(\boldsymbol{\theta}_{1:i}^{[t]},{\bf y}_{i:j-1}(\boldsymbol{\theta}_{1:i}^{[t]}))\\|$
	$\displaystyle\leq L\sum_{j=i+1}^{N}\\|x_{j}^{[t]}\\|+L^{3}\sum_{j=i+1}^{N}\sum_{k=i+1}^{j-1}\\|x_{k}^{[t]}\\|$
	$\displaystyle\leq 2NL^{3}\sum_{j=i+1}^{N}\\|x_{j}^{[t]}\\|.$		(55)

The second inequality of (53) plugs in the bound on $\|\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]},{\bf y}_{i:N}(\boldsymbol{\theta}_{1:i-1}^{[t]}))\|$ below, which follows from a similar argument

	$\displaystyle\\|\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]},{\bf y}_{i:N}(\boldsymbol{\theta}_{1:i-1}^{[t]}))\\|$	$\displaystyle\leq L\\|x_{i}^{[t]}\\|+L\sum_{j=i+1}^{N}\\|\theta_{j}^{[t]}-y_{j}(\boldsymbol{\theta}_{1:i-1}^{[t]})\\|$
		$\displaystyle\leq 2NL^{3}\sum_{j=i}^{N}\\|x_{j}^{[t]}\\|.$

The term $T_{2}$ can be bounded leveraging the Lipschitz condition of $y_{i}$

$\displaystyle\\|y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\\|$	$\displaystyle\leq L\sum_{j=1}^{i-1}\\|\theta_{j}^{[t+1]}-\theta_{j}^{[t]}\\|$
	$\displaystyle\leq L\sum_{j=1}^{i-1}\alpha_{j}^{[t]}\left(\\|\Delta f_{j}^{[t]}\\|+NL^{2}\sum_{k=j}^{N}\\|x_{k}^{[t]}\\|\right)$
	$\displaystyle\leq L\sum_{j=1}^{i-1}\alpha_{j}^{[t]}\\|\Delta f_{j}^{[t]}\\|+N^{2}L^{3}\sum_{j=1}^{i-1}\alpha_{j}^{[t]}\sum_{k=j}^{N}\\|x_{k}^{[t]}\\|,$	(56)

where the second inequality plugs in the result of Lemma 3.

Since $\alpha_{j}^{[t]}\leq\alpha_{k}^{[t]}$ for any $j\leq k$ , (56) implies

$\displaystyle T_{2}$	$\displaystyle=\\|y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\\|^{2}$
	$\displaystyle\leq 2\left(L\sum_{j=1}^{i-1}\alpha_{j}^{[t]}\\|\Delta f_{j}^{[t]}\\|\right)^{2}+2\left(N^{2}L^{3}\sum_{j=1}^{N}\alpha_{j}^{[t]}\\|x_{j}^{[t]}\\|\right)^{2}$
	$\displaystyle\leq 2NL^{2}\sum_{j=1}^{i-1}(\alpha_{j}^{[t]})^{2}\\|\Delta f_{j}^{[t]}\\|^{2}+2N^{5}L^{6}\sum_{j=1}^{N}(\alpha_{j}^{[t]})^{2}\\|x_{j}^{[t]}\\|^{2}.$	(57)

Next we treat $T_{3}$

$\displaystyle T_{3}$	$\displaystyle\leq\alpha_{i}^{[t]}\\|\Delta f_{i}^{[t]}\\|\\|x_{i}^{[t]}-\alpha_{i}^{[t]}\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\\|$
	$\displaystyle\leq\frac{2\alpha_{i}^{[t]}}{\delta}\\|\Delta f_{i}^{[t]}\\|^{2}+\frac{\delta\alpha_{i}^{[t]}}{8}\\|x_{i}^{[t]}-\alpha_{i}^{[t]}\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\\|^{2}$
	$\displaystyle\leq\frac{2\alpha_{i}^{[t]}}{\delta}\\|\Delta f_{i}^{[t]}\\|^{2}+\frac{\delta\alpha_{i}^{[t]}}{8}\left(\\|x_{i}^{[t]}\\|^{2}-\frac{\delta\alpha_{i}^{[t]}}{2}\\|x_{i}^{[t]}\\|^{2}+\frac{8N^{3}L^{6}\alpha_{i}^{[t]}}{\delta}\sum_{j=i+1}^{N}\\|x_{j}^{[t]}\\|^{2}\right)$
	$\displaystyle\leq\frac{2\alpha_{i}^{[t]}}{\delta}\\|\Delta f_{i}^{[t]}\\|^{2}+\frac{\delta\alpha_{i}^{[t]}}{8}\\|x_{i}^{[t]}\\|^{2}+\frac{N^{3}L^{6}\alpha_{i}^{[t]}}{\delta}\sum_{j=i+1}^{N}\\|x_{j}^{[t]}\\|^{2},$	(58)

where the third inequality uses (53) and the fourth inequality follows from the step size condition $\alpha_{i}^{[t]}\leq\frac{1}{\delta}$ .

To bound $T_{4}$ , we plug in (53) and (56)

$\displaystyle T_{4}$	$\displaystyle\leq\\|x_{i}^{[t]}-\alpha_{i}^{[t]}\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\\|\\|y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\\|$
	$\displaystyle\leq\sqrt{\\|x_{i}^{[t]}\\|^{2}-\frac{\delta\alpha_{i}^{[t]}}{2}\\|x_{i}^{[t]}\\|^{2}+\frac{8N^{3}L^{6}\alpha_{i}^{[t]}}{\delta}\sum_{j=i+1}^{N}\\|x_{j}^{[t]}\\|^{2}}\quad\cdot$
	$\displaystyle\hskip 50.0pt\left(L\sum_{j=1}^{i-1}\alpha_{j}^{[t]}\\|\Delta f_{j}^{[t]}\\|+NL^{3}\sum_{j=1}^{i-1}\alpha_{j}^{[t]}\sum_{k=j}^{N}\\|x_{k}^{[t]}\\|\right)$
	$\displaystyle\leq\left(\\|x_{i}^{[t]}\\|+\sqrt{\frac{8N^{3}L^{6}\alpha_{i}^{[t]}}{\delta}}\sum_{j=i+1}^{N}\\|x_{j}^{[t]}\\|\right)\cdot$
	$\displaystyle\hskip 50.0pt\left(L\sum_{j=1}^{i-1}\alpha_{j}^{[t]}\\|\Delta f_{j}^{[t]}\\|+N^{2}L^{3}\sum_{j=1}^{i-1}\alpha_{j}^{[t]}\\|x_{j}^{[t]}\\|+N^{2}L^{3}\alpha_{i-1}^{[t]}\sum_{j=i}^{N}\\|x_{j}^{[t]}\\|\right)$
	$\displaystyle\leq L\sum_{j=1}^{i-1}\alpha_{j}^{[t]}\\|\Delta f_{j}^{[t]}\\|\\|x_{i}^{[t]}\\|+L^{4}(\alpha_{i}^{[t]})^{1/2}\alpha_{i-1}^{[t]}\sqrt{\frac{8N^{3}}{\delta}}\left(\sum_{j=i+1}^{N}\\|x_{j}^{[t]}\\|\right)\left(\sum_{j=1}^{i-1}\\|\Delta f_{j}^{[t]}\\|\right)$
	$\displaystyle\hskip 20.0pt+N^{2}L^{3}\sum_{j=1}^{i-1}\alpha_{j}^{[t]}\\|x_{j}^{[t]}\\|\\|x_{i}^{[t]}\\|+N^{2}L^{3}\alpha_{i-1}^{[t]}\sum_{j=i}^{N}\\|x_{j}^{[t]}\\|\\|x_{i}^{[t]}\\|$
	$\displaystyle\hskip 20.0pt+L^{6}(\alpha_{i}^{[t]})^{1/2}\alpha_{i-1}^{[t]}\sqrt{\frac{8N^{7}}{\delta}}\left(\sum_{j=i+1}^{N}\\|x_{j}^{[t]}\\|\right)\left(\sum_{j=1}^{N}\\|x_{j}^{[t]}\\|\right)$
	$\displaystyle\leq\frac{NL\alpha_{i-1}^{[t]}}{4}\\|x_{i}^{[t]}\\|^{2}+L\sum_{j=1}^{i-1}\alpha_{j}^{[t]}\\|\Delta f_{j}^{[t]}\\|^{2}+NL^{4}(\alpha_{i}^{[t]})^{1/2}\alpha_{i-1}^{[t]}\sum_{j=i+1}^{N}\\|x_{j}^{[t]}\\|^{2}$
	$\displaystyle\hskip 20.0pt+\frac{2N^{4}L^{4}(\alpha_{i}^{[t]})^{1/2}\alpha_{i-1}^{[t]}}{\delta}\sum_{j=i+1}^{N}\\|\Delta f_{j}^{[t]}\\|^{2}+\frac{4N^{6}L^{6}\alpha_{i-1}^{[t]}}{\delta}\\|x_{i}^{[t]}\\|^{2}+\frac{\delta}{16N}\sum_{j=1}^{i-1}\alpha_{j}^{[t]}\\|x_{j}^{[t]}\\|^{2}$
	$\displaystyle\hskip 20.0pt+\frac{N^{3}L^{3}\alpha_{i-1}^{[t]}}{4}\\|x_{i}^{[t]}\\|^{2}+N^{2}L^{3}\alpha_{i-1}^{[t]}\sum_{j=i}^{N}\\|x_{j}^{[t]}\\|^{2}$
	$\displaystyle\hskip 20.0pt+NL^{6}(\alpha_{i}^{[t]})^{1/2}\alpha_{i-1}^{[t]}\sqrt{\frac{2N^{7}}{\delta}}\sum_{j=i+1}^{N}\\|x_{j}^{[t]}\\|^{2}+NL^{6}(\alpha_{i}^{[t]})^{1/2}\alpha_{i-1}^{[t]}\sqrt{\frac{2N^{7}}{\delta}}\sum_{j=1}^{N}\\|x_{j}^{[t]}\\|^{2}$
	$\displaystyle\leq\sum_{j=1}^{i-1}\left(\frac{\delta\alpha_{j}^{[t]}}{16N}+NL^{6}\sqrt{\frac{2N^{7}}{\delta}}(\alpha_{i}^{[t]})^{2}\right)\\|x_{j}^{[t]}\\|^{2}+4N^{2}L^{3}\alpha_{i-1}^{[t]}\sum_{j=i}^{N}\\|x_{j}^{[t]}\\|^{2}$
	$\displaystyle\hskip 20.0pt+(\frac{N^{3}L^{3}\alpha_{i-1}^{[t]}}{2}+\frac{4N^{6}L^{6}\alpha_{i-1}^{[t]}}{\delta})\\|x_{i}^{[t]}\\|^{2}+L\alpha_{i}^{[t]}\sum_{j=1}^{N}\\|\Delta f_{j}^{[t]}\\|^{2}$
	$\displaystyle\leq\sum_{j=1}^{i-1}\left(\frac{\delta\alpha_{j}^{[t]}}{16N}+NL^{6}\sqrt{\frac{2N^{7}}{\delta}}(\alpha_{i}^{[t]})^{2}\right)\\|x_{j}^{[t]}\\|^{2}+4N^{2}L^{3}\alpha_{i-1}^{[t]}\sum_{j=i+1}^{N}\\|x_{j}^{[t]}\\|^{2}$
	$\displaystyle\hskip 20.0pt+(\frac{9N^{3}L^{3}\alpha_{i-1}^{[t]}}{2}+\frac{4N^{6}L^{6}\alpha_{i-1}^{[t]}}{\delta})\\|x_{i}^{[t]}\\|^{2}+L\alpha_{i}^{[t]}\sum_{j=1}^{N}\\|\Delta f_{j}^{[t]}\\|^{2},$	(59)

where the second inequality follows from the fact that $\sqrt{a_{1}+a_{2}+\cdots}\leq\sqrt{a_{1}}+\sqrt{a_{2}}+\cdots$ for any positive scalars $a_{1},a_{2},\cdots$ . We have also simplified terms using the step size condition $\alpha_{i}^{[t]}\leq\frac{\delta}{2N^{5}L^{6}}$ , $\alpha_{i}^{[t]}\leq\frac{N^{2}}{L^{2}}$ , and $\alpha_{i}^{[t]}\leq\frac{\delta^{2}}{4N^{8}L^{6}}$ .

We finally bound $T_{5}$ with (57),

$\displaystyle T_{5}$	$\displaystyle\leq\alpha_{i}^{[t]}\\|\Delta f_{i}^{[t]}\\|\\|y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\\|$
	$\displaystyle\leq(\alpha_{i}^{[t]})^{2}\\|\Delta f_{i}^{[t]}\\|^{2}+\frac{1}{4}\\|y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\\|^{2}$
	$\displaystyle\leq(\alpha_{i}^{[t]})^{2}\\|\Delta f_{i}^{[t]}\\|^{2}+\frac{NL^{2}}{2}\sum_{j=1}^{i-1}(\alpha_{j}^{[t]})^{2}\\|\Delta f_{j}^{[t]}\\|^{2}+\frac{N^{5}L^{6}}{2}\sum_{j=1}^{N}(\alpha_{j}^{[t]})^{2}\\|x_{j}^{[t]}\\|^{2}.$	(60)

Collecting the bounds on $T_{1}$ - $T_{5}$ and applying them to (52), we have

	$\displaystyle\\|x_{i}^{t+1}\\|^{2}$
	$\displaystyle\leq\\|x_{i}^{[t]}\\|^{2}-\frac{\delta\alpha_{i}^{[t]}}{2}\\|x_{i}^{[t]}\\|^{2}+\frac{8N^{3}L^{6}\alpha_{i}^{[t]}}{\delta}\sum_{j=i+1}^{N}\\|x_{j}^{[t]}\\|^{2}+(\alpha_{i}^{[t]})^{2}\\|\Delta f_{i}^{[t]}\\|^{2}$
	$\displaystyle\hskip 20.0pt+2NL^{2}\sum_{j=1}^{i-1}(\alpha_{j}^{[t]})^{2}\\|\Delta f_{j}^{[t]}\\|^{2}+2N^{5}L^{6}\sum_{j=1}^{N}(\alpha_{j}^{[t]})^{2}\\|x_{j}^{[t]}\\|^{2}$
	$\displaystyle\hskip 20.0pt+\frac{2\alpha_{i}^{[t]}}{\delta}\\|\Delta f_{i}^{[t]}\\|^{2}+\frac{\delta\alpha_{i}^{[t]}}{8}\\|x_{i}^{[t]}\\|^{2}+\frac{N^{3}L^{6}\alpha_{i}^{[t]}}{\delta}\sum_{j=i+1}^{N}\\|x_{j}^{[t]}\\|^{2}$
	$\displaystyle\hskip 20.0pt+\sum_{j=1}^{i-1}\left(\frac{\delta\alpha_{j}^{[t]}}{16N}+NL^{6}\sqrt{\frac{2N^{7}}{\delta}}(\alpha_{i}^{[t]})^{2}\right)\\|x_{j}^{[t]}\\|^{2}+4N^{2}L^{3}\alpha_{i-1}^{[t]}\sum_{j=i+1}^{N}\\|x_{j}^{[t]}\\|^{2}$
	$\displaystyle\hskip 20.0pt+(\frac{9N^{3}L^{3}\alpha_{i-1}^{[t]}}{2}+\frac{4N^{6}L^{6}\alpha_{i-1}^{[t]}}{\delta})\\|x_{i}^{[t]}\\|^{2}+L\alpha_{i}^{[t]}\sum_{j=1}^{N}\\|\Delta f_{j}^{[t]}\\|^{2}$
	$\displaystyle\hskip 20.0pt+(\alpha_{i}^{[t]})^{2}\\|\Delta f_{i}^{[t]}\\|^{2}+\frac{NL^{2}}{2}\sum_{j=1}^{i-1}(\alpha_{j}^{[t]})^{2}\\|\Delta f_{j}^{[t]}\\|^{2}+\frac{N^{5}L^{6}}{2}\sum_{j=1}^{N}(\alpha_{j}^{[t]})^{2}\\|x_{j}^{[t]}\\|^{2}$
	$\displaystyle\leq\\|x_{i}^{[t]}\\|^{2}-\left(\frac{3\delta\alpha_{i}^{[t]}}{8}-\frac{9N^{3}L^{3}\alpha_{i-1}^{[t]}}{2}-\frac{4N^{6}L^{6}\alpha_{i-1}^{[t]}}{\delta}+\frac{5N^{5}L^{6}(\alpha_{i}^{[t]})^{2}}{2}\right)\\|x_{i}^{[t]}\\|^{2}$
	$\displaystyle\hskip 20.0pt+\sum_{j=1}^{i-1}\left(\frac{5N^{5}L^{6}(\alpha_{j}^{[t]})^{2}}{2}+\frac{\delta\alpha_{j}^{[t]}}{16N}+NL^{6}\sqrt{\frac{2N^{7}}{\delta}}(\alpha_{i}^{[t]})^{2}\right)\\|x_{j}^{[t]}\\|^{2}$
	$\displaystyle\hskip 20.0pt+\sum_{j=i+1}^{N}\left(\frac{9N^{3}L^{6}\alpha_{i}^{[t]}}{\delta}+\frac{5N^{5}L^{6}(\alpha_{j}^{[t]})^{2}}{2}+4N^{2}L^{3}\alpha_{i}^{[t]}\right)\\|x_{j}^{[t]}\\|^{2}+(\frac{3}{\delta}+L)\alpha_{i}^{[t]}\sum_{j=1}^{N}\\|\Delta f_{j}^{[t]}\\|^{2}$
	$\displaystyle\leq\\|x_{i}^{[t]}\\|^{2}-\frac{\delta\alpha_{i}^{[t]}}{4}\\|x_{i}^{[t]}\\|^{2}+\sum_{j=1}^{i-1}\frac{\delta\alpha_{j}^{[t]}}{8N}\\|x_{j}^{[t]}\\|^{2}$
	$\displaystyle\hskip 20.0pt+\left(\frac{9N^{3}L^{6}}{\delta}+8N^{2}L^{3}\right)\alpha_{i}^{[t]}\sum_{j=i+1}^{N}\\|x_{j}^{[t]}\\|^{2}+(\frac{3}{\delta}+L)\alpha_{i}^{[t]}\sum_{j=1}^{N}\\|\Delta f_{j}^{[t]}\\|^{2},$

where in the second inequality we have combined and simplified terms with the step size condition $\alpha_{i}^{[t]}\leq\frac{1}{\delta}$ and $\alpha_{i}^{[t]}\leq\frac{2}{5NL}$ , and the third inequality follows from $\frac{\alpha_{i-1}^{[t]}}{\alpha_{i}^{[t]}}\leq\frac{\delta}{16}(\frac{9N^{3}L^{3}}{2}+\frac{4N^{6}L^{6}}{\delta})^{-1}$ , $\alpha_{i}^{[t]}\leq\frac{\delta}{40N^{5}L^{6}}$ , $\frac{(\alpha_{i}^{[t]})^{2}}{\alpha_{1}^{[t]}}\leq\frac{\delta^{3/2}}{64N^{7}}$ , $\alpha_{i}^{[t]}\leq\frac{\delta}{80N^{6}L^{6}}$ , and $\frac{(\alpha_{i}^{[t]})^{2}}{\alpha_{1}^{[t]}}\leq\frac{8}{5N^{3}L^{3}}$ .

$\Box$

A.3 Proof of Lemma 3

By the definition of $f_{i}^{[t]}$ ,

$\displaystyle\\|f_{i}^{[t]}\\|$	$\displaystyle=\\|\Delta f_{i}^{[t]}+\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]},{\bf y}_{i:N}(\boldsymbol{\theta}_{1:i-1}^{[t]}))\\|$
	$\displaystyle\leq\\|\Delta f_{i}^{[t]}\\|+\\|\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]},{\bf y}_{i:N}(\boldsymbol{\theta}_{1:i-1}^{[t]}))\\|$
	$\displaystyle\leq\\|\Delta f_{i}^{[t]}\\|+L\sum_{j=i}^{N}\\|\theta_{j}^{[t]}-y_{j}(\boldsymbol{\theta}_{1:i-1}^{[t]})\\|.$	(61)

Using the identity (54), we can bound the second term of (61) recursively

	$\displaystyle\sum_{j=i}^{N}\\|\theta_{j}^{[t]}-y_{j}(\boldsymbol{\theta}_{1:i-1}^{[t]})\\|$
	$\displaystyle\leq\\|x_{i}^{[t]}\\|+\sum_{j=i+1}^{N}\\|\theta_{j}^{[t]}-y_{j}(\boldsymbol{\theta}_{1:i-1}^{[t]})\\|$
	$\displaystyle\leq\\|x_{i}^{[t]}\\|+\\|\theta_{i+1}^{[t]}-y_{i+1}(\boldsymbol{\theta}_{1:i-1}^{[t]},y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]}))\\|+\sum_{j=i+2}^{N}\\|\theta_{j}^{[t]}-y_{j}(\boldsymbol{\theta}_{1:i-1}^{[t]})\\|$
	$\displaystyle\leq(L+1)\\|x_{i}^{[t]}\\|+\\|x_{i+1}^{[t+1]}\\|+\sum_{j=i+2}^{N}\\|\theta_{j}^{[t]}-y_{j}(\boldsymbol{\theta}_{1:i-1}^{[t]})\\|$
	$\displaystyle\leq(2L+1)\\|x_{i}^{[t]}\\|+(L+1)\\|x_{i+1}^{[t]}\\|+\\|x_{i+2}^{[t]}\\|+\sum_{j=i+3}^{N}\\|\theta_{j}^{[t]}-y_{j}(\boldsymbol{\theta}_{1:i-1}^{[t]})\\|$
	$\displaystyle\leq NL\sum_{j=i}^{N}\\|x_{j}^{[t]}\\|.$		(62)

Plugging (62) into (61), we have

\displaystyle\|f_{i}^{[t]}\|\leq\|\Delta f_{i}^{[t]}\|+NL^{2}\sum_{j=i}^{N}\|x_{j}^{[t]}\|.

(63)

The bound on $\|\theta_{i}^{[t+1]}-\theta_{i}^{[t]}\|$ follows simply from (63) and the update rule (7).

$\Box$

A.4 Proof of Lemma 4

By the update rule (6), we have

	$\displaystyle x_{i}^{[t+1]}$	$\displaystyle=\theta_{i}^{[t+1]}-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})$
		$\displaystyle=\theta_{i}^{[t]}-\alpha_{i}^{[t]}F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})$
		$\displaystyle=x_{i}^{[t]}-\alpha_{i}^{[t]}F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})+\left(y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\right)$
		$\displaystyle=x_{i}^{[t]}-\alpha_{i}^{[t]}\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})-\alpha_{i}^{[t]}\big{(}F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\big{)}+\big{(}y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\big{)}.$

Taking the norm,

	$\displaystyle\\|x_{i}^{[t+1]}\\|^{2}$
	$\displaystyle=\underbrace{\\|x_{i}^{[t]}-\alpha_{i}^{[t]}\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\\|^{2}}_{A_{1}}+\underbrace{(\alpha_{i}^{[t]})^{2}\\|F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\\|^{2}}_{A_{2}}$
	$\displaystyle\hskip 20.0pt+\underbrace{\\|y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\\|^{2}}_{A_{3}}+\underbrace{\alpha_{i}^{[t]}\langle x_{i}^{[t]}-\alpha_{i}^{[t]}\widebar{F}_{i}(\boldsymbol{\theta}^{[t]}),F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\rangle}_{A_{4}}$
	$\displaystyle\hskip 20.0pt+\underbrace{\langle x_{i}^{[t]}-\alpha_{i}^{[t]}\widebar{F}_{i}(\boldsymbol{\theta}^{[t]}),y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\rangle}_{A_{5}}$
	$\displaystyle\hskip 20.0pt+\underbrace{\alpha_{i}^{[t]}\langle F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]}),y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})}_{A_{6}}\rangle.$		(64)

A bound on $A_{1}$ can be obtained using an argument identical to (53), provided that the step sizes satisfy $\alpha_{i}^{[t]}\leq\frac{\delta}{8N^{3}L^{6}}$ and $\alpha_{i}^{[t]}\leq\frac{1}{\delta}$

	$\displaystyle A_{1}$	$\displaystyle=\\|x_{i}^{[t]}\\|^{2}-2\alpha_{i}^{[t]}\langle x_{i}^{[t]},\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\rangle+(\alpha_{i}^{[t]})^{2}\\|\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\\|^{2}$
		$\displaystyle\leq\\|x_{i}^{[t]}\\|^{2}-\frac{\delta\alpha_{i}^{[t]}}{2}\\|x_{i}^{[t]}\\|^{2}+\frac{8N^{3}L^{6}\alpha_{i}^{[t]}}{\delta}\sum_{j=i+1}^{N}\\|x_{j}^{[t]}\\|^{2}.$		(65)

The second term is bounded in expectation due to Assumption 5

\displaystyle A_{2}=(\alpha_{i}^{[t]})^{2}\|F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\|^{2}\leq 4D^{2}(\alpha_{i}^{[t]})^{2}.

(66)

Using the Lipschitz continuity of $y_{i}$ , we have

$\displaystyle\\|y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\\|$	$\displaystyle\leq L\sum_{j=1}^{i-1}\\|\theta_{j}^{[t+1]}-\theta_{j}^{[t]}\\|$
	$\displaystyle=L\sum_{j=1}^{i-1}\alpha_{j}^{[t]}\\|F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})\\|$
	$\displaystyle\leq NLD\alpha_{i-1}^{[t]},$	(67)

which implies

\displaystyle A_{3}

\displaystyle=\|y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\|^{2}\leq N^{2}L^{2}D^{2}(\alpha_{i-1}^{[t]})^{2}.

(68)

The term $A_{4}$ is obviously zero in expectation as

\displaystyle\mathbb{E}[F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})]=0.

The bound on $A_{5}$ follows from the Cauchy-Schwarz inequality

$\displaystyle A_{5}$	$\displaystyle\leq\\|x_{i}^{[t]}\\|\\|y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\\|+\alpha_{i}^{[t]}\\|\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\\|\\|y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\\|$
	$\displaystyle\leq\frac{\delta\alpha_{i}^{[t]}}{4}\\|x_{i}^{[t]}\\|^{2}+\frac{1}{\delta\alpha_{i}^{[t]}}\\|y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\\|^{2}+D\alpha_{i}^{[t]}\\|y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\\|$
	$\displaystyle\leq\frac{\delta\alpha_{i}^{[t]}}{4}\\|x_{i}^{[t]}\\|^{2}+\frac{(NLD\alpha_{i-1}^{[t]})^{2}}{\delta\alpha_{i}^{[t]}}+D\alpha_{i}^{[t]}\cdot NLD\alpha_{i-1}^{[t]}$
	$\displaystyle\leq\frac{\delta\alpha_{i}^{[t]}}{4}\\|x_{i}^{[t]}\\|^{2}+\frac{N^{2}L^{2}D^{2}(\alpha_{i-1}^{[t]})^{2}}{\delta\alpha_{i}^{[t]}}+NLD^{2}\alpha_{i}^{[t]}\alpha_{i-1}^{[t]}$	(69)

where in the third equality we plug in (67).

Finally, we bound $A_{6}$ using a similar argument

$\displaystyle A_{6}$	$\displaystyle\leq\alpha_{i}^{[t]}\\|F_{i}(\boldsymbol{\theta}^{[t]},X^{[t]})-\widebar{F}_{i}(\boldsymbol{\theta}^{[t]})\\|\\|y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t]})-y_{i}(\boldsymbol{\theta}_{1:i-1}^{[t+1]})\\|$
	$\displaystyle\leq\alpha_{i}^{[t]}\cdot 2D\cdot NLD\alpha_{i-1}^{[t]}$
	$\displaystyle\leq 2NLD^{2}\alpha_{i-1}^{[t]}\alpha_{i}^{[t]}.$	(70)

Collecting the bounds in (65)-(70) and putting them into (64), we get

	$\displaystyle\mathbb{E}[\\|x_{i}^{[t+1]}\\|^{2}]$
	$\displaystyle\leq\mathbb{E}[\\|x_{i}^{[t]}\\|^{2}-\frac{\delta\alpha_{i}^{[t]}}{2}\\|x_{i}^{[t]}\\|^{2}+\frac{8N^{3}L^{6}\alpha_{i}^{[t]}}{\delta}\sum_{j=i+1}^{N}\\|x_{j}^{[t]}\\|^{2}+4D^{2}(\alpha_{i}^{[t]})^{2}+N^{2}L^{2}D^{2}(\alpha_{i-1}^{[t]})^{2}$
	$\displaystyle\hskip 20.0pt+\frac{\delta\alpha_{i}^{[t]}}{4}\\|x_{i}^{[t]}\\|^{2}+\frac{N^{2}L^{2}D^{2}(\alpha_{i-1}^{[t]})^{2}}{\delta\alpha_{i}^{[t]}}+NLD^{2}\alpha_{i}^{[t]}\alpha_{i-1}^{[t]}+2NLD^{2}\alpha_{i-1}^{[t]}\alpha_{i}^{[t]}]$
	$\displaystyle\leq\mathbb{E}[\\|x_{i}^{[t]}\\|^{2}]-\frac{\delta\alpha_{i}^{[t]}}{4}\mathbb{E}[\\|x_{i}^{[t]}\\|^{2}]+\frac{8N^{3}L^{6}\alpha_{i}^{[t]}}{\delta}\sum_{j=i+1}^{N}\mathbb{E}[\\|x_{j}^{[t]}\\|^{2}]$
	$\displaystyle\hskip 20.0pt+8N^{2}L^{2}D^{2}(\alpha_{i}^{[t]})^{2}+\frac{N^{2}L^{2}D^{2}(\alpha_{i-1}^{[t]})^{2}}{\delta\alpha_{i}^{[t]}}.$

$\Box$

A.5 Proof of Lemma 5

This lemma essentially bounds the distance between samples from a time-varying Markov chain and the stationary distribution. Techniques for proving this lemma has been well-studied in the literature. Hence we omit the proof here but note that it is very similar to the proof of Lemma 1 in Zeng et al. (2024b).

$\displaystyle\\|F_{i}(\boldsymbol{\theta},X)-F_{i}(\boldsymbol{\theta}^{\prime},X)\\|$	$\displaystyle\leq L\sum_{j=1}^{N}\\|\theta_{j}-\theta_{j}^{\prime}\\|,\quad\forall i,X,$	(15)
$\displaystyle\\|y_{j}(\boldsymbol{\theta}_{1:i-1})-y_{j}(\boldsymbol{\theta}_{1:i-1}^{\prime})\\|$	$\displaystyle\leq L\sum_{i^{\prime}=1}^{i-1}\\|\theta_{i^{\prime}}-\theta_{i^{\prime}}^{\prime}\\|,\quad\forall i,j,$	(16)
$\displaystyle\\|y_{i}(\boldsymbol{\theta}_{1:i-1})\\|$	$\displaystyle\leq B,\quad\forall i.$	(17)

	$\displaystyle\\|x_{i}^{[t+1]}\\|^{2}$	$\displaystyle\leq\\|x_{i}^{[t]}\\|^{2}-\frac{\delta\alpha_{i}^{[t]}}{4}\\|x_{i}^{[t]}\\|^{2}+\sum_{j=1}^{i-1}\frac{\delta\alpha_{j}^{[t]}}{8N}\\|x_{j}^{[t]}\\|^{2}$
		$\displaystyle\hskip 20.0pt+\left(\frac{9N^{3}L^{6}}{\delta}+8N^{2}L^{3}\right)\alpha_{i}^{[t]}\sum_{j=i+1}^{N}\\|x_{j}^{[t]}\\|^{2}+(\frac{3}{\delta}+L)\alpha_{i}^{[t]}\sum_{j=1}^{N}\\|\Delta f_{j}^{[t]}\\|^{2}.$		(24)

$\displaystyle V^{[t+1]}$	$\displaystyle=\sum_{i=1}^{N}\mathbb{E}[\\|\Delta f_{i}^{[t+1]}\\|^{2}+\\|x_{i}^{[t+1]}\\|^{2}]$
	$\displaystyle\leq(1-\lambda^{[t]})\sum_{i=1}^{N}\mathbb{E}[\\|\Delta f_{i}^{[t]}\\|^{2}]-\frac{\lambda^{[t]}}{4}\sum_{i=1}^{N}\mathbb{E}[\\|\Delta f_{i}^{[t]}\\|^{2}]+DN^{2}\tau_{t}^{2}\lambda^{[t]}\lambda^{[t-\tau_{t}]}$
	$\displaystyle\hskip 20.0pt+DN^{2}\tau_{t}^{2}\sum_{i=1}^{N}(\lambda^{[t]}\lambda^{[t-\tau_{t}]}+\frac{(\alpha_{i}^{[t]})^{2}}{\lambda^{[t]}})\big{(}\sum_{j=1}^{N}\\|x_{j}^{[t]}\\|^{2}+\sum_{j=1}^{N}\\|\Delta f_{j}^{[t]}\\|^{2}\big{)}$
	$\displaystyle\hskip 20.0pt+\sum_{i=1}^{N}\mathbb{E}[\\|x_{i}^{[t]}\\|^{2}]-\sum_{i=1}^{N}\frac{\delta\alpha_{i}^{[t]}}{4}\mathbb{E}[\\|x_{i}^{[t]}\\|^{2}]+\sum_{i=1}^{N}\sum_{j=1}^{i-1}\frac{\delta\alpha_{j}^{[t]}}{8N}\mathbb{E}[\\|x_{j}^{[t]}\\|^{2}]$
	$\displaystyle\hskip 20.0pt+\sum_{i=1}^{N}\left(\frac{9N^{3}L^{6}}{\delta}+8N^{2}L^{3}\right)\alpha_{i}^{[t]}\sum_{j=i+1}^{N}\mathbb{E}[\\|x_{j}^{[t]}\\|^{2}]+(\frac{3}{\delta}+L)\sum_{i=1}^{N}\alpha_{i}^{[t]}\sum_{j=1}^{N}\mathbb{E}[\\|\Delta f_{j}^{[t]}\\|^{2}]$
	$\displaystyle\leq(1-\lambda^{[t]})\sum_{i=1}^{N}\mathbb{E}[\\|\Delta f_{i}^{[t]}\\|^{2}]-\frac{\lambda^{[t]}}{4}\sum_{i=1}^{N}\mathbb{E}[\\|\Delta f_{i}^{[t]}\\|^{2}]+\widetilde{{\cal O}}\Big{(}\frac{1}{(t+1)^{2}}\Big{)}$
	$\displaystyle\hskip 20.0pt+DN^{3}\alpha_{N}^{[t]}\sum_{i=1}^{N}\mathbb{E}[\\|\Delta f_{i}^{[t]}\\|^{2}]+DN^{3}\sum_{i=1}^{N}\frac{(\alpha_{i}^{[t]})^{2}}{\lambda^{[t]}}\mathbb{E}[\\|x_{i}^{[t]}\\|^{2}]$
	$\displaystyle\hskip 20.0pt+DN^{3}\tau_{t}^{2}\lambda^{[t]}\lambda^{[t-\tau_{t}]}\sum_{i=1}^{N}\mathbb{E}[\\|\Delta f_{i}^{[t]}\\|^{2}]+DN^{3}\tau_{t}^{2}\lambda^{[t]}\lambda^{[t-\tau_{t}]}\sum_{i=1}^{N}\mathbb{E}[\\|x_{i}^{[t]}\\|^{2}]$
	$\displaystyle\hskip 20.0pt+\sum_{i=1}^{N}\mathbb{E}[\\|x_{i}^{[t]}\\|^{2}]-\sum_{i=1}^{N}\frac{\delta\alpha_{i}^{[t]}}{4}\mathbb{E}[\\|x_{i}^{[t]}\\|^{2}]+\sum_{i=1}^{N}\frac{\delta\alpha_{i}^{[t]}}{8}\mathbb{E}[\\|x_{i}^{[t]}\\|^{2}]$
	$\displaystyle\hskip 20.0pt+\left(\frac{9N^{4}L^{6}}{\delta}+8N^{3}L^{3}\right)\sum_{i=2}^{N}\alpha_{i-1}^{[t]}\mathbb{E}[\\|x_{i}^{[t]}\\|^{2}]+(\frac{3}{\delta}+L)\alpha_{N}^{[t]}\sum_{i=1}^{N}\mathbb{E}[\\|\Delta f_{i}^{[t]}\\|^{2}]$
	$\displaystyle\leq(1-\lambda^{[t]})\sum_{i=1}^{N}\mathbb{E}[\\|\Delta f_{i}^{[t]}\\|^{2}]+(1-\frac{\delta\alpha_{1}^{[t]}}{16})\sum_{i=1}^{N}\mathbb{E}[\\|x_{i}^{[t]}\\|^{2}]+\widetilde{{\cal O}}\Big{(}\frac{1}{(t+1)^{2}}\Big{)}$
	$\displaystyle\hskip 20.0pt-\sum_{i=1}^{N}\left(\frac{\lambda^{[t]}}{8}-(DN^{3}+\frac{3}{\delta}+L)\alpha_{N}^{[t]}\right)\mathbb{E}[\\|\Delta f_{i}^{[t]}\\|^{2}]$
	$\displaystyle\hskip 20.0pt-\sum_{i=1}^{N}\left(\frac{\delta\alpha_{i}^{[t]}}{16}-\frac{DN^{3}(\alpha_{i}^{[t]})^{2}}{\lambda^{[t]}}-\left(\frac{9N^{4}L^{6}}{\delta}+8N^{3}L^{3}\right)\alpha_{i-1}^{[t]}\right)\mathbb{E}[\\|x_{i}^{[t]}\\|^{2}],$	(25)

	$\displaystyle\mathbb{E}[\\|x_{1}^{[t+1]}\\|^{2}]$	$\displaystyle\leq(1-\frac{\delta\alpha_{1}^{[t]}}{4})\mathbb{E}[\\|x_{1}^{[t]}\\|^{2}]+\frac{216L^{6}\alpha_{1}^{[t]}}{\delta}\mathbb{E}[\\|x_{2}^{[t]}\\|^{2}+\\|x_{3}^{[t]}\\|^{2}]+72L^{2}D^{2}(\alpha_{1}^{[t]})^{2},$
	$\displaystyle\mathbb{E}[\\|x_{2}^{[t+1]}\\|^{2}]$	$\displaystyle\leq(1-\frac{\delta\alpha_{2}^{[t]}}{4})\mathbb{E}[\\|x_{2}^{[t]}\\|^{2}]+\frac{216L^{6}\alpha_{2}^{[t]}}{\delta}\mathbb{E}[\\|x_{3}^{[t]}\\|^{2}]+72L^{2}D^{2}(\alpha_{2}^{[t]})^{2}+\frac{9L^{2}D^{2}(\alpha_{1}^{[t]})^{2}}{\delta\alpha_{2}^{[t]}},$
	$\displaystyle\mathbb{E}[\\|x_{3}^{[t+1]}\\|^{2}]$	$\displaystyle\leq(1-\frac{\delta\alpha_{3}^{[t]}}{4})\mathbb{E}[\\|x_{3}^{[t]}\\|^{2}]+72L^{2}D^{2}(\alpha_{3}^{[t]})^{2}+\frac{9L^{2}D^{2}(\alpha_{2}^{[t]})^{2}}{\delta\alpha_{3}^{[t]}}.$

$\displaystyle V^{[t+1]}$	$\displaystyle=\mathbb{E}[\\|x_{1}^{[t+1]}\\|^{2}+v_{2}^{[t+1]}\\|x_{2}^{[t+1]}\\|^{2}+v_{3}^{[t+1]}\\|x_{3}^{[t+1]}\\|^{2}]$
	$\displaystyle\leq\mathbb{E}[\\|x_{1}^{[t+1]}\\|^{2}+v_{2}^{[t]}\\|x_{2}^{[t+1]}\\|^{2}+v_{3}^{[t]}\\|x_{3}^{[t+1]}\\|^{2}]$
	$\displaystyle\leq(1-\frac{\delta\alpha_{1}^{[t]}}{4})\mathbb{E}[\\|x_{1}^{[t]}\\|^{2}]+\frac{216L^{6}\alpha_{1}^{[t]}}{\delta}\mathbb{E}[\\|x_{2}^{[t]}\\|^{2}+\\|x_{3}^{[t]}\\|^{2}]+72L^{2}D^{2}(\alpha_{1}^{[t]})^{2}$
	$\displaystyle\hskip 20.0pt+v_{2}^{[t]}(1-\frac{\delta\alpha_{2}^{[t]}}{4})\mathbb{E}[\\|x_{2}^{[t]}\\|^{2}]+\frac{216L^{6}\alpha_{2}^{[t]}v_{2}^{[t]}}{\delta}\mathbb{E}[\\|x_{3}^{[t]}\\|^{2}]+72L^{2}D^{2}(\alpha_{2}^{[t]})^{2}v_{2}^{[t]}$
	$\displaystyle\hskip 20.0pt+\frac{9L^{2}D^{2}(\alpha_{1}^{[t]})^{2}v_{2}^{[t]}}{\delta\alpha_{2}^{[t]}}+v_{3}^{[t]}(1-\frac{\delta\alpha_{3}^{[t]}}{4})\mathbb{E}[\\|x_{3}^{[t]}\\|^{2}]$
	$\displaystyle\hskip 20.0pt+72L^{2}D^{2}(\alpha_{3}^{[t]})^{2}v_{3}^{[t]}+\frac{9L^{2}D^{2}(\alpha_{2}^{[t]})^{2}v_{3}^{[t]}}{\delta\alpha_{3}^{[t]}}$
	$\displaystyle=(1-\frac{\delta\alpha_{1}^{[t]}}{4})\mathbb{E}[\\|x_{1}^{[t]}\\|^{2}+v_{2}^{[t]}\\|x_{2}^{[t]}\\|^{2}+v_{3}^{[t]}\\|x_{3}^{[t]}\\|^{2}]+72L^{2}D^{2}(\alpha_{1}^{[t]})^{2}$
	$\displaystyle\hskip 20.0pt+\frac{124416L^{8}D^{2}\alpha_{1}^{[t]}\alpha_{2}^{[t]}}{\delta^{2}}+\frac{15552L^{8}D^{2}(\alpha_{1}^{[t]})^{3}}{\delta^{3}(\alpha_{2}^{[t]})^{2}}$
	$\displaystyle\hskip 20.0pt+\left(\frac{572L^{2}D^{2}\alpha_{1}^{[t]}\alpha_{3}^{[t]}}{\delta}+\frac{9L^{2}D^{2}\alpha_{1}^{[t]}(\alpha_{2}^{[t]})^{2}}{\delta^{2}(\alpha_{3}^{[t]})^{2}}\right)(\frac{216L^{6}}{\delta}+\frac{373248L^{12}}{\delta^{3}})$
	$\displaystyle\hskip 20.0pt+(\frac{\delta\alpha_{1}^{[t]}v_{2}^{[t]}}{4}-\frac{\delta\alpha_{2}^{[t]}v_{2}^{[t]}}{4}+\frac{216L^{6}\alpha_{1}^{[t]}}{\delta})\mathbb{E}[\\|x_{2}^{[t]}\\|^{2}]$
	$\displaystyle\hskip 20.0pt+(\frac{\delta\alpha_{1}^{[t]}v_{3}^{[t]}}{4}-\frac{\delta\alpha_{3}^{[t]}v_{3}^{[t]}}{4}+\frac{216L^{6}\alpha_{1}^{[t]}}{\delta}+\frac{216L^{6}\alpha_{2}^{[t]}v_{2}^{[t]}}{\delta})\mathbb{E}[\\|x_{3}^{[t]}\\|^{2}]$
	$\displaystyle\leq(1-\frac{\delta\alpha_{1}^{[t]}}{4})V^{[t]}+72L^{2}D^{2}(\alpha_{1}^{[t]})^{2}+\frac{124416L^{8}D^{2}\alpha_{1}^{[t]}\alpha_{2}^{[t]}}{\delta^{2}}+\frac{15552L^{8}D^{2}(\alpha_{1}^{[t]})^{3}}{\delta^{3}(\alpha_{2}^{[t]})^{2}}$
	$\displaystyle\hskip 20.0pt+\left(\frac{572L^{2}D^{2}\alpha_{1}^{[t]}\alpha_{3}^{[t]}}{\delta}+\frac{9L^{2}D^{2}\alpha_{1}^{[t]}(\alpha_{2}^{[t]})^{2}}{\delta^{2}(\alpha_{3}^{[t]})^{2}}\right)(\frac{216L^{6}}{\delta}+\frac{373248L^{12}}{\delta^{3}}),$	(28)