Accelerated Multi-Time-Scale Stochastic Approximation: Optimal Complexity and Applications in Reinforcement Learning and Multi-Agent Games
Abstract
Multi-time-scale stochastic approximation is an iterative algorithm for finding the fixed point of a set of coupled operators given their noisy samples. It has been observed that due to the coupling between the decision variables and noisy samples of the operators, the performance of this method decays as increases. In this work, we develop a new accelerated variant of multi-time-scale stochastic approximation, which significantly improves the convergence rates of its standard counterpart. Our key idea is to introduce auxiliary variables to dynamically estimate the operators from their samples, which are then used to update the decision variables. These auxiliary variables help not only to control the variance of the operator estimates but also to decouple the sampling noise and the decision variables. This allows us to select more aggressive step sizes to achieve an optimal convergence rate. Specifically, under a strong monotonicity condition, we show that for any value of the iterate of the proposed algorithm converges to the desired solution at a rate when the operator samples are generated from a single from Markov process trajectory.
A second contribution of this work is to demonstrate that the objective of a range of problems in reinforcement learning and multi-agent games can be expressed as a system of fixed-point equations. As such, the proposed approach can be used to design new learning algorithms for solving these problems. We illustrate this observation with numerical simulations in a multi-agent game and show the advantage of the proposed method over the standard multi-time-scale stochastic approximation algorithm.
Keywords: Multi-time-scale stochastic approximation, reinforcement learning, game theory
1 Introduction
The objective of a range of problems in reinforcement learning and games can be expressed as a coupled system of equations, with each equation defined through a nonlinear operator that can be queried with noise. This work proposes a single-loop stochastic approximation (SA) algorithm which solves such a system using a (near) optimal number of queries. Given a compact statistical sample space and operators , our aim is to find a solution tuple such that
(5) |
where denotes a distribution over (possibly) as a function of . We do not assume having direct access to . Rather, our sampling oracle is that we may query an operator , which induces as the stationary distribution. More precisely, for any and , we can draw a sample . We consider this relaxed sampling oracle as it realistically captures how samples are generated in reinforcement learning (RL). We use to denote the space of the decision variable and denote for any .
It may appear that the system (5) is symmetric and each equation is identically coupled with any other equation. This is not true in the context of RL and games, where the problems usually have a specific hierarchical structure. The structure dictates the order in which the system can be most naturally solved and is formalized later through Assumptionย 3.
Exploiting the hierarchical structure, the standard multi-time-scale stochastic approximation algorithm for solving (5), thereafter referred to as MSA, maintains parameters as an estimate of and simultaneously updates them, each using a unique step size. Given (possibly biased) stochastic samples , MSA iteratively carries out
(6) |
With properly selected step sizes, this simple method is guaranteed to converge to the solution of (5) under strong monotonicity and Lipschitz continuity assumptions on to be introduced later (Hong etย al., 2023; Zeng etย al., 2024b; Shen and Chen, 2022; Han etย al., 2024). However, its time and sample complexity degrades as increases and is sub-optimal as known so far. When , MSA reduces to the classic stochastic approximation algorithm (Robbins and Monro, 1951) and has a complexity of when are i.i.d. samples from a fixed distribution (and if the samples are consecutively generated from a Markov chain, where hides any logarithm factor). More precisely, the last iterate of (6) converges to an neighborhood around in iterations, matching the worst-case lower bound (Chen etย al., 2022). When , the best-known complexity, as established in Hong etย al. (2023); Zeng etย al. (2024b), becomes , inferior to that under by a factor of . The authors in Hong etย al. (2023); Zeng etย al. (2024b) only analyze the case of two equations, but we show in Sectionย 3.2 that their techniques can be extended to establish a complexity of for and further worse bounds for . This apparent gap in the performance of MSA for different makes us question whether the problem becomes fundamentally harder as increases or whether any algorithmic/analytical improvement can be made. In this work, we answer the question in the positive direction by proposing an accelerated SA algorithm which achieves the optimal complexity across all while maintaining the favorable properties of (6) such as simplicity and single loop structure. The same algorithm has been introduced earlier in our prior works (Doan, 2024; Zeng and Doan, 2024) but we make important contributions over them as discussed in details in Sectionย 1.2. We now summarize our main contributions.
1.1 Main Contributions
Number of Equations | Standard Multi-Time-Scale SA Complexity | A-MSA Complexity |
Two | ||
Three | ||
Four | ||
The key contribution of this work is to propose a novel Accelerated Multi-Time-Scale Stochastic Approximation (A-MSA) algorithm that solves a coupled system of equations with the optimal convergence rate for strong monotone operators and when the samples are generated from a single trajectory of Markov processes characterized by . In the case , this rate improves over , the best-known complexity of the MSA algorithm derived in Doan (2022); Hong etย al. (2023). When , the complexity of MSA deteriorates and is further inferior to our result (see Table.ย 1). Our innovation is built on the observation that the main hindrance to the fast convergence of (6) lies in the direct coupling between the iterates across , which means that any error/noise in for some immediately propagates to for all in the next iteration. Such a coupling effect prevents us from choosing aggressive step sizes necessary for achieving the optimal complexity. Our solution to this challenge is to update in the direction of a properly averaged version of operator (as opposed to the single sample used in (6)). The seemingly small modification maintains the simplicity of (6) while eliminating the direct coupling between . In Sectionย 3.3, we compare the analysis of MSA and A-MSA and pinpoint the convergence bottleneck in MSA which we overcome through an algorithmic modification.
A second contribution of our work is to show that coupled equations in (5) abstract many popular problems in games and RL, including gradient-based temporal difference learning with momentum, distributionally robust RL, policy optimization in a two-player zero-sum Markov game, and policy optimization in mean-field games. As a consequence, the proposed A-MSA method can be used to design new algorithms for solving these problems. Some of these problems do not satisfy the strong monotonicity condition but are specially structured in their own ways. We show how we can adapt the analysis for strongly monotone operators to the structure in each respective problem, which can potentially establish the state-of-the-art/previously unknown convergence rates.
Applicable to Equations | Sample Noise | Smoothness Assumption | Sample Complexity | |
Deb and Bhatnagar (2021) | Yes | Bounded Martingale | Not Required | - |
Hong etย al. (2023) | No | i.i.d. | Not Required | |
Zeng etย al. (2024b) | No | Time-Varying Markovian | Not Required | |
Shen and Chen (2022) | Yes | Bounded Martingale | Required | |
Han etย al. (2024) | No | Bounded Martingale | Relaxed Version | |
Doan (2024) | No | i.i.d. | Not Required | |
Zeng and Doan (2024) | No | i.i.d. | Not Required | |
This Work | Yes | Time-Varying Markovian | Not Required |
1.2 Related Work
This paper closely relates to the existing works on singe- and multi-time-scale SA (especially under Markovian samples) and those that analyze the finite-time and sample complexity of reinforcement learning algorithms through the lens of SA. We discuss the most relevant works in these domains and highlight our novelty.
Stochastic Approximation. The SA algorithm is a classic method for solving a single fixed-point equation given noisy operator samples (Robbins and Monro, 1951). The asymptotic convergence of SA is well understood and usually derived with the Poisson equation method (Benveniste etย al., 2012; Li etย al., 2023a) or by analyzing a related ODE (Meerkov, 1972; Borkar, 2008). Under i.i.d. samples, bounded Martingale difference noise, or dependent samples from a Markov chain with bounded mixing time, the finite-time analysis measured by squared errors in expectation is established for linear (Lakshminarayanan and Szepesvรกri, 2017; Srikant and Ying, 2019) and strongly monotone operators (Moulines and Bach, 2011; Chen etย al., 2022). Linear SA abstracts the temporal difference (TD) learning algorithm in RL. Various works (Bhandari etย al., 2018; Mitra, 2024) study the finite-time convergence of TD learning leveraging the problem structure, but their analysis may be extended to linear SA. Recently, non-asymptotic central limit theorems have also been proved for TD learning (Srikant, 2024) and Q learning (Li etย al., 2023b), which show the error convergence to a zero-mean Gaussian random variable in distribution. Such results may be extended to general linear/non-linear SA as well.
Some of the aforementioned works employ Polyak-Ruppert averaging to improve variance control (Srikant, 2024; Haque etย al., 2023). We note that the averaging step we carry out in the proposed method is distinct from Polyak-Ruppert averaging โ Polyak-Ruppert averaging does not modify the algorithm itself but returns the solution as an weighted-averaged version of all history iterates, while our algorithm uses an averaged sequence in the updates.
Multi-Time-Scale Stochastic Approximation. When , the MSA algorithm reduces to the standard two-time-scale SA, which is known to converge asymptotically under linear (Borkar, 1997, 2008; Konda and Tsitsiklis, 2004) and non-linear operators (Mokkadem and Pelletier, 2006; Fort, 2015). In the linear setting, the finite-time convergence measured by mean squared errors is first established with a sub-optimal rate in Gupta etย al. (2019); Doan and Romberg (2019) and improved to the optimal in Kaledin etย al. (2020); Haque etย al. (2023). In addition, convergence in probability is studied in Dalal etย al. (2018). In the non-linear setting under the strong monotonicity assumption, the best-known rate of two-time-scale SA remains in general (Doan, 2022; Hong etย al., 2023; Zeng etย al., 2024b). Under an additional smoothness assumption, Shen and Chen (2022); Han etย al. (2024) show that the optimal rate can be recovered. Our paper does not require this assumption as it may not hold true in practical applications.
The two-time-scale SA framework has wide applications in control, RL, and games, e.g., gradient-based TD learning (Sutton etย al., 2009; Wang etย al., 2021), actor-critic methods (Konda and Borkar, 1999; Konda and Tsitsiklis, 1999; Wu etย al., 2020), policy gradient descent-ascent for two-player games (Daskalakis etย al., 2020), and decentralized Q-learning (Sayin etย al., 2021). By providing an analysis for the general framework, one can apply it to each specific problem and immediately deduce the performance guarantees, without having to tailor the analysis on a case basis.
The study of SA for solving three or more equations remains limited but deserves increased attention due to its ability to model rich classes of problems, including gradient-based TD learning with momentum (Deb and Bhatnagar, 2022), distributionally robust RL (Liang etย al., 2024), and policy optimization in games; the applications will be discussed in Sectionย 4. The existing works on multi-time-scale SA include Deb and Bhatnagar (2021) which only provides an asymptotic analysis and Shen and Chen (2022) which relies on a restrictive smoothness assumption.
Finally, we discuss the difference of the current paper and our prior works in (Doan, 2024; Zeng and Doan, 2024), which study the special setting under i.i.d. samples. First, this paper extends our prior works to the general setting with any . Moving from to creates a critical technical challenge โ with mid-level variables are coupled with faster and slower moving variables at the same time, making it harder for them to be decoupled by gradient averaging. We overcome the challenge through an improved bound on (which measures the convergence of to a proper learning target introduced later) by a function of and . We also perform a mathematical investigation of the root cause of the complexity improvement. An explanation of why A-MSA enjoys a faster convergence rate is missing in our prior papers. This work fills in the gap by contrasting an important intermediate result obtained under A-MSA and MSA and pinpointing the terms that get improved due to the averaged operator estimation step. Finally, this paper considers a more general sampling oracle โ we study non i.i.d samples generated from a time-varying Markov chain, a setting realistic for modeling applications in sequential decision making. To avoid complexity degradation due to Markovian samples, analyzing A-MSA requires carefully bounding the distance between the distribution of and a (time-varying) stationary distribution determined by by , whereas this distance is zero in expectation in the i.i.d. setting.
2 Accelerated Multi-Time-Scale Stochastic Approximation Method
The MSA method (6) updates each in the direction of the corresponding operator evaluated at the latest iterates. Our proposed method, formally presented in Algorithm 1, modifies the update by introducing the auxiliary variables to estimate by forming a weighted average of the operator samples controlled by a parameter . These auxiliary variables will then be used to update the decision variables. If we set for all , Algorithmย 1 reduces to MSA in (6). However, a large leads to a high-variance estimate, whereas if is too small cannot track the moving target . Decaying appropriately with respect to , helps to control the variance of our estimates, which is crucial to derive the optimal convergence rate.
(7) |
(8) |
(9) |
Note that in our algorithm the samples are drawn from the transition kernel parameterized by (a.k.a (9)). Thus, as changes over time, the samples are temporally dependent and Markovian with time-varying stationary distributions. To ensure the stable behavior of the Markovian samples we make the following assumptions. Our first assumption is on the fast mixing of the Markov chain generated under . Given two distributions , their total variation (TV) distance and the mixing time of the Markov chain are defined as
(10) |
Definition 1
Given , consider a Markov chain where with as the stationary distribution. For any , the mixing time of at level is defined as
Assumption 1 (Uniform Geometric Ergodicity)
For any , the Markov chain generated under has a unique stationary distribution . In addition, there exist constants and independent of such that
The geometric ergodicity assumption states that the distribution of a sample from the Markov chain approaches the stationary distribution geometrically fast, and is again standard in the literature on RL and SA under Markovian noise (Srikant and Ying, 2019; Wu etย al., 2020; Zeng etย al., 2024b).
Denoting , this assumption implies that there exists a positive constant depending only on and such that
(11) |
For convenience, we denote by the mixing time corresponding to in Algorithm 1.
Assumption 2
Given two distributions over and , let and . There exists a constant such that
(12) |
In addition, the stationary distribution is Lipschitz in
(13) |
Assumptionย 2 represents a regularity condition on , which can be interpreted as a type of the triangle inequality for the TV distance defined over the transition kernels w.r.t different parameters. This assumption can be shown to hold in standard MDPs (See Wu etย al. (2020, Lemma A1)).
We denote by the filtration containing all randomness generated by Algorithm 1 up to time .
3 Main Results
In this section, we establish the finite-sample complexity of A-MSA to the solution of (5) when the operators have a nested structure. For convenience, we refer to the first and last equations in (5) as the highest- and lowest-level problems, respectively, i.e., the problem levels increase from to 1. Levels here indicate the order in which the problem can be solved.
With some abuse of notation, we denote . Given , we use to denote the solution of the lowest level problem
Similarly, we denote by the solution of the last two levels corresponding to given
(14) |
Conceptually, and represent the optimal solutions of the last and second last equation with respect to the last and second last decision variables when are given.
Generalizing this line of discussion, we introduce to represent the optimal solution of the equation with respect to the variable when are given, for any and . For (i.e. no is given), we additionally define , which is the component of the optimal solution to (5). We introduce the aggregate notation
and note that is the solution to the system
We now introduce the main assumption that drives our analysis, referred to as the nested strong monotonicity of the operators corresponding to the increasing levels defined above.
Assumption 3 (Strong Monotonicity)
There exists a constant such that for all and ,
Assumptionย 3 represents a nested variant of strong monotonicity for the operators , which does not require each being strongly monotone w.r.t to all of its variables. It states that given , is strongly monotone w.r.t when all lower level decision variables are at the corresponding optimal solutions. When , the assumption reduces to the same condition made in the existing literature (Doan, 2022; Shen and Chen, 2022; Han etย al., 2024) and can be verified to hold in RL applications including the gradient-based TD learning (Xu etย al., 2019) and distributed TD learning learning111To model the distributed TD learning in the two-time-scale SA framework, the outer loop equation enforces the consensus among agents whereas in the inner loop each agent learns its local value function.. As an implication of the assumption, is always unique for any . Without any loss of generality, we assume , which makes it convenient for us to simplify terms.
Assumption 4 (Lipschitz Continuity and Boundedness)
There exists positive finite constants and such that for all
(15) | ||||
(16) | ||||
(17) |
Eqs.ย (15) and (16) are standard Lipschitz continuity assumptions on the operator and learning target (Doan, 2022; Zeng etย al., 2024b; Shen and Chen, 2022). Eq.ย 17 assumes that the learning targets are bounded and guarantees algorithm stability in the presence time-varying Markovian noise. This assumption is made in Kaledin etย al. (2020); Zeng etย al. (2024b) when . Without any loss of generality, we assume , which is finite due to the compactness of . Assumptionย 4 implies that the operator can be bounded affinely by the decision variable
(18) |
We use the same constant in Assumptionsย 2 and 4 to simplify the notation.
Convergence Metric. We quantity the algorithm convergence through the residuals
(20) |
where is the optimal residual, measuring the distance of decision variable to the learning target solution introduced above; whereas is the estimation residual, capturing the quality of the estimates of the operator . Note that if for all , then , the desired solution.
Finally, we will study the convergence of Algorithm 1 under the following choice of step sizes
(21) |
where as the operator estimates are implemented at a โfasterโ time scale than the updates of decision variables. In particular, these step sizes satisfies the following conditions for all
(22) |
where is an absolute constant depending on , , , and . We note that there always exist constants such that the step sizes in (21) satisfy (22).
For our analysis, we consider the following Lyapunov function
(23) |
Under the choice of step sizes above, the main result of this paper is presented in the following theorem, where we characterize the convergence complexity of Algorithm 1
Theoremย 1 shows that the last iterate of the A-MSA algorithm converges to the unique solution of (5) in the mean-squared sense at an optimal rate . As the algorithm draws exactly one sample in each iteration, this rate translates to a sample complexity of for finding an -optimal solution. The same complexity has been shown to be achievable by the MSA algorithm under an additional smoothness assumption on (Shen and Chen, 2022). We match the rate without requiring this restrictive assumption. In the absence of the smoothness condition, our rate is order-optimal (up to an logarithm factor) and significantly improves over the best-known rate of MSA, which is as established in Hong etย al. (2023); Zeng etย al. (2024b); Han etย al. (2024) when . In Sectionย 3.2, we further show that the convergence rate of MSA is when and deteriorates as increases. This observation highlights the advantage of our proposed algorithm where its convergence complexity is optimal for arbitrary .
3.1 Proof of Theoremย 1
We introduce the following technical lemmas, which set up important intermediate steps to derive the results in Theoremย 1. These lemmas establishes a useful upper bound of and in expectation, which shows the dependence of these two quantities.
Lemma 1
In (24), the terms for and are scaled by different step sizes ( versus ). This subtle difference is the key to establish the optimal complexity of Algorithm 1.
It is worth noting that if the iterates are generated by the MSA algorithm, a per-iteration analysis of similar to Lemmaย 2 can be established. We make the analysis and point out the distinctions that allow A-MSA to converge faster later in Sectionย 3.2 (Lemmaย 4).
We defer the proofs of the lemmas to the appendix but point out that to prove Lemmaย 2 the following bound on provides an important tool for us to control the stability of the iterates. The implication of Lemmaย 3 is that the distance converges to zero faster than if and both converge.
Proof (of Theoremย 1).
The proof follows in a straightforward manner by combining the bounds in Lemmasย 1 and 2 and choosing the correct step sizes. Recall the Lyapunov function (23)
(25) |
where the first inequality plugs in the bounds from Lemmasย 1 and 2, and in the last inequality we have artificially introduced and used the relation . Note that the last two terms of (25) are non-positive under the step size condition , , and . As a result, we have
due to and . Multiplying by on both sides, we have
which when dividing by gives the desired result.
3.2 Standard Multi-Time-Scale Stochastic Approximation
In this section, we present the finite-time complexity of the MSA algorithm derived using the techniques in Hong etย al. (2023); Zeng etย al. (2024b), to highlight its gap from the optimal rate established in Theoremย 1. We start with the case of , where we want to solve
For simplicity, we assume (just in this section) that the stationary distribution is independent of and that each is an i.i.d. sample from . We additionally assume that has uniformly bounded energy as follows. Note that this is not a necessary assumption but considered just to simplify the analysis of MSA. Also note that we do not make this assumption in the analysis of the proposed A-MSA algorithm.
Assumption 5
There exists a constant such that for all , , and
Recall the update rule of MSA in (6). We measure the convergence by , the same as the metric (20) used for analyzing A-MSA.
We next establish the per-iteration bound for , where the analysis can be found in Appendixย A.4.
Lemma 4
It is worth noting that Lemmaย 4 holds for any . We apply this lemma to the case and consider the Lyapunov function
(26) |
where and .
Theorem 2
Proof.
With , Lemmaย 4 implies
Recall the Lyapunov function . Note that and since decays faster than . We have
(28) |
where the last inequality is a result of
following if the step sizes satisfy .
We now discuss how Theoremย 2 can be extended to more than three equations and what the expected convergence rate should be. When , Zeng etย al. (2024b) shows that the convergence rate is on the order of (up to a logarithm factor)
(29) |
To optimize (29) the step sizes should be selected as , which leads to a convergence rate of . When , the convergence rate is on the order of
which dictates the choice of step sizes in (27).
3.3 Reflecting on Faster Convergence of A-MSA
Lemmasย 2 and 4 establish critical per-iteration convergence bounds on the errors in the decision variable for A-MSA and MSA, respectively. In this section, we contrast corresponding terms in the bounds to pinpoint the advantage of Lemmaย 2 over Lemmaย 4. In the analysis of under the two algorithms, the common/comparable terms include the following:
-
โข
The same term arises from the shift in learning target across iterations and appears in the proofs of both Lemmas 2 and 4 (in Sectionย A.2 and A.4, respectively). Due to the Lipschitz continuity of imposed in Assumptionย 4, controlling this term essentially requires handling
(31) Under MSA, there is currently no better bound on (31) than . As is on the order of a constant even when approaches the optimal solution (due to the random sample ), the bound eventually becomes .
In comparison, under A-MSA, we can control (31) with Lemmaย 3 and show
(32) Since and themselves decay to zero as the iterations proceed, the convergence rate of (32) is much faster than .
Takeaway: Our objective is to solve a highly coupled system (5). The discussion above highlights how the A-MSA algorithm effectively decouples the decision variables, in the sense that the inaccuracy of the upper-level variables has a reduced effect on the lower-level ones. Being able to decoupled the decision variable updates is crucial as it allows us to choose across in a more independent manner.
-
โข
is an error appearing in the proof of Lemmaย 2 for A-MSA, whereas is the comparable term in the proof of Lemmaย 4 for MSA. They capture the variance in the estimation of . In the case of MSA, does not decay to zero (again due to the randomness in ), which eventually becomes a bottleneck in achieving the optimal rate. In contrast, under A-MSA can be expected to converge to zero (under a careful analysis).
Takeaway: Compared to MSA, A-MSA provides a low-variance estimate of . The variance decaying sufficiently fast is another key driver of the overall improved complexity.
Remark 2
At the end of the section, we point out an important additional advantage of the analysis of A-MSA over that of MSA. Observe that the Lyapunov function (26) considered in Theoremย 2 weights the lower-level residuals and by and . The weights have been carefully chosen to ensure the proper cancellation of errors across different levels. Choosing such weights, however, requires non-trivial efforts and can be inconvenient as goes up.
In addition, we note that the weights and themselves are decaying sequences. Theoremย 2 shows that decays with rate , which only implies and . In other words, the guaranteed convergence rate of lower-level variables does not match that of the highest level variable and may even be meaningless. In contrast, the Lyapunov function (23) for the analysis of A-MSA simply combines over all levels without additional weights. As a result, the decision variables at all level are guaranteed to converge at .
4 Motivating Applications
In this section we discuss how the framework (5) models the objective of a range of algorithms in RL and games. In some cases, the objective is structured such that Assumptionsย 3-2 can be verified to hold, meaning that the theoretical guarantee immediately applies if the A-MSA algorithm is used. For the other problems, the assumptions may not hold, but we discuss how the analysis can be adapted to derive the complexity of A-MSA by using the specific problem structure. All applications discussed are special cases of (5) with , meaning that they cannot be covered by the bilevel framework in Zeng etย al. (2024b); Han etย al. (2024); Doan (2024); Zeng and Doan (2024).
4.1 Gradient-Based Temporal Difference learning with Momentum.
Gradient-based temporal difference (TD) learning algorithms including GTD2 and TDC are a popular class of methods that stably minimize a projected Bellman error under linear function approximation and off-policy samples (Sutton etย al., 2008, 2009). Consider an infinite-horizon average-reward MDP characterized by . Here and are the action and state spaces. Each state is associated with a feature vector . is the transition probability kernel, with denoting the probability of the MDP transitioning from state to under action . is the reward function. is the discount factor. Given a fixed policy to evaluate, our objective is to learn a value function parameter such that approximates a value function for every state . The TDC algorithm formulates the objective with a system of two equations on and an auxiliary variable , i.e., we aim to solve
(33) |
A momentum term can be added to the update of the variable to accelerate the convergence of this method. Deb and Bhatnagar (2022) studies this approach and shows that the objective of TDC with momentum can be described as solving a system of three equations. Compared with (33), the additional time scale is introduced to force the momentum term to decay to zero, which becomes an equation in the lowest level. The second equation in (33) is primarily used to solve for and lies in the middle level, whereas is the variable associated with the highest level equation. We skip the formulation details here and point interested readers to Deb and Bhatnagar (2022)[Section 4.2]. It is important to note that under a common assumption on the feature vectors (see, for example, Xu etย al. (2019)[Assumption 1]), the strong monotonicity condition holds, allowing us to conclude that A-MSA find the (unique) optimal solution with rate . While TDC with momentum has been shown to converge asymptotically in Deb and Bhatnagar (2022), its finite-time complexity is unknown from the existing literature. Our paper fills in this gap.
4.2 Distributionally Reinforcement Q Learning
Distributionally robust reinforcement learning (DRRL) in general studies finding a policy robust under unknown environmental changes. In this paper, we introduce the subject following the specific formulation in Liang etย al. (2024). Consider again the MDP introduced in Sectionย 4.1, where is the transition kernel that we can sample from during training. The unknown transition kernel in the test environment may deviate from but lies within an uncertainty set with
Here denotes the distance between distributions , and is a radius parameter of the uncertainty set. The aim of DRRL is to find the distributionally robust Q function that performs optimally in the worst case
The distributionally robust Q function satisfies the robust Bellman optimality equation
(34) |
Directly solving from (34), however, is challenging as an exhaustive search in is infeasible and we cannot sample from . The trick developed in the distributionally robust optimization literature (Duchi and Namkoong, 2021) is to use the dual form of the distance
(35) |
We define the shorthand notation . Exploiting this relation, it can be shown that a variant of the Q learning algorithm for the distributionally robust setting aims to find the solution tuple to the system of equations
(36) | |||
(37) | |||
(38) |
Interested readers can find in Liang etย al. (2024) the detailed derivation of the system of equations as well as how a three-time-scale algorithm is developed to solve the system. Liang etย al. (2024) further establishes the asymptotic convergence of the algorithm. The highest level equation is (36), and the associated Bellman operator is strongly monotone under a common contraction assumption in Q learning (Chen etย al., 2022). The lowest level equation is (38), which has an associated operator that is an identity mapping with respect to and and is therefore obviously strongly monotone. The issue is that the middle level operator in (37) is in general only (non-strongly) monotone with respect to . The violation of assumption makes our analysis not immediately applicable. However, the techniques developed may help design accelerated three-time-scale algorithms with analysis tailored to the monotone structure of . Additionally, it is possible that becomes strongly monotone under additional assumptions on , in which case we can invoke Theoremย 1 and conclude that the A-MSA algorithm can be applied and find the optimal distributionally robust Q function with rate .
4.3 Actor-Critic Algorithm for Regularized Two-Player Zero-Sum Markov Games
Consider a two-player zero-sum Markov games described by . is the state space, observed by both players. and are the action spaces of the two players. The transition kernel is , with with denoting the probability of the game transitioning from state to when player 1 selects action and player 2 selects . The reward function is โ when player 1 and 2 take actions and in state , player 1 receives reward and player 2 receives . We denote the two playersโ policies by and , with , representing the probability of selecting action , in state according to , . For a fixed initial state distribution , the expected discounted cumulative reward under entropy regularization (received by player 1) under policy pair is
where is a non-negative regularization weight and is the regularized value function
Solving a regularized game means that we want to find a Nash equilibrium policy pair as the solution to
It is known from Zeng etย al. (2022) that such a Nash equilibrium exists and is unique.
Here we consider the softmax parameterization โ we introduce parameters that encode the policies according to
(39) |
Taking a gradient-based approach to the problem, we can express our objective as finding a stationary point where and , which can be expanded as below.
(40) | |||
(41) |
The value function is not directly known but solvable as the solution to the Bellman equation
(42) |
We need to solve the system of three equations (40)-(42). The lowest level is (42) and has corresponds to an operator that can be shown to be strongly monotone. (40) and (41) are the highest level and middle level equations, associated with gradient operator and , which are not strongly monotone. This prevents our analysis from being immediately applicable. However, the operators exhibit an important structure โ they are the gradients of functions satisfying the Polyak-ลojasiewicz (PL) condition, which makes the operators resemble strong monotone operators in a transformed domain. Exploiting the structure, Zeng etย al. (2022) shows that a gradient descent ascent algorithm that aims to find the solution to (40)-(41) assuming the exact knowledge of converges linearly fast, which is the convergence rate to be expected when the operators and are strongly monotone. Combining the techniques in Zeng etย al. (2022) on leveraging the PL condition and those in this work on acceleration, we believe that the A-MSA algorithm can be shown to find the Nash equilibrium in a regularized two-player zero-sum Markov game with convergence rate .
4.4 Actor-Critic Algorithm for Mean Field Games
A mean field game (MFG) provides an approximation of an -agent Markov game with homogeneous agents as approaches infinity. The goal in solving a mean field game is to find a policy optimal in an MDP determined by the induced mean field, where the induced mean field is a distribution over the (infinite) population of agents and a function of the policy itself. In essence, an representative agent in a MFG needs to perform against an infinite population of other agents that adopts its same policy.
We consider MFGs in the stationary infinite-horizon average-reward setting and follow the formulation and notations in Zeng etย al. (2024a). We denote the state and action space of the representative agent by and . The state transition depends not only on the action of the representative agent but also on the aggregate population behavior, which is described by the mean field . We use to denote the transition kernel, with representing the probability that the state transitions from to when the representative agent takes action and mean field is . Similarly, the reward function depends on the mean field โ the representative agent receives reward when it takes action under mean field in state . The agent does not observe the mean field, and its policy is a mapping from to . We denote by the state transition matrix under policy and mean field such that
When the mean field is and the agent takes policy , the agent can expect to collect the cumulative reward
The mean field induced by policy (i.e. state visitation distribution over population when every agent takes policy ) is denoted by , which satisfies
where is the indicator vector whose entry has value 1 if and 0 otherwise.
We again consider the softmax function โ a policy parameter encodes the policy according to (39). With the derivation presented in Zeng etย al. (2024a), it can be shown that finding an equilibrium in an MFG can be formulated as solving a system of three equations (43)-(45)
(43) | |||
(44) | |||
(45) |
in which we introduce the auxiliary variable which tracks the (differential) value function under . Among the three equations, (44) and (45) are on the middle and lowest levels, and have associated operators satisfying the strong monotonicity. The highest level operator is not monotone but a gradient operator. We can extend the analysis in the work leveraging the techniques developed in Zeng and Doan (2024) for gradient operators and show that in this case A-MSA finds a first-order stationary point (but not necessarily a globally or locally optimal solution) of with rate (in the sense of gradient norm squared). This recovers the rate of the state-of-the-art MFG-ASAC algorithm proposed in Zeng etย al. (2024a) with tailored analysis.
5 Numerical Simulations
We apply A-MSA to the MFG policy optimization problem discussed in Sectionย 4.4. The environment is a small-scale synthetic MFG with , , and a randomly generated transition kernel and reward function. We compare A-MSA against the standard MSA algorithm without averaging and plot the algorithm convergence in Figureย 1, in which and denote the policy and mean field iterates in the iteration. As the norm of the gradient with respect to , measures the convergence in the policy given the latest mean field iterate. To quantify the mean field convergence, we consider , where for any denotes the stationary distribution of states induced by . The simulation shows that A-MSA has a clear advantage over MSA, though not by an apparent order of magnitude.

6 Concluding Remarks
In this work we propose an accelerated multi-time-scale SA algorithm that solves a coupled system of fixed-point equations with the optimal convergence rate and sample complexity โ under a strong monotonicity assumption, the last iterate of the algorithm converges to the (unique) solution with rate for any . This is the first time such convergence rate is achieved under no additional restrictive smoothness assumptions. We show that we can formulate the objectives of a range of problems in RL and multi-agent games as a coupled fixed-point equation system, to which the proposed algorithm can be applied. Our analysis is easily generalizable to cases where the highest level operator is not strongly monotone (but the lower level operators are). An important future direction is to investigate the stability and convergence of multi-time-scale SA when the lower level operators lose strong monotonicity.
Acknowledgement
This work was partially supported by the National Science Foundation under awards ECCS-CAREER-2339509 and CCF-2343599.
Disclaimer
This paper was prepared for informational purposes in part by the Artificial Intelligence Research group of JP Morgan Chase & Co and its affiliates (โJP Morganโ), and is not a product of the Research Department of JP Morgan. JP Morgan makes no representation and warranty whatsoever and disclaims all liability, for the completeness, accuracy or reliability of the information contained herein. This document is not intended as investment research or investment advice, or a recommendation, offer or solicitation for the purchase or sale of any security, financial instrument, financial product or service, or to be used in any way for evaluating the merits of participating in any transaction, and shall not constitute a solicitation under any jurisdiction or to any person, if such solicitation under such jurisdiction or to such person would be unlawful.
References
- Benveniste etย al. (2012) Albert Benveniste, Michel Mรฉtivier, and Pierre Priouret. Adaptive algorithms and stochastic approximations, volumeย 22. Springer Science & Business Media, 2012.
- Bhandari etย al. (2018) Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite time analysis of temporal difference learning with linear function approximation. In Conference on learning theory, pages 1691โ1692. PMLR, 2018.
- Borkar (1997) Vivekย S Borkar. Stochastic approximation with two time scales. Systems & Control Letters, 29(5):291โ294, 1997.
- Borkar (2008) Vivekย S Borkar. Stochastic approximation: a dynamical systems viewpoint. Springer, 2008.
- Chen etย al. (2022) Zaiwei Chen, Sheng Zhang, Thinhย T Doan, John-Paul Clarke, and Sivaย Theja Maguluri. Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning. Automatica, 146:110623, 2022.
- Dalal etย al. (2018) Gal Dalal, Gugan Thoppe, Balรกzs Szรถrรฉnyi, and Shie Mannor. Finite sample analysis of two-timescale stochastic approximation with applications to reinforcement learning. In Conference On Learning Theory, pages 1199โ1233. PMLR, 2018.
- Daskalakis etย al. (2020) Constantinos Daskalakis, Dylanย J Foster, and Noah Golowich. Independent policy gradient methods for competitive reinforcement learning. Advances in neural information processing systems, 33:5527โ5540, 2020.
- Deb and Bhatnagar (2021) Rohan Deb and Shalabh Bhatnagar. -timescale stochastic approximation: Stability and convergence. arXiv preprint arXiv:2112.03515, 2021.
- Deb and Bhatnagar (2022) Rohan Deb and Shalabh Bhatnagar. Gradient temporal difference with momentum: Stability and convergence. In Proceedings of the AAAI Conference on Artificial Intelligence, volumeย 36, pages 6488โ6496, 2022.
- Doan (2022) Thinhย T Doan. Nonlinear two-time-scale stochastic approximation: Convergence and finite-time performance. IEEE Transactions on Automatic Control, 68(8):4695โ4705, 2022.
- Doan (2024) Thinhย T Doan. Fast nonlinear two-time-scale stochastic approximation: Achieving finite-sample complexity. arXiv preprint arXiv:2401.12764, 2024.
- Doan and Romberg (2019) Thinhย T Doan and Justin Romberg. Linear two-time-scale stochastic approximation a finite-time analysis. In 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 399โ406. IEEE, 2019.
- Duchi and Namkoong (2021) Johnย C Duchi and Hongseok Namkoong. Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics, 49(3):1378โ1406, 2021.
- Fort (2015) Gersende Fort. Central limit theorems for stochastic approximation with controlled markov chain dynamics. ESAIM: Probability and Statistics, 19:60โ80, 2015.
- Gupta etย al. (2019) Harsh Gupta, Rayadurgam Srikant, and Lei Ying. Finite-time performance bounds and adaptive learning rate selection for two time-scale reinforcement learning. Advances in Neural Information Processing Systems, 32, 2019.
- Han etย al. (2024) Yuze Han, Xiang Li, and Zhihua Zhang. Finite-time decoupled convergence in nonlinear two-time-scale stochastic approximation. arXiv preprint arXiv:2401.03893, 2024.
- Haque etย al. (2023) Shaanย Ul Haque, Sajad Khodadadian, and Sivaย Theja Maguluri. Tight finite time bounds of two-time-scale linear stochastic approximation with markovian noise. arXiv preprint arXiv:2401.00364, 2023.
- Hong etย al. (2023) Mingyi Hong, Hoi-To Wai, Zhaoran Wang, and Zhuoran Yang. A two-timescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actor-critic. SIAM Journal on Optimization, 33(1):147โ180, 2023.
- Kaledin etย al. (2020) Maxim Kaledin, Eric Moulines, Alexey Naumov, Vladislav Tadic, and Hoi-To Wai. Finite time analysis of linear two-timescale stochastic approximation with markovian noise. In Conference on Learning Theory, pages 2144โ2203. PMLR, 2020.
- Konda and Tsitsiklis (1999) Vijay Konda and John Tsitsiklis. Actor-critic algorithms. Advances in neural information processing systems, 12, 1999.
- Konda and Tsitsiklis (2004) Vijayย R Konda and Johnย N Tsitsiklis. Cconvergence rate of linear two-time-scale stochastic approximation. The Annals of Applied Probability, 14(2):796โ819, 2004.
- Konda and Borkar (1999) Vijaymohanย R Konda and Vivekย S Borkar. Actor-criticโtype learning algorithms for markov decision processes. SIAM Journal on control and Optimization, 38(1):94โ123, 1999.
- Lakshminarayanan and Szepesvรกri (2017) Chandrashekar Lakshminarayanan and Csaba Szepesvรกri. Linear stochastic approximation: Constant step-size and iterate averaging. arXiv preprint arXiv:1709.04073, 2017.
- Li etย al. (2023a) Xiang Li, Jiadong Liang, and Zhihua Zhang. Online statistical inference for nonlinear stochastic approximation with markovian data. arXiv preprint arXiv:2302.07690, 2023a.
- Li etย al. (2023b) Xiang Li, Wenhao Yang, Jiadong Liang, Zhihua Zhang, and Michaelย I Jordan. A statistical analysis of polyak-ruppert averaged q-learning. In International Conference on Artificial Intelligence and Statistics, pages 2207โ2261. PMLR, 2023b.
- Liang etย al. (2024) Zhipeng Liang, Xiaoteng Ma, Jose Blanchet, Jun Yang, Jiheng Zhang, and Zhengyuan Zhou. Single-trajectory distributionally robust reinforcement learning. In Proceedings of the 41st International Conference on Machine Learning, volume 235, pages 29644โ29666. PMLR, 2024.
- Meerkov (1972) S.ย M. Meerkov. Simplified description of slow markov walks: part i. Automation and Remote Control, 33:404โ414, 1972.
- Mitra (2024) Aritra Mitra. A simple finite-time analysis of td learning with linear function approximation. arXiv preprint arXiv:2403.02476, 2024.
- Mokkadem and Pelletier (2006) Abdelkader Mokkadem and Mariane Pelletier. Convergence rate and averaging of nonlinear two-time-scale stochastic approximation algorithms. Annals of Applied Probability, 16(3):1671โ1702, 2006.
- Moulines and Bach (2011) Eric Moulines and Francis Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Advances in neural information processing systems, 24, 2011.
- Robbins and Monro (1951) Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400โ407, 1951.
- Sayin etย al. (2021) Muhammed Sayin, Kaiqing Zhang, David Leslie, Tamer Basar, and Asuman Ozdaglar. Decentralized q-learning in zero-sum markov games. Advances in Neural Information Processing Systems, 34:18320โ18334, 2021.
- Shen and Chen (2022) Han Shen and Tianyi Chen. A single-timescale analysis for stochastic approximation with multiple coupled sequences. Advances in Neural Information Processing Systems, 35:17415โ17429, 2022.
- Srikant (2024) Rย Srikant. Rates of convergence in the central limit theorem for markov chains, with an application to td learning. arXiv preprint arXiv:2401.15719, 2024.
- Srikant and Ying (2019) Rayadurgam Srikant and Lei Ying. Finite-time error bounds for linear stochastic approximation andtd learning. In Conference on Learning Theory, pages 2803โ2830. PMLR, 2019.
- Sutton etย al. (2008) Richardย S Sutton, Hamid Maei, and Csaba Szepesvรกri. A convergent temporal-difference algorithm for off-policy learning with linear function approximation. Advances in neural information processing systems, 21, 2008.
- Sutton etย al. (2009) Richardย S Sutton, Hamidย Reza Maei, Doina Precup, Shalabh Bhatnagar, David Silver, Csaba Szepesvรกri, and Eric Wiewiora. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th annual international conference on machine learning, pages 993โ1000, 2009.
- Wang etย al. (2021) Yue Wang, Shaofeng Zou, and Yiย Zhou. Non-asymptotic analysis for two time-scale tdc with general smooth function approximation. Advances in neural information processing systems, 34:9747โ9758, 2021.
- Wu etย al. (2020) Yueย Frank Wu, Weitong Zhang, Pan Xu, and Quanquan Gu. A finite-time analysis of two time-scale actor-critic methods. Advances in Neural Information Processing Systems, 33:17617โ17628, 2020.
- Xu etย al. (2019) Tengyu Xu, Shaofeng Zou, and Yingbin Liang. Two time-scale off-policy td learning: Non-asymptotic analysis over markovian samples. Advances in neural information processing systems, 32, 2019.
- Zeng and Doan (2024) Sihan Zeng and Thinh Doan. Fast two-time-scale stochastic gradient method with applications in reinforcement learning. In The Thirty Seventh Annual Conference on Learning Theory, pages 5166โ5212. PMLR, 2024.
- Zeng etย al. (2022) Sihan Zeng, Thinh Doan, and Justin Romberg. Regularized gradient descent ascent for two-player zero-sum markov games. Advances in Neural Information Processing Systems, 35:34546โ34558, 2022.
- Zeng etย al. (2024a) Sihan Zeng, Sujay Bhatt, Alec Koppel, and Sumitra Ganesh. A single-loop finite-time convergent policy optimization algorithm for mean field games (and average-reward markov decision processes). arXiv preprint arXiv:2408.04780, 2024a.
- Zeng etย al. (2024b) Sihan Zeng, Thinhย T Doan, and Justin Romberg. A two-time-scale stochastic optimization framework with applications in control and reinforcement learning. SIAM Journal on Optimization, 34(1):946โ976, 2024b.
ย
Accelerated Multi-Time-Scale Stochastic Approximation: Optimal Complexity and Applications in Reinforcement Learning and Multi-Agent Games
Appendix
ย
Appendix A Proof of Technical Lemmas
A.1 Proof of Lemmaย 1
Lemma 5
For any we have
where is a positive, bounded scalar depending only on the structural constants including , , , and .
We derive a frequently used inequality in the proofs of Lemmaย 1 and 5 as a result of (18). For any , we have
(46) |
where the last inequality follows from the boundedness condition in Assumptionย 4 and the definition of in (20).
By the update rule in (8), we have
This leads to
(47) |
where the last inequality follows from .
A.2 Proof of Lemmaย 2
By the update rule in (7), we have
Taking the norm leads to
(52) |
We bound each term of (52) separately. By the definition of , we have
This implies
(53) |
The third equation of (53) is a result of the identity that for all
(54) |
The first inequality of (53) follows from Assumptionย 3, and the final inequality follows from the step size condition and . The second inequality of (53) uses the bound
which we now justify
(55) |
The second inequality of (53) plugs in the bound on below, which follows from a similar argument
The term can be bounded leveraging the Lipschitz condition of
(56) |
where the second inequality plugs in the result of Lemmaย 3.
Since for any , ย (56) implies
(57) |
Next we treat
(58) |
where the third inequality uses (53) and the fourth inequality follows from the step size condition .
To bound , we plug in (53) and (56)
(59) |
where the second inequality follows from the fact that for any positive scalars . We have also simplified terms using the step size condition , , and .
We finally bound with (57),
(60) |
Collecting the bounds on - and applying them to (52), we have
where in the second inequality we have combined and simplified terms with the step size condition and , and the third inequality follows from , , , , and .
A.3 Proof of Lemmaย 3
By the definition of ,
(61) |
A.4 Proof of Lemmaย 4
By the update rule (6), we have
Taking the norm,
(64) |
A bound on can be obtained using an argument identical to (53), provided that the step sizes satisfy and
(65) |
The second term is bounded in expectation due to Assumptionย 5
(66) |
Using the Lipschitz continuity of , we have
(67) |
which implies
(68) |
The term is obviously zero in expectation as
The bound on follows from the Cauchy-Schwarz inequality
(69) |
where in the third equality we plug in (67).
Finally, we bound using a similar argument
(70) |
A.5 Proof of Lemmaย 5
This lemma essentially bounds the distance between samples from a time-varying Markov chain and the stationary distribution. Techniques for proving this lemma has been well-studied in the literature. Hence we omit the proof here but note that it is very similar to the proof of Lemma 1 in Zeng etย al. (2024b).