Explicit Mean-Square Error Bounds
for Monte-Carlo and Linear Stochastic Approximation
Abstract
This paper concerns error bounds for recursive equations subject to Markovian disturbances. Motivating examples abound within the fields of Markov chain Monte Carlo (MCMC) and Reinforcement Learning (RL), and many of these algorithms can be interpreted as special cases of stochastic approximation (SA). It is argued that it is not possible in general to obtain a Hoeffding bound on the error sequence, even when the underlying Markov chain is reversible and geometrically ergodic, such as the M/M/1 queue. This is motivation for the focus on mean square error bounds for parameter estimates. It is shown that mean square error achieves the optimal rate of , subject to conditions on the step-size sequence. Moreover, the exact constants in the rate are obtained, which is of great value in algorithm design.
Keywords: Stochastic Approximation, Markov chain Monte Carlo, Reinforcement learning
1 Introduction
Many questions in statistics and the area of reinforcement learning are concerned with computation of the root of a function in the form of an expectation: , where is a vector valued random variable, and . The value satisfying is most commonly approximated through some version of the stochastic approximation (SA) algorithm of Robbins and Monro [32, 5]. In its basic form, this is the recursive algorithm
(1) |
in which is a non-negative gain sequence, and is a sequence of random variables whose distribution converges to that of as . The sequence is a Markov chain in the applications of interest in this paper.
There is a large body of work on conditions for convergence of this recursion, and also a Central Limit Theorem (CLT): with ,
almost surely | ||||
in distribution |
The matrix is known as the asymptotic covariance. The CLT requires substantially stronger assumptions on the gain sequence, the function , and the statistics of the “noise” sequence [2, 25].
Soon after the stochastic approximation algorithm was first introduced in [32, 4], Chung [9] identified the optimal CLT covariance and techniques to obtain the optimum for scalar recursions. This can be cast as a form of stochastic Newton-Raphson (SNR) [14, 15, 12, 11]. Gradient free methods [or stochastic quasi Newton-Raphson (SQNR)] appeared in later work: The first example was proposed by Venter in [39], which was shown to obtain the optimal variance for a one-dimensional SA recursion. The algorithm obtains estimates of the SNR gain (see (2) below), through a procedure similar to the Kiefer-Wolfowitz algorithm [23]. Ruppert proposed an extension of Venter’s algorithm for vector-valued functions [33].
The averaging technique of Ruppert and Polyak [34, 30, 31] is a two-time-scale algorithm that is also designed to achieve the optimal asymptotic covariance. More recently, a two-time-scale variant of the SNR algorithm known as “Zap-SNR” was proposed in [14, 15, 12, 11], with applications to reinforcement learning. Zap algorithms are stable and convergent under mild assumptions [14, 7].
Under the typical assumptions under which the CLT holds for the recursion (1), the asymptotic covariance has an explicit form in terms of a linearization [24, Chapter 10, Theorem 3.3]. Assume that the solution to is unique, and denote
(2) |
The error dynamics of the SA recursion are then approximated by the linear SA recursion:
(3) |
Subject to the assumption that is Hurwitz (i.e., for each eigenvalue of ), the matrix is the unique positive semi-definite solution to the Lyapunov equation
(4) |
in which is also an asymptotic covariance: the covariance matrix appearing in the CLT for the sequence (which may be expressed in terms of a solution to a Poisson equation - see [24, Theorem 2.2, Chapter 10]).
The goal of this paper is to demonstrate that the CLT is far less asymptotic than it may appear. For this we focus analysis on the linearization (3), along with first steps towards analysis of the nonlinear recursions. Subject to assumptions on and the Markov chain, we establish the bound
(5) |
with . Under further assumptions, the bound is refined to obtain , and even finer bounds:
(6) |
where again and formula of is obtained in the paper based on a second Lyapunov equation and a solution to a second Poisson equation.
It is hoped that these results will be helpful in construction and performance analysis of many algorithms found in machine learning, statistics and reinforcement learning. Identification of the coefficient for the term from (6) may lead to criteria for gain design when one aims to minimize the covariance with a fixed budget on the number of iterations.
The reader may ask, why not search directly for finite- bounds of the flavor of Hoeffding’s inequality:
(7) |
where is fixed, and is a convex function that is strictly positive and finite in a region . The answer is that such bounds are not always possible even for the simplest SA recursions, even when the Markov chain is geometrically ergodic. This is clarified in the first general example:
1.1 Markov Chain Monte Carlo
As a prototypical example of stochastic approximation, Markov chain Monte Carlo (MCMC) proceeds by constructing an ergodic Markov chain with invariant measure so as to estimate for some function . One then simulates to obtain the estimates
(8) |
This is an instance of the SA recursion (1):
(9) |
Subtracting from both sides of (9) gives, with ,
which is (3) in a special case: , and .
A significant part of the literature on MCMC focuses on finding Markov chains whose marginals approach the invariant measure quickly. Error estimates for MCMC have only been studied under rather restrictive settings. For instance, under the assumption of uniform ergodicity of and uniform boundedness of (which rarely hold in practice outside of a finite state space), a generalized Hoeffding’s inequality was obtained in [19] to obtain the PAC-style error bound (7). We can not expect Hoeffding’s bound if either of these assumptions is relaxed. Consider the simplest countable state space Markov chain: the M/M/1 queue with uniformization, defined with and
This is a reversible, geometrically ergodic Markov chain when , with geometric invariant distribution , . It is shown in [28] that the error bound (7) fails for most unbounded functions . The question is looked at in greater depth in [17, 16], where asymptotic bounds are obtained for the special case . An asymptotic version of (7) is obtained for the lower tail:
(10) |
in which the right hand side is strictly negative and finite valued for positive in a neighborhood of zero. An entirely different scaling is required for the upper tail:
(11) |
where again the right hand side is strictly negative and finite valued for sufficiently small. It follows from (11) that the PAC-style bound (7) is not attainable.
1.2 Reinforcement Learning
The theory of this paper also applies to TD-learning. In this case, the Markov chain contains as one component a state process for a system to be controlled.
Consider a Markov chain evolving on a (Polish) state space . Given a cost function , and a discount factor , the goal in TD-learning is to approximate the solution to the Bellman equation:
(12) |
This functional equation can be recast:
(13a) | ||||
where | (13b) |
Equation (13a) may be regarded as motivation for the TD-learning algorithms of [36, 38].
Consider a linearly parameterized family of candidate approximations , where denotes the basis functions. The goal in TD-learning is to solve the Galerkin relaxation of (13a,13b):
(14) |
where is a -dimensional stochastic process, adapted to , and the expectation is with respect to the steady state distribution. In particular, in TD() learning, so that the goal in this case is to find such that:
(15) | ||||
The TD() algorithm is the SA recursion (1) applied to solve (15):
(16) | ||||
Denoting
the algorithm (16) can be rewritten as:
(17) |
Note that from (14) solves the linear equation . Subtracting from both sides of (17) gives, with ,
(18) |
Under mild conditions, we show through coupling that iteration (18) can be closely approximated by the linear SA recursion (3) with matrix and noise sequence . In particular, the two recursions have the same asymptotic covariance if the matrix is Hurwitz (see Section 2.3).
Under general assumptions on the Markov chain , and the basis functions , it is known that matrix is Hurwitz, and that the sequence of estimates converges to [38]. However, when the discount factor is close to , it can be shown that (where denotes the largest eigenvalue of ), and is in fact close to under mild additional assumptions [14, 11, 13]. It follows that the algorithm has infinite asymptotic covariance: full details and finer results can be deduced from Theorems 2.4 and 2.7.
The SNR algorithm is defined as follows:
(19) | ||||
(20) |
Under the assumption that the sequence of matrices is invertible for each , it is shown in [14, 11] that the sequence of estimates obtained using (19,20) are identical to the parameter estimates obtained using the LSTD() algorithm: , with
Consequently, the LSTD() algorithm achieves the optimal asymptotic covariance.
Q-learning and many other RL algorithms can also be cast as SA recursions. They are no longer linear, but it is anticipated that bounds can be obtain in future research through linearization [18].
1.3 Literature Survey
Finite time performance bounds for linear stochastic approximation were obtained in many prior papers, subject to the assumption that the noise sequence appearing in (3) is a martingale difference sequence [10, 26]. This assumption is rarely satisfied in the applications of interest to the authors.
Much of the literature on finite time bounds for linear SA recursions with Markovian noise has been recent. For constant step-size algorithms with step-size , it follows from analysis in [6] that the pair process is a geometrically ergodic Markov chain, and the covariance of is in steady state. Finite time bounds of order were obtained in [37, 3, 35, 21]. Unfortunately, these bounds are not tight, and hence their value for algorithm design is limited.
Mean-square error bounds have also been obtained for diminishing step-size algorithms, to establish the optimal rate of convergence [35, 3, 8]. The constant is a function of the mixing time of the underlying Markov chain. These results require strong assumptions (uniform ergodicity of the Markov chain), and do not obtain the optimal constant . Rather than parameter estimation error, finite time bounds are obtained in [22] for , which may be regarded as a far more relevant performance criterion. Bounds are obtained for Markovian models, subject to the existence of a Lyapunov function similar to what is assumed in the present work. It is again not clear if the resulting bounds are tight, or have value in algorithm design.
1.4 Contributions
The main contribution of this paper is a general framework for analyzing the finite time performance of linear stochastic approximation algorithms with Markovian noise, and vanishing step-size (required to achieve the optimal rate of convergence of Chung-Ruppert-Polyak). The M/M/1 queue example illustrates plainly that Markovian noise introduces challenges not seen in the “white noise” setting, and that the finite- error bound (7) cannot be obtained without substantial restrictions. Even under the assumptions of [19] (uniform ergodicity, and bounded noise), the resulting bounds are extremely loose and hence may give little insight for algorithm design. Our approach allows us to obtain explicit bounds, without the uniform boundedness assumption of noise that is frequently imposed in the literature [3, 35, 21, 8]. Instead, it is assumed that the Markovian noise is -uniform ergodic; an assumption that is far weaker than geometric or uniform mixing.
Our starting point is the classical martingale approximation of the noise used in CLT analysis of Markov chains [29, Chapter 17] and used in the analysis of SA recursions since Metivier and Priouret [27]. Under mild assumptions on the Markov chain, each can be expressed as the sum of a martinagle difference and a telescoping term. The solution of the linear recursion (3) is decomposed as a sum of the respective responses:
(21) |
The challenge is to obtain explicit bounds on the mean square error for each term.
We say that a deterministic vector-valued sequence converges to zero at rate if
Bounds for the mean-square error are obtained in Thm. 2.4, subject to conditions on both the matrix and the noise sequence. In summary, under general assumptions on ,
-
(i)
The bound (5) holds if is Hurwitz.
-
(ii)
If is Hurwitz, then the finer bound (6) holds.
-
(iii)
If there is an eigenvalue of satisfying , and corresponding left-eigenvalue that lies outside of the null-space of the asymptotic covariance of the noise sequence, then
(22) with . The convergence of to zero is thus no faster than .
2 Mean Square Convergence
2.1 Notation and Background
Consider the linear SA recursion (3), with the noise sequence defined in (2). We use the following notation to explicitly represent the noise as a function of :
(23) |
A form of geometric ergodicity is assumed throughout. To apply standard theory, it is assumed that the state space is Polish (the standing assumption in [29]). We fix a measurable function , and let denote the set of measurable functions satisfying
The Markov chain is assumed to be -uniformly ergodic: there exists , and such that for each , ,
(24) |
where is the unique invariant measure, and is the steady state mean.
The uniform bound (24) is not a strong assumption. For example, it is satisfied for the M/M/1 queue described in Section 1.1 with , for sufficiently small, with [29, Thm. 16.4.1].
The following are imposed throughout:
Assumptions:
-
(A1)
The Markov process is -uniformly ergodic, with unique invariant measure denoted .
-
(A2)
The matrix is Hurwitz, and the step-size sequence , .
-
(A3)
The function satisfies and for each .
For any , denote , and
(25) |
It is evident that under (A1). Further conclusions are summarized below. Thm. 2.1 (i) follows immediately from (A1). Part (ii) follows from (i) and [29, Lemma 15.2.9] (the chain is also -uniformly ergodic).
Theorem 2.1.
The following conclusions hold for a -uniformly ergodic Markov chain:
-
(i)
The function defined in (25) has zero mean, and solves Poisson’s equation:
(26) -
(ii)
If , then . ∎
Assumption (A3) implies that the sequence appearing in (3) is zero mean for the stationary version of the Markov chain . Its asymptotic covariance (appearing in the Central Limit Theorem) is denoted
(27) |
where the expectations are in steady state.
A more useful representation of is obtained through a decomposition of the noise sequence based on Poisson’s equation. This now standard technique was introduced in the SA literature in the 1980s [27].
With defined in (23), denote by a solution to Poisson’s equation:
(28) |
This is in fact separate Poisson equations since . It is assumed for convenience that the solutions are normalized so has zero steady-state mean. This is justified by the fact that also solves (28) under assumption (A3). The fact that for follows from Thm. 2.1 (ii).
We then write, for ,
where and is a martingale difference sequence. Each of the sequences is bounded in , and the asymptotic covariance (27) is expressed
(29) |
where the expectation is taken in steady-state. The equivalence of (29) and (27) appears in [29, Theorem 17.5.3] for the case in which is scalar valued; the generalization to vector valued processes involves only notational changes.
2.2 Decomposition and Scaling of the Parameter Sequence
We now explain the decomposition (21). Each of the two sequences and evolves as a stochastic approximation sequence, differentiated by the inputs and initial conditions:
(30a) | |||||
(30b) |
The second recursion admits a more tractable realization through the change of variables, , .
Lemma 2.2.
The sequence evolves as the SA recursion
(31) |
∎
It is more convenient to work directly with the recursion for the scaled sequence:
Lemma 2.3.
For any , the scaled sequence admits the recursion,
(33) |
where with , and . ∎
2.3 Mean Square Error Bounds
Fix the initial condition , and denote and . The following summarizes bounds on the convergence rate of .
Theorem 2.4.
Suppose (A1)-(A3) hold. Then, for the linear recursion (3),
-
(i)
If for every eigenvalue of , then for some ,
where is the solution to the Lyapunov equation (4). Consequently, the rate of convergence of is .
-
(ii)
Suppose there is an eigenvalue of that satisfies . Let denote a corresponding left eigenvector, and suppose that . Then, converges to at rate . ∎
The proof of Thm. 2.4 is contained in Section 2.5. The following negative result is a direct corollary of Thm. 2.4 (ii):
Corollary 2.5.
Suppose (A1)-(A3) hold. Moreover, suppose there is an eigenvalue of that satisfies , with corresponding left eigenvector satisfying . Then, converges to zero at rate no faster than .
One challenge in extension to nonlinear recursions is that the noise sequence depends on the parameter estimates (recall (2)). This is true even for TD learning with linear function approximation (see (18) and surrounding discussion). Extension to these recursions is obtained through coupling.
Consider the error sequence for a random linear recursion
(36) |
subject to the following assumptions:
-
(A4)
The sequences are functions of the Markov chain:
which satisfy , for each . The steady state means are denoted , . Moreover, the matrix is Hurwitz, and .
Theorem 2.6.
The proof of the theorem is via coupling with (3). For this we write (36) in the suggestive form
(37) |
With common initial condition , the sequence is compared with the error sequence for the corresponding linear SA algorithm:
The difference sequence evolves according to (3), but with a vanishing noise sequence:
(38) |
By decomposing into martingale difference and telescoping sequences based on Poisson’s equation, the technique used to prove Thm. 2.4 can be used to obtain the following bound on the mean-square coupling error.
Let denote an eigenvalue of the matrix with largest real part (i.e., is minimal).
Theorem 2.7.
Under Assumptions (A1)-(A4),
-
(i)
if .
-
(ii)
for all , provided .
∎
Thm. 2.7 provides a remarkable bound when : it immediately implies Thm. 2.6 because the mean-square coupling error tends to zero at rate no slower than , which is far faster than (implied by Thm. 2.4).
An alert reader may observe that Theorems 2.6 and 2.7 leave out a special case: consider , so that the rate of convergence of is the sub-optimal value . The bound obtained in Thm. 2.7 remains valuable, in the sense that it combined with Thm. 2.4 (ii) implies the rate of convergence of is no slower than . However, because and tend to zero at the same rate, we cannot rule out the possibility that converges to zero much faster. In particular, it remains to prove that if there is an eigenvalue of that satisfies , and an eigenvector satisfying , then, converges to at rate .
2.4 Implications
Thm. 2.4 indicates that the convergence rate of is determined jointly by the matrix , and the martingale difference component of the noise sequence . Convergence of can be slow if the matrix has eigenvalues close to zero.
The result also explains the slow convergence of some reinforcement learning algorithms. For instance, the matrix in Watkins’ Q-learning has at least one eigenvalue with real part greater than or equal to , where is the discount factor appearing in the Markov decision process [40, 14, 11]. Since is usually close to one, Thm. 2.4 implies that the convergence rate of the algorithm is much slower than . Under the assumption that is Hurwitz, the convergence rate is guaranteed by the use of a modified step-size sequence , with chosen so that the matrix is Hurwitz. Corollary 2.8 follows directly from Thm. 2.4 (i).
Corollary 2.8.
Let be a constant such that is Hurwitz, and be recursively obtained as
Then, for some ,
where solves the Lyapunov equation
∎
We can also ensure the convergence rate by using a matrix gain. Provided is invertible, and if it is known beforehand, is the optimal step-size sequence (in terms of minimizing the asymptotic covariance) [1, 25, 13]. The SQNR algorithm of [33] and the Zap-SNR algorithm [14, 11] provide general approaches to recursively estimate the optimal matrix gain.
2.5 Proof of Thm. 2.4
Denote and for each in (34). The proof proceeds by establishing the convergence rate for each . The main challenges are the first two: and , for which explicit bounds are obtained by studying recursions of the scaled sequences. Bounding is trivial.
2.5.1 The martingale difference term
Proposition 2.9.
Under (A1)-(A3),
- (i)
-
(ii)
Suppose there is an eigenvalue of , that satisfies . Let denote the corresponding left eigenvector, and suppose moreover that . Then, converges to at rate .
∎
2.5.2 The telescoping sequence term
Proposition 2.10.
Under (A1)-(A3),
-
(i)
If for every eigenvalue of , then, for some .
-
(ii)
Suppose there is an eigenvalue of that satisfies . Let denote the corresponding left eigenvector, and suppose moreover that . Then,
∎
2.5.3 Proof of Thm. 2.4
We obtain the convergence rate of based on
For case (i), by Prop. 2.9 (i) and Prop. 2.10 (i), there exists such that
The cross terms between and for are of smaller orders than by the Cauchy-Schwarz inequality. Therefore, for a possibly smaller ,
For case (ii), for each can be obtained from Prop. 2.9 (ii) and Prop. 2.10 (ii) directly by the triangle inequality. For , the result is established independently in Lemma A.10. ∎
2.6 Finer Error Bound
2.6.1 Finer Decomposition with Second Poisson Equation
With in (28) and that for each , denote by the zero-mean solution to the second Poisson equation
(39) |
We then write, for ,
(40) |
where , and is a martingale difference sequence.
2.6.2 Finer Mean Square Error Bound
Theorem 2.11.
Suppose Assumptions (A1)-(A3) hold, and moreover for every eigenvalue of . Then, for the linear recursion (3),
where , , and is the unique solution to the Lyapunov equation:
(44) |
∎
2.6.3 Proof of Thm. 2.11
Denote the correlation between and as , where are different terms in (43). The key results that help establish Thm. 2.11 are summarized in the following proposition. The proof is in Appendix A.4.
Proposition 2.12.
Under Assumptions (A1)-(A3), if for every eigenvalue of , then there is such that
-
(i)
, where , is the unique solution to the Lyapunov equation (4), and solves the Lyapunov equation,
(45) -
(ii)
, where solves the Lyapunov equation:
(46) -
(iii)
.
∎
Proof of Thm. 2.11.
With the decomposition in (43), we have
, and by Thm. 2.4 (i). By the Cauchy-Schwarz inequality, the correlation terms involving and are , and is also . Prop. 2.12 (ii) shows that . Hence the covariance can be approximated as follows:
By Prop. 2.12, there exist and such that
Putting those results together gives
for some , where solves the Lyapunov equation (44). ∎
3 Conclusions
Performance bounds for recursive algorithms are challenging outside of the special cases surveyed in the introduction. The general framework developed in this paper provides tight finite time performance for linear stochastic recursions under mild conditions on the Markovian noise, and we are confident that the techniques will extend to obtain similar bounds for nonlinear stochastic approximation provided that the linearization (2) is meaningful.
The bound (5) implies that, for some constant and all ,
It may be argued that we have not obtained a finite- bound, because a bound on the constant is lacking. Our response is that the precision of the dominant term is most important. We have tested the bound in numerous experiments in which the empirical mean-square error is obtained from multiple independent trials, and the resulting histogram is compared to what is predicted by the Central Limit Theorem with covariance . It is found that the Central Limit Theorem is highly predictive of finite- performance in most cases [14, 11, 13]. While it is hoped that further research will provide bounds on , it seems likely that any bound will involve high-order statistics of the Markov chain; evidence of this is the complex coefficient of in (6) for the special case .
Current research concerns these topics, as well as algorithm design for reinforcement learning in various settings.
References
- [1] A. Benveniste, M. Métivier, and P. Priouret. Adaptive algorithms and stochastic approximations, volume 22 of Applications of Mathematics (New York). Springer-Verlag, Berlin, 1990. Translated from the French by Stephen S. Wilson.
- [2] A. Benveniste, M. Métivier, and P. Priouret. Adaptive algorithms and stochastic approximations. Springer, 2012.
- [3] J. Bhandari, D. Russo, and R. Singal. A finite time analysis of temporal difference learning with linear function approximation. arXiv preprint arXiv:1806.02450, 2018.
- [4] J. R. Blum. Multidimensional stochastic approximation methods. The Annals of Mathematical Statistics, pages 737–744, 1954.
- [5] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan Book Agency and Cambridge University Press (jointly), Delhi, India and Cambridge, UK, 2008.
- [6] V. S. Borkar and S. P. Meyn. The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim., 38(2):447–469, 2000. (see also IEEE CDC, 1998).
- [7] S. Chen, A. M. Devraj, A. Bušić, and S. Meyn. Zap Q Learning with nonlinear function approximation. Submitted for publication and arXiv e-prints, 2019.
- [8] Z. Chen, S. Zhang, T. Doan, S. Maguluri, and J. Clarke. Performance of q-learning with linear function approximation: Stability and finite-time analysis. arXiv preprint arXiv:1905.11425, 2019.
- [9] K. L. Chung et al. On a stochastic approximation method. The Annals of Mathematical Statistics, 25(3):463–483, 1954.
- [10] G. Dalal, B. Szorenyi, G. Thoppe, and S. Mannor. Finite sample analysis of two-timescale stochastic approximation with applications to reinforcement learning. arXiv preprint arXiv:1703.05376, 2017.
- [11] A. M. Devraj. Reinforcement Learning Design with Optimal Learning Rate. PhD thesis, University of Florida, 2019.
- [12] A. M. Devraj, A. Bušić, and S. Meyn. Zap Q-Learning – a user’s guide. In Proc. of the Fifth Indian Control Conference, January 9-11 2019.
- [13] A. M. Devraj, A. Bušić, and S. Meyn. Fundamental design principles for reinforcement learning algorithms. In Handbook on Reinforcement Learning and Control. Springer, 2020.
- [14] A. M. Devraj and S. P. Meyn. Fastest convergence for Q-learning. ArXiv e-prints, July 2017.
- [15] A. M. Devraj and S. P. Meyn. Zap Q-learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017.
- [16] K. Duffy and S. Meyn. Large deviation asymptotics for busy periods. Stochastic Systems, 4(1):300–319, 2014.
- [17] K. R. Duffy and S. P. Meyn. Most likely paths to error when estimating the mean of a reflected random walk. Performance Evaluation, 67(12):1290–1303, 2010.
- [18] L. Gerencser. Convergence rate of moments in stochastic approximation with simultaneous perturbation gradient approximation and resetting. IEEE Transactions on Automatic Control, 44(5):894–905, May 1999.
- [19] P. W. Glynn and D. Ormoneit. Hoeffding’s inequality for uniformly ergodic Markov chains. Statistics and Probability Letters, 56:143–146, 2002.
- [20] G. Golub and C. Van Loan. Matrix computations. 3rd. edn. ed, 1996.
- [21] B. Hu and U. A. Syed. Characterizing the exact behaviors of temporal difference learning algorithms using markov jump linear system theory. arXiv preprint arXiv:1906.06781, 2019.
- [22] B. Karimi, B. Miasojedow, E. Moulines, and H.-T. Wai. Non-asymptotic analysis of biased stochastic approximation scheme. In Conference on Learning Theory, pages 1944–1974, 2019.
- [23] J. Kiefer and J. Wolfowitz. Stochastic estimation of the maximum of a regression function. Ann. Math. Statist., 23(3):462–466, 09 1952.
- [24] H. Kushner and G. G. Yin. Stochastic approximation and recursive algorithms and applications, volume 35. Springer Science & Business Media, 2003.
- [25] H. J. Kushner and G. G. Yin. Stochastic approximation algorithms and applications, volume 35 of Applications of Mathematics (New York). Springer-Verlag, New York, 1997.
- [26] C. Lakshminarayanan and C. Szepesvari. Linear stochastic approximation: How far does constant step-size and iterate averaging go? In International Conference on Artificial Intelligence and Statistics, pages 1347–1355, 2018.
- [27] M. Metivier and P. Priouret. Applications of a Kushner and Clark lemma to general classes of stochastic algorithms. IEEE Transactions on Information Theory, 30(2):140–151, March 1984.
- [28] S. P. Meyn. Control Techniques for Complex Networks. Cambridge University Press, 2007. Pre-publication edition available online.
- [29] S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Cambridge University Press, Cambridge, second edition, 2009. Published in the Cambridge Mathematical Library. 1993 edition online.
- [30] B. T. Polyak. A new method of stochastic approximation type. Avtomatika i telemekhanika (in Russian). translated in Automat. Remote Control, 51 (1991), pages 98–107, 1990.
- [31] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim., 30(4):838–855, 1992.
- [32] H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400–407, 1951.
- [33] D. Ruppert. A Newton-Raphson version of the multivariate Robbins-Monro procedure. The Annals of Statistics, 13(1):236–245, 1985.
- [34] D. Ruppert. Efficient estimators from a slowly convergent Robbins-Monro processes. Technical Report Tech. Rept. No. 781, Cornell University, School of Operations Research and Industrial Engineering, Ithaca, NY, 1988.
- [35] R. Srikant and L. Ying. Finite-time error bounds for linear stochastic approximation and TD learning. CoRR, abs/1902.00923, 2019.
- [36] R. S. Sutton. Learning to predict by the methods of temporal differences. Mach. Learn., 3(1):9–44, 1988.
- [37] V. B. Tadić. Asymptotic analysis of temporal-difference learning algorithms with constant step-sizes. Machine learning, 63(2):107–133, 2006.
- [38] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximation. IEEE Trans. Automat. Control, 42(5):674–690, 1997.
- [39] J. Venter et al. An extension of the robbins-monro procedure. The Annals of Mathematical Statistics, 38(1):181–190, 1967.
- [40] C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, Cambridge, UK, 1989.
Appendix A Appendices
A.1 Proofs for decomposition and scaling
Proof of Lemma 2.2.
Recall the summation by parts formula: for scalar sequences ,
(47) |
Proof of Lemma 2.3.
Consider the Taylor series expansion:
where the second equation uses . With , the following bound follows:
where , and for all .
Lemma A.1.
Let be fixed real numbers. Then the following holds for each and :
where .
Proof.
By the inequality ,
The remainder of the proof involves establishing the bound
(48) |
For this follows from the bound , and for the bound (48) follows from . ∎
Lemma A.2.
Under Assumptions A1-A3, let denote an eigenvalue of matrix with largest real part. Then
Proof.
Recall the decomposition of in (32): , with evolving as
(49a) | |||||
(49b) |
For fixed , Let solve the Lyapunov equation , which exists since is Hurwitz. Define the norm of by .
First consider . Since the martingale difference is uncorrelated with , denoting , we obtain the following from (49a):
(50) |
Letting denote the largest eigenvalue of , we arrive at the following simplification of the first term in (50)
(51) | ||||
where denotes the induced operator norm of with respect to the norm . We then obtain the following recursive bound from (50) and (51)
where . is finite since converges to geometrically fast.
For , we use similar arguments. We obtain the following from (49b) by the triangle inequality.
Using the same argument as in (51), along with the inequality ,
Denote .
Then by the same argument for the martingale difference term, we can show that at rate at least .
Given converges to zero at rate , the proof is completed by the triangle inequality. ∎
A.2 Proof of Prop. 2.9
(i)
Recall that is a martingale difference sequence. It is thus an uncorrelated sequence for which and are uncorrelated for . The following recursion is obtained from these facts and (30a)
Multiplying each side by gives
The following argument will be used repeatedly through this Appendix: the recursion for is a deterministic SA recursion for , and is regarded as an Euler approximation to the stable linear system
(52) |
Stability follows from the assumption that is Hurwitz. The standard justification of the Euler approximation is through the choice of timescale: let and let denote the solution to this ODE on with , , for any . Using standard ODE arguments [5],
Exponential convergence of to implies convergence of to zero at rate for some . ∎
(ii)
Denote and . We begin with the proof that
(53) |
With , we have , with . Applying (35a) gives
Let denote the conjugate of . Consequently, with ,
-uniform ergodicity implies that as at a geometric rate. Fix so that , and hence also . We also assume that for , which is possible since .
For we obtain the uniform bound
which proves that .
The proof of an upper bound for : by concavity of the logarithm,
where . Using concavity of the logarithm once more gives
which gives the uniform upper bound
This proves that . ∎
A.3 Proof for Prop. 2.10
(i)
Denote . We can rewrite (35b) as
(54) | ||||
Let solve the Lyapunov equation
As in the proof of Lemma A.2, a solution exists because is Hurwitz. Adopting the familiar notation , the triangle inequality applied to (54) gives
(55) |
The first term can be simplified by the Lyapunov equation.
where is the induced operator norm of , and denotes its largest eigenvalue.
Consequently, by the inequality ,
Fix such that for ,
This is possible since .
Denote and , which is finite because converges. We obtain the following from (55)
(56) | ||||
Apply (56) repeatedly for
where for . Therefore, at rate at least .
The desired conclusion follows: letting denote the smallest eigenvalue of ,
∎
(ii)
A.4 Proof of Prop. 2.12
(i)
Since is uncorrelated with , the following recursion follows from (30a):
Take in the definition of and . Multiplying each side of the equation by gives
(60) |
Recall that solves the Laypunov equation . Denoting , the following identity holds
Subtracting from both sides of (60) gives the recursion
(61) | ||||
Similar to the decomposition in (30), we have , each evolving as
(62a) | ||||
(62b) |
Since converges to zero geometrically fast, converges to zero faster than .
The recursion for is treated as in the proof of Prop. 2.9 (i). Consider the matrix ODE,
(63) |
Let and let denote the solution to this ODE on with , , for any . We then obtain as previously,
Recall that is the solution to the Lyapunov equation (45). Exponential convergence of to implies convergence of at rate for . Therefore, .
Given , we have
∎
(ii)
We focus on since . Recall the update forms of and in (30a) and (42a) respectively, where is uncorrelated with the martingale difference sequence for and is uncorrelated with for . With , the following is obtained from these facts:
Denote . Multiplying both sides of the previous equation by gives
Multiplying each side of this equation by once more results in
where the error term consists of two components: that converges to zero at a geometric rate and .
As previously, this is approximated by the linear system
(64) | ||||
With the same argument used in (i), converges to in (46) at rate for . Therefore, and . ∎
(iii)
The third claim in Prop. 2.12 is established through a sequence of lemmas. Start with the representation of based on (40):
Since is uncorrelated with the sequence for , we have
(65) |
Hence it suffices to consider the correlation between and . The formula for for is
(66) |
converges to zero geometrically fast under -uniform ergodicity of . Then we consider the expectation of the following:
(67) | ||||
The definition of is now based on the assumption that is Hurwitz: is the unique solution to the Lyapunov equation:
As previously, we denote for a random vector , and denote by the induced operator norm of a matrix . In the following result the vector is taken to be deterministic.
Lemma A.3.
Suppose the matrix is Hurwitz. Then there exists constant such that the following holds for any and all
Proof.
To analyze , consider the bivariate Markov chain , , with state space . An associated weighting function is defined as .
Denote function as and as the -th entry of for . Note that
Lemma A.4.
Suppose Assumptions (A1) and (A3) hold. For each ,
-
(i)
, moreover there exists constant such that
-
(ii)
Consequently, there exists constant such that
Proof.
By the definition of -norm,
Given and the -uniform ergodicity of [29, Lemma 15.2.9], there exists constant such that
Consequently,
(68) |
By the inequality and the -uniform ergodicity of once more, we have
(69) | ||||
(70) |
with .
For (ii), denote by the conditional expectation:
This is bounded by a constant times :
-uniform ergodicity of is equivalent to the following drift condition [29, Theorem 16.0.2]: for some , and some “petite set” ,
Consequently,
Therefore,
(71) |
Thus . By -uniform ergodicity of again,
with . The proof is then completed by applying the smoothing property of conditional expectation. ∎
Lemma A.5.
Under Assumptions (A1) and (A3), there exists such that the following hold
(72a) | ||||
(72b) |
Proof.
Lemma A.6.
For fixed , there exists such that for all ,
Proof.
Denote and observe that the function is increasing over . The following holds for
Now consider the integral: for any ,
Take .
where . The proof is completed by setting . ∎
Proof of Prop. 2.12 (iii).
Following (65), we have
(73) |
This is bounded based on (67): Lemma A.3 and (72b) indicate that there exists some constant such that
(74) |
For the second term in (67), it admits a simpler form
where and converges to its steady-state mean. For the remaining part, Lemma A.3 and (72a) together imply that
for some constant . By Lemma A.6, there exists another constant such that
This combined with (74) shows that
Following (73), we obtain the desired result:
∎
A.5 Unbounded moments
This section is devoted to the proof that for (see Thm. 2.4 (ii)). Since it suffices to show the result holds for , we assume throughout. Recall that .
Consider the update of in (33). With , we have . Multiplying each side of (33) by gives
with . Note that is strictly positive for sufficiently large .
For a fixed but arbitrary and each , we have
(75) | ||||
with .
The analysis of is mainly based on the random series appearing in (75), which requires the following three preliminary results:
Lemma A.7.
There exists some such that for each ,
Proof.
Note that , so it is sufficient to bound the second factor:
(76) | ||||
Consider the real part in (76): since , there exists such that and for . Consequently,
Given , we can increase if necessary, such that for . Then we have
For the imaginary part, observe that
The proof is completed by summing the bounds for the real and imaginary parts. ∎
Lemma A.8.
Suppose Assumptions A1 and A3 hold. With each , the random series converges a.s..
Proof.
Decompose the series into the sum of a martingale difference and telescoping sequence. The martingale difference sequence converges almost surely given ; the telescoping series is absolutely convergent by Lemma A.7. ∎
Lemma A.9.
Suppose Assumptions A1 and A3 hold. Denote . There exists a deterministic constant , such that for all and each sequence ,
(77) |
Proof.
First recall that , and hence by the Markov property,
where , and the last equality holds by the assumption and dominated convergence. For each , letting denote truncated at index , we have
(78) |
where is the covariance matrix with each entry defined as ; is Hermitian and positive semi-definite. With denoting the largest eigenvalue of , we have
(79) |
By the Gershgorin circle theorem [20], the maximal eigenvalue is upper bounded by the maximum row sum of absolute values of entries:
For any , observe that
Since -uniform ergodicity of the Markov chain implies -geometric mixing [29, Theorem 16.1.5] and , there exist and such that for each ,
Consequently,
(80) | ||||
Given , by (24),
The Markov chain is also -uniformly ergodic. By (24) for and once more,
Hence
The other two terms on the right hand side of (80) are bounded as follows:
where exists since .
Lemma A.10.
Suppose Assumptions A1-A3 hold and . With updated via (33),
Proof.
With fixed , equation (75) gives a representation for for each . It is obvious that . Hence it suffices to show that is strictly greater than zero.
Apply once more the decomposition based on Poisson’s equation:
where and is a martingale difference. By the variance inequality , we have
(81) | ||||
By the law of total variance once more,
Note that converges to zero almost surely. With and the Jensen’s inequality, we have for all ,
Then by the dominated convergence theorem, . Therefore,
Hence,
(82) |
where .
For the telescoping term on the right hand side of (81), we have
(83) | ||||
Given by Lemma A.7, Lemma A.9 indicates that there exists some constant independent of such that,
Since and at a geometric rate, we set sufficiently large such that Lemma A.7 holds and moreover for all ,
Then,
Therefore,
The desired conclusion then follows from (75):
∎
A.6 Coupling of Deterministic and Random Linear SA
Let denote the zero-mean solution to the following Poisson equation:
which is a matrix version of (26). Denote (a martingale difference sequence), and . Then, from (36),
The sequence from (38) can be expressed as the sum
where , and the first three sequences are solutions to the following linear systems:
(84a) | |||||
(84b) | |||||
(84c) |
The second recursion arises through the arguments used in the proof of Lemma 2.2.
Recall that is an eigenvalue of the matrix with largest real part. For fixed , let denote the unique solution to the Lyapunov equation
(85) |
As previously, the norm of random vector is defined as: .
Lemma A.11.
Under Assumptions (A1)-(A4), there exist constants and such that, for all ,
-
(i)
The following holds for each ,
-
(ii)
The following holds for ,
The inequality below will be useful in proving Lemma A.11.
Lemma A.12.
For any real numbers and all ,
Proof.
With , the result follows directly from the inequality
∎
Proof of Lemma A.11.
First consider updated via (84a). Since the martingale difference sequence is uncorrelated with or , we have
Using the fact that solves the Lyapunov equation (85) gives
where (the induced operator norm). With ,
Consequently,
(86) |
where is finite by the -uniform ergodicity of applied to (recall Thm. 2.1).
For updated by (84b), using Lemma A.12 with gives
We can find and such that for all ,
We then obtain the desired form for the sequence
(87) |
The same argument applies to in (84c). Therefore, for some constants and ,
(88) |
A bound on the final term is relatively easy.
Hence there exists some constant such that
∎
The results in Lemma A.11 lead to a rough bound on presented in the following. This intermediate result will be used later to establish the refined bound in Thm. 2.7.
Lemma A.13.
Under Assumptions (A1)-(A4),
Proof.
Denote . By Lemma A.11, we can find such that for and
with and , which are finite by Lemma A.2 combined with Lemma A.11. Iterating this inequality gives, for ,
By Lemma A.1,
The partial sum can be estimated by an integral: with ,
(89) |
Given ,
Consequently, by the inequality . Then we have
where as goes to infinity by Lemma A.2. Hence . ∎
Proof of Thm. 2.7.
First consider updated via (84b). By the triangle inequality and the inequality ,
where and , which is finite thanks to Lemma A.13. Hence, by Lemma A.1 once more,
With , we have . Hence . Replacing with in (84c), the same argument applies to and we get . The fact that follows directly from definition and Lemma A.13. Then we have, for each ,
(90) |
Now consider the martingale difference part . The following is directly obtained from (84a):
From Lemma A.2 we have for . Combining this with (90) implies that there exists some constant such that for ,
Consequently,
where . With initial condition , iterating this inequality gives
With , the partial sum is bounded by an integral similar as (89):
Therefore,
-
(i)
If , then for .
-
(ii)
If , then .
Given that the same convergence rates hold for the other components in (90), the conclusion then follows. ∎