This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Explicit Mean-Square Error Bounds
for Monte-Carlo and Linear Stochastic Approximation

Shuhang Chen Department of Mathematics at the University of Florida, Gainesville.    Adithya M. Devraj Department of ECE at the University of Florida, Gainesville.    Ana Bušić Inria and DI ENS, École Normale Supérieure, CNRS, PSL Research University, Paris, France.
Financial support from ARO grant W911NF1810334 is gratefully acknowledged. Additional support from EPCN 1609131 & CPS 1646229, and French National Research Agency grant ANR-16-CE05-0008.
   Sean Meyn22footnotemark: 2
Abstract

This paper concerns error bounds for recursive equations subject to Markovian disturbances. Motivating examples abound within the fields of Markov chain Monte Carlo (MCMC) and Reinforcement Learning (RL), and many of these algorithms can be interpreted as special cases of stochastic approximation (SA). It is argued that it is not possible in general to obtain a Hoeffding bound on the error sequence, even when the underlying Markov chain is reversible and geometrically ergodic, such as the M/M/1 queue. This is motivation for the focus on mean square error bounds for parameter estimates. It is shown that mean square error achieves the optimal rate of O(1/n)O(1/n), subject to conditions on the step-size sequence. Moreover, the exact constants in the rate are obtained, which is of great value in algorithm design.

Keywords: Stochastic Approximation, Markov chain Monte Carlo, Reinforcement learning

1 Introduction

Many questions in statistics and the area of reinforcement learning are concerned with computation of the root of a function in the form of an expectation: f¯(θ)=𝖤[f(θ,Φ))]\bar{f}(\theta)={\sf E}[f(\theta,\Phi))], where Φ\Phi is a vector valued random variable, and θd\theta\in\mathbb{R}^{d}. The value θ\theta^{*} satisfying f¯(θ)=0\bar{f}(\theta^{*})=0 is most commonly approximated through some version of the stochastic approximation (SA) algorithm of Robbins and Monro [32, 5]. In its basic form, this is the recursive algorithm

θn+1=θn+αn+1f(θn,Φn+1)\theta_{n+1}=\theta_{n}+\alpha_{n+1}f(\theta_{n},\Phi_{n+1}) (1)

in which {αn}\{\alpha_{n}\} is a non-negative gain sequence, and {Φn}\{\Phi_{n}\} is a sequence of random variables whose distribution converges to that of Φ\Phi as nn\to\infty. The sequence is a Markov chain in the applications of interest in this paper.

There is a large body of work on conditions for convergence of this recursion, and also a Central Limit Theorem (CLT): with θ~n=θnθ\widetilde{\theta}_{n}=\theta_{n}-\theta^{*},

limnθ~n\displaystyle\lim_{n\to\infty}\widetilde{\theta}_{n} =0\displaystyle=0 almost surely
limnnθ~n\displaystyle\lim_{n\to\infty}\sqrt{n}\widetilde{\theta}_{n} =N(0,Σθ)\displaystyle=N(0,\Sigma_{\theta})\qquad in distribution

The d×dd\times d matrix Σθ\Sigma_{\theta} is known as the asymptotic covariance. The CLT requires substantially stronger assumptions on the gain sequence, the function ff, and the statistics of the “noise” sequence {Φn}\{\Phi_{n}\} [2, 25].

Soon after the stochastic approximation algorithm was first introduced in [32, 4], Chung [9] identified the optimal CLT covariance and techniques to obtain the optimum for scalar recursions. This can be cast as a form of stochastic Newton-Raphson (SNR) [14, 15, 12, 11]. Gradient free methods [or stochastic quasi Newton-Raphson (SQNR)] appeared in later work: The first example was proposed by Venter in [39], which was shown to obtain the optimal variance for a one-dimensional SA recursion. The algorithm obtains estimates of the SNR gain A1-A^{-1} (see (2) below), through a procedure similar to the Kiefer-Wolfowitz algorithm [23]. Ruppert proposed an extension of Venter’s algorithm for vector-valued functions [33].

The averaging technique of Ruppert and Polyak [34, 30, 31] is a two-time-scale algorithm that is also designed to achieve the optimal asymptotic covariance. More recently, a two-time-scale variant of the SNR algorithm known as “Zap-SNR” was proposed in [14, 15, 12, 11], with applications to reinforcement learning. Zap algorithms are stable and convergent under mild assumptions [14, 7].

Under the typical assumptions under which the CLT holds for the recursion (1), the asymptotic covariance has an explicit form in terms of a linearization [24, Chapter 10, Theorem 3.3]. Assume that the solution to f¯(θ)=0\bar{f}(\theta^{*})=0 is unique, and denote

A=f¯(θ),Δn=f(θ,Φn)A=\partial\bar{f}\,(\theta^{*})\,,\qquad\Delta_{n}=f(\theta^{*},\Phi_{n}) (2)

The error dynamics of the SA recursion are then approximated by the linear SA recursion:

θ~n+1=θ~n+αn+1[Aθ~n+Δn+1].\widetilde{\theta}_{n+1}=\widetilde{\theta}_{n}+\alpha_{n+1}[A\widetilde{\theta}_{n}+\Delta_{n+1}]\,. (3)

Subject to the assumption that 12I+A{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}I+A is Hurwitz (i.e., Real(λ)<12\text{Real}(\lambda)<-{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}} for each eigenvalue of AA), the d×dd\times d matrix Σθ\Sigma_{\theta} is the unique positive semi-definite solution to the Lyapunov equation

[12I+A]Σ+Σ[12I+A]+ΣΔ=0[{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}I+A]\Sigma+\Sigma[{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}I+A]^{\intercal}+\Sigma_{\Delta}=0 (4)

in which ΣΔ\Sigma_{\Delta} is also an asymptotic covariance: the covariance matrix appearing in the CLT for the sequence {Δn}\{\Delta_{n}\} (which may be expressed in terms of a solution to a Poisson equation - see [24, Theorem 2.2, Chapter 10]).

The goal of this paper is to demonstrate that the CLT is far less asymptotic than it may appear. For this we focus analysis on the linearization (3), along with first steps towards analysis of the nonlinear recursions. Subject to assumptions on AA and the Markov chain, we establish the bound

Cov(θn)=n1Σθ+O(n1δ)\text{\rm Cov}\,(\theta_{n})=n^{-1}\Sigma_{\theta}+O(n^{-1-\delta}) (5)

with Cov(θn)=𝖤[θ~nθ~n]\text{\rm Cov}\,(\theta_{n})={\sf E}[\widetilde{\theta}_{n}\widetilde{\theta}_{n}^{\intercal}]. Under further assumptions, the bound is refined to obtain δ1\delta\geq 1, and even finer bounds:

Cov(θn)=n1Σθ+n2Σθ,2+O(n2δ)\text{\rm Cov}\,(\theta_{n})=n^{-1}\Sigma_{\theta}+n^{-2}\Sigma_{\theta,2}+O(n^{-2-\delta}) (6)

where again δ>0\delta>0 and formula of Σθ,2\Sigma_{\theta,2} is obtained in the paper based on a second Lyapunov equation and a solution to a second Poisson equation.

It is hoped that these results will be helpful in construction and performance analysis of many algorithms found in machine learning, statistics and reinforcement learning. Identification of the coefficient for the n2n^{-2} term from (6) may lead to criteria for gain design when one aims to minimize the covariance with a fixed budget on the number of iterations.

The reader may ask, why not search directly for finite-nn bounds of the flavor of Hoeffding’s inequality:

𝖯{θ~nε}b0exp(nI0(ε)){\sf P}\{\|\widetilde{\theta}_{n}\|\geq\varepsilon\}\leq b_{0}\exp(-nI_{0}(\varepsilon))\, (7)

where b0>0b_{0}>0 is fixed, and I0I_{0} is a convex function that is strictly positive and finite in a region 0<ε2ε¯20<\varepsilon^{2}\leq\bar{\varepsilon}^{2}. The answer is that such bounds are not always possible even for the simplest SA recursions, even when the Markov chain is geometrically ergodic. This is clarified in the first general example:

1.1 Markov Chain Monte Carlo

As a prototypical example of stochastic approximation, Markov chain Monte Carlo (MCMC) proceeds by constructing an ergodic Markov chain 𝚽\Phi with invariant measure π\pi so as to estimate π(F)=F(z)π(dz)\pi(F)=\int F(z)\,\pi(dz) for some function F:𝖹dF:{\sf Z}\rightarrow\mathbb{R}^{d}. One then simulates Φ1,Φ2,,Φn+1\Phi_{1},\Phi_{2},\dots,\Phi_{n+1} to obtain the estimates

θn+1=1n+1k=1n+1F(Φk)\theta_{n+1}=\frac{1}{n+1}\sum_{k=1}^{n+1}F(\Phi_{k}) (8)

This is an instance of the SA recursion (1):

θn+1=θn+1n+1(θn+F(Φn+1))\theta_{n+1}=\theta_{n}+\frac{1}{n+1}(-\theta_{n}+F(\Phi_{n+1})) (9)

Subtracting θ=π(F)\theta^{*}=\pi(F) from both sides of (9) gives, with θ~n=θnπ(F)\widetilde{\theta}_{n}=\theta_{n}-\pi(F),

θ~n+1=θ~n+1n+1(θ~n+F(Φn+1)π(F))\widetilde{\theta}_{n+1}=\widetilde{\theta}_{n}+\frac{1}{n+1}(-\widetilde{\theta}_{n}+F(\Phi_{n+1})-\pi(F))

which is (3) in a special case: A=IA=-I, Δn+1=F(Φn+1)π(F)\Delta_{n+1}=F(\Phi_{n+1})-\pi(F) and αn=1/n\alpha_{n}=1/n.

A significant part of the literature on MCMC focuses on finding Markov chains whose marginals approach the invariant measure π\pi quickly. Error estimates for MCMC have only been studied under rather restrictive settings. For instance, under the assumption of uniform ergodicity of 𝚽\Phi and uniform boundedness of FF (which rarely hold in practice outside of a finite state space), a generalized Hoeffding’s inequality was obtained in [19] to obtain the PAC-style error bound (7). We can not expect Hoeffding’s bound if either of these assumptions is relaxed. Consider the simplest countable state space Markov chain: the M/M/1 queue with uniformization, defined with 𝖹={0,1,2,}{\sf Z}=\{0,1,2,\dots\} and

Φn+1={Φn+1prob. αmax(Φn1,0)prob. μ=1α\Phi_{n+1}=\begin{cases}\Phi_{n}+1&\text{prob.\ $\alpha$}\\ \max(\Phi_{n}-1,0)&\text{prob.\ $\mu=1-\alpha$}\end{cases}

This is a reversible, geometrically ergodic Markov chain when ρ=α/μ<1\rho=\alpha/\mu<1, with geometric invariant distribution π(z)=(1ρ)ρz\pi(z)=(1-\rho)\rho^{z}, z0z\geq 0. It is shown in [28] that the error bound (7) fails for most unbounded functions FF. The question is looked at in greater depth in [17, 16], where asymptotic bounds are obtained for the special case F(z)zF(z)\equiv z. An asymptotic version of (7) is obtained for the lower tail:

limn1nlog(𝖯{θ~nε})\displaystyle\lim_{n\to\infty}\frac{1}{n}\log\Bigl{(}{\sf P}\{\widetilde{\theta}_{n}\leq-\varepsilon\}\Bigr{)} =I0(ε)\displaystyle=-I_{0}(\varepsilon) (10)

in which the right hand side is strictly negative and finite valued for positive ε\varepsilon in a neighborhood of zero. An entirely different scaling is required for the upper tail:

limn1nlog(𝖯{θ~nnε})\displaystyle\lim_{n\to\infty}\frac{1}{n}\log\Bigl{(}{\sf P}\Bigl{\{}\frac{\widetilde{\theta}_{n}}{n}\geq\varepsilon\Bigr{\}}\Bigr{)} =J0(ε)\displaystyle=-J_{0}(\varepsilon) (11)

where again the right hand side is strictly negative and finite valued for ε>0\varepsilon>0 sufficiently small. It follows from (11) that the PAC-style bound (7) is not attainable.

1.2 Reinforcement Learning

The theory of this paper also applies to TD-learning. In this case, the Markov chain 𝚽\Phi contains as one component a state process for a system to be controlled.

Consider a Markov chain 𝑿X evolving on a (Polish) state space 𝖷{\sf X}. Given a cost function c:𝖷c:{\sf X}\to\mathbb{R}, and a discount factor β(0,1)\beta\in(0,1), the goal in TD-learning is to approximate the solution h:𝖷h:{\sf X}\to\mathbb{R} to the Bellman equation:

h(x)=c(x)+β𝖤[h(Xn+1)Xn=x]h(x)=c(x)+\beta{\sf E}[h(X_{n+1})\mid X_{n}=x] (12)

This functional equation can be recast:

𝖤[\displaystyle{\sf E}[ 𝒟(h,Φn+1)Φ0Φn]=0\displaystyle\mathcal{D}(h,\Phi_{n+1})\mid\Phi_{0}\,\dots\,\Phi_{n}]=0 (13a)
where 𝒟(h,Φn+1):=c(Xn)+βh(Xn+1)h(Xn),Φn+1:=(Xn+1,Xn),n0\displaystyle\mathcal{D}(h,\Phi_{n+1})\mathbin{:=}c(X_{n})+\beta h(X_{n+1})-h(X_{n})\,,\quad\Phi_{n+1}\mathbin{:=}(X_{n+1},X_{n})\,,\quad n\geq 0 (13b)

Equation (13a) may be regarded as motivation for the TD-learning algorithms of [36, 38].

Consider a linearly parameterized family of candidate approximations {hθ(x)=θψ(x):θd}\{h^{\theta}(x)=\theta^{\intercal}\psi(x):\theta\in\mathbb{R}^{d}\}, where ψ:𝖷d\psi:{\sf X}\to\mathbb{R}^{d} denotes the dd basis functions. The goal in TD-learning is to solve the Galerkin relaxation of (13a,13b):

𝖤[𝒟(hθ,Φn+1)ζn]=0{\sf E}[\mathcal{D}(h^{\theta^{*}},\Phi_{n+1})\zeta_{n}]=0 (14)

where {ζn}\{\zeta_{n}\} is a dd-dimensional stochastic process, adapted to 𝚽\Phi, and the expectation is with respect to the steady state distribution. In particular, ζnψ(Xn)\zeta_{n}\equiv\psi(X_{n}) in TD(0) learning, so that the goal in this case is to find θd\theta^{*}\in\mathbb{R}^{d} such that:

𝖤\displaystyle{\sf E} [𝒟(hθ,Φn+1)ψ(Xn)]=0\displaystyle\big{[}\mathcal{D}(h^{\theta^{*}},\Phi_{n+1})\psi(X_{n})\big{]}=0 (15)
𝒟(hθ,Φn+1)=c(Xn)+βhθ(Xn+1)hθ(Xn)\displaystyle\mathcal{D}(h^{\theta^{*}},\Phi_{n+1})=c(X_{n})+\beta h^{\theta^{*}}(X_{n+1})-h^{\theta^{*}}(X_{n})

The TD(0) algorithm is the SA recursion (1) applied to solve (15):

θn+1\displaystyle\theta_{n+1} =θn+αn+1dn+1ψ(Xn)\displaystyle=\theta_{n}+\alpha_{n+1}d_{n+1}\psi(X_{n}) (16)
dn+1\displaystyle d_{n+1} =c(Xn)+βhθn(Xn+1)hθn(Xn)\displaystyle=c(X_{n})+\beta h^{\theta_{n}}(X_{n+1})-h^{\theta_{n}}(X_{n})

Denoting

An+1\displaystyle A_{n+1} :=ψ(Xn)(βψ(Xn+1)ψ(Xn))\displaystyle\mathbin{:=}\psi(X_{n})\big{(}\beta\psi(X_{n+1})-\psi(X_{n})\big{)}^{\intercal}
bn+1\displaystyle b_{n+1} :=c(Xn)ψ(Xn)\displaystyle\mathbin{:=}-c(X_{n})\psi(X_{n})

the algorithm (16) can be rewritten as:

θn+1=θn+αn+1(An+1θnbn+1)\theta_{n+1}=\theta_{n}+\alpha_{n+1}\big{(}A_{n+1}\theta_{n}-b_{n+1}\big{)} (17)

Note that θ\theta^{*} from (14) solves the linear equation 𝖤[An+1]θ=𝖤[bn+1]{\sf E}[A_{n+1}]\theta^{*}={\sf E}[b_{n+1}]. Subtracting θ\theta^{*} from both sides of (17) gives, with θ~n=θnθ\widetilde{\theta}_{n}=\theta_{n}-\theta^{*},

θ~n+1=θ~n+αn+1[Aθ~n+(An+1A)θ~n+An+1θbn+1]\widetilde{\theta}_{n+1}=\widetilde{\theta}_{n}+\alpha_{n+1}[A\widetilde{\theta}_{n}+(A_{n+1}-A)\widetilde{\theta}_{n}+A_{n+1}\theta^{*}-b_{n+1}] (18)

Under mild conditions, we show through coupling that iteration (18) can be closely approximated by the linear SA recursion (3) with matrix A=𝖤[An+1]A={\sf E}[A_{n+1}] and noise sequence Δn+1=An+1θbn+1\Delta_{n+1}=A_{n+1}\theta^{*}-b_{n+1}. In particular, the two recursions have the same asymptotic covariance if the matrix 12I+A{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}I+A is Hurwitz (see Section 2.3).

Under general assumptions on the Markov chain 𝑿X, and the basis functions ψ\psi, it is known that matrix A=𝖤[𝒟(hθ,Φn+1)]=𝖤[An+1]A=\partial{\sf E}[\mathcal{D}(h^{\theta^{*}},\Phi_{n+1})]={\sf E}[A_{n+1}] is Hurwitz, and that the sequence of estimates {θn}\{\theta_{n}\} converges to θ\theta^{*} [38]. However, when the discount factor β\beta is close to 11, it can be shown that λmax>12\lambda_{\max}>-{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}} (where λmax\lambda_{\max} denotes the largest eigenvalue of AA), and is in fact close to 0 under mild additional assumptions [14, 11, 13]. It follows that the algorithm has infinite asymptotic covariance: full details and finer results can be deduced from Theorems 2.4 and 2.7.

The SNR algorithm is defined as follows:

θn+1\displaystyle\theta_{n+1} =θnαn+1A^n+11dn+1ψ(Xn)\displaystyle=\theta_{n}-\alpha_{n+1}\widehat{A}_{n+1}^{-1}d_{n+1}\psi(X_{n}) (19)
dn+1\displaystyle d_{n+1} =c(Xn)+βhθn(Xn+1)hθn(Xn)\displaystyle=c(X_{n})+\beta h^{\theta_{n}}(X_{n+1})-h^{\theta_{n}}(X_{n})
A^n+1\displaystyle\widehat{A}_{n+1} =A^n+αn+1[An+1A^n]\displaystyle=\widehat{A}_{n}+\alpha_{n+1}\big{[}A_{n+1}-\widehat{A}_{n}\big{]} (20)

Under the assumption that the sequence of matrices {A^n:n0}\{\widehat{A}_{n}:n\geq 0\} is invertible for each nn, it is shown in [14, 11] that the sequence of estimates obtained using (19,20) are identical to the parameter estimates obtained using the LSTD(0) algorithm: θn+1=A^n+11b^n+1\theta_{n+1}=\widehat{A}_{n+1}^{-1}\widehat{b}_{n+1}, with

A^n+1\displaystyle\widehat{A}_{n+1} =A^n+αn+1[An+1A^n]\displaystyle=\widehat{A}_{n}+\alpha_{n+1}\big{[}A_{n+1}-\widehat{A}_{n}\big{]}
b^n+1\displaystyle\widehat{b}_{n+1} =b^n+αn+1[bn+1b^n]\displaystyle=\widehat{b}_{n}+\alpha_{n+1}\big{[}b_{n+1}-\widehat{b}_{n}\big{]}

Consequently, the LSTD(0) algorithm achieves the optimal asymptotic covariance.

Q-learning and many other RL algorithms can also be cast as SA recursions. They are no longer linear, but it is anticipated that bounds can be obtain in future research through linearization [18].

1.3 Literature Survey

Finite time performance bounds for linear stochastic approximation were obtained in many prior papers, subject to the assumption that the noise sequence {Δn}\{\Delta_{n}\} appearing in (3) is a martingale difference sequence [10, 26]. This assumption is rarely satisfied in the applications of interest to the authors.

Much of the literature on finite time bounds for linear SA recursions with Markovian noise has been recent. For constant step-size algorithms with step-size α\alpha, it follows from analysis in [6] that the pair process (θn,Φn)(\theta_{n},\Phi_{n}) is a geometrically ergodic Markov chain, and the covariance of θn\theta_{n} is O(α)O(\alpha) in steady state. Finite time bounds of order O(α)O(\alpha) were obtained in [37, 3, 35, 21]. Unfortunately, these bounds are not tight, and hence their value for algorithm design is limited.

Mean-square error bounds have also been obtained for diminishing step-size algorithms, to establish the optimal rate of convergence 𝖤[θ~n2]bθ/n{\sf E}[\|\widetilde{\theta}_{n}\|^{2}]\leq b_{\theta}/n [35, 3, 8]. The constant bθb_{\theta} is a function of the mixing time of the underlying Markov chain. These results require strong assumptions (uniform ergodicity of the Markov chain), and do not obtain the optimal constant bθ=trace (Σθ)b^{*}_{\theta}=\hbox{\rm trace\,}(\Sigma_{\theta}). Rather than parameter estimation error, finite time bounds are obtained in [22] for 𝖤[f¯(θn)2]{\sf E}[\|\bar{f}(\theta_{n})\|^{2}], which may be regarded as a far more relevant performance criterion. Bounds are obtained for Markovian models, subject to the existence of a Lyapunov function similar to what is assumed in the present work. It is again not clear if the resulting bounds are tight, or have value in algorithm design.

1.4 Contributions

The main contribution of this paper is a general framework for analyzing the finite time performance of linear stochastic approximation algorithms with Markovian noise, and vanishing step-size (required to achieve the optimal rate of convergence of Chung-Ruppert-Polyak). The M/M/1 queue example illustrates plainly that Markovian noise introduces challenges not seen in the “white noise” setting, and that the finite-nn error bound (7) cannot be obtained without substantial restrictions. Even under the assumptions of [19] (uniform ergodicity, and bounded noise), the resulting bounds are extremely loose and hence may give little insight for algorithm design. Our approach allows us to obtain explicit bounds, without the uniform boundedness assumption of noise that is frequently imposed in the literature [3, 35, 21, 8]. Instead, it is assumed that the Markovian noise is VV-uniform ergodic; an assumption that is far weaker than geometric or uniform mixing.

Our starting point is the classical martingale approximation of the noise used in CLT analysis of Markov chains [29, Chapter 17] and used in the analysis of SA recursions since Metivier and Priouret [27]. Under mild assumptions on the Markov chain, each Δn\Delta_{n} can be expressed as the sum of a martinagle difference and a telescoping term. The solution of the linear recursion (3) is decomposed as a sum of the respective responses:

θ~n=θ~n+θ~n𝒯\widetilde{\theta}_{n}=\widetilde{\theta}_{n}^{\mathcal{M}}+\widetilde{\theta}_{n}^{\mathcal{T}} (21)

The challenge is to obtain explicit bounds on the mean square error for each term.

We say that a deterministic vector-valued sequence {en}\{e_{n}\} converges to zero at rate 1/nϱ01/n^{\varrho_{0}} if

limnnϱen={0,if ϱ<ϱ0,if ϱ>ϱ0\lim_{n\to\infty}n^{\varrho}\|e_{n}\|=\begin{cases}0,&\text{if }\;\varrho<\varrho_{0}\\ \infty,&\text{if }\;\varrho>\varrho_{0}\end{cases}

Bounds for the mean-square error are obtained in Thm. 2.4, subject to conditions on both the matrix AA and the noise sequence. In summary, under general assumptions on {Δn}\{\Delta_{n}\},

  • (i)

    The bound (5) holds if 12I+A{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}I+A is Hurwitz.

  • (ii)

    If I+AI+A is Hurwitz, then the finer bound (6) holds.

  • (iii)

    If there is an eigenvalue of AA satisfying Real(λ)>12\text{Real}(\lambda)>-{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}, and corresponding left-eigenvalue vv that lies outside of the null-space of the asymptotic covariance of the noise sequence, then

    limnn2ρ𝖤[|vθ~n|2]={0,ϱ<ϱ0,ϱ>ϱ0\lim_{n\to\infty}n^{2\rho}{\sf E}[|v^{\intercal}\widetilde{\theta}_{n}|^{2}]=\begin{cases}0,&\quad\varrho<\varrho_{0}\\ \infty,&\quad\varrho>\varrho_{0}\end{cases} (22)

    with ρ0=|Real(λ)|\rho_{0}=|\text{Real}(\lambda)|. The convergence of 𝖤[θ~n2]{\sf E}[\|\widetilde{\theta}_{n}\|^{2}] to zero is thus no faster than n2ρ0n^{-2\rho_{0}}.

2 Mean Square Convergence

2.1 Notation and Background

Consider the linear SA recursion (3), with the noise sequence {Δn}\{\Delta_{n}\} defined in (2). We use the following notation to explicitly represent the noise as a function of Φn\Phi_{n}:

f(Φn):=Δn=f(θ,Φn)f^{*}(\Phi_{n})\mathbin{:=}\Delta_{n}=f(\theta^{*}\,,\Phi_{n}) (23)

A form of geometric ergodicity is assumed throughout. To apply standard theory, it is assumed that the state space 𝖹{\sf Z} is Polish (the standing assumption in [29]). We fix a measurable function V:𝖹[1,)V\colon{\sf Z}\to[1,\infty), and let LVL_{\infty}^{V} denote the set of measurable functions g:𝖹g\colon{\sf Z}\to\mathbb{R} satisfying

gV:=supz𝖹|g(z)|V(z)<\|g\|_{V}\mathbin{:=}\sup_{z\in{\sf Z}}\frac{|g(z)|}{V(z)}<\infty

The Markov chain 𝚽\Phi is assumed to be VV-uniformly ergodic: there exists ρ(0,1)\rho\in(0,1), and BV<B_{V}<\infty such that for each gLVg\in L_{\infty}^{V}, z𝖹z\in{\sf Z},

|𝖤[g(Φn)Φ0=z]π(g)|BVgVρnV(z),n0\Big{|}{\sf E}[g(\Phi_{n})\mid\Phi_{0}=z]-\pi(g)\Big{|}\leq B_{V}\|g\|_{V}\rho^{n}V(z)\,,\quad n\geq 0 (24)

where π\pi is the unique invariant measure, and π(g)=g(z)π(dz)\pi(g)=\int g(z)\,\pi(dz) is the steady state mean.

The uniform bound (24) is not a strong assumption. For example, it is satisfied for the M/M/1 queue described in Section 1.1 with V(z)=exp(ε0z)V(z)=\exp(\varepsilon_{0}z), for ε0>0\varepsilon_{0}>0 sufficiently small, with z𝖹={0,1,}z\in{\sf Z}=\{0,1,\dots\} [29, Thm. 16.4.1].

The following are imposed throughout:

Assumptions:
  • (A1)

    The Markov process 𝚽\Phi is VV-uniformly ergodic, with unique invariant measure denoted π\pi.

  • (A2)

    The d×dd\times d matrix AA is Hurwitz, and the step-size sequence αn1/n\alpha_{n}\equiv 1/n, n1n\geq 1.

  • (A3)

    The function f:𝖹df^{*}:{\sf Z}\to\mathbb{R}^{d} satisfiesfi2V<\|{f_{i}^{*}}^{2}\|_{V}<\infty and π(fi)=0\pi(f^{*}_{i})=0 for each ii.

For any gLVg\in L_{\infty}^{V}, denote g~(z)=g(z)π(g)\tilde{g}(z)=g(z)-\pi(g), and

g^(z)=n=0𝖤[g~(Φn)Φ0=z]{\hat{g}}(z)=\sum_{n=0}^{\infty}{\sf E}[\tilde{g}(\Phi_{n})\mid\Phi_{0}=z] (25)

It is evident that g^LV{\hat{g}}\in L_{\infty}^{V} under (A1). Further conclusions are summarized below. Thm. 2.1 (i) follows immediately from (A1). Part (ii) follows from (i) and [29, Lemma 15.2.9] (the chain is also V\sqrt{V}-uniformly ergodic).

Theorem 2.1.

The following conclusions hold for a VV-uniformly ergodic Markov chain:

  • (i)

    The function g^LV{\hat{g}}\in L_{\infty}^{V} defined in (25) has zero mean, and solves Poisson’s equation:

    𝖤[g^(Φk+1)Φk=z]=g^(z)g~(z){\sf E}[{\hat{g}}(\Phi_{k+1})\!\mid\!\Phi_{k}\!=\!z]\!=\!{\hat{g}}(z)-\tilde{g}(z) (26)
  • (ii)

    If g2LVg^{2}\in L_{\infty}^{V}, then g^2LV{\hat{g}}^{2}\in L_{\infty}^{V}. ∎

Assumption (A3) implies that the sequence {Δn}\{\Delta_{n}\} appearing in (3) is zero mean for the stationary version of the Markov chain 𝚽\Phi. Its asymptotic covariance (appearing in the Central Limit Theorem) is denoted

ΣΔ=k=𝖤π[ΔkΔ0]\Sigma_{\Delta}=\sum_{k=-\infty}^{\infty}{\sf E}_{\pi}[\Delta_{k}\Delta_{0}^{\intercal}] (27)

where the expectations are in steady state.

A more useful representation of ΣΔ\Sigma_{\Delta} is obtained through a decomposition of the noise sequence based on Poisson’s equation. This now standard technique was introduced in the SA literature in the 1980s [27].

With ff^{*} defined in (23), denote by f^{\hat{f}} a solution to Poisson’s equation:

𝖤[f^(Φk+1)Φk=z]=f^(z)f(z){\sf E}[{\hat{f}}\,(\Phi_{k+1})\mid\Phi_{k}=z]={\hat{f}}(z)-f^{*}(z) (28)

This is in fact dd separate Poisson equations since f:𝖹df^{*}\colon{\sf Z}\to\mathbb{R}^{d}. It is assumed for convenience that the solutions are normalized so f^{\hat{f}} has zero steady-state mean. This is justified by the fact that f^π(f^){\hat{f}}-\pi({\hat{f}}) also solves (28) under assumption (A3). The fact that f^i2LV{\hat{f}}_{i}^{2}\in L_{\infty}^{V} for 1id1\leq i\leq d follows from Thm. 2.1 (ii).

We then write, for n1n\geq 1,

Δn=f(Φn)=Δn+1m+ZnZn+1\Delta_{n}=f^{*}(\Phi_{n})=\Delta_{n+1}^{m}+Z_{n}-Z_{n+1}

where Zn=f^(Φn)Z_{n}={\hat{f}}(\Phi_{n}) and Δn+1m=Zn+1𝖤[Zn+1n]\Delta_{n+1}^{m}=Z_{n+1}-{\sf E}[Z_{n+1}\mid\mathcal{F}_{n}] is a martingale difference sequence. Each of the sequences is bounded in L2L_{2}, and the asymptotic covariance (27) is expressed

ΣΔ=𝖤π[ΔnmΔnm]\Sigma_{\Delta}={\sf E}_{\pi}[\Delta_{n}^{m}{\Delta_{n}^{m}}^{\intercal}] (29)

where the expectation is taken in steady-state. The equivalence of (29) and (27) appears in [29, Theorem 17.5.3] for the case in which Δn\Delta_{n} is scalar valued; the generalization to vector valued processes involves only notational changes.

2.2 Decomposition and Scaling of the Parameter Sequence

We now explain the decomposition (21). Each of the two sequences {θ~n}\{\widetilde{\theta}_{n}^{\mathcal{M}}\} and {θ~n𝒯}\{\widetilde{\theta}_{n}^{\mathcal{T}}\} evolves as a stochastic approximation sequence, differentiated by the inputs and initial conditions:

θ~n+1\displaystyle\widetilde{\theta}_{n+1}^{\mathcal{M}} =θ~n+αn+1[Aθ~n+Δn+2m],\displaystyle=\widetilde{\theta}_{n}^{\mathcal{M}}+\alpha_{n+1}\bigl{[}A\widetilde{\theta}_{n}^{\mathcal{M}}\!+\Delta_{n+2}^{m}\bigr{]}\,, θ~0=θ~0\displaystyle\widetilde{\theta}_{0}^{\mathcal{M}}=\widetilde{\theta}_{0} (30a)
θ~n+1𝒯\displaystyle\widetilde{\theta}_{n+1}^{\mathcal{T}} =θ~n𝒯+αn+1[Aθ~n𝒯+Zn+1Zn+2],\displaystyle=\widetilde{\theta}_{n}^{\mathcal{T}}+\alpha_{n+1}\bigl{[}A\widetilde{\theta}_{n}^{\mathcal{T}}\!+\!Z_{n+1}-Z_{n+2}\bigr{]}\,, θ~0𝒯=0\displaystyle\widetilde{\theta}_{0}^{\mathcal{T}}=0 (30b)

The second recursion admits a more tractable realization through the change of variables, Ξn=θ~n𝒯+αnZn+1\Xi_{n}=\widetilde{\theta}_{n}^{\mathcal{T}}+\alpha_{n}Z_{n+1}, n1n\geq 1.

Lemma 2.2.

The sequence {Ξn}\{\Xi_{n}\} evolves as the SA recursion

Ξn+1=Ξn+αn+1[AΞnαn[I+A]Zn+1],Ξ1=Z1\Xi_{n+1}=\Xi_{n}+\alpha_{n+1}\bigl{[}A\Xi_{n}-\alpha_{n}[I+A]Z_{n+1}\bigr{]}\,,\qquad\Xi_{1}=Z_{1} (31)

Lemma 2.2 combined with (30) gives

θ~n=θ~n(1)+θ~n(2)+θ~n(3)\widetilde{\theta}_{n}=\widetilde{\theta}_{n}^{(1)}+\widetilde{\theta}_{n}^{(2)}+\widetilde{\theta}_{n}^{(3)} (32)

where θ~n(1)=θ~n\widetilde{\theta}_{n}^{(1)}\!=\widetilde{\theta}_{n}^{\mathcal{M}}, θ~n(2)=Ξn\widetilde{\theta}_{n}^{(2)}\!=\Xi_{n}, and θ~n(3)=αnZn+1\widetilde{\theta}_{n}^{(3)}\!=\!-\alpha_{n}Z_{n+1} for n1n\geq 1. Note that θ~n𝒯=θ~n(2)+θ~n(3)\widetilde{\theta}_{n}^{\mathcal{T}}=\widetilde{\theta}_{n}^{(2)}+\widetilde{\theta}_{n}^{(3)}.

It is more convenient to work directly with the recursion for the scaled sequence:

Lemma 2.3.

For any ϱ(0,1/2]\varrho\in(0,1/2], the scaled sequence θ~nϱ=nϱθ~n\widetilde{\theta}^{\varrho}_{n}=n^{\varrho}\widetilde{\theta}_{n} admits the recursion,

θ~n+1ϱ=θ~nϱ+αn+1[ϱnθ~nϱ+A(n,ϱ)θ~nϱ+(n+1)ϱΔn+1]\widetilde{\theta}^{\varrho}_{n+1}\!=\!\widetilde{\theta}^{\varrho}_{n}+\alpha_{n+1}\bigl{[}\varrho_{n}\widetilde{\theta}^{\varrho}_{n}+A(n,\varrho)\widetilde{\theta}^{\varrho}_{n}+(n+1)^{\varrho}\Delta_{n+1}\bigr{]} (33)

where ϱn=ϱ+ε(n,ϱ)\varrho_{n}=\varrho+\varepsilon(n,\varrho) with ε(n,ϱ)=O(n1)\varepsilon(n,\varrho)=O(n^{-1}), and A(n,ϱ)=(1+n1)ϱAA(n,\varrho)=(1+n^{-1})^{\varrho}A. ∎

Denote θ~nϱ,(i)=nϱθ~n(i)\widetilde{\theta}_{n}^{\varrho,(i)}=n^{\varrho}\widetilde{\theta}_{n}^{(i)} for each ii. Lemma 2.3 combined with (32) gives

θ~nϱ=θ~nϱ,(1)+θ~nϱ,(2)+θ~nϱ,(3)\widetilde{\theta}^{\varrho}_{n}=\widetilde{\theta}_{n}^{\varrho,(1)}+\widetilde{\theta}_{n}^{\varrho,(2)}+\widetilde{\theta}_{n}^{\varrho,(3)} (34)

The first two sequences evolve as SA recursions:

θ~n+1ϱ,(1)=θ~nϱ,(1)+\displaystyle\widetilde{\theta}_{n+1}^{\varrho,(1)}=\widetilde{\theta}_{n}^{\varrho,(1)}+ αn+1[[ϱnI+A(n,ϱ)]θ~nϱ,(1)+(n+1)ϱΔn+2m],\displaystyle\alpha_{n+1}\bigl{[}[\varrho_{n}I+A(n,\varrho)]\widetilde{\theta}_{n}^{\varrho,(1)}+(n+1)^{\varrho}\Delta_{n+2}^{m}\bigr{]}\,, θ~0ϱ,(1)=θ~0ϱ\displaystyle\widetilde{\theta}_{0}^{\varrho,(1)}\!=\!\widetilde{\theta}^{\varrho}_{0} (35a)
θ~n+1ϱ,(2)=θ~nϱ,(2)+\displaystyle\widetilde{\theta}_{n+1}^{\varrho,(2)}=\widetilde{\theta}_{n}^{\varrho,(2)}+ αn+1[[ϱnI+A(n,ϱ)]θ~nϱ,(2)(n+1)ϱαn[I+A]Zn+1],\displaystyle\alpha_{n+1}\bigl{[}[\varrho_{n}I+A(n,\varrho)]\widetilde{\theta}_{n}^{\varrho,(2)}-(n+1)^{\varrho}\alpha_{n}[I+A]Z_{n+1}\bigr{]}\,, θ~1ϱ,(2)=Ξ1\displaystyle\widetilde{\theta}^{\varrho,(2)}_{1}\!=\!\Xi_{1} (35b)

2.3 Mean Square Error Bounds

Fix the initial condition (Φ0,θ~0)(\Phi_{0},\widetilde{\theta}_{0}), and denote Cov(θn)=𝖤[θ~nθ~n]\text{\rm Cov}\,(\theta_{n})={\sf E}[\widetilde{\theta}_{n}\widetilde{\theta}_{n}^{\intercal}] and ΣZ=𝖤π[ZnZn]\Sigma_{Z}={\sf E}_{\pi}[Z_{n}Z_{n}^{\intercal}]. The following summarizes bounds on the convergence rate of 𝖤[θ~n2]=trace (Cov(θn)){\sf E}[\|\widetilde{\theta}_{n}\|^{2}]=\hbox{\rm trace\,}(\text{\rm Cov}\,(\theta_{n})).

Theorem 2.4.

Suppose (A1)-(A3) hold. Then, for the linear recursion (3),

  • (i)

    If Real(λ)<12\text{Real}(\lambda)<-{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}} for every eigenvalue λ\lambda of AA, then for some δ=δ(A,ΣΔ)>0\delta=\delta(A,\Sigma_{\Delta})>0,

    Cov(θn)=n1Σθ+O(n1δ),n0,\text{\rm Cov}\,(\theta_{n})=n^{-1}\Sigma_{\theta}+O(n^{-1-\delta})\,,\qquad n\geq 0\,,

    where Σθ0\Sigma_{\theta}\geq 0 is the solution to the Lyapunov equation (4). Consequently, the rate of convergence of 𝖤[θ~n2]{\sf E}[\|\widetilde{\theta}_{n}\|^{2}] is 1/n1/n.

  • (ii)

    Suppose there is an eigenvalue λ\lambda of AA that satisfies ϱ0=Real(λ)>12-\varrho_{0}=\text{Real}(\lambda)>-{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}. Let v0v\neq 0 denote a corresponding left eigenvector, and suppose that ΣΔv0\Sigma_{\Delta}v\neq 0. Then, 𝖤[|vθ~n|2]{\sf E}[|v^{\intercal}\widetilde{\theta}_{n}|^{2}] converges to 0 at rate n2ϱ0n^{-2\varrho_{0}}. ∎

The proof of Thm. 2.4 is contained in Section 2.5. The following negative result is a direct corollary of Thm. 2.4 (ii):

Corollary 2.5.

Suppose (A1)-(A3) hold. Moreover, suppose there is an eigenvalue λ\lambda of AA that satisfies ϱ0=Real(λ)>12-\varrho_{0}=\text{Real}(\lambda)>-{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}, with corresponding left eigenvector vv satisfying ΣΔv0\Sigma_{\Delta}v\neq 0. Then, 𝖤[θ~n2]{\sf E}[\|\widetilde{\theta}_{n}\|^{2}] converges to zero at rate no faster than 1/n2ϱ01/n^{2\varrho_{0}}.

One challenge in extension to nonlinear recursions is that the noise sequence depends on the parameter estimates (recall (2)). This is true even for TD learning with linear function approximation (see (18) and surrounding discussion). Extension to these recursions is obtained through coupling.

Consider the error sequence for a random linear recursion

θ~n+1=θ~n+αn+1[An+1θ~n+An+1θbn+1]\widetilde{\theta}^{\circ}_{n+1}=\widetilde{\theta}^{\circ}_{n}+\alpha_{n+1}[A_{n+1}\widetilde{\theta}^{\circ}_{n}+A_{n+1}\theta^{*}-b_{n+1}] (36)

subject to the following assumptions:

  • (A4)

    The sequences {An,bn}\{A_{n},b_{n}\} are functions of the Markov chain:

    An=𝒜(Φn),bn=(Φn),A_{n}=\mathcal{A}(\Phi_{n})\,,\quad b_{n}=\mathcal{B}(\Phi_{n})\,,

    which satisfy 𝒜i,j2V<\|\mathcal{A}_{i,j}^{2}\|_{V}<\infty, i2V<\|\mathcal{B}_{i}^{2}\|_{V}<\infty for each 1i,jd1\leq i,j\leq d. The steady state means are denoted A=𝖤π[An]A={\sf E}_{\pi}[A_{n}], b=𝖤π[bn]b={\sf E}_{\pi}[b_{n}]. Moreover, the matrix AA is Hurwitz, and θ=A1b\theta^{*}=A^{-1}b.

Theorem 2.6.

Suppose Assumptions A1-A4 hold. If the matrix 12I+A{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}I+A is Hurwitz, the error bound (5) holds for {θ~n}\{\widetilde{\theta}^{\circ}_{n}\} obtained from (36), with Δn+1=An+1θbn+1\Delta_{n+1}=A_{n+1}\theta^{*}-b_{n+1}.

The proof of the theorem is via coupling with (3). For this we write (36) in the suggestive form

θ~n+1=θ~n+αn+1[Aθ~n+Δn+1+(An+1A)θ~n],Δn+1=An+1θbn+1\widetilde{\theta}^{\circ}_{n+1}=\widetilde{\theta}^{\circ}_{n}+\alpha_{n+1}[A\widetilde{\theta}^{\circ}_{n}+\Delta_{n+1}+(A_{n+1}-A)\widetilde{\theta}^{\circ}_{n}]\,,\qquad\Delta_{n+1}=A_{n+1}\theta^{*}-b_{n+1} (37)

With common initial condition Φ0\Phi_{0}, the sequence {θ~n}\{\widetilde{\theta}^{\circ}_{n}\} is compared with the error sequence {θ~n}\{\widetilde{\theta}^{\bullet}_{n}\} for the corresponding linear SA algorithm:

θ~n+1=θ~n+αn+1[Aθ~n+Δn+1]\widetilde{\theta}^{\bullet}_{n+1}=\widetilde{\theta}^{\bullet}_{n}+\alpha_{n+1}[A\widetilde{\theta}^{\bullet}_{n}+\Delta_{n+1}]

The difference sequence {n:=θ~nθ~n}\{\mathcal{E}_{n}\mathbin{:=}\widetilde{\theta}^{\circ}_{n}-\widetilde{\theta}^{\bullet}_{n}\} evolves according to (3), but with a vanishing noise sequence:

n+1=n+αn+1[An+(An+1A)θ~n]\mathcal{E}_{n+1}=\mathcal{E}_{n}+\alpha_{n+1}[A\mathcal{E}_{n}+(A_{n+1}-A)\widetilde{\theta}^{\circ}_{n}] (38)

By decomposing An+1AA_{n+1}-A into martingale difference and telescoping sequences based on Poisson’s equation, the technique used to prove Thm. 2.4 can be used to obtain the following bound on the mean-square coupling error.

Let λ=ϱ0+ui\lambda=-\varrho_{0}+ui denote an eigenvalue of the matrix AA with largest real part (i.e., ϱ0\varrho_{0} is minimal).

Theorem 2.7.

Under Assumptions (A1)-(A4),

  • (i)

    lim supnn2𝖤[nn]<\displaystyle\limsup_{n\rightarrow\infty}n^{2}{\sf E}[\mathcal{E}_{n}^{\intercal}\mathcal{E}_{n}]<\infty if ϱ0>1\varrho_{0}>1.

  • (ii)

    lim supnn2ϱ𝖤[nn]<\displaystyle\limsup_{n\rightarrow\infty}n^{2\varrho}{\sf E}[\mathcal{E}_{n}^{\intercal}\mathcal{E}_{n}]<\infty for all ϱ<ϱ0\varrho<\varrho_{0}, provided ϱ01\varrho_{0}\leq 1.

Thm. 2.7 provides a remarkable bound when ρ0>1\rho_{0}>1: it immediately implies Thm. 2.6 because the mean-square coupling error 𝖤[n]2{\sf E}[\|\mathcal{E}_{n}]\|^{2} tends to zero at rate no slower than n2n^{-2}, which is far faster than 𝖤[θ~n2]trace (Σθ)n1{\sf E}[\|\widetilde{\theta}^{\bullet}_{n}\|^{2}]\approx\hbox{\rm trace\,}(\Sigma_{\theta})n^{-1} (implied by Thm. 2.4).

An alert reader may observe that Theorems 2.6 and 2.7 leave out a special case: consider ρ0<12\rho_{0}<{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}, so that the rate of convergence of 𝖤[θ~n2]{\sf E}[\|\widetilde{\theta}^{\bullet}_{n}\|^{2}] is the sub-optimal value n2ρ0n^{-2\rho_{0}}. The bound obtained in Thm. 2.7 remains valuable, in the sense that it combined with Thm. 2.4 (ii) implies the rate of convergence of 𝖤[θ~n]2{\sf E}[\|\widetilde{\theta}^{\circ}_{n}]\|^{2} is no slower than n2ρ0n^{-2\rho_{0}}. However, because 𝖤[n]2{\sf E}[\|\mathcal{E}_{n}]\|^{2} and 𝖤[θ~n2]{\sf E}[\|\widetilde{\theta}^{\bullet}_{n}\|^{2}] tend to zero at the same rate, we cannot rule out the possibility that θ~n=n+θ~n\widetilde{\theta}^{\circ}_{n}=\mathcal{E}_{n}+\widetilde{\theta}^{\bullet}_{n} converges to zero much faster. In particular, it remains to prove that if there is an eigenvalue λ\lambda of AA that satisfies ϱ0=Real(λ)>12-\varrho_{0}=\text{Real}(\lambda)>-{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}, and an eigenvector v0v\neq 0 satisfying ΣΔv0\Sigma_{\Delta}v\neq 0, then, 𝖤[|vθ~n|2]{\sf E}[|v^{\intercal}\widetilde{\theta}_{n}^{\circ}|^{2}] converges to 0 at rate n2ϱ0n^{-2\varrho_{0}}.

2.4 Implications

Thm. 2.4 indicates that the convergence rate of Cov(θn)\text{\rm Cov}\,(\theta_{n}) is determined jointly by the matrix AA, and the martingale difference component of the noise sequence {Δn}\{\Delta_{n}\}. Convergence of {θ~n}\{\widetilde{\theta}_{n}\} can be slow if the matrix AA has eigenvalues close to zero.

The result also explains the slow convergence of some reinforcement learning algorithms. For instance, the matrix AA in Watkins’ Q-learning has at least one eigenvalue with real part greater than or equal to (1β)-(1-\beta), where β\beta is the discount factor appearing in the Markov decision process [40, 14, 11]. Since β\beta is usually close to one, Thm. 2.4 implies that the convergence rate of the algorithm is much slower than n1n^{-1}. Under the assumption that AA is Hurwitz, the 1/n1/n convergence rate is guaranteed by the use of a modified step-size sequence αn=g/n\alpha_{n}=g/n, with g>0g>0 chosen so that the matrix 12I+gA{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}I+gA is Hurwitz. Corollary 2.8 follows directly from Thm. 2.4 (i).

Corollary 2.8.

Let gg be a constant such that 12I+gA{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}I+gA is Hurwitz, and {θ~n}\{\widetilde{\theta}_{n}\} be recursively obtained as

θ~n+1=θ~n+gn+1[Aθ~n+Δn+1]\widetilde{\theta}_{n+1}=\widetilde{\theta}_{n}+\frac{g}{n+1}[A\widetilde{\theta}_{n}+\Delta_{n+1}]

Then, for some δ=δ(A,g,ΣΔ)>0\delta=\delta(A,g,\Sigma_{\Delta})>0,

Cov(θn)=𝖤[θ~nθ~n]=n1Σθg+O(n1δ)\text{\rm Cov}\,(\theta_{n})={\sf E}[\widetilde{\theta}_{n}\widetilde{\theta}_{n}^{\intercal}]=n^{-1}\Sigma_{\theta}^{g}+O(n^{-1-\delta})

where Σθg0\Sigma_{\theta}^{g}\geq 0 solves the Lyapunov equation

[12I+gA]Σ+Σ[12I+gA]+g2ΣΔ=0[{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}I+gA]\Sigma+\Sigma[{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}I+gA]^{\intercal}+g^{2}\Sigma_{\Delta}=0

We can also ensure the 1/n1/n convergence rate by using a matrix gain. Provided AA is invertible, and if it is known beforehand, αn=A1/n\alpha_{n}=-A^{-1}/n is the optimal step-size sequence (in terms of minimizing the asymptotic covariance) [1, 25, 13]. The SQNR algorithm of [33] and the Zap-SNR algorithm [14, 11] provide general approaches to recursively estimate the optimal matrix gain.

The next subsection is dedicated to the proof of Thm. 2.4. The proofs of the technical results are contained in the Appendix A.

2.5 Proof of Thm. 2.4

Denote Cov(θn(i))=𝖤[θ~n(i)(θ~n(i))]\text{\rm Cov}\,(\theta_{n}^{(i)})={\sf E}[\widetilde{\theta}_{n}^{(i)}(\widetilde{\theta}_{n}^{(i)})^{\intercal}] and Σnϱ,(i)=𝖤[θ~ϱ,(i)(θ~ϱ,(i))]=n2ϱCov(θn(i))\Sigma_{n}^{\varrho,(i)}={\sf E}[\widetilde{\theta}^{\varrho,(i)}(\widetilde{\theta}^{\varrho,(i)})^{\intercal}]=n^{2\varrho}\text{\rm Cov}\,(\theta_{n}^{(i)}) for each ii in (34). The proof proceeds by establishing the convergence rate for each Cov(θn(i))\text{\rm Cov}\,(\theta_{n}^{(i)}). The main challenges are the first two: Cov(θn(1))\text{\rm Cov}\,(\theta_{n}^{(1)}) and Cov(θn(2))\text{\rm Cov}\,(\theta_{n}^{(2)}), for which explicit bounds are obtained by studying recursions of the scaled sequences. Bounding θ~n(3)=αnZn+1\widetilde{\theta}_{n}^{{(3)}}=-\alpha_{n}Z_{n+1} is trivial.

2.5.1 The martingale difference term

Proposition 2.9.

Under (A1)-(A3),

  • (i)

    If Real(λ)<12\text{Real}(\lambda)<-{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}} for every eigenvalue λ\lambda of AA, then

    Cov(θn(1))=n1Σθ+O(n1δ)\text{\rm Cov}\,(\theta_{n}^{(1)})=n^{-1}\Sigma_{\theta}+O(n^{-1-\delta})

    where δ=δ(12I+A,ΣΔ)>0\delta=\delta({\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}I+A,\Sigma_{\Delta})>0, and Σθ\Sigma_{\theta} is the solution to the Lyapunov equation (4).

  • (ii)

    Suppose there is an eigenvalue λ\lambda of AA, that satisfies ϱ0=Real(λ)>12-\varrho_{0}=\text{Real}(\lambda)>-{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}. Let v0v\neq 0 denote the corresponding left eigenvector, and suppose moreover that ΣΔv0\Sigma_{\Delta}v\neq 0. Then, 𝖤[|vθ~n(1)|2]{\sf E}[|v^{\intercal}\widetilde{\theta}_{n}^{(1)}|^{2}] converges to 0 at rate n2ϱ0n^{-2\varrho_{0}}.

2.5.2 The telescoping sequence term

Proposition 2.10.

Under (A1)-(A3),

  • (i)

    If Real(λ)<12\text{Real}(\lambda)<-{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}} for every eigenvalue λ\lambda of AA, then, Cov(θn(2))=O(n1δ)\text{\rm Cov}\,(\theta_{n}^{(2)})=O(n^{-1-\delta}) for some δ=δ(12I+A,ΣΔ)>0\delta=\delta({\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}I+A,\Sigma_{\Delta})>0.

  • (ii)

    Suppose there is an eigenvalue λ\lambda of AA that satisfies ϱ0=Real(λ)>12-\varrho_{0}=\text{Real}(\lambda)>-{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}. Let v0v\neq 0 denote the corresponding left eigenvector, and suppose moreover that ΣΔv0\Sigma_{\Delta}v\neq 0. Then,

    lim supnn2ϱ0𝖤[|vθ~n(2)|2]<\limsup_{n\rightarrow\infty}n^{2\varrho_{0}}{\sf E}[|v^{\intercal}\widetilde{\theta}_{n}^{(2)}|^{2}]<\infty

2.5.3 Proof of Thm. 2.4

We obtain the convergence rate of Cov(θn)\text{\rm Cov}\,(\theta_{n}) based on

Cov(θn)=i=13Cov(θn(i))+i=13j=1,ji3𝖤[θ~n(i)(θ~n(j))]\text{\rm Cov}\,(\theta_{n})=\sum_{i=1}^{3}\text{\rm Cov}\,(\theta_{n}^{(i)})+\sum_{i=1}^{3}\sum_{j=1,j\neq i}^{3}{\sf E}[\widetilde{\theta}_{n}^{(i)}(\widetilde{\theta}_{n}^{(j)})^{\intercal}]

For case (i), by Prop. 2.9 (i) and Prop. 2.10 (i), there exists δ=δ(12I+A,ΣΔ)>0\delta=\delta({\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}I+A,\Sigma_{\Delta})>0 such that

Cov(θn(1))\displaystyle\text{\rm Cov}\,(\theta_{n}^{(1)}) =n1Σθ+O(n1δ)\displaystyle=n^{-1}\Sigma_{\theta}+O(n^{-1-\delta})
Cov(θn(2))\displaystyle\text{\rm Cov}\,(\theta_{n}^{(2)}) =O(n1δ)\displaystyle=O(n^{-1-\delta})
Cov(θn(3))\displaystyle\text{\rm Cov}\,(\theta_{n}^{(3)}) =n2ΣZn+1\displaystyle=n^{-2}\Sigma_{Z_{n+1}}

The cross terms between θ~n(i)\widetilde{\theta}_{n}^{(i)} and θ~n(j)\widetilde{\theta}_{n}^{(j)} for iji\neq j are of smaller orders than O(1/n)O(1/n) by the Cauchy-Schwarz inequality. Therefore, for a possibly smaller δ>0\delta>0,

Cov(θn)=n1Σθ+O(n1δ)\text{\rm Cov}\,(\theta_{n})=n^{-1}\Sigma_{\theta}+O(n^{-1-\delta})

For case (ii), limn0n2ϱ𝖤[|vθ~n|2]=0\lim_{n\rightarrow 0}n^{2\varrho}{\sf E}[|v^{\intercal}\widetilde{\theta}_{n}|^{2}]=0 for each ϱ<ϱ0\varrho<\varrho_{0} can be obtained from Prop. 2.9 (ii) and Prop. 2.10 (ii) directly by the triangle inequality. For ϱ>ϱ0\varrho>\varrho_{0}, the result limn0n2ϱ𝖤[|vθ~n|2]=\lim_{n\rightarrow 0}n^{2\varrho}{\sf E}[|v^{\intercal}\widetilde{\theta}_{n}|^{2}]=\infty is established independently in Lemma A.10. ∎

2.6 Finer Error Bound

2.6.1 Finer Decomposition with Second Poisson Equation

With f^{\hat{f}} in (28) and that f^i2LV{\hat{f}}_{i}^{2}\in L_{\infty}^{V} for each 1id1\leq i\leq d, denote f^^\hat{\vphantom{\rule{1.0pt}{6.14584pt}}\smash{\hat{f}}} by the zero-mean solution to the second Poisson equation

𝖤[f^^(Φk+1)Φk=z]=f^^(z)f^(z){\sf E}[\hat{\vphantom{\rule{1.0pt}{6.14584pt}}\smash{\hat{f}}}\,(\Phi_{k+1})\mid\Phi_{k}=z]=\hat{\vphantom{\rule{1.0pt}{6.14584pt}}\smash{\hat{f}}}(z)-{\hat{f}}(z) (39)

We then write, for n1n\geq 1,

Zn=Δ^n+1m+Z^nZ^n+1Z_{n}=\widehat{\Delta}_{n+1}^{m}+\widehat{Z}_{n}-\widehat{Z}_{n+1} (40)

where Z^n=f^^(Φn)\widehat{Z}_{n}=\hat{\vphantom{\rule{1.0pt}{6.14584pt}}\smash{\hat{f}}}(\Phi_{n}), and Δ^n+1m=Z^n+1𝖤[Z^n+1n]\widehat{\Delta}_{n+1}^{m}=\widehat{Z}_{n+1}-{\sf E}[\widehat{Z}_{n+1}\mid\mathcal{F}_{n}] is a martingale difference sequence.

The type of decomposition in Section 2.2 can be applied to θ~n(2)\widetilde{\theta}_{n}^{(2)} in (31) for n2n\geq 2:

θ~n(2)=θ~n(2,1)+θ~n(2,2)+θ~n(2,3)\widetilde{\theta}_{n}^{(2)}=\widetilde{\theta}_{n}^{(2,1)}+\widetilde{\theta}_{n}^{(2,2)}+\widetilde{\theta}_{n}^{(2,3)} (41)

The first two sequences evolve as SA recursions:

θ~n+1(2,1)=θ~n(2,1)+\displaystyle\widetilde{\theta}_{n+1}^{(2,1)}=\widetilde{\theta}_{n}^{(2,1)}+ αn+1[Aθ~n(2,1)αn[I+A]Δ^n+2m],\displaystyle\alpha_{n+1}\bigl{[}A\widetilde{\theta}_{n}^{(2,1)}-\alpha_{n}[I+A]\widehat{\Delta}_{n+2}^{m}\bigr{]}\,, θ~1(2,1)=Z1\displaystyle\widetilde{\theta}_{1}^{(2,1)}=Z_{1} (42a)
θ~n+1(2,2)=θ~n(2,2)+\displaystyle\widetilde{\theta}_{n+1}^{(2,2)}=\widetilde{\theta}_{n}^{(2,2)}+ αn+1[Aθ~n(2,2)+αn1αn[2I+A][I+A]Z^n+1],\displaystyle\alpha_{n+1}\bigl{[}A\widetilde{\theta}_{n}^{(2,2)}+\alpha_{n-1}\alpha_{n}[2I+A][I+A]\widehat{Z}_{n+1}\bigr{]}\,, θ~2(2,2)=12[I+A]Z^2\displaystyle\widetilde{\theta}_{2}^{(2,2)}=-{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}[I+A]\widehat{Z}_{2} (42b)

and θ~n(2,3)=αn1αn[I+A]Z^n+1\widetilde{\theta}_{n}^{(2,3)}=\alpha_{n-1}\alpha_{n}[I+A]\widehat{Z}_{n+1}. Therefore, θ~n\widetilde{\theta}_{n} for n2n\geq 2 can be decomposed as:

θ~n=θ~n(1)+θ~n(2,1)+θ~n(2,2)+θ~n(2,3)+θ~n(3)\widetilde{\theta}_{n}=\widetilde{\theta}_{n}^{(1)}+\widetilde{\theta}_{n}^{(2,1)}+\widetilde{\theta}_{n}^{(2,2)}+\widetilde{\theta}_{n}^{(2,3)}+\widetilde{\theta}_{n}^{(3)} (43)

2.6.2 Finer Mean Square Error Bound

The error bound (6) is obtained from (43):

Theorem 2.11.

Suppose Assumptions (A1)-(A3) hold, and moreover Real(λ)<1\text{Real}(\lambda)<-1 for every eigenvalue λ\lambda of AA. Then, for the linear recursion (3),

Cov(θn)=n1Σθ+n2Σθ,2+O(n2δ)\text{\rm Cov}\,(\theta_{n})=n^{-1}\Sigma_{\theta}+n^{-2}\Sigma_{\theta,2}+O(n^{-2-\delta})

where δ=δ(I+A,ΣΔ)>0\delta=\delta(I+A,\Sigma_{\Delta})>0, Σθ,2=Σ+ΣZ𝖤π[ΔnmZ^n]𝖤π[Z^n(Δnm)]\Sigma_{\theta,2}=\Sigma_{\sharp}+\Sigma_{Z}-{\sf E}_{\pi}[\Delta_{n}^{m}\widehat{Z}_{n}^{\intercal}]-{\sf E}_{\pi}[\widehat{Z}_{n}(\Delta_{n}^{m})^{\intercal}], and Σ\Sigma_{\sharp} is the unique solution to the Lyapunov equation:

[I+A][ΣCovπ(Δ^nm,Δnm)]+[ΣCovπ(Δnm,Δ^nm)][I+A]+AΣθAΣΔ=0[I+A][\Sigma-\text{\rm Cov}\,_{\pi}(\widehat{\Delta}_{n}^{m},\Delta_{n}^{m})]+[\Sigma-\text{\rm Cov}\,_{\pi}(\Delta_{n}^{m},\widehat{\Delta}_{n}^{m})][I+A]^{\intercal}+A\Sigma_{\theta}A^{\intercal}-\Sigma_{\Delta}=0 (44)

2.6.3 Proof of Thm. 2.11

Denote the correlation between θ~n(a)\widetilde{\theta}_{n}^{(a)} and θ~n(b)\widetilde{\theta}_{n}^{(b)} as Rn(a),(b)=𝖤[θ~n(a)(θ~n(b))]R_{n}^{(a),(b)}={\sf E}[\widetilde{\theta}_{n}^{(a)}(\widetilde{\theta}_{n}^{(b)})^{\intercal}], where θ~n(a),θ~n(b)\widetilde{\theta}_{n}^{(a)},\widetilde{\theta}_{n}^{(b)} are different terms in (43). The key results that help establish Thm. 2.11 are summarized in the following proposition. The proof is in Appendix A.4.

Proposition 2.12.

Under Assumptions (A1)-(A3), if Real(λ)<1\text{Real}(\lambda)<-1 for every eigenvalue of AA, then there is δ>0\delta>0 such that

  • (i)

    Cov(θn(1))=n1Σθ+n2Σ(1)+O(n2δ)\text{\rm Cov}\,(\theta_{n}^{(1)})=n^{-1}\Sigma_{\theta}+n^{-2}\Sigma_{\sharp}^{(1)}+O(n^{-2-\delta}), where δ=δ(I+A,ΣΔ)>0\delta=\delta(I+A,\Sigma_{\Delta})>0, Σθ0\Sigma_{\theta}\geq 0 is the unique solution to the Lyapunov equation (4), and Σ(1)0\Sigma_{\sharp}^{(1)}\geq 0 solves the Lyapunov equation,

    [I+A]Σ+Σ[I+A]+AΣθAΣΔ=0[I+A]\Sigma+\Sigma[I+A]^{\intercal}+A\Sigma_{\theta}A^{\intercal}-\Sigma_{\Delta}=0 (45)
  • (ii)

    Rn(2,1),(1)+Rn(1),(2,1)=n2Σ(2)+O(n2δ)R_{n}^{(2,1),(1)}+R_{n}^{(1),(2,1)}=n^{-2}\Sigma_{\sharp}^{(2)}+O(n^{-2-\delta}), where Σ(2)\Sigma_{\sharp}^{(2)} solves the Lyapunov equation:

    [I+A]Σ+Σ[I+A][I+A]Covπ(Δ^nm,Δnm)Covπ(Δnm,Δ^nm)[I+A]=0[I+A]\Sigma+\Sigma[I+A]^{\intercal}-[I+A]\text{\rm Cov}\,_{\pi}(\widehat{\Delta}_{n}^{m},\Delta_{n}^{m})-\text{\rm Cov}\,_{\pi}(\Delta_{n}^{m},\widehat{\Delta}_{n}^{m})[I+A]^{\intercal}=0 (46)
  • (iii)

    Rn(1),(3)=n2𝖤π[ΔnmZ^n]+O(n3)R_{n}^{(1),(3)}=-n^{-2}{\sf E}_{\pi}[\Delta_{n}^{m}\widehat{Z}_{n}^{\intercal}]+O(n^{-3}).

Proof of Thm. 2.11.

With the decomposition in (43), we have

Cov(θn)=Cov(θn(1))\displaystyle\text{\rm Cov}\,(\theta_{n})=\text{\rm Cov}\,(\theta_{n}^{(1)}) +j=13Cov(θn(2,j))+Cov(θn(3))+Rn(1),(3)+Rn(3),(1)\displaystyle+\sum_{j=1}^{3}\text{\rm Cov}\,(\theta_{n}^{(2,j)})+\text{\rm Cov}\,(\theta_{n}^{(3)})+R_{n}^{(1),(3)}+R_{n}^{(3),(1)}
+i{1,3}j=13[Rn(2,j),(i)+Rn(i),(2,j)]+j=13k=1,kj3[Rn(2,j),(2,k)+Rn(2,k),(2,j)]\displaystyle+\sum_{i\in\{1,3\}}\sum_{j=1}^{3}[R_{n}^{(2,j),(i)}+R_{n}^{(i),(2,j)}]+\sum_{j=1}^{3}\sum_{k=1,k\neq j}^{3}[R_{n}^{(2,j),(2,k)}+R_{n}^{(2,k),(2,j)}]

Cov(θn(2,1))=O(n3)\text{\rm Cov}\,(\theta_{n}^{(2,1)})=O(n^{-3}), Cov(θn(2,2))=O(n5)\text{\rm Cov}\,(\theta_{n}^{(2,2)})=O(n^{-5}) and Cov(θn(2,3))=O(n4)\text{\rm Cov}\,(\theta_{n}^{(2,3)})=O(n^{-4}) by Thm. 2.4 (i). By the Cauchy-Schwarz inequality, the correlation terms involving θ~n(2,2)\widetilde{\theta}_{n}^{(2,2)} and θ~n(2,3)\widetilde{\theta}_{n}^{(2,3)} are O(n2.5)O(n^{-2.5}), and Rn(2,1),(3)=O(n2.5)R_{n}^{(2,1),(3)}=O(n^{-2.5}) is also O(n2.5)O(n^{-2.5}). Prop. 2.12 (ii) shows that Rn(2,1),(3)=O(n3)R_{n}^{(2,1),(3)}=O(n^{-3}). Hence the covariance can be approximated as follows:

Cov(θn)=Cov(θn(1))+Cov(θn(3))+Rn(1),(3)+Rn(3),(1)+Rn(2,1),(1)+Rn(1),(2,1)+O(n2.5)\text{\rm Cov}\,(\theta_{n})=\text{\rm Cov}\,(\theta_{n}^{(1)})+\text{\rm Cov}\,(\theta_{n}^{(3)})+R_{n}^{(1),(3)}+R_{n}^{(3),(1)}+R_{n}^{(2,1),(1)}+R_{n}^{(1),(2,1)}+O(n^{-2.5})

By Prop. 2.12, there exist δ(I+A,ΣΔ)>0\delta(I+A,\Sigma_{\Delta})>0 and δ(I+A)>0\delta(I+A)>0 such that

Cov(θn(1))\displaystyle\text{\rm Cov}\,(\theta_{n}^{(1)}) =n1Σθ+n2Σ(1)+O(n2δ)\displaystyle=n^{-1}\Sigma_{\theta}+n^{-2}\Sigma_{\sharp}^{(1)}+O(n^{-2-\delta})
Cov(θn(3))\displaystyle\text{\rm Cov}\,(\theta_{n}^{(3)}) =n2ΣZ+O(ρn)\displaystyle=n^{-2}\Sigma_{Z}+O(\rho^{n})
Rn(1),(3)\displaystyle R_{n}^{(1),(3)} =n2𝖤π[ΔnmZ^n]+O(n3)\displaystyle=-n^{-2}{\sf E}_{\pi}[\Delta_{n}^{m}\widehat{Z}_{n}^{\intercal}]+O(n^{-3})
Rn(2,1),(1)+Rn(1),(2,1)\displaystyle R_{n}^{(2,1),(1)}+R_{n}^{(1),(2,1)} =n2Σ(2)+O(n2δ)\displaystyle=n^{-2}\Sigma_{\sharp}^{(2)}+O(n^{-2-\delta})

Putting those results together gives

Cov(θn)=n1Σθ+n2(Σ(1)+Σ(2)+ΣZ𝖤π[ΔnmZ^n]𝖤π[Z^n(Δnm)])+O(n2δ)\text{\rm Cov}\,(\theta_{n})=n^{-1}\Sigma_{\theta}+n^{-2}\bigl{(}\Sigma_{\sharp}^{(1)}+\Sigma_{\sharp}^{(2)}+\Sigma_{Z}-{\sf E}_{\pi}[\Delta_{n}^{m}\widehat{Z}_{n}^{\intercal}]-{\sf E}_{\pi}[\widehat{Z}_{n}(\Delta_{n}^{m})^{\intercal}]\bigr{)}+O(n^{-2-\delta})

for some δ>0\delta>0, where Σ:=Σ(1)+Σ(2)\Sigma_{\sharp}\mathbin{:=}\Sigma_{\sharp}^{(1)}+\Sigma_{\sharp}^{(2)} solves the Lyapunov equation (44). ∎

3 Conclusions

Performance bounds for recursive algorithms are challenging outside of the special cases surveyed in the introduction. The general framework developed in this paper provides tight finite time performance for linear stochastic recursions under mild conditions on the Markovian noise, and we are confident that the techniques will extend to obtain similar bounds for nonlinear stochastic approximation provided that the linearization (2) is meaningful.

The bound (5) implies that, for some constant bθb_{\theta} and all nn,

𝖤[θ~n2]n1trace (Σθ)+n1δbθ.{\sf E}[\|\widetilde{\theta}_{n}\|^{2}]\leq n^{-1}\hbox{\rm trace\,}(\Sigma_{\theta})+n^{-1-\delta}b_{\theta}\,.

It may be argued that we have not obtained a finite-nn bound, because a bound on the constant bθb_{\theta} is lacking. Our response is that the precision of the dominant term is most important. We have tested the bound in numerous experiments in which the empirical mean-square error is obtained from multiple independent trials, and the resulting histogram is compared to what is predicted by the Central Limit Theorem with covariance Σθ\Sigma_{\theta}. It is found that the Central Limit Theorem is highly predictive of finite-nn performance in most cases [14, 11, 13]. While it is hoped that further research will provide bounds on bθb_{\theta}, it seems likely that any bound will involve high-order statistics of the Markov chain; evidence of this is the complex coefficient of n2n^{-2} in (6) for the special case δ=1\delta=1.

Current research concerns these topics, as well as algorithm design for reinforcement learning in various settings.

References

  • [1] A. Benveniste, M. Métivier, and P. Priouret. Adaptive algorithms and stochastic approximations, volume 22 of Applications of Mathematics (New York). Springer-Verlag, Berlin, 1990. Translated from the French by Stephen S. Wilson.
  • [2] A. Benveniste, M. Métivier, and P. Priouret. Adaptive algorithms and stochastic approximations. Springer, 2012.
  • [3] J. Bhandari, D. Russo, and R. Singal. A finite time analysis of temporal difference learning with linear function approximation. arXiv preprint arXiv:1806.02450, 2018.
  • [4] J. R. Blum. Multidimensional stochastic approximation methods. The Annals of Mathematical Statistics, pages 737–744, 1954.
  • [5] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan Book Agency and Cambridge University Press (jointly), Delhi, India and Cambridge, UK, 2008.
  • [6] V. S. Borkar and S. P. Meyn. The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim., 38(2):447–469, 2000. (see also IEEE CDC, 1998).
  • [7] S. Chen, A. M. Devraj, A. Bušić, and S. Meyn. Zap Q Learning with nonlinear function approximation. Submitted for publication and arXiv e-prints, 2019.
  • [8] Z. Chen, S. Zhang, T. Doan, S. Maguluri, and J. Clarke. Performance of q-learning with linear function approximation: Stability and finite-time analysis. arXiv preprint arXiv:1905.11425, 2019.
  • [9] K. L. Chung et al. On a stochastic approximation method. The Annals of Mathematical Statistics, 25(3):463–483, 1954.
  • [10] G. Dalal, B. Szorenyi, G. Thoppe, and S. Mannor. Finite sample analysis of two-timescale stochastic approximation with applications to reinforcement learning. arXiv preprint arXiv:1703.05376, 2017.
  • [11] A. M. Devraj. Reinforcement Learning Design with Optimal Learning Rate. PhD thesis, University of Florida, 2019.
  • [12] A. M. Devraj, A. Bušić, and S. Meyn. Zap Q-Learning – a user’s guide. In Proc. of the Fifth Indian Control Conference, January 9-11 2019.
  • [13] A. M. Devraj, A. Bušić, and S. Meyn. Fundamental design principles for reinforcement learning algorithms. In Handbook on Reinforcement Learning and Control. Springer, 2020.
  • [14] A. M. Devraj and S. P. Meyn. Fastest convergence for Q-learning. ArXiv e-prints, July 2017.
  • [15] A. M. Devraj and S. P. Meyn. Zap Q-learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017.
  • [16] K. Duffy and S. Meyn. Large deviation asymptotics for busy periods. Stochastic Systems, 4(1):300–319, 2014.
  • [17] K. R. Duffy and S. P. Meyn. Most likely paths to error when estimating the mean of a reflected random walk. Performance Evaluation, 67(12):1290–1303, 2010.
  • [18] L. Gerencser. Convergence rate of moments in stochastic approximation with simultaneous perturbation gradient approximation and resetting. IEEE Transactions on Automatic Control, 44(5):894–905, May 1999.
  • [19] P. W. Glynn and D. Ormoneit. Hoeffding’s inequality for uniformly ergodic Markov chains. Statistics and Probability Letters, 56:143–146, 2002.
  • [20] G. Golub and C. Van Loan. Matrix computations. 3rd. edn. ed, 1996.
  • [21] B. Hu and U. A. Syed. Characterizing the exact behaviors of temporal difference learning algorithms using markov jump linear system theory. arXiv preprint arXiv:1906.06781, 2019.
  • [22] B. Karimi, B. Miasojedow, E. Moulines, and H.-T. Wai. Non-asymptotic analysis of biased stochastic approximation scheme. In Conference on Learning Theory, pages 1944–1974, 2019.
  • [23] J. Kiefer and J. Wolfowitz. Stochastic estimation of the maximum of a regression function. Ann. Math. Statist., 23(3):462–466, 09 1952.
  • [24] H. Kushner and G. G. Yin. Stochastic approximation and recursive algorithms and applications, volume 35. Springer Science & Business Media, 2003.
  • [25] H. J. Kushner and G. G. Yin. Stochastic approximation algorithms and applications, volume 35 of Applications of Mathematics (New York). Springer-Verlag, New York, 1997.
  • [26] C. Lakshminarayanan and C. Szepesvari. Linear stochastic approximation: How far does constant step-size and iterate averaging go? In International Conference on Artificial Intelligence and Statistics, pages 1347–1355, 2018.
  • [27] M. Metivier and P. Priouret. Applications of a Kushner and Clark lemma to general classes of stochastic algorithms. IEEE Transactions on Information Theory, 30(2):140–151, March 1984.
  • [28] S. P. Meyn. Control Techniques for Complex Networks. Cambridge University Press, 2007. Pre-publication edition available online.
  • [29] S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Cambridge University Press, Cambridge, second edition, 2009. Published in the Cambridge Mathematical Library. 1993 edition online.
  • [30] B. T. Polyak. A new method of stochastic approximation type. Avtomatika i telemekhanika (in Russian). translated in Automat. Remote Control, 51 (1991), pages 98–107, 1990.
  • [31] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim., 30(4):838–855, 1992.
  • [32] H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400–407, 1951.
  • [33] D. Ruppert. A Newton-Raphson version of the multivariate Robbins-Monro procedure. The Annals of Statistics, 13(1):236–245, 1985.
  • [34] D. Ruppert. Efficient estimators from a slowly convergent Robbins-Monro processes. Technical Report Tech. Rept. No. 781, Cornell University, School of Operations Research and Industrial Engineering, Ithaca, NY, 1988.
  • [35] R. Srikant and L. Ying. Finite-time error bounds for linear stochastic approximation and TD learning. CoRR, abs/1902.00923, 2019.
  • [36] R. S. Sutton. Learning to predict by the methods of temporal differences. Mach. Learn., 3(1):9–44, 1988.
  • [37] V. B. Tadić. Asymptotic analysis of temporal-difference learning algorithms with constant step-sizes. Machine learning, 63(2):107–133, 2006.
  • [38] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximation. IEEE Trans. Automat. Control, 42(5):674–690, 1997.
  • [39] J. Venter et al. An extension of the robbins-monro procedure. The Annals of Mathematical Statistics, 38(1):181–190, 1967.
  • [40] C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, Cambridge, UK, 1989.

Appendix A Appendices

A.1 Proofs for decomposition and scaling

Proof of Lemma 2.2.

Recall the summation by parts formula: for scalar sequences {ak,bk}\{a_{k},b_{k}\},

k=0Nak+1[bk+1bk]=ak+1bk+1a1b0k=1N[ak+1ak]bk\sum_{k=0}^{N}a_{k+1}[b_{k+1}-b_{k}]=a_{k+1}b_{k+1}-a_{1}b_{0}-\sum_{k=1}^{N}[a_{k+1}-a_{k}]b_{k} (47)

This is applied to (30b), beginning with

θ~N+1𝒯=n=0Nαn+1Aθ~n𝒯+n=0Nαn+1[Zn+1Zn+2]\widetilde{\theta}_{N+1}^{\mathcal{T}}=\sum_{n=0}^{N}\alpha_{n+1}A\widetilde{\theta}_{n}^{\mathcal{T}}+\sum_{n=0}^{N}\alpha_{n+1}[Z_{n+1}-Z_{n+2}]

Hence with ak=αka_{k}=\alpha_{k} and bk=Zk+1b_{k}=Z_{k+1}, the identity (47) implies

n=0Nαn+1[Zn+1Zn+2]\displaystyle\sum_{n=0}^{N}\alpha_{n+1}[Z_{n+1}-Z_{n+2}] =Z1αN+1ZN+2+n=1N[αn+1αn]Zn+1\displaystyle=Z_{1}-\alpha_{N+1}Z_{N+2}+\sum_{n=1}^{N}[\alpha_{n+1}-\alpha_{n}]Z_{n+1}
=Z1αN+1ZN+2n=1Nαn+1αnZn+1\displaystyle=Z_{1}-\alpha_{N+1}Z_{N+2}-\sum_{n=1}^{N}\alpha_{n+1}\alpha_{n}Z_{n+1}

By substitution, and using θ~0𝒯=0\widetilde{\theta}_{0}^{\mathcal{T}}=0,

θ~N+1𝒯=Z1αN+1ZN+2+n=1Nαn+1[Aθ~n𝒯αnZn+1]\widetilde{\theta}_{N+1}^{\mathcal{T}}=Z_{1}-\alpha_{N+1}Z_{N+2}+\sum_{n=1}^{N}\alpha_{n+1}\bigl{[}A\widetilde{\theta}_{n}^{\mathcal{T}}-\alpha_{n}Z_{n+1}\bigr{]}

With Ξn:=θ~n𝒯+αnZn+1\Xi_{n}\mathbin{:=}\widetilde{\theta}_{n}^{\mathcal{T}}+\alpha_{n}Z_{n+1} for n1n\geq 1 we finally obtain for N1N\geq 1,

ΞN+1=Z1+n=1Nαn+1[AΞnαn[I+A]Zn+1]\Xi_{N+1}=Z_{1}+\sum_{n=1}^{N}\alpha_{n+1}\bigl{[}A\Xi_{n}-\alpha_{n}[I+A]Z_{n+1}\bigr{]}

which is equivalent to (31). ∎

Proof of Lemma 2.3.

Consider the Taylor series expansion:

(n+1)ϱnϱ=(1+n1)ϱ\displaystyle\frac{(n+1)^{\varrho}}{n^{\varrho}}=(1+n^{-1})^{\varrho} =1+ϱn112ϱ(1ϱ)n2+O(n3)\displaystyle=1+\varrho n^{-1}-{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}\varrho(1-\varrho)n^{-2}+O(n^{-3})
=1+ϱ(n+1)1+ϱn1(n+1)112ϱ(1ϱ)n2+O(n3)\displaystyle=1+\varrho(n+1)^{-1}+\varrho n^{-1}(n+1)^{-1}-{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}\varrho(1-\varrho)n^{-2}+O(n^{-3})

where the second equation uses n1(n+1)1=n1(n+1)1n^{-1}-(n+1)^{-1}=n^{-1}(n+1)^{-1}. With αn=1/n\alpha_{n}=1/n, the following bound follows:

(n+1)ϱ=nϱ[1+αn+1(ϱ+ε(n,ϱ))](n+1)^{\varrho}=n^{\varrho}\bigl{[}1+\alpha_{n+1}(\varrho+\varepsilon(n,\varrho))\bigr{]}

where ε(n,ϱ)=O(n1)\varepsilon(n,\varrho)=O(n^{-1}), and ε(n,ϱ)>0\varepsilon(n,\varrho)>0 for all nn.

Multiplying both sides of (3) by (n+1)ϱ(n+1)^{\varrho}, we obtain

θ~n+1ϱ=θ~nϱ+αn+1[ϱnθ~nϱ+A(n,ϱ)θ~nϱ+(n+1)ϱΔn+1]\widetilde{\theta}^{\varrho}_{n+1}=\widetilde{\theta}^{\varrho}_{n}+\alpha_{n+1}\bigl{[}\varrho_{n}\widetilde{\theta}^{\varrho}_{n}+A(n,\varrho)\widetilde{\theta}^{\varrho}_{n}+(n+1)^{\varrho}\Delta_{n+1}\bigr{]}

where ϱn=ϱ+ε(n,ϱ)\varrho_{n}=\varrho+\varepsilon(n,\varrho) and A(n,ϱ)=(1+n1)ϱAA(n,\varrho)=(1+n^{-1})^{\varrho}A. ∎

Lemma A.1.

Let ϱ0>0,L0\varrho_{0}>0,L\geq 0 be fixed real numbers. Then the following holds for each n1n\geq 1 and 1n0<n1\leq n_{0}<n:

k=n0n[1ϱ0αk+L2αk2]KA.1n0ϱ0(n+1)ϱ0\prod_{k=n_{0}}^{n}[1-\varrho_{0}\alpha_{k}+L^{2}\alpha_{k}^{2}]\leq K_{\ref{t:prod-n}}\frac{n_{0}^{\varrho_{0}}}{(n+1)^{\varrho_{0}}}

where KA.1=exp(ϱ0+L2k=1αk2)K_{\ref{t:prod-n}}=\exp(\varrho_{0}+L^{2}\sum_{k=1}^{\infty}\alpha_{k}^{2}).

Proof.

By the inequality 1xexp(x)1-x\leq\exp(-x),

k=n0n[1ϱ0αk+L2αk2]exp(ϱ0k=n0nαk)exp(L2k=n0nαk2)exp(ϱ0)Kexp(ϱ0k=n0nαk)\prod_{k=n_{0}}^{n}[1-\varrho_{0}\alpha_{k}+L^{2}\alpha_{k}^{2}]\leq\exp(-\varrho_{0}\sum_{k=n_{0}}^{n}\alpha_{k})\exp(L^{2}\sum_{k=n_{0}}^{n}\alpha_{k}^{2})\leq\exp(-\varrho_{0})K\exp(-\varrho_{0}\sum_{k=n_{0}}^{n}\alpha_{k})

The remainder of the proof involves establishing the bound

exp(ϱ0k=n0nαk)exp(ϱ0)n0ϱ0(n+1)ϱ0\exp(-\varrho_{0}\sum_{k=n_{0}}^{n}\alpha_{k})\leq\exp(\varrho_{0})\frac{n_{0}^{\varrho_{0}}}{(n+1)^{\varrho_{0}}} (48)

For n0=1n_{0}=1 this follows from the bound k=1nαkln(n+1)\sum_{k=1}^{n}\alpha_{k}\geq\ln(n+1), and for n02n_{0}\geq 2 the bound (48) follows from k=n0nαk>ln(n+1)ln(n01)1\sum_{k=n_{0}}^{n}\alpha_{k}>\ln(n+1)-\ln(n_{0}-1)-1. ∎

Lemma A.2.

Under Assumptions A1-A3, let λ=ϱ0+ui\lambda=-\varrho_{0}+ui denote an eigenvalue of matrix AA with largest real part. Then

limnn2ϱ𝖤[θ~nθ~n]=0,ϱ<ϱ0 and ϱ12\lim_{n\rightarrow\infty}n^{2\varrho}{\sf E}[\widetilde{\theta}_{n}^{\intercal}\widetilde{\theta}_{n}]=0\,,\qquad\varrho<\varrho_{0}\text{ and }\varrho\leq{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}
Proof.

Recall the decomposition of θ~n\widetilde{\theta}_{n} in (32): θ~n=θ~n(1)+θ~n(2)+θ~n(3)\widetilde{\theta}_{n}=\widetilde{\theta}_{n}^{(1)}+\widetilde{\theta}_{n}^{(2)}+\widetilde{\theta}_{n}^{(3)}, with θ~n(1),θ~n(2)\widetilde{\theta}_{n}^{(1)},\widetilde{\theta}_{n}^{(2)} evolving as

θ~n+1(1)\displaystyle\widetilde{\theta}_{n+1}^{(1)} =θ~n(1)+αn+1[Aθ~n(1)+Δn+2m],\displaystyle=\widetilde{\theta}_{n}^{(1)}+\alpha_{n+1}\bigl{[}A\widetilde{\theta}_{n}^{(1)}+\Delta_{n+2}^{m}\bigr{]}\,, θ~0(1)=θ~0\displaystyle\widetilde{\theta}_{0}^{(1)}=\widetilde{\theta}_{0} (49a)
θ~n+1(2)\displaystyle\widetilde{\theta}_{n+1}^{(2)} =θ~n(2)+αn+1[Aθ~n(2)αn[I+A]Zn+1],\displaystyle=\widetilde{\theta}_{n}^{(2)}+\alpha_{n+1}\bigl{[}A\widetilde{\theta}_{n}^{(2)}-\alpha_{n}[I+A]Z_{n+1}\bigr{]}\,,\qquad θ~1(2)=Z1\displaystyle\widetilde{\theta}_{1}^{(2)}=Z_{1} (49b)

For fixed ϱ<ϱ0 and ϱ12\varrho<\varrho_{0}\text{ and }\varrho\leq{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}, Let T>0T>0 solve the Lyapunov equation [A+ϱI]T+T[A+ϱI]+I=0[A+\varrho I]T+T[A+\varrho I]^{\intercal}+I=0, which exists since A+ϱIA+\varrho I is Hurwitz. Define the norm of θ~n\widetilde{\theta}_{n} by θ~nT:=𝖤[θ~nTθ~n]\|\widetilde{\theta}_{n}\|_{T}\mathbin{:=}\sqrt{{\sf E}[\widetilde{\theta}_{n}^{\intercal}T\widetilde{\theta}_{n}]}.

First consider θ~n(1)\widetilde{\theta}_{n}^{(1)}. Since the martingale difference Δn+2m\Delta_{n+2}^{m} is uncorrelated with θ~n(1)\widetilde{\theta}_{n}^{(1)}, denoting en=θ~n(1)T2,bn+2=Δn+2mT2e_{n}=\|\widetilde{\theta}_{n}^{(1)}\|_{T}^{2},b_{n+2}=\|\Delta_{n+2}^{m}\|_{T}^{2}, we obtain the following from (49a):

en+1=[I+αn+1A]θ~n(1)T2+bn+2e_{n+1}=\|[I+\alpha_{n+1}A]\widetilde{\theta}_{n}^{(1)}\|_{T}^{2}+b_{n+2} (50)

Letting λ>0\lambda_{\circ}>0 denote the largest eigenvalue of TT, we arrive at the following simplification of the first term in (50)

[I+αn+1A]θ~n(1)T2\displaystyle\|[I+\alpha_{n+1}A]\widetilde{\theta}_{n}^{(1)}\|_{T}^{2} =𝖤[(θ~n(1))[T2αn+1ϱTαn+1I+αn+12ATA]θ~n(1)]\displaystyle={\sf E}\bigl{[}(\widetilde{\theta}_{n}^{(1)})^{\intercal}[T-2\alpha_{n+1}\varrho T-\alpha_{n+1}I+\alpha_{n+1}^{2}ATA^{\intercal}]\widetilde{\theta}_{n}^{(1)}\bigr{]} (51)
𝖤[(θ~n(1))[T2αn+1ϱT1λαn+1T+αn+12ATA]θ~n(1)]\displaystyle\leq{\sf E}\bigl{[}(\widetilde{\theta}_{n}^{(1)})^{\intercal}[T-2\alpha_{n+1}\varrho T-\frac{1}{\lambda_{\circ}}\alpha_{n+1}T+\alpha_{n+1}^{2}ATA^{\intercal}]\widetilde{\theta}_{n}^{(1)}\bigr{]}
[12αn+1ϱαn+1/λ+αn+12L2]θ~nT2\displaystyle\leq[1-2\alpha_{n+1}\varrho-\alpha_{n+1}/\lambda_{\circ}+\alpha_{n+1}^{2}L^{2}]\|\widetilde{\theta}_{n}\|_{T}^{2}

where LL denotes the induced operator norm of AA with respect to the norm T\|\,\cdot\,\|_{T}. We then obtain the following recursive bound from (50) and (51)

en+1[1(2ϱ+1/λ)αn+1+L2αn+12]en+αn+12Ke_{n+1}\leq[1-(2\varrho+1/\lambda_{\circ})\alpha_{n+1}+L^{2}\alpha_{n+1}^{2}]e_{n}+\alpha_{n+1}^{2}K

where K=supn1bnK=\sup_{n\geq 1}b_{n}. KK is finite since bnb_{n} converges to 𝖤π[(Δnm)TΔnm]{\sf E}_{\pi}[(\Delta_{n}^{m})^{\intercal}T\Delta_{n}^{m}] geometrically fast.

Consequently, for each n1n\geq 1,

en+1e0k=1n+1[1(2ϱ+1/λ)αk+L2αk2]+Kk=1n+1αk2l=k+1n+1[1(2ϱ+1/λ)αl+L2αl2]e_{n+1}\leq e_{0}\prod_{k=1}^{n+1}[1-(2\varrho+1/\lambda_{\circ})\alpha_{k}+L^{2}\alpha_{k}^{2}]+K\sum_{k=1}^{n+1}\alpha_{k}^{2}\prod_{l=k+1}^{n+1}[1-(2\varrho+1/\lambda_{\circ})\alpha_{l}+L^{2}\alpha_{l}^{2}]

By Lemma A.1,

en+1e1KA.11(n+2)2ϱ+1/λ+KKA.1(n+2)2ϱ+1/λk=1n+1αk22ϱ1/λ\displaystyle e_{n+1}\leq e_{1}K_{\ref{t:prod-n}}\frac{1}{(n+2)^{2\varrho+1/\lambda_{\circ}}}+\frac{KK_{\ref{t:prod-n}}}{(n+2)^{2\varrho+1/\lambda_{\circ}}}\sum_{k=1}^{n+1}\alpha_{k}^{2-2\varrho-1/\lambda_{\circ}}

Therefore, en+10e_{n+1}\rightarrow 0 at rate at least n2ϱn^{-2\varrho}.

For θ~n(2)\widetilde{\theta}_{n}^{(2)}, we use similar arguments. We obtain the following from (49b) by the triangle inequality.

θ~n+1(2)T[I+αn+1A]θ~n(2)T+αnαn+1[I+A]Zn+1T\|\widetilde{\theta}_{n+1}^{(2)}\|_{T}\leq\|[I+\alpha_{n+1}A]\widetilde{\theta}_{n}^{(2)}\|_{T}+\alpha_{n}\alpha_{n+1}\|[I+A]Z_{n+1}\|_{T}

Using the same argument as in (51), along with the inequality 1+x1+12x\sqrt{1+x}\leq 1+{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}x,

[I+αn+1A]θ~n(2)T\displaystyle\|[I+\alpha_{n+1}A]\widetilde{\theta}_{n}^{(2)}\|_{T} θ~n(2)T12αn+1ϱαn+1/λ+αn+12L2\displaystyle\leq\|\widetilde{\theta}_{n}^{(2)}\|_{T}\sqrt{1-2\alpha_{n+1}\varrho-\alpha_{n+1}/\lambda_{\circ}+\alpha_{n+1}^{2}L^{2}}
θ~n(2)T(1αn+1ϱαn+1/(2λ)+12αn+12L2)\displaystyle\leq\|\widetilde{\theta}_{n}^{(2)}\|_{T}(1-\alpha_{n+1}\varrho-\alpha_{n+1}/(2\lambda_{\circ})+{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}\alpha_{n+1}^{2}L^{2})

Denote K=supn1[I+A]Zn+1TK^{\prime}=\sup_{n\geq 1}\|[I+A]Z_{n+1}\|_{T}.

θ~n+1(2)T[1(ϱ+1/(2λ))αn+1+12αn+12L2]θ~n(2)T+αnαn+1K\|\widetilde{\theta}_{n+1}^{(2)}\|_{T}\leq[1-(\varrho+1/(2\lambda_{\circ}))\alpha_{n+1}+{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}\alpha_{n+1}^{2}L^{2}]\|\widetilde{\theta}_{n}^{(2)}\|_{T}+\alpha_{n}\alpha_{n+1}K^{\prime}

Then by the same argument for the martingale difference term, we can show that θ~n(2)T0\|\widetilde{\theta}_{n}^{(2)}\|_{T}\rightarrow 0 at rate at least nϱn^{-\varrho}.

Given θ~n(3)T=αnZn+1T\|\widetilde{\theta}_{n}^{(3)}\|_{T}=\alpha_{n}\|Z_{n+1}\|_{T} converges to zero at rate 1/n1/n, the proof is completed by the triangle inequality. ∎

A.2 Proof of Prop. 2.9

(i)

Recall that {Δnm}\{\Delta^{m}_{n}\} is a martingale difference sequence. It is thus an uncorrelated sequence for which θ~n(1)\widetilde{\theta}_{n}^{(1)} and Δn+km\Delta^{m}_{n+k} are uncorrelated for k2k\geq 2. The following recursion is obtained from these facts and (30a)

Cov(θn+1(1))=Cov(θn(1))+αn+1[Cov(θn(1))A+ACov(θn(1))+αn+1[ACov(θn(1))A+ΣΔn+2]]\text{\rm Cov}\,(\theta_{n+1}^{(1)})=\text{\rm Cov}\,(\theta_{n}^{(1)})+\alpha_{n+1}\Bigl{[}\text{\rm Cov}\,(\theta_{n}^{(1)})A^{\intercal}+A\text{\rm Cov}\,(\theta_{n}^{(1)})+\alpha_{n+1}[A\text{\rm Cov}\,(\theta_{n}^{(1)})A^{\intercal}+\Sigma_{\Delta_{n+2}}]\Bigr{]}

Multiplying each side by n+1n+1 gives

(n+1)Cov(θn+1(1))=\displaystyle(n+1)\text{\rm Cov}\,(\theta_{n+1}^{(1)})= nCov(θn(1))+Cov(θn(1))+Cov(θn(1))A+ACov(θn(1))\displaystyle n\text{\rm Cov}\,(\theta_{n}^{(1)})+\text{\rm Cov}\,(\theta_{n}^{(1)})+\text{\rm Cov}\,(\theta_{n}^{(1)})A^{\intercal}+A\text{\rm Cov}\,(\theta_{n}^{(1)})
+αn+1[ACov(θn(1))A+ΣΔn+2]\displaystyle+\alpha_{n+1}[A\text{\rm Cov}\,(\theta_{n}^{(1)})A^{\intercal}+\Sigma_{\Delta_{n+2}}]
=\displaystyle= nCov(θn(1))+αn+1[(1+1n)[nCov(θn(1))+nCov(θn(1))A+AnCov(θn(1))]\displaystyle n\text{\rm Cov}\,(\theta_{n}^{(1)})+\alpha_{n+1}\Bigl{[}(1+\frac{1}{n})[n\text{\rm Cov}\,(\theta_{n}^{(1)})+n\text{\rm Cov}\,(\theta_{n}^{(1)})A^{\intercal}+An\text{\rm Cov}\,(\theta_{n}^{(1)})]
+ACov(θn(1))A+ΣΔn+2]\displaystyle+A\text{\rm Cov}\,(\theta_{n}^{(1)})A^{\intercal}+\Sigma_{\Delta_{n+2}}\Bigr{]}

The following argument will be used repeatedly through this Appendix: the recursion for nCov(θn(1))n\text{\rm Cov}\,(\theta_{n}^{(1)}) is a deterministic SA recursion for nCov(θn(1))n\text{\rm Cov}\,(\theta_{n}^{(1)}), and is regarded as an Euler approximation to the stable linear system

ddt𝒳(t)=(1+et)[𝒳(t)+A𝒳(t)+𝒳(t)A]+ΣΔ+etA𝒳(t)A{\mathchoice{\genfrac{}{}{}{1}{d}{dt}}{\genfrac{}{}{}{1}{d}{dt}}{\genfrac{}{}{}{3}{d}{dt}}{\genfrac{}{}{}{3}{d}{dt}}}\mathcal{X}(t)=(1+e^{-t})[\mathcal{X}(t)+A\mathcal{X}(t)+\mathcal{X}(t)A^{\intercal}]+\Sigma_{\Delta}+e^{-t}A\mathcal{X}(t)A^{\intercal} (52)

Stability follows from the assumption that 12I+A{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}I+A is Hurwitz. The standard justification of the Euler approximation is through the choice of timescale: let tn=k=1nαkt_{n}=\sum_{k=1}^{n}\alpha_{k} and let 𝒳n(t)\mathcal{X}^{n}(t) denote the solution to this ODE on [tn,)[t_{n},\infty) with 𝒳n(tn)=nCov(θn(1))\mathcal{X}^{n}(t_{n})=n\text{\rm Cov}\,(\theta_{n}^{(1)}), ttnt\geq t_{n}, for any n1n\geq 1. Using standard ODE arguments [5],

supkn𝒳n(tk)kΣk(1)=O(1/n)\sup_{k\geq n}\|\mathcal{X}^{n}(t_{k})-k\Sigma_{k}^{(1)}\|=O(1/n)

Exponential convergence of 𝒳\mathcal{X} to Σθ\Sigma_{\theta} implies convergence of {nCov(θn(1))}\{n\text{\rm Cov}\,(\theta_{n}^{(1)})\} to zero at rate 1/nδ1/n^{\delta} for some δ=δ(12I+A,ΣΔ)>0\delta=\delta({\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}I+A,\Sigma_{\Delta})>0. ∎

(ii)

Denote enϱ0=𝖤[|vθ~nϱ0|2]e_{n}^{\varrho_{0}}={\sf E}[|v^{\intercal}\widetilde{\theta}_{n}^{\varrho_{0}}|^{2}] and λ=ϱ0+ui\lambda=-\varrho_{0}+ui. We begin with the proof that

lim infnenϱ0>0\liminf_{n\to\infty}e^{\varrho_{0}}_{n}>0 (53)

With v[IλA]=0v^{\intercal}[I\lambda-A]=0, we have v[Iϱn+A(n,ϱ)]=[εv(n,ϱ0)+ui]vv^{\intercal}[I\varrho_{n}+A(n,\varrho)]=[\varepsilon_{v}(n,\varrho_{0})+ui]v^{\intercal}, with εv(n,ϱ0)=O(n1)\varepsilon_{v}(n,\varrho_{0})=O(n^{-1}). Applying (35a) gives

vθ~n+1ϱ0,(1)=vθ~nϱ0,(1)+αn+1[[εv(n,ϱ0)+ui]vθ~nϱ0,(1)+(n+1)ϱ0vΔn+2m]v^{\intercal}\widetilde{\theta}_{n+1}^{{\varrho_{0}},(1)}=v^{\intercal}\widetilde{\theta}_{n}^{{\varrho_{0}},(1)}+\alpha_{n+1}\bigl{[}[\varepsilon_{v}(n,{\varrho_{0}})+ui]v^{\intercal}\widetilde{\theta}_{n}^{{\varrho_{0}},(1)}+(n+1)^{\varrho_{0}}v^{\intercal}\Delta_{n+2}^{m}\bigr{]}

Let v¯\overline{v} denote the conjugate of vv. Consequently, with σn2(v)=vΣΔnv¯\sigma^{2}_{n}(v)=v^{\intercal}\Sigma_{\Delta_{n}}\overline{v},

en+1ϱ0=[[1+εv(n,ϱ0)/(n+1)]2+u2/(n+1)2]enϱ0+(n+1)2ϱ02σn+22(v)e^{\varrho_{0}}_{n+1}=\bigl{[}[1+\varepsilon_{v}(n,{\varrho_{0}})/(n+1)]^{2}+u^{2}/(n+1)^{2}\bigr{]}e^{\varrho_{0}}_{n}+(n+1)^{2{\varrho_{0}}-2}\sigma^{2}_{n+2}(v)

VV-uniform ergodicity implies that σn2(v)vΣΔv¯>0\sigma^{2}_{n}(v)\to v^{\intercal}\Sigma_{\Delta}\overline{v}>0 as nn\to\infty at a geometric rate. Fix n0>0n_{0}>0 so that σn02(v)>0\sigma^{2}_{n_{0}}(v)>0, and hence also en0+1ϱ0>0e^{\varrho_{0}}_{n_{0}+1}>0. We also assume that 1+εv(n,ϱ0)/(n+1)>01+\varepsilon_{v}(n,{\varrho_{0}})/(n+1)>0 for nn0n\geq n_{0}, which is possible since εv(n,ϱ0)=O(n1)\varepsilon_{v}(n,{\varrho_{0}})=O(n^{-1}).

For N>n0N>n_{0} we obtain the uniform bound

log(eNϱ0)log(en0+1ϱ0)+2n=n0+2log[1|εv(n,ϱ0)|/(n+1)]>\log(e^{\varrho_{0}}_{N})\geq\log(e^{\varrho_{0}}_{n_{0}+1})+2\sum_{n=n_{0}+2}^{\infty}\log[1-|\varepsilon_{v}(n,{\varrho_{0}})|/(n+1)]>-\infty

which proves that lim infnenϱ0=lim infnvΣnϱ0,(1)v¯>0\liminf_{n\to\infty}e_{n}^{\varrho_{0}}=\liminf_{n\to\infty}v^{\intercal}\Sigma_{n}^{{\varrho_{0}},(1)}\overline{v}>0.

The proof of an upper bound for ϱ0<1/2{\varrho_{0}}<1/2: by concavity of the logarithm,

log(en+1ϱ0)log([[1+εv(n,ϱ0)/(n+1)]2+u2/(n+1)2]enϱ0)+K(n+1)2ϱ02\log(e^{\varrho_{0}}_{n+1})\leq\log\bigl{(}\bigl{[}[1+\varepsilon_{v}(n,{\varrho_{0}})/(n+1)]^{2}+u^{2}/(n+1)^{2}\bigr{]}e^{\varrho_{0}}_{n}\bigr{)}+K(n+1)^{2{\varrho_{0}}-2}

where K=supn>n0[[1+εv(n,ϱ0)/(n+1)]2+u2/(n+1)2]1[enϱ0]1σn+22(v)K=\sup_{n>n_{0}}\bigl{[}[1+\varepsilon_{v}(n,{\varrho_{0}})/(n+1)]^{2}+u^{2}/(n+1)^{2}\bigr{]}^{-1}[e^{\varrho_{0}}_{n}]^{-1}\sigma^{2}_{n+2}(v). Using concavity of the logarithm once more gives

log(en+1ϱ0)log(enϱ0)+2εv(n,ϱ0)/(n+1)+εv(n,ϱ0)2(n+1)2+u2(n+1)2+K(n+1)2ϱ02\log(e^{\varrho_{0}}_{n+1})\leq\log(e^{\varrho_{0}}_{n})+2\varepsilon_{v}(n,{\varrho_{0}})/(n+1)+\frac{\varepsilon_{v}(n,{\varrho_{0}})^{2}}{(n+1)^{2}}+\frac{u^{2}}{(n+1)^{2}}+K(n+1)^{2{\varrho_{0}}-2}

which gives the uniform upper bound

log(eNϱ0)log(en0+1ϱ0)+n=n0+2(2|εv(n,ϱ0)|n+1)+εv(n,ϱ0)2(n+1)2+u2(n+1)2+K(n+1)2ϱ02)<\log(e^{\varrho_{0}}_{N})\leq\log(e^{\varrho_{0}}_{n_{0}+1})+\sum_{n=n_{0}+2}^{\infty}\Bigl{(}2\frac{|\varepsilon_{v}(n,{\varrho_{0}})|}{n+1)}+\frac{\varepsilon_{v}(n,{\varrho_{0}})^{2}}{(n+1)^{2}}+\frac{u^{2}}{(n+1)^{2}}+K(n+1)^{2{\varrho_{0}}-2}\Bigr{)}<\infty

This proves that lim supnenϱ0=lim supnvΣnϱ0,(1)v¯<\limsup_{n\to\infty}e_{n}^{\varrho_{0}}=\limsup_{n\to\infty}v^{\intercal}\Sigma_{n}^{{\varrho_{0}},(1)}\overline{v}<\infty. ∎

A.3 Proof for Prop. 2.10

(i)

Denote 𝒟n=ε(n,ϱ)I+A(n,ϱ)A\mathcal{D}_{n}=\varepsilon(n,\varrho)I+A(n,\varrho)-A. We can rewrite (35b) as

θ~n+1ϱ,(2)\displaystyle\widetilde{\theta}_{n+1}^{\varrho,(2)} =θ~nϱ,(2)+αn+1[[12I+A]θ~nϱ,(2)+𝒟nθ~nϱ,(2)αn(n+1)ϱ[I+A]Zn+1]\displaystyle=\widetilde{\theta}_{n}^{\varrho,(2)}+\alpha_{n+1}\bigl{[}[{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}I+A]\widetilde{\theta}_{n}^{\varrho,(2)}+\mathcal{D}_{n}\widetilde{\theta}_{n}^{\varrho,(2)}-\alpha_{n}(n+1)^{\varrho}[I+A]Z_{n+1}\bigr{]} (54)
=[I+αn+1[12I+A]]θ~nϱ,(2)+αn+1𝒟nθ~nϱ,(2)αn+1αn(n+1)ϱ[I+A]Zn+1\displaystyle=\bigl{[}I+\alpha_{n+1}[{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}I+A]\bigr{]}\widetilde{\theta}_{n}^{\varrho,(2)}+\alpha_{n+1}\mathcal{D}_{n}\widetilde{\theta}_{n}^{\varrho,(2)}-\alpha_{n+1}\alpha_{n}(n+1)^{\varrho}[I+A]Z_{n+1}

Let T>0T>0 solve the Lyapunov equation

[12I+A]T+T[12I+A]+I=0[{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}I+A]^{\intercal}T+T[{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}I+A]+I=0

As in the proof of Lemma A.2, a solution exists because 12I+A{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}I+A is Hurwitz. Adopting the familiar notation θ~nϱ,(2)T:=𝖤[(θ~nϱ,(2))Tθ~nϱ,(2)]\|\widetilde{\theta}_{n}^{\varrho,(2)}\|_{T}\mathbin{:=}\sqrt{{\sf E}[(\widetilde{\theta}_{n}^{\varrho,(2)})^{\intercal}T\widetilde{\theta}_{n}^{\varrho,(2)}]}, the triangle inequality applied to (54) gives

θ~n+1ϱ,(2)T[I+αn+1[12I+A]]θ~nϱ,(2)T+αn+1𝒟nTθ~nϱ,(2)T+αn+1αn(n+1)ϱ[I+A]Zn+1T\|\widetilde{\theta}_{n+1}^{\varrho,(2)}\|_{T}\leq\|\bigl{[}I+\alpha_{n+1}[{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}I+A]\bigr{]}\widetilde{\theta}_{n}^{\varrho,(2)}\|_{T}+\alpha_{n+1}\|\mathcal{D}_{n}\|_{T}\|\widetilde{\theta}_{n}^{\varrho,(2)}\|_{T}+\alpha_{n+1}\alpha_{n}(n+1)^{\varrho}\|[I+A]Z_{n+1}\|_{T} (55)

The first term can be simplified by the Lyapunov equation.

[I+αn+1[12I+A]]θ~nϱ,(2)T2=\displaystyle\|\bigl{[}I+\alpha_{n+1}[{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}I+A]\bigr{]}\widetilde{\theta}_{n}^{\varrho,(2)}\|_{T}^{2}= 𝖤[(θ~nϱ,(2))[Tαn+1I+αn+12[12I+A]T[12I+A]]θ~nϱ,(2)]\displaystyle{\sf E}\bigl{[}(\widetilde{\theta}_{n}^{\varrho,(2)})^{\intercal}\bigl{[}T-\alpha_{n+1}I+\alpha_{n+1}^{2}[{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}I+A]^{\intercal}T[{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}I+A]\bigr{]}\widetilde{\theta}_{n}^{\varrho,(2)}\bigr{]}
\displaystyle\leq 𝖤[(θ~nϱ,(2))[Tαn+1λT+αn+12[12I+A]T[12I+A]]θ~nϱ,(2)]\displaystyle{\sf E}\bigl{[}(\widetilde{\theta}_{n}^{\varrho,(2)})^{\intercal}\bigl{[}T-\frac{\alpha_{n+1}}{\lambda_{\circ}}T+\alpha_{n+1}^{2}[{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}I+A]^{\intercal}T[{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}I+A]\bigr{]}\widetilde{\theta}_{n}^{\varrho,(2)}\bigr{]}
\displaystyle\leq θ~nϱ,(2)T2αn+1λθ~nϱ,(2)T2+αn+12L2θ~nϱ,(2)T2\displaystyle\|\widetilde{\theta}_{n}^{\varrho,(2)}\|_{T}^{2}-\frac{\alpha_{n+1}}{\lambda_{\circ}}\|\widetilde{\theta}_{n}^{\varrho,(2)}\|_{T}^{2}+\alpha_{n+1}^{2}L^{2}\|\widetilde{\theta}_{n}^{\varrho,(2)}\|_{T}^{2}

where LL is the induced operator norm of 12I+A{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}I+A, and λ>0\lambda_{\circ}>0 denotes its largest eigenvalue.

Consequently, by the inequality 1+x1+12x\sqrt{1+x}\leq 1+{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}x,

[I+αn+1[12I+A]]θ~nϱ,(2)Tθ~nϱ,(2)T1αn+1λ+αn+12L2θ~nϱ,(2)T(1αn+12λ+12αn+12L2)\displaystyle\|\bigl{[}I+\alpha_{n+1}[{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}I+A]\bigr{]}\widetilde{\theta}_{n}^{\varrho,(2)}\|_{T}\leq\|\widetilde{\theta}_{n}^{\varrho,(2)}\|_{T}\sqrt{1-\frac{\alpha_{n+1}}{\lambda_{\circ}}+\alpha_{n+1}^{2}L^{2}}\leq\|\widetilde{\theta}_{n}^{\varrho,(2)}\|_{T}(1-\frac{\alpha_{n+1}}{2\lambda_{\circ}}+{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}\alpha_{n+1}^{2}L^{2})

Fix n0>0n_{0}>0 such that for nn0n\geq n_{0},

1αn+12λ+12αn+12L2+αn+1𝒟nT1αn+14λ\displaystyle 1-\frac{\alpha_{n+1}}{2\lambda_{\circ}}+{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}\alpha_{n+1}^{2}L^{2}+\alpha_{n+1}\|\mathcal{D}_{n}\|_{T}\leq 1-\frac{\alpha_{n+1}}{4\lambda_{\circ}}

This is possible since 𝒟nT=O(n1)\|\mathcal{D}_{n}\|_{T}=O(n^{-1}).

Denote δ=min(14λ,14)\delta=\min(\frac{1}{4\lambda_{\circ}},\frac{1}{4}) and K=supnn0[I+A]Zn+1TK=\sup_{n\geq n_{0}}\|[I+A]Z_{n+1}\|_{T}, which is finite because Zn+1T\|Z_{n+1}\|_{T} converges. We obtain the following from (55)

θ~n+1ϱ,(2)T\displaystyle\|\widetilde{\theta}_{n+1}^{\varrho,(2)}\|_{T}\leq θ~nϱ,(2)T(1δαn+1)+αn+11/2αnK\displaystyle\|\widetilde{\theta}_{n}^{\varrho,(2)}\|_{T}(1-\delta\alpha_{n+1})+\alpha_{n+1}^{1/2}\alpha_{n}K (56)
\displaystyle\leq θ~nϱ,(2)T(1δαn+1)+αn3/2K\displaystyle\|\widetilde{\theta}_{n}^{\varrho,(2)}\|_{T}(1-\delta\alpha_{n+1})+\alpha_{n}^{3/2}K

Apply (56) repeatedly for nn0n\geq n_{0}

θ~n+1ϱ,(2)T\displaystyle\|\widetilde{\theta}_{n+1}^{\varrho,(2)}\|_{T}\leq θ~n0ϱ,(2)Tk=n0+1n+1(1δαk)+Kk=n0nαk3/2l=k+1n(1δαl)\displaystyle\|\widetilde{\theta}_{n_{0}}^{\varrho,(2)}\|_{T}\prod_{k=n_{0}+1}^{n+1}(1-\delta\alpha_{k})+K\sum_{k=n_{0}}^{n}\alpha_{k}^{3/2}\prod_{l=k+1}^{n}(1-\delta\alpha_{l})
\displaystyle\leq θ~n0ϱ,(2)Texp(δ)n0δ(n+2)δ+Kexp(δ)(n+1)δk=n0nk32+δ\displaystyle\|\widetilde{\theta}_{n_{0}}^{\varrho,(2)}\|_{T}\exp(\delta)\frac{n_{0}^{\delta}}{(n+2)^{\delta}}+\frac{K\exp(\delta)}{(n+1)^{\delta}}\sum_{k=n_{0}}^{n}k^{-\frac{3}{2}+\delta}

where k=1k32+δ<\sum_{k=1}^{\infty}k^{-\frac{3}{2}+\delta}<\infty for δ1/4\delta\leq 1/4. Therefore, θ~nϱ,(2)T0\|\widetilde{\theta}_{n}^{\varrho,(2)}\|_{T}\rightarrow 0 at rate at least nδn^{-\delta}.

The desired conclusion follows: letting λ>0\lambda_{\bullet}>0 denote the smallest eigenvalue of TT,

Σnϱ,(2)𝖤[(θ~nϱ,(2))θ~nϱ,(2)]I1λθ~nϱ,(2)T2I\Sigma_{n}^{\varrho,(2)}\leq{\sf E}[(\widetilde{\theta}_{n}^{\varrho,(2)})^{\intercal}\widetilde{\theta}_{n}^{\varrho,(2)}]I\leq\frac{1}{\lambda_{\bullet}}\|\widetilde{\theta}_{n}^{\varrho,(2)}\|_{T}^{2}I

(ii)

Multiplying both sides of (35b) by vv^{\intercal} gives

vθ~n+1ϱ0,(2)\displaystyle v^{\intercal}\widetilde{\theta}_{n+1}^{\varrho_{0},(2)} =vθ~nϱ0,(2)+αn+1[[εv(n,ϱ0)+ui]vθ~nϱ0,(2)(1ϱ0+ui)αn(n+1)ϱ0vZn+1]\displaystyle=v^{\intercal}\widetilde{\theta}_{n}^{\varrho_{0},(2)}+\alpha_{n+1}\bigl{[}[\varepsilon_{v}(n,\varrho_{0})+ui]v^{\intercal}\widetilde{\theta}_{n}^{\varrho_{0},(2)}-(1-\varrho_{0}+ui)\alpha_{n}(n+1)^{\varrho_{0}}v^{\intercal}Z_{n+1}\bigr{]} (57)
=[1+αn+1[εv(n,ϱ0)+ui]]vθ~nϱ0,(2)(1ϱ0+ui)αnαn+1(n+1)ϱ0vZn+1\displaystyle=\bigl{[}1+\alpha_{n+1}[\varepsilon_{v}(n,\varrho_{0})+ui]\bigr{]}v^{\intercal}\widetilde{\theta}_{n}^{\varrho_{0},(2)}-(1-\varrho_{0}+ui)\alpha_{n}\alpha_{n+1}(n+1)^{\varrho_{0}}v^{\intercal}Z_{n+1}

With vθ~nϱ0,(2)2:=𝖤[|vθ~nϱ0,(2)|2]\|v^{\intercal}\widetilde{\theta}_{n}^{\varrho_{0},(2)}\|_{2}\mathbin{:=}\sqrt{{\sf E}[|v^{\intercal}\widetilde{\theta}_{n}^{\varrho_{0},(2)}|^{2}]}, we obtain the following from (57) by the triangle inequality

vθ~n+1ϱ0,(2)2|1+αn+1[εv(n,ϱ0)+ui]|vθ~nϱ0,(2)2+|1ϱ0+ui|αnαn+1(n+1)ϱ0vZn+12\displaystyle\|v^{\intercal}\widetilde{\theta}_{n+1}^{\varrho_{0},(2)}\|_{2}\leq\bigl{|}1+\alpha_{n+1}[\varepsilon_{v}(n,\varrho_{0})+ui]\bigr{|}\|v^{\intercal}\widetilde{\theta}_{n}^{\varrho_{0},(2)}\|_{2}+\bigl{|}1-\varrho_{0}+ui\bigr{|}\alpha_{n}\alpha_{n+1}(n+1)^{\varrho_{0}}\|v^{\intercal}Z_{n+1}\|_{2} (58)

By the inequality 1+x1+12x\sqrt{1+x}\leq 1+{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}x, we have

|1+αn+1εv(n,ϱ0)+αn+1ui|1+αn+1εv(n,ϱ0)+12αn+12εv(n,ϱ0)2+12αn+12u2\bigl{|}1+\alpha_{n+1}\varepsilon_{v}(n,\varrho_{0})+\alpha_{n+1}ui\bigr{|}\leq 1+\alpha_{n+1}\varepsilon_{v}(n,\varrho_{0})+{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}\alpha_{n+1}^{2}\varepsilon_{v}(n,\varrho_{0})^{2}+{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}\alpha_{n+1}^{2}u^{2}

Fix n0>0n_{0}>0 such that for nn0n\geq n_{0},

1+αn+1εv(n,ϱ0)+12αn+12εv(n,ϱ0)2+12αn+12u21+αn+13/21+\alpha_{n+1}\varepsilon_{v}(n,\varrho_{0})+{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}\alpha_{n+1}^{2}\varepsilon_{v}(n,\varrho_{0})^{2}+{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}\alpha_{n+1}^{2}u^{2}\leq 1+\alpha_{n+1}^{3/2}

which is possible since εv(n,ϱ0)=O(n1)\varepsilon_{v}(n,\varrho_{0})=O(n^{-1}). With K=supnn0|1ϱ0+ui|vZn+12K=\sup_{n\geq n_{0}}|1-\varrho_{0}+ui|\|v^{\intercal}Z_{n+1}\|_{2}, we obtain the following bound from (58):

vθ~n+1ϱ0,(2)2(1+αn+13/2)vθ~nϱ0,(2)2+αn2ϱ0K\|v^{\intercal}\widetilde{\theta}_{n+1}^{\varrho_{0},(2)}\|_{2}\leq(1+\alpha_{n+1}^{3/2})\|v^{\intercal}\widetilde{\theta}_{n}^{\varrho_{0},(2)}\|_{2}+\alpha_{n}^{2-\varrho_{0}}K (59)

Iterating (59) gives,

vθ~n+1ϱ0,(2)2\displaystyle\|v^{\intercal}\widetilde{\theta}_{n+1}^{\varrho_{0},(2)}\|_{2} vθ~n0ϱ0,(2)2k=n0+1n+1(1+αk3/2)+Kk=n0nαk2ϱ0l=k+1n(1+αl3/2)\displaystyle\leq\|v^{\intercal}\widetilde{\theta}_{n_{0}}^{\varrho_{0},(2)}\|_{2}\prod_{k=n_{0}+1}^{n+1}(1+\alpha_{k}^{3/2})+K\sum_{k=n_{0}}^{n}\alpha_{k}^{2-\varrho_{0}}\prod_{l=k+1}^{n}(1+\alpha_{l}^{3/2})
vθ~n0ϱ0,(2)2exp(k=n0+1n+1k2/3)+Kk=n0nk2+ϱ0exp(l=k+1nl3/2)\displaystyle\leq\|v^{\intercal}\widetilde{\theta}_{n_{0}}^{\varrho_{0},(2)}\|_{2}\exp(\sum_{k=n_{0}+1}^{n+1}k^{-2/3})+K\sum_{k=n_{0}}^{n}k^{-2+\varrho_{0}}\exp(\sum_{l=k+1}^{n}l^{-3/2})

lim supnvθ~nϱ0,(2))2<\limsup_{n\rightarrow\infty}\|v^{\intercal}\widetilde{\theta}_{n}^{\varrho_{0},(2)})\|_{2}<\infty, since it is assumed that ϱ0<12\varrho_{0}<{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}. ∎

A.4 Proof of Prop. 2.12

(i)

Since Δn+2m\Delta_{n+2}^{m} is uncorrelated with θ~n(1)\widetilde{\theta}_{n}^{(1)}, the following recursion follows from (30a):

Cov(θn+1(1))=Cov(θn(1))+αn+1[Cov(θn(1))A+ACov(θn(1))+αn+1[ACov(θn(1))A+ΣΔn+2]]\text{\rm Cov}\,(\theta_{n+1}^{(1)})=\text{\rm Cov}\,(\theta_{n}^{(1)})+\alpha_{n+1}\Bigl{[}\text{\rm Cov}\,(\theta_{n}^{(1)})A^{\intercal}+A\text{\rm Cov}\,(\theta_{n}^{(1)})+\alpha_{n+1}[A\text{\rm Cov}\,(\theta_{n}^{(1)})A^{\intercal}+\Sigma_{\Delta_{n+2}}]\Bigr{]}

Take ϱ=1/2\varrho=1/2 in the definition of θ~ϱ,(1)\widetilde{\theta}^{\varrho,(1)} and Σnϱ,(1)=𝖤[θ~ϱ,(1)(θ~ϱ,(1))]=nCov(θn(1))\Sigma_{n}^{\varrho,(1)}={\sf E}[\widetilde{\theta}^{\varrho,(1)}(\widetilde{\theta}^{\varrho,(1)})^{\intercal}]=n\text{\rm Cov}\,(\theta_{n}^{(1)}). Multiplying each side of the equation by n+1n+1 gives

Σn+1ϱ,(1)=Σnϱ,(1)+αn+1[(1+1n)[Σnϱ,(1)+Σnϱ,(1)A+AΣnϱ,(1)]+1nAΣnϱ,(1)A+ΣΔn+2]\Sigma_{n+1}^{\varrho,(1)}=\Sigma_{n}^{\varrho,(1)}+\alpha_{n+1}\Bigl{[}(1+\frac{1}{n})\bigl{[}\Sigma_{n}^{\varrho,(1)}+\Sigma_{n}^{\varrho,(1)}A^{\intercal}+A\Sigma_{n}^{\varrho,(1)}\bigr{]}+\frac{1}{n}A\Sigma_{n}^{\varrho,(1)}A^{\intercal}+\Sigma_{\Delta_{n+2}}\Bigr{]} (60)

Recall that Σθ\Sigma_{\theta} solves the Laypunov equation Σ+ΣA+AΣ+ΣΔ=0\Sigma+\Sigma A^{\intercal}+A\Sigma+\Sigma_{\Delta}=0. Denoting En=Σnϱ,(1)ΣθE_{n}=\Sigma_{n}^{\varrho,(1)}-\Sigma_{\theta}, the following identity holds

Σnϱ,(1)+Σnϱ,(1)A+AΣnϱ,(1)=En+EnA+AEnΣΔ\Sigma_{n}^{\varrho,(1)}+\Sigma_{n}^{\varrho,(1)}A^{\intercal}+A\Sigma_{n}^{\varrho,(1)}=E_{n}+E_{n}A^{\intercal}+AE_{n}-\Sigma_{\Delta}

Subtracting Σθ\Sigma_{\theta} from both sides of (60) gives the recursion

En+1=En+αn+1[\displaystyle E_{n+1}=E_{n}+\alpha_{n+1}\Bigl{[} (1+1n)[En+EnA+AEn]+1nAEnA\displaystyle(1+\frac{1}{n})\bigl{[}E_{n}+E_{n}A^{\intercal}+AE_{n}\bigr{]}+\frac{1}{n}AE_{n}A^{\intercal} (61)
+1nAΣθA1nΣΔΣΔ+ΣΔn+2]\displaystyle+\frac{1}{n}A\Sigma_{\theta}A^{\intercal}-\frac{1}{n}\Sigma_{\Delta}-\Sigma_{\Delta}+\Sigma_{\Delta_{n+2}}\Bigr{]}

Similar to the decomposition in (30), we have En=En(1)+En(2)E_{n}=E_{n}^{(1)}+E_{n}^{(2)}, each evolving as

En+1(1)\displaystyle E_{n+1}^{(1)} =En(1)+αn+1[(1+1n)[En(1)+En(1)A+AEn(1)]+1nAEn(1)A+1n[AΣθAΣΔ]]\displaystyle=E_{n}^{(1)}+\alpha_{n+1}\Bigl{[}(1+\frac{1}{n})\bigl{[}E_{n}^{(1)}+E_{n}^{(1)}A^{\intercal}+AE_{n}^{(1)}\bigr{]}+\frac{1}{n}AE_{n}^{(1)}A^{\intercal}+\frac{1}{n}\bigl{[}A\Sigma_{\theta}A^{\intercal}-\Sigma_{\Delta}\bigr{]}\Bigr{]} (62a)
En+1(2)\displaystyle E_{n+1}^{(2)} =En(2)+αn+1[(1+1n)[En(2)+En(2)A+AEn(2)]+1nAEn(2)A+ΣΔn+2ΣΔ]\displaystyle=E_{n}^{(2)}+\alpha_{n+1}\Bigl{[}(1+\frac{1}{n})\bigl{[}E_{n}^{(2)}+E_{n}^{(2)}A^{\intercal}+AE_{n}^{(2)}\bigr{]}+\frac{1}{n}AE_{n}^{(2)}A^{\intercal}+\Sigma_{\Delta_{n+2}}-\Sigma_{\Delta}\Bigr{]} (62b)

Since ΣΔn+2ΣΔ\Sigma_{\Delta_{n+2}}-\Sigma_{\Delta} converges to zero geometrically fast, {En(1)}\{E_{n}^{(1)}\} converges to zero faster than {En(2)}\{E_{n}^{(2)}\}.

Multiplying each side of (62a) by n+1n+1 gives

(n+1)En+1(1)\displaystyle(n+1)E_{n+1}^{(1)} =(n+1)En(1)+(1+1n)[En(1)+En(1)A+AEn(1)]+1n[AEn(1)A+AΣθAΣΔ]\displaystyle=(n+1)E_{n}^{(1)}+(1+\frac{1}{n})\bigl{[}E_{n}^{(1)}+E_{n}^{(1)}A^{\intercal}+AE_{n}^{(1)}\bigr{]}+\frac{1}{n}\bigl{[}AE_{n}^{(1)}A^{\intercal}+A\Sigma_{\theta}A^{\intercal}-\Sigma_{\Delta}\bigr{]}
=nEn(1)+1n[(1+1n)[2nEn(1)+nEn(1)A+AnEn(1)]+AΣθAΣΔ+n,(1)]\displaystyle=nE_{n}^{(1)}+\frac{1}{n}\Bigl{[}(1+\frac{1}{n})\bigl{[}2nE_{n}^{(1)}+nE_{n}^{(1)}A^{\intercal}+AnE_{n}^{(1)}\bigr{]}+A\Sigma_{\theta}A^{\intercal}-\Sigma_{\Delta}+\mathcal{E}_{n}^{\bullet,(1)}\Bigr{]}

with the error term n,(1)=AEn(1)AEn\mathcal{E}_{n}^{\bullet,(1)}=AE_{n}^{(1)}A^{\intercal}-E_{n}. Note that AΣθAΣΔ=[A+I]Σθ[A+I]A\Sigma_{\theta}A^{\intercal}-\Sigma_{\Delta}=[A+I]\Sigma_{\theta}[A+I]^{\intercal} is positive definite.

The recursion for {nEn(1)}\{nE_{n}^{(1)}\} is treated as in the proof of Prop. 2.9 (i). Consider the matrix ODE,

ddt𝒳(t)=(1+et)[2𝒳(t)+𝒳(t)A+A𝒳(t)]+AΣθAΣΔ+et[A𝒳(t)A𝒳(t)]{\mathchoice{\genfrac{}{}{}{1}{d}{dt}}{\genfrac{}{}{}{1}{d}{dt}}{\genfrac{}{}{}{3}{d}{dt}}{\genfrac{}{}{}{3}{d}{dt}}}\mathcal{X}(t)=(1+e^{-t})[2\mathcal{X}(t)+\mathcal{X}(t)A^{\intercal}+A\mathcal{X}(t)]+A\Sigma_{\theta}A^{\intercal}-\Sigma_{\Delta}+e^{-t}[A\mathcal{X}(t)A^{\intercal}-\mathcal{X}(t)] (63)

Let tn=k=1n1/kt_{n}=\sum_{k=1}^{n}1/k and let 𝒳n(t)\mathcal{X}^{n}(t) denote the solution to this ODE on [tn,)[t_{n},\infty) with 𝒳n(tn)=nEn(1)\mathcal{X}^{n}(t_{n})=nE_{n}^{(1)}, ttnt\geq t_{n}, for any n1n\geq 1. We then obtain as previously,

supkn𝒳n(tk)kEk(1)=O(1/n)\sup_{k\geq n}\|\mathcal{X}^{n}(t_{k})-kE_{k}^{(1)}\|=O(1/n)

Recall that Σ(1)0\Sigma_{\sharp}^{(1)}\geq 0 is the solution to the Lyapunov equation (45). Exponential convergence of 𝒳\mathcal{X} to Σ(1)\Sigma_{\sharp}^{(1)} implies convergence of {nEn(1)}\{nE_{n}^{(1)}\} at rate 1/nδ1/n^{\delta} for δ=δ(A+I,ΣΔ)>0\delta=\delta(A+I,\Sigma_{\Delta})>0. Therefore, nEn=Σ(1)+O(nδ)nE_{n}=\Sigma_{\sharp}^{(1)}+O(n^{-\delta}).

Given Cov(θn(1))=n1Σθ+n1En\text{\rm Cov}\,(\theta_{n}^{(1)})=n^{-1}\Sigma_{\theta}+n^{-1}E_{n}, we have

Cov(θn(1))=n1Σθ+n2Σ(1)+O(n2δ)\text{\rm Cov}\,(\theta_{n}^{(1)})=n^{-1}\Sigma_{\theta}+n^{-2}\Sigma_{\sharp}^{(1)}+O(n^{-2-\delta})

(ii)

We focus on Rn(2,1),(1)R_{n}^{(2,1),(1)} since Rn(1),(2,1)=[Rn(2,1),(1)]R_{n}^{(1),(2,1)}=[R_{n}^{(2,1),(1)}]^{\intercal}. Recall the update forms of θ~n(1)\widetilde{\theta}_{n}^{(1)} and θ~n(2,1)\widetilde{\theta}_{n}^{(2,1)} in (30a) and (42a) respectively, where θ~n(1)\widetilde{\theta}_{n}^{(1)} is uncorrelated with the martingale difference sequence {Δ^n+km}\{\widehat{\Delta}_{n+k}^{m}\} for k2k\geq 2 and θ~n(2,1)\widetilde{\theta}_{n}^{(2,1)} is uncorrelated with {Δn+km}\{\Delta_{n+k}^{m}\} for k2k\geq 2. With Rn(2,1),(1)=𝖤[θ~n(2,1)(θ~n(1))]R_{n}^{(2,1),(1)}={\sf E}[\widetilde{\theta}_{n}^{(2,1)}(\widetilde{\theta}_{n}^{(1)})^{\intercal}], the following is obtained from these facts:

Rn+1(2,1),(1)=Rn(2,1),(1)+αn+1[Rn(2,1),(1)A\displaystyle R_{n+1}^{(2,1),(1)}=R_{n}^{(2,1),(1)}+\alpha_{n+1}\bigl{[}R_{n}^{(2,1),(1)}A^{\intercal} +ARn(2,1),(1)+αn+1ARn(2,1),(1)A\displaystyle+AR_{n}^{(2,1),(1)}+\alpha_{n+1}AR_{n}^{(2,1),(1)}A^{\intercal}
αnαn+1[I+A]Cov(Δ^n+2m,Δn+2m)]\displaystyle-\alpha_{n}\alpha_{n+1}[I+A]\text{\rm Cov}\,(\widehat{\Delta}_{n+2}^{m},\Delta_{n+2}^{m})\bigr{]}

Denote Cn=nRn(2,1),(1)C_{n}=nR_{n}^{(2,1),(1)}. Multiplying both sides of the previous equation by n+1n+1 gives

Cn+1=Cn+αn+1[(1+n1)[Cn+CnA+ACn]+αnACnAαn[I+A]Cov(Δ^n+2m,Δn+2m)]C_{n+1}=C_{n}+\alpha_{n+1}\bigl{[}(1+n^{-1})[C_{n}+C_{n}A^{\intercal}+AC_{n}]+\alpha_{n}AC_{n}A^{\intercal}-\alpha_{n}[I+A]\text{\rm Cov}\,(\widehat{\Delta}_{n+2}^{m},\Delta_{n+2}^{m})\bigr{]}

Multiplying each side of this equation by n+1n+1 once more results in

(n+1)Cn+1\displaystyle(n+1)C_{n+1} =(n+1)Cn+(1+n1)[Cn+CnA+ACn]+αnACnAαn[I+A]Cov(Δ^n+2m,Δn+2m)\displaystyle=(n+1)C_{n}+(1+n^{-1})[C_{n}+C_{n}A^{\intercal}+AC_{n}]+\alpha_{n}AC_{n}A^{\intercal}-\alpha_{n}[I+A]\text{\rm Cov}\,(\widehat{\Delta}_{n+2}^{m},\Delta_{n+2}^{m})
=nCn+αn[(1+n1)[2nCn+nCnA+AnCn][I+A]Covπ(Δ^n+2m,Δn+2m)+𝒟n+1(2)]\displaystyle=nC_{n}+\alpha_{n}\bigl{[}(1+n^{-1})[2nC_{n}+nC_{n}A^{\intercal}+AnC_{n}]-[I+A]\text{\rm Cov}\,_{\pi}(\widehat{\Delta}_{n+2}^{m},\Delta_{n+2}^{m})+\mathcal{D}_{n+1}^{(2)}\bigr{]}

where the error term 𝒟n+1(2)\mathcal{D}_{n+1}^{(2)} consists of two components: [I+A][Covπ(Δ^n+2m,Δn+2m)Cov(Δ^n+2m,Δn+2m)][I+A][\text{\rm Cov}\,_{\pi}(\widehat{\Delta}_{n+2}^{m},\Delta_{n+2}^{m})-\text{\rm Cov}\,(\widehat{\Delta}_{n+2}^{m},\Delta_{n+2}^{m})] that converges to zero at a geometric rate and ACnACnAC_{n}A^{\intercal}-C_{n}.

As previously, this is approximated by the linear system

ddt𝒳(t)=\displaystyle{\mathchoice{\genfrac{}{}{}{1}{d}{dt}}{\genfrac{}{}{}{1}{d}{dt}}{\genfrac{}{}{}{3}{d}{dt}}{\genfrac{}{}{}{3}{d}{dt}}}\mathcal{X}(t)= (1+et)[2𝒳(t)+𝒳(t)A+A𝒳(t)]+et[A𝒳(t)A𝒳(t)]\displaystyle(1+e^{-t})[2\mathcal{X}(t)+\mathcal{X}(t)A^{\intercal}+A\mathcal{X}(t)]+e^{-t}[A\mathcal{X}(t)A^{\intercal}-\mathcal{X}(t)] (64)
[I+A]Covπ(Δ^n+2m,Δn+2m))\displaystyle\qquad-[I+A]\text{\rm Cov}\,_{\pi}(\widehat{\Delta}_{n+2}^{m},\Delta_{n+2}^{m}))

With the same argument used in (i), {nCn+nCn}\{nC_{n}+nC_{n}^{\intercal}\} converges to Σ(2)\Sigma_{\sharp}^{(2)} in (46) at rate 1/nδ1/n^{\delta} for δ=δ(A+I)>0\delta=\delta(A+I)>0. Therefore, nCn+nCn=Σ(2)+O(nδ)nC_{n}+nC_{n}^{\intercal}=\Sigma_{\sharp}^{(2)}+O(n^{-\delta}) and Rn(2,1),(1)=n2Cn=n2Σ,C+O(n2δ)R_{n}^{(2,1),(1)}=n^{-2}C_{n}=n^{-2}\Sigma_{\infty,C}+O(n^{-2-\delta}). ∎

(iii)

The third claim in Prop. 2.12 is established through a sequence of lemmas. Start with the representation of θ~n+1(3)\widetilde{\theta}_{n+1}^{(3)} based on (40):

θ~n+1(3)=1n+1Zn+2=1n+1Δ^n+3m+1n+1(Z^n+3Z^n+2)\widetilde{\theta}_{n+1}^{(3)}=-\frac{1}{n+1}Z_{n+2}=-\frac{1}{n+1}\widehat{\Delta}_{n+3}^{m}+\frac{1}{n+1}(\widehat{Z}_{n+3}-\widehat{Z}_{n+2})

Since Δ^n+3m\widehat{\Delta}_{n+3}^{m} is uncorrelated with the sequence {θ~k(1)}\{\widetilde{\theta}_{k}^{(1)}\} for kn+1k\leq n+1, we have

𝖤[θ~n+1(1)(Δ^n+3m)]=0{\sf E}[\widetilde{\theta}_{n+1}^{(1)}(\widehat{\Delta}_{n+3}^{m})^{\intercal}]=0 (65)

Hence it suffices to consider the correlation between θ~n+1(1)\widetilde{\theta}_{n+1}^{(1)} and Z^n+3Z^n+2\widehat{Z}_{n+3}-\widehat{Z}_{n+2}. The formula for θ~n+1(1)\widetilde{\theta}_{n+1}^{(1)} for n1n\geq 1 is

θ~n+1(1)=k=1n+1[I+αkA]θ~0+k=1n+1αkl=k+1n+1[I+αlA]Δk+1m\widetilde{\theta}_{n+1}^{(1)}=\prod_{k=1}^{n+1}[I+\alpha_{k}A]\widetilde{\theta}_{0}+\sum_{k=1}^{n+1}\alpha_{k}\prod_{l=k+1}^{n+1}[I+\alpha_{l}A]\Delta_{k+1}^{m} (66)

θ~0𝖤[Z^n+3Z^n+2]\widetilde{\theta}_{0}{\sf E}[\widehat{Z}_{n+3}^{\intercal}-\widehat{Z}_{n+2}^{\intercal}] converges to zero geometrically fast under VV-uniform ergodicity of 𝚽\Phi. Then we consider the expectation of the following:

k=1n+1αkl=k+1n+1[I+αlA]Δk+1m[Z^n+3Z^n+2]\displaystyle\sum_{k=1}^{n+1}\alpha_{k}\prod_{l=k+1}^{n+1}[I+\alpha_{l}A]\Delta_{k+1}^{m}[\widehat{Z}_{n+3}^{\intercal}-\widehat{Z}_{n+2}^{\intercal}] (67)
=\displaystyle= k=1n+1αkl=k+1n+1[I+αlA][Δk+2mZ^n+3Δk+1mZ^n+2]+k=1n+1αkl=k+1n+1[I+αlA][Δk+1mΔk+2m]Z^n+3\displaystyle\sum_{k=1}^{n+1}\alpha_{k}\prod_{l=k+1}^{n+1}[I+\alpha_{l}A]\bigl{[}\Delta_{k+2}^{m}\widehat{Z}_{n+3}^{\intercal}-\Delta_{k+1}^{m}\widehat{Z}_{n+2}^{\intercal}\bigr{]}+\sum_{k=1}^{n+1}\alpha_{k}\prod_{l=k+1}^{n+1}[I+\alpha_{l}A]\bigl{[}\Delta_{k+1}^{m}-\Delta_{k+2}^{m}\bigr{]}\widehat{Z}_{n+3}^{\intercal}

The definition of TT is now based on the assumption that I+AI+A is Hurwitz: T>0T>0 is the unique solution to the Lyapunov equation:

[A+I]T+T[A+I]+I=0[A+I]T+T[A+I]^{\intercal}+I=0

As previously, we denote WT2=𝖤[WTW]\|W\|_{T}^{2}={\sf E}[W^{\intercal}TW] for a random vector WW, and denote by MT\|M\|_{T} the induced operator norm of a matrix Md×dM\in\mathbb{R}^{d\times d}. In the following result the vector WW is taken to be deterministic.

Lemma A.3.

Suppose the matrix I+AI+A is Hurwitz. Then there exists constant KK such that the following holds for any k1k\geq 1 and all nkn\geq k

l=k+1n+1[I+αlA]TKkn+2\bigl{\|}\prod_{l=k+1}^{n+1}[I+\alpha_{l}A]\bigr{\|}_{T}\leq K\frac{k}{n+2}
Proof.

For any vector WdW\in\mathbb{R}^{d} and l1l\geq 1, we have

[I+αlA]WT2\displaystyle\|[I+\alpha_{l}A]W\|_{T}^{2} =W[T2αlTαlI+αl2ATA]W\displaystyle=W^{\intercal}[T-2\alpha_{l}T-\alpha_{l}I+\alpha_{l}^{2}A^{\intercal}TA]W
W[TT2αlT+αl2ATA]W\displaystyle\leq W^{\intercal}[TT-2\alpha_{l}T+\alpha_{l}^{2}A^{\intercal}TA]W
(12αl+αl2L2)WT2\displaystyle\leq(1-2\alpha_{l}+\alpha_{l}^{2}L^{2})\|W\|_{T}^{2}

where L=ATL=\|A\|_{T}. Hence

I+αlAT12αl+αl2L21αl+12αl2L2\|I+\alpha_{l}A\|_{T}\leq\sqrt{1-2\alpha_{l}+\alpha_{l}^{2}L^{2}}\leq 1-\alpha_{l}+{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}\alpha_{l}^{2}L^{2}

Lemma A.1 completes the proof:

l=k+1n+1[I+αlA]Tl=k+1n+1[I+αlA]Tl=k+1n+1[1αl+12L2αl2]KA.1kn+2\displaystyle\bigl{\|}\prod_{l=k+1}^{n+1}[I+\alpha_{l}A]\bigr{\|}_{T}\leq\prod_{l=k+1}^{n+1}\|[I+\alpha_{l}A]\|_{T}\leq\prod_{l=k+1}^{n+1}[1-\alpha_{l}+{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}L^{2}\alpha_{l}^{2}]\leq K_{\ref{t:prod-n}}\frac{k}{n+2}

To analyze 𝖤[Δk+2mZ^n+3]{\sf E}[\Delta_{k+2}^{m}\widehat{Z}_{n+3}^{\intercal}], consider the bivariate Markov chain Φn=(Φn,Φn+1)\Phi_{n}^{*}=(\Phi_{n},\Phi_{n+1}), n0n\geq 0, with state space 𝖹=𝖹×𝖹{\sf Z}^{*}={\sf Z}\times{\sf Z}. An associated weighting function V:𝖹×𝖹[1,)V^{*}:{\sf Z}\times{\sf Z}\rightarrow[1,\infty) is defined as V(z,z)=V(z)+V(z)V^{*}(z,z^{\prime})=V(z)+V(z^{\prime}).

Denote function hk+1,n+2:𝖹d×dh^{k+1,n+2}:{\sf Z}^{*}\rightarrow\mathbb{R}^{d\times d} as hk+1,n+2(z,z′′)=(f^(z′′)𝖤[f^(Φk+1)Φk=z])𝖤[f^^(Φn+2)Φk+1=z′′]h^{k+1,n+2}(z^{\prime},z^{\prime\prime})=({\hat{f}}(z^{\prime\prime})-{\sf E}[{\hat{f}}(\Phi_{k+1})\mid\Phi_{k}=z^{\prime}]){\sf E}[\hat{\vphantom{\rule{1.0pt}{6.14584pt}}\smash{\hat{f}}}(\Phi_{n+2})^{\intercal}\mid\Phi_{k+1}=z^{\prime\prime}] and hi,jk+1,n+2:𝖹h_{i,j}^{k+1,n+2}:{\sf Z}^{*}\rightarrow\mathbb{R} as the (i,j)(i,j)-th entry of hk+1,n+2h^{k+1,n+2} for 1i,jd1\leq i,j\leq d. Note that hk+1,n+2(Φk,Φk+1)=𝖤[Δk+1mZ^n+2k+1]h^{k+1,n+2}(\Phi_{k},\Phi_{k+1})={\sf E}[\Delta_{k+1}^{m}\widehat{Z}_{n+2}\mid\mathcal{F}_{k+1}]

Lemma A.4.

Suppose Assumptions (A1) and (A3) hold. For each 1i,jd1\leq i,j\leq d,

  • (i)

    hi,jk+1,n+2LVh_{i,j}^{k+1,n+2}\in L_{\infty}^{V^{*}}, moreover there exists constant BB such that

    hi,jk+1,n+2VBf^iVf^^jVρnk+1\|h_{i,j}^{k+1,n+2}\|_{V^{*}}\leq B\|{\hat{f}}_{i}\|_{\scriptscriptstyle\sqrt{V}}\|\hat{\vphantom{\rule{1.0pt}{6.14584pt}}\smash{\hat{f}}}_{j}\|_{\scriptscriptstyle\sqrt{V}}\rho^{n-k+1}
  • (ii)

    Consequently, there exists constant BB^{\prime} such that

    |𝖤[hi,jk+1,n+2(Φk,Φk+1)Φ0=z]π(hi,jk+1,n+2)|Bf^iVf^^jVV(z)ρn+1\bigl{|}{\sf E}[h_{i,j}^{k+1,n+2}(\Phi_{k},\Phi_{k+1})\mid\Phi_{0}=z]-\pi\bigl{(}h_{i,j}^{k+1,n+2}\bigr{)}\bigr{|}\leq B^{\prime}\|{\hat{f}}_{i}\|_{\scriptscriptstyle\sqrt{V}}\|\hat{\vphantom{\rule{1.0pt}{6.14584pt}}\smash{\hat{f}}}_{j}\|_{\scriptscriptstyle\sqrt{V}}V(z)\rho^{n+1}
Proof.

By the definition of VV^{*}-norm,

hi,jk+1,n+2V=\displaystyle\|h_{i,j}^{k+1,n+2}\|_{V^{*}}= supz,z′′𝖹|[f^i(z′′)+𝖤[f^i(Φk+1)Φk=z]]𝖤[f^^j(Φn+2)Φk+1=z′′]|V(z)+V(z′′)\displaystyle\sup_{z^{\prime},z^{\prime\prime}\in{\sf Z}}\frac{\bigl{|}\bigl{[}{\hat{f}}_{i}(z^{\prime\prime})+{\sf E}[{\hat{f}}_{i}(\Phi_{k+1})\mid\Phi_{k}=z^{\prime}]\bigr{]}{\sf E}[\hat{\vphantom{\rule{1.0pt}{6.14584pt}}\smash{\hat{f}}}_{j}(\Phi_{n+2})\mid\Phi_{k+1}=z^{\prime\prime}]\bigr{|}}{V(z^{\prime})+V(z^{\prime\prime})}
\displaystyle\leq supz′′𝖹|f^i(z′′)𝖤[f^^j(Φn+2)Φk+1=z′′]|V(z′′)\displaystyle\sup_{z^{\prime\prime}\in{\sf Z}}\frac{\bigl{|}{\hat{f}}_{i}(z^{\prime\prime}){\sf E}[\hat{\vphantom{\rule{1.0pt}{6.14584pt}}\smash{\hat{f}}}_{j}(\Phi_{n+2})\mid\Phi_{k+1}=z^{\prime\prime}]\bigr{|}}{V(z^{\prime\prime})}
+supz,z′′𝖹|𝖤[f^i(Φk+1)Φk=z]𝖤[f^^j(Φn+2)Φk+1=z′′]|V(z)+V(z′′)\displaystyle+\sup_{z^{\prime},z^{\prime\prime}\in{\sf Z}}\frac{\bigl{|}{\sf E}[{\hat{f}}_{i}(\Phi_{k+1})\mid\Phi_{k}=z^{\prime}]{\sf E}[\hat{\vphantom{\rule{1.0pt}{6.14584pt}}\smash{\hat{f}}}_{j}(\Phi_{n+2})\mid\Phi_{k+1}=z^{\prime\prime}]\bigr{|}}{V(z^{\prime})+V(z^{\prime\prime})}

Given f^^j2LV\hat{\vphantom{\rule{1.0pt}{6.14584pt}}\smash{\hat{f}}}_{j}^{2}\in L_{\infty}^{V} and the V\sqrt{V}-uniform ergodicity of 𝚽\Phi [29, Lemma 15.2.9], there exists constant BV<B_{\scriptscriptstyle\sqrt{V}}<\infty such that

|𝖤[f^^j(Φn+2)Φk+1=z′′]|BVf^^jVV(z′′)ρn+1k\bigl{|}{\sf E}[\hat{\vphantom{\rule{1.0pt}{6.14584pt}}\smash{\hat{f}}}_{j}(\Phi_{n+2})\mid\Phi_{k+1}=z^{\prime\prime}]\bigr{|}\leq B_{\scriptscriptstyle\sqrt{V}}\|\hat{\vphantom{\rule{1.0pt}{6.14584pt}}\smash{\hat{f}}}_{j}\|_{\scriptscriptstyle\sqrt{V}}\sqrt{V(z^{\prime\prime})}\rho^{n+1-k}

Consequently,

supz′′𝖹|f^i(z′′)[𝖤[f^^j(Φn+2)Φk+1=z′′]|V(z′′)f^iVBVf^^jVρn+1k\sup_{z^{\prime\prime}\in{\sf Z}}\frac{|{\hat{f}}_{i}(z^{\prime\prime})\bigl{[}{\sf E}[\hat{\vphantom{\rule{1.0pt}{6.14584pt}}\smash{\hat{f}}}_{j}(\Phi_{n+2})\mid\Phi_{k+1}=z^{\prime\prime}]|}{V(z^{\prime\prime})}\leq\|{\hat{f}}_{i}\|_{\scriptscriptstyle\sqrt{V}}B_{\scriptscriptstyle\sqrt{V}}\|\hat{\vphantom{\rule{1.0pt}{6.14584pt}}\smash{\hat{f}}}_{j}\|_{\scriptscriptstyle\sqrt{V}}\rho^{n+1-k} (68)

By the inequality V(z)+V(z′′)V(z)V(z′′)V(z^{\prime})+V(z^{\prime\prime})\geq\sqrt{V(z^{\prime})V(z^{\prime\prime})} and the V\sqrt{V}-uniform ergodicity of 𝚽\Phi once more, we have

supz,z′′𝖹|𝖤[f^i(Φk+1)Φk=z]𝖤[f^^j(Φn+2)Φk+1=z′′]|V(z)+V(z′′)\displaystyle\sup_{z^{\prime},z^{\prime\prime}\in{\sf Z}}\frac{\bigl{|}{\sf E}[{\hat{f}}_{i}(\Phi_{k+1})\mid\Phi_{k}=z^{\prime}]{\sf E}[\hat{\vphantom{\rule{1.0pt}{6.14584pt}}\smash{\hat{f}}}_{j}(\Phi_{n+2})\mid\Phi_{k+1}=z^{\prime\prime}]\bigr{|}}{V(z^{\prime})+V(z^{\prime\prime})} (69)
\displaystyle\leq supz𝖹|𝖤[f^i(Φk+1)Φk=z]|V(z)supz′′𝖹|𝖤[f^^j(Φn+2)Φk+1=z′′]|V(z′′)BV2f^iVf^^jVρn+2k\displaystyle\sup_{z^{\prime}\in{\sf Z}}\frac{\bigl{|}{\sf E}[{\hat{f}}_{i}(\Phi_{k+1})\mid\Phi_{k}=z^{\prime}]\bigr{|}}{\sqrt{V(z^{\prime})}}\sup_{z^{\prime\prime}\in{\sf Z}}\frac{\bigl{|}{\sf E}[\hat{\vphantom{\rule{1.0pt}{6.14584pt}}\smash{\hat{f}}}_{j}(\Phi_{n+2})\mid\Phi_{k+1}=z^{\prime\prime}]\bigr{|}}{\sqrt{V(z^{\prime\prime})}}\leq B_{\scriptscriptstyle\sqrt{V}}^{2}\|{\hat{f}}_{i}\|_{\scriptscriptstyle\sqrt{V}}\|\hat{\vphantom{\rule{1.0pt}{6.14584pt}}\smash{\hat{f}}}_{j}\|_{\scriptscriptstyle\sqrt{V}}\rho^{n+2-k}

Combining (68) and (69) gives

hi,jk+1,n+2VBf^iVf^^jVρn+1k\|h_{i,j}^{k+1,n+2}\|_{V^{*}}\leq B\|{\hat{f}}_{i}\|_{\scriptscriptstyle\sqrt{V}}\|\hat{\vphantom{\rule{1.0pt}{6.14584pt}}\smash{\hat{f}}}_{j}\|_{\scriptscriptstyle\sqrt{V}}\rho^{n+1-k} (70)

with B=BV+BV2B=B_{\scriptscriptstyle\sqrt{V}}+B_{\scriptscriptstyle\sqrt{V}}^{2}.

For (ii), denote gi,jk,n+2:𝖹g_{i,j}^{k,n+2}:{\sf Z}\rightarrow\mathbb{R} by the conditional expectation:

gi,jk,n+2(z)=𝖤[hi,jk+1,n+2(Φk,Φk+1)Φk=z]g_{i,j}^{k,n+2}(z)={\sf E}[h_{i,j}^{k+1,n+2}(\Phi_{k},\Phi_{k+1})\mid\Phi_{k}=z]

This is bounded by a constant times VV^{*}:

|gi,jk,n+2(z)|=|hi,jk+1,n+2(z,z)P(z,dz)|\displaystyle|g_{i,j}^{k,n+2}(z)|=\Bigl{|}\int h_{i,j}^{k+1,n+2}(z,z^{\prime})P(z,dz^{\prime})\Bigr{|} |hi,jk+1,n+2(z,z)V(z,z)V(z,z)P(z,dz)|\displaystyle\leq\Bigl{|}\int\frac{h_{i,j}^{k+1,n+2}(z,z^{\prime})}{V^{*}(z,z^{\prime})}V^{*}(z,z^{\prime})P(z,dz^{\prime})\Bigr{|}
hi,jk+1,n+2V[V(z)+PV(z)]\displaystyle\leq\|h_{i,j}^{k+1,n+2}\|_{V^{*}}[V(z)+PV(z)]

VV-uniform ergodicity of 𝚽\Phi is equivalent to the following drift condition [29, Theorem 16.0.2]: for some β>0,b<\beta>0,b<\infty, and some “petite set” CC,

PV(z)V(z)βV(z)+b𝕀C(z),z𝖹PV(z)-V(z)\leq-\beta V(z)+b\mathbb{I}_{C}(z)\,,\qquad z\in{\sf Z}

Consequently,

[V(z)+PV(z)][2V(z)+b][2+|b|]V(z)[V(z)+PV(z)]\leq[2V(z)+b]\leq[2+|b|]V(z)

Therefore,

gi,jk,n+2V[2+|b|]hi,jk+1,n+2V[2+|b|]Bf^iVf^^jVρn+1k\|g_{i,j}^{k,n+2}\|_{V}\leq[2+|b|]\|h_{i,j}^{k+1,n+2}\|_{V^{*}}\leq[2+|b|]B\|{\hat{f}}_{i}\|_{\scriptscriptstyle\sqrt{V}}\|\hat{\vphantom{\rule{1.0pt}{6.14584pt}}\smash{\hat{f}}}_{j}\|_{\scriptscriptstyle\sqrt{V}}\rho^{n+1-k} (71)

Thus gi,jk,n+2LVg_{i,j}^{k,n+2}\in L_{\infty}^{V}. By VV-uniform ergodicity of 𝚽\Phi again,

|𝖤[gi,jk,n+2(Φk)Φ0=z]π(gi,jk,n+2)|\displaystyle\bigl{|}{\sf E}[g_{i,j}^{k,n+2}(\Phi_{k})\mid\Phi_{0}=z]-\pi\bigl{(}g_{i,j}^{k,n+2}\bigr{)}\bigr{|} BVgi,jk,n+2VV(z)ρk\displaystyle\leq B_{V}\|g_{i,j}^{k,n+2}\|_{V}V(z)\rho^{k}
Bf^iVf^^jVV(z)ρn+1\displaystyle\leq B^{\prime}\|{\hat{f}}_{i}\|_{\scriptscriptstyle\sqrt{V}}\|\hat{\vphantom{\rule{1.0pt}{6.14584pt}}\smash{\hat{f}}}_{j}\|_{\scriptscriptstyle\sqrt{V}}V(z)\rho^{n+1}

with B=[2+|b|]BVBB^{\prime}=[2+|b|]B_{V}B. The proof is then completed by applying the smoothing property of conditional expectation. ∎

Lemma A.5.

Under Assumptions (A1) and (A3), there exists K<K<\infty such that the following hold

𝖤[Δk+1mZ^n+3]T\displaystyle\bigl{\|}{\sf E}[\Delta_{k+1}^{m}\widehat{Z}_{n+3}^{\intercal}]\bigr{\|}_{T} Kρn+1k\displaystyle\leq K\rho^{n+1-k} (72a)
𝖤[Δk+1mZ^n+2]𝖤[Δk+2mZ^n+3]T\displaystyle\bigl{\|}{\sf E}[\Delta_{k+1}^{m}\widehat{Z}_{n+2}^{\intercal}]-{\sf E}[\Delta_{k+2}^{m}\widehat{Z}_{n+3}^{\intercal}]\bigr{\|}_{T} K(1+ρ)ρn+1\displaystyle\leq K(1+\rho)\rho^{n+1} (72b)
Proof.

By the triangle inequality,

𝖤[Δk+1mZ^n+2]T𝖤[Zk+1Z^n+2]T+𝖤[𝖤[Zk+1|k]Z^n+2]T\bigl{\|}{\sf E}[\Delta_{k+1}^{m}\widehat{Z}_{n+2}^{\intercal}]\bigr{\|}_{T}\leq\bigl{\|}{\sf E}[Z_{k+1}\widehat{Z}_{n+2}^{\intercal}]\bigr{\|}_{T}+\bigl{\|}{\sf E}\bigl{[}{\sf E}[Z_{k+1}|\mathcal{F}_{k}]\widehat{Z}_{n+2}^{\intercal}\bigr{]}\bigr{\|}_{T}

where both terms admit the geometric bound in (72a) following directly from the VV-geometric mixing of 𝚽\Phi  [29, Theorem 16.1.5].

For (72b), first notice that

𝖤[Δk+1mZ^n+2]=𝖤[𝖤[Δk+1mZ^n+2k+1]]=𝖤[hk+1,n+2(Φk,Φk+1)]{\sf E}[\Delta_{k+1}^{m}\widehat{Z}_{n+2}^{\intercal}]={\sf E}\bigl{[}{\sf E}[\Delta_{k+1}^{m}\widehat{Z}_{n+2}^{\intercal}\mid\mathcal{F}_{k+1}]\bigr{]}={\sf E}[h^{k+1,n+2}(\Phi_{k},\Phi_{k+1})]

With Lemma A.4, we have for each (i,j)(i,j)-th entry,

|𝖤[hi,jk+1,n+2(Φk,Φk+1)Φ0=z]π(hi,jk+1,n+2)|Bf^iVf^^jVV(z)ρn+1\Bigl{|}{\sf E}[h_{i,j}^{k+1,n+2}(\Phi_{k},\Phi_{k+1})\mid\Phi_{0}=z]-\pi\bigl{(}h_{i,j}^{k+1,n+2}\bigr{)}\Bigr{|}\leq B^{\prime}\|{\hat{f}}_{i}\|_{\scriptscriptstyle\sqrt{V}}\|\hat{\vphantom{\rule{1.0pt}{6.14584pt}}\smash{\hat{f}}}_{j}\|_{\scriptscriptstyle\sqrt{V}}V(z)\rho^{n+1}

With fixed initial condition Φ0=z\Phi_{0}=z, by equivalence of matrix norms, there exists a constant KK such that

𝖤[hk+1,n+2(Φk,Φk+1)]π(hi,jk+1,n+2)TKρn+1\Bigl{\|}{\sf E}[h^{k+1,n+2}(\Phi_{k},\Phi_{k+1})]-\pi\bigl{(}h_{i,j}^{k+1,n+2}\bigr{)}\Bigr{\|}_{T}\leq K\rho^{n+1}

(72b) then follows from the triangle inequality:

𝖤[Δk+1mZ^n+2]𝖤[Δk+2mZ^n+3]TKρn+1+Kρn+2=K(1+ρ)ρn+1\bigl{\|}{\sf E}[\Delta_{k+1}^{m}\widehat{Z}_{n+2}^{\intercal}]-{\sf E}[\Delta_{k+2}^{m}\widehat{Z}_{n+3}^{\intercal}]\bigr{\|}_{T}\leq K\rho^{n+1}+K\rho^{n+2}=K(1+\rho)\rho^{n+1}

Lemma A.6.

For fixed ρ(0,1)\rho\in(0,1), there exists K<K<\infty such that for all n2n\geq 2,

k=1n11kρkKρnn\sum_{k=1}^{n-1}\frac{1}{k}\rho^{-k}\leq K\frac{\rho^{-n}}{n}
Proof.

Denote γ=logρ>0\gamma=-\log\rho>0 and observe that the function t1exp(γt)t^{-1}\exp(\gamma t) is increasing over [1,)[1,\infty). The following holds for n2n\geq 2

k=1n11kρk=k=1n11kexp(γk)1nt1exp(γt)𝑑t\sum_{k=1}^{n-1}\frac{1}{k}\rho^{-k}=\sum_{k=1}^{n-1}\frac{1}{k}\exp(\gamma k)\leq\int_{1}^{n}t^{-1}\exp(\gamma t)dt

Now consider the integral: for any t0(1,n)t_{0}\in(1,n),

1nt1exp(γt)𝑑t\displaystyle\int_{1}^{n}t^{-1}\exp(\gamma t)dt 1t0exp(γt)𝑑t+t0nt01exp(γt)𝑑t\displaystyle\leq\int_{1}^{t_{0}}\exp(\gamma t)dt+\int_{t_{0}}^{n}t_{0}^{-1}\exp(\gamma t)dt
γ1[exp(γt0)exp(γ)+exp(γn)exp(γt0)t0]\displaystyle\leq\gamma^{-1}\bigl{[}\exp(\gamma t_{0})-\exp(\gamma)+\frac{\exp(\gamma n)-\exp(\gamma t_{0})}{t_{0}}\bigr{]}

Take t0=nnt_{0}=n-\sqrt{n}.

exp(γt0)exp(γ)+exp(γn)exp(γt0)t0\displaystyle\exp(\gamma t_{0})-\exp(\gamma)+\frac{\exp(\gamma n)-\exp(\gamma t_{0})}{t_{0}} =exp(γ(nn))exp(γ)+exp(γn)exp(γ(nn))nn\displaystyle=\exp(\gamma(n-\sqrt{n}))-\exp(\gamma)+\frac{\exp(\gamma n)-\exp(\gamma(n-\sqrt{n}))}{n-\sqrt{n}}
Kn1exp(γn)\displaystyle\leq K^{\prime}n^{-1}\exp(\gamma n)

where K=supt2texp(γt)texp(γγt)+[1exp(γt)]/[11/t]K^{\prime}=\sup_{t\geq 2}t\exp(-\gamma\sqrt{t})-t\exp(\gamma-\gamma t)+[1-\exp(-\gamma\sqrt{t})]/[1-1/\sqrt{t}]. The proof is completed by setting K=γ1KK=\gamma^{-1}K^{\prime}. ∎

Proof of Prop. 2.12 (iii).

Following (65), we have

Rn+1(1),(3)=𝖤[θ~n+1(1)(θ~n+1(3))]=1n+1𝖤[θ~n+1(1)[Z^n+3Z^n+2]]R_{n+1}^{(1),(3)}={\sf E}[\widetilde{\theta}_{n+1}^{(1)}(\widetilde{\theta}_{n+1}^{(3)})^{\intercal}]=\frac{1}{n+1}{\sf E}[\widetilde{\theta}_{n+1}^{(1)}[\widehat{Z}_{n+3}-\widehat{Z}_{n+2}]^{\intercal}] (73)

This is bounded based on (67): Lemma A.3 and (72b) indicate that there exists some constant KK such that

k=1n+1αkl=k+1n+1[I+αlA]T𝖤[Δk+2mZ^n+3Δk+1mZ^n+2]TKρn+1\displaystyle\sum_{k=1}^{n+1}\alpha_{k}\bigl{\|}\prod_{l=k+1}^{n+1}[I+\alpha_{l}A]\bigr{\|}_{T}\bigl{\|}{\sf E}\bigl{[}\Delta_{k+2}^{m}\widehat{Z}_{n+3}^{\intercal}-\Delta_{k+1}^{m}\widehat{Z}_{n+2}^{\intercal}\bigr{]}\bigr{\|}_{T}\leq K\rho^{n+1} (74)

For the second term in (67), it admits a simpler form

k=1n+1αkl=k+1n+1[I+αlA][Δk+1mΔk+2m]Z^n+3=\displaystyle\sum_{k=1}^{n+1}\alpha_{k}\prod_{l=k+1}^{n+1}[I+\alpha_{l}A]\bigl{[}\Delta_{k+1}^{m}-\Delta_{k+2}^{m}\bigr{]}\widehat{Z}_{n+3}^{\intercal}= l=2n+1[I+αlA]Δ2mZ^n+31n+1Δn+3mZ^n+3\displaystyle\prod_{l=2}^{n+1}[I+\alpha_{l}A]\Delta_{2}^{m}\widehat{Z}_{n+3}^{\intercal}-\frac{1}{n+1}\Delta_{n+3}^{m}\widehat{Z}_{n+3}^{\intercal}
k=2n+1αk1αkl=k+1n+1[I+αlA][I+A]Δk+1mZ^n+3\displaystyle-\sum_{k=2}^{n+1}\alpha_{k-1}\alpha_{k}\prod_{l=k+1}^{n+1}[I+\alpha_{l}A][I+A]\Delta_{k+1}^{m}\widehat{Z}_{n+3}^{\intercal}

where l=2n+1[I+αlA]𝖤[Δ2Z^n+3]=O(ρn)\prod_{l=2}^{n+1}[I+\alpha_{l}A]{\sf E}[\Delta_{2}\widehat{Z}_{n+3}^{\intercal}]=O(\rho^{n}) and 𝖤[Δn+3mZ^n+3]{\sf E}[\Delta_{n+3}^{m}\widehat{Z}_{n+3}^{\intercal}] converges to its steady-state mean. For the remaining part, Lemma A.3 and (72a) together imply that

k=2n+1αk1αkl=k+1n+1[I+αlA][I+A]𝖤[Δk+1mZ^n+3]T\displaystyle\Bigl{\|}\sum_{k=2}^{n+1}\alpha_{k-1}\alpha_{k}\prod_{l=k+1}^{n+1}[I+\alpha_{l}A][I+A]{\sf E}[\Delta_{k+1}^{m}\widehat{Z}_{n+3}^{\intercal}]\Bigr{\|}_{T}
k=2n+1αk1αkl=k+1n+1I+αlATI+AT𝖤[Δk+1mZ^n+3]T\displaystyle\leq\sum_{k=2}^{n+1}\alpha_{k-1}\alpha_{k}\prod_{l=k+1}^{n+1}\|I+\alpha_{l}A\|_{T}\|I+A\|_{T}\|{\sf E}[\Delta_{k+1}^{m}\widehat{Z}_{n+3}^{\intercal}]\|_{T}
Kn+2k=2n+1αk1ρn+1k\displaystyle\leq\frac{K^{\prime}}{n+2}\sum_{k=2}^{n+1}\alpha_{k-1}\rho^{n+1-k}

for some constant KK^{\prime}. By Lemma A.6, there exists another constant K′′K^{\prime\prime} such that

Kn+2k=2n+1αk1ρnk=Kρnn+2k=1nαkρkKK′′ρ(n+1)(n+2)\frac{K^{\prime}}{n+2}\sum_{k=2}^{n+1}\alpha_{k-1}\rho^{n-k}=\frac{K^{\prime}\rho^{n}}{n+2}\sum_{k=1}^{n}\alpha_{k}\rho^{-k}\leq\frac{K^{\prime}K^{\prime\prime}\rho}{(n+1)(n+2)}

This combined with (74) shows that

𝖤[θ~n+1(1)[Z^n+3Z^n+2]]=(n+1)1𝖤π[ΔnmZ^n]+O(ρn+1){\sf E}[\widetilde{\theta}_{n+1}^{(1)}[\widehat{Z}_{n+3}-\widehat{Z}_{n+2}]^{\intercal}]=-(n+1)^{-1}{\sf E}_{\pi}[\Delta_{n}^{m}\widehat{Z}_{n}^{\intercal}]+O(\rho^{n+1})

Following (73), we obtain the desired result:

𝖤[θ~n+1(1)(θ~n+1(3))]=1(n+1)2𝖤π[ΔnmZ^n]+O((n+1)3){\sf E}[\widetilde{\theta}_{n+1}^{(1)}(\widetilde{\theta}_{n+1}^{(3)})^{\intercal}]=-\frac{1}{(n+1)^{2}}{\sf E}_{\pi}[\Delta_{n}^{m}\widehat{Z}_{n}^{\intercal}]+O((n+1)^{-3})

A.5 Unbounded moments

This section is devoted to the proof that limn𝖤[|vθ~nϱ|2]=\lim_{n\rightarrow\infty}{\sf E}[|v^{\intercal}\widetilde{\theta}_{n}^{\varrho}|^{2}]=\infty for ϱ>ϱ0\varrho>\varrho_{0} (see Thm. 2.4 (ii)). Since it suffices to show the result holds for ϱ0<ϱ<12\varrho_{0}<\varrho<{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}, we assume ϱ<12\varrho<{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}} throughout. Recall that λ=ϱ0+ui\lambda=-\varrho_{0}+ui.

Consider the update of θ~nϱ\widetilde{\theta}_{n}^{\varrho} in (33). With v[λIA]=0v^{\intercal}[\lambda I-A]=0, we have v[ϱnI+An]=v[ϱϱ0+εv(n,ϱ)+ui]v^{\intercal}[\varrho_{n}I+A_{n}]=v^{\intercal}[\varrho-\varrho_{0}+\varepsilon_{v}(n,\varrho)+ui]. Multiplying each side of (33) by vv^{\intercal} gives

vθ~n+1ϱ\displaystyle v^{\intercal}\widetilde{\theta}_{n+1}^{\varrho} =vθ~nϱ+αn+1[[ϱϱ0+εv(n,ϱ)+ui]vθ~nϱ+(n+1)ϱvΔn+1]\displaystyle=v^{\intercal}\widetilde{\theta}_{n}^{\varrho}+\alpha_{n+1}\bigl{[}[\varrho-\varrho_{0}+\varepsilon_{v}(n,\varrho)+ui]v^{\intercal}\widetilde{\theta}_{n}^{\varrho}+(n+1)^{\varrho}v^{\intercal}\Delta_{n+1}\bigr{]}
=[1+αn+1ϱ~n+1+αn+1ui]vθ~nϱ+(n+1)ϱ1vΔn+1\displaystyle=[1+\alpha_{n+1}\tilde{\varrho}_{n+1}+\alpha_{n+1}ui]v^{\intercal}\widetilde{\theta}_{n}^{\varrho}+(n+1)^{\varrho-1}v^{\intercal}\Delta_{n+1}

with ϱ~n+1=ϱϱ0+εv(n,ϱ)\tilde{\varrho}_{n+1}=\varrho-\varrho_{0}+\varepsilon_{v}(n,\varrho). Note that ϱ~n+1\tilde{\varrho}_{n+1} is strictly positive for sufficiently large nn.

For a fixed but arbitrary n0n_{0} and each nn0n\geq n_{0}, we have

vθ~n+1ϱ\displaystyle v^{\intercal}\widetilde{\theta}_{n+1}^{\varrho} =vθ~n0ϱk=n0+1n+1[1+αkϱ~k+αkui]+k=n0+1n+1kϱ1vΔkl=k+1n+1[1+αlϱ~l+αlui]\displaystyle=v^{\intercal}\widetilde{\theta}_{n_{0}}^{\varrho}\prod_{k=n_{0}+1}^{n+1}[1+\alpha_{k}\tilde{\varrho}_{k}+\alpha_{k}ui]+\sum_{k=n_{0}+1}^{n+1}k^{\varrho-1}v^{\intercal}\Delta_{k}\prod_{l=k+1}^{n+1}[1+\alpha_{l}\tilde{\varrho}_{l}+\alpha_{l}ui] (75)
=[k=n0+1n+1[1+αkϱ~k+αkui]][vθ~n0ϱ+k=n0+1n+1kϱ1l=n0+1k[1+αlϱ~l+αlui]vΔk]\displaystyle=\Bigl{[}\prod_{k=n_{0}+1}^{n+1}[1+\alpha_{k}\tilde{\varrho}_{k}+\alpha_{k}ui]\Bigr{]}\cdot\Bigl{[}v^{\intercal}\widetilde{\theta}_{n_{0}}^{\varrho}+\sum_{k=n_{0}+1}^{n+1}\frac{k^{\varrho-1}}{\prod_{l=n_{0}+1}^{k}[1+\alpha_{l}\tilde{\varrho}_{l}+\alpha_{l}ui]}v^{\intercal}\Delta_{k}\Bigr{]}
=[k=n0+1n+1[1+αkϱ~k+αkui]][vθ~n0ϱ+k=n0+1n+1βkvΔk]\displaystyle=\Bigl{[}\prod_{k=n_{0}+1}^{n+1}[1+\alpha_{k}\tilde{\varrho}_{k}+\alpha_{k}ui]\Bigr{]}\cdot\Bigl{[}v^{\intercal}\widetilde{\theta}_{n_{0}}^{\varrho}+\sum_{k=n_{0}+1}^{n+1}\beta_{k}v^{\intercal}\Delta_{k}\Bigr{]}

with βn=nϱ1/l=n0+1n[1+αlϱ~l+αlui]\beta_{n}=n^{\varrho-1}/\prod_{l=n_{0}+1}^{n}[1+\alpha_{l}\tilde{\varrho}_{l}+\alpha_{l}ui].

The analysis of {vθ~nϱ}\{v^{\intercal}\widetilde{\theta}_{n}^{\varrho}\} is mainly based on the random series appearing in (75), which requires the following three preliminary results:

Lemma A.7.

There exists some n0n_{0} such that for each n>n0n>n_{0},

|βnβn+1|24|βn+1|2αn2(1+u2)|\beta_{n}-\beta_{n+1}|^{2}\leq 4|\beta_{n+1}|^{2}\alpha_{n}^{2}(1+u^{2})
Proof.

Note that |βnβn+1|2=|βn+1|2|βn/βn+11|2|\beta_{n}-\beta_{n+1}|^{2}=|\beta_{n+1}|^{2}|\beta_{n}/\beta_{n+1}-1|^{2}, so it is sufficient to bound the second factor:

|βn/βn+11|2\displaystyle|\beta_{n}/\beta_{n+1}-1|^{2} =|(1+n1)1ϱ[1+αn+1ϱ~n+1+αn+1ui]1|2\displaystyle=|(1+n^{-1})^{1-\varrho}[1+\alpha_{n+1}\tilde{\varrho}_{n+1}+\alpha_{n+1}ui]-1|^{2} (76)
=|(1+n1)1ϱ[1+αn+1ϱ~n+1]1+(1+n1)1ϱαn+1ui|2\displaystyle=|(1+n^{-1})^{1-\varrho}[1+\alpha_{n+1}\tilde{\varrho}_{n+1}]-1+(1+n^{-1})^{1-\varrho}\alpha_{n+1}ui|^{2}

Consider the real part in (76): since εv(n,ϱ)=O(n1)\varepsilon_{v}(n,\varrho)=O(n^{-1}), there exists n0n_{0} such that |εv(n,ϱ)|ϱϱ0|\varepsilon_{v}(n,\varrho)|\leq\varrho-\varrho_{0} and ϱ~n+1=ϱϱ0+εv(n,ϱ)>0\tilde{\varrho}_{n+1}=\varrho-\varrho_{0}+\varepsilon_{v}(n,\varrho)>0 for nn0n\geq n_{0}. Consequently,

0(1+n1)1ϱ[1+αn+1ϱ~n+1]1\displaystyle 0\leq(1+n^{-1})^{1-\varrho}[1+\alpha_{n+1}\tilde{\varrho}_{n+1}]-1 <(1+n1)[1+αn+1ϱ~n+1]1\displaystyle<(1+n^{-1})[1+\alpha_{n+1}\tilde{\varrho}_{n+1}]-1
n1(1+ϱ~n+1+αn+1ϱ~n+1)\displaystyle\leq n^{-1}(1+\tilde{\varrho}_{n+1}+\alpha_{n+1}\tilde{\varrho}_{n+1})

Given 0<ϱϱ0<120<\varrho-\varrho_{0}<{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}, we can increase n0n_{0} if necessary, such that 1+ϱ~n+1+αn+1ϱ~n+121+\tilde{\varrho}_{n+1}+\alpha_{n+1}\tilde{\varrho}_{n+1}\leq 2 for nn0n\geq n_{0}. Then we have

(1+n1)1ϱ[1+αn+1ϱ~n+1]12αn(1+n^{-1})^{1-\varrho}[1+\alpha_{n+1}\tilde{\varrho}_{n+1}]-1\leq 2\alpha_{n}

For the imaginary part, observe that

(1+n1)1ϱαn+1u=αnnϱ(n+1)ϱu2uαn(1+n^{-1})^{1-\varrho}\alpha_{n+1}u=\alpha_{n}\frac{n^{\varrho}}{(n+1)^{\varrho}}u\leq 2u\alpha_{n}

The proof is completed by summing the bounds for the real and imaginary parts. ∎

Lemma A.8.

Suppose Assumptions A1 and A3 hold. With each n01n_{0}\geq 1, the random series k=n0+1βkvΔk\sum_{k=n_{0}+1}^{\infty}\beta_{k}v^{\intercal}\Delta_{k} converges a.s..

Proof.

Decompose the series into the sum of a martingale difference and telescoping sequence. The martingale difference sequence converges almost surely given {βn}2\{\beta_{n}\}\in\ell_{2}; the telescoping series is absolutely convergent by Lemma A.7. ∎

Lemma A.9.

Suppose Assumptions A1 and A3 hold. Denote Znv=vZn=vf^(Φn)Z_{n}^{v}=v^{\intercal}Z_{n}=v^{\intercal}{\hat{f}}(\Phi_{n}). There exists a deterministic constant K>0K>0, such that for all n0n_{0} and each sequence γ12\gamma\in\ell_{1}\subseteq\ell_{2},

𝖤[Var(k=n0+2γkn01Zkvn0+1)]Kk=1|γk|2{\sf E}\bigl{[}\hbox{\sf Var}\,(\sum_{k=n_{0}+2}^{\infty}\gamma_{k-n_{0}-1}Z_{k}^{v}\mid\mathcal{F}_{n_{0}+1})\bigr{]}\leq K\sum_{k=1}^{\infty}|\gamma_{k}|^{2} (77)
Proof.

First recall that Var(k=n0+2γkn01Zkvn0+1)𝖤[|k=n0+2γkn01Zkv|2n0+1]\displaystyle\hbox{\sf Var}\,\bigl{(}\sum_{k=n_{0}+2}^{\infty}\gamma_{k-n_{0}-1}Z_{k}^{v}\mid\mathcal{F}_{n_{0}+1}\bigr{)}\leq{\sf E}\bigl{[}|\sum_{k=n_{0}+2}^{\infty}\gamma_{k-n_{0}-1}Z_{k}^{v}|^{2}\mid\mathcal{F}_{n_{0}+1}\bigr{]}, and hence by the Markov property,

𝖤[|k=n0+2γkn01Zkv|2n0+1]=𝖤z[|k=1γkZkv|2]=limn𝖤z[|k=1nγkZkv|2]{\sf E}\bigl{[}|\sum_{k=n_{0}+2}^{\infty}\gamma_{k-n_{0}-1}Z_{k}^{v}|^{2}\mid\mathcal{F}_{n_{0}+1}\bigr{]}={\sf E}_{z^{\prime}}\bigl{[}|\sum_{k=1}^{\infty}\gamma_{k}Z_{k}^{v}|^{2}\bigr{]}=\lim_{n\rightarrow\infty}{\sf E}_{z^{\prime}}\bigl{[}|\sum_{k=1}^{n}\gamma_{k}Z_{k}^{v}|^{2}\bigr{]}

where z=Φnz^{\prime}=\Phi_{n}, and the last equality holds by the assumption γ1\gamma\in\ell_{1} and dominated convergence. For each nn, letting γn=(γ1,,γn)\lceil\gamma\rceil^{n}=(\gamma_{1},\dots,\gamma_{n}) denote γ\gamma truncated at index nn, we have

𝖤z[|k=1nγkZkv|2]=k=1n|γk|2𝖤z[|Zkv|2]+i=1njinγiγj𝖤z[(Ziv)Zjv]=(γn)[R]nγn{\sf E}_{z^{\prime}}\bigl{[}|\sum_{k=1}^{n}\gamma_{k}Z_{k}^{v}|^{2}\bigr{]}=\sum_{k=1}^{n}|\gamma_{k}|^{2}{\sf E}_{z^{\prime}}\bigl{[}|Z_{k}^{v}|^{2}\bigr{]}+\sum_{i=1}^{n}\sum_{j\neq i}^{n}\gamma_{i}^{\dagger}\gamma_{j}{\sf E}_{z^{\prime}}\bigl{[}(Z_{i}^{v})^{\dagger}Z_{j}^{v}\bigr{]}=(\lceil\gamma\rceil^{n})^{\dagger}[R]_{n}\lceil\gamma\rceil^{n} (78)

where [R]nn×n[R]_{n}\in\mathbb{C}^{n\times n} is the covariance matrix with each entry defined as R(i,j)=𝖤z[(Ziv)Zjv],1i,jnR(i,j)={\sf E}_{z^{\prime}}\bigl{[}(Z_{i}^{v})^{\dagger}Z_{j}^{v}\bigr{]},1\leq i,j\leq n; [R]n[R]_{n} is Hermitian and positive semi-definite. With λn0\lambda_{n}\geq 0 denoting the largest eigenvalue of [R]n[R]_{n}, we have

(γn)[R]nγnλnk=1n|γk|2λnk=1|γk|2(\lceil\gamma\rceil^{n})^{\dagger}[R]_{n}\lceil\gamma\rceil^{n}\leq\lambda_{n}\sum_{k=1}^{n}|\gamma_{k}|^{2}\leq\lambda_{n}\sum_{k=1}^{\infty}|\gamma_{k}|^{2} (79)

By the Gershgorin circle theorem [20], the maximal eigenvalue is upper bounded by the maximum row sum of absolute values of entries:

λnmaxi{1,,n}j=1n|R(i,j)|supi+j=1|R(i,j)|\lambda_{n}\leq\max_{i\in\{1,\dots,n\}}\sum_{j=1}^{n}|R(i,j)|\leq\sup_{i\in\mathbb{Z}_{+}}\sum_{j=1}^{\infty}|R(i,j)|

For any ii, observe that

j=1|R(i,j)|=𝖤z[|Ziv|2]+i<j|R(i,j)|+i>j|R(i,j)|\sum_{j=1}^{\infty}|R(i,j)|={\sf E}_{z^{\prime}}\bigl{[}|Z_{i}^{v}|^{2}\bigr{]}+\sum_{i<j}|R(i,j)|+\sum_{i>j}|R(i,j)|

Since VV-uniform ergodicity of the Markov chain 𝚽\Phi implies VV-geometric mixing [29, Theorem 16.1.5] and |vf^|2LV|v^{\intercal}{\hat{f}}|^{2}\in L_{\infty}^{V}, there exist B<B<\infty and r(0,1)r\in(0,1) such that for each i,k+i,k\in\mathbb{Z}_{+},

|R(i,i+k)𝖤z[(Ziv)]𝖤z[Zi+kv]|Brk[1+riV(z)]\Bigl{|}R(i,i+k)-{\sf E}_{z^{\prime}}\bigl{[}(Z_{i}^{v})^{\dagger}\bigr{]}{\sf E}_{z^{\prime}}\bigl{[}Z_{i+k}^{v}\bigr{]}\Bigl{|}\leq Br^{k}[1+r^{i}V(z^{\prime})]

Consequently,

j=1|R(i,j)|\displaystyle\sum_{j=1}^{\infty}|R(i,j)|\leq 𝖤z[|Ziv|2]+|𝖤z[(Ziv)]|j=1|𝖤z[Zjv]|\displaystyle{\sf E}_{z^{\prime}}\bigl{[}|Z_{i}^{v}|^{2}\bigr{]}+\Bigl{|}{\sf E}_{z^{\prime}}\bigl{[}(Z_{i}^{v})^{\dagger}\bigr{]}\Bigr{|}\sum_{j=1}^{\infty}\Bigl{|}{\sf E}_{z^{\prime}}[Z_{j}^{v}]\Bigr{|} (80)
+i<jBrji[1+riV(z)]+i>jBrij[1+rjV(z)]\displaystyle+\sum_{i<j}Br^{j-i}[1+r^{i}V(z^{\prime})]+\sum_{i>j}Br^{i-j}[1+r^{j}V(z^{\prime})]

Given |vf^|2LV|v^{\intercal}{\hat{f}}|^{2}\in L_{\infty}^{V}, by (24),

𝖤z[|Znv|2]𝖤π[|Znv|2]+BV|vf^|2VV(z){\sf E}_{z^{\prime}}\bigl{[}|Z_{n}^{v}|^{2}\bigr{]}\leq{\sf E}_{\pi}\bigl{[}|Z_{n}^{v}|^{2}\bigr{]}+B_{V}\bigl{\|}|v^{\intercal}{\hat{f}}|^{2}\bigr{\|}_{V}V(z^{\prime})

The Markov chain 𝚽\Phi is also V\sqrt{V}-uniformly ergodic. By (24) for V\sqrt{V} and |vf^|2LV|v^{\intercal}{\hat{f}}|^{2}\in L_{\infty}^{V} once more,

|𝖤z[(Ziv)]|BVvf^VV(z)ρj\bigl{|}{\sf E}_{z^{\prime}}[(Z_{i}^{v})^{\dagger}]\bigr{|}\leq B_{\scriptscriptstyle\sqrt{V}}\|v^{\intercal}{\hat{f}}\|_{\scriptscriptstyle\sqrt{V}}\sqrt{V(z^{\prime})}\rho^{j}

Hence

|𝖤z[(Ziv)]|j=1|𝖤z[Zjv]|BV2vf^V2V(z)ρij=1ρjBV2vf^V2ρ1ρV(z)\bigl{|}{\sf E}_{z^{\prime}}[(Z_{i}^{v})^{\dagger}]\bigr{|}\sum_{j=1}^{\infty}\bigl{|}{\sf E}_{z^{\prime}}[Z_{j}^{v}]\bigr{|}\leq B_{\scriptscriptstyle\sqrt{V}}^{2}\|v^{\intercal}{\hat{f}}\|^{2}_{\scriptscriptstyle\sqrt{V}}V(z^{\prime})\rho^{i}\sum_{j=1}^{\infty}\rho^{j}\leq B^{2}_{\scriptscriptstyle\sqrt{V}}\|v^{\intercal}{\hat{f}}\|^{2}_{\scriptscriptstyle\sqrt{V}}\frac{\rho}{1-\rho}V(z^{\prime})

The other two terms on the right hand side of (80) are bounded as follows:

j>iBrji[1+riV(z)]\displaystyle\sum_{j>i}Br^{j-i}[1+r^{i}V(z^{\prime})] =j>iB[rji+rjV(z)]Br1r(1+V(z))\displaystyle=\sum_{j>i}B[r^{j-i}+r^{j}V(z^{\prime})]\leq\frac{Br}{1-r}(1+V(z^{\prime}))
j<iBrij[1+rjV(z)]\displaystyle\sum_{j<i}Br^{i-j}[1+r^{j}V(z^{\prime})] =[j<iB[rij]]+BV(z)(i1)riBr1r+BV(z)supiiri\displaystyle=\bigl{[}\sum_{j<i}B[r^{i-j}]\bigr{]}+BV(z^{\prime})(i-1)r^{i}\leq\frac{Br}{1-r}+BV(z^{\prime})\sup_{i}ir^{i}

where supiiri\sup_{i}ir^{i} exists since limnnrn=0\lim_{n\rightarrow\infty}nr^{n}=0.

Consequently, there exists some deterministic constant KK^{\prime} independent of zz^{\prime} such that, the largest eigenvalues {λn}\{\lambda_{n}\} are uniformly bounded

supnλnKV(z)\sup_{n}\lambda_{n}\leq K^{\prime}V(z^{\prime})

Combining this with (78) and (79) gives

𝖤z[|k=1Zkv|2]KV(z)k=1|γk|2{\sf E}_{z^{\prime}}\bigl{[}|\sum_{k=1}^{\infty}Z_{k}^{v}|^{2}\bigr{]}\leq K^{\prime}V(z^{\prime})\sum_{k=1}^{\infty}|\gamma_{k}|^{2}

Therefore,

𝖤[𝖤[|k=n0+2γkn01Zkv|2n0+1]Φ0=z]K𝖤[V(Φn0+1)Φ0=z]k=1|γk|2{\sf E}\Bigl{[}{\sf E}\bigl{[}|\sum_{k=n_{0}+2}^{\infty}\gamma_{k-n_{0}-1}Z_{k}^{v}|^{2}\mid\mathcal{F}_{n_{0}+1}\bigr{]}\mid\Phi_{0}=z\Bigr{]}\leq K^{\prime}{\sf E}\bigl{[}V(\Phi_{n_{0}+1})\mid\Phi_{0}=z\bigr{]}\sum_{k=1}^{\infty}|\gamma_{k}|^{2}

By VLVV\in L_{\infty}^{V} and (24) again, 𝖤[V(Φn0+1)Φ0=z]π(V)+BVV(z){\sf E}[V(\Phi_{n_{0}+1})\mid\Phi_{0}=z]\leq\pi(V)+B_{V}V(z). The desired conclusion then follows by setting K=K(BVV(z)+π(V))K=K^{\prime}(B_{V}V(z)+\pi(V)). ∎

Lemma A.10.

Suppose Assumptions A1-A3 hold and ΣΔv0\Sigma_{\Delta}v\neq 0. With {θ~nϱ}\{\widetilde{\theta}_{n}^{\varrho}\} updated via (33),

lim infn𝖤[|vθ~nϱ|2]=,ϱ>ϱ0\liminf_{n\rightarrow\infty}{\sf E}[|v^{\intercal}\widetilde{\theta}_{n}^{\varrho}|^{2}]=\infty\,,\qquad\varrho>\varrho_{0}
Proof.

With fixed n0n_{0}, equation (75) gives a representation for vθ~n+1ϱv^{\intercal}\widetilde{\theta}_{n+1}^{\varrho} for each nn0n\geq n_{0}. It is obvious that lim infnk=n0+1n|1+ϱ~kαk+αkui|2=\liminf_{n\rightarrow\infty}\prod_{k=n_{0}+1}^{n}|1+\tilde{\varrho}_{k}\alpha_{k}+\alpha_{k}ui|^{2}=\infty. Hence it suffices to show that lim infn𝖤[|vθ~n0ϱ+k=n0+1n+1βkvΔk|2]\liminf_{n\rightarrow\infty}{\sf E}[|v^{\intercal}\widetilde{\theta}_{n_{0}}^{\varrho}+\sum_{k=n_{0}+1}^{n+1}\beta_{k}v^{\intercal}\Delta_{k}|^{2}] is strictly greater than zero.

By Fatou’s lemma,

lim infn𝖤[|vθ~n0ϱ+k=n0+1n+1βkvΔk|2]\displaystyle\liminf_{n\rightarrow\infty}{\sf E}\bigl{[}|v^{\intercal}\widetilde{\theta}_{n_{0}}^{\varrho}+\sum_{k=n_{0}+1}^{n+1}\beta_{k}v^{\intercal}\Delta_{k}|^{2}\bigr{]} 𝖤[lim infn|vθ~n0ϱ+k=n0+1n+1βkvΔk|2]\displaystyle\geq{\sf E}\bigl{[}\liminf_{n\rightarrow\infty}|v^{\intercal}\widetilde{\theta}_{n_{0}}^{\varrho}+\sum_{k=n_{0}+1}^{n+1}\beta_{k}v^{\intercal}\Delta_{k}|^{2}\bigr{]}
=𝖤[|vθ~n0ϱ+k=n0+1βkvΔk|2]\displaystyle={\sf E}\bigl{[}|v^{\intercal}\widetilde{\theta}_{n_{0}}^{\varrho}+\sum_{k=n_{0}+1}^{\infty}\beta_{k}v^{\intercal}\Delta_{k}|^{2}\bigr{]}
Var(vθ~n0ϱ+k=n0+1βkvΔk)\displaystyle\geq\hbox{\sf Var}\,\bigl{(}v^{\intercal}\widetilde{\theta}_{n_{0}}^{\varrho}+\sum_{k=n_{0}+1}^{\infty}\beta_{k}v^{\intercal}\Delta_{k}\bigr{)}

where the equality holds by Lemma A.8. By the law of total variance,

Var(vθ~n0ϱ+k=n0+1βkvΔk)\displaystyle\hbox{\sf Var}\,\bigl{(}v^{\intercal}\widetilde{\theta}_{n_{0}}^{\varrho}+\sum_{k=n_{0}+1}^{\infty}\beta_{k}v^{\intercal}\Delta_{k}\bigr{)} 𝖤[Var(vθ~n0ϱ+k=n0+1βkvΔkn0+1)]\displaystyle\geq{\sf E}\bigl{[}\hbox{\sf Var}\,(v^{\intercal}\widetilde{\theta}_{n_{0}}^{\varrho}+\sum_{k=n_{0}+1}^{\infty}\beta_{k}v^{\intercal}\Delta_{k}\mid\mathcal{F}_{n_{0}+1})\bigr{]}
=𝖤[Var(k=n0+1βkvΔkn0+1)]\displaystyle={\sf E}\bigl{[}\hbox{\sf Var}\,(\sum_{k=n_{0}+1}^{\infty}\beta_{k}v^{\intercal}\Delta_{k}\mid\mathcal{F}_{n_{0}+1})\bigr{]}

Apply once more the decomposition based on Poisson’s equation:

vΔn=Δn+1vm+ZnvZn+1v,n1,v^{\intercal}\Delta_{n}=\Delta_{n+1}^{vm}+Z_{n}^{v}-Z_{n+1}^{v}\,,\qquad n\geq 1\,,

where Znv=vf^(Φn)Z_{n}^{v}=v^{\intercal}{\hat{f}}(\Phi_{n}) and Δn+1vm=Zn+1v𝖤[Zn+1vn]\Delta_{n+1}^{vm}=Z_{n+1}^{v}-{\sf E}[Z_{n+1}^{v}\mid\mathcal{F}_{n}] is a martingale difference. By the variance inequality Var(X+Yn0+1)2Var(Xn0+1)+2Var(Yn0+1)\hbox{\sf Var}\,(X+Y\mid\mathcal{F}_{n_{0}+1})\leq 2\hbox{\sf Var}\,(X\mid\mathcal{F}_{n_{0}+1})+2\hbox{\sf Var}\,(Y\mid\mathcal{F}_{n_{0}+1}), we have

𝖤[Var\displaystyle{\sf E}\bigl{[}\hbox{\sf Var} (k=n0+1βkvΔkn0+1)]\displaystyle(\sum_{k=n_{0}+1}^{\infty}\beta_{k}v^{\intercal}\Delta_{k}\mid\mathcal{F}_{n_{0}+1})\bigr{]} (81)
12𝖤[Var(k=n0+1βkΔk+1vmn0+1)]𝖤[Var(k=n0+1βk(ZkvZk+1v)n0+1)]\displaystyle\geq{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}{\sf E}\bigl{[}\hbox{\sf Var}\,(\sum_{k=n_{0}+1}^{\infty}\beta_{k}\Delta_{k+1}^{vm}\mid\mathcal{F}_{n_{0}+1})\bigr{]}-{\sf E}\bigl{[}\hbox{\sf Var}\,(\sum_{k=n_{0}+1}^{\infty}\beta_{k}(Z_{k}^{v}-Z_{k+1}^{v})\mid\mathcal{F}_{n_{0}+1})\bigr{]}

By the law of total variance once more,

Var(k=n0+1βkΔk+1vm)=𝖤[Var(k=n0+1βkΔk+1vmn0+1)]+Var(𝖤[k=n0+1βkΔk+1vmn0+1])\hbox{\sf Var}\,\bigl{(}\sum_{k=n_{0}+1}^{\infty}\beta_{k}\Delta_{k+1}^{vm}\bigr{)}={\sf E}\bigl{[}\hbox{\sf Var}\,(\sum_{k=n_{0}+1}^{\infty}\beta_{k}\Delta_{k+1}^{vm}\mid\mathcal{F}_{n_{0}+1})\bigr{]}+\hbox{\sf Var}\,\bigl{(}{\sf E}[\sum_{k=n_{0}+1}^{\infty}\beta_{k}\Delta_{k+1}^{vm}\mid\mathcal{F}_{n_{0}+1}]\bigr{)}

Note that limn𝖤[k=n0+1nβkΔk+1vmn0+1]\lim_{n\to\infty}{\sf E}[\sum_{k=n_{0}+1}^{n}\beta_{k}\Delta_{k+1}^{vm}\mid\mathcal{F}_{n_{0}+1}] converges to zero almost surely. With {βn}2\{\beta_{n}\}\in\ell_{2} and the Jensen’s inequality, we have for all nn,

|𝖤[k=n0+1nβkΔk+1vmn0+1]|2k=n0+1|βk|2𝖤[|Δk+1vm|2n0+1]<\bigl{|}{\sf E}[\sum_{k=n_{0}+1}^{n}\beta_{k}\Delta_{k+1}^{vm}\mid\mathcal{F}_{n_{0}+1}]\bigr{|}^{2}\leq\sum_{k=n_{0}+1}^{\infty}|\beta_{k}|^{2}{\sf E}[|\Delta_{k+1}^{vm}|^{2}\mid\mathcal{F}_{n_{0}+1}]<\infty

Then by the dominated convergence theorem, 𝖤[|𝖤[k=n0+1βkΔk+1vmn0+1]|2]=0{\sf E}\bigl{[}\bigl{|}{\sf E}[\sum_{k=n_{0}+1}^{\infty}\beta_{k}\Delta_{k+1}^{vm}\mid\mathcal{F}_{n_{0}+1}]\bigr{|}^{2}\bigr{]}=0. Therefore,

Var(𝖤[k=n0+1βkΔk+1vmn0+1])𝖤[|𝖤[k=n0+1βkΔk+1vmn0+1]|2]=0\hbox{\sf Var}\,\bigl{(}{\sf E}[\sum_{k=n_{0}+1}^{\infty}\beta_{k}\Delta_{k+1}^{vm}\mid\mathcal{F}_{n_{0}+1}]\bigr{)}\leq{\sf E}\bigl{[}\bigl{|}{\sf E}[\sum_{k=n_{0}+1}^{\infty}\beta_{k}\Delta_{k+1}^{vm}\mid\mathcal{F}_{n_{0}+1}]\bigr{|}^{2}\bigr{]}=0

Hence,

𝖤[Var(k=n0+1βkΔk+1vmn0+1)]=Var(k=n0+1βkΔk+1vm)=k=n0+1|βk|2σk+12{\sf E}\bigl{[}\hbox{\sf Var}\,(\sum_{k=n_{0}+1}^{\infty}\beta_{k}\Delta_{k+1}^{vm}\mid\mathcal{F}_{n_{0}+1})\bigr{]}=\hbox{\sf Var}\,\bigl{(}\sum_{k=n_{0}+1}^{\infty}\beta_{k}\Delta_{k+1}^{vm}\bigr{)}=\sum_{k=n_{0}+1}^{\infty}|\beta_{k}|^{2}\sigma_{k+1}^{2} (82)

where σn2=Var(Δnvm)\sigma_{n}^{2}=\hbox{\sf Var}\,(\Delta_{n}^{vm}).

For the telescoping term on the right hand side of (81), we have

𝖤[Var(k=n0+1βk(ZkvZk+1v)n0+1)]\displaystyle{\sf E}\bigl{[}\hbox{\sf Var}\,(\sum_{k=n_{0}+1}^{\infty}\beta_{k}(Z_{k}^{v}-Z_{k+1}^{v})\mid\mathcal{F}_{n_{0}+1})\bigr{]} =𝖤[Var(βn0+1Zn0+1vk=n0+2(βkβk+1)Zkvn0+1)]\displaystyle={\sf E}\bigl{[}\hbox{\sf Var}\,(\beta_{n_{0}+1}Z_{n_{0}+1}^{v}-\sum_{k=n_{0}+2}^{\infty}(\beta_{k}-\beta_{k+1})Z_{k}^{v}\mid\mathcal{F}_{n_{0}+1})\bigr{]} (83)
=𝖤[Var(k=n0+2(βkβk+1)Zkvn0+1)]\displaystyle={\sf E}\bigl{[}\hbox{\sf Var}\,(\sum_{k=n_{0}+2}^{\infty}(\beta_{k}-\beta_{k+1})Z_{k}^{v}\mid\mathcal{F}_{n_{0}+1})\bigr{]}

Given {βnβn+1}1\{\beta_{n}-\beta_{n+1}\}\in\ell_{1} by Lemma A.7, Lemma A.9 indicates that there exists some constant KK independent of n0n_{0} such that,

𝖤[Var(k=n0+2(βkβk+1)Z^kn0+1)]Kk=n0+2|βkβk+1|2{\sf E}\bigl{[}\hbox{\sf Var}\,(\sum_{k=n_{0}+2}^{\infty}(\beta_{k}-\beta_{k+1})\hat{Z}_{k}\mid\mathcal{F}_{n_{0}+1})\bigr{]}\leq K\sum_{k=n_{0}+2}^{\infty}|\beta_{k}-\beta_{k+1}|^{2}

Combining (82) and (83) gives

𝖤[Var(k=n0+1βkvΔkn0+1)]12k=n0+1|βk|2σk+12Kk=n0+2|βkβk+1|2\displaystyle{\sf E}[\hbox{\sf Var}\,(\sum_{k=n_{0}+1}^{\infty}\beta_{k}v^{\intercal}\Delta_{k}\mid\mathcal{F}_{n_{0}+1})]\geq{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}\sum_{k=n_{0}+1}^{\infty}|\beta_{k}|^{2}\sigma_{k+1}^{2}-K\sum_{k=n_{0}+2}^{\infty}|\beta_{k}-\beta_{k+1}|^{2}

Since |vf^|2LV|v^{\intercal}{\hat{f}}|^{2}\in L_{\infty}^{V} and σn2σ2=vΣΔv¯>0\sigma_{n}^{2}\rightarrow\sigma^{2}=v^{\intercal}\Sigma_{\Delta}\overline{v}>0 at a geometric rate, we set n0n_{0} sufficiently large such that Lemma A.7 holds and moreover for all nn0n\geq n_{0},

σn212σ2,14σ24Kαn2(1+u2)18σ2,\sigma_{n}^{2}\geq{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}\sigma^{2},\qquad\frac{1}{4}\sigma^{2}-4K\alpha_{n}^{2}(1+u^{2})\geq\frac{1}{8}\sigma^{2}\,,

Then,

𝖤[Var(k=n0+1βkvΔkn0+1)]18σ2k=n0+1|βk|2{\sf E}\bigl{[}\hbox{\sf Var}\,(\sum_{k=n_{0}+1}^{\infty}\beta_{k}v^{\intercal}\Delta_{k}\mid\mathcal{F}_{n_{0}+1})\bigr{]}\geq\frac{1}{8}\sigma^{2}\sum_{k=n_{0}+1}^{\infty}|\beta_{k}|^{2}

Therefore,

lim infn𝖤[|vθ~n0ϱ+k=n0+1nβkvΔk|2]18σ2k=n0+1|βk|2>0\liminf_{n\rightarrow\infty}{\sf E}\bigl{[}|v^{\intercal}\widetilde{\theta}_{n_{0}}^{\varrho}+\sum_{k=n_{0}+1}^{n}\beta_{k}v^{\intercal}\Delta_{k}|^{2}\bigr{]}\geq\frac{1}{8}\sigma^{2}\sum_{k=n_{0}+1}^{\infty}|\beta_{k}|^{2}>0

The desired conclusion then follows from (75):

lim infn𝖤[|vθ~n+1ϱ|2]lim infnk=n0+1n|1+ϱ~kαk+αkui|2lim infn𝖤[|vθ~n0ϱ+k=n0+1nβkvΔk|2]=\liminf_{n\rightarrow\infty}{\sf E}\bigl{[}|v^{\intercal}\widetilde{\theta}_{n+1}^{\varrho}|^{2}\bigr{]}\geq\liminf_{n\rightarrow\infty}\prod_{k=n_{0}+1}^{n}|1+\tilde{\varrho}_{k}\alpha_{k}+\alpha_{k}ui|^{2}\cdot\liminf_{n\rightarrow\infty}{\sf E}\bigl{[}|v^{\intercal}\widetilde{\theta}_{n_{0}}^{\varrho}+\sum_{k=n_{0}+1}^{n}\beta_{k}v^{\intercal}\Delta_{k}|^{2}\bigr{]}=\infty

A.6 Coupling of Deterministic and Random Linear SA

Let 𝒜^:𝖹d×d\widehat{\mathcal{A}}:{\sf Z}\rightarrow\mathbb{R}^{d\times d} denote the zero-mean solution to the following Poisson equation:

𝖤[𝒜^(Φn+1)Φn=z]=𝒜^(z)𝒜(z)+A,z𝖹{\sf E}[\widehat{\mathcal{A}}(\Phi_{n+1})\mid\Phi_{n}=z]=\widehat{\mathcal{A}}(z)-\mathcal{A}(z)+A\,,\qquad z\in{\sf Z}

which is a matrix version of (26). Denote Δn+1𝒜=𝒜^(Φn+1)𝖤[𝒜^(Φn+1)n]\Delta^{\mathcal{A}}_{n+1}=\widehat{\mathcal{A}}(\Phi_{n+1})-{\sf E}[\widehat{\mathcal{A}}(\Phi_{n+1})\mid\mathcal{F}_{n}] (a martingale difference sequence), and 𝒜n=𝒜^(Φn)\mathcal{A}_{n}=\widehat{\mathcal{A}}(\Phi_{n}). Then, from (36),

(An+1A)θ~n\displaystyle(A_{n+1}-A)\widetilde{\theta}^{\circ}_{n} =[Δn+2𝒜+𝒜n+1𝒜n+2]θ~n\displaystyle=[\Delta^{\mathcal{A}}_{n+2}+\mathcal{A}_{n+1}-\mathcal{A}_{n+2}]\widetilde{\theta}^{\circ}_{n}
=Δn+2𝒜θ~n+𝒜n+1θ~n𝒜n+2θ~n+1+𝒜n+2(θ~n+1θ~n)\displaystyle=\Delta^{\mathcal{A}}_{n+2}\widetilde{\theta}^{\circ}_{n}+\mathcal{A}_{n+1}\widetilde{\theta}^{\circ}_{n}-\mathcal{A}_{n+2}\widetilde{\theta}^{\circ}_{n+1}+\mathcal{A}_{n+2}(\widetilde{\theta}^{\circ}_{n+1}-\widetilde{\theta}^{\circ}_{n})
=Δn+2𝒜θ~n+[𝒜n+1θ~n𝒜n+2θ~n+1]+αn+1𝒜n+2(An+1θ~n+Δn+1)\displaystyle=\Delta^{\mathcal{A}}_{n+2}\widetilde{\theta}^{\circ}_{n}+[\mathcal{A}_{n+1}\widetilde{\theta}^{\circ}_{n}-\mathcal{A}_{n+2}\widetilde{\theta}^{\circ}_{n+1}]+\alpha_{n+1}\mathcal{A}_{n+2}(A_{n+1}\widetilde{\theta}^{\circ}_{n}+\Delta_{n+1})

The sequence {n}\{\mathcal{E}_{n}\} from (38) can be expressed as the sum

n=n(1)+n(2)+n(3)+n(4)\mathcal{E}_{n}=\mathcal{E}_{n}^{(1)}+\mathcal{E}_{n}^{(2)}+\mathcal{E}_{n}^{(3)}+\mathcal{E}_{n}^{(4)}

where n(4)=αn𝒜n+1θ~n\mathcal{E}_{n}^{(4)}=-\alpha_{n}\mathcal{A}_{n+1}\widetilde{\theta}^{\circ}_{n}, and the first three sequences are solutions to the following linear systems:

n+1(1)\displaystyle\mathcal{E}_{n+1}^{(1)} =n(1)+αn+1[An(1)+Δn+2𝒜θ~n],\displaystyle=\mathcal{E}_{n}^{(1)}+\alpha_{n+1}[A\mathcal{E}_{n}^{(1)}+\Delta^{\mathcal{A}}_{n+2}\widetilde{\theta}^{\circ}_{n}]\,, 0(1)=0\displaystyle\mathcal{E}_{0}^{(1)}=0 (84a)
n+1(2)\displaystyle\mathcal{E}_{n+1}^{(2)} =n(2)+αn+1[An(2)αn[I+A]𝒜n+1θ~n],\displaystyle=\mathcal{E}_{n}^{(2)}+\alpha_{n+1}[A\mathcal{E}_{n}^{(2)}-\alpha_{n}[I+A]\mathcal{A}_{n+1}\widetilde{\theta}^{\circ}_{n}]\,, 1(2)=𝒜1θ~0\displaystyle\mathcal{E}_{1}^{(2)}=\mathcal{A}_{1}\widetilde{\theta}^{\circ}_{0} (84b)
n+1(3)\displaystyle\mathcal{E}_{n+1}^{(3)} =n(3)+αn+1[An(3)+αn+1𝒜n+2(An+1θ~n+Δn+1)],\displaystyle=\mathcal{E}_{n}^{(3)}+\alpha_{n+1}[A\mathcal{E}_{n}^{(3)}+\alpha_{n+1}\mathcal{A}_{n+2}(A_{n+1}\widetilde{\theta}^{\circ}_{n}+\Delta_{n+1})]\,, 0(3)=0\displaystyle\mathcal{E}_{0}^{(3)}=0 (84c)

The second recursion arises through the arguments used in the proof of Lemma 2.2.

Recall that λ=ϱ0+ui\lambda=-\varrho_{0}+ui is an eigenvalue of the matrix AA with largest real part. For fixed 0<ϱ<ϱ00<\varrho<\varrho_{0}, let T0T\geq 0 denote the unique solution to the Lyapunov equation

[ϱI+A]T+T[ϱI+A]+I=0[\varrho I+A]T+T[\varrho I+A]^{\intercal}+I=0 (85)

As previously, the norm of random vector EdE\in\mathbb{R}^{d} is defined as: ET=𝖤[ETE]\|E\|_{T}=\sqrt{{\sf E}[E^{\intercal}TE]}.

Lemma A.11.

Under Assumptions (A1)-(A4), there exist constants LA.11L_{\ref{t:clE-td-err-inequalities}} and KA.11K_{\ref{t:clE-td-err-inequalities}} such that, for all n1n\geq 1,

  • (i)

    The following holds for each 1i31\leq i\leq 3,

    n+1(i)T2(12ϱαn+1+LA.112αn+12)n(i)T2+KA.11αn+12(nT2+θ~nT2+1)\|\mathcal{E}_{n+1}^{(i)}\|_{T}^{2}\leq(1-2\varrho\alpha_{n+1}+L_{\ref{t:clE-td-err-inequalities}}^{2}\alpha_{n+1}^{2})\|\mathcal{E}_{n}^{(i)}\|_{T}^{2}+K_{\ref{t:clE-td-err-inequalities}}\alpha_{n+1}^{2}(\|\mathcal{E}_{n}\|_{T}^{2}+\|\widetilde{\theta}^{\bullet}_{n}\|_{T}^{2}+1)
  • (ii)

    The following holds for n(4)\mathcal{E}_{n}^{(4)},

    n+1(4)T2KA.11αn+12(nT2+θ~nT2+1)\|\mathcal{E}_{n+1}^{(4)}\|_{T}^{2}\leq K_{\ref{t:clE-td-err-inequalities}}\alpha_{n+1}^{2}(\|\mathcal{E}_{n}\|_{T}^{2}+\|\widetilde{\theta}^{\bullet}_{n}\|_{T}^{2}+1)

The inequality below will be useful in proving Lemma A.11.

Lemma A.12.

For any real numbers a,ba,b and all c>0c>0,

(a+b)2(1+c1)a2+(1+c)b2(a+b)^{2}\leq(1+c^{-1})a^{2}+(1+c)b^{2}
Proof.

With (a+b)2=a2+b2+2ab(a+b)^{2}=a^{2}+b^{2}+2ab, the result follows directly from the inequality

2ab=2(a/c)(cb)a2/c+cb22ab=2(a/\sqrt{c})(\sqrt{c}b)\leq a^{2}/c+cb^{2}

Proof of Lemma A.11.

First consider {n(1)}\{\mathcal{E}_{n}^{(1)}\} updated via (84a). Since the martingale difference sequence Δn+2𝒜\Delta^{\mathcal{A}}_{n+2} is uncorrelated with θ~n\widetilde{\theta}^{\circ}_{n} or n(1)\mathcal{E}_{n}^{(1)}, we have

n+1(1)T2=[I+αn+1A]n(1)T2+αn+12Δn+2𝒜θ~nT2\|\mathcal{E}_{n+1}^{(1)}\|_{T}^{2}=\|[I+\alpha_{n+1}A]\mathcal{E}_{n}^{(1)}\|_{T}^{2}+\alpha_{n+1}^{2}\|\Delta^{\mathcal{A}}_{n+2}\widetilde{\theta}^{\circ}_{n}\|_{T}^{2}

Using the fact that T0T\geq 0 solves the Lyapunov equation (85) gives

n+1(1)T2(12ϱαn+1+L12αn+12)n(1)T2+αn+12Δn+2𝒜θ~nT2\|\mathcal{E}_{n+1}^{(1)}\|_{T}^{2}\leq(1-2\varrho\alpha_{n+1}+L_{1}^{2}\alpha_{n+1}^{2})\|\mathcal{E}_{n}^{(1)}\|_{T}^{2}+\alpha_{n+1}^{2}\|\Delta^{\mathcal{A}}_{n+2}\widetilde{\theta}^{\circ}_{n}\|_{T}^{2}

where L1=ATL_{1}=\|A\|_{T} (the induced operator norm). With θ~n=n+θ~n\widetilde{\theta}^{\circ}_{n}=\mathcal{E}_{n}+\widetilde{\theta}^{\bullet}_{n},

Δn+2𝒜θ~nT22Δn+2𝒜T2(nT2+θ~nT2)\|\Delta^{\mathcal{A}}_{n+2}\widetilde{\theta}^{\circ}_{n}\|_{T}^{2}\leq 2\|\Delta^{\mathcal{A}}_{n+2}\|_{T}^{2}(\|\mathcal{E}_{n}\|_{T}^{2}+\|\widetilde{\theta}^{\bullet}_{n}\|_{T}^{2})

Consequently,

n+1(1)T2(12ϱαn+1+L12αn+12)n(1)T2+K1αn+12(nT2+θ~nT2)\|\mathcal{E}_{n+1}^{(1)}\|_{T}^{2}\leq(1-2\varrho\alpha_{n+1}+L_{1}^{2}\alpha_{n+1}^{2})\|\mathcal{E}_{n}^{(1)}\|_{T}^{2}+K_{1}\alpha_{n+1}^{2}(\|\mathcal{E}_{n}\|_{T}^{2}+\|\widetilde{\theta}^{\bullet}_{n}\|_{T}^{2}) (86)

where K1=supn2Δn+2𝒜T2K_{1}=\sup_{n}2\|\Delta^{\mathcal{A}}_{n+2}\|_{T}^{2} is finite by the VV-uniform ergodicity of 𝚽\Phi applied to 𝒜^i,j2\widehat{\mathcal{A}}_{i,j}^{2} (recall Thm. 2.1).

For {n(2)}\{\mathcal{E}_{n}^{(2)}\} updated by (84b), using Lemma A.12 with c=n(n+1)c=n(n+1) gives

n+1(2)T2\displaystyle\|\mathcal{E}_{n+1}^{(2)}\|_{T}^{2}\leq (1+αnαn+1)(12ϱαn+1+L12αn+12)n(2)T2\displaystyle(1+\alpha_{n}\alpha_{n+1})(1-2\varrho\alpha_{n+1}+L_{1}^{2}\alpha_{n+1}^{2})\|\mathcal{E}_{n}^{(2)}\|_{T}^{2}
+2(αnαn+1+αn2αn+12)[I+A]𝒜n+1T2(nT2+θ~nT2)\displaystyle+2(\alpha_{n}\alpha_{n+1}+\alpha_{n}^{2}\alpha_{n+1}^{2})\|[I+A]\mathcal{A}_{n+1}\|_{T}^{2}(\|\mathcal{E}_{n}\|_{T}^{2}+\|\widetilde{\theta}^{\bullet}_{n}\|_{T}^{2})

We can find L2L_{2} and K2K_{2} such that for all n1n\geq 1,

αn+12L12+αnαn+1(12ϱαn+1+L12αn+12)\displaystyle\alpha_{n+1}^{2}L_{1}^{2}+\alpha_{n}\alpha_{n+1}(1-2\varrho\alpha_{n+1}+L_{1}^{2}\alpha_{n+1}^{2}) L22αn+12\displaystyle\leq L_{2}^{2}\alpha_{n+1}^{2}
2(αnαn+1+αn2αn+12)[I+A]𝒜n+1T2\displaystyle 2(\alpha_{n}\alpha_{n+1}+\alpha_{n}^{2}\alpha_{n+1}^{2})\|[I+A]\mathcal{A}_{n+1}\|_{T}^{2} K2αn+12\displaystyle\leq K_{2}\alpha_{n+1}^{2}

We then obtain the desired form for the sequence {n(2)}\{\mathcal{E}_{n}^{(2)}\}

n+1(2)T2(12ϱαn+1+L22αn+12)n(2)T2+K2αn+12(nT2+θ~nT2)\displaystyle\|\mathcal{E}_{n+1}^{(2)}\|_{T}^{2}\leq(1-2\varrho\alpha_{n+1}+L_{2}^{2}\alpha_{n+1}^{2})\|\mathcal{E}_{n}^{(2)}\|_{T}^{2}+K_{2}\alpha_{n+1}^{2}(\|\mathcal{E}_{n}\|_{T}^{2}+\|\widetilde{\theta}^{\bullet}_{n}\|_{T}^{2}) (87)

The same argument applies to {n(3)}\{\mathcal{E}_{n}^{(3)}\} in (84c). Therefore, for some constants L3L_{3} and K3K_{3},

n+1(3)T2(12ϱαn+1+L32αn+12)n(3)T2+K3αn+12(nT2+θ~nT2+1)\|\mathcal{E}_{n+1}^{(3)}\|_{T}^{2}\leq(1-2\varrho\alpha_{n+1}+L_{3}^{2}\alpha_{n+1}^{2})\|\mathcal{E}_{n}^{(3)}\|_{T}^{2}+K_{3}\alpha_{n+1}^{2}(\|\mathcal{E}_{n}\|_{T}^{2}+\|\widetilde{\theta}^{\bullet}_{n}\|_{T}^{2}+1) (88)

A bound on the final term n+1(4)=αn+1𝒜n+2θ~n+1\mathcal{E}_{n+1}^{(4)}=-\alpha_{n+1}\mathcal{A}_{n+2}\widetilde{\theta}^{\circ}_{n+1} is relatively easy.

n+1(4)T2\displaystyle\|\mathcal{E}_{n+1}^{(4)}\|_{T}^{2} =αn+1𝒜n+2[θ~n+αn+1(An+1θ~n+Δn+1)]T2\displaystyle=\|\alpha_{n+1}\mathcal{A}_{n+2}[\widetilde{\theta}^{\circ}_{n}+\alpha_{n+1}(A_{n+1}\widetilde{\theta}^{\circ}_{n}+\Delta_{n+1})]\|_{T}^{2}
2αn+12𝒜n+2T2(I+αn+1An+1T2θ~nT2+αn+12Δn+1T2)\displaystyle\leq 2\alpha_{n+1}^{2}\|\mathcal{A}_{n+2}\|_{T}^{2}(\|I+\alpha_{n+1}A_{n+1}\|_{T}^{2}\|\widetilde{\theta}^{\circ}_{n}\|_{T}^{2}+\alpha_{n+1}^{2}\|\Delta_{n+1}\|_{T}^{2})

Hence there exists some constant K4K_{4} such that

n+1(4)T2K4αn+12(nT2+θ~nT2+1)\|\mathcal{E}_{n+1}^{(4)}\|_{T}^{2}\leq K_{4}\alpha_{n+1}^{2}(\|\mathcal{E}_{n}\|_{T}^{2}+\|\widetilde{\theta}^{\bullet}_{n}\|_{T}^{2}+1)

The results in Lemma A.11 lead to a rough bound on θ~nT2\|\widetilde{\theta}^{\circ}_{n}\|_{T}^{2} presented in the following. This intermediate result will be used later to establish the refined bound in Thm. 2.7.

Lemma A.13.

Under Assumptions (A1)-(A4),

lim supnnϱθ~nT2<,for ϱ<ϱ0 and ϱ1\limsup_{n\rightarrow\infty}n^{\varrho}\|\widetilde{\theta}^{\circ}_{n}\|_{T}^{2}<\infty\,,\qquad\text{for }\,\varrho<\varrho_{0}\text{ and }\varrho\leq 1
Proof.

Denote ntot=i=14n(i)T2\mathcal{E}^{\text{tot}}_{n}=\sum_{i=1}^{4}\|\mathcal{E}_{n}^{(i)}\|_{T}^{2}. By Lemma A.11, we can find n01n_{0}\geq 1 such that 12ϱαn+1+LA.112αn+12>01-2\varrho\alpha_{n+1}+L_{\ref{t:clE-td-err-inequalities}}^{2}\alpha_{n+1}^{2}>0 for nn0n\geq n_{0} and

n+1tot\displaystyle\mathcal{E}^{\text{tot}}_{n+1} (12ϱαn+1+LA.112αn+12)ntot+4KA.11αn+12(nT2+θ~nT2+1)\displaystyle\leq(1-2\varrho\alpha_{n+1}+L_{\ref{t:clE-td-err-inequalities}}^{2}\alpha_{n+1}^{2})\mathcal{E}^{\text{tot}}_{n}+4K_{\ref{t:clE-td-err-inequalities}}\alpha_{n+1}^{2}(\|\mathcal{E}_{n}\|_{T}^{2}+\|\widetilde{\theta}^{\bullet}_{n}\|_{T}^{2}+1)
(12ϱαn+1+LA.112αn+12)ntot+4KA.11αn+12(4ntot+θ~nT2+1)\displaystyle\leq(1-2\varrho\alpha_{n+1}+L_{\ref{t:clE-td-err-inequalities}}^{2}\alpha_{n+1}^{2})\mathcal{E}^{\text{tot}}_{n}+4K_{\ref{t:clE-td-err-inequalities}}\alpha_{n+1}^{2}(4\mathcal{E}^{\text{tot}}_{n}+\|\widetilde{\theta}^{\bullet}_{n}\|_{T}^{2}+1)
(12ϱαn+1+Ltot2αn+12)ntot+Ktotαn+12\displaystyle\leq(1-2\varrho\alpha_{n+1}+L_{{\text{tot}}}^{2}\alpha_{n+1}^{2})\mathcal{E}^{\text{tot}}_{n}+K_{{\text{tot}}}\alpha_{n+1}^{2}

with Ltot2=LA.112+16KA.11L_{{\text{tot}}}^{2}=L_{\ref{t:clE-td-err-inequalities}}^{2}+16K_{\ref{t:clE-td-err-inequalities}} and Ktot=supn4KA.11(θ~nT2+1)K_{{\text{tot}}}=\sup_{n}4K_{\ref{t:clE-td-err-inequalities}}(\|\widetilde{\theta}^{\bullet}_{n}\|_{T}^{2}+1), which are finite by Lemma A.2 combined with Lemma A.11. Iterating this inequality gives, for nn0n\geq n_{0},

n+1totn0totk=n0+1n+1(12ϱαk+Ltot2αk2)+Ktotk=n0+1n+1αk2l=k+1n+1(12ϱαl+Ltot2αl2)\mathcal{E}^{\text{tot}}_{n+1}\leq\mathcal{E}^{\text{tot}}_{n_{0}}\prod_{k=n_{0}+1}^{n+1}(1-2\varrho\alpha_{k}+L_{{\text{tot}}}^{2}\alpha_{k}^{2})+K_{{\text{tot}}}\sum_{k=n_{0}+1}^{n+1}\alpha_{k}^{2}\prod_{l=k+1}^{n+1}(1-2\varrho\alpha_{l}+L_{{\text{tot}}}^{2}\alpha_{l}^{2})

By Lemma A.1,

n+1totn0totKA.1n02ϱ(n+2)2ϱ+KA.1Ktot(n+2)2ϱk=n0+1n+1k2ϱ2\mathcal{E}^{\text{tot}}_{n+1}\leq\mathcal{E}^{\text{tot}}_{n_{0}}\frac{K_{\ref{t:prod-n}}n_{0}^{2\varrho}}{(n+2)^{2\varrho}}+\frac{K_{\ref{t:prod-n}}K_{{\text{tot}}}}{(n+2)^{2\varrho}}\sum_{k=n_{0}+1}^{n+1}k^{2\varrho-2}

The partial sum can be estimated by an integral: with 2ϱ202\varrho-2\leq 0,

k=n0n+1k2ϱ21+n0n+1r2ϱ2𝑑r={1+[(n+1)2ϱ1n02ϱ1]/(2ϱ1),if ϱ121+ln(n+1)ln(n0),if ϱ=12\sum_{k=n_{0}}^{n+1}k^{2\varrho-2}\leq 1+\int_{n_{0}}^{n+1}r^{2\varrho-2}dr=\begin{cases}1+[(n+1)^{2\varrho-1}-n_{0}^{2\varrho-1}]/(2\varrho-1)\,,&\text{if }\varrho\neq{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}\\ 1+\ln(n+1)-\ln(n_{0})\,,&\text{if }\varrho={\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}\end{cases} (89)

Given ϱ1\varrho\leq 1,

nϱntotn0totKA.1n02ϱ(n+2)ϱ+KA.1Ktot(n+2)ϱk=n0+1n+1k2ϱ2<n^{\varrho}\mathcal{E}^{\text{tot}}_{n}\leq\mathcal{E}^{\text{tot}}_{n_{0}}\frac{K_{\ref{t:prod-n}}n_{0}^{2\varrho}}{(n+2)^{\varrho}}+\frac{K_{\ref{t:prod-n}}K_{{\text{tot}}}}{(n+2)^{\varrho}}\sum_{k=n_{0}+1}^{n+1}k^{2\varrho-2}<\infty

Consequently, lim supnnϱnT2<\limsup_{n\rightarrow\infty}n^{\varrho}\|\mathcal{E}_{n}\|_{T}^{2}<\infty by the inequality nϱnT24nϱntotn^{\varrho}\|\mathcal{E}_{n}\|_{T}^{2}\leq 4n^{\varrho}\mathcal{E}^{\text{tot}}_{n}. Then we have

nϱθ~nT22nϱnT2+2nϱθ~nT2n^{\varrho}\|\widetilde{\theta}^{\circ}_{n}\|_{T}^{2}\leq 2n^{\varrho}\|\mathcal{E}_{n}\|_{T}^{2}+2n^{\varrho}\|\widetilde{\theta}^{\bullet}_{n}\|_{T}^{2}

where nϱθ~nT20n^{\varrho}\|\widetilde{\theta}^{\bullet}_{n}\|_{T}^{2}\rightarrow 0 as nn goes to infinity by Lemma A.2. Hence lim supnnϱθ~nT2<\limsup_{n\rightarrow\infty}n^{\varrho}\|\widetilde{\theta}^{\circ}_{n}\|_{T}^{2}<\infty. ∎

Proof of Thm. 2.7.

First consider {n(2)}\{\mathcal{E}_{n}^{(2)}\} updated via (84b). By the triangle inequality and the inequality 1x12x\sqrt{1-x}\leq{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}x,

n+1(2)T\displaystyle\|\mathcal{E}_{n+1}^{(2)}\|_{T} [I+αn+1A]n(2)T+αnαn+1[I+A]𝒜n+1θ~nT\displaystyle\leq\|[I+\alpha_{n+1}A]\mathcal{E}_{n}^{(2)}\|_{T}+\alpha_{n}\alpha_{n+1}\|[I+A]\mathcal{A}_{n+1}\widetilde{\theta}^{\circ}_{n}\|_{T}
(1ϱαn+1+12L2αn+12)n(2)T+αn+12+ϱ/2K\displaystyle\leq(1-\varrho\alpha_{n+1}+{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}L^{2}\alpha_{n+1}^{2})\|\mathcal{E}_{n}^{(2)}\|_{T}+\alpha_{n+1}^{2+\varrho/2}K

where L=ATL=\|A\|_{T} and K=supn2[I+A]𝒜n+1Tθ~n/(n+1)ϱ/2K=\sup_{n}2\|[I+A]\mathcal{A}_{n+1}\|_{T}\|\widetilde{\theta}^{\circ}_{n}\|/(n+1)^{\varrho/2}, which is finite thanks to Lemma A.13. Hence, by Lemma A.1 once more,

n+1(2)T\displaystyle\|\mathcal{E}_{n+1}^{(2)}\|_{T} 1(2)Tk=2n+1[1ϱαk+12L2αk2]+Kk=2n+1αk2+ϱ/2l=k+1n+1[1ϱαk+12L2αk2]\displaystyle\leq\|\mathcal{E}_{1}^{(2)}\|_{T}\prod_{k=2}^{n+1}[1-\varrho\alpha_{k}+{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}L^{2}\alpha_{k}^{2}]+K\sum_{k=2}^{n+1}\alpha_{k}^{2+\varrho/2}\prod_{l=k+1}^{n+1}[1-\varrho\alpha_{k}+{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}L^{2}\alpha_{k}^{2}]
1(2)TKA.1(n+2)ϱ+KKA.1(n+2)ϱk=2n+1kϱ/22\displaystyle\leq\|\mathcal{E}_{1}^{(2)}\|_{T}\frac{K_{\ref{t:prod-n}}}{(n+2)^{\varrho}}+\frac{KK_{\ref{t:prod-n}}}{(n+2)^{\varrho}}\sum_{k=2}^{n+1}k^{\varrho/2-2}

With ϱ1\varrho\leq 1, we have k=1kϱ/22k=1k3/2<\sum_{k=1}^{\infty}k^{\varrho/2-2}\leq\sum_{k=1}^{\infty}k^{-3/2}<\infty. Hence lim supnnϱn(2)T<\limsup_{n\rightarrow\infty}n^{\varrho}\|\mathcal{E}_{n}^{(2)}\|_{T}<\infty. Replacing An+1θ~n+Δn+1A_{n+1}\widetilde{\theta}^{\circ}_{n}+\Delta_{n+1} with θ~n+1θ~n\widetilde{\theta}^{\circ}_{n+1}-\widetilde{\theta}^{\circ}_{n} in (84c), the same argument applies to {n(3)}\{\mathcal{E}_{n}^{(3)}\} and we get lim supnnϱn(3)T<\limsup_{n\rightarrow\infty}n^{\varrho}\|\mathcal{E}_{n}^{(3)}\|_{T}<\infty. The fact that lim supnnn+1(4)T<\limsup_{n\rightarrow\infty}n\|\mathcal{E}_{n+1}^{(4)}\|_{T}<\infty follows directly from definition n(4)=αn𝒜n+1θ~n\mathcal{E}_{n}^{(4)}=-\alpha_{n}\mathcal{A}_{n+1}\widetilde{\theta}^{\circ}_{n} and Lemma A.13. Then we have, for each 2i42\leq i\leq 4,

lim supnnϱn(i)T<,for ϱ<ϱ0 and ϱ1\limsup_{n\rightarrow\infty}n^{\varrho}\|\mathcal{E}_{n}^{(i)}\|_{T}<\infty\,,\qquad\text{for }\varrho<\varrho_{0}\text{ and }\varrho\leq 1 (90)

Now consider the martingale difference part {n(1)}\{\mathcal{E}_{n}^{(1)}\}. The following is directly obtained from (84a):

n+1(1)T2\displaystyle\|\mathcal{E}_{n+1}^{(1)}\|_{T}^{2}\leq (12ϱαn+1+L2αn+12)n(1)T2+αn+12Δn+2𝒜T2θ~nT2\displaystyle(1-2\varrho\alpha_{n+1}+L^{2}\alpha_{n+1}^{2})\|\mathcal{E}_{n}^{(1)}\|_{T}^{2}+\alpha_{n+1}^{2}\|\Delta^{\mathcal{A}}_{n+2}\|_{T}^{2}\|\widetilde{\theta}^{\circ}_{n}\|_{T}^{2}
\displaystyle\leq (12ϱαn+1+L2αn+12)n(1)T2+αn+12Δn+2𝒜T2[8i=14n(i)T2+2θ~nT2]\displaystyle(1-2\varrho\alpha_{n+1}+L^{2}\alpha_{n+1}^{2})\|\mathcal{E}_{n}^{(1)}\|_{T}^{2}+\alpha_{n+1}^{2}\|\Delta^{\mathcal{A}}_{n+2}\|_{T}^{2}\bigl{[}8\sum_{i=1}^{4}\|\mathcal{E}_{n}^{(i)}\|_{T}^{2}+2\|\widetilde{\theta}^{\bullet}_{n}\|_{T}^{2}\bigr{]}

From Lemma A.2 we have supnnδθ~nT2<\sup_{n}n^{\delta}\|\widetilde{\theta}^{\bullet}_{n}\|_{T}^{2}<\infty for δ=min(1,2ϱ)\delta=\min(1,2\varrho). Combining this with (90) implies that there exists some constant KK_{\mathcal{M}} such that for δ=min(1,2ϱ)\delta=\min(1,2\varrho),

Δn+2𝒜T2[8i=24n(i)T2+2θ~nT2]K1(n+1)δ\|\Delta^{\mathcal{A}}_{n+2}\|_{T}^{2}\bigl{[}8\sum_{i=2}^{4}\|\mathcal{E}_{n}^{(i)}\|_{T}^{2}+2\|\widetilde{\theta}^{\bullet}_{n}\|_{T}^{2}\bigr{]}\leq K_{\mathcal{M}}\frac{1}{(n+1)^{\delta}}

Consequently,

n+1(1)T2\displaystyle\|\mathcal{E}_{n+1}^{(1)}\|_{T}^{2} (12ϱαn+1+L2αn+12)n(1)T2+Kαn+12+δ\displaystyle\leq(1-2\varrho\alpha_{n+1}+L_{\mathcal{M}}^{2}\alpha_{n+1}^{2})\|\mathcal{E}_{n}^{(1)}\|_{T}^{2}+K_{\mathcal{M}}\alpha_{n+1}^{2+\delta}

where L2=supnL2+8Δn+2𝒜T2L_{\mathcal{M}}^{2}=\sup_{n}L^{2}+8\|\Delta^{\mathcal{A}}_{n+2}\|_{T}^{2}. With initial condition 0=0\mathcal{E}_{0}=0, iterating this inequality gives

n+1(1)T2Kk=1n+1αk2+δl=k+1n+1[12ϱαl+L2αl2]KKA.1(n+2)2ϱk=1n+1k(2+δ2ϱ)\displaystyle\|\mathcal{E}_{n+1}^{(1)}\|_{T}^{2}\leq K_{\mathcal{M}}\sum_{k=1}^{n+1}\alpha_{k}^{2+\delta}\prod_{l=k+1}^{n+1}[1-2\varrho\alpha_{l}+L_{\mathcal{M}}^{2}\alpha_{l}^{2}]\leq\frac{K_{\mathcal{M}}K_{\ref{t:prod-n}}}{(n+2)^{2\varrho}}\sum_{k=1}^{n+1}k^{-(2+\delta-2\varrho)}

With 2+δ2ϱ>02+\delta-2\varrho>0, the partial sum is bounded by an integral similar as (89):

1(n+2)2ϱk=1n+1k(2+δ2ϱ)={O((n+1)2ϱ),if ϱ12 and δ=2ϱO((n+1)2ϱ),if 12<ϱ<1 and δ=1O((n+1)2),if ϱ>1 and δ=1\frac{1}{(n+2)^{2\varrho}}\sum_{k=1}^{n+1}k^{-(2+\delta-2\varrho)}=\begin{cases}O((n+1)^{-2\varrho}),&\text{if }\varrho\leq{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}\text{ and }\delta=2\varrho\\ O((n+1)^{-2\varrho}),&\text{if }{\mathchoice{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{1}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}{\genfrac{}{}{}{3}{1}{2}}}<\varrho<1\text{ and }\delta=1\\ O((n+1)^{-2}),&\text{if }\varrho>1\text{ and }\delta=1\end{cases}

Therefore,

  • (i)

    If ϱ01\varrho_{0}\leq 1, then lim supn(n+1)2ϱn+1(1)T2<\limsup_{n\rightarrow\infty}(n+1)^{2\varrho}\|\mathcal{E}_{n+1}^{(1)}\|_{T}^{2}<\infty for ϱ<ϱ0\varrho<\varrho_{0}.

  • (ii)

    If ϱ0>1\varrho_{0}>1, then lim supn(n+1)2n+1(1)T2<\limsup_{n\rightarrow\infty}(n+1)^{2}\|\mathcal{E}_{n+1}^{(1)}\|_{T}^{2}<\infty.

Given that the same convergence rates hold for the other components in (90), the conclusion then follows. ∎