This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Accelerating Distributed Stochastic Optimization via Self-Repellent Random Walks

Jie Hu
Department of Electrical and Computer Engineering
North Carolina State University
[email protected]
&Vishwaraj Doshi
Data Science & Advanced Analytics
IQVIA Inc.
[email protected]
&Do Young Eun
Department of Electrical and Computer Engineering
North Carolina State University
[email protected]
Abstract

We study a family of distributed stochastic optimization algorithms where gradients are sampled by a token traversing a network of agents in random-walk fashion. Typically, these random-walks are chosen to be Markov chains that asymptotically sample from a desired target distribution, and play a critical role in the convergence of the optimization iterates. In this paper, we take a novel approach by replacing the standard linear Markovian token by one which follows a non-linear Markov chain - namely the Self-Repellent Radom Walk (SRRW). Defined for any given ‘base’ Markov chain, the SRRW, parameterized by a positive scalar α\alpha, is less likely to transition to states that were highly visited in the past, thus the name. In the context of MCMC sampling on a graph, a recent breakthrough in Doshi et al. (2023) shows that the SRRW achieves O(1/α)O(1/\alpha) decrease in the asymptotic variance for sampling. We propose the use of a ‘generalized’ version of the SRRW to drive token algorithms for distributed stochastic optimization in the form of stochastic approximation, termed SA-SRRW. We prove that the optimization iterate errors of the resulting SA-SRRW converge to zero almost surely and prove a central limit theorem, deriving the explicit form of the resulting asymptotic covariance matrix corresponding to iterate errors. This asymptotic covariance is always smaller than that of an algorithm driven by the base Markov chain and decreases at rate O(1/α2)O(1/\alpha^{2}) - the performance benefit of using SRRW thereby amplified in the stochastic optimization context. Empirical results support our theoretical findings.

**footnotetext: Equal contributors.

1 Introduction

Stochastic optimization algorithms solve optimization problems of the form

𝜽argmin𝜽df(𝜽),wheref(𝜽)𝔼X𝝁[F(𝜽,X)]=i𝒩μiF(𝜽,i),{\bm{\theta}}^{*}\in\operatorname*{arg\,min}_{{\bm{\theta}}\in{\mathbb{R}}^{d}}f({\bm{\theta}}),~{}~{}~{}\text{where}~{}f({\bm{\theta}})\triangleq\mathbb{E}_{X\sim{\bm{\mu}}}\left[F({\bm{\theta}},X)\right]=\sum_{i\in{\mathcal{N}}}\mu_{i}F({\bm{\theta}},i),\vspace{-2mm} (1)

with the objective function f:df:\mathbb{R}^{d}\to\mathbb{R} and XX taking values in a finite state space 𝒩{\mathcal{N}} with distribution 𝝁[μi]i𝒩{\bm{\mu}}\triangleq[\mu_{i}]_{i\in{\mathcal{N}}}. Leveraging partial gradient information per iteration, these algorithms have been recognized for their scalability and efficiency with large datasets (Bottou et al., 2018; Even, 2023). For any given noise sequence {Xn}n0𝒩\{X_{n}\}_{n\geq 0}\subset{\mathcal{N}}, and step size sequence {βn}n0+\{\beta_{n}\}_{n\geq 0}\subset\mathbb{R}_{+}, most stochastic optimization algorithms can be classified as stochastic approximations (SA) of the form

𝜽n+1=𝜽n+βn+1H(𝜽n,Xn+1),n0,{\bm{\theta}}_{n+1}={\bm{\theta}}_{n}+\beta_{n+1}H({\bm{\theta}}_{n},X_{n+1}),~{}~{}~{}\forall~{}n\geq 0, (2)

where, roughly speaking, H(𝜽,i)H({\bm{\theta}},i) contains gradient information 𝜽F(θ,i)\nabla_{{\bm{\theta}}}F(\theta,i), such that 𝜽{\bm{\theta}}^{*} solves 𝐡(𝜽)𝔼X𝝁[H(𝜽,X)]=i𝒩μiH(𝜽,i)=𝟎{\mathbf{h}}({\bm{\theta}})\!\triangleq\!\mathbb{E}_{X\sim{\bm{\mu}}}[H({\bm{\theta}},X)]\!=\!\sum_{i\in{\mathcal{N}}}\mu_{i}H({\bm{\theta}},i)\!=\!{\bm{0}}. Such SA iterations include the well-known stochastic gradient descent (SGD), stochastic heavy ball (SHB) (Gadat et al., 2018; Li et al., 2022), and some SGD-type algorithms employing additional auxiliary variables (Barakat et al., 2021).111Further illustrations of stochastic optimization algorithms of the form (2) are deferred to Appendix A. These algorithms typically have the stochastic noise term XnX_{n} generated by i.i.d. random variables with probability distribution 𝝁{\bm{\mu}} in each iteration. In this paper, we study a stochastic optimization algorithm where the noise sequence governing access to the gradient information is generated from general stochastic processes in place of i.i.d. random variables.

This is commonly the case in distributed learning, where {Xn}\{X_{n}\} is a (typically Markovian) random walk, and should asymptotically be able to sample the gradients from the desired probability distribution 𝝁{\bm{\mu}}. This is equivalent to saying that the random walker’s empirical distribution converges to 𝝁{\bm{\mu}} almost surely (a.s.); that is, 𝐱n1n+1k=0n𝜹Xkna.s.𝝁{\mathbf{x}}_{n}\triangleq\frac{1}{n+1}\sum_{k=0}^{n}{\bm{\delta}}_{X_{k}}\xrightarrow[n\to\infty]{a.s.}{\bm{\mu}} for any initial X0𝒩X_{0}\in{\mathcal{N}}, where 𝜹Xk{\bm{\delta}}_{X_{k}} is the delta measure whose XkX_{k}’th entry is one, the rest being zero. Such convergence is most commonly achieved by employing the Metropolis Hastings random walk (MHRW) which can be designed to sample from any target measure 𝝁{\bm{\mu}} and implemented in a scalable manner (Sun et al., 2018). Unsurprisingly, convergence characteristics of the employed Markov chain affect that of the SA sequence (2), and appear in both finite-time and asymptotic analyses. Finite-time bounds typically involve the second largest eigenvalue in modulus (SLEM) of the Markov chain’s transition kernel 𝐏{\mathbf{P}}, which is critically connected to the mixing time of a Markov chain (Levin & Peres, 2017); whereas asymptotic results such as central limit theorems (CLT) involve asymptotic covariance matrices that embed information regarding the entire spectrum of 𝐏{\mathbf{P}}, i.e., all eigenvalues as well as eigenvectors (Brémaud, 2013), which are key to understanding the sampling efficiency of a Markov chain. Thus, the choice of random walker can significantly impact the performance of (2), and simply ensuring that it samples from 𝝁{\bm{\mu}} asymptotically is not enough to achieve optimal algorithmic performance. In this paper, we take a closer look at the distributed stochastic optimization problem through the lens of a non-linear Markov chain, known as the Self Repellent Random Walk (SRRW), which was shown in Doshi et al. (2023) to achieve asymptotically minimal sampling variance for large values of α\alpha, a positive scalar controlling the strength of the random walker’s self-repellence behaviour. Our proposed modification of (2) can be implemented within the settings of decentralized learning applications in a scalable manner, while also enjoying significant performance benefit over distributed stochastic optimization algorithms driven by vanilla Markov chains.

Token Algorithms for Decentralized Learning. In decentralized learning, agents like smartphones or IoT devices, each containing a subset of data, collaboratively train models on a graph 𝒢(𝒩,){\mathcal{G}}({\mathcal{N}},{\mathcal{E}}) by sharing information locally without a central server (McMahan et al., 2017). In this setup, N=|𝒩|N\!=\!|{\mathcal{N}}| agents correspond to nodes i𝒩i\!\in\!{\mathcal{N}}, and an edge (i,j)(i,j)\!\in\!{\mathcal{E}} indicates direct communication between agents ii and jj. This decentralized approach offers several advantages compared to the traditional centralized learning setting, promoting data privacy and security by eliminating the need for raw data to be aggregated centrally and thus reducing the risk of data breach or misuse (Bottou et al., 2018; Nedic, 2020). Additionally, decentralized approaches are more scalable and can handle vast amounts of heterogeneous data from distributed agents without overwhelming a central server, alleviating concerns about single point of failure (Vogels et al., 2021).

Among decentralized learning approaches, the class of ‘Token’ algorithms can be expressed as stochastic approximation iterations of the type (2), wherein the sequence {Xn}\{X_{n}\} is realized as the sample path of a token that stochastically traverses the graph 𝒢{\mathcal{G}}, carrying with it the iterate 𝜽n{\bm{\theta}}_{n} for any time n0n\geq 0 and allowing each visited node (agent) to incrementally update 𝜽n{\bm{\theta}}_{n} using locally available gradient information. Token algorithms have gained popularity in recent years (Hu et al., 2022; Triastcyn et al., 2022; Hendrikx, 2023), and are provably more communication efficient (Even, 2023) when compared to consensus-based algorithms - another popular approach for solving distributed optimization problems (Boyd et al., 2006; Morral et al., 2017; Olshevsky, 2022). The construction of token algorithms means that they do not suffer from expensive costs of synchronization and communication that are typical of consensus-based approaches, where all agents (or a subset of agents selected by a coordinator (Boyd et al., 2006; Wang et al., 2019)) on the graph are required to take simultaneous actions, such as communicating on the graph at each iteration. While decentralized Federated learning has indeed helped mitigate the communication overhead by processing multiple SGD iterations prior to each aggregation (Lalitha et al., 2018; Ye et al., 2022; Chellapandi et al., 2023), they still cannot overcome challenges such as synchronization and straggler issues.

Self Repellent Random Walk. As mentioned earlier, sample paths {Xn}\{X_{n}\} of token algorithms are usually generated using Markov chains with 𝝁Int(Σ){\bm{\mu}}\in\text{Int}(\Sigma) as their limiting distribution. Here, Σ\Sigma denotes the NN-dimensional probability simplex, with Int(Σ)\text{Int}(\Sigma) representing its interior. A recent work by Doshi et al. (2023) pioneers the use of non-linear Markov chains to, in some sense, improve upon any given time-reversible Markov chain with transition kernel 𝐏{\mathbf{P}} whose stationary distribution is 𝝁{\bm{\mu}}. They show that the non-linear transition kernel222Here, non-linearity in the transition kernel implies that 𝐊[𝐱]{\mathbf{K}}[{\mathbf{x}}] takes probability distribution 𝐱{\mathbf{x}} as the argument (Andrieu et al., 2007), as opposed to the kernel being a linear operator 𝐊[𝐱]=𝐏{\mathbf{K}}[{\mathbf{x}}]={\mathbf{P}} for a constant stochastic matrix 𝐏{\mathbf{P}} in a standard (linear) Markovian setting. 𝐊[]:Int(Σ)[0,1]N×N{\mathbf{K}}[\cdot]:\text{Int}(\Sigma)\to[0,1]^{N\times N}, given by

Kij[𝐱]Pij(xj/μj)αk𝒩Pik(xk/μk)α,i,j𝒩,\vspace{-2mm}K_{ij}[{\mathbf{x}}]\triangleq\frac{P_{ij}(x_{j}/\mu_{j})^{-\alpha}}{\sum_{k\in{\mathcal{N}}}P_{ik}(x_{k}/\mu_{k})^{-\alpha}},~{}~{}~{}~{}~{}~{}\forall~{}i,j\in{\mathcal{N}}, (3)

for any 𝐱Int(Σ){\mathbf{x}}\in\text{Int}(\Sigma), when simulated as a self-interacting random walk (Del Moral & Miclo, 2006; Del Moral & Doucet, 2010), can achieve smaller asymptotic variance than the base Markov chain when sampling over a graph 𝒢{\mathcal{G}}, for all α>0\alpha\!>\!0. The argument 𝐱{\mathbf{x}} for the kernel 𝐊[𝒙]{\mathbf{K}}[{\bm{x}}] is taken to be the empirical distribution 𝐱n{\mathbf{x}}_{n} at each time step n0n\!\geq\!0. For instance, if node jj has been visited more often than other nodes so far, the entry xjx_{j} becomes larger (than target value μj\mu_{j}), resulting in a smaller transition probability from ii to jj under 𝐊[𝐱]{\mathbf{K}}[{\mathbf{x}}] in (3) compared to PijP_{ij}. This ensures that a random walker prioritizes more seldom visited nodes in the process, and is thus ‘self-repellent’. This effect is made more drastic by increasing α\alpha, and leads to asymptotically near-zero variance at a rate of O(1/α)O(1/\alpha). Moreover, the polynomial function (xi/μi)α(x_{i}/\mu_{i})^{-\alpha} chosen to encode self-repellent behaviour is shown in Doshi et al. (2023) to be the only one that allows the SRRW to inherit the so-called ‘scale-invariance’ property of the underlying Markov chain – a necessary component for the scalable implementation of a random walker over a large network without requiring knowledge of any graph-related global constants. Conclusively, such attributes render SRRW especially suitable for distributed optimization.333Recently, Guo et al. (2020) introduce an optimization scheme, which designs self-repellence into the perturbation of the gradient descent iterates (Jin et al., 2017; 2018; 2021) with the goal of escaping saddle points. This notion of self-repellence is distinct from the SRRW, which is a probability kernel designed specifically for a token to sample from a target distribution 𝝁{\bm{\mu}} over a set of nodes on an arbitrary graph.

Effect of Stochastic Noise - Finite time and Asymptotic Approaches. Most contemporary token algorithms driven by Markov chains are analyzed using the finite-time bounds approach for obtaining insights into their convergence rates (Sun et al., 2018; Doan et al., 2019; 2020; Triastcyn et al., 2022; Hendrikx, 2023). However, as also explained in Even (2023), in most cases these bounds are overly dependent on mixing time properties of the specific Markov chain employed therein. This makes them largely ineffective in capturing the exact contribution of the underlying random walk in a manner which is qualitative enough to be used for algorithm design; and performance enhancements are typically achieved via application of techniques such as variance reduction (Defazio et al., 2014; Schmidt et al., 2017), momentum/Nesterov’s acceleration (Gadat et al., 2018; Li et al., 2022), adaptive step size (Kingma & Ba, 2015; Reddi et al., 2018), which work by modifying the algorithm iterations themselves, and never consider potential improvements to the stochastic input itself.

Complimentary to finite-time approaches, asymptotic analysis using CLT has proven to be an excellent tool to approach the design of stochastic algorithms (Hu et al., 2022; Devraj & Meyn, 2017; Morral et al., 2017; Chen et al., 2020a; Mou et al., 2020; Devraj & Meyn, 2021). Hu et al. (2022) shows how asymptotic analysis can be used to compare the performance of SGD algorithms for various stochastic inputs using their notion of efficiency ordering, and, as mentioned in Devraj & Meyn (2017), the asymptotic benefits from minimizing the limiting covariance matrix are known to be a good predictor of finite-time algorithmic performance, also observed empirically in Section 4.

From the perspective of both finite-time analysis as well as asymptotic analysis of token algorithms, it is now well established that employing ‘better’ Markov chains can enhance the performance of stochastic optimization algorithm. For instance, Markov chains with smaller SLEMs yield tighter finite-time upper bounds (Sun et al., 2018; Ayache & El Rouayheb, 2021; Even, 2023). Similarly, Markov chains with smaller asymptotic variance for MCMC sampling tasks also provide better performance, resulting in smaller covariance matrix of SGD algorithms (Hu et al., 2022). Therefore, with these breakthrough results via SRRW achieving near-zero sampling variance, it is within reason to ask: Can we achieve near-zero variance in distributed stochastic optimization driven by SRRW-like token algorithms on any general graph?444This near-zero sampling variance implies a significantly smaller variance than even an i.i.d. sampling counterpart, while adhering to graph topological constraints of token algorithms. In this paper, we answer in the affirmative.

SRRW Driven Algorithm and Analysis Approach. For any ergodic time-reversible Markov chain with transition probability matrix 𝐏[Pij]i,j𝒩{\mathbf{P}}\triangleq[P_{ij}]_{i,j\in{\mathcal{N}}} and stationary distribution 𝝁Int(Σ){\bm{\mu}}\in\text{Int}(\Sigma), we consider a general step size version of the SRRW stochastic process analysed in Doshi et al. (2023) and use it to drive the noise sequence in (2). Our SA-SRRW algorithm is as follows:

Draw:Xn+1𝐊Xn,[𝐱n]\text{Draw:}~{}~{}~{}~{}~{}~{}~{}~{}X_{n+1}\sim{\mathbf{K}}_{X_{n},\cdot}[{\mathbf{x}}_{n}]~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{} (4a)
Update:𝐱n+1=𝐱n+γn+1(𝜹Xn+1𝐱n),\text{Update:}~{}~{}~{}~{}~{}~{}{\mathbf{x}}_{n+1}={\mathbf{x}}_{n}+\gamma_{n+1}({\bm{\delta}}_{X_{n+1}}-{\mathbf{x}}_{n}), (4b)
𝜽n+1=𝜽n+βn+1H(𝜽n,Xn+1),~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}{\bm{\theta}}_{n+1}={\bm{\theta}}_{n}+\beta_{n+1}H({\bm{\theta}}_{n},X_{n+1}), (4c)

where {βn}\{\beta_{n}\} and {γn}\{\gamma_{n}\} are step size sequences decreasing to zero, and 𝐊[𝐱]{\mathbf{K}}[{\mathbf{x}}] is the SRRW kernel in (3). Current non-asymptotic analyses require globally Lipschitz mean field function (Chen et al., 2020b; Doan, 2021; Zeng et al., 2021; Even, 2023) and is thus inapplicable to SA-SRRW since the mean field function of the SRRW iterates (4b) is only locally Lipschitz (details deferred to Appendix B). Instead, we successfully obtain non-trivial results by taking an asymptotic CLT-based approach to analyze (4). This goes beyond just analyzing the asymptotic sampling covariance555Sampling covariance corresponds to only the empirical distribution 𝐱n{\mathbf{x}}_{n} in (4b). as in Doshi et al. (2023), the result therein forming a special case of ours by setting γn=1/(n+1)\gamma_{n}\!=\!1/(n\!+\!1) and considering only (4a) and (4b), that is, in the absence of optimization iteration (4c). Specifically, we capture the effect of SRRW’s hyper-parameter α\alpha on the asymptotic speed of convergence of the optimization error term 𝜽n𝜽{\bm{\theta}}_{n}-{\bm{\theta}}^{*} to zero via explicit deduction of its asymptotic covariance matrix. See Figure 1 for illustration.

Refer to caption
Figure 1: Visualization of token algorithms using SRRW versus traditional MC in distributed learning. Our CLT analysis, extended from SRRW itself to distributed stochastic approximation, leads to near-zero variance for the SA iteration 𝜽n{\bm{\theta}}_{n}. Node numbers on the left denote visit counts.

Our Contributions.

1. Given any time-reversible ‘base’ Markov chain with transition kernel 𝐏{\mathbf{P}} and stationary distribution 𝝁{\bm{\mu}}, we generalize first and second order convergence results of 𝐱n{\mathbf{x}}_{n} to target measure 𝝁{\bm{\mu}} (Theorems 4.1 and 4.2 in Doshi et al., 2023) to a class of weighted empirical measures, through the use of more general step sizes γn\gamma_{n}. This includes showing that the asymptotic sampling covariance terms decrease to zero at rate O(1/α)O(1/\alpha), thus quantifying the effect of self-repellent on 𝐱n{\mathbf{x}}_{n}. Our generalization is not for the sake thereof and is shown in Section 3 to be crucial for the design of step sizes βn,γn\beta_{n},\gamma_{n}.

2. Building upon the convergence results for iterates 𝐱n{\mathbf{x}}_{n}, we analyze the algorithm (4) driven by the SRRW kernel in (3) with step sizes βn\beta_{n} and γn\gamma_{n} separated into three disjoint cases:

  1. (i)

    βn=o(γn)\beta_{n}=o(\gamma_{n}), and we say that 𝜽n{\bm{\theta}}_{n} is on the slower timescale compared to 𝐱n{\mathbf{x}}_{n};

  2. (ii)

    βn=γn\beta_{n}\!=\!\gamma_{n}, and we say that 𝜽n{\bm{\theta}}_{n} and 𝐱n{\mathbf{x}}_{n} are on the same timescale;

  3. (iii)

    γn=o(βn)\gamma_{n}=o(\beta_{n}), and we say that 𝜽n{\bm{\theta}}_{n} is on the faster timescale compared to 𝐱n{\mathbf{x}}_{n}.

For any α0\alpha\geq 0 and let k=1,2k=1,2 and 33 refer to the corresponding cases (i), (ii) and (iii), we show that

𝜽nna.s.𝜽and(𝜽n𝜽)/βnndist.N(𝟎,𝐕𝜽(k)(α)),{\bm{\theta}}_{n}\xrightarrow[n\to\infty]{a.s.}{\bm{\theta}}^{*}~{}~{}~{}~{}\text{and}~{}~{}~{}~{}({\bm{\theta}}_{n}-{\bm{\theta}}^{*})/\sqrt{\beta_{n}}\xrightarrow[n\to\infty]{dist.}N\left({\bm{0}},{\mathbf{V}}^{(k)}_{{\bm{\theta}}}(\alpha)\right),\vspace{-1mm}

featuring distinct asymptotic covariance matrices 𝐕𝜽(1)(α),𝐕𝜽(2)(α){\mathbf{V}}^{(1)}_{{\bm{\theta}}}(\alpha),{\mathbf{V}}^{(2)}_{{\bm{\theta}}}(\alpha) and 𝐕𝜽(3)(α){\mathbf{V}}^{(3)}_{{\bm{\theta}}}(\alpha), respectively. The three matrices coincide when α=0\alpha=0,666The α=0\alpha=0 case is equivalent to simply running the base Markov chain, since from (3) we have 𝐊[]=𝐏{\mathbf{K}}[\cdot]={\mathbf{P}}, thus bypassing the SRRW’s effect and rendering all three cases nearly the same.. Moreover, the derivation of the CLT for cases (i) and (iii), for which (4) corresponds to two-timescale SA with controlled Markov noise, is the first of its kind and thus a key technical contribution in this paper, as expanded upon in Section 3.

3. For case (i), we show that 𝐕𝜽(1)(α){\mathbf{V}}^{(1)}_{{\bm{\theta}}}(\alpha) decreases to zero (in the sense of Loewner ordering introduced in Section 2.1) as α\alpha increases, with rate O(1/α2)O(1/\alpha^{2}). This is especially surprising, since the asymptotic performance benefit from using the SRRW kernel with α\alpha in (3), to drive the noise terms XnX_{n}, is amplified in the context of distributed learning and estimating 𝜽{\bm{\theta}}^{*}; compared to the sampling case, for which the rate is O(1/α)O(1/\alpha) as mentioned earlier. For case (iii), we show that 𝐕𝜽(3)(α)=𝐕𝜽(3)(0){\mathbf{V}}_{{\bm{\theta}}}^{(3)}(\alpha)\!=\!{\mathbf{V}}_{{\bm{\theta}}}^{(3)}(0) for all α0\alpha\!\geq\!0, implying that using the SRRW in this case provides no asymptotic benefit than the original base Markov chain, and thus performs worse than case (i). In summary, we deduce that 𝐕𝜽(1)(α2)<L𝐕𝜽(1)(α1)<L𝐕𝜽(1)(0)=𝐕𝜽(3)(0)=𝐕𝜽(3)(α){\mathbf{V}}_{{\bm{\theta}}}^{(1)}(\alpha_{2})\!<_{L}\!{\mathbf{V}}_{{\bm{\theta}}}^{(1)}(\alpha_{1})\!<_{L}\!{\mathbf{V}}_{{\bm{\theta}}}^{(1)}(0)\!=\!{\mathbf{V}}_{{\bm{\theta}}}^{(3)}(0)\!=\!{\mathbf{V}}_{{\bm{\theta}}}^{(3)}(\alpha) for all α2>α1>0\alpha_{2}>\alpha_{1}>0 and α>0\alpha>0.777In particular, this is the reason why we advocate for a more general step size γn=(n+1)a\gamma_{n}=(n+1)^{-a} in the SRRW iterates with a<1a<1, allowing us to choose βn=(n+1)b\beta_{n}=(n+1)^{-b} with b(a,1]b\in(a,1] to satisfy βn=o(γn)\beta_{n}=o(\gamma_{n}) for case (i).

4. We numerically simulate our SA-SRRW algorithm on various real-world datasets, focusing on a binary classification task, to evaluate its performance across all three cases. By carefully choosing the function HH in SA-SRRW, we test the SGD and algorithms driven by SRRW. Our findings consistently highlight the superiority of case (i) over cases (ii) and (iii) for diverse α\alpha values, even in their finite time performance. Notably, our tests validate the variance reduction at a rate of O(1/α2)O(1/\alpha^{2}) for case (i), suggesting it as the best algorithmic choice among the three cases.

2 Preliminaries and Model Setup

In Section 2.1, we first standardize the notations used throughout the paper, and define key mathematical terms and quantities used in our theoretical analyses. Then, in Section 2.2, we consolidate the model assumptions of our SA-SRRW algorithm (4). We then go on to discuss our assumptions, and provide additional interpretations of our use of generalized step-sizes.

2.1 Basic Notations and Definitions

Vectors are denoted by lower-case bold letters, e.g., 𝐯[vi]D{\mathbf{v}}\triangleq[v_{i}]\in\mathbb{R}^{D}, and matrices by upper-case bold, e.g., 𝐌[Mij]D×D{\mathbf{M}}\triangleq[M_{ij}]\in\mathbb{R}^{D\times D}. 𝐌T{\mathbf{M}}^{-T} is the transpose of the matrix inverse 𝐌1{\mathbf{M}}^{-1}. The diagonal matrix 𝐃𝐯{\mathbf{D}}_{{\mathbf{v}}} is formed by vector 𝐯{\mathbf{v}} with viv_{i} as the ii’th diagonal entry. Let 𝟏{\bm{1}} and 𝟎{\bm{0}} denote vectors of all ones and zeros, respectively. The identity matrix is represented by 𝐈{\mathbf{I}}, with subscripts indicating dimensions as needed. A matrix is Hurwitz if all its eigenvalues possess strictly negative real parts. 𝟙{}\mathds{1}_{\{\cdot\}} denotes an indicator function with condition in parentheses. We use \|\!\cdot\!\| to denote both the Euclidean norm of vectors and the spectral norm of matrices. Two symmetric matrices 𝐌1,𝐌2{\mathbf{M}}_{1},{\mathbf{M}}_{2} follow Loewner ordering 𝐌1<L𝐌2{\mathbf{M}}_{1}\!<_{L}\!{\mathbf{M}}_{2} if 𝐌2𝐌1{\mathbf{M}}_{2}\!-\!{\mathbf{M}}_{1} is positive semi-definite and 𝐌1𝐌2{\mathbf{M}}_{1}\!\neq\!{\mathbf{M}}_{2}. This slightly differs from the conventional definition with L\leq_{L}, which allows 𝐌1=𝐌2{\mathbf{M}}_{1}\!=\!{\mathbf{M}}_{2}.

Throughout the paper, the matrix 𝐏[Pi,j]i,j𝒩{\mathbf{P}}\triangleq[P_{i,j}]_{i,j\in{\mathcal{N}}} and vector 𝝁[μi]i𝒩{\bm{\mu}}\triangleq[\mu_{i}]_{i\in{\mathcal{N}}} are used exclusively to denote an N×NN\times N-dimensional transition kernel of an ergodic Markov chain, and its stationary distribution, respectively. Without loss of generality, we assume Pij>0P_{ij}>0 if and only if aij>0a_{ij}>0. Markov chains satisfying the detailed balance equation, where μiPij=μjPji\mu_{i}P_{ij}=\mu_{j}P_{ji} for all i,j𝒩i,j\in{\mathcal{N}}, are termed time-reversible. For such chains, we use (λi,𝐮i)(\lambda_{i},{\mathbf{u}}_{i}) (resp. (λi,𝐯i)(\lambda_{i},{\mathbf{v}}_{i})) to denote the ii’th left (resp. right) eigenpair where the eigenvalues are ordered: 1<λ1λN1<λN=1-1\!<\!\lambda_{1}\!\leq\!\cdots\!\leq\!\lambda_{N-1}\!<\!\lambda_{N}\!=\!1, with 𝐮N=𝝁{\mathbf{u}}_{N}\!=\!{\bm{\mu}} and 𝐯N=𝟏{\mathbf{v}}_{N}\!=\!{\bm{1}} in N{\mathbb{R}}^{N}. We assume eigenvectors to be normalized such that 𝐮iT𝐯i=1{\mathbf{u}}_{i}^{T}{\mathbf{v}}_{i}\!=\!1 for all ii, and we have 𝐮i=𝐃𝝁𝐯i{\mathbf{u}}_{i}\!=\!{\mathbf{D}}_{{\bm{\mu}}}{\mathbf{v}}_{i} and 𝐮iT𝐯j=0{\mathbf{u}}_{i}^{T}{\mathbf{v}}_{j}\!=\!0 for all i,j𝒩i,j\!\in\!{\mathcal{N}}. We direct the reader to Aldous & Fill (2002, Chapter 3.4) for a detailed exposition on spectral properties of time-reversible Markov chains.

2.2 SA-SRRW: Key Assumptions and Discussions

Assumptions: All results in our paper are proved under the following assumptions.

  1. (A1)

    The function H:D×𝒩DH:{\mathbb{R}}^{D}\times{\mathcal{N}}\to{\mathbb{R}}^{D}, is a continuous at every 𝜽D{\bm{\theta}}\in\mathbb{R}^{D}, and there exists a positive constant LL such that H(𝜽,i)L(1+𝜽)\|H({\bm{\theta}},i)\|\leq L(1+\|{\bm{\theta}}\|) for every 𝜽D,i𝒩{\bm{\theta}}\in{\mathbb{R}}^{D},i\in{\mathcal{N}}.

  2. (A2)

    Step sizes βn\beta_{n} and γn\gamma_{n} follow βn=(n+1)b\beta_{n}\!=\!(n\!+\!1)^{-b}, and γn=(n+1)a\gamma_{n}\!=\!(n\!+\!1)^{-a}, where a,b(0.5,1]a,b\in(0.5,1].

  3. (A3)

    Roots of function 𝐡(){\mathbf{h}}(\cdot) are disjoint, which comprise the globally attracting set Θ{𝜽|𝐡(𝜽)=𝟎,𝐡(𝜽)+𝟙{b=1}2𝐈is Hurwitz}\Theta\!\triangleq\!\left\{{\bm{\theta}}^{*}|{\mathbf{h}}({\bm{\theta}}^{*})\!=\!{\bm{0}},\nabla{\mathbf{h}}({\bm{\theta}}^{*})+\frac{\mathds{1}_{\{b=1\}}}{2}{\mathbf{I}}~{}\text{is Hurwitz}\right\}\!\neq\!\emptyset of the associated ordinary differential equation (ODE) for iteration (4c), given by d𝜽(t)/dt=𝐡(𝜽(t)){d{\bm{\theta}}(t)}/{dt}\!=\!{\mathbf{h}}({\bm{\theta}}(t)).

  4. (A4)

    For any (𝜽0,𝐱0,X0)D×Int(Σ)×𝒩({\bm{\theta}}_{0},{\mathbf{x}}_{0},X_{0})\in{\mathbb{R}}^{D}\times\text{Int}(\Sigma)\times{\mathcal{N}}, the iterate sequence {𝜽n}n0\{{\bm{\theta}}_{n}\}_{n\geq 0} (resp. {𝐱n}n0\{{\mathbf{x}}_{n}\}_{n\geq 0}) is 𝜽0,𝐱0,X0{\mathbb{P}}_{{\bm{\theta}}_{0},{\mathbf{x}}_{0},X_{0}}-almost surely contained within a compact subset of D{\mathbb{R}}^{D} (resp. Int(Σ)\text{Int}(\Sigma)).

Discussions on Assumptions: Assumption A1 requires HH to only be locally Lipschitz albeit with linear growth, and is less stringent than the globally Lipschitz assumption prevalent in optimization literature (Li & Wai, 2022; Hendrikx, 2023; Even, 2023).

Assumption A2 is the general umbrella assumption under which cases (i), (ii) and (iii) mentioned in Section 1 are extracted by setting: (i) a<ba<b, (ii) a=ba=b, and (iii) a>ba>b. Cases (i) and (iii) render 𝜽n,𝐱n{\bm{\theta}}_{n},{\mathbf{x}}_{n} on different timescales; the polynomial form of βn,γn\beta_{n},\gamma_{n} widely assumed in the two-timescale SA literature (Mokkadem & Pelletier, 2006; Zeng et al., 2021; Hong et al., 2023). Case (ii) characterizes the SA-SRRW algorithm (4) as a single-timescale SA with polynomially decreasing step size, and is among the most common assumptions in the SA literature (Borkar, 2022; Fort, 2015; Li et al., 2023). In all three cases, the form of γn\gamma_{n} ensures γn1\gamma_{n}\leq 1 such that the SRRW iterates 𝐱n{\mathbf{x}}_{n} in (4b) is within Int(Σ)\text{Int}(\Sigma), ensuring that 𝐊[𝐱n]{\mathbf{K}}[{\mathbf{x}}_{n}] is well-defined for all n0n\geq 0.

In Assumption A3, limiting dynamics of SA iterations {𝜽n}n0\{{\bm{\theta}}_{n}\}_{n\geq 0} closely follow trajectories {𝜽(t)}t0\{{\bm{\theta}}(t)\}_{t\geq 0} of their associated ODE, and assuming the existence of globally stable equilibria is standard (Borkar, 2022; Fort, 2015; Li et al., 2023). In optimization problems, this is equivalent to assuming the existence of at most countably many local minima.

Assumption A4 assumes almost sure boundedness of iterates 𝜽n{\bm{\theta}}_{n} and 𝐱n{\mathbf{x}}_{n}, which is a common assumption in SA algorithms (Kushner & Yin, 2003; Chen, 2006; Borkar, 2022; Karmakar & Bhatnagar, 2018; Li et al., 2023) for the stability of the SA iterations by ensuring the well-definiteness of all quantities involved. Stability of the weighted empirical measure 𝐱n{\mathbf{x}}_{n} of the SRRW process is practically ensured by studying (4b) with a truncation-based procedure (see Doshi et al., 2023, Remark 4.5 and Appendix E for a comprehensive explanation), while that for 𝜽n{\bm{\theta}}_{n} is usually ensured either as a by-product of the algorithm design, or via mechanisms such as projections onto a compact subset of D\mathbb{R}^{D}, depending on the application context.

We now provide additional discussions regarding the step-size assumptions and their implications on the SRRW iteration (4b).

SRRW with General Step Size: As shown in Benaim & Cloez (2015, Remark 1.1), albeit for a completely different non-linear Markov kernel driving the algorithm therein, iterates 𝐱n{\mathbf{x}}_{n} of (4b) can also be expressed as weighted empirical measures of {Xn}n0\{X_{n}\}_{n\geq 0}, in the following form:

𝐱n=i=1nωi𝜹Xi+ω0𝐱0i=0nωi,whereω0=1,andωn=γni=1n(1γi),\vspace{-1mm}{\mathbf{x}}_{n}=\frac{\sum_{i=1}^{n}\omega_{i}{\bm{\delta}}_{X_{i}}+\omega_{0}{\mathbf{x}}_{0}}{\sum_{i=0}^{n}\omega_{i}},~{}~{}~{}\text{where}~{}~{}\omega_{0}=1,~{}~{}\text{and}~{}~{}\omega_{n}=\frac{\gamma_{n}}{\prod_{i=1}^{n}(1-\gamma_{i})}, (5)

for all n>0n>0. For the special case when γn=1/(n+1)\gamma_{n}=1/(n+1) as in Doshi et al. (2023), we have ωn=1\omega_{n}=1 for all n0n\geq 0 and 𝐱n{\mathbf{x}}_{n} is the typical, unweighted empirical measure. For the additional case considered in our paper, when a<1a<1 for γn\gamma_{n} as in assumption A2, we can approximate 1γneγn1-\gamma_{n}\approx e^{-\gamma_{n}} and ωnnaen(1a)/(1a)\omega_{n}\approx n^{-a}e^{n^{(1-a)}/{(1-a)}}. This implies that ωn\omega_{n} will increase at sub-exponential rate, giving more weight to recent visit counts and allowing it to quickly ‘forget’ the poor initial measure 𝐱0{\mathbf{x}}_{0} and shed the correlation with the initial choice of X0X_{0}. This ‘speed up’ effect by setting a<1a<1 is guaranteed in case (i) irrespective of the choice of bb in Assumption A2, and in Section 3 we show how this can lead to further reduction in covariance of optimization error 𝜽n=𝜽{\bm{\theta}}_{n}={\bm{\theta}}^{*} in the asymptotic regime.

Additional assumption for case (iii): Before moving on to Section 3, we take another look at the case when γn=o(βn)\gamma_{n}=o(\beta_{n}), and replace A3 with the following, stronger assumption only for case (iii).

  1. (A3)

    For any 𝐱Int(Σ){\mathbf{x}}\!\in\!\text{Int}(\Sigma), there exists a function ρ:Int(Σ)D\rho:\text{Int}(\Sigma)\!\to\!{\mathbb{R}}^{D} such that ρ(𝐱)L2(1+𝐱)\|\rho({\mathbf{x}})\|\!\leq\!L_{2}(1+\|{\mathbf{x}}\|) for some L2>0L_{2}\!>\!0, 𝔼i𝝅[𝐱][H(ρ(𝐱),i)]=0\mathbb{E}_{i\sim{\bm{\pi}}[{\mathbf{x}}]}[H(\rho({\mathbf{x}}),i)]\!=\!0 and 𝔼i𝝅[𝐱][H(ρ(𝐱),i)]+𝟙{b=1}2𝐈\mathbb{E}_{i\sim{\bm{\pi}}[{\mathbf{x}}]}[\nabla H(\rho({\mathbf{x}}),i)]+\frac{\mathds{1}_{\{b=1\}}}{2}{\mathbf{I}} is Hurwitz.

While Assumption A3 for case (iii) is much stronger than A3, it is not detrimental to the overall results of our paper, since case (i) is of far greater interest as impressed upon in Section 1. This is discussed further in Appendix C.

3 Asymptotic Analysis of the SA-SRRW Algorithm

In this section, we provide the main results for the SA-SRRW algorithm (4). We first present the a.s. convergence and the CLT result for SRRW with generalized step size, extending the results in Doshi et al. (2023). Building upon this, we present the a.s. convergence and the CLT result for the SA iterate 𝜽n{\bm{\theta}}_{n} under different settings of step sizes. We then shift our focus to the analysis of the different asymptotic covariance matrices emerging out of the CLT result, and capture the effect of α\alpha and the step sizes, particularly in cases (i) and (iii), on 𝜽n𝜽{\bm{\theta}}_{n}-{\bm{\theta}}^{*} via performance ordering.

Almost Sure convergence and CLT: The following result establishes first and second order convergence of the sequence {𝐱n}n0\{{\mathbf{x}}_{n}\}_{n\geq 0}, which represents the weighted empirical measures of the SRRW process {Xn}n0\{X_{n}\}_{n\geq 0}, based on the update rule in (4b).

Lemma 3.1.

Under Assumptions A1, A2 and A4, for the SRRW iterates (4b), we have

𝐱nna.s.𝝁,andγn1/2(𝐱n𝝁)ndist.N(𝟎,𝐕𝐱(α)),{\mathbf{x}}_{n}\xrightarrow[n\to\infty]{a.s.}{\bm{\mu}},\quad\text{and}\quad\gamma_{n}^{-1/2}({\mathbf{x}}_{n}-{\bm{\mu}})\xrightarrow[n\to\infty]{dist.}N({\bm{0}},{\mathbf{V}}_{{\mathbf{x}}}(\alpha)),
where𝐕𝐱(α)=i=1N112α(1+λi)+2𝟙{a=1}1+λi1λi𝐮i𝐮iT.\text{where}~{}~{}~{}~{}{\mathbf{V}}_{{\mathbf{x}}}(\alpha)=\sum_{i=1}^{N-1}\frac{1}{2\alpha(1+\lambda_{i})+2-\mathds{1}_{\{a=1\}}}\cdot\frac{1+\lambda_{i}}{1-\lambda_{i}}{\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T}. (6)

Moreover, for all α2>α1>0\alpha_{2}>\alpha_{1}>0, we have 𝐕𝐱(α2)<L𝐕𝐱(α1)<L𝐕𝐱(0){\mathbf{V}}_{\mathbf{x}}(\alpha_{2})<_{L}{\mathbf{V}}_{\mathbf{x}}(\alpha_{1})<_{L}{\mathbf{V}}_{\mathbf{x}}(0).

Lemma 3.1 shows that the SRRW iterates 𝐱n{\mathbf{x}}_{n} converges to the target distribution 𝝁{\bm{\mu}} a.s. even under the general step size γn=(n+1)a\gamma_{n}=(n+1)^{-a} for a(0.5,1]a\in(0.5,1]. We also observe that the asymptotic covariance matrix 𝐕𝐱(α){\mathbf{V}}_{{\mathbf{x}}}(\alpha) decreases at rate O(1/α)O(1/\alpha). Lemma 3.1 aligns with Doshi et al. (2023, Theorem 4.2 and Corollary 4.3) for the special case of a=1a=1, and is therefore more general. Critically, it helps us establish our next result regarding the first-order convergence for the optimization iterate sequence {𝜽n}n0\{{\bm{\theta}}_{n}\}_{n\geq 0} following update rule (4c), as well as its second-order convergence result, which follows shortly after. The proofs of Lemma 3.1 and our next result, Theorem 3.2, are deferred to Appendix D. In what follows, k=1,2k=1,2, and 33 refer to cases (i), (ii), and (iii) in Section 2.2, respectively. All subsequent results are proven under Assumptions A1 to A4, with A3 replacing A3 only when the step sizes βn,γn\beta_{n},\gamma_{n} satisfy case (iii).

Theorem 3.2.

For k{1,2,3}k\in\{1,2,3\}, and any initial (𝛉0,𝐱0,X0)D×Int(Σ)×𝒩({\bm{\theta}}_{0},{\mathbf{x}}_{0},X_{0})\in\mathbb{R}^{D}\times\text{Int}(\Sigma)\times{\mathcal{N}}, we have 𝛉n𝛉{\bm{\theta}}_{n}\to{\bm{\theta}}^{*} as nn\to\infty for some 𝛉Θ{\bm{\theta}}^{*}\in\Theta, 𝛉0,𝐱0,X0{\mathbb{P}}_{{\bm{\theta}}_{0},{\mathbf{x}}_{0},X_{0}}-almost surely.

In the stochastic optimization context, the above result ensures convergence of iterates 𝜽n{\bm{\theta}}_{n} to a local minimizer 𝜽{\bm{\theta}}^{*}. Loosely speaking, the first-order convergence of 𝐱n{\mathbf{x}}_{n} in Lemma 3.1 as well as that of 𝜽n{\bm{\theta}}_{n} are closely related to the convergence of trajectories {𝐳(t)(𝜽(t),𝐱(t))}t0\{{\mathbf{z}}(t)\triangleq({\bm{\theta}}(t),{\mathbf{x}}(t))\}_{t\geq 0} of the (coupled) mean-field ODE, written in a matrix-vector form as

ddt𝐳(t)=𝐠(𝐳(t))[𝐇(𝜽(t))T𝝅[𝐱(t)]𝝅[𝐱(t)]𝐱(t)]D+N.\textstyle\frac{d}{dt}{\mathbf{z}}(t)={\mathbf{g}}({\mathbf{z}}(t))\triangleq\begin{bmatrix}{\mathbf{H}}({\bm{\theta}}(t))^{T}{\bm{\pi}}[{\mathbf{x}}(t)]\\ {\bm{\pi}}[{\mathbf{x}}(t)]-{\mathbf{x}}(t)\end{bmatrix}\in{\mathbb{R}}^{D+N}. (7)

where matrix 𝐇(𝜽)[H(𝜽,1),,H(𝜽,N)]TN×D{\mathbf{H}}({\bm{\theta}})\!\triangleq\![H({\bm{\theta}},1),\!\cdots\!,H({\bm{\theta}},N)]^{T}\!\in\!{\mathbb{R}}^{N\!\times\!D} for any 𝜽D{\bm{\theta}}\in\mathbb{R}^{D}. Here, 𝝅[𝐱]Int(Σ){\bm{\pi}}[{\mathbf{x}}]\in\text{Int}(\Sigma) is the stationary distribution of the SRRW kernel 𝐊[𝐱]{\mathbf{K}}[{\mathbf{x}}] and is shown in Doshi et al. (2023) to be given by πi[𝐱]j𝒩μiPij(xi/μi)α(xj/μj)α\pi_{i}[{\mathbf{x}}]\propto\sum_{j\in{\mathcal{N}}}\mu_{i}P_{ij}(x_{i}/\mu_{i})^{-\alpha}(x_{j}/\mu_{j})^{-\alpha}. The Jacobian matrix of (7) when evaluated at equilibria 𝐳=(𝜽,𝝁){\mathbf{z}}^{*}=({\bm{\theta}}^{*},{\bm{\mu}}) for 𝜽Θ{\bm{\theta}}^{*}\in\Theta captures the behaviour of solutions of the mean-field in their vicinity, and plays an important role in the asymptotic covariance matrices arising out of our CLT results. We evaluate this Jacobian matrix 𝐉(α){\mathbf{J}}(\alpha) as a function of α0\alpha\geq 0 to be given by

𝐉(α)g(𝐳)=[𝐡(𝜽)α𝐇(𝜽)T(𝐏T+𝐈)𝟎N×D2α𝝁𝟏Tα𝐏T(α+1)𝐈][𝐉11𝐉12(α)𝐉21𝐉22(α)].{\mathbf{J}}(\alpha)\!\triangleq\!\nabla g({\mathbf{z}}^{*})\!=\!\begin{bmatrix}\nabla{\mathbf{h}}({\bm{\theta}}^{*})&-\alpha{\mathbf{H}}({\bm{\theta}}^{*})^{T}({\mathbf{P}}^{T}\!\!+{\mathbf{I}})\\ {\bm{0}}_{N\!\times\!D}&2\alpha\bm{\mu}{\bm{1}}^{T}\!\!\!-\!\alpha{\mathbf{P}}^{T}\!\!\!-\!(\alpha\!+\!1){\mathbf{I}}\end{bmatrix}\!\triangleq\!\begin{bmatrix}{\mathbf{J}}_{11}&{\mathbf{J}}_{12}(\alpha)\\ {\mathbf{J}}_{21}&{\mathbf{J}}_{22}(\alpha)\end{bmatrix}. (8)

The derivation of 𝐉(α){\mathbf{J}}(\alpha) is referred to Appendix E.1.888The Jacobian 𝐉(α){\mathbf{J}}(\alpha) is (D+N)×(D+N)(D\!+\!N)\!\times\!(D\!+\!N)– dimensional, with 𝐉11D×D{\mathbf{J}}_{11}\!\in\!{\mathbb{R}}^{D\!\times\!D} and 𝐉22(α)N×N{\mathbf{J}}_{22}(\alpha)\!\in\!{\mathbb{R}}^{N\!\times\!N}. Following this, all matrices written in a block form, such as matrix 𝐔{\mathbf{U}} in (9), will inherit the same dimensional structure. Here, 𝐉21{\mathbf{J}}_{21} is a zero matrix since 𝝅[𝐱]𝐱{\bm{\pi}}[{\mathbf{x}}]-{\mathbf{x}} is devoid of 𝜽{\bm{\theta}}. While matrix 𝐉22(α){\mathbf{J}}_{22}(\alpha) is exactly of the form in Doshi et al. (2023, Lemma 3.4) to characterize the SRRW performance, our analysis includes an additional matrix 𝐉12(α){\mathbf{J}}_{12}(\alpha), which captures the effect of 𝐱(t){\mathbf{x}}(t) on 𝜽(t){\bm{\theta}}(t) in the ODE (7), which translates to the influence of our generalized SRRW empirical measure 𝐱n{\mathbf{x}}_{n} on the SA iterates 𝜽n{\bm{\theta}}_{n} in (4).

For notational simplicity, and without loss of generality, all our remaining results are stated while conditioning on the event that {𝜽n𝜽}\{{\bm{\theta}}_{n}\!\to\!{\bm{\theta}}^{*}\}, for some 𝜽Θ{\bm{\theta}}^{*}\!\in\!\Theta. We also adopt the shorthand notation 𝐇{\mathbf{H}} to represent 𝐇(𝜽){\mathbf{H}}({\bm{\theta}}^{*}). Our main CLT result is as follows, with its proof deferred to Appendix E.

Theorem 3.3.

For any α0\alpha\geq 0, we have: (a) There exists 𝐕(k)(α){\mathbf{V}}^{(k)}(\alpha) for all k{1,2,3}k\in\{1,2,3\} such that

[βn1/2(𝜽n𝜽)γn1/2(𝐱n𝝁)]ndist.N(𝟎,𝐕(k)(α)).\begin{bmatrix}\beta_{n}^{-1/2}({\bm{\theta}}_{n}-{\bm{\theta}}^{*})\\ \gamma_{n}^{-1/2}({\mathbf{x}}_{n}-{\bm{\mu}})\end{bmatrix}\xrightarrow[n\to\infty]{\text{dist.}}N\left({\bm{0}},{\mathbf{V}}^{(k)}(\alpha)\right).

(b) For k=2k=2, matrix 𝐕(2)(α){\mathbf{V}}^{(2)}(\alpha) solves the Lyapunov equation 𝐉(α)𝐕(2)(α)+𝐕(2)(α)𝐉(α)T+𝟙{b=1}𝐕(2)(α)=𝐔{\mathbf{J}}(\alpha){\mathbf{V}}^{(2)}(\alpha)+{\mathbf{V}}^{(2)}(\alpha){\mathbf{J}}(\alpha)^{T}+\mathds{1}_{\{b=1\}}{\mathbf{V}}^{(2)}(\alpha)=-{\mathbf{U}}, where the Jacobian matrix 𝐉(α){\mathbf{J}}(\alpha) is in (8), and

𝐔i=1N11+λi1λi[𝐇T𝐮i𝐮iT𝐇𝐇T𝐮i𝐮iT𝐮i𝐮iT𝐇𝐮i𝐮iT][𝐔11𝐔12𝐔21𝐔22].{\mathbf{U}}\triangleq\sum_{i=1}^{N-1}\frac{1+\lambda_{i}}{1-\lambda_{i}}\cdot\begin{bmatrix}{\mathbf{H}}^{T}{\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T}{\mathbf{H}}&{\mathbf{H}}^{T}{\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T}\\ {\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T}{\mathbf{H}}&{\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T}\end{bmatrix}\triangleq\begin{bmatrix}{\mathbf{U}}_{11}&{\mathbf{U}}_{12}\\ {\mathbf{U}}_{21}&{\mathbf{U}}_{22}\end{bmatrix}. (9)

(c) For k{1,3}k\in\{1,3\}, 𝐕(k)(α){\mathbf{V}}^{(k)}(\alpha) becomes block diagonal, which is given by

𝐕(k)(α)=[𝐕𝜽(k)(α)𝟎D×N𝟎N×D𝐕𝐱(α)],{\mathbf{V}}^{(k)}(\alpha)=\begin{bmatrix}{\mathbf{V}}^{(k)}_{{\bm{\theta}}}(\alpha)&{\bm{0}}_{D\!\times\!N}\\ {\bm{0}}_{N\!\times\!D}&{\mathbf{V}}_{{\mathbf{x}}}(\alpha)\end{bmatrix}, (10)

where 𝐕𝐱(α){\mathbf{V}}_{{\mathbf{x}}}(\alpha) is as in (6), and 𝐕𝛉(1)(α){\mathbf{V}}^{(1)}_{{\bm{\theta}}}(\alpha) and 𝐕𝛉(3)(α){\mathbf{V}}^{(3)}_{{\bm{\theta}}}(\alpha) can be written in the following explicit form:

𝐕𝜽(1)(α)=0et(𝜽𝐡(𝜽)+𝟙{b=1}2𝑰)𝐔𝜽(α)et(𝜽𝐡(𝜽)+𝟙{b=1}2𝑰)T𝑑t,\displaystyle{\mathbf{V}}^{(1)}_{{\bm{\theta}}}(\alpha)=\textstyle\int_{0}^{\infty}e^{t(\nabla_{{\bm{\theta}}}{\mathbf{h}}({\bm{\theta}}^{*})+\frac{\mathds{1}_{\{b=1\}}}{2}{\bm{I}})}{\mathbf{U}}_{{\bm{\theta}}}(\alpha)e^{t(\nabla_{{\bm{\theta}}}{\mathbf{h}}({\bm{\theta}}^{*})+\frac{\mathds{1}_{\{b=1\}}}{2}{\bm{I}})^{T}}dt,
𝐕𝜽(3)(α)=0et𝜽𝐡(𝜽)𝐔11et𝜽𝐡(𝜽)𝑑t,\displaystyle{\mathbf{V}}^{(3)}_{{\bm{\theta}}}(\alpha)=\textstyle\int_{0}^{\infty}e^{t\nabla_{{\bm{\theta}}}{\mathbf{h}}({\bm{\theta}}^{*})}{\mathbf{U}}_{11}e^{t\nabla_{{\bm{\theta}}}{\mathbf{h}}({\bm{\theta}}^{*})}dt,
where 𝐔𝜽(α)=i=1N11(α(1+λi)+1)21+λi1λi𝐇T𝐮i𝐮iT𝐇.\displaystyle{\mathbf{U}}_{{\bm{\theta}}}(\alpha)=\sum_{i=1}^{N-1}\frac{1}{(\alpha(1+\lambda_{i})+1)^{2}}\cdot\frac{1+\lambda_{i}}{1-\lambda_{i}}{\mathbf{H}}^{T}{\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T}{\mathbf{H}}. (11)

For k{1,3}k\in\{1,3\}, SA-SRRW in (4) is a two-timescale SA with controlled Markov noise. While a few works study the CLT of two-timescale SA with the stochastic input being a martingale-difference (i.i.d.) noise (Konda & Tsitsiklis, 2004; Mokkadem & Pelletier, 2006), a CLT result covering the case of controlled Markov noise (e.g., k{1,3}k\in\{1,3\}), a far more general setting than martingale-difference noise, is still an open problem. Thus, we here prove our CLT for k{1,3}k\in\{1,3\} from scratch by a series of careful decompositions of the Markovian noise, ultimately into a martingale-difference term and several non-vanishing noise terms through repeated application of the Poisson equation (Benveniste et al., 2012; Fort, 2015). Although the form of the resulting asymptotic covariance looks similar to that for the martingale-difference case in (Konda & Tsitsiklis, 2004; Mokkadem & Pelletier, 2006) at first glance, they are not equivalent. Specifically, 𝐕𝜽(k)(α){\mathbf{V}}^{(k)}_{{\bm{\theta}}}(\alpha) captures both the effect of SRRW hyper-parameter α\alpha, as well as that of the underlying base Markov chain via eigenpairs (λi,𝐮i)(\lambda_{i},{\mathbf{u}}_{i}) of its transition probability matrix 𝐏{\mathbf{P}} in matrix 𝐔{\mathbf{U}}, whereas the latter only covers the martingale-difference noise terms as a special case.

When k=2k=2, that is, βn=γn\beta_{n}=\gamma_{n}, algorithm (4) can be regarded as a single-timescale SA algorithm. In this case, we utilize the CLT in Fort (2015, Theorem 2.1) to obtain the implicit form of 𝐕(2)(α){\mathbf{V}}^{(2)}(\alpha) as shown in Theorem 3.3. However, 𝐉12(α){\mathbf{J}}_{12}(\alpha) being non-zero for α>0\alpha>0 restricts us from obtaining an explicit form for the covariance term corresponding to SA iterate errors 𝜽n𝜽{\bm{\theta}}_{n}-{\bm{\theta}}^{*}. On the other hand, for k{1,3}k\in\{1,3\}, the nature of two-timescale structure causes 𝜽n{\bm{\theta}}_{n} and 𝐱n{\mathbf{x}}_{n} to become asymptotically independent with zero correlation terms inside 𝐕(k)(α){\mathbf{V}}^{(k)}(\alpha) in (10), and we can explicitly deduce 𝐕𝜽(k)(α){\mathbf{V}}^{(k)}_{{\bm{\theta}}}(\alpha). We now take a deeper dive into α\alpha and study its effect on 𝐕𝜽(k)(α){\mathbf{V}}^{(k)}_{{\bm{\theta}}}(\alpha).

Covariance Ordering of SA-SRRW: We refer the reader to Appendix F for proofs of all remaining results. We begin by focusing on case (i) and capturing the impact of α\alpha on 𝐕𝜽(1)(α){\mathbf{V}}_{{\bm{\theta}}}^{(1)}(\alpha).

Proposition 3.4.

For all α2>α1>0\alpha_{2}>\alpha_{1}>0, we have 𝐕𝛉(1)(α2)<L𝐕𝛉(1)(α1)<L𝐕𝛉(1)(0){\mathbf{V}}_{{\bm{\theta}}}^{(1)}(\alpha_{2})<_{L}{\mathbf{V}}_{{\bm{\theta}}}^{(1)}(\alpha_{1})<_{L}{\mathbf{V}}_{{\bm{\theta}}}^{(1)}(0). Furthermore, 𝐕𝛉(1)(α){\mathbf{V}}_{{\bm{\theta}}}^{(1)}(\alpha) decreases to zero at a rate of O(1/α2)O(1/\alpha^{2}).

Proposition 3.4 proves a monotonic reduction (in terms of Loewner ordering) of 𝐕𝜽(1)(α){\mathbf{V}}_{{\bm{\theta}}}^{(1)}(\alpha) as α\alpha increases. Moreover, the decrease rate O(1/α2)O(1/\alpha^{2}) surpasses the O(1/α)O(1/\alpha) rate seen in 𝐕𝐱(α){\mathbf{V}}_{{\mathbf{x}}}(\alpha) and the sampling application in Doshi et al. (2023, Corollary 4.7), and is also empirically observed in our simulation in Section 4.999Further insights of O(1/α2)O(1/\alpha^{2}) are tied to the two-timescale structure, particularly βn=o(γn)\beta_{n}=o(\gamma_{n}) in case (i), which places 𝜽n{\bm{\theta}}_{n} on the slow timescale so that the correlation terms 𝐉12(α),𝐉22(α){\mathbf{J}}_{12}(\alpha),{\mathbf{J}}_{22}(\alpha) in the Jacobian matrix 𝐉(α){\mathbf{J}}(\alpha) in (8) come into play. Technical details are referred to Appendix E.2, where we show the form of 𝐔𝜽(α){\mathbf{U}}_{{\bm{\theta}}}(\alpha). Suppose we consider the same SA now driven by an i.i.d. sequence {Xn}\{X_{n}\} with the same marginal distribution 𝝁{\bm{\mu}}. Then, our Proposition 3.4 asserts that a token algorithm employing SRRW (walk on a graph) with large enough α\alpha on a general graph can actually produce better SA iterates with its asymptotic covariance going down to zero, than a ‘hypothetical situation’ where the walker is able to access any node jj with probability μj\mu_{j} from anywhere in one step (more like a random jumper). This can be seen by noting that for large time nn, the scaled MSE 𝔼[𝜽n𝜽2]/βn\mathbb{E}[\|{\bm{\theta}}_{n}-{\bm{\theta}}^{*}\|^{2}]/\beta_{n} is composed of the diagonal entries of the covariance matrix 𝐕𝜽{\mathbf{V}}_{{\bm{\theta}}}, which, as we discuss in detail in Appendix F.2, are decreasing in α\alpha as a consequence of the Loewner ordering in Proposition 3.4. For large enough α\alpha, the scaled MSE for SA-SRRW becomes smaller than its i.i.d. counterpart, which is always a constant. Although Doshi et al. (2023) alluded this for sampling applications with 𝐕𝐱(α){\mathbf{V}}_{{\mathbf{x}}}(\alpha), we broaden its horizons to distributed optimization problem with 𝐕𝜽(α){\mathbf{V}}_{{\bm{\theta}}}(\alpha) using tokens on graphs. Our subsequent result concerns the performance comparison between cases (i) and (iii).

Corollary 3.5.

For any α>0\alpha>0, we have 𝐕𝛉(1)(α)<L𝐕𝛉(3)(α)=𝐕𝛉(3)(0){\mathbf{V}}_{{\bm{\theta}}}^{(1)}(\alpha)<_{L}{\mathbf{V}}_{{\bm{\theta}}}^{(3)}(\alpha)={\mathbf{V}}_{{\bm{\theta}}}^{(3)}(0).

We show that case (i) is asymptotically better than case (iii) for α>0\alpha>0. In view of Proposition 3.4 and Corollary 3.5, the advantages of case (i) become prominent.

4 Simulation

In this section, we simulate our SA-SRRW algorithm on the wikiVote graph (Leskovec & Krevl, 2014), comprising 889889 nodes and 29142914 edges. We configure the SRRW’s base Markov chain 𝐏{\mathbf{P}} as the MHRW with a uniform target distribution 𝝁=1N𝟏{\bm{\mu}}=\frac{1}{N}{\bm{1}}. For distributed optimization, we consider the following L2L_{2} regularized binary classification problem:

min𝜽D{f(𝜽)1Ni=1Nlog(1+e𝜽T𝐬i)yi(𝜽T𝐬i)+κ2𝜽2},\textstyle\min_{{\bm{\theta}}\in{\mathbb{R}}^{D}}\left\{f({\bm{\theta}})\triangleq\frac{1}{N}\sum_{i=1}^{N}\log\left(1+e^{{\bm{\theta}}^{T}{\mathbf{s}}_{i}}\right)-y_{i}\left({\bm{\theta}}^{T}{\mathbf{s}}_{i}\right)+\frac{\kappa}{2}\|{\bm{\theta}}\|^{2}\right\},\vspace{-1mm} (12)

where {(𝐬i,yi)}i=1N\{({\mathbf{s}}_{i},y_{i})\}_{i=1}^{N} is the ijcnn1 dataset (with 2222 features, i.e., 𝐬i22{\mathbf{s}}_{i}\in{\mathbb{R}}^{22}) from LIBSVM (Chang & Lin, 2011), and penalty parameter κ=1\kappa=1. Each node in the wikiVote graph is assigned one data point, thus 889889 data points in total. We perform SRRW driven SGD (SGD-SRRW) and SRRW driven stochastic heavy ball (SHB-SRRW) algorithms (see (13) in Appendix A for its algorithm). We fix the step size βn=(n+1)0.9\beta_{n}\!=\!(n+1)^{-0.9} for the SA iterates and adjust γn=(n+1)a\gamma_{n}\!=\!(n+1)^{\!-a} in the SRRW iterates to cover all three cases discussed in this paper: (i) a=0.8a\!=\!0.8; (ii) a=0.9a\!=\!0.9; (iii) a=1a\!=\!1. We use mean square error (MSE), i.e., 𝔼[𝜽n𝜽2]\mathbb{E}[\|{\bm{\theta}}_{n}\!-\!{\bm{\theta}}^{*}\|^{2}], to measure the error on the SA iterates.

Our results are presented in Figures 2 and 3, where each experiment is repeated 100100 times. Figures 2(a) and 2(b), based on wikiVote graph, highlight the consistent performance ordering across different α\alpha values for both algorithms over almost all time (not just asymptotically). Notably, curves for α5\alpha\geq 5 outperform that of the i.i.d. sampling (in black) even under the graph constraints. Figure 2(c) on the smaller Dolphins graph (Rossi & Ahmed, 2015) - 6262 nodes and 159159 edges - illustrates that the points of (α\alpha, MSE) pair arising from SGD-SRRW at time n=107n=10^{7} align with a curve in the form of g(x)=c1(x+c2)2+c3g(x)\!=\!\frac{c_{1}}{(x+c_{2})^{2}}\!+\!c_{3} to showcase O(1/α2)O(1/\alpha^{2}) rates. This smaller graph allows for longer simulations to observe the asymptotic behaviour. Additionally, among the three cases examined at identical α\alpha values, Figures 3(a) - 3(c) confirm that case (i) performs consistently better than the rest, underscoring its superiority in practice. Further results, including those from non-convex functions and additional datasets, are deferred to Appendix H due to space constraints.

Refer to caption
(a) SGD-SRRW.
Refer to caption
(b) SHB-SRRW.
Refer to caption
(c) Curve fitting result for MSE.
Figure 2: Simulation results under case (i): (a) and (b) show the performance of SGD-SRRW and SHB-SRRW for various α\alpha values. (c) shows that MSE decreases at O(1/α2)O(1/\alpha^{2}) speed.
Refer to caption
(a) α=1\alpha=1, SGD-SRRW
Refer to caption
(b) α=5\alpha=5, SGD-SRRW
Refer to caption
(c) α=10\alpha=10, SGD-SRRW
Figure 3: Comparison of the performance among cases (i) - (iii) for α{1,5,10}\alpha\in\{1,5,10\}.

5 Conclusion

In this paper, we show both theoretically and empirically that the SRRW as a drop-in replacement for Markov chains can provide significant performance improvements when used for token algorithms, where the acceleration comes purely from the careful analysis of the stochastic input of the algorithm, without changing the optimization iteration itself. Our paper is an instance where the asymptotic analysis approach allows the design of better algorithms despite the usage of unconventional noise sequences such as nonlinear Markov chains like the SRRW, for which traditional finite-time analytical approaches fall short, thus advocating their wider adoption.

References

  • Aldous & Fill (2002) David Aldous and James Allen Fill. Reversible markov chains and random walks on graphs, 2002. Unfinished monograph, recompiled 2014, available at http://www.stat.berkeley.edu/~aldous/RWG/book.html.
  • Andrieu et al. (2007) Christophe Andrieu, Ajay Jasra, Arnaud Doucet, and Pierre Del Moral. Non-linear markov chain monte carlo. In Esaim: Proceedings, volume 19, pp.  79–84. EDP Sciences, 2007.
  • Ayache & El Rouayheb (2021) Ghadir Ayache and Salim El Rouayheb. Private weighted random walk stochastic gradient descent. IEEE Journal on Selected Areas in Information Theory, 2(1):452–463, 2021.
  • Barakat & Bianchi (2021) Anas Barakat and Pascal Bianchi. Convergence and dynamical behavior of the adam algorithm for nonconvex stochastic optimization. SIAM Journal on Optimization, 31(1):244–274, 2021.
  • Barakat et al. (2021) Anas Barakat, Pascal Bianchi, Walid Hachem, and Sholom Schechtman. Stochastic optimization with momentum: convergence, fluctuations, and traps avoidance. Electronic Journal of Statistics, 15(2):3892–3947, 2021.
  • Benaim & Cloez (2015) M Benaim and Bertrand Cloez. A stochastic approximation approach to quasi-stationary distributions on finite spaces. Electronic Communications in Probability 37 (20), 1-14.(2015), 2015.
  • Benveniste et al. (2012) Albert Benveniste, Michel Métivier, and Pierre Priouret. Adaptive algorithms and stochastic approximations, volume 22. Springer Science & Business Media, 2012.
  • Borkar (2022) V.S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint: Second Edition. Texts and Readings in Mathematics. Hindustan Book Agency, 2022. ISBN 9788195196111.
  • Bottou et al. (2018) Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM review, 60(2):223–311, 2018.
  • Boyd et al. (2006) Stephen Boyd, Arpita Ghosh, Balaji Prabhakar, and Devavrat Shah. Randomized gossip algorithms. IEEE transactions on information theory, 52(6):2508–2530, 2006.
  • Brémaud (2013) Pierre Brémaud. Markov chains: Gibbs fields, Monte Carlo simulation, and queues, volume 31. Springer Science & Business Media, 2013.
  • Chang & Lin (2011) Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):1–27, 2011.
  • Chellaboina & Haddad (2008) VijaySekhar Chellaboina and Wassim M Haddad. Nonlinear dynamical systems and control: A Lyapunov-based approach. Princeton University Press, 2008.
  • Chellapandi et al. (2023) Vishnu Pandi Chellapandi, Antesh Upadhyay, Abolfazl Hashemi, and Stanislaw H Zak. On the convergence of decentralized federated learning under imperfect information sharing. arXiv preprint arXiv:2303.10695, 2023.
  • Chen (2006) Han-Fu Chen. Stochastic approximation and its applications, volume 64. Springer Science & Business Media, 2006.
  • Chen et al. (2020a) Shuhang Chen, Adithya Devraj, Ana Busic, and Sean Meyn. Explicit mean-square error bounds for monte-carlo and linear stochastic approximation. In International Conference on Artificial Intelligence and Statistics, pp.  4173–4183. PMLR, 2020a.
  • Chen et al. (2020b) Zaiwei Chen, Siva Theja Maguluri, Sanjay Shakkottai, and Karthikeyan Shanmugam. Finite-sample analysis of stochastic approximation using smooth convex envelopes. arXiv preprint arXiv:2002.00874, 2020b.
  • Chen et al. (2022) Zaiwei Chen, Sheng Zhang, Thinh T Doan, John-Paul Clarke, and Siva Theja Maguluri. Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning. Automatica, 146:110623, 2022.
  • Davis (1970) Burgess Davis. On the intergrability of the martingale square function. Israel Journal of Mathematics, 8:187–190, 1970.
  • Defazio et al. (2014) Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: a fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in neural information processing systems, volume 1, 2014.
  • Del Moral & Doucet (2010) Pierre Del Moral and Arnaud Doucet. Interacting markov chain monte carlo methods for solving nonlinear measure-valued equations1. The Annals of Applied Probability, 20(2):593–639, 2010.
  • Del Moral & Miclo (2006) Pierre Del Moral and Laurent Miclo. Self-interacting markov chains. Stochastic Analysis and Applications, 24(3):615–660, 2006.
  • Delyon (2000) Bernard Delyon. Stochastic approximation with decreasing gain: Convergence and asymptotic theory. Technical report, Université de Rennes, 2000.
  • Delyon et al. (1999) Bernard Delyon, Marc Lavielle, and Eric Moulines. Convergence of a stochastic approximation version of the em algorithm. Annals of statistics, pp.  94–128, 1999.
  • Devraj & Meyn (2017) Adithya M Devraj and Sean P Meyn. Zap q-learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp.  2232–2241, 2017.
  • Devraj & Meyn (2021) Adithya M. Devraj and Sean P. Meyn. Q-learning with uniformly bounded variance. IEEE Transactions on Automatic Control, 2021.
  • Doan et al. (2019) Thinh Doan, Siva Maguluri, and Justin Romberg. Finite-time analysis of distributed td (0) with linear function approximation on multi-agent reinforcement learning. In International Conference on Machine Learning, pp. 1626–1635. PMLR, 2019.
  • Doan (2021) Thinh T Doan. Finite-time convergence rates of nonlinear two-time-scale stochastic approximation under markovian noise. arXiv preprint arXiv:2104.01627, 2021.
  • Doan et al. (2020) Thinh T Doan, Lam M Nguyen, Nhan H Pham, and Justin Romberg. Convergence rates of accelerated markov gradient descent with applications in reinforcement learning. arXiv preprint arXiv:2002.02873, 2020.
  • Doshi et al. (2023) Vishwaraj Doshi, Jie Hu, and Do Young Eun. Self-repellent random walks on general graphs–achieving minimal sampling variance via nonlinear markov chains. In International Conference on Machine Learning. PMLR, 2023.
  • Duflo (1996) Marie Duflo. Algorithmes stochastiques, volume 23. Springer, 1996.
  • Even (2023) Mathieu Even. Stochastic gradient descent under markovian sampling schemes. In International Conference on Machine Learning, 2023.
  • Fort (2015) Gersende Fort. Central limit theorems for stochastic approximation with controlled markov chain dynamics. ESAIM: Probability and Statistics, 19:60–80, 2015.
  • Gadat et al. (2018) Sébastien Gadat, Fabien Panloup, and Sofiane Saadane. Stochastic heavy ball. Electronic Journal of Statistics, 12:461–529, 2018.
  • Guo et al. (2020) Xin Guo, Jiequn Han, Mahan Tajrobehkar, and Wenpin Tang. Escaping saddle points efficiently with occupation-time-adapted perturbations. arXiv preprint arXiv:2005.04507, 2020.
  • Hall et al. (2014) P. Hall, C.C. Heyde, Z.W. Birnbauam, and E. Lukacs. Martingale Limit Theory and Its Application. Communication and Behavior. Elsevier Science, 2014.
  • Hendrikx (2023) Hadrien Hendrikx. A principled framework for the design and analysis of token algorithms. In International Conference on Artificial Intelligence and Statistics, pp.  470–489. PMLR, 2023.
  • Hong et al. (2023) Mingyi Hong, Hoi-To Wai, Zhaoran Wang, and Zhuoran Yang. A two-timescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actor-critic. SIAM Journal on Optimization, 33(1):147–180, 2023.
  • Hu et al. (2022) Jie Hu, Vishwaraj Doshi, and Do Young Eun. Efficiency ordering of stochastic gradient descent. In Advances in Neural Information Processing Systems, 2022.
  • Jin et al. (2017) Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points efficiently. In International conference on machine learning, pp. 1724–1732. PMLR, 2017.
  • Jin et al. (2018) Chi Jin, Praneeth Netrapalli, and Michael I Jordan. Accelerated gradient descent escapes saddle points faster than gradient descent. In Conference On Learning Theory, pp.  1042–1085. PMLR, 2018.
  • Jin et al. (2021) Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M Kakade, and Michael I Jordan. On nonconvex optimization for machine learning: Gradients, stochasticity, and saddle points. Journal of the ACM (JACM), 68(2):1–29, 2021.
  • Karimi et al. (2019) Belhal Karimi, Blazej Miasojedow, Eric Moulines, and Hoi-To Wai. Non-asymptotic analysis of biased stochastic approximation scheme. In Conference on Learning Theory, pp.  1944–1974. PMLR, 2019.
  • Karmakar & Bhatnagar (2018) Prasenjit Karmakar and Shalabh Bhatnagar. Two time-scale stochastic approximation with controlled markov noise and off-policy temporal-difference learning. Mathematics of Operations Research, 43(1):130–151, 2018.
  • Khaled & Richtárik (2023) Ahmed Khaled and Peter Richtárik. Better theory for SGD in the nonconvex world. Transactions on Machine Learning Research, 2023. ISSN 2835-8856.
  • Kingma & Ba (2015) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  • Konda & Tsitsiklis (2004) Vijay R Konda and John N Tsitsiklis. Convergence rate of linear two-time-scale stochastic approximation. The Annals of Applied Probability, 14(2):796–819, 2004.
  • Kushner & Yin (2003) Harold Kushner and G George Yin. Stochastic approximation and recursive algorithms and applications, volume 35. Springer Science & Business Media, 2003.
  • Lalitha et al. (2018) Anusha Lalitha, Shubhanshu Shekhar, Tara Javidi, and Farinaz Koushanfar. Fully decentralized federated learning. In Advances in neural information processing systems, 2018.
  • Leskovec & Krevl (2014) Jure Leskovec and Andrej Krevl. Snap datasets: Stanford large network dataset collection, 2014.
  • Levin & Peres (2017) David A Levin and Yuval Peres. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.
  • Li & Wai (2022) Qiang Li and Hoi-To Wai. State dependent performative prediction with stochastic approximation. In International Conference on Artificial Intelligence and Statistics, pp.  3164–3186. PMLR, 2022.
  • Li et al. (2022) Tiejun Li, Tiannan Xiao, and Guoguo Yang. Revisiting the central limit theorems for the sgd-type methods. arXiv preprint arXiv:2207.11755, 2022.
  • Li et al. (2023) Xiang Li, Jiadong Liang, and Zhihua Zhang. Online statistical inference for nonlinear stochastic approximation with markovian data. arXiv preprint arXiv:2302.07690, 2023.
  • McMahan et al. (2017) Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp.  1273–1282. PMLR, 2017.
  • Meyn (2022) Sean Meyn. Control systems and reinforcement learning. Cambridge University Press, 2022.
  • Mokkadem & Pelletier (2005) Abdelkader Mokkadem and Mariane Pelletier. The compact law of the iterated logarithm for multivariate stochastic approximation algorithms. Stochastic analysis and applications, 23(1):181–203, 2005.
  • Mokkadem & Pelletier (2006) Abdelkader Mokkadem and Mariane Pelletier. Convergence rate and averaging of nonlinear two-time-scale stochastic approximation algorithms. Annals of Applied Probability, 16(3):1671–1702, 2006.
  • Morral et al. (2017) Gemma Morral, Pascal Bianchi, and Gersende Fort. Success and failure of adaptation-diffusion algorithms with decaying step size in multiagent networks. IEEE Transactions on Signal Processing, 65(11):2798–2813, 2017.
  • Mou et al. (2020) Wenlong Mou, Chris Junchi Li, Martin J Wainwright, Peter L Bartlett, and Michael I Jordan. On linear stochastic approximation: Fine-grained polyak-ruppert and non-asymptotic concentration. In Conference on Learning Theory, pp.  2947–2997. PMLR, 2020.
  • Nedic (2020) Angelia Nedic. Distributed gradient methods for convex machine learning problems in networks: Distributed optimization. IEEE Signal Processing Magazine, 37(3):92–101, 2020.
  • Olshevsky (2022) Alex Olshevsky. Asymptotic network independence and step-size for a distributed subgradient method. Journal of Machine Learning Research, 23(69):1–32, 2022.
  • Pelletier (1998) Mariane Pelletier. On the almost sure asymptotic behaviour of stochastic algorithms. Stochastic processes and their applications, 78(2):217–244, 1998.
  • Reddi et al. (2018) Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018.
  • Rossi & Ahmed (2015) Ryan A. Rossi and Nesreen K. Ahmed. The network data repository with interactive graph analytics and visualization. In AAAI, 2015.
  • Schmidt et al. (2017) Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162:83–112, 2017.
  • Sun et al. (2018) Tao Sun, Yuejiao Sun, and Wotao Yin. On markov chain gradient descent. In Advances in neural information processing systems, volume 31, 2018.
  • Triastcyn et al. (2022) Aleksei Triastcyn, Matthias Reisser, and Christos Louizos. Decentralized learning with random walks and communication-efficient adaptive optimization. In Workshop on Federated Learning: Recent Advances and New Challenges (in Conjunction with NeurIPS 2022), 2022.
  • Vogels et al. (2021) Thijs Vogels, Lie He, Anastasiia Koloskova, Sai Praneeth Karimireddy, Tao Lin, Sebastian U Stich, and Martin Jaggi. Relaysum for decentralized deep learning on heterogeneous data. In Advances in Neural Information Processing Systems, volume 34, pp.  28004–28015, 2021.
  • Wang et al. (2019) Jianyu Wang, Anit Kumar Sahu, Zhouyi Yang, Gauri Joshi, and Soummya Kar. Matcha: Speeding up decentralized sgd via matching decomposition sampling. In 2019 Sixth Indian Control Conference (ICC), pp.  299–300. IEEE, 2019.
  • Yaji & Bhatnagar (2020) Vinayaka G Yaji and Shalabh Bhatnagar. Stochastic recursive inclusions in two timescales with nonadditive iterate-dependent markov noise. Mathematics of Operations Research, 45(4):1405–1444, 2020.
  • Ye et al. (2022) Hao Ye, Le Liang, and Geoffrey Ye Li. Decentralized federated learning with unreliable communications. IEEE Journal of Selected Topics in Signal Processing, 16(3):487–500, 2022.
  • Zeng et al. (2021) Sihan Zeng, Thinh T Doan, and Justin Romberg. A two-time-scale stochastic optimization framework with applications in control and reinforcement learning. arXiv preprint arXiv:2109.14756, 2021.

Appendix A Examples of Stochastic Algorithms of the form (2).

In the literature of stochastic optimizations, many SGD variants have been proposed by introducing an auxiliary variable to improve convergence. In what follows, we present two SGD variants with decreasing step size that can be presented in the form of (2): SHB (Gadat et al., 2018; Li et al., 2022) and momentum-based algorithm (Barakat et al., 2021; Barakat & Bianchi, 2021).

{𝜽n+1=𝜽nβn+1𝐦n𝐦n+1=𝐦n+βn+1(F(𝜽n,Xn+1)𝐦n),{𝐯n+1=𝐯n+βn+1(F(𝜽n,Xn+1)2𝐯n),𝐦n+1=𝐦n+βn+1(F(𝜽n,Xn+1)𝐦n),𝜽n+1=𝜽nβn+1𝐦n/𝐯n+ϵ,(a). SHB(b). Momentum-based Algorithm\begin{split}&\begin{cases}\!{\bm{\theta}}_{n+1}\!=\!{\bm{\theta}}_{n}\!-\!\beta_{n+1}{\mathbf{m}}_{n}\\ \!{\mathbf{m}}_{n+1}\!=\!{\mathbf{m}}_{n}\!+\!\beta_{n+1}\!(\nabla\!F({\bm{\theta}}_{n},X_{n+1})\!-\!{\mathbf{m}}_{n}),\end{cases}\!\!\!\!\!\begin{cases}\!{\mathbf{v}}_{n+1}\!=\!{\mathbf{v}}_{n}\!+\!\beta_{n+1}\!(\nabla\!F({\bm{\theta}}_{n},X_{n+1})^{2}\!\!-\!{\mathbf{v}}_{n}),\\ \!{\mathbf{m}}_{n+1}\!=\!{\mathbf{m}}_{n}\!+\!\beta_{n+1}\!(\nabla\!F({\bm{\theta}}_{n},X_{n+1})\!-\!{\mathbf{m}}_{n}),\\ \!{\bm{\theta}}_{n+1}\!=\!{\bm{\theta}}_{n}\!-\!\beta_{n+1}{\mathbf{m}}_{n}/\sqrt{{\mathbf{v}}_{n}+\epsilon},\end{cases}\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\text{(a). SHB}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\text{(b). Momentum-based Algorithm}\end{split}\vspace{-5mm} (13)

where ϵ>0\epsilon>0, 𝜽n,𝐦n,𝐯n,F(𝜽,X)d{\bm{\theta}}_{n},{\mathbf{m}}_{n},{\mathbf{v}}_{n},\nabla F({\bm{\theta}},X)\in{\mathbb{R}}^{d}, and the square and square root in (13) (b) are element-wise operators.101010For ease of expression, we simplify the original SHB and momentum-based algorithms from Gadat et al. (2018); Li et al. (2022); Barakat et al. (2021); Barakat & Bianchi (2021), setting all tunable parameters to 11 and resulting in (13).

For SHB, we introduce an augmented variable 𝐳n{\mathbf{z}}_{n} and function H(𝐳n,Xn+1)H({\mathbf{z}}_{n},X_{n+1}) defined as follows:

𝐳n[𝜽n𝐦n]2d,H(𝐳n,Xn+1)[𝐦nF(𝜽n,Xn+1)𝐦n]2d.{\mathbf{z}}_{n}\triangleq\begin{bmatrix}{\bm{\theta}}_{n}\\ {\mathbf{m}}_{n}\end{bmatrix}\in{\mathbb{R}}^{2d},\quad H({\mathbf{z}}_{n},X_{n+1})\triangleq\begin{bmatrix}-{\mathbf{m}}_{n}\\ \nabla F({\bm{\theta}}_{n},X_{n+1})-{\mathbf{m}}_{n}\end{bmatrix}\in{\mathbb{R}}^{2d}.

For the general momentum-based algorithm, we define

𝐳n[𝐯n𝐦n𝜽n]3d,H(𝐳n,X)[F(𝜽n,Xn+1)2𝐯nF(𝜽n,Xn+1)𝐦n𝐦n/𝐯n+ϵ]3d.{\mathbf{z}}_{n}\triangleq\begin{bmatrix}{\mathbf{v}}_{n}\\ {\mathbf{m}}_{n}\\ {\bm{\theta}}_{n}\end{bmatrix}\in{\mathbb{R}}^{3d},\quad H({\mathbf{z}}_{n},X)\triangleq\begin{bmatrix}\nabla F({\bm{\theta}}_{n},X_{n+1})^{2}-{\mathbf{v}}_{n}\\ \nabla F({\bm{\theta}}_{n},X_{n+1})-{\mathbf{m}}_{n}\\ -{\mathbf{m}}_{n}/\sqrt{{\mathbf{v}}_{n}+\epsilon}\end{bmatrix}\in{\mathbb{R}}^{3d}.

Thus, we can reformulate both algorithms in (13) as 𝐳n+1=𝐳n+βn+1H(𝐳n,Xn+1){\mathbf{z}}_{n+1}={\mathbf{z}}_{n}+\beta_{n+1}H({\mathbf{z}}_{n},X_{n+1}). This augmentation approach was previously adopted in (Gadat et al., 2018; Barakat et al., 2021; Barakat & Bianchi, 2021; Li et al., 2022) to analyze the asymptotic performance of algorithms in (13) using an i.i.d. sequence {Xn}n0\{X_{n}\}_{n\geq 0}. Consequently, the general SA iteration (2) includes these SGD variants. However, we mainly focus on the CLT for the general SA driven by SRRW in this paper. Pursuing the explicit CLT results of these SGD variants with specific form of function H(𝜽,X)H({\bm{\theta}},X) driven by the SRRW sequence {Xn}\{X_{n}\} is out of the scope of this paper.

When we numerically test the SHB algorithm in Section 4, we use the exact form of (13) (a) and the stochastic sequence {Xn}\{X_{n}\} is now driven by the SRRW. Specifically, we consider MHRW with transition kernel 𝐏{\mathbf{P}} as the base Markov chain of the SRRW process, e.g.,

Pij={min{1di,1dj}if node j is the neighbor of node i,0otherwise,P_{ij}=\begin{cases}\min\left\{\frac{1}{d_{i}},\frac{1}{d_{j}}\right\}&\text{if node $j$ is the neighbor of node $i$},\\ 0&\text{otherwise},\end{cases} (14)
Pii=1j𝒩Pij.P_{ii}=1-\sum_{j\in{\mathcal{N}}}P_{ij}.~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}

Then, at each time step nn,

Draw:Xn+1𝐊Xn,[𝐱n],\text{Draw:}~{}~{}~{}~{}~{}~{}~{}~{}X_{n+1}\sim{\mathbf{K}}_{X_{n},\cdot}[{\mathbf{x}}_{n}],~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{} (15)
whereKij[𝐱]Pij(xj/μj)αk𝒩Pik(xk/μk)α,i,j𝒩,~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\text{where}~{}~{}~{}~{}~{}~{}~{}~{}K_{ij}[{\mathbf{x}}]\triangleq\frac{P_{ij}(x_{j}/\mu_{j})^{-\alpha}}{\sum_{k\in{\mathcal{N}}}P_{ik}(x_{k}/\mu_{k})^{-\alpha}},~{}~{}~{}~{}~{}~{}\forall~{}i,j\in{\mathcal{N}},
Update:𝐱n+1=𝐱n+γn+1(𝜹Xn+1𝐱n),\text{Update:}~{}~{}~{}~{}~{}~{}{\mathbf{x}}_{n+1}={\mathbf{x}}_{n}+\gamma_{n+1}({\bm{\delta}}_{X_{n+1}}-{\mathbf{x}}_{n}),
𝜽n+1=𝜽nβn+1𝐦n,~{}~{}{\bm{\theta}}_{n+1}={\bm{\theta}}_{n}-\beta_{n+1}{\mathbf{m}}_{n},
𝐦n+1=𝐦n+βn+1(F(𝜽n,Xn+1)𝐦n).~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}{\mathbf{m}}_{n+1}={\mathbf{m}}_{n}+\beta_{n+1}(\nabla F({\bm{\theta}}_{n},X_{n+1})-{\mathbf{m}}_{n}).

Appendix B Discussion on Mean Field Function of SRRW Iterates (4b)

Non-asymptotic analyses have seen extensive attention recently in both single-timescale SA literature (Sun et al., 2018; Karimi et al., 2019; Chen et al., 2020b; 2022) and two-timescale SA literature (Doan, 2021; Zeng et al., 2021). Specifically, single-timescale SA has the following form:

𝐱n+1=𝐱n+βn+1H(𝐱n,Xn+1),{\mathbf{x}}_{n+1}={\mathbf{x}}_{n}+\beta_{n+1}H({\mathbf{x}}_{n},X_{n+1}),

and function h(𝐱)𝔼X𝝁[H(𝐱,X)]h({\mathbf{x}})\triangleq\mathbb{E}_{X\sim{\bm{\mu}}}[H({\mathbf{x}},X)] is the mean field of function H(𝐱,X)H({\mathbf{x}},X). Similarly, for two-timescale SA, we have the following recursions:

𝐱n+1=𝐱n+βn+1H1(𝐱n,𝐲n,Xn+1),𝐲n+1=𝐲n+γn+1H2(𝐱n,𝐲n,Xn+1),\begin{split}{\mathbf{x}}_{n+1}={\mathbf{x}}_{n}+\beta_{n+1}H_{1}({\mathbf{x}}_{n},{\mathbf{y}}_{n},X_{n+1}),\\ {\mathbf{y}}_{n+1}={\mathbf{y}}_{n}+\gamma_{n+1}H_{2}({\mathbf{x}}_{n},{\mathbf{y}}_{n},X_{n+1}),\end{split}

where {βn}\{\beta_{n}\} and {γn}\{\gamma_{n}\} are on different timescales, and function hi(𝐱,𝐲)𝔼X𝝁[Hi(𝐱,𝐲,X)]h_{i}({\mathbf{x}},{\mathbf{y}})\triangleq\mathbb{E}_{X\sim{\bm{\mu}}}[H_{i}({\mathbf{x}},{\mathbf{y}},X)] is the mean field of function Hi(𝐱,𝐲,X)H_{i}({\mathbf{x}},{\mathbf{y}},X) for i={1,2}i=\{1,2\}. All the aforementioned works require the mean field function h(𝐱)h({\mathbf{x}}) in the single-timescale SA (or h1(𝐱,𝐲),h2(𝐱,𝐲)h_{1}({\mathbf{x}},{\mathbf{y}}),h_{2}({\mathbf{x}},{\mathbf{y}}) in the two-timescale SA) to be globally Lipschitz with a Lipschitz constant LL to proceed with the derivation of finite-time bounds including the constant LL.

Here, we show that the mean field function 𝝅[𝐱]𝐱{\bm{\pi}}[{\mathbf{x}}]-{\mathbf{x}} in the SRRW iterates (4b) is not globally Lipschitz, where 𝝅[𝐱]{\bm{\pi}}[{\mathbf{x}}] is the stationary distribution of the SRRW kernel 𝐊[𝐱]{\mathbf{K}}[{\mathbf{x}}] defined in (3). To this end, we show that each entry of Jacobian matrix of 𝝅[𝐱]𝐱{\bm{\pi}}[{\mathbf{x}}]-{\mathbf{x}} goes unbounded because a multivariate function is Lipschitz if and only if it has bounded partial derivatives. Note that from Doshi et al. (2023, Proposition 2.1), for the ii-th entry of 𝝅[𝐱]{\bm{\pi}}[{\mathbf{x}}], we have

𝝅i[𝐱]=j𝒩μiPij(xi/μi)α(xj/μj)αi𝒩j𝒩μiPij(xi/μi)α(xj/μj)α.{\bm{\pi}}_{i}[{\mathbf{x}}]=\frac{\sum_{j\in{\mathcal{N}}}\mu_{i}P_{ij}\left(x_{i}/\mu_{i}\right)^{-\alpha}\left(x_{j}/\mu_{j}\right)^{-\alpha}}{\sum_{i\in{\mathcal{N}}}\sum_{j\in{\mathcal{N}}}\mu_{i}P_{ij}\left(x_{i}/\mu_{i}\right)^{-\alpha}\left(x_{j}/\mu_{j}\right)^{-\alpha}}. (16)

Then, the Jacobian matrix of the mean field function 𝝅[𝐱]𝐱{\bm{\pi}}[{\mathbf{x}}]-{\mathbf{x}} , which has been derived in Doshi et al. (2023, Proof of Lemma 3.4 in Appendix B), is given as follows:

(𝝅i[𝐱]xi)xj=2αxj(k𝒩μiPik(xi/μi)α(xk/μk)α)(k𝒩μjPjk(xj/μj)α(xk/μk)α)(l𝒩k𝒩μlPlk(xl/μl)α(xk/μk)α)2αxjμiPij(xi/μi)α(xj/μj)αl𝒩k𝒩μlPlk(xl/μl)α(xk/μk)α\begin{split}&\frac{\partial({\bm{\pi}}_{i}[{\mathbf{x}}]-x_{i})}{\partial x_{j}}\\ =&~{}\frac{2\alpha}{x_{j}}\cdot\frac{(\sum_{k\in{\mathcal{N}}}\mu_{i}P_{ik}\left(x_{i}/\mu_{i}\right)^{-\alpha}\left(x_{k}/\mu_{k}\right)^{-\alpha})(\sum_{k\in{\mathcal{N}}}\mu_{j}P_{jk}\left(x_{j}/\mu_{j}\right)^{-\alpha}\left(x_{k}/\mu_{k}\right)^{-\alpha})}{(\sum_{l\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}\mu_{l}P_{lk}\left(x_{l}/\mu_{l}\right)^{-\alpha}\left(x_{k}/\mu_{k}\right)^{-\alpha})^{2}}\\ &-\frac{\alpha}{x_{j}}\cdot\frac{\mu_{i}P_{ij}\left(x_{i}/\mu_{i}\right)^{-\alpha}\left(x_{j}/\mu_{j}\right)^{-\alpha}}{\sum_{l\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}\mu_{l}P_{lk}\left(x_{l}/\mu_{l}\right)^{-\alpha}\left(x_{k}/\mu_{k}\right)^{-\alpha}}\end{split} (17)

for i,j𝒩,iji,j\in{\mathcal{N}},i\neq j, and

(𝝅i[𝐱]xi)xi=2αxi(k𝒩μiPik(xi/μi)α(xk/μk)α)2(l𝒩k𝒩μlPlk(xl/μl)α(xk/μk)α)2αxik𝒩μiPik(xi/μi)α(xk/μk)α+μiPii(xi/μi)2αl𝒩k𝒩μlPlk(xl/μl)α(xk/μk)α1\begin{split}&\frac{\partial({\bm{\pi}}_{i}[{\mathbf{x}}]-x_{i})}{\partial x_{i}}\\ =&~{}\frac{2\alpha}{x_{i}}\cdot\frac{(\sum_{k\in{\mathcal{N}}}\mu_{i}P_{ik}\left(x_{i}/\mu_{i}\right)^{-\alpha}\left(x_{k}/\mu_{k}\right)^{-\alpha})^{2}}{(\sum_{l\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}\mu_{l}P_{lk}\left(x_{l}/\mu_{l}\right)^{-\alpha}\left(x_{k}/\mu_{k}\right)^{-\alpha})^{2}}\\ &-\frac{\alpha}{x_{i}}\cdot\frac{\sum_{k\in{\mathcal{N}}}\mu_{i}P_{ik}\left(x_{i}/\mu_{i}\right)^{-\alpha}\left(x_{k}/\mu_{k}\right)^{-\alpha}+\mu_{i}P_{ii}(x_{i}/\mu_{i})^{-2\alpha}}{\sum_{l\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}\mu_{l}P_{lk}\left(x_{l}/\mu_{l}\right)^{-\alpha}\left(x_{k}/\mu_{k}\right)^{-\alpha}}-1\end{split} (18)

for i𝒩i\in{\mathcal{N}}. Since the empirical distribution 𝐱Int(Σ){\mathbf{x}}\in\text{Int}(\Sigma), we have xi(0,1)x_{i}\in(0,1) for all i𝒩i\in{\mathcal{N}}. For fixed ii, assume xi=xjx_{i}=x_{j} and as they approach zero, the terms (xi/μi)α(x_{i}/\mu_{i})^{-\alpha}, (xj/μj)α(x_{j}/\mu_{j})^{-\alpha} dominate the fraction in (17) and both the numerator and the denominator of the fraction have the same order in terms of xi,xjx_{i},x_{j}. Thus, we have

(𝝅i[𝐱]xi)xj=O(1xj)\frac{\partial({\bm{\pi}}_{i}[{\mathbf{x}}]-x_{i})}{\partial x_{j}}=O\left(\frac{1}{x_{j}}\right)

such that the (i,j)(i,j)-th entry of the Jacobian matrix can go unbounded as xj0x_{j}\to 0. Consequently, 𝝅[𝐱]𝐱{\bm{\pi}}[{\mathbf{x}}]-{\mathbf{x}} is not globally Lipschitz for 𝐱Int(Σ){\mathbf{x}}\in\text{Int}(\Sigma).

Appendix C Discussion on Assumption A3

When γn=o(βn)\gamma_{n}=o(\beta_{n}), iterates 𝐱n{\mathbf{x}}_{n} has smaller step size compared to 𝜽n{\bm{\theta}}_{n}, thus converges ‘slower’ than 𝜽n{\bm{\theta}}_{n}. From Assumption A3, 𝜽n{\bm{\theta}}_{n} will intuitively converge to some point ρ(𝐱)\rho({\mathbf{x}}) with the current value 𝐱{\mathbf{x}} from the iteration 𝐱n{\mathbf{x}}_{n}, i.e., 𝔼X𝝅[𝐱][H(ρ(𝐱),X)]=0\mathbb{E}_{X\sim{\bm{\pi}}[{\mathbf{x}}]}[H(\rho({\mathbf{x}}),X)]=0, while the Hurwitz condition is to ensure the stability around ρ(𝐱)\rho({\mathbf{x}}). We can see that Assumption A3 is less stringent than A3 in that it only assumes such condition when 𝐱=𝝁{\mathbf{x}}={\bm{\mu}} such that ρ(𝝁)=𝜽\rho({\bm{\mu}})={\bm{\theta}}^{*} rather than for all 𝐱Int(Σ){\mathbf{x}}\in\text{Int}(\Sigma).

One special instance of Assumption A3 is by assuming the linear SA, e.g., H(𝜽,i)=Ai𝜽+biH({\bm{\theta}},i)=A_{i}{\bm{\theta}}+b_{i}. In this case, 𝔼X𝝅[𝐱][H(ρ(𝐱),X)]=0\mathbb{E}_{X\sim{\bm{\pi}}[{\mathbf{x}}]}[H(\rho({\mathbf{x}}),X)]=0 is equivalent to 𝔼i𝝅[𝐱][Ai]ρ(𝐱)+𝔼i𝝅[𝐱][bi]=0\mathbb{E}_{i\sim{\bm{\pi}}[{\mathbf{x}}]}[A_{i}]\rho({\mathbf{x}})+\mathbb{E}_{i\sim{\bm{\pi}}[{\mathbf{x}}]}[b_{i}]=0. Under the condition that for every 𝐱Int(Σ){\mathbf{x}}\in\text{Int}(\Sigma), matrix 𝔼i𝝅[𝐱][Ai]\mathbb{E}_{i\sim{\bm{\pi}}[{\mathbf{x}}]}[A_{i}] is invertible, we then have

ρ(𝐱)=(𝔼i𝝅[𝐱][Ai])1𝔼i𝝅[𝐱][bi].\rho({\mathbf{x}})=-(\mathbb{E}_{i\sim{\bm{\pi}}[{\mathbf{x}}]}[A_{i}])^{-1}\cdot\mathbb{E}_{i\sim{\bm{\pi}}[{\mathbf{x}}]}[b_{i}].

However, this condition is quite strict. Loosely speaking, 𝔼i𝝅[𝐱][Ai]\mathbb{E}_{i\sim{\bm{\pi}}[{\mathbf{x}}]}[A_{i}] being invertible for any 𝐱{\mathbf{x}} is similar to saying that any convex combination of {Ai}\{A_{i}\} is invertible. For example, if we assume {Ai}i𝒩\{A_{i}\}_{i\in{\mathcal{N}}} are negative definite and they all share the same eigenbasis {𝐮i}\{{\mathbf{u}}_{i}\}, e.g., Ai=j=1Dλji𝐮i𝐮iTA_{i}=\sum_{j=1}^{D}\lambda^{i}_{j}{\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T} and λji<0\lambda_{j}^{i}<0 for all i𝒩,j[D]i\in{\mathcal{N}},j\in[D]. Then, 𝔼i𝝅[𝐱][Ai]\mathbb{E}_{i\sim{\bm{\pi}}[{\mathbf{x}}]}[A_{i}] is invertible.

Another example for Assumption A3 is when H(𝜽,i)=H(𝜽,j)H({\bm{\theta}},i)=H({\bm{\theta}},j) for all i,j𝒩i,j\in{\mathcal{N}}, which implies that each agent in the distributed learning has the same local dataset to collaboratively train the model. In this example, ρ(𝐱)=𝜽\rho({\mathbf{x}})={\bm{\theta}}^{*} such that

𝔼i𝝅[𝐱][H(ρ(𝐱),i)]=h(𝜽)=0,\mathbb{E}_{i\sim{\bm{\pi}}[{\mathbf{x}}]}[H(\rho({\mathbf{x}}),i)]=h({\bm{\theta}}^{*})=0,
𝔼i𝝅[𝐱][H(ρ(𝐱),i)]+𝟙{b=1}2𝐈=h(𝜽)+𝟙{b=1}2𝐈being Hurwitz.\mathbb{E}_{i\sim{\bm{\pi}}[{\mathbf{x}}]}[\nabla H(\rho({\mathbf{x}}),i)]+\frac{\mathds{1}_{\{b=1\}}}{2}{\mathbf{I}}=\nabla h({\bm{\theta}}^{*})+\frac{\mathds{1}_{\{b=1\}}}{2}{\mathbf{I}}\quad\text{being Hurwitz}.

Appendix D Proof of Lemma 3.1 and Lemma 3.2

In this section, we demonstrate the almost sure convergence of both 𝜽n{\bm{\theta}}_{n} and 𝐱n{\mathbf{x}}_{n} together. This proof naturally incorporates the almost certain convergence of the SRRW iteration in Lemma 3.1, since 𝐱n{\mathbf{x}}_{n} is independent of 𝜽n{\bm{\theta}}_{n} (as indicated in (4)), allowing us to separate out its asymptotic results. The same reason applies to the CLT analysis of SRRW iterates and we refer the reader to Section E.1 for the CLT result of 𝐱n{\mathbf{x}}_{n} in Lemma 3.1.

We will use different techniques for different settings of step sizes in Assumption A2. Specifically, for step sizes γn=(n+1)a,βn=(n+1)b\gamma_{n}=(n+1)^{-a},\beta_{n}=(n+1)^{-b}, we consider the following scenarios:

Scenario 1:

We consider case(ii): 1/2<a=b11/2<a=b\leq 1, and will apply the almost sure convergence result of the single-timescale stochastic approximation in Theorem G.8 and verify all the conditions therein.

Scenario 2:

We consider both case(i): 1/2<a<b11/2<a<b\leq 1 and case (iii): 1/2<b<a11/2<b<a\leq 1. In these two cases, step sizes γn,βn\gamma_{n},\beta_{n} decrease at different rates, thereby putting iterates 𝐱n,𝜽n{\mathbf{x}}_{n},{\bm{\theta}}_{n} on different timescales and resulting in a two-timescale structure. We will apply the existing almost sure convergence result of the two-timescale stochastic approximation with iterate-dependent Markov chain in Yaji & Bhatnagar (2020, Theorem 4) where our SA-SRRW algorithm can be regarded as a special instance.111111However, Yaji & Bhatnagar (2020) paper only analysed the almost sure convergence. The central limit theorem analysis remains unknown in the literature for the two-timescale stochastic approximation with iterate-dependent Markov chains. Thus, our CLT analysis in Section E for this two-timescale structure with iterate-dependent Markov chain is still novel and recognized as our contribution.

D.1 Scenario 1

In Scenario 1, we have βn=γn\beta_{n}=\gamma_{n}. First, we rewrite (4) as

[𝜽n+1𝐱n+1]=[𝜽n𝐱n]+γn+1[H(𝜽n.Xn+1)𝜹Xn+1𝐱n].\begin{bmatrix}{\bm{\theta}}_{n+1}\\ {\mathbf{x}}_{n+1}\end{bmatrix}=\begin{bmatrix}{\bm{\theta}}_{n}\\ {\mathbf{x}}_{n}\end{bmatrix}+\gamma_{n+1}\begin{bmatrix}H({\bm{\theta}}_{n}.X_{n+1})\\ {\bm{\delta}}_{X_{n+1}}-{\mathbf{x}}_{n}\end{bmatrix}. (19)

By augmentations, we define the variable 𝐳n[𝜽n𝐱n](N+D)×1{\mathbf{z}}_{n}\triangleq\begin{bmatrix}{\bm{\theta}}_{n}\\ {\mathbf{x}}_{n}\end{bmatrix}\in{\mathbb{R}}^{(N+D)\times 1} and the function G(𝐳n,i)[H(𝜽n,i)𝜹i𝐱n](N+d)×1G({\mathbf{z}}_{n},i)\triangleq\begin{bmatrix}H({\bm{\theta}}_{n},i)\\ {\bm{\delta}}_{i}-{\mathbf{x}}_{n}\end{bmatrix}\in{\mathbb{R}}^{(N+d)\times 1}. In addition, we define a new Markov chain {Yn}n0\{Y_{n}\}_{n\geq 0} in the same state space 𝒩{\mathcal{N}} as SRRW sequence {Xn}n0\{X_{n}\}_{n\geq 0}. With slight abuse of notation, the transition kernel of {Yn}\{Y_{n}\} is denoted by 𝐊[𝐳n]𝐊[𝐱n]{\mathbf{K}}^{\prime}[{\mathbf{z}}_{n}]\equiv{\mathbf{K}}[{\mathbf{x}}_{n}] and its stationary distribution 𝝅[𝐳n]𝝅[𝐱n]{\bm{\pi}}^{\prime}[{\mathbf{z}}_{n}]\equiv{\bm{\pi}}[{\mathbf{x}}_{n}], where 𝐊[𝐱n]{\mathbf{K}}[{\mathbf{x}}_{n}] and 𝝅(𝐱n){\bm{\pi}}({\mathbf{x}}_{n}) are the transition kernel and its corresponding stationary distribution of SRRW, with 𝝅[𝐱]{\bm{\pi}}[{\mathbf{x}}] of the form

πi[𝐱]j𝒩μiPij(xi/μi)α(xj/μj)α.\pi_{i}[{\mathbf{x}}]\propto\sum_{j\in{\mathcal{N}}}\mu_{i}P_{ij}(x_{i}/\mu_{i})^{-\alpha}(x_{j}/\mu_{j})^{-\alpha}. (20)

Recall that 𝝁{\bm{\mu}} is the fixed point, i.e., 𝝅[𝝁]=𝝁{\bm{\pi}}[{\bm{\mu}}]={\bm{\mu}}, and 𝐏{\mathbf{P}} is the base Markov chain inside SRRW (see (3)). Then, the mean field

g(𝐳)=𝔼Y𝝅(𝐳)[G(𝐳,Y)]=[i𝒩πi[𝐱]H(𝜽,i)𝝅[𝐱]𝐱],g({\mathbf{z}})=\mathbb{E}_{Y\sim{\bm{\pi}}^{\prime}({\mathbf{z}})}[G({\mathbf{z}},Y)]=\begin{bmatrix}\sum_{i\in{\mathcal{N}}}\pi_{i}[{\mathbf{x}}]H({\bm{\theta}},i)\\ {\bm{\pi}}[{\mathbf{x}}]-{\mathbf{x}}\end{bmatrix},

and 𝐳=(𝜽,𝝁){\mathbf{z}}^{*}=({\bm{\theta}}^{*},{\bm{\mu}}) for 𝜽Θ{\bm{\theta}}^{*}\in\Theta in Assumption A3 is the root of g(𝐳)g({\mathbf{z}}), i.e., g(𝐳)=0g({\mathbf{z}}^{*})=0. The augmented iteration (19) becomes

𝐳n+1=𝐳n+γn+1G(𝐳n,Yn+1){\mathbf{z}}_{n+1}={\mathbf{z}}_{n}+\gamma_{n+1}G({\mathbf{z}}_{n},Y_{n+1}) (21)

with the goal of solving g(𝐳)=0g({\mathbf{z}})=0. Therefore, we can treat (21) as an SA algorithm driven by a Markov chain {Yn}n0\{Y_{n}\}_{n\geq 0} with its kernel 𝐊[𝐳]{\mathbf{K}}^{\prime}[{\mathbf{z}}] and stationary distribution 𝝅[𝐳]{\bm{\pi}}^{\prime}[{\mathbf{z}}], which has been widely studied in the literature (e.g., Delyon (2000); Benveniste et al. (2012); Fort (2015); Li et al. (2023)). In what follows, we demonstrate that for any initial point 𝐳0=(𝜽0,𝐱0)D×Int(Σ){\mathbf{z}}_{0}=({\bm{\theta}}_{0},{\mathbf{x}}_{0})\in{\mathbb{R}}^{D}\times\text{Int}(\Sigma), the SRRW iteration {𝐱n}n0\{{\mathbf{x}}_{n}\}_{n\geq 0} will almost surely converge to the target distribution 𝝁{\bm{\mu}}, and the SA iteration {𝜽n}n0\{{\bm{\theta}}_{n}\}_{n\geq 0} will almost surely converge to the set Θ\Theta.

Now we verify conditions C1 - C4 in Theorem G.8. Our assumption A4 is equivalent to condition C1 and assumption A2 corresponds to condition C2. For condition C3, we set w(𝐳)g(𝐳)\nabla w({\mathbf{z}})\equiv-g({\mathbf{z}}), and the set S{𝐳|𝜽Θ,𝐱=𝝁}S\equiv\{{\mathbf{z}}^{*}|{\bm{\theta}}^{*}\in\Theta,{\mathbf{x}}^{*}={\bm{\mu}}\}, including disjoint points. For condition C4, since 𝐊[𝐳]{\mathbf{K}}^{\prime}[{\mathbf{z}}], or equivalently 𝐊[𝐱]{\mathbf{K}}[{\mathbf{x}}], is ergodic and time-reversible for a given 𝐳{\mathbf{z}}, as shown in the SRRW work Doshi et al. (2023), it automatically ensures a solution to the Poisson equation, which has been well discussed in Chen et al. (2020a, Section 2) and Benveniste et al. (2012); Meyn (2022). To show (97) and (98) in condition C4, for each given 𝐳{\mathbf{z}} and any i𝒩i\in{\mathcal{N}}, we need to give the explicit solution m𝐳(i)m_{{\mathbf{z}}}(i) to the Poisson equation m𝐳(i)(𝐊𝐳m𝐳)(i)=G(𝐳,i)g(𝐳)m_{{\mathbf{z}}}(i)-({\mathbf{K}}^{\prime}_{{\mathbf{z}}}m_{{\mathbf{z}}})(i)=G({\mathbf{z}},i)-g({\mathbf{z}}) in (96). The notation (𝐊𝐳m𝐳)(i)({\mathbf{K}}^{\prime}_{{\mathbf{z}}}m_{{\mathbf{z}}})(i) is defined as follows.

(𝐊𝒛m𝐳)(i)=j𝒩𝐊𝒛(i,j)m(𝐳,j).({\mathbf{K}}^{\prime}_{{\bm{z}}}m_{{\mathbf{z}}})(i)=\sum_{j\in{\mathcal{N}}}{\mathbf{K}}^{\prime}_{{\bm{z}}}(i,j)m({\mathbf{z}},j).

Let 𝐆(𝐳)[G(𝐳,1),,G(𝐳,N)]TN×D{\mathbf{G}}({\mathbf{z}})\triangleq[G({\mathbf{z}},1),\cdots,G({\mathbf{z}},N)]^{T}\in{\mathbb{R}}^{N\times D}. We use [𝐀]:,i[{\mathbf{A}}]_{:,i} to denote the ii-th column of matrix 𝐀{\mathbf{A}}. Then, we let m𝐳(i)m_{{\mathbf{z}}}(i) such that

m𝐳(i)=k=0([𝐆(𝐳)(𝐊[𝐳]k)T][:,i]g(𝐳))=k=0[𝐆(𝐳)((𝐊[𝐳]k)T𝝅[𝐳]𝟏T)][:,i].m_{{\mathbf{z}}}(i)=\sum_{k=0}^{\infty}\left([{\mathbf{G}}({\mathbf{z}})({\mathbf{K}}^{\prime}[{\mathbf{z}}]^{k})^{T}]_{[:,i]}-g({\mathbf{z}})\right)=\sum_{k=0}^{\infty}[{\mathbf{G}}({\mathbf{z}})(({\mathbf{K}}^{\prime}[{\mathbf{z}}]^{k})^{T}-{\bm{\pi}}^{\prime}[{\mathbf{z}}]{\bm{1}}^{T})]_{[:,i]}. (22)

In addition,

(𝐊𝐳m𝐳)(i)=k=1[𝐆(𝐳)(𝐊[𝐳]T𝝅[𝐳]𝟏T)][:,i].({\mathbf{K}}^{\prime}_{{\mathbf{z}}}m_{{\mathbf{z}}})(i)=\sum_{k=1}^{\infty}[{\mathbf{G}}({\mathbf{z}})({\mathbf{K}}^{\prime}[{\mathbf{z}}]^{T}-{\bm{\pi}}^{\prime}[{\mathbf{z}}]{\bm{1}}^{T})]_{[:,i]}. (23)

We can check that the m𝐳(i)m_{{\mathbf{z}}}(i) form in (22) is indeed the solution of the above Poisson equation. Now, by induction, we get 𝐊[𝐳]k𝟏𝝅[𝐳]T=(𝐊[𝐳]𝟏𝝅[𝐳]T)k{\mathbf{K}}^{\prime}[{\mathbf{z}}]^{k}-{\bm{1}}{\bm{\pi}}^{\prime}[{\mathbf{z}}]^{T}=({\mathbf{K}}^{\prime}[{\mathbf{z}}]-{\bm{1}}{\bm{\pi}}^{\prime}[{\mathbf{z}}]^{T})^{k} for k1k\geq 1 and for k=0k=0, 𝐊[𝐳]0𝟏𝝅[𝐳]T=(𝐊[𝐳]𝟏𝝅[𝐳]T)0𝟏𝝅[𝐳]T{\mathbf{K}}^{\prime}[{\mathbf{z}}]^{0}-{\bm{1}}{\bm{\pi}}^{\prime}[{\mathbf{z}}]^{T}=({\mathbf{K}}^{\prime}[{\mathbf{z}}]-{\bm{1}}{\bm{\pi}}^{\prime}[{\mathbf{z}}]^{T})^{0}-{\bm{1}}{\bm{\pi}}^{\prime}[{\mathbf{z}}]^{T}. Then,

m𝐳(i)=k=0[𝐆(𝐳)(𝐊[𝐳]T𝝅[𝐳]𝟏T)k][:,i]g(𝐳)=[𝐆(𝐳)k=0(𝐊[𝐳]T𝝅[𝐳]𝟏T)k][:,i]g(𝐳)=[𝐆(𝐳)(𝐈𝐊[𝐳]T+𝝅[𝐳]𝟏T)1][:,i]g(𝐳)=j𝒩(𝐈𝐊[𝐳]+𝟏𝝅[𝐳]T)1(i,j)G(𝐳,j)g(𝐳).\begin{split}m_{{\mathbf{z}}}(i)=&\sum_{k=0}^{\infty}[{\mathbf{G}}({\mathbf{z}})({\mathbf{K}}^{\prime}[{\mathbf{z}}]^{T}-{\bm{\pi}}^{\prime}[{\mathbf{z}}]{\bm{1}}^{T})^{k}]_{[:,i]}-g({\mathbf{z}})\\ =&\left[{\mathbf{G}}({\mathbf{z}})\sum_{k=0}^{\infty}({\mathbf{K}}^{\prime}[{\mathbf{z}}]^{T}-{\bm{\pi}}^{\prime}[{\mathbf{z}}]{\bm{1}}^{T})^{k}\right]_{[:,i]}-g({\mathbf{z}})\\ =&\left[{\mathbf{G}}({\mathbf{z}})({\mathbf{I}}-{\mathbf{K}}^{\prime}[{\mathbf{z}}]^{T}+{\bm{\pi}}^{\prime}[{\mathbf{z}}]{\bm{1}}^{T})^{-1}\right]_{[:,i]}-g({\mathbf{z}})\\ =&\sum_{j\in{\mathcal{N}}}({\mathbf{I}}-{\mathbf{K}}^{\prime}[{\mathbf{z}}]+{\bm{1}}{\bm{\pi}}^{\prime}[{\mathbf{z}}]^{T})^{-1}(i,j)G({\mathbf{z}},j)-g({\mathbf{z}}).\end{split} (24)

Here, (𝐈𝐊[𝐳]+𝟏𝝅[𝐳]T)1({\mathbf{I}}-{\mathbf{K}}^{\prime}[{\mathbf{z}}]+{\bm{1}}{\bm{\pi}}^{\prime}[{\mathbf{z}}]^{T})^{-1} is well defined because 𝐊[𝐳]{\mathbf{K}}^{\prime}[{\mathbf{z}}] is ergodic and time-reversible for any given 𝐳{\mathbf{z}} (proved in Doshi et al. (2023, Appendix A)). Now that both functions H(𝜽,i)H({\bm{\theta}},i) and 𝜹i𝐱{\bm{\delta}}_{i}-{\mathbf{x}} are bounded for each compact subset of D×Σ{\mathbb{R}}^{D}\times\Sigma by our assumption A1, function G(𝐳,i)G({\mathbf{z}},i) is also bounded within the compact subset of its domain. Thus, function m𝐳(i)m_{{\mathbf{z}}}(i) is bounded, and (97) is verified. Moreover, for a fixed i𝒩i\in{\mathcal{N}},

j𝒩(𝐈𝐊[𝐳]+𝟏𝝅[𝐳]T)1(i,j)𝜹j=(𝐈𝐊[𝐳]+𝟏𝝅[𝐳]T)[:,i]1=(𝐈𝐊[𝐱]+𝟏𝝅[𝐱]T)[:,i]1\sum_{j\in{\mathcal{N}}}({\mathbf{I}}-{\mathbf{K}}^{\prime}[{\mathbf{z}}]+{\bm{1}}{\bm{\pi}}^{\prime}[{\mathbf{z}}]^{T})^{-1}(i,j){\bm{\delta}}_{j}=({\mathbf{I}}-{\mathbf{K}}^{\prime}[{\mathbf{z}}]+{\bm{1}}{\bm{\pi}}^{\prime}[{\mathbf{z}}]^{T})^{-1}_{[:,i]}=({\mathbf{I}}-{\mathbf{K}}[{\mathbf{x}}]+{\bm{1}}{\bm{\pi}}[{\mathbf{x}}]^{T})^{-1}_{[:,i]}

and this vector-valued function is continuous in 𝐱{\mathbf{x}} because 𝐊[𝐱],𝝅[𝐱]{\mathbf{K}}[{\mathbf{x}}],{\bm{\pi}}[{\mathbf{x}}] are continuous. We then rewrite (24) as

m𝐳(i)=[j𝒩(𝐈𝐊[𝐱]+𝟏𝝅[𝐱]T)1(i,j)H(𝐱,j)(𝐈𝐊[𝐱]T+𝝅[𝐱]𝟏T)[:,i]1][i𝒩πi[𝐱]H(𝜽,i)𝝅[𝐱]𝐱].m_{{\mathbf{z}}}(i)=\begin{bmatrix}\sum_{j\in{\mathcal{N}}}({\mathbf{I}}-{\mathbf{K}}[{\mathbf{x}}]+{\bm{1}}{\bm{\pi}}[{\mathbf{x}}]^{T})^{-1}(i,j)H({\mathbf{x}},j)\\ ({\mathbf{I}}-{\mathbf{K}}[{\mathbf{x}}]^{T}+{\bm{\pi}}[{\mathbf{x}}]{\bm{1}}^{T})^{-1}_{[:,i]}\end{bmatrix}-\begin{bmatrix}\sum_{i\in{\mathcal{N}}}\pi_{i}[{\mathbf{x}}]H({\bm{\theta}},i)\\ {\bm{\pi}}[{\mathbf{x}}]-{\mathbf{x}}\end{bmatrix}.

With continuous functions H(𝜽,i),𝐊[𝐱],𝝅[𝐱]H({\bm{\theta}},i),{\mathbf{K}}[{\mathbf{x}}],{\bm{\pi}}[{\mathbf{x}}], we have m𝐳(i)m_{{\mathbf{z}}}(i) continuous with respect to 𝐳{\mathbf{z}}, so does (𝐊𝐳m𝐳)(i)({\mathbf{K}}^{\prime}_{{\mathbf{z}}}m_{{\mathbf{z}}})(i). This implies that functions m𝐳(i)m_{{\mathbf{z}}}(i) and (𝐊𝐳m𝐳)(i)({\mathbf{K}}^{\prime}_{{\mathbf{z}}}m_{{\mathbf{z}}})(i) are locally Lipschitz, which satisfies (98) with ϕ𝒞(x)=C𝒞x\phi_{{\mathcal{C}}}(x)=C_{{\mathcal{C}}}x for some constant C𝒞C_{{\mathcal{C}}} that depends on the compact set 𝒞{\mathcal{C}}. Therefore, condition C4 is checked, and we can apply Theorem G.8 to show the almost convergence result of (19), i.e., almost surely,

limn𝐱n=𝝁,andlim supninf𝜽Θ𝜽n𝜽=0.\lim_{n\to\infty}{\mathbf{x}}_{n}={\bm{\mu}},\quad\text{and}~{}~{}~{}~{}\limsup_{n\to\infty}\inf_{{\bm{\theta}}^{*}\in\Theta}\|{\bm{\theta}}_{n}-{\bm{\theta}}^{*}\|=0.

Therefore, the almost sure convergence of 𝐱n{\mathbf{x}}_{n} in Lemma 3.1 is also proved. This finishes the proof in Scenario 1.

D.2 Scenario 2

Now in this subsection, we consider the steps sizes γn,βn\gamma_{n},\beta_{n} with 1/2<a<b11/2<a<b\leq 1 and 1/2<b<a11/2<b<a\leq 1. We will frequently use assumptions (B1) - (B5) in Section G.3 and Theorem G.10 to prove the almost sure convergence.

D.2.1 Case (i): 1/2<a<b11/2<a<b\leq 1

In case (i), 𝜽n{\bm{\theta}}_{n} is on the slow timescale and 𝐱n{\mathbf{x}}_{n} is on the fast timescale because iteration 𝜽n{\bm{\theta}}_{n} has smaller step size than 𝐱n{\mathbf{x}}_{n}, making 𝜽n{\bm{\theta}}_{n} converge slower than 𝐱n{\mathbf{x}}_{n}. Here, we consider the two-timescale SA of the form:

𝜽n+1=𝜽n+βn+1H(𝜽,Xn+1),{\bm{\theta}}_{n+1}={\bm{\theta}}_{n}+\beta_{n+1}H({\bm{\theta}},X_{n+1}), (25)
𝐱n+1=𝐱n+γn+1(𝜹Xn+1𝐱).{\mathbf{x}}_{n+1}={\mathbf{x}}_{n}+\gamma_{n+1}({\bm{\delta}}_{X_{n+1}}-{\mathbf{x}}).

Now, we verify assumptions (B1) - (B5) listed in Section G.3.

  • Assumptions (B1) and (B5) are satisfied by our assumptions A2 and A4.

  • Our assumption A3 shows that the function H(𝜽,X)H({\bm{\theta}},X) is continuous and differentiable w.r.t 𝜽{\bm{\theta}} and grows linearly with 𝜽\|{\bm{\theta}}\|. In addition, 𝜹X𝐱{\bm{\delta}}_{X}-{\mathbf{x}} also satisfies this property. Therefore, (B2) is satisfied.

  • Now that the function 𝝅[𝐱]𝐱{\bm{\pi}}[{\mathbf{x}}]-{\mathbf{x}} is independent of 𝜽{\bm{\theta}}, we can set ρ(𝜽)=𝝁\rho({\bm{\theta}})={\bm{\mu}} for any 𝜽D{\bm{\theta}}\in{\mathbb{R}}^{D} such that 𝝅[𝝁]𝝁=0{\bm{\pi}}[{\bm{\mu}}]-{\bm{\mu}}=0 from Doshi et al. (2023, Proposition 3.1), and

    𝐱(𝝅(𝐱)𝐱)|𝐱=𝝁=2α𝐮𝟏Tα𝐏T(α+1)𝐈\nabla_{{\mathbf{x}}}({\bm{\pi}}({\mathbf{x}})-{\mathbf{x}})|_{{\mathbf{x}}={\bm{\mu}}}=2\alpha{\mathbf{u}}{\bm{1}}^{T}-\alpha{\mathbf{P}}^{T}-(\alpha+1){\mathbf{I}}

    from Doshi et al. (2023, Lemma 3.4), which is Hurwitz. Furthermore, ρ(𝜽)=𝝁\rho({\bm{\theta}})={\bm{\mu}} inherently satisfies the condition ρ(𝜽)L2(1+𝜽)\|\rho({\bm{\theta}})\|\leq L_{2}(1+\|{\bm{\theta}}\|) for any L2𝝁L_{2}\geq\|{\bm{\mu}}\|. Thus, conditions (i) - (iii) in (B3) are satisfied. Additionally, i𝒩𝝅i[ρ(𝜽)]H(𝜽,i)=i𝒩𝝅i[𝐱]=𝐡(𝜽)\sum_{i\in{\mathcal{N}}}{\bm{\pi}}_{i}[\rho({\bm{\theta}})]H({\bm{\theta}},i)=\sum_{i\in{\mathcal{N}}}{\bm{\pi}}_{i}[{\mathbf{x}}]={\mathbf{h}}({\bm{\theta}}) such that for 𝜽Θ{\bm{\theta}}^{*}\in\Theta defined in assumption A3, i𝒩𝝅i[ρ(𝜽)]H(𝜽,i)=𝐡(𝜽)=0\sum_{i\in{\mathcal{N}}}{\bm{\pi}}_{i}[\rho({\bm{\theta}}^{*})]H({\bm{\theta}}^{*},i)={\mathbf{h}}({\bm{\theta}}^{*})=0, and 𝜽𝐡(𝜽)\nabla_{{\bm{\theta}}}{\mathbf{h}}({\bm{\theta}}^{*}) is Hurwitz. Therefore, (B3) is checked.

  • Assumption (B4) is verified by the nature of SRRW, i.e., its transition kernel 𝐊[𝐱]{\mathbf{K}}[{\mathbf{x}}] and the corresponding stationary distribution 𝝅[𝐱]{\bm{\pi}}[{\mathbf{x}}] with 𝝅[𝝁]=𝝁{\bm{\pi}}[{\bm{\mu}}]={\bm{\mu}}.

Consequently, assumptions (B1) - (B5) are satisfied by our assumptoins A1 - A4 and by Theorem G.10, we have limn𝐱n=𝝁\lim_{n\to\infty}{\mathbf{x}}_{n}={\bm{\mu}} and 𝜽nΘ{\bm{\theta}}_{n}\to\Theta almost surely.

Next, we consider 1/2<b<a11/2<b<a\leq 1. As discussed before, (B1), (B2), (B4) and (B5) are satisfied by our assumptions A1 - A4 and the properties of SRRW. The only difference for this step size setting, compared to the previous one 1/2<a<b11/2<a<b\leq 1, is that the roles of 𝜽n,𝐱n{\bm{\theta}}_{n},{\mathbf{x}}_{n} are now flipped, that is, 𝜽n{\bm{\theta}}_{n} is now on the fast timescale while 𝐱n{\mathbf{x}}_{n} is on the slow timescale. By a much stronger Assumption A3, for any 𝐱Int(Σ){\mathbf{x}}\in\text{Int}(\Sigma), (i) 𝔼X𝝅[𝐱][H(ρ(𝐱),X)]=0\mathbb{E}_{X\sim{\bm{\pi}}[{\mathbf{x}}]}[H(\rho({\mathbf{x}}),X)]=0; (ii) 𝔼X𝝅[𝐱][H(ρ(𝐱),X)]\mathbb{E}_{X\sim{\bm{\pi}}[{\mathbf{x}}]}[\nabla H(\rho({\mathbf{x}}),X)] is Hurwitz; (iii) ρ(𝐱)L2(1+𝐱)\|\rho({\mathbf{x}})\|\leq L_{2}(1+\|{\mathbf{x}}\|). Hence, conditions (i) - (iii) in (B3) are satisfied. Moreover, we have 𝝅[𝝁]𝝁=0{\bm{\pi}}[{\bm{\mu}}]-{\bm{\mu}}=0, (𝝅[𝐱]𝐱)|𝐱=𝝁\nabla({\bm{\pi}}[{\mathbf{x}}]-{\mathbf{x}})|_{{\mathbf{x}}={\bm{\mu}}} being Hurwitz, as mentioned in the previous part. Therefore, (B3) is verified. Accordingly, (B1) - (B5) are checked by our assumptions A1, A2, A3, A4. By Theorem G.10, we have limn𝐱n=𝝁\lim_{n\to\infty}{\mathbf{x}}_{n}={\bm{\mu}} and 𝜽nΘ{\bm{\theta}}_{n}\to\Theta almost surely.

Appendix E Proof of Theorem 3.3

This section is devoted to the proof of Theorem 3.3, which also includes the proof of the CLT results for the SRRW iteration 𝐱n{\mathbf{x}}_{n} in Lemma 3.1. We will use different techniques depending on the step sizes in Assumption A2. Specifically, for step sizes γn=(n+1)a,βn=(n+1)b\gamma_{n}=(n+1)^{-a},\beta_{n}=(n+1)^{-b}, we will consider three cases: case (i): βn=o(γn)\beta_{n}=o(\gamma_{n}); case (ii): βn=γn\beta_{n}=\gamma_{n}; and case (iii): γn=o(βn)\gamma_{n}=o(\beta_{n}). For case (ii), we will use the existing CLT result for single-timescale SA in Theorem G.9. For cases (i) and (iii), we will construct our own CLT analysis for the two-timescale structure. We start with case (ii).

E.1 Case (ii): βn=γn\beta_{n}=\gamma_{n}

In this part, we stick to the notations for single-timescale SA studied in Section D.1. To utilize Theorem G.9, apart from Conditions C1 - C4 that have been checked in Section D.1, we still need to check conditions C5 and C6 listed in Section G.2.

Assumption A3 corresponds to condition C5. For condition C6, we need to obtain the explicit form of function Q𝐳Q_{{\mathbf{z}}} to the Poisson equation defined in (96), that is,

Q𝐳(i)(𝐊𝐳Q𝐳)(i)=ψ(𝐳,i)𝔼j𝝅[𝐳][ψ(𝐳,j)]Q_{{\mathbf{z}}}(i)-({\mathbf{K}}^{\prime}_{{\mathbf{z}}}Q_{{\mathbf{z}}})(i)=\psi({\mathbf{z}},i)-\mathbb{E}_{j\sim{\bm{\pi}}[{\mathbf{z}}]}[\psi({\mathbf{z}},j)]

where

ψ(𝐳,i)j𝒩𝐊𝐳(i,j)m𝐳(j)m𝐳(j)T(𝐊𝐳m𝐳)(i)(𝐊𝐳m𝐳)(i)T.\psi({\mathbf{z}},i)\triangleq\sum_{j\in{\mathcal{N}}}{\mathbf{K}}^{\prime}_{{\mathbf{z}}}(i,j)m_{{\mathbf{z}}}(j)m_{{\mathbf{z}}}(j)^{T}-({\mathbf{K}}^{\prime}_{{\mathbf{z}}}m_{{\mathbf{z}}})(i)({\mathbf{K}}^{\prime}_{{\mathbf{z}}}m_{{\mathbf{z}}})(i)^{T}.

Following the similar steps in the derivation of m𝐳(i)m_{{\mathbf{z}}}(i) from (22) to (24), we have

Q𝐳(i)=j𝒩(𝐈𝐊[𝐳]+𝟏𝝅[𝐳]T)1(i,j)m𝐳(j)πj[𝐳]m𝐳(j).Q_{{\mathbf{z}}}(i)=\sum_{j\in{\mathcal{N}}}({\mathbf{I}}-{\mathbf{K}}^{\prime}[{\mathbf{z}}]+{\bm{1}}{\bm{\pi}}^{\prime}[{\mathbf{z}}]^{T})^{-1}(i,j)m_{{\mathbf{z}}}(j)-\pi^{\prime}_{j}[{\mathbf{z}}]m_{{\mathbf{z}}}(j).

We also know that Q𝐳(i)Q_{{\mathbf{z}}}(i) and (𝐊𝐳Q𝐳)(i)({\mathbf{K}}^{\prime}_{{\mathbf{z}}}Q_{{\mathbf{z}}})(i) are continuous in 𝐳{\mathbf{z}} for any i𝒩i\in{\mathcal{N}}. For any 𝐳{\mathbf{z}} in a compact set Ω\Omega, Q𝐳(i)Q_{{\mathbf{z}}}(i) and (𝐊𝐳Q𝐳)(i)({\mathbf{K}}^{\prime}_{{\mathbf{z}}}Q_{{\mathbf{z}}})(i) are bounded because function m𝐳(i)m_{{\mathbf{z}}}(i) is bounded. Therefore, C6 is checked. By Theorem G.9, assume 𝐳n=[𝜽n𝐱n]{\mathbf{z}}_{n}=\begin{bmatrix}{\bm{\theta}}_{n}\\ {\mathbf{x}}_{n}\end{bmatrix} converges to a point 𝐳=[𝜽𝝁]{\mathbf{z}}^{*}=\begin{bmatrix}{\bm{\theta}}^{*}\\ {\bm{\mu}}\end{bmatrix} for 𝜽Θ{\bm{\theta}}^{*}\in\Theta, we have

γn1/2(𝐳n𝐳)ndist.N(0,𝐕),\gamma_{n}^{-1/2}({\mathbf{z}}_{n}-{\mathbf{z}}^{*})\xrightarrow[n\to\infty]{dist.}N(0,{\mathbf{V}}), (26)

where 𝐕{\mathbf{V}} is the solution of the following Lyapunov equation

𝐕(𝟙{b=1}2𝐈+g(𝒛)T)+(𝟙{b=1}2𝐈+g(𝒛))𝐕+𝐔=0,{\mathbf{V}}\left(\frac{\mathds{1}_{\{b=1\}}}{2}{\mathbf{I}}+\nabla g({\bm{z}}^{*})^{T}\right)+\left(\frac{\mathds{1}_{\{b=1\}}}{2}{\mathbf{I}}+\nabla g({\bm{z}}^{*})\right){\mathbf{V}}+{\mathbf{U}}=0, (27)

and 𝐔=i𝒩μi(m𝐳(i)m𝐳(i)T(𝐊𝐳m𝐳)(i)(𝐊𝐳m𝐳)(i)T){\mathbf{U}}=\sum_{i\in{\mathcal{N}}}\mu_{i}\left(m_{{\mathbf{z}}^{*}}(i)m_{{\mathbf{z}}^{*}}(i)^{T}-({\mathbf{K}}_{{\mathbf{z}}^{*}}m_{{\mathbf{z}}^{*}})(i)({\mathbf{K}}_{{\mathbf{z}}^{*}}m_{{\mathbf{z}}^{*}})(i)^{T}\right).

By algebraic calculations of derivative of 𝝅[𝐱]{\bm{\pi}}[{\mathbf{x}}] with respect to 𝐱{\mathbf{x}} in (20),121212One may refer to Doshi et al. (2023, Appendix B, Proof of Lemma 3.4) for the computation of 𝝅[𝐱]𝐱𝐱\frac{\partial{\bm{\pi}}[{\mathbf{x}}]-{\mathbf{x}}}{\partial{\mathbf{x}}}. we can rewrite g(𝐳)\nabla g({\mathbf{z}}^{*}) in terms of 𝐱,𝜽{\mathbf{x}},{\bm{\theta}}, i.e.,

𝐉(α)g(𝐳)=[i𝒩πi[𝐱]H(𝜽,i)𝜽i𝒩πi[𝐱]H(𝜽,i)𝐱(𝝅[𝐱]𝐱)𝜽𝝅[𝐱]𝐱𝐱]𝐳=𝐳=[𝐡(𝜽)α𝐇T(𝐏T+𝐈)𝟎2α𝝁𝟏Tα𝐏T(α+1)𝐈][𝐉11𝐉12(α)𝐉21𝐉22(α)],\begin{split}{\mathbf{J}}(\alpha)\triangleq\nabla g({\mathbf{z}}^{*})&=\begin{bmatrix}\frac{\partial\sum_{i\in{\mathcal{N}}}\pi_{i}[{\mathbf{x}}]H({\bm{\theta}},i)}{\partial{\bm{\theta}}}&\frac{\partial\sum_{i\in{\mathcal{N}}}\pi_{i}[{\mathbf{x}}]H({\bm{\theta}},i)}{\partial{\mathbf{x}}}\\ \frac{\partial({\bm{\pi}}[{\mathbf{x}}]-{\mathbf{x}})}{\partial{\bm{\theta}}}&\frac{\partial{\bm{\pi}}[{\mathbf{x}}]-{\mathbf{x}}}{\partial{\mathbf{x}}}\end{bmatrix}_{{\mathbf{z}}={\mathbf{z}}^{*}}\\ &=\begin{bmatrix}\nabla{\mathbf{h}}({\bm{\theta}}^{*})&-\alpha{\mathbf{H}}^{T}({\mathbf{P}}^{T}+{\mathbf{I}})\\ {\bm{0}}&2\alpha\bm{\mu}{\bm{1}}^{T}-\alpha{\mathbf{P}}^{T}-(\alpha+1){\mathbf{I}}\end{bmatrix}\triangleq\begin{bmatrix}{\mathbf{J}}_{11}&{\mathbf{J}}_{12}(\alpha)\\ {\mathbf{J}}_{21}&{\mathbf{J}}_{22}(\alpha)\end{bmatrix},\end{split}

where matrix 𝐇=[H(𝜽,1),,H(𝜽,N)]T{\mathbf{H}}=[H({\bm{\theta}}^{*},1),\cdots,H({\bm{\theta}},N)]^{T}. Then, we further clarify the matrix 𝐔{\mathbf{U}}. Note that

m𝐳(i)=k=0[𝐆(𝐳)((𝐏k)T𝝁𝟏T)][:,i]=k=0[𝐆(𝐳)(𝐏k)T][:,i]=𝔼[k=0[G(𝐳,Xk)]|X0=i],m_{{\mathbf{z}}^{*}}(i)=\sum_{k=0}^{\infty}[{\mathbf{G}}({\mathbf{z}}^{*})(({\mathbf{P}}^{k})^{T}-{\bm{\mu}}{\bm{1}}^{T})]_{[:,i]}=\sum_{k=0}^{\infty}[{\mathbf{G}}({\mathbf{z}}^{*})({\mathbf{P}}^{k})^{T}]_{[:,i]}=\mathbb{E}\left[\left.\sum_{k=0}^{\infty}[G({\mathbf{z}}^{*},X_{k})]\right|X_{0}=i\right]\!\!, (28)

where the first equality holds because 𝐊[𝝁]=𝐏{\mathbf{K}}^{\prime}[{\bm{\mu}}]={\mathbf{P}} from the definition of SRRW kernel (3), the second equality stems from 𝐆(𝐳)𝝁=g(𝐳)=0{\mathbf{G}}({\mathbf{z}}^{*}){\bm{\mu}}=g({\mathbf{z}}^{*})=0, and the last term is a conditional expectation over the base Markov chain {Xk}k0\{X_{k}\}_{k\geq 0} (with transition kernel 𝐏{\mathbf{P}}) conditioned on X0=iX_{0}=i. Similarly, with (𝐊𝐳m𝐳)(i)({\mathbf{K}}^{\prime}_{{\mathbf{z}}}m_{{\mathbf{z}}})(i) in the form of (23), we have

(𝐊𝐳m𝐳)(i)=𝔼[k=1[G(𝐳,Xk)]|X0=i].({\mathbf{K}}^{\prime}_{{\mathbf{z}}}m_{{\mathbf{z}}})(i)=\mathbb{E}\left[\left.\sum_{k=1}^{\infty}[G({\mathbf{z}}^{*},X_{k})]\right|X_{0}=i\right].

From the form ‘i𝒩μi\sum_{i\in{\mathcal{N}}}\mu_{i}’ inside the matrix 𝐔{\mathbf{U}}, the Markov chain {Xk}k0\{X_{k}\}_{k\geq 0} is in its stationary regime from the beginning, i.e., Xk𝝁X_{k}\sim{\bm{\mu}} for any k0k\geq 0. Hence,

𝐔=𝔼[(k=0[G(𝐳,Xk)])(k=0[G(𝐳,Xk)])T]𝔼[(k=1[G(𝐳,Xk)])(k=1[G(𝐳,Xk)])T]=𝔼[G(𝐳,X0)G(𝐳,X0)T]+𝔼[G(𝐳,X0)(k=1G(𝐳,Xk))T]+𝔼[(k=1G(𝐳,Xk))G(𝐳,X0)T]=Cov(G(𝐳,X0),G(𝐳,X0))+k=1[Cov(G(𝐳,X0),G(𝐳,Xk))+Cov(G(𝐳,Xk),G(𝐳,X0))],\begin{split}{\mathbf{U}}=&~{}\mathbb{E}\left[\left(\sum_{k=0}^{\infty}[G({\mathbf{z}}^{*},X_{k})]\right)\left(\sum_{k=0}^{\infty}[G({\mathbf{z}}^{*},X_{k})]\right)^{T}\right]\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-\mathbb{E}\left[\left(\sum_{k=1}^{\infty}[G({\mathbf{z}}^{*},X_{k})]\right)\left(\sum_{k=1}^{\infty}[G({\mathbf{z}}^{*},X_{k})]\right)^{T}\right]\\ =&~{}\mathbb{E}\left[G({\mathbf{z}}^{*},X_{0})G({\mathbf{z}}^{*},X_{0})^{T}\right]\!+\!\mathbb{E}\left[G({\mathbf{z}}^{*},X_{0})\left(\sum_{k=1}^{\infty}G({\mathbf{z}}^{*},X_{k})\right)^{T}\right]\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\mathbb{E}\left[\left(\sum_{k=1}^{\infty}G({\mathbf{z}}^{*},X_{k})\right)G({\mathbf{z}}^{*},X_{0})^{T}\right]\\ =&~{}\text{Cov}(G({\mathbf{z}}^{*},X_{0}),G({\mathbf{z}}^{*},X_{0}))\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\sum_{k=1}^{\infty}\left[\text{Cov}(G({\mathbf{z}}^{*},X_{0}),G({\mathbf{z}}^{*},X_{k}))+\text{Cov}(G({\mathbf{z}}^{*},X_{k}),G({\mathbf{z}}^{*},X_{0}))\right],\\ \end{split} (29)

where the covariance between G(𝐳,X0)G({\mathbf{z}}^{*},X_{0}) and G(𝐳,Xk)G({\mathbf{z}}^{*},X_{k}) for the Markov chain {Xn}\{X_{n}\} in the stationary regime is Cov(G(𝐳,X0),G(𝐳,Xk))\text{Cov}(G({\mathbf{z}}^{*},X_{0}),G({\mathbf{z}}^{*},X_{k})). By Brémaud (2013, Theorem 6.3.7), it is demonstrated that 𝐔{\mathbf{U}} is the sampling covariance of the base Markov chain 𝐏{\mathbf{P}} for the test function G(𝐳,)G({\mathbf{z}}^{*},\cdot). Moreover, Brémaud (2013, equation (6.34)) states that this sampling covariance 𝐔{\mathbf{U}} can be rewritten in the following form:

𝐔=i=1N1𝐆(𝐳)T𝐮i𝐮i𝐆(𝐳)=i=1N11+λi1λi[𝐇T𝐮i𝐮iT𝐇𝐇T𝐮i𝐮iT𝐮i𝐮iT𝐇𝐮i𝐮iT][𝐔11𝐔12𝐔21𝐔22],{\mathbf{U}}=\sum_{i=1}^{N-1}{\mathbf{G}}({\mathbf{z}}^{*})^{T}{\mathbf{u}}_{i}{\mathbf{u}}_{i}{\mathbf{G}}({\mathbf{z}}^{*})=\sum_{i=1}^{N-1}\frac{1+\lambda_{i}}{1-\lambda_{i}}\begin{bmatrix}{\mathbf{H}}^{T}{\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T}{\mathbf{H}}&{\mathbf{H}}^{T}{\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T}\\ {\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T}{\mathbf{H}}&{\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T}\end{bmatrix}\triangleq\begin{bmatrix}{\mathbf{U}}_{11}&{\mathbf{U}}_{12}\\ {\mathbf{U}}_{21}&{\mathbf{U}}_{22}\end{bmatrix}, (30)

where {(λi,𝐮i)}i𝒩\{(\lambda_{i},{\mathbf{u}}_{i})\}_{i\in{\mathcal{N}}} is the eigenpair of the transition kernel 𝐏{\mathbf{P}} of the ergodic and time-reversible base Markov chain. This completes the proof of case 1.

Remark E.1.

For the CLT result (26), we can look further into the asymptotic covariance matrix 𝐕{\mathbf{V}} as in (27). For convenience, we denote 𝐕=[𝐕11𝐕12𝐕21𝐕22]{\mathbf{V}}=\begin{bmatrix}{\mathbf{V}}_{11}&{\mathbf{V}}_{12}\\ {\mathbf{V}}_{21}&{\mathbf{V}}_{22}\end{bmatrix} and 𝐔{\mathbf{U}} in the form of (30) such that

[𝐕11𝐕12𝐕21𝐕22](𝟙{b=1}2𝐈+𝐉(α)T)+(𝟙{b=1}2𝐈+𝐉(α))[𝐕11𝐕12𝐕21𝐕22]+𝐔=0.\begin{bmatrix}{\mathbf{V}}_{11}&{\mathbf{V}}_{12}\\ {\mathbf{V}}_{21}&{\mathbf{V}}_{22}\end{bmatrix}\left(\frac{\mathds{1}_{\{b=1\}}}{2}{\mathbf{I}}+{\mathbf{J}}(\alpha)^{T}\right)+\left(\frac{\mathds{1}_{\{b=1\}}}{2}{\mathbf{I}}+{\mathbf{J}}(\alpha)\right)\begin{bmatrix}{\mathbf{V}}_{11}&{\mathbf{V}}_{12}\\ {\mathbf{V}}_{21}&{\mathbf{V}}_{22}\end{bmatrix}+{\mathbf{U}}=0. (31)

For the SRRW iteration 𝐱n{\mathbf{x}}_{n}, from (26) we know that γn1/2(𝐱n𝝁)ndist.N(𝟎,𝐕4)\gamma_{n}^{-1/2}({\mathbf{x}}_{n}-{\bm{\mu}})\xrightarrow[n\to\infty]{dist.}N({\bm{0}},{\mathbf{V}}_{4}). Thus, in this remark, we want to obtain the closed form of 𝐕22{\mathbf{V}}_{22}. By algebraic computations of the bottom-right sub-block matrix, we have

(2α𝝁𝟏Tα𝐏T(α+1𝟙{a=1}2)𝐈)𝐕22\displaystyle\left(2\alpha\bm{\mu}{\bm{1}}^{T}\!-\!\alpha{\mathbf{P}}^{T}\!-\!\left(\alpha+1\!-\!\frac{\mathds{1}_{\{a=1\}}}{2}\right){\mathbf{I}}\right){\mathbf{V}}_{22}
+𝐕22(2α𝝁𝟏Tα𝐏T(α+1𝟙{a=1}2)𝐈)T\displaystyle+{\mathbf{V}}_{22}\left(2\alpha\bm{\mu}{\bm{1}}^{T}\!-\!\alpha{\mathbf{P}}^{T}\!-\!\left(\alpha+1\!-\!\frac{\mathds{1}_{\{a=1\}}}{2}\right){\mathbf{I}}\right)^{T}
+𝐔22=0.\displaystyle+{\mathbf{U}}_{22}=0.

By using result of the closed form solution to the Lyapunov equation (e.g., Lemma G.1) and the eigendecomposition of 𝐏{\mathbf{P}}, we have

𝐕22=i=1N112α(1+λi)+2𝟙{a=1}1+λi1λi𝐮i𝐮iT.{\mathbf{V}}_{22}=\sum_{i=1}^{N-1}\frac{1}{2\alpha(1+\lambda_{i})+2-\mathds{1}_{\{a=1\}}}\cdot\frac{1+\lambda_{i}}{1-\lambda_{i}}{\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T}. (32)

E.2 Case (i): βn=o(γn)\beta_{n}=o(\gamma_{n})

In this part, we mainly focus on the CLT of the SA iteration 𝜽n{\bm{\theta}}_{n} because the SRRW iteration 𝐱n{\mathbf{x}}_{n} is independent of 𝜽n{\bm{\theta}}_{n} and its CLT result has been shown in Remark E.1.

E.2.1 Decomposition of SA-SRRW iteration (4)

We slightly abuse the math notation and define the function

𝐡(𝜽,𝐱)𝔼i𝝅[𝐱]H(𝜽,i)=i𝒩πi[𝐱]H(𝜽,i){\mathbf{h}}({\bm{\theta}},{\mathbf{x}})\triangleq\mathbb{E}_{i\sim{\bm{\pi}}[{\mathbf{x}}]}H({\bm{\theta}},i)=\sum_{i\in{\mathcal{N}}}\pi_{i}[{\mathbf{x}}]H({\bm{\theta}},i)

such that 𝐡(𝜽,𝝁)𝐡(𝜽){\mathbf{h}}({\bm{\theta}},{\bm{\mu}})\equiv{\mathbf{h}}({\bm{\theta}}). Then, we reformulate (25) as

𝜽n+1=𝜽n+βn+1𝐡(𝜽n,𝐱n)+βn+1(H(𝜽n,Xn+1)𝐡(𝜽n,𝐱n)).{\bm{\theta}}_{n+1}={\bm{\theta}}_{n}+\beta_{n+1}{\mathbf{h}}({\bm{\theta}}_{n},{\mathbf{x}}_{n})+\beta_{n+1}(H({\bm{\theta}}_{n},X_{n+1})-{\mathbf{h}}({\bm{\theta}}_{n},{\mathbf{x}}_{n})). (33a)
𝐱n+1=𝐱n+γn+1(𝝅[𝐱n]𝐱n)+γn+1(𝜹Xn+1)𝝅[𝐱n]).{\mathbf{x}}_{n+1}={\mathbf{x}}_{n}+\gamma_{n+1}({\bm{\pi}}[{\mathbf{x}}_{n}]-{\mathbf{x}}_{n})+\gamma_{n+1}({\bm{\delta}}_{X_{n+1}})-{\bm{\pi}}[{\mathbf{x}}_{n}]). (33b)

There exist functions q𝐱:𝒩N,H~𝜽,𝐱:𝒩Dq_{{\mathbf{x}}}:{\mathcal{N}}\to\mathbb{R}^{N},\tilde{H}_{{\bm{\theta}},{\mathbf{x}}}:{\mathcal{N}}\to\mathbb{R}^{D} satisfying the following Poisson equations

𝜹i𝝅(𝐱)=q𝐱(i)(𝐊𝐱q𝐱)(i){\bm{\delta}}_{i}-{\bm{\pi}}({\mathbf{x}})=q_{{\mathbf{x}}}(i)-({\mathbf{K}}_{{\mathbf{x}}}q_{{\mathbf{x}}})(i) (34a)
H(𝜽,i)𝐡(𝜽,𝐱)=H~𝜽,𝐱(i)(𝐊𝐱H~𝜽,𝐱)(i),H({\bm{\theta}},i)-{\mathbf{h}}({\bm{\theta}},{\mathbf{x}})=\tilde{H}_{{\bm{\theta}},{\mathbf{x}}}(i)-({\mathbf{K}}_{{\mathbf{x}}}\tilde{H}_{{\bm{\theta}},{\mathbf{x}}})(i), (34b)

for any 𝜽D,𝐱Int(Σ){\bm{\theta}}\in{\mathbb{R}}^{D},{\mathbf{x}}\in\text{Int}(\Sigma) and i𝒩i\in{\mathcal{N}}, where (𝐊𝐱q𝐱)(i)j𝒩Kij[𝐱]q𝐱(j)({\mathbf{K}}_{{\mathbf{x}}}q_{{\mathbf{x}}})(i)\triangleq\sum_{j\in{\mathcal{N}}}K_{ij}[{\mathbf{x}}]q_{{\mathbf{x}}}(j), (𝐊𝐱H~𝜽,𝐱)(j)j𝒩Kij[𝐱]H~𝜽,𝐱(j)({\mathbf{K}}_{{\mathbf{x}}}\tilde{H}_{{\bm{\theta}},{\mathbf{x}}})(j)\triangleq\sum_{j\in{\mathcal{N}}}K_{ij}[{\mathbf{x}}]\tilde{H}_{{\bm{\theta}},{\mathbf{x}}}(j). The existence and explicit form of the solutions q𝐱,H~𝜽,𝐱q_{{\mathbf{x}}},\tilde{H}_{{\bm{\theta}},{\mathbf{x}}}, which are continuous w.r.t 𝐱,𝜽{\mathbf{x}},{\bm{\theta}}, follow the similar steps that can be found in Section D.1 from (22) to (24). Thus, we can further decompose (33) into

𝜽n+1=𝜽n+βn+1𝐡(𝜽n,𝐱n)+βn+1(H~𝜽n,𝐱n(Xn+1)(𝐊𝐱nH~𝜽n,𝐱n)(Xn))Mn+1(𝜽)+βn+1((𝐊𝐱n+1H~𝜽n+1,𝐱n+1)(Xn+1)(𝐊𝐱nH~𝜽n,𝐱n)(Xn+1))rn(𝜽,1)+βn+1((𝐊𝐱nH~𝜽n,𝐱n)(Xn)(𝐊𝐱n+1H~𝜽n+1,𝐱n+1)(Xn+1))rn(𝜽,2),\begin{split}{\bm{\theta}}_{n+1}=&{\bm{\theta}}_{n}+\beta_{n+1}{\mathbf{h}}({\bm{\theta}}_{n},{\mathbf{x}}_{n})+\beta_{n+1}\underbrace{(\tilde{H}_{{\bm{\theta}}_{n},{\mathbf{x}}_{n}}(X_{n+1})-({\mathbf{K}}_{{\mathbf{x}}_{n}}\tilde{H}_{{\bm{\theta}}_{n},{\mathbf{x}}_{n}})(X_{n}))}_{M_{n+1}^{({\bm{\theta}})}}\\ &+\beta_{n+1}\underbrace{(({\mathbf{K}}_{{\mathbf{x}}_{n+1}}\tilde{H}_{{\bm{\theta}}_{n+1},{\mathbf{x}}_{n+1}})(X_{n+1})-({\mathbf{K}}_{{\mathbf{x}}_{n}}\tilde{H}_{{\bm{\theta}}_{n},{\mathbf{x}}_{n}})(X_{n+1}))}_{r^{({\bm{\theta}},1)}_{n}}\\ &+\beta_{n+1}\underbrace{(({\mathbf{K}}_{{\mathbf{x}}_{n}}\tilde{H}_{{\bm{\theta}}_{n},{\mathbf{x}}_{n}})(X_{n})-({\mathbf{K}}_{{\mathbf{x}}_{n+1}}\tilde{H}_{{\bm{\theta}}_{n+1},{\mathbf{x}}_{n+1}})(X_{n+1}))}_{r^{({\bm{\theta}},2)}_{n}},\end{split} (35a)
𝐱n+1=𝐱n+γn+1(𝝅(𝐱n)𝐱n)+γn+1(q𝐱n(Xn+1)(𝐊𝐱nq𝐱n)(Xn))Mn+1(𝐱)+γn+1([𝐊𝐱nq𝐱n](Xn+1)[𝐊𝐱nq𝐱n+1](Xn+1))rn(𝐱,1)+γn+1((𝐊𝐱nq𝐱n)(Xn)(𝐊𝐱n+1q𝐱n+1)(Xn+1))rn(𝐱,2).\begin{split}{\mathbf{x}}_{n+1}=&{\mathbf{x}}_{n}+\gamma_{n+1}({\bm{\pi}}({\mathbf{x}}_{n})-{\mathbf{x}}_{n})+\gamma_{n+1}\underbrace{(q_{{\mathbf{x}}_{n}}(X_{n+1})-({\mathbf{K}}_{{\mathbf{x}}_{n}}q_{{\mathbf{x}}_{n}})(X_{n}))}_{M_{n+1}^{({\mathbf{x}})}}\\ &+\gamma_{n+1}\underbrace{([{\mathbf{K}}_{{\mathbf{x}}_{n}}q_{{\mathbf{x}}_{n}}](X_{n+1})-[{\mathbf{K}}_{{\mathbf{x}}_{n}}q_{{\mathbf{x}}_{n+1}}](X_{n+1}))}_{r^{({\mathbf{x}},1)}_{n}}\\ &+\gamma_{n+1}\underbrace{(({\mathbf{K}}_{{\mathbf{x}}_{n}}q_{{\mathbf{x}}_{n}})(X_{n})-({\mathbf{K}}_{{\mathbf{x}}_{n+1}}q_{{\mathbf{x}}_{n+1}})(X_{n+1}))}_{r^{({\mathbf{x}},2)}_{n}}.\end{split} (35b)

such that

𝜽n+1=𝜽n+βn+1𝐡(𝜽n,𝐱n)+βn+1Mn+1(𝜽)+βn+1rn(𝜽,1)+βn+1rn(𝜽,2),{\bm{\theta}}_{n+1}={\bm{\theta}}_{n}+\beta_{n+1}{\mathbf{h}}({\bm{\theta}}_{n},{\mathbf{x}}_{n})+\beta_{n+1}M_{n+1}^{({\bm{\theta}})}+\beta_{n+1}r^{({\bm{\theta}},1)}_{n}+\beta_{n+1}r^{({\bm{\theta}},2)}_{n}, (36a)
𝐱n+1=𝐱n+γn+1(𝝅(𝐱n)𝐱n)+γn+1Mn+1(𝐱)+γn+1rn(𝐱,1)+γn+1rn(𝐱,2).{\mathbf{x}}_{n+1}={\mathbf{x}}_{n}+\gamma_{n+1}({\bm{\pi}}({\mathbf{x}}_{n})-{\mathbf{x}}_{n})+\gamma_{n+1}M_{n+1}^{({\mathbf{x}})}+\gamma_{n+1}r^{({\mathbf{x}},1)}_{n}+\gamma_{n+1}r^{({\mathbf{x}},2)}_{n}. (36b)

We can observe that (36) differs from the expression in Konda & Tsitsiklis (2004); Mokkadem & Pelletier (2006), which studied the two-timescale SA with Martingale difference noise. Here, due to the presence of the iterate-dependent Markovian noise and the application of the Poisson equation technique, we have additional non-vanishing terms rn(𝜽,2),rn(𝐱,2)r^{({\bm{\theta}},2)}_{n},r^{({\mathbf{x}},2)}_{n}, which will be further examined in Lemma E.2. Additionally, when we apply the Poisson equation to the Martingale difference terms Mn+1(𝜽)M_{n+1}^{({\bm{\theta}})}, Mn+1(𝐱)M_{n+1}^{({\mathbf{x}})}, we find that there are some covariances that are also non-vanishing as in Lemma E.1. We will mention this again when we obtain those covariances. These extra non-zero noise terms make our analysis distinct from the previous ones since the key assumption (A4) in Mokkadem & Pelletier (2006) is not satisfied. We demonstrate that the long-term average performance of these terms can be managed so that they do not affect the final CLT result.

Analysis of Terms Mn+1(θ),Mn+1(𝐱)M_{n+1}^{({\bm{\theta}})},M_{n+1}^{({\mathbf{x}})}

Consider the filtration nσ(𝜽0,𝐱0,X0,,𝜽n,𝐱n,Xn){\mathcal{F}}_{n}\triangleq\sigma({\bm{\theta}}_{0},{\mathbf{x}}_{0},X_{0},\cdots,{\bm{\theta}}_{n},{\mathbf{x}}_{n},X_{n}), it is evident that Mn+1(𝜽),Mn+1(𝐱)M_{n+1}^{({\bm{\theta}})},M_{n+1}^{({\mathbf{x}})} are Martingale difference sequences adapted to n{\mathcal{F}}_{n}. Then, we have

𝔼[Mn+1(𝐱)(Mn+1(𝐱))T|n]=𝔼[q𝐱n(Xn+1)q𝐱n(Xn+1)T|n]+(𝐊𝐱nq𝐱n)(Xn)((𝐊𝐱nq𝐱n)(Xn))T𝔼[q𝐱n(Xn+1)|n](𝐊𝐱nq𝐱n)(Xn))T(𝐊𝐱nq𝐱n)(Xn)𝔼[q𝐱n(Xn+1)T|n]=𝔼[q𝐱n(Xn+1)q𝐱n(Xn+1)T|n](𝐊𝐱nq𝐱n)(Xn)((𝐊𝐱nq𝐱n)(Xn))T.\begin{split}&\mathbb{E}\left[\left.M^{({\mathbf{x}})}_{n+1}(M^{({\mathbf{x}})}_{n+1})^{T}\right|{\mathcal{F}}_{n}\right]\\ =&~{}\mathbb{E}[q_{{\mathbf{x}}_{n}}(X_{n+1})q_{{\mathbf{x}}_{n}}(X_{n+1})^{T}|{\mathcal{F}}_{n}]+({\mathbf{K}}_{{\mathbf{x}}_{n}}q_{{\mathbf{x}}_{n}})(X_{n})\left(({\mathbf{K}}_{{\mathbf{x}}_{n}}q_{{\mathbf{x}}_{n}})(X_{n})\right)^{T}\\ &-\mathbb{E}[q_{{\mathbf{x}}_{n}}(X_{n+1})|{\mathcal{F}}_{n}]\left({\mathbf{K}}_{{\mathbf{x}}_{n}}q_{{\mathbf{x}}_{n}})(X_{n})\right)^{T}-({\mathbf{K}}_{{\mathbf{x}}_{n}}q_{{\mathbf{x}}_{n}})(X_{n})\mathbb{E}[q_{{\mathbf{x}}_{n}}(X_{n+1})^{T}|{\mathcal{F}}_{n}]\\ =&~{}\mathbb{E}[q_{{\mathbf{x}}_{n}}(X_{n+1})q_{{\mathbf{x}}_{n}}(X_{n+1})^{T}|{\mathcal{F}}_{n}]-({\mathbf{K}}_{{\mathbf{x}}_{n}}q_{{\mathbf{x}}_{n}})(X_{n})\left(({\mathbf{K}}_{{\mathbf{x}}_{n}}q_{{\mathbf{x}}_{n}})(X_{n})\right)^{T}.\end{split} (37)

Similarly, we have

𝔼[Mn+1(𝜽)(Mn+1(𝜽))T|n]=𝔼[H~𝜽n,𝐱n(Xn+1)H~𝜽n,𝐱n(Xn+1)T|n](𝐊𝐱nH~𝜽n,𝐱n)(Xn)((𝐊𝐱nH~𝜽n,𝐱n)(Xn))T,\begin{split}&\mathbb{E}\left[\left.M^{({\bm{\theta}})}_{n+1}(M^{({\bm{\theta}})}_{n+1})^{T}\right|{\mathcal{F}}_{n}\right]\\ =&~{}\mathbb{E}[\tilde{H}_{{\bm{\theta}}_{n},{\mathbf{x}}_{n}}(X_{n+1})\tilde{H}_{{\bm{\theta}}_{n},{\mathbf{x}}_{n}}(X_{n+1})^{T}|{\mathcal{F}}_{n}]-({\mathbf{K}}_{{\mathbf{x}}_{n}}\tilde{H}_{{\bm{\theta}}_{n},{\mathbf{x}}_{n}})(X_{n})\left(({\mathbf{K}}_{{\mathbf{x}}_{n}}\tilde{H}_{{\bm{\theta}}_{n},{\mathbf{x}}_{n}})(X_{n})\right)^{T},\end{split} (38)

and

𝔼[Mn+1(𝐱)(Mn+1(𝜽))T|n]=𝔼[q𝐱n(Xn+1)H~𝜽n,𝐱n(Xn+1)T|n](𝐊𝐱nq𝐱n)(Xn)((𝐊𝐱nH~𝜽n,𝐱n)(Xn))T.\begin{split}&\mathbb{E}\left[\left.M^{({\mathbf{x}})}_{n+1}(M^{({\bm{\theta}})}_{n+1})^{T}\right|{\mathcal{F}}_{n}\right]\\ =&~{}\mathbb{E}[q_{{\mathbf{x}}_{n}}(X_{n+1})\tilde{H}_{{\bm{\theta}}_{n},{\mathbf{x}}_{n}}(X_{n+1})^{T}|{\mathcal{F}}_{n}]-({\mathbf{K}}_{{\mathbf{x}}_{n}}q_{{\mathbf{x}}_{n}})(X_{n})\left(({\mathbf{K}}_{{\mathbf{x}}_{n}}\tilde{H}_{{\bm{\theta}}_{n},{\mathbf{x}}_{n}})(X_{n})\right)^{T}.\end{split}

We now focus on 𝔼[Mn+1(𝐱)(Mn+1(𝐱))T|n]\mathbb{E}\left[\left.M^{({\mathbf{x}})}_{n+1}(M^{({\mathbf{x}})}_{n+1})^{T}\right|{\mathcal{F}}_{n}\right]. Denote by

V1(𝐱,i)j𝒩𝐊i,j[𝐱]q𝐱(j)q𝐱(j)T(𝐊𝐱q𝐱)(i)((𝐊𝐱q𝐱)(i))T,V_{1}({\mathbf{x}},i)\triangleq\sum_{j\in{\mathcal{N}}}{\mathbf{K}}_{i,j}[{\mathbf{x}}]q_{{\mathbf{x}}}(j)q_{{\mathbf{x}}}(j)^{T}-({\mathbf{K}}_{{\mathbf{x}}}q_{{\mathbf{x}}})(i)\left(({\mathbf{K}}_{{\mathbf{x}}}q_{{\mathbf{x}}})(i)\right)^{T}, (39)

and let its expectation w.r.t the stationary distribution 𝝅(𝐱){\bm{\pi}}({\mathbf{x}}) be v1(𝐱)𝔼i𝝅(𝐱)[V1(𝐱,i)]v_{1}({\mathbf{x}})\triangleq\mathbb{E}_{i\sim{\bm{\pi}}({\mathbf{x}})}[V_{1}({\mathbf{x}},i)], we can construct another Poisson equation, i.e.,

𝔼[Mn+1(𝐱)(Mn+1(𝐱))T|n]Xn𝒩πXn(𝐱n)𝔼[Mn+1(𝐱)(Mn+1(𝐱))T|n]=V1(𝐱n,Xn+1)v1(𝐱n)=φ𝐱(1)(Xn+1)(𝐊𝐱nφ𝐱n(1))(Xn+1),\begin{split}&\mathbb{E}\left[\left.M^{({\mathbf{x}})}_{n+1}(M^{({\mathbf{x}})}_{n+1})^{T}\right|{\mathcal{F}}_{n}\right]-\sum_{X_{n}\in{\mathcal{N}}}\pi_{X_{n}}({\mathbf{x}}_{n})\mathbb{E}\left[\left.M^{({\mathbf{x}})}_{n+1}(M^{({\mathbf{x}})}_{n+1})^{T}\right|{\mathcal{F}}_{n}\right]\\ =&~{}V_{1}({\mathbf{x}}_{n},X_{n+1})-v_{1}({\mathbf{x}}_{n})\\ =&~{}\varphi^{(1)}_{{\mathbf{x}}}(X_{n+1})-({\mathbf{K}}_{{\mathbf{x}}_{n}}\varphi^{(1)}_{{\mathbf{x}}_{n}})(X_{n+1}),\end{split}

for some matrix-valued function φ𝐱(1):𝒩N×N\varphi^{(1)}_{{\mathbf{x}}}:{\mathcal{N}}\to\mathbb{R}^{N\times N}. Since q𝐱q_{{\mathbf{x}}} and 𝐊[𝐱]{\mathbf{K}}[{\mathbf{x}}] are continuous in 𝐱{\mathbf{x}}, functions V1,v1V_{1},v_{1} are also continuous in 𝐱{\mathbf{x}}. Then, we can decompose (39) into

V1(𝐱n,Xn+1)=v1(𝝁)𝐔22+v1(𝐱n)v1(𝝁)𝐃n(1)+φ𝐱n(1)(Xn+1)(𝐊𝐱nφ𝐱n(1))(Xn)𝐉n(1,a)+(𝐊𝐱nφ𝐱n(1))(Xn)(𝐊𝐱nφ𝐱n(1))(Xn+1)𝐉n(1,b).\begin{split}V_{1}({\mathbf{x}}_{n},X_{n+1})=&\underbrace{v_{1}({\bm{\mu}})}_{{\mathbf{U}}_{22}}+\underbrace{v_{1}({\mathbf{x}}_{n})-v_{1}({\bm{\mu}})}_{{\mathbf{D}}^{(1)}_{n}}+\underbrace{\varphi^{(1)}_{{\mathbf{x}}_{n}}(X_{n+1})-({\mathbf{K}}_{{\mathbf{x}}_{n}}\varphi^{(1)}_{{\mathbf{x}}_{n}})(X_{n})}_{{\mathbf{J}}^{(1,a)}_{n}}\\ &+\underbrace{({\mathbf{K}}_{{\mathbf{x}}_{n}}\varphi^{(1)}_{{\mathbf{x}}_{n}})(X_{n})-({\mathbf{K}}_{{\mathbf{x}}_{n}}\varphi^{(1)}_{{\mathbf{x}}_{n}})(X_{n+1})}_{{\mathbf{J}}^{(1,b)}_{n}}.\end{split} (40)

Thus, we have

𝔼[Mn+1(𝐱)(Mn+1(𝐱))T|n]=𝐔22+𝐃n(1)+𝐉n(1),\mathbb{E}[M^{({\mathbf{x}})}_{n+1}(M^{({\mathbf{x}})}_{n+1})^{T}|{\mathcal{F}}_{n}]={\mathbf{U}}_{22}+{\mathbf{D}}_{n}^{(1)}+{\mathbf{J}}_{n}^{(1)}, (41)

where 𝐉n(1)=𝐉n(1,a)+𝐉n(1,b){\mathbf{J}}_{n}^{(1)}={\mathbf{J}}_{n}^{(1,a)}+{\mathbf{J}}_{n}^{(1,b)}.

Following the similar steps above, we can decompose 𝔼[Mn+1(𝐱)(Mn+1(𝜽))T|n]\mathbb{E}\left[\left.M^{({\mathbf{x}})}_{n+1}(M^{({\bm{\theta}})}_{n+1})^{T}\right|{\mathcal{F}}_{n}\right] and 𝔼[Mn+1(𝜽)(Mn+1(𝜽))T|n]\mathbb{E}\left[\left.M^{({\bm{\theta}})}_{n+1}(M^{({\bm{\theta}})}_{n+1})^{T}\right|{\mathcal{F}}_{n}\right] as

𝔼[Mn+1(𝐱)(Mn+1(𝜽))T|n]=𝐔21+𝐃n(2)+𝐉n(2),\mathbb{E}\left[\left.M^{({\mathbf{x}})}_{n+1}(M^{({\bm{\theta}})}_{n+1})^{T}\right|{\mathcal{F}}_{n}\right]={\mathbf{U}}_{21}+{\mathbf{D}}_{n}^{(2)}+{\mathbf{J}}_{n}^{(2)}, (42a)
𝔼[Mn+1(𝜽)(Mn+1(𝜽))T|n]=𝐔11+𝐃n(3)+𝐉n(3).\mathbb{E}\left[\left.M^{({\bm{\theta}})}_{n+1}(M^{({\bm{\theta}})}_{n+1})^{T}\right|{\mathcal{F}}_{n}\right]={\mathbf{U}}_{11}+{\mathbf{D}}_{n}^{(3)}+{\mathbf{J}}_{n}^{(3)}. (42b)

where 𝐉n(2)=𝐉n(2,a)+𝐉n(2,b){\mathbf{J}}_{n}^{(2)}={\mathbf{J}}_{n}^{(2,a)}+{\mathbf{J}}_{n}^{(2,b)} and 𝐉n(3)=𝐉n(3,a)+𝐉n(3,b){\mathbf{J}}_{n}^{(3)}={\mathbf{J}}_{n}^{(3,a)}+{\mathbf{J}}_{n}^{(3,b)}. Here, we note that matrices 𝐉ni{\mathbf{J}}_{n}^{i} for i=1,2,3i=1,2,3 are in presence of the current CLT analysis of the two-timescale SA with Martingale difference noise. In addition, 𝐔11{\mathbf{U}}_{11}, 𝐔12{\mathbf{U}}_{12} and 𝐔22{\mathbf{U}}_{22} inherently include the information of the underlying Markov chain (with its eigenpair (λi,𝐮i\lambda_{i},{\mathbf{u}}_{i})), which is an extension of the previous works (Konda & Tsitsiklis, 2004; Mokkadem & Pelletier, 2006).

Lemma E.1.

For Mn+1(𝛉),Mn+1(𝐱)M_{n+1}^{({\bm{\theta}})},M_{n+1}^{({\mathbf{x}})} defined in (35) and their decomposition in (41) and (42), we have

𝐔11=i=1N11+λi1λi𝐮i𝐮iT,𝐔21=i=1N11+λi1λi𝐮i𝐮iT𝐇,𝐔11=i=1N11+λi1λi𝐇T𝐮i𝐮iT𝐇,{\mathbf{U}}_{11}=\sum_{i=1}^{N-1}\frac{1+\lambda_{i}}{1-\lambda_{i}}{\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T},\quad{\mathbf{U}}_{21}=\sum_{i=1}^{N-1}\frac{1+\lambda_{i}}{1-\lambda_{i}}{\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T}{\mathbf{H}},\quad{\mathbf{U}}_{11}=\sum_{i=1}^{N-1}\frac{1+\lambda_{i}}{1-\lambda_{i}}{\mathbf{H}}^{T}{\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T}{\mathbf{H}}, (43a)
limn𝐃n(i)=0a.s.fori=1,2,3,\lim_{n\to\infty}{\mathbf{D}}_{n}^{(i)}=0~{}~{}\text{a.s.}~{}~{}~{}~{}\text{for}~{}~{}~{}~{}i=1,2,3, (43b)
limnγn𝔼[k=1n𝐉k(i)]=0,fori=1,2,3.\lim_{n\to\infty}\gamma_{n}\mathbb{E}\left[\left\|\sum_{k=1}^{n}{\mathbf{J}}^{(i)}_{k}\right\|\right]=0,\quad\text{for}~{}~{}~{}~{}i=1,2,3. (43c)
Proof.

We now provide the properties of the four terms inside (41) as an example. Note that

𝐔11=𝔼i𝝁[V1(𝝁,i)]=i𝒩μi[j𝒩𝐏(i,j)q𝝁(j)q𝝁(j)T(𝐏q𝝁)(i)((𝐏q𝝁)(i))T]=j𝒩μjq𝝁(j)q𝝁(j)T(𝐏q𝝁)(j)((𝐏q𝝁)(j))T.\begin{split}{\mathbf{U}}_{11}=&~{}\mathbb{E}_{i\sim{\bm{\mu}}}[V_{1}({\bm{\mu}},i)]=\sum_{i\in{\mathcal{N}}}\mu_{i}\left[\sum_{j\in{\mathcal{N}}}{\mathbf{P}}(i,j)q_{{\bm{\mu}}}(j)q_{{\bm{\mu}}}(j)^{T}-({\mathbf{P}}q_{{\bm{\mu}}})(i)\left(({\mathbf{P}}q_{{\bm{\mu}}})(i)\right)^{T}\right]\\ =&~{}\sum_{j\in{\mathcal{N}}}\mu_{j}q_{{\bm{\mu}}}(j)q_{{\bm{\mu}}}(j)^{T}-({\mathbf{P}}q_{{\bm{\mu}}})(j)\left(({\mathbf{P}}q_{{\bm{\mu}}})(j)\right)^{T}.\end{split}

We can see that it has exactly the same structure as matrix 𝑼{\bm{U}} in (27). Following the similar steps in deducing the explicit form of 𝑼{\bm{U}} from (28) to (30), we get

𝐔11=i=1N11+λi1λi𝐮i𝐮iT.{\mathbf{U}}_{11}=\sum_{i=1}^{N-1}\frac{1+\lambda_{i}}{1-\lambda_{i}}{\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T}. (44)

By the almost sure convergence result 𝐱n𝝁{\mathbf{x}}_{n}\to{\bm{\mu}} in Lemma 3.1, v1(𝐱n)v1(𝝁)v_{1}({\mathbf{x}}_{n})\to v_{1}({\bm{\mu}}) a.s. such that limn𝐃n(1)=0\lim_{n\to\infty}{\mathbf{D}}_{n}^{(1)}=0 a.s.

We next prove that limnγn𝔼[k=1n𝐉k(1,a)]=0\lim_{n\to\infty}\gamma_{n}\mathbb{E}\left[\left\|\sum_{k=1}^{n}{\mathbf{J}}^{(1,a)}_{k}\right\|\right]=0 and limnγn𝔼[k=1n𝐉k(1,b)]=0\lim_{n\to\infty}\gamma_{n}\mathbb{E}\left[\left\|\sum_{k=1}^{n}{\mathbf{J}}^{(1,b)}_{k}\right\|\right]=0.

Since {𝐉n(1,a)}\{{\mathbf{J}}^{(1,a)}_{n}\} is a Martingale difference sequence adapted to n{\mathcal{F}}_{n}, with the Burkholder inequality in Lemma G.2 and p=1p=1, we show that

𝔼[k=1n𝐉k(1,a)]C1𝔼[(k=1n𝐉k(1,a)2)].\mathbb{E}\left[\left\|\sum_{k=1}^{n}{\mathbf{J}}_{k}^{(1,a)}\right\|\right]\leq C_{1}\mathbb{E}\left[\sqrt{\left(\sum_{k=1}^{n}\left\|{\mathbf{J}}_{k}^{(1,a)}\right\|^{2}\right)}\right]. (45)

By assumption A4, 𝐱n{\mathbf{x}}_{n} is always within some compact set Ω\Omega such that supn𝐉n(1,a)CΩ<\sup_{n}\|{\mathbf{J}}_{n}^{(1,a)}\|\leq C_{\Omega}<\infty and for a given trajectory ω\omega of 𝐱n(ω){\mathbf{x}}_{n}(\omega),

γnCp(k=1n𝐉k(1,a)2)CpCΩγnn,\gamma_{n}C_{p}\sqrt{\left(\sum_{k=1}^{n}\left\|{\mathbf{J}}_{k}^{(1,a)}\right\|^{2}\right)}\leq C_{p}C_{\Omega}\gamma_{n}\sqrt{n}, (46)

and the last term decreases to zero in nn since a>1/2a>1/2.

For 𝐉n(1,b){\mathbf{J}}_{n}^{(1,b)}, we use Abel transformation and obtain

k=1n𝐉k(1,b)=k=1n((𝐊𝐱kφ𝐱k(1))(Xk1)(𝐊𝐱k1φ𝐱k1(1))(Xk1))+(𝐊𝐱0φ𝐱0(1))(X0)(𝐊𝐱nφ𝐱n(1))(Xn).\begin{split}\sum_{k=1}^{n}{\mathbf{J}}_{k}^{(1,b)}=&\sum_{k=1}^{n}(({\mathbf{K}}_{{\mathbf{x}}_{k}}\varphi^{(1)}_{{\mathbf{x}}_{k}})(X_{k-1})-({\mathbf{K}}_{{\mathbf{x}}_{k-1}}\varphi^{(1)}_{{\mathbf{x}}_{k-1}})(X_{k-1}))\\ &+({\mathbf{K}}_{{\mathbf{x}}_{0}}\varphi^{(1)}_{{\mathbf{x}}_{0}})(X_{0})-({\mathbf{K}}_{{\mathbf{x}}_{n}}\varphi^{(1)}_{{\mathbf{x}}_{n}})(X_{n}).\end{split}

Since (𝐊𝐱φ𝐱(1))(X)({\mathbf{K}}_{{\mathbf{x}}}\varphi^{(1)}_{{\mathbf{x}}})(X) is continuous in 𝐱{\mathbf{x}}, for 𝐱n{\mathbf{x}}_{n} within a compact set Ω\Omega (assumption A4), it is local Lipschitz with a constant LΩL_{\Omega} such that

(𝐊𝐱kφ𝐱k(1))(Xk1)𝐊𝐱k1φ𝐱k1(1))(Xk1)LΩ𝐱k𝐱k12LΩγk.\|({\mathbf{K}}_{{\mathbf{x}}_{k}}\varphi^{(1)}_{{\mathbf{x}}_{k}})(X_{k-1})-{\mathbf{K}}_{{\mathbf{x}}_{k-1}}\varphi^{(1)}_{{\mathbf{x}}_{k-1}})(X_{k-1})\|\leq L_{\Omega}\|{\mathbf{x}}_{k}-{\mathbf{x}}_{k-1}\|\leq 2L_{\Omega}\gamma_{k}.

where the last inequality arises from (4b), i.e., 𝐱k𝐱k1=γk(𝜹Xk𝐱k1){\mathbf{x}}_{k}-{\mathbf{x}}_{k-1}=\gamma_{k}({\bm{\delta}}_{X_{k}}-{\mathbf{x}}_{k-1}) and 𝜹Xk𝐱k12\|{\bm{\delta}}_{X_{k}}-{\mathbf{x}}_{k-1}\|\leq 2 because 𝐱nInt(Σ){\mathbf{x}}_{n}\in\text{Int}(\Sigma). Also, (𝐊𝐱0φ𝐱0(1))(X0)+(𝐊𝐱nφ𝐱n(1))(Xn)\|({\mathbf{K}}_{{\mathbf{x}}_{0}}\varphi^{(1)}_{{\mathbf{x}}_{0}})(X_{0})\|+\|({\mathbf{K}}_{{\mathbf{x}}_{n}}\varphi^{(1)}_{{\mathbf{x}}_{n}})(X_{n})\| are upper-bounded by some positive constant CΩC_{\Omega}^{\prime}. This implies that

k=1n𝐉k(1,b)CΩ+2LΩk=1nγk.\left\|\sum_{k=1}^{n}{\mathbf{J}}_{k}^{(1,b)}\right\|\leq C_{\Omega}^{\prime}+2L_{\Omega}\sum_{k=1}^{n}\gamma_{k}.

Note that

γnk=1n𝐉k(1,b)γnCΩ+2LΩγnk=1nγkγnCΩ+2LΩan12a,\gamma_{n}\left\|\sum_{k=1}^{n}{\mathbf{J}}_{k}^{(1,b)}\right\|\leq\gamma_{n}C_{\Omega}^{\prime}+2L_{\Omega}\gamma_{n}\sum_{k=1}^{n}\gamma_{k}\leq\gamma_{n}C_{\Omega}^{\prime}+\frac{2L_{\Omega}}{a}n^{1-2a}, (47)

where the last inequality is from k=1nγk<1an1a\sum_{k=1}^{n}\gamma_{k}<\frac{1}{a}n^{1-a}. We observe that the last term in (47) is decreasing to zero in nn because a>1/2a>1/2.

Note that 𝐉k(1)=𝐉k(1,a)+𝐉k(1,b){\mathbf{J}}_{k}^{(1)}={\mathbf{J}}_{k}^{(1,a)}+{\mathbf{J}}_{k}^{(1,b)}, by triangular inequality we have

γn𝔼[k=1n𝐉k(11)]γn𝔼[k=1n𝐉k(11,A)]+γn𝔼[k=1n𝐉k(11,B)]γnC1𝔼[(k=1n𝐉k(11,A)2)]+γn𝔼[k=1n𝐉k(11,B)]=𝔼[γnC1(k=1n𝐉k(11,A)2)+γnk=1n𝐉k(11,B)],\begin{split}\gamma_{n}\mathbb{E}\left[\left\|\sum_{k=1}^{n}{\mathbf{J}}_{k}^{(11)}\right\|\right]&\leq\gamma_{n}\mathbb{E}\left[\left\|\sum_{k=1}^{n}{\mathbf{J}}_{k}^{(11,A)}\right\|\right]+\gamma_{n}\mathbb{E}\left[\left\|\sum_{k=1}^{n}{\mathbf{J}}_{k}^{(11,B)}\right\|\right]\\ &\leq\gamma_{n}C_{1}\mathbb{E}\left[\sqrt{\left(\sum_{k=1}^{n}\left\|{\mathbf{J}}_{k}^{(11,A)}\right\|^{2}\right)}\right]+\gamma_{n}\mathbb{E}\left[\left\|\sum_{k=1}^{n}{\mathbf{J}}_{k}^{(11,B)}\right\|\right]\\ &=\mathbb{E}\left[\gamma_{n}C_{1}\sqrt{\left(\sum_{k=1}^{n}\left\|{\mathbf{J}}_{k}^{(11,A)}\right\|^{2}\right)}+\gamma_{n}\left\|\sum_{k=1}^{n}{\mathbf{J}}_{k}^{(11,B)}\right\|\right],\end{split} (48)

where the second inequality comes from (45). By (46) and (47) we know that both terms in the last line of (48) are uniformly bounded by constants over time nn that depend on the set Ω\Omega. Therefore, by dominated convergence theorem, taking the limit over the last line of (48) gives

limn𝔼[γnC1(k=1n𝐉k(11,A)2)+γnk=1n𝐉k(11,B)]=𝔼[limnγnC1(k=1n𝐉k(11,A)2)+γnk=1n𝐉k(11,B)]=0.\begin{split}&\lim_{n\to\infty}\mathbb{E}\left[\gamma_{n}C_{1}\sqrt{\left(\sum_{k=1}^{n}\left\|{\mathbf{J}}_{k}^{(11,A)}\right\|^{2}\right)}\!+\!\gamma_{n}\left\|\sum_{k=1}^{n}{\mathbf{J}}_{k}^{(11,B)}\right\|\right]\\ =&~{}\mathbb{E}\left[\lim_{n\to\infty}\gamma_{n}C_{1}\sqrt{\left(\sum_{k=1}^{n}\left\|{\mathbf{J}}_{k}^{(11,A)}\right\|^{2}\right)}\!+\!\gamma_{n}\left\|\sum_{k=1}^{n}{\mathbf{J}}_{k}^{(11,B)}\right\|\right]\!=\!0.\end{split}

Therefore, we have

limnγn𝔼[k=1n𝐉k(1)]=0,\lim_{n\to\infty}\gamma_{n}\mathbb{E}\left[\left\|\sum_{k=1}^{n}{\mathbf{J}}_{k}^{(1)}\right\|\right]=0,

In sum, in terms of 𝔼[Mn+1(𝐱)(Mn+1(𝐱))T|n]\mathbb{E}[M^{({\mathbf{x}})}_{n+1}(M^{({\mathbf{x}})}_{n+1})^{T}|{\mathcal{F}}_{n}] in (41), we have 𝐔11{\mathbf{U}}_{11} in (44), limn𝐃n(1)=0\lim_{n\to\infty}{\mathbf{D}}_{n}^{(1)}=0 a.s. and limnγn𝔼[k=1n𝐉k(1)]=0\lim_{n\to\infty}\gamma_{n}\mathbb{E}\left[\left\|\sum_{k=1}^{n}{\mathbf{J}}^{(1)}_{k}\right\|\right]=0.

We can apply the same steps as above for the other two terms i=2,3i=2,3 in (42) and obtain the results. ∎

Analysis of Terms rn(θ,1),rn(θ,2),rn(𝐱,1),rn(𝐱,2)r^{({\bm{\theta}},1)}_{n},r^{({\bm{\theta}},2)}_{n},r^{({\mathbf{x}},1)}_{n},r^{({\mathbf{x}},2)}_{n}

Lemma E.2.

For rn(𝛉,1),rn(𝛉,2),rn(𝐱,1),rn(𝐱,2)r^{({\bm{\theta}},1)}_{n},r^{({\bm{\theta}},2)}_{n},r^{({\mathbf{x}},1)}_{n},r^{({\mathbf{x}},2)}_{n} defined in (35), we have the following results:

rn(𝜽,1)=O(γn)=o(βn),γnk=1nrk(𝜽,2)=O(γn)=o(1).\|r^{({\bm{\theta}},1)}_{n}\|=O(\gamma_{n})=o(\sqrt{\beta_{n}}),\quad\sqrt{\gamma_{n}}\left\|\sum_{k=1}^{n}r^{({\bm{\theta}},2)}_{k}\right\|=O(\sqrt{\gamma_{n}})=o(1). (49a)
rn(𝐱,1)=O(γn)=o(βn),γnk=1nrk(𝐱,2)=O(γn)=o(1).\|r^{({\mathbf{x}},1)}_{n}\|=O(\gamma_{n})=o(\sqrt{\beta_{n}}),\quad\sqrt{\gamma_{n}}\left\|\sum_{k=1}^{n}r^{({\mathbf{x}},2)}_{k}\right\|=O(\sqrt{\gamma_{n}})=o(1). (49b)
Proof.

For rn(𝜽,1)r^{({\bm{\theta}},1)}_{n}, note that

rn(𝜽,1)=(𝐊𝐱n+1H~𝜽n+1,𝐱n+1)(Xn+1)(𝐊𝐱nH~𝜽n,𝐱n)(Xn+1)=j𝒩(𝐊Xn,j[𝐱n+1]H~𝜽n+1,𝐱n+1(j)𝐊Xn,j[𝐱n]H~𝜽n,𝐱n(j))j𝒩L𝒞(𝜽n+1𝜽n+𝐱n+1𝐱n)NL𝒞(C𝒞βn+1+2γn+1)\begin{split}r^{({\bm{\theta}},1)}_{n}=&~{}({\mathbf{K}}_{{\mathbf{x}}_{n+1}}\tilde{H}_{{\bm{\theta}}_{n+1},{\mathbf{x}}_{n+1}})(X_{n+1})-({\mathbf{K}}_{{\mathbf{x}}_{n}}\tilde{H}_{{\bm{\theta}}_{n},{\mathbf{x}}_{n}})(X_{n+1})\\ =&~{}\sum_{j\in{\mathcal{N}}}\left({\mathbf{K}}_{X_{n},j}[{\mathbf{x}}_{n+1}]\tilde{H}_{{\bm{\theta}}_{n+1},{\mathbf{x}}_{n+1}}(j)-{\mathbf{K}}_{X_{n},j}[{\mathbf{x}}_{n}]\tilde{H}_{{\bm{\theta}}_{n},{\mathbf{x}}_{n}}(j)\right)\\ \leq&~{}\sum_{j\in{\mathcal{N}}}L_{{\mathcal{C}}}(\|{\bm{\theta}}_{n+1}-{\bm{\theta}}_{n}\|+\|{\mathbf{x}}_{n+1}-{\mathbf{x}}_{n}\|)\\ \leq&NL_{{\mathcal{C}}}(C_{{\mathcal{C}}}\beta_{n+1}+2\gamma_{n+1})\end{split} (50)

where the second last inequality is because 𝐊i,j[𝐱]H~𝜽,𝐱(j){\mathbf{K}}_{i,j}[{\mathbf{x}}]\tilde{H}_{{\bm{\theta}},{\mathbf{x}}}(j) is continuous in 𝜽,𝐱{\bm{\theta}},{\mathbf{x}} 𝐊[𝒙]{\mathbf{K}}[{\bm{x}}], which stems from continuous functions 𝑲[𝐱]{\bm{K}}[{\mathbf{x}}] and H~𝜽,𝐱\tilde{H}_{{\bm{\theta}},{\mathbf{x}}}. The last inequality is from update rules (4) and (𝜽n,𝐱n)Ω({\bm{\theta}}_{n},{\mathbf{x}}_{n})\in\Omega for some compact subset Ω\Omega by assumption A4. Then, we have rn(𝜽,1)=O(γn)=o(βn)\|r^{({\bm{\theta}},1)}_{n}\|=O(\gamma_{n})=o(\sqrt{\beta_{n}}) because of a>1/2b/2a>1/2\geq b/2 by assumption A2.

We let νn(𝐊𝐱nH~𝜽n,𝐱n)(Xn)\nu_{n}\triangleq({\mathbf{K}}_{{\mathbf{x}}_{n}}\tilde{H}_{{\bm{\theta}}_{n},{\mathbf{x}}_{n}})(X_{n}) such that rn(𝜽,2)=νnνn+1r^{({\bm{\theta}},2)}_{n}=\nu_{n}-\nu_{n+1}. Note that k=1nrk(𝜽,2)=ν1νn+1\sum_{k=1}^{n}r^{({\bm{\theta}},2)}_{k}=\nu_{1}-\nu_{n+1}, and by assumption A4, νn\|\nu_{n}\| is upper bounded by a constant dependent on the compact set, which leads to

γnk=1nrk(𝜽,2)=γnν1νn+1=O(γn)=o(1).\sqrt{\gamma_{n}}\left\|\sum_{k=1}^{n}r^{({\bm{\theta}},2)}_{k}\right\|=\sqrt{\gamma_{n}}\|\nu_{1}-\nu_{n+1}\|=O(\sqrt{\gamma_{n}})=o(1).

Similarly, we can also obtain rn(𝐱,1)=o(βn)\|r^{({\mathbf{x}},1)}_{n}\|=o(\sqrt{\beta_{n}}) and γnk=1nrk(𝐱,2)=O(γn)=o(1)\sqrt{\gamma_{n}}\left\|\sum_{k=1}^{n}r^{({\mathbf{x}},2)}_{k}\right\|=O(\sqrt{\gamma_{n}})=o(1). ∎

E.2.2 Effect of SRRW Iteration on SA Iteration

In view of the almost sure convergence results in Lemma 3.1 and Lemma 3.2, for large enough nn so that both iterations 𝜽n,𝐱n{\bm{\theta}}_{n},{\mathbf{x}}_{n} are close to the equilibrium (𝜽,𝝁)({\bm{\theta}}^{*},{\bm{\mu}}), we can apply the Taylor expansion to functions 𝐡(𝜽,𝐱){\mathbf{h}}({\bm{\theta}},{\mathbf{x}}) and 𝝅[𝐱]𝐱{\bm{\pi}}[{\mathbf{x}}]-{\mathbf{x}} in (36) at the point (𝜽,𝝁)({\bm{\theta}}^{*},{\bm{\mu}}), which results in

𝐡(𝜽,𝐱)=𝐡(𝜽,𝝁)+𝜽𝐡(𝜽,𝝁)(𝜽𝜽)+𝐱𝐡(𝜽,𝝁)(𝐱𝝁)+O(𝜽𝜽2+𝐱𝝁2),{\mathbf{h}}({\bm{\theta}},{\mathbf{x}})={\mathbf{h}}({\bm{\theta}}^{*},{\bm{\mu}})+\nabla_{{\bm{\theta}}}{\mathbf{h}}({\bm{\theta}}^{*},{\bm{\mu}})({\bm{\theta}}-{\bm{\theta}}^{*})+\nabla_{{\mathbf{x}}}{\mathbf{h}}({\bm{\theta}}^{*},{\bm{\mu}})({\mathbf{x}}-{\bm{\mu}})+O(\|{\bm{\theta}}-{\bm{\theta}}^{*}\|^{2}+\|{\mathbf{x}}-{\bm{\mu}}\|^{2}), (51a)
𝝅[𝐱]𝐱=𝝅[𝝁]𝝁+𝐱(𝝅(𝐱)𝐱)|𝐱=𝝁(𝐱𝝁)+O(𝐱𝝁2).{\bm{\pi}}[{\mathbf{x}}]-{\mathbf{x}}={\bm{\pi}}[{\bm{\mu}}]-{\bm{\mu}}+\nabla_{{\mathbf{x}}}({\bm{\pi}}({\mathbf{x}})-{\mathbf{x}})|_{{\mathbf{x}}={\bm{\mu}}}({\mathbf{x}}-{\bm{\mu}})+O(\|{\mathbf{x}}-{\bm{\mu}}\|^{2}). (51b)

With matrix 𝐉(α){\mathbf{J}}(\alpha), we have the following:

𝐉11=𝜽𝐡(𝜽,𝝁)=𝐡(𝜽),𝐉12(α)=𝐱𝐡(𝜽,𝝁)=α𝐇T(𝐏T+𝐈),𝐉22(α)=𝐱(𝝅(𝐱)𝐱)|𝐱=𝝁=2α𝝁𝟏Tα𝐏T(α+1)𝐈.\begin{split}&{\mathbf{J}}_{11}=\nabla_{{\bm{\theta}}}{\mathbf{h}}({\bm{\theta}}^{*},{\bm{\mu}})=\nabla{\mathbf{h}}({\bm{\theta}}^{*}),\\ &{\mathbf{J}}_{12}(\alpha)=\nabla_{{\mathbf{x}}}{\mathbf{h}}({\bm{\theta}}^{*},{\bm{\mu}})=-\alpha{\mathbf{H}}^{T}({\mathbf{P}}^{T}+{\mathbf{I}}),\\ &{\mathbf{J}}_{22}(\alpha)=\nabla_{{\mathbf{x}}}({\bm{\pi}}({\mathbf{x}})-{\mathbf{x}})|_{{\mathbf{x}}={\bm{\mu}}}=2\alpha\bm{\mu}{\bm{1}}^{T}-\alpha{\mathbf{P}}^{T}-(\alpha+1){\mathbf{I}}.\end{split} (52)

Then, (36) becomes

𝜽n+1=𝜽n+βn+1(𝐉11(𝜽n𝜽)+𝐉12(α)(𝐱n𝝁)+rn(𝜽,1)+rn(𝜽,2)+Mn+1(𝜽)+ηn(𝜽)),{\bm{\theta}}_{n+1}={\bm{\theta}}_{n}+\beta_{n+1}({\mathbf{J}}_{11}({\bm{\theta}}_{n}-{\bm{\theta}}^{*})+{\mathbf{J}}_{12}(\alpha)({\mathbf{x}}_{n}-{\bm{\mu}})+r^{({\bm{\theta}},1)}_{n}+r^{({\bm{\theta}},2)}_{n}+M^{({\bm{\theta}})}_{n+1}+\eta_{n}^{({\bm{\theta}})}), (53a)
𝐱n+1=𝐱n+γn+1(𝐉22(α)(𝐱n𝝁)+rn(𝐱,1)+rn(𝐱,2)+Mn+1(𝐱)+ηn(𝐱)),{\mathbf{x}}_{n+1}={\mathbf{x}}_{n}+\gamma_{n+1}({\mathbf{J}}_{22}(\alpha)({\mathbf{x}}_{n}-{\bm{\mu}})+r^{({\mathbf{x}},1)}_{n}+r^{({\mathbf{x}},2)}_{n}+M^{({\mathbf{x}})}_{n+1}+\eta_{n}^{({\mathbf{x}})}), (53b)

where ηn(𝜽)=O(𝐱n2+𝜽n2)\eta_{n}^{({\bm{\theta}})}=O(\|{\mathbf{x}}_{n}\|^{2}+\|{\bm{\theta}}_{n}\|^{2}) and ηn(𝐱)=O(𝐱n2)\eta_{n}^{({\mathbf{x}})}=O(\|{\mathbf{x}}_{n}\|^{2}).

Then, inspired by Mokkadem & Pelletier (2006), we decompose iterates {𝐱n}\{{\mathbf{x}}_{n}\} and {𝜽n}\{{\bm{\theta}}_{n}\} into 𝐱n=Ln(𝒙)+Δn(𝒙){\mathbf{x}}_{n}=L^{({\bm{x}})}_{n}+\Delta^{({\bm{x}})}_{n} and 𝜽n=Ln(𝜽)+Rn(𝜽)+Δn(𝜽){\bm{\theta}}_{n}=L^{({\bm{\theta}})}_{n}+R^{({\bm{\theta}})}_{n}+\Delta^{({\bm{\theta}})}_{n}. Rewriting (53b) gives

𝐱n𝝁=γn+11𝐉22(α)1(𝐱n+1𝐱n)𝐉22(α)1(rn(𝐱,1)+rn(𝐱,2)+Mn+1(𝐱)+ηn(𝐱)),{\mathbf{x}}_{n}-{\bm{\mu}}=\gamma_{n+1}^{-1}{\mathbf{J}}_{22}(\alpha)^{-1}({\mathbf{x}}_{n+1}-{\mathbf{x}}_{n})-{\mathbf{J}}_{22}(\alpha)^{-1}(r^{({\mathbf{x}},1)}_{n}+r^{({\mathbf{x}},2)}_{n}+M^{({\mathbf{x}})}_{n+1}+\eta_{n}^{({\mathbf{x}})}),

and substituting the above equation back in (53a) gives

𝜽n+1𝜽=𝜽n𝜽+βn+1(𝐉11(𝜽n𝜽)+γn+11𝐉12(α)𝐉22(α)1(𝐱n+1𝐱n)𝐉12(α)𝐉22(α)1(rn(𝐱,1)+rn(𝐱,2)+Mn+1(𝐱)+ηn(𝐱))+rn(𝜽,1)+rn(𝜽,2)+Mn+1(𝜽)+ηn(𝜽))=(𝐈+βn+1𝐉11)(𝜽n𝜽)+[βn+1γn+11𝐉12(α)𝐉22(α)1(𝐱n+1𝐱n)]+βn+1(Mn+1(𝜽)𝐉12(α)𝐉22(α)1Mn+1(𝐱))+βn+1(rn(𝜽,1)+rn(𝜽,2)+ηn(𝜽)𝐉12(α)𝐉22(α)1(rn(𝐱,1)+rn(𝐱,2)+ηn(𝐱))),\begin{split}{\bm{\theta}}_{n+1}&-{\bm{\theta}}^{*}=~{}{\bm{\theta}}_{n}-{\bm{\theta}}^{*}+\beta_{n+1}\bigg{(}{\mathbf{J}}_{11}({\bm{\theta}}_{n}-{\bm{\theta}}^{*})+\gamma_{n+1}^{-1}{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}({\mathbf{x}}_{n+1}-{\mathbf{x}}_{n})\\ &-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}(r^{({\mathbf{x}},1)}_{n}+r^{({\mathbf{x}},2)}_{n}+M^{({\mathbf{x}})}_{n+1}+\eta_{n}^{({\mathbf{x}})})+r^{({\bm{\theta}},1)}_{n}+r^{({\bm{\theta}},2)}_{n}+M^{({\bm{\theta}})}_{n+1}+\eta_{n}^{({\bm{\theta}})}\bigg{)}\\ =&~{}({\mathbf{I}}+\beta_{n+1}{\mathbf{J}}_{11})({\bm{\theta}}_{n}-{\bm{\theta}}^{*})+[\beta_{n+1}\gamma_{n+1}^{-1}{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}({\mathbf{x}}_{n+1}-{\mathbf{x}}_{n})]\\ &+\beta_{n+1}(M^{({\bm{\theta}})}_{n+1}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}M^{({\mathbf{x}})}_{n+1})\\ &+\beta_{n+1}(r^{({\bm{\theta}},1)}_{n}+r^{({\bm{\theta}},2)}_{n}+\eta_{n}^{({\bm{\theta}})}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}(r^{({\mathbf{x}},1)}_{n}+r^{({\mathbf{x}},2)}_{n}+\eta_{n}^{({\mathbf{x}})})),\end{split} (54)

From (54) we can see the iteration {𝜽n}\{{\bm{\theta}}_{n}\} implicitly embeds the recursions of three sequences

  • βn+1γn+11𝐉12(α)𝐉22(α)1(𝐱n+1𝐱n)\beta_{n+1}\gamma_{n+1}^{-1}{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}({\mathbf{x}}_{n+1}-{\mathbf{x}}_{n});

  • βn+1(Mn+1(𝜽)𝐉12(α)𝐉22(α)1Mn+1(𝐱))\beta_{n+1}(M^{({\bm{\theta}})}_{n+1}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}M^{({\mathbf{x}})}_{n+1});

  • βn+1(rn(𝜽,1)+rn(𝜽,2)+ηn(𝜽)𝐉12(α)𝐉22(α)1(rn(𝐱,1)+rn(𝐱,2)+ηn(𝐱))))\beta_{n+1}(r^{({\bm{\theta}},1)}_{n}+r^{({\bm{\theta}},2)}_{n}+\eta_{n}^{({\bm{\theta}})}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}(r^{({\mathbf{x}},1)}_{n}+r^{({\mathbf{x}},2)}_{n}+\eta_{n}^{({\mathbf{x}})}))).

Let unk=1nβku_{n}\triangleq\sum_{k=1}^{n}\beta_{k} and snk=1nγks_{n}\triangleq\sum_{k=1}^{n}\gamma_{k}. Below we define two iterations:

Ln(𝜽)=eβn𝐉11Ln1(𝜽)+βn(Mn(𝜽)𝐉12(α)𝐉22(α)1Mn(𝐱))=k=1ne(unuk)𝐉11βk(Mk(𝜽)𝐉12(α)𝐉22(α)1Mk(𝐱))\begin{split}L_{n}^{({\bm{\theta}})}=e^{\beta_{n}{\mathbf{J}}_{11}}&L_{n-1}^{({\bm{\theta}})}+\beta_{n}(M^{({\bm{\theta}})}_{n}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}M^{({\mathbf{x}})}_{n})\\ &=\sum_{k=1}^{n}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}\beta_{k}(M^{({\bm{\theta}})}_{k}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}M^{({\mathbf{x}})}_{k})\end{split} (55a)
Rn(𝜽)=eβn𝐉11Rn1(𝜽)+βnγn1𝐉12(α)𝐉22(α)1(𝐱n𝐱n1)=k=1ne(unuk)𝐉11βkγk1𝐉12(α)𝐉22(α)1(𝐱k𝐱k1)\begin{split}R_{n}^{({\bm{\theta}})}=e^{\beta_{n}{\mathbf{J}}_{11}}&R_{n-1}^{({\bm{\theta}})}+\beta_{n}\gamma_{n}^{-1}{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}({\mathbf{x}}_{n}-{\mathbf{x}}_{n-1})\\ &=\sum_{k=1}^{n}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}\beta_{k}\gamma_{k}^{-1}{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}({\mathbf{x}}_{k}-{\mathbf{x}}_{k-1})\end{split} (55b)

and a remaining term Δn(𝜽)𝜽n𝜽Ln(𝜽)Rn(𝜽)\Delta_{n}^{({\bm{\theta}})}\triangleq{\bm{\theta}}_{n}-{\bm{\theta}}^{*}-L_{n}^{({\bm{\theta}})}-R_{n}^{({\bm{\theta}})}.

Similarly, for iteration 𝐱n{\mathbf{x}}_{n}, define the sequence Ln(𝐱)L_{n}^{({\mathbf{x}})} such that

Ln(𝐱)=eγn𝐉22(α)Ln1(𝐱)+γnMn(𝐱)=k=1ne(snsk)𝐉22(α)γkMk(𝐱),L_{n}^{({\mathbf{x}})}=e^{\gamma_{n}{\mathbf{J}}_{22}(\alpha)}L_{n-1}^{({\mathbf{x}})}+\gamma_{n}M^{({\mathbf{x}})}_{n}=\sum_{k=1}^{n}e^{(s_{n}-s_{k}){\mathbf{J}}_{22}(\alpha)}\gamma_{k}M^{({\mathbf{x}})}_{k}, (56)

and a remaining term

Δn(𝐱)𝐱n𝝁Ln(𝜽)\Delta_{n}^{({\mathbf{x}})}\triangleq{\mathbf{x}}_{n}-{\bm{\mu}}-L_{n}^{({\bm{\theta}})} (57)

The decomposition of 𝜽n𝜽{\bm{\theta}}_{n}-{\bm{\theta}}^{*} and 𝐱n𝝁{\mathbf{x}}_{n}-{\bm{\mu}} in the above form is also standard in the single-timescale SA literature (Delyon, 2000; Fort, 2015).

Characterization of Sequences {Ln(θ)}\{L_{n}^{({\bm{\theta}})}\} and {Ln(𝐱)}\{L_{n}^{({\mathbf{x}})}\}

we set a Martingale Z(n)={Zk(n)}k1Z^{(n)}=\{Z^{(n)}_{k}\}_{k\geq 1} such that

Zk(n)=(βn1/2eun𝐉1100γn1/2esn𝐉22(α))×j=1k(euk𝐉11βk(Mk(𝜽)𝐉12(α)𝐉22(α)1Mk(𝐱))esk𝐉22(α)γkMk(𝐱)).Z^{(n)}_{k}=\begin{pmatrix}\beta_{n}^{-1/2}e^{u_{n}{\mathbf{J}}_{11}}&0\\ 0&\gamma_{n}^{-1/2}e^{s_{n}{\mathbf{J}}_{22}(\alpha)}\end{pmatrix}\times\sum_{j=1}^{k}\begin{pmatrix}e^{-u_{k}{\mathbf{J}}_{11}}\beta_{k}(M^{({\bm{\theta}})}_{k}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}M^{({\mathbf{x}})}_{k})\\ e^{-s_{k}{\mathbf{J}}_{22}(\alpha)}\gamma_{k}M^{({\mathbf{x}})}_{k}\end{pmatrix}.

Then, the Martingale difference array Zk(n)Zk1(n)Z_{k}^{(n)}-Z_{k-1}^{(n)} becomes

Zk(n)Zk1(n)=(βn1/2e(unuk)𝐉11βk(Mk(𝜽)𝐉12(α)𝐉22(α)1Mk(𝐱))γn1/2e(snsk)𝐉22(α)γkMk(𝐱))Z_{k}^{(n)}-Z_{k-1}^{(n)}=\begin{pmatrix}\beta_{n}^{-1/2}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}\beta_{k}(M^{({\bm{\theta}})}_{k}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}M^{({\mathbf{x}})}_{k})\\ \gamma_{n}^{-1/2}e^{(s_{n}-s_{k}){\mathbf{J}}_{22}(\alpha)}\gamma_{k}M^{({\mathbf{x}})}_{k}\end{pmatrix}

and

k=1n𝔼[(Zk(n)Zk1(n))(Zk(n)Zk1(n))T|k1]=(A1,nA2,nA2,nTA4,n),\sum_{k=1}^{n}\mathbb{E}\left[(Z_{k}^{(n)}-Z_{k-1}^{(n)})(Z_{k}^{(n)}-Z_{k-1}^{(n)})^{T}|{\mathcal{F}}_{k-1}\right]=\begin{pmatrix}A_{1,n}&A_{2,n}\\ A_{2,n}^{T}&A_{4,n}\end{pmatrix},

where, in view of decomposition of Mn(𝜽)M^{({\bm{\theta}})}_{n} and Mn(𝐱)M^{({\mathbf{x}})}_{n} in (41) and (42), respectively,

A1,n=βn1k=1nβk2e(unuk)𝐉11(𝐔22+𝐃k(1)+𝐉k(1)(𝐔21+𝐃k(2)+𝐉k(2))(𝐉12(α)𝐉22(α)1)T+𝐉12(α)𝐉22(α)1(𝐔11+𝐃k(3)+𝐉k(3))(𝐉12(α)𝐉22(α)1)T𝐉12(α)𝐉22(α)1(𝐔21+𝐃k(2)+𝐉k(2))T)e(unuk)(𝐉11)T,\begin{split}A_{1,n}=\beta_{n}^{-1}\sum_{k=1}^{n}\beta_{k}^{2}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}&\bigg{(}{\mathbf{U}}_{22}\!+\!{\mathbf{D}}_{k}^{(1)}\!+\!{\mathbf{J}}_{k}^{(1)}\!-\!({\mathbf{U}}_{21}\!+\!{\mathbf{D}}_{k}^{(2)}\!+\!{\mathbf{J}}_{k}^{(2)})({\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1})^{T}\\ &+{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}({\mathbf{U}}_{11}+{\mathbf{D}}_{k}^{(3)}+{\mathbf{J}}_{k}^{(3)})({\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1})^{T}\\ &-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}({\mathbf{U}}_{21}+{\mathbf{D}}_{k}^{(2)}+{\mathbf{J}}_{k}^{(2)})^{T}\bigg{)}e^{(u_{n}-u_{k})({\mathbf{J}}_{11})^{T}},\end{split} (58a)
A2,n=βn1/2γn1/2k=1nβkγke(unuk)𝐉11(𝐔21𝐉12(α)𝐉22(α)1𝐔11)e(snsk)𝐉22(α)T,A_{2,n}=\beta_{n}^{-1/2}\gamma_{n}^{-1/2}\sum_{k=1}^{n}\beta_{k}\gamma_{k}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}({\mathbf{U}}_{21}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}{\mathbf{U}}_{11})e^{(s_{n}-s_{k}){\mathbf{J}}_{22}(\alpha)^{T}}, (58b)
A4,n=γn1k=1nγk2e(snsk)𝐉22(α)(𝐔11+𝐃k(3)+𝐉k(3))e(snsk)𝐉22(α)T.A_{4,n}=\gamma_{n}^{-1}\sum_{k=1}^{n}\gamma_{k}^{2}e^{(s_{n}-s_{k}){\mathbf{J}}_{22}(\alpha)}({\mathbf{U}}_{11}+{\mathbf{D}}_{k}^{(3)}+{\mathbf{J}}_{k}^{(3)})e^{(s_{n}-s_{k}){\mathbf{J}}_{22}(\alpha)^{T}}. (58c)

We further decompose A1,nA_{1,n} into three parts:

A1,n=βn1k=1n(βk2e(unuk)𝐉11(𝐔22𝐔21(𝐉12(α)𝐉22(α)1)T𝐉12(α)𝐉22(α)1𝐔12+𝐉12(α)𝐉22(α)1𝐔11(𝐉12(α)𝐉22(α)1)T)e(unuk)(𝐉11)T)+βn1k=1n(βk2e(unuk)𝐉11(𝐃k(1)+𝐉12(α)𝐉22(α)1𝐃k(3)(𝐉12(α)𝐉22(α)1)T𝐃k(2)(𝐉12(α)𝐉22(α)1)T𝐉12(α)𝐉22(α)1(𝐃k(2))T)e(unuk)(𝐉11)T)+βn1k=1n(βk2e(unuk)𝐉11(𝐉k(1)+𝐉12(α)𝐉22(α)1𝐉k(3)(𝐉12(α)𝐉22(α)1)T𝐉k(2)(𝐉12(α)𝐉22(α)1)T𝐉12(α)𝐉22(α)1(𝐉k(2))T)e(unuk)(𝐉11)T)A1,n(a)+A1,n(b)+A1,n(c).\begin{split}A_{1,n}=&\beta_{n}^{-1}\sum_{k=1}^{n}\bigg{(}\beta_{k}^{2}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}({\mathbf{U}}_{22}-{\mathbf{U}}_{21}({\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1})^{T}\\ &~{}~{}~{}~{}~{}~{}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}{\mathbf{U}}_{12}+{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}{\mathbf{U}}_{11}({\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1})^{T})e^{(u_{n}-u_{k})({\mathbf{J}}_{11})^{T}}\bigg{)}\\ &+\beta_{n}^{-1}\sum_{k=1}^{n}\bigg{(}\beta_{k}^{2}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}({\mathbf{D}}_{k}^{(1)}+{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}{\mathbf{D}}_{k}^{(3)}({\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1})^{T}\\ &~{}~{}~{}~{}~{}~{}-{\mathbf{D}}_{k}^{(2)}({\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1})^{T}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}({\mathbf{D}}_{k}^{(2)})^{T})e^{(u_{n}-u_{k})({\mathbf{J}}_{11})^{T}}\bigg{)}\\ &+\beta_{n}^{-1}\sum_{k=1}^{n}\bigg{(}\beta_{k}^{2}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}({\mathbf{J}}_{k}^{(1)}+{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}{\mathbf{J}}_{k}^{(3)}({\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1})^{T}\\ &~{}~{}~{}~{}~{}~{}-{\mathbf{J}}_{k}^{(2)}({\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1})^{T}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}({\mathbf{J}}_{k}^{(2)})^{T})e^{(u_{n}-u_{k})({\mathbf{J}}_{11})^{T}}\bigg{)}\\ \triangleq&A_{1,n}^{(a)}+A_{1,n}^{(b)}+A_{1,n}^{(c)}.\end{split} (59)

Here, we define 𝐔𝜽(α)𝐔22𝐔21(𝐉12(α)𝐉22(α)1)T𝐉12(α)𝐉22(α)1𝐔12+𝐉12(α)𝐉22(α)1𝐔11(𝐉12(α)𝐉22(α)1)T{\mathbf{U}}_{{\bm{\theta}}}(\alpha)\triangleq{\mathbf{U}}_{22}-{\mathbf{U}}_{21}({\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1})^{T}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}{\mathbf{U}}_{12}+{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}{\mathbf{U}}_{11}({\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1})^{T}. By (52) and (43a) in Lemma E.1, we have

𝐔𝜽(α)=i=1N11(α(1+λi)+1)21+λi1λi𝐇T𝐮i𝐮iT𝐇.{\mathbf{U}}_{{\bm{\theta}}}(\alpha)=\sum_{i=1}^{N-1}\frac{1}{(\alpha(1+\lambda_{i})+1)^{2}}\cdot\frac{1+\lambda_{i}}{1-\lambda_{i}}{\mathbf{H}}^{T}{\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T}{\mathbf{H}}. (60)

Then, we have the following lemma.

Lemma E.3.

For A1,n(a),A1,n(b),A1,n(c)A^{(a)}_{1,n},A^{(b)}_{1,n},A^{(c)}_{1,n} defined in (59), we have

limnA1,n(a)=𝐕𝜽(α),limnA1,n(b)=0,limnA1,n(c)=0,\lim_{n\to\infty}A^{(a)}_{1,n}={\mathbf{V}}_{{\bm{\theta}}}(\alpha),\quad\lim_{n\to\infty}\|A^{(b)}_{1,n}\|=0,\quad\lim_{n\to\infty}\|A^{(c)}_{1,n}\|=0, (61)

where 𝐕𝛉(α){\mathbf{V}}_{{\bm{\theta}}}(\alpha) is the solution to the Lyapunov equation

(𝐉11+𝟙{b=1}2𝑰)𝐕𝜽(α)+𝐕𝜽(α)(𝐉11+𝟙{b=1}2𝑰)T+𝐔𝜽(α)=0.\left({\mathbf{J}}_{11}+\frac{\mathds{1}_{\{b=1\}}}{2}{\bm{I}}\right){\mathbf{V}}_{{\bm{\theta}}}(\alpha)+{\mathbf{V}}_{{\bm{\theta}}}(\alpha)\left({\mathbf{J}}_{11}+\frac{\mathds{1}_{\{b=1\}}}{2}{\bm{I}}\right)^{T}+{\mathbf{U}}_{{\bm{\theta}}}(\alpha)=0.
Proof.

First, from Lemma G.4, we have for some c,T>0c,T>0 such that

A1,n(b)βn1k=1n𝐃k(1)+𝐉12(α)𝐉22(α)1𝐃k(3)(𝐉12(α)𝐉22(α)1)T𝐃k(2)(𝐉12(α)𝐉22(α)1)T𝐉12(α)𝐉22(α)1(𝐃k(2))Tβk2c2e2T(unuk).\begin{split}\|A^{(b)}_{1,n}\|\leq\beta_{n}^{-1}\sum_{k=1}^{n}&\bigg{\|}{\mathbf{D}}_{k}^{(1)}+{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}{\mathbf{D}}_{k}^{(3)}({\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1})^{T}-{\mathbf{D}}_{k}^{(2)}({\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1})^{T}\\ &-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}({\mathbf{D}}_{k}^{(2)})^{T}\bigg{\|}\cdot\beta_{k}^{2}c^{2}e^{-2T(u_{n}-u_{k})}.\end{split}

Applying Lemma G.6, together with 𝐃n(i)0{\mathbf{D}}^{(i)}_{n}\to 0 a.s. in Lemma E.1, gives

lim supnA1,n(b)1C(b,p)lim supn(𝐃n(1)+𝐉12(α)𝐉22(α)1𝐃n(4)(𝐉12(α)𝐉22(α)1)T𝐃n(2)(𝐉12(α)𝐉22(α)1)T𝐉12(α)𝐉22(α)1𝐃n(3))=0.\begin{split}&\limsup_{n}\|A^{(b)}_{1,n}\|\\ \leq&\frac{1}{C(b,p)}\limsup_{n}\|({\mathbf{D}}_{n}^{(1)}+{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}{\mathbf{D}}_{n}^{(4)}({\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1})^{T}\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-{\mathbf{D}}_{n}^{(2)}({\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1})^{T}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}{\mathbf{D}}_{n}^{(3)})\|\\ &=0.\end{split}

We now consider A1,n(c)\|A^{(c)}_{1,n}\|. Set

Ξnk=1n(\displaystyle\Xi_{n}\triangleq\sum_{k=1}^{n}\big{(} 𝐉n(1)+𝐉12(α)𝐉22(α)1𝐉k(3)(𝐉12(α)𝐉22(α)1)T\displaystyle{\mathbf{J}}_{n}^{(1)}+{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}{\mathbf{J}}_{k}^{(3)}({\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1})^{T}
𝐉k(2)(𝐉12(α)𝐉22(α)1)T𝐉12(α)𝐉22(α)1(𝐉k(2))T),\displaystyle-{\mathbf{J}}_{k}^{(2)}({\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1})^{T}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}({\mathbf{J}}_{k}^{(2)})^{T}\big{)},

we can rewrite A1,n(c)A^{(c)}_{1,n} as

A1,n(c)=βn1k=1nβk2e(unuk)𝐉11(ΞkΞk1)e(unuk)(𝐉11)T.A^{(c)}_{1,n}=\beta_{n}^{-1}\sum_{k=1}^{n}\beta_{k}^{2}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}(\Xi_{k}-\Xi_{k-1})e^{(u_{n}-u_{k})({\mathbf{J}}_{11})^{T}}.

By the Abel transformation, we have

A1,n(c)=βnΞn+βn1k=1n1[βk2e(unuk)𝐉11Ξke(unuk)(𝐉11)Tβk+12e(unuk+1)𝐉11Ξke(unuk+1)(𝐉11)T].\begin{split}A^{(c)}_{1,n}=\beta_{n}\Xi_{n}+\beta_{n}^{-1}\sum_{k=1}^{n-1}&\Big{[}\beta_{k}^{2}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}\Xi_{k}e^{(u_{n}-u_{k})({\mathbf{J}}_{11})^{T}}\\ &-\beta_{k+1}^{2}e^{(u_{n}-u_{k+1}){\mathbf{J}}_{11}}\Xi_{k}e^{(u_{n}-u_{k+1})({\mathbf{J}}_{11})^{T}}\Big{]}.\end{split} (62)

We know from Lemma E.1 that βnΞn0\beta_{n}\Xi_{n}\to 0 a.s. because Ξn=o(γn1)\Xi_{n}=o(\gamma_{n}^{-1}). Besides,

βke(unuk)𝐉11βk+1e(unuk+1)𝐉11=(βkβk+1)e(unuk)𝐉11+βk+1e(unuk)𝐉11(𝐈eβk+1𝐉11)C1βk2e(unuk)T\begin{split}\|\beta_{k}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}&-\beta_{k+1}e^{(u_{n}-u_{k+1}){\mathbf{J}}_{11}}\|\\ =&~{}\|(\beta_{k}-\beta_{k+1})e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}+\beta_{k+1}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}({\mathbf{I}}-e^{-\beta_{k+1}{\mathbf{J}}_{11}})\|\\ \leq&~{}C_{1}\beta_{k}^{2}e^{-(u_{n}-u_{k})T}\end{split}

for some constant C1>0C_{1}>0 because βnβn+1C2βn2\beta_{n}-\beta_{n+1}\leq C_{2}\beta_{n}^{2} and 𝐈eβk+1𝐉11C3βk+1\|{\mathbf{I}}-e^{-\beta_{k+1}{\mathbf{J}}_{11}}\|\leq C_{3}\beta_{k+1}. Moreover,

βke(unuk)𝐉11+βk+1e(unuk+1)𝐉11βke(unuk)𝐉11+βke(unuk)𝐉11eβk+1𝐉11C4βke(unuk)T.\begin{split}\|\beta_{k}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}\|&+\|\beta_{k+1}e^{(u_{n}-u_{k+1}){\mathbf{J}}_{11}}\|\\ \leq&\beta_{k}\|e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}\|+\beta_{k}\|e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}\|\cdot\|e^{-\beta_{k+1}{\mathbf{J}}_{11}}\|\\ \leq&C_{4}\beta_{k}e^{-(u_{n}-u_{k})T}.\end{split}

Using Lemma G.7 on (62) gives

A1,n(c)C1C4βn1k=1n1βk2e2(unuk)TβkΞk+βnΞn.\|A^{(c)}_{1,n}\|\leq C_{1}C_{4}\beta_{n}^{-1}\sum_{k=1}^{n-1}\beta_{k}^{2}e^{-2(u_{n}-u_{k})T}\|\beta_{k}\Xi_{k}\|+\|\beta_{n}\Xi_{n}\|.

Applying Lemma G.6 again gives

lim supnA1,n(c)C5lim supnβnΞn=0\limsup_{n}\|A^{(c)}_{1,n}\|\leq C_{5}\limsup_{n}\|\beta_{n}\Xi_{n}\|=0

for some constant C5>0C_{5}>0.

Finally, we provide an existing lemma below.

Lemma E.4 (Mokkadem & Pelletier (2005) Lemma 4).

For a sequence with decreasing step size βn=(n+1)b\beta_{n}=(n+1)^{-b} for b(1/2,1]b\in(1/2,1], un=k=1nβku_{n}=\sum_{k=1}^{n}\beta_{k}, a positive semi-definite matrix Γ\Gamma and a Hurwitz matrix 𝐐{\mathbf{Q}}, which is given by

βn1k=1nβn2e(unuk)𝐐Γe(unuk)𝐐T,\beta_{n}^{-1}\sum_{k=1}^{n}\beta_{n}^{2}e^{(u_{n}-u_{k}){\mathbf{Q}}}\Gamma e^{(u_{n}-u_{k}){\mathbf{Q}}^{T}},

we have

limnβn1k=1nβn2e(unuk)𝐐Γe(unuk)𝐐T=𝐕\lim_{n\to\infty}\beta_{n}^{-1}\sum_{k=1}^{n}\beta_{n}^{2}e^{(u_{n}-u_{k}){\mathbf{Q}}}\Gamma e^{(u_{n}-u_{k}){\mathbf{Q}}^{T}}={\mathbf{V}}

where 𝐕{\mathbf{V}} is the solution of the Lyapunov equation

(𝐐+𝟙{b=1}2𝐈)𝐕+𝐕(𝐐T+𝟙{b=1}2𝐈)+Γ=0.\left({\mathbf{Q}}+\frac{\mathds{1}_{\{b=1\}}}{2}{\mathbf{I}}\right){\mathbf{V}}+{\mathbf{V}}\left({\mathbf{Q}}^{T}+\frac{\mathds{1}_{\{b=1\}}}{2}{\mathbf{I}}\right)+\Gamma=0.

Then, limnA1,n(a)=𝐕𝜽(α)\lim_{n\to\infty}A^{(a)}_{1,n}={\mathbf{V}}_{{\bm{\theta}}}(\alpha) is a direct application of Lemma E.4. ∎

We can follow the similar steps in Lemma E.3 to obtain

limnA4,n=𝐕𝐱(α),\lim_{n\to\infty}A_{4,n}={\mathbf{V}}_{{\mathbf{x}}}(\alpha),

where 𝐕𝐱(α){\mathbf{V}}_{{\mathbf{x}}}(\alpha) is in the form of (32).

The last step is to show limnA2,n=0\lim_{n\to\infty}A_{2,n}=0. Note that

A2,n=O(βn1/2γn1/2k=1nβkγke(unuk)𝐉11e(snsk)𝐉22(α)T)=O(βn1/2γn1/2k=1nβkγke(unuk)Te(snsk)T)=O(βn1/2γn1/2k=1nβkγke(snsk)T),\begin{split}\|A_{2,n}\|&=O\left(\beta_{n}^{-1/2}\gamma_{n}^{-1/2}\sum_{k=1}^{n}\beta_{k}\gamma_{k}\|e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}\|\|e^{(s_{n}-s_{k}){\mathbf{J}}_{22}(\alpha)^{T}}\|\right)\\ &=O\left(\beta_{n}^{-1/2}\gamma_{n}^{-1/2}\sum_{k=1}^{n}\beta_{k}\gamma_{k}e^{-(u_{n}-u_{k})T}e^{-(s_{n}-s_{k})T^{\prime}}\right)\\ &=O\left(\beta_{n}^{-1/2}\gamma_{n}^{-1/2}\sum_{k=1}^{n}\beta_{k}\gamma_{k}e^{-(s_{n}-s_{k})T^{\prime}}\right),\end{split}

where the second equality is from Lemma G.4. Then, we use Lemma G.6 with p=0p=0 to obtain

k=1nβkγke(snsk)T=O(βn)\sum_{k=1}^{n}\beta_{k}\gamma_{k}e^{-(s_{n}-s_{k})T^{\prime}}=O(\beta_{n}) (63)

Additionally, since βn=o(γn)\beta_{n}=o(\gamma_{n}), we have

βn1/2γn1/2k=1nβkγk1/2γk3/2e(snsk)T=O(βn1/2γn1/2)=o(1).\beta_{n}^{-1/2}\gamma_{n}^{-1/2}\sum_{k=1}^{n}\beta_{k}\gamma_{k}^{-1/2}\gamma_{k}^{3/2}e^{-(s_{n}-s_{k})T^{\prime}}=O(\beta_{n}^{1/2}\gamma_{n}^{-1/2})=o(1).

Then, it follows that limnA2,n=0\lim_{n\to\infty}A_{2,n}=0. Therefore, we obtain

limnk=1n𝔼[(Zk(n)Zk1(n))(Zk(n)Zk1(n))T|k1]=(𝐕𝜽(α)00𝐕𝐱(α)).\lim_{n\to\infty}\sum_{k=1}^{n}\mathbb{E}\left[(Z_{k}^{(n)}-Z_{k-1}^{(n)})(Z_{k}^{(n)}-Z_{k-1}^{(n)})^{T}|{\mathcal{F}}_{k-1}\right]=\begin{pmatrix}{\mathbf{V}}_{{\bm{\theta}}}(\alpha)&0\\ 0&{\mathbf{V}}_{{\mathbf{x}}}(\alpha)\end{pmatrix}.

Now, we turn to verifying the conditions in Theorem G.3. For some τ>0\tau>0, we have

k=1n𝔼[Zk(n)Zk1(n)2+τ|k1]=O(βn(1+τ2)k=1nβk2+τ2βkτ2e(2+τ)(unuk)T+γn(1+τ2)k=1nγk2+τ2γkτ2e(2+τ)(snsk)T)=O(βnτ2+γnτ2)\begin{split}&\sum_{k=1}^{n}\mathbb{E}\left[\|Z_{k}^{(n)}-Z_{k-1}^{(n)}\|^{2+\tau}|{\mathcal{F}}_{k-1}\right]\\ &=O\left(\beta_{n}^{-(1+\frac{\tau}{2})}\sum_{k=1}^{n}\beta_{k}^{2+\frac{\tau}{2}}\beta_{k}^{\frac{\tau}{2}}e^{-(2+\tau)(u_{n}-u_{k})T}+\gamma_{n}^{-(1+\frac{\tau}{2})}\sum_{k=1}^{n}\gamma_{k}^{2+\frac{\tau}{2}}\gamma_{k}^{\frac{\tau}{2}}e^{-(2+\tau)(s_{n}-s_{k})T^{\prime}}\right)\\ &=O\left(\beta_{n}^{\frac{\tau}{2}}+\gamma_{n}^{\frac{\tau}{2}}\right)\end{split} (64)

where the last equality comes from Lemma G.6. Since (64) also holds for τ=0\tau=0, we have

k=1n𝔼[Zk(n)Zk1(n)2|k1]=O(1)<.\sum_{k=1}^{n}\mathbb{E}\left[\|Z_{k}^{(n)}-Z_{k-1}^{(n)}\|^{2}|{\mathcal{F}}_{k-1}\right]=O(1)<\infty.

Therefore, all the conditions in Theorem G.3 are satisfied and its application then gives

Z(n)=(βn1Ln(𝜽)γn1Ln(𝐱))dist.nN(0,(𝐕𝜽(α)00𝐕𝐱(α))).Z^{(n)}=\begin{pmatrix}\sqrt{\beta_{n}^{-1}}L_{n}^{({\bm{\theta}})}\\ \sqrt{\gamma_{n}^{-1}}L_{n}^{({\mathbf{x}})}\end{pmatrix}\xrightarrow[dist.]{n\to\infty}N\left(0,\begin{pmatrix}{\mathbf{V}}_{{\bm{\theta}}}(\alpha)&0\\ 0&{\mathbf{V}}_{{\mathbf{x}}}(\alpha)\end{pmatrix}\right). (65)

Furthermore, we have the following lemma about the strong convergence rate of {Ln(𝜽)}\{L^{({\bm{\theta}})}_{n}\} and {Ln(𝐱)}\{L^{({\mathbf{x}})}_{n}\}.

Lemma E.5.
Ln(𝜽)=O(βnlog(un))a.s.\|L^{({\bm{\theta}})}_{n}\|=O\left(\sqrt{\beta_{n}\log(u_{n})}\right)\quad a.s. (66a)
Ln(𝐱)=O(γnlog(sn))a.s.\|L^{({\mathbf{x}})}_{n}\|=O\left(\sqrt{\gamma_{n}\log(s_{n})}\right)\quad a.s. (66b)
Proof.

This proof follows Pelletier (1998, Lemma 1). We only need the special case of Pelletier (1998, Lemma 1) that fits our scenario; e.g., we let the two types of step sizes therein to be the same. Specifically, we attach the following lemma.

Lemma E.6 (Pelletier (1998) Lemma 1).

Consider a sequence

Ln+1=eun𝐇k=1neuk𝐇βkMk+1,L_{n+1}=e^{u_{n}{\mathbf{H}}}\sum_{k=1}^{n}e^{-u_{k}{\mathbf{H}}}\beta_{k}M_{k+1},

where βn=nb\beta_{n}=n^{-b}, 1/2<b11/2<b\leq 1, and {Mn}\{M_{n}\} is a Martingale difference sequence adapted to the filtration {\mathcal{F}} such that, almost surely, lim supn𝔼[Mn+12|n]M2\limsup_{n}\mathbb{E}[\|M_{n+1}\|^{2}|{\mathcal{F}}_{n}]\leq M^{2} and there exists τ(0,2)\tau\in(0,2), b(2+τ)>2b(2+\tau)>2, such that supn𝔼[Mn+12+τ|n]<\sup_{n}\mathbb{E}[\|M_{n+1}\|^{2+\tau}|{\mathcal{F}}_{n}]<\infty. Then, almost surely,

lim supnLnβnlog(un)CM,\limsup_{n}\frac{\|L_{n}\|}{\sqrt{\beta_{n}\log(u_{n})}}\leq C_{M}, (67)

where CMC_{M} is a constant dependent on MM.

By assumption A4, the iterates (𝜽n,𝐱n)({\bm{\theta}}_{n},{\mathbf{x}}_{n}) are bounded within a compact subset Ω\Omega. Recall the form of Mn+1(𝜽),Mn+1(𝐱)M^{({\bm{\theta}})}_{n+1},M^{({\mathbf{x}})}_{n+1} defined in (35), it comprises the functions H~𝜽n,𝐱n(i)\tilde{H}_{{\bm{\theta}}_{n},{\mathbf{x}}_{n}}(i) and (𝐊𝐱nH~𝜽n,𝐱n)(i)({\mathbf{K}}_{{\mathbf{x}}_{n}}\tilde{H}_{{\bm{\theta}}_{n},{\mathbf{x}}_{n}})(i), which in turn include the function H(𝜽,i)H({\bm{\theta}},i). We know that H(𝜽,i)H({\bm{\theta}},i) is bounded for 𝜽{\bm{\theta}} in some compact set 𝒞{\mathcal{C}}. Thus, for any (𝜽n,𝐱n)Ω({\bm{\theta}}_{n},{\mathbf{x}}_{n})\in\Omega for some compact set Ω\Omega, Mn+1(𝜽),Mn+1(𝐱)M^{({\bm{\theta}})}_{n+1},M^{({\mathbf{x}})}_{n+1} are bounded and we denote by c𝜽c_{{\bm{\theta}}} and c𝐱c_{{\mathbf{x}}} as their upper bounds, i.e., 𝔼[Mn+1(𝜽)2|n]cΩ(𝜽)\mathbb{E}[\|M^{({\bm{\theta}})}_{n+1}\|^{2}|{\mathcal{F}}_{n}]\leq c^{({\bm{\theta}})}_{\Omega} and 𝔼[Mn+1(𝐱)2|n]cΩ(𝐱)\mathbb{E}[\|M^{({\mathbf{x}})}_{n+1}\|^{2}|{\mathcal{F}}_{n}]\leq c^{({\mathbf{x}})}_{\Omega}. We only need to replace the upper bound cc in Lemma E.6 by cΩ(𝜽)c^{({\bm{\theta}})}_{\Omega} for the sequence {Ln(𝜽)}\{L_{n}^{({\bm{\theta}})}\} (resp. cΩ(𝐱)c^{({\mathbf{x}})}_{\Omega} for the sequence {Ln(𝐱)}\{L_{n}^{({\mathbf{x}})}\}), i.e.,

lim supnLn(𝜽)βnlog(un)CΩ(𝜽),\limsup_{n}\frac{\|L_{n}^{({\bm{\theta}})}\|}{\sqrt{\beta_{n}\log(u_{n})}}\leq C^{({\bm{\theta}})}_{\Omega}, (68a)
lim supnLn(𝐱)γnlog(sn)CΩ(𝐱),\limsup_{n}\frac{\|L_{n}^{({\mathbf{x}})}\|}{\sqrt{\gamma_{n}\log(s_{n})}}\leq C^{({\mathbf{x}})}_{\Omega}, (68b)

such that Ln(𝜽)=O(βnlog(un))\|L_{n}^{({\bm{\theta}})}\|=O(\sqrt{\beta_{n}\log(u_{n})}) a.s. and Ln(𝐱)=O(γnlog(sn))\|L_{n}^{({\mathbf{x}})}\|=O(\sqrt{\gamma_{n}\log(s_{n})}) a.s. which completes the proof. ∎

Note that we have 𝐱n𝝁{\mathbf{x}}_{n}-{\bm{\mu}} and Ln(𝐱)L_{n}^{({\mathbf{x}})} weakly converge to the same Gaussian distribution from Remark E.1 and (65). Then, γn1/2Δn(𝐱)\gamma_{n}^{-1/2}\Delta_{n}^{({\mathbf{x}})} weakly converges to zero, implying that γn1/2Δn(𝐱)\gamma_{n}^{-1/2}\Delta_{n}^{({\mathbf{x}})} converges to zero with probability 11. Therefore, together with {γn}\{\gamma_{n}\} being strictly positive, we have

Δn(𝐱)=o(γn)a.s.\Delta_{n}^{({\mathbf{x}})}=o(\sqrt{\gamma_{n}})\quad a.s. (69)

Characterization of Sequences {Rn(θ)}\{R_{n}^{({\bm{\theta}})}\} and {Δn(θ)}\{\Delta_{n}^{({\bm{\theta}})}\}

We first consider the sequence {Rn(𝜽)}\{R_{n}^{({\bm{\theta}})}\}. We assume a positive real-valued bounded sequence {wn}\{w_{n}\} under the same conditions as in Mokkadem & Pelletier (2006, Definition 1), i.e.,

Definition E.1.

In the case b<1b<1, wnwn+1=1+o(βn)\frac{w_{n}}{w_{n+1}}=1+o(\beta_{n}), which also implies wnwn+1=1+o(γn)\frac{w_{n}}{w_{n+1}}=1+o(\gamma_{n}).

In the case b=1b=1, there exist ϵ0\epsilon\geq 0 and a nondecreasing slowly varying function l(n)l(n) such that wn=nϵl(n)w_{n}=n^{-\epsilon}l(n). When ϵ=0\epsilon=0, we require function l(n)l(n) to be bounded. ∎

Since 𝐱n𝝁=o(1)\|{\mathbf{x}}_{n}-{\bm{\mu}}\|=o(1) by a.s. convergence result, we can assume that there exists {wn}\{w_{n}\} such that 𝐱n𝝁=O(wn)\|{\mathbf{x}}_{n}-{\bm{\mu}}\|=O(w_{n}). Then, from (55b), we can use the Abel transformation and obtain

Rn(𝜽)=βnγn1𝐉12(α)𝐉22(α)1(𝐱n𝝁)e(unu1)𝐉11β1γ11𝐔11𝐉12(α)𝐉22(α)1(𝐱1𝝁)+eun𝐉11k=1n1(euk𝐉11βkγk1euk+1𝐉11βk+1γk+11)𝐉12(α)𝐉22(α)1(𝐱k+1𝝁),\begin{split}R_{n}^{({\bm{\theta}})}=&~{}\beta_{n}\gamma_{n}^{-1}{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}({\mathbf{x}}_{n}-{\bm{\mu}})-e^{(u_{n}-u_{1}){\mathbf{J}}_{11}}\beta_{1}\gamma_{1}^{-1}{\mathbf{U}}_{11}{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}({\mathbf{x}}_{1}-{\bm{\mu}})\\ &~{}+e^{u_{n}{\mathbf{J}}_{11}}\sum_{k=1}^{n-1}\left(e^{-u_{k}{\mathbf{J}}_{11}}\beta_{k}\gamma_{k}^{-1}-e^{-u_{k+1}{\mathbf{J}}_{11}}\beta_{k+1}\gamma_{k+1}^{-1}\right){\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}({\mathbf{x}}_{k+1}-{\bm{\mu}}),\end{split}

where the last term on the RHS can be rewritten as

Wn=k=1n1e(unuk+1)𝐉11βk+1γk+11(eβk+1𝐉11βkβk+11γk1γk+1𝐈)𝐉12(α)𝐉22(α)1(𝐱k+1𝝁).W_{n}=\sum_{k=1}^{n-1}e^{(u_{n}-u_{k+1}){\mathbf{J}}_{11}}\beta_{k+1}\gamma_{k+1}^{-1}\left(e^{\beta_{k+1}{\mathbf{J}}_{11}}\beta_{k}\beta_{k+1}^{-1}\gamma_{k}^{-1}\gamma_{k+1}-{\mathbf{I}}\right){\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}({\mathbf{x}}_{k+1}-{\bm{\mu}}).

Using Lemma G.6 on WnW_{n} gives Wn=O(γn1eβn𝐉11𝑰𝐱n𝝁)=O(γn1βnωn)\|W_{n}\|=O(\gamma_{n}^{-1}\|e^{\beta_{n}{\mathbf{J}}_{11}}-{\bm{I}}\|\|{\mathbf{x}}_{n}-{\bm{\mu}}\|)=O(\gamma^{-1}_{n}\beta_{n}\omega_{n}). Then, it follows that for some T>0T>0,

Rn(𝜽)=O(βnγn1ωn+eun𝐉11)=O(βnγn1ωn+eunT)\|R_{n}^{({\bm{\theta}})}\|=O\left(\beta_{n}\gamma_{n}^{-1}\omega_{n}+\|e^{u_{n}{\mathbf{J}}_{11}}\|\right)=O(\beta_{n}\gamma_{n}^{-1}\omega_{n}+e^{-u_{n}T}) (70)

with the application of Lemma G.4 to the second equality.

Then, we shift our focus on {Δn(𝜽)}\{\Delta_{n}^{({\bm{\theta}})}\}. Specifically, we take (54), (55a), and (56) back to Δn(𝜽)=𝜽n𝜽Ln(𝜽)Rn(𝜽)\Delta_{n}^{({\bm{\theta}})}={\bm{\theta}}_{n}-{\bm{\theta}}^{*}-L_{n}^{({\bm{\theta}})}-R_{n}^{({\bm{\theta}})}, and obtain

Δn+1(𝜽)=(𝐈+βn+1𝐉11)(𝜽n𝜽)+βn+1(rn(𝜽,1)+rn(𝜽,2)+ηn(𝜽)𝐉12(α)𝐉22(α)1(rn(𝐱,1)+rn(𝐱,2)+ηn(𝐱)))eβn+1𝐉11Ln(𝜽)eβn+1𝐉11Rn(𝜽)=(𝐈+βn+1𝐉11)(𝜽n𝜽)+βn+1(rn(𝜽,1)+rn(𝜽,2)+ηn(𝜽)𝐉12(α)𝐉22(α)1(rn(𝐱,1)+rn(𝐱,2)+ηn(𝐱)))(𝐈+βn+1𝐉11+O(βn+12))Ln(𝜽)(𝐈+βn+1𝐉11+O(βn+12))Rn(𝜽)=(𝐈+βn+1𝐉11)Δn(𝜽)+O(βn+12)(Ln(𝜽)+Rn(𝜽)))+βn+1(rn(𝜽,1)+rn(𝜽,2)+ηn(𝜽)𝐉12(α)𝐉22(α)1(rn(𝐱,1)+rn(𝐱,2)+ηn(𝐱))),\begin{split}\Delta_{n+1}^{({\bm{\theta}})}=&({\mathbf{I}}+\beta_{n+1}{\mathbf{J}}_{11})({\bm{\theta}}_{n}-{\bm{\theta}}^{*})\\ &+\beta_{n+1}(r^{({\bm{\theta}},1)}_{n}+r^{({\bm{\theta}},2)}_{n}+\eta_{n}^{({\bm{\theta}})}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}(r^{({\mathbf{x}},1)}_{n}+r^{({\mathbf{x}},2)}_{n}+\eta_{n}^{({\mathbf{x}})}))\\ &-e^{\beta_{n+1}{\mathbf{J}}_{11}}L_{n}^{({\bm{\theta}})}-e^{\beta_{n+1}{\mathbf{J}}_{11}}R_{n}^{({\bm{\theta}})}\\ =&({\mathbf{I}}+\beta_{n+1}{\mathbf{J}}_{11})({\bm{\theta}}_{n}-{\bm{\theta}}^{*})\\ &+\beta_{n+1}(r^{({\bm{\theta}},1)}_{n}+r^{({\bm{\theta}},2)}_{n}+\eta_{n}^{({\bm{\theta}})}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}(r^{({\mathbf{x}},1)}_{n}+r^{({\mathbf{x}},2)}_{n}+\eta_{n}^{({\mathbf{x}})}))\\ &-({\mathbf{I}}+\beta_{n+1}{\mathbf{J}}_{11}+O(\beta_{n+1}^{2}))L_{n}^{({\bm{\theta}})}-({\mathbf{I}}+\beta_{n+1}{\mathbf{J}}_{11}+O(\beta_{n+1}^{2}))R_{n}^{({\bm{\theta}})}\\ =&({\mathbf{I}}+\beta_{n+1}{\mathbf{J}}_{11})\Delta_{n}^{({\bm{\theta}})}+O(\beta_{n+1}^{2})(L_{n}^{({\bm{\theta}})}+R_{n}^{({\bm{\theta}}))})\\ &+\beta_{n+1}(r^{({\bm{\theta}},1)}_{n}+r^{({\bm{\theta}},2)}_{n}+\eta_{n}^{({\bm{\theta}})}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}(r^{({\mathbf{x}},1)}_{n}+r^{({\mathbf{x}},2)}_{n}+\eta_{n}^{({\mathbf{x}})})),\end{split} (71)

where the second equality is by taking the Taylor expansion eβn+1𝐉11=𝐈+βn+1𝐉11+O(βn+12)e^{\beta_{n+1}{\mathbf{J}}_{11}}={\mathbf{I}}+\beta_{n+1}{\mathbf{J}}_{11}+O(\beta_{n+1}^{2}).

Define Φk,nj=k+1n(𝐈+βj𝐉11)\Phi_{k,n}\triangleq\prod_{j={k+1}}^{n}({\mathbf{I}}+\beta_{j}{\mathbf{J}}_{11}) and by convention Φn,n=𝐈\Phi_{n,n}={\mathbf{I}}. Then, we rewrite (71) as

Δn+1(𝜽)=k=1nΦk,nβk+1(O(βk+1)Lk(𝜽)+O(βk+1)Rk(𝜽))+k=1nΦk,nβk+1(rk(𝜽,1)+rk(𝜽,2)+ηk(𝜽)𝐉12(α)𝐉22(α)1(rk(𝐱,1)+rk(𝐱,2)+ηk(𝐱)))=k=1nΦk,nβk+1(O(βk+1)Lk(𝜽)+O(βk+1)Rk(𝜽))+k=1nΦk,nβk+1(rk(𝜽,1)+ηk(𝜽)𝐉12(α)𝐉22(α)1(rk(𝐱,1)+ηk(𝐱)))+k=1nΦk,nβk+1(rk(𝜽,2)𝐉12(α)𝐉22(α)1rk(𝐱,2)).\begin{split}\Delta_{n+1}^{({\bm{\theta}})}=&\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}\left(O(\beta_{k+1})L_{k}^{({\bm{\theta}})}+O(\beta_{k+1})R_{k}^{({\bm{\theta}})}\right)\\ &+\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}(r^{({\bm{\theta}},1)}_{k}+r^{({\bm{\theta}},2)}_{k}+\eta_{k}^{({\bm{\theta}})}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}(r^{({\mathbf{x}},1)}_{k}+r^{({\mathbf{x}},2)}_{k}+\eta_{k}^{({\mathbf{x}})}))\\ =&\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}\left(O(\beta_{k+1})L_{k}^{({\bm{\theta}})}+O(\beta_{k+1})R_{k}^{({\bm{\theta}})}\right)\\ &+\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}(r^{({\bm{\theta}},1)}_{k}+\eta_{k}^{({\bm{\theta}})}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}(r^{({\mathbf{x}},1)}_{k}+\eta_{k}^{({\mathbf{x}})}))\\ &+\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}(r^{({\bm{\theta}},2)}_{k}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}r^{({\mathbf{x}},2)}_{k}).\end{split} (72)

From (72), we can indeed decompose Δn+1(𝒙)\Delta_{n+1}^{({\bm{x}})} into two parts Δn+1(𝜽)=Δn+1(𝜽,1)+Δn+1(𝜽,2)\Delta_{n+1}^{({\bm{\theta}})}=\Delta_{n+1}^{({\bm{\theta}},1)}+\Delta_{n+1}^{({\bm{\theta}},2)}, where

Δn+1(𝜽,1)k=1nΦk,nβk+1(O(βk+1)Lk(𝜽)+O(βk+1)Rk(𝜽))+k=1nΦk,nβk+1(rk(𝜽,1)+ηk(𝜽)𝐉12(α)𝐉22(α)1(rk(𝐱,1)+ηk(𝐱))),\begin{split}\Delta_{n+1}^{({\bm{\theta}},1)}\triangleq&\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}\left(O(\beta_{k+1})L_{k}^{({\bm{\theta}})}+O(\beta_{k+1})R_{k}^{({\bm{\theta}})}\right)\\ &+\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}(r^{({\bm{\theta}},1)}_{k}+\eta_{k}^{({\bm{\theta}})}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}(r^{({\mathbf{x}},1)}_{k}+\eta_{k}^{({\mathbf{x}})})),\end{split} (73a)
Δn+1(𝜽,2)k=1nΦk,nβk+1(rk(𝜽,2)𝐉12(α)𝐉22(α)1rk(𝐱,2)).\Delta_{n+1}^{({\bm{\theta}},2)}\triangleq\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}(r^{({\bm{\theta}},2)}_{k}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}r^{({\mathbf{x}},2)}_{k}). (73b)

This term Δn+1(𝜽,1)\Delta_{n+1}^{({\bm{\theta}},1)} shares the same recursive form as in the sequence defined in Mokkadem & Pelletier (2006, Lemma 6), which is given below.

Lemma E.7 (Mokkadem & Pelletier (2006) Lemma 6).

For Δn+1(𝛉,1)\Delta_{n+1}^{({\bm{\theta}},1)} in the form of (73a), assume 𝐱n𝛍=O(ωn)\|{\mathbf{x}}_{n}-{\bm{\mu}}\|=O(\omega_{n}) and Δn(𝐱)=O(δn)\|\Delta_{n}^{({\mathbf{x}})}\|=O(\delta_{n}) for the sequences ωn,δn\omega_{n},\delta_{n} defined in (E.1). Then, we have

Δn+1(𝜽,1)=O(βn2γn2ωn2+βnγn1δn)+o(βn)a.s.\|\Delta_{n+1}^{({\bm{\theta}},1)}\|=O(\beta_{n}^{2}\gamma_{n}^{-2}\omega_{n}^{2}+\beta_{n}\gamma_{n}^{-1}\delta_{n})+o(\sqrt{\beta_{n}})\quad\text{a.s.}

Since we already have Δn(𝐱)=o(γn)\Delta_{n}^{({\mathbf{x}})}=o(\sqrt{\gamma_{n}}) in (69), together with Lemma E.7, we have

Δn+1(𝜽,1)=O(βn2γn2ωn2)+o(βnγn1/2)+o(βn)=O(βn2γn2ωn2)+o(βn)\|\Delta_{n+1}^{({\bm{\theta}},1)}\|=O(\beta_{n}^{2}\gamma_{n}^{-2}\omega_{n}^{2})+o(\beta_{n}\gamma_{n}^{-1/2})+o(\sqrt{\beta_{n}})=O(\beta_{n}^{2}\gamma_{n}^{-2}\omega_{n}^{2})+o(\sqrt{\beta_{n}})

where the second equality comes from o(βnγn1/2)=o(βn1/2(βnγn1)1/2)=o(βn1/2)o(\beta_{n}\gamma_{n}^{-1/2})=o(\beta_{n}^{1/2}(\beta_{n}\gamma_{n}^{-1})^{1/2})=o(\beta_{n}^{1/2}).

We now focus on Δn+1(𝜽,2)\Delta_{n+1}^{({\bm{\theta}},2)}. Define a sequence

Ψnk=1nrk(𝜽,2)𝐉12(α)𝐉22(α)1rk(𝐱,2),\Psi_{n}\triangleq\sum_{k=1}^{n}r^{({\bm{\theta}},2)}_{k}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}r^{({\mathbf{x}},2)}_{k}, (74)

and we have

βn+11/2k=1nΦk,nβk+1(rk(𝜽,2)𝐉12(α)𝐉22(α)1rk(𝐱,2))=βn+11/2k=1nΦk,nβk+1(ΨkΨk1)=βn+11/2Ψn+βn+11/2k=1n1(βkΦk,nβk+1Φk+1,n)Ψk\begin{split}&\beta_{n+1}^{-1/2}\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}(r^{({\bm{\theta}},2)}_{k}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}r^{({\mathbf{x}},2)}_{k})\\ =&\beta_{n+1}^{-1/2}\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}(\Psi_{k}-\Psi_{k-1})\\ =&\beta_{n+1}^{1/2}\Psi_{n}+\beta_{n+1}^{-1/2}\sum_{k=1}^{n-1}(\beta_{k}\Phi_{k,n}-\beta_{k+1}\Phi_{k+1,n})\Psi_{k}\end{split}

where the last equality comes from the Abel transformation. Note that

βkΦk,nβk+1Φk+1,nβk+1Φk,nΦk+1,n+(βkβk+1)Φk,nβk+1Φk+1,nβk𝐉11+C7βk2Φk,nC8βk2e(unuk)T\begin{split}\|\beta_{k}\Phi_{k,n}-\beta_{k+1}\Phi_{k+1,n}\|&\leq\beta_{k+1}\|\Phi_{k,n}-\Phi_{k+1,n}\|+(\beta_{k}-\beta_{k+1})\|\Phi_{k,n}\|\\ &\leq\beta_{k+1}\|\Phi_{k+1,n}\|\beta_{k}\|{\mathbf{J}}_{11}\|+C_{7}\beta_{k}^{2}\|\Phi_{k,n}\|\\ &\leq C_{8}\beta_{k}^{2}e^{-(u_{n}-u_{k})T}\end{split}

for some constant C7,C8>0C_{7},C_{8}>0, where the last inequality is from Lemma G.4 and Φk+1,nC9Φk,n\|\Phi_{k+1,n}\|\leq C_{9}\|\Phi_{k,n}\| for some constant C9>0C_{9}>0 that depends on eβ0Te^{\beta_{0}T}. Then,

βn+11/2k=1nΦk,nβk+1(rk(𝜽,2)𝐉12(α)𝐉22(α)1rk(𝐱,2))βn+11/2Ψn+(βn+1βn)1/2βn1/2k=1nβkΦk,nβk+1Φk+1,nΨkβn+11/2Ψn+C8(βn+1βn)1/2βn1/2k=1nβk3/2e(unuk)Tβk1/2Ψk.\begin{split}&\beta_{n+1}^{-1/2}\left\|\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}(r^{({\bm{\theta}},2)}_{k}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}r^{({\mathbf{x}},2)}_{k})\right\|\\ &\leq\|\beta_{n+1}^{1/2}\Psi_{n}\|+\left(\frac{\beta_{n+1}}{\beta_{n}}\right)^{1/2}\beta_{n}^{-1/2}\sum_{k=1}^{n}\|\beta_{k}\Phi_{k,n}-\beta_{k+1}\Phi_{k+1,n}\|\|\Psi_{k}\|\\ &\leq\|\beta_{n+1}^{1/2}\Psi_{n}\|+C_{8}\left(\frac{\beta_{n+1}}{\beta_{n}}\right)^{1/2}\beta_{n}^{-1/2}\sum_{k=1}^{n}\beta_{k}^{3/2}e^{-(u_{n}-u_{k})T}\|\beta_{k}^{1/2}\Psi_{k}\|.\end{split}

By Lemma E.1, we have βn1/2Ψn0\beta_{n}^{1/2}\Psi_{n}\to 0 a.s. such that by Lemma G.6, it follows that

lim supnβn+11/2k=1nΦk,nβk+1(rk(𝜽,2)𝐉12(α)𝐉22(α)1rk(𝐱,2))lim supnβn1/2ΨnC(T,1/2)=0.\limsup_{n}\beta_{n+1}^{-1/2}\left\|\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}(r^{({\bm{\theta}},2)}_{k}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}r^{({\mathbf{x}},2)}_{k})\right\|\leq\frac{\limsup_{n}\|\beta_{n}^{1/2}\Psi_{n}\|}{C(T,1/2)}=0.

Therefore, we have

Δn+1(𝜽,2)=k=1nΦk,nβk+1(rk(𝜽,2)𝐉12(α)𝐉22(α)1rk(𝐱,2))=o(βn).\Delta_{n+1}^{({\bm{\theta}},2)}=\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}(r^{({\bm{\theta}},2)}_{k}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}r^{({\mathbf{x}},2)}_{k})=o(\sqrt{\beta_{n}}). (75)

Consequently, Δn+1(𝜽)=O(βn2γn2ωn2)+o(βn)\Delta_{n+1}^{({\bm{\theta}})}=O(\beta_{n}^{2}\gamma_{n}^{-2}\omega_{n}^{2})+o(\sqrt{\beta_{n}}) almost surely.

Now we are dealing with 𝐱n𝝁{\mathbf{x}}_{n}-{\bm{\mu}} and its related sequence ωn\omega_{n}. Note that by Lemma E.5 and (69), we have almost surely,

𝐱n𝝁=O(Ln(𝐱)+Δn𝐱)=O(γnlog(sn)+o(γn))=O(γnlog(sn)).\begin{split}\|{\mathbf{x}}_{n}-{\bm{\mu}}\|&=O(\|L_{n}^{({\mathbf{x}})}\|+\|\Delta_{n}^{{\mathbf{x}}}\|)\\ &=O(\sqrt{\gamma_{n}\log(s_{n})}+o(\sqrt{\gamma_{n}}))\\ &=O(\sqrt{\gamma_{n}\log(s_{n})}).\end{split} (76)

Thus, we can set ωnO(γnlog(sn))\omega_{n}\equiv O(\sqrt{\gamma_{n}\log(s_{n})}) such that Rn(𝜽)\|R_{n}^{({\bm{\theta}})}\| in (70) can be written as

Rn(𝜽)=O(na/2blog(sn)+eunT),\|R_{n}^{({\bm{\theta}})}\|=O(n^{a/2-b}\sqrt{log(s_{n})}+e^{-u_{n}T}),

and

Δn+1(𝜽)=O(na2blog(sn))+o(βn).\|\Delta_{n+1}^{({\bm{\theta}})}\|=O(n^{a-2b}log(s_{n}))+o(\sqrt{\beta_{n}}).

In view of assumption A2 and βn=o(γn)\beta_{n}=o(\gamma_{n}), a/2b<b/2a/2-b<-b/2 and a2b<ba-2b<-b, there exists a c>b/2c>b/2 such that almost surely,

Rn(𝜽)=O(ns),Δn+1(𝜽)=o(βn).\|R_{n}^{({\bm{\theta}})}\|=O(n^{-s}),\quad\|\Delta_{n+1}^{({\bm{\theta}})}\|=o(\sqrt{\beta_{n}}).

Therefore, βn1/2(Rn(𝜽)+Δn+1(𝜽))0\beta_{n}^{-1/2}(R_{n}^{({\bm{\theta}})}+\Delta_{n+1}^{({\bm{\theta}})})\to 0 almost surely. This completes the proof of Scenario 2.

E.3 Case (iii): γn=o(βn)\gamma_{n}=o(\beta_{n})

For γn=o(βn)\gamma_{n}=o(\beta_{n}), we can see that the roles of 𝜽n{\bm{\theta}}_{n} and 𝐱n{\mathbf{x}}_{n} are flipped, i.e., 𝜽n{\bm{\theta}}_{n} is now on fast timescale while 𝐱n{\mathbf{x}}_{n} is on slow timescale.

We still decompose 𝐱n{\mathbf{x}}_{n} as 𝐱n𝝁=Ln(𝐱)+Δn(𝐱){\mathbf{x}}_{n}-{\bm{\mu}}=L_{n}^{({\mathbf{x}})}+\Delta_{n}^{({\mathbf{x}})}, where Ln(𝐱),Δn(𝐱)L_{n}^{({\mathbf{x}})},\Delta_{n}^{({\mathbf{x}})} are defined in (56) and (57), respectively. Since 𝐱n{\mathbf{x}}_{n} is independent of 𝜽n{\bm{\theta}}_{n}, the results of Ln(𝐱)L_{n}^{({\mathbf{x}})} and Δn(𝐱)\Delta_{n}^{({\mathbf{x}})} remain the same, i.e., almost surely, Ln(𝐱)=O(γnlog(sn))L_{n}^{({\mathbf{x}})}=O(\sqrt{\gamma_{n}\log(s_{n})}) from Lemma E.5 and Δn(𝐱)=o(γn)\Delta_{n}^{({\mathbf{x}})}=o(\sqrt{\gamma_{n}}) from (69). Then, we define sequences L^n(𝜽)\hat{L}_{n}^{({\bm{\theta}})} and R^n(𝜽)\hat{R}_{n}^{({\bm{\theta}})} as follows.

L^n(𝜽)eβn𝐉11L^n1(𝜽)+βnMn(𝜽)=k=1ne(unuk)𝐉11βkMk(𝜽),\hat{L}_{n}^{({\bm{\theta}})}\triangleq e^{\beta_{n}{\mathbf{J}}_{11}}\hat{L}_{n-1}^{({\bm{\theta}})}+\beta_{n}M_{n}^{({\bm{\theta}})}=\sum_{k=1}^{n}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}\beta_{k}M_{k}^{({\bm{\theta}})}, (77a)
R^n(𝜽)eβn𝐉11R^n1(𝜽)+βn𝐉12(α)(Ln1(𝐱)+Rn1(𝐱))=k=1ne(unuk)𝐉11βk𝐉12(α)(Lk1(𝐱)+Rk1(𝐱)).\hat{R}_{n}^{({\bm{\theta}})}\triangleq e^{\beta_{n}{\mathbf{J}}_{11}}\hat{R}_{n-1}^{({\bm{\theta}})}+\beta_{n}{\mathbf{J}}_{12}(\alpha)(L_{n-1}^{({\mathbf{x}})}+R_{n-1}^{({\mathbf{x}})})=\sum_{k=1}^{n}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}\beta_{k}{\mathbf{J}}_{12}(\alpha)(L_{k-1}^{({\mathbf{x}})}+R_{k-1}^{({\mathbf{x}})}). (77b)

Moreover, the remaining term Δ^n(𝜽)𝜽n𝜽L^n(𝜽)R^n(𝜽)\hat{\Delta}_{n}^{({\bm{\theta}})}\triangleq{\bm{\theta}}_{n}-{\bm{\theta}}^{*}-\hat{L}_{n}^{({\bm{\theta}})}-\hat{R}_{n}^{({\bm{\theta}})}.

The proof outline is the same as in the previous scenario:

  • We first show βn1/2Δ^n(𝜽)\beta_{n}^{-1/2}\hat{\Delta}_{n}^{({\bm{\theta}})} weakly converges to the distribution N(0,𝐕𝜽(3))(α)N(0,{\mathbf{V}}_{{\bm{\theta}}}^{(3)})(\alpha);

  • We analyse L^n(𝜽)\hat{L}_{n}^{({\bm{\theta}})} and R^n(𝜽)\hat{R}_{n}^{({\bm{\theta}})} to ensure that these two terms decrease faster than the CLT scale βn1/2\beta_{n}^{-1/2}, i.e., limnβn1/2(L^n(𝜽)R^n(𝜽))=0\lim_{n\to\infty}\beta_{n}^{-1/2}(\hat{L}_{n}^{({\bm{\theta}})}-\hat{R}_{n}^{({\bm{\theta}})})=0;

  • With above two steps, we can show that βn1/2(𝜽n𝜽)\beta_{n}^{-1/2}({\bm{\theta}}_{n}-{\bm{\theta}}^{*}) weakly converges to the distribution N(0,𝐕𝜽(3))(α)N(0,{\mathbf{V}}_{{\bm{\theta}}}^{(3)})(\alpha).

Analysis of L^n(θ)\hat{L}_{n}^{({\bm{\theta}})}

We first focus on L^n(𝜽)\hat{L}_{n}^{({\bm{\theta}})} and follow similar steps as we did when we analysed Ln(𝜽)L_{n}^{({\bm{\theta}})} in the previous scenario. We set a Martingale Z(n)={Zk(n)}k1Z^{(n)}=\{Z_{k}^{(n)}\}_{k\geq 1} such that

Zk(n)=βn1/2k=1ne(unuk)𝐉11βkMk(𝜽).Z_{k}^{(n)}=\beta_{n}^{-1/2}\sum_{k=1}^{n}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}\beta_{k}M_{k}^{({\bm{\theta}})}}.

Then,

Ank=1n𝔼[(Zk(n)Zk1(n))(Zk(n)Zk1(n))T|k1].A_{n}\triangleq\sum_{k=1}^{n}\mathbb{E}\left[\left.(Z_{k}^{(n)}-Z_{k-1}^{(n)})(Z_{k}^{(n)}-Z_{k-1}^{(n)})^{T}\right|{\mathcal{F}}_{k-1}\right].

Following the similar steps in (59) to decompose Mk(𝜽)M_{k}^{({\bm{\theta}})} with (42b), we have

An=βn1k=1nβk2e(unuk)𝐉11(𝐔11+𝐃k(3)+𝐉k(3))e(unuk)𝐉11T=βn1k=1nβk2e(unuk)𝐉11𝐔11e(unuk)𝐉11TAn(a)+βn1k=1nβk2e(unuk)𝐉11𝐃k(3)e(unuk)𝐉11TAn(b)+βn1k=1nβk2e(unuk)𝐉11𝐉k(3)e(unuk)𝐉11TAn(c)\begin{split}A_{n}=&~{}\beta_{n}^{-1}\sum_{k=1}^{n}\beta_{k}^{2}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}\left({\mathbf{U}}_{11}+{\mathbf{D}}_{k}^{(3)}+{\mathbf{J}}_{k}^{(3)}\right)e^{(u_{n}-u_{k}){\mathbf{J}}_{11}^{T}}\\ =&~{}\underbrace{\beta_{n}^{-1}\sum_{k=1}^{n}\beta_{k}^{2}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}{\mathbf{U}}_{11}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}^{T}}}_{A_{n}^{(a)}}+\underbrace{\beta_{n}^{-1}\sum_{k=1}^{n}\beta_{k}^{2}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}{\mathbf{D}}_{k}^{(3)}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}^{T}}}_{A_{n}^{(b)}}\\ &+\underbrace{\beta_{n}^{-1}\sum_{k=1}^{n}\beta_{k}^{2}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}{\mathbf{J}}_{k}^{(3)}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}^{T}}}_{A_{n}^{(c)}}\\ \end{split} (78)

Since An(a),An(b),An(c)A_{n}^{(a)},A_{n}^{(b)},A_{n}^{(c)} share similar forms as in Lemma E.3, we follow the same steps as the proof therein, with the application of Lemma E.1. To avoid repetition, we omit the proof and directly give the following lemma.

Lemma E.8.

For An(a),An(b),An(c)A^{(a)}_{n},A^{(b)}_{n},A^{(c)}_{n} defined in (78), we have

limnAn(a)=𝐕𝜽(3)(α),limnAn(b)=0,limnAn(c)=0,\lim_{n\to\infty}A^{(a)}_{n}={\mathbf{V}}^{(3)}_{{\bm{\theta}}}(\alpha),\quad\lim_{n\to\infty}\|A^{(b)}_{n}\|=0,\quad\lim_{n\to\infty}\|A^{(c)}_{n}\|=0, (79)

where 𝐕𝛉(3)(α){\mathbf{V}}^{(3)}_{{\bm{\theta}}}(\alpha) is the solution to the Lyapunov equation

𝐉11𝐕+𝐕𝐉11T+𝐔11=0.{\mathbf{J}}_{11}{\mathbf{V}}+{\mathbf{V}}{\mathbf{J}}_{11}^{T}+{\mathbf{U}}_{11}=0.

Note that here we don’t have the term 𝟙{b=1}2𝑰\frac{\mathds{1}_{\{b=1\}}}{2}{\bm{I}} in above lemma, compared to Lemma E.3, because in the case of γn=o(βn)\gamma_{n}=o(\beta_{n}), b<1b<1 such that 𝟙{b=1}=0\mathds{1}_{\{b=1\}}=0. Then, applying Lemma G.1 to derive the closed form of 𝐕𝜽(3)(α){\mathbf{V}}^{(3)}_{{\bm{\theta}}}(\alpha) gives

𝐕𝜽(3)(α)=0et𝜽𝐡(𝜽)𝐔11et𝜽𝐡(𝜽)𝑑t.{\mathbf{V}}^{(3)}_{{\bm{\theta}}}(\alpha)=\textstyle\int_{0}^{\infty}e^{t\nabla_{{\bm{\theta}}}{\mathbf{h}}({\bm{\theta}}^{*})}{\mathbf{U}}_{11}e^{t\nabla_{{\bm{\theta}}}{\mathbf{h}}({\bm{\theta}}^{*})}dt.

Thus, it follows that

limnk=1n𝔼[(Zk(n)Zk1(n))(Zk(n)Zk1(n))T|k1]=𝐕𝜽(3)(α).\lim_{n\to\infty}\sum_{k=1}^{n}\mathbb{E}\left[(Z_{k}^{(n)}-Z_{k-1}^{(n)})(Z_{k}^{(n)}-Z_{k-1}^{(n)})^{T}|{\mathcal{F}}_{k-1}\right]={\mathbf{V}}^{(3)}_{{\bm{\theta}}}(\alpha).

Again, we use the Martingale CLT result in Theorem G.3 and have the following result.

Zn=βn1/2L^n(𝜽)dist.nN(0,𝐕𝜽(3)(α)).Z^{n}=\beta_{n}^{-1/2}\hat{L}_{n}^{({\bm{\theta}})}\xrightarrow[dist.]{n\to\infty}N\left(0,{\mathbf{V}}^{(3)}_{{\bm{\theta}}}(\alpha)\right).

Moreover, similar to the tighter upper bound of Ln(𝐱)L_{n}^{({\mathbf{x}})} proved in Lemma E.5, we utilize the tighter upper bound Lemma E.6 in the proof thereof, and obtain L^n(𝜽)=O(βnlog(un))\hat{L}_{n}^{({\bm{\theta}})}=O(\sqrt{\beta_{n}\log(u_{n})}).

Analysis of R^n(θ)\hat{R}_{n}^{({\bm{\theta}})}

Next, we turn to the term R^n(𝜽)\hat{R}_{n}^{({\bm{\theta}})} in (77b). Taking the norm gives the following inequality for some constant C,T>0C,T>0 by applying Lemma G.4,

R^n(𝜽)Ck=1ne(unuk)Tβk(Lk1(𝐱)+Rk1(𝐱)).\|\hat{R}_{n}^{({\bm{\theta}})}\|\leq C\sum_{k=1}^{n}e^{-(u_{n}-u_{k})T}\beta_{k}(\|L_{k-1}^{({\mathbf{x}})}\|+\|R_{k-1}^{({\mathbf{x}})}\|).

Using Lemma G.6 gives

k=1ne(unuk)Tβk(Lk1(𝐱)+Rk1(𝐱))=O(Lk1(𝐱)+Rn1(𝐱)).\sum_{k=1}^{n}e^{-(u_{n}-u_{k})T}\beta_{k}(\|L_{k-1}^{({\mathbf{x}})}\|+\|R_{k-1}^{({\mathbf{x}})}\|)=O(\|L_{k-1}^{({\mathbf{x}})}\|+\|R_{n-1}^{({\mathbf{x}})}\|).

Thus, βn1/2R^n(𝜽)=o(γnβn1)+O(γnβn1log(sn))\beta_{n}^{-1/2}\|\hat{R}_{n}^{({\bm{\theta}})}\|=o(\sqrt{\gamma_{n}\beta_{n}^{-1}})+O\left(\sqrt{\gamma_{n}\beta_{n}^{-1}\log(s_{n})}\right). Since γn=o(βn)\gamma_{n}=o(\beta_{n}), γnβn1=(n+1)ba\gamma_{n}\beta_{n}^{-1}=(n+1)^{b-a}, where ba<0b-a<0. Then, there exists some s>0s>0 such that ba<s<0b-a<-s<0. Together with log(sn)=O(log(n))\log(s_{n})=O(\log(n)), we have O(γnβn1log(sn))=O(nslog(n))=o(1)O\left(\sqrt{\gamma_{n}\beta_{n}^{-1}\log(s_{n})}\right)=O(\sqrt{n^{-s}\log(n)})=o(1). Therefore, we have

limnβn1/2R^n(𝜽)=0.\lim_{n\to\infty}\beta_{n}^{-1/2}\hat{R}_{n}^{({\bm{\theta}})}=0.

Analysis of Δ^n(θ)\hat{\Delta}_{n}^{({\bm{\theta}})}

Lastly, let’s focus on the term Δ^n(𝜽)\hat{\Delta}_{n}^{({\bm{\theta}})}. We have

Δ^n+1(𝜽)=𝜽n+1𝜽L^n+1(𝜽)R^n+1(𝜽)=𝜽n𝜽+βn+1(𝐉11(𝜽n𝜽)+𝐉12(α)(𝐱n𝝁)+Mn+1(𝜽)+rn(𝜽,1)+rn(𝜽,2)+ηn(𝜽))eβn+1𝐉11L^n(𝜽)βn+1Mn+1(𝜽)eβn+1𝐉11R^n(𝜽)βn+1𝐉12(α)(Ln(𝐱)+Rn(𝐱))=(𝐈+βn+1𝐉11)(𝜽n𝜽)+βn+1𝐉12(α)Δn(𝐱)+βn+1(rn(𝜽,1)+rn(𝜽,2)+ηn(𝜽))(𝐈+βn+1𝐉11+O(βn+12))(L^n(𝜽)+R^n(𝜽))=(𝐈+βn+1𝐉11)Δ^n(𝜽)+βn+1𝐉12(α)Δn(𝐱)+βn+1(rn(𝜽,1)+rn(𝜽,2)+ηn(𝜽))+O(βn+12)(L^n(𝜽)+R^n(𝜽)).\begin{split}\hat{\Delta}_{n+1}^{({\bm{\theta}})}=&~{}{\bm{\theta}}_{n+1}-{\bm{\theta}}^{*}-\hat{L}_{n+1}^{({\bm{\theta}})}-\hat{R}_{n+1}^{({\bm{\theta}})}\\ =&~{}{\bm{\theta}}_{n}-{\bm{\theta}}^{*}+\beta_{n+1}\left({\mathbf{J}}_{11}({\bm{\theta}}_{n}-{\bm{\theta}}^{*})+{\mathbf{J}}_{12}(\alpha)({\mathbf{x}}_{n}-{\bm{\mu}})+M_{n+1}^{({\bm{\theta}})}+r_{n}^{({\bm{\theta}},1)}+r_{n}^{({\bm{\theta}},2)}+\eta_{n}^{({\bm{\theta}})}\right)\\ &-e^{\beta_{n+1}{\mathbf{J}}_{11}}\hat{L}_{n}^{({\bm{\theta}})}-\beta_{n+1}M_{n+1}^{({\bm{\theta}})}-e^{\beta_{n+1}{\mathbf{J}}_{11}}\hat{R}_{n}^{({\bm{\theta}})}-\beta_{n+1}{\mathbf{J}}_{12}(\alpha)(L_{n}^{({\mathbf{x}})}+R_{n}^{({\mathbf{x}})})\\ =&~{}({\mathbf{I}}+\beta_{n+1}{\mathbf{J}}_{11})({\bm{\theta}}_{n}-{\bm{\theta}}^{*})+\beta_{n+1}{\mathbf{J}}_{12}(\alpha)\Delta_{n}^{({\mathbf{x}})}+\beta_{n+1}(r_{n}^{({\bm{\theta}},1)}+r_{n}^{({\bm{\theta}},2)}+\eta_{n}^{({\bm{\theta}})})\\ &-({\mathbf{I}}+\beta_{n+1}{\mathbf{J}}_{11}+O(\beta_{n+1}^{2}))(\hat{L}_{n}^{({\bm{\theta}})}+\hat{R}_{n}^{({\bm{\theta}})})\\ =&~{}({\mathbf{I}}+\beta_{n+1}{\mathbf{J}}_{11})\hat{\Delta}_{n}^{({\bm{\theta}})}+\beta_{n+1}{\mathbf{J}}_{12}(\alpha)\Delta_{n}^{({\mathbf{x}})}+\beta_{n+1}(r_{n}^{({\bm{\theta}},1)}+r_{n}^{({\bm{\theta}},2)}+\eta_{n}^{({\bm{\theta}})})\\ &+O(\beta_{n+1}^{2})(\hat{L}_{n}^{({\bm{\theta}})}+\hat{R}_{n}^{({\bm{\theta}})}).\end{split}

where the second equality is from (53a), the third equality stems from the approximation of eβn+1𝐉11e^{\beta_{n+1}{\mathbf{J}}_{11}}. Then, we again use the definition Φk,nj=k+1n(𝐈+βj𝐉11)\Phi_{k,n}\triangleq\prod_{j={k+1}}^{n}({\mathbf{I}}+\beta_{j}{\mathbf{J}}_{11}) and reiterate the above equation as

Δ^n+1(𝜽)=k=1nΦk,nβk+1(O(βk+1)Lk(𝜽)+O(βk+1)Rk(𝜽))+k=1nΦk,nβk+1𝐉12(α)Δn(𝐱)+k=1nΦk,nβk+1(rk(𝜽,1)+ηk(𝜽))+k=1nΦk,nβk+1rk(𝜽,2)Δ^n+1(𝜽,1)+Δ^n+1(𝜽,2),\begin{split}\hat{\Delta}_{n+1}^{({\bm{\theta}})}=&\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}\left(O(\beta_{k+1})L_{k}^{({\bm{\theta}})}+O(\beta_{k+1})R_{k}^{({\bm{\theta}})}\right)\\ &+\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}{\mathbf{J}}_{12}(\alpha)\Delta_{n}^{({\mathbf{x}})}+\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}(r^{({\bm{\theta}},1)}_{k}+\eta_{k}^{({\bm{\theta}})})\\ &+\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}r^{({\bm{\theta}},2)}_{k}\\ \triangleq&~{}\hat{\Delta}_{n+1}^{({\bm{\theta}},1)}+\hat{\Delta}_{n+1}^{({\bm{\theta}},2)},\end{split}

where Δ^n+1(𝜽,2)=k=1nΦk,nβk+1rk(𝜽,2)\hat{\Delta}_{n+1}^{({\bm{\theta}},2)}=\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}r^{({\bm{\theta}},2)}_{k} and

Δ^n+1(𝜽,1)=k=1nΦk,nβk+1(O(βk+1)Lk(𝜽)+O(βk+1)Rk(𝜽))+k=1nΦk,nβk+1(rk(𝜽,1)+ηk(𝜽)+𝐉12(α)Δn(𝐱)).\begin{split}\hat{\Delta}_{n+1}^{({\bm{\theta}},1)}=&\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}\left(O(\beta_{k+1})L_{k}^{({\bm{\theta}})}+O(\beta_{k+1})R_{k}^{({\bm{\theta}})}\right)\\ &+\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}(r^{({\bm{\theta}},1)}_{k}+\eta_{k}^{({\bm{\theta}})}+{\mathbf{J}}_{12}(\alpha)\Delta_{n}^{({\mathbf{x}})}).\end{split} (80)

For Δ^n+1(𝜽,2)\hat{\Delta}_{n+1}^{({\bm{\theta}},2)}, we follow the same steps from (74) to (75), and obtain Δ^n+1(𝜽,2)=o(βn)\hat{\Delta}_{n+1}^{({\bm{\theta}},2)}=o(\sqrt{\beta_{n}}).

Next, we consider Δ^n+1(𝜽,1)\hat{\Delta}_{n+1}^{({\bm{\theta}},1)} and want to show that Δ^n+1(𝜽,1)=o(βn)\hat{\Delta}_{n+1}^{({\bm{\theta}},1)}=o(\sqrt{\beta_{n}}). Again, we utilize Mokkadem & Pelletier (2006, Lemma 6) for Δ^n+1(𝜽,1)\hat{\Delta}_{n+1}^{({\bm{\theta}},1)} and adapt the notation here for the case γn=o(βn)\gamma_{n}=o(\beta_{n}).

Lemma E.9.

For Δ^n+1(𝛉,1)\hat{\Delta}_{n+1}^{({\bm{\theta}},1)} in the form of (80), assume 𝛉n𝛉=O(ωn)\|{\bm{\theta}}_{n}-{\bm{\theta}}^{*}\|=O(\omega_{n}) and Δ^n(𝛉,1)=O(δn)\|\hat{\Delta}_{n}^{({\bm{\theta}},1)}\|=O(\delta_{n}) for the sequences ωn,δn\omega_{n},\delta_{n} defined in (E.1). Then, we have

Δ^n+1(𝜽,1)=O(γn2βn2ωn2+γnβn1δn)+o(γn)a.s.\|\hat{\Delta}_{n+1}^{({\bm{\theta}},1)}\|=O(\gamma_{n}^{2}\beta_{n}^{-2}\omega_{n}^{2}+\gamma_{n}\beta_{n}^{-1}\delta_{n})+o(\sqrt{\gamma_{n}})\quad\text{a.s.} (81)

Now we need to further analyse δn\delta_{n} and tighten its big O form, starting from δn1\delta_{n}\equiv 1, so that we can finally obtain the big O form of Δ^n+1(𝜽,1)\|\hat{\Delta}_{n+1}^{({\bm{\theta}},1)}\|. The following steps are borrowed from the ideas in Mokkadem & Pelletier (2006, Section 2.3.2).

By almost sure convergence result limn𝜽n=𝜽\lim_{n\to\infty}{\bm{\theta}}_{n}={\bm{\theta}}^{*}, we have limnΔn(𝜽)=0\lim_{n\to\infty}\Delta_{n}^{({\bm{\theta}})}=0 a.s. such that we can first set δn1\delta_{n}\equiv 1, and Δ^n+1(𝜽,1)=O(γn2βn2ωn2+γnβn1)+o(γn)\|\hat{\Delta}_{n+1}^{({\bm{\theta}},1)}\|=O(\gamma_{n}^{2}\beta_{n}^{-2}\omega_{n}^{2}+\gamma_{n}\beta_{n}^{-1})+o(\sqrt{\gamma_{n}}). Then, we redefine

δnO(γn2βn2ωn2+γnβn1)+o(γn),\delta_{n}\equiv O(\gamma_{n}^{2}\beta_{n}^{-2}\omega_{n}^{2}+\gamma_{n}\beta_{n}^{-1})+o(\sqrt{\gamma_{n}}),

and notice that it still satisfies definition E.1. Then, reapplying this δn\delta_{n} form to (81) gives

Δ^n+1(𝜽,1)=O(γn2βn2ωn2+[γnβn1]2)+o(γn)\|\hat{\Delta}_{n+1}^{({\bm{\theta}},1)}\|=O(\gamma_{n}^{2}\beta_{n}^{-2}\omega_{n}^{2}+[\gamma_{n}\beta_{n}^{-1}]^{2})+o(\sqrt{\gamma_{n}})

and by induction we have for all integers k1k\geq 1,

Δ^n+1(𝜽,1)=O(γn2βn2ωn2+[γnβn1]k)+o(γn).\|\hat{\Delta}_{n+1}^{({\bm{\theta}},1)}\|=O(\gamma_{n}^{2}\beta_{n}^{-2}\omega_{n}^{2}+[\gamma_{n}\beta_{n}^{-1}]^{k})+o(\sqrt{\gamma_{n}}).

Since [γnβn1]k=n(ba)k[\gamma_{n}\beta_{n}^{-1}]^{k}=n^{(b-a)k}, there exists k0>a/2(ab)k_{0}>a/2(a-b) such that [γnβn1]k0=o(γn)[\gamma_{n}\beta_{n}^{-1}]^{k_{0}}=o(\sqrt{\gamma_{n}}), and

Δ^n+1(𝜽,1)=O(γn2βn2ωn2)+o(γn).\|\hat{\Delta}_{n+1}^{({\bm{\theta}},1)}\|=O(\gamma_{n}^{2}\beta_{n}^{-2}\omega_{n}^{2})+o(\sqrt{\gamma_{n}}). (82)

Then, as suggested in Mokkadem & Pelletier (2006, Section 2.3.2), we can choose ωn=O(βnlog(un)+[γnβn1]k)\omega_{n}=O(\sqrt{\beta_{n}\log(u_{n})}+[\gamma_{n}\beta_{n}^{-1}]^{k}), which also satisfies definition E.1. Then,

𝜽n𝜽=L^n(𝜽)+R^n(𝜽)+Δ^n(𝜽)=O(βnlog(un)+γnβn1log(sn)+([γnβn1]k+1+γnβn1βnlog(un))2)+o(βn+γn)=O(βnlog(un)+[γnβn1]k+1).\begin{split}&\|{\bm{\theta}}_{n}-{\bm{\theta}}^{*}\|=\|\hat{L}_{n}^{({\bm{\theta}})}+\hat{R}_{n}^{({\bm{\theta}})}+\hat{\Delta}_{n}^{({\bm{\theta}})}\|\\ =&O\left(\sqrt{\beta_{n}\log(u_{n})}\!+\!\sqrt{\gamma_{n}\beta_{n}^{-1}\log(s_{n})}\!+\!\left([\gamma_{n}\beta_{n}^{-1}]^{k+1}\!+\!\gamma_{n}\beta_{n}^{-1}\sqrt{\beta_{n}\log(u_{n})}\right)^{2}\right)\\ &+o(\sqrt{\beta_{n}}+\sqrt{\gamma_{n}})\\ =&O(\sqrt{\beta_{n}\log(u_{n})}+[\gamma_{n}\beta_{n}^{-1}]^{k+1}).\end{split}

By induction, this holds for all k1k\geq 1 such that there exists k0k_{0}, [γnβn1]k0=o(βn)[\gamma_{n}\beta_{n}^{-1}]^{k_{0}}=o(\sqrt{\beta_{n}}) and 𝜽n𝜽=O(βnlog(un))\|{\bm{\theta}}_{n}-{\bm{\theta}}^{*}\|=O(\sqrt{\beta_{n}\log(u_{n})}). Equivalently, ωn=βnlog(un)\omega_{n}=\sqrt{\beta_{n}\log(u_{n})}. Therefore, from (82) we have

Δ^n+1(𝜽,1)=O(γn2βn1log(un))+o(γn)=o(γn).\|\hat{\Delta}_{n+1}^{({\bm{\theta}},1)}\|=O(\gamma_{n}^{2}\beta_{n}^{-1}\log(u_{n}))+o(\sqrt{\gamma_{n}})=o(\sqrt{\gamma_{n}}).

Together with Δ^n+1(𝜽,2)=o(βn)\|\hat{\Delta}_{n+1}^{({\bm{\theta}},2)}\|=o(\sqrt{\beta_{n}}), we have βn1/2Δ^n+1(𝜽)=o(γnβn1)+1)\beta_{n}^{-1/2}\|\hat{\Delta}_{n+1}^{({\bm{\theta}})}\|=o(\sqrt{\gamma_{n}\beta_{n}^{-1}})+1) such that

limnβn1/2Δ^n+1(𝜽)=0.\lim_{n\to\infty}\beta_{n}^{-1/2}\hat{\Delta}_{n+1}^{({\bm{\theta}})}=0.

Thus, we have finished the proof according to the proof outline mentioned at the beginning of this part.

Appendix F Discussion of Covariance Ordering of SA-SRRW

F.1 Proof of Proposition 3.4

For any α>0\alpha>0 and any vector 𝐱d{\mathbf{x}}\in\mathbb{R}^{d}, we have

𝐱T𝐕𝜽(1)(α)𝐱=0𝐱Tet(𝜽𝐡(𝜽)+𝟙{b=1}2𝑰)𝐔𝜽(α)et(𝜽𝐡(𝜽)+𝟙{b=1}2𝑰)T𝐱𝑑t{\mathbf{x}}^{T}{\mathbf{V}}_{\bm{\theta}}^{(1)}(\alpha){\mathbf{x}}=\int_{0}^{\infty}{\mathbf{x}}^{T}e^{t(\nabla_{{\bm{\theta}}}{\mathbf{h}}({\bm{\theta}}^{*})+\frac{\mathds{1}_{\{b=1\}}}{2}{\bm{I}})}{\mathbf{U}}_{{\bm{\theta}}}(\alpha)e^{t(\nabla_{{\bm{\theta}}}{\mathbf{h}}({\bm{\theta}}^{*})+\frac{\mathds{1}_{\{b=1\}}}{2}{\bm{I}})^{T}}{\mathbf{x}}~{}dt

where the first equality is from the form of 𝐕𝜽(1)(α){\mathbf{V}}_{\bm{\theta}}^{(1)}(\alpha) in Theorem 3.3. Let 𝐲et(𝜽𝐡(𝜽)+𝟙{b=1}2𝑰)𝐱{\mathbf{y}}\triangleq e^{t(\nabla_{{\bm{\theta}}}{\mathbf{h}}({\bm{\theta}}^{*})+\frac{\mathds{1}_{\{b=1\}}}{2}{\bm{I}})}{\mathbf{x}}, with the dependence on variable tt left implicit. The matrix 𝐔𝜽(α){\mathbf{U}}_{\bm{\theta}}(\alpha), given explicitly in (11) positive semi definite, since λi(1,1)\lambda_{i}\in(-1,1) for all i{1,,N1}i\in\{1,\cdots,N-1\}. Thus, the terms 𝐲T𝐔𝜽(α)𝐲{\mathbf{y}}^{T}{\mathbf{U}}_{{\bm{\theta}}}(\alpha){\mathbf{y}} inside the integral are non-negative, and it is enough to provide an ordering on 𝐲T𝐔𝜽(α)𝐲{\mathbf{y}}^{T}{\mathbf{U}}_{{\bm{\theta}}}(\alpha){\mathbf{y}} with respect to α\alpha.

For any α2>α1>0\alpha_{2}>\alpha_{1}>0,

𝐲T𝐔𝜽(α2)𝐲\displaystyle{\mathbf{y}}^{T}{\mathbf{U}}_{{\bm{\theta}}}(\alpha_{2}){\mathbf{y}} =i=1N11(α2(1+λi)+1)21+λi1λi𝐲T𝐇T𝐮i𝐮iT𝐇𝐲\displaystyle=\sum_{i=1}^{N-1}\frac{1}{(\alpha_{2}(1+\lambda_{i})+1)^{2}}\cdot\frac{1+\lambda_{i}}{1-\lambda_{i}}{\mathbf{y}}^{T}{\mathbf{H}}^{T}{\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T}{\mathbf{H}}{\mathbf{y}}
<i=1N11(α1(1+λi)+1)21+λi1λi𝐲T𝐇T𝐮i𝐮iT𝐇𝐲=𝐲T𝐔𝜽(α1)𝐲\displaystyle<\sum_{i=1}^{N-1}\frac{1}{(\alpha_{1}(1+\lambda_{i})+1)^{2}}\cdot\frac{1+\lambda_{i}}{1-\lambda_{i}}{\mathbf{y}}^{T}{\mathbf{H}}^{T}{\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T}{\mathbf{H}}{\mathbf{y}}={\mathbf{y}}^{T}{\mathbf{U}}_{{\bm{\theta}}}(\alpha_{1}){\mathbf{y}}
<i=1N11+λi1λi𝐲T𝐇T𝐮i𝐮iT𝐇𝐲=𝐲T𝐔𝜽(0)𝐲,\displaystyle<\sum_{i=1}^{N-1}\cdot\frac{1+\lambda_{i}}{1-\lambda_{i}}{\mathbf{y}}^{T}{\mathbf{H}}^{T}{\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T}{\mathbf{H}}{\mathbf{y}}={\mathbf{y}}^{T}{\mathbf{U}}_{{\bm{\theta}}}(0){\mathbf{y}},

where the inequality131313The inequality may not be strict when HH is low rank, however it will always be true for some choice of 𝐱{\mathbf{x}}, since 𝐇{\mathbf{H}} is not a zero matrix. Thus, the ordering derived still follows our definition of <L<_{L} in Section 1, footnote 6. is because α(1+λi)>0\alpha(1+\lambda_{i})>0 for all i{1,,N}i\in\{1,\cdots,N\} and any α>0\alpha>0. In fact, the ordering is monotone in α\alpha, and 𝐲T𝐔𝜽(α2)𝐲{\mathbf{y}}^{T}{\mathbf{U}}_{{\bm{\theta}}}(\alpha_{2}){\mathbf{y}} decreases at rate 1/α21/\alpha^{2} as seen form its form in the equation above. This completes the proof.

F.2 Discussion regarding Proposition 3.4 and MSE ordering

We can use Proposition 3.4 to show that the MSE of SA iterates of (4c) driven by SRRW eventually becomes smaller than that SA iterates when the stochastic noise is driven by an i.i.d. sequence of random variables. The diagonal entries of 𝐕𝜽(1)(α){\mathbf{V}}_{{\bm{\theta}}}^{(1)}(\alpha) are obtained by evaluating 𝐞iT𝐕𝜽(1)(α)𝐞i{\mathbf{e}}_{i}^{T}{\mathbf{V}}_{{\bm{\theta}}}^{(1)}(\alpha){\mathbf{e}}_{i}, where 𝐞i{\mathbf{e}}_{i} is the ii’th standard basis vector.141414DD-dimensional vector of all zeros except at the ii’th position which is 11. These diagonal entries are the asymptotic variance corresponding to the element-wise iterate errors, and for large enough nn, we have 𝐞iT𝐕𝜽(1)(α)𝐞i𝔼[(𝜽n𝜽)i2]/βn{\mathbf{e}}_{i}^{T}{\mathbf{V}}_{{\bm{\theta}}}^{(1)}(\alpha){\mathbf{e}}_{i}\approx\mathbb{E}[({\bm{\theta}}_{n}-{\bm{\theta}}^{*})_{i}^{2}]/\beta_{n} for all i{1,,D}i\in\{1,\cdots,D\}. Thus, the trace of matrix 𝐕𝜽(1)(α){\mathbf{V}}_{\bm{\theta}}^{(1)}(\alpha) approximates the scaled MSE, that is Tr(𝐕𝜽(1)(α))=i𝐞iT𝐕𝜽(1)(α)𝐞ii𝔼[(𝜽n𝜽)i2]/βn=𝔼[𝜽n𝜽2]/βn\text{Tr}({\mathbf{V}}_{\bm{\theta}}^{(1)}(\alpha))=\sum_{i}{\mathbf{e}}_{i}^{T}{\mathbf{V}}_{{\bm{\theta}}}^{(1)}(\alpha){\mathbf{e}}_{i}\approx\sum_{i}\mathbb{E}[({\bm{\theta}}_{n}-{\bm{\theta}}^{*})_{i}^{2}]/\beta_{n}=\mathbb{E}[\|{\bm{\theta}}_{n}-{\bm{\theta}}^{*}\|^{2}]/\beta_{n} for large nn. Since all entries of 𝐕𝜽(1)(α){\mathbf{V}}_{\bm{\theta}}^{(1)}(\alpha) go to zero as α\alpha increases, they get smaller than the corresponding term for the SA algorithm with i.i.d. input for large enough α\alpha, which achieves a constant MSE in the similarly scaled limit, since the asymptotic covariance is not a function of α\alpha. Moreover, the value of α\alpha only needs to be moderately large, since the asymptotic covariance terms decrease at rate O(1/α2)O(1/\alpha^{2}) as shown in Proposition 3.4.

F.3 Proof of Corollary 3.5

We see that 𝐕𝜽(3)(α)=𝐕𝜽(3)(0){\mathbf{V}}_{{\bm{\theta}}}^{(3)}(\alpha)={\mathbf{V}}_{{\bm{\theta}}}^{(3)}(0) for all α>0\alpha>0, because the form of 𝐕𝜽(3)(α){\mathbf{V}}_{{\bm{\theta}}}^{(3)}(\alpha) in Theorem 3.3 is independent of α\alpha. To prove that 𝐕𝜽(1)(α)<L𝐕𝜽(3)(0){\mathbf{V}}_{{\bm{\theta}}}^{(1)}(\alpha)<_{L}{\mathbf{V}}_{{\bm{\theta}}}^{(3)}(0), it is enough to show that 𝐕𝜽(1)(0)=𝐕𝜽(3)(0){\mathbf{V}}_{{\bm{\theta}}}^{(1)}(0)={\mathbf{V}}_{{\bm{\theta}}}^{(3)}(0), since 𝐕𝜽(1)(α)<L𝐕𝜽(1)(0){\mathbf{V}}_{{\bm{\theta}}}^{(1)}(\alpha)<_{L}{\mathbf{V}}_{{\bm{\theta}}}^{(1)}(0) from Proposition 3.4. This is easily checked by substituting α=0\alpha=0 in 11, for which 𝐔𝜽(0)=𝐔11{\mathbf{U}}_{{\bm{\theta}}}(0)={\mathbf{U}}_{11}. Substituting in the respective forms of 𝐕𝜽(1)(0){\mathbf{V}}_{{\bm{\theta}}}^{(1)}(0) and 𝐕𝜽(3)(0){\mathbf{V}}_{{\bm{\theta}}}^{(3)}(0) in Theorem 3.3, we get equivalence. This completes the proof.

Appendix G Background Theory

G.1 Technical Lemmas

Lemma G.1 (Solution to the Lyapunov Equation).

If all the eigenvalues of matrix 𝐌{\mathbf{M}} have negative real part, then for every positive semi-definite matrix 𝐔{\mathbf{U}} there exists a unique positive semi-definite matrix 𝐕{\mathbf{V}} satisfying the Lyapunov equation 𝐔+𝐌𝐕+𝐕𝐌T=𝟎{\mathbf{U}}+{\mathbf{M}}{\mathbf{V}}+{\mathbf{V}}{\mathbf{M}}^{T}={\bm{0}}. The explicit solution 𝐕{\mathbf{V}} is given as

𝐕=0e𝐌t𝐔e(𝐌T)t𝑑t.{\mathbf{V}}=\int_{0}^{\infty}e^{{\mathbf{M}}t}{\mathbf{U}}e^{({\mathbf{M}}^{T})t}dt. (83)

Chellaboina & Haddad (2008, Theorem 3.16) states that for a positive definite matrix 𝐔{\mathbf{U}}, there exists a positive definite matrix 𝐕{\mathbf{V}}. The reason they focus on the positive definite matrix 𝐔{\mathbf{U}} is that they require the related autonomous ODE system to be asymptotically stable. However, in this paper we don’t need this requirement. The same steps therein can be used to prove Lemma G.1 and show that if 𝐔{\mathbf{U}} is positive semi-definite, then 𝐕{\mathbf{V}} in the form of (83) is unique and also positive semi-definite.

Lemma G.2 (Burkholder Inequality, Davis (1970), Hall et al. (2014) Theorem 2.10).

Given a Martingale difference sequence {Mi,n}i=1n\{M_{i,n}\}_{i=1}^{n}, for p1p\geq 1 and some positive constant CpC_{p}, we have

𝔼[i=1nMi,np]Cp𝔼[(i=1nMi,n2)p/2]\mathbb{E}\left[\left\|\sum_{i=1}^{n}M_{i,n}\right\|^{p}\right]\leq C_{p}\mathbb{E}\left[\left(\sum_{i=1}^{n}\left\|M_{i,n}\right\|^{2}\right)^{p/2}\right] (84)
Theorem G.3 (Martingale CLT, Delyon (2000) Theorem 30).

If a Martingale difference array {Xn,i}\{X_{n,i}\} satisfies the following condition: for some τ>0\tau>0,

k=1n𝔼[Xn,k2+τ|k1]0,\sum_{k=1}^{n}\mathbb{E}\left[\|X_{n,k}\|^{2+\tau}|{\mathcal{F}}_{k-1}\right]\xrightarrow[]{\mathbb{P}}0, (85)
supnk=1n𝔼[Xn,k2|k1]<,\sup_{n}\sum_{k=1}^{n}\mathbb{E}\left[\|X_{n,k}\|^{2}|{\mathcal{F}}_{k-1}\right]<\infty, (86)

and

k=1n𝔼[Xn,kXn,kT|k1]𝑽,\sum_{k=1}^{n}\mathbb{E}\left[X_{n,k}X_{n,k}^{T}|{\mathcal{F}}_{k-1}\right]\xrightarrow[]{\mathbb{P}}{\bm{V}}, (87)

then

i=1nXn,idist.N(0,𝑽).\sum_{i=1}^{n}X_{n,i}\xrightarrow[]{dist.}N(0,{\bm{V}}). (88)

Lemma G.4 (Duflo (1996) Proposition 3.I.2).

For a Hurwitz matrix 𝐇{\bm{H}}, there exist some positive constants C,bC,b such that for any nn,

e𝑯nCebn.\left\|e^{{\bm{H}}n}\right\|\leq Ce^{-bn}. (89)
Lemma G.5 (Fort (2015) Lemma 5.8).

For a Hurwitz matrix 𝐀{\bm{A}}, denote by r-r, r>0r>0, the largest real part of its eigenvalues. Let a positive sequence {γn}\{\gamma_{n}\} such that limnγn=0\lim_{n}\gamma_{n}=0. Then for any 0<r<r0<r^{\prime}<r, there exists a positive constant CC such that for any k<nk<n,

j=kn(𝑰+γj𝑨)Cerj=knγj.\left\|\prod_{j=k}^{n}({\bm{I}}+\gamma_{j}{\bm{A}})\right\|\leq Ce^{-r^{\prime}\sum_{j=k}^{n}\gamma_{j}}. (90)
Lemma G.6 (Fort (2015) Lemma 5.9, Mokkadem & Pelletier (2006) Lemma 10).

Let {γn}\{\gamma_{n}\} be a positive sequence such that limnγn=0\lim_{n}\gamma_{n}=0 and nγn=\sum_{n}\gamma_{n}=\infty. Let {ϵn,n0}\{\epsilon_{n},n\geq 0\} be a nonnegative sequence. Then, for b>0b>0, p0p\geq 0,

lim supnγnpk=1nγkp+1ebj=k+1nγjϵk1C(b,p)lim supnϵn\limsup_{n}\gamma_{n}^{-p}\sum_{k=1}^{n}\gamma_{k}^{p+1}e^{-b\sum_{j=k+1}^{n}\gamma_{j}}\epsilon_{k}\leq\frac{1}{C(b,p)}\limsup_{n}\epsilon_{n} (91)

for some constant C(b,p)>0C(b,p)>0.

When p=0p=0 and define a positive sequence {wn}\{w_{n}\} satisfying wn1/wn=1+o(γn)w_{n-1}/w_{n}=1+o(\gamma_{n}), we have

k=1nγkebj=k+1nγjϵk={O(wn),if ϵn=O(wn),o(wn),if ϵn=o(wn).\sum_{k=1}^{n}\gamma_{k}e^{-b\sum_{j=k+1}^{n}\gamma_{j}}\epsilon_{k}=\begin{cases}O(w_{n}),&\quad\text{if~{}}\epsilon_{n}=O(w_{n}),\\ o(w_{n}),&\quad\text{if~{}}\epsilon_{n}=o(w_{n}).\end{cases} (92)
Lemma G.7 (Fort (2015) Lemma 5.10).

For any matrices A,B,CA,B,C,

ABATCBCTACB(A+C).\|ABA^{T}-CBC^{T}\|\leq\|A-C\|\|B\|(\|A\|+\|C\|). (93)

G.2 Asymptotic Results of Single-Timescale SA

Consider the stochastic approximation in the form of

𝒛n+1=𝒛n+γn+1G(𝒛n,Xn+1).{\bm{z}}_{n+1}={\bm{z}}_{n}+\gamma_{n+1}G({\bm{z}}_{n},X_{n+1}). (94)

Let 𝐊𝒛{\mathbf{K}}_{{\bm{z}}} be the transition kernel of the underlying Markov chain {Xn}n0\{X_{n}\}_{n\geq 0} with stationary distribution π(𝒛)\pi({\bm{z}}) such that g(𝒛)𝔼Xπ(𝒛)[G(𝒛,X)]g({\bm{z}})\triangleq\mathbb{E}_{X\sim\pi({\bm{z}})}[G({\bm{z}},X)] with domain 𝒪d{\mathcal{O}}\subseteq{\mathbb{R}}^{d}. Define an operator 𝐊𝒛f{\mathbf{K}}_{{\bm{z}}}f for any function f:𝒩Df:{\mathcal{N}}\to{\mathbb{R}}^{D} such that

(𝐊𝒛f)(i)=j𝒩f(j)𝐊𝒛(i,j).({\mathbf{K}}_{{\bm{z}}}f)(i)=\sum_{j\in{\mathcal{N}}}f(j){\mathbf{K}}_{{\bm{z}}}(i,j). (95)

Assume that

  1. C1.

    W.p.1, the closure of {𝒛n}n0\{{\bm{z}}_{n}\}_{n\geq 0} is a compact subset of 𝒪{\mathcal{O}}.

  2. C2.

    γn=γ0/na,a(1/2,1]\gamma_{n}=\gamma_{0}/n^{a},a\in(1/2,1].

  3. C3.

    Function gg is continuous on 𝒪{\mathcal{O}} and there exists a non-negative C1C^{1} function ww and a compact set 𝒦𝒪{\mathcal{K}}\subset{\mathcal{O}} such that

    • w(𝒛)Tg(𝒛)0\nabla w({\bm{z}})^{T}g({\bm{z}})\leq 0 for all 𝒛𝒪{\bm{z}}\in{\mathcal{O}} and w(𝒛)Tg(𝒛)<0\nabla w({\bm{z}})^{T}g({\bm{z}})<0 if 𝒛𝒦{\bm{z}}\notin{\mathcal{K}};

    • the set S{𝒛|w(𝒛)Tg(𝒛)=0}S\triangleq\{{\bm{z}}~{}|~{}\nabla w({\bm{z}})^{T}g({\bm{z}})=0\} is such that w(S)w(S) has an empty interior;

  4. C4.

    For every 𝒛{\bm{z}}, there exists a solution m𝒛:𝒩dm_{{\bm{z}}}:{\mathcal{N}}\to{\mathbb{R}}^{d} for the following Poisson equation

    m𝒛(i)(𝐊𝒛m𝒛)(i)=G(𝒛,i)g(𝒛)m_{{\bm{z}}}(i)-({\mathbf{K}}_{{\bm{z}}}m_{{\bm{z}}})(i)=G({\bm{z}},i)-g({\bm{z}}) (96)

    for any i𝒩i\in{\mathcal{N}}; for any compact set 𝒞𝒪{\mathcal{C}}\subset{\mathcal{O}},

    sup𝒛𝒞,i𝒩(𝐊𝒛m𝒛)(i)+m𝒛(i)<\sup_{{\bm{z}}\in{\mathcal{C}},i\in{\mathcal{N}}}\|({\mathbf{K}}_{{\bm{z}}}m_{{\bm{z}}})(i)\|+\|m_{{\bm{z}}}(i)\|<\infty (97)

    and there exist a continuous function ϕ𝒞,ϕ𝒞(0)=0\phi_{{\mathcal{C}}},\phi_{{\mathcal{C}}}(0)=0, such that for any 𝒛,𝒛𝒞{\bm{z}},{\bm{z}}^{\prime}\in{\mathcal{C}},

    supi𝒩(𝐊𝒛m𝒛)(i)(𝐊𝒛m𝒛)(i)ϕ𝒞(𝒛𝒛).\sup_{i\in{\mathcal{N}}}\|({\mathbf{K}}_{{\bm{z}}}m_{{\bm{z}}})(i)-({\mathbf{K}}_{{\bm{z}}^{\prime}}m_{{\bm{z}}^{\prime}})(i)\|\leq\phi_{{\mathcal{C}}}(\|{\bm{z}}-{\bm{z}}^{\prime}\|). (98)
  5. C5.

    Denote by r-r the largest real part of the eigenvalues of the Jacobian matrix g(𝒛)\nabla g({\bm{z}}^{*}) and assume r>𝟙{a=1}2r>\frac{\mathds{1}_{\{a=1\}}}{2}.

  6. C6.

    For every 𝒛{\bm{z}}, there exists a solution Q𝒛:𝒩d×dQ_{{\bm{z}}}:{\mathcal{N}}\to{\mathbb{R}}^{d\times d} for the following Poisson equation

    Q𝒛(i)(𝐊𝒛Q𝒛)(i)=F(𝒛,i)𝔼jπ(𝒛)[F(𝒛,j)]Q_{{\bm{z}}}(i)-({\mathbf{K}}_{{\bm{z}}}Q_{{\bm{z}}})(i)=F({\bm{z}},i)-\mathbb{E}_{j\sim\pi({\bm{z}})}[F({\bm{z}},j)] (99)

    for any i𝒩i\in{\mathcal{N}}, where

    F(𝒛,i)j𝒩m𝒛(j)m𝒛(j)T𝐊𝒛(i,j)(𝐊𝒛m𝒛)(i)(𝐊𝒛m𝒛)(i)T.F({\bm{z}},i)\triangleq\sum_{j\in{\mathcal{N}}}m_{{\bm{z}}}(j)m_{{\bm{z}}}(j)^{T}{\mathbf{K}}_{{\bm{z}}}(i,j)-({\mathbf{K}}_{{\bm{z}}}m_{{\bm{z}}})(i)({\mathbf{K}}_{{\bm{z}}}m_{{\bm{z}}})(i)^{T}. (100)

    For any compact set 𝒞𝒪{\mathcal{C}}\subset{\mathcal{O}},

    sup𝒛𝒞,i𝒩Q𝒛(i)+(𝐊𝒛Q𝒛)(i)<\sup_{{\bm{z}}\in{\mathcal{C}},i\in{\mathcal{N}}}\|Q_{{\bm{z}}}(i)\|+\|({\mathbf{K}}_{{\bm{z}}}Q_{{\bm{z}}})(i)\|<\infty (101)

    and there exist p,C𝒞>0p,C_{{\mathcal{C}}}>0, such that for any 𝒛,𝒛𝒞{\bm{z}},{\bm{z}}^{\prime}\in{\mathcal{C}},

    supi𝒩(𝐊𝒛Q𝒛)(i)(𝐊𝒛Q𝒛)(i)C𝒞𝒛𝒛p.\sup_{i\in{\mathcal{N}}}\|({\mathbf{K}}_{{\bm{z}}}Q_{{\bm{z}}})(i)-({\mathbf{K}}_{{\bm{z}}^{\prime}}Q_{{\bm{z}}^{\prime}})(i)\|\leq C_{{\mathcal{C}}}\|{\bm{z}}-{\bm{z}}^{\prime}\|^{p}. (102)
Theorem G.8 (Delyon et al. (1999) Theorem 2).

Consider (94) and assume C1 - C4. Then, w.p.1, lim supnd(𝐳n,S)=0\limsup_{n}d({\bm{z}}_{n},S)=0.

Theorem G.9 (Fort (2015) Theorem 2.1 & Proposition 4.1).

Consider (94) and assume C1 - C6. Then, given the condition that 𝐳n{\bm{z}}_{n} converges to one point 𝐳S{\bm{z}}^{*}\in S, we have

γn1/2(𝒛n𝒛)ndist.N(0,𝐕),\gamma_{n}^{-1/2}({\bm{z}}_{n}-{\bm{z}}^{*})\xrightarrow[n\to\infty]{dist.}N(0,{\mathbf{V}}), (103)

where

𝐕(𝟙{b=1}2𝐈+g(𝒛)T)+(𝟙{b=1}2𝐈+g(𝒛))𝐕+𝐔=0,{\mathbf{V}}\left(\frac{\mathds{1}_{\{b=1\}}}{2}{\mathbf{I}}+\nabla g({\bm{z}}^{*})^{T}\right)+\left(\frac{\mathds{1}_{\{b=1\}}}{2}{\mathbf{I}}+\nabla g({\bm{z}}^{*})\right){\mathbf{V}}+{\mathbf{U}}=0, (104)

and

𝐔i𝒩μi(m𝐳(i)m𝐳(i)T(𝐊𝐳m𝐳)(i)(𝐊𝐳m𝐳)(i)T).{\mathbf{U}}\triangleq\sum_{i\in{\mathcal{N}}}\mu_{i}\left(m_{{\mathbf{z}}^{*}}(i)m_{{\mathbf{z}}^{*}}(i)^{T}-({\mathbf{K}}_{{\mathbf{z}}^{*}}m_{{\mathbf{z}}^{*}})(i)({\mathbf{K}}_{{\mathbf{z}}^{*}}m_{{\mathbf{z}}^{*}})(i)^{T}\right). (105)

G.3 Asymptotic Results of Two-Timescale SA

For the two-timescale SA with iterate-dependent Markov chain, we have the following iterations:

𝐳n+1=𝐳n+βn+1G1(𝐳n,𝐲nXn+1),{\mathbf{z}}_{n+1}={\mathbf{z}}_{n}+\beta_{n+1}G_{1}({\mathbf{z}}_{n},{\mathbf{y}}_{n}X_{n+1}), (106a)
𝐲n+1=𝐲n+γn+1G2(𝐳n,𝐲n,Xn+1),{\mathbf{y}}_{n+1}={\mathbf{y}}_{n}+\gamma_{n+1}G_{2}({\mathbf{z}}_{n},{\mathbf{y}}_{n},X_{n+1}), (106b)

with the goal of finding the root (𝐳,𝐲)({\mathbf{z}}^{*},{\mathbf{y}}^{*}) such that

g1(𝐳,𝐲)=𝔼X𝝁[G1(𝐳,𝐲,X)]=0,g2(𝐳,𝐲)=𝔼X𝝁[G2(𝐳,𝐲,X)]=0.g_{1}({\mathbf{z}}^{*},{\mathbf{y}}^{*})=\mathbb{E}_{X\sim{\bm{\mu}}}[G_{1}({\mathbf{z}}^{*},{\mathbf{y}}^{*},X)]=0,\quad g_{2}({\mathbf{z}}^{*},{\mathbf{y}}^{*})=\mathbb{E}_{X\sim{\bm{\mu}}}[G_{2}({\mathbf{z}}^{*},{\mathbf{y}}^{*},X)]=0. (107)

We present here a simplified version of the assumptions for single-valued functions G1,G2G_{1},G_{2} that are necessary for the almost sure convergence result in Yaji & Bhatnagar (2020, Theorem 4). The original assumptions are intended for more general set-valued functions G1,G2G_{1},G_{2}.

  1. (B1)

    The step sizes βnnb\beta_{n}\triangleq n^{-b} and γnna\gamma_{n}\triangleq n^{-a}, where 0.5<a<b10.5<a<b\leq 1.

  2. (B2)

    Assume the function G1(𝐳,𝐲,X)G_{1}({\mathbf{z}},{\mathbf{y}},X) is continuous and differentiable with respect to 𝐳,𝐲{\mathbf{z}},{\mathbf{y}}. There exists a positive constant L1L_{1} such that G1(𝐳,𝐲,X)L1(1+𝐳+𝐲)\|G_{1}({\mathbf{z}},{\mathbf{y}},X)\|\leq L_{1}(1+\|{\mathbf{z}}\|+\|{\mathbf{y}}\|) for every 𝐳d1,𝐲d2,X𝒩{\mathbf{z}}\in{\mathbb{R}}^{d_{1}},{\mathbf{y}}\in{\mathbb{R}}^{d_{2}},X\in{\mathcal{N}}. The same condition holds for the function G2G_{2} as well.

  3. (B3)

    Assume there exists a function ρ:d1d2\rho:{\mathbb{R}}^{d_{1}}\to{\mathbb{R}}^{d_{2}} such that the following three properties hold: (i) ρ(𝐳)L2(1+𝐳)\|\rho({\mathbf{z}})\|\leq L_{2}(1+\|{\mathbf{z}}\|) for some positive constant L2L_{2}; (ii) the ODE 𝐲˙=g2(𝐳,𝐲)\dot{\mathbf{y}}=g_{2}({\mathbf{z}},{\mathbf{y}}) has a globally asymptotically stable equilibrium λ(𝐳)\lambda({\mathbf{z}}) such that g2(𝐳,ρ(𝐳))=0g_{2}({\mathbf{z}},\rho({\mathbf{z}}))=0. Additionally, let g^1(𝐳)g1(𝐳,ρ(𝐳))\hat{g}_{1}({\mathbf{z}})\triangleq g_{1}({\mathbf{z}},\rho({\mathbf{z}})), there exists a set of disjoint roots Λ{𝐳:g^1(𝐳)=0}\Lambda\triangleq\{{\mathbf{z}}^{*}:\hat{g}_{1}({\mathbf{z}}^{*})=0\}, which is the set of globally asymptotically stable equilibria of the ODE 𝐳˙=g^1(𝐳)\dot{\mathbf{z}}=\hat{g}_{1}({\mathbf{z}}).

  4. (B4)

    {Xn}n0\{X_{n}\}_{n\geq 0} is an iterate-dependent Markov process in finite state space 𝒩{\mathcal{N}}. For every n0n\geq 0, P(Xn+1=j|𝐳m,𝐲m,Xm,0mn)=P(Xn+1=j|𝐳n,𝐲n,Xn=i)=𝐏i,j[𝐳n,𝐲n]P(X_{n+1}=j|{\mathbf{z}}_{m},{\mathbf{y}}_{m},X_{m},0\leq m\leq n)=P(X_{n+1}=j|{\mathbf{z}}_{n},{\mathbf{y}}_{n},X_{n}=i)={\mathbf{P}}_{i,j}[{\mathbf{z}}_{n},{\mathbf{y}}_{n}], where the transition kernel 𝐏[𝐳,𝐲]{\mathbf{P}}[{\mathbf{z}},{\mathbf{y}}] is continuous in 𝐳,𝐲{\mathbf{z}},{\mathbf{y}}, and the Markov chain generated by 𝐏[𝐳,𝐲]{\mathbf{P}}[{\mathbf{z}},{\mathbf{y}}] is ergodic so that it admits a stationary distribution 𝝅(𝐳,𝐲){\bm{\pi}}({\mathbf{z}},{\mathbf{y}}), and 𝝅(𝐳,ρ(𝐳))=𝝁{\bm{\pi}}({\mathbf{z}}^{*},\rho({\mathbf{z}}^{*}))={\bm{\mu}}.

  5. (B5)

    supn0(𝐳n+𝐲n)<\sup_{n\geq 0}(\|{\mathbf{z}}_{n}\|+\|{\mathbf{y}}_{n}\|)<\infty a.s.

Yaji & Bhatnagar (2020) included assumptions A1 - A9 and A11 for the following Theorem G.10. We briefly show the correspondence of our assumptions (B1) - (B5) and theirs: (B1) with A5, (B2) with A1 and A2, (B3) with A9 and A11, (B4) with A3 and A4, and (B5) with A8. Given that our two-timescale SA framework (106) excludes additional noises (setting them to zero), A6 and A7 therein are inherently met.

Theorem G.10 (Yaji & Bhatnagar (2020) Theorem 4).

Under Assumptions (B1) - (B5), iterations (𝐳n,𝐲n)({\mathbf{z}}_{n},{\mathbf{y}}_{n}) in (106) almost surely converge to a set of roots, i.e., (𝐳n,𝐲n)𝐳Λ(𝐳,ρ(𝐳))({\mathbf{z}}_{n},{\mathbf{y}}_{n})\to\bigcup_{{\mathbf{z}}^{*}\in\Lambda}({\mathbf{z}}^{*},\rho({\mathbf{z}}^{*})) a.s.

Appendix H Additional Simulation Results

H.1 Binary Classification on Additional Datasets

In this part, we perform the binary classification task as in Section 4 on additional datasets, i.e., a9a (with 123123 features) and splice (with 6060 features) from LIBSVM (Chang & Lin, 2011). Figure 4 provides the performance ordering of different α\alpha values, and we empirically demonstrate that the curves with α5\alpha\geq 5 still outperform the i.i.d. counterpart. Additionally, Figure 5 compare cases (i) - (iii) under both a9a and splice datasets, and case (i) consistently perform the best.

Refer to caption
(a) SGD-SRRW, a9a
Refer to caption
(b) SGD-SRRW, splice
Figure 4: Simulation results with various α\alpha values in a9a and splice datasets.
Refer to caption
(a) α=5\alpha=5, SGD-SRRW, a9a
Refer to caption
(b) α=10\alpha=10, SGD-SRRW, a9a
Refer to caption
(c) α=20\alpha=20, SGD-SRRW, a9a
Refer to caption
(d) α=5\alpha=5, SGD-SRRW, splice
Refer to caption
(e) α=10\alpha=10, SGD-SRRW, splice
Refer to caption
(f) α=20\alpha=20, SGD-SRRW, splice
Figure 5: Performance comparison among cases (i) - (iii) for α{5,10,20}\alpha\in\{5,10,20\} in a9a and splice datasets.

H.2 Non-convex Linear Regression

We further test SGD-SRRW and SHB-SRRW algorithms with a non-convex function to demonstrate the efficiency of our SA-SRRW algorithm beyond the convex setting. In this task, we simulate the following linear regression problem in Khaled & Richtárik (2023) with non-convex regularization

min𝜽d{f(𝜽)1Ni=1Nli(𝜽)+κj=1d𝜽j2𝜽j2+1}\min_{{\bm{\theta}}\in{\mathbb{R}}^{d}}\left\{f({\bm{\theta}})\triangleq\frac{1}{N}\sum_{i=1}^{N}l_{i}({\bm{\theta}})+\kappa\sum_{j=1}^{d}\frac{{\bm{\theta}}_{j}^{2}}{{\bm{\theta}}_{j}^{2}+1}\right\} (108)

where the loss function li(𝜽)=𝐬iT𝜽yi2l_{i}({\bm{\theta}})=\|{\mathbf{s}}_{i}^{T}{\bm{\theta}}-y_{i}\|^{2} and κ=1\kappa=1, with the data points {(𝐬i,yi)}i𝒩\{({\mathbf{s}}_{i},y_{i})\}_{i\in{\mathcal{N}}} from the ijcnn1 dataset of LIBVIM (Chang & Lin, 2011). We still perform the optimization over the wikiVote graph, as done in Section 4.

The numerical results for the non-convex linear regression taks are presented in Figures 6 and 7, where each experiment is repeated 100100 times. Figures 6(a) and 6(b) show that the performance ordering across different α\alpha values is still preserved for both algorithms over almost all time, and curves for α5\alpha\geq 5 outperform that of the i.i.d. sampling (in black) under the graph topological constraints. Additionally, among the three cases examined at identical α\alpha values, Figures 7(a) - 7(c) confirm that case (i) performs consistently better than the other two cases, implying that case (i) can even become the best choice for non-convex distributed optimization tasks.

Refer to caption
(a) SGD-SRRW
Refer to caption
(b) SHB-SRRW
Figure 6: Simulation results for non-convex linear regression under case (i) with various α\alpha values.
Refer to caption
(a) α=5\alpha=5, SGD-SRRW
Refer to caption
(b) α=10\alpha=10, SGD-SRRW
Refer to caption
(c) α=20\alpha=20, SGD-SRRW
Figure 7: Performance comparison among cases (i) - (iii) for non-convex regression.