Accelerating Distributed Stochastic Optimization via Self-Repellent Random Walks

Jie Hu^∗
Department of Electrical and Computer Engineering
North Carolina State University
[email protected]
&Vishwaraj Doshi^∗
Data Science & Advanced Analytics
IQVIA Inc.
[email protected]
&Do Young Eun
Department of Electrical and Computer Engineering
North Carolina State University
[email protected]

Abstract

We study a family of distributed stochastic optimization algorithms where gradients are sampled by a token traversing a network of agents in random-walk fashion. Typically, these random-walks are chosen to be Markov chains that asymptotically sample from a desired target distribution, and play a critical role in the convergence of the optimization iterates. In this paper, we take a novel approach by replacing the standard linear Markovian token by one which follows a non-linear Markov chain - namely the Self-Repellent Radom Walk (SRRW). Defined for any given ‘base’ Markov chain, the SRRW, parameterized by a positive scalar $\alpha$ , is less likely to transition to states that were highly visited in the past, thus the name. In the context of MCMC sampling on a graph, a recent breakthrough in Doshi et al. (2023) shows that the SRRW achieves $O(1/\alpha)$ decrease in the asymptotic variance for sampling. We propose the use of a ‘generalized’ version of the SRRW to drive token algorithms for distributed stochastic optimization in the form of stochastic approximation, termed SA-SRRW. We prove that the optimization iterate errors of the resulting SA-SRRW converge to zero almost surely and prove a central limit theorem, deriving the explicit form of the resulting asymptotic covariance matrix corresponding to iterate errors. This asymptotic covariance is always smaller than that of an algorithm driven by the base Markov chain and decreases at rate $O(1/\alpha^{2})$ - the performance benefit of using SRRW thereby amplified in the stochastic optimization context. Empirical results support our theoretical findings.

^*^*footnotetext: Equal contributors.

1 Introduction

Stochastic optimization algorithms solve optimization problems of the form

{\bm{\theta}}^{*}\in\operatorname*{arg\,min}_{{\bm{\theta}}\in{\mathbb{R}}^{d}}f({\bm{\theta}}),~{}~{}~{}\text{where}~{}f({\bm{\theta}})\triangleq\mathbb{E}_{X\sim{\bm{\mu}}}\left[F({\bm{\theta}},X)\right]=\sum_{i\in{\mathcal{N}}}\mu_{i}F({\bm{\theta}},i),\vspace{-2mm}

(1)

with the objective function $f:\mathbb{R}^{d}\to\mathbb{R}$ and $X$ taking values in a finite state space ${\mathcal{N}}$ with distribution ${\bm{\mu}}\triangleq[\mu_{i}]_{i\in{\mathcal{N}}}$ . Leveraging partial gradient information per iteration, these algorithms have been recognized for their scalability and efficiency with large datasets (Bottou et al., 2018; Even, 2023). For any given noise sequence $\{X_{n}\}_{n\geq 0}\subset{\mathcal{N}}$ , and step size sequence $\{\beta_{n}\}_{n\geq 0}\subset\mathbb{R}_{+}$ , most stochastic optimization algorithms can be classified as stochastic approximations (SA) of the form

{\bm{\theta}}_{n+1}={\bm{\theta}}_{n}+\beta_{n+1}H({\bm{\theta}}_{n},X_{n+1}),~{}~{}~{}\forall~{}n\geq 0,

(2)

where, roughly speaking, $H({\bm{\theta}},i)$ contains gradient information $\nabla_{{\bm{\theta}}}F(\theta,i)$ , such that ${\bm{\theta}}^{*}$ solves ${\mathbf{h}}({\bm{\theta}})\!\triangleq\!\mathbb{E}_{X\sim{\bm{\mu}}}[H({\bm{\theta}},X)]\!=\!\sum_{i\in{\mathcal{N}}}\mu_{i}H({\bm{\theta}},i)\!=\!{\bm{0}}$ . Such SA iterations include the well-known stochastic gradient descent (SGD), stochastic heavy ball (SHB) (Gadat et al., 2018; Li et al., 2022), and some SGD-type algorithms employing additional auxiliary variables (Barakat et al., 2021).¹¹1Further illustrations of stochastic optimization algorithms of the form (2) are deferred to Appendix A. These algorithms typically have the stochastic noise term $X_{n}$ generated by i.i.d. random variables with probability distribution ${\bm{\mu}}$ in each iteration. In this paper, we study a stochastic optimization algorithm where the noise sequence governing access to the gradient information is generated from general stochastic processes in place of i.i.d. random variables.

This is commonly the case in distributed learning, where $\{X_{n}\}$ is a (typically Markovian) random walk, and should asymptotically be able to sample the gradients from the desired probability distribution ${\bm{\mu}}$ . This is equivalent to saying that the random walker’s empirical distribution converges to ${\bm{\mu}}$ almost surely (a.s.); that is, ${\mathbf{x}}_{n}\triangleq\frac{1}{n+1}\sum_{k=0}^{n}{\bm{\delta}}_{X_{k}}\xrightarrow[n\to\infty]{a.s.}{\bm{\mu}}$ for any initial $X_{0}\in{\mathcal{N}}$ , where ${\bm{\delta}}_{X_{k}}$ is the delta measure whose $X_{k}$ ’th entry is one, the rest being zero. Such convergence is most commonly achieved by employing the Metropolis Hastings random walk (MHRW) which can be designed to sample from any target measure ${\bm{\mu}}$ and implemented in a scalable manner (Sun et al., 2018). Unsurprisingly, convergence characteristics of the employed Markov chain affect that of the SA sequence (2), and appear in both finite-time and asymptotic analyses. Finite-time bounds typically involve the second largest eigenvalue in modulus (SLEM) of the Markov chain’s transition kernel ${\mathbf{P}}$ , which is critically connected to the mixing time of a Markov chain (Levin & Peres, 2017); whereas asymptotic results such as central limit theorems (CLT) involve asymptotic covariance matrices that embed information regarding the entire spectrum of ${\mathbf{P}}$ , i.e., all eigenvalues as well as eigenvectors (Brémaud, 2013), which are key to understanding the sampling efficiency of a Markov chain. Thus, the choice of random walker can significantly impact the performance of (2), and simply ensuring that it samples from ${\bm{\mu}}$ asymptotically is not enough to achieve optimal algorithmic performance. In this paper, we take a closer look at the distributed stochastic optimization problem through the lens of a non-linear Markov chain, known as the Self Repellent Random Walk (SRRW), which was shown in Doshi et al. (2023) to achieve asymptotically minimal sampling variance for large values of $\alpha$ , a positive scalar controlling the strength of the random walker’s self-repellence behaviour. Our proposed modification of (2) can be implemented within the settings of decentralized learning applications in a scalable manner, while also enjoying significant performance benefit over distributed stochastic optimization algorithms driven by vanilla Markov chains.

Token Algorithms for Decentralized Learning. In decentralized learning, agents like smartphones or IoT devices, each containing a subset of data, collaboratively train models on a graph ${\mathcal{G}}({\mathcal{N}},{\mathcal{E}})$ by sharing information locally without a central server (McMahan et al., 2017). In this setup, $N\!=\!|{\mathcal{N}}|$ agents correspond to nodes $i\!\in\!{\mathcal{N}}$ , and an edge $(i,j)\!\in\!{\mathcal{E}}$ indicates direct communication between agents $i$ and $j$ . This decentralized approach offers several advantages compared to the traditional centralized learning setting, promoting data privacy and security by eliminating the need for raw data to be aggregated centrally and thus reducing the risk of data breach or misuse (Bottou et al., 2018; Nedic, 2020). Additionally, decentralized approaches are more scalable and can handle vast amounts of heterogeneous data from distributed agents without overwhelming a central server, alleviating concerns about single point of failure (Vogels et al., 2021).

Among decentralized learning approaches, the class of ‘Token’ algorithms can be expressed as stochastic approximation iterations of the type (2), wherein the sequence $\{X_{n}\}$ is realized as the sample path of a token that stochastically traverses the graph ${\mathcal{G}}$ , carrying with it the iterate ${\bm{\theta}}_{n}$ for any time $n\geq 0$ and allowing each visited node (agent) to incrementally update ${\bm{\theta}}_{n}$ using locally available gradient information. Token algorithms have gained popularity in recent years (Hu et al., 2022; Triastcyn et al., 2022; Hendrikx, 2023), and are provably more communication efficient (Even, 2023) when compared to consensus-based algorithms - another popular approach for solving distributed optimization problems (Boyd et al., 2006; Morral et al., 2017; Olshevsky, 2022). The construction of token algorithms means that they do not suffer from expensive costs of synchronization and communication that are typical of consensus-based approaches, where all agents (or a subset of agents selected by a coordinator (Boyd et al., 2006; Wang et al., 2019)) on the graph are required to take simultaneous actions, such as communicating on the graph at each iteration. While decentralized Federated learning has indeed helped mitigate the communication overhead by processing multiple SGD iterations prior to each aggregation (Lalitha et al., 2018; Ye et al., 2022; Chellapandi et al., 2023), they still cannot overcome challenges such as synchronization and straggler issues.

Self Repellent Random Walk. As mentioned earlier, sample paths $\{X_{n}\}$ of token algorithms are usually generated using Markov chains with ${\bm{\mu}}\in\text{Int}(\Sigma)$ as their limiting distribution. Here, $\Sigma$ denotes the $N$ -dimensional probability simplex, with $\text{Int}(\Sigma)$ representing its interior. A recent work by Doshi et al. (2023) pioneers the use of non-linear Markov chains to, in some sense, improve upon any given time-reversible Markov chain with transition kernel ${\mathbf{P}}$ whose stationary distribution is ${\bm{\mu}}$ . They show that the non-linear transition kernel²²2Here, non-linearity in the transition kernel implies that ${\mathbf{K}}[{\mathbf{x}}]$ takes probability distribution ${\mathbf{x}}$ as the argument (Andrieu et al., 2007), as opposed to the kernel being a linear operator ${\mathbf{K}}[{\mathbf{x}}]={\mathbf{P}}$ for a constant stochastic matrix ${\mathbf{P}}$ in a standard (linear) Markovian setting. ${\mathbf{K}}[\cdot]:\text{Int}(\Sigma)\to[0,1]^{N\times N}$ , given by

\vspace{-2mm}K_{ij}[{\mathbf{x}}]\triangleq\frac{P_{ij}(x_{j}/\mu_{j})^{-\alpha}}{\sum_{k\in{\mathcal{N}}}P_{ik}(x_{k}/\mu_{k})^{-\alpha}},~{}~{}~{}~{}~{}~{}\forall~{}i,j\in{\mathcal{N}},

(3)

for any ${\mathbf{x}}\in\text{Int}(\Sigma)$ , when simulated as a self-interacting random walk (Del Moral & Miclo, 2006; Del Moral & Doucet, 2010), can achieve smaller asymptotic variance than the base Markov chain when sampling over a graph ${\mathcal{G}}$ , for all $\alpha\!>\!0$ . The argument ${\mathbf{x}}$ for the kernel ${\mathbf{K}}[{\bm{x}}]$ is taken to be the empirical distribution ${\mathbf{x}}_{n}$ at each time step $n\!\geq\!0$ . For instance, if node $j$ has been visited more often than other nodes so far, the entry $x_{j}$ becomes larger (than target value $\mu_{j}$ ), resulting in a smaller transition probability from $i$ to $j$ under ${\mathbf{K}}[{\mathbf{x}}]$ in (3) compared to $P_{ij}$ . This ensures that a random walker prioritizes more seldom visited nodes in the process, and is thus ‘self-repellent’. This effect is made more drastic by increasing $\alpha$ , and leads to asymptotically near-zero variance at a rate of $O(1/\alpha)$ . Moreover, the polynomial function $(x_{i}/\mu_{i})^{-\alpha}$ chosen to encode self-repellent behaviour is shown in Doshi et al. (2023) to be the only one that allows the SRRW to inherit the so-called ‘scale-invariance’ property of the underlying Markov chain – a necessary component for the scalable implementation of a random walker over a large network without requiring knowledge of any graph-related global constants. Conclusively, such attributes render SRRW especially suitable for distributed optimization.³³3Recently, Guo et al. (2020) introduce an optimization scheme, which designs self-repellence into the perturbation of the gradient descent iterates (Jin et al., 2017; 2018; 2021) with the goal of escaping saddle points. This notion of self-repellence is distinct from the SRRW, which is a probability kernel designed specifically for a token to sample from a target distribution ${\bm{\mu}}$ over a set of nodes on an arbitrary graph.

Effect of Stochastic Noise - Finite time and Asymptotic Approaches. Most contemporary token algorithms driven by Markov chains are analyzed using the finite-time bounds approach for obtaining insights into their convergence rates (Sun et al., 2018; Doan et al., 2019; 2020; Triastcyn et al., 2022; Hendrikx, 2023). However, as also explained in Even (2023), in most cases these bounds are overly dependent on mixing time properties of the specific Markov chain employed therein. This makes them largely ineffective in capturing the exact contribution of the underlying random walk in a manner which is qualitative enough to be used for algorithm design; and performance enhancements are typically achieved via application of techniques such as variance reduction (Defazio et al., 2014; Schmidt et al., 2017), momentum/Nesterov’s acceleration (Gadat et al., 2018; Li et al., 2022), adaptive step size (Kingma & Ba, 2015; Reddi et al., 2018), which work by modifying the algorithm iterations themselves, and never consider potential improvements to the stochastic input itself.

Complimentary to finite-time approaches, asymptotic analysis using CLT has proven to be an excellent tool to approach the design of stochastic algorithms (Hu et al., 2022; Devraj & Meyn, 2017; Morral et al., 2017; Chen et al., 2020a; Mou et al., 2020; Devraj & Meyn, 2021). Hu et al. (2022) shows how asymptotic analysis can be used to compare the performance of SGD algorithms for various stochastic inputs using their notion of efficiency ordering, and, as mentioned in Devraj & Meyn (2017), the asymptotic benefits from minimizing the limiting covariance matrix are known to be a good predictor of finite-time algorithmic performance, also observed empirically in Section 4.

From the perspective of both finite-time analysis as well as asymptotic analysis of token algorithms, it is now well established that employing ‘better’ Markov chains can enhance the performance of stochastic optimization algorithm. For instance, Markov chains with smaller SLEMs yield tighter finite-time upper bounds (Sun et al., 2018; Ayache & El Rouayheb, 2021; Even, 2023). Similarly, Markov chains with smaller asymptotic variance for MCMC sampling tasks also provide better performance, resulting in smaller covariance matrix of SGD algorithms (Hu et al., 2022). Therefore, with these breakthrough results via SRRW achieving near-zero sampling variance, it is within reason to ask: Can we achieve near-zero variance in distributed stochastic optimization driven by SRRW-like token algorithms on any general graph?⁴⁴4This near-zero sampling variance implies a significantly smaller variance than even an i.i.d. sampling counterpart, while adhering to graph topological constraints of token algorithms. In this paper, we answer in the affirmative.

SRRW Driven Algorithm and Analysis Approach. For any ergodic time-reversible Markov chain with transition probability matrix ${\mathbf{P}}\triangleq[P_{ij}]_{i,j\in{\mathcal{N}}}$ and stationary distribution ${\bm{\mu}}\in\text{Int}(\Sigma)$ , we consider a general step size version of the SRRW stochastic process analysed in Doshi et al. (2023) and use it to drive the noise sequence in (2). Our SA-SRRW algorithm is as follows:

	$\text{Draw:}~{}~{}~{}~{}~{}~{}~{}~{}X_{n+1}\sim{\mathbf{K}}_{X_{n},\cdot}[{\mathbf{x}}_{n}]~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}$		(4a)
	$\text{Update:}~{}~{}~{}~{}~{}~{}{\mathbf{x}}_{n+1}={\mathbf{x}}_{n}+\gamma_{n+1}({\bm{\delta}}_{X_{n+1}}-{\mathbf{x}}_{n}),$		(4b)
	$~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}{\bm{\theta}}_{n+1}={\bm{\theta}}_{n}+\beta_{n+1}H({\bm{\theta}}_{n},X_{n+1}),$		(4c)

where $\{\beta_{n}\}$ and $\{\gamma_{n}\}$ are step size sequences decreasing to zero, and ${\mathbf{K}}[{\mathbf{x}}]$ is the SRRW kernel in (3). Current non-asymptotic analyses require globally Lipschitz mean field function (Chen et al., 2020b; Doan, 2021; Zeng et al., 2021; Even, 2023) and is thus inapplicable to SA-SRRW since the mean field function of the SRRW iterates (4b) is only locally Lipschitz (details deferred to Appendix B). Instead, we successfully obtain non-trivial results by taking an asymptotic CLT-based approach to analyze (4). This goes beyond just analyzing the asymptotic sampling covariance⁵⁵5Sampling covariance corresponds to only the empirical distribution ${\mathbf{x}}_{n}$ in (4b). as in Doshi et al. (2023), the result therein forming a special case of ours by setting $\gamma_{n}\!=\!1/(n\!+\!1)$ and considering only (4a) and (4b), that is, in the absence of optimization iteration (4c). Specifically, we capture the effect of SRRW’s hyper-parameter $\alpha$ on the asymptotic speed of convergence of the optimization error term ${\bm{\theta}}_{n}-{\bm{\theta}}^{*}$ to zero via explicit deduction of its asymptotic covariance matrix. See Figure 1 for illustration.

Refer to caption — Figure 1: Visualization of token algorithms using SRRW versus traditional MC in distributed learning. Our CLT analysis, extended from SRRW itself to distributed stochastic approximation, leads to near-zero variance for the SA iteration ${\bm{\theta}}_{n}$ . Node numbers on the left denote visit counts.

Our Contributions.

1. Given any time-reversible ‘base’ Markov chain with transition kernel ${\mathbf{P}}$ and stationary distribution ${\bm{\mu}}$ , we generalize first and second order convergence results of ${\mathbf{x}}_{n}$ to target measure ${\bm{\mu}}$ (Theorems 4.1 and 4.2 in Doshi et al., 2023) to a class of weighted empirical measures, through the use of more general step sizes $\gamma_{n}$ . This includes showing that the asymptotic sampling covariance terms decrease to zero at rate $O(1/\alpha)$ , thus quantifying the effect of self-repellent on ${\mathbf{x}}_{n}$ . Our generalization is not for the sake thereof and is shown in Section 3 to be crucial for the design of step sizes $\beta_{n},\gamma_{n}$ .

2. Building upon the convergence results for iterates ${\mathbf{x}}_{n}$ , we analyze the algorithm (4) driven by the SRRW kernel in (3) with step sizes $\beta_{n}$ and $\gamma_{n}$ separated into three disjoint cases:

(i)

$\beta_{n}=o(\gamma_{n})$ , and we say that ${\bm{\theta}}_{n}$ is on the slower timescale compared to ${\mathbf{x}}_{n}$ ;
(ii)

$\beta_{n}\!=\!\gamma_{n}$ , and we say that ${\bm{\theta}}_{n}$ and ${\mathbf{x}}_{n}$ are on the same timescale;
(iii)

$\gamma_{n}=o(\beta_{n})$ , and we say that ${\bm{\theta}}_{n}$ is on the faster timescale compared to ${\mathbf{x}}_{n}$ .

For any $\alpha\geq 0$ and let $k=1,2$ and $3$ refer to the corresponding cases (i), (ii) and (iii), we show that

{\bm{\theta}}_{n}\xrightarrow[n\to\infty]{a.s.}{\bm{\theta}}^{*}~{}~{}~{}~{}\text{and}~{}~{}~{}~{}({\bm{\theta}}_{n}-{\bm{\theta}}^{*})/\sqrt{\beta_{n}}\xrightarrow[n\to\infty]{dist.}N\left({\bm{0}},{\mathbf{V}}^{(k)}_{{\bm{\theta}}}(\alpha)\right),\vspace{-1mm}

featuring distinct asymptotic covariance matrices ${\mathbf{V}}^{(1)}_{{\bm{\theta}}}(\alpha),{\mathbf{V}}^{(2)}_{{\bm{\theta}}}(\alpha)$ and ${\mathbf{V}}^{(3)}_{{\bm{\theta}}}(\alpha)$ , respectively. The three matrices coincide when $\alpha=0$ ,⁶⁶6The $\alpha=0$ case is equivalent to simply running the base Markov chain, since from (3) we have ${\mathbf{K}}[\cdot]={\mathbf{P}}$ , thus bypassing the SRRW’s effect and rendering all three cases nearly the same.. Moreover, the derivation of the CLT for cases (i) and (iii), for which (4) corresponds to two-timescale SA with controlled Markov noise, is the first of its kind and thus a key technical contribution in this paper, as expanded upon in Section 3.

3. For case (i), we show that ${\mathbf{V}}^{(1)}_{{\bm{\theta}}}(\alpha)$ decreases to zero (in the sense of Loewner ordering introduced in Section 2.1) as $\alpha$ increases, with rate $O(1/\alpha^{2})$ . This is especially surprising, since the asymptotic performance benefit from using the SRRW kernel with $\alpha$ in (3), to drive the noise terms $X_{n}$ , is amplified in the context of distributed learning and estimating ${\bm{\theta}}^{*}$ ; compared to the sampling case, for which the rate is $O(1/\alpha)$ as mentioned earlier. For case (iii), we show that ${\mathbf{V}}_{{\bm{\theta}}}^{(3)}(\alpha)\!=\!{\mathbf{V}}_{{\bm{\theta}}}^{(3)}(0)$ for all $\alpha\!\geq\!0$ , implying that using the SRRW in this case provides no asymptotic benefit than the original base Markov chain, and thus performs worse than case (i). In summary, we deduce that ${\mathbf{V}}_{{\bm{\theta}}}^{(1)}(\alpha_{2})\!<_{L}\!{\mathbf{V}}_{{\bm{\theta}}}^{(1)}(\alpha_{1})\!<_{L}\!{\mathbf{V}}_{{\bm{\theta}}}^{(1)}(0)\!=\!{\mathbf{V}}_{{\bm{\theta}}}^{(3)}(0)\!=\!{\mathbf{V}}_{{\bm{\theta}}}^{(3)}(\alpha)$ for all $\alpha_{2}>\alpha_{1}>0$ and $\alpha>0$ .⁷⁷7In particular, this is the reason why we advocate for a more general step size $\gamma_{n}=(n+1)^{-a}$ in the SRRW iterates with $a<1$ , allowing us to choose $\beta_{n}=(n+1)^{-b}$ with $b\in(a,1]$ to satisfy $\beta_{n}=o(\gamma_{n})$ for case (i).

4. We numerically simulate our SA-SRRW algorithm on various real-world datasets, focusing on a binary classification task, to evaluate its performance across all three cases. By carefully choosing the function $H$ in SA-SRRW, we test the SGD and algorithms driven by SRRW. Our findings consistently highlight the superiority of case (i) over cases (ii) and (iii) for diverse $\alpha$ values, even in their finite time performance. Notably, our tests validate the variance reduction at a rate of $O(1/\alpha^{2})$ for case (i), suggesting it as the best algorithmic choice among the three cases.

2 Preliminaries and Model Setup

In Section 2.1, we first standardize the notations used throughout the paper, and define key mathematical terms and quantities used in our theoretical analyses. Then, in Section 2.2, we consolidate the model assumptions of our SA-SRRW algorithm (4). We then go on to discuss our assumptions, and provide additional interpretations of our use of generalized step-sizes.

2.1 Basic Notations and Definitions

Vectors are denoted by lower-case bold letters, e.g., ${\mathbf{v}}\triangleq[v_{i}]\in\mathbb{R}^{D}$ , and matrices by upper-case bold, e.g., ${\mathbf{M}}\triangleq[M_{ij}]\in\mathbb{R}^{D\times D}$ . ${\mathbf{M}}^{-T}$ is the transpose of the matrix inverse ${\mathbf{M}}^{-1}$ . The diagonal matrix ${\mathbf{D}}_{{\mathbf{v}}}$ is formed by vector ${\mathbf{v}}$ with $v_{i}$ as the $i$ ’th diagonal entry. Let ${\bm{1}}$ and ${\bm{0}}$ denote vectors of all ones and zeros, respectively. The identity matrix is represented by ${\mathbf{I}}$ , with subscripts indicating dimensions as needed. A matrix is Hurwitz if all its eigenvalues possess strictly negative real parts. $\mathds{1}_{\{\cdot\}}$ denotes an indicator function with condition in parentheses. We use $\|\!\cdot\!\|$ to denote both the Euclidean norm of vectors and the spectral norm of matrices. Two symmetric matrices ${\mathbf{M}}_{1},{\mathbf{M}}_{2}$ follow Loewner ordering ${\mathbf{M}}_{1}\!<_{L}\!{\mathbf{M}}_{2}$ if ${\mathbf{M}}_{2}\!-\!{\mathbf{M}}_{1}$ is positive semi-definite and ${\mathbf{M}}_{1}\!\neq\!{\mathbf{M}}_{2}$ . This slightly differs from the conventional definition with $\leq_{L}$ , which allows ${\mathbf{M}}_{1}\!=\!{\mathbf{M}}_{2}$ .

Throughout the paper, the matrix ${\mathbf{P}}\triangleq[P_{i,j}]_{i,j\in{\mathcal{N}}}$ and vector ${\bm{\mu}}\triangleq[\mu_{i}]_{i\in{\mathcal{N}}}$ are used exclusively to denote an $N\times N$ -dimensional transition kernel of an ergodic Markov chain, and its stationary distribution, respectively. Without loss of generality, we assume $P_{ij}>0$ if and only if $a_{ij}>0$ . Markov chains satisfying the detailed balance equation, where $\mu_{i}P_{ij}=\mu_{j}P_{ji}$ for all $i,j\in{\mathcal{N}}$ , are termed time-reversible. For such chains, we use $(\lambda_{i},{\mathbf{u}}_{i})$ (resp. $(\lambda_{i},{\mathbf{v}}_{i})$ ) to denote the $i$ ’th left (resp. right) eigenpair where the eigenvalues are ordered: $-1\!<\!\lambda_{1}\!\leq\!\cdots\!\leq\!\lambda_{N-1}\!<\!\lambda_{N}\!=\!1$ , with ${\mathbf{u}}_{N}\!=\!{\bm{\mu}}$ and ${\mathbf{v}}_{N}\!=\!{\bm{1}}$ in ${\mathbb{R}}^{N}$ . We assume eigenvectors to be normalized such that ${\mathbf{u}}_{i}^{T}{\mathbf{v}}_{i}\!=\!1$ for all $i$ , and we have ${\mathbf{u}}_{i}\!=\!{\mathbf{D}}_{{\bm{\mu}}}{\mathbf{v}}_{i}$ and ${\mathbf{u}}_{i}^{T}{\mathbf{v}}_{j}\!=\!0$ for all $i,j\!\in\!{\mathcal{N}}$ . We direct the reader to Aldous & Fill (2002, Chapter 3.4) for a detailed exposition on spectral properties of time-reversible Markov chains.

2.2 SA-SRRW: Key Assumptions and Discussions

Assumptions: All results in our paper are proved under the following assumptions.

(A1)

The function $H:{\mathbb{R}}^{D}\times{\mathcal{N}}\to{\mathbb{R}}^{D}$ , is a continuous at every ${\bm{\theta}}\in\mathbb{R}^{D}$ , and there exists a positive constant $L$ such that $\|H({\bm{\theta}},i)\|\leq L(1+\|{\bm{\theta}}\|)$ for every ${\bm{\theta}}\in{\mathbb{R}}^{D},i\in{\mathcal{N}}$ .
(A2)

Step sizes $\beta_{n}$ and $\gamma_{n}$ follow $\beta_{n}\!=\!(n\!+\!1)^{-b}$ , and $\gamma_{n}\!=\!(n\!+\!1)^{-a}$ , where $a,b\in(0.5,1]$ .
(A3)

Roots of function ${\mathbf{h}}(\cdot)$ are disjoint, which comprise the globally attracting set $\Theta\!\triangleq\!\left\{{\bm{\theta}}^{*}|{\mathbf{h}}({\bm{\theta}}^{*})\!=\!{\bm{0}},\nabla{\mathbf{h}}({\bm{\theta}}^{*})+\frac{\mathds{1}_{\{b=1\}}}{2}{\mathbf{I}}~{}\text{is Hurwitz}\right\}\!\neq\!\emptyset$ of the associated ordinary differential equation (ODE) for iteration (4c), given by ${d{\bm{\theta}}(t)}/{dt}\!=\!{\mathbf{h}}({\bm{\theta}}(t))$ .
(A4)

For any $({\bm{\theta}}_{0},{\mathbf{x}}_{0},X_{0})\in{\mathbb{R}}^{D}\times\text{Int}(\Sigma)\times{\mathcal{N}}$ , the iterate sequence $\{{\bm{\theta}}_{n}\}_{n\geq 0}$ (resp. $\{{\mathbf{x}}_{n}\}_{n\geq 0}$ ) is ${\mathbb{P}}_{{\bm{\theta}}_{0},{\mathbf{x}}_{0},X_{0}}$ -almost surely contained within a compact subset of ${\mathbb{R}}^{D}$ (resp. $\text{Int}(\Sigma)$ ).

Discussions on Assumptions: Assumption A1 requires $H$ to only be locally Lipschitz albeit with linear growth, and is less stringent than the globally Lipschitz assumption prevalent in optimization literature (Li & Wai, 2022; Hendrikx, 2023; Even, 2023).

Assumption A2 is the general umbrella assumption under which cases (i), (ii) and (iii) mentioned in Section 1 are extracted by setting: (i) $a<b$ , (ii) $a=b$ , and (iii) $a>b$ . Cases (i) and (iii) render ${\bm{\theta}}_{n},{\mathbf{x}}_{n}$ on different timescales; the polynomial form of $\beta_{n},\gamma_{n}$ widely assumed in the two-timescale SA literature (Mokkadem & Pelletier, 2006; Zeng et al., 2021; Hong et al., 2023). Case (ii) characterizes the SA-SRRW algorithm (4) as a single-timescale SA with polynomially decreasing step size, and is among the most common assumptions in the SA literature (Borkar, 2022; Fort, 2015; Li et al., 2023). In all three cases, the form of $\gamma_{n}$ ensures $\gamma_{n}\leq 1$ such that the SRRW iterates ${\mathbf{x}}_{n}$ in (4b) is within $\text{Int}(\Sigma)$ , ensuring that ${\mathbf{K}}[{\mathbf{x}}_{n}]$ is well-defined for all $n\geq 0$ .

In Assumption A3, limiting dynamics of SA iterations $\{{\bm{\theta}}_{n}\}_{n\geq 0}$ closely follow trajectories $\{{\bm{\theta}}(t)\}_{t\geq 0}$ of their associated ODE, and assuming the existence of globally stable equilibria is standard (Borkar, 2022; Fort, 2015; Li et al., 2023). In optimization problems, this is equivalent to assuming the existence of at most countably many local minima.

Assumption A4 assumes almost sure boundedness of iterates ${\bm{\theta}}_{n}$ and ${\mathbf{x}}_{n}$ , which is a common assumption in SA algorithms (Kushner & Yin, 2003; Chen, 2006; Borkar, 2022; Karmakar & Bhatnagar, 2018; Li et al., 2023) for the stability of the SA iterations by ensuring the well-definiteness of all quantities involved. Stability of the weighted empirical measure ${\mathbf{x}}_{n}$ of the SRRW process is practically ensured by studying (4b) with a truncation-based procedure (see Doshi et al., 2023, Remark 4.5 and Appendix E for a comprehensive explanation), while that for ${\bm{\theta}}_{n}$ is usually ensured either as a by-product of the algorithm design, or via mechanisms such as projections onto a compact subset of $\mathbb{R}^{D}$ , depending on the application context.

We now provide additional discussions regarding the step-size assumptions and their implications on the SRRW iteration (4b).

SRRW with General Step Size: As shown in Benaim & Cloez (2015, Remark 1.1), albeit for a completely different non-linear Markov kernel driving the algorithm therein, iterates ${\mathbf{x}}_{n}$ of (4b) can also be expressed as weighted empirical measures of $\{X_{n}\}_{n\geq 0}$ , in the following form:

\vspace{-1mm}{\mathbf{x}}_{n}=\frac{\sum_{i=1}^{n}\omega_{i}{\bm{\delta}}_{X_{i}}+\omega_{0}{\mathbf{x}}_{0}}{\sum_{i=0}^{n}\omega_{i}},~{}~{}~{}\text{where}~{}~{}\omega_{0}=1,~{}~{}\text{and}~{}~{}\omega_{n}=\frac{\gamma_{n}}{\prod_{i=1}^{n}(1-\gamma_{i})},

(5)

for all $n>0$ . For the special case when $\gamma_{n}=1/(n+1)$ as in Doshi et al. (2023), we have $\omega_{n}=1$ for all $n\geq 0$ and ${\mathbf{x}}_{n}$ is the typical, unweighted empirical measure. For the additional case considered in our paper, when $a<1$ for $\gamma_{n}$ as in assumption A2, we can approximate $1-\gamma_{n}\approx e^{-\gamma_{n}}$ and $\omega_{n}\approx n^{-a}e^{n^{(1-a)}/{(1-a)}}$ . This implies that $\omega_{n}$ will increase at sub-exponential rate, giving more weight to recent visit counts and allowing it to quickly ‘forget’ the poor initial measure ${\mathbf{x}}_{0}$ and shed the correlation with the initial choice of $X_{0}$ . This ‘speed up’ effect by setting $a<1$ is guaranteed in case (i) irrespective of the choice of $b$ in Assumption A2, and in Section 3 we show how this can lead to further reduction in covariance of optimization error ${\bm{\theta}}_{n}={\bm{\theta}}^{*}$ in the asymptotic regime.

Additional assumption for case (iii): Before moving on to Section 3, we take another look at the case when $\gamma_{n}=o(\beta_{n})$ , and replace A3 with the following, stronger assumption only for case (iii).

(A3^′)

For any ${\mathbf{x}}\!\in\!\text{Int}(\Sigma)$ , there exists a function $\rho:\text{Int}(\Sigma)\!\to\!{\mathbb{R}}^{D}$ such that $\|\rho({\mathbf{x}})\|\!\leq\!L_{2}(1+\|{\mathbf{x}}\|)$ for some $L_{2}\!>\!0$ , $\mathbb{E}_{i\sim{\bm{\pi}}[{\mathbf{x}}]}[H(\rho({\mathbf{x}}),i)]\!=\!0$ and $\mathbb{E}_{i\sim{\bm{\pi}}[{\mathbf{x}}]}[\nabla H(\rho({\mathbf{x}}),i)]+\frac{\mathds{1}_{\{b=1\}}}{2}{\mathbf{I}}$ is Hurwitz.

While Assumption A3^′ for case (iii) is much stronger than A3, it is not detrimental to the overall results of our paper, since case (i) is of far greater interest as impressed upon in Section 1. This is discussed further in Appendix C.

3 Asymptotic Analysis of the SA-SRRW Algorithm

In this section, we provide the main results for the SA-SRRW algorithm (4). We first present the a.s. convergence and the CLT result for SRRW with generalized step size, extending the results in Doshi et al. (2023). Building upon this, we present the a.s. convergence and the CLT result for the SA iterate ${\bm{\theta}}_{n}$ under different settings of step sizes. We then shift our focus to the analysis of the different asymptotic covariance matrices emerging out of the CLT result, and capture the effect of $\alpha$ and the step sizes, particularly in cases (i) and (iii), on ${\bm{\theta}}_{n}-{\bm{\theta}}^{*}$ via performance ordering.

Almost Sure convergence and CLT: The following result establishes first and second order convergence of the sequence $\{{\mathbf{x}}_{n}\}_{n\geq 0}$ , which represents the weighted empirical measures of the SRRW process $\{X_{n}\}_{n\geq 0}$ , based on the update rule in (4b).

Lemma 3.1.

Under Assumptions A1, A2 and A4, for the SRRW iterates (4b), we have

{\mathbf{x}}_{n}\xrightarrow[n\to\infty]{a.s.}{\bm{\mu}},\quad\text{and}\quad\gamma_{n}^{-1/2}({\mathbf{x}}_{n}-{\bm{\mu}})\xrightarrow[n\to\infty]{dist.}N({\bm{0}},{\mathbf{V}}_{{\mathbf{x}}}(\alpha)),

\text{where}~{}~{}~{}~{}{\mathbf{V}}_{{\mathbf{x}}}(\alpha)=\sum_{i=1}^{N-1}\frac{1}{2\alpha(1+\lambda_{i})+2-\mathds{1}_{\{a=1\}}}\cdot\frac{1+\lambda_{i}}{1-\lambda_{i}}{\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T}.

(6)

Moreover, for all $\alpha_{2}>\alpha_{1}>0$ , we have ${\mathbf{V}}_{\mathbf{x}}(\alpha_{2})<_{L}{\mathbf{V}}_{\mathbf{x}}(\alpha_{1})<_{L}{\mathbf{V}}_{\mathbf{x}}(0)$ .

Lemma 3.1 shows that the SRRW iterates ${\mathbf{x}}_{n}$ converges to the target distribution ${\bm{\mu}}$ a.s. even under the general step size $\gamma_{n}=(n+1)^{-a}$ for $a\in(0.5,1]$ . We also observe that the asymptotic covariance matrix ${\mathbf{V}}_{{\mathbf{x}}}(\alpha)$ decreases at rate $O(1/\alpha)$ . Lemma 3.1 aligns with Doshi et al. (2023, Theorem 4.2 and Corollary 4.3) for the special case of $a=1$ , and is therefore more general. Critically, it helps us establish our next result regarding the first-order convergence for the optimization iterate sequence $\{{\bm{\theta}}_{n}\}_{n\geq 0}$ following update rule (4c), as well as its second-order convergence result, which follows shortly after. The proofs of Lemma 3.1 and our next result, Theorem 3.2, are deferred to Appendix D. In what follows, $k=1,2$ , and $3$ refer to cases (i), (ii), and (iii) in Section 2.2, respectively. All subsequent results are proven under Assumptions A1 to A4, with A3^′ replacing A3 only when the step sizes $\beta_{n},\gamma_{n}$ satisfy case (iii).

Theorem 3.2.

For $k\in\{1,2,3\}$ , and any initial $({\bm{\theta}}_{0},{\mathbf{x}}_{0},X_{0})\in\mathbb{R}^{D}\times\text{Int}(\Sigma)\times{\mathcal{N}}$ , we have ${\bm{\theta}}_{n}\to{\bm{\theta}}^{*}$ as $n\to\infty$ for some ${\bm{\theta}}^{*}\in\Theta$ , ${\mathbb{P}}_{{\bm{\theta}}_{0},{\mathbf{x}}_{0},X_{0}}$ -almost surely.

In the stochastic optimization context, the above result ensures convergence of iterates ${\bm{\theta}}_{n}$ to a local minimizer ${\bm{\theta}}^{*}$ . Loosely speaking, the first-order convergence of ${\mathbf{x}}_{n}$ in Lemma 3.1 as well as that of ${\bm{\theta}}_{n}$ are closely related to the convergence of trajectories $\{{\mathbf{z}}(t)\triangleq({\bm{\theta}}(t),{\mathbf{x}}(t))\}_{t\geq 0}$ of the (coupled) mean-field ODE, written in a matrix-vector form as

\textstyle\frac{d}{dt}{\mathbf{z}}(t)={\mathbf{g}}({\mathbf{z}}(t))\triangleq\begin{bmatrix}{\mathbf{H}}({\bm{\theta}}(t))^{T}{\bm{\pi}}[{\mathbf{x}}(t)]\\ {\bm{\pi}}[{\mathbf{x}}(t)]-{\mathbf{x}}(t)\end{bmatrix}\in{\mathbb{R}}^{D+N}.

(7)

where matrix ${\mathbf{H}}({\bm{\theta}})\!\triangleq\![H({\bm{\theta}},1),\!\cdots\!,H({\bm{\theta}},N)]^{T}\!\in\!{\mathbb{R}}^{N\!\times\!D}$ for any ${\bm{\theta}}\in\mathbb{R}^{D}$ . Here, ${\bm{\pi}}[{\mathbf{x}}]\in\text{Int}(\Sigma)$ is the stationary distribution of the SRRW kernel ${\mathbf{K}}[{\mathbf{x}}]$ and is shown in Doshi et al. (2023) to be given by $\pi_{i}[{\mathbf{x}}]\propto\sum_{j\in{\mathcal{N}}}\mu_{i}P_{ij}(x_{i}/\mu_{i})^{-\alpha}(x_{j}/\mu_{j})^{-\alpha}$ . The Jacobian matrix of (7) when evaluated at equilibria ${\mathbf{z}}^{*}=({\bm{\theta}}^{*},{\bm{\mu}})$ for ${\bm{\theta}}^{*}\in\Theta$ captures the behaviour of solutions of the mean-field in their vicinity, and plays an important role in the asymptotic covariance matrices arising out of our CLT results. We evaluate this Jacobian matrix ${\mathbf{J}}(\alpha)$ as a function of $\alpha\geq 0$ to be given by

{\mathbf{J}}(\alpha)\!\triangleq\!\nabla g({\mathbf{z}}^{*})\!=\!\begin{bmatrix}\nabla{\mathbf{h}}({\bm{\theta}}^{*})&-\alpha{\mathbf{H}}({\bm{\theta}}^{*})^{T}({\mathbf{P}}^{T}\!\!+{\mathbf{I}})\\ {\bm{0}}_{N\!\times\!D}&2\alpha\bm{\mu}{\bm{1}}^{T}\!\!\!-\!\alpha{\mathbf{P}}^{T}\!\!\!-\!(\alpha\!+\!1){\mathbf{I}}\end{bmatrix}\!\triangleq\!\begin{bmatrix}{\mathbf{J}}_{11}&{\mathbf{J}}_{12}(\alpha)\\ {\mathbf{J}}_{21}&{\mathbf{J}}_{22}(\alpha)\end{bmatrix}.

(8)

The derivation of ${\mathbf{J}}(\alpha)$ is referred to Appendix E.1.⁸⁸8The Jacobian ${\mathbf{J}}(\alpha)$ is $(D\!+\!N)\!\times\!(D\!+\!N)$ – dimensional, with ${\mathbf{J}}_{11}\!\in\!{\mathbb{R}}^{D\!\times\!D}$ and ${\mathbf{J}}_{22}(\alpha)\!\in\!{\mathbb{R}}^{N\!\times\!N}$ . Following this, all matrices written in a block form, such as matrix ${\mathbf{U}}$ in (9), will inherit the same dimensional structure. Here, ${\mathbf{J}}_{21}$ is a zero matrix since ${\bm{\pi}}[{\mathbf{x}}]-{\mathbf{x}}$ is devoid of ${\bm{\theta}}$ . While matrix ${\mathbf{J}}_{22}(\alpha)$ is exactly of the form in Doshi et al. (2023, Lemma 3.4) to characterize the SRRW performance, our analysis includes an additional matrix ${\mathbf{J}}_{12}(\alpha)$ , which captures the effect of ${\mathbf{x}}(t)$ on ${\bm{\theta}}(t)$ in the ODE (7), which translates to the influence of our generalized SRRW empirical measure ${\mathbf{x}}_{n}$ on the SA iterates ${\bm{\theta}}_{n}$ in (4).

For notational simplicity, and without loss of generality, all our remaining results are stated while conditioning on the event that $\{{\bm{\theta}}_{n}\!\to\!{\bm{\theta}}^{*}\}$ , for some ${\bm{\theta}}^{*}\!\in\!\Theta$ . We also adopt the shorthand notation ${\mathbf{H}}$ to represent ${\mathbf{H}}({\bm{\theta}}^{*})$ . Our main CLT result is as follows, with its proof deferred to Appendix E.

Theorem 3.3.

For any $\alpha\geq 0$ , we have: (a) There exists ${\mathbf{V}}^{(k)}(\alpha)$ for all $k\in\{1,2,3\}$ such that

\begin{bmatrix}\beta_{n}^{-1/2}({\bm{\theta}}_{n}-{\bm{\theta}}^{*})\\ \gamma_{n}^{-1/2}({\mathbf{x}}_{n}-{\bm{\mu}})\end{bmatrix}\xrightarrow[n\to\infty]{\text{dist.}}N\left({\bm{0}},{\mathbf{V}}^{(k)}(\alpha)\right).

(b) For $k=2$ , matrix ${\mathbf{V}}^{(2)}(\alpha)$ solves the Lyapunov equation ${\mathbf{J}}(\alpha){\mathbf{V}}^{(2)}(\alpha)+{\mathbf{V}}^{(2)}(\alpha){\mathbf{J}}(\alpha)^{T}+\mathds{1}_{\{b=1\}}{\mathbf{V}}^{(2)}(\alpha)=-{\mathbf{U}}$ , where the Jacobian matrix ${\mathbf{J}}(\alpha)$ is in (8), and

{\mathbf{U}}\triangleq\sum_{i=1}^{N-1}\frac{1+\lambda_{i}}{1-\lambda_{i}}\cdot\begin{bmatrix}{\mathbf{H}}^{T}{\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T}{\mathbf{H}}&{\mathbf{H}}^{T}{\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T}\\ {\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T}{\mathbf{H}}&{\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T}\end{bmatrix}\triangleq\begin{bmatrix}{\mathbf{U}}_{11}&{\mathbf{U}}_{12}\\ {\mathbf{U}}_{21}&{\mathbf{U}}_{22}\end{bmatrix}.

(9)

{\mathbf{V}}^{(k)}(\alpha)=\begin{bmatrix}{\mathbf{V}}^{(k)}_{{\bm{\theta}}}(\alpha)&{\bm{0}}_{D\!\times\!N}\\ {\bm{0}}_{N\!\times\!D}&{\mathbf{V}}_{{\mathbf{x}}}(\alpha)\end{bmatrix},

(10)

where ${\mathbf{V}}_{{\mathbf{x}}}(\alpha)$ is as in (6), and ${\mathbf{V}}^{(1)}_{{\bm{\theta}}}(\alpha)$ and ${\mathbf{V}}^{(3)}_{{\bm{\theta}}}(\alpha)$ can be written in the following explicit form:

	$\displaystyle{\mathbf{V}}^{(1)}_{{\bm{\theta}}}(\alpha)=\textstyle\int_{0}^{\infty}e^{t(\nabla_{{\bm{\theta}}}{\mathbf{h}}({\bm{\theta}}^{})+\frac{\mathds{1}_{\{b=1\}}}{2}{\bm{I}})}{\mathbf{U}}_{{\bm{\theta}}}(\alpha)e^{t(\nabla_{{\bm{\theta}}}{\mathbf{h}}({\bm{\theta}}^{})+\frac{\mathds{1}_{\{b=1\}}}{2}{\bm{I}})^{T}}dt,$
	$\displaystyle{\mathbf{V}}^{(3)}_{{\bm{\theta}}}(\alpha)=\textstyle\int_{0}^{\infty}e^{t\nabla_{{\bm{\theta}}}{\mathbf{h}}({\bm{\theta}}^{})}{\mathbf{U}}_{11}e^{t\nabla_{{\bm{\theta}}}{\mathbf{h}}({\bm{\theta}}^{})}dt,$
where	$\displaystyle{\mathbf{U}}_{{\bm{\theta}}}(\alpha)=\sum_{i=1}^{N-1}\frac{1}{(\alpha(1+\lambda_{i})+1)^{2}}\cdot\frac{1+\lambda_{i}}{1-\lambda_{i}}{\mathbf{H}}^{T}{\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T}{\mathbf{H}}.$	(11)

For $k\in\{1,3\}$ , SA-SRRW in (4) is a two-timescale SA with controlled Markov noise. While a few works study the CLT of two-timescale SA with the stochastic input being a martingale-difference (i.i.d.) noise (Konda & Tsitsiklis, 2004; Mokkadem & Pelletier, 2006), a CLT result covering the case of controlled Markov noise (e.g., $k\in\{1,3\}$ ), a far more general setting than martingale-difference noise, is still an open problem. Thus, we here prove our CLT for $k\in\{1,3\}$ from scratch by a series of careful decompositions of the Markovian noise, ultimately into a martingale-difference term and several non-vanishing noise terms through repeated application of the Poisson equation (Benveniste et al., 2012; Fort, 2015). Although the form of the resulting asymptotic covariance looks similar to that for the martingale-difference case in (Konda & Tsitsiklis, 2004; Mokkadem & Pelletier, 2006) at first glance, they are not equivalent. Specifically, ${\mathbf{V}}^{(k)}_{{\bm{\theta}}}(\alpha)$ captures both the effect of SRRW hyper-parameter $\alpha$ , as well as that of the underlying base Markov chain via eigenpairs $(\lambda_{i},{\mathbf{u}}_{i})$ of its transition probability matrix ${\mathbf{P}}$ in matrix ${\mathbf{U}}$ , whereas the latter only covers the martingale-difference noise terms as a special case.

When $k=2$ , that is, $\beta_{n}=\gamma_{n}$ , algorithm (4) can be regarded as a single-timescale SA algorithm. In this case, we utilize the CLT in Fort (2015, Theorem 2.1) to obtain the implicit form of ${\mathbf{V}}^{(2)}(\alpha)$ as shown in Theorem 3.3. However, ${\mathbf{J}}_{12}(\alpha)$ being non-zero for $\alpha>0$ restricts us from obtaining an explicit form for the covariance term corresponding to SA iterate errors ${\bm{\theta}}_{n}-{\bm{\theta}}^{*}$ . On the other hand, for $k\in\{1,3\}$ , the nature of two-timescale structure causes ${\bm{\theta}}_{n}$ and ${\mathbf{x}}_{n}$ to become asymptotically independent with zero correlation terms inside ${\mathbf{V}}^{(k)}(\alpha)$ in (10), and we can explicitly deduce ${\mathbf{V}}^{(k)}_{{\bm{\theta}}}(\alpha)$ . We now take a deeper dive into $\alpha$ and study its effect on ${\mathbf{V}}^{(k)}_{{\bm{\theta}}}(\alpha)$ .

Covariance Ordering of SA-SRRW: We refer the reader to Appendix F for proofs of all remaining results. We begin by focusing on case (i) and capturing the impact of $\alpha$ on ${\mathbf{V}}_{{\bm{\theta}}}^{(1)}(\alpha)$ .

Proposition 3.4.

For all $\alpha_{2}>\alpha_{1}>0$ , we have ${\mathbf{V}}_{{\bm{\theta}}}^{(1)}(\alpha_{2})<_{L}{\mathbf{V}}_{{\bm{\theta}}}^{(1)}(\alpha_{1})<_{L}{\mathbf{V}}_{{\bm{\theta}}}^{(1)}(0)$ . Furthermore, ${\mathbf{V}}_{{\bm{\theta}}}^{(1)}(\alpha)$ decreases to zero at a rate of $O(1/\alpha^{2})$ .

Proposition 3.4 proves a monotonic reduction (in terms of Loewner ordering) of ${\mathbf{V}}_{{\bm{\theta}}}^{(1)}(\alpha)$ as $\alpha$ increases. Moreover, the decrease rate $O(1/\alpha^{2})$ surpasses the $O(1/\alpha)$ rate seen in ${\mathbf{V}}_{{\mathbf{x}}}(\alpha)$ and the sampling application in Doshi et al. (2023, Corollary 4.7), and is also empirically observed in our simulation in Section 4.⁹⁹9Further insights of $O(1/\alpha^{2})$ are tied to the two-timescale structure, particularly $\beta_{n}=o(\gamma_{n})$ in case (i), which places ${\bm{\theta}}_{n}$ on the slow timescale so that the correlation terms ${\mathbf{J}}_{12}(\alpha),{\mathbf{J}}_{22}(\alpha)$ in the Jacobian matrix ${\mathbf{J}}(\alpha)$ in (8) come into play. Technical details are referred to Appendix E.2, where we show the form of ${\mathbf{U}}_{{\bm{\theta}}}(\alpha)$ . Suppose we consider the same SA now driven by an i.i.d. sequence $\{X_{n}\}$ with the same marginal distribution ${\bm{\mu}}$ . Then, our Proposition 3.4 asserts that a token algorithm employing SRRW (walk on a graph) with large enough $\alpha$ on a general graph can actually produce better SA iterates with its asymptotic covariance going down to zero, than a ‘hypothetical situation’ where the walker is able to access any node $j$ with probability $\mu_{j}$ from anywhere in one step (more like a random jumper). This can be seen by noting that for large time $n$ , the scaled MSE $\mathbb{E}[\|{\bm{\theta}}_{n}-{\bm{\theta}}^{*}\|^{2}]/\beta_{n}$ is composed of the diagonal entries of the covariance matrix ${\mathbf{V}}_{{\bm{\theta}}}$ , which, as we discuss in detail in Appendix F.2, are decreasing in $\alpha$ as a consequence of the Loewner ordering in Proposition 3.4. For large enough $\alpha$ , the scaled MSE for SA-SRRW becomes smaller than its i.i.d. counterpart, which is always a constant. Although Doshi et al. (2023) alluded this for sampling applications with ${\mathbf{V}}_{{\mathbf{x}}}(\alpha)$ , we broaden its horizons to distributed optimization problem with ${\mathbf{V}}_{{\bm{\theta}}}(\alpha)$ using tokens on graphs. Our subsequent result concerns the performance comparison between cases (i) and (iii).

Corollary 3.5.

For any $\alpha>0$ , we have ${\mathbf{V}}_{{\bm{\theta}}}^{(1)}(\alpha)<_{L}{\mathbf{V}}_{{\bm{\theta}}}^{(3)}(\alpha)={\mathbf{V}}_{{\bm{\theta}}}^{(3)}(0)$ .

We show that case (i) is asymptotically better than case (iii) for $\alpha>0$ . In view of Proposition 3.4 and Corollary 3.5, the advantages of case (i) become prominent.

4 Simulation

In this section, we simulate our SA-SRRW algorithm on the wikiVote graph (Leskovec & Krevl, 2014), comprising $889$ nodes and $2914$ edges. We configure the SRRW’s base Markov chain ${\mathbf{P}}$ as the MHRW with a uniform target distribution ${\bm{\mu}}=\frac{1}{N}{\bm{1}}$ . For distributed optimization, we consider the following $L_{2}$ regularized binary classification problem:

\textstyle\min_{{\bm{\theta}}\in{\mathbb{R}}^{D}}\left\{f({\bm{\theta}})\triangleq\frac{1}{N}\sum_{i=1}^{N}\log\left(1+e^{{\bm{\theta}}^{T}{\mathbf{s}}_{i}}\right)-y_{i}\left({\bm{\theta}}^{T}{\mathbf{s}}_{i}\right)+\frac{\kappa}{2}\|{\bm{\theta}}\|^{2}\right\},\vspace{-1mm}

(12)

where $\{({\mathbf{s}}_{i},y_{i})\}_{i=1}^{N}$ is the ijcnn1 dataset (with $22$ features, i.e., ${\mathbf{s}}_{i}\in{\mathbb{R}}^{22}$ ) from LIBSVM (Chang & Lin, 2011), and penalty parameter $\kappa=1$ . Each node in the wikiVote graph is assigned one data point, thus $889$ data points in total. We perform SRRW driven SGD (SGD-SRRW) and SRRW driven stochastic heavy ball (SHB-SRRW) algorithms (see (13) in Appendix A for its algorithm). We fix the step size $\beta_{n}\!=\!(n+1)^{-0.9}$ for the SA iterates and adjust $\gamma_{n}\!=\!(n+1)^{\!-a}$ in the SRRW iterates to cover all three cases discussed in this paper: (i) $a\!=\!0.8$ ; (ii) $a\!=\!0.9$ ; (iii) $a\!=\!1$ . We use mean square error (MSE), i.e., $\mathbb{E}[\|{\bm{\theta}}_{n}\!-\!{\bm{\theta}}^{*}\|^{2}]$ , to measure the error on the SA iterates.

Our results are presented in Figures 2 and 3, where each experiment is repeated $100$ times. Figures 2(a) and 2(b), based on wikiVote graph, highlight the consistent performance ordering across different $\alpha$ values for both algorithms over almost all time (not just asymptotically). Notably, curves for $\alpha\geq 5$ outperform that of the i.i.d. sampling (in black) even under the graph constraints. Figure 2(c) on the smaller Dolphins graph (Rossi & Ahmed, 2015) - $62$ nodes and $159$ edges - illustrates that the points of ( $\alpha$ , MSE) pair arising from SGD-SRRW at time $n=10^{7}$ align with a curve in the form of $g(x)\!=\!\frac{c_{1}}{(x+c_{2})^{2}}\!+\!c_{3}$ to showcase $O(1/\alpha^{2})$ rates. This smaller graph allows for longer simulations to observe the asymptotic behaviour. Additionally, among the three cases examined at identical $\alpha$ values, Figures 3(a) - 3(c) confirm that case (i) performs consistently better than the rest, underscoring its superiority in practice. Further results, including those from non-convex functions and additional datasets, are deferred to Appendix H due to space constraints.

5 Conclusion

In this paper, we show both theoretically and empirically that the SRRW as a drop-in replacement for Markov chains can provide significant performance improvements when used for token algorithms, where the acceleration comes purely from the careful analysis of the stochastic input of the algorithm, without changing the optimization iteration itself. Our paper is an instance where the asymptotic analysis approach allows the design of better algorithms despite the usage of unconventional noise sequences such as nonlinear Markov chains like the SRRW, for which traditional finite-time analytical approaches fall short, thus advocating their wider adoption.

References

Aldous & Fill (2002) David Aldous and James Allen Fill. Reversible markov chains and random walks on graphs, 2002. Unfinished monograph, recompiled 2014, available at http://www.stat.berkeley.edu/~aldous/RWG/book.html.
Andrieu et al. (2007) Christophe Andrieu, Ajay Jasra, Arnaud Doucet, and Pierre Del Moral. Non-linear markov chain monte carlo. In Esaim: Proceedings, volume 19, pp. 79–84. EDP Sciences, 2007.
Ayache & El Rouayheb (2021) Ghadir Ayache and Salim El Rouayheb. Private weighted random walk stochastic gradient descent. IEEE Journal on Selected Areas in Information Theory, 2(1):452–463, 2021.
Barakat & Bianchi (2021) Anas Barakat and Pascal Bianchi. Convergence and dynamical behavior of the adam algorithm for nonconvex stochastic optimization. SIAM Journal on Optimization, 31(1):244–274, 2021.
Barakat et al. (2021) Anas Barakat, Pascal Bianchi, Walid Hachem, and Sholom Schechtman. Stochastic optimization with momentum: convergence, fluctuations, and traps avoidance. Electronic Journal of Statistics, 15(2):3892–3947, 2021.
Benaim & Cloez (2015) M Benaim and Bertrand Cloez. A stochastic approximation approach to quasi-stationary distributions on finite spaces. Electronic Communications in Probability 37 (20), 1-14.(2015), 2015.
Benveniste et al. (2012) Albert Benveniste, Michel Métivier, and Pierre Priouret. Adaptive algorithms and stochastic approximations, volume 22. Springer Science & Business Media, 2012.
Borkar (2022) V.S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint: Second Edition. Texts and Readings in Mathematics. Hindustan Book Agency, 2022. ISBN 9788195196111.
Bottou et al. (2018) Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM review, 60(2):223–311, 2018.
Boyd et al. (2006) Stephen Boyd, Arpita Ghosh, Balaji Prabhakar, and Devavrat Shah. Randomized gossip algorithms. IEEE transactions on information theory, 52(6):2508–2530, 2006.
Brémaud (2013) Pierre Brémaud. Markov chains: Gibbs fields, Monte Carlo simulation, and queues, volume 31. Springer Science & Business Media, 2013.
Chang & Lin (2011) Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):1–27, 2011.
Chellaboina & Haddad (2008) VijaySekhar Chellaboina and Wassim M Haddad. Nonlinear dynamical systems and control: A Lyapunov-based approach. Princeton University Press, 2008.
Chellapandi et al. (2023) Vishnu Pandi Chellapandi, Antesh Upadhyay, Abolfazl Hashemi, and Stanislaw H Zak. On the convergence of decentralized federated learning under imperfect information sharing. arXiv preprint arXiv:2303.10695, 2023.
Chen (2006) Han-Fu Chen. Stochastic approximation and its applications, volume 64. Springer Science & Business Media, 2006.
Chen et al. (2020a) Shuhang Chen, Adithya Devraj, Ana Busic, and Sean Meyn. Explicit mean-square error bounds for monte-carlo and linear stochastic approximation. In International Conference on Artificial Intelligence and Statistics, pp. 4173–4183. PMLR, 2020a.
Chen et al. (2020b) Zaiwei Chen, Siva Theja Maguluri, Sanjay Shakkottai, and Karthikeyan Shanmugam. Finite-sample analysis of stochastic approximation using smooth convex envelopes. arXiv preprint arXiv:2002.00874, 2020b.
Chen et al. (2022) Zaiwei Chen, Sheng Zhang, Thinh T Doan, John-Paul Clarke, and Siva Theja Maguluri. Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning. Automatica, 146:110623, 2022.
Davis (1970) Burgess Davis. On the intergrability of the martingale square function. Israel Journal of Mathematics, 8:187–190, 1970.
Defazio et al. (2014) Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: a fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in neural information processing systems, volume 1, 2014.
Del Moral & Doucet (2010) Pierre Del Moral and Arnaud Doucet. Interacting markov chain monte carlo methods for solving nonlinear measure-valued equations1. The Annals of Applied Probability, 20(2):593–639, 2010.
Del Moral & Miclo (2006) Pierre Del Moral and Laurent Miclo. Self-interacting markov chains. Stochastic Analysis and Applications, 24(3):615–660, 2006.
Delyon (2000) Bernard Delyon. Stochastic approximation with decreasing gain: Convergence and asymptotic theory. Technical report, Université de Rennes, 2000.
Delyon et al. (1999) Bernard Delyon, Marc Lavielle, and Eric Moulines. Convergence of a stochastic approximation version of the em algorithm. Annals of statistics, pp. 94–128, 1999.
Devraj & Meyn (2017) Adithya M Devraj and Sean P Meyn. Zap q-learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 2232–2241, 2017.
Devraj & Meyn (2021) Adithya M. Devraj and Sean P. Meyn. Q-learning with uniformly bounded variance. IEEE Transactions on Automatic Control, 2021.
Doan et al. (2019) Thinh Doan, Siva Maguluri, and Justin Romberg. Finite-time analysis of distributed td (0) with linear function approximation on multi-agent reinforcement learning. In International Conference on Machine Learning, pp. 1626–1635. PMLR, 2019.
Doan (2021) Thinh T Doan. Finite-time convergence rates of nonlinear two-time-scale stochastic approximation under markovian noise. arXiv preprint arXiv:2104.01627, 2021.
Doan et al. (2020) Thinh T Doan, Lam M Nguyen, Nhan H Pham, and Justin Romberg. Convergence rates of accelerated markov gradient descent with applications in reinforcement learning. arXiv preprint arXiv:2002.02873, 2020.
Doshi et al. (2023) Vishwaraj Doshi, Jie Hu, and Do Young Eun. Self-repellent random walks on general graphs–achieving minimal sampling variance via nonlinear markov chains. In International Conference on Machine Learning. PMLR, 2023.
Duflo (1996) Marie Duflo. Algorithmes stochastiques, volume 23. Springer, 1996.
Even (2023) Mathieu Even. Stochastic gradient descent under markovian sampling schemes. In International Conference on Machine Learning, 2023.
Fort (2015) Gersende Fort. Central limit theorems for stochastic approximation with controlled markov chain dynamics. ESAIM: Probability and Statistics, 19:60–80, 2015.
Gadat et al. (2018) Sébastien Gadat, Fabien Panloup, and Sofiane Saadane. Stochastic heavy ball. Electronic Journal of Statistics, 12:461–529, 2018.
Guo et al. (2020) Xin Guo, Jiequn Han, Mahan Tajrobehkar, and Wenpin Tang. Escaping saddle points efficiently with occupation-time-adapted perturbations. arXiv preprint arXiv:2005.04507, 2020.
Hall et al. (2014) P. Hall, C.C. Heyde, Z.W. Birnbauam, and E. Lukacs. Martingale Limit Theory and Its Application. Communication and Behavior. Elsevier Science, 2014.
Hendrikx (2023) Hadrien Hendrikx. A principled framework for the design and analysis of token algorithms. In International Conference on Artificial Intelligence and Statistics, pp. 470–489. PMLR, 2023.
Hong et al. (2023) Mingyi Hong, Hoi-To Wai, Zhaoran Wang, and Zhuoran Yang. A two-timescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actor-critic. SIAM Journal on Optimization, 33(1):147–180, 2023.
Hu et al. (2022) Jie Hu, Vishwaraj Doshi, and Do Young Eun. Efficiency ordering of stochastic gradient descent. In Advances in Neural Information Processing Systems, 2022.
Jin et al. (2017) Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points efficiently. In International conference on machine learning, pp. 1724–1732. PMLR, 2017.
Jin et al. (2018) Chi Jin, Praneeth Netrapalli, and Michael I Jordan. Accelerated gradient descent escapes saddle points faster than gradient descent. In Conference On Learning Theory, pp. 1042–1085. PMLR, 2018.
Jin et al. (2021) Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M Kakade, and Michael I Jordan. On nonconvex optimization for machine learning: Gradients, stochasticity, and saddle points. Journal of the ACM (JACM), 68(2):1–29, 2021.
Karimi et al. (2019) Belhal Karimi, Blazej Miasojedow, Eric Moulines, and Hoi-To Wai. Non-asymptotic analysis of biased stochastic approximation scheme. In Conference on Learning Theory, pp. 1944–1974. PMLR, 2019.
Karmakar & Bhatnagar (2018) Prasenjit Karmakar and Shalabh Bhatnagar. Two time-scale stochastic approximation with controlled markov noise and off-policy temporal-difference learning. Mathematics of Operations Research, 43(1):130–151, 2018.
Khaled & Richtárik (2023) Ahmed Khaled and Peter Richtárik. Better theory for SGD in the nonconvex world. Transactions on Machine Learning Research, 2023. ISSN 2835-8856.
Kingma & Ba (2015) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
Konda & Tsitsiklis (2004) Vijay R Konda and John N Tsitsiklis. Convergence rate of linear two-time-scale stochastic approximation. The Annals of Applied Probability, 14(2):796–819, 2004.
Kushner & Yin (2003) Harold Kushner and G George Yin. Stochastic approximation and recursive algorithms and applications, volume 35. Springer Science & Business Media, 2003.
Lalitha et al. (2018) Anusha Lalitha, Shubhanshu Shekhar, Tara Javidi, and Farinaz Koushanfar. Fully decentralized federated learning. In Advances in neural information processing systems, 2018.
Leskovec & Krevl (2014) Jure Leskovec and Andrej Krevl. Snap datasets: Stanford large network dataset collection, 2014.
Levin & Peres (2017) David A Levin and Yuval Peres. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.
Li & Wai (2022) Qiang Li and Hoi-To Wai. State dependent performative prediction with stochastic approximation. In International Conference on Artificial Intelligence and Statistics, pp. 3164–3186. PMLR, 2022.
Li et al. (2022) Tiejun Li, Tiannan Xiao, and Guoguo Yang. Revisiting the central limit theorems for the sgd-type methods. arXiv preprint arXiv:2207.11755, 2022.
Li et al. (2023) Xiang Li, Jiadong Liang, and Zhihua Zhang. Online statistical inference for nonlinear stochastic approximation with markovian data. arXiv preprint arXiv:2302.07690, 2023.
McMahan et al. (2017) Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp. 1273–1282. PMLR, 2017.
Meyn (2022) Sean Meyn. Control systems and reinforcement learning. Cambridge University Press, 2022.
Mokkadem & Pelletier (2005) Abdelkader Mokkadem and Mariane Pelletier. The compact law of the iterated logarithm for multivariate stochastic approximation algorithms. Stochastic analysis and applications, 23(1):181–203, 2005.
Mokkadem & Pelletier (2006) Abdelkader Mokkadem and Mariane Pelletier. Convergence rate and averaging of nonlinear two-time-scale stochastic approximation algorithms. Annals of Applied Probability, 16(3):1671–1702, 2006.
Morral et al. (2017) Gemma Morral, Pascal Bianchi, and Gersende Fort. Success and failure of adaptation-diffusion algorithms with decaying step size in multiagent networks. IEEE Transactions on Signal Processing, 65(11):2798–2813, 2017.
Mou et al. (2020) Wenlong Mou, Chris Junchi Li, Martin J Wainwright, Peter L Bartlett, and Michael I Jordan. On linear stochastic approximation: Fine-grained polyak-ruppert and non-asymptotic concentration. In Conference on Learning Theory, pp. 2947–2997. PMLR, 2020.
Nedic (2020) Angelia Nedic. Distributed gradient methods for convex machine learning problems in networks: Distributed optimization. IEEE Signal Processing Magazine, 37(3):92–101, 2020.
Olshevsky (2022) Alex Olshevsky. Asymptotic network independence and step-size for a distributed subgradient method. Journal of Machine Learning Research, 23(69):1–32, 2022.
Pelletier (1998) Mariane Pelletier. On the almost sure asymptotic behaviour of stochastic algorithms. Stochastic processes and their applications, 78(2):217–244, 1998.
Reddi et al. (2018) Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018.
Rossi & Ahmed (2015) Ryan A. Rossi and Nesreen K. Ahmed. The network data repository with interactive graph analytics and visualization. In AAAI, 2015.
Schmidt et al. (2017) Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162:83–112, 2017.
Sun et al. (2018) Tao Sun, Yuejiao Sun, and Wotao Yin. On markov chain gradient descent. In Advances in neural information processing systems, volume 31, 2018.
Triastcyn et al. (2022) Aleksei Triastcyn, Matthias Reisser, and Christos Louizos. Decentralized learning with random walks and communication-efficient adaptive optimization. In Workshop on Federated Learning: Recent Advances and New Challenges (in Conjunction with NeurIPS 2022), 2022.
Vogels et al. (2021) Thijs Vogels, Lie He, Anastasiia Koloskova, Sai Praneeth Karimireddy, Tao Lin, Sebastian U Stich, and Martin Jaggi. Relaysum for decentralized deep learning on heterogeneous data. In Advances in Neural Information Processing Systems, volume 34, pp. 28004–28015, 2021.
Wang et al. (2019) Jianyu Wang, Anit Kumar Sahu, Zhouyi Yang, Gauri Joshi, and Soummya Kar. Matcha: Speeding up decentralized sgd via matching decomposition sampling. In 2019 Sixth Indian Control Conference (ICC), pp. 299–300. IEEE, 2019.
Yaji & Bhatnagar (2020) Vinayaka G Yaji and Shalabh Bhatnagar. Stochastic recursive inclusions in two timescales with nonadditive iterate-dependent markov noise. Mathematics of Operations Research, 45(4):1405–1444, 2020.
Ye et al. (2022) Hao Ye, Le Liang, and Geoffrey Ye Li. Decentralized federated learning with unreliable communications. IEEE Journal of Selected Topics in Signal Processing, 16(3):487–500, 2022.
Zeng et al. (2021) Sihan Zeng, Thinh T Doan, and Justin Romberg. A two-time-scale stochastic optimization framework with applications in control and reinforcement learning. arXiv preprint arXiv:2109.14756, 2021.

Appendix A Examples of Stochastic Algorithms of the form (2).

In the literature of stochastic optimizations, many SGD variants have been proposed by introducing an auxiliary variable to improve convergence. In what follows, we present two SGD variants with decreasing step size that can be presented in the form of (2): SHB (Gadat et al., 2018; Li et al., 2022) and momentum-based algorithm (Barakat et al., 2021; Barakat & Bianchi, 2021).

\begin{split}&\begin{cases}\!{\bm{\theta}}_{n+1}\!=\!{\bm{\theta}}_{n}\!-\!\beta_{n+1}{\mathbf{m}}_{n}\\ \!{\mathbf{m}}_{n+1}\!=\!{\mathbf{m}}_{n}\!+\!\beta_{n+1}\!(\nabla\!F({\bm{\theta}}_{n},X_{n+1})\!-\!{\mathbf{m}}_{n}),\end{cases}\!\!\!\!\!\begin{cases}\!{\mathbf{v}}_{n+1}\!=\!{\mathbf{v}}_{n}\!+\!\beta_{n+1}\!(\nabla\!F({\bm{\theta}}_{n},X_{n+1})^{2}\!\!-\!{\mathbf{v}}_{n}),\\ \!{\mathbf{m}}_{n+1}\!=\!{\mathbf{m}}_{n}\!+\!\beta_{n+1}\!(\nabla\!F({\bm{\theta}}_{n},X_{n+1})\!-\!{\mathbf{m}}_{n}),\\ \!{\bm{\theta}}_{n+1}\!=\!{\bm{\theta}}_{n}\!-\!\beta_{n+1}{\mathbf{m}}_{n}/\sqrt{{\mathbf{v}}_{n}+\epsilon},\end{cases}\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\text{(a). SHB}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\text{(b). Momentum-based Algorithm}\end{split}\vspace{-5mm}

(13)

where $\epsilon>0$ , ${\bm{\theta}}_{n},{\mathbf{m}}_{n},{\mathbf{v}}_{n},\nabla F({\bm{\theta}},X)\in{\mathbb{R}}^{d}$ , and the square and square root in (13) (b) are element-wise operators.¹⁰¹⁰10For ease of expression, we simplify the original SHB and momentum-based algorithms from Gadat et al. (2018); Li et al. (2022); Barakat et al. (2021); Barakat & Bianchi (2021), setting all tunable parameters to $1$ and resulting in (13).

For SHB, we introduce an augmented variable ${\mathbf{z}}_{n}$ and function $H({\mathbf{z}}_{n},X_{n+1})$ defined as follows:

{\mathbf{z}}_{n}\triangleq\begin{bmatrix}{\bm{\theta}}_{n}\\ {\mathbf{m}}_{n}\end{bmatrix}\in{\mathbb{R}}^{2d},\quad H({\mathbf{z}}_{n},X_{n+1})\triangleq\begin{bmatrix}-{\mathbf{m}}_{n}\\ \nabla F({\bm{\theta}}_{n},X_{n+1})-{\mathbf{m}}_{n}\end{bmatrix}\in{\mathbb{R}}^{2d}.

For the general momentum-based algorithm, we define

{\mathbf{z}}_{n}\triangleq\begin{bmatrix}{\mathbf{v}}_{n}\\ {\mathbf{m}}_{n}\\ {\bm{\theta}}_{n}\end{bmatrix}\in{\mathbb{R}}^{3d},\quad H({\mathbf{z}}_{n},X)\triangleq\begin{bmatrix}\nabla F({\bm{\theta}}_{n},X_{n+1})^{2}-{\mathbf{v}}_{n}\\ \nabla F({\bm{\theta}}_{n},X_{n+1})-{\mathbf{m}}_{n}\\ -{\mathbf{m}}_{n}/\sqrt{{\mathbf{v}}_{n}+\epsilon}\end{bmatrix}\in{\mathbb{R}}^{3d}.

Thus, we can reformulate both algorithms in (13) as ${\mathbf{z}}_{n+1}={\mathbf{z}}_{n}+\beta_{n+1}H({\mathbf{z}}_{n},X_{n+1})$ . This augmentation approach was previously adopted in (Gadat et al., 2018; Barakat et al., 2021; Barakat & Bianchi, 2021; Li et al., 2022) to analyze the asymptotic performance of algorithms in (13) using an i.i.d. sequence $\{X_{n}\}_{n\geq 0}$ . Consequently, the general SA iteration (2) includes these SGD variants. However, we mainly focus on the CLT for the general SA driven by SRRW in this paper. Pursuing the explicit CLT results of these SGD variants with specific form of function $H({\bm{\theta}},X)$ driven by the SRRW sequence $\{X_{n}\}$ is out of the scope of this paper.

When we numerically test the SHB algorithm in Section 4, we use the exact form of (13) (a) and the stochastic sequence $\{X_{n}\}$ is now driven by the SRRW. Specifically, we consider MHRW with transition kernel ${\mathbf{P}}$ as the base Markov chain of the SRRW process, e.g.,

	$P_{ij}=\begin{cases}\min\left\{\frac{1}{d_{i}},\frac{1}{d_{j}}\right\}&\text{if node $j$ is the neighbor of node $i$},\\ 0&\text{otherwise},\end{cases}$		(14)
	$P_{ii}=1-\sum_{j\in{\mathcal{N}}}P_{ij}.~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}$		(14)

Then, at each time step $n$ ,

	$\text{Draw:}~{}~{}~{}~{}~{}~{}~{}~{}X_{n+1}\sim{\mathbf{K}}_{X_{n},\cdot}[{\mathbf{x}}_{n}],~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}$		(15)
	$~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\text{where}~{}~{}~{}~{}~{}~{}~{}~{}K_{ij}[{\mathbf{x}}]\triangleq\frac{P_{ij}(x_{j}/\mu_{j})^{-\alpha}}{\sum_{k\in{\mathcal{N}}}P_{ik}(x_{k}/\mu_{k})^{-\alpha}},~{}~{}~{}~{}~{}~{}\forall~{}i,j\in{\mathcal{N}},$
	$\text{Update:}~{}~{}~{}~{}~{}~{}{\mathbf{x}}_{n+1}={\mathbf{x}}_{n}+\gamma_{n+1}({\bm{\delta}}_{X_{n+1}}-{\mathbf{x}}_{n}),$
	$~{}~{}{\bm{\theta}}_{n+1}={\bm{\theta}}_{n}-\beta_{n+1}{\mathbf{m}}_{n},$
	$~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}{\mathbf{m}}_{n+1}={\mathbf{m}}_{n}+\beta_{n+1}(\nabla F({\bm{\theta}}_{n},X_{n+1})-{\mathbf{m}}_{n}).$

Appendix B Discussion on Mean Field Function of SRRW Iterates (4b)

Non-asymptotic analyses have seen extensive attention recently in both single-timescale SA literature (Sun et al., 2018; Karimi et al., 2019; Chen et al., 2020b; 2022) and two-timescale SA literature (Doan, 2021; Zeng et al., 2021). Specifically, single-timescale SA has the following form:

{\mathbf{x}}_{n+1}={\mathbf{x}}_{n}+\beta_{n+1}H({\mathbf{x}}_{n},X_{n+1}),

and function $h({\mathbf{x}})\triangleq\mathbb{E}_{X\sim{\bm{\mu}}}[H({\mathbf{x}},X)]$ is the mean field of function $H({\mathbf{x}},X)$ . Similarly, for two-timescale SA, we have the following recursions:

\begin{split}{\mathbf{x}}_{n+1}={\mathbf{x}}_{n}+\beta_{n+1}H_{1}({\mathbf{x}}_{n},{\mathbf{y}}_{n},X_{n+1}),\\ {\mathbf{y}}_{n+1}={\mathbf{y}}_{n}+\gamma_{n+1}H_{2}({\mathbf{x}}_{n},{\mathbf{y}}_{n},X_{n+1}),\end{split}

where $\{\beta_{n}\}$ and $\{\gamma_{n}\}$ are on different timescales, and function $h_{i}({\mathbf{x}},{\mathbf{y}})\triangleq\mathbb{E}_{X\sim{\bm{\mu}}}[H_{i}({\mathbf{x}},{\mathbf{y}},X)]$ is the mean field of function $H_{i}({\mathbf{x}},{\mathbf{y}},X)$ for $i=\{1,2\}$ . All the aforementioned works require the mean field function $h({\mathbf{x}})$ in the single-timescale SA (or $h_{1}({\mathbf{x}},{\mathbf{y}}),h_{2}({\mathbf{x}},{\mathbf{y}})$ in the two-timescale SA) to be globally Lipschitz with a Lipschitz constant $L$ to proceed with the derivation of finite-time bounds including the constant $L$ .

Here, we show that the mean field function ${\bm{\pi}}[{\mathbf{x}}]-{\mathbf{x}}$ in the SRRW iterates (4b) is not globally Lipschitz, where ${\bm{\pi}}[{\mathbf{x}}]$ is the stationary distribution of the SRRW kernel ${\mathbf{K}}[{\mathbf{x}}]$ defined in (3). To this end, we show that each entry of Jacobian matrix of ${\bm{\pi}}[{\mathbf{x}}]-{\mathbf{x}}$ goes unbounded because a multivariate function is Lipschitz if and only if it has bounded partial derivatives. Note that from Doshi et al. (2023, Proposition 2.1), for the $i$ -th entry of ${\bm{\pi}}[{\mathbf{x}}]$ , we have

{\bm{\pi}}_{i}[{\mathbf{x}}]=\frac{\sum_{j\in{\mathcal{N}}}\mu_{i}P_{ij}\left(x_{i}/\mu_{i}\right)^{-\alpha}\left(x_{j}/\mu_{j}\right)^{-\alpha}}{\sum_{i\in{\mathcal{N}}}\sum_{j\in{\mathcal{N}}}\mu_{i}P_{ij}\left(x_{i}/\mu_{i}\right)^{-\alpha}\left(x_{j}/\mu_{j}\right)^{-\alpha}}.

(16)

Then, the Jacobian matrix of the mean field function ${\bm{\pi}}[{\mathbf{x}}]-{\mathbf{x}}$ , which has been derived in Doshi et al. (2023, Proof of Lemma 3.4 in Appendix B), is given as follows:

\begin{split}&\frac{\partial({\bm{\pi}}_{i}[{\mathbf{x}}]-x_{i})}{\partial x_{j}}\\ =&~{}\frac{2\alpha}{x_{j}}\cdot\frac{(\sum_{k\in{\mathcal{N}}}\mu_{i}P_{ik}\left(x_{i}/\mu_{i}\right)^{-\alpha}\left(x_{k}/\mu_{k}\right)^{-\alpha})(\sum_{k\in{\mathcal{N}}}\mu_{j}P_{jk}\left(x_{j}/\mu_{j}\right)^{-\alpha}\left(x_{k}/\mu_{k}\right)^{-\alpha})}{(\sum_{l\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}\mu_{l}P_{lk}\left(x_{l}/\mu_{l}\right)^{-\alpha}\left(x_{k}/\mu_{k}\right)^{-\alpha})^{2}}\\ &-\frac{\alpha}{x_{j}}\cdot\frac{\mu_{i}P_{ij}\left(x_{i}/\mu_{i}\right)^{-\alpha}\left(x_{j}/\mu_{j}\right)^{-\alpha}}{\sum_{l\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}\mu_{l}P_{lk}\left(x_{l}/\mu_{l}\right)^{-\alpha}\left(x_{k}/\mu_{k}\right)^{-\alpha}}\end{split}

(17)

for $i,j\in{\mathcal{N}},i\neq j$ , and

\begin{split}&\frac{\partial({\bm{\pi}}_{i}[{\mathbf{x}}]-x_{i})}{\partial x_{i}}\\ =&~{}\frac{2\alpha}{x_{i}}\cdot\frac{(\sum_{k\in{\mathcal{N}}}\mu_{i}P_{ik}\left(x_{i}/\mu_{i}\right)^{-\alpha}\left(x_{k}/\mu_{k}\right)^{-\alpha})^{2}}{(\sum_{l\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}\mu_{l}P_{lk}\left(x_{l}/\mu_{l}\right)^{-\alpha}\left(x_{k}/\mu_{k}\right)^{-\alpha})^{2}}\\ &-\frac{\alpha}{x_{i}}\cdot\frac{\sum_{k\in{\mathcal{N}}}\mu_{i}P_{ik}\left(x_{i}/\mu_{i}\right)^{-\alpha}\left(x_{k}/\mu_{k}\right)^{-\alpha}+\mu_{i}P_{ii}(x_{i}/\mu_{i})^{-2\alpha}}{\sum_{l\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}\mu_{l}P_{lk}\left(x_{l}/\mu_{l}\right)^{-\alpha}\left(x_{k}/\mu_{k}\right)^{-\alpha}}-1\end{split}

(18)

for $i\in{\mathcal{N}}$ . Since the empirical distribution ${\mathbf{x}}\in\text{Int}(\Sigma)$ , we have $x_{i}\in(0,1)$ for all $i\in{\mathcal{N}}$ . For fixed $i$ , assume $x_{i}=x_{j}$ and as they approach zero, the terms $(x_{i}/\mu_{i})^{-\alpha}$ , $(x_{j}/\mu_{j})^{-\alpha}$ dominate the fraction in (17) and both the numerator and the denominator of the fraction have the same order in terms of $x_{i},x_{j}$ . Thus, we have

\frac{\partial({\bm{\pi}}_{i}[{\mathbf{x}}]-x_{i})}{\partial x_{j}}=O\left(\frac{1}{x_{j}}\right)

such that the $(i,j)$ -th entry of the Jacobian matrix can go unbounded as $x_{j}\to 0$ . Consequently, ${\bm{\pi}}[{\mathbf{x}}]-{\mathbf{x}}$ is not globally Lipschitz for ${\mathbf{x}}\in\text{Int}(\Sigma)$ .

Appendix C Discussion on Assumption A3^′

When $\gamma_{n}=o(\beta_{n})$ , iterates ${\mathbf{x}}_{n}$ has smaller step size compared to ${\bm{\theta}}_{n}$ , thus converges ‘slower’ than ${\bm{\theta}}_{n}$ . From Assumption A3^′, ${\bm{\theta}}_{n}$ will intuitively converge to some point $\rho({\mathbf{x}})$ with the current value ${\mathbf{x}}$ from the iteration ${\mathbf{x}}_{n}$ , i.e., $\mathbb{E}_{X\sim{\bm{\pi}}[{\mathbf{x}}]}[H(\rho({\mathbf{x}}),X)]=0$ , while the Hurwitz condition is to ensure the stability around $\rho({\mathbf{x}})$ . We can see that Assumption A3 is less stringent than A3^′ in that it only assumes such condition when ${\mathbf{x}}={\bm{\mu}}$ such that $\rho({\bm{\mu}})={\bm{\theta}}^{*}$ rather than for all ${\mathbf{x}}\in\text{Int}(\Sigma)$ .

One special instance of Assumption A3^′ is by assuming the linear SA, e.g., $H({\bm{\theta}},i)=A_{i}{\bm{\theta}}+b_{i}$ . In this case, $\mathbb{E}_{X\sim{\bm{\pi}}[{\mathbf{x}}]}[H(\rho({\mathbf{x}}),X)]=0$ is equivalent to $\mathbb{E}_{i\sim{\bm{\pi}}[{\mathbf{x}}]}[A_{i}]\rho({\mathbf{x}})+\mathbb{E}_{i\sim{\bm{\pi}}[{\mathbf{x}}]}[b_{i}]=0$ . Under the condition that for every ${\mathbf{x}}\in\text{Int}(\Sigma)$ , matrix $\mathbb{E}_{i\sim{\bm{\pi}}[{\mathbf{x}}]}[A_{i}]$ is invertible, we then have

\rho({\mathbf{x}})=-(\mathbb{E}_{i\sim{\bm{\pi}}[{\mathbf{x}}]}[A_{i}])^{-1}\cdot\mathbb{E}_{i\sim{\bm{\pi}}[{\mathbf{x}}]}[b_{i}].

However, this condition is quite strict. Loosely speaking, $\mathbb{E}_{i\sim{\bm{\pi}}[{\mathbf{x}}]}[A_{i}]$ being invertible for any ${\mathbf{x}}$ is similar to saying that any convex combination of $\{A_{i}\}$ is invertible. For example, if we assume $\{A_{i}\}_{i\in{\mathcal{N}}}$ are negative definite and they all share the same eigenbasis $\{{\mathbf{u}}_{i}\}$ , e.g., $A_{i}=\sum_{j=1}^{D}\lambda^{i}_{j}{\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T}$ and $\lambda_{j}^{i}<0$ for all $i\in{\mathcal{N}},j\in[D]$ . Then, $\mathbb{E}_{i\sim{\bm{\pi}}[{\mathbf{x}}]}[A_{i}]$ is invertible.

Another example for Assumption A3^′ is when $H({\bm{\theta}},i)=H({\bm{\theta}},j)$ for all $i,j\in{\mathcal{N}}$ , which implies that each agent in the distributed learning has the same local dataset to collaboratively train the model. In this example, $\rho({\mathbf{x}})={\bm{\theta}}^{*}$ such that

\mathbb{E}_{i\sim{\bm{\pi}}[{\mathbf{x}}]}[H(\rho({\mathbf{x}}),i)]=h({\bm{\theta}}^{*})=0,

\mathbb{E}_{i\sim{\bm{\pi}}[{\mathbf{x}}]}[\nabla H(\rho({\mathbf{x}}),i)]+\frac{\mathds{1}_{\{b=1\}}}{2}{\mathbf{I}}=\nabla h({\bm{\theta}}^{*})+\frac{\mathds{1}_{\{b=1\}}}{2}{\mathbf{I}}\quad\text{being Hurwitz}.

Appendix D Proof of Lemma 3.1 and Lemma 3.2

In this section, we demonstrate the almost sure convergence of both ${\bm{\theta}}_{n}$ and ${\mathbf{x}}_{n}$ together. This proof naturally incorporates the almost certain convergence of the SRRW iteration in Lemma 3.1, since ${\mathbf{x}}_{n}$ is independent of ${\bm{\theta}}_{n}$ (as indicated in (4)), allowing us to separate out its asymptotic results. The same reason applies to the CLT analysis of SRRW iterates and we refer the reader to Section E.1 for the CLT result of ${\mathbf{x}}_{n}$ in Lemma 3.1.

We will use different techniques for different settings of step sizes in Assumption A2. Specifically, for step sizes $\gamma_{n}=(n+1)^{-a},\beta_{n}=(n+1)^{-b}$ , we consider the following scenarios:

Scenario 1:: We consider case(ii): $1/2<a=b\leq 1$ , and will apply the almost sure convergence result of the single-timescale stochastic approximation in Theorem G.8 and verify all the conditions therein.
Scenario 2:: We consider both case(i): $1/2<a<b\leq 1$ and case (iii): $1/2<b<a\leq 1$ . In these two cases, step sizes $\gamma_{n},\beta_{n}$ decrease at different rates, thereby putting iterates ${\mathbf{x}}_{n},{\bm{\theta}}_{n}$ on different timescales and resulting in a two-timescale structure. We will apply the existing almost sure convergence result of the two-timescale stochastic approximation with iterate-dependent Markov chain in Yaji & Bhatnagar (2020, Theorem 4) where our SA-SRRW algorithm can be regarded as a special instance.¹¹¹¹11However, Yaji & Bhatnagar (2020) paper only analysed the almost sure convergence. The central limit theorem analysis remains unknown in the literature for the two-timescale stochastic approximation with iterate-dependent Markov chains. Thus, our CLT analysis in Section E for this two-timescale structure with iterate-dependent Markov chain is still novel and recognized as our contribution.

D.1 Scenario 1

In Scenario 1, we have $\beta_{n}=\gamma_{n}$ . First, we rewrite (4) as

\begin{bmatrix}{\bm{\theta}}_{n+1}\\ {\mathbf{x}}_{n+1}\end{bmatrix}=\begin{bmatrix}{\bm{\theta}}_{n}\\ {\mathbf{x}}_{n}\end{bmatrix}+\gamma_{n+1}\begin{bmatrix}H({\bm{\theta}}_{n}.X_{n+1})\\ {\bm{\delta}}_{X_{n+1}}-{\mathbf{x}}_{n}\end{bmatrix}.

(19)

By augmentations, we define the variable ${\mathbf{z}}_{n}\triangleq\begin{bmatrix}{\bm{\theta}}_{n}\\ {\mathbf{x}}_{n}\end{bmatrix}\in{\mathbb{R}}^{(N+D)\times 1}$ and the function $G({\mathbf{z}}_{n},i)\triangleq\begin{bmatrix}H({\bm{\theta}}_{n},i)\\ {\bm{\delta}}_{i}-{\mathbf{x}}_{n}\end{bmatrix}\in{\mathbb{R}}^{(N+d)\times 1}$ . In addition, we define a new Markov chain $\{Y_{n}\}_{n\geq 0}$ in the same state space ${\mathcal{N}}$ as SRRW sequence $\{X_{n}\}_{n\geq 0}$ . With slight abuse of notation, the transition kernel of $\{Y_{n}\}$ is denoted by ${\mathbf{K}}^{\prime}[{\mathbf{z}}_{n}]\equiv{\mathbf{K}}[{\mathbf{x}}_{n}]$ and its stationary distribution ${\bm{\pi}}^{\prime}[{\mathbf{z}}_{n}]\equiv{\bm{\pi}}[{\mathbf{x}}_{n}]$ , where ${\mathbf{K}}[{\mathbf{x}}_{n}]$ and ${\bm{\pi}}({\mathbf{x}}_{n})$ are the transition kernel and its corresponding stationary distribution of SRRW, with ${\bm{\pi}}[{\mathbf{x}}]$ of the form

\pi_{i}[{\mathbf{x}}]\propto\sum_{j\in{\mathcal{N}}}\mu_{i}P_{ij}(x_{i}/\mu_{i})^{-\alpha}(x_{j}/\mu_{j})^{-\alpha}.

(20)

Recall that ${\bm{\mu}}$ is the fixed point, i.e., ${\bm{\pi}}[{\bm{\mu}}]={\bm{\mu}}$ , and ${\mathbf{P}}$ is the base Markov chain inside SRRW (see (3)). Then, the mean field

g({\mathbf{z}})=\mathbb{E}_{Y\sim{\bm{\pi}}^{\prime}({\mathbf{z}})}[G({\mathbf{z}},Y)]=\begin{bmatrix}\sum_{i\in{\mathcal{N}}}\pi_{i}[{\mathbf{x}}]H({\bm{\theta}},i)\\ {\bm{\pi}}[{\mathbf{x}}]-{\mathbf{x}}\end{bmatrix},

and ${\mathbf{z}}^{*}=({\bm{\theta}}^{*},{\bm{\mu}})$ for ${\bm{\theta}}^{*}\in\Theta$ in Assumption A3 is the root of $g({\mathbf{z}})$ , i.e., $g({\mathbf{z}}^{*})=0$ . The augmented iteration (19) becomes

{\mathbf{z}}_{n+1}={\mathbf{z}}_{n}+\gamma_{n+1}G({\mathbf{z}}_{n},Y_{n+1})

(21)

with the goal of solving $g({\mathbf{z}})=0$ . Therefore, we can treat (21) as an SA algorithm driven by a Markov chain $\{Y_{n}\}_{n\geq 0}$ with its kernel ${\mathbf{K}}^{\prime}[{\mathbf{z}}]$ and stationary distribution ${\bm{\pi}}^{\prime}[{\mathbf{z}}]$ , which has been widely studied in the literature (e.g., Delyon (2000); Benveniste et al. (2012); Fort (2015); Li et al. (2023)). In what follows, we demonstrate that for any initial point ${\mathbf{z}}_{0}=({\bm{\theta}}_{0},{\mathbf{x}}_{0})\in{\mathbb{R}}^{D}\times\text{Int}(\Sigma)$ , the SRRW iteration $\{{\mathbf{x}}_{n}\}_{n\geq 0}$ will almost surely converge to the target distribution ${\bm{\mu}}$ , and the SA iteration $\{{\bm{\theta}}_{n}\}_{n\geq 0}$ will almost surely converge to the set $\Theta$ .

Now we verify conditions C1 - C4 in Theorem G.8. Our assumption A4 is equivalent to condition C1 and assumption A2 corresponds to condition C2. For condition C3, we set $\nabla w({\mathbf{z}})\equiv-g({\mathbf{z}})$ , and the set $S\equiv\{{\mathbf{z}}^{*}|{\bm{\theta}}^{*}\in\Theta,{\mathbf{x}}^{*}={\bm{\mu}}\}$ , including disjoint points. For condition C4, since ${\mathbf{K}}^{\prime}[{\mathbf{z}}]$ , or equivalently ${\mathbf{K}}[{\mathbf{x}}]$ , is ergodic and time-reversible for a given ${\mathbf{z}}$ , as shown in the SRRW work Doshi et al. (2023), it automatically ensures a solution to the Poisson equation, which has been well discussed in Chen et al. (2020a, Section 2) and Benveniste et al. (2012); Meyn (2022). To show (97) and (98) in condition C4, for each given ${\mathbf{z}}$ and any $i\in{\mathcal{N}}$ , we need to give the explicit solution $m_{{\mathbf{z}}}(i)$ to the Poisson equation $m_{{\mathbf{z}}}(i)-({\mathbf{K}}^{\prime}_{{\mathbf{z}}}m_{{\mathbf{z}}})(i)=G({\mathbf{z}},i)-g({\mathbf{z}})$ in (96). The notation $({\mathbf{K}}^{\prime}_{{\mathbf{z}}}m_{{\mathbf{z}}})(i)$ is defined as follows.

({\mathbf{K}}^{\prime}_{{\bm{z}}}m_{{\mathbf{z}}})(i)=\sum_{j\in{\mathcal{N}}}{\mathbf{K}}^{\prime}_{{\bm{z}}}(i,j)m({\mathbf{z}},j).

Let ${\mathbf{G}}({\mathbf{z}})\triangleq[G({\mathbf{z}},1),\cdots,G({\mathbf{z}},N)]^{T}\in{\mathbb{R}}^{N\times D}$ . We use $[{\mathbf{A}}]_{:,i}$ to denote the $i$ -th column of matrix ${\mathbf{A}}$ . Then, we let $m_{{\mathbf{z}}}(i)$ such that

m_{{\mathbf{z}}}(i)=\sum_{k=0}^{\infty}\left([{\mathbf{G}}({\mathbf{z}})({\mathbf{K}}^{\prime}[{\mathbf{z}}]^{k})^{T}]_{[:,i]}-g({\mathbf{z}})\right)=\sum_{k=0}^{\infty}[{\mathbf{G}}({\mathbf{z}})(({\mathbf{K}}^{\prime}[{\mathbf{z}}]^{k})^{T}-{\bm{\pi}}^{\prime}[{\mathbf{z}}]{\bm{1}}^{T})]_{[:,i]}.

(22)

In addition,

({\mathbf{K}}^{\prime}_{{\mathbf{z}}}m_{{\mathbf{z}}})(i)=\sum_{k=1}^{\infty}[{\mathbf{G}}({\mathbf{z}})({\mathbf{K}}^{\prime}[{\mathbf{z}}]^{T}-{\bm{\pi}}^{\prime}[{\mathbf{z}}]{\bm{1}}^{T})]_{[:,i]}.

(23)

We can check that the $m_{{\mathbf{z}}}(i)$ form in (22) is indeed the solution of the above Poisson equation. Now, by induction, we get ${\mathbf{K}}^{\prime}[{\mathbf{z}}]^{k}-{\bm{1}}{\bm{\pi}}^{\prime}[{\mathbf{z}}]^{T}=({\mathbf{K}}^{\prime}[{\mathbf{z}}]-{\bm{1}}{\bm{\pi}}^{\prime}[{\mathbf{z}}]^{T})^{k}$ for $k\geq 1$ and for $k=0$ , ${\mathbf{K}}^{\prime}[{\mathbf{z}}]^{0}-{\bm{1}}{\bm{\pi}}^{\prime}[{\mathbf{z}}]^{T}=({\mathbf{K}}^{\prime}[{\mathbf{z}}]-{\bm{1}}{\bm{\pi}}^{\prime}[{\mathbf{z}}]^{T})^{0}-{\bm{1}}{\bm{\pi}}^{\prime}[{\mathbf{z}}]^{T}$ . Then,

\begin{split}m_{{\mathbf{z}}}(i)=&\sum_{k=0}^{\infty}[{\mathbf{G}}({\mathbf{z}})({\mathbf{K}}^{\prime}[{\mathbf{z}}]^{T}-{\bm{\pi}}^{\prime}[{\mathbf{z}}]{\bm{1}}^{T})^{k}]_{[:,i]}-g({\mathbf{z}})\\ =&\left[{\mathbf{G}}({\mathbf{z}})\sum_{k=0}^{\infty}({\mathbf{K}}^{\prime}[{\mathbf{z}}]^{T}-{\bm{\pi}}^{\prime}[{\mathbf{z}}]{\bm{1}}^{T})^{k}\right]_{[:,i]}-g({\mathbf{z}})\\ =&\left[{\mathbf{G}}({\mathbf{z}})({\mathbf{I}}-{\mathbf{K}}^{\prime}[{\mathbf{z}}]^{T}+{\bm{\pi}}^{\prime}[{\mathbf{z}}]{\bm{1}}^{T})^{-1}\right]_{[:,i]}-g({\mathbf{z}})\\ =&\sum_{j\in{\mathcal{N}}}({\mathbf{I}}-{\mathbf{K}}^{\prime}[{\mathbf{z}}]+{\bm{1}}{\bm{\pi}}^{\prime}[{\mathbf{z}}]^{T})^{-1}(i,j)G({\mathbf{z}},j)-g({\mathbf{z}}).\end{split}

(24)

Here, $({\mathbf{I}}-{\mathbf{K}}^{\prime}[{\mathbf{z}}]+{\bm{1}}{\bm{\pi}}^{\prime}[{\mathbf{z}}]^{T})^{-1}$ is well defined because ${\mathbf{K}}^{\prime}[{\mathbf{z}}]$ is ergodic and time-reversible for any given ${\mathbf{z}}$ (proved in Doshi et al. (2023, Appendix A)). Now that both functions $H({\bm{\theta}},i)$ and ${\bm{\delta}}_{i}-{\mathbf{x}}$ are bounded for each compact subset of ${\mathbb{R}}^{D}\times\Sigma$ by our assumption A1, function $G({\mathbf{z}},i)$ is also bounded within the compact subset of its domain. Thus, function $m_{{\mathbf{z}}}(i)$ is bounded, and (97) is verified. Moreover, for a fixed $i\in{\mathcal{N}}$ ,

\sum_{j\in{\mathcal{N}}}({\mathbf{I}}-{\mathbf{K}}^{\prime}[{\mathbf{z}}]+{\bm{1}}{\bm{\pi}}^{\prime}[{\mathbf{z}}]^{T})^{-1}(i,j){\bm{\delta}}_{j}=({\mathbf{I}}-{\mathbf{K}}^{\prime}[{\mathbf{z}}]+{\bm{1}}{\bm{\pi}}^{\prime}[{\mathbf{z}}]^{T})^{-1}_{[:,i]}=({\mathbf{I}}-{\mathbf{K}}[{\mathbf{x}}]+{\bm{1}}{\bm{\pi}}[{\mathbf{x}}]^{T})^{-1}_{[:,i]}

and this vector-valued function is continuous in ${\mathbf{x}}$ because ${\mathbf{K}}[{\mathbf{x}}],{\bm{\pi}}[{\mathbf{x}}]$ are continuous. We then rewrite (24) as

m_{{\mathbf{z}}}(i)=\begin{bmatrix}\sum_{j\in{\mathcal{N}}}({\mathbf{I}}-{\mathbf{K}}[{\mathbf{x}}]+{\bm{1}}{\bm{\pi}}[{\mathbf{x}}]^{T})^{-1}(i,j)H({\mathbf{x}},j)\\ ({\mathbf{I}}-{\mathbf{K}}[{\mathbf{x}}]^{T}+{\bm{\pi}}[{\mathbf{x}}]{\bm{1}}^{T})^{-1}_{[:,i]}\end{bmatrix}-\begin{bmatrix}\sum_{i\in{\mathcal{N}}}\pi_{i}[{\mathbf{x}}]H({\bm{\theta}},i)\\ {\bm{\pi}}[{\mathbf{x}}]-{\mathbf{x}}\end{bmatrix}.

With continuous functions $H({\bm{\theta}},i),{\mathbf{K}}[{\mathbf{x}}],{\bm{\pi}}[{\mathbf{x}}]$ , we have $m_{{\mathbf{z}}}(i)$ continuous with respect to ${\mathbf{z}}$ , so does $({\mathbf{K}}^{\prime}_{{\mathbf{z}}}m_{{\mathbf{z}}})(i)$ . This implies that functions $m_{{\mathbf{z}}}(i)$ and $({\mathbf{K}}^{\prime}_{{\mathbf{z}}}m_{{\mathbf{z}}})(i)$ are locally Lipschitz, which satisfies (98) with $\phi_{{\mathcal{C}}}(x)=C_{{\mathcal{C}}}x$ for some constant $C_{{\mathcal{C}}}$ that depends on the compact set ${\mathcal{C}}$ . Therefore, condition C4 is checked, and we can apply Theorem G.8 to show the almost convergence result of (19), i.e., almost surely,

\lim_{n\to\infty}{\mathbf{x}}_{n}={\bm{\mu}},\quad\text{and}~{}~{}~{}~{}\limsup_{n\to\infty}\inf_{{\bm{\theta}}^{*}\in\Theta}\|{\bm{\theta}}_{n}-{\bm{\theta}}^{*}\|=0.

Therefore, the almost sure convergence of ${\mathbf{x}}_{n}$ in Lemma 3.1 is also proved. This finishes the proof in Scenario 1.

D.2 Scenario 2

Now in this subsection, we consider the steps sizes $\gamma_{n},\beta_{n}$ with $1/2<a<b\leq 1$ and $1/2<b<a\leq 1$ . We will frequently use assumptions (B1) - (B5) in Section G.3 and Theorem G.10 to prove the almost sure convergence.

D.2.1 Case (i): $1/2<a<b\leq 1$

In case (i), ${\bm{\theta}}_{n}$ is on the slow timescale and ${\mathbf{x}}_{n}$ is on the fast timescale because iteration ${\bm{\theta}}_{n}$ has smaller step size than ${\mathbf{x}}_{n}$ , making ${\bm{\theta}}_{n}$ converge slower than ${\mathbf{x}}_{n}$ . Here, we consider the two-timescale SA of the form:

	${\bm{\theta}}_{n+1}={\bm{\theta}}_{n}+\beta_{n+1}H({\bm{\theta}},X_{n+1}),$		(25)
	${\mathbf{x}}_{n+1}={\mathbf{x}}_{n}+\gamma_{n+1}({\bm{\delta}}_{X_{n+1}}-{\mathbf{x}}).$		(25)

Now, we verify assumptions (B1) - (B5) listed in Section G.3.

•

Assumptions (B1) and (B5) are satisfied by our assumptions A2 and A4.
•

Our assumption A3 shows that the function $H({\bm{\theta}},X)$ is continuous and differentiable w.r.t ${\bm{\theta}}$ and grows linearly with $\|{\bm{\theta}}\|$ . In addition, ${\bm{\delta}}_{X}-{\mathbf{x}}$ also satisfies this property. Therefore, (B2) is satisfied.

•

Now that the function ${\bm{\pi}}[{\mathbf{x}}]-{\mathbf{x}}$ is independent of ${\bm{\theta}}$ , we can set $\rho({\bm{\theta}})={\bm{\mu}}$ for any ${\bm{\theta}}\in{\mathbb{R}}^{D}$ such that ${\bm{\pi}}[{\bm{\mu}}]-{\bm{\mu}}=0$ from Doshi et al. (2023, Proposition 3.1), and

\nabla_{{\mathbf{x}}}({\bm{\pi}}({\mathbf{x}})-{\mathbf{x}})|_{{\mathbf{x}}={\bm{\mu}}}=2\alpha{\mathbf{u}}{\bm{1}}^{T}-\alpha{\mathbf{P}}^{T}-(\alpha+1){\mathbf{I}}

from Doshi et al. (2023, Lemma 3.4), which is Hurwitz. Furthermore, $\rho({\bm{\theta}})={\bm{\mu}}$ inherently satisfies the condition $\|\rho({\bm{\theta}})\|\leq L_{2}(1+\|{\bm{\theta}}\|)$ for any $L_{2}\geq\|{\bm{\mu}}\|$ . Thus, conditions (i) - (iii) in (B3) are satisfied. Additionally, $\sum_{i\in{\mathcal{N}}}{\bm{\pi}}_{i}[\rho({\bm{\theta}})]H({\bm{\theta}},i)=\sum_{i\in{\mathcal{N}}}{\bm{\pi}}_{i}[{\mathbf{x}}]={\mathbf{h}}({\bm{\theta}})$ such that for ${\bm{\theta}}^{*}\in\Theta$ defined in assumption A3, $\sum_{i\in{\mathcal{N}}}{\bm{\pi}}_{i}[\rho({\bm{\theta}}^{*})]H({\bm{\theta}}^{*},i)={\mathbf{h}}({\bm{\theta}}^{*})=0$ , and $\nabla_{{\bm{\theta}}}{\mathbf{h}}({\bm{\theta}}^{*})$ is Hurwitz. Therefore, (B3) is checked.

•

Assumption (B4) is verified by the nature of SRRW, i.e., its transition kernel ${\mathbf{K}}[{\mathbf{x}}]$ and the corresponding stationary distribution ${\bm{\pi}}[{\mathbf{x}}]$ with ${\bm{\pi}}[{\bm{\mu}}]={\bm{\mu}}$ .

Consequently, assumptions (B1) - (B5) are satisfied by our assumptoins A1 - A4 and by Theorem G.10, we have $\lim_{n\to\infty}{\mathbf{x}}_{n}={\bm{\mu}}$ and ${\bm{\theta}}_{n}\to\Theta$ almost surely.

Next, we consider $1/2<b<a\leq 1$ . As discussed before, (B1), (B2), (B4) and (B5) are satisfied by our assumptions A1 - A4 and the properties of SRRW. The only difference for this step size setting, compared to the previous one $1/2<a<b\leq 1$ , is that the roles of ${\bm{\theta}}_{n},{\mathbf{x}}_{n}$ are now flipped, that is, ${\bm{\theta}}_{n}$ is now on the fast timescale while ${\mathbf{x}}_{n}$ is on the slow timescale. By a much stronger Assumption A3^′, for any ${\mathbf{x}}\in\text{Int}(\Sigma)$ , (i) $\mathbb{E}_{X\sim{\bm{\pi}}[{\mathbf{x}}]}[H(\rho({\mathbf{x}}),X)]=0$ ; (ii) $\mathbb{E}_{X\sim{\bm{\pi}}[{\mathbf{x}}]}[\nabla H(\rho({\mathbf{x}}),X)]$ is Hurwitz; (iii) $\|\rho({\mathbf{x}})\|\leq L_{2}(1+\|{\mathbf{x}}\|)$ . Hence, conditions (i) - (iii) in (B3) are satisfied. Moreover, we have ${\bm{\pi}}[{\bm{\mu}}]-{\bm{\mu}}=0$ , $\nabla({\bm{\pi}}[{\mathbf{x}}]-{\mathbf{x}})|_{{\mathbf{x}}={\bm{\mu}}}$ being Hurwitz, as mentioned in the previous part. Therefore, (B3) is verified. Accordingly, (B1) - (B5) are checked by our assumptions A1, A2, A3^′, A4. By Theorem G.10, we have $\lim_{n\to\infty}{\mathbf{x}}_{n}={\bm{\mu}}$ and ${\bm{\theta}}_{n}\to\Theta$ almost surely.

Appendix E Proof of Theorem 3.3

This section is devoted to the proof of Theorem 3.3, which also includes the proof of the CLT results for the SRRW iteration ${\mathbf{x}}_{n}$ in Lemma 3.1. We will use different techniques depending on the step sizes in Assumption A2. Specifically, for step sizes $\gamma_{n}=(n+1)^{-a},\beta_{n}=(n+1)^{-b}$ , we will consider three cases: case (i): $\beta_{n}=o(\gamma_{n})$ ; case (ii): $\beta_{n}=\gamma_{n}$ ; and case (iii): $\gamma_{n}=o(\beta_{n})$ . For case (ii), we will use the existing CLT result for single-timescale SA in Theorem G.9. For cases (i) and (iii), we will construct our own CLT analysis for the two-timescale structure. We start with case (ii).

E.1 Case (ii): $\beta_{n}=\gamma_{n}$

In this part, we stick to the notations for single-timescale SA studied in Section D.1. To utilize Theorem G.9, apart from Conditions C1 - C4 that have been checked in Section D.1, we still need to check conditions C5 and C6 listed in Section G.2.

Assumption A3 corresponds to condition C5. For condition C6, we need to obtain the explicit form of function $Q_{{\mathbf{z}}}$ to the Poisson equation defined in (96), that is,

Q_{{\mathbf{z}}}(i)-({\mathbf{K}}^{\prime}_{{\mathbf{z}}}Q_{{\mathbf{z}}})(i)=\psi({\mathbf{z}},i)-\mathbb{E}_{j\sim{\bm{\pi}}[{\mathbf{z}}]}[\psi({\mathbf{z}},j)]

where

\psi({\mathbf{z}},i)\triangleq\sum_{j\in{\mathcal{N}}}{\mathbf{K}}^{\prime}_{{\mathbf{z}}}(i,j)m_{{\mathbf{z}}}(j)m_{{\mathbf{z}}}(j)^{T}-({\mathbf{K}}^{\prime}_{{\mathbf{z}}}m_{{\mathbf{z}}})(i)({\mathbf{K}}^{\prime}_{{\mathbf{z}}}m_{{\mathbf{z}}})(i)^{T}.

Following the similar steps in the derivation of $m_{{\mathbf{z}}}(i)$ from (22) to (24), we have

Q_{{\mathbf{z}}}(i)=\sum_{j\in{\mathcal{N}}}({\mathbf{I}}-{\mathbf{K}}^{\prime}[{\mathbf{z}}]+{\bm{1}}{\bm{\pi}}^{\prime}[{\mathbf{z}}]^{T})^{-1}(i,j)m_{{\mathbf{z}}}(j)-\pi^{\prime}_{j}[{\mathbf{z}}]m_{{\mathbf{z}}}(j).

We also know that $Q_{{\mathbf{z}}}(i)$ and $({\mathbf{K}}^{\prime}_{{\mathbf{z}}}Q_{{\mathbf{z}}})(i)$ are continuous in ${\mathbf{z}}$ for any $i\in{\mathcal{N}}$ . For any ${\mathbf{z}}$ in a compact set $\Omega$ , $Q_{{\mathbf{z}}}(i)$ and $({\mathbf{K}}^{\prime}_{{\mathbf{z}}}Q_{{\mathbf{z}}})(i)$ are bounded because function $m_{{\mathbf{z}}}(i)$ is bounded. Therefore, C6 is checked. By Theorem G.9, assume ${\mathbf{z}}_{n}=\begin{bmatrix}{\bm{\theta}}_{n}\\ {\mathbf{x}}_{n}\end{bmatrix}$ converges to a point ${\mathbf{z}}^{*}=\begin{bmatrix}{\bm{\theta}}^{*}\\ {\bm{\mu}}\end{bmatrix}$ for ${\bm{\theta}}^{*}\in\Theta$ , we have

\gamma_{n}^{-1/2}({\mathbf{z}}_{n}-{\mathbf{z}}^{*})\xrightarrow[n\to\infty]{dist.}N(0,{\mathbf{V}}),

(26)

where ${\mathbf{V}}$ is the solution of the following Lyapunov equation

{\mathbf{V}}\left(\frac{\mathds{1}_{\{b=1\}}}{2}{\mathbf{I}}+\nabla g({\bm{z}}^{*})^{T}\right)+\left(\frac{\mathds{1}_{\{b=1\}}}{2}{\mathbf{I}}+\nabla g({\bm{z}}^{*})\right){\mathbf{V}}+{\mathbf{U}}=0,

(27)

and ${\mathbf{U}}=\sum_{i\in{\mathcal{N}}}\mu_{i}\left(m_{{\mathbf{z}}^{*}}(i)m_{{\mathbf{z}}^{*}}(i)^{T}-({\mathbf{K}}_{{\mathbf{z}}^{*}}m_{{\mathbf{z}}^{*}})(i)({\mathbf{K}}_{{\mathbf{z}}^{*}}m_{{\mathbf{z}}^{*}})(i)^{T}\right)$ .

By algebraic calculations of derivative of ${\bm{\pi}}[{\mathbf{x}}]$ with respect to ${\mathbf{x}}$ in (20),¹²¹²12One may refer to Doshi et al. (2023, Appendix B, Proof of Lemma 3.4) for the computation of $\frac{\partial{\bm{\pi}}[{\mathbf{x}}]-{\mathbf{x}}}{\partial{\mathbf{x}}}$ . we can rewrite $\nabla g({\mathbf{z}}^{*})$ in terms of ${\mathbf{x}},{\bm{\theta}}$ , i.e.,

\begin{split}{\mathbf{J}}(\alpha)\triangleq\nabla g({\mathbf{z}}^{*})&=\begin{bmatrix}\frac{\partial\sum_{i\in{\mathcal{N}}}\pi_{i}[{\mathbf{x}}]H({\bm{\theta}},i)}{\partial{\bm{\theta}}}&\frac{\partial\sum_{i\in{\mathcal{N}}}\pi_{i}[{\mathbf{x}}]H({\bm{\theta}},i)}{\partial{\mathbf{x}}}\\ \frac{\partial({\bm{\pi}}[{\mathbf{x}}]-{\mathbf{x}})}{\partial{\bm{\theta}}}&\frac{\partial{\bm{\pi}}[{\mathbf{x}}]-{\mathbf{x}}}{\partial{\mathbf{x}}}\end{bmatrix}_{{\mathbf{z}}={\mathbf{z}}^{*}}\\ &=\begin{bmatrix}\nabla{\mathbf{h}}({\bm{\theta}}^{*})&-\alpha{\mathbf{H}}^{T}({\mathbf{P}}^{T}+{\mathbf{I}})\\ {\bm{0}}&2\alpha\bm{\mu}{\bm{1}}^{T}-\alpha{\mathbf{P}}^{T}-(\alpha+1){\mathbf{I}}\end{bmatrix}\triangleq\begin{bmatrix}{\mathbf{J}}_{11}&{\mathbf{J}}_{12}(\alpha)\\ {\mathbf{J}}_{21}&{\mathbf{J}}_{22}(\alpha)\end{bmatrix},\end{split}

where matrix ${\mathbf{H}}=[H({\bm{\theta}}^{*},1),\cdots,H({\bm{\theta}},N)]^{T}$ . Then, we further clarify the matrix ${\mathbf{U}}$ . Note that

m_{{\mathbf{z}}^{*}}(i)=\sum_{k=0}^{\infty}[{\mathbf{G}}({\mathbf{z}}^{*})(({\mathbf{P}}^{k})^{T}-{\bm{\mu}}{\bm{1}}^{T})]_{[:,i]}=\sum_{k=0}^{\infty}[{\mathbf{G}}({\mathbf{z}}^{*})({\mathbf{P}}^{k})^{T}]_{[:,i]}=\mathbb{E}\left[\left.\sum_{k=0}^{\infty}[G({\mathbf{z}}^{*},X_{k})]\right|X_{0}=i\right]\!\!,

(28)

where the first equality holds because ${\mathbf{K}}^{\prime}[{\bm{\mu}}]={\mathbf{P}}$ from the definition of SRRW kernel (3), the second equality stems from ${\mathbf{G}}({\mathbf{z}}^{*}){\bm{\mu}}=g({\mathbf{z}}^{*})=0$ , and the last term is a conditional expectation over the base Markov chain $\{X_{k}\}_{k\geq 0}$ (with transition kernel ${\mathbf{P}}$ ) conditioned on $X_{0}=i$ . Similarly, with $({\mathbf{K}}^{\prime}_{{\mathbf{z}}}m_{{\mathbf{z}}})(i)$ in the form of (23), we have

({\mathbf{K}}^{\prime}_{{\mathbf{z}}}m_{{\mathbf{z}}})(i)=\mathbb{E}\left[\left.\sum_{k=1}^{\infty}[G({\mathbf{z}}^{*},X_{k})]\right|X_{0}=i\right].

From the form ‘ $\sum_{i\in{\mathcal{N}}}\mu_{i}$ ’ inside the matrix ${\mathbf{U}}$ , the Markov chain $\{X_{k}\}_{k\geq 0}$ is in its stationary regime from the beginning, i.e., $X_{k}\sim{\bm{\mu}}$ for any $k\geq 0$ . Hence,

\begin{split}{\mathbf{U}}=&~{}\mathbb{E}\left[\left(\sum_{k=0}^{\infty}[G({\mathbf{z}}^{*},X_{k})]\right)\left(\sum_{k=0}^{\infty}[G({\mathbf{z}}^{*},X_{k})]\right)^{T}\right]\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-\mathbb{E}\left[\left(\sum_{k=1}^{\infty}[G({\mathbf{z}}^{*},X_{k})]\right)\left(\sum_{k=1}^{\infty}[G({\mathbf{z}}^{*},X_{k})]\right)^{T}\right]\\ =&~{}\mathbb{E}\left[G({\mathbf{z}}^{*},X_{0})G({\mathbf{z}}^{*},X_{0})^{T}\right]\!+\!\mathbb{E}\left[G({\mathbf{z}}^{*},X_{0})\left(\sum_{k=1}^{\infty}G({\mathbf{z}}^{*},X_{k})\right)^{T}\right]\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\mathbb{E}\left[\left(\sum_{k=1}^{\infty}G({\mathbf{z}}^{*},X_{k})\right)G({\mathbf{z}}^{*},X_{0})^{T}\right]\\ =&~{}\text{Cov}(G({\mathbf{z}}^{*},X_{0}),G({\mathbf{z}}^{*},X_{0}))\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\sum_{k=1}^{\infty}\left[\text{Cov}(G({\mathbf{z}}^{*},X_{0}),G({\mathbf{z}}^{*},X_{k}))+\text{Cov}(G({\mathbf{z}}^{*},X_{k}),G({\mathbf{z}}^{*},X_{0}))\right],\\ \end{split}

(29)

where the covariance between $G({\mathbf{z}}^{*},X_{0})$ and $G({\mathbf{z}}^{*},X_{k})$ for the Markov chain $\{X_{n}\}$ in the stationary regime is $\text{Cov}(G({\mathbf{z}}^{*},X_{0}),G({\mathbf{z}}^{*},X_{k}))$ . By Brémaud (2013, Theorem 6.3.7), it is demonstrated that ${\mathbf{U}}$ is the sampling covariance of the base Markov chain ${\mathbf{P}}$ for the test function $G({\mathbf{z}}^{*},\cdot)$ . Moreover, Brémaud (2013, equation (6.34)) states that this sampling covariance ${\mathbf{U}}$ can be rewritten in the following form:

{\mathbf{U}}=\sum_{i=1}^{N-1}{\mathbf{G}}({\mathbf{z}}^{*})^{T}{\mathbf{u}}_{i}{\mathbf{u}}_{i}{\mathbf{G}}({\mathbf{z}}^{*})=\sum_{i=1}^{N-1}\frac{1+\lambda_{i}}{1-\lambda_{i}}\begin{bmatrix}{\mathbf{H}}^{T}{\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T}{\mathbf{H}}&{\mathbf{H}}^{T}{\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T}\\ {\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T}{\mathbf{H}}&{\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T}\end{bmatrix}\triangleq\begin{bmatrix}{\mathbf{U}}_{11}&{\mathbf{U}}_{12}\\ {\mathbf{U}}_{21}&{\mathbf{U}}_{22}\end{bmatrix},

(30)

where $\{(\lambda_{i},{\mathbf{u}}_{i})\}_{i\in{\mathcal{N}}}$ is the eigenpair of the transition kernel ${\mathbf{P}}$ of the ergodic and time-reversible base Markov chain. This completes the proof of case 1.

Remark E.1.

For the CLT result (26), we can look further into the asymptotic covariance matrix ${\mathbf{V}}$ as in (27). For convenience, we denote ${\mathbf{V}}=\begin{bmatrix}{\mathbf{V}}_{11}&{\mathbf{V}}_{12}\\ {\mathbf{V}}_{21}&{\mathbf{V}}_{22}\end{bmatrix}$ and ${\mathbf{U}}$ in the form of (30) such that

\begin{bmatrix}{\mathbf{V}}_{11}&{\mathbf{V}}_{12}\\ {\mathbf{V}}_{21}&{\mathbf{V}}_{22}\end{bmatrix}\left(\frac{\mathds{1}_{\{b=1\}}}{2}{\mathbf{I}}+{\mathbf{J}}(\alpha)^{T}\right)+\left(\frac{\mathds{1}_{\{b=1\}}}{2}{\mathbf{I}}+{\mathbf{J}}(\alpha)\right)\begin{bmatrix}{\mathbf{V}}_{11}&{\mathbf{V}}_{12}\\ {\mathbf{V}}_{21}&{\mathbf{V}}_{22}\end{bmatrix}+{\mathbf{U}}=0.

(31)

For the SRRW iteration ${\mathbf{x}}_{n}$ , from (26) we know that $\gamma_{n}^{-1/2}({\mathbf{x}}_{n}-{\bm{\mu}})\xrightarrow[n\to\infty]{dist.}N({\bm{0}},{\mathbf{V}}_{4})$ . Thus, in this remark, we want to obtain the closed form of ${\mathbf{V}}_{22}$ . By algebraic computations of the bottom-right sub-block matrix, we have

	$\displaystyle\left(2\alpha\bm{\mu}{\bm{1}}^{T}\!-\!\alpha{\mathbf{P}}^{T}\!-\!\left(\alpha+1\!-\!\frac{\mathds{1}_{\{a=1\}}}{2}\right){\mathbf{I}}\right){\mathbf{V}}_{22}$
	$\displaystyle+{\mathbf{V}}_{22}\left(2\alpha\bm{\mu}{\bm{1}}^{T}\!-\!\alpha{\mathbf{P}}^{T}\!-\!\left(\alpha+1\!-\!\frac{\mathds{1}_{\{a=1\}}}{2}\right){\mathbf{I}}\right)^{T}$
	$\displaystyle+{\mathbf{U}}_{22}=0.$

By using result of the closed form solution to the Lyapunov equation (e.g., Lemma G.1) and the eigendecomposition of ${\mathbf{P}}$ , we have

{\mathbf{V}}_{22}=\sum_{i=1}^{N-1}\frac{1}{2\alpha(1+\lambda_{i})+2-\mathds{1}_{\{a=1\}}}\cdot\frac{1+\lambda_{i}}{1-\lambda_{i}}{\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T}.

(32)

E.2 Case (i): $\beta_{n}=o(\gamma_{n})$

In this part, we mainly focus on the CLT of the SA iteration ${\bm{\theta}}_{n}$ because the SRRW iteration ${\mathbf{x}}_{n}$ is independent of ${\bm{\theta}}_{n}$ and its CLT result has been shown in Remark E.1.

E.2.1 Decomposition of SA-SRRW iteration (4)

We slightly abuse the math notation and define the function

{\mathbf{h}}({\bm{\theta}},{\mathbf{x}})\triangleq\mathbb{E}_{i\sim{\bm{\pi}}[{\mathbf{x}}]}H({\bm{\theta}},i)=\sum_{i\in{\mathcal{N}}}\pi_{i}[{\mathbf{x}}]H({\bm{\theta}},i)

such that ${\mathbf{h}}({\bm{\theta}},{\bm{\mu}})\equiv{\mathbf{h}}({\bm{\theta}})$ . Then, we reformulate (25) as

	${\bm{\theta}}_{n+1}={\bm{\theta}}_{n}+\beta_{n+1}{\mathbf{h}}({\bm{\theta}}_{n},{\mathbf{x}}_{n})+\beta_{n+1}(H({\bm{\theta}}_{n},X_{n+1})-{\mathbf{h}}({\bm{\theta}}_{n},{\mathbf{x}}_{n})).$		(33a)
	${\mathbf{x}}_{n+1}={\mathbf{x}}_{n}+\gamma_{n+1}({\bm{\pi}}[{\mathbf{x}}_{n}]-{\mathbf{x}}_{n})+\gamma_{n+1}({\bm{\delta}}_{X_{n+1}})-{\bm{\pi}}[{\mathbf{x}}_{n}]).$		(33b)

There exist functions $q_{{\mathbf{x}}}:{\mathcal{N}}\to\mathbb{R}^{N},\tilde{H}_{{\bm{\theta}},{\mathbf{x}}}:{\mathcal{N}}\to\mathbb{R}^{D}$ satisfying the following Poisson equations

	${\bm{\delta}}_{i}-{\bm{\pi}}({\mathbf{x}})=q_{{\mathbf{x}}}(i)-({\mathbf{K}}_{{\mathbf{x}}}q_{{\mathbf{x}}})(i)$		(34a)
	$H({\bm{\theta}},i)-{\mathbf{h}}({\bm{\theta}},{\mathbf{x}})=\tilde{H}_{{\bm{\theta}},{\mathbf{x}}}(i)-({\mathbf{K}}_{{\mathbf{x}}}\tilde{H}_{{\bm{\theta}},{\mathbf{x}}})(i),$		(34b)

for any ${\bm{\theta}}\in{\mathbb{R}}^{D},{\mathbf{x}}\in\text{Int}(\Sigma)$ and $i\in{\mathcal{N}}$ , where $({\mathbf{K}}_{{\mathbf{x}}}q_{{\mathbf{x}}})(i)\triangleq\sum_{j\in{\mathcal{N}}}K_{ij}[{\mathbf{x}}]q_{{\mathbf{x}}}(j)$ , $({\mathbf{K}}_{{\mathbf{x}}}\tilde{H}_{{\bm{\theta}},{\mathbf{x}}})(j)\triangleq\sum_{j\in{\mathcal{N}}}K_{ij}[{\mathbf{x}}]\tilde{H}_{{\bm{\theta}},{\mathbf{x}}}(j)$ . The existence and explicit form of the solutions $q_{{\mathbf{x}}},\tilde{H}_{{\bm{\theta}},{\mathbf{x}}}$ , which are continuous w.r.t ${\mathbf{x}},{\bm{\theta}}$ , follow the similar steps that can be found in Section D.1 from (22) to (24). Thus, we can further decompose (33) into

	$\begin{split}{\bm{\theta}}_{n+1}=&{\bm{\theta}}_{n}+\beta_{n+1}{\mathbf{h}}({\bm{\theta}}_{n},{\mathbf{x}}_{n})+\beta_{n+1}\underbrace{(\tilde{H}_{{\bm{\theta}}_{n},{\mathbf{x}}_{n}}(X_{n+1})-({\mathbf{K}}_{{\mathbf{x}}_{n}}\tilde{H}_{{\bm{\theta}}_{n},{\mathbf{x}}_{n}})(X_{n}))}_{M_{n+1}^{({\bm{\theta}})}}\\ &+\beta_{n+1}\underbrace{(({\mathbf{K}}_{{\mathbf{x}}_{n+1}}\tilde{H}_{{\bm{\theta}}_{n+1},{\mathbf{x}}_{n+1}})(X_{n+1})-({\mathbf{K}}_{{\mathbf{x}}_{n}}\tilde{H}_{{\bm{\theta}}_{n},{\mathbf{x}}_{n}})(X_{n+1}))}_{r^{({\bm{\theta}},1)}_{n}}\\ &+\beta_{n+1}\underbrace{(({\mathbf{K}}_{{\mathbf{x}}_{n}}\tilde{H}_{{\bm{\theta}}_{n},{\mathbf{x}}_{n}})(X_{n})-({\mathbf{K}}_{{\mathbf{x}}_{n+1}}\tilde{H}_{{\bm{\theta}}_{n+1},{\mathbf{x}}_{n+1}})(X_{n+1}))}_{r^{({\bm{\theta}},2)}_{n}},\end{split}$		(35a)
	$\begin{split}{\mathbf{x}}_{n+1}=&{\mathbf{x}}_{n}+\gamma_{n+1}({\bm{\pi}}({\mathbf{x}}_{n})-{\mathbf{x}}_{n})+\gamma_{n+1}\underbrace{(q_{{\mathbf{x}}_{n}}(X_{n+1})-({\mathbf{K}}_{{\mathbf{x}}_{n}}q_{{\mathbf{x}}_{n}})(X_{n}))}_{M_{n+1}^{({\mathbf{x}})}}\\ &+\gamma_{n+1}\underbrace{([{\mathbf{K}}_{{\mathbf{x}}_{n}}q_{{\mathbf{x}}_{n}}](X_{n+1})-[{\mathbf{K}}_{{\mathbf{x}}_{n}}q_{{\mathbf{x}}_{n+1}}](X_{n+1}))}_{r^{({\mathbf{x}},1)}_{n}}\\ &+\gamma_{n+1}\underbrace{(({\mathbf{K}}_{{\mathbf{x}}_{n}}q_{{\mathbf{x}}_{n}})(X_{n})-({\mathbf{K}}_{{\mathbf{x}}_{n+1}}q_{{\mathbf{x}}_{n+1}})(X_{n+1}))}_{r^{({\mathbf{x}},2)}_{n}}.\end{split}$		(35b)

such that

	${\bm{\theta}}_{n+1}={\bm{\theta}}_{n}+\beta_{n+1}{\mathbf{h}}({\bm{\theta}}_{n},{\mathbf{x}}_{n})+\beta_{n+1}M_{n+1}^{({\bm{\theta}})}+\beta_{n+1}r^{({\bm{\theta}},1)}_{n}+\beta_{n+1}r^{({\bm{\theta}},2)}_{n},$		(36a)
	${\mathbf{x}}_{n+1}={\mathbf{x}}_{n}+\gamma_{n+1}({\bm{\pi}}({\mathbf{x}}_{n})-{\mathbf{x}}_{n})+\gamma_{n+1}M_{n+1}^{({\mathbf{x}})}+\gamma_{n+1}r^{({\mathbf{x}},1)}_{n}+\gamma_{n+1}r^{({\mathbf{x}},2)}_{n}.$		(36b)

We can observe that (36) differs from the expression in Konda & Tsitsiklis (2004); Mokkadem & Pelletier (2006), which studied the two-timescale SA with Martingale difference noise. Here, due to the presence of the iterate-dependent Markovian noise and the application of the Poisson equation technique, we have additional non-vanishing terms $r^{({\bm{\theta}},2)}_{n},r^{({\mathbf{x}},2)}_{n}$ , which will be further examined in Lemma E.2. Additionally, when we apply the Poisson equation to the Martingale difference terms $M_{n+1}^{({\bm{\theta}})}$ , $M_{n+1}^{({\mathbf{x}})}$ , we find that there are some covariances that are also non-vanishing as in Lemma E.1. We will mention this again when we obtain those covariances. These extra non-zero noise terms make our analysis distinct from the previous ones since the key assumption (A4) in Mokkadem & Pelletier (2006) is not satisfied. We demonstrate that the long-term average performance of these terms can be managed so that they do not affect the final CLT result.

Analysis of Terms $M_{n+1}^{({\bm{\theta}})},M_{n+1}^{({\mathbf{x}})}$

Consider the filtration ${\mathcal{F}}_{n}\triangleq\sigma({\bm{\theta}}_{0},{\mathbf{x}}_{0},X_{0},\cdots,{\bm{\theta}}_{n},{\mathbf{x}}_{n},X_{n})$ , it is evident that $M_{n+1}^{({\bm{\theta}})},M_{n+1}^{({\mathbf{x}})}$ are Martingale difference sequences adapted to ${\mathcal{F}}_{n}$ . Then, we have

\begin{split}&\mathbb{E}\left[\left.M^{({\mathbf{x}})}_{n+1}(M^{({\mathbf{x}})}_{n+1})^{T}\right|{\mathcal{F}}_{n}\right]\\ =&~{}\mathbb{E}[q_{{\mathbf{x}}_{n}}(X_{n+1})q_{{\mathbf{x}}_{n}}(X_{n+1})^{T}|{\mathcal{F}}_{n}]+({\mathbf{K}}_{{\mathbf{x}}_{n}}q_{{\mathbf{x}}_{n}})(X_{n})\left(({\mathbf{K}}_{{\mathbf{x}}_{n}}q_{{\mathbf{x}}_{n}})(X_{n})\right)^{T}\\ &-\mathbb{E}[q_{{\mathbf{x}}_{n}}(X_{n+1})|{\mathcal{F}}_{n}]\left({\mathbf{K}}_{{\mathbf{x}}_{n}}q_{{\mathbf{x}}_{n}})(X_{n})\right)^{T}-({\mathbf{K}}_{{\mathbf{x}}_{n}}q_{{\mathbf{x}}_{n}})(X_{n})\mathbb{E}[q_{{\mathbf{x}}_{n}}(X_{n+1})^{T}|{\mathcal{F}}_{n}]\\ =&~{}\mathbb{E}[q_{{\mathbf{x}}_{n}}(X_{n+1})q_{{\mathbf{x}}_{n}}(X_{n+1})^{T}|{\mathcal{F}}_{n}]-({\mathbf{K}}_{{\mathbf{x}}_{n}}q_{{\mathbf{x}}_{n}})(X_{n})\left(({\mathbf{K}}_{{\mathbf{x}}_{n}}q_{{\mathbf{x}}_{n}})(X_{n})\right)^{T}.\end{split}

(37)

Similarly, we have

\begin{split}&\mathbb{E}\left[\left.M^{({\bm{\theta}})}_{n+1}(M^{({\bm{\theta}})}_{n+1})^{T}\right|{\mathcal{F}}_{n}\right]\\ =&~{}\mathbb{E}[\tilde{H}_{{\bm{\theta}}_{n},{\mathbf{x}}_{n}}(X_{n+1})\tilde{H}_{{\bm{\theta}}_{n},{\mathbf{x}}_{n}}(X_{n+1})^{T}|{\mathcal{F}}_{n}]-({\mathbf{K}}_{{\mathbf{x}}_{n}}\tilde{H}_{{\bm{\theta}}_{n},{\mathbf{x}}_{n}})(X_{n})\left(({\mathbf{K}}_{{\mathbf{x}}_{n}}\tilde{H}_{{\bm{\theta}}_{n},{\mathbf{x}}_{n}})(X_{n})\right)^{T},\end{split}

(38)

and

\begin{split}&\mathbb{E}\left[\left.M^{({\mathbf{x}})}_{n+1}(M^{({\bm{\theta}})}_{n+1})^{T}\right|{\mathcal{F}}_{n}\right]\\ =&~{}\mathbb{E}[q_{{\mathbf{x}}_{n}}(X_{n+1})\tilde{H}_{{\bm{\theta}}_{n},{\mathbf{x}}_{n}}(X_{n+1})^{T}|{\mathcal{F}}_{n}]-({\mathbf{K}}_{{\mathbf{x}}_{n}}q_{{\mathbf{x}}_{n}})(X_{n})\left(({\mathbf{K}}_{{\mathbf{x}}_{n}}\tilde{H}_{{\bm{\theta}}_{n},{\mathbf{x}}_{n}})(X_{n})\right)^{T}.\end{split}

We now focus on $\mathbb{E}\left[\left.M^{({\mathbf{x}})}_{n+1}(M^{({\mathbf{x}})}_{n+1})^{T}\right|{\mathcal{F}}_{n}\right]$ . Denote by

V_{1}({\mathbf{x}},i)\triangleq\sum_{j\in{\mathcal{N}}}{\mathbf{K}}_{i,j}[{\mathbf{x}}]q_{{\mathbf{x}}}(j)q_{{\mathbf{x}}}(j)^{T}-({\mathbf{K}}_{{\mathbf{x}}}q_{{\mathbf{x}}})(i)\left(({\mathbf{K}}_{{\mathbf{x}}}q_{{\mathbf{x}}})(i)\right)^{T},

(39)

and let its expectation w.r.t the stationary distribution ${\bm{\pi}}({\mathbf{x}})$ be $v_{1}({\mathbf{x}})\triangleq\mathbb{E}_{i\sim{\bm{\pi}}({\mathbf{x}})}[V_{1}({\mathbf{x}},i)]$ , we can construct another Poisson equation, i.e.,

\begin{split}&\mathbb{E}\left[\left.M^{({\mathbf{x}})}_{n+1}(M^{({\mathbf{x}})}_{n+1})^{T}\right|{\mathcal{F}}_{n}\right]-\sum_{X_{n}\in{\mathcal{N}}}\pi_{X_{n}}({\mathbf{x}}_{n})\mathbb{E}\left[\left.M^{({\mathbf{x}})}_{n+1}(M^{({\mathbf{x}})}_{n+1})^{T}\right|{\mathcal{F}}_{n}\right]\\ =&~{}V_{1}({\mathbf{x}}_{n},X_{n+1})-v_{1}({\mathbf{x}}_{n})\\ =&~{}\varphi^{(1)}_{{\mathbf{x}}}(X_{n+1})-({\mathbf{K}}_{{\mathbf{x}}_{n}}\varphi^{(1)}_{{\mathbf{x}}_{n}})(X_{n+1}),\end{split}

for some matrix-valued function $\varphi^{(1)}_{{\mathbf{x}}}:{\mathcal{N}}\to\mathbb{R}^{N\times N}$ . Since $q_{{\mathbf{x}}}$ and ${\mathbf{K}}[{\mathbf{x}}]$ are continuous in ${\mathbf{x}}$ , functions $V_{1},v_{1}$ are also continuous in ${\mathbf{x}}$ . Then, we can decompose (39) into

\begin{split}V_{1}({\mathbf{x}}_{n},X_{n+1})=&\underbrace{v_{1}({\bm{\mu}})}_{{\mathbf{U}}_{22}}+\underbrace{v_{1}({\mathbf{x}}_{n})-v_{1}({\bm{\mu}})}_{{\mathbf{D}}^{(1)}_{n}}+\underbrace{\varphi^{(1)}_{{\mathbf{x}}_{n}}(X_{n+1})-({\mathbf{K}}_{{\mathbf{x}}_{n}}\varphi^{(1)}_{{\mathbf{x}}_{n}})(X_{n})}_{{\mathbf{J}}^{(1,a)}_{n}}\\ &+\underbrace{({\mathbf{K}}_{{\mathbf{x}}_{n}}\varphi^{(1)}_{{\mathbf{x}}_{n}})(X_{n})-({\mathbf{K}}_{{\mathbf{x}}_{n}}\varphi^{(1)}_{{\mathbf{x}}_{n}})(X_{n+1})}_{{\mathbf{J}}^{(1,b)}_{n}}.\end{split}

(40)

Thus, we have

\mathbb{E}[M^{({\mathbf{x}})}_{n+1}(M^{({\mathbf{x}})}_{n+1})^{T}|{\mathcal{F}}_{n}]={\mathbf{U}}_{22}+{\mathbf{D}}_{n}^{(1)}+{\mathbf{J}}_{n}^{(1)},

(41)

where ${\mathbf{J}}_{n}^{(1)}={\mathbf{J}}_{n}^{(1,a)}+{\mathbf{J}}_{n}^{(1,b)}$ .

Following the similar steps above, we can decompose $\mathbb{E}\left[\left.M^{({\mathbf{x}})}_{n+1}(M^{({\bm{\theta}})}_{n+1})^{T}\right|{\mathcal{F}}_{n}\right]$ and $\mathbb{E}\left[\left.M^{({\bm{\theta}})}_{n+1}(M^{({\bm{\theta}})}_{n+1})^{T}\right|{\mathcal{F}}_{n}\right]$ as

	$\mathbb{E}\left[\left.M^{({\mathbf{x}})}_{n+1}(M^{({\bm{\theta}})}_{n+1})^{T}\right\|{\mathcal{F}}_{n}\right]={\mathbf{U}}_{21}+{\mathbf{D}}_{n}^{(2)}+{\mathbf{J}}_{n}^{(2)},$		(42a)
	$\mathbb{E}\left[\left.M^{({\bm{\theta}})}_{n+1}(M^{({\bm{\theta}})}_{n+1})^{T}\right\|{\mathcal{F}}_{n}\right]={\mathbf{U}}_{11}+{\mathbf{D}}_{n}^{(3)}+{\mathbf{J}}_{n}^{(3)}.$		(42b)

where ${\mathbf{J}}_{n}^{(2)}={\mathbf{J}}_{n}^{(2,a)}+{\mathbf{J}}_{n}^{(2,b)}$ and ${\mathbf{J}}_{n}^{(3)}={\mathbf{J}}_{n}^{(3,a)}+{\mathbf{J}}_{n}^{(3,b)}$ . Here, we note that matrices ${\mathbf{J}}_{n}^{i}$ for $i=1,2,3$ are in presence of the current CLT analysis of the two-timescale SA with Martingale difference noise. In addition, ${\mathbf{U}}_{11}$ , ${\mathbf{U}}_{12}$ and ${\mathbf{U}}_{22}$ inherently include the information of the underlying Markov chain (with its eigenpair ( $\lambda_{i},{\mathbf{u}}_{i}$ )), which is an extension of the previous works (Konda & Tsitsiklis, 2004; Mokkadem & Pelletier, 2006).

Lemma E.1.

For $M_{n+1}^{({\bm{\theta}})},M_{n+1}^{({\mathbf{x}})}$ defined in (35) and their decomposition in (41) and (42), we have

	${\mathbf{U}}_{11}=\sum_{i=1}^{N-1}\frac{1+\lambda_{i}}{1-\lambda_{i}}{\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T},\quad{\mathbf{U}}_{21}=\sum_{i=1}^{N-1}\frac{1+\lambda_{i}}{1-\lambda_{i}}{\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T}{\mathbf{H}},\quad{\mathbf{U}}_{11}=\sum_{i=1}^{N-1}\frac{1+\lambda_{i}}{1-\lambda_{i}}{\mathbf{H}}^{T}{\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T}{\mathbf{H}},$		(43a)
	$\lim_{n\to\infty}{\mathbf{D}}_{n}^{(i)}=0~{}~{}\text{a.s.}~{}~{}~{}~{}\text{for}~{}~{}~{}~{}i=1,2,3,$		(43b)
	$\lim_{n\to\infty}\gamma_{n}\mathbb{E}\left[\left\\|\sum_{k=1}^{n}{\mathbf{J}}^{(i)}_{k}\right\\|\right]=0,\quad\text{for}~{}~{}~{}~{}i=1,2,3.$		(43c)

Proof.

We now provide the properties of the four terms inside (41) as an example. Note that

\begin{split}{\mathbf{U}}_{11}=&~{}\mathbb{E}_{i\sim{\bm{\mu}}}[V_{1}({\bm{\mu}},i)]=\sum_{i\in{\mathcal{N}}}\mu_{i}\left[\sum_{j\in{\mathcal{N}}}{\mathbf{P}}(i,j)q_{{\bm{\mu}}}(j)q_{{\bm{\mu}}}(j)^{T}-({\mathbf{P}}q_{{\bm{\mu}}})(i)\left(({\mathbf{P}}q_{{\bm{\mu}}})(i)\right)^{T}\right]\\ =&~{}\sum_{j\in{\mathcal{N}}}\mu_{j}q_{{\bm{\mu}}}(j)q_{{\bm{\mu}}}(j)^{T}-({\mathbf{P}}q_{{\bm{\mu}}})(j)\left(({\mathbf{P}}q_{{\bm{\mu}}})(j)\right)^{T}.\end{split}

We can see that it has exactly the same structure as matrix ${\bm{U}}$ in (27). Following the similar steps in deducing the explicit form of ${\bm{U}}$ from (28) to (30), we get

{\mathbf{U}}_{11}=\sum_{i=1}^{N-1}\frac{1+\lambda_{i}}{1-\lambda_{i}}{\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T}.

(44)

By the almost sure convergence result ${\mathbf{x}}_{n}\to{\bm{\mu}}$ in Lemma 3.1, $v_{1}({\mathbf{x}}_{n})\to v_{1}({\bm{\mu}})$ a.s. such that $\lim_{n\to\infty}{\mathbf{D}}_{n}^{(1)}=0$ a.s.

We next prove that $\lim_{n\to\infty}\gamma_{n}\mathbb{E}\left[\left\|\sum_{k=1}^{n}{\mathbf{J}}^{(1,a)}_{k}\right\|\right]=0$ and $\lim_{n\to\infty}\gamma_{n}\mathbb{E}\left[\left\|\sum_{k=1}^{n}{\mathbf{J}}^{(1,b)}_{k}\right\|\right]=0$ .

Since $\{{\mathbf{J}}^{(1,a)}_{n}\}$ is a Martingale difference sequence adapted to ${\mathcal{F}}_{n}$ , with the Burkholder inequality in Lemma G.2 and $p=1$ , we show that

\mathbb{E}\left[\left\|\sum_{k=1}^{n}{\mathbf{J}}_{k}^{(1,a)}\right\|\right]\leq C_{1}\mathbb{E}\left[\sqrt{\left(\sum_{k=1}^{n}\left\|{\mathbf{J}}_{k}^{(1,a)}\right\|^{2}\right)}\right].

(45)

By assumption A4, ${\mathbf{x}}_{n}$ is always within some compact set $\Omega$ such that $\sup_{n}\|{\mathbf{J}}_{n}^{(1,a)}\|\leq C_{\Omega}<\infty$ and for a given trajectory $\omega$ of ${\mathbf{x}}_{n}(\omega)$ ,

\gamma_{n}C_{p}\sqrt{\left(\sum_{k=1}^{n}\left\|{\mathbf{J}}_{k}^{(1,a)}\right\|^{2}\right)}\leq C_{p}C_{\Omega}\gamma_{n}\sqrt{n},

(46)

and the last term decreases to zero in $n$ since $a>1/2$ .

For ${\mathbf{J}}_{n}^{(1,b)}$ , we use Abel transformation and obtain

\begin{split}\sum_{k=1}^{n}{\mathbf{J}}_{k}^{(1,b)}=&\sum_{k=1}^{n}(({\mathbf{K}}_{{\mathbf{x}}_{k}}\varphi^{(1)}_{{\mathbf{x}}_{k}})(X_{k-1})-({\mathbf{K}}_{{\mathbf{x}}_{k-1}}\varphi^{(1)}_{{\mathbf{x}}_{k-1}})(X_{k-1}))\\ &+({\mathbf{K}}_{{\mathbf{x}}_{0}}\varphi^{(1)}_{{\mathbf{x}}_{0}})(X_{0})-({\mathbf{K}}_{{\mathbf{x}}_{n}}\varphi^{(1)}_{{\mathbf{x}}_{n}})(X_{n}).\end{split}

Since $({\mathbf{K}}_{{\mathbf{x}}}\varphi^{(1)}_{{\mathbf{x}}})(X)$ is continuous in ${\mathbf{x}}$ , for ${\mathbf{x}}_{n}$ within a compact set $\Omega$ (assumption A4), it is local Lipschitz with a constant $L_{\Omega}$ such that

\|({\mathbf{K}}_{{\mathbf{x}}_{k}}\varphi^{(1)}_{{\mathbf{x}}_{k}})(X_{k-1})-{\mathbf{K}}_{{\mathbf{x}}_{k-1}}\varphi^{(1)}_{{\mathbf{x}}_{k-1}})(X_{k-1})\|\leq L_{\Omega}\|{\mathbf{x}}_{k}-{\mathbf{x}}_{k-1}\|\leq 2L_{\Omega}\gamma_{k}.

where the last inequality arises from (4b), i.e., ${\mathbf{x}}_{k}-{\mathbf{x}}_{k-1}=\gamma_{k}({\bm{\delta}}_{X_{k}}-{\mathbf{x}}_{k-1})$ and $\|{\bm{\delta}}_{X_{k}}-{\mathbf{x}}_{k-1}\|\leq 2$ because ${\mathbf{x}}_{n}\in\text{Int}(\Sigma)$ . Also, $\|({\mathbf{K}}_{{\mathbf{x}}_{0}}\varphi^{(1)}_{{\mathbf{x}}_{0}})(X_{0})\|+\|({\mathbf{K}}_{{\mathbf{x}}_{n}}\varphi^{(1)}_{{\mathbf{x}}_{n}})(X_{n})\|$ are upper-bounded by some positive constant $C_{\Omega}^{\prime}$ . This implies that

\left\|\sum_{k=1}^{n}{\mathbf{J}}_{k}^{(1,b)}\right\|\leq C_{\Omega}^{\prime}+2L_{\Omega}\sum_{k=1}^{n}\gamma_{k}.

Note that

\gamma_{n}\left\|\sum_{k=1}^{n}{\mathbf{J}}_{k}^{(1,b)}\right\|\leq\gamma_{n}C_{\Omega}^{\prime}+2L_{\Omega}\gamma_{n}\sum_{k=1}^{n}\gamma_{k}\leq\gamma_{n}C_{\Omega}^{\prime}+\frac{2L_{\Omega}}{a}n^{1-2a},

(47)

where the last inequality is from $\sum_{k=1}^{n}\gamma_{k}<\frac{1}{a}n^{1-a}$ . We observe that the last term in (47) is decreasing to zero in $n$ because $a>1/2$ .

Note that ${\mathbf{J}}_{k}^{(1)}={\mathbf{J}}_{k}^{(1,a)}+{\mathbf{J}}_{k}^{(1,b)}$ , by triangular inequality we have

\begin{split}\gamma_{n}\mathbb{E}\left[\left\|\sum_{k=1}^{n}{\mathbf{J}}_{k}^{(11)}\right\|\right]&\leq\gamma_{n}\mathbb{E}\left[\left\|\sum_{k=1}^{n}{\mathbf{J}}_{k}^{(11,A)}\right\|\right]+\gamma_{n}\mathbb{E}\left[\left\|\sum_{k=1}^{n}{\mathbf{J}}_{k}^{(11,B)}\right\|\right]\\ &\leq\gamma_{n}C_{1}\mathbb{E}\left[\sqrt{\left(\sum_{k=1}^{n}\left\|{\mathbf{J}}_{k}^{(11,A)}\right\|^{2}\right)}\right]+\gamma_{n}\mathbb{E}\left[\left\|\sum_{k=1}^{n}{\mathbf{J}}_{k}^{(11,B)}\right\|\right]\\ &=\mathbb{E}\left[\gamma_{n}C_{1}\sqrt{\left(\sum_{k=1}^{n}\left\|{\mathbf{J}}_{k}^{(11,A)}\right\|^{2}\right)}+\gamma_{n}\left\|\sum_{k=1}^{n}{\mathbf{J}}_{k}^{(11,B)}\right\|\right],\end{split}

(48)

where the second inequality comes from (45). By (46) and (47) we know that both terms in the last line of (48) are uniformly bounded by constants over time $n$ that depend on the set $\Omega$ . Therefore, by dominated convergence theorem, taking the limit over the last line of (48) gives

\begin{split}&\lim_{n\to\infty}\mathbb{E}\left[\gamma_{n}C_{1}\sqrt{\left(\sum_{k=1}^{n}\left\|{\mathbf{J}}_{k}^{(11,A)}\right\|^{2}\right)}\!+\!\gamma_{n}\left\|\sum_{k=1}^{n}{\mathbf{J}}_{k}^{(11,B)}\right\|\right]\\ =&~{}\mathbb{E}\left[\lim_{n\to\infty}\gamma_{n}C_{1}\sqrt{\left(\sum_{k=1}^{n}\left\|{\mathbf{J}}_{k}^{(11,A)}\right\|^{2}\right)}\!+\!\gamma_{n}\left\|\sum_{k=1}^{n}{\mathbf{J}}_{k}^{(11,B)}\right\|\right]\!=\!0.\end{split}

Therefore, we have

\lim_{n\to\infty}\gamma_{n}\mathbb{E}\left[\left\|\sum_{k=1}^{n}{\mathbf{J}}_{k}^{(1)}\right\|\right]=0,

In sum, in terms of $\mathbb{E}[M^{({\mathbf{x}})}_{n+1}(M^{({\mathbf{x}})}_{n+1})^{T}|{\mathcal{F}}_{n}]$ in (41), we have ${\mathbf{U}}_{11}$ in (44), $\lim_{n\to\infty}{\mathbf{D}}_{n}^{(1)}=0$ a.s. and $\lim_{n\to\infty}\gamma_{n}\mathbb{E}\left[\left\|\sum_{k=1}^{n}{\mathbf{J}}^{(1)}_{k}\right\|\right]=0$ .

We can apply the same steps as above for the other two terms $i=2,3$ in (42) and obtain the results. ∎

Analysis of Terms $r^{({\bm{\theta}},1)}_{n},r^{({\bm{\theta}},2)}_{n},r^{({\mathbf{x}},1)}_{n},r^{({\mathbf{x}},2)}_{n}$

Lemma E.2.

For $r^{({\bm{\theta}},1)}_{n},r^{({\bm{\theta}},2)}_{n},r^{({\mathbf{x}},1)}_{n},r^{({\mathbf{x}},2)}_{n}$ defined in (35), we have the following results:

	$\\|r^{({\bm{\theta}},1)}_{n}\\|=O(\gamma_{n})=o(\sqrt{\beta_{n}}),\quad\sqrt{\gamma_{n}}\left\\|\sum_{k=1}^{n}r^{({\bm{\theta}},2)}_{k}\right\\|=O(\sqrt{\gamma_{n}})=o(1).$		(49a)
	$\\|r^{({\mathbf{x}},1)}_{n}\\|=O(\gamma_{n})=o(\sqrt{\beta_{n}}),\quad\sqrt{\gamma_{n}}\left\\|\sum_{k=1}^{n}r^{({\mathbf{x}},2)}_{k}\right\\|=O(\sqrt{\gamma_{n}})=o(1).$		(49b)

Proof.

For $r^{({\bm{\theta}},1)}_{n}$ , note that

\begin{split}r^{({\bm{\theta}},1)}_{n}=&~{}({\mathbf{K}}_{{\mathbf{x}}_{n+1}}\tilde{H}_{{\bm{\theta}}_{n+1},{\mathbf{x}}_{n+1}})(X_{n+1})-({\mathbf{K}}_{{\mathbf{x}}_{n}}\tilde{H}_{{\bm{\theta}}_{n},{\mathbf{x}}_{n}})(X_{n+1})\\ =&~{}\sum_{j\in{\mathcal{N}}}\left({\mathbf{K}}_{X_{n},j}[{\mathbf{x}}_{n+1}]\tilde{H}_{{\bm{\theta}}_{n+1},{\mathbf{x}}_{n+1}}(j)-{\mathbf{K}}_{X_{n},j}[{\mathbf{x}}_{n}]\tilde{H}_{{\bm{\theta}}_{n},{\mathbf{x}}_{n}}(j)\right)\\ \leq&~{}\sum_{j\in{\mathcal{N}}}L_{{\mathcal{C}}}(\|{\bm{\theta}}_{n+1}-{\bm{\theta}}_{n}\|+\|{\mathbf{x}}_{n+1}-{\mathbf{x}}_{n}\|)\\ \leq&NL_{{\mathcal{C}}}(C_{{\mathcal{C}}}\beta_{n+1}+2\gamma_{n+1})\end{split}

(50)

where the second last inequality is because ${\mathbf{K}}_{i,j}[{\mathbf{x}}]\tilde{H}_{{\bm{\theta}},{\mathbf{x}}}(j)$ is continuous in ${\bm{\theta}},{\mathbf{x}}$ ${\mathbf{K}}[{\bm{x}}]$ , which stems from continuous functions ${\bm{K}}[{\mathbf{x}}]$ and $\tilde{H}_{{\bm{\theta}},{\mathbf{x}}}$ . The last inequality is from update rules (4) and $({\bm{\theta}}_{n},{\mathbf{x}}_{n})\in\Omega$ for some compact subset $\Omega$ by assumption A4. Then, we have $\|r^{({\bm{\theta}},1)}_{n}\|=O(\gamma_{n})=o(\sqrt{\beta_{n}})$ because of $a>1/2\geq b/2$ by assumption A2.

We let $\nu_{n}\triangleq({\mathbf{K}}_{{\mathbf{x}}_{n}}\tilde{H}_{{\bm{\theta}}_{n},{\mathbf{x}}_{n}})(X_{n})$ such that $r^{({\bm{\theta}},2)}_{n}=\nu_{n}-\nu_{n+1}$ . Note that $\sum_{k=1}^{n}r^{({\bm{\theta}},2)}_{k}=\nu_{1}-\nu_{n+1}$ , and by assumption A4, $\|\nu_{n}\|$ is upper bounded by a constant dependent on the compact set, which leads to

\sqrt{\gamma_{n}}\left\|\sum_{k=1}^{n}r^{({\bm{\theta}},2)}_{k}\right\|=\sqrt{\gamma_{n}}\|\nu_{1}-\nu_{n+1}\|=O(\sqrt{\gamma_{n}})=o(1).

Similarly, we can also obtain $\|r^{({\mathbf{x}},1)}_{n}\|=o(\sqrt{\beta_{n}})$ and $\sqrt{\gamma_{n}}\left\|\sum_{k=1}^{n}r^{({\mathbf{x}},2)}_{k}\right\|=O(\sqrt{\gamma_{n}})=o(1)$ . ∎

E.2.2 Effect of SRRW Iteration on SA Iteration

In view of the almost sure convergence results in Lemma 3.1 and Lemma 3.2, for large enough $n$ so that both iterations ${\bm{\theta}}_{n},{\mathbf{x}}_{n}$ are close to the equilibrium $({\bm{\theta}}^{*},{\bm{\mu}})$ , we can apply the Taylor expansion to functions ${\mathbf{h}}({\bm{\theta}},{\mathbf{x}})$ and ${\bm{\pi}}[{\mathbf{x}}]-{\mathbf{x}}$ in (36) at the point $({\bm{\theta}}^{*},{\bm{\mu}})$ , which results in

	${\mathbf{h}}({\bm{\theta}},{\mathbf{x}})={\mathbf{h}}({\bm{\theta}}^{},{\bm{\mu}})+\nabla_{{\bm{\theta}}}{\mathbf{h}}({\bm{\theta}}^{},{\bm{\mu}})({\bm{\theta}}-{\bm{\theta}}^{})+\nabla_{{\mathbf{x}}}{\mathbf{h}}({\bm{\theta}}^{},{\bm{\mu}})({\mathbf{x}}-{\bm{\mu}})+O(\\|{\bm{\theta}}-{\bm{\theta}}^{*}\\|^{2}+\\|{\mathbf{x}}-{\bm{\mu}}\\|^{2}),$		(51a)
	${\bm{\pi}}[{\mathbf{x}}]-{\mathbf{x}}={\bm{\pi}}[{\bm{\mu}}]-{\bm{\mu}}+\nabla_{{\mathbf{x}}}({\bm{\pi}}({\mathbf{x}})-{\mathbf{x}})\|_{{\mathbf{x}}={\bm{\mu}}}({\mathbf{x}}-{\bm{\mu}})+O(\\|{\mathbf{x}}-{\bm{\mu}}\\|^{2}).$		(51b)

With matrix ${\mathbf{J}}(\alpha)$ , we have the following:

\begin{split}&{\mathbf{J}}_{11}=\nabla_{{\bm{\theta}}}{\mathbf{h}}({\bm{\theta}}^{*},{\bm{\mu}})=\nabla{\mathbf{h}}({\bm{\theta}}^{*}),\\ &{\mathbf{J}}_{12}(\alpha)=\nabla_{{\mathbf{x}}}{\mathbf{h}}({\bm{\theta}}^{*},{\bm{\mu}})=-\alpha{\mathbf{H}}^{T}({\mathbf{P}}^{T}+{\mathbf{I}}),\\ &{\mathbf{J}}_{22}(\alpha)=\nabla_{{\mathbf{x}}}({\bm{\pi}}({\mathbf{x}})-{\mathbf{x}})|_{{\mathbf{x}}={\bm{\mu}}}=2\alpha\bm{\mu}{\bm{1}}^{T}-\alpha{\mathbf{P}}^{T}-(\alpha+1){\mathbf{I}}.\end{split}

(52)

Then, (36) becomes

	${\bm{\theta}}_{n+1}={\bm{\theta}}_{n}+\beta_{n+1}({\mathbf{J}}_{11}({\bm{\theta}}_{n}-{\bm{\theta}}^{*})+{\mathbf{J}}_{12}(\alpha)({\mathbf{x}}_{n}-{\bm{\mu}})+r^{({\bm{\theta}},1)}_{n}+r^{({\bm{\theta}},2)}_{n}+M^{({\bm{\theta}})}_{n+1}+\eta_{n}^{({\bm{\theta}})}),$		(53a)
	${\mathbf{x}}_{n+1}={\mathbf{x}}_{n}+\gamma_{n+1}({\mathbf{J}}_{22}(\alpha)({\mathbf{x}}_{n}-{\bm{\mu}})+r^{({\mathbf{x}},1)}_{n}+r^{({\mathbf{x}},2)}_{n}+M^{({\mathbf{x}})}_{n+1}+\eta_{n}^{({\mathbf{x}})}),$		(53b)

where $\eta_{n}^{({\bm{\theta}})}=O(\|{\mathbf{x}}_{n}\|^{2}+\|{\bm{\theta}}_{n}\|^{2})$ and $\eta_{n}^{({\mathbf{x}})}=O(\|{\mathbf{x}}_{n}\|^{2})$ .

Then, inspired by Mokkadem & Pelletier (2006), we decompose iterates $\{{\mathbf{x}}_{n}\}$ and $\{{\bm{\theta}}_{n}\}$ into ${\mathbf{x}}_{n}=L^{({\bm{x}})}_{n}+\Delta^{({\bm{x}})}_{n}$ and ${\bm{\theta}}_{n}=L^{({\bm{\theta}})}_{n}+R^{({\bm{\theta}})}_{n}+\Delta^{({\bm{\theta}})}_{n}$ . Rewriting (53b) gives

{\mathbf{x}}_{n}-{\bm{\mu}}=\gamma_{n+1}^{-1}{\mathbf{J}}_{22}(\alpha)^{-1}({\mathbf{x}}_{n+1}-{\mathbf{x}}_{n})-{\mathbf{J}}_{22}(\alpha)^{-1}(r^{({\mathbf{x}},1)}_{n}+r^{({\mathbf{x}},2)}_{n}+M^{({\mathbf{x}})}_{n+1}+\eta_{n}^{({\mathbf{x}})}),

and substituting the above equation back in (53a) gives

\begin{split}{\bm{\theta}}_{n+1}&-{\bm{\theta}}^{*}=~{}{\bm{\theta}}_{n}-{\bm{\theta}}^{*}+\beta_{n+1}\bigg{(}{\mathbf{J}}_{11}({\bm{\theta}}_{n}-{\bm{\theta}}^{*})+\gamma_{n+1}^{-1}{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}({\mathbf{x}}_{n+1}-{\mathbf{x}}_{n})\\ &-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}(r^{({\mathbf{x}},1)}_{n}+r^{({\mathbf{x}},2)}_{n}+M^{({\mathbf{x}})}_{n+1}+\eta_{n}^{({\mathbf{x}})})+r^{({\bm{\theta}},1)}_{n}+r^{({\bm{\theta}},2)}_{n}+M^{({\bm{\theta}})}_{n+1}+\eta_{n}^{({\bm{\theta}})}\bigg{)}\\ =&~{}({\mathbf{I}}+\beta_{n+1}{\mathbf{J}}_{11})({\bm{\theta}}_{n}-{\bm{\theta}}^{*})+[\beta_{n+1}\gamma_{n+1}^{-1}{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}({\mathbf{x}}_{n+1}-{\mathbf{x}}_{n})]\\ &+\beta_{n+1}(M^{({\bm{\theta}})}_{n+1}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}M^{({\mathbf{x}})}_{n+1})\\ &+\beta_{n+1}(r^{({\bm{\theta}},1)}_{n}+r^{({\bm{\theta}},2)}_{n}+\eta_{n}^{({\bm{\theta}})}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}(r^{({\mathbf{x}},1)}_{n}+r^{({\mathbf{x}},2)}_{n}+\eta_{n}^{({\mathbf{x}})})),\end{split}

(54)

From (54) we can see the iteration $\{{\bm{\theta}}_{n}\}$ implicitly embeds the recursions of three sequences

•

$\beta_{n+1}\gamma_{n+1}^{-1}{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}({\mathbf{x}}_{n+1}-{\mathbf{x}}_{n})$ ;
•

$\beta_{n+1}(M^{({\bm{\theta}})}_{n+1}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}M^{({\mathbf{x}})}_{n+1})$ ;
•

$\beta_{n+1}(r^{({\bm{\theta}},1)}_{n}+r^{({\bm{\theta}},2)}_{n}+\eta_{n}^{({\bm{\theta}})}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}(r^{({\mathbf{x}},1)}_{n}+r^{({\mathbf{x}},2)}_{n}+\eta_{n}^{({\mathbf{x}})})))$ .

Let $u_{n}\triangleq\sum_{k=1}^{n}\beta_{k}$ and $s_{n}\triangleq\sum_{k=1}^{n}\gamma_{k}$ . Below we define two iterations:

	$\begin{split}L_{n}^{({\bm{\theta}})}=e^{\beta_{n}{\mathbf{J}}_{11}}&L_{n-1}^{({\bm{\theta}})}+\beta_{n}(M^{({\bm{\theta}})}_{n}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}M^{({\mathbf{x}})}_{n})\\ &=\sum_{k=1}^{n}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}\beta_{k}(M^{({\bm{\theta}})}_{k}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}M^{({\mathbf{x}})}_{k})\end{split}$		(55a)
	$\begin{split}R_{n}^{({\bm{\theta}})}=e^{\beta_{n}{\mathbf{J}}_{11}}&R_{n-1}^{({\bm{\theta}})}+\beta_{n}\gamma_{n}^{-1}{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}({\mathbf{x}}_{n}-{\mathbf{x}}_{n-1})\\ &=\sum_{k=1}^{n}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}\beta_{k}\gamma_{k}^{-1}{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}({\mathbf{x}}_{k}-{\mathbf{x}}_{k-1})\end{split}$		(55b)

and a remaining term $\Delta_{n}^{({\bm{\theta}})}\triangleq{\bm{\theta}}_{n}-{\bm{\theta}}^{*}-L_{n}^{({\bm{\theta}})}-R_{n}^{({\bm{\theta}})}$ .

Similarly, for iteration ${\mathbf{x}}_{n}$ , define the sequence $L_{n}^{({\mathbf{x}})}$ such that

L_{n}^{({\mathbf{x}})}=e^{\gamma_{n}{\mathbf{J}}_{22}(\alpha)}L_{n-1}^{({\mathbf{x}})}+\gamma_{n}M^{({\mathbf{x}})}_{n}=\sum_{k=1}^{n}e^{(s_{n}-s_{k}){\mathbf{J}}_{22}(\alpha)}\gamma_{k}M^{({\mathbf{x}})}_{k},

(56)

and a remaining term

\Delta_{n}^{({\mathbf{x}})}\triangleq{\mathbf{x}}_{n}-{\bm{\mu}}-L_{n}^{({\bm{\theta}})}

(57)

The decomposition of ${\bm{\theta}}_{n}-{\bm{\theta}}^{*}$ and ${\mathbf{x}}_{n}-{\bm{\mu}}$ in the above form is also standard in the single-timescale SA literature (Delyon, 2000; Fort, 2015).

Characterization of Sequences $\{L_{n}^{({\bm{\theta}})}\}$ and $\{L_{n}^{({\mathbf{x}})}\}$

we set a Martingale $Z^{(n)}=\{Z^{(n)}_{k}\}_{k\geq 1}$ such that

Z^{(n)}_{k}=\begin{pmatrix}\beta_{n}^{-1/2}e^{u_{n}{\mathbf{J}}_{11}}&0\\ 0&\gamma_{n}^{-1/2}e^{s_{n}{\mathbf{J}}_{22}(\alpha)}\end{pmatrix}\times\sum_{j=1}^{k}\begin{pmatrix}e^{-u_{k}{\mathbf{J}}_{11}}\beta_{k}(M^{({\bm{\theta}})}_{k}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}M^{({\mathbf{x}})}_{k})\\ e^{-s_{k}{\mathbf{J}}_{22}(\alpha)}\gamma_{k}M^{({\mathbf{x}})}_{k}\end{pmatrix}.

Then, the Martingale difference array $Z_{k}^{(n)}-Z_{k-1}^{(n)}$ becomes

Z_{k}^{(n)}-Z_{k-1}^{(n)}=\begin{pmatrix}\beta_{n}^{-1/2}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}\beta_{k}(M^{({\bm{\theta}})}_{k}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}M^{({\mathbf{x}})}_{k})\\ \gamma_{n}^{-1/2}e^{(s_{n}-s_{k}){\mathbf{J}}_{22}(\alpha)}\gamma_{k}M^{({\mathbf{x}})}_{k}\end{pmatrix}

and

\sum_{k=1}^{n}\mathbb{E}\left[(Z_{k}^{(n)}-Z_{k-1}^{(n)})(Z_{k}^{(n)}-Z_{k-1}^{(n)})^{T}|{\mathcal{F}}_{k-1}\right]=\begin{pmatrix}A_{1,n}&A_{2,n}\\ A_{2,n}^{T}&A_{4,n}\end{pmatrix},

where, in view of decomposition of $M^{({\bm{\theta}})}_{n}$ and $M^{({\mathbf{x}})}_{n}$ in (41) and (42), respectively,

	$\begin{split}A_{1,n}=\beta_{n}^{-1}\sum_{k=1}^{n}\beta_{k}^{2}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}&\bigg{(}{\mathbf{U}}_{22}\!+\!{\mathbf{D}}_{k}^{(1)}\!+\!{\mathbf{J}}_{k}^{(1)}\!-\!({\mathbf{U}}_{21}\!+\!{\mathbf{D}}_{k}^{(2)}\!+\!{\mathbf{J}}_{k}^{(2)})({\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1})^{T}\\ &+{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}({\mathbf{U}}_{11}+{\mathbf{D}}_{k}^{(3)}+{\mathbf{J}}_{k}^{(3)})({\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1})^{T}\\ &-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}({\mathbf{U}}_{21}+{\mathbf{D}}_{k}^{(2)}+{\mathbf{J}}_{k}^{(2)})^{T}\bigg{)}e^{(u_{n}-u_{k})({\mathbf{J}}_{11})^{T}},\end{split}$		(58a)
	$A_{2,n}=\beta_{n}^{-1/2}\gamma_{n}^{-1/2}\sum_{k=1}^{n}\beta_{k}\gamma_{k}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}({\mathbf{U}}_{21}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}{\mathbf{U}}_{11})e^{(s_{n}-s_{k}){\mathbf{J}}_{22}(\alpha)^{T}},$		(58b)
	$A_{4,n}=\gamma_{n}^{-1}\sum_{k=1}^{n}\gamma_{k}^{2}e^{(s_{n}-s_{k}){\mathbf{J}}_{22}(\alpha)}({\mathbf{U}}_{11}+{\mathbf{D}}_{k}^{(3)}+{\mathbf{J}}_{k}^{(3)})e^{(s_{n}-s_{k}){\mathbf{J}}_{22}(\alpha)^{T}}.$		(58c)

We further decompose $A_{1,n}$ into three parts:

\begin{split}A_{1,n}=&\beta_{n}^{-1}\sum_{k=1}^{n}\bigg{(}\beta_{k}^{2}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}({\mathbf{U}}_{22}-{\mathbf{U}}_{21}({\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1})^{T}\\ &~{}~{}~{}~{}~{}~{}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}{\mathbf{U}}_{12}+{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}{\mathbf{U}}_{11}({\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1})^{T})e^{(u_{n}-u_{k})({\mathbf{J}}_{11})^{T}}\bigg{)}\\ &+\beta_{n}^{-1}\sum_{k=1}^{n}\bigg{(}\beta_{k}^{2}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}({\mathbf{D}}_{k}^{(1)}+{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}{\mathbf{D}}_{k}^{(3)}({\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1})^{T}\\ &~{}~{}~{}~{}~{}~{}-{\mathbf{D}}_{k}^{(2)}({\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1})^{T}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}({\mathbf{D}}_{k}^{(2)})^{T})e^{(u_{n}-u_{k})({\mathbf{J}}_{11})^{T}}\bigg{)}\\ &+\beta_{n}^{-1}\sum_{k=1}^{n}\bigg{(}\beta_{k}^{2}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}({\mathbf{J}}_{k}^{(1)}+{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}{\mathbf{J}}_{k}^{(3)}({\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1})^{T}\\ &~{}~{}~{}~{}~{}~{}-{\mathbf{J}}_{k}^{(2)}({\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1})^{T}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}({\mathbf{J}}_{k}^{(2)})^{T})e^{(u_{n}-u_{k})({\mathbf{J}}_{11})^{T}}\bigg{)}\\ \triangleq&A_{1,n}^{(a)}+A_{1,n}^{(b)}+A_{1,n}^{(c)}.\end{split}

(59)

Here, we define ${\mathbf{U}}_{{\bm{\theta}}}(\alpha)\triangleq{\mathbf{U}}_{22}-{\mathbf{U}}_{21}({\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1})^{T}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}{\mathbf{U}}_{12}+{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}{\mathbf{U}}_{11}({\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1})^{T}$ . By (52) and (43a) in Lemma E.1, we have

{\mathbf{U}}_{{\bm{\theta}}}(\alpha)=\sum_{i=1}^{N-1}\frac{1}{(\alpha(1+\lambda_{i})+1)^{2}}\cdot\frac{1+\lambda_{i}}{1-\lambda_{i}}{\mathbf{H}}^{T}{\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T}{\mathbf{H}}.

(60)

Then, we have the following lemma.

Lemma E.3.

For $A^{(a)}_{1,n},A^{(b)}_{1,n},A^{(c)}_{1,n}$ defined in (59), we have

\lim_{n\to\infty}A^{(a)}_{1,n}={\mathbf{V}}_{{\bm{\theta}}}(\alpha),\quad\lim_{n\to\infty}\|A^{(b)}_{1,n}\|=0,\quad\lim_{n\to\infty}\|A^{(c)}_{1,n}\|=0,

(61)

where ${\mathbf{V}}_{{\bm{\theta}}}(\alpha)$ is the solution to the Lyapunov equation

\left({\mathbf{J}}_{11}+\frac{\mathds{1}_{\{b=1\}}}{2}{\bm{I}}\right){\mathbf{V}}_{{\bm{\theta}}}(\alpha)+{\mathbf{V}}_{{\bm{\theta}}}(\alpha)\left({\mathbf{J}}_{11}+\frac{\mathds{1}_{\{b=1\}}}{2}{\bm{I}}\right)^{T}+{\mathbf{U}}_{{\bm{\theta}}}(\alpha)=0.

Proof.

First, from Lemma G.4, we have for some $c,T>0$ such that

\begin{split}\|A^{(b)}_{1,n}\|\leq\beta_{n}^{-1}\sum_{k=1}^{n}&\bigg{\|}{\mathbf{D}}_{k}^{(1)}+{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}{\mathbf{D}}_{k}^{(3)}({\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1})^{T}-{\mathbf{D}}_{k}^{(2)}({\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1})^{T}\\ &-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}({\mathbf{D}}_{k}^{(2)})^{T}\bigg{\|}\cdot\beta_{k}^{2}c^{2}e^{-2T(u_{n}-u_{k})}.\end{split}

Applying Lemma G.6, together with ${\mathbf{D}}^{(i)}_{n}\to 0$ a.s. in Lemma E.1, gives

\begin{split}&\limsup_{n}\|A^{(b)}_{1,n}\|\\ \leq&\frac{1}{C(b,p)}\limsup_{n}\|({\mathbf{D}}_{n}^{(1)}+{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}{\mathbf{D}}_{n}^{(4)}({\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1})^{T}\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-{\mathbf{D}}_{n}^{(2)}({\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1})^{T}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}{\mathbf{D}}_{n}^{(3)})\|\\ &=0.\end{split}

We now consider $\|A^{(c)}_{1,n}\|$ . Set

	$\displaystyle\Xi_{n}\triangleq\sum_{k=1}^{n}\big{(}$	$\displaystyle{\mathbf{J}}_{n}^{(1)}+{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}{\mathbf{J}}_{k}^{(3)}({\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1})^{T}$
		$\displaystyle-{\mathbf{J}}_{k}^{(2)}({\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1})^{T}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}({\mathbf{J}}_{k}^{(2)})^{T}\big{)},$

we can rewrite $A^{(c)}_{1,n}$ as

A^{(c)}_{1,n}=\beta_{n}^{-1}\sum_{k=1}^{n}\beta_{k}^{2}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}(\Xi_{k}-\Xi_{k-1})e^{(u_{n}-u_{k})({\mathbf{J}}_{11})^{T}}.

By the Abel transformation, we have

\begin{split}A^{(c)}_{1,n}=\beta_{n}\Xi_{n}+\beta_{n}^{-1}\sum_{k=1}^{n-1}&\Big{[}\beta_{k}^{2}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}\Xi_{k}e^{(u_{n}-u_{k})({\mathbf{J}}_{11})^{T}}\\ &-\beta_{k+1}^{2}e^{(u_{n}-u_{k+1}){\mathbf{J}}_{11}}\Xi_{k}e^{(u_{n}-u_{k+1})({\mathbf{J}}_{11})^{T}}\Big{]}.\end{split}

(62)

We know from Lemma E.1 that $\beta_{n}\Xi_{n}\to 0$ a.s. because $\Xi_{n}=o(\gamma_{n}^{-1})$ . Besides,

\begin{split}\|\beta_{k}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}&-\beta_{k+1}e^{(u_{n}-u_{k+1}){\mathbf{J}}_{11}}\|\\ =&~{}\|(\beta_{k}-\beta_{k+1})e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}+\beta_{k+1}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}({\mathbf{I}}-e^{-\beta_{k+1}{\mathbf{J}}_{11}})\|\\ \leq&~{}C_{1}\beta_{k}^{2}e^{-(u_{n}-u_{k})T}\end{split}

for some constant $C_{1}>0$ because $\beta_{n}-\beta_{n+1}\leq C_{2}\beta_{n}^{2}$ and $\|{\mathbf{I}}-e^{-\beta_{k+1}{\mathbf{J}}_{11}}\|\leq C_{3}\beta_{k+1}$ . Moreover,

\begin{split}\|\beta_{k}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}\|&+\|\beta_{k+1}e^{(u_{n}-u_{k+1}){\mathbf{J}}_{11}}\|\\ \leq&\beta_{k}\|e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}\|+\beta_{k}\|e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}\|\cdot\|e^{-\beta_{k+1}{\mathbf{J}}_{11}}\|\\ \leq&C_{4}\beta_{k}e^{-(u_{n}-u_{k})T}.\end{split}

Using Lemma G.7 on (62) gives

\|A^{(c)}_{1,n}\|\leq C_{1}C_{4}\beta_{n}^{-1}\sum_{k=1}^{n-1}\beta_{k}^{2}e^{-2(u_{n}-u_{k})T}\|\beta_{k}\Xi_{k}\|+\|\beta_{n}\Xi_{n}\|.

Applying Lemma G.6 again gives

\limsup_{n}\|A^{(c)}_{1,n}\|\leq C_{5}\limsup_{n}\|\beta_{n}\Xi_{n}\|=0

for some constant $C_{5}>0$ .

Finally, we provide an existing lemma below.

Lemma E.4 (Mokkadem & Pelletier (2005) Lemma 4).

For a sequence with decreasing step size $\beta_{n}=(n+1)^{-b}$ for $b\in(1/2,1]$ , $u_{n}=\sum_{k=1}^{n}\beta_{k}$ , a positive semi-definite matrix $\Gamma$ and a Hurwitz matrix ${\mathbf{Q}}$ , which is given by

\beta_{n}^{-1}\sum_{k=1}^{n}\beta_{n}^{2}e^{(u_{n}-u_{k}){\mathbf{Q}}}\Gamma e^{(u_{n}-u_{k}){\mathbf{Q}}^{T}},

we have

\lim_{n\to\infty}\beta_{n}^{-1}\sum_{k=1}^{n}\beta_{n}^{2}e^{(u_{n}-u_{k}){\mathbf{Q}}}\Gamma e^{(u_{n}-u_{k}){\mathbf{Q}}^{T}}={\mathbf{V}}

where ${\mathbf{V}}$ is the solution of the Lyapunov equation

\left({\mathbf{Q}}+\frac{\mathds{1}_{\{b=1\}}}{2}{\mathbf{I}}\right){\mathbf{V}}+{\mathbf{V}}\left({\mathbf{Q}}^{T}+\frac{\mathds{1}_{\{b=1\}}}{2}{\mathbf{I}}\right)+\Gamma=0.

Then, $\lim_{n\to\infty}A^{(a)}_{1,n}={\mathbf{V}}_{{\bm{\theta}}}(\alpha)$ is a direct application of Lemma E.4. ∎

We can follow the similar steps in Lemma E.3 to obtain

\lim_{n\to\infty}A_{4,n}={\mathbf{V}}_{{\mathbf{x}}}(\alpha),

where ${\mathbf{V}}_{{\mathbf{x}}}(\alpha)$ is in the form of (32).

The last step is to show $\lim_{n\to\infty}A_{2,n}=0$ . Note that

\begin{split}\|A_{2,n}\|&=O\left(\beta_{n}^{-1/2}\gamma_{n}^{-1/2}\sum_{k=1}^{n}\beta_{k}\gamma_{k}\|e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}\|\|e^{(s_{n}-s_{k}){\mathbf{J}}_{22}(\alpha)^{T}}\|\right)\\ &=O\left(\beta_{n}^{-1/2}\gamma_{n}^{-1/2}\sum_{k=1}^{n}\beta_{k}\gamma_{k}e^{-(u_{n}-u_{k})T}e^{-(s_{n}-s_{k})T^{\prime}}\right)\\ &=O\left(\beta_{n}^{-1/2}\gamma_{n}^{-1/2}\sum_{k=1}^{n}\beta_{k}\gamma_{k}e^{-(s_{n}-s_{k})T^{\prime}}\right),\end{split}

where the second equality is from Lemma G.4. Then, we use Lemma G.6 with $p=0$ to obtain

\sum_{k=1}^{n}\beta_{k}\gamma_{k}e^{-(s_{n}-s_{k})T^{\prime}}=O(\beta_{n})

(63)

Additionally, since $\beta_{n}=o(\gamma_{n})$ , we have

\beta_{n}^{-1/2}\gamma_{n}^{-1/2}\sum_{k=1}^{n}\beta_{k}\gamma_{k}^{-1/2}\gamma_{k}^{3/2}e^{-(s_{n}-s_{k})T^{\prime}}=O(\beta_{n}^{1/2}\gamma_{n}^{-1/2})=o(1).

Then, it follows that $\lim_{n\to\infty}A_{2,n}=0$ . Therefore, we obtain

\lim_{n\to\infty}\sum_{k=1}^{n}\mathbb{E}\left[(Z_{k}^{(n)}-Z_{k-1}^{(n)})(Z_{k}^{(n)}-Z_{k-1}^{(n)})^{T}|{\mathcal{F}}_{k-1}\right]=\begin{pmatrix}{\mathbf{V}}_{{\bm{\theta}}}(\alpha)&0\\ 0&{\mathbf{V}}_{{\mathbf{x}}}(\alpha)\end{pmatrix}.

Now, we turn to verifying the conditions in Theorem G.3. For some $\tau>0$ , we have

\begin{split}&\sum_{k=1}^{n}\mathbb{E}\left[\|Z_{k}^{(n)}-Z_{k-1}^{(n)}\|^{2+\tau}|{\mathcal{F}}_{k-1}\right]\\ &=O\left(\beta_{n}^{-(1+\frac{\tau}{2})}\sum_{k=1}^{n}\beta_{k}^{2+\frac{\tau}{2}}\beta_{k}^{\frac{\tau}{2}}e^{-(2+\tau)(u_{n}-u_{k})T}+\gamma_{n}^{-(1+\frac{\tau}{2})}\sum_{k=1}^{n}\gamma_{k}^{2+\frac{\tau}{2}}\gamma_{k}^{\frac{\tau}{2}}e^{-(2+\tau)(s_{n}-s_{k})T^{\prime}}\right)\\ &=O\left(\beta_{n}^{\frac{\tau}{2}}+\gamma_{n}^{\frac{\tau}{2}}\right)\end{split}

(64)

where the last equality comes from Lemma G.6. Since (64) also holds for $\tau=0$ , we have

\sum_{k=1}^{n}\mathbb{E}\left[\|Z_{k}^{(n)}-Z_{k-1}^{(n)}\|^{2}|{\mathcal{F}}_{k-1}\right]=O(1)<\infty.

Therefore, all the conditions in Theorem G.3 are satisfied and its application then gives

Z^{(n)}=\begin{pmatrix}\sqrt{\beta_{n}^{-1}}L_{n}^{({\bm{\theta}})}\\ \sqrt{\gamma_{n}^{-1}}L_{n}^{({\mathbf{x}})}\end{pmatrix}\xrightarrow[dist.]{n\to\infty}N\left(0,\begin{pmatrix}{\mathbf{V}}_{{\bm{\theta}}}(\alpha)&0\\ 0&{\mathbf{V}}_{{\mathbf{x}}}(\alpha)\end{pmatrix}\right).

(65)

Furthermore, we have the following lemma about the strong convergence rate of $\{L^{({\bm{\theta}})}_{n}\}$ and $\{L^{({\mathbf{x}})}_{n}\}$ .

Lemma E.5.

	$\\|L^{({\bm{\theta}})}_{n}\\|=O\left(\sqrt{\beta_{n}\log(u_{n})}\right)\quad a.s.$		(66a)
	$\\|L^{({\mathbf{x}})}_{n}\\|=O\left(\sqrt{\gamma_{n}\log(s_{n})}\right)\quad a.s.$		(66b)

Proof.

This proof follows Pelletier (1998, Lemma 1). We only need the special case of Pelletier (1998, Lemma 1) that fits our scenario; e.g., we let the two types of step sizes therein to be the same. Specifically, we attach the following lemma.

Lemma E.6 (Pelletier (1998) Lemma 1).

Consider a sequence

L_{n+1}=e^{u_{n}{\mathbf{H}}}\sum_{k=1}^{n}e^{-u_{k}{\mathbf{H}}}\beta_{k}M_{k+1},

where $\beta_{n}=n^{-b}$ , $1/2<b\leq 1$ , and $\{M_{n}\}$ is a Martingale difference sequence adapted to the filtration ${\mathcal{F}}$ such that, almost surely, $\limsup_{n}\mathbb{E}[\|M_{n+1}\|^{2}|{\mathcal{F}}_{n}]\leq M^{2}$ and there exists $\tau\in(0,2)$ , $b(2+\tau)>2$ , such that $\sup_{n}\mathbb{E}[\|M_{n+1}\|^{2+\tau}|{\mathcal{F}}_{n}]<\infty$ . Then, almost surely,

\limsup_{n}\frac{\|L_{n}\|}{\sqrt{\beta_{n}\log(u_{n})}}\leq C_{M},

(67)

where $C_{M}$ is a constant dependent on $M$ .

By assumption A4, the iterates $({\bm{\theta}}_{n},{\mathbf{x}}_{n})$ are bounded within a compact subset $\Omega$ . Recall the form of $M^{({\bm{\theta}})}_{n+1},M^{({\mathbf{x}})}_{n+1}$ defined in (35), it comprises the functions $\tilde{H}_{{\bm{\theta}}_{n},{\mathbf{x}}_{n}}(i)$ and $({\mathbf{K}}_{{\mathbf{x}}_{n}}\tilde{H}_{{\bm{\theta}}_{n},{\mathbf{x}}_{n}})(i)$ , which in turn include the function $H({\bm{\theta}},i)$ . We know that $H({\bm{\theta}},i)$ is bounded for ${\bm{\theta}}$ in some compact set ${\mathcal{C}}$ . Thus, for any $({\bm{\theta}}_{n},{\mathbf{x}}_{n})\in\Omega$ for some compact set $\Omega$ , $M^{({\bm{\theta}})}_{n+1},M^{({\mathbf{x}})}_{n+1}$ are bounded and we denote by $c_{{\bm{\theta}}}$ and $c_{{\mathbf{x}}}$ as their upper bounds, i.e., $\mathbb{E}[\|M^{({\bm{\theta}})}_{n+1}\|^{2}|{\mathcal{F}}_{n}]\leq c^{({\bm{\theta}})}_{\Omega}$ and $\mathbb{E}[\|M^{({\mathbf{x}})}_{n+1}\|^{2}|{\mathcal{F}}_{n}]\leq c^{({\mathbf{x}})}_{\Omega}$ . We only need to replace the upper bound $c$ in Lemma E.6 by $c^{({\bm{\theta}})}_{\Omega}$ for the sequence $\{L_{n}^{({\bm{\theta}})}\}$ (resp. $c^{({\mathbf{x}})}_{\Omega}$ for the sequence $\{L_{n}^{({\mathbf{x}})}\}$ ), i.e.,

	$\limsup_{n}\frac{\\|L_{n}^{({\bm{\theta}})}\\|}{\sqrt{\beta_{n}\log(u_{n})}}\leq C^{({\bm{\theta}})}_{\Omega},$		(68a)
	$\limsup_{n}\frac{\\|L_{n}^{({\mathbf{x}})}\\|}{\sqrt{\gamma_{n}\log(s_{n})}}\leq C^{({\mathbf{x}})}_{\Omega},$		(68b)

such that $\|L_{n}^{({\bm{\theta}})}\|=O(\sqrt{\beta_{n}\log(u_{n})})$ a.s. and $\|L_{n}^{({\mathbf{x}})}\|=O(\sqrt{\gamma_{n}\log(s_{n})})$ a.s. which completes the proof. ∎

Note that we have ${\mathbf{x}}_{n}-{\bm{\mu}}$ and $L_{n}^{({\mathbf{x}})}$ weakly converge to the same Gaussian distribution from Remark E.1 and (65). Then, $\gamma_{n}^{-1/2}\Delta_{n}^{({\mathbf{x}})}$ weakly converges to zero, implying that $\gamma_{n}^{-1/2}\Delta_{n}^{({\mathbf{x}})}$ converges to zero with probability $1$ . Therefore, together with $\{\gamma_{n}\}$ being strictly positive, we have

\Delta_{n}^{({\mathbf{x}})}=o(\sqrt{\gamma_{n}})\quad a.s.

(69)

Characterization of Sequences $\{R_{n}^{({\bm{\theta}})}\}$ and $\{\Delta_{n}^{({\bm{\theta}})}\}$

We first consider the sequence $\{R_{n}^{({\bm{\theta}})}\}$ . We assume a positive real-valued bounded sequence $\{w_{n}\}$ under the same conditions as in Mokkadem & Pelletier (2006, Definition 1), i.e.,

Definition E.1.

In the case $b<1$ , $\frac{w_{n}}{w_{n+1}}=1+o(\beta_{n})$ , which also implies $\frac{w_{n}}{w_{n+1}}=1+o(\gamma_{n})$ .

In the case $b=1$ , there exist $\epsilon\geq 0$ and a nondecreasing slowly varying function $l(n)$ such that $w_{n}=n^{-\epsilon}l(n)$ . When $\epsilon=0$ , we require function $l(n)$ to be bounded. ∎

Since $\|{\mathbf{x}}_{n}-{\bm{\mu}}\|=o(1)$ by a.s. convergence result, we can assume that there exists $\{w_{n}\}$ such that $\|{\mathbf{x}}_{n}-{\bm{\mu}}\|=O(w_{n})$ . Then, from (55b), we can use the Abel transformation and obtain

\begin{split}R_{n}^{({\bm{\theta}})}=&~{}\beta_{n}\gamma_{n}^{-1}{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}({\mathbf{x}}_{n}-{\bm{\mu}})-e^{(u_{n}-u_{1}){\mathbf{J}}_{11}}\beta_{1}\gamma_{1}^{-1}{\mathbf{U}}_{11}{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}({\mathbf{x}}_{1}-{\bm{\mu}})\\ &~{}+e^{u_{n}{\mathbf{J}}_{11}}\sum_{k=1}^{n-1}\left(e^{-u_{k}{\mathbf{J}}_{11}}\beta_{k}\gamma_{k}^{-1}-e^{-u_{k+1}{\mathbf{J}}_{11}}\beta_{k+1}\gamma_{k+1}^{-1}\right){\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}({\mathbf{x}}_{k+1}-{\bm{\mu}}),\end{split}

where the last term on the RHS can be rewritten as

W_{n}=\sum_{k=1}^{n-1}e^{(u_{n}-u_{k+1}){\mathbf{J}}_{11}}\beta_{k+1}\gamma_{k+1}^{-1}\left(e^{\beta_{k+1}{\mathbf{J}}_{11}}\beta_{k}\beta_{k+1}^{-1}\gamma_{k}^{-1}\gamma_{k+1}-{\mathbf{I}}\right){\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}({\mathbf{x}}_{k+1}-{\bm{\mu}}).

Using Lemma G.6 on $W_{n}$ gives $\|W_{n}\|=O(\gamma_{n}^{-1}\|e^{\beta_{n}{\mathbf{J}}_{11}}-{\bm{I}}\|\|{\mathbf{x}}_{n}-{\bm{\mu}}\|)=O(\gamma^{-1}_{n}\beta_{n}\omega_{n})$ . Then, it follows that for some $T>0$ ,

\|R_{n}^{({\bm{\theta}})}\|=O\left(\beta_{n}\gamma_{n}^{-1}\omega_{n}+\|e^{u_{n}{\mathbf{J}}_{11}}\|\right)=O(\beta_{n}\gamma_{n}^{-1}\omega_{n}+e^{-u_{n}T})

(70)

with the application of Lemma G.4 to the second equality.

Then, we shift our focus on $\{\Delta_{n}^{({\bm{\theta}})}\}$ . Specifically, we take (54), (55a), and (56) back to $\Delta_{n}^{({\bm{\theta}})}={\bm{\theta}}_{n}-{\bm{\theta}}^{*}-L_{n}^{({\bm{\theta}})}-R_{n}^{({\bm{\theta}})}$ , and obtain

\begin{split}\Delta_{n+1}^{({\bm{\theta}})}=&({\mathbf{I}}+\beta_{n+1}{\mathbf{J}}_{11})({\bm{\theta}}_{n}-{\bm{\theta}}^{*})\\ &+\beta_{n+1}(r^{({\bm{\theta}},1)}_{n}+r^{({\bm{\theta}},2)}_{n}+\eta_{n}^{({\bm{\theta}})}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}(r^{({\mathbf{x}},1)}_{n}+r^{({\mathbf{x}},2)}_{n}+\eta_{n}^{({\mathbf{x}})}))\\ &-e^{\beta_{n+1}{\mathbf{J}}_{11}}L_{n}^{({\bm{\theta}})}-e^{\beta_{n+1}{\mathbf{J}}_{11}}R_{n}^{({\bm{\theta}})}\\ =&({\mathbf{I}}+\beta_{n+1}{\mathbf{J}}_{11})({\bm{\theta}}_{n}-{\bm{\theta}}^{*})\\ &+\beta_{n+1}(r^{({\bm{\theta}},1)}_{n}+r^{({\bm{\theta}},2)}_{n}+\eta_{n}^{({\bm{\theta}})}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}(r^{({\mathbf{x}},1)}_{n}+r^{({\mathbf{x}},2)}_{n}+\eta_{n}^{({\mathbf{x}})}))\\ &-({\mathbf{I}}+\beta_{n+1}{\mathbf{J}}_{11}+O(\beta_{n+1}^{2}))L_{n}^{({\bm{\theta}})}-({\mathbf{I}}+\beta_{n+1}{\mathbf{J}}_{11}+O(\beta_{n+1}^{2}))R_{n}^{({\bm{\theta}})}\\ =&({\mathbf{I}}+\beta_{n+1}{\mathbf{J}}_{11})\Delta_{n}^{({\bm{\theta}})}+O(\beta_{n+1}^{2})(L_{n}^{({\bm{\theta}})}+R_{n}^{({\bm{\theta}}))})\\ &+\beta_{n+1}(r^{({\bm{\theta}},1)}_{n}+r^{({\bm{\theta}},2)}_{n}+\eta_{n}^{({\bm{\theta}})}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}(r^{({\mathbf{x}},1)}_{n}+r^{({\mathbf{x}},2)}_{n}+\eta_{n}^{({\mathbf{x}})})),\end{split}

(71)

where the second equality is by taking the Taylor expansion $e^{\beta_{n+1}{\mathbf{J}}_{11}}={\mathbf{I}}+\beta_{n+1}{\mathbf{J}}_{11}+O(\beta_{n+1}^{2})$ .

Define $\Phi_{k,n}\triangleq\prod_{j={k+1}}^{n}({\mathbf{I}}+\beta_{j}{\mathbf{J}}_{11})$ and by convention $\Phi_{n,n}={\mathbf{I}}$ . Then, we rewrite (71) as

\begin{split}\Delta_{n+1}^{({\bm{\theta}})}=&\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}\left(O(\beta_{k+1})L_{k}^{({\bm{\theta}})}+O(\beta_{k+1})R_{k}^{({\bm{\theta}})}\right)\\ &+\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}(r^{({\bm{\theta}},1)}_{k}+r^{({\bm{\theta}},2)}_{k}+\eta_{k}^{({\bm{\theta}})}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}(r^{({\mathbf{x}},1)}_{k}+r^{({\mathbf{x}},2)}_{k}+\eta_{k}^{({\mathbf{x}})}))\\ =&\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}\left(O(\beta_{k+1})L_{k}^{({\bm{\theta}})}+O(\beta_{k+1})R_{k}^{({\bm{\theta}})}\right)\\ &+\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}(r^{({\bm{\theta}},1)}_{k}+\eta_{k}^{({\bm{\theta}})}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}(r^{({\mathbf{x}},1)}_{k}+\eta_{k}^{({\mathbf{x}})}))\\ &+\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}(r^{({\bm{\theta}},2)}_{k}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}r^{({\mathbf{x}},2)}_{k}).\end{split}

(72)

From (72), we can indeed decompose $\Delta_{n+1}^{({\bm{x}})}$ into two parts $\Delta_{n+1}^{({\bm{\theta}})}=\Delta_{n+1}^{({\bm{\theta}},1)}+\Delta_{n+1}^{({\bm{\theta}},2)}$ , where

	$\begin{split}\Delta_{n+1}^{({\bm{\theta}},1)}\triangleq&\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}\left(O(\beta_{k+1})L_{k}^{({\bm{\theta}})}+O(\beta_{k+1})R_{k}^{({\bm{\theta}})}\right)\\ &+\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}(r^{({\bm{\theta}},1)}_{k}+\eta_{k}^{({\bm{\theta}})}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}(r^{({\mathbf{x}},1)}_{k}+\eta_{k}^{({\mathbf{x}})})),\end{split}$		(73a)
	$\Delta_{n+1}^{({\bm{\theta}},2)}\triangleq\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}(r^{({\bm{\theta}},2)}_{k}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}r^{({\mathbf{x}},2)}_{k}).$		(73b)

This term $\Delta_{n+1}^{({\bm{\theta}},1)}$ shares the same recursive form as in the sequence defined in Mokkadem & Pelletier (2006, Lemma 6), which is given below.

Lemma E.7 (Mokkadem & Pelletier (2006) Lemma 6).

For $\Delta_{n+1}^{({\bm{\theta}},1)}$ in the form of (73a), assume $\|{\mathbf{x}}_{n}-{\bm{\mu}}\|=O(\omega_{n})$ and $\|\Delta_{n}^{({\mathbf{x}})}\|=O(\delta_{n})$ for the sequences $\omega_{n},\delta_{n}$ defined in (E.1). Then, we have

\|\Delta_{n+1}^{({\bm{\theta}},1)}\|=O(\beta_{n}^{2}\gamma_{n}^{-2}\omega_{n}^{2}+\beta_{n}\gamma_{n}^{-1}\delta_{n})+o(\sqrt{\beta_{n}})\quad\text{a.s.}

Since we already have $\Delta_{n}^{({\mathbf{x}})}=o(\sqrt{\gamma_{n}})$ in (69), together with Lemma E.7, we have

\|\Delta_{n+1}^{({\bm{\theta}},1)}\|=O(\beta_{n}^{2}\gamma_{n}^{-2}\omega_{n}^{2})+o(\beta_{n}\gamma_{n}^{-1/2})+o(\sqrt{\beta_{n}})=O(\beta_{n}^{2}\gamma_{n}^{-2}\omega_{n}^{2})+o(\sqrt{\beta_{n}})

where the second equality comes from $o(\beta_{n}\gamma_{n}^{-1/2})=o(\beta_{n}^{1/2}(\beta_{n}\gamma_{n}^{-1})^{1/2})=o(\beta_{n}^{1/2})$ .

We now focus on $\Delta_{n+1}^{({\bm{\theta}},2)}$ . Define a sequence

\Psi_{n}\triangleq\sum_{k=1}^{n}r^{({\bm{\theta}},2)}_{k}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}r^{({\mathbf{x}},2)}_{k},

(74)

and we have

\begin{split}&\beta_{n+1}^{-1/2}\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}(r^{({\bm{\theta}},2)}_{k}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}r^{({\mathbf{x}},2)}_{k})\\ =&\beta_{n+1}^{-1/2}\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}(\Psi_{k}-\Psi_{k-1})\\ =&\beta_{n+1}^{1/2}\Psi_{n}+\beta_{n+1}^{-1/2}\sum_{k=1}^{n-1}(\beta_{k}\Phi_{k,n}-\beta_{k+1}\Phi_{k+1,n})\Psi_{k}\end{split}

where the last equality comes from the Abel transformation. Note that

\begin{split}\|\beta_{k}\Phi_{k,n}-\beta_{k+1}\Phi_{k+1,n}\|&\leq\beta_{k+1}\|\Phi_{k,n}-\Phi_{k+1,n}\|+(\beta_{k}-\beta_{k+1})\|\Phi_{k,n}\|\\ &\leq\beta_{k+1}\|\Phi_{k+1,n}\|\beta_{k}\|{\mathbf{J}}_{11}\|+C_{7}\beta_{k}^{2}\|\Phi_{k,n}\|\\ &\leq C_{8}\beta_{k}^{2}e^{-(u_{n}-u_{k})T}\end{split}

for some constant $C_{7},C_{8}>0$ , where the last inequality is from Lemma G.4 and $\|\Phi_{k+1,n}\|\leq C_{9}\|\Phi_{k,n}\|$ for some constant $C_{9}>0$ that depends on $e^{\beta_{0}T}$ . Then,

\begin{split}&\beta_{n+1}^{-1/2}\left\|\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}(r^{({\bm{\theta}},2)}_{k}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}r^{({\mathbf{x}},2)}_{k})\right\|\\ &\leq\|\beta_{n+1}^{1/2}\Psi_{n}\|+\left(\frac{\beta_{n+1}}{\beta_{n}}\right)^{1/2}\beta_{n}^{-1/2}\sum_{k=1}^{n}\|\beta_{k}\Phi_{k,n}-\beta_{k+1}\Phi_{k+1,n}\|\|\Psi_{k}\|\\ &\leq\|\beta_{n+1}^{1/2}\Psi_{n}\|+C_{8}\left(\frac{\beta_{n+1}}{\beta_{n}}\right)^{1/2}\beta_{n}^{-1/2}\sum_{k=1}^{n}\beta_{k}^{3/2}e^{-(u_{n}-u_{k})T}\|\beta_{k}^{1/2}\Psi_{k}\|.\end{split}

By Lemma E.1, we have $\beta_{n}^{1/2}\Psi_{n}\to 0$ a.s. such that by Lemma G.6, it follows that

\limsup_{n}\beta_{n+1}^{-1/2}\left\|\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}(r^{({\bm{\theta}},2)}_{k}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}r^{({\mathbf{x}},2)}_{k})\right\|\leq\frac{\limsup_{n}\|\beta_{n}^{1/2}\Psi_{n}\|}{C(T,1/2)}=0.

Therefore, we have

\Delta_{n+1}^{({\bm{\theta}},2)}=\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}(r^{({\bm{\theta}},2)}_{k}-{\mathbf{J}}_{12}(\alpha){\mathbf{J}}_{22}(\alpha)^{-1}r^{({\mathbf{x}},2)}_{k})=o(\sqrt{\beta_{n}}).

(75)

Consequently, $\Delta_{n+1}^{({\bm{\theta}})}=O(\beta_{n}^{2}\gamma_{n}^{-2}\omega_{n}^{2})+o(\sqrt{\beta_{n}})$ almost surely.

Now we are dealing with ${\mathbf{x}}_{n}-{\bm{\mu}}$ and its related sequence $\omega_{n}$ . Note that by Lemma E.5 and (69), we have almost surely,

\begin{split}\|{\mathbf{x}}_{n}-{\bm{\mu}}\|&=O(\|L_{n}^{({\mathbf{x}})}\|+\|\Delta_{n}^{{\mathbf{x}}}\|)\\ &=O(\sqrt{\gamma_{n}\log(s_{n})}+o(\sqrt{\gamma_{n}}))\\ &=O(\sqrt{\gamma_{n}\log(s_{n})}).\end{split}

(76)

Thus, we can set $\omega_{n}\equiv O(\sqrt{\gamma_{n}\log(s_{n})})$ such that $\|R_{n}^{({\bm{\theta}})}\|$ in (70) can be written as

\|R_{n}^{({\bm{\theta}})}\|=O(n^{a/2-b}\sqrt{log(s_{n})}+e^{-u_{n}T}),

and

\|\Delta_{n+1}^{({\bm{\theta}})}\|=O(n^{a-2b}log(s_{n}))+o(\sqrt{\beta_{n}}).

In view of assumption A2 and $\beta_{n}=o(\gamma_{n})$ , $a/2-b<-b/2$ and $a-2b<-b$ , there exists a $c>b/2$ such that almost surely,

\|R_{n}^{({\bm{\theta}})}\|=O(n^{-s}),\quad\|\Delta_{n+1}^{({\bm{\theta}})}\|=o(\sqrt{\beta_{n}}).

Therefore, $\beta_{n}^{-1/2}(R_{n}^{({\bm{\theta}})}+\Delta_{n+1}^{({\bm{\theta}})})\to 0$ almost surely. This completes the proof of Scenario 2.

E.3 Case (iii): $\gamma_{n}=o(\beta_{n})$

For $\gamma_{n}=o(\beta_{n})$ , we can see that the roles of ${\bm{\theta}}_{n}$ and ${\mathbf{x}}_{n}$ are flipped, i.e., ${\bm{\theta}}_{n}$ is now on fast timescale while ${\mathbf{x}}_{n}$ is on slow timescale.

We still decompose ${\mathbf{x}}_{n}$ as ${\mathbf{x}}_{n}-{\bm{\mu}}=L_{n}^{({\mathbf{x}})}+\Delta_{n}^{({\mathbf{x}})}$ , where $L_{n}^{({\mathbf{x}})},\Delta_{n}^{({\mathbf{x}})}$ are defined in (56) and (57), respectively. Since ${\mathbf{x}}_{n}$ is independent of ${\bm{\theta}}_{n}$ , the results of $L_{n}^{({\mathbf{x}})}$ and $\Delta_{n}^{({\mathbf{x}})}$ remain the same, i.e., almost surely, $L_{n}^{({\mathbf{x}})}=O(\sqrt{\gamma_{n}\log(s_{n})})$ from Lemma E.5 and $\Delta_{n}^{({\mathbf{x}})}=o(\sqrt{\gamma_{n}})$ from (69). Then, we define sequences $\hat{L}_{n}^{({\bm{\theta}})}$ and $\hat{R}_{n}^{({\bm{\theta}})}$ as follows.

	$\hat{L}_{n}^{({\bm{\theta}})}\triangleq e^{\beta_{n}{\mathbf{J}}_{11}}\hat{L}_{n-1}^{({\bm{\theta}})}+\beta_{n}M_{n}^{({\bm{\theta}})}=\sum_{k=1}^{n}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}\beta_{k}M_{k}^{({\bm{\theta}})},$		(77a)
	$\hat{R}_{n}^{({\bm{\theta}})}\triangleq e^{\beta_{n}{\mathbf{J}}_{11}}\hat{R}_{n-1}^{({\bm{\theta}})}+\beta_{n}{\mathbf{J}}_{12}(\alpha)(L_{n-1}^{({\mathbf{x}})}+R_{n-1}^{({\mathbf{x}})})=\sum_{k=1}^{n}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}\beta_{k}{\mathbf{J}}_{12}(\alpha)(L_{k-1}^{({\mathbf{x}})}+R_{k-1}^{({\mathbf{x}})}).$		(77b)

Moreover, the remaining term $\hat{\Delta}_{n}^{({\bm{\theta}})}\triangleq{\bm{\theta}}_{n}-{\bm{\theta}}^{*}-\hat{L}_{n}^{({\bm{\theta}})}-\hat{R}_{n}^{({\bm{\theta}})}$ .

The proof outline is the same as in the previous scenario:

•

We first show $\beta_{n}^{-1/2}\hat{\Delta}_{n}^{({\bm{\theta}})}$ weakly converges to the distribution $N(0,{\mathbf{V}}_{{\bm{\theta}}}^{(3)})(\alpha)$ ;
•

We analyse $\hat{L}_{n}^{({\bm{\theta}})}$ and $\hat{R}_{n}^{({\bm{\theta}})}$ to ensure that these two terms decrease faster than the CLT scale $\beta_{n}^{-1/2}$ , i.e., $\lim_{n\to\infty}\beta_{n}^{-1/2}(\hat{L}_{n}^{({\bm{\theta}})}-\hat{R}_{n}^{({\bm{\theta}})})=0$ ;
•

With above two steps, we can show that $\beta_{n}^{-1/2}({\bm{\theta}}_{n}-{\bm{\theta}}^{*})$ weakly converges to the distribution $N(0,{\mathbf{V}}_{{\bm{\theta}}}^{(3)})(\alpha)$ .

Analysis of $\hat{L}_{n}^{({\bm{\theta}})}$

We first focus on $\hat{L}_{n}^{({\bm{\theta}})}$ and follow similar steps as we did when we analysed $L_{n}^{({\bm{\theta}})}$ in the previous scenario. We set a Martingale $Z^{(n)}=\{Z_{k}^{(n)}\}_{k\geq 1}$ such that

Z_{k}^{(n)}=\beta_{n}^{-1/2}\sum_{k=1}^{n}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}\beta_{k}M_{k}^{({\bm{\theta}})}}.

Then,

A_{n}\triangleq\sum_{k=1}^{n}\mathbb{E}\left[\left.(Z_{k}^{(n)}-Z_{k-1}^{(n)})(Z_{k}^{(n)}-Z_{k-1}^{(n)})^{T}\right|{\mathcal{F}}_{k-1}\right].

Following the similar steps in (59) to decompose $M_{k}^{({\bm{\theta}})}$ with (42b), we have

\begin{split}A_{n}=&~{}\beta_{n}^{-1}\sum_{k=1}^{n}\beta_{k}^{2}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}\left({\mathbf{U}}_{11}+{\mathbf{D}}_{k}^{(3)}+{\mathbf{J}}_{k}^{(3)}\right)e^{(u_{n}-u_{k}){\mathbf{J}}_{11}^{T}}\\ =&~{}\underbrace{\beta_{n}^{-1}\sum_{k=1}^{n}\beta_{k}^{2}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}{\mathbf{U}}_{11}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}^{T}}}_{A_{n}^{(a)}}+\underbrace{\beta_{n}^{-1}\sum_{k=1}^{n}\beta_{k}^{2}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}{\mathbf{D}}_{k}^{(3)}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}^{T}}}_{A_{n}^{(b)}}\\ &+\underbrace{\beta_{n}^{-1}\sum_{k=1}^{n}\beta_{k}^{2}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}}{\mathbf{J}}_{k}^{(3)}e^{(u_{n}-u_{k}){\mathbf{J}}_{11}^{T}}}_{A_{n}^{(c)}}\\ \end{split}

(78)

Since $A_{n}^{(a)},A_{n}^{(b)},A_{n}^{(c)}$ share similar forms as in Lemma E.3, we follow the same steps as the proof therein, with the application of Lemma E.1. To avoid repetition, we omit the proof and directly give the following lemma.

Lemma E.8.

For $A^{(a)}_{n},A^{(b)}_{n},A^{(c)}_{n}$ defined in (78), we have

\lim_{n\to\infty}A^{(a)}_{n}={\mathbf{V}}^{(3)}_{{\bm{\theta}}}(\alpha),\quad\lim_{n\to\infty}\|A^{(b)}_{n}\|=0,\quad\lim_{n\to\infty}\|A^{(c)}_{n}\|=0,

(79)

where ${\mathbf{V}}^{(3)}_{{\bm{\theta}}}(\alpha)$ is the solution to the Lyapunov equation

{\mathbf{J}}_{11}{\mathbf{V}}+{\mathbf{V}}{\mathbf{J}}_{11}^{T}+{\mathbf{U}}_{11}=0.

Note that here we don’t have the term $\frac{\mathds{1}_{\{b=1\}}}{2}{\bm{I}}$ in above lemma, compared to Lemma E.3, because in the case of $\gamma_{n}=o(\beta_{n})$ , $b<1$ such that $\mathds{1}_{\{b=1\}}=0$ . Then, applying Lemma G.1 to derive the closed form of ${\mathbf{V}}^{(3)}_{{\bm{\theta}}}(\alpha)$ gives

{\mathbf{V}}^{(3)}_{{\bm{\theta}}}(\alpha)=\textstyle\int_{0}^{\infty}e^{t\nabla_{{\bm{\theta}}}{\mathbf{h}}({\bm{\theta}}^{*})}{\mathbf{U}}_{11}e^{t\nabla_{{\bm{\theta}}}{\mathbf{h}}({\bm{\theta}}^{*})}dt.

Thus, it follows that

\lim_{n\to\infty}\sum_{k=1}^{n}\mathbb{E}\left[(Z_{k}^{(n)}-Z_{k-1}^{(n)})(Z_{k}^{(n)}-Z_{k-1}^{(n)})^{T}|{\mathcal{F}}_{k-1}\right]={\mathbf{V}}^{(3)}_{{\bm{\theta}}}(\alpha).

Again, we use the Martingale CLT result in Theorem G.3 and have the following result.

Z^{n}=\beta_{n}^{-1/2}\hat{L}_{n}^{({\bm{\theta}})}\xrightarrow[dist.]{n\to\infty}N\left(0,{\mathbf{V}}^{(3)}_{{\bm{\theta}}}(\alpha)\right).

Moreover, similar to the tighter upper bound of $L_{n}^{({\mathbf{x}})}$ proved in Lemma E.5, we utilize the tighter upper bound Lemma E.6 in the proof thereof, and obtain $\hat{L}_{n}^{({\bm{\theta}})}=O(\sqrt{\beta_{n}\log(u_{n})})$ .

Analysis of $\hat{R}_{n}^{({\bm{\theta}})}$

Next, we turn to the term $\hat{R}_{n}^{({\bm{\theta}})}$ in (77b). Taking the norm gives the following inequality for some constant $C,T>0$ by applying Lemma G.4,

\|\hat{R}_{n}^{({\bm{\theta}})}\|\leq C\sum_{k=1}^{n}e^{-(u_{n}-u_{k})T}\beta_{k}(\|L_{k-1}^{({\mathbf{x}})}\|+\|R_{k-1}^{({\mathbf{x}})}\|).

Using Lemma G.6 gives

\sum_{k=1}^{n}e^{-(u_{n}-u_{k})T}\beta_{k}(\|L_{k-1}^{({\mathbf{x}})}\|+\|R_{k-1}^{({\mathbf{x}})}\|)=O(\|L_{k-1}^{({\mathbf{x}})}\|+\|R_{n-1}^{({\mathbf{x}})}\|).

Thus, $\beta_{n}^{-1/2}\|\hat{R}_{n}^{({\bm{\theta}})}\|=o(\sqrt{\gamma_{n}\beta_{n}^{-1}})+O\left(\sqrt{\gamma_{n}\beta_{n}^{-1}\log(s_{n})}\right)$ . Since $\gamma_{n}=o(\beta_{n})$ , $\gamma_{n}\beta_{n}^{-1}=(n+1)^{b-a}$ , where $b-a<0$ . Then, there exists some $s>0$ such that $b-a<-s<0$ . Together with $\log(s_{n})=O(\log(n))$ , we have $O\left(\sqrt{\gamma_{n}\beta_{n}^{-1}\log(s_{n})}\right)=O(\sqrt{n^{-s}\log(n)})=o(1)$ . Therefore, we have

\lim_{n\to\infty}\beta_{n}^{-1/2}\hat{R}_{n}^{({\bm{\theta}})}=0.

Analysis of $\hat{\Delta}_{n}^{({\bm{\theta}})}$

Lastly, let’s focus on the term $\hat{\Delta}_{n}^{({\bm{\theta}})}$ . We have

\begin{split}\hat{\Delta}_{n+1}^{({\bm{\theta}})}=&~{}{\bm{\theta}}_{n+1}-{\bm{\theta}}^{*}-\hat{L}_{n+1}^{({\bm{\theta}})}-\hat{R}_{n+1}^{({\bm{\theta}})}\\ =&~{}{\bm{\theta}}_{n}-{\bm{\theta}}^{*}+\beta_{n+1}\left({\mathbf{J}}_{11}({\bm{\theta}}_{n}-{\bm{\theta}}^{*})+{\mathbf{J}}_{12}(\alpha)({\mathbf{x}}_{n}-{\bm{\mu}})+M_{n+1}^{({\bm{\theta}})}+r_{n}^{({\bm{\theta}},1)}+r_{n}^{({\bm{\theta}},2)}+\eta_{n}^{({\bm{\theta}})}\right)\\ &-e^{\beta_{n+1}{\mathbf{J}}_{11}}\hat{L}_{n}^{({\bm{\theta}})}-\beta_{n+1}M_{n+1}^{({\bm{\theta}})}-e^{\beta_{n+1}{\mathbf{J}}_{11}}\hat{R}_{n}^{({\bm{\theta}})}-\beta_{n+1}{\mathbf{J}}_{12}(\alpha)(L_{n}^{({\mathbf{x}})}+R_{n}^{({\mathbf{x}})})\\ =&~{}({\mathbf{I}}+\beta_{n+1}{\mathbf{J}}_{11})({\bm{\theta}}_{n}-{\bm{\theta}}^{*})+\beta_{n+1}{\mathbf{J}}_{12}(\alpha)\Delta_{n}^{({\mathbf{x}})}+\beta_{n+1}(r_{n}^{({\bm{\theta}},1)}+r_{n}^{({\bm{\theta}},2)}+\eta_{n}^{({\bm{\theta}})})\\ &-({\mathbf{I}}+\beta_{n+1}{\mathbf{J}}_{11}+O(\beta_{n+1}^{2}))(\hat{L}_{n}^{({\bm{\theta}})}+\hat{R}_{n}^{({\bm{\theta}})})\\ =&~{}({\mathbf{I}}+\beta_{n+1}{\mathbf{J}}_{11})\hat{\Delta}_{n}^{({\bm{\theta}})}+\beta_{n+1}{\mathbf{J}}_{12}(\alpha)\Delta_{n}^{({\mathbf{x}})}+\beta_{n+1}(r_{n}^{({\bm{\theta}},1)}+r_{n}^{({\bm{\theta}},2)}+\eta_{n}^{({\bm{\theta}})})\\ &+O(\beta_{n+1}^{2})(\hat{L}_{n}^{({\bm{\theta}})}+\hat{R}_{n}^{({\bm{\theta}})}).\end{split}

where the second equality is from (53a), the third equality stems from the approximation of $e^{\beta_{n+1}{\mathbf{J}}_{11}}$ . Then, we again use the definition $\Phi_{k,n}\triangleq\prod_{j={k+1}}^{n}({\mathbf{I}}+\beta_{j}{\mathbf{J}}_{11})$ and reiterate the above equation as

\begin{split}\hat{\Delta}_{n+1}^{({\bm{\theta}})}=&\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}\left(O(\beta_{k+1})L_{k}^{({\bm{\theta}})}+O(\beta_{k+1})R_{k}^{({\bm{\theta}})}\right)\\ &+\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}{\mathbf{J}}_{12}(\alpha)\Delta_{n}^{({\mathbf{x}})}+\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}(r^{({\bm{\theta}},1)}_{k}+\eta_{k}^{({\bm{\theta}})})\\ &+\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}r^{({\bm{\theta}},2)}_{k}\\ \triangleq&~{}\hat{\Delta}_{n+1}^{({\bm{\theta}},1)}+\hat{\Delta}_{n+1}^{({\bm{\theta}},2)},\end{split}

where $\hat{\Delta}_{n+1}^{({\bm{\theta}},2)}=\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}r^{({\bm{\theta}},2)}_{k}$ and

\begin{split}\hat{\Delta}_{n+1}^{({\bm{\theta}},1)}=&\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}\left(O(\beta_{k+1})L_{k}^{({\bm{\theta}})}+O(\beta_{k+1})R_{k}^{({\bm{\theta}})}\right)\\ &+\sum_{k=1}^{n}\Phi_{k,n}\beta_{k+1}(r^{({\bm{\theta}},1)}_{k}+\eta_{k}^{({\bm{\theta}})}+{\mathbf{J}}_{12}(\alpha)\Delta_{n}^{({\mathbf{x}})}).\end{split}

(80)

For $\hat{\Delta}_{n+1}^{({\bm{\theta}},2)}$ , we follow the same steps from (74) to (75), and obtain $\hat{\Delta}_{n+1}^{({\bm{\theta}},2)}=o(\sqrt{\beta_{n}})$ .

Next, we consider $\hat{\Delta}_{n+1}^{({\bm{\theta}},1)}$ and want to show that $\hat{\Delta}_{n+1}^{({\bm{\theta}},1)}=o(\sqrt{\beta_{n}})$ . Again, we utilize Mokkadem & Pelletier (2006, Lemma 6) for $\hat{\Delta}_{n+1}^{({\bm{\theta}},1)}$ and adapt the notation here for the case $\gamma_{n}=o(\beta_{n})$ .

Lemma E.9.

For $\hat{\Delta}_{n+1}^{({\bm{\theta}},1)}$ in the form of (80), assume $\|{\bm{\theta}}_{n}-{\bm{\theta}}^{*}\|=O(\omega_{n})$ and $\|\hat{\Delta}_{n}^{({\bm{\theta}},1)}\|=O(\delta_{n})$ for the sequences $\omega_{n},\delta_{n}$ defined in (E.1). Then, we have

\|\hat{\Delta}_{n+1}^{({\bm{\theta}},1)}\|=O(\gamma_{n}^{2}\beta_{n}^{-2}\omega_{n}^{2}+\gamma_{n}\beta_{n}^{-1}\delta_{n})+o(\sqrt{\gamma_{n}})\quad\text{a.s.}

(81)

Now we need to further analyse $\delta_{n}$ and tighten its big O form, starting from $\delta_{n}\equiv 1$ , so that we can finally obtain the big O form of $\|\hat{\Delta}_{n+1}^{({\bm{\theta}},1)}\|$ . The following steps are borrowed from the ideas in Mokkadem & Pelletier (2006, Section 2.3.2).

By almost sure convergence result $\lim_{n\to\infty}{\bm{\theta}}_{n}={\bm{\theta}}^{*}$ , we have $\lim_{n\to\infty}\Delta_{n}^{({\bm{\theta}})}=0$ a.s. such that we can first set $\delta_{n}\equiv 1$ , and $\|\hat{\Delta}_{n+1}^{({\bm{\theta}},1)}\|=O(\gamma_{n}^{2}\beta_{n}^{-2}\omega_{n}^{2}+\gamma_{n}\beta_{n}^{-1})+o(\sqrt{\gamma_{n}})$ . Then, we redefine

\delta_{n}\equiv O(\gamma_{n}^{2}\beta_{n}^{-2}\omega_{n}^{2}+\gamma_{n}\beta_{n}^{-1})+o(\sqrt{\gamma_{n}}),

and notice that it still satisfies definition E.1. Then, reapplying this $\delta_{n}$ form to (81) gives

\|\hat{\Delta}_{n+1}^{({\bm{\theta}},1)}\|=O(\gamma_{n}^{2}\beta_{n}^{-2}\omega_{n}^{2}+[\gamma_{n}\beta_{n}^{-1}]^{2})+o(\sqrt{\gamma_{n}})

and by induction we have for all integers $k\geq 1$ ,

\|\hat{\Delta}_{n+1}^{({\bm{\theta}},1)}\|=O(\gamma_{n}^{2}\beta_{n}^{-2}\omega_{n}^{2}+[\gamma_{n}\beta_{n}^{-1}]^{k})+o(\sqrt{\gamma_{n}}).

Since $[\gamma_{n}\beta_{n}^{-1}]^{k}=n^{(b-a)k}$ , there exists $k_{0}>a/2(a-b)$ such that $[\gamma_{n}\beta_{n}^{-1}]^{k_{0}}=o(\sqrt{\gamma_{n}})$ , and

\|\hat{\Delta}_{n+1}^{({\bm{\theta}},1)}\|=O(\gamma_{n}^{2}\beta_{n}^{-2}\omega_{n}^{2})+o(\sqrt{\gamma_{n}}).

(82)

Then, as suggested in Mokkadem & Pelletier (2006, Section 2.3.2), we can choose $\omega_{n}=O(\sqrt{\beta_{n}\log(u_{n})}+[\gamma_{n}\beta_{n}^{-1}]^{k})$ , which also satisfies definition E.1. Then,

\begin{split}&\|{\bm{\theta}}_{n}-{\bm{\theta}}^{*}\|=\|\hat{L}_{n}^{({\bm{\theta}})}+\hat{R}_{n}^{({\bm{\theta}})}+\hat{\Delta}_{n}^{({\bm{\theta}})}\|\\ =&O\left(\sqrt{\beta_{n}\log(u_{n})}\!+\!\sqrt{\gamma_{n}\beta_{n}^{-1}\log(s_{n})}\!+\!\left([\gamma_{n}\beta_{n}^{-1}]^{k+1}\!+\!\gamma_{n}\beta_{n}^{-1}\sqrt{\beta_{n}\log(u_{n})}\right)^{2}\right)\\ &+o(\sqrt{\beta_{n}}+\sqrt{\gamma_{n}})\\ =&O(\sqrt{\beta_{n}\log(u_{n})}+[\gamma_{n}\beta_{n}^{-1}]^{k+1}).\end{split}

By induction, this holds for all $k\geq 1$ such that there exists $k_{0}$ , $[\gamma_{n}\beta_{n}^{-1}]^{k_{0}}=o(\sqrt{\beta_{n}})$ and $\|{\bm{\theta}}_{n}-{\bm{\theta}}^{*}\|=O(\sqrt{\beta_{n}\log(u_{n})})$ . Equivalently, $\omega_{n}=\sqrt{\beta_{n}\log(u_{n})}$ . Therefore, from (82) we have

\|\hat{\Delta}_{n+1}^{({\bm{\theta}},1)}\|=O(\gamma_{n}^{2}\beta_{n}^{-1}\log(u_{n}))+o(\sqrt{\gamma_{n}})=o(\sqrt{\gamma_{n}}).

Together with $\|\hat{\Delta}_{n+1}^{({\bm{\theta}},2)}\|=o(\sqrt{\beta_{n}})$ , we have $\beta_{n}^{-1/2}\|\hat{\Delta}_{n+1}^{({\bm{\theta}})}\|=o(\sqrt{\gamma_{n}\beta_{n}^{-1}})+1)$ such that

\lim_{n\to\infty}\beta_{n}^{-1/2}\hat{\Delta}_{n+1}^{({\bm{\theta}})}=0.

Thus, we have finished the proof according to the proof outline mentioned at the beginning of this part.

Appendix F Discussion of Covariance Ordering of SA-SRRW

F.1 Proof of Proposition 3.4

For any $\alpha>0$ and any vector ${\mathbf{x}}\in\mathbb{R}^{d}$ , we have

{\mathbf{x}}^{T}{\mathbf{V}}_{\bm{\theta}}^{(1)}(\alpha){\mathbf{x}}=\int_{0}^{\infty}{\mathbf{x}}^{T}e^{t(\nabla_{{\bm{\theta}}}{\mathbf{h}}({\bm{\theta}}^{*})+\frac{\mathds{1}_{\{b=1\}}}{2}{\bm{I}})}{\mathbf{U}}_{{\bm{\theta}}}(\alpha)e^{t(\nabla_{{\bm{\theta}}}{\mathbf{h}}({\bm{\theta}}^{*})+\frac{\mathds{1}_{\{b=1\}}}{2}{\bm{I}})^{T}}{\mathbf{x}}~{}dt

where the first equality is from the form of ${\mathbf{V}}_{\bm{\theta}}^{(1)}(\alpha)$ in Theorem 3.3. Let ${\mathbf{y}}\triangleq e^{t(\nabla_{{\bm{\theta}}}{\mathbf{h}}({\bm{\theta}}^{*})+\frac{\mathds{1}_{\{b=1\}}}{2}{\bm{I}})}{\mathbf{x}}$ , with the dependence on variable $t$ left implicit. The matrix ${\mathbf{U}}_{\bm{\theta}}(\alpha)$ , given explicitly in (11) positive semi definite, since $\lambda_{i}\in(-1,1)$ for all $i\in\{1,\cdots,N-1\}$ . Thus, the terms ${\mathbf{y}}^{T}{\mathbf{U}}_{{\bm{\theta}}}(\alpha){\mathbf{y}}$ inside the integral are non-negative, and it is enough to provide an ordering on ${\mathbf{y}}^{T}{\mathbf{U}}_{{\bm{\theta}}}(\alpha){\mathbf{y}}$ with respect to $\alpha$ .

For any $\alpha_{2}>\alpha_{1}>0$ ,

	$\displaystyle{\mathbf{y}}^{T}{\mathbf{U}}_{{\bm{\theta}}}(\alpha_{2}){\mathbf{y}}$	$\displaystyle=\sum_{i=1}^{N-1}\frac{1}{(\alpha_{2}(1+\lambda_{i})+1)^{2}}\cdot\frac{1+\lambda_{i}}{1-\lambda_{i}}{\mathbf{y}}^{T}{\mathbf{H}}^{T}{\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T}{\mathbf{H}}{\mathbf{y}}$
		$\displaystyle<\sum_{i=1}^{N-1}\frac{1}{(\alpha_{1}(1+\lambda_{i})+1)^{2}}\cdot\frac{1+\lambda_{i}}{1-\lambda_{i}}{\mathbf{y}}^{T}{\mathbf{H}}^{T}{\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T}{\mathbf{H}}{\mathbf{y}}={\mathbf{y}}^{T}{\mathbf{U}}_{{\bm{\theta}}}(\alpha_{1}){\mathbf{y}}$
		$\displaystyle<\sum_{i=1}^{N-1}\cdot\frac{1+\lambda_{i}}{1-\lambda_{i}}{\mathbf{y}}^{T}{\mathbf{H}}^{T}{\mathbf{u}}_{i}{\mathbf{u}}_{i}^{T}{\mathbf{H}}{\mathbf{y}}={\mathbf{y}}^{T}{\mathbf{U}}_{{\bm{\theta}}}(0){\mathbf{y}},$

where the inequality¹³¹³13The inequality may not be strict when $H$ is low rank, however it will always be true for some choice of ${\mathbf{x}}$ , since ${\mathbf{H}}$ is not a zero matrix. Thus, the ordering derived still follows our definition of $<_{L}$ in Section 1, footnote 6. is because $\alpha(1+\lambda_{i})>0$ for all $i\in\{1,\cdots,N\}$ and any $\alpha>0$ . In fact, the ordering is monotone in $\alpha$ , and ${\mathbf{y}}^{T}{\mathbf{U}}_{{\bm{\theta}}}(\alpha_{2}){\mathbf{y}}$ decreases at rate $1/\alpha^{2}$ as seen form its form in the equation above. This completes the proof.

F.2 Discussion regarding Proposition 3.4 and MSE ordering

We can use Proposition 3.4 to show that the MSE of SA iterates of (4c) driven by SRRW eventually becomes smaller than that SA iterates when the stochastic noise is driven by an i.i.d. sequence of random variables. The diagonal entries of ${\mathbf{V}}_{{\bm{\theta}}}^{(1)}(\alpha)$ are obtained by evaluating ${\mathbf{e}}_{i}^{T}{\mathbf{V}}_{{\bm{\theta}}}^{(1)}(\alpha){\mathbf{e}}_{i}$ , where ${\mathbf{e}}_{i}$ is the $i$ ’th standard basis vector.¹⁴¹⁴14 $D$ -dimensional vector of all zeros except at the $i$ ’th position which is $1$ . These diagonal entries are the asymptotic variance corresponding to the element-wise iterate errors, and for large enough $n$ , we have ${\mathbf{e}}_{i}^{T}{\mathbf{V}}_{{\bm{\theta}}}^{(1)}(\alpha){\mathbf{e}}_{i}\approx\mathbb{E}[({\bm{\theta}}_{n}-{\bm{\theta}}^{*})_{i}^{2}]/\beta_{n}$ for all $i\in\{1,\cdots,D\}$ . Thus, the trace of matrix ${\mathbf{V}}_{\bm{\theta}}^{(1)}(\alpha)$ approximates the scaled MSE, that is $\text{Tr}({\mathbf{V}}_{\bm{\theta}}^{(1)}(\alpha))=\sum_{i}{\mathbf{e}}_{i}^{T}{\mathbf{V}}_{{\bm{\theta}}}^{(1)}(\alpha){\mathbf{e}}_{i}\approx\sum_{i}\mathbb{E}[({\bm{\theta}}_{n}-{\bm{\theta}}^{*})_{i}^{2}]/\beta_{n}=\mathbb{E}[\|{\bm{\theta}}_{n}-{\bm{\theta}}^{*}\|^{2}]/\beta_{n}$ for large $n$ . Since all entries of ${\mathbf{V}}_{\bm{\theta}}^{(1)}(\alpha)$ go to zero as $\alpha$ increases, they get smaller than the corresponding term for the SA algorithm with i.i.d. input for large enough $\alpha$ , which achieves a constant MSE in the similarly scaled limit, since the asymptotic covariance is not a function of $\alpha$ . Moreover, the value of $\alpha$ only needs to be moderately large, since the asymptotic covariance terms decrease at rate $O(1/\alpha^{2})$ as shown in Proposition 3.4.

F.3 Proof of Corollary 3.5

We see that ${\mathbf{V}}_{{\bm{\theta}}}^{(3)}(\alpha)={\mathbf{V}}_{{\bm{\theta}}}^{(3)}(0)$ for all $\alpha>0$ , because the form of ${\mathbf{V}}_{{\bm{\theta}}}^{(3)}(\alpha)$ in Theorem 3.3 is independent of $\alpha$ . To prove that ${\mathbf{V}}_{{\bm{\theta}}}^{(1)}(\alpha)<_{L}{\mathbf{V}}_{{\bm{\theta}}}^{(3)}(0)$ , it is enough to show that ${\mathbf{V}}_{{\bm{\theta}}}^{(1)}(0)={\mathbf{V}}_{{\bm{\theta}}}^{(3)}(0)$ , since ${\mathbf{V}}_{{\bm{\theta}}}^{(1)}(\alpha)<_{L}{\mathbf{V}}_{{\bm{\theta}}}^{(1)}(0)$ from Proposition 3.4. This is easily checked by substituting $\alpha=0$ in 11, for which ${\mathbf{U}}_{{\bm{\theta}}}(0)={\mathbf{U}}_{11}$ . Substituting in the respective forms of ${\mathbf{V}}_{{\bm{\theta}}}^{(1)}(0)$ and ${\mathbf{V}}_{{\bm{\theta}}}^{(3)}(0)$ in Theorem 3.3, we get equivalence. This completes the proof.

Appendix G Background Theory

G.1 Technical Lemmas

Lemma G.1 (Solution to the Lyapunov Equation).

If all the eigenvalues of matrix ${\mathbf{M}}$ have negative real part, then for every positive semi-definite matrix ${\mathbf{U}}$ there exists a unique positive semi-definite matrix ${\mathbf{V}}$ satisfying the Lyapunov equation ${\mathbf{U}}+{\mathbf{M}}{\mathbf{V}}+{\mathbf{V}}{\mathbf{M}}^{T}={\bm{0}}$ . The explicit solution ${\mathbf{V}}$ is given as

{\mathbf{V}}=\int_{0}^{\infty}e^{{\mathbf{M}}t}{\mathbf{U}}e^{({\mathbf{M}}^{T})t}dt.

(83)

Chellaboina & Haddad (2008, Theorem 3.16) states that for a positive definite matrix ${\mathbf{U}}$ , there exists a positive definite matrix ${\mathbf{V}}$ . The reason they focus on the positive definite matrix ${\mathbf{U}}$ is that they require the related autonomous ODE system to be asymptotically stable. However, in this paper we don’t need this requirement. The same steps therein can be used to prove Lemma G.1 and show that if ${\mathbf{U}}$ is positive semi-definite, then ${\mathbf{V}}$ in the form of (83) is unique and also positive semi-definite.

Lemma G.2 (Burkholder Inequality, Davis (1970), Hall et al. (2014) Theorem 2.10).

Given a Martingale difference sequence $\{M_{i,n}\}_{i=1}^{n}$ , for $p\geq 1$ and some positive constant $C_{p}$ , we have

\mathbb{E}\left[\left\|\sum_{i=1}^{n}M_{i,n}\right\|^{p}\right]\leq C_{p}\mathbb{E}\left[\left(\sum_{i=1}^{n}\left\|M_{i,n}\right\|^{2}\right)^{p/2}\right]

(84)

Theorem G.3 (Martingale CLT, Delyon (2000) Theorem 30).

If a Martingale difference array $\{X_{n,i}\}$ satisfies the following condition: for some $\tau>0$ ,

\sum_{k=1}^{n}\mathbb{E}\left[\|X_{n,k}\|^{2+\tau}|{\mathcal{F}}_{k-1}\right]\xrightarrow[]{\mathbb{P}}0,

(85)

\sup_{n}\sum_{k=1}^{n}\mathbb{E}\left[\|X_{n,k}\|^{2}|{\mathcal{F}}_{k-1}\right]<\infty,

(86)

and

\sum_{k=1}^{n}\mathbb{E}\left[X_{n,k}X_{n,k}^{T}|{\mathcal{F}}_{k-1}\right]\xrightarrow[]{\mathbb{P}}{\bm{V}},

(87)

then

\sum_{i=1}^{n}X_{n,i}\xrightarrow[]{dist.}N(0,{\bm{V}}).

(88)

∎

Lemma G.4 (Duflo (1996) Proposition 3.I.2).

For a Hurwitz matrix ${\bm{H}}$ , there exist some positive constants $C,b$ such that for any $n$ ,

\left\|e^{{\bm{H}}n}\right\|\leq Ce^{-bn}.

(89)

Lemma G.5 (Fort (2015) Lemma 5.8).

For a Hurwitz matrix ${\bm{A}}$ , denote by $-r$ , $r>0$ , the largest real part of its eigenvalues. Let a positive sequence $\{\gamma_{n}\}$ such that $\lim_{n}\gamma_{n}=0$ . Then for any $0<r^{\prime}<r$ , there exists a positive constant $C$ such that for any $k<n$ ,

\left\|\prod_{j=k}^{n}({\bm{I}}+\gamma_{j}{\bm{A}})\right\|\leq Ce^{-r^{\prime}\sum_{j=k}^{n}\gamma_{j}}.

(90)

Lemma G.6 (Fort (2015) Lemma 5.9, Mokkadem & Pelletier (2006) Lemma 10).

Let $\{\gamma_{n}\}$ be a positive sequence such that $\lim_{n}\gamma_{n}=0$ and $\sum_{n}\gamma_{n}=\infty$ . Let $\{\epsilon_{n},n\geq 0\}$ be a nonnegative sequence. Then, for $b>0$ , $p\geq 0$ ,

\limsup_{n}\gamma_{n}^{-p}\sum_{k=1}^{n}\gamma_{k}^{p+1}e^{-b\sum_{j=k+1}^{n}\gamma_{j}}\epsilon_{k}\leq\frac{1}{C(b,p)}\limsup_{n}\epsilon_{n}

(91)

for some constant $C(b,p)>0$ .

When $p=0$ and define a positive sequence $\{w_{n}\}$ satisfying $w_{n-1}/w_{n}=1+o(\gamma_{n})$ , we have

\sum_{k=1}^{n}\gamma_{k}e^{-b\sum_{j=k+1}^{n}\gamma_{j}}\epsilon_{k}=\begin{cases}O(w_{n}),&\quad\text{if~{}}\epsilon_{n}=O(w_{n}),\\ o(w_{n}),&\quad\text{if~{}}\epsilon_{n}=o(w_{n}).\end{cases}

(92)

Lemma G.7 (Fort (2015) Lemma 5.10).

For any matrices $A,B,C$ ,

\|ABA^{T}-CBC^{T}\|\leq\|A-C\|\|B\|(\|A\|+\|C\|).

(93)

G.2 Asymptotic Results of Single-Timescale SA

Consider the stochastic approximation in the form of

{\bm{z}}_{n+1}={\bm{z}}_{n}+\gamma_{n+1}G({\bm{z}}_{n},X_{n+1}).

(94)

Let ${\mathbf{K}}_{{\bm{z}}}$ be the transition kernel of the underlying Markov chain $\{X_{n}\}_{n\geq 0}$ with stationary distribution $\pi({\bm{z}})$ such that $g({\bm{z}})\triangleq\mathbb{E}_{X\sim\pi({\bm{z}})}[G({\bm{z}},X)]$ with domain ${\mathcal{O}}\subseteq{\mathbb{R}}^{d}$ . Define an operator ${\mathbf{K}}_{{\bm{z}}}f$ for any function $f:{\mathcal{N}}\to{\mathbb{R}}^{D}$ such that

({\mathbf{K}}_{{\bm{z}}}f)(i)=\sum_{j\in{\mathcal{N}}}f(j){\mathbf{K}}_{{\bm{z}}}(i,j).

(95)

Assume that

C1.

W.p.1, the closure of $\{{\bm{z}}_{n}\}_{n\geq 0}$ is a compact subset of ${\mathcal{O}}$ .
C2.

$\gamma_{n}=\gamma_{0}/n^{a},a\in(1/2,1]$ .
C3.
Function $g$ is continuous on ${\mathcal{O}}$ and there exists a non-negative $C^{1}$ function $w$ and a compact set ${\mathcal{K}}\subset{\mathcal{O}}$ such that
- •
  
  $\nabla w({\bm{z}})^{T}g({\bm{z}})\leq 0$ for all ${\bm{z}}\in{\mathcal{O}}$ and $\nabla w({\bm{z}})^{T}g({\bm{z}})<0$ if ${\bm{z}}\notin{\mathcal{K}}$ ;
- •
  
  the set $S\triangleq\{{\bm{z}}~{}|~{}\nabla w({\bm{z}})^{T}g({\bm{z}})=0\}$ is such that $w(S)$ has an empty interior;

C4.

For every ${\bm{z}}$ , there exists a solution $m_{{\bm{z}}}:{\mathcal{N}}\to{\mathbb{R}}^{d}$ for the following Poisson equation

m_{{\bm{z}}}(i)-({\mathbf{K}}_{{\bm{z}}}m_{{\bm{z}}})(i)=G({\bm{z}},i)-g({\bm{z}})

(96)

for any $i\in{\mathcal{N}}$ ; for any compact set ${\mathcal{C}}\subset{\mathcal{O}}$ ,

\sup_{{\bm{z}}\in{\mathcal{C}},i\in{\mathcal{N}}}\|({\mathbf{K}}_{{\bm{z}}}m_{{\bm{z}}})(i)\|+\|m_{{\bm{z}}}(i)\|<\infty

(97)

and there exist a continuous function $\phi_{{\mathcal{C}}},\phi_{{\mathcal{C}}}(0)=0$ , such that for any ${\bm{z}},{\bm{z}}^{\prime}\in{\mathcal{C}}$ ,

\sup_{i\in{\mathcal{N}}}\|({\mathbf{K}}_{{\bm{z}}}m_{{\bm{z}}})(i)-({\mathbf{K}}_{{\bm{z}}^{\prime}}m_{{\bm{z}}^{\prime}})(i)\|\leq\phi_{{\mathcal{C}}}(\|{\bm{z}}-{\bm{z}}^{\prime}\|).

(98)

C5.

Denote by $-r$ the largest real part of the eigenvalues of the Jacobian matrix $\nabla g({\bm{z}}^{*})$ and assume $r>\frac{\mathds{1}_{\{a=1\}}}{2}$ .

C6.

For every ${\bm{z}}$ , there exists a solution $Q_{{\bm{z}}}:{\mathcal{N}}\to{\mathbb{R}}^{d\times d}$ for the following Poisson equation

Q_{{\bm{z}}}(i)-({\mathbf{K}}_{{\bm{z}}}Q_{{\bm{z}}})(i)=F({\bm{z}},i)-\mathbb{E}_{j\sim\pi({\bm{z}})}[F({\bm{z}},j)]

(99)

for any $i\in{\mathcal{N}}$ , where

F({\bm{z}},i)\triangleq\sum_{j\in{\mathcal{N}}}m_{{\bm{z}}}(j)m_{{\bm{z}}}(j)^{T}{\mathbf{K}}_{{\bm{z}}}(i,j)-({\mathbf{K}}_{{\bm{z}}}m_{{\bm{z}}})(i)({\mathbf{K}}_{{\bm{z}}}m_{{\bm{z}}})(i)^{T}.

(100)

For any compact set ${\mathcal{C}}\subset{\mathcal{O}}$ ,

\sup_{{\bm{z}}\in{\mathcal{C}},i\in{\mathcal{N}}}\|Q_{{\bm{z}}}(i)\|+\|({\mathbf{K}}_{{\bm{z}}}Q_{{\bm{z}}})(i)\|<\infty

(101)

and there exist $p,C_{{\mathcal{C}}}>0$ , such that for any ${\bm{z}},{\bm{z}}^{\prime}\in{\mathcal{C}}$ ,

\sup_{i\in{\mathcal{N}}}\|({\mathbf{K}}_{{\bm{z}}}Q_{{\bm{z}}})(i)-({\mathbf{K}}_{{\bm{z}}^{\prime}}Q_{{\bm{z}}^{\prime}})(i)\|\leq C_{{\mathcal{C}}}\|{\bm{z}}-{\bm{z}}^{\prime}\|^{p}.

(102)

Theorem G.8 (Delyon et al. (1999) Theorem 2).

Consider (94) and assume C1 - C4. Then, w.p.1, $\limsup_{n}d({\bm{z}}_{n},S)=0$ .

Theorem G.9 (Fort (2015) Theorem 2.1 & Proposition 4.1).

Consider (94) and assume C1 - C6. Then, given the condition that ${\bm{z}}_{n}$ converges to one point ${\bm{z}}^{*}\in S$ , we have

\gamma_{n}^{-1/2}({\bm{z}}_{n}-{\bm{z}}^{*})\xrightarrow[n\to\infty]{dist.}N(0,{\mathbf{V}}),

(103)

where

{\mathbf{V}}\left(\frac{\mathds{1}_{\{b=1\}}}{2}{\mathbf{I}}+\nabla g({\bm{z}}^{*})^{T}\right)+\left(\frac{\mathds{1}_{\{b=1\}}}{2}{\mathbf{I}}+\nabla g({\bm{z}}^{*})\right){\mathbf{V}}+{\mathbf{U}}=0,

(104)

and

{\mathbf{U}}\triangleq\sum_{i\in{\mathcal{N}}}\mu_{i}\left(m_{{\mathbf{z}}^{*}}(i)m_{{\mathbf{z}}^{*}}(i)^{T}-({\mathbf{K}}_{{\mathbf{z}}^{*}}m_{{\mathbf{z}}^{*}})(i)({\mathbf{K}}_{{\mathbf{z}}^{*}}m_{{\mathbf{z}}^{*}})(i)^{T}\right).

(105)

G.3 Asymptotic Results of Two-Timescale SA

For the two-timescale SA with iterate-dependent Markov chain, we have the following iterations:

	${\mathbf{z}}_{n+1}={\mathbf{z}}_{n}+\beta_{n+1}G_{1}({\mathbf{z}}_{n},{\mathbf{y}}_{n}X_{n+1}),$		(106a)
	${\mathbf{y}}_{n+1}={\mathbf{y}}_{n}+\gamma_{n+1}G_{2}({\mathbf{z}}_{n},{\mathbf{y}}_{n},X_{n+1}),$		(106b)

with the goal of finding the root $({\mathbf{z}}^{*},{\mathbf{y}}^{*})$ such that

g_{1}({\mathbf{z}}^{*},{\mathbf{y}}^{*})=\mathbb{E}_{X\sim{\bm{\mu}}}[G_{1}({\mathbf{z}}^{*},{\mathbf{y}}^{*},X)]=0,\quad g_{2}({\mathbf{z}}^{*},{\mathbf{y}}^{*})=\mathbb{E}_{X\sim{\bm{\mu}}}[G_{2}({\mathbf{z}}^{*},{\mathbf{y}}^{*},X)]=0.

(107)

We present here a simplified version of the assumptions for single-valued functions $G_{1},G_{2}$ that are necessary for the almost sure convergence result in Yaji & Bhatnagar (2020, Theorem 4). The original assumptions are intended for more general set-valued functions $G_{1},G_{2}$ .

(B1)

The step sizes $\beta_{n}\triangleq n^{-b}$ and $\gamma_{n}\triangleq n^{-a}$ , where $0.5<a<b\leq 1$ .
(B2)

Assume the function $G_{1}({\mathbf{z}},{\mathbf{y}},X)$ is continuous and differentiable with respect to ${\mathbf{z}},{\mathbf{y}}$ . There exists a positive constant $L_{1}$ such that $\|G_{1}({\mathbf{z}},{\mathbf{y}},X)\|\leq L_{1}(1+\|{\mathbf{z}}\|+\|{\mathbf{y}}\|)$ for every ${\mathbf{z}}\in{\mathbb{R}}^{d_{1}},{\mathbf{y}}\in{\mathbb{R}}^{d_{2}},X\in{\mathcal{N}}$ . The same condition holds for the function $G_{2}$ as well.
(B3)

Assume there exists a function $\rho:{\mathbb{R}}^{d_{1}}\to{\mathbb{R}}^{d_{2}}$ such that the following three properties hold: (i) $\|\rho({\mathbf{z}})\|\leq L_{2}(1+\|{\mathbf{z}}\|)$ for some positive constant $L_{2}$ ; (ii) the ODE $\dot{\mathbf{y}}=g_{2}({\mathbf{z}},{\mathbf{y}})$ has a globally asymptotically stable equilibrium $\lambda({\mathbf{z}})$ such that $g_{2}({\mathbf{z}},\rho({\mathbf{z}}))=0$ . Additionally, let $\hat{g}_{1}({\mathbf{z}})\triangleq g_{1}({\mathbf{z}},\rho({\mathbf{z}}))$ , there exists a set of disjoint roots $\Lambda\triangleq\{{\mathbf{z}}^{*}:\hat{g}_{1}({\mathbf{z}}^{*})=0\}$ , which is the set of globally asymptotically stable equilibria of the ODE $\dot{\mathbf{z}}=\hat{g}_{1}({\mathbf{z}})$ .
(B4)

$\{X_{n}\}_{n\geq 0}$ is an iterate-dependent Markov process in finite state space ${\mathcal{N}}$ . For every $n\geq 0$ , $P(X_{n+1}=j|{\mathbf{z}}_{m},{\mathbf{y}}_{m},X_{m},0\leq m\leq n)=P(X_{n+1}=j|{\mathbf{z}}_{n},{\mathbf{y}}_{n},X_{n}=i)={\mathbf{P}}_{i,j}[{\mathbf{z}}_{n},{\mathbf{y}}_{n}]$ , where the transition kernel ${\mathbf{P}}[{\mathbf{z}},{\mathbf{y}}]$ is continuous in ${\mathbf{z}},{\mathbf{y}}$ , and the Markov chain generated by ${\mathbf{P}}[{\mathbf{z}},{\mathbf{y}}]$ is ergodic so that it admits a stationary distribution ${\bm{\pi}}({\mathbf{z}},{\mathbf{y}})$ , and ${\bm{\pi}}({\mathbf{z}}^{*},\rho({\mathbf{z}}^{*}))={\bm{\mu}}$ .
(B5)

$\sup_{n\geq 0}(\|{\mathbf{z}}_{n}\|+\|{\mathbf{y}}_{n}\|)<\infty$ a.s.

Yaji & Bhatnagar (2020) included assumptions A1 - A9 and A11 for the following Theorem G.10. We briefly show the correspondence of our assumptions (B1) - (B5) and theirs: (B1) with A5, (B2) with A1 and A2, (B3) with A9 and A11, (B4) with A3 and A4, and (B5) with A8. Given that our two-timescale SA framework (106) excludes additional noises (setting them to zero), A6 and A7 therein are inherently met.

Theorem G.10 (Yaji & Bhatnagar (2020) Theorem 4).

Under Assumptions (B1) - (B5), iterations $({\mathbf{z}}_{n},{\mathbf{y}}_{n})$ in (106) almost surely converge to a set of roots, i.e., $({\mathbf{z}}_{n},{\mathbf{y}}_{n})\to\bigcup_{{\mathbf{z}}^{*}\in\Lambda}({\mathbf{z}}^{*},\rho({\mathbf{z}}^{*}))$ a.s.

Appendix H Additional Simulation Results

H.1 Binary Classification on Additional Datasets

In this part, we perform the binary classification task as in Section 4 on additional datasets, i.e., a9a (with $123$ features) and splice (with $60$ features) from LIBSVM (Chang & Lin, 2011). Figure 4 provides the performance ordering of different $\alpha$ values, and we empirically demonstrate that the curves with $\alpha\geq 5$ still outperform the i.i.d. counterpart. Additionally, Figure 5 compare cases (i) - (iii) under both a9a and splice datasets, and case (i) consistently perform the best.

H.2 Non-convex Linear Regression

We further test SGD-SRRW and SHB-SRRW algorithms with a non-convex function to demonstrate the efficiency of our SA-SRRW algorithm beyond the convex setting. In this task, we simulate the following linear regression problem in Khaled & Richtárik (2023) with non-convex regularization

\min_{{\bm{\theta}}\in{\mathbb{R}}^{d}}\left\{f({\bm{\theta}})\triangleq\frac{1}{N}\sum_{i=1}^{N}l_{i}({\bm{\theta}})+\kappa\sum_{j=1}^{d}\frac{{\bm{\theta}}_{j}^{2}}{{\bm{\theta}}_{j}^{2}+1}\right\}

(108)

where the loss function $l_{i}({\bm{\theta}})=\|{\mathbf{s}}_{i}^{T}{\bm{\theta}}-y_{i}\|^{2}$ and $\kappa=1$ , with the data points $\{({\mathbf{s}}_{i},y_{i})\}_{i\in{\mathcal{N}}}$ from the ijcnn1 dataset of LIBVIM (Chang & Lin, 2011). We still perform the optimization over the wikiVote graph, as done in Section 4.

The numerical results for the non-convex linear regression taks are presented in Figures 6 and 7, where each experiment is repeated $100$ times. Figures 6(a) and 6(b) show that the performance ordering across different $\alpha$ values is still preserved for both algorithms over almost all time, and curves for $\alpha\geq 5$ outperform that of the i.i.d. sampling (in black) under the graph topological constraints. Additionally, among the three cases examined at identical $\alpha$ values, Figures 7(a) - 7(c) confirm that case (i) performs consistently better than the other two cases, implying that case (i) can even become the best choice for non-convex distributed optimization tasks.

Accelerating Distributed Stochastic Optimization via Self-Repellent Random Walks

Abstract

1 Introduction

2 Preliminaries and Model Setup

2.1 Basic Notations and Definitions

2.2 SA-SRRW: Key Assumptions and Discussions

3 Asymptotic Analysis of the SA-SRRW Algorithm

Lemma 3.1.

Theorem 3.2.

Theorem 3.3.

Proposition 3.4.

Corollary 3.5.

4 Simulation

5 Conclusion

References

Appendix A Examples of Stochastic Algorithms of the form (2).

Appendix B Discussion on Mean Field Function of SRRW Iterates (4b)

Appendix C Discussion on Assumption A3′

Appendix D Proof of Lemma 3.1 and Lemma 3.2

D.1 Scenario 1

D.2 Scenario 2

D.2.1 Case (i): 1/2<a<b≤11/2<a<b\leq 1

Appendix E Proof of Theorem 3.3

E.1 Case (ii): βn=γn\beta_{n}=\gamma_{n}

Remark E.1.

E.2 Case (i): βn=o​(γn)\beta_{n}=o(\gamma_{n})

E.2.1 Decomposition of SA-SRRW iteration (4)

Lemma E.1.

Proof.

Lemma E.2.

Proof.

E.2.2 Effect of SRRW Iteration on SA Iteration

Lemma E.3.

Proof.

Lemma E.4 (Mokkadem & Pelletier (2005) Lemma 4).

Lemma E.5.

Proof.

Lemma E.6 (Pelletier (1998) Lemma 1).

Definition E.1.

Lemma E.7 (Mokkadem & Pelletier (2006) Lemma 6).

E.3 Case (iii): γn=o​(βn)\gamma_{n}=o(\beta_{n})

Lemma E.8.

Lemma E.9.

Appendix F Discussion of Covariance Ordering of SA-SRRW

F.1 Proof of Proposition 3.4

F.2 Discussion regarding Proposition 3.4 and MSE ordering

F.3 Proof of Corollary 3.5

Appendix G Background Theory

G.1 Technical Lemmas

Lemma G.1 (Solution to the Lyapunov Equation).

Lemma G.2 (Burkholder Inequality, Davis (1970), Hall et al. (2014) Theorem 2.10).

Theorem G.3 (Martingale CLT, Delyon (2000) Theorem 30).

Lemma G.4 (Duflo (1996) Proposition 3.I.2).

Lemma G.5 (Fort (2015) Lemma 5.8).

Lemma G.6 (Fort (2015) Lemma 5.9, Mokkadem & Pelletier (2006) Lemma 10).

Lemma G.7 (Fort (2015) Lemma 5.10).

G.2 Asymptotic Results of Single-Timescale SA

Theorem G.8 (Delyon et al. (1999) Theorem 2).

Theorem G.9 (Fort (2015) Theorem 2.1 & Proposition 4.1).

G.3 Asymptotic Results of Two-Timescale SA

Theorem G.10 (Yaji & Bhatnagar (2020) Theorem 4).

Appendix H Additional Simulation Results

H.1 Binary Classification on Additional Datasets

H.2 Non-convex Linear Regression

Appendix C Discussion on Assumption A3^′

D.2.1 Case (i): $1/2<a<b\leq 1$

E.1 Case (ii): $\beta_{n}=\gamma_{n}$

E.2 Case (i): $\beta_{n}=o(\gamma_{n})$

E.3 Case (iii): $\gamma_{n}=o(\beta_{n})$