This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Does Worst-Performing Agent Lead the Pack? Analyzing Agent Dynamics in Unified Distributed SGD

Jie Hu         Yi-Ting Ma        Do Young Eun
Department of Electrical and Computer Engineering
North Carolina State University
{jhu29, yma42, dyeun}@ncsu.edu
Abstract

Distributed learning is essential to train machine learning algorithms across heterogeneous agents while maintaining data privacy. We conduct an asymptotic analysis of Unified Distributed SGD (UD-SGD), exploring a variety of communication patterns, including decentralized SGD and local SGD within Federated Learning (FL), as well as the increasing communication interval in the FL setting. In this study, we assess how different sampling strategies, such as i.i.d. sampling, shuffling, and Markovian sampling, affect the convergence speed of UD-SGD by considering the impact of agent dynamics on the limiting covariance matrix as described in the Central Limit Theorem (CLT). Our findings not only support existing theories on linear speedup and asymptotic network independence, but also theoretically and empirically show how efficient sampling strategies employed by individual agents contribute to overall convergence in UD-SGD. Simulations reveal that a few agents using highly efficient sampling can achieve or surpass the performance of the majority employing moderately improved strategies, providing new insights beyond traditional analyses focusing on the worst-performing agent.

1 Introduction

Distributed learning deals with the training of models across multiple agents over a communication network in a distributed manner, while addressing the challenges of privacy, scalability, and high-dimensional data [11, 55]. Each agent i[N]i\in[N] holds a private dataset 𝒳i\mathcal{X}_{i} and an agent-specified loss function Fi:d×𝒳iF_{i}:\mathbb{R}^{d}\times\mathcal{X}_{i}\to\mathbb{R} that depends on the model parameter θd\theta\in\mathbb{R}^{d} and a data point X𝒳iX\in\mathcal{X}_{i}. The goal is then to find a local minima θ\theta^{*} of the objective function f(θ)1Ni=1Nfi(θ)f(\theta)\triangleq\frac{1}{N}\sum_{i=1}^{N}f_{i}(\theta), where agent ii’s loss function fi(θ)𝔼X𝒟i[Fi(θ,X)]f_{i}(\theta)\triangleq\mathbb{E}_{X\sim\mathcal{D}_{i}}[F_{i}(\theta,X)] and 𝒟i\mathcal{D}_{i} represents the target distribution of data for agent ii.111Throughout the paper we don’t impose convexity assumption on f(θ)f(\theta). For convex f(θ)f(\theta), \mathcal{L} is the global minima. For non-convex f(θ)f(\theta), \mathcal{L} represents the collection of local minima, which is of great interest in neural network training for sufficiently good performance [20, 19]. With an additional condition such as the Polyak-Lojasiewicz inequality, non-convex f(θ)f(\theta) is ensured to have a unique minima [1, 75, 79]. Each agent ii can locally compute the gradient Fi(θ,X)d\nabla F_{i}(\theta,X)\in\mathbb{R}^{d} w.r.t. θ\theta for every sampled data point X𝒳iX\in\mathcal{X}_{i}. Due to the distributed nature, {𝒟i}i[N]\{\mathcal{D}_{i}\}_{i\in[N]} and {𝒳i}i[N]\{\mathcal{X}_{i}\}_{i\in[N]} are not necessarily identically distributed over [N][N] so that the minima of each local function fi(θ)f_{i}(\theta) can be far away from \mathcal{L}. This is particularly relevant in decentralized training data, e.g., Federated Learning (FL) with heterogeneous data across data centers or devices [81, 31].

In this paper, we focus on Unified Distributed SGD (UD-SGD), where each agent i[N]i\in[N] updates its model parameter θn+1i\theta_{n+1}^{i} in a two-step process:

Local update: θn+1/2i=θniγn+1Fi(θni,Xni),\textstyle\textit{Local update:~{}~{}}\theta_{n+\nicefrac{{1}}{{2}}}^{i}=\theta_{n}^{i}-\gamma_{n+1}\nabla F_{i}(\theta_{n}^{i},X^{i}_{n}), (1a)
Aggregation: θn+1i=j=1Nwn(i,j)θn+1/2j,\textstyle\textit{Aggregation:~{}~{}}\theta_{n+1}^{i}=\sum_{j=1}^{N}w_{n}(i,j)\theta_{n+\nicefrac{{1}}{{2}}}^{j}, (1b)

where γn\gamma_{n} denotes the step size, XniX^{i}_{n} is the data sampled by agent ii at time nn (i.e., agent dynamics), and 𝐖n=[wn(i,j)]i,j[N]\mathbf{W}_{n}\!=\![w_{n}(i,j)]_{i,j\in[N]} represents the doubly-stochastic communication matrix satisfying wn(i,j)0w_{n}(i,j)\geq 0 and 𝟏T𝐖n=𝟏T\mathbf{1}^{T}\mathbf{W}_{n}\!=\!\mathbf{1}^{T}, 𝐖n𝟏=𝟏\mathbf{W}_{n}\mathbf{1}\!=\!\mathbf{1}. In the special case of N=1N=1, (1) simplifies to the vanilla SGD where 𝐖n=1\mathbf{W}_{n}=1 for all nn. UD-SGD covers a wide range of distributed algorithms, e.g., decentralized SGD (DSGD) [71, 80, 61, 69], distributed SGD with changing topology (DSGD-CT) [24, 43], local SGD (LSGD) in FL [55, 76], and its variant aimed at reducing communication costs (LSGD-RC) [51].

Refer to caption
Figure 1: GD-SGD algorithm with a communication network of N=5N=5 agents, each holding potentially distinct datasets; e.g., agent jj (in blue) samples 𝒳j\mathcal{X}_{j} i.i.d. and agent ii (in red) samples 𝒳i\mathcal{X}_{i} via Markovian trajectory.

Versatile Communication Patterns {𝐖n}\{\mathbf{W}_{n}\}: For visualization, we depict the scenarios of UD-SGD (1) in Figure 1. In DSGD, each agent (node) in the graph communicates with its neighbors after each SGD computation via 𝐖n\mathbf{W}_{n}, representing the underlying network topology. As a special case, central server-based aggregation, forming a fully connected network, translates 𝐖n\mathbf{W}_{n} into a rank-1 matrix 𝐖n=𝟏𝟏T/N\mathbf{W}_{n}\!=\!\mathbf{1}\mathbf{1}^{T}/N. To minimize communication expenses, FL variants allow each agent to perform multiple SGD steps before aggregation [55, 67, 76], resulting in a communication interval of length KK and a consistent pattern 𝐖n=𝐖\mathbf{W}_{n}=\mathbf{W} for n=mK,mn=mK,\forall m\in\mathbb{N}, and 𝐖n=𝐈N\mathbf{W}_{n}=\mathbf{I}_{N} otherwise. In particular, i) 𝐖=𝟏𝟏T/N\mathbf{W}\!=\!\mathbf{1}\mathbf{1}^{T}/N corresponds to LSGD with full agent participation (LSGD-FP) [76, 42, 51]; ii) 𝐖\mathbf{W} is a random matrix generated by partial agent participation (LSGD-PP) [55, 18, 74]; iii) 𝐖\mathbf{W} is generated by Metropolis-Hasting algorithm in decentralized setting, e.g., hybrid LSGD (HLSGD) [37, 32] and decentralized FL (DFL) [46, 77, 16]. We defer further discussion of 𝐖\mathbf{W} to Section F.1.

Markovian vs i.i.d. Sampling: Agents typically employ i.i.d. or Markovian sampling, as illustrated in the bottom brown box of Figure 1. In cases where agents have full access to their data, DSGD with i.i.d sampling has been extensively studied [60, 43, 61, 47]. In FL, many application-oriented LSGD variants have been investigated [51, 18, 77, 32, 37, 53]. However, these works solely focus on i.i.d. sampling, restricting their applicability to Markovian sampling scenarios.

Markovian sampling, which has received increased attention in limited settings (see Table 1), is vital where agents lack independent data access. For instance, in statistical applications, agents with an unknown a priori distribution often use Markovian sampling over i.i.d. sampling [40, 63]. In HLSGD across device-to-device (D2D) networks [32, 37], random walks reduce communication costs compared to the frequent aggregations required by Gossip algorithms [38, 28, 4]. For single-agent scenarios, vanilla SGD with Markovian noise, as applied in a D2D network, has shown improved communication efficiency and privacy [68, 28, 35]. In contrast, for agents with full data access, Markov Chain Monte Carlo (MCMC) methods can be more efficient than i.i.d. sampling, especially in high-dimensional spaces with constraints [27, 40], where acceptance-rejection methods [12] lead to computational inefficiency (e.g., wasted samples) due to multiple rejections before obtaining a sample that satisfies constraints [26, 68]. In addition, shuffling methods can be considered as high-order Markov chains [38], which achieves faster convergence than i.i.d. sampling [1, 78, 79].

Table 1: Comparison of recent works in distributed learning: We classify the communication patterns into seven categories, i.e., DSGD, DSGD-CT, LSGD-FP, LSGD-PP, LSGD-RC, HLSGD and DFL. We mark ‘UD-SGD’ when all aforementioned patterns are included and the detailed discussion on {𝐖n}\{\mathbf{W}_{n}\} is referred to Section F.1. Abbreviations: ‘Asym.== ‘Asymptotic’, ‘D.A.B== ‘Differentiating Agent Behavior’, ‘L.S.== ‘Linear Speedup’, ‘A.N.I.== ‘Asymptotic Network Independence’.
Reference Analysis Sampling Communication Pattern D.A.B. L.S. A.N.I.
[58] Asym. i.i.d. DSGD
[51] Asym. i.i.d. LSGD-RC N/A
[43, 47] Non-Asym. i.i.d. DSGD-CT ×\times ×\times
[61] Non-Asym. i.i.d. DSGD ×\times ×\times
[18, 53] Non-Asym. i.i.d. LSGD-PP ×\times N/A
[37, 32] Non-Asym. i.i.d. HLSGD ×\times ×\times
[77, 16] Non-Asym. i.i.d. DFL ×\times ×\times
[71, 80, 69] Non-Asym. Markov DSGD ×\times ×\times ×\times
[42, 72] Non-Asym. Markov LSGD-FP ×\times N/A
[68, 4, 28] Non-Asym. Markov N/A (single agent) N/A N/A N/A
[38, 52] Asym. Markov N/A (single agent) N/A N/A N/A
Our Work Asym. Markov UD-SGD

Limitations of Non-Asymptotic Analysis on Agent’s Sampling Strategy: Recent studies on the non-asymptotic behavior of DSGD and LSGD variants under Markovian sampling, as summarized in Table 1, have made significant strides. However, these works often fall short in accurately revealing the statistical influence of each agent dynamics {Xni}\{X_{n}^{i}\} on the performance of UD-SGD. For instance, [71, 69] proposed the error bound O(1/log2(1/ρ)n1a)O(\frac{1/\log^{2}(1/\rho)}{n^{1-a}}), where a(0.5,1]a\in(0.5,1] and ρ\rho denotes the identical mixing rate for all agents, overlooking agent heterogeneity in sampling strategy. A similar assumption to ρ\rho is also evident in [42]. More recent contributions from [80, 72] have attempted to relax these constraints by considering a finite-time bound of O(τmix2/(n+1))O(\tau_{mix}^{2}/(n+1)), where τmix\tau_{mix} is the mixing time of the slowest agent. This approach, however, inherently focuses on the worst-performing agent, neglecting how other agents with faster mixing rates might positively influence the system.222Although improving the finite-time upper bound to distinguish each agent may not be the focus of the aforementioned works, their analyses require every Markov chain to be close to some neighborhood of its stationary distribution. This naturally incurs a maximum operator, and thus convergence is strongly influenced by the slowest mixing rate, i.e., the worst-performing agent. Such an analysis fails to capture the collective impact of other agents on the overall system performance, a crucial aspect in large-scale applications where identifying and managing the worst-performing agent is challenging due to privacy concerns or sporadic unreachability. Since agents in distributed learning have the freedom to choose their sampling strategies, it’s vital to understand how each agent’s improved sampling approach contributes to the overall convergence speed of the UD-SGD algorithm. This understanding is key to enhancing system performance, particularly in large-scale machine learning scenarios where agent heterogeneity is a defining feature.

Rationale for Asymptotic Analysis: Recent trends in convergence analysis have leaned towards non-asymptotic methods, yet it’s crucial to recognize the complementary role of asymptotic analysis for a better understanding of convergence behaviors, as highlighted in [9, 56, 25, 39]. For vanilla SGD, [59, 17] emphasized that central limit theorem (CLT) is far less asymptotic than it may appear under both i.i.d. and Markovian sampling. Notably, the limiting covariance matrix, a key statistical feature in vanilla SGD’s CLT, also prominently features in high-probability bound [59], explicit finite-time bound [17] and 11-Wasserstein distance in the non-asymptotic CLT [66]. [38] further underscored this by numerically showing that the limiting covariance matrix provides a more precise depiction of convergence than the mixing rates often used in finite-time upper bounds [26, 68]. Moreover, they argued that finite-time analysis may not suitably apply to certain efficient high-order Markov chains, due to the lack of comparative mixing-rate metrics.

Our Contributions: We present an asymptotic analysis of the UD-SGD algorithm (1) under heterogeneous agent dynamics {Xni}\{X^{i}_{n}\} and a large family of communication patterns {𝐖n}\{\mathbf{W}_{n}\}. Specifically,

\bullet  Under appropriate assumptions, all agents performing (1) asymptotically reach the consensus and find θ\theta^{*}: i[N]\forall i\in[N], θn1Ni=1Nθni\theta_{n}\triangleq\frac{1}{N}\sum_{i=1}^{N}\theta^{i}_{n} denotes the average model parameter among all agents, we have

limnθniθn=0,limnθnθ=0 a.s.\lim_{n\to\infty}\!\|\theta^{i}_{n}\!-\!\theta_{n}\|\!=\!0,~{}~{}\lim_{n\to\infty}\!\|\theta_{n}\!-\!\theta^{*}\|\!=\!0~{}\text{ a.s.} (2)

Moreover, we derive the CLT of UD-SGD in the form of

γn1/2(θnθ)ndist.𝒩(𝟎,𝐕).\gamma_{n}^{-1/2}(\theta_{n}-\theta^{*})\xrightarrow[n\to\infty]{dist.}\mathcal{N}\left(\mathbf{0},\mathbf{V}\right). (3)

Our framework addresses technical challenges in quantifying consensus error under various communication patterns and slowly increasing communication interval. This shows a substantial extension compared to previous studies [58, 43, 51], particularly in regulating the growth of communication intervals (Assumption 2.3-ii) and in proving the scaled consensus error’s boundedness (Lemma B.1). Furthermore, we reformulate UD-SGD as a stochastic approximation-like iteration and tackle the Markovian noise term using the Poisson equation, a technique previously confined only to vanilla SGD with Markovian sampling [17, 38, 52]. The key here is to devise the noise decomposition that separates the consensus error among all agents from the error caused by the bias from the Markov chain, which aligns with the target distribution only asymptotically at infinity, not at finite times.

\bullet  In analyzing (3), we derive the exact form of 𝐕\mathbf{V} as 1N2i=1N𝐕i\frac{1}{N^{2}}\sum_{i=1}^{N}\mathbf{V}_{i}. Here, 𝐕i\mathbf{V}_{i} is the limiting covariance matrix of agent ii, which depends mainly on its sampling strategy {Xni}\{X^{i}_{n}\}. This allows us to show that improving individual agents’ sampling strategy can reduce the covariance in CLT, which in turn implies a smaller mean-square error (MSE) for large time nn. This is a significant advancement over previous finite-sample bounds that only account for the worst-performing agent and do not fully capture the effect of individual agent dynamics on overall system performance. Our CLT result (3) also treats recent findings in [38] as a very special case with N=1N=1, where the relationship therein between the sampling efficiency of the Markov chain and the limiting covariance matrix in the CLT of vanilla SGD, can carry over to our UD-SGD.

\bullet  We demonstrate that our analysis supports recent findings from studies such as [42], which exhibited linear speedup scaling with the number of agents under LSGD-FP with Markovian sampling; and [62, 61], which examined the notion of ‘asymptotic network independence’ for DSGD with i.i.d. sampling, where the convergence of the algorithm (1) at large time nn depends solely on the left eigenvector of 𝐖n\mathbf{W}_{n} (1N𝟏\frac{1}{N}\mathbf{1} considered in this paper) rather than the specific communication network topology encoded in 𝐖n\mathbf{W}_{n}, but now under Markovian sampling. We extend these findings in view of CLT to a broader range of communication patterns {𝐖n}\{\mathbf{W}_{n}\} and general sampling strategies {Xni}\{X^{i}_{n}\}.

\bullet  We conduct numerical experiments using logistic regression and neural network training with several choices of agents’ sampling strategies, including a recently proposed one via nonlinear Markov chain [25]. Our results uncover a key phenomenon: a handful of compliant agents adopting highly efficient sampling strategies can match or exceed the performance of the majority using moderately improved strategies. This finding is crucial for practical optimization in large-scale learning systems, moving beyond the current literature that only considers the worst-performing agent in more restrictive settings.

2 Preliminaries

Basic Notations: We use 𝐯\|\mathbf{v}\| to indicate the Euclidean norm of a vector 𝐯d\mathbf{v}\in\mathbb{R}^{d} and 𝐌\|\mathbf{M}\| to indicate the spectral norm of a matrix 𝐌d×d\mathbf{M}\in\mathbb{R}^{d\times d}. The identity matrix of dimension dd is denoted by 𝐈d\mathbf{I}_{d}, and the all-one (resp. all-zero) vector of dimension NN is denoted by 𝟏\mathbf{1} (resp. 𝟎\mathbf{0}). Let 𝐉𝟏𝟏T/N\mathbf{J}\triangleq\mathbf{1}\mathbf{1}^{T}/N. The diagonal matrix with the entries of 𝐯\mathbf{v} on the main diagonal is written as diag(𝐯)\text{diag}(\mathbf{v}). We also use ‘\succeq’ for Loewner ordering such that 𝐀𝐁\mathbf{A}\succeq\mathbf{B} is equivalent to 𝐱T(𝐀𝐁)𝐱0\mathbf{x}^{T}(\mathbf{A}-\mathbf{B})\mathbf{x}\geq 0 for any 𝐱d\mathbf{x}\in\mathbb{R}^{d}.

Asymptotic Covariance Matrix: Asymptotic variance is a widely used metric for evaluating the second-order properties of Markov chains associated with a scalar-valued test function in the MCMC literature, e.g., Chapter 6.3 [12], and asymptotic covariance matrix is its multivariate version for a vector-valued function. Specifically, we consider a finite, irreducible, aperiodic and positive recurrent (ergodic) Markov chain {Xn}n0\{X_{n}\}_{n\geq 0} with transition matrix 𝐏\mathbf{P} and stationary distribution 𝝅\boldsymbol{\pi}, and the estimator μ^n(𝐠)1ns=0n1𝐠(Xs)\hat{\mu}_{n}(\mathbf{g})\triangleq\frac{1}{n}\sum_{s=0}^{n-1}\mathbf{g}(X_{s}) for any vector-valued function 𝐠:[N]d\mathbf{g}:[N]\to\mathbb{R}^{d}. According to the ergodic theorem [12, 13], we have limnμ^n(𝐠)=𝔼𝝅(𝐠)\lim_{n\to\infty}\hat{\mu}_{n}(\mathbf{g})=\mathbb{E}_{\boldsymbol{\pi}}(\mathbf{g}) a.s.. As defined in [13, 38], the asymptotic covariance matrix 𝚺X(𝐠)\boldsymbol{\Sigma}_{X}(\mathbf{g}) for a vector-valued function 𝐠()\mathbf{g}(\cdot) is given by

𝚺X(𝐠)limnnVar(μ^n(𝐠))=limn1n𝔼{ΔnΔnT},\boldsymbol{\Sigma}_{X}(\mathbf{g})\!\triangleq\!\lim_{n\to\infty}\!n\cdot\text{Var}(\hat{\mu}_{n}(\mathbf{g}))\!=\!\lim_{n\to\infty}\frac{1}{n}\cdot\mathbb{E}\left\{\Delta_{n}\Delta_{n}^{T}\right\}, (4)

where Δns=0n1(𝐠(Xs)𝔼𝝅(𝐠))\Delta_{n}\triangleq\sum_{s=0}^{n-1}(\mathbf{g}(X_{s})-\mathbb{E}_{\boldsymbol{\pi}}(\mathbf{g})). By following the algebraic manipulations in [12, Theorem 6.3.7] for asymptotic variance (univariate version), we can rewrite (4) in a matrix form such that

𝚺X(𝐠)=𝐆Tdiag(𝝅)(𝐙𝐈N+𝟏𝝅T)𝐆,\boldsymbol{\Sigma}_{X}(\mathbf{g})=\mathbf{G}^{T}\text{diag}(\boldsymbol{\pi})\left(\mathbf{Z}-\mathbf{I}_{N}+\mathbf{1}\boldsymbol{\pi}^{T}\right)\mathbf{G}, (5)

where 𝐆[𝐠(1),,𝐠(N)]TN×d\mathbf{G}\triangleq[\mathbf{g}(1),\cdots,\mathbf{g}(N)]^{T}\in\mathbb{R}^{N\times d} and 𝐙[𝐈N𝐏+𝟏𝝅T]1\mathbf{Z}\triangleq[\mathbf{I}_{N}-\mathbf{P}+\mathbf{1}\boldsymbol{\pi}^{T}]^{-1}. This matrix form explicitly shows the dependence on the transition matrix 𝐏\mathbf{P} and its stationary distribution 𝝅\boldsymbol{\pi}, and will be utilized in our Theorem 3.3.

Model Description: The UD-SGD in (1) can be expressed in a compact iterative form, i.e., we have

θn+1i=j=1Nwn(i,j)(θnjγn+1Fj(θnj,Xnj)),\textstyle\theta_{n+1}^{i}=\sum_{j=1}^{N}w_{n}(i,j)(\theta_{n}^{j}-\gamma_{n+1}\nabla F_{j}(\theta_{n}^{j},X^{j}_{n})), (6)

at each time nn, where each agent ii samples according to its own Markovian trajectory {Xni}n0\{X^{i}_{n}\}_{n\geq 0} with stationary distribution πi\pi_{i} such that 𝔼Xπi[Fi(θ,X)]=fi(θ)\mathbb{E}_{X\!\sim\!\pi_{i}}[F_{i}(\theta,X)]\!=\!f_{i}(\theta). Let KlK_{l} denote the communication interval between the (l1)(l-1)-th and ll-th aggregation among NN agents, and nlm=1lKmn_{l}\triangleq\sum_{m=1}^{l}K_{m} be the time instance for the ll-th aggregation. We also define τnminl{l:nln}\tau_{n}\triangleq\min_{l}\{l:n_{l}\geq n\} as the index of the upcoming aggregation at time nn such that KτnK_{\tau_{n}} indicates the communication interval for the τn\tau_{n}-th aggregation, or more precisely, the length of the communication interval that includes the time index nn. The communication pattern follows that 𝐖n=𝐈n\mathbf{W}_{n}=\mathbf{I}_{n} if nnln\neq n_{l} and 𝐖n=𝐖\mathbf{W}_{n}=\mathbf{W} otherwise for l1l\geq 1, where the examples of 𝐖\mathbf{W} will be discussed in Section F.1. Note that i) when Kl=1K_{l}=1, (6) reduces to DSGD; ii) when Kl=K>1K_{l}=K>1, (6) becomes the local SGD in FL. iii) When KlK_{l} increases with ll, we recover some choices of KlK_{l} studied in [51] beyond LSGD-RC with i.i.d. sampling. This increasing communication interval aims to further reduce the frequency of aggregation among agents for lower communication costs, but now under a Markovian sampling setting and a wider range of communication patterns. We below state the assumptions needed for the main theoretical results.

Assumption 2.1 (Regularity of the gradient).

For each i[N]i\in[N] and X𝒳iX\in\mathcal{X}^{i}, the function Fi(θ,X)F_{i}(\theta,X) is L-smooth in terms of θ\theta, i.e., for any θ1,θ2d\theta_{1},\theta_{2}\in\mathbb{R}^{d},

Fi(θ1,X)Fi(θ2,X)Lθ1θ2.\|\nabla F_{i}(\theta_{1},X)-\nabla F_{i}(\theta_{2},X)\|\leq L\|\theta_{1}-\theta_{2}\|. (7)

In addition, we assume that the objective function ff is twice continuously differentiable and μ\mu-strongly convex only around the local minima θ\theta^{*}\in\mathcal{L}, i.e.,

𝐇2f(θ)μ𝐈d.\mathbf{H}\triangleq\nabla^{2}f(\theta^{*})\succeq\mu\mathbf{I}_{d}. (8)

2.1 imposes the regularity conditions on the gradient Fi(,X)\nabla F_{i}(\cdot,X) and Hessian matrix of the objective function f()f(\cdot), as is commonly assumed in [10, 45, 29, 38]. Note that (7) requires per-sample Lipschitzness of Fi\nabla F_{i} and is stronger than the Lipschitzness of its expected version fi\nabla f_{i}, which is commonly assumed under i.i.d sampling setting [73, 50, 30]. However, we remark that this is in line with previous work on DSGD and LSGD-FP under Markovian sampling as well [71, 42, 80], because Fi(θ,X)\nabla F_{i}(\theta,X) is no longer the unbiased stochastic version of fi(θ)\nabla f_{i}(\theta) and the effect of {Xni}\{X^{i}_{n}\} has to be taken into account in the analysis. The local strong convexity at the minimizer is commonly assumed to analyze the convergence of the algorithm under both asymptotic and non-asymptotic analysis [10, 29, 38, 45, 52, 80].

Assumption 2.2 (Ergodicity of Markovian sampling).

{Xni}n0\{X^{i}_{n}\}_{n\geq 0} is an ergodic Markov chain with stationary distribution 𝛑i\boldsymbol{\pi}_{i} such that 𝔼X𝛑i[Fi(θ,X)]=fi(θ)\mathbb{E}_{X\sim\boldsymbol{\pi}_{i}}[F_{i}(\theta,X)]=f_{i}(\theta), and is independent from {Xnj}n0,ji\{X^{j}_{n}\}_{n\geq 0},j\neq i.

The ergodicity of the underlying Markov chains, as stated in 2.2, is commonly assumed in the literature [26, 68, 80, 42, 38]. This assumption ensures the asymptotic unbiasedness of the loss function Fi(θ,)F_{i}(\theta,\cdot), which takes i.i.d. sampling as a special case.

Assumption 2.3 (Decreasing step size and slowly increasing communication interval).

i) For bounded communication interval KτnK,nK_{\tau_{n}}\leq K,\forall n, we assume the polynomial step size γn=1/na\gamma_{n}=1/n^{a} and a(0.5,1]a\in(0.5,1]; Or ii) If KτnK_{\tau_{n}}\to\infty as nn\to\infty, we assume γn=1/n\gamma_{n}=1/n and define ηn=γnKτnL+1\eta_{n}=\gamma_{n}K^{L+1}_{\tau_{n}}, where the sequence {Kl}l0\{K_{l}\}_{l\geq 0} satisfies nηn2<\sum_{n}\eta^{2}_{n}<\infty, Kτn=o(γn1/2(L+1))K_{\tau_{n}}=o(\gamma_{n}^{-1/2(L+1)}), and limlηnl+1/ηnl+1+1=1\lim_{l\to\infty}\eta_{n_{l}+1}/\eta_{n_{l+1}+1}=1.

In 2.3, the polynomial step size γn\gamma_{n} is standard in the literature and it has the property nγn=\sum_{n}\gamma_{n}=\infty, nγn2<\sum_{n}\gamma^{2}_{n}<\infty [17, 38]. Inspired by [51], we introduce ηn\eta_{n} to control the step size within each ll-th communication interval with length KlK_{l} to restrict the growth of KlK_{l}. Specifically, nηn2<\sum_{n}\eta^{2}_{n}<\infty and Kτn=o(γn1/2(L+1))K_{\tau_{n}}=o(\gamma_{n}^{-1/2(L+1)}) ensure that ηn0\eta_{n}\to 0 and KτnK_{\tau_{n}} does not increase too fast in nn. limlηnl+1/ηnl+1+1=1\lim_{l\to\infty}\eta_{n_{l}+1}/\eta_{n_{l+1}+1}=1 sets the restriction on the increment from nln_{l} to nl+1n_{l+1}. Several practical forms of KlK_{l} suggested by [51], including Kllog(l)K_{l}\sim\log(l) and Klloglog(l)K_{l}\sim\log\log(l), also satisfy 2.3-ii). We defer to Appendix A the mathematical verification of these two types of KlK_{l}, together with the practical implications of increasing communication interval KlK_{l}.

Remark 1.

In 2.3, we incorporate an increasing communication interval along with a step size γn=1/n\gamma_{n}=1/n. This complements the choice of step size γn\gamma_{n} in [51, Assumption 3.33.3], where γn=1/na\gamma_{n}=1/n^{a} for a(0.5,1)a\in(0.5,1). It is important to note, however, that the increasing communication interval specified in [51, Assumption 3.23.2] is applicable only in i.i.d sampling. Under the Markovian sampling framework, the expression Fi(θ,X)fi(θ)\nabla F_{i}(\theta,X)-\nabla f_{i}(\theta) loses its unbiased and Martingale difference properties. Consequently, the Martingale CLT application as utilized by [51] does not directly extend to Markovian sampling. To address this, we adapted techniques from [29, 58] to accommodate the increasing communication interval within the Markovian sampling setting and various communication patterns. This adaptation necessitates γn=1/n\gamma_{n}=1/n, a specification not covered in [51]. Exploring more general forms of KlK_{l} that could relax this assumption is outside the scope of our current study.

Assumption 2.4 (Stability on model parameter).

We assume supnθni<\sup_{n}\|\theta_{n}^{i}\|<\infty almost surely i[N]\forall i\in[N].

2.4 claims that the sequence of {θni}\{\theta_{n}^{i}\} always remains in a path-dependent compact set. It is to ensure the stability of the algorithm that serves the purpose of analyzing the convergence, which is often assumed under the asymptotic analysis of vanilla SGD with Markovian noise [23, 29, 52]. As mentioned in [58, 70], checking 2.4 is challenging and requires case-by-case analysis, even under i.i.d. sampling. Only recently the stability of SGD under Markovian sampling has been studied in [9], but the result for UD-SGD remains unknown in the literature. Thus, we analyze each agent’s sampling strategy in the asymptotic regime under this stability condition.

Assumption 2.5 (Contraction property of communication matrix).

i). {𝐖n}n0\{\mathbf{W}_{n}\}_{n\geq 0} is independent of the sampling strategy {Xni}n0\{X_{n}^{i}\}_{n\geq 0} for all i[N]i\in[N] and is assumed to be doubly-stochastic for all nn; ii). At each aggregation step nln_{l}, 𝐖nl\mathbf{W}_{n_{l}} is independently generated from some distribution 𝒫nl\mathcal{P}_{n_{l}} such that 𝔼𝐖𝒫nl[𝐖T𝐖]𝐉C1<1\|\mathbb{E}_{\mathbf{W}\!\sim\mathcal{P}_{n_{l}}}\![\mathbf{W}^{T}\mathbf{W}]\!-\!\mathbf{J}\|\!\leq\!C_{1}\!<\!1 for some constant C1C_{1}.

The doubly-stochasticity of 𝐖n\mathbf{W}_{n} in 2.5-i) is widely assumed in the literature [54, 24, 43, 80]. 2.5-ii) is a contraction property to ensure that agents employing UD-SGD will asymptotically achieve the consensus, which is also common in [7, 24, 80]. Examples of 𝐖\mathbf{W} that satisfy 2.5-ii), e.g., Metropolis-Hasting matrix, partial agent participation in FL, are deferred to Appendix F.1 due to space constraint.

3 Asymptotic Analysis of UD-SGD

Almost Sure Convergence: Let θn1Ni=1Nθni\theta_{n}\triangleq\frac{1}{N}\sum_{i=1}^{N}\theta_{n}^{i} represent the consensus among all the agents at time nn, we establish the asymptotic consensus of the local parameters θni{\theta_{n}^{i}}, as stated in Lemma 3.1.

Lemma 3.1.

Under Assumptions 2.1, 2.3, 2.4 and 2.5, the consensus error θniθn\theta_{n}^{i}\!-\!\theta_{n} diminishes to zero at the rate specified below: Almost surely, for every agent i[N]i\in[N],

θniθn={O(γn)under Assum. 2.3-i),O(ηn)under Assum. 2.3-ii).\left\|\theta_{n}^{i}\!-\!\theta_{n}\right\|\!=\!\begin{cases}O(\gamma_{n})&\text{under Assum. \ref{assumption:Decreasing step size and slowly increasing communication interval}-i)},\\ O(\eta_{n})&\text{under Assum. \ref{assumption:Decreasing step size and slowly increasing communication interval}-ii)}.\end{cases} (9)

Lemma 3.1 indicates that all agents asymptotically reach consensus at a rate of O(γn)O(\gamma_{n}) (or O(ηn)O(\eta_{n})). This finding extends the scope of [58, Proposition 1], incorporating considerations for Markovian sampling, FL settings, and increasing communication interval KlK_{l}. The proof, detailed in Appendix B, primarily tackles the challenge of establishing the boundedness of the sequences {γn1(θniθn)}\{\gamma_{n}^{-1}(\theta_{n}^{i}-\theta_{n})\} (or {ηn1(θniθn)}\{\eta_{n}^{-1}(\theta_{n}^{i}-\theta_{n})\}) almost surely for all i[N]i\in[N]. This is specifically analyzed in Lemma B.1. Next, with additional 2.2, we are able to obtain the almost sure convergence of θn\theta_{n} to θ\theta^{*}\in\mathcal{L}.

Theorem 3.2.

Under Assumptions 2.1 - 2.5, the consensus θn\theta_{n} converges to \mathcal{L} almost surely, i.e.,

lim supninfθθnθ=0a.s.\textstyle\limsup_{n}\inf_{\theta^{*}\in\mathcal{L}}\left\|\theta_{n}-\theta^{*}\right\|=0~{}~{}~{}~{}\text{a.s.} (10)

Theorem 3.2 is achieved by decomposing the Markovian noise term Fi(θni,Xni)fi(θni)\nabla F_{i}(\theta_{n}^{i},X_{n}^{i})-\nabla f_{i}(\theta_{n}^{i}), using the Poisson equation technique as discussed in [6, 29, 17], into a Martingale difference noise term, along with additional noise terms. We then reformulate (6) into an iteration akin to stochastic approximation, as depicted in (56). The subsequent step involves verifying the conditions on these noise terms under our stated assumptions. Crucially, this theorem also establishes that UD-SGD ensures an almost sure convergence of each agent to a local minimum θ\theta^{*}\in\mathcal{L}, even in scenarios where the communication interval KlK_{l} gradually increases, in accordance with Assumption 2.3-ii). The detailed proof of this theorem is provided in Appendix C.

Central Limit Theorem: Let 𝐔i𝚺Xi(Fi(θ,))\mathbf{U}_{i}\!\triangleq\!\boldsymbol{\Sigma}_{X^{i}}(\nabla F_{i}(\theta^{*},\cdot)) represent the asymptotic covariance matrix (defined in (5)) associated with each agent i[N]i\in[N], given their sampling strategy {Xni}\{X^{i}_{n}\} and function Fi(θ,)\nabla F_{i}(\theta^{*},\cdot). Define 𝐔1N2i=1N𝐔i\mathbf{U}\triangleq\frac{1}{N^{2}}\sum_{i=1}^{N}\mathbf{U}_{i}. We assume the polynomial step-size γnγ/na\gamma_{n}\!\sim\!\gamma_{\star}/n^{a}, a(0.5,1]a\!\in\!(0.5,1] and γ>0\gamma_{\star}>0. In the case of a=1a=1, we further assume γ>1/2μ\gamma_{\star}>1/2\mu, where μ\mu is defined in (8). For notational simplicity, and without loss of generality, our remaining CLT result is stated while conditioning on the event that {θnθ}\{\theta_{n}\to\theta^{*}\} for some θ\theta^{*}\in\mathcal{L}.

Theorem 3.3.

Let Assumptions 2.1 - 2.5 hold. Then,

γn1/2(θnθ)ndist.𝒩(𝟎,𝐕),\gamma_{n}^{-1/2}(\theta_{n}-\theta^{*})\xrightarrow[n\to\infty]{dist.}\mathcal{N}(\mathbf{0},\mathbf{V}), (11)

where the limiting covariance matrix 𝐕\mathbf{V} is in the form of

𝐕=0e𝐌t𝐔e𝐌t𝑑t.\textstyle\mathbf{V}=\int_{0}^{\infty}e^{\mathbf{M}t}\mathbf{U}e^{\mathbf{M}t}dt. (12)

Here, we have 𝐌=𝐇\mathbf{M}=-\mathbf{H} if a(0.5,1)a\in(0.5,1), or 𝐌=𝐈d/2γ𝐇\mathbf{M}=\mathbf{I}_{d}/2\gamma_{\star}-\mathbf{H} if a=1a=1, where 𝐇\mathbf{H} is defined in (8).

Moreover, let θ¯n=1ns=0n1θs\bar{\theta}_{n}=\frac{1}{n}\sum_{s=0}^{n-1}\theta_{s} and 𝐕=𝐇1𝐔𝐇1\mathbf{V}^{\prime}=\mathbf{H}^{-1}\mathbf{U}\mathbf{H}^{-1}. For a(0.5,1)a\in(0.5,1), we have

n(θ¯nθ)ndist.𝒩(𝟎,𝐕),\sqrt{n}(\bar{\theta}_{n}-\theta^{*})\xrightarrow[n\to\infty]{dist.}\mathcal{N}(\mathbf{0},\mathbf{V}^{\prime}), (13)

The proof, presented in Appendix D, addresses the technical challenges in deriving the CLT for UD-SGD, specifically the second-order conditions in decomposing the Markovian noise term, which is not present in the i.i.d. sampling case [58, 43, 51]. We decompose Fi(θn,Xni)fi(θn)\nabla F_{i}(\theta_{n},X_{n}^{i})\!-\!\nabla f_{i}(\theta_{n}) into three parts in (48) using Poisson equation: en+1i,νn+1i,ξn+1ie_{n+1}^{i},\nu_{n+1}^{i},\xi_{n+1}^{i}. The consensus error θniθn\theta_{n}^{i}-\theta_{n} embedded in noise terms en+1ie_{n+1}^{i} and ξn+1i\xi_{n+1}^{i} is a new factor, whose characteristics have been quantified in our Lemma 3.1 but are not present in the single-agent scenario analyzed as an application of stochastic approximation in [22, 29]. The specifics of this analysis are expanded upon in Appendices D.1 to D.3. We require γ>1/2μ\gamma_{\star}\!>\!1/2\mu for a=1a\!=\!1 to ensure that the largest eigenvalue of 𝐌\mathbf{M} is negative, as this is a necessary condition for the existence of 𝐕\mathbf{V} in (12) (otherwise integration diverges). In the case where there is only one agent (N=1N\!=\!1), 𝐕\mathbf{V} and 𝐕\mathbf{V}^{\prime} reduce to the matrices specified in the CLT result of vanilla SGD [29, 38, 52]. In addition, for a special case of constant communication interval in Assumption 2.3-i) and i.i.d. sampling as shown in Table 1, we recover the CLT of LSGD-RC in [51]. See Appendix E for detailed discussions.

Theorem 3.3 has significant implications for the MSE of {θn}\{\theta_{n}\} for large time nn, i.e., 𝔼[θnθ2]=i=1d𝐞iT𝔼[(θnθ)(θnθ)T]𝐞iγni=1d𝐞iT𝐕𝐞i=γnTr(𝐕)\mathbb{E}[\|\theta_{n}\!-\theta^{*}\|^{2}]\!\!=\!\sum_{i=1}^{d}\mathbf{e}_{i}^{T}\mathbb{E}[(\theta_{n}\!-\theta^{*})(\theta_{n}\!-\theta^{*})^{T}]\mathbf{e}_{i}\!\approx\!\gamma_{n}\sum_{i=1}^{d}\mathbf{e}_{i}^{T}\mathbf{V}\mathbf{e}_{i}=\gamma_{n}\text{Tr}(\mathbf{V}), where 𝐞i\mathbf{e}_{i} is the dd-dimensional vector of all zeros except 11 at the ii-th entry. This indicates that a smaller limiting covariance matrix 𝐕\mathbf{V}, according to the Loewner order, results in a smaller trace of 𝐕\mathbf{V} and consequently in a reduced MSE for large nn. Consideration for smaller 𝐕\mathbf{V} will be presented in the next section, where agents have the opportunity to improve their sampling strategies.

Remark 2.

Studies by [62, 61] have shown that in DSGD with a fixed doubly-stochastic matrix 𝐖\mathbf{W}, the influence of communication topology diminishes after a transient period. Our Theorem 3.3 extends these findings to Markovian sampling and a broader spectrum of communication patterns as in Table 1. This extension is based on the fact that the consensus error, impacted primarily by the communication pattern, decreases faster than the CLT scale O(γn)O(\sqrt{\gamma_{n}}) and is thus not the dominant factor in the asymptotic regime, as suggested by Lemma 3.1.

Remark 3.

Recent studies have highlighted linear speedup with increasing number of agents NN in the dominant term of their finite-sample error bounds under DSGD-CT with i.i.d. sampling [43] and LSGD-FC with Markovian sampling [42]. However, our Theorem 3.3 demonstrates this phenomenon under more diverse communication patterns and Markovian sampling in Table 1 via the leading term 𝐕\mathbf{V} in our CLT. Specifically, it scales with 1/N1/N, i.e. 𝐕=𝐕¯/N\mathbf{V}\!=\!\bar{\mathbf{V}}/N, where 𝐕¯=1Ni=1N𝐕i\bar{\mathbf{V}}\!=\!\frac{1}{N}\sum_{i=1}^{N}\mathbf{V}_{i} denotes the average limiting covariance matrices across all NN agents and 𝐕i=0e𝐌t𝐔ie𝐌t𝑑t\mathbf{V}_{i}\!=\!\int_{0}^{\infty}e^{\mathbf{M}t}\mathbf{U}_{i}e^{\mathbf{M}t}dt, suggesting that the MSE 𝔼[θnθ2]\mathbb{E}[\|\theta_{n}-\theta^{*}\|^{2}] will be improved by 1/N1/N. A similar argument also applies to 𝐕\mathbf{V}^{\prime} in (13), i.e., 𝐕=𝐕¯/N\mathbf{V}^{\prime}=\bar{\mathbf{V}}^{\prime}/N, where 𝐕¯=1Ni=1N𝐕i\bar{\mathbf{V}}^{\prime}=\frac{1}{N}\sum_{i=1}^{N}\mathbf{V}^{\prime}_{i} and 𝐕i=𝐇1𝐔i𝐇1\mathbf{V}^{\prime}_{i}=\mathbf{H}^{-1}\mathbf{U}_{i}\mathbf{H}^{-1}.

Impact of Agent’s Sampling Strategy: In the literature, the mixing time-based technique has been widely used in the non-asymptotic analysis in SGD, DSGD and various LSGD variants in FL [26, 68, 69, 80, 42], i.e., for each agent i[N]i\in[N] and some constant CC,

Fi(θ,Xni)fi(θ)Cθρin,\|\nabla F_{i}(\theta,X^{i}_{n})-\nabla f_{i}(\theta)\|\leq C\|\theta\|\rho_{i}^{n}, (14)

where ρi\rho_{i} is the mixing rate of the underlying Markov chain. However, typical non-asymptotic analyses often rely on ρmaxiρi\rho\triangleq\max_{i}\rho_{i} among NN agents, i.e., the worst-performing agent in their finite-time bounds [80, 72], or assume an identical mixing rate across all NN agents [42, 69].

In contrast, Remark 3 highlights that each agent holds its own limiting covariance matrices 𝐕i\mathbf{V}_{i} and 𝐕i\mathbf{V}^{\prime}_{i}, which are predominantly governed by the matrix 𝐔i\mathbf{U}_{i}, capturing the agent’s sampling strategy {Xni}\{X^{i}_{n}\} and contributing equally to the overall performance of UD-SGD. For each agent ii, denote by 𝐔iX\mathbf{U}_{i}^{X} and 𝐔iY\mathbf{U}_{i}^{Y} the asymptotic covariance matrices associated with two candidate sampling strategies {Xni}\{X^{i}_{n}\} and {Yni}\{Y^{i}_{n}\}, respectively. Let 𝐕X\mathbf{V}^{X} and 𝐕Y\mathbf{V}^{Y} be the limiting covariance matrices of the distributed system in (12), where agent ii employs {Xni}\{X^{i}_{n}\} and {Yni}\{Y^{i}_{n}\}, respectively, while keeping other agents’ sampling strategies unchanged. Then, we have the following result.

Corollary 3.4.

For agent ii, if there exist two sampling strategies {Xni}n0\{X^{i}_{n}\}_{n\geq 0} and {Yni}n0\{Y^{i}_{n}\}_{n\geq 0} such that 𝐔iX𝐔iY\mathbf{U}_{i}^{X}\succeq\mathbf{U}_{i}^{Y}, we have 𝐕X𝐕Y\mathbf{V}^{X}\succeq\mathbf{V}^{Y}.

Corollary 3.4 directly follows from the definition of Loewner ordering, and Loewner ordering being closed under addition (i.e., 𝐀𝐁\mathbf{A}\!\succeq\!\mathbf{B} implies 𝐀+𝐂𝐁+𝐂\mathbf{A}\!+\!\mathbf{C}\!\succeq\!\mathbf{B}\!+\!\mathbf{C}). It demonstrates that even a single agent improves its sampling strategy from {Xni}\{X^{i}_{n}\} to {Yni}\{Y^{i}_{n}\}, it leads to an overall reduction in 𝐕\mathbf{V} (in terms of Loewner ordering), thereby decreasing the MSE and benefiting the entire group of NN agents. The subsequent question arises: How do we identify an improved sampling strategy {Yni}\{Y^{i}_{n}\} over the baseline {Xni}\{X^{i}_{n}\}?

This question has been partially addressed by [57, 48, 38], which qualitatively investigates the ‘efficiency ordering’ of two sampling strategies. In particular, [38, Theorem 3.63.6 (i)] shows that sampling strategy {Yn}\{Y_{n}\} is more efficient than {Xn}\{X_{n}\} if and only if 𝚺X(𝐠)𝚺Y(𝐠)\boldsymbol{\Sigma}_{X}(\mathbf{g})\succeq\boldsymbol{\Sigma}_{Y}(\mathbf{g}) for any vector-valued function 𝐠()d\mathbf{g}(\cdot)\in\mathbb{R}^{d}. Consequently, in the UD-SGD framework, employing a more efficient sampling strategy {Yni}\{Y^{i}_{n}\} over the baseline {Xni}\{X^{i}_{n}\} by agent ii leads to 𝚺Xi(Fi(θ,))𝚺Yi(Fi(θ,))\boldsymbol{\Sigma}_{X^{i}}(\nabla F_{i}(\theta^{*},\cdot))\succeq\boldsymbol{\Sigma}_{Y^{i}}(\nabla F_{i}(\theta^{*},\cdot)), thus satisfying 𝐔iX𝐔iY\mathbf{U}_{i}^{X}\succeq\mathbf{U}_{i}^{Y}. This finding, as per Corollary 3.4, implies an overall improvement in UD-SGD.

For illustration purposes, we list a few examples where two competing sampling strategies follow efficiency ordering: i) When an agent has complete access to the entire dataset (e.g., deep learning), shuffling techniques like single shuffling and random reshuffling are more efficient than i.i.d. sampling [38, 79]; ii) When an agent works with a graph-like data structure and employs a random walk, e.g., agent ii in Figure 1, using non-backtracking random walk (NBRW) is more efficient than simple random walk (SRW) [48]. iii) A recently proposed self-repellent random walk (SRRW) is shown to achieve near-zero sampling variance, indicating even higher sampling efficiency than NBRW and SRW [25].333Note that SRRW is a nonlinear Markov chain that depends on the relative visit counts of each node in the graph. While its application in single-agent optimization has been studied in [39], expanding the theoretical examination of SRRW to multi-agent scenarios is beyond the scope of this paper. However, we can still numerically evaluate the performance of UD-SGD with multiple agents on general communication matrices using SRRW as a highly efficient sampling strategy in Section 4. This random-walk-based sampling finds a particular application in large-scale FL within D2D networks (e.g., mobile networks, wireless sensor networks), where each agent acts as an edge server or access point, gathering information from the local D2D network [37, 32]. Employing a random walk over local D2D network for each agent constitutes the sampling strategy.

Theorem 3.3 and Corollary 3.4 not only qualitatively compare these sampling strategies but also allow for a quantitative assessment of the overall system enhancement. Since every agent contributes equally to the limiting covariance matrix 𝐕\mathbf{V} of the distributed system as in Remark 3, a key application scenario is to encourage a subset of compliant agents to adopt highly efficient strategies like SRRW, potentially yielding better performance than universally upgrading to slightly improved strategies like NBRW. This approach, more feasible and impactful in large-scale machine learning scenarios where some agents cannot freely modify their sampling strategies, is a unique aspect of our framework not addressed in previous works focusing on the worst-performing agent [80, 42, 69, 72].

4 Experiments

In this section, we empirically evaluate the effect of agents’ sampling strategies under various communication patterns in UD-SGD. We consider the L2L_{2}-regularized binary classification problem

minθf(θ)1Ni=1Nfi(θ),withfi(θ)=1Bj=1Blog(1+eθT𝐱i,j)yi,j(θT𝐱i,j)+κ2θ2,\min_{\theta}f(\theta)\triangleq\frac{1}{N}\sum_{i=1}^{N}f_{i}(\theta),~{}\text{with}~{}f_{i}(\theta)\!=\!\frac{1}{B}\!\sum_{j=1}^{B}\!\log\!\left(\!1\!+\!e^{\theta^{T}\mathbf{x}_{i,j}}\!\right)\!-\!y_{i,j}\left(\theta^{T}\mathbf{x}_{i,j}\right)\!+\!\frac{\kappa}{2}\|\theta\|^{2}, (15)

where the feature vector 𝐱i,j\mathbf{x}_{i,j} and its corresponding label yi,jy_{i,j} are held by agent ii, with a penalty parameter κ\kappa set to 11. We use the ijcnn1 dataset [14] with 2222 features in each data point and 5050k data points in total, which is evenly distributed to two groups with 5050 agents each (N=100N=100 agents in total) and each agent holds B=500B=500 distinct data points. Each agent in the first group has full access to its entire dataset, and thus can employ i.i.d. sampling (baseline) or single shuffling. On the other hand, each agent in the other group has a graph-like structure and uses SRW (baseline), NBRW or SRRW with reweighting to sample its local dataset with uniform weight. In this simulation, we assume that agents can only communicate through a communication network using the DSGD algorithm. This scenario with heterogeneous agents, as depicted in Figure 1, is of great interest in large-scale machine learning [37, 32]. In addition, we employ a decreasing step size γn=1/n\gamma_{n}=1/n in our UD-SGD framework (1) because it is typically used for the strongly convex objective function and is tested to have the fastest convergence in this simulation setup. Due to space constraints, we defer detailed simulation setup, including the introduction of SRW, NBRW, and SRRW, to Section G.1.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Binary classification problem. From left to right: (a) Impact of efficient sampling strategies on convergence. (b) Performance gains from partial adoption of efficient sampling. (c) Comparative advantage of SRRW over NBRW in a small subset of agents. (d) Asymptotic network independence of four algorithms under UD-SGD framework with fixed sampling strategy (shuffling, SRRW). (e) Different sampling strategies in the DSGD algorithm with time-varying topology (DSGD-VT). (f) Different sampling strategies in the DFL algorithm with increasing communication interval.

The simulation results are obtained through 120120 independent trials. In Figure 2, we assume that the first group of agents perform either i.i.d. sampling or shuffling method, while the other group of agents all change their sampling strategies from baseline SRW to NBRW and SRRW, as shown in the legend. This plot shows that improved sampling strategy leads to overall convergence speedup since NBRW and SRRW are more efficient than SRW [38, 25]. Furthermore, it illustrates that SRRW is significantly more efficient than NBRW in this simulation setup, i.e., SRRW \gg NBRW >> SRW in terms of sampling efficiency. While keeping the second group of agents unchanged, we can see that shuffling method outperforms i.i.d. sampling with smaller asymptotic MSE. However, shuffling method may not perform perfectly for small time nn due to slow mixing behavior in the initial period, which is also observed in the single-agent scenario in [65, 1, 38]. The error bar therein also indicates that the random-walk sampling strategy has a significant impact on the overall system performance and SRRW has smaller variance than NBRW and SRW.

In Figure 2, we let the first group of agents perform i.i.d. sampling while only changing a portion of agents in the second group to upgrade from SRW to SRRW, e.g., 3030 SRW 2020 SRRW in the legend means that there are 30 agents using SRW while the rest 2020 agents in the second group upgrade to SRRW. We observe that more agents willing to upgrade from SRW to SRRW lead to smaller asymptotic MSE, as predicted by Theorem 3.3 and Remark 3. This improvement in MSE reduction doesn’t scale linearly with more agents adopting SRRW because each agent holds its own dataset that are not necessarily identical, resulting in different individual limiting covariance matrices 𝐕i𝐕j\mathbf{V}_{i}\!\neq\!\mathbf{V}_{j}.

While maintaining i.i.d. sampling for the first group of agents, we compare the performance when the second group of agents in Figure 2 employ NBRW or SRRW. Remarkably, the case with only 1010 agents out of 5050 agents in the second group adopting far more efficient sampling strategy (4040 SRW, 1010 SRRW) through incentives or compliance already produces a smaller MSE than all 5050 agents using slightly better strategy (5050 NBRW). The performance gap becomes even more pronounced when 2020 agents upgrade from SRW to SRRW (3030 SRW, 2020 SRRW). We show that the performance of a distributed system can be improved significantly when a small proportion of agents adopt highly efficient sampling strategies.

Figure 2 empirically illustrates the asymptotic network independence property via four algorithms under our UD-SGD framework: Centralized SGD (communication interval K=1K=1, communication matrix 𝐖=𝟏𝟏T/N\mathbf{W}=\mathbf{1}\mathbf{1}^{T}/N); LSGD-FP (FL with full client participation, K=5K=5, 𝐖=𝟏𝟏T/N\mathbf{W}=\mathbf{1}\mathbf{1}^{T}/N); DSGD-VT (DSGD with time-varying topologies, randomly chosen from 55 doubly stochastic matrices); DFL (decentralized FL with fixed MH-generated 𝐖\mathbf{W} and increasing communication interval Kl=max{1,log(l)}K_{l}=\max\{1,\log(l)\} after ll-th aggregation). We fix the sampling strategy (shuffling, SRRW) throughout this plot. All four algorithms overlap around 10001000 steps, implying that they have entered the asymptotic regime with similar performance where the CLT result dominates, implying the asymptotic network independence in the long run.

Figure 2 and 2 show the performance of different sampling strategies in DSGD-VT and DFL algorithms in terms of MSE. Both plots consistently demonstrate that improving agent’s sampling strategies (e.g., shuffling > iid sampling, and SRRW > NBRW > SRW) leads to faster convergence with smaller MSE, supporting our theory.

Furthermore, in Section G.2, we simulate an image classification task with CIFAR-10 dataset [44] by training a 55-layer CNN and ResNet-18 model collaboratively through a 1010-agent network. The result is illustrated in Figure 3, where SRRW outperforms NBRW and SRW as expected. In summary, we find that upgrading even a small portion of agents to efficient sampling strategies (e.g., shuffling method, NBRW, SRRW under different dataset structures) improves system performance in UD-SGD. These results are consistent in binary and image classification tasks, underscoring that every agent matters in distributed learning.

5 Conclusion

In this work, we develop an UD-SGD framework that establishes the CLT of various distributed algorithms with Markovian sampling. We overcome technical challenges such as quantifying consensus error under very general communication patterns and decomposing Markovian noise through the Poisson equation, which extends the analysis beyond the single-agent scenario. We demonstrate that even if only a few agents optimize their sampling strategies, the entire distributed system will benefit with a smaller limiting covariance in the CLT, suggesting a reduced MSE. This finding challenges the current established upper bounds where the worst-performing agent leads the pack. Future studies could pivot towards developing fine-grained finite-time bounds to individually characterize each agent’s behavior, and theoretically analyze the effect of SRRW in UD-SGD.

6 Acknowledgments and Disclosure of Funding

We thank the anonymous reviewers for their constructive comments. This work was supported in part by National Science Foundation under Grant Nos. CNS-2007423, IIS-1910749, and IIS-2421484.

References

  • Ahn et al. [2020] Kwangjun Ahn, Chulhee Yun, and Suvrit Sra. Sgd with shuffling: optimal rates without component convexity and large epoch requirements. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pages 17526–17535, 2020.
  • Alon et al. [2007] Noga Alon, Itai Benjamini, Eyal Lubetzky, and Sasha Sodin. Non-backtracking random walks mix faster. Communications in Contemporary Mathematics, 9(04):585–603, 2007.
  • Andrieu et al. [2005] Christophe Andrieu, Éric Moulines, and Pierre Priouret. Stability of stochastic approximation under verifiable conditions. SIAM Journal on control and optimization, 44(1):283–312, 2005.
  • Ayache et al. [2023] Ghadir Ayache, Venkat Dassari, and Salim El Rouayheb. Walk for learning: A random walk approach for federated learning from heterogeneous data. IEEE Journal on Selected Areas in Communications, 41(4):929–940, 2023.
  • Ben-Hamou et al. [2018] Anna Ben-Hamou, Eyal Lubetzky, and Yuval Peres. Comparing mixing times on sparse random graphs. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1734–1740. SIAM, 2018.
  • Benveniste et al. [2012] Albert Benveniste, Michel Métivier, and Pierre Priouret. Adaptive algorithms and stochastic approximations, volume 22. Springer Science & Business Media, 2012.
  • Bianchi et al. [2013] Pascal Bianchi, Gersende Fort, and Walid Hachem. Performance of a distributed stochastic approximation algorithm. IEEE Transactions on Information Theory, 59(11):7405–7418, 2013.
  • Billingsley [2013] Patrick Billingsley. Convergence of probability measures. John Wiley & Sons, 2013.
  • Borkar et al. [2021] Vivek Borkar, Shuhang Chen, Adithya Devraj, Ioannis Kontoyiannis, and Sean Meyn. The ode method for asymptotic statistics in stochastic approximation and reinforcement learning. arXiv preprint arXiv:2110.14427, 2021.
  • Borkar [2022] V.S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint: Second Edition. Texts and Readings in Mathematics. Hindustan Book Agency, 2022.
  • Boyd et al. [2011] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122, 2011.
  • Brémaud [2013] Pierre Brémaud. Markov chains: Gibbs fields, Monte Carlo simulation, and queues, volume 31. Springer Science & Business Media, 2013.
  • Brooks et al. [2011] Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. Handbook of markov chain monte carlo. CRC press, 2011.
  • Chang and Lin [2011] Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):1–27, 2011.
  • Chellaboina and Haddad [2008] VijaySekhar Chellaboina and Wassim M Haddad. Nonlinear dynamical systems and control: A Lyapunov-based approach. Princeton University Press, 2008.
  • Chellapandi et al. [2023] Vishnu Pandi Chellapandi, Antesh Upadhyay, Abolfazl Hashemi, and Stanislaw H Zak. On the convergence of decentralized federated learning under imperfect information sharing. arXiv preprint arXiv:2303.10695, 2023.
  • Chen et al. [2020] Shuhang Chen, Adithya Devraj, Ana Busic, and Sean Meyn. Explicit mean-square error bounds for monte-carlo and linear stochastic approximation. In International Conference on Artificial Intelligence and Statistics, pages 4173–4183. PMLR, 2020.
  • Chen et al. [2022] Wenlin Chen, Samuel Horváth, and Peter Richtárik. Optimal client sampling for federated learning. Transactions on Machine Learning Research, 2022.
  • Choromanska et al. [2015] Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In Artificial intelligence and statistics, pages 192–204, 2015.
  • Dauphin et al. [2014] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in neural information processing systems, volume 27, 2014.
  • Davis [1970] Burgess Davis. On the intergrability of the martingale square function. Israel Journal of Mathematics, 8:187–190, 1970.
  • Delyon [2000] Bernard Delyon. Stochastic approximation with decreasing gain: Convergence and asymptotic theory. Technical report, 2000.
  • Delyon et al. [1999] Bernard Delyon, Marc Lavielle, and Eric Moulines. Convergence of a stochastic approximation version of the em algorithm. Annals of statistics, pages 94–128, 1999.
  • Doan et al. [2019] Thinh Doan, Siva Maguluri, and Justin Romberg. Finite-time analysis of distributed td (0) with linear function approximation on multi-agent reinforcement learning. In International Conference on Machine Learning, pages 1626–1635. PMLR, 2019.
  • Doshi et al. [2023] Vishwaraj Doshi, Jie Hu, and Do Young Eun. Self-repellent random walks on general graphs–achieving minimal sampling variance via nonlinear markov chains. In International Conference on Machine Learning. PMLR, 2023.
  • Duchi et al. [2012] John C Duchi, Alekh Agarwal, Mikael Johansson, and Michael I Jordan. Ergodic mirror descent. SIAM Journal on Optimization, 22(4):1549–1578, 2012.
  • Dyer et al. [1993] Martin Dyer, Alan Frieze, Ravi Kannan, Ajai Kapoor, Ljubomir Perkovic, and Umesh Vazirani. A mildly exponential time algorithm for approximating the number of solutions to a multidimensional knapsack problem. Combinatorics, Probability and Computing, 2(3):271–284, 1993.
  • Even [2023] Mathieu Even. Stochastic gradient descent under markovian sampling schemes. In International Conference on Machine Learning, 2023.
  • Fort [2015] Gersende Fort. Central limit theorems for stochastic approximation with controlled markov chain dynamics. ESAIM: Probability and Statistics, 19:60–80, 2015.
  • Fraboni et al. [2022] Yann Fraboni, Richard Vidal, Laetitia Kameni, and Marco Lorenzi. A general theory for client sampling in federated learning. In International Workshop on Trustworthy Federated Learning in conjunction with IJCAI, 2022.
  • Ghosh et al. [2020] Avishek Ghosh, Jichan Chung, Dong Yin, and Kannan Ramchandran. An efficient framework for clustered federated learning. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pages 19586–19597, 2020.
  • Guo et al. [2022] Yuanxiong Guo, Ying Sun, Rui Hu, and Yanmin Gong. Hybrid local SGD for federated learning with heterogeneous communications. In International Conference on Learning Representations, 2022.
  • Hall et al. [2014] P. Hall, C.C. Heyde, Z.W. Birnbauam, and E. Lukacs. Martingale Limit Theory and Its Application. Communication and Behavior. Elsevier Science, 2014.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Hendrikx [2023] Hadrien Hendrikx. A principled framework for the design and analysis of token algorithms. In International Conference on Artificial Intelligence and Statistics, pages 470–489. PMLR, 2023.
  • Horn and Johnson [1991] Roger A. Horn and Charles R. Johnson. Topics in Matrix Analysis. Cambridge University Press, 1991. doi: 10.1017/CBO9780511840371.
  • Hosseinalipour et al. [2022] Seyyedali Hosseinalipour, Sheikh Shams Azam, Christopher G Brinton, Nicolo Michelusi, Vaneet Aggarwal, David J Love, and Huaiyu Dai. Multi-stage hybrid federated learning over large-scale d2d-enabled fog networks. IEEE/ACM Transactions on Networking, 2022.
  • Hu et al. [2022] Jie Hu, Vishwaraj Doshi, and Do Young Eun. Efficiency ordering of stochastic gradient descent. In Advances in Neural Information Processing Systems, 2022.
  • Hu et al. [2024] Jie Hu, Vishwaraj Doshi, and Do Young Eun. Accelerating distributed stochastic optimization via self-repellent random walks. In The Twelfth International Conference on Learning Representations, 2024.
  • Jerrum and Sinclair [1996] Mark Jerrum and Alistair Sinclair. The markov chain monte carlo method: an approach to approximate counting and integration. Approximation Algorithms for NP-hard problems, PWS Publishing, 1996.
  • Kailath et al. [2000] Thomas Kailath, Ali H Sayed, and Babak Hassibi. Linear estimation. Prentice Hall, 2000.
  • Khodadadian et al. [2022] Sajad Khodadadian, Pranay Sharma, Gauri Joshi, and Siva Theja Maguluri. Federated reinforcement learning: Linear speedup under markovian sampling. In International Conference on Machine Learning, pages 10997–11057, 2022.
  • Koloskova et al. [2020] Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi, and Sebastian Stich. A unified theory of decentralized sgd with changing topology and local updates. In International Conference on Machine Learning, pages 5381–5393, 2020.
  • Krizhevsky and Hinton [2009] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, 2009.
  • Kushner and Yin [2003] Harold Kushner and G George Yin. Stochastic approximation and recursive algorithms and applications, volume 35. Springer Science & Business Media, 2003.
  • Lalitha et al. [2018] Anusha Lalitha, Shubhanshu Shekhar, Tara Javidi, and Farinaz Koushanfar. Fully decentralized federated learning. In Advances in neural information processing systems, 2018.
  • Le Bars et al. [2023] Batiste Le Bars, Aurélien Bellet, Marc Tommasi, Erick Lavoie, and Anne-Marie Kermarrec. Refined convergence and topology learning for decentralized sgd with heterogeneous data. In International Conference on Artificial Intelligence and Statistics, pages 1672–1702. PMLR, 2023.
  • Lee et al. [2012] Chul-Ho Lee, Xin Xu, and Do Young Eun. Beyond random walk and metropolis-hastings samplers: why you should not backtrack for unbiased graph sampling. ACM SIGMETRICS Performance evaluation review, 40(1):319–330, 2012.
  • Leskovec and Krevl [2014] Jure Leskovec and Andrej Krevl. Snap datasets: Stanford large network dataset collection, 2014.
  • Li et al. [2020] Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. On the convergence of fedavg on non-iid data. In International Conference on Learning Representations, 2020.
  • Li et al. [2022] Xiang Li, Jiadong Liang, Xiangyu Chang, and Zhihua Zhang. Statistical estimation and online inference via local sgd. In Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pages 1613–1661, 02–05 Jul 2022.
  • Li et al. [2023] Xiang Li, Jiadong Liang, and Zhihua Zhang. Online statistical inference for nonlinear stochastic approximation with markovian data. arXiv preprint arXiv:2302.07690, 2023.
  • Liao et al. [2023] Yunming Liao, Yang Xu, Hongli Xu, Lun Wang, and Chen Qian. Adaptive configuration for heterogeneous participants in decentralized federated learning. In IEEE INFOCOM 2023. IEEE, 2023.
  • Mathkar and Borkar [2016] Adwaitvedant S Mathkar and Vivek S Borkar. Nonlinear gossip. SIAM Journal on Control and Optimization, 54(3):1535–1557, 2016.
  • McMahan et al. [2017] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017.
  • Meyn [2022] Sean Meyn. Control systems and reinforcement learning. Cambridge University Press, 2022.
  • Mira [2001] Antonietta Mira. Ordering and improving the performance of monte carlo markov chains. Statistical Science, pages 340–350, 2001.
  • Morral et al. [2017] Gemma Morral, Pascal Bianchi, and Gersende Fort. Success and failure of adaptation-diffusion algorithms with decaying step size in multiagent networks. IEEE Transactions on Signal Processing, 65(11):2798–2813, 2017.
  • Mou et al. [2020] Wenlong Mou, Chris Junchi Li, Martin J Wainwright, Peter L Bartlett, and Michael I Jordan. On linear stochastic approximation: Fine-grained polyak-ruppert and non-asymptotic concentration. In Conference on Learning Theory, pages 2947–2997. PMLR, 2020.
  • Neglia et al. [2020] Giovanni Neglia, Chuan Xu, Don Towsley, and Gianmarco Calbi. Decentralized gradient methods: does topology matter? In International Conference on Artificial Intelligence and Statistics, pages 2348–2358. PMLR, 2020.
  • Olshevsky [2022] Alex Olshevsky. Asymptotic network independence and step-size for a distributed subgradient method. Journal of Machine Learning Research, 23(69):1–32, 2022.
  • Pu et al. [2020] Shi Pu, Alex Olshevsky, and Ioannis Ch Paschalidis. Asymptotic network independence in distributed stochastic optimization for machine learning: Examining distributed and centralized stochastic gradient descent. IEEE signal processing magazine, 37(3):114–122, 2020.
  • Robert et al. [1999] Christian P Robert, George Casella, and George Casella. Monte Carlo statistical methods, volume 2. Springer, 1999.
  • Ross et al. [1996] Sheldon M Ross, John J Kelly, Roger J Sullivan, William James Perry, Donald Mercer, Ruth M Davis, Thomas Dell Washburn, Earl V Sager, Joseph B Boyce, and Vincent L Bristow. Stochastic processes, volume 2. Wiley New York, 1996.
  • Safran and Shamir [2020] Itay Safran and Ohad Shamir. How good is sgd with random shuffling? In Conference on Learning Theory, pages 3250–3284. PMLR, 2020.
  • Srikant [2024] R. Srikant. Rates of Convergence in the Central Limit Theorem for Markov Chains, with an Application to TD Learning. arXiv e-prints, art. arXiv:2401.15719, January 2024. doi: 10.48550/arXiv.2401.15719.
  • Stich [2018] Sebastian U Stich. Local sgd converges fast and communicates little. In International Conference on Learning Representations, 2018.
  • Sun et al. [2018] Tao Sun, Yuejiao Sun, and Wotao Yin. On markov chain gradient descent. In Advances in neural information processing systems, volume 31, 2018.
  • Sun et al. [2023] Tao Sun, Dongsheng li, and Bao Wang. On the decentralized stochastic gradient descent with markov chain sampling. IEEE Transactions on Signal Processing, PP, 07 2023. doi: 10.1109/TSP.2023.3297053.
  • Vidyasagar [2022] M Vidyasagar. Convergence of stochastic approximation via martingale and converse lyapunov methods. arXiv preprint arXiv:2205.01303, 2022.
  • Wai [2020] Hoi-To Wai. On the convergence of consensus algorithms with markovian noise and gradient bias. In 2020 59th IEEE Conference on Decision and Control (CDC), pages 4897–4902. IEEE, 2020.
  • Wang et al. [2023] Han Wang, Aritra Mitra, Hamed Hassani, George J Pappas, and James Anderson. Federated temporal difference learning with linear function approximation under environmental heterogeneity. arXiv preprint arXiv:2302.02212, 2023.
  • Wang et al. [2020] Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H. Vincent Poor. Tackling the objective inconsistency problem in heterogeneous federated optimization. In Advances in Neural Information Processing Systems, volume 33, pages 7611–7623, 2020.
  • Wang and Ji [2022] Shiqiang Wang and Mingyue Ji. A unified analysis of federated learning with arbitrary client participation. In Advances in neural information processing systems, 2022.
  • Wojtowytsch [2023] Stephan Wojtowytsch. Stochastic gradient descent with noise of machine learning type part i: Discrete time analysis. Journal of Nonlinear Science, 33(3):45, 2023.
  • Woodworth et al. [2020] Blake Woodworth, Kumar Kshitij Patel, Sebastian Stich, Zhen Dai, Brian Bullins, Brendan Mcmahan, Ohad Shamir, and Nathan Srebro. Is local sgd better than minibatch sgd? In International Conference on Machine Learning, pages 10334–10343. PMLR, 2020.
  • Ye et al. [2022] Hao Ye, Le Liang, and Geoffrey Ye Li. Decentralized federated learning with unreliable communications. IEEE Journal of Selected Topics in Signal Processing, 16(3):487–500, 2022.
  • Yun et al. [2021] Chulhee Yun, Suvrit Sra, and Ali Jadbabaie. Open problem: Can single-shuffle sgd be better than reshuffling sgd and gd? In Proceedings of Thirty Fourth Conference on Learning Theory, volume 134, pages 4653–4658. PMLR, Aug 2021.
  • Yun et al. [2022] Chulhee Yun, Shashank Rajput, and Suvrit Sra. Minibatch vs local SGD with shuffling: Tight convergence bounds and beyond. In International Conference on Learning Representations, 2022.
  • Zeng et al. [2022] Sihan Zeng, Thinh T Doan, and Justin Romberg. Finite-time convergence rates of decentralized stochastic approximation with applications in multi-agent and multi-task learning. IEEE Transactions on Automatic Control, 2022.
  • Zhao et al. [2018] Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. Federated learning with non-iid data. arXiv preprint arXiv:1806.00582, 2018.

Appendix A Discussion of 2.3-ii)

A.1 Suitable choices of KlK_{l}

When wet let Kllog(l)K_{l}\sim\log(l) (resp. Klloglog(l)K_{l}\sim\log\log(l)), as suggested by [51], it trivially satisfies Kτn=o(γn1/2(L+1))=o(n1/2(L+1))K_{\tau_{n}}=o(\gamma_{n}^{-1/2(L+1)})=o(n^{1/2(L+1)}) since by definition Kτn<Knlog(n)K_{\tau_{n}}<K_{n}\sim\log(n) (resp. loglog(n)\log\log(n)), and log(n)=o(nϵ)\log(n)=o(n^{\epsilon}) (resp. loglog(n)=o(nϵ)\log\log(n)=o(n^{\epsilon})) for any ϵ>0\epsilon>0. Besides, nηn2=nγn2Kτn2(L+1)nn2n2(L+1)ϵ=nn2(L+1)ϵ2\sum_{n}\eta_{n}^{2}=\sum_{n}\gamma_{n}^{2}K_{\tau_{n}}^{2(L+1)}\lesssim\sum_{n}n^{-2}n^{2(L+1)\epsilon}=\sum_{n}n^{2(L+1)\epsilon-2}. To ensure nηn2<\sum_{n}\eta_{n}^{2}<\infty, it is sufficient to have 2(L+1)ϵ2<12(L+1)\epsilon-2<-1, or equivalently, ϵ<1/2(L+1)\epsilon<1/2(L+1). Since ϵ\epsilon can be arbitrarily small to satisfy the condition, nηn2<\sum_{n}\eta_{n}^{2}<\infty is satisfied. When Kllog(l)K_{l}\sim\log(l), we can rewrite the last condition as

ηnl+1ηnl+1+1=γnl+1γnl+1+1KlL+1Kl+1L+1=(nl+1+1nl+1)(log(l+1)+1log(l)+1)L+1=(1+Kl+1nl+1)(log(l+1)+1log(l)+1)L+1,\begin{split}\frac{\eta_{n_{l}+1}}{\eta_{n_{l+1}+1}}=\frac{\gamma_{n_{l}+1}}{\gamma_{n_{l+1}+1}}\frac{K_{l}^{L+1}}{K_{l+1}^{L+1}}&=\left(\frac{n_{l+1}+1}{n_{l}+1}\right)\left(\frac{\log(l+1)+1}{\log(l)+1}\right)^{L+1}\\ &=\left(1+\frac{K_{l+1}}{n_{l}+1}\right)\left(\frac{\log(l+1)+1}{\log(l)+1}\right)^{L+1},\end{split} (16)

where we have nllog(l!)n_{l}\sim\log(l!) such that Kl+1/nl=log(l+1)/log(l!)0K_{l+1}/n_{l}=\log(l+1)/\log(l!)\to 0 and log(l+1)/log(l)1\log(l+1)/\log(l)\to 1 as ll\to\infty, which leads to limnηnl+1/ηnl+1+1=1\lim_{n\to\infty}\eta_{n_{l}+1}/\eta_{n_{l+1}+1}=1. Similarly, for Klloglog(l)K_{l}\sim\log\log(l), we have nllog(s=1llog(s))n_{l}\sim\log(\prod_{s=1}^{l}\log(s)) such that Kl+1/nlloglog(l+1)/loglog(s=1llog(s))0K_{l+1}/n_{l}\sim\log\log(l+1)/\log\log(\prod_{s=1}^{l}\log(s))\to 0 and loglog(l+1)/loglog(l)1\log\log(l+1)/\log\log(l)\to 1 as ll\to\infty, which also leads to limnηnl+1/ηnl+1+1=1\lim_{n\to\infty}\eta_{n_{l}+1}/\eta_{n_{l+1}+1}=1.

A.2 Practical implications of 2.3-ii)

In this assumption, we allow the number of local iterations to go to infinity asymptotically. In distributed learning environments such as mobile, IoT, and wireless sensor networks, where nodes are often constrained by battery life, increasing communication interval in 2.3-ii) plays a crucial role in balancing energy costs with communication effectiveness. It allows agents to communicate more frequently early on, leading to a faster initial convergence to the neighborhood of θ\theta^{*}. Then, we slow down the communication frequency between agents to conserve energy, leveraging the diminishing returns on accuracy improvements from additional communications.

Consider the scenario where devices across multiple clusters collaborate on a distributed optimization task, utilizing local datasets. Devices within each cluster form a communication network that allows a virtual agent to perform a heterogeneous Markov chain trajectory via random walk, or an i.i.d. sequence in a complete graph with self-loops, depending on the application context. Each cluster features an edge server that supports the exchange of model estimates with neighboring clusters. By performing KK local updates before uploading these to the cluster’s edge server, the model benefits from reduced communication overhead. As the frequency of updates between devices and edge servers decreases — optimized by gradually increasing KK — we effectively lower communication costs, particularly as the model estimation θn\theta_{n} is close to θ\theta^{*}.

Appendix B Proof of Lemma 3.1

Let 𝐉𝐈N𝐉N×N\mathbf{J}_{\bot}\triangleq\mathbf{I}_{N}-\mathbf{J}\in\mathbb{R}^{N\times N} and 𝒥𝐉𝐈dNd×Nd\mathcal{J}_{\bot}\triangleq\mathbf{J}_{\bot}\otimes\mathbf{I}_{d}\in\mathbb{R}^{Nd\times Nd}, where \otimes is the Kronecker product. Let Θn=[(θn1)T,,(θnN)T]TNd\Theta_{n}=[(\theta^{1}_{n})^{T},\cdots,(\theta^{N}_{n})^{T}]^{T}\in\mathbb{R}^{Nd}. Then, motivated by [58], we define a sequence ϕnηn+11𝒥ΘnNd\phi_{n}\triangleq\eta_{n+1}^{-1}\mathcal{J}_{\bot}\Theta_{n}\in\mathbb{R}^{Nd} in the increasing communication interval case (resp. ϕnγn+11𝒥Θn\phi_{n}\triangleq\gamma_{n+1}^{-1}\mathcal{J}_{\bot}\Theta_{n} in the bounded communication interval case), where ηn+1\eta_{n+1} is defined in 2.3-ii). 𝒥Θn=Θn1N(𝟏𝟏T𝐈d)Θn\mathcal{J}_{\bot}\Theta_{n}=\Theta_{n}-\frac{1}{N}(\mathbf{1}\mathbf{1}^{T}\otimes\mathbf{I}_{d})\Theta_{n} represents the consensus error of the model.

We first give the following lemma that shows the pathwise boundedness of ϕn\phi_{n}.

Lemma B.1.

Let Assumptions 2.1, 2.3, 2.4 and 2.5 hold. For any compact set ΩNd\Omega\subset\mathbb{R}^{Nd}, the sequence ϕn\phi_{n} satisfies supn𝔼[ϕn2𝟙jn1{ΘjΩ}]<\sup_{n}\mathbb{E}[\|\phi_{n}\|^{2}\mathbbm{1}_{\cap_{j\leq n-1}\{\Theta_{j}\in\Omega\}}]<\infty.

Lemma B.1 and 2.3-ii) imply that for any n0n\geq 0, 𝔼[𝒥Θn2𝟙jn1{ΘjΩ}]=ηn+12𝔼[ϕn2𝟙jn1{ΘjΩ}]Cηn+12\mathbb{E}[\|\mathcal{J}_{\bot}\Theta_{n}\|^{2}\mathbbm{1}_{\cap_{j\leq n-1}\{\Theta_{j}\in\Omega\}}]=\eta_{n+1}^{2}\mathbb{E}[\|\phi_{n}\|^{2}\mathbbm{1}_{\cap_{j\leq n-1}\{\Theta_{j}\in\Omega\}}]\leq C\eta_{n+1}^{2} for some constant CC that depends on C1C_{1} and Ω\Omega. Along with Assumption 2.4 such that 𝒥Θn\|\mathcal{J}_{\bot}\Theta_{n}\| is always bounded per each trajectory, it means

𝒥Θn𝟙jn1{ΘjΩ}=O(ηn)a.s.\|\mathcal{J}_{\bot}\Theta_{n}\|\mathbbm{1}_{\cap_{j\leq n-1}\{\Theta_{j}\in\Omega\}}=O(\eta_{n})\quad a.s.

Let {Ωm}m0\{\Omega_{m}\}_{m\geq 0} be a sequence of increasing compact subset of Nd\mathbb{R}^{Nd} such that mΩm=Nd\bigcup_{m}\Omega_{m}=\mathbb{R}^{Nd}. Then, we know that for any m0m\geq 0,

𝒥Θn𝟙jn1{ΘjΩm}=O(ηn)a.s.\|\mathcal{J}_{\bot}\Theta_{n}\|\mathbbm{1}_{\cap_{j\leq n-1}\{\Theta_{j}\in\Omega_{m}\}}=O(\eta_{n})\quad a.s. (17)

(17) indicates either one of the following two cases:

  • there exists some trajectory-dependent index mm^{\prime} such that each trajectory {Θn}n0\{\Theta_{n}\}_{n\geq 0} is always within the compact set Ωm\Omega_{m^{\prime}}, i.e., 𝟙jn{ΘjΩm}=1\mathbbm{1}_{\cap_{j\leq n}\{\Theta_{j}\in\Omega_{m^{\prime}}\}}=1 (satisfied by the construction of increasing compact sets {Ωm}m0\{\Omega_{m}\}_{m\geq 0} and 2.4), and we have 𝒥Θn=O(ηn)\|\mathcal{J}_{\bot}\Theta_{n}\|=O(\eta_{n}) such that limn𝒥Θn=𝟎\lim_{n\to\infty}\mathcal{J}_{\bot}\Theta_{n}=\mathbf{0};

  • Θn\Theta_{n} will escape the compact set Ωm\Omega_{m} eventually for any m0m\geq 0 in finite time such that 𝟙jn1{ΘjΩm}=0\mathbbm{1}_{\cap_{j\leq n-1}\{\Theta_{j}\in\Omega_{m}\}}=0 when nn is large enough.

We can see the second case contradicts 2.4 because we assume every trajectory {Θn}n0\{\Theta_{n}\}_{n\geq 0} is within some compact set. Therefore, (17) for any m0m\geq 0 is equivalent to showing 𝒥Θn=O(ηn)\|\mathcal{J}_{\bot}\Theta_{n}\|=O(\eta_{n}) and limn𝒥Θn=𝟎\lim_{n\to\infty}\mathcal{J}_{\bot}\Theta_{n}=\mathbf{0}. Under Assumption 2.3-i) we can obtain similar result 𝒥Θn=O(γn)\|\mathcal{J}_{\bot}\Theta_{n}\|=O(\gamma_{n}) by following the same steps as above, which completes the proof of Lemma 3.1.

Proof of Lemma B.1.

We begin by rewriting (6) in the matrix form,

Θn+1=𝒲n(Θnγn+1𝐅(Θn,𝐗n)),\Theta_{n+1}=\mathcal{W}_{n}\left(\Theta_{n}-\gamma_{n+1}\nabla\mathbf{F}(\Theta_{n},\mathbf{X}_{n})\right), (18)

where 𝐗n(Xn1,Xn2,,XnN)\mathbf{X}_{n}\triangleq(X_{n}^{1},X^{2}_{n},\cdots,X^{N}_{n}) and 𝐅(Θn,𝐗n)[F1(θn1,Xn1)T,,FN(θnN,XnN)T]TNd\nabla\mathbf{F}(\Theta_{n},\mathbf{X}_{n})\triangleq[\nabla F_{1}(\theta^{1}_{n},X^{1}_{n})^{T},\cdots,\nabla F_{N}(\theta^{N}_{n},X^{N}_{n})^{T}]^{T}\in\mathbb{R}^{Nd}. Recall θn1Ni=1Nθnid\theta_{n}\triangleq\frac{1}{N}\sum_{i=1}^{N}\theta_{n}^{i}\in\mathbb{R}^{d} and we have [θnT,,θnT]T=1N(𝟏𝟏T𝐈d)ΘnNd[\theta_{n}^{T},\cdots,\theta_{n}^{T}]^{T}=\frac{1}{N}(\mathbf{1}\mathbf{1}^{T}\otimes\mathbf{I}_{d})\Theta_{n}\in\mathbb{R}^{Nd}.

Case 1 (Increasing communication interval KτnK_{\tau_{n}}): By left multiplying (18) with 1N(𝟏𝟏T𝐈d)\frac{1}{N}(\mathbf{1}\mathbf{1}^{T}\otimes\mathbf{I}_{d}), along with γn+1=ηn+1/Kτn+1L+1\gamma_{n+1}=\eta_{n+1}/K^{L+1}_{\tau_{n+1}} in 2.3-ii), we have the following iteration

1N(𝟏𝟏T𝐈d)Θn+1=1N(𝟏𝟏T𝐈d)Θnηn+11N(𝟏𝟏T𝐈d)𝐅(Θn,𝐗n)Kτn+1L+1,\frac{1}{N}(\mathbf{1}\mathbf{1}^{T}\otimes\mathbf{I}_{d})\Theta_{n+1}=\frac{1}{N}(\mathbf{1}\mathbf{1}^{T}\otimes\mathbf{I}_{d})\Theta_{n}-\eta_{n+1}\frac{1}{N}(\mathbf{1}\mathbf{1}^{T}\otimes\mathbf{I}_{d})\frac{\nabla\mathbf{F}(\Theta_{n},\mathbf{X}_{n})}{K^{L+1}_{\tau_{n+1}}}, (19)

where the equality comes from 1N(𝟏𝟏T𝐈d)𝒲n=1N(𝟏𝟏T𝐖n𝐈d)=1N(𝟏𝟏T𝐈d)\frac{1}{N}(\mathbf{1}\mathbf{1}^{T}\otimes\mathbf{I}_{d})\mathcal{W}_{n}=\frac{1}{N}(\mathbf{1}\mathbf{1}^{T}\mathbf{W}_{n}\otimes\mathbf{I}_{d})=\frac{1}{N}(\mathbf{1}\mathbf{1}^{T}\otimes\mathbf{I}_{d}). With (18) and (19), we have

Θn+11N(𝟏𝟏T𝐈d)Θn+1=(𝒲n1N(𝟏𝟏T𝐈d))Θnηn+1(𝒲n1N(𝟏𝟏T𝐈d))𝐅(Θn,𝐗n)Kτn+1L+1=(𝐉𝐖n𝐈d)𝒥Θnηn+1(𝐉𝐖n𝐈d)𝐅(Θn,Xn)Kτn+1L+1=ηn+1(𝐉𝐖n𝐈d)(ηn+11𝒥Θn𝐅(Θn,𝐗n)Kτn+1L+1),\begin{split}&\Theta_{n+1}-\frac{1}{N}(\mathbf{1}\mathbf{1}^{T}\otimes\mathbf{I}_{d})\Theta_{n+1}\\ =&\left(\mathcal{W}_{n}-\frac{1}{N}(\mathbf{1}\mathbf{1}^{T}\otimes\mathbf{I}_{d})\right)\Theta_{n}-\eta_{n+1}\left(\mathcal{W}_{n}-\frac{1}{N}(\mathbf{1}\mathbf{1}^{T}\otimes\mathbf{I}_{d})\right)\frac{\nabla\mathbf{F}(\Theta_{n},\mathbf{X}_{n})}{K^{L+1}_{\tau_{n+1}}}\\ =&(\mathbf{J}_{\bot}\mathbf{W}_{n}\otimes\mathbf{I}_{d})\mathcal{J}_{\bot}\Theta_{n}-\eta_{n+1}(\mathbf{J}_{\bot}\mathbf{W}_{n}\otimes\mathbf{I}_{d})\frac{\nabla\mathbf{F}(\Theta_{n},X_{n})}{K^{L+1}_{\tau_{n+1}}}\\ =&\eta_{n+1}(\mathbf{J}_{\bot}\mathbf{W}_{n}\otimes\mathbf{I}_{d})\left(\eta_{n+1}^{-1}\mathcal{J}_{\bot}\Theta_{n}-\frac{\nabla\mathbf{F}(\Theta_{n},\mathbf{X}_{n})}{K^{L+1}_{\tau_{n+1}}}\right),\end{split} (20)

where the second equality comes from 𝒲n1N(𝟏𝟏T𝐈d)=(𝐖n1N𝟏𝟏T)𝐈d=𝐉𝐖n𝐈d\mathcal{W}_{n}-\frac{1}{N}(\mathbf{1}\mathbf{1}^{T}\otimes\mathbf{I}_{d})=(\mathbf{W}_{n}-\frac{1}{N}\mathbf{1}\mathbf{1}^{T})\otimes\mathbf{I}_{d}=\mathbf{J}_{\bot}\mathbf{W}_{n}\otimes\mathbf{I}_{d} and (𝐉𝐖n𝐈d)𝒥=𝐉𝐖n𝐉𝐈d=𝐉𝐖n𝐈d(\mathbf{J}_{\bot}\mathbf{W}_{n}\otimes\mathbf{I}_{d})\mathcal{J}_{\bot}=\mathbf{J}_{\bot}\mathbf{W}_{n}\mathbf{J}_{\bot}\otimes\mathbf{I}_{d}=\mathbf{J}_{\bot}\mathbf{W}_{n}\otimes\mathbf{I}_{d}. Let anηn/ηn+1a_{n}\triangleq\eta_{n}/\eta_{n+1}, dividing both sides of (20) by ηn+2\eta_{n+2} gives

ϕn+1=an+1(𝐉𝐖n𝐈d)(ϕn𝐅(Θn,𝐗n)Kτn+1L+1).\phi_{n+1}=a_{n+1}(\mathbf{J}_{\bot}\mathbf{W}_{n}\otimes\mathbf{I}_{d})\left(\phi_{n}-\frac{\nabla\mathbf{F}(\Theta_{n},\mathbf{X}_{n})}{K^{L+1}_{\tau_{n+1}}}\right). (21)

Define the filtration {n}n0\{\mathcal{F}_{n}\}_{n\geq 0} as nσ{Θ0,𝐗0,𝐖0,Θ1,𝐗1,𝐖1,,𝐗n1,𝐖n1,Θn,𝐗n}\mathcal{F}_{n}\triangleq\sigma\{\Theta_{0},\mathbf{X}_{0},\mathbf{W}_{0},\Theta_{1},\mathbf{X}_{1},\mathbf{W}_{1},\cdots,\mathbf{X}_{n-1},\mathbf{W}_{n-1},\Theta_{n},\mathbf{X}_{n}\}. Recursively computing (21) w.r.t the time interval [nl,nl+1][n_{l},n_{l+1}] gives

ϕnl+1=[k=nl+1nl+1ak]([𝐉k=nlnl+11𝐖k]𝐈d)ϕnlk=nlnl+11[i=k+1nl+1ai]([𝐉i=knl+11𝐖i]𝐈d)𝐅(Θk,𝐗k)Kl+1L+1=ηnl+1ηnl+1+1(𝐉𝐖nl𝐈d)ϕnlk=nlnl+11ηnl+1ηk+2(𝐉𝐖nl𝐈d)𝐅(Θk,𝐗k)Kl+1L+1,\begin{split}\phi_{n_{l+1}}&=\left[\prod_{k=n_{l}+1}^{n_{l+1}}a_{k}\right]\left(\left[\mathbf{J}_{\bot}\prod_{k=n_{l}}^{n_{l+1}-1}\mathbf{W}_{k}\right]\otimes\mathbf{I}_{d}\right)\phi_{n_{l}}\\ &~{}~{}~{}~{}~{}-\sum_{k={n_{l}}}^{n_{l+1}-1}\left[\prod_{i=k+1}^{n_{l+1}}a_{i}\right]\left(\left[\mathbf{J}_{\bot}\prod_{i=k}^{n_{l+1}-1}\mathbf{W}_{i}\right]\otimes\mathbf{I}_{d}\right)\frac{\nabla\mathbf{F}(\Theta_{k},\mathbf{X}_{k})}{K^{L+1}_{l+1}}\\ &=\frac{\eta_{n_{l}+1}}{\eta_{n_{l+1}+1}}\left(\mathbf{J}_{\bot}\mathbf{W}_{n_{l}}\otimes\mathbf{I}_{d}\right)\phi_{n_{l}}-\sum_{k={n_{l}}}^{n_{l+1}-1}\frac{\eta_{n_{l}+1}}{\eta_{k+2}}\left(\mathbf{J}_{\bot}\mathbf{W}_{n_{l}}\otimes\mathbf{I}_{d}\right)\frac{\nabla\mathbf{F}(\Theta_{k},\mathbf{X}_{k})}{K^{L+1}_{l+1}},\end{split} (22)

where \prod is the backward multiplier, the second equality comes from 𝐉𝐖n𝐉=𝐉𝐖n\mathbf{J}_{\bot}\mathbf{W}_{n}\mathbf{J}_{\bot}=\mathbf{J}_{\bot}\mathbf{W}_{n} and 𝐖k=𝐈N\mathbf{W}_{k}=\mathbf{I}_{N} for k{nl}k\notin\{n_{l}\}. In 2.5, we have 𝔼𝐖𝒫nl[𝐖T𝐉𝐖]=𝔼𝐖𝒫nl[𝐖T𝐖𝐉]C1<1\|\mathbb{E}_{\mathbf{W}\sim\mathcal{P}_{n_{l}}}[\mathbf{W}^{T}\mathbf{J}_{\bot}\mathbf{W}]\|=\|\mathbb{E}_{\mathbf{W}\sim\mathcal{P}_{n_{l}}}[\mathbf{W}^{T}\mathbf{W}-\mathbf{J}]\|\leq C_{1}<1. Then,

𝔼[ϕnl+12|nl]=(ηnl+1ηnl+1+1)2ϕnlT𝔼𝐖nl𝒫nl[(𝐉𝐖nl𝐈d)T(𝐉𝐖nl𝐈d)]ϕnl2𝔼[k=nlnl+11ηnl+12ηnl+1+1ηk+2ϕnlT(𝐉𝐖nl𝐈d)T(𝐉𝐖nl𝐈d)𝐅(Θk,𝐗k)Kl+1L+1|nl]+𝔼[k=nlnl+11ηnl+1ηk+2(𝐉𝐖nl𝐈d)𝐅(Θk,𝐗k)Kl+1L+12|nl](ηnl+1ηnl+1+1)2ϕnlT𝔼𝐖nl𝒫nl[(𝐖nlT𝐉𝐖nl𝐈d)]ϕnl2(ηnl+1ηnl+1+1)2𝔼[k=nlnl+11ϕnlT(𝐖nlT𝐉𝐖nl𝐈d)𝐅(Θk,𝐗k)Kl+1L+1|nl]+(ηnl+1ηnl+1+1)2𝔼[(𝐉𝐖nl𝐈d)k=nlnl+11𝐅(Θk,𝐗k)Kl+1L+12|nl](ηnl+1ηnl+1+1)2C1ϕnl2+2(ηnl+1ηnl+1+1)2C1ϕnl𝔼[k=nlnl+11𝐅(Θk,𝐗k)Kl+1L+1|nl]+(ηnl+1ηnl+1+1)2C1𝔼[k=nlnl+11𝐅(Θk,𝐗k)Kl+1L+12|nl],\begin{split}&\mathbb{E}[\|\phi_{n_{l+1}}\|^{2}|\mathcal{F}_{n_{l}}]\\ =&\left(\frac{\eta_{n_{l}+1}}{\eta_{n_{l+1}+1}}\right)^{2}\phi_{n_{l}}^{T}\mathbb{E}_{\mathbf{W}_{n_{l}\sim\mathcal{P}_{n_{l}}}}\left[\left(\mathbf{J}_{\bot}\mathbf{W}_{n_{l}}\otimes\mathbf{I}_{d}\right)^{T}\left(\mathbf{J}_{\bot}\mathbf{W}_{n_{l}}\otimes\mathbf{I}_{d}\right)\right]\phi_{n_{l}}\\ &-2\mathbb{E}\left[\left.\sum_{k=n_{l}}^{n_{l+1}-1}\frac{\eta^{2}_{n_{l}+1}}{\eta_{n_{l+1}+1}\eta_{k+2}}\phi_{n_{l}}^{T}\left(\mathbf{J}_{\bot}\mathbf{W}_{n_{l}}\otimes\mathbf{I}_{d}\right)^{T}\left(\mathbf{J}_{\bot}\mathbf{W}_{n_{l}}\otimes\mathbf{I}_{d}\right)\frac{\nabla\mathbf{F}(\Theta_{k},\mathbf{X}_{k})}{K^{L+1}_{l+1}}\right|\mathcal{F}_{n_{l}}\right]\\ &+\mathbb{E}\left[\left.\left\|\sum_{k={n_{l}}}^{n_{l+1}-1}\frac{\eta_{n_{l}+1}}{\eta_{k+2}}\left(\mathbf{J}_{\bot}\mathbf{W}_{n_{l}}\otimes\mathbf{I}_{d}\right)\frac{\nabla\mathbf{F}(\Theta_{k},\mathbf{X}_{k})}{K^{L+1}_{l+1}}\right\|^{2}\right|\mathcal{F}_{n_{l}}\right]\\ \leq&\left(\frac{\eta_{n_{l}+1}}{\eta_{n_{l+1}+1}}\right)^{2}\phi_{n_{l}}^{T}\mathbb{E}_{\mathbf{W}_{n_{l}\sim\mathcal{P}_{n_{l}}}}\left[\left(\mathbf{W}_{n_{l}}^{T}\mathbf{J}_{\bot}\mathbf{W}_{n_{l}}\otimes\mathbf{I}_{d}\right)\right]\phi_{n_{l}}\\ &-2\left(\frac{\eta_{n_{l}+1}}{\eta_{n_{l+1}+1}}\right)^{2}\mathbb{E}\left[\left.\sum_{k=n_{l}}^{n_{l+1}-1}\phi_{n_{l}}^{T}\left(\mathbf{W}_{n_{l}}^{T}\mathbf{J}_{\bot}\mathbf{W}_{n_{l}}\otimes\mathbf{I}_{d}\right)\frac{\nabla\mathbf{F}(\Theta_{k},\mathbf{X}_{k})}{K^{L+1}_{l+1}}\right|\mathcal{F}_{n_{l}}\right]\\ &+\left(\frac{\eta_{n_{l}+1}}{\eta_{n_{l+1}+1}}\right)^{2}\mathbb{E}\left[\left.\left\|\left(\mathbf{J}_{\bot}\mathbf{W}_{n_{l}}\otimes\mathbf{I}_{d}\right)\sum_{k={n_{l}}}^{n_{l+1}-1}\frac{\nabla\mathbf{F}(\Theta_{k},\mathbf{X}_{k})}{K^{L+1}_{l+1}}\right\|^{2}\right|\mathcal{F}_{n_{l}}\right]\\ \leq&\left(\frac{\eta_{n_{l}+1}}{\eta_{n_{l+1}+1}}\right)^{2}C_{1}\|\phi_{n_{l}}\|^{2}+2\left(\frac{\eta_{n_{l}+1}}{\eta_{n_{l+1}+1}}\right)^{2}C_{1}\|\phi_{n_{l}}\|\mathbb{E}\left[\left.\left\|\sum_{k=n_{l}}^{n_{l+1}-1}\frac{\nabla\mathbf{F}(\Theta_{k},\mathbf{X}_{k})}{K^{L+1}_{l+1}}\right\|\right|\mathcal{F}_{n_{l}}\right]\\ &+\left(\frac{\eta_{n_{l}+1}}{\eta_{n_{l+1}+1}}\right)^{2}C_{1}\mathbb{E}\left[\left.\left\|\sum_{k=n_{l}}^{n_{l+1}-1}\frac{\nabla\mathbf{F}(\Theta_{k},\mathbf{X}_{k})}{K^{L+1}_{l+1}}\right\|^{2}\right|\mathcal{F}_{n_{l}}\right],\end{split} (23)

where the first inequality comes from 𝐉T𝐉=𝐉\mathbf{J}_{\bot}^{T}\mathbf{J}_{\bot}=\mathbf{J}_{\bot} and ηk+2ηnl+1+1\eta_{k+2}\geq\eta_{n_{l+1}+1} for k[nl,nl+11]k\in[n_{l},n_{l+1}-1]. Then, we analyze the norm of the gradient 𝐅(Θk,𝐗k)\|\nabla\mathbf{F}(\Theta_{k},\mathbf{X}_{k})\| in the second term on the RHS of (23) conditioned on nl\mathcal{F}_{n_{l}}. By 2.4, we assume Θnl\Theta_{n_{l}} is within some compact set Ω\Omega at time nln_{l} such that supi[N],Xi𝒳iFi(θnli,Xi)CΩ\sup_{i\in[N],X^{i}\in\mathcal{X}_{i}}\nabla F_{i}(\theta_{n_{l}}^{i},X^{i})\leq C_{\Omega} for some constant CΩC_{\Omega}. For n=nl+1n=n_{l}+1 and any 𝐗𝒳1×𝒳2××𝒳N\mathbf{X}\in\mathcal{X}_{1}\times\mathcal{X}_{2}\times\cdots\times\mathcal{X}_{N}, we have

𝐅(Θnl+1,𝐗)𝐅(Θnl+1,𝐗)𝐅(Θnl,𝐗)+𝐅(Θnl,𝐗).\|\nabla\mathbf{F}(\Theta_{n_{l}+1},\mathbf{X})\|\leq\|\nabla\mathbf{F}(\Theta_{n_{l}+1},\mathbf{X})-\nabla\mathbf{F}(\Theta_{n_{l}},\mathbf{X})\|+\|\nabla\mathbf{F}(\Theta_{n_{l}},\mathbf{X})\|.

Considering 𝐅(Θnl,𝐗)\|\nabla\mathbf{F}(\Theta_{n_{l}},\mathbf{X})\|, we have sup𝐗𝐅(Θnl,𝐗)2i=1NsupXi𝒳iFi(θnli,Xi)2NCΩ2\sup_{\mathbf{X}}\|\nabla\mathbf{F}(\Theta_{n_{l}},\mathbf{X})\|^{2}\leq\sum_{i=1}^{N}\sup_{X^{i}\in\mathcal{X}_{i}}\|\nabla F_{i}(\theta^{i}_{n_{l}},X^{i})\|^{2}\leq NC_{\Omega}^{2} such that 𝐅(Θnl,𝐗)NCΩ\|\nabla\mathbf{F}(\Theta_{n_{l}},\mathbf{X})\|\leq\sqrt{N}C_{\Omega}. In addition, we have

𝐅(Θnl+1,𝐗)𝐅(Θnl,𝐗)2=i=1NFi(θnl+1i,Xi)Fi(θnli,Xi)2i=1NL2θnl+1iθnli2i=1Nγnl+12L2Fi(θnli,Xnli)2γnl+12CΩ2NL2\begin{split}\|\nabla\mathbf{F}(\Theta_{n_{l}+1},\mathbf{X})-\nabla\mathbf{F}(\Theta_{n_{l}},\mathbf{X})\|^{2}=&\sum_{i=1}^{N}\|\nabla F_{i}(\theta_{n_{l}+1}^{i},X^{i})-\nabla F_{i}(\theta_{n_{l}}^{i},X^{i})\|^{2}\\ \leq&\sum_{i=1}^{N}L^{2}\|\theta^{i}_{n_{l}+1}-\theta^{i}_{n_{l}}\|^{2}\\ \leq&\sum_{i=1}^{N}\gamma_{n_{l}+1}^{2}L^{2}\|\nabla F_{i}(\theta^{i}_{n_{l}},X^{i}_{n_{l}})\|^{2}\\ \leq&\gamma_{n_{l}+1}^{2}C_{\Omega}^{2}NL^{2}\end{split} (24)

such that 𝐅(Θnl+1,𝐗)𝐅(Θnl,𝐗)γnl+1CΩNL\|\nabla\mathbf{F}(\Theta_{n_{l}+1},\mathbf{X})-\nabla\mathbf{F}(\Theta_{n_{l}},\mathbf{X})\|\leq\gamma_{n_{l}+1}C_{\Omega}\sqrt{N}L. Thus, for any 𝐗\mathbf{X},

𝐅(Θnl+1,𝐗)(1+γnl+1L)NCΩ.\|\nabla\mathbf{F}(\Theta_{n_{l}+1},\mathbf{X})\|\leq\left(1+\gamma_{n_{l}+1}L\right)\sqrt{N}C_{\Omega}. (25)

For n=nl+2n=n_{l}+2 and any 𝐗\mathbf{X}, we have

𝐅(Θnl+2,𝐗)𝐅(Θnl+2,𝐗)𝐅(Θnl+1,𝐗)+𝐅(Θnl+1,𝐗).\|\nabla\mathbf{F}(\Theta_{n_{l}+2},\mathbf{X})\|\leq\|\nabla\mathbf{F}(\Theta_{n_{l}+2},\mathbf{X})-\nabla\mathbf{F}(\Theta_{n_{l}+1},\mathbf{X})\|+\|\nabla\mathbf{F}(\Theta_{n_{l}+1},\mathbf{X})\|.

Similar to the steps in (24), we have

𝐅(Θnl+2,𝐗)𝐅(Θnl+1,𝐗)2i=1Nγnl+22L2Fi(θnl+1i,Xnl+1i)2=γnl+22L2𝐅(Θnl+1,𝐗nl+1)2.\begin{split}\|\nabla\mathbf{F}(\Theta_{n_{l}+2},\mathbf{X})-\nabla\mathbf{F}(\Theta_{n_{l}+1},\mathbf{X})\|^{2}\leq&\sum_{i=1}^{N}\gamma_{n_{l}+2}^{2}L^{2}\|\nabla F_{i}(\theta^{i}_{n_{l}+1},X^{i}_{n_{l}+1})\|^{2}\\ =&\gamma_{n_{l}+2}^{2}L^{2}\|\nabla\mathbf{F}(\Theta_{n_{l}+1},\mathbf{X}_{n_{l}+1})\|^{2}.\end{split} (26)

Then, 𝐅(Θnl+2,𝐗)(1+γnl+2L)sup𝐗𝐅(Θnl+1,𝐗)\|\nabla\mathbf{F}(\Theta_{n_{l}+2},\mathbf{X})\|\leq(1+\gamma_{n_{l}+2}L)\sup_{\mathbf{X}}\|\nabla\mathbf{F}(\Theta_{n_{l}+1},\mathbf{X})\| and, together with (25), we have

𝐅(Θnl+2,𝐗)(1+γnl+2L)(1+γnl+1L)NCΩ.\|\nabla\mathbf{F}(\Theta_{n_{l}+2},\mathbf{X})\|\leq(1+\gamma_{n_{l}+2}L)(1+\gamma_{n_{l}+1}L)\sqrt{N}C_{\Omega}. (27)

By induction, 𝐅(Θnl+m,𝐗)s=1m(1+γnl+sL)NCΩ\|\nabla\mathbf{F}(\Theta_{n_{l}+m},\mathbf{X})\|\leq\prod_{s=1}^{m}(1+\gamma_{n_{l}+s}L)\sqrt{N}C_{\Omega} for m[1,Kl+11]m\in[1,K_{l+1}-1].

The next step is to analyze the growth rate of s=1m(1+γnl+sL)\prod_{s=1}^{m}(1+\gamma_{n_{l}+s}L). By 1+xex1+x\leq e^{x} for x0x\geq 0, we have

s=1m(1+γnl+sL)eLs=1mγnl+s.\prod_{s=1}^{m}(1+\gamma_{n_{l}+s}L)\leq e^{L\sum_{s=1}^{m}\gamma_{n_{l}+s}}.

For step size γn=1/n\gamma_{n}=1/n, we have Ls=1mγnl+s=Ls=1m1/(nl+s)<Ls=1m1/s<L(log(m)+1)L\sum_{s=1}^{m}\gamma_{n_{l}+s}=L\sum_{s=1}^{m}1/(n_{l}+s)<L\sum_{s=1}^{m}1/s<L(\log(m)+1) such that s=1m(1+γnl+sL)<(em)L\prod_{s=1}^{m}(1+\gamma_{n_{l}+s}L)<(em)^{L}. Then,

k=nlnl+11𝐅(Θk,𝐗k)Kl+1L+11Kl+1L+1k=nlnl+11𝐅(Θk,𝐗k)1Kl+1L+1NeLCΩm=0Kl+11mLNeLCΩ,\begin{split}\left\|\sum_{k=n_{l}}^{n_{l+1}-1}\frac{\nabla\mathbf{F}(\Theta_{k},\mathbf{X}_{k})}{K^{L+1}_{l+1}}\right\|&\leq\frac{1}{K^{L+1}_{l+1}}\sum_{k=n_{l}}^{n_{l+1}-1}\|\nabla\mathbf{F}(\Theta_{k},\mathbf{X}_{k})\|\leq\frac{1}{K^{L+1}_{l+1}}\sqrt{N}e^{L}C_{\Omega}\sum_{m=0}^{K_{l+1}-1}m^{L}\\ &\leq\sqrt{N}e^{L}C_{\Omega},\end{split} (28)

where the last inequality comes from m=0Kl+11mL<Kl+1(Kl+11)L<Kl+1L+1\sum_{m=0}^{K_{l+1}-1}m^{L}<K_{l+1}(K_{l+1}-1)^{L}<K_{l+1}^{L+1}. We can see the sum of the norm of the gradients are bounded by NeLCΩ\sqrt{N}e^{L}C_{\Omega}, which only depends on the compact set Ω\Omega at time n=nln=n_{l}.

Let δ1(C1,1)\delta_{1}\in(C_{1},1). Since from 2.3-ii), limlηnl+1/ηnl+1+1=1\lim_{l\to\infty}\eta_{n_{l}+1}/\eta_{n_{l+1}+1}=1, there exists some large enough l0l_{0} such that (ηnl+1ηnl+1+1)2C1<δ1<δ2:=(δ1+1)/2<1(\frac{\eta_{n_{l}+1}}{\eta_{n_{l+1}+1}})^{2}C_{1}<\delta_{1}<\delta_{2}:=(\delta_{1}+1)/2<1 for any l>l0l>l_{0}. Note that δ1\delta_{1} depends only on C1C_{1} and is independent of n\mathcal{F}_{n}. Then, let C~Ω:=NeLCΩ\tilde{C}_{\Omega}:=\sqrt{N}e^{L}C_{\Omega}, we can rewrite (23) as

𝔼[ϕnl+12|nl]δ1ϕnl2+2δ1C~Ωϕnl+δ1C~Ω2δ2ϕnl2+MΩ,\begin{split}\mathbb{E}[\|\phi_{n_{l+1}}\|^{2}|\mathcal{F}_{n_{l}}]\leq&\delta_{1}\|\phi_{n_{l}}\|^{2}+2\delta_{1}\tilde{C}_{\Omega}\|\phi_{n_{l}}\|+\delta_{1}\tilde{C}_{\Omega}^{2}\\ \leq&\delta_{2}\|\phi_{n_{l}}\|^{2}+M_{\Omega},\end{split} (29)

where MΩM_{\Omega} satisfies MΩ>8C~Ω2/(1δ1)+δ1C~Ω2M_{\Omega}>8\tilde{C}_{\Omega}^{2}/(1-\delta_{1})+\delta_{1}\tilde{C}_{\Omega}^{2}, which is derived from rearranging (29) as MΩ(δ1δ2)ϕnl2+2δ1C~Ωϕnl+δ1C~Ω2M_{\Omega}\geq(\delta_{1}-\delta_{2})\|\phi_{n_{l}}\|^{2}+2\delta_{1}\tilde{C}_{\Omega}\|\phi_{n_{l}}\|+\delta_{1}\tilde{C}_{\Omega}^{2} and upper bounding the RHS. Upon noting that 𝟙jnl{ΘjΩ}𝟙jnl1{ΘjΩ}\mathbbm{1}_{\cap_{j\leq n_{l}}\{\Theta_{j}\in\Omega\}}\leq\mathbbm{1}_{\cap_{j\leq n_{l-1}}\{\Theta_{j}\in\Omega\}}, we obtain

𝔼[ϕnl+12𝟙jnl{ΘjΩ}]δ2𝔼[ϕnl2𝟙jnl1{ΘjΩ}]+MΩ.\mathbb{E}\left[\|\phi_{n_{l+1}}\|^{2}\mathbbm{1}_{\cap_{j\leq n_{l}}\{\Theta_{j}\in\Omega\}}\right]\leq\delta_{2}\mathbb{E}\left[\|\phi_{n_{l}}\|^{2}\mathbbm{1}_{\cap_{j\leq n_{l-1}}\{\Theta_{j}\in\Omega\}}\right]+M_{\Omega}. (30)

The induction leads to 𝔼[ϕnl+12𝟙jnl{ΘjΩ}]δ2nl+1nl0𝔼[ϕnl02𝟙jnl01{ΘjΩ}]+M/(1δ2)<\mathbb{E}[\|\phi_{n_{l+1}}\|^{2}\mathbbm{1}_{\cap_{j\leq n_{l}}\{\Theta_{j}\in\Omega\}}]\leq\delta_{2}^{n_{l+1}-n_{l_{0}}}\mathbb{E}[\|\phi_{n_{l_{0}}}\|^{2}\mathbbm{1}_{\cap_{j\leq n_{l_{0}-1}}\{\Theta_{j}\in\Omega\}}]+M/(1-\delta_{2})<\infty for any ll0l\geq l_{0}. Besides, for m(nl,nl+1)m\in(n_{l},n_{l+1}), by following the above steps (23) applied to (21), we have

𝔼[ϕm2|nl](ηnl+1ηm+1)2ϕnl2+2(ηnl+1ηm+1)2ϕnl𝔼[k=nlm1𝐅(Θk,𝐗k)Kl+1L+1|nl]+(ηnl+1ηm+1)2𝔼[k=nlm1𝐅(Θk,𝐗k)Kl+1L+12|nl].\begin{split}\mathbb{E}[\|\phi_{m}\|^{2}|\mathcal{F}_{n_{l}}]\leq&\left(\frac{\eta_{n_{l}+1}}{\eta_{m+1}}\right)^{2}\|\phi_{n_{l}}\|^{2}+2\left(\frac{\eta_{n_{l}+1}}{\eta_{m+1}}\right)^{2}\|\phi_{n_{l}}\|\mathbb{E}\left[\left.\left\|\sum_{k=n_{l}}^{m-1}\frac{\nabla\mathbf{F}(\Theta_{k},\mathbf{X}_{k})}{K^{L+1}_{l+1}}\right\|\right|\mathcal{F}_{n_{l}}\right]\\ +&\left(\frac{\eta_{n_{l}+1}}{\eta_{m+1}}\right)^{2}\mathbb{E}\left[\left.\left\|\sum_{k=n_{l}}^{m-1}\frac{\nabla\mathbf{F}(\Theta_{k},\mathbf{X}_{k})}{K^{L+1}_{l+1}}\right\|^{2}\right|\mathcal{F}_{n_{l}}\right].\end{split} (31)

By (28) we already show that k=nlnl+11𝐅(Θk,𝐗k)Kl+1L+1<\|\sum_{k=n_{l}}^{n_{l+1}-1}\frac{\nabla\mathbf{F}(\Theta_{k},\mathbf{X}_{k})}{K^{L+1}_{l+1}}\|<\infty conditioned on nl\mathcal{F}_{n_{l}}. Therefore, 𝔼[ϕm2𝟙jnl{ΘjΩ}]<\mathbb{E}[\|\phi_{m}\|^{2}\mathbbm{1}_{\cap_{j\leq n_{l}}\{\Theta_{j}\in\Omega\}}]<\infty for m(nl,nl+1)m\in(n_{l},n_{l+1}). This completes the boundedness analysis of 𝔼[ϕn2𝟙jn1{ΘjΩ}]\mathbb{E}[\|\phi_{n}\|^{2}\mathbbm{1}_{\cap_{j\leq n-1}\{\Theta_{j}\in\Omega\}}].

Case 2 (Bounded communication interval KτnKK_{\tau_{n}}\leq K): In this case, we do not need the auxiliary step size ηn\eta_{n} and can directly work on γn=1/na\gamma_{n}=1/n^{a} for a(0.5,1]a\in(0.5,1]. Similar to (20), we have

Θn+11N(𝟏𝟏T𝐈d)Θn+1=γn+1(𝐉𝐖n𝐈d)(γn+11𝒥Θn𝐅(Θn,𝐗n)),\Theta_{n+1}-\frac{1}{N}(\mathbf{1}\mathbf{1}^{T}\otimes\mathbf{I}_{d})\Theta_{n+1}=\gamma_{n+1}(\mathbf{J}_{\bot}\mathbf{W}_{n}\otimes\mathbf{I}_{d})\left(\gamma_{n+1}^{-1}\mathcal{J}_{\bot}\Theta_{n}-\nabla\mathbf{F}(\Theta_{n},\mathbf{X}_{n})\right), (32)

and let bnγn/γn+1b_{n}\triangleq\gamma_{n}/\gamma_{n+1}, dividing both sides of above equation by γn+2\gamma_{n+2} gives

ϕn+1=bn+1(𝐉𝐖n𝐈d)(ϕn𝐅(Θn,𝐗n)).\phi_{n+1}=b_{n+1}(\mathbf{J}_{\bot}\mathbf{W}_{n}\otimes\mathbf{I}_{d})\left(\phi_{n}-\nabla\mathbf{F}(\Theta_{n},\mathbf{X}_{n})\right). (33)

Then, by following the similar steps in (22) and (23), we obtain

𝔼[ϕnl+12|nl](γnl+1γnl+1+1)2C1(ϕnl2+2ϕnl𝔼[k=nlnl+11𝐅(Θk,𝐗k)|nl]+𝔼[k=nlnl+11𝐅(Θk,𝐗k)2|nl]).\begin{split}\mathbb{E}[\|\phi_{n_{l+1}}\|^{2}|\mathcal{F}_{n_{l}}]\leq&\left(\frac{\gamma_{n_{l}+1}}{\gamma_{n_{l+1}+1}}\right)^{2}C_{1}\Bigg{(}\|\phi_{n_{l}}\|^{2}+2\|\phi_{n_{l}}\|\mathbb{E}\left[\left.\left\|\sum_{k=n_{l}}^{n_{l+1}-1}\nabla\mathbf{F}(\Theta_{k},\mathbf{X}_{k})\right\|\right|\mathcal{F}_{n_{l}}\right]\\ &+\mathbb{E}\left[\left.\left\|\sum_{k=n_{l}}^{n_{l+1}-1}\nabla\mathbf{F}(\Theta_{k},\mathbf{X}_{k})\right\|^{2}\right|\mathcal{F}_{n_{l}}\right]\Bigg{)}.\end{split} (34)

Also similar to (25) - (28), we can bound the sum of the norm of the gradients as

k=nlnl+11𝐅(Θk,𝐗k)k=nlnl+11[s=nlk(1+γs+1L)]NCΩ.\left\|\sum_{k=n_{l}}^{n_{l+1}-1}\nabla\mathbf{F}(\Theta_{k},\mathbf{X}_{k})\right\|\leq\sum_{k=n_{l}}^{n_{l+1}-1}\left[\prod_{s=n_{l}}^{k}(1+\gamma_{s+1}L)\right]\sqrt{N}C_{\Omega}. (35)

Now that KlK_{l} is bounded above by KK, s=nlk(1+γs+1L)eLs=nlkγs+1<eLs=0K1γs+1:=CK\prod_{s=n_{l}}^{k}(1+\gamma_{s+1}L)\leq e^{L\sum_{s=n_{l}}^{k}\gamma_{s}+1}<e^{L\sum_{s=0}^{K-1}\gamma_{s+1}}:=C_{K}. Then, we further bound (35) as

k=nlnl+11𝐅(Θk,𝐗k)NKCKCΩ.\left\|\sum_{k=n_{l}}^{n_{l+1}-1}\nabla\mathbf{F}(\Theta_{k},\mathbf{X}_{k})\right\|\leq\sqrt{N}KC_{K}C_{\Omega}. (36)

The subsequent proof is basically a replication of (29) - (31) and is therefore omitted. ∎

Appendix C Proof of Theorem 3.2

We focus on analyzing the convergence property of θ\theta, which is obtained by left multiplying (18) with 1N(𝟏T𝐈d)\frac{1}{N}(\mathbf{1}^{T}\otimes\mathbf{I}_{d}), i.e.,

θn+1=1N(𝟏T𝐈d)θn+1=θnγn+11N(𝟏T𝐈d)𝐅(Θn,𝐗n).\begin{split}\theta_{n+1}&=\frac{1}{N}(\mathbf{1}^{T}\otimes\mathbf{I}_{d})\theta_{n+1}\\ &=\theta_{n}-\gamma_{n+1}\frac{1}{N}(\mathbf{1}^{T}\otimes\mathbf{I}_{d})\nabla\mathbf{F}(\Theta_{n},\mathbf{X}_{n}).\end{split} (37)

where the second equality comes from 𝐖n\mathbf{W}_{n} being doubly stochastic and 1N(𝟏T𝐈d)𝒲n=1N(𝟏T𝐖n𝐈d)=1N(𝟏T𝐈d)\frac{1}{N}(\mathbf{1}^{T}\otimes\mathbf{I}_{d})\mathcal{W}_{n}=\frac{1}{N}(\mathbf{1}^{T}\mathbf{W}_{n}\otimes\mathbf{I}_{d})=\frac{1}{N}(\mathbf{1}^{T}\otimes\mathbf{I}_{d}).

For self-contained purpose, we first give the almost sure convergence result for the stochastic approximation that will be used in our proof.

Theorem C.1 (Theorem 2 [23]).

Consider the stochastic approximation in the form of

θn+1=θn+γn+1h(θn)+γn+1en+1+γn+1rn+1.\theta_{n+1}=\theta_{n}+\gamma_{n+1}h(\theta_{n})+\gamma_{n+1}e_{n+1}+\gamma_{n+1}r_{n+1}. (38)

Assume that

  1. C1.

    w.p.1, the closure of {θn}n0\{\theta_{n}\}_{n\geq 0} is a compact subset of d\mathbb{R}^{d};

  2. C2.

    {γn}\{\gamma_{n}\} is a decreasing sequence of positive number such that nγn=\sum_{n}\gamma_{n}=\infty;

  3. C3.

    w.p.1, limpn=1pγnen\lim_{p\to\infty}\sum_{n=1}^{p}\gamma_{n}e_{n} exists and is finite. Moreover, limnrn=0\lim_{n\to\infty}r_{n}=0.

  4. C4.

    vector-valued function hh is continuous on RdR^{d} and there exists a continuously differentiable function V:dV:\mathbb{R}^{d}\to\mathbb{R} such that V(θ),h(θ)0\langle\nabla V(\theta),h(\theta)\rangle\leq 0 for all θd\theta\in\mathbb{R}^{d}. Besides, the interior of V()V(\mathcal{L}) is empty where {θd:V(θ),h(θ)=0}\mathcal{L}\triangleq\{\theta\in\mathbb{R}^{d}:\langle\nabla V(\theta),h(\theta)\rangle=0\}.

Then, w.p.1, lim supnd(θn,)=0\limsup_{n}d(\theta_{n},\mathcal{L})=0. ∎

We can rewrite (37) as

θn+1=θnγn+11N(𝟏T𝐈d)𝐅(Θn,𝐗n)=θnγn+1f(θn)γn+1(1Ni=1Nfi(θni)f(θn))γn+1(1Ni=1NFi(θni,Xni)1Ni=1Nfi(θni)),\begin{split}\theta_{n+1}=&\theta_{n}-\gamma_{n+1}\frac{1}{N}(\mathbf{1}^{T}\otimes\mathbf{I}_{d})\nabla\mathbf{F}(\Theta_{n},\mathbf{X}_{n})\\ =&\theta_{n}-\gamma_{n+1}\nabla f(\theta_{n})-\gamma_{n+1}\left(\frac{1}{N}\sum_{i=1}^{N}\nabla f_{i}(\theta_{n}^{i})-\nabla f(\theta_{n})\right)\\ &-\gamma_{n+1}\left(\frac{1}{N}\sum_{i=1}^{N}\nabla F_{i}(\theta_{n}^{i},X_{n}^{i})-\frac{1}{N}\sum_{i=1}^{N}\nabla f_{i}(\theta_{n}^{i})\right),\end{split} (39)

and work on the converging behavior of the third and fourth term. By definition of function f()\nabla f(\cdot), we have

rn1Ni=1Nfi(θni)f(θn)=1Ni=1N[fi(θni)fi(θn)].r_{n}\triangleq\frac{1}{N}\sum_{i=1}^{N}\nabla f_{i}(\theta_{n}^{i})-\nabla f(\theta_{n})=\frac{1}{N}\sum_{i=1}^{N}\left[\nabla f_{i}(\theta_{n}^{i})-\nabla f_{i}(\theta_{n})\right]. (40)

By the Lipschitz continuity of function Fi(,X)\nabla F_{i}(\cdot,X) in (7), we have

rn1Ni=1NLθniθnLNΘn1N(𝟏𝟏T𝐈d)Θn=LN𝒥Θn,\left\|r_{n}\right\|\leq\frac{1}{N}\sum_{i=1}^{N}L\|\theta_{n}^{i}-\theta_{n}\|\leq\frac{L}{\sqrt{N}}\left\|\Theta_{n}-\frac{1}{N}(\mathbf{1}\mathbf{1}^{T}\otimes\mathbf{I}_{d})\Theta_{n}\right\|=\frac{L}{\sqrt{N}}\|\mathcal{J}_{\bot}\Theta_{n}\|, (41)

where the second inequality comes from the Cauchy-Schwartz inequality. In Appendix B, we have shown limn𝒥Θn=𝟎\lim_{n}\mathcal{J}_{\bot}\Theta_{n}=\mathbf{0} almost surely such that limnrn=0\lim_{n\to\infty}r_{n}=0 almost surely.

Next, we further decompose the fourth term in (39). For an ergodic transition matrix 𝐏\mathbf{P} and a function vv associated with the same state space 𝒳\mathcal{X}, define the operator 𝐏kv(x)y𝒳𝐏k(x,y)v(y)\mathbf{P}^{k}v(x)\triangleq\sum_{y\in\mathcal{X}}\mathbf{P}^{k}(x,y)v(y) for the kk-step transition probability 𝐏k(x,y)\mathbf{P}^{k}(x,y). Denote by 𝐏1,,𝐏N\mathbf{P}_{1},\cdots,\mathbf{P}_{N} the underlying transition matrices of all NN agents with corresponding stationary distribution 𝝅1,,𝝅N\boldsymbol{\pi}_{1},\cdots,\boldsymbol{\pi}_{N}. Then, for every function Fi(θi,):𝒳id\nabla F_{i}(\theta^{i},\cdot):\mathcal{X}_{i}\to\mathbb{R}^{d}, there exists a corresponding function mθi():𝒳idm_{\theta^{i}}(\cdot):\mathcal{X}_{i}\to\mathbb{R}^{d} such that

mθi(x)𝐏imθi(x)=Fi(θi,x)fi(θi).m_{\theta^{i}}(x)-\mathbf{P}_{i}m_{\theta^{i}}(x)=\nabla F_{i}(\theta^{i},x)-\nabla f_{i}(\theta^{i}). (42)

The solution of the Poisson equation (42) has been studied in the literature, e.g., [17, 38]. For self-contained purpose, we derive the closed-form mθi(x)m_{\theta^{i}}(x) from scratch. First of all, we can obtain function mθi(x)m_{\theta^{i}}(x) in the recursive form as follows,

mθi(x)=Fi(θi,x)fi(θi)+𝐏i[Fi(θi,)fi(θi)](x)+𝐏i2[Fi(θi,)fi(θi)](x)+.m_{\theta^{i}}(x)=\nabla F_{i}(\theta^{i},x)-\nabla f_{i}(\theta^{i})+\mathbf{P}_{i}[\nabla F_{i}(\theta^{i},\cdot)-\nabla f_{i}(\theta^{i})](x)+\mathbf{P}_{i}^{2}[\nabla F_{i}(\theta^{i},\cdot)-\nabla f_{i}(\theta^{i})](x)+\cdots. (43)

It is not hard to check that (43) satisfies (42). Note that by induction we get

𝐏ik𝟏(𝝅i)T=(𝐏i𝟏(𝝅i)T)k,k,k1.\mathbf{P}^{k}_{i}-\mathbf{1}(\boldsymbol{\pi}_{i})^{T}=\left(\mathbf{P}_{i}-\mathbf{1}(\boldsymbol{\pi}_{i})^{T}\right)^{k},\forall k\in\mathbb{N},k\geq 1. (44)

Then, we can further simplify (43), and the closed-form expression of mθi(x)m_{\theta^{i}}(x) is given as

mθi(x)=y𝒳i[𝐏i𝟏(𝝅i)T]0(x,y)(Fi(θi,y)fi(θi))+y𝒳i[𝐏i1𝟏(𝝅i)T](x,y)(Fi(θi,y)fi(θi))+=y𝒳i[k=0[𝐏i𝟏(𝝅i)T]k](x,y)(Fi(θi,y)fi(θi))=y𝒳i(𝐈𝐏i+𝟏(𝝅i)T)1(x,y)(Fi(θi,y)fi(θi)),\begin{split}m_{\theta^{i}}(x)&=\sum_{y\in\mathcal{X}_{i}}\left[\mathbf{P}_{i}-\mathbf{1}(\boldsymbol{\pi}_{i})^{T}\right]^{0}(x,y)(\nabla F_{i}(\theta^{i},y)-\nabla f_{i}(\theta^{i}))\\ &\quad+\sum_{y\in\mathcal{X}^{i}}\left[\mathbf{P}_{i}^{1}-\mathbf{1}(\boldsymbol{\pi}_{i})^{T}\right](x,y)(\nabla F_{i}(\theta^{i},y)-\nabla f_{i}(\theta^{i}))+\cdots\\ &=\sum_{y\in\mathcal{X}_{i}}\left[\sum_{k=0}^{\infty}\left[\mathbf{P}_{i}-\mathbf{1}(\boldsymbol{\pi}_{i})^{T}\right]^{k}\right](x,y)(\nabla F_{i}(\theta^{i},y)-\nabla f_{i}(\theta^{i}))\\ &=\sum_{y\in\mathcal{X}_{i}}\left(\mathbf{I}-\mathbf{P}_{i}+\mathbf{1}(\boldsymbol{\pi}_{i})^{T}\right)^{-1}(x,y)(\nabla F_{i}(\theta^{i},y)-\nabla f_{i}(\theta^{i})),\\ \end{split} (45)

where the fourth equality comes from (44). Note that the so-called ‘fundamental matrix’ (𝐈𝐏i+𝟏(𝝅i)T)1(\mathbf{I}-\mathbf{P}_{i}+\mathbf{1}(\boldsymbol{\pi}_{i})^{T})^{-1} exists for every ergodic Markov chain XiX^{i} from 2.2. Since function Fi\nabla F_{i} is Lipschitz continuous, we have the following lemma.

Lemma C.2.

Under assumption (A1), functions mθi(x)m_{\theta^{i}}(x) and 𝐏imθi(x)\mathbf{P}_{i}m_{\theta^{i}}(x) are both Lipschitz continuous in θi\theta^{i} for any x𝒳ix\in\mathcal{X}_{i}.

Proof.

By (45), for any θ1i,θ2id\theta^{i}_{1},\theta^{i}_{2}\in\mathbb{R}^{d} and x𝒳ix\in\mathcal{X}_{i}, we have

mθ1i(x)mθ2i(x)y𝒳i(𝐈𝐏i+𝟏(𝝅i)T)1(x,y)[Fi(θ1i,y)Fi(θ2i,y)]+fi(θ1i)fi(θ2i)Cimaxy𝒳iFi(θ1i,y)Fi(θ2i,y)+fi(θ1i)fi(θ2i)(CiL+1)θ1iθ2i,\begin{split}\left\|m_{\theta^{i}_{1}}(x)-m_{\theta^{i}_{2}}(x)\right\|\leq&\left\|\sum_{y\in\mathcal{X}_{i}}\left(\mathbf{I}-\mathbf{P}_{i}+\mathbf{1}(\boldsymbol{\pi}_{i})^{T}\right)^{-1}(x,y)\left[\nabla F_{i}(\theta^{i}_{1},y)-\nabla F_{i}(\theta^{i}_{2},y)\right]\right\|\\ &+\left\|\nabla f_{i}(\theta^{i}_{1})-\nabla f_{i}(\theta^{i}_{2})\right\|\\ \leq&C_{i}\max_{y\in\mathcal{X}_{i}}\left\|\nabla F_{i}(\theta^{i}_{1},y)-\nabla F_{i}(\theta^{i}_{2},y)\right\|+\left\|\nabla f_{i}(\theta^{i}_{1})-\nabla f_{i}(\theta^{i}_{2})\right\|\\ \leq&(C_{i}L+1)\|\theta^{i}_{1}-\theta^{i}_{2}\|,\end{split} (46)

where the second inequality holds for a constant CiC_{i} that is the largest absolute value of the entry in the matrix (𝐈𝐏i+𝟏(𝝅i)T)1(\mathbf{I}-\mathbf{P}_{i}+\mathbf{1}(\boldsymbol{\pi}_{i})^{T})^{-1}. Therefore, mθi(x)m_{\theta^{i}}(x) is Lipschitz continuous in θi\theta^{i}. Moreover, following the similar steps as above, we have

𝐏imθ1i(x)𝐏imθ2i(x)=y𝒳i𝐏i(x,y)mθ1i(y)y𝒳i𝐏i(x,y)mθ2i(y)=y𝒳i𝐏i(x,y)(mθ1i(y)mθ2i(y))y𝒳i𝐏i(x,y)mθ1i(y)mθ2i(y)|𝒳i|mθ1i(y)mθ2i(y)|𝒳i|(CiL+1)θ1iθ2i\begin{split}\left\|\mathbf{P}_{i}m_{\theta^{i}_{1}}(x)-\mathbf{P}_{i}m_{\theta^{i}_{2}}(x)\right\|&=\left\|\sum_{y\in\mathcal{X}_{i}}\mathbf{P}_{i}(x,y)m_{\theta^{i}_{1}}(y)-\sum_{y\in\mathcal{X}_{i}}\mathbf{P}_{i}(x,y)m_{\theta^{i}_{2}}(y)\right\|\\ &=\left\|\sum_{y\in\mathcal{X}_{i}}\mathbf{P}_{i}(x,y)\left(m_{\theta^{i}_{1}}(y)-m_{\theta^{i}_{2}}(y)\right)\right\|\\ &\leq\sum_{y\in\mathcal{X}^{i}}\mathbf{P}_{i}(x,y)\left\|m_{\theta^{i}_{1}}(y)-m_{\theta^{i}_{2}}(y)\right\|\\ &\leq|\mathcal{X}_{i}|\left\|m_{\theta^{i}_{1}}(y)-m_{\theta^{i}_{2}}(y)\right\|\\ &\leq|\mathcal{X}_{i}|(C_{i}L+1)\|\theta^{i}_{1}-\theta^{i}_{2}\|\end{split} (47)

such that 𝐏imθi(x)\mathbf{P}_{i}m_{\theta^{i}}(x) is also Lipschitz continuous in θi\theta^{i}, which comletes the proof. ∎

Now with (42) we can decompose Fi(θni,Xni)fi(θni)\nabla F_{i}(\theta_{n}^{i},X_{n}^{i})-\nabla f_{i}(\theta_{n}^{i}) as

Fi(θni,Xni)fi(θni)=mθni(Xni)𝐏imθni(Xni)=mθni(Xni)𝐏imθni(Xn1i)en+1i+𝐏imθni(Xn1i)νni𝐏imθn+1i(Xni)νn+1i+𝐏imθn+1i(Xni)𝐏imθni(Xni)ξn+1i.\begin{split}\nabla F_{i}(\theta_{n}^{i},X_{n}^{i})-\nabla f_{i}(\theta_{n}^{i})=&m_{\theta_{n}^{i}}(X_{n}^{i})-\mathbf{P}_{i}m_{\theta_{n}^{i}}(X_{n}^{i})\\ =&\underbrace{m_{\theta_{n}^{i}}(X_{n}^{i})-\mathbf{P}_{i}m_{\theta_{n}^{i}}(X_{n-1}^{i})}_{e^{i}_{n+1}}\\ &+\underbrace{\mathbf{P}_{i}m_{\theta_{n}^{i}}(X_{n-1}^{i})}_{\nu_{n}^{i}}-\underbrace{\mathbf{P}_{i}m_{\theta_{n+1}^{i}}(X_{n}^{i})}_{\nu_{n+1}^{i}}\\ &+\underbrace{\mathbf{P}_{i}m_{\theta_{n+1}^{i}}(X_{n}^{i})-\mathbf{P}_{i}m_{\theta_{n}^{i}}(X_{n}^{i})}_{\xi^{i}_{n+1}}.\end{split} (48)

Here {γneni}\{\gamma_{n}e^{i}_{n}\} is a Martingale difference sequence and we need the martingale convergence theorem in Theorem C.3.

Theorem C.3 (Theorem 6.4.66.4.6 [64]).

For an n\mathcal{F}_{n}-Martingale SnS_{n}, set Xn1=SnSn1X_{n-1}=S_{n}-S_{n-1}. If for some 1p21\leq p\leq 2,

n=1𝔼[Xn1p|n1]<a.s.\sum_{n=1}^{\infty}\mathbb{E}[\|X_{n-1}\|^{p}|\mathcal{F}_{n-1}]<\infty\quad a.s. (49)

then SnS_{n} converges almost surely. ∎

We want to show that nγn+12𝔼[en+1i2|n]<\sum_{n}\gamma_{n+1}^{2}\mathbb{E}[\|e^{i}_{n+1}\|^{2}|\mathcal{F}_{n}]<\infty such that nγneni\sum_{n}\gamma_{n}e^{i}_{n} converges almost surely by Theorem C.3. As we can see in (45), with Lemma C.2 and 2.4, for a sample path (Θn\Theta_{n} within a compact set Ω\Omega), supnmθni(x)<\sup_{n}\|m_{\theta^{i}_{n}}(x)\|<\infty and supn𝐏imθni(x)<\sup_{n}\|\mathbf{P}_{i}m_{\theta^{i}_{n}}(x)\|<\infty almost surely for all x𝒳ix\in\mathcal{X}_{i}. This ensures that en+1ie^{i}_{n+1} is an L2L_{2}-bounded martingale difference sequence, i.e., supnen+1isupn(mθni(Xn+1i)+𝐏imθni(Xni))DΩ<\sup_{n}\|e^{i}_{n+1}\|\leq\sup_{n}(\|m_{\theta^{i}_{n}}(X^{i}_{n+1})\|+\|\mathbf{P}_{i}m_{\theta^{i}_{n}}(X^{i}_{n})\|)\leq D_{\Omega}<\infty. Together with 2.3, we get

nγn+12𝔼[en+1i2|n]DΩnγn+12<a.s.\sum_{n}\gamma_{n+1}^{2}\mathbb{E}[\|e^{i}_{n+1}\|^{2}|\mathcal{F}_{n}]\leq D_{\Omega}\sum_{n}\gamma_{n+1}^{2}<\infty\quad a.s. (50)

and thus nγneni\sum_{n}\gamma_{n}e^{i}_{n} converges almost surely.

Next, for the term νni\nu^{i}_{n} we have

k=0pγk+1(νkiνk+1i)=k=0p(γk+1γk)νki+γ0ν0iγp+1νp+1i.\sum_{k=0}^{p}\gamma_{k+1}(\nu^{i}_{k}-\nu_{k+1}^{i})=\sum_{k=0}^{p}(\gamma_{k+1}-\gamma_{k})\nu_{k}^{i}+\gamma_{0}\nu_{0}^{i}-\gamma_{p+1}\nu^{i}_{p+1}. (51)

As is shown before, for a given sample path, 𝐏imθni(x)\|\mathbf{P}_{i}m_{\theta^{i}_{n}}(x)\| is bounded almost surely for all nn and x𝒳ix\in\mathcal{X}^{i} such that supnνni<\sup_{n}\|\nu_{n}^{i}\|<\infty almost surely. Since limn(γn+1γn)=0\lim_{n\to\infty}(\gamma_{n+1}-\gamma_{n})=0, we have limn(γn+1γn)νni=0\lim_{n\to\infty}(\gamma_{n+1}-\gamma_{n})\nu_{n}^{i}=0. Note that there exists a path-dependent constant CC (that bounds νni\|\nu_{n}^{i}\|) such that for any nmn\geq m,

k=mn(γk+1γk)νkiCk=mn(γkγk+1)=C(γmγn+1)<Cγm.\left\|\sum_{k=m}^{n}(\gamma_{k+1}-\gamma_{k})\nu_{k}^{i}\right\|\leq C\sum_{k=m}^{n}(\gamma_{k}-\gamma_{k+1})=C(\gamma_{m}-\gamma_{n+1})<C\gamma_{m}. (52)

Since limnγn=0\lim_{n\to\infty}\gamma_{n}=0, there exists a positive integer MM such that for all nmMn\geq m\geq M, γm<ϵ/C\gamma_{m}<\epsilon/C and k=mn(γk+1γk)νki<ϵ\|\sum_{k=m}^{n}(\gamma_{k+1}-\gamma_{k})\nu_{k}^{i}\|<\epsilon for every ϵ>0\epsilon>0. Therefore, {k=0p(γk+1γk)νki}p0\{\sum_{k=0}^{p}(\gamma_{k+1}-\gamma_{k})\nu_{k}^{i}\}_{p\geq 0} is a Cauchy sequence and k=0(γk+1γk)νki\sum_{k=0}^{\infty}(\gamma_{k+1}-\gamma_{k})\nu_{k}^{i} converges by Cauchy convergence criterion. The last term of (51) tends to zero. Therefore, k=0γk+1(νkiνk+1i)\sum_{k=0}^{\infty}\gamma_{k+1}(\nu^{i}_{k}-\nu_{k+1}^{i}) converges and is finite.

For the last term ξni\xi^{i}_{n}, Lemma C.2 leads to

1Ni=1Nξn+1iCNi=1Nθn+1iθniCNΘn+1Θn.\frac{1}{N}\sum_{i=1}^{N}\left\|\xi^{i}_{n+1}\right\|\leq\frac{C^{\prime}}{N}\sum_{i=1}^{N}\|\theta^{i}_{n+1}-\theta^{i}_{n}\|\leq\frac{C^{\prime}}{\sqrt{N}}\|\Theta_{n+1}-\Theta_{n}\|. (53)

for the Lipschitz constant CC^{\prime} of 𝐏imθi(x)\mathbf{P}_{i}m_{\theta^{i}}(x). However, the relationship between θn\theta_{n} and θn+1\theta_{n+1} is not obvious in the D-SGD and FL setting due to the update rule (18) with communication matrix 𝒲n\mathcal{W}_{n}, unlike the classical stochastic approximation shown in (38). We come up with the novel decomposition of ξni\xi^{i}_{n}, which takes the consensus error into account, to solve this issue, i.e.,

ξn+1i=[𝐏imθn+1i(Xni)𝐏imθn+1(Xni)]+[𝐏imθn(Xni)𝐏imθni(Xni)]+[𝐏imθn+1(Xni)𝐏imθn(Xni)].\begin{split}\xi^{i}_{n+1}=&\left[\mathbf{P}_{i}m_{\theta_{n+1}^{i}}(X_{n}^{i})-\mathbf{P}_{i}m_{\theta_{n+1}}(X_{n}^{i})\right]+\left[\mathbf{P}_{i}m_{\theta_{n}}(X_{n}^{i})-\mathbf{P}_{i}m_{\theta_{n}^{i}}(X_{n}^{i})\right]\\ &+\left[\mathbf{P}_{i}m_{\theta_{n+1}}(X^{i}_{n})-\mathbf{P}_{i}m_{\theta_{n}}(X^{i}_{n})\right].\end{split} (54)

Using the Lipschitzness property of 𝐏imθ(X)\mathbf{P}_{i}m_{\theta}(X) in Lemma C.2, we have

1Ni=1Nξn+1iCNi=1N(θn+1iθn+1+θn+1θn+θnθni)CNΘn+11N(𝟏𝟏T𝐈d)Θn+1+CNΘn1N(𝟏𝟏T𝐈d)Θn+Cθn+1θn=CN(𝒥Θn+1+𝒥Θn)+Cθn+1θn=CN(𝒥Θn+1+𝒥Θn)+Cγn+11N(𝟏T𝐈d)𝐅(Θn,𝐗n).\begin{split}\frac{1}{N}\sum_{i=1}^{N}\left\|\xi^{i}_{n+1}\right\|\leq&\frac{C^{\prime}}{N}\sum_{i=1}^{N}\left(\left\|\theta^{i}_{n+1}-\theta_{n+1}\right\|+\left\|\theta_{n+1}-\theta_{n}\right\|+\left\|\theta_{n}-\theta_{n}^{i}\right\|\right)\\ \leq&\frac{C^{\prime}}{\sqrt{N}}\left\|\Theta_{n+1}-\frac{1}{N}\left(\mathbf{1}\mathbf{1}^{T}\otimes\mathbf{I}_{d}\right)\Theta_{n+1}\right\|+\frac{C^{\prime}}{\sqrt{N}}\left\|\Theta_{n}-\frac{1}{N}\left(\mathbf{1}\mathbf{1}^{T}\otimes\mathbf{I}_{d}\right)\Theta_{n}\right\|\\ &+C^{\prime}\left\|\theta_{n+1}-\theta_{n}\right\|\\ =&\frac{C^{\prime}}{\sqrt{N}}\left(\left\|\mathcal{J}_{\bot}\Theta_{n+1}\right\|+\left\|\mathcal{J}_{\bot}\Theta_{n}\right\|\right)+C^{\prime}\left\|\theta_{n+1}-\theta_{n}\right\|\\ =&\frac{C^{\prime}}{\sqrt{N}}\left(\left\|\mathcal{J}_{\bot}\Theta_{n+1}\right\|+\left\|\mathcal{J}_{\bot}\Theta_{n}\right\|\right)+C^{\prime}\gamma_{n+1}\left\|\frac{1}{N}(\mathbf{1}^{T}\otimes\mathbf{I}_{d})\nabla\mathbf{F}(\Theta_{n},\mathbf{X}_{n})\right\|.\end{split} (55)

In Appendix B we have shown limn𝒥Θn=𝟎\lim_{n\to\infty}\mathcal{J}_{\bot}\Theta_{n}=\mathbf{0} almost surely. Moreover, 1N(𝟏T𝐈d)𝐅(Θn,𝐗n)\|\frac{1}{N}(\mathbf{1}^{T}\otimes\mathbf{I}_{d})\nabla\mathbf{F}(\Theta_{n},\mathbf{X}_{n})\| is bounded per sample path. Therefore, limn1Ni=1Nξn+1i=0\lim_{n\to\infty}\frac{1}{N}\sum_{i=1}^{N}\|\xi^{i}_{n+1}\|=0 such that limn1Ni=1Nξn+1i=0\lim_{n\to\infty}\frac{1}{N}\sum_{i=1}^{N}\xi^{i}_{n+1}=0 almost surely.

To sum up, we decompose (39) into

θn+1=θnγn+1f(θn)γn+1rnγn+11Ni=1N(en+1i+νniνn+1i+ξn+1i).\theta_{n+1}=\theta_{n}-\gamma_{n+1}\nabla f(\theta_{n})-\gamma_{n+1}r_{n}-\gamma_{n+1}\frac{1}{N}\sum_{i=1}^{N}\left(e^{i}_{n+1}+\nu^{i}_{n}-\nu^{i}_{n+1}+\xi^{i}_{n+1}\right). (56)

Now that limpn=1p1Ni=1Nγneni\lim_{p\to\infty}\sum_{n=1}^{p}\frac{1}{N}\sum_{i=1}^{N}\gamma_{n}e^{i}_{n} and limpn=0p1Ni=1Nγn+1(νniνn+1i)\lim_{p\to\infty}\sum_{n=0}^{p}\frac{1}{N}\sum_{i=1}^{N}\gamma_{n+1}(\nu^{i}_{n}-\nu_{n+1}^{i}) converge and are finite, limnrn=0\lim_{n\to\infty}r_{n}=0, limn1Ni=1Nξni=0\lim_{n\to\infty}\frac{1}{N}\sum_{i=1}^{N}\xi^{i}_{n}=0 , all the conditions of C3 in Theorem C.1 are satisfied. Additionally, 2.4 corresponds to C1, 2.3 meets C2, and C4 is automatically satisfied when we choose the lyapunov function V(θ)=f(θ)V(\theta)=f(\theta). Therefore, lim supninfθθnθ=0\limsup_{n}\inf_{\theta^{*}\in\mathcal{L}}\|\theta_{n}-\theta^{*}\|=0.

Appendix D Proof of Theorem 3.3

To obtain Theorem 3.3, we need to utilize the existing CLT result for general SA in Theorem D.1 and check all the necessary conditions therein.

Theorem D.1 (Theorem 2.1 [29]).

Consider the stochastic approximation iteration (38), assume

  1. C1.

    Let θ\theta^{*} be the root of function hh, i.e., h(θ)=0h(\theta^{*})=0, and assume limnθn=θ\lim_{n\to\infty}\theta_{n}=\theta^{*}. Moreover, assume the mean field hh is twice continuously differentiable in a neighborhood of θ\theta^{*}, and the Jacobian 𝐇h(θ)\mathbf{H}\triangleq\nabla h(\theta^{*}) is Hurwitz, i.e., the largest real part of its eigenvalues B<0B<0;

  2. C2.

    The step size nγn=\sum_{n}\gamma_{n}=\infty, nγn2<\sum_{n}\gamma^{2}_{n}<\infty, and either (i). log(γn1/γn)=o(γn)\log(\gamma_{n-1}/\gamma_{n})=o(\gamma_{n}), or (ii). log(γn1/γn)γn/γ\log(\gamma_{n-1}/\gamma_{n})\sim\gamma_{n}/\gamma_{\star} for some γ>1/2|B|\gamma_{\star}>1/2|B|;

  3. C3.

    supnθni<\sup_{n}\|\theta^{i}_{n}\|<\infty almost surely for any i[N]i\in[N];

  4. C4.
    1. (a)

      {en}n0\{e_{n}\}_{n\geq 0} is an n\mathcal{F}_{n}-Martingale difference sequence, i.e., 𝔼[en|n1]=0\mathbb{E}[e_{n}|\mathcal{F}_{n-1}]=0, and there exists τ>0\tau>0 such that supn0𝔼[en2+τ|n1]<\sup_{n\geq 0}\mathbb{E}[\|e_{n}\|^{2+\tau}|\mathcal{F}_{n-1}]<\infty;

    2. (b)

      𝔼[en+1en+1T|n]=𝐔+𝐃n(A)+𝐃n(B)\mathbb{E}[e_{n+1}e_{n+1}^{T}|\mathcal{F}_{n}]=\mathbf{U}+\mathbf{D}^{(A)}_{n}+\mathbf{D}^{(B)}_{n}, where 𝐔\mathbf{U} is a symmetric positive semi-definite matrix and

      {𝐃n(A)0almost surely,limnγn𝔼[k=1n𝐃k(B)]=0.\begin{cases}&\mathbf{D}^{(A)}_{n}\to 0~{}~{}\text{almost surely},\\ &\lim_{n}\gamma_{n}\mathbb{E}\left[\left\|\sum_{k=1}^{n}\mathbf{D}^{(B)}_{k}\right\|\right]=0.\end{cases} (57)
  5. C5.

    Let rn=rn(1)+rn(2)r_{n}=r^{(1)}_{n}+r^{(2)}_{n}, rnr_{n} is n\mathcal{F}_{n}-adapted, and

    {rn(1)=o(γn)a.s.γnk=1nrk(2)=o(1)a.s.\begin{cases}&\left\|r^{(1)}_{n}\right\|=o(\sqrt{\gamma_{n}})\quad a.s.\\ &\sqrt{\gamma_{n}}\left\|\sum_{k=1}^{n}r^{(2)}_{k}\right\|=o(1)\quad a.s.\\ \end{cases} (58)

Then,

1γn(θnθ)ndist.𝒩(0,𝐕),\frac{1}{\sqrt{\gamma_{n}}}(\theta_{n}-\theta^{*})\xrightarrow[n\to\infty]{dist.}\mathcal{N}(0,\mathbf{V}), (59)

where

{𝐕𝐇T+𝐇𝐕=𝐔in case C2 (i),𝐕(𝐈d+2γ𝐇T)+(𝐈d+2γ𝐇)𝐕=2γ𝐔in case C2 (ii).\begin{cases}&\mathbf{V}\mathbf{H}^{T}+\mathbf{H}\mathbf{V}=-\mathbf{U}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\text{in case C2 (i)},\\ &\mathbf{V}(\mathbf{I}_{d}+2\gamma_{\star}\mathbf{H}^{T})+(\mathbf{I}_{d}+2\gamma_{\star}\mathbf{H})\mathbf{V}=-2\gamma_{\star}\mathbf{U}~{}~{}~{}~{}~{}~{}\text{in case C2 (ii)}.\\ \end{cases} (60)

Note that the matrix 𝐔\mathbf{U} in the condition C4(b) of Theorem D.1 was assumed to be positive definite in the original Theorem 2.1 [29]. It was only to ensure that the solution 𝐕\mathbf{V} to the Lyapunov equation (60) is positive definite, which was only used for the stability of the related autonomous linear ODE (e.g., Theorem 3.16 [15] or Theorem 2.2.3 [36]). However, in this paper, we do not need strict positive definite matrix 𝐕\mathbf{V}. Therefore, we extend 𝐔\mathbf{U} to be positive semi-definite such that 𝐕\mathbf{V} is also positive semi-definite (see Lemma D.2 for the closed form of matrix 𝐕\mathbf{V}). Such kind of extension does not change any of the proof steps in [29].

D.1 Discussion about C1-C3

Our Assumption 2.1 corresponds to C1 by letting function h(θ)=f(θ)h(\theta)=-\nabla f(\theta) therein. We can also let γ\gamma_{\star} in Theorem 3.3 large enough to satisfy C2. The typical form of step size, also indicated in [29], is polynomial step size γnγ/na\gamma_{n}\sim\gamma_{\star}/n^{a} for a(0.5,1]a\in(0.5,1]. Note that a(0.5,1)a\in(0.5,1) satisfies C2 (i) and a=1a=1 satisfies C2 (ii). Assumption 2.4 corresponds to C3.444Theorem D.1 is slightly modified in terms of condition C3, which is mentioned as a special case in Section 2.2 [29]. For the sake of mathematical simplicity, we stick to condition C3 in the proof.

D.2 Analysis of C4

To check condition C4, we need to analyze the Martingale difference sequence {eni}\{e^{i}_{n}\}. Recall en+1i=mθni(Xni)𝐏imθni(Xn1i)e^{i}_{n+1}=m_{\theta^{i}_{n}}(X^{i}_{n})-\mathbf{P}_{i}m_{\theta^{i}_{n}}(X^{i}_{n-1}) such that there exists a constant CC,

𝔼[en+1i2+τ|n]C𝔼[mθni(Xni)2+τ+𝐏imθni(Xn1i)2+τ|n]=CY𝒳i𝐏i(Xn1i,Y)mθni(Y)2+τ+C𝐏imθni(Xn1i)2+τ.\begin{split}\mathbb{E}\left[\left\|e^{i}_{n+1}\right\|^{2+\tau}|\mathcal{F}_{n}\right]&\leq C\mathbb{E}\left[\left.\left\|m_{\theta^{i}_{n}}(X^{i}_{n})\right\|^{2+\tau}+\left\|\mathbf{P}_{i}m_{\theta^{i}_{n}}(X^{i}_{n-1})\right\|^{2+\tau}\right|\mathcal{F}_{n}\right]\\ &=C\sum_{Y\in\mathcal{X}^{i}}\mathbf{P}_{i}(X^{i}_{n-1},Y)\left\|m_{\theta^{i}_{n}}(Y)\right\|^{2+\tau}+C\left\|\mathbf{P}_{i}m_{\theta^{i}_{n}}(X^{i}_{n-1})\right\|^{2+\tau}.\end{split} (61)

Since mθni(Y)<\|m_{\theta^{i}_{n}}(Y)\|<\infty almost surely by 2.4 and 𝒳i\mathcal{X}^{i} is a finite state space, at all time nn, we have

Y𝒳i𝐏i(Xn1i,Y)mθni(Y)2+τ<a.s.\sum_{Y\in\mathcal{X}^{i}}\mathbf{P}_{i}(X^{i}_{n-1},Y)\left\|m_{\theta^{i}_{n}}(Y)\right\|^{2+\tau}<\infty\quad a.s. (62)

and there exists another constant CC^{\prime} such that by definition of 𝐏imθni(Xn1i)\mathbf{P}_{i}m_{\theta^{i}_{n}}(X^{i}_{n-1}), we have

𝐏imθni(Xn1i)2+τCY𝒳i𝐏i(Xn1i,Y)mθni(Y)2+τ<a.s.\left\|\mathbf{P}_{i}m_{\theta^{i}_{n}}(X^{i}_{n-1})\right\|^{2+\tau}\leq C^{\prime}\sum_{Y\in\mathcal{X}^{i}}\mathbf{P}_{i}(X^{i}_{n-1},Y)\left\|m_{\theta^{i}_{n}}(Y)\right\|^{2+\tau}<\infty\quad a.s. (63)

Therefore, 𝔼[en+1i2+τ|n]<\mathbb{E}[\|e^{i}_{n+1}\|^{2+\tau}|\mathcal{F}_{n}]<\infty a.s. for all nn and C4.(a) is satisfied.

We now turn to C4.(b). Note that for any iji\neq j, we have 𝔼[en+1i(en+1j)T|n]=𝔼[en+1i|n]𝔼[(en+1j)T|n]=0\mathbb{E}[e^{i}_{n+1}(e^{j}_{n+1})^{T}|\mathcal{F}_{n}]=\mathbb{E}[e^{i}_{n+1}|\mathcal{F}_{n}]\cdot\mathbb{E}[(e^{j}_{n+1})^{T}|\mathcal{F}_{n}]=0 due to the independence between agent ii and jj, and 𝔼[en+1i|n]=0\mathbb{E}[e^{i}_{n+1}|\mathcal{F}_{n}]=0. Then, we have

𝔼[(1Ni=1Nen+1i)(1Ni=1Nen+1i)T|n]=1N2i=1N𝔼[en+1i(en+1i)T|n].\mathbb{E}\left[\left.\left(\frac{1}{N}\sum_{i=1}^{N}e^{i}_{n+1}\right)\left(\frac{1}{N}\sum_{i=1}^{N}e^{i}_{n+1}\right)^{T}\right|\mathcal{F}_{n}\right]=\frac{1}{N^{2}}\sum_{i=1}^{N}\mathbb{E}\left[\left.e^{i}_{n+1}(e^{i}_{n+1})^{T}\right|\mathcal{F}_{n}\right]. (64)

The analysis of 𝔼[en+1i(en+1i)T|n]\mathbb{E}[e^{i}_{n+1}(e^{i}_{n+1})^{T}|\mathcal{F}_{n}] is inspired by Section 4 [29] and Section 4.3.3 [22], where they constructed another Poisson equation to further decompose the noise terms therein.555However, we note that [29, 22] considered the Lipschitz continuity of function Fθii(x)F^{i}_{\theta^{i}}(x) defined in (68) as an assumption instead of a conclusion, where we give a detailed proof for this. We also obtain matrix 𝐔i\mathbf{U}_{i} in an explicit form, which coincides with the definition of asymptotic covariance matrix and was not simplified in [29]. The discussion on the improvement of 𝐔i\mathbf{U}_{i} is outlined in Section 3.2, which was not the focus of [29, 22] and was not covered therein. Here, expanding 𝔼[en+1i(en+1i)T|n]\mathbb{E}[e^{i}_{n+1}(e^{i}_{n+1})^{T}|\mathcal{F}_{n}] gives

𝔼[en+1i(en+1i)T|n]=𝔼[mθni(Xni)mθni(Xni)T|n]+𝐏imθni(Xn1i)(𝐏imθni(Xn1i))T𝔼[mθni(Xni)|n](𝐏imθni(Xn1i))T𝐏imθni(Xn1i)𝔼[mθni(Xni)T|n]=y𝒳i𝐏i(Xn1,y)mθni(y)mθni(y)T𝐏imθni(Xn1i)(𝐏imθni(Xn1i))T.\begin{split}\mathbb{E}\left[\left.e^{i}_{n+1}(e^{i}_{n+1})^{T}\right|\mathcal{F}_{n}\right]=&\mathbb{E}[m_{\theta^{i}_{n}}(X^{i}_{n})m_{\theta^{i}_{n}}(X^{i}_{n})^{T}|\mathcal{F}_{n}]+\mathbf{P}_{i}m_{\theta^{i}_{n}}(X^{i}_{n-1})\left(\mathbf{P}_{i}m_{\theta^{i}_{n}}(X^{i}_{n-1})\right)^{T}\\ &\!-\!\mathbb{E}[m_{\theta^{i}_{n}}(X^{i}_{n})|\mathcal{F}_{n}]\left(\mathbf{P}_{i}m_{\theta^{i}_{n}}(X^{i}_{n-1})\right)^{T}\!\!-\!\mathbf{P}_{i}m_{\theta^{i}_{n}}(X^{i}_{n-1})\mathbb{E}[m_{\theta^{i}_{n}}(X^{i}_{n})^{T}|\mathcal{F}_{n}]\\ =&\sum_{y\in\mathcal{X}_{i}}\mathbf{P}_{i}(X_{n-1},y)m_{\theta^{i}_{n}}(y)m_{\theta^{i}_{n}}(y)^{T}\!-\!\mathbf{P}_{i}m_{\theta^{i}_{n}}(X^{i}_{n-1})\left(\mathbf{P}_{i}m_{\theta^{i}_{n}}(X^{i}_{n-1})\right)^{T}.\end{split} (65)

Denote by

Gi(θi,x)y𝒳i𝐏i(x,y)mθi(y)mθi(y)T𝐏imθi(x)(𝐏imθi(x))T,G_{i}(\theta^{i},x)\triangleq\sum_{y\in\mathcal{X}_{i}}\mathbf{P}_{i}(x,y)m_{\theta^{i}}(y)m_{\theta^{i}}(y)^{T}-\mathbf{P}_{i}m_{\theta^{i}}(x)\left(\mathbf{P}_{i}m_{\theta^{i}}(x)\right)^{T}, (66)

and let its expectation w.r.t the stationary distribution 𝝅i\boldsymbol{\pi}_{i} be gi(θi)𝔼x𝝅i[Gi(θi,x)]g_{i}(\theta^{i})\triangleq\mathbb{E}_{x\sim\boldsymbol{\pi}_{i}}[G_{i}(\theta^{i},x)], we can construct another Poisson equation, i.e.,

𝔼[en+1i(en+1i)T|n]Xni𝒳i𝝅(Xni)𝔼[en+1i(en+1i)T|n]=Gi(θni,Xn1i)gi(θni)=φθnii(Xn1i)𝐏iφθnii(Xn1i),\begin{split}&\mathbb{E}\left[\left.e^{i}_{n+1}(e^{i}_{n+1})^{T}\right|\mathcal{F}_{n}\right]-\sum_{X^{i}_{n}\in\mathcal{X}_{i}}\boldsymbol{\pi}(X^{i}_{n})\mathbb{E}\left[\left.e^{i}_{n+1}(e^{i}_{n+1})^{T}\right|\mathcal{F}_{n}\right]\\ =&G_{i}(\theta^{i}_{n},X^{i}_{n-1})-g_{i}(\theta^{i}_{n})\\ =&\varphi^{i}_{\theta_{n}^{i}}(X^{i}_{n-1})-\mathbf{P}_{i}\varphi^{i}_{\theta_{n}^{i}}(X^{i}_{n-1}),\end{split} (67)

for some matrix-valued function φi:d×𝒳id×d\varphi^{i}:\mathbb{R}^{d}\times\mathcal{X}_{i}\to\mathbb{R}^{d\times d}. Following the similar steps shown in (42) - (45), we can obtain the closed-form expression

φθii(x)=y𝒳i(𝐈𝐏i+𝟏(𝝅i)T)1(x,y)Gi(θi,x)gi(θi).\varphi^{i}_{\theta^{i}}(x)=\sum_{y\in\mathcal{X}_{i}}\left(\mathbf{I}-\mathbf{P}_{i}+\mathbf{1}(\boldsymbol{\pi}_{i})^{T}\right)^{-1}(x,y)G_{i}(\theta^{i},x)-g_{i}(\theta^{i}). (68)

Then, we can decompose (65) into

Gi(θni,Xn1i)=gi(θ)𝐔i+gi(θni)gi(θ)𝐃i,n(1)+φθnii(Xni)𝐏iφθnii(Xn1i)𝐃i,n(2,a)+φθnii(Xn1i)φθnii(Xni)𝐃i,n(2,b).G_{i}(\theta^{i}_{n},X^{i}_{n-1})=\underbrace{g_{i}(\theta^{*})}_{\mathbf{U}_{i}}+\underbrace{g_{i}(\theta^{i}_{n})-g_{i}(\theta^{*})}_{\mathbf{D}^{(1)}_{i,n}}+\underbrace{\varphi^{i}_{\theta_{n}^{i}}(X^{i}_{n})-\mathbf{P}_{i}\varphi^{i}_{\theta_{n}^{i}}(X^{i}_{n-1})}_{\mathbf{D}^{(2,a)}_{i,n}}+\underbrace{\varphi^{i}_{\theta_{n}^{i}}(X^{i}_{n-1})-\varphi^{i}_{\theta_{n}^{i}}(X^{i}_{n})}_{\mathbf{D}^{(2,b)}_{i,n}}. (69)

Let 𝐔1N2i=1N𝐔i\mathbf{U}\triangleq\frac{1}{N^{2}}\sum_{i=1}^{N}\mathbf{U}_{i}, 𝐃n(1)1N2i=1N𝐃1,n(1)\mathbf{D}^{(1)}_{n}\triangleq\frac{1}{N^{2}}\sum_{i=1}^{N}\mathbf{D}^{(1)}_{1,n}, 𝐃n(2,a)1N2i=1N𝐃i,n(2,a)\mathbf{D}^{(2,a)}_{n}\triangleq\frac{1}{N^{2}}\sum_{i=1}^{N}\mathbf{D}^{(2,a)}_{i,n}, and 𝐃n(2,b)1N2i=1N𝐃i,n(2,b)\mathbf{D}^{(2,b)}_{n}\triangleq\frac{1}{N^{2}}\sum_{i=1}^{N}\mathbf{D}^{(2,b)}_{i,n}, we want to prove that 𝐃n(1)\mathbf{D}^{(1)}_{n} satisfies the first condition in C4, and 𝐃n(2,a),𝐃n(2,b)\mathbf{D}^{(2,a)}_{n},\mathbf{D}^{(2,b)}_{n} meet the second condition in C4.

We now show that for all ii, Gi(θi,x)G_{i}(\theta^{i},x) is Lipschitz continuous in θiΩ\theta^{i}\in\Omega for some compact subset Ωd\Omega\subset\mathbb{R}^{d}. For any x𝒳ix\in\mathcal{X}_{i} and θ1i,θ2iΩ\theta^{i}_{1},\theta_{2}^{i}\in\Omega, we can get

mθ1i(x)mθ1i(x)Tmθ2i(x)mθ2i(x)T=mθ1i(x)(mθ1i(x)mθ2i(x))T(mθ1i(x)mθ2i(x))mθ2i(x)Tmθ1i(x)mθ2i(x)(mθ1i(x)+mθ2i(x))Cθ1iθ2i,\begin{split}&\|m_{\theta^{i}_{1}}(x)m_{\theta^{i}_{1}}(x)^{T}-m_{\theta^{i}_{2}}(x)m_{\theta^{i}_{2}}(x)^{T}\|\\ =&\|m_{\theta^{i}_{1}}(x)(m_{\theta^{i}_{1}}(x)-m_{\theta^{i}_{2}}(x))^{T}-(m_{\theta^{i}_{1}}(x)-m_{\theta^{i}_{2}}(x))m_{\theta^{i}_{2}}(x)^{T}\|\\ \leq&\|m_{\theta^{i}_{1}}(x)-m_{\theta^{i}_{2}}(x)\|(m_{\theta^{i}_{1}}(x)\|+\|m_{\theta^{i}_{2}}(x)\|)\\ \leq&C\|\theta^{i}_{1}-\theta^{i}_{2}\|,\end{split} (70)

for some constant CC, where the last inequality comes from mθ1i(x)<\|m_{\theta^{i}_{1}}(x)\|<\infty since θ1iΩ\theta^{i}_{1}\in\Omega and the Lipschitz continuous function mθi(x)m_{\theta^{i}}(x). Similarly, we can get 𝐏imθ1i(x)𝐏imθ2i(x)Cθ1iθ2i\|\mathbf{P}_{i}m_{\theta^{i}_{1}}(x)-\mathbf{P}_{i}m_{\theta^{i}_{2}}(x)\|\leq C\|\theta^{i}_{1}-\theta^{i}_{2}\|. Therefore, Gi(θi,x)G_{i}(\theta^{i},x) and gi(θi)g_{i}(\theta^{i}) are Lipschitz continuous in θiΩ\theta^{i}\in\Omega for any x𝒳ix\in\mathcal{X}_{i}.

For the sequence {𝐃i,n(1)}n0\{\mathbf{D}^{(1)}_{i,n}\}_{n\geq 0}, by applying Theorem 3.2 and conditioned on limnθn=θ\lim_{n\to\infty}\theta_{n}=\theta^{*} for an optimal point θ\theta^{*}\in\mathcal{L}, we have limngi(θni)gi(θ)limnCθniθ=0\lim_{n\to\infty}\|g_{i}(\theta^{i}_{n})-g_{i}(\theta^{*})\|\leq\lim_{n\to\infty}C\|\theta^{i}_{n}-\theta^{*}\|=0. This implies 𝐃i,n(1)0\mathbf{D}^{(1)}_{i,n}\to 0 for every i[N]i\in[N] and thus 𝐃n(1)0\mathbf{D}^{(1)}_{n}\to 0 as nn\to\infty almost surely, which satisfies the first condition in (57).

For the Martingale difference sequence {𝐃i,n(2,a)}n0\{\mathbf{D}^{(2,a)}_{i,n}\}_{n\geq 0}, we use Burkholder inequality (e.g., Theorem 2.10 [33], [21]) such that for p1p\geq 1 and some constant CpC_{p},

𝔼[i=1n𝐃i,n(2,a)p]Cp𝔼[(i=1n𝐃i,n(2,a)2)p/2].\mathbb{E}\left[\left\|\sum_{i=1}^{n}\mathbf{D}^{(2,a)}_{i,n}\right\|^{p}\right]\leq C_{p}\mathbb{E}\left[\left(\sum_{i=1}^{n}\left\|\mathbf{D}^{(2,a)}_{i,n}\right\|^{2}\right)^{p/2}\right]. (71)

By the definition (66) and Assumption 2.4, for a sample path, supnGi(θni,x)<\sup_{n}\|G_{i}(\theta^{i}_{n},x)\|<\infty for any x𝒳ix\in\mathcal{X}_{i}, as well as supngi(θni)<\sup_{n}\|g_{i}(\theta^{i}_{n})\|<\infty, which leads to supnφθnii(x)<\sup_{n}\|\varphi^{i}_{\theta^{i}_{n}}(x)\|<\infty for any x𝒳ix\in\mathcal{X}_{i} because of (68). Then, we have supn𝐃i,n(2,a)C<\sup_{n}\|\mathbf{D}^{(2,a)}_{i,n}\|\leq C<\infty for the path-dependent constant CC. Taking p=1p=1 and we have

limnγnCpi=1n𝐃i,n(2,a)2limnCpCγnn=0a.s.\lim_{n\to\infty}\gamma_{n}C_{p}\sqrt{\sum_{i=1}^{n}\left\|\mathbf{D}^{(2,a)}_{i,n}\right\|^{2}}\leq\lim_{n\to\infty}C_{p}C\gamma_{n}\sqrt{n}=0\quad a.s. (72)

Thus, Lebesgue dominated convergence theorem gives

limnγnCp𝔼[i=1n𝐃i,n(2,a)2]=𝔼[limnγnCpi=1n𝐃i,n(2,a)2]=0\lim_{n\to\infty}\gamma_{n}C_{p}\mathbb{E}\left[\sqrt{\sum_{i=1}^{n}\|\mathbf{D}^{(2,a)}_{i,n}\|^{2}}\right]=\mathbb{E}\left[\lim_{n\to\infty}\gamma_{n}C_{p}\sqrt{\sum_{i=1}^{n}\|\mathbf{D}^{(2,a)}_{i,n}\|^{2}}\right]=0

and we have limnγn𝔼[i=1n𝐃i,n(2,a)]=0\lim_{n\to\infty}\gamma_{n}\mathbb{E}[\|\sum_{i=1}^{n}\mathbf{D}^{(2,a)}_{i,n}\|]=0.

For the sequence {𝐃i,n(2,b)}n0\{\mathbf{D}^{(2,b)}_{i,n}\}_{n\geq 0}, we have

k=1n𝐃i,k(2,b)=k=1n(φθkii(Xk1i)φθk1ii(Xk1i))+φθ0ii(X0i)φθnii(Xni)=k=1n(φθkii(Xk1i)φθki(Xk1i)+φθki(Xk1i)φθk1i(Xk1i)+φθk1i(Xk1i)φθk1ii(Xk1i))+φθ0ii(X0i)φθnii(Xni).\begin{split}\sum_{k=1}^{n}\mathbf{D}^{(2,b)}_{i,k}=&\sum_{k=1}^{n}\left(\varphi^{i}_{\theta^{i}_{k}}(X^{i}_{k-1})-\varphi^{i}_{\theta^{i}_{k-1}}(X^{i}_{k-1})\right)+\varphi^{i}_{\theta^{i}_{0}}(X^{i}_{0})-\varphi^{i}_{\theta^{i}_{n}}(X^{i}_{n})\\ =&\sum_{k=1}^{n}\left(\varphi^{i}_{\theta^{i}_{k}}(X^{i}_{k\!-\!1})\!-\!\varphi^{i}_{\theta_{k}}\!(X^{i}_{k\!-\!1})\!+\!\varphi^{i}_{\theta_{k}}\!(X^{i}_{k\!-\!1})\!-\!\varphi^{i}_{\theta_{k\!-\!1}}\!(X^{i}_{k\!-\!1})\!+\!\varphi^{i}_{\theta_{k\!-\!1}}\!(X^{i}_{k\!-\!1})\!-\!\varphi^{i}_{\theta^{i}_{k\!-\!1}}\!(X^{i}_{k\!-\!1})\!\right)\\ &+\varphi^{i}_{\theta^{i}_{0}}(X^{i}_{0})-\varphi^{i}_{\theta^{i}_{n}}(X^{i}_{n}).\end{split} (73)

Since Gi(θi,x)G_{i}(\theta^{i},x) and gi(θi)g_{i}(\theta^{i}) are Lipschitz continuous in θiΩ\theta^{i}\in\Omega, φθii(x)\varphi^{i}_{\theta^{i}}(x) is also Lipschitz continuous in θiΩ\theta^{i}\in\Omega and is bounded. We have

k=1n𝐃i,k(2,b)k=1nφθkii(Xk1i)φθk1ii(Xk1i)+φθ0ii(X0i)+φθnii(Xni)k=1nφθkii(Xk1i)φθk1ii(Xk1i)+D1k=1nD2DΩγk+D1\begin{split}\left\|\sum_{k=1}^{n}\mathbf{D}^{(2,b)}_{i,k}\right\|\leq&\left\|\sum_{k=1}^{n}\varphi^{i}_{\theta^{i}_{k}}(X^{i}_{k-1})-\varphi^{i}_{\theta^{i}_{k-1}}(X^{i}_{k-1})\right\|+\left\|\varphi^{i}_{\theta^{i}_{0}}(X^{i}_{0})\right\|+\left\|\varphi^{i}_{\theta^{i}_{n}}(X^{i}_{n})\right\|\\ \leq&\left\|\sum_{k=1}^{n}\varphi^{i}_{\theta^{i}_{k}}(X^{i}_{k-1})-\varphi^{i}_{\theta^{i}_{k-1}}(X^{i}_{k-1})\right\|+D_{1}\\ \leq&\sum_{k=1}^{n}D_{2}D_{\Omega}\gamma_{k}+D_{1}\end{split} (74)

where φθ0ii(X0i)+φθnii(Xni)D1\|\varphi^{i}_{\theta^{i}_{0}}(X^{i}_{0})\|+\|\varphi^{i}_{\theta^{i}_{n}}(X^{i}_{n})\|\leq D_{1} for a given sample path, D2D_{2} is the Lipschitz constant of φθii(x)\varphi^{i}_{\theta^{i}}(x), and Fi(xi,Xi)DΩ\|\nabla F_{i}(x^{i},X^{i})\|\leq D_{\Omega} for any xiΩx^{i}\in\Omega and Xi𝒳iX^{i}\in\mathcal{X}^{i}. Then,

γnk=1n𝐃i,k(2,b)D2DΩγnk=1nγk+γnD10asn\gamma_{n}\left\|\sum_{k=1}^{n}\mathbf{D}^{(2,b)}_{i,k}\right\|\leq D_{2}D_{\Omega}\gamma_{n}\sum_{k=1}^{n}\gamma_{k}+\gamma_{n}D_{1}\to 0\quad\text{as}~{}n\to\infty (75)

because γnk=1nγk=O(n12a)\gamma_{n}\sum_{k=1}^{n}\gamma_{k}=O(n^{1-2a}) by assumption 2.3. Therefore, the second condition of C4 is satisfied.

D.3 Analysis of C5

We now analyze condition C5. The decreasing rate of each term in (56) has been proved in Appendix C. Specifically, by assumption 2.4, there exists a compact subset for a given sample path, and

  • we have shown that rn(A)=O(ηn)\left\|r_{n}^{(A)}\right\|=O(\eta_{n}) a.s., which implies rn(A)=o(γn)\left\|r_{n}^{(A)}\right\|=o(\sqrt{\gamma_{n}}) a.s.

  • For 1Ni=1Nξni\frac{1}{N}\sum_{i=1}^{N}\xi^{i}_{n}, in the case of increasing communication interval, 1Ni=1Nξni=O(γn+ηn)\frac{1}{N}\sum_{i=1}^{N}\xi^{i}_{n}=O(\gamma_{n}+\eta_{n}), by Assumption 2.3-ii), we know (γn+ηn)/γn=γn+γnKτnL+1=o(1)(\gamma_{n}+\eta_{n})/\sqrt{\gamma_{n}}=\sqrt{\gamma_{n}}+\sqrt{\gamma_{n}}K^{L+1}_{\tau_{n}}=o(1) such that 1Ni=1Nξni=o(γn)\|\frac{1}{N}\sum_{i=1}^{N}\xi^{i}_{n}\|=o(\sqrt{\gamma_{n}}) almost surely. On the other hand, in the case of bounded communication interval, 1Ni=1Nξni=O(γn)\frac{1}{N}\sum_{i=1}^{N}\xi^{i}_{n}=O(\gamma_{n}) such that 1Ni=1Nξni=o(γn)\|\frac{1}{N}\sum_{i=1}^{N}\xi^{i}_{n}\|=o(\sqrt{\gamma_{n}}) a.s.

  • Since supnνni<\sup_{n}\|\nu^{i}_{n}\|<\infty almost surely, we have supp1Ni=1Nk=0p(νkiνk+1i)=supp1Ni=1N(ν0iνp+1i)<\sup_{p}\|\frac{1}{N}\sum_{i=1}^{N}\sum_{k=0}^{p}(\nu^{i}_{k}-\nu^{i}_{k+1})\|=\sup_{p}\|\frac{1}{N}\sum_{i=1}^{N}(\nu^{i}_{0}-\nu^{i}_{p+1})\|<\infty almost surely. Then, γp1Ni=1Nk=0p(νkiνk+1i)=O(γp)\sqrt{\gamma_{p}}\|\frac{1}{N}\sum_{i=1}^{N}\sum_{k=0}^{p}(\nu^{i}_{k}-\nu^{i}_{k+1})\|=O(\sqrt{\gamma_{p}}) leads to γp1Ni=1Nk=0p(νkiνk+1i)=o(1)\sqrt{\gamma_{p}}\|\frac{1}{N}\sum_{i=1}^{N}\sum_{k=0}^{p}(\nu^{i}_{k}-\nu^{i}_{k+1})\|=o(1) a.s.

Let rn(1)rn(A)+1Ni=1Nξnir^{(1)}_{n}\triangleq r^{(A)}_{n}+\frac{1}{N}\sum_{i=1}^{N}\xi^{i}_{n} and rn(2)1Ni=1N(νkiνk+1i)r^{(2)}_{n}\triangleq\frac{1}{N}\sum_{i=1}^{N}(\nu^{i}_{k}-\nu^{i}_{k+1}). From above, we can see that C5 in Theorem D.1 is satisfied and we show that all the conditions in Theorem D.1 have been satisfied.

D.4 CLosed Form of Limitimg Covariance Matrix

Lastly, we need to analyze the closed-form expression of 𝐔\mathbf{U} as in C4 (b) of Theorem D.1. Recall that 𝐔=1N2i=1N𝐔i\mathbf{U}=\frac{1}{N^{2}}\sum_{i=1}^{N}\mathbf{U}_{i} and 𝐔i=gi(θ)\mathbf{U}_{i}=g_{i}(\theta^{*}) in (69). We now give the exact form of function gi(θ)g_{i}(\theta^{*}) as follows:

gi(θ)=x𝒳i𝝅i(x)[mθ(x)mθ(x)T(y𝒳i𝐏i(x,y)mθ(y))(y𝒳i𝐏i(x,y)mθ(y))T]=𝔼[(s=0[Fi(θ,Xs)fi(θ)])(s=0[Fi(θ,Xs)fi(θ)])T]𝔼[(s=1[Fi(θ,Xs)fi(θ)])(s=1[Fi(θ,Xs)fi(θ)])T]=𝔼[(Fi(θ,X0i)fi(θ))(Fi(θ,X0i)fi(θ))T]+𝔼[(Fi(θ,X0i)fi(θ))(s=1[Fi(θ,Xs)fi(θ)])T]+𝔼[(s=1[Fi(θ,Xs)fi(θ)])(Fi(θ,X0i)fi(θ))T]=Cov(Fi(θ,X0),Fi(θ,X0))+s=1[Cov(Fi(θ,X0),Fi(θ,Xs))+Cov(Fi(θ,Xs),Fi(θ,X0))],=𝚺(F(θ,)).\begin{split}g_{i}(\theta^{*})=&~{}\sum_{x\in\mathcal{X}_{i}}\boldsymbol{\pi}_{i}(x)\left[m_{\theta^{*}}(x)m_{\theta^{*}}(x)^{T}-\left(\sum_{y\in\mathcal{X}_{i}}\mathbf{P}_{i}(x,y)m_{\theta^{*}}(y)\right)\left(\sum_{y\in\mathcal{X}_{i}}\mathbf{P}_{i}(x,y)m_{\theta^{*}}(y)\right)^{T}\right]\\ =&~{}\mathbb{E}\left[\left(\sum_{s=0}^{\infty}[\nabla F_{i}(\theta^{*},X_{s})-\nabla f_{i}(\theta^{*})]\right)\left(\sum_{s=0}^{\infty}[\nabla F_{i}(\theta^{*},X_{s})-\nabla f_{i}(\theta^{*})]\right)^{T}\right]\\ &-\mathbb{E}\left[\left(\sum_{s=1}^{\infty}[\nabla F_{i}(\theta^{*},X_{s})-\nabla f_{i}(\theta^{*})]\right)\left(\sum_{s=1}^{\infty}[\nabla F_{i}(\theta^{*},X_{s})-\nabla f_{i}(\theta^{*})]\right)^{T}\right]\\ =&~{}\mathbb{E}\left[\left(\nabla F_{i}(\theta^{*},X^{i}_{0})-\nabla f_{i}(\theta^{*})\right)\left(\nabla F_{i}(\theta^{*},X^{i}_{0})-\nabla f_{i}(\theta^{*})\right)^{T}\right]\\ &+\mathbb{E}\left[\left(\nabla F_{i}(\theta^{*},X^{i}_{0})-\nabla f_{i}(\theta^{*})\right)\left(\sum_{s=1}^{\infty}[\nabla F_{i}(\theta^{*},X_{s})-\nabla f_{i}(\theta^{*})]\right)^{T}\right]\\ &+\mathbb{E}\left[\left(\sum_{s=1}^{\infty}[\nabla F_{i}(\theta^{*},X_{s})-\nabla f_{i}(\theta^{*})]\right)\left(\nabla F_{i}(\theta^{*},X^{i}_{0})-\nabla f_{i}(\theta^{*})\right)^{T}\right]\\ =&~{}\text{Cov}(\nabla F_{i}(\theta^{*},X_{0}),\nabla F_{i}(\theta^{*},X_{0}))\\ &+\sum_{s=1}^{\infty}\left[\text{Cov}(\nabla F_{i}(\theta^{*},X_{0}),\nabla F_{i}(\theta^{*},X_{s}))+\text{Cov}(\nabla F_{i}(\theta^{*},X_{s}),\nabla F_{i}(\theta^{*},X_{0}))\right],\\ =&~{}\boldsymbol{\Sigma}(\nabla F(\theta^{*},\cdot)).\end{split} (76)

where the second equality comes from the recursive form of mθi(x)m_{\theta^{i}}(x) in (45), and that the process {Xn}n0\{X_{n}\}_{n\geq 0} is in its stationary regime, i.e., X0𝝅iX_{0}\sim\boldsymbol{\pi}_{i} from the beginning. The last equality comes from rewriting Cov(Fi(θ,Xi),Fi(θ,Xj))\text{Cov}(\nabla F_{i}(\theta^{*},X_{i}),\nabla F_{i}(\theta^{*},X_{j})) in a matrix form. Note that gi(θ)g_{i}(\theta^{*}) is exactly the asymptotic covariance matrix of the underlying Markov chain {Xni}n0\{X^{i}_{n}\}_{n\geq 0} associated with the test function Fi(θ,)\nabla F_{i}(\theta^{*},\cdot). By utilizing the following lemma, we can obtain the explicit form of 𝐕\mathbf{V} as defined in (60).

Lemma D.2 (Lemma D.2.2 [41]).

If all the eigenvalues of matrix 𝐌\mathbf{M} have negative real part, then for every positive semi-definite matrix 𝐔\mathbf{U} there exists a unique positive semi-definite matrix 𝐕\mathbf{V} satisfying 𝐔+𝐌𝐕+𝐕𝐌T=𝟎\mathbf{U}+\mathbf{M}\mathbf{V}+\mathbf{V}\mathbf{M}^{T}=\mathbf{0}. The explicit solution 𝐕\mathbf{V} is given as

𝐕=0e𝐌t𝐔e(𝐌T)t𝑑t.\mathbf{V}=\int_{0}^{\infty}e^{\mathbf{M}t}\mathbf{U}e^{(\mathbf{M}^{T})t}dt. (77)

D.5 CLT of Polyak-Ruppert Averaging

We now consider the CLT result of Polyak-Ruppert averaging θ¯n=1nk=0n1θk\bar{\theta}_{n}=\frac{1}{n}\sum_{k=0}^{n-1}\theta_{k}. The steps follow similar way by verifying that the conditions in the related CLT of Polyak-Ruppert averaging for the stochastic approximation are satisfied. The additional assumption is given below.

  1. C6.

    For the sequence {rn}\{r_{n}\} in (38), n1/2k=0nrk(1)0n^{-1/2}\sum_{k=0}^{n}r_{k}^{(1)}\to 0 with probability 11.

Then, the CLT of Polyak-Ruppert averaging is as follows.

Theorem D.3 (Theorem 3.2 of [29]).

Consider the iteration (38), assume C1, C3, C4, C5 in Theorem D.1 are satisfied. Moreover, assume C6 is satisfied. Then, with step size γnγ/na\gamma_{n}\sim\gamma_{\star}/n^{a} for a(0.5,1)a\in(0.5,1), we have

n(θ¯nθ)ndist.𝒩(0,𝐕),\sqrt{n}(\bar{\theta}_{n}-\theta^{*})\xrightarrow[n\to\infty]{dist.}\mathcal{N}(0,\mathbf{V}^{\prime}), (78)

where 𝐕=𝐇1𝐔𝐇T\mathbf{V}^{\prime}=\mathbf{H}^{-1}\mathbf{U}\mathbf{H}^{-T}.

Discussion about C1 and C3 can be found in Section D.1. Condition C4 has been analyzed in Section D.2 and condition C5 has been examined in Section D.3. The only condition left to analyze is C6, which is based on the results obtained in Section D.3. In view of (56), rn(1)=rn(A)+1Ni=1Nξn+1ir^{(1)}_{n}=r_{n}^{(A)}+\frac{1}{N}\sum_{i=1}^{N}\xi^{i}_{n+1}, so C6 is equivalent to

n1/2k=1n[rk(A)+1Ni=1N(ξk+1i)]0w.p.1.n^{-1/2}\sum_{k=1}^{n}\left[r_{k}^{(A)}+\frac{1}{N}\sum_{i=1}^{N}\left(\xi^{i}_{k+1}\right)\right]\to 0\quad w.p.1. (79)

In Section D.3, we have shown that rn(A)=O(ηn)\left\|r_{n}^{(A)}\right\|=O(\eta_{n}), 1Ni=1Nξni=O(γn)\frac{1}{N}\sum_{i=1}^{N}\xi^{i}_{n}=O(\gamma_{n}). Note that by Assumption 2.3, we consider bounded communication interval for step size γnγ/na\gamma_{n}\sim\gamma_{\star}/n^{a} for a(0.5,1)a\in(0.5,1), and hence, ηn=O(γn)\eta_{n}=O(\gamma_{n}) such that rn(A)=O(γn)\left\|r_{n}^{(A)}\right\|=O(\gamma_{n}). We then know that

k=1nrn(A)=O(n1a),k=1n1Ni=1Nξni=O(n1a),\sum_{k=1}^{n}\left\|r_{n}^{(A)}\right\|=O(n^{1-a}),\quad\sum_{k=1}^{n}\|\frac{1}{N}\sum_{i=1}^{N}\xi^{i}_{n}\|=O(n^{1-a}), (80)

such that

n1/2k=1nrk(A)+1Ni=1N(ξk+1i)=O(n1/2a)=o(1),n^{-1/2}\sum_{k=1}^{n}\left\|r_{k}^{(A)}+\frac{1}{N}\sum_{i=1}^{N}\left(\xi^{i}_{k+1}\right)\right\|=O(n^{1/2-a})=o(1), (81)

which proved (79) and C6 is verified. Therefore, Theorem D.3 is proved under our Assumptions 2.1 - 2.5.

Appendix E Discussion on the comparison of Theorem 3.3 to the CLT result in [51]

As a byproduct of our Theorem 3.3, we have the following corollary.

Corollary E.1.

Under Assumptions 2.1 - 2.5, for the sub-sequence {nl}l0\{n_{l}\}_{l\geq 0} where Kl=KK_{l}=K for all ll, we have

1nlk=1l(θ¯nkθ)ldist.𝒩(0,𝐕)\frac{1}{\sqrt{n_{l}}}\sum_{k=1}^{l}(\bar{\theta}_{n_{k}}-\theta^{*})\xrightarrow[l\to\infty]{dist.}\mathcal{N}(0,\mathbf{V}^{\prime}) (82)
Proof.

Since Kl=KK_{l}=K for all ll, we have nl=Kln_{l}=Kl. There is an existing result showing the CLT result of the partial sum of a sub-sequence (after normalization) has the same normal distribution as the partial sum of the original sequence.

Theorem E.2 (Theorem 14.4 of [8]).

Given a sequence of random variable θ1,θ2,\theta_{1},\theta_{2},\cdots with partial sum Snk=1nθkS_{n}\triangleq\sum_{k=1}^{n}\theta_{k} such that 1nSnndist.𝒩(0,𝐕)\frac{1}{\sqrt{n}}S_{n}\xrightarrow[n\to\infty]{dist.}\mathcal{N}(0,\mathbf{V}). Let nln_{l} be some positive random variable taking integer value such that θnl\theta_{n_{l}} is on the same space as θn\theta_{n}. In addition, for some sequence {bl}l0\{b_{l}\}_{l\geq 0} going to infinity, nl/blcn_{l}/b_{l}\to c for a positive constant cc. Then, 1nlSnlldist.𝒩(0,𝐕)\frac{1}{\sqrt{n_{l}}}S_{n_{l}}\xrightarrow[l\to\infty]{dist.}\mathcal{N}(0,\mathbf{V}).

From Theorem E.2 and our Theorem 3.3, we have 1nlk=1l(θ¯nkθ)ldist.𝒩(0,𝐕)\frac{1}{\sqrt{n_{l}}}\sum_{k=1}^{l}(\bar{\theta}_{n_{k}}-\theta^{*})\xrightarrow[l\to\infty]{dist.}\mathcal{N}(0,\mathbf{V}^{\prime}). ∎

Recently, [51] studied the CLT result under the LSGD-FP algorithm with i.i.d sampling (with slightly different setting of the step size). We are able recover Theorem 3.1 of [51] under the constant communication interval while adjusting their step size to make a fair comparison. We state their algorithm below for self-contained purpose. During each communication interval n(nl,nl+1]n\in(n_{l},n_{l+1}],

θn+1i={θniγlFi(θni,Xni)ifn(nl,nl+1),1Ni=1N(θniγlFi(θni,Xni))ifn=nl+1.\theta_{n+1}^{i}=\begin{cases}\theta_{n}^{i}-\gamma_{l}\nabla F_{i}(\theta_{n}^{i},X^{i}_{n})&~{}~{}~{}~{}\text{if}~{}~{}n\in(n_{l},n_{l+1}),\\ \frac{1}{N}\sum_{i=1}^{N}(\theta_{n}^{i}-\gamma_{l}\nabla F_{i}(\theta_{n}^{i},X^{i}_{n}))&~{}~{}~{}~{}\text{if}~{}~{}n=n_{l+1}.\end{cases} (83)

The CLT result associated with (83) is given below.

Theorem E.3 (Theorem 3.1 of [51]).

Under LSGD-FP algorithm with i.i.d. sampling, we have

nllk=1l(θ¯nkθ)ldist.𝒩(0,ν𝐕),\frac{\sqrt{n_{l}}}{l}\sum_{k=1}^{l}\left(\bar{\theta}_{n_{k}}-\theta^{*}\right)\xrightarrow[l\to\infty]{dist.}\mathcal{N}(0,\nu\mathbf{V}^{\prime}), (84)

where νliml1l2(k=1lKl)(k=1lKl1)\nu\triangleq\lim_{l\to\infty}\frac{1}{l^{2}}(\sum_{k=1}^{l}K_{l})(\sum_{k=1}^{l}K_{l}^{-1}).

Note that ν=1\nu=1 for constant KK. We can rewrite (84) as

nllk=1l(θ¯nkθ)=nll1lk=1l(θ¯nkθ)=K1lk=1l(θ¯nkθ)\frac{\sqrt{n_{l}}}{l}\sum_{k=1}^{l}\left(\bar{\theta}_{n_{k}}-\theta^{*}\right)=\frac{\sqrt{n_{l}}}{\sqrt{l}}\frac{1}{\sqrt{l}}\sum_{k=1}^{l}(\bar{\theta}_{n_{k}}-\theta^{*})=\sqrt{K}\frac{1}{\sqrt{l}}\sum_{k=1}^{l}(\bar{\theta}_{n_{k}}-\theta^{*}) (85)

such that

1lk=1l(θ¯nkθ)ldist.𝒩(0,1K𝐕).\frac{1}{\sqrt{l}}\sum_{k=1}^{l}(\bar{\theta}_{n_{k}}-\theta^{*})\xrightarrow[l\to\infty]{dist.}\mathcal{N}(0,\frac{1}{K}\mathbf{V}^{\prime}). (86)

Note that the step size in (83) keeps unchanged during each communication interval, while ours in (1) keeps decreasing even in the same communication interval. This makes our step size decreasing faster than theirs. To make a fair comparison, we only choose a sub-sequence {nKl}l0\{n_{Kl}\}_{l\geq 0} in (86) such that it is ‘equivalent’ to see that our step sizes become the same at each aggregation step. In this case, we again use Theorem E.2 to obtain

1Kls=1l(θ¯nKsθ)ldist.𝒩(0,1K𝐕),\frac{1}{\sqrt{Kl}}\sum_{s=1}^{l}(\bar{\theta}_{n_{Ks}}-\theta^{*})\xrightarrow[l\to\infty]{dist.}\mathcal{N}(0,\frac{1}{K}\mathbf{V}^{\prime}), (87)

such that

1ls=1l(θ¯nKsθ)=K1Kls=1l(θ¯nKsθ)ldist.𝒩(0,𝐕).\frac{1}{\sqrt{l}}\sum_{s=1}^{l}(\bar{\theta}_{n_{Ks}}-\theta^{*})=\sqrt{K}\frac{1}{\sqrt{Kl}}\sum_{s=1}^{l}(\bar{\theta}_{n_{Ks}}-\theta^{*})\xrightarrow[l\to\infty]{dist.}\mathcal{N}(0,\mathbf{V}^{\prime}). (88)

Therefore, our Corollary E.1 also recovers Theorem 3.1 of [51] under the constant communication interval KK, but with more general communication patterns and Markovian sampling.

Appendix F Discussion on Communication Patterns

F.1 Examples of Communication Matrix 𝐖\mathbf{W}

Metropolis Hasting Algorithm: In the decentralized learning such as D-SGD, HLSGD and DFL, 𝐖\mathbf{W} at the aggregation step can be generated locally using the Metropolis Hasting algorithm based on the underlying communication topology, and is deterministic [62, 43, 80]. Specifically, each agent ii exchanges its degree did_{i} with its neighbors jN(i)j\in N(i), forming the weight 𝐖(i,j)=min{1/di,1/dj}\mathbf{W}(i,j)=\min\{1/d_{i},1/d_{j}\} for jN(i)j\in N(i) and 𝐖(i,i)=1jN(i)𝐖(i,j)\mathbf{W}(i,i)=1-\sum_{j\neq N(i)}\mathbf{W}(i,j). In this case, 𝐖\mathbf{W} is doubly stochastic and symmetric. By Perron-Frobenius theorem, its SLEM λ2(𝐖)<1\lambda_{2}(\mathbf{W})<1 . Then, 𝐖T𝐖𝐉=𝐖2𝐉=λ22(𝐖)<1\|\mathbf{W}^{T}\mathbf{W}-\mathbf{J}\|=\|\mathbf{W}^{2}-\mathbf{J}\|=\lambda_{2}^{2}(\mathbf{W})<1, which satisfies 2.5-ii). It is worth noting that this algorithm is robust to time-varying communication topologies.

Client Sampling in FL: For LSGD-FP studied in [67, 76, 42], 𝐖=𝟏𝟏T/N\mathbf{W}=\mathbf{1}\mathbf{1}^{T}/N trivially satisfies 2.5-ii). For LSGD-PP on the other hand, only a small fraction of agents participate in each aggregation step for consensus [50, 30]. Denote by 𝒮\mathcal{S} a randomly selected set of agents (without replacement) of fixed size |𝒮|{1,2,,N}|\mathcal{S}|\in\{1,2,\cdots,N\} at time nn and 𝐖𝒮\mathbf{W}_{\mathcal{S}} plays a role of aggregating θni\theta^{i}_{n} for agent i𝒮i\in\mathcal{S}. Additionally, the central server needs to broadcast updated parameter θn+1\theta_{n+1} to the newly selected set 𝒮\mathcal{S}^{\prime} with the same size, which results in a bijective mapping σ\sigma (for 𝒮𝒮\mathcal{S}\to\mathcal{S}^{\prime} and [N]/𝒮[N]/𝒮[N]/\mathcal{S}\to[N]/\mathcal{S}^{\prime}) and a corresponding permutation matrix 𝐓𝒮𝒮\mathbf{T}_{\mathcal{S}\to\mathcal{S}^{\prime}}. Then, the communication matrix becomes 𝐖=𝐓𝒮𝒮𝐖𝒮\mathbf{W}=\mathbf{T}_{\mathcal{S}\to\mathcal{S}^{\prime}}\mathbf{W}_{\mathcal{S}}.666In Section F.2, we will discuss the mathematical equivalence between our client sampling strategy and the commonly used one in the FL literature [50, 30]. Specifically, 𝐓𝒮𝒮(i,j)=1\mathbf{T}_{\mathcal{S}\to\mathcal{S}^{\prime}}(i,j)=1 if j=σ(i)j=\sigma(i) and 𝐓𝒮𝒮(i,j)=0\mathbf{T}_{\mathcal{S}\to\mathcal{S}^{\prime}}(i,j)=0 otherwise. Besides, 𝐖𝒮(i,j)=1/|𝒮|\mathbf{W}_{\mathcal{S}}(i,j)=1/|\mathcal{S}| for i,j𝒮i,j\in\mathcal{S}, 𝐖𝒮(i,i)=1\mathbf{W}_{\mathcal{S}}(i,i)=1 for i𝒮i\notin\mathcal{S}, and 𝐖𝒮(i,j)=0\mathbf{W}_{\mathcal{S}}(i,j)=0 otherwise. Note that 𝐖𝒮\mathbf{W}_{\mathcal{S}} is now a random matrix, since 𝒮\mathcal{S} is a randomly chosen subset of size |𝒮||\mathcal{S}|. Clearly, for each choice of 𝒮\mathcal{S}, 𝐖𝒮\mathbf{W}_{\mathcal{S}} is doubly stochastic, symmetric and 𝐖𝒮2=𝐖𝒮\mathbf{W}_{\mathcal{S}}^{2}=\mathbf{W}_{\mathcal{S}}. Taking the expectation of 𝐖𝒮\mathbf{W}_{\mathcal{S}} w.r.t the randomly selected set 𝒮\mathcal{S} gives 𝔼𝒮[𝐖𝒮](i,i)=1(|𝒮|1)/N\mathbb{E}_{\mathcal{S}}[\mathbf{W}_{\mathcal{S}}](i,i)=1-(|\mathcal{S}|-1)/N for i[N]i\in[N] and 𝔼𝒮[𝐖𝒮](i,j)=(|𝒮|1)/N(N1)\mathbb{E}_{\mathcal{S}}[\mathbf{W}_{\mathcal{S}}](i,j)=(|\mathcal{S}|-1)/N(N-1) for iji\neq j. Note that 𝔼𝒮[𝐖𝒮]\mathbb{E}_{\mathcal{S}}[\mathbf{W}_{\mathcal{S}}] has all positive entries. Therefore, we use the fact 𝐓T𝐓=𝐈\mathbf{T}^{T}\mathbf{T}=\mathbf{I} for permutation matrix 𝐓\mathbf{T} such that 𝔼[𝐖]𝐉=𝔼𝒮,𝒮[𝐖𝒮T𝐓𝒮,𝒮T𝐓𝒮,𝒮𝐖𝒮]𝐉=𝔼𝒮[𝐖𝒮T𝐖𝒮]𝐉=𝔼𝒮[𝐖𝒮]𝐉<1\|\mathbb{E}[\mathbf{W}]-\mathbf{J}\|=\|\mathbb{E}_{\mathcal{S},\mathcal{S}^{\prime}}[\mathbf{W}_{\mathcal{S}}^{T}\mathbf{T}_{\mathcal{S},\mathcal{S}^{\prime}}^{T}\mathbf{T}_{\mathcal{S},\mathcal{S}^{\prime}}\mathbf{W}_{\mathcal{S}}]-\mathbf{J}\|=\|\mathbb{E}_{\mathcal{S}}[\mathbf{W}_{\mathcal{S}}^{T}\mathbf{W}_{\mathcal{S}}]-\mathbf{J}\|=\|\mathbb{E}_{\mathcal{S}}[\mathbf{W}_{\mathcal{S}}]-\mathbf{J}\|<1 by Perron–Frobenius theorem and eigendecomposition, which satisfies 2.5-ii).

F.2 Discussion on partial client sampling

The commonly used partial client sampling algorithm in the FL literature [50, 30] is FedAvg as follows:

  1. 1.

    At time nn, the central server updates its global parameter θn=1|𝒮|i𝒮θni\theta_{n}=\frac{1}{|\mathcal{S}|}\sum_{i\in\mathcal{S}}\theta^{i}_{n} from the agents in the previous set 𝒮\mathcal{S}. Then, the central server selects a new subset of agents 𝒮\mathcal{S}^{\prime} and broadcasts θn\theta_{n} to agent i𝒮i\in\mathcal{S}^{\prime}, i.e., θni=θn\theta_{n}^{i}=\theta_{n};

  2. 2.

    Each selected agent ii computes KK steps of SGD locally and consecutively updates its local parameter θn+1i,,θn+Ki\theta^{i}_{n+1},\cdots,\theta^{i}_{n+K} according to (1a);

  3. 3.

    Each selected agent i𝒮i\in\mathcal{S}^{\prime} uploads θn+Ki\theta^{i}_{n+K} to the central server.

Then, the central server repeats the above three steps with θn+K\theta_{n+K} and a new set of selected agents.

In our client sampling scheme, at the aggregation step nn, the design of 𝐖𝒮\mathbf{W}_{\mathcal{S}} results in θ~ni=1|𝒮|j𝒮θnj\tilde{\theta}^{i}_{n}=\frac{1}{|\mathcal{S}|}\sum_{j\in\mathcal{S}}\theta_{n}^{j} for a selected agent i𝒮i\in\mathcal{S}, and θ~ni=θni\tilde{\theta}_{n}^{i}=\theta_{n}^{i} for an unselected agent i𝒮i\notin\mathcal{S}. Meanwhile, the central server updates the global parameter θ~n=θ~ni\tilde{\theta}_{n}=\tilde{\theta}^{i}_{n} for i𝒮i\in\mathcal{S}. Then, the permutation matrix 𝐓𝒮𝒮\mathbf{T}_{\mathcal{S}\to\mathcal{S}^{\prime}} ensures that the newly selected agent i𝒮i\in\mathcal{S}^{\prime} will use θ~n\tilde{\theta}_{n} as the initial point for its subsequent SGD iterations. Consequently, from the selected agents’ perspective, the communication matrix 𝐖=𝐓𝒮𝒮𝐖𝒮\mathbf{W}=\mathbf{T}_{\mathcal{S}\to\mathcal{S}^{\prime}}\mathbf{W}_{\mathcal{S}} corresponds to step 1 in FedAvg. As we can observe, both algorithms update the global parameter identically from the central server’s viewpoint, rendering them mathematically equivalent regarding the global parameter update.

We acknowledge that under the i.i.d sampling strategy, the behavior of unselected agents in our algorithm differs from FedAvg. Specifically, unselected agents are idle in FedAvg, while they continue the SGD computation in our algorithm (despite not contributing to the global parameter update). Importantly, when an unselected agent is later selected, the central server overwrites its local parameter during the broadcasting process. This ensures that the activities of agents when they are unselected have no impact on the global parameter update.

To our knowledge, the FedAvg algorithm under the Markovian sampling strategy remains unexplored in the FL literature. Extrapolating the behavior of unselected agents in FedAvg from i.i.d sampling to Markovian sampling suggests that unselected agents would remain idle. In contrast, our algorithm enables unselected agents to continue evolving XniX^{i}_{n}. These additional transitions contribute to faster mixing of the Markov chain for each unselected agent and a smaller bias of Fi(θ,Xni)F_{i}(\theta,X^{i}_{n}) relative to its mean field fi(θ)f_{i}(\theta), potentially accelerating the convergence.

Appendix G Additional Simulation

G.1 Simulation Setup in Section 4

This simulation is performed on a PC with an AMD R9 5950X, RTX 3080 and 128 GB RAM. In this simulation, we assume that agents follow the DSGD algorithm (1). In Figure 2 - 2, each agent holds a disjoint local dataset (non-overlapping data points for every agent), while we distribute the ijcnn1 dataset [14] with more varied distribution among 100100 agents by leveraging Dirichlet distribution with the default alpha value of 0.50.5 in Figure 2 - 2.

Moreover, we assume that all agents are distributed over a communication network. In order to create this network among 100100 agents and the graph-like dataset structure held by each agent, we utilize connected sub-graphs from the real-world graph Facebook in SNAP [49]. All 100100 agents collaborate together to generate a deterministic communication matrix 𝐖=[Wij]\mathbf{W}=[W_{ij}] with Metropolis Hasting algorithm of the following form: For i,j[N]i,j\in[N], we have

Wij={min{1di,1dj}if agent j is the neighbor of agent i,0otherwise,\displaystyle W_{ij}=\begin{cases}\min\left\{\frac{1}{d_{i}},\frac{1}{d_{j}}\right\}&\text{if agent $j$ is the neighbor of agent $i$},\\ 0&\text{otherwise},\end{cases}
Wii=1j[N]Wij,\displaystyle W_{ii}=1-\sum_{j\in[N]}W_{ij},

where did_{i} represents the degree of agent ii in the graph. The communication interval KK is set to 11, as is the usual choice in DSGD [71, 80, 61, 69].

For the first group of agents, we assume they have full access to their datasets, thus performing i.i.d. sampling or single shuffling. In particular,

  • i.i.d. sampling employed by agent ii means that the data point XniX_{n}^{i} is independently and uniformly sampled from its dataset 𝒳i\mathcal{X}_{i} at each time nn.

  • Single shuffling, by its name, only shuffles the dataset once and adheres to that specific order throughout the training process.

On the other hand, within the second group of agents, we assume that they hold graph-like datasets. Now, we introduce simple random walk (SRW), non-backtracking random walk (NBRW), and self-repellent random walk (SRRW) in order:

  • SRW refers to the walker that chooses one of the neighboring nodes uniformly at random.

  • NBRW, as studied in [2, 48, 5], is a variation of SRW, which selects one of the neighbors uniformly at random, with the exception of the one visited in the last step.

  • SRRW, recently proposed by [25], is designed with a nonlinear transition kernel 𝐊[𝐱][0,1]N×N\mathbf{K}[\mathbf{x}]\in[0,1]^{N\times N} of the following form:

    Kij[𝐱]Pij(xj/μj)αk[N]Pik(xk/μk)α,i,j[N],K_{ij}[\mathbf{x}]\triangleq\frac{P_{ij}(x_{j}/\mu_{j})^{-\alpha}}{\sum_{k\in[N]}P_{ik}(x_{k}/\mu_{k})^{-\alpha}},\quad\forall i,j\in[N], (89)

    where matrix 𝐏=[Pij]\mathbf{P}=[P_{ij}] is the transition kernel of the baseline Markov chain and 𝝁=[μi]\boldsymbol{\mu}=[\mu_{i}] is its corresponding stationary distribution. Additionally, α\alpha denotes the force of self repellence, and larger α\alpha leads to stronger force of self repellence, thus higher sampling efficiency [25, Corollary 4.3]. Moreover, vector 𝐱N\mathbf{x}\in\mathbb{R}^{N} is in the interior of probability simplex, representing the empirical distribution, in other words, the visit frequency of each node in the graph. The update rule of this empirical distribution is in the following form:

    𝐱n+1=𝐱n+βn+1(𝜹Xn+1𝐱n),\mathbf{x}_{n+1}=\mathbf{x}_{n}+\beta_{n+1}(\boldsymbol{\delta}_{X_{n+1}}-\mathbf{x}_{n}), (90)

    where βn(n+1)b\beta_{n}\triangleq(n+1)^{-b} is the step size of SRRW iterates. b=1b=1 was original proposed in [25] and is recently extended to b(0.5,1)b\in(0.5,1) in [39]. In this simulation, we use SRW as the baseline Markov chain of SRRW, and in turn 𝝁\boldsymbol{\mu} is proportional to the degree distribution. We also assume 𝐱0=𝟏/N\mathbf{x}_{0}=\mathbf{1}/N, i.e., each node has been visited once, and choose the step size βn=(n+1)0.8\beta_{n}=(n+1)^{-0.8}, force of self repellence α=20\alpha=20 according to the suggestion in [39, Section 4].

Since SRW, NBRW, and SRRW all admit the stationary distribution that is proportional to degree distribution, in order to obtain unfirom target in (15), we need to reweight the gradient computed by each agent ii in order to maintain an asymptotic unbiased gradient. Thus, agent ii should modify its SGD update from (1a) to the following:

θn+1/2i=θniγn+1Fi(θni,Xni)/d(Xni).\theta_{n+\nicefrac{{1}}{{2}}}^{i}=\theta_{n}^{i}-\gamma_{n+1}\nabla F_{i}(\theta_{n}^{i},X^{i}_{n})/d(X_{n}^{i}). (91)

G.2 Image Classification Task

In this part, we perform the image classification task through a 55-layer neural network, where the CIFAR10 dataset [44] with 5050k image data is evenly distributed to 1010 agents. Each agent possesses 55k images, which are further divided into 200200 batches, each batch with 2525 images.

The Convolutional Neural Network (CNN) model used in this simulation encompasses:

  • Two convolutional layers (i.e., nn.Conv2d(3, 6, 5) and nn.Conv2d(6, 16, 5)), each followed by ReLU activation functions to introduce non-linearity and max pooling (i.e., nn.MaxPool2d(2, 2)) to reduce spatial dimensions.

  • Three fully connected (linear) layers, concluding with a softmax output to handle the multi-class classification problem.

Similar to the simulation setup in Section 4, among the 1010 participating agents, five have unrestricted access to their respective data allocations, enabling them to utilize the shuffling method to iterate through their batches. The other five agents are designed to simulate limited data access scenarios. Their data access is structured using five distinct graph topologies extracted from the SNAP dataset collection [49], each graph simulating a unique communication pattern among the batches (nodes) of data. Within these topologies, agents adopt one of three random walk strategies — SRW, NBRW, and SRRW, all with importance reweighting — to sample the batches for training.

Refer to caption
Refer to caption
Refer to caption
Figure 3: Image classification experiment. From left to right: (a) Comparison of various sampling strategies in image classification problem using 55-layer neural network. (b) Train a 55-layer CNN model with different number of total agents (clients) to show the linear speedup effect. (c) Train ResNet-1818 model with different sampling strategies among 1010 agents with participation ratio 0.40.4.

Local model training is conducted for five epochs at each agent before aggregating the updates at a central server — a process repeated for a total of 200200 communication rounds. Each epoch consists of a full traversal of the local dataset of agents in the first group, or 200200 batches sampled for training in the second group of agents. To mimic realistic conditions, we also introduce partial agent participation where only 40%40\% of agents are selected randomly to transmit their updates in each round, reflecting the intermittent communication in real-world FL deployments. Lastly, the selection of the step size βn\beta_{n} for SRRW iterates (90) is a critical aspect of our experiments. In this simulation, we experiment with various values of b{0.501,0.6,0.7,0.8,0.9}b\in\{0.501,0.6,0.7,0.8,0.9\} to determine the most advantageous setting for maximizing the benefits of the SRRW strategy. Based on our findings, the best choice for the SRRW step size is b=0.501b=0.501, in other words, βn=(n+1)0.501\beta_{n}=(n+1)^{-0.501}.

The simulation result is quantified by averaging the training loss across ten repeated trials for each configuration. As depicted in Figure 3, the training results are consistent with our previous findings in Figure 2 in the context of the FL framework and the training of neural networks: the use of a more effective sampling approach, even for a portion of the agents, results in significant enhancements in the overall training of the model, and this improvement is further enhanced through the highly efficient sampling strategy SRRW.

In Figure 3 and 3, we perform image classification experiments in the FL setting with partial client participation. Only 44 random agents will participate in the training process at each aggregation phase. In Figure 3, we fix the sampling strategy (shuffling, SRRW with α=10\alpha=10) and test the linear speedup effect for the 55-layer CNN model by duplicating 1010 agents to NN agents with N{10,20,30,40}N\in\{10,20,30,40\}, keeping the same participation ratio 0.40.4. As can be seen from the plot, the training loss is inversely proportional to the number of agents, i.e., at 200200 rounds, the training loss is 0.520.52 for 1010 agents, 0.230.23 for 2020 agents, 0.180.18 for 3030 agents, and 0.120.12 for 4040 agents. In Figure 3, we extend the current simulation from 5-layer CNN model to ResNet-18 model [34] in order to numerically test the performance of different sampling strategies in a more complex neural network training task. By fixing the shuffling in the first group of agents, we observe that improving Markovian sampling from SRW to NBRW, then to SRRW, gives accelerated training process with smaller training loss.

Appendix H Limitations

Our study provides crucial insights into the identification of nuanced agents’ sampling behaviors in UD-SGD, where improving each agent’s sampling strategy speeds up overall system performance without additional computational burden except the additional storage for the visit counts used for sampling their datasets. Our UD-SGD is scalable in terms of larger datasets as the sampling strategy (i.e., random walk) utilized by each agent leverages only local information for its dataset. However, our work has two limitations that should be acknowledged.

  • 1.

    Assumption 2.4 posits that the parameter trajectory {θn}\{\theta_{n}\} is almost surely bounded, which is a strong assumption. This is crucial for guaranteeing the well-defined nature of all related quantities. Some mechanisms such as projections onto a compact subset [45, Chapter 5.1], or truncation-based method with expanding compact subsets can do the trick to ensure that the iteration is always bounded [3]. As mentioned in our discussion after Assumption 2.4, only recently the stability of SGD under Markovian sampling has been guaranteed to hold for some class of objective function ff [9], while the discussion on stability issue under multi-agent scenario with Markovian sampling remains an open problem and we do not pursue to remove this stability assumption in this paper.

  • 2.

    Asymptotic analysis: The main results of our work, i.e., almost sure convergence and central limit theorem in distributed optimization, are based on asymptotic analysis and might not accurately represent the finite-sample performance of each contributing agent in the system. The state-of-the-art finite-sample analysis in the literature only focuses on the worst-performing agent that cannot capture the nuanced differences between agents’ sampling strategies, with the explanation detailed in Footnote 2. Thus, a finite-sample error bound that distinguish every agent’s dynamics is still unknown and regarded as a future direction.

Checklist

  1. 1.

    Claims

  2. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

  3. Answer: [Yes]

  4. Justification: The abstract and the introduction clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. The claims made match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings.

  5. 2.

    Limitations

  6. Question: Does the paper discuss the limitations of the work performed by the authors?

  7. Answer: [Yes]

  8. Justification: The ‘Limitations’ section is provided in Appendix H.

  9. 3.

    Theory Assumptions and Proofs

  10. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

  11. Answer: [Yes]

  12. Justification: All the theorems, formulas, and proofs in the paper have been numbered and cross-referenced. All assumptions have been clearly stated or referenced in the statement of any theorems. The complete and correct proofs have been provided in appendices.

  13. 4.

    Experimental Result Reproducibility

  14. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

  15. Answer: [Yes]

  16. Justification: The detailed simulation setups have been provided in Appendix G.

  17. 5.

    Open access to data and code

  18. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

  19. Answer: [No]

  20. Justification: We do not provide clean source codes, but detailed instructions to perform the experiment have been provided in Appendix G.

  21. 6.

    Experimental Setting/Details

  22. Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

  23. Answer: [Yes]

  24. Justification: The full details have been provided in Appendix G.

  25. 7.

    Experiment Statistical Significance

  26. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

  27. Answer: [Yes]

  28. Justification: We plot the error bars in Figure 2. However, error bars are intentionally omitted in Figure 2 in the main body to avoid a cluttered plot and deliver the main message.

  29. 8.

    Experiments Compute Resources

  30. Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

  31. Answer: [Yes]

  32. Justification: We specify the platform that runs the simulation in Appendix G.1.

  33. 9.

    Code Of Ethics

  34. Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines?

  35. Answer: [Yes]

  36. Justification: We fully obey the NeurIPS Code of Ethics.

  37. 10.

    Broader Impacts

  38. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

  39. Answer: [N/A]

  40. Justification: This work mainly aims to demonstrate how the improvement of partial agents’ sampling strategies can accelerate the overall system’s performance. It advocates for improving common wealth and has no negative social impact.

  41. 11.

    Safeguards

  42. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

  43. Answer: [N/A]

  44. Justification: The paper poses no such risks.

  45. 12.

    Licenses for existing assets

  46. Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

  47. Answer: [Yes]

  48. Justification: We have cited correctly the datasets ijcnn1 and CIFAR-10 used for experiments in our paper.

  49. 13.

    New Assets

  50. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

  51. Answer: [N/A]

  52. Justification: The paper does not release new assets.

  53. 14.

    Crowdsourcing and Research with Human Subjects

  54. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

  55. Answer: [N/A]

  56. Justification: The paper does not involve crowdsourcing nor research with human subjects.

  57. 15.

    Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects

  58. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

  59. Answer: [N/A]

  60. Justification: The paper does not involve crowdsourcing nor research with human subjects.