This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

DESTRESS: Computation-Optimal and Communication-Efficient Decentralized Nonconvex Finite-Sum Optimization

Boyue Li
CMU
Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA; Emails: {boyuel,yuejiec}@andrew.cmu.edu.
   Zhize Li
KAUST
Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal 23955-6900, Kingdom of Saudi Arabia; Email: [email protected].
   Yuejie Chi11footnotemark: 1
CMU
(October 2021; Revised November 2021)
Abstract

Emerging applications in multi-agent environments such as internet-of-things, networked sensing, autonomous systems and federated learning, call for decentralized algorithms for finite-sum optimizations that are resource-efficient in terms of both computation and communication. In this paper, we consider the prototypical setting where the agents work collaboratively to minimize the sum of local loss functions by only communicating with their neighbors over a predetermined network topology. We develop a new algorithm, called DEcentralized STochastic REcurSive gradient methodS (DESTRESS) for nonconvex finite-sum optimization, which matches the optimal incremental first-order oracle (IFO) complexity of centralized algorithms for finding first-order stationary points, while maintaining communication efficiency. Detailed theoretical and numerical comparisons corroborate that the resource efficiencies of DESTRESS improve upon prior decentralized algorithms over a wide range of parameter regimes. DESTRESS leverages several key algorithm design ideas including randomly activated stochastic recursive gradient updates with mini-batches for local computation, gradient tracking with extra mixing (i.e., multiple gossiping rounds) for per-iteration communication, together with careful choices of hyper-parameters and new analysis frameworks to provably achieve a desirable computation-communication trade-off.

Keywords: decentralized optimization, nonconvex finite-sum optimization, stochastic recursive gradient methods

1 Introduction

The proliferation of multi-agent environments in emerging applications such as internet-of-things (IoT), networked sensing and autonomous systems, together with the necessity of training machine learning models using distributed systems in federated learning, leads to a growing need of developing decentralized algorithms for optimizing finite-sum problems. Specifically, the goal is to minimize the global objective function:

minimize𝒙df(𝒙):=1N𝒛(𝒙;𝒛),\underset{{{\bm{{x}}}\,\in\,\mathbb{R}^{d}}}{\text{minimize}}\quad f({\bm{{x}}}):=\frac{1}{N}\sum_{\bm{z}\in{\mathcal{{M}}}}\ell({\bm{{x}}};{\bm{{z}}}), (1)

where 𝒙d{\bm{{x}}}\in{\mathbb{\bm{R}}}^{d} denotes the parameter of interest, (𝒙;𝒛)\ell({\bm{{x}}};{\bm{{z}}}) denotes the sample loss of the sample 𝒛{\bm{{z}}}, {\mathcal{{M}}} denotes the entire dataset, and N=||N=|{\mathcal{{M}}}| denotes the number of data samples in the entire dataset. Of particular interest to this paper is the nonconvex setting, where (𝒙;𝒛)\ell({\bm{{x}}};{\bm{{z}}}) is nonconvex with respect to 𝒙{\bm{{x}}}, due to its ubiquity across problems in machine learning and signal processing, including but not limited to nonlinear estimation, neural network training, and so on.

In a prototypical decentralized environment, however, each agent only has access to a disjoint subset of the data samples, and aims to work collaboratively to optimize f(𝒙)f({\bm{{x}}}), by only exchanging information with its neighbors over a predetermined network topology. Assuming the data are distributed equally among all agents,111It is straightforward to generalize to the unequal splitting case with a proper reweighting. each agent thus possesses m:=N/nm:=N/n samples, and f(𝒙)f({\bm{{x}}}) can be rewritten as

f(𝒙)=1ni=1nfi(𝒙),f({\bm{{x}}})=\frac{1}{n}\sum_{i=1}^{n}f_{i}({\bm{{x}}}),

where

fi(𝒙):=1m𝒛i(𝒙;𝒛)f_{i}({\bm{{x}}}):=\frac{1}{m}\sum_{{\bm{{z}}}\in{\mathcal{{M}}}_{i}}\ell({\bm{{x}}};{\bm{{z}}})

denotes the local objective function averaged over the local dataset i{\mathcal{{M}}}_{i} at the iith agent (1in1\leq i\leq n) and =i=1ni{\mathcal{{M}}}=\cup_{i=1}^{n}{\mathcal{{M}}}_{i}. The communication pattern of the agents is specified via an undirected graph 𝒢=(𝒱,)\mathcal{G}=(\mathcal{V},\mathcal{E}), where 𝒱\mathcal{V} denotes the set of all agents, and two agents can exchange information if and only if there is an edge in \mathcal{E} connecting them. Unlike the server/client setting, the decentralized setting, sometimes also called the network setting, does not admit a parameter server to facilitate global information sharing, therefore is much more challenging to understand and delineate the impact of the network graph.

Roughly speaking, in a typical decentralized algorithm, the agents alternate between (1) communication, which propagates local information and enforces consensus, and (2) computation, which updates individual parameter estimates and improves convergence using information received from the neighbors. The resource efficiency of a decentralized algorithm can often be measured in terms of its computation complexity and communication complexity. For example, communication can be extremely time-consuming and become the top priority when the bandwidth is limited. On the other hand, minimizing computation, especially at resource-constrained agents (e.g., power-hungry IoT or mobile devices), is also critical to ensure the overall efficiency. Achieving a desired level of resource efficiency for a decentralized algorithm often requires careful and delicate trade-offs between computation and communication, as these objectives are often conflicting in nature.

1.1 Our contributions

The central contribution of this paper lies in the development of a new resource-efficient algorithm for nonconvex finite-sum optimization problems in a decentralized environment, dubbed DEcentralized STochastic REcurSive gradient methodS (DESTRESS). DESTRESS provably finds first-order stationary points of the global objective function f(𝒙)f({\bm{{x}}}) with the optimal incremental first-order (IFO) oracle complexity, i.e. the complexity of evaluating sample gradients, matching state-of-the-art centralized algorithms, but at a much lower communication complexity compared to existing decentralized algorithms over a wide range of parameter regimes.

To achieve resource efficiency, DESTRESS leverages several key ideas in the algorithm design. To reduce local computation, DESTRESS harnesses the finite-sum structure of the empirical risk function by performing stochastic variance-reduced recursive gradient updates [NvDP+19, FLLZ18, WJZ+19, Li19, LR21, LBZR21, ZXG18]—an approach that is shown to be optimal in terms of IFO complexity in the centralized setting—in a randomly activated manner to further improve computational efficiency when the local sample size is limited. To reduce communication, DESTRESS employs gradient tracking [ZM10] with a few mixing rounds per iteration, which helps accelerate the convergence through better information sharing [LCCC20]; the extra mixing scheme can be implemented using Chebyshev acceleration [AS14] to further improve the communication efficiency. In a nutshell, to find an ϵ\epsilon-approximate first-order stationary points, i.e. 𝔼f(𝒙𝗈𝗎𝗍𝗉𝗎𝗍)22ϵ{\mathbb{\bm{E}}}\big{\|}\nabla f({\bm{{x}}}^{\mathsf{output}})\big{\|}^{2}_{2}\leq\epsilon, where 𝒙𝗈𝗎𝗍𝗉𝗎𝗍{\bm{{x}}}^{\mathsf{output}} is the output of DESTRESS, and the expectation is taken with respect to the randomness of the algorithm, DESTRESS requires:

  • O(m+(m/n)1/2L/ϵ)O\big{(}m+(m/n)^{1/2}L/\epsilon\big{)} per-agent IFO calls,222The big-OO notation is defined in Section 1.3. which is network-independent; and

  • O(log((n/m)1/2+2)(1α)1/2((mn)1/2+Lϵ))O\Big{(}\frac{\log\big{(}(n/m)^{1/2}+2\big{)}}{(1-\alpha)^{1/2}}\cdot\big{(}(mn)^{1/2}+\frac{L}{\epsilon}\big{)}\Big{)} rounds of communication,

where LL is the smoothness parameter of the sample loss, α[0,1)\alpha\in[0,1) is the mixing rate of the network topology, nn is the number of agents, and m=N/nm=N/n is the local sample size.

Algorithms Setting Per-agent IFO Complexity Communication Rounds
SVRG centralized N+N2/3LϵN+\frac{N^{2/3}L}{\epsilon} n/a
[AZH16, RHS+16]
SCSG/SVRG+ centralized N+N2/3LϵN+\frac{N^{2/3}L}{\epsilon} n/a
[LJCJ17, LL18]
SNVRG centralized N+N1/2LϵN+\frac{N^{1/2}L}{\epsilon} n/a
[ZXG18]
SARAH/SPIDER/SpiderBoost centralized N+N1/2LϵN+\frac{N^{1/2}L}{\epsilon} n/a
[NvDP+19, FLLZ18, WJZ+19]
SSRGD/ZeroSARAH/PAGE centralized N+N1/2LϵN+\frac{N^{1/2}L}{\epsilon} n/a
[Li19, LR21, LBZR21]
D-GET decentralized m+1(1α)2m1/2Lϵm+\frac{1}{(1-\alpha)^{2}}\cdot\frac{m^{1/2}L}{\epsilon} Same as IFO
[SLH20]
GT-SARAH decentralized m+max(1(1α)2,(mn)1/2,(m/n+1)1/31α)Lϵm+\max\Big{(}\frac{1}{(1-\alpha)^{2}},\big{(}\frac{m}{n}\big{)}^{1/2},\frac{(m/n+1)^{1/3}}{1-\alpha}\Big{)}\cdot\frac{L}{\epsilon} Same as IFO
[XKK20b]
DESTRESS decentralized m+(m/n)1/2Lϵm+\frac{(m/n)^{1/2}L}{\epsilon} 1(1α)1/2((mn)1/2+Lϵ)\frac{1}{(1-\alpha)^{1/2}}\cdot\Big{(}(mn)^{1/2}+\frac{L}{\epsilon}\Big{)}
(this paper)
Table 1: The per-agent IFO complexities and communication complexities to find ϵ\epsilon-approximate first-order stationary points by stochastic variance-reduced algorithms for nonconvex finite-sum problems. The algorithms listed in the first three rows are designed for the centralized setting, and the remaining D-GET, GT-SARAH and our DESTRESS are in the decentralized setting. Here, nn is the number of agents, m=N/nm=N/n is the local sample size, LL is the smoothness parameter of the sample loss, and α[0,1)\alpha\in[0,1) is the mixing rate of the network topology. The big-OO notations and logarithmic terms are omitted for simplicity.

Comparisons with existing algorithms.

Table 1 summarizes the convergence guarantees of representative stochastic variance-reduced algorithms for finding first-order stationary points across centralized and decentralized communication settings.

  • In terms of the computation complexity, the overall IFO complexity of DESTRESS—when summed over all agents—becomes

    nO(m+(m/n)1/2L/ϵ)=O(mn+(mn)1/2L/ϵ)=O(N+N1/2L/ϵ),n\cdot O\big{(}m+(m/n)^{1/2}L/\epsilon\big{)}=O\big{(}mn+(mn)^{1/2}L/\epsilon\big{)}=O\big{(}N+N^{1/2}L/\epsilon\big{)},

    matching the optimal IFO complexity of centralized algorithms (e.g., SPIDER [FLLZ18], PAGE [LBZR21]) and distributed server/client algorithms (e.g., D-ZeroSARAH [LR21]). However, the state-of-the-art decentralized algorithm GT-SARAH [XKK20b] still did not achieve this optimal IFO complexity for most situations (see Table 1). To the best of our knowledge, DESTRESS is the first algorithm to achieve the optimal IFO complexity for the decentralized setting regardless of network topology and sample size.

  • When it comes to the communication complexity, it is observed that the communication rounds of DESTRESS can be decomposed into the sum of an ϵ\epsilon-independent term and an ϵ\epsilon-dependent term (up to a logarithmic factor), i.e.,

    1(1α)1/2(mn)1/2ϵ𝗂𝗇𝖽𝖾𝗉𝖾𝗇𝖽𝖾𝗇𝗍+1(1α)1/2Lϵϵ𝖽𝖾𝗉𝖾𝗇𝖽𝖾𝗇𝗍;\underbrace{\frac{1}{(1-\alpha)^{1/2}}\cdot(mn)^{1/2}}_{\epsilon-\mathsf{independent}}+\underbrace{\frac{1}{(1-\alpha)^{1/2}}\cdot\frac{L}{\epsilon}}_{\epsilon-\mathsf{dependent}};

    similar decompositions also apply to competing decentralized algorithms. DESTRESS significantly improves the ϵ\epsilon-dependent term of D-GET and GT-SARAH by at least a factor of 1(1α)3/2\frac{1}{(1-\alpha)^{3/2}}, and therefore, saves more communications over poorly-connected networks. Further, the ϵ\epsilon-independent term of DESTRESS is also smaller than that of D-GET/GT-SARAH as long as the local sample size is sufficiently large, i.e. m=Ω(n1α)m=\Omega\big{(}\frac{n}{1-\alpha}\big{)}, which also holds for a wide variety of application scenarios. To gain further insights in terms of the communication savings of DESTRESS, Table 2 further compares the communication complexities of decentralized algorithms for finding first-order stationary points under three common network settings.

Erdős-Rényi graph 2-D grid graph Path graph
1α1-\alpha 11 1nlogn\frac{1}{n\log n} 1n2\frac{1}{n^{2}}
(spectral gap)
D-GET m+m1/2Lϵm+\frac{m^{1/2}L}{\epsilon} m+m1/2n2Lϵm+\frac{m^{1/2}n^{2}L}{\epsilon} m+m1/2n4Lϵm+\frac{m^{1/2}n^{4}L}{\epsilon}
[SLH20]
GT-SARAH m+max{1,(mn)1/3,(mn)1/2}Lϵm+\max\Big{\{}1,~{}\big{(}\frac{m}{n}\big{)}^{1/3},~{}\big{(}\frac{m}{n}\big{)}^{1/2}\Big{\}}\cdot\frac{L}{\epsilon} m+max{n2,m1/3n2/3,(mn)1/2}Lϵm+\max\Big{\{}n^{2},~{}m^{1/3}n^{2/3},~{}\big{(}\frac{m}{n}\big{)}^{1/2}\Big{\}}\cdot\frac{L}{\epsilon} m+max{n4,m1/3n5/3,(mn)1/2}Lϵm+\max\Big{\{}n^{4},~{}m^{1/3}n^{5/3},~{}\big{(}\frac{m}{n}\big{)}^{1/2}\Big{\}}\cdot\frac{L}{\epsilon}
[XKK20b]
DESTRESS (mn)1/2+Lϵ(mn)^{1/2}+\frac{L}{\epsilon} m1/2n+n1/2Lϵm^{1/2}n+\frac{n^{1/2}L}{\epsilon} (mn3)1/2+nLϵ(mn^{3})^{1/2}+\frac{nL}{\epsilon}
(this paper)
Improvement factors (mn)1/2\big{(}\frac{m}{n}\big{)}^{1/2} m1/2n\frac{m^{1/2}}{n} m1/2n3/2\frac{m^{1/2}}{n^{3/2}}
for ϵ\epsilon-independent term
Improvement factors max{1,(mn)1/3,(mn)1/2}\max\Big{\{}1,~{}\big{(}\frac{m}{n}\big{)}^{1/3},~{}\big{(}\frac{m}{n}\big{)}^{1/2}\Big{\}} max{n3/2,m1/3n1/6,m1/2n}\max\Big{\{}n^{3/2},~{}m^{1/3}n^{1/6},~{}\frac{m^{1/2}}{n}\Big{\}} max{n3,m1/3n2/3,m1/2n3/2}\max\Big{\{}n^{3},~{}m^{1/3}n^{2/3},~{}\frac{m^{1/2}}{n^{3/2}}\Big{\}}
for ϵ\epsilon-dependent term
Table 2: Detailed comparisons of the communication complexities of D-GET, GT-SARAH and DESTRESS under three graph topologies, where the last two rows delineate the improve factors of DESTRESS over existing algorithms. The communication savings become significant especially when m=Ω(n1α)m=\Omega\big{(}\frac{n}{1-\alpha}\big{)}. The complexities are simplified by plugging the bound on the spectral gap 1α1-\alpha from [NOR18, Proposition 5]. Here, nn is the number of agents, m=N/nm=N/n is the local sample size, LL is the smoothness parameter of the sample loss, and α[0,1)\alpha\in[0,1) is the mixing rate of the network topology. The big-OO notations and logarithmic terms are omitted for simplicity.

In sum, DESTRESS harnesses the ideas of variance reduction, gradient tracking and extra mixing in a sophisticated manner to achieve a scalable decentralized algorithm for nonconvex empirical risk minimization that is competitive in both computation and communication over existing approaches.

1.2 Additional related works

Decentralized optimization and learning have been studied extensively, with contemporary emphasis on the capabilities to scale gracefully to large-scale problems — both in terms of the size of the data and the size of the network. For the conciseness of the paper, we focus our discussions on the most relevant literature and refer interested readers to recent overviews [NRB20, XPNK20, XKK20a] for further references.

Stochastic variance-reduced methods.

Many variants of stochastic variance-reduced gradient based methods have been proposed for finite-sum optimization for finding first-order stationary points, including but not limited to SVRG [JZ13, AZH16, RHS+16], SCSG [LJCJ17], SVRG+ [LL18], SAGA [DBLJ14], SARAH [NLST17, NvDP+19], SPIDER [FLLZ18], SpiderBoost [WJZ+19], SSRGD [Li19], ZeroSARAH [LR21] and PAGE [LBZR21, Li21]. SVRG/SVRG+/SCSG/SAGA utilize stochastic variance-reduced gradients as a corrected estimator of the full gradient, but can only achieve a sub-optimal IFO complexity of O(N+N2/3L/ϵ)O(N+N^{2/3}L/\epsilon). Other algorithms such as SARAH, SPIDER, SpiderBoost, SSRGD and PAGE adopt stochastic recursive gradients to improve the IFO complexity to O(N+N1/2L/ϵ)O(N+N^{1/2}L/\epsilon), which is optimal indicated by the lower bound provided in [FLLZ18, LBZR21]. DESTRESS also utilizes the stochastic recursive gradients to perform variance reduction, which results in the optimal IFO complexity for finding first-order stationary points.

Decentralized stochastic nonconvex optimization.

There has been a flurry of recent activities in decentralized nonconvex optimization in both the server/client setting and the network setting. In the server/client setting, [CZC+20] simplifies the approaches in [LLMY17] for distributing stochastic variance-reduced algorithms without requiring sampling extra data. In particular, D-SARAH [CZC+20] extends SARAH to the server/client setting but with a slightly worse IFO complexity and a sample-independent communication complexity. D-ZeroSARAH [LR21] obtains the optimal IFO complexity in the server/client setting. In the network setting, D-PSGD [LZZ+17] and SGP [ALBR19] extend stochastic gradient descent (SGD) to solve the nonconvex decentralized expectation minimization problems with sub-optimal rates. However, due to the noisy stochastic gradients, D-PSGD can only use diminishing step size to ensure convergence, and SGP uses a small step size on the order of 1/K1/K, where KK denotes the total iterations. D2D^{2} [TLY+18] introduces a variance-reduced correction term to D-PSGD, which allows a constant step size and hence reaches a better convergence rate.

Gradient tracking [ZM10, QL18] provides a systematic approach to estimate the global gradient at each agent, which allows one to easily design decentralized optimization algorithms based on existing centralized algorithms. This idea is applied in [ZY19] to extend SGD to the decentralized setting, and in [LCCC20] to extend quasi-Newton algorithms as well as stochastic variance-reduced algorithms, with performance guarantees for optimizing strongly convex functions. GT-SAGA [XKK20c] further uses SAGA-style updates and reaches a convergence rate that matches SAGA [DBLJ14, RSPS16]. However, GT-SAGA requires to store a variable table, which leads to a high memory complexity. D-GET [SLH20] and GT-SARAH [XKK20b] adopt equivalent recursive local gradient estimators to enable the use of constant step sizes without extra memory usage. The IFO complexity of GT-SARAH is optimal in the restrictive range mn(1α)6m\gtrsim\frac{n}{(1-\alpha)^{6}}, while DESTRESS achieves the optimal IFO over all parameter regimes.

In addition to variance reduction techniques, performing multiple mixing steps between local updates can greatly improve the dependence of the network in convergence rates, which is equivalent of communicating over a better-connected communication graph for the agents, which in turn leads to a faster convergence (and a better overall efficiency) due to better information mixing. This technique is applied by a number of recent literature including [BBKW19, PLW20, BBW21, LCCC20, HAD+20, IW21], and its effectiveness is verified both in theory and experiments. Our algorithm also adopts the extra mixing steps, which leads to better IFO complexity and communication complexity.

1.3 Paper organization and notation

Section 2 introduces preliminary concepts and the algorithm development, Section 3 shows the theoretical performance guarantees for DESTRESS, Section 4 provides numerical evidence to support the analysis, and Section 5 concludes the paper. Proofs and experiment settings are postponed to appendices.

Throughout this paper, we use boldface letters to represent matrices and vectors. We use 𝗈𝗉\|\cdot\|_{\mathsf{op}} for matrix operator norm, \otimes for the Kronecker product, 𝑰n{\bm{{I}}}_{n} for the nn-dimensional identity matrix and 𝟏n\bm{1}_{n} for the nn-dimensional all-one vector. For two real functions f()f(\cdot) and g()g(\cdot) defined on +{\mathbb{\bm{R}}}^{+}, we say f(x)=O(g(x))f(x)=O\big{(}g(x)\big{)} or f(x)g(x)f(x)\lesssim g(x) if there exists some universal constant M>0M>0 such that f(x)Mg(x)f(x)\leq Mg(x). The notation f(x)=Ω(g(x))f(x)=\Omega\big{(}g(x)\big{)} or f(x)g(x)f(x)\gtrsim g(x) means g(x)=O(f(x))g(x)=O\big{(}f(x)\big{)}.

2 Preliminaries and Proposed Algorithm

We start by describing a few useful preliminary concepts and definitions in Section 2.1, then present the proposed algorithm in Section 2.2.

2.1 Preliminaries

Mixing.

The information mixing between agents is conducted by updating the local information via a weighted sum of information from neighbors, which is characterized by a mixing (gossiping) matrix. Concerning this matrix is an important quantity called the mixing rate, defined in Definition 1.

Definition 1 (Mixing matrix and mixing rate).

The mixing matrix is a matrix 𝐖=[wij]n×n{\bm{{W}}}=[w_{ij}]\in{\mathbb{\bm{R}}}^{n\times n}, such that wij=0w_{ij}=0 if agent ii and jj are not connected according to the communication graph 𝒢{\mathcal{{G}}}. Furthermore, 𝐖𝟏n=𝟏n{\bm{{W}}}\bm{1}_{n}=\bm{1}_{n} and 𝐖𝟏n=𝟏n{\bm{{W}}}^{\top}\bm{1}_{n}=\bm{1}_{n}. The mixing rate of a mixing matrix 𝐖{\bm{{W}}} is defined as

α:=𝑾1n𝟏n𝟏n𝗈𝗉.\displaystyle\alpha:=\big{\|}{\bm{{W}}}-\tfrac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top}\big{\|}_{\mathsf{op}}. (2)

The mixing rate indicates the speed of information shared across the network. For example, for a fully-connected network, choosing 𝑾=1n𝟏n𝟏n{\bm{{W}}}=\tfrac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top} leads to α=0\alpha=0. For general networks and mixing matrices, [NOR18, Proposition 5] provides comprehensive bounds on 1α1-\alpha—also known as the spectral gap—for various graphs. In practice, FDLA matrices [XB04] are more favorable because it can achieve a much smaller mixing rate, but they usually contain negative elements and are not symmetric. Different from other algorithms that require the mixing matrix to be doubly-stochastic, our analysis can handle arbitrary mixing matrices as long as their row/column sums equal to one.

Dynamic average consensus.

It has been well understood by now that using a naive mixing of local information merely, e.g. the local gradients of neighboring agents, does not lead to fast convergence of decentralized extensions of centralized methods [NO09, SLWY15]. This is due to the fact that the quantity of interest in solving decentralized optimization problems is often iteration-varying, which naive mixing is unable to track; consequently, an accumulation of errors leads to either slow convergence or poor accuracy. Fortunately, the general scheme of dynamic average consensus [ZM10] proves to be extremely effective in this regard to track the dynamic average of local variables over the course of iterative algorithms, and has been applied to extend many central algorithms to decentralized settings, e.g. [NOS17, QL18, DLS16, LCCC20]. This idea, also known as “gradient tracking” in the literature, essentially adds a correction term to the naive information mixing, which we will employ in the communication stage of the proposed algorithm to track the dynamic average of local gradients.

Stochastic recursive gradient methods.

Stochastic recursive gradients methods [NvDP+19, FLLZ18, WJZ+19, Li19] achieve the optimal IFO complexity in the centralized setting for nonconvex finite-sum optimization, which make it natural to adapt them to the decentralized setting with the hope of maintaining the appealing IFO complexity. Roughly speaking, these methods use a nested loop structure to iteratively refine the parameter, where 1) a global gradient evaluation is performed at each outer loop, and 2) a stochastic recursive gradient estimator is used to calculate the gradient and update the parameter at each inner loop. In the proposed DESTRESS algorithm, this nested loop structure lends itself to a natural decentralized scheme, as will be seen momentarily.

Additional notation.

For convenience of presentation, define the stacked vector 𝒙nd{\bm{{x}}}\in{\mathbb{\bm{R}}}^{nd} and its average over all agents 𝒙¯d{\overline{\bm{x}}}\in{\mathbb{\bm{R}}}^{d} as

𝒙:=[𝒙1,,𝒙n],𝒙¯=1ni=1n𝒙i.\displaystyle{\bm{{x}}}:=\big{[}{\bm{{x}}}_{1}^{\top},\cdots,{\bm{{x}}}_{n}^{\top}\big{]}^{\top},\quad{\overline{\bm{x}}}=\frac{1}{n}\sum_{i=1}^{n}{\bm{{x}}}_{i}. (3)

The vectors 𝒔{\bm{{s}}}, 𝒔¯{\overline{\bm{s}}}, 𝒖{\bm{{u}}}, 𝒖¯{\overline{\bm{u}}}, 𝒗{\bm{{v}}} and 𝒗¯{\overline{\bm{v}}} are defined in the same fashion. In addition, for a stacked vector 𝒙nd{\bm{{x}}}\in{\mathbb{\bm{R}}}^{nd}, we introduce the distributed gradient F(𝒙)nd\nabla F({\bm{{x}}})\in{\mathbb{\bm{R}}}^{nd} as

F(𝒙):=[f1(𝒙1),,fn(𝒙n)].\displaystyle\nabla F({\bm{{x}}}):=[\nabla f_{1}({\bm{{x}}}_{1})^{\top},\cdots,\nabla f_{n}({\bm{{x}}}_{n})^{\top}]^{\top}. (4)

2.2 The DESTRESS Algorithm

Detailed in Algorithm 1, we propose a novel decentralized stochastic optimization algorithm, dubbed DESTRESS, for finding first-order order stationary points of nonconvex finite-sum problems. Motivated by stochastic recursive gradient methods in the centralized setting, DESTRESS has a nested loop structure:

  1. 1.

    The outer loop adopts dynamic average consensus to estimate and track the global gradient F(𝒙(t))\nabla F({\bm{{x}}}^{(t)}) at each agent in (5), where 𝒙(t){\bm{{x}}}^{(t)} is the stacked parameter estimate (cf. (4)). This helps to “reset” the stochastic gradient to a less noisy starting gradient 𝒗(t),0=𝒔(t){\bm{{v}}}^{(t),0}={\bm{{s}}}^{(t)} of the inner loop. A key property of (5)—which is a direct consequence of dynamic average consensus—is that the average of 𝒔(t){\bm{{s}}}^{(t)} equals to the dynamic average of local gradients, i.e. 𝒔¯(t)=1ni[n]𝒔i(t)=1ni[n]fi(𝒙i(t)){\overline{\bm{s}}}^{(t)}=\tfrac{1}{n}\sum_{i\in[n]}{\bm{{s}}}_{i}^{(t)}=\tfrac{1}{n}\sum_{i\in[n]}\nabla f_{i}({\bm{{x}}}_{i}^{(t)}).

  2. 2.

    The inner loop refines the parameter estimate 𝒖(t),0=𝒙(t){\bm{{u}}}^{(t),0}={\bm{{x}}}^{(t)} by performing randomly activated stochastic recursive gradient updates (6), where the stochastic recursive gradient 𝒈(t),s{\bm{{g}}}^{(t),s} is updated in (6b) via sampling mini-batches from activated agents’ local datasets.

To complete the last mile, inspired by [LCCC20], we allow DESTRESS to perform a few rounds of mixing or gossiping whenever communication takes place, to enable better information sharing and faster convergence. Specifically, DESTRESS performs K𝗈𝗎𝗍K_{\mathsf{out}} and K𝗂𝗇K_{\mathsf{in}} mixing steps for the outer and inner loops respectively per iteration, which is equivalent to using

𝑾𝗈𝗎𝗍=𝑾K𝗈𝗎𝗍and𝑾𝗂𝗇=𝑾K𝗂𝗇{\bm{{W}}}_{\mathsf{out}}={\bm{{W}}}^{K_{\mathsf{out}}}\qquad\mbox{and}\qquad{\bm{{W}}}_{\mathsf{in}}={\bm{{W}}}^{K_{\mathsf{in}}}

as mixing matrices, and correspondingly a network with better connectivity; see (5), (6a) and (6c). Note that Algorithm 1 is written in matrix notation, where the mixing steps are described by 𝑾𝗂𝗇𝑰n{\bm{{W}}}_{\mathsf{in}}\otimes{\bm{{I}}}_{n} or 𝑾𝗈𝗎𝗍𝑰n{\bm{{W}}}_{\mathsf{out}}\otimes{\bm{{I}}}_{n} and applied to all agents simultaneously. The extra mixing steps can be implemented by Chebyshev acceleration [AS14] with improved communication efficiency.

Algorithm 1 DESTRESS for decentralized nonconvex finite-sum optimization
1:  input: initial parameter 𝒙¯(0){\overline{\bm{x}}}^{(0)}, step size η\eta, activation probability pp, batch size bb, number of outer loops TT, number of inner loops SS, and number of communication (extra mixing) steps K𝗂𝗇K_{\mathsf{in}} and K𝗈𝗎𝗍K_{\mathsf{out}}.
2:  initialization: set 𝒙i(0)=𝒙¯(0){\bm{{x}}}_{i}^{(0)}={\overline{\bm{x}}}^{(0)} and 𝒔i(0)=f(𝒙¯(0)){\bm{{s}}}_{i}^{(0)}=\nabla f({\overline{\bm{x}}}^{(0)}) for all agents 1in1\leq i\leq n.
3:  for t=1,,Tt=1,\ldots,T do
4:     Set the new parameter estimate 𝒙(t)=𝒖(t1),S{\bm{{x}}}^{(t)}={\bm{{u}}}^{(t-1),S}.
5:     Update the global gradient estimate by aggregated local information and gradient tracking:
𝒔(t)=\displaystyle{\bm{{s}}}^{(t)}= (𝑾𝗈𝗎𝗍𝑰d)(𝒔(t1)+F(𝒙(t))F(𝒙(t1)))\displaystyle({\bm{{W}}}_{\mathsf{out}}\otimes{\bm{{I}}}_{d})\Big{(}{\bm{{s}}}^{(t-1)}+\nabla F\big{(}{\bm{{x}}}^{(t)}\big{)}-\nabla F\big{(}{\bm{{x}}}^{(t-1)}\big{)}\Big{)} (5)
6:     Set 𝒖(t),0=𝒙(t){\bm{{u}}}^{(t),0}={\bm{{x}}}^{(t)} and 𝒗(t),0=𝒔(t){\bm{{v}}}^{(t),0}={\bm{{s}}}^{(t)}.
7:     for s=1,,Ss=1,...,S do
8:        Each agent ii samples a mini-batch 𝒵i(t),s{\mathcal{{Z}}}_{i}^{(t),s} of size bb from i{\mathcal{{M}}}_{i} uniformly at random, λi(t),s(p)\lambda_{i}^{(t),s}\sim\mathcal{B}(p) where (p)\mathcal{B}(p) denotes the Bernoulli distribution with parameter pp,333The stochastic gradients will not be computed if λi(t),s=0\lambda_{i}^{(t),s}=0. and then performs the following updates:
𝒖(t),s\displaystyle{\bm{{u}}}^{(t),s} =(𝑾𝗂𝗇𝑰d)(𝒖(t),s1η𝒗(t),s1),\displaystyle=({\bm{{W}}}_{\mathsf{in}}\otimes{\bm{{I}}}_{d})({\bm{{u}}}^{(t),s-1}-\eta{\bm{{v}}}^{(t),s-1}), (6a)
𝒈i(t),s\displaystyle{\bm{{g}}}_{i}^{(t),s} =λi(t),spb𝒛i𝒵i(t),s((𝒖i(t),s;𝒛i)(𝒖i(t),s1;𝒛i))+𝒗i(t),s1,\displaystyle=\frac{\lambda_{i}^{(t),s}}{pb}\sum_{{\bm{{z}}}_{i}\in{\mathcal{{Z}}}_{i}^{(t),s}}\Big{(}\nabla\ell({\bm{{u}}}_{i}^{(t),s};{\bm{{z}}}_{i})-\nabla\ell({\bm{{u}}}_{i}^{(t),s-1};{\bm{{z}}}_{i})\Big{)}+{\bm{{v}}}_{i}^{(t),s-1}, (6b)
𝒗(t),s\displaystyle{\bm{{v}}}^{(t),s} =(𝑾𝗂𝗇𝑰d)𝒈(t),s.\displaystyle=({\bm{{W}}}_{\mathsf{in}}\otimes{\bm{{I}}}_{d}){\bm{{g}}}^{(t),s}. (6c)
9:     end for
10:  end for
11:  output:  𝒙𝗈𝗎𝗍𝗉𝗎𝗍Uniform({𝒖i(t),s1|i[n],t[T],s[S]}){\bm{{x}}}^{\mathsf{output}}\sim\text{Uniform}(\{{\bm{{u}}}_{i}^{(t),s-1}|i\in[n],t\in[T],s\in[S]\}).

Compared with existing decentralized algorithms based on stochastic variance-reduced algorithms such as D-GET [SLH20] and GT-SARAH [XKK20b], DESTRESS utilizes different gradient estimators and communication protocols: First, DESTRESS produces a sequence of reference points x(t)x^{(t)}—which converge to a global first-order stationary point—to “restart” the inner loops periodically using fresher information; secondly, the communication and computation in DESTRESS are paced differently due to the introduction of extra mixing, which allow a more flexible trade-off schemes between different types of resources; last but not least, the random activation of stochastic recursive gradient updates further saves local computation, especially when the local sample size is small compared to the number of agents.

3 Performance Guarantees

This section presents the performance guarantees of DESTRESS for finding first-order stationary points of the global objective function f()f(\cdot).

3.1 Assumptions

We first introduce Assumption 1 and Assumption 2, which are standard assumptions imposed on the loss function. Assumption 1 implies that all local objective functions fi()f_{i}(\cdot) and the global objective function f()f(\cdot) also have Lipschitz gradients, and Assumption 2 guarantees the absence of trivial solutions.

Assumption 1 (Lipschitz gradient).

The sample loss function (𝐱;𝐳)\ell({\bm{{x}}};{\bm{{z}}}) has LL-Lipschitz gradients for all 𝐳{\bm{{z}}}\in{\mathcal{{M}}} and 𝐱d{\bm{{x}}}\in{\mathbb{\bm{R}}}^{d}, namely, (𝐱;𝐳)(𝐱;𝐳)2L𝐱𝐱2\big{\|}\nabla\ell({\bm{{x}}};{\bm{{z}}})-\nabla\ell({\bm{{x}}}^{\prime};{\bm{{z}}})\big{\|}_{2}\leq L\|{\bm{{x}}}-{\bm{{x}}}^{\prime}\|_{2}, 𝐱,𝐱d\forall{\bm{{x}}},{\bm{{x}}}^{\prime}\in{\mathbb{\bm{R}}}^{d} and 𝐳{\bm{{z}}}\in{\mathcal{{M}}}.

Assumption 2 (Function boundedness).

The global objective function f()f(\cdot) is bounded below, i.e., f=inf𝐱df(𝐱)>f^{*}=\inf_{{\bm{{x}}}\in\mathbb{R}^{d}}f({\bm{{x}}})>-\infty.

Due to the nonconvexity, first-order algorithms are generally guaranteed to converge to only first-order stationary points of the global loss function f()f(\cdot), defined below in Definition 2.

Definition 2 (First-order stationary point).

A point 𝐱d{\bm{{x}}}\in{\mathbb{\bm{R}}}^{d} is called an ϵ\epsilon-approximate first-order stationary point of a differentiable function f()f(\cdot) if

f(𝒙)22ϵ.\displaystyle\|\nabla f({\bm{{x}}})\|_{2}^{2}\leq\epsilon.

3.2 Main theorem

Theorem 1, whose proof is deferred to Appendix B, shows that DESTRESS converges in expectation to an approximate first-order stationary point, under suitable parameter choices.

Theorem 1 (First-order optimality).

Assume Assumption 1 and 2 hold. Set p(0,1]p\in(0,1], K𝗂𝗇K_{\mathsf{in}}, K𝗈𝗎𝗍K_{\mathsf{out}}, SS, bb and η\eta to be positive and satisfy

αK𝗂𝗇p and ηL(1αK𝗂𝗇)3(1αK𝗈𝗎𝗍)10(1+αK𝗂𝗇αK𝗈𝗎𝗍npb)(S/(npb)+1).\alpha^{K_{\mathsf{in}}}\leq p\text{~{}~{}~{}~{}and~{}~{}~{}~{}}\eta L\leq\frac{(1-\alpha^{K_{\mathsf{in}}})^{3}(1-\alpha^{K_{\mathsf{out}}})}{10\big{(}1+\alpha^{K_{\mathsf{in}}}\alpha^{K_{\mathsf{out}}}\sqrt{npb}\big{)}\big{(}\sqrt{S/(npb)}+1\big{)}}. (7)

The output produced by Algorithm 1 satisfies

𝔼f(𝒙𝗈𝗎𝗍𝗉𝗎𝗍)22<4ηTS(𝔼[f(𝒙¯(0))]f).\displaystyle{\mathbb{\bm{E}}}\big{\|}\nabla f({\bm{{x}}}^{\mathsf{output}})\big{\|}^{2}_{2}<\frac{4}{\eta TS}\Big{(}{\mathbb{\bm{E}}}[f({\overline{\bm{x}}}^{(0)})]-f^{*}\Big{)}. (8)

If there is only one agent, i.e. n=1n=1, the mixing rate will be α=0\alpha=0, we can choose K𝗂𝗇=K𝗈𝗎𝗍=p=1K_{\mathsf{in}}=K_{\mathsf{out}}=p=1, and Theorem 1 reduces to [NvDP+19, Theorem 1], its counterpart in the centralized setting. For general decentralized settings with arbitrary mixing schedules, Theorem 1 provides a comprehensive characterization of the convergence rate, where an ϵ\epsilon-approximate first-order stationary point can be found in expectation in a total of

TS=O(𝔼[f(𝒙¯(0))]fηϵ)TS=O\left(\frac{{\mathbb{\bm{E}}}[f({\overline{\bm{x}}}^{(0)})]-f^{*}}{\eta\epsilon}\right)

iterations; here, TT is the number of outer iterations and SS is the number of inner iterations. Clearly, a larger step size η\eta, as allowable by (7), hints on a smaller iteration complexity, and hence a smaller IFO complexity.

There are two conditions in (7). On one end, K𝗂𝗇K_{\mathsf{in}} needs to be large enough (i.e., perform more rounds of extra mixing) to counter the effect when pp is small (i.e., we compute less stochastic gradients every iteration), or when α\alpha is close to 11 (i.e., the network is poorly connected). On the other end, the step size η\eta needs to be small enough to account for the requirement of the step size in the centralized setting, as well as the effect of imperfect communication due to decentralization. For well-connected networks where α1\alpha\ll 1, the terms introduced by the decentralized setting will diminish—indicating the iteration complexity is close to that of the centralized setting. For poorly-connected networks, carefully designing the mixing matrix and other parameters can ensure a desirable trade-off between convergence speed and communication cost. The following corollary provides specific parameter choices for DESTRESS to achieve the optimal per-agent IFO complexity. The proof is deferred to Appendix C.

Corollary 1 (Complexity for finding first-order stationary points).

Under conditions of Theorem 1, set S=mnS=\Big{\lceil}\sqrt{mn}\Big{\rceil}, b=m/nb=\left\lceil\sqrt{m/n}\right\rceil, p=m/nm/np=\frac{\sqrt{m/n}}{\left\lceil\sqrt{m/n}\right\rceil}, K𝗈𝗎𝗍=log(npb+1)(1α)1/2K_{\mathsf{out}}=\left\lceil\frac{\log(\sqrt{npb}+1)}{(1-\alpha)^{1/2}}\right\rceil, K𝗂𝗇=log(2/p)(1α)1/2K_{\mathsf{in}}=\left\lceil\frac{\log(2/p)}{(1-\alpha)^{1/2}}\right\rceil, η=1640L\eta=\frac{1}{640L}, and implement the mixing steps using Chebyshev’s acceleration [AS14]. To reach an ϵ\epsilon-approximate first-order stationary point, in expectation, DESTRESS takes O(m+(m/n)1/2Lϵ)O\Big{(}m+\frac{(m/n)^{1/2}L}{\epsilon}\Big{)} IFO calls per agent, and O(log((n/m)1/2+2)(1α)1/2((mn)1/2+Lϵ))O\Big{(}\frac{\log\big{(}(n/m)^{1/2}+2\big{)}}{(1-\alpha)^{1/2}}\cdot\big{(}(mn)^{1/2}+\frac{L}{\epsilon}\big{)}\Big{)} rounds of communication.

As elaborated in Section 1.1, DESTRESS achieves a network-independent IFO complexity that matches the optimal complexity in the centralized setting. In addition, when the accuracy ϵL/(mn)1/2\epsilon\lesssim L/(mn)^{1/2}, DESTRESS reaches a communication complexity of O(1(1α)1/2Lϵ)O\big{(}\frac{1}{(1-\alpha)^{1/2}}\cdot\frac{L}{\epsilon}\big{)}, which is independent of the sample size.

It is worthwhile to further highlight the role of the random activation probability pp in achieving the optimal IFO by allowing “fractional” batch size. Note that the batch size is set as b=m/nb=\left\lceil\sqrt{m/n}\right\rceil, where mm is the local sample size, and nn is the number of agents.

  1. 1.

    When the local sample size is large, i.e. mnm\geq n, we can approximate bm/nb\approx\sqrt{m/n} and p1p\approx 1. In fact, Corollary 1 continues to hold with p=1p=1 in this regime.

  2. 2.

    However, when the number of agents is large, i.e. n>mn>m, the batch size b=1b=1 and p=m/n<1p=\sqrt{m/n}<1, which mitigates the potential computation waste by only selecting a subset of agents to perform local computation, compared to the case when we naively set p=1p=1.

Therefore, by introducing random activation, we can view pb=m/npb=\sqrt{m/n} as the effective batch size at each agent, which allows fractional values and leads to the optimal IFO complexity in all scenarios.

4 Numerical Experiments

This section provides numerical experiments on real datasets to evaluate our proposed algorithm DESTRESS with comparisons against two existing baselines: DSGD [NO09, LZZ+17] and GT-SARAH [XKK20b]. To allow for reproducibility, all codes can be found at

https://github.com/liboyue/Network-Distributed-Algorithm.

For all experiments, we set the number of agents n=20n=20, and split the dataset uniformly at random to each agent. In addition, since mnm\gg n in all experiments, we set p=1p=1 for simplicity. We run each experiment on three communication graphs with the same data assignment and starting point: Erdös-Rènyi graph (the connectivity probability is set to 0.30.3), grid graph, and path graph. The mixing matrices are chosen as the symmetric fastest distributed linear averaging (FDLA) matrices [XB04] generated according to different graph topologies, and the extra mixing steps are implemented by Chebyshev’s acceleration [AS14] to save communications as described earlier. To ensure convergence, DSGD adopts a diminishing step size schedule. All the parameters are tuned manually for best performance. We defer a detailed account of the baseline algorithms as well as parameter choices in Appendix A.

4.1 Regularized logistic regression

To begin with, we employ logistic regression with nonconvex regularization to solve a binary classification problem using the Gisette dataset.444The dataset can be accessed at https://archive.ics.uci.edu/ml/datasets/Gisette. We split the Gisette dataset to n=20n=20 agents, where each agent receives m=300m=300 training samples of dimension d=5000d=5000. The sample loss function is given as

(𝒙;{𝒇,l})=llog(11+exp(𝒙𝒇))+(1l)log(exp(𝒙𝒇)1+exp(𝒙𝒇))+λi=1dxi21+xi2,\displaystyle\ell({\bm{{x}}};\{\bm{f},l\})=-l\log\Big{(}\frac{1}{1+\exp({\bm{{x}}}^{\top}\bm{f})}\Big{)}+(1-l)\log\Big{(}\frac{\exp({\bm{{x}}}^{\top}\bm{f})}{1+\exp({\bm{{x}}}^{\top}\bm{f})}\Big{)}+\lambda\sum_{i=1}^{d}\frac{x_{i}^{2}}{1+x_{i}^{2}},

where {𝒇,l}\{\bm{f},l\} represents a training tuple, 𝒇d\bm{f}\in{\mathbb{\bm{R}}}^{d} is the feature vector and l{0,1}l\in\{0,1\} is the label, and λ\lambda is the regularization parameter. For this experiment, we set λ=0.01\lambda=0.01.

Refer to caption

(a) Erdös-Rènyi graph
Refer to caption (b) Grid graph
Refer to caption (c) Path graph

Figure 1: The training loss and testing accuracy with respect to the number of communication rounds (left two panels) and gradient evaluations (right two panels) for DSGD, GT-SARAH and DESTRESS when training a regularized logistic regression model on the Gisette dataset. Due to the initial full-gradient computation, the gradient evaluations of DESTRESS and GT-SARAH do not start from 0.

Figure 1 shows the loss and testing accuracy for all algorithms. DESTRESS significantly outperforms other algorithms both in terms of communication and computation. It is worth noting that, DSGD converges very fast at the beginning of training, but cannot sustain the progress due to the diminishing schedule of step sizes. On the contrary, the variance-reduced algorithms can converge with a constant step size, and hence converge better overall. Moreover, due to the refined gradient estimation and information mixing designs, DESTRESS can bear a larger step size than GT-SARAH, which leads to the fastest convergence and best overall performance. In addition, a larger number of extra mixing steps leads to a better performance when the graph topology becomes less connected.

Refer to caption

(a) Erdös-Rènyi graph
Refer to caption (b) Grid graph
Refer to caption (c) Path graph

Figure 2: The training loss and testing accuracy with respect to the number of communication rounds (left two panels) and gradient evaluations (right two panels) for DSGD, GT-SARAH and DESTRESS when training a one-hidden-layer neural network on the MNIST dataset. Due to the initial full-gradient computation, the gradient evaluations of DESTRESS and GT-SARAH do not start from 0.

4.2 Neural network training

Next, we compare the performance of DESTRESS with comparisons to DSGD and GT-SARAH for training a one-hidden-layer neural network with 6464 hidden neurons and sigmoid activations for classifying the MNIST dataset [Den12]. We evenly split 60,00060,000 training samples to 2020 agents at random. Figure 2 plots the training loss and testing accuracy against the number of communication rounds and gradient evaluations for all algorithms. Again, DESTRESS significantly outperforms other algorithms in terms of computation and communication costs due to the larger step size and extra mixing, which validates our theoretical analysis.

5 Conclusions

In this paper, we proposed DESTRESS for decentralized nonconvex finite-sum optimization, where both its theoretical convergence guarantees and empirical performances on real-world datasets were presented. In sum, DESTRESS matches the optimal IFO complexity of centralized SARAH-type methods for finding first-order stationary points, and improves both computation and communication complexities for a broad range of parameters regimes compared with existing approaches. A natural and important extension of this paper is to generalize and develop convergence guarantees of DESTRESS for finding second-order stationary points, which we leave to future works.

Acknowledgements

This work is supported in part by ONR N00014-19-1-2404, by AFRL under FA8750-20-2-0504, and by NSF under CCF-1901199 and CCF-2007911. The authors thank Ran Xin for helpful discussions.

References

  • [ALBR19] M. Assran, N. Loizou, N. Ballas, and M. Rabbat. Stochastic gradient push for distributed deep learning. In International Conference on Machine Learning, pages 344–353. PMLR, 2019.
  • [AS14] M. Arioli and J. Scott. Chebyshev acceleration of iterative refinement. Numerical Algorithms, 66(3):591–608, 2014.
  • [AZH16] Z. Allen-Zhu and E. Hazan. Variance reduction for faster non-convex optimization. In International conference on machine learning, pages 699–707. PMLR, 2016.
  • [BBKW19] A. S. Berahas, R. Bollapragada, N. S. Keskar, and E. Wei. Balancing communication and computation in distributed optimization. IEEE Transactions on Automatic Control, 64(8):3141–3155, 2019.
  • [BBW21] A. S. Berahas, R. Bollapragada, and E. Wei. On the convergence of nested decentralized gradient methods with multiple consensus and gradient steps. IEEE Transactions on Signal Processing, 2021.
  • [CZC+20] S. Cen, H. Zhang, Y. Chi, W. Chen, and T.-Y. Liu. Convergence of distributed stochastic variance reduced methods without sampling extra data. IEEE Transactions on Signal Processing, 68:3976–3989, 2020.
  • [DBLJ14] A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in neural information processing systems, pages 1646–1654, 2014.
  • [Den12] L. Deng. The MNIST database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
  • [DLS16] P. Di Lorenzo and G. Scutari. Next: In-network nonconvex optimization. IEEE Transactions on Signal and Information Processing over Networks, 2(2):120–136, 2016.
  • [FLLZ18] C. Fang, C. J. Li, Z. Lin, and T. Zhang. SPIDER: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Advances in Neural Information Processing Systems, pages 687–697, 2018.
  • [HAD+20] A. Hashemi, A. Acharya, R. Das, H. Vikalo, S. Sanghavi, and I. Dhillon. On the benefits of multiple gossip steps in communication-constrained decentralized optimization. arXiv preprint arXiv:2011.10643, 2020.
  • [IW21] C. Iakovidou and E. Wei. S-NEAR-DGD: A flexible distributed stochastic gradient method for inexact communication. arXiv preprint arXiv:2102.00121, 2021.
  • [JZ13] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315–323, 2013.
  • [LBZR21] Z. Li, H. Bao, X. Zhang, and P. Richtárik. PAGE: A simple and optimal probabilistic gradient estimator for nonconvex optimization. In International Conference on Machine Learning, pages 6286–6295. PMLR, 2021. arXiv:2008.10898.
  • [LCCC20] B. Li, S. Cen, Y. Chen, and Y. Chi. Communication-efficient distributed optimization in networks with gradient tracking and variance reduction. Journal of Machine Learning Research, 21(180):1–51, 2020.
  • [Li19] Z. Li. SSRGD: Simple stochastic recursive gradient descent for escaping saddle points. In Advances in Neural Information Processing Systems, pages 1523–1533, 2019.
  • [Li21] Z. Li. A short note of PAGE: Optimal convergence rates for nonconvex optimization. arXiv preprint arXiv:2106.09663, 2021.
  • [LJCJ17] L. Lei, C. Ju, J. Chen, and M. I. Jordan. Non-convex finite-sum optimization via SCSG methods. In Advances in Neural Information Processing Systems, volume 30, 2017.
  • [LL18] Z. Li and J. Li. A simple proximal stochastic gradient method for nonsmooth nonconvex optimization. In Advances in Neural Information Processing Systems, pages 5569–5579, 2018.
  • [LLMY17] J. D. Lee, Q. Lin, T. Ma, and T. Yang. Distributed stochastic variance reduced gradient methods by sampling extra data with replacement. The Journal of Machine Learning Research, 18(1):4404–4446, 2017.
  • [LR21] Z. Li and P. Richtárik. ZeroSARAH: Efficient nonconvex finite-sum optimization with zero full gradient computation. arXiv preprint arXiv:2103.01447, 2021.
  • [LZZ+17] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu. Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 5330–5340, 2017.
  • [NLST17] L. M. Nguyen, J. Liu, K. Scheinberg, and M. Takáč. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In International Conference on Machine Learning, pages 2613–2621, 2017.
  • [NO09] A. Nedic and A. Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009.
  • [NOR18] A. Nedić, A. Olshevsky, and M. G. Rabbat. Network topology and communication-computation tradeoffs in decentralized optimization. Proceedings of the IEEE, 106(5):953–976, 2018.
  • [NOS17] A. Nedić, A. Olshevsky, and W. Shi. Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM Journal on Optimization, 27(4):2597–2633, 2017.
  • [NRB20] M. Nokleby, H. Raja, and W. U. Bajwa. Scaling-up distributed processing of data streams for machine learning. Proceedings of the IEEE, 108(11):1984–2012, 2020.
  • [NvDP+19] L. M. Nguyen, M. van Dijk, D. T. Phan, P. H. Nguyen, T.-W. Weng, and J. R. Kalagnanam. Finite-sum smooth optimization with SARAH. arXiv preprint arXiv:1901.07648, 2019.
  • [PLW20] T. Pan, J. Liu, and J. Wang. D-SPIDER-SFO: A decentralized optimization algorithm with faster convergence rate for nonconvex problems. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 1619–1626, 2020.
  • [QL18] G. Qu and N. Li. Harnessing smoothness to accelerate distributed optimization. IEEE Transactions on Control of Network Systems, 5(3):1245–1260, 2018.
  • [RHS+16] S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola. Stochastic variance reduction for nonconvex optimization. In International conference on machine learning, pages 314–323, 2016.
  • [RSPS16] S. J. Reddi, S. Sra, B. Póczos, and A. Smola. Fast incremental method for smooth nonconvex optimization. In 2016 IEEE 55th Conference on Decision and Control (CDC), pages 1971–1977. IEEE, 2016.
  • [SLH20] H. Sun, S. Lu, and M. Hong. Improving the sample and communication complexity for decentralized non-convex optimization: Joint gradient estimation and tracking. In International Conference on Machine Learning, pages 9217–9228. PMLR, 2020.
  • [SLWY15] W. Shi, Q. Ling, G. Wu, and W. Yin. EXTRA: An exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944–966, 2015.
  • [TLY+18] H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu. D2{D}^{2}: Decentralized training over decentralized data. In International Conference on Machine Learning, pages 4848–4856. PMLR, 2018.
  • [WJZ+19] Z. Wang, K. Ji, Y. Zhou, Y. Liang, and V. Tarokh. SpiderBoost and momentum: Faster variance reduction algorithms. In Advances in Neural Information Processing Systems, pages 2406–2416, 2019.
  • [XB04] L. Xiao and S. Boyd. Fast linear iterations for distributed averaging. Systems and Control Letters, 53(1):65–78, 2004.
  • [XKK20a] R. Xin, S. Kar, and U. A. Khan. Decentralized stochastic optimization and machine learning: A unified variance-reduction framework for robust performance and fast convergence. IEEE Signal Processing Magazine, 37(3):102–113, 2020.
  • [XKK20b] R. Xin, U. A. Khan, and S. Kar. Fast decentralized non-convex finite-sum optimization with recursive variance reduction. arXiv preprint arXiv:2008.07428, 2020.
  • [XKK20c] R. Xin, U. A. Khan, and S. Kar. A fast randomized incremental gradient method for decentralized non-convex optimization. arXiv preprint arXiv:2011.03853, 2020.
  • [XPNK20] R. Xin, S. Pu, A. Nedić, and U. A. Khan. A general framework for decentralized optimization with first-order methods. Proceedings of the IEEE, 108(11):1869–1889, 2020.
  • [ZM10] M. Zhu and S. Martínez. Discrete-time dynamic average consensus. Automatica, 46(2):322–329, 2010.
  • [ZXG18] D. Zhou, P. Xu, and Q. Gu. Stochastic nested variance reduction for nonconvex optimization. Advances in Neural Information Processing Systems, 31:3921–3932, 2018.
  • [ZY19] J. Zhang and K. You. Decentralized stochastic gradient tracking for non-convex empirical risk minimization. arXiv preprint arXiv:1909.02712, 2019.

Appendix A Experiment details

For completeness, we list two baseline algorithms, DSGD [NO09, LZZ+17] (cf. Algorithm 2) and GT-SARAH [XKK20b] (cf. Algorithm 3), which are compared numerically against the proposed DESTRESS algorithm in Section 4. Furthermore, the detailed hyperparameter settings for the experiments in Section 4.1 and Section 4.2 are listed in Table 3 and Table 4, respectively.

Algorithm 2 Decentralized stochastic gradient descent (DSGD)
1:  input: initial parameter 𝒙¯(0){\overline{\bm{x}}}^{(0)}, step size schedule ηt\eta_{t}, number of outer loops TT.
2:  initialization: set 𝒙i(0)=𝒙¯(0){\bm{{x}}}_{i}^{(0)}={\overline{\bm{x}}}^{(0)}.
3:  for t=1,,Tt=1,\ldots,T do
4:     Each agent ii samples a data point 𝒛i(t){\bm{{z}}}_{i}^{(t)} from i{\mathcal{{M}}}_{i} uniformly at random and compute the stochastic gradient:
𝒈i(t)=(𝒖i(t);𝒛i(t)).{\bm{{g}}}_{i}^{(t)}=\nabla\ell({\bm{{u}}}_{i}^{(t)};{\bm{{z}}}_{i}^{(t)}).
5:     Update via local communication: 𝒙(t+1)=(𝑾𝑰d)(𝒙(t)ηt𝒈(t)){\bm{{x}}}^{(t+1)}=({\bm{{W}}}\otimes{\bm{{I}}}_{d})({\bm{{x}}}^{(t)}-\eta_{t}{\bm{{g}}}^{(t)}).
6:  end for
7:  output: 𝒙𝗈𝗎𝗍𝗉𝗎𝗍=𝒙¯(T){\bm{{x}}}^{\mathsf{output}}={\overline{\bm{x}}}^{(T)}.
Algorithm 3 GT-SARAH
1:  input: initial parameter 𝒙¯(0){\overline{\bm{x}}}^{(0)}, step size η\eta, number of outer loops TT, number of inner loops qq.
2:  initialization: set 𝒗(0)=𝒚(0)=F(𝒙(0)){\bm{{v}}}^{(0)}={\bm{{y}}}^{(0)}=\nabla F({\bm{{x}}}^{(0)}).
3:  for t=1,,Tt=1,\ldots,T do
4:     Update via local communication 𝒙(t)=(𝑾𝑰d)𝒙(t1)η𝒚(t1){\bm{{x}}}^{(t)}=({\bm{{W}}}\otimes{\bm{{I}}}_{d}){\bm{{x}}}^{(t-1)}-\eta{\bm{{y}}}^{(t-1)}.
5:     if mod(t,q)=0\mod(t,q)=0 then
6:        𝒗(t)=F(𝒙(t)).{\bm{{v}}}^{(t)}=\nabla F({\bm{{x}}}^{(t)}).
7:     else
8:        Each agent ii samples a mini-batch 𝒵i(t){\mathcal{{Z}}}_{i}^{(t)} from i{\mathcal{{M}}}_{i} uniformly at random, and then performs the following updates:
𝒗i(t)=1b𝒛i𝒵i(t)((𝒙i(t);𝒛i)(𝒙i(t1);𝒛i))+𝒗i(t1).{\bm{{v}}}_{i}^{(t)}=\frac{1}{b}\sum_{{\bm{{z}}}_{i}\in{\mathcal{{Z}}}_{i}^{(t)}}\big{(}\nabla\ell({\bm{{x}}}_{i}^{(t)};{\bm{{z}}}_{i})-\nabla\ell({\bm{{x}}}_{i}^{(t-1)};{\bm{{z}}}_{i})\big{)}+{\bm{{v}}}_{i}^{(t-1)}.
9:     end if
10:     Update via local communication 𝒚(t)=(𝑾𝑰d)𝒚(t1)+𝒗(t)𝒗(t1){\bm{{y}}}^{(t)}=({\bm{{W}}}\otimes{\bm{{I}}}_{d}){\bm{{y}}}^{(t-1)}+{\bm{{v}}}^{(t)}-{\bm{{v}}}^{(t-1)}.
11:  end for
12:  output: 𝒙𝗈𝗎𝗍𝗉𝗎𝗍=𝒙¯(T){\bm{{x}}}^{\mathsf{output}}={\overline{\bm{x}}}^{(T)}.
Algorithms DSGD DESTRESS GT-SARAH
Parameters η0\eta_{0} η\eta pp K𝗂𝗇K_{\mathsf{in}} K𝗈𝗎𝗍K_{\mathsf{out}} bb SS η\eta bb SS
Erdös-Rènyi 11 11 11 22 22 1010 1010 0.10.1 1010 1010
Grid 11 11 11 22 22 1010 1010 0.010.01 1010 1010
Path 11 11 11 88 88 1010 1010 0.0010.001 1010 1010
Table 3: Parameter settings for the experiments on regularized logistic regression in Section 4.1.
Algorithms DSGD DESTRESS GT-SARAH
Parameters η0\eta_{0} η\eta pp K𝗂𝗇K_{\mathsf{in}} K𝗈𝗎𝗍K_{\mathsf{out}} bb SS η\eta bb SS
Erdös-Rènyi 11 0.10.1 11 22 22 100100 1010 0.010.01 100100 1010
Grid 11 0.10.1 11 22 22 100100 1010 0.0010.001 100100 1010
Path 11 0.10.1 11 88 88 100100 1010 0.0010.001 100100 1010
Table 4: Parameter settings for the experiments on neural network training in Section 4.2.

Appendix B Proof of Theorem 1

For notation simplicity, let

α𝗂𝗇\displaystyle\alpha_{\mathsf{in}} =αK𝗂𝗇,α𝗈𝗎𝗍=αK𝗈𝗎𝗍\displaystyle=\alpha^{K_{\mathsf{in}}},\qquad\alpha_{\mathsf{out}}=\alpha^{K_{\mathsf{out}}}

throughout the proof. In addition, with a slight abuse of notation, we define the global gradient f(𝒙)nd\nabla f({\bm{{x}}})\in{\mathbb{\bm{R}}}^{nd} of an (nd)(nd)-dimensional vector 𝒙=[𝒙1,,𝒙n]{\bm{{x}}}=\big{[}{\bm{{x}}}_{1}^{\top},\cdots,{\bm{{x}}}_{n}^{\top}\big{]}^{\top}, where 𝒙id{\bm{{x}}}_{i}\in\mathbb{R}^{d}, as follows

f(𝒙):=[f(𝒙1),,f(𝒙n)].\displaystyle\nabla f({\bm{{x}}}):=[\nabla f({\bm{{x}}}_{1})^{\top},\cdots,\nabla f({\bm{{x}}}_{n})^{\top}]^{\top}. (9)

The following fact is a straightforward consequence of our assumption on the mixing matrix 𝑾{\bm{{W}}} in Definition 1.

Fact 1.

Let 𝐱=[𝐱1,,𝐱n]{\bm{{x}}}=\big{[}{\bm{{x}}}_{1}^{\top},\cdots,{\bm{{x}}}_{n}^{\top}\big{]}^{\top}, and 𝐱¯=1ni=1n𝐱i{\overline{\bm{x}}}=\frac{1}{n}\sum_{i=1}^{n}{\bm{{x}}}_{i}, where 𝐱id{\bm{{x}}}_{i}\in\mathbb{R}^{d}. For a mixing matrix 𝐖n×n{\bm{{W}}}\in\mathbb{R}^{n\times n} satisfying Definition 1, we have

  1. 1.

    (1n𝟏n𝑰d)(𝑾𝑰d)𝒙=(1n𝟏n𝑰d)𝒙=𝒙¯\Big{(}\frac{1}{n}\bm{1}_{n}^{\top}\otimes{\bm{{I}}}_{d}\Big{)}({\bm{{W}}}\otimes{\bm{{I}}}_{d}){\bm{{x}}}=\Big{(}\frac{1}{n}\bm{1}_{n}^{\top}\otimes{\bm{{I}}}_{d}\Big{)}{\bm{{x}}}={\overline{\bm{x}}};

  2. 2.

    (𝑰nd(1n𝟏n𝟏n)𝑰d)(𝑾𝑰d)=(𝑾𝑰d(1n𝟏n𝟏n)𝑰d)(𝑰nd(1n𝟏n𝟏n)𝑰d)\big{(}{\bm{{I}}}_{nd}-(\frac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top})\otimes{\bm{{I}}}_{d}\big{)}({\bm{{W}}}\otimes{\bm{{I}}}_{d})=({\bm{{W}}}\otimes{\bm{{I}}}_{d}-(\frac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top})\otimes{\bm{{I}}}_{d})\big{(}{\bm{{I}}}_{nd}-(\frac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top})\otimes{\bm{{I}}}_{d}\big{)}.

To begin with, we introduce a key lemma that upper bounds the norm of the gradient of the global loss function evaluated at the average local estimates over nn agents, in terms of the function value difference at the beginning and the end of the inner loop, the gradient estimation error, and the norm of gradient estimates.

Lemma 1 (Inner loop induction).

Assume Assumption 1 holds. After S1S\geq 1 inner loops, one has

s=0S1f(𝒖¯(t),s)22\displaystyle\sum_{s=0}^{S-1}\|\nabla f({\overline{\bm{u}}}^{(t),s})\|^{2}_{2} 2η(f(𝒖¯(t),0)f(𝒖¯(t),S))\displaystyle\leq\frac{2}{\eta}\Big{(}f({\overline{\bm{u}}}^{(t),0})-f({\overline{\bm{u}}}^{(t),S})\Big{)}
+s=0S1f(𝒖¯(t),s)𝒗¯(t),s22(1ηL)s=0S1𝒗¯(t),s22.\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}+\sum_{s=0}^{S-1}\big{\|}\nabla f({\overline{\bm{u}}}^{(t),s})-{\overline{\bm{v}}}^{(t),s}\big{\|}^{2}_{2}-(1-\eta L)\sum_{s=0}^{S-1}\big{\|}{\overline{\bm{v}}}^{(t),s}\big{\|}^{2}_{2}.
Proof of Lemma 1.

The local update rule (6a), combined with Lemma 1, yields

𝒖¯(t),s+1=𝒖¯(t),sη𝒗¯(t),s.{\overline{\bm{u}}}^{(t),s+1}={\overline{\bm{u}}}^{(t),s}-\eta{\overline{\bm{v}}}^{(t),s}.

By Assumption 1, we have

f(𝒖¯(t),s+1)\displaystyle f({\overline{\bm{u}}}^{(t),s+1}) =f(𝒖¯(t),sη𝒗¯(t),s)\displaystyle=f({\overline{\bm{u}}}^{(t),s}-\eta{\overline{\bm{v}}}^{(t),s})
f(𝒖¯(t),s)f(𝒖¯(t),s),η𝒗¯(t),s+L2η𝒗¯(t),s22\displaystyle\leq f({\overline{\bm{u}}}^{(t),s})-\big{\langle}\nabla f({\overline{\bm{u}}}^{(t),s}),\eta{\overline{\bm{v}}}^{(t),s}\big{\rangle}+\frac{L}{2}\big{\|}\eta{\overline{\bm{v}}}^{(t),s}\big{\|}^{2}_{2}
=f(𝒖¯(t),s)η2f(𝒖¯(t),s)22+η2f(𝒖¯(t),s)𝒗¯(t),s22(η2η2L2)𝒗¯(t),s22,\displaystyle=f({\overline{\bm{u}}}^{(t),s})-\frac{\eta}{2}\big{\|}\nabla f({\overline{\bm{u}}}^{(t),s})\big{\|}^{2}_{2}+\frac{\eta}{2}\big{\|}\nabla f({\overline{\bm{u}}}^{(t),s})-{\overline{\bm{v}}}^{(t),s}\big{\|}^{2}_{2}-\Big{(}\frac{\eta}{2}-\frac{\eta^{2}L}{2}\Big{)}\big{\|}{\overline{\bm{v}}}^{(t),s}\big{\|}^{2}_{2}, (10)

where the last equality is obtained by applying 𝒂,𝒃=12(𝒂𝒃22𝒂22𝒃22)-\langle{\bm{{a}}},{\bm{{b}}}\rangle=\frac{1}{2}\big{(}\|{\bm{{a}}}-{\bm{{b}}}\|^{2}_{2}-\|{\bm{{a}}}\|^{2}_{2}-\|{\bm{{b}}}\|^{2}_{2}\big{)}. Summing over s=0,,S1s=0,\ldots,S-1 finishes the proof. ∎

Because the output 𝒙𝗈𝗎𝗍𝗉𝗎𝗍{\bm{{x}}}^{\mathsf{output}} is chosen from {𝒖i(t),s1|i[n],t[T],s[S]}\big{\{}{\bm{{u}}}_{i}^{(t),s-1}|i\in[n],t\in[T],s\in[S]\big{\}} uniformly at random, we can compute the expectation of the output’s gradient as follows:

nTS𝔼f(𝒙𝗈𝗎𝗍𝗉𝗎𝗍)22\displaystyle nTS{\mathbb{\bm{E}}}\big{\|}\nabla f({\bm{{x}}}^{\mathsf{output}})\big{\|}_{2}^{2} =i=1nt=1Ts=0S1𝔼f(𝒖i(t),s)22\displaystyle=\sum_{i=1}^{n}\sum_{t=1}^{T}\sum_{s=0}^{S-1}{\mathbb{\bm{E}}}\big{\|}\nabla f({\bm{{u}}}_{i}^{(t),s})\big{\|}_{2}^{2}
=(i)t=1Ts=0S1𝔼f(𝒖(t),s)22\displaystyle\overset{\text{(i)}}{=}\sum_{t=1}^{T}\sum_{s=0}^{S-1}{\mathbb{\bm{E}}}\big{\|}\nabla f({\bm{{u}}}^{(t),s})\big{\|}_{2}^{2}
=t=1Ts=0S1𝔼f(𝒖(t),s)f(𝟏n𝒖¯(t),s)+f(𝟏n𝒖¯(t),s)22\displaystyle=\sum_{t=1}^{T}\sum_{s=0}^{S-1}{\mathbb{\bm{E}}}\big{\|}\nabla f({\bm{{u}}}^{(t),s})-\nabla f(\bm{1}_{n}\otimes{\overline{\bm{u}}}^{(t),s})+\nabla f(\bm{1}_{n}\otimes{\overline{\bm{u}}}^{(t),s})\big{\|}_{2}^{2}
(ii)2t=1Ts=0S1(𝔼f(𝒖(t),s)f(𝟏n𝒖¯(t),s)22+𝔼f(𝟏n𝒖¯(t),s)22)\displaystyle\overset{\text{(ii)}}{\leq}2\sum_{t=1}^{T}\sum_{s=0}^{S-1}\Big{(}{\mathbb{\bm{E}}}\big{\|}\nabla f({\bm{{u}}}^{(t),s})-\nabla f(\bm{1}_{n}\otimes{\overline{\bm{u}}}^{(t),s})\big{\|}_{2}^{2}+{\mathbb{\bm{E}}}\big{\|}\nabla f(\bm{1}_{n}\otimes{\overline{\bm{u}}}^{(t),s})\big{\|}_{2}^{2}\Big{)}
(iii)2t=0Ts=0S1(L2𝔼𝒖(t),s𝟏n𝒖¯(t),s22+n𝔼f(𝒖¯(t),s)22),\displaystyle\overset{\text{(iii)}}{\leq}2\sum_{t=0}^{T}\sum_{s=0}^{S-1}\Big{(}L^{2}{\mathbb{\bm{E}}}\big{\|}{\bm{{u}}}^{(t),s}-\bm{1}_{n}\otimes{\overline{\bm{u}}}^{(t),s}\big{\|}_{2}^{2}+n{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{u}}}^{(t),s})\big{\|}_{2}^{2}\Big{)}, (11)

where (i) follows from the change of notation using (9), (ii) follows from the Cauchy-Schwartz inequality, and (iii) follows from Assumption 1 and extending the summation to t=0,Tt=0,...T. Then, in view of Lemma 1, (11) can be further bounded by

nTS𝔼f(𝒙𝗈𝗎𝗍𝗉𝗎𝗍)22\displaystyle nTS{\mathbb{\bm{E}}}\big{\|}\nabla f({\bm{{x}}}^{\mathsf{output}})\big{\|}_{2}^{2} 4nη(𝔼[f(𝒙¯(0))]f)+2L2t=0Ts=0S1𝔼𝒖(t),s𝟏n𝒖¯(t),s22\displaystyle\leq\frac{4n}{\eta}\Big{(}{\mathbb{\bm{E}}}[f({\overline{\bm{x}}}^{(0)})]-f^{*}\Big{)}+2L^{2}\sum_{t=0}^{T}\sum_{s=0}^{S-1}{\mathbb{\bm{E}}}\big{\|}{\bm{{u}}}^{(t),s}-\bm{1}_{n}\otimes{\overline{\bm{u}}}^{(t),s}\big{\|}_{2}^{2}
+2nt=0Ts=0S1(𝔼f(𝒖¯(t),s)𝒗¯(t),s22(1ηL)𝔼𝒗¯(t),s22),\displaystyle\qquad+2n\sum_{t=0}^{T}\sum_{s=0}^{S-1}\Big{(}{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{u}}}^{(t),s})-{\overline{\bm{v}}}^{(t),s}\big{\|}^{2}_{2}-(1-\eta L){\mathbb{\bm{E}}}\big{\|}{\overline{\bm{v}}}^{(t),s}\big{\|}^{2}_{2}\Big{)}, (12)

where we use 𝒖¯(t),0=𝒙¯(t){\overline{\bm{u}}}^{(t),0}={\overline{\bm{x}}}^{(t)} and f(𝒖¯(t),S)ff({\overline{\bm{u}}}^{(t),S})\geq f^{*}.

Next, we present Lemma 2 and 3 to bound the double sum in (12), whose proofs can be found in Appendix D and Appendix E, respectively.

Lemma 2 (Sum of inner loop errors).

Assuming all conditions in Theorem 1 hold. For all t>0t>0, we can bound the summation of inner loop errors as

2L2s=0S1𝔼\displaystyle 2L^{2}\sum_{s=0}^{S-1}{\mathbb{\bm{E}}}\big{\|} 𝒖(t),s𝟏n𝒖¯(t),s22+2ns=0S1𝔼f(𝒖¯(t),s)𝒗¯(t),s22\displaystyle{\bm{{u}}}^{(t),s}-\bm{1}_{n}\otimes{\overline{\bm{u}}}^{(t),s}\big{\|}_{2}^{2}+2n\sum_{s=0}^{S-1}{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{u}}}^{(t),s})-{\overline{\bm{v}}}^{(t),s}\big{\|}^{2}_{2}
64L21α𝗂𝗇(Snpb+1)𝔼𝒙(t)𝟏n𝒙¯(t)22+2α𝗂𝗇2𝔼𝒔(t)𝟏n𝒔¯(t)22+2n25s=1S𝔼𝒗¯(t),s122.\displaystyle\leq\frac{64L^{2}}{1-\alpha_{\mathsf{in}}}\cdot\Big{(}\frac{S}{npb}+1\Big{)}{\mathbb{\bm{E}}}\big{\|}{\bm{{x}}}^{(t)}-\bm{1}_{n}\otimes{\overline{\bm{x}}}^{(t)}\big{\|}_{2}^{2}+2\alpha_{\mathsf{in}}^{2}{\mathbb{\bm{E}}}\big{\|}{\bm{{s}}}^{(t)}-\bm{1}_{n}\otimes{\overline{\bm{s}}}^{(t)}\big{\|}_{2}^{2}+\frac{2n}{25}\sum_{s=1}^{S}{\mathbb{\bm{E}}}\big{\|}{\overline{\bm{v}}}^{(t),s-1}\big{\|}^{2}_{2}.
Lemma 3 (Sum of outer loop gradient estimation error and consensus error).

Assuming all conditions in Theorem 1 hold. We have

64L21α𝗂𝗇(Snpb+1)t=0T𝔼𝒙(t)𝟏n𝒙¯(t)22\displaystyle\frac{64L^{2}}{1-\alpha_{\mathsf{in}}}\cdot\Big{(}\frac{S}{npb}+1\Big{)}\sum_{t=0}^{T}{\mathbb{\bm{E}}}\big{\|}{\bm{{x}}}^{(t)}-\bm{1}_{n}\otimes{\overline{\bm{x}}}^{(t)}\big{\|}_{2}^{2} +2α𝗂𝗇2t=0T𝔼𝒔(t)𝟏n𝒔¯(t)2211n25t=1Ts=0S1𝔼𝒗¯(t),s22.\displaystyle+2\alpha_{\mathsf{in}}^{2}\sum_{t=0}^{T}{\mathbb{\bm{E}}}\big{\|}{\bm{{s}}}^{(t)}-\bm{1}_{n}\otimes{\overline{\bm{s}}}^{(t)}\big{\|}_{2}^{2}\leq\frac{11n}{25}\sum_{t=1}^{T}\sum_{s=0}^{S-1}{\mathbb{\bm{E}}}\big{\|}{\overline{\bm{v}}}^{(t),s}\big{\|}^{2}_{2}.

Using Lemma 2, (12) can be bounded as follows:

nTS𝔼\displaystyle nTS{\mathbb{\bm{E}}}\big{\|} f(𝒙𝗈𝗎𝗍𝗉𝗎𝗍)22<4nη(𝔼[f(𝒙¯(t),0)]f)2n(2425ηL)t=0Ts=1S𝔼𝒗¯(t),s122\displaystyle\nabla f({\bm{{x}}}^{\mathsf{output}})\big{\|}_{2}^{2}<\frac{4n}{\eta}\Big{(}{\mathbb{\bm{E}}}[f({\overline{\bm{x}}}^{(t),0})]-f^{*}\Big{)}-2n\Big{(}\frac{24}{25}-\eta L\Big{)}\sum_{t=0}^{T}\sum_{s=1}^{S}{\mathbb{\bm{E}}}\big{\|}{\overline{\bm{v}}}^{(t),s-1}\big{\|}^{2}_{2}
+64L21α𝗂𝗇(Snpb+1)t=0T𝔼𝒙(t)𝟏n𝒙¯(t)22+2α𝗂𝗇2t=0T𝔼𝒔(t)𝟏n𝒔¯(t)22,\displaystyle+\frac{64L^{2}}{1-\alpha_{\mathsf{in}}}\cdot\Big{(}\frac{S}{npb}+1\Big{)}\sum_{t=0}^{T}{\mathbb{\bm{E}}}\big{\|}{\bm{{x}}}^{(t)}-\bm{1}_{n}\otimes{\overline{\bm{x}}}^{(t)}\big{\|}_{2}^{2}+2\alpha_{\mathsf{in}}^{2}\sum_{t=0}^{T}{\mathbb{\bm{E}}}\big{\|}{\bm{{s}}}^{(t)}-\bm{1}_{n}\otimes{\overline{\bm{s}}}^{(t)}\big{\|}_{2}^{2}, (13)

where we bound the sum of inner loop errors L2s=0S1𝔼𝒖(t),s𝟏n𝒖¯(t),s22L^{2}\sum_{s=0}^{S-1}{\mathbb{\bm{E}}}\big{\|}{\bm{{u}}}^{(t),s}-\bm{1}_{n}\otimes{\overline{\bm{u}}}^{(t),s}\big{\|}_{2}^{2} and ns=0S1𝔼f(𝒖¯(t),s)𝒗¯(t),s22n\sum_{s=0}^{S-1}{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{u}}}^{(t),s})-{\overline{\bm{v}}}^{(t),s}\big{\|}^{2}_{2} by the initial value of each inner loop 𝔼𝒙(t)𝟏n𝒙¯(t)22{\mathbb{\bm{E}}}\big{\|}{\bm{{x}}}^{(t)}-\bm{1}_{n}\otimes{\overline{\bm{x}}}^{(t)}\big{\|}_{2}^{2} and 𝔼𝒔(t)𝟏n𝒔¯(t)22{\mathbb{\bm{E}}}\big{\|}{\bm{{s}}}^{(t)}-\bm{1}_{n}\otimes{\overline{\bm{s}}}^{(t)}\big{\|}_{2}^{2}, and the summation of the norm of average inner loop gradient estimator ns=1S𝔼𝒗¯(t),s122n\sum_{s=1}^{S}{\mathbb{\bm{E}}}\big{\|}{\overline{\bm{v}}}^{(t),s-1}\big{\|}^{2}_{2}.

By Lemma 3, (13) can be further bounded as

nTS𝔼f(𝒙𝗈𝗎𝗍𝗉𝗎𝗍)22\displaystyle nTS{\mathbb{\bm{E}}}\big{\|}\nabla f({\bm{{x}}}^{\mathsf{output}})\big{\|}_{2}^{2} 4nη(𝔼[f(𝒙¯(t),0)]f)2n(3750ηL)t=1Ts=0S1𝔼𝒗¯(t),s22\displaystyle\leq\frac{4n}{\eta}\Big{(}{\mathbb{\bm{E}}}[f({\overline{\bm{x}}}^{(t),0})]-f^{*}\Big{)}-2n\Big{(}\frac{37}{50}-\eta L\Big{)}\sum_{t=1}^{T}\sum_{s=0}^{S-1}{\mathbb{\bm{E}}}\big{\|}{\overline{\bm{v}}}^{(t),s}\big{\|}^{2}_{2}
<4nη(𝔼[f(𝒙¯(t),0)]f),\displaystyle<\frac{4n}{\eta}\Big{(}{\mathbb{\bm{E}}}[f({\overline{\bm{x}}}^{(t),0})]-f^{*}\Big{)},

which concludes the proof.

Appendix C Proof of Corollary 1

Without loss of generality, we assume n2n\geq 2. Otherwise, the problem reduces to the centralized setting with a single agent n=1n=1, and the bound holds trivially. We will confirm the choice of parameters in Corollary 1 in the following paragraphs, and finally obtain the IFO complexity and communication complexity.

Step size η\eta.

We first assume α𝗂𝗇p212\alpha_{\mathsf{in}}\leq\frac{p}{2}\leq\frac{1}{2} and α𝗈𝗎𝗍1npb+112\alpha_{\mathsf{out}}\leq\frac{1}{\sqrt{npb}+1}\leq\frac{1}{2}, which will be proved to hold shortly, then we can verify the step size choice meets the requirement in (7) as:

(1α𝗂𝗇)3(1α𝗈𝗎𝗍)1+αK𝗂𝗇αK𝗈𝗎𝗍pnb110L(S/(npb)+1)\displaystyle\frac{(1-\alpha_{\mathsf{in}})^{3}(1-\alpha_{\mathsf{out}})}{1+\alpha^{K_{\mathsf{in}}}\alpha^{K_{\mathsf{out}}}\sqrt{pnb}}\cdot\frac{1}{10L\big{(}\sqrt{S/(npb)}+1\big{)}} (1/2)42120L=1640L.\displaystyle\geq\frac{(1/2)^{4}}{2}\cdot\frac{1}{20L}=\frac{1}{640L}.

Mixing steps K𝗂𝗇K_{\mathsf{in}} and K𝗈𝗎𝗍K_{\mathsf{out}}.

Using Chebyshev’s acceleration [AS14] to implement the mixing steps, it amounts to an improved mixing rate of α𝖼𝗁𝖾𝖻12(1α)\alpha_{\mathsf{cheb}}\asymp 1-\sqrt{2(1-\alpha)}, when the original mixing rate α\alpha is close to 11. Set K𝗂𝗇=log(2/p)1αK_{\mathsf{in}}=\left\lceil\frac{\log(2/p)}{\sqrt{1-\alpha}}\right\rceil and K𝗈𝗎𝗍=log(npb+1)1αK_{\mathsf{out}}=\left\lceil\frac{\log(\sqrt{npb}+1)}{\sqrt{1-\alpha}}\right\rceil. We are now positioned to examine the effective mixing rate α𝗂𝗇=α𝖼𝗁𝖾𝖻K𝗂𝗇\alpha_{\mathsf{in}}=\alpha_{\mathsf{cheb}}^{K_{\mathsf{in}}} and α𝗈𝗎𝗍=α𝖼𝗁𝖾𝖻K𝗈𝗎𝗍\alpha_{\mathsf{out}}=\alpha_{\mathsf{cheb}}^{K_{\mathsf{out}}}, as follows

α𝗈𝗎𝗍=α𝖼𝗁𝖾𝖻K𝗈𝗎𝗍(i)α𝖼𝗁𝖾𝖻log(npb+1)1αα𝖼𝗁𝖾𝖻2log(npb+1)1α𝖼𝗁𝖾𝖻(ii)α𝖼𝗁𝖾𝖻2log(npb+1)logα𝖼𝗁𝖾𝖻<1npb+1(iii)12,\alpha_{\mathsf{out}}=\alpha_{\mathsf{cheb}}^{K_{\mathsf{out}}}\overset{\mathrm{(i)}}{\leq}\alpha_{\mathsf{cheb}}^{\frac{\log(\sqrt{npb}+1)}{\sqrt{1-\alpha}}}\asymp\alpha_{\mathsf{cheb}}^{\frac{\sqrt{2}\log(\sqrt{npb}+1)}{1-\alpha_{\mathsf{cheb}}}}\overset{\mathrm{(ii)}}{\leq}\alpha_{\mathsf{cheb}}^{\frac{\sqrt{2}\log(\sqrt{npb}+1)}{-\log\alpha_{\mathsf{cheb}}}}<\frac{1}{\sqrt{npb}+1}\overset{\mathrm{(iii)}}{\leq}\frac{1}{2},

where (i) follows from K𝗈𝗎𝗍=log(npb+1)1αK_{\mathsf{out}}=\left\lceil\frac{\log(\sqrt{npb}+1)}{\sqrt{1-\alpha}}\right\rceil, (ii) follows from logxx1\log x\leq x-1, x>0\forall x>0, and (iii) follows from n1n\geq 1 and b1b\geq 1. By a similar argument, we have α𝗂𝗇=α𝖼𝗁𝖾𝖻K𝗂𝗇p2\alpha_{\mathsf{in}}=\alpha_{\mathsf{cheb}}^{K_{\mathsf{in}}}\leq\frac{p}{2}.

Complexity.

Plugging in the selected parameters into (8) in Theorem 1, We have

𝔼f(𝒙𝗈𝗎𝗍𝗉𝗎𝗍)22\displaystyle{\mathbb{\bm{E}}}\big{\|}\nabla f({\bm{{x}}}^{\mathsf{output}})\big{\|}_{2}^{2}\leq 4ηTS(𝔼[f(𝒙¯(t),0)]f)=O(LTmn).\displaystyle\frac{4}{\eta TS}\Big{(}{\mathbb{\bm{E}}}[f({\overline{\bm{x}}}^{(t),0})]-f^{*}\Big{)}=O\Big{(}\frac{L}{T\sqrt{mn}}\Big{)}.

Consequently, the outer iteration complexity is T=O(1+L(mn)1/2ϵ).T=O\Big{(}1+\frac{L}{(mn)^{1/2}\epsilon}\Big{)}. With this in place, we summarize the communication and IFO complexities as follows:

  • The communication complexity is T(SK𝗂𝗇+K𝗈𝗎𝗍)=O((mn)1/2log(2(n/m)1/2+2)+log((mn)1/4+1)1α(1+L(mn)1/2ϵ))=O(log((n/m)1/2+2)1α((mn)1/2+Lϵ))T\cdot(SK_{\mathsf{in}}+K_{\mathsf{out}})=O\Big{(}\frac{(mn)^{1/2}\log\big{(}2(n/m)^{1/2}+2\big{)}+\log\big{(}(mn)^{1/4}+1\big{)}}{\sqrt{1-\alpha}}\cdot\Big{(}1+\frac{L}{(mn)^{1/2}\epsilon}\Big{)}\Big{)}=O\Big{(}\frac{\log\big{(}(n/m)^{1/2}+2\big{)}}{\sqrt{1-\alpha}}\cdot\big{(}(mn)^{1/2}+\frac{L}{\epsilon}\big{)}\Big{)}, where we use 2/p=2m/nm/n2(m/n+1)m/n=2(n/m+1)2/p=\frac{2\left\lceil\sqrt{m/n}\right\rceil}{\sqrt{m/n}}\leq\frac{2(\sqrt{m/n}+1)}{\sqrt{m/n}}=2(\sqrt{n/m}+1) to bound K𝗂𝗇K_{\mathsf{in}}.

  • The IFO complexity is T(Spb+2m)=O(m+(m/n)1/2Lϵ)T\cdot(Spb+2m)=O\Big{(}m+\frac{(m/n)^{1/2}L}{\epsilon}\Big{)}.

Appendix D Proof of Lemma 2

This section proves Lemma 2. Sections D.1 and D.2 bounds the expected inner loop gradient estimation error and consensus errors by their previous values and the sum of inner loop gradient estimator’s norms, Section D.3 then creates a linear system to compute the summation of inner loop errors using their initial values of each inner loop, which concludes the proof.

D.1 Sum of inner loop gradient estimation errors

To begin with, note that the gradient estimation error at the ss-th inner loop iteration can be written as

𝔼f(𝒖¯(t),s)𝒗¯(t),s22\displaystyle~{}~{}~{}~{}{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{u}}}^{(t),s})-{\overline{\bm{v}}}^{(t),s}\big{\|}_{2}^{2}
=𝔼(1n𝟏n𝑰d)(F(𝟏n𝒖¯(t),s)𝒗(t),s)22\displaystyle={\mathbb{\bm{E}}}\Big{\|}\Big{(}\frac{1}{n}\bm{1}_{n}^{\top}\otimes{\bm{{I}}}_{d}\Big{)}\Big{(}\nabla F(\bm{1}_{n}\otimes{\overline{\bm{u}}}^{(t),s})-{\bm{{v}}}^{(t),s}\Big{)}\Big{\|}_{2}^{2}
=𝔼(1n𝟏n𝑰d)(F(𝟏n𝒖¯(t),s)F(𝒖(t),s))+(1n𝟏n𝑰d)(F(𝒖(t),s)𝒗(t),s)22\displaystyle={\mathbb{\bm{E}}}\Big{\|}\Big{(}\frac{1}{n}\bm{1}_{n}^{\top}\otimes{\bm{{I}}}_{d}\Big{)}\big{(}\nabla F(\bm{1}_{n}\otimes{\overline{\bm{u}}}^{(t),s})-\nabla F({\bm{{u}}}^{(t),s})\big{)}+\Big{(}\frac{1}{n}\bm{1}_{n}^{\top}\otimes{\bm{{I}}}_{d}\Big{)}\big{(}\nabla F({\bm{{u}}}^{(t),s})-{\bm{{v}}}^{(t),s}\big{)}\Big{\|}_{2}^{2}
2𝔼(1n𝟏n𝑰d)(F(𝟏n𝒖¯(t),s)F(𝒖(t),s))22+2𝔼(1n𝟏n𝑰d)(F(𝒖(t),s)𝒗(t),s)22\displaystyle\leq 2{\mathbb{\bm{E}}}\Big{\|}\Big{(}\frac{1}{n}\bm{1}_{n}^{\top}\otimes{\bm{{I}}}_{d}\Big{)}\big{(}\nabla F(\bm{1}_{n}\otimes{\overline{\bm{u}}}^{(t),s})-\nabla F({\bm{{u}}}^{(t),s})\big{)}\Big{\|}_{2}^{2}+2{\mathbb{\bm{E}}}\Big{\|}\Big{(}\frac{1}{n}\bm{1}_{n}^{\top}\otimes{\bm{{I}}}_{d}\Big{)}\big{(}\nabla F({\bm{{u}}}^{(t),s})-{\bm{{v}}}^{(t),s}\big{)}\Big{\|}_{2}^{2}
2L2n𝔼𝒖(t),s𝟏n𝒖¯(t),s22+2𝔼(1n𝟏n𝑰d)(F(𝒖(t),s)𝒗(t),s)22,\displaystyle\leq\frac{2L^{2}}{n}{\mathbb{\bm{E}}}\big{\|}{\bm{{u}}}^{(t),s}-\bm{1}_{n}\otimes{\overline{\bm{u}}}^{(t),s}\big{\|}_{2}^{2}+2{\mathbb{\bm{E}}}\Big{\|}\Big{(}\frac{1}{n}\bm{1}_{n}^{\top}\otimes{\bm{{I}}}_{d}\Big{)}\big{(}\nabla F({\bm{{u}}}^{(t),s})-{\bm{{v}}}^{(t),s}\big{)}\Big{\|}_{2}^{2}, (14)

where the first equality follows from (4), and the last inequality is due to Assumption 1. To continue, the expectation of the second term in (14) can be bounded as

𝔼(1n𝟏n𝑰d)(F(𝒖(t),s)𝒗(t),s)22\displaystyle~{}~{}~{}~{}{\mathbb{\bm{E}}}\Big{\|}\Big{(}\frac{1}{n}\bm{1}_{n}^{\top}\otimes{\bm{{I}}}_{d}\Big{)}\Big{(}\nabla F({\bm{{u}}}^{(t),s})-{\bm{{v}}}^{(t),s}\Big{)}\Big{\|}_{2}^{2}
=𝔼(1n𝟏n𝑰d)((F(𝒖(t),s)𝒗(t),s)(F(𝒖(t),s1)𝒗(t),s1)+(F(𝒖(t),s1)𝒗(t),s1))22\displaystyle={\mathbb{\bm{E}}}\Big{\|}\Big{(}\frac{1}{n}\bm{1}_{n}^{\top}\otimes{\bm{{I}}}_{d}\Big{)}\Big{(}\big{(}\nabla F({\bm{{u}}}^{(t),s})-{\bm{{v}}}^{(t),s}\big{)}-\big{(}\nabla F({\bm{{u}}}^{(t),s-1})-{\bm{{v}}}^{(t),s-1}\big{)}+\big{(}\nabla F({\bm{{u}}}^{(t),s-1})-{\bm{{v}}}^{(t),s-1}\big{)}\Big{)}\Big{\|}_{2}^{2}
=(i)𝔼(1n𝟏n𝑰d)((F(𝒖(t),s)𝒗(t),s)(F(𝒖(t),s1)𝒗(t),s1))22\displaystyle\overset{\text{(i)}}{=}{\mathbb{\bm{E}}}\Big{\|}\Big{(}\frac{1}{n}\bm{1}_{n}^{\top}\otimes{\bm{{I}}}_{d}\Big{)}\Big{(}\big{(}\nabla F({\bm{{u}}}^{(t),s})-{\bm{{v}}}^{(t),s}\big{)}-\big{(}\nabla F({\bm{{u}}}^{(t),s-1})-{\bm{{v}}}^{(t),s-1}\big{)}\Big{)}\Big{\|}_{2}^{2}
+𝔼(1n𝟏n𝑰d)(F(𝒖(t),s1)𝒗(t),s1)22\displaystyle\qquad+{\mathbb{\bm{E}}}\Big{\|}\Big{(}\frac{1}{n}\bm{1}_{n}^{\top}\otimes{\bm{{I}}}_{d}\Big{)}\big{(}\nabla F({\bm{{u}}}^{(t),s-1})-{\bm{{v}}}^{(t),s-1}\big{)}\Big{\|}_{2}^{2}
=(ii)k=1s𝔼(1n𝟏n𝑰d)((F(𝒖(t),k)𝒗(t),k)(F(𝒖(t),k1)𝒗(t),k1))22\displaystyle\overset{\text{(ii)}}{=}\sum_{k=1}^{s}{\mathbb{\bm{E}}}\Big{\|}\Big{(}\frac{1}{n}\bm{1}_{n}^{\top}\otimes{\bm{{I}}}_{d}\Big{)}\Big{(}\big{(}\nabla F({\bm{{u}}}^{(t),k})-{\bm{{v}}}^{(t),k}\big{)}-\big{(}\nabla F({\bm{{u}}}^{(t),k-1})-{\bm{{v}}}^{(t),k-1}\big{)}\Big{)}\Big{\|}_{2}^{2}
+𝔼(1n𝟏n𝑰d)(F(𝒖(t),0)𝒗(t),0)22\displaystyle\qquad+{\mathbb{\bm{E}}}\Big{\|}\Big{(}\frac{1}{n}\bm{1}_{n}^{\top}\otimes{\bm{{I}}}_{d}\Big{)}\big{(}\nabla F({\bm{{u}}}^{(t),0})-{\bm{{v}}}^{(t),0}\big{)}\Big{\|}_{2}^{2}
=(iii)k=1s𝔼(1n𝟏n𝑰d)((F(𝒖(t),k)𝒗(t),k)(F(𝒖(t),k1)𝒗(t),k1))22.\displaystyle\overset{\text{(iii)}}{=}\sum_{k=1}^{s}{\mathbb{\bm{E}}}\Big{\|}\Big{(}\frac{1}{n}\bm{1}_{n}^{\top}\otimes{\bm{{I}}}_{d}\Big{)}\Big{(}\big{(}\nabla F({\bm{{u}}}^{(t),k})-{\bm{{v}}}^{(t),k}\big{)}-\big{(}\nabla F({\bm{{u}}}^{(t),k-1})-{\bm{{v}}}^{(t),k-1}\big{)}\Big{)}\Big{\|}_{2}^{2}. (15)

Here, (i) follows from the expectation with respect to the activating indicator λi(t),s\lambda_{i}^{(t),s} and random samples 𝒵(t),s\mathcal{Z}^{(t),s}, conditioned on 𝒖(t),s1{\bm{{u}}}^{(t),s-1} and 𝒗(t),s1{\bm{{v}}}^{(t),s-1}:

𝔼[(1n𝟏n𝑰d)(F(𝒖(t),s)𝒗(t),s)|𝒖(t),s1,𝒗(t),s1]\displaystyle~{}~{}~{}~{}{\mathbb{\bm{E}}}\left[\Big{(}\frac{1}{n}\bm{1}_{n}^{\top}\otimes{\bm{{I}}}_{d}\Big{)}\big{(}\nabla F({\bm{{u}}}^{(t),s})-{\bm{{v}}}^{(t),s}\big{)}\Big{|}{\bm{{u}}}^{(t),s-1},{\bm{{v}}}^{(t),s-1}\right]
=1ni=1nfi(𝒖i(t),s)𝔼[1npbiλi(t),s𝒛i𝒵i(t),s((𝒖i(t),s;𝒛i)(𝒖i(t),s1;𝒛i))|𝒖(t),s1,𝒗(t),s1]𝒗¯(t),s1\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}({\bm{{u}}}_{i}^{(t),s})-{\mathbb{\bm{E}}}\Bigg{[}\frac{1}{npb}\sum_{i}\lambda_{i}^{(t),s}\sum_{{\bm{{z}}}_{i}\in{\mathcal{{Z}}}_{i}^{(t),s}}\Big{(}\nabla\ell({\bm{{u}}}_{i}^{(t),s};{\bm{{z}}}_{i})-\nabla\ell({\bm{{u}}}_{i}^{(t),s-1};{\bm{{z}}}_{i})\Big{)}\Big{|}{\bm{{u}}}^{(t),s-1},{\bm{{v}}}^{(t),s-1}\Bigg{]}-{\overline{\bm{v}}}^{(t),s-1}
=1ni=1nfi(𝒖i(t),s)1npi𝔼[λi(t),s](fi(𝒖i(t),s)fi(𝒖i(t),s1))𝒗¯(t),s1\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}({\bm{{u}}}_{i}^{(t),s})-\frac{1}{np}\sum_{i}{\mathbb{\bm{E}}}\big{[}\lambda_{i}^{(t),s}\big{]}\Big{(}\nabla f_{i}({\bm{{u}}}_{i}^{(t),s})-\nabla f_{i}({\bm{{u}}}_{i}^{(t),s-1})\Big{)}-{\overline{\bm{v}}}^{(t),s-1}
=1ni=1nfi(𝒖i(t),s1)𝒗¯(t),s1\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}({\bm{{u}}}_{i}^{(t),s-1})-{\overline{\bm{v}}}^{(t),s-1}
=(1n𝟏n𝑰d)(F(𝒖(t),s1)𝒗(t),s1),\displaystyle=\Big{(}\frac{1}{n}\bm{1}_{n}^{\top}\otimes{\bm{{I}}}_{d}\Big{)}\big{(}\nabla F({\bm{{u}}}^{(t),s-1})-{\bm{{v}}}^{(t),s-1}\big{)}, (16)

(ii) follows by recursively applying the relation obtained from (i); and (iii) follows from the property of gradient tracking, i.e.

(1n𝟏n𝑰d)F(𝒖(t),0)\displaystyle\Big{(}\frac{1}{n}\bm{1}_{n}^{\top}\otimes{\bm{{I}}}_{d}\Big{)}\nabla F({\bm{{u}}}^{(t),0}) =1ni=1nfi(𝒖i(t),0)=1ni=1nfi(𝒙i(t))=𝒔¯(t)=𝒗¯(t),0,\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}({\bm{{u}}}_{i}^{(t),0})=\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}({\bm{{x}}}_{i}^{(t)})={\overline{\bm{s}}}^{(t)}={\overline{\bm{v}}}^{(t),0}, (17)

which leads to (1n𝟏n𝑰d)(F(𝒖(t),0)𝒗(t),0)=𝟎\Big{(}\frac{1}{n}\bm{1}_{n}^{\top}\otimes{\bm{{I}}}_{d}\Big{)}\big{(}\nabla F({\bm{{u}}}^{(t),0})-{\bm{{v}}}^{(t),0}\big{)}=\bm{0}.

We now continue to bound each term in (15), which can be viewed as the variance of the stochastic gradient, as

𝔼(1n𝟏n𝑰d)((F(𝒖(t),s)𝒗(t),s)(F(𝒖(t),s1)𝒗(t),s1))22\displaystyle~{}~{}~{}~{}{\mathbb{\bm{E}}}\Big{\|}\Big{(}\frac{1}{n}\bm{1}_{n}^{\top}\otimes{\bm{{I}}}_{d}\Big{)}\Big{(}\big{(}\nabla F({\bm{{u}}}^{(t),s})-{\bm{{v}}}^{(t),s}\big{)}-\big{(}\nabla F({\bm{{u}}}^{(t),s-1})-{\bm{{v}}}^{(t),s-1}\big{)}\Big{)}\Big{\|}_{2}^{2}
=(i)𝔼1nbi=1n𝒛i𝒵i(t),s((fi(𝒖i(t),s)fi(𝒖i(t),s1))λi(t),sp((𝒖i(t),s;𝒛i)(𝒖i(t),s1;𝒛i)))22\displaystyle\overset{\text{(i)}}{=}{\mathbb{\bm{E}}}\Bigg{\|}\frac{1}{nb}\sum_{i=1}^{n}\sum_{{\bm{{z}}}_{i}\in{\mathcal{{Z}}}_{i}^{(t),s}}\Big{(}\big{(}\nabla f_{i}({\bm{{u}}}_{i}^{(t),s})-\nabla f_{i}({\bm{{u}}}_{i}^{(t),s-1})\big{)}-\frac{\lambda_{i}^{(t),s}}{p}\big{(}\nabla\ell({\bm{{u}}}_{i}^{(t),s};{\bm{{z}}}_{i})-\nabla\ell({\bm{{u}}}_{i}^{(t),s-1};{\bm{{z}}}_{i})\big{)}\Big{)}\Bigg{\|}_{2}^{2}
=(ii)1n2b2i=1n𝒛i𝒵i(t),s𝔼(fi(𝒖i(t),s)fi(𝒖i(t),s1))λi(t),sp((𝒖i(t),s;𝒛i)(𝒖i(t),s1;𝒛i))22\displaystyle\overset{\text{(ii)}}{=}\frac{1}{n^{2}b^{2}}\sum_{i=1}^{n}\sum_{{\bm{{z}}}_{i}\in{\mathcal{{Z}}}_{i}^{(t),s}}{\mathbb{\bm{E}}}\Big{\|}\big{(}\nabla f_{i}({\bm{{u}}}_{i}^{(t),s})-\nabla f_{i}({\bm{{u}}}_{i}^{(t),s-1})\big{)}-\frac{\lambda_{i}^{(t),s}}{p}\big{(}\nabla\ell({\bm{{u}}}_{i}^{(t),s};{\bm{{z}}}_{i})-\nabla\ell({\bm{{u}}}_{i}^{(t),s-1};{\bm{{z}}}_{i})\big{)}\Big{\|}_{2}^{2}
=(iii)1n2p2b2i=1n𝒛i𝒵i(t),s𝔼[(λi(t),s)2]𝔼(𝒖i(t),s;𝒛i)(𝒖i(t),s1;𝒛i)221n2b𝔼F(𝒖(t),s)F(𝒖(t),s1)22\displaystyle\overset{\text{(iii)}}{=}\frac{1}{n^{2}p^{2}b^{2}}\sum_{i=1}^{n}\sum_{{\bm{{z}}}_{i}\in{\mathcal{{Z}}}_{i}^{(t),s}}{\mathbb{\bm{E}}}\Big{[}\big{(}\lambda_{i}^{(t),s}\big{)}^{2}\Big{]}{\mathbb{\bm{E}}}\big{\|}\nabla\ell({\bm{{u}}}_{i}^{(t),s};{\bm{{z}}}_{i})-\nabla\ell({\bm{{u}}}_{i}^{(t),s-1};{\bm{{z}}}_{i})\big{\|}_{2}^{2}-\frac{1}{n^{2}b}{\mathbb{\bm{E}}}\big{\|}\nabla F({\bm{{u}}}^{(t),s})-\nabla F({\bm{{u}}}^{(t),s-1})\big{\|}_{2}^{2}
L2n2pb𝔼𝒖(t),s𝒖(t),s122,\displaystyle\leq\frac{L^{2}}{n^{2}pb}{\mathbb{\bm{E}}}\big{\|}{\bm{{u}}}^{(t),s}-{\bm{{u}}}^{(t),s-1}\big{\|}_{2}^{2}, (18)

where (i) follows from the update rules (6b) and (6c), (ii) follows from the independence of samples and 𝔼[λi(t),s]=p{\mathbb{\bm{E}}}\big{[}\lambda_{i}^{(t),s}\big{]}=p, (iii) follows from similar argument with (16), and the last inequality follows from Assumption 1 and 𝔼[(λi(t),s)2]=p{\mathbb{\bm{E}}}\Big{[}\big{(}\lambda_{i}^{(t),s}\big{)}^{2}\Big{]}=p.

In view of (6a), the difference between inner loop variables in (18) can be bounded deterministically as

𝒖(t),s𝒖(t),s122\displaystyle~{}~{}~{}~{}\big{\|}{\bm{{u}}}^{(t),s}-{\bm{{u}}}^{(t),s-1}\big{\|}_{2}^{2}
=(𝑾𝗂𝗇𝑰d)(𝒖(t),s1η𝒗(t),s1)𝒖(t),s122\displaystyle=\big{\|}({\bm{{W}}}_{\mathsf{in}}\otimes{\bm{{I}}}_{d})({\bm{{u}}}^{(t),s-1}-\eta{\bm{{v}}}^{(t),s-1})-{\bm{{u}}}^{(t),s-1}\big{\|}_{2}^{2}
=(i)((𝑾𝗂𝗇𝑰d)𝑰nd)(𝒖(t),s1𝟏n𝒖¯(t),s1)\displaystyle\overset{\text{(i)}}{=}\Big{\|}\Big{(}({\bm{{W}}}_{\mathsf{in}}\otimes{\bm{{I}}}_{d})-{\bm{{I}}}_{nd}\Big{)}({\bm{{u}}}^{(t),s-1}-\bm{1}_{n}\otimes{\overline{\bm{u}}}^{(t),s-1})
η((𝑾𝗂𝗇𝑰d)(1n𝟏n𝟏n)𝑰d)(𝒗(t),s1𝟏n𝒗¯(t),s1)η𝟏n𝒗¯(t),s122\displaystyle\qquad-\eta\Big{(}({\bm{{W}}}_{\mathsf{in}}\otimes{\bm{{I}}}_{d})-(\frac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top})\otimes{\bm{{I}}}_{d}\Big{)}({\bm{{v}}}^{(t),s-1}-\bm{1}_{n}\otimes{\overline{\bm{v}}}^{(t),s-1})-\eta\bm{1}_{n}\otimes{\overline{\bm{v}}}^{(t),s-1}\Big{\|}_{2}^{2}
=(ii)((𝑾𝗂𝗇𝑰d)𝑰nd)(𝒖(t),s1𝟏n𝒖¯(t),s1)\displaystyle\overset{\text{(ii)}}{=}\Big{\|}\Big{(}({\bm{{W}}}_{\mathsf{in}}\otimes{\bm{{I}}}_{d})-{\bm{{I}}}_{nd}\Big{)}({\bm{{u}}}^{(t),s-1}-\bm{1}_{n}\otimes{\overline{\bm{u}}}^{(t),s-1})
η((𝑾𝗂𝗇𝑰d)(1n𝟏n𝟏n)𝑰d)(𝒗(t),s1𝟏n𝒗¯(t),s1)22+η2n𝒗¯(t),s122\displaystyle\qquad-\eta\Big{(}({\bm{{W}}}_{\mathsf{in}}\otimes{\bm{{I}}}_{d})-(\frac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top})\otimes{\bm{{I}}}_{d}\Big{)}({\bm{{v}}}^{(t),s-1}-\bm{1}_{n}\otimes{\overline{\bm{v}}}^{(t),s-1})\Big{\|}_{2}^{2}+\eta^{2}n\big{\|}{\overline{\bm{v}}}^{(t),s-1}\big{\|}_{2}^{2}
8𝒖(t),s1𝟏n𝒖¯(t),s122+2α𝗂𝗇2η2𝒗(t),s1𝟏n𝒗¯(t),s122+η2n𝒗¯(t),s122,\displaystyle\leq 8\big{\|}{\bm{{u}}}^{(t),s-1}-\bm{1}_{n}\otimes{\overline{\bm{u}}}^{(t),s-1}\|_{2}^{2}+2\alpha_{\mathsf{in}}^{2}\eta^{2}\big{\|}{\bm{{v}}}^{(t),s-1}-\bm{1}_{n}\otimes{\overline{\bm{v}}}^{(t),s-1}\big{\|}_{2}^{2}+\eta^{2}n\big{\|}{\overline{\bm{v}}}^{(t),s-1}\big{\|}_{2}^{2}, (19)

where (i) and (ii) follow from ((𝑾𝗂𝗇𝑰d)𝑰nd)(𝟏n𝒙¯)=𝟎\big{(}({\bm{{W}}}_{\mathsf{in}}\otimes{\bm{{I}}}_{d})-{\bm{{I}}}_{nd}\big{)}(\bm{1}_{n}\otimes{\overline{\bm{x}}})=\bm{0} and ((𝑾𝗂𝗇𝑰d)(1n𝟏n𝟏n)𝑰d)(𝟏n𝒙¯)=𝟎\big{(}({\bm{{W}}}_{\mathsf{in}}\otimes{\bm{{I}}}_{d})-(\frac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top})\otimes{\bm{{I}}}_{d}\big{)}(\bm{1}_{n}\otimes{\overline{\bm{x}}})=\bm{0} for any mean vector 𝒙¯{\overline{\bm{x}}}; and the last inequality follows from the property of the mixing matrix (𝑾𝗂𝗇𝑰d)𝑰nd𝗈𝗉2\big{\|}({\bm{{W}}}_{\mathsf{in}}\otimes{\bm{{I}}}_{d})-{\bm{{I}}}_{nd}\big{\|}_{\mathsf{op}}\leq 2 and (𝑾𝗂𝗇𝑰d)(1n𝟏n𝟏n)𝑰d𝗈𝗉α𝗂𝗇\big{\|}({\bm{{W}}}_{\mathsf{in}}\otimes{\bm{{I}}}_{d})-(\frac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top})\otimes{\bm{{I}}}_{d}\big{\|}_{\mathsf{op}}\leq\alpha_{\mathsf{in}}.

Plugging (18) and (19) into (15), we can further obtain

𝔼((1n𝟏n𝟏n)𝑰d)(F(𝒖(t),s)𝒗(t),s)22\displaystyle~{}~{}~{}~{}{\mathbb{\bm{E}}}\Big{\|}\Big{(}(\frac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top})\otimes{\bm{{I}}}_{d}\Big{)}\Big{(}\nabla F({\bm{{u}}}^{(t),s})-{\bm{{v}}}^{(t),s}\Big{)}\Big{\|}_{2}^{2}
L2n2pbk=1s𝔼𝒖(t),k𝒖(t),k122\displaystyle\leq\frac{L^{2}}{n^{2}pb}\sum_{k=1}^{s}{\mathbb{\bm{E}}}\big{\|}{\bm{{u}}}^{(t),k}-{\bm{{u}}}^{(t),k-1}\big{\|}^{2}_{2}
8L2n2pbk=0s1𝔼𝒖(t),k𝟏n𝒖¯(t),k22+2α𝗂𝗇2η2L2n2pbk=0s1𝔼𝒗(t),k𝟏n𝒗¯(t),k22+η2L2npbk=0s1𝔼𝒗¯(t),k22.\displaystyle\leq\frac{8L^{2}}{n^{2}pb}\sum_{k=0}^{s-1}{\mathbb{\bm{E}}}\big{\|}{\bm{{u}}}^{(t),k}-\bm{1}_{n}\otimes{\overline{\bm{u}}}^{(t),k}\|_{2}^{2}+\frac{2\alpha_{\mathsf{in}}^{2}\eta^{2}L^{2}}{n^{2}pb}\sum_{k=0}^{s-1}{\mathbb{\bm{E}}}\big{\|}{\bm{{v}}}^{(t),k}-\bm{1}_{n}\otimes{\overline{\bm{v}}}^{(t),k}\big{\|}_{2}^{2}+\frac{\eta^{2}L^{2}}{npb}\sum_{k=0}^{s-1}{\mathbb{\bm{E}}}\big{\|}{\overline{\bm{v}}}^{(t),k}\big{\|}_{2}^{2}.

Using (14) and the previous inequality, we can bound the summation of inner loop gradient estimation errors as

s=0S1𝔼f(𝒖¯(t),s)𝒗¯(t),s22\displaystyle~{}~{}~{}~{}\sum_{s=0}^{S-1}{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{u}}}^{(t),s})-{\overline{\bm{v}}}^{(t),s}\big{\|}_{2}^{2}
2L2ns=0S1𝔼𝒖(t),s𝟏n𝒖¯(t),s22+2s=0S1𝔼(1n𝟏n𝑰d)(F(𝒖(t),s)𝒗(t),s)22\displaystyle\leq\frac{2L^{2}}{n}\sum_{s=0}^{S-1}{\mathbb{\bm{E}}}\big{\|}{\bm{{u}}}^{(t),s}-\bm{1}_{n}\otimes{\overline{\bm{u}}}^{(t),s}\big{\|}_{2}^{2}+2\sum_{s=0}^{S-1}{\mathbb{\bm{E}}}\Big{\|}\Big{(}\frac{1}{n}\bm{1}_{n}^{\top}\otimes{\bm{{I}}}_{d}\Big{)}\big{(}\nabla F({\bm{{u}}}^{(t),s})-{\bm{{v}}}^{(t),s}\big{)}\Big{\|}_{2}^{2}
2L2ns=0S1𝔼𝒖(t),s𝟏n𝒖¯(t),s22+16L2n2pbs=0S1k=0s1𝔼𝒖(t),k𝟏n𝒖¯(t),k22\displaystyle\leq\frac{2L^{2}}{n}\sum_{s=0}^{S-1}{\mathbb{\bm{E}}}\big{\|}{\bm{{u}}}^{(t),s}-\bm{1}_{n}\otimes{\overline{\bm{u}}}^{(t),s}\big{\|}_{2}^{2}+\frac{16L^{2}}{n^{2}pb}\sum_{s=0}^{S-1}\sum_{k=0}^{s-1}{\mathbb{\bm{E}}}\big{\|}{\bm{{u}}}^{(t),k}-\bm{1}_{n}\otimes{\overline{\bm{u}}}^{(t),k}\big{\|}_{2}^{2}
+4α𝗂𝗇2η2L2n2pbs=0S1k=0s1𝔼𝒗(t),k𝟏n𝒗¯(t),k22+2η2L2npbs=0S1k=0s1𝔼𝒗¯(t),k22\displaystyle\qquad+\frac{4\alpha_{\mathsf{in}}^{2}\eta^{2}L^{2}}{n^{2}pb}\sum_{s=0}^{S-1}\sum_{k=0}^{s-1}{\mathbb{\bm{E}}}\big{\|}{\bm{{v}}}^{(t),k}-\bm{1}_{n}\otimes{\overline{\bm{v}}}^{(t),k}\big{\|}_{2}^{2}+\frac{2\eta^{2}L^{2}}{npb}\sum_{s=0}^{S-1}\sum_{k=0}^{s-1}{\mathbb{\bm{E}}}\big{\|}{\overline{\bm{v}}}^{(t),k}\big{\|}_{2}^{2}
(8Snpb+1)2L2ns=0S1𝔼𝒖(t),s𝟏n𝒖¯(t),s22\displaystyle\leq\Big{(}\frac{8S}{npb}+1\Big{)}\cdot\frac{2L^{2}}{n}\sum_{s=0}^{S-1}{\mathbb{\bm{E}}}\big{\|}{\bm{{u}}}^{(t),s}-\bm{1}_{n}\otimes{\overline{\bm{u}}}^{(t),s}\big{\|}_{2}^{2}
+4Sα𝗂𝗇2η2L2n2pbs=0S1𝔼𝒗(t),s𝟏n𝒗¯(t),s22+2Sη2L2npbs=0S1𝔼𝒗¯(t),s22,\displaystyle\qquad+\frac{4S\alpha_{\mathsf{in}}^{2}\eta^{2}L^{2}}{n^{2}pb}\sum_{s=0}^{S-1}{\mathbb{\bm{E}}}\big{\|}{\bm{{v}}}^{(t),s}-\bm{1}_{n}\otimes{\overline{\bm{v}}}^{(t),s}\big{\|}_{2}^{2}+\frac{2S\eta^{2}L^{2}}{npb}\sum_{s=0}^{S-1}{\mathbb{\bm{E}}}\big{\|}{\overline{\bm{v}}}^{(t),s}\big{\|}_{2}^{2},

where the last inequality is obtained by relaxing the upper bound of the summation w.r.t. kk from s1s-1 to S1S-1.

The quantity of interest can be now bounded as

2L2s=0S1𝔼𝒖(t),s\displaystyle 2L^{2}\sum_{s=0}^{S-1}{\mathbb{\bm{E}}}\big{\|}{\bm{{u}}}^{(t),s}- 𝟏n𝒖¯(t),s22+2ns=0S1𝔼f(𝒖¯(t),s)𝒗¯(t),s22\displaystyle\bm{1}_{n}\otimes{\overline{\bm{u}}}^{(t),s}\big{\|}_{2}^{2}+2n\sum_{s=0}^{S-1}{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{u}}}^{(t),s})-{\overline{\bm{v}}}^{(t),s}\big{\|}^{2}_{2}
(4Snpb+1)8L2s=0S1𝔼𝒖(t),s𝟏n𝒖¯(t),s22\displaystyle\leq\Big{(}\frac{4S}{npb}+1\Big{)}\cdot 8L^{2}\sum_{s=0}^{S-1}{\mathbb{\bm{E}}}\big{\|}{\bm{{u}}}^{(t),s}-\bm{1}_{n}\otimes{\overline{\bm{u}}}^{(t),s}\big{\|}_{2}^{2}
+8Sα𝗂𝗇2η2L2npbs=0S1𝔼𝒗(t),s𝟏n𝒗¯(t),s22+4Sη2L2pbs=0S1𝔼𝒗¯(t),s22\displaystyle\qquad+\frac{8S\alpha_{\mathsf{in}}^{2}\eta^{2}L^{2}}{npb}\sum_{s=0}^{S-1}{\mathbb{\bm{E}}}\big{\|}{\bm{{v}}}^{(t),s}-\bm{1}_{n}\otimes{\overline{\bm{v}}}^{(t),s}\big{\|}_{2}^{2}+\frac{4S\eta^{2}L^{2}}{pb}\sum_{s=0}^{S-1}{\mathbb{\bm{E}}}\big{\|}{\overline{\bm{v}}}^{(t),s}\big{\|}_{2}^{2} (20)

D.2 Sum of inner loop consensus errors

Using the update rule (6a), the variable consensus error can be expanded deterministically as follows:

𝒖(t),s𝟏n𝒖¯(t),s22\displaystyle\big{\|}{\bm{{u}}}^{(t),s}-\bm{1}_{n}\otimes{\overline{\bm{u}}}^{(t),s}\big{\|}_{2}^{2} =(𝑰nd(1n𝟏n𝟏n)𝑰d)(𝑾𝗂𝗇𝑰d)(𝒖(t),s1η𝒗(t),s1)22\displaystyle=\Big{\|}\Big{(}{\bm{{I}}}_{nd}-(\frac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top})\otimes{\bm{{I}}}_{d}\Big{)}({\bm{{W}}}_{\mathsf{in}}\otimes{\bm{{I}}}_{d})({\bm{{u}}}^{(t),s-1}-\eta{\bm{{v}}}^{(t),s-1})\Big{\|}_{2}^{2}
(i)α𝗂𝗇2(𝑰nd(1n𝟏n𝟏n)𝑰d)(𝒖(t),s1η𝒗(t),s1)22\displaystyle\overset{\text{(i)}}{\leq}\alpha_{\mathsf{in}}^{2}\Big{\|}\Big{(}{\bm{{I}}}_{nd}-(\frac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top})\otimes{\bm{{I}}}_{d}\Big{)}({\bm{{u}}}^{(t),s-1}-\eta{\bm{{v}}}^{(t),s-1})\Big{\|}_{2}^{2}
2α𝗂𝗇21+α𝗂𝗇2𝒖(t),s1𝟏n𝒖¯(t),s122+2α𝗂𝗇2η21α𝗂𝗇2𝒗(t),s1𝟏n𝒗¯(t),s122,\displaystyle\leq\frac{2\alpha_{\mathsf{in}}^{2}}{1+\alpha_{\mathsf{in}}^{2}}\big{\|}{\bm{{u}}}^{(t),s-1}-\bm{1}_{n}\otimes{\overline{\bm{u}}}^{(t),s-1}\big{\|}_{2}^{2}+\frac{2\alpha_{\mathsf{in}}^{2}\eta^{2}}{1-\alpha_{\mathsf{in}}^{2}}\big{\|}{\bm{{v}}}^{(t),s-1}-\bm{1}_{n}\otimes{\overline{\bm{v}}}^{(t),s-1}\big{\|}_{2}^{2}, (21)

where (i) follows from the fact

(𝑰nd(1n𝟏n𝟏n)𝑰d)(𝑾𝑰d)=(𝑾𝑰d(1n𝟏n𝟏n)𝑰d)(𝑰nd(1n𝟏n𝟏n)𝑰d)\Big{(}{\bm{{I}}}_{nd}-(\frac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top})\otimes{\bm{{I}}}_{d}\Big{)}({\bm{{W}}}\otimes{\bm{{I}}}_{d})=\Big{(}{\bm{{W}}}\otimes{\bm{{I}}}_{d}-(\frac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top})\otimes{\bm{{I}}}_{d}\Big{)}\Big{(}{\bm{{I}}}_{nd}-(\frac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top})\otimes{\bm{{I}}}_{d}\Big{)}

and the definition of the mixing rate. The last inequality follows from the elementary inequality 2𝒂,𝒃1α𝗂𝗇21+α𝗂𝗇2𝒂22+1+α𝗂𝗇21α𝗂𝗇2𝒃222\langle\bm{a},\bm{b}\rangle\leq\frac{1-\alpha_{\mathsf{in}}^{2}}{1+\alpha_{\mathsf{in}}^{2}}\|\bm{a}\|_{2}^{2}+\frac{1+\alpha_{\mathsf{in}}^{2}}{1-\alpha_{\mathsf{in}}^{2}}\|\bm{b}\|_{2}^{2}, so that 𝒂+𝒃2221+α𝗂𝗇2𝒂22+21α𝗂𝗇2𝒃22\|{\bm{{a}}}+{\bm{{b}}}\|_{2}^{2}\leq\frac{2}{1+\alpha_{\mathsf{in}}^{2}}\|{\bm{{a}}}\|_{2}^{2}+\frac{2}{1-\alpha_{\mathsf{in}}^{2}}\|{\bm{{b}}}\|_{2}^{2}.

Furthermore, using the update rules (6b) and (6c) and defining 𝚲(t),s=1pdiag(λ1(t),s,λ2(t),s,,λn(t),s)𝑰d{\bm{{\Lambda}}}^{(t),s}=\frac{1}{p}\text{diag}(\lambda_{1}^{(t),s},\lambda_{2}^{(t),s},\ldots,\lambda_{n}^{(t),s})\otimes{\bm{{I}}}_{d}, the gradient consensus error can be similarly expanded as follows:

𝒗(t),s𝟏n𝒗¯(t),s22\displaystyle~{}~{}~{}~{}\big{\|}{\bm{{v}}}^{(t),s}-\bm{1}_{n}\otimes{\overline{\bm{v}}}^{(t),s}\big{\|}_{2}^{2}
=(𝑰nd(1n𝟏n𝟏n)𝑰d)(𝑾𝗂𝗇𝑰d)𝒈(t),s22\displaystyle=\Big{\|}\Big{(}{\bm{{I}}}_{nd}-(\frac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top})\otimes{\bm{{I}}}_{d}\Big{)}({\bm{{W}}}_{\mathsf{in}}\otimes{\bm{{I}}}_{d}){\bm{{g}}}^{(t),s}\Big{\|}_{2}^{2}
α𝗂𝗇2(𝑰nd(1n𝟏n𝟏n)𝑰d)𝒈(t),s22\displaystyle\leq\alpha_{\mathsf{in}}^{2}\Big{\|}\Big{(}{\bm{{I}}}_{nd}-(\frac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top})\otimes{\bm{{I}}}_{d}\Big{)}{\bm{{g}}}^{(t),s}\Big{\|}_{2}^{2}
(i)2α𝗂𝗇21+α𝗂𝗇2𝒗(t),s1𝟏n𝒗¯(t),s122\displaystyle\overset{\text{(i)}}{\leq}\frac{2\alpha_{\mathsf{in}}^{2}}{1+\alpha_{\mathsf{in}}^{2}}\big{\|}{\bm{{v}}}^{(t),s-1}-\bm{1}_{n}\otimes{\overline{\bm{v}}}^{(t),s-1}\big{\|}_{2}^{2}
+2α𝗂𝗇21α𝗂𝗇21b𝒛𝒵(t),s(𝑰nd(1n𝟏n𝟏n)𝑰d)𝚲(t),s((𝒖(t),s;𝒛)(𝒖(t),s1;𝒛))22\displaystyle\qquad+\frac{2\alpha_{\mathsf{in}}^{2}}{1-\alpha_{\mathsf{in}}^{2}}\cdot\frac{1}{b}\sum_{{\bm{{z}}}\in{\mathcal{{Z}}}^{(t),s}}\Big{\|}\Big{(}{\bm{{I}}}_{nd}-(\frac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top})\otimes{\bm{{I}}}_{d}\Big{)}{\bm{{\Lambda}}}^{(t),s}\big{(}\nabla\ell({\bm{{u}}}^{(t),s};{\bm{{z}}})-\nabla\ell({\bm{{u}}}^{(t),s-1};{\bm{{z}}})\big{)}\Big{\|}_{2}^{2}
(ii)2α𝗂𝗇21+α𝗂𝗇2𝒗(t),s1𝟏n𝒗¯(t),s122+2α𝗂𝗇2L2(1α𝗂𝗇2)p2𝒖(t),s𝒖(t),s122\displaystyle\overset{\text{(ii)}}{\leq}\frac{2\alpha_{\mathsf{in}}^{2}}{1+\alpha_{\mathsf{in}}^{2}}\big{\|}{\bm{{v}}}^{(t),s-1}-\bm{1}_{n}\otimes{\overline{\bm{v}}}^{(t),s-1}\big{\|}_{2}^{2}+\frac{2\alpha_{\mathsf{in}}^{2}L^{2}}{(1-\alpha_{\mathsf{in}}^{2})p^{2}}\big{\|}{\bm{{u}}}^{(t),s}-{\bm{{u}}}^{(t),s-1}\big{\|}_{2}^{2}
(iii)2α𝗂𝗇21+α𝗂𝗇2𝒗(t),s1𝟏n𝒗¯(t),s122+2α𝗂𝗇2L2(1α𝗂𝗇2)p2(8𝒖(t),s1𝟏n𝒖¯(t),s122\displaystyle\overset{\text{(iii)}}{\leq}\frac{2\alpha_{\mathsf{in}}^{2}}{1+\alpha_{\mathsf{in}}^{2}}\big{\|}{\bm{{v}}}^{(t),s-1}-\bm{1}_{n}\otimes{\overline{\bm{v}}}^{(t),s-1}\big{\|}_{2}^{2}+\frac{2\alpha_{\mathsf{in}}^{2}L^{2}}{(1-\alpha_{\mathsf{in}}^{2})p^{2}}\Big{(}8\big{\|}{\bm{{u}}}^{(t),s-1}-\bm{1}_{n}\otimes{\overline{\bm{u}}}^{(t),s-1}\|_{2}^{2}
+2α𝗂𝗇2η2𝒗(t),s1𝟏n𝒗¯(t),s122+η2n𝒗¯(t),s122)\displaystyle\qquad+2\alpha_{\mathsf{in}}^{2}\eta^{2}\big{\|}{\bm{{v}}}^{(t),s-1}-\bm{1}_{n}\otimes{\overline{\bm{v}}}^{(t),s-1}\big{\|}_{2}^{2}+\eta^{2}n\big{\|}{\overline{\bm{v}}}^{(t),s-1}\big{\|}_{2}^{2}\Big{)}
=(2α𝗂𝗇21+α𝗂𝗇2+4α𝗂𝗇4η2L2(1α𝗂𝗇2)p2)𝒗(t),s1𝟏n𝒗¯(t),s122\displaystyle=\Big{(}\frac{2\alpha_{\mathsf{in}}^{2}}{1+\alpha_{\mathsf{in}}^{2}}+\frac{4\alpha_{\mathsf{in}}^{4}\eta^{2}L^{2}}{(1-\alpha_{\mathsf{in}}^{2})p^{2}}\Big{)}\big{\|}{\bm{{v}}}^{(t),s-1}-\bm{1}_{n}\otimes{\overline{\bm{v}}}^{(t),s-1}\big{\|}_{2}^{2}
+16α𝗂𝗇2L2(1α𝗂𝗇2)p2𝒖(t),s1𝟏n𝒖¯(t),s122+2α𝗂𝗇2η2L2(1α𝗂𝗇2)p2n𝒗¯(t),s122,\displaystyle\qquad+\frac{16\alpha_{\mathsf{in}}^{2}L^{2}}{(1-\alpha_{\mathsf{in}}^{2})p^{2}}\big{\|}{\bm{{u}}}^{(t),s-1}-\bm{1}_{n}\otimes{\overline{\bm{u}}}^{(t),s-1}\big{\|}_{2}^{2}+\frac{2\alpha_{\mathsf{in}}^{2}\eta^{2}L^{2}}{(1-\alpha_{\mathsf{in}}^{2})p^{2}}\cdot n\big{\|}{\overline{\bm{v}}}^{(t),s-1}\big{\|}_{2}^{2}, (22)

where the second term in (i) is obtained by Jensen’s inequality, (ii) follows from Assumption 1 and 𝚲(t),s𝗈𝗉1p\big{\|}{\bm{{\Lambda}}}^{(t),s}\big{\|}_{\mathsf{op}}\leq\frac{1}{p}, and (iii) follows from (19).

D.3 Linear system

Let 𝒆(t),s=[L2𝔼𝒖(t),s𝟏n𝒖¯(t),s22𝔼𝒗(t),s𝟏n𝒗¯(t),s22]{\bm{{e}}}^{(t),s}=\begin{bmatrix}L^{2}{\mathbb{\bm{E}}}\big{\|}{\bm{{u}}}^{(t),s}-\bm{1}_{n}\otimes{\overline{\bm{u}}}^{(t),s}\big{\|}_{2}^{2}\\ {\mathbb{\bm{E}}}\big{\|}{\bm{{v}}}^{(t),s}-\bm{1}_{n}\otimes{\overline{\bm{v}}}^{(t),s}\big{\|}_{2}^{2}\end{bmatrix}, and 𝒃(t),s=2α𝗂𝗇2η2L2(1α𝗂𝗇2)p2[0n𝔼𝒗¯(t),s22]{\bm{{b}}}^{(t),s}=\frac{2\alpha_{\mathsf{in}}^{2}\eta^{2}L^{2}}{(1-\alpha_{\mathsf{in}}^{2})p^{2}}\begin{bmatrix}0\\ n{\mathbb{\bm{E}}}\big{\|}{\overline{\bm{v}}}^{(t),s}\big{\|}_{2}^{2}\end{bmatrix}. By taking expectation of (21) and (22), we can construct the following linear system

𝒆(t),s\displaystyle{\bm{{e}}}^{(t),s} [2α𝗂𝗇21+α𝗂𝗇22α𝗂𝗇2η2L21α𝗂𝗇216α𝗂𝗇2(1α𝗂𝗇2)p22α𝗂𝗇21+α𝗂𝗇2+4α𝗂𝗇4η2L2(1α𝗂𝗇2)p2]𝒆(t),s1+𝒃(t),s1\displaystyle\leq\begin{bmatrix}\frac{2\alpha_{\mathsf{in}}^{2}}{1+\alpha_{\mathsf{in}}^{2}}&\frac{2\alpha_{\mathsf{in}}^{2}\eta^{2}L^{2}}{1-\alpha_{\mathsf{in}}^{2}}\\[5.0pt] \frac{16\alpha_{\mathsf{in}}^{2}}{(1-\alpha_{\mathsf{in}}^{2})p^{2}}&\frac{2\alpha_{\mathsf{in}}^{2}}{1+\alpha_{\mathsf{in}}^{2}}+\frac{4\alpha_{\mathsf{in}}^{4}\eta^{2}L^{2}}{(1-\alpha_{\mathsf{in}}^{2})p^{2}}\end{bmatrix}{\bm{{e}}}^{(t),s-1}+{\bm{{b}}}^{(t),s-1}
[α𝗂𝗇2α𝗂𝗇2η2L21α𝗂𝗇16α𝗂𝗇2(1α𝗂𝗇)p2α𝗂𝗇+4α𝗂𝗇4η2L2(1α𝗂𝗇)p2]=:𝑮𝗂𝗇𝒆(t),s1+𝒃(t),s1=𝑮𝗂𝗇𝒆(t),s1+𝒃(t),s1,\displaystyle\leq\underbrace{\begin{bmatrix}\alpha_{\mathsf{in}}&\frac{2\alpha_{\mathsf{in}}^{2}\eta^{2}L^{2}}{1-\alpha_{\mathsf{in}}}\\[5.0pt] \frac{16\alpha_{\mathsf{in}}^{2}}{(1-\alpha_{\mathsf{in}})p^{2}}&\alpha_{\mathsf{in}}+\frac{4\alpha_{\mathsf{in}}^{4}\eta^{2}L^{2}}{(1-\alpha_{\mathsf{in}})p^{2}}\end{bmatrix}}_{=:{\bm{{G}}}_{\mathsf{in}}}{\bm{{e}}}^{(t),s-1}+{\bm{{b}}}^{(t),s-1}={\bm{{G}}}_{\mathsf{in}}{\bm{{e}}}^{(t),s-1}+{\bm{{b}}}^{(t),s-1}, (23)

where the second inequality is due to 2α𝗂𝗇<1+α𝗂𝗇22\alpha_{\mathsf{in}}<1+\alpha_{\mathsf{in}}^{2} and 1+α𝗂𝗇11+\alpha_{\mathsf{in}}\geq 1. Telescope the above inequality to obtain

𝒆(t),s\displaystyle{\bm{{e}}}^{(t),s}\leq 𝑮𝗂𝗇s𝒆(t),0+k=1s𝑮𝗂𝗇sk𝒃(t),k1.\displaystyle{\bm{{G}}}_{\mathsf{in}}^{s}{\bm{{e}}}^{(t),0}+\sum_{k=1}^{s}{\bm{{G}}}_{\mathsf{in}}^{s-k}{\bm{{b}}}^{(t),k-1}. (24)

Thus, the sum of the consensus errors can be bounded by

s=0S1𝒆(t),s\displaystyle\sum_{s=0}^{S-1}{\bm{{e}}}^{(t),s} 𝒆(t),0+s=1S1(𝑮𝗂𝗇s𝒆(t),0+k=1s𝑮𝗂𝗇sk𝒃i(t),k1)\displaystyle\leq{\bm{{e}}}^{(t),0}+\sum_{s=1}^{S-1}\Big{(}{\bm{{G}}}_{\mathsf{in}}^{s}{\bm{{e}}}^{(t),0}+\sum_{k=1}^{s}{\bm{{G}}}_{\mathsf{in}}^{s-k}{\bm{{b}}}_{i}^{(t),k-1}\Big{)}
=s=0S1𝑮𝗂𝗇s𝒆(t),0+s=1S1k=1s𝑮𝗂𝗇sk𝒃(t),k1\displaystyle=\sum_{s=0}^{S-1}{\bm{{G}}}_{\mathsf{in}}^{s}{\bm{{e}}}^{(t),0}+\sum_{s=1}^{S-1}\sum_{k=1}^{s}{\bm{{G}}}_{\mathsf{in}}^{s-k}{\bm{{b}}}^{(t),k-1}
=(i)s=0S1𝑮𝗂𝗇s𝒆(t),0+k=1S1s=0S1k𝑮𝗂𝗇s𝒃(t),k1\displaystyle\overset{\text{(i)}}{=}\sum_{s=0}^{S-1}{\bm{{G}}}_{\mathsf{in}}^{s}{\bm{{e}}}^{(t),0}+\sum_{k=1}^{S-1}\sum_{s=0}^{S-1-k}{\bm{{G}}}_{\mathsf{in}}^{s}{\bm{{b}}}^{(t),k-1}
(ii)s=0S1𝑮𝗂𝗇s𝒆(t),0+k=1S1s=0S1𝑮𝗂𝗇s𝒃(t),k1\displaystyle\overset{\text{(ii)}}{\leq}\sum_{s=0}^{S-1}{\bm{{G}}}_{\mathsf{in}}^{s}{\bm{{e}}}^{(t),0}+\sum_{k=1}^{S-1}\sum_{s=0}^{S-1}{\bm{{G}}}_{\mathsf{in}}^{s}{\bm{{b}}}^{(t),k-1}
(iii)s=0𝑮𝗂𝗇s(𝒆(t),0+s=0S1𝒃(t),s)\displaystyle\overset{\text{(iii)}}{\leq}\sum_{s=0}^{\infty}{\bm{{G}}}_{\mathsf{in}}^{s}\Big{(}{\bm{{e}}}^{(t),0}+\sum_{s=0}^{S-1}{\bm{{b}}}^{(t),s}\Big{)} (25)

where (i) follows by changing the order of summation, (ii) and (iii) follows from the nonnegativity of 𝑮𝗂𝗇{\bm{{G}}}_{\mathsf{in}} and 𝒃(t),s{\bm{{b}}}^{(t),s} respectively. To continue, we begin with the following claim about 𝑮𝗂𝗇{\bm{{G}}}_{\mathsf{in}} which will be proved momentarily.

Claim 1.

Under the choice of η\eta in Theorem 1, the eigenvalues of 𝐆𝗂𝗇{\bm{{G}}}_{\mathsf{in}} are in (1,1)(-1,1), and the Neumann series converges,

s=0𝑮𝗂𝗇s=(𝑰2𝑮𝗂𝗇)1\displaystyle\sum_{s=0}^{\infty}{\bm{{G}}}_{\mathsf{in}}^{s}=({\bm{{I}}}_{2}-{\bm{{G}}}_{\mathsf{in}})^{-1} [21α𝗂𝗇4α𝗂𝗇2η2L2(1α𝗂𝗇)332α𝗂𝗇2(1α𝗂𝗇)3p221α𝗂𝗇].\displaystyle\leq\begin{bmatrix}\frac{2}{1-\alpha_{\mathsf{in}}}&\frac{4\alpha_{\mathsf{in}}^{2}\eta^{2}L^{2}}{(1-\alpha_{\mathsf{in}})^{3}}\\[5.0pt] \frac{32\alpha_{\mathsf{in}}^{2}}{(1-\alpha_{\mathsf{in}})^{3}p^{2}}&\frac{2}{1-\alpha_{\mathsf{in}}}\end{bmatrix}. (26)

Let 𝝇𝗂𝗇=[8(4Spnb+1)8Sα𝗂𝗇2η2L2pnb]\bm{\varsigma}_{\mathsf{in}}^{\top}=\begin{bmatrix}8\left(\frac{4S}{pnb}+1\right)&\frac{8S\alpha_{\mathsf{in}}^{2}\eta^{2}L^{2}}{pnb}\end{bmatrix}, in view of 1, the summation of consensus erros in (20) can be bounded as

(4Snpb+1)8L2s=0S1𝔼𝒖(t),s𝟏n𝒖¯(t),s22+8Sα𝗂𝗇2η2L2npbs=0S1𝔼𝒗(t),s𝟏n𝒗¯(t),s22\displaystyle~{}~{}~{}~{}\Big{(}\frac{4S}{npb}+1\Big{)}\cdot 8L^{2}\sum_{s=0}^{S-1}{\mathbb{\bm{E}}}\big{\|}{\bm{{u}}}^{(t),s}-\bm{1}_{n}\otimes{\overline{\bm{u}}}^{(t),s}\big{\|}_{2}^{2}+\frac{8S\alpha_{\mathsf{in}}^{2}\eta^{2}L^{2}}{npb}\sum_{s=0}^{S-1}{\mathbb{\bm{E}}}\big{\|}{\bm{{v}}}^{(t),s}-\bm{1}_{n}\otimes{\overline{\bm{v}}}^{(t),s}\big{\|}_{2}^{2}
=𝝇𝗂𝗇s=0S1𝒆(t),s\displaystyle=\bm{\varsigma}_{\mathsf{in}}^{\top}\sum_{s=0}^{S-1}{\bm{{e}}}^{(t),s}
𝝇𝗂𝗇(s=0𝑮𝗂𝗇s)(𝒆(t),0+k=0S1𝒃(t),k)\displaystyle\leq\bm{\varsigma}_{\mathsf{in}}^{\top}\Big{(}\sum_{s=0}^{\infty}{\bm{{G}}}_{\mathsf{in}}^{s}\Big{)}\Big{(}{\bm{{e}}}^{(t),0}+\sum_{k=0}^{S-1}{\bm{{b}}}^{(t),k}\Big{)}
𝝇𝗂𝗇(𝑰2𝑮𝗂𝗇)1(𝒆(t),0+s=0S1𝒃(t),s),\displaystyle\leq\bm{\varsigma}_{\mathsf{in}}^{\top}\big{(}{\bm{{I}}}_{2}-{\bm{{G}}}_{\mathsf{in}}\big{)}^{-1}\Big{(}{\bm{{e}}}^{(t),0}+\sum_{s=0}^{S-1}{\bm{{b}}}^{(t),s}\Big{)},

and

𝝇𝗂𝗇(𝑰2𝑮𝗂𝗇)1\displaystyle\bm{\varsigma}_{\mathsf{in}}^{\top}\big{(}{\bm{{I}}}_{2}-{\bm{{G}}}_{\mathsf{in}}\big{)}^{-1} [161α𝗂𝗇(4Snpb+1)+32α𝗂𝗇2(1α𝗂𝗇)3p28Sα𝗂𝗇2η2L2pnb32α𝗂𝗇2η2L2(1α𝗂𝗇)3(4Snpb+1)+21α𝗂𝗇8Sα𝗂𝗇2η2L2npb]\displaystyle\leq\begin{bmatrix}\frac{16}{1-\alpha_{\mathsf{in}}}\Big{(}\frac{4S}{npb}+1\Big{)}+\frac{32\alpha_{\mathsf{in}}^{2}}{(1-\alpha_{\mathsf{in}})^{3}p^{2}}\cdot\frac{8S\alpha_{\mathsf{in}}^{2}\eta^{2}L^{2}}{pnb}\,&\,\frac{32\alpha_{\mathsf{in}}^{2}\eta^{2}L^{2}}{(1-\alpha_{\mathsf{in}})^{3}}\cdot\Big{(}\frac{4S}{npb}+1\Big{)}+\frac{2}{1-\alpha_{\mathsf{in}}}\cdot\frac{8S\alpha_{\mathsf{in}}^{2}\eta^{2}L^{2}}{npb}\end{bmatrix}
[161α𝗂𝗇(4Snpb+1)+3α𝗂𝗇2(1α𝗂𝗇)p2128α𝗂𝗇2η2L2(1α𝗂𝗇)3(Snpb+1)+16α𝗂𝗇2(1α𝗂𝗇)100]\displaystyle\leq\begin{bmatrix}\frac{16}{1-\alpha_{\mathsf{in}}}\Big{(}\frac{4S}{npb}+1\Big{)}+\frac{3\alpha_{\mathsf{in}}^{2}}{(1-\alpha_{\mathsf{in}})p^{2}}\,&\,\frac{128\alpha_{\mathsf{in}}^{2}\eta^{2}L^{2}}{(1-\alpha_{\mathsf{in}})^{3}}\cdot\Big{(}\frac{S}{npb}+1\Big{)}+\frac{16\alpha_{\mathsf{in}}^{2}(1-\alpha_{\mathsf{in}})}{100}\end{bmatrix}
[641α𝗂𝗇(Snpb+1) 2α𝗂𝗇2],\displaystyle\leq\begin{bmatrix}\frac{64}{1-\alpha_{\mathsf{in}}}\Big{(}\frac{S}{npb}+1\Big{)}\,&\,2\alpha_{\mathsf{in}}^{2}\end{bmatrix},

where we use (7), ηL(1α𝗂𝗇)3(1α𝗈𝗎𝗍)10(1+α𝗂𝗇α𝗈𝗎𝗍npb)(S/(npb)+1)(1α𝗂𝗇)3(1α𝗈𝗎𝗍)10(S/(pnb)+1)\eta L\leq\frac{(1-\alpha_{\mathsf{in}})^{3}(1-\alpha_{\mathsf{out}})}{10\big{(}1+\alpha_{\mathsf{in}}\alpha_{\mathsf{out}}\sqrt{npb}\big{)}\big{(}\sqrt{S/(npb)}+1\big{)}}\leq\frac{(1-\alpha_{\mathsf{in}})^{3}(1-\alpha_{\mathsf{out}})}{10\big{(}\sqrt{S/(pnb)}+1\big{)}}, to prove the last two inequalities.

Therefore, (20) can be bounded as

2L2s=0S1𝔼\displaystyle 2L^{2}\sum_{s=0}^{S-1}{\mathbb{\bm{E}}}\big{\|} 𝒖(t),s𝟏n𝒖¯(t),s22+2ns=0S1𝔼f(𝒖¯(t),s)𝒗¯(t),s22\displaystyle{\bm{{u}}}^{(t),s}-\bm{1}_{n}\otimes{\overline{\bm{u}}}^{(t),s}\big{\|}_{2}^{2}+2n\sum_{s=0}^{S-1}{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{u}}}^{(t),s})-{\overline{\bm{v}}}^{(t),s}\big{\|}^{2}_{2}
𝝇𝗂𝗇(𝑰2𝑮𝗂𝗇)1(𝒆(t),0+s=0S1𝒃(t),s)+4Sη2L2pbs=0S1𝔼𝒗¯(t),s22\displaystyle\leq\bm{\varsigma}_{\mathsf{in}}^{\top}\big{(}{\bm{{I}}}_{2}-{\bm{{G}}}_{\mathsf{in}}\big{)}^{-1}\Big{(}{\bm{{e}}}^{(t),0}+\sum_{s=0}^{S-1}{\bm{{b}}}^{(t),s}\Big{)}+\frac{4S\eta^{2}L^{2}}{pb}\sum_{s=0}^{S-1}{\mathbb{\bm{E}}}\big{\|}{\overline{\bm{v}}}^{(t),s}\big{\|}_{2}^{2}
64L21α𝗂𝗇(Snpb+1)𝔼𝒙(t)𝟏n𝒙¯(t)22+2α𝗂𝗇2𝔼𝒔(t)𝟏n𝒔¯(t)22\displaystyle\leq\frac{64L^{2}}{1-\alpha_{\mathsf{in}}}\cdot\Big{(}\frac{S}{npb}+1\Big{)}{\mathbb{\bm{E}}}\big{\|}{\bm{{x}}}^{(t)}-\bm{1}_{n}\otimes{\overline{\bm{x}}}^{(t)}\big{\|}_{2}^{2}+2\alpha_{\mathsf{in}}^{2}{\mathbb{\bm{E}}}\big{\|}{\bm{{s}}}^{(t)}-\bm{1}_{n}\otimes{\overline{\bm{s}}}^{(t)}\big{\|}_{2}^{2}
+(4α𝗂𝗇4η2L2(1α𝗂𝗇)2p2+4Sη2L2npb)ns=1S𝔼𝒗¯(t),s122\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}+\Big{(}\frac{4\alpha_{\mathsf{in}}^{4}\eta^{2}L^{2}}{(1-\alpha_{\mathsf{in}})^{2}p^{2}}+\frac{4S\eta^{2}L^{2}}{npb}\Big{)}\cdot n\sum_{s=1}^{S}{\mathbb{\bm{E}}}\big{\|}{\overline{\bm{v}}}^{(t),s-1}\big{\|}^{2}_{2}
<64L21α𝗂𝗇(Snpb+1)𝔼𝒙(t)𝟏n𝒙¯(t)22+2α𝗂𝗇2𝔼𝒔(t)𝟏n𝒔¯(t)22+2n25s=1S𝔼𝒗¯(t),s122,\displaystyle<\frac{64L^{2}}{1-\alpha_{\mathsf{in}}}\cdot\Big{(}\frac{S}{npb}+1\Big{)}{\mathbb{\bm{E}}}\big{\|}{\bm{{x}}}^{(t)}-\bm{1}_{n}\otimes{\overline{\bm{x}}}^{(t)}\big{\|}_{2}^{2}+2\alpha_{\mathsf{in}}^{2}{\mathbb{\bm{E}}}\big{\|}{\bm{{s}}}^{(t)}-\bm{1}_{n}\otimes{\overline{\bm{s}}}^{(t)}\big{\|}_{2}^{2}+\frac{2n}{25}\sum_{s=1}^{S}{\mathbb{\bm{E}}}\big{\|}{\overline{\bm{v}}}^{(t),s-1}\big{\|}^{2}_{2},

where the last inequality is proved by incorporating (7) as 4α𝗂𝗇4η2L2(1α𝗂𝗇)2p24α𝗂𝗇2η2L2(1α𝗂𝗇)2<4α𝗂𝗇2(1α𝗂𝗇)2(1α𝗂𝗇)6100125\frac{4\alpha_{\mathsf{in}}^{4}\eta^{2}L^{2}}{(1-\alpha_{\mathsf{in}})^{2}p^{2}}\leq\frac{4\alpha_{\mathsf{in}}^{2}\eta^{2}L^{2}}{(1-\alpha_{\mathsf{in}})^{2}}<\frac{4\alpha_{\mathsf{in}}^{2}}{(1-\alpha_{\mathsf{in}})^{2}}\cdot\frac{(1-\alpha_{\mathsf{in}})^{6}}{100}\leq\frac{1}{25} and 4Sη2L2npbSnpb4100(S/(npb)+1)2<125\frac{4S\eta^{2}L^{2}}{npb}\leq\frac{S}{npb}\cdot\frac{4}{100\big{(}\sqrt{S/(npb)}+1\big{)}^{2}}<\frac{1}{25}.

Proof of Claim 1.

By the definition of 𝑮𝗂𝗇{\bm{{G}}}_{\mathsf{in}} in (23), the characteristic polynomial of 𝑮𝗂𝗇{\bm{{G}}}_{\mathsf{in}} is

f(λ)\displaystyle f(\lambda) =(α𝗂𝗇λ)(α𝗂𝗇+4α𝗂𝗇4η2L2(1α𝗂𝗇)p2λ)32α𝗂𝗇4η2L2(1α𝗂𝗇)2p2.\displaystyle=(\alpha_{\mathsf{in}}-\lambda)\Big{(}\alpha_{\mathsf{in}}+\frac{4\alpha_{\mathsf{in}}^{4}\eta^{2}L^{2}}{(1-\alpha_{\mathsf{in}})p^{2}}-\lambda\Big{)}-\frac{32\alpha_{\mathsf{in}}^{4}\eta^{2}L^{2}}{(1-\alpha_{\mathsf{in}})^{2}p^{2}}.

By (7), ηL(1α𝗂𝗇)3(1α𝗈𝗎𝗍)10(1+α𝗂𝗇α𝗈𝗎𝗍npb)(S/(npb)+1)(1α𝗂𝗇)310\eta L\leq\frac{(1-\alpha_{\mathsf{in}})^{3}(1-\alpha_{\mathsf{out}})}{10\big{(}1+\alpha_{\mathsf{in}}\alpha_{\mathsf{out}}\sqrt{npb}\big{)}\big{(}\sqrt{S/(npb)}+1\big{)}}\leq\frac{(1-\alpha_{\mathsf{in}})^{3}}{10} and α𝗂𝗇p\alpha_{\mathsf{in}}\leq p, we have 32α𝗂𝗇4η2L2(1α𝗂𝗇)2p232α𝗂𝗇2η2L2(1α𝗂𝗇)232100α𝗂𝗇2(1α𝗂𝗇)4<1\frac{32\alpha_{\mathsf{in}}^{4}\eta^{2}L^{2}}{(1-\alpha_{\mathsf{in}})^{2}p^{2}}\leq\frac{32\alpha_{\mathsf{in}}^{2}\eta^{2}L^{2}}{(1-\alpha_{\mathsf{in}})^{2}}\leq\frac{32}{100}\alpha_{\mathsf{in}}^{2}(1-\alpha_{\mathsf{in}})^{4}<1, so that f(1)132α𝗂𝗇4η2L2(1α𝗂𝗇)2>0f(-1)\geq 1-\frac{32\alpha_{\mathsf{in}}^{4}\eta^{2}L^{2}}{(1-\alpha_{\mathsf{in}})^{2}}>0, and

f(1)\displaystyle f(1) =(1α𝗂𝗇)24α𝗂𝗇4η2L2p232α𝗂𝗇4η2L2(1α𝗂𝗇)2p2\displaystyle=(1-\alpha_{\mathsf{in}})^{2}-\frac{4\alpha_{\mathsf{in}}^{4}\eta^{2}L^{2}}{p^{2}}-\frac{32\alpha_{\mathsf{in}}^{4}\eta^{2}L^{2}}{(1-\alpha_{\mathsf{in}})^{2}p^{2}}
(1α𝗂𝗇)236α𝗂𝗇4η2L2(1α𝗂𝗇)2p2\displaystyle\geq(1-\alpha_{\mathsf{in}})^{2}-\frac{36\alpha_{\mathsf{in}}^{4}\eta^{2}L^{2}}{(1-\alpha_{\mathsf{in}})^{2}p^{2}}
>(1α𝗂𝗇)236100(1α𝗂𝗇)4>0.\displaystyle>(1-\alpha_{\mathsf{in}})^{2}-\frac{36}{100}(1-\alpha_{\mathsf{in}})^{4}>0.

Because f(α𝗂𝗇)0f(\alpha_{\mathsf{in}})\leq 0, all eigenvalues of 𝑮𝗂𝗇{\bm{{G}}}_{\mathsf{in}} are in (1,1)(-1,1), then the Neumann series converges, yielding

s=0𝑮𝗂𝗇s\displaystyle\sum_{s=0}^{\infty}{\bm{{G}}}_{\mathsf{in}}^{s} =(𝑰2𝑮𝗂𝗇)1\displaystyle=({\bm{{I}}}_{2}-{\bm{{G}}}_{\mathsf{in}})^{-1}
=1α𝗂𝗇(1α𝗂𝗇)4p24((1α𝗂𝗇)2+8)α𝗂𝗇4η2L2[(1α𝗂𝗇)2p24α𝗂𝗇4η2L22α𝗂𝗇2η2L2p216α𝗂𝗇2(1α𝗂𝗇)2p2]\displaystyle=\frac{1-\alpha_{\mathsf{in}}}{(1-\alpha_{\mathsf{in}})^{4}p^{2}-4\big{(}(1-\alpha_{\mathsf{in}})^{2}+8\big{)}\alpha_{\mathsf{in}}^{4}\eta^{2}L^{2}}\begin{bmatrix}(1-\alpha_{\mathsf{in}})^{2}p^{2}-4\alpha_{\mathsf{in}}^{4}\eta^{2}L^{2}&2\alpha_{\mathsf{in}}^{2}\eta^{2}L^{2}p^{2}\\[5.0pt] 16\alpha_{\mathsf{in}}^{2}&(1-\alpha_{\mathsf{in}})^{2}p^{2}\end{bmatrix}
1α𝗂𝗇(1α𝗂𝗇)4p24((1α𝗂𝗇)2+8)α𝗂𝗇4η2L2[(1α𝗂𝗇)2p22α𝗂𝗇2η2L2p216α𝗂𝗇2(1α𝗂𝗇)2p2]\displaystyle\leq\frac{1-\alpha_{\mathsf{in}}}{(1-\alpha_{\mathsf{in}})^{4}p^{2}-4\big{(}(1-\alpha_{\mathsf{in}})^{2}+8\big{)}\alpha_{\mathsf{in}}^{4}\eta^{2}L^{2}}\begin{bmatrix}(1-\alpha_{\mathsf{in}})^{2}p^{2}&2\alpha_{\mathsf{in}}^{2}\eta^{2}L^{2}p^{2}\\[5.0pt] 16\alpha_{\mathsf{in}}^{2}&(1-\alpha_{\mathsf{in}})^{2}p^{2}\end{bmatrix}
(i)1α𝗂𝗇(1α𝗂𝗇)4p236α𝗂𝗇4η2L2[(1α𝗂𝗇)2p22α𝗂𝗇2η2L2p216α𝗂𝗇2(1α𝗂𝗇)2p2]\displaystyle\overset{\text{(i)}}{\leq}\frac{1-\alpha_{\mathsf{in}}}{(1-\alpha_{\mathsf{in}})^{4}p^{2}-36\alpha_{\mathsf{in}}^{4}\eta^{2}L^{2}}\begin{bmatrix}(1-\alpha_{\mathsf{in}})^{2}p^{2}&2\alpha_{\mathsf{in}}^{2}\eta^{2}L^{2}p^{2}\\[5.0pt] 16\alpha_{\mathsf{in}}^{2}&(1-\alpha_{\mathsf{in}})^{2}p^{2}\end{bmatrix}
(ii)[21α𝗂𝗇4α𝗂𝗇2η2L2(1α𝗂𝗇)332α𝗂𝗇2(1α𝗂𝗇)3p221α𝗂𝗇],\displaystyle\overset{\text{(ii)}}{\leq}\begin{bmatrix}\frac{2}{1-\alpha_{\mathsf{in}}}&\frac{4\alpha_{\mathsf{in}}^{2}\eta^{2}L^{2}}{(1-\alpha_{\mathsf{in}})^{3}}\\[5.0pt] \frac{32\alpha_{\mathsf{in}}^{2}}{(1-\alpha_{\mathsf{in}})^{3}p^{2}}&\frac{2}{1-\alpha_{\mathsf{in}}}\end{bmatrix},

where (i) and (ii) follow the fact (1α𝗂𝗇)21(1-\alpha_{\mathsf{in}})^{2}\leq 1, and (1α𝗂𝗇)4p236α𝗂𝗇4η2L2(1α𝗂𝗇)4p236100α𝗂𝗇4(1α𝗂𝗇)6(1α𝗂𝗇)4p236100α𝗂𝗇2(1α𝗂𝗇)6p2>12(1α𝗂𝗇)4p2(1-\alpha_{\mathsf{in}})^{4}p^{2}-36\alpha_{\mathsf{in}}^{4}\eta^{2}L^{2}\geq(1-\alpha_{\mathsf{in}})^{4}p^{2}-\frac{36}{100}\alpha_{\mathsf{in}}^{4}(1-\alpha_{\mathsf{in}})^{6}\geq(1-\alpha_{\mathsf{in}})^{4}p^{2}-\frac{36}{100}\alpha_{\mathsf{in}}^{2}(1-\alpha_{\mathsf{in}})^{6}p^{2}>\frac{1}{2}(1-\alpha_{\mathsf{in}})^{4}p^{2} due to (7).

Appendix E Proof of Lemma 3

This section proves Lemma 3. In the following subsections, Sections E.1 and E.2 derive induction inequalities for the consensus errors and Section E.3 creates a linear system of consensus errors to compute the summation.

E.1 Sum of outer loop variable consensus errors

The variable consensus error can be bounded deterministically as following,

𝒙(t)𝟏n𝒙¯(t)22\displaystyle~{}~{}~{}~{}\big{\|}{\bm{{x}}}^{(t)}-\bm{1}_{n}\otimes{\overline{\bm{x}}}^{(t)}\big{\|}_{2}^{2}
=(𝑰nd(1n𝟏n𝟏n)𝑰d)𝒙(t)22\displaystyle=\Big{\|}\Big{(}{\bm{{I}}}_{nd}-(\frac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top})\otimes{\bm{{I}}}_{d}\Big{)}{\bm{{x}}}^{(t)}\Big{\|}_{2}^{2}
=(i)(𝑰nd(1n𝟏n𝟏n)𝑰d)𝒖(t1),S22\displaystyle\overset{\text{(i)}}{=}\Big{\|}\Big{(}{\bm{{I}}}_{nd}-(\frac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top})\otimes{\bm{{I}}}_{d}\Big{)}{\bm{{u}}}^{(t-1),S}\Big{\|}_{2}^{2}
=(ii)(𝑰nd(1n𝟏n𝟏n)𝑰d)(𝑾𝗂𝗇𝑰d)(𝒖(t1),S1η𝒗(t1),S1)22\displaystyle\overset{\text{(ii)}}{=}\Big{\|}\Big{(}{\bm{{I}}}_{nd}-(\frac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top})\otimes{\bm{{I}}}_{d}\Big{)}({\bm{{W}}}_{\mathsf{in}}\otimes{\bm{{I}}}_{d})\Big{(}{\bm{{u}}}^{(t-1),S-1}-\eta{\bm{{v}}}^{(t-1),S-1}\Big{)}\Big{\|}_{2}^{2}
α𝗂𝗇2(𝑰nd(1n𝟏n𝟏n)𝑰d)(𝒖(t1),S1η𝒗(t1),S1)22\displaystyle\leq\alpha_{\mathsf{in}}^{2}\Big{\|}\Big{(}{\bm{{I}}}_{nd}-(\frac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top})\otimes{\bm{{I}}}_{d}\Big{)}\Big{(}{\bm{{u}}}^{(t-1),S-1}-\eta{\bm{{v}}}^{(t-1),S-1}\Big{)}\Big{\|}_{2}^{2}
2α𝗂𝗇21+α𝗂𝗇2𝒖(t1),S1𝟏n𝒖¯(t1),S122+2α𝗂𝗇2η21α𝗂𝗇2𝒗(t1),S1𝟏n𝒗¯(t1),S122,\displaystyle\leq\frac{2\alpha_{\mathsf{in}}^{2}}{1+\alpha_{\mathsf{in}}^{2}}\big{\|}{\bm{{u}}}^{(t-1),S-1}-\bm{1}_{n}\otimes{\overline{\bm{u}}}^{(t-1),S-1}\big{\|}_{2}^{2}+\frac{2\alpha_{\mathsf{in}}^{2}\eta^{2}}{1-\alpha_{\mathsf{in}}^{2}}\big{\|}{\bm{{v}}}^{(t-1),S-1}-\bm{1}_{n}\otimes{\overline{\bm{v}}}^{(t-1),S-1}\big{\|}_{2}^{2},

where (i) uses 𝒙(t)=𝒖(t1),S{\bm{{x}}}^{(t)}={\bm{{u}}}^{(t-1),S}, (ii) uses the update rule (6a), and the last two inequalities follow from similar reasonings as (21). Apply the same reasoning to 2α𝗂𝗇21+α𝗂𝗇2𝒖(t1),S1𝟏n𝒖¯(t1),S122\frac{2\alpha_{\mathsf{in}}^{2}}{1+\alpha_{\mathsf{in}}^{2}}\big{\|}{\bm{{u}}}^{(t-1),S-1}-\bm{1}_{n}\otimes{\overline{\bm{u}}}^{(t-1),S-1}\big{\|}_{2}^{2} and use 2α𝗂𝗇21+α𝗂𝗇21\frac{2\alpha_{\mathsf{in}}^{2}}{1+\alpha_{\mathsf{in}}^{2}}\leq 1, we can prove

𝒙(t)𝟏n𝒙¯(t)22\displaystyle\big{\|}{\bm{{x}}}^{(t)}-\bm{1}_{n}\otimes{\overline{\bm{x}}}^{(t)}\big{\|}_{2}^{2} (2α𝗂𝗇21+α𝗂𝗇2)S𝒖(t1),0𝟏n𝒖¯(t1),022+2α𝗂𝗇2η21α𝗂𝗇2s=0S1𝒗(t1),s𝟏n𝒗¯(t1),s22\displaystyle\leq\Big{(}\frac{2\alpha_{\mathsf{in}}^{2}}{1+\alpha_{\mathsf{in}}^{2}}\Big{)}^{S}\big{\|}{\bm{{u}}}^{(t-1),0}-\bm{1}_{n}\otimes{\overline{\bm{u}}}^{(t-1),0}\big{\|}_{2}^{2}+\frac{2\alpha_{\mathsf{in}}^{2}\eta^{2}}{1-\alpha_{\mathsf{in}}^{2}}\sum_{s=0}^{S-1}\big{\|}{\bm{{v}}}^{(t-1),s}-\bm{1}_{n}\otimes{\overline{\bm{v}}}^{(t-1),s}\big{\|}_{2}^{2}
=(2α𝗂𝗇21+α𝗂𝗇2)S𝒙(t1)𝟏n𝒙¯(t1)22+2α𝗂𝗇2η21α𝗂𝗇2s=0S1𝒗(t1),s𝟏n𝒗¯(t1),s22,\displaystyle=\Big{(}\frac{2\alpha_{\mathsf{in}}^{2}}{1+\alpha_{\mathsf{in}}^{2}}\Big{)}^{S}\big{\|}{\bm{{x}}}^{(t-1)}-\bm{1}_{n}\otimes{\overline{\bm{x}}}^{(t-1)}\big{\|}_{2}^{2}+\frac{2\alpha_{\mathsf{in}}^{2}\eta^{2}}{1-\alpha_{\mathsf{in}}^{2}}\sum_{s=0}^{S-1}\big{\|}{\bm{{v}}}^{(t-1),s}-\bm{1}_{n}\otimes{\overline{\bm{v}}}^{(t-1),s}\big{\|}_{2}^{2}, (27)

where the last equality identifies 𝒙(t1)=𝒖(t1),0{\bm{{x}}}^{(t-1)}={\bm{{u}}}^{(t-1),0}.

Take expectation of the previous inequality, by (25), we can further compute the summation in (27) as follows

s=0S1𝔼𝒗(t1),s𝟏n𝒗¯(t1),s22\displaystyle\sum_{s=0}^{S-1}{\mathbb{\bm{E}}}\big{\|}{\bm{{v}}}^{(t-1),s}-\bm{1}_{n}\otimes{\overline{\bm{v}}}^{(t-1),s}\big{\|}_{2}^{2} 32α𝗂𝗇2L2(1α𝗂𝗇)3p2𝔼𝒙(t1)𝟏n𝒙¯(t1)22\displaystyle\leq\frac{32\alpha_{\mathsf{in}}^{2}L^{2}}{(1-\alpha_{\mathsf{in}})^{3}p^{2}}{\mathbb{\bm{E}}}\big{\|}{\bm{{x}}}^{(t-1)}-\bm{1}_{n}\otimes{\overline{\bm{x}}}^{(t-1)}\big{\|}_{2}^{2}
+21α𝗂𝗇(𝔼𝒔(t1)𝟏n𝒔¯(t1)22+2α𝗂𝗇2η2L2(1α𝗂𝗇2)p2ns=0S1𝔼𝒗¯(t1),s).\displaystyle\qquad+\frac{2}{1-\alpha_{\mathsf{in}}}\Big{(}{\mathbb{\bm{E}}}\big{\|}{\bm{{s}}}^{(t-1)}-\bm{1}_{n}\otimes{\overline{\bm{s}}}^{(t-1)}\big{\|}_{2}^{2}+\frac{2\alpha_{\mathsf{in}}^{2}\eta^{2}L^{2}}{(1-\alpha_{\mathsf{in}}^{2})p^{2}}\cdot n\sum_{s=0}^{S-1}{\mathbb{\bm{E}}}\big{\|}{\overline{\bm{v}}}^{(t-1),s}\big{\|}\Big{)}.

Together with 𝒙(t)=𝒖(t),0{\bm{{x}}}^{(t)}={\bm{{u}}}^{(t),0} and 𝒔(t)=𝒗(t),0{\bm{{s}}}^{(t)}={\bm{{v}}}^{(t),0}, (27) can be further bounded as

𝔼𝒙(t)𝟏n𝒙¯(t)22\displaystyle{\mathbb{\bm{E}}}\big{\|}{\bm{{x}}}^{(t)}-\bm{1}_{n}\otimes{\overline{\bm{x}}}^{(t)}\big{\|}_{2}^{2} ((2α𝗂𝗇21+α𝗂𝗇2)S+2α𝗂𝗇2η2L21α𝗂𝗇232α𝗂𝗇2(1α𝗂𝗇)3p2)𝔼𝒙(t1)𝟏n𝒙¯(t1)22\displaystyle\leq\Bigg{(}\Big{(}\frac{2\alpha_{\mathsf{in}}^{2}}{1+\alpha_{\mathsf{in}}^{2}}\Big{)}^{S}+\frac{2\alpha_{\mathsf{in}}^{2}\eta^{2}L^{2}}{1-\alpha_{\mathsf{in}}^{2}}\cdot\frac{32\alpha_{\mathsf{in}}^{2}}{(1-\alpha_{\mathsf{in}})^{3}p^{2}}\Bigg{)}{\mathbb{\bm{E}}}\big{\|}{\bm{{x}}}^{(t-1)}-\bm{1}_{n}\otimes{\overline{\bm{x}}}^{(t-1)}\big{\|}_{2}^{2}
+2α𝗂𝗇2η21α𝗂𝗇221α𝗂𝗇(𝒔(t1)𝟏n𝒔¯(t1)𝔼22+2α𝗂𝗇2η2L2(1α𝗂𝗇2)p2ns=0S1𝔼𝒗¯(t1),s)\displaystyle\qquad+\frac{2\alpha_{\mathsf{in}}^{2}\eta^{2}}{1-\alpha_{\mathsf{in}}^{2}}\cdot\frac{2}{1-\alpha_{\mathsf{in}}}\Big{(}\big{\|}{\bm{{s}}}^{(t-1)}-\bm{1}_{n}\otimes{\overline{\bm{s}}}^{(t-1)}{\mathbb{\bm{E}}}\big{\|}_{2}^{2}+\frac{2\alpha_{\mathsf{in}}^{2}\eta^{2}L^{2}}{(1-\alpha_{\mathsf{in}}^{2})p^{2}}\cdot n\sum_{s=0}^{S-1}{\mathbb{\bm{E}}}\big{\|}{\overline{\bm{v}}}^{(t-1),s}\big{\|}\Big{)}
<α𝗂𝗇𝔼𝒙(t1)𝟏n𝒙¯(t1)22\displaystyle<\alpha_{\mathsf{in}}{\mathbb{\bm{E}}}\big{\|}{\bm{{x}}}^{(t-1)}-\bm{1}_{n}\otimes{\overline{\bm{x}}}^{(t-1)}\big{\|}_{2}^{2}
+4α𝗂𝗇2η2(1α𝗂𝗇)2(𝔼𝒔(t1)𝟏n𝒔¯(t1)22+2α𝗂𝗇2η2L2(1α𝗂𝗇)p2ns=0S1𝔼𝒗¯(t1),s).\displaystyle\qquad+\frac{4\alpha_{\mathsf{in}}^{2}\eta^{2}}{(1-\alpha_{\mathsf{in}})^{2}}\Big{(}{\mathbb{\bm{E}}}\big{\|}{\bm{{s}}}^{(t-1)}-\bm{1}_{n}\otimes{\overline{\bm{s}}}^{(t-1)}\big{\|}_{2}^{2}+\frac{2\alpha_{\mathsf{in}}^{2}\eta^{2}L^{2}}{(1-\alpha_{\mathsf{in}})p^{2}}\cdot n\sum_{s=0}^{S-1}{\mathbb{\bm{E}}}\big{\|}{\overline{\bm{v}}}^{(t-1),s}\big{\|}\Big{)}. (28)

The last inequality is obtained by using (7) and the fact that 0α𝗂𝗇<10\leq\alpha_{\mathsf{in}}<1 as follows

(2α𝗂𝗇21+α𝗂𝗇2)S+2α𝗂𝗇2η2L21α𝗂𝗇232α𝗂𝗇2(1α𝗂𝗇)3p2\displaystyle\Big{(}\frac{2\alpha_{\mathsf{in}}^{2}}{1+\alpha_{\mathsf{in}}^{2}}\Big{)}^{S}+\frac{2\alpha_{\mathsf{in}}^{2}\eta^{2}L^{2}}{1-\alpha_{\mathsf{in}}^{2}}\cdot\frac{32\alpha_{\mathsf{in}}^{2}}{(1-\alpha_{\mathsf{in}})^{3}p^{2}} =(2α𝗂𝗇21+α𝗂𝗇2)S+α𝗂𝗇2(1α𝗂𝗇)21+α𝗂𝗇64α𝗂𝗇2η2L2(1α𝗂𝗇)6p2\displaystyle=\Big{(}\frac{2\alpha_{\mathsf{in}}^{2}}{1+\alpha_{\mathsf{in}}^{2}}\Big{)}^{S}+\frac{\alpha_{\mathsf{in}}^{2}(1-\alpha_{\mathsf{in}})^{2}}{1+\alpha_{\mathsf{in}}}\cdot\frac{64\alpha_{\mathsf{in}}^{2}\eta^{2}L^{2}}{(1-\alpha_{\mathsf{in}})^{6}p^{2}}
<2α𝗂𝗇21+α𝗂𝗇2+64100α𝗂𝗇2(1α𝗂𝗇)21+α𝗂𝗇\displaystyle<\frac{2\alpha_{\mathsf{in}}^{2}}{1+\alpha_{\mathsf{in}}^{2}}+\frac{64}{100}\cdot\frac{\alpha_{\mathsf{in}}^{2}(1-\alpha_{\mathsf{in}})^{2}}{1+\alpha_{\mathsf{in}}}
2α𝗂𝗇21+α𝗂𝗇2+α𝗂𝗇(1α𝗂𝗇)21+α𝗂𝗇2=α𝗂𝗇.\displaystyle\leq\frac{2\alpha_{\mathsf{in}}^{2}}{1+\alpha_{\mathsf{in}}^{2}}+\frac{\alpha_{\mathsf{in}}(1-\alpha_{\mathsf{in}})^{2}}{1+\alpha_{\mathsf{in}}^{2}}=\alpha_{\mathsf{in}}.

E.2 Sum of outer loop gradient estimation consensus errors

In view of the update rule for the gradient tracking term (5) and reorganize terms,

𝒔(t)𝟏n𝒔¯(t)22\displaystyle\big{\|}{\bm{{s}}}^{(t)}-\bm{1}_{n}\otimes{\overline{\bm{s}}}^{(t)}\big{\|}_{2}^{2} =(𝑰nd(1n𝟏n𝟏n)𝑰d)𝒔(t)22\displaystyle=\Big{\|}\Big{(}{\bm{{I}}}_{nd}-(\frac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top})\otimes{\bm{{I}}}_{d}\Big{)}{\bm{{s}}}^{(t)}\Big{\|}_{2}^{2}
=(𝑰nd(1n𝟏n𝟏n)𝑰d)(𝑾𝗈𝗎𝗍𝑰d)(𝒔(t1)+F(𝒙(t))F(𝒙(t1)))22\displaystyle=\Big{\|}\Big{(}{\bm{{I}}}_{nd}-(\frac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top})\otimes{\bm{{I}}}_{d}\Big{)}({\bm{{W}}}_{\mathsf{out}}\otimes{\bm{{I}}}_{d})\Big{(}{\bm{{s}}}^{(t-1)}+\nabla F({\bm{{x}}}^{(t)})-\nabla F({\bm{{x}}}^{(t-1)})\Big{)}\Big{\|}_{2}^{2}
2α𝗈𝗎𝗍21+α𝗈𝗎𝗍2𝒔(t1)𝟏n𝒔¯(t1)22\displaystyle\leq\frac{2\alpha_{\mathsf{out}}^{2}}{1+\alpha_{\mathsf{out}}^{2}}\big{\|}{\bm{{s}}}^{(t-1)}-\bm{1}_{n}\otimes{\overline{\bm{s}}}^{(t-1)}\big{\|}_{2}^{2}
+2α𝗈𝗎𝗍21α𝗈𝗎𝗍2(𝑰nd(1n𝟏n𝟏n)𝑰d)(F(𝒙(t))F(𝒙(t1)))22,\displaystyle\qquad+\frac{2\alpha_{\mathsf{out}}^{2}}{1-\alpha_{\mathsf{out}}^{2}}\Big{\|}\Big{(}{\bm{{I}}}_{nd}-(\frac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top})\otimes{\bm{{I}}}_{d}\Big{)}\Big{(}\nabla F({\bm{{x}}}^{(t)})-\nabla F({\bm{{x}}}^{(t-1)})\Big{)}\Big{\|}_{2}^{2}, (29)

which follows from similar reasonings as (21). The second term can be further decomposed as

(𝑰nd(1n𝟏n𝟏n)𝑰d)(F(𝒙(t))F(𝒙(t1)))22\displaystyle~{}~{}~{}~{}\Big{\|}\Big{(}{\bm{{I}}}_{nd}-(\frac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top})\otimes{\bm{{I}}}_{d}\Big{)}\Big{(}\nabla F({\bm{{x}}}^{(t)})-\nabla F({\bm{{x}}}^{(t-1)})\Big{)}\Big{\|}_{2}^{2}
F(𝒙(t))F(𝒙(t1))22\displaystyle\leq\big{\|}\nabla F({\bm{{x}}}^{(t)})-\nabla F({\bm{{x}}}^{(t-1)})\big{\|}_{2}^{2}
L2(𝒙(t)𝟏n𝒙¯(t))(𝒙(t1)𝟏n𝒙¯(t1))+(𝟏n𝒙¯(t)𝟏n𝒙¯(t1))22\displaystyle\leq L^{2}\big{\|}({\bm{{x}}}^{(t)}-\bm{1}_{n}\otimes{\overline{\bm{x}}}^{(t)})-({\bm{{x}}}^{(t-1)}-\bm{1}_{n}\otimes{\overline{\bm{x}}}^{(t-1)})+(\bm{1}_{n}\otimes{\overline{\bm{x}}}^{(t)}-\bm{1}_{n}\otimes{\overline{\bm{x}}}^{(t-1)})\big{\|}_{2}^{2}
=L2(𝒙(t)𝟏n𝒙¯(t))(𝒙(t1)𝟏n𝒙¯(t1))22+nL2𝒙¯(t)𝒙¯(t1)22\displaystyle=L^{2}\big{\|}({\bm{{x}}}^{(t)}-\bm{1}_{n}\otimes{\overline{\bm{x}}}^{(t)})-({\bm{{x}}}^{(t-1)}-\bm{1}_{n}\otimes{\overline{\bm{x}}}^{(t-1)})\big{\|}_{2}^{2}+nL^{2}\big{\|}{\overline{\bm{x}}}^{(t)}-{\overline{\bm{x}}}^{(t-1)}\big{\|}_{2}^{2}
2L2𝒙(t)𝟏n𝒙¯(t)22+2L2𝒙(t1)𝟏n𝒙¯(t1)22+Sη2L2ns=0S1𝒗¯(t1),s22,\displaystyle\leq 2L^{2}\big{\|}{\bm{{x}}}^{(t)}-\bm{1}_{n}\otimes{\overline{\bm{x}}}^{(t)}\big{\|}_{2}^{2}+2L^{2}\big{\|}{\bm{{x}}}^{(t-1)}-\bm{1}_{n}\otimes{\overline{\bm{x}}}^{(t-1)}\big{\|}_{2}^{2}+S\eta^{2}L^{2}\cdot n\sum_{s=0}^{S-1}\big{\|}{\overline{\bm{v}}}^{(t-1),s}\big{\|}_{2}^{2}, (30)

where the last line follows from the update rule (6a) by identifying 𝒙¯(t)𝒙¯(t1)=ηs=0S1𝒗¯(t1),s{\overline{\bm{x}}}^{(t)}-{\overline{\bm{x}}}^{(t-1)}=\eta\sum_{s=0}^{S-1}{\overline{\bm{v}}}^{(t-1),s} and Cauchy-Schwartz inequality.

With (30), (29) can be further bounded as follows

𝒔(t)𝟏n𝒔¯(t)22\displaystyle\big{\|}{\bm{{s}}}^{(t)}-\bm{1}_{n}\otimes{\overline{\bm{s}}}^{(t)}\big{\|}_{2}^{2} 2α𝗈𝗎𝗍21+α𝗈𝗎𝗍2𝒔(t1)𝟏n𝒔¯(t1)22+2α𝗈𝗎𝗍21α𝗈𝗎𝗍2(2L2𝒙(t)𝟏n𝒙¯(t)22\displaystyle\leq\frac{2\alpha_{\mathsf{out}}^{2}}{1+\alpha_{\mathsf{out}}^{2}}\big{\|}{\bm{{s}}}^{(t-1)}-\bm{1}_{n}\otimes{\overline{\bm{s}}}^{(t-1)}\big{\|}_{2}^{2}+\frac{2\alpha_{\mathsf{out}}^{2}}{1-\alpha_{\mathsf{out}}^{2}}\Big{(}2L^{2}\big{\|}{\bm{{x}}}^{(t)}-\bm{1}_{n}\otimes{\overline{\bm{x}}}^{(t)}\big{\|}_{2}^{2}
+2L2𝒙(t1)𝟏n𝒙¯(t1)22+Sη2L2ns=0S1𝒗¯(t1),s22)\displaystyle\qquad+2L^{2}\big{\|}{\bm{{x}}}^{(t-1)}-\bm{1}_{n}\otimes{\overline{\bm{x}}}^{(t-1)}\big{\|}_{2}^{2}+S\eta^{2}L^{2}\cdot n\sum_{s=0}^{S-1}\big{\|}{\overline{\bm{v}}}^{(t-1),s}\big{\|}_{2}^{2}\Big{)}
α𝗈𝗎𝗍𝒔(t1)𝟏n𝒔¯(t1)22+2α𝗈𝗎𝗍21α𝗈𝗎𝗍(2L2𝒙(t)𝟏n𝒙¯(t)22\displaystyle\leq\alpha_{\mathsf{out}}\big{\|}{\bm{{s}}}^{(t-1)}-\bm{1}_{n}\otimes{\overline{\bm{s}}}^{(t-1)}\big{\|}_{2}^{2}+\frac{2\alpha_{\mathsf{out}}^{2}}{1-\alpha_{\mathsf{out}}}\Big{(}2L^{2}\big{\|}{\bm{{x}}}^{(t)}-\bm{1}_{n}\otimes{\overline{\bm{x}}}^{(t)}\big{\|}_{2}^{2}
+2L2𝒙(t1)𝟏n𝒙¯(t1)22+Sη2L2ns=0S1𝒗¯(t1),s22).\displaystyle\qquad+2L^{2}\big{\|}{\bm{{x}}}^{(t-1)}-\bm{1}_{n}\otimes{\overline{\bm{x}}}^{(t-1)}\big{\|}_{2}^{2}+S\eta^{2}L^{2}\cdot n\sum_{s=0}^{S-1}\big{\|}{\overline{\bm{v}}}^{(t-1),s}\big{\|}_{2}^{2}\Big{)}. (31)

Combine with (28), after taking expectations, (31) can be further bounded as

𝔼𝒔(t)𝟏n𝒔¯(t)22\displaystyle{\mathbb{\bm{E}}}\big{\|}{\bm{{s}}}^{(t)}-\bm{1}_{n}\otimes{\overline{\bm{s}}}^{(t)}\big{\|}_{2}^{2} <α𝗈𝗎𝗍𝔼𝒔(t1)𝟏n𝒔¯(t1)22+4α𝗈𝗎𝗍2L21α𝗈𝗎𝗍𝔼𝒙(t1)𝟏n𝒙¯(t1)22\displaystyle<\alpha_{\mathsf{out}}{\mathbb{\bm{E}}}\big{\|}{\bm{{s}}}^{(t-1)}-\bm{1}_{n}\otimes{\overline{\bm{s}}}^{(t-1)}\big{\|}_{2}^{2}+\frac{4\alpha_{\mathsf{out}}^{2}L^{2}}{1-\alpha_{\mathsf{out}}}{\mathbb{\bm{E}}}\big{\|}{\bm{{x}}}^{(t-1)}-\bm{1}_{n}\otimes{\overline{\bm{x}}}^{(t-1)}\big{\|}_{2}^{2}
+2α𝗈𝗎𝗍2Sη2L21α𝗈𝗎𝗍ns=0S1𝔼𝒗¯(t1),s22+4α𝗈𝗎𝗍2L21α𝗈𝗎𝗍(α𝗂𝗇𝔼𝒙(t1)𝟏n𝒙¯(t1)22\displaystyle\qquad+\frac{2\alpha_{\mathsf{out}}^{2}S\eta^{2}L^{2}}{1-\alpha_{\mathsf{out}}}\cdot n\sum_{s=0}^{S-1}{\mathbb{\bm{E}}}\big{\|}{\overline{\bm{v}}}^{(t-1),s}\big{\|}_{2}^{2}+\frac{4\alpha_{\mathsf{out}}^{2}L^{2}}{1-\alpha_{\mathsf{out}}}\Bigg{(}\alpha_{\mathsf{in}}{\mathbb{\bm{E}}}\big{\|}{\bm{{x}}}^{(t-1)}-\bm{1}_{n}\otimes{\overline{\bm{x}}}^{(t-1)}\big{\|}_{2}^{2}
+4α𝗂𝗇2η2(1α𝗂𝗇)2(𝔼𝒔(t1)𝟏n𝒔¯(t1)22+2α𝗂𝗇2η2L2(1α𝗂𝗇)p2ns=0S1𝔼𝒗¯(t1),s22))\displaystyle\qquad+\frac{4\alpha_{\mathsf{in}}^{2}\eta^{2}}{(1-\alpha_{\mathsf{in}})^{2}}\Big{(}{\mathbb{\bm{E}}}\big{\|}{\bm{{s}}}^{(t-1)}-\bm{1}_{n}\otimes{\overline{\bm{s}}}^{(t-1)}\big{\|}_{2}^{2}+\frac{2\alpha_{\mathsf{in}}^{2}\eta^{2}L^{2}}{(1-\alpha_{\mathsf{in}})p^{2}}\cdot n\sum_{s=0}^{S-1}{\mathbb{\bm{E}}}\big{\|}{\overline{\bm{v}}}^{(t-1),s}\big{\|}_{2}^{2}\Big{)}\Bigg{)}
=(α𝗈𝗎𝗍+4α𝗈𝗎𝗍2L21α𝗈𝗎𝗍4α𝗂𝗇2η2(1α𝗂𝗇)2)𝔼𝒔(t1)𝟏n𝒔¯(t1)22\displaystyle=\Big{(}\alpha_{\mathsf{out}}+\frac{4\alpha_{\mathsf{out}}^{2}L^{2}}{1-\alpha_{\mathsf{out}}}\cdot\frac{4\alpha_{\mathsf{in}}^{2}\eta^{2}}{(1-\alpha_{\mathsf{in}})^{2}}\Big{)}{\mathbb{\bm{E}}}\big{\|}{\bm{{s}}}^{(t-1)}-\bm{1}_{n}\otimes{\overline{\bm{s}}}^{(t-1)}\big{\|}_{2}^{2}
+4α𝗈𝗎𝗍2L21α𝗈𝗎𝗍(1+α𝗂𝗇)𝔼𝒙(t1)𝟏n𝒙¯(t1)22\displaystyle\qquad+\frac{4\alpha_{\mathsf{out}}^{2}L^{2}}{1-\alpha_{\mathsf{out}}}(1+\alpha_{\mathsf{in}}){\mathbb{\bm{E}}}\big{\|}{\bm{{x}}}^{(t-1)}-\bm{1}_{n}\otimes{\overline{\bm{x}}}^{(t-1)}\big{\|}_{2}^{2}
+(2α𝗈𝗎𝗍2Sη2L21α𝗈𝗎𝗍+4α𝗈𝗎𝗍2L21α𝗈𝗎𝗍4α𝗂𝗇2η2(1α𝗂𝗇)22α𝗂𝗇2η2L2(1α𝗂𝗇)p2)ns=0S1𝔼𝒗¯(t1),s22\displaystyle\qquad+\Big{(}\frac{2\alpha_{\mathsf{out}}^{2}S\eta^{2}L^{2}}{1-\alpha_{\mathsf{out}}}+\frac{4\alpha_{\mathsf{out}}^{2}L^{2}}{1-\alpha_{\mathsf{out}}}\cdot\frac{4\alpha_{\mathsf{in}}^{2}\eta^{2}}{(1-\alpha_{\mathsf{in}})^{2}}\cdot\frac{2\alpha_{\mathsf{in}}^{2}\eta^{2}L^{2}}{(1-\alpha_{\mathsf{in}})p^{2}}\Big{)}\cdot n\sum_{s=0}^{S-1}{\mathbb{\bm{E}}}\big{\|}{\overline{\bm{v}}}^{(t-1),s}\big{\|}_{2}^{2}
<(i)(α𝗈𝗎𝗍+4α𝗈𝗎𝗍2L21α𝗈𝗎𝗍4α𝗂𝗇2η2(1α𝗂𝗇)2)𝔼𝒔(t1)𝟏n𝒔¯(t1)22\displaystyle\overset{\text{(i)}}{<}\Big{(}\alpha_{\mathsf{out}}+\frac{4\alpha_{\mathsf{out}}^{2}L^{2}}{1-\alpha_{\mathsf{out}}}\cdot\frac{4\alpha_{\mathsf{in}}^{2}\eta^{2}}{(1-\alpha_{\mathsf{in}})^{2}}\Big{)}{\mathbb{\bm{E}}}\big{\|}{\bm{{s}}}^{(t-1)}-\bm{1}_{n}\otimes{\overline{\bm{s}}}^{(t-1)}\big{\|}_{2}^{2}
+4α𝗈𝗎𝗍2L21α𝗈𝗎𝗍(1+α𝗂𝗇)𝔼𝒙(t1)𝟏n𝒙¯(t1)22\displaystyle\qquad+\frac{4\alpha_{\mathsf{out}}^{2}L^{2}}{1-\alpha_{\mathsf{out}}}(1+\alpha_{\mathsf{in}}){\mathbb{\bm{E}}}\big{\|}{\bm{{x}}}^{(t-1)}-\bm{1}_{n}\otimes{\overline{\bm{x}}}^{(t-1)}\big{\|}_{2}^{2}
+3α𝗈𝗎𝗍2Sη2L21α𝗈𝗎𝗍ns=0S1𝔼𝒗¯(t1),s22,\displaystyle\qquad+\frac{3\alpha_{\mathsf{out}}^{2}S\eta^{2}L^{2}}{1-\alpha_{\mathsf{out}}}\cdot n\sum_{s=0}^{S-1}{\mathbb{\bm{E}}}\big{\|}{\overline{\bm{v}}}^{(t-1),s}\big{\|}_{2}^{2}, (32)

where and (i) is obtained by applying the condition in (7) as follows

4α𝗈𝗎𝗍2L21α𝗈𝗎𝗍4α𝗂𝗇2η2(1α𝗂𝗇)22α𝗂𝗇2η2L2(1α𝗂𝗇)p2\displaystyle\frac{4\alpha_{\mathsf{out}}^{2}L^{2}}{1-\alpha_{\mathsf{out}}}\cdot\frac{4\alpha_{\mathsf{in}}^{2}\eta^{2}}{(1-\alpha_{\mathsf{in}})^{2}}\cdot\frac{2\alpha_{\mathsf{in}}^{2}\eta^{2}L^{2}}{(1-\alpha_{\mathsf{in}})p^{2}} =α𝗈𝗎𝗍2η2L21α𝗈𝗎𝗍32α𝗂𝗇4η2L2(1α𝗂𝗇)3p2\displaystyle=\frac{\alpha_{\mathsf{out}}^{2}\eta^{2}L^{2}}{1-\alpha_{\mathsf{out}}}\cdot\frac{32\alpha_{\mathsf{in}}^{4}\eta^{2}L^{2}}{(1-\alpha_{\mathsf{in}})^{3}p^{2}}
α𝗈𝗎𝗍2Sη2L21α𝗈𝗎𝗍32α𝗂𝗇2(1α𝗂𝗇)6100(1α𝗂𝗇)3\displaystyle\leq\frac{\alpha_{\mathsf{out}}^{2}S\eta^{2}L^{2}}{1-\alpha_{\mathsf{out}}}\cdot\frac{32\alpha_{\mathsf{in}}^{2}(1-\alpha_{\mathsf{in}})^{6}}{100(1-\alpha_{\mathsf{in}})^{3}}
α𝗈𝗎𝗍2Sη2L21α𝗈𝗎𝗍,\displaystyle\leq\frac{\alpha_{\mathsf{out}}^{2}S\eta^{2}L^{2}}{1-\alpha_{\mathsf{out}}},

where the inequalities are obtained by using S1S\geq 1 and 0α𝗂𝗇<10\leq\alpha_{\mathsf{in}}<1.

E.3 Linear system

Defining 𝒆(t):=𝒆(t),0=[L2𝔼𝒙(t)𝟏n𝒙¯(t)22𝔼𝒔(t)𝟏n𝒔¯(t)22]{\bm{{e}}}^{(t)}:={\bm{{e}}}^{(t),0}=\begin{bmatrix}L^{2}{\mathbb{\bm{E}}}\big{\|}{\bm{{x}}}^{(t)}-\bm{1}_{n}\otimes{\overline{\bm{x}}}^{(t)}\big{\|}_{2}^{2}\\[3.00003pt] {\mathbb{\bm{E}}}\big{\|}{\bm{{s}}}^{(t)}-\bm{1}_{n}\otimes{\overline{\bm{s}}}^{(t)}\big{\|}_{2}^{2}\\ \end{bmatrix} and 𝒃(t)=[8α𝗂𝗇4η4L4(1α𝗂𝗇)3p2ns=0S1𝔼𝒗¯(t),s223α𝗈𝗎𝗍2Sη2L21α𝗈𝗎𝗍ns=0S1𝔼𝒗¯(t),s22]{\bm{{b}}}^{\prime(t)}=\begin{bmatrix}\frac{8\alpha_{\mathsf{in}}^{4}\eta^{4}L^{4}}{(1-\alpha_{\mathsf{in}})^{3}p^{2}}\cdot n\sum_{s=0}^{S-1}{\mathbb{\bm{E}}}\big{\|}{\overline{\bm{v}}}^{(t),s}\big{\|}_{2}^{2}\\[3.00003pt] \frac{3\alpha_{\mathsf{out}}^{2}S\eta^{2}L^{2}}{1-\alpha_{\mathsf{out}}}\cdot n\sum_{s=0}^{S-1}{\mathbb{\bm{E}}}\big{\|}{\overline{\bm{v}}}^{(t),s}\big{\|}_{2}^{2}\end{bmatrix}, we construct a linear system by putting together (28) and (32) as

𝒆(t)\displaystyle{\bm{{e}}}^{(t)} [α𝗂𝗇4α𝗂𝗇2η2L2(1α𝗂𝗇)24α𝗈𝗎𝗍21α𝗈𝗎𝗍(1+α𝗂𝗇)α𝗈𝗎𝗍+4α𝗈𝗎𝗍21α𝗈𝗎𝗍4α𝗂𝗇2η2L2(1α𝗂𝗇)2]=:𝑮𝗈𝗎𝗍𝒆(t1)+𝒃(t1)=𝑮𝗈𝗎𝗍𝒆(t1)+𝒃(t1).\displaystyle\leq\underbrace{\begin{bmatrix}\alpha_{\mathsf{in}}&\frac{4\alpha_{\mathsf{in}}^{2}\eta^{2}L^{2}}{(1-\alpha_{\mathsf{in}})^{2}}\\[5.0pt] \frac{4\alpha_{\mathsf{out}}^{2}}{1-\alpha_{\mathsf{out}}}(1+\alpha_{\mathsf{in}})&\alpha_{\mathsf{out}}+\frac{4\alpha_{\mathsf{out}}^{2}}{1-\alpha_{\mathsf{out}}}\cdot\frac{4\alpha_{\mathsf{in}}^{2}\eta^{2}L^{2}}{(1-\alpha_{\mathsf{in}})^{2}}\end{bmatrix}}_{=:{\bm{{G}}}_{\mathsf{out}}}{\bm{{e}}}^{(t-1)}+{\bm{{b}}}^{\prime(t-1)}={\bm{{G}}}_{\mathsf{out}}{\bm{{e}}}^{(t-1)}+{\bm{{b}}}^{\prime(t-1)}. (33)

Then, following the same argument as (25), we obtain

t=0T𝒆(t)\displaystyle\sum_{t=0}^{T}{\bm{{e}}}^{(t)} t=0𝑮𝗈𝗎𝗍t(𝒆(0)+t=0T1𝒃(t)).\displaystyle\leq\sum_{t=0}^{\infty}{\bm{{G}}}_{\mathsf{out}}^{t}\Big{(}{\bm{{e}}}^{(0)}+\sum_{t=0}^{T-1}{\bm{{b}}}^{\prime(t)}\Big{)}. (34)

Before continuing, we state the following claim about 𝑮𝗈𝗎𝗍{\bm{{G}}}_{\mathsf{out}} which will be proven momentarily.

Claim 2.

Under the choice of η\eta in Theorem 1, the eigenvalues of 𝐆𝗈𝗎𝗍{\bm{{G}}}_{\mathsf{out}} are in (1,1)(-1,1), and the Neumann series converges,

t=0𝑮𝗈𝗎𝗍t=(𝑰2𝑮𝗈𝗎𝗍)1\displaystyle\sum_{t=0}^{\infty}{\bm{{G}}}_{\mathsf{out}}^{t}=({\bm{{I}}}_{2}-{\bm{{G}}}_{\mathsf{out}})^{-1} [21α𝗂𝗇8α𝗂𝗇2η2L2(1α𝗂𝗇)3(1α𝗈𝗎𝗍)16α𝗈𝗎𝗍2(1α𝗂𝗇)(1α𝗈𝗎𝗍)221α𝗈𝗎𝗍].\displaystyle\leq\begin{bmatrix}\frac{2}{1-\alpha_{\mathsf{in}}}&\frac{8\alpha_{\mathsf{in}}^{2}\eta^{2}L^{2}}{(1-\alpha_{\mathsf{in}})^{3}(1-\alpha_{\mathsf{out}})}\\[5.0pt] \frac{16\alpha_{\mathsf{out}}^{2}}{(1-\alpha_{\mathsf{in}})(1-\alpha_{\mathsf{out}})^{2}}&\frac{2}{1-\alpha_{\mathsf{out}}}\\ \end{bmatrix}.

With Claim 2 in hand, and the fact that 𝒆(0)=𝟎{\bm{{e}}}^{(0)}=\bm{0}, we can bound the summation of outer loop consensus errors by

64L21α𝗂𝗇(Snpb+1)t=0T𝔼𝒙(t)𝟏n𝒙¯(t)22+2α𝗂𝗇21α𝗂𝗇t=0T𝔼𝒔(t)𝟏n𝒔¯(t)22\displaystyle~{}~{}~{}~{}\frac{64L^{2}}{1-\alpha_{\mathsf{in}}}\cdot\Big{(}\frac{S}{npb}+1\Big{)}\sum_{t=0}^{T}{\mathbb{\bm{E}}}\big{\|}{\bm{{x}}}^{(t)}-\bm{1}_{n}\otimes{\overline{\bm{x}}}^{(t)}\big{\|}_{2}^{2}+\frac{2\alpha_{\mathsf{in}}^{2}}{1-\alpha_{\mathsf{in}}}\sum_{t=0}^{T}{\mathbb{\bm{E}}}\big{\|}{\bm{{s}}}^{(t)}-\bm{1}_{n}\otimes{\overline{\bm{s}}}^{(t)}\big{\|}_{2}^{2}
=𝝇𝗈𝗎𝗍t=1T𝒆(t)\displaystyle=\bm{\varsigma}_{\mathsf{out}}^{\top}\sum_{t=1}^{T}{\bm{{e}}}^{(t)}
𝝇𝗈𝗎𝗍(𝑰2𝑮𝗈𝗎𝗍)1(𝒆(0)+t=0T1𝒃(t))\displaystyle\leq\bm{\varsigma}_{\mathsf{out}}^{\top}({\bm{{I}}}_{2}-{\bm{{G}}}_{\mathsf{out}})^{-1}\Big{(}{\bm{{e}}}^{(0)}+\sum_{t=0}^{T-1}{\bm{{b}}}^{\prime(t)}\Big{)}
=𝝇𝗈𝗎𝗍(𝑰2𝑮𝗈𝗎𝗍)1t=0T1𝒃(t),\displaystyle=\bm{\varsigma}_{\mathsf{out}}^{\top}({\bm{{I}}}_{2}-{\bm{{G}}}_{\mathsf{out}})^{-1}\sum_{t=0}^{T-1}{\bm{{b}}}^{\prime(t)}, (35)

where 𝝇𝗈𝗎𝗍=[641α𝗂𝗇(Snpb+1)2α𝗂𝗇21α𝗂𝗇]\bm{\varsigma}_{\mathsf{out}}^{\top}=\begin{bmatrix}\frac{64}{1-\alpha_{\mathsf{in}}}\cdot\Big{(}\frac{S}{npb}+1\Big{)}&\frac{2\alpha_{\mathsf{in}}^{2}}{1-\alpha_{\mathsf{in}}}\end{bmatrix}.

Note that by elementary calculations,

𝝇𝗈𝗎𝗍(𝑰2𝑮𝗈𝗎𝗍)1\displaystyle~{}~{}~{}~{}\bm{\varsigma}_{\mathsf{out}}^{\top}({\bm{{I}}}_{2}-{\bm{{G}}}_{\mathsf{out}})^{-1}
[641α𝗂𝗇(Snpb+1)2α𝗂𝗇2][21α𝗂𝗇8α𝗂𝗇2η2L2(1α𝗂𝗇)3(1α𝗈𝗎𝗍)16α𝗈𝗎𝗍2(1α𝗂𝗇)(1α𝗈𝗎𝗍)221α𝗈𝗎𝗍]\displaystyle\leq\begin{bmatrix}\frac{64}{1-\alpha_{\mathsf{in}}}\Big{(}\frac{S}{npb}+1\Big{)}&2\alpha_{\mathsf{in}}^{2}\end{bmatrix}\begin{bmatrix}\frac{2}{1-\alpha_{\mathsf{in}}}&\frac{8\alpha_{\mathsf{in}}^{2}\eta^{2}L^{2}}{(1-\alpha_{\mathsf{in}})^{3}(1-\alpha_{\mathsf{out}})}\\[5.0pt] \frac{16\alpha_{\mathsf{out}}^{2}}{(1-\alpha_{\mathsf{in}})(1-\alpha_{\mathsf{out}})^{2}}&\frac{2}{1-\alpha_{\mathsf{out}}}\\ \end{bmatrix}
=[641α𝗂𝗇(Snpb+1)21α𝗂𝗇+32α𝗂𝗇2α𝗈𝗎𝗍2(1α𝗂𝗇)(1α𝗈𝗎𝗍)2641α𝗂𝗇(Snpb+1)8α𝗂𝗇2η2L2(1α𝗂𝗇)3(1α𝗈𝗎𝗍)+4α𝗂𝗇21α𝗈𝗎𝗍]\displaystyle=\begin{bmatrix}\frac{64}{1-\alpha_{\mathsf{in}}}\Big{(}\frac{S}{npb}+1\Big{)}\cdot\frac{2}{1-\alpha_{\mathsf{in}}}+\frac{32\alpha_{\mathsf{in}}^{2}\alpha_{\mathsf{out}}^{2}}{(1-\alpha_{\mathsf{in}})(1-\alpha_{\mathsf{out}})^{2}}&\frac{64}{1-\alpha_{\mathsf{in}}}\Big{(}\frac{S}{npb}+1\Big{)}\cdot\frac{8\alpha_{\mathsf{in}}^{2}\eta^{2}L^{2}}{(1-\alpha_{\mathsf{in}})^{3}(1-\alpha_{\mathsf{out}})}+\frac{4\alpha_{\mathsf{in}}^{2}}{1-\alpha_{\mathsf{out}}}\end{bmatrix}
<(i)[128(1α𝗂𝗇)2(Snpb+1)+32α𝗂𝗇2α𝗈𝗎𝗍2(1α𝗂𝗇)(1α𝗈𝗎𝗍)26α𝗂𝗇2+4α𝗂𝗇21α𝗈𝗎𝗍]\displaystyle\overset{\text{(i)}}{<}\begin{bmatrix}\frac{128}{(1-\alpha_{\mathsf{in}})^{2}}\Big{(}\frac{S}{npb}+1\Big{)}+\frac{32\alpha_{\mathsf{in}}^{2}\alpha_{\mathsf{out}}^{2}}{(1-\alpha_{\mathsf{in}})(1-\alpha_{\mathsf{out}})^{2}}&6\alpha_{\mathsf{in}}^{2}+\frac{4\alpha_{\mathsf{in}}^{2}}{1-\alpha_{\mathsf{out}}}\end{bmatrix}
<(ii)[128(1α𝗂𝗇)2(Snpb+1)+32α𝗂𝗇2α𝗈𝗎𝗍2(1α𝗂𝗇)(1α𝗈𝗎𝗍)210α𝗂𝗇21α𝗈𝗎𝗍],\displaystyle\overset{\text{(ii)}}{<}\begin{bmatrix}\frac{128}{(1-\alpha_{\mathsf{in}})^{2}}\Big{(}\frac{S}{npb}+1\Big{)}+\frac{32\alpha_{\mathsf{in}}^{2}\alpha_{\mathsf{out}}^{2}}{(1-\alpha_{\mathsf{in}})(1-\alpha_{\mathsf{out}})^{2}}&\frac{10\alpha_{\mathsf{in}}^{2}}{1-\alpha_{\mathsf{out}}}\end{bmatrix},

where we use (7) to prove (i), and 1/(1α𝗂𝗇)11/(1-\alpha_{\mathsf{in}})\geq 1 and 1/(1α𝗈𝗎𝗍)11/(1-\alpha_{\mathsf{out}})\geq 1 to prove (ii).

Thus, (35) can be bounded using (7) as

𝝇𝗈𝗎𝗍(𝑰2𝑮𝗈𝗎𝗍)1[8α𝗂𝗇4η4L4(1α𝗂𝗇)3p23α𝗈𝗎𝗍2Sη2L21α𝗈𝗎𝗍]\displaystyle~{}~{}~{}~{}\bm{\varsigma}_{\mathsf{out}}^{\top}({\bm{{I}}}_{2}-{\bm{{G}}}_{\mathsf{out}})^{-1}\begin{bmatrix}\frac{8\alpha_{\mathsf{in}}^{4}\eta^{4}L^{4}}{(1-\alpha_{\mathsf{in}})^{3}p^{2}}\\[3.00003pt] \frac{3\alpha_{\mathsf{out}}^{2}S\eta^{2}L^{2}}{1-\alpha_{\mathsf{out}}}\end{bmatrix}
(128(1α𝗂𝗇)2(Snpb+1)+32α𝗂𝗇2α𝗈𝗎𝗍2(1α𝗂𝗇)(1α𝗈𝗎𝗍)2)8α𝗂𝗇4η4L4(1α𝗂𝗇)3p2+10α𝗂𝗇21α𝗈𝗎𝗍3α𝗈𝗎𝗍2Sη2L21α𝗈𝗎𝗍\displaystyle\leq\Bigg{(}\frac{128}{(1-\alpha_{\mathsf{in}})^{2}}\Big{(}\frac{S}{npb}+1\Big{)}+\frac{32\alpha_{\mathsf{in}}^{2}\alpha_{\mathsf{out}}^{2}}{(1-\alpha_{\mathsf{in}})(1-\alpha_{\mathsf{out}})^{2}}\Bigg{)}\frac{8\alpha_{\mathsf{in}}^{4}\eta^{4}L^{4}}{(1-\alpha_{\mathsf{in}})^{3}p^{2}}+\frac{10\alpha_{\mathsf{in}}^{2}}{1-\alpha_{\mathsf{out}}}\cdot\frac{3\alpha_{\mathsf{out}}^{2}S\eta^{2}L^{2}}{1-\alpha_{\mathsf{out}}}
=1024α𝗂𝗇4η4L4(1α𝗂𝗇)5p2(Snpb+1)+256α𝗂𝗇6α𝗈𝗎𝗍2η4L4(1α𝗂𝗇)4(1α𝗈𝗎𝗍)2p2+30α𝗂𝗇2α𝗈𝗎𝗍2npbS/(npb)(1α𝗈𝗎𝗍)2η2L2\displaystyle=\frac{1024\alpha_{\mathsf{in}}^{4}\eta^{4}L^{4}}{(1-\alpha_{\mathsf{in}})^{5}p^{2}}\Big{(}\frac{S}{npb}+1\Big{)}+\frac{256\alpha_{\mathsf{in}}^{6}\alpha_{\mathsf{out}}^{2}\eta^{4}L^{4}}{(1-\alpha_{\mathsf{in}})^{4}(1-\alpha_{\mathsf{out}})^{2}p^{2}}+\frac{30\alpha_{\mathsf{in}}^{2}\alpha_{\mathsf{out}}^{2}npb\cdot S/(npb)}{(1-\alpha_{\mathsf{out}})^{2}}\cdot\eta^{2}L^{2}
11α𝗂𝗇4η2L2+3α𝗂𝗇6α𝗈𝗎𝗍2η2L2+30100\displaystyle\leq 11\alpha_{\mathsf{in}}^{4}\eta^{2}L^{2}+3\alpha_{\mathsf{in}}^{6}\alpha_{\mathsf{out}}^{2}\eta^{2}L^{2}+\frac{30}{100}
<1125,\displaystyle<\frac{11}{25},

which concludes the proof.

Proof of 2.

For simplicity, denote c=4α𝗂𝗇2η2L2(1α𝗂𝗇)2c=\frac{4\alpha_{\mathsf{in}}^{2}\eta^{2}L^{2}}{(1-\alpha_{\mathsf{in}})^{2}} and d=4α𝗈𝗎𝗍21α𝗈𝗎𝗍d=\frac{4\alpha_{\mathsf{out}}^{2}}{1-\alpha_{\mathsf{out}}}. Then 𝑮𝗈𝗎𝗍{\bm{{G}}}_{\mathsf{out}} can be written as

𝑮𝗈𝗎𝗍=[α𝗂𝗇cd(1+α𝗂𝗇)α𝗈𝗎𝗍+cd],\displaystyle{\bm{{G}}}_{\mathsf{out}}=\begin{bmatrix}\alpha_{\mathsf{in}}&c\\ d(1+\alpha_{\mathsf{in}})&\alpha_{\mathsf{out}}+cd\end{bmatrix},

whose characteristic polynomial is

f(λ)\displaystyle f(\lambda) =(α𝗂𝗇λ)(α𝗈𝗎𝗍+cdλ)(1+α𝗂𝗇)cd.\displaystyle=(\alpha_{\mathsf{in}}-\lambda)(\alpha_{\mathsf{out}}+cd-\lambda)-(1+\alpha_{\mathsf{in}})cd.

First, note that f(1)f(1) can be bounded by

f(1)\displaystyle f(1) =(α𝗂𝗇1)(α𝗈𝗎𝗍+cd1)(1+α𝗂𝗇)cd\displaystyle=(\alpha_{\mathsf{in}}-1)(\alpha_{\mathsf{out}}+cd-1)-(1+\alpha_{\mathsf{in}})cd
=(1α𝗂𝗇)(1α𝗈𝗎𝗍)2cd>0,\displaystyle=(1-\alpha_{\mathsf{in}})(1-\alpha_{\mathsf{out}})-2cd>0,

where the last inequality is due to the choice of η\eta, namely,

cd\displaystyle cd =4α𝗈𝗎𝗍21α𝗈𝗎𝗍4α𝗂𝗇2η2L2(1α𝗂𝗇)216(1α𝗂𝗇)(1α𝗈𝗎𝗍).\displaystyle=\frac{4\alpha_{\mathsf{out}}^{2}}{1-\alpha_{\mathsf{out}}}\cdot\frac{4\alpha_{\mathsf{in}}^{2}\eta^{2}L^{2}}{(1-\alpha_{\mathsf{in}})^{2}}\leq\frac{1}{6}(1-\alpha_{\mathsf{in}})(1-\alpha_{\mathsf{out}}).

Combined with the trivial fact that f(1)>0f(-1)>0 and f(α𝗂𝗇)0f(\alpha_{\mathsf{in}})\leq 0, all eigenvalues of 𝑮𝗈𝗎𝗍{\bm{{G}}}_{\mathsf{out}} are in (1,1)(-1,1). Consequently, the Neumann series converges, leading to

t=0𝑮𝗈𝗎𝗍t=(𝑰2𝑮𝗈𝗎𝗍)1\displaystyle\sum_{t=0}^{\infty}{\bm{{G}}}_{\mathsf{out}}^{t}=({\bm{{I}}}_{2}-{\bm{{G}}}_{\mathsf{out}})^{-1} =[(1α𝗂𝗇)2(1α𝗈𝗎𝗍)216α𝗂𝗇2α𝗈𝗎𝗍2η2L2(1α𝗂𝗇)3(1α𝗈𝗎𝗍)232α𝗂𝗇2α𝗈𝗎𝗍2η2L24α𝗂𝗇2(1α𝗈𝗎𝗍)η2L2(1α𝗂𝗇)3(1α𝗈𝗎𝗍)232α𝗂𝗇2α𝗈𝗎𝗍2η2L24(1α𝗂𝗇)2(1+α𝗂𝗇)α𝗈𝗎𝗍2(1α𝗂𝗇)3(1α𝗈𝗎𝗍)232α𝗂𝗇2α𝗈𝗎𝗍2η2L2(1α𝗂𝗇)3(1α𝗈𝗎𝗍)(1α𝗂𝗇)3(1α𝗈𝗎𝗍)232α𝗂𝗇2α𝗈𝗎𝗍2η2L2]\displaystyle=\begin{bmatrix}\frac{(1-\alpha_{\mathsf{in}})^{2}(1-\alpha_{\mathsf{out}})^{2}-16\alpha_{\mathsf{in}}^{2}\alpha_{\mathsf{out}}^{2}\eta^{2}L^{2}}{(1-\alpha_{\mathsf{in}})^{3}(1-\alpha_{\mathsf{out}})^{2}-32\alpha_{\mathsf{in}}^{2}\alpha_{\mathsf{out}}^{2}\eta^{2}L^{2}}&\frac{4\alpha_{\mathsf{in}}^{2}(1-\alpha_{\mathsf{out}})\eta^{2}L^{2}}{(1-\alpha_{\mathsf{in}})^{3}(1-\alpha_{\mathsf{out}})^{2}-32\alpha_{\mathsf{in}}^{2}\alpha_{\mathsf{out}}^{2}\eta^{2}L^{2}}\\[5.0pt] \frac{4(1-\alpha_{\mathsf{in}})^{2}(1+\alpha_{\mathsf{in}})\alpha_{\mathsf{out}}^{2}}{(1-\alpha_{\mathsf{in}})^{3}(1-\alpha_{\mathsf{out}})^{2}-32\alpha_{\mathsf{in}}^{2}\alpha_{\mathsf{out}}^{2}\eta^{2}L^{2}}&\frac{(1-\alpha_{\mathsf{in}})^{3}(1-\alpha_{\mathsf{out}})}{(1-\alpha_{\mathsf{in}})^{3}(1-\alpha_{\mathsf{out}})^{2}-32\alpha_{\mathsf{in}}^{2}\alpha_{\mathsf{out}}^{2}\eta^{2}L^{2}}\end{bmatrix}
[21α𝗂𝗇8α𝗂𝗇2η2L2(1α𝗂𝗇)3(1α𝗈𝗎𝗍)16α𝗈𝗎𝗍2(1α𝗂𝗇)(1α𝗈𝗎𝗍)221α𝗈𝗎𝗍],\displaystyle\leq\begin{bmatrix}\frac{2}{1-\alpha_{\mathsf{in}}}&\frac{8\alpha_{\mathsf{in}}^{2}\eta^{2}L^{2}}{(1-\alpha_{\mathsf{in}})^{3}(1-\alpha_{\mathsf{out}})}\\[5.0pt] \frac{16\alpha_{\mathsf{out}}^{2}}{(1-\alpha_{\mathsf{in}})(1-\alpha_{\mathsf{out}})^{2}}&\frac{2}{1-\alpha_{\mathsf{out}}}\\ \end{bmatrix},

where we use the condition in (7) to prove 32α𝗂𝗇2α𝗈𝗎𝗍2η2L232100(1α𝗂𝗇)6(1α𝗈𝗎𝗍)2<12(1α𝗂𝗇)3(1α𝗈𝗎𝗍)232\alpha_{\mathsf{in}}^{2}\alpha_{\mathsf{out}}^{2}\eta^{2}L^{2}\leq\frac{32}{100}(1-\alpha_{\mathsf{in}})^{6}(1-\alpha_{\mathsf{out}})^{2}<\frac{1}{2}(1-\alpha_{\mathsf{in}})^{3}(1-\alpha_{\mathsf{out}})^{2} to bound the denominator.