This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Convergence and Privacy of Decentralized Nonconvex Optimization with Gradient Clipping and Communication Compression

Boyue Li
Carnegie Mellon University
Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA. Emails: {boyuel, yuejiec}@andrew.cmu.edu.
   Yuejie Chi11footnotemark: 1
Carnegie Mellon University
Abstract

Achieving communication efficiency in decentralized machine learning has been attracting significant attention, with communication compression recognized as an effective technique in algorithm design. This paper takes a first step to understand the role of gradient clipping, a popular strategy in practice, in decentralized nonconvex optimization with communication compression. We propose PORTER, which considers two variants of gradient clipping added before or after taking a mini-batch of stochastic gradients, where the former variant PORTER-DP allows local differential privacy analysis with additional Gaussian perturbation, and the latter variant PORTER-GC helps to stabilize training. We develop a novel analysis framework that establishes their convergence guarantees without assuming the stringent bounded gradient assumption. To the best of our knowledge, our work provides the first convergence analysis for decentralized nonconvex optimization with gradient clipping and communication compression, highlighting the trade-offs between convergence rate, compression ratio, network connectivity, and privacy.

Keywords: communication compression, gradient clipping, convergence rate, local differential privacy

1 Introduction

Decentralized machine learning has been attracting significant attention in recent years, which can be often modeled as a nonconvex finite-sum optimization problem:

min𝒙df(𝒙):=1ni=1nfi(𝒙), where fi(𝒙)=1m𝒛𝒵i(𝒙;𝒛),\displaystyle\min_{{\bm{{x}}}\in{\mathbb{\bm{R}}}^{d}}f({\bm{{x}}}):=\frac{1}{n}\sum_{i=1}^{n}f_{i}({\bm{{x}}}),\text{~{}~{}~{}~{}~{}where~{}~{}}f_{i}({\bm{{x}}})=\frac{1}{m}\sum_{{\bm{{z}}}\in{\mathcal{{Z}}}_{i}}\ell({\bm{{x}}};{\bm{{z}}}), (1)

where 𝒙d{\bm{{x}}}\in{\mathbb{\bm{R}}}^{d} and 𝒛{\bm{{z}}} denote the optimization parameter and one data sample, (𝒙;𝒛)\ell({\bm{{x}}};{\bm{{z}}}) denotes the sample loss function that is nonconvex in 𝒙{\bm{{x}}}, and fi(𝒙)f_{i}({\bm{{x}}}) and f(𝒙)f({\bm{{x}}}) denote the local objective function at agent ii and the global objective function. In addition, 𝒵i{\mathcal{{Z}}}_{i} denotes the dataset at agent ii, m=|𝒵i|m=|{\mathcal{{Z}}}_{i}| denotes the local sample size, and nn denotes the number of agents. An undirected communication graph 𝒢{\mathcal{{G}}} is used to model the connectivity between any two agents, where there is an edge between agent ii and jj only if they can communicate. The goal is to efficiently optimize the global objective function f(𝒙)f({\bm{{x}}}) in a decentralized manner, subject to the network connectivity constraints specified by 𝒢{\mathcal{{G}}}.

Communication efficiency is critical to decentralized optimization algorithms, as communication can quickly become bottleneck of the system as the number of agents and the size of the model increase. This has led to the development of communication compression (or quantization) techniques, which can significantly reduce the communication burden per round by transferring compressed information, especially when the communication bandwidth is limited. Therefore, a number of recent works have focused on designing decentralized nonconvex optimization algorithms with communication compression, including but not limited to [KSJ19, SDGD21, ZLL+22, TLQ+19, HP23, YCC+23].

Built upon this line of work, the paper aims to understand the role of gradient clipping in decentralized nonconvex optimization algorithms with communication compression. On the one hand, gradient clipping has been used widely in privacy-preserving algorithms [ACG+16] to enable (local) differential privacy guarantees [Dwo08]. On the other hand, gradient clipping is also observed to be beneficial in stabilizing neural network training [ZHSJ20a]. However, since gradient clipping necessarily introduces bias, the characterization of the convergence becomes much more challenging compared to their unclipped counterpart. As a result, most of the existing theoretical analyses for stochastic gradient algorithms with clipping — in the context of centralized and server/client settings — make strong assumptions such as the bounded gradient assumption [ACG+16, WJEG19, LZLC22] and the uniformly bounded gradient assumption [YZCL22, ZJFW20, ZHSJ20a]. To the best of our knowledge, the convergence of stochastic gradient algorithms with clipping in the decentralized setting has not been investigated before.

1.1 Our contributions

This paper proposes PORTER (cf. Algorithm 1),111The name is coined for two reasons: 1) PORTER has strong connection to the prior algorithm BEER (porter is a kind of dark beer), and 2) the authors developed this algorithm in Porter Hall at Carnegie Mellon University. a communication-efficient decentralized algorithm for nonconvex finite-sum optimization with gradient clipping and communication compression. PORTER is built on BEER [ZLL+22] — a fast decentralized algorithm with communication compression proposed recently — by introducing gradient clipping to the local stochastic gradient computation at agents, while inheriting the desirable designs such as error feedback and stochastic gradient tracking that are crucial in enabling the fast convergence of BEER. PORTER considers two variants of gradient clipping, corresponding to adding it before or after taking a mini-batch of stochastic gradients. In particular, the former variant PORTER-DP allows local differential privacy (LDP) analysis with additional Gaussian perturbation, and the latter variant PORTER-GC helps to stabilize training. Assuming a smooth clipping operator (Definition 2) and general compression operators (Definition 3), the highlights of our contributions are as follows.

  1. 1.

    We establish that PORTER-DP (cf. Algorithm 1) achieves (ϵ,δ)(\epsilon,\delta)-LDP under appropriate Gaussian perturbation. Under the bounded gradient assumption (when gradient clipping can be ignored), PORTER-DP converges in average squared 2\ell_{2} gradient norm as 1Tt[T]𝔼f(𝒙¯(t))22ρ43(1α)83ϕm1\frac{1}{T}\sum_{t\in[T]}{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}^{2}_{2}\lesssim\rho^{-\frac{4}{3}}(1-\alpha)^{-\frac{8}{3}}\phi_{m}^{-1} in T=ϕm2T=\phi_{m}^{-2} iterations, where 𝒙¯(t){\overline{\bm{x}}}^{(t)} is the average parameter, ϕm=dlog(1/δ)mϵ\phi_{m}=\frac{\sqrt{d\log(1/\delta)}}{m\epsilon} is the baseline utility for a centralized stochastic algorithm to achieve (ϵ,δ)(\epsilon,\delta)-DP with mm data samples [ACG+16], ρ(0,1]\rho\in(0,1] is the compression ratio, and α[0,1)\alpha\in[0,1) is the mixing rate of the topology.

  2. 2.

    However, the bounded gradient assumption might be too stringent to hold in practice. Instead we further establish that under the local variance and bounded dissimilarity assumptions, PORTER-DP converges in minimum 2\ell_{2} gradient norm as mint[T]𝔼f(𝒙¯(t))2ρ23(1α)43ϕm1/2\min_{t\in[T]}{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}\lesssim\rho^{-\frac{2}{3}}(1-\alpha)^{-\frac{4}{3}}\phi_{m}^{-1/2} in T=ϕm2T=\phi_{m}^{-2} iterations.

  3. 3.

    We establish that under the local variance and bounded dissimilarity assumptions, by properly choosing the mini-batch size, PORTER-GC converges in minimum 2\ell_{2} gradient norm as mint[T]𝔼f(𝒙¯(t))2ρ23(1α)43T1/2,\min_{t\in[T]}{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}\lesssim\rho^{-\frac{2}{3}}(1-\alpha)^{-\frac{4}{3}}T^{-1/2}, which matches the convergence rate of classical centralized stochastic algorithms.

Our work develops a novel analysis framework that establishes their convergence guarantees without assuming the stringent bounded gradient assumption, highlighting comprehensive trade-offs between convergence rate, compression ratio, network connectivity, and privacy. To the best of our knowledge, our work provides the first private decentralized optimization algorithm with communication compression, and a systematic investigation of gradient clipping in the fully decentralized setting. Table 1 provides a detailed comparison of PORTER-DP with prior art on private server-client algorithms, where the bounded gradient assumption is all in effect except ours.

Table 1: Comparison of final utility upper bounds and communication complexities of different stochastic algorithms that achieve (ϵ,δ)(\epsilon,\delta)-DP/LDP. The Big-OO notation (defined in Section 1.3) is omitted for simplicity. DP-SGD is a single-server optimization algorithm that serves as a baseline, to show the overhead brought in by the distributed setting. DDP-SRM and Soteria-SGD are server/client distributed algorithms, but DDP-SRM doesn’t use communication compression.
(1) θ=(1ω)3/2n1/2\theta=(1-\omega)^{-3/2}n^{-1/2}, where ω\omega is the parameter for unbiased compression.
Algorithm Privacy Compression Bounded Utility Communication
operator gradient rounds
DP-SGD DP - ϕm\phi_{m} -
[ACG+16]
DDP-SRM DP - 1nϕm\frac{1}{n}\phi_{m} n2dϕm1n^{2}d\phi_{m}^{-1}
[WJEG19]
Soteria-SGD (1) LDP Unbiased (1+θ1/2)(1+ωn)1/2ϕm(1+\theta^{1/2})\big{(}\frac{1+\omega}{n}\big{)}^{1/2}\phi_{m} (1+θ1/2)(n1+ω)2/3dϕm1(1+\theta^{1/2})\big{(}\frac{n}{1+\omega}\big{)}^{2/3}d\phi_{m}^{-1}
[LZLC22]
PORTER-DP LDP General 1(1α)8/3ρ4/3ϕm\frac{1}{(1-\alpha)^{8/3}\rho^{4/3}}\phi_{m} ϕm2\phi_{m}^{-2}
(Algorithm 1)
PORTER-DP LDP General 1(1α)16/3ρ8/3ϕm\frac{1}{(1-\alpha)^{16/3}\rho^{8/3}}\phi_{m} ϕm2\phi_{m}^{-2}
(Algorithm 1)

1.2 Related works

Decentralized optimization algorithms have been extensively studied to solve large-scale optimization problems. We review most closely related works in this section, and refer readers to more comprehensive reviews in [NRB20, XPNK20].

Decentralized stochastic nonconvex optimization.

Decentralized stochastic algorithms have been a actively researched area in recent years. Various algorithms have been proposed by directly adapting existing centralized algorithms, e.g., [KDG03, XB04, Sha07, BJ13, LZZ+17, ALBR19, WJ21]. However, the simple adaptations usually fail to achieve better convergence rates. Gradient tracking [ZM10], originally proposed by the control theory community, can be used to track the global gradient at each agent, which leads to a simple systematic framework for extending existing centralized algorithms to the decentralized setting. Gradient tracking can be used for both deterministic optimization algorithms, e.g., [DLS16, NOS17, QL18, LCCC20], and stochastic algorithms, e.g., [SLH20, XKK22b, XKK22a, LLC22, HSZ+22, LY22].

Communication compression.

In [DSZOR15, AGL+17], gradient compression was adopted to create a server/client distributed SGD algorithm, however, the large variance of compressed gradients leads to a sub-optimal convergence rate. [SFD+14] first proposed the use of error feedback to compensate for the variance induced by compression. [SCJ18, AHJ+18, MGTR19, LKQR20, GBLR21, LR21] all equipped similar mechanism to improve convergence for server/client distributed optimization algorithms, and [RSF21, FSG+21] formalized the error feedback mechanism and reaches an O(1/T)O(1/T) convergence rate for smooth nonconvex objective functions. [TGZ+18, KSJ19, SDGD21, TMHP20, ZLL+22, YCC+23, LLP23, ZBLR21] further extended communication compression schemes to the decentralized setting.

Private optimization algorithms.

The concern of leaking sensitive data has been increasing with the rapid development of large-scale machine learning systems. To address this concern, the concept of differential privacy is widely adopted [DMNS06, Dwo08], where a popular approach to protect privacy is adding noise to the model or gradients. This approach is first adopted in the single server setting to design differentially private optimization algorithms [ACG+16, WYX17, INS+19, FKT20, CWH20, WXDX20], while [HDJ+20, ACCÖ21, NBD22, DLC+22, LZLC22, MS23, ZCH+22] considered differential privacy for the server/client distributed setting.

Gradient clipping.

Understanding gradient clipping has gained significant attention in recent years. Earlier works, e.g. [PMB13, BBP13, KLL16, KFI17, YGG17], used gradient clipping as a pure heuristic to solve gradient vanishing/exploding problems without theoretical understandings. Then, [ZHSJ20b, ZKV+20, ZJFW20, RLDJ23] introduced theoretical analyses to understand its impact on the convergence rate and training performance. This question is also investigated in [CWH20, ZCH+22, DKX+22, FLFL23], which applies gradient clipping to limit the magnitude of each of the sample gradients, so that the variance of privacy perturbation can be decided without the bounded gradient assumption. While finishing up this paper, we became aware of [KHS23], which also develops convergence guarantees on the minimum 2\ell_{2} gradient norm of clipped stochastic gradient algorithms in the centralized setting with a piece-wise linear clipping operator. In contrast, our focus is on the decentralized setting with a smooth clipping operator, where extra care is taken to deal with the discrepancy between the local and global objective functions.

1.3 Paper organization and notation

Section 2 introduces preliminary concepts, Section 3 describes the algorithm development, Section 4 shows the theoretical performance guarantees for PORTER, Section 5 provides numerical evidence to support the analysis, and Section 6 concludes the paper. Proofs are postponed to appendices.

Throughout this paper, we use uppercase and lowercase boldface letters to represent matrices and vectors, respectively. We use 𝗈𝗉\|\cdot\|_{\mathsf{op}} for matrix operator norm, 𝖥\|\cdot\|_{\mathsf{F}} for Frobenius norm, 𝑰n{\bm{{I}}}_{n} for the nn-dimensional identity matrix, 𝟏n\bm{1}_{n} for the nn-dimensional all-one vector and 𝟎d×n\bm{0}_{d\times n} for the (d×n)(d\times n)-dimensional zero matrix. For two real functions f()f(\cdot) and g()g(\cdot) defined on +{\mathbb{\bm{R}}}^{+}, we say f(x)=O(g(x))f(x)=O\big{(}g(x)\big{)} or f(x)g(x)f(x)\lesssim g(x) if there exists some universal constant M>0M>0 such that f(x)Mg(x)f(x)\leq Mg(x). The notation f(x)=Ω(g(x))f(x)=\Omega\big{(}g(x)\big{)} or f(x)g(x)f(x)\gtrsim g(x) means g(x)=O(f(x))g(x)=O\big{(}f(x)\big{)}.

2 Preliminaries

Mixing.

The information mixing between agents is conducted by updating the local information via a weighted sum of information from neighbors, which is characterized by a mixing (gossiping) matrix. Concerning this matrix is an important quantity called the mixing rate, defined in Definition 1.

Definition 1 (Mixing matrix and mixing rate).

The mixing matrix is a matrix 𝐖=[wij]n×n{\bm{{W}}}=[w_{ij}]\in{\mathbb{\bm{R}}}^{n\times n}, such that wij=0w_{ij}=0 if agent ii and jj are not connected according to the communication graph 𝒢{\mathcal{{G}}}. Furthermore, 𝐖𝟏n=𝟏n{\bm{{W}}}\bm{1}_{n}=\bm{1}_{n} and 𝐖𝟏n=𝟏n{\bm{{W}}}^{\top}\bm{1}_{n}=\bm{1}_{n}. The mixing rate of a mixing matrix 𝐖{\bm{{W}}} is defined as

α:=𝑾1n𝟏n𝟏n𝗈𝗉.\displaystyle\alpha:=\big{\|}{\bm{{W}}}-\tfrac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top}\big{\|}_{\mathsf{op}}. (2)

The mixing rate describes the connectivity of a communication graph and the speed of information sharing. Generally, a better connected graph leads to a smaller mixing rate, for example, 𝑾{\bm{{W}}} can be the averaging matrix for a fully connected communication network, which results in α=0\alpha=0. A comprehensive list of bounds on 1α1-\alpha is provided by [NOR18, Proposition 5]. Our analysis does not require the mixing matrix to be doubly stochastic, while allows us to use a non-symmetric matrix with negative values as the mixing matrix, such as the FDLA matrix [XB04], which has a smaller mixing rate under the same connectivity pattern.

Gradient clipping.

In practice, gradient clipping is frequently adopted to ensure the gradients are within a predetermined region, so that the variance of privacy perturbation can be decided accordingly. The clipping operator we adopt is a smooth clipping operator [YZCL22] defined in Definition 2, which scales a vector into a ball of radius τ\tau centered at the origin.

Definition 2 (Smooth clipping operator).

For 𝐱d{\bm{{x}}}\in{\mathbb{\bm{R}}}^{d}, the clipping operator is defined as

𝖢𝗅𝗂𝗉τ(𝒙)=ττ+𝒙2𝒙.\displaystyle\mathsf{Clip}_{\tau}({\bm{{x}}})=\frac{\tau}{\tau+\|{\bm{{x}}}\|_{2}}{\bm{{x}}}.

For 𝐗=[𝐱1,,𝐱n]d×n{\bm{{X}}}=[{\bm{{x}}}_{1},\ldots,{\bm{{x}}}_{n}]\in{\mathbb{\bm{R}}}^{d\times n}, the distributed clipping operator is defined as

𝖢𝗅𝗂𝗉τ(𝑿)=[𝖢𝗅𝗂𝗉τ(𝒙1),,𝖢𝗅𝗂𝗉τ(𝒙n)].\displaystyle\mathsf{Clip}_{\tau}({\bm{{X}}})=[\mathsf{Clip}_{\tau}({\bm{{x}}}_{1}),\ldots,\mathsf{Clip}_{\tau}({\bm{{x}}}_{n})].
Remark 1.

Another widely used clipping operator is the piece-wise linear clipping operator, which scales inputs whenever its gradient norm is larger than τ\tau and does nothing otherwise, defined by

𝖢𝗅𝗂𝗉τ(𝒙)=𝒙min{1,τ/𝒙2}.\displaystyle\mathsf{Clip}_{\tau}({\bm{{x}}})={\bm{{x}}}\min\big{\{}1,\tau/\|{\bm{{x}}}\|_{2}\big{\}}.

Figure 1 plots the norm of a vector before and after clipping for these two clipping operators, which show that they behave quite similarly.

Refer to caption
Figure 1: Illustration of input norm and clipped norm for the smooth clipping operator (Definition 2) and piece-wise linear clipping operator, where τ\tau is the clipping parameter.
Compression operators.

Following [RSF21, FSG+21], Definition 3 defines a randomized general compression operator that only guarantees the expected compression error 𝔼𝒞(𝒙)𝒙22{\mathbb{\bm{E}}}\|{\mathcal{{C}}}({\bm{{x}}})-{\bm{{x}}}\|_{2}^{2} is less than the magnitude of original message 𝒙22\|{\bm{{x}}}\|_{2}^{2}.

Definition 3 (General compression operator).

A randomized map 𝒞{\mathcal{{C}}} : dd{\mathbb{\bm{R}}}^{d}\to{\mathbb{\bm{R}}}^{d} is a ρ\rho-compression operator if 𝐱d\forall{\bm{{x}}}\in{\mathbb{\bm{R}}}^{d} and some ρ[0,1]\rho\in[0,1], the following inequality holds:

𝔼𝒞(𝒙)𝒙22(1ρ)𝒙22.\displaystyle{\mathbb{\bm{E}}}\big{\|}{\mathcal{{C}}}({\bm{{x}}})-{\bm{{x}}}\big{\|}^{2}_{2}\leq(1-\rho)\|{\bm{{x}}}\|^{2}_{2}.

Many widely used compression schemes can be modeled as special cases, for example, random sparsificiation and top-kk compression.

Example 1 (Random sparsification).

Random sparsification keeps an element from a dd-dimensional vector with probability kd\frac{k}{d}. Let 𝐮d{\bm{{u}}}\in{\mathbb{\bm{R}}}^{d} where uiB(kd)u_{i}\sim B\big{(}\frac{k}{d}\big{)}, then random sparsification is defined as randomk(𝐱)=𝐮𝐱\mathrm{random}_{k}({\bm{{x}}})={\bm{{u}}}\odot{\bm{{x}}}, which satisfies Definition 3 with ρ=kd\rho=\frac{k}{d}.

Example 2 (topk).

topk\mathrm{top}_{k} [AHJ+18, SCJ18] keeps kk elements that have the largest absolute values and sets other elements to 0, which is defined as topk(𝐱):=𝐱𝐮(𝐱)\mathrm{top}_{k}({\bm{{x}}}):={\bm{{x}}}\odot{\bm{{u}}}({\bm{{x}}}), where [𝐮(𝐱)]i=1[{\bm{{u}}}({\bm{{x}}})]_{i}=1 if the absolute value of the ii-th element is one of the kk-largest absolute values, otherwise [𝐮(𝐱)]i=0[{\bm{{u}}}({\bm{{x}}})]_{i}=0. It follows that topk\mathrm{top}_{k} satisfies Definition 3 with ρ=k/d\rho=k/d.

Local differential privacy.

In decentralized learning systems, all agents share information with their neighbors that are potentially sensitive. If some agents are exploited by adversaries, the system will face a risk of privacy leakage even when the system-level privacy is protected. Therefore, we introduce local differential privacy (LDP) — defined in Definition 4 — following [DJW13, CABP13, XYD19], which protects each agent’s privacy from leaking to other agents.

Definition 4 (Local differential privacy (LDP)).

A randomized mechanism :𝒵{\mathcal{{M}}}:{\mathcal{{Z}}}\to{\mathcal{{R}}} with domain 𝒵{\mathcal{{Z}}} and range {\mathcal{{R}}} satisfies (ϵ,δ)(\epsilon,\delta)-local differential privacy for client ii, if for any two neighboring dataset 𝐙i,𝐙i𝒵{\bm{{Z}}}_{i},{\bm{{Z}}}_{i}^{\prime}\in{\mathcal{{Z}}} at client ii and for any subset of outputs 𝐑{\bm{{R}}}\subseteq{\mathcal{{R}}}, it holds that

((𝒁i)𝑹)eϵ((𝒁i)𝑹)+δ.\displaystyle{\mathbb{\bm{P}}}\big{(}{\mathcal{{M}}}({\bm{{Z}}}_{i})\in{\bm{{R}}}\big{)}\leq e^{\epsilon}{\mathbb{\bm{P}}}\big{(}{\mathcal{{M}}}({\bm{{Z}}}_{i}^{\prime})\in{\bm{{R}}}\big{)}+\delta. (3)

The two datasets 𝐙i{\bm{{Z}}}_{i} and 𝐙i{\bm{{Z}}}_{i}^{\prime} are called neighboring if they are only different by one data point at client ii.

Definition 4 is a stricter privacy guarantee because it can imply general differential privacy (DP). Consequently, LDP requires a larger perturbation variance than general DP. To identify the impact of the decentralized LDP setting compared to centralized DP setting, we define the baseline utility

ϕm=dlog(1/δ)mϵ,\phi_{m}=\frac{\sqrt{d\log(1/\delta)}}{m\epsilon}, (4)

which can be understood as the final utility of a centralized system with mm data samples that guarantees (ϵ,δ)(\epsilon,\delta)-DP. For typical private problems, the local sample size mm has to be large enough for the privacy perturbation to deliver meaning guarantees, we impose a mild assumption that ϕm<1\phi_{m}<1 to simplify presentation. For example, the problem defined in (1) has in total mnmn data samples, running an (ϵ,δ)(\epsilon,\delta)-DP algorithm on one server that can access all data will achieve 1nϕm\frac{1}{n}\phi_{m} utility in nϕm1n\phi_{m}^{-1} iterations.

3 Proposed algorithm

We propose PORTER, a novel stochastic decentralized optimization algorithm for finding first-order stationary points of nonconvex finite-sum problems with gradient clipping and communication compression; the details are described in Algorithm 1. On a high level, PORTER is composed of local stochastic gradient updates and neighboring information sharing, following a similar structure as BEER [ZLL+22], in terms of the use of error feedback [RSF21], which accelerates the convergence with biased compression operators, and stochastic gradient tracking to track the global gradient locally at each agent. A key difference, however, is the use of gradient clipping. Motivated by efficient training and privacy preserving, we consider two variants, corresponding to clipping before the mini-batch with privacy perturbation (PORTER-DP), and after taking the mini-batch (PORTER-GC), respectively.

Algorithm 1 PORTER
1:input: 𝒙¯(0){\overline{\bm{x}}}^{(0)}, gradient stepsize η\eta, consensus stepsize γ\gamma, clipping threshold τ\tau, mini-batch size bb, perturbation noise σp\sigma_{p}, number of iterations TT
2:initialize: 𝑽(0)=𝑸v(0)=𝑮p(0)=𝟎d×n{\bm{{V}}}^{(0)}={\bm{{Q}}}_{v}^{(0)}={\bm{{G}}}_{p}^{(0)}=\bm{0}_{d\times n}, 𝑸x(0)=𝑿(0)=𝒙¯(0)𝟏n{\bm{{Q}}}_{x}^{(0)}={\bm{{X}}}^{(0)}={\overline{\bm{x}}}^{(0)}{\bm{1}}_{n}^{\top}
3:for t=1,,Tt=1,\ldots,T do
4:     Draw the local mini-batch of size bb uniformly at random 𝒵(t)={𝒵i(t)}i=1n{\mathcal{{Z}}}^{(t)}=\{{\mathcal{{Z}}}_{i}^{(t)}\}_{i=1}^{n}
5:     Option I: PORTER-DP (differentially-private SGD)
6:       𝑮τ(t)=1b𝒁𝒵(t)𝖢𝗅𝗂𝗉τ((𝑿(t1);𝒁)){\bm{{G}}}_{\tau}^{(t)}=\frac{1}{b}\sum_{{\bm{{Z}}}\in{\mathcal{{Z}}}^{(t)}}\mathsf{Clip}_{\tau}(\nabla\ell({\bm{{X}}}^{(t-1)};{\bm{{Z}}}))
7:       𝑮p(t)=𝑮τ(t)+𝑬(t){\bm{{G}}}_{p}^{(t)}={\bm{{G}}}_{\tau}^{(t)}+{\bm{{E}}}^{(t)}, where 𝒆i(t)𝒩(𝟎d,σp2𝑰d){\bm{{e}}}_{i}^{(t)}\sim{\mathcal{{N}}}(\bm{0}_{d},\sigma_{p}^{2}{\bm{{I}}}_{d})
8:     Option II: PORTER-GC (SGD with gradient clipping)
9:       𝑮(t)=1b𝒁𝒵(t)(𝑿(t1);𝒁){\bm{{G}}}^{(t)}=\frac{1}{b}\sum_{{\bm{{Z}}}\in{\mathcal{{Z}}}^{(t)}}\nabla\ell({\bm{{X}}}^{(t-1)};{\bm{{Z}}})
10:       𝑮p(t)=𝑮τ(t)=𝖢𝗅𝗂𝗉τ(𝑮(t)){\bm{{G}}}_{p}^{(t)}={\bm{{G}}}_{\tau}^{(t)}=\mathsf{Clip}_{\tau}({\bm{{G}}}^{(t)})
11:     𝑸v(t)=𝑸v(t1)+𝒞(𝑽(t1)𝑸v(t1)){\bm{{Q}}}_{v}^{(t)}={\bm{{Q}}}_{v}^{(t-1)}+{\mathcal{{C}}}({\bm{{V}}}^{(t-1)}-{\bm{{Q}}}_{v}^{(t-1)}) \triangleright Communication
12:     𝑽(t)=𝑽(t1)+γ𝑸v(t)(𝑾𝑰n)+𝑮p(t)𝑮p(t1){\bm{{V}}}^{(t)}={\bm{{V}}}^{(t-1)}+\gamma{\bm{{Q}}}_{v}^{(t)}({\bm{{W}}}-{\bm{{I}}}_{n})+{\bm{{G}}}_{p}^{(t)}-{\bm{{G}}}_{p}^{(t-1)}
13:     𝑸x(t)=𝑸x(t1)+𝒞(𝑿(t1)𝑸x(t1)){\bm{{Q}}}_{x}^{(t)}={\bm{{Q}}}_{x}^{(t-1)}+{\mathcal{{C}}}({\bm{{X}}}^{(t-1)}-{\bm{{Q}}}_{x}^{(t-1)}) \triangleright Communication
14:     𝑿(t)=𝑿(t1)+γ𝑸x(t)(𝑾𝑰n)η𝑽(t){\bm{{X}}}^{(t)}={\bm{{X}}}^{(t-1)}+\gamma{\bm{{Q}}}_{x}^{(t)}({\bm{{W}}}-{\bm{{I}}}_{n})-\eta{\bm{{V}}}^{(t)}
15:end for
16:output:  𝒙outUniform({𝒙i(t)|i[n],t[T]}){\bm{{x}}}_{out}\sim\text{Uniform}(\{{\bm{{x}}}_{i}^{(t)}|i\in[n],t\in[T]\}).

Before proceeding, we introduce some notation convenient for describing decentralized algorithms. Let 𝒙id{\bm{{x}}}_{i}\in{\mathbb{\bm{R}}}^{d} be the optimization variable at agent ii, we define the collection of all optimization variables as a matrix 𝑿=[𝒙1,𝒙2,,𝒙n]d×n{\bm{{X}}}=[{\bm{{x}}}_{1},{\bm{{x}}}_{2},\ldots,{\bm{{x}}}_{n}]\in{\mathbb{\bm{R}}}^{d\times n}, and the average as 𝒙¯=1ni=1n𝒙i{\overline{\bm{x}}}=\frac{1}{n}\sum_{i=1}^{n}{\bm{{x}}}_{i}. The gradient estimates 𝑽{\bm{{V}}}, stochastic gradients 𝑮{\bm{{G}}}, perturbation noise 𝑬{\bm{{E}}}, compressed surrogate 𝑸x{\bm{{Q}}}_{x} and 𝑸v{\bm{{Q}}}_{v}, and their corresponding agent-wise values are defined analogously. The distributed gradient is defined as F(𝑿)=[f1(𝒙1),f2(𝒙2),,fn(𝒙n)]d×n\nabla F({\bm{{X}}})=\big{[}\nabla f_{1}({\bm{{x}}}_{1}),\nabla f_{2}({\bm{{x}}}_{2}),\ldots,\nabla f_{n}({\bm{{x}}}_{n})\big{]}\in{\mathbb{\bm{R}}}^{d\times n}.

To provide more detail, PORTER initializes gradient-related variables to 𝟎d\bm{0}_{d} and other variables to the same random value 𝒙¯(0){\overline{\bm{x}}}^{(0)}, which improves stability in early iterations and simplifies analysis, but has no impact on the convergence rates. Within each iteration, PORTER is consisted of 33 major steps.

  1. (1)

    Computing clipped stochastic gradients. We consider two options. The first option PORTER-DP corresponds to differentially-private SGD (6-7), where 6 computes a batch of clipped stochastic gradient on each agent, and then 7 adds Gaussian noise to ensure privacy. The second option PORTER-GC corresponds to SGD with gradient clipping (9-10), where 9 computes a batch stochastic gradient on each agent, and 10 applies clipping to each batch stochastic gradient.

  2. (2)

    Updating gradient estimates. 11 updates the auxiliary variable 𝑸v(t){\bm{{Q}}}_{v}^{(t)}, which is a compressed surrogate of 𝑽(t1){\bm{{V}}}^{(t-1)}, by adding the compressed difference to itself 𝒞(𝑽(t1)𝑸v(t1)){\mathcal{{C}}}({\bm{{V}}}^{(t-1)}-{\bm{{Q}}}_{v}^{(t-1)}). Meanwhile, each agent ii sends its compressed difference 𝒞(𝒗i(t1)𝒒v,i(t1)){\mathcal{{C}}}({\bm{{v}}}_{i}^{(t-1)}-{\bm{{q}}}_{v,i}^{(t-1)}) to all of its neighbors, so that every neighbor can reconstruct the auxiliary variable 𝒒v,i(t){\bm{{q}}}_{v,i}^{(t)} by accumulating this difference. 12 then adds a correction term γ𝑸v(t)(𝑾𝑰n)\gamma{\bm{{Q}}}_{v}^{(t)}({\bm{{W}}}-{\bm{{I}}}_{n}), and applies stochastic gradient tracking to update gradient estimates.

  3. (3)

    Updating variable estimates. Similar to updating gradient estimates, 13 updates the auxiliary variable 𝑸x(t){\bm{{Q}}}_{x}^{(t)}, which is a compressed surrogate of variable estimates 𝑿(t1){\bm{{X}}}^{(t-1)}, and communicates with neighbors. 14 applies correction and updates the variable estimates by a gradient-style update.

4 Theoretical guarantees

This section theoretically analyzes the privacy and convergence properties of PORTER under various assumptions. Section 4.1 lists all assumptions required for convergence analysis, Section 4.2 shows the privacy and convergence of PORTER-DP using a specific perturbation variance, and Section 4.3 shows the convergence of PORTER corresponding to clipped SGD without privacy.

4.1 Assumptions

We start with smoothness assumption in Assumption 1, which is standard and required for all of our analysis.

Assumption 1 (LL-smoothness).

For any 𝐱,𝐲d{\bm{{x}}},{\bm{{y}}}\in{\mathbb{\bm{R}}}^{d} and any datum 𝐳{\bm{{z}}} in dataset 𝒵{\mathcal{{Z}}},

(𝒙;𝒛)(𝒚;𝒛)2L𝒙𝒚2.\|\nabla\ell({\bm{{x}}};{\bm{{z}}})-\nabla\ell({\bm{{y}}};{\bm{{z}}})\|_{2}\leq L\|{\bm{{x}}}-{\bm{{y}}}\|_{2}.

Note that the gradient clipping operator 𝖢𝗅𝗂𝗉τ()\mathsf{Clip}_{\tau}(\cdot) is utilized to ensure gradients are bounded. In addition, the boundedness of the gradient ensures the application of differentially-private mechanisms. However, stochastic gradients at different agents lose correct scaling after clipping, which breaks the stationary point property at local minima and introduces bias. To simplify analysis, one assumption that has been adopted widely in theoretical analysis [LZLC22] is the following bounded gradient assumption.

Assumption 2 (Bounded gradient).

For any 𝐱d{\bm{{x}}}\in{\mathbb{\bm{R}}}^{d} and any datum 𝐳{\bm{{z}}} in dataset 𝒵{\mathcal{{Z}}}, (𝐱;𝐳)2τ\|\nabla\ell({\bm{{x}}};{\bm{{z}}})\|_{2}\leq\tau.

Under Assumption 2, PORTER can skip the clipping operator, and 𝒈τi(t){\bm{{g}}}_{\tau_{i}}^{(t)} becomes an unbiased estimator of local gradient fi(𝒙i(t))\nabla f_{i}({\bm{{x}}}_{i}^{(t)}), while still allowing privacy analysis. However, Assumption 2 is rather strong and seldomly met in practice. For example, the gradient of a quadratic loss function is not bounded. In addition, it may result in an overly pessimistic clipping operation when there are possibly adversarial gradients with large norms in the samples. Going beyond the strong bounded gradient assumption, we consider a much milder assumption that bounds the local variance as follows, which is more standard in the analysis of unclipped stochastic algorithms.

Assumption 3 (Bounded local variance).

For any 𝐱d{\bm{{x}}}\in{\mathbb{\bm{R}}}^{d} and i[n]i\in[n],

𝔼𝒛𝒵i(𝒙;𝒛)fi(𝒙)22σg2.{\mathbb{\bm{E}}}_{{\bm{{z}}}\sim{\mathcal{{Z}}}_{i}}\|\nabla\ell({\bm{{x}}};{\bm{{z}}})-\nabla f_{i}({\bm{{x}}})\|_{2}^{2}\leq\sigma_{g}^{2}.

An additional challenge is associated with dealing with the decentralized environment, where the local objective functions can be rather distinct from the global one. Our analysis identifies the following assumption, called bounded gradient dissimilarity, which says that the difference between the local gradient and the global gradient should be small relative to the global one.

Assumption 4 (Bounded gradient dissimilarity).

For any 𝐱d{\bm{{x}}}\in{\mathbb{\bm{R}}}^{d} and i[n]i\in[n],

f(𝒙)fi(𝒙)2112f(𝒙)2.\big{\|}\nabla f({\bm{{x}}})-\nabla f_{i}({\bm{{x}}})\big{\|}_{2}\leq\frac{1}{12}\big{\|}\nabla f({\bm{{x}}})\big{\|}_{2}.

4.2 Privacy and convergence guarantees of PORTER-DP

We start by analyzing the privacy and convergence guarantees of PORTER-DP, assuming the batch size b=1b=1.

Privacy guarantee.

Theorem 1 proves that PORTER-DP is (ϵ,δ)(\epsilon,\delta)-LDP when setting the variance of Gaussian perturbation properly.

Theorem 1 (Local differential privacy).

Let ϕm=dlog(1/δ)mϵ\phi_{m}=\frac{\sqrt{d\log(1/\delta)}}{m\epsilon} and b=1b=1. For any ϵT/m2\epsilon\leq T/m^{2} and δ(0,1)\delta\in(0,1), PORTER-DP is (ϵ,δ)(\epsilon,\delta)-LDP after TT iterations if we set

σp2=Tτ2log(1/δ)m2ϵ2=Tτ2ϕm2/d.\displaystyle\sigma_{p}^{2}=\frac{T\tau^{2}\log(1/\delta)}{m^{2}\epsilon^{2}}=T\tau^{2}\phi_{m}^{2}/d. (5)

Theorem 1 shows that PORTER-DP can achieve (ϵ,δ)(\epsilon,\delta)-LDP regardless of whether the bounded gradient assumption presents, because using the clipping operator 𝖢𝗅𝗂𝗉τ()\mathsf{Clip}_{\tau}(\cdot) can guarantee all the stochastic gradients’ 2\ell_{2} norms are bounded by τ\tau, so that the perturbation variance can be set accordingly.

Convergence with bounded gradient assumption.

We start by analyzing the convergence when the gradients are bounded under Assumption 2, in which case PORTER-DP can omit the clipping operator. Theorem 2 presents the convergence result of PORTER-DP using general compression operators (Definition 3).

Theorem 2 (Convergence of PORTER-DP with bounded gradient assumption).

Assume Assumptions 1 and 2 hold, and use general compression operators (Definition 3). Let Δ=𝔼[f(𝐱¯(0))]f\Delta={\mathbb{\bm{E}}}[f({\overline{\bm{x}}}^{(0)})]-f^{\star}. Set γ=O((1α)ρ)\gamma=O\big{(}(1-\alpha)\rho\big{)}, η=O(γ4/3ρ4/3ϕm)\eta=O\big{(}\gamma^{4/3}\rho^{4/3}\phi_{m}\big{)}, T=ϕm2T=\phi_{m}^{-2}, b=1b=1 and σp2=Tτ2ϕm2/d\sigma_{p}^{2}=T\tau^{2}\phi_{m}^{2}/d. PORTER-DP converges in average squared 2\ell_{2} gradient norm as

1Tt=1T𝔼f(𝒙¯(t))22\displaystyle\frac{1}{T}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\|\nabla f({\overline{\bm{x}}}^{(t)})\|_{2}^{2} ϕmρ43(1α)83max{τ2,LΔ}.\displaystyle\lesssim\frac{\phi_{m}}{\rho^{\frac{4}{3}}(1-\alpha)^{\frac{8}{3}}}\cdot\max\big{\{}\tau^{2},L\Delta\big{\}}. (6)

Theorem 2 shows the convergence error of the squared 2\ell_{2} gradient norm with explicit dependency on the compression ratio ρ\rho and mixing rate α\alpha. When we fix ρ\rho and α\alpha, Theorem 2 reaches an O(ϕm)O(\phi_{m}) final average squared 2\ell_{2} gradient norm, which matches the result of SoteriaFL-SGD [LZLC22], the state-of-the-art stochastic algorithm with local differential privacy guarantees and unbiased communication compression in the server-client setting. However, due to extra complexities induced by the decentralized setting and biased compression, PORTER-DP takes O(ϕm2)O(\phi_{m}^{-2}) iterations to converge while SoteriaFL-SGD only takes O(ϕm1)O(\phi_{m}^{-1}) iterations; in addition, PORTER-DP has a slightly worse dependency on the compression ratio ρ\rho.

Convergence with bounded gradient assumption.

A more interesting and challenging scenario is when Assumption 2 does not hold, PORTER-DP applies gradient clipping to ensure gradients are bounded to suit the privacy constraints. Fortunately, Theorem 3 describes the convergence behavior of Algorithm 1 in this case, under the much weaker bounded local variance and bounded dissimilarity assumptions.

Theorem 3 (Convergence of PORTER-DP without bounded gradient assumption).

Assume Assumptions 1, 3 and 4 hold, and use general compression operators (Definition 3). Let Δ=𝔼[f(𝐱¯(0))]f\Delta={\mathbb{\bm{E}}}[f({\overline{\bm{x}}}^{(0)})]-f^{\star}. Set τ=max{365ρ43(1α)83ϕm1/2,24σg}\tau=\max\big{\{}365\rho^{-\frac{4}{3}}(1-\alpha)^{-\frac{8}{3}}\phi_{m}^{1/2},24\sigma_{g}\big{\}}, γ=O((1α)ρ)\gamma=O((1-\alpha)\rho), η=O(L1)\eta=O\big{(}L^{-1}\big{)}, b=1b=1, T=ϕm2T=\phi_{m}^{-2} and σp2=Tτ2ϕm2/d\sigma_{p}^{2}=T\tau^{2}\phi_{m}^{2}/d. Algorithm 1 converges in minimum 2\ell_{2} gradient norm as

mint[T]𝔼f(𝒙¯(t))2max{ρ43(1α)83(LΔϕm)1/2,σg}.\displaystyle\min_{t\in[T]}{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}\lesssim\max\big{\{}\rho^{-\frac{4}{3}}(1-\alpha)^{-\frac{8}{3}}(L\Delta\phi_{m})^{1/2},~{}\sigma_{g}\big{\}}. (7)

Theorem 3 shows PORTER-DP converges in minimum 2\ell_{2} gradient norm with explicit dependency on compression ratio ρ\rho, mixing rate α\alpha and gradient variance σg\sigma_{g}, under much weaker assumptions. To compare with Theorem 2, we can take the square root of (6), which translates to minimum 2\ell_{2} gradient norm convergence on the order of O(ρ2/3(1α)4/3ϕm1/2max{τ,(LΔ)1/2}).O\big{(}\rho^{-2/3}(1-\alpha)^{-4/3}\phi_{m}^{1/2}\cdot\max\{\tau,(L\Delta)^{1/2}\}\big{)}. In comparison, although Theorem 3 has worse dependency on compression ratio ρ\rho and mixing rate α\alpha, it matches the dependency on the baseline privacy loss ϕm\phi_{m}.

4.3 Convergence guarantees of PORTER-GC

Theorem 4 further establishes the convergence of PORTER-GC without the bounded gradient assumption, which applies the clipping operator to mini-batch stochastic gradients without privacy perturbation.

Theorem 4 (Convergence of PORTER-GC without bounded gradient assumption).

Assume Assumptions 1, 3 and 4 hold, and use general compression operators (Definition 3). Let Δ=𝔼[f(𝐱¯(0))]f\Delta={\mathbb{\bm{E}}}[f({\overline{\bm{x}}}^{(0)})]-f^{\star}. Set τ=O(ρ23(1α)43T1/2)\tau=O\big{(}\rho^{-\frac{2}{3}}(1-\alpha)^{-\frac{4}{3}}T^{-1/2}\big{)}, γ=O((1α)ρ)\gamma=O\big{(}(1-\alpha)\rho\big{)}, η=O(L1)\eta=O(L^{-1}) and b=O(σg2ν2)b=O(\sigma_{g}^{2}\nu^{-2}). PORTER-GC converges in minimum 2\ell_{2} gradient norm as

mint[T]𝔼f(𝒙¯(t))21ρ23(1α)431T1/2.\displaystyle\min_{t\in[T]}{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}\lesssim\frac{1}{\rho^{\frac{2}{3}}(1-\alpha)^{\frac{4}{3}}}\cdot\frac{1}{T^{1/2}}.

Theorem 4 suggests that by picking the clipping threshold τ\tau and batch size bb properly, PORTER-GC converges at an O(1/T1/2)O(1/T^{1/2}) rate. In comparison, standard centralized SGD converges in average squared 2\ell_{2} gradient norm at an O(1/T)O(1/T) rate, which also translates to a minimum 2\ell_{2} gradient norm convergence in the form of mint[T]𝔼f(𝒙(t))21/T1/2\min_{t\in[T]}{\mathbb{\bm{E}}}\big{\|}\nabla f({\bm{{x}}}^{(t)})\big{\|}_{2}\lesssim 1/T^{1/2}. Therefore, using gradient clipping for decentralized SGD does not affect the convergence rate, providing proper hyper parameter choices.

When the gradients are bounded, we can omit the clipping operator in PORTER-GC, which become the same as BEER [ZLL+22]. Recall that BEER guarantees a minimum 2\ell_{2} gradient norm convergence at the rate 1ρ1/2(1α)3/21T1/2\frac{1}{\rho^{1/2}(1-\alpha)^{3/2}}\cdot\frac{1}{T^{1/2}}. In comparison, Theorem 4 has a better dependency on the mixing rate α\alpha, but has a slightly worse dependency on the compression ratio ρ\rho, which again emphasizes that gradient clipping does not harm convergence.

5 Numerical experiments

This section presents numerical experiments to examine the performance of PORTER-DP, with comparison to the state-of-the-art server/client private stochastic algorithm SoteriaFL-SGD, which also utilizes communication compression and guarantees local differential privacy. More specifically, we evaluate the convergence of utility and accuracy in terms of communication rounds and communication bits, to analyze the privacy-utility-communication trade-offs of different algorithms.

For all experiments, we split shuffled datasets evenly to 1010 agents that are connected by an Erdős-Rényi random graph with connecting probability p=0.8p=0.8. We use the FDLA matrix [XB04] as the mixing matrix to perform weighted information aggregation to accelerate convergence. We use biased random sparsification (cf. Example 1) for all algorithms where k=d20k=\lfloor\frac{d}{20}\rfloor, i.e., the compressor randomly selects 5% elements from each vector. We also apply gradient clipping with τ=1\tau=1 to all algorithms for simplicity. For each experiment, all algorithms are initialized to the same starting point, and use best-tuned learning rates, batch size 11 and σp=τTlog(1/δ)mϵ\sigma_{p}=\frac{\tau\sqrt{T\log(1/\delta)}}{m\epsilon}.

5.1 Logistic regression with nonconvex regularization

We run experiments on logistic regression with nonconvex regularization on the a9a dataset [CL11]. Following [WJZ+19], the objective function is defined as

(𝒙;{𝒇,l})=log(1+lexp(𝒙𝒇))+λi=1dxi21+xi2,\displaystyle\ell({\bm{{x}}};\{\bm{f},l\})=\log\left(1+l\exp(-{\bm{{x}}}^{\top}\bm{f})\right)+\lambda\sum\limits_{i=1}^{d}\frac{x_{i}^{2}}{1+x_{i}^{2}},

where {𝒇,l}\{\bm{f},l\} represents a training tuple, 𝒇d\bm{f}\in{\mathbb{\bm{R}}}^{d} is the feature vector and l{0,1}l\in\{0,1\} is the label, and λ\lambda is the regularization parameter which is set to λ=0.2\lambda=0.2.

Refer to caption Refer to caption
(a) (102,103)(10^{-2},10^{-3})-LDP (b) (101,103)(10^{-1},10^{-3})-LDP
Figure 2: The train utility and test accuracy vs. communication rounds for logistic regression with nonconvex regularization on the a9a dataset when guaranteeing (102,103)(10^{-2},10^{-3})-LDP and (101,103)(10^{-1},10^{-3})-LDP, respectively. Both PORTER-DP and SoteriaFL-SGD employ random162\text{random}_{162} compression (cf. Example 1).

Figure 2 shows the convergence results of PORTER-DP and SoteriaFL-SGD for logistic regression with nonconvex regularization on the a9a dataset to reach (102,103)(10^{-2},10^{-3})-LDP and (101,103)(10^{-1},10^{-3})-LDP, respectively. Under (102,103)(10^{-2},10^{-3})-LDP, which is a stricter privacy setting, PORTER-DP converges faster than SoteriaFL-SGD in test accuracy, while converges slightly slower in train utility. Under (101,103)(10^{-1},10^{-3})-LDP, PORTER-DP performs slightly worse than SoteriaFL-SGD. Given that PORTER-DP operates under the decentralized topology with much weaker information exchange, the results highlight PORTER-DP’s communication efficiency by showing it can achieve similar performance as its server/client counterpart, i.e. SoteriaFL-SGD, especially under strict privacy constraints.

5.2 One-hidden-layer neural network training

We evaluate PORTER-DP by training a one-hidden layer neural network on the MNIST dataset [LJB+95]. The network uses 6464 hidden neurons, sigmoid activation functions and cross-entropy loss, where the loss function over a training pair {𝒇,l}\{\bm{f},l\} is defined as

(𝒙;(𝒇,l))=𝖢𝗋𝗈𝗌𝗌𝖤𝗇𝗍𝗋𝗈𝗉𝗒(𝗌𝗈𝖿𝗍𝗆𝖺𝗑(𝑾2𝗌𝗂𝗀𝗆𝗈𝗂𝖽(𝑾1𝒇+𝒄1)+𝒄2),l).\displaystyle\ell({\bm{{x}}};(\bm{f},l))=\mathsf{CrossEntropy}(\mathsf{softmax}(\bm{W}_{2}~{}\mathsf{sigmoid}(\bm{W}_{1}\bm{f}+\bm{c}_{1})+\bm{c}_{2}),l).

Here, the model parameter is defined by 𝒙=vec(𝑾1,𝒄1,𝑾2,𝒄2){\bm{{x}}}=\text{vec}(\bm{W}_{1},\bm{c}_{1},\bm{W}_{2},\bm{c}_{2}), where the dimensions of the network parameters 𝑾1\bm{W}_{1}, 𝒄1\bm{c}_{1}, 𝑾2\bm{W}_{2}, 𝒄2\bm{c}_{2} are 64×78464\times 784, 64×164\times 1, 10×6410\times 64, and 10×110\times 1, respectively.

Refer to caption Refer to caption
(a) (102,103)(10^{-2},10^{-3})-LDP (b) (101,103)(10^{-1},10^{-3})-LDP.
Figure 3: The train utility and test accuracy vs. communication rounds for training a one-hidden-layer neural network on the MNIST dataset when guaranteeing (102,103)(10^{-2},10^{-3})-LDP and (101,103)(10^{-1},10^{-3})-LDP, respectively. Both PORTER-DP and SoteriaFL-SGD employ random2583\text{random}_{2583} compression (cf. Example 1).

Figure 3 shows the convergence results of PORTER-DP and SoteriaFL-SGD for training a one-hidden-layer neural network on the MNIST dataset to reach (102,103)(10^{-2},10^{-3})-LDP and (101,103)(10^{-1},10^{-3})-LDP, respectively. Under both privacy settings, PORTER-DP converges at a similar rate as SoteriaFL-SGD in train utility. However, in terms of convergence in test accuracy, PORTER-DP outperforms SoteriaFL-SGD under the stricter (102,103)(10^{-2},10^{-3})-LDP, while the two algorithms have similar performance under the other setting. This experiment again emphasizes PORTER-DP’s communication efficiency in comparison to the server/client algorithm SoteriaFL-SGD.

6 Conclusions

In this paper, we propose an algorithmic framework called PORTER, which incorporates practically-relevant gradient clipping and communication compression simultaneously in the design of decentralized nonconvex optimization algorithms. We propose two variants: PORTER-DP and PORTER-GC. While they share a similar structure that makes use of gradient tracking, communication compression, and error feedback, their focuses are on different perspectives motivated by applications in privacy preserving and neural network training, respectively. PORTER-DP adds privacy perturbation to clipped gradients to protect the local differential privacy of each agent, with explicit utility and communication complexities. PORTER applies gradient clipping to mini-batch stochastic gradients, which converges in minimum 2\ell_{2} gradient norm at similar rate as centralized SGD without clipping under proper choices of hyperparameters. The development of PORTER offers a simple analysis framework to understand gradient clipping in decentralized nonconvex optimization without bounded gradient assumptions, highlighting the potential of achieving both communication efficiency and privacy preserving in the decentralized framework.

Acknowledgements

The work of B. Li and Y. Chi is supported in part by ONR under the grant N00014-19-1-2404, by AFRL under FA8750-20-2-0504, and by NSF under the grants CCF-1901199, CCF-2007911 and CNS-2148212. B. Li is also gratefully supported by the Wei Shen and Xuehong Zhang Presidential Fellowship at Carnegie Mellon University.

References

  • [ACCÖ21] S. Asoodeh, W.-N. Chen, F. P. Calmon, and A. Özgür. Differentially private federated learning: An information-theoretic perspective. In 2021 IEEE International Symposium on Information Theory (ISIT), pages 344–349, July 2021.
  • [ACG+16] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 308–318, October 2016.
  • [AGL+17] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. QSGD: Communication-efficient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  • [AHJ+18] D. Alistarh, T. Hoefler, M. Johansson, N. Konstantinov, S. Khirirat, and C. Renggli. The convergence of sparsified gradient methods. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
  • [ALBR19] M. Assran, N. Loizou, N. Ballas, and M. Rabbat. Stochastic gradient push for distributed deep learning. In International Conference on Machine Learning, pages 344–353. PMLR, 2019.
  • [BBP13] Y. Bengio, N. Boulanger-Lewandowski, and R. Pascanu. Advances in optimizing recurrent networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 8624–8628, May 2013.
  • [BJ13] P. Bianchi and J. Jakubowicz. Convergence of a multi-agent projected stochastic gradient algorithm for non-convex optimization. IEEE Transactions on Automatic Control, 58(2):391–405, February 2013.
  • [CABP13] K. Chatzikokolakis, M. E. Andrés, N. E. Bordenabe, and C. Palamidessi. Broadening the scope of differential privacy using metrics. In E. De Cristofaro and M. Wright, editors, Privacy Enhancing Technologies, Lecture Notes in Computer Science, pages 82–102, Berlin, Heidelberg, 2013. Springer.
  • [CL11] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):1–27, May 2011.
  • [CWH20] X. Chen, S. Z. Wu, and M. Hong. Understanding gradient clipping in private SGD: A geometric perspective. In Advances in Neural Information Processing Systems, volume 33, pages 13773–13782. Curran Associates, Inc., 2020.
  • [DJW13] J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Local privacy and statistical minimax rates. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pages 429–438, October 2013.
  • [DKX+22] R. Das, S. Kale, Z. Xu, T. Zhang, and S. Sanghavi. Beyond uniform Lipschitz condition in differentially private optimization. arXiv preprint arXiv:2206.10713, 2022.
  • [DLC+22] J. Du, S. Li, X. Chen, S. Chen, and M. Hong. Dynamic differential-privacy preserving sgd. arXiv preprint arXiv:2111.00173, 2022.
  • [DLS16] P. Di Lorenzo and G. Scutari. Next: In-network nonconvex optimization. IEEE Transactions on Signal and Information Processing over Networks, 2(2):120–136, 2016.
  • [DMNS06] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In S. Halevi and T. Rabin, editors, Theory of Cryptography, Lecture Notes in Computer Science, pages 265–284, Berlin, Heidelberg, 2006. Springer.
  • [DSZOR15] C. M. De Sa, C. Zhang, K. Olukotun, and C. Ré. Taming the wild: A unified analysis of Hogwild-style algorithms. Advances in Neural Information Processing Systems, 28, 2015.
  • [Dwo08] C. Dwork. Differential privacy: A survey of results. In M. Agrawal, D. Du, Z. Duan, and A. Li, editors, Theory and Applications of Models of Computation, Lecture Notes in Computer Science, pages 1–19, Berlin, Heidelberg, 2008. Springer.
  • [FKT20] V. Feldman, T. Koren, and K. Talwar. Private stochastic convex optimization: Optimal rates in linear time. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, pages 439–449, New York, NY, USA, June 2020. Association for Computing Machinery.
  • [FLFL23] H. Fang, X. Li, C. Fan, and P. Li. Improved convergence of differential private SGD with gradient clipping. In The Eleventh International Conference on Learning Representations, 2023.
  • [FSG+21] I. Fatkhullin, I. Sokolov, E. Gorbunov, Z. Li, and P. Richtárik. EF21 with bells & whistles: Practical algorithmic extensions of modern error feedback. arXiv preprint arXiv:2110.03294, 2021.
  • [GBLR21] E. Gorbunov, K. P. Burlachenko, Z. Li, and P. Richtarik. MARINA: Faster non-convex distributed learning with compression. In Proceedings of the 38th International Conference on Machine Learning, pages 3788–3798. PMLR, July 2021.
  • [HDJ+20] X. Huang, Y. Ding, Z. L. Jiang, S. Qi, X. Wang, and Q. Liao. DP-FL: A novel differentially private federated learning framework for the unbalanced data. World Wide Web, 23(4):2529–2545, July 2020.
  • [HP23] K. Huang and S. Pu. CEDAS: A compressed decentralized stochastic gradient method with improved convergence. arXiv preprint arXiv:2301.05872, 2023.
  • [HSZ+22] Y. Huang, Y. Sun, Z. Zhu, C. Yan, and J. Xu. Tackling data heterogeneity: A new unified framework for decentralized sgd with sample-induced topology. In Proceedings of the 39th International Conference on Machine Learning, pages 9310–9345. PMLR, June 2022.
  • [INS+19] R. Iyengar, J. P. Near, D. Song, O. Thakkar, A. Thakurta, and L. Wang. Towards practical differentially private convex optimization. In 2019 IEEE Symposium on Security and Privacy (SP), pages 299–316, May 2019.
  • [KDG03] D. Kempe, A. Dobra, and J. Gehrke. Gossip-based computation of aggregate information. In 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings., pages 482–491, October 2003.
  • [KFI17] S. Kanai, Y. Fujiwara, and S. Iwamura. Preventing gradient explosions in gated recurrent units. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  • [KHS23] A. Koloskova, H. Hendrikx, and S. U. Stich. Revisiting gradient clipping: Stochastic bias and tight convergence guarantees. In Proceedings of the 40th International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR, 23–29 Jul 2023.
  • [KLL16] J. Kim, J. K. Lee, and K. M. Lee. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1646–1654, 2016.
  • [KSJ19] A. Koloskova, S. Stich, and M. Jaggi. Decentralized stochastic optimization and gossip algorithms with compressed communication. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 3478–3487. PMLR, 09–15 Jun 2019.
  • [LCCC20] B. Li, S. Cen, Y. Chen, and Y. Chi. Communication-efficient distributed optimization in networks with gradient tracking and variance reduction. Journal of Machine Learning Research, 21(180):1–51, 2020.
  • [LJB+95] Y. LeCun, L. D. Jackel, L. Bottou, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, U. A. Muller, E. Sackinger, and P. Simard. Learning algorithms for classification: A comparison on handwritten digit recognition. Neural networks: the statistical mechanics perspective, 261(276):2, 1995.
  • [LKQR20] Z. Li, D. Kovalev, X. Qian, and P. Richtarik. Acceleration for compressed gradient descent in distributed and federated optimization. In Proceedings of the 37th International Conference on Machine Learning, pages 5895–5904. PMLR, November 2020.
  • [LLC22] B. Li, Z. Li, and Y. Chi. DESTRESS: Computation-optimal and communication-efficient decentralized nonconvex finite-sum optimization. SIAM Journal on Mathematics of Data Science, 4(3):1031–1051, September 2022.
  • [LLP23] Y. Liao, Z. Li, and S. Pu. A linearly convergent robust compressed Push-Pull method for decentralized optimization. arXiv preprint arXiv:2303.07091, 2023.
  • [LR21] Z. Li and P. Richtarik. CANITA: Faster rates for distributed convex optimization with communication compression. In Advances in Neural Information Processing Systems, volume 34, pages 13770–13781. Curran Associates, Inc., 2021.
  • [LY22] L. Luo and H. Ye. An optimal stochastic algorithm for decentralized nonconvex finite-sum optimization. arXiv preprint arXiv:2210.13931, 2022.
  • [LZLC22] Z. Li, H. Zhao, B. Li, and Y. Chi. SoteriaFL: A unified framework for private federated learning with communication compression. In Advances in Neural Information Processing Systems, volume 35, pages 4285–4300. Curran Associates, Inc., 2022.
  • [LZZ+17] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 5330–5340, 2017.
  • [MGTR19] K. Mishchenko, E. Gorbunov, M. Takáč, and P. Richtárik. Distributed learning with compressed gradient differences. arXiv preprint arXiv:1901.09269, 2019.
  • [MS23] T. Murata and T. Suzuki. DIFF2: Differential private optimization via gradient differences for nonconvex distributed learning. arXiv preprint arXiv:2302.03884, 2023.
  • [NBD22] M. Noble, A. Bellet, and A. Dieuleveut. Differentially private federated learning on heterogeneous data. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, pages 10110–10145. PMLR, May 2022.
  • [NOR18] A. Nedić, A. Olshevsky, and M. G. Rabbat. Network topology and communication-computation tradeoffs in decentralized optimization. Proceedings of the IEEE, 106(5):953–976, 2018.
  • [NOS17] A. Nedić, A. Olshevsky, and W. Shi. Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM Journal on Optimization, 27(4):2597–2633, 2017.
  • [NRB20] M. Nokleby, H. Raja, and W. U. Bajwa. Scaling-Up Distributed Processing of Data Streams for Machine Learning. Proceedings of the IEEE, 108(11):1984–2012, November 2020.
  • [PMB13] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning, pages 1310–1318. PMLR, May 2013.
  • [QL18] G. Qu and N. Li. Harnessing smoothness to accelerate distributed optimization. IEEE Transactions on Control of Network Systems, 5(3):1245–1260, 2018.
  • [RLDJ23] A. Reisizadeh, H. Li, S. Das, and A. Jadbabaie. Variance-reduced clipping for non-convex optimization. arXiv preprint arXiv:2303.00883, 2023.
  • [RSF21] P. Richtarik, I. Sokolov, and I. Fatkhullin. EF21: A new, simpler, theoretically better, and practically faster error feedback. In Advances in Neural Information Processing Systems, volume 34, pages 4384–4396. Curran Associates, Inc., 2021.
  • [SCJ18] S. U. Stich, J.-B. Cordonnier, and M. Jaggi. Sparsified SGD with memory. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
  • [SDGD21] N. Singh, D. Data, J. George, and S. Diggavi. SQuARM-SGD: Communication-Efficient Momentum SGD for Decentralized Optimization. IEEE Journal on Selected Areas in Information Theory, 2(3):954–969, September 2021.
  • [SFD+14] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In Interspeech 2014, pages 1058–1062. ISCA, September 2014.
  • [Sha07] D. Shah. Gossip algorithms. Foundations and Trends® in Networking, 3(1):1–125, 2007.
  • [SLH20] H. Sun, S. Lu, and M. Hong. Improving the sample and communication complexity for decentralized non-convex optimization: Joint gradient estimation and tracking. In International Conference on Machine Learning, pages 9217–9228. PMLR, 2020.
  • [TGZ+18] H. Tang, S. Gan, C. Zhang, T. Zhang, and J. Liu. Communication compression for decentralized training. Advances in Neural Information Processing Systems, 31:7652–7662, 2018.
  • [TLQ+19] H. Tang, X. Lian, S. Qiu, L. Yuan, C. Zhang, T. Zhang, and J. Liu. Deepsqueeze: Decentralization meets error-compensated compression. arXiv preprint arXiv:1907.07346, 2019.
  • [TMHP20] H. Taheri, A. Mokhtari, H. Hassani, and R. Pedarsani. Quantized decentralized stochastic learning over directed graphs. In Proceedings of the 37th International Conference on Machine Learning, pages 9324–9333. PMLR, November 2020.
  • [WJ21] J. Wang and G. Joshi. Cooperative SGD: A unified framework for the design and analysis of local-update SGD algorithms. The Journal of Machine Learning Research, 22(1):213:9709–213:9758, January 2021.
  • [WJEG19] L. Wang, B. Jayaraman, D. Evans, and Q. Gu. Efficient privacy-preserving stochastic nonconvex optimization. arXiv preprint arXiv:1910.13659, 2019.
  • [WJZ+19] Z. Wang, K. Ji, Y. Zhou, Y. Liang, and V. Tarokh. SpiderBoost and momentum: Faster variance reduction algorithms. In Advances in Neural Information Processing Systems, pages 2406–2416, 2019.
  • [WXDX20] D. Wang, H. Xiao, S. Devadas, and J. Xu. On differentially private stochastic convex optimization with heavy-tailed data. In Proceedings of the 37th International Conference on Machine Learning, pages 10081–10091. PMLR, November 2020.
  • [WYX17] D. Wang, M. Ye, and J. Xu. Differentially private empirical risk minimization revisited: Faster and more general. Advances in Neural Information Processing Systems, 30, 2017.
  • [XB04] L. Xiao and S. Boyd. Fast linear iterations for distributed averaging. Systems & Control Letters, 53(1):65–78, September 2004.
  • [XKK22a] R. Xin, U. A. Khan, and S. Kar. Fast decentralized nonconvex finite-sum optimization with recursive variance reduction. SIAM Journal on Optimization, 32(1):1–28, March 2022.
  • [XKK22b] R. Xin, U. A. Khan, and S. Kar. A Fast Randomized Incremental Gradient Method for Decentralized Nonconvex Optimization. IEEE Transactions on Automatic Control, 67(10):5150–5165, October 2022.
  • [XPNK20] R. Xin, S. Pu, A. Nedić, and U. A. Khan. A general framework for decentralized optimization with first-order methods. Proceedings of the IEEE, 108(11):1869–1889, 2020.
  • [XYD19] H. Xiao, Y. Ye, and S. Devadas. Local differential privacy in decentralized optimization. arXiv preprint arXiv:1902.06101, 2019.
  • [YCC+23] Y. Yan, J. Chen, P.-Y. Chen, X. Cui, S. Lu, and Y. Xu. Compressed decentralized proximal stochastic gradient method for nonconvex composite problems with heterogeneous data. arXiv preprint arXiv:2302.14252, 2023.
  • [YGG17] Y. You, I. Gitman, and B. Ginsburg. Scaling SGD batch size to 32k for ImageNet training. Technical Report UCB/EECS-2017-156, EECS Department, University of California, Berkeley, Sep 2017.
  • [YZCL22] X. Yang, H. Zhang, W. Chen, and T.-Y. Liu. Normalized/clipped SGD with perturbation for differentially private non-convex optimization. arXiv preprint arXiv:2206.13033, 2022.
  • [ZBLR21] H. Zhao, K. Burlachenko, Z. Li, and P. Richtárik. Faster rates for compressed federated learning with client-variance reduction. arXiv preprint arXiv:2112.13097, 2021.
  • [ZCH+22] X. Zhang, X. Chen, M. Hong, S. Wu, and J. Yi. Understanding clipping for federated learning: Convergence and client-level differential privacy. In Proceedings of the 39th International Conference on Machine Learning, pages 26048–26067. PMLR, June 2022.
  • [ZHSJ20a] J. Zhang, T. He, S. Sra, and A. Jadbabaie. Why gradient clipping accelerates training: A theoretical justification for adaptivity. In International Conference on Learning Representations, 2020.
  • [ZHSJ20b] J. Zhang, T. He, S. Sra, and A. Jadbabaie. Why gradient clipping accelerates training: A theoretical justification for adaptivity. In International Conference on Learning Representations, March 2020.
  • [ZJFW20] B. Zhang, J. Jin, C. Fang, and L. Wang. Improved analysis of clipping algorithms for non-convex optimization. Advances in Neural Information Processing Systems, 33:15511–15521, 2020.
  • [ZKV+20] J. Zhang, S. P. Karimireddy, A. Veit, S. Kim, S. J. Reddi, S. Kumar, and S. Sra. Why are adaptive methods good for attention models? arXiv preprint arXiv:1912.03194, October 2020.
  • [ZLL+22] H. Zhao, B. Li, Z. Li, P. Richtárik, and Y. Chi. Beer: Fast O (1/T) rate for decentralized nonconvex optimization with communication compression. Advances in Neural Information Processing Systems, 35, 2022.
  • [ZM10] M. Zhu and S. Martínez. Discrete-time dynamic average consensus. Automatica, 46(2):322–329, 2010.

Appendix A Proof of Theorem 1

This section proves Theorem 1 in the following steps: 1) define privacy loss and moment generating function, 2) define mechanisms and sub-mechanisms, and 3) bound the overall moment generating function and show the choice of perturbation variance satisfies all conditions.

Moment generating function.

Let oo and aux denote an outcome and an auxiliary input, respectively. Then, we can define the privacy loss of an outcome oo on neighboring dataset 𝒁{\bm{{Z}}} and 𝒁(i){\bm{{Z}}}^{(i)} as

c(o;,aux,𝒁,𝒁(i))=log((aux,𝒁)=o)((aux,𝒁(i))=o),\displaystyle c(o;{\mathcal{{M}}},\text{aux},{\bm{{Z}}},{\bm{{Z}}}^{(i)})=\log\frac{{\mathbb{\bm{P}}}\big{(}{\mathcal{{M}}}(\text{aux},{\bm{{Z}}})=o\big{)}}{{\mathbb{\bm{P}}}\big{(}{\mathcal{{M}}}(\text{aux},{\bm{{Z}}}^{(i)})=o\big{)}},

and its log moment generating functions as

αi(λ;aux,𝒁,𝒁(i))\displaystyle\alpha_{i}^{{\mathcal{{M}}}}(\lambda;\text{aux},{\bm{{Z}}},{\bm{{Z}}}^{(i)}) =log𝔼o(aux,𝒁)[exp(λc(o;,aux,𝒁,𝒁(i)))].\displaystyle=\log{\mathbb{\bm{E}}}_{o\sim{\mathcal{{M}}}(\text{aux},{\bm{{Z}}})}\big{[}\exp\big{(}\lambda c(o;{\mathcal{{M}}},\text{aux},{\bm{{Z}}},{\bm{{Z}}}^{(i)})\big{)}\big{]}.

Taking maximum over conditions, the unconditioned log moment generating function is

α^i(λ)\displaystyle{\hat{\alpha}}_{i}^{{\mathcal{{M}}}}(\lambda) =maxaux,𝒁,𝒁(i)αi(λ;aux,𝒁,𝒁(i)).\displaystyle=\max_{\text{aux},{\bm{{Z}}},{\bm{{Z}}}^{(i)}}\alpha_{i}^{{\mathcal{{M}}}}(\lambda;\text{aux},{\bm{{Z}}},{\bm{{Z}}}^{(i)}).
Sub-mechanisms.

Definition 4 defines the LDP mechanism, but it is not enough to model decentralized algorithms. To model the perturbation operation happens on agent ii at time tt, we define a sub-mechanism as i(t):𝒟{\mathcal{{M}}}_{i}^{(t)}:{\mathcal{{D}}}\to{\mathcal{{R}}}, where i[n],t[T]i\in[n],t\in[T], which can be understood as the perturbation added on agent ii at time tt. In addition, we define another mechanism 𝒞:{\mathcal{{C}}}:{\mathcal{{R}}}\to{\mathcal{{R}}} to model the compression operator and 𝒞i(t){\mathcal{{C}}}\circ{\mathcal{{M}}}_{i}^{(t)} to represent the full update at an agent, and use {\mathcal{{M}}} to represent the full algorithm.

Proof of LDP.

The overall log moment generating function for agent ii can be bounded using [LZLC22, Lemma 2] as

α^i(λ)t=1Tα^i𝒞i(t)(λ)t=1Tα^ii(t)(λ).\displaystyle\hat{\alpha}_{i}^{{\mathcal{{M}}}}(\lambda)\leq\sum_{t=1}^{T}\hat{\alpha}_{i}^{{\mathcal{{C}}}\circ{\mathcal{{M}}}_{i}^{(t)}}(\lambda)\leq\sum_{t=1}^{T}\hat{\alpha}_{i}^{{\mathcal{{M}}}_{i}^{(t)}}(\lambda).

Let q=1mq=\frac{1}{m} denote the probability each data sample is chosen. For agent ii and λ>0\lambda>0, assume qτ16σpq\leq\frac{\tau}{16\sigma_{p}} and λσp2τ2logτqσp\lambda\leq\frac{\sigma_{p}^{2}}{\tau^{2}}\log\frac{\tau}{q\sigma_{p}}. We can apply [ACG+16, Lemma 3] to bound each α^ii(t)(λ)\hat{\alpha}_{i}^{{\mathcal{{M}}}_{i}^{(t)}}(\lambda) as

α^i(t)(λ)q2λ(λ+1)τ2(1q)σp2+O(q3λ3τ3σp3)=O(q2λ2τ2σp2).\displaystyle\hat{\alpha}^{{\mathcal{{M}}}_{i}^{(t)}}(\lambda)\leq\frac{q^{2}\lambda(\lambda+1)\tau^{2}}{(1-q)\sigma_{p}^{2}}+O\Big{(}\frac{q^{3}\lambda^{3}\tau^{3}}{\sigma_{p}^{3}}\Big{)}=O\Big{(}\frac{q^{2}\lambda^{2}\tau^{2}}{\sigma_{p}^{2}}\Big{)}.

To conclude the proof, we can verify there exists some λ\lambda that satisfies the following inequalities when choosing σp=τqTlog(1/δ)ϵ\sigma_{p}=\frac{\tau q\sqrt{T\log(1/\delta)}}{\epsilon} and q=1mq=\frac{1}{m}, namely,

(Tqτλσp)2λϵ2,\displaystyle\Big{(}\frac{Tq\tau\lambda}{\sigma_{p}}\Big{)}^{2}\leq\frac{\lambda\epsilon}{2},
exp(λϵ/2)δ,\displaystyle\exp(-\lambda\epsilon/2)\leq\delta,
λσp2τ2logτqσp.\displaystyle\lambda\leq\frac{\sigma_{p}^{2}}{\tau^{2}}\log\frac{\tau}{q\sigma_{p}}.

Appendix B Proof of Theorem 2

This section proves Theorem 2 in the following 44 subsections: Section B.1 derives the descent inequality, Sections B.2 and B.3 create two linear systems to bound the sum of consensus errors in the descent inequality, and finally Section B.4 specifies hyper parameters to obtain convergence rate.

With Assumption 2, we can skip the compression operator, in PORTER-DP. To reuse this section’s results in later analysis, we assume Assumption 3 in deriving descent lemma and linear systems, and lift this assumption when computing convergence rate in Section B.4 using σg2τ\sigma_{g}\leq 2\tau.

Denote

𝑮(t)=1b𝒁𝒵(t)(𝑿(t1);𝒁).{\bm{{G}}}^{(t)}=\frac{1}{b}\sum_{{\bm{{Z}}}\in{\mathcal{{Z}}}^{(t)}}\nabla\ell({\bm{{X}}}^{(t-1)};{\bm{{Z}}}).

B.1 Function value descent

Using Taylor expansion, and taking expectation conditioned on time tt,

𝔼t[f(𝒙¯(t+1))f(𝒙¯(t))]\displaystyle{\mathbb{\bm{E}}}_{t}\big{[}f({\overline{\bm{x}}}^{(t+1)})-f({\overline{\bm{x}}}^{(t)})\big{]} 𝔼tf(𝒙¯(t)),η𝒗¯(t+1)+L2𝔼tη𝒗¯(t+1)22\displaystyle\leq{\mathbb{\bm{E}}}_{t}\langle\nabla f({\overline{\bm{x}}}^{(t)}),-\eta{\overline{\bm{v}}}^{(t+1)}\rangle+\frac{L}{2}{\mathbb{\bm{E}}}_{t}\big{\|}\eta{\overline{\bm{v}}}^{(t+1)}\big{\|}_{2}^{2}
=ηf(𝒙¯(t)),𝔼t[𝒗¯(t+1)]+η2L2𝔼t𝒗¯(t+1)22\displaystyle=-\eta\langle\nabla f({\overline{\bm{x}}}^{(t)}),{\mathbb{\bm{E}}}_{t}[{\overline{\bm{v}}}^{(t+1)}]\rangle+\frac{\eta^{2}L}{2}{\mathbb{\bm{E}}}_{t}\big{\|}{\overline{\bm{v}}}^{(t+1)}\big{\|}_{2}^{2}
=ηf(𝒙¯(t)),𝔼t[𝒈¯τ(t+1)+𝒆¯(t+1)]+η2L2𝔼t𝒗¯(t+1)22,\displaystyle=-\eta\langle\nabla f({\overline{\bm{x}}}^{(t)}),{\mathbb{\bm{E}}}_{t}[{\overline{\bm{g}}}_{\tau}^{(t+1)}+{\overline{\bm{e}}}^{(t+1)}]\rangle+\frac{\eta^{2}L}{2}{\mathbb{\bm{E}}}_{t}\big{\|}{\overline{\bm{v}}}^{(t+1)}\big{\|}_{2}^{2},

where the last equality is due to 𝒗¯(t)=𝒈¯p(t){\overline{\bm{v}}}^{(t)}={\overline{\bm{g}}}_{p}^{(t)} that can be proved by induction.

Because 𝔼t[𝒆i(t)]=𝟎d{\mathbb{\bm{E}}}_{t}[{\bm{{e}}}_{i}^{(t)}]=\bm{0}_{d} and stochastic gradients are unbiased,

𝔼t[f(𝒙¯(t+1))f(𝒙¯(t))]\displaystyle~{}~{}~{}~{}{\mathbb{\bm{E}}}_{t}\big{[}f({\overline{\bm{x}}}^{(t+1)})-f({\overline{\bm{x}}}^{(t)})\big{]}
=ηf(𝒙¯(t)),F(𝑿(t))(1n𝟏n)+η2L2𝔼t𝒗¯(t+1)22\displaystyle=-\eta\langle\nabla f({\overline{\bm{x}}}^{(t)}),\nabla F({\bm{{X}}}^{(t)})(\tfrac{1}{n}\bm{1}_{n})\rangle+\frac{\eta^{2}L}{2}{\mathbb{\bm{E}}}_{t}\big{\|}{\overline{\bm{v}}}^{(t+1)}\big{\|}_{2}^{2}
=η2(f(𝒙¯(t))F(𝑿(t))(1n𝟏n)22f(𝒙¯(t))22F(𝑿(t))(1n𝟏n)22)+η2L2𝔼t𝒗¯(t+1)22\displaystyle=\frac{\eta}{2}\Big{(}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})-\nabla F({\bm{{X}}}^{(t)})(\tfrac{1}{n}\bm{1}_{n})\big{\|}_{2}^{2}-\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}^{2}-\big{\|}\nabla F({\bm{{X}}}^{(t)})(\tfrac{1}{n}\bm{1}_{n})\big{\|}_{2}^{2}\Big{)}+\frac{\eta^{2}L}{2}{\mathbb{\bm{E}}}_{t}\big{\|}{\overline{\bm{v}}}^{(t+1)}\big{\|}_{2}^{2}
η2f(𝒙¯(t))22+ηL22n𝑿(t)𝒙¯(t)𝟏n𝖥2+η2L2𝔼t𝒗¯(t+1)22η2F(𝑿(t))(1n𝟏n)22,\displaystyle\leq-\frac{\eta}{2}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}^{2}+\frac{\eta L^{2}}{2n}\big{\|}{\bm{{X}}}^{(t)}-{\overline{\bm{x}}}^{(t)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}^{2}+\frac{\eta^{2}L}{2}{\mathbb{\bm{E}}}_{t}\big{\|}{\overline{\bm{v}}}^{(t+1)}\big{\|}_{2}^{2}-\frac{\eta}{2}\big{\|}\nabla F({\bm{{X}}}^{(t)})(\tfrac{1}{n}\bm{1}_{n})\big{\|}_{2}^{2}, (8)

where the last inequality is due to Assumption 1.

Let Δ=𝔼[f(𝒙¯(0))]f\Delta={\mathbb{\bm{E}}}\big{[}f({\overline{\bm{x}}}^{(0)})\big{]}-f^{\star}. Take full expectation and average (8) over t=1,,Tt=1,\ldots,T, the expected utility can be bounded by

1Tt=1T𝔼f(𝒙¯(t))22\displaystyle\frac{1}{T}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}^{2} 2ΔηT+1TL2nt=1T𝔼𝑿(t)𝒙¯(t)𝟏n𝖥2\displaystyle\leq\frac{2\Delta}{\eta T}+\frac{1}{T}\cdot\frac{L^{2}}{n}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}{\bm{{X}}}^{(t)}-{\overline{\bm{x}}}^{(t)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}^{2}
+1TηLt=1T𝔼𝒗¯(t+1)221Tt=1T𝔼F(𝑿(t))(1n𝟏n)22.\displaystyle~{}~{}~{}~{}+\frac{1}{T}\cdot\eta L\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}{\overline{\bm{v}}}^{(t+1)}\big{\|}_{2}^{2}-\frac{1}{T}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}\nabla F({\bm{{X}}}^{(t)})(\tfrac{1}{n}\bm{1}_{n})\big{\|}_{2}^{2}. (9)

B.2 Sum of variable consensus errors

This subsection creates a linear system to bound t=1T𝔼𝑿(t)𝒙¯(t)𝟏n𝖥2\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}{\bm{{X}}}^{(t)}-{\overline{\bm{x}}}^{(t)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}^{2} by t=1T𝔼𝑽(t)𝒗¯(t)𝟏n𝖥2\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}{\bm{{V}}}^{(t)}-{\overline{\bm{v}}}^{(t)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}^{2} and t=1T𝔼𝒗¯(t)22\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}{\overline{\bm{v}}}^{(t)}\big{\|}_{2}^{2}. To simplify notations, let 𝑾^=𝑰n+γ(𝑾𝑰n)\widehat{\bm{{W}}}={\bm{{I}}}_{n}+\gamma({\bm{{W}}}-{\bm{{I}}}_{n}), and denote the mixing rate of 𝑾^\widehat{\bm{{W}}} by α^=𝑾^(1n𝟏n𝟏n)op\hat{\alpha}=\big{\|}\widehat{\bm{{W}}}-(\tfrac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top})\big{\|}_{\mathrm{op}}. Lemma 1 analyzes the mixing rate of the regularized mxing matrix.

Lemma 1 (Mixing rate of regularized mixing matrix).

Assuming 0<γ10<\gamma\leq 1. The mixing rate of 𝐖^\widehat{\bm{{W}}} can be bounded as

α^1+γ(α1).\displaystyle\hat{\alpha}\leq 1+\gamma(\alpha-1). (10)
Proof of Lemma 1.

Let λ1=1>λ2λn>1\lambda_{1}=1>\lambda_{2}\geq\ldots\geq\lambda_{n}>-1 denote the eigenvalues of 𝑾{\bm{{W}}}. Corresponding eigenvalues of 𝑾^\widehat{\bm{{W}}} are 1+γ(λi1)1+\gamma(\lambda_{i}-1), i=1,,ni=1,\ldots,n.

The mixing rate of 𝑾^\widehat{\bm{{W}}} is

α^\displaystyle\hat{\alpha} =max{|1+γ(λ21)|,|1+γ(λn1)|}\displaystyle=\max\big{\{}\big{|}1+\gamma(\lambda_{2}-1)\big{|},\big{|}1+\gamma(\lambda_{n}-1)\big{|}\big{\}}
max{|1γ|+γ|λ2|,|1γ|+γ|λn|}\displaystyle\leq\max\big{\{}|1-\gamma|+\gamma|\lambda_{2}|,|1-\gamma|+\gamma|\lambda_{n}|\big{\}}
=1+γ(α1).\displaystyle=1+\gamma(\alpha-1).

B.2.1 Variable consensus error

Take expectation conditioned on time tt, and use Young’s inequality, the variable consensus error can be bounded as

𝔼t𝑿(t+1)𝒙¯(t+1)𝟏n𝖥2\displaystyle~{}~{}~{}~{}{\mathbb{\bm{E}}}_{t}\big{\|}{\bm{{X}}}^{(t+1)}-{\overline{\bm{x}}}^{(t+1)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}^{2}
=𝔼t(𝑿(t)+γ𝑸x(t+1)(𝑾𝑰n)η𝑽(t+1))(𝑰n(1n𝟏n𝟏n))𝖥2\displaystyle={\mathbb{\bm{E}}}_{t}\Big{\|}\Big{(}{\bm{{X}}}^{(t)}+\gamma{\bm{{Q}}}_{x}^{(t+1)}({\bm{{W}}}-{\bm{{I}}}_{n})-\eta{\bm{{V}}}^{(t+1)}\Big{)}\big{(}{\bm{{I}}}_{n}-(\tfrac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top})\big{)}\Big{\|}_{\mathsf{F}}^{2}
=𝔼t(𝑿(t)𝒙¯(t)𝟏n)(𝑾^(1n𝟏n𝟏n))+γ(𝑸x(t+1)𝑿(t))(𝑾𝑰n)η𝑽(t+1)(𝑰n(1n𝟏n𝟏n))𝖥2\displaystyle={\mathbb{\bm{E}}}_{t}\Big{\|}\big{(}{\bm{{X}}}^{(t)}-{\overline{\bm{x}}}^{(t)}{\bm{1}}_{n}^{\top}\big{)}\big{(}\widehat{\bm{{W}}}-(\tfrac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top})\big{)}+\gamma({\bm{{Q}}}_{x}^{(t+1)}-{\bm{{X}}}^{(t)})({\bm{{W}}}-{\bm{{I}}}_{n})-\eta{\bm{{V}}}^{(t+1)}\big{(}{\bm{{I}}}_{n}-(\tfrac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top})\big{)}\Big{\|}_{\mathsf{F}}^{2}
21+α^2(𝑿(t)𝒙¯(t)𝟏n)(𝑾^(1n𝟏n𝟏n))𝖥2\displaystyle\leq\frac{2}{1+\hat{\alpha}^{2}}\big{\|}\big{(}{\bm{{X}}}^{(t)}-{\overline{\bm{x}}}^{(t)}{\bm{1}}_{n}^{\top}\big{)}\big{(}\widehat{\bm{{W}}}-(\tfrac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top})\big{)}\big{\|}_{\mathsf{F}}^{2}
+21α^2𝔼tγ(𝑸x(t+1)𝑿(t))(𝑾𝑰n)η𝑽(t+1)(𝑰n(1n𝟏n𝟏n))𝖥2\displaystyle~{}~{}~{}~{}+\frac{2}{1-\hat{\alpha}^{2}}{\mathbb{\bm{E}}}_{t}\Big{\|}\gamma({\bm{{Q}}}_{x}^{(t+1)}-{\bm{{X}}}^{(t)})({\bm{{W}}}-{\bm{{I}}}_{n})-\eta{\bm{{V}}}^{(t+1)}\big{(}{\bm{{I}}}_{n}-(\tfrac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top})\big{)}\Big{\|}_{\mathsf{F}}^{2}
21+α^2(𝑿(t)𝒙¯(t)𝟏n)(𝑾^(1n𝟏n𝟏n))𝖥2\displaystyle\leq\frac{2}{1+\hat{\alpha}^{2}}\big{\|}\big{(}{\bm{{X}}}^{(t)}-{\overline{\bm{x}}}^{(t)}{\bm{1}}_{n}^{\top}\big{)}\big{(}\widehat{\bm{{W}}}-(\tfrac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top})\big{)}\big{\|}_{\mathsf{F}}^{2}
+41α^2𝔼tγ(𝑸x(t+1)𝑿(t))(𝑾𝑰n)𝖥2+41α^2𝔼tη𝑽(t+1)(𝑰n(1n𝟏n𝟏n))𝖥2\displaystyle~{}~{}~{}~{}+\frac{4}{1-\hat{\alpha}^{2}}{\mathbb{\bm{E}}}_{t}\Big{\|}\gamma({\bm{{Q}}}_{x}^{(t+1)}-{\bm{{X}}}^{(t)})({\bm{{W}}}-{\bm{{I}}}_{n})\Big{\|}_{\mathsf{F}}^{2}+\frac{4}{1-\hat{\alpha}^{2}}{\mathbb{\bm{E}}}_{t}\Big{\|}\eta{\bm{{V}}}^{(t+1)}\big{(}{\bm{{I}}}_{n}-(\tfrac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top})\big{)}\Big{\|}_{\mathsf{F}}^{2}
(i)2α^21+α^2𝑿(t)𝒙¯(t)𝟏n𝖥2+16γ21α^2𝔼t𝑸x(t+1)𝑿(t)𝖥2+4η21α^2𝔼t𝑽(t+1)𝒗¯(t+1)𝟏n𝖥2\displaystyle\overset{(\text{i})}{\leq}\frac{2\hat{\alpha}^{2}}{1+\hat{\alpha}^{2}}\big{\|}{\bm{{X}}}^{(t)}-{\overline{\bm{x}}}^{(t)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}^{2}+\frac{16\gamma^{2}}{1-\hat{\alpha}^{2}}{\mathbb{\bm{E}}}_{t}\big{\|}{\bm{{Q}}}_{x}^{(t+1)}-{\bm{{X}}}^{(t)}\big{\|}_{\mathsf{F}}^{2}+\frac{4\eta^{2}}{1-\hat{\alpha}^{2}}{\mathbb{\bm{E}}}_{t}\big{\|}{\bm{{V}}}^{(t+1)}\ -{\overline{\bm{v}}}^{(t+1)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}^{2}
(ii)α^𝑿(t)𝒙¯(t)𝟏n𝖥2+16(1ρ)γ21α^𝑸x(t)𝑿(t)𝖥2+4η21α^𝔼t𝑽(t+1)𝒗¯(t+1)𝟏n𝖥2,\displaystyle\overset{(\text{ii})}{\leq}\hat{\alpha}\big{\|}{\bm{{X}}}^{(t)}-{\overline{\bm{x}}}^{(t)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}^{2}+\frac{16(1-\rho)\gamma^{2}}{1-\hat{\alpha}}\big{\|}{\bm{{Q}}}_{x}^{(t)}-{\bm{{X}}}^{(t)}\big{\|}_{\mathsf{F}}^{2}+\frac{4\eta^{2}}{1-\hat{\alpha}}{\mathbb{\bm{E}}}_{t}\big{\|}{\bm{{V}}}^{(t+1)}\ -{\overline{\bm{v}}}^{(t+1)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}^{2}, (11)

where (i) is obtained by 𝑾𝑰nop2\|{\bm{{W}}}-{\bm{{I}}}_{n}\|_{\mathrm{op}}\leq 2, (ii) uses 2α^1+α^22\hat{\alpha}\leq 1+\hat{\alpha}^{2}, 1α^1α^21-\hat{\alpha}\leq 1-\hat{\alpha}^{2} and Definition 3.

B.2.2 Variable quantization error

Assume γ\gamma satisfies the following inequality (which will be verified in Section B.4)

γ2ρ296(1ρ).\displaystyle\gamma^{2}\leq\frac{\rho^{2}}{96(1-\rho)}. (12)

Taking expectation conditioned on time tt, the variable quantization error can be decomposed and bounded as

𝔼t𝑸x(t+1)𝑿(t+1)𝖥2\displaystyle~{}~{}~{}~{}{\mathbb{\bm{E}}}_{t}\big{\|}{\bm{{Q}}}_{x}^{(t+1)}-{\bm{{X}}}^{(t+1)}\big{\|}_{\mathsf{F}}^{2}
=𝔼t𝑸x(t)+𝒞(𝑿(t)𝑸x(t))𝑿(t+1)𝖥2\displaystyle={\mathbb{\bm{E}}}_{t}\big{\|}{\bm{{Q}}}_{x}^{(t)}+{\mathcal{{C}}}({\bm{{X}}}^{(t)}-{\bm{{Q}}}_{x}^{(t)})-{\bm{{X}}}^{(t+1)}\big{\|}_{\mathsf{F}}^{2}
=𝔼t𝒞(𝑿(t)𝑸x(t))(𝑿(t)𝑸x(t))(𝑿(t+1)𝑿(t))𝖥2\displaystyle={\mathbb{\bm{E}}}_{t}\big{\|}{\mathcal{{C}}}({\bm{{X}}}^{(t)}-{\bm{{Q}}}_{x}^{(t)})-({\bm{{X}}}^{(t)}-{\bm{{Q}}}_{x}^{(t)})-({\bm{{X}}}^{(t+1)}-{\bm{{X}}}^{(t)})\big{\|}_{\mathsf{F}}^{2}
(i)21+(1ρ)𝔼t𝒞(𝑿(t)𝑸x(t))(𝑿(t)𝑸x(t))𝖥2+21(1ρ)𝔼t𝑿(t+1)𝑿(t)𝖥2\displaystyle\overset{(\text{i})}{\leq}\frac{2}{1+(1-\rho)}{\mathbb{\bm{E}}}_{t}\big{\|}{\mathcal{{C}}}({\bm{{X}}}^{(t)}-{\bm{{Q}}}_{x}^{(t)})-({\bm{{X}}}^{(t)}-{\bm{{Q}}}_{x}^{(t)})\big{\|}_{\mathsf{F}}^{2}+\frac{2}{1-(1-\rho)}{\mathbb{\bm{E}}}_{t}\big{\|}{\bm{{X}}}^{(t+1)}-{\bm{{X}}}^{(t)}\big{\|}_{\mathsf{F}}^{2}
(ii)2(1ρ)1+(1ρ)𝑿(t)𝑸x(t)𝖥2+2ρ𝔼t𝑿(t+1)𝑿(t)𝖥2\displaystyle\overset{(\text{ii})}{\leq}\frac{2(1-\rho)}{1+(1-\rho)}\big{\|}{\bm{{X}}}^{(t)}-{\bm{{Q}}}_{x}^{(t)}\big{\|}_{\mathsf{F}}^{2}+\frac{2}{\rho}{\mathbb{\bm{E}}}_{t}\big{\|}{\bm{{X}}}^{(t+1)}-{\bm{{X}}}^{(t)}\big{\|}_{\mathsf{F}}^{2}
=2(1ρ)1+(1ρ)𝑿(t)𝑸x(t)𝖥2\displaystyle=\frac{2(1-\rho)}{1+(1-\rho)}\big{\|}{\bm{{X}}}^{(t)}-{\bm{{Q}}}_{x}^{(t)}\big{\|}_{\mathsf{F}}^{2}
+2ρ𝔼tγ(𝑸x(t+1)𝑿(t))(𝑾𝑰n)+γ(𝑿(t)𝒙¯(t)𝟏n)(𝑾𝑰n)η𝑽(t+1)𝖥2\displaystyle~{}~{}~{}~{}+\frac{2}{\rho}{\mathbb{\bm{E}}}_{t}\big{\|}\gamma({\bm{{Q}}}_{x}^{(t+1)}-{\bm{{X}}}^{(t)})({\bm{{W}}}-{\bm{{I}}}_{n})+\gamma({\bm{{X}}}^{(t)}-{\overline{\bm{x}}}^{(t)}{\bm{1}}_{n}^{\top})({\bm{{W}}}-{\bm{{I}}}_{n})-\eta{\bm{{V}}}^{(t+1)}\big{\|}_{\mathsf{F}}^{2}
(iii)(1ρ2)𝑿(t)𝑸x(t)𝖥2+24γ2ρ𝔼t𝑸x(t+1)𝑿(t)𝖥2+24γ2ρ𝑿(t)𝒙¯(t)𝟏n𝖥2+6η2ρ𝔼t𝑽(t+1)𝖥2\displaystyle\overset{(\text{iii})}{\leq}\Big{(}1-\frac{\rho}{2}\Big{)}\big{\|}{\bm{{X}}}^{(t)}-{\bm{{Q}}}_{x}^{(t)}\big{\|}_{\mathsf{F}}^{2}+\frac{24\gamma^{2}}{\rho}{\mathbb{\bm{E}}}_{t}\big{\|}{\bm{{Q}}}_{x}^{(t+1)}-{\bm{{X}}}^{(t)}\big{\|}_{\mathsf{F}}^{2}+\frac{24\gamma^{2}}{\rho}\big{\|}{\bm{{X}}}^{(t)}-{\overline{\bm{x}}}^{(t)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}^{2}+\frac{6\eta^{2}}{\rho}{\mathbb{\bm{E}}}_{t}\big{\|}{\bm{{V}}}^{(t+1)}\big{\|}_{\mathsf{F}}^{2}
(1ρ2+24(1ρ)γ2ρ)𝑸x(t)𝑿(t)𝖥2+24γ2ρ𝑿(t)𝒙¯(t)𝟏n𝖥2+6η2ρ𝔼t𝑽(t+1)𝖥2\displaystyle\leq\Big{(}1-\frac{\rho}{2}+\frac{24(1-\rho)\gamma^{2}}{\rho}\Big{)}\big{\|}{\bm{{Q}}}_{x}^{(t)}-{\bm{{X}}}^{(t)}\big{\|}_{\mathsf{F}}^{2}+\frac{24\gamma^{2}}{\rho}\big{\|}{\bm{{X}}}^{(t)}-{\overline{\bm{x}}}^{(t)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}^{2}+\frac{6\eta^{2}}{\rho}{\mathbb{\bm{E}}}_{t}\big{\|}{\bm{{V}}}^{(t+1)}\big{\|}_{\mathsf{F}}^{2}
(iv)(1ρ4)𝑸x(t)𝑿(t)𝖥2+24γ2ρ𝑿(t)𝒙¯(t)𝟏n𝖥2+6η2ρ𝔼t𝑽(t+1)𝖥2,\displaystyle\overset{(\text{iv})}{\leq}\Big{(}1-\frac{\rho}{4}\Big{)}\big{\|}{\bm{{Q}}}_{x}^{(t)}-{\bm{{X}}}^{(t)}\big{\|}_{\mathsf{F}}^{2}+\frac{24\gamma^{2}}{\rho}\big{\|}{\bm{{X}}}^{(t)}-{\overline{\bm{x}}}^{(t)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}^{2}+\frac{6\eta^{2}}{\rho}{\mathbb{\bm{E}}}_{t}\big{\|}{\bm{{V}}}^{(t+1)}\big{\|}_{\mathsf{F}}^{2}, (13)

where (i) is obtained by applying Young’s inequality, (ii) uses Definition 3, (iii) uses the fact 𝑾𝑰nop2\big{\|}{\bm{{W}}}-{\bm{{I}}}_{n}\big{\|}_{\text{op}}\leq 2, and (iv) uses (12).

B.2.3 Linear system

Let 𝒆1(t)=[𝑿(t)𝒙¯(t)𝟏n𝖥2𝑸x(t)𝑿(t)𝖥2]{\bm{{e}}}_{1}^{(t)}=\begin{bmatrix}\big{\|}{\bm{{X}}}^{(t)}-{\overline{\bm{x}}}^{(t)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}^{2}\\ \big{\|}{\bm{{Q}}}_{x}^{(t)}-{\bm{{X}}}^{(t)}\big{\|}_{\mathsf{F}}^{2}\end{bmatrix}, we can take full expectation and rewrite (11) and (13) in matrix form as

𝔼[𝒆1(t+1)]\displaystyle{\mathbb{\bm{E}}}[{\bm{{e}}}_{1}^{(t+1)}] [α^16(1ρ)γ21α^24γ2ρ1ρ4]𝔼[𝒆1(t)]+[4η21α^𝔼𝑽(t+1)𝒗¯(t+1)𝟏n𝖥26η2ρ𝔼𝑽(t+1)𝖥2]\displaystyle\leq\begin{bmatrix}\hat{\alpha}&\frac{16(1-\rho)\gamma^{2}}{1-\hat{\alpha}}\\ \frac{24\gamma^{2}}{\rho}&1-\frac{\rho}{4}\end{bmatrix}{\mathbb{\bm{E}}}[{\bm{{e}}}_{1}^{(t)}]+\begin{bmatrix}\frac{4\eta^{2}}{1-\hat{\alpha}}{\mathbb{\bm{E}}}\big{\|}{\bm{{V}}}^{(t+1)}\ -{\overline{\bm{v}}}^{(t+1)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}^{2}\\ \frac{6\eta^{2}}{\rho}{\mathbb{\bm{E}}}\big{\|}{\bm{{V}}}^{(t+1)}\big{\|}_{\mathsf{F}}^{2}\end{bmatrix}
:=𝑮1𝔼[𝒆1(t)]+𝒃1(t).\displaystyle:={\bm{{G}}}_{1}{\mathbb{\bm{E}}}[{\bm{{e}}}_{1}^{(t)}]+{\bm{{b}}}_{1}^{(t)}. (14)

We can compute (𝑰n𝑮1)1({\bm{{I}}}_{n}-{\bm{{G}}}_{1})^{-1} and verify all its entries are positive:

(𝑰n𝑮1)1\displaystyle({\bm{{I}}}_{n}-{\bm{{G}}}_{1})^{-1} =1(1α^)ρ416(1ρ)γ21α^24γ2ρ[ρ416(1ρ)γ21α^24γ2ρ1α^]\displaystyle=\frac{1}{(1-\hat{\alpha})\cdot\frac{\rho}{4}-\frac{16(1-\rho)\gamma^{2}}{1-\hat{\alpha}}\cdot\frac{24\gamma^{2}}{\rho}}\begin{bmatrix}\frac{\rho}{4}&\frac{16(1-\rho)\gamma^{2}}{1-\hat{\alpha}}\\ \frac{24\gamma^{2}}{\rho}&1-\hat{\alpha}\end{bmatrix}
118(1α^)ρ[ρ416(1ρ)γ21α^24γ2ρ1α^],\displaystyle\leq\frac{1}{\frac{1}{8}(1-\hat{\alpha})\rho}\begin{bmatrix}\frac{\rho}{4}&\frac{16(1-\rho)\gamma^{2}}{1-\hat{\alpha}}\\ \frac{24\gamma^{2}}{\rho}&1-\hat{\alpha}\end{bmatrix}, (15)

where we assume the following inequality to to prove (15), which will be validated in Section B.4:

(1α^)ρ416(1ρ)γ21α^24γ2ρ18(1α^)ρ.\displaystyle(1-\hat{\alpha})\cdot\frac{\rho}{4}-\frac{16(1-\rho)\gamma^{2}}{1-\hat{\alpha}}\cdot\frac{24\gamma^{2}}{\rho}\geq\frac{1}{8}(1-\hat{\alpha})\rho. (16)

Sum expected error vectors 𝔼[𝒆1(t)]{\mathbb{\bm{E}}}[{\bm{{e}}}_{1}^{(t)}] over t=1,,Tt=1,\ldots,T,

t=1T𝔼[𝒆1(t)]\displaystyle\sum_{t=1}^{T}{\mathbb{\bm{E}}}[{\bm{{e}}}_{1}^{(t)}] t=1T(𝑮1𝔼[𝒆1(t1)]+𝒃1(t1))\displaystyle\leq\sum_{t=1}^{T}({\bm{{G}}}_{1}{\mathbb{\bm{E}}}[{\bm{{e}}}_{1}^{(t-1)}]+{\bm{{b}}}_{1}^{(t-1)})
𝑮1t=1T𝔼[𝒆1(t)]+𝑮1𝔼[𝒆1(0)]+t=1T𝒃1(t1).\displaystyle\leq{\bm{{G}}}_{1}\sum_{t=1}^{T}{\mathbb{\bm{E}}}[{\bm{{e}}}_{1}^{(t)}]+{\bm{{G}}}_{1}{\mathbb{\bm{E}}}[{\bm{{e}}}_{1}^{(0)}]+\sum_{t=1}^{T}{\bm{{b}}}_{1}^{(t-1)}.

Reorganize terms, multiply (𝑰n𝑮1)1({\bm{{I}}}_{n}-{\bm{{G}}}_{1})^{-1} on both sides and use 𝒆1(0)=𝟎2{\bm{{e}}}_{1}^{(0)}=\bm{0}_{2}, the sum of error vectors can be bounded as

t=1T𝔼[𝒆1(t)]\displaystyle\sum_{t=1}^{T}{\mathbb{\bm{E}}}[{\bm{{e}}}_{1}^{(t)}] (𝑰n𝑮1)1t=0T1𝒃1(t).\displaystyle\leq({\bm{{I}}}_{n}-{\bm{{G}}}_{1})^{-1}\sum_{t=0}^{T-1}{\bm{{b}}}_{1}^{(t)}. (17)

The sum of consensus error can be computed as

t=1T𝔼𝑿(t)𝒙¯(t)𝟏n𝖥2\displaystyle~{}~{}~{}~{}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}{\bm{{X}}}^{(t)}-{\overline{\bm{x}}}^{(t)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}^{2}
[10](𝑰n𝑮)1t=0T1𝒃1(t)\displaystyle\leq\begin{bmatrix}1&0\end{bmatrix}({\bm{{I}}}_{n}-{\bm{{G}}})^{-1}\sum_{t=0}^{T-1}{\bm{{b}}}_{1}^{(t)}
=118(1α^)ρ[ρ416(1ρ)γ21α^][4η21α^t=1T𝔼𝑽(t)𝒗¯(t)𝟏n𝖥26η2ρt=1T𝔼𝑽(t)𝖥2]\displaystyle=\frac{1}{\frac{1}{8}(1-\hat{\alpha})\rho}\begin{bmatrix}\frac{\rho}{4}&\frac{16(1-\rho)\gamma^{2}}{1-\hat{\alpha}}\end{bmatrix}\begin{bmatrix}\frac{4\eta^{2}}{1-\hat{\alpha}}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}{\bm{{V}}}^{(t)}\ -{\overline{\bm{v}}}^{(t)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}^{2}\\ \frac{6\eta^{2}}{\rho}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}{\bm{{V}}}^{(t)}\big{\|}_{\mathsf{F}}^{2}\end{bmatrix}
=η218(1α^)ρ(ρ1α^t=1T𝔼𝑽(t)𝒗¯(t)𝟏n𝖥2+16(1ρ)γ21α^6ρt=1T𝔼𝑽(t)𝖥2)\displaystyle=\frac{\eta^{2}}{\frac{1}{8}(1-\hat{\alpha})\rho}\Big{(}\frac{\rho}{1-\hat{\alpha}}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}{\bm{{V}}}^{(t)}-{\overline{\bm{v}}}^{(t)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}^{2}+\frac{16(1-\rho)\gamma^{2}}{1-\hat{\alpha}}\cdot\frac{6}{\rho}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}{\bm{{V}}}^{(t)}\big{\|}_{\mathsf{F}}^{2}\Big{)}
=(i)8η2(1α^)ρ(ρ1α^+96(1ρ)γ2(1α^)ρ)t=1T𝔼𝑽(t)𝒗¯(t)𝟏n𝖥2+768(1ρ)γ2η2(1α^)2ρ2t=1Tn𝔼𝒗¯(t)𝖥2\displaystyle\overset{(\text{i})}{=}\frac{8\eta^{2}}{(1-\hat{\alpha})\rho}\Big{(}\frac{\rho}{1-\hat{\alpha}}+\frac{96(1-\rho)\gamma^{2}}{(1-\hat{\alpha})\rho}\Big{)}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}{\bm{{V}}}^{(t)}\ -{\overline{\bm{v}}}^{(t)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}^{2}+\frac{768(1-\rho)\gamma^{2}\eta^{2}}{(1-\hat{\alpha})^{2}\rho^{2}}\sum_{t=1}^{T}n{\mathbb{\bm{E}}}\big{\|}{\overline{\bm{v}}}^{(t)}\big{\|}_{\mathsf{F}}^{2}
(ii)16η2(1α^)2t=1T𝔼𝑽(t)𝒗¯(t)𝟏n𝖥2+768(1ρ)γ2η2(1α^)2ρ2t=1Tn𝔼𝒗¯(t)𝖥2,\displaystyle\overset{(\text{ii})}{\leq}\frac{16\eta^{2}}{(1-\hat{\alpha})^{2}}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}{\bm{{V}}}^{(t)}\ -{\overline{\bm{v}}}^{(t)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}^{2}+\frac{768(1-\rho)\gamma^{2}\eta^{2}}{(1-\hat{\alpha})^{2}\rho^{2}}\sum_{t=1}^{T}n{\mathbb{\bm{E}}}\big{\|}{\overline{\bm{v}}}^{(t)}\big{\|}_{\mathsf{F}}^{2}, (18)

where we use the equality 𝑽(t)𝖥2=𝑽(t)𝒗¯(t)𝟏n𝖥2+n𝒗¯(t)22\big{\|}{\bm{{V}}}^{(t)}\big{\|}_{\mathsf{F}}^{2}=\big{\|}{\bm{{V}}}^{(t)}-{\overline{\bm{v}}}^{(t)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}^{2}+n\|{\overline{\bm{v}}}^{(t)}\|_{2}^{2} for (i) and use (12) for (ii).

B.3 Sum of gradient consensus errors

This section creates a linear system to bound the sum of gradient consensus error t=1T𝔼𝑽(t)𝒗¯(t)𝟏n𝖥2\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}{\bm{{V}}}^{(t)}-{\overline{\bm{v}}}^{(t)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}^{2} by t=1T𝔼𝒗¯(t)22\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}{\overline{\bm{v}}}^{(t)}\big{\|}_{2}^{2} and constant terms.

B.3.1 Gradient consensus error

Take expectation conditioned on time tt and reorganize terms, the gradient consensus error can be expanded as

𝔼t𝑽(t+1)𝒗¯(t+1)𝟏n𝖥2\displaystyle~{}~{}~{}~{}{\mathbb{\bm{E}}}_{t}\big{\|}{\bm{{V}}}^{(t+1)}-{\overline{\bm{v}}}^{(t+1)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}^{2}
=𝔼t(𝑽(t)+γ𝑸v(t+1)(𝑾𝑰n)+𝑮p(t+1)𝑮p(t))(𝑰n(1n𝟏n𝟏n))𝖥2\displaystyle={\mathbb{\bm{E}}}_{t}\Big{\|}\Big{(}{\bm{{V}}}^{(t)}+\gamma{\bm{{Q}}}_{v}^{(t+1)}({\bm{{W}}}-{\bm{{I}}}_{n})+{\bm{{G}}}_{p}^{(t+1)}-{\bm{{G}}}_{p}^{(t)}\Big{)}\big{(}{\bm{{I}}}_{n}-(\tfrac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top})\big{)}\Big{\|}_{\mathsf{F}}^{2}
=𝔼t(𝑽(t)𝒗¯(t)𝟏n)(𝑾^(1n𝟏n𝟏n))+γ(𝑸v(t+1)𝑽(t))(𝑾𝑰n)+(𝑮p(t+1)𝑮p(t))(𝑰n(1n𝟏n𝟏n))𝖥2.\displaystyle={\mathbb{\bm{E}}}_{t}\Big{\|}\big{(}{\bm{{V}}}^{(t)}-{\overline{\bm{v}}}^{(t)}{\bm{1}}_{n}^{\top}\big{)}\big{(}\widehat{\bm{{W}}}-(\tfrac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top})\big{)}+\gamma\big{(}{\bm{{Q}}}_{v}^{(t+1)}-{\bm{{V}}}^{(t)}\big{)}({\bm{{W}}}-{\bm{{I}}}_{n})+\big{(}{\bm{{G}}}_{p}^{(t+1)}-{\bm{{G}}}_{p}^{(t)}\big{)}\big{(}{\bm{{I}}}_{n}-(\tfrac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top})\big{)}\Big{\|}_{\mathsf{F}}^{2}.

Then, take full expectation, use the update formula and Young’s inequality similarly to (11),

𝔼𝑽(t+1)𝒗¯(t+1)𝟏n𝖥2\displaystyle~{}~{}~{}~{}{\mathbb{\bm{E}}}\big{\|}{\bm{{V}}}^{(t+1)}-{\overline{\bm{v}}}^{(t+1)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}^{2}
2α^21+α^2𝑽(t)𝒗¯(t)𝟏n𝖥2\displaystyle\leq\frac{2\hat{\alpha}^{2}}{1+\hat{\alpha}^{2}}\big{\|}{\bm{{V}}}^{(t)}-{\overline{\bm{v}}}^{(t)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}^{2}
+21α^2𝔼γ(𝑸v(t+1)𝑽(t))(𝑾𝑰n)+(𝑮p(t+1)𝑮p(t))(𝑰n(1n𝟏n𝟏n))𝖥2\displaystyle~{}+\frac{2}{1-\hat{\alpha}^{2}}{\mathbb{\bm{E}}}\Big{\|}\gamma\big{(}{\bm{{Q}}}_{v}^{(t+1)}-{\bm{{V}}}^{(t)}\big{)}({\bm{{W}}}-{\bm{{I}}}_{n})+\big{(}{\bm{{G}}}_{p}^{(t+1)}-{\bm{{G}}}_{p}^{(t)}\big{)}\big{(}{\bm{{I}}}_{n}-(\tfrac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top})\big{)}\Big{\|}_{\mathsf{F}}^{2}
α^𝑽(t)𝒗¯(t)𝟏n𝖥2+41α^𝔼γ(𝑸v(t+1)𝑽(t))(𝑾𝑰n)𝖥2\displaystyle\leq\hat{\alpha}\big{\|}{\bm{{V}}}^{(t)}-{\overline{\bm{v}}}^{(t)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}^{2}+\frac{4}{1-\hat{\alpha}}{\mathbb{\bm{E}}}\big{\|}\gamma\big{(}{\bm{{Q}}}_{v}^{(t+1)}-{\bm{{V}}}^{(t)}\big{)}({\bm{{W}}}-{\bm{{I}}}_{n})\big{\|}_{\mathsf{F}}^{2}
+41α^𝔼(𝑮p(t+1)𝑮p(t))(𝑰n(1n𝟏n𝟏n))𝖥2\displaystyle~{}~{}~{}~{}+\frac{4}{1-\hat{\alpha}}{\mathbb{\bm{E}}}\big{\|}\big{(}{\bm{{G}}}_{p}^{(t+1)}-{\bm{{G}}}_{p}^{(t)}\big{)}\big{(}{\bm{{I}}}_{n}-(\tfrac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top})\big{)}\big{\|}_{\mathsf{F}}^{2}
(i)α^𝑽(t)𝒗¯(t)𝟏n𝖥2+16γ21α^𝔼𝑸v(t+1)𝑽(t)𝖥2+41α^𝔼𝑮p(t+1)𝑮p(t)𝖥2\displaystyle\overset{(\text{i})}{\leq}\hat{\alpha}\big{\|}{\bm{{V}}}^{(t)}-{\overline{\bm{v}}}^{(t)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}^{2}+\frac{16\gamma^{2}}{1-\hat{\alpha}}{\mathbb{\bm{E}}}\big{\|}{\bm{{Q}}}_{v}^{(t+1)}-{\bm{{V}}}^{(t)}\big{\|}_{\mathsf{F}}^{2}+\frac{4}{1-\hat{\alpha}}{\mathbb{\bm{E}}}\big{\|}{\bm{{G}}}_{p}^{(t+1)}-{\bm{{G}}}_{p}^{(t)}\big{\|}_{\mathsf{F}}^{2}
(ii)α^𝑽(t)𝒗¯(t)𝟏n𝖥2+16(1ρ)γ21α^𝔼𝑸v(t)𝑽(t)𝖥2+16n(τ2+σp2d)1α^,\displaystyle\overset{(\text{ii})}{\leq}\hat{\alpha}\big{\|}{\bm{{V}}}^{(t)}-{\overline{\bm{v}}}^{(t)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}^{2}+\frac{16(1-\rho)\gamma^{2}}{1-\hat{\alpha}}{\mathbb{\bm{E}}}\big{\|}{\bm{{Q}}}_{v}^{(t)}-{\bm{{V}}}^{(t)}\big{\|}_{\mathsf{F}}^{2}+\frac{16n(\tau^{2}+\sigma_{p}^{2}d)}{1-\hat{\alpha}}, (19)

where (i) is proved using the facts 𝑾𝑰n𝗈𝗉2\big{\|}{\bm{{W}}}-{\bm{{I}}}_{n}\big{\|}_{\mathsf{op}}\leq 2 and 𝑰n(1n𝟏n𝟏n)𝗈𝗉1\big{\|}{\bm{{I}}}_{n}-(\tfrac{1}{n}\bm{1}_{n}\bm{1}_{n}^{\top})\big{\|}_{\mathsf{op}}\leq 1, (ii) is due to Definition 3 and

𝔼𝑮p(t+1)𝑮p(t)𝖥2\displaystyle{\mathbb{\bm{E}}}\big{\|}{\bm{{G}}}_{p}^{(t+1)}-{\bm{{G}}}_{p}^{(t)}\big{\|}_{\mathsf{F}}^{2} 2𝔼𝑮p(t+1)𝖥2+2𝔼𝑮p(t)𝖥2\displaystyle\leq 2{\mathbb{\bm{E}}}\big{\|}{\bm{{G}}}_{p}^{(t+1)}\big{\|}_{\mathsf{F}}^{2}+2{\mathbb{\bm{E}}}\big{\|}{\bm{{G}}}_{p}^{(t)}\big{\|}_{\mathsf{F}}^{2}
=2(𝔼𝑮τ(t+1)𝖥2+nσp2d)+2(𝔼𝑮τ(t)𝖥2+nσp2d)\displaystyle=2\big{(}{\mathbb{\bm{E}}}\big{\|}{\bm{{G}}}_{\tau}^{(t+1)}\big{\|}_{\mathsf{F}}^{2}+n\sigma_{p}^{2}d\big{)}+2\big{(}{\mathbb{\bm{E}}}\big{\|}{\bm{{G}}}_{\tau}^{(t)}\big{\|}_{\mathsf{F}}^{2}+n\sigma_{p}^{2}d\big{)}
4n(τ2+σp2d).\displaystyle\leq 4n(\tau^{2}+\sigma_{p}^{2}d). (20)

B.3.2 Gradient quantization error

𝔼t𝑸v(t+1)𝑽(t+1)𝖥2\displaystyle{\mathbb{\bm{E}}}_{t}\big{\|}{\bm{{Q}}}_{v}^{(t+1)}-{\bm{{V}}}^{(t+1)}\big{\|}_{\mathsf{F}}^{2} =𝔼t(𝑸v(t+1)𝑽(t))(𝑽(t+1)𝑽(t))𝖥2\displaystyle={\mathbb{\bm{E}}}_{t}\big{\|}({\bm{{Q}}}_{v}^{(t+1)}-{\bm{{V}}}^{(t)})-({\bm{{V}}}^{(t+1)}-{\bm{{V}}}^{(t)})\big{\|}_{\mathsf{F}}^{2}
21+(1ρ)𝔼t𝑸v(t+1)𝑽(t)𝖥2+21(1ρ)𝔼t𝑽(t+1)𝑽(t)𝖥2\displaystyle\leq\frac{2}{1+(1-\rho)}{\mathbb{\bm{E}}}_{t}\big{\|}{\bm{{Q}}}_{v}^{(t+1)}-{\bm{{V}}}^{(t)}\big{\|}_{\mathsf{F}}^{2}+\frac{2}{1-(1-\rho)}{\mathbb{\bm{E}}}_{t}\big{\|}{\bm{{V}}}^{(t+1)}-{\bm{{V}}}^{(t)}\big{\|}_{\mathsf{F}}^{2}
2(1ρ)2ρ𝑸v(t)𝑽(t)𝖥2+2ρ𝔼tγ𝑸v(t+1)(𝑾𝑰n)+𝑮p(t+1)𝑮p(t)𝖥2\displaystyle\leq\frac{2(1-\rho)}{2-\rho}\big{\|}{\bm{{Q}}}_{v}^{(t)}-{\bm{{V}}}^{(t)}\big{\|}_{\mathsf{F}}^{2}+\frac{2}{\rho}{\mathbb{\bm{E}}}_{t}\big{\|}\gamma{\bm{{Q}}}_{v}^{(t+1)}({\bm{{W}}}-{\bm{{I}}}_{n})+{\bm{{G}}}_{p}^{(t+1)}-{\bm{{G}}}_{p}^{(t)}\big{\|}_{\mathsf{F}}^{2}
2(1ρ)2ρ𝑸v(t)𝑽(t)𝖥2+6γ2ρ𝔼t(𝑸v(t+1)𝑽(t))(𝑾𝑰n)𝖥2\displaystyle\leq\frac{2(1-\rho)}{2-\rho}\big{\|}{\bm{{Q}}}_{v}^{(t)}-{\bm{{V}}}^{(t)}\big{\|}_{\mathsf{F}}^{2}+\frac{6\gamma^{2}}{\rho}{\mathbb{\bm{E}}}_{t}\big{\|}({\bm{{Q}}}_{v}^{(t+1)}-{\bm{{V}}}^{(t)})({\bm{{W}}}-{\bm{{I}}}_{n})\big{\|}_{\mathsf{F}}^{2}
+6γ2ρ𝑽(t)(𝑾𝑰n)𝖥2+6ρ𝔼t𝑮p(t+1)𝑮p(t)𝖥2\displaystyle~{}~{}~{}~{}+\frac{6\gamma^{2}}{\rho}\big{\|}{\bm{{V}}}^{(t)}({\bm{{W}}}-{\bm{{I}}}_{n})\big{\|}_{\mathsf{F}}^{2}+\frac{6}{\rho}{\mathbb{\bm{E}}}_{t}\big{\|}{\bm{{G}}}_{p}^{(t+1)}-{\bm{{G}}}_{p}^{(t)}\big{\|}_{\mathsf{F}}^{2}
(i)(1ρ2+24γ2(1ρ)ρ)𝑸v(t)𝑽(t)𝖥2+24γ2ρ𝑽(t)𝒗¯(t)𝟏n𝖥2+24n(τ2+σp2d)ρ\displaystyle\overset{\text{(i)}}{\leq}\Big{(}1-\frac{\rho}{2}+\frac{24\gamma^{2}(1-\rho)}{\rho}\Big{)}\big{\|}{\bm{{Q}}}_{v}^{(t)}-{\bm{{V}}}^{(t)}\big{\|}_{\mathsf{F}}^{2}+\frac{24\gamma^{2}}{\rho}\big{\|}{\bm{{V}}}^{(t)}-{\overline{\bm{v}}}^{(t)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}^{2}+\frac{24n(\tau^{2}+\sigma_{p}^{2}d)}{\rho}
(ii)(1ρ4)𝑸v(t)𝑽(t)𝖥2+24γ2ρ𝑽(t)𝒗¯(t)𝟏n𝖥2+24n(τ2+σp2d)ρ,\displaystyle\overset{\text{(ii)}}{\leq}\Big{(}1-\frac{\rho}{4}\Big{)}\big{\|}{\bm{{Q}}}_{v}^{(t)}-{\bm{{V}}}^{(t)}\big{\|}_{\mathsf{F}}^{2}+\frac{24\gamma^{2}}{\rho}\big{\|}{\bm{{V}}}^{(t)}-{\overline{\bm{v}}}^{(t)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}^{2}+\frac{24n(\tau^{2}+\sigma_{p}^{2}d)}{\rho}, (21)

where we use (20) and the fact 2(1ρ)2ρ=1ρ2ρ1ρ2\frac{2(1-\rho)}{2-\rho}=1-\frac{\rho}{2-\rho}\geq 1-\frac{\rho}{2} when ρ0\rho\geq 0 to reach (i) and use (12) to reach (ii).

B.3.3 Linear system

Let 𝒆2(t)=[𝑽(t)𝒗¯(t)𝟏n𝖥2𝑸v(t)𝑽(t)𝖥2]{\bm{{e}}}_{2}^{(t)}=\begin{bmatrix}\big{\|}{\bm{{V}}}^{(t)}-{\overline{\bm{v}}}^{(t)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}^{2}\\ \big{\|}{\bm{{Q}}}_{v}^{(t)}-{\bm{{V}}}^{(t)}\big{\|}_{\mathsf{F}}^{2}\end{bmatrix}. We can write (19) and (21) in matrix form as

𝔼[𝒆2(t+1)]\displaystyle{\mathbb{\bm{E}}}[{\bm{{e}}}_{2}^{(t+1)}] [α^16(1ρ)γ21α^24γ2ρ1ρ4]𝔼[𝒆2(t)]+[16n(τ2+σp2d)1α^24n(τ2+σp2d)ρ]\displaystyle\leq\begin{bmatrix}\hat{\alpha}&\frac{16(1-\rho)\gamma^{2}}{1-\hat{\alpha}}\\ \frac{24\gamma^{2}}{\rho}&1-\frac{\rho}{4}\end{bmatrix}{\mathbb{\bm{E}}}[{\bm{{e}}}_{2}^{(t)}]+\begin{bmatrix}\frac{16n(\tau^{2}+\sigma_{p}^{2}d)}{1-\hat{\alpha}}\\ \frac{24n(\tau^{2}+\sigma_{p}^{2}d)}{\rho}\\ \end{bmatrix}
:=𝑮2𝔼[𝒆2(t)]+𝒃2(t).\displaystyle:={\bm{{G}}}_{2}{\mathbb{\bm{E}}}[{\bm{{e}}}_{2}^{(t)}]+{\bm{{b}}}_{2}^{(t)}.

Because 𝑮2=𝑮1{\bm{{G}}}_{2}={\bm{{G}}}_{1}, we can use the same argument as in Section B.2.3, and use (15) to prove

t=1T𝔼𝑽(t)𝒗¯(t)𝟏n𝖥2\displaystyle\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}{\bm{{V}}}^{(t)}-{\overline{\bm{v}}}^{(t)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}^{2} [10](𝑰n𝑮2)1(𝔼[𝒆2(0)]+t=0T1𝒃2(t))\displaystyle\leq\begin{bmatrix}1&0\end{bmatrix}({\bm{{I}}}_{n}-{\bm{{G}}}_{2})^{-1}\Big{(}{\mathbb{\bm{E}}}[{\bm{{e}}}_{2}^{(0)}]+\sum_{t=0}^{T-1}{\bm{{b}}}_{2}^{(t)}\Big{)}
118(1α^)ρ[ρ416(1ρ)γ21α^][16Tn(τ2+σp2d)1α^24Tn(τ2+σp2d)ρ]\displaystyle\leq\frac{1}{\frac{1}{8}(1-\hat{\alpha})\rho}\begin{bmatrix}\frac{\rho}{4}&\frac{16(1-\rho)\gamma^{2}}{1-\hat{\alpha}}\end{bmatrix}\begin{bmatrix}\frac{16Tn(\tau^{2}+\sigma_{p}^{2}d)}{1-\hat{\alpha}}\\ \frac{24Tn(\tau^{2}+\sigma_{p}^{2}d)}{\rho}\\ \end{bmatrix}
=Tn(τ2+σp2d)18(1α^)ρ(ρ4161α^+16(1ρ)γ21α^24ρ)\displaystyle=\frac{Tn(\tau^{2}+\sigma_{p}^{2}d)}{\frac{1}{8}(1-\hat{\alpha})\rho}\cdot\Big{(}\frac{\rho}{4}\cdot\frac{16}{1-\hat{\alpha}}+\frac{16(1-\rho)\gamma^{2}}{1-\hat{\alpha}}\cdot\frac{24}{\rho}\Big{)}
Tn(τ2+σp2d)18(1α^)ρ(4ρ1α^+4ρ1α^)\displaystyle\leq\frac{Tn(\tau^{2}+\sigma_{p}^{2}d)}{\frac{1}{8}(1-\hat{\alpha})\rho}\cdot\Big{(}\frac{4\rho}{1-\hat{\alpha}}+\frac{4\rho}{1-\hat{\alpha}}\Big{)}
=64(1α^)2Tn(τ2+σp2d),\displaystyle=\frac{64}{(1-\hat{\alpha})^{2}}Tn(\tau^{2}+\sigma_{p}^{2}d), (22)

where we use (12) to prove the last inequality.

With (22), we can bound (18) by

t=1T𝔼𝑿(t)𝒙¯(t)𝟏n𝖥2\displaystyle~{}~{}~{}~{}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}{\bm{{X}}}^{(t)}-{\overline{\bm{x}}}^{(t)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}^{2}
16η2(1α^)2t=1T𝔼𝑽(t)𝒗¯(t)𝟏n𝖥2+768(1ρ)γ2η2(1α^)2ρ2t=1Tn𝔼𝒗¯(t)𝖥2\displaystyle\leq\frac{16\eta^{2}}{(1-\hat{\alpha})^{2}}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}{\bm{{V}}}^{(t)}\ -{\overline{\bm{v}}}^{(t)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}^{2}+\frac{768(1-\rho)\gamma^{2}\eta^{2}}{(1-\hat{\alpha})^{2}\rho^{2}}\sum_{t=1}^{T}n{\mathbb{\bm{E}}}\big{\|}{\overline{\bm{v}}}^{(t)}\big{\|}_{\mathsf{F}}^{2}
16η2(1α^)264(1α^)2Tn(τ2+σp2d)+768(1ρ)γ2η2(1α^)2ρ2t=1Tn𝔼𝒗¯(t)𝖥2\displaystyle\leq\frac{16\eta^{2}}{(1-\hat{\alpha})^{2}}\cdot\frac{64}{(1-\hat{\alpha})^{2}}Tn(\tau^{2}+\sigma_{p}^{2}d)+\frac{768(1-\rho)\gamma^{2}\eta^{2}}{(1-\hat{\alpha})^{2}\rho^{2}}\sum_{t=1}^{T}n{\mathbb{\bm{E}}}\big{\|}{\overline{\bm{v}}}^{(t)}\big{\|}_{\mathsf{F}}^{2}
1024η2(1α^)4Tn(τ2+σp2d)+8η2(1α^)2t=1Tn𝔼𝒗¯(t)𝖥2,\displaystyle\leq\frac{1024\eta^{2}}{(1-\hat{\alpha})^{4}}Tn(\tau^{2}+\sigma_{p}^{2}d)+\frac{8\eta^{2}}{(1-\hat{\alpha})^{2}}\sum_{t=1}^{T}n{\mathbb{\bm{E}}}\big{\|}{\overline{\bm{v}}}^{(t)}\big{\|}_{\mathsf{F}}^{2}, (23)

where we use (12) again to prove the last inequality.

B.4 Convergence rate

Note bounded gradient assumption can imply Assumption 3 for some σg2τ\sigma_{g}\leq 2\tau, we can bound the expected norm of average gradient estimate as

𝔼𝒗¯(t)22\displaystyle{\mathbb{\bm{E}}}\big{\|}{\overline{\bm{v}}}^{(t)}\big{\|}_{2}^{2} =𝔼𝒈¯p(t)22\displaystyle={\mathbb{\bm{E}}}\big{\|}{\overline{\bm{g}}}_{p}^{(t)}\big{\|}_{2}^{2}
=𝔼𝒈¯τ(t)22+σp2dn\displaystyle={\mathbb{\bm{E}}}\big{\|}{\overline{\bm{g}}}_{\tau}^{(t)}\big{\|}_{2}^{2}+\frac{\sigma_{p}^{2}d}{n}
𝔼F(𝑿(t))(1n𝟏n)22+σg2b+σp2dn\displaystyle\leq{\mathbb{\bm{E}}}\big{\|}\nabla F({\bm{{X}}}^{(t)})(\tfrac{1}{n}\bm{1}_{n})\big{\|}_{2}^{2}+\frac{\sigma_{g}^{2}}{b}+\frac{\sigma_{p}^{2}d}{n}
𝔼F(𝑿(t))(1n𝟏n)22+4τ2b+σp2dn.\displaystyle\leq{\mathbb{\bm{E}}}\big{\|}\nabla F({\bm{{X}}}^{(t)})(\tfrac{1}{n}\bm{1}_{n})\big{\|}_{2}^{2}+\frac{4\tau^{2}}{b}+\frac{\sigma_{p}^{2}d}{n}. (24)

We assume

ηL18(1α^)43.\displaystyle\eta L\leq\frac{1}{8}(1-\hat{\alpha})^{\frac{4}{3}}. (25)

Using (23) (24), expected utility (9) can be bounded by

1Tt=1T𝔼f(𝒙¯(t))22\displaystyle\frac{1}{T}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\|\nabla f({\overline{\bm{x}}}^{(t)})\|_{2}^{2} 2ΔηT+1TL2nt=1T𝔼𝑿(t)𝒙¯(t)𝟏n𝖥2\displaystyle\leq\frac{2\Delta}{\eta T}+\frac{1}{T}\cdot\frac{L^{2}}{n}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}{\bm{{X}}}^{(t)}-{\overline{\bm{x}}}^{(t)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}^{2}
+1Tt=1TηL𝔼𝒗¯(t)221Tt=1T𝔼F(𝑿(t))(1n𝟏n)22\displaystyle~{}~{}~{}~{}+\frac{1}{T}\sum_{t=1}^{T}\eta L{\mathbb{\bm{E}}}\|{\overline{\bm{v}}}^{(t)}\|_{2}^{2}-\frac{1}{T}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}\nabla F({\bm{{X}}}^{(t)})(\tfrac{1}{n}\bm{1}_{n})\big{\|}_{2}^{2}
2ΔηT+1TL2n(1024η2(1α^)4Tn(τ2+σp2d)+8η2(1α^)2t=1Tn𝔼𝒗¯(t)22)\displaystyle\leq\frac{2\Delta}{\eta T}+\frac{1}{T}\cdot\frac{L^{2}}{n}\Big{(}\frac{1024\eta^{2}}{(1-\hat{\alpha})^{4}}Tn(\tau^{2}+\sigma_{p}^{2}d)+\frac{8\eta^{2}}{(1-\hat{\alpha})^{2}}\sum_{t=1}^{T}n{\mathbb{\bm{E}}}\big{\|}{\overline{\bm{v}}}^{(t)}\big{\|}_{2}^{2}\Big{)}
+1Tt=1TηL𝔼𝒗¯(t)221Tt=1T𝔼F(𝑿(t))(1n𝟏n)22\displaystyle~{}~{}~{}~{}+\frac{1}{T}\sum_{t=1}^{T}\eta L{\mathbb{\bm{E}}}\|{\overline{\bm{v}}}^{(t)}\|_{2}^{2}-\frac{1}{T}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}\nabla F({\bm{{X}}}^{(t)})(\tfrac{1}{n}\bm{1}_{n})\big{\|}_{2}^{2}
(i)2ΔηT+1024η2L2(1α^)4(τ2+σp2d)+2ηL(1α^)43Tt=1T𝔼𝒗¯(t)221Tt=1T𝔼F(𝑿(t))(1n𝟏n)22\displaystyle\overset{(\text{i})}{\leq}\frac{2\Delta}{\eta T}+\frac{1024\eta^{2}L^{2}}{(1-\hat{\alpha})^{4}}(\tau^{2}+\sigma_{p}^{2}d)+\frac{2\eta L}{(1-\hat{\alpha})^{\frac{4}{3}}T}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\|{\overline{\bm{v}}}^{(t)}\|_{2}^{2}-\frac{1}{T}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}\nabla F({\bm{{X}}}^{(t)})(\tfrac{1}{n}\bm{1}_{n})\big{\|}_{2}^{2}
2ΔηT+1024η2L2(1α^)4(τ2+σp2d)+2ηL(1α^)43(4τ2b+σp2dn)\displaystyle\leq\frac{2\Delta}{\eta T}+\frac{1024\eta^{2}L^{2}}{(1-\hat{\alpha})^{4}}(\tau^{2}+\sigma_{p}^{2}d)+\frac{2\eta L}{(1-\hat{\alpha})^{\frac{4}{3}}}\Big{(}\frac{4\tau^{2}}{b}+\frac{\sigma_{p}^{2}d}{n}\Big{)}
=(ii)2ΔηT+1024η2L2τ2(1α^)4(1+Tϕm2)+8ηLτ2(1α^)43(1+Tϕm2)\displaystyle\overset{(\text{ii})}{=}\frac{2\Delta}{\eta T}+\frac{1024\eta^{2}L^{2}\tau^{2}}{(1-\hat{\alpha})^{4}}(1+T\phi_{m}^{2})+\frac{8\eta L\tau^{2}}{(1-\hat{\alpha})^{\frac{4}{3}}}(1+T\phi_{m}^{2})
=(iii)2ΔηT+2048η2L2τ2(1α^)4+16ηLτ2(1α^)43\displaystyle\overset{(\text{iii})}{=}\frac{2\Delta}{\eta T}+\frac{2048\eta^{2}L^{2}\tau^{2}}{(1-\hat{\alpha})^{4}}+\frac{16\eta L\tau^{2}}{(1-\hat{\alpha})^{\frac{4}{3}}} (26)

where we use (25) for (i), substitute b=1b=1 and σp2d=(τTlog(1/δ)mϵ)2=Tτ2ϕm2\sigma_{p}^{2}d=\Big{(}\frac{\tau\sqrt{T\log(1/\delta)}}{m\epsilon}\Big{)}^{2}=T\tau^{2}\phi_{m}^{2} for (ii), and substitute T=ϕm2T=\phi_{m}^{-2} for (iii).

We set the step size as

η\displaystyle\eta =γ43(1α)4332ϕmL,\displaystyle=\frac{\gamma^{\frac{4}{3}}(1-\alpha)^{\frac{4}{3}}}{32}\cdot\frac{\phi_{m}}{L},

(26) can be further bounded as

1Tt=1T𝔼f(𝒙¯(t))22\displaystyle\frac{1}{T}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\|\nabla f({\overline{\bm{x}}}^{(t)})\|_{2}^{2} 64LΔϕmγ43(1α)43+2τ2ϕm2(1α^)43+τ2ϕm2\displaystyle\leq\frac{64L\Delta\phi_{m}}{\gamma^{\frac{4}{3}}(1-\alpha)^{\frac{4}{3}}}+\frac{2\tau^{2}\phi_{m}^{2}}{(1-\hat{\alpha})^{\frac{4}{3}}}+\frac{\tau^{2}\phi_{m}}{2}
64LΔϕmγ43(1α)43+3τ2ϕm(1α^)43\displaystyle\leq\frac{64L\Delta\phi_{m}}{\gamma^{\frac{4}{3}}(1-\alpha)^{\frac{4}{3}}}+\frac{3\tau^{2}\phi_{m}}{(1-\hat{\alpha})^{\frac{4}{3}}}
67ϕmγ43(1α)43max{τ2,LΔ},\displaystyle\leq\frac{67\phi_{m}}{\gamma^{\frac{4}{3}}(1-\alpha)^{\frac{4}{3}}}\max\big{\{}\tau^{2},L\Delta\big{\}},

where we use Lemma 1 to reach the last inequality.

Lastly, set the hyper parameter γ\gamma as

γ\displaystyle\gamma =1100(1α)ρ.\displaystyle=\frac{1}{100}(1-\alpha)\rho.

We can now verify conditions (12), (16) and the condition on η\eta are all met to conclude the proof:

γ2ρ210000\displaystyle\gamma^{2}\leq\frac{\rho^{2}}{10000} \displaystyle\Rightarrow (12)
γ4=γ2(1α)2ρ210000(1α^)2ρ210000\displaystyle\gamma^{4}=\gamma^{2}\cdot\frac{(1-\alpha)^{2}\rho^{2}}{10000}\leq\frac{(1-\hat{\alpha})^{2}\rho^{2}}{10000} \displaystyle\Rightarrow (16)
ηL(1α^)4332\displaystyle\eta L\leq\frac{(1-\hat{\alpha})^{\frac{4}{3}}}{32} \displaystyle\Rightarrow (25)

Appendix C Proof of Theorem 3

This section proves Theorem 3 in 22 subsections. Section C.1 derives the descent inequality using results from Sections B.2 and B.3. Section C.2 first assumes all expected gradient norm 𝔼f(𝒙¯(t))2{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2} are greater than a threshold ν\nu (i.e. 𝔼f(𝒙¯(t))2ν{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}\geq\nu for all t=1,,Tt=1,\ldots,T), then specifies parameters and proves the average of expected gradient norm is smaller than that threshold 1Tt=1T𝔼f(𝒙¯(t))2ν\frac{1}{T}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}\leq\nu, which contradicts the assumption hence proves the algorithm reaches 𝔼f(𝒙¯(t))2ν{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}\leq\nu within TT steps.

C.1 Function value descent

Let δi(t)=ττ+𝒈i(t)2\delta_{i}^{(t)}=\frac{\tau}{\tau+\|{\bm{{g}}}_{i}^{(t)}\|_{2}} and δ(t)=ττ+f(𝒙¯(t))2\delta^{(t)}=\frac{\tau}{\tau+\|\nabla f({\overline{\bm{x}}}^{(t)})\|_{2}}. Similar to Section B.1, use Taylor expansion and take expectation conditioned on tt, we can expand the function value descent as

𝔼t[f(𝒙¯(t+1))f(𝒙¯(t))]\displaystyle~{}~{}~{}~{}{\mathbb{\bm{E}}}_{t}\big{[}f({\overline{\bm{x}}}^{(t+1)})-f({\overline{\bm{x}}}^{(t)})\big{]}
𝔼tf(𝒙¯(t)),η𝒗¯(t+1)+L2𝔼tη𝒗¯(t+1)22\displaystyle\leq{\mathbb{\bm{E}}}_{t}\langle\nabla f({\overline{\bm{x}}}^{(t)}),-\eta{\overline{\bm{v}}}^{(t+1)}\rangle+\frac{L}{2}{\mathbb{\bm{E}}}_{t}\big{\|}\eta{\overline{\bm{v}}}^{(t+1)}\big{\|}_{2}^{2}
=η𝔼tf(𝒙¯(t)),𝒈¯p(t+1)+η2L2𝔼t𝒗¯(t+1)22\displaystyle=-\eta{\mathbb{\bm{E}}}_{t}\big{\langle}\nabla f({\overline{\bm{x}}}^{(t)}),{\overline{\bm{g}}}_{p}^{(t+1)}\big{\rangle}+\frac{\eta^{2}L}{2}{\mathbb{\bm{E}}}_{t}\big{\|}{\overline{\bm{v}}}^{(t+1)}\big{\|}_{2}^{2}
=η𝔼tf(𝒙¯(t)),𝒈¯τ(t+1)+η2L2𝔼t𝒗¯(t+1)22\displaystyle=-\eta{\mathbb{\bm{E}}}_{t}\big{\langle}\nabla f({\overline{\bm{x}}}^{(t)}),{\overline{\bm{g}}}_{\tau}^{(t+1)}\big{\rangle}+\frac{\eta^{2}L}{2}{\mathbb{\bm{E}}}_{t}\big{\|}{\overline{\bm{v}}}^{(t+1)}\big{\|}_{2}^{2}
=η𝔼tf(𝒙¯(t)),𝖢𝗅𝗂𝗉τ(f(𝒙¯(t)))+η𝔼tf(𝒙¯(t)),𝖢𝗅𝗂𝗉τ(f(𝒙¯(t)))𝒈¯τ(t+1)+η2L2𝔼t𝒗¯(t+1)22\displaystyle=-\eta{\mathbb{\bm{E}}}_{t}\big{\langle}\nabla f({\overline{\bm{x}}}^{(t)}),\mathsf{Clip}_{\tau}(\nabla f({\overline{\bm{x}}}^{(t)}))\big{\rangle}+\eta{\mathbb{\bm{E}}}_{t}\big{\langle}\nabla f({\overline{\bm{x}}}^{(t)}),\mathsf{Clip}_{\tau}(\nabla f({\overline{\bm{x}}}^{(t)}))-{\overline{\bm{g}}}_{\tau}^{(t+1)}\big{\rangle}+\frac{\eta^{2}L}{2}{\mathbb{\bm{E}}}_{t}\big{\|}{\overline{\bm{v}}}^{(t+1)}\big{\|}_{2}^{2}
=ηδ(t)f(𝒙¯(t))22+η𝔼tf(𝒙¯(t)),𝖢𝗅𝗂𝗉τ(f(𝒙¯(t)))𝒈¯τ(t+1)+η2L2𝔼t𝒗¯(t+1)22\displaystyle=-\eta\delta^{(t)}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}^{2}+\eta{\mathbb{\bm{E}}}_{t}\big{\langle}\nabla f({\overline{\bm{x}}}^{(t)}),\mathsf{Clip}_{\tau}(\nabla f({\overline{\bm{x}}}^{(t)}))-{\overline{\bm{g}}}_{\tau}^{(t+1)}\big{\rangle}+\frac{\eta^{2}L}{2}{\mathbb{\bm{E}}}_{t}\big{\|}{\overline{\bm{v}}}^{(t+1)}\big{\|}_{2}^{2}
ηδ(t)f(𝒙¯(t))22+ηf(𝒙¯(t))2𝔼t𝖢𝗅𝗂𝗉τ(f(𝒙¯(t)))𝒈¯τ(t+1)2+η2L2𝔼t𝒗¯(t+1)22.\displaystyle\leq-\eta\delta^{(t)}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}^{2}+\eta\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}{\mathbb{\bm{E}}}_{t}\big{\|}\mathsf{Clip}_{\tau}(\nabla f({\overline{\bm{x}}}^{(t)}))-{\overline{\bm{g}}}_{\tau}^{(t+1)}\big{\|}_{2}+\frac{\eta^{2}L}{2}{\mathbb{\bm{E}}}_{t}\big{\|}{\overline{\bm{v}}}^{(t+1)}\big{\|}_{2}^{2}. (27)

The 𝔼t𝖢𝗅𝗂𝗉τ(f(𝒙¯(t)))𝒈¯τ(t+1)2{\mathbb{\bm{E}}}_{t}\big{\|}\mathsf{Clip}_{\tau}(\nabla f({\overline{\bm{x}}}^{(t)}))-{\overline{\bm{g}}}_{\tau}^{(t+1)}\big{\|}_{2} term in (27) is the error introduced by gradient clipping, which can be analyzed by splitting it to 44 terms as following

𝔼t𝖢𝗅𝗂𝗉τ(f(𝒙¯(t)))𝒈¯τ(t+1)2\displaystyle~{}~{}~{}~{}{\mathbb{\bm{E}}}_{t}\big{\|}\mathsf{Clip}_{\tau}(\nabla f({\overline{\bm{x}}}^{(t)}))-{\overline{\bm{g}}}_{\tau}^{(t+1)}\big{\|}_{2}
=𝔼t1ni=1nττ+𝒈i(t)2𝒈i(t)ττ+f(𝒙¯(t))2f(𝒙¯(t))2\displaystyle={\mathbb{\bm{E}}}_{t}\Big{\|}\frac{1}{n}\sum_{i=1}^{n}\frac{\tau}{\tau+\|{\bm{{g}}}_{i}^{(t)}\|_{2}}{\bm{{g}}}_{i}^{(t)}-\frac{\tau}{\tau+\|\nabla f({\overline{\bm{x}}}^{(t)})\|_{2}}\nabla f({\overline{\bm{x}}}^{(t)})\Big{\|}_{2}
=𝔼t1ni=1n(ττ+𝒈i(t)2𝒈i(t)ττ+fi(𝒙i(t))2𝒈i(t))\displaystyle={\mathbb{\bm{E}}}_{t}\Bigg{\|}\frac{1}{n}\sum_{i=1}^{n}\Big{(}\frac{\tau}{\tau+\|{\bm{{g}}}_{i}^{(t)}\|_{2}}{\bm{{g}}}_{i}^{(t)}-\frac{\tau}{\tau+\|\nabla f_{i}({\bm{{x}}}_{i}^{(t)})\|_{2}}{\bm{{g}}}_{i}^{(t)}\Big{)}
+1ni=1n(ττ+fi(𝒙i(t))2𝒈i(t)ττ+fi(𝒙i(t))2fi(𝒙i(t)))\displaystyle~{}~{}~{}~{}~{}~{}~{}+\frac{1}{n}\sum_{i=1}^{n}\Big{(}\frac{\tau}{\tau+\|\nabla f_{i}({\bm{{x}}}_{i}^{(t)})\|_{2}}{\bm{{g}}}_{i}^{(t)}-\frac{\tau}{\tau+\|\nabla f_{i}({\bm{{x}}}_{i}^{(t)})\|_{2}}\nabla f_{i}({\bm{{x}}}_{i}^{(t)})\Big{)}
+1ni=1n(ττ+fi(𝒙i(t))2fi(𝒙i(t))ττ+f(𝒙¯(t))2fi(𝒙i(t)))\displaystyle~{}~{}~{}~{}~{}~{}~{}+\frac{1}{n}\sum_{i=1}^{n}\Big{(}\frac{\tau}{\tau+\|\nabla f_{i}({\bm{{x}}}_{i}^{(t)})\|_{2}}\nabla f_{i}({\bm{{x}}}_{i}^{(t)})-\frac{\tau}{\tau+\|\nabla f({\overline{\bm{x}}}^{(t)})\|_{2}}\nabla f_{i}({\bm{{x}}}_{i}^{(t)})\Big{)}
+1ni=1n(ττ+f(𝒙¯(t))2fi(𝒙i(t))ττ+f(𝒙¯(t))2f(𝒙¯(t)))2\displaystyle~{}~{}~{}~{}~{}~{}~{}+\frac{1}{n}\sum_{i=1}^{n}\Big{(}\frac{\tau}{\tau+\|\nabla f({\overline{\bm{x}}}^{(t)})\|_{2}}\nabla f_{i}({\bm{{x}}}_{i}^{(t)})-\frac{\tau}{\tau+\|\nabla f({\overline{\bm{x}}}^{(t)})\|_{2}}\nabla f({\overline{\bm{x}}}^{(t)})\Big{)}\Bigg{\|}_{2}
1ni=1n𝔼t(ττ+𝒈i(t)2ττ+fi(𝒙i(t))2)𝒈i(t)2\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}{\mathbb{\bm{E}}}_{t}\Bigg{\|}\Big{(}\frac{\tau}{\tau+\|{\bm{{g}}}_{i}^{(t)}\|_{2}}-\frac{\tau}{\tau+\|\nabla f_{i}({\bm{{x}}}_{i}^{(t)})\|_{2}}\Big{)}{\bm{{g}}}_{i}^{(t)}\Bigg{\|}_{2} (28)
+1ni=1n𝔼tττ+fi(𝒙i(t))2(𝒈i(t)fi(𝒙i(t)))2\displaystyle\qquad\qquad+\frac{1}{n}\sum_{i=1}^{n}{\mathbb{\bm{E}}}_{t}\Bigg{\|}\frac{\tau}{\tau+\|\nabla f_{i}({\bm{{x}}}_{i}^{(t)})\|_{2}}\big{(}{\bm{{g}}}_{i}^{(t)}-\nabla f_{i}({\bm{{x}}}_{i}^{(t)})\big{)}\Bigg{\|}_{2} (29)
+1ni=1n(ττ+fi(𝒙i(t))2ττ+f(𝒙¯(t))2)fi(𝒙i(t))2\displaystyle\qquad\qquad+\frac{1}{n}\sum_{i=1}^{n}\Bigg{\|}\Big{(}\frac{\tau}{\tau+\|\nabla f_{i}({\bm{{x}}}_{i}^{(t)})\|_{2}}-\frac{\tau}{\tau+\|\nabla f({\overline{\bm{x}}}^{(t)})\|_{2}}\Big{)}\nabla f_{i}({\bm{{x}}}_{i}^{(t)})\Bigg{\|}_{2} (30)
+ττ+f(𝒙¯(t))2(1ni=1nfi(𝒙i(t))f(𝒙¯(t)))2.\displaystyle\qquad\qquad+\Bigg{\|}\frac{\tau}{\tau+\|\nabla f({\overline{\bm{x}}}^{(t)})\|_{2}}\Big{(}\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}({\bm{{x}}}_{i}^{(t)})-\nabla f({\overline{\bm{x}}}^{(t)})\Big{)}\Bigg{\|}_{2}. (31)

Next, we bound each term separately using triangle inequality, Assumptions 3 and 4.

Bound the first term (28) as

1ni=1n𝔼t(ττ+𝒈i(t)2ττ+fi(𝒙i(t))2)𝒈i(t)2\displaystyle~{}~{}~{}~{}\frac{1}{n}\sum_{i=1}^{n}{\mathbb{\bm{E}}}_{t}\Bigg{\|}\Big{(}\frac{\tau}{\tau+\big{\|}{\bm{{g}}}_{i}^{(t)}\big{\|}_{2}}-\frac{\tau}{\tau+\big{\|}\nabla f_{i}({\bm{{x}}}_{i}^{(t)})\big{\|}_{2}}\Big{)}{\bm{{g}}}_{i}^{(t)}\Bigg{\|}_{2}
=1ni=1n𝔼tτ(𝒈i(t)2fi(𝒙i(t))2)(τ+𝒈i(t)2)(τ+fi(𝒙i(t))2)𝒈i(t)2\displaystyle=\frac{1}{n}\sum_{i=1}^{n}{\mathbb{\bm{E}}}_{t}\Bigg{\|}\frac{\tau(\big{\|}{\bm{{g}}}_{i}^{(t)}\big{\|}_{2}-\big{\|}\nabla f_{i}({\bm{{x}}}_{i}^{(t)})\big{\|}_{2})}{(\tau+\big{\|}{\bm{{g}}}_{i}^{(t)}\big{\|}_{2})(\tau+\big{\|}\nabla f_{i}({\bm{{x}}}_{i}^{(t)})\big{\|}_{2})}{\bm{{g}}}_{i}^{(t)}\Bigg{\|}_{2}
=1ni=1n𝔼t(|𝒈i(t)2fi(𝒙i(t))2|ττ+fi(𝒙i(t))2𝒈i(t)2τ+𝒈i(t)2)\displaystyle=\frac{1}{n}\sum_{i=1}^{n}{\mathbb{\bm{E}}}_{t}\Bigg{(}\Big{|}\big{\|}{\bm{{g}}}_{i}^{(t)}\big{\|}_{2}-\big{\|}\nabla f_{i}({\bm{{x}}}_{i}^{(t)})\big{\|}_{2}\Big{|}\cdot\frac{\tau}{\tau+\big{\|}\nabla f_{i}({\bm{{x}}}_{i}^{(t)})\big{\|}_{2}}\cdot\frac{\big{\|}{\bm{{g}}}_{i}^{(t)}\big{\|}_{2}}{\tau+\big{\|}{\bm{{g}}}_{i}^{(t)}\big{\|}_{2}}\Bigg{)}
1ni=1n𝔼t|𝒈i(t)2fi(𝒙i(t))2|\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}{\mathbb{\bm{E}}}_{t}\Big{|}\big{\|}{\bm{{g}}}_{i}^{(t)}\big{\|}_{2}-\big{\|}\nabla f_{i}({\bm{{x}}}_{i}^{(t)})\big{\|}_{2}\Big{|}
1ni=1n𝔼t(𝒈i(t)2fi(𝒙i(t))2)2\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\sqrt{{\mathbb{\bm{E}}}_{t}\big{(}\big{\|}{\bm{{g}}}_{i}^{(t)}\big{\|}_{2}-\big{\|}\nabla f_{i}({\bm{{x}}}_{i}^{(t)})\big{\|}_{2}\big{)}^{2}}
=1ni=1n𝔼t(𝒈i(t)22+fi(𝒙i(t))222𝒈i(t)2fi(𝒙i(t))2)\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\sqrt{{\mathbb{\bm{E}}}_{t}\big{(}\big{\|}{\bm{{g}}}_{i}^{(t)}\big{\|}_{2}^{2}+\big{\|}\nabla f_{i}({\bm{{x}}}_{i}^{(t)})\big{\|}_{2}^{2}-2\big{\|}{\bm{{g}}}_{i}^{(t)}\big{\|}_{2}\big{\|}\nabla f_{i}({\bm{{x}}}_{i}^{(t)})\big{\|}_{2}\big{)}}
1ni=1n𝔼t(𝒈i(t)22+fi(𝒙i(t))222𝒈i(t),fi(𝒙i(t)))\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\sqrt{{\mathbb{\bm{E}}}_{t}\big{(}\big{\|}{\bm{{g}}}_{i}^{(t)}\big{\|}_{2}^{2}+\big{\|}\nabla f_{i}({\bm{{x}}}_{i}^{(t)})\big{\|}_{2}^{2}-2\langle{\bm{{g}}}_{i}^{(t)},\nabla f_{i}({\bm{{x}}}_{i}^{(t)})\rangle\big{)}}
=1ni=1n𝔼t𝒈i(t)fi(𝒙i(t))22\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\sqrt{{\mathbb{\bm{E}}}_{t}\big{\|}{\bm{{g}}}_{i}^{(t)}-\nabla f_{i}({\bm{{x}}}_{i}^{(t)})\big{\|}_{2}^{2}}
σgb.\displaystyle\leq\frac{\sigma_{g}}{\sqrt{b}}. (32)

Bound the second term (29) as

1ni=1n𝔼tττ+fi(𝒙i(t))2(𝒈i(t)fi(𝒙i(t)))21ni=1nτσg/bτ+fi(𝒙i(t))2σgb.\displaystyle\frac{1}{n}\sum_{i=1}^{n}{\mathbb{\bm{E}}}_{t}\Bigg{\|}\frac{\tau}{\tau+\big{\|}\nabla f_{i}({\bm{{x}}}_{i}^{(t)})\big{\|}_{2}}\big{(}{\bm{{g}}}_{i}^{(t)}-\nabla f_{i}({\bm{{x}}}_{i}^{(t)})\big{)}\Bigg{\|}_{2}\leq\frac{1}{n}\sum_{i=1}^{n}\frac{\tau\sigma_{g}/\sqrt{b}}{\tau+\big{\|}\nabla f_{i}({\bm{{x}}}_{i}^{(t)})\big{\|}_{2}}\leq\frac{\sigma_{g}}{\sqrt{b}}. (33)

Bound the third term (30) as

1ni=1n(ττ+fi(𝒙i(t))2ττ+f(𝒙¯(t))2)fi(𝒙i(t))2\displaystyle~{}~{}~{}~{}\frac{1}{n}\sum_{i=1}^{n}\Bigg{\|}\Big{(}\frac{\tau}{\tau+\big{\|}\nabla f_{i}({\bm{{x}}}_{i}^{(t)})\big{\|}_{2}}-\frac{\tau}{\tau+\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}}\Big{)}\nabla f_{i}({\bm{{x}}}_{i}^{(t)})\Bigg{\|}_{2}
=1ni=1nτ(fi(𝒙i(t))2f(𝒙¯(t))2)(τ+fi(𝒙i(t))2)(τ+f(𝒙¯(t))2)fi(𝒙i(t))2\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\Bigg{\|}\frac{\tau\big{(}\big{\|}\nabla f_{i}({\bm{{x}}}_{i}^{(t)})\big{\|}_{2}-\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}\big{)}}{\big{(}\tau+\big{\|}\nabla f_{i}({\bm{{x}}}_{i}^{(t)})\big{\|}_{2}\big{)}\big{(}\tau+\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}\big{)}}\nabla f_{i}({\bm{{x}}}_{i}^{(t)})\Bigg{\|}_{2}
1ni=1nτ|fi(𝒙i(t))2f(𝒙¯(t))2|τ+f(𝒙¯(t))2\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\frac{\tau\Big{|}\big{\|}\nabla f_{i}({\bm{{x}}}_{i}^{(t)})\big{\|}_{2}-\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}\Big{|}}{\tau+\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}}
1ni=1nδ(t)|fi(𝒙i(t))2fi(𝒙¯(t))2|+1ni=1nδ(t)|fi(𝒙¯(t))2f(𝒙¯(t))2|\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\delta^{(t)}\Big{|}\big{\|}\nabla f_{i}({\bm{{x}}}_{i}^{(t)})\big{\|}_{2}-\big{\|}\nabla f_{i}({\overline{\bm{x}}}^{(t)})\big{\|}_{2}\Big{|}+\frac{1}{n}\sum_{i=1}^{n}\delta^{(t)}\Big{|}\big{\|}\nabla f_{i}({\overline{\bm{x}}}^{(t)})\big{\|}_{2}-\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}\Big{|}
1ni=1nδ(t)L𝒙i(t)𝒙¯(t)2+1ni=1nδ(t)112f(𝒙¯(t))2\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\delta^{(t)}L\big{\|}{\bm{{x}}}_{i}^{(t)}-{\overline{\bm{x}}}^{(t)}\big{\|}_{2}+\frac{1}{n}\sum_{i=1}^{n}\delta^{(t)}\cdot\frac{1}{12}\|\nabla f({\overline{\bm{x}}}^{(t)})\|_{2}
δ(t)Ln𝑿(t)𝒙¯(t)𝟏n𝖥+112f(𝒙¯(t))2,\displaystyle\leq\frac{\delta^{(t)}L}{\sqrt{n}}\big{\|}{\bm{{X}}}^{(t)}-{\overline{\bm{x}}}^{(t)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}+\frac{1}{12}\|\nabla f({\overline{\bm{x}}}^{(t)})\|_{2}, (34)

where we use δ(t)1\delta^{(t)}\leq 1 to reach the last inequality.

Bound (31) as

ττ+f(𝒙¯(t))2(1ni=1nfi(𝒙i(t))f(𝒙¯(t)))2\displaystyle~{}~{}~{}~{}\Bigg{\|}\frac{\tau}{\tau+\|\nabla f({\overline{\bm{x}}}^{(t)})\|_{2}}\Big{(}\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}({\bm{{x}}}_{i}^{(t)})-\nabla f({\overline{\bm{x}}}^{(t)})\Big{)}\Bigg{\|}_{2}
=ττ+f(𝒙¯(t))2F(𝑿(t))(1n𝟏n)f(𝒙¯(t))2\displaystyle=\frac{\tau}{\tau+\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}}\big{\|}\nabla F({\bm{{X}}}^{(t)})(\tfrac{1}{n}\bm{1}_{n})-\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}
δ(t)Ln𝑿(t)𝒙¯(t)𝟏n𝖥.\displaystyle\leq\frac{\delta^{(t)}L}{\sqrt{n}}\big{\|}{\bm{{X}}}^{(t)}-{\overline{\bm{x}}}^{(t)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}. (35)

Using (32), (33), (34) and (35), the function value descent inequality (27) becomes

𝔼t[f(𝒙¯(t+1))f(𝒙¯(t))]\displaystyle{\mathbb{\bm{E}}}_{t}\big{[}f({\overline{\bm{x}}}^{(t+1)})-f({\overline{\bm{x}}}^{(t)})\big{]} ηδ(t)f(𝒙¯(t))22+η2L2𝔼t𝒗¯(t+1)22\displaystyle\leq-\eta\delta^{(t)}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}^{2}+\frac{\eta^{2}L}{2}{\mathbb{\bm{E}}}_{t}\big{\|}{\overline{\bm{v}}}^{(t+1)}\big{\|}_{2}^{2}
+ηf(𝒙¯(t))2𝔼t𝖢𝗅𝗂𝗉τ(f(𝒙¯(t)))𝒈¯τ(t+1)2\displaystyle~{}~{}~{}~{}+\eta\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}{\mathbb{\bm{E}}}_{t}\big{\|}\mathsf{Clip}_{\tau}\big{(}\nabla f({\overline{\bm{x}}}^{(t)})\big{)}-{\overline{\bm{g}}}_{\tau}^{(t+1)}\big{\|}_{2}
ηδ(t)f(𝒙¯(t))22+η2L2𝔼t𝒗¯(t+1)22\displaystyle\leq-\eta\delta^{(t)}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}^{2}+\frac{\eta^{2}L}{2}{\mathbb{\bm{E}}}_{t}\big{\|}{\overline{\bm{v}}}^{(t+1)}\big{\|}_{2}^{2}
+ηf(𝒙¯(t))2(2σgb+112δ(t)f(𝒙¯(t))2+2δ(t)Ln𝑿(t)𝒙¯(t)𝟏n𝖥)\displaystyle~{}~{}~{}~{}+\eta\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}\Big{(}\frac{2\sigma_{g}}{\sqrt{b}}+\frac{1}{12}\delta^{(t)}\|\nabla f({\overline{\bm{x}}}^{(t)})\|_{2}+\frac{2\delta^{(t)}L}{\sqrt{n}}\big{\|}{\bm{{X}}}^{(t)}-{\overline{\bm{x}}}^{(t)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}\Big{)}
=1112ηδ(t)f(𝒙¯(t))22+η2L2𝔼t𝒗¯(t+1)22\displaystyle=-\frac{11}{12}\eta\delta^{(t)}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}^{2}+\frac{\eta^{2}L}{2}{\mathbb{\bm{E}}}_{t}\big{\|}{\overline{\bm{v}}}^{(t+1)}\big{\|}_{2}^{2}
+ηf(𝒙¯(t))2(2σgb+2δ(t)Ln𝑿(t)𝒙¯(t)𝟏n𝖥)\displaystyle~{}~{}~{}~{}+\eta\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}\Big{(}\frac{2\sigma_{g}}{\sqrt{b}}+\frac{2\delta^{(t)}L}{\sqrt{n}}\big{\|}{\bm{{X}}}^{(t)}-{\overline{\bm{x}}}^{(t)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}\Big{)}
512ηδ(t)f(𝒙¯(t))22+η2L2𝔼t𝒗¯(t+1)22\displaystyle\leq-\frac{5}{12}\eta\delta^{(t)}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}^{2}+\frac{\eta^{2}L}{2}{\mathbb{\bm{E}}}_{t}\big{\|}{\overline{\bm{v}}}^{(t+1)}\big{\|}_{2}^{2}
+2ησgbf(𝒙¯(t))2+2δ(t)ηL2n𝑿(t)𝒙¯(t)𝟏n𝖥2,\displaystyle~{}~{}~{}~{}+\frac{2\eta\sigma_{g}}{\sqrt{b}}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}+\frac{2\delta^{(t)}\eta L^{2}}{n}\big{\|}{\bm{{X}}}^{(t)}-{\overline{\bm{x}}}^{(t)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}^{2}, (36)

where the last inequality is due to

ηf(𝒙¯(t))22δ(t)Ln𝑿(t)𝒙¯(t)𝟏n𝖥\displaystyle~{}~{}~{}~{}\eta\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}\cdot\frac{2\delta^{(t)}L}{\sqrt{n}}\big{\|}{\bm{{X}}}^{(t)}-{\overline{\bm{x}}}^{(t)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}
ηδ(t)212f(𝒙¯(t))222L2n𝑿(t)𝒙¯(t)𝟏n𝖥2\displaystyle\leq\eta\delta^{(t)}\cdot 2\cdot\sqrt{\frac{1}{2}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}^{2}}\cdot\sqrt{\frac{2L^{2}}{n}\big{\|}{\bm{{X}}}^{(t)}-{\overline{\bm{x}}}^{(t)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}^{2}}
12ηδ(t)f(𝒙¯(t))22+2δ(t)ηL2n𝑿(t)𝒙¯(t)𝟏n𝖥2.\displaystyle\leq\frac{1}{2}\eta\delta^{(t)}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}^{2}+\frac{2\delta^{(t)}\eta L^{2}}{n}\big{\|}{\bm{{X}}}^{(t)}-{\overline{\bm{x}}}^{(t)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}^{2}.

C.2 Convergence rate

Different from (24), with the use of gradient clipping operator, we can only bound the expected norm of average gradient estimate as

𝔼𝒗¯(t)22\displaystyle{\mathbb{\bm{E}}}\|{\overline{\bm{v}}}^{(t)}\|_{2}^{2} =𝔼𝒈¯p(t)22\displaystyle={\mathbb{\bm{E}}}\|{\overline{\bm{g}}}_{p}^{(t)}\|_{2}^{2}
=𝔼𝒈¯τ(t)+𝒆¯(t)22\displaystyle={\mathbb{\bm{E}}}\|{\overline{\bm{g}}}_{\tau}^{(t)}+{\overline{\bm{e}}}^{(t)}\|_{2}^{2}
=𝔼𝒈¯τ(t)22+𝔼𝒆¯(t)22\displaystyle={\mathbb{\bm{E}}}\|{\overline{\bm{g}}}_{\tau}^{(t)}\|_{2}^{2}+{\mathbb{\bm{E}}}\|{\overline{\bm{e}}}^{(t)}\|_{2}^{2}
τ2+σp2dn.\displaystyle\leq\tau^{2}+\frac{\sigma_{p}^{2}d}{n}. (37)

Let Δ=𝔼[f(𝒙¯(0))]f\Delta={\mathbb{\bm{E}}}\big{[}f({\overline{\bm{x}}}^{(0)})\big{]}-f^{\star}. The techniques used is similar to that used in Appendix B, so that we can reuse results from Sections B.2 and B.3, namely (23), in the following proof. Take full expectation and use (37), sum (36) over t=1,,Tt=1,\ldots,T,

Δ\displaystyle-\Delta 5η12t=1T𝔼(δ(t)f(𝒙¯(t))22)+2ησbt=1T𝔼f(𝒙¯(t))2\displaystyle\leq-\frac{5\eta}{12}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\Big{(}\delta^{(t)}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}^{2}\Big{)}+\frac{2\eta\sigma}{\sqrt{b}}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}
+2ηL2nt=1T𝔼𝑿(t)𝒙¯(t)𝟏n𝖥2+η2L2t=1T𝔼𝒗¯(t+1)22\displaystyle\qquad\qquad+\frac{2\eta L^{2}}{n}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}{\bm{{X}}}^{(t)}-{\overline{\bm{x}}}^{(t)}{\bm{1}}_{n}^{\top}\big{\|}_{\mathsf{F}}^{2}+\frac{\eta^{2}L}{2}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}{\overline{\bm{v}}}^{(t+1)}\big{\|}_{2}^{2}
5η12t=1T𝔼(δ(t)f(𝒙¯(t))22)+2ησbt=1T𝔼f(𝒙¯(t))2\displaystyle\leq-\frac{5\eta}{12}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\Big{(}\delta^{(t)}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}^{2}\Big{)}+\frac{2\eta\sigma}{\sqrt{b}}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}
+2ηL2n(8η2(1α^)2t=1Tn𝔼𝒗¯(t)22+1024η2(1α^)4Tn(τ2+σp2d))+η2L2t=1T𝔼𝒗¯(t+1)22\displaystyle\qquad\qquad+\frac{2\eta L^{2}}{n}\Big{(}\frac{8\eta^{2}}{(1-\hat{\alpha})^{2}}\sum_{t=1}^{T}n{\mathbb{\bm{E}}}\big{\|}{\overline{\bm{v}}}^{(t)}\big{\|}_{2}^{2}+\frac{1024\eta^{2}}{(1-\hat{\alpha})^{4}}Tn(\tau^{2}+\sigma_{p}^{2}d)\Big{)}+\frac{\eta^{2}L}{2}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}{\overline{\bm{v}}}^{(t+1)}\big{\|}_{2}^{2}
=5η12t=1T𝔼(δ(t)f(𝒙¯(t))22)+2ησbt=1T𝔼f(𝒙¯(t))2\displaystyle=-\frac{5\eta}{12}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\Big{(}\delta^{(t)}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}^{2}\Big{)}+\frac{2\eta\sigma}{\sqrt{b}}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}
+16η3L2(1α^)2(τ2+σp2dn)+2048η3L2(1α^)4T(τ2+σp2d)+η2LT2(τ2+σp2dn).\displaystyle\qquad\qquad+\frac{16\eta^{3}L^{2}}{(1-\hat{\alpha})^{2}}\Big{(}\tau^{2}+\frac{\sigma_{p}^{2}d}{n}\Big{)}+\frac{2048\eta^{3}L^{2}}{(1-\hat{\alpha})^{4}}T(\tau^{2}+\sigma_{p}^{2}d)+\frac{\eta^{2}LT}{2}\Big{(}\tau^{2}+\frac{\sigma_{p}^{2}d}{n}\Big{)}. (38)

To be able to analyze the expected clipped gradient norm 𝔼(δ(t)f(𝒙¯(t))22){\mathbb{\bm{E}}}\big{(}\delta^{(t)}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}^{2}\big{)}, we need to use convexity and monotonicity from Lemma 2.

Lemma 2.

Let g(x)=xc+xg(x)=\frac{x}{c+x} and h(x)=xg(x)=x2c+xh(x)=xg(x)=\frac{x^{2}}{c+x}. When x0x\geq 0, g(x)g(x) and h(x)h(x) increase monotonically, while g(x)g(x) is concave and h(x)h(x) is convex.

Proof of Lemma 2.

It is sufficient to prove Lemma 2 by evaluating the first-order and second-order derivatives of g(x)g(x) and h(x)h(x).

Because g(x)=(c+x)x(c+x)2=c(c+x)2>0g^{\prime}(x)=\frac{(c+x)-x}{(c+x)^{2}}=\frac{c}{(c+x)^{2}}>0 and h(x)=g(x)+xg(x)0h^{\prime}(x)=g(x)+xg^{\prime}(x)\geq 0, g(x)g(x) and h(x)h(x) increase monotonically.

g(x)g(x) is concave because g′′(x)2c(c+x)(c+x)4=2c(c+x)3<0g^{\prime\prime}(x)-\frac{2c(c+x)}{(c+x)^{4}}=-\frac{2c}{(c+x)^{3}}<0.

h(x)h(x) is convex because h′′(x)=2g(x)+xg′′(x)=2c(c+x)22cx(c+x)3=2c2(c+x)3>0h^{\prime\prime}(x)=2g^{\prime}(x)+xg^{\prime\prime}(x)=\frac{2c}{(c+x)^{2}}-\frac{2cx}{(c+x)^{3}}=\frac{2c^{2}}{(c+x)^{3}}>0. ∎

Next, we substitute τ=ν\tau=\nu (cf. Theorem 2), and assume the following inequality

𝔼f(𝒙¯(t))2ν.\displaystyle{\mathbb{\bm{E}}}\|\nabla f({\overline{\bm{x}}}^{(t)})\|_{2}\geq\nu. (39)

By Lemma 2, the expectation of clipped gradients can be bounded as

𝔼(δ(t)f(𝒙¯(t))22)\displaystyle{\mathbb{\bm{E}}}\Big{(}\delta^{(t)}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}^{2}\Big{)} =𝔼(τf(𝒙¯(t))22τ+f(𝒙¯(t))2)\displaystyle={\mathbb{\bm{E}}}\Big{(}\frac{\tau\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}^{2}}{\tau+\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}}\Big{)}
τ(𝔼f(𝒙¯(t))2)2τ+𝔼f(𝒙¯(t))2\displaystyle\geq\frac{\tau\big{(}{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}\big{)}^{2}}{\tau+{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}}
τντ+ν𝔼f(𝒙¯(t))2\displaystyle\geq\frac{\tau\nu}{\tau+\nu}{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}
=ν2𝔼f(𝒙¯(t))2.\displaystyle=\frac{\nu}{2}{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}. (40)

Using (37), (40) Assumptions 4 and 3, and set b=1b=1, we can further bound the RHS of (38) as

Δ\displaystyle-\Delta 5ην24t=1T𝔼f(𝒙¯(t))2+2ησgbt=1T𝔼f(𝒙¯(t))2\displaystyle\leq-\frac{5\eta\nu}{24}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}+\frac{2\eta\sigma_{g}}{\sqrt{b}}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}
+16η3L2(1α^)2(τ2+σp2dn)+2048η3L2(1α^)4T(τ2+σp2d)+η2LT2(τ2+σp2dn)\displaystyle\qquad\qquad+\frac{16\eta^{3}L^{2}}{(1-\hat{\alpha})^{2}}\Big{(}\tau^{2}+\frac{\sigma_{p}^{2}d}{n}\Big{)}+\frac{2048\eta^{3}L^{2}}{(1-\hat{\alpha})^{4}}T(\tau^{2}+\sigma_{p}^{2}d)+\frac{\eta^{2}LT}{2}\Big{(}\tau^{2}+\frac{\sigma_{p}^{2}d}{n}\Big{)} (41)
(i)ην8t=1T𝔼f(𝒙¯(t))2+2048Tη3L2(1α^)4(τ2+σp2d)+3Tη2L1α^(τ2+σp2dn)\displaystyle\overset{\text{(i)}}{\leq}-\frac{\eta\nu}{8}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}+\frac{2048T\eta^{3}L^{2}}{(1-\hat{\alpha})^{4}}(\tau^{2}+\sigma_{p}^{2}d)+\frac{3T\eta^{2}L}{1-\hat{\alpha}}\Big{(}\tau^{2}+\frac{\sigma_{p}^{2}d}{n}\Big{)}
=ην8t=1T𝔼f(𝒙¯(t))2+2048Tη3L2τ2(1α^)4(1+Tϕm2)+3Tη2Lτ21α^(1+Tϕm2),\displaystyle=-\frac{\eta\nu}{8}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}+\frac{2048T\eta^{3}L^{2}\tau^{2}}{(1-\hat{\alpha})^{4}}(1+T\phi_{m}^{2})+\frac{3T\eta^{2}L\tau^{2}}{1-\hat{\alpha}}(1+T\phi_{m}^{2}), (42)

where we use the condition ν24σg\nu\geq 24\sigma_{g} to prove (i) and substitute σp2=Tτ2ϕm2/d\sigma_{p}^{2}=T\tau^{2}\phi_{m}^{2}/d to prove (42).

Reorganize terms, (42) can be further bounded as

1Tt=1T𝔼f(𝒙¯(t))2\displaystyle\frac{1}{T}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2} 8ΔηνT+16384η2L2ν(1α^)4(1+Tϕm2)+24ηLν1α^(1+Tϕm2)\displaystyle\leq\frac{8\Delta}{\eta\nu T}+\frac{16384\eta^{2}L^{2}\nu}{(1-\hat{\alpha})^{4}}(1+T\phi_{m}^{2})+\frac{24\eta L\nu}{1-\hat{\alpha}}(1+T\phi_{m}^{2})
(i)8Δϕmην+32768η2L2ν(1α^)4+48ηLν1α^\displaystyle\overset{\text{(i)}}{\leq}\frac{8\Delta\phi_{m}}{\eta\nu}+\frac{32768\eta^{2}L^{2}\nu}{(1-\hat{\alpha})^{4}}+\frac{48\eta L\nu}{1-\hat{\alpha}}
(ii)8Δϕmην+4096ηLν(1α^)83+48ηLν1α^,\displaystyle\overset{\text{(ii)}}{\leq}\frac{8\Delta\phi_{m}}{\eta\nu}+\frac{4096\eta L\nu}{(1-\hat{\alpha})^{\frac{8}{3}}}+\frac{48\eta L\nu}{1-\hat{\alpha}},
8Δϕmην+4144ηLν(1α^)83,\displaystyle\leq\frac{8\Delta\phi_{m}}{\eta\nu}+\frac{4144\eta L\nu}{(1-\hat{\alpha})^{\frac{8}{3}}}, (43)

where we substitute T=ϕm2T=\phi_{m}^{-2} to prove (i), and use (25) for (ii).

Set η=γ83(1α)838288L\eta=\frac{\gamma^{\frac{8}{3}}(1-\alpha)^{\frac{8}{3}}}{8288L} and γ=1100(1α)ρ\gamma=\frac{1}{100}(1-\alpha)\rho, (43) can be further bounded as

1Tt=1T𝔼f(𝒙¯(t))2\displaystyle\frac{1}{T}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2} <8288γ83(1α)83192LΔϕmν+γ83(1α)8382884144ν(1α^)83\displaystyle<\frac{8288}{\gamma^{\frac{8}{3}}(1-\alpha)^{\frac{8}{3}}}\cdot\frac{192L\Delta\phi_{m}}{\nu}+\frac{\gamma^{\frac{8}{3}}(1-\alpha)^{\frac{8}{3}}}{8288}\cdot\frac{4144\nu}{(1-\hat{\alpha})^{\frac{8}{3}}}
=8288γ83(1α)83192LΔϕmν+ν2\displaystyle=\frac{8288}{\gamma^{\frac{8}{3}}(1-\alpha)^{\frac{8}{3}}}\cdot\frac{192L\Delta\phi_{m}}{\nu}+\frac{\nu}{2}
<1784LΔϕmγ43(1α)43.\displaystyle<\frac{1784\sqrt{L\Delta\phi_{m}}}{\gamma^{\frac{4}{3}}(1-\alpha)^{\frac{4}{3}}}. (44)

Choosing ν=1784LΔϕmγ43(1α)43\nu=\frac{1784\sqrt{L\Delta\phi_{m}}}{\gamma^{\frac{4}{3}}(1-\alpha)^{\frac{4}{3}}}, (44) simplifies to 1Tt=1T𝔼f(𝒙¯(t))2<ν\frac{1}{T}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}<\nu, which further implies that there exists some t[T]t\in[T] such that 𝔼f(𝒙¯(t))2<ν{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}<\nu. However, this contradicts the assumption (39), which leads to the convergence results in the theorem.

Lastly, we can verify conditions (12), (16) and (25) are all met, which concludes the proof.

Appendix D Proof of Theorem 4

This section proves Theorem 4 based on results from Appendix C. We first assume all expected gradient norm 𝔼f(𝒙¯(t))2{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2} are greater than a threshold ν\nu (i.e. 𝔼f(𝒙¯(t))2ν{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}\geq\nu for all t=1,,Tt=1,\ldots,T). Set b=(24σgν)2b=\big{(}\frac{24\sigma_{g}}{\nu}\big{)}^{2} and σp=0\sigma_{p}=0 in (41), we can reach the following descent inequality

Δ\displaystyle-\Delta ην8t=1T𝔼f(𝒙¯(t))2+2048Tη3L2τ2(1α^)4+3Tη2Lτ21α^.\displaystyle\leq-\frac{\eta\nu}{8}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}+\frac{2048T\eta^{3}L^{2}\tau^{2}}{(1-\hat{\alpha})^{4}}+\frac{3T\eta^{2}L\tau^{2}}{1-\hat{\alpha}}. (45)

Reorganize terms,

1Tt=1T𝔼f(𝒙¯(t))2\displaystyle\frac{1}{T}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2} 8ΔηνT+16384η2L2ν(1α^)4+24ηLν1α^\displaystyle\leq\frac{8\Delta}{\eta\nu T}+\frac{16384\eta^{2}L^{2}\nu}{(1-\hat{\alpha})^{4}}+\frac{24\eta L\nu}{1-\hat{\alpha}}
<8ΔηνT+2048ηLν(1α^)83+24ηLν1α^,\displaystyle<\frac{8\Delta}{\eta\nu T}+\frac{2048\eta L\nu}{(1-\hat{\alpha})^{\frac{8}{3}}}+\frac{24\eta L\nu}{1-\hat{\alpha}},
8ΔηνT+2072ηLν(1α^)83,\displaystyle\leq\frac{8\Delta}{\eta\nu T}+\frac{2072\eta L\nu}{(1-\hat{\alpha})^{\frac{8}{3}}},

where we use (25) for the second inequality.

Set η=1L82072\eta=\frac{1}{L}\cdot\sqrt{\frac{8}{2072}} and ν=LΔ(1α^)83T\nu=\sqrt{\frac{L\Delta}{(1-\hat{\alpha})^{\frac{8}{3}}T}}, the average 2\ell_{2} norm of gradients can be bounded as

1Tt=1T𝔼f(𝒙¯(t))2\displaystyle\frac{1}{T}\sum_{t=1}^{T}{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2} <LΔ(1α^)83T=ν,\displaystyle<\sqrt{\frac{L\Delta}{(1-\hat{\alpha})^{\frac{8}{3}}T}}=\nu,

which implies that there exists some t[T]t\in[T], such that 𝔼f(𝒙¯(t))2<ν{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}<\nu which contradicts the assumption, and proves that PORTER-GC reaches 𝔼f(𝒙¯(t))2ν{\mathbb{\bm{E}}}\big{\|}\nabla f({\overline{\bm{x}}}^{(t)})\big{\|}_{2}\leq\nu within TT iterations