This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Variance-Reduced Gradient Estimator for Nonconvex Zeroth-Order Distributed Optimization

Huaiyi Mu1, Yujie Tang1, and Zhongkui Li1 1The authors are with the College of Engineering, Peking University, China (e-mail: [email protected], [email protected], [email protected]).
Abstract

This paper investigates distributed zeroth-order optimization for smooth nonconvex problems. We propose a novel variance-reduced gradient estimator, which randomly renovates one orthogonal direction of the true gradient in each iteration while leveraging historical snapshots for variance correction. By integrating this estimator with gradient tracking mechanism, we address the trade-off between convergence rate and sampling cost per zeroth-order gradient estimation that exists in current zeroth-order distributed optimization algorithms, which rely on either the 2-point or 2d2d-point gradient estimators. We derive a convergence rate of 𝒪(d52/m)\mathcal{O}(d^{\frac{5}{2}}/m) for smooth nonconvex functions in terms of sampling number mm and problem dimension dd. Numerical simulations comparing our algorithm with existing methods confirm the effectiveness and efficiency of the proposed gradient estimator.

I INTRODUCTION

We consider a multi-agent system with NN agents, where the agents are connected by a communication network that allows them to exchange information for decentralized decision-making. The goal of this group of agents is to collaboratively solve the following consensus optimization problem in a decentralized manner:

minxdf(x)1Ni=1Nfi(x).\displaystyle\min_{x\in\mathbb{R}^{d}}\ f(x)\coloneqq\frac{1}{N}\sum_{i=1}^{N}f_{i}(x). (1)

Here xdx\in\mathbb{R}^{d} is the global decision variable. Each function fi:df_{i}:\mathbb{R}^{d}\to\mathbb{R} represents the local objective function for agent ii, known only to the agent itself; fif_{i} is assumed to be smooth but may be nonconvex. Each agent can only use zeroth-order information of fif_{i} during the optimization procedure.

Decentralized optimization has gained significant attention due to its wide range of applications including multi-agent system coordination [1], power systems [2], communication networks [3], etc. For smooth and convex objective functions, the decentralized gradient descent (DGD) algorithm with a decreasing step-size achieved a convergence rate of 𝒪(logtt)\mathcal{O}(\frac{\log t}{\sqrt{t}}) [4], while [5] and subsequent works proposed gradient tracking (GT) mechanisms with a fixed step-size, achieving a sublinear convergence rate of 𝒪(1t)\mathcal{O}(\frac{1}{t}), which is comparable to centralized gradient descent method. In many real-world applications, the objective functions can be nonconvex, and distributed nonconvex optimization has applications in areas such as machine learning [6] and robotics control [7]. For smooth nonconvex functions, DGD achieves convergence to a stationary point with a rate of 𝒪(1t)\mathcal{O}(\frac{1}{\sqrt{t}}) [8, 9], while various GT methods can achieve convergence to a stationary point with a rate of 𝒪(1t)\mathcal{O}(\frac{1}{t}) [10].

The aforementioned optimization algorithms rely on first-order information. However, in some scenarios, the gradient is unavailable or expensive to obtain, and agents can only access zeroth-order information of the objective functions. Such scenarios arise in optimization with black-box models [11], optimization with bandit feedback [12], fine-tuning language models [13], etc. To address this issue, various gradient-free methods, particularly algorithms based on zeroth-order gradient estimators, have attracted considerable attention [14]. The work [15] investigated the 2-point zeroth-order gradient estimator, which produces a biased stochastic gradient by using the function values of two randomly sampled points. In [16], the 2d2d-point gradient estimator was proposed, where dd is the dimension of the state variable for each agent. [17] combined the 2-point gradient estimator with DGD and the 2d2d-point gradient estimator with GT for nonconvex multi-agent optimization, which lead to convergence rates that are comparable with their first-order counterparts. However, [17] also argued that there appears to be a trade-off between the convergence rate and the sampling cost per zeroth-order gradient estimation when combining zeroth-order gradient estimation techniques with distributed optimization frameworks. This trade-off arises from the high sampling burden of the 2d2d-point estimator and the inherent variance of the 2-point estimator in distributed settings.

To address this trade-off, we aim to design a variance-reduced zeroth-order gradient estimator with a scalable sampling number of function values that is independent of the dimension dd. Variance reduction is widely applied in machine learning [18] and stochastic optimization [19]. In [20], variance reduction was used for centralized stochastic gradient descent with strongly convex objectives, achieving a linear convergence rate. [21] applied a 2-point gradient estimator and employed variance reduction for zeroth-order nonconvex centralized stochastic optimization, achieving sublinear convergence. Note that these works only focused on centralized problems. Decentralized finite-sum minimization problems were considered in [22], where variance reduction was used to accelerate convergence; we note that in this work, the variance reduction technique was employed to reduce the variance caused by the finite-sum structure, rather than to address the inherent variance of the 2-point zeroth-order gradient estimator.

In this paper, we propose a new distributed zeroth-order optimization method that incorporates variance reduction techniques as well as the gradient tracking framework, to address the trade-off between convergence rate and sampling cost per zeroth-order gradient estimation in existing zeroth-order distributed optimization algorithms. Specifically, We employ the variance reduction (VR) mechanism to design a novel variance-reduced gradient estimator for distributed nonconvex zeroth-order optimization problems, as formulated in (1). We then combine this new zeroth-order gradient estimation method with the gradient tracking framework, and the resulting algorithm is able to achieve both fast convergence and low sampling cost per zeroth-order gradient estimation. To the best of the authors’ knowledge, this is the first work that attempts to address the aforementioned trade-off for general zeroth-order distributed optimization problems. We also provide rigorous convergence analysis of our proposed algorithm under the smoothness assumption. Although the derived oracle complexities have a higher dependence on the dimension dd compared to the algorithms in [17], numerical experiments demonstrate that our proposed algorithm enjoys superior convergence speed and accuracy, reaching lower optimization errors with the same number of samples compared to existing zeroth-order distributed optimization algorithms [17, 23].

Notations: The set of positive integers up to mm is denoted as [m]={1,2,,m}[m]=\{1,2,\cdots,m\}. The ii-th component of a vector xx is denoted as [x]i[x]_{i}. The spectral norm and spectral radius of a matrix AA are represented by σ(A)\sigma(A) and ρ(A)\rho(A), respectively. For a vector xdx\in\mathbb{R}^{d}, x\|x\| refers to the Euclidean norm, and xπ=maxi|[x]i|/[π]i\|x\|_{\infty}^{\pi}=\max_{i}\left|[x]_{i}\right|/[\pi]_{i} refers to the weighted infinity norm, where πd\pi\in\mathbb{R}^{d} has all positive components. For a matrix AA, A2\|A\|_{2} represents the matrix norm induced by \|\cdot\|, and Aπ\interleave A\interleave_{\infty}^{\pi} represents the matrix norm induced by π\|\cdot\|_{\infty}^{\pi}. For two matrices MM and NN, MNM\otimes N denotes the Kronecker product. We denote 𝔹d\mathbb{B}_{d} as the closed unit ball in d\mathbb{R}^{d}, and 𝕊d1={xd:x=1}\mathbb{S}_{d-1}=\{x\in\mathbb{R}^{d}:\|x\|=1\} as the unit sphere. 𝒰()\mathcal{U}(\cdot) denotes the uniform distribution.

II Formulation And Preliminaries

II-A Problem Formulation

We consider a network consisting of NN agents connected via an undirected communication network. The topology of the network is represented by the graph 𝒢=(𝒩,)\mathcal{G}=(\mathcal{N},\mathcal{E}), where 𝒩\mathcal{N} and \mathcal{E} represent the set of agents and communication links, respectively. The distributed consensus optimization problem (1) can be equivalently reformulated as follows:

minx1,,xNd\displaystyle\min_{x_{1},\ldots,x_{N}\in\mathbb{R}^{d}} 1Ni=1Nfi(xi)\displaystyle\frac{1}{N}\sum_{i=1}^{N}f_{i}(x_{i}) (2)
s.t. x1=x2==xN,\displaystyle x_{1}=x_{2}=\cdots=x_{N},

where xidx_{i}\in\mathbb{R}^{d} now represents the local decision variable of agent ii, and the constraint x1==xNx_{1}=\cdots=x_{N} requires the agents to achieve global consensus for the final decision. During the optimization procedure, each agent may obtain other agents’ information only via exchanging messages with their neighbors in the communication network. We further impose the restriction that only zeroth-order information of the local objective function is available to each agent. In other words, in each iteration, agent ii can query the function values of fif_{i} at finitely many points.

The following assumption will be employed later in this paper.

Assumption 1.

Each fi:df_{i}:\mathbb{R}^{d}\to\mathbb{R} is LL-smooth, i.e., we have

fi(x)fi(y)Lxy\|\nabla f_{i}(x)-\nabla f_{i}(y)\|\leq L\|x-y\| (3)

for all x,ydx,y\in\mathbb{R}^{d} and i=1,,Ni=1,\ldots,N. Furthermore, f=infxdf(x)>f^{*}=\inf_{x\in\mathbb{R}^{d}}f(x)>-\infty.

II-B Preliminaries on Distributed Zeroth-Order Optimization

When gradient information of the objective function is unavailable, one may construct gradient estimators by sampling the function values at a finite number of points, which has been shown to be a very effective approach by existing literature. We first briefly introduce two types of gradient estimators [17] that are commonly used in noiseless distributed optimization.

Let h:dh:\mathbb{R}^{d}\rightarrow\mathbb{R} be a continuously differentiable function. One version of the 2-point zeroth-order gradient estimator for h(x)\nabla h(x) has the following form:

Gh(2)(x,u,z)=dh(x+uz)h(xuz)2uz,\displaystyle G_{h}^{(2)}(x,u,z)=d\cdot\frac{h(x+uz)-h(x-uz)}{2u}z, (4)

where uu is a positive scalar called the smoothing radius and zz is a random vector sampled from the distribution 𝒰(𝕊d1)\mathcal{U}(\mathbb{S}_{d-1}). One can show that the expectation of the 2-point gradient estimator is the gradient of a smoothed version of the original function [24, 25], i.e.,

𝔼z𝒰(𝕊d1)[Gh(2)(x,u,z)]=hu(x),\mathbb{E}_{z\thicksim\mathcal{U}(\mathbb{S}_{d-1})}[G_{h}^{(2)}(x,u,z)]=\nabla h^{u}(x),

where hu(x)=𝔼y𝒰(𝔹d)[h(x+uy)]h^{u}(x)=\mathbb{E}_{y\thicksim\mathcal{U}(\mathbb{B}_{d})}[h(x+uy)]. As the smoothing radius uu tends to zero, the expectation of the 2-point gradient estimator approaches to the true gradient h(x)\nabla h(x).

By combining the simple 2-point gradient estimator (4) with the decentralized gradient descent framework, one obtains the following algorithm for distributed zeroth-order consensus optimization (2):

xik+1=j=1NWij(xjkηtGfj(2)(xjk,uk,zjk)),x_{i}^{k+1}=\sum_{j=1}^{N}W_{ij}\!\left(x^{k}_{j}-\eta_{t}\,G_{f_{j}}^{(2)}(x^{k}_{j},u^{k},z^{k}_{j})\right), (5)

which we shall call DGD-2p in this paper. Here xikx^{k}_{i} denotes the local decision variable of agent ii at the kk-th iteration, WN×NW\in\mathbb{R}^{N\times N} is a weight matrix that is taken to be doubly stochastic, and ηk\eta_{k} is the step-size at iteration kk. Since each construction of the 2-point gradient estimator (4) requires sampling only two function values, we can see that DGD-2p can achieves low sampling cost per zeroth-order gradient estimation. However, as shown by [17], DGD-2p achieves a relatively slow convergence rate O(d/mlogm)O(\sqrt{d/m}\log m), where mm denotes the number of function value queries. [17] argued that this slow convergence rate is mainly due to the inherent variance of the 2-point gradient estimator, bounded by

𝔼z𝒰(𝕊d1)[Gh(2)(x,u,z)2]dh(x)2+u2L2d2\mathbb{E}_{z\thicksim\mathcal{U}(\mathbb{S}_{d-1})}\!\left[\|G_{h}^{(2)}(x,u,z)\|^{2}\right]\lesssim d\|\nabla h(x)\|^{2}+u^{2}L^{2}d^{2}

under the assumption that function hh is LL-smooth. In a distributed optimization algorithm, each agent’s local gradient fi(xik)\nabla f_{i}(x^{k}_{i}) does not vanish to zero even if the system has reached consensus and optimality. Consequently, the inherent variance of 2-point gradient estimator is inevitable and will considerably slow down the convergence rate.

To achieve higher accuracy for zeroth-order gradient estimation, existing literature has also proposed the 2d2d-point gradient estimator:

Gh(2d)(x,u)=l=1dh(x+uel)h(xuel)2uel.\displaystyle G_{h}^{(2d)}(x,u)=\sum_{l=1}^{d}\frac{h(x+ue_{l})-h(x-ue_{l})}{2u}e_{l}. (6)

Here elde_{l}\in\mathbb{R}^{d} is the ll-th standard basis vector such that [el]j=1[e_{l}]_{j}=1 when j=lj=l and [el]j=0[e_{l}]_{j}=0 otherwise. It can be shown that Gh(2d)(x,u)h(x)12uLd\|G_{h}^{(2d)}(x,u)-\nabla h(x)\|\leq\frac{1}{2}uL\sqrt{d} when hh is LL-smooth (see, e.g., [17]). Consequently, if we assume the function values of hh can be obtained accurately and machine precision issues in numerical computations are ignored, then the 2d2d-point gradient estimator can achieve arbitrarily high accuracy when approximating the true gradient. By combining (6) with the distributed gradient tracking method, one obtains the following algorithm:

xik+1=\displaystyle x^{k+1}_{i}={} j=1NWij(xjkηsjk),\displaystyle\sum_{j=1}^{N}W_{ij}\!\left(x^{k}_{j}-\eta\,s^{k}_{j}\right), (7)
sik+1=\displaystyle s^{k+1}_{i}={} j=1NWij(sjk+Gfj(2d)(xjk+1,uk+1)Gfj(2d)(xjk,uk)),\displaystyle\sum_{j=1}^{N}W_{ij}\!\left(s^{k}_{j}\!+\!G_{f_{j}}^{(2d)}(x^{k+1}_{j},u^{k+1})\!-\!G_{f_{j}}^{(2d)}(x^{k}_{j},u^{k})\right),

which we shall call GT-2d2d. Here the auxiliary state variable sjks^{k}_{j} in (7) tracks the global gradient across iterations. Distributed zeroth-order optimization algorithms that utilize the 2d2d-point gradient estimator, such as GT-2d2d, can achieve faster convergence due to precise estimation of the true gradients that allows further incorporation of gradient tracking techniques. However, GT-2d2d has higher sampling cost per gradient estimation compared to DGD-2p: As shown in (6), 2d2d points need to be sampled for each construction of the gradient estimator. This high sampling cost may lead to poor scalability when the dimension dd is large.

We remark that the 2d2d-point gradient estimator (6) can also be interpreted as the expectation of the following coordinate-wise gradient estimator:

Gh(c)(x,u,l)=dh(x+uel)h(xuel)2uel,l[d],\displaystyle G_{h}^{(c)}(x,u,l)=d\cdot\frac{h(x+ue_{l})-h(x-ue_{l})}{2u}e_{l},\ \ l\in[d], (8)

and we have

Gh(2d)(x,u)=𝔼l𝒰[d][Gf(c)(x,u,l)],\displaystyle G_{h}^{(2d)}(x,u)=\mathbb{E}_{l\sim\mathcal{U}[d]}\!\left[G_{f}^{(c)}(x,u,l)\right], (9)

where 𝒰[d]\mathcal{U}[d] denotes the discrete uniform distribution over the set {1,,d}\{1,\ldots,d\}. The coordinate-wise gradient estimator in (8) shares the similar structure with the 2-point gradient estimator in (4). The key difference is that in (8), we restrict the perturbation direction uelue_{l} to lie in the dd orthogonal directions associated with the standard basis, instead of uniformly sampled from the unit sphere.

III Our Algorithm

To address the trade-off between convergence rate and sampling cost per gradient estimation in zeroth-order distributed optimization, we employ a variance reduction mechanism [20] to design an improved gradient estimator. The intuition is to combine the best of both worlds, i.e., the precise approximation feature of the 2d2d-point gradient estimator and the low-sampling feature of the 2-point gradient estimator.

Let kk denote the iteration number. For each agent, we keep a snapshot point x~ik\tilde{x}_{i}^{k} together with its associate smoothing radius u~ik\tilde{u}_{i}^{k} at which we have conducted relatively accurate gradient estimation. We then use a random variable ζik\zeta_{i}^{k} generated from the Bernoulli distribution Ber(p)\mathrm{Ber}(p) as an activation indicator for the update of the snapshot point x~ik\tilde{x}_{i}^{k}: When ζik=1\zeta_{i}^{k}=1, agent ii updates the snapshot point to be the current iterate, i.e., x~ik=xik\tilde{x}_{i}^{k}=x_{i}^{k} and u~ik=uik\tilde{u}_{i}^{k}=u_{i}^{k}, and takes a snapshot of the 2d2d-point gradient estimator Gfi(2d)(x~ik,u~ik)G_{f_{i}}^{(2d)}(\tilde{x}_{i}^{k},\tilde{u}_{i}^{k}) which provides a more accurate gradient estimation at iteration kk. When ζik=0\zeta_{i}^{k}=0, the snapshot point x~ik\tilde{x}_{i}^{k} together with the smoothing radius u~ki\tilde{u}^{i}_{k} remain unchanged from the previous iteration, i.e., x~ik=x~ik1,u~ik=u~ik1\tilde{x}_{i}^{k}=\tilde{x}_{i}^{k-1},\tilde{u}_{i}^{k}=\tilde{u}_{i}^{k-1}.

In each iteration, agent ii constructs the coordinate-wise gradient estimators Gfi(c)(xik,uik,lik)G_{f_{i}}^{(c)}(x_{i}^{k},u_{i}^{k},l_{i}^{k}) and Gfi(c)(x~ik,u~ik,lik)G_{f_{i}}^{(c)}(\tilde{x}_{i}^{k},\tilde{u}_{i}^{k},l_{i}^{k}) at the two points xikx^{k}_{i} and x~ik\tilde{x}^{k}_{i}. We then propose the following variance-reduced gradient estimator (VR-GE):

gfik=\displaystyle g_{f_{i}}^{k}={} Gfi(c)(xik,uik,lik)Gfi(c)(x~ik,u~ik,lik)\displaystyle G_{f_{i}}^{(c)}(x_{i}^{k},u_{i}^{k},l_{i}^{k})-G_{f_{i}}^{(c)}(\tilde{x}_{i}^{k},\tilde{u}_{i}^{k},l_{i}^{k}) (10)
+Gfi(2d)(x~ik,u~ik),\displaystyle+G_{f_{i}}^{(2d)}(\tilde{x}_{i}^{k},\tilde{u}_{i}^{k}),

where lik[d]l_{i}^{k}\in[d] is randomly selected by each agent ii at each iteration kk. By (9), we can see that

𝔼[gfik|xik,x~ik]\displaystyle\mathbb{E}\left[g_{f_{i}}^{k}\mkern 2.0mu\middle|\mkern 2.0mux^{k}_{i},\tilde{x}^{k}_{i}\right]
=\displaystyle={} 𝔼[Gfi(c)(xik,uik,lik)|xik,x~ik]𝔼[Gfi(c)(x~ik,u~ik,lik)|xik,x~ik]\displaystyle\mathbb{E}\left[G_{f_{i}}^{(c)}(x_{i}^{k},u_{i}^{k},l_{i}^{k})\mkern 2.0mu\middle|\mkern 2.0mux^{k}_{i},\tilde{x}^{k}_{i}\right]-\mathbb{E}\left[G_{f_{i}}^{(c)}(\tilde{x}_{i}^{k},\tilde{u}_{i}^{k},l_{i}^{k})\mkern 2.0mu\middle|\mkern 2.0mux^{k}_{i},\tilde{x}^{k}_{i}\right]
+Gfi(2d)(x~ik,u~ik)\displaystyle+G_{f_{i}}^{(2d)}(\tilde{x}_{i}^{k},\tilde{u}_{i}^{k})
=\displaystyle={} Gfi(2d)(xik,uik),\displaystyle G_{f_{i}}^{(2d)}(x^{k}_{i},u^{k}_{i}), (11)

i.e., VR-GE provides a stochastic gradient of fif_{i} with a bias Gfi(2d)(xik,uik)fi(xik)12uikLd\|G_{f_{i}}^{(2d)}(x^{k}_{i},u^{k}_{i})-\nabla f_{i}(x^{k}_{i})\|\leq\frac{1}{2}u^{k}_{i}L\sqrt{d}, assuming fif_{i} is LL-smooth.

The expected number of function value samples required per construction of VR-GE is 4+2dp4+2dp. When p=C2dp=\frac{C}{2d} for some absolute constant CC, this becomes 4+C4+C which is independent of the dimension dd. This gives VR-GE the potential to decrease the sampling cost in large-dimensional zeroth-order distributed optimization by appropriately adjusting the probability pp. In the following section, specifically in Lemma 1, we will rigorously analyze the variance of VR-GE and demonstrate its variance reduction property.

In designing our distributed zeroth-order optimization algorithm, we further leverage the gradient tracking mechanisms. Existing literature (including [5, 26, 27], etc.) has demonstrated that gradient tracking mechanisms help mitigate the gap in the convergence rates between distributed optimization and centralized optimization when the objective function is smooth. Drawing inspiration from this advantage, we incorporate the variance-reduced gradient estimator with gradient tracking mechanism to design our algorithm.

The details of the proposed algorithm are outlined in Algorithm 1. Here α>0\alpha>0 is the step-size; Steps 1 and 6 implement the gradient tracking mechanism, while Steps 2–5 implement our proposed variance-reduced gradient estimator (10). The convergence guarantees of Algorithm 1 will be provided and discussed in the next section.

Algorithm 1 Distributed Zeroth-Order Optimization Algorithm with Variance Reduced Gradient Tracking Estimator
  Initialization : x~i0=xi0d,si0=gfi0=0\tilde{x}_{i}^{0}=x_{i}^{0}\in\mathbb{R}^{d},s_{i}^{0}=g_{f_{i}}^{0}=0.
  for k=0,1,2,k=0,1,2,\cdots do
   for each i[N]i\in[N] do
   1. Update xik+1x_{i}^{k+1} by
xik+1=j=1NWij(xjkαsjk).x_{i}^{k+1}=\sum_{j=1}^{N}W_{ij}(x_{j}^{k}-\alpha s_{j}^{k}).
   2. Select lik+1l_{i}^{k+1} uniformly at random from [d][d].
   3. Generate ξik+1Ber(p).\xi_{i}^{k+1}\sim\mathrm{Ber}(p).
   4. If ζik+1=1\zeta_{i}^{k+1}=1, update x~ik+1=xik+1,u~ik+1=uik+1\tilde{x}_{i}^{k+1}=x_{i}^{k+1},\,\tilde{u}_{i}^{k+1}=u_{i}^{k+1};  4. If ζik+1=0\zeta_{i}^{k+1}=0, update x~ik+1=x~ik,u~ik+1=u~ik\tilde{x}_{i}^{k+1}=\tilde{x}_{i}^{k},\,\tilde{u}_{i}^{k+1}=\tilde{u}_{i}^{k}.
   5. Construct the VR-GE gfik+1g_{f_{i}}^{k+1} by
gfik+1=\displaystyle g_{f_{i}}^{k+1}={} Gfi(c)(xik+1,uik+1,lik+1)\displaystyle G_{f_{i}}^{(c)}(x_{i}^{k+1},u_{i}^{k+1},l_{i}^{k+1})
Gfi(c)(x~ik+1,u~ik+1,lik+1)\displaystyle-G_{f_{i}}^{(c)}(\tilde{x}_{i}^{k+1},\tilde{u}_{i}^{k+1},l_{i}^{k+1})
+Gfi(2d)(x~ik+1,u~ik+1).\displaystyle+G_{f_{i}}^{(2d)}(\tilde{x}_{i}^{k+1},\tilde{u}_{i}^{k+1}).
   6. Update sik+1s^{k+1}_{i} by
sik+1=\displaystyle s_{i}^{k+1}={} j=1NWij(sjk+gfjk+1gfjk).\displaystyle\sum_{j=1}^{N}W_{ij}(s_{j}^{k}+g_{f_{j}}^{k+1}-g_{f_{j}}^{k}).
   end
  end

IV Main Results

In this section, we present the convergence result of Algorithm 1 under Assumption 1.

For the subsequent analysis, we denote

xk=[x1kxNk],x~k=[x~1kx~Nk],sk=[s1ksNk],x^{k}=\begin{bmatrix}x^{k}_{1}\\ \vdots\\ x^{k}_{N}\end{bmatrix},\quad\tilde{x}^{k}=\begin{bmatrix}\tilde{x}^{k}_{1}\\ \vdots\\ \tilde{x}^{k}_{N}\end{bmatrix},\quad s^{k}=\begin{bmatrix}s^{k}_{1}\\ \vdots\\ s^{k}_{N}\end{bmatrix},

and define the following quantities:

δk\displaystyle\delta^{k} =𝔼[f(x¯k)]f,\displaystyle=\mathbb{E}[f(\bar{x}^{k})]-f^{*}, Exk\displaystyle E_{x}^{k} =𝔼[xk𝟏Nx¯k2],\displaystyle=\mathbb{E}[\|x^{k}-\mathbf{1}_{N}\otimes\bar{x}^{k}\|^{2}],
Ex~k\displaystyle E_{\tilde{x}}^{k} =𝔼[x~k𝟏Nx¯k2],\displaystyle=\mathbb{E}[\|\tilde{x}^{k}-\mathbf{1}_{N}\otimes\bar{x}^{k}\|^{2}], Esk\displaystyle E_{s}^{k} =𝔼[sk𝟏Ng¯k2],\displaystyle=\mathbb{E}[\|s^{k}-\mathbf{1}_{N}\otimes\bar{g}^{k}\|^{2}],

where x¯k=1Ni=1Nxik\bar{x}^{k}=\frac{1}{N}\sum_{i=1}^{N}x_{i}^{k} and g¯k=1Ni=1Ngfik\bar{g}^{k}=\frac{1}{N}\sum_{i=1}^{N}g_{f_{i}}^{k}. Here, δk\delta^{k} quantifies the optimality gap in terms of the objective value, ExkE_{x}^{k} and Ex~kE_{\tilde{x}}^{k} characterize the consensus errors, and EskE_{s}^{k} characterizes the tracking error. We also denote σ=W1N𝟏N𝟏NT2\sigma=\|W-\frac{1}{N}\mathbf{1}_{N}\mathbf{1}_{N}^{T}\|_{2}.

Theorem 1.

Under Assumption 1, suppose the parameters of Algorithm 1 satisfy p(1σ2d,1]p\in\big{(}\frac{1-\sigma^{2}}{d},1\big{]}, τ=0(duiτ)2<\sum_{\tau=0}^{\infty}(du_{i}^{\tau})^{2}<\infty, uiku_{i}^{k} is non-increasing, and

αLmin{\mfrac16d,\mfrac112d(\mfrac2+σ21p3),\mfrac(1σ2)321629d52}\displaystyle\alpha L\leq\min\!\left\{\mfrac{1}{6\sqrt{d}},\mfrac{1}{12\sqrt{d}}\bigg{(}\!\sqrt{\mfrac{2\!+\!\sigma^{2}}{1\!-\!p}}-\sqrt{3}\bigg{)},\mfrac{(1\!-\!\sigma^{2})^{3}}{216\sqrt{29}d^{\frac{5}{2}}}\right\}

Then we have

1kτ=0k1𝔼[f(x¯τ)2]1k(6αδ0+6LR029dNα+36L2Ru),\displaystyle\frac{1}{k}\sum_{\tau=0}^{k-1}\mathbb{E}[\|\nabla f(\bar{x}^{\tau})\|^{2}]\leq\frac{1}{k}\!\left(\frac{6}{\alpha}\delta^{0}+\frac{6LR_{0}}{\sqrt{29d}N\alpha}+36L^{2}R_{u}\right)\!,

and

1kτ=0k11Ni=1N𝔼[xiτx¯τ2]\displaystyle\frac{1}{k}\sum_{\tau=0}^{k-1}\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}[\|x_{i}^{\tau}-\bar{x}^{\tau}\|^{2}]
\displaystyle\leq{} 1k(429δ0(1σ2)Ld+8R0(1σ2)Nd+7Ru54d2),\displaystyle\frac{1}{k}\left(\frac{4\sqrt{29}\delta^{0}}{(1-\sigma^{2})L\sqrt{d}}+\frac{8R_{0}}{(1-\sigma^{2})Nd}+\frac{7R_{u}}{54d^{2}}\right),
1kτ=0k11Ni=1N𝔼[siτf(x¯τ)2]\displaystyle\frac{1}{k}\sum_{\tau=0}^{k-1}\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}[\|s_{i}^{\tau}-\nabla f(\bar{x}^{\tau})\|^{2}]
\displaystyle\leq{} 1k(580δ0(1σ2)2α+4029LR0(1σ2)Nαd+29LRu10αd32),\displaystyle\frac{1}{k}\left(\frac{580\delta^{0}}{(1-\sigma^{2})^{2}\alpha}+\frac{40\sqrt{29}LR_{0}}{(1-\sigma^{2})N\alpha\sqrt{d}}+\frac{\sqrt{29}LR_{u}}{10\alpha d^{\frac{3}{2}}}\right),

where R0=d1σ2Ex0R_{0}=\frac{d}{1-\sigma^{2}}E_{x}^{0}, Ru=1p(dui0)2+τ=1(duiτ)2R_{u}=\frac{1}{p}(du_{i}^{0})^{2}+\sum_{\tau=1}^{\infty}(du_{i}^{\tau})^{2}.

Proof.

See Section V. ∎

Remark 1.

The convergence rate of Algorithm 1 under Assumption 1 is 𝒪(1k)\mathcal{O}(\frac{1}{k}), which aligns with the rate achieved for distributed nonconvex optimization with gradient tracking using first-order information [10]. In addition, each iteration of VR-GE requires 4+2dp4+2dp function value queries on average. As long as p<12dp<1-\frac{2}{d}, the averaged sampling number for VR-GE is less then that for the 2d2d-point gradient estimator.

Remark 2.

Assuming p1dp\propto\frac{1}{d} and that all agents start from the same initial point, the convergence rate becomes 𝒪(d52/m)\mathcal{O}\big{(}d^{\frac{5}{2}}/m\big{)} with respect to the number of function value queries mm, which can be justified by simple algebraic calculation. Although this rate is not as favorable as the 𝒪(d/m)\mathcal{O}(d/m) rate of GT-2d2d in terms of the dependence on the problem dimension dd, we suspect that this discrepancy is primarily due to the complicated analysis procedure; the conditions in Theorem 1 serve as sufficient but not necessary conditions for convergence, meaning that the 𝒪(d52/m)\mathcal{O}\big{(}d^{\frac{5}{2}}/m\big{)} rate may not be a theoretically tight bound. As demonstrated in the simulation section, Algorithm 1 converges faster than GT-2d2d and achieves higher accuracy with the same number of samples. Deriving a tighter convergence result remains a topic for future work.

Next, we present the convergence rate when the probability pp is fixed as 1/d1/d in the following corollary.

Corollary 1.

Under Assumption 1, suppose the parameters of Algorithm 1 satisfy p=1/dp=1/d, τ=0(duiτ)2<\sum_{\tau=0}^{\infty}(du_{i}^{\tau})^{2}<\infty, uiku_{i}^{k} is non-increasing, and

αLmin{\mfrac16d,\mfrac312d(1+1+\mfracσ2d1),\mfrac(1σ2)326429d32}.\alpha L\leq\min\left\{\mfrac{1}{6\sqrt{d}},\mfrac{\sqrt{3}}{12\sqrt{d}}\bigg{(}\!{-1}+\sqrt{1\!+\!\mfrac{\sigma^{2}}{d\!-\!1}}\bigg{)},\mfrac{(1-\sigma^{2})^{3}}{264\sqrt{29}d^{\frac{3}{2}}}\right\}.

Then we have

1kτ=0k1𝔼[f(x¯τ)2]1k(6αδ0+6dLR~029Nα+36L2Ru),\displaystyle\frac{1}{k}\sum_{\tau=0}^{k-1}\mathbb{E}[\|\nabla f(\bar{x}^{\tau})\|^{2}]\leq{}\frac{1}{k}\!\left(\frac{6}{\alpha}\delta^{0}+\frac{6\sqrt{d}L\tilde{R}_{0}}{\sqrt{29}N\alpha}+36L^{2}R_{u}\right)\!, (12)

where R~0=11σ2Ex0\tilde{R}_{0}=\frac{1}{1-\sigma^{2}}E_{x}^{0}.

Proof.

See Appendix F. ∎

Remark 3.

With some standard algebraic calculation, it can be shown that the convergence rate in (12) is 𝒪(d32/m)\mathcal{O}\big{(}d^{\frac{3}{2}}/m\big{)} in terms of the number of function value queries mm, which improves upon the 𝒪(d52/m)\mathcal{O}\big{(}d^{\frac{5}{2}}/m\big{)} rate in Theorem 1. Roughly speaking, this improvement is due to that, by fixing p=1/dp=1/d, the step-size’s dependency on the dimension can be decreased from 1/d521/d^{\frac{5}{2}} to 1/d321/d^{\frac{3}{2}}. In light of this improvement, setting p=1/dp=1/d offers practical guidance for selecting the probability pp.

V Outline of Convergence Analysis

V-A Bounding the Variance of VR-GE

The variance of VR-GE is essential for convergence proof of Algorithm 1 and we provide analysis details in this subsection. We first rewrite Algorithm 1 as follows:

xk+1=(WId)(xkαsk),\displaystyle x^{k+1}=(W\otimes I_{d})(x^{k}-\alpha s^{k}), (13a)
sk+1=(WId)(sk+gk+1gk),\displaystyle s^{k+1}=(W\otimes I_{d})(s^{k}+g^{k+1}-g^{k}), (13b)

where gk=[(gf1k)T,(gf2k)T,,(gfNk)T]Tg^{k}=[(g_{f_{1}}^{k})^{T},(g_{f_{2}}^{k})^{T},\cdots,(g_{f_{N}}^{k})^{T}]^{T}.

We now derive a bound on the expected difference between variance-reduced gradient estimator and the true gradient in the following lemma.

Lemma 1.

Let h:dh:\mathbb{R}^{d}\to\mathbb{R} be LL-smooth. Then for any x,x~,ydx,\tilde{x},y\in\mathbb{R}^{d} and 0<uu~0<u\leq\tilde{u}, it holds that

𝔼l𝒰[d][gh(x)2]\displaystyle\mathbb{E}_{l\sim\mathcal{U}[d]}\!\left[\|g-\nabla h(x)\|^{2}\right] (14)
\displaystyle\leq{} 12dL2xy2+12dL2x~y2+72u~2L2d2,\displaystyle 12dL^{2}\|x-y\|^{2}+12dL^{2}\|\tilde{x}-y\|^{2}+\frac{7}{2}\tilde{u}^{2}L^{2}d^{2},

where g=Gh(c)(x,u,l)Gh(c)(x~,u~,l)+Gh(2d)(x~,u~)g=G_{h}^{(c)}(x,u,l)-G_{h}^{(c)}(\tilde{x},\tilde{u},l)+G_{h}^{(2d)}(\tilde{x},\tilde{u}).

Proof.

Using ab22a2+2b2\|a-b\|^{2}\leq 2\|a\|^{2}+2\|b\|^{2}, we derive that

𝔼l𝒰[d][gh(x)2]\displaystyle\mathbb{E}_{l\sim\mathcal{U}[d]}\!\left[\|g-\nabla h(x)\|^{2}\right]
=\displaystyle={} 𝔼l𝒰[d][(Gh(c)(x,u,l)Gh(c)(x~,u~,l))\displaystyle\mathbb{E}_{l\sim\mathcal{U}[d]}\!\Big{[}\Big{\|}\left(G_{h}^{(c)}(x,u,l)-G_{h}^{(c)}(\tilde{x},\tilde{u},l)\right)
(Gh(2d)(x,u)Gh(2d)(x~,u~))(h(x)Gh(2d)(x,u))2]\displaystyle\!-\!\Big{(}G_{h}^{(2d)}\!(x,u)\!-\!G_{h}^{(2d)}\!(\tilde{x},\tilde{u})\Big{)}\!-\!\left(\nabla h(x)\!-\!G_{h}^{(2d)}\!(x,u)\!\right)\!\!\Big{\|}^{2}\Big{]}
\displaystyle\leq{} 2𝔼l𝒰[d][(Gh(c)(x,u,l)Gh(c)(x~,u~,l))\displaystyle 2\mathbb{E}_{l\sim\mathcal{U}[d]}\!\Big{[}\Big{\|}\left(G_{h}^{(c)}(x,u,l)-G_{h}^{(c)}(\tilde{x},\tilde{u},l)\right)
(Gh(2d)(x,u)Gh(2d)(x~,u~))2]\displaystyle-\left(G_{h}^{(2d)}(x,u)-G_{h}^{(2d)}(\tilde{x},\tilde{u})\right)\Big{\|}^{2}\Big{]}
+2𝔼[h(x)Gh(2d)(x,u)2].\displaystyle+2\mathbb{E}[\|\nabla h(x)-G_{h}^{(2d)}(x,u)\|^{2}].

Then, by 𝔼[ζ𝔼[ζ]2]𝔼[ζ2]\mathbb{E}[\|\zeta-\mathbb{E}[\zeta]\|^{2}]\leq\mathbb{E}[\|\zeta\|^{2}] for an arbitrary random vector ζ\zeta, we can derive from the above inequality that

𝔼l𝒰[d][gh(x)2]\displaystyle\mathbb{E}_{l\sim\mathcal{U}[d]}\!\left[\|g-\nabla h(x)\|^{2}\right]
\displaystyle\leq{} 2𝔼l𝒰[d][Gh(c)(x,u,l)Gh(c)(x~,u~,l)2]+12u2L2d\displaystyle 2\mathbb{E}_{l\sim\mathcal{U}[d]}\!\left[\|G_{h}^{(c)}(x,u,l)-G_{h}^{(c)}(\tilde{x},\tilde{u},l)\|^{2}\right]+\frac{1}{2}u^{2}L^{2}d
=\displaystyle={} 2dGh(2d)(x,u)Gh(2d)(x~,u~)2+12u2L2d\displaystyle 2d\|G_{h}^{(2d)}(x,u)-G_{h}^{(2d)}(\tilde{x},\tilde{u})\|^{2}+\frac{1}{2}u^{2}L^{2}d
=\displaystyle={} 2d(Gh(2d)(x,u)h(x))(Gh(2d)(x~,u~)h(x~))\displaystyle 2d\|(G_{h}^{(2d)}(x,u)-\nabla h(x))-(G_{h}^{(2d)}(\tilde{x},\tilde{u})-\nabla h(\tilde{x}))
+(h(x)h(x~))2+12u2L2d\displaystyle+(\nabla h(x)-\nabla h(\tilde{x}))\|^{2}+\frac{1}{2}u^{2}L^{2}d
\displaystyle\leq{} 6dL2xx~2+32u2L2d2+32u~2L2d2+12u2L2d\displaystyle 6dL^{2}\|x-\tilde{x}\|^{2}+\frac{3}{2}u^{2}L^{2}d^{2}+\frac{3}{2}\tilde{u}^{2}L^{2}d^{2}+\frac{1}{2}u^{2}L^{2}d
\displaystyle\leq{} 12dL2xy2+12dL2x~y2+72u~2L2d2,\displaystyle 12dL^{2}\|x-y\|^{2}+12dL^{2}\|\tilde{x}-y\|^{2}+\frac{7}{2}\tilde{u}^{2}L^{2}d^{2},

where we have used (3) and inequality Gh(2d)(x,u)h(x)12uLd\|G_{h}^{(2d)}(x,u)-\nabla h(x)\|\leq\frac{1}{2}uL\sqrt{d} in the second inequality, and the fact that uu~u\leq\tilde{u} in the last inequality. ∎

Remark 4.

The estimation error of VR-GE (i.e., the left-hand side of (14)) comprises two parts: consensus error and the smoothing radius effect. Consensus error is inherent due to the distributed nature of the algorithm, while the smoothing radius can be tuned properly. As Algorithm 1 approaches consensus and the smoothing radius approaches zero, the estimation error between the VR-GE and the true gradient diminishes. Thus, VR-GE offers reduced variance compared to the 2-point gradient estimator while requiring fewer samples than the 2d2d-point gradient estimator.

V-B Proof Sketch of Theorem 1

The proof relies on four lemmas. The first lemma analyzes the evolution of function value f(x¯k)f(\bar{x}^{k}) using LL-smooth property. We denote u~k=maxi{𝔼[u~ik]},i[N]\tilde{u}^{k}=\max_{i}\{\mathbb{E}[\tilde{u}_{i}^{k}]\},i\in[N] and assume d3d\geq 3 in the following analysis.

Lemma 2.

Suppose αL16d\alpha L\leq\frac{1}{6\sqrt{d}}, we have

δk+1\displaystyle\delta^{k+1}\leq{} δkα3𝔼[f(x¯k)2]+18αdL2NExk\displaystyle\delta^{k}-\frac{\alpha}{3}\mathbb{E}[\|\nabla f(\bar{x}^{k})\|^{2}]+\frac{18\alpha dL^{2}}{N}E_{x}^{k}
+16αdL2NEx~k+5αL2d2(u~k)2.\displaystyle+\frac{16\alpha dL^{2}}{N}E_{\tilde{x}}^{k}+5\alpha L^{2}d^{2}(\tilde{u}^{k})^{2}. (15)
Proof.

See Appendix B. ∎

Lemma 2 derives a bound for optimization error δk\delta^{k}. We need to further bound consensus errors: ExkE_{x}^{k} and Ex~kE_{\tilde{x}}^{k}. This is tackled by the following lemma.

Lemma 3.

Suppose we choose p(1σ2d,1]p\in\big{(}\frac{1-\sigma^{2}}{d},1\big{]} and αLmin{(1σ2)212σ2(1+2σ2)d(1+p),16d,312d(d+d1+σ21p)}\alpha L\leq\min\left\{\frac{(1-\sigma^{2})^{2}}{12\sigma^{2}(1+2\sigma^{2})\sqrt{d(1+p)}},\\ \frac{1}{6\sqrt{d}},\frac{\sqrt{3}}{12d}\!\left(-\sqrt{d}\!+\!\sqrt{\frac{d-1+\sigma^{2}}{1-p}}\right)\!\right\}. Then we have the following inequality:

vk+1Avk+Bk,v^{k+1}\leq Av^{k}+B_{k}, (16)

where

vk=\displaystyle v^{k}={} [Exk,Ex~k,\mfracα329LdEsk]T,\displaystyle\!\left[E_{x}^{k},E_{\tilde{x}}^{k},\mfrac{\alpha}{3\sqrt{29}L\sqrt{d}}E_{s}^{k}\right]^{T},
A=\displaystyle A={} [1+2σ230929d1σ2αL17dαL+1+2σ2311σ2d929d1σ2αL929d1σ2αL329d1σ2αL2+σ23],\displaystyle\!\begin{bmatrix}\frac{1+2\sigma^{2}}{3}&0&\frac{9\sqrt{29}\sqrt{d}}{1-\sigma^{2}}\alpha L\\ 17\sqrt{d}\alpha L+\frac{1+2\sigma^{2}}{3}&1-\frac{1-\sigma^{2}}{d}&\frac{9\sqrt{29}\sqrt{d}}{1-\sigma^{2}}\alpha L\\ \frac{9\sqrt{29}\sqrt{d}}{1-\sigma^{2}}\alpha L&\frac{3\sqrt{29}\sqrt{d}}{1-\sigma^{2}}\alpha L&\frac{2+\sigma^{2}}{3}\end{bmatrix},
and Bk=[0NαLd𝔼[f(x¯k)2]+5Nd32αL(u~k)229Nα3(1σ2)Ld𝔼[f(x¯k)2]+229Nd32αL1σ2(u~k)2].\displaystyle\!\!\!B_{k}\!=\!\begin{bmatrix}0\\ \frac{N\alpha}{L\sqrt{d}}\mathbb{E}[\|\nabla f(\bar{x}^{k})\|^{2}]+5Nd^{\frac{3}{2}}\alpha L(\tilde{u}^{k})^{2}\\ \frac{\sqrt{29}N\alpha}{3(1-\sigma^{2})L\sqrt{d}}\mathbb{E}[\|\nabla f(\bar{x}^{k})\|^{2}]\!+\!\frac{2\sqrt{29}Nd^{\frac{3}{2}}\alpha L}{1-\sigma^{2}}(\tilde{u}^{k})^{2}\\ \end{bmatrix}\!\!.
Proof.

See Appendix C. ∎

For subsequent analysis, we need to bound the induced weighted matrix norm of the matrix AA in the inequality (16).

Lemma 4.

Consider the matrix AA defined in Lemma 3, and suppose that αL(1σ2)31229d52\alpha L\leq\frac{(1-\sigma^{2})^{3}}{12\sqrt{29}d^{\frac{5}{2}}}. Then

ρ(A)Aπ11σ22d,\rho(A)\leq\interleave A\interleave_{\infty}^{\pi}\leq 1-\frac{1-\sigma^{2}}{2d}, (17)

where π=[1σ2d,3,1]T\pi=[\frac{1-\sigma^{2}}{d},3,1]^{T}.

Proof.

See Appendix D. ∎

Next, we derive a bound on the accumulated consensus errors over iterations using Lemma 3.

Lemma 5.

Suppose we choose p(1σ2d,1]p\in(\frac{1-\sigma^{2}}{d},1] and αLmin{16d,(1σ2)31229d52,312d(d+d1+σ21p)}\alpha L\leq\min\!\left\{\frac{1}{6\sqrt{d}},\frac{(1-\sigma^{2})^{3}}{12\sqrt{29}d^{\frac{5}{2}}},\frac{\sqrt{3}}{12d}\left(-\sqrt{d}+\sqrt{\frac{d-1+\sigma^{2}}{1-p}}\right)\!\right\}. We have

max{d1σ2τ=0kExτ,13τ=0kEx~τ,α329Ldτ=0kEsτ}\displaystyle\max\left\{\frac{d}{1-\sigma^{2}}\sum_{\tau=0}^{k}E_{x}^{\tau},\frac{1}{3}\sum_{\tau=0}^{k}E_{\tilde{x}}^{\tau},\frac{\alpha}{3\sqrt{29}L\sqrt{d}}\sum_{\tau=0}^{k}E_{s}^{\tau}\right\}
\displaystyle\leq{} 4d1σ2R0+229dNα3(1σ2)2Lm=0k1𝔼[f(x¯m)2]\displaystyle\frac{4d}{1-\sigma^{2}}R_{0}+\frac{2\sqrt{29}\sqrt{d}N\alpha}{3(1-\sigma^{2})^{2}L}\sum_{m=0}^{k-1}\mathbb{E}[\|\nabla f(\bar{x}^{m})\|^{2}]
+429Nd52αL(1σ2)2m=0k1(u~m)2,\displaystyle+\frac{4\sqrt{29}Nd^{\frac{5}{2}}\alpha L}{(1-\sigma^{2})^{2}}\sum_{m=0}^{k-1}(\tilde{u}^{m})^{2}, (18)

where R0=d1σ2Ex0R_{0}=\frac{d}{1-\sigma^{2}}E_{x}^{0}. We also derive that τ=0(u~τ)21p(ui0)2+τ=1(uiτ)2\sum_{\tau=0}^{\infty}(\tilde{u}^{\tau})^{2}\leq\frac{1}{p}(u_{i}^{0})^{2}+\sum_{\tau=1}^{\infty}(u_{i}^{\tau})^{2}.

Proof.

See Appendix E. ∎

Based on Lemmas 2 and 5, we are now ready to prove Theorem 1. Plugging the inequality (18) into (15) leads to

0\displaystyle 0\leq{} δ0α3τ=0k𝔼[f(x¯τ)2]+5αL2d2τ=0k(u~τ)2\displaystyle\delta^{0}-\frac{\alpha}{3}\sum_{\tau=0}^{k}\mathbb{E}[\|\nabla f(\bar{x}^{\tau})\|^{2}]+5\alpha L^{2}d^{2}\sum_{\tau=0}^{k}(\tilde{u}^{\tau})^{2}
+54αdL2N(4d1σ2R0+429Nd52αL(1σ2)2m=0k1(u~m)2\displaystyle+\frac{54\alpha dL^{2}}{N}(\frac{4d}{1-\sigma^{2}}R_{0}+\frac{4\sqrt{29}Nd^{\frac{5}{2}}\alpha L}{(1-\sigma^{2})^{2}}\sum_{m=0}^{k-1}(\tilde{u}^{m})^{2}
+229dNα3(1σ2)2Lm=0k1𝔼[f(x¯m)2])\displaystyle+\frac{2\sqrt{29}\sqrt{d}N\alpha}{3(1-\sigma^{2})^{2}L}\sum_{m=0}^{k-1}\mathbb{E}[\|\nabla f(\bar{x}^{m})\|^{2}])
\displaystyle\leq{} δ0+L29NdR0α6τ=0k𝔼[f(x¯τ)2]\displaystyle\delta^{0}+\frac{L}{\sqrt{29}N\sqrt{d}}R_{0}-\frac{\alpha}{6}\sum_{\tau=0}^{k}\mathbb{E}[\|\nabla f(\bar{x}^{\tau})\|^{2}]
+6αL2d2τ=0k(u~τ)2,\displaystyle+6\alpha L^{2}d^{2}\sum_{\tau=0}^{k}(\tilde{u}^{\tau})^{2}, (19)

where we have used 18αdL2N1σ2d+16αdL2N3<54αdL2N\frac{18\alpha dL^{2}}{N}\cdot\frac{1-\sigma^{2}}{d}+\frac{16\alpha dL^{2}}{N}\cdot 3<\frac{54\alpha dL^{2}}{N} in the first inequality and αL(1σ2)321629d521σ221629d32\alpha L\leq\frac{(1-\sigma^{2})^{3}}{216\sqrt{29}d^{\frac{5}{2}}}\leq\frac{1-\sigma^{2}}{216\sqrt{29}d^{\frac{3}{2}}} in the last inequality.

Using the bound τ=0(u~τ)2\sum_{\tau=0}^{\infty}(\tilde{u}^{\tau})^{2} in Lemma 5 and (uik)2(u_{i}^{k})^{2} is summable, we see that (u~k)2(\tilde{u}^{k})^{2} is also summable. Consequently, we have 𝔼[f(x¯k)2]0\mathbb{E}[\|\nabla f(\bar{x}^{k})\|^{2}]\rightarrow 0 as kk\rightarrow\infty by (19). We can further derive from the inequality (19) that

1kτ=0k1𝔼[f(x¯τ)2]\displaystyle\frac{1}{k}\sum_{\tau=0}^{k-1}\mathbb{E}[\|\nabla f(\bar{x}^{\tau})\|^{2}] (20)
\displaystyle\leq{} 1k[6αδ0+6L29dNαR0+36L2d2τ=0(u~τ)2].\displaystyle\frac{1}{k}\left[\frac{6}{\alpha}\delta^{0}+\frac{6L}{\sqrt{29}\sqrt{d}N\alpha}R_{0}+36L^{2}d^{2}\sum_{\tau=0}^{\infty}(\tilde{u}^{\tau})^{2}\right].

Combining (20) with (18), we have

1Nτ=0Exτ429δ0(1σ2)Ld+8R0(1σ2)Nd+754τ=0(u~τ)2,\frac{1}{N}\sum_{\tau=0}^{\infty}E_{x}^{\tau}\leq\frac{4\sqrt{29}\delta^{0}}{(1\!-\!\sigma^{2})L\sqrt{d}}+\frac{8R_{0}}{(1\!-\!\sigma^{2})Nd}+\frac{7}{54}\sum_{\tau=0}^{\infty}(\tilde{u}^{\tau})^{2}, (21)

where we have used αL(1σ2)321629d52\alpha L\leq\frac{(1-\sigma^{2})^{3}}{216\sqrt{29}d^{\frac{5}{2}}}. Similarly,

1Nτ=0𝔼[sτ𝟏Nf(x¯τ)2]\displaystyle\frac{1}{N}\sum_{\tau=0}^{\infty}\mathbb{E}[\|s^{\tau}-\mathbf{1}_{N}\otimes\nabla f(\bar{x}^{\tau})\|^{2}]
\displaystyle\leq{} 32Nτ=0𝔼[sτ𝟏Ng¯τ2]+3τ=0𝔼[g¯τf(x¯τ)2]\displaystyle\frac{3}{2N}\sum_{\tau=0}^{\infty}\mathbb{E}[\|s^{\tau}\!-\!\mathbf{1}_{N}\otimes\bar{g}^{\tau}\|^{2}]+3\sum_{\tau=0}^{\infty}\mathbb{E}[\|\bar{g}^{\tau}\!-\!\nabla f(\bar{x}^{\tau})\|^{2}]
\displaystyle\leq{} 32Nτ=0Esτ+3τ=0[26dL2NExτ+24dL2NEx~τ+7(u~τ)2L2d2]\displaystyle\frac{3}{2N}\!\sum_{\tau=0}^{\infty}\!E_{s}^{\tau}\!+\!3\sum_{\tau=0}^{\infty}\!\left[\frac{26dL^{2}}{N}E_{x}^{\tau}\!+\!\frac{24dL^{2}}{N}E_{\tilde{x}}^{\tau}\!+\!7(\tilde{u}^{\tau})^{2}L^{2}d^{2}\right]
\displaystyle\leq{} 529dLNα(4d1σ2R0+429Nd52αL(1σ2)2m=0k1(u~m)2\displaystyle\frac{5\sqrt{29}\sqrt{d}L}{N\alpha}\bigg{(}\frac{4d}{1-\sigma^{2}}R_{0}+\frac{4\sqrt{29}Nd^{\frac{5}{2}}\alpha L}{(1-\sigma^{2})^{2}}\sum_{m=0}^{k-1}(\tilde{u}^{m})^{2}
+229dNα3(1σ2)2Ldm=0k1𝔼[f(x¯m)2])+21d2L2τ=0(u~τ)2\displaystyle\!\!\!+\!\frac{2\sqrt{29}dN\alpha}{3(1\!-\!\sigma^{2})^{2}L\sqrt{d}}\!\sum_{m=0}^{k-1}\!\mathbb{E}[\|\nabla f(\bar{x}^{m})\|^{2}]\bigg{)}\!+\!21d^{2}L^{2}\!\sum_{\tau=0}^{\infty}(\tilde{u}^{\tau})^{2}
\displaystyle\leq{} 580δ0(1σ2)2α+4029LR0(1σ2)Nαd+29dL10ατ=0(u~τ)2,\displaystyle\frac{580\delta^{0}}{(1-\sigma^{2})^{2}\alpha}+\frac{40\sqrt{29}LR_{0}}{(1\!-\!\sigma^{2})N\alpha\sqrt{d}}+\frac{\sqrt{29}\sqrt{d}L}{10\alpha}\sum_{\tau=0}^{\infty}(\tilde{u}^{\tau})^{2},

where we have used inequality (18) and 32N329Ldα+78dL2N1σ2d+72dL2N3529dLNα\frac{3}{2N}\cdot\frac{3\sqrt{29}L\sqrt{d}}{\alpha}+\frac{78dL^{2}}{N}\cdot\frac{1-\sigma^{2}}{d}+\frac{72dL^{2}}{N}\cdot 3\leq\frac{5\sqrt{29}\sqrt{d}L}{N\alpha} in the third inequality.

The proof of Theorem 1 can now be concluded.

VI Simulation

We consider a multi-agent nonconvex optimization problem adapted from [17] with N=50N=50 agents in the network, and the objective function of each agent is given as follows:

fi(x)=αi1+eζiTxvi+βiln(1+x2),\displaystyle f_{i}(x)=\frac{\alpha_{i}}{1+e^{-\zeta_{i}^{T}x-v_{i}}}+\beta_{i}\ln(1+\|x\|^{2}), (22)

where αi,βi,vi\alpha_{i},\beta_{i},v_{i}\in\mathbb{R} are randomly generated parameters satisfying 1Niβi=1\frac{1}{N}\sum_{i}\beta_{i}=1, each ζid\zeta_{i}\in\mathbb{R}^{d} is also randomly generated, and the dimension dd is set to 6464.

For the following numerical simulation of Algorithm 1, we set the step-size η=0.02\eta=0.02 and the smoothing radius uik=3/k34u_{i}^{k}=3/k^{\frac{3}{4}}. All agents start from the same initial points to ensure consistency in the initial conditions across the network.

VI-A Comparison with Other Algorithms

Fig. 1 compares Algorithm 1 with DGD-2p, GT-2d2d, and ZONE-M (with JJ=100). In the figure, the probability used for Algorithm 1 is p=0.1p=0.1. The horizontal axis is normalized and represents the sampling number mm (i.e., the number of zeroth-order queries). The two sub-figures illustrate the stationarity gap f(x¯k)2\|\nabla f(\bar{x}^{k})\|^{2} and the consensus error 1Nixikx¯k2\frac{1}{N}\sum_{i}\|x_{i}^{k}-\bar{x}^{k}\|^{2}, respectively.

By inspecting Fig. 1, we first see that DGD-2p converges faster than ZONE-M with J=100J=100. When comparing DGD-2p and GT-2d2d, we can see a clear difference between their convergence behavior: DGD-2p achieves fast convergence initially but slows down afterwards due to the inherent variance of the 2-point gradient estimator, whereas GT-2d2d achieves higher eventual accuracy but slower initial convergence before approximately 1.5×1041.5\times 10^{4} zeroth-order queries due to the higher sampling burden of the 2d2d-point gradient estimator.

As demonstrated in Fig. 1, Algorithm 1 offers both high eventual accuracy and a fast convergence rate in terms of stationarity gap and consensus error. This improvement is attributed to the variance reduction mechanism employed in designing VR-GE, which effectively balances the sampling number and expected variance, thereby addressing the trade-off between convergence rate and sampling cost per zeroth-order gradient estimation that exists in current zeroth-order distributed optimization algorithms.

Refer to caption

Figure 1: Convergence of Algorithm 1, ZONE-M with J =100, DGD-2p, GT-2d2d.

VI-B Comparison of Algorithm 1 under Different Probabilities

Fig. 2 compares the convergence of Algorithm 1 under different choices of the probability pp, which reflects the frequency with which each agent takes snapshots. The three sub-figures illustrate the stationarity gap f(x¯k)2\|\nabla f(\bar{x}^{k})\|^{2}, the consensus error 1Nixikx¯k2\frac{1}{N}\sum_{i}\|x_{i}^{k}-\bar{x}^{k}\|^{2}, and the tracking error 1Nisikf(x¯k)2\frac{1}{N}\sum_{i}\|s_{i}^{k}-\nabla f(\bar{x}^{k})\|^{2}, respectively.

The results demonstrate that Algorithm 1 with a lower probability achieves better accuracy with fewer sampling numbers. However, a lower probability also results in more fluctuation during convergence. This is expected because, with a lower probability, the snapshot variables are updated less frequently, leading to a greater deviation from the true gradient as iterations progress.

Two notable cases are p=0p=0 and p=1p=1. When p=1p=1, VR-GE behaves similarly to GT-2d2d, taking snapshots at every iteration, which results in poorer convergence performance compared to cases where p(0,1)p\in(0,1). Conversely, when p=0p=0, agents do not take any snapshots, and x~ik\tilde{x}_{i}^{k} remains its initial value xi0x_{i}^{0}. In this scenario, VR-GE uses solely the initial iterate for gradient correction, which introduces variance and reduces convergence accuracy, as illustrated in Fig. 2.

Refer to caption

Figure 2: Convergence of Algorithm 1 under probability pp = 0.2, 0.5, 0.8, and 1.

VI-C Comparison of Algorithm 1 under Different Dimensions

Fig. 3 compares the convergence of Algorithm 1 across different agent dimensions, alongside varying probabilities for taking snapshots within the algorithm. The results show that Algorithm 1 can effectively handle different scenarios, such as when d=300d=300, achieving stationarity gaps that are below 10610^{-6}.

As the dimension increases, VR-GE requires more samples to accurately estimate the gradient. To maintain similar convergence performance across higher dimensions, the probability pp for taking snapshots can be adjusted to lower values. As shown in Fig. 3, decreasing the probability as the dimension grows allows Algorithm 1 to achieve a convergence rate and optimization accuracy that are comparable to cases with lower dimensions. However, this adjustment also leads to increased fluctuation during the convergence process. This fluctuation is a result of the randomness introduced by the snapshot mechanism.

Refer to caption

Figure 3: Convergence of Algorithm 1 with different dimension dd = 30, 100, 200, and 300.

VII Conclusion

In this paper, we proposed an improved variance-reduced gradient estimator and integrated it with gradient tracking mechanism for nonconvex distributed zeroth-order optimization problems. Through rigorous analysis, we demonstrated that our algorithm achieves sublinear convergence for smooth nonconvex functions that is comparable with first-order gradient tracking algorithms, while maintaining relatively low sampling cost per gradient estimation. Comparative evaluations with existing distributed zeroth-order optimization algorithms verified the effectiveness of the proposed gradient estimator. Future work will focus on refining the step-size bounds of the algorithm and reducing the dependence of the convergence rate on the dimension dd.

Appendix A. Auxiliary Lemmas

This section summarizes some auxiliary lemmas for convergence analysis.

Lemma 6.

Suppose f:df:\mathbb{R}^{d}\to\mathbb{R} is LL-smooth and infxdf(x)=f>\inf_{x\in\mathbb{R}^{d}}f(x)=f^{*}>-\infty. Then

f(x)22L(f(x)f(x)).\|\nabla f(x)\|^{2}\leq 2L(f(x)-f(x^{*})). (23)
Proof.

This lemma is standard in optimization theory, and can be shown via

ff(x\mfrac1Lf(x))f(x)\mfrac12Lf(x)2,f^{\ast}\leq f\!\left(x-\mfrac{1}{L}\nabla f(x)\right)\leq f(x)-\mfrac{1}{2L}\|\nabla f(x)\|^{2},

where the last step follows from the LL-smoothness of ff. ∎

Lemma 7 ([17]).

Let f:df:\mathbb{R}^{d}\to\mathbb{R} be LL-smooth. Then for any xdx\in\mathbb{R}^{d},

Gf(2d)(x,u)f(x)12uLd.\|G_{f}^{(2d)}(x,u)-\nabla f(x)\|\leq\frac{1}{2}uL\sqrt{d}. (24)
Lemma 8 ([26]).

Let σW1N𝟏N𝟏NT2<1\sigma\triangleq\|W-\frac{1}{N}\mathbf{1}_{N}\mathbf{1}_{N}^{T}\|_{2}<1. For any x1,,xNdx_{1},\cdots,x_{N}\in\mathbb{R}^{d}, we have

(WId)(x𝟏Nx¯)σx𝟏Nx¯,\|(W\otimes I_{d})(x-\mathbf{1}_{N}\otimes\bar{x})\|\leq\sigma\|x-\mathbf{1}_{N}\otimes\bar{x}\|,

where we denote x=[x1xN],x¯=1Ni=1Nxix=\begin{bmatrix}x_{1}\\ \vdots\\ x_{N}\end{bmatrix},\bar{x}=\dfrac{1}{N}\sum_{i=1}^{N}x_{i}.

Lemma 9 ([28]).

Let An×nA\in\mathbb{R}^{n\times n} be non-negative and πn\pi\in\mathbb{R}^{n} be positive. If AπβπA\pi\leq\beta\pi for β>0\beta>0, then ρ(A)Aπβ\rho(A)\leq\interleave A\interleave_{\infty}^{\pi}\leq\beta.

Appendix B. Proof of Lemma 2

First, by left multiplying 𝟏NTTd\mathbf{1}_{N}^{T}\otimes T_{d} on both sides of equation (13b), and using the initialization s0=g0s^{0}=g^{0}, we derive that

s¯k=g¯k,\bar{s}^{k}=\bar{g}^{k},

where s¯k=1Ni=1Nsik\bar{s}^{k}=\frac{1}{N}\sum_{i=1}^{N}s_{i}^{k} and g¯k=1Ni=1Ngfik\bar{g}^{k}=\frac{1}{N}\sum_{i=1}^{N}g_{f_{i}}^{k}.

Then, from (13a), we have

x¯k+1=x¯kαg¯k.\bar{x}_{k+1}=\bar{x}^{k}-\alpha\bar{g}^{k}. (25)

Leveraging the LL-smoothness of the function ff, we have

𝔼[f(x¯k+1)]\displaystyle\mathbb{E}[f(\bar{x}_{k+1})]\leq{} 𝔼[f(x¯k)]α𝔼[f(x¯k),g¯k]+α2L2𝔼[g¯k2]\displaystyle\mathbb{E}[f(\bar{x}^{k})]\!-\!\alpha\mathbb{E}[\langle\nabla f(\bar{x}^{k}),\bar{g}^{k}\rangle]\!+\!\frac{\alpha^{2}\!L}{2}\mathbb{E}[\|\bar{g}^{k}\|^{2}] (26)
=\displaystyle={} 𝔼[f(x¯k)]α𝔼[f(x¯k)2]+α2L2𝔼[g¯k2]\displaystyle\mathbb{E}[f(\bar{x}^{k})]-\alpha\mathbb{E}[\|\nabla f(\bar{x}^{k})\|^{2}]+\frac{\alpha^{2}L}{2}\mathbb{E}[\|\bar{g}^{k}\|^{2}]
α𝔼[f(x¯k),1Ni=1N(gfikfi(x¯k))]\displaystyle-\alpha\mathbb{E}[\langle\nabla f(\bar{x}^{k}),\frac{1}{N}\sum\limits_{i=1}^{N}(g_{f_{i}}^{k}-\nabla f_{i}(\bar{x}^{k}))\rangle]
\displaystyle\leq{} 𝔼[f(x¯k)]α2𝔼[f(x¯k)2]+α2L2𝔼[g¯k2]\displaystyle\mathbb{E}[f(\bar{x}^{k})]\!-\!\frac{\alpha}{2}\mathbb{E}[\|\nabla f(\bar{x}^{k})\|^{2}]\!+\!\frac{\alpha^{2}L}{2}\mathbb{E}[\|\bar{g}^{k}\|^{2}]
+α2𝔼[1Ni=1N(gfikfi(x¯k))2].\displaystyle+\frac{\alpha}{2}\mathbb{E}\bigg{[}\bigg{\|}\frac{1}{N}\sum\limits_{i=1}^{N}(g_{f_{i}}^{k}-\nabla f_{i}(\bar{x}^{k}))\bigg{\|}^{2}\bigg{]}.

For the last term on the right-hand side, we can derive by using Lemma 1 that

𝔼[1Ni=1N(gfikfi(x¯k))2]\displaystyle\mathbb{E}\bigg{[}\bigg{\|}\frac{1}{N}\sum\limits_{i=1}^{N}(g_{f_{i}}^{k}-\nabla f_{i}(\bar{x}^{k}))\bigg{\|}^{2}\bigg{]} (27)
\displaystyle\leq{} 2𝔼[1Ni=1N(fi(xik)fi(x¯k))2]\displaystyle 2\mathbb{E}\bigg{[}\bigg{\|}\frac{1}{N}\sum_{i=1}^{N}\left(\nabla f_{i}(x_{i}^{k})-\nabla f_{i}(\bar{x}^{k})\right)\bigg{\|}^{2}\bigg{]}
+2N𝔼[i=1Ngfikfi(xik)2]\displaystyle+\frac{2}{N}\mathbb{E}\bigg{[}\sum_{i=1}^{N}\|g_{f_{i}}^{k}-\nabla f_{i}(x_{i}^{k})\|^{2}\bigg{]}
\displaystyle\leq{} 2L2NExk+2N(12dL2Exk+12dL2Ex~k+72N(u~k)2L2d2)\displaystyle\frac{2L^{2}}{N}E_{x}^{k}+\frac{2}{N}(12dL^{2}E_{x}^{k}+12dL^{2}E_{\tilde{x}}^{k}+\frac{7}{2}N(\tilde{u}^{k})^{2}L^{2}d^{2})
=\displaystyle={} 2L2(1+12d)NExk+24dL2NEx~k+7(u~k)2L2d2.\displaystyle\frac{2L^{2}(1+12d)}{N}E_{x}^{k}+\frac{24dL^{2}}{N}E_{\tilde{x}}^{k}+7(\tilde{u}^{k})^{2}L^{2}d^{2}.

Next, we bound the term 𝔼[g¯k2]\mathbb{E}[\|\bar{g}^{k}\|^{2}] as follows:

𝔼[g¯k2]\displaystyle\mathbb{E}[\|\bar{g}^{k}\|^{2}]
=\displaystyle={} 𝔼[1Ni=1Ngfik2]\displaystyle\mathbb{E}[\|\frac{1}{N}\sum_{i=1}^{N}g_{f_{i}}^{k}\|^{2}]
\displaystyle\leq{} 2𝔼[f(x¯k)2]+2𝔼[1Ni=1N(gfikfi(x¯k))2]\displaystyle 2\mathbb{E}[\|\nabla f(\bar{x}^{k})\|^{2}]+2\mathbb{E}[\|\frac{1}{N}\sum_{i=1}^{N}\left(g_{f_{i}}^{k}-\nabla f_{i}(\bar{x}^{k})\right)\|^{2}]
\displaystyle\leq{} 2𝔼[f(x¯k)2]+4L2(1+12d)NExk+48dL2NEx~k\displaystyle 2\mathbb{E}[\|\nabla f(\bar{x}^{k})\|^{2}]+\frac{4L^{2}(1+12d)}{N}E_{x}^{k}+\frac{48dL^{2}}{N}E_{\tilde{x}}^{k}
+14(u~k)2L2d2,\displaystyle+14(\tilde{u}^{k})^{2}L^{2}d^{2}, (28)

where we have used (27) in the second inequality.

Hence, plugging inequalities (27) and (28) into (26), we have

𝔼[f(x¯k+1)]\displaystyle\mathbb{E}[f(\bar{x}_{k+1})]
\displaystyle\leq{} 𝔼[f(x¯k)]α2(12αL)𝔼[f(x¯k)2]\displaystyle\mathbb{E}[f(\bar{x}^{k})]-\frac{\alpha}{2}(1-2\alpha L)\mathbb{E}[\|\nabla f(\bar{x}^{k})\|^{2}]
+7α(u~k)2L2d22(1+2αL)+αL2(1+12d)N(1+2αL)Exk\displaystyle+\frac{7\alpha(\tilde{u}^{k})^{2}L^{2}d^{2}}{2}(1\!+\!2\alpha L)+\frac{\alpha L^{2}(1\!+\!12d)}{N}(1\!+\!2\alpha L)E_{x}^{k}
+12αdL2N(1+2αL)Ex~k\displaystyle+\frac{12\alpha dL^{2}}{N}(1+2\alpha L)E_{\tilde{x}}^{k}
\displaystyle\leq{} 𝔼[f(x¯k)]α3𝔼[f(x¯k)2]+18αdL2NExk\displaystyle\mathbb{E}[f(\bar{x}^{k})]-\frac{\alpha}{3}\mathbb{E}[\|\nabla f(\bar{x}^{k})\|^{2}]+\frac{18\alpha dL^{2}}{N}E_{x}^{k}
+16αdL2NEx~k+5α(u~k)2L2d2,\displaystyle+\frac{16\alpha dL^{2}}{N}E_{\tilde{x}}^{k}+5\alpha(\tilde{u}^{k})^{2}L^{2}d^{2},

where we have used αL16\alpha L\leq\frac{1}{6} in the second inequality.

Appendix C. Proof of Lemma 3

First, following from (13) and (25), we derive that

Exk+1=\displaystyle E_{x}^{k+1}= 𝔼[xk+1𝟏Nx¯k+12]\displaystyle\mathbb{E}[\|x^{k+1}-\mathbf{1}_{N}\otimes\bar{x}_{k+1}\|^{2}] (29)
=\displaystyle={} 𝔼[(WId)[xk𝟏Nx¯kα(sk𝟏Ng¯k)]2]\displaystyle\mathbb{E}[\|(W\otimes I_{d})[x^{k}\!-\!\mathbf{1}_{N}\otimes\bar{x}^{k}\!-\!\alpha(s^{k}\!-\!\mathbf{1}_{N}\otimes\bar{g}^{k})]\|^{2}]
\displaystyle\leq{} σ2(1+1σ23σ2)𝔼[xk𝟏Nx¯k2]\displaystyle\sigma^{2}\left(1+\frac{1-\sigma^{2}}{3\sigma^{2}}\right)\mathbb{E}[\|x^{k}-\mathbf{1}_{N}\otimes\bar{x}^{k}\|^{2}]
+σ2(1+3σ21σ2)α2𝔼[sk𝟏Ng¯k2]\displaystyle+\sigma^{2}\left(1+\frac{3\sigma^{2}}{1-\sigma^{2}}\right)\alpha^{2}\mathbb{E}[\|s^{k}-\mathbf{1}_{N}\otimes\bar{g}^{k}\|^{2}]
=\displaystyle={} 1+2σ23Exk+σ2(1+2σ2)1σ2α2Esk,\displaystyle\frac{1+2\sigma^{2}}{3}E_{x}^{k}+\frac{\sigma^{2}(1+2\sigma^{2})}{1-\sigma^{2}}\alpha^{2}E_{s}^{k},

where we have used Lemma 8 in the inequality.

Second, we bound the consensus error Ex~kE_{\tilde{x}}^{k}. We have

Ex~k+1=\displaystyle E_{\tilde{x}}^{k+1}={} 𝔼[x~k+1𝟏Nx¯k+12]\displaystyle\mathbb{E}[\|\tilde{x}_{k+1}-\mathbf{1}_{N}\otimes\bar{x}_{k+1}\|^{2}]
=\displaystyle={} p𝔼[xk+1𝟏Nx¯k+12]\displaystyle p\mathbb{E}[\|x^{k+1}-\mathbf{1}_{N}\otimes\bar{x}_{k+1}\|^{2}]
+(1p)𝔼[x~k𝟏Nx¯k+12]\displaystyle+(1-p)\mathbb{E}[\|\tilde{x}^{k}-\mathbf{1}_{N}\otimes\bar{x}_{k+1}\|^{2}]
=\displaystyle={} pExk+1+(1p)𝔼[x~k𝟏Nx¯k+12].\displaystyle pE_{x}^{k+1}+(1-p)\mathbb{E}[\|\tilde{x}^{k}-\mathbf{1}_{N}\otimes\bar{x}_{k+1}\|^{2}]. (30)

For the second term of (30), we have

𝔼[x~k𝟏Nx¯k+12]\displaystyle\mathbb{E}[\|\tilde{x}^{k}-\mathbf{1}_{N}\otimes\bar{x}_{k+1}\|^{2}] (31)
=\displaystyle={} 𝔼[x~k𝟏Nx¯k+𝟏Nx¯k𝟏Nx¯k+12]\displaystyle\mathbb{E}[\|\tilde{x}^{k}-\mathbf{1}_{N}\otimes\bar{x}^{k}+\mathbf{1}_{N}\otimes\bar{x}^{k}-\mathbf{1}_{N}\otimes\bar{x}_{k+1}\|^{2}]
=\displaystyle={} 𝔼[x~k𝟏Nx¯k+α𝟏Ng¯k2]\displaystyle\mathbb{E}[\|\tilde{x}^{k}-\mathbf{1}_{N}\otimes\bar{x}^{k}+\alpha\mathbf{1}_{N}\otimes\bar{g}^{k}\|^{2}]
\displaystyle\leq{} (1+β)Ex~k+(1+\mfrac1β)Nα2𝔼[g¯k2]\displaystyle(1+\beta)E_{\tilde{x}}^{k}+\left(1+\mfrac{1}{\beta}\right)N\alpha^{2}\mathbb{E}[\|\bar{g}^{k}\|^{2}]
\displaystyle\leq{} [1+β+(1+\mfrac1β)48α2L2d]Ex~k+(1+\mfrac1β)4α2L2(1+12d)Exk\displaystyle\!\!\!\left[1\!+\!\beta\!+\!\left(\!1\!+\!\mfrac{1}{\beta}\!\right)\!48\alpha^{2}L^{2}d\right]\!E_{\tilde{x}}^{k}\!+\!\left(\!1\!+\!\mfrac{1}{\beta}\!\right)\!4\alpha^{2}L^{2}(1\!+\!12d)E_{x}^{k}
+(1+\mfrac1β)2Nα2𝔼[f(x¯k)2]+(1+\mfrac1β)14Nα2(u~k)2L2d2,\displaystyle\!\!\!+\!\left(\!1\!+\!\mfrac{1}{\beta}\!\right)\!2N\alpha^{2}\mathbb{E}[\|\nabla f(\bar{x}^{k})\|^{2}]\!+\!\left(\!1\!+\!\mfrac{1}{\beta}\!\right)\!14N\alpha^{2}(\tilde{u}^{k})^{2}L^{2}d^{2}\!,

where β>0\beta>0 is an auxiliary parameter that will be determined later.

Now, we can plug (29) and (31) into (30) to obtain

Ex~k+1\displaystyle E_{\tilde{x}}^{k+1}\leq{} (1p)[1+β+(1+\mfrac1β)48α2L2d]Ex~k\displaystyle(1-p)\left[1+\beta+\left(1+\mfrac{1}{\beta}\right)48\alpha^{2}L^{2}d\right]E_{\tilde{x}}^{k} (32)
+[(1p)(1+\mfrac1β)4α2L2(1+12d)+p1+2σ23]Exk\displaystyle+\!\left[(1\!-\!p)\!\left(1\!+\!\mfrac{1}{\beta}\right)4\alpha^{2}L^{2}(1\!+\!12d)+p\frac{1\!+\!2\sigma^{2}}{3}\right]\!E_{x}^{k}
+σ2(1+2σ2)1σ2α2pEsk\displaystyle+\frac{\sigma^{2}(1+2\sigma^{2})}{1-\sigma^{2}}\alpha^{2}pE_{s}^{k}
+(1p)(1+\mfrac1β)2Nα2𝔼[f(x¯k)2]\displaystyle+(1-p)\left(1+\mfrac{1}{\beta}\right)2N\alpha^{2}\mathbb{E}[\|\nabla f(\bar{x}^{k})\|^{2}]
+(1p)(1+\mfrac1β)14Nα2(u~k)2L2d2.\displaystyle+(1-p)\left(1+\mfrac{1}{\beta}\right)14N\alpha^{2}(\tilde{u}^{k})^{2}L^{2}d^{2}.

Letting β=43αLd\beta=4\sqrt{3}\alpha L\sqrt{d} and using αL16d\alpha L\leq\frac{1}{6\sqrt{d}} leads to

Ex~k+1\displaystyle E_{\tilde{x}}^{k+1}\leq{} (1p)(1+48α2L2d+83αLd)Ex~k\displaystyle(1-p)(1+48\alpha^{2}L^{2}d+8\sqrt{3}\alpha L\sqrt{d})E_{\tilde{x}}^{k}
+(17αLd+1+2σ23)Exk+σ2(1+2σ2)1σ2α2Esk\displaystyle+\left(17\alpha L\sqrt{d}+\frac{1\!+\!2\sigma^{2}}{3}\right)E_{x}^{k}+\frac{\sigma^{2}(1\!+\!2\sigma^{2})}{1\!-\!\sigma^{2}}\alpha^{2}E_{s}^{k}
+NαLd𝔼[f(x¯k)2]+5NαLd32(u~k)2.\displaystyle+\frac{N\alpha}{L\sqrt{d}}\mathbb{E}[\|\nabla f(\bar{x}^{k})\|^{2}]+5N\alpha Ld^{\frac{3}{2}}(\tilde{u}^{k})^{2}.

Using αL312d(d+d1+σ21p)\alpha L\leq\frac{\sqrt{3}}{12d}\left(-\sqrt{d}+\sqrt{\frac{d-1+\sigma^{2}}{1-p}}\right) and under the condition that p(1σ2d,1]p\in(\frac{1-\sigma^{2}}{d},1], we get

(1p)(1+48α2L2d+83αLd)11σ2d.(1-p)(1+48\alpha^{2}L^{2}d+8\sqrt{3}\alpha L\sqrt{d})\leq 1-\frac{1-\sigma^{2}}{d}.

Consequently, we derive that

Ex~k+1\displaystyle E_{\tilde{x}}^{k+1}\leq{} (11σ2d)Ex~k+(17αLd+1+2σ23)Exk\displaystyle\left(1-\frac{1\!-\!\sigma^{2}}{d}\right)E_{\tilde{x}}^{k}+\left(17\alpha L\sqrt{d}+\frac{1\!+\!2\sigma^{2}}{3}\right)E_{x}^{k} (33)
+σ2(1+2σ2)1σ2α2Esk+NαLd𝔼[f(x¯k)2]\displaystyle+\frac{\sigma^{2}(1+2\sigma^{2})}{1-\sigma^{2}}\alpha^{2}E_{s}^{k}+\frac{N\alpha}{L\sqrt{d}}\mathbb{E}[\|\nabla f(\bar{x}^{k})\|^{2}]
+5NαLd32(u~k)2.\displaystyle+5N\alpha Ld^{\frac{3}{2}}(\tilde{u}^{k})^{2}.

Third, we bound the consensus error EskE_{s}^{k}. Note that

Esk+1=\displaystyle E_{s}^{k+1}={} 𝔼[sk+1𝟏Ng¯k+12]\displaystyle\mathbb{E}[\|s^{k+1}-\mathbf{1}_{N}\otimes\bar{g}^{k+1}\|^{2}] (34)
=\displaystyle={} 𝔼[(WId)(sk𝟏Ng¯k+gk+1gk\displaystyle\mathbb{E}[\|(W\otimes I_{d})(s^{k}-\mathbf{1}_{N}\otimes\bar{g}^{k}+g^{k+1}-g^{k}
𝟏Ng¯k+1+𝟏Ng¯k)2].\displaystyle-\mathbf{1}_{N}\otimes\bar{g}^{k+1}+\mathbf{1}_{N}\otimes\bar{g}^{k})\|^{2}].

For the term gk+1gk𝟏Ng¯k+1+𝟏Ng¯kg^{k+1}-g^{k}-\mathbf{1}_{N}\otimes\bar{g}^{k+1}+\mathbf{1}_{N}\otimes\bar{g}^{k} in (34), we have

gk+1gk𝟏Ng¯k+1+𝟏Ng¯k2\displaystyle\|g^{k+1}-g^{k}-\mathbf{1}_{N}\otimes\bar{g}^{k+1}+\mathbf{1}_{N}\otimes\bar{g}^{k}\|^{2}
=\displaystyle={} gk+1gk2+Ng¯k+1g¯k2\displaystyle\|g^{k+1}-g^{k}\|^{2}+N\|\bar{g}^{k+1}-\bar{g}^{k}\|^{2}
2i=1Ngfik+1gfik,g¯k+1g¯k\displaystyle-2\sum_{i=1}^{N}\langle g_{f_{i}}^{k+1}-g_{f_{i}}^{k},\bar{g}^{k+1}-\bar{g}^{k}\rangle
=\displaystyle={} gk+1gk2Ng¯k+1g¯k2\displaystyle\|g^{k+1}-g^{k}\|^{2}-N\|\bar{g}^{k+1}-\bar{g}^{k}\|^{2}
\displaystyle\leq{} gk+1gk2.\displaystyle\|g^{k+1}-g^{k}\|^{2}. (35)

Combining (35) with (34), we derive that

Esk+1=𝔼[sk+1𝟏Ng¯k+12]\displaystyle E_{s}^{k+1}=\mathbb{E}[\|s^{k+1}-\mathbf{1}_{N}\otimes\bar{g}^{k+1}\|^{2}]
\displaystyle\leq{} σ2𝔼[(sk𝟏Ng¯k+gk+1gk)2]\displaystyle\sigma^{2}\mathbb{E}[(\|s^{k}-\mathbf{1}_{N}\otimes\bar{g}^{k}\|+\|g^{k+1}-g^{k}\|)^{2}]
\displaystyle\leq{} 1+2σ23Esk+σ2(1+2σ2)1σ2i=1N𝔼[gfik+1gfik2],\displaystyle\frac{1+2\sigma^{2}}{3}E_{s}^{k}+\frac{\sigma^{2}(1+2\sigma^{2})}{1-\sigma^{2}}\sum_{i=1}^{N}\mathbb{E}[\|g_{f_{i}}^{k+1}-g_{f_{i}}^{k}\|^{2}], (36)

where we have used inequality (a+b)2(1+ϖ)a2+(1+1ϖ)b2(a+b)^{2}\leq(1+\varpi)a^{2}+(1+\frac{1}{\varpi})b^{2} for all a,ba,b\in\mathbb{R} and ϖ>0\varpi>0 in the second inequality.

Now we consider the second item in (36):

i=1N𝔼[gfik+1gfik2]\displaystyle\sum_{i=1}^{N}\mathbb{E}[\|g_{f_{i}}^{k+1}-g_{f_{i}}^{k}\|^{2}] (37)
=\displaystyle={} 𝔼[i=1N(gfik+1fi(xik+1))(gfikfi(xik))\displaystyle\mathbb{E}\bigg{[}\sum_{i=1}^{N}\|(g_{f_{i}}^{k+1}-\nabla f_{i}(x_{i}^{k+1}))-(g_{f_{i}}^{k}-\nabla f_{i}(x_{i}^{k}))
+(fi(xik+1)fi(xik))2]\displaystyle+(\nabla f_{i}(x_{i}^{k+1})-\nabla f_{i}(x_{i}^{k}))\|^{2}\bigg{]}
\displaystyle\leq{} 3i=1N𝔼[gfik+1fi(xik+1)2]+3i=1N𝔼[gfikfi(xik)2]\displaystyle 3\sum_{i=1}^{N}\!\mathbb{E}[\|g_{f_{i}}^{k+1}\!-\!\nabla f_{i}(x_{i}^{k+1})\|^{2}]\!+\!3\sum_{i=1}^{N}\!\mathbb{E}[\|g_{f_{i}}^{k}\!-\!\nabla f_{i}(x_{i}^{k})\|^{2}]
+3i=1N𝔼[fi(xik+1)fi(xik)2]\displaystyle+3\sum_{i=1}^{N}\mathbb{E}[\|\nabla f_{i}(x_{i}^{k+1})-\nabla f_{i}(x_{i}^{k})\|^{2}]
\displaystyle\leq{} 3L2𝔼[xk+1xk2]+36dL2Exk+36dL2Ex~k\displaystyle 3L^{2}\mathbb{E}[\|x^{k+1}-x^{k}\|^{2}]+36dL^{2}E_{x}^{k}+36dL^{2}E_{\tilde{x}}^{k}
+36dL2Exk+1+36dL2Ex~k+1+21N(u~k)2L2d2,\displaystyle+36dL^{2}E_{x}^{k+1}+36dL^{2}E_{\tilde{x}}^{k+1}+21N(\tilde{u}^{k})^{2}L^{2}d^{2},

where we have used Lemma 1 in the second inequality.

According to (13), we have

xk+1xk=\displaystyle x^{k+1}-x^{k}={} (WIdINd)xkα(WId)sk\displaystyle(W\otimes I_{d}-I_{Nd})x^{k}-\alpha(W\otimes I_{d})s^{k}
=\displaystyle={} (WIdINd)(xk𝟏Nx¯k)\displaystyle(W\otimes I_{d}-I_{Nd})(x^{k}-\mathbf{1}_{N}\otimes\bar{x}^{k})
α(WId)(sk𝟏Ng¯k)α𝟏Ng¯k.\displaystyle-\alpha(W\otimes I_{d})(s^{k}-\mathbf{1}_{N}\otimes\bar{g}^{k})-\alpha\mathbf{1}_{N}\otimes\bar{g}^{k}.

Therefore, we can get

𝔼[xk+1xk2]\displaystyle\mathbb{E}[\|x^{k+1}-x^{k}\|^{2}]
=\displaystyle={} 𝔼[(WIdINd)(xk𝟏Nx¯k)\displaystyle\mathbb{E}[\|(W\otimes I_{d}-I_{Nd})(x^{k}-\mathbf{1}_{N}\otimes\bar{x}^{k})
α(WId)(sk𝟏Ng¯k)α𝟏Ng¯k2]\displaystyle-\alpha(W\otimes I_{d})(s^{k}-\mathbf{1}_{N}\otimes\bar{g}^{k})-\alpha\mathbf{1}_{N}\otimes\bar{g}^{k}\|^{2}]
\displaystyle\leq{} 3𝔼[α(WId)(sk𝟏Ng¯k)2]+3𝔼[α𝟏Ng¯k2]\displaystyle 3\mathbb{E}[\|\alpha(W\otimes I_{d})(s^{k}-\mathbf{1}_{N}\otimes\bar{g}^{k})\|^{2}]+3\mathbb{E}[\|\alpha\mathbf{1}_{N}\otimes\bar{g}^{k}\|^{2}]
+3𝔼[(WIdINd)(xk𝟏Nx¯k)2]\displaystyle+3\mathbb{E}[\|(W\otimes I_{d}-I_{Nd})(x^{k}-\mathbf{1}_{N}\otimes\bar{x}^{k})\|^{2}]
\displaystyle\leq{} 3α2σ2Esk+3α2N𝔼[g¯k2]+12Exk\displaystyle 3\alpha^{2}\sigma^{2}E_{s}^{k}+3\alpha^{2}N\mathbb{E}[\|\bar{g}^{k}\|^{2}]+12E_{x}^{k}
\displaystyle\leq{} (12+12α2L2(1+12d))Exk+144α2L2dEx~k+3α2σ2Esk\displaystyle(12\!+\!12\alpha^{2}L^{2}(1\!+\!12d))E_{x}^{k}+144\alpha^{2}L^{2}dE_{\tilde{x}}^{k}+3\alpha^{2}\sigma^{2}E_{s}^{k}
+6α2N𝔼[f(x¯k)2]+42Nα2(u~k)2L2d2\displaystyle+6\alpha^{2}N\mathbb{E}[\|\nabla f(\bar{x}^{k})\|^{2}]+42N\alpha^{2}(\tilde{u}^{k})^{2}L^{2}d^{2}
\displaystyle\leq{} 17Exk+4Ex~k+3α2σ2Esk\displaystyle 17E_{x}^{k}+4E_{\tilde{x}}^{k}+3\alpha^{2}\sigma^{2}E_{s}^{k}
+6α2N𝔼[f(x¯k)2]+2N(u~k)2d,\displaystyle+6\alpha^{2}N\mathbb{E}[\|\nabla f(\bar{x}^{k})\|^{2}]+2N(\tilde{u}^{k})^{2}d, (38)

where we used WIdINd2\|W\otimes I_{d}-I_{Nd}\|\leq 2 in the second inequality, used (28) in the third inequality, and used αL16d\alpha L\leq\frac{1}{6\sqrt{d}} in the last inequality.

Combine (29), (33), (37), and (38) with (36), and we get

Esk+1\displaystyle E_{s}^{k+1}\leq{} [1+2σ23+36dL2α2(1+p)σ4(1+2σ2)2(1σ2)2\displaystyle\bigg{[}\frac{1+2\sigma^{2}}{3}+36dL^{2}\alpha^{2}(1+p)\frac{\sigma^{4}(1+2\sigma^{2})^{2}}{(1-\sigma^{2})^{2}} (39)
+9α2L2σ4(1+2σ2)1σ2]Esk+9L21σ2(70d+17)Exk\displaystyle+9\alpha^{2}L^{2}\frac{\sigma^{4}(1+2\sigma^{2})}{1-\sigma^{2}}\bigg{]}E_{s}^{k}+\frac{9L^{2}}{1-\sigma^{2}}(70d\!+\!17)E_{x}^{k}
+252dL21σ2Ex~k+20N1σ2𝔼[f(x¯k)2]\displaystyle+\frac{252dL^{2}}{1-\sigma^{2}}E_{\tilde{x}}^{k}+\frac{20N}{1-\sigma^{2}}\mathbb{E}[\|\nabla f(\bar{x}^{k})\|^{2}]
+171NL2d21σ2(u~k)2.\displaystyle+\frac{171NL^{2}d^{2}}{1-\sigma^{2}}(\tilde{u}^{k})^{2}.

Using αL(1σ2)212σ2(1+2σ2)d(1+p)\alpha L\leq\frac{(1-\sigma^{2})^{2}}{12\sigma^{2}(1+2\sigma^{2})\sqrt{d(1+p)}}, we have

36dL2α2(1+p)σ4(1+2σ2)2(1σ2)2+9α2L2σ4(1+2σ2)1σ2\displaystyle 36dL^{2}\alpha^{2}(1+p)\frac{\sigma^{4}(1+2\sigma^{2})^{2}}{(1-\sigma^{2})^{2}}+9\alpha^{2}L^{2}\frac{\sigma^{4}(1+2\sigma^{2})}{1-\sigma^{2}}
<\displaystyle<{} 45dL2α2(1+p)σ4(1+2σ2)2(1σ2)21σ23.\displaystyle 45dL^{2}\alpha^{2}(1+p)\frac{\sigma^{4}(1+2\sigma^{2})^{2}}{(1-\sigma^{2})^{2}}\leq\frac{1-\sigma^{2}}{3}.

We can further derive from inequality (39) that

Esk+1\displaystyle E_{s}^{k+1}\leq{} 2+σ23Esk+783dL21σ2Exk+252dL21σ2Ex~k\displaystyle\frac{2+\sigma^{2}}{3}E_{s}^{k}+\frac{783dL^{2}}{1-\sigma^{2}}E_{x}^{k}+\frac{252dL^{2}}{1-\sigma^{2}}E_{\tilde{x}}^{k} (40)
+20N1σ2𝔼[f(x¯k)2]+171NL2d21σ2(u~k)2.\displaystyle+\frac{20N}{1-\sigma^{2}}\mathbb{E}[\|\nabla f(\bar{x}^{k})\|^{2}]+\frac{171NL^{2}d^{2}}{1-\sigma^{2}}(\tilde{u}^{k})^{2}.

Combining (29), (33) and (40), we complete the proof.

Appendix D. Proof of Lemma 4

Using Lemma 9, we derive a range of α\alpha and a positive vector π=[π1,π2,π3]\pi=[\pi_{1},\pi_{2},\pi_{3}] such that the following inequality holds:

Aπ(11σ22d)π,A\pi\leq(1-\frac{1-\sigma^{2}}{2d})\pi,

which can be equivalently written as

c1αLπ3π1(2312d)(1σ2),\displaystyle c_{1}\alpha L\frac{\pi_{3}}{\pi_{1}}\leq\left(\frac{2}{3}-\frac{1}{2d}\right)(1-\sigma^{2}), (41a)
(c2αL+1+2σ23)π1π2+c1αLπ3π2(1d12d)(1σ2),\displaystyle\!\!\left(\!c_{2}\alpha L\!+\!\frac{1\!+\!2\sigma^{2}}{3}\!\right)\!\frac{\pi_{1}}{\pi_{2}}+c_{1}\alpha L\frac{\pi_{3}}{\pi_{2}}\leq\left(\frac{1}{d}\!-\!\frac{1}{2d}\right)\!(1\!-\!\sigma^{2}), (41b)
c1αLπ1π3+13c1αLπ2π3(1312d)(1σ2),\displaystyle c_{1}\alpha L\frac{\pi_{1}}{\pi_{3}}+\frac{1}{3}c_{1}\alpha L\frac{\pi_{2}}{\pi_{3}}\leq\left(\frac{1}{3}-\frac{1}{2d}\right)(1-\sigma^{2}), (41c)

where c1=929d1σ2c_{1}=\frac{9\sqrt{29}\sqrt{d}}{1-\sigma^{2}} and c2=17dc_{2}=17\sqrt{d}. Under the condition d3d\geq 3, to show the inequalities (41), it is sufficient to check the following inequalities hold:

13c1αLπ3π11σ22d,\displaystyle\frac{1}{3}c_{1}\alpha L\frac{\pi_{3}}{\pi_{1}}\leq\frac{1-\sigma^{2}}{2d},
(c2αL+1+2σ23)π1π2+c1αLπ3π21σ22d,\displaystyle(c_{2}\alpha L+\frac{1+2\sigma^{2}}{3})\frac{\pi_{1}}{\pi_{2}}+c_{1}\alpha L\frac{\pi_{3}}{\pi_{2}}\leq\frac{1-\sigma^{2}}{2d},
c1αLπ1π3+13c1αLπ2π31σ22d,\displaystyle c_{1}\alpha L\frac{\pi_{1}}{\pi_{3}}+\frac{1}{3}c_{1}\alpha L\frac{\pi_{2}}{\pi_{3}}\leq\frac{1-\sigma^{2}}{2d},

and by further letting π1π2=1σ23d\frac{\pi_{1}}{\pi_{2}}=\frac{1-\sigma^{2}}{3d} and π2π3=3\frac{\pi_{2}}{\pi_{3}}=3, it is sufficient to show the following inequalities:

13c1αLd1σ21σ22d,\displaystyle\frac{1}{3}c_{1}\alpha L\frac{d}{1-\sigma^{2}}\leq\frac{1-\sigma^{2}}{2d},
c2αL1σ2d+c1αL1σ22d,\displaystyle c_{2}\alpha L\frac{1-\sigma^{2}}{d}+c_{1}\alpha L\leq\frac{1-\sigma^{2}}{2d},
c1αL1σ2d+c1αL1σ22d.\displaystyle c_{1}\alpha L\frac{1-\sigma^{2}}{d}+c_{1}\alpha L\leq\frac{1-\sigma^{2}}{2d}.

Using inequality c2<c1c_{2}<c_{1}, we have the following inequality:

max{\displaystyle\max\Bigg{\{} c2αL1σ2d+c1αL,c1αL1σ2d+c1αL,\displaystyle c_{2}\alpha L\frac{1\!-\!\sigma^{2}}{d}+c_{1}\alpha L,c_{1}\alpha L\frac{1\!-\!\sigma^{2}}{d}+c_{1}\alpha L,
13c1αLd1σ2}<αL(c11σ2d+c1d3(1σ2)).\displaystyle\frac{1}{3}c_{1}\alpha L\frac{d}{1\!-\!\sigma^{2}}\Bigg{\}}<\alpha L\left(c_{1}\frac{1\!-\!\sigma^{2}}{d}+c_{1}\frac{d}{3(1\!-\!\sigma^{2})}\right).

Consequently, it suffices to ensure that

αL(c11σ2d+c1d3(1σ2))1σ22d,\alpha L\left(c_{1}\frac{1-\sigma^{2}}{d}+c_{1}\frac{d}{3(1-\sigma^{2})}\right)\leq\frac{1-\sigma^{2}}{2d},

which can now be justified by the condition αL(1σ2)31229d52\alpha L\leq\frac{(1-\sigma^{2})^{3}}{12\sqrt{29}d^{\frac{5}{2}}}. It is straightforward to verify that π=[1σ2d,3,1]T\pi=[\frac{1-\sigma^{2}}{d},3,1]^{T} satisfies π1π2=1σ23d\frac{\pi_{1}}{\pi_{2}}=\frac{1-\sigma^{2}}{3d} and π2π3=3\frac{\pi_{2}}{\pi_{3}}=3, which completes the proof.

Appendix E. Proof of Lemma 5

Based on Lemmas 3 and 4 , we are ready to prove Lemma 5.

It is straightforward to verify that

αLmin{16d,312d(d+\mfracd1+σ21p),(1σ2)31229d52}\alpha L\leq\min\!\left\{\frac{1}{6\sqrt{d}},\frac{\sqrt{3}}{12d}\!\left(\!-\sqrt{d}\!+\!\sqrt{\mfrac{d\!-\!1\!+\!\sigma^{2}}{1-p}}\right)\!,\frac{(1-\sigma^{2})^{3}}{12\sqrt{29}d^{\frac{5}{2}}}\right\}

is a sufficient (and unnecessary) condition for

αLmin{\displaystyle\alpha L\leq\min\Bigg{\{} 16d,(1σ2)212σ2(1+2σ2)d(1+p),\displaystyle\frac{1}{6\sqrt{d}},\frac{(1-\sigma^{2})^{2}}{12\sigma^{2}(1+2\sigma^{2})\sqrt{d(1+p)}},
312d(d+\mfracd1+σ21p)}.\displaystyle\frac{\sqrt{3}}{12d}\left(-\sqrt{d}+\sqrt{\mfrac{d-1+\sigma^{2}}{1-p}}\right)\Bigg{\}}.

Thus, Lemma 3 holds.

Note that the matrix norm π\interleave\cdot\interleave_{\infty}^{\pi} is induced by the vector norm π\|\cdot\|_{\infty}^{\pi}. From the inequality (16), we derive that

vk+1π\displaystyle\|v^{k+1}\|_{\infty}^{\pi} Avkπ+Bkπ\displaystyle\leq\|Av^{k}\|_{\infty}^{\pi}+\|B_{k}\|_{\infty}^{\pi} (42)
Aπvkπ+Bkπ,\displaystyle\leq\interleave A\interleave_{\infty}^{\pi}\|v^{k}\|_{\infty}^{\pi}+\|B_{k}\|_{\infty}^{\pi},

and by induction on (42), we get

vkπ(Aπ)kv0π+m=0k1(Aπ)mBkm1π.\displaystyle\|v^{k}\|_{\infty}^{\pi}\leq(\interleave A\interleave_{\infty}^{\pi})^{k}\|v_{0}\|_{\infty}^{\pi}+\sum_{m=0}^{k-1}(\interleave A\interleave_{\infty}^{\pi})^{m}\|B_{k-m-1}\|_{\infty}^{\pi}.

Using the definition of the vector norm xw=maxi|xi|/wi\|x\|_{\infty}^{w}=\max_{i}|x_{i}|/w_{i}, we derive that

max{d1σ2Exk,13Ex~k,α329LdEsk}\displaystyle\max\left\{\frac{d}{1-\sigma^{2}}E_{x}^{k},\frac{1}{3}E_{\tilde{x}}^{k},\frac{\alpha}{3\sqrt{29}L\sqrt{d}}E_{s}^{k}\right\} (43)
\displaystyle\leq{} m=0k1(Aπ)m(29Nα3(1σ2)Ld𝔼[f(x¯km1)2]\displaystyle\sum_{m=0}^{k-1}(\interleave A\interleave_{\infty}^{\pi})^{m}\bigg{(}\frac{\sqrt{29}N\alpha}{3(1-\sigma^{2})L\sqrt{d}}\mathbb{E}[\|\nabla f(\bar{x}^{k-m-1})\|^{2}]
+2291σ2Nd32αL(u~km1)2)+(Aπ)kv0π.\displaystyle+\frac{2\sqrt{29}}{1-\sigma^{2}}Nd^{\frac{3}{2}}\alpha L(\tilde{u}^{k-m-1})^{2}\bigg{)}+(\interleave A\interleave_{\infty}^{\pi})^{k}\|v_{0}\|_{\infty}^{\pi}.

Note that for any nonnegative sequence (am)m(a_{m})_{m\in\mathbb{N}} and ρ(0,1)\rho\in(0,1), we have

τ=1km=0τ1ρmaτm1\displaystyle\sum_{\tau=1}^{k}\sum_{m=0}^{\tau-1}\rho^{m}a_{\tau-m-1} =τ=1km=0τ1ρτm1am\displaystyle=\sum_{\tau=1}^{k}\sum_{m=0}^{\tau-1}\rho^{\tau-m-1}a_{m} (44)
=m=0k1ρm1amτ=m+1kρτ\displaystyle=\sum_{m=0}^{k-1}\rho^{-m-1}a_{m}\sum_{\tau=m+1}^{k}\rho^{\tau}
11ρm=0k1am.\displaystyle\leq\frac{1}{1-\rho}\sum_{m=0}^{k-1}a_{m}.

Consequently, we obtain

max{τ=0kd1σ2Exτ,τ=0k13Ex~τ,τ=0kα329LdEsτ}\displaystyle\max\left\{\sum_{\tau=0}^{k}\frac{d}{1-\sigma^{2}}E_{x}^{\tau},\sum_{\tau=0}^{k}\frac{1}{3}E_{\tilde{x}}^{\tau},\sum_{\tau=0}^{k}\frac{\alpha}{3\sqrt{29}L\sqrt{d}}E_{s}^{\tau}\right\}
\displaystyle\leq{} 4d1σ2R0+229dNα3(1σ2)2Lm=0k1𝔼[f(x¯m)2]\displaystyle\frac{4d}{1-\sigma^{2}}R_{0}+\frac{2\sqrt{29}\sqrt{d}N\alpha}{3(1-\sigma^{2})^{2}L}\sum_{m=0}^{k-1}\mathbb{E}[\|\nabla f(\bar{x}^{m})\|^{2}]
+429Nd52αL(1σ2)2m=0k1(u~m)2,\displaystyle+\frac{4\sqrt{29}Nd^{\frac{5}{2}}\alpha L}{(1-\sigma^{2})^{2}}\sum_{m=0}^{k-1}(\tilde{u}^{m})^{2},

where R0=v0π=d1σ2Ex0R_{0}=\|v_{0}\|_{\infty}^{\pi}=\frac{d}{1-\sigma^{2}}E_{x}^{0} and we have used Aπ11σ22d\interleave A\interleave_{\infty}^{\pi}\leq 1-\frac{1-\sigma^{2}}{2d} in the inequality.

For the bound on τ=0(u~τ)2\sum_{\tau=0}^{\infty}(\tilde{u}^{\tau})^{2}, we first derive that

(u~k)2\displaystyle(\tilde{u}^{k})^{2} =maxi{𝔼[(u~ik)2]}\displaystyle=\max_{i}\{\mathbb{E}[(\tilde{u}_{i}^{k})^{2}]\}
=(1p)maxi{𝔼[(u~ik1)2]}+pmaxi{(uik)2}\displaystyle=(1-p)\max_{i}\{\mathbb{E}[(\tilde{u}_{i}^{k-1})^{2}]\}+p\max_{i}\{(u_{i}^{k})^{2}\}
=(1p)k(ui0)2+pm=0k1(1p)m(uikm)2,\displaystyle=(1-p)^{k}(u_{i}^{0})^{2}+p\sum_{m=0}^{k-1}(1-p)^{m}(u_{i}^{k-m})^{2},

where we have used the condition that all the uiku_{i}^{k} follow from the same iterative equation in the third equality. Then, using inequality (44), we further get

τ=0(u~τ)2=\displaystyle\sum_{\tau=0}^{\infty}(\tilde{u}^{\tau})^{2}={} τ=0(1p)τ(ui0)2+τ=1m=0τ1p(1p)m(uiτm)2\displaystyle\sum_{\tau=0}^{\infty}(1\!-\!p)^{\tau}(u_{i}^{0})^{2}+\sum_{\tau=1}^{\infty}\sum_{m=0}^{\tau-1}p(1\!-\!p)^{m}(u_{i}^{\tau-m})^{2}
\displaystyle\leq{} 1p(ui0)2+m=1(uim)2.\displaystyle\frac{1}{p}(u_{i}^{0})^{2}+\sum_{m=1}^{\infty}(u_{i}^{m})^{2}.

Since (uik)2(u_{i}^{k})^{2} is summable, we derive that (u~k)2(\tilde{u}^{k})^{2} is also summable, which completes the proof.

Appendix F. Proof of Corollary 1

Let p=1dp=\frac{1}{d} and β=43αLd\beta=4\sqrt{3}\alpha L\sqrt{d}. Using αL312d(1+1+σ2d1)\alpha L\leq\frac{\sqrt{3}}{12\sqrt{d}}\left(-1+\sqrt{1+\frac{\sigma^{2}}{d-1}}\right), we derive from (32) that

Ex~k+1\displaystyle E_{\tilde{x}}^{k+1}\leq{} (11σ2d)Ex~k+(17αLd+1+2σ23d)Exk\displaystyle(1-\frac{1-\sigma^{2}}{d})E_{\tilde{x}}^{k}+(17\alpha L\sqrt{d}+\frac{1+2\sigma^{2}}{3d})E_{x}^{k} (45)
+σ2(1+2σ2)(1σ2)dα2Esk+NαLdf(x¯k)2\displaystyle+\frac{\sigma^{2}(1+2\sigma^{2})}{(1-\sigma^{2})d}\alpha^{2}E_{s}^{k}+\frac{N\alpha}{L\sqrt{d}}\|\nabla f(\bar{x}^{k})\|^{2}
+5NαLd32(u~k)2.\displaystyle+5N\alpha Ld^{\frac{3}{2}}(\tilde{u}^{k})^{2}.

Combining (29), (40) and (45), we get

vk+1Avk+Bk,\displaystyle v^{k+1}\leq A^{\prime}v^{k}+B_{k},

where

A=[1+2σ230929d1σ2αL17dαL+1+2σ23d11σ2d929(1σ2)dαL929d1σ2αL329d1σ2αL2+σ23].A^{\prime}=\begin{bmatrix}\frac{1+2\sigma^{2}}{3}&0&\frac{9\sqrt{29}\sqrt{d}}{1-\sigma^{2}}\alpha L\\ 17\sqrt{d}\alpha L+\frac{1+2\sigma^{2}}{3d}&1-\frac{1-\sigma^{2}}{d}&\frac{9\sqrt{29}}{(1-\sigma^{2})\sqrt{d}}\alpha L\\ \frac{9\sqrt{29}\sqrt{d}}{1-\sigma^{2}}\alpha L&\frac{3\sqrt{29}\sqrt{d}}{1-\sigma^{2}}\alpha L&\frac{2+\sigma^{2}}{3}\end{bmatrix}.

We then demonstrate that the vector π=[1σ2,3,1]\pi^{\prime}=[1-\sigma^{2},3,1] and the step-size condition αL(1σ2)34529d32\alpha L\leq\frac{(1-\sigma^{2})^{3}}{45\sqrt{29}d^{\frac{3}{2}}} together ensure the inequality

Aπ(11σ22d)π.A^{\prime}\pi^{\prime}\leq(1-\frac{1-\sigma^{2}}{2d})\pi^{\prime}.

Similar to the proof of Lemma 4 , it suffices to verify

13(1σ2)c1αL1σ22d,\displaystyle\frac{1}{3(1-\sigma^{2})}c_{1}\alpha L\leq\frac{1-\sigma^{2}}{2d}, (46)
(1σ2)c2αL+c1αLd1σ22d,\displaystyle(1-\sigma^{2})c_{2}\alpha L+\frac{c_{1}\alpha L}{d}\leq\frac{1-\sigma^{2}}{2d},
(1σ2)c1αL+c1αL1σ22d.\displaystyle(1-\sigma^{2})c_{1}\alpha L+c_{1}\alpha L\leq\frac{1-\sigma^{2}}{2d}.

It is straightforward to verify that αL(1σ2)34529d32\alpha L\leq\frac{(1-\sigma^{2})^{3}}{45\sqrt{29}d^{\frac{3}{2}}} is a sufficient condition of inequality (46). Using Lemma 9, we then obtain

ρ(A)Aπ11σ22d.\rho(A^{\prime})\leq\interleave A^{\prime}\interleave_{\infty}^{\pi^{\prime}}\leq 1-\frac{1-\sigma^{2}}{2d}. (47)

Next, similar to Lemma 5, we can show that when

αLmin{16d,312d(1+1+\mfracσ2d1),(1σ2)34529d32},\alpha L\leq\min\!\left\{\frac{1}{6\sqrt{d}},\frac{\sqrt{3}}{12\sqrt{d}}\!\left(-1\!+\!\sqrt{1\!+\!\mfrac{\sigma^{2}}{d\!-\!1}}\right),\frac{(1-\sigma^{2})^{3}}{45\sqrt{29}d^{\frac{3}{2}}}\right\},

the following inequality holds:

max{11σ2τ=0kExτ,13τ=0kEx~τ,α329Ldτ=0kEsτ}\displaystyle\max\left\{\frac{1}{1-\sigma^{2}}\sum_{\tau=0}^{k}E_{x}^{\tau},\frac{1}{3}\sum_{\tau=0}^{k}E_{\tilde{x}}^{\tau},\frac{\alpha}{3\sqrt{29}L\sqrt{d}}\sum_{\tau=0}^{k}E_{s}^{\tau}\right\} (48)
\displaystyle\leq{} 4d1σ2R~0+229dNα3(1σ2)2Lm=0k1𝔼[f(x¯m)2]\displaystyle\frac{4d}{1-\sigma^{2}}\tilde{R}_{0}+\frac{2\sqrt{29}\sqrt{d}N\alpha}{3(1-\sigma^{2})^{2}L}\sum_{m=0}^{k-1}\mathbb{E}[\|\nabla f(\bar{x}^{m})\|^{2}]
+429Nd52αL(1σ2)2m=0k1(u~m)2,\displaystyle+\frac{4\sqrt{29}Nd^{\frac{5}{2}}\alpha L}{(1-\sigma^{2})^{2}}\sum_{m=0}^{k-1}(\tilde{u}^{m})^{2},

where R~0=11σ2Ex0\tilde{R}_{0}=\frac{1}{1-\sigma^{2}}E_{x}^{0}.

Now, we combine (48) with (15) to derive

0\displaystyle 0\leq{} δ0α3τ=0k𝔼[f(x¯τ)2]+5αL2d2τ=0k(u~τ)2\displaystyle\delta^{0}-\frac{\alpha}{3}\sum_{\tau=0}^{k}\mathbb{E}[\|\nabla f(\bar{x}^{\tau})\|^{2}]+5\alpha L^{2}d^{2}\sum_{\tau=0}^{k}(\tilde{u}^{\tau})^{2} (49)
+66αdL2N(4d1σ2R~0+429Nd52αL(1σ2)2m=0k1(u~m)2\displaystyle+\frac{66\alpha dL^{2}}{N}(\frac{4d}{1-\sigma^{2}}\tilde{R}_{0}+\frac{4\sqrt{29}Nd^{\frac{5}{2}}\alpha L}{(1-\sigma^{2})^{2}}\sum_{m=0}^{k-1}(\tilde{u}^{m})^{2}
+229dNα3(1σ2)2Lm=0k1𝔼[f(x¯m)2])\displaystyle+\frac{2\sqrt{29}\sqrt{d}N\alpha}{3(1-\sigma^{2})^{2}L}\sum_{m=0}^{k-1}\mathbb{E}[\|\nabla f(\bar{x}^{m})\|^{2}])
\displaystyle\leq{} δ0+dL29NR~0α6τ=0k𝔼[f(x¯τ)2]\displaystyle\delta^{0}+\frac{\sqrt{d}L}{\sqrt{29}N}\tilde{R}_{0}-\frac{\alpha}{6}\sum_{\tau=0}^{k}\mathbb{E}[\|\nabla f(\bar{x}^{\tau})\|^{2}]
+6αL2d2τ=0k(u~τ)2,\displaystyle+6\alpha L^{2}d^{2}\sum_{\tau=0}^{k}(\tilde{u}^{\tau})^{2},

where we have used 18αdL2N(1σ2)+16αdL2N3<66αdL2N\frac{18\alpha dL^{2}}{N}\cdot(1-\sigma^{2})+\frac{16\alpha dL^{2}}{N}\cdot 3<\frac{66\alpha dL^{2}}{N} in the first inequality and αL(1σ2)326429d321σ226429d32\alpha L\leq\frac{(1-\sigma^{2})^{3}}{264\sqrt{29}d^{\frac{3}{2}}}\leq\frac{1-\sigma^{2}}{264\sqrt{29}d^{\frac{3}{2}}} in the last inequality. We then derive from (49) that

1kτ=0k1𝔼[f(x¯τ)2]\displaystyle\frac{1}{k}\sum_{\tau=0}^{k-1}\mathbb{E}[\|\nabla f(\bar{x}^{\tau})\|^{2}]
\displaystyle\leq{} 1k[6αδ0+6dL29NαR~0+36L2d2τ=0(u~τ)2].\displaystyle\frac{1}{k}\left[\frac{6}{\alpha}\delta^{0}+\frac{6\sqrt{d}L}{\sqrt{29}N\alpha}\tilde{R}_{0}+36L^{2}d^{2}\sum_{\tau=0}^{\infty}(\tilde{u}^{\tau})^{2}\right].

The proof of Corollary 1 can now be concluded.

References

  • [1] Shu Liang, Le Yi Wang, and George Yin. Distributed smooth convex optimization with coupled constraints. IEEE Transactions on Automatic Control, 65(1):347–353, 2020.
  • [2] Xuan Zhang, Antonis Papachristodoulou, and Na Li. Distributed control for reaching optimal steady state in network systems: An optimization approach. IEEE Transactions on Automatic Control, 63(3):864–871, 2018.
  • [3] Jie Liu, Daniel W. C. Ho, and Lulu Li. A generic algorithm framework for distributed optimization over the time-varying network with communication delays. IEEE Transactions on Automatic Control, 69(1):371–378, 2024.
  • [4] Angelia Nedić and Asuman Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009.
  • [5] Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. EXTRA: An exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944–966, 2015.
  • [6] Angelia Nedić. Distributed gradient methods for convex machine learning problems in networks: Distributed optimization. IEEE Signal Processing Magazine, 37(3):92–101, 2020.
  • [7] Guido Carnevale, Nicola Mimmo, and Giuseppe Notarstefano. Nonconvex distributed feedback optimization for aggregative cooperative robotics. Automatica, 167:111767, 2024.
  • [8] Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, volume 30, 2017.
  • [9] Jinshan Zeng and Wotao Yin. On nonconvex decentralized gradient descent. IEEE Transactions on signal processing, 66(11):2834–2848, 2018.
  • [10] Gesualdo Scutari and Ying Sun. Distributed nonconvex constrained optimization over time-varying digraphs. Mathematical Programming, 176:497–544, 2019.
  • [11] Zhongguo Li, Zhen Dong, Zhongchao Liang, and Zhengtao Ding. Surrogate-based distributed optimisation for expensive black-box functions. Automatica, 125:109407, 2021.
  • [12] Sébastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
  • [13] Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D. Lee, Danqi Chen, and Sanjeev Arora. Fine-tuning language models with just forward passes. In Advances in Neural Information Processing Systems, volume 36, pages 53038–53075, 2023.
  • [14] Anit Kumar Sahu, Dusan Jakovetic, Dragana Bajovic, and Soummya Kar. Communication-efficient distributed strongly convex stochastic optimization: Non-asymptotic rates. arXiv preprint arXiv:1809.02920, 2018.
  • [15] Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527–566, 2017.
  • [16] Jack Kiefer and Jacob Wolfowitz. Stochastic estimation of the maximum of a regression function. The Annals of Mathematical Statistics, pages 462–466, 1952.
  • [17] Yujie Tang, Junshan Zhang, and Na Li. Distributed zero-order algorithms for nonconvex multiagent optimization. IEEE Transactions on Control of Network Systems, 8(1):269–281, 2021.
  • [18] Yongyi Guo, Dominic Coey, Mikael Konutgan, Wenting Li, Chris Schoener, and Matt Goldman. Machine learning for variance reduction in online experiments. In Advances in Neural Information Processing Systems, volume 34, pages 8637–8648, 2021.
  • [19] Chong Wang, Xi Chen, Alexander J. Smola, and Eric P. Xing. Variance reduction for stochastic gradient optimization. In Advances in Neural Information Processing Systems, volume 26, 2013.
  • [20] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, volume 26, 2013.
  • [21] Sijia Liu, Bhavya Kailkhura, Pin-Yu Chen, Paishun Ting, Shiyu Chang, and Lisa Amini. Zeroth-order stochastic variance reduction for nonconvex optimization. In Advances in Neural Information Processing Systems, volume 31, 2018.
  • [22] Ran Xin, Usman A Khan, and Soummya Kar. Variance-reduced decentralized stochastic optimization with accelerated convergence. IEEE Transactions on Signal Processing, 68:6255–6271, 2020.
  • [23] Davood Hajinezhad, Mingyi Hong, and Alfredo Garcia. ZONE: Zeroth-order nonconvex multiagent optimization over networks. IEEE Transactions on Automatic Control, 64(10):3995–4010, 2019.
  • [24] Arkadij Semenovič Nemirovskij and David Borisovich Yudin. Problem Complexity and Method Efficiency in Optimization. Wiley-Interscience, 1983.
  • [25] Abraham D Flaxman, Adam Tauman Kalai, and H Brendan McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 385–394, 2005.
  • [26] Guannan Qu and Na Li. Harnessing smoothness to accelerate distributed optimization. IEEE Transactions on Control of Network Systems, 5(3):1245–1260, 2017.
  • [27] Angelia Nedić, Alex Olshevsky, and Wei Shi. Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM Journal on Optimization, 27(4):2597–2633, 2017.
  • [28] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 2nd edition, 2012.