This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Distributed Stochastic Optimization under Heavy-Tailed Noises

Chao Sun C. Sun is with the Institute of Artificial Intelligence, Beihang University, China. (Email: [email protected])
Abstract

This paper studies the distributed optimization problem under the influence of heavy-tailed gradient noises. Here, a heavy-tailed noise ξ\xi means that the noise ξ\xi does not necessarily satisfy the bounded variance assumption, i.e., E[ξ2]ν2E[\|\xi\|^{2}]\leq\nu^{2} for some positive constant ν\nu. Instead, it satisfies a more general assumption E[ξδ]νδE[\|\xi\|^{\delta}]\leq\nu^{\delta} for some 1<δ21<\delta\leq 2. The commonly-used bounded variance assumption is a special case of the considered noise assumption. A typical example of this kind of noise is a Pareto distribution noise with tail index within (1,2], which has infinite variance. Despite that there has been several distributed optimization algorithms proposed for the heavy-tailed noise scenario, these algorithms need a centralized server in the network which collects the information of all clients. Different from these algorithms, this paper considers that there is no centralized server and the agents can only exchange information with neighbors in a communication graph. A distributed method combining gradient clipping and distributed stochastic subgradient projection is proposed. It is proven that when the gradient descent step-size and the gradient clipping step-size meet certain conditions, the state of each agent converges to the optimal solution of the distributed optimization problem with probability 1. The simulation results validate the algorithm.

Index Terms:
Multi-agent system; Heavy-tailed noise; Distributed stochastic optimization.

I Introduction

I-A Distributed Optimization

Distributed optimization is a method of achieving optimization through cooperation among multiple agents, which can be used to solve large-scale optimization problems that are hard for centralized algorithms to cope with. The core idea of distributed optimization is to decompose a large problem into small problems and distribute the local objectives to multiple intelligent agents for solution. Compared with centralized optimization, distributed optimization overcomes the disadvantages of single point of failure, high computation and communication burden, and restrictions on scalability and flexibility [1]. Distributed optimization has a wide range of applications, such as machine learning [2], distributed economic dispatch problem of energy systems[3], and formation control problem in robotic systems [4].

I-B Distributed Stochastic Optimization

Different from deterministic optimization, stochastic optimization employs random factors to reach a solution to optimization problems. Stochastic optimization has wild applications in various fields. For example, the stochastic gradient descent (SGD) method has greatly participated in the progress of machine learning.

As an extension of distributed deterministic optimization to stochastic scenarios, distributed stochastic optimization has also attracted widespread attention. For instance, the authors in [5] presented a distributed stochastic subgradient projection algorithm tailored for convex optimization challenges. [6] introduced a distributed algorithm that relies on asynchronous step-sizes. [7] proposed a proximal primal-dual approach geared towards nonconvex and nonsmooth problems. [8] examined the high-probability convergence of distributed stochastic optimization.

I-C Heavy-Tailed Noise

Most of the above literature on stochastic optimization assume that the variance of the random variable is bounded. Despite that this is a mild assumption which covers many noises such as Gaussian noises, there are still many noises that do not satisfy this condition. For example, the Pareto distribution and the α\alpha-stable Levy distribution [9]. Heavy-tailed noises has been recently observed in many machine learning systems [9, 10, 11, 12, 13]. For example, in [9], it was found that an attention training model on Wikipedia can generate heavy-tailed noises. Moreover, it is shown that the traditional SGD method may diverge when the noise is heavy-tailed [9, 10].

I-D Contributions

In this work, we study the distributed stochastic optimization algorithm under heavy-tailed gradient noises using neighboring agents' information exchange only. The main contributions of this paper are listed as follows:

1) We design a distributed updating law to deal with the stochastic optimization problem under heavy-tailed noises. Strict proof is given to show that under some mild parameter conditions, the state of each agent converges to the optimal solution with probability 1.

2) Compared with the existing distributed stochastic optimization works, e.g., [5, 6, 14, 15, 16, 7], our algorithm allows heavy-tailed noises that satisfy E[ξδ]νδE[\|\xi\|^{\delta}]\leq\nu^{\delta} for some 1<δ21<\delta\leq 2 and positive constant ν\nu, which generalizes the commonly used assumption E[ξ2]ν2E[\|\xi\|^{2}]\leq\nu^{2}.

3) Compared with the existing literature on heavy-tailed noises, e.g., [9, 17, 18, 19, 10, 20, 21, 22], our algorithm is distributed in the sense that there is no central server and the agents can only exchange information via a strongly connected communication graph. Note that despite that [10, 20, 21, 22] studied heavy-tailed noises in a distributed setting, these algorithms require a central server [10, 21, 22] or the agents require the states of all the other agents at some time instant [20]. These works are not related to multi-agent consensus while our work is related to consensus and graph theory.

I-E Organization

The rest of this paper is organized as follows: In Section II, notations and preliminary knowledge are given. In Section III, the studied distributed optimization problem is formulated. In Section IV, a distributed updating law is proposed for each agent based on gradient clipping and distributed stochastic subgradient projection method. The almost sure convergence to the optimal solution is proven. In Section V, a numerical example is given to illustrate the effectiveness and efficiency of the proposed algorithm. Finally, in Section VI, we conclude the paper.

II Notations and Preliminaries

Throughout this paper, 0 represents a zero vector with an appropriate dimension or scalar zero. \mathbb{R} and N\mathbb{R}^{N} represent the real number set and the NN-dimensional real vector set, respectively. ||\left|\cdot\right| is the absolute value of a scalar, and \left\|\cdot\right\| is the 2-norm of a vector. The ``inf" represent the infimum of a sequence. PΩ[]P_{\Omega}[\cdot] is the Euclidean projection of a vector to the space Ω\Omega. [e]i[e]_{i} is the ii-th element of a vector ee. [A]i,j[A]_{i,j} is the element in the ii-th row and the jj-th column of a matrix AA. min{a,b}\min\{a,b\} represents the minimum of aa and bb. E[]E[\cdot] is the expectation of a variable.

Lemma 1

[5] For scalars β1,,βN0\beta_{1},\ldots,\beta_{N}\geq 0 where i=1Nβi=1\sum_{i=1}^{N}\beta_{i}=1 and vectors v1,,vNnv_{1},\ldots,v_{N}\in\mathbb{R}^{n}, the following conclusion holds:

i=1Nβivi\displaystyle\left\|\sum_{i=1}^{N}\beta_{i}v_{i}\right\| i=1Nβivi,\displaystyle\leq\sum_{i=1}^{N}\beta_{i}\left\|v_{i}\right\|,
i=1Nβivi2\displaystyle\left\|\sum_{i=1}^{N}\beta_{i}v_{i}\right\|^{2} i=1Nβivi2.\displaystyle\leq\sum_{i=1}^{N}\beta_{i}\|v_{i}\|^{2}. (1)
Lemma 2

(Lemma 3.1 of [5]) Let γk\gamma_{k} be a scalar sequence. (1) If limk+γk=γ\lim_{k\rightarrow+\infty}\gamma_{k}=\gamma and 0<β<10<\beta<1, then limk+=0kβkγ=γ1β.\lim_{k\rightarrow+\infty}\sum_{\ell=0}^{k}\beta^{k-\ell}\gamma_{\ell}=\frac{\gamma}{1-\beta}. (2) If γk0\gamma_{k}\geq 0 for all kk, k=0+γk<\sum_{k=0}^{+\infty}\gamma_{k}<\infty and 0<β<10<\beta<1, then k=0+(=0kβkγ)<\sum_{k=0}^{+\infty}\left(\sum_{\ell=0}^{k}\beta^{k-\ell}\gamma_{\ell}\right)<\infty.

III Problem Formulation

Consider a multi-agent system comprised of N>1N>1 agents. The agents cooperate to solve the following optimization problem

minθΩf(θ)=i=1Nfi(θ),\displaystyle\min_{\theta\in\Omega}f(\theta)=\sum_{i=1}^{N}f_{i}(\theta), (2)

where θΩn\theta\in\Omega\subset\mathbb{R}^{n} is the decision variable, f(θ)f(\theta) is the global objective function, fi(θ)f_{i}(\theta) is the local objective function in agent ii, and Ω\Omega is the local constraint set.

Suppose that each agent can only get information from its neighbors via a communication graph 𝒢N:={𝒱N,N}\mathcal{G}_{N}:=\{\mathcal{V}_{N},\mathcal{E}_{N}\} where 𝒱N={1,,N}\mathcal{V}_{N}=\{1,\cdots,N\} is the node set, and N𝒱N×𝒱N\mathcal{E}_{N}\subset\mathcal{V}_{N}\times\mathcal{V}_{N} is the edge set. Let AA be the adjacency matrix of 𝒢N\mathcal{G}_{N}.

Assumption 1

[5] The graph 𝒢N\mathcal{G}_{N} is strongly connected. The adjacency matrix AA is doubly stochastic, i.e., i=1Nai,j=j=1Nai,j=1\sum_{i=1}^{N}a_{i,j}=\sum_{j=1}^{N}a_{i,j}=1. There exists a scalar 0<η<10<\eta<1 such that ai,jηa_{i,j}\geq\eta when j𝒩ij\in\mathcal{N}_{i}.

Assumption 2

The constraint set Ω\Omega is nonempty, convex and compact.

Assumption 3

fi(θ)f_{i}(\theta) is strongly convex in θ\theta, i.e., there exists a modulus μ>0\mu>0 such that f(𝐳𝟐)f(𝐳𝟏)(f(𝐳𝟏))(𝐳𝟐𝐳𝟏)μ𝐳𝟏𝐳𝟐2f(\mathbf{z_{2}})-f(\mathbf{z_{1}})-(\nabla f(\mathbf{z_{1}}))^{\top}(\mathbf{z_{2}}-\mathbf{z_{1}})\geq\mu\|\mathbf{z_{1}}-\mathbf{z_{2}}\|^{2} for any 𝐳𝟏,𝐳𝟐Ω\mathbf{z_{1}},\mathbf{z_{2}}\in\Omega. Moreover, fi(θ)f_{i}(\theta) is continuously differentiable.

Under Assumptions 2 and 3, there exists a unique optimal solution θ\theta^{*}.

Remark 1

Assumptions 1, 2, 3 are all commonly-used assumptions in the literature. For example, Assumption 1 is used in [5, 16], Assumption 2 is used in [23, 9], and Assumption 3 is used in [24, 16].

IV Algorithm Design

The distributed updating law for agent ii is designed as

xi,k+1=PΩ[vi,kαkg^i,k(vi,k)],xi,0Ω\displaystyle x_{i,k+1}=P_{\Omega}\left[v_{i,k}-\alpha_{k}\hat{g}_{i,k}(v_{i,k})\right],x_{i,0}\in\Omega (3)

where

vi,k=j=1Nai,jxj,k,\displaystyle v_{i,k}=\sum_{j=1}^{N}a_{i,j}x_{j,k}, (4)

and

g^i,k(vi,k)\displaystyle\hat{g}_{i,k}(v_{i,k}) =min{1,τkgi,k(vi,k)}gi,k(vi,k),\displaystyle=\min\left\{1,\frac{\tau_{k}}{\|g_{i,k}(v_{i,k})\|}\right\}g_{i,k}(v_{i,k}),
gi,k(vi,k)\displaystyle g_{i,k}(v_{i,k}) =fi(vi,k)+ξi,k.\displaystyle=\nabla f_{i}(v_{i,k})+\xi_{i,k}. (5)

In (3)-(5), αk\alpha_{k} and τk\tau_{k} are two positive sequences determined later, ξi,k\xi_{i,k} is the random variable of agent ii at step kk.

Let k\mathcal{F}_{k} be the δ\delta-algebra generated by the random variables from 0 to k1k-1.

The following assumptions on the noises are made to facilitate the subsequent analysis.

Assumption 4

[9] (Bounded α\alpha-Moment) There exists two positive constants δ\delta and ν\nu with δ(1,2]\delta\in(1,2] and ν>0\nu>0 such that E[ξi,kδ|k]νδE[\|\xi_{i,k}\|^{\delta}|\mathcal{F}_{k}]\leq\nu^{\delta} with probability 1.

Remark 2

When δ=2\delta=2, Assumption 4 reduces to a standard assumption E[ξi,k2|k]ν2E[\|\xi_{i,k}\|^{2}|\mathcal{F}_{k}]\leq\nu^{2}. Note that a heavy-tailed noise satisfying Assumption 4 may violate the bounded variance assumption. See Section V for an example.

Assumption 5

(Unbiased Local Gradient Estimator) E[ξi,k|k]=0E[\xi_{i,k}|\mathcal{F}_{k}]=0 with probability 1.

Remark 3

The algorithm proposed in (3)-(5) is motivated by the GClip algorithm for a centralized convex optimization problem with heavy-tailed noises proposed in [9] and the distributed stochastic subgradient projection algorithm proposed for a distributed convex optimization problem with standard noises in [5].

Remark 4

In (3)-(5), all agents use the same gradient decent step-size and gradient clipping step-size. Thus, the algorithm is not ``fully distributed". In fact, this hypothesis is common in the existing distributed optimization literature. This work is still novel compared with the existing algorithms on heavy-tailed noises since the state exchange is distributed.

Let bi,k=g^i,kfi(vi,k)b_{i,k}=\hat{g}_{i,k}-\nabla f_{i}(v_{i,k}) and Bi,k=E[bi,k|k]=E[g^i,k|k]fi(vi,k)B_{i,k}=E[b_{i,k}|\mathcal{F}_{k}]=E[\hat{g}_{i,k}|\mathcal{F}_{k}]-\nabla f_{i}(v_{i,k}) .

Based on Assumption 2, there exists a positive scalar C0C_{0} such that fi(θ)C0\|\nabla f_{i}(\theta)\|\leq C_{0} for all θΩ\theta\in\Omega.

Based on the definition of vi,kv_{i,k} and Assumption 1, vi,kΩv_{i,k}\in\Omega for all k0k\geq 0.

Thus, fi(vi,k)C0\|\nabla f_{i}(v_{i,k})\|\leq C_{0} for all k0k\geq 0.

Based on the above analysis, we can arrive at the following conclusion.

Lemma 3

Suppose that Assumptions 4 and 5 hold and τk2C0\tau_{k}\geq 2C_{0}. Then, for all ii, we have Bi,k(2ν)δτk1δ\|B_{i,k}\|\leq(2\nu)^{\delta}\tau_{k}^{1-\delta} with probability 1.

Proof:

The proof is similar to Lemma 5.1 of [25]. For the completeness of this work, we put it here. Define

χ=𝕀{gi,k>τk}={1, if gi,k>τk,0, otherwise,\chi=\mathbb{I}_{\{\|g_{i,k}\|>\tau_{k}\}}=\bigg{\{}\begin{array}[]{ll}1,&\text{ if }\|g_{i,k}\|>\tau_{k},\\ 0,&\text{ otherwise,}\end{array}\ \quad (6)

and

ε=𝕀{ξi,k>τk2}={1, if ξi,k>τk2,0, otherwise.\varepsilon=\mathbb{I}_{\{\|\xi_{i,k}\|>\frac{\tau_{k}}{2}\}}=\bigg{\{}\begin{array}[]{ll}1,&\text{ if }\|\xi_{i,k}\|>\frac{\tau_{k}}{2},\\ 0,&\text{ otherwise.}\end{array}\ \quad (7)

Since τk2C0\tau_{k}\geq 2C_{0}, gi,kτk2+ξi,k\|g_{i,k}\|\leq\frac{\tau_{k}}{2}+\|\xi_{i,k}\|. Thus, εχ\varepsilon\geq\chi. Since g^i,k=min{1,τkgi,k}gi,k=χτkgi,kgi,k+(1χ)gi,k\hat{g}_{i,k}=\min\{1,\frac{\tau_{k}}{\|g_{i,k}\|}\}g_{i,k}=\chi\frac{\tau_{k}}{\|g_{i,k}\|}g_{i,k}+(1-\chi)g_{i,k} and based on Assumption 5,

Bi,k=\displaystyle\|B_{i,k}\|= E[(gi,k+χ(τkgi,k1)gi,k)|k]\displaystyle\bigg{\|}E\bigg{[}\left(g_{i,k}+\chi\left(\frac{\tau_{k}}{\|g_{i,k}\|}-1\right)g_{i,k}\right)|\mathcal{F}_{k}\bigg{]}
fi(vi,k)\displaystyle-\nabla f_{i}(v_{i,k})\bigg{\|}
=\displaystyle= E[(χ(τkgi,k1)gi,k)|k]\displaystyle\left\|E\left[\left(\chi\left(\frac{\tau_{k}}{\|g_{i,k}\|}-1\right)g_{i,k}\right)|\mathcal{F}_{k}\right]\right\|
\displaystyle\leq E[(χ|τkgi,k1|gi,k)|k]\displaystyle E\left[\left(\chi\left|\frac{\tau_{k}}{\|g_{i,k}\|}-1\right|\left\|g_{i,k}\right\|\right)|\mathcal{F}_{k}\right]
=\displaystyle= E[χ(1τkgi,k)gi,k|k]\displaystyle E\left[\chi\left(1-\frac{\tau_{k}}{\|g_{i,k}\|}\right)\|g_{i,k}\||\mathcal{F}_{k}\right]
\displaystyle\leq E[χgi,k|k]\displaystyle E\left[\chi\|g_{i,k}\||\mathcal{F}_{k}\right]
\displaystyle\leq E[εgi,k|k]\displaystyle E\left[\varepsilon\|g_{i,k}\||\mathcal{F}_{k}\right]
\displaystyle\leq E[(εξi,k+εfi(vi,k))|k]\displaystyle E\left[\left(\varepsilon\|\xi_{i,k}\|+\varepsilon\|\nabla f_{i}(v_{i,k})\|\right)|\mathcal{F}_{k}\right] (8)

with probability 1, where we used Jensen's inequality and the convexity of the norm.

Based on Holder's inequality for conditional expectation,

Bi,k\displaystyle\|B_{i,k}\|\leq (E[ξi,kδ|k])1δ(E[εδδ1|k])δ1δ\displaystyle\left(E\left[\|\xi_{i,k}\|^{\delta}|\mathcal{F}_{k}\right]\right)^{\frac{1}{\delta}}\left(E\left[\varepsilon^{\frac{\delta}{\delta-1}}|\mathcal{F}_{k}\right]\right)^{\frac{\delta-1}{\delta}}
+τk2E[ε|k]\displaystyle+\frac{\tau_{k}}{2}E[\varepsilon|\mathcal{F}_{k}]
\displaystyle\leq ν(E[εδδ1|k])δ1δ+τk2E[ε|k]\displaystyle\nu\left(E\left[\varepsilon^{\frac{\delta}{\delta-1}}|\mathcal{F}_{k}\right]\right)^{\frac{\delta-1}{\delta}}+\frac{\tau_{k}}{2}E[\varepsilon|\mathcal{F}_{k}]
=\displaystyle= ν(E[ε|k])δ1δ+τk2E[ε|k]\displaystyle\nu\left(E[\varepsilon|\mathcal{F}_{k}]\right)^{\frac{\delta-1}{\delta}}+\frac{\tau_{k}}{2}E[\varepsilon|\mathcal{F}_{k}] (9)

with probability 1.

Based on the Markov's inequality for conditional expectation,

E[ε|k]\displaystyle E[\varepsilon|\mathcal{F}_{k}] =P(ξi,k>τk2|k)\displaystyle=P(\|\xi_{i,k}\|>\frac{\tau_{k}}{2}|\mathcal{F}_{k})
=P(ξi,kδ>τkδ2δ|k)\displaystyle=P(\|\xi_{i,k}\|^{\delta}>\frac{\tau_{k}^{\delta}}{2^{\delta}}|\mathcal{F}_{k})
E[ξi,kδ|k]τkδ2δ\displaystyle\leq\frac{E\left[\|\xi_{i,k}\|^{\delta}|\mathcal{F}_{k}\right]}{\frac{\tau_{k}^{\delta}}{2^{\delta}}}
(2ν)δτkδ\displaystyle\leq(2\nu)^{\delta}\tau_{k}^{-\delta} (10)

with probability 1.

Then, Bi,kν(2ν)δ1τk1δ+12(2ν)δτk1δ=(2ν)δτk1δ\|B_{i,k}\|\leq\nu(2\nu)^{\delta-1}\tau_{k}^{1-\delta}+\frac{1}{2}(2\nu)^{\delta}\tau_{k}^{1-\delta}=(2\nu)^{\delta}\tau_{k}^{1-\delta} with probability 1. ∎

The following lemma can be directly obtained from Lemma 3.2 of [5].

Lemma 4

Under Assumption 1, for any k>0k>0, the following inequality holds:

|[Ak]i,j1N|θβk,\displaystyle\left|[A^{k}]_{i,j}-\frac{1}{N}\right|\leq\theta\beta^{k}, (11)

where θ=(1η4N2)2\theta=\left(1-\frac{\eta}{4N^{2}}\right)^{-2}, and β=(1η4N2)1Q\beta=\left(1-\frac{\eta}{4N^{2}}\right)^{\frac{1}{Q}}. Here, η\eta was defined in Assumption 1 and QQ is the number of edges in graph 𝒢N\mathcal{G}_{N}. [Ak]i,j[A^{k}]_{i,j} is the element of the ii-th row and jj-th column of AkA^{k}.

Based on the above lemmas, the following conclusion holds.

Theorem 1

Suppose that Assumptions 1-5 hold, and the parameters αk\alpha_{k} and τk\tau_{k} satisfy the following condition:

k=0+αk=,k=0+αk2<+,\displaystyle\sum_{k=0}^{+\infty}\alpha_{k}=\infty,\sum_{k=0}^{+\infty}\alpha_{k}^{2}<+\infty, (12)
τk2C0,limk+αkτk=0,\displaystyle\tau_{k}\geq 2C_{0},\lim_{k\rightarrow+\infty}\alpha_{k}\tau_{k}=0, (13)
k=0+αk2τk2<+,k=0+αkτk22δ<+,\displaystyle\sum_{k=0}^{+\infty}\alpha_{k}^{2}\tau_{k}^{2}<+\infty,\sum_{k=0}^{+\infty}\alpha_{k}\tau_{k}^{2-2\delta}<+\infty, (14)

where C0C_{0} is a constant satisfying fi(θ)C0\|\nabla f_{i}(\theta)\|\leq C_{0} for all θΩ\theta\in\Omega. Then, xi,kx_{i,k} converges to the unique optimal solution θ\theta^{*} with probability 1.

Proof:

Define

yk=1Ni=1Nxi,k for all ky_{k}=\frac{1}{N}\sum_{i=1}^{N}x_{i,k}\quad\text{ for all }k (15)

and

pi,k+1=xi,k+1vi,k=xi,k+1j=1Nai,jxj,k.p_{i,k+1}=x_{i,k+1}-v_{i,k}=x_{i,k+1}-\sum_{j=1}^{N}a_{i,j}x_{j,k}. (16)

By (15) and (16), we have

yk+1\displaystyle y_{k+1} =1Ni=1N(j=1Nai,jxj,k+pi,k+1)\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\left(\sum_{j=1}^{N}a_{i,j}x_{j,k}+p_{i,k+1}\right)
=1N(j=1N(i=1Nai,j)xj,k+i=1Npi,k+1)\displaystyle=\frac{1}{N}\left(\sum_{j=1}^{N}\left(\sum_{i=1}^{N}a_{i,j}\right)x_{j,k}+\sum_{i=1}^{N}p_{i,k+1}\right)
=1N(j=1Nxj,k+i=1Npi,k+1)\displaystyle=\frac{1}{N}\left(\sum_{j=1}^{N}x_{j,k}+\sum_{i=1}^{N}p_{i,k+1}\right)
=yk+1Ni=1Npi,k+1,\displaystyle=y_{k}+\frac{1}{N}\sum_{i=1}^{N}p_{i,k+1}, (17)

where we used Assumption 1.

Then,

yk+1\displaystyle y_{k+1} =y0+1N=1k+1i=1Npi,\displaystyle=y_{0}+\frac{1}{N}\sum_{\ell=1}^{k+1}\sum_{i=1}^{N}p_{i,\ell}
=1Ni=1Nxi,0+1N=1k+1i=1Npi,\displaystyle=\frac{1}{N}\sum_{i=1}^{N}x_{i,0}+\frac{1}{N}\sum_{\ell=1}^{k+1}\sum_{i=1}^{N}p_{i,\ell}
=1Nj=1Nxj,0+1N=1k+1j=1Npj,.\displaystyle=\frac{1}{N}\sum_{j=1}^{N}x_{j,0}+\frac{1}{N}\sum_{\ell=1}^{k+1}\sum_{j=1}^{N}p_{j,\ell}. (18)

Let 𝐱k=[x1,k,,xN,k]\mathbf{x}_{k}=[x_{1,k}^{\top},\cdots,x_{N,k}^{\top}]^{\top}, 𝐩k=[p1,k,,pN,k]\mathbf{p}_{k}=[p_{1,k}^{\top},\cdots,p_{N,k}^{\top}]^{\top}, A~=AIn\tilde{A}=A\otimes I_{n}, then (16) can be written as

𝐩k+1=𝐱k+1A~𝐱k,\mathbf{p}_{k+1}=\mathbf{x}_{k+1}-\tilde{A}\mathbf{x}_{k}, (19)

which implies that

𝐱k+1\displaystyle\mathbf{x}_{k+1} =𝐩k+1+A~𝐱k\displaystyle=\mathbf{p}_{k+1}+\tilde{A}\mathbf{x}_{k}
=𝐩k+1+A~(𝐩k+A~𝐱k1)\displaystyle=\mathbf{p}_{k+1}+\tilde{A}(\mathbf{p}_{k}+\tilde{A}\mathbf{x}_{k-1})
==0kA~𝐩k+1+A~k+1𝐱0\displaystyle=\sum_{\ell=0}^{k}\tilde{A}^{\ell}\mathbf{p}_{k+1-\ell}+\tilde{A}^{k+1}\mathbf{x}_{0}
==1k+1A~k+1𝐩+A~k+1𝐱0,\displaystyle=\sum_{\ell=1}^{k+1}\tilde{A}^{k+1-\ell}\mathbf{p}_{\ell}+\tilde{A}^{k+1}\mathbf{x}_{0}, (20)

where in the last step we use k+1k+1-\ell to replace \ell, and A~0:=INn\tilde{A}^{0}:=I_{Nn}.

Thus,

xi,k+1\displaystyle x_{i,k+1} ==1k+1j=1N[Ak+1]i,jpj,+j=1N[Ak+1]i,jxj,0,\displaystyle=\sum_{\ell=1}^{k+1}\sum_{j=1}^{N}[A^{k+1-\ell}]_{i,j}p_{j,\ell}+\sum_{j=1}^{N}[A^{k+1}]_{i,j}x_{j,0}, (21)

where A0INA^{0}\triangleq I_{N}.

Based on (3),(16) and the non-expansive property of the projection operator,

pi,k+1\displaystyle\|p_{i,k+1}\| =PΩ[vi,kαkg^i,k]vi,k\displaystyle=\|P_{\Omega}\left[v_{i,k}-\alpha_{k}\hat{g}_{i,k}\right]-v_{i,k}\|
αkg^i,k\displaystyle\leq\alpha_{k}\|\hat{g}_{i,k}\|
αkτk.\displaystyle\leq\alpha_{k}\tau_{k}. (22)

By (18) and (21), we have

yk+1xi,k+1\displaystyle\|y_{k+1}-x_{i,k+1}\| =j=1N(1N[Ak+1]i,j)xj,0\displaystyle=\bigg{\|}\sum_{j=1}^{N}\left(\frac{1}{N}-[A^{k+1}]_{i,j}\right)x_{j,0}
+1N=1k+1j=1Npj,=1k+1j=1N[Ak+1]i,jpj,\displaystyle+\frac{1}{N}\sum_{\ell=1}^{k+1}\sum_{j=1}^{N}p_{j,\ell}-\sum_{\ell=1}^{k+1}\sum_{j=1}^{N}[A^{k+1-\ell}]_{i,j}p_{j,\ell}\bigg{\|}
=j=1N(1N[Ak+1]i,j)xj,0\displaystyle=\bigg{\|}\sum_{j=1}^{N}\left(\frac{1}{N}-[A^{k+1}]_{i,j}\right)x_{j,0}
+1N=1kj=1Npj,=1kj=1N[Ak+1]i,jpj,\displaystyle+\frac{1}{N}\sum_{\ell=1}^{k}\sum_{j=1}^{N}p_{j,\ell}-\sum_{\ell=1}^{k}\sum_{j=1}^{N}[A^{k+1-\ell}]_{i,j}p_{j,\ell}
+1Nj=1Npj,k+1pi,k+1\displaystyle+\frac{1}{N}\sum_{j=1}^{N}p_{j,k+1}-p_{i,k+1}\bigg{\|}
j=1N|1N[Ak+1]i,j|xj,0\displaystyle\leq\sum_{j=1}^{N}\left|\frac{1}{N}-[A^{k+1}]_{i,j}\right|\|x_{j,0}\|
+=1kj=1N|1N[Ak+1]i,j|pj,\displaystyle+\sum_{\ell=1}^{k}\sum_{j=1}^{N}\left|\frac{1}{N}-[A^{k+1-\ell}]_{i,j}\right|\|p_{j,\ell}\|
+1Nj=1Npj,k+1+pi,k+1\displaystyle+\frac{1}{N}\sum_{j=1}^{N}\|p_{j,k+1}\|+\|p_{i,k+1}\|
Nθβk+1C1+N=1kθβk+1α1τ1\displaystyle\leq N\theta\beta^{k+1}C_{1}+N\sum_{\ell=1}^{k}\theta\beta^{k+1-\ell}\alpha_{\ell-1}\tau_{\ell-1}
+2αkτk,\displaystyle+2\alpha_{k}\tau_{k}, (23)

where in the last step we used Lemma 4 and (22), and C1C_{1} is a constant satisfying xj,0C1\|x_{j,0}\|\leq C_{1} for all j𝒱Nj\in\mathcal{V}_{N}.

For any zΩz\in\Omega, based on the non-expansive property and the definition of bi,kb_{i,k},

xi,k+1z2=\displaystyle\|x_{i,k+1}-z\|^{2}= PΩ[vi,kαkg^i,k]z2\displaystyle\|P_{\Omega}\left[v_{i,k}-\alpha_{k}\hat{g}_{i,k}\right]-z\|^{2}
\displaystyle\leq vi,kαkg^i,kz2\displaystyle\|v_{i,k}-\alpha_{k}\hat{g}_{i,k}-z\|^{2}
=\displaystyle= vi,kz22αkg^i,k(vi,kz)\displaystyle\|v_{i,k}-z\|^{2}-2\alpha_{k}\hat{g}_{i,k}^{\top}\left(v_{i,k}-z\right)
+αk2g^i,k2\displaystyle+\alpha_{k}^{2}\|\hat{g}_{i,k}\|^{2}
=\displaystyle= vi,kz22αkbi,k(vi,kz)\displaystyle\|v_{i,k}-z\|^{2}-2\alpha_{k}b_{i,k}^{\top}\left(v_{i,k}-z\right)
2αkfi(vi,k)(vi,kz)\displaystyle-2\alpha_{k}\nabla f_{i}(v_{i,k})^{\top}\left(v_{i,k}-z\right)
+αk2g^i,k2\displaystyle+\alpha_{k}^{2}\|\hat{g}_{i,k}\|^{2}
\displaystyle\leq (1μαk)vi,kz2\displaystyle(1-\mu\alpha_{k})\|v_{i,k}-z\|^{2}
2αk(fi(vi,k)fi(z))\displaystyle-2\alpha_{k}(f_{i}(v_{i,k})-f_{i}(z))
2αkbi,k(vi,kz)\displaystyle-2\alpha_{k}b_{i,k}^{\top}\left(v_{i,k}-z\right)
+αk2τk2,\displaystyle+\alpha_{k}^{2}\tau_{k}^{2}, (24)

where in the last step we used the strong convexity of fif_{i} in Assumption 3.

Based on Lemma 1 and Assumption 1, we have

i=1Nvi,k+1z2\displaystyle\sum_{i=1}^{N}\|v_{i,k+1}-z\|^{2} =i=1Nj=1Nai,jxj,k+1z2\displaystyle=\sum_{i=1}^{N}\bigg{\|}\sum_{j=1}^{N}a_{i,j}x_{j,k+1}-z\bigg{\|}^{2}
i=1Nj=1Nai,jxj,k+1z2\displaystyle\leq\sum_{i=1}^{N}\sum_{j=1}^{N}a_{i,j}\|x_{j,k+1}-z\|^{2}
=j=1Ni=1Nai,jxj,k+1z2\displaystyle=\sum_{j=1}^{N}\sum_{i=1}^{N}a_{i,j}\|x_{j,k+1}-z\|^{2}
=j=1Nxj,k+1z2.\displaystyle=\sum_{j=1}^{N}\|x_{j,k+1}-z\|^{2}. (25)

Then, based on (24) and (24),

i=1Nvi,k+1z2\displaystyle\sum_{i=1}^{N}\|v_{i,k+1}-z\|^{2}\leq (1μαk)i=1Nvi,kz2\displaystyle(1-\mu\alpha_{k})\sum_{i=1}^{N}\|v_{i,k}-z\|^{2}
2αki=1N(fi(vi,k)fi(z))\displaystyle-2\alpha_{k}\sum_{i=1}^{N}(f_{i}(v_{i,k})-f_{i}(z))
2αki=1Nbi,k(vi,kz)\displaystyle-2\alpha_{k}\sum_{i=1}^{N}b_{i,k}^{\top}\left(v_{i,k}-z\right)
+Nαk2τk2.\displaystyle+N\alpha_{k}^{2}\tau_{k}^{2}. (26)

Based on Lemma 1 and Assumption 3,

fi(vi,k)fi(z)=\displaystyle f_{i}\left(v_{i,k}\right)-f_{i}(z)= (fi(vi,k)fi(yk))\displaystyle\left(f_{i}\left(v_{i,k}\right)-f_{i}\left(y_{k}\right)\right)
+(fi(yk)fi(z))\displaystyle+\left(f_{i}\left(y_{k}\right)-f_{i}(z)\right)
\displaystyle\geq fi(vi,k)ykvi,k\displaystyle-\|\nabla f_{i}\left(v_{i,k}\right)\|\|y_{k}-v_{i,k}\|
+(fi(yk)fi(z))\displaystyle+\left(f_{i}\left(y_{k}\right)-f_{i}(z)\right)
=\displaystyle= fi(vi,k)ykj=1Nai,jxj,k\displaystyle-\|\nabla f_{i}\left(v_{i,k}\right)\|\bigg{\|}y_{k}-\sum_{j=1}^{N}a_{i,j}x_{j,k}\bigg{\|}
+(fi(yk)fi(z))\displaystyle+\left(f_{i}\left(y_{k}\right)-f_{i}(z)\right)
\displaystyle\geq fi(vi,k)j=1Nai,jykxj,k\displaystyle-\|\nabla f_{i}\left(v_{i,k}\right)\|\sum_{j=1}^{N}a_{i,j}\|y_{k}-x_{j,k}\|
+(fi(yk)fi(z)).\displaystyle+\left(f_{i}\left(y_{k}\right)-f_{i}(z)\right). (27)

Thus, based on Assumption 1,

i=1N(fi(vi,k)fi(z))\displaystyle\sum_{i=1}^{N}(f_{i}\left(v_{i,k}\right)-f_{i}(z))
\displaystyle\geq i=1Nfi(vi,k)j=1Nai,jykxj,k\displaystyle-\sum_{i=1}^{N}\|\nabla f_{i}\left(v_{i,k}\right)\|\sum_{j=1}^{N}a_{i,j}\|y_{k}-x_{j,k}\|
+i=1N(fi(yk)fi(z))\displaystyle+\sum_{i=1}^{N}\left(f_{i}\left(y_{k}\right)-f_{i}(z)\right)
\displaystyle\geq C0i=1Nj=1Nai,jykxj,k\displaystyle-C_{0}\sum_{i=1}^{N}\sum_{j=1}^{N}a_{i,j}\|y_{k}-x_{j,k}\|
+i=1N(fi(yk)fi(z))\displaystyle+\sum_{i=1}^{N}\left(f_{i}\left(y_{k}\right)-f_{i}(z)\right)
=\displaystyle= C0j=1Nykxj,k+i=1N(fi(yk)fi(z)).\displaystyle-C_{0}\sum_{j=1}^{N}\|y_{k}-x_{j,k}\|+\sum_{i=1}^{N}\left(f_{i}\left(y_{k}\right)-f_{i}(z)\right). (28)

Then, based on (28), the inequality (26) can be written as

i=1Nvi,k+1z2\displaystyle\sum_{i=1}^{N}\|v_{i,k+1}-z\|^{2}\leq (1μαk)i=1Nvi,kz2\displaystyle(1-\mu\alpha_{k})\sum_{i=1}^{N}\|v_{i,k}-z\|^{2}
+2C0αkj=1Nykxj,k\displaystyle+2C_{0}\alpha_{k}\sum_{j=1}^{N}\|y_{k}-x_{j,k}\|
2αk(f(yk)f(z))\displaystyle-2\alpha_{k}\left(f\left(y_{k}\right)-f(z)\right)
2αki=1Nbi,k(vi,kz)\displaystyle-2\alpha_{k}\sum_{i=1}^{N}b_{i,k}^{\top}\left(v_{i,k}-z\right)
+Nαk2τk2.\displaystyle+N\alpha_{k}^{2}\tau_{k}^{2}. (29)

Taking the σ\sigma-algebra of (29) and letting z=θz=\theta^{*} gives

i=1NE[vi,k+1θ2|k]\displaystyle\sum_{i=1}^{N}E[\|v_{i,k+1}-\theta^{*}\|^{2}|\mathcal{F}_{k}]\leq (1μαk)i=1Nvi,kθ2\displaystyle(1-\mu\alpha_{k})\sum_{i=1}^{N}\|v_{i,k}-\theta^{*}\|^{2}
+2C0αkj=1Nykxj,k\displaystyle+2C_{0}\alpha_{k}\sum_{j=1}^{N}\|y_{k}-x_{j,k}\|
2αk(f(yk)f(θ))\displaystyle-2\alpha_{k}\left(f\left(y_{k}\right)-f(\theta^{*})\right)
2αki=1NBi,kT(vi,kθ)\displaystyle-2\alpha_{k}\sum_{i=1}^{N}B_{i,k}^{T}\left(v_{i,k}-\theta^{*}\right)
+Nαk2τk2.\displaystyle+N\alpha_{k}^{2}\tau_{k}^{2}. (30)

Based on Lemma 3 and utilizing Young's inequality to the last but one term, we have

i=1NE[vi,k+1θ2|k]\displaystyle\sum_{i=1}^{N}E[\|v_{i,k+1}-\theta^{*}\|^{2}|\mathcal{F}_{k}]\leq i=1Nvi,kθ2\displaystyle\sum_{i=1}^{N}\|v_{i,k}-\theta^{*}\|^{2}
+2C0αkj=1Nykxj,k\displaystyle+2C_{0}\alpha_{k}\sum_{j=1}^{N}\|y_{k}-x_{j,k}\|
2αk(f(yk)f(θ))\displaystyle-2\alpha_{k}\left(f\left(y_{k}\right)-f(\theta^{*})\right)
+Nμαk(2ν)2δτk22δ\displaystyle+\frac{N}{\mu}\alpha_{k}(2\nu)^{2\delta}\tau_{k}^{2-2\delta}
+Nαk2τk2\displaystyle+N\alpha_{k}^{2}\tau_{k}^{2} (31)

with probability 1.

According to (23), it can be obtained that

k=0+αkj=1Nykxj,k\displaystyle\sum_{k=0}^{+\infty}\alpha_{k}\sum_{j=1}^{N}\|y_{k}-x_{j,k}\| N2θC1k=0+αkβk+1\displaystyle\leq N^{2}\theta C_{1}\sum_{k=0}^{+\infty}\alpha_{k}\beta^{k+1}
+N2θk=0+αk=1kβk+1α1τ1\displaystyle+N^{2}\theta\sum_{k=0}^{+\infty}\alpha_{k}\sum_{\ell=1}^{k}\beta^{k+1-\ell}\alpha_{\ell-1}\tau_{\ell-1}
+2Nk=0+αk2τk.\displaystyle+2N\sum_{k=0}^{+\infty}\alpha_{k}^{2}\tau_{k}. (32)

Since k=0+αk2<+\sum_{k=0}^{+\infty}\alpha_{k}^{2}<+\infty, αk\alpha_{k} is bounded. According to the definition of β\beta in Lemma 4, 0<β<10<\beta<1. Then, N2θC1k=0+αkβk+1N^{2}\theta C_{1}\sum_{k=0}^{+\infty}\alpha_{k}\beta^{k+1} is bounded.

Since k=0+αk2τk2<+\sum_{k=0}^{+\infty}\alpha_{k}^{2}\tau_{k}^{2}<+\infty, based on Lemma 2 (2), we have

N2θk=0+αk=1kβk+1α1τ1\displaystyle N^{2}\theta\sum_{k=0}^{+\infty}\alpha_{k}\sum_{\ell=1}^{k}\beta^{k+1-\ell}\alpha_{\ell-1}\tau_{\ell-1}
\displaystyle\leq N2θ2k=0+=1kβk+1(αk2+α12τ12)\displaystyle\frac{N^{2}\theta}{2}\sum_{k=0}^{+\infty}\sum_{\ell=1}^{k}\beta^{k+1-\ell}(\alpha_{k}^{2}+\alpha_{\ell-1}^{2}\tau_{\ell-1}^{2})
\displaystyle\leq N2θ2k=0+αk2β1β+k=0+=1kβk+1α12τ12\displaystyle\frac{N^{2}\theta}{2}\sum_{k=0}^{+\infty}\alpha_{k}^{2}\frac{\beta}{1-\beta}+\sum_{k=0}^{+\infty}\sum_{\ell=1}^{k}\beta^{k+1-\ell}\alpha_{\ell-1}^{2}\tau_{\ell-1}^{2}
<\displaystyle< +.\displaystyle+\infty. (33)

Based on the conditions k=0+αk2τk2<+\sum_{k=0}^{+\infty}\alpha_{k}^{2}\tau_{k}^{2}<+\infty and τk2C0\tau_{k}\geq 2C_{0}, we can get that k=0+αk2τk<+\sum_{k=0}^{+\infty}\alpha_{k}^{2}\tau_{k}<+\infty and k=0+αk2τk2δ<+\sum_{k=0}^{+\infty}\alpha_{k}^{2}\tau_{k}^{2-\delta}<+\infty. Thus, by (32), k=0+αkj=1Nykxj,k<+\sum_{k=0}^{+\infty}\alpha_{k}\sum_{j=1}^{N}\|y_{k}-x_{j,k}\|<+\infty.

Then, based on (31), condition (14) and Theorem 3.4 of [5], vi,kθ\|v_{i,k}-\theta^{*}\| converges and k=0+αk(f(yk)f(θ))<+\sum_{k=0}^{+\infty}\alpha_{k}\left(f\left(y_{k}\right)-f(\theta^{*})\right)<+\infty with probability 1.

Based on (23), condition (13) and Lemma 2 (1), limk+ykxi,k=0\lim_{k\rightarrow+\infty}\|y_{k}-x_{i,k}\|=0. Thus, limk+ykvi,k=0\lim_{k\rightarrow+\infty}\|y_{k}-v_{i,k}\|=0, which implies that ykθ\|y_{k}-\theta^{*}\| converges with probability 1. Since k=0+αk=\sum_{k=0}^{+\infty}\alpha_{k}=\infty and k=0+αk(f(yk)f(θ))<+\sum_{k=0}^{+\infty}\alpha_{k}\left(f\left(y_{k}\right)-f(\theta^{*})\right)<+\infty with probability 1, we have limk+inff(yk)=f(θ)\lim_{k\rightarrow+\infty}\text{inf}f(y_{k})=f(\theta^{*}) with probability 1. Based on the continuity of ff, we can obtain that yky_{k} converges to θ\theta^{*} with probability 1. Then, xi,kx_{i,k} converges to θ\theta^{*} with probability 1. ∎

Remark 5

The sequences satisfying conditions (12)-(14) exist. For example, αk=(k+1)1\alpha_{k}=(k+1)^{-1} and τk=2C0(k+1)0.4\tau_{k}=2C_{0}(k+1)^{0.4}.

Remark 6

Comparing the assumptions in [5] with those in this work, the main difference is that the conditions on noises in Assumption 4 are different where we consider heavy-tailed noises. In addition, in Assumption 3, we require the local objective functions to be strongly convex while [5] only requires convexity. The reason is that to guarantee the boundedness of the terms in the right-hand side of (31), with strong convexity, we need k=0+αkτk22δ<+\sum_{k=0}^{+\infty}\alpha_{k}\tau_{k}^{2-2\delta}<+\infty to be bounded, while with convexity we need k=0+τk22δ<+\sum_{k=0}^{+\infty}\tau_{k}^{2-2\delta}<+\infty to be bounded, which is hard to be guaranteed together with the other conditions given in the theorem.

V Simulation

In this section, we consider N=6N=6 agents collaborate to address an l22l_{2}^{2}-regularized Logistic Regression Problem [26], where

fi(θ)=16log(1+e(aiqiTθ))+μ12θTθ,\displaystyle f_{i}(\theta)=\frac{1}{6}\text{log}(1+e^{(-a_{i}q_{i}^{T}\theta)})+\frac{\mu}{12}\theta^{T}\theta, (34)

with μ>0\mu>0.

The constraint set is given by Ω={θ|θi1,i=1,,N}\Omega=\{\theta|\|\theta_{i}\|\leq 1,i=1,\cdots,N\}.

Let a1=a3=a5=1a_{1}=a_{3}=a_{5}=1, a2=a4=a6=1a_{2}=a_{4}=a_{6}=-1, μ=1\mu=1 and let qiq_{i} be

q1\displaystyle q_{1} =[0.462,0.798,0,0.335,0.163,0.102],\displaystyle=[0.462,0.798,0,0.335,0.163,0.102]^{\top},
q2\displaystyle q_{2} =[0.167,0.309,0.355,0.482,0.375,0.614],\displaystyle=[0.167,0.309,0.355,0.482,0.375,0.614]^{\top},
q3\displaystyle q_{3} =[0.155,0.664,0.021,0.507,0.316,0.422],\displaystyle=[0.155,0.664,0.021,0.507,0.316,0.422]^{\top},
q4\displaystyle q_{4} =[0.094,0.133,0.538,0.651,0.211,0.465],\displaystyle=[0.094,0.133,0.538,0.651,0.211,0.465]^{\top},
q5\displaystyle q_{5} =[0.568,0.58,0.055,0.025,0.110.57],\displaystyle=[0.568,0.58,0.055,0.025,0.110.57]^{\top},
q6\displaystyle q_{6} =[0.07,0.3,0.683,0.38,0.4930.225].\displaystyle=[0.07,0.3,0.683,0.38,0.4930.225]^{\top}. (35)

It can be verified that the optimization problem satisfies Assumptions 2 and 3.

The communication graph can be described in Fig. 1, where the weights are defined as: a1,2=a2,3=a2,5=a5,6=a5,4=1/3a_{1,2}=a_{2,3}=a_{2,5}=a_{5,6}=a_{5,4}=1/3 and a1,6=a3,4=2/3a_{1,6}=a_{3,4}=2/3. The graph satisfies Assumption 1.

Refer to caption
Figure 1: The communication graph.

The initial values xi(0)=0x_{i}(0)=0. Let (3)-(5) be the updating law with αk=10k+1\alpha_{k}=\frac{10}{k+1} and τk=10(k+1)0.4\tau_{k}=10(k+1)^{0.4}.

Let ξ\xi be a random variable with ξ=ω2wmin\xi=\omega-2w_{\min} where ω\omega satisfies Pareto distribution with tail index γ=2\gamma=2 and ωmin=1\omega_{\min}=1. The probability density function of ω\omega can be described as p(w)={0, if wwmin,2wmin2w3, if w>wmin.p(w)=\left\{\begin{array}[]{cc}0,&\text{ if }w\leq w_{\min},\\ \frac{2w_{\min}^{2}}{w^{3}},&\text{ if }w>w_{\min}.\end{array}\right. The expectation of ww is 2wmin{2w_{\min}} and the variance is infinity. Thus, E[ξ]=E[ω]2wmin=0E[\xi]=E[\omega]-2w_{\min}=0. Moreover, since p(ξ)={0, if ξwmin,2wmin2(ξ+2wmin)3, if ξ>wmin,p(\xi)=\left\{\begin{array}[]{cc}0,&\text{ if }\xi\leq-w_{\min},\\ \frac{2w_{\min}^{2}}{(\xi+2w_{\min})^{3}},&\text{ if }\xi>-w_{\min},\end{array}\right. we have for δ=32\delta=\frac{3}{2},

E[|ξ|δ]\displaystyle E[|\xi|^{\delta}] =+|ξ|δp(ξ)𝑑ξ\displaystyle=\int_{-\infty}^{+\infty}|\xi|^{\delta}p(\xi)d\xi
=wmin+|ξ|δ2wmin2(ξ+2wmin)3𝑑ξ\displaystyle=\int_{-w_{\min}}^{+\infty}|\xi|^{\delta}\frac{2w_{\min}^{2}}{(\xi+2w_{\min})^{3}}d\xi
wmin0|ξ|δ2wmin2wmin3𝑑ξ+01ξδ2wmin2(2wmin)3𝑑ξ\displaystyle\leq\int_{-w_{\min}}^{0}|\xi|^{\delta}\frac{2w_{\min}^{2}}{w_{\min}^{3}}d\xi+\int_{0}^{1}\xi^{\delta}\frac{2w_{\min}^{2}}{(2w_{\min})^{3}}d\xi
+1ξδ2wmin2ξ3𝑑ξ\displaystyle+\int_{1}^{\infty}\xi^{\delta}\frac{2w_{\min}^{2}}{\xi^{3}}d\xi
=2ωminδ(δ+1)+14ωmin(δ+1)+2wmin22δ\displaystyle=\frac{2\omega_{\min}^{\delta}}{(\delta+1)}+\frac{1}{4\omega_{\min}(\delta+1)}+\frac{2w_{\min}^{2}}{2-\delta}
4.9<3δ.\displaystyle\approx 4.9<3^{\delta}. (36)

Let ξi,k=ωi,k2\xi_{i,k}=\omega_{i,k}-2 where each element of ωi,k\omega_{i,k} satisfies Pareto distribution with tail index 2 and the minimum parameter 1. Based on the above analysis, E[ξi,k]=0E[\xi_{i,k}]=0 and E[ξi,kδ]=E[(ξi,k2)δ2]E[(6max|[ξi,k]j|2)δ2]<8δE[\|\xi_{i,k}\|^{\delta}]=E[(\|\xi_{i,k}\|^{2})^{\frac{\delta}{2}}]\leq E[(6\max|{[\xi_{i,k}]_{j}}|^{2})^{\frac{\delta}{2}}]<8^{\delta}. Thus, ξi,k\xi_{i,k} satisfies Assumptions 4 and 5.

Fig. 2 shows the evolution of i=16xi,kθ\sum_{i=1}^{6}\|x_{i,k}-\theta^{*}\|. As can be seen, the algorithm converges to the optimal solution.

Refer to caption
Figure 2: The evolution of i=16xi,kθ\sum_{i=1}^{6}\|x_{i,k}-\theta^{*}\|.

Simulation with Different Tail Index: It can be verified that noises with tail index γ>1\gamma>1 satisfy Assumptions 4 and 5. When γ>2\gamma>2, the variance is bounded. The evolution of i=16xi,kθ\sum_{i=1}^{6}\|x_{i,k}-\theta^{*}\| with tail index is shown in Fig. 3. It can be seen that a larger tail index can lead to faster convergence.

Refer to caption
Figure 3: The evolution of i=16xi,kθ\sum_{i=1}^{6}\|x_{i,k}-\theta^{*}\| for γ=1.5\gamma=1.5, 2 and 3.

Comparison with [5]: We compare our algorithm with the distributed stochastic subgradient projection algorithm proposed in [5] despite that the Patero random noise does not satisfy the bounded variance condition (Assumption 4 of [5]). When μ\mu is small, it is found that the performance of our algorithm is similar to that of [5]. Let μ=10\mu=10. The simulation result is shown in Fig. 4. The performance of our algorithm is better than that of [5].

Refer to caption
Figure 4: The evolution of i=16xi,kθ\sum_{i=1}^{6}\|x_{i,k}-\theta^{*}\| under our algorithm and the distributed stochastic subgradient projection algorithm proposed in [5].

VI Conclusions

In this paper, we considered the distributed optimization problem under the influence of heavy-tailed noises. Unlike previous studies on heavy-tailed noises, we proposed a distributed update law that only uses neighboring information and the network does not have a central node. Unlike previous studies on distributed stochastic optimization, this paper considers the assumption of heavy-tailed distributions instead of the commonly considered bounded variance noises, which is a broader assumption. The simulation results demonstrate the effectiveness of the algorithm. In future, we will investigate the algorithm design for distributed optimization problems with inequality constraints.

References

  • [1] T. Yang, X. Yi, J. Wu, Y. Yuan, D. Wu, Z. Meng, Y. Hong, H. Wang, Z. Lin, and K. H. Johansson, ``A survey of distributed optimization,'' Annual Reviews in Control, vol. 47, pp. 278–305, 2019.
  • [2] K. I. Tsianos, S. Lawlor, and M. G. Rabbat, ``Consensus-based distributed optimization: Practical issues and applications in large-scale machine learning,'' in 2012 50th annual allerton conference on communication, control, and computing (allerton).   IEEE, 2012, pp. 1543–1550.
  • [3] P. Yi, Y. Hong, and F. Liu, ``Initialization-free distributed algorithms for optimal resource allocation with feasibility constraints and application to economic dispatch of power systems,'' Automatica, vol. 74, pp. 259–269, 2016.
  • [4] C. Sun, Z. Feng, and G. Hu, ``Time-varying optimization-based approach for distributed formation of uncertain euler–lagrange systems,'' IEEE Transactions on Cybernetics, vol. 52, no. 7, pp. 5984–5998, 2021.
  • [5] S. Sundhar Ram, A. Nedić, and V. V. Veeravalli, ``Distributed stochastic subgradient projection algorithms for convex optimization,'' Journal of optimization theory and applications, vol. 147, pp. 516–545, 2010.
  • [6] K. Srivastava and A. Nedic, ``Distributed asynchronous constrained stochastic optimization,'' IEEE journal of selected topics in signal processing, vol. 5, no. 4, pp. 772–790, 2011.
  • [7] Z. Wang, J. Zhang, T.-H. Chang, J. Li, and Z.-Q. Luo, ``Distributed stochastic consensus optimization with momentum for nonconvex nonsmooth problems,'' IEEE Transactions on Signal Processing, vol. 69, pp. 4486–4501, 2021.
  • [8] K. Lu, H. Wang, H. Zhang, and L. Wang, ``Convergence in high probability of distributed stochastic gradient descent algorithms,'' IEEE Transactions on Automatic Control, 2023.
  • [9] J. Zhang, S. P. Karimireddy, A. Veit, S. Kim, S. Reddi, S. Kumar, and S. Sra, ``Why are adaptive methods good for attention models?'' Advances in Neural Information Processing Systems, vol. 33, pp. 15 383–15 393, 2020.
  • [10] H. Yang, P. Qiu, and J. Liu, ``Taming fat-tailed (“heavier-tailed” with potentially infinite variance) noise in federated learning,'' Advances in Neural Information Processing Systems, vol. 35, pp. 17 017–17 029, 2022.
  • [11] U. Simsekli, L. Sagun, and M. Gurbuzbalaban, ``A tail-index analysis of stochastic gradient noise in deep neural networks,'' in International Conference on Machine Learning.   PMLR, 2019, pp. 5827–5837.
  • [12] M. Gurbuzbalaban, U. Simsekli, and L. Zhu, ``The heavy-tail phenomenon in sgd,'' in International Conference on Machine Learning.   PMLR, 2021, pp. 3964–3975.
  • [13] H. Wang, M. Gurbuzbalaban, L. Zhu, U. Simsekli, and M. A. Erdogdu, ``Convergence rates of stochastic gradient descent under infinite noise variance,'' Advances in Neural Information Processing Systems, vol. 34, pp. 18 866–18 877, 2021.
  • [14] S. A. Alghunaim and A. H. Sayed, ``Distributed coupled multiagent stochastic optimization,'' IEEE Transactions on Automatic Control, vol. 65, no. 1, pp. 175–190, 2019.
  • [15] S. Pu and A. Nedić, ``Distributed stochastic gradient tracking methods,'' Mathematical Programming, vol. 187, pp. 409–457, 2021.
  • [16] D. Yuan, Y. Hong, D. W. Ho, and G. Jiang, ``Optimal distributed stochastic mirror descent for strongly convex optimization,'' Automatica, vol. 90, pp. 196–203, 2018.
  • [17] E. Gorbunov, M. Danilova, and A. Gasnikov, ``Stochastic optimization with heavy-tailed noise via accelerated gradient clipping,'' Advances in Neural Information Processing Systems, vol. 33, pp. 15 042–15 053, 2020.
  • [18] Z. Liu, T. D. Nguyen, T. H. Nguyen, A. Ene, and H. Nguyen, ``High probability convergence of stochastic gradient methods,'' in International Conference on Machine Learning.   PMLR, 2023, pp. 21 884–21 914.
  • [19] A. Cutkosky and H. Mehta, ``High-probability bounds for non-convex stochastic optimization with heavy tails,'' Advances in Neural Information Processing Systems, vol. 34, pp. 4883–4895, 2021.
  • [20] M. Liu, Z. Zhuang, Y. Lei, and C. Liao, ``A communication-efficient distributed gradient clipping algorithm for training deep neural networks,'' Advances in Neural Information Processing Systems, vol. 35, pp. 26 204–26 217, 2022.
  • [21] S. Yu, D. Jakovetic, and S. Kar, ``Smoothed gradient clipping and error feedback for distributed optimization under heavy-tailed noise,'' arXiv preprint arXiv:2310.16920, 2023.
  • [22] E. Gorbunov, A. Sadiev, M. Danilova, S. Horváth, G. Gidel, P. Dvurechensky, A. Gasnikov, and P. Richtárik, ``High-probability convergence for composite and distributed stochastic minimization and variational inequalities with heavy-tailed noise,'' arXiv preprint arXiv:2310.01860, 2023.
  • [23] S. S. Ram, A. Nedic, and V. V. Veeravalli, ``Distributed subgradient projection algorithm for convex optimization,'' in 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.   IEEE, 2009, pp. 3653–3656.
  • [24] M. O. Sayin, N. D. Vanli, S. S. Kozat, and T. Başar, ``Stochastic subgradient algorithms for strongly convex optimization over distributed networks,'' IEEE Transactions on network science and engineering, vol. 4, no. 4, pp. 248–260, 2017.
  • [25] A. Sadiev, M. Danilova, E. Gorbunov, S. Horváth, G. Gidel, P. Dvurechensky, A. Gasnikov, and P. Richtárik, ``High-probability bounds for stochastic optimization and variational inequalities: the case of unbounded variance,'' arXiv preprint arXiv:2302.00999, 2023.
  • [26] A. Y. Ng, ``Feature selection, l 1 vs. l 2 regularization, and rotational invariance,'' in Proceedings of the twenty-first international conference on Machine learning.   ACM, 2004, p. 78.
[Uncaptioned image]

Chao Sun received his B.Eng degree from University of Science and Technology of China in 2013. Then he obtained the Ph.D. degree from School of Electrical and Electronics Engineering, Nanyang Technological University, Singapore in 2018. After graduation, he worked as a research fellow at Nanyang Technological University till May 2022. Currently, he is an associate professor at Beihang University, China.