This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Resilient Distributed Optimization

Jingxuan Zhu         Yixuan Lin         Alvaro Velasquez         Ji Liu J. Zhu and Y. Lin are with the Department of Applied Mathematics and Statistics at Stony Brook University ({jingxuan.zhu,yixuan.lin.1}@stonybrook.edu). A. Velasquez is with the Department of Computer Science at University of Colorado Boulder ([email protected]). J. Liu is with the Department of Electrical and Computer Engineering at Stony Brook University ([email protected]).
Abstract

This paper considers a distributed optimization problem in the presence of Byzantine agents capable of introducing untrustworthy information into the communication network. A resilient distributed subgradient algorithm is proposed based on graph redundancy and objective redundancy. It is shown that the algorithm causes all non-Byzantine agents’ states to asymptotically converge to the same optimal point under appropriate assumptions. A partial convergence rate result is also provided.

1 Introduction

Distributed optimization has attracted considerable attention and achieved remarkable success in both theory and practice. The distributed convex optimization problem was first studied in [1] where a distributed subgradient algorithm was proposed. After this, various distributed optimization algorithms have been crafted and studied; see survey papers [2, 3, 4]. Distributed optimization techniques are also widely applied to decentralized deep learning [5].

Information exchange between neighboring agents is necessary for a multi-agent network for distributed optimization. However, agents’ states may be corrupted and they may not adhere to the designed algorithm due to faulty processes or external attacks. An agent is called Byzantine if it updates its state in an arbitrary, unknown manner, and can send conflicting values to different neighbors [6]. Such attacking agents can know global information of the network, play arbitrarily and strategically, and even be coordinated. Consider a network of agents in which Byzantine agents exist. An ideal resilient algorithm is the one which can lead non-Byzantine (or normal) agents to cooperatively solve the corresponding distributed optimization problem in the presence of Byzantine agents as if they do not exist. Such a resilient algorithm is highly desirable for the safety and security of multi-agent systems as faulty processes and external attacks are inevitable.

Resilient distributed optimization has recently received increasing attention, probably originating from the work of [7]. Almost all the existing works cannot guarantee full resilience; what they can guarantee is all normal agents’ states converge to a bounded neighborhood of the desired optimal point whose bound is not controllable [8, 9, 10, 11, 12], or an optimal point of an unspecified convex combination of all normal agents’ objective functions [7, 13, 14], or a convex combination of all normal agents’ local optimal points [15]. The only exceptions are [16, 17, 18] in which the underlying communication graph is assumed to be a complete graph, namely, each agent is allowed to communicate with all other agents. All [16, 17, 18] rely on the idea of “objective function redundancy”. The idea has also been applied to the federated setting and achieved full resilience [19, 20]. In the federated setting, a central coordinator agent is able to communicate with all worker agents, which is more or less equivalent to a complete graph in the distributed setting (or sometimes called decentralized setting). It is worth noting that [7, 13, 14, 15, 18] only consider special one-dimensional optimization.

Resilient distributed optimization is also related to resilient federated optimization/learning in the coordinator-workers setting (e.g., [20, 21, 22]), which has attracted increasing attention recently. The key problem is how the central coordinator aggregates the received information to eliminate or attenuate the effects of Byzantine worker agents. Various Byzantine-resilient information aggregation methods have been proposed for high-dimensional optimization/learning, focusing on stochastic gradient descent (SGD). Notable among them are [23, 24, 25, 26, 27], just to name a few; see an overview of recent developments in this area in [28]. It is doubtable that these methods can be applied to achieve full resilience in the distributed setting.

From the preceding discussion, and to the best of our knowledge, a fully resilient distributed optimization algorithm for general non-complete communication graphs does not exist, even for one-dimensional optimization problems. This gap is precisely what we study in this paper. We consider a distributed convex optimization problem in the presence of Byzantine agents and propose a fully resilient distributed subgradient algorithm based on the ideas of objective redundancy (cf. Definition 1) and graph redundancy (cf. Definition 2). The algorithm is shown to cause all non-Byzantine agents’ states to asymptotically converge to the same desired optimal point under appropriate assumptions. The proposed algorithm works theoretically for multi-dimensional optimization but practically not for high-dimensional optimization, as will be explained and discussed in the concluding remarks.

This work is motivated by two recent ideas. The first is the quantified notion of objective function redundancy proposed in [29] where a couple of different definitions of objective redundancy are studied, based on which fully resilient distributed optimization algorithms have been crafted either for a federated setting [29, 19, 20] or a distributed setting over complete graphs [16, 17, 18]; such redundancy has been shown necessary for achieving full resilience in multi-agent optimization [19]. We borrow one notation in [29] and further develop it. It is worth emphasizing that the results in [16, 17, 18] rely on objective redundancy among non-Byzantine agents, whereas ours depend on objective redundancy among all agents. This subtle difference is important for equipping a multi-agent network with a certain level of redundancy at a network design stage as which agents are non-Byzantine cannot be assumed a priori.

The second idea is so-called “Byzantine vector consensus” [30, 31] whose goal is, given a set of both Byzantine and non-Byzantine vectors, to pick a vector lying in the convex hull of the non-Byzantine vectors, based on Tverberg’s theorem [32, 33]. The idea has been very recently improved in [34] which can be used to achieve resilient multi-dimensional consensus exponentially fast. Exponential consensus is critical in the presence of diminishing disturbance [35]. We are prompted by this improved idea and make use of a resilient vector picking process, simplified from that of [34, Algorithm 1]. There are other recent approaches appealing to the idea of centerpoint [36, 37]. We expect that these approaches can also be applied to resilient optimization, provided that exponential consensus is guaranteed, e.g., in [37].

2 Problem Formulation

Consider a network consisting of nn agents, labeled 11 through nn for the purpose of presentation. The agents are not aware of such global labeling, but can differentiate between their neighbors. The neighbor relations among the nn agents are characterized by a directed graph 𝔾=(𝒱,)\mathbb{G}=(\mathcal{V},\mathcal{E}) whose vertices correspond to agents and whose directed edges (or arcs) depict neighbor relations, where 𝒱={1,,n}\mathcal{V}=\{1,\ldots,n\} is the vertex set and 𝒱×𝒱\mathcal{E}\subset\mathcal{V}\times\mathcal{V} is the directed edge set.111We use 𝒜{\cal A}\subset{\cal B} to denote that 𝒜{\cal A} is a subset of {\cal B}. Specifically, agent jj is a neighbor of agent ii if (j,i)(j,i)\in{\cal E}. Each agent can receive information from its neighbors. Thus, the directions of edges represent the directions of information flow. We use 𝒩i\mathcal{N}_{i} to denote the neighbor set of agent ii, excluding ii, i.e., 𝒩i={j𝒱:(j,i)}\mathcal{N}_{i}=\{j\in\mathcal{V}:(j,i)\in\mathcal{E}\}.

Each agent i𝒱i\in{\cal V} has a “private” convex (not necessarily differentiable) objective function, fi:IRdIRf_{i}:{\rm I\!R}^{d}\rightarrow{\rm I\!R}, only known to agent ii. There exist Byzantine agents in the network which are able to transmit arbitrary values to others and capable of sending conflicting values to different neighbors at any time. The set of Byzantine agents is denoted by {\cal F} and the set of normal (non-Byzantine) agents is denoted by {\cal H}. Which agents are Byzantine is unknown to normal agents. It is assumed that each agent may have at most β\beta Byzantine neighbors.

The goal of the normal agents is to cooperatively minimize the objective functions

f(x)=ifi(x)andf(x)=i𝒱fi(x).f_{{\cal H}}(x)=\sum_{i\in{\cal H}}f_{i}(x)\;\;\;\;\;{\rm and}\;\;\;\;\;f(x)=\sum_{i\in{\cal V}}f_{i}(x).

We will show that minimizing the above two objective functions can be achieved simultaneously with appropriate redundancy in objective functions (cf. Definition 1 and Corollary 1). It is assumed that the set of optimal solutions to ff, denoted by 𝒳{\cal X}^{*}, is nonempty and bounded.

Since each fif_{i} is not necessarily differentiable, the gradient descent method may not be applicable. Instead, the subgradient method [38] can be applied. For a convex function h:IRdIRh:{\rm I\!R}^{d}\rightarrow{\rm I\!R}, a vector gIRdg\in{\rm I\!R}^{d} is called a subgradient of hh at point xx if

h(y)h(x)+g(yx)forallyIRd.h(y)\geq h(x)+g^{\top}(y-x)\;\;{\rm for\;all}\;\;y\in{\rm I\!R}^{d}. (1)

Such a vector gg always exists and may not be unique. In the case when hh is differentiable at point xx, the subgradient gg is unique and equals h(x)\nabla h(x), the gradient of hh at xx. Thus, the subgradient can be viewed as a generalization of the notion of the gradient. From (1) and the Cauchy-Schwarz inequality, h(y)h(x)Gyxh(y)-h(x)\geq-G\|y-x\|, where GG is an upper bound for the 2-norm of the subgradients of hh at both xx and yy. We will use this fact without special mention in the sequel. Throughout this paper, we use \|\cdot\| for the 2-norm.

The subgradient method was first proposed in [38] and the first distributed subgradient method was proposed in [1] for undirected graphs. Its extension to directed graphs has been studied in [39] and recently further analyzed in [40].

2.1 Redundancy

To make the resilient distributed optimization problem solvable, certain redundancy is necessary. We begin with objective redundancy.

Definition 1.

An nn-agent network is called kk-redundant, k{0,1,,n1}k\in\{0,1,\ldots,n-1\}, if for any subsets 𝒮1,𝒮2𝒱{\cal S}_{1},{\cal S}_{2}\subset{\cal V} with |𝒮1|=|𝒮2|=nk|{\cal S}_{1}|=|{\cal S}_{2}|=n-k, there holds222We use |𝒮||{\cal S}| to denote the cardinality of a set 𝒮{\cal S}.

argminxi𝒮1fi(x)=argminxi𝒮2fi(x).\displaystyle\operatorname*{arg\,min}_{x}\sum_{i\in{\cal S}_{1}}f_{i}(x)=\operatorname*{arg\,min}_{x}\sum_{i\in{\cal S}_{2}}f_{i}(x).

The above definition of objective redundancy originated in [29, Definition 2]. It has the following properties.

Lemma 1.

If an nn-agent network is kk-redundant, then for any subsets 𝒮,𝒱{\cal S},{\cal L}\subset{\cal V} with |𝒮|=nk|{\cal S}|=n-k and ||nk|{\cal L}|\geq n-k, there holds

argminxi𝒮fi(x)=argminxifi(x).\operatorname*{arg\,min}_{x}\sum_{i\in{\cal S}}f_{i}(x)=\operatorname*{arg\,min}_{x}\sum_{i\in{\cal L}}f_{i}(x).

Proof of Lemma 1: Let 𝒵=argminxi𝒮fi(x){\cal Z}=\operatorname*{arg\,min}_{x}\sum_{i\in{\cal S}}f_{i}(x) and 𝒬={𝒫:𝒫,|𝒫|=nk}{\cal Q}=\{{\cal P}\;:\;{\cal P}\subset{\cal L},\;|{\cal P}|=n-k\}. From Definition 1, argminxi𝒫fi(x)=𝒵\operatorname*{arg\,min}_{x}\sum_{i\in{\cal P}}f_{i}(x)={\cal Z} for any 𝒫𝒬{\cal P}\in{\cal Q}. For each ii\in{\cal L}, let 𝒬i={𝒫:𝒫,|𝒫|=nk,i𝒫}{\cal Q}_{i}=\{{\cal P}\;:\;{\cal P}\subset{\cal L},\;|{\cal P}|=n-k,\;i\in{\cal P}\}. It is easy to see that for each ii\in{\cal L},333We use (nk)\binom{n}{k} to denote the number of kk-combinations from a set of nn elements.

|𝒬i|=q=Δ(||1nk1).|{\cal Q}_{i}|=q\stackrel{{\scriptstyle\Delta}}{{=}}\binom{|{\cal L}|-1}{n-k-1}.

Then,

𝒫𝒬i𝒫fi(x)=qifi(x).\displaystyle\sum_{{\cal P}\in{\cal Q}}\sum_{i\in{\cal P}}f_{i}(x)=q\sum_{i\in{\cal L}}f_{i}(x). (2)

Pick any z𝒵z\in{\cal Z}. From (2),

minxqifi(x)=minx𝒫𝒬i𝒫fi(x)𝒫𝒬minxi𝒫fi(x)=𝒫𝒬i𝒫fi(z)=qifi(z),\displaystyle\min_{x}q\sum_{i\in{\cal L}}f_{i}(x)=\min_{x}\sum_{{\cal P}\in{\cal Q}}\sum_{i\in{\cal P}}f_{i}(x)\geq\sum_{{\cal P}\in{\cal Q}}\min_{x}\sum_{i\in{\cal P}}f_{i}(x)=\sum_{{\cal P}\in{\cal Q}}\sum_{i\in{\cal P}}f_{i}(z)=q\sum_{i\in{\cal L}}f_{i}(z),

which implies that zargminxifi(x)z\in\operatorname*{arg\,min}_{x}\sum_{i\in{\cal L}}f_{i}(x), and thus 𝒵argminxifi(x){\cal Z}\subset\operatorname*{arg\,min}_{x}\sum_{i\in{\cal L}}f_{i}(x).

To prove the lemma, it is sufficient to prove that argminxifi(x)𝒵\operatorname*{arg\,min}_{x}\sum_{i\in{\cal L}}f_{i}(x)\subset{\cal Z}. Suppose that, to the contrary, there exists a yy such that yargminxifi(x)y\in\operatorname*{arg\,min}_{x}\sum_{i\in{\cal L}}f_{i}(x) and yZy\notin Z. Since y,zargminxifi(x)y,z\in\operatorname*{arg\,min}_{x}\sum_{i\in{\cal L}}f_{i}(x), from (2),

ifi(y)=ifi(z)=1q𝒫𝒬i𝒫fi(z)<1q𝒫𝒬i𝒫fi(y)=ifi(y),\displaystyle\sum_{i\in{\cal L}}f_{i}(y)=\sum_{i\in{\cal L}}f_{i}(z)=\frac{1}{q}\sum_{{\cal P}\in{\cal Q}}\sum_{i\in{\cal P}}f_{i}(z)<\frac{1}{q}\sum_{{\cal P}\in{\cal Q}}\sum_{i\in{\cal P}}f_{i}(y)=\sum_{i\in{\cal L}}f_{i}(y),

which is impossible. Therefore, argminxifi(x)𝒵\operatorname*{arg\,min}_{x}\sum_{i\in{\cal L}}f_{i}(x)\subset{\cal Z}.  

The following corollaries are immediate consequences of Lemma 1.

Corollary 1.

If an nn-agent network is kk-redundant, then for any subsets 𝒮𝒱{\cal S}\subset{\cal V} with |𝒮|nk|{\cal S}|\geq n-k, there holds

argminxi𝒮fi(x)=𝒳.\operatorname*{arg\,min}_{x}\sum_{i\in{\cal S}}f_{i}(x)={\cal X}^{*}.
Corollary 2.

If an nn-agent network is (k+1)(k+1)-redundant with k0k\geq 0, then it is kk-redundant.

We also need redundancy in graph connectivity.

A vertex ii in a directed graph 𝔾\mathbb{G} is called a root of 𝔾\mathbb{G} if for each other vertex jj of 𝔾\mathbb{G}, there is a directed path from ii to jj. Thus, ii is a root of 𝔾\mathbb{G} if it is the root of a directed spanning tree of 𝔾\mathbb{G}. We will say that 𝔾\mathbb{G} is rooted at ii if ii is in fact a root. It is easy to see that a rooted graph 𝔾\mathbb{G} has a unique strongly connected component whose vertices are all roots of 𝔾\mathbb{G}.

Definition 2.

An (r,s)(r,s)-reduced graph of a directed graph 𝔾\mathbb{G} with nn vertices, with r,s0r,s\geq 0 and r+sn1r+s\leq n-1, is a subgraph of 𝔾\mathbb{G} obtained by first picking any vertex subset 𝒮𝒱{\cal S}\subset{\cal V} with |𝒮|=nr|{\cal S}|=n-r and then removing from each vertex of the subgraph induced by 𝒮{\cal S}, 𝔾𝒮\mathbb{G}_{{\cal S}}, arbitrary ss incoming edges in 𝔾𝒮\mathbb{G}_{{\cal S}}. A directed graph 𝔾\mathbb{G} is called (r,s)(r,s)-resilient if all its (r,s)(r,s)-reduced graphs are rooted.

It is easy to see that if a directed graph is (r1,s1)(r_{1},s_{1})-resilient, then for any nonnegative r2r1r_{2}\leq r_{1} and s2s1s_{2}\leq s_{1}, the graph is also (r2,s2)(r_{2},s_{2})-resilient.

In the case when r=s=βr=s=\beta, the resilient graph is equivalent to rooted “reduced graph” in [41] which was used to guarantee resilient one-dimension consensus; see Definition 4 and Theorem 2 in [41]. Thus, the definition here can be viewed as a simple generalization of the rooted “reduced graph”.

Definition 2 implicitly requires that each vertex of an (r,s)(r,s)-resilient graph have at least r+sr+s neighbors. More can be said.

Lemma 2.

If a directed graph is (r,s)(r,s)-resilient, then each of its vertices has at least (r+s+1)(r+s+1) neighbors.

Proof of Lemma 2: Suppose that, to the contrary, there exists a vertex ii in 𝔾\mathbb{G} whose |𝒩i|r+s|{\cal N}_{i}|\leq r+s. If |𝒩i|<r+s|{\cal N}_{i}|<r+s, it is easy to see that 𝔾\mathbb{G} does not satisfy Definition 2. We thus consider the case when |𝒩i|=r+s|{\cal N}_{i}|=r+s. Let {\cal R} be the set of arbitrary rr neighbors of vertex ii, and 𝒮=𝒱{\cal S}={\cal V}\setminus{\cal R}, where 𝒱{\cal V} is the vertex set of 𝔾\mathbb{G}.444We use 𝒜{\cal A}\setminus{\cal B} to denote the set of elements that are in 𝒜{\cal A} but not in {\cal B}. It is clear that |𝒮|=nr|{\cal S}|=n-r, and in the subgraph induced by 𝒮{\cal S}, 𝔾𝒮\mathbb{G}_{{\cal S}}, vertex ii has exactly ss neighbors. Then, after vertex ii removes ss incoming edges in 𝔾𝒮\mathbb{G}_{{\cal S}}, and each out-neighbor555A vertex ii is called an out-neighbor of vertex jj if the latter is a neighbor of the former. of vertex ii in 𝔾𝒮\mathbb{G}_{{\cal S}}, if any, removes its incoming edge from ii, vertex ii becomes isolated. But it is impossible for an (r,s)(r,s)-resilient graph.  

3 Algorithm

To describe our algorithm, we need the following notation.

Let 𝒜i{\cal A}_{i} denote the collection of all those subsets of 𝒩i{\cal N}_{i} whose cardinality is (d+1)β+1(d+1)\beta+1. It is obvious that the number of all such subsets is

ai=Δ(|𝒩i|(d+1)β+1),a_{i}\stackrel{{\scriptstyle\Delta}}{{=}}\binom{|{\cal N}_{i}|}{(d+1)\beta+1}, (3)

and label them 𝒜i1,,𝒜iai{\cal A}_{i1},\ldots,{\cal A}_{ia_{i}}. For each j{1,,ai}j\in\{1,\ldots,a_{i}\}, let ij{\cal B}_{ij} denote the collection of all those subsets of 𝒜ij{\cal A}_{ij} whose cardinality is dβ+1d\beta+1. For any subset of agents 𝒮𝒱{\cal S}\subset{\cal V}, let 𝒞𝒮(t){\cal C}_{{\cal S}}(t) denote the convex hull of all xi(t)x_{i}(t), i𝒮i\in{\cal S}.

Algorithm: At each discrete time t{0,1,2,}t\in\{0,1,2,\ldots\}, each agent ii first picks an arbitrary point

yij(t)𝒮ij𝒞𝒮(t)y_{ij}(t)\in\bigcap_{{\cal S}\in{\cal B}_{ij}}{\cal C}_{{\cal S}}(t) (4)

for each j{1,,ai}j\in\{1,\ldots,a_{i}\}, and then updates its state by setting

vi(t)\displaystyle v_{i}(t) =11+ai(xi(t)+j=1aiyij(t)),\displaystyle=\frac{1}{1+a_{i}}\Big{(}x_{i}(t)+\sum_{j=1}^{a_{i}}y_{ij}(t)\Big{)}, (5)
xi(t+1)\displaystyle x_{i}(t+1) =vi(t)α(t)gi(vi(t)),\displaystyle=v_{i}(t)-\alpha(t)g_{i}(v_{i}(t)), (6)

where α(t)\alpha(t) is the stepsize, and gi()g_{i}(\cdot) is a subgradient of fi()f_{i}(\cdot). \Box

In the special one-dimensional case with d=1d=1, it is not hard to check that the steps (4) and (5) simplifies to the resilient scalar consensus algorithm in [41], which is essentially equivalent to the trimmed mean method and has been improved in [42].

The convergence and correctness of the proposed algorithm rely on the following assumptions.

Assumption 1.

𝒳{\cal X}^{*} has a nonempty interior.

It is easy to see that Assumption 1 implies that f(x)f(x) is differentiable at any xint(𝒳)x\in{\rm int}({\cal X}^{*}), where int(){\rm int}(\cdot) denotes the interior of a set. More can be said.

Lemma 3.

Under Assumption 1, if the nn-agent network is kk-redundant with k1k\geq 1, then fi(x)f_{i}(x) is differentiable at xx with fi(x)=0\nabla f_{i}(x)=0 for all i𝒱i\in{\cal V} and xint(𝒳)x\in{\rm int}({\cal X}^{*}).

Proof of Lemma 3: Since int(𝒳){\rm int}({\cal X}^{*}) is nonempty, for any xint(𝒳)x^{*}\in{\rm int}({\cal X}^{*}), there exist a positive number rr and an open ball in int(𝒳){\rm int}({\cal X}^{*}) centered at xx^{*} with radius rr, denoted as (x,r)int(𝒳){\cal B}(x^{*},r)\subset{\rm int}({\cal X}^{*}). Let hjh_{j} be a vector in IRd{\rm I\!R}^{d} whose jjth entry is ϵ\epsilon and the remaining entries all equal zero. Since x+hj(x0,r)int(𝒳)x^{*}+h_{j}\in{\cal B}(x_{0},r)\subset{\rm int}({\cal X}^{*}) for sufficiently small ϵ\epsilon,

xjf(x)=limϵ0f(x+hj)f(x)ϵ=limϵ0i𝒱(fi(x+hj)fi(x))ϵ=0.\displaystyle\frac{\partial}{\partial x_{j}}f(x^{*})=\lim_{\epsilon\rightarrow 0}\frac{f(x^{*}+h_{j})-f(x^{*})}{\epsilon}=\lim_{\epsilon\rightarrow 0}\frac{\sum_{i\in{\cal V}}(f_{i}(x^{*}+h_{j})-f_{i}(x^{*}))}{\epsilon}=0. (7)

For each i𝒱i\in{\cal V}, since fi(x)f_{i}(x) is convex, both limϵ0fi(x+hj)fi(x)ϵ\lim_{\epsilon\rightarrow 0^{-}}\frac{f_{i}(x^{*}+h_{j})-f_{i}(x^{*})}{\epsilon} and limϵ0+fi(x+hj)fi(x)ϵ\lim_{\epsilon\rightarrow 0^{+}}\frac{f_{i}(x^{*}+h_{j})-f_{i}(x^{*})}{\epsilon} exist and limϵ0fi(x+hj)fi(x)ϵlimϵ0+fi(x+hj)fi(x)ϵ\lim_{\epsilon\rightarrow 0^{-}}\frac{f_{i}(x^{*}+h_{j})-f_{i}(x^{*})}{\epsilon}\leq\lim_{\epsilon\rightarrow 0^{+}}\frac{f_{i}(x^{*}+h_{j})-f_{i}(x^{*})}{\epsilon} for all j{1,,d}j\in\{1,\ldots,d\} [43, Theorem 24.1]. It follows that

k𝒱limϵ0fk(x+hj)fk(x)ϵk𝒱limϵ0+fk(x+hj)fk(x)ϵ.\sum_{k\in{\cal V}}\lim_{\epsilon\rightarrow 0^{-}}\frac{f_{k}(x^{*}+h_{j})-f_{k}(x^{*})}{\epsilon}\leq\sum_{k\in{\cal V}}\lim_{\epsilon\rightarrow 0^{+}}\frac{f_{k}(x^{*}+h_{j})-f_{k}(x^{*})}{\epsilon}.

Note that from (7),

k𝒱limϵ0fk(x+hj)fk(x)ϵ=k𝒱limϵ0+fk(x+hj)fk(x)ϵ.\sum_{k\in{\cal V}}\lim_{\epsilon\rightarrow 0^{-}}\frac{f_{k}(x^{*}+h_{j})-f_{k}(x^{*})}{\epsilon}=\sum_{k\in{\cal V}}\lim_{\epsilon\rightarrow 0^{+}}\frac{f_{k}(x^{*}+h_{j})-f_{k}(x^{*})}{\epsilon}.

Thus, limϵ0fi(x+hj)fi(x)ϵ=limϵ0+fi(x+hj)fi(x)ϵ\lim_{\epsilon\rightarrow 0^{-}}\frac{f_{i}(x^{*}+h_{j})-f_{i}(x^{*})}{\epsilon}=\lim_{\epsilon\rightarrow 0^{+}}\frac{f_{i}(x^{*}+h_{j})-f_{i}(x^{*})}{\epsilon}, i.e., fi(x)/xj\partial f_{i}(x^{*})/\partial x_{j} exists for all i𝒱i\in{\cal V} and j{1,,d}j\in\{1,\ldots,d\}.

To proceed, let hi(x)=k𝒱,kifk(x)h_{i}(x)=\sum_{k\in{\cal V},\;k\neq i}f_{k}(x) for all i𝒱i\in{\cal V}. From Corollary 1, argminxhi(x)=𝒳\operatorname*{arg\,min}_{x}h_{i}(x)={\cal X}^{*}. Since xint(𝒳)x^{*}\in{\rm int}({\cal X}^{*}), both f(x)f(x) and hi(x)h_{i}(x) are differentiable at xx^{*}, implying that fxj(x)=hixj(x)=0\frac{\partial f}{\partial x_{j}}(x^{*})=\frac{\partial h_{i}}{\partial x_{j}}(x^{*})=0 for all i𝒱i\in{\cal V} and j{1,,d}j\in\{1,\ldots,d\}. Since fi(x)=f(x)hi(x)f_{i}(x)=f(x)-h_{i}(x), fixj(x)=0\frac{\partial f_{i}}{\partial x_{j}}(x^{*})=0 for all i𝒱i\in{\cal V} and j{1,,d}j\in\{1,\ldots,d\}. Note that this holds for all xint(𝒳)x^{*}\in{\rm int}({\cal X}^{*}). From [44, Section 8.4.2], fi(x)f_{i}(x) is differentiable at xx^{*} with fi(x)=0\nabla f_{i}(x^{*})=0 for all i𝒱i\in{\cal V}.  

Lemma 3 has the following important implication.

Corollary 3.

Under Assumption 1, if the nn-agent network is kk-redundant with k1k\geq 1, then for all i𝒱i\in{\cal V},

𝒳argminxfi(x).{\cal X}^{*}\subset\operatorname*{arg\,min}_{x}f_{i}(x).

Corollary 3 immediately implies that

i𝒱argminxfi(x)=𝒳.\bigcap_{i\in{\cal V}}\operatorname*{arg\,min}_{x}f_{i}(x)={\cal X}^{*}.

Proof of Corollary 3: Suppose that, to the contrary, there exist x𝒳x^{*}\in{\cal X}^{*} and i𝒱i\in{\cal V} such that xargminxfi(x)x^{*}\notin\operatorname*{arg\,min}_{x}f_{i}(x). Pick a zint(𝒳)z\in{\rm int}({\cal X}^{*}). From Lemma 3, zargminxfi(x)z\in\operatorname*{arg\,min}_{x}f_{i}(x). It is then clear that fi(x)>fi(z)f_{i}(x^{*})>f_{i}(z). Let hi(x)=k𝒱,kifk(x)h_{i}(x)=\sum_{k\in{\cal V},\;k\neq i}f_{k}(x). From Corollary 1, argminxhi(x)=𝒳\operatorname*{arg\,min}_{x}h_{i}(x)={\cal X}^{*}, and thus hi(x)=hi(z)h_{i}(x^{*})=h_{i}(z). It follows that f(x)=fi(x)+hi(x)>fi(z)+hi(z)=f(z)f(x^{*})=f_{i}(x^{*})+h_{i}(x^{*})>f_{i}(z)+h_{i}(z)=f(z), which contradicts the fact that x𝒳x^{*}\in{\cal X}^{*}.  

Assumption 2.

The subgradients of all fif_{i}, i𝒱i\in{\cal V}, are uniformly bounded, i.e., there exists a positive number DD such that gi(x)D\|g_{i}(x)\|\leq D for all i𝒱i\in{\cal V} and xIRdx\in{\rm I\!R}^{d}.

Assumption 3.

The step-size sequence {α(t)}\{\alpha(t)\} is positive, non-increasing, and satisfies t=0α(t)=\sum_{t=0}^{\infty}\alpha(t)=\infty and t=0α2(t)<\sum_{t=0}^{\infty}\alpha^{2}(t)<\infty.

The above two assumptions are standard for subgradient methods.

To state our main results, we need the following concepts. For a directed graph 𝔾\mathbb{G}, we use r,s(𝔾){\cal R}_{r,s}(\mathbb{G}) to denote the set of all (r,s)(r,s)-reduced graphs of 𝔾\mathbb{G}. For a rooted graph 𝔾\mathbb{G}, we use κ(𝔾)\kappa(\mathbb{G}) to denote the size of the unique strongly connected component whose vertices are all roots of 𝔾\mathbb{G}; in other words, κ(𝔾)\kappa(\mathbb{G}) equals the number of roots of 𝔾\mathbb{G}. For any (r,s)(r,s)-resilient graph 𝔾\mathbb{G}, let

κr,s(𝔾)=Δminr,s(𝔾)κ().\kappa_{r,s}(\mathbb{G})\stackrel{{\scriptstyle\Delta}}{{=}}\min_{\mathbb{H}\in{\cal R}_{r,s}(\mathbb{G})}\kappa(\mathbb{H}).

which is well-defined and denotes the smallest possible number of roots in any (r,s)(r,s)-reduced graphs of 𝔾\mathbb{G}.

Theorem 1.

Under Assumptions 13, if 𝔾\mathbb{G} is (β,dβ)(\beta,d\beta)-resilient and the nn-agent network is (nκβ,dβ(𝔾))(n-\kappa_{\beta,d\beta}(\mathbb{G}))-redundant, then all xi(t)x_{i}(t), ii\in{\cal H} will asymptotically reach a consensus at a point in 𝒳{\cal X}^{*}.

The following example shows that (nκβ,dβ(𝔾))(n-\kappa_{\beta,d\beta}(\mathbb{G}))-redundancy is necessary. For simplicity, set d=1d=1. Consider a 4-agent network whose neighbor graph is the 4-vertex complete graph \mathbb{C}, which is (1,1)(1,1)-resilient. Suppose that agent 4 is Byzantine and the other three are normal. It is possible that, with a carefully crafted attack strategy of the Byzantine agent, the three normal agents update their states mathematically equivalent to the case as if their neighbor graph is the 3-vertex (1,1)(1,1)-reduced graph with the arc set {(1,2),(1,3),(2,3)}\{(1,2),(1,3),(2,3)\}, which is rooted (cf. Lemma 6). In this case, since vertex 1 is the only root and agent 1 does not have any neighbor, it follows the single-agent subgradient algorithm, and thus its state will converge to a minimum point of f1(x)f_{1}(x), denoted xx^{*}. Since all normal agents will eventually reach a consensus (cf. Lemma 9), both states of agents 2 and 3 will converge to xx^{*}. To guarantee the resilient distributed optimization problem is solvable in this case, there must hold that xargminxfi(x)x^{*}\in\operatorname*{arg\,min}_{x}f_{i}(x), i{1,2,3}i\in\{1,2,3\}, which implies that the network needs to be 3-redundant. It is easy to see that κ1,1()=1\kappa_{1,1}(\mathbb{C})=1, and thus nκ1,1(𝔾)=3n-\kappa_{1,1}(\mathbb{G})=3.

Theorem 1 shows that the proposed algorithm achieves full resiliency. We next partially characterize the convergence rate of the algorithm.

Theorem 2.

Under Assumptions 1 and 2, if 𝔾\mathbb{G} is (β,dβ)(\beta,d\beta)-resilient, the nn-agent network is (nκβ,dβ(𝔾))(n-\kappa_{\beta,d\beta}(\mathbb{G}))-redundant, and α(t)=1/T\alpha(t)=1/\sqrt{T} for T>0T>0 steps, i.e., t{0,1,,T1}t\in\{0,1,\ldots,T-1\}, then there exist a subset of normal agents 𝒮{\cal S}\subset{\cal H} with |𝒮|κβ,dβ(𝔾)|{\cal S}|\geq\kappa_{\beta,d\beta}(\mathbb{G}), a positive constant C1C\geq 1, and a time subsequence 𝒯{0,1,,T1}{\cal T}\subset\{0,1,\ldots,T-1\} with |𝒯|T/C|{\cal T}|\geq T/C such that for any jj\in{\cal H} and x𝒳x^{*}\in{\cal X}^{*},

i𝒮fi(t𝒯xj(t)|𝒯|)i𝒮fi(x)O(1T).\displaystyle\sum_{i\in{\cal S}}f_{i}\bigg{(}\frac{\sum_{t\in{\cal T}}x_{j}(t)}{|{\cal T}|}\bigg{)}-\sum_{i\in{\cal S}}f_{i}(x^{*})\leq O\Big{(}\frac{1}{\sqrt{T}}\Big{)}. (8)

The existing distributed optimization literature (without Byzantine agents) typically characterizes convergence rates by bounding the difference between i𝒱fi(1Tt=0T1xi(t))\sum_{i\in{\cal V}}f_{i}(\frac{1}{T}\sum_{t=0}^{T-1}x_{i}(t)) and i𝒱fi(x)\sum_{i\in{\cal V}}f_{i}(x^{*}). The above theorem can be viewed as a “partial” convergence rate result in that it only reckons a subset 𝒮{\cal S} of normal agents and a subsequence 𝒯{\cal T} in a finite time horizon. Notwithstanding this, it is worth noting that mini𝒮fi(x)\min\sum_{i\in{\cal S}}f_{i}(x) is equivalent to mini𝒱fi(x)\min\sum_{i\in{\cal V}}f_{i}(x) in the setting here with Byzantine agents (cf. Corollary 1) and that |𝒯|=O(T)|{\cal T}|=O(T). Therefore, the theorem still to some extent evaluates the convergence rate of the resilient distributed subgradient algorithm under consideration. It is well known that the optimal convergence rate of subgradient methods for convex optimization is O(1/t)O(1/\sqrt{t}). Whether f()=ifi()f_{{\cal H}}(\cdot)=\sum_{i\in{\cal H}}f_{i}(\cdot) converges at this optimal rate or not, has so far eluded us.

Theorem 2 is an immediate consequence of the following proposition.

Proposition 1.

Under Assumptions 1 and 2, if 𝔾\mathbb{G} is (β,dβ)(\beta,d\beta)-resilient, the nn-agent network is (nκβ,dβ(𝔾))(n-\kappa_{\beta,d\beta}(\mathbb{G}))-redundant, and α(t)=1/T\alpha(t)=1/\sqrt{T} for t{0,1,,T1}t\in\{0,1,\ldots,T-1\}, then for any integer b[κβ,dβ(𝔾),n||]b\in[\kappa_{\beta,d\beta}(\mathbb{G}),n-|{\cal F}|], there exist a subset of normal agents 𝒮{\cal S}\subset{\cal H} with b|𝒮|κβ,dβ(𝔾)b\geq|{\cal S}|\geq\kappa_{\beta,d\beta}(\mathbb{G}) and a time subsequence 𝒯{0,1,,T1}{\cal T}\subset\{0,1,\ldots,T-1\} with |𝒯|T/k=κβ,dβ(𝔾)b(n||k)|{\cal T}|\geq T/\sum_{k=\kappa_{\beta,d\beta}(\mathbb{G})}^{b}\binom{n-|{\cal F}|}{k} such that (8) holds for any jj\in{\cal H} and x𝒳x^{*}\in{\cal X}^{*}.

The proposition further quantifies a trade-off between the number of normal agents in 𝒮{\cal S} and the length of time subsequence 𝒯{\cal T}. Roughly speaking, the fewer the normal agents involved in (8), the denser would the time subsequence be. In the special case when b=κβ,dβ(𝔾)b=\kappa_{\beta,d\beta}(\mathbb{G}), the proposition simplifies to the following corollary.

Corollary 4.

Under Assumptions 1 and 2, if 𝔾\mathbb{G} is (β,dβ)(\beta,d\beta)-resilient, the nn-agent network is (nκβ,dβ(𝔾))(n-\kappa_{\beta,d\beta}(\mathbb{G}))-redundant, and α(t)=1/T\alpha(t)=1/\sqrt{T} for t{0,1,,T1}t\in\{0,1,\ldots,T-1\}, then there exist a subset of normal agents 𝒮{\cal S}\subset{\cal H} with |𝒮|=κβ,dβ(𝔾)|{\cal S}|=\kappa_{\beta,d\beta}(\mathbb{G}) and a time subsequence 𝒯{0,1,,T1}{\cal T}\subset\{0,1,\ldots,T-1\} with |𝒯|T/(n||κβ,dβ(𝔾))|{\cal T}|\geq T/\binom{n-|{\cal F}|}{\kappa_{\beta,d\beta}(\mathbb{G})} such that (8) holds for any jj\in{\cal H} and x𝒳x^{*}\in{\cal X}^{*}.

4 Analysis

This section provides the analysis of the algorithm and proofs of the main results.

4.1 Algorithm Feasibility

From Lemma 2, (β,dβ)(\beta,d\beta)-resilient 𝔾\mathbb{G} guarantees that each agent has at least (d+1)β+1(d+1)\beta+1 neighbors at each time tt. Thus, each 𝒜ij{\cal A}_{ij} in the algorithm is always nonempty.

We next show that yij(t)y_{ij}(t) in (4) always exists. To this end, we need the following well-known theorem by Helly.

Lemma 4.

(Helly’s Theorem [45]) Let {𝒞1,𝒞2,𝒞m}\{{\cal C}_{1},{\cal C}_{2}\ldots,{\cal C}_{m}\} be a finite collection of convex sets in IRd{\rm I\!R}^{d} with md+1m\geq d+1. If the intersection of every d+1d+1 of these sets is nonempty, then the whole collection has a nonempty intersection, i.e., i=1m𝒞i\bigcap_{i=1}^{m}{\cal C}_{i}\neq\emptyset.

Lemma 5.

For any i𝒱i\in{\cal V} and j{1,,ai}j\in\{1,\ldots,a_{i}\}, there holds 𝒮ij𝒞𝒮(t)\bigcap_{{\cal S}\in{\cal B}_{ij}}{\cal C}_{{\cal S}}(t)\neq\emptyset.

Proof of Lemma 5: From Lemma 4, it is sufficient to prove that the intersection of every d+1d+1 sets in ij{\cal B}_{ij} is nonempty. Pick any 𝒫ij{\cal P}\subset{\cal B}_{ij} with |𝒫|=d+1|{\cal P}|=d+1. For each 𝒮𝒫{\cal S}\in{\cal P}, from the definition of ij{\cal B}_{ij}, 𝒞𝒮(t){\cal C}_{{\cal S}}(t) is the convex hull of distinct (dβ+1)(d\beta+1) points. Since |𝒫|=d+1|{\cal P}|=d+1, the intersection 𝒮𝒫𝒞𝒮(t)\bigcap_{{\cal S}\in{\cal P}}{\cal C}_{{\cal S}}(t) involves in total (d+1)(dβ+1)(d+1)(d\beta+1) points (with repetition). Recall that all these points are agents’ states at time tt. Thus, each of them can be written as xh(t)x_{h}(t) with hh being an index in 𝒱{\cal V}. From the definition of ij{\cal B}_{ij}, it is easy to see that h𝒜ijh\in{\cal A}_{ij}. Note that

(d+1)(dβ+1)d|𝒜ij|=(d+1)(dβ+1)d((d+1)β+1)=1.(d+1)(d\beta+1)-d|{\cal A}_{ij}|=(d+1)(d\beta+1)-d((d+1)\beta+1)=1.

Then, the pigeonhole principle (or Dirichlet’s box principle) guarantees that among the total (d+1)(dβ+1)(d+1)(d\beta+1) indices, at least one index in 𝒜ij{\cal A}_{ij}, say kk, repeating at least (d+1)(d+1) times. Since for each 𝒮𝒫{\cal S}\in{\cal P}, there is no repetition of indices when computing 𝒞𝒮(t){\cal C}_{{\cal S}}(t), there must exist (d+1)(d+1) different sets 𝒮1,,Sd+1𝒫{\cal S}_{1},\ldots,S_{d+1}\in{\cal P} for which xk(t)x_{k}(t) involves the computation of 𝒞𝒮p(t){\cal C}_{{\cal S}_{p}}(t) and thus xk(t)𝒞𝒮p(t)x_{k}(t)\in{\cal C}_{{\cal S}_{p}}(t) for all p{1,,d+1}p\in\{1,\ldots,d+1\}. Since |𝒫|=d+1|{\cal P}|=d+1, 𝒫={S1,,Sd+1}{\cal P}=\{S_{1},\ldots,S_{d+1}\}. It follows that xk(t)p=1d+1𝒞𝒮p(t)=𝒮𝒫𝒞𝒮(t)x_{k}(t)\in\bigcap_{p=1}^{d+1}{\cal C}_{{\cal S}_{p}}(t)=\bigcap_{{\cal S}\in{\cal P}}{\cal C}_{{\cal S}}(t).  

4.2 Dynamics of Normal Agents

To analyze the performance of the algorithm, it is important to understand the dynamics of normal agents and “decouple” the influence of Byzantine agents. The following lemma serves this purpose.

Lemma 6.

vi(t)v_{i}(t) in (5) can be expressed as a convex combination of xi(t)x_{i}(t) and xk(t)x_{k}(t), k𝒩ik\in{\cal N}_{i}\cap{\cal H},

vi(t)=wii(t)xi(t)+k𝒩iwik(t)xk(t),\displaystyle v_{i}(t)=w_{ii}(t)x_{i}(t)+\sum_{k\in{\cal N}_{i}\cap{\cal H}}w_{ik}(t)x_{k}(t),

where wii(t)w_{ii}(t) and wik(t)w_{ik}(t) are nonnegative numbers satisfying wii(t)+k𝒩iwik(t)=1w_{ii}(t)+\sum_{k\in{\cal N}_{i}\cap{\cal H}}w_{ik}(t)=1, and there exists a positive constant η\eta such that for all ii\in{\cal H} and tt, wii(t)ηw_{ii}(t)\geq\eta and among all wik(t)w_{ik}(t), k𝒩ik\in{\cal N}_{i}\cap{\cal H}, at least |𝒩i|dβ|{\cal N}_{i}\cap{\cal H}|-d\beta of them are bounded below by η\eta.

Since each agent is assumed to have at most β\beta Byzantine neighbors, the lemma immediately implies that at least |𝒩i|(d+1)β|{\cal N}_{i}|-(d+1)\beta among all wik(t)w_{ik}(t), k𝒩ik\in{\cal N}_{i}\cap{\cal H}, are bounded below by η\eta, which has been reported in [34, Theorem 1]. In the special case when d=1d=1, the lemma simplifies to Claim 2 in [46], which directly implies Proposition 5.1 in [13]. Thus, the lemma can be regarded as a generalization of [34, Theorem 1], [46, Claim 2], and [13, Proposition 5.1].

Proof of Lemma 6: Recall that there are at most β\beta Byzantine neighbors in 𝒜ij{\cal A}_{ij} whose cardinality |𝒜ij|=(d+1)β+1|{\cal A}_{ij}|=(d+1)\beta+1, and that ij{\cal B}_{ij} is the collection of those subsets of 𝒜ij{\cal A}_{ij} whose cardinality is dβ+1=|𝒜ij|βd\beta+1=|{\cal A}_{ij}|-\beta, there must exist an index set 𝒫ij{\cal P}\in{\cal B}_{ij} such that 𝒫𝒩i{\cal P}\subset{\cal N}_{i}\cap{\cal H}. For any such index set 𝒫{\cal P}, since yij(t)𝒮ij𝒞𝒮(t)y_{ij}(t)\in\bigcap_{{\cal S}\in{\cal B}_{ij}}{\cal C}_{{\cal S}}(t), it follows that yij(t)𝒞𝒫(t)y_{ij}(t)\in{\cal C}_{{\cal P}}(t). Let 𝒬ij={𝒫:𝒫ij,𝒫𝒩i}{\cal Q}_{ij}=\{{\cal P}:{\cal P}\in{\cal B}_{ij},{\cal P}\subset{\cal N}_{i}\cap{\cal H}\} for each i𝒱i\in{\cal V} and j{1,,ai}j\in\{1,\ldots,a_{i}\}. From the preceding, 𝒬ij{\cal Q}_{ij} is always nonempty, and for every 𝒫𝒬ij{\cal P}\in{\cal Q}_{ij},

yij(t)=p𝒫cp(𝒫)xp(t),y_{ij}(t)=\sum_{p\in{\cal P}}c_{p}({\cal P})x_{p}(t), (9)

where cp(𝒫)c_{p}({\cal P}) are nonnegative weights satisfying p𝒫cp(𝒫)=1\sum_{p\in{\cal P}}c_{p}({\cal P})=1. It is clear that at least one of cp(𝒫)c_{p}({\cal P}) is positive and at least 1/|𝒫|=1/(dβ+1)1/|{\cal P}|=1/(d\beta+1).

It is easy to see that yij(t)y_{ij}(t) can be rewritten as

yij(t)=1|𝒬ij|𝒫𝒬ijp𝒫cp(𝒫)xp(t).\displaystyle y_{ij}(t)=\frac{1}{|{\cal Q}_{ij}|}\sum_{{\cal P}\in{\cal Q}_{ij}}\sum_{p\in{\cal P}}c_{p}({\cal P})x_{p}(t). (10)

Our reason for rewriting yij(t)y_{ij}(t) in this way will be clear shortly. Since each 𝒫𝒩i{\cal P}\subset{\cal N}_{i}\cap{\cal H}, the above expression is a convex combination of all xk(t)x_{k}(t), k𝒩ik\in{\cal N}_{i}\cap{\cal H}, allowing some weights being zero. Specifically, defining 𝒮ijk={𝒫:𝒫𝒬ij,k𝒫}{\cal S}_{ijk}=\{{\cal P}:{\cal P}\in{\cal Q}_{ij},k\in{\cal P}\} for each 𝒬ij{\cal Q}_{ij} and k𝒩ik\in{\cal N}_{i}\cap{\cal H},

yij(t)=k𝒩i(𝒬ij𝒫𝒮ijkck(𝒫)|𝒬ij|)xk(t).y_{ij}(t)=\sum_{k\in{\cal N}_{i}\cap{\cal H}}\Big{(}\sum_{{\cal Q}_{ij}}\sum_{{\cal P}\in{\cal S}_{ijk}}\frac{c_{k}({\cal P})}{|{\cal Q}_{ij}|}\Big{)}x_{k}(t).

Then,

vi(t)\displaystyle v_{i}(t) =11+ai(xi(t)+j=1aiyij(t))\displaystyle=\frac{1}{1+a_{i}}\Big{(}x_{i}(t)+\sum_{j=1}^{a_{i}}y_{ij}(t)\Big{)}
=11+ai(xi(t)+j=1aik𝒩i𝒬ij𝒫𝒮ijkck(𝒫)|𝒬ij|xk(t))\displaystyle=\frac{1}{1+a_{i}}\Big{(}x_{i}(t)+\sum_{j=1}^{a_{i}}\sum_{k\in{\cal N}_{i}\cap{\cal H}}\sum_{{\cal Q}_{ij}}\sum_{{\cal P}\in{\cal S}_{ijk}}\frac{c_{k}({\cal P})}{|{\cal Q}_{ij}|}x_{k}(t)\Big{)}
=xi(t)1+ai+k𝒩i(j=1ai𝒬ij𝒫𝒮ijkck(𝒫)(1+ai)|𝒬ij|)xk(t)\displaystyle=\frac{x_{i}(t)}{1+a_{i}}+\sum_{k\in{\cal N}_{i}\cap{\cal H}}\Big{(}\sum_{j=1}^{a_{i}}\sum_{{\cal Q}_{ij}}\sum_{{\cal P}\in{\cal S}_{ijk}}\frac{c_{k}({\cal P})}{(1+a_{i})|{\cal Q}_{ij}|}\Big{)}x_{k}(t)
=wii(t)xi(t)+k𝒩iwik(t)xk(t),\displaystyle=w_{ii}(t)x_{i}(t)+\sum_{k\in{\cal N}_{i}\cap{\cal H}}w_{ik}(t)x_{k}(t),

in which

wii(t)=Δ11+ai,wik(t)=Δj=1ai𝒬ij𝒫𝒮ijkck(𝒫)(1+ai)|𝒬ij|,k𝒩i.\displaystyle w_{ii}(t)\stackrel{{\scriptstyle\Delta}}{{=}}\frac{1}{1+a_{i}},\;\;\;\;\;w_{ik}(t)\stackrel{{\scriptstyle\Delta}}{{=}}\sum_{j=1}^{a_{i}}\sum_{{\cal Q}_{ij}}\sum_{{\cal P}\in{\cal S}_{ijk}}\frac{c_{k}({\cal P})}{(1+a_{i})|{\cal Q}_{ij}|},\;\;\;k\in{\cal N}_{i}\cap{\cal H}. (11)

It is clear that wii(t)>0w_{ii}(t)>0 for all ii and tt. Since

11+aij=1aiyij(t)=k𝒩iwik(t)xk(t),\frac{1}{1+a_{i}}\sum_{j=1}^{a_{i}}y_{ij}(t)=\sum_{k\in{\cal N}_{i}\cap{\cal H}}w_{ik}(t)x_{k}(t),

from (10),

k𝒩iwik(t)xk(t)=j=1ai𝒫𝒬ijp𝒫cp(𝒫)xp(t)(1+ai)|𝒬ij|.\sum_{k\in{\cal N}_{i}\cap{\cal H}}w_{ik}(t)x_{k}(t)=\sum_{j=1}^{a_{i}}\sum_{{\cal P}\in{\cal Q}_{ij}}\sum_{p\in{\cal P}}\frac{c_{p}({\cal P})x_{p}(t)}{(1+a_{i})|{\cal Q}_{ij}|}. (12)

Note that j=1ai𝒫ij𝒫\bigcup_{j=1}^{a_{i}}\bigcup_{{\cal P}\in{\cal B}_{ij}}{\cal P} is the collection of all subsets of 𝒩i{\cal N}_{i} with cardinality being dβ+1d\beta+1. It follows that j=1ai𝒫𝒬ij𝒫\bigcup_{j=1}^{a_{i}}\bigcup_{{\cal P}\in{\cal Q}_{ij}}{\cal P} is the collection of all subsets of 𝒩i{\cal N}_{i}\cap{\cal H} with cardinality being dβ+1d\beta+1. Thus, (12) implies that k𝒩iwik(t)xk(t)\sum_{k\in{\cal N}_{i}\cap{\cal H}}w_{ik}(t)x_{k}(t) is a convex combination of all possible expressions of yij(t)y_{ij}(t) in terms of (9). Since both 1+ai1+a_{i} and |𝒬ij||{\cal Q}_{ij}| are positive, as long as a cp(𝒫)c_{p}({\cal P}) in (12) is positive, wip(t)w_{ip}(t) is positive with p𝒩ip\in{\cal N}_{i}\cap{\cal H}.

We claim that the number of those indices p𝒩ip\in{\cal N}_{i}\cap{\cal H} such that cp(𝒫)1/|𝒫|c_{p}({\cal P})\geq 1/|{\cal P}| for at least one 𝒫𝒬ij{\cal P}\in{\cal Q}_{ij} is at least |𝒩i|dβ|{\cal N}_{i}\cap{\cal H}|-d\beta. To prove this, suppose that, to the contrary, the number of such indices is no larger than |𝒩i|dβ1|{\cal N}_{i}\cap{\cal H}|-d\beta-1. That is, at least dβ+1d\beta+1 indices in 𝒩i{\cal N}_{i}\cap{\cal H} whose corresponding cp(𝒫)<1/|𝒫|c_{p}({\cal P})<1/|{\cal P}| for all 𝒫𝒬ij{\cal P}\in{\cal Q}_{ij}, j{1,,ai}j\in\{1,\ldots,a_{i}\}. Pick exactly dβ+1d\beta+1 of them and form an index set 𝒫0{{\cal P}}_{0}. It is clear that 𝒫0𝒬ij{\cal P}_{0}\in{\cal Q}_{ij} for some j{1,,ai}j\in\{1,\ldots,a_{i}\}. So all cp(𝒫0)c_{p}({\cal P}_{0}), p𝒫0p\in{\cal P}_{0}, must be included in the right hand side of (12) and strictly less than 1/|𝒫|1/|{\cal P}|. But this is impossible because (9) asserts that p𝒫0cp(𝒫0)=1\sum_{p\in{\cal P}_{0}}c_{p}({\cal P}_{0})=1.

From the preceding, there exist at least |𝒩i|dβ|{\cal N}_{i}\cap{\cal H}|-d\beta indices k𝒩ik\in{\cal N}_{i}\cap{\cal H} for which ck(𝒫)c_{k}({\cal P}) in (11) is no less than 1/|𝒫|1/|{\cal P}| for at least one 𝒫𝒮ijk{\cal P}\in{\cal S}_{ijk}. For each of such kk, from (11),

wik(t)1|𝒫|(1+ai)|𝒬ij|=1(dβ+1)(1+ai)((d+1)β+1dβ+1).w_{ik}(t)\geq\frac{1}{|{\cal P}|(1+a_{i})|{\cal Q}_{ij}|}=\frac{1}{(d\beta+1)(1+a_{i})\binom{(d+1)\beta+1}{d\beta+1}}.

Set

η=Δmini𝒱1(dβ+1)(1+ai)((d+1)β+1dβ+1).\eta\;\stackrel{{\scriptstyle\Delta}}{{=}}\;\min_{i\in{\cal V}}\;\frac{1}{(d\beta+1)(1+a_{i})\binom{(d+1)\beta+1}{d\beta+1}}. (13)

Since ai(n(d+1)β+1)a_{i}\leq\binom{n}{(d+1)\beta+1} due to (3), η\eta must be positive and independent of ii and tt. The statement of the lemma then immediately follows.  

From (6) and Lemma 6, the updates of all normal agents can be written as

vi(t)\displaystyle v_{i}(t) =wii(t)xi(t)+k𝒩iwik(t)xk(t),\displaystyle=w_{ii}(t)x_{i}(t)+\sum_{k\in{\cal N}_{i}\cap{\cal H}}w_{ik}(t)x_{k}(t), (14)
xi(t+1)\displaystyle x_{i}(t+1) =vi(t)α(t)gi(vi(t)),\displaystyle=v_{i}(t)-\alpha(t)g_{i}(v_{i}(t)), (15)

for all ii\in{\cal H}.

Without loss of generality, we label all normal agents from 11 to |||{\cal H}| in the sequel.

Let

x(t)=Δ[x1(t)x||(t)],v(t)=Δ[v1(t)v||(t)],g(v(t))=Δ[g1(v1(t))g||(v||(t))].\displaystyle x(t)\stackrel{{\scriptstyle\Delta}}{{=}}\begin{bmatrix}x_{1}^{\top}(t)\cr\vdots\cr x_{|{\cal H}|}^{\top}(t)\end{bmatrix},\;\;\;\;\;v(t)\stackrel{{\scriptstyle\Delta}}{{=}}\begin{bmatrix}v_{1}^{\top}(t)\cr\vdots\cr v_{|{\cal H}|}^{\top}(t)\end{bmatrix},\;\;\;\;\;g(v(t))\stackrel{{\scriptstyle\Delta}}{{=}}\begin{bmatrix}g_{1}^{\top}(v_{1}(t))\cr\vdots\cr g_{|{\cal H}|}^{\top}(v_{|{\cal H}|}(t))\end{bmatrix}.

Then, the updates in (14) and (15) can be written in the form of state equations:

v(t)\displaystyle v(t) =W(t)x(t),\displaystyle=W(t)x(t), (16)
x(t+1)\displaystyle x(t+1) =v(t)α(t)g(v(t)),\displaystyle=v(t)-\alpha(t)g(v(t)), (17)

where each W(t)W(t) is a ||×|||{\cal H}|\times|{\cal H}| stochastic matrix with positive diagonal entries.666A square nonnegative matrix is called a stochastic matrix if its row sums all equal one.

4.3 Consensus

We first study the infinite product of stochastic matrices W(t)W(t).

The graph of an n×nn\times n matrix MM, denoted γ(M)\gamma(M), is a direct graph with nn vertices and a directed edge from vertex ii to vertex jj whenever the jiji-th entry of the matrix is nonzero.

Lemma 7.

If 𝔾\mathbb{G} is (β,dβ)(\beta,d\beta)-resilient, the graph of each W(t)W(t) in (16) has a rooted spanning subgraph and all the diagonal entries and those off-diagonal entries of W(t)W(t) corresponding to the rooted spanning subgraph are uniformly bounded below by a positive number η\eta given in (13).

Proof of Lemma 7: From Lemma 6, for all ii\in{\cal H}, wii(t)w_{ii}(t) and at least |𝒩i|dβ|{\cal N}_{i}\cap{\cal H}|-d\beta among all wik(t)w_{ik}(t), k𝒩ik\in{\cal N}_{i}\cap{\cal H}, are positive and uniformly bounded below by η\eta given in (13). The expression “at least |𝒩i|dβ|{\cal N}_{i}\cap{\cal H}|-d\beta among all wik(t)w_{ik}(t)” implies that the graph of W(t)W(t) can be obtained by removing from each vertex of the subgraph of 𝔾\mathbb{G} induced by {\cal H}, 𝔾\mathbb{G}_{{\cal H}}, at most dβd\beta unspecified incoming edges in 𝔾\mathbb{G}_{{\cal H}}, which is the (||,dβ)(|{\cal F}|,d\beta)-reduced graph of 𝔾\mathbb{G}. Since ||β|{\cal F}|\leq\beta, 𝔾\mathbb{G} must be (||,dβ)(|{\cal F}|,d\beta)-resilient. Thus, any (||,dβ)(|{\cal F}|,d\beta)-reduced graph of 𝔾\mathbb{G} is rooted, so is W(t)W(t).  

For any infinite sequence of stochastic matrices with the property in Lemma 7, their product has the following result.

Lemma 8.

Let S1,S2,S_{1},S_{2},\ldots be an infinite sequence of n×nn\times n stochastic matrices, each of whose graphs having a rooted spanning subgraph. If all the diagonal entries and those off-diagonal entries of S1,S2,S_{1},S_{2},\ldots corresponding to the rooted spanning subgraphs are uniformly bounded below by a positive number pp, then the product SkS2S1S_{k}\cdots S_{2}S_{1} converges to a rank one matrix of the form 𝟏v\mathbf{1}v^{\top} exponentially fast, where vv is a column vector.777We use 𝟏\mathbf{1} to denote the vector whose entries all equal to 11 and the dimension is to be understood from the context.

To prove the lemma, we need the concept of the “composition” of directed graphs. The composition of two directed graphs 𝔾p\mathbb{G}_{p}, 𝔾q\mathbb{G}_{q} with the same vertex set, denoted by 𝔾q𝔾p\mathbb{G}_{q}\circ\mathbb{G}_{p}, is the directed graph with the same vertex set and arc set defined so that (i,j)(i,j) is an arc in the composition whenever there is a vertex kk such that (i,k)(i,k) is an arc in 𝔾p\mathbb{G}_{p} and (k,j)(k,j) is an arc in 𝔾q\mathbb{G}_{q}. Since this composition is an associative binary operation, the definition extends unambiguously to any finite sequence of directed graphs with the same vertex set. Composition and matrix multiplication are closely related. In particular, the graph of the product of two nonnegative matrices M1,M2IRn×nM_{1},M_{2}\in{\rm I\!R}^{n\times n} is equal to the composition of the graphs of the two matrices comprising the product. In other words, γ(M2M1)=γ(M2)γ(M1)\gamma(M_{2}M_{1})=\gamma(M_{2})\circ\gamma(M_{1}). If we focus exclusively on graphs with self-arcs at all vertices, the definition of composition implies that the arcs of both 𝔾p\mathbb{G}_{p} and 𝔾q\mathbb{G}_{q} are arcs of 𝔾q𝔾p\mathbb{G}_{q}\circ\mathbb{G}_{p}; the converse is false.

To proceed, for any n×nn\times n nonnegative matrix SS, let

μ(S)=Δmaxi,j(1k=1nmin{sik,sjk}).\mu(S)\stackrel{{\scriptstyle\Delta}}{{=}}\max_{i,j}\Big{(}1-\sum_{k=1}^{n}\min\{s_{ik},s_{jk}\}\Big{)}.

It is easy to see that if SS is a substochastic matrix888A square nonnegative matrix is called substochastic if its row sums are all equal to or less than one., μ(S)[0,1]\mu(S)\in[0,1]. In the case when SS is a stochastic matrix, μ(S)\mu(S) is called the coefficients of ergodicity [47, page 137]. A stochastic matrix SS is called a scrambling matrix if and only if μ(S)<1\mu(S)<1 [48], whose graph is sometimes called “neighbor-shared” [49]. It is natural to call a vertex ii a neighbor of vertex jj in a directed graph 𝔾\mathbb{G} if (i,j)(i,j) is an arc in 𝔾\mathbb{G}. A directed graph 𝔾\mathbb{G} is called neighbor-shared if each set of two distinct vertices share a common neighbor. The composition of any set of n1n-1 rooted graphs with nn vertices and self-arcs at all vertices, is neighbor shared [50, Proposition 8]. It is easy to check that for any nonnegative square matrix SS, μ(S)<1\mu(S)<1 if and only if its graph is neighbor-shared. Moreover, for any two nonnegative square matrices S1S_{1} and S2S_{2}, if S1S2S_{1}\geq S_{2},999For any two real matrices AA and BB with the same size, we write ABA\geq B if aijbija_{ij}\geq b_{ij} for all ii and jj. then μ(S1)μ(S2)\mu(S_{1})\leq\mu(S_{2}).

Proof of Lemma 8: Since the graph of each StS_{t} is rooted with self-arcs, by Proposition 8 in [50], the graph of the product of any finite sequence of StS_{t} matrices of length n1n-1, is neighbor-shared, which implies that the product is a scrambling matrix. Thus, letting Vt=τ=tt+n2SτV_{t}=\prod_{\tau=t}^{t+n-2}S_{\tau} for each tt, each VtV_{t} is scrambling and its graph γ(Vt)\gamma(V_{t}) is neighbor-shared. Since the graph of each StS_{t} has a rooted spanning subgraph with self-arcs whose corresponding entries in StS_{t} are bounded below by a positive number pp. It follows that the graph of each VtV_{t} has a neighbor-shared spanning subgraph 𝕊\mathbb{S} with self-arcs whose corresponding entries in VtV_{t} are bounded below by a positive number pn1p^{n-1}. Let UtU_{t} be the n×nn\times n matrix whose ijijth entry is the ijijth entry of VtV_{t} if (j,i)(j,i) is an arc in 𝕊\mathbb{S} and zero otherwise. Then, each UtU_{t} is a substochastic matrix whose graph is neighbor-shared. Since all positive entries of UtU_{t} are bounded below by pn1p^{n-1}, μ(Ut)1pn1\mu(U_{t})\leq 1-p^{n-1}. Since VtUtV_{t}\geq U_{t}, μ(Vt)μ(Ut)1pn1\mu(V_{t})\leq\mu(U_{t})\leq 1-p^{n-1}. With this uniform upper bound of all μ(Vt)\mu(V_{t}), the lemma thus is an immediate consequence of Lemma 3 in [48].  

An infinite sequence of stochastic matrices {S(t)}\{S(t)\} is called ergodic if limtS(t)S(τ+1)S(τ)=𝟏v(τ)\lim_{t\rightarrow\infty}S(t)\cdots S(\tau+1)S(\tau)=\mathbf{1}v^{\top}(\tau) for all τ\tau, where each v(τ)v(\tau) is a stochastic vector.101010A vector is called a stochastic vector if its entries are nonnegative and sum to one. From Lemmas 7 and 8, the sequence of stochastic matrices {W(t)}\{W(t)\} is ergodic, and any infinite product of W(t)W(t) matrices converges to a rank one matrix exponentially fast. Using the same argument as in the proof of Lemma 8,

μ(W(t+n2)W(t+1)W(t))1ηn1\mu\big{(}W(t+n-2)\cdots W(t+1)W(t)\big{)}\leq 1-\eta^{n-1}

for all tt, where η\eta is given in (13). Following the same argument as pages 610–611 in [49], the product W(t)W(τ+1)W(τ)W(t)\cdots W(\tau+1)W(\tau) converges to a rank one matrix as tt\rightarrow\infty exponentially fast at a rate no slower than

λ=Δ(1ηn1)1n1.\lambda\stackrel{{\scriptstyle\Delta}}{{=}}\left(1-\eta^{n-1}\right)^{\frac{1}{n-1}}. (18)

Lemmas 7 and 8 also have the following important implication.

Proposition 2.

Under Assumptions 2 and 3, if 𝔾\mathbb{G} is (β,dβ)(\beta,d\beta)-resilient, all the normal agents will asymptotically reach a consensus.

To prove the proposition, we need the following concept.

Definition 3.

Let {S(t)}\{S(t)\} be a sequence of stochastic matrices. A sequence of stochastic vectors {π(t)}\{\pi(t)\} is an absolute probability sequence for {S(t)}\{S(t)\} if π(t)=π(t+1)S(t)\pi^{\top}(t)=\pi^{\top}(t+1)S(t) for all t0t\geq 0.

This definition was first introduced by Kolmogorov [51]. It was shown by Blackwell [52] that every sequence of stochastic matrices has an absolute probability sequence. In general, a sequence of stochastic matrices may have more than one absolute probability sequence; when the sequence of stochastic matrices is ergodic, it has a unique absolute probability sequence [53, Lemma 1]. It is easy to see that when S(t)S(t) is a fixed irreducible stochastic matrix SS, π(t)\pi(t) is simply the normalized left eigenvector of SS for eigenvalue one, and when {S(t)}\{S(t)\} is an ergodic sequence of doubly stochastic matrices111111A square nonnegative matrix is called a doubly stochastic matrix if its row sums and column sums all equal one., π(t)=(1/n)𝟏\pi(t)=(1/n)\mathbf{1}.

From the preceding, the sequence of stochastic matrices {W(t)}\{W(t)\} in (16) is ergodic. Thus, {W(t)}\{W(t)\} has a unique absolute probability sequence {π(t)}\{\pi(t)\}. From Lemma 1 in [53],

limtW(t)W(τ+1)W(τ)=𝟏π(τ).\lim_{t\rightarrow\infty}W(t)\cdots W(\tau+1)W(\tau)=\mathbf{1}\pi^{\top}(\tau). (19)

Let Φ(t,τ)=ΔW(t)W(τ)\Phi(t,\tau)\stackrel{{\scriptstyle\Delta}}{{=}}W(t)\cdots W(\tau) with tτt\geq\tau. Then, there exists a positive constant cc such that for all i,ji,j\in\mathcal{H} and tτ0t\geq\tau\geq 0,

|[Φ(t,τ)]ijπj(τ)|cλtτ,\displaystyle\big{|}\big{[}\Phi(t,\tau)]_{ij}-\pi_{j}(\tau)\big{|}\leq c\lambda^{t-\tau}, (20)

where []ij[\cdot]_{ij} denotes the ijijth entry of a matrix and λ\lambda is given in (18). Using the same argument as in the proof of Lemma 2 in [39], c=2c=2.

To proceed, define

y(t)=Δπ(t)x(t)=i=1||πi(t)xi(t).y(t)\stackrel{{\scriptstyle\Delta}}{{=}}\pi^{\top}(t)x(t)=\sum_{i=1}^{|{\cal H}|}\pi_{i}(t)x_{i}(t).

From (15),

y(t+1)\displaystyle y(t+1) =i=1||πi(t+1)xi(t+1)\displaystyle=\sum_{i=1}^{|{\cal H}|}\pi_{i}(t+1)x_{i}(t+1)
=i=1||πi(t+1)vi(t)α(t)i=1||πi(t+1)gi(vi(t))\displaystyle=\sum_{i=1}^{|{\cal H}|}\pi_{i}(t+1)v_{i}(t)-\alpha(t)\sum_{i=1}^{|{\cal H}|}\pi_{i}(t+1)g_{i}(v_{i}(t))
=π(t+1)W(t)x(t)α(t)i=1||πi(t+1)gi(vi(t))\displaystyle=\pi^{\top}(t+1)W(t)x(t)-\alpha(t)\sum_{i=1}^{|{\cal H}|}\pi_{i}(t+1)g_{i}(v_{i}(t))
=y(t)α(t)i=1||πi(t+1)gi(vi(t)).\displaystyle=y(t)-\alpha(t)\sum_{i=1}^{|{\cal H}|}\pi_{i}(t+1)g_{i}(v_{i}(t)). (21)
Lemma 9.

Under Assumptions 2 and 3, if 𝔾\mathbb{G} is (β,dβ)(\beta,d\beta)-resilient, limt(xi(t)y(t))=0\lim_{t\rightarrow\infty}(x_{i}(t)-y(t))=0 for all ii\in{\cal H}.

Proof of Lemma 9: For all t>st>s,

xi(t+1)=j=1||[Φ(t,s)]ijxj(s)r=st1(j=1||[Φ(t,r+1)]ijα(r)gj(vj(r)))α(t)gi(vi(t)).\displaystyle x_{i}(t+1)=\sum_{j=1}^{|{\cal H}|}\left[\Phi(t,s)\right]_{ij}x_{j}(s)-\sum_{r=s}^{t-1}\bigg{(}\sum_{j=1}^{|{\cal H}|}\left[\Phi(t,r+1)\right]_{ij}\alpha(r)g_{j}(v_{j}(r))\bigg{)}-\alpha(t)g_{i}(v_{i}(t)).

From (21), for all t>st>s,

y(t+1)=y(s)r=st1(α(r)j=1||πj(r+1)gj(vj(r)))α(t)i=1||πi(t+1)gi(vi(t)).\displaystyle y(t+1)=y(s)-\sum_{r=s}^{t-1}\bigg{(}\alpha(r)\sum_{j=1}^{|{\cal H}|}\pi_{j}(r+1)g_{j}(v_{j}(r))\bigg{)}-\alpha(t)\sum_{i=1}^{|{\cal H}|}\pi_{i}(t+1)g_{i}(v_{i}(t)).

Set s=0s=0. Then, using (20) and Assumption 2, for t>0t>0,

xi(t)y(t)\displaystyle\|x_{i}(t)-y(t)\|\;\leq\; j=1|||[Φ(t1,0)]ijπj(0)|xj(0)\displaystyle\sum_{j=1}^{|{\cal H}|}\big{|}\left[\Phi(t-1,0)\right]_{ij}-\pi_{j}(0)\big{|}\|x_{j}(0)\|
+r=0t2j=1|||[Φ(t1,r+1)]ijπj(r+1)|α(r)gj(vj(r))\displaystyle+\sum_{r=0}^{t-2}\sum_{j=1}^{|{\cal H}|}\big{|}\left[\Phi(t-1,r+1)\right]_{ij}-\pi_{j}(r+1)\big{|}\alpha(r)\|g_{j}(v_{j}(r))\|
+α(t1)gi(vi(t1))+α(t1)i=1||πi(t)gi(vi(t1))\displaystyle+\alpha(t-1)\|g_{i}(v_{i}(t-1))\|+\alpha(t-1)\sum_{i=1}^{|{\cal H}|}\pi_{i}(t)\|g_{i}(v_{i}(t-1))\|
\displaystyle\leq\; 2λtj=1||xj(0)+2D||r=0t2λtr2α(r)+2Dα(t1)\displaystyle 2\lambda^{t}\sum_{j=1}^{|{\cal H}|}\|x_{j}(0)\|+2D|{\cal H}|\sum_{r=0}^{t-2}\lambda^{t-r-2}\alpha(r)+2D\alpha(t-1) (22)
\displaystyle\leq\; 2λtj=1||xj(0)+2D||(r=0t21λtr2α(r)+r=t21t2λtr2α(r))+2Dα(t1)\displaystyle 2\lambda^{t}\sum_{j=1}^{|{\cal H}|}\|x_{j}(0)\|+2D|{\cal H}|\bigg{(}\sum_{r=0}^{\lfloor\frac{t}{2}\rfloor-1}\lambda^{t-r-2}\alpha(r)+\sum_{r=\lceil\frac{t}{2}\rceil-1}^{t-2}\lambda^{t-r-2}\alpha(r)\bigg{)}+2D\alpha(t-1)
\displaystyle\leq\; 2λtj=1||xj(0)+2D||1λ(λt21α(0)+α(t21))+2Dα(t1),\displaystyle 2\lambda^{t}\sum_{j=1}^{|{\cal H}|}\|x_{j}(0)\|+\frac{2D|{\cal H}|}{1-\lambda}\Big{(}\lambda^{\lceil\frac{t}{2}\rceil-1}\alpha(0)+\alpha\big{(}\lceil\frac{t}{2}\rceil-1\big{)}\Big{)}+2D\alpha(t-1), (23)

where \lfloor\cdot\rfloor and \lceil\cdot\rceil denote the floor and ceiling functions, respectively. It follows that limtxi(t)y(t)=0\lim_{t\rightarrow\infty}\|x_{i}(t)-y(t)\|=0.  

The above proof essentially follows the proof of Lemma 8(a) in [54], generalizing the straight average y(t)=1||i=1||xi(t)y(t)=\frac{1}{|{\cal H}|}\sum_{i=1}^{|{\cal H}|}x_{i}(t) to the time-varying weighted average y(t)=i=1||πi(t)xi(t)y(t)=\sum_{i=1}^{|{\cal H}|}\pi_{i}(t)x_{i}(t). It can also be proved using the idea of “input-output consensus stability” based on a suitably defined semi-norm; see Corollary 1 in [35].

Proposition 2 is an immediate consequence of Lemma 9. More can be said.

4.4 Convergence

Lemma 10.

Under Assumptions 2 and 3, if 𝔾\mathbb{G} is (β,dβ)(\beta,d\beta)-resilient, t=0α(t)xi(t)y(t)<\sum_{t=0}^{\infty}\alpha(t)\|x_{i}(t)-y(t)\|<\infty for all ii\in{\cal H}.

Proof of Lemma 10: From (22), for any t>0t>0,

α(t)xi(t)y(t)2α(t)λtj=1||xj(0)+2D||r=0t2λtr2α(t)α(r)+2Dα(t)α(t1).\displaystyle\alpha(t)\|x_{i}(t)-y(t)\|\leq 2\alpha(t)\lambda^{t}\sum_{j=1}^{|{\cal H}|}\|x_{j}(0)\|+2D|{\cal H}|\sum_{r=0}^{t-2}\lambda^{t-r-2}\alpha(t)\alpha(r)+2D\alpha(t)\alpha(t-1).

Since α(t)λtα2(t)+λ2t\alpha(t)\lambda^{t}\leq\alpha^{2}(t)+\lambda^{2t} and 2α(t)α(r)α2(t)+α2(r)2\alpha(t)\alpha(r)\leq\alpha^{2}(t)+\alpha^{2}(r) for any tt and rr, then for any t>0t>0,

α(t)xi(t)y(t)\displaystyle\alpha(t)\|x_{i}(t)-y(t)\| \displaystyle\leq 2α2(t)j=1||xj(0)+2λ2tj=1||xj(0)\displaystyle 2\alpha^{2}(t)\sum_{j=1}^{|{\cal H}|}\|x_{j}(0)\|+2\lambda^{2t}\sum_{j=1}^{|{\cal H}|}\|x_{j}(0)\|
+ 2D||α2(t)r=0t2λtr2+2D||r=0t2λtr2α2(r)+D(α2(t)+α2(t1))\displaystyle+\;2D|{\cal H}|\alpha^{2}(t)\sum_{r=0}^{t-2}\lambda^{t-r-2}+2D|{\cal H}|\sum_{r=0}^{t-2}\lambda^{t-r-2}\alpha^{2}(r)+D\big{(}\alpha^{2}(t)+\alpha^{2}(t-1)\big{)}
\displaystyle\leq 2α2(t)j=1||xj(0)+2λ2tj=1||xj(0)\displaystyle 2\alpha^{2}(t)\sum_{j=1}^{|{\cal H}|}\|x_{j}(0)\|+2\lambda^{2t}\sum_{j=1}^{|{\cal H}|}\|x_{j}(0)\|
+(2D||1λ)α2(t)+2D||r=0t2λtr2α2(r)+D(α2(t)+α2(t1)).\displaystyle+\;\Big{(}\frac{2D|{\cal H}|}{1-\lambda}\Big{)}\alpha^{2}(t)+2D|{\cal H}|\sum_{r=0}^{t-2}\lambda^{t-r-2}\alpha^{2}(r)+D\big{(}\alpha^{2}(t)+\alpha^{2}(t-1)\big{)}.

Therefore,

t=0α(t)xi(t)y(t)\displaystyle\sum_{t=0}^{\infty}\alpha(t)\|x_{i}(t)-y(t)\| \displaystyle\leq 2(t=1α2(t))j=1||xj(0)+2(t=1λ2t)j=1||xj(0)\displaystyle 2\bigg{(}\sum_{t=1}^{\infty}\alpha^{2}(t)\bigg{)}\sum_{j=1}^{|{\cal H}|}\|x_{j}(0)\|+2\bigg{(}\sum_{t=1}^{\infty}\lambda^{2t}\bigg{)}\sum_{j=1}^{|{\cal H}|}\|x_{j}(0)\|
+(2D||1λ)t=1α2(t)+2D||t=1r=0t2λtr2α2(r)+2Dt=0α2(t)\displaystyle+\Big{(}\frac{2D|{\cal H}|}{1-\lambda}\Big{)}\sum_{t=1}^{\infty}\alpha^{2}(t)+2D|{\cal H}|\sum_{t=1}^{\infty}\sum_{r=0}^{t-2}\lambda^{t-r-2}\alpha^{2}(r)+2D\sum_{t=0}^{\infty}\alpha^{2}(t)
+α(0)xi(0)y(0)\displaystyle+\alpha(0)\|x_{i}(0)-y(0)\|
<\displaystyle< ,\displaystyle\infty,

which completes the proof.  

It is known that under some “standard” assumptions, if the graphs of an ergodic sequence of stochastic matrices are “uniformly strongly connected”, all the entries of its unique absolute probability sequence are uniformly bounded below by a positive constant [55, Theorem 4.8]. This is not the case here as the sequence of the graphs W(t)W(t) matrices may not be “uniformly strongly connected”.

Proposition 3.

If 𝔾\mathbb{G} is (β,dβ)(\beta,d\beta)-resilient, then for any t0t\geq 0, there exists a subset 𝒮(t){\cal S}(t)\subset{\cal H} with |𝒮(t)|κβ,dβ(𝔾)|{\cal S}(t)|\geq\kappa_{\beta,d\beta}(\mathbb{G}) for which all πi(t)\pi_{i}(t), i𝒮(t)i\in{\cal S}(t), t0t\geq 0 are bounded below by a positive number.

To prove the proposition, we need the following concept and results.

We say that a directed graph 𝔾\mathbb{G} is strongly rooted at vertex ii if each other vertex of 𝔾\mathbb{G} is reachable from vertex ii along a directed path of length 1; that is, 𝔾\mathbb{G} is strongly rooted at ii if ii is a neighbor of every other vertex in the graph. A directed graph is called strongly rooted if it has at least one vertex at which it is strongly rooted. For a square nonnegative matrix, its graph is strongly rooted at ii if and only if its iith column is strictly positive. Moreover, for any n1n-1 directed graphs with nn vertices and self-arcs which are all rooted at the same vertex ii, their composition is strongly rooted at ii [50, Proposition 3].

Proof of Proposition 3: From Lemma 7, the graph of each W(t)W(t) is rooted. It is clear that the number of roots in the graph of each W(t)W(t) is at least κβ,dβ(𝔾)\kappa_{\beta,d\beta}(\mathbb{G}); that is, κ(γ(W(t)))κβ,dβ(𝔾)\kappa(\gamma(W(t)))\geq\kappa_{\beta,d\beta}(\mathbb{G}) for all time tt. Let

l=Δ(||κβ,dβ(𝔾)+1)(||2)+1.l\stackrel{{\scriptstyle\Delta}}{{=}}(|{\cal H}|-\kappa_{\beta,d\beta}(\mathbb{G})+1)(|{\cal H}|-2)+1. (24)

We claim that for any finite sequence of W(t)W(t) matrices of length ll, there exists a subset 𝒮{\cal S}\subset{\cal H}, depending on the sequence, with |𝒮|κβ,dβ(𝔾)|{\cal S}|\geq\kappa_{\beta,d\beta}(\mathbb{G}) such that each i𝒮i\in{\cal S} is a root of the graph of some W(t)W(t) for at least ||1|{\cal H}|-1 times. To prove the claim, suppose that, to the contrary, such a subset does not exist; that is, at most κβ,dβ(𝔾)1\kappa_{\beta,d\beta}(\mathbb{G})-1 vertices in {\cal H} are a root of the graph of some W(t)W(t) for at least ||1|{\cal H}|-1 times. For the remaining vertices, they are a root of the graph of some W(t)W(t) for at most ||2|{\cal H}|-2 times. Then, the total number of roots of all the graphs of the sequence of W(t)W(t) matrices of length ll is no large than (κβ,dβ(𝔾)1)l+(||κβ,dβ(𝔾)+1)(||2)(\kappa_{\beta,d\beta}(\mathbb{G})-1)l+(|{\cal H}|-\kappa_{\beta,d\beta}(\mathbb{G})+1)(|{\cal H}|-2). Meanwhile, since κ(γ(W(t)))κβ,dβ(𝔾)\kappa(\gamma(W(t)))\geq\kappa_{\beta,d\beta}(\mathbb{G}) for all time tt, this total number of roots is at least κβ,dβ(𝔾)l\kappa_{\beta,d\beta}(\mathbb{G})l, which implies that (κβ,dβ(𝔾)1)l+(||κβ,dβ(𝔾)+1)(||2)κβ,dβ(𝔾)l(\kappa_{\beta,d\beta}(\mathbb{G})-1)l+(|{\cal H}|-\kappa_{\beta,d\beta}(\mathbb{G})+1)(|{\cal H}|-2)\geq\kappa_{\beta,d\beta}(\mathbb{G})l, and thus l(||κβ,dβ(𝔾)+1)(||2)l\leq(|{\cal H}|-\kappa_{\beta,d\beta}(\mathbb{G})+1)(|{\cal H}|-2). But this contradicts (24). Therefore, the claim is true.

The above claim immediately implies that for any time t0t\geq 0 and the corresponding ll-length stochastic matrix sequence starting at time tt, {W(t),W(t+1),,W(t+l1)}\{W(t),W(t+1),\ldots,W(t+l-1)\}, there exists a subset 𝒮(t){\cal S}(t)\subset{\cal H} with |𝒮(t)|κβ,dβ(𝔾)|{\cal S}(t)|\geq\kappa_{\beta,d\beta}(\mathbb{G}) such that each i𝒮(t)i\in{\cal S}(t) is a root of at least ||1|{\cal H}|-1 graphs among γ(W(t)),γ(W(t+1)),,γ(W(t+l1))\gamma(W(t)),\gamma(W(t+1)),\ldots,\gamma(W(t+l-1)). From Lemma 7, the graph of each W(t)W(t) is rooted with self-arcs. With the two facts of directed graphs with self-arcs that the arcs of each graph in a graph sequence are arcs of their composition, and that any ||1|{\cal H}|-1 graphs with |||{\cal H}| vertices which are all rooted at the same vertex ii, their composition is strongly rooted at ii [50, Proposition 3], it follows that for any time t0t\geq 0, the graph of W(t+l1)W(t+1)W(t)W(t+l-1)\cdots W(t+1)W(t) is strongly rooted at each vertex i𝒮(t)i\in{\cal S}(t). Then, each product W(t+l1)W(t+1)W(t)W(t+l-1)\cdots W(t+1)W(t) has |𝒮(t)|κβ,dβ(𝔾)|{\cal S}(t)|\geq\kappa_{\beta,d\beta}(\mathbb{G}) strictly positive columns whose entries are uniformly bounded below by a positive number ηl\eta^{l}, where η\eta is defined in (13). Since each W(t)W(t) is a stochastic matrix, it is easy to see that for any mlm\geq l, each product W(t+m1)W(t+1)W(t)W(t+m-1)\cdots W(t+1)W(t) has |𝒮(t)|κβ,dβ(𝔾)|{\cal S}(t)|\geq\kappa_{\beta,d\beta}(\mathbb{G}) strictly positive columns whose entries are uniformly bounded below ηl\eta^{l}. Then, the statement of the proposition follows from (19).  

We are now in a position to prove Theorem 1 and Proposition 1.

Proof of Theorem 1: From (15) and (1), for any z𝒳z\in{\cal X}^{*},

xi(t+1)z2\displaystyle\|x_{i}(t+1)-z\|^{2} =vi(t)zα(t)gi(vi(t))2\displaystyle=\|v_{i}(t)-z-\alpha(t)g_{i}(v_{i}(t))\|^{2}
vi(t)z2+α2(t)gi(vi(t))22α(t)(fi(vi(t))fi(z)).\displaystyle\leq\|v_{i}(t)-z\|^{2}+\alpha^{2}(t)\|g_{i}(v_{i}(t))\|^{2}-2\alpha(t)(f_{i}(v_{i}(t))-f_{i}(z)).

Since each W(t)W(t) is a stochastic matrix and the 2-norm is convex, vi(t)z2j=1||wij(t)xj(t)z2\|v_{i}(t)-z\|^{2}\leq\sum_{j=1}^{|{\cal H}|}w_{ij}(t)\|x_{j}(t)-z\|^{2}. Then, with Assumption 2 and the fact that πj(t)=i=1||πi(t+1)wij(t)\pi_{j}(t)=\sum_{i=1}^{|{\cal H}|}\pi_{i}(t+1)w_{ij}(t) for all jj\in{\cal H} and tt,

i=1||πi(t+1)xi(t+1)z2\displaystyle\;\;\;\sum_{i=1}^{|{\cal H}|}\pi_{i}(t+1)\|x_{i}(t+1)-z\|^{2}
i=1||πi(t+1)j=1||wij(t)xj(t)z2+α2(t)i=1||πi(t+1)gi(vi(t))2\displaystyle\leq\sum_{i=1}^{|{\cal H}|}\pi_{i}(t+1)\sum_{j=1}^{|{\cal H}|}w_{ij}(t)\|x_{j}(t)-z\|^{2}+\alpha^{2}(t)\sum_{i=1}^{|{\cal H}|}\pi_{i}(t+1)\|g_{i}(v_{i}(t))\|^{2}
2α(t)i=1||πi(t+1)(fi(vi(t))fi(z))\displaystyle\;\;\;\;-2\alpha(t)\sum_{i=1}^{|{\cal H}|}\pi_{i}(t+1)(f_{i}(v_{i}(t))-f_{i}(z))
=i=1||πi(t)xi(t)z2+α2(t)i=1||πi(t+1)gi(vi(t))22α(t)i=1||πi(t+1)(fi(vi(t))fi(z))\displaystyle=\sum_{i=1}^{|{\cal H}|}\pi_{i}(t)\|x_{i}(t)-z\|^{2}+\alpha^{2}(t)\sum_{i=1}^{|{\cal H}|}\pi_{i}(t+1)\|g_{i}(v_{i}(t))\|^{2}-2\alpha(t)\sum_{i=1}^{|{\cal H}|}\pi_{i}(t+1)(f_{i}(v_{i}(t))-f_{i}(z))
i=1||πi(t)xi(t)z2+α2(t)D22α(t)i=1||πi(t+1)(fi(vi(t))fi(y(t)))\displaystyle\leq\sum_{i=1}^{|{\cal H}|}\pi_{i}(t)\|x_{i}(t)-z\|^{2}+\alpha^{2}(t)D^{2}-2\alpha(t)\sum_{i=1}^{|{\cal H}|}\pi_{i}(t+1)(f_{i}(v_{i}(t))-f_{i}(y(t)))
2α(t)i=1||πi(t+1)(fi(y(t))fi(z)).\displaystyle\;\;\;\;-2\alpha(t)\sum_{i=1}^{|{\cal H}|}\pi_{i}(t+1)(f_{i}(y(t))-f_{i}(z)). (25)

Note that

|fi(vi(t))fi(y(t))|Dvi(t)y(t)Dj=1||wij(t)xj(t)y(t),|f_{i}(v_{i}(t))-f_{i}(y(t))|\leq D\|v_{i}(t)-y(t)\|\leq D\sum_{j=1}^{|{\cal H}|}w_{ij}(t)\|x_{j}(t)-y(t)\|,

which implies that

i=1||πi(t+1)|fi(vi(t))fi(y(t))|\displaystyle\sum_{i=1}^{|{\cal H}|}\pi_{i}(t+1)|f_{i}(v_{i}(t))-f_{i}(y(t))| \displaystyle\leq Di=1||πi(t+1)j=1||wij(t)xj(t)y(t)\displaystyle D\sum_{i=1}^{|{\cal H}|}\pi_{i}(t+1)\sum_{j=1}^{|{\cal H}|}w_{ij}(t)\|x_{j}(t)-y(t)\|
=\displaystyle= Di=1||πi(t)xi(t)y(t)\displaystyle D\sum_{i=1}^{|{\cal H}|}\pi_{i}(t)\|x_{i}(t)-y(t)\|
\displaystyle\leq Di=1||xi(t)y(t).\displaystyle D\sum_{i=1}^{|{\cal H}|}\|x_{i}(t)-y(t)\|.

From (25),

i=1||πi(t+1)xi(t+1)z2\displaystyle\sum_{i=1}^{|{\cal H}|}\pi_{i}(t+1)\|x_{i}(t+1)-z\|^{2} \displaystyle\leq i=1||πi(t)xi(t)z2+α2(t)D2+2α(t)Di=1||xi(t)y(t)\displaystyle\sum_{i=1}^{|{\cal H}|}\pi_{i}(t)\|x_{i}(t)-z\|^{2}+\alpha^{2}(t)D^{2}+2\alpha(t)D\sum_{i=1}^{|{\cal H}|}\|x_{i}(t)-y(t)\| (26)
 2α(t)i=1||πi(t+1)(fi(y(t))fi(z)).\displaystyle-\;2\alpha(t)\sum_{i=1}^{|{\cal H}|}\pi_{i}(t+1)(f_{i}(y(t))-f_{i}(z)).

Thus,

2t=0α(t)i=1||πi(t+1)(fi(y(t))fi(z))\displaystyle 2\sum_{t=0}^{\infty}\alpha(t)\sum_{i=1}^{|{\cal H}|}\pi_{i}(t+1)(f_{i}(y(t))-f_{i}(z)) \displaystyle\leq i=1||πi(0)xi(0)z2+D2t=0α2(t)\displaystyle\sum_{i=1}^{|{\cal H}|}\pi_{i}(0)\|x_{i}(0)-z\|^{2}+D^{2}\sum_{t=0}^{\infty}\alpha^{2}(t)
+ 2Dt=0α(t)i=1||xi(t)y(t)\displaystyle+\;2D\sum_{t=0}^{\infty}\alpha(t)\sum_{i=1}^{|{\cal H}|}\|x_{i}(t)-y(t)\|
<\displaystyle< \displaystyle\infty

because of Assumption 3 and Lemma 10.

From Corollary 3, zargminxfi(x)z\in\operatorname*{arg\,min}_{x}f_{i}(x) for all ii\in{\cal H}. Then, fi(y(t))fi(z)0f_{i}(y(t))-f_{i}(z)\geq 0 for all ii\in{\cal H} and t0t\geq 0. From Proposition 3 and its proof, for any t0t\geq 0, there exists a subset 𝒮(t){\cal S}(t)\subset{\cal H} with |𝒮(t)|κβ,dβ(𝔾)|{\cal S}(t)|\geq\kappa_{\beta,d\beta}(\mathbb{G}) for which all πi(t)\pi_{i}(t), i𝒮(t)i\in{\cal S}(t), t0t\geq 0 are uniformly bounded below by ηl>0\eta^{l}>0. It follows that

ηlt=0α(t)i𝒮(t)(fi(y(t))fi(z))t=0α(t)i=1||πi(t+1)(fi(y(t))fi(z))<.\eta^{l}\sum_{t=0}^{\infty}\alpha(t)\sum_{i\in{\cal S}(t)}(f_{i}(y(t))-f_{i}(z))\leq\sum_{t=0}^{\infty}\alpha(t)\sum_{i=1}^{|{\cal H}|}\pi_{i}(t+1)(f_{i}(y(t))-f_{i}(z))<\infty. (27)

Since t=0α(t)=\sum_{t=0}^{\infty}\alpha(t)=\infty and fi(y(t))fi(z)0f_{i}(y(t))-f_{i}(z)\geq 0 for all ii\in{\cal H} and t0t\geq 0,

lim infti𝒮(t)(fi(y(t))fi(z))=0.\liminf_{t\rightarrow\infty}\sum_{i\in{\cal S}(t)}\big{(}f_{i}(y(t))-f_{i}(z)\big{)}=0.

which implies that121212We use dist(x,𝒮)=Δinf{xy:y𝒮}{\rm dist}(x,{\cal S})\stackrel{{\scriptstyle\Delta}}{{=}}\inf\{\|x-y\|:y\in{\cal S}\} to denote the Euclidean distance between a point xx and a set 𝒮{\cal S} in IRd{\rm I\!R}^{d}.

lim inftdist(y(t),argminxi𝒮(t)fi(x))=0.\liminf_{t\rightarrow\infty}\;{\rm dist}\Big{(}y(t),\;\operatorname*{arg\,min}_{x}\sum_{i\in{\cal S}(t)}f_{i}(x)\Big{)}=0.

From Corollary 1, argminxi𝒮(t)fi(x)=𝒳\operatorname*{arg\,min}_{x}\sum_{i\in{\cal S}(t)}f_{i}(x)={\cal X}^{*} for all time t0t\geq 0. Then,

argminx(lim infti𝒮(t)fi(x))=argminxf(x),\displaystyle\operatorname*{arg\,min}_{x}\Big{(}\liminf_{t\rightarrow\infty}\sum_{i\in{\cal S}(t)}f_{i}(x)\Big{)}=\operatorname*{arg\,min}_{x}f(x),

which implies that lim inftdist(y(t),𝒳)=0\liminf_{t\rightarrow\infty}{\rm dist}(y(t),{\cal X}^{*})=0, and thus

lim inft(f(y(t))f(z))=0.\displaystyle\liminf_{t\rightarrow\infty}\big{(}f(y(t))-f(z)\big{)}=0. (28)

We next show that all the sequences {xi(t)}\{x_{i}(t)\}, ii\in{\cal H}, converge to the same optimal point.

From (26), rearranging the terms and fixing an arbitrary period from time t1t_{1} to t2t_{2} with t1<t2t_{1}<t_{2},

i=1||πi(t2+1)xi(t2+1)z2\displaystyle\sum_{i=1}^{|{\cal H}|}\pi_{i}(t_{2}+1)\|x_{i}(t_{2}+1)-z\|^{2} \displaystyle\leq i=1||πi(t1)xi(t1)z2\displaystyle\sum_{i=1}^{|{\cal H}|}\pi_{i}(t_{1})\|x_{i}(t_{1})-z\|^{2}
+D2t=t1t2α2(t)+2Dt=t1t2α(t)i=1||xi(t)y(t).\displaystyle+\;D^{2}\sum_{t=t_{1}}^{t_{2}}\alpha^{2}(t)+2D\sum_{t=t_{1}}^{t_{2}}\alpha(t)\sum_{i=1}^{|{\cal H}|}\|x_{i}(t)-y(t)\|.

From Assumption 2 and Lemma 10,

lim supτ2i=1||πi(τ2+1)xi(τ2+1)z2lim infτ1i=1||πi(τ1)xi(τ1)z2.\limsup_{\tau_{2}\rightarrow\infty}\;\sum_{i=1}^{|{\cal H}|}\pi_{i}(\tau_{2}+1)\|x_{i}(\tau_{2}+1)-z\|^{2}\leq\liminf_{\tau_{1}\rightarrow\infty}\;\sum_{i=1}^{|{\cal H}|}\pi_{i}(\tau_{1})\|x_{i}(\tau_{1})-z\|^{2}.

Thus, the sequence {i=1||πi(t)xi(t)z2}\{\sum_{i=1}^{|{\cal H}|}\pi_{i}(t)\|x_{i}(t)-z\|^{2}\} is convergent for each z𝒳z\in{\cal X}^{*}. From Proposition 3 and its proof,

ηli𝒮(t)xi(t)z2i𝒮(t)πi(t)xi(t)z2i=1||πi(t)xi(t)z2,\eta^{l}\sum_{i\in{\cal S}(t)}\|x_{i}(t)-z\|^{2}\leq\sum_{i\in{\cal S}(t)}\pi_{i}(t)\|x_{i}(t)-z\|^{2}\leq\sum_{i=1}^{|{\cal H}|}\pi_{i}(t)\|x_{i}(t)-z\|^{2},

which implies that the sequence {i𝒮xi(t)z2}\{\sum_{i\in{\cal S}}\|x_{i}(t)-z\|^{2}\} is bounded, so is each sequence {xi(t)}\{x_{i}(t)\}, i𝒮(t)i\in{\cal S}(t). From Proposition 2, all the sequences {xi(t)}\{x_{i}(t)\}, ii\in{\cal H}, are bounded. From Lemma 9, the sequence {y(t)}\{y(t)\} is bounded, and with each π(t)\pi(t) being a stochastic vector, the sequence {y(t)z2}\{\|y(t)-z\|^{2}\} is convergent for each z𝒳z\in{\cal X}^{*}. Since y(t)y(t) is bounded, from (28), there exists a subsequence of {y(t)}\{y(t)\} converging to a point x𝒳x^{*}\in{\cal X}^{*}. Since {y(t)x2}\{\|y(t)-x^{*}\|^{2}\} is convergent, the sequence {y(t)}\{y(t)\} converges to xx^{*}. From Lemma 9, all the sequences {xi(t)}\{x_{i}(t)\}, ii\in{\cal H} converge to the same optimal point xx^{*}.  

Proof of Proposition 1: From Proposition 3, for any t0t\geq 0, there exists a subset 𝒮(t){\cal S}(t)\subset{\cal H} with |𝒮(t)|κβ,dβ(𝔾)|{\cal S}(t)|\geq\kappa_{\beta,d\beta}(\mathbb{G}) for which all πi(t)\pi_{i}(t), i𝒮(t)i\in{\cal S}(t), t0t\geq 0 are bounded below by a positive number. It is clear that |𝒮(t)|||=n|||{\cal S}(t)|\leq|{\cal H}|=n-|{\cal F}| for all t0t\geq 0. Fix any integer b[κβ,dβ(𝔾),n||]b\in[\kappa_{\beta,d\beta}(\mathbb{G}),n-|{\cal F}|]. Define

𝒮=argmax𝒦,b|𝒦|κβ,dβ(𝔾)|{t:t{0,1,,T1},𝒦𝒮(t)}|,\displaystyle{\cal S}=\operatorname*{arg\,max}_{{\cal K}\subset{\cal H},\;b\geq|{\cal K}|\geq\kappa_{\beta,d\beta}(\mathbb{G})}\big{|}\big{\{}t:t\in\{0,1,\ldots,T-1\},\;{\cal K}\subset{\cal S}(t)\big{\}}\big{|},

which represents a subset of normal agents, with an appropriate cardinality, appearing the most times within TT time steps. It is easy to see that 𝒮{\cal S} is well-defined and nonempty.

We next count the number of times such an 𝒮{\cal S} set appears within TT time steps. To this end, define

𝒯={t:t{0,1,,T1},𝒮𝒮(t)},\displaystyle{\cal T}=\big{\{}t:t\in\{0,1,\ldots,T-1\},\;{\cal S}\subset{\cal S}(t)\big{\}},

which represents this number. Consider the set ={𝒦:b|𝒦|κβ,dβ(𝔾)}{\cal M}=\{{\cal K}\subset{\cal H}:b\geq|{\cal K}|\geq\kappa_{\beta,d\beta}(\mathbb{G})\} whose cardinality is

C=k=κβ,dβ(𝔾)b(n||k).C=\sum_{k=\kappa_{\beta,d\beta}(\mathbb{G})}^{b}\binom{n-|{\cal F}|}{k}.

Note that 𝒮{\cal S}\in{\cal M} and for each t0t\geq 0, there exists at least one set in {\cal M} being a subset of 𝒮(t){\cal S}(t). From the pigeonhole principle, at least one set in {\cal M} appears T/C\lceil T/C\rceil times within TT time steps. Since 𝒮{\cal S} is a set in {\cal M} that appears most often during the total TT time steps, |𝒯|T/C|{\cal T}|\geq T/C.

To proceed, similar to (27), for any z𝒳z\in{\cal X}^{*},

ηlt=0T1α(t)i𝒮(t)(fi(y(t))fi(z))t=0T1α(t)iπi(t+1)(fi(y(t))fi(z))<.\displaystyle\eta^{l}\sum_{t=0}^{T-1}\alpha(t)\sum_{i\in{{\cal S}}(t)}(f_{i}(y(t))-f_{i}(z))\leq\sum_{t=0}^{T-1}\alpha(t)\sum_{i\in{\cal H}}\pi_{i}(t+1)(f_{i}(y(t))-f_{i}(z))<\infty. (29)

Substituting α(t)=1/T\alpha(t)=1/\sqrt{T} to (26) and (29), it follows that for any z𝒳z\in{\cal X}^{*},

2ηlt=0T1i𝒮(t)fi(y(t))fi(z)T\displaystyle 2\eta^{l}\sum_{t=0}^{T-1}\sum_{i\in{{\cal S}}(t)}\frac{f_{i}(y(t))-f_{i}(z)}{\sqrt{T}} iπi(0)xi(0)z2+D2+2Dt=0T1ixi(t)y(t)T.\displaystyle\leq\sum_{i\in{\cal H}}\pi_{i}(0)\|x_{i}(0)-z\|^{2}+D^{2}+2D\sum_{t=0}^{T-1}\sum_{i\in{\cal H}}\frac{\|x_{i}(t)-y(t)\|}{\sqrt{T}}. (30)

Since all fi(x)f_{i}(x), ii\in{\cal H} are convex functions,

i𝒮fi(t𝒯y(t)|𝒯|)i𝒮fi(z)i𝒮t𝒯fi(y(t))fi(z)|𝒯|=t𝒯i𝒮fi(y(t))fi(z)|𝒯|.\displaystyle\sum_{i\in{\cal S}}f_{i}\bigg{(}\frac{\sum_{t\in{\cal T}}y(t)}{|{\cal T}|}\bigg{)}-\sum_{i\in{\cal S}}f_{i}(z)\leq\sum_{i\in{\cal S}}\sum_{t\in{\cal T}}\frac{f_{i}(y(t))-f_{i}(z)}{|{\cal T}|}=\sum_{t\in{\cal T}}\sum_{i\in{\cal S}}\frac{f_{i}(y(t))-f_{i}(z)}{|{\cal T}|}. (31)

From Corollary 3, zargminxfi(x)z\in\operatorname*{arg\,min}_{x}f_{i}(x) for all ii\in{\cal H}. Then, fi(y(t))fi(z)0f_{i}(y(t))-f_{i}(z)\geq 0 for all ii\in{\cal H} and t0t\geq 0. Since 𝒮𝒮(t){\cal S}\subset{\cal S}(t)\subset{\cal H} for all t0t\geq 0, it follows from (31), (30), and |𝒯|T/C|{\cal T}|\geq T/C that

i𝒮fi(t𝒯y(t)|𝒯|)i𝒮fi(z)t=0T1i𝒮(t)fi(y(t))fi(z)|𝒯|\displaystyle\sum_{i\in{\cal S}}f_{i}\bigg{(}\frac{\sum_{t\in{\cal T}}y(t)}{|{\cal T}|}\bigg{)}-\sum_{i\in{\cal S}}f_{i}(z)\leq\sum_{t=0}^{T-1}\sum_{i\in{\cal S}(t)}\frac{f_{i}(y(t))-f_{i}(z)}{|{\cal T}|}
\displaystyle\leq\; C2ηl(iπi(0)xi(0)z2+D2T+2Dt=0T1ixi(t)y(t)T).\displaystyle\frac{C}{2\eta^{l}}\bigg{(}\frac{\sum_{i\in{\cal H}}\pi_{i}(0)\|x_{i}(0)-z\|^{2}+D^{2}}{\sqrt{T}}+\frac{2D\sum_{t=0}^{T-1}\sum_{i\in{\cal H}}\|x_{i}(t)-y(t)\|}{T}\bigg{)}. (32)

Note that for all ii\in{\cal H}, there holds

xi(0)y(0)=xi(0)jπi(0)xi(0)jxj(0).\|x_{i}(0)-y(0)\|=\|x_{i}(0)-\sum_{j\in{\cal H}}\pi_{i}^{\top}(0)x_{i}(0)\|\leq\sum_{j\in{\cal H}}\|x_{j}(0)\|.

With this simple fact and α(t)=1/T\alpha(t)=1/\sqrt{T}, it follows from (23) that for all ii\in{\cal H},

t=0T1xi(t)y(t)\displaystyle\sum_{t=0}^{T-1}\|x_{i}(t)-y(t)\|\leq\; t=1T12λtjxj(0)+2D||1λt=1T1(λt21α(0)+α(t21))+2Dt=1T1α(t1)\displaystyle\sum_{t=1}^{T-1}2\lambda^{t}\sum_{j\in{\cal H}}\|x_{j}(0)\|+\frac{2D|{\cal H}|}{1-\lambda}\sum_{t=1}^{T-1}\Big{(}\lambda^{\lceil\frac{t}{2}\rceil-1}\alpha(0)+\alpha\big{(}\lceil\frac{t}{2}\rceil-1\big{)}\Big{)}+2D\sum_{t=1}^{T-1}\alpha(t-1)
+xi(0)yi(0)\displaystyle+\|x_{i}(0)-y_{i}(0)\|
\displaystyle\leq\; t=0T12λtjxj(0)+2D||1λt=1T1(λt21α(0)+α(t21))+2Dt=1T1α(t1)\displaystyle\sum_{t=0}^{T-1}2\lambda^{t}\sum_{j\in{\cal H}}\|x_{j}(0)\|+\frac{2D|{\cal H}|}{1-\lambda}\sum_{t=1}^{T-1}\Big{(}\lambda^{\lceil\frac{t}{2}\rceil-1}\alpha(0)+\alpha\big{(}\lceil\frac{t}{2}\rceil-1\big{)}\Big{)}+2D\sum_{t=1}^{T-1}\alpha(t-1)
\displaystyle\leq\; 21λjxj(0)+2D||1λ(2λ(1λ)T+T)+2DT,\displaystyle\frac{2}{1-\lambda}\sum_{j\in{\cal H}}\|x_{j}(0)\|+\frac{2D|{\cal H}|}{1-\lambda}\Big{(}\frac{2}{\lambda(1-\lambda)\sqrt{T}}+\sqrt{T}\Big{)}+2D\sqrt{T}, (33)

which implies that

i𝒮fi(t𝒯y(t)|𝒯|)i𝒮fi(z)\displaystyle\sum_{i\in{\cal S}}f_{i}\bigg{(}\frac{\sum_{t\in{\cal T}}y(t)}{|{\cal T}|}\bigg{)}-\sum_{i\in{\cal S}}f_{i}(z)\leq\; C2ηl(iπi(0)xi(0)z2+D2T+4D||(1λ)Tj=1||xj(0)\displaystyle\frac{C}{2\eta^{l}}\bigg{(}\frac{\sum_{i\in{\cal H}}\pi_{i}(0)\|x_{i}(0)-z\|^{2}+D^{2}}{\sqrt{T}}+\frac{4D|{\cal H}|}{(1-\lambda)T}\sum_{j=1}^{|{\cal H}|}\|x_{j}(0)\|
+8||2D2λ(1λ)2TT+4||2D2(1λ)T+4D2||T).\displaystyle\;\;\;\;\;\;\;\;+\frac{8|{\cal H}|^{2}D^{2}}{\lambda(1-\lambda)^{2}T\sqrt{T}}+\frac{4|{\cal H}|^{2}D^{2}}{(1-\lambda)\sqrt{T}}+\frac{4D^{2}|{\cal H}|}{\sqrt{T}}\bigg{)}. (34)

From (1) and Assumption 2, for all jj\in{\cal H}, it holds that

i𝒮fi(t𝒯xj(t)|𝒯|)i𝒮fi(t𝒯y(t)|𝒯|)\displaystyle\sum_{i\in{\cal S}}f_{i}\bigg{(}\frac{\sum_{t\in{\cal T}}x_{j}(t)}{|{\cal T}|}\bigg{)}-\sum_{i\in{\cal S}}f_{i}\bigg{(}\frac{\sum_{t\in{\cal T}}y(t)}{|{\cal T}|}\bigg{)}
\displaystyle\leq\; Di𝒮t𝒯xj(t)y(t)|𝒯|Di𝒮t=0T1xj(t)y(t)|𝒯|\displaystyle D\sum_{i\in{\cal S}}\frac{\sum_{t\in{\cal T}}\|x_{j}(t)-y(t)\|}{|{\cal T}|}\leq D\sum_{i\in{\cal S}}\frac{\sum_{t=0}^{T-1}\|x_{j}(t)-y(t)\|}{|{\cal T}|}
\displaystyle\leq\; DC|𝒮|(2(1λ)Tjxj(0)+4D||λ(1λ)2TT+2D||(1λ)T+2DT),\displaystyle DC|{\cal S}|\bigg{(}\frac{2}{(1-\lambda)T}\sum_{j\in{\cal H}}\|x_{j}(0)\|+\frac{4D|{\cal H}|}{\lambda(1-\lambda)^{2}T\sqrt{T}}+\frac{2D|{\cal H}|}{(1-\lambda)\sqrt{T}}+\frac{2D}{\sqrt{T}}\bigg{)},

which, with (34), implies that

i𝒮fi(t𝒯xj(t)|𝒯|)i𝒮fi(z)\displaystyle\sum_{i\in{\cal S}}f_{i}\bigg{(}\frac{\sum_{t\in{\cal T}}x_{j}(t)}{|{\cal T}|}\bigg{)}-\sum_{i\in{\cal S}}f_{i}(z)
\displaystyle\leq\; C2ηl(iπi(0)xi(0)z2+D2T+4D||(1λ)Tjxj(0)+8||2D2λ(1λ)2TT+4||2D2(1λ)T\displaystyle\frac{C}{2\eta^{l}}\bigg{(}\frac{\sum_{i\in{\cal H}}\pi_{i}(0)\|x_{i}(0)-z\|^{2}+D^{2}}{\sqrt{T}}+\frac{4D|{\cal H}|}{(1-\lambda)T}\sum_{j\in{\cal H}}\|x_{j}(0)\|+\frac{8|{\cal H}|^{2}D^{2}}{\lambda(1-\lambda)^{2}T\sqrt{T}}+\frac{4|{\cal H}|^{2}D^{2}}{(1-\lambda)\sqrt{T}}
+4D2||T)+DC|𝒮|(2(1λ)Tjxj(0)+4D||λ(1λ)2TT+2D||(1λ)T+2DT)\displaystyle+\frac{4D^{2}|{\cal H}|}{\sqrt{T}}\bigg{)}+DC|{\cal S}|\bigg{(}\frac{2}{(1-\lambda)T}\sum_{j\in{\cal H}}\|x_{j}(0)\|+\frac{4D|{\cal H}|}{\lambda(1-\lambda)^{2}T\sqrt{T}}+\frac{2D|{\cal H}|}{(1-\lambda)\sqrt{T}}+\frac{2D}{\sqrt{T}}\bigg{)}
=\displaystyle=\; C2ηl(iπi(0)xi(0)z2+D2T+4D(||+|𝒮|ηl)(1λ)Tjxj(0)+8||D2(||+|𝒮|ηl)λ(1λ)2TT\displaystyle\frac{C}{2\eta^{l}}\bigg{(}\frac{\sum_{i\in{\cal H}}\pi_{i}(0)\|x_{i}(0)-z\|^{2}+D^{2}}{\sqrt{T}}+\frac{4D\big{(}|{\cal H}|+|{\cal S}|\eta^{l}\big{)}}{(1-\lambda)T}\sum_{j\in{\cal H}}\|x_{j}(0)\|+\frac{8|{\cal H}|D^{2}\big{(}|{\cal H}|+|{\cal S}|\eta^{l}\big{)}}{\lambda(1-\lambda)^{2}T\sqrt{T}}
+4||D2(||+|𝒮|ηl)(1λ)T+4D2(||+|𝒮|ηl)T)\displaystyle+\frac{4|{\cal H}|D^{2}\big{(}|{\cal H}|+|{\cal S}|\eta^{l}\big{)}}{(1-\lambda)\sqrt{T}}+\frac{4D^{2}\big{(}|{\cal H}|+|{\cal S}|\eta^{l}\big{)}}{\sqrt{T}}\bigg{)} (35)
=\displaystyle=\; O(1T).\displaystyle O\Big{(}\frac{1}{\sqrt{T}}\Big{)}.

There is an alternative way to prove this, as follows.

From (26), for any jj\in{\cal H},

i=1||πi(t+1)xi(t+1)z2\displaystyle\;\;\;\;\sum_{i=1}^{|{\cal H}|}\pi_{i}(t+1)\|x_{i}(t+1)-z\|^{2}
i=1||πi(t)xi(t)z2+α2(t)D2+2α(t)Di=1||xi(t)y(t)\displaystyle\leq\sum_{i=1}^{|{\cal H}|}\pi_{i}(t)\|x_{i}(t)-z\|^{2}+\alpha^{2}(t)D^{2}+2\alpha(t)D\sum_{i=1}^{|{\cal H}|}\|x_{i}(t)-y(t)\|
2α(t)i=1||πi(t+1)(fi(xj(t))fi(z))2α(t)i=1||πi(t+1)(fi(y(t))fi(xj(t)))\displaystyle\;\;\;\;-2\alpha(t)\sum_{i=1}^{|{\cal H}|}\pi_{i}(t+1)(f_{i}(x_{j}(t))-f_{i}(z))-2\alpha(t)\sum_{i=1}^{|{\cal H}|}\pi_{i}(t+1)(f_{i}(y(t))-f_{i}(x_{j}(t)))
i=1||πi(t)xi(t)z2+α2(t)D2+2α(t)Di=1||xi(t)y(t)\displaystyle\leq\sum_{i=1}^{|{\cal H}|}\pi_{i}(t)\|x_{i}(t)-z\|^{2}+\alpha^{2}(t)D^{2}+2\alpha(t)D\sum_{i=1}^{|{\cal H}|}\|x_{i}(t)-y(t)\|
2α(t)i=1||πi(t+1)(fi(xj(t))fi(z))+2Dα(t)i=1||πi(t+1)y(t)xj(t)\displaystyle\;\;\;\;-2\alpha(t)\sum_{i=1}^{|{\cal H}|}\pi_{i}(t+1)(f_{i}(x_{j}(t))-f_{i}(z))+2D\alpha(t)\sum_{i=1}^{|{\cal H}|}\pi_{i}(t+1)\|y(t)-x_{j}(t)\|
=i=1||πi(t)xi(t)z2+α2(t)D2+2α(t)Di=1||xi(t)y(t)\displaystyle=\sum_{i=1}^{|{\cal H}|}\pi_{i}(t)\|x_{i}(t)-z\|^{2}+\alpha^{2}(t)D^{2}+2\alpha(t)D\sum_{i=1}^{|{\cal H}|}\|x_{i}(t)-y(t)\|
+2Dα(t)y(t)xj(t)2α(t)i=1||πi(t+1)(fi(xj(t))fi(z)),\displaystyle\;\;\;\;+2D\alpha(t)\|y(t)-x_{j}(t)\|-2\alpha(t)\sum_{i=1}^{|{\cal H}|}\pi_{i}(t+1)(f_{i}(x_{j}(t))-f_{i}(z)),

which implies that

2α(t)i=1||πi(t+1)(fi(xj(t))fi(z))\displaystyle 2\alpha(t)\sum_{i=1}^{|{\cal H}|}\pi_{i}(t+1)(f_{i}(x_{j}(t))-f_{i}(z)) i=1||πi(t+1)xi(t+1)z2+i=1||πi(t)xi(t)z2\displaystyle\leq-\sum_{i=1}^{|{\cal H}|}\pi_{i}(t+1)\|x_{i}(t+1)-z\|^{2}+\sum_{i=1}^{|{\cal H}|}\pi_{i}(t)\|x_{i}(t)-z\|^{2}
+α2(t)D2+2α(t)Di=1||xi(t)y(t)\displaystyle\;\;\;\;+\alpha^{2}(t)D^{2}+2\alpha(t)D\sum_{i=1}^{|{\cal H}|}\|x_{i}(t)-y(t)\|
+2Dα(t)y(t)xj(t).\displaystyle\;\;\;\;+2D\alpha(t)\|y(t)-x_{j}(t)\|.

Thus,

     2t=0T1α(t)i=1||πi(t+1)(fi(xj(t))fi(z))\displaystyle\;\;\;\;\;2\sum_{t=0}^{T-1}\alpha(t)\sum_{i=1}^{|{\cal H}|}\pi_{i}(t+1)(f_{i}(x_{j}(t))-f_{i}(z))
i=1||πi(0)xi(0)z2+D2t=0T1α2(t)+2Dt=0T1α(t)i=1||xi(t)y(t)+2Dt=0T1α(t)xj(t)y(t).\displaystyle\leq\sum_{i=1}^{|{\cal H}|}\pi_{i}(0)\|x_{i}(0)-z\|^{2}+D^{2}\sum_{t=0}^{T-1}\alpha^{2}(t)+2D\sum_{t=0}^{T-1}\alpha(t)\sum_{i=1}^{|{\cal H}|}\|x_{i}(t)-y(t)\|+2D\sum_{t=0}^{T-1}\alpha(t)\|x_{j}(t)-y(t)\|. (36)

Similar to (27), for any z𝒳z\in{\cal X}^{*},

ηlt=0T1α(t)i𝒮(t)(fi(xj(t))fi(z))t=0T1α(t)i=1||πi(t+1)(fi(xj(t))fi(z)).\displaystyle\eta^{l}\sum_{t=0}^{T-1}\alpha(t)\sum_{i\in{\cal S}(t)}(f_{i}(x_{j}(t))-f_{i}(z))\leq\sum_{t=0}^{T-1}\alpha(t)\sum_{i=1}^{|{\cal H}|}\pi_{i}(t+1)(f_{i}(x_{j}(t))-f_{i}(z)). (37)

Substituting α(t)=1/T\alpha(t)=1/\sqrt{T} to (36) and (37), it holds that for any z𝒳z\in{\cal X}^{*},

t=0T1i𝒮(t)(fi(xj(t))fi(z))ηlt=0T1i=1||πi(t+1)(fi(xj(t))fi(z))\displaystyle\sum_{t=0}^{T-1}\sum_{i\in{\cal S}(t)}(f_{i}(x_{j}(t))-f_{i}(z))\leq\eta^{-l}\sum_{t=0}^{T-1}\sum_{i=1}^{|{\cal H}|}\pi_{i}(t+1)(f_{i}(x_{j}(t))-f_{i}(z))
\displaystyle\leq\; T2ηl(i=1||πi(0)xi(0)z2+D2+2DTt=0T1(i=1||xi(t)y(t)+xj(t)y(t))).\displaystyle\frac{\sqrt{T}}{2\eta^{l}}\bigg{(}\sum_{i=1}^{|{\cal H}|}\pi_{i}(0)\|x_{i}(0)-z\|^{2}+D^{2}+\frac{2D}{\sqrt{T}}\sum_{t=0}^{T-1}\Big{(}\sum_{i=1}^{|{\cal H}|}\|x_{i}(t)-y(t)\|+\|x_{j}(t)-y(t)\|\Big{)}\bigg{)}.

Using the same argument as in deriving (32),

i𝒮fi(t𝒯xj(t)|𝒯|)i𝒮fi(z)\displaystyle\sum_{i\in{\cal S}}f_{i}\bigg{(}\frac{\sum_{t\in{\cal T}}x_{j}(t)}{|{\cal T}|}\bigg{)}-\sum_{i\in{\cal S}}f_{i}(z)
\displaystyle\leq\; C2ηl(i=1||πi(0)xi(0)z2+D2T+2Dt=0T1(i=1||xi(t)y(t)+xj(t)y(t))T),\displaystyle\frac{C}{2\eta^{l}}\bigg{(}\frac{\sum_{i=1}^{|{\cal H}|}\pi_{i}(0)\|x_{i}(0)-z\|^{2}+D^{2}}{\sqrt{T}}+\frac{2D\sum_{t=0}^{T-1}\big{(}\sum_{i=1}^{|{\cal H}|}\|x_{i}(t)-y(t)\|+\|x_{j}(t)-y(t)\|\big{)}}{T}\bigg{)},

which, with (33), yields

i𝒮fi(t𝒯xj(t)T)i𝒮fi(z)\displaystyle\sum_{i\in{\cal S}}f_{i}\bigg{(}\frac{\sum_{t\in{\cal T}}x_{j}(t)}{T}\bigg{)}-\sum_{i\in{\cal S}}f_{i}(z)
\displaystyle\leq\; C2ηl(i=1||πi(0)xi(0)z2+D2T+4D(||+1)(1λ)Tj=1||xj(0)+8D2||(||+1)λ(1λ)2TT\displaystyle\frac{C}{2\eta^{l}}\bigg{(}\frac{\sum_{i=1}^{|{\cal H}|}\pi_{i}(0)\|x_{i}(0)-z\|^{2}+D^{2}}{\sqrt{T}}+\frac{4D(|{\cal H}|+1)}{(1-\lambda)T}\sum_{j=1}^{|{\cal H}|}\|x_{j}(0)\|+\frac{8D^{2}|{\cal H}|(|{\cal H}|+1)}{\lambda(1-\lambda)^{2}T\sqrt{T}}
+4D2||(||+1)(1λ)T+4D2(||+1)T)\displaystyle\;\;\;\;\;\;\;\;+\frac{4D^{2}|{\cal H}|(|{\cal H}|+1)}{(1-\lambda)\sqrt{T}}+\frac{4D^{2}(|{\cal H}|+1)}{\sqrt{T}}\bigg{)} (38)
=\displaystyle=\; O(1T).\displaystyle O\Big{(}\frac{1}{\sqrt{T}}\Big{)}.

This completes the proof.  

The above proof provides two bounds with the same order. It is not hard to check that the first bound (35) is better than the second one (38) if and only if ηl|𝒮|<1\eta^{l}|{\cal S}|<1, and they are equal if and only if ηl|𝒮|=1\eta^{l}|{\cal S}|=1.

5 Concluding Remarks

This paper has proposed a distributed subgradient algorithm which achieves full resilience in the presence of Byzantine agents, with appropriate redundancy in both graph connectivity and objective functions. The algorithm and convergence results can be easily extended to time-varying neighbor graphs, provided that the neighbor graph is (β,dβ)(\beta,d\beta)-resilient all the time. One immediate next step is to relax Assumption 1, possibly appealing to gradient descent for differentiable convex functions. The concepts and tools developed in the paper are expected to be applicable to other consensus-based distributed optimization and computation problems.

Although the algorithm theoretically works for multi-dimensional convex optimization, it has the following limitations which preclude its applicability to high-dimensional optimization. First, from Lemma 2, the algorithm implicitly requires that each agent have at least (d+1)β+1(d+1)\beta+1 neighbors, which is impossible for high dimensions. Second, picking a point in the intersection of multiple convex hulls (cf. step (4) in the algorithm) can be computationally expensive in high dimensions, although the issue has been attenuated in [34, Algorithm 2] and [37, Section 5.1]. Last, building (β,dβ)(\beta,d\beta)-resilient graphs is not an easy job, especially when dd or β\beta is large. Another practical issue of the algorithm, independent of dimensions, is how to measure and establish objective function redundancy. Studies of (r,s)(r,s)-resilient graphs and kk-redundant multi-agent networks are of independent interest.

Considering that nowadays distributed optimization algorithms in machine learning are frequently high-dimensional, there is ample motivation to design fully resilient high-dimensional distributed optimization algorithms. A future direction of this paper aims to tackle this challenging problem by combining the proposed algorithm with communication-efficient schemes in which each agent can transmit only low-dimensional signals. Possible approaches include entry-wise or block-wise updating [56, 57], limited information fusion [58], and dimension-independent filtering [16, 59].

References

  • [1] A. Nedić and A. Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009.
  • [2] T. Yang, X. Yi, J. Wu, Y. Yuan, D. Wu, Z. Meng, Y. Hong, H. Wang, Z. Lin, and K.H. Johansson. A survey of distributed optimization. Annual Reviews in Control, 47:278–305, 2019.
  • [3] A. Nedić and J. Liu. Distributed optimization for control. Annual Review of Control, Robotics, and Autonomous Systems, 1:77–103, 2018.
  • [4] D.K. Molzahn, F. Dörfler, H. Sandberg, S.H. Low, S. Chakrabarti, R. Baldick, and J. Lavaei. A survey of distributed optimization and control algorithms for electric power systems. IEEE Transactions on Smart Grid, 8(6):2941–2962, 2017.
  • [5] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu. Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, volume 30, 2017.
  • [6] L. Lamport, R. Shostak, and M. Pease. The Byzantine generals problem. ACM Transactions on Programming Languages and Systems, 4(3):382–401, 1982.
  • [7] L. Su and N.H. Vaidya. Fault-tolerant multi-agent optimization: Optimal iterative distributed algorithms. In Proceedings of the 2016 ACM Symposium on Principles of Distributed Computing, pages 425–434, 2016.
  • [8] K. Kuwaranancharoen, L. Xin, and S. Sundaram. Byzantine-resilient distributed optimization of multi-dimensional functions. In Proceedings of the 2020 American Control Conference, pages 4399–4404, 2020.
  • [9] Z. Yang and W.U. Bajwa. ByRDiE: Byzantine-resilient distributed coordinate descent for decentralized learning. IEEE Transactions on Signal and Information Processing over Networks, 5(4):611–627, 2019.
  • [10] C. Fang, Z. Yang, and W.U. Bajwa. BRIDGE: Byzantine-resilient decentralized gradient descent. IEEE Transactions on Signal and Information Processing over Networks, 8:610–626, 2022.
  • [11] L. He, S.P. Karimireddy, and M. Jaggi. Byzantine-robust decentralized learning via self-centered clipping. 2022. arXiv:2202.01545 [cs.LG].
  • [12] Z. Wu, T. Chen, and Q. Ling. Byzantine-resilient decentralized stochastic optimization with robust aggregation rules. 2022. arXiv:2206.04568 [cs.DC].
  • [13] S. Sundaram and B. Gharesifard. Distributed optimization under adversarial nodes. IEEE Transactions on Automatic Control, 64(3):1063–1076, 2018.
  • [14] L. Su and N.H. Vaidya. Byzantine-resilient multiagent optimization. IEEE Transactions on Automatic Control, 66(5):2227–2233, 2020.
  • [15] C. Zhao, J. He, and Q.-G. Wang. Resilient distributed optimization algorithm against adversarial attacks. IEEE Transactions on Automatic Control, 65(10):4308–4315, 2020.
  • [16] N. Gupta, T.T. Doan, and N.H. Vaidya. Byzantine fault-tolerance in decentralized optimization under 2f2f-redundancy. In Proceedings of the 2021 American Control Conference, pages 3632–3637, 2021.
  • [17] N. Gupta and N.H. Vaidya. Byzantine fault-tolerance in peer-to-peer distributed gradient-descent. 2021. arXiv:2101.12316 [cs.DC].
  • [18] M. Kaheni, E. Usai, and M. Franceschelli. Resilient constrained optimization in multi-agent systems with improved guarantee on approximation bounds. IEEE Control Systems Letters, 6:2659–2664, 2022.
  • [19] S. Liu, N. Gupta, and N.H. Vaidya. Approximate Byzantine fault-tolerance in distributed optimization. In Proceedings of the 2021 ACM Symposium on Principles of Distributed Computing, pages 379–389, 2021.
  • [20] N. Gupta, T.T. Doan, and N. Vaidya. Byzantine fault-tolerance in federated local SGD under 2f2f-redundancy. 2021. arXiv:2108.11769 [cs.DC].
  • [21] D. Data, L. Song, and S.N. Diggavi. Data encoding for Byzantine-resilient distributed optimization. IEEE Transactions on Information Theory, 67(2):1117–1140, 2021.
  • [22] C.A. Uribe, H.-T. Wai, and M. Alizadeh. Resilient distributed optimization algorithms for resource allocation. In Proceedings of the 58th IEEE Conference on Decision and Control, pages 8341–8346, 2019.
  • [23] Y. Chen, L. Su, and J. Xu. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 1(2):Article No. 44, 2017.
  • [24] P. Blanchard, E.M. El Mhamdi, R. Guerraoui, and J. Stainer. Machine learning with adversaries: Byzantine tolerant gradient descent. In Advances in Neural Information Processing Systems, volume 30, 2017.
  • [25] D. Alistarh, Z. Allen-Zhu, and J. Li. Byzantine stochastic gradient descent. In Advances in Neural Information Processing Systems, volume 31, 2018.
  • [26] L. Su and J. Xu. Securing distributed gradient descent in high dimensional statistical learning. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 3(1):Article No. 12, 2019.
  • [27] L. Chen, H. Wang, Z. Charles, and D. Papailiopoulos. DRACO: Byzantine-resilient distributed training via redundant gradients. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 903–912, 2018.
  • [28] Z. Yang, A. Gang, and W.U. Bajwa. Adversary-resilient distributed and decentralized statistical inference and machine learning: An overview of recent advances under the Byzantine threat model. IEEE Signal Processing Magazine, 37(3):146–159, 2020.
  • [29] N. Gupta and N. H. Vaidya. Resilience in collaborative optimization: Redundant and independent cost functions. 2020. arXiv:2003.09675v2 [cs.DC].
  • [30] N.H. Vaidya and V.K. Garg. Byzantine vector consensus in complete graphs. In Proceedings of the ACM Symposium on Principles of Distributed Computing, pages 65–73, 2013.
  • [31] N.H. Vaidya. Iterative Byzantine vector consensus in incomplete graphs. In Proceedings of the 15th International Conference on Distributed Computing and Networking, pages 14–28, 2014.
  • [32] H. Tverberg. A generalization of Radon’s theorem. Journal of the London Mathematical Society, 41:123–128, 1966.
  • [33] P.K. Agarwal, M. Sharir, and E. Welzl. Algorithms for center and Tverberg points. In Proceedings of the 20th Annual Symposium on Computational Geometry, pages 61–67, 2004.
  • [34] X. Wang, S. Mou, and S. Sundaram. Resilience for distributed consensus with constraints. 2022. arXiv:2206.05662 [eess.SY].
  • [35] J. Liu, T. Başar, and A. Nedić. Input-output stability of linear consensus processes. In Proceedings of the 55th IEEE Conference on Decision and Control, pages 6978–6983, 2016.
  • [36] W. Abbas, M. Shabbir, J. Li, and X. Koutsoukos. Resilient distributed vector consensus using centerpoint. Automatica, 136:110046, 2022.
  • [37] J. Yan, X. Li, Y. Mo, and C. Wen. Resilient multi-dimensional consensus in adversarial environment. Automatica, 145:110530, 2022.
  • [38] B. Polyak. A general method for solving extremum problems. Doklady Akademii Nauk, 8(3):593–597, 1967.
  • [39] A. Nedić and A. Olshevsky. Distributed optimization over time-varying directed graphs. IEEE Transactions on Automatic Control, 60(3):601–615, 2015.
  • [40] Y. Lin and J. Liu. Subgradient-push is of the optimal convergence rate. In Proceedings of the 61st IEEE Conference on Decision and Control, pages 5849–5856, 2022.
  • [41] N.H. Vaidya, L. Tseng, and G. Liang. Iterative approximate Byzantine consensus in arbitrary directed graphs. In Proceedings of the ACM Symposium on Principles of Distributed Computing, pages 365–374, 2012.
  • [42] H.J. Leblance, H. Zhang, X. Koutsoukos, and S. Sundaram. Resilient asymptotic consensus in robust networks. IEEE Journal on Selected Areas in Communications, 31(4):766–781, 2013.
  • [43] R.T. Rockafellar. Convex Analysis. Princeton University Press, 2015.
  • [44] V.A. Zorich and R. Cooke. Mathematical Analysis I. Mathematical Analysis. Springer, 2004.
  • [45] E. Helly. Über mengen konvexer körper mit gemeinschaftlichen punkte. Jahresbericht der Deutschen Mathematiker-Vereinigung, 32:175–176, 1923.
  • [46] N. Vaidya. Matrix representation of iterative approximate byzantine consensus in directed graphs. Technical report, University of Illinois at Urbana-Champaign, 2012. arXiv:1203.1888 [cs.DC].
  • [47] E. Seneta. Non-Negative Matrices and Markov Chain. Springer-Verlag, 2006.
  • [48] J. Hajnal and M.S. Bartlett. Weak ergodicity in non-homogeneous Markov chains. Mathematical Proceedings of the Cambridge Philosophical Society, 54:233–246, 1958.
  • [49] M. Cao, A.S. Morse, and B.D.O. Anderson. Reaching a consensus in a dynamically changing environment: Convergence rates, measurement delays, and asynchronous events. SIAM Journal on Control and Optimization, 47(2):601–623, 2008.
  • [50] M. Cao, A.S. Morse, and B.D.O. Anderson. Reaching a consensus in a dynamically changing environment: A graphical approach. SIAM Journal on Control and Optimization, 47(2):575–600, 2008.
  • [51] A. Kolmogoroff. Zur theorie der markoffschen ketten. Mathematische Annalen, 112(1):155–160, 1936.
  • [52] D. Blackwell. Finite non-homogeneous chains. Annals of Mathematics, 46(4):594–599, 1945.
  • [53] A. Nedić and J. Liu. On convergence rate of weighted-averaging dynamics for consensus problems. IEEE Transactions on Automatic Control, 62(2):766–781, 2017.
  • [54] A. Nedić, A. Ozdaglar, and P.A. Parrilo. Constrained consensus and optimization in multi-agent networks. IEEE Transactions on Automatic Control, 55(4):922–938, 2010.
  • [55] B. Touri. Product of Random Stochastic Matrices and Distributed Averaging. Springer Science & Business Media, 2012.
  • [56] J. Liu and B.D.O. Anderson. Communication-efficient distributed algorithms for solving linear algebraic equations over directed graphs. In Proceedings of the 59th IEEE Conference on Decision and Control, pages 5360–5365, 2020.
  • [57] I. Notarnicola, Y. Sun, G. Scutari, and G. Notarstefano. Distributed big-data optimization via blockwise gradient tracking. IEEE Transactions on Automatic Control, 66(5):2045–2060, 2021.
  • [58] J. Zhu, Y. Lin, J. Liu, and A.S. Morse. Reaching a consensus with limited information. In Proceedings of the 61st IEEE Conference on Decision and Control, pages 4579–4584, 2022.
  • [59] J. Zhu, Y. Lin, A. Velasquez, and J. Liu. Resilient constrained consensus over complete graphs via feasibility redundancy. In Proceedings of the American Control Conference, pages 3418–3422, 2022.