Resilient Distributed Optimization

Jingxuan Zhu Yixuan Lin Alvaro Velasquez Ji Liu J. Zhu and Y. Lin are with the Department of Applied Mathematics and Statistics at Stony Brook University ({jingxuan.zhu,yixuan.lin.1}@stonybrook.edu). A. Velasquez is with the Department of Computer Science at University of Colorado Boulder ([email protected]). J. Liu is with the Department of Electrical and Computer Engineering at Stony Brook University ([email protected]).

Abstract

This paper considers a distributed optimization problem in the presence of Byzantine agents capable of introducing untrustworthy information into the communication network. A resilient distributed subgradient algorithm is proposed based on graph redundancy and objective redundancy. It is shown that the algorithm causes all non-Byzantine agents’ states to asymptotically converge to the same optimal point under appropriate assumptions. A partial convergence rate result is also provided.

1 Introduction

Distributed optimization has attracted considerable attention and achieved remarkable success in both theory and practice. The distributed convex optimization problem was first studied in [1] where a distributed subgradient algorithm was proposed. After this, various distributed optimization algorithms have been crafted and studied; see survey papers [2, 3, 4]. Distributed optimization techniques are also widely applied to decentralized deep learning [5].

Information exchange between neighboring agents is necessary for a multi-agent network for distributed optimization. However, agents’ states may be corrupted and they may not adhere to the designed algorithm due to faulty processes or external attacks. An agent is called Byzantine if it updates its state in an arbitrary, unknown manner, and can send conflicting values to different neighbors [6]. Such attacking agents can know global information of the network, play arbitrarily and strategically, and even be coordinated. Consider a network of agents in which Byzantine agents exist. An ideal resilient algorithm is the one which can lead non-Byzantine (or normal) agents to cooperatively solve the corresponding distributed optimization problem in the presence of Byzantine agents as if they do not exist. Such a resilient algorithm is highly desirable for the safety and security of multi-agent systems as faulty processes and external attacks are inevitable.

Resilient distributed optimization has recently received increasing attention, probably originating from the work of [7]. Almost all the existing works cannot guarantee full resilience; what they can guarantee is all normal agents’ states converge to a bounded neighborhood of the desired optimal point whose bound is not controllable [8, 9, 10, 11, 12], or an optimal point of an unspecified convex combination of all normal agents’ objective functions [7, 13, 14], or a convex combination of all normal agents’ local optimal points [15]. The only exceptions are [16, 17, 18] in which the underlying communication graph is assumed to be a complete graph, namely, each agent is allowed to communicate with all other agents. All [16, 17, 18] rely on the idea of “objective function redundancy”. The idea has also been applied to the federated setting and achieved full resilience [19, 20]. In the federated setting, a central coordinator agent is able to communicate with all worker agents, which is more or less equivalent to a complete graph in the distributed setting (or sometimes called decentralized setting). It is worth noting that [7, 13, 14, 15, 18] only consider special one-dimensional optimization.

Resilient distributed optimization is also related to resilient federated optimization/learning in the coordinator-workers setting (e.g., [20, 21, 22]), which has attracted increasing attention recently. The key problem is how the central coordinator aggregates the received information to eliminate or attenuate the effects of Byzantine worker agents. Various Byzantine-resilient information aggregation methods have been proposed for high-dimensional optimization/learning, focusing on stochastic gradient descent (SGD). Notable among them are [23, 24, 25, 26, 27], just to name a few; see an overview of recent developments in this area in [28]. It is doubtable that these methods can be applied to achieve full resilience in the distributed setting.

From the preceding discussion, and to the best of our knowledge, a fully resilient distributed optimization algorithm for general non-complete communication graphs does not exist, even for one-dimensional optimization problems. This gap is precisely what we study in this paper. We consider a distributed convex optimization problem in the presence of Byzantine agents and propose a fully resilient distributed subgradient algorithm based on the ideas of objective redundancy (cf. Definition 1) and graph redundancy (cf. Definition 2). The algorithm is shown to cause all non-Byzantine agents’ states to asymptotically converge to the same desired optimal point under appropriate assumptions. The proposed algorithm works theoretically for multi-dimensional optimization but practically not for high-dimensional optimization, as will be explained and discussed in the concluding remarks.

This work is motivated by two recent ideas. The first is the quantified notion of objective function redundancy proposed in [29] where a couple of different definitions of objective redundancy are studied, based on which fully resilient distributed optimization algorithms have been crafted either for a federated setting [29, 19, 20] or a distributed setting over complete graphs [16, 17, 18]; such redundancy has been shown necessary for achieving full resilience in multi-agent optimization [19]. We borrow one notation in [29] and further develop it. It is worth emphasizing that the results in [16, 17, 18] rely on objective redundancy among non-Byzantine agents, whereas ours depend on objective redundancy among all agents. This subtle difference is important for equipping a multi-agent network with a certain level of redundancy at a network design stage as which agents are non-Byzantine cannot be assumed a priori.

The second idea is so-called “Byzantine vector consensus” [30, 31] whose goal is, given a set of both Byzantine and non-Byzantine vectors, to pick a vector lying in the convex hull of the non-Byzantine vectors, based on Tverberg’s theorem [32, 33]. The idea has been very recently improved in [34] which can be used to achieve resilient multi-dimensional consensus exponentially fast. Exponential consensus is critical in the presence of diminishing disturbance [35]. We are prompted by this improved idea and make use of a resilient vector picking process, simplified from that of [34, Algorithm 1]. There are other recent approaches appealing to the idea of centerpoint [36, 37]. We expect that these approaches can also be applied to resilient optimization, provided that exponential consensus is guaranteed, e.g., in [37].

2 Problem Formulation

Consider a network consisting of $n$ agents, labeled $1$ through $n$ for the purpose of presentation. The agents are not aware of such global labeling, but can differentiate between their neighbors. The neighbor relations among the $n$ agents are characterized by a directed graph $\mathbb{G}=(\mathcal{V},\mathcal{E})$ whose vertices correspond to agents and whose directed edges (or arcs) depict neighbor relations, where $\mathcal{V}=\{1,\ldots,n\}$ is the vertex set and $\mathcal{E}\subset\mathcal{V}\times\mathcal{V}$ is the directed edge set.¹¹1We use ${\cal A}\subset{\cal B}$ to denote that ${\cal A}$ is a subset of ${\cal B}$ . Specifically, agent $j$ is a neighbor of agent $i$ if $(j,i)\in{\cal E}$ . Each agent can receive information from its neighbors. Thus, the directions of edges represent the directions of information flow. We use $\mathcal{N}_{i}$ to denote the neighbor set of agent $i$ , excluding $i$ , i.e., $\mathcal{N}_{i}=\{j\in\mathcal{V}:(j,i)\in\mathcal{E}\}$ .

Each agent $i\in{\cal V}$ has a “private” convex (not necessarily differentiable) objective function, $f_{i}:{\rm I\!R}^{d}\rightarrow{\rm I\!R}$ , only known to agent $i$ . There exist Byzantine agents in the network which are able to transmit arbitrary values to others and capable of sending conflicting values to different neighbors at any time. The set of Byzantine agents is denoted by ${\cal F}$ and the set of normal (non-Byzantine) agents is denoted by ${\cal H}$ . Which agents are Byzantine is unknown to normal agents. It is assumed that each agent may have at most $\beta$ Byzantine neighbors.

The goal of the normal agents is to cooperatively minimize the objective functions

f_{{\cal H}}(x)=\sum_{i\in{\cal H}}f_{i}(x)\;\;\;\;\;{\rm and}\;\;\;\;\;f(x)=\sum_{i\in{\cal V}}f_{i}(x).

We will show that minimizing the above two objective functions can be achieved simultaneously with appropriate redundancy in objective functions (cf. Definition 1 and Corollary 1). It is assumed that the set of optimal solutions to $f$ , denoted by ${\cal X}^{*}$ , is nonempty and bounded.

Since each $f_{i}$ is not necessarily differentiable, the gradient descent method may not be applicable. Instead, the subgradient method [38] can be applied. For a convex function $h:{\rm I\!R}^{d}\rightarrow{\rm I\!R}$ , a vector $g\in{\rm I\!R}^{d}$ is called a subgradient of $h$ at point $x$ if

h(y)\geq h(x)+g^{\top}(y-x)\;\;{\rm for\;all}\;\;y\in{\rm I\!R}^{d}.

(1)

Such a vector $g$ always exists and may not be unique. In the case when $h$ is differentiable at point $x$ , the subgradient $g$ is unique and equals $\nabla h(x)$ , the gradient of $h$ at $x$ . Thus, the subgradient can be viewed as a generalization of the notion of the gradient. From (1) and the Cauchy-Schwarz inequality, $h(y)-h(x)\geq-G\|y-x\|$ , where $G$ is an upper bound for the 2-norm of the subgradients of $h$ at both $x$ and $y$ . We will use this fact without special mention in the sequel. Throughout this paper, we use $\|\cdot\|$ for the 2-norm.

The subgradient method was first proposed in [38] and the first distributed subgradient method was proposed in [1] for undirected graphs. Its extension to directed graphs has been studied in [39] and recently further analyzed in [40].

2.1 Redundancy

To make the resilient distributed optimization problem solvable, certain redundancy is necessary. We begin with objective redundancy.

Definition 1.

An $n$ -agent network is called $k$ -redundant, $k\in\{0,1,\ldots,n-1\}$ , if for any subsets ${\cal S}_{1},{\cal S}_{2}\subset{\cal V}$ with $|{\cal S}_{1}|=|{\cal S}_{2}|=n-k$ , there holds²²2We use $|{\cal S}|$ to denote the cardinality of a set ${\cal S}$ .

\displaystyle\operatorname*{arg\,min}_{x}\sum_{i\in{\cal S}_{1}}f_{i}(x)=\operatorname*{arg\,min}_{x}\sum_{i\in{\cal S}_{2}}f_{i}(x).

The above definition of objective redundancy originated in [29, Definition 2]. It has the following properties.

Lemma 1.

If an $n$ -agent network is $k$ -redundant, then for any subsets ${\cal S},{\cal L}\subset{\cal V}$ with $|{\cal S}|=n-k$ and $|{\cal L}|\geq n-k$ , there holds

\operatorname*{arg\,min}_{x}\sum_{i\in{\cal S}}f_{i}(x)=\operatorname*{arg\,min}_{x}\sum_{i\in{\cal L}}f_{i}(x).

Proof of Lemma 1: Let ${\cal Z}=\operatorname*{arg\,min}_{x}\sum_{i\in{\cal S}}f_{i}(x)$ and ${\cal Q}=\{{\cal P}\;:\;{\cal P}\subset{\cal L},\;|{\cal P}|=n-k\}$ . From Definition 1, $\operatorname*{arg\,min}_{x}\sum_{i\in{\cal P}}f_{i}(x)={\cal Z}$ for any ${\cal P}\in{\cal Q}$ . For each $i\in{\cal L}$ , let ${\cal Q}_{i}=\{{\cal P}\;:\;{\cal P}\subset{\cal L},\;|{\cal P}|=n-k,\;i\in{\cal P}\}$ . It is easy to see that for each $i\in{\cal L}$ ,³³3We use $\binom{n}{k}$ to denote the number of $k$ -combinations from a set of $n$ elements.

|{\cal Q}_{i}|=q\stackrel{{\scriptstyle\Delta}}{{=}}\binom{|{\cal L}|-1}{n-k-1}.

Then,

\displaystyle\sum_{{\cal P}\in{\cal Q}}\sum_{i\in{\cal P}}f_{i}(x)=q\sum_{i\in{\cal L}}f_{i}(x).

(2)

Pick any $z\in{\cal Z}$ . From (2),

\displaystyle\min_{x}q\sum_{i\in{\cal L}}f_{i}(x)=\min_{x}\sum_{{\cal P}\in{\cal Q}}\sum_{i\in{\cal P}}f_{i}(x)\geq\sum_{{\cal P}\in{\cal Q}}\min_{x}\sum_{i\in{\cal P}}f_{i}(x)=\sum_{{\cal P}\in{\cal Q}}\sum_{i\in{\cal P}}f_{i}(z)=q\sum_{i\in{\cal L}}f_{i}(z),

which implies that $z\in\operatorname*{arg\,min}_{x}\sum_{i\in{\cal L}}f_{i}(x)$ , and thus ${\cal Z}\subset\operatorname*{arg\,min}_{x}\sum_{i\in{\cal L}}f_{i}(x)$ .

To prove the lemma, it is sufficient to prove that $\operatorname*{arg\,min}_{x}\sum_{i\in{\cal L}}f_{i}(x)\subset{\cal Z}$ . Suppose that, to the contrary, there exists a $y$ such that $y\in\operatorname*{arg\,min}_{x}\sum_{i\in{\cal L}}f_{i}(x)$ and $y\notin Z$ . Since $y,z\in\operatorname*{arg\,min}_{x}\sum_{i\in{\cal L}}f_{i}(x)$ , from (2),

\displaystyle\sum_{i\in{\cal L}}f_{i}(y)=\sum_{i\in{\cal L}}f_{i}(z)=\frac{1}{q}\sum_{{\cal P}\in{\cal Q}}\sum_{i\in{\cal P}}f_{i}(z)<\frac{1}{q}\sum_{{\cal P}\in{\cal Q}}\sum_{i\in{\cal P}}f_{i}(y)=\sum_{i\in{\cal L}}f_{i}(y),

which is impossible. Therefore, $\operatorname*{arg\,min}_{x}\sum_{i\in{\cal L}}f_{i}(x)\subset{\cal Z}$ .

The following corollaries are immediate consequences of Lemma 1.

Corollary 1.

If an $n$ -agent network is $k$ -redundant, then for any subsets ${\cal S}\subset{\cal V}$ with $|{\cal S}|\geq n-k$ , there holds

\operatorname*{arg\,min}_{x}\sum_{i\in{\cal S}}f_{i}(x)={\cal X}^{*}.

Corollary 2.

If an $n$ -agent network is $(k+1)$ -redundant with $k\geq 0$ , then it is $k$ -redundant.

We also need redundancy in graph connectivity.

A vertex $i$ in a directed graph $\mathbb{G}$ is called a root of $\mathbb{G}$ if for each other vertex $j$ of $\mathbb{G}$ , there is a directed path from $i$ to $j$ . Thus, $i$ is a root of $\mathbb{G}$ if it is the root of a directed spanning tree of $\mathbb{G}$ . We will say that $\mathbb{G}$ is rooted at $i$ if $i$ is in fact a root. It is easy to see that a rooted graph $\mathbb{G}$ has a unique strongly connected component whose vertices are all roots of $\mathbb{G}$ .

Definition 2.

An $(r,s)$ -reduced graph of a directed graph $\mathbb{G}$ with $n$ vertices, with $r,s\geq 0$ and $r+s\leq n-1$ , is a subgraph of $\mathbb{G}$ obtained by first picking any vertex subset ${\cal S}\subset{\cal V}$ with $|{\cal S}|=n-r$ and then removing from each vertex of the subgraph induced by ${\cal S}$ , $\mathbb{G}_{{\cal S}}$ , arbitrary $s$ incoming edges in $\mathbb{G}_{{\cal S}}$ . A directed graph $\mathbb{G}$ is called $(r,s)$ -resilient if all its $(r,s)$ -reduced graphs are rooted.

It is easy to see that if a directed graph is $(r_{1},s_{1})$ -resilient, then for any nonnegative $r_{2}\leq r_{1}$ and $s_{2}\leq s_{1}$ , the graph is also $(r_{2},s_{2})$ -resilient.

In the case when $r=s=\beta$ , the resilient graph is equivalent to rooted “reduced graph” in [41] which was used to guarantee resilient one-dimension consensus; see Definition 4 and Theorem 2 in [41]. Thus, the definition here can be viewed as a simple generalization of the rooted “reduced graph”.

Definition 2 implicitly requires that each vertex of an $(r,s)$ -resilient graph have at least $r+s$ neighbors. More can be said.

Lemma 2.

If a directed graph is $(r,s)$ -resilient, then each of its vertices has at least $(r+s+1)$ neighbors.

Proof of Lemma 2: Suppose that, to the contrary, there exists a vertex $i$ in $\mathbb{G}$ whose $|{\cal N}_{i}|\leq r+s$ . If $|{\cal N}_{i}|<r+s$ , it is easy to see that $\mathbb{G}$ does not satisfy Definition 2. We thus consider the case when $|{\cal N}_{i}|=r+s$ . Let ${\cal R}$ be the set of arbitrary $r$ neighbors of vertex $i$ , and ${\cal S}={\cal V}\setminus{\cal R}$ , where ${\cal V}$ is the vertex set of $\mathbb{G}$ .⁴⁴4We use ${\cal A}\setminus{\cal B}$ to denote the set of elements that are in ${\cal A}$ but not in ${\cal B}$ . It is clear that $|{\cal S}|=n-r$ , and in the subgraph induced by ${\cal S}$ , $\mathbb{G}_{{\cal S}}$ , vertex $i$ has exactly $s$ neighbors. Then, after vertex $i$ removes $s$ incoming edges in $\mathbb{G}_{{\cal S}}$ , and each out-neighbor⁵⁵5A vertex $i$ is called an out-neighbor of vertex $j$ if the latter is a neighbor of the former. of vertex $i$ in $\mathbb{G}_{{\cal S}}$ , if any, removes its incoming edge from $i$ , vertex $i$ becomes isolated. But it is impossible for an $(r,s)$ -resilient graph.

3 Algorithm

To describe our algorithm, we need the following notation.

Let ${\cal A}_{i}$ denote the collection of all those subsets of ${\cal N}_{i}$ whose cardinality is $(d+1)\beta+1$ . It is obvious that the number of all such subsets is

a_{i}\stackrel{{\scriptstyle\Delta}}{{=}}\binom{|{\cal N}_{i}|}{(d+1)\beta+1},

(3)

and label them ${\cal A}_{i1},\ldots,{\cal A}_{ia_{i}}$ . For each $j\in\{1,\ldots,a_{i}\}$ , let ${\cal B}_{ij}$ denote the collection of all those subsets of ${\cal A}_{ij}$ whose cardinality is $d\beta+1$ . For any subset of agents ${\cal S}\subset{\cal V}$ , let ${\cal C}_{{\cal S}}(t)$ denote the convex hull of all $x_{i}(t)$ , $i\in{\cal S}$ .

Algorithm: At each discrete time $t\in\{0,1,2,\ldots\}$ , each agent $i$ first picks an arbitrary point

y_{ij}(t)\in\bigcap_{{\cal S}\in{\cal B}_{ij}}{\cal C}_{{\cal S}}(t)

(4)

for each $j\in\{1,\ldots,a_{i}\}$ , and then updates its state by setting

	$\displaystyle v_{i}(t)$	$\displaystyle=\frac{1}{1+a_{i}}\Big{(}x_{i}(t)+\sum_{j=1}^{a_{i}}y_{ij}(t)\Big{)},$		(5)
	$\displaystyle x_{i}(t+1)$	$\displaystyle=v_{i}(t)-\alpha(t)g_{i}(v_{i}(t)),$		(6)

where $\alpha(t)$ is the stepsize, and $g_{i}(\cdot)$ is a subgradient of $f_{i}(\cdot)$ . $\Box$

In the special one-dimensional case with $d=1$ , it is not hard to check that the steps (4) and (5) simplifies to the resilient scalar consensus algorithm in [41], which is essentially equivalent to the trimmed mean method and has been improved in [42].

The convergence and correctness of the proposed algorithm rely on the following assumptions.

Assumption 1.

${\cal X}^{*}$ has a nonempty interior.

It is easy to see that Assumption 1 implies that $f(x)$ is differentiable at any $x\in{\rm int}({\cal X}^{*})$ , where ${\rm int}(\cdot)$ denotes the interior of a set. More can be said.

Lemma 3.

Under Assumption 1, if the $n$ -agent network is $k$ -redundant with $k\geq 1$ , then $f_{i}(x)$ is differentiable at $x$ with $\nabla f_{i}(x)=0$ for all $i\in{\cal V}$ and $x\in{\rm int}({\cal X}^{*})$ .

Proof of Lemma 3: Since ${\rm int}({\cal X}^{*})$ is nonempty, for any $x^{*}\in{\rm int}({\cal X}^{*})$ , there exist a positive number $r$ and an open ball in ${\rm int}({\cal X}^{*})$ centered at $x^{*}$ with radius $r$ , denoted as ${\cal B}(x^{*},r)\subset{\rm int}({\cal X}^{*})$ . Let $h_{j}$ be a vector in ${\rm I\!R}^{d}$ whose $j$ th entry is $\epsilon$ and the remaining entries all equal zero. Since $x^{*}+h_{j}\in{\cal B}(x_{0},r)\subset{\rm int}({\cal X}^{*})$ for sufficiently small $\epsilon$ ,

\displaystyle\frac{\partial}{\partial x_{j}}f(x^{*})=\lim_{\epsilon\rightarrow 0}\frac{f(x^{*}+h_{j})-f(x^{*})}{\epsilon}=\lim_{\epsilon\rightarrow 0}\frac{\sum_{i\in{\cal V}}(f_{i}(x^{*}+h_{j})-f_{i}(x^{*}))}{\epsilon}=0.

(7)

For each $i\in{\cal V}$ , since $f_{i}(x)$ is convex, both $\lim_{\epsilon\rightarrow 0^{-}}\frac{f_{i}(x^{*}+h_{j})-f_{i}(x^{*})}{\epsilon}$ and $\lim_{\epsilon\rightarrow 0^{+}}\frac{f_{i}(x^{*}+h_{j})-f_{i}(x^{*})}{\epsilon}$ exist and $\lim_{\epsilon\rightarrow 0^{-}}\frac{f_{i}(x^{*}+h_{j})-f_{i}(x^{*})}{\epsilon}\leq\lim_{\epsilon\rightarrow 0^{+}}\frac{f_{i}(x^{*}+h_{j})-f_{i}(x^{*})}{\epsilon}$ for all $j\in\{1,\ldots,d\}$ [43, Theorem 24.1]. It follows that

\sum_{k\in{\cal V}}\lim_{\epsilon\rightarrow 0^{-}}\frac{f_{k}(x^{*}+h_{j})-f_{k}(x^{*})}{\epsilon}\leq\sum_{k\in{\cal V}}\lim_{\epsilon\rightarrow 0^{+}}\frac{f_{k}(x^{*}+h_{j})-f_{k}(x^{*})}{\epsilon}.

Note that from (7),

\sum_{k\in{\cal V}}\lim_{\epsilon\rightarrow 0^{-}}\frac{f_{k}(x^{*}+h_{j})-f_{k}(x^{*})}{\epsilon}=\sum_{k\in{\cal V}}\lim_{\epsilon\rightarrow 0^{+}}\frac{f_{k}(x^{*}+h_{j})-f_{k}(x^{*})}{\epsilon}.

Thus, $\lim_{\epsilon\rightarrow 0^{-}}\frac{f_{i}(x^{*}+h_{j})-f_{i}(x^{*})}{\epsilon}=\lim_{\epsilon\rightarrow 0^{+}}\frac{f_{i}(x^{*}+h_{j})-f_{i}(x^{*})}{\epsilon}$ , i.e., $\partial f_{i}(x^{*})/\partial x_{j}$ exists for all $i\in{\cal V}$ and $j\in\{1,\ldots,d\}$ .

To proceed, let $h_{i}(x)=\sum_{k\in{\cal V},\;k\neq i}f_{k}(x)$ for all $i\in{\cal V}$ . From Corollary 1, $\operatorname*{arg\,min}_{x}h_{i}(x)={\cal X}^{*}$ . Since $x^{*}\in{\rm int}({\cal X}^{*})$ , both $f(x)$ and $h_{i}(x)$ are differentiable at $x^{*}$ , implying that $\frac{\partial f}{\partial x_{j}}(x^{*})=\frac{\partial h_{i}}{\partial x_{j}}(x^{*})=0$ for all $i\in{\cal V}$ and $j\in\{1,\ldots,d\}$ . Since $f_{i}(x)=f(x)-h_{i}(x)$ , $\frac{\partial f_{i}}{\partial x_{j}}(x^{*})=0$ for all $i\in{\cal V}$ and $j\in\{1,\ldots,d\}$ . Note that this holds for all $x^{*}\in{\rm int}({\cal X}^{*})$ . From [44, Section 8.4.2], $f_{i}(x)$ is differentiable at $x^{*}$ with $\nabla f_{i}(x^{*})=0$ for all $i\in{\cal V}$ .

Lemma 3 has the following important implication.

Corollary 3.

Under Assumption 1, if the $n$ -agent network is $k$ -redundant with $k\geq 1$ , then for all $i\in{\cal V}$ ,

{\cal X}^{*}\subset\operatorname*{arg\,min}_{x}f_{i}(x).

Corollary 3 immediately implies that

\bigcap_{i\in{\cal V}}\operatorname*{arg\,min}_{x}f_{i}(x)={\cal X}^{*}.

Proof of Corollary 3: Suppose that, to the contrary, there exist $x^{*}\in{\cal X}^{*}$ and $i\in{\cal V}$ such that $x^{*}\notin\operatorname*{arg\,min}_{x}f_{i}(x)$ . Pick a $z\in{\rm int}({\cal X}^{*})$ . From Lemma 3, $z\in\operatorname*{arg\,min}_{x}f_{i}(x)$ . It is then clear that $f_{i}(x^{*})>f_{i}(z)$ . Let $h_{i}(x)=\sum_{k\in{\cal V},\;k\neq i}f_{k}(x)$ . From Corollary 1, $\operatorname*{arg\,min}_{x}h_{i}(x)={\cal X}^{*}$ , and thus $h_{i}(x^{*})=h_{i}(z)$ . It follows that $f(x^{*})=f_{i}(x^{*})+h_{i}(x^{*})>f_{i}(z)+h_{i}(z)=f(z)$ , which contradicts the fact that $x^{*}\in{\cal X}^{*}$ .

Assumption 2.

The subgradients of all $f_{i}$ , $i\in{\cal V}$ , are uniformly bounded, i.e., there exists a positive number $D$ such that $\|g_{i}(x)\|\leq D$ for all $i\in{\cal V}$ and $x\in{\rm I\!R}^{d}$ .

Assumption 3.

The step-size sequence $\{\alpha(t)\}$ is positive, non-increasing, and satisfies $\sum_{t=0}^{\infty}\alpha(t)=\infty$ and $\sum_{t=0}^{\infty}\alpha^{2}(t)<\infty$ .

The above two assumptions are standard for subgradient methods.

To state our main results, we need the following concepts. For a directed graph $\mathbb{G}$ , we use ${\cal R}_{r,s}(\mathbb{G})$ to denote the set of all $(r,s)$ -reduced graphs of $\mathbb{G}$ . For a rooted graph $\mathbb{G}$ , we use $\kappa(\mathbb{G})$ to denote the size of the unique strongly connected component whose vertices are all roots of $\mathbb{G}$ ; in other words, $\kappa(\mathbb{G})$ equals the number of roots of $\mathbb{G}$ . For any $(r,s)$ -resilient graph $\mathbb{G}$ , let

\kappa_{r,s}(\mathbb{G})\stackrel{{\scriptstyle\Delta}}{{=}}\min_{\mathbb{H}\in{\cal R}_{r,s}(\mathbb{G})}\kappa(\mathbb{H}).

which is well-defined and denotes the smallest possible number of roots in any $(r,s)$ -reduced graphs of $\mathbb{G}$ .

Theorem 1.

Under Assumptions 1–3, if $\mathbb{G}$ is $(\beta,d\beta)$ -resilient and the $n$ -agent network is $(n-\kappa_{\beta,d\beta}(\mathbb{G}))$ -redundant, then all $x_{i}(t)$ , $i\in{\cal H}$ will asymptotically reach a consensus at a point in ${\cal X}^{*}$ .

The following example shows that $(n-\kappa_{\beta,d\beta}(\mathbb{G}))$ -redundancy is necessary. For simplicity, set $d=1$ . Consider a 4-agent network whose neighbor graph is the 4-vertex complete graph $\mathbb{C}$ , which is $(1,1)$ -resilient. Suppose that agent 4 is Byzantine and the other three are normal. It is possible that, with a carefully crafted attack strategy of the Byzantine agent, the three normal agents update their states mathematically equivalent to the case as if their neighbor graph is the 3-vertex $(1,1)$ -reduced graph with the arc set $\{(1,2),(1,3),(2,3)\}$ , which is rooted (cf. Lemma 6). In this case, since vertex 1 is the only root and agent 1 does not have any neighbor, it follows the single-agent subgradient algorithm, and thus its state will converge to a minimum point of $f_{1}(x)$ , denoted $x^{*}$ . Since all normal agents will eventually reach a consensus (cf. Lemma 9), both states of agents 2 and 3 will converge to $x^{*}$ . To guarantee the resilient distributed optimization problem is solvable in this case, there must hold that $x^{*}\in\operatorname*{arg\,min}_{x}f_{i}(x)$ , $i\in\{1,2,3\}$ , which implies that the network needs to be 3-redundant. It is easy to see that $\kappa_{1,1}(\mathbb{C})=1$ , and thus $n-\kappa_{1,1}(\mathbb{G})=3$ .

Theorem 1 shows that the proposed algorithm achieves full resiliency. We next partially characterize the convergence rate of the algorithm.

Theorem 2.

Under Assumptions 1 and 2, if $\mathbb{G}$ is $(\beta,d\beta)$ -resilient, the $n$ -agent network is $(n-\kappa_{\beta,d\beta}(\mathbb{G}))$ -redundant, and $\alpha(t)=1/\sqrt{T}$ for $T>0$ steps, i.e., $t\in\{0,1,\ldots,T-1\}$ , then there exist a subset of normal agents ${\cal S}\subset{\cal H}$ with $|{\cal S}|\geq\kappa_{\beta,d\beta}(\mathbb{G})$ , a positive constant $C\geq 1$ , and a time subsequence ${\cal T}\subset\{0,1,\ldots,T-1\}$ with $|{\cal T}|\geq T/C$ such that for any $j\in{\cal H}$ and $x^{*}\in{\cal X}^{*}$ ,

\displaystyle\sum_{i\in{\cal S}}f_{i}\bigg{(}\frac{\sum_{t\in{\cal T}}x_{j}(t)}{|{\cal T}|}\bigg{)}-\sum_{i\in{\cal S}}f_{i}(x^{*})\leq O\Big{(}\frac{1}{\sqrt{T}}\Big{)}.

(8)

The existing distributed optimization literature (without Byzantine agents) typically characterizes convergence rates by bounding the difference between $\sum_{i\in{\cal V}}f_{i}(\frac{1}{T}\sum_{t=0}^{T-1}x_{i}(t))$ and $\sum_{i\in{\cal V}}f_{i}(x^{*})$ . The above theorem can be viewed as a “partial” convergence rate result in that it only reckons a subset ${\cal S}$ of normal agents and a subsequence ${\cal T}$ in a finite time horizon. Notwithstanding this, it is worth noting that $\min\sum_{i\in{\cal S}}f_{i}(x)$ is equivalent to $\min\sum_{i\in{\cal V}}f_{i}(x)$ in the setting here with Byzantine agents (cf. Corollary 1) and that $|{\cal T}|=O(T)$ . Therefore, the theorem still to some extent evaluates the convergence rate of the resilient distributed subgradient algorithm under consideration. It is well known that the optimal convergence rate of subgradient methods for convex optimization is $O(1/\sqrt{t})$ . Whether $f_{{\cal H}}(\cdot)=\sum_{i\in{\cal H}}f_{i}(\cdot)$ converges at this optimal rate or not, has so far eluded us.

Theorem 2 is an immediate consequence of the following proposition.

Proposition 1.

Under Assumptions 1 and 2, if $\mathbb{G}$ is $(\beta,d\beta)$ -resilient, the $n$ -agent network is $(n-\kappa_{\beta,d\beta}(\mathbb{G}))$ -redundant, and $\alpha(t)=1/\sqrt{T}$ for $t\in\{0,1,\ldots,T-1\}$ , then for any integer $b\in[\kappa_{\beta,d\beta}(\mathbb{G}),n-|{\cal F}|]$ , there exist a subset of normal agents ${\cal S}\subset{\cal H}$ with $b\geq|{\cal S}|\geq\kappa_{\beta,d\beta}(\mathbb{G})$ and a time subsequence ${\cal T}\subset\{0,1,\ldots,T-1\}$ with $|{\cal T}|\geq T/\sum_{k=\kappa_{\beta,d\beta}(\mathbb{G})}^{b}\binom{n-|{\cal F}|}{k}$ such that (8) holds for any $j\in{\cal H}$ and $x^{*}\in{\cal X}^{*}$ .

The proposition further quantifies a trade-off between the number of normal agents in ${\cal S}$ and the length of time subsequence ${\cal T}$ . Roughly speaking, the fewer the normal agents involved in (8), the denser would the time subsequence be. In the special case when $b=\kappa_{\beta,d\beta}(\mathbb{G})$ , the proposition simplifies to the following corollary.

Corollary 4.

Under Assumptions 1 and 2, if $\mathbb{G}$ is $(\beta,d\beta)$ -resilient, the $n$ -agent network is $(n-\kappa_{\beta,d\beta}(\mathbb{G}))$ -redundant, and $\alpha(t)=1/\sqrt{T}$ for $t\in\{0,1,\ldots,T-1\}$ , then there exist a subset of normal agents ${\cal S}\subset{\cal H}$ with $|{\cal S}|=\kappa_{\beta,d\beta}(\mathbb{G})$ and a time subsequence ${\cal T}\subset\{0,1,\ldots,T-1\}$ with $|{\cal T}|\geq T/\binom{n-|{\cal F}|}{\kappa_{\beta,d\beta}(\mathbb{G})}$ such that (8) holds for any $j\in{\cal H}$ and $x^{*}\in{\cal X}^{*}$ .

4 Analysis

This section provides the analysis of the algorithm and proofs of the main results.

4.1 Algorithm Feasibility

From Lemma 2, $(\beta,d\beta)$ -resilient $\mathbb{G}$ guarantees that each agent has at least $(d+1)\beta+1$ neighbors at each time $t$ . Thus, each ${\cal A}_{ij}$ in the algorithm is always nonempty.

We next show that $y_{ij}(t)$ in (4) always exists. To this end, we need the following well-known theorem by Helly.

Lemma 4.

(Helly’s Theorem [45]) Let $\{{\cal C}_{1},{\cal C}_{2}\ldots,{\cal C}_{m}\}$ be a finite collection of convex sets in ${\rm I\!R}^{d}$ with $m\geq d+1$ . If the intersection of every $d+1$ of these sets is nonempty, then the whole collection has a nonempty intersection, i.e., $\bigcap_{i=1}^{m}{\cal C}_{i}\neq\emptyset$ .

Lemma 5.

For any $i\in{\cal V}$ and $j\in\{1,\ldots,a_{i}\}$ , there holds $\bigcap_{{\cal S}\in{\cal B}_{ij}}{\cal C}_{{\cal S}}(t)\neq\emptyset$ .

Proof of Lemma 5: From Lemma 4, it is sufficient to prove that the intersection of every $d+1$ sets in ${\cal B}_{ij}$ is nonempty. Pick any ${\cal P}\subset{\cal B}_{ij}$ with $|{\cal P}|=d+1$ . For each ${\cal S}\in{\cal P}$ , from the definition of ${\cal B}_{ij}$ , ${\cal C}_{{\cal S}}(t)$ is the convex hull of distinct $(d\beta+1)$ points. Since $|{\cal P}|=d+1$ , the intersection $\bigcap_{{\cal S}\in{\cal P}}{\cal C}_{{\cal S}}(t)$ involves in total $(d+1)(d\beta+1)$ points (with repetition). Recall that all these points are agents’ states at time $t$ . Thus, each of them can be written as $x_{h}(t)$ with $h$ being an index in ${\cal V}$ . From the definition of ${\cal B}_{ij}$ , it is easy to see that $h\in{\cal A}_{ij}$ . Note that

(d+1)(d\beta+1)-d|{\cal A}_{ij}|=(d+1)(d\beta+1)-d((d+1)\beta+1)=1.

Then, the pigeonhole principle (or Dirichlet’s box principle) guarantees that among the total $(d+1)(d\beta+1)$ indices, at least one index in ${\cal A}_{ij}$ , say $k$ , repeating at least $(d+1)$ times. Since for each ${\cal S}\in{\cal P}$ , there is no repetition of indices when computing ${\cal C}_{{\cal S}}(t)$ , there must exist $(d+1)$ different sets ${\cal S}_{1},\ldots,S_{d+1}\in{\cal P}$ for which $x_{k}(t)$ involves the computation of ${\cal C}_{{\cal S}_{p}}(t)$ and thus $x_{k}(t)\in{\cal C}_{{\cal S}_{p}}(t)$ for all $p\in\{1,\ldots,d+1\}$ . Since $|{\cal P}|=d+1$ , ${\cal P}=\{S_{1},\ldots,S_{d+1}\}$ . It follows that $x_{k}(t)\in\bigcap_{p=1}^{d+1}{\cal C}_{{\cal S}_{p}}(t)=\bigcap_{{\cal S}\in{\cal P}}{\cal C}_{{\cal S}}(t)$ .

4.2 Dynamics of Normal Agents

To analyze the performance of the algorithm, it is important to understand the dynamics of normal agents and “decouple” the influence of Byzantine agents. The following lemma serves this purpose.

Lemma 6.

$v_{i}(t)$ in (5) can be expressed as a convex combination of $x_{i}(t)$ and $x_{k}(t)$ , $k\in{\cal N}_{i}\cap{\cal H}$ ,

\displaystyle v_{i}(t)=w_{ii}(t)x_{i}(t)+\sum_{k\in{\cal N}_{i}\cap{\cal H}}w_{ik}(t)x_{k}(t),

where $w_{ii}(t)$ and $w_{ik}(t)$ are nonnegative numbers satisfying $w_{ii}(t)+\sum_{k\in{\cal N}_{i}\cap{\cal H}}w_{ik}(t)=1$ , and there exists a positive constant $\eta$ such that for all $i\in{\cal H}$ and $t$ , $w_{ii}(t)\geq\eta$ and among all $w_{ik}(t)$ , $k\in{\cal N}_{i}\cap{\cal H}$ , at least $|{\cal N}_{i}\cap{\cal H}|-d\beta$ of them are bounded below by $\eta$ .

Since each agent is assumed to have at most $\beta$ Byzantine neighbors, the lemma immediately implies that at least $|{\cal N}_{i}|-(d+1)\beta$ among all $w_{ik}(t)$ , $k\in{\cal N}_{i}\cap{\cal H}$ , are bounded below by $\eta$ , which has been reported in [34, Theorem 1]. In the special case when $d=1$ , the lemma simplifies to Claim 2 in [46], which directly implies Proposition 5.1 in [13]. Thus, the lemma can be regarded as a generalization of [34, Theorem 1], [46, Claim 2], and [13, Proposition 5.1].

Proof of Lemma 6: Recall that there are at most $\beta$ Byzantine neighbors in ${\cal A}_{ij}$ whose cardinality $|{\cal A}_{ij}|=(d+1)\beta+1$ , and that ${\cal B}_{ij}$ is the collection of those subsets of ${\cal A}_{ij}$ whose cardinality is $d\beta+1=|{\cal A}_{ij}|-\beta$ , there must exist an index set ${\cal P}\in{\cal B}_{ij}$ such that ${\cal P}\subset{\cal N}_{i}\cap{\cal H}$ . For any such index set ${\cal P}$ , since $y_{ij}(t)\in\bigcap_{{\cal S}\in{\cal B}_{ij}}{\cal C}_{{\cal S}}(t)$ , it follows that $y_{ij}(t)\in{\cal C}_{{\cal P}}(t)$ . Let ${\cal Q}_{ij}=\{{\cal P}:{\cal P}\in{\cal B}_{ij},{\cal P}\subset{\cal N}_{i}\cap{\cal H}\}$ for each $i\in{\cal V}$ and $j\in\{1,\ldots,a_{i}\}$ . From the preceding, ${\cal Q}_{ij}$ is always nonempty, and for every ${\cal P}\in{\cal Q}_{ij}$ ,

y_{ij}(t)=\sum_{p\in{\cal P}}c_{p}({\cal P})x_{p}(t),

(9)

where $c_{p}({\cal P})$ are nonnegative weights satisfying $\sum_{p\in{\cal P}}c_{p}({\cal P})=1$ . It is clear that at least one of $c_{p}({\cal P})$ is positive and at least $1/|{\cal P}|=1/(d\beta+1)$ .

It is easy to see that $y_{ij}(t)$ can be rewritten as

\displaystyle y_{ij}(t)=\frac{1}{|{\cal Q}_{ij}|}\sum_{{\cal P}\in{\cal Q}_{ij}}\sum_{p\in{\cal P}}c_{p}({\cal P})x_{p}(t).

(10)

Our reason for rewriting $y_{ij}(t)$ in this way will be clear shortly. Since each ${\cal P}\subset{\cal N}_{i}\cap{\cal H}$ , the above expression is a convex combination of all $x_{k}(t)$ , $k\in{\cal N}_{i}\cap{\cal H}$ , allowing some weights being zero. Specifically, defining ${\cal S}_{ijk}=\{{\cal P}:{\cal P}\in{\cal Q}_{ij},k\in{\cal P}\}$ for each ${\cal Q}_{ij}$ and $k\in{\cal N}_{i}\cap{\cal H}$ ,

y_{ij}(t)=\sum_{k\in{\cal N}_{i}\cap{\cal H}}\Big{(}\sum_{{\cal Q}_{ij}}\sum_{{\cal P}\in{\cal S}_{ijk}}\frac{c_{k}({\cal P})}{|{\cal Q}_{ij}|}\Big{)}x_{k}(t).

Then,

	$\displaystyle v_{i}(t)$	$\displaystyle=\frac{1}{1+a_{i}}\Big{(}x_{i}(t)+\sum_{j=1}^{a_{i}}y_{ij}(t)\Big{)}$
		$\displaystyle=\frac{1}{1+a_{i}}\Big{(}x_{i}(t)+\sum_{j=1}^{a_{i}}\sum_{k\in{\cal N}_{i}\cap{\cal H}}\sum_{{\cal Q}_{ij}}\sum_{{\cal P}\in{\cal S}_{ijk}}\frac{c_{k}({\cal P})}{\|{\cal Q}_{ij}\|}x_{k}(t)\Big{)}$
		$\displaystyle=\frac{x_{i}(t)}{1+a_{i}}+\sum_{k\in{\cal N}_{i}\cap{\cal H}}\Big{(}\sum_{j=1}^{a_{i}}\sum_{{\cal Q}_{ij}}\sum_{{\cal P}\in{\cal S}_{ijk}}\frac{c_{k}({\cal P})}{(1+a_{i})\|{\cal Q}_{ij}\|}\Big{)}x_{k}(t)$
		$\displaystyle=w_{ii}(t)x_{i}(t)+\sum_{k\in{\cal N}_{i}\cap{\cal H}}w_{ik}(t)x_{k}(t),$

in which

\displaystyle w_{ii}(t)\stackrel{{\scriptstyle\Delta}}{{=}}\frac{1}{1+a_{i}},\;\;\;\;\;w_{ik}(t)\stackrel{{\scriptstyle\Delta}}{{=}}\sum_{j=1}^{a_{i}}\sum_{{\cal Q}_{ij}}\sum_{{\cal P}\in{\cal S}_{ijk}}\frac{c_{k}({\cal P})}{(1+a_{i})|{\cal Q}_{ij}|},\;\;\;k\in{\cal N}_{i}\cap{\cal H}.

(11)

It is clear that $w_{ii}(t)>0$ for all $i$ and $t$ . Since

\frac{1}{1+a_{i}}\sum_{j=1}^{a_{i}}y_{ij}(t)=\sum_{k\in{\cal N}_{i}\cap{\cal H}}w_{ik}(t)x_{k}(t),

from (10),

\sum_{k\in{\cal N}_{i}\cap{\cal H}}w_{ik}(t)x_{k}(t)=\sum_{j=1}^{a_{i}}\sum_{{\cal P}\in{\cal Q}_{ij}}\sum_{p\in{\cal P}}\frac{c_{p}({\cal P})x_{p}(t)}{(1+a_{i})|{\cal Q}_{ij}|}.

(12)

Note that $\bigcup_{j=1}^{a_{i}}\bigcup_{{\cal P}\in{\cal B}_{ij}}{\cal P}$ is the collection of all subsets of ${\cal N}_{i}$ with cardinality being $d\beta+1$ . It follows that $\bigcup_{j=1}^{a_{i}}\bigcup_{{\cal P}\in{\cal Q}_{ij}}{\cal P}$ is the collection of all subsets of ${\cal N}_{i}\cap{\cal H}$ with cardinality being $d\beta+1$ . Thus, (12) implies that $\sum_{k\in{\cal N}_{i}\cap{\cal H}}w_{ik}(t)x_{k}(t)$ is a convex combination of all possible expressions of $y_{ij}(t)$ in terms of (9). Since both $1+a_{i}$ and $|{\cal Q}_{ij}|$ are positive, as long as a $c_{p}({\cal P})$ in (12) is positive, $w_{ip}(t)$ is positive with $p\in{\cal N}_{i}\cap{\cal H}$ .

We claim that the number of those indices $p\in{\cal N}_{i}\cap{\cal H}$ such that $c_{p}({\cal P})\geq 1/|{\cal P}|$ for at least one ${\cal P}\in{\cal Q}_{ij}$ is at least $|{\cal N}_{i}\cap{\cal H}|-d\beta$ . To prove this, suppose that, to the contrary, the number of such indices is no larger than $|{\cal N}_{i}\cap{\cal H}|-d\beta-1$ . That is, at least $d\beta+1$ indices in ${\cal N}_{i}\cap{\cal H}$ whose corresponding $c_{p}({\cal P})<1/|{\cal P}|$ for all ${\cal P}\in{\cal Q}_{ij}$ , $j\in\{1,\ldots,a_{i}\}$ . Pick exactly $d\beta+1$ of them and form an index set ${{\cal P}}_{0}$ . It is clear that ${\cal P}_{0}\in{\cal Q}_{ij}$ for some $j\in\{1,\ldots,a_{i}\}$ . So all $c_{p}({\cal P}_{0})$ , $p\in{\cal P}_{0}$ , must be included in the right hand side of (12) and strictly less than $1/|{\cal P}|$ . But this is impossible because (9) asserts that $\sum_{p\in{\cal P}_{0}}c_{p}({\cal P}_{0})=1$ .

From the preceding, there exist at least $|{\cal N}_{i}\cap{\cal H}|-d\beta$ indices $k\in{\cal N}_{i}\cap{\cal H}$ for which $c_{k}({\cal P})$ in (11) is no less than $1/|{\cal P}|$ for at least one ${\cal P}\in{\cal S}_{ijk}$ . For each of such $k$ , from (11),

w_{ik}(t)\geq\frac{1}{|{\cal P}|(1+a_{i})|{\cal Q}_{ij}|}=\frac{1}{(d\beta+1)(1+a_{i})\binom{(d+1)\beta+1}{d\beta+1}}.

Set

\eta\;\stackrel{{\scriptstyle\Delta}}{{=}}\;\min_{i\in{\cal V}}\;\frac{1}{(d\beta+1)(1+a_{i})\binom{(d+1)\beta+1}{d\beta+1}}.

(13)

Since $a_{i}\leq\binom{n}{(d+1)\beta+1}$ due to (3), $\eta$ must be positive and independent of $i$ and $t$ . The statement of the lemma then immediately follows.

From (6) and Lemma 6, the updates of all normal agents can be written as

	$\displaystyle v_{i}(t)$	$\displaystyle=w_{ii}(t)x_{i}(t)+\sum_{k\in{\cal N}_{i}\cap{\cal H}}w_{ik}(t)x_{k}(t),$		(14)
	$\displaystyle x_{i}(t+1)$	$\displaystyle=v_{i}(t)-\alpha(t)g_{i}(v_{i}(t)),$		(15)

for all $i\in{\cal H}$ .

Without loss of generality, we label all normal agents from $1$ to $|{\cal H}|$ in the sequel.

Let

\displaystyle x(t)\stackrel{{\scriptstyle\Delta}}{{=}}\begin{bmatrix}x_{1}^{\top}(t)\cr\vdots\cr x_{|{\cal H}|}^{\top}(t)\end{bmatrix},\;\;\;\;\;v(t)\stackrel{{\scriptstyle\Delta}}{{=}}\begin{bmatrix}v_{1}^{\top}(t)\cr\vdots\cr v_{|{\cal H}|}^{\top}(t)\end{bmatrix},\;\;\;\;\;g(v(t))\stackrel{{\scriptstyle\Delta}}{{=}}\begin{bmatrix}g_{1}^{\top}(v_{1}(t))\cr\vdots\cr g_{|{\cal H}|}^{\top}(v_{|{\cal H}|}(t))\end{bmatrix}.

Then, the updates in (14) and (15) can be written in the form of state equations:

	$\displaystyle v(t)$	$\displaystyle=W(t)x(t),$		(16)
	$\displaystyle x(t+1)$	$\displaystyle=v(t)-\alpha(t)g(v(t)),$		(17)

where each $W(t)$ is a $|{\cal H}|\times|{\cal H}|$ stochastic matrix with positive diagonal entries.⁶⁶6A square nonnegative matrix is called a stochastic matrix if its row sums all equal one.

4.3 Consensus

We first study the infinite product of stochastic matrices $W(t)$ .

The graph of an $n\times n$ matrix $M$ , denoted $\gamma(M)$ , is a direct graph with $n$ vertices and a directed edge from vertex $i$ to vertex $j$ whenever the $ji$ -th entry of the matrix is nonzero.

Lemma 7.

If $\mathbb{G}$ is $(\beta,d\beta)$ -resilient, the graph of each $W(t)$ in (16) has a rooted spanning subgraph and all the diagonal entries and those off-diagonal entries of $W(t)$ corresponding to the rooted spanning subgraph are uniformly bounded below by a positive number $\eta$ given in (13).

Proof of Lemma 7: From Lemma 6, for all $i\in{\cal H}$ , $w_{ii}(t)$ and at least $|{\cal N}_{i}\cap{\cal H}|-d\beta$ among all $w_{ik}(t)$ , $k\in{\cal N}_{i}\cap{\cal H}$ , are positive and uniformly bounded below by $\eta$ given in (13). The expression “at least $|{\cal N}_{i}\cap{\cal H}|-d\beta$ among all $w_{ik}(t)$ ” implies that the graph of $W(t)$ can be obtained by removing from each vertex of the subgraph of $\mathbb{G}$ induced by ${\cal H}$ , $\mathbb{G}_{{\cal H}}$ , at most $d\beta$ unspecified incoming edges in $\mathbb{G}_{{\cal H}}$ , which is the $(|{\cal F}|,d\beta)$ -reduced graph of $\mathbb{G}$ . Since $|{\cal F}|\leq\beta$ , $\mathbb{G}$ must be $(|{\cal F}|,d\beta)$ -resilient. Thus, any $(|{\cal F}|,d\beta)$ -reduced graph of $\mathbb{G}$ is rooted, so is $W(t)$ .

For any infinite sequence of stochastic matrices with the property in Lemma 7, their product has the following result.

Lemma 8.

Let $S_{1},S_{2},\ldots$ be an infinite sequence of $n\times n$ stochastic matrices, each of whose graphs having a rooted spanning subgraph. If all the diagonal entries and those off-diagonal entries of $S_{1},S_{2},\ldots$ corresponding to the rooted spanning subgraphs are uniformly bounded below by a positive number $p$ , then the product $S_{k}\cdots S_{2}S_{1}$ converges to a rank one matrix of the form $\mathbf{1}v^{\top}$ exponentially fast, where $v$ is a column vector.⁷⁷7We use $\mathbf{1}$ to denote the vector whose entries all equal to $1$ and the dimension is to be understood from the context.

To prove the lemma, we need the concept of the “composition” of directed graphs. The composition of two directed graphs $\mathbb{G}_{p}$ , $\mathbb{G}_{q}$ with the same vertex set, denoted by $\mathbb{G}_{q}\circ\mathbb{G}_{p}$ , is the directed graph with the same vertex set and arc set defined so that $(i,j)$ is an arc in the composition whenever there is a vertex $k$ such that $(i,k)$ is an arc in $\mathbb{G}_{p}$ and $(k,j)$ is an arc in $\mathbb{G}_{q}$ . Since this composition is an associative binary operation, the definition extends unambiguously to any finite sequence of directed graphs with the same vertex set. Composition and matrix multiplication are closely related. In particular, the graph of the product of two nonnegative matrices $M_{1},M_{2}\in{\rm I\!R}^{n\times n}$ is equal to the composition of the graphs of the two matrices comprising the product. In other words, $\gamma(M_{2}M_{1})=\gamma(M_{2})\circ\gamma(M_{1})$ . If we focus exclusively on graphs with self-arcs at all vertices, the definition of composition implies that the arcs of both $\mathbb{G}_{p}$ and $\mathbb{G}_{q}$ are arcs of $\mathbb{G}_{q}\circ\mathbb{G}_{p}$ ; the converse is false.

To proceed, for any $n\times n$ nonnegative matrix $S$ , let

\mu(S)\stackrel{{\scriptstyle\Delta}}{{=}}\max_{i,j}\Big{(}1-\sum_{k=1}^{n}\min\{s_{ik},s_{jk}\}\Big{)}.

It is easy to see that if $S$ is a substochastic matrix⁸⁸8A square nonnegative matrix is called substochastic if its row sums are all equal to or less than one., $\mu(S)\in[0,1]$ . In the case when $S$ is a stochastic matrix, $\mu(S)$ is called the coefficients of ergodicity [47, page 137]. A stochastic matrix $S$ is called a scrambling matrix if and only if $\mu(S)<1$ [48], whose graph is sometimes called “neighbor-shared” [49]. It is natural to call a vertex $i$ a neighbor of vertex $j$ in a directed graph $\mathbb{G}$ if $(i,j)$ is an arc in $\mathbb{G}$ . A directed graph $\mathbb{G}$ is called neighbor-shared if each set of two distinct vertices share a common neighbor. The composition of any set of $n-1$ rooted graphs with $n$ vertices and self-arcs at all vertices, is neighbor shared [50, Proposition 8]. It is easy to check that for any nonnegative square matrix $S$ , $\mu(S)<1$ if and only if its graph is neighbor-shared. Moreover, for any two nonnegative square matrices $S_{1}$ and $S_{2}$ , if $S_{1}\geq S_{2}$ ,⁹⁹9For any two real matrices $A$ and $B$ with the same size, we write $A\geq B$ if $a_{ij}\geq b_{ij}$ for all $i$ and $j$ . then $\mu(S_{1})\leq\mu(S_{2})$ .

Proof of Lemma 8: Since the graph of each $S_{t}$ is rooted with self-arcs, by Proposition 8 in [50], the graph of the product of any finite sequence of $S_{t}$ matrices of length $n-1$ , is neighbor-shared, which implies that the product is a scrambling matrix. Thus, letting $V_{t}=\prod_{\tau=t}^{t+n-2}S_{\tau}$ for each $t$ , each $V_{t}$ is scrambling and its graph $\gamma(V_{t})$ is neighbor-shared. Since the graph of each $S_{t}$ has a rooted spanning subgraph with self-arcs whose corresponding entries in $S_{t}$ are bounded below by a positive number $p$ . It follows that the graph of each $V_{t}$ has a neighbor-shared spanning subgraph $\mathbb{S}$ with self-arcs whose corresponding entries in $V_{t}$ are bounded below by a positive number $p^{n-1}$ . Let $U_{t}$ be the $n\times n$ matrix whose $ij$ th entry is the $ij$ th entry of $V_{t}$ if $(j,i)$ is an arc in $\mathbb{S}$ and zero otherwise. Then, each $U_{t}$ is a substochastic matrix whose graph is neighbor-shared. Since all positive entries of $U_{t}$ are bounded below by $p^{n-1}$ , $\mu(U_{t})\leq 1-p^{n-1}$ . Since $V_{t}\geq U_{t}$ , $\mu(V_{t})\leq\mu(U_{t})\leq 1-p^{n-1}$ . With this uniform upper bound of all $\mu(V_{t})$ , the lemma thus is an immediate consequence of Lemma 3 in [48].

An infinite sequence of stochastic matrices $\{S(t)\}$ is called ergodic if $\lim_{t\rightarrow\infty}S(t)\cdots S(\tau+1)S(\tau)=\mathbf{1}v^{\top}(\tau)$ for all $\tau$ , where each $v(\tau)$ is a stochastic vector.¹⁰¹⁰10A vector is called a stochastic vector if its entries are nonnegative and sum to one. From Lemmas 7 and 8, the sequence of stochastic matrices $\{W(t)\}$ is ergodic, and any infinite product of $W(t)$ matrices converges to a rank one matrix exponentially fast. Using the same argument as in the proof of Lemma 8,

\mu\big{(}W(t+n-2)\cdots W(t+1)W(t)\big{)}\leq 1-\eta^{n-1}

for all $t$ , where $\eta$ is given in (13). Following the same argument as pages 610–611 in [49], the product $W(t)\cdots W(\tau+1)W(\tau)$ converges to a rank one matrix as $t\rightarrow\infty$ exponentially fast at a rate no slower than

\lambda\stackrel{{\scriptstyle\Delta}}{{=}}\left(1-\eta^{n-1}\right)^{\frac{1}{n-1}}.

(18)

Lemmas 7 and 8 also have the following important implication.

Proposition 2.

Under Assumptions 2 and 3, if $\mathbb{G}$ is $(\beta,d\beta)$ -resilient, all the normal agents will asymptotically reach a consensus.

To prove the proposition, we need the following concept.

Definition 3.

Let $\{S(t)\}$ be a sequence of stochastic matrices. A sequence of stochastic vectors $\{\pi(t)\}$ is an absolute probability sequence for $\{S(t)\}$ if $\pi^{\top}(t)=\pi^{\top}(t+1)S(t)$ for all $t\geq 0$ .

This definition was first introduced by Kolmogorov [51]. It was shown by Blackwell [52] that every sequence of stochastic matrices has an absolute probability sequence. In general, a sequence of stochastic matrices may have more than one absolute probability sequence; when the sequence of stochastic matrices is ergodic, it has a unique absolute probability sequence [53, Lemma 1]. It is easy to see that when $S(t)$ is a fixed irreducible stochastic matrix $S$ , $\pi(t)$ is simply the normalized left eigenvector of $S$ for eigenvalue one, and when $\{S(t)\}$ is an ergodic sequence of doubly stochastic matrices¹¹¹¹11A square nonnegative matrix is called a doubly stochastic matrix if its row sums and column sums all equal one., $\pi(t)=(1/n)\mathbf{1}$ .

From the preceding, the sequence of stochastic matrices $\{W(t)\}$ in (16) is ergodic. Thus, $\{W(t)\}$ has a unique absolute probability sequence $\{\pi(t)\}$ . From Lemma 1 in [53],

\lim_{t\rightarrow\infty}W(t)\cdots W(\tau+1)W(\tau)=\mathbf{1}\pi^{\top}(\tau).

(19)

Let $\Phi(t,\tau)\stackrel{{\scriptstyle\Delta}}{{=}}W(t)\cdots W(\tau)$ with $t\geq\tau$ . Then, there exists a positive constant $c$ such that for all $i,j\in\mathcal{H}$ and $t\geq\tau\geq 0$ ,

\displaystyle\big{|}\big{[}\Phi(t,\tau)]_{ij}-\pi_{j}(\tau)\big{|}\leq c\lambda^{t-\tau},

(20)

where $[\cdot]_{ij}$ denotes the $ij$ th entry of a matrix and $\lambda$ is given in (18). Using the same argument as in the proof of Lemma 2 in [39], $c=2$ .

To proceed, define

y(t)\stackrel{{\scriptstyle\Delta}}{{=}}\pi^{\top}(t)x(t)=\sum_{i=1}^{|{\cal H}|}\pi_{i}(t)x_{i}(t).

From (15),

$\displaystyle y(t+1)$	$\displaystyle=\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)x_{i}(t+1)$
	$\displaystyle=\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)v_{i}(t)-\alpha(t)\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)g_{i}(v_{i}(t))$
	$\displaystyle=\pi^{\top}(t+1)W(t)x(t)-\alpha(t)\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)g_{i}(v_{i}(t))$
	$\displaystyle=y(t)-\alpha(t)\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)g_{i}(v_{i}(t)).$	(21)

Lemma 9.

Under Assumptions 2 and 3, if $\mathbb{G}$ is $(\beta,d\beta)$ -resilient, $\lim_{t\rightarrow\infty}(x_{i}(t)-y(t))=0$ for all $i\in{\cal H}$ .

Proof of Lemma 9: For all $t>s$ ,

\displaystyle x_{i}(t+1)=\sum_{j=1}^{|{\cal H}|}\left[\Phi(t,s)\right]_{ij}x_{j}(s)-\sum_{r=s}^{t-1}\bigg{(}\sum_{j=1}^{|{\cal H}|}\left[\Phi(t,r+1)\right]_{ij}\alpha(r)g_{j}(v_{j}(r))\bigg{)}-\alpha(t)g_{i}(v_{i}(t)).

From (21), for all $t>s$ ,

\displaystyle y(t+1)=y(s)-\sum_{r=s}^{t-1}\bigg{(}\alpha(r)\sum_{j=1}^{|{\cal H}|}\pi_{j}(r+1)g_{j}(v_{j}(r))\bigg{)}-\alpha(t)\sum_{i=1}^{|{\cal H}|}\pi_{i}(t+1)g_{i}(v_{i}(t)).

Set $s=0$ . Then, using (20) and Assumption 2, for $t>0$ ,

$\displaystyle\\|x_{i}(t)-y(t)\\|\;\leq\;$	$\displaystyle\sum_{j=1}^{\|{\cal H}\|}\big{\|}\left[\Phi(t-1,0)\right]_{ij}-\pi_{j}(0)\big{\|}\\|x_{j}(0)\\|$
	$\displaystyle+\sum_{r=0}^{t-2}\sum_{j=1}^{\|{\cal H}\|}\big{\|}\left[\Phi(t-1,r+1)\right]_{ij}-\pi_{j}(r+1)\big{\|}\alpha(r)\\|g_{j}(v_{j}(r))\\|$
	$\displaystyle+\alpha(t-1)\\|g_{i}(v_{i}(t-1))\\|+\alpha(t-1)\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t)\\|g_{i}(v_{i}(t-1))\\|$
$\displaystyle\leq\;$	$\displaystyle 2\lambda^{t}\sum_{j=1}^{\|{\cal H}\|}\\|x_{j}(0)\\|+2D\|{\cal H}\|\sum_{r=0}^{t-2}\lambda^{t-r-2}\alpha(r)+2D\alpha(t-1)$	(22)
$\displaystyle\leq\;$	$\displaystyle 2\lambda^{t}\sum_{j=1}^{\|{\cal H}\|}\\|x_{j}(0)\\|+2D\|{\cal H}\|\bigg{(}\sum_{r=0}^{\lfloor\frac{t}{2}\rfloor-1}\lambda^{t-r-2}\alpha(r)+\sum_{r=\lceil\frac{t}{2}\rceil-1}^{t-2}\lambda^{t-r-2}\alpha(r)\bigg{)}+2D\alpha(t-1)$
$\displaystyle\leq\;$	$\displaystyle 2\lambda^{t}\sum_{j=1}^{\|{\cal H}\|}\\|x_{j}(0)\\|+\frac{2D\|{\cal H}\|}{1-\lambda}\Big{(}\lambda^{\lceil\frac{t}{2}\rceil-1}\alpha(0)+\alpha\big{(}\lceil\frac{t}{2}\rceil-1\big{)}\Big{)}+2D\alpha(t-1),$	(23)

where $\lfloor\cdot\rfloor$ and $\lceil\cdot\rceil$ denote the floor and ceiling functions, respectively. It follows that $\lim_{t\rightarrow\infty}\|x_{i}(t)-y(t)\|=0$ .

The above proof essentially follows the proof of Lemma 8(a) in [54], generalizing the straight average $y(t)=\frac{1}{|{\cal H}|}\sum_{i=1}^{|{\cal H}|}x_{i}(t)$ to the time-varying weighted average $y(t)=\sum_{i=1}^{|{\cal H}|}\pi_{i}(t)x_{i}(t)$ . It can also be proved using the idea of “input-output consensus stability” based on a suitably defined semi-norm; see Corollary 1 in [35].

Proposition 2 is an immediate consequence of Lemma 9. More can be said.

4.4 Convergence

Lemma 10.

Under Assumptions 2 and 3, if $\mathbb{G}$ is $(\beta,d\beta)$ -resilient, $\sum_{t=0}^{\infty}\alpha(t)\|x_{i}(t)-y(t)\|<\infty$ for all $i\in{\cal H}$ .

Proof of Lemma 10: From (22), for any $t>0$ ,

\displaystyle\alpha(t)\|x_{i}(t)-y(t)\|\leq 2\alpha(t)\lambda^{t}\sum_{j=1}^{|{\cal H}|}\|x_{j}(0)\|+2D|{\cal H}|\sum_{r=0}^{t-2}\lambda^{t-r-2}\alpha(t)\alpha(r)+2D\alpha(t)\alpha(t-1).

Since $\alpha(t)\lambda^{t}\leq\alpha^{2}(t)+\lambda^{2t}$ and $2\alpha(t)\alpha(r)\leq\alpha^{2}(t)+\alpha^{2}(r)$ for any $t$ and $r$ , then for any $t>0$ ,

$\displaystyle\alpha(t)\\|x_{i}(t)-y(t)\\|$	$\displaystyle\leq$	$\displaystyle 2\alpha^{2}(t)\sum_{j=1}^{\|{\cal H}\|}\\|x_{j}(0)\\|+2\lambda^{2t}\sum_{j=1}^{\|{\cal H}\|}\\|x_{j}(0)\\|$
		$\displaystyle+\;2D\|{\cal H}\|\alpha^{2}(t)\sum_{r=0}^{t-2}\lambda^{t-r-2}+2D\|{\cal H}\|\sum_{r=0}^{t-2}\lambda^{t-r-2}\alpha^{2}(r)+D\big{(}\alpha^{2}(t)+\alpha^{2}(t-1)\big{)}$
	$\displaystyle\leq$	$\displaystyle 2\alpha^{2}(t)\sum_{j=1}^{\|{\cal H}\|}\\|x_{j}(0)\\|+2\lambda^{2t}\sum_{j=1}^{\|{\cal H}\|}\\|x_{j}(0)\\|$
		$\displaystyle+\;\Big{(}\frac{2D\|{\cal H}\|}{1-\lambda}\Big{)}\alpha^{2}(t)+2D\|{\cal H}\|\sum_{r=0}^{t-2}\lambda^{t-r-2}\alpha^{2}(r)+D\big{(}\alpha^{2}(t)+\alpha^{2}(t-1)\big{)}.$

Therefore,

$\displaystyle\sum_{t=0}^{\infty}\alpha(t)\\|x_{i}(t)-y(t)\\|$	$\displaystyle\leq$	$\displaystyle 2\bigg{(}\sum_{t=1}^{\infty}\alpha^{2}(t)\bigg{)}\sum_{j=1}^{\|{\cal H}\|}\\|x_{j}(0)\\|+2\bigg{(}\sum_{t=1}^{\infty}\lambda^{2t}\bigg{)}\sum_{j=1}^{\|{\cal H}\|}\\|x_{j}(0)\\|$
		$\displaystyle+\Big{(}\frac{2D\|{\cal H}\|}{1-\lambda}\Big{)}\sum_{t=1}^{\infty}\alpha^{2}(t)+2D\|{\cal H}\|\sum_{t=1}^{\infty}\sum_{r=0}^{t-2}\lambda^{t-r-2}\alpha^{2}(r)+2D\sum_{t=0}^{\infty}\alpha^{2}(t)$
		$\displaystyle+\alpha(0)\\|x_{i}(0)-y(0)\\|$
	$\displaystyle<$	$\displaystyle\infty,$

which completes the proof.

It is known that under some “standard” assumptions, if the graphs of an ergodic sequence of stochastic matrices are “uniformly strongly connected”, all the entries of its unique absolute probability sequence are uniformly bounded below by a positive constant [55, Theorem 4.8]. This is not the case here as the sequence of the graphs $W(t)$ matrices may not be “uniformly strongly connected”.

Proposition 3.

If $\mathbb{G}$ is $(\beta,d\beta)$ -resilient, then for any $t\geq 0$ , there exists a subset ${\cal S}(t)\subset{\cal H}$ with $|{\cal S}(t)|\geq\kappa_{\beta,d\beta}(\mathbb{G})$ for which all $\pi_{i}(t)$ , $i\in{\cal S}(t)$ , $t\geq 0$ are bounded below by a positive number.

To prove the proposition, we need the following concept and results.

We say that a directed graph $\mathbb{G}$ is strongly rooted at vertex $i$ if each other vertex of $\mathbb{G}$ is reachable from vertex $i$ along a directed path of length 1; that is, $\mathbb{G}$ is strongly rooted at $i$ if $i$ is a neighbor of every other vertex in the graph. A directed graph is called strongly rooted if it has at least one vertex at which it is strongly rooted. For a square nonnegative matrix, its graph is strongly rooted at $i$ if and only if its $i$ th column is strictly positive. Moreover, for any $n-1$ directed graphs with $n$ vertices and self-arcs which are all rooted at the same vertex $i$ , their composition is strongly rooted at $i$ [50, Proposition 3].

Proof of Proposition 3: From Lemma 7, the graph of each $W(t)$ is rooted. It is clear that the number of roots in the graph of each $W(t)$ is at least $\kappa_{\beta,d\beta}(\mathbb{G})$ ; that is, $\kappa(\gamma(W(t)))\geq\kappa_{\beta,d\beta}(\mathbb{G})$ for all time $t$ . Let

l\stackrel{{\scriptstyle\Delta}}{{=}}(|{\cal H}|-\kappa_{\beta,d\beta}(\mathbb{G})+1)(|{\cal H}|-2)+1.

(24)

We claim that for any finite sequence of $W(t)$ matrices of length $l$ , there exists a subset ${\cal S}\subset{\cal H}$ , depending on the sequence, with $|{\cal S}|\geq\kappa_{\beta,d\beta}(\mathbb{G})$ such that each $i\in{\cal S}$ is a root of the graph of some $W(t)$ for at least $|{\cal H}|-1$ times. To prove the claim, suppose that, to the contrary, such a subset does not exist; that is, at most $\kappa_{\beta,d\beta}(\mathbb{G})-1$ vertices in ${\cal H}$ are a root of the graph of some $W(t)$ for at least $|{\cal H}|-1$ times. For the remaining vertices, they are a root of the graph of some $W(t)$ for at most $|{\cal H}|-2$ times. Then, the total number of roots of all the graphs of the sequence of $W(t)$ matrices of length $l$ is no large than $(\kappa_{\beta,d\beta}(\mathbb{G})-1)l+(|{\cal H}|-\kappa_{\beta,d\beta}(\mathbb{G})+1)(|{\cal H}|-2)$ . Meanwhile, since $\kappa(\gamma(W(t)))\geq\kappa_{\beta,d\beta}(\mathbb{G})$ for all time $t$ , this total number of roots is at least $\kappa_{\beta,d\beta}(\mathbb{G})l$ , which implies that $(\kappa_{\beta,d\beta}(\mathbb{G})-1)l+(|{\cal H}|-\kappa_{\beta,d\beta}(\mathbb{G})+1)(|{\cal H}|-2)\geq\kappa_{\beta,d\beta}(\mathbb{G})l$ , and thus $l\leq(|{\cal H}|-\kappa_{\beta,d\beta}(\mathbb{G})+1)(|{\cal H}|-2)$ . But this contradicts (24). Therefore, the claim is true.

The above claim immediately implies that for any time $t\geq 0$ and the corresponding $l$ -length stochastic matrix sequence starting at time $t$ , $\{W(t),W(t+1),\ldots,W(t+l-1)\}$ , there exists a subset ${\cal S}(t)\subset{\cal H}$ with $|{\cal S}(t)|\geq\kappa_{\beta,d\beta}(\mathbb{G})$ such that each $i\in{\cal S}(t)$ is a root of at least $|{\cal H}|-1$ graphs among $\gamma(W(t)),\gamma(W(t+1)),\ldots,\gamma(W(t+l-1))$ . From Lemma 7, the graph of each $W(t)$ is rooted with self-arcs. With the two facts of directed graphs with self-arcs that the arcs of each graph in a graph sequence are arcs of their composition, and that any $|{\cal H}|-1$ graphs with $|{\cal H}|$ vertices which are all rooted at the same vertex $i$ , their composition is strongly rooted at $i$ [50, Proposition 3], it follows that for any time $t\geq 0$ , the graph of $W(t+l-1)\cdots W(t+1)W(t)$ is strongly rooted at each vertex $i\in{\cal S}(t)$ . Then, each product $W(t+l-1)\cdots W(t+1)W(t)$ has $|{\cal S}(t)|\geq\kappa_{\beta,d\beta}(\mathbb{G})$ strictly positive columns whose entries are uniformly bounded below by a positive number $\eta^{l}$ , where $\eta$ is defined in (13). Since each $W(t)$ is a stochastic matrix, it is easy to see that for any $m\geq l$ , each product $W(t+m-1)\cdots W(t+1)W(t)$ has $|{\cal S}(t)|\geq\kappa_{\beta,d\beta}(\mathbb{G})$ strictly positive columns whose entries are uniformly bounded below $\eta^{l}$ . Then, the statement of the proposition follows from (19).

We are now in a position to prove Theorem 1 and Proposition 1.

Proof of Theorem 1: From (15) and (1), for any $z\in{\cal X}^{*}$ ,

	$\displaystyle\\|x_{i}(t+1)-z\\|^{2}$	$\displaystyle=\\|v_{i}(t)-z-\alpha(t)g_{i}(v_{i}(t))\\|^{2}$
		$\displaystyle\leq\\|v_{i}(t)-z\\|^{2}+\alpha^{2}(t)\\|g_{i}(v_{i}(t))\\|^{2}-2\alpha(t)(f_{i}(v_{i}(t))-f_{i}(z)).$

Since each $W(t)$ is a stochastic matrix and the 2-norm is convex, $\|v_{i}(t)-z\|^{2}\leq\sum_{j=1}^{|{\cal H}|}w_{ij}(t)\|x_{j}(t)-z\|^{2}$ . Then, with Assumption 2 and the fact that $\pi_{j}(t)=\sum_{i=1}^{|{\cal H}|}\pi_{i}(t+1)w_{ij}(t)$ for all $j\in{\cal H}$ and $t$ ,

	$\displaystyle\;\;\;\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)\\|x_{i}(t+1)-z\\|^{2}$
	$\displaystyle\leq\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)\sum_{j=1}^{\|{\cal H}\|}w_{ij}(t)\\|x_{j}(t)-z\\|^{2}+\alpha^{2}(t)\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)\\|g_{i}(v_{i}(t))\\|^{2}$
	$\displaystyle\;\;\;\;-2\alpha(t)\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)(f_{i}(v_{i}(t))-f_{i}(z))$
	$\displaystyle=\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t)\\|x_{i}(t)-z\\|^{2}+\alpha^{2}(t)\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)\\|g_{i}(v_{i}(t))\\|^{2}-2\alpha(t)\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)(f_{i}(v_{i}(t))-f_{i}(z))$
	$\displaystyle\leq\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t)\\|x_{i}(t)-z\\|^{2}+\alpha^{2}(t)D^{2}-2\alpha(t)\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)(f_{i}(v_{i}(t))-f_{i}(y(t)))$
	$\displaystyle\;\;\;\;-2\alpha(t)\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)(f_{i}(y(t))-f_{i}(z)).$		(25)

Note that

|f_{i}(v_{i}(t))-f_{i}(y(t))|\leq D\|v_{i}(t)-y(t)\|\leq D\sum_{j=1}^{|{\cal H}|}w_{ij}(t)\|x_{j}(t)-y(t)\|,

which implies that

$\displaystyle\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)\|f_{i}(v_{i}(t))-f_{i}(y(t))\|$	$\displaystyle\leq$	$\displaystyle D\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)\sum_{j=1}^{\|{\cal H}\|}w_{ij}(t)\\|x_{j}(t)-y(t)\\|$
	$\displaystyle=$	$\displaystyle D\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t)\\|x_{i}(t)-y(t)\\|$
	$\displaystyle\leq$	$\displaystyle D\sum_{i=1}^{\|{\cal H}\|}\\|x_{i}(t)-y(t)\\|.$

From (25),

	$\displaystyle\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)\\|x_{i}(t+1)-z\\|^{2}$	$\displaystyle\leq$	$\displaystyle\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t)\\|x_{i}(t)-z\\|^{2}+\alpha^{2}(t)D^{2}+2\alpha(t)D\sum_{i=1}^{\|{\cal H}\|}\\|x_{i}(t)-y(t)\\|$		(26)
			$\displaystyle-\;2\alpha(t)\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)(f_{i}(y(t))-f_{i}(z)).$		(26)

Thus,

$\displaystyle 2\sum_{t=0}^{\infty}\alpha(t)\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)(f_{i}(y(t))-f_{i}(z))$	$\displaystyle\leq$	$\displaystyle\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(0)\\|x_{i}(0)-z\\|^{2}+D^{2}\sum_{t=0}^{\infty}\alpha^{2}(t)$
		$\displaystyle+\;2D\sum_{t=0}^{\infty}\alpha(t)\sum_{i=1}^{\|{\cal H}\|}\\|x_{i}(t)-y(t)\\|$
	$\displaystyle<$	$\displaystyle\infty$

because of Assumption 3 and Lemma 10.

From Corollary 3, $z\in\operatorname*{arg\,min}_{x}f_{i}(x)$ for all $i\in{\cal H}$ . Then, $f_{i}(y(t))-f_{i}(z)\geq 0$ for all $i\in{\cal H}$ and $t\geq 0$ . From Proposition 3 and its proof, for any $t\geq 0$ , there exists a subset ${\cal S}(t)\subset{\cal H}$ with $|{\cal S}(t)|\geq\kappa_{\beta,d\beta}(\mathbb{G})$ for which all $\pi_{i}(t)$ , $i\in{\cal S}(t)$ , $t\geq 0$ are uniformly bounded below by $\eta^{l}>0$ . It follows that

\eta^{l}\sum_{t=0}^{\infty}\alpha(t)\sum_{i\in{\cal S}(t)}(f_{i}(y(t))-f_{i}(z))\leq\sum_{t=0}^{\infty}\alpha(t)\sum_{i=1}^{|{\cal H}|}\pi_{i}(t+1)(f_{i}(y(t))-f_{i}(z))<\infty.

(27)

Since $\sum_{t=0}^{\infty}\alpha(t)=\infty$ and $f_{i}(y(t))-f_{i}(z)\geq 0$ for all $i\in{\cal H}$ and $t\geq 0$ ,

\liminf_{t\rightarrow\infty}\sum_{i\in{\cal S}(t)}\big{(}f_{i}(y(t))-f_{i}(z)\big{)}=0.

which implies that¹²¹²12We use ${\rm dist}(x,{\cal S})\stackrel{{\scriptstyle\Delta}}{{=}}\inf\{\|x-y\|:y\in{\cal S}\}$ to denote the Euclidean distance between a point $x$ and a set ${\cal S}$ in ${\rm I\!R}^{d}$ .

\liminf_{t\rightarrow\infty}\;{\rm dist}\Big{(}y(t),\;\operatorname*{arg\,min}_{x}\sum_{i\in{\cal S}(t)}f_{i}(x)\Big{)}=0.

From Corollary 1, $\operatorname*{arg\,min}_{x}\sum_{i\in{\cal S}(t)}f_{i}(x)={\cal X}^{*}$ for all time $t\geq 0$ . Then,

\displaystyle\operatorname*{arg\,min}_{x}\Big{(}\liminf_{t\rightarrow\infty}\sum_{i\in{\cal S}(t)}f_{i}(x)\Big{)}=\operatorname*{arg\,min}_{x}f(x),

which implies that $\liminf_{t\rightarrow\infty}{\rm dist}(y(t),{\cal X}^{*})=0$ , and thus

\displaystyle\liminf_{t\rightarrow\infty}\big{(}f(y(t))-f(z)\big{)}=0.

(28)

We next show that all the sequences $\{x_{i}(t)\}$ , $i\in{\cal H}$ , converge to the same optimal point.

From (26), rearranging the terms and fixing an arbitrary period from time $t_{1}$ to $t_{2}$ with $t_{1}<t_{2}$ ,

	$\displaystyle\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t_{2}+1)\\|x_{i}(t_{2}+1)-z\\|^{2}$	$\displaystyle\leq$	$\displaystyle\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t_{1})\\|x_{i}(t_{1})-z\\|^{2}$
			$\displaystyle+\;D^{2}\sum_{t=t_{1}}^{t_{2}}\alpha^{2}(t)+2D\sum_{t=t_{1}}^{t_{2}}\alpha(t)\sum_{i=1}^{\|{\cal H}\|}\\|x_{i}(t)-y(t)\\|.$

From Assumption 2 and Lemma 10,

\limsup_{\tau_{2}\rightarrow\infty}\;\sum_{i=1}^{|{\cal H}|}\pi_{i}(\tau_{2}+1)\|x_{i}(\tau_{2}+1)-z\|^{2}\leq\liminf_{\tau_{1}\rightarrow\infty}\;\sum_{i=1}^{|{\cal H}|}\pi_{i}(\tau_{1})\|x_{i}(\tau_{1})-z\|^{2}.

Thus, the sequence $\{\sum_{i=1}^{|{\cal H}|}\pi_{i}(t)\|x_{i}(t)-z\|^{2}\}$ is convergent for each $z\in{\cal X}^{*}$ . From Proposition 3 and its proof,

\eta^{l}\sum_{i\in{\cal S}(t)}\|x_{i}(t)-z\|^{2}\leq\sum_{i\in{\cal S}(t)}\pi_{i}(t)\|x_{i}(t)-z\|^{2}\leq\sum_{i=1}^{|{\cal H}|}\pi_{i}(t)\|x_{i}(t)-z\|^{2},

which implies that the sequence $\{\sum_{i\in{\cal S}}\|x_{i}(t)-z\|^{2}\}$ is bounded, so is each sequence $\{x_{i}(t)\}$ , $i\in{\cal S}(t)$ . From Proposition 2, all the sequences $\{x_{i}(t)\}$ , $i\in{\cal H}$ , are bounded. From Lemma 9, the sequence $\{y(t)\}$ is bounded, and with each $\pi(t)$ being a stochastic vector, the sequence $\{\|y(t)-z\|^{2}\}$ is convergent for each $z\in{\cal X}^{*}$ . Since $y(t)$ is bounded, from (28), there exists a subsequence of $\{y(t)\}$ converging to a point $x^{*}\in{\cal X}^{*}$ . Since $\{\|y(t)-x^{*}\|^{2}\}$ is convergent, the sequence $\{y(t)\}$ converges to $x^{*}$ . From Lemma 9, all the sequences $\{x_{i}(t)\}$ , $i\in{\cal H}$ converge to the same optimal point $x^{*}$ .

Proof of Proposition 1: From Proposition 3, for any $t\geq 0$ , there exists a subset ${\cal S}(t)\subset{\cal H}$ with $|{\cal S}(t)|\geq\kappa_{\beta,d\beta}(\mathbb{G})$ for which all $\pi_{i}(t)$ , $i\in{\cal S}(t)$ , $t\geq 0$ are bounded below by a positive number. It is clear that $|{\cal S}(t)|\leq|{\cal H}|=n-|{\cal F}|$ for all $t\geq 0$ . Fix any integer $b\in[\kappa_{\beta,d\beta}(\mathbb{G}),n-|{\cal F}|]$ . Define

\displaystyle{\cal S}=\operatorname*{arg\,max}_{{\cal K}\subset{\cal H},\;b\geq|{\cal K}|\geq\kappa_{\beta,d\beta}(\mathbb{G})}\big{|}\big{\{}t:t\in\{0,1,\ldots,T-1\},\;{\cal K}\subset{\cal S}(t)\big{\}}\big{|},

which represents a subset of normal agents, with an appropriate cardinality, appearing the most times within $T$ time steps. It is easy to see that ${\cal S}$ is well-defined and nonempty.

We next count the number of times such an ${\cal S}$ set appears within $T$ time steps. To this end, define

\displaystyle{\cal T}=\big{\{}t:t\in\{0,1,\ldots,T-1\},\;{\cal S}\subset{\cal S}(t)\big{\}},

which represents this number. Consider the set ${\cal M}=\{{\cal K}\subset{\cal H}:b\geq|{\cal K}|\geq\kappa_{\beta,d\beta}(\mathbb{G})\}$ whose cardinality is

C=\sum_{k=\kappa_{\beta,d\beta}(\mathbb{G})}^{b}\binom{n-|{\cal F}|}{k}.

Note that ${\cal S}\in{\cal M}$ and for each $t\geq 0$ , there exists at least one set in ${\cal M}$ being a subset of ${\cal S}(t)$ . From the pigeonhole principle, at least one set in ${\cal M}$ appears $\lceil T/C\rceil$ times within $T$ time steps. Since ${\cal S}$ is a set in ${\cal M}$ that appears most often during the total $T$ time steps, $|{\cal T}|\geq T/C$ .

To proceed, similar to (27), for any $z\in{\cal X}^{*}$ ,

\displaystyle\eta^{l}\sum_{t=0}^{T-1}\alpha(t)\sum_{i\in{{\cal S}}(t)}(f_{i}(y(t))-f_{i}(z))\leq\sum_{t=0}^{T-1}\alpha(t)\sum_{i\in{\cal H}}\pi_{i}(t+1)(f_{i}(y(t))-f_{i}(z))<\infty.

(29)

Substituting $\alpha(t)=1/\sqrt{T}$ to (26) and (29), it follows that for any $z\in{\cal X}^{*}$ ,

\displaystyle 2\eta^{l}\sum_{t=0}^{T-1}\sum_{i\in{{\cal S}}(t)}\frac{f_{i}(y(t))-f_{i}(z)}{\sqrt{T}}

\displaystyle\leq\sum_{i\in{\cal H}}\pi_{i}(0)\|x_{i}(0)-z\|^{2}+D^{2}+2D\sum_{t=0}^{T-1}\sum_{i\in{\cal H}}\frac{\|x_{i}(t)-y(t)\|}{\sqrt{T}}.

(30)

Since all $f_{i}(x)$ , $i\in{\cal H}$ are convex functions,

\displaystyle\sum_{i\in{\cal S}}f_{i}\bigg{(}\frac{\sum_{t\in{\cal T}}y(t)}{|{\cal T}|}\bigg{)}-\sum_{i\in{\cal S}}f_{i}(z)\leq\sum_{i\in{\cal S}}\sum_{t\in{\cal T}}\frac{f_{i}(y(t))-f_{i}(z)}{|{\cal T}|}=\sum_{t\in{\cal T}}\sum_{i\in{\cal S}}\frac{f_{i}(y(t))-f_{i}(z)}{|{\cal T}|}.

(31)

From Corollary 3, $z\in\operatorname*{arg\,min}_{x}f_{i}(x)$ for all $i\in{\cal H}$ . Then, $f_{i}(y(t))-f_{i}(z)\geq 0$ for all $i\in{\cal H}$ and $t\geq 0$ . Since ${\cal S}\subset{\cal S}(t)\subset{\cal H}$ for all $t\geq 0$ , it follows from (31), (30), and $|{\cal T}|\geq T/C$ that

		$\displaystyle\sum_{i\in{\cal S}}f_{i}\bigg{(}\frac{\sum_{t\in{\cal T}}y(t)}{\|{\cal T}\|}\bigg{)}-\sum_{i\in{\cal S}}f_{i}(z)\leq\sum_{t=0}^{T-1}\sum_{i\in{\cal S}(t)}\frac{f_{i}(y(t))-f_{i}(z)}{\|{\cal T}\|}$
	$\displaystyle\leq\;$	$\displaystyle\frac{C}{2\eta^{l}}\bigg{(}\frac{\sum_{i\in{\cal H}}\pi_{i}(0)\\|x_{i}(0)-z\\|^{2}+D^{2}}{\sqrt{T}}+\frac{2D\sum_{t=0}^{T-1}\sum_{i\in{\cal H}}\\|x_{i}(t)-y(t)\\|}{T}\bigg{)}.$		(32)

Note that for all $i\in{\cal H}$ , there holds

\|x_{i}(0)-y(0)\|=\|x_{i}(0)-\sum_{j\in{\cal H}}\pi_{i}^{\top}(0)x_{i}(0)\|\leq\sum_{j\in{\cal H}}\|x_{j}(0)\|.

With this simple fact and $\alpha(t)=1/\sqrt{T}$ , it follows from (23) that for all $i\in{\cal H}$ ,

$\displaystyle\sum_{t=0}^{T-1}\\|x_{i}(t)-y(t)\\|\leq\;$	$\displaystyle\sum_{t=1}^{T-1}2\lambda^{t}\sum_{j\in{\cal H}}\\|x_{j}(0)\\|+\frac{2D\|{\cal H}\|}{1-\lambda}\sum_{t=1}^{T-1}\Big{(}\lambda^{\lceil\frac{t}{2}\rceil-1}\alpha(0)+\alpha\big{(}\lceil\frac{t}{2}\rceil-1\big{)}\Big{)}+2D\sum_{t=1}^{T-1}\alpha(t-1)$
	$\displaystyle+\\|x_{i}(0)-y_{i}(0)\\|$
$\displaystyle\leq\;$	$\displaystyle\sum_{t=0}^{T-1}2\lambda^{t}\sum_{j\in{\cal H}}\\|x_{j}(0)\\|+\frac{2D\|{\cal H}\|}{1-\lambda}\sum_{t=1}^{T-1}\Big{(}\lambda^{\lceil\frac{t}{2}\rceil-1}\alpha(0)+\alpha\big{(}\lceil\frac{t}{2}\rceil-1\big{)}\Big{)}+2D\sum_{t=1}^{T-1}\alpha(t-1)$
$\displaystyle\leq\;$	$\displaystyle\frac{2}{1-\lambda}\sum_{j\in{\cal H}}\\|x_{j}(0)\\|+\frac{2D\|{\cal H}\|}{1-\lambda}\Big{(}\frac{2}{\lambda(1-\lambda)\sqrt{T}}+\sqrt{T}\Big{)}+2D\sqrt{T},$	(33)

which implies that

	$\displaystyle\sum_{i\in{\cal S}}f_{i}\bigg{(}\frac{\sum_{t\in{\cal T}}y(t)}{\|{\cal T}\|}\bigg{)}-\sum_{i\in{\cal S}}f_{i}(z)\leq\;$	$\displaystyle\frac{C}{2\eta^{l}}\bigg{(}\frac{\sum_{i\in{\cal H}}\pi_{i}(0)\\|x_{i}(0)-z\\|^{2}+D^{2}}{\sqrt{T}}+\frac{4D\|{\cal H}\|}{(1-\lambda)T}\sum_{j=1}^{\|{\cal H}\|}\\|x_{j}(0)\\|$
		$\displaystyle\;\;\;\;\;\;\;\;+\frac{8\|{\cal H}\|^{2}D^{2}}{\lambda(1-\lambda)^{2}T\sqrt{T}}+\frac{4\|{\cal H}\|^{2}D^{2}}{(1-\lambda)\sqrt{T}}+\frac{4D^{2}\|{\cal H}\|}{\sqrt{T}}\bigg{)}.$		(34)

From (1) and Assumption 2, for all $j\in{\cal H}$ , it holds that

		$\displaystyle\sum_{i\in{\cal S}}f_{i}\bigg{(}\frac{\sum_{t\in{\cal T}}x_{j}(t)}{\|{\cal T}\|}\bigg{)}-\sum_{i\in{\cal S}}f_{i}\bigg{(}\frac{\sum_{t\in{\cal T}}y(t)}{\|{\cal T}\|}\bigg{)}$
	$\displaystyle\leq\;$	$\displaystyle D\sum_{i\in{\cal S}}\frac{\sum_{t\in{\cal T}}\\|x_{j}(t)-y(t)\\|}{\|{\cal T}\|}\leq D\sum_{i\in{\cal S}}\frac{\sum_{t=0}^{T-1}\\|x_{j}(t)-y(t)\\|}{\|{\cal T}\|}$
	$\displaystyle\leq\;$	$\displaystyle DC\|{\cal S}\|\bigg{(}\frac{2}{(1-\lambda)T}\sum_{j\in{\cal H}}\\|x_{j}(0)\\|+\frac{4D\|{\cal H}\|}{\lambda(1-\lambda)^{2}T\sqrt{T}}+\frac{2D\|{\cal H}\|}{(1-\lambda)\sqrt{T}}+\frac{2D}{\sqrt{T}}\bigg{)},$

which, with (34), implies that

	$\displaystyle\sum_{i\in{\cal S}}f_{i}\bigg{(}\frac{\sum_{t\in{\cal T}}x_{j}(t)}{\|{\cal T}\|}\bigg{)}-\sum_{i\in{\cal S}}f_{i}(z)$
$\displaystyle\leq\;$	$\displaystyle\frac{C}{2\eta^{l}}\bigg{(}\frac{\sum_{i\in{\cal H}}\pi_{i}(0)\\|x_{i}(0)-z\\|^{2}+D^{2}}{\sqrt{T}}+\frac{4D\|{\cal H}\|}{(1-\lambda)T}\sum_{j\in{\cal H}}\\|x_{j}(0)\\|+\frac{8\|{\cal H}\|^{2}D^{2}}{\lambda(1-\lambda)^{2}T\sqrt{T}}+\frac{4\|{\cal H}\|^{2}D^{2}}{(1-\lambda)\sqrt{T}}$
	$\displaystyle+\frac{4D^{2}\|{\cal H}\|}{\sqrt{T}}\bigg{)}+DC\|{\cal S}\|\bigg{(}\frac{2}{(1-\lambda)T}\sum_{j\in{\cal H}}\\|x_{j}(0)\\|+\frac{4D\|{\cal H}\|}{\lambda(1-\lambda)^{2}T\sqrt{T}}+\frac{2D\|{\cal H}\|}{(1-\lambda)\sqrt{T}}+\frac{2D}{\sqrt{T}}\bigg{)}$
$\displaystyle=\;$	$\displaystyle\frac{C}{2\eta^{l}}\bigg{(}\frac{\sum_{i\in{\cal H}}\pi_{i}(0)\\|x_{i}(0)-z\\|^{2}+D^{2}}{\sqrt{T}}+\frac{4D\big{(}\|{\cal H}\|+\|{\cal S}\|\eta^{l}\big{)}}{(1-\lambda)T}\sum_{j\in{\cal H}}\\|x_{j}(0)\\|+\frac{8\|{\cal H}\|D^{2}\big{(}\|{\cal H}\|+\|{\cal S}\|\eta^{l}\big{)}}{\lambda(1-\lambda)^{2}T\sqrt{T}}$
	$\displaystyle+\frac{4\|{\cal H}\|D^{2}\big{(}\|{\cal H}\|+\|{\cal S}\|\eta^{l}\big{)}}{(1-\lambda)\sqrt{T}}+\frac{4D^{2}\big{(}\|{\cal H}\|+\|{\cal S}\|\eta^{l}\big{)}}{\sqrt{T}}\bigg{)}$	(35)
$\displaystyle=\;$	$\displaystyle O\Big{(}\frac{1}{\sqrt{T}}\Big{)}.$

There is an alternative way to prove this, as follows.

From (26), for any $j\in{\cal H}$ ,

	$\displaystyle\;\;\;\;\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)\\|x_{i}(t+1)-z\\|^{2}$
	$\displaystyle\leq\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t)\\|x_{i}(t)-z\\|^{2}+\alpha^{2}(t)D^{2}+2\alpha(t)D\sum_{i=1}^{\|{\cal H}\|}\\|x_{i}(t)-y(t)\\|$
	$\displaystyle\;\;\;\;-2\alpha(t)\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)(f_{i}(x_{j}(t))-f_{i}(z))-2\alpha(t)\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)(f_{i}(y(t))-f_{i}(x_{j}(t)))$
	$\displaystyle\leq\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t)\\|x_{i}(t)-z\\|^{2}+\alpha^{2}(t)D^{2}+2\alpha(t)D\sum_{i=1}^{\|{\cal H}\|}\\|x_{i}(t)-y(t)\\|$
	$\displaystyle\;\;\;\;-2\alpha(t)\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)(f_{i}(x_{j}(t))-f_{i}(z))+2D\alpha(t)\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)\\|y(t)-x_{j}(t)\\|$
	$\displaystyle=\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t)\\|x_{i}(t)-z\\|^{2}+\alpha^{2}(t)D^{2}+2\alpha(t)D\sum_{i=1}^{\|{\cal H}\|}\\|x_{i}(t)-y(t)\\|$
	$\displaystyle\;\;\;\;+2D\alpha(t)\\|y(t)-x_{j}(t)\\|-2\alpha(t)\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)(f_{i}(x_{j}(t))-f_{i}(z)),$

which implies that

	$\displaystyle 2\alpha(t)\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)(f_{i}(x_{j}(t))-f_{i}(z))$	$\displaystyle\leq-\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)\\|x_{i}(t+1)-z\\|^{2}+\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t)\\|x_{i}(t)-z\\|^{2}$
		$\displaystyle\;\;\;\;+\alpha^{2}(t)D^{2}+2\alpha(t)D\sum_{i=1}^{\|{\cal H}\|}\\|x_{i}(t)-y(t)\\|$
		$\displaystyle\;\;\;\;+2D\alpha(t)\\|y(t)-x_{j}(t)\\|.$

Thus,

	$\displaystyle\;\;\;\;\;2\sum_{t=0}^{T-1}\alpha(t)\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)(f_{i}(x_{j}(t))-f_{i}(z))$
	$\displaystyle\leq\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(0)\\|x_{i}(0)-z\\|^{2}+D^{2}\sum_{t=0}^{T-1}\alpha^{2}(t)+2D\sum_{t=0}^{T-1}\alpha(t)\sum_{i=1}^{\|{\cal H}\|}\\|x_{i}(t)-y(t)\\|+2D\sum_{t=0}^{T-1}\alpha(t)\\|x_{j}(t)-y(t)\\|.$		(36)

Similar to (27), for any $z\in{\cal X}^{*}$ ,

\displaystyle\eta^{l}\sum_{t=0}^{T-1}\alpha(t)\sum_{i\in{\cal S}(t)}(f_{i}(x_{j}(t))-f_{i}(z))\leq\sum_{t=0}^{T-1}\alpha(t)\sum_{i=1}^{|{\cal H}|}\pi_{i}(t+1)(f_{i}(x_{j}(t))-f_{i}(z)).

(37)

Substituting $\alpha(t)=1/\sqrt{T}$ to (36) and (37), it holds that for any $z\in{\cal X}^{*}$ ,

		$\displaystyle\sum_{t=0}^{T-1}\sum_{i\in{\cal S}(t)}(f_{i}(x_{j}(t))-f_{i}(z))\leq\eta^{-l}\sum_{t=0}^{T-1}\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)(f_{i}(x_{j}(t))-f_{i}(z))$
	$\displaystyle\leq\;$	$\displaystyle\frac{\sqrt{T}}{2\eta^{l}}\bigg{(}\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(0)\\|x_{i}(0)-z\\|^{2}+D^{2}+\frac{2D}{\sqrt{T}}\sum_{t=0}^{T-1}\Big{(}\sum_{i=1}^{\|{\cal H}\|}\\|x_{i}(t)-y(t)\\|+\\|x_{j}(t)-y(t)\\|\Big{)}\bigg{)}.$

Using the same argument as in deriving (32),

		$\displaystyle\sum_{i\in{\cal S}}f_{i}\bigg{(}\frac{\sum_{t\in{\cal T}}x_{j}(t)}{\|{\cal T}\|}\bigg{)}-\sum_{i\in{\cal S}}f_{i}(z)$
	$\displaystyle\leq\;$	$\displaystyle\frac{C}{2\eta^{l}}\bigg{(}\frac{\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(0)\\|x_{i}(0)-z\\|^{2}+D^{2}}{\sqrt{T}}+\frac{2D\sum_{t=0}^{T-1}\big{(}\sum_{i=1}^{\|{\cal H}\|}\\|x_{i}(t)-y(t)\\|+\\|x_{j}(t)-y(t)\\|\big{)}}{T}\bigg{)},$

which, with (33), yields

	$\displaystyle\sum_{i\in{\cal S}}f_{i}\bigg{(}\frac{\sum_{t\in{\cal T}}x_{j}(t)}{T}\bigg{)}-\sum_{i\in{\cal S}}f_{i}(z)$
$\displaystyle\leq\;$	$\displaystyle\frac{C}{2\eta^{l}}\bigg{(}\frac{\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(0)\\|x_{i}(0)-z\\|^{2}+D^{2}}{\sqrt{T}}+\frac{4D(\|{\cal H}\|+1)}{(1-\lambda)T}\sum_{j=1}^{\|{\cal H}\|}\\|x_{j}(0)\\|+\frac{8D^{2}\|{\cal H}\|(\|{\cal H}\|+1)}{\lambda(1-\lambda)^{2}T\sqrt{T}}$
	$\displaystyle\;\;\;\;\;\;\;\;+\frac{4D^{2}\|{\cal H}\|(\|{\cal H}\|+1)}{(1-\lambda)\sqrt{T}}+\frac{4D^{2}(\|{\cal H}\|+1)}{\sqrt{T}}\bigg{)}$	(38)
$\displaystyle=\;$	$\displaystyle O\Big{(}\frac{1}{\sqrt{T}}\Big{)}.$

This completes the proof.

The above proof provides two bounds with the same order. It is not hard to check that the first bound (35) is better than the second one (38) if and only if $\eta^{l}|{\cal S}|<1$ , and they are equal if and only if $\eta^{l}|{\cal S}|=1$ .

5 Concluding Remarks

This paper has proposed a distributed subgradient algorithm which achieves full resilience in the presence of Byzantine agents, with appropriate redundancy in both graph connectivity and objective functions. The algorithm and convergence results can be easily extended to time-varying neighbor graphs, provided that the neighbor graph is $(\beta,d\beta)$ -resilient all the time. One immediate next step is to relax Assumption 1, possibly appealing to gradient descent for differentiable convex functions. The concepts and tools developed in the paper are expected to be applicable to other consensus-based distributed optimization and computation problems.

Although the algorithm theoretically works for multi-dimensional convex optimization, it has the following limitations which preclude its applicability to high-dimensional optimization. First, from Lemma 2, the algorithm implicitly requires that each agent have at least $(d+1)\beta+1$ neighbors, which is impossible for high dimensions. Second, picking a point in the intersection of multiple convex hulls (cf. step (4) in the algorithm) can be computationally expensive in high dimensions, although the issue has been attenuated in [34, Algorithm 2] and [37, Section 5.1]. Last, building $(\beta,d\beta)$ -resilient graphs is not an easy job, especially when $d$ or $\beta$ is large. Another practical issue of the algorithm, independent of dimensions, is how to measure and establish objective function redundancy. Studies of $(r,s)$ -resilient graphs and $k$ -redundant multi-agent networks are of independent interest.

Considering that nowadays distributed optimization algorithms in machine learning are frequently high-dimensional, there is ample motivation to design fully resilient high-dimensional distributed optimization algorithms. A future direction of this paper aims to tackle this challenging problem by combining the proposed algorithm with communication-efficient schemes in which each agent can transmit only low-dimensional signals. Possible approaches include entry-wise or block-wise updating [56, 57], limited information fusion [58], and dimension-independent filtering [16, 59].

References

[1] A. Nedić and A. Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009.
[2] T. Yang, X. Yi, J. Wu, Y. Yuan, D. Wu, Z. Meng, Y. Hong, H. Wang, Z. Lin, and K.H. Johansson. A survey of distributed optimization. Annual Reviews in Control, 47:278–305, 2019.
[3] A. Nedić and J. Liu. Distributed optimization for control. Annual Review of Control, Robotics, and Autonomous Systems, 1:77–103, 2018.
[4] D.K. Molzahn, F. Dörfler, H. Sandberg, S.H. Low, S. Chakrabarti, R. Baldick, and J. Lavaei. A survey of distributed optimization and control algorithms for electric power systems. IEEE Transactions on Smart Grid, 8(6):2941–2962, 2017.
[5] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu. Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, volume 30, 2017.
[6] L. Lamport, R. Shostak, and M. Pease. The Byzantine generals problem. ACM Transactions on Programming Languages and Systems, 4(3):382–401, 1982.
[7] L. Su and N.H. Vaidya. Fault-tolerant multi-agent optimization: Optimal iterative distributed algorithms. In Proceedings of the 2016 ACM Symposium on Principles of Distributed Computing, pages 425–434, 2016.
[8] K. Kuwaranancharoen, L. Xin, and S. Sundaram. Byzantine-resilient distributed optimization of multi-dimensional functions. In Proceedings of the 2020 American Control Conference, pages 4399–4404, 2020.
[9] Z. Yang and W.U. Bajwa. ByRDiE: Byzantine-resilient distributed coordinate descent for decentralized learning. IEEE Transactions on Signal and Information Processing over Networks, 5(4):611–627, 2019.
[10] C. Fang, Z. Yang, and W.U. Bajwa. BRIDGE: Byzantine-resilient decentralized gradient descent. IEEE Transactions on Signal and Information Processing over Networks, 8:610–626, 2022.
[11] L. He, S.P. Karimireddy, and M. Jaggi. Byzantine-robust decentralized learning via self-centered clipping. 2022. arXiv:2202.01545 [cs.LG].
[12] Z. Wu, T. Chen, and Q. Ling. Byzantine-resilient decentralized stochastic optimization with robust aggregation rules. 2022. arXiv:2206.04568 [cs.DC].
[13] S. Sundaram and B. Gharesifard. Distributed optimization under adversarial nodes. IEEE Transactions on Automatic Control, 64(3):1063–1076, 2018.
[14] L. Su and N.H. Vaidya. Byzantine-resilient multiagent optimization. IEEE Transactions on Automatic Control, 66(5):2227–2233, 2020.
[15] C. Zhao, J. He, and Q.-G. Wang. Resilient distributed optimization algorithm against adversarial attacks. IEEE Transactions on Automatic Control, 65(10):4308–4315, 2020.
[16] N. Gupta, T.T. Doan, and N.H. Vaidya. Byzantine fault-tolerance in decentralized optimization under $2f$ -redundancy. In Proceedings of the 2021 American Control Conference, pages 3632–3637, 2021.
[17] N. Gupta and N.H. Vaidya. Byzantine fault-tolerance in peer-to-peer distributed gradient-descent. 2021. arXiv:2101.12316 [cs.DC].
[18] M. Kaheni, E. Usai, and M. Franceschelli. Resilient constrained optimization in multi-agent systems with improved guarantee on approximation bounds. IEEE Control Systems Letters, 6:2659–2664, 2022.
[19] S. Liu, N. Gupta, and N.H. Vaidya. Approximate Byzantine fault-tolerance in distributed optimization. In Proceedings of the 2021 ACM Symposium on Principles of Distributed Computing, pages 379–389, 2021.
[20] N. Gupta, T.T. Doan, and N. Vaidya. Byzantine fault-tolerance in federated local SGD under $2f$ -redundancy. 2021. arXiv:2108.11769 [cs.DC].
[21] D. Data, L. Song, and S.N. Diggavi. Data encoding for Byzantine-resilient distributed optimization. IEEE Transactions on Information Theory, 67(2):1117–1140, 2021.
[22] C.A. Uribe, H.-T. Wai, and M. Alizadeh. Resilient distributed optimization algorithms for resource allocation. In Proceedings of the 58th IEEE Conference on Decision and Control, pages 8341–8346, 2019.
[23] Y. Chen, L. Su, and J. Xu. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 1(2):Article No. 44, 2017.
[24] P. Blanchard, E.M. El Mhamdi, R. Guerraoui, and J. Stainer. Machine learning with adversaries: Byzantine tolerant gradient descent. In Advances in Neural Information Processing Systems, volume 30, 2017.
[25] D. Alistarh, Z. Allen-Zhu, and J. Li. Byzantine stochastic gradient descent. In Advances in Neural Information Processing Systems, volume 31, 2018.
[26] L. Su and J. Xu. Securing distributed gradient descent in high dimensional statistical learning. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 3(1):Article No. 12, 2019.
[27] L. Chen, H. Wang, Z. Charles, and D. Papailiopoulos. DRACO: Byzantine-resilient distributed training via redundant gradients. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 903–912, 2018.
[28] Z. Yang, A. Gang, and W.U. Bajwa. Adversary-resilient distributed and decentralized statistical inference and machine learning: An overview of recent advances under the Byzantine threat model. IEEE Signal Processing Magazine, 37(3):146–159, 2020.
[29] N. Gupta and N. H. Vaidya. Resilience in collaborative optimization: Redundant and independent cost functions. 2020. arXiv:2003.09675v2 [cs.DC].
[30] N.H. Vaidya and V.K. Garg. Byzantine vector consensus in complete graphs. In Proceedings of the ACM Symposium on Principles of Distributed Computing, pages 65–73, 2013.
[31] N.H. Vaidya. Iterative Byzantine vector consensus in incomplete graphs. In Proceedings of the 15th International Conference on Distributed Computing and Networking, pages 14–28, 2014.
[32] H. Tverberg. A generalization of Radon’s theorem. Journal of the London Mathematical Society, 41:123–128, 1966.
[33] P.K. Agarwal, M. Sharir, and E. Welzl. Algorithms for center and Tverberg points. In Proceedings of the 20th Annual Symposium on Computational Geometry, pages 61–67, 2004.
[34] X. Wang, S. Mou, and S. Sundaram. Resilience for distributed consensus with constraints. 2022. arXiv:2206.05662 [eess.SY].
[35] J. Liu, T. Başar, and A. Nedić. Input-output stability of linear consensus processes. In Proceedings of the 55th IEEE Conference on Decision and Control, pages 6978–6983, 2016.
[36] W. Abbas, M. Shabbir, J. Li, and X. Koutsoukos. Resilient distributed vector consensus using centerpoint. Automatica, 136:110046, 2022.
[37] J. Yan, X. Li, Y. Mo, and C. Wen. Resilient multi-dimensional consensus in adversarial environment. Automatica, 145:110530, 2022.
[38] B. Polyak. A general method for solving extremum problems. Doklady Akademii Nauk, 8(3):593–597, 1967.
[39] A. Nedić and A. Olshevsky. Distributed optimization over time-varying directed graphs. IEEE Transactions on Automatic Control, 60(3):601–615, 2015.
[40] Y. Lin and J. Liu. Subgradient-push is of the optimal convergence rate. In Proceedings of the 61st IEEE Conference on Decision and Control, pages 5849–5856, 2022.
[41] N.H. Vaidya, L. Tseng, and G. Liang. Iterative approximate Byzantine consensus in arbitrary directed graphs. In Proceedings of the ACM Symposium on Principles of Distributed Computing, pages 365–374, 2012.
[42] H.J. Leblance, H. Zhang, X. Koutsoukos, and S. Sundaram. Resilient asymptotic consensus in robust networks. IEEE Journal on Selected Areas in Communications, 31(4):766–781, 2013.
[43] R.T. Rockafellar. Convex Analysis. Princeton University Press, 2015.
[44] V.A. Zorich and R. Cooke. Mathematical Analysis I. Mathematical Analysis. Springer, 2004.
[45] E. Helly. Über mengen konvexer körper mit gemeinschaftlichen punkte. Jahresbericht der Deutschen Mathematiker-Vereinigung, 32:175–176, 1923.
[46] N. Vaidya. Matrix representation of iterative approximate byzantine consensus in directed graphs. Technical report, University of Illinois at Urbana-Champaign, 2012. arXiv:1203.1888 [cs.DC].
[47] E. Seneta. Non-Negative Matrices and Markov Chain. Springer-Verlag, 2006.
[48] J. Hajnal and M.S. Bartlett. Weak ergodicity in non-homogeneous Markov chains. Mathematical Proceedings of the Cambridge Philosophical Society, 54:233–246, 1958.
[49] M. Cao, A.S. Morse, and B.D.O. Anderson. Reaching a consensus in a dynamically changing environment: Convergence rates, measurement delays, and asynchronous events. SIAM Journal on Control and Optimization, 47(2):601–623, 2008.
[50] M. Cao, A.S. Morse, and B.D.O. Anderson. Reaching a consensus in a dynamically changing environment: A graphical approach. SIAM Journal on Control and Optimization, 47(2):575–600, 2008.
[51] A. Kolmogoroff. Zur theorie der markoffschen ketten. Mathematische Annalen, 112(1):155–160, 1936.
[52] D. Blackwell. Finite non-homogeneous chains. Annals of Mathematics, 46(4):594–599, 1945.
[53] A. Nedić and J. Liu. On convergence rate of weighted-averaging dynamics for consensus problems. IEEE Transactions on Automatic Control, 62(2):766–781, 2017.
[54] A. Nedić, A. Ozdaglar, and P.A. Parrilo. Constrained consensus and optimization in multi-agent networks. IEEE Transactions on Automatic Control, 55(4):922–938, 2010.
[55] B. Touri. Product of Random Stochastic Matrices and Distributed Averaging. Springer Science & Business Media, 2012.
[56] J. Liu and B.D.O. Anderson. Communication-efficient distributed algorithms for solving linear algebraic equations over directed graphs. In Proceedings of the 59th IEEE Conference on Decision and Control, pages 5360–5365, 2020.
[57] I. Notarnicola, Y. Sun, G. Scutari, and G. Notarstefano. Distributed big-data optimization via blockwise gradient tracking. IEEE Transactions on Automatic Control, 66(5):2045–2060, 2021.
[58] J. Zhu, Y. Lin, J. Liu, and A.S. Morse. Reaching a consensus with limited information. In Proceedings of the 61st IEEE Conference on Decision and Control, pages 4579–4584, 2022.
[59] J. Zhu, Y. Lin, A. Velasquez, and J. Liu. Resilient constrained consensus over complete graphs via feasibility redundancy. In Proceedings of the American Control Conference, pages 3418–3422, 2022.

$\displaystyle y(t+1)$	$\displaystyle=\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)x_{i}(t+1)$
	$\displaystyle=\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)v_{i}(t)-\alpha(t)\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)g_{i}(v_{i}(t))$
	$\displaystyle=\pi^{\top}(t+1)W(t)x(t)-\alpha(t)\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)g_{i}(v_{i}(t))$
	$\displaystyle=y(t)-\alpha(t)\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)g_{i}(v_{i}(t)).$	(21)

$\displaystyle\\|x_{i}(t)-y(t)\\|\;\leq\;$	$\displaystyle\sum_{j=1}^{\|{\cal H}\|}\big{\|}\left[\Phi(t-1,0)\right]_{ij}-\pi_{j}(0)\big{\|}\\|x_{j}(0)\\|$
	$\displaystyle+\sum_{r=0}^{t-2}\sum_{j=1}^{\|{\cal H}\|}\big{\|}\left[\Phi(t-1,r+1)\right]_{ij}-\pi_{j}(r+1)\big{\|}\alpha(r)\\|g_{j}(v_{j}(r))\\|$
	$\displaystyle+\alpha(t-1)\\|g_{i}(v_{i}(t-1))\\|+\alpha(t-1)\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t)\\|g_{i}(v_{i}(t-1))\\|$
$\displaystyle\leq\;$	$\displaystyle 2\lambda^{t}\sum_{j=1}^{\|{\cal H}\|}\\|x_{j}(0)\\|+2D\|{\cal H}\|\sum_{r=0}^{t-2}\lambda^{t-r-2}\alpha(r)+2D\alpha(t-1)$	(22)
$\displaystyle\leq\;$	$\displaystyle 2\lambda^{t}\sum_{j=1}^{\|{\cal H}\|}\\|x_{j}(0)\\|+2D\|{\cal H}\|\bigg{(}\sum_{r=0}^{\lfloor\frac{t}{2}\rfloor-1}\lambda^{t-r-2}\alpha(r)+\sum_{r=\lceil\frac{t}{2}\rceil-1}^{t-2}\lambda^{t-r-2}\alpha(r)\bigg{)}+2D\alpha(t-1)$
$\displaystyle\leq\;$	$\displaystyle 2\lambda^{t}\sum_{j=1}^{\|{\cal H}\|}\\|x_{j}(0)\\|+\frac{2D\|{\cal H}\|}{1-\lambda}\Big{(}\lambda^{\lceil\frac{t}{2}\rceil-1}\alpha(0)+\alpha\big{(}\lceil\frac{t}{2}\rceil-1\big{)}\Big{)}+2D\alpha(t-1),$	(23)

$\displaystyle\alpha(t)\\|x_{i}(t)-y(t)\\|$	$\displaystyle\leq$	$\displaystyle 2\alpha^{2}(t)\sum_{j=1}^{\|{\cal H}\|}\\|x_{j}(0)\\|+2\lambda^{2t}\sum_{j=1}^{\|{\cal H}\|}\\|x_{j}(0)\\|$
		$\displaystyle+\;2D\|{\cal H}\|\alpha^{2}(t)\sum_{r=0}^{t-2}\lambda^{t-r-2}+2D\|{\cal H}\|\sum_{r=0}^{t-2}\lambda^{t-r-2}\alpha^{2}(r)+D\big{(}\alpha^{2}(t)+\alpha^{2}(t-1)\big{)}$
	$\displaystyle\leq$	$\displaystyle 2\alpha^{2}(t)\sum_{j=1}^{\|{\cal H}\|}\\|x_{j}(0)\\|+2\lambda^{2t}\sum_{j=1}^{\|{\cal H}\|}\\|x_{j}(0)\\|$
		$\displaystyle+\;\Big{(}\frac{2D\|{\cal H}\|}{1-\lambda}\Big{)}\alpha^{2}(t)+2D\|{\cal H}\|\sum_{r=0}^{t-2}\lambda^{t-r-2}\alpha^{2}(r)+D\big{(}\alpha^{2}(t)+\alpha^{2}(t-1)\big{)}.$

$\displaystyle\sum_{t=0}^{\infty}\alpha(t)\\|x_{i}(t)-y(t)\\|$	$\displaystyle\leq$	$\displaystyle 2\bigg{(}\sum_{t=1}^{\infty}\alpha^{2}(t)\bigg{)}\sum_{j=1}^{\|{\cal H}\|}\\|x_{j}(0)\\|+2\bigg{(}\sum_{t=1}^{\infty}\lambda^{2t}\bigg{)}\sum_{j=1}^{\|{\cal H}\|}\\|x_{j}(0)\\|$
		$\displaystyle+\Big{(}\frac{2D\|{\cal H}\|}{1-\lambda}\Big{)}\sum_{t=1}^{\infty}\alpha^{2}(t)+2D\|{\cal H}\|\sum_{t=1}^{\infty}\sum_{r=0}^{t-2}\lambda^{t-r-2}\alpha^{2}(r)+2D\sum_{t=0}^{\infty}\alpha^{2}(t)$
		$\displaystyle+\alpha(0)\\|x_{i}(0)-y(0)\\|$
	$\displaystyle<$	$\displaystyle\infty,$

	$\displaystyle\;\;\;\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)\\|x_{i}(t+1)-z\\|^{2}$
	$\displaystyle\leq\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)\sum_{j=1}^{\|{\cal H}\|}w_{ij}(t)\\|x_{j}(t)-z\\|^{2}+\alpha^{2}(t)\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)\\|g_{i}(v_{i}(t))\\|^{2}$
	$\displaystyle\;\;\;\;-2\alpha(t)\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)(f_{i}(v_{i}(t))-f_{i}(z))$
	$\displaystyle=\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t)\\|x_{i}(t)-z\\|^{2}+\alpha^{2}(t)\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)\\|g_{i}(v_{i}(t))\\|^{2}-2\alpha(t)\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)(f_{i}(v_{i}(t))-f_{i}(z))$
	$\displaystyle\leq\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t)\\|x_{i}(t)-z\\|^{2}+\alpha^{2}(t)D^{2}-2\alpha(t)\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)(f_{i}(v_{i}(t))-f_{i}(y(t)))$
	$\displaystyle\;\;\;\;-2\alpha(t)\sum_{i=1}^{\|{\cal H}\|}\pi_{i}(t+1)(f_{i}(y(t))-f_{i}(z)).$		(25)