Resilient Distributed Optimization
Abstract
This paper considers a distributed optimization problem in the presence of Byzantine agents capable of introducing untrustworthy information into the communication network. A resilient distributed subgradient algorithm is proposed based on graph redundancy and objective redundancy. It is shown that the algorithm causes all non-Byzantine agents’ states to asymptotically converge to the same optimal point under appropriate assumptions. A partial convergence rate result is also provided.
1 Introduction
Distributed optimization has attracted considerable attention and achieved remarkable success in both theory and practice. The distributed convex optimization problem was first studied in [1] where a distributed subgradient algorithm was proposed. After this, various distributed optimization algorithms have been crafted and studied; see survey papers [2, 3, 4]. Distributed optimization techniques are also widely applied to decentralized deep learning [5].
Information exchange between neighboring agents is necessary for a multi-agent network for distributed optimization. However, agents’ states may be corrupted and they may not adhere to the designed algorithm due to faulty processes or external attacks. An agent is called Byzantine if it updates its state in an arbitrary, unknown manner, and can send conflicting values to different neighbors [6]. Such attacking agents can know global information of the network, play arbitrarily and strategically, and even be coordinated. Consider a network of agents in which Byzantine agents exist. An ideal resilient algorithm is the one which can lead non-Byzantine (or normal) agents to cooperatively solve the corresponding distributed optimization problem in the presence of Byzantine agents as if they do not exist. Such a resilient algorithm is highly desirable for the safety and security of multi-agent systems as faulty processes and external attacks are inevitable.
Resilient distributed optimization has recently received increasing attention, probably originating from the work of [7]. Almost all the existing works cannot guarantee full resilience; what they can guarantee is all normal agents’ states converge to a bounded neighborhood of the desired optimal point whose bound is not controllable [8, 9, 10, 11, 12], or an optimal point of an unspecified convex combination of all normal agents’ objective functions [7, 13, 14], or a convex combination of all normal agents’ local optimal points [15]. The only exceptions are [16, 17, 18] in which the underlying communication graph is assumed to be a complete graph, namely, each agent is allowed to communicate with all other agents. All [16, 17, 18] rely on the idea of “objective function redundancy”. The idea has also been applied to the federated setting and achieved full resilience [19, 20]. In the federated setting, a central coordinator agent is able to communicate with all worker agents, which is more or less equivalent to a complete graph in the distributed setting (or sometimes called decentralized setting). It is worth noting that [7, 13, 14, 15, 18] only consider special one-dimensional optimization.
Resilient distributed optimization is also related to resilient federated optimization/learning in the coordinator-workers setting (e.g., [20, 21, 22]), which has attracted increasing attention recently. The key problem is how the central coordinator aggregates the received information to eliminate or attenuate the effects of Byzantine worker agents. Various Byzantine-resilient information aggregation methods have been proposed for high-dimensional optimization/learning, focusing on stochastic gradient descent (SGD). Notable among them are [23, 24, 25, 26, 27], just to name a few; see an overview of recent developments in this area in [28]. It is doubtable that these methods can be applied to achieve full resilience in the distributed setting.
From the preceding discussion, and to the best of our knowledge, a fully resilient distributed optimization algorithm for general non-complete communication graphs does not exist, even for one-dimensional optimization problems. This gap is precisely what we study in this paper. We consider a distributed convex optimization problem in the presence of Byzantine agents and propose a fully resilient distributed subgradient algorithm based on the ideas of objective redundancy (cf. Definition 1) and graph redundancy (cf. Definition 2). The algorithm is shown to cause all non-Byzantine agents’ states to asymptotically converge to the same desired optimal point under appropriate assumptions. The proposed algorithm works theoretically for multi-dimensional optimization but practically not for high-dimensional optimization, as will be explained and discussed in the concluding remarks.
This work is motivated by two recent ideas. The first is the quantified notion of objective function redundancy proposed in [29] where a couple of different definitions of objective redundancy are studied, based on which fully resilient distributed optimization algorithms have been crafted either for a federated setting [29, 19, 20] or a distributed setting over complete graphs [16, 17, 18]; such redundancy has been shown necessary for achieving full resilience in multi-agent optimization [19]. We borrow one notation in [29] and further develop it. It is worth emphasizing that the results in [16, 17, 18] rely on objective redundancy among non-Byzantine agents, whereas ours depend on objective redundancy among all agents. This subtle difference is important for equipping a multi-agent network with a certain level of redundancy at a network design stage as which agents are non-Byzantine cannot be assumed a priori.
The second idea is so-called “Byzantine vector consensus” [30, 31] whose goal is, given a set of both Byzantine and non-Byzantine vectors, to pick a vector lying in the convex hull of the non-Byzantine vectors, based on Tverberg’s theorem [32, 33]. The idea has been very recently improved in [34] which can be used to achieve resilient multi-dimensional consensus exponentially fast. Exponential consensus is critical in the presence of diminishing disturbance [35]. We are prompted by this improved idea and make use of a resilient vector picking process, simplified from that of [34, Algorithm 1]. There are other recent approaches appealing to the idea of centerpoint [36, 37]. We expect that these approaches can also be applied to resilient optimization, provided that exponential consensus is guaranteed, e.g., in [37].
2 Problem Formulation
Consider a network consisting of agents, labeled through for the purpose of presentation. The agents are not aware of such global labeling, but can differentiate between their neighbors. The neighbor relations among the agents are characterized by a directed graph whose vertices correspond to agents and whose directed edges (or arcs) depict neighbor relations, where is the vertex set and is the directed edge set.111We use to denote that is a subset of . Specifically, agent is a neighbor of agent if . Each agent can receive information from its neighbors. Thus, the directions of edges represent the directions of information flow. We use to denote the neighbor set of agent , excluding , i.e., .
Each agent has a “private” convex (not necessarily differentiable) objective function, , only known to agent . There exist Byzantine agents in the network which are able to transmit arbitrary values to others and capable of sending conflicting values to different neighbors at any time. The set of Byzantine agents is denoted by and the set of normal (non-Byzantine) agents is denoted by . Which agents are Byzantine is unknown to normal agents. It is assumed that each agent may have at most Byzantine neighbors.
The goal of the normal agents is to cooperatively minimize the objective functions
We will show that minimizing the above two objective functions can be achieved simultaneously with appropriate redundancy in objective functions (cf. Definition 1 and Corollary 1). It is assumed that the set of optimal solutions to , denoted by , is nonempty and bounded.
Since each is not necessarily differentiable, the gradient descent method may not be applicable. Instead, the subgradient method [38] can be applied. For a convex function , a vector is called a subgradient of at point if
(1) |
Such a vector always exists and may not be unique. In the case when is differentiable at point , the subgradient is unique and equals , the gradient of at . Thus, the subgradient can be viewed as a generalization of the notion of the gradient. From (1) and the Cauchy-Schwarz inequality, , where is an upper bound for the 2-norm of the subgradients of at both and . We will use this fact without special mention in the sequel. Throughout this paper, we use for the 2-norm.
The subgradient method was first proposed in [38] and the first distributed subgradient method was proposed in [1] for undirected graphs. Its extension to directed graphs has been studied in [39] and recently further analyzed in [40].
2.1 Redundancy
To make the resilient distributed optimization problem solvable, certain redundancy is necessary. We begin with objective redundancy.
Definition 1.
An -agent network is called -redundant, , if for any subsets with , there holds222We use to denote the cardinality of a set .
The above definition of objective redundancy originated in [29, Definition 2]. It has the following properties.
Lemma 1.
If an -agent network is -redundant, then for any subsets with and , there holds
Proof of Lemma 1: Let and . From Definition 1, for any . For each , let . It is easy to see that for each ,333We use to denote the number of -combinations from a set of elements.
Then,
(2) |
Pick any . From (2),
which implies that , and thus .
To prove the lemma, it is sufficient to prove that . Suppose that, to the contrary, there exists a such that and . Since , from (2),
which is impossible. Therefore, .
The following corollaries are immediate consequences of Lemma 1.
Corollary 1.
If an -agent network is -redundant, then for any subsets with , there holds
Corollary 2.
If an -agent network is -redundant with , then it is -redundant.
We also need redundancy in graph connectivity.
A vertex in a directed graph is called a root of if for each other vertex of , there is a directed path from to . Thus, is a root of if it is the root of a directed spanning tree of . We will say that is rooted at if is in fact a root. It is easy to see that a rooted graph has a unique strongly connected component whose vertices are all roots of .
Definition 2.
An -reduced graph of a directed graph with vertices, with and , is a subgraph of obtained by first picking any vertex subset with and then removing from each vertex of the subgraph induced by , , arbitrary incoming edges in . A directed graph is called -resilient if all its -reduced graphs are rooted.
It is easy to see that if a directed graph is -resilient, then for any nonnegative and , the graph is also -resilient.
In the case when , the resilient graph is equivalent to rooted “reduced graph” in [41] which was used to guarantee resilient one-dimension consensus; see Definition 4 and Theorem 2 in [41]. Thus, the definition here can be viewed as a simple generalization of the rooted “reduced graph”.
Definition 2 implicitly requires that each vertex of an -resilient graph have at least neighbors. More can be said.
Lemma 2.
If a directed graph is -resilient, then each of its vertices has at least neighbors.
Proof of Lemma 2: Suppose that, to the contrary, there exists a vertex in whose . If , it is easy to see that does not satisfy Definition 2. We thus consider the case when . Let be the set of arbitrary neighbors of vertex , and , where is the vertex set of .444We use to denote the set of elements that are in but not in . It is clear that , and in the subgraph induced by , , vertex has exactly neighbors. Then, after vertex removes incoming edges in , and each out-neighbor555A vertex is called an out-neighbor of vertex if the latter is a neighbor of the former. of vertex in , if any, removes its incoming edge from , vertex becomes isolated. But it is impossible for an -resilient graph.
3 Algorithm
To describe our algorithm, we need the following notation.
Let denote the collection of all those subsets of whose cardinality is . It is obvious that the number of all such subsets is
(3) |
and label them . For each , let denote the collection of all those subsets of whose cardinality is . For any subset of agents , let denote the convex hull of all , .
Algorithm: At each discrete time , each agent first picks an arbitrary point
(4) |
for each , and then updates its state by setting
(5) | ||||
(6) |
where is the stepsize, and is a subgradient of .
In the special one-dimensional case with , it is not hard to check that the steps (4) and (5) simplifies to the resilient scalar consensus algorithm in [41], which is essentially equivalent to the trimmed mean method and has been improved in [42].
The convergence and correctness of the proposed algorithm rely on the following assumptions.
Assumption 1.
has a nonempty interior.
It is easy to see that Assumption 1 implies that is differentiable at any , where denotes the interior of a set. More can be said.
Lemma 3.
Under Assumption 1, if the -agent network is -redundant with , then is differentiable at with for all and .
Proof of Lemma 3: Since is nonempty, for any , there exist a positive number and an open ball in centered at with radius , denoted as . Let be a vector in whose th entry is and the remaining entries all equal zero. Since for sufficiently small ,
(7) |
For each , since is convex, both and exist and for all [43, Theorem 24.1]. It follows that
Note that from (7),
Thus, , i.e., exists for all and .
To proceed, let for all . From Corollary 1, . Since , both and are differentiable at , implying that for all and . Since , for all and . Note that this holds for all . From [44, Section 8.4.2], is differentiable at with for all .
Lemma 3 has the following important implication.
Corollary 3.
Under Assumption 1, if the -agent network is -redundant with , then for all ,
Corollary 3 immediately implies that
Proof of Corollary 3: Suppose that, to the contrary, there exist and such that . Pick a . From Lemma 3, . It is then clear that . Let . From Corollary 1, , and thus . It follows that , which contradicts the fact that .
Assumption 2.
The subgradients of all , , are uniformly bounded, i.e., there exists a positive number such that for all and .
Assumption 3.
The step-size sequence is positive, non-increasing, and satisfies and .
The above two assumptions are standard for subgradient methods.
To state our main results, we need the following concepts. For a directed graph , we use to denote the set of all -reduced graphs of . For a rooted graph , we use to denote the size of the unique strongly connected component whose vertices are all roots of ; in other words, equals the number of roots of . For any -resilient graph , let
which is well-defined and denotes the smallest possible number of roots in any -reduced graphs of .
Theorem 1.
The following example shows that -redundancy is necessary. For simplicity, set . Consider a 4-agent network whose neighbor graph is the 4-vertex complete graph , which is -resilient. Suppose that agent 4 is Byzantine and the other three are normal. It is possible that, with a carefully crafted attack strategy of the Byzantine agent, the three normal agents update their states mathematically equivalent to the case as if their neighbor graph is the 3-vertex -reduced graph with the arc set , which is rooted (cf. Lemma 6). In this case, since vertex 1 is the only root and agent 1 does not have any neighbor, it follows the single-agent subgradient algorithm, and thus its state will converge to a minimum point of , denoted . Since all normal agents will eventually reach a consensus (cf. Lemma 9), both states of agents 2 and 3 will converge to . To guarantee the resilient distributed optimization problem is solvable in this case, there must hold that , , which implies that the network needs to be 3-redundant. It is easy to see that , and thus .
Theorem 1 shows that the proposed algorithm achieves full resiliency. We next partially characterize the convergence rate of the algorithm.
Theorem 2.
The existing distributed optimization literature (without Byzantine agents) typically characterizes convergence rates by bounding the difference between and . The above theorem can be viewed as a “partial” convergence rate result in that it only reckons a subset of normal agents and a subsequence in a finite time horizon. Notwithstanding this, it is worth noting that is equivalent to in the setting here with Byzantine agents (cf. Corollary 1) and that . Therefore, the theorem still to some extent evaluates the convergence rate of the resilient distributed subgradient algorithm under consideration. It is well known that the optimal convergence rate of subgradient methods for convex optimization is . Whether converges at this optimal rate or not, has so far eluded us.
Theorem 2 is an immediate consequence of the following proposition.
Proposition 1.
The proposition further quantifies a trade-off between the number of normal agents in and the length of time subsequence . Roughly speaking, the fewer the normal agents involved in (8), the denser would the time subsequence be. In the special case when , the proposition simplifies to the following corollary.
4 Analysis
This section provides the analysis of the algorithm and proofs of the main results.
4.1 Algorithm Feasibility
From Lemma 2, -resilient guarantees that each agent has at least neighbors at each time . Thus, each in the algorithm is always nonempty.
We next show that in (4) always exists. To this end, we need the following well-known theorem by Helly.
Lemma 4.
(Helly’s Theorem [45]) Let be a finite collection of convex sets in with . If the intersection of every of these sets is nonempty, then the whole collection has a nonempty intersection, i.e., .
Lemma 5.
For any and , there holds .
Proof of Lemma 5: From Lemma 4, it is sufficient to prove that the intersection of every sets in is nonempty. Pick any with . For each , from the definition of , is the convex hull of distinct points. Since , the intersection involves in total points (with repetition). Recall that all these points are agents’ states at time . Thus, each of them can be written as with being an index in . From the definition of , it is easy to see that . Note that
Then, the pigeonhole principle (or Dirichlet’s box principle) guarantees that among the total indices, at least one index in , say , repeating at least times. Since for each , there is no repetition of indices when computing , there must exist different sets for which involves the computation of and thus for all . Since , . It follows that .
4.2 Dynamics of Normal Agents
To analyze the performance of the algorithm, it is important to understand the dynamics of normal agents and “decouple” the influence of Byzantine agents. The following lemma serves this purpose.
Lemma 6.
in (5) can be expressed as a convex combination of and , ,
where and are nonnegative numbers satisfying , and there exists a positive constant such that for all and , and among all , , at least of them are bounded below by .
Since each agent is assumed to have at most Byzantine neighbors, the lemma immediately implies that at least among all , , are bounded below by , which has been reported in [34, Theorem 1]. In the special case when , the lemma simplifies to Claim 2 in [46], which directly implies Proposition 5.1 in [13]. Thus, the lemma can be regarded as a generalization of [34, Theorem 1], [46, Claim 2], and [13, Proposition 5.1].
Proof of Lemma 6: Recall that there are at most Byzantine neighbors in whose cardinality , and that is the collection of those subsets of whose cardinality is , there must exist an index set such that . For any such index set , since , it follows that . Let for each and . From the preceding, is always nonempty, and for every ,
(9) |
where are nonnegative weights satisfying . It is clear that at least one of is positive and at least .
It is easy to see that can be rewritten as
(10) |
Our reason for rewriting in this way will be clear shortly. Since each , the above expression is a convex combination of all , , allowing some weights being zero. Specifically, defining for each and ,
Then,
in which
(11) |
It is clear that for all and . Since
from (10),
(12) |
Note that is the collection of all subsets of with cardinality being . It follows that is the collection of all subsets of with cardinality being . Thus, (12) implies that is a convex combination of all possible expressions of in terms of (9). Since both and are positive, as long as a in (12) is positive, is positive with .
We claim that the number of those indices such that for at least one is at least . To prove this, suppose that, to the contrary, the number of such indices is no larger than . That is, at least indices in whose corresponding for all , . Pick exactly of them and form an index set . It is clear that for some . So all , , must be included in the right hand side of (12) and strictly less than . But this is impossible because (9) asserts that .
From the preceding, there exist at least indices for which in (11) is no less than for at least one . For each of such , from (11),
Set
(13) |
Since due to (3), must be positive and independent of and . The statement of the lemma then immediately follows.
Without loss of generality, we label all normal agents from to in the sequel.
4.3 Consensus
We first study the infinite product of stochastic matrices .
The graph of an matrix , denoted , is a direct graph with vertices and a directed edge from vertex to vertex whenever the -th entry of the matrix is nonzero.
Lemma 7.
Proof of Lemma 7: From Lemma 6, for all , and at least among all , , are positive and uniformly bounded below by given in (13). The expression “at least among all ” implies that the graph of can be obtained by removing from each vertex of the subgraph of induced by , , at most unspecified incoming edges in , which is the -reduced graph of . Since , must be -resilient. Thus, any -reduced graph of is rooted, so is .
For any infinite sequence of stochastic matrices with the property in Lemma 7, their product has the following result.
Lemma 8.
Let be an infinite sequence of stochastic matrices, each of whose graphs having a rooted spanning subgraph. If all the diagonal entries and those off-diagonal entries of corresponding to the rooted spanning subgraphs are uniformly bounded below by a positive number , then the product converges to a rank one matrix of the form exponentially fast, where is a column vector.777We use to denote the vector whose entries all equal to and the dimension is to be understood from the context.
To prove the lemma, we need the concept of the “composition” of directed graphs. The composition of two directed graphs , with the same vertex set, denoted by , is the directed graph with the same vertex set and arc set defined so that is an arc in the composition whenever there is a vertex such that is an arc in and is an arc in . Since this composition is an associative binary operation, the definition extends unambiguously to any finite sequence of directed graphs with the same vertex set. Composition and matrix multiplication are closely related. In particular, the graph of the product of two nonnegative matrices is equal to the composition of the graphs of the two matrices comprising the product. In other words, . If we focus exclusively on graphs with self-arcs at all vertices, the definition of composition implies that the arcs of both and are arcs of ; the converse is false.
To proceed, for any nonnegative matrix , let
It is easy to see that if is a substochastic matrix888A square nonnegative matrix is called substochastic if its row sums are all equal to or less than one., . In the case when is a stochastic matrix, is called the coefficients of ergodicity [47, page 137]. A stochastic matrix is called a scrambling matrix if and only if [48], whose graph is sometimes called “neighbor-shared” [49]. It is natural to call a vertex a neighbor of vertex in a directed graph if is an arc in . A directed graph is called neighbor-shared if each set of two distinct vertices share a common neighbor. The composition of any set of rooted graphs with vertices and self-arcs at all vertices, is neighbor shared [50, Proposition 8]. It is easy to check that for any nonnegative square matrix , if and only if its graph is neighbor-shared. Moreover, for any two nonnegative square matrices and , if ,999For any two real matrices and with the same size, we write if for all and . then .
Proof of Lemma 8: Since the graph of each is rooted with self-arcs, by Proposition 8 in [50], the graph of the product of any finite sequence of matrices of length , is neighbor-shared, which implies that the product is a scrambling matrix. Thus, letting for each , each is scrambling and its graph is neighbor-shared. Since the graph of each has a rooted spanning subgraph with self-arcs whose corresponding entries in are bounded below by a positive number . It follows that the graph of each has a neighbor-shared spanning subgraph with self-arcs whose corresponding entries in are bounded below by a positive number . Let be the matrix whose th entry is the th entry of if is an arc in and zero otherwise. Then, each is a substochastic matrix whose graph is neighbor-shared. Since all positive entries of are bounded below by , . Since , . With this uniform upper bound of all , the lemma thus is an immediate consequence of Lemma 3 in [48].
An infinite sequence of stochastic matrices is called ergodic if for all , where each is a stochastic vector.101010A vector is called a stochastic vector if its entries are nonnegative and sum to one. From Lemmas 7 and 8, the sequence of stochastic matrices is ergodic, and any infinite product of matrices converges to a rank one matrix exponentially fast. Using the same argument as in the proof of Lemma 8,
for all , where is given in (13). Following the same argument as pages 610–611 in [49], the product converges to a rank one matrix as exponentially fast at a rate no slower than
(18) |
Proposition 2.
To prove the proposition, we need the following concept.
Definition 3.
Let be a sequence of stochastic matrices. A sequence of stochastic vectors is an absolute probability sequence for if for all .
This definition was first introduced by Kolmogorov [51]. It was shown by Blackwell [52] that every sequence of stochastic matrices has an absolute probability sequence. In general, a sequence of stochastic matrices may have more than one absolute probability sequence; when the sequence of stochastic matrices is ergodic, it has a unique absolute probability sequence [53, Lemma 1]. It is easy to see that when is a fixed irreducible stochastic matrix , is simply the normalized left eigenvector of for eigenvalue one, and when is an ergodic sequence of doubly stochastic matrices111111A square nonnegative matrix is called a doubly stochastic matrix if its row sums and column sums all equal one., .
From the preceding, the sequence of stochastic matrices in (16) is ergodic. Thus, has a unique absolute probability sequence . From Lemma 1 in [53],
(19) |
Let with . Then, there exists a positive constant such that for all and ,
(20) |
where denotes the th entry of a matrix and is given in (18). Using the same argument as in the proof of Lemma 2 in [39], .
Proof of Lemma 9: For all ,
From (21), for all ,
Set . Then, using (20) and Assumption 2, for ,
(22) | ||||
(23) |
where and denote the floor and ceiling functions, respectively. It follows that .
4.4 Convergence
Proof of Lemma 10: From (22), for any ,
Since and for any and , then for any ,
Therefore,
which completes the proof.
It is known that under some “standard” assumptions, if the graphs of an ergodic sequence of stochastic matrices are “uniformly strongly connected”, all the entries of its unique absolute probability sequence are uniformly bounded below by a positive constant [55, Theorem 4.8]. This is not the case here as the sequence of the graphs matrices may not be “uniformly strongly connected”.
Proposition 3.
If is -resilient, then for any , there exists a subset with for which all , , are bounded below by a positive number.
To prove the proposition, we need the following concept and results.
We say that a directed graph is strongly rooted at vertex if each other vertex of is reachable from vertex along a directed path of length 1; that is, is strongly rooted at if is a neighbor of every other vertex in the graph. A directed graph is called strongly rooted if it has at least one vertex at which it is strongly rooted. For a square nonnegative matrix, its graph is strongly rooted at if and only if its th column is strictly positive. Moreover, for any directed graphs with vertices and self-arcs which are all rooted at the same vertex , their composition is strongly rooted at [50, Proposition 3].
Proof of Proposition 3: From Lemma 7, the graph of each is rooted. It is clear that the number of roots in the graph of each is at least ; that is, for all time . Let
(24) |
We claim that for any finite sequence of matrices of length , there exists a subset , depending on the sequence, with such that each is a root of the graph of some for at least times. To prove the claim, suppose that, to the contrary, such a subset does not exist; that is, at most vertices in are a root of the graph of some for at least times. For the remaining vertices, they are a root of the graph of some for at most times. Then, the total number of roots of all the graphs of the sequence of matrices of length is no large than . Meanwhile, since for all time , this total number of roots is at least , which implies that , and thus . But this contradicts (24). Therefore, the claim is true.
The above claim immediately implies that for any time and the corresponding -length stochastic matrix sequence starting at time , , there exists a subset with such that each is a root of at least graphs among . From Lemma 7, the graph of each is rooted with self-arcs. With the two facts of directed graphs with self-arcs that the arcs of each graph in a graph sequence are arcs of their composition, and that any graphs with vertices which are all rooted at the same vertex , their composition is strongly rooted at [50, Proposition 3], it follows that for any time , the graph of is strongly rooted at each vertex . Then, each product has strictly positive columns whose entries are uniformly bounded below by a positive number , where is defined in (13). Since each is a stochastic matrix, it is easy to see that for any , each product has strictly positive columns whose entries are uniformly bounded below . Then, the statement of the proposition follows from (19).
Proof of Theorem 1: From (15) and (1), for any ,
Since each is a stochastic matrix and the 2-norm is convex, . Then, with Assumption 2 and the fact that for all and ,
(25) |
Note that
which implies that
From (25),
(26) | |||||
Thus,
From Corollary 3, for all . Then, for all and . From Proposition 3 and its proof, for any , there exists a subset with for which all , , are uniformly bounded below by . It follows that
(27) |
Since and for all and ,
which implies that121212We use to denote the Euclidean distance between a point and a set in .
From Corollary 1, for all time . Then,
which implies that , and thus
(28) |
We next show that all the sequences , , converge to the same optimal point.
From (26), rearranging the terms and fixing an arbitrary period from time to with ,
From Assumption 2 and Lemma 10,
Thus, the sequence is convergent for each . From Proposition 3 and its proof,
which implies that the sequence is bounded, so is each sequence , . From Proposition 2, all the sequences , , are bounded. From Lemma 9, the sequence is bounded, and with each being a stochastic vector, the sequence is convergent for each . Since is bounded, from (28), there exists a subsequence of converging to a point . Since is convergent, the sequence converges to . From Lemma 9, all the sequences , converge to the same optimal point .
Proof of Proposition 1: From Proposition 3, for any , there exists a subset with for which all , , are bounded below by a positive number. It is clear that for all . Fix any integer . Define
which represents a subset of normal agents, with an appropriate cardinality, appearing the most times within time steps. It is easy to see that is well-defined and nonempty.
We next count the number of times such an set appears within time steps. To this end, define
which represents this number. Consider the set whose cardinality is
Note that and for each , there exists at least one set in being a subset of . From the pigeonhole principle, at least one set in appears times within time steps. Since is a set in that appears most often during the total time steps, .
To proceed, similar to (27), for any ,
(29) |
Substituting to (26) and (29), it follows that for any ,
(30) |
Since all , are convex functions,
(31) |
From Corollary 3, for all . Then, for all and . Since for all , it follows from (31), (30), and that
(32) |
Note that for all , there holds
With this simple fact and , it follows from (23) that for all ,
(33) |
which implies that
(34) |
From (1) and Assumption 2, for all , it holds that
which, with (34), implies that
(35) | ||||
There is an alternative way to prove this, as follows.
5 Concluding Remarks
This paper has proposed a distributed subgradient algorithm which achieves full resilience in the presence of Byzantine agents, with appropriate redundancy in both graph connectivity and objective functions. The algorithm and convergence results can be easily extended to time-varying neighbor graphs, provided that the neighbor graph is -resilient all the time. One immediate next step is to relax Assumption 1, possibly appealing to gradient descent for differentiable convex functions. The concepts and tools developed in the paper are expected to be applicable to other consensus-based distributed optimization and computation problems.
Although the algorithm theoretically works for multi-dimensional convex optimization, it has the following limitations which preclude its applicability to high-dimensional optimization. First, from Lemma 2, the algorithm implicitly requires that each agent have at least neighbors, which is impossible for high dimensions. Second, picking a point in the intersection of multiple convex hulls (cf. step (4) in the algorithm) can be computationally expensive in high dimensions, although the issue has been attenuated in [34, Algorithm 2] and [37, Section 5.1]. Last, building -resilient graphs is not an easy job, especially when or is large. Another practical issue of the algorithm, independent of dimensions, is how to measure and establish objective function redundancy. Studies of -resilient graphs and -redundant multi-agent networks are of independent interest.
Considering that nowadays distributed optimization algorithms in machine learning are frequently high-dimensional, there is ample motivation to design fully resilient high-dimensional distributed optimization algorithms. A future direction of this paper aims to tackle this challenging problem by combining the proposed algorithm with communication-efficient schemes in which each agent can transmit only low-dimensional signals. Possible approaches include entry-wise or block-wise updating [56, 57], limited information fusion [58], and dimension-independent filtering [16, 59].
References
- [1] A. Nedić and A. Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009.
- [2] T. Yang, X. Yi, J. Wu, Y. Yuan, D. Wu, Z. Meng, Y. Hong, H. Wang, Z. Lin, and K.H. Johansson. A survey of distributed optimization. Annual Reviews in Control, 47:278–305, 2019.
- [3] A. Nedić and J. Liu. Distributed optimization for control. Annual Review of Control, Robotics, and Autonomous Systems, 1:77–103, 2018.
- [4] D.K. Molzahn, F. Dörfler, H. Sandberg, S.H. Low, S. Chakrabarti, R. Baldick, and J. Lavaei. A survey of distributed optimization and control algorithms for electric power systems. IEEE Transactions on Smart Grid, 8(6):2941–2962, 2017.
- [5] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu. Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, volume 30, 2017.
- [6] L. Lamport, R. Shostak, and M. Pease. The Byzantine generals problem. ACM Transactions on Programming Languages and Systems, 4(3):382–401, 1982.
- [7] L. Su and N.H. Vaidya. Fault-tolerant multi-agent optimization: Optimal iterative distributed algorithms. In Proceedings of the 2016 ACM Symposium on Principles of Distributed Computing, pages 425–434, 2016.
- [8] K. Kuwaranancharoen, L. Xin, and S. Sundaram. Byzantine-resilient distributed optimization of multi-dimensional functions. In Proceedings of the 2020 American Control Conference, pages 4399–4404, 2020.
- [9] Z. Yang and W.U. Bajwa. ByRDiE: Byzantine-resilient distributed coordinate descent for decentralized learning. IEEE Transactions on Signal and Information Processing over Networks, 5(4):611–627, 2019.
- [10] C. Fang, Z. Yang, and W.U. Bajwa. BRIDGE: Byzantine-resilient decentralized gradient descent. IEEE Transactions on Signal and Information Processing over Networks, 8:610–626, 2022.
- [11] L. He, S.P. Karimireddy, and M. Jaggi. Byzantine-robust decentralized learning via self-centered clipping. 2022. arXiv:2202.01545 [cs.LG].
- [12] Z. Wu, T. Chen, and Q. Ling. Byzantine-resilient decentralized stochastic optimization with robust aggregation rules. 2022. arXiv:2206.04568 [cs.DC].
- [13] S. Sundaram and B. Gharesifard. Distributed optimization under adversarial nodes. IEEE Transactions on Automatic Control, 64(3):1063–1076, 2018.
- [14] L. Su and N.H. Vaidya. Byzantine-resilient multiagent optimization. IEEE Transactions on Automatic Control, 66(5):2227–2233, 2020.
- [15] C. Zhao, J. He, and Q.-G. Wang. Resilient distributed optimization algorithm against adversarial attacks. IEEE Transactions on Automatic Control, 65(10):4308–4315, 2020.
- [16] N. Gupta, T.T. Doan, and N.H. Vaidya. Byzantine fault-tolerance in decentralized optimization under -redundancy. In Proceedings of the 2021 American Control Conference, pages 3632–3637, 2021.
- [17] N. Gupta and N.H. Vaidya. Byzantine fault-tolerance in peer-to-peer distributed gradient-descent. 2021. arXiv:2101.12316 [cs.DC].
- [18] M. Kaheni, E. Usai, and M. Franceschelli. Resilient constrained optimization in multi-agent systems with improved guarantee on approximation bounds. IEEE Control Systems Letters, 6:2659–2664, 2022.
- [19] S. Liu, N. Gupta, and N.H. Vaidya. Approximate Byzantine fault-tolerance in distributed optimization. In Proceedings of the 2021 ACM Symposium on Principles of Distributed Computing, pages 379–389, 2021.
- [20] N. Gupta, T.T. Doan, and N. Vaidya. Byzantine fault-tolerance in federated local SGD under -redundancy. 2021. arXiv:2108.11769 [cs.DC].
- [21] D. Data, L. Song, and S.N. Diggavi. Data encoding for Byzantine-resilient distributed optimization. IEEE Transactions on Information Theory, 67(2):1117–1140, 2021.
- [22] C.A. Uribe, H.-T. Wai, and M. Alizadeh. Resilient distributed optimization algorithms for resource allocation. In Proceedings of the 58th IEEE Conference on Decision and Control, pages 8341–8346, 2019.
- [23] Y. Chen, L. Su, and J. Xu. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 1(2):Article No. 44, 2017.
- [24] P. Blanchard, E.M. El Mhamdi, R. Guerraoui, and J. Stainer. Machine learning with adversaries: Byzantine tolerant gradient descent. In Advances in Neural Information Processing Systems, volume 30, 2017.
- [25] D. Alistarh, Z. Allen-Zhu, and J. Li. Byzantine stochastic gradient descent. In Advances in Neural Information Processing Systems, volume 31, 2018.
- [26] L. Su and J. Xu. Securing distributed gradient descent in high dimensional statistical learning. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 3(1):Article No. 12, 2019.
- [27] L. Chen, H. Wang, Z. Charles, and D. Papailiopoulos. DRACO: Byzantine-resilient distributed training via redundant gradients. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 903–912, 2018.
- [28] Z. Yang, A. Gang, and W.U. Bajwa. Adversary-resilient distributed and decentralized statistical inference and machine learning: An overview of recent advances under the Byzantine threat model. IEEE Signal Processing Magazine, 37(3):146–159, 2020.
- [29] N. Gupta and N. H. Vaidya. Resilience in collaborative optimization: Redundant and independent cost functions. 2020. arXiv:2003.09675v2 [cs.DC].
- [30] N.H. Vaidya and V.K. Garg. Byzantine vector consensus in complete graphs. In Proceedings of the ACM Symposium on Principles of Distributed Computing, pages 65–73, 2013.
- [31] N.H. Vaidya. Iterative Byzantine vector consensus in incomplete graphs. In Proceedings of the 15th International Conference on Distributed Computing and Networking, pages 14–28, 2014.
- [32] H. Tverberg. A generalization of Radon’s theorem. Journal of the London Mathematical Society, 41:123–128, 1966.
- [33] P.K. Agarwal, M. Sharir, and E. Welzl. Algorithms for center and Tverberg points. In Proceedings of the 20th Annual Symposium on Computational Geometry, pages 61–67, 2004.
- [34] X. Wang, S. Mou, and S. Sundaram. Resilience for distributed consensus with constraints. 2022. arXiv:2206.05662 [eess.SY].
- [35] J. Liu, T. Başar, and A. Nedić. Input-output stability of linear consensus processes. In Proceedings of the 55th IEEE Conference on Decision and Control, pages 6978–6983, 2016.
- [36] W. Abbas, M. Shabbir, J. Li, and X. Koutsoukos. Resilient distributed vector consensus using centerpoint. Automatica, 136:110046, 2022.
- [37] J. Yan, X. Li, Y. Mo, and C. Wen. Resilient multi-dimensional consensus in adversarial environment. Automatica, 145:110530, 2022.
- [38] B. Polyak. A general method for solving extremum problems. Doklady Akademii Nauk, 8(3):593–597, 1967.
- [39] A. Nedić and A. Olshevsky. Distributed optimization over time-varying directed graphs. IEEE Transactions on Automatic Control, 60(3):601–615, 2015.
- [40] Y. Lin and J. Liu. Subgradient-push is of the optimal convergence rate. In Proceedings of the 61st IEEE Conference on Decision and Control, pages 5849–5856, 2022.
- [41] N.H. Vaidya, L. Tseng, and G. Liang. Iterative approximate Byzantine consensus in arbitrary directed graphs. In Proceedings of the ACM Symposium on Principles of Distributed Computing, pages 365–374, 2012.
- [42] H.J. Leblance, H. Zhang, X. Koutsoukos, and S. Sundaram. Resilient asymptotic consensus in robust networks. IEEE Journal on Selected Areas in Communications, 31(4):766–781, 2013.
- [43] R.T. Rockafellar. Convex Analysis. Princeton University Press, 2015.
- [44] V.A. Zorich and R. Cooke. Mathematical Analysis I. Mathematical Analysis. Springer, 2004.
- [45] E. Helly. Über mengen konvexer körper mit gemeinschaftlichen punkte. Jahresbericht der Deutschen Mathematiker-Vereinigung, 32:175–176, 1923.
- [46] N. Vaidya. Matrix representation of iterative approximate byzantine consensus in directed graphs. Technical report, University of Illinois at Urbana-Champaign, 2012. arXiv:1203.1888 [cs.DC].
- [47] E. Seneta. Non-Negative Matrices and Markov Chain. Springer-Verlag, 2006.
- [48] J. Hajnal and M.S. Bartlett. Weak ergodicity in non-homogeneous Markov chains. Mathematical Proceedings of the Cambridge Philosophical Society, 54:233–246, 1958.
- [49] M. Cao, A.S. Morse, and B.D.O. Anderson. Reaching a consensus in a dynamically changing environment: Convergence rates, measurement delays, and asynchronous events. SIAM Journal on Control and Optimization, 47(2):601–623, 2008.
- [50] M. Cao, A.S. Morse, and B.D.O. Anderson. Reaching a consensus in a dynamically changing environment: A graphical approach. SIAM Journal on Control and Optimization, 47(2):575–600, 2008.
- [51] A. Kolmogoroff. Zur theorie der markoffschen ketten. Mathematische Annalen, 112(1):155–160, 1936.
- [52] D. Blackwell. Finite non-homogeneous chains. Annals of Mathematics, 46(4):594–599, 1945.
- [53] A. Nedić and J. Liu. On convergence rate of weighted-averaging dynamics for consensus problems. IEEE Transactions on Automatic Control, 62(2):766–781, 2017.
- [54] A. Nedić, A. Ozdaglar, and P.A. Parrilo. Constrained consensus and optimization in multi-agent networks. IEEE Transactions on Automatic Control, 55(4):922–938, 2010.
- [55] B. Touri. Product of Random Stochastic Matrices and Distributed Averaging. Springer Science & Business Media, 2012.
- [56] J. Liu and B.D.O. Anderson. Communication-efficient distributed algorithms for solving linear algebraic equations over directed graphs. In Proceedings of the 59th IEEE Conference on Decision and Control, pages 5360–5365, 2020.
- [57] I. Notarnicola, Y. Sun, G. Scutari, and G. Notarstefano. Distributed big-data optimization via blockwise gradient tracking. IEEE Transactions on Automatic Control, 66(5):2045–2060, 2021.
- [58] J. Zhu, Y. Lin, J. Liu, and A.S. Morse. Reaching a consensus with limited information. In Proceedings of the 61st IEEE Conference on Decision and Control, pages 4579–4584, 2022.
- [59] J. Zhu, Y. Lin, A. Velasquez, and J. Liu. Resilient constrained consensus over complete graphs via feasibility redundancy. In Proceedings of the American Control Conference, pages 3418–3422, 2022.