Towards Practical Overlay Networks for Decentralized Federated Learning

Yifan Hua¹, Jinlong Pang¹, Xiaoxue Zhang¹, Yi Liu¹, Xiaofeng Shi¹, Bao Wang², Yang Liu¹, Chen Qian¹
¹University of California Santa Cruz, ²University of Utah
{yhua294, pang14, xzhan330, yliu634, xshi24, yangliu, cqian12}@ucsc.edu [email protected]

Abstract

Decentralized federated learning (DFL) uses peer-to-peer communication to avoid the single point of failure problem in federated learning and has been considered an attractive solution for machine learning tasks on distributed devices. We provide the first solution to a fundamental network problem of DFL: what overlay network should DFL use to achieve fast training of highly accurate models, low communication, and decentralized construction and maintenance? Overlay topologies of DFL have been investigated, but no existing DFL topology includes decentralized protocols for network construction and topology maintenance. Without these protocols, DFL cannot run in practice. This work presents an overlay network, called FedLay, which provides fast training and low communication cost for practical DFL. FedLay is the first solution for constructing near-random regular topologies in a decentralized manner and maintaining the topologies under node joins and failures. Experiments based on prototype implementation and simulations show that FedLay achieves the fastest model convergence and highest accuracy on real datasets compared to existing DFL solutions while incurring small communication costs and being resilient to node joins and failures.

^†^†publicationid: pubid: 979-8-3503-5171-2/24/$31.00

\copyright

2024 IEEE

I Introduction

Training machine learning (ML) models using data collected by distributed devices, such as mobile and IoT devices, is crucial for modern ML. Federated learning (FL) [23, 14, 20, 26, 27, 43, 1] has become a popular ML paradigm that allows a large number of clients (end systems, edge nodes, etc.) to train ML models collaboratively without directly sharing training data. FL uses a central server or cloud to orchestrate clients for training ML models and iterates the following procedure: The server creates a global model by aggregating the local ML models collected from the clients and then sends it to clients for edge applications; the ML models are updated at the clients.

An abstraction of FL is shown in Fig. 1(a). Compared to collecting raw data from distributed devices and performing centralized ML, FL has several main advantages, including saving communication costs on limited-bandwidth devices, preserving data privacy, and being compatible with country or organization regulations that prohibit direct data sharing.

Refer to caption — Figure 1: Federated learning v.s. decentralized federated learning

However, the drawbacks of FL are also prominent and have been studied and widely mentioned in the literature [20, 11, 35]. For example, the central orchestration server that frequently exchanges models with clients clearly presents a bottleneck and becomes a typical single point of failure [20, 4]. In addition, the server is also a single point of attack: adversaries can make all clients use tampered ML models by attacking the server. There is even a risk that the server itself is malicious, which might distribute incorrect global models or collect sensitive information from the clients.

Overlay network	Decentralized construction	Node degree	Model convergence	Resilience to churn	Other comments
Ring [11]	Not discussed	2	Slow	Not discussed
2D grid [11]	Not discussed	4	Slow	Not discussed
Complete graph [11]	Not discussed	$N$ -1	Fast	Not discussed
Dynamic chain [10]	Not discussed	2	Faster than ring	Not discussed
D-Cliques [3]	Unknown	$\|C\|$ -1	Fast	Not discussed	Assume global knowledge
Clustering [2]	Not discussed	$\|C\|$ -1	Fast	Not discussed	Bottlenecks exist
Hypercube [34]	Not discussed	$O(\log N)$	Fast	Not discussed
Torus [34]	Not discussed	$d$	Fast	Not discussed
Ramanujan [8, 13]	Not discussed	$d$	Fast	Not discussed
Random $d$ -graph [8, 13]	Unknown	$d$	Fast	Not discussed	Assume global knowledge
FedLay (this work)	Yes	$d$	Fast	Yes	Address device/data heterogeneity

TABLE I: List of overlay network topologies for DFL.

N

is the number of clients.

d

is a small constant for node degree, usually around 10.

|C|

is the size of each cluster/clique, usually bigger than values of

d

. Other than FedLay, only D-Cliques and [13] discuss its construction algorithm but still assumes global knowledge.

Decentralized federated learning (DFL) emerged recently [11, 33, 4] to resolve the above problems of FL, by removing the involvement of the central server. As shown in Fig. 1(b), in a DFL system, DFL clients form a peer-to-peer (P2P) network and keep exchanging their models using P2P communication. In most cases, the data on different clients are not identically and independently distributed (non-iid) [11, 33, 4], hence the trained local ML models are substantially different from each other. After sufficient model exchanges, the local models on the clients may converge to a model that correctly reflects the features of data from all clients.

This work focuses on a fundamental network problem of DFL: what overlay network is ideal for DFL in practice? An overlay network of DFL is a logical network on top of the physical networks. It specifies which pairs of clients should exchange their local models: two clients exchange models if they are overlay neighbors. An ideal overlay network for DFL needs to satisfy a few requirements [4] including 1) a decentralized construction protocol that can build the overlay topology; 2) fast convergence of local models to high accuracy; 3) small node degree that can maintain low bandwidth cost on clients for exchanging models with a limited number of neighbors; 4) resilient to client dynamics such as client joins, leaves, and failures – they are also called as churn.

Table I shows a list of overlay network topologies that have been studied for DFL. We find that most of these existing studies do not pay attention to whether the proposed topologies can be constructed by decentralized protocols and resilient to churn. These two requirements are common networking/distributed system problems and might not be the focus for ML researchers.

DFL cannot work as a practical system without a decentralized construction protocol for its overlay network. For example, recent work suggests that Ramanujan graphs provide fast convergence and accurate models for DFL [32, 13]. However, the decentralized construction of Ramanujan graphs is unknown. Centralized construction/maintenance contradicts the main purpose of DFL: avoid the single point of failure/attack.

We propose a fully decentralized overlay network for DFL, called FedLay, which achieves all four requirements discussed above, namely decentralized construction, fast convergence to accurate models, small communication cost, and resilience to churn. FedLay does not need a centralized server at any stage and all clients run the same suite of distributed protocols. The FedLay protocol suite includes two sets of protocols: 1) a set of Neighbor Discovery and Maintenance Protocols (NDMP) to build the overlay network and recover it from churn; and 2) a Model Exchange Protocol (MEP) to achieve fast model convergence for heterogeneous clients and asynchronous communication. The FedLay topology is motivated by the near-random regular topologies that have been proposed for data center networks [29, 38]. However, [29] [38] are centralized protocols for data centers and cannot be applied to DFL. To our knowledge, FedLay is the first solution for constructing near-random regular topologies in a decentralized manner and maintaining the topologies under node joins and failures. FedLay also considers other practical issues, including non-iid data and asynchronous communication with heterogeneous clients.

The contributions of this work are summarized as follows.

•

We identify three topology metrics related to DFL convergence and evaluate various overlay topologies of DFL. We find that FedLay outperforms all other topologies. FedLay, as a decentralized network, has almost identical results on all three metrics to the best result among the 100 randomly generated regular graphs (in a centralized way).
•

We design and implement the FedLay protocol suite. To our knowledge, FedLay is the first DFL overlay network that provides decentralized protocols for construction, churn recovery, and model aggregation.
•

We evaluate FedLay using both prototype implementation and simulations on real ML datasets. We find that FedLay achieves the highest average model accuracy and fastest convergence compared to other DFL methods. It also has small communication cost and strong resilience to churn.

The rest of this paper is organized as follows. Section II presents the metrics for selecting DFL overlay topologies and the details of the topology of FedLay. The design of FedLay protocol suite, including the topology construction and maintenance protocols and model aggregation protocol, is presented in Section III. Section IV shows the evaluation results of FedLay as well as existing DFL overlay networks on real ML datasets. We present related work in Section V and conclude this work in Section VI.

II Overlay Topology of FedLay

This section presents the topology of FedLay and the intuition behind using this topology. We first explore what topology metrics can be used to evaluate the convergence speed under small node degrees. Then we design the FebLay topology and use numerical results to show its advantages. The decentralized construction and maintenance under churn will be presented in the next section.

II-A Assumptions

This work is based on the following assumptions: All the devices in FedLay are already connected to the Internet where they can directly access each other using TCP/IP. All clients train the same neural network models. Clients are honest and benign. The security problems under dishonest clients will be considered in future work.

II-B Three metrics for DFL topologies

A DFL topology can be modeled as an undirected graph $G=(V,E)$ , where each node $v\in V$ represents a client in the DFL system and each link $e=(u,v)\in E$ indicates that two clients $u$ and $v$ will exchange local ML models – $u$ and $v$ are thus called neighbors. We assume clients have equal roles in the overlay and similar numbers of neighbors.

II-B1 Expander property and convergence factor.

An important notion in DFL, or general decentralized optimization algorithms, is the mixing matrix ${\bm{M}}$ of the graph $G$ . The $i$ -th row of ${\bm{M}}$ denotes the weights used for aggregating local models of the neighboring nodes to update the model of the $i$ -th client. Hence the adjacency matrix of an overlay network and its Metropolis-Hastings matrix are both mixing matrices [5]. The symmetric property of ${{\bm{M}}}$ indicates that its eigenvalues are real and can be sorted in non-increasing order. Let $\lambda_{i}({{\bm{M}}})$ denote the $i$ -th largest eigenvalue of ${{\bm{M}}}$ , then we have $\lambda_{1}({{\bm{M}}})=1>\lambda_{2}({{\bm{M}}})\geq\cdots\geq\lambda_{N}({{\bm{M}}})>-1$ based on the spectral property of the mixing matrix [5]. The constant $\lambda=\lambda({{\bm{M}}}):=\max\{|\lambda_{2}({{\bm{M}}})|,|\lambda_{N}({{\bm{M}}})|\}$ has been used to characterize optimization error (a measure of training loss) and generalization gap (a measure of test accuracy) of DFL. In particular, it is shown that the optimization error and generalization gap – for a typical DFL framework, Decentralized Federated Averaging (DFedAvg) – are bounded, respectively, by $\mathcal{O}\Big{(}\frac{1}{(1-\lambda)^{2}}\Big{)}$ and $\mathcal{O}\Big{(}2\lambda^{2}+4\lambda^{2}\ln\frac{1}{\lambda}+2\lambda+\frac{2}{\ln\frac{1}{\lambda}}\Big{)}$ – in terms of $\lambda$ [33, 13]. Notice that both $\frac{1}{(1-\lambda)^{2}}$ and $2\lambda^{2}+4\lambda^{2}\ln\frac{1}{\lambda}+2\lambda+\frac{2}{\ln\frac{1}{\lambda}}$ are increasing functions of $\lambda\in(0,1)$ .

Per the above discussion, to achieve good convergence and generalization, a topology needs to have a $\lambda$ sufficiently smaller than 1 and hence achieve a small value of $\frac{1}{(1-\lambda)^{2}}$ and $2\lambda^{2}+4\lambda^{2}\ln\frac{1}{\lambda}+2\lambda+\frac{2}{\ln\frac{1}{\lambda}}$ . Thus we define the first topology metric, called the convergence factor of $G$ : $c_{G}=\frac{1}{(1-\lambda)^{2}}$ .

Note that when $\frac{1}{(1-\lambda)^{2}}$ is minimized, $2\lambda^{2}+4\lambda^{2}\ln\frac{1}{\lambda}+2\lambda+\frac{2}{\ln\frac{1}{\lambda}}$ is also minimized. Hence for the sake of simplicity, we do not need another factor.

II-B2 Network diameter.

The diameter of a network is the longest length of all shortest paths calculated in the network. It reflects the network distance between the two most distant nodes. The intuition of considering this metric is that the network diameter can represent the maximum latency that the local model trained on the data of a client can propagate to all clients in the network.

II-B3 Average length of shortest paths.

The third metric is the average length of all shortest paths in the network. The intuition of considering this metric is that the average length can represent the average latency that a local model can propagate to a random client.

II-C FedLay topology

The FedLay topology is motivated by the research on near-random regular graphs for data center networks [29, 38, 39] and DFL [8, 13]. Recent theoretical studies show that Ramanujan graphs can provide small values of the spectral expander property $\lambda$ and hence achieve ‘optimal’ convergence with a constant node degree $d$ [13]. However, a large Ramanujan graph cannot be generated even by centralized construction. Hence, random regular graphs (RRGs) can be used instead, which are approximately Ramanujan for a large network size $n$ [8]. In addition, prior research on data center networks [29] also shows that near-RRGs achieve the smallest average length of shortest paths among known graphs with a fixed node degree $d$ . Note that RRGs cannot be generated with any deterministic algorithm either. Hence, near-RRGs are usually used in practice, which are considered close enough to RRGs [29, 38]. near-RRGs can achieve ideal values on both the convergence factor and shortest path lengths.

A practical problem is that all existing near-RRGs are constructed by centralized methods [29, 38, 39, 13]. The main challenge of a decentralized construction of near-RRGs is to allow a node to select a neighbor from all other nodes with equal likelihood while this node does not know the entire network. We propose to use a virtual coordinate system to solve this problem, inspired by [38]. Note that [38] is a centralized protocol and our main contribution is a decentralized construction for DFL.

In FedLay, each node computes a set of virtual coordinates $C$ , which is an $L$ -dimensional vector $<x_{1},x_{2},...,x_{L}>$ where each element $x_{l}$ is a random real number in range $[0,1)$ . In practice, $x_{i}$ can be computed as $H(IP_{x}|i)$ where $H$ is a publicly known hash function and $IP_{x}$ is $x$ ’s IP address.

We define $L$ virtual ring spaces. In the $i$ -th ring space, a node is virtually placed on a ring based on the value of its $i$ -th coordinate $x_{i}$ . The coordinates of each space are circular, with 0 and 1 being superposed at the top-most point of the ring and 0.5 being the bottom-most point. If the coordinates of two different nodes are identical in one space, their orders on the ring are determined by the values of their IP addresses. For ease of presentation, we assume all coordinates on a ring are different. As shown in the example in Fig. 2, there are 8 nodes and each of them computes a set of two-dimensional random coordinates $<x_{1},x_{2}>$ . There are two virtual ring space as shown in Figs. 2 (a) and (b) and every node is on a position of the $i$ -th ring based on its random coordinate $x_{i}$ . Note, all spaces are virtual and they have no relationship to the geographic locations of the nodes.

In each virtual space, every node $u$ has two adjacent nodes on the ring, based on the order of their coordinate values. $u$ will find the adjacent nodes from all spaces as its overlay neighbors (by a decentralized protocol described later) for model exchange. In the example of Fig. 2(c), every node finds its adjacent nodes in two spaces and form the FedLay overlay. So most nodes have four neighbors in the overlay but there are a few ones, like node $B$ , has only three neighbors because $D$ is adjacent to $B$ on both rings. Hence, for $L$ spaces, every node has at most $2L$ neighbors. $L$ can then be considered as a parameter for communication and convergence trade-off: with a bigger $L$ , nodes have more neighbors for model exchanges but increased communication cost.

The detailed decentralized construction and maintenance will be discussed in the next section. We now show that the FedLay topology is close to the optimal choice with a given node degree $d$ , measured by the three metrics discussed in Section II-B. We evaluate the FedLay topology by comparing it with the following existing topologies.

1.

Best of 100 randomly generated $d$ -regular graphs (“Best”). We randomly generate 100 $d$ -regular graphs and measure them for the three metrics. We obtain the optimal value among the 100 graphs for each metric. It is then considered the optimal value of practical topologies. for this metric. Of course, there is no decentralized protocol to construct such “Best” topology.
2.

Chord [30]. Chord is a well-known peer-to-peer overlay network serving the function of a distributed hash table (DHT). It has a $O(\log n)$ degree and can be constructed and maintained by decentralized protocols.
3.

Viceroy [21]. Viceroy is a peer-to-peer overlay network with a constant degree, inspired by the classic Butterfly network used for super-computing. Its main objective is to minimize congestion by overlay routing. It can be constructed and maintained by decentralized protocols.
4.

Distributed Delaunay Triangulation (DT) [19, 17]. DT is an overlay network with a constant degree that supports greedy routing. It can be constructed and maintained by decentralized protocols.
5.

Waxman network [36]. Waxman is a network that simulates connections with physical proximity. Nodes with a close geographic distance are likely to connect. There is no known decentralized construction of Waxman.
6.

Social network [22]. We use a social network topology of Facebook users that was collected by [22]. This is a typical example of overlay networks that rely on information from other application channels.

Fig. 3 shows the empirical results by comparing all the above-mentioned topologies on the three metrics: the convergence factor, diameter, and average length of shortest paths. For all metrics, smaller values are more desired. Each network includes 300 nodes for fair comparisons. For the social network, we sample 300 nodes for fair comparisons. We vary the node degree from 4 to 14 for both “Best” and FedLay. Other networks do not support flexible node degrees, so the result of each topology is shown as a single dot in each figure.

We summarize our findings from these results as follows. “Best” always provides optimal results for every metric. The results of FedLay are extremely close to “Best”: most points of FedLay are superposed with those of “Best” with only a few exceptions. All of topologies are much less optimal compared to FedLay. The convergence factor of Chord is very high but the diameter and average shortest path length are low, due to its high node degree.

The results of both DT and Waxman are much less optimal compared to FedLay. The main reason is that both topologies are constructed on neighbors with short distances. Hence, it might take a long latency for information (local models) to propagate between two remote nodes. The convergence factors of both Viceroy and the social network are close to that of FedLay, but their diameter and average shortest path length are much longer than those of FedLay. The node degree of Chord is larger than most other constant-degree networks, because it needs a node degree of $2\log n$ . The convergence factor of Chord is very high but the diameter and average shortest path length are low, due to its high node degree.

Since “Best” are only simulated optimal values rather than realistic network topologies, FedLay achieves the best results on all three metrics among existing practical overlay topologies.

Definition 1 (A correct FedLay overlay).

We define that a FedLay network is correct, if every node $u$ has a neighbor set $N_{u}$ such that $N_{u}$ includes the adjacent nodes of $u$ in all $L$ virtual ring spaces and does not include other nodes. Each node also knows the virtual coordinates of all its neighbors.

FedLay uses random coordinates to achieve near-random sampling of other nodes and hence generates a near-random-regular graph.

Understanding the FedLay topology. Since all coordinates in the FedLay topology are randomly computed. In each virtual ring space, the two adjacent nodes of a node $u$ are actually randomly sampled from the set of all nodes. Therefore, all neighbors of $u$ in FedLay are randomly selected and all other nodes have approximately equal likelihood to be selected. Hence, FedLay can approximate an RRG, which, as studied in past research [29, 38, 39, 13], provides optimal results on both convergence factors and shortest paths. So why are the coordinates in virtual spaces necessary? The main difficulty of generating an RRG in a decentralized manner is that a node cannot find neighbors by randomly sampling all other nodes with equal likelihood if it does not know the whole network. The key idea of overcoming such difficulty is to use the random coordinates to order all nodes in each virtual ring space. The ring coordinates allow every newly joining node to use greedy routing to find its two adjacent nodes in every virtual space within a small number of routing hops. Such an adjacent-node-discovery process achieves near-random sampling of other nodes. In addition, the coordinates also help the network to recover from node failures.

III Design of FedLay Protocols

III-A Overview

The key features of FedLay are that it is completely decentralized and all nodes are self-organized: the FedLay protocol suite allows nodes to join, leave, and fail in the network while still maintaining a correct topology and every node only stores its neighbor information.

As shown in Fig. 4, the FedLay protocol suite running on a client consists of two sets: 1) Neighbor Discovery and Maintenance Protocols (NDMP); and 2) Model Exchange Protocol (MEP). Both sets of protocols exchange messages with other nodes in the network. The objective of NDMP running on node $u$ is to allow $u$ to find its correct neighbors during the join of $u$ and maintain correct neighbors under network dynamics. Hence, NDMP can be considered as control protocols to construct the correct FedLay network. The objective of MEP is to decide when to exchange the local models with the neighbors and how to process the received models. Hence, MEP can be considered as an application protocol to optimize the model convergence of DFL. Both NDMP and MEP use TCP with reliable delivery.

III-B Neighbor Discovery and Maintenance Protocols (NDMP)

NDMP includes $\mathtt{join}$ , $\mathtt{leave}$ , and $\mathtt{maintenance}$ protocols. The $\mathtt{join}$ protocol is run by each new node joining the FedLay network. It ensures all nodes will find the correct neighbors after the node joins the network. The $\mathtt{leave}$ protocol is run by each node that is preparing to leave the network. It ensures all remaining nodes will keep the correct neighbors after the leave. The $\mathtt{maintenance}$ protocol is run by every node periodically to detect potential failed neighbors or wrong neighbors and fix these errors.

III-B1 NDMP join protocol

The $\mathtt{join}$ protocol is designed to achieve the following correctness property: given an existing FedLay overlay network, a new joining node runs the $\mathtt{join}$ protocol. When the $\mathtt{join}$ protocol finishes, the joining node is guaranteed to find its correct neighbors, i.e., the adjacent nodes in all virtual spaces. The above property will be proved later and it ensures that a correct FedLay overlay can be achieved after every new node join. If a FedLay network with $n$ node is correct, after a new node joins, the $(n+1)$ -node network is also correct. This provides a way to recursively construct a large overlay network correctly from a small-scale network even with 2 or 3 nodes. Note that such recursive property is a key module to ensure the correctness of many P2P overlays such as Chord [30] and distributed DT graph [19, 17].

FedLay builds upon the concept of random virtual coordinates and circular distance introduced by SpaceShuffle [38], extending it to achieve a fully decentralized topology construction. Unlike [38], where an administrator is required to access each node and switch, FedLay allows any client to join the network through any existing node, enhancing scalability and ease of use. Additionally, we optimize the NDMP leave protocol to minimize overhead by ensuring that maintenance operations are only triggered when necessary, thus reducing the resource consumption during network changes.

When a new node $u$ joins the existing correct FedLay network, we assume $u$ knows one existing node $v$ in the overlay, which can be an arbitrary node – this is the minimum assumption for any overlay network. If $u$ knows no existing node, it has no way to join any overlay. $u$ first computes a random coordinate as its position in the first virtual space, say $x_{1}^{u}$ . $u$ will let $v$ sends a message Neighbor_discovery to the current Fedlay network using greedy routing to the destination location $x_{1}^{u}$ . Neighbor_discovery also includes $u$ ’s IP address.

We first define the concept of circular distance, which is a metric used in greedy routing of FedLay.

Definition 2 (Circular distance).

The circular distance of two coordinates $x$ and $y$ in the same ring space, $0\leq x,y<1$ , is:

\footnotesize CD(x,y)=\min\{|x-y|,1-|x-y|\}.

For two coordinates $x$ and $y$ on a ring, the circular distance is the length of the smaller arc between $x$ and $y$ , normalized by the perimeter of the ring that is 1. We say $x$ is closer to $y$ than $w$ on a ring space, if $CD(x,y)<CD(w,y)$ . If $x$ and $w$ have the same circular distance to $y$ , we always break the tie to one of $x$ and $w$ with a smaller value of their IP addresses (considering each IP address is a 32-bit value). Hence, there is only one node that is closest to a given coordinate $x$ .

Upon receiving Neighbor_discovery, the greedy routing protocol to the destination location $x_{i}^{u}$ in Space $i$ is executed by a node $v$ as following:

1.

Node $v$ finds a neighbor $w$ , such as $w$ ’s coordinate in Space $i$ , $x_{i}^{w}$ , has the smallest circular distance to $x_{i}^{u}$ among all neighbors of $v$ .
2.

If $CD(x_{i}^{u},x_{i}^{v})>CD(x_{i}^{u},x_{i}^{w})$ , $v$ forwards the Neighbor_discovery message to $w$ .
3.

If $CD(x_{i}^{u},x_{i}^{v})\leq CD(x_{i}^{u},x_{i}^{w})$ , ${Neighbor\_discovery}$ stops at $v$ . From $v$ ’s two adjacent nodes, $v$ finds the adjacent node $p$ such that $x_{i}^{u}$ is on $\wideparen{v,p}$ , the smaller arc between $v$ and $p$ . Then $v$ sends a message to $u$ to tell $u$ that $v$ and $p$ are $u$ ’s adjacent nodes on this virtual ring and let $u$ add $v$ and $p$ to $u$ ’s neighbor set.

The greedy routing presented above will make each node forward Neighbor_discovery to its neighbor that has the shortest circular distance to the destination location $x_{i}^{u}$ . When a node $v$ cannot find a neighbor that is closer to $x_{i}^{u}$ than $v$ itself, $v$ must be the node that has the shortest circular distance to $x_{i}^{u}$ among all nodes in FedLay (will be formally proved later). Hence, $v$ and one of its adjacent nodes $p$ will be $u$ ’s neighbors.

We show an example of FedLay $\mathtt{join}$ in Fig. 5. $u$ joins FedLay and it knows an existing node $H$ in the network. $u$ computes a random coordinate 0.15 and asks $H$ to run greedy routing of Neighbor_discovery to 0.15. $H$ will forward Neighbor_discovery to $B$ , which is closest to 0.15 in the space among all $H$ ’s neighbors. Eventually, the message arrives at $G$ , the node that is closest to 0.15 in the space and $G$ tells $u$ to add $G$ and $D$ as neighbors. Note greedy routing has much smaller hop-count than traveling nodes one after another through the ring because there are many shortcuts like the link $HB$ .

We have the following property.

Lemma 1.

In a ring space of a correct FedLay network and a given coordinate $x$ , if a node $v$ is not the node that has the smallest circular distance to $x$ in the space, then $v$ must have an adjacent node $w$ on the ring such that $CD(x,x^{v})>CD(x,x^{w})$ , where $x^{v}$ is $v$ ’s coordinate.

Proof.

Let $p$ be the node with the smallest circular distance to $x$ in the space. Consider the two arcs between $x$ and $x^{v}$ . At least one of them has a length no longer than 0.5. Let that arc be $\wideparen{x^{v},x}$ with length $L(\wideparen{x^{v},x})$ . $CD(x^{v},x)=L(\wideparen{x^{v},x})\leq 0.5$ .

If $p$ is on $\wideparen{x^{v},x}$ , consider the arc between $v$ and $p$ , $\wideparen{x^{v},x^{p}}$ , as a part of $\wideparen{x^{v},x}$ . If $v$ has an adjacent node $q$ whose coordinate is on $\wideparen{x^{v},x^{p}}$ , then $L(\wideparen{x,x^{q}})<L(\wideparen{x^{v},x})\leq 0.5$ . Hence, $CD(x^{v},x)>CD(x^{q},x)$ . The lemma holds in this situation. If $v$ has no adjacent node on $\wideparen{x^{v},x^{p}}$ , then there must be no node on the arc $\wideparen{x^{v},x^{p}}$ . Hence, $v$ and $p$ are adjacent nodes. Since $p$ is the node closest to $x$ , $v$ does have an adjacent node $p$ that is closer to $x$ . The lemma also holds in this situation.

If $p$ is not on $\wideparen{x^{v},x}$ , consider the arc $\wideparen{x^{v},x,x^{p}}$ . The arc $\wideparen{x,x^{p}}$ is a part of $\wideparen{x^{v},x,x^{p}}$ , and we have $L(\wideparen{x,x^{p}})<L(\wideparen{x^{v},x})$ , because $p$ is the closest node to $x$ . If $v$ has an adjacent node $q$ whose coordinate is on $\wideparen{x^{v},x,x^{p}}$ , then $L(\wideparen{x,x^{q}})<L(\wideparen{x^{v},x})\leq 0.5$ . Hence, $CD(x^{v},x)>CD(x^{q},x)$ . The lemma holds in this situation. If $v$ has no adjacent node on $\wideparen{x^{v},x,x^{p}}$ , then $v$ and $p$ are adjacent nodes. Since $p$ is the node closest to $x$ , $v$ does have an adjacent node $p$ that is closer to $x$ . The lemma also holds in this situation.

Therefore in all cases, the lemma holds. ∎

Note that every adjacent node of $v$ is its neighbor in a FedLay network. This lemma tells that if $v$ is not the node that is closest to the destination coordinate $x_{i}^{u}$ , then the greedy routing algorithm must execute Step 2 and forwards the message to a neighbor. Hence, if the greedy routing algorithm goes to Step 3 and stops at $v$ , $v$ must be the node that has the smallest circular distance to $x_{i}^{u}$ in the space. So we have,

Theorem 1.

In a FedLay network, when Neighbor_discovery to the destination coordinate $x$ stops at a node $v$ , $v$ must be the node that has the smallest circular distance to $x$ .

The Neighbor_discovery message will stop at the node $v$ closest to $x_{i}^{u}$ and $v$ must be an adjacent node to the joining node $u$ , because no other node is closer to $u$ ’s coordinate $x_{i}^{u}$ . $v$ also knows the other adjacent node $w$ of $u$ , because $w$ is a current adjacent node of $v$ . $v$ will send a message to $u$ by TCP including the information of $v$ and $w$ . $u$ will then add $v$ and $w$ to its neighbor set.

The joining node $u$ can find all its neighbors by running the above $\mathtt{join}$ protocol in all spaces. Therefore, if $u$ joins a correct FedLay network, the new FedLay network after this join is also correct.

In some extreme cases, there could be multiple nodes joining the network simultaneously. This situation will be handled by both the $\mathtt{join}$ and $\mathtt{maintenance}$ protocols.

III-B2 NDMP leave protocol

The $\mathtt{leave}$ protocol of NDMP is quite straightforward. When a user wants to leave and closes the client program of FedLay, for every virtual space, the leaving node sends messages to its two adjacent nodes and tells them to add each other to their neighbor sets. As shown in Fig. 6, node $G$ wants to leave the network and tells its two adjacent nodes $A$ and $D$ about its leaving. Then $A$ and $D$ will consider each other as adjacent nodes and neighbors. In another space, $G$ ’s two adjacent nodes $F$ and $C$ will also add each other to their neighbor sets. Hence, the new FedLay network after a node leave is also correct.

III-B3 NDMP maintenance protocol

In addition to planned leaves, An overlay network may also experience node failures due to various reasons such as an Internet service outage and end system failures. A failed node disappears without notice. In order to detect and fix these situations, each node in FedLay also runs the $\mathtt{maintenance}$ protocol.

The $\mathtt{maintenance}$ protocol requires every node to send each of its neighbors a heartbeat message periodically. Suppose the time period between two heartbeat messages is $T$ . If a node $p$ has not received any heartbeat message from a neighbor $u$ for $3T$ time, it considers that $u$ has failed. $p$ then sends a Neighbor_repair message by greedy routing in the opposite direction of $u$ on the virtual space $i$ where $u$ and $p$ are adjacent nodes. We use an example to explain this protocol, as shown in Fig. 7. In the example of Fig. 7(a), node $A$ detects the failure of $G$ . Since $G$ is an adjacent node of $A$ on $A$ ’s clockwise side, $A$ sends a Neighbor_repair message by greedy routing in the counterclockwise direction (the opposite direction of $G$ ). By “counterclockwise direction”, it requires that all hops of such greedy routing, $A-E-B-D$ in this example, should follow the counterclockwise order.

We give a formal description of the counterclockwise direction: Upon receiving Neighbor_repair to $x_{i}^{u}$ in Space $i$ , a node $v$ runs the following algorithm:

1.

Node $v$ considers a subset of its neighbors, such that for every neighbor $w$ in the subset, $w$ ’s coordinate in Space $i$ , $x_{i}^{w}$ , satisfies $x_{i}^{w}<x_{i}^{v}$ or $x_{i}^{w}>x_{i}^{u}$ . For each neighbor $w$ , consider the length of the arc from $x_{i}^{w}$ to $x_{i}^{u}$ in the counterclockwise direction $L(\wideparen{x_{i}^{w},x_{i}^{u}})$ . From the above subset, node $v$ finds a neighbor $w^{\prime}$ , such that $w^{\prime}$ ’s coordinate in Space $i$ , $x_{i}^{w^{\prime}}$ , has the smallest arc length $L(\wideparen{x_{i}^{w^{\prime}},x_{i}^{u}})$ among all neighbors in the subset.
2.

If $L(\wideparen{x_{i}^{v},x_{i}^{u}})>L(\wideparen{x_{i}^{w^{\prime}},x_{i}^{u}})$ , $v$ forwards Neighbor_repair to $w$ .
3.

If $L(\wideparen{x_{i}^{v},x_{i}^{u}})<L(\wideparen{x_{i}^{w^{\prime}},x_{i}^{u}})$ , Neighbor_repair stops at $v$ . $v$ then tells $p$ that $v$ is $p$ ’s adjacent nodes on this virtual ring and let $p$ add $v$ to $p$ ’s neighbor set.

When Neighbor_repair stops, it arrives at another adjacent node of $G$ before $G$ ’s failure (stated in a later theorem). Then $G$ ’s previous two adjacent nodes can be connected.

In Fig. 7(a), node $D$ also detects $G$ ’s failure, independent of $A$ ’s detection. Then $D$ sends a Neighbor_repair message to the clockwise direction and the message will travel on a path $D-F-A$ . The algorithm to forward Neighbor_repair in the clockwise direction can be specified in a similar way.

By changing the subset selection condition in Step 1 to “ $x_{i}^{w}>x_{i}^{v}$ or $x_{i}^{w}<x_{i}^{u}$ ”. We can prove the following theorem:

Theorem 2.

Consider a correct FedLay network and a node $u$ fails. When a node $p$ detects the failure of its adjacent node $u$ in Space $i$ , it sends Neighbor_repair to the destination coordinate $x_{i}^{u}$ in the opposite direction of $u$ . When Neighbor_repair stops at a node $q$ , $q$ is another adjacent node of $u$ in Space $i$ before $u$ ’s failure.

Proof.

We first discuss the case that Neighbor_repair is sent in the counterclockwise direction. The case of clockwise direction can be proved in the same way.

Let $q$ be another adjacent node of $u$ in Space $i$ before $u$ ’s failure.

Let $v$ be the current node that receives Neighbor_repair. If $v\neq q$ , $v$ has at least one neighbor $w$ , such that $L(\wideparen{x_{i}^{w},x_{i}^{u}})<L(\wideparen{x_{i}^{v},x_{i}^{u}})$ . Hence, Neighbor_repair will not stop if $v\neq q$ . $v$ will forward Neighbor_repair to the neighbor $w^{\prime}$ and $L(\wideparen{x_{i}^{w^{\prime}},x_{i}^{u}})\leq L(\wideparen{x_{i}^{w},x_{i}^{u}})<L(\wideparen{x_{i}^{v},x_{i}^{u}})$ .

Therefore after each hop, the arc length $L(\wideparen{x_{i}^{v},x_{i}^{u}})$ strictly decreases. Hence, Neighbor_repair will not experience a loop. Since there are a limited number of nodes, Neighbor_repair will stop at $q$ . ∎

Based on the theorem, in every virtual space, the two adjacent nodes of the failed $u$ can find and connect each other. Hence, if a node fails in a correct FedLay network, after the node failure FedLay is still correct.

Neighbor repair for concurrent joins and failures. Note the above property cannot be proved for multiple failures that happen at the same time, called concurrent failures. For concurrent joins and failures, we allow each node $u$ to periodically send two Neighbor_repair messages with destination $x_{i}^{u}$ to both counterclockwise and clockwise directions in every virtual space $i$ , even without detecting any neighbor failure. For each node $v$ in $u$ ’s neighbor set, if they are indeed adjacent in a virtual space, $v$ will receive a Neighbor_repair message that stops at $v$ . The Neighbor_repair stops at a node $w$ that is not in $u$ ’s neighbor set, $w$ and $u$ will add each other to their neighbor sets. In fact there is no way to prove any property for concurrent joins and failures in any structured P2P network [30, 17]. Hence we conduct experiments of extreme concurrent joins and failures and the above method always allows the network to recover to a correct FedLay.

III-C Model Exchange Protocol (MEP)

One key challenge of DFL is that there is no central server to evaluate the quality of models from different clients. In P2P model exchanges, a client with low-quality local models can ‘infect’ its neighbors with high-quality models and these errors may be further propagated in the overlay. MEP is designed to limit the impact of low-quality models and amplify the impact of high-quality models in a decentralized way. We consider two practical issues in DFL systems. 1) Data heterogeneity [11, 4, 13]. It is well known that the local data of different clients are usually non-iid due to geographic and environmental diversity. Hence, their models have different accuracy. 2) Client heterogeneity [40, 7, 24]. Clients of DFL could have different bandwidth and computing capacities. They may have different model exchange frequencies. Unlike previous work of DFL that usually assumes homogeneous clients [11, 10, 3, 34, 8, 13], MEP allows each client to set different parameters based on their two dimensions of heterogeneity. to guide clients to exchange models with their neighbors, including multiple design components. We present three main components in detail: 1) set confidence parameters; 2) set model exchange period; 3) model de-duplication.

Dataset

Tasks

Model

Model size per client

Comm. period for

medium-cap. clients

MNIST

Img Classification

MLP

247 KB

5 min

CIFAR-10

Img Classification

CNN

1.1 MB

10 min

Shakespeare

Next-character pred.

LSTM

23.4 MB

40 min

TABLE II: Datasets used in evaluation

III-C1 Asynchronous model exchange

Previous work assumes synchronous, round-based communication, in which all clients use the same time period to exchange models with neighbors [11, 10, 3, 34, 8, 13]. However, due to client heterogeneity, some low-resource clients may become “stragglers” that will fail to perform model exchanges in the given time period, while powerful or newly joined clients prefer shorter time period (or higher frequency) of model exchanges.

MEP uses asynchronous communication and allows each client $u$ to use a different communication time period $T_{u}$ . $T_{u}$ can be set in two ways: 1) Coarse-grained settings. Each client may configure a period based on their device and communication types, for example, $\mathtt{Server}$ - $\mathtt{LAN}$ , $\mathtt{PC}$ - $\mathtt{LAN}$ , $\mathtt{Laptop}$ - $\mathtt{WLAN}$ , $\mathtt{Phone}$ - $\mathtt{LTE}$ , and $\mathtt{IoT}$ - $\mathtt{WLAN}$ . These values are pre-specified in the client program. 2) Fine-grained settings. Based on the monitoring of available bandwidth and computing resources, client $u$ estimates the minimum time duration $T_{u,{\min}}$ to produce an updated ML model and transmit it to all neighbors. Then its communication period $T_{u}=\eta T_{u,{\min}}$ for constant $\eta>1$ .

For two neighbors with periods $T_{u}$ and $T_{v}$ , their model exchange period is set to $\max(T_{u},T_{v})$ . Hence, a client may have different exchange periods to different neighbors.

III-C2 Set confidence parameters

One key innovation in MEP is to introduce confidence parameters. Each node has a set of confidence parameters that present its self-evaluation of the local model accuracy.

We define the data divergence confidence $c_{d}$ on a client $u$ :

\footnotesize c_{d}^{u}=\frac{1}{\exp(DKL(D_{loc}||D_{std}))}

where $DKL(\cdot)$ is the Kullback-Leibler divergence [16] to evaluate the statistical distance between two probability distributions $P$ and $Q$ , $D_{loc}$ denotes the local data distribution, and $D_{std}$ denotes the estimated iid distribution of the dataset. The uniform distribution is widely used [42, 28] to estimate the iid data because the majority of publicly-available datasets for classification follow uniform distributions, such as (MNIST [18] and CIFAR-10 [15]). The Kullback-Leibler divergence can effectively represent the richness of a local dataset. $c_{d}\in(0,1]$ and a higher value represents a higher quality of local data and local models.

In addition, we define the communication confidence $c_{c}$ on a client $u$ : $c_{c}^{u}=\frac{1}{T_{u}}$ . The intuition of using $c_{c}$ is that when a client has more frequent model exchanges with its neighbors, its models are more likely to have higher qualities.

Hence, the overall confidence of client $u$ is

\footnotesize c^{u}=\alpha_{d}\frac{c_{d}^{u}}{\max(c_{d})}+\alpha_{c}\frac{c_{c}^{u}}{\max(c_{c})}

where $\max(c_{d})$ and $\max(c_{c})$ are the maximum values of $c_{d}$ and $c_{c}$ respectively, from all $u$ ’s neighbors. $\alpha_{d}$ and $\alpha_{c}$ are two constants to balance the weights of the two confidence parameters. The specific values of $\alpha_{d}$ and $\alpha_{c}$ can just be 0.5 and 0.5. We try a variety of combinations of $\alpha_{d}$ and $\alpha_{c}$ , and in all cases, FedLay achieves fast model convergence on different nodes.

The models from $u$ ’s neighbors are aggregated as follows:

\footnotesize\omega^{u}=\frac{\sum_{j\in N\bigcup\{u\}}c^{j}\omega^{j}}{\sum_{j\in N\bigcup\{u\}}c^{j}}

The above aggregation will be computed once every local time period $T_{u}$ and the models from each neighbor are always the most updated ones from the neighbor. In this way, clients with low confidence in their model accuracy will have less impact on other clients. In this work, we do not consider the situation where a client might intentionally set a large confidence value to mislead other clients.

III-C3 Model fingerprinting

In the neighbor set of each client, other than the IP addresses, coordinates, and confidence values of the neighbors, the client also stores the fingerprint $f$ of the most recent models of each neighbor, computed by hashing the models by a public hash function. Before starting a model exchange, the client first sends the fingerprint to its neighbor. If the neighbor finds the fingerprint matches the models that are going to be sent, the neighbor will consider the models to be a duplicate of a copy sent earlier and stop sending the models. This method reduces unnecessary traffic of exchanges of duplicated models.

IV Performance Evaluation

IV-A Evaluation methodology

IV-A1 Three types of evaluation

We conduct three types of evaluation of FedLay for different scales of DFL networks.

1) Real experiments. We conduct experiments with real packet exchanges and data training. We deploy 16 instances to public clouds (we used both Oracle OCI and Amazon EC2), each with a 2GHz CPU and 2GB RAM. Each instance is connected to the Internet and runs a FedLay client. Each client sends and receives NDMP and MEP messages using TCP. Clients train ML models on their local datasets with Pytorch [25] and exchange the models with active neighbors. There is no central server for any purpose, and the system is completely decentralized. The purpose of this type of experiment is to present a prototype and demonstrate that FedLay can run with real ML data training in practice.

2) Medium-scale emulation with real data training. In this type of experiment, we use real data training and simulated packet exchanges as discrete events to evaluate networks with up to 100 clients. The simulation and real data training and testing run on a machine with an NVIDIA GeForce RTX3080 graphic card for training acceleration. The purpose of this type of experiment is to evaluate the overlay construction/maintenance, model accuracy, convergence speed, and message cost of FedLay and other DFL methods.

3) Large-scale simulation with trained models. For more than 100 clients, conducting all data training on a few machines takes very long time. For networks with 200 to 1000 nodes, we re-use the models trained from the above two types of experiments and assign them to the simulated clients. Packet exchanges are simulated as discrete events.

IV-A2 ML datasets and models

We evaluate the performance of FedLay for three ML tasks, including 1) Multilayer Perceptron (MLP) for digit classification on the MNIST dataset [9]. 2) Convolutional Neural Networks (CNN) for image classification on the CIFAR-10 dataset [15]. 3) Long Short-Term Memory (LSTM) for role forecasting on the Shakespeare dataset [6] built from The Complete Works of William Shakespeare. Details are described in Table II. All three are standard datasets for FL benchmarks [6].

Learning with non-iid data. We generate non-iid MNIST and CIFAR-10 datasets by selecting limited labels for the local training sets using the sharding method. Each shard contains only one label, and each local dataset includes a limited number of shards, resulting in a non-iid distribution and heterogeneity among clients’ local datasets. For large-scale simulations, data among clients may overlap, but in real experiments and medium-scale simulations, clients have unique data. In the original Shakespeare dataset, each speaking role in a play is considered a unique shard.

Client heterogeneity. We also assumed that clients have different computation and communication resources. We set 3 tiers of clients. For the 16-client real-world experiments, we set 10 medium-capacity, 3 high-capacity, and 3 low-capacity clients. For simulations, each experiment includes 60% medium-, 20% high-, and 20% low-capacity clients. The training time and communication time period of a high-capacity client are 2/3 of those of a medium-capacity user, and those of a low-capacity client are 2x of those of a medium-capacity user. The communication period for medium-capacity clients of each dataset is shown in Table II.

IV-A3 Performance metrics

Besides the topology metrics discussed in Sec. II-B, we use the following metrics to evaluate FedLay and other methods.

Model accuracy: We evaluate the individual accuracy and average accuracy of local ML models based on separate test datasets that are different from the training datasets.

Topology correctness: It is defined as the number of correct neighbors of all nodes over the total number of neighbors. Hence correctness equal to 1 means a correct FedLay.

Communication Cost: We evaluate the average communication cost per client, by counting the number of NDMP messages sent by each client and the total size of the models sent by each client in bytes.

IV-A4 Methods for comparison

There is no existing DFL topology that allows decentralized construction and maintenance. Hence we compare FedLay with the following methods: 1) Gaia [12] is an ML method for geo-distributed clouds and still uses central servers. Hence it is not DFL. It runs server-based ML in each region and lets servers from different regions connect as a complete graph. It includes no aggregation method to handle non-iid data. 2) DFL-DDS [31] is a DFL method without a fixed topology. Instead, it simulates mobile nodes in a road network and considers two geographically close nodes as neighbors. 3) Chord [30]. 4) FedAvg [23] is a standard centralized FL method. We use its accuracy as the upper bound of DFL model accuracy because the central server knows all models from the clients.

IV-B Evaluation of FedLay topology

We show the convergence factor, diameter, and average shortest path length by varying the number of nodes in Fig. LABEL:fig:expander3 for different topologies discussed in Sec. II-C. We evaluate FedLay with degrees of 6, 8, and 10, as well as Viceroy, Waxman, and Chord. We find that the diameter and average shortest path length of Viceroy and Waxman increases with the size of the network, and Chord has a large convergence factor when the number of nodes is large. It showed FedLay has the best results in all topology metrics.

Fig. 8(a) shows the topology correctness under an extreme situation when 100 new clients join a 400-client FedLay at the same time (10ms in the timeline). The average network latency is set to 350ms. We find the correctness can quickly converge to 1 after 8 seconds in FedLay with degree $d=6,8,10,12$ . Fig. 8(b) shows the topology correctness under another extreme situation when 100 clients failed from a 400-client FedLay network at the same time (10ms in the timeline). The correctness quickly drops to 64.3%. The remaining clients run NDMP and quickly recover to a correct 300-client FedLay network in 8 seconds. In Fig. 8(c), we plot the number of messages sent per client to construct FedLay networks with different sizes. With as many as 500 clients, each client only sends around 30 messages on average.

IV-C DFL model accuracy

Fig. 9 shows the model accuracy of different methods in 6 subfigures, based on real experiments in Amazon EC2. In Figs. 9(a)-9(c) we find that FedLey achieves higher accuracy and faster convergence than Gaia and DFL-DDS, even with $d=4$ . Note one communication period for medium-capacity clients is set to 40 minutes in Shakespeare hence there are not many times of model exchanges before convergence. Figs. 9(d)-9(f) show the cumulative distribution of the accuracy of all clients at convergence time (150 minutes for MNIST and 1500 minutes for others). We can see that nodes are with similar accuracy levels without any ‘stragglers’.

We evaluate the accuracy of FedLay ( $d=10$ ), FedAvg, Gaia, and DFL-DDS using medium-scale experiments with 100 clients and show the results in Fig. 10. FedAvg achieves the best accuracy as a centralized FL, which we consider as the upper bound for DFL. The accuracy of FedLay with 10 degrees is only 1.2%, 2.5%, and 0.9% lower than FedAvg on MNIST, CIFAR-10, and Shakespeare, respectively. Other methods have lower accuracy but the differences are not significant. We list accuracy at convergence time in Table III , along with the default centralized method FedAvg as the baseline.

Task	FedLay	FedAvg	Gaia	Chord	DFLDDS
MNIST	90.2%	92.1%	89.2%	88.9%	87.4%
CIFAR	50.3%	52.8%	48.6%	49.2%	49.4%
Shakes	45.9%	46.9%	44.0%	44.5%	44.2%

TABLE III: Accuracy comparison at convergence. We regard the accuracy of FedAvg as the centralized baseline.

We evaluate FedLay under different non-iid levels. Each client has a limited number of shards. When each client has fewer shards, the level of non-iid becomes more significant. The default setting is 8 shards per client as shown in previous results such as Fig. 10(b), Fig. 11(a) and Fig. 11(b) show the accuracy comparison for 4 shards per client and 12 shards per client respectively for CIFAR-10. We find that all DFL methods have slower convergence under more non-iid data (Fig. 11(a)) but eventually FedLay still achieves similar accuracy as FedAvg, while Gaia and DFL-DDS have lower accuracy. We also show the distribution of accuracy of all clients at the time 2000 in Fig. 11(c). When there are 4 shards per client, the distribution is more uneven.

Evaluation of data with biased distribution and locality. In this set of experiments, 100 clients are divided into 10 groups evenly, and each group possesses 6 out of the total 10 labels in the CIFAR-10 dataset. Each group only has 1 label that is different from the neighboring groups. For example, group 1 has labels 1 to 6; group 2 has labels 2 to 7, etc. and the last group has labels 10, 1, 2, 3, 4, 5. For each client, we sampled 2000 images for each label evenly from the original CIFAR-10 dataset. In Fig.15, we show FedLay has an average of $37.01\%$ improvement over Chord on varying degrees. It also demonstrates FedLay is only $2.0\%$ lower than the theoretical upper bound, a fully connected network. In Fig.15, we show the comparison of the accuracy of FedLay and Chord versus time. Again, FedLay shows much better convergence compared to Chord.

[Uncaptioned image] — Figure 13: Converged Accuracy

IV-D Evaluation of other considerations

Asynchronous communication. We also compare FedLay with synchronous and asynchronous communication in Fig. 12. We find for all three datasets, asynchronous communication can improve both the accuracy and convergence speed, because high-capacity clients do not need to wait for low-capacity ones.

Confidence parameters. Fig. 19 and Fig. 19 shows the accuracy of FedLay with and without confidence parameters for MNIST, compared to simple average. We set $\alpha_{d}=0.5$ and $\alpha_{c}=0.5$ . The results show that FedLay slightly improves the simple average in accuracy.

Accuracy under churn. We show the model accuracy under extreme churn: 50 new clients join a 50-client FedLay network. In Fig. 19, the curves with triangle markers show the accuracy of the initial 50 nodes and the curves with square markers show the accuracy of 50 newly joined nodes. We find that the accuracy of the new nodes quickly converges to a high level due to the high-confidence models from existing nodes. Fig. 19 show that at the join time, the newly joined nodes have very low accuracy and all clients achieve high accuracy eventually.

Computation Cost. We show the relative computational cost in Fig.15. In the experiment of 100 nodes training on MNSIT dataset, To reach the accuracy of 88%, The relative computation cost of FedLay is 1.33, compared to 1.53 for Gaia, 2.47 for Chord, and 2.76 for DFL-DDS, with the baseline FedAvg normalized to 1. FedLay only has 33% overheads, smallest compared to other methods.

IV-E Scalability

We use large-scale simulations to evaluate the scalability of FedLay, as shown in Fig. 20(b). We find that even with up to 1000 clients, FedLay has stable performance in all datasets. In Fig. 20(d), we compared the communication cost per client (in MBs) to reach the convergence of FedLay to those of FedAvg, DFL-DDS, and Gaia. Gaia has poor scalability in communication.

V Related Works

Decentralized Federated Learning (DFL). Federated learning (FL) [23] is an attractive solution for large-scale ML that allows many clients to train ML models collaboratively without directly sharing training data. However, the central server in FL is a single point of failure and attack. Decentralized Federated Learning (DFL) has been proposed to remove the central server [11, 33, 4]. The overlay network of DFL is a fundamental problem. He et al. [11] suggest a few overlay topologies including ring, 2D grid, and complete graph, which either are unable to be constructed in a decentralized way or cause too much communication. GADMM [10] uses a dynamic chain topology and other methods apply clustering-based topologies [2] [3]. Vogels et al. [34] analyze model convergence in theory on different topologies including hypercube, torus, binary tree, and ring. Recently Hua et al. suggest applying Ramanujan graphs for DFL [13]. No above studies discuss decentralized construction and maintenance of the suggested topologies. Recently, [41, 37] suggest to utlize Blockchain enhance the security and verifiability of DFL.

Other overlay topologies. Overlay topologies have been extensively studied for P2P networks. A well-known category of overlay networks is called distributed hash tables (DHTs), such as Chord [30]. DHTs are proposed to achieve data searching in a decentralized network. Distributed Delaunay triangulation (DT) [19, 17] is designed to achieve greedy routing guarantees and Viceroy [21] is designed for minimizing congestion by overlay routing. Near-random regular topologies have been studied for data center networks with centralized construction [29, 38].

VI Conclusion

This work presents FedLay, the first overlay network for DFL that achieves all the following properties: 1) decentralized construction, 2) fast convergence to accurate models, 3) small node degree, and 4) resilience to churn. We present the detailed designs of the FedLay topology, neighbor discovery and maintenance protocols (NDMP), and model exchange protocol (MEP). We prove that NDMP can guarantee the correctness of a decentralized overlay for node joins and failures. Evaluation results show that FedLay provides the highest model accuracy among existing DFL methods, small communication costs, and strong resilience to churn. In particular, it provides significant model accuracy advantages compared to other decentralized protocols such as Chord, when data are distributed with locality and bias.

References

[1] Durmus Alp Emre Acar, Yue Zhao, Ramon Matas, Matthew Mattina, Paul Whatmough, and Venkatesh Saligrama. Federated learning based on dynamic regularization. In Proc. of ICLR, 2021.
[2] Mohammed S. Al-Abiad, Mohanad Obeed, Md. Jahangir Hossain, and Anas Chaaban. Decentralized aggregation for energy-efficient federated learning via overlapped clustering and d2d communications. arXiv preprint arXiv:2206.02981, 2022.
[3] Aurélien Bellet, Anne-Marie Kermarrec, and Erick Lavoie. D-cliques: Compensating for data heterogeneity with topology in decentralized federated learning. In Proc. of IEEE SRDS, 2022.
[4] Enrique Tamos Martinez Beltran et al. Decentralized federated learning: Fundamentals, state-of-the-art, frameworks, trends, and challenges. IEEE Communications Surverys and Tutorials, 2022.
[5] Stephen Boyd, Persi Diaconis, and Lin Xiao. Fastest mixing markov chain on a graph. SIAM review, 46(4):667–689, 2004.
[6] Sebastian Caldas, Sai Meher Karthik Duddu, Peter Wu, Tian Li, Jakub Konečný, H. Brendan McMahan, Virginia Smith, and Ameet Talwalkar. Leaf: A benchmark for federated settings, 2019.
[7] Jing Cao, Zirui Lian, Weihong Liu, Zongwei Zhu, and Cheng Ji. Hadfl: heterogeneity-aware decentralized federated learning framework. In Proc. of IEEE DAC, 2021.
[8] Yat-Tin Chow, Wei Shi, Tianyu Wu, and Wotao Yin. Expander graph and communication-efficient decentralized optimization. In Proc. of IEEE ASILOMAR, 2016.
[9] Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
[10] Anis Elgabli, Jihong Park, Amrit S Bedi, Mehdi Bennis, and Vaneet Aggarwal. Communication efficient framework for decentralized machine learning. In Proc. of IEEE CISS, 2020.
[11] Lie He, An Bian, and Martin Jaggi. Cola: Decentralized linear learning. Proc. of NIPS, 2018.
[12] Kevin Hsieh, Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons, and Onur Mutlu. Gaia: Geo-Distributed machine learning approaching LAN speeds. In Prof. of USENIX NSDI, 2017.
[13] Yifan Hua, Kevin Miller, Andrea L Bertozzi, Chen Qian, and Bao Wang. Efficient and reliable overlay networks for decentralized federated learning. SIAM Journal on Applied Mathematics, 82(4):1558–1586, 2022.
[14] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. In Proc. of ICML, 2020.
[15] A. Krizhevsky. Learning multiple layers of features from tiny images. Tech Report, 2009.
[16] S. Kullback and R.A. Leibler. On information and sufficiency. Annals of Mathematical Statistics, 1951.
[17] Simon S. Lam and Chen Qian. Geographic Routing in $d$ -dimensional Spaces with Guaranteed Delivery and Low Stretch. In Proceedings of ACM SIGMETRICS, 2011.
[18] Yann LeCun, Corinna Cortes, and CJ Burges. MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
[19] Dong-Young Lee and Simon S. Lam. Efficient and accurate protocols for distributed delaunay triangulation under churn. In Proc. of IEEE ICNP, 2008.
[20] Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3):50–60, 2020.
[21] Dahlia Malkhi, Moni Naor, and David Ratajczak. Viceroy: A scalable and dynamic emulation of the butterfly. In Proc. of ACM PODC, 2002.
[22] J. McAuley and J. Leskovec. Learning to discover social circles in ego networks. In Proc. of NIPS, 2012.
[23] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Proc. of PMLR AISTATS, 2017.
[24] Jinlong Pang, Jieling Yu, Ruiting Zhou, and John CS Lui. An incentive auction for heterogeneous client selection in federated learning. IEEE Transactions on Mobile Computing, 2022.
[25] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. Tech Report, 2017.
[26] Reese Pathak and Martin J Wainwright. Fedsplit: An algorithmic framework for fast federated optimization. arXiv preprint arXiv:2005.05238, 2020.
[27] Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečnỳ, Sanjiv Kumar, and H Brendan McMahan. Adaptive federated optimization. arXiv preprint arXiv:2003.00295, 2020.
[28] Felix Sattler, Simon Wiedemann, Klaus-Robert Müller, and Wojciech Samek. Robust and communication-efficient federated learning from non-iid data. IEEE transactions on neural networks and learning systems, 31(9):3400–3413, 2019.
[29] Ankit Singla, Chi-Yao Hong, Lucian Popa, and P. Brighten Godfrey. Jellyfish: Networking data centers randomly. In Proc. of USENIX NSDI, 2012.
[30] Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. In Proc. of ACM SIGCOMM, 2001.
[31] Dongyuan Su, Yipeng Zhou, and Laizhong Cui. Boost decentralized federated learning in vehicular networks by diversifying data sources. In Proc. of IEEE ICNP, 2022.
[32] Tao Sun, Dongsheng Li, and Bao Wang. Stability and generalization of decentralized stochastic gradient descent. In Proc. of AAAI, 2021.
[33] Tao Sun, Dongsheng Li, and Bao Wang. Decentralized federated averaging. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
[34] Thijs Vogels, Hadrien Hendrikx, and Martin Jaggi. Beyond spectral gap: The role of the topology in decentralized learning. arXiv preprint arXiv:2206.03093, 2022.
[35] Tian Wang, Yan Liu, Xi Zheng, Hong-Ning Dai, Weijia Jia, and Mande Xie. Edge-based communication optimization for distributed federated learning. IEEE Transactions on Network Science and Engineering, 9(4):2015–2024, 2021.
[36] Bernard M Waxman. Routing of multipoint connections. IEEE journal on selected areas in communications, 6(9):1617–1622, 1988.
[37] Minghui Xu, Zongrui Zou, Ye Cheng, Qin Hu, Dongxiao Yu, and Xiuzhen Cheng. Spdl: A blockchain-enabled secure and privacy-preserving decentralized learning system. IEEE Transactions on Computers, 72(2):548–558, 2022.
[38] Ye Yu and Chen Qian. Space Shuffle: A Scalable, Flexible, and High-Bandwidth Data Center Network. In Proc. of IEEE ICNP, 2014.
[39] Ye Yu and Chen Qian. Space Shuffle: A Scalable, Flexible, and High-Bandwidth Data Center Network. IEEE Transactions on Parallel and Distributed Systems, 2016.
[40] Shahryar Zehtabi, Seyyedali Hosseinalipour, and Christopher G Brinton. Decentralized event-triggered federated learning with heterogeneous communication thresholds. In Proc. of IEEE CDC, 2022.
[41] Xiaoxue Zhang, Yifan Hua, and Chen Qian. Secure decentralized learning with blockchain. In 2023 IEEE 20th International Conference on Mobile Ad Hoc and Smart Systems (MASS), pages 116–124. IEEE, 2023.
[42] Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. Federated learning with non-iid data. arXiv preprint arXiv:1806.00582, 2018.
[43] Ruiting Zhou, Jinlong Pang, Zhibo Wang, John CS Lui, and Zongpeng Li. A truthful procurement auction for incentivizing heterogeneous clients in federated learning. In 2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS), pages 183–193. IEEE, 2021.