This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Towards Practical Overlay Networks for Decentralized Federated Learning

Yifan Hua1, Jinlong Pang1, Xiaoxue Zhang1, Yi Liu1, Xiaofeng Shi1, Bao Wang2, Yang Liu1, Chen Qian1
1University of California Santa Cruz, 2University of Utah
{yhua294, pang14, xzhan330, yliu634, xshi24, yangliu, cqian12}@ucsc.edu [email protected]
Abstract

Decentralized federated learning (DFL) uses peer-to-peer communication to avoid the single point of failure problem in federated learning and has been considered an attractive solution for machine learning tasks on distributed devices. We provide the first solution to a fundamental network problem of DFL: what overlay network should DFL use to achieve fast training of highly accurate models, low communication, and decentralized construction and maintenance? Overlay topologies of DFL have been investigated, but no existing DFL topology includes decentralized protocols for network construction and topology maintenance. Without these protocols, DFL cannot run in practice. This work presents an overlay network, called FedLay, which provides fast training and low communication cost for practical DFL. FedLay is the first solution for constructing near-random regular topologies in a decentralized manner and maintaining the topologies under node joins and failures. Experiments based on prototype implementation and simulations show that FedLay achieves the fastest model convergence and highest accuracy on real datasets compared to existing DFL solutions while incurring small communication costs and being resilient to node joins and failures.

publicationid: pubid: 979-8-3503-5171-2/24/$31.00 ©\copyright2024 IEEE

I Introduction

Training machine learning (ML) models using data collected by distributed devices, such as mobile and IoT devices, is crucial for modern ML. Federated learning (FL) [23, 14, 20, 26, 27, 43, 1] has become a popular ML paradigm that allows a large number of clients (end systems, edge nodes, etc.) to train ML models collaboratively without directly sharing training data. FL uses a central server or cloud to orchestrate clients for training ML models and iterates the following procedure: The server creates a global model by aggregating the local ML models collected from the clients and then sends it to clients for edge applications; the ML models are updated at the clients.

An abstraction of FL is shown in Fig. 1(a). Compared to collecting raw data from distributed devices and performing centralized ML, FL has several main advantages, including saving communication costs on limited-bandwidth devices, preserving data privacy, and being compatible with country or organization regulations that prohibit direct data sharing.

Refer to caption
Figure 1: Federated learning v.s. decentralized federated learning

However, the drawbacks of FL are also prominent and have been studied and widely mentioned in the literature [20, 11, 35]. For example, the central orchestration server that frequently exchanges models with clients clearly presents a bottleneck and becomes a typical single point of failure [20, 4]. In addition, the server is also a single point of attack: adversaries can make all clients use tampered ML models by attacking the server. There is even a risk that the server itself is malicious, which might distribute incorrect global models or collect sensitive information from the clients.

Overlay network Decentralized construction Node degree Model convergence Resilience to churn Other comments
Ring [11] Not discussed 2 Slow Not discussed
2D grid [11] Not discussed 4 Slow Not discussed
Complete graph [11] Not discussed NN-1 Fast Not discussed
Dynamic chain [10] Not discussed 2 Faster than ring Not discussed
D-Cliques [3] Unknown |C||C|-1 Fast Not discussed Assume global knowledge
Clustering [2] Not discussed |C||C|-1 Fast Not discussed Bottlenecks exist
Hypercube [34] Not discussed O(logN)O(\log N) Fast Not discussed
Torus [34] Not discussed dd Fast Not discussed
Ramanujan [8, 13] Not discussed dd Fast Not discussed
Random dd-graph [8, 13] Unknown dd Fast Not discussed Assume global knowledge
FedLay (this work) Yes dd Fast Yes Address device/data heterogeneity
TABLE I: List of overlay network topologies for DFL. NN is the number of clients. dd is a small constant for node degree, usually around 10. |C||C| is the size of each cluster/clique, usually bigger than values of dd. Other than FedLay, only D-Cliques and [13] discuss its construction algorithm but still assumes global knowledge.

Decentralized federated learning (DFL) emerged recently [11, 33, 4] to resolve the above problems of FL, by removing the involvement of the central server. As shown in Fig. 1(b), in a DFL system, DFL clients form a peer-to-peer (P2P) network and keep exchanging their models using P2P communication. In most cases, the data on different clients are not identically and independently distributed (non-iid) [11, 33, 4], hence the trained local ML models are substantially different from each other. After sufficient model exchanges, the local models on the clients may converge to a model that correctly reflects the features of data from all clients.

This work focuses on a fundamental network problem of DFL: what overlay network is ideal for DFL in practice? An overlay network of DFL is a logical network on top of the physical networks. It specifies which pairs of clients should exchange their local models: two clients exchange models if they are overlay neighbors. An ideal overlay network for DFL needs to satisfy a few requirements [4] including 1) a decentralized construction protocol that can build the overlay topology; 2) fast convergence of local models to high accuracy; 3) small node degree that can maintain low bandwidth cost on clients for exchanging models with a limited number of neighbors; 4) resilient to client dynamics such as client joins, leaves, and failures – they are also called as churn.

Table I shows a list of overlay network topologies that have been studied for DFL. We find that most of these existing studies do not pay attention to whether the proposed topologies can be constructed by decentralized protocols and resilient to churn. These two requirements are common networking/distributed system problems and might not be the focus for ML researchers.

DFL cannot work as a practical system without a decentralized construction protocol for its overlay network. For example, recent work suggests that Ramanujan graphs provide fast convergence and accurate models for DFL [32, 13]. However, the decentralized construction of Ramanujan graphs is unknown. Centralized construction/maintenance contradicts the main purpose of DFL: avoid the single point of failure/attack.

We propose a fully decentralized overlay network for DFL, called FedLay, which achieves all four requirements discussed above, namely decentralized construction, fast convergence to accurate models, small communication cost, and resilience to churn. FedLay does not need a centralized server at any stage and all clients run the same suite of distributed protocols. The FedLay protocol suite includes two sets of protocols: 1) a set of Neighbor Discovery and Maintenance Protocols (NDMP) to build the overlay network and recover it from churn; and 2) a Model Exchange Protocol (MEP) to achieve fast model convergence for heterogeneous clients and asynchronous communication. The FedLay topology is motivated by the near-random regular topologies that have been proposed for data center networks [29, 38]. However, [29] [38] are centralized protocols for data centers and cannot be applied to DFL. To our knowledge, FedLay is the first solution for constructing near-random regular topologies in a decentralized manner and maintaining the topologies under node joins and failures. FedLay also considers other practical issues, including non-iid data and asynchronous communication with heterogeneous clients.

The contributions of this work are summarized as follows.

  • We identify three topology metrics related to DFL convergence and evaluate various overlay topologies of DFL. We find that FedLay outperforms all other topologies. FedLay, as a decentralized network, has almost identical results on all three metrics to the best result among the 100 randomly generated regular graphs (in a centralized way).

  • We design and implement the FedLay protocol suite. To our knowledge, FedLay is the first DFL overlay network that provides decentralized protocols for construction, churn recovery, and model aggregation.

  • We evaluate FedLay using both prototype implementation and simulations on real ML datasets. We find that FedLay achieves the highest average model accuracy and fastest convergence compared to other DFL methods. It also has small communication cost and strong resilience to churn.

The rest of this paper is organized as follows. Section II presents the metrics for selecting DFL overlay topologies and the details of the topology of FedLay. The design of FedLay protocol suite, including the topology construction and maintenance protocols and model aggregation protocol, is presented in Section III. Section IV shows the evaluation results of FedLay as well as existing DFL overlay networks on real ML datasets. We present related work in Section V and conclude this work in Section VI.

II Overlay Topology of FedLay

This section presents the topology of FedLay and the intuition behind using this topology. We first explore what topology metrics can be used to evaluate the convergence speed under small node degrees. Then we design the FebLay topology and use numerical results to show its advantages. The decentralized construction and maintenance under churn will be presented in the next section.

II-A Assumptions

This work is based on the following assumptions: All the devices in FedLay are already connected to the Internet where they can directly access each other using TCP/IP. All clients train the same neural network models. Clients are honest and benign. The security problems under dishonest clients will be considered in future work.

II-B Three metrics for DFL topologies

A DFL topology can be modeled as an undirected graph G=(V,E)G=(V,E), where each node vVv\in V represents a client in the DFL system and each link e=(u,v)Ee=(u,v)\in E indicates that two clients uu and vv will exchange local ML models – uu and vv are thus called neighbors. We assume clients have equal roles in the overlay and similar numbers of neighbors.

II-B1 Expander property and convergence factor.

An important notion in DFL, or general decentralized optimization algorithms, is the mixing matrix 𝑴{\bm{M}} of the graph GG. The ii-th row of 𝑴{\bm{M}} denotes the weights used for aggregating local models of the neighboring nodes to update the model of the ii-th client. Hence the adjacency matrix of an overlay network and its Metropolis-Hastings matrix are both mixing matrices [5]. The symmetric property of 𝑴{{\bm{M}}} indicates that its eigenvalues are real and can be sorted in non-increasing order. Let λi(𝑴)\lambda_{i}({{\bm{M}}}) denote the ii-th largest eigenvalue of 𝑴{{\bm{M}}}, then we have λ1(𝑴)=1>λ2(𝑴)λN(𝑴)>1\lambda_{1}({{\bm{M}}})=1>\lambda_{2}({{\bm{M}}})\geq\cdots\geq\lambda_{N}({{\bm{M}}})>-1 based on the spectral property of the mixing matrix [5]. The constant λ=λ(𝑴):=max{|λ2(𝑴)|,|λN(𝑴)|}\lambda=\lambda({{\bm{M}}}):=\max\{|\lambda_{2}({{\bm{M}}})|,|\lambda_{N}({{\bm{M}}})|\} has been used to characterize optimization error (a measure of training loss) and generalization gap (a measure of test accuracy) of DFL. In particular, it is shown that the optimization error and generalization gap – for a typical DFL framework, Decentralized Federated Averaging (DFedAvg) – are bounded, respectively, by 𝒪(1(1λ)2)\mathcal{O}\Big{(}\frac{1}{(1-\lambda)^{2}}\Big{)} and 𝒪(2λ2+4λ2ln1λ+2λ+2ln1λ)\mathcal{O}\Big{(}2\lambda^{2}+4\lambda^{2}\ln\frac{1}{\lambda}+2\lambda+\frac{2}{\ln\frac{1}{\lambda}}\Big{)} – in terms of λ\lambda [33, 13]. Notice that both 1(1λ)2\frac{1}{(1-\lambda)^{2}} and 2λ2+4λ2ln1λ+2λ+2ln1λ2\lambda^{2}+4\lambda^{2}\ln\frac{1}{\lambda}+2\lambda+\frac{2}{\ln\frac{1}{\lambda}} are increasing functions of λ(0,1)\lambda\in(0,1).

Per the above discussion, to achieve good convergence and generalization, a topology needs to have a λ\lambda sufficiently smaller than 1 and hence achieve a small value of 1(1λ)2\frac{1}{(1-\lambda)^{2}} and 2λ2+4λ2ln1λ+2λ+2ln1λ2\lambda^{2}+4\lambda^{2}\ln\frac{1}{\lambda}+2\lambda+\frac{2}{\ln\frac{1}{\lambda}}. Thus we define the first topology metric, called the convergence factor of GG: cG=1(1λ)2c_{G}=\frac{1}{(1-\lambda)^{2}}.

Note that when 1(1λ)2\frac{1}{(1-\lambda)^{2}} is minimized, 2λ2+4λ2ln1λ+2λ+2ln1λ2\lambda^{2}+4\lambda^{2}\ln\frac{1}{\lambda}+2\lambda+\frac{2}{\ln\frac{1}{\lambda}} is also minimized. Hence for the sake of simplicity, we do not need another factor.

II-B2 Network diameter.

The diameter of a network is the longest length of all shortest paths calculated in the network. It reflects the network distance between the two most distant nodes. The intuition of considering this metric is that the network diameter can represent the maximum latency that the local model trained on the data of a client can propagate to all clients in the network.

II-B3 Average length of shortest paths.

The third metric is the average length of all shortest paths in the network. The intuition of considering this metric is that the average length can represent the average latency that a local model can propagate to a random client.

II-C FedLay topology

The FedLay topology is motivated by the research on near-random regular graphs for data center networks [29, 38, 39] and DFL [8, 13]. Recent theoretical studies show that Ramanujan graphs can provide small values of the spectral expander property λ\lambda and hence achieve ‘optimal’ convergence with a constant node degree dd [13]. However, a large Ramanujan graph cannot be generated even by centralized construction. Hence, random regular graphs (RRGs) can be used instead, which are approximately Ramanujan for a large network size nn [8]. In addition, prior research on data center networks [29] also shows that near-RRGs achieve the smallest average length of shortest paths among known graphs with a fixed node degree dd. Note that RRGs cannot be generated with any deterministic algorithm either. Hence, near-RRGs are usually used in practice, which are considered close enough to RRGs [29, 38]. near-RRGs can achieve ideal values on both the convergence factor and shortest path lengths.

Refer to caption
Figure 2: An example of FedLay topology

A practical problem is that all existing near-RRGs are constructed by centralized methods [29, 38, 39, 13]. The main challenge of a decentralized construction of near-RRGs is to allow a node to select a neighbor from all other nodes with equal likelihood while this node does not know the entire network. We propose to use a virtual coordinate system to solve this problem, inspired by [38]. Note that [38] is a centralized protocol and our main contribution is a decentralized construction for DFL.

In FedLay, each node computes a set of virtual coordinates CC, which is an LL-dimensional vector <x1,x2,,xL><x_{1},x_{2},...,x_{L}> where each element xlx_{l} is a random real number in range [0,1)[0,1). In practice, xix_{i} can be computed as H(IPx|i)H(IP_{x}|i) where HH is a publicly known hash function and IPxIP_{x} is xx’s IP address.

We define LL virtual ring spaces. In the ii-th ring space, a node is virtually placed on a ring based on the value of its ii-th coordinate xix_{i}. The coordinates of each space are circular, with 0 and 1 being superposed at the top-most point of the ring and 0.5 being the bottom-most point. If the coordinates of two different nodes are identical in one space, their orders on the ring are determined by the values of their IP addresses. For ease of presentation, we assume all coordinates on a ring are different. As shown in the example in Fig. 2, there are 8 nodes and each of them computes a set of two-dimensional random coordinates <x1,x2><x_{1},x_{2}>. There are two virtual ring space as shown in Figs. 2 (a) and (b) and every node is on a position of the ii-th ring based on its random coordinate xix_{i}. Note, all spaces are virtual and they have no relationship to the geographic locations of the nodes.

In each virtual space, every node uu has two adjacent nodes on the ring, based on the order of their coordinate values. uu will find the adjacent nodes from all spaces as its overlay neighbors (by a decentralized protocol described later) for model exchange. In the example of Fig. 2(c), every node finds its adjacent nodes in two spaces and form the FedLay overlay. So most nodes have four neighbors in the overlay but there are a few ones, like node BB, has only three neighbors because DD is adjacent to BB on both rings. Hence, for LL spaces, every node has at most 2L2L neighbors. LL can then be considered as a parameter for communication and convergence trade-off: with a bigger LL, nodes have more neighbors for model exchanges but increased communication cost.

Refer to caption
(a) Convergence factor
Refer to caption
(b) Network diameter
Refer to caption
(c) Average shortest path length
Figure 3: Comparisons of network topologies on the three metrics discussed in Sec. II-B.

The detailed decentralized construction and maintenance will be discussed in the next section. We now show that the FedLay topology is close to the optimal choice with a given node degree dd, measured by the three metrics discussed in Section II-B. We evaluate the FedLay topology by comparing it with the following existing topologies.

  1. 1.

    Best of 100 randomly generated dd-regular graphs (“Best”). We randomly generate 100 dd-regular graphs and measure them for the three metrics. We obtain the optimal value among the 100 graphs for each metric. It is then considered the optimal value of practical topologies. for this metric. Of course, there is no decentralized protocol to construct such “Best” topology.

  2. 2.

    Chord [30]. Chord is a well-known peer-to-peer overlay network serving the function of a distributed hash table (DHT). It has a O(logn)O(\log n) degree and can be constructed and maintained by decentralized protocols.

  3. 3.

    Viceroy [21]. Viceroy is a peer-to-peer overlay network with a constant degree, inspired by the classic Butterfly network used for super-computing. Its main objective is to minimize congestion by overlay routing. It can be constructed and maintained by decentralized protocols.

  4. 4.

    Distributed Delaunay Triangulation (DT) [19, 17]. DT is an overlay network with a constant degree that supports greedy routing. It can be constructed and maintained by decentralized protocols.

  5. 5.

    Waxman network [36]. Waxman is a network that simulates connections with physical proximity. Nodes with a close geographic distance are likely to connect. There is no known decentralized construction of Waxman.

  6. 6.

    Social network [22]. We use a social network topology of Facebook users that was collected by [22]. This is a typical example of overlay networks that rely on information from other application channels.

Fig. 3 shows the empirical results by comparing all the above-mentioned topologies on the three metrics: the convergence factor, diameter, and average length of shortest paths. For all metrics, smaller values are more desired. Each network includes 300 nodes for fair comparisons. For the social network, we sample 300 nodes for fair comparisons. We vary the node degree from 4 to 14 for both “Best” and FedLay. Other networks do not support flexible node degrees, so the result of each topology is shown as a single dot in each figure.

We summarize our findings from these results as follows. “Best” always provides optimal results for every metric. The results of FedLay are extremely close to “Best”: most points of FedLay are superposed with those of “Best” with only a few exceptions. All of topologies are much less optimal compared to FedLay. The convergence factor of Chord is very high but the diameter and average shortest path length are low, due to its high node degree.

The results of both DT and Waxman are much less optimal compared to FedLay. The main reason is that both topologies are constructed on neighbors with short distances. Hence, it might take a long latency for information (local models) to propagate between two remote nodes. The convergence factors of both Viceroy and the social network are close to that of FedLay, but their diameter and average shortest path length are much longer than those of FedLay. The node degree of Chord is larger than most other constant-degree networks, because it needs a node degree of 2logn2\log n. The convergence factor of Chord is very high but the diameter and average shortest path length are low, due to its high node degree.

Since “Best” are only simulated optimal values rather than realistic network topologies, FedLay achieves the best results on all three metrics among existing practical overlay topologies.

Definition 1 (A correct FedLay overlay).

We define that a FedLay network is correct, if every node uu has a neighbor set NuN_{u} such that NuN_{u} includes the adjacent nodes of uu in all LL virtual ring spaces and does not include other nodes. Each node also knows the virtual coordinates of all its neighbors.

FedLay uses random coordinates to achieve near-random sampling of other nodes and hence generates a near-random-regular graph.

Understanding the FedLay topology. Since all coordinates in the FedLay topology are randomly computed. In each virtual ring space, the two adjacent nodes of a node uu are actually randomly sampled from the set of all nodes. Therefore, all neighbors of uu in FedLay are randomly selected and all other nodes have approximately equal likelihood to be selected. Hence, FedLay can approximate an RRG, which, as studied in past research [29, 38, 39, 13], provides optimal results on both convergence factors and shortest paths. So why are the coordinates in virtual spaces necessary? The main difficulty of generating an RRG in a decentralized manner is that a node cannot find neighbors by randomly sampling all other nodes with equal likelihood if it does not know the whole network. The key idea of overcoming such difficulty is to use the random coordinates to order all nodes in each virtual ring space. The ring coordinates allow every newly joining node to use greedy routing to find its two adjacent nodes in every virtual space within a small number of routing hops. Such an adjacent-node-discovery process achieves near-random sampling of other nodes. In addition, the coordinates also help the network to recover from node failures.

III Design of FedLay Protocols

III-A Overview

Refer to caption
Figure 4: FedLay protocol suite includes two sets of protocols: 1) Neighbor Discovery and Maintenance Protocols; 2) Model Exchange Protocol.

The key features of FedLay are that it is completely decentralized and all nodes are self-organized: the FedLay protocol suite allows nodes to join, leave, and fail in the network while still maintaining a correct topology and every node only stores its neighbor information.

As shown in Fig. 4, the FedLay protocol suite running on a client consists of two sets: 1) Neighbor Discovery and Maintenance Protocols (NDMP); and 2) Model Exchange Protocol (MEP). Both sets of protocols exchange messages with other nodes in the network. The objective of NDMP running on node uu is to allow uu to find its correct neighbors during the join of uu and maintain correct neighbors under network dynamics. Hence, NDMP can be considered as control protocols to construct the correct FedLay network. The objective of MEP is to decide when to exchange the local models with the neighbors and how to process the received models. Hence, MEP can be considered as an application protocol to optimize the model convergence of DFL. Both NDMP and MEP use TCP with reliable delivery.

III-B Neighbor Discovery and Maintenance Protocols (NDMP)

NDMP includes 𝚓𝚘𝚒𝚗\mathtt{join}, 𝚕𝚎𝚊𝚟𝚎\mathtt{leave}, and 𝚖𝚊𝚒𝚗𝚝𝚎𝚗𝚊𝚗𝚌𝚎\mathtt{maintenance} protocols. The 𝚓𝚘𝚒𝚗\mathtt{join} protocol is run by each new node joining the FedLay network. It ensures all nodes will find the correct neighbors after the node joins the network. The 𝚕𝚎𝚊𝚟𝚎\mathtt{leave} protocol is run by each node that is preparing to leave the network. It ensures all remaining nodes will keep the correct neighbors after the leave. The 𝚖𝚊𝚒𝚗𝚝𝚎𝚗𝚊𝚗𝚌𝚎\mathtt{maintenance} protocol is run by every node periodically to detect potential failed neighbors or wrong neighbors and fix these errors.

III-B1 NDMP join protocol

The 𝚓𝚘𝚒𝚗\mathtt{join} protocol is designed to achieve the following correctness property: given an existing FedLay overlay network, a new joining node runs the 𝚓𝚘𝚒𝚗\mathtt{join} protocol. When the 𝚓𝚘𝚒𝚗\mathtt{join} protocol finishes, the joining node is guaranteed to find its correct neighbors, i.e., the adjacent nodes in all virtual spaces. The above property will be proved later and it ensures that a correct FedLay overlay can be achieved after every new node join. If a FedLay network with nn node is correct, after a new node joins, the (n+1)(n+1)-node network is also correct. This provides a way to recursively construct a large overlay network correctly from a small-scale network even with 2 or 3 nodes. Note that such recursive property is a key module to ensure the correctness of many P2P overlays such as Chord [30] and distributed DT graph [19, 17].

FedLay builds upon the concept of random virtual coordinates and circular distance introduced by SpaceShuffle [38], extending it to achieve a fully decentralized topology construction. Unlike [38], where an administrator is required to access each node and switch, FedLay allows any client to join the network through any existing node, enhancing scalability and ease of use. Additionally, we optimize the NDMP leave protocol to minimize overhead by ensuring that maintenance operations are only triggered when necessary, thus reducing the resource consumption during network changes.

When a new node uu joins the existing correct FedLay network, we assume uu knows one existing node vv in the overlay, which can be an arbitrary node – this is the minimum assumption for any overlay network. If uu knows no existing node, it has no way to join any overlay. uu first computes a random coordinate as its position in the first virtual space, say x1ux_{1}^{u}. uu will let vv sends a message Neighbor_discovery to the current Fedlay network using greedy routing to the destination location x1ux_{1}^{u}. Neighbor_discovery also includes uu’s IP address.

We first define the concept of circular distance, which is a metric used in greedy routing of FedLay.

Definition 2 (Circular distance).

The circular distance of two coordinates xx and yy in the same ring space, 0x,y<10\leq x,y<1, is:

CD(x,y)=min{|xy|,1|xy|}.\footnotesize CD(x,y)=\min\{|x-y|,1-|x-y|\}.

For two coordinates xx and yy on a ring, the circular distance is the length of the smaller arc between xx and yy, normalized by the perimeter of the ring that is 1. We say xx is closer to yy than ww on a ring space, if CD(x,y)<CD(w,y)CD(x,y)<CD(w,y). If xx and ww have the same circular distance to yy, we always break the tie to one of xx and ww with a smaller value of their IP addresses (considering each IP address is a 32-bit value). Hence, there is only one node that is closest to a given coordinate xx.

Upon receiving Neighbor_discovery, the greedy routing protocol to the destination location xiux_{i}^{u} in Space ii is executed by a node vv as following:

  1. 1.

    Node vv finds a neighbor ww, such as ww’s coordinate in Space ii, xiwx_{i}^{w}, has the smallest circular distance to xiux_{i}^{u} among all neighbors of vv.

  2. 2.

    If CD(xiu,xiv)>CD(xiu,xiw)CD(x_{i}^{u},x_{i}^{v})>CD(x_{i}^{u},x_{i}^{w}), vv forwards the Neighbor_discovery message to ww.

  3. 3.

    If CD(xiu,xiv)CD(xiu,xiw)CD(x_{i}^{u},x_{i}^{v})\leq CD(x_{i}^{u},x_{i}^{w}), Neighbor_discovery{Neighbor\_discovery} stops at vv. From vv’s two adjacent nodes, vv finds the adjacent node pp such that xiux_{i}^{u} is on \wideparenv,p\wideparen{v,p}, the smaller arc between vv and pp. Then vv sends a message to uu to tell uu that vv and pp are uu’s adjacent nodes on this virtual ring and let uu add vv and pp to uu’s neighbor set.

The greedy routing presented above will make each node forward Neighbor_discovery to its neighbor that has the shortest circular distance to the destination location xiux_{i}^{u}. When a node vv cannot find a neighbor that is closer to xiux_{i}^{u} than vv itself, vv must be the node that has the shortest circular distance to xiux_{i}^{u} among all nodes in FedLay (will be formally proved later). Hence, vv and one of its adjacent nodes pp will be uu’s neighbors.

Refer to caption
Figure 5: An example of the FedLay 𝚓𝚘𝚒𝚗\mathtt{join} protocol.

We show an example of FedLay 𝚓𝚘𝚒𝚗\mathtt{join} in Fig. 5. uu joins FedLay and it knows an existing node HH in the network. uu computes a random coordinate 0.15 and asks HH to run greedy routing of Neighbor_discovery to 0.15. HH will forward Neighbor_discovery to BB, which is closest to 0.15 in the space among all HH’s neighbors. Eventually, the message arrives at GG, the node that is closest to 0.15 in the space and GG tells uu to add GG and DD as neighbors. Note greedy routing has much smaller hop-count than traveling nodes one after another through the ring because there are many shortcuts like the link HBHB.

We have the following property.

Lemma 1.

In a ring space of a correct FedLay network and a given coordinate xx, if a node vv is not the node that has the smallest circular distance to xx in the space, then vv must have an adjacent node ww on the ring such that CD(x,xv)>CD(x,xw)CD(x,x^{v})>CD(x,x^{w}), where xvx^{v} is vv’s coordinate.

Proof.

Let pp be the node with the smallest circular distance to xx in the space. Consider the two arcs between xx and xvx^{v}. At least one of them has a length no longer than 0.5. Let that arc be \wideparenxv,x\wideparen{x^{v},x} with length L(\wideparenxv,x)L(\wideparen{x^{v},x}). CD(xv,x)=L(\wideparenxv,x)0.5CD(x^{v},x)=L(\wideparen{x^{v},x})\leq 0.5.

If pp is on \wideparenxv,x\wideparen{x^{v},x}, consider the arc between vv and pp, \wideparenxv,xp\wideparen{x^{v},x^{p}}, as a part of \wideparenxv,x\wideparen{x^{v},x}. If vv has an adjacent node qq whose coordinate is on \wideparenxv,xp\wideparen{x^{v},x^{p}}, then L(\wideparenx,xq)<L(\wideparenxv,x)0.5L(\wideparen{x,x^{q}})<L(\wideparen{x^{v},x})\leq 0.5. Hence, CD(xv,x)>CD(xq,x)CD(x^{v},x)>CD(x^{q},x). The lemma holds in this situation. If vv has no adjacent node on \wideparenxv,xp\wideparen{x^{v},x^{p}}, then there must be no node on the arc \wideparenxv,xp\wideparen{x^{v},x^{p}}. Hence, vv and pp are adjacent nodes. Since pp is the node closest to xx, vv does have an adjacent node pp that is closer to xx. The lemma also holds in this situation.

If pp is not on \wideparenxv,x\wideparen{x^{v},x}, consider the arc \wideparenxv,x,xp\wideparen{x^{v},x,x^{p}}. The arc \wideparenx,xp\wideparen{x,x^{p}} is a part of \wideparenxv,x,xp\wideparen{x^{v},x,x^{p}}, and we have L(\wideparenx,xp)<L(\wideparenxv,x)L(\wideparen{x,x^{p}})<L(\wideparen{x^{v},x}), because pp is the closest node to xx. If vv has an adjacent node qq whose coordinate is on \wideparenxv,x,xp\wideparen{x^{v},x,x^{p}}, then L(\wideparenx,xq)<L(\wideparenxv,x)0.5L(\wideparen{x,x^{q}})<L(\wideparen{x^{v},x})\leq 0.5. Hence, CD(xv,x)>CD(xq,x)CD(x^{v},x)>CD(x^{q},x). The lemma holds in this situation. If vv has no adjacent node on \wideparenxv,x,xp\wideparen{x^{v},x,x^{p}}, then vv and pp are adjacent nodes. Since pp is the node closest to xx, vv does have an adjacent node pp that is closer to xx. The lemma also holds in this situation.

Therefore in all cases, the lemma holds. ∎

Note that every adjacent node of vv is its neighbor in a FedLay network. This lemma tells that if vv is not the node that is closest to the destination coordinate xiux_{i}^{u}, then the greedy routing algorithm must execute Step 2 and forwards the message to a neighbor. Hence, if the greedy routing algorithm goes to Step 3 and stops at vv, vv must be the node that has the smallest circular distance to xiux_{i}^{u} in the space. So we have,

Theorem 1.

In a FedLay network, when Neighbor_discovery to the destination coordinate xx stops at a node vv, vv must be the node that has the smallest circular distance to xx.

The Neighbor_discovery message will stop at the node vv closest to xiux_{i}^{u} and vv must be an adjacent node to the joining node uu, because no other node is closer to uu’s coordinate xiux_{i}^{u}. vv also knows the other adjacent node ww of uu, because ww is a current adjacent node of vv. vv will send a message to uu by TCP including the information of vv and ww. uu will then add vv and ww to its neighbor set.

The joining node uu can find all its neighbors by running the above 𝚓𝚘𝚒𝚗\mathtt{join} protocol in all spaces. Therefore, if uu joins a correct FedLay network, the new FedLay network after this join is also correct.

In some extreme cases, there could be multiple nodes joining the network simultaneously. This situation will be handled by both the 𝚓𝚘𝚒𝚗\mathtt{join} and 𝚖𝚊𝚒𝚗𝚝𝚎𝚗𝚊𝚗𝚌𝚎\mathtt{maintenance} protocols.

III-B2 NDMP leave protocol

Refer to caption
Figure 6: An example of the FedLay 𝚕𝚎𝚊𝚟𝚎\mathtt{leave} protocol.

The 𝚕𝚎𝚊𝚟𝚎\mathtt{leave} protocol of NDMP is quite straightforward. When a user wants to leave and closes the client program of FedLay, for every virtual space, the leaving node sends messages to its two adjacent nodes and tells them to add each other to their neighbor sets. As shown in Fig. 6, node GG wants to leave the network and tells its two adjacent nodes AA and DD about its leaving. Then AA and DD will consider each other as adjacent nodes and neighbors. In another space, GG’s two adjacent nodes FF and CC will also add each other to their neighbor sets. Hence, the new FedLay network after a node leave is also correct.

Refer to caption
Figure 7: An example of the 𝚖𝚊𝚒𝚗𝚝𝚎𝚗𝚊𝚗𝚌𝚎\mathtt{maintenance} protocol.

III-B3 NDMP maintenance protocol

In addition to planned leaves, An overlay network may also experience node failures due to various reasons such as an Internet service outage and end system failures. A failed node disappears without notice. In order to detect and fix these situations, each node in FedLay also runs the 𝚖𝚊𝚒𝚗𝚝𝚎𝚗𝚊𝚗𝚌𝚎\mathtt{maintenance} protocol.

The 𝚖𝚊𝚒𝚗𝚝𝚎𝚗𝚊𝚗𝚌𝚎\mathtt{maintenance} protocol requires every node to send each of its neighbors a heartbeat message periodically. Suppose the time period between two heartbeat messages is TT. If a node pp has not received any heartbeat message from a neighbor uu for 3T3T time, it considers that uu has failed. pp then sends a Neighbor_repair message by greedy routing in the opposite direction of uu on the virtual space ii where uu and pp are adjacent nodes. We use an example to explain this protocol, as shown in Fig. 7. In the example of Fig. 7(a), node AA detects the failure of GG. Since GG is an adjacent node of AA on AA’s clockwise side, AA sends a Neighbor_repair message by greedy routing in the counterclockwise direction (the opposite direction of GG). By “counterclockwise direction”, it requires that all hops of such greedy routing, AEBDA-E-B-D in this example, should follow the counterclockwise order.

We give a formal description of the counterclockwise direction: Upon receiving Neighbor_repair to xiux_{i}^{u} in Space ii, a node vv runs the following algorithm:

  1. 1.

    Node vv considers a subset of its neighbors, such that for every neighbor ww in the subset, ww’s coordinate in Space ii, xiwx_{i}^{w}, satisfies xiw<xivx_{i}^{w}<x_{i}^{v} or xiw>xiux_{i}^{w}>x_{i}^{u}. For each neighbor ww, consider the length of the arc from xiwx_{i}^{w} to xiux_{i}^{u} in the counterclockwise direction L(\wideparenxiw,xiu)L(\wideparen{x_{i}^{w},x_{i}^{u}}). From the above subset, node vv finds a neighbor ww^{\prime}, such that ww^{\prime}’s coordinate in Space ii, xiwx_{i}^{w^{\prime}}, has the smallest arc length L(\wideparenxiw,xiu)L(\wideparen{x_{i}^{w^{\prime}},x_{i}^{u}}) among all neighbors in the subset.

  2. 2.

    If L(\wideparenxiv,xiu)>L(\wideparenxiw,xiu)L(\wideparen{x_{i}^{v},x_{i}^{u}})>L(\wideparen{x_{i}^{w^{\prime}},x_{i}^{u}}), vv forwards Neighbor_repair to ww.

  3. 3.

    If L(\wideparenxiv,xiu)<L(\wideparenxiw,xiu)L(\wideparen{x_{i}^{v},x_{i}^{u}})<L(\wideparen{x_{i}^{w^{\prime}},x_{i}^{u}}), Neighbor_repair stops at vv. vv then tells pp that vv is pp’s adjacent nodes on this virtual ring and let pp add vv to pp’s neighbor set.

When Neighbor_repair stops, it arrives at another adjacent node of GG before GG’s failure (stated in a later theorem). Then GG’s previous two adjacent nodes can be connected.

In Fig. 7(a), node DD also detects GG’s failure, independent of AA’s detection. Then DD sends a Neighbor_repair message to the clockwise direction and the message will travel on a path DFAD-F-A. The algorithm to forward Neighbor_repair in the clockwise direction can be specified in a similar way.

By changing the subset selection condition in Step 1 to “xiw>xivx_{i}^{w}>x_{i}^{v} or xiw<xiux_{i}^{w}<x_{i}^{u}”. We can prove the following theorem:

Theorem 2.

Consider a correct FedLay network and a node uu fails. When a node pp detects the failure of its adjacent node uu in Space ii, it sends Neighbor_repair to the destination coordinate xiux_{i}^{u} in the opposite direction of uu. When Neighbor_repair stops at a node qq, qq is another adjacent node of uu in Space ii before uu’s failure.

Proof.

We first discuss the case that Neighbor_repair is sent in the counterclockwise direction. The case of clockwise direction can be proved in the same way.

Let qq be another adjacent node of uu in Space ii before uu’s failure.

Let vv be the current node that receives Neighbor_repair. If vqv\neq q, vv has at least one neighbor ww, such that L(\wideparenxiw,xiu)<L(\wideparenxiv,xiu)L(\wideparen{x_{i}^{w},x_{i}^{u}})<L(\wideparen{x_{i}^{v},x_{i}^{u}}). Hence, Neighbor_repair will not stop if vqv\neq q. vv will forward Neighbor_repair to the neighbor ww^{\prime} and L(\wideparenxiw,xiu)L(\wideparenxiw,xiu)<L(\wideparenxiv,xiu)L(\wideparen{x_{i}^{w^{\prime}},x_{i}^{u}})\leq L(\wideparen{x_{i}^{w},x_{i}^{u}})<L(\wideparen{x_{i}^{v},x_{i}^{u}}).

Therefore after each hop, the arc length L(\wideparenxiv,xiu)L(\wideparen{x_{i}^{v},x_{i}^{u}}) strictly decreases. Hence, Neighbor_repair will not experience a loop. Since there are a limited number of nodes, Neighbor_repair will stop at qq. ∎

Based on the theorem, in every virtual space, the two adjacent nodes of the failed uu can find and connect each other. Hence, if a node fails in a correct FedLay network, after the node failure FedLay is still correct.

Neighbor repair for concurrent joins and failures. Note the above property cannot be proved for multiple failures that happen at the same time, called concurrent failures. For concurrent joins and failures, we allow each node uu to periodically send two Neighbor_repair messages with destination xiux_{i}^{u} to both counterclockwise and clockwise directions in every virtual space ii, even without detecting any neighbor failure. For each node vv in uu’s neighbor set, if they are indeed adjacent in a virtual space, vv will receive a Neighbor_repair message that stops at vv. The Neighbor_repair stops at a node ww that is not in uu’s neighbor set, ww and uu will add each other to their neighbor sets. In fact there is no way to prove any property for concurrent joins and failures in any structured P2P network [30, 17]. Hence we conduct experiments of extreme concurrent joins and failures and the above method always allows the network to recover to a correct FedLay.

III-C Model Exchange Protocol (MEP)

One key challenge of DFL is that there is no central server to evaluate the quality of models from different clients. In P2P model exchanges, a client with low-quality local models can ‘infect’ its neighbors with high-quality models and these errors may be further propagated in the overlay. MEP is designed to limit the impact of low-quality models and amplify the impact of high-quality models in a decentralized way. We consider two practical issues in DFL systems. 1) Data heterogeneity [11, 4, 13]. It is well known that the local data of different clients are usually non-iid due to geographic and environmental diversity. Hence, their models have different accuracy. 2) Client heterogeneity [40, 7, 24]. Clients of DFL could have different bandwidth and computing capacities. They may have different model exchange frequencies. Unlike previous work of DFL that usually assumes homogeneous clients [11, 10, 3, 34, 8, 13], MEP allows each client to set different parameters based on their two dimensions of heterogeneity. to guide clients to exchange models with their neighbors, including multiple design components. We present three main components in detail: 1) set confidence parameters; 2) set model exchange period; 3) model de-duplication.

Dataset Tasks Model Model size per client
Comm. period for
medium-cap. clients
MNIST Img Classification MLP 247 KB 5 min
CIFAR-10 Img Classification CNN 1.1 MB 10 min
Shakespeare Next-character pred. LSTM 23.4 MB 40 min
TABLE II: Datasets used in evaluation

III-C1 Asynchronous model exchange

Previous work assumes synchronous, round-based communication, in which all clients use the same time period to exchange models with neighbors [11, 10, 3, 34, 8, 13]. However, due to client heterogeneity, some low-resource clients may become “stragglers” that will fail to perform model exchanges in the given time period, while powerful or newly joined clients prefer shorter time period (or higher frequency) of model exchanges.

MEP uses asynchronous communication and allows each client uu to use a different communication time period TuT_{u}. TuT_{u} can be set in two ways: 1) Coarse-grained settings. Each client may configure a period based on their device and communication types, for example, 𝚂𝚎𝚛𝚟𝚎𝚛\mathtt{Server}-𝙻𝙰𝙽\mathtt{LAN}, 𝙿𝙲\mathtt{PC}-𝙻𝙰𝙽\mathtt{LAN}, 𝙻𝚊𝚙𝚝𝚘𝚙\mathtt{Laptop}-𝚆𝙻𝙰𝙽\mathtt{WLAN}, 𝙿𝚑𝚘𝚗𝚎\mathtt{Phone}-𝙻𝚃𝙴\mathtt{LTE}, and 𝙸𝚘𝚃\mathtt{IoT}-𝚆𝙻𝙰𝙽\mathtt{WLAN}. These values are pre-specified in the client program. 2) Fine-grained settings. Based on the monitoring of available bandwidth and computing resources, client uu estimates the minimum time duration Tu,minT_{u,{\min}} to produce an updated ML model and transmit it to all neighbors. Then its communication period Tu=ηTu,minT_{u}=\eta T_{u,{\min}} for constant η>1\eta>1.

For two neighbors with periods TuT_{u} and TvT_{v}, their model exchange period is set to max(Tu,Tv)\max(T_{u},T_{v}). Hence, a client may have different exchange periods to different neighbors.

Refer to caption
(a) Correctness for concurrent joins
Refer to caption
(b) Correctness for concurrent fails
Refer to caption
(c) Message cost
Figure 8: Topology correctness under churn & message cost

III-C2 Set confidence parameters

One key innovation in MEP is to introduce confidence parameters. Each node has a set of confidence parameters that present its self-evaluation of the local model accuracy.

We define the data divergence confidence cdc_{d} on a client uu:

cdu=1exp(DKL(Dloc||Dstd))\footnotesize c_{d}^{u}=\frac{1}{\exp(DKL(D_{loc}||D_{std}))}

where DKL()DKL(\cdot) is the Kullback-Leibler divergence [16] to evaluate the statistical distance between two probability distributions PP and QQ, DlocD_{loc} denotes the local data distribution, and DstdD_{std} denotes the estimated iid distribution of the dataset. The uniform distribution is widely used [42, 28] to estimate the iid data because the majority of publicly-available datasets for classification follow uniform distributions, such as (MNIST [18] and CIFAR-10 [15]). The Kullback-Leibler divergence can effectively represent the richness of a local dataset. cd(0,1]c_{d}\in(0,1] and a higher value represents a higher quality of local data and local models.

In addition, we define the communication confidence ccc_{c} on a client uu: ccu=1Tuc_{c}^{u}=\frac{1}{T_{u}}. The intuition of using ccc_{c} is that when a client has more frequent model exchanges with its neighbors, its models are more likely to have higher qualities.

Hence, the overall confidence of client uu is

cu=αdcdumax(cd)+αcccumax(cc)\footnotesize c^{u}=\alpha_{d}\frac{c_{d}^{u}}{\max(c_{d})}+\alpha_{c}\frac{c_{c}^{u}}{\max(c_{c})}

where max(cd)\max(c_{d}) and max(cc)\max(c_{c}) are the maximum values of cdc_{d} and ccc_{c} respectively, from all uu’s neighbors. αd\alpha_{d} and αc\alpha_{c} are two constants to balance the weights of the two confidence parameters. The specific values of αd\alpha_{d} and αc\alpha_{c} can just be 0.5 and 0.5. We try a variety of combinations of αd\alpha_{d} and αc\alpha_{c}, and in all cases, FedLay achieves fast model convergence on different nodes.

The models from uu’s neighbors are aggregated as follows:

ωu=jN{u}cjωjjN{u}cj\footnotesize\omega^{u}=\frac{\sum_{j\in N\bigcup\{u\}}c^{j}\omega^{j}}{\sum_{j\in N\bigcup\{u\}}c^{j}}

The above aggregation will be computed once every local time period TuT_{u} and the models from each neighbor are always the most updated ones from the neighbor. In this way, clients with low confidence in their model accuracy will have less impact on other clients. In this work, we do not consider the situation where a client might intentionally set a large confidence value to mislead other clients.

III-C3 Model fingerprinting

In the neighbor set of each client, other than the IP addresses, coordinates, and confidence values of the neighbors, the client also stores the fingerprint ff of the most recent models of each neighbor, computed by hashing the models by a public hash function. Before starting a model exchange, the client first sends the fingerprint to its neighbor. If the neighbor finds the fingerprint matches the models that are going to be sent, the neighbor will consider the models to be a duplicate of a copy sent earlier and stop sending the models. This method reduces unnecessary traffic of exchanges of duplicated models.

IV Performance Evaluation

IV-A Evaluation methodology

IV-A1 Three types of evaluation

We conduct three types of evaluation of FedLay for different scales of DFL networks.

1) Real experiments. We conduct experiments with real packet exchanges and data training. We deploy 16 instances to public clouds (we used both Oracle OCI and Amazon EC2), each with a 2GHz CPU and 2GB RAM. Each instance is connected to the Internet and runs a FedLay client. Each client sends and receives NDMP and MEP messages using TCP. Clients train ML models on their local datasets with Pytorch [25] and exchange the models with active neighbors. There is no central server for any purpose, and the system is completely decentralized. The purpose of this type of experiment is to present a prototype and demonstrate that FedLay can run with real ML data training in practice.

2) Medium-scale emulation with real data training. In this type of experiment, we use real data training and simulated packet exchanges as discrete events to evaluate networks with up to 100 clients. The simulation and real data training and testing run on a machine with an NVIDIA GeForce RTX3080 graphic card for training acceleration. The purpose of this type of experiment is to evaluate the overlay construction/maintenance, model accuracy, convergence speed, and message cost of FedLay and other DFL methods.

3) Large-scale simulation with trained models. For more than 100 clients, conducting all data training on a few machines takes very long time. For networks with 200 to 1000 nodes, we re-use the models trained from the above two types of experiments and assign them to the simulated clients. Packet exchanges are simulated as discrete events.

Refer to caption
(a) MNIST. Accuracy vs time (mins)
Refer to caption
(b) CIFAR-10. Accuracy vs time (mins)
Refer to caption
(c) Shakespeare. Accuracy vs time (mins)
Refer to caption
(d) MNIST. CDF of Accuracy
Refer to caption
(e) CIFAR-10. CDF of Accuracy
Refer to caption
(f) Shakespeare. CDF of Accuracy
Figure 9: Model accuracy from 16-client real experiments in Amazon EC2.

IV-A2 ML datasets and models

We evaluate the performance of FedLay for three ML tasks, including 1) Multilayer Perceptron (MLP) for digit classification on the MNIST dataset [9]. 2) Convolutional Neural Networks (CNN) for image classification on the CIFAR-10 dataset [15]. 3) Long Short-Term Memory (LSTM) for role forecasting on the Shakespeare dataset [6] built from The Complete Works of William Shakespeare. Details are described in Table II. All three are standard datasets for FL benchmarks [6].

Learning with non-iid data. We generate non-iid MNIST and CIFAR-10 datasets by selecting limited labels for the local training sets using the sharding method. Each shard contains only one label, and each local dataset includes a limited number of shards, resulting in a non-iid distribution and heterogeneity among clients’ local datasets. For large-scale simulations, data among clients may overlap, but in real experiments and medium-scale simulations, clients have unique data. In the original Shakespeare dataset, each speaking role in a play is considered a unique shard.

Client heterogeneity. We also assumed that clients have different computation and communication resources. We set 3 tiers of clients. For the 16-client real-world experiments, we set 10 medium-capacity, 3 high-capacity, and 3 low-capacity clients. For simulations, each experiment includes 60% medium-, 20% high-, and 20% low-capacity clients. The training time and communication time period of a high-capacity client are 2/3 of those of a medium-capacity user, and those of a low-capacity client are 2x of those of a medium-capacity user. The communication period for medium-capacity clients of each dataset is shown in Table II.

IV-A3 Performance metrics

Besides the topology metrics discussed in Sec. II-B, we use the following metrics to evaluate FedLay and other methods.

Model accuracy: We evaluate the individual accuracy and average accuracy of local ML models based on separate test datasets that are different from the training datasets.

Topology correctness: It is defined as the number of correct neighbors of all nodes over the total number of neighbors. Hence correctness equal to 1 means a correct FedLay.

Communication Cost: We evaluate the average communication cost per client, by counting the number of NDMP messages sent by each client and the total size of the models sent by each client in bytes.

IV-A4 Methods for comparison

There is no existing DFL topology that allows decentralized construction and maintenance. Hence we compare FedLay with the following methods: 1) Gaia [12] is an ML method for geo-distributed clouds and still uses central servers. Hence it is not DFL. It runs server-based ML in each region and lets servers from different regions connect as a complete graph. It includes no aggregation method to handle non-iid data. 2) DFL-DDS [31] is a DFL method without a fixed topology. Instead, it simulates mobile nodes in a road network and considers two geographically close nodes as neighbors. 3) Chord [30]. 4) FedAvg [23] is a standard centralized FL method. We use its accuracy as the upper bound of DFL model accuracy because the central server knows all models from the clients.

Refer to caption
(a) MNIST. Accuracy vs. Time
Refer to caption
(b) CIFAR-10. Accuracy vs. Time
Refer to caption
(c) Shakespeare. Accuracy vs. Time
Figure 10: Model accuracy from 100-client medium-scale experiments. FedAvg is centralized and considered the upper bound.
Refer to caption
(a) 4 shards per client
Refer to caption
(b) 12 shards per client
Refer to caption
(c) Accuracy distribution
Figure 11: Accuracy under different non-iid levels for CIFAR-10.

IV-B Evaluation of FedLay topology

We show the convergence factor, diameter, and average shortest path length by varying the number of nodes in Fig. LABEL:fig:expander3 for different topologies discussed in Sec. II-C. We evaluate FedLay with degrees of 6, 8, and 10, as well as Viceroy, Waxman, and Chord. We find that the diameter and average shortest path length of Viceroy and Waxman increases with the size of the network, and Chord has a large convergence factor when the number of nodes is large. It showed FedLay has the best results in all topology metrics.

Fig. 8(a) shows the topology correctness under an extreme situation when 100 new clients join a 400-client FedLay at the same time (10ms in the timeline). The average network latency is set to 350ms. We find the correctness can quickly converge to 1 after 8 seconds in FedLay with degree d=6,8,10,12d=6,8,10,12. Fig. 8(b) shows the topology correctness under another extreme situation when 100 clients failed from a 400-client FedLay network at the same time (10ms in the timeline). The correctness quickly drops to 64.3%. The remaining clients run NDMP and quickly recover to a correct 300-client FedLay network in 8 seconds. In Fig. 8(c), we plot the number of messages sent per client to construct FedLay networks with different sizes. With as many as 500 clients, each client only sends around 30 messages on average.

IV-C DFL model accuracy

Fig. 9 shows the model accuracy of different methods in 6 subfigures, based on real experiments in Amazon EC2. In Figs. 9(a)-9(c) we find that FedLey achieves higher accuracy and faster convergence than Gaia and DFL-DDS, even with d=4d=4. Note one communication period for medium-capacity clients is set to 40 minutes in Shakespeare hence there are not many times of model exchanges before convergence. Figs. 9(d)-9(f) show the cumulative distribution of the accuracy of all clients at convergence time (150 minutes for MNIST and 1500 minutes for others). We can see that nodes are with similar accuracy levels without any ‘stragglers’.

We evaluate the accuracy of FedLay (d=10d=10), FedAvg, Gaia, and DFL-DDS using medium-scale experiments with 100 clients and show the results in Fig. 10. FedAvg achieves the best accuracy as a centralized FL, which we consider as the upper bound for DFL. The accuracy of FedLay with 10 degrees is only 1.2%, 2.5%, and 0.9% lower than FedAvg on MNIST, CIFAR-10, and Shakespeare, respectively. Other methods have lower accuracy but the differences are not significant. We list accuracy at convergence time in Table III , along with the default centralized method FedAvg as the baseline.

Task FedLay FedAvg Gaia Chord DFLDDS
MNIST 90.2% 92.1% 89.2% 88.9% 87.4%
CIFAR 50.3% 52.8% 48.6% 49.2% 49.4%
Shakes 45.9% 46.9% 44.0% 44.5% 44.2%
TABLE III: Accuracy comparison at convergence. We regard the accuracy of FedAvg as the centralized baseline.

We evaluate FedLay under different non-iid levels. Each client has a limited number of shards. When each client has fewer shards, the level of non-iid becomes more significant. The default setting is 8 shards per client as shown in previous results such as Fig. 10(b), Fig. 11(a) and Fig. 11(b) show the accuracy comparison for 4 shards per client and 12 shards per client respectively for CIFAR-10. We find that all DFL methods have slower convergence under more non-iid data (Fig. 11(a)) but eventually FedLay still achieves similar accuracy as FedAvg, while Gaia and DFL-DDS have lower accuracy. We also show the distribution of accuracy of all clients at the time 2000 in Fig. 11(c). When there are 4 shards per client, the distribution is more uneven.

Refer to caption
(a) Accuracy for MNIST.
Refer to caption
(b) Accuracy for CIFAR-10.
Refer to caption
(c) Accuracy for Shakespeare
Figure 12: Model accuracy with synchronous and asynchronous communication.

Evaluation of data with biased distribution and locality. In this set of experiments, 100 clients are divided into 10 groups evenly, and each group possesses 6 out of the total 10 labels in the CIFAR-10 dataset. Each group only has 1 label that is different from the neighboring groups. For example, group 1 has labels 1 to 6; group 2 has labels 2 to 7, etc. and the last group has labels 10, 1, 2, 3, 4, 5. For each client, we sampled 2000 images for each label evenly from the original CIFAR-10 dataset. In Fig.15, we show FedLay has an average of 37.01%37.01\% improvement over Chord on varying degrees. It also demonstrates FedLay is only 2.0%2.0\% lower than the theoretical upper bound, a fully connected network. In Fig.15, we show the comparison of the accuracy of FedLay and Chord versus time. Again, FedLay shows much better convergence compared to Chord.

[Uncaptioned image]

Figure 13: Converged Accuracy

[Uncaptioned image]

Figure 14: Accuracy vs Time(d = 8)

[Uncaptioned image]

Figure 15: Relative Computational Cost

IV-D Evaluation of other considerations

Asynchronous communication. We also compare FedLay with synchronous and asynchronous communication in Fig. 12. We find for all three datasets, asynchronous communication can improve both the accuracy and convergence speed, because high-capacity clients do not need to wait for low-capacity ones.

Confidence parameters. Fig. 19 and Fig. 19 shows the accuracy of FedLay with and without confidence parameters for MNIST, compared to simple average. We set αd=0.5\alpha_{d}=0.5 and αc=0.5\alpha_{c}=0.5. The results show that FedLay slightly improves the simple average in accuracy.

Accuracy under churn. We show the model accuracy under extreme churn: 50 new clients join a 50-client FedLay network. In Fig. 19, the curves with triangle markers show the accuracy of the initial 50 nodes and the curves with square markers show the accuracy of 50 newly joined nodes. We find that the accuracy of the new nodes quickly converges to a high level due to the high-confidence models from existing nodes. Fig. 19 show that at the join time, the newly joined nodes have very low accuracy and all clients achieve high accuracy eventually.

Computation Cost. We show the relative computational cost in Fig.15. In the experiment of 100 nodes training on MNSIT dataset, To reach the accuracy of 88%, The relative computation cost of FedLay is 1.33, compared to 1.53 for Gaia, 2.47 for Chord, and 2.76 for DFL-DDS, with the baseline FedAvg normalized to 1. FedLay only has 33% overheads, smallest compared to other methods.

IV-E Scalability

We use large-scale simulations to evaluate the scalability of FedLay, as shown in Fig. 20(b). We find that even with up to 1000 clients, FedLay has stable performance in all datasets. In Fig. 20(d), we compared the communication cost per client (in MBs) to reach the convergence of FedLay to those of FedAvg, DFL-DDS, and Gaia. Gaia has poor scalability in communication.

[Uncaptioned image]

Figure 16: Accuracy for MNIST, αd=0.5,αc=0.5\alpha_{d}=0.5,\alpha_{c}=0.5

[Uncaptioned image]

Figure 17: Accuracy distribution for MNIST

[Uncaptioned image]

Figure 18: Accuracy under churn.

[Uncaptioned image]

Figure 19: Accuracy distribution under churn.

V Related Works

Decentralized Federated Learning (DFL). Federated learning (FL) [23] is an attractive solution for large-scale ML that allows many clients to train ML models collaboratively without directly sharing training data. However, the central server in FL is a single point of failure and attack. Decentralized Federated Learning (DFL) has been proposed to remove the central server [11, 33, 4]. The overlay network of DFL is a fundamental problem. He et al. [11] suggest a few overlay topologies including ring, 2D grid, and complete graph, which either are unable to be constructed in a decentralized way or cause too much communication. GADMM [10] uses a dynamic chain topology and other methods apply clustering-based topologies [2] [3]. Vogels et al. [34] analyze model convergence in theory on different topologies including hypercube, torus, binary tree, and ring. Recently Hua et al. suggest applying Ramanujan graphs for DFL [13]. No above studies discuss decentralized construction and maintenance of the suggested topologies. Recently, [41, 37] suggest to utlize Blockchain enhance the security and verifiability of DFL.

Other overlay topologies. Overlay topologies have been extensively studied for P2P networks. A well-known category of overlay networks is called distributed hash tables (DHTs), such as Chord [30]. DHTs are proposed to achieve data searching in a decentralized network. Distributed Delaunay triangulation (DT) [19, 17] is designed to achieve greedy routing guarantees and Viceroy [21] is designed for minimizing congestion by overlay routing. Near-random regular topologies have been studied for data center networks with centralized construction [29, 38].

Refer to caption
(a) MNIST
Refer to caption
(b) CIFAR-10
Refer to caption
(c) Shakespeare
Refer to caption
(d) Shakespeare: Comm. cost
Figure 20: Results of simulations for large-scale networks

VI Conclusion

This work presents FedLay, the first overlay network for DFL that achieves all the following properties: 1) decentralized construction, 2) fast convergence to accurate models, 3) small node degree, and 4) resilience to churn. We present the detailed designs of the FedLay topology, neighbor discovery and maintenance protocols (NDMP), and model exchange protocol (MEP). We prove that NDMP can guarantee the correctness of a decentralized overlay for node joins and failures. Evaluation results show that FedLay provides the highest model accuracy among existing DFL methods, small communication costs, and strong resilience to churn. In particular, it provides significant model accuracy advantages compared to other decentralized protocols such as Chord, when data are distributed with locality and bias.

References

  • [1] Durmus Alp Emre Acar, Yue Zhao, Ramon Matas, Matthew Mattina, Paul Whatmough, and Venkatesh Saligrama. Federated learning based on dynamic regularization. In Proc. of ICLR, 2021.
  • [2] Mohammed S. Al-Abiad, Mohanad Obeed, Md. Jahangir Hossain, and Anas Chaaban. Decentralized aggregation for energy-efficient federated learning via overlapped clustering and d2d communications. arXiv preprint arXiv:2206.02981, 2022.
  • [3] Aurélien Bellet, Anne-Marie Kermarrec, and Erick Lavoie. D-cliques: Compensating for data heterogeneity with topology in decentralized federated learning. In Proc. of IEEE SRDS, 2022.
  • [4] Enrique Tamos Martinez Beltran et al. Decentralized federated learning: Fundamentals, state-of-the-art, frameworks, trends, and challenges. IEEE Communications Surverys and Tutorials, 2022.
  • [5] Stephen Boyd, Persi Diaconis, and Lin Xiao. Fastest mixing markov chain on a graph. SIAM review, 46(4):667–689, 2004.
  • [6] Sebastian Caldas, Sai Meher Karthik Duddu, Peter Wu, Tian Li, Jakub Konečný, H. Brendan McMahan, Virginia Smith, and Ameet Talwalkar. Leaf: A benchmark for federated settings, 2019.
  • [7] Jing Cao, Zirui Lian, Weihong Liu, Zongwei Zhu, and Cheng Ji. Hadfl: heterogeneity-aware decentralized federated learning framework. In Proc. of IEEE DAC, 2021.
  • [8] Yat-Tin Chow, Wei Shi, Tianyu Wu, and Wotao Yin. Expander graph and communication-efficient decentralized optimization. In Proc. of IEEE ASILOMAR, 2016.
  • [9] Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
  • [10] Anis Elgabli, Jihong Park, Amrit S Bedi, Mehdi Bennis, and Vaneet Aggarwal. Communication efficient framework for decentralized machine learning. In Proc. of IEEE CISS, 2020.
  • [11] Lie He, An Bian, and Martin Jaggi. Cola: Decentralized linear learning. Proc. of NIPS, 2018.
  • [12] Kevin Hsieh, Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons, and Onur Mutlu. Gaia: Geo-Distributed machine learning approaching LAN speeds. In Prof. of USENIX NSDI, 2017.
  • [13] Yifan Hua, Kevin Miller, Andrea L Bertozzi, Chen Qian, and Bao Wang. Efficient and reliable overlay networks for decentralized federated learning. SIAM Journal on Applied Mathematics, 82(4):1558–1586, 2022.
  • [14] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. In Proc. of ICML, 2020.
  • [15] A. Krizhevsky. Learning multiple layers of features from tiny images. Tech Report, 2009.
  • [16] S. Kullback and R.A. Leibler. On information and sufficiency. Annals of Mathematical Statistics, 1951.
  • [17] Simon S. Lam and Chen Qian. Geographic Routing in dd-dimensional Spaces with Guaranteed Delivery and Low Stretch. In Proceedings of ACM SIGMETRICS, 2011.
  • [18] Yann LeCun, Corinna Cortes, and CJ Burges. MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
  • [19] Dong-Young Lee and Simon S. Lam. Efficient and accurate protocols for distributed delaunay triangulation under churn. In Proc. of IEEE ICNP, 2008.
  • [20] Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3):50–60, 2020.
  • [21] Dahlia Malkhi, Moni Naor, and David Ratajczak. Viceroy: A scalable and dynamic emulation of the butterfly. In Proc. of ACM PODC, 2002.
  • [22] J. McAuley and J. Leskovec. Learning to discover social circles in ego networks. In Proc. of NIPS, 2012.
  • [23] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Proc. of PMLR AISTATS, 2017.
  • [24] Jinlong Pang, Jieling Yu, Ruiting Zhou, and John CS Lui. An incentive auction for heterogeneous client selection in federated learning. IEEE Transactions on Mobile Computing, 2022.
  • [25] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. Tech Report, 2017.
  • [26] Reese Pathak and Martin J Wainwright. Fedsplit: An algorithmic framework for fast federated optimization. arXiv preprint arXiv:2005.05238, 2020.
  • [27] Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečnỳ, Sanjiv Kumar, and H Brendan McMahan. Adaptive federated optimization. arXiv preprint arXiv:2003.00295, 2020.
  • [28] Felix Sattler, Simon Wiedemann, Klaus-Robert Müller, and Wojciech Samek. Robust and communication-efficient federated learning from non-iid data. IEEE transactions on neural networks and learning systems, 31(9):3400–3413, 2019.
  • [29] Ankit Singla, Chi-Yao Hong, Lucian Popa, and P. Brighten Godfrey. Jellyfish: Networking data centers randomly. In Proc. of USENIX NSDI, 2012.
  • [30] Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. In Proc. of ACM SIGCOMM, 2001.
  • [31] Dongyuan Su, Yipeng Zhou, and Laizhong Cui. Boost decentralized federated learning in vehicular networks by diversifying data sources. In Proc. of IEEE ICNP, 2022.
  • [32] Tao Sun, Dongsheng Li, and Bao Wang. Stability and generalization of decentralized stochastic gradient descent. In Proc. of AAAI, 2021.
  • [33] Tao Sun, Dongsheng Li, and Bao Wang. Decentralized federated averaging. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  • [34] Thijs Vogels, Hadrien Hendrikx, and Martin Jaggi. Beyond spectral gap: The role of the topology in decentralized learning. arXiv preprint arXiv:2206.03093, 2022.
  • [35] Tian Wang, Yan Liu, Xi Zheng, Hong-Ning Dai, Weijia Jia, and Mande Xie. Edge-based communication optimization for distributed federated learning. IEEE Transactions on Network Science and Engineering, 9(4):2015–2024, 2021.
  • [36] Bernard M Waxman. Routing of multipoint connections. IEEE journal on selected areas in communications, 6(9):1617–1622, 1988.
  • [37] Minghui Xu, Zongrui Zou, Ye Cheng, Qin Hu, Dongxiao Yu, and Xiuzhen Cheng. Spdl: A blockchain-enabled secure and privacy-preserving decentralized learning system. IEEE Transactions on Computers, 72(2):548–558, 2022.
  • [38] Ye Yu and Chen Qian. Space Shuffle: A Scalable, Flexible, and High-Bandwidth Data Center Network. In Proc. of IEEE ICNP, 2014.
  • [39] Ye Yu and Chen Qian. Space Shuffle: A Scalable, Flexible, and High-Bandwidth Data Center Network. IEEE Transactions on Parallel and Distributed Systems, 2016.
  • [40] Shahryar Zehtabi, Seyyedali Hosseinalipour, and Christopher G Brinton. Decentralized event-triggered federated learning with heterogeneous communication thresholds. In Proc. of IEEE CDC, 2022.
  • [41] Xiaoxue Zhang, Yifan Hua, and Chen Qian. Secure decentralized learning with blockchain. In 2023 IEEE 20th International Conference on Mobile Ad Hoc and Smart Systems (MASS), pages 116–124. IEEE, 2023.
  • [42] Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. Federated learning with non-iid data. arXiv preprint arXiv:1806.00582, 2018.
  • [43] Ruiting Zhou, Jinlong Pang, Zhibo Wang, John CS Lui, and Zongpeng Li. A truthful procurement auction for incentivizing heterogeneous clients in federated learning. In 2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS), pages 183–193. IEEE, 2021.