This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Heroes: Lightweight Federated Learning with Neural Composition and Adaptive Local Update in Heterogeneous Edge Networks

Jiaming Yan1,2   Jianchun Liu1,2   Shilong Wang1,2   Hongli Xu1,2   Haifeng Liu3   Jianhua Zhou3
1School of Computer Science and Technology, University of Science and Technology of China, China
2Suzhou Institute for Advanced Research, University of Science and Technology of China, China
3 Guangdong OPPO Mobile Telecommunications Corp., Ltd. Dongguan, Guangdong, China
Abstract

Federated Learning (FL) enables distributed clients to collaboratively train models without exposing their private data. However, it is difficult to implement efficient FL due to limited resources. Most existing works compress the transmitted gradients or prune the global model to reduce the resource cost, but leave the compressed or pruned parameters under-optimized, which degrades the training performance. To address this issue, the neural composition technique constructs size-adjustable models by composing low-rank tensors, allowing every parameter in the global model to learn the knowledge from all clients. Nevertheless, some tensors can only be optimized by a small fraction of clients, thus the global model may get insufficient training, leading to a long completion time, especially in heterogeneous edge scenarios. To this end, we enhance the neural composition technique, enabling all parameters to be fully trained. Further, we propose a lightweight FL framework, called Heroes, with enhanced neural composition and adaptive local update. A greedy-based algorithm is designed to adaptively assign the proper tensors and local update frequencies for participating clients according to their heterogeneous capabilities and resource budgets. Extensive experiments demonstrate that Heroes can reduce traffic consumption by about 72.05% and provide up to 2.97×\times speedup compared to the baselines.

Index Terms:
Federated Learning, Heterogeneity, Neural Composition, Local Update Frequency.

I Introduction

In traditional machine learning approaches, data is gathered from various sources and transmitted to a central server for model training. However, in edge computing (EC), where data is generated and processed at the edge clients, this centralized approach becomes infeasible due to privacy concerns [liu2022enhancing]. Federated Learning (FL) [mcmahan2017communication] is an emerging paradigm that enables collaborative model training across distributed clients, bringing the power of machine learning to EC [gao2019auction]. As the field of FL continues to evolve, it carries significant potential to advance edge computing capabilities and facilitate intelligent applications across diverse domains, such as healthcare, smart cities and autonomous vehicles[kairouz2021advances].

However, it is difficult to implement efficient FL in practical edge networks due to resource limitation [liu2023finch], which primarily revolve around constrained computation resources and scarce communication resources on edge clients, such as smartphones and Internet of Things (IoT) devices. Firstly, edge clients often have limited computation capabilities, especially compared to high-performance devices like desktop GPUs [ignatov2019ai]. For example, the floating-point operations per second (FLOPS) of an iPhone14 is only about 5% of that of the desktop GPU RTX3090. Secondly, the communication resources are also scarce in edge networks [liu2023yoga]. The bandwidth of wide area networks (WANs) between the PS and edge clients is much lower (e.g., 15×\times) than that of local area networks (LANs) within the data centers [wang2021resource]. Unfortunately, the growing complexity (e.g., parameter size or architecture) of deep neural networks (DNNs) further intensifies the consumption of both computation and communication resources. For instance, a standard ResNet-18 [he2016deep] consists of 11.68 million parameters and requires 27.29 billion floating-point operations (FLOPs) to process a single image, making the local update process extremely slow or even infeasible on edge clients [jiang2022fedmp]. Besides, the frequent transmissions of a large number of parameters will also strain the limited network bandwidth.

Some previous works have made efforts to address the challenge of resource limitation in FL by employing various techniques, such as gradient compression [li2021talk, liu2022communication] and model pruning [diao2020heterofl, horvath2021fjord]. Specifically, FlexCom [li2021talk] and AdaGQ [liu2022communication] compress the transmitted gradients by sparsification or quantization to alleviate the communication overhead. However, these approaches do not reduce the model complexity and the computation overhead is still high. To save both the computation and communication resources, HeteroFL [diao2020heterofl] and Fjord [horvath2021fjord] propose to prune the complete model into the smaller sub-models for training. Nevertheless, excessive parameter pruning will significantly degrade training performance. For instance, HeteroFL prunes 93.75% of the parameters from the complete model to accommodate the limitations (e.g., CPU power, RAM, energy) on weak clients like smartphones, leaving only 6.25% of the parameters to be optimized. As a result, the majority of parameters are unable to benefit from the local data on these weak clients, leading to poor training performance of the complete model.

To address the above issues, Flanc [mei2022resource] proposes the neural composition technique, which approximates each model weight as the product of two low-rank tensors, named neural basis and coefficient. Concretely, the models of various complexities are constructed by composing (i.e., multiplying) the neural basis with the coefficient of different sizes. In each round, the parameter server (PS) sends the neural basis and a proper coefficient to each client for composition and local training. The entire neural basis is trained by all clients and its learned knowledge can be propagated to all parameters, which enables every parameter to access the full range of knowledge. We observe that the size of low-rank tensors is inherently much smaller than that of the original model. For example, the size of standard ResNet-18 is 42.8MB, while that of approximated tensors is only 15.3MB. Thus, this approach can reduce both computation and communication consumption during training.

Despite resource efficiency, the neural composition technique will encounter some problems in the context of FL. Firstly, Flanc only aggregates the coefficients with the same shape. As a consequence, the coefficient of a specific shape is only trained by the clients with the corresponding computation power, which may be insufficient for global model convergence. This issue becomes more prominent when high-performance clients only constitute a small fraction of all clients in the EC system due to their expensive price. For instance, the largest coefficient is only trained by a few powerful clients. As a result, the largest global model may struggle to converge within the given completion time, leading to poor training performance. Secondly, the edge clients are equipped with different hardware, thus their capabilities (e.g., CPU power, bandwidth) may vary significantly [li2022pyramidfl], i.e., client heterogeneity, which poses a great impact on training efficiency. For example, if the PS sends the neural basis and a large coefficient to a client with strong computation power but low upload bandwidth, the completion time for model updates will be prolonged, causing long delays in the aggregation step.

In order to tackle the aforementioned challenges, we propose a lightweight FL framework, called Heroes (LigHtweight Federated Learning through Neural Composition). On the one hand, we enhance the neural composition technique to aggregate the coefficients with different shapes into the largest one. Besides, by adaptively assigning the coefficients to clients, each parameter in the global coefficient can be fully trained, ensuring global model convergence. On the other hand, we adjust the local update frequencies for different clients to balance their completion time, so as to diminish the impact of client heterogeneity and improve the training efficiency. Nevertheless, the neural composition and local update frequency are interconnected. Specifically, the completion time depends on the size of composed model, which affects the determination of local update frequency. Meanwhile, the local update frequency also affects the training adequacy of each parameter in coefficient. Therefore, it is necessary yet challenging to jointly assign proper coefficient and local update frequency for each client. Our contributions are summarized as follows:

  • We propose a lightweight FL framework, called Heroes, which overcomes the challenges of resource limitation and client heterogeneity through enhanced neural composition and adaptive local update. Besides, a theoretical convergence analysis is provided for Heroes.

  • Guided by the convergence bound, we design a greedy-based algorithm to adaptively assign the proper coefficients and local update frequencies for participating clients based on both their heterogeneous capabilities and resource budgets.

  • The performance of Heroes is evaluated through extensive experiments and the results demonstrate that Heroes can reduce the traffic consumption by about 72.05% and provide up to 2.97×\times speedup for the training process compared to the baselines.

II Background and Motivation

II-A Federated Learning

Considering a client set 𝒩={1,2,,N}\mathcal{N}=\{1,2,\cdots,N\} coordinated by the PS, each client n𝒩n\in\mathcal{N} holds its local dataset Dn={ζin}i=1|Dn|D_{n}=\{\zeta_{i}^{n}\}_{i=1}^{|D_{n}|}, where ζin\zeta_{i}^{n} denotes a data sample from DnD_{n}. Further, we represent the loss function as (𝐱;ζ)\mathcal{L}(\mathbf{x};\zeta), which measures how well the model 𝐱\mathbf{x} performs on data sample ζ\zeta. Therefore, the local loss function of client nn is defined as:

Fn(𝐱)𝔼ζDn[(𝐱;ζ)]F_{n}(\mathbf{x})\triangleq\mathbb{E}_{\zeta\thicksim D_{n}}[\mathcal{L}(\mathbf{x};\zeta)] (1)

The global loss function is a linear combination of all NN clients, and the goal of FL is to train a high-quality model 𝐱\mathbf{x}^{*} with minimum global loss function, which is defined as:

𝐱argmin𝐱F(𝐱)=argmin𝐱1Nn=1NFn(𝐱)\mathbf{x}^{*}\triangleq\mathop{\arg\min}_{\mathbf{x}}F(\mathbf{x})=\mathop{\arg\min}_{\mathbf{x}}\frac{1}{N}\sum_{n=1}^{N}F_{n}(\mathbf{x}) (2)

Suppose there are HH rounds of training in total. In each round h{1,2,,H}h\in\{1,2,\cdots,H\}, the PS randomly selects a set of clients 𝒩h𝒩\mathcal{N}^{h}\subseteq\mathcal{N} to participate in training and sends the fresh global model 𝐱h\mathbf{x}^{h} to the specified clients, where |𝒩h|=K|\mathcal{N}^{h}|=K. Then, each client n𝒩hn\in\mathcal{N}^{h} updates the global model 𝐱h\mathbf{x}^{h} over its local dataset DnD_{n} for τ\tau times, where each update is regarded as one local iteration and τ\tau represents the local update frequency. Let 𝐱nh(t)\mathbf{x}^{h}_{n}(t) denote the local model of client nn at iteration tt in round hh. For the mini-batch stochastic gradient descent (SGD) [yu2019parallel] algorithm, a local iteration can be expressed as follows:

𝐱nh(t+1)=𝐱nh(t)ηFn(𝐱nh(t);ξn)\mathbf{x}^{h}_{n}(t+1)=\mathbf{x}^{h}_{n}(t)-\eta\nabla F_{n}(\mathbf{x}^{h}_{n}(t);\xi^{n}) (3)

where ξn\xi^{n} denotes a random data batch from the local dataset DnD_{n} and η\eta is the learning rate. Finally, the PS collects the updated local models from the participating clients and aggregates them to the latest global model for further training, i.e., 𝐱h+1=1Nn=1N𝐱nh(τ)\mathbf{x}^{h+1}=\frac{1}{N}\sum_{n=1}^{N}\mathbf{x}^{h}_{n}(\tau).

II-B Enhanced Neural Composition

To make full use of the limited resources, it is necessary to adjust the complexity of local model for each client based on its resource budgets [diao2020heterofl, horvath2021fjord, mei2022resource]. Herein, we follow Flanc [mei2022resource] to innovatively propose an enhanced neural composition technique to construct the models in different widths (i.e., the number of hidden channels in each weight) based on low-rank factorization [phan2020stable]. Concretely, since the weights in DNNs are usually over-parameterized [zou2019improved], each layer’s weight (e.g., convolution, fully connection) can be approximated as the product of two low-rank tensors, named neural basis 𝐯\mathbf{v} and coefficient 𝐮\mathbf{u}. For example, let 𝐰k2×I×O\mathbf{w}\in\mathbb{R}^{k^{2}\times I\times O} represent a convolution weight, with kernel size kk, input channel number II and output channel number OO. Let p{1,2,,P}p\in\{1,2,\cdots,P\} denote the weight width, where the shape of pp-width weight 𝐰p\mathbf{w}_{p} is k2×pI×pOk^{2}\times pI\times pO. Accordingly, 𝐰p\mathbf{w}_{p} is approximated as follows:

𝐰p𝐯𝐮p,𝐯k2×I×R,𝐮R×(p×pO)\mathbf{w}_{p}\approx\mathbf{v}\cdot\mathbf{u}_{p},\qquad\mathbf{v}\in\mathbb{R}^{k^{2}\times I\times R},\mathbf{u}\in\mathbb{R}^{R\times(p\times pO)} (4)

When k=1k=1, the above format represents the approximation of fully connection layer’s weight. The weight width is controlled by adjusting the size of coefficient, while the size of neural basis is constant. Specifically, the complete coefficient is divided into P2P^{2} blocks, where the shape of each block is R×OR\times O. We select p2p^{2} blocks from the complete coefficient to form the reduced coefficient and compose (i.e., multiply) it with neural basis into a pp-width weight.

To ensure sufficient training of every parameter in the global model, we control different coefficient blocks to be trained evenly. Thus, the selected blocks are currently the least trained ones. Notably, we measure each block’s training adequacy by the total number of local iterations it has experienced on all clients since round 1, i.e., total update times. In general, we balance the total update times of different coefficient blocks to ensure that each block is fully trained.

For example, as illustrated in Fig. 1, the coefficient 𝐮R×(9×O)\mathbf{u}\in\mathbb{R}^{R\times(9\times O)} is divided into 9 blocks (i.e., PP=3) and the number in each block represents its current total update times. To obtain a 2-width weight, we first extract the least trained 4 blocks (with the total update times of 6, 5, 7 and 8, respectively) from the complete coefficient and combine them into the reduced coefficient 𝐮^R×(4×O)\hat{\mathbf{u}}\in\mathbb{R}^{R\times(4\times O)}. Then, 𝐮^\hat{\mathbf{u}} is composed with the neural basis 𝐯k2×I×R\mathbf{v}\in\mathbb{R}^{k^{2}\times I\times R} into an intermediate tensor whose shape is k2×I×(4×O)k^{2}\times I\times(4\times O). Finally, a 2-width weight 𝐱^k2×(2×I)×(2×O)\hat{\mathbf{x}}\in\mathbb{R}^{k^{2}\times(2\times I)\times(2\times O)} is obtained by reshaping this intermediate tensor.

Refer to caption
Figure 1: The demonstration of constructing a pp-width weight by the enhanced neural composition technique (PP=3, pp=2).
TABLE I: Training performance within given resource constraints.
FL Schemes Traffic Time
30GB 60GB 20,000s 40,000s
MP [diao2020heterofl] 34.89% 48.92% 39.72% 50.75%
Original NC [mei2022resource] 42.22% 51.89% 43.25% 49.37%
Enhanced NC 59.76% 64.42% 58.69% 62.81%

Compared to the FL schemes based on model pruning [diao2020heterofl], the neural composition technique enables every parameter in the global model to benefit from the full range of knowledge through the shared neural basis, improving the training performance. Different from Flanc [mei2022resource] with original neural composition, enhanced neural composition allows the coefficient of all sizes to get fully trained, accelerating the training process. To verify this improvement, we simulate a FL system with 100 clients to train the standard ResNet-18 model [he2016deep] over the ImageNet dataset [russakovsky2015imagenet]. The results in Table I indicate that with the given traffic consumption or completion time, the enhanced neural composition can improve the global model’s test accuracy by about 16.29% on average compared with the FL schemes based on model pruning (MP) [diao2020heterofl] and original neural composition (NC) [mei2022resource].

II-C Adaptive Local Update

In most synchronous FL schemes [li2021talk, mei2022resource], the clients’ local update frequencies are identical and fixed in each round. However, due to client heterogeneity, strong clients have to wait for weak ones (i.e., stragglers) for global aggregation, incurring non-negligible waiting time and reducing the training efficiency significantly [liu2023adaptive, wang2020towards]. To evaluate the negative impacts of stragglers, we record the completion time for one training round of each client in the simulated FL system. As shown in Fig. 2(a), the strongest client completes one training round four times faster than the weakest client. In other words, about 70% of the strongest client’s time is idle and wasted.

To reduce clients’ idle waiting, some research [li2020federated, xu2022adaptive, li2022pyramidfl] proposes adjusting the local update frequencies for different clients to balance their completion time. On the one hand, the weaker clients perform fewer local iterations to reduce the impact of stragglers. On the other hand, the stronger clients utilize the idle time to perform more local iterations, which helps the model converge faster. We illustrate the completion time of each client with proper local update frequency in Fig. 2(b). It can be observed that almost every client’s time is fully utilized without idle waiting.

Refer to caption
(a) Fixed and Identical Frequency
Refer to caption
(b) Adaptive Frequency
Figure 2: Ranked clients’ completion time in one training round.

However, the neural composition and the local update are interconnected, and it is difficult to jointly assign proper coefficient blocks and local update frequencies for different clients. Specifically, both the computation time and the communication time of a specific client depend on the number of coefficient blocks, which affects the determination of its local update frequency. Meanwhile, since each client trains different blocks, various local update frequencies will also affect the balance of the blocks’ total update times.

III Proposed Framework

To enhance the training efficiency and resource utilization for FL, we propose a lightweight FL framework, called Heroes, which integrates the benefits of enhanced neural composition and adaptive local update. Specifically, in Heroes, there are three main phases in each round as follows.

III-1 Tensors and Frequency Assignment

The PS first determines the model width pnhp_{n}^{h} for each client nn in round hh. As studied in [diao2020heterofl, mei2022resource], the smaller the model width, the less the required computation resource. To alleviate the performance degradation, Heroes increases each client’s model width as much as possible within its resource budget. Then, for each client nn, Heroes selects (pnh)2(p_{n}^{h})^{2} coefficient blocks with the least total update times as the reduced coefficient 𝐮^nh\hat{\mathbf{u}}_{n}^{h}. Finally, Heroes assigns a proper local update frequency τnh\tau_{n}^{h} to each client nn by the algorithm proposed in Section V-C to minimize the idle waiting time and balance the training across different coefficient blocks.

III-2 Local Training

In round hh, each client n𝒩hn\in\mathcal{N}^{h} first downloads the latest neural basis 𝐯h\mathbf{v}^{h} and the reduced coefficient 𝐮^nh\hat{\mathbf{u}}_{n}^{h} from the PS, then composes them into the local model 𝐱^nt\hat{\mathbf{x}}_{n}^{t}. Each local model 𝐱^nt\hat{\mathbf{x}}_{n}^{t} is trained over the local dataset DnD_{n} for τnh\tau_{n}^{h} iterations. Let 𝐱¯nh\bar{\mathbf{x}}_{n}^{h} represent the updated local model of client nn in round hh. After the local training, 𝐱¯nh\bar{\mathbf{x}}_{n}^{h} is be decomposed into the updated neural basis 𝐯¯nh\bar{\mathbf{v}}_{n}^{h} and coefficient 𝐮¯nh\bar{\mathbf{u}}_{n}^{h}, i.e., 𝐱¯nh𝐯¯nh𝐮¯nh\bar{\mathbf{x}}_{n}^{h}\approx\bar{\mathbf{v}}_{n}^{h}\cdot\bar{\mathbf{u}}_{n}^{h}. Since the size of low-rank tensors is inherently smaller than that of original model, each client nn uploads 𝐯¯nh\bar{\mathbf{v}}_{n}^{h} and 𝐮¯nh\bar{\mathbf{u}}_{n}^{h}, instead of 𝐱¯nh\bar{\mathbf{x}}_{n}^{h}, to the PS for global aggregation, which further saves the limited bandwidth.

III-3 Global Aggregation

In round hh, upon receiving the updated neural basis and coefficient from all the participating clients in 𝒩h\mathcal{N}^{h}, the PS performs global aggregation to obtain the latest basis 𝐯h+1\mathbf{v}^{h+1} and coefficient 𝐮h+1\mathbf{u}^{h+1} for the next round of training. For neural basis, Heroes directly averages the updated ones from all participating clients, i.e., 𝐯h+1=1Kn=1K𝐯¯nh\mathbf{v}^{h+1}=\frac{1}{K}\sum_{n=1}^{K}\bar{\mathbf{v}}_{n}^{h}. For coefficient, Heroes performs the block-wise aggregation. Specifically, let i{1,2,,P2}i\in\{1,2,\cdots,P^{2}\} represent the block index and 𝒩ih\mathcal{N}_{i}^{h} denote the set of clients that train the ii-th coefficient block in round hh. The latest ii-th coefficient block 𝐮h+1,i\mathbf{u}^{h+1,i} is obtained as follows:

𝐮h+1,i=1|𝒩ih|n𝒩ih𝐮¯nh,i\mathbf{u}^{h+1,i}=\frac{1}{|\mathcal{N}_{i}^{h}|}\sum_{n\in\mathcal{N}_{i}^{h}}\bar{\mathbf{u}}_{n}^{h,i} (5)

where 𝐮¯nh,i\bar{\mathbf{u}}_{n}^{h,i} represents the ii-th coefficient block updated by client nn in round hh. For instance, as shown in Fig. 3, the leftmost block is trained by two clients (i.e., 2 and 4), thus its value is 3=12×(4+2)3=\frac{1}{2}\times(4+2).

Refer to caption
Figure 3: The demonstration of the proposed framework.

For better explanation of Heroes, we give a demonstration of Heroes in Fig. 3. Four heterogeneous clients participate in training, which are divided into three levels by their computation power, i.e., weak smartphones (clients 1 and 3), medium laptop (client 2) and powerful PC (client 4). The complete coefficient contains three blocks. Heroes selects one coefficient block for the weak clients 1 and 3, and two blocks for medium client 2. The powerful client 4 utilizes all three blocks for model training. We adopt the block’s color to denote the amount of training it has obtained. Specifically, the total update times of blue blocks is ample, while that of orange blocks and white blocks is moderate and few, respectively. To ensure sufficient training for every block, the selected blocks are the least trained ones currently. Besides, Heroes assigns different local update frequencies for these four clients to balance their completion time. For example, the weak client 1 trains the model much slower than the powerful client 4. Thus, client 1 performs fewer local iterations (e.g., 10) to alleviate the straggler effect, while client 4 performs more local iterations (e.g., 30) to make more contributions to the global model. In addition to the balance of clients’ completion time, the balance of coefficient blocks’ total update times is also considered for the determination of each client’s local update frequency.

IV Convergence Analysis

In this section, we provide the convergence analysis of the proposed framework. We first state the following assumptions, which are standard in non-convex optimization problems and widely used in the analysis of previous works [jiang2023computation].

Assumption 1. (Smoothness) The local objective function FnF_{n} of each client nn is smooth with modulus LL:

Fn(𝐲)Fn(𝐱)L2𝐲𝐱2+Fn(𝐲),𝐲𝐱n,𝐱,𝐲\!\!F_{n}(\mathbf{y})\!-\!F_{n}(\mathbf{x})\leq\frac{L}{2}\|\mathbf{y}-\mathbf{x}\|^{2}+\langle\nabla F_{n}(\mathbf{y}),\mathbf{y}-\mathbf{x}\rangle\quad\forall n,\mathbf{x},\mathbf{y} (6)

Assumption 2. (Bounded Variance) Let ξn\xi_{n} denote a random data batch sampled from client nn’s local dataset DnD_{n}. There exists a constant σ>0\sigma>0 such that the variance of stochastic gradients at each client is bounded by:

𝔼[Fn(𝐱;ξn)Fn(𝐱)2]σ2𝐱,n,ξn\mathbb{E}[\|\nabla F_{n}(\mathbf{x};\xi_{n})-\nabla F_{n}(\mathbf{x})\|^{2}]\leq\sigma^{2}\quad\forall\mathbf{x},n,\xi_{n} (7)

Assumption 3. (Bounded Gradients) There exists a constant G>0G>0, such that the stochastic gradients at each client nn are bounded by:

𝔼[Fn(𝐱;ξn)2]G2𝐱,n,ξn\mathbb{E}[\|\nabla F_{n}(\mathbf{x};\xi_{n})\|^{2}]\leq G^{2}\quad\forall\mathbf{x},n,\xi_{n} (8)

In order to analyze the convergence bound of the global loss function after HH rounds, we present the following three important lemmas.

Lemma 1. According to Assumption 3, the deviations between global model 𝐱h\mathbf{x}^{h} and local model 𝐱^nh(t1)\hat{\mathbf{x}}^{h}_{n}(t-1) is bounded as:

t=1τn=1N𝔼[𝐱h𝐱^nh(t1)2]2τ3η2G2N3+2τn=1Nαnh\displaystyle\!\!\!\!\sum_{t=1}^{\tau}\sum_{n=1}^{N}\mathbb{E}[\|\mathbf{x}^{h}\!-\!\hat{\mathbf{x}}_{n}^{h}(t\!-\!1)\|^{2}]\!\leq\!\frac{2\tau^{3}\eta^{2}G^{2}N}{3}\!+\!2\tau\sum_{n=1}^{N}\alpha_{n}^{h} (9)

where αnh\alpha_{n}^{h} represents the model error induced by reducing coefficient for client nn in round hh (i.e., αnh𝐮h𝐮^nh2\alpha_{n}^{h}\triangleq\|\mathbf{u}^{h}-\hat{\mathbf{u}}^{h}_{n}\|^{2}) and τ=maxn,h{τnh}\tau=\mathop{\max_{n,h}}\{\tau_{n}^{h}\}.

Lemma 2. Combining Assumptions 1-2 and Lemma 1, the difference between global models in two successive rounds can be bounded as:

𝔼[𝐱h+1𝐱h2]3η2τ2𝔼[Fn(𝐱h)2]\displaystyle\mathbb{E}[\|\mathbf{x}^{h+1}-\mathbf{x}^{h}\|^{2}]\leq 3\eta^{2}\tau^{2}\mathbb{E}[\|\nabla F_{n}(\mathbf{x}^{h})\|^{2}]
+6η2τ2L2Nn=1Nαnh+2τ4η4G2L2+3η2τ2σ2\displaystyle+\frac{6\eta^{2}\tau^{2}L^{2}}{N}\sum_{n=1}^{N}\alpha_{n}^{h}+2\tau^{4}\eta^{4}G^{2}L^{2}+3\eta^{2}\tau^{2}\sigma^{2} (10)

Lemma 3. Under Assumption 1 and Lemma 1, the proposed framework ensures that:

𝔼[F(𝐱h),𝐱h+1𝐱h]\displaystyle\mathbb{E}[\langle\nabla F(\mathbf{x}^{h}),\mathbf{x}^{h+1}-\mathbf{x}^{h}\rangle]
\displaystyle\leq τη2𝔼[F(𝐱h)2]+L2τηNn=1Nαnh+τ3η3L2G23\displaystyle-\frac{\tau\eta}{2}\mathbb{E}[\|\nabla F(\mathbf{x}^{h})\|^{2}]+\frac{L^{2}\tau\eta}{N}\sum_{n=1}^{N}\alpha_{n}^{h}+\frac{\tau^{3}\eta^{3}L^{2}G^{2}}{3} (11)

Let these assumptions and lemmas hold, then the convergence bound can be obtained as follows.

Theorem 1. If the learning rate satisfies η16Lτ\eta\leq\frac{1}{6L\tau}, the mean square gradient after HH rounds can be bounded as follows:

1Hh=0H1𝔼[F(𝐱h)2]4Hητ(F(𝐱0)F(𝐱))\displaystyle\frac{1}{H}\sum_{h=0}^{H-1}\mathbb{E}[\|\nabla F(\mathbf{x}^{h})\|^{2}]\leq\frac{4}{H\eta\tau}(F(\mathbf{x}^{0})-F(\mathbf{x}^{*}))
+6L2HNh=0H1n=1Nαnh+Lητ3(G2+18σ2)\displaystyle+\frac{6L^{2}}{HN}\sum_{h=0}^{H-1}\sum_{n=1}^{N}\alpha_{n}^{h}+\frac{L\eta\tau}{3}(G^{2}+18\sigma^{2}) (12)

where 𝐱\mathbf{x}^{*} denotes the optimal global model.

Proof. With the smoothness assumption, we have:

𝔼[F(𝐱h+1)]𝔼[F(𝐱h)]\displaystyle\mathbb{E}[F(\mathbf{x}^{h+1})]-\mathbb{E}[F(\mathbf{x}^{h})]
\displaystyle\leq L2𝔼[𝐱h+1𝐱h2]+𝔼[F(𝐱h),𝐱h+1𝐱h]\displaystyle\frac{L}{2}\mathbb{E}[\|\mathbf{x}^{h+1}-\mathbf{x}^{h}\|^{2}]+\mathbb{E}[\langle\nabla F(\mathbf{x}^{h}),\mathbf{x}^{h+1}-\mathbf{x}^{h}\rangle] (13)
\displaystyle\leq τη2(3τηL1)𝔼[F(𝐱h)2]+L2τηN(3Lτη+1)n=1Nαnh\displaystyle\frac{\tau\eta}{2}(3\tau\eta L-1)\mathbb{E}[\|\nabla F(\mathbf{x}^{h})\|^{2}]+\frac{L^{2}\tau\eta}{N}(3L\tau\eta+1)\sum_{n=1}^{N}\alpha_{n}^{h}
+τ4η4L3G2+3η2τ2σ2L2+τ3η3L2G23\displaystyle+\tau^{4}\eta^{4}L^{3}G^{2}+\frac{3\eta^{2}\tau^{2}\sigma^{2}L}{2}+\frac{\tau^{3}\eta^{3}L^{2}G^{2}}{3} (14)

where Eq. (14) is obtained by inserting Eq. (10) and Eq. (11) into Eq. (13). Summing over the round h{0,1,,H1}h\in\{0,1,\cdots,H-1\} on the both sides of Eq. (14), we have:

𝔼[F(𝐱H)]𝔼[F(𝐱0)]τ4η4L3G2H\displaystyle\mathbb{E}[F(\mathbf{x}^{H})]-\mathbb{E}[F(\mathbf{x}^{0})]\leq\tau^{4}\eta^{4}L^{3}G^{2}H
+τη2(3τηL1)h=0H1𝔼[F(𝐱h)2]+3η2τ2σ2HL2\displaystyle+\frac{\tau\eta}{2}(3\tau\eta L-1)\sum_{h=0}^{H-1}\mathbb{E}[\|\nabla F(\mathbf{x}^{h})\|^{2}]+\frac{3\eta^{2}\tau^{2}\sigma^{2}HL}{2}
+L2τηN(3Lτη+1)h=0H1n=1Nαnh+τ3η3L2G2H3\displaystyle+\frac{L^{2}\tau\eta}{N}(3L\tau\eta+1)\sum_{h=0}^{H-1}\sum_{n=1}^{N}\alpha_{n}^{h}+\frac{\tau^{3}\eta^{3}L^{2}G^{2}H}{3} (15)

When the learning rate satisfies η16Lτ\eta\leq\frac{1}{6L\tau}, we can shift the terms in Eq. (15) as follows:

τη4k=0H1𝔼[F(𝐱h)2]𝔼[F(𝐱0)]𝔼[F(𝐱H)]\displaystyle\frac{\tau\eta}{4}\sum_{k=0}^{H-1}\mathbb{E}[\|\nabla F(\mathbf{x}^{h})\|^{2}]\leq\mathbb{E}[F(\mathbf{x}^{0})]-\mathbb{E}[F(\mathbf{x}^{H})]
+3L2τη2Nh=0H1n=1Nαnh+τ2η2HL12(G2+18σ2)\displaystyle+\frac{3L^{2}\tau\eta}{2N}\sum_{h=0}^{H-1}\sum_{n=1}^{N}\alpha_{n}^{h}+\frac{\tau^{2}\eta^{2}HL}{12}(G^{2}+18\sigma^{2}) (16)

Finally, we divide both sides of Eq. (16) by Hητ4\frac{H\eta\tau}{4} to derive the convergence bound in Eq. (12) of Theorem 1.

Thus, we complete the convergence analysis of the proposed framework under the non-convex setting. The convergence bound is proportional to the coefficient reducing error. The larger the coefficient, the smaller the reducing error, leading to a tighter convergence bound. Besides, the bound also depends on the local update frequency τ\tau, indicating that we can obtain better convergence performance by determining the local update frequency τ\tau properly.

V Problem Formulation and Algorithm Design

V-A Problem Formulation

In this section, we define the joint optimization problem of neural composition and local update frequency in FL training. Let G(𝐯𝐮)G(\mathbf{v}\cdot\mathbf{u}) represent the number of floating-point operations (FLOPs) required to perform one local iteration for the composed model, and G(𝐯𝐮)G(\mathbf{v}\cdot\mathbf{u}) depends on the size of coefficient 𝐮\mathbf{u}. The more blocks the coefficient 𝐮\mathbf{u} includes, the wider the composed model 𝐱=𝐯𝐮\mathbf{x}=\mathbf{v}\cdot\mathbf{u} and the more FLOPs required for model training. We formulate the time cost for one local iteration of client nn in round hh as follows:

μnh=G(𝐯nh𝐮^nh)/qnh\mu_{n}^{h}=G(\mathbf{v}^{h}_{n}\cdot\hat{\mathbf{u}}^{h}_{n})/q_{n}^{h} (17)

where qnhq_{n}^{h} represents the speed to process the floating-operations of client nn in round hh.

Since the download bandwidth is usually much faster than the upload bandwidth in typical WANs [zhan2020incentive], the download time of tensors can be negligible and we mainly focus on the upload time. Let bnhb_{n}^{h} denote the upload bandwidth of client nn in round hh and EE measure the size of a tensor. We formulate the communication time of client nn in round hh as:

νnh=[E(𝐯¯nh)+E(𝐮¯nh)]/bnh\nu_{n}^{h}=[E(\bar{\mathbf{v}}_{n}^{h})+E(\bar{\mathbf{u}}_{n}^{h})]/b_{n}^{h} (18)

Let TnhT_{n}^{h} denote the completion time of client nn in round hh, which consists of the time cost for τnh\tau_{n}^{h} local iterations and the communication time. Due to the synchronization barrier of FL, the completion time of round hh depends on the slowest client, which can be defined as:

Th=maxn𝒩hTnh=maxn𝒩h(τnhμnh+νnh)T^{h}=\mathop{\max}_{n\in\mathcal{N}^{h}}T_{n}^{h}=\mathop{\max}_{n\in\mathcal{N}^{h}}(\tau_{n}^{h}\cdot\mu_{n}^{h}+\nu_{n}^{h}) (19)

We adopt the average waiting time for participating clients in 𝒩h\mathcal{N}^{h} to measure the impact of synchronization barrier in round hh, which is defined as follows:

𝒲h=1Kn𝒩h(ThTnh)\mathcal{W}^{h}=\frac{1}{K}\sum_{n\in\mathcal{N}^{h}}(T^{h}-T_{n}^{h}) (20)

Let cihc_{i}^{h} denote the total update times of ii-th coefficient block, where i{1,2,,P2}i\in\{1,2,\cdots,P^{2}\}. The training consistency between different coefficient blocks in round hh can be reflected by the variance of set {cih|i}\{c_{i}^{h}|\forall i\}, denoted as:

𝒱h=1P2i=1P2(cih1P2j=1P2cjh)2\mathcal{V}^{h}=\frac{1}{P^{2}}\sum_{i=1}^{P^{2}}(c_{i}^{h}-\frac{1}{P^{2}}\sum_{j=1}^{P^{2}}c_{j}^{h})^{2} (21)

In each round hh, we aim to select appropriate (pnh)2(p_{n}^{h})^{2} coefficient blocks and determine the optimal local update frequency τnh\tau_{n}^{h} for each participating client n𝒩hn\in\mathcal{N}^{h}, so as to accelerate the training process of FL. Accordingly, we define the optimization problem as follows:

minh=1HTh\mathop{\min}\sum_{h=1}^{H}T^{h}

s.t.{F(𝐯H𝐮H)ϵ𝒲hρ,h𝒱hδ,hpnh{1,2,,P},n,hs.t.\begin{cases}F(\mathbf{v}^{H}\cdot\mathbf{u}^{H})\leq\epsilon\\ \mathcal{W}^{h}\leq\rho,&\forall h\\ \mathcal{V}^{h}\leq\delta,&\forall h\\ p_{n}^{h}\in\{1,2,\cdots,P\},&\forall n,h\end{cases} (22)

The first inequality expresses the convergence requirement, where ϵ\epsilon is the convergence threshold of the training loss after HH rounds. The second set of inequalities indicates that the average waiting time of participating clients in each round hh should not exceed the given threshold ρ\rho. The third set of inequalities bounds the variance of all coefficient blocks’ total update times in each round hh, ensuring the balanced training among blocks. The fourth set of inequalities tells the feasible range of the number of coefficient blocks. The object of this optimization problem is to minimize the completion time of the FL training with the performance requirements (e.g., model convergence, average waiting time).

Algorithm 1 Procedure at the PS

Input: completion time budget TmaxT^{max}, maximum time cost for one local iteration μmax\mu^{max}, maximum model width PP, waiting time bound ρ\rho.
Output: convergenced neural basis 𝐯H\mathbf{v}^{H} and coefficient 𝐮H\mathbf{u}^{H}.

1:Initialize 𝐯0\mathbf{v}^{0} and 𝐮0\mathbf{u}^{0} as random tensors;
2:Initialize h0h\leftarrow 0, T0T\leftarrow 0, cih0c_{i}^{h}\leftarrow 0 (i[P2])\forall i\in[P^{2}]);
3:while  TTmaxT\leq T^{max}  do
4:     Collect the status information of network and clients;
5:     Randomly sample KK participating clients, i.e., 𝒩h\mathcal{N}^{h};
6:     for each client n𝒩hn\in\mathcal{N}^{h} do
7:         Set μnh0\mu_{n}^{h}\leftarrow 0 and pnh1p_{n}^{h}\leftarrow 1;
8:         while μnh<μmax\mu_{n}^{h}<\mu^{max} and pnh<Pp_{n}^{h}<P do
9:              Estimate the local iteration time μnh\mu_{n}^{h};
10:              Set pnhpnh+1p_{n}^{h}\leftarrow p_{n}^{h}+1;          
11:         Estimate the communication time νnh\nu_{n}^{h};      
12:     for  each client n𝒩hn\in\mathcal{N}^{h} do
13:         Solve Eq. (27) to obtain TnT_{n}, τn\tau_{n} and TnhT_{n}^{h};      
14:     Select the fastest client largminn𝒩hTnl\leftarrow\mathop{\arg\min}_{n\in\mathcal{N}^{h}}T_{n};
15:     Set τlhτl\tau_{l}^{h}\leftarrow\tau_{l} and TT+TlhT\leftarrow T+T_{l}^{h};
16:     for  each client n𝒩hn\in\mathcal{N}^{h} do
17:         if nln\neq l then
18:              Obtain the interval [τa,τb][\tau_{a},\tau_{b}] by Eq. (24);
19:              Search τnh[τa,τb]\tau_{n}^{h}\in[\tau_{a},\tau_{b}] to minimize 𝒱h\mathcal{V}^{h};          
20:         Select the least trained (pnh)2(p_{n}^{h})^{2} blocks to form 𝐮^nh\hat{\mathbf{u}}^{h}_{n};
21:         for each coefficient block ii in 𝐮^nh\hat{\mathbf{u}}_{n}^{h}  do
22:              Update cihcih+τnhc_{i}^{h}\leftarrow c_{i}^{h}+\tau_{n}^{h};          
23:         Send 𝐯h\mathbf{v}^{h}, 𝐮^nh\hat{\mathbf{u}}_{n}^{h}, τnh\tau^{h}_{n} to client nn;
24:         Receive LnL_{n}, σn2\sigma_{n}^{2}, Gn2G_{n}^{2}, 𝐯¯nh\bar{\mathbf{v}}^{h}_{n}, 𝐮¯nh\bar{\mathbf{u}}_{n}^{h} from client vnv_{n};      
25:     Aggregate estimated variables to LL, σ2\sigma^{2} and G2G^{2};  
26:     Aggregate basis and coefficient to 𝐯h+1\mathbf{v}^{h+1} and 𝐮h+1\mathbf{u}^{h+1};
27:     Set hh+1h\leftarrow h+1 and cihcih1c_{i}^{h}\leftarrow c_{i}^{h-1} (i[P2]\forall i\in[P^{2}]);
28:Set Hh1H\leftarrow h-1 ,then send 𝐯H\mathbf{v}^{H} and 𝐮H\mathbf{u}^{H} to all clients.

V-B Preliminaries for Algorithm Design

To solve the optimization problem in Eq. (22), we first approximate the convergence bound. Specifically, we adopt an upper bound β2\beta^{2} for the coefficient reducing error, where αnhβ2\alpha_{n}^{h}\leq\beta^{2}. Besides, the minimum loss value is approximated as zero, i.e., F(𝐱)=0F(\mathbf{x}^{*})=0. Accordingly, the convergence bound in Eq. (12) is formulated as follows:

G(H,τ)=4HητF(𝐱0)+Lητ3(G2+18σ2)+6L2β2G(H,\tau)=\frac{4}{H\eta\tau}F(\mathbf{x}^{0})+\frac{L\eta\tau}{3}(G^{2}+18\sigma^{2})+6L^{2}\beta^{2} (23)

Since τ\tau is a positive variable, the convergence bound G(H,τ)G(H,\tau) is a convex function with respect to the local update frequency τ\tau. It can be derived that G(H,τ)G(H,\tau) will decrease as τ\tau increases when τ12F(𝐱0)η2HL(G2+18σ2)\tau\leq\small{\sqrt{\frac{12F(\mathbf{x}^{0})}{\eta^{2}HL(G^{2}+18\sigma^{2})}}}. On the contrary, the trend of G(H,τ)G(H,\tau) and τ\tau is the opposite.

To balance the completion time of heterogeneous clients, we first select the fastest client ll in round hh and let the completion time of other clients be approximately equal to that of client ll, which can be formulated as follows:

0Tlh(τnhμnh+νnh)ρ,n𝒩h,h0\leq T_{l}^{h}-(\tau_{n}^{h}\cdot\mu_{n}^{h}+\nu_{n}^{h})\leq\rho,\quad\forall n\in\mathcal{N}^{h},\forall h (24)

Notably, the completion time TlhT_{l}^{h} of client ll is the largest among that of all participating clients in round hh. Therefore, the total completion time can be denoted as follows:

T(H,τ)=h=1HTlh=h=1H(τlhμlh+νlh)T(H,\tau)=\sum_{h=1}^{H}T_{l}^{h}=\sum_{h=1}^{H}(\tau_{l}^{h}\cdot\mu_{l}^{h}+\nu_{l}^{h}) (25)

In order to minimize the convergence bound, we set the local update frequency τlh\tau_{l}^{h} of the fastest client ll in round hh as 12F(𝐱h)η2HL(G2+18σ2)\small{\sqrt{\frac{12F(\mathbf{x}^{h})}{\eta^{2}HL(G^{2}+18\sigma^{2})}}}. Thus, the optimization problem in Eq. (22) can be approximated as a univariate problem of finding the number of rounds HH.

minh=1H(τlhμlh+νlh)\mathop{\min}\sum_{h=1}^{H}(\tau_{l}^{h}\cdot\mu_{l}^{h}+\nu_{l}^{h})

s.t.{τlh=12F(𝐱h)η2HL(G2+18σ2),h0Tlh(τnhμnh+νnh)ρ,n𝒩h,h𝒱hδ,hs.t.\begin{cases}\tau_{l}^{h}=\sqrt{\frac{12F(\mathbf{x}^{h})}{\eta^{2}HL(G^{2}+18\sigma^{2})}},&\forall h\\ 0\leq T_{l}^{h}-(\tau_{n}^{h}\cdot\mu_{n}^{h}+\nu_{n}^{h})\leq\rho,&\forall n\in\mathcal{N}^{h},\forall h\\ \mathcal{V}^{h}\leq\delta,&\forall h\end{cases} (26)
Algorithm 2 Procedure at client nn in round hh

Input: global neural basis 𝐯h\mathbf{v}^{h}, reduced coefficient 𝐮^nh\hat{\mathbf{u}}_{n}^{h}, local update frequency τnh\tau^{h}_{n}.
Output: estimated variables LnL_{n}, σn2\sigma_{n}^{2} and Gn2G_{n}^{2}, updated neural basis 𝐯¯nh\bar{\mathbf{v}}^{h}_{n} and coefficient 𝐮¯nh\bar{\mathbf{u}}_{n}^{h}.

1:Receive 𝐯h\mathbf{v}^{h}, 𝐮^nh\hat{\mathbf{u}}_{n}^{h} and τnh\tau^{h}_{n} from PS;
2:Compose basis and coefficient 𝐱^nh𝐯h𝐮^nh\hat{\mathbf{x}}^{h}_{n}\leftarrow\mathbf{v}^{h}\cdot\hat{\mathbf{u}}^{h}_{n};
3:Initialize learning rate η\eta and set 𝐱^nh(0)𝐱^nh\hat{\mathbf{x}}^{h}_{n}(0)\leftarrow\hat{\mathbf{x}}^{h}_{n};
4:for each local iteration t{1,2,,τnh}t\in\{1,2,...,\tau_{n}^{h}\} do
5:     Update 𝐱^nh(t)𝐱^nh(t1)ηFn(𝐱^nh(t1);ξnh)\hat{\mathbf{x}}_{n}^{h}(t)\leftarrow\hat{\mathbf{x}}_{n}^{h}(t-1)-\eta\nabla F_{n}(\hat{\mathbf{x}}_{n}^{h}(t-1);\xi_{n}^{h});
6:Set 𝐱¯nh𝐱^nh(τnh)\bar{\mathbf{x}}_{n}^{h}\leftarrow\hat{\mathbf{x}}_{n}^{h}(\tau^{h}_{n});
7:Estimate LnFn(𝐱¯nh)Fn(𝐱^nh)/𝐱¯nh𝐱^nhL_{n}\leftarrow\|\nabla F_{n}(\bar{\mathbf{x}}_{n}^{h})-\nabla F_{n}(\hat{\mathbf{x}}_{n}^{h})\|/\|\bar{\mathbf{x}}_{n}^{h}-\hat{\mathbf{x}}_{n}^{h}\|;
8:Estimate σn2𝔼[Fn(𝐱^nh,ξn)Fn(𝐱^nh)2]\sigma_{n}^{2}\leftarrow\mathbb{E}[\|\nabla F_{n}(\hat{\mathbf{x}}_{n}^{h},\xi^{n})-\nabla F_{n}(\hat{\mathbf{x}}_{n}^{h})\|^{2}];
9:Estimate Gn2𝔼[Fn(𝐱^nh,ξn)2]G_{n}^{2}\leftarrow\mathbb{E}[\|\nabla F_{n}(\hat{\mathbf{x}}_{n}^{h},\xi^{n})\|^{2}];
10:Decompose the updated local model 𝐱¯nh𝐯¯h𝐮¯nh\bar{\mathbf{x}}^{h}_{n}\rightarrow\bar{\mathbf{v}}^{h}\cdot\bar{\mathbf{u}}^{h}_{n};
11:Send LnL_{n}, σn2\sigma_{n}^{2}, Gn2G_{n}^{2}, 𝐯¯nh\bar{\mathbf{v}}^{h}_{n} and 𝐮¯nh\bar{\mathbf{u}}_{n}^{h} to the PS.

V-C Algorithm Description

In terms of Eq. (26), we design a greedy-based control algorithm to adaptively assign proper coefficients and local update frequencies for heterogeneous clients. The proposed algorithm consists of both the PS and client sides, which are formally described in Alg. 1 and Alg. 2, respectively. We will introduce the algorithm in detail by the order of workflow in a training round.

Firstly, at the beginning of each round hh, the algorithm determines the model width pnhp_{n}^{h} for each client n𝒩hn\in\mathcal{N}^{h} (Lines 6-11 of Alg. 1). To minimize the reducing error, the algorithm greedily adds the coefficient blocks for each client as many as possible within a given maximum iteration time μmax\mu^{max} or the model width pnhp_{n}^{h} reaches the maximum value PP.

Secondly, the algorithm selects the fastest client with the least completion time (Lines 12-14 of Alg. 1). Specifically, for each participating client n𝒩hn\in\mathcal{N}^{h}, the algorithm assumes it is the fastest client and solves the approximated problem in Eq. (26), where the optimization object is Tn(H,τn)=h=hH(τnμnh+νnh)T_{n}(H,\tau_{n})=\sum_{h^{\prime}=h}^{H}(\tau_{n}\cdot\mu_{n}^{h^{\prime}}+\nu_{n}^{h^{\prime}}) with τn=12F(𝐱h)η2HL(G2+18σ2)\tau_{n}=\sqrt{\frac{12F(\mathbf{x}^{h})}{\eta^{2}HL(G^{2}+18\sigma^{2})}}. Then, the number of rounds HH is obtained. Accordingly, the total completion time for the entire training process TnT_{n} is calculated according to HH. Then, the algorithm selects the fastest client ll, i.e., largminn𝒩hTnl\leftarrow\mathop{\arg\min}_{n\in\mathcal{N}^{h}}T_{n}. However, solving this problem requires information of the entire training process, such as clients’ status and network bandwidth. Unfortunately, since these information are usually time-varying in the dynamic edge system, it is impossible to obtain them in advance. To this end, we adopt the information in the current round to approximate the unavailable future information. Therefore, the optimization object is approximated as follows:

Tn(H)=H(12F(𝐱h)η2HL(G2+18σ2)μnh+νnh)T_{n}(H)=H\cdot(\sqrt{\frac{12F(\mathbf{x}^{h})}{\eta^{2}HL(G^{2}+18\sigma^{2})}}\cdot\mu_{n}^{h}+\nu_{n}^{h}) (27)

Notably, there are some variables, such as LL, σ2\sigma^{2} and G2G^{2}, whose values are unknown at the beginning. In order to address this issue, when h=0h=0, we adopt an identical predefined local update frequency μ1\mu_{1} for all participating clients, without performing the algorithm. When h1h\geq 1, each participating client estimates these variables over its local loss function (Lines 7-9 of Alg. 2) and the PS aggregates them to obtain their specific values (Line 25 of Alg. 1).

Thirdly, the algorithm determines other clients’ local update frequencies based on the fastest client ll’s completion time in round hh, i.e., TlhT_{l}^{h} (Lines 15-19 of Alg. 1). Specifically, for each client n𝒩hn\in\mathcal{N}^{h}, the algorithm first derives a frequency space [τa,τb][\tau_{a},\tau_{b}] according to Eq. (24) and searches the final local update frequency within this space, ensuring the waiting time does not exceed the threshold ρ\rho. Then, (pnh)2(p_{n}^{h})^{2} blocks with the least total update times are selected to form the reduced coefficient 𝐮^nh\hat{\mathbf{u}}_{n}^{h} for client nn. Finally, the algorithm searches the local update frequency τnh\tau_{n}^{h} in [τa,τb][\tau_{a},\tau_{b}] to minimize the variance among the total update times of all coefficient blocks.

Fourthly, the PS sends the global basis 𝐯h\mathbf{v}^{h}, reduced coefficient 𝐮^nh\hat{\mathbf{u}}_{n}^{h} and local update frequency τnh\tau_{n}^{h} to each participating client nn for local training (i.e., Alg. 2). Once receiving the updated basis 𝐯¯nh\bar{\mathbf{v}}_{n}^{h} and coefficient 𝐮¯nh\bar{\mathbf{u}}_{n}^{h}, as well as the estimated variables’ values LnL_{n}, σn2\sigma^{2}_{n} and Gn2G^{2}_{n}, the PS performs global aggregation (Lines 25-26 in Alg. 1). The whole process continues until the total time cost exceeds the budget TmaxT^{max}.

VI Performance Evaluation

VI-A Datasets and Models

VI-A1 Datasets

We conduct the experiments over three common datasets: CIFAR-10[krizhevsky2009learning], ImageNet[russakovsky2015imagenet] and Shakespeare[caldas2018leaf]. Specifically, CIFAR-10 is an image dataset including 60,000 images (50,000 images for training and 10,000 images for testing), which are 3×\times32×\times32 dimensional and evenly from 10 classes. ImageNet contains 1,281,167 training images, 50,000 validation images and 100,000 test images from 1,000 classes and is more challenging to train the models for visual recognition. Considering the constrained resource of edge clients, we create a subset of ImageNet, called ImageNet-100, that consists of 100 out of 1,000 classes. Besides, each image’s resolution is resized to 3×\times144×\times144. Shakespeare is a text dataset for next-character prediction built from Shakespeare Dialogues, and includes 422,615 samples with a sequence length of 80. We split the dataset into 90% for training and 10% for testing [caldas2018leaf]. CIFAR-10 and ImageNet-100 represent the low-resolution and high-resolution computer vision (CV) learning tasks, respectively, while Shakespeare represents the natural language processing (NLP) learning task.

VI-A2 Data Distribution

To simulate the non-independent and identically distributed (Non-IID) data, we adopt three different data partition schemes for the three datasets, respectively. Specifically, we adopt latent Dirichlet allocation (LDA) over CIFAR-10 [jiang2023heterogeneity], where Γ\varGamma% (Γ=\varGamma= 20, 40, 60 and 80) of the samples on each client belong to one class and the remaining samples evenly belong to other classes. Particularly, Γ=\varGamma= 10 represents the IID setting. For ImageNet-100, we control that each client lacks ϕ\phi (ϕ=\phi= 20, 40, 60 and 80) classes of samples and the data volume of each class is the same [wang2022accelerating], where ϕ=\phi= 0 represents the IID setting. In our experiments, both Γ\varGamma and ϕ\phi are set to 40 by default. Shakespeare is a natural Non-IID dataset, where the dialogues of each speaking role in each play are regarded as the local data of a specific client [caldas2018leaf] and the Non-IID level is fixed. For fair comparison, the full test datasets are used to evaluate the models’ performance.

VI-A3 Models

To validate the universality of the enhanced neural composition technique, we conduct the experiments across several different architectures. Firstly, a 4-layer CNN with three 3×\times3 convolutional layers and one linear output layer is adopted for the CIFAR-10 dataset. Secondly, we utilize the standard ResNet-18 for the more challenging ImageNet-100 dataset. Thirdly, for the Shakespeare dataset, we adopt an RNN model and set both the hidden channel size and embedding size to 512 [mei2022resource].

VI-B Baselines and Metrics

VI-B1 Baselines

We choose the following four baselines for performance comparison: ① FedAvg [mcmahan2017communication] transmits and trains the entire models with a fixed (non-dynamic) and identical (non-diverse) local update frequency for all clients. ② ADP [wang2018edge] dynamically determines the identical local update frequency for all clients in each round on the basis of the constrained resource. ③ HeteroFL [diao2020heterofl] reduces the model width for each client based on its computation power by model pruning. ④ Flanc [mei2022resource] utilizes the neural composition technique to adjust the model width, where the coefficients in different shapes do not share any parameter.

VI-B2 Metrics

We employ the following four metrics to evaluate the performance of Heroes and baselines. ① Test accuracy is measured by the proportion between the amount of the correct samples through model inference and that of all test samples. ② Average Waiting Time is calculated by averaging the time each client waits for global aggregation in a round, reflecting the impact of client heterogeneity. ③ Completion time is defined as the total time cost to reach the target accuracy, which reflects the training speed. ④ Network Traffic is the overall size of models (or tensors) transmitted between PS and clients during the training process, which quantifies the communication cost.

VI-C Experimental Setup

The experimental environment is built on an AMAX deep learning workstation equipped with an Intel Xeon 5218 CPU, 8 NVIDIA GeForce RTX 3090 GPUs and 256GB RAM. We simulate an FL system with 100 virtual clients and one PS (each is implemented as a process in the system) on this workstation. In each round, we randomly activate 10 clients to participate in training. Specifically, the model training and testing are implemented based on the PyTorch framework111https://pytorch.org/docs/stable, and the MPI for Python library222https://mpi4py.readthedocs.io/en/stable/ is utilized to build up the communication between clients and the PS.

To reflect heterogeneous and dynamic network conditions, we let each client’s download speed to fluctuate between 10Mb/s and 20Mb/s [wang2022accelerating]. Since the upload speed is usually smaller than the download speed in typical WANs, we configure it to fluctuate between 1Mb/s and 5Mb/s [jiang2023heterogeneity] for each client. Besides, considering the clients’ computation capabilities are also heterogeneous and dynamic, the time cost of one local iteration on a certain simulated client follows a Gaussian distribution whose mean and variance are derived from the time records on a physical device (e.g., laptop, TX2, Xavier NX, AGX Xavier) [liao2023adaptive].

VI-D Evaluation Results

VI-D1 Training Performance

We implement Heroes and baselines on CIFAR-10 and ImageNet-100 to observe their training performance (e.g., test accuracy). The results in Fig. 4 show that Heroes converges much faster than the baselines while accomplishing a comparable accuracy. For instance, by Fig. 4(a), Heroes takes 1,375s to achieve an accuracy of 70% for CNN on CIFAR-10, while FedAvg, ADP, HeteroFL and Flanc take 4,508s, 3,924s, 3,015s and 3,187s, respectively. In other words, Heroes can speed up the training process by up to 2.67×\times compared to the baselines. Besides, the model’s accuracy in Heroes also surpasses that in the baselines within a given completion time. For example, Heroes achieves an accuracy of 64.36% after training ResNet-18 over ImageNet-100 for 40,000s, while that of FedAvg, ADP, HeteroFL and Flanc is 55.22%, 56.34%, 52.11% and 51.89%, respectively. In general, within the same time budget, Heroes can improve the test accuracy by about 10.46% compared with the baselines. These results demonstrate the advantages of Heroes in accelerating model training.

VI-D2 Impact of Client Heterogeneity

To evaluate the impact of client heterogeneity on model training with different schemes, we illustrate the average waiting time each round among the participating clients in Fig. 5. The results show that Heroes incurs much less waiting time than the baselines, which means high robustness against system heterogeneity. For example, by Fig. 5(a), the average waiting time in Heroes is 2.86s when training CNN over CIFAR-10, while that in FedAvg, ADP, HeteroFL and Flanc is 15.37s, 11.02s, 8.34s and 5.96s, respectively. Specifically, both FedAvg and ADP assign the entire model and identical local update frequencies for all clients during each round without considering the system heterogeneity, resulting in non-negligible waiting time. HeteroFL and Flanc reduce the model width for different clients according to their various computation power. However, they ignore the heterogeneity in clients’ communication capabilities. For example, the client with a slow upload speed will easily become the straggler and delay the global aggregation. In addition to reducing the model width, Heroes also adjusts the local update frequencies for different clients to balance their completion time in each round. Therefore, Heroes can diminish the impact or system heterogeneity significantly.

Refer to caption
(a) CNN on CIFAR-10.
Refer to caption
(b) ResNet-18 on ImageNet-100.
Figure 4: Training performance of different schemes.
Refer to caption
(a) CNN on CIFAR-10.
Refer to caption
(b) ResNet-18 on ImageNet-100.
Figure 5: Average waiting time of different schemes.

VI-D3 Resource Consumption

We observe the resource consumption (e.g., network traffic and completion time) of five schemes when they achieve different target accuracies on the two image datasets (e.g., 75% on CIFAR-10, 60% on ImageNet-100). The results in Fig. 6 and Fig. 8 demonstrate that Heroes can mitigate both the time and traffic costs greatly. For instance, in Fig. 8(a), to obtain the target accuracy of 50% on ImageNet-100, the traffic cost of Heroes is 17.81GB, while that of FedAvg, ADP, HeteroFL and Flanc is 82.34GB, 77.75GB, 62.87GB and 49.38GB, respectively. At the same time, by Fig. 8(b), Heroes can separately speed up the training process by about 3.09×\times, 2.86×\times, 3.15×\times and 2.81×\times, compared with FedAvg, ADP, HeteroFL and Flanc. In a word, Heroes achieves the target accuracy fastest with about 2.97×\times speedup while reducing the network traffic by about 72.05% compared with the baselines.

Refer to caption
(a) Traffic Overhead.
Refer to caption
(b) Completion Time.
Figure 6: The resources consumption of CNN on CIFAR10.

VI-D4 Impact of Non-IID Data

We test these schemes’ test accuracies over the CIFAR-10 and ImageNet-100 datasets under different Non-IID levels within a given completion time (800s for CIFAR-10 and 40,000s for ImageNet-100). The results in Fig. 7 indicate that the test accuracy decreases as the Non-IID level increases for all schemes. For instance, by Fig. 7(a), when training CNN on CIFAR-10 with Γ=80\varGamma=80, Heros achieves an accuracy of 68.9%, which is 14.72%, 13.21%, 7.48% and 24.08% higher than FedAvg, ADP, HeteroFL and Flanc, respectively. Heroes investigates the benefits of neural composition technique and adaptive control of local update frequency, which can accelerate the FL process while maintaining the accuracy of the complete model. Compared to FedAvg and ADP, Heroes will perform more training rounds to achieve higher accuracy within the given time. Compared to HeteroFL and Flanc, Heroes enables every parameter in the global model to be fully trained over the full range of knowledge, thus eliminating the training bias and achieving better performance.

Refer to caption
(a) CNN on CIFAR-10.
Refer to caption
(b) ResNet-18 on ImageNet-100.
Figure 7: Training performance under different Non-IID levels.
Refer to caption
(a) Traffic Overhead.
Refer to caption
(b) Completion Time.
Figure 8: The resource consumption of ResNet-18 on ImageNet-100.
Refer to caption
(a) Test Accuracy vs.Time.
Refer to caption
(b) Traffic Overhead.
Figure 9: The performance of training RNN over Shakespeare

VI-D5 Performance on Text Dataset

Finally, to verify the generalization of the enhanced neural composition technique, we conduct a set of experiments to train RNN over the text dataset Shakespeare. The results in Fig. 9 demonstrate that Heroes can also accomplish better training performance than the baselines on the NLP learning task. Specifically, according to Fig. 9(a), Heroes takes 1,862s to reach the target accuracy of 45%, while FedAvg, ADP, HeteroFL and Flanc take 4,183s, 4,015s, 3,182s and 2,874s, respectively. Besides, by Fig. 9(b), compared with FedAvg, ADP, HeteroFL and Flanc, Heroes saves about 60.71%, 42.57%, 38.65% and 26.72% of network traffic, respectively. Therefore, compared with baselines, Heroes can provide up to 1.91×\times speedup and reduce the traffic consumption by about 45.06% when training RNN over Shakespeare.

VII Related Work

As a practical and promising approach, FL has garnered significant interest from both research and industrial communities [kairouz2021advances]. However, the training efficiency of FL often suffers from resources limitation and client heterogeneity [imteaj2021survey]. In recent years, many previous works have been proposed to improve the training efficiency for FL. To tackle the challenge of client heterogeneity, some research [lai2021oort, luo2022tackling] optimizes the client sampling strategy to diminish the heterogeneity degree among participating clients. Another approach is adjusting the local update frequencies for different clients [li2020federated], so as to balance their completion time and mitigate the effect of straggler. However, these approaches have not been able to effectively conserve the limited resources in FL.

Compressing the transmitted gradients is a common way to alleviate the communication overhead [jiang2023heterogeneity, li2021talk, wang2023federated, cui2022optimal, liu2022communication, nori2021fast]. To further reduce the computation overhead, a natural solution is to prune the global model into a smaller sub-model for training [diao2020heterofl, horvath2021fjord, alam2022fedrolex, li2020lotteryfl, mugunthan2022fedltn, li2021hermes]. Nevertheless, the compressed or pruned parameters are under-optimized in these approaches, degrading the training performance. To address this issue, neural composition technique [mei2022resource] is proposed to construct the size-adjustable models using the more efficient low-rank tensors, while enabling every parameter to learn the knowledge from all clients. However, the global model may get insufficient training, leading to a long completion time, especially in heterogeneous edge networks.

VIII Conclusion

In this paper, we have proposed a lightweight FL framework, called Heroes, to address the challenges of resource limitation and client heterogeneity with the enhanced neural composition and adaptive local update. We have analyzed the convergence of Heroes and designed a greedy-based algorithm to jointly assign proper tensors and local update frequency for each client, which enables every parameter in the global model to benefit from all clients’ knowledge and get fully trained. Extensive experiments demonstrate the effectiveness and advantages of our proposed framework.