This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Unity is Power: Semi-Asynchronous Collaborative Training of Large-Scale Models with Structured Pruning in Resource-Limited Clients

Yan Li, Mingyi Li, Xiao Zhang, Guangwei Xu, Feng Chen, Yuan Yuan,
Yifei Zou, Mengying Zhao, Jianbo Lu, Dongxiao Yu
Shandong University
Qingdao 266237, China
Abstract

In this work, we study to release the potential of massive heterogeneous weak computing power to collaboratively train large-scale models on dispersed datasets. In order to improve both efficiency and accuracy in resource-adaptive collaborative learning, we take the first step to consider the unstructured pruning, varying submodel architectures, knowledge loss, and straggler challenges simultaneously. We propose a novel semi-asynchronous collaborative training framework, namely Co-S2P{Co\text{-}S}^{2}{P}, with data distribution-aware structured pruning and cross-block knowledge transfer mechanism to address the above concerns. Furthermore, we provide theoretical proof that Co-S2P{Co\text{-}S}^{2}{P} can achieve asymptotic optimal convergence rate of O(1/NEQ)O(1/\sqrt{N^{*}EQ}). Finally, we conduct extensive experiments on a real-world hardware testbed, in which 16 heterogeneous Jetson devices can be united to train large-scale models with parameters up to 0.11 billion. The experimental results demonstrate that Co-S2PCo\text{-}S^{2}P improves accuracy by up to 8.8% and resource utilization by up to 1.2×\times compared to state-of-the-art methods, while reducing memory consumption by approximately 22% and training time by about 24% on all resource-limited devices.

1 Introduction

The pretrained foundation models (PFMs) with a huge amount of parameters, such as ChatGPT, GPT-4, have witnessed great success in various fields  [16, 42, 23], which demonstrates superior performance on kinds of downstream tasks, such as CV [32], NLP [3], robotics [8], etc. However, training the large-scale pretrained models is becoming increasingly more expensive, usually requiring scaling to thousands of high-performance GPUs.

In real-world scenarios, heterogeneous weak computing power are commonly everywhere, such as mobile devices, equipped with limited computational and memory resources. Hence, it would be difficult and unaffordable for resource-constrained clients to train the full large-scale models. In addition, the resource-limited devices can produce heterogeneous data and bring "data silos" due to the promulgation of rigorous data regulations such as GDPR [30]. Therefore, a practical problem arises: How to release the potential of massive resource-limited devices to train large-scale models uniting the weak computing power and heterogeneous dispersed datasets collaboratively?

Traditionally, fruitful works have explored the lightweight collaborative training methods, such as quantization [25, 20, 40], compression [10, 15], distillation [26, 6], and pruning strategies [37]. In addition, adaptively splitting the globally large models into submodels can facilitate the collaboration process in resource-limited federated learning (FL), such as RAM-Fed [31], PruneFL [14], OAP [43], pFedGate [5]. However, when deploying the mentioned methods to prune the large-scale models

Refer to caption
Figure 1: Unstructured pruning fails to train the splitted submodels due to memory constraints in real-world resource-limited clients.

in real-world resource-limited hardware, both efficiency and accuracy should be carefully considered, arising some tricky challenges. (1) Unstructured pruning. The existing adaptive splitting methods adopt unstructured pruning [12, 13], which only zeros out unimportant parameters without reducing the original parameter size. As shown in Fig.1, unstructured pruning fails to train the splitted submodels due to memory constraints in real-world hardware. (2) Varying submodel architectures. The randomized or weight-based submodels might not fit the heterogeneous data distributions in each client, which limits the learning potential of the generated submodels. (3) Knowledge loss. The shallow submodels fail to learn high-level and complex knowledge due to the sparsity of the submodel architectures, causing knowledge loss in the resourced-limited clients. (4) Straggler. The limited computation abilities of clients can cause enormous disparities in submodels training time, slowing down the large model’s convergence and lowering down the utilization of dispersed computing power.

Along this line, in this work, we propose a novel Semi-asynchronous Collaborative training framework with Structured Pruning, named Co-S2P{Co\text{-}S}^{2}{P}, which can release the potential of massive heterogeneous resource-limited computing power to train a globally large model. In detail, Co-S2P{Co\text{-}S}^{2}{P} designs a data distribution-aware structured pruning algorithm to ensure a balanced learning capability of submodels at both depth and width dimensions and accelerate the training process. The server first prunes the blocks of the large models in a rolling fashion according to the available resources of clients. Then, the clients train the structured width-based segment-wise masks based on dispersed datasets to make the submodels fit the heterogeneous data distributions. Considering the knowledge loss of the pruned submodels in the resource-limited clients, we adopt self-distillation to implement cross-block knowledge transfer. To facilitate resource utilization and mitigate the straggler challenge, we design a semi-asynchronous aggregation strategy to further accelerate the large model convergence. Furthermore, we give a detailed convergence analysis of Co-S2P{Co\text{-}S}^{2}{P}, which can achieve an asymptotically optimal rate O(1/NEQ)O(1/\sqrt{N^{*}EQ}). Finally, we deploy Co-S2P{Co\text{-}S}^{2}{P} on a real-world hardware testbed, in which 16 heterogeneous Jetson devices can be united to train a large-scale model with parameters up to 0.11 billion (B), demonstrating its superiority compared with the state-of-the-arts, while effectively reducing memory usage and training time. The main contributions are summarized as follows:

  • To release the potential of massive heterogeneous weak computing power to train large-scale models on dispersed datasets, we take the first step to improve both the efficiency and accuracy of the collaborative training by considering the unstructured pruning, varying submodel architectures, knowledge loss, and straggler challenges into a united framework.

  • We propose a semi-asynchronous collaborative training framework Co-S2P{Co\text{-}S}^{2}{P}, in which the data distribution-aware structured pruning is designed at both depth and width dimensions while accelerating training. In addition, the structured pruned submodels are trained with self-distillation to implement cross-block knowledge transfer.

  • We theoretically prove the strategy can converge with O(1/NEQ)O(1/\sqrt{N^{*}EQ}), which shows that our semi-asynchronous aggregation strategy mitigates the straggler problem while balancing the convergence rate.

  • We conduct extensive experiments on a real-world hardware testbed, in which 16 Jetson devices can be united to train large-scale models with parameters up to 0.11B. Co-S2PCo\text{-}S^{2}P improves accuracy up to 8.8% and resource utilization up to 1.2×\times compared with the state-of-the-art methods while reducing memory consumption and training time significantly.

Refer to caption
Figure 2: Overview of Co-S2PCo\text{-}S^{2}P. \raisebox{-1.0pt}{1}⃝ The server prunes blocks at depth top-to-bottom according to the available resources of clients. \raisebox{-1.0pt}{2}⃝ The clients train structured width-based masks based on local datasets. \raisebox{-1.0pt}{3}⃝ The clients train submodels with self-distillation to implement cross-block knowledge transfer. \raisebox{-1.0pt}{4}⃝ We design a semi-asynchronous aggregation strategy to mitigate the problem of stragglers.

2 Related Work

Lightweight Collaborative Training. With the popularity of PFMs, it would be unaffordable for resource-limited devices to train the full model under classic collaborative learning. In recent years, fruitful works have been proposed for lightweight collaborative training such as quantization [25, 40, 20], compression [19, 10, 15] and distillation [26, 6]. Moreover, some works design pruning strategies [12, 14, 37] and mask strategies [17, 13, 5] to adapt to the limited resources, but fail to accelerate the training process due to the unstructured pruning. To ensure both efficiency and accuracy, we introduce personalized trainable structured masks based on local data distribution, so as to better adapt to local limited resources and accelerate the model training process.

Asynchronous Collaborative Training. In the collaborative training scenario, each client usually has heterogeneous resources such as dispersed data or computational power. Therefore, the stragglers can slow down the whole training process, causing the server to wait for updates from all the clients. Therefore, some fully asynchronous methods [33, 45] and semi-asynchronous collaborative training works such as  [35, 4, 24, 41, 29] have been proposed to solve the straggler problem. In our work, we design a novel semi-asynchronous aggregation strategy to accelerate collaborative training and mitigate the impact of stale clients simultaneously.

3 Problem Formulation

In this paper, we aim to train globally large models collaboratively while optimizing the utilization of diverse discrete resource-limited computing power. Specifically, we assume there exist NN clients with different resources and locally heterogeneous datasets 𝒟n={(𝐗i,yi)}i=1,,|𝒟n|\mathcal{D}_{n}=\left\{\left(\mathbf{X}_{i},y_{i}\right)\right\}_{i=1,\dots,|\mathcal{D}_{n}|}, where (𝐗i,yi)\left(\mathbf{X}_{i},y_{i}\right) denotes the ii-th training data sample and its ground truth label, and nn\in {1,,N}\{1,\ldots,N\}. In the resource-limited collaborative learning scenario, we aim to solve the following optimization problem:

minwdF(w):=n=1N1N𝔼(x,y)𝒟n[f(w;x,y)],\min_{w\in\mathbb{R}^{d}}F(w):=\sum_{n=1}^{N}\frac{1}{N}\cdot\mathbb{E}_{(x,y)\sim\mathcal{D}_{n}}\left[f\left(w;x,y\right)\right], (1)

where ww and wnw_{n} are the parameters of global model and pruned submodels respectively. f(w;x,y)(hw(x),y)f\left(w;x,y\right)\triangleq\ell\left(h_{w}(x),y\right), where ()\ell(\cdot) is the loss function.

4 Co-S2PCo\text{-}S^{2}P Framework Design

We propose a semi-asynchronous collaborative training framework Co-S2P{Co\text{-}S}^{2}{P}, as shown in Fig.2. In Co-S2P{Co\text{-}S}^{2}{P}, we design a data distribution-aware structured pruning algorithm to ensure a balanced learning capability of submodel at both the depth and width dimensions while accelerating training (Sec.4.1). The server first prunes blocks in a rolling fashion according to the available resources of clients. Then, in order to make submodels fit the heterogeneous data distributions, the clients train structured width-based segment-wise masks based on local datasets. Considering the knowledge loss in resource-limited clients, we train submodels with self-distillation to implement cross-block knowledge transfer (Sec.4.2). Finally, to mitigate the problem of straggler, we design a semi-asynchronous aggregation strategy (Sec.4.3).

4.1 Data Distribution-Aware Structured Pruning

Refer to caption
Figure 3: Details of data distribution-aware structured pruning in both depth and width dimensions.

In this section, we focus on the balanced structured pruning of the globally large models at both depth and width dimensions, as shown in Fig.3. For the heterogeneous clients with weak computing power, we obtain the overall pruned rate RnR_{n} according to the available resources, which consists of the width pruned rate RnwidthR_{n}^{width} and the depth pruned rate RndepthR_{n}^{depth}. The pruned rate is defined as the ratio of the structured-pruned model size of the full model size. For client nn, the server first generates the depth-pruned submodel with L×Rndepth\lfloor L\times R_{n}^{depth}\rfloor consecutive trainable blocks by freezing the shallow blocks bottom-to-top and pruning the deep blocks top-to-bottom in a rolling fashion.

Then in order to make the submodels fit the heterogeneous data distributions, the clients train the width-based structured masks based on the local data distributions. A Transformer-like model mainly consists of linear layers and Multi-head Self-Attention(MSA) layers. Consequently, we design trainable segment-wise masks for two types of layers, where the segment represents part parameters of linear layer or head in MSA layer. We denote the weight matrix and its binary mask by wnlidout×dinw_{n}^{l_{i}}\in\mathbb{R}^{d_{out}\times d_{in}} and Mnli{0,1}dM_{n}^{l_{i}}\in\{0,1\}^{d} respectively, where lil_{i} represents the ii-th layer. For the linear layer, each value of Mnli{0,1}doutM_{n}^{l_{i}}\in\{0,1\}^{d_{out}} indicates whether the part of the layer is pruned or not. For the MSA layer, each value of Mnli{0,1}hM_{n}^{l_{i}}\in\{0,1\}^{h} indicates whether the head is pruned or not and hh represents the number of heads.

However, considering that the segment-wise mask MnliM_{n}^{l_{i}} is a binary tensor, directly applying the existing back-propagation methods to model training would lead to gradients backward errors. Inspired by [44, 13], we introduce the corresponding segment-wise importance mask InlidI_{n}^{l_{i}}\in\mathbb{R}^{d} and segment-wise probability mask Pnli[0,1]dP_{n}^{l_{i}}\in[0,1]^{d}. To enable back-propagation, we apply the sigmoid function scaling to the importance mask, and further use the Bernoulli distribution [7] to generate the binary mask MnliBern(11+eInli)M_{n}^{l_{i}}\sim Bern(\frac{1}{1+e^{-I_{n}^{l_{i}}}}). Then we use the inverse of the sigmoid function and record PnliP_{n}^{l_{i}} with the straight-through estimator [2] for back-propagation.

Furthermore, the binary matrix mask M^nlidout×din\hat{M}_{n}^{l_{i}}\in\mathbb{R}^{d_{out}\times d_{in}} is introduced as an extension of MnliM_{n}^{l_{i}} to produce Hadamard product with the weight matrix of the model. For training the personalized mask MnliM_{n}^{l_{i}} efficiently, we freeze the corresponding weights of the depth-pruned submodels. Considering the resource-limited problem, we use the submodel size to reflect computation and communication costs. The loss function is as follows:

n,mask=CE(y^,y)+λ1|liSsum(M^nli)liSsize(M^nli)Rnwidth|,\displaystyle\mathcal{L}_{n,mask}=\mathcal{L}_{CE}(\hat{y},y)+\lambda_{1}|\frac{\sum_{l_{i}\in S}sum(\hat{M}_{n}^{l_{i}})}{\sum_{l_{i}\in S}size(\hat{M}_{n}^{l_{i}})}-R_{n}^{width}|, (2)

where CE(y^,y)\mathcal{L}_{CE}(\hat{y},y) represents the cross entropy loss for target class yy, SS represents the set of the linear layers and the MSA layers. λ1\lambda_{1} is the hyperparameter that indicates the extent to which the submodel adapts to the limited resource. For each client, liSsum(M^nli)liSsize(M^nli)\frac{\sum_{l_{i}\in S}sum(\hat{M}_{n}^{l_{i}})}{\sum_{l_{i}\in S}size(\hat{M}_{n}^{l_{i}})} stays as close as possible to RnwidthR_{n}^{width} to ensure a balanced learning capability of submodel.

Finally, we perform Hadamard product for the depth-pruned submodel and binary matrix to obtain the structured-pruned submodel. The structured pruning strategy significantly reduces the capacity of model, saves storage and computation resources, and accelerates directly on resource-limited clients.

4.2 Self-Distillation based Knowledge Transfer

To mitigate the knowledge loss problem in resource-limited clients, Co-S2PCo\text{-}S^{2}P introduces lightweight self-distillation mechanism [38] to implement cross-block knowledge transfer. For the structured-pruned submodel, we place multiple classifiers at specific depths hh, where h{hfrozen+L×Ridepth|RidepthRndepth,1iN}h\in\{h^{frozen}+L\times{R_{i}^{depth}}|{R_{i}^{depth}}\leq{R_{n}^{depth}},1\leq i\leq N\}, and hfrozenh^{frozen} is the number of the frozen blocks and LL is the number of blocks in the global model. We use self-distillation by treating the deepest classifier as the teacher and other classifiers as the students. The loss function for the width pruning is as follows:

n,weight=i=1cn((1λ2)CE(y^i,y)+λ2KL(y^i,y^cn)),\displaystyle\mathcal{L}_{n,weight}=\sum_{i=1}^{c_{n}}((1-\lambda_{2})\mathcal{L}_{CE}(\hat{y}_{i},y)+\lambda_{2}\mathcal{L}_{KL}(\hat{y}_{i},\hat{y}_{c_{n}})), (3)

where cnc_{n} is the number of classifiers in client nn, CE(y^i,y)\mathcal{L}_{CE}(\hat{y}_{i},y) represents the cross entropy loss of the ii-th classifier for target class yy. KL(y^i,y^cn)=sum(σ(y^i/t)lnσ(y^i/t)σ(y^cn/t))t2\mathcal{L}_{KL}(\hat{y}_{i},\hat{y}_{c_{n}})=sum(\sigma(\hat{y}_{i}/t)\ln{\frac{\sigma(\hat{y}_{i}/t)}{\sigma(\hat{y}_{c_{n}}/t)}})t^{2}, temperature t is a hyperparameter that controls the information capacity of data distributions provided by the teacher, λ2\lambda_{2} indicates the balance the cross entropy loss and self-distillation. The self-distillation mechanism enables the shallow blocks to be equipped with high-level knowledge without extra forward overhead, because the blocks share feature maps.

Algorithm 1 Semi-Asynchronous Aggregation Algorithm
1:  Input: initial weight w0w^{0}, aggregation minimal ratio μ\mu, waiting interval TclkT_{clk}, total communication round QQ
2:  for q=1q=1 to QQ do
3:     CqC_{q}\leftarrow available Clients
4:     for Cn,qCqC_{n,q}\in C_{q} (all workers in parallel) do
5:        Co-S2P{Co\text{-}S}^{2}{P} client-training(wn,qw_{n,q}, qq)
6:     end for
7:     cnt=0cnt=0
8:     while cnt<μNcnt<\mu N do
9:        Δq,n,Mq,n,τq,n\Delta_{q,n},M_{q,n},\tau_{q,n}\leftarrow update from client nn
10:        for Segment iΔni\in\Delta_{n} do
11:           compute γni\gamma_{n}^{i} via Eq.4
12:        end for
13:        cnt=cnt+1cnt=cnt+1
14:     end while
15:     tclkt_{clk}\leftarrow Current time, Ttimeout=tclk+TclkT_{timeout}=t_{clk}+T_{clk}
16:     while tclk<Ttimeoutt_{clk}<T_{timeout} do
17:        Δq,n,Mq,n,τq,n\Delta_{q,n},M_{q,n},\tau_{q,n}\leftarrow update from client nn
18:        for Segment iΔni\in\Delta_{n} do
19:           compute γni\gamma_{n}^{i} via Eq.4
20:        end for
21:        tclkt_{clk}\leftarrow Current time
22:     end while
23:     for segment iwi\in w do
24:        Nqi={n:Mn,qi=1}N_{q}^{i}=\{n:M_{n,q}^{i}=1\}
25:        wq+1i=wqiηnNqiγninNqiγniΔniw_{q+1}^{i}=w_{q}^{i}-\eta\sum_{n\in N_{q}^{i}}\frac{\gamma_{n}^{i}}{\sum_{n\in N_{q}^{i}}\gamma_{n}^{i}}\Delta_{n}^{i}
26:     end for
27:  end for

4.3 Semi-Asynchronous Aggregation Strategy

Due to the stragglers, we design a semi-asynchronous aggregation strategy with minimum ratio μ\mu of received clients and waiting interval TclkT_{clk} to balance the performance of the collaborative training and the resource utilization of all clients, which is shown in Algorithm 1. After the server receives μN\mu N client updates, it turns on the clock timing and waits for TclkT_{clk} seconds before aggregation.

Considering that the updates may be too stale and deviate significantly from the latest global model, for received gradients, we compute a segment importance score on each trained segment ii as follows:

γni=Δq,ni1wqiwqτq,ni1+size(wqi),\gamma_{n}^{i}=\frac{\|\Delta_{q,n}^{i}\|_{1}}{\|w_{q}^{i}-w_{q-\tau_{q,n}}^{i}\|_{1}+size(w_{q}^{i})}, (4)

where Δq,ni\Delta_{q,n}^{i} is the gradients of the client nn on segment ii in qq-th round that indicates the extent of updating, wqiw_{q}^{i} represents the weights of the global model on segment ii in qq-th round. wqτq,niw_{q-\tau_{q,n}}^{i} represents the weights of the global model on segment ii in the (qτq,n)(q-\tau_{q,n})-th round, τq,n\tau_{q,n} is the delay round that the client nn has not taken part in global aggregation. wqiwqτq,ni1\|w_{q}^{i}-w_{q-\tau_{q,n}}^{i}\|_{1} indicates the difference of the received model by the client nn and the latest model.

Finally, the server aggregates the gradients and updates segment ii by normalizing the segment important scores on segment ii as follows:

wq+1i=wqiηnNqiγninNqiγniΔni,w_{q+1}^{i}=w_{q}^{i}-\eta\sum_{n\in N_{q}^{i}}\frac{\gamma_{n}^{i}}{\sum_{n\in N_{q}^{i}}\gamma_{n}^{i}}\Delta_{n}^{i}, (5)

where Nqi={n:Mn,qi=1}N_{q}^{i}=\{n:M_{n,q}^{i}=1\} represents the set of clients training segment ii in the qq-th round, η\eta is the learning rate for training the masked submodel. It is worth noting that the aggregation strategy also transfers the high-order knowledge from the deep blocks to shallow blocks in all clients.

5 Convergence Analysis

In this section, we show the convergence rate of our proposed semi-asynchronous collaborative learning framework Co-S2P{Co\text{-}S}^{2}{P}. The detailed proof is in Appendix C. Firstly, we give a crucial definition for theoretical analysis:

Definition 1.

(Maximum delay). We define the delay as the client nn has not taken part in the global aggregation for τq,n\tau_{q,n} rounds, so we can get:

τq=maxnτq,n,nN.\tau_{q}=\max\limits_{n}\tau_{q,n},\quad n\in N. (6)

Then, we give assumptions for ease of convergence analysis:

Assumption 1.

(LL-smooth). Every function Fn()F_{n}(\cdot) is LL-smooth for all i[N],w,vRdi\in[N],w,v\in R^{d}

Fn(w)Fn(v)Lwv.\|\nabla F_{n}(w)-\nabla F_{n}(v)\|\leq L\|w-v\|. (7)
Assumption 2.

(Bounded variance). There exists σ>0\sigma>0:

𝔼ξiDnFn(w;ξi)Fn(w)2σ2,\mathbb{E}_{\xi_{i}\sim D_{n}}\|\nabla F_{n}(w;\xi_{i})-\nabla F_{n}(w)\|^{2}\leq\sigma^{2}, (8)

σ>0\sigma>0 bounds the variance of stochastic gradient.

Assumption 3.

(Bounded data heterogeneity level). There exists δ>0\delta>0:

Fn(wq)F(wq)2δ2,\|\nabla F_{n}(w_{q})-\nabla F(w_{q})\|^{2}\leq\delta^{2}, (9)

δ>0\delta>0 bounds the effect of heterogeneous data.

Assumption 4.

(Bounded gradient). In algorithm 2, the expected squared norm of stochastic gradients is bounded uniformly, for constant G>0G>0 and n,q,t\forall n,q,t:

𝔼Fn(wq,n,t,ξq,n,t)2G.\mathbb{E}\|\nabla F_{n}(w_{q,n,t},\xi_{q,n,t})\|^{2}\leq G. (10)

With these definitions and assumptions, we can bound the deviation of the average squared gradient norm for the convergence analysis of Co-S2P{Co\text{-}S}^{2}{P}.

Theorem 1.

Let all assumptions hold. Suppose that the step size η\eta satisfies the following relationships:

{8η2L2E212η14LEL2η2E2Eη2<0η<1LE.\left\{\begin{array}[]{r c l}8\eta^{2}L^{2}E^{2}\leq\frac{1}{2}\Rightarrow\eta\leq\frac{1}{4LE}\\ \frac{L}{2}\eta^{2}E^{2}-\frac{E\eta}{2}<0\Rightarrow\eta<\frac{1}{LE}\end{array}\right..

Therefore, the step size η\eta is defined as: 0η14LE.0\leq\eta\leq\frac{1}{4LE}.

Then, for all Q1Q\geqslant 1, we have :

1Qq=1Q𝔼F(wq)2\displaystyle\frac{1}{Q}\sum_{q=1}^{Q}\mathbb{E}\|\nabla F(w_{q})\|^{2} 2𝔼[F(w1)]EηQ+4Lη2E2N|K|NG(τN|K|N+16)+16Lη2EN|K|Nσ2+64Lη2E2N|K|Nδ2,\displaystyle\leq\frac{2\mathbb{E}[F(w_{1})]}{E\eta Q}+4L\eta^{2}E^{2}\frac{N|K|}{N^{*}}G(\tau\frac{N|K|}{N^{*}}+16)+16L\eta^{2}E\frac{N|K|}{N^{*}}\sigma^{2}+64L\eta^{2}E^{2}\frac{N|K|}{N^{*}}\delta^{2},

where 1Qq=1Q(τq)2=τ\frac{1}{Q}\sum_{q=1}^{Q}(\tau_{q})^{2}=\tau, |K||K| is the number of segment, N=minq,iNqiN^{*}=\min_{q,i}N_{q}^{i}, means the minimum number of submodels training the corresponding segment ii in all rounds.

Theorem 1 shows the convergence of the semi-asynchronous aggregation mechanism in Co-S2P{Co\text{-}S}^{2}{P} with the upper bound on the average gradient of all segments.

Remark 1.

Impact of the maximum delay τq\tau_{q}. As we defined, τq=maxnτq,n\tau_{q}=\max\limits_{n}\tau_{q,n} is the maximum delay between all clients until round qq. The result indicates that the larger τq\tau_{q} is, the worse the convergence rate is. Taking into account the local training of the submodel, the larger τq\tau_{q} means that the initial local submodels are more biased towards the latest global model. And a smaller τq\tau_{q} is beneficial to the convergence but disturbed by stragglers. Therefore, our semi-asynchronous aggregation balances the convergence rate and the resource utilization.

Table 1: Performance comparisons on (ViT-Tiny/16)-ImageNet200 and (ViT-Base/16)-ImageNet300. Bold is the optimal result except for FedAvg(FA) with full model training.

Methods (ViT-Tiny/16)-ImageNet200 (ViT-Base/16)-ImageNet300
Server(%) Client(Avg.)(%) RU(%)RU(\%) Server(%) Client(Avg.)(%) RU(%)RU(\%)
Top1Top1 Top5Top5 F1F1 Top1Top1 Top5Top5 F1F1 Top1Top1 Top5Top5 F1F1 Top1Top1 Top5Top5 F1F1
FA(Full) 38.7 63.6 37.3 39.2 64.1 37.1 63.5 68.6 87.2 68.1 64.4 83.2 62.8 60.0
FA+PLATON 19.3 42.7 19.1 16.5 37.5 17.4 70.4 37.6 58.2 35.5 34.5 54.2 30.1 74.3
FA+AdaViT 17.9 36.1 16.2 14.3 34.5 14.0 67.3 31.9 54.9 32.1 28.6 51.7 29.1 72.7
FA+Q-ViT 17.6 40.1 15.8 14.9 34.8 13.6 70.7 29.7 52.6 29.8 25.6 49.1 26.3 74.5
FedPM 16.2 38.7 15.5 14.2 34.0 12.9 71.8 26.4 49.3 27.2 23.7 44.8 24.6 76.2
PruneFL 21.8 42.2 19.1 20.2 39.1 17.1 69.2 42.1 65.8 41.3 37.1 61.6 36.2 79.6
RAM-Fed 23.6 46.0 21.5 19.4 41.2 19.3 69.9 43.4 66.7 42.0 36.8 59.6 35.3 80.5
FedRolex 24.3 48.0 23.1 20.7 45.8 20.5 72.5 44.2 67.5 43.4 38.4 62.1 38.6 82.4
Co-S2P\text{Co-S}^{2}\text{P} 33.1 59.2 31.9 31.8 56.6 30.5 87.9 50.6 76.4 48.9 47.2 72.5 45.5 92.4
Co-S2P\text{Co-S}^{2}\text{P}(Async.) 27.7 53.2 27.3 26.7 49.8 26.7 100.0 47.6 71.9 46.3 44.1 68.4 42.6 100.0

Next, by choosing the appropriate convergence rate η\eta, we can obtain the following corollary.

Corollary 1.

Let all assumptions hold. Supposing that the step size η=O(NEQ)\eta=O(\sqrt{\frac{N^{*}}{EQ}}) and σ\sigma is sufficiently small, when the constant C>0C>0 exists, the convergence rate can be expressed as follows:

1Qq=1Q𝔼F(wq)2C(1NEQ+1Q).\displaystyle\frac{1}{Q}\sum_{q=1}^{Q}\mathbb{E}\|\nabla F(w_{q})\|^{2}\leq C(\frac{1}{\sqrt{N^{*}EQ}}+\frac{1}{Q}).
Remark 2.

Impact of the minimum participation rate NN^{*}. The Corollary 1 shows that the semi-asynchronous aggregation mechanism in Co-S2P{Co\text{-}S}^{2}{P} can converge to O(1/NEQ)O(1/\sqrt{N^{*}EQ}), while we use the semi-asynchronous aggregate mechanism with the personalized mask trained on local data. The result shows that the larger NN^{*} is, the faster the convergence rate is. Otherwise, it’s obvious that NNN^{*}\leq N. When N=NN^{*}=N, all clients take part in all the communication rounds, which achieves the same convergence rate O(1/NEQ)O(1/\sqrt{NEQ}) as asynchronous full client participation FedAvg.

6 Experiments

6.1 Experimental Settings

Refer to caption
Figure 4: Overview of the testbed.

Testbed Implementation. To effectively demonstrate the real-world applicability of the proposed collaborative training framework, we utilize a GeForce RTX 3090 GPU as the server for aggregation and 4 types of 16 Jetson developer kits as the resource-limited clients to build a real-world testbed, as shown in Fig.4. The details are shown in Tab.5 of Appendix.D.

Baselines. To fully evaluate the performance and efficiency of the proposed framework Co-S2P{Co\text{-}S}^{2}{P}, we choose the classical FL algorithm FedAvg [21]. Additionally, we combine FedAvg with existing standalone training for sparsifying transformer-like models: PLATON [39], AdaViT [22] and Q-ViT [18], named FedAvg+FedAvg+. Moreover, we evaluate Co-S2P{Co\text{-}S}^{2}{P} against several FL algorithms designed for resource-limited scenarios, including FedPM [13], PruneFL [14], RAM-Fed [31] and FedRolex [1].

Backbones and datasets. We use three ViT [9] variants with different capabilities, as shown in Tab.4. We conduct experiments based on ImageNet200 and ImageNet300 sampled from ImageNet1K [27] and design a evaluation metric for the real-world testbed, namely Resource Utilization Rate (RU).

We deploy Co-S2P{Co\text{-}S}^{2}{P} and all baselines on the testbed based on the general FL framework NVFlare [28]. Considering that the unstructured baselines fail to work in the resource-limited clients, we replace the unstructured pruning in the baselines with the structured pruning. We conduct three groups of real-world experiments, and the detailed configurations are shown in Tab.6 and Tab.7. Due to limited space, please refer to Appendix D for additional details of experimental implementation.

Table 2: Performance comparison on (ViT-Base(Ext.)/16)-ImageNet300 with parameters up to 0.11 billion. Bold is the optimal result except for FedAvg with full model training.

Methods Server Client(Avg.) RU(%)RU(\%)
Top1(%)Top1(\%) Top5(%)Top5(\%) F1-score(%)F1\text{-}score(\%) Top1(%)Top1(\%) Top5(%)Top5(\%) F1-score(%)F1\text{-}score(\%)
FedAvg(Full) 56.4 78.3 55.8 50.4 73.0 49.9 61.4
FedAvg+PLATON 32.3 55.3 31.6 30.7 51.8 31.6 75.1
FedAvg+AdaViT 31.4 54.3 31.9 30.8 51.4 29.3 74.4
FedAvg+Q-ViT 30.7 53.2 30.2 28.1 52.1 27.6 74.1
FedPM 27.3 52.2 26.3 25.4 49.1 24.1 73.6
PruneFL 34.3 58.4 32.8 31.0 54.4 30.4 76.9
RAM-Fed 36.4 60.8 35.5 30.4 53.1 30.1 77.5
FedRolex 38.1 63.2 37.3 35.6 58.5 34.7 78.2
Co-S2P\text{Co-S}^{2}\text{P} 46.9 72.5 46.0 45.7 67.5 44.5 90.6
Co-S2P\text{Co-S}^{2}\text{P}(Async.) 43.4 68.2 43.6 41.3 65.6 41.5 100.0

6.2 Overall Performance

Server’s Performance. As shown in Tab.1 and Tab.2, Co-S2P{Co\text{-}S}^{2}{P} outperforms all baselines in terms of global performance across three groups of experiments: (1) Co-S2P{Co\text{-}S}^{2}{P} improves Top1 accuracy of global model by 8.8%/6.4%/8.7% compared to the closest baseline FedRolex in three groups of experiments respectively, demonstrating that our algorithm achieves better generalization in resource-limited scenarios. (2) For Top1 accuracy, the FedAvg+FedAvg+ baselines significantly decrease 2.0%14.5%2.0\%\sim 14.5\% compared to the federated resource-adaptive methods, indicating that the mixed heterogeneity of collaborative training scenarios has severe implications for model training. (3) Considering Top1 accuracy, Co-S2P{Co\text{-}S}^{2}{P} achieves 5.4%/3.0%/3.5% higher improvement compared to the fully asynchronous algorithm on three groups of experiments respectively. This highlights that Co-S2P{Co\text{-}S}^{2}{P} effectively balances resource utilization and model performance.

Refer to caption Refer to caption Refer to caption
(a) Convergence Curves (b) Memory Usage (c) Training Time
Figure 5: Overall performance and efficiency comparison on (ViT-Base(Ext.)/16)-ImageNet300. Co-S2P\text{Co-S}^{2}\text{P} outperforms all baselines in terms of overall performance, memory usage, and training time.

Clients’ Average Performance. As shown in Tab.1 and Tab.2, Co-S2P{Co\text{-}S}^{2}{P} outperforms all baselines in terms of weighted average performance of the clients in three groups of experiments: (1) For Top1 accuracy, Co-S2P{Co\text{-}S}^{2}{P} achieves 11.1%/8.8%/10.1% higher personalized submodel performance compared to the closest baseline FedRolex in three groups of experiments respectively, demonstrating that our framework obtains a better personalized mask for each client in resource-limited scenarios. (2) Comparing the difference between global and local performance, Co-S2P{Co\text{-}S}^{2}{P} has closer gap than all baselines. This is because the trainable masks better adapt to local data distributions and the depth-pruned submodels in resource-limited clients. (3) Considering Top1 accuracy, Co-S2P{Co\text{-}S}^{2}{P} achieves 5.1%/3.1%/4.4% higher personalized model performance compared to the designed asynchronous methods on three groups of experiments respectively, which demonstrates that Co-S2P{Co\text{-}S}^{2}{P} achieves a better balance between resource utilization and personalized submodel performance.

Efficiency Comparison. As shown in Tab.1, Tab.2 and Fig.5, Co-S2P{Co\text{-}S}^{2}{P} outperforms all baselines in terms of efficiency in three groups of experiments: (1) Compared to FedRolex, Co-S2P{Co\text{-}S}^{2}{P} reduces memory consumption by about 22% and training time per round by about 24% on all resource-limited devices, demonstrating the superiority of the structured pruning in memory and computation efficiency. (2) For resource utilization rate, Co-S2P{Co\text{-}S}^{2}{P} achieves 1.4×\times, 1.5×\times, and 1.5×\times improvements compared to FedAvg and 1.2×\times, 1.1×\times, and 1.2×\times improvements compared to FedRolex across the three groups of experiments. This demonstrates that the proposed semi-asynchronous aggregation strategy can effectively improve the resource utilization during large-scale model training.

6.3 Ablation Study

Table 3: Effectiveness of trainable mask policies and self-distillation mechanism. We use (ViT-Tiny/16)-ImageNet200 for ablation study and hyperparameter analysis.

Trainable Mask Self- dist. Server(%) Client(Avg.)(%)
Linear MSA Top1Top1 Top5Top5 F1F1 Top1Top1 Top5Top5 F1F1
22.5 45.3 21.6 18.9 40.2 18.3
26.5 51.6 26.0 24.3 47.2 25.8
27.8 55.4 26.8 25.3 50.9 24.9
28.2 56.1 26.9 22.3 47.5 21.6
33.1 59.2 31.9 31.8 56.6 30.5

Effectiveness of trainable mask policies. We validate that the trainable mask strategy is effective in improving global performance and generating personalized masks in resource-limited scenarios. We replace the trainable masking strategies with randomly mask strategies and report the results in Tab.3. We observe that replacing any set of trainable masks with randomized ones leads to a performance decrease. With MSA/Linear/Both Random, Co-S2PCo\text{-}S^{2}P improves global model accuracy by 5.3%/6.6%/10.6%, and the average accuracy of submodels on the client by 6.5%/7.5%/12.9% in Top1 accuracy. These confirm the effectiveness of the trainable mask strategy.

Effectiveness of self-distillation mechanism. We validate that the self-distillation mechanism helps shallow submodels on resource-limited clients to learn high-level and complex knowledge. As shown in Tab.3, Co-S2PCo\text{-}S^{2}P improves Top1 accuracy by 4.9% for the global performance and by 9.5% for the average performance of submodels. This demonstrates the effectiveness of self-distillation in enabling shallow models on resource-limited clients to obtain more complex and high-level knowledge.

6.4 Hyperparameter Analysis

Refer to caption
Figure 6: Semi-async. aggregation strategy with different μ\mu and Tclk(s)T_{clk}(s).

Impact of minimum received ratio μ\mu and waiting interval TclkT_{clk}. We study the impact of μ\mu and TclkT_{clk} on model performance and the resource utilization of the system. As shown in Fig.6, increasing μ\mu or TclkT_{clk}, improves the performance of both the global model and the submodels, but reduces the resource utilization rate due to requiring a longer wait for the other clients. In addition, we provide more hyperparameter analysis details of weight term λ1\lambda_{1} of n,mask\mathcal{L}_{n,mask}, balance term λ2\lambda_{2} of n,weight\mathcal{L}_{n,weight} and the ratio between of RnwidthR_{n}^{width} and RndepthR_{n}^{depth} in Appendix F.2.

7 Conclusion

To release the potential of massive resource-limited nodes, we proposed a novel semi-asynchronous collaborative training framework Co-S2P{Co\text{-}S}^{2}{P}. We designed a data distribution-aware structured pruning to ensure balanced learning capability and accelerate training. By self-distillation, Co-S2P{Co\text{-}S}^{2}{P} facilitated shallow blocks obtaining the high-level knowledge from the deep blocks. Furthermore, we introduced a semi-asynchronous aggregation strategy to solve the straggler problem, and theoretically proved the strategy converges with O(1/NEQ)O(1/\sqrt{N^{*}EQ}). We conducted the real-world experiments training model with parameters up to 0.11B. Co-S2P{Co\text{-}S}^{2}{P} outperforms all the baselines, while reducing memory consumption and training time of all the clients and improving resource utilization of the system.

References

  • Alam et al. [2022] Samiul Alam, Luyang Liu, Ming Yan, and Mi Zhang. Fedrolex: Model-heterogeneous federated learning with rolling sub-model extraction. Advances in Neural Information Processing Systems, 35:29677–29690, 2022.
  • Bengio et al. [2013] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
  • Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • Chai et al. [2021] Zheng Chai, Yujing Chen, Ali Anwar, Liang Zhao, Yue Cheng, and Huzefa Rangwala. Fedat: A high-performance and communication-efficient federated learning system with asynchronous tiers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16, 2021.
  • Chen et al. [2023a] Daoyuan Chen, Liuyi Yao, Dawei Gao, Bolin Ding, and Yaliang Li. Efficient personalized federated learning via sparse model-adaptation. In International Conference on Machine Learning, 2023a.
  • Chen et al. [2023b] Zheyi Chen, Pu Tian, Weixian Liao, Xuhui Chen, Guobin Xu, and Wei Yu. Resource-aware knowledge distillation for federated learning. IEEE Transactions on Emerging Topics in Computing, 2023b.
  • Coolidge [1925] Julian Lowell Coolidge. An introduction to mathematical probability. Clarendon Press, 1925.
  • Cui et al. [2022] Yuchen Cui, Scott Niekum, Abhinav Gupta, Vikash Kumar, and Aravind Rajeswaran. Can foundation models perform zero-shot task specification for robot manipulation? In Learning for Dynamics and Control Conference, pages 893–905. PMLR, 2022.
  • Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • Haddadpour et al. [2021] Farzin Haddadpour, Mohammad Mahdi Kamani, Aryan Mokhtari, and Mehrdad Mahdavi. Federated learning with compression: Unified analysis and sharp guarantees. In International Conference on Artificial Intelligence and Statistics, pages 2350–2358. PMLR, 2021.
  • Hsu et al. [2019] Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335, 2019.
  • Ilhan et al. [2023] Fatih Ilhan, Gong Su, and Ling Liu. Scalefl: Resource-adaptive federated learning with heterogeneous clients. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24532–24541, 2023.
  • Isik et al. [2022] Berivan Isik, Francesco Pase, Deniz Gunduz, Tsachy Weissman, and Michele Zorzi. Sparse random networks for communication-efficient federated learning. arXiv preprint arXiv:2209.15328, 2022.
  • Jiang et al. [2022a] Yuang Jiang, Shiqiang Wang, Victor Valls, Bong Jun Ko, Wei-Han Lee, Kin K Leung, and Leandros Tassiulas. Model pruning enables efficient federated learning on edge devices. IEEE Transactions on Neural Networks and Learning Systems, 2022a.
  • Jiang et al. [2022b] Zhida Jiang, Yang Xu, Hongli Xu, Zhiyuan Wang, and Chen Qian. Adaptive control of client selection and gradient compression for efficient federated learning. arXiv preprint arXiv:2212.09483, 2022b.
  • Jiao et al. [2023] Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing Wang, and Zhaopeng Tu. Is chatgpt a good translator? a preliminary study. arXiv preprint arXiv:2301.08745, 2023.
  • Li et al. [2021] Ang Li, Jingwei Sun, Xiao Zeng, Mi Zhang, Hai Li, and Yiran Chen. Fedmask: Joint computation and communication-efficient personalized federated learning via heterogeneous masking. In Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems, pages 42–55, 2021.
  • Li et al. [2022] Zhexin Li, Tong Yang, Peisong Wang, and Jian Cheng. Q-vit: Fully differentiable quantization for vision transformer. arXiv preprint arXiv:2201.07703, 2022.
  • Liu et al. [2020] Xiaorui Liu, Yao Li, Jiliang Tang, and Ming Yan. A double residual compression algorithm for efficient distributed learning. In International Conference on Artificial Intelligence and Statistics, pages 133–143. PMLR, 2020.
  • Markov et al. [2023] Ilia Markov, Adrian Vladu, Qi Guo, and Dan Alistarh. Quantized distributed training of large models with convergence guarantees. arXiv preprint arXiv:2302.02390, 2023.
  • McMahan et al. [2017] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017.
  • Meng et al. [2022] Lingchen Meng, Hengduo Li, Bor-Chun Chen, Shiyi Lan, Zuxuan Wu, Yu-Gang Jiang, and Ser-Nam Lim. Adavit: Adaptive vision transformers for efficient image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12309–12318, 2022.
  • Min et al. [2023] Bonan Min, Hayley Ross, Elior Sulem, Amir Pouran Ben Veyseh, Thien Huu Nguyen, Oscar Sainz, Eneko Agirre, Ilana Heintz, and Dan Roth. Recent advances in natural language processing via large pre-trained language models: A survey. ACM Computing Surveys, 56(2):1–40, 2023.
  • Nguyen et al. [2022] John Nguyen, Kshitiz Malik, Hongyuan Zhan, Ashkan Yousefpour, Mike Rabbat, Mani Malek, and Dzmitry Huba. Federated learning with buffered asynchronous aggregation. In International Conference on Artificial Intelligence and Statistics, pages 3581–3607. PMLR, 2022.
  • Ozkara et al. [2021] Kaan Ozkara, Navjot Singh, Deepesh Data, and Suhas Diggavi. Quped: Quantized personalization via distillation with applications to federated learning. Advances in Neural Information Processing Systems, 34:3622–3634, 2021.
  • Polino et al. [2018] Antonio Polino, Razvan Pascanu, and Dan Alistarh. Model compression via distillation and quantization. arXiv preprint arXiv:1802.05668, 2018.
  • Russakovsky et al. [2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.
  • Sheller et al. [2019] Micah J Sheller, G Anthony Reina, Brandon Edwards, Jason Martin, and Spyridon Bakas. Multi-institutional deep learning modeling without sharing patient data: A feasibility study on brain tumor segmentation. In Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: 4th International Workshop, BrainLes 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16, 2018, Revised Selected Papers, Part I 4, pages 92–104. Springer, 2019.
  • van Dijk et al. [2020] Marten van Dijk, Nhuong V Nguyen, Toan N Nguyen, Lam M Nguyen, Quoc Tran-Dinh, and Phuong Ha Nguyen. Asynchronous federated learning with reduced number of rounds and with differential privacy from less aggregated gaussian noise. arXiv preprint arXiv:2007.09208, 2020.
  • Voigt and Von dem Bussche [2017] Paul Voigt and Axel Von dem Bussche. The eu general data protection regulation (gdpr). A Practical Guide, 1st Ed., Cham: Springer International Publishing, 10(3152676):10–5555, 2017.
  • Wang et al. [2023] Yangyang Wang, Xiao Zhang, Mingyi Li, Tian Lan, Huashan Chen, Hui Xiong, Xiuzhen Cheng, and Dongxiao Yu. Theoretical convergence guaranteed resource-adaptive federated learning with mixed heterogeneity. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 2444–2455, 2023.
  • Wu et al. [2023] Junde Wu, Rao Fu, Huihui Fang, Yuanpei Liu, Zhaowei Wang, Yanwu Xu, Yueming Jin, and Tal Arbel. Medical sam adapter: Adapting segment anything model for medical image segmentation. arXiv preprint arXiv:2304.12620, 2023.
  • Xie et al. [2019] Cong Xie, Sanmi Koyejo, and Indranil Gupta. Asynchronous federated optimization. arXiv preprint arXiv:1903.03934, 2019.
  • Xie et al. [2022] Xingyu Xie, Pan Zhou, Huan Li, Zhouchen Lin, and Shuicheng Yan. Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models. arXiv preprint arXiv:2208.06677, 2022.
  • Xu et al. [2023a] Chenhao Xu, Youyang Qu, Yong Xiang, and Longxiang Gao. Asynchronous federated learning on heterogeneous devices: A survey. Computer Science Review, 50:100595, 2023a.
  • Xu et al. [2023b] Jian Xu, Xinyi Tong, and Shao-Lun Huang. Personalized federated learning with feature alignment and classifier collaboration. arXiv preprint arXiv:2306.11867, 2023b.
  • Yu et al. [2021] Sixing Yu, Phuong Nguyen, Ali Anwar, and Ali Jannesari. Adaptive dynamic pruning for non-iid federated learning. arXiv preprint arXiv:2106.06921, 2, 2021.
  • Zhang et al. [2019] Linfeng Zhang, Jiebo Song, Anni Gao, Jingwei Chen, Chenglong Bao, and Kaisheng Ma. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3713–3722, 2019.
  • Zhang et al. [2022] Qingru Zhang, Simiao Zuo, Chen Liang, Alexander Bukharin, Pengcheng He, Weizhu Chen, and Tuo Zhao. Platon: Pruning large transformer models with upper confidence bound of weight importance. In International Conference on Machine Learning, pages 26809–26823. PMLR, 2022.
  • Zhang et al. [2021] Qingsong Zhang, Bin Gu, Cheng Deng, Songxiang Gu, Liefeng Bo, Jian Pei, and Heng Huang. Asysqn: Faster vertical federated learning algorithms with better computation resource utilization. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 3917–3927, 2021.
  • Zhang et al. [2023] Tuo Zhang, Lei Gao, Sunwoo Lee, Mi Zhang, and Salman Avestimehr. Timelyfl: Heterogeneity-aware asynchronous federated learning with adaptive partial training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5063–5072, 2023.
  • Zhou et al. [2023] Ce Zhou, Qian Li, Chen Li, Jun Yu, Yixin Liu, Guangjing Wang, Kai Zhang, Cheng Ji, Qiben Yan, Lifang He, et al. A comprehensive survey on pretrained foundation models: A history from bert to chatgpt. arXiv preprint arXiv:2302.09419, 2023.
  • Zhou et al. [2022] Hanhan Zhou, Tian Lan, Guru Prasadh Venkataramani, and Wenbo Ding. Federated learning with online adaptive heterogeneous local models. In Workshop on Federated Learning: Recent Advances and New Challenges (in Conjunction with NeurIPS 2022), 2022.
  • Zhou et al. [2019] Hattie Zhou, Janice Lan, Rosanne Liu, and Jason Yosinski. Deconstructing lottery tickets: Zeros, signs, and the supermask. Advances in neural information processing systems, 32, 2019.
  • Zhu et al. [2023] Zehan Zhu, Ye Tian, Yan Huang, Jinming Xu, and Shibo He. Robust fully-asynchronous methods for distributed training over general architecture. arXiv preprint arXiv:2307.11617, 2023.

Appendix

We provide more details about our work and experiments in the appendices:

  • Appendix A: the additional details, limitations and broader impacts of the proposed framework.

  • Appendix B: the preliminary lemmas used in the theoretical analysis.

  • Appendix C: details proof of the convergence analysis of our proposed semi-asynchronous aggregation strategy.

  • Appendix D: the details of experiments settings including baselines, datasets, backbones, and evaluation metrics and the implementation details of testbed.

  • Appendix E: the details used to reproduce the main experimental results.

  • Appendix F: additional experimental results including Appendix F.1, the overall performance in three groups of experiments; Appendix F.2, the hyperparameter analysis for weight term λ1\lambda_{1} of n,mask\mathcal{L}_{n,mask}, balance term λ2\lambda_{2} of n,weight\mathcal{L}_{n,weight} and ratio between of RnwidthR_{n}^{width} and RndepthR_{n}^{depth}.

  • Appendix G: the discussion of limitations and broader impacts of this work.

Appendix A Framework Details

Algorithm 2 Co-S2P{Co\text{-}S}^{2}{P} client-training
1:  Input: dispersed datasets 𝒟n\mathcal{D}_{n}, current round qq, the weights w0,nw_{0,n} of depth-pruned submodel, the rounds Q^\hat{Q}, the local epochs E^\hat{E} and learning rate η^\hat{\eta} for training width-based masks, the local epochs EE and learning rate η\eta for training weights
2:  Initialize: the important mask InI_{n}
3:  if q<Q^q<\hat{Q} then
4:     for epoch e=1e=1 to E^\hat{E} do
5:        for segment iwni\in w_{n} do
6:           MnliBern(11+eInli)M_{n}^{l_{i}}\sim Bern(\frac{1}{1+e^{-I_{n}^{l_{i}}}})
7:           Obtain M^nli\hat{M}_{n}^{l_{i}} by extended MnliM_{n}^{l_{i}}
8:           w~0,nli=w0,nliM^nli\widetilde{w}_{0,n}^{l_{i}}=w_{0,n}^{l_{i}}\odot\hat{M}_{n}^{l_{i}} \triangleright Freeze w0,nw_{0,n}
9:        end for
10:        In=Inη^Inn,mask(w~0,n;𝒟n)I_{n}=I_{n}-\hat{\eta}\nabla_{I_{n}}\mathcal{L}_{n,mask}(\widetilde{w}_{0,n};\mathcal{D}_{n})
11:     end for
12:  end if
13:  w^0,n=w0,nM^n\hat{w}_{0,n}=w_{0,n}\odot\hat{M}_{n} \triangleright Freeze M^n\hat{M}_{n}
14:  for epoch e=1e=1 to EE do
15:     w^e,n=w^e1,nηwe1,nn,weight(w^e1,n;𝒟n)\hat{w}_{e,n}=\hat{w}_{e-1,n}-\eta\nabla_{w_{e-1,n}}\mathcal{L}_{n,weight}(\hat{w}_{e-1,n};\mathcal{D}_{n})
16:  end for

In the section, we introduce the details of the proposed framework Co-S2P{Co\text{-}S}^{2}{P}. Co-S2P{Co\text{-}S}^{2}{P} designs a structured mask at both depth and width dimensions by ratio. For the heterogeneous clients with diverse limited resources and dispersed datasets, we measure the training time for a round using different submodels with different capacities and obtain a proper pruned rate Rn=Rnwidth×RndepthR_{n}=R_{n}^{width}\times R_{n}^{depth} for each client according to the available resources. RnR_{n} indicates the proportion of the non-pruned parameters of the globally large models, and RnwidthR_{n}^{width}, RndepthR_{n}^{depth} denote the width pruned rate and the depth pruned rate, respectively. By constraining these two values, the submodels can obtain balanced learning capacities.

However, the traditional depth pruning method has a strong assumption that some of the available clients can afford to train the full model [12], i.e., there exists at least one client nn^{*} with Rn=1R_{n^{*}}=1. The models that can be trained are still bounded by the high-end devices. We design a novel deep pruning mechanism to break the Rn=1R_{n^{*}}=1 constraints ensure that the whole blocks can be trained evenly by all the clients. For client nn, the server first generates the depth-pruned submodel by freezing the shallow blocks bottom-to-top and pruning the deep blocks top-to-bottom in a rolling fashion, which consists of L×Rndepth\lfloor L\times R_{n}^{depth}\rfloor consecutive trainable blocks.

The resource-limited clients train the width-based structured masks based on the local data distributions and the depth-pruned submodel, as shown in Algorithm.2. Each value of the segment-wise masks indicates which the part of output neurons of linear layer and the head of MSA layer is pruned. Furthermore, each the part of input neurons in these layers needs to be removed if its corresponding mask value is 0 in the previous layer. Considering the Transformer-like models need to perform residual connection, we utilize zero padding for the outputs of the MSA layer and MLP without introducing the extra parameters and increasing the computation. Co-S2P{Co\text{-}S}^{2}{P} trains the structured-pruned submodels θ^n\hat{\theta}_{n} pruned from depth-pruned submodels based on the width-based masks. All client train the masks and the weights of the submodels in the first E^\hat{E} rounds sequentially, and then freeze the masks and only train the masked submodels. Furthermore, to ensure personalization, each client maintain the classifiers and only updates them through local submodel training, rather updates from the server[36].

Refer to caption
Figure 7: Cross-block knowledge transfer. In structured-pruned submodel 1, the complex and high-level knowledge transfer from block B3B_{3} to blocks B1B_{1} and B2B_{2} through self-distillation.

Considering the knowledge loss in the resource-limited clients, within the masked submodel training, we place multiple classifiers at the specific depth. In some clients with deep models, the complex and high-level knowledge transfer from the deep blocks to the shallow blocks through the self-distillation mechanism. Within the aggregation phase, the server aggregates and updates the same segments trained by different clients. Therefore, the shallow blocks in the global model obtain the complex and high-level knowledge. Subsequently, the resource-limited clients with weak computing power obtain the knowledge from the server by updating the depth-pruned submodel, as shown in Fig.7. By this way, we implement cross-block knowledge transfer without extra forward overhead of distillation except for the extra classifiers because the blocks share feature maps.

Appendix B Preliminary Lemmas

In order to analyze the convergence rate of our proposed semi-asynchronous aggregation algorithm, we firstly state some preliminary lemmas as follows:

Lemma 1.

(Jensen’s inequality). For any convex function hh and any variable x1,,xnx_{1},\dots,x_{n} we have

h(1ni=1nxi)1ni=1nh(xi).h(\frac{1}{n}\sum_{i=1}^{n}x_{i})\leq\frac{1}{n}\sum_{i=1}^{n}h(x_{i}). (11)

Especially, when h(x)=x2h(x)=\|x\|^{2}, we can get

1ni=1nxi21ni=1nxi2.\|\frac{1}{n}\sum_{i=1}^{n}x_{i}\|^{2}\leq\frac{1}{n}\sum_{i=1}^{n}\|x_{i}\|^{2}. (12)
Lemma 2.

For random variable x1,,xnx_{1},\dots,x_{n} we have

𝔼[x1++xn2]n𝔼[x12++xn2].\mathbb{E}[\|x_{1}+\cdots+x_{n}\|^{2}]\leq n\mathbb{E}[\|x_{1}\|^{2}+\cdots+\|x_{n}\|^{2}]. (13)
Lemma 3.

For independent random variables x1,,xnx_{1},\dots,x_{n} whose mean is 0, we have

𝔼[x1++xn2]=𝔼[x12++xn2].\mathbb{E}[\|x_{1}+\cdots+x_{n}\|^{2}]=\mathbb{E}[\|x_{1}\|^{2}+\cdots+\|x_{n}\|^{2}]. (14)

Appendix C Convergence Analysis of Semi-Asynchronous Aggregation Strategy

C.1 Key Lemmas

Lemma 4.

(Bounded of client update). Suppose all Assumptions hold, for any learning rate satisfies η12LE\eta\leq\frac{1}{2LE}, then the client update is bounded:

1Ee=1E𝔼wqτq,n,e1wqτq24η2Eσ2+16η2E2δ2+16η2E2G\frac{1}{E}\sum_{e=1}^{E}\mathbb{E}\|w_{q-\tau_{q},n,e-1}-w_{q-\tau_{q}}\|^{2}\leq 4\eta^{2}E\sigma^{2}+16\eta^{2}E^{2}\delta^{2}+16\eta^{2}E^{2}G (15)

Proof.

1Ee=1E𝔼wqτq,n,e1wqτq2=1Ee=1E𝔼wqτq,n,e1wqτq,n,02\displaystyle\frac{1}{E}\sum_{e=1}^{E}\mathbb{E}\|w_{q-\tau_{q},n,e-1}-w_{q-\tau_{q}}\|^{2}=\frac{1}{E}\sum_{e=1}^{E}\mathbb{E}\|w_{q-\tau_{q},n,e-1}-w_{q-\tau_{q},n,0}\|^{2}
=1Ee=1E𝔼j=0e2ηFn(wqτq,n,j,ξn,j)Mqτq,n2\displaystyle=\frac{1}{E}\sum_{e=1}^{E}\mathbb{E}\|\sum_{j=0}^{e-2}-\eta\nabla F_{n}(w_{q-\tau_{q},n,j},\xi_{n,j})\odot M_{q-\tau_{q},n}\|^{2}
=η2Ee=1E𝔼j=0e2(Fn(wqτq,n,j,ξn,j)Fn(wqτq,n,j)+Fn(wqτq,n,j))Mqτq,n2\displaystyle=\frac{\eta^{2}}{E}\sum_{e=1}^{E}\mathbb{E}\|\sum_{j=0}^{e-2}(\nabla F_{n}(w_{q-\tau_{q},n,j},\xi_{n,j})-\nabla F_{n}(w_{q-\tau_{q},n,j})+\nabla F_{n}(w_{q-\tau_{q},n,j}))\odot M_{q-\tau_{q},n}\|^{2}
(a)2η2Ee=1E𝔼j=0e2(Fn(wqτq,n,j,ξn,j)Fn(wqτq,n,j))Mqτq,n2\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\frac{2\eta^{2}}{E}\sum_{e=1}^{E}\mathbb{E}\|\sum_{j=0}^{e-2}(\nabla F_{n}(w_{q-\tau_{q},n,j},\xi_{n,j})-\nabla F_{n}(w_{q-\tau_{q},n,j}))\odot M_{q-\tau_{q},n}\|^{2}
+2η2Ee=1E𝔼j=0e2Fn(wqτq,n,j)Mqτq,n2\displaystyle+\frac{2\eta^{2}}{E}\sum_{e=1}^{E}\mathbb{E}\|\sum_{j=0}^{e-2}\nabla F_{n}(w_{q-\tau_{q},n,j})\odot M_{q-\tau_{q},n}\|^{2}
(b)2η2Ee=1E(e1)σ2+2η2Ee=1E𝔼j=0e2(Fn(wqτq,n,j)Fn(wqτq)+Fn(wqτq))Mqτq,n2\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}\frac{2\eta^{2}}{E}\sum_{e=1}^{E}(e-1)\sigma^{2}+\frac{2\eta^{2}}{E}\sum_{e=1}^{E}\mathbb{E}\|\sum_{j=0}^{e-2}(\nabla F_{n}(w_{q-\tau_{q},n,j})-\nabla F_{n}(w_{q-\tau_{q}})+\nabla F_{n}(w_{q-\tau_{q}}))\odot M_{q-\tau_{q},n}\|^{2}
(c)2η2Eσ2+4η2Ee=1E(e1)j=0e2𝔼(Fn(wqτq,n,j)Fn(wqτq))Mqτq,n2\displaystyle\stackrel{{\scriptstyle(c)}}{{\leq}}2\eta^{2}E\sigma^{2}+\frac{4\eta^{2}}{E}\sum_{e=1}^{E}(e-1)\sum_{j=0}^{e-2}\mathbb{E}\|(\nabla F_{n}(w_{q-\tau_{q},n,j})-\nabla F_{n}(w_{q-\tau_{q}}))\odot M_{q-\tau_{q},n}\|^{2}
+4η2Ee=1E(e1)j=0e2𝔼Fn(wqτq)Mqτq,n2\displaystyle+\frac{4\eta^{2}}{E}\sum_{e=1}^{E}(e-1)\sum_{j=0}^{e-2}\mathbb{E}\|\nabla F_{n}(w_{q-\tau_{q}})\odot M_{q-\tau_{q},n}\|^{2}
(d)2η2Eσ2+4η2L2Ee=1E(e1)j=0e2𝔼wqτq,n,jwqτq2\displaystyle\stackrel{{\scriptstyle(d)}}{{\leq}}2\eta^{2}E\sigma^{2}+\frac{4\eta^{2}L^{2}}{E}\sum_{e=1}^{E}(e-1)\sum_{j=0}^{e-2}\mathbb{E}\|w_{q-\tau_{q},n,j}-w_{q-\tau_{q}}\|^{2}
+4η2E2𝔼(Fn(wqτq)F(wqτq)+F(wqτq))Mqτq,n2\displaystyle+4\eta^{2}E^{2}\mathbb{E}\|(\nabla F_{n}(w_{q-\tau_{q}})-\nabla F(w_{q-\tau_{q}})+\nabla F(w_{q-\tau_{q}}))\odot M_{q-\tau_{q},n}\|^{2}
(e)2η2Eσ2+4η2L2E(E1)1Ee=1E𝔼wqτq,n,t1wqτq2\displaystyle\stackrel{{\scriptstyle(e)}}{{\leq}}2\eta^{2}E\sigma^{2}+4\eta^{2}L^{2}E(E-1)\frac{1}{E}\sum_{e=1}^{E}\mathbb{E}\|w_{q-\tau_{q},n,t-1}-w_{q-\tau_{q}}\|^{2}
+8η2E2𝔼(Fn(wqτq)F(wqτq))Mqτq,n2+8η2E2𝔼F(wqτq)Mqτq,n2\displaystyle+8\eta^{2}E^{2}\mathbb{E}\|(\nabla F_{n}(w_{q-\tau_{q}})-\nabla F(w_{q-\tau_{q}}))\odot M_{q-\tau_{q},n}\|^{2}+8\eta^{2}E^{2}\mathbb{E}\|\nabla F(w_{q-\tau_{q}})\odot M_{q-\tau_{q},n}\|^{2}
(f)2η2Eσ2+4η2L2E21Ee=1E𝔼wqτqwqτq,n,e12+8η2E2δ2+8η2E2𝔼F(wqτq)Mqτq,n2\displaystyle\stackrel{{\scriptstyle(f)}}{{\leq}}2\eta^{2}E\sigma^{2}+4\eta^{2}L^{2}E^{2}\frac{1}{E}\sum_{e=1}^{E}\mathbb{E}\|w_{q-\tau_{q}}-w_{q-\tau_{q},n,e-1}\|^{2}+8\eta^{2}E^{2}\delta^{2}+8\eta^{2}E^{2}\mathbb{E}\|\nabla F(w_{q-\tau_{q}})\odot M_{q-\tau_{q},n}\|^{2}

where (a)(c)(e) are from the Lemma 2, (b) is from Assumption 2, (d) is from Assumption 1 and (f) is from Assumption 3.

Let 4η2L2E212η12LE4\eta^{2}L^{2}E^{2}\leq\frac{1}{2}\Rightarrow\eta\leq\frac{1}{2LE}, we can get:

1Ee=1E𝔼wqτq,n,t1wqτq]2\displaystyle\frac{1}{E}\sum_{e=1}^{E}\mathbb{E}\|w_{q-\tau_{q},n,t-1}-w_{q-\tau_{q}}]\|^{2}
(a)4η2Eσ2+16η2E2δ2+16η2E2𝔼F(wqτq)Mqτq,n2\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}4\eta^{2}E\sigma^{2}+16\eta^{2}E^{2}\delta^{2}+16\eta^{2}E^{2}\mathbb{E}\|\nabla F(w_{q-\tau_{q}})\odot M_{q-\tau_{q},n}\|^{2}
4η2Eσ2+16η2E2δ2+16η2E2iSqτq𝔼Fi(wqτq)2\displaystyle\leq 4\eta^{2}E\sigma^{2}+16\eta^{2}E^{2}\delta^{2}+16\eta^{2}E^{2}\sum_{i\in S_{q-\tau_{q}}}\mathbb{E}\|\nabla F^{i}(w_{q-\tau_{q}})\|^{2}
(b)4η2Eσ2+16η2E2δ2+16η2E2G\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}4\eta^{2}E\sigma^{2}+16\eta^{2}E^{2}\delta^{2}+16\eta^{2}E^{2}G

where (a) is due to the fact that η12LE\eta\leq\frac{1}{2LE} and (b) is from Assumption 4.

Lemma 5.

(Bounded of wτq\mathcal{E}_{w_{\tau_{q}}} of global model).Suppose all Assumptions hold, we can get:

wτq=𝔼wqwqτq2(τq)2η2E2NN|K|G\mathcal{E}_{w_{\tau_{q}}}=\mathbb{E}\|w_{q}-w_{q-\tau_{q}}\|^{2}\leq(\tau_{q})^{2}\eta^{2}E^{2}\frac{N}{N^{*}}|K|G (16)

Proof.

𝔼wqwqτq2=iSq𝔼wqiwqτqi2+iKSq𝔼wqiwqτqi2\displaystyle\mathbb{E}\|w_{q}-w_{q-\tau_{q}}\|^{2}=\sum_{i\in S_{q}}\mathbb{E}\|w_{q}^{i}-w_{q-\tau_{q}}^{i}\|^{2}+\sum_{i\in K-S_{q}}\mathbb{E}\|w_{q}^{i}-w_{q-\tau_{q}}^{i}\|^{2}
(a)iSqτql=0τq1𝔼wqliwq(l+1)i2+iKSqτql=0τq1𝔼wqliwq(l+1)i2\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\sum_{i\in S_{q}}\tau_{q}\sum_{l=0}^{\tau_{q}-1}\mathbb{E}\|w_{q-l}^{i}-w_{q-(l+1)}^{i}\|^{2}+\sum_{i\in K-S_{q}}\tau_{q}\sum_{l=0}^{\tau_{q}-1}\mathbb{E}\|w_{q-l}^{i}-w_{q-(l+1)}^{i}\|^{2}
τql=0τq1iSq𝔼wqliwq(l+1)i2U1+τql=0τq1iKSq𝔼wqliwq(l+1)i2U2\displaystyle\leq\tau_{q}\sum_{l=0}^{\tau_{q}-1}\underbrace{\sum_{i\in S_{q}}\mathbb{E}\|w_{q-l}^{i}-w_{q-(l+1)}^{i}\|^{2}}_{U_{1}}+\tau_{q}\sum_{l=0}^{\tau_{q}-1}\underbrace{\sum_{i\in K-S_{q}}\mathbb{E}\|w_{q-l}^{i}-w_{q-(l+1)}^{i}\|^{2}}_{U_{2}}

where (a) is from the definition of τq\tau_{q}.

Here, we define the normalized gamma as γ~q,ni=γninNqiγni.\tilde{\gamma}_{q,n}^{i}=\frac{\gamma_{n}^{i}}{\sum_{n\in N_{q}^{i}}\gamma_{n}^{i}}.

To bound U1U_{1}:

iSq𝔼wqliwq(l+1)i2=iSq𝔼η(nNq(l+1)iγ~q(l+1),niΔq(l+1),ni)2\displaystyle\sum_{i\in S_{q}}\mathbb{E}\|w_{q-l}^{i}-w_{q-(l+1)}^{i}\|^{2}=\sum_{i\in S_{q}}\mathbb{E}\|-\eta(\sum\limits_{n\in N_{q-(l+1)}^{i}}\tilde{\gamma}_{q-(l+1),n}^{i}\Delta_{q-(l+1),n}^{i})\|^{2}
=η2iSq𝔼nNq(l+1)iγ~q(l+1),niΔq(l+1),ni)2\displaystyle=\eta^{2}\sum_{i\in S_{q}}\mathbb{E}\|\sum\limits_{n\in N_{q-(l+1)}^{i}}\tilde{\gamma}_{q-(l+1),n}^{i}\Delta_{q-(l+1),n}^{i})\|^{2}
=η2iSq𝔼nNq(l+1)ie=1Eγ~q(l+1),niFni(wq(l+1)τq(l+1),n,e1,ξn,e1)2\displaystyle=\eta^{2}\sum_{i\in S_{q}}\mathbb{E}\|\sum\limits_{n\in N_{q-(l+1)}^{i}}\sum_{e=1}^{E}\tilde{\gamma}_{q-(l+1),n}^{i}\nabla F_{n}^{i}(w_{q-(l+1)-\tau_{q-(l+1)},n,e-1},\xi_{n,e-1})\|^{2}
(a)η2EiSqnNq(l+1)iγ~q(l+1),nie=1E𝔼Fni(wq(l+1)τq(l+1),n,e1,ξn,e1)2\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\eta^{2}E\sum_{i\in S_{q}}\sum\limits_{n\in N_{q-(l+1)}^{i}}\tilde{\gamma}_{q-(l+1),n}^{i}\sum_{e=1}^{E}\mathbb{E}\|\nabla F_{n}^{i}(w_{q-(l+1)-\tau_{q-(l+1)},n,e-1},\xi_{n,e-1})\|^{2}
(c)η2EiSqnNq(l+1)i1Nq(l+1)iEG\displaystyle\stackrel{{\scriptstyle(c)}}{{\leq}}\eta^{2}E\sum_{i\in S_{q}}\sum\limits_{n\in N_{q-(l+1)}^{i}}\frac{1}{N_{q-(l+1)}^{i}}EG
(b)η2E2NN|Sq|G\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}\eta^{2}E^{2}\frac{N}{N^{*}}|S_{q}|G

where (a) is due to the Lemma 3, (b) is from Assumption 4.

To bound U2U_{2}:

iKSq𝔼wqliwq(l+1)i2=iCq(l+1)𝔼wqliwq(l+1)i2+iKSqCq(l+1)𝔼wqliwq(l+1)i2\displaystyle\sum_{i\in K-S_{q}}\mathbb{E}\|w_{q-l}^{i}-w_{q-(l+1)}^{i}\|^{2}=\sum_{i\in C_{q-(l+1)}}\mathbb{E}\|w_{q-l}^{i}-w_{q-(l+1)}^{i}\|^{2}+\sum_{i\in K-S_{q}-C_{q-(l+1)}}\mathbb{E}\|w_{q-l}^{i}-w_{q-(l+1)}^{i}\|^{2}
=iCq(l+1)𝔼|η(nNq(l+1)iγ~q(l+1),niΔq(l+1),ni)2+0\displaystyle=\sum_{i\in C_{q-(l+1)}}\mathbb{E}|-\eta(\sum\limits_{n\in N_{q-(l+1)}^{i}}\tilde{\gamma}_{q-(l+1),n}^{i}\Delta_{q-(l+1),n}^{i})\|^{2}+\textbf{0}
=η2iCq(l+1)𝔼nNq(l+1)ie=1Eγ~q(l+1),niFni(wq(l+1)τq(l+1),n,e1,ξn,e1)2\displaystyle=\eta^{2}\sum_{i\in C_{q-(l+1)}}\mathbb{E}\|\sum\limits_{n\in N_{q-(l+1)}^{i}}\sum_{e=1}^{E}\tilde{\gamma}_{q-(l+1),n}^{i}\nabla F_{n}^{i}(w_{q-(l+1)-\tau_{q-(l+1)},n,e-1},\xi_{n,e-1})\|^{2}
(a)η2EiCq(l+1)nNq(l+1)iγ~q(l+1),nie=1E𝔼Fni(wq(l+1)τq(l+1),n,e1,ξn,e1)2\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\eta^{2}E\sum_{i\in C_{q-(l+1)}}\sum\limits_{n\in N_{q-(l+1)}^{i}}\tilde{\gamma}_{q-(l+1),n}^{i}\sum_{e=1}^{E}\mathbb{E}\|\nabla F_{n}^{i}(w_{q-(l+1)-\tau_{q-(l+1)},n,e-1},\xi_{n,e-1})\|^{2}
(b)η2E2NN|Cq(l+1)|G\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}\eta^{2}E^{2}\frac{N}{N^{*}}|C_{q-(l+1)}|G

where Cq(l+1)C_{q-(l+1)} is the region in KSqK-S_{q} that has been trained in round q(l+1)q-(l+1). And (a) is due to the Lemma 3 and (b) is from Assumption 4.

Plugging U1,U2U_{1},U_{2} into 𝔼wqwqτq2\mathbb{E}\|w_{q}-w_{q-\tau_{q}}\|^{2} and using |Sq|+|Cq(l+1)||K||S_{q}|+|C_{q-(l+1)}|\leq|K|, we can gain the bounded of wτq\mathcal{E}_{w_{\tau_{q}}}.

C.2 Convergence Analysis

Let us start the proof of the global model generated by semi-asynchronous aggregation strategy from LL-Lipschitz Condition:

𝔼[F(wq+1)]𝔼[F(wq)]𝔼[F(wq),wq+1wq]U1+L2𝔼wq+1wq2U2\displaystyle\mathbb{E}[F(w_{q+1})]-\mathbb{E}[F(w_{q})]\leq\underbrace{\mathbb{E}[\langle\nabla F(w_{q}),w_{q+1}-w_{q}\rangle]}_{U_{1}}+\underbrace{\frac{L}{2}\mathbb{E}\|w_{q+1}-w_{q}\|^{2}}_{U_{2}}

bound U1U_{1}:

𝔼[F(wq),wq+1wq]\displaystyle\mathbb{E}[\langle\nabla F(w_{q}),w_{q+1}-w_{q}\rangle]
=iSq𝔼[Fi(wq),wq+1iwqi]+iKSq𝔼[Fi(wq),wq+1iwqi]\displaystyle=\sum_{i\in S_{q}}\mathbb{E}[\langle\nabla F^{i}(w_{q}),w_{q+1}^{i}-w_{q}^{i}\rangle]+\sum_{i\in K-S_{q}}\mathbb{E}[\langle\nabla F^{i}(w_{q}),w_{q+1}^{i}-w_{q}^{i}\rangle]
=iSq𝔼[Fi(wq),wq+1iwqi]+iKSq𝔼[Fi(wq),𝟎]\displaystyle=\sum_{i\in S_{q}}\mathbb{E}[\langle\nabla F^{i}(w_{q}),w_{q+1}^{i}-w_{q}^{i}\rangle]+\sum_{i\in K-S_{q}}\mathbb{E}[\langle\nabla F^{i}(w_{q}),\mathbf{0}\rangle]
=iSq𝔼[Fi(wq),wq+1iwqi]\displaystyle=\sum_{i\in S_{q}}\mathbb{E}[\langle\nabla F^{i}(w_{q}),w_{q+1}^{i}-w_{q}^{i}\rangle]
=iSq𝔼[Fi(wq),η(nNqiγ~q,niΔq,ni)]\displaystyle=\sum_{i\in S_{q}}\mathbb{E}[\langle\nabla F^{i}(w_{q}),-\eta(\sum\limits_{n\in N_{q}^{i}}\tilde{\gamma}_{q,n}^{i}\Delta_{q,n}^{i})\rangle]
=iSq𝔼[Fi(wq),η(nNqi(γ~q,ni(wqτq,n,0iwqτq,n,Ei)η))]\displaystyle=\sum_{i\in S_{q}}\mathbb{E}[\langle\nabla F^{i}(w_{q}),-\eta(\sum\limits_{n\in N_{q}^{i}}(\tilde{\gamma}_{q,n}^{i}\frac{(w_{q-\tau_{q},n,0}^{i}-w_{q-\tau_{q},n,E}^{i})}{\eta}))\rangle]
=iSq𝔼[Fi(wq),η(nNqie=1Eγ~q,niFni(wqτq,n,e1,ξn,e1))]\displaystyle=\sum_{i\in S_{q}}\mathbb{E}[\langle\nabla F^{i}(w_{q}),-\eta(\sum\limits_{n\in N_{q}^{i}}\sum_{e=1}^{E}\tilde{\gamma}_{q,n}^{i}\nabla F_{n}^{i}(w_{q-\tau_{q},n,e-1},\xi_{n,e-1}))\rangle]
=EηiSq𝔼[Fi(wq),nNqi1Ee=1Eγ~q,niFni(wqτq,n,e1,ξn,e1)]\displaystyle=-E\eta\sum_{i\in S_{q}}\mathbb{E}[\langle\nabla F^{i}(w_{q}),\sum\limits_{n\in N_{q}^{i}}\frac{1}{E}\sum_{e=1}^{E}\tilde{\gamma}_{q,n}^{i}\nabla F_{n}^{i}(w_{q-\tau_{q},n,e-1},\xi_{n,e-1})\rangle]
=(a)EηiSq[12𝔼Fi(wq)2+12𝔼nNqi1Ee=1Eγ~q,niFni(wqτq,n,e1,ξn,e1)2\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}-E\eta\sum_{i\in S_{q}}[\frac{1}{2}\mathbb{E}\|\nabla F^{i}(w_{q})\|^{2}+\frac{1}{2}\mathbb{E}\|\sum\limits_{n\in N_{q}^{i}}\frac{1}{E}\sum_{e=1}^{E}\tilde{\gamma}_{q,n}^{i}\nabla F_{n}^{i}(w_{q-\tau_{q},n,e-1},\xi_{n,e-1})\|^{2}
12𝔼Fi(wq)nNqi1Ee=1Eγ~q,niFni(wqτq,n,e1,ξn,e1)2]\displaystyle-\frac{1}{2}\mathbb{E}\|\nabla F^{i}(w_{q})-\sum\limits_{n\in N_{q}^{i}}\frac{1}{E}\sum_{e=1}^{E}\tilde{\gamma}_{q,n}^{i}\nabla F_{n}^{i}(w_{q-\tau_{q},n,e-1},\xi_{n,e-1})\|^{2}]
=Eη2iSq𝔼Fi(wq)2Eη2iSq𝔼nNqi1Ee=1Eγ~q,niFni(wqτq,n,e1,ξn,e1)2\displaystyle=-\frac{E\eta}{2}\sum_{i\in S_{q}}\mathbb{E}\|\nabla F^{i}(w_{q})\|^{2}-\frac{E\eta}{2}\sum_{i\in S_{q}}\mathbb{E}\|\sum\limits_{n\in N_{q}^{i}}\frac{1}{E}\sum_{e=1}^{E}\tilde{\gamma}_{q,n}^{i}\nabla F_{n}^{i}(w_{q-\tau_{q},n,e-1},\xi_{n,e-1})\|^{2}
+Eη2iSq𝔼Fi(wq)nNqi1Ee=1Eγ~q,niFni(wqτq,n,e1,ξn,e1)2U3\displaystyle+\frac{E\eta}{2}\underbrace{\sum_{i\in S_{q}}\mathbb{E}\|\nabla F^{i}(w_{q})-\sum\limits_{n\in N_{q}^{i}}\frac{1}{E}\sum_{e=1}^{E}\tilde{\gamma}_{q,n}^{i}\nabla F_{n}^{i}(w_{q-\tau_{q},n,e-1},\xi_{n,e-1})\|^{2}}_{U_{3}}

where (a) is from the equation a,b=12(a2+b2ab2)\langle a,b\rangle=\frac{1}{2}(\|a\|^{2}+\|b\|^{2}-\|a-b\|^{2}) with a=Fi(wq)a=\nabla F^{i}(w_{q}) and b=nNqi1Ee=1Eγ~q,nib=\sum\limits_{n\in N_{q}^{i}}\frac{1}{E}\sum_{e=1}^{E}\tilde{\gamma}_{q,n}^{i} Fni(wqτq,n,e1,ξn,e1)\nabla F_{n}^{i}(w_{q-\tau_{q},n,e-1},\xi_{n,e-1})

To bound U3U_{3}:

iSq𝔼Fi(wq)nNqi1Ee=1Eγ~q,niFni(wqτq,n,e1,ξn,e1)2\displaystyle\sum_{i\in S_{q}}\mathbb{E}\|\nabla F^{i}(w_{q})-\sum\limits_{n\in N_{q}^{i}}\frac{1}{E}\sum_{e=1}^{E}\tilde{\gamma}_{q,n}^{i}\nabla F_{n}^{i}(w_{q-\tau_{q},n,e-1},\xi_{n,e-1})\|^{2}
=iSq𝔼nNqi1Ee=1Eγ~q,ni(Fni(wq)Fni(wqτq,n,e1,ξn,e1))2\displaystyle=\sum_{i\in S_{q}}\mathbb{E}\|\sum\limits_{n\in N_{q}^{i}}\frac{1}{E}\sum_{e=1}^{E}\tilde{\gamma}_{q,n}^{i}(\nabla F_{n}^{i}(w_{q})-\nabla F_{n}^{i}(w_{q-\tau_{q},n,e-1},\xi_{n,e-1}))\|^{2}
=iSq𝔼nNqi1Ee=1Eγ~q,ni(Fni(wq)Fni(wqτq,n,e1)+Fni(wqτq,n,e1)Fni(wqτq,n,e1,ξn,e1))2\displaystyle=\sum_{i\in S_{q}}\mathbb{E}\|\sum\limits_{n\in N_{q}^{i}}\frac{1}{E}\sum_{e=1}^{E}\tilde{\gamma}_{q,n}^{i}(\nabla F_{n}^{i}(w_{q})-\nabla F_{n}^{i}(w_{q-\tau_{q},n,e-1})+\nabla F_{n}^{i}(w_{q-\tau_{q},n,e-1})-\nabla F_{n}^{i}(w_{q-\tau_{q},n,e-1},\xi_{n,e-1}))\|^{2}
(a)2iSq𝔼nNqi1Ee=1Eγ~q,ni(Fni(wq)Fni(wqτq,n,e1))2\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}2\sum_{i\in S_{q}}\mathbb{E}\|\sum\limits_{n\in N_{q}^{i}}\frac{1}{E}\sum_{e=1}^{E}\tilde{\gamma}_{q,n}^{i}(\nabla F_{n}^{i}(w_{q})-\nabla F_{n}^{i}(w_{q-\tau_{q},n,e-1}))\|^{2}
+2iSq𝔼nNqi1Ee=1Eγ~q,ni(Fni(wqτq,n,e1)Fni(wqτq,n,e1,ξn,e1))2\displaystyle+2\sum_{i\in S_{q}}\mathbb{E}\|\sum\limits_{n\in N_{q}^{i}}\frac{1}{E}\sum_{e=1}^{E}\tilde{\gamma}_{q,n}^{i}(\nabla F_{n}^{i}(w_{q-\tau_{q},n,e-1})-\nabla F_{n}^{i}(w_{q-\tau_{q},n,e-1},\xi_{n,e-1}))\|^{2}
2iSqnNqi1Ee=1Eγ~q,ni𝔼Fn(wq)Fn(wqτq,n,e1)2\displaystyle\leq 2\sum_{i\in S_{q}}\sum\limits_{n\in N_{q}^{i}}\frac{1}{E}\sum_{e=1}^{E}\tilde{\gamma}_{q,n}^{i}\mathbb{E}\|\nabla F_{n}(w_{q})-\nabla F_{n}(w_{q-\tau_{q},n,e-1})\|^{2}
+2iSqnNqi1Ee=1Eγ~q,ni𝔼Fn(wqτq,n,e1)Fn(wqτq,n,e1,ξn,e1)2\displaystyle+2\sum_{i\in S_{q}}\sum\limits_{n\in N_{q}^{i}}\frac{1}{E}\sum_{e=1}^{E}\tilde{\gamma}_{q,n}^{i}\mathbb{E}\|\nabla F_{n}(w_{q-\tau_{q},n,e-1})-\nabla F_{n}(w_{q-\tau_{q},n,e-1},\xi_{n,e-1})\|^{2}
(b)2LiSqnNqi1Ee=1Eγ~q,ni𝔼wqwqτq,n,e12+2σ2\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}2L\sum_{i\in S_{q}}\sum\limits_{n\in N_{q}^{i}}\frac{1}{E}\sum_{e=1}^{E}\tilde{\gamma}_{q,n}^{i}\mathbb{E}\|w_{q}-w_{q-\tau_{q},n,e-1}\|^{2}+2\sigma^{2}
(c)4LiSqnNqiγ~q,ni1Ee=1E𝔼wqwqτq2+4LiSqnNqiγ~q,ni1Ee=1E𝔼wqτqwqτq,n,e12+2σ2\displaystyle\stackrel{{\scriptstyle(c)}}{{\leq}}4L\sum_{i\in S_{q}}\sum\limits_{n\in N_{q}^{i}}\tilde{\gamma}_{q,n}^{i}\frac{1}{E}\sum_{e=1}^{E}\mathbb{E}\|w_{q}-w_{q-\tau_{q}}\|^{2}+4L\sum_{i\in S_{q}}\sum\limits_{n\in N_{q}^{i}}\tilde{\gamma}_{q,n}^{i}\frac{1}{E}\sum_{e=1}^{E}\mathbb{E}\|w_{q-\tau_{q}}-w_{q-\tau_{q},n,e-1}\|^{2}+2\sigma^{2}

where (a)(c) is from the Lemma 2, (b) is from the Assumption 2.

Using Lemma 4 and Lemma 5, we have:

iSq𝔼Fi(wq)nNqi1Ee=1Eγ~q,niFni(wqτq,n,e1,ξn,e1)2\displaystyle\sum_{i\in S_{q}}\mathbb{E}\|\nabla F^{i}(w_{q})-\sum\limits_{n\in N_{q}^{i}}\frac{1}{E}\sum_{e=1}^{E}\tilde{\gamma}_{q,n}^{i}\nabla F_{n}^{i}(w_{q-\tau_{q},n,e-1},\xi_{n,e-1})\|^{2}
4LiSqnNqiγ~q,ni1Ee=1E2(τq)2η2E2NG+4LiSqnNqiγ~q,ni(4η2Eσ2+16η2E2δ2+16η2E2G)+2σ2\displaystyle\leq 4L\sum_{i\in S_{q}}\sum\limits_{n\in N_{q}^{i}}\tilde{\gamma}_{q,n}^{i}\frac{1}{E}\sum_{e=1}^{E}2(\tau_{q})^{2}\eta^{2}E^{2}NG+4L\sum_{i\in S_{q}}\sum\limits_{n\in N_{q}^{i}}\tilde{\gamma}_{q,n}^{i}(4\eta^{2}E\sigma^{2}+16\eta^{2}E^{2}\delta^{2}+16\eta^{2}E^{2}G)+2\sigma^{2}
4L(τq)2η2E2(NN)2|K|2G+16Lη2ENN|K|σ2+64Lη2E2NN|K|δ2+64Lη2E2NN|K|G\displaystyle\leq 4L(\tau_{q})^{2}\eta^{2}E^{2}(\frac{N}{N^{*}})^{2}|K|^{2}G+16L\eta^{2}E\frac{N}{N^{*}}|K|\sigma^{2}+64L\eta^{2}E^{2}\frac{N}{N^{*}}|K|\delta^{2}+64L\eta^{2}E^{2}\frac{N}{N^{*}}|K|G

To bound U2U_{2}:

L2𝔼wq+1wq2\displaystyle\frac{L}{2}\mathbb{E}\|w_{q+1}-w_{q}\|^{2}
=L2iSq𝔼wq+1iwqi2+L2iKSq𝔼wq+1iwqi2\displaystyle=\frac{L}{2}\sum_{i\in S_{q}}\mathbb{E}\|w_{q+1}^{i}-w_{q}^{i}\|^{2}+\frac{L}{2}\sum_{i\in K-S_{q}}\mathbb{E}\|w_{q+1}^{i}-w_{q}^{i}\|^{2}
=L2iSq𝔼wq+1iwqi2\displaystyle=\frac{L}{2}\sum_{i\in S_{q}}\mathbb{E}\|w_{q+1}^{i}-w_{q}^{i}\|^{2}
=L2iSq𝔼nNqiγ~q,ni(wq,n,0wq,n,E)i2\displaystyle=\frac{L}{2}\sum_{i\in S_{q}}\mathbb{E}\|\sum_{n\in N_{q}^{i}}\tilde{\gamma}_{q,n}^{i}(w_{q,n,0}-w_{q,n,E})^{i}\|^{2}
=Lη2E22iSq𝔼nNqi1Ee=1Eγ~q,niFni(wqτq,n,e1,ξn,e1)2\displaystyle=\frac{L\eta^{2}E^{2}}{2}\sum_{i\in S_{q}}\mathbb{E}\|\sum\limits_{n\in N_{q}^{i}}\frac{1}{E}\sum_{e=1}^{E}\tilde{\gamma}_{q,n}^{i}\nabla F_{n}^{i}(w_{q-\tau_{q},n,e-1},\xi_{n,e-1})\|^{2}

So, we can get:

𝔼[F(wq+1)]𝔼[F(wq)]𝔼[F(wq),wq+1wq]+L2𝔼wq+1wq2\displaystyle\mathbb{E}[F(w_{q+1})]-\mathbb{E}[F(w_{q})]\leq\mathbb{E}[\langle\nabla F(w_{q}),w_{q+1}-w_{q}\rangle]+\frac{L}{2}\mathbb{E}\|w_{q+1}-w_{q}\|^{2}
=Eη2iSq𝔼Fi(wq)2+(Lη2E22Eη2)iSq𝔼nNqi1Ee=1Eγ~q,niFni(wqτq,n,e1,ξn,e1)2+Eη2(U3)\displaystyle=-\frac{E\eta}{2}\sum_{i\in S_{q}}\mathbb{E}\|\nabla F^{i}(w_{q})\|^{2}+(\frac{L\eta^{2}E^{2}}{2}-\frac{E\eta}{2})\sum_{i\in S_{q}}\mathbb{E}\|\sum\limits_{n\in N_{q}^{i}}\frac{1}{E}\sum_{e=1}^{E}\tilde{\gamma}_{q,n}^{i}\nabla F_{n}^{i}(w_{q-\tau_{q},n,e-1},\xi_{n,e-1})\|^{2}+\frac{E\eta}{2}(U_{3})
aEη2iSq𝔼Fi(wq)2+Eη2(U3)\displaystyle\stackrel{{\scriptstyle a}}{{\leq}}-\frac{E\eta}{2}\sum_{i\in S_{q}}\mathbb{E}\|\nabla F^{i}(w_{q})\|^{2}+\frac{E\eta}{2}(U_{3})

where aa follows because: L2η2E2Eη2<0η<1LE\frac{L}{2}\eta^{2}E^{2}-\frac{E\eta}{2}<0\Rightarrow\eta<\frac{1}{LE}

Then, we can obtain:

𝔼[F(wQ+1)]𝔼[F(w1)]=q=1Q𝔼[F(wq+1)]q=1Q𝔼[F(wq)]\displaystyle\mathbb{E}[F(w_{Q+1})]-\mathbb{E}[F(w_{1})]=\sum_{q=1}^{Q}\mathbb{E}[F(w_{q+1})]-\sum_{q=1}^{Q}\mathbb{E}[F(w_{q})]
Eη2q=1Q𝔼F(wq)2+Eη2q=1QU3\displaystyle\leq-\frac{E\eta}{2}\sum_{q=1}^{Q}\mathbb{E}\|\nabla F(w_{q})\|^{2}+\frac{E\eta}{2}\sum_{q=1}^{Q}U_{3}

Re-arranging the terms:

Eη2q=1Q𝔼F(wq)2𝔼[F(w1)]𝔼[F(wQ+1)]+Eη2q=1QU3\displaystyle\frac{E\eta}{2}\sum_{q=1}^{Q}\mathbb{E}\|\nabla F(w_{q})\|^{2}\leq\mathbb{E}[F(w_{1})]-\mathbb{E}[F(w_{Q+1})]+\frac{E\eta}{2}\sum_{q=1}^{Q}U_{3}

Letting 1Qq=1Q(τq)2=τ\frac{1}{Q}\sum_{q=1}^{Q}(\tau_{q})^{2}=\tau and dividing both sides by EηQ2\frac{E\eta Q}{2}

1Qq=1Q𝔼F(wq)22𝔼[F(w1)]EηQ+4Lη2E2N|K|NG(τN|K|N+16)+16Lη2ENN|K|σ2+64Lη2E2NN|K|δ2\displaystyle\frac{1}{Q}\sum_{q=1}^{Q}\mathbb{E}\|\nabla F(w_{q})\|^{2}\leq\frac{2\mathbb{E}[F(w_{1})]}{E\eta Q}+4L\eta^{2}E^{2}\frac{N|K|}{N^{*}}G(\tau\frac{N|K|}{N^{*}}+16)+16L\eta^{2}E\frac{N}{N^{*}}|K|\sigma^{2}+64L\eta^{2}E^{2}\frac{N}{N^{*}}|K|\delta^{2}

Supposing that the step size η=O(NEQ)\eta=O(\sqrt{\frac{N^{*}}{EQ}}) and σ\sigma is sufficiently small, when the constant C>0C>0 exists, the convergence rate can be expressed as follows:

1Qq=1Q𝔼F(θq)2C(1NEQ+1Q)\displaystyle\frac{1}{Q}\sum_{q=1}^{Q}\mathbb{E}\|\nabla F(\theta_{q})\|^{2}\leq C(\frac{1}{\sqrt{N^{*}EQ}}+\frac{1}{Q})

Appendix D Experimental Details

Baselines. We consider the following baselines including classical FL, the combination of FedAvg and state-of-the-art standalone training methods for sparsifying ViT, and resource-limited collaborative training methods in this work:

  • FedAvg[21] is the most classical FL algorithm, but it is not applicable to resource-constrained scenarios. The devices send updated parameters to the server and download the aggregated global model for continuous local training.

  • FedAvg+PLATON[39] is based on FedAvg using PLATON for model pruning on each client. PLATON captures the uncertainty of importance scores through upper confidence bound on the importance estimation, which in turn determines the local model.

  • FedAvg+AdaViT[22] is based on Fedavg using AdaViT on each client. AdaViT is an adaptive learning framework that learns which patches, heads and block to prune throughout the transformer backbone.

  • FedAvg+Q-ViT[18] is based on Fedavg using Q-ViT on each client. Q-ViT is an fullly differentiable quantization method to learn the head-wise bit-width and switchable scale.

  • PruneFL[14] is a two-stage distributed pruning approach with initial pruning and adaptive iterative pruning, which adapts the model size to minimize the overall training time.

  • RAMFed[31] is a general resource-adaptive FL framework with a arbitrary neuron assignment scheme and a server aggregation strategy.

  • FedPM[13] is a communication-efficient FL framework where the client only trains the binary mask and the server use the Bayesian strategy to aggregate the masks from clients.

  • FedRolex[1] is a structured partial training approach with a rolling submodel extraction scheme that allows different parts of the global model to be evenly trained.

Datasets. We conduct experiments based on ImageNet1K [27] widely used in large-scale model training, which can be found in https://www.image-net.org/challenges/LSVRC/2012/2012-downloads.php. We randomly sample from it to get ImageNet200 with 200 classes and ImageNet300 with 300 classes respectively. For each dataset, we divide it into a testset on the server and a dataset on all clients. We further divide the latter dataset into the dispersed datasets using Dirichlet allocation [11] of parameter α=1.5\alpha=1.5. Furthermore, different types of clients have different capacity datasets, as shown in Tab.7. For the dispersed datasets in each client, we randomly split into train/test sets with a ratio 8:2.

Backbone. We use three ViT [9] variants with different capabilities, as shown in Tab.4.

Table 4: Details of three variants of ViT model

Model Depth Hidden size MLP size Heads Params
ViT-Tiny 8 512 2048 8 26M
ViT-Base 12 768 3072 12 86M
ViT-Base(Ext.) 16 768 3072 12 0.11B

Evaluation Metrics. We evaluate the global model on the testset in the server and report the Top1 accuracy, Top5 accuracy, and F1-score. We also evaluate the personalized masked submodel on the testset in each client and report the average of obtained three values, weighted based on their local dataset sizes. Furthermore, we introduce a evaluation metric for the real-world testbed, namely Resource Utilization Rate (RU), calculated as shown below:

RU=1Qq=1Qn=1NqTime(n)Nq×max({Time(1),,Time(Nq)}),RU=\frac{1}{Q}\sum_{q=1}^{Q}{\frac{\sum_{n=1}^{N_{q}}{Time(n)}}{N_{q}\times max(\{Time(1),\cdots,Time(N_{q})\})}}, (17)

where NqN_{q} denotes the number of clients in the qq-th round, Time(n)Time(n) represents the sum of training and communication time of the client n in the qq-th round.

Testbed Implementation Details. To effectively demonstrate the real-world applicability of the proposed collaborative training framework, we build a real-world testbed, where the server is a GeForce RTX 3090 GPU with 24GB memory and the resource-limited clients consist of 4 types of 16 Jetson developer kits, as shown in Fig.4. The details of server and clients are shown in Tab.5. Based on this, we conduct three groups of real-world experiments with different model variants and datasets, and the detailed configurations of them are shown in Tab.6 and Tab.7.

Table 5: Detailed resources of the diverse client devices

Type Memory GPU AI performance Power Mode
Jetson Xavier NX 8GB 384-core NVIDIA Volta™ architecture GPU with 48 Tensor Cores 21TOPS 15W
Jetson Xavier NX 16GB 16GB 384-core NVIDIA Volta™ architecture GPU with 48 Tensor Cores 21TOPS 20W
Jetson AGX Orin 32GB 32GB 1792-core NVIDIA Ampere architecture GPU with 56 Tensor Cores 200TOPS 50W
Jetson AGX Orin 64GB 64GB 2048-core NVIDIA Ampere architecture GPU with 64 Tensor Cores 275TOPS 60W
Table 6: Details of the three groups of experiments

Exp ID Model Datasets Clients Server
Jetson Xavier NX Jetson Xavier NX 16GB Jetson AGX 32GB Jetson AGX 64GB
E1 ViT-Tiny ImageNet200 1 0 3 4 NVIDIA GeForce RTX 3090
E2 ViT-Base ImageNet300 1 2 3 10 NVIDIA GeForce RTX 3090
E3 ViT-Base(Ext.) ImageNet300 1 2 3 10 NVIDIA GeForce RTX 3090
Table 7: Detailed implementation of the real-world experiments. ’-’ means the group of experiment does not involve the type of client.

Clients Type E1 E2 E3
Datasets Size(Avg.) pruned rate Batch Size Datasets Size(Avg.) pruned rate Batch Size Datasets Size(Avg.) pruned rate Batch Size
Jetson Xavier NX 8277 0.0625 8 6285 0.0625 8 4737 0.015625 8
Jetson Xavier NX 16GB - - - 11770 0.0625 32 9476 0.0625 32
Jetson AGX 32GB 24960 0.5625 64 17502 0.0625 64 18932 0.25 64
Jetson AGX 64GB 36992 0.5625 128 28047 0.5625 128 29372 0.5625 128

Appendix E Reproduction details

In this section, we provide the details for reproducing three groups of experimental results.

For Experiment 1, we optimize the local submodel for 120 rounds using early stop strategy and Adan optimizer[34], and the initial learning rate η\eta is 0.005, the local epochs EE is 5. And we set the rounds, the local epochs E^\hat{E} and the initial learning rate η^\hat{\eta} for training width-based structured masks to 20, 5 and 0.01 respectively. For the local training in clients, we set the hyperparameter λ1\lambda_{1}, λ2\lambda_{2}, and tt to 1.0, 0.2, and 3. In the semi-asynchronous aggregation strategy, the aggregation minimal ratio μ\mu, waiting interval TclkT_{clk} are set to 0.5 and 60 seconds, respectively.

For Experiment 2, we use the pretrained Vision Transformer model to initialize the backbone model and optimize the local submodel for 100 rounds using early stop strategy and Adan optimizer, and the initial learning rate η\eta is 0.00025, the local epochs EE is 5. And we set the rounds, the local epochs E^\hat{E} and the initial learning rate η^\hat{\eta} for training width-based structured masks to 10, 5 and 0.001 respectively. For the local training in clients, we set the hyperparameter λ1\lambda_{1}, λ2\lambda_{2}, and tt to 1.0, 0.3, and 3. In the semi-asynchronous aggregation strategy, the aggregation minimal ratio μ\mu, waiting interval TclkT_{clk} are set to 0.4 and 100 seconds, respectively.

For Experiment 3, we use the pretrained ViT model to initialize the first 12 layers of the backbone model. For the last 4 layers, we initialize it by the last 4 layers of the pretrained model. We optimize the local submodel for 100 rounds using early stop strategy and Adan optimizer, and the initial learning rate η\eta is 0.00025, the local epochs EE is 5. And we set the rounds, the local epochs E^\hat{E} and the initial learning rate η^\hat{\eta} for training width-based structured masks to 10, 5 and 0.001 respectively. For the local training in clients, we set the hyperparameter λ1\lambda_{1}, λ2\lambda_{2}, and tt to 1.5, 0.3, and 3. In the semi-asynchronous aggregation strategy, the aggregation minimal ratio μ\mu, waiting interval TclkT_{clk} are set to 0.4 and 150 seconds, respectively.

Additionally, we set the width pruned rate RnwidthR_{n}^{width} equal to depth pruned rate RndepthR_{n}^{depth}, both equal Rn\sqrt{R_{n}}. The setting ensures that the submodels have a balanced learning capacity of both width and depth dimensions. Furthermore, we provide an extensive analysis of the effectiveness of ratio between of RnwidthR_{n}^{width} and RndepthR_{n}^{depth} in F.2.

Appendix F Additional Experiments

Refer to caption Refer to caption Refer to caption
(a) (ViT-Tiny)/(ImageNet200) (b) (ViT-Base)/(ImageNet300) (c) (ViT-Base(ext.))/(ImageNet300)
Figure 8: Convergence curves of FedAvg, all baselines and ours algorithm in three groups of experiment. Co-S2P{Co\text{-}S}^{2}{P} achieves better performance and comparable or even better convergence speeds than baselines.
Refer to caption Refer to caption Refer to caption
(a) (ViT-Tiny)/(ImageNet200) (b) (ViT-Base)/(ImageNet300) (c) (ViT-Base(ext.))/(ImageNet300)
Figure 9: Comparison of model memory usage. The red horizontal line represents the memory usage of the global model, and We use FedRolex with the closest performance for comparison.
Refer to caption Refer to caption Refer to caption
(a) (ViT-Tiny)/(ImageNet200) (b) (ViT-Base)/(ImageNet300) (c) (ViT-Base(ext.))/(ImageNet300)
Figure 10: Comparison of training time per round. We used FedRolex with the closest performance for comparison.

F.1 Overall Performance

Server Performance. We make four observations from Tab.1 and Tab.2 corresponding to three groups of experiments:

  • For Top1 accuracy, Co-S2P{Co\text{-}S}^{2}{P} achieves 8.8%/6.4%/8.7% higher global model performance compared to the closest baseline FedRolex on three groups of experiments respectively. For Top5 accuracy, Co-S2P{Co\text{-}S}^{2}{P} improves 11.2%/8.9%/9.3%, respectively. For F1-score, Co-S2P{Co\text{-}S}^{2}{P} improves 8.8%/5.5%/8.7% respectively. These demonstrate that our algorithm obtain a better generalized global model in resource-limited scenarios.

  • Without loading the pretrained model, Co-S2P{Co\text{-}S}^{2}{P} is capable to get closer to the accuracy of FedAvg, which demonstrates that our method is effective in mitigating the impact of data heterogeneity.

  • For Top1 accuracy, the FedAvg+FedAvg+ baselines significantly decrease 2.0%14.5%2.0\%\sim 14.5\% compared to the federated resource-adaptive methods. For Top5 accuracy, FedAvg+FedAvg+ decreases 0%14.9%0\%\sim 14.9\%. For F1-score, FedAvg+FedAvg+ decreases 0%14.6%0\%\sim 14.6\%. These demonstrate that the mixed heterogeneity of the federated scenarios has severe implications for large-scale model training.

  • Considering top-1 accuracy, Co-S2P{Co\text{-}S}^{2}{P} achieves 5.4%/3.0%/3.5% higher global model performance compared to the fully asynchronous methods on three groups of experiments respectively, which demonstrates that Co-S2P{Co\text{-}S}^{2}{P} achieves a better balance between resource utilization and global model performance.

Clients’ Average Performance. We make four observations from Tab.1, and Tab.2 corresponding to three groups of experiments:

  • For Top1 accuracy, Co-S2P{Co\text{-}S}^{2}{P} achieve 11.1%/8.8%/10.1% higher personalized submodel performance compared to the closest approach on three groups of experiments respectively. For Top5 accuracy, Co-S2P{Co\text{-}S}^{2}{P} improves 10.6%/10.4%/9.0% respectively. For F1-score, Co-S2P{Co\text{-}S}^{2}{P} improves 10.5%/6.9%/9.8% respectively. These demonstrate that our framework obtain a better personalized mask for each client in resource-limited scenarios.

  • Comparing the difference between global and local performance, Co-S2P{Co\text{-}S}^{2}{P} has the closer performance gap than the federated resource-adaptive methods, demonstrating that the trainable mask better adapt to local data distribution in resource-limited scenarios.

  • Comparing the difference between global and local performance, the FedAvg+FedAvg+ baselines have the closer performance gaps than the federated resource-adaptive methods, demonstrating that the former baselines primarily consider the local data distributions.

  • Considering Top1 accuracy, Co-S2P{Co\text{-}S}^{2}{P} achieves 5.1%/3.1%/4.4% higher personalized model performance compared to the fully asynchronous methods on three groups of experiments respectively, which demonstrates that Co-S2P{Co\text{-}S}^{2}{P} achieves a better balance between resource utilization and personalized submodel performance.

Efficiency Comparison. We make two observations from Tab.1, Tab.2, Fig.9 and Fig.10 corresponding to three groups of experiments:

  • For resource utilization rate, Co-S2P{Co\text{-}S}^{2}{P} improves 1.4×\times, 1.5×\times, and 1.5×\times higher compared to FedAvg on three groups of experiments respectively and improves 1.2×\times, 1.1×\times, and 1.2×\times higher compared to FedRolex on three groups of experiments respectively, which demonstrates that the proposed semi-asynchronous aggregation strategy can effectively improve the resource utilization during large-scale model training.

  • Comparing our approach with FedRolex with the closest performance, reveals that we reduce memory consumption by about 22% and reduce training time per round by about 24% on all resource-limited devices.

F.2 Hyperparameter Analysis Details

Impact of weight term λ1\lambda_{1} to constrain submodel capacity. We study the impact of λ1\lambda_{1} on the personalized mask, as shown in Fig.11. We observe that as λ1\lambda_{1} increases, it indicates a more rigorous sparsity for the masked submodel, leading to faster convergence of the submodel. Furthermore, we observe that the smaller λ1\lambda_{1} does not mean the higher performance. This is because the smaller λ1\lambda_{1} causes the client to not get a personalized mask adapted to the local data distribution. And the performance differences in the clients is more significant, demonstrating the impact of λ1\lambda_{1} for masked submodels is even greater.

Impact of balance term λ2\lambda_{2}. Here we study the impact of λ2\lambda_{2} on the performance of global model and submodels. As shown in Fig.12, with increasing λ2\lambda_{2}, the effect of self-distillation on submodel training becomes more important and the global model converges more rapidly. However, too much reliance on deep block knowledge affects the performance of the global model. Moreover, we observe that for the performance of clients, λ2=0.1\lambda_{2}=0.1 and λ2=0.3\lambda_{2}=0.3 have opposite performance comparisons. This is because the resource-constrained client is capable to learn more complex and higher-level knowledge through knowledge transfer.

Impact of ratio of RnwidthR_{n}^{width} and RndepthR_{n}^{depth}. We study the impact of different ratio of RnwidthR_{n}^{width} and RndepthR_{n}^{depth} on the performance of global model and submodels. The ratio indicates the proportion of non-pruned parameters of the submodels along the width and depth dimensions. As shown in Tab.8, Rnwidth/Rndepth=1{R_{n}^{width}}/{R_{n}^{depth}}=1 outperforms the ohter ratios, because the setting makes each submodel have a balanced learning capacity.

Refer to caption Refer to caption
(a) Test in Server (b) Test in Client(Avg.)
Figure 11: Impact of λ1\lambda_{1} for submodel training.
Refer to caption Refer to caption
(a) Test in Server (b) Test in Client(Avg.)
Figure 12: Impact of λ2\lambda_{2} within the weight-training stage.
Table 8: Impact of different ratio of RnwidthR_{n}^{width} and RndepthR_{n}^{depth}.

Rnwidth/Rndepth{R_{n}^{width}}/{R_{n}^{depth}} Server Client(Avg.)
Top1Top1(%) Top5Top5(%) F1F1(%) Top1Top1(%) Top5Top5(%) F1F1(%)
0.5 28.4 52.9 27.0 24.8 49.7 24.7
2.0 29.7 54.6 28.3 26.9 50.5 25.3
1.0 33.1 59.2 31.9 31.8 56.6 30.5

Appendix G Limitation and Broader Impacts

In this section, we discuss the limitations and broader impacts of our work.

Limitations. This work proposes a semi-asynchronous collaborative training framework and preliminarily validates its scalability with various experimental groups. However, due to funding and time constraints, we were unable to obtain additional Jetson devices for collaborative training, which limited the scale of large models we could experiment with.

Broader Impacts. Our work on leveraging weak computing power for collaborative learning has the potential to significantly transform the landscape of distributed learning, particularly in resource-constrained environments. This advancement can allow organizations and individuals with limited computational resources to participate in and benefit from cutting-edge AI developments. The smaller institutions and research groups can engage in large-scale model training without the need for expensive hardware investments. Furthermore, the reduction in memory consumption and training time achieved by Co-S2P{Co\text{-}S}^{2}{P} contributes to lower energy consumption and operational costs, promoting sustainable computing practices. This is particularly important in the context of environmental concerns and the growing demand for green technology solutions. Overall, our work paves the way for more inclusive and sustainable AI development, fostering innovation and collaboration across various communities with otherwise limited computational resources.