This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Selective Knowledge Sharing for Personalized Federated Learning Under Capacity Heterogeneity

Zheng Wang12  Zheng Wang2  Zhaopeng Peng12  Zihui Wang12  Cheng Wang12
1Fujian Key Laboratory of Sensing and Computing for Smart Cities
2School of Informatics, Xiamen University, China
{zwang, zhengwang, pengzhaopeng, wangziwei}@stu.xmu.edu.cn
[email protected]
Corresponding Author
Abstract

Federated Learning (FL) stands to gain significant advantages from collaboratively training capacity-heterogeneous models, enabling the utilization of private data and computing power from low-capacity devices. However, the focus on personalizing capacity-heterogeneous models based on client-specific data has been limited, resulting in suboptimal local model utility, particularly for low-capacity clients. The heterogeneity in both data and device capacity poses two key challenges for model personalization: 1) accurately retaining necessary knowledge embedded within reduced submodels for each client, and 2) effectively sharing knowledge through aggregating size-varying parameters. To this end, we introduce Pa3dFL, a novel framework designed to enhance local model performance by decoupling and selectively sharing knowledge among capacity-heterogeneous models. First, we decompose each layer of the model into general and personal parameters. Then, we maintain uniform sizes for the general parameters across clients and aggregate them through direct averaging. Subsequently, we employ a hyper-network to generate size-varying personal parameters for clients using learnable embeddings. Finally, we facilitate the implicit aggregation of personal parameters by aggregating client embeddings through a self-attention module. We conducted extensive experiments on three datasets to evaluate the effectiveness of Pa3dFL. Our findings indicate that Pa3dFL consistently outperforms baseline methods across various heterogeneity settings. Moreover, Pa3dFL demonstrates competitive communication and computation efficiency compared to baseline approaches, highlighting its practicality and adaptability in adverse system conditions.

1 Introduction

Refer to caption

Figure 1: ClientiClient_{i}’s smaller pruned submodel compared to ClientjClient_{j} is due to its relatively limited capacity (e.g., mobile phone versus computer). By assuming that knowledge embedded in the model is intricately interwoven across channels within each layer, we focus on the impact of model pruning on knowledge retention at each layer for clients. (a) Direct channel pruning potentially loses necessary general/local knowledge from all channels and preserves irrelative knowledge from other clients. (b) Personalized channel pruning will lose general knowledge from unseen channels and can preserve all necessary local knowledge perhaps with irrelative knowledge from overlapping channels. (c) Pruning after decomposition (ours) fully maintains general knowledge through complete general parameters sharing, and local knowledge varies with pruned personal parameters

Federated learning (FL) leverages private users’ data on edge devices, e.g., mobile phones and smartwatches, to collaboratively train a global model without direct access to raw data [1]. In practice, the clients’ device capacities (i.e., hard disk or memory size) are usually heterogeneous. Some clients with limited device capacity cannot be directly arranged to train a large-size model. Therefore, traditional FL will be hindered by either limited model sizes or the absence of data from low-capacity clients [2]. To mitigate this issue, recent works [3, 4, 5, 6, 7, 8, 9, 10, 11] enable collaboratively training a large-size model by reducing it into small trainable submodels. While succeeding in training capacity-heterogeneous models, the model inference performance on each client side can be hindered by both capacity and data heterogeneity. For one thing, the model utility will be quickly reduced as the number of model parameters decreases [2]. For another thing, the pruned submodels may not be suitable for client-specific data distributions [10]. The two issues motivate us to consider such a problem:

How to train personalized FL models simultaneously based on clients’ capacity and data?

There have been plenty of works addressing model personalization based on client-specific data [12, 13], most of which demand a uniform model size being shared across clients. Personalization methods based on knowledge distillation (KD) support the training of heterogeneous models, but they fail to leverage the consistency in model parameters and architectures. A most recent step towards this problem is TailorFL [10] and pFedGate[11], i.e. dropping out excessive channels from the global model based on clients’ data and capacity. However, they may still lead to suboptimal local performance since knowledge necessary for clients cannot be well retained when model parameters are pruned as shown in Fig.1(a) and (b) (w.r.t., general knowledge refers to what is suitable for all the clients while local knowledge is only necessary for partial clients). The undesired knowledge retention will harm FL effectiveness and efficiency in two ways: 1) the inconsistency between knowledge embedded in submodels and local data increases the difficulty of local model training and, 2) different channels may redundantly learn similar knowledge that cannot be aggregated within the same layer [7, 8, 11, 10, 2]. Moreover, traditional position-wise parameter averaging [11, 10] may further lower the aggregation efficiency due to unmatched fusion of knowledge independently learned by different clients (e.g., fusing global and local knowledge) [14, 15]. These issues induce two key challenges for capacity-heterogeneous model personalization:

Challenge 1.

Accurately retaining necessary knowledge when pruning the model for each client.

Challenge 2.

Effectively aggregating knowledge from capacity-heterogeneous models across clients.

Motivated by recent advancements in parameter decomposition techniques [2, 16, 17], we introduce a novel framework, termed Pa3dFL (Pruning and aggregation after decomposition FL), aimed at tackling the challenges outlined above. To address the first challenge, we decompose each layer’s model parameters into general and personal components as depicted in Fig.1(c). Subsequently, we trim the model size by only pruning personal parameters while preserving the consistency of global knowledge embedded in general parameters across all clients. In this way, only client-specific and general knowledge will be allocated to each client within the submodel. To address the second challenge, we directly average general parameters with constant shapes to foster comprehensive general knowledge sharing. Next, we adopt a hyper-network to generate personal parameters from learnable client embeddings. Finally, we facilitate the implicit aggregation of personal parameters by aggregating client embeddings via a self-attention module, which adaptively enhances personal knowledge sharing among similar clients.

Our contributions are concluded as follows:

  • We identify two key challenges (i.e., Challenges 1 and 2) in model personalization under data and capacity heterogeneity, which are commonly overlooked by previous works.

  • We propose an efficient personalized FL framework to enhance local model performance by decoupling and selectively sharing knowledge among capacity-heterogeneous models.

  • We conduct extensive experiments under various heterogeneity settings to show the advantage of our proposed methods in terms of accuracy and efficiency.

2 Problem Formulation

In personalized FL, there exist NN clients equipped with their private data {𝒟i|i[N]}\{\mathcal{D}_{i}|i\in[N]\}, and the goal is to obtain a series of model Θ={θ1,θ2,,θN}\mathcal{M}_{\Theta}=\{\mathcal{M}_{\theta_{1}},\mathcal{M}_{\theta_{2}},...,\mathcal{M}_{\theta_{N}}\} that minimize

minΘf(Θ)1Ni[N]𝔼(x,y)𝒟i[f(θi;x,y)]\min_{\Theta}f(\Theta)\coloneqq\frac{1}{N}\sum_{i\in[N]}{\mathbb{E}_{(x,y)\sim\mathcal{D}_{i}}\left[f(\mathcal{M}_{\theta_{i}};x,y)\right]} (1)

where f(θ;x,y)=(θ(x),y)f(\mathcal{M}_{\theta};x,y)=\ell(\mathcal{M}_{\theta}(x),y) and ()\ell(\cdot) is the loss function. In a capacity-heterogeneous setting, the constraints on both running-time peak memory occupation and model size may be different for clients, and the local model ability will finally be limited by the weaker one of the memory term and the storage term. Following [11], we denote the full-model size smax=|θall|s_{max}=|\mathcal{M}_{\theta_{all}}| and assume that each client can train a submodel of the reduced size ratio risismaxr_{i}\leq\frac{s_{i}}{s_{max}}. Consequently, we formulate the capacity constraint as

|θi|rismax,i[N]|\mathcal{M}_{\theta_{i}}|\leq r_{i}s_{max},\forall i\in[N] (2)

under which the PFL objective (1) should be solved.

3 Methodology

3.1 Channel-aware Layer Decomposition

We start from the decomposition of the weight of the convolution operator introduced by FLANC [2], which can also be extended to fully connected layers by setting kernel size to 11. Given a convolution weight tensor 𝜽T×S×k×k\boldsymbol{\theta}\in\mathbb{R}^{T\times S\times k\times k} ( w.r.t., TT the output channels, SS the input channels, and kk the kernel size), the decomposition of FLANC consists of a general part 𝜽𝒖k2R1×R2\boldsymbol{\theta_{u}}\in\mathbb{R}^{k^{2}R_{1}\times R_{2}} and a capacity-dependent part 𝜽𝒗R2×SR1T\boldsymbol{\theta_{v}}\in\mathbb{R}^{R_{2}\times\frac{S}{R_{1}}T} such that

𝜽=𝜽𝒖𝜽𝒗=[𝜽𝒖𝜽𝕧,1𝜽1,,𝜽𝒖𝜽𝕧,STR1𝜽STR1]=[,𝜽S(i1)R1+1:SiR1Output Channel oi,i=1 to T,],𝜽=ReshapeFLANC(𝜽)\boldsymbol{\theta}^{\prime}=\boldsymbol{\theta_{u}}\boldsymbol{\theta_{v}}=\left[\underbrace{\boldsymbol{\theta_{u}}\boldsymbol{\theta}_{\mathbb{v},1}}_{\boldsymbol{\theta}^{\prime}_{1}},\cdots,\underbrace{\boldsymbol{\theta_{u}}\boldsymbol{\theta}_{\mathbb{v},\frac{ST}{R_{1}}}}_{\boldsymbol{\theta}^{\prime}_{\frac{ST}{R_{1}}}}\right]=\left[\cdots,\underbrace{\boldsymbol{\theta}^{\prime}_{\frac{S(i-1)}{R_{1}}+1:\frac{Si}{R_{1}}}}_{\text{Output Channel }o_{i},i=1\text{ to }T},\cdots\right],\\ \boldsymbol{\theta}=\text{Reshape}_{\text{FLANC}}(\boldsymbol{\theta}^{\prime}) (3)

where R1,R2R_{1},R_{2} are manually predefined coefficients of the decomposition and 𝜽\boldsymbol{\theta} is obtained by reshaping parameters according to channels in 𝜽\boldsymbol{\theta}^{\prime}. FLANC reduces the model size by only reducing the columns of 𝜽𝒗\boldsymbol{\theta_{v}} to SpTp/R1SpTp/R_{1} (i.e., 0<p10<p\leq 1 is the reduction ratio of the model width) while keeping 𝜽𝒖\boldsymbol{\theta_{u}} consistent across clients. This reveals an outstanding property that general knowledge learned by 𝜽𝒖\boldsymbol{\theta_{u}} can be propagated to all the reduced models to access the full range of knowledge on all capacity-heterogeneous devices. Motivated by FLANC, we employ a similar decomposition scheme where capacity-dependent parameters are personally kept instead of shared among devices with the same capacity. However, we notice that general parameters 𝜽𝒖\boldsymbol{\theta_{u}} in FLANC can hardly encode channel-specific knowledge since 𝜽𝒖\boldsymbol{\theta_{u}} equally contributes to every output channel as the basis in Eq. (3). This may limit model performance, especially when output channels sensitive to certain signals [18] (e.g., low-level circles or high-level eyes) are indispensable across all clients’ local tasks. To mitigate this issue, we propose to decompose 𝜽\boldsymbol{\theta} in a channel-aware manner

𝜽=[𝒖1𝒖2𝒖R1]𝜽𝒖k2R1×R2[𝒗1,,𝒗TR1,]𝜽𝒗R2×TR1S=[𝒖1𝒗1𝒖1𝒗TR1𝒖R1𝒗1𝒖R1𝒗TR1]Output Channel oij=𝒖i𝒗j|i=1 to R1,j=1 to TR1,𝜽=ReshapePadFL(𝜽)\boldsymbol{\theta}^{\prime}=\underbrace{\left[\begin{matrix}\boldsymbol{u}_{1}\\ \boldsymbol{u}_{2}\\ \vdots\\ \boldsymbol{u}_{R_{1}}\end{matrix}\right]}_{\boldsymbol{\theta_{u}}\in\mathbb{R}^{k^{2}R_{1}\times R_{2}}}\underbrace{\left[\boldsymbol{v}_{1},\cdots,\boldsymbol{v}_{\frac{T}{R_{1}}},\right]}_{\boldsymbol{\theta_{v}}\in\mathbb{R}^{R_{2}\times\frac{T}{R_{1}}S}}=\underbrace{\left[\begin{matrix}\boldsymbol{u}_{1}\boldsymbol{v}_{1}&\cdots&\boldsymbol{u}_{1}\boldsymbol{v}_{\frac{T}{R_{1}}}\\ \vdots&\ddots&\vdots\\ \boldsymbol{u}_{R_{1}}\boldsymbol{v}_{1}&\cdots&\boldsymbol{u}_{R_{1}}\boldsymbol{v}_{\frac{T}{R_{1}}}\end{matrix}\right]}_{\text{Output Channel }o_{ij}=\boldsymbol{u}_{i}\boldsymbol{v}_{j}|_{i=1\text{ to }R_{1},j=1\text{ to }\frac{T}{R_{1}}}},\boldsymbol{\theta}=\text{Reshape}_{\text{PadFL}}(\boldsymbol{\theta}^{\prime}) (4)

where 𝒖ik2×R2\boldsymbol{u}_{i}\in\mathbb{R}^{k^{2}\times R_{2}}, 𝒗jR2×S\boldsymbol{v}_{j}\in\mathbb{R}^{R_{2}\times S} and 𝜽\boldsymbol{\theta}^{\prime} is also reshaped by channels to obtain final 𝜽\boldsymbol{\theta}. Eq. (4) enables 𝒖i\boldsymbol{u}_{i} in general parameters to remember different channel-specific knowledge since it’s only shared by partial channels. We adapt the model to clients’ small capacity by reducing columns 𝒗j\boldsymbol{v}_{j} in 𝜽𝒗\boldsymbol{\theta_{v}} , where only output channels (𝒖𝒗j)(\boldsymbol{u}_{\mathbb{\cdot}}\boldsymbol{v}_{j}) will be pruned together. This completely preserves general parameters 𝜽𝒖\boldsymbol{\theta_{u}} in the reduced model as FLANC. Additionally, we choose the coefficients R1,R2R_{1},R_{2} adaptively by

R1=Tpmin,R2={max(min(S,T),k2),if layer is convolutionR1,if layer is linearR_{1}=Tp_{min},R_{2}=\left\{\begin{array}[]{ll}\max(\min(S,T),k^{2}),&\text{if layer is convolution}\\ R_{1},&\text{if layer is linear}\\ \end{array}\right. (5)

to strike a balance between computation efficiency and performance as depicted in Sec.5.

Model Reduction.

Given a full-size model θ\mathcal{M}_{\mathbb{\theta}} with dd layers, we represent the model parameters θ\mathbb{\theta} by the set of decomposed parameters of all layers {(θ𝕦(l),θ𝕧(l))}|l=1 to d\{(\mathbb{\theta}_{\mathbb{u}}^{(l)},\mathbb{\theta}_{\mathbb{v}}^{(l)})\}|_{l=1\text{ to }d}. Then, we layer-wisely remove columns in θ𝕧(l)|l=1 to d\mathbb{\theta}_{\mathbb{v}}^{(l)}|_{l=1\text{ to }d} in descending order of column indices (i.e., Eq. (10)) to obtain the pip_{i}-width model for each client cic_{i} based on capacity constraint Eq. (2). We finally denote general pameters by θi,𝕦={θ𝕦(l)}|l=1 to d\mathbb{\theta}_{i,\mathbb{u}}=\{\mathbb{\theta}_{\mathbb{u}}^{(l)}\}|_{l=1\text{ to }d} and varying-size personal parameters by θi,𝕧={θ𝕧(l)}|l=1 to d\mathbb{\theta}_{i,\mathbb{v}}=\{\mathbb{\theta}_{\mathbb{v}}^{(l)}\}|_{l=1\text{ to }d} for each client cic_{i}.

3.2 Aggregation Mechanism

The proposed model decomposition method enables clients to collaboratively maintain general parameters θ𝕦\mathbb{\theta}_{\mathbb{u}} of a constant shape. As a result, θ𝕦\mathbb{\theta}_{\mathbb{u}} can be updated by direct averaging at each aggregation round tt

θ𝕦(t+1)1|𝒮t|i|𝒮t|θi,𝕦(t)\mathbb{\theta}_{\mathbb{u}}^{(t+1)}\leftarrow\frac{1}{|\mathcal{S}_{t}|}\sum_{i\in|\mathcal{S}_{t}|}\mathbb{\theta}_{i,\mathbb{u}}^{(t)} (6)

where 𝒮t\mathcal{S}_{t} denotes the currently selected subset of clients. However, direct position-wise averaging on personal parameters {θi,𝕧}|l=1 to N\{\mathbb{\theta}_{i,\mathbb{v}}\}|_{l=1\text{ to }N} of varying shapes may reduce the aggregation efficiency [9] and cause unmatched knowledge fusion [14]. Moreover, since personal parameters are viewed as learners for personal knowledge ge, it’s essential to propagate θi,𝕧\mathbb{\theta}_{i,\mathbb{v}} within clients that have similar local data distributions [19]. To this end, we employ a hyper-network (HN) to generate personal parameters from learnable client embeddings that will be aggregated to facilitate the implicit aggregation of personal parameters.

Parameter Generation via HN.

Personalizing model parameters through HN is first introduced by pFedHN [20]. In pFedHN, the server generates and distributes the full model parameters 𝜽i(t)=h(𝜽h(t),𝕖i(t))\boldsymbol{\theta}_{i}^{(t)}=h(\boldsymbol{\theta}^{(t)}_{h},\mathbb{e}_{i}^{(t)}) to the selected client ci,i𝒮tc_{i},i\in\mathcal{S}_{t}, w.r.t. learnable embeddings 𝕖i\mathbb{e}_{i}, parameterized HN h(𝜽h,)h(\boldsymbol{\theta}_{h},\cdot) at the ttth communication round. Then, the server collects the locally trained parameters 𝜽i(t+1)\boldsymbol{\theta}_{i}^{(t+1)} to update the hyper-network hh with the learning rate γ\gamma

HN=1|𝒮t|i𝒮t\displaystyle\mathcal{L}_{HN}=\frac{1}{|\mathcal{S}_{t}|}\sum_{i\in\mathcal{S}_{t}} 12𝜽i(t+1)𝜽i(t)22\displaystyle\frac{1}{2}\|\boldsymbol{\theta}_{i}^{(t+1)}-\boldsymbol{\theta}_{i}^{(t)}\|_{2}^{2} (7)
𝕖i(t+1)𝕖i(t)γHN𝕖i(t),\displaystyle\mathbb{e}_{i}^{(t+1)}\leftarrow\mathbb{e}_{i}^{(t)}-\gamma\frac{\partial\mathcal{L}_{HN}}{\partial\mathbb{e}_{i}^{(t)}}, 𝜽h(t+1)𝜽h(t)γHN𝜽h(t)\displaystyle\quad\boldsymbol{\theta}^{(t+1)}_{h}\leftarrow\boldsymbol{\theta}^{(t)}_{h}-\gamma\frac{\partial\mathcal{L}_{HN}}{\partial\boldsymbol{\theta}^{(t)}_{h}} (8)

Unlike pFedHN, we use HN only to generate personal parameters 𝜽i,𝕧|i=1N\boldsymbol{\theta}_{i,\mathbb{v}}|_{i=1}^{N} for clients. First, we input all client embeddings 𝔼=[𝕖1,,𝕖N]\mathbb{E}=\left[\mathbb{e}_{1},\cdots,\mathbb{e}_{N}\right] into the encoder to obtain 𝔼=EncoderHN(𝔼)=[𝕖1,,𝕖N]\mathbb{E}^{\prime}=Encoder_{HN}(\mathbb{E})=\left[\mathbb{e}_{1}^{\prime},\cdots,\mathbb{e}_{N}^{\prime}\right]. Next, we compute the similarity between client cic_{i} and other clients by 𝕤i=𝔼𝕖i\mathbb{s}_{i}=\mathbb{E}^{\prime\top}\mathbb{e}_{i}^{\prime}. Subsequently, we respectively generate personal parameters of each layer θ~i,𝕧(l)\tilde{\mathbb{\theta}}_{i,\mathbb{v}}^{(l)} by Eq. (9) and prune them by Eq.10

𝕖i(l)=𝔼softmax(𝕤iτl)Aggregation,\displaystyle\underbrace{\mathbb{e}_{i}^{(l)}=\mathbb{E}^{\prime}\text{softmax}(\frac{\mathbb{s}_{i}}{\tau_{l}})}_{\text{Aggregation}}, θ~i,𝕧(l)=DecoderHN,l(𝕖i(l))=[𝕧11(il),,𝕧1Sl(il)𝕧1(il),,𝕧TlR1l1(il),,𝕧TlR1lSl(il)𝕧TlR1l(il)]\displaystyle\tilde{\mathbb{\theta}}_{i,\mathbb{v}}^{(l)}=Decoder_{HN,l}(\mathbb{e}_{i}^{(l)})=\left[\underbrace{\mathbb{v}_{11}^{(il)},\cdots,\mathbb{v}_{1S_{l}}^{(il)}}_{\mathbb{v}_{1}^{(il)}},\cdots,\underbrace{\mathbb{v}_{\frac{T_{l}}{R_{1l}}1}^{(il)},\cdots,\mathbb{v}_{\frac{T_{l}}{R_{1l}}S_{l}}^{(il)}}_{\mathbb{v}_{\frac{T_{l}}{R_{1l}}}^{(il)}}\right] (9)
θi,𝕧(l)=Prune(θ~i,𝕧(l))=[𝕧1,1:piTl1(il),,𝕧piTlR1l,1:piTl1(il)]\displaystyle\mathbb{\theta}_{i,\mathbb{v}}^{(l)}=\text{Prune}(\tilde{\mathbb{\theta}}_{i,\mathbb{v}}^{(l)})=\left[\mathbb{v}_{1,1:p_{i}T_{l-1}}^{(il)},\cdots,\mathbb{v}_{\frac{p_{i}T_{l}}{R_{1}l},1:p_{i}T_{l-1}}^{(il)}\right] (10)

where Tl,SlT_{l},S_{l} the numbers of output, input channels of layer ll, τl\tau_{l} the learnable coefficient (i.e. temperature of softmax) of layer ll, pip_{i} the target reduction ratio of the model width. In practice, pip_{i} can be approximately ensured by pi2rip_{i}^{2}\leq r_{i} (i.e., Eq. (2)) according to our complexity analysis in Sec. 4. In practice, we use the multi-layer perceptron as the encoder and decoder, where the 1-D output vectors will be reshaped to obtain the parameters.

Implict Aggregation.

Our Eq. (9) can promote personal knowledge sharing among similar clients from three aspects. First, HN implicitly aggregates personal parameters in the embedding space instead of position-wise averaging in the original parameter space, allowing complex non-linear knowledge fusion across varying-size parameters. Second, the self-attention module enlarges aggregation weights between clients with larger similarities in the converted features, which adaptively promotes personal knowledge sharing among similar clients. Finally, we independently use learnable temperature coefficients for softmax aggregation in different layers, since the optimal personalization degree can vary across layers [21] and tasks [19]. For example, τl\tau_{l}\rightarrow\infty leads to no personalization on the layer ll and a small τl\tau_{l} encourages a high degree of layer personalization.

3.3 Implementation

In Pa3dFL implementation, we decompose all the layers but the last one (i.e., head). We follow [22] to fix the global head without updating it during training and we maintain a local head by the HN respectively for each client [23]. We employ a regularization term slightly different from [2] during local training. We finally fuse the received model with the global head and the locally trained one with the local head on local validation data by linear search to decide each client’s local model for testing. For the hyper-network, we use the multi-layer-perceptron (MLP) to build the encoder and decoders, and we view the dimensions of client embeddings and hidden layers as hyper-parameters. The details of the implementation are in AppendixC.

4 Analysis

Complexity.

Pa3dFL reduces the global model along the width dimension. Supposing a model’s width reduction ratio is pp, the size of the model will be reduced by 𝒪(p2)\mathcal{O}(p^{2}) [7]. Therefore, both the computation costs and the communication costs can be significantly reduced by choosing a small reduction ratio. One concern is that Pa3dFL will introduce additional costs in both computation and communication processes due to its parameter decomposition. We demonstrate that these additional costs are acceptable in our adaptive choices of R1R_{1} and R2R_{2} and thus won’t significantly violate the benefits from the width-level reduction. We empirically compare the efficiency of Pa3dFL with existing methods in Sec. 6.4, and we detail the complexity analysis in Appendix A.

Convergence.

We follow [24] to assume a single local update per device and make the convergence analysis under commonly used assumptions of the smoothness of the objective (i.e., Assumption 1) and bounded gradient variance (i.e., Assumption 2).

Theorem 1 (Convergence of Pa3dFL).

Suppose Assumptions 1, 2 hold and the step sizes η\eta and γ\gamma in Pa3dFL are chosen as 2L𝕦\frac{2}{L_{\mathbb{u}}}, L𝕦L𝛗\frac{L_{\mathbb{u}}}{L_{\boldsymbol{\varphi}}} respectively, η\eta decays with the number of rounds tt and all model parameters initialized at the same point 𝛉(0)\boldsymbol{\theta}^{(0)}. Then, after TT rounds we have:

1T(t=0Tγηt(1γηtL𝝋2)F𝝋(t)2+t=0Tηt(1ηtL𝝋2)F𝜽𝕦(t)2)1T(F(𝜽(0))F)+O(η2)\frac{1}{T}(\sum_{t=0}^{T}\gamma\eta_{t}(1-\frac{\gamma\eta_{t}L_{\boldsymbol{\varphi}}}{2})\|\frac{\partial F}{\partial\boldsymbol{\varphi}^{(t)}}\|^{2}+\sum_{t=0}^{T}\eta_{t}(1-\frac{\eta_{t}L_{\boldsymbol{\varphi}}}{2})\|\frac{\partial F}{\partial\boldsymbol{\theta}_{\mathbb{u}}^{(t)}}\|^{2})\leq\frac{1}{T}(F(\boldsymbol{\theta}^{(0)})-F^{*})+O(\eta^{2}) (11)

where FF^{*} is the lower bound of FF.

Privacy.

Pa3dFL won’t produce more privacy concerns than fedavg [1] since it only transfers model parameters between the server and clients like FedAvg without local data leaving clients’ devices. Besides, client embeddings in Pa3dFL on the server side are randomly initialized and updated only by the delta of model parameters, which can only reflect the similarities between clients without leaking their local data information.

5 Related Work

Personalized FL.

While traditional FL finds a global solution that is suitable to all the clients [1, 25], personalized FL (PFL) aims to obtain customized models of high local utility for different clients [26, 27]. Existing works focus on improving the effectiveness of PFL by partial parameter sharing [28, 29, 30, 31, 32], meta-learning[13, 12, 33], latent representation alignment[34, 22], local history maintaining[19, 35], hyper-network[20], adaptive aggregation[36, 37, 38, 39], and knowledge decoupling[40, 41]. Although achieving high accuracy, these methods commonly assume a uniform model size to be shared [10][7]. Some model agnostic methods[34, 42] can also mitigate the low-capacity issue, however, leading to suboptimal results without leveraging the consistency in model parameters and architectures.

Pruning FL.

To enable collaboratively training capacity-heterogeneous models, several methods prune or mask model parameters to reduce the model size [5, 4, 8]. [3, 4, 5] mask model parameters to reduce the communicating model size, but cannot save computation costs during the model training phase.[6, 7, 8, 9] drop channels out at each layer of a neural network to reduce the model capacity, but they ignore filters’ preference for particular data when pruning the model. [5, 10] allow clients with more similar data to share more identical parameters, and [11] dynamically personalizes parameters according to batch-level information. However, they suffer from undesirable knowledge retention as depicted in Sec.1.

Decomposition FL.

Decomposition FL can explicitly decouple knowledge by parameter decomposition [17, 16, 2].[16] decomposes each weight into a global vector and a personalized vector. [17] proposes a novel decomposition way to address the low-rank issue. The two methods are communication-efficient but both ignore the computational costs. [2] decomposes each weight into a common part and a capacity-specific part without pruning models. However, only clients with identical capacities can perform aggregation on capacity-specific parts and personalization was not taken into consideration by them.

6 Experiment

6.1 Setup

Datasets.

We evaluate Pa3dFL on three widely used benchmarks in FL: FashionMNIST [43], CIFAR10 and CIFAR100 [44]. We partition each dataset into 100 parts respectively kept by clients. We control data heterogeneity as [1, 25]. For CIFAR10, clients’ label distributions obey Dirichlet(α=1.0)Dirichlet(\alpha=1.0) [45]. For CIFAR100, clients own 20 classes of data from all 100 classes. For FashionMNIST, data is i.i.d. distributed. We set the ratios of train/val/test parts as 0.8/0.1/0.10.8/0.1/0.1 for each local data.

Models.

We use ResNet18-GN used by [25] for CIFAR10, where each batch normalization layer is replaced by group normalization due to the infeasibility of batch normalization on non-I.I.D. data in FL [46]. We use two-layer CNNs for both CIFAR100 [25] and FashionMNIST [47]. We detail the architecture of the models in Appendix D.2.

Capacity Setting.

We assume that each client can only afford training or testing a submodel that takes nearly r%r\% computation costs of the full model. For width-reduction methods [9, 8, 7, 2, 10], the reduction ratio pip_{i} of each client ii should meet pi2ri%p_{i}^{2}\leq r_{i}\% since the cost of a pp-width model is 𝒪(p2d)\mathcal{O}(p^{2}d) [7] and dd is the full model size. For methods that directly mask the ratio s%s\% of parameters [11], si%ri%s_{i}\%\leq r_{i}\% should hold for each client. For capacity-unaware methods, we use the maximum model that can be computed by all the users based on the minimum capacity, which enables full participation of all the local data. We conduct experiments on two settings: Ideal and Hetero.. Clients in Ideal settings have unlimited computation resources. In Hetero., we set rir_{i} to be uniformly distributed from 1%100%1\%\sim 100\% across clients. For example, clients with the lowest capacity can only afford training a submodel that is nearly 1100\frac{1}{100} of the complete model.

Table 1: Comarison on testing accuracy (%) of different methods w./w.o. heterogeneous capacity constraints on three datasets. The optimal result in each column is in bold font and the second optimal results are underlined.
CIFAR10 CIFAR100 FashionMNIST
Method Ideal Hetero. Ideal Hetero. Ideal Hetero.
FedAvg 75.01±0.4375.01\pm 0.43 52.42±0.5652.42\pm 0.56 38.09±0.8538.09\pm 0.85 17.11±0.1917.11\pm 0.19 91.71±0.27¯\underline{91.71\pm 0.27} 87.76±0.4887.76\pm 0.48
Ditto 91.16±0.03\mathbb{91.16\pm 0.03} 83.11±0.2783.11\pm 0.27 57.07±0.30¯\underline{57.07\pm 0.30} 34.44±0.1234.44\pm 0.12 90.91±0.3590.91\pm 0.35 86.51±0.4486.51\pm 0.44
pFedHN 83.11±0.1283.11\pm 0.12 79.33±0.4279.33\pm 0.42 32.49±0.3932.49\pm 0.39 27.38±0.6527.38\pm 0.65 84.51±0.3884.51\pm 0.38 82.51±0.4082.51\pm 0.40
FLANC 74.64±0.4474.64\pm 0.44 53.17±0.4053.17\pm 0.40 37.84±0.3637.84\pm 0.36 23.84±0.1123.84\pm 0.11 91.43±0.1091.43\pm 0.10 85.33±0.2985.33\pm 0.29
Fjord 75.94±0.4375.94\pm 0.43 71.52±0.3171.52\pm 0.31 38.51±0.4438.51\pm 0.44 34.41±0.3134.41\pm 0.31 91.04±0.1091.04\pm 0.10 90.11±0.26\mathbb{90.11\pm 0.26}
HeteroFL 74.17±0.4774.17\pm 0.47 68.42±0.8668.42\pm 0.86 38.97±0.8338.97\pm 0.83 26.81±0.4826.81\pm 0.48 90.56±0.2090.56\pm 0.20 85.66±0.8585.66\pm 0.85
FedRolex 74.80±0.2374.80\pm 0.23 68.26±0.3168.26\pm 0.31 36.55±0.1536.55\pm 0.15 27.84±0.2527.84\pm 0.25 91.78±0.14\mathbb{91.78\pm 0.14} 87.40±0.2687.40\pm 0.26
LocalOnly 76.59±0.4076.59\pm 0.40 76.28±0.1776.28\pm 0.17 25.88±0.6425.88\pm 0.64 24.55±0.0524.55\pm 0.05 75.72±0.2175.72\pm 0.21 74.98±0.0774.98\pm 0.07
LG-Fedavg 81.12±0.0981.12\pm 0.09 80.39±0.0380.39\pm 0.03 33.86±0.2133.86\pm 0.21 33.75±0.1133.75\pm 0.11 78.96±0.0378.96\pm 0.03 78.21±0.0878.21\pm 0.08
TailorFL 74.96±0.1574.96\pm 0.15 85.00±0.84¯\underline{85.00\pm 0.84} 37.98±0.3537.98\pm 0.35 41.14±0.6641.14\pm 0.66 91.63±0.1191.63\pm 0.11 88.41±0.2388.41\pm 0.23
pFedGate 90.58±0.1090.58\pm 0.10 67.02±3.0567.02\pm 3.05 54.32±0.6254.32\pm 0.62 4.55±0.034.55\pm 0.03 88.33±3.9088.33\pm 3.90 76.58±1.9676.58\pm 1.96
Pa3dFL 91.05±0.23¯\underline{91.05\pm 0.23} 86.42±0.49\mathbb{86.42\pm 0.49} 57.33±1.24\mathbb{57.33\pm 1.24} 51.48±0.99\mathbb{51.48\pm 0.99} 91.04±0.4291.04\pm 0.42 89.55±0.20¯\underline{89.55\pm 0.20}

Baselines.

We compare Pa3dFL with 3 types of methods:

  • Personalization. Model sizes are limited by the lowest capacity (e.g., pFedHN [20], and Ditto [12]).

  • Capacity-aware Clients with different capacities train the global model without personalization (e.g.,FLANC [2], Fjord [8], HeteroFL [7] and FedRolex [9]).

  • Capacity-aware + Personalization. Clients with different capacities train the global model and personalize it (e.g., LocalOnly [1], pFedGate [11], TailorFL [10], and LG-FedAvg [29].

Details of these methods are in Appendix D.3

Hyperparameters.

Following [25], we fix batch size B=50B=50 and local epoch E=5E=5 for all the datasets. We set communication rounds R=1000/2000/500R=1000/2000/500 respectively for CIFAR10/CIFAR100/FashionMNIST. We perform early stopping when there is no gain in the optimal validation metrics for more than 0.2R0.2R rounds. We tune the learning rate on the grid {5e3,1e2,5e2,1e1}\{5e-3,1e-2,5e-2,1e-1\} with the decaying rate 0.9980.998 and respectively tune the algorithmic hyperparameters for each method to their optimal. Each result is averaged over 3 random seeds. More details on hyperparameters are in Appendix D.

Implementation.

Our experiments are conducted on a 64g Ubuntu 22.04.2 LTS server with Intel(R) Xeon(R) Silver 4314 CPU @ 2.40GHz and 4 NVidia(R) RTX3090 GPUs. Our code is realized by FL framework [48].

6.2 Overall Performance

We first compare the overall accuracy of clients w./w.o. heterogeneous capacity constraints (e.g., hetero./ideal) for different methods in Table 1. While existing methods suffer performance reduction from low-capacity constraints (e.g., the reduction ratios of accuracy obtained by FedAvg in CIFAR10/CIFAR100/Fashion are 30.1%/55.0%/4.3%30.1\%/55.0\%/4.3\%), our proposed Pa3dFL achieves the optimal results on CIFAR10 and CIFAR100 (i.e., 86.42%86.42\% and 51.48%51.48\%) and competitive results against Fjord on FashionMNIST (i.e., 89.55%89.55\% v.s. 90.11%90.11\%) under the same constraints. Compared with personalization-only methods (e.g., Ditto and pFedHN), Pa3dFL improves model performance up to 43.3%43.3\% (e.g., CIFAR100-Hetero.) and achieves similar results in ideal cases (e.g., 91.14%91.14\% v.s. 91.05%91.05\% in CIFAR10-ideal). Compared with capacity-aware methods, Pa3dFL consistently dominates them across all the cases in CIFAR10 and CIFAR100. We notice that our method takes no advantage in FashionMNIST-ideal. We attribute this result to the i.i.d. distributed data, and we still found that we outperform other personalization-only methods (e.g., Ditto and pFedHN) in this case. In addition, Pa3dFL also achieves the suboptimal result in FashionMNIST-Hetero., which suggests its adaptability when clients’ capacity varies regardless of data distributions. For capacity-aware methods with personalization (e.g., LG-FedAvg, TailorFL and pFedGate), they are also able to improve model performance while maintaining different model sizes for different clients. TailorFL achieves similar results against Pa3dFL in CIFAR10-Hetero. (e.g., 85.00%85.00\% v.s. 86.42%86.42\%), but can only obtain a much weaker result than Pa3dFL when the task becomes more difficult (e.g., 41.14%41.14\% v.s. 51.48%51.48\% in CIFAR100-UNI). Besides, TailorFL failed to personalize models for clients in all IDL cases since its personalization is based on different architectures of clients, which may limit its applicability in practice. Overall, the results in Table 1 confirms the ability of Pa3dFL to preserve model performance under heterogeneity in terms of data and device capacity. We further compare the learning curves of different methods in Fig.3. Our method enjoys the fastest convergence and the highest performance across datasets.

Refer to caption

Figure 2: Model testing accuracy v.s. Client Capacity on three benchmarks
Refer to caption
Refer to caption
Refer to caption
Figure 3: Test accuracy v.s. communication rounds under the Hetero. setting

6.3 Client Capacity-specific Performance

We show the capacity-specific performance under the Hetero. setting in Fig. 2, where we group clients’ capacities into five clusters and then show the averaging testing accuracy of each cluster. The results suggest that Pa3dFL outperforms other methods across different levels of client capacity in non-IID datasets (e.g. CIFAR10 and CIFAR100). We notice that TailorFL can achieve comparable results as Pa3dFL in CIFAR10, however, it suffers quick performance reduction on CIFAR100 as the capacity decreases. In Fashion, Pa3dFL achieves competitive results against non-personalization methods (e,g, Fjord and FedAvg) and still dominates other personalization methods. The results indicate the robustness towards different types of data distributions in practice.

Table 2: Comparison of efficiency for different methods on CIFAR100 with the CNN model.
FLOPs (×107\times 10^{7}) Peak Memory (MB) Model Size (×101\times 10^{-1}MB)
ratio r 1/256 1/64 1/4 1 1/256 1/64 1/4 1 1/256 1/64 1/4 1
FedAvg - - - 188.0 - - - 71.1 - - - 30.4
pp-width 3.5 8.2 59.1 188.0 23.1 26.2 45.3 71.1 0.1 0.5 7.6 30.4
FLANC 15.0 18.2 42.3 87.7 30.1 34.1 58.4 90.9 0.6 0.9 7.76 29.3
pFedGate 5.5 10.1 61.0 190.0 39.2 42.3 61.4 87.2 36.3 36.3 36.3 36.3
Pa3dFL 3.6 8.3 59.9 191.9 23.3 26.5 46.3 74.8 0.8 1.3 8.5 28.7

6.4 Efficiency

We evaluate the efficiency of methods under the metrics FLOPs, peak memory, and model size [11, 7] in Table 2. Results of each method except pFedGate are independently obtained by training the model on CIFAR10 with a few batches of data over 10 iterations, and the batch size is fixed to 128. For pFedGate, we evaluate its costs of the sparse model and the additional gating layer, as is detailed in Appendix D.4. The term pp-width refers to methods reducing the model along the width (e.g., Fjord, HeteroFL, FedRolex and TailorFL). Firstly, like other capacity-aware methods, Pa3dFL can significantly reduce both the computation cost and communication cost of the model. Compared with pp-width, Pa3dFL brings few additional costs. However, Pa3dFL saves efficiency in a level more similar to pp-width than pFedGate and FLANC. For example, the ratio of saving amounts of FLOPs by Pa3dFL and pp-width when shrinking the model to 1/2561/256 are respectively 98.08%98.08\% and 98.13%98.13\% when compared to the full model used by FedAvg. We also observe similar trends of the efficiency saving by Pa3dFL in the peak memory and model size. For FLANC, it can achieve much smaller FLOPs than other methods when the reduction ratio is large. However, the effectiveness of the efficiency saving is reduced as the reduction ratio becomes small. pFedGate failed to reduce the model size since full parameters might be activated during computation, resulting in the full transmission and storage of the model.

Table 3: The testing accuracy (%\%) of ablation on Pa3dFL.
CIFAR10 FashionMNIST
Pa3dFL 86.4286.42 89.5589.55
w.o. HN-based Agg. 81.52(5.66%)81.52(-5.66\%) 78.94(11.84%)78.94(-11.84\%)
FLANC decomposition 82.35(4.70%)82.35(-4.70\%) 84.58(5.54%)84.58(-5.54\%)

6.5 Ablation Study

We conduct ablation studies on modules of Pa3dFL to show their impacts in Table 3. We notice that only aggregating general parameters of models without using hyper-network to aggregate personalized parts severely hurts model performance (e.g., 5.66%-5.66\% in CIFAR10 and 11.84%-11.84\% in FashionMNIST) for both i.i.d. and non-i.i.d. cases. This indicates the importance of the aggregation of personalized parts for personal knowledge fusion. When replacing the model decomposition with the one used in FLANC (i.e., Eq. (3)), our method also suffers non-neglectable performance reduction. Therefore, we claim that the parameter decomposition way is essential to the model performance since it decides how knowledge is organized in parameters as aforementioned.

7 Conclusion

In this work, we address the issue of model performance reduction when there exists both data heterogeneity and capacity heterogeneity in FL. We identify and tackle two key challenges to enhance the model performance when adapting the over-capacity model to clients by pruning. Different from previous works, we propose to prune the model after decomposition and develop Pa3dFL to promote both general and personal knowledge sharing. We theoretically verify the effectiveness of our methods. Finally, we conduct extensive experiments to show the advantages of our proposed method in terms of accuracy and efficiency. We consider how to efficiently train models with rich operators and larger scalability under heterogeneous capacity constraints as our future works.

References

  • [1] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017.
  • [2] Yiqun Mei, Pengfei Guo, Mo Zhou, and Vishal Patel. Resource-adaptive federated learning with all-in-one neural composition. Advances in Neural Information Processing Systems, 35:4270–4284, 2022.
  • [3] Ang Li, Jingwei Sun, Binghui Wang, Lin Duan, Sicheng Li, Yiran Chen, and Hai Li. Lotteryfl: Personalized and communication-efficient federated learning with lottery ticket hypothesis on non-iid datasets. arXiv preprint arXiv:2008.03371, 2020.
  • [4] Ang Li, Jingwei Sun, Xiao Zeng, Mi Zhang, Hai Li, and Yiran Chen. Fedmask: Joint computation and communication-efficient personalized federated learning via heterogeneous masking. In Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems, pages 42–55, 2021.
  • [5] Ang Li, Jingwei Sun, Pengcheng Li, Yu Pu, Hai Li, and Yiran Chen. Hermes: an efficient federated learning framework for heterogeneous mobile clients. In Proceedings of the 27th Annual International Conference on Mobile Computing and Networking, pages 420–437, 2021.
  • [6] Dingzhu Wen, Ki-Jun Jeon, and Kaibin Huang. Federated dropout—a simple approach for enabling federated learning on resource constrained devices. IEEE wireless communications letters, 11(5):923–927, 2022.
  • [7] Enmao Diao, Jie Ding, and Vahid Tarokh. Heterofl: Computation and communication efficient federated learning for heterogeneous clients. arXiv preprint arXiv:2010.01264, 2020.
  • [8] Samuel Horvath, Stefanos Laskaridis, Mario Almeida, Ilias Leontiadis, Stylianos Venieris, and Nicholas Lane. Fjord: Fair and accurate federated learning under heterogeneous targets with ordered dropout. Advances in Neural Information Processing Systems, 34:12876–12889, 2021.
  • [9] Samiul Alam, Luyang Liu, Ming Yan, and Mi Zhang. Fedrolex: Model-heterogeneous federated learning with rolling sub-model extraction. Advances in neural information processing systems, 35:29677–29690, 2022.
  • [10] Yongheng Deng, Weining Chen, Ju Ren, Feng Lyu, Yang Liu, Yunxin Liu, and Yaoxue Zhang. Tailorfl: Dual-personalized federated learning under system and data heterogeneity. In Proceedings of the 20th ACM Conference on Embedded Networked Sensor Systems, pages 592–606, 2022.
  • [11] Daoyuan Chen, Liuyi Yao, Dawei Gao, Bolin Ding, and Yaliang Li. Efficient personalized federated learning via sparse model-adaptation. arXiv preprint arXiv:2305.02776, 2023.
  • [12] Tian Li, Shengyuan Hu, Ahmad Beirami, and Virginia Smith. Ditto: Fair and robust federated learning through personalization. In International Conference on Machine Learning, pages 6357–6368. PMLR, 2021.
  • [13] Canh T Dinh, Nguyen Tran, and Josh Nguyen. Personalized federated learning with moreau envelopes. Advances in Neural Information Processing Systems, 33:21394–21405, 2020.
  • [14] Hongyi Wang, Mikhail Yurochkin, Yuekai Sun, Dimitris Papailiopoulos, and Yasaman Khazaeni. Federated learning with matched averaging. arXiv preprint arXiv:2002.06440, 2020.
  • [15] Xin-Chun Li, Yi-Chu Xu, Shaoming Song, Bingshuai Li, Yinchuan Li, Yunfeng Shao, and De-Chuan Zhan. Federated learning with position-aware neurons. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10082–10091, 2022.
  • [16] Wonyong Jeong and Sung Ju Hwang. Factorized-fl: Personalized federated learning with parameter factorization & similarity matching. Advances in Neural Information Processing Systems, 35:35684–35695, 2022.
  • [17] Nam Hyeon-Woo, Moon Ye-Bin, and Tae-Hyun Oh. Fedpara: Low-rank hadamard product for communication-efficient federated learning. arXiv preprint arXiv:2108.06098, 2021.
  • [18] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, pages 818–833. Springer, 2014.
  • [19] Jianqing Zhang, Yang Hua, Hao Wang, Tao Song, Zhengui Xue, Ruhui Ma, and Haibing Guan. Fedala: Adaptive local aggregation for personalized federated learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 11237–11244, 2023.
  • [20] Aviv Shamsian, Aviv Navon, Ethan Fetaya, and Gal Chechik. Personalized federated learning using hypernetworks. In International Conference on Machine Learning, pages 9489–9502. PMLR, 2021.
  • [21] Xiaosong Ma, Jie Zhang, Song Guo, and Wenchao Xu. Layer-wised model aggregation for personalized federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10092–10101, June 2022.
  • [22] Jaehoon Oh, Sangmook Kim, and Se-Young Yun. Fedbabu: Towards enhanced representation for federated image classification. arXiv preprint arXiv:2106.06042, 2021.
  • [23] Manoj Ghuhan Arivazhagan, Vinay Aggarwal, Aaditya Kumar Singh, and Sunav Choudhary. Federated learning with personalization layers. arXiv preprint arXiv:1912.00818, 2019.
  • [24] Krishna Pillutla, Kshitiz Malik, Abdel-Rahman Mohamed, Mike Rabbat, Maziar Sanjabi, and Lin Xiao. Federated learning with partial model personalization. In International Conference on Machine Learning, pages 17716–17758. PMLR, 2022.
  • [25] Cheng Jin, Xuandong Chen, Yi Gu, and Qun Li. Feddyn: A dynamic and efficient federated distillation approach on recommender system. In 2022 IEEE 28th International Conference on Parallel and Distributed Systems (ICPADS), pages 786–793. IEEE, 2023.
  • [26] Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet S Talwalkar. Federated multi-task learning. Advances in neural information processing systems, 30, 2017.
  • [27] Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends® in Machine Learning, 14(1–2):1–210, 2021.
  • [28] Manoj Ghuhan Arivazhagan, Vinay Aggarwal, Aaditya Kumar Singh, and Sunav Choudhary. Federated learning with personalization layers. arXiv preprint arXiv:1912.00818, 2019.
  • [29] Paul Pu Liang, Terrance Liu, Liu Ziyin, Nicholas B. Allen, Randy P. Auerbach, David Brent, Ruslan Salakhutdinov, and Louis-Philippe Morency. Think locally, act globally: Federated learning with local and global representations, 2020.
  • [30] Muhammad Akbar Husnoo, Adnan Anwar, Nasser Hosseinzadeh, Shama Naz Islam, Abdun Naser Mahmood, and Robin Doss. Fedrep: towards horizontal federated load forecasting for retail energy providers. In 2022 IEEE PES 14th Asia-Pacific Power and Energy Engineering Conference (APPEEC), pages 1–6. IEEE, 2022.
  • [31] Benyuan Sun, Hongxing Huo, Yi Yang, and Bo Bai. Partialfed: Cross-domain personalized federated learning via partial initialization. Advances in Neural Information Processing Systems, 34:23309–23320, 2021.
  • [32] Xiaoxiao Li, Meirui Jiang, Xiaofei Zhang, Michael Kamp, and Qi Dou. Fedbn: Federated learning on non-iid features via local batch normalization. arXiv preprint arXiv:2102.07623, 2021.
  • [33] Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. Personalized federated learning with theoretical guarantees: A model-agnostic meta-learning approach. Advances in Neural Information Processing Systems, 33:3557–3568, 2020.
  • [34] Yue Tan, Guodong Long, Lu Liu, Tianyi Zhou, Qinghua Lu, Jing Jiang, and Chengqi Zhang. Fedproto: Federated prototype learning across heterogeneous clients. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 8432–8440, 2022.
  • [35] Xin-Chun Li, De-Chuan Zhan, Yunfeng Shao, Bingshuai Li, and Shaoming Song. Fedphp: Federated personalization with inherited private models. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 587–602. Springer, 2021.
  • [36] Michael Zhang, Karan Sapra, Sanja Fidler, Serena Yeung, and Jose M Alvarez. Personalized federated learning with first order model optimization. arXiv preprint arXiv:2012.08565, 2020.
  • [37] Yutao Huang, Lingyang Chu, Zirui Zhou, Lanjun Wang, Jiangchuan Liu, Jian Pei, and Yong Zhang. Personalized cross-silo federated learning on non-iid data. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 7865–7873, 2021.
  • [38] Jian Xu, Xinyi Tong, and Shao-Lun Huang. Personalized federated learning with feature alignment and classifier collaboration. arXiv preprint arXiv:2306.11867, 2023.
  • [39] Jun Luo and Shandong Wu. Adapt to adaptation: Learning personalization for cross-silo federated learning. In IJCAI: proceedings of the conference, volume 2022, page 2166. NIH Public Access, 2022.
  • [40] Hong-You Chen and Wei-Lun Chao. On bridging generic and personalized federated learning for image classification. arXiv preprint arXiv:2107.00778, 2021.
  • [41] Jianqing Zhang, Yang Hua, Hao Wang, Tao Song, Zhengui Xue, Ruhui Ma, Jian Cao, and Haibing Guan. Gpfl: Simultaneously learning global and personalized feature information for personalized federated learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5041–5051, 2023.
  • [42] Daliang Li and Junpu Wang. Fedmd: Heterogenous federated learning via model distillation. arXiv preprint arXiv:1910.03581, 2019.
  • [43] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.
  • [44] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • [45] Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. Measuring the effects of non-identical data distribution for federated visual classification, 2019.
  • [46] Kevin Hsieh, Amar Phanishayee, Onur Mutlu, and Phillip Gibbons. The non-iid data quagmire of decentralized machine learning. In International Conference on Machine Learning, pages 4387–4398. PMLR, 2020.
  • [47] Zheng Wang, Xiaoliang Fan, Jianzhong Qi, Haibing Jin, Peizhen Yang, Siqi Shen, and Cheng Wang. Fedgs: Federated graph-based sampling with arbitrary client availability. 2023.
  • [48] Zheng Wang, Xiaoliang Fan, Zhaopeng Peng, Xueheng Li, Ziqi Yang, Mingkuan Feng, Zhicheng Yang, Xiao Liu, and Cheng Wang. Flgo: A fully customizable federated learning platform. arXiv preprint arXiv:2306.12079, 2023.
  • [49] Samiul Alam, Luyang Liu, Ming Yan, and Mi Zhang. Fedrolex: Model-heterogeneous federated learning with rolling sub-model extraction. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 29677–29690. Curran Associates, Inc., 2022.

Appendix A Complexity Analysis

Model Size.

A standard convolution weight 𝜽convT×S×k×k\boldsymbol{\theta}_{conv}\in\mathbb{R}^{T\times S\times k\times k} is decomposed into 𝜽𝒘R1k2×R2\boldsymbol{\theta_{w}}\in\mathbb{R}^{R_{1}k^{2}\times R_{2}} and 𝜽𝒗R2×STR1\boldsymbol{\theta_{v}}\in\mathbb{R}^{R_{2}\times\frac{ST}{R1}}, where TmodR1=0T\mod R_{1}=0. kk is the kernel size. SS/TT is the number of input/output channels. R1R_{1} and R2R_{2} are two hyperparameters that control the amount of the convolution parameters. Therefore, the total amount of the decomposed convolution is R2(ST/R1+R1k2)R_{2}(ST/R_{1}+R_{1}k^{2}). When the convolution layer is reduced by ratio pp, the amount of corresponding parameters becomes R2(p2STR1+R1k2)R_{2}(\frac{p^{2}ST}{R1}+R_{1}k^{2}). Since TpmodR1=0,p\lfloor Tp\rfloor\mod R_{1}=0,\forall p is always required in Pa3dFL and a small R1R_{1} may limit the expressiveness of 𝜽𝒘\boldsymbol{\theta_{w}}, we always fixed R1=TpminR_{1}=Tp_{min}. The corresponding parameter amount is

R2(p2Spmin+Tpmink2)=pR2(ppminS+pminpTk2)R_{2}\left(\frac{p^{2}S}{p_{min}}+Tp_{min}k^{2}\right)=pR_{2}(\frac{p}{p_{min}}S+\frac{p_{min}}{p}Tk^{2}) (12)

We denote pminσp_{min}\leq\sigma as the systemic constraint since pminp_{min} depends on the lowest capacity of clients. The reduction ratio tt of a pp-width model implemented by Pa3dFL is

rPa3dFL=p(R2(ppminS+pminpTk2)STk2)r_{Pa^{3}dFL}=p\left(R_{2}\frac{(\frac{p}{p_{min}}S+\frac{p_{min}}{p}Tk^{2})}{STk^{2}}\right) (13)

We first visualize how the model size reduction ratio changes with the value of R2R_{2} respectively on the convolution operator Conv(64,64,5,5) and Linear operator (1600, 384) that is used in CIFAR10 in Fig. 4. We fixed the lowest capacity σ=0.0625\sigma=0.0625 as default. Firstly, we notice that for the convolution operator, let R2=max(k2,S)R_{2}=\max(k^{2},S) won’t significantly violate the benefit in the model reduction ratio. Even when the width ratio is small, the convolution operator brings very limited additional amounts of parameters. For the Linear operator, the model size increases quickly as R2R_{2} increases, and the reduction ratio is almost consistent with pp-width. Since 𝜽𝒘R1×R2\boldsymbol{\theta_{w}}\in\mathbb{R}^{R_{1}\times R_{2}} where setting R2R_{2} won’t increase the rank of boldsymbolθwboldsymbol{\theta_{w}}, we set R2=R1R_{2}=R_{1} for all linear operators without further specification. We finally design the rule of choosing R1,R2R_{1},R_{2} as follows:

R1=Tpmin,R2={max(min(S,T),k2),if layer is convolutionR1,if layer is linearR_{1}=Tp_{min},R_{2}=\left\{\begin{array}[]{ll}\max(\min(S,T),k^{2}),&\text{if layer is convolution}\\ R_{1},&\text{if layer is linear}\\ \end{array}\right. (14)
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: The model reduction ratios v.s. R2R_{2}, the lowest capacity pminp_{min}, the dimension and the type of operators

Under this rule, we respectively study how the reduction ratio is influenced by the systemic constraint σ\sigma and the dimension of the operators in Fig. 4. For the convolution operator, a loose constraint (i.e., a large σ\sigma) will diminish the effectiveness of model reduction when the width ratio is small. However, a small pminp_{min} is allowed to be used under loose systemic constraints. Different dimensions of convolution layers enjoy different reduction speeds. Although additional costs are introduced by Pa3dFL than pp-width, the reduced amount of model parameters is already significant compared to the full model. For the linear operator, the results show the consistency between this rule pp-width when the systemic constraints or the dimensions of operators vary. Further, if min(S,T)k2\min(S,T)\geq k^{2}, the reduction ratio under this rule becomes

rPa3dFL=p(R2(ppminS+pminpTk2)STk2)p(1k2ppmin+pminp)=1k2pminp2+pminr_{Pa^{3}dFL}=p\left(R_{2}\frac{(\frac{p}{p_{min}}S+\frac{p_{min}}{p}Tk^{2})}{STk^{2}}\right)\leq p(\frac{1}{k^{2}}\frac{p}{p_{min}}+\frac{p_{min}}{p})=\frac{1}{k^{2}p_{min}}p^{2}+p_{min} (15)

If pminp_{min} is small than p2p^{2} and k2pmin1k^{2}p_{min}\approx 1, the ratio is nearly of complexity 𝒪(p2)\mathcal{O}(p^{2}). The term 1k2\frac{1}{k^{2}} in the reduction ratio also explains why the convolution operator in Pa3dFL has a larger tolerance to the value changing of R2R_{2} than the linear operator in Fig. 4. Therefore, one can use a hyperparameter to scale R2R_{2} of convolution layers properly to enhance the rank of the common parts’ matrices. Unfortunately, this rule may not work when using a super kernel size k2minS,Tk^{2}\gg\min{S,T}, which will produce unneglectable additional costs. In our experiments, the kernel size we used won’t significantly exceed the number of parameters. We remain how to specify the parameters selection rules for convolution layer with relatively large kernel sizes as our future works.

Training-Time Complexity.

During the model training phase, the additional complexity comes from the matrix multiplication in the parameter recovery process [2]. Given a batch of BB data with the feature size q×qq\times q (i.e., the height and the width) and input channel number c1c_{1}, a convolution operator with kernel size kk and output channels c2c_{2} will produce Bq2k2c1c2Bq^{2}k^{2}c_{1}c_{2} multiply-add float operations. Since the cost of weight multiplication is k2R1R2(c2/R1)c1=R2k2c1c2k^{2}R_{1}R_{2}(c_{2}/R_{1})c_{1}=R_{2}k^{2}c_{1}c_{2}, the additional cost ratio is

k2R1R2(c2/R1)c1=R2k2c1c2Bq2k2c1c2=R2Bq2\frac{k^{2}R_{1}R_{2}(c_{2}/R_{1})c_{1}=R_{2}k^{2}c_{1}c_{2}}{Bq^{2}k^{2}c_{1}c_{2}}=\frac{R_{2}}{Bq^{2}} (16)

In practice, this term is negligible when R2Bq2R_{2}\ll Bq^{2}.

Appendix B Convergence

At each communication round tt, the server generates personal parameters from clients’ embeddings by the hyper-network according to Eq. (9) and Eq. (10)

{𝜽i,𝕧(t)}i=1N=hgen&prune(𝜽h(t),𝔼(t))=h(𝝋(t))\{\boldsymbol{\theta}_{i,\mathbb{v}}^{(t)}\}_{i=1}^{N}=h_{gen\&prune}(\boldsymbol{\theta}_{h}^{(t)},\mathbb{E}^{(t)})=h(\boldsymbol{\varphi}^{(t)}) (17)

where 𝝋(t)=(𝜽h(t),𝔼(t))\boldsymbol{\varphi}^{(t)}=(\boldsymbol{\theta}_{h}^{(t)},\mathbb{E}^{(t)}). Then, client cic_{i} will receive the model parameters (𝜽𝕦(t),𝜽i,𝕧(t))(\boldsymbol{\theta}_{\mathbb{u}}^{(t)},\boldsymbol{\theta}_{i,\mathbb{v}}^{(t)}) and locally update them by one-step gradient descent with the step size ηt\eta_{t}

𝜽i,𝕦(t+1)=𝜽𝕦(t)ηtfi𝜽𝕦(t),𝜽i,𝕧(t+1)=𝜽i,𝕧(t)ηtfi𝜽i,𝕧(t)\boldsymbol{\theta}_{i,\mathbb{u}}^{(t+1)}=\boldsymbol{\theta}_{\mathbb{u}}^{(t)}-\eta_{t}\frac{\partial f_{i}}{\partial\boldsymbol{\theta}_{\mathbb{u}}^{(t)}},\quad\boldsymbol{\theta}_{i,\mathbb{v}}^{(t+1)}=\boldsymbol{\theta}_{i,\mathbb{v}}^{(t)}-\eta_{t}\frac{\partial f_{i}}{\partial\boldsymbol{\theta}_{i,\mathbb{v}}^{(t)}} (18)

The server aggregates the clients’ general parameters by

𝜽𝕦(t+1)=1Ni=1N𝜽i,𝕦(t+1)=𝜽𝕦(t)ηt1NiNfi𝜽𝕦(t)=𝜽𝕦(t)ηtF𝜽𝕦(t)\boldsymbol{\theta}_{\mathbb{u}}^{(t+1)}=\frac{1}{N}\sum_{i=1}^{N}\boldsymbol{\theta}_{i,\mathbb{u}}^{(t+1)}=\boldsymbol{\theta}_{\mathbb{u}}^{(t)}-\eta_{t}\frac{1}{N}\sum_{i}^{N}\frac{\partial f_{i}}{\partial\boldsymbol{\theta}_{\mathbb{u}}^{(t)}}=\boldsymbol{\theta}_{\mathbb{u}}^{(t)}-\eta_{t}\frac{\partial F}{\partial\boldsymbol{\theta}_{\mathbb{u}}^{(t)}} (19)

where F=1Ni=1NfiF=\frac{1}{N}\sum_{i=1}^{N}f_{i} denotes the global objective. The hyper-network is updated by Eq. (8)

HN𝝋(t)\displaystyle\frac{\partial\mathcal{L}_{HN}}{\partial\boldsymbol{\varphi}^{(t)}} =1Ni=1N12h(𝝋(t))[i]𝜽i,𝕦(t+1))22𝝋(t)\displaystyle=\frac{1}{N}\frac{\partial\sum_{i=1}^{N}\frac{1}{2}\|h(\boldsymbol{\varphi}^{(t)})[i]-\boldsymbol{\theta}_{i,\mathbb{u}}^{(t+1)})\|_{2}^{2}}{\partial\boldsymbol{\varphi}^{(t)}} (20)
=1Ni=1Nηtfi𝜽i,𝕧(t)h(𝝋(t))[i]𝝋(t)\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\eta_{t}\frac{\partial f_{i}}{\partial\boldsymbol{\theta}_{i,\mathbb{v}}^{(t)}}\frac{h(\boldsymbol{\varphi}^{(t)})[i]}{\boldsymbol{\varphi}^{(t)}} (21)
=ηtNi=1Nfi𝝋(t)\displaystyle=\frac{\eta_{t}}{N}\sum_{i=1}^{N}\frac{\partial f_{i}}{\boldsymbol{\varphi}^{(t)}} (22)
=ηtF𝝋(t)\displaystyle=\eta_{t}\frac{\partial F}{\partial\boldsymbol{\varphi}^{(t)}} (23)

As a result, the update rule of the hyper-network is

𝝋(t+1)=𝝋(t)γηtF𝝋(t)\boldsymbol{\varphi}^{(t+1)}=\boldsymbol{\varphi}^{(t)}-\gamma\eta_{t}\frac{\partial F}{\partial\boldsymbol{\varphi}^{(t)}} (24)

By viewing the general parameters and the hyper-network parameters as a whole 𝜽(𝒕)=(𝜽𝕦(t),𝝋(t))\boldsymbol{\theta^{(t)}}=(\boldsymbol{\theta}_{\mathbb{u}}^{(t)},\boldsymbol{\varphi}^{(t)}), we can directly apply the previous analysis result to our updating scheme under the following standard assumptions.

Assumption 1 (Smoothness).

For each i=1,,ni=1,\ldots,n, the function FiF_{i} is continuously differentiable. There exist constants L𝕦,L𝛗L_{\mathbb{u}},L_{\boldsymbol{\varphi}} such that F𝛗(t)\frac{\partial F}{\partial\boldsymbol{\varphi}^{(t)}} is L𝛗L_{\boldsymbol{\varphi}}–Lipschitz with respect to 𝛗(t)\boldsymbol{\varphi}^{(t)}, and for each i=1,,ni=1,\ldots,n, F𝛉i,𝕦(t)\frac{\partial F}{\partial\boldsymbol{\theta}_{i,\mathbb{u}}^{(t)}} is L𝕦L_{\mathbb{u}}–Lipschitz with respect to 𝛉i,𝕦(t)\boldsymbol{\theta}_{i,\mathbb{u}}^{(t)}.

Assumption 2 (Bounded Variance).

The stochastic gradients in Alg.1 and Alg.2 are unbiased and have bounded variance. That is, for all 𝕦\mathbb{u} and 𝛗\boldsymbol{\varphi}, their stochastic gradients 𝐠𝕦(t)\boldsymbol{g}_{\mathbb{u}}^{(t)} and 𝐠𝛗(t)\boldsymbol{g}_{\boldsymbol{\varphi}}^{(t)} have 𝔼[𝐠𝕦(t)]=F𝛉𝕦(t),𝔼[𝐠𝛗(t)]=F𝛗(t)\mathbb{E}\bigl{[}\boldsymbol{g}_{\mathbb{u}}^{(t)}\bigr{]}=\frac{\partial F}{\partial\boldsymbol{\theta}_{\mathbb{u}}^{(t)}},\mathbb{E}\bigl{[}\boldsymbol{g}_{\boldsymbol{\varphi}}^{(t)}\bigr{]}=\frac{\partial F}{\partial\boldsymbol{\varphi}^{(t)}}, and there exist constants σ𝕦\sigma_{\mathbb{u}} and σ𝛗\sigma_{\boldsymbol{\varphi}} such that

𝔼[𝒈𝕦(t)F𝜽𝕦(t)2]σ𝕦2,𝔼[𝒈𝝋(t)F𝝋(t)2]σ𝝋2\mathbb{E}\bigl{[}\|\boldsymbol{g}_{\mathbb{u}}^{(t)}-\frac{\partial F}{\partial\boldsymbol{\theta}_{\mathbb{u}}^{(t)}}\|_{2}\bigr{]}\leq\sigma_{\mathbb{u}}^{2},\mathbb{E}\bigl{[}\|\boldsymbol{g}_{\boldsymbol{\varphi}}^{(t)}-\frac{\partial F}{\partial\boldsymbol{\varphi}^{(t)}}\|_{2}\bigr{]}\leq\sigma_{\boldsymbol{\varphi}}^{2} (25)
Theorem 1 (Convergence of Pa3dFL).

Suppose Assumptions 1, 2 hold and the step sizes η\eta and γ\gamma in Pa3dFL are chosen as 2L𝕦\frac{2}{L_{\mathbb{u}}}, L𝕦L𝛗\frac{L_{\mathbb{u}}}{L_{\boldsymbol{\varphi}}} respectively, η\eta decays with the number of rounds tt and all model parameters initialized at the same point 𝛉(0)\boldsymbol{\theta}^{(0)}. Then, after TT rounds we have:

1T(t=0Tγηt(1γηtL𝝋2)F𝝋(t)2+t=0Tηt(1ηtL𝝋2)F𝜽𝕦(t)2)1T(F(𝜽(0))F)+O(η2)\frac{1}{T}(\sum_{t=0}^{T}\gamma\eta_{t}(1-\frac{\gamma\eta_{t}L_{\boldsymbol{\varphi}}}{2})\|\frac{\partial F}{\partial\boldsymbol{\varphi}^{(t)}}\|^{2}+\sum_{t=0}^{T}\eta_{t}(1-\frac{\eta_{t}L_{\boldsymbol{\varphi}}}{2})\|\frac{\partial F}{\partial\boldsymbol{\theta}_{\mathbb{u}}^{(t)}}\|^{2})\leq\frac{1}{T}(F(\boldsymbol{\theta}^{(0)})-F^{*})+O(\eta^{2}) (26)

where FF^{*} is the lower bound of FF.

Proof.

Based on Assumptions 1, we have:

F(𝜽𝕦(t+1),𝝋(t+1))F(𝜽𝕦(t),𝝋(t))\displaystyle F(\boldsymbol{\theta}_{\mathbb{u}}^{(t+1)},\boldsymbol{\varphi}^{(t+1)})-F(\boldsymbol{\theta}_{\mathbb{u}}^{(t)},\boldsymbol{\varphi}^{(t)}) F𝝋(t),𝝋(t+1)𝝋(t)+F𝜽𝕦(t),𝜽𝕦(t+1)𝜽𝕦(t)\displaystyle\leq\braket{\frac{\partial F}{\partial\boldsymbol{\varphi}^{(t)}},\boldsymbol{\varphi}^{(t+1)}-\boldsymbol{\varphi}^{(t)}}+\braket{\frac{\partial F}{\partial\boldsymbol{\theta}_{\mathbb{u}}^{(t)}},\boldsymbol{\theta}_{\mathbb{u}}^{(t+1)}-\boldsymbol{\theta}_{\mathbb{u}}^{(t)}} (27)
+L𝝋2𝝋(t+1)𝝋(t)2+L𝕦2𝜽𝕦(t+1)𝜽𝕦(t)2\displaystyle+\frac{L_{\boldsymbol{\varphi}}}{2}\|\boldsymbol{\varphi}^{(t+1)}-\boldsymbol{\varphi}^{(t)}\|^{2}+\frac{L_{\mathbb{u}}}{2}\|\boldsymbol{\theta}_{\mathbb{u}}^{(t+1)}-\boldsymbol{\theta}_{\mathbb{u}}^{(t)}\|^{2}

Consider stochastic gradient descent: 𝜽𝕦(t+1)=𝜽𝕦(t)ηt𝒈𝕦(t)\boldsymbol{\theta}_{\mathbb{u}}^{(t+1)}=\boldsymbol{\theta}_{\mathbb{u}}^{(t)}-\eta_{t}\boldsymbol{g}_{\mathbb{u}}^{(t)} and 𝝋(t+1)=𝝋(t)γηt𝒈𝝋(t)\boldsymbol{\varphi}^{(t+1)}=\boldsymbol{\varphi}^{(t)}-\gamma\eta_{t}\boldsymbol{g}_{\boldsymbol{\varphi}}^{(t)}, based on Assumptions 2 then we have:

𝔼[F(𝜽𝕦(t+1),𝝋(t+1))F(𝜽𝕦(t),𝝋(t))]\displaystyle\mathbb{E}\bigl{[}F(\boldsymbol{\theta}_{\mathbb{u}}^{(t+1)},\boldsymbol{\varphi}^{(t+1)})-F(\boldsymbol{\theta}_{\mathbb{u}}^{(t)},\boldsymbol{\varphi}^{(t)})\bigr{]} γηt(γηtL𝝋21)F𝝋(t)2\displaystyle\leq\gamma\eta_{t}(\frac{\gamma\eta_{t}L_{\boldsymbol{\varphi}}}{2}-1)\|\frac{\partial F}{\partial\boldsymbol{\varphi}^{(t)}}\|^{2} (28)
+ηt(ηtL𝕦21)F𝜽𝕦(t)2+O(η2)\displaystyle+\eta_{t}(\frac{\eta_{t}L_{\mathbb{u}}}{2}-1)\|\frac{\partial F}{\partial\boldsymbol{\theta}_{\mathbb{u}}^{(t)}}\|^{2}+O(\eta^{2})

If the step sizes are selected as 2L𝕦\frac{2}{L_{\mathbb{u}}} and L𝕦L𝝋\frac{L_{\mathbb{u}}}{L_{\boldsymbol{\varphi}}} respectively, the convergence condition is satisfied. Based on (27), after TT rounds, we have:

FF(𝜽(0))\displaystyle F^{*}-F(\boldsymbol{\theta}^{(0)}) F(𝜽𝕦(T),𝝋(T))F(𝜽𝕦(0),𝝋(0))\displaystyle\leq F(\boldsymbol{\theta}_{\mathbb{u}}^{(T)},\boldsymbol{\varphi}^{(T)})-F(\boldsymbol{\theta}_{\mathbb{u}}^{(0)},\boldsymbol{\varphi}^{(0)}) (29)
γt=0T1ηtF𝝋(t),𝒈𝝋(t)t=0T1ηtF𝜽𝕦(t),𝒈𝕦(t)\displaystyle\leq-\gamma\sum_{t=0}^{T-1}\eta_{t}\braket{\frac{\partial F}{\partial\boldsymbol{\varphi}^{(t)}},\boldsymbol{g}_{\boldsymbol{\varphi}}^{(t)}}-\sum_{t=0}^{T-1}\eta_{t}\braket{\frac{\partial F}{\partial\boldsymbol{\theta}_{\mathbb{u}}^{(t)}},\boldsymbol{g}_{\mathbb{u}}^{(t)}}
+γ2L𝝋2t=0T1ηt2𝒈𝝋(t)2+L𝕦2t=0T1ηt2𝒈𝕦(t)2\displaystyle+\frac{\gamma^{2}L_{\boldsymbol{\varphi}}}{2}\sum_{t=0}^{T-1}\eta_{t}^{2}\|\boldsymbol{g}_{\boldsymbol{\varphi}}^{(t)}\|^{2}+\frac{L_{\mathbb{u}}}{2}\sum_{t=0}^{T-1}\eta_{t}^{2}\|\boldsymbol{g}_{\mathbb{u}}^{(t)}\|^{2}

then:

𝔼[γt=0T1ηtF𝝋(t),𝒈𝝋(t)+t=0T1ηtF𝜽𝕦(t),𝒈𝕦(t)]\displaystyle\mathbb{E}\bigl{[}\gamma\sum_{t=0}^{T-1}\eta_{t}\braket{\frac{\partial F}{\partial\boldsymbol{\varphi}^{(t)}},\boldsymbol{g}_{\boldsymbol{\varphi}}^{(t)}}+\sum_{t=0}^{T-1}\eta_{t}\braket{\frac{\partial F}{\partial\boldsymbol{\theta}_{\mathbb{u}}^{(t)}},\boldsymbol{g}_{\mathbb{u}}^{(t)}}\bigr{]} =γt=0T1ηtF𝝋(t)2+t=0T1ηtF𝜽𝕦(t)2\displaystyle=\gamma\sum_{t=0}^{T-1}\eta_{t}\|\frac{\partial F}{\partial\boldsymbol{\varphi}^{(t)}}\|^{2}+\sum_{t=0}^{T-1}\eta_{t}\|\frac{\partial F}{\partial\boldsymbol{\theta}_{\mathbb{u}}^{(t)}}\|^{2} (30)
F(𝜽(0))F+γ2L𝝋2t=0T1ηt2(F𝝋(t)2+σ𝝋2)\displaystyle\leq F(\boldsymbol{\theta}^{(0)})-F^{*}+\frac{\gamma^{2}L_{\boldsymbol{\varphi}}}{2}\sum_{t=0}^{T-1}\eta_{t}^{2}(\|\frac{\partial F}{\partial\boldsymbol{\varphi}^{(t)}}\|^{2}+\sigma_{\boldsymbol{\varphi}}^{2})
+L𝕦2t=0T1ηt2(F𝜽𝕦(t)2+σ𝕦2)\displaystyle+\frac{L_{\mathbb{u}}}{2}\sum_{t=0}^{T-1}\eta_{t}^{2}(\|\frac{\partial F}{\partial\boldsymbol{\theta}_{\mathbb{u}}^{(t)}}\|^{2}+\sigma_{\mathbb{u}}^{2})

then we have:

1T(t=0Tγηt(1γηtL𝝋2)F𝝋(t)2+t=0Tηt(1ηtL𝝋2)F𝜽𝕦(t)2)\displaystyle\frac{1}{T}(\sum_{t=0}^{T}\gamma\eta_{t}(1-\frac{\gamma\eta_{t}L_{\boldsymbol{\varphi}}}{2})\|\frac{\partial F}{\partial\boldsymbol{\varphi}^{(t)}}\|^{2}+\sum_{t=0}^{T}\eta_{t}(1-\frac{\eta_{t}L_{\boldsymbol{\varphi}}}{2})\|\frac{\partial F}{\partial\boldsymbol{\theta}_{\mathbb{u}}^{(t)}}\|^{2}) 1T(F(𝜽(0))F)+O(η2)\displaystyle\leq\frac{1}{T}(F(\boldsymbol{\theta}^{(0)})-F^{*})+O(\eta^{2}) (31)

Appendix C Algorithmic Details

C.1 Pseudo Code of Pa3dFL

We respectively show the main procedure of Pa3dFL on the server-side and client-side in Alg.1, Alg.2.

C.2 Training Tips

Testing Model Selection.

The procedure for testing model selection is shown in Alg.1. Since we only perform the model selection for testing purposes and the process won’t leak the validation data information to the training process, it won’t bring risks of overfitting on the validation dataset. The model selection can be performed at any time after locally training the model for each client. Since no model training is performed, the cost of model fusion is the almost same as the inference phase. Clients can recover the model into pp-width model first and then fuse the recovered model to further reduce the cost at the model selection phase.

Algorithm 1 Pa3dFL - Model Selection

Input:validation data 𝒟val\mathcal{D}_{val}, models 0\mathcal{M}_{0} and 1\mathcal{M}_{1}, the maximum number of iterations nn

1:  Given the function of the model performance on the validation dataset f:×𝒟metf:\mathcal{M}\times\mathcal{D}\rightarrow met, set the objective function as g(α)=f(0+α(01))g(\alpha)=f(\mathcal{M}_{0}+\alpha(\mathcal{M}_{0}-\mathcal{M}_{1})), α[0,1]\alpha\in[0,1]
Algorithm 2 Pa3dFL - Server

Input:clients’ capacity constraints 𝒓\boldsymbol{r},number of clients NN, model architecture \mathcal{M}, hyper-network \mathcal{H}, learning rate η\eta

1:  procedure Server(𝒓\boldsymbol{r})
2:     Set minimal model width ratio pmin=min({pi|i[N]})p_{min}=\min(\{p_{i}|i\in[N]\})
3:     Select R1,R2R_{1},R_{2} according to Eq. (14) and specify model architecture based on \mathcal{M}
4:     Initialize model common parameters 𝜽𝒘(0)\boldsymbol{\theta_{w}}^{(0)}, the hyper-network parameters ϕ(0)\boldsymbol{\phi}^{(0)}, and clients’ embedding 𝕌(0)\mathbb{U}^{(0)}
5:     for t0 to T1t\leftarrow 0\text{ to }T-1 do
6:        Randomly samples a subset 𝒮t\mathcal{S}_{t} of mm clients
7:        for i𝒮ti\in\mathcal{S}_{t} do
8:           Generate and prune personalized parameters 𝜽𝒗i(t)=Prune((𝕌(t),i),pi)\boldsymbol{\theta_{v}}_{i}^{(t)}=Prune(\mathcal{H}(\mathbb{U}^{(t)},i),p_{i}) by its width ratio pip_{i}
9:           Send (𝜽𝒘(t),𝜽𝒗i(t))(\boldsymbol{\theta_{w}}^{(t)},\boldsymbol{\theta_{v}}_{i}^{(t)}) to client ii
10:           𝜽𝒘i(t+1),𝜽𝒗i(t+1)=LocalUpdate(𝜽)\boldsymbol{\theta_{w}}_{i}^{(t+1)},\boldsymbol{\theta_{v}}_{i}^{(t+1)}=\textsc{LocalUpdate}(\boldsymbol{\theta})
11:        end for
12:        # Aggregation
13:        𝜽𝒘(t+1)=1mi𝒮t𝜽𝒘i(t+1)\boldsymbol{\theta_{w}}^{(t+1)}=\frac{1}{m}\sum_{i\in\mathcal{S}_{t}}\boldsymbol{\theta_{w}}_{i}^{(t+1)}
14:        Compute =1mi𝒮t𝜽𝒗i(t+1)𝜽𝒗i(t)22\mathcal{L}=\frac{1}{m}\sum_{i\in\mathcal{S}_{t}}\|\boldsymbol{\theta_{v}}_{i}^{(t+1)}-\boldsymbol{\theta_{v}}_{i}^{(t)}\|_{2}^{2} and backward
15:        Set ϕ(t+1)=ϕ(t)ηϕ(t)\boldsymbol{\phi}^{(t+1)}=\boldsymbol{\phi}^{(t)}-\eta\frac{\partial\mathcal{L}}{\partial\boldsymbol{\phi}^{(t)}}
16:        Set 𝑼(t+1)=𝑼(t)η𝑼(t)\boldsymbol{U}^{(t+1)}=\boldsymbol{U}^{(t)}-\eta\frac{\partial\mathcal{L}}{\partial\boldsymbol{U}^{(t)}}
17:     end for
18:  end procedure
Algorithm 3 Pa3dFL - Client

Input: hyperparameter λ\lambda, model architecture \mathcal{M}, local epochs EE, batch size BB, learning rate η\eta

1:  procedure LocalUpdate(𝜽\boldsymbol{\theta})
2:     Set 𝜽𝕨,head,𝜽𝕨,enc,𝜽𝕧,head,𝜽𝕧,enc=𝜽\boldsymbol{\theta}_{\mathbb{w},head},\boldsymbol{\theta}_{\mathbb{w},enc},\boldsymbol{\theta}_{\mathbb{v},head},\boldsymbol{\theta}_{\mathbb{v},enc}=\boldsymbol{\theta}
3:     Freeze parameters of 𝜽𝕨,head\boldsymbol{\theta}_{\mathbb{w},head}
4:     for e1 to Ee\leftarrow 1\text{ to }E do
5:        for batch data (x,y)(x,y) in local data do
6:           Recover pp-width model from parameters 𝜽enc=(𝜽𝕨,enc,𝜽𝕧,enc)\boldsymbol{\theta}_{enc}=\mathcal{R}(\boldsymbol{\theta}_{\mathbb{w},enc},\boldsymbol{\theta}_{\mathbb{v},enc})
7:           Compute representation x=p,enc(x;𝜽enc)x^{\prime}=\mathcal{M}_{p,enc}(x;\boldsymbol{\theta}_{enc})
8:           Compute loss on the fixed global head g=(𝜽𝕨,head(x),y)\mathcal{L}_{g}=\ell(\boldsymbol{\theta}_{\mathbb{w},head}(x^{\prime}),y)
9:           Compute loss on the local head with detached representations c=(𝜽𝕨,head(detach(x)),y)\mathcal{L}_{c}=\ell(\boldsymbol{\theta}_{\mathbb{w},head}(\text{detach}(x^{\prime})),y)
10:           Compute regularization term reg\mathcal{L}_{reg} according to Eq. (32)
11:           Compute loss =g+c+λreg\mathcal{L}=\mathcal{L}_{g}+\mathcal{L}_{c}+\lambda\mathcal{L}_{reg}
12:           Backward and update model parameters
13:        end for
14:     end for
15:     return  𝜽𝕨,𝜽𝕧\boldsymbol{\theta}_{\mathbb{w}},\boldsymbol{\theta}_{\mathbb{v}}
16:  end procedure

C.3 Regularization Term

We follow [2] to add a similar orthogonal regularization term to enhance the expressiveness of the decomposed parameters as follows:

reg=λi{conv layer}𝑺𝒊diag(𝑺𝒊)2,𝑺𝒊=𝜽𝒘,𝒊𝜽𝒘,𝒊\mathcal{L}_{reg}=\lambda\sum_{i\in\{\text{conv layer}\}}\|\boldsymbol{S_{i}}-diag(\boldsymbol{S_{i}})\|^{2},\boldsymbol{S_{i}}=\boldsymbol{\theta_{w,i}}^{\top}\boldsymbol{\theta_{w,i}} (32)

where diag()diag(\cdot) only preserves the diagonal elements of the input and λ\lambda is the hyper-parameter. This regularization is only made for convolution operators. Although we have a different scheme with FLANC [2] in the parameter recovery process, the orthogonal regularization term they used is also suitable to our method. In FLANC, columns of the common parts 𝜽𝒘\boldsymbol{\theta_{w}} are viewed as the neural basis of the parameters of each filter. Therefore, they regularize the basis to be orthogonal to each other to increase the expressiveness by

FLANC,reg=1Ll=1L𝜽𝒘𝜽𝒘𝑰22\mathcal{L}_{FLANC,reg}=\frac{1}{L}\sum_{l=1}^{L}\|\boldsymbol{\theta_{w}}^{\top}\boldsymbol{\theta_{w}}-\boldsymbol{I}\|_{2}^{2} (33)

where LL denotes the number of decomposition layers in the model. In Pa3dFL, columns in each basic component construct the basis of filters in the final weights that are delivered from the component. Therefore, encouraging columns to be orthogonal to each other also helps increase the expressiveness of filters in our parameter recovery scheme. Since the corresponding kernel size for linear operators is 11, requiring the orthogonality of scalars among columns of a basic component is meaningless. Therefore, we only perform this regularization term for convolution operators instead of all the operators like FLANC. In addition, we notice that the term used by FLANC will force the l2l_{2} norm of each column in 𝜽𝒘\boldsymbol{\theta_{w}} to be closest to one. We eliminate this issue by ignoring the diagonal elements in the product in Eq. (32) to further enhance model performance from empirical observations.

Appendix D Experimental Details

D.1 Datasets

CIFAR10\100

Each CIFAR dataset [44] consists of total 60000 32x32 colour images in 10\100 classes (i.e., 50000 training images and 10000 test images). We use random data augmentation on the two datasets [1]. For CIFAR10, We partition the training data by letting the local label distribution 𝕡kDirichlet(α𝕡)\mathbb{p}_{k}\sim Dirichlet(\alpha\mathbb{p}*) for each client, where 𝕡\mathbb{p}*is the label distribution in the original dataset. We use α=1.0\alpha=1.0 in our experiments. For CIFAR100, we randomly let each client to own 20 classes of 100 classes. We provide the visualized partition in Fig. 5(a) and 5(b).

FashionMNIST.

The dataset [43] consists of a training set of 60,000 examples and a test set of 10,000 examples, where each example is a 28×2828\times 28 size image of fashion and associated with a label from 10 classes. In this work, we partition this dataset into 100 clients in a i.i.d. manner and equally allocate the same number of examples to each one. A direct visualization of the partitioned result is provided in Fig. 5(c).

Refer to caption
(a) CIFAR10
Refer to caption
(b) CIFAR100
Refer to caption
(c) FashionMNIST
Figure 5: The visualization of data partition for CIFAR10 (a), CIFAR100 (B) and FashionMNIST (C). Each bar in the figures represents a client’s local dataset and each label is assign to one color. The length of each bar reflects the size of the local data.

D.2 Models

For all the three datasets, we use the classical CNN architecture used by [1]. The information on the models is concluded in Table 4.

CIFAR10 CIFAR100 FACHION
Input Shape (3,32,32) (3,32,32) (1,28,28)
Conv Layer (3, 64, kernel=7, stride=2, pad=3) (3,64,kernel=5), (1,32,kernel=5, pad=2),
pool, relu pool, relu pool, relu
ResBlock(2) *4 (64,64,kernel=5), (32,64,kernel=5, pad=2),
pool, relu pool, relu
FC Layer (512, 10) (1600,384), relu (3136,512), relu
(384,192), relu (512,128), relu
(192,100), relu (128,10), relu
Table 4: The model architecture.

D.3 Baselines & Hyperparameters

We consider the following baselines including SOTA personalized and capacity-aware methods in this work

  • LocalOnly is a non-federated method where each client independently trains its local model;

  • FedAvg [1] is the classical FL method that iteratively averages the locally trained models to update the global model;

  • Ditto [12] personalizes the local model by limiting its distance to the global model for each client with a proximal term. We tune the coefficient μ\mu from {0.01,0.1,1}\{0.01,0.1,1\} for this method;

  • pFedHN [20] is a SOTA PFL method that uses a hyper-network to generate personalized parameters for each client;

  • HeteroFL [7] is a capacity-aware FL method that prunes model parameters in a nested manner and uses the Scaler to align information across models of different capacities.

  • Fjord[8] drops filters of a neural network at each layer in order and it aligns the information across models with different scalability by knowledge distillation.

  • FLANC[2] reduces the model size by decomposing the model parameters into commonly shared parts and capacity-specific parts. They average the commonly shared parts to update the model, and only capacity-specific parts at the same capacity level can be aggregated. In addition, it uses an orthogonal regularization term to enhance the expressiveness of commonly shared parts. We tune the term’s coefficient λ\lambda from {0.01,0.1,1.0}\{0.01,0.1,1.0\}

  • TailorFL[10] is an efficient personalized FL method that prunes filters in a personalized manner, which enhances the local model performance by encouraging similar clients to share more identical filters. It regularizes the local training objectives by limiting the norm of the importance layer proposed by them. We tune the coefficient of the regularization from {0.1,0.5}\{0.1,0.5\}.

  • pFedGate[11] dynamically masks blocks of model parameters according to batch-level information to personalize the model parameters. A local gating layer is maintained by each client locally to decide which parameters should be activated during computation. We set the default splitting number B=5B=5 as they did and we tune this method by varying the learning rate of the gating layer from the grid {0.005,0.01,0.05,0.1}\{0.005,0.01,0.05,0.1\} of the model learning rate.

  • LG-FEDAVG[29] is a method that jointly learns compact local representations on each device and a global model across all devices.

  • FEDROLEX[49] employs a rolling sub-model extraction scheme that allows different parts of the global server model to be evenly trained, which mitigates the client drift induced by the inconsistency between individual client models and server model architectures.

Hyperparameters.

we fix batch size B=50B=50 and local epoch E=5E=5 for all the datasets. We set communication rounds R=1000/2000/500R=1000/2000/500 respectively for CIFAR10/CIFAR100/FashionMNIST. We perform early stopping when there is no gain in the optimal validation metrics for more than 0.2R0.2R rounds. We tune the learning rate on the grid {0.005,0.01,0.05,0.1}\{0.005,0.01,0.05,0.1\} with the decaying rate 0.9980.998 and respectively tune the algorithmic hyperparameters for each method to their optimal. For Pa3dFL, we select the dimensions of clients’ embeddings and hidden layers of HN from {(64,64),(16,128)}\{(64,64),(16,128)\}, and we tune the depth of HN encoder over {2,3,4}\{2,3,4\} , the regularization coefficient from {0.1,0.01,0.001}\{0.1,0.01,0.001\} and the learning rate of HN from {1.0,0.1,0.01,0.001}\{1.0,0.1,0.01,0.001\}. We conclude the optimal configurations for all the methods in Table 5.

Table 5: Optimal Configuration on CIFAR10
Ideal Hetero.
Step size Algorithmic Step Size Algorithmic
FedAvg 0.1 - 0.1 -
Ditto 0.1 0.1 0.05 1.0
pFedHN 0.1 [5, 100, 2] 0.1 [5, 100, 2]
FLANC 0.1 0.01 0.1 0.01
Fjord 0.1 - 0.1 -
HeteroFL 0.1 - 0.1 -
FedRolex 0.1 - 0.1 -
LocalOnly 0.005 - 0.005 -
LG-FedAvg 0.005 - 0.1 -
TailorFL 0.1 0.1, 0.05 0.1 0.5, 0.05
pFedGate 0.1 0.005 0.1 0.1
Pa3dFL 0.1 0.1,0.001,[64,64,4] 0.1 0.1,0.001,[64,64,3]
Table 6: Optimal Configuration on CIFAR100
Ideal Hetero.
Step size Algorithmic Step Size Algorithmic
FedAvg 0.1 - 0.1 -
Ditto 0.1 0.1 0.05 1.0
pFedHN 0.05 [5, 100, 2] 0.1 [5, 100, 2]
FLANC 0.1 1.0 0.1 0.01
Fjord 0.1 - 0.1 -
HeteroFL 0.05 - 0.05 -
FedRolex 0.1 - 0.1 -
LocalOnly 0.1 - 0.1 -
LG-FedAvg 0.1 - 0.1 -
TailorFL 0.1 0.5, 0.2 0.01 0.5, 0.2
pFedGate 0.1 0.05 0.1 0.05
Pa3dFL 0.1 0.001,1.0,[16,128,2] 0.1 0.01,0.1,[16,128,3]
Table 7: Optimal Configuration on FASHION
Ideal Hetero.
Step size Algorithmic Step Size Algorithmic
FedAvg 0.1 - 0.05 -
Ditto 0.05 1.0 0.05 1.0
pFedHN 0.1 - 0.1 -
FLANC 0.1 1.0 0.1 1.0
Fjord 0.1 - 0.1 -
HeteroFL 0.05 - 0.1 -
FedRolex 0.1 - 0.1 -
LocalOnly 0.1 - 0.1 -
LG-FedAvg 0.1 - 0.1 -
TailorFL 0.1 0.1, 0.2 0.01 0.05, 0.2
pFedGate 0.1 0.1 0.1 0.05
Pa3dFL 0.1 0.01,0.1,[64,64,3] 0.1 0.001,1.0,[64,64,4]

D.4 Efficiency Evaluation For pFedGate

We discuss the details of the efficiency evaluation for pFedGate. While the efficiency of pruning-based baselines can be directly computed by tracing the memory and the intermediates during computation, this manner fails to evaluate the efficiency of pFedGate since dynamically masking model parameters (i.e., setting their values to zero) without actually dropping them out from the memory won’t save any computation efficiency. [11] simply assumes the computation cost is proportion to the size of the non-masked model parameters and implements pFedGate based on zero-masking. As illustrated above, this type of implementation cannot be directly evaluated by tracing the systemic variables (e.g memory). A possible way is to split each weight in a model into independently stored parts, which allows users to drop specified parts before feeding data into the model. However, it’s too complex to realize such a mechanism in practice due to the varying splitting numbers, operator shapes, model architectures, and dimension alignment. As a result, we keep the consistency of our implementation with the authors’ official open-source codes111https://github.com/yxdyc/pFedGate. Since the additional costs of pFedGate usually come from the gating layer, we respectively consider the cost of the gating layer and the sparse model, and we then regard the sum of them to obtain the evaluated results. For a sparse model with sparsity ss, it shares almost the same computation cost as the pruning-based method with width reduction ratio s\sqrt{s}. The gating layer receives a batch of data and outputs a vector with the legnth of the total number of blocks, which is irrelevant to the reduction ratio of the model and is only decided by the input shapes and the model splitting manner. Therefore, we evaluate the additional cost of the gating layer as the difference between the cost of a full ordinary model and the cost of the pFedGate’s mdoel with sparsity 1.0 for each batch size. Then, we use the estimated costs of the gating layer as the additional costs for all reduction ratios when batch size is specified. We finally add the cost of the shrunk model and the cost of the gating layer together to obtain the total cost of pFedGate.

Appendix E Additional Experiments

Table 8: Comparison of efficiency for different methods on CIFAR100 with the CNN model.
FLOPs (×107\times 10^{7}) Peak Memory (MB)
reduction ratio 1/256 1/64 1/4 1 1/256 1/64 1/4 1
p-width B=16 0.4 1.0 7.3 23.5 17.0 17.4 20.4 25.7
B=64 1.7 4.1 29.5 94.0 19.2 20.8 31.4 45.1
B=128 3.5 8.2 59.0 188.1 23.1 26.2 45.3 71.1
B=512 14.2 32.8 236.3 752.3 41.2 53.4 127.9 227.3
FLANC B=16 1.8 2.2 5.2 10.9 17.9 18.4 22.0 28.2
B=64 7.4 9.0 21.1 43.8 22.8 24.8 38.3 55.2
B=128 14.9 18.1 42.2 87.7 30.1 34.1 58.4 90.9
B=512 59.9 72.7 169.1 350.9 69.6 84.0 179.6 305.4
pFedGate B=16 0.6 1.2 7.6 23.7 27.6 28.1 31.0 36.4
B=64 2.7 5.0 30.5 95.0 32.4 33.9 44.6 58.2
B=128 5.5 10.1 61.0 190.0 40.8 43.9 63.0 88.8
B=512 22.1 40.7 244.1 760.1 68.0 80.2 154.7 254.1
Pa3dFL B=16 0.47 1.0 8.0 27.3 17.1 17.6 21.3 28.6
B=64 1.8 4.2 30.2 97.8 19.4 21.0 32.4 48.0
B=128 3.6 8.3 59.9 191.9 23.3 26.5 46.3 74.8
B=512 14.5 33.3 237.9 756.1 41.1 53.9 128.9 231.0

E.1 Full Efficiency Comparison

We provide a full comparison of the computation efficiency of different methods by varying batch sizes in Table 8. When using a small batch size and a small reduction ratio, the reduced amount on both FLOPs and peak memory of Pa3dFL is closest to that of p-width. We notice that if batch size is small and the reduction is large, pFedGate achieves a larger reduced amount in computation costs. However, since clients with large reduction ratios usually have more computation costs, it will be more useful to save more efficiency for clients with small reduction ratios rather than large ones.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: Test accuracy v.s. communication rounds under the setting UNI (the above row) and IDL (the bottom row)

E.2 Learning Curve

We visualize the learning curve of different methods in Fig. 6. In Hetero. setting, Pa3dFL consistently outperforms other methods in CIFAR10 and CIFAR100 and achieves competitive results against the SOTA method Fjord. In Ideal setting, Pa3dFL still achieves outstanding performance regardless of data distributions. We also notice that our method does not converge until 2000 rounds in CIFAR100 while already achieving the highest performance among all the methods, indicating the room for accelerating convergence of our method to further enhance its performance. We will investigate this issue in our future work.

E.3 Impact of Regularization

Refer to caption
Refer to caption
Refer to caption
Figure 7: Test accuracy v.s. Regularization coefficient
Refer to caption
Refer to caption
Figure 8: Test accuracy v.s. HN depth

We conduct the impact of the regularization term in Fig. 8 by varying the value of the coefficient λ\lambda on three datasets for both the ideal case and the hetero. case. In FashionMNIST, a small regularization coefficient is preferred. We infer that this is due to the knowledge has been already well organized in model parameters. The impact of different values of the regularization coefficient is not obvious in CIFAR10 and needs to be carefully tuned in practice. These results also suggest λ=1e3\lambda=1e-3 can be a proper choice without prior knowledge.

Table 9: Generating/optimization time cost (s) of hyper-networks with different numbers of layers for 20 clients.
1-layer 2-layer 3-layer
CNN-CIFAR100 0.01/0.26 0.01/0.27 0.01/0.30
ResNet18-CIFAR10 0.03/0.81 0.03/0.85 0.03/0.87

Appendix F Computation Cost of HN

We evaluate the computation cost of generating personal parameters by HN and optimization of HN on CIFAR10 (e.g., ResNet18) and CIFAR100 (2-layer CNN) in Table 9 with a (64,64,layers) HN. The time cost of hyper-network optimization and parameter generation are both extremely small when compared to local training. In addition, scaling the model also brings limited additional cost, indicating the adaptability of HN in practice.

Appendix G Impact of HN depth

Generally, the clients’ model performance can benefit from a deep HN as shown in Fig.8.

Appendix H Limitation

There exist several limitations of our work. First, We only evaluate Pa3dFL on two commonly used operators, e.g., convolution operator and linear operator. A more broad range of operators like attention are not discussed in this work. However, Pa3dFL can also be extended other operators since we compress the model from a fine-grained matrix level. We will investigate the extension of Pa3dFL to other operators (e.g., attention, rnn) in our future work. Second, we have only validated the effectiveness of our proposed method on CV models and datasets. How Pa3dFL will perform on other types of tasks is unexplored. In addition, Pa3dFL’s performance and convergence speed should be further improved when there exists no capacity constraints. Finally, we manually search the architecture of proper hyper-networks for each task, which is tedious and introduces large computation costs in network architecture searching. Nevertheless, we confirm the effectiveness of generating personalized parameters using hyper-networks. We will improve these mentioned issues in our future work.