This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

(cvpr) Package cvpr Warning: Incorrect paper size - CVPR uses paper size ‘letter’. Please load document class ‘article’ with ‘letterpaper’ option

pMixFed: Efficient Personalized Federated Learning through Adaptive Layer-Wise Mixup

Yasaman Saadati1,2, Mohammad Rostami3, and M. Hadi Amini1,2
1 Knight Foundation School of Computing and Information Sciences, Florida International University
2 Sustainability, Optimization, and Learning for InterDependent networks laboratory (solid lab)
3 University of Southern California
[email protected] , [email protected] , [email protected]
Abstract

Traditional Federated Learning (FL) methods encounter significant challenges when dealing with heterogeneous data and providing personalized solutions for non-IID scenarios. Personalized Federated Learning (PFL) approaches aim to address these issues by balancing generalization and personalization, often through parameter decoupling or partial models that freeze some neural network layers for personalization while aggregating other layers globally. However, existing methods still face challenges of global-local model discrepancy, client drift, and catastrophic forgetting, which degrade model accuracy. To overcome these limitations, we propose pMixFed, a dynamic, layer-wise PFL approach that integrates mixup between shared global and personalized local models. Our method introduces an adaptive strategy for partitioning between personalized and shared layers, a gradual transition of personalization degree to enhance local client adaptation, improved generalization across clients, and a novel aggregation mechanism to mitigate catastrophic forgetting. Extensive experiments demonstrate that pMixFed outperforms state-of-the-art PFL methods, showing faster model training, increased robustness, and improved handling of data heterogeneity under different heterogeneous settings. Our code is available for reproducing our results: https://github.com/YasMinSdt/pMixFed.

1 Introduction

Refer to caption
Figure 1: Discrepancy between personalized and global shared layers in Partial PFL: (1) The global model, GtG^{t}, is constructed by aggregating asynchronous local updates from clients, denoted as LitL^{t}_{i}, LjtL^{t}_{j}, and LktL^{t}_{k}. (2),(3) In communication round tt, available clients ii and jj aggregate shared parameters to produce the updated global model Gt+1G^{t+1}, while the personalized parameters, such as LktL^{t}_{k}, remain unchanged for unavailable clients. (4) This integration of distinct models, Gt+1G^{t+1} and LktL^{t}_{k}, induces inconsistencies in the overall model updates. (Bottom) During the joint training of generalized and personalized models, the gradient updates from the generalized layers are impacted by the gradients from personalized layers, resulting in catastrophic forgetting, performance drop and slower convergence rates.

The goal in federated learning (FL) [22] is to facilitate collaborative learning of several machine learning (ML) models in a decentralized scheme. FL requires addressing data privacy, catastrophic forgetting, and client drift problem 111A phenomenon where the global model fails to serve as an accurate representation because local models gradually drift apart due to high data heterogeneity. [38, 15, 43, 31, 36]. Existing FL methods cannot address all these challenges with non-IID (non-Independent and Identically Distributed) data. For instance, although “FedAvg”  [32] demonstrates strong generalization performance, it fails to provide personalized solutions for a cohort of clients with non-IID datasets. Hence, the global model, or one “average client” in “FedAvg”, may not adequately represent all individual local models in non-IID settings due to client-drift [53]. Personalized FL (PFL) methods handle data heterogeneity by considering both generalization and personalization during the training stage. Since there is a trade-off between generalization and personalization in heterogeneous environments, PFL methods leverage heterogeneity and diversity as advantages rather than adversities [35, 47]. A group of PFL approaches train personalized local models on each device while collaborating toward a shared global model. Partial PFL, also known as parameter decoupling, involves using a partial model sharing, where only a subset of the model is shared while other parameters remain “frozen” to balance generalization and personalization until the subsequent round of local training.

While partial PFL methods are effective in mitigating catastrophic forgetting, strengthening privacy, and reducing computation and communication overhead [34, 45], there are still some unaddressed challenges. First, the question of when, where, and how to optimally partition the full model? remains unresolved. Recent studies [34, 45] have shown that there is no “one-size-fits-all” solution; the best or optimal partitioning strategy depends on factors such as task type (e.g., next word prediction or speech recognition) and local model architecture. An improper partitioning choice can lead to issues such as underfitting, overfitting, increased bias, and catastrophic forgetting. Some studies [29] suggest that personalized layers should reside in the base layers, while others [9, 4] argue that the base layers contain more generalized information and should be shared. Further, the use of a fixed partitioning strategy across all communication rounds for heterogeneous clients can limit the efficacy of collaborative learning. For instance, if the performance of the client suddenly drops due to new incoming data, the partitioning strategy should be changed because the client requires more frozen layers. Another issue is catastrophic forgetting of the previously shared global knowledge after only a few rounds of local training because the shared global model can be completely overwritten by local updates leading to generalization degradation [31, 41, 15, 54]. Most importantly, partial models may experience slower convergence compared to full model personalization, as frozen local model updates can diverge in an opposite direction from the globally-shared model. Since the generalized and personalized models are trained on non-IID datasets, there might also be a domain shift, leading to model discrepancy as depicted in Figure 1. These discrepancies arise from variations in local and global objective functions, differences in initialization, and asynchronous updates [56, 24]. As a result, merging the shared and the personalized layers can disrupt information flow within the network, impede the learning process, and lead to a slower convergence rate or accuracy drop in partial PFL models such as FedAlt and FedSim [34]222More details on this is discussed in section 4.. Further, while partial PFL techniques contribute to an overall improved training accuracy, they can reduce the test accuracy on some devices, particularly in devices with limited samples, leading to variations in results in terms of the performance level [34]. Hence, there is a need for novel solutions to achieve the following goals in PFL:

  • Dynamic and Adaptive Partitioning: The balance between shared and personalized layers should be dynamically and adaptively adjusted for each client during every communication round, rather than relying on a static, fixed partitioning strategy for all participants.

  • Gradual Personalization Transition: The degree of personalization should transition gradually across layers, as opposed to an ”all-or-nothing” approach that employs strict partitioning or hard splits within the model discussed in Figure 1. This ability allows nuanced adaptation for individual client needs due to heterogeneity.

  • Improved Generalization Across All Clients: The average personalization accuracy should be such that the global model is unbiased toward specific subsets of clients.

  • Mitigation of Catastrophic Forgetting: The strategy should address the catastrophic forgetting problem by incorporating mechanisms to strengthen the generalization and retain the state of the previous global model when updating the global model in aggregation.

  • Scalability and Adaptability: The approach should be fast, scalable, and easily adaptable to new cold-start clients while accounting for model/device heterogeneity.

To achieve the above, we propose “pMixFed”, a layer-wise, dynamic PFL approach that integrates Mixup [61] between the shared global and personalized local models’ layers during both the broadcasting (global model sharing with local clients) and aggregation (aggregating distributed local models to update the global model) stages within a partial PFL framework. Our main contributions include:

  • We develop an online, dynamic interpolation method between local and global models using Mixup [57], effectively addressing data heterogeneity and scalable across varying cohort sizes, degrees of data heterogeneity, and diverse model sizes and architectures.

  • Our solutions facilitate a gradual increase in the degree of personalization across layers, rather than relying on a strict cut-off layer, helping to mitigate client drift.

  • We introduce a new fast and efficient aggregation technique which addresses catastrophic forgetting by keeping the previous global model state.

  • “pMixFed” reduces the participant gap (test accuracy for cold-start users) and the out-of-sample gap (test accuracy on unseen data) caused by data heterogeneity through linear interpolation between client updates, thereby mitigating the impact of client drift.

2 Related Work

PFL aims to adapt local models to the individual needs, preferences, and contexts of each participant. Since the inception of FL, various PFL approaches have been explored to address the challenges stemming from heterogeneity, categorized into three areas: 1) data-centric strategies 2) global model adaptation, and 3) local model personalization. The following provide a more detailed discussion of these three approaches:
Data-centric: These approaches in PFL aim to address data heterogeneity and class imbalances by manipulating statistical data properties, such as data size, distribution, and local data selection [47]. Examples of these techniques include data normalization, feature engineering, data augmentation, synthetic data generation, and client selection. Representative methods, such as Astraea [11], P2P k-SMOTE [50], FedMCCS [1], FedAug [19], [63], and FedHome [52], exemplify data-centric approaches. The shortcoming of these methods is that they often modify the natural distribution and statistical characteristics of federated data, potentially injecting bias or eliminating valuable, rare information. Additionally, many of these strategies depend on the availability of proxy data on the global server, which may conflict with the privacy regulations of some entities.
Global Model Adaptation: In these approaches, a single global model is maintained on the server and subsequently adapted to individual local models in the following phase. The primary goals of these methods are: 1) learning a robust, generalized model and 2) enabling fast, efficient local model adaptation. Several techniques employ regularization terms to mitigate client drift and prevent model divergence during local updates e.g., FedProx [27], SCAFFOLD [21], FL-MOON [26], and FedCurv [42]. Some of these approaches such as PerFedAvg [12], pFedMe [46], and pFedHN [39] leverage Meta-learning, and transfer learning e.g., FedStag [55], and DPFed [58], yet they introduce specific challenges. Meta-learning methods can be computationally intensive, while regularization techniques may add computational overhead by incorporating additional terms in the objective function. Similarly, transfer learning algorithms can be inefficient in terms of communication overhead and often require a public dataset to enhance the global model on the server. A common limitation across most adaptation techniques is the need for uniform model architecture across all clients, requiring devices with varying computational capabilities to use the same model size.
Local Model Personalization: The limitations of existing PFL methods have led to the development of local model personalization approaches that focus on training customized models for each client. Some approaches utilize multi-task learning (MTL), a collaborative framework that facilitates information exchange across distinct tasks (e.g., FedRes [2], MOCHA [44], FedAMP [16], and Ditto [28]). Another category leverages knowledge distillation (KD) to support personalization when client-specific training objectives differ, as seen in FedDF [30], FedMD [25], FedGen [64], and FedGKT [14]. However, both MTL and KD can incur significant computational and communication overhead, limiting their scalability for large-scale FL deployments across resource-constrained devices.

Partial PFL: These methods, also known as parameter decoupling, are the primary focus of this paper. These methods mitigate catastrophic forgetting by retaining shared components while freezing other layers for personalization. FedPer [4] introduced partial models in FL, sharing only initial layers with generalized information and reserving final layers for personalization. FedBABU [33] divides the network into a shared body and a frozen head with fully connected layers. Other frameworks, such as FURL [6] and LG-FedAvg [29], apply partial PFL by retaining private feature embeddings or using compact representation learning for high-level features, respectively. Two baseline methods in this paper are FedAlt [43] and FedSim [34]. FedAlt uses a stateless FL paradigm, reconstructing local models from the global model, while FedSim synchronously updates shared and local models with each iteration.
Limitations: These methods face challenges such as model update discrepancies (as shown in Figure 1) and catastrophic forgetting, where shared layers may undergo significant changes after only a few rounds of local training, resulting in sudden accuracy drops. Additionally, users with high personalization accuracy may freeze more layers than cold-start users or unreliable participants, who should rely more heavily on the global model. These challenges have motivated our development of a dynamic, adaptive layer-wise approach to balance generalization and personalization across clients, allowing tuning at different communication rounds to accommodate varying performance conditions.

3 Problem Formulation

Consider KK collaborating clients (or agents), each trying to optimize a local loss function Fk(θ)F_{k}(\theta) on the distributed local dataset Dk=(xk,yk){D_{k}}={(x_{k},y_{k})}, where (x,y)(x,y) shows the data features and the corresponding labels, respectively. Since the agents collaborate, the parameters θ\theta (parameters of the global model) are shared across the agents. A basic FL objective function aims to optimize the overall global loss:

minθF(θ)=k=1K|𝒟k||𝒟|Fk(θ),\min_{\theta}F(\theta)=\sum_{k=1}^{K}\frac{|\mathcal{D}_{k}|}{|\mathcal{D}|}F_{k}(\theta)\quad\quad, (1)
Fk(θ)=1|𝒟k|(x,yi)𝒟k(fk(θ,xi),yi)F_{k}(\theta)=\frac{1}{|\mathcal{D}_{k}|}\sum_{(x,y_{i})\in\mathcal{D}_{k}}\ell(f_{k}(\theta,x_{i}),y_{i})

where |𝒟|=k=1K|𝒟k||\mathcal{D}|=\sum_{k=1}^{K}|\mathcal{D}_{k}| and F(θ)F(\theta) is the global loss function of FedAvg [32]. FL is performed in at iterative fashion. At each round, each client downloads the current version of the global model and trains it using their local data. Clients then send the updated model parameters to the central server. The central server uses the model updates from the selected clients and aggregates them to update the global model. Iterations continue until convergence.

In Equation 1, the assumption is that the data is collected from an IID distribution, and all clients should train their data according to the exact same model. However, this assumption cannot be applied within many practical FL settings due to the non-IID nature of data and resource limitations [18, 17]. In the PFL settings with high heterogeneity and non-IID data distribution, the same issue persists and the local parameters need to be customized toward each agent. PFL extends FL by solving the following [26, 34]:

minθ,θkk=1K1|𝒟k|(k(θk)+αkθ𝐤θ2).\min_{\theta,\theta_{k}}\sum_{k=1}^{K}\frac{1}{|\mathcal{D}_{k}|}(\mathcal{F}_{k}(\theta_{k})+\alpha_{k}\|\mathbf{\theta_{k}-\theta}\|^{2}). (2)

PFL explicitly handles data heterogeneity through the term k(θk)\mathcal{F}_{k}(\theta_{k}) which accounts for model heterogeneity by considering personalized parameter θk\theta_{k} for client kk. Meanwhile, θ\theta represents the shared global model parameters in Equation 2, and αk\alpha_{k} acts as a regularizer and indicates the degree of personalization tuning collaborative learning between personalized local models θk\theta_{k} and generalized global model θ\theta. When αk\alpha_{k} is small, the personalization power of the local models will be increased. If αk\alpha_{k} is large, the local models’ parameters tend to be closer to the global model parameters θ\theta.

3.1 Partial Personalized Model

The limitations of full model personalization methods with global and fully independent local models are discussed in Section 2. Partial PFL methods improve personalization by providing more flexibility through allowing clients to choose which parts of their models to be personalized based on their specific needs and constraints for improved performance. Let LktL_{k}^{t} be a partial local model kk in round tt which is partitioned into two parts Ll,kt;Lg,kt\langle L_{l,k}^{t};L_{g,k}^{t}\rangle, where l,g{1,M}l,g\subseteq\{1,\dots M\} are the personalized and global layers, and MM is the number of layers. We can integrate both personalized and generalized layers in a local model LkL_{k} as:

k(θk)=(fk(Lg,k;Ll,k,xk),yk)\mathcal{F}_{k}(\theta_{k})=\ell(f_{k}(\langle L_{g,k};L_{l,k}\rangle,x_{k}),y_{k}) (3)

Among different partitioning strategies for partial PFL [34], the most popular technique is to assign local personalized layers Ll,ktL_{l,k}^{t} to final layers and allow the base layers Lg,ktL_{g,k}^{t} to share the knowledge similar to FedPer [4]. This choice aligns with insights from MAML 333Model-Agnostic Meta-Learning algorithm, suggesting that initial layers keeps general and broad information while personalized characteristics manifest prominently in the higher layers. Accordingly, we would have:

k(θk)=Ll(t)(Lg(t)(xk))localupdateLl(Lg(xk)),\mathcal{F}_{k}(\theta_{k})=L_{l}^{(t)}(L_{g}^{(t)}(x_{k}))\xrightarrow{localupdate}L^{\prime}_{l}(L^{\prime}_{g}(x_{k}))\quad,
Ll(Lg(xk))broadcastingLl(t+1)(G(t+1)(xk))L^{\prime}_{l}(L^{\prime}_{g}(x_{k}))\xrightarrow{broadcasting}L_{l}^{(t+1)}(G^{(t+1)}(x_{k}))

For simplicity Lg=Lg,kL_{g}=L_{g,k} and Ll=Lg,kL_{l}=L_{g,k} where {1gslM}\{1\leq g\leq s\leq l\leq M\} and ss is the split(cut) layer. The objective in solving Equation 3.1 is to find the optimal ss (cut layer) which minimizes the personalization objective: k=1K1|𝒟k|k(θk)=Ll(t)(Lg(t)(xk))\sum_{k=1}^{K}\frac{1}{|\mathcal{D}_{k}|}\mathcal{F}_{k}(\theta_{k})=L_{l}^{(t)}(L_{g}^{(t)}(x_{k})).

In partial models, after several rounds of local training, both personalized and global layers of local model are updated. This update could be synchronous like FedSim or asynchronous as in FedAlt. The Personalized layers will be frozen until the next communication round, Ll(t+1)=LlL_{l}^{(t+1)}=L^{\prime}_{l} and the global layers will be sent to the server for global model aggregation : G(t+1)k=1K|𝒟k||𝒟|Lg(xk)G^{(t+1)}\leftarrow\sum_{k=1}^{K}\frac{|\mathcal{D}_{k}|}{|\mathcal{D}|}L^{\prime}_{g}(x_{k}). In the next broadcasting phase, the shared layers of the local model will be updated as Lg(t+1)G(t+1)L_{g}^{(t+1)}\leftarrow G^{(t+1)}. Our goal is to benefit from Mixup to improve personalization.

4 The Proposed Method

4.1 Mixup

Mixup is a data augmentation technique for enhancing model generalization [61] based on learning to generalize on linear combinations of training examples. Variations of Mixup have consistently excelled in vision tasks, contributing to improved robustness, generalization, and adversarial privacy. Mixup creates augmented samples as:

x¯=λ.xi+(1λ).xjy¯=λ.yi+(1λ).yj,\small\begin{split}\bar{x}=\lambda.x_{i}+(1-\lambda).x_{j}\\ \bar{y}=\lambda.y_{i}+(1-\lambda).y_{j},\end{split} (4)

where xix_{i} and xjx_{j} are two input samples, yiy_{i} and yjy_{j} are the corresponding labels, λ\lambda, λBeta(α,α),λ[0,1]\lambda\sim\text{Beta}(\alpha,\alpha),\lambda\in[0,1], is the degree of interpolation between the two samples. Mixup relates data points belonging to different classes which has been shown to be successful in mitigating overfitting and improving model generalization [49, 13, 62]. There are many variants of Mixup that has been developed to address specific challenges and enhance its effectiveness. For instance, AlignMixup [48] improves local spatial alignment by introducing transformations that better preserve semantic consistency between input pairs. Manifold Mixup [49] extends the concept to hidden layer representations, acting as a powerful regularization technique by training deep neural networks (DNNs) on linear combinations of intermediate features. CutMix [60] replaces patches between images, blending visual information while retaining spatial structure. Remix [7] addresses class imbalance by assigning higher weights to minority classes during the mixing process, enhancing the robustness of the trained model on imbalanced datasets. AdaMix [13] dynamically optimizes the mixing distributions, reducing overlaps and improving training efficiency. This linear interpolation also serves as a regularization technique that shapes smoother decision boundaries, thereby enhancing the ability of a trained model to generalize to unseen data. Mixup can also increase robustness against adversarial attacks [62, 5]; and improves performance against noise, corrupted labels, and uncertainty as it relaxes the dependency on specific information [13].

Refer to caption
Figure 2: Workflow of pMixFed: Mixup is used in two stages. 1-Broadcasting: when transferring knowledge to local models, the frozen personalized model Lk(t)L_{k}^{(t)} is mixed up with global model G(t)G^{(t)} according to the adaptive mix factor μk(t)\mu_{k}^{(t)} which determines layer-wise mixup degree λi\lambda_{i} for layer ii. 2-Aggregation: The updated global model G(t+1)G^{(t+1)} is generated through a Mixup between the updated local model L(t+1)L^{(t+1)} and the current shared global model G(t)G^{\prime(t)}. Consequently, a state of the previous global model G(t)G^{\prime(t)} is retained, which aids in mitigating catastrophic forgetting.

4.2 pMixFed : Partial Mixed up Personalized federated learning

Our goal is to leverage the well-established benefits of Mixup in the context of personalized FL. While Mixup has previously been employed in FL frameworks, such as XORMixup [40], FEDMIX [57], and FedMix [51], prior studies have primarily focused on using Mixup for data augmentation or data averaging. We propose pMixFed by integrating Mixup on the model parameter space, rather using it on the feature space. We apply Mixup between the parameters of the global and the local models in a layer-wise manner for more customized and adaptive PFL. Our approach eliminates the need for static and rigid partitioning strategies. Specifically, during both the broadcasting and aggregation stages of our partial PFL framework, we generate mixed model weights using a an interpolation strategy which is illustrated in Figure 2. Mixup offers the flexibility in combining models by introducing a mix degree for each layer λi\lambda_{i}, which changes gradually according to μ\mu, i.e., the mix factor. Parameter μ\mu is also updated adaptively in each communication round and for each client according to the test accuracy of the global and the local model during the evaluation phase of FL. μ\mu is computed as follows:

μkt=111+eδ(Accb1)\small\mu_{k}^{t}=1-\frac{1}{1+e^{-\delta(Acc^{b}-1)}} (5)

δ=(t/T)\delta=(t/T), where tt, and TT are the current communication round, and the overall number of communication rounds respectively. Also, AccAcc calculated as: Acc=(Acckt/Accoverall(t1))Acc=(Acc_{k}^{t}/Acc_{overall}^{(t-1)}) in the broadcasting phase and average test accuracy of the previous global model (Acc=Gt(x,θt),y)\left(Acc=G^{t}(x,\theta^{t}),y\right) on all local test sets x={x1,x2,,xK}x=\{x_{1},x_{2},\ldots,x_{K}\}. , in the aggregation stage. bb is the offset parameter for sigmoid function which is set to 22 after several experiments. More detailed discussion on parameter μ\mu’s rule of update is discussed in section 5 and the reason behind this formulation is discussed in the Appendix 8.

As shown in Figure 2, Mixup is applied in two distinct stages of FL. Firstly, when transferring shared knowledge to local models, the local model LkL_{k} is mixed up with the current global model GG according to the dynamic mixing factor μ\mu, which determines the change ratio of λi\lambda_{i} (layer-wise Mixup degree in Eq. (4)) across different layers. λi\lambda_{i} gradually is changed from 101\rightarrow 0 as we move from the head to the base layer. λi=1\lambda_{i}=1 means sharing the 100% of the global model and λi=0\lambda_{i}=0 means that the corresponding layer in local model is frozen and will not be mixed up with the global model GG. Calculation of the Mixup degree of layer ii λi\lambda_{i} at both broadcasting and aggregation stages is performed as follows:

Broadcasting Stage:λi={1λi>1μ(ni)λi1Aggregation Stage:λi={0λi01(iμ)λi>0,\small\begin{split}\text{Broadcasting Stage:}\quad\quad\lambda_{i}=\begin{cases}1&\lambda_{i}>1\\ \mu*(n-i)&\lambda_{i}\leq 1\end{cases}\\ \text{Aggregation Stage:}\quad\quad\lambda_{i}=\begin{cases}0&\lambda_{i}\leq 0\\ 1-(i*\mu)&\lambda_{i}>0,\end{cases}\end{split} (6)

where nn is total number of layers, ii is the current layer number starting from the base layer to the head as i=0i=ni=0\rightarrow i=n. μ\mu is the mix factor which will be adaptively updated in each communication round tt and for each local model kk according to Eq. 5.

Algorithm 1 pMixFed: Broadcasting global to local model
1:Input: Initial states global model: GM(0)GM^{(0)}, local models: {LMi(0)}i=1,,M\{LM^{(0)}_{i}\}_{i=1,\ldots,M}, Number of communication rounds TT, number of devices per round mm, Number of layers in local models {Li}\{L_{i}\}
2:for t=0,1,,T1t=0,1,\ldots,T-1 do
3:    Server selects KK devices S(t){1,,N}S(t)\subset\{1,\ldots,N\}
4:    Update μ\mu for each LMiLM_{i}, i={1,,K}i=\{1,\ldots,K\}
5:    Server broadcasts GM(t)GM^{(t)} to each device in S(t)S(t)
6:    for each device mS(t)m\in S(t) in parallel do
7:        (LMm(t+1),GMm(t+1))=Mixup[(LMm(t+1),GMm(t+1)),μ](LM^{(t+1)}_{m},GM^{(t+1)}_{m})=\text{Mixup}[(LM^{\prime(t+1)}_{m},GM^{\prime(t+1)}_{m}),\mu]
8:        Device sends GMm(t+1)GM^{(t+1)}_{m} back to server
9:        Update μi=1,,K\mu_{i=1,\ldots,K}
10:    end for
11:    Server updates GM(t+1)=1KiS(t)GMm(t+1)GM^{(t+1)}=\frac{1}{K}\sum_{i\in S(t)}GM^{(t+1)}_{m}
12:end for
Algorithm 2 Proposed Mixup for Aggregation
1:Input: Initial states global model: GM(0)GM^{(0)}, Number of communication rounds TT, Number of local iterations ItrItr, number of devices per round mm, Mixup degree λ\lambda
2:for t=0,1,,T1t=0,1,\ldots,T-1 do
3:    Server broadcasts GM(t)GM^{(t)} to each device in S(t)S(t): LMi(t)LM_{i}^{(t)}
4:    Update λi=1,,M\lambda_{i=1,\ldots,M} for each device
5:    for each device iS(t)i\in S(t) do
6:        for epoch=0,1,,Itr1epoch=0,1,\ldots,Itr-1 do
7:           Train local model:
8:                         LMi(t+1)=GM(t)LM_{i}^{(t+1)}=GM^{(t)}
9:                     GM(t+1)=GM^{(t+1)}=
10:Mixup(LMi(t+1),GM(t))\hskip 65.44133pt\text{Mixup}(LM_{i}^{(t+1)},GM^{(t)})
11:        end for
12:        Device sends GM(t+1)GM^{(t+1)} back to server
13:        Adaptively update λi=1,,M\lambda_{i=1,\ldots,M}
14:    end for
15:end for

4.2.1 Broadcasting : Global to local model transfer

This stage involves sharing global knowledge with local clients. In the existing PFL methods, the same weight allocation is typically applied to each heterogeneous local model. In our work, we personalize this process by allowing the local model to select the proportion of layers it requires. For instance, for a cold-start user, more information should be extracted from the shared knowledge model, implying that a few layers should be frozen for personalization. Accordingly, A history of the previous global model GtG^{t} will remain which helps with catastrophic forgetting of the generalized model. Additionally, we introduce a gradual update procedure where the value of λ\lambda gradually decreases from one (indicating fully shared layers) from the base layer to the end, based on the mixing factor μ\mu. The mix layer is adaptively updated in each communication round for each client individually, according to personalization accuracy. With this adaptive and flexible approach, not only can upcoming streaming unseen data be managed, but also the participation gap (test accuracy of new cold start users) would be improved. The Broadcasting phase is illustrated in 1 which is inspired by [20]. The update rule of local model Lk(t)L_{k}^{(t)} is as follows:

Lk,i(t)=(λk,i(t)).Gi(t)+(1λk,i(t)).Lk,i(t)Lk(t+1)=Lk(t)ηFk(Lk(t)).\small\begin{split}&{L^{\prime}}^{(t)}_{k,i}=(\lambda^{(t)}_{k,i}).G^{(t)}_{i}+(1-\lambda^{(t)}_{k,i}).L^{(t)}_{k,i}\\ &L_{k}^{(t+1)}={L^{\prime}}^{(t)}_{k}-\eta\nabla F_{k}({L^{\prime}}^{(t)}_{k}).\end{split} (7)

4.2.2 Aggregation: Local to Global Model Transfer

Existing methods primarily categorize layers into two types: personalized layers and generalized layers. The updated global layers from different clients, Gk(t+1)G_{k}^{(t+1)}, are typically aggregated using Equation 3, which can lead to catastrophic forgetting. Since The base layers of the global model serve as the backbone of shared knowledge [37], This issue arises because, during each local update, the generalized layers undergo substantial modifications [34, 33]. When the global model is updated by simply aggregating local models, valuable information from previously shared knowledge may be lost, leading to forgetting—even if GtG^{t} performs better than the newly aggregated model. To address this challenge, we propose a new strategy that applies Mixup between the gradients of the previous global model and each client’s local model before aggregation. For each client ii, the mixup coefficient λi\lambda_{i} gradually increases from 0 to 1, moving from the head to the base layer, controlled by the mix factor μi\mu_{i}. Additionally, the base layer is adaptively updated based on the communication round and the generalization accuracy, ensuring robust integration of shared and personalized knowledge. It should be noted that parameter μ\mu is constant in the aggregation stage for all clients as it’s dependent to the average performance of the previous global model.

Gk,i(t)=(λi,k(t)).Gi(t)+(1λk,i(t)).Lk,i(t+1),G(t+1)=k=1K|Dk|k=1K|Dk|Gk(t+1).\small\begin{split}&{G^{\prime}}^{(t)}_{k,i}=(\lambda^{(t)}_{i,k}).G^{(t)}_{i}+(1-\lambda^{(t)}_{k,i}).L^{(t+1)}_{k,i},\\ &G^{(t+1)}=\sum_{k=1}^{K}\frac{|D_{k}|}{\sum_{k=1}^{K}|D_{k}|}G^{\prime(t+1)}_{k}.\end{split} (8)

The high-level block-diagram visualization of the proposed method is shown in Figure 2. It is important to note that the sizes of local models LMi(0)LM^{(0)}_{i} can differ from each other. Consequently, the size of GM(0)GM^{(0)} should be greater than the maximum size of local models. The parameter λ\lambda in Equation 4 determines the Mixup degree between the shared model GMGM and the local models LMi()LM^{(\cdot)}_{i}, while μ\mu governs the slope of the change in λ\lambda across different layers. The degree of Mixup gradually decreases according to the parameter μ\mu from 0 to 11. In this scenario, λ=0\lambda=0 for the first base layer, indicating total sharing, while λ=1\lambda=1 applies to the final layer, which represents no sharing. The underlying concept is that the base layer contains more general information, whereas the final layers retain client-specific information. The use of the parameter μ\mu, relative to the number of local layers, eliminates the need for a specified cut layer kk and allows its application across different model sizes and layers. The parameter μi\mu_{i} is adaptively updated based on the personalized and global model accuracy for each client. Algorithm 1 shows how Mixup is used as a shared aggregation technique between individual clients and the server. In each training round, only one client is Mixed up with the global model, and λ\lambda is adaptively learned based on the objective function using online learning. Algorithm 2 shows how Mixup is employed as a shared aggregation technique between the clients and the server. In each training round, only one client is “Mixed up” with the global model and the λ\lambda parameter is adaptively learned based on the objective function using online learning. For a theoretical convergence analysis of pMixFed, please refer to the Appendix 9.

5 Experiments

Table 1: Comparison of the State-of-The-Art Methods using Mobile-Net model
Method CIFAR-10 CIFAR-100
N=100N=100 N=10N=10 N=100N=100 N=10N=10
C=10%C=10\% C=100%C=100\% C=10%C=10\% C=100%C=100\% C=10%C=10\% C=100%C=100\% C=10%C=10\% C=100%C=100\%
FedAvg 26.93 25.61 15.64 19.55 4.50 4.54 17.65 21.24
FedAlt 39.36 46.06 39.60 42.79 27.85 19.07 48.30 14.31
FedSim 51.12 45.43 40.30 35.43 28.18 19.52 47.41 48.30
FedBaBU 28.70 25.98 17.58 20.74 4.63 4.72 17.46 17.57
Ditto 26.58 62.77 18.60 43.98 7.00 16.07 16.89 20.74
Per-FedAvg 34.30 43.59 19.52 38.29 24.04 30.65 16.27 37.20
Lg-FedAvg 34.52 34.17 48.88 58.51 5.65 5.73 31.89 35.78
pMixFed 69.94 72.42 54.90 74.62 45.62 56.63 54.71 58.25
Table 2: Comparison of the State-of-The-Art Methods using CNN model
Method CIFAR-10 CIFAR-100 MNIST
N=100N=100 N=10N=10 N=100N=100 N=10N=10 N=100N=100 N=10N=10
C=10%C=10\% C=100%C=100\% C=10%C=10\% C=100%C=100\% C=10%C=10\% C=100%C=100\% C=10%C=10\% C=100%C=100\% C=100%C=100\% C=100%C=100\%
FedAvg 54.78 56.82 44.11 54.37 25.77 26.73 34.47 39.93 97.54 98.59
FedAlt 56.41 56.77 69.50 64.80 15.19 10.56 28.30 26.53 97.37 99.21
FedSim 59.90 56.07 63.46 38.34 14.80 10.46 27.00 26.47 98.63 99.60
FedBaBU 53.12 54.60 39.77 53.21 16.77 17.33 25.60 32.47 98.19 99.07
Ditto 46.86 79.60 31.65 60.75 27.16 42.93 25.38 55.27 98.03 95.51
Per-FedAvg 39.37 45.03 10.00 48.13 32.67 39.01 8.71 41.21 98.32 50.34
Lg-FedAvg 62.28 62.99 62.46 71.73 28.75 28.03 33.75 45.76 97.65 98.82
pMixFed 65.30 75.49 74.36 75.06 34.66 41.56 43.47 51.46 99.88 99.98

5.1 Experimental Setup

Datasets: We used three datasets widely used federated learning: MNIST [23], CIFAR-10, and CIFAR-100 [3], We followed the setup in [32] and [33] to simulate heterogeneous, non-IID data distributions across clients for both train and test datasets. The maximum number of classes per user is set to S=5S=5 for CIFAR-10 and MNIST, and S=50S=50 for CIFAR-100. Experiments were conducted across varying heterogeneous settings, including small-scale (N=10N=10) and large-scale (N=100N=100) client populations, with different client participation rates C=[100%,10%]C=[100\%,10\%] to measure the effects of stragglers.
Baselines and backbone: We compare two version of our method (i) pMixFed: an adaptive and dynamic mixup-based PFL approach, and (ii) pMixFed-Dynamic: a dynamic-only mixup variant where the parameter μ\mu is fixed across communication rounds, against several baselines. These baselines include: FedAvg [32], FedAlt [34], FedSim [34], FedBABU [33]. Additionally, we compare against full model personalization methods, Ditto [28], Per-FedAvg [12] , and LG-FedAvg [29]. For all experiments, the number of local training epochs was set to r=5r=5, number of communication rounds was fixed at T=100T=100, and the batch size was 32. The Adam optimizer has been used while the learning rate, for both global and local updates, was lr=0.001lr=0.001 across all communication rounds. The average personalized test accuracy across individual client’s data in the final communication round is reported in Table 2 and 1. Figure 3 presents the training accuracy versus communication rounds for CIFAR10 and CIFAR100 datasets.

Model Architectures: Following the FL literature, we utilized several model architectures. For MNIST, we used a simple CNN consisting of 2 convolutional layers (each with 1 block) and 2 fully connected layers. for CIFAR10 and CIFAR100, We employed a CNN with 4 convolutional layers (1 block each) and 1 fully connected layer for all datasets. Additionally, we used MobileNet, which comprises 14 convolutional layers (2 blocks each) and 1 fully connected layer, for CIFAR-10 and CIFAR-100. For partial model approaches such as FedAlt and FedSim, the split layer is fixed in the middle of the network: for CNNs, layers [1,2] are shared, while for MobileNet, layers [1–7] are considered as shared part. More details about the training process is discussed in Appendix 10.1. Our implementation is available as a supplement for reprodcing the results.

Refer to caption
Refer to caption
Figure 3: test accuracy accuracy curve along with global communication rounds for pMixFed and PFL baselines experimented on CIFAR10 and CIFAR100 where N=100, C=10%.

5.2 Comparative Results

We evaluate two versions of our proposed method: pMixFed, which utilizes both adaptive and dynamic layer-wise mixup degree updates, and pMixFed-Dynamic, where the mixup coefficient μi\mu_{i} is fixed across all training rounds and gradually decreases from 101\rightarrow 0 from the head to the base layer (i=M1i=M\rightarrow 1) following a Sigmoid function designed for each client. Our results show that the final accuracy of pMixFed outperforms baseline methods in most cases. As shown in Figure 3, the accuracy curve of pMixFed exhibits smoother and faster convergence, which may suggest the potential for early stopping in FL settings. Previous studies on mixup also suggest that linear interpolation between features provides the most benefit in the early training phase [65]. Moreover, Figure 3 highlights that partial models with a hard split are highly sensitive to hyperparameter selection and different distribution settings, which can lead to training instability. This issue appears to be addressed more effectively in pMixFed, as discussed further in Section 4. According to the results in Table 2 and Table 1, Ditto demonstrates relatively robust performance across different heterogeneity settings. However, its effectiveness diminishes as the model depth increases, such as with MobileNet, and under larger client populations (N=100N=100). On the other hand, while FedAlt and FedSim report above-baseline results, they consistently fail during training. Adjusting hyperparameters did not resolve this issue.

5.3 Analytic Experiments

Refer to caption
(a)
Refer to caption
(b)
Figure 4: (a) The accuracy drop in FedSim occurred due to the vanishing gradient at round 42. (b) accuracy declines at round 10 in FedAlt due to the introduction of 5 new participants. Applying adaptive mixup solely between corresponding global and local shared layers mitigates the accuracy drop.

Adaptive Robustness to Performance Degradation: During our experiments, we observed the algorithm’s ability to adapt and recover from performance degradation, especially in challenging scenarios such as gradient vanishing, adding new users, or introducing unseen incoming data. For instance, in complex settings with larger models, such as MobileNet on the CIFAR-100 dataset, partial PFL models like FedSim and FedAlt experience sudden accuracy drops due to zero gradients or the incorporation of new participants into the cohort. We believe the reason behind this phenomenon is due to the: 1-local and global model update discrepancy in partial models with strict cut in the middle depicted in Figure 1. This degradation is mitigated by the adaptive mixup coefficient, which dynamically adjusts the degree of personalization based on the local model’s performance during both the broadcasting and aggregation stages. Specifically, if the global model G(t)G^{(t)} lacks sufficient strength, the mixup coefficient μ(t)\mu^{(t)} is reduced, decreasing the influence of global model. 2- Catastrophic forgetting which is addressed in pMixFed by keeping the historical models H|1TGtH\big{|}_{1}^{T}G^{t} in the aggregation process as discussed in section 4.2.2. Figure 4 illustrates that even applying mixup only to the shared layers of the same partial PFL models (FedAlt and Fedsim) enhances resilience against sudden accuracy drops, maintaining model performance over time. In both experiments, the mixup degree for personalized layers is set to λi=0\lambda_{i}=0 for all clients similar to FedSim and FedAlt algorithms.

5.4 Ablation Study

Random vs gradual mix factor from β\beta distribution: In this paper, we have explored different designs for calculating g mix factor μk\mu_{k}. The value of λ\lambda in Eq. 4, naturally sampled from a β(α,α)\beta\neq(\alpha,\alpha) distribution [61] which is on the interval [0,1]. We have also experimented the random λi\lambda_{i} using β\beta distribution with different α\alpha. If α=1\alpha=1, the β\beta distribution is uniform meaning that the λ\lambda would be sampled uniformly from [0,1]. Moreover α>1\alpha>1, The λ\lambda would be more in between, creating a more mixed output between LkL_{k} and GG. On the contrary, if α<1\alpha<1 the mixed model tend to choose just one of the global and local models where λ=1orλ=0\lambda=1or\lambda=0. The effects of different α\alpha on mixup degree λ\lambda is discussed in Appendix.

Refer to caption
(a)
Refer to caption
(b)
Figure 5: (a) Effect on learning rate on average test accuracy(out-of-sample) gap and on the cold-start users. (b) The comparison between test accuracy on the cold-start-users with different Mix factor functions. (Dynamic-only). In this scenario we used a fixed MuMu for all communication rounds. (Sigmoid). The original updating strategy based on a sigmoid function. (Gradual). A simple linear function has been adapted for updating MuMu. (Random) Mixup degree λi\lambda_{i} is selected randomly from β\beta distribution.

Mix Factor(μ\mu): sigmoid vs Dynamic-only In this study, we have exploited two different functions to update adaptive mixup factor(μ\mu) in each communication round. This idea is based on the performance of the model which we want to update. In 1st scenario σ\sigma function has been adapted as shown in equation 5 for adaptively updating mixup degree λi\lambda_{i} using sigmoid function. On the other hand, the 2nd scenario, Dynamic-only, MuMu is fixed over all communication rounds. The comparison of these two scenarios as well as the effect of different tt values on the test accuracy, is depicted in Figure 5 (b).
Effect of Mixup Degree as Learning Rate (lr): We observed that the effect of the mixup coefficient is highly influenced by the learning rate and its decay. To empirically demonstrate this relationship, we measured the impact of the learning rate on new participants (cold-start users) as well as on the out-of-sample gap (average test accuracy on unseen data). The results of this comparison are presented in Figure 5 (a). Additional details are provided in the Appendix 10.3.

6 Conclusions

We introduced pMixFed, a dynamic, layer-wise personalized federated learning approach that uses mixup to integrate the shared global and personalized local models. Our approach features adaptive partitioning between shared and personalized layers, along with a gradual transition for personalization, enabling seamless adaptation for local clients, improved generalization across clients, and reduced risk of catastrophic forgetting. We provided a theoretical analysis of pMixFed to study the properties of its convergence. Our experiments on three datasets demonstrated its superior performance over existing PFL methods. Empirically, pMixFed exhibited faster training times, increased robustness, and better handling of data heterogeneity compared to state-of-the-art PFL models. Future research directions include exploring multi-Modal personalization and adapting pMixFed for working on resource-constrained devices

7 Acknowledgment

This work is based upon the work partly supported by the National Center for Transportation Cybersecurity and Resiliency (TraCR) (a U.S. Department of Transportation National University Transportation Center) headquartered at Clemson University, Clemson, South Carolina, USA. Any opinions, findings, conclusions, and recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of TraCR, and the U.S. Government assumes no liability for the contents or use thereof.

References

  • Abdulrahman et al. [2021] Sawsan Abdulrahman, Hanine Tout, Azzam Mourad, and Chamseddine Talhi. Fedmccs: Multicriteria client selection model for optimal iot federated learning. IEEE Internet of Things Journal, 8(6):4723–4735, 2021.
  • Agarwal et al. [2020] Alekh Agarwal, John Langford, and Chen-Yu Wei. Federated residual learning. arXiv preprint arXiv:2003.12880, 2020.
  • Alex [2009] Krizhevsky Alex. Learning multiple layers of features from tiny images. https://www. cs. toronto. edu/kriz/learning-features-2009-TR. pdf, 2009.
  • Arivazhagan et al. [2019] Manoj Ghuhan Arivazhagan, Vinay Aggarwal, Aaditya Kumar Singh, and Sunav Choudhary. Federated learning with personalization layers. arXiv preprint arXiv:1912.00818, 2019.
  • Beckham et al. [2019] Christopher Beckham, Sina Honari, Vikas Verma, Alex M Lamb, Farnoosh Ghadiri, R Devon Hjelm, Yoshua Bengio, and Chris Pal. On adversarial mixup resynthesis. Advances in neural information processing systems, 32, 2019.
  • Bui et al. [2019] Duc Bui, Kshitiz Malik, Jack Goetz, Honglei Liu, Seungwhan Moon, Anuj Kumar, and Kang G Shin. Federated user representation learning. arXiv preprint arXiv:1909.12535, 2019.
  • Chou et al. [2020] Hsin-Ping Chou, Shih-Chieh Chang, Jia-Yu Pan, Wei Wei, and Da-Cheng Juan. Remix: rebalanced mixup. In Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, pages 95–110. Springer, 2020.
  • Cohen et al. [2017] Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. Emnist: Extending mnist to handwritten letters. In 2017 international joint conference on neural networks (IJCNN), pages 2921–2926. IEEE, 2017.
  • Collins et al. [2021] Liam Collins, Hamed Hassani, Aryan Mokhtari, and Sanjay Shakkottai. Exploiting shared representations for personalized federated learning. In International conference on machine learning, pages 2089–2099. PMLR, 2021.
  • Collins et al. [2022] Liam Collins, Aryan Mokhtari, Sewoong Oh, and Sanjay Shakkottai. Maml and anil provably learn representations. In International Conference on Machine Learning, pages 4238–4310. PMLR, 2022.
  • Duan et al. [2019] Moming Duan, Duo Liu, Xianzhang Chen, Yujuan Tan, Jinting Ren, Lei Qiao, and Liang Liang. Astraea: Self-balancing federated learning for improving classification accuracy of mobile deep learning applications. In 2019 IEEE 37th international conference on computer design (ICCD), pages 246–254. IEEE, 2019.
  • Fallah et al. [2020] Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. Personalized federated learning: A meta-learning approach. arXiv preprint arXiv:2002.07948, 2020.
  • Guo et al. [2019] Hongyu Guo, Yongyi Mao, and Richong Zhang. Mixup as locally linear out-of-manifold regularization. In Proceedings of the AAAI conference on artificial intelligence, pages 3714–3722, 2019.
  • He et al. [2020] Chaoyang He, Murali Annavaram, and Salman Avestimehr. Group knowledge transfer: Federated learning of large cnns at the edge. Advances in Neural Information Processing Systems, 33:14068–14080, 2020.
  • Huang et al. [2022] Wenke Huang, Mang Ye, and Bo Du. Learn from others and be yourself in heterogeneous federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10143–10153, 2022.
  • Huang et al. [2021] Yutao Huang, Lingyang Chu, Zirui Zhou, Lanjun Wang, Jiangchuan Liu, Jian Pei, and Yong Zhang. Personalized cross-silo federated learning on non-iid data. In Proceedings of the AAAI conference on artificial intelligence, pages 7865–7873, 2021.
  • Imteaj and Amini [2021] Ahmed Imteaj and M Hadi Amini. Fedparl: Client activity and resource-oriented lightweight federated learning model for resource-constrained heterogeneous iot environment. Frontiers in Communications and Networks, 2:657653, 2021.
  • Imteaj et al. [2021] Ahmed Imteaj, Urmish Thakker, Shiqiang Wang, Jian Li, and M Hadi Amini. A survey on federated learning for resource-constrained iot devices. IEEE Internet of Things Journal, 9(1):1–24, 2021.
  • Jeong et al. [2018] Eunjeong Jeong, Seungeun Oh, Hyesung Kim, Jihong Park, Mehdi Bennis, and Seong-Lyun Kim. Communication-efficient on-device machine learning: Federated distillation and augmentation under non-iid private data. arXiv preprint arXiv:1811.11479, 2018.
  • Kairouz et al. [2021] Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends® in Machine Learning, 14(1–2):1–210, 2021.
  • Karimireddy et al. [2020] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. In International conference on machine learning, pages 5132–5143. PMLR, 2020.
  • Konečnỳ et al. [2016] Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016.
  • LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Lee et al. [2023] Sunwoo Lee, Anit Kumar Sahu, Chaoyang He, and Salman Avestimehr. Partial model averaging in federated learning: Performance guarantees and benefits. Neurocomputing, 556:126647, 2023.
  • Li and Wang [2019] Daliang Li and Junpu Wang. Fedmd: Heterogenous federated learning via model distillation. arXiv preprint arXiv:1910.03581, 2019.
  • Li et al. [2021a] Qinbin Li, Bingsheng He, and Dawn Song. Model-contrastive federated learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10713–10722, 2021a.
  • Li et al. [2020] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. Proceedings of Machine learning and systems, 2:429–450, 2020.
  • Li et al. [2021b] Tian Li, Shengyuan Hu, Ahmad Beirami, and Virginia Smith. Ditto: Fair and robust federated learning through personalization. In International Conference on Machine Learning, pages 6357–6368. PMLR, 2021b.
  • Liang et al. [2020] Paul Pu Liang, Terrance Liu, Liu Ziyin, Nicholas B Allen, Randy P Auerbach, David Brent, Ruslan Salakhutdinov, and Louis-Philippe Morency. Think locally, act globally: Federated learning with local and global representations. arXiv preprint arXiv:2001.01523, 2020.
  • Lin et al. [2020] Tao Lin, Lingjing Kong, Sebastian U Stich, and Martin Jaggi. Ensemble distillation for robust model fusion in federated learning. Advances in neural information processing systems, 33:2351–2363, 2020.
  • Luo et al. [2023] Kangyang Luo, Xiang Li, Yunshi Lan, and Ming Gao. Gradma: A gradient-memory-based accelerated federated learning with alleviated catastrophic forgetting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3708–3717, 2023.
  • McMahan et al. [2017] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017.
  • Oh et al. [2021] Jaehoon Oh, Sangmook Kim, and Se-Young Yun. Fedbabu: Towards enhanced representation for federated image classification. arXiv preprint arXiv:2106.06042, 2021.
  • Pillutla et al. [2022] Krishna Pillutla, Kshitiz Malik, Abdel-Rahman Mohamed, Mike Rabbat, Maziar Sanjabi, and Lin Xiao. Federated learning with partial model personalization. In International Conference on Machine Learning, pages 17716–17758. PMLR, 2022.
  • Pye and Yu [2021] Sone Kyaw Pye and Han Yu. Personalised federated learning: A combinational approach. arXiv preprint arXiv:2108.09618, 2021.
  • Qu et al. [2022] Liangqiong Qu, Yuyin Zhou, Paul Pu Liang, Yingda Xia, Feifei Wang, Ehsan Adeli, Li Fei-Fei, and Daniel Rubin. Rethinking architecture design for tackling data heterogeneity in federated learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10061–10071, 2022.
  • Raghu et al. [2019] Aniruddh Raghu, Maithra Raghu, Samy Bengio, and Oriol Vinyals. Rapid learning or feature reuse? towards understanding the effectiveness of maml. arXiv preprint arXiv:1909.09157, 2019.
  • Rostami et al. [2018] Mohammad Rostami, Soheil Kolouri, Kyungnam Kim, and Eric Eaton. Multi-agent distributed lifelong learning for collective knowledge acquisition. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pages 712–720, 2018.
  • Shamsian et al. [2021] Aviv Shamsian, Aviv Navon, Ethan Fetaya, and Gal Chechik. Personalized federated learning using hypernetworks. In International Conference on Machine Learning, pages 9489–9502. PMLR, 2021.
  • Shin et al. [2020] MyungJae Shin, Chihoon Hwang, Joongheon Kim, Jihong Park, Mehdi Bennis, and Seong-Lyun Kim. Xor mixup: Privacy-preserving data augmentation for one-shot federated learning. arXiv preprint arXiv:2006.05148, 2020.
  • Shirvani-Mahdavi et al. [2023] Nasim Shirvani-Mahdavi, Farahnaz Akrami, Mohammed Samiul Saeef, Xiao Shi, and Chengkai Li. Comprehensive analysis of freebase and dataset creation for robust evaluation of knowledge graph link prediction models. In International Semantic Web Conference, pages 113–133. Springer, 2023.
  • Shoham et al. [2019] Neta Shoham, Tomer Avidor, Aviv Keren, Nadav Israel, Daniel Benditkis, Liron Mor-Yosef, and Itai Zeitak. Overcoming forgetting in federated learning on non-iid data. arXiv preprint arXiv:1910.07796, 2019.
  • Singhal et al. [2021] Karan Singhal, Hakim Sidahmed, Zachary Garrett, Shanshan Wu, John Rush, and Sushant Prakash. Federated reconstruction: Partially local federated learning. Advances in Neural Information Processing Systems, 34:11220–11232, 2021.
  • Smith et al. [2017] Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet S Talwalkar. Federated multi-task learning. Advances in neural information processing systems, 30, 2017.
  • Sun et al. [2023] Guangyu Sun, Matias Mendieta, Jun Luo, Shandong Wu, and Chen Chen. Fedperfix: Towards partial model personalization of vision transformers in federated learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4988–4998, 2023.
  • T Dinh et al. [2020] Canh T Dinh, Nguyen Tran, and Josh Nguyen. Personalized federated learning with moreau envelopes. Advances in Neural Information Processing Systems, 33:21394–21405, 2020.
  • Tan et al. [2022] Alysa Ziying Tan, Han Yu, Lizhen Cui, and Qiang Yang. Towards personalized federated learning. IEEE Transactions on Neural Networks and Learning Systems, 2022.
  • Venkataramanan et al. [2022] Shashanka Venkataramanan, Ewa Kijak, Laurent Amsaleg, and Yannis Avrithis. Alignmixup: Improving representations by interpolating aligned features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19174–19183, 2022.
  • Verma et al. [2019] Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najafi, Ioannis Mitliagkas, David Lopez-Paz, and Yoshua Bengio. Manifold mixup: Better representations by interpolating hidden states. In International conference on machine learning, pages 6438–6447. PMLR, 2019.
  • Wang et al. [2021] Han Wang, Luis Muñoz-González, David Eklund, and Shahid Raza. Non-iid data re-balancing at iot edge with peer-to-peer federated learning for anomaly detection. In Proceedings of the 14th ACM Conference on Security and Privacy in Wireless and Mobile Networks, pages 153–163, 2021.
  • Wicaksana et al. [2022] Jeffry Wicaksana, Zengqiang Yan, Dong Zhang, Xijie Huang, Huimin Wu, Xin Yang, and Kwang-Ting Cheng. Fedmix: Mixed supervised federated learning for medical image segmentation. IEEE Transactions on Medical Imaging, 42(7):1955–1968, 2022.
  • Wu et al. [2020] Qiong Wu, Xu Chen, Zhi Zhou, and Junshan Zhang. Fedhome: Cloud-edge based personalized federated learning for in-home health monitoring. IEEE Transactions on Mobile Computing, 21(8):2818–2832, 2020.
  • Xiao et al. [2020] Peng Xiao, Samuel Cheng, Vladimir Stankovic, and Dejan Vukobratovic. Averaging is probably not the optimum way of aggregating parameters in federated learning. Entropy, 22(3):314, 2020.
  • Xu et al. [2022] Chencheng Xu, Zhiwei Hong, Minlie Huang, and Tao Jiang. Acceleration of federated learning with alleviated forgetting in local training. arXiv preprint arXiv:2203.02645, 2022.
  • Yang et al. [2020] Hongwei Yang, Hui He, Weizhe Zhang, and Xiaochun Cao. Fedsteg: A federated transfer learning framework for secure image steganalysis. IEEE Transactions on Network Science and Engineering, 8(2):1084–1094, 2020.
  • Yang et al. [2024] Xiyuan Yang, Wenke Huang, and Mang Ye. Fedas: Bridging inconsistency in personalized federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11986–11995, 2024.
  • Yoon et al. [2021] Tehrim Yoon, Sumin Shin, Sung Ju Hwang, and Eunho Yang. Fedmix: Approximation of mixup under mean augmented federated learning. arXiv preprint arXiv:2107.00233, 2021.
  • Yu et al. [2020] Tao Yu, Eugene Bagdasaryan, and Vitaly Shmatikov. Salvaging federated learning by local adaptation. arXiv preprint arXiv:2002.04758, 2020.
  • Yuan et al. [2021] Honglin Yuan, Warren Morningstar, Lin Ning, and Karan Singhal. What do we mean by generalization in federated learning? arXiv preprint arXiv:2110.14216, 2021.
  • Yun et al. [2019] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019.
  • Zhang et al. [2017] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
  • Zhang et al. [2020] Linjun Zhang, Zhun Deng, Kenji Kawaguchi, Amirata Ghorbani, and James Zou. How does mixup help with robustness and generalization? arXiv preprint arXiv:2010.04819, 2020.
  • Zhao et al. [2018] Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. Federated learning with non-iid data. arXiv preprint arXiv:1806.00582, 2018.
  • Zhu et al. [2021] Zhuangdi Zhu, Junyuan Hong, and Jiayu Zhou. Data-free knowledge distillation for heterogeneous federated learning. In International conference on machine learning, pages 12878–12889. PMLR, 2021.
  • Zou et al. [2023] Difan Zou, Yuan Cao, Yuanzhi Li, and Quanquan Gu. The benefits of mixup for feature learning. In International Conference on Machine Learning, pages 43423–43479. PMLR, 2023.

Appendix

8 More Detailed discussion on the Mix Factor

This section elaborates on two key properties of pMixFed: 1) dynamic behavior, achieved through a gradual transition in the mix degree between layers (λi\lambda_{i}), and 2) Adaptive behavior, introduced via the mix factor (μ\mu). Below, we delve into the details and formulation of each step.

8.1 Dynamic Mixup Degree

Among the various partitioning strategies for partial PFL discussed in the main paper and in [34], one of the most widely adopted techniques is assigning higher layers of local model Ll,ktL_{l,k}^{t} to personalization while allowing the base layers Lg,ktL_{g,k}^{t} to be shared across clients as the global model [33, 45, 4]. This design aligns with insights from the Model-Agnostic Meta-Learning (MAML) algorithm [10], which demonstrates that lower layers generally retain task-agnostic, generalized features, while the higher layers capture task-specific, personalized characteristics. Accordingly, in this work, we designate the head of the model as containing personalized information, while the base layers represent generalized information shared across clients 444It should be noted that our method is fully adaptable to different partial PFL designs as well discussed in [34]..
Broadcasting: To achieve a nuanced and gradual transition in the mixing process between the global model and the local model, we define the mix degree λi\lambda_{i} as follows. The local model’s head LnL_{n} remains frozen, and the head of the global model is excluded from being shared with the local model. The mix degree for each layer λi\lambda_{i} increases incrementally based on the mix factor μ\mu, such that as we move toward the base layers, the personalization impact decreases. This dynamic behavior is represented as:

λi=λi+1+μ,\lambda_{i}=\lambda_{i+1}+\mu,

where λi\lambda_{i} controls the degree of mixing at layer ii. This process is visually illustrated in Figure 8.

Refer to caption
Figure 6: layer-wise Dynamic transition of mixup degree λi\lambda_{i} in Broadcasting and aggregation phase. The darker color shows the higher mixup degree (λi\lambda_{i}) for the corresponding layer ii.

Aggregation: The aggregation stage focuses on preserving the generalized information from the history of previous global models to mitigate the catastrophic forgetting problem. In contrast to the broadcasting process, the primary goal here is to retain the generalized information of the previous global model, which encapsulates the history of all prior models H|1t1GH\big{|}_{1}^{t-1}{G}. Therefore, the base layers should predominantly be shared from the global model, particularly during the first round of training, where the updated local models are still underdeveloped. Nonetheless, as we transition to the head, it’s less prominent to transfer knowledge from the previous global models. Figure 8 illustrates how the mixup degree transitions from the head to the base layers.

8.2 Adaptive Mix-Factor

To enable online, adaptive updates of the mix degree λi\lambda_{i}, we implemented several algorithms, as detailed in Section 5.4 Ablation Study. We observed a clear relationship between the mixup degree and the similar behavior to lr, which is further discussed in Section 9.4.1. Inspired by learning rate schedulers, we introduced the mix-factor μ\mu to adaptively update the layer-wise mixup degree based on the current communication round δ=t/T\delta=t/T and the relative performance of the current local model LktL_{k}^{t} compared to the global model GtG^{t}. The Sigmoid function in Figure 8 illustrates how μ\mu evolves with respect to δ\delta and accuracy (AccAcc). The best results were achieved when setting b=1b=1 and using the square of AccAcc as an exponent. The rationale behind this approach is that a more experienced, better-performing model should share more information. Specifically, if the local model accuracy AcclAcc_{l} significantly exceeds that of the global model (AccGAcc_{G}) in the current round (tt) such that Acc1Acc\gg 1, less information is shared from the global model. Conversely, when 0<Acc<10<Acc<1, the global model dominates the parameter updates in both the global and local models. Figure 7 shows the distribution of the calculated μ\mu across different communication rounds for client 0 in the broadcasting stage. In the broadcasting stage, higher AcclAcc_{l} values result in freezing more layers for personalization, leading to a decrease in μ\mu. Conversely, during the aggregation phase, if the global model accuracy AccGAcc_{G} outperforms the updated local model accuracy Accl(t+1)Acc_{l^{\prime}}^{(t+1)}, more base layers are shared. This is especially important during the first round of training, where local models are less stable. As demonstrated in Section 5.3, our proposed method adapts dynamically to performance drops and high-distribution complexity, adjusting the mixup degree as needed. This adaptability is effective even when applied partially to partial PFL methods such as FedAlt, and FedSim.

8.3 Model Heterogeneity

pMixFed is capable of handling variable model sizes across different clients. The global model, MgM_{g}, retains the maximum number of layers from all clients, i.e., MG=max(M1,M2,,MN)M_{G}=\max(M_{1},M_{2},\dots,M_{N}). During the matching process between the global model GiG_{i} and the local model LiL_{i}, if a layer block from the local model does not match a corresponding global layer, we set λi=0\lambda_{i}=0, meaning that the layer block will neither participate in the broadcasting nor aggregation processes. So the layers are participating according to their existing participation rate. For instance, if only 40% of clients have more than 4 layers, the generalization degree (in both broadcasting and aggregation stages), will be less than 0.40.4 in each training round due to different participation rates.

Refer to caption
Figure 7: mix-factors for client 0 in different communication rounds

9 Theoretical Analysis

In this section, we provide the convergence analysis of pMixFed. Moreover, We compare the aggregation process and the server global model update of the FedSGD algorithm with our proposed mixed aggregation stage in pMixFed. We begin by introducing the key notations and assumptions used throughout the convergence analysis.

Notations:

  • t{0,,T1}t\in\{0,\ldots,T-1\}: communication round index.

  • ηl\eta_{l}: learning rate for local update, ηg\eta_{g}: learning rate for global updates.

  • λk,i\lambda_{k,i}: mixup coefficient for client kk at layer ii in round tt, λk\lambda_{k}: mixup coefficient for client kk in round tt (assuming uniform across layers).

  • G(t)G^{(t)}: global model parameters at round tt, Lk(t)L_{k}^{(t)}: local model parameters of client kk at round tt.

  • |𝒟k||\mathcal{D}_{k}|: size of the dataset at client kk, |𝒟||\mathcal{D}|: total size of datasets across all clients.

  • Lk(t+1)\nabla L_{k}^{(t+1)}: gradient of the local model at client kk in round tt.

We make the following assumptions to establish the convergence properties of pMixFed:

Assumption 1.

Unbiased Gradient Estimation with Bounded Variance. The gradient estimate for local model updates is unbiased and has a bounded variance, i.e.,

𝔼[k(t)]=k(t),Var(k(t))σ2.\mathbb{E}[\nabla\mathcal{L}_{k}^{(t)}]=\nabla\mathcal{L}_{k}^{(t)},\quad\text{Var}(\nabla\mathcal{L}_{k}^{(t)})\leq\sigma^{2}. (9)
Assumption 2 (Smoothness of Local Objectives).

The local objective functions Lk()L_{k}(\cdot) are LL-smooth, i.e.,

Lk(t+1)Lk(t)LLk(t+1)Lk(t).\small\|\nabla L_{k}^{(t+1)}-\nabla L_{k}^{(t)}\|\leq L\|L_{k}^{(t+1)}-L_{k}^{(t)}\|. (10)
Assumption 3 (Bounded Gradients).

The gradients at each client are bounded, i.e.,

Lk(t+1)G,k,t.\small\|\nabla L_{k}^{(t+1)}\|\leq G,\quad\forall k,t. (11)

These assumptions are standard in convergence analysis and ensure that the optimization process is well-behaved.

9.1 Local Training Convergence

We first analyze the local training updates that occur between communication rounds.

Lemma 1.

Local Model Training Progress. Under Assumptions 2 and 1, after rr local updates, the expected loss for client kk satisfies:

𝔼[k(t+r)]k(t)+(L1ηl22ηl)j=0r1k(t+j)2+L1rηl2σ22.\mathbb{E}[\mathcal{L}_{k}^{(t+r)}]\leq\mathcal{L}_{k}^{(t)}+\left(\frac{L_{1}\eta_{l}^{2}}{2}-\eta_{l}\right)\sum_{j=0}^{r-1}\|\nabla\mathcal{L}_{k}^{(t+j)}\|^{2}+\frac{L_{1}r\eta_{l}^{2}\sigma^{2}}{2}. (12)

This lemma provides a bound on how the local training process improves the loss function. The bound depends on the learning rate ηl\eta_{l}, the smoothness constant L1L_{1}, and the variance of the gradient estimates σ2\sigma^{2}.

Refer to caption
Figure 8: Adaptive mix-factor μkt\mu_{k}^{t} according to the accuracy ration Acct=AccLt/AccGtAcc^{t}=Acc^{t}_{L}/Acc^{t}_{G} in different communication rounds.

9.2 Global Model Aggregation

Next, we consider the effect of aggregating local models at the server after each communication round.

Lemma 2.

Global Model Aggregation Dynamics. After aggregating the local models, the global model loss changes as follows:

𝔼[(t+1)]𝔼[(t)]+ηgδ2,\mathbb{E}[\mathcal{L}^{(t+1)}]\leq\mathbb{E}[\mathcal{L}^{(t)}]+\eta_{g}\delta^{2}, (13)

where δ2\delta^{2} bounds the difference between the global and local model parameters before and after aggregation.

This lemma shows that, during aggregation, the global loss increases by a term proportional to the global learning rate ηg\eta_{g} and the parameter variation δ2\delta^{2}.

9.3 Convergence of pMixFed

With the above lemmas, we can now derive the convergence properties of pMixFed.

Theorem 1.

Convergence Rate of pMixFed. Under the previously stated assumptions, the expected global loss after a complete round of communication and local training satisfies:

𝔼[(t+1)](t)+(L1ηl22ηl)k=1Kj=0r1k(t+j)2+L1rηl2σ22+ηgδ2.\small\mathbb{E}[\mathcal{L}^{(t+1)}]\leq\mathcal{L}^{(t)}+\left(\frac{L_{1}\eta_{l}^{2}}{2}-\eta_{l}\right)\sum_{k=1}^{K}\sum_{j=0}^{r-1}\|\nabla\mathcal{L}_{k}^{(t+j)}\|^{2}+\frac{L_{1}r\eta_{l}^{2}\sigma^{2}}{2}+\eta_{g}\delta^{2}. (14)

This theorem demonstrates that the global loss decreases with each communication round, with the convergence depending on the local and global learning rates, the gradient norms, and the parameter variation.

Theorem 2.

Non-Convex Convergence Rate. For non-convex loss functions, pMixFed achieves convergence at the following rate:

1Tt=0T1k=1Kk(t)2𝒪(1/T),\frac{1}{T}\sum_{t=0}^{T-1}\sum_{k=1}^{K}\|\nabla\mathcal{L}_{k}^{(t)}\|^{2}\leq\mathcal{O}(1/T), (15)

where TT is the total number of communication rounds.

This final theorem indicates that pMixFed achieves non-convex convergence with a rate of 𝒪(1/T)\mathcal{O}(1/T), demonstrating that the algorithm improves over time as the number of communication rounds increases.

9.4 The effect of Aggregating model parameters and gradients on catastrophic forgetting

In FedSGD, the gradients are aggregated and the server will be update the global model according to the aggregated gradients. the FedSGD is sometimes preferred over FedAvg due to its potentially faster convergence. However, it lacks robustness in heterogeneous environments. pMixFed leverages the faster convergence characteristics of FedSGD by incorporating early stopping mechanisms, facilitated by the use of mixup. As demonstrated in Section 5, the mixup factor λ\lambda functions analogously to an SGD update at the server, even though pMixFed aggregates model weights rather than gradients, similar to FedAvg. Section 9.4 provides a more detailed explanation of this mechanism.
the pMixFed algorithm combines the advantages of FedSGD and FedAvg by aggregating model weights rather than gradients, while still ensuring convergence even in heterogeneous data settings. The incorporation of the mixup mechanism enhances stability, providing faster convergence rates compared to FedSGD, particularly in non-convex settings. Considering local SGD steps rr and ηl\eta_{l} as the local learning rate we can further expand the above formulation:

Gt+1=Gtηgi=1kΩk[Lk(t+r)Lk(t)]Gt+1=Gtηg.ηli=1kΩkj=0r1Lk(t+j)\begin{split}&G^{t+1}=G^{t}-\eta_{g}\sum_{i=1}^{k}\Omega_{k}\left[L_{k}^{(t+r)}-L_{k}^{(t)}\right]\\ &G^{t+1}=G^{t}-\eta_{g}.\eta_{l}\sum_{i=1}^{k}\Omega_{k}\sum_{j=0}^{r-1}\nabla{L_{k}^{(t+j)}}\end{split} (16)

Since in FedSGD, the global model G(t)G^{(t)} is fully shared with each local model Lk(t)L_{k}^{(t)} in each communication round t, and i=1kΩk=1\sum_{i=1}^{k}\Omega_{k}=1, we can update the problem as a constrained optimization problem.

Gt+1=Gtηg.ηlj=1r1G(t+j)G^{t+1}=G^{t}-\eta_{g}.\eta_{l}\ \sum_{j=1}^{r-1}\nabla{G^{(t+j)}}
j=2r1G(t+j)=(1ηg.ηl)Gt(1+ηg.ηl)Gt+1\sum_{j=2}^{r-1}\nabla{G^{(t+j)}=(1-\eta_{g}.\eta_{l})G^{t}-(1+\eta_{g}.\eta_{l})G^{t+1}\ }

The right term of the equation is the history of global model gradients which could be written as H|2r1Gt=a.Gtb.Gt+1H\big{|}_{2}^{r-1}\nabla{G^{t}}=a.G^{t}-b.G^{t+1}. The problem of the FedSGD algorithm is that the gradients are too small to keep the state of the previous global model, leading to catastrophic forgetting in partial models that use gradients in their aggregation stage. We hypothesize that the mixup coefficient λ\lambda acts similarly to the learning rate η\eta, suggesting that λ\lambda plays a role analogous to η\eta in FedSGD.
FedSGD Update Rule : In FedSGD, the global model G(t)G^{(t)} is updated by aggregating the gradients from all clients:

G(t+1)=G(t)ηgk=1KΩkLk(t+1),\small G^{(t+1)}=G^{(t)}-\eta_{g}\sum_{k=1}^{K}\Omega_{k}\nabla L_{k}^{(t+1)}, (17)

where Ωk=|𝒟k||𝒟|\Omega_{k}=\frac{|\mathcal{D}_{k}|}{|\mathcal{D}|} is the weight associated with client kk. Since Lk(t+1)=Lk(t)ηlLk(t+1)L_{k}^{(t+1)}=L_{k}^{(t)}-\eta_{l}\nabla L_{k}^{(t+1)}, we can rewrite the update as:

G(t+1)=G(t)ηgk=1KΩk(Lk(t)Lk(t+1)ηl).\small G^{(t+1)}=G^{(t)}-\eta_{g}\sum_{k=1}^{K}\Omega_{k}\left(\frac{L_{k}^{(t)}-L_{k}^{(t+1)}}{\eta_{l}}\right). (18)

pMixFed Update Rule: In pMixFed, the global model update incorporates mixup coefficients:

G(t+1)=k=1KΩk[(1λk)Lk(t+1)+λkG(t)].\small G^{(t+1)}=\sum_{k=1}^{K}\Omega_{k}\left[(1-\lambda_{k})L_{k}^{(t+1)}+\lambda_{k}G^{(t)}\right]. (19)

Assuming λk,i=λk\lambda_{k,i}=\lambda_{k} for all layers ii, and considering that G(t)G^{(t)} is sent to all clients at round tt, we simplify the update to:

G(t+1)=(1λk)k=1KΩkLk(t+1)+λkG(t).\small G^{(t+1)}=(1-\lambda_{k})\sum_{k=1}^{K}\Omega_{k}L_{k}^{(t+1)}+\lambda_{k}G^{(t)}. (20)

9.4.1 Analytical analysis of the Effect of learning rate and mixup degree

To establish the relationship between λ\lambda and ηg\eta_{g}, we align the FedSGD and pMixFed update equations. From Equation (20), rearranged:

G(t+1)λkG(t)=(1λk)k=1KΩkLk(t+1).\small G^{(t+1)}-\lambda_{k}G^{(t)}=(1-\lambda_{k})\sum_{k=1}^{K}\Omega_{k}L_{k}^{(t+1)}. (21)

From Equation (18), rearranged:

G(t+1)=G(t)ηgηlk=1KΩk(Lk(t)Lk(t+1)).\small G^{(t+1)}=G^{(t)}-\frac{\eta_{g}}{\eta_{l}}\sum_{k=1}^{K}\Omega_{k}\left(L_{k}^{(t)}-L_{k}^{(t+1)}\right). (22)

Assuming Lk(t)L_{k}^{(t)} is replaced with G(t)G^{(t)} at round tt for FedSGD, we have Lk(t)=G(t)L_{k}^{(t)}=G^{(t)}. Substituting this into Equation (22):

G(t+1)=G(t)ηgηlk=1KΩk(G(t)Lk(t+1)).G^{(t+1)}=G^{(t)}-\frac{\eta_{g}}{\eta_{l}}\sum_{k=1}^{K}\Omega_{k}\left(G^{(t)}-L_{k}^{(t+1)}\right). (23)

Simplifying and Comparing with Equation (20), we see that if: λk=ηgηl\lambda_{k}=\frac{\eta_{g}}{\eta_{l}}. then the updates are analogous.

G(t+1)=G(t)(1ηgηlk=1KΩk)+ηgηlk=1KΩkLk(t+1).G^{(t+1)}=G^{(t)}\left(1-\frac{\eta_{g}}{\eta_{l}}\sum_{k=1}^{K}\Omega_{k}\right)+\frac{\eta_{g}}{\eta_{l}}\sum_{k=1}^{K}\Omega_{k}L_{k}^{(t+1)}. (24)
Theorem 3.

Under Assumptions 2 and 3, the mixup coefficient λk\lambda_{k} in pMixFed acts similarly to the learning rate ratio ηgηl\frac{\eta_{g}}{\eta_{l}} in FedSGD, such that:λk=ηgηl\lambda_{k}=\frac{\eta_{g}}{\eta_{l}}. This implies that the mixup mechanism in pMixFed can be interpreted as a form of learning rate control analogous to FedSGD.

Proof.

Starting from the pMixFed update in Equation (21):

G(t+1)λkG(t)=(1λk)k=1KΩkLk(t+1).G^{(t+1)}-\lambda_{k}G^{(t)}=(1-\lambda_{k})\sum_{k=1}^{K}\Omega_{k}L_{k}^{(t+1)}. (25)

From the rearranged FedSGD update in Equation (24):

G(t+1)=G(t)(1ηgηl)+ηgηlk=1KΩkLk(t+1).G^{(t+1)}=G^{(t)}\left(1-\frac{\eta_{g}}{\eta_{l}}\right)+\frac{\eta_{g}}{\eta_{l}}\sum_{k=1}^{K}\Omega_{k}L_{k}^{(t+1)}. (26)

Setting the two expressions equal:

G(t)(1λk)+(1λk)k=1KΩkLk(t+1)=G^{(t)}\left(1-\lambda_{k}\right)+(1-\lambda_{k})\sum_{k=1}^{K}\Omega_{k}L_{k}^{(t+1)}=
G(t)(1ηgηl)+ηgηlk=1KΩkLk(t+1).G^{(t)}\left(1-\frac{\eta_{g}}{\eta_{l}}\right)+\frac{\eta_{g}}{\eta_{l}}\sum_{k=1}^{K}\Omega_{k}L_{k}^{(t+1)}. (27)

This simplifies to:

λk=ηgηl.\lambda_{k}=\frac{\eta_{g}}{\eta_{l}}. (28)

Given that λk\lambda_{k} must lie in the range [0,1][0,1], this relationship holds when ηgηl\eta_{g}\leq\eta_{l}. Since both ηg\eta_{g} and ηl\eta_{l} are also constrained within the interval [0,1][0,1], our findings are consistent with previous studies [33, 10], which recommend keeping the head unfrozen (i.e., ηl0\eta_{l}\neq 0).

Remark1. Theorem 3 establishes a direct relationship between the mixup coefficient λk\lambda_{k} and the learning rates used in local and global updates. This insight allows us to interpret the mixup mechanism in pMixFed as adjusting the effective learning rate at the server, providing a theoretical foundation for selecting λk\lambda_{k} based on desired convergence properties.

Remark2. practice, this relationship suggests that by tuning λk\lambda_{k}, we can control the influence of the global model versus the local models in the aggregation process, similar to adjusting the learning rate in FedSGD. This is particularly beneficial in heterogeneous environments where clients may have varying data distributions.

Our theoretical analysis indicates that the mixup coefficient λk\lambda_{k} in pMixFed plays a role analogous to the learning rate in FedSGD. This equivalence provides a deeper understanding of how pMixFed leverages the strengths of FedSGD while mitigating its weaknesses in heterogeneous settings. By appropriately choosing λk\lambda_{k}, pMixFed can achieve faster convergence and improved robustness.

10 Additional Details about the Experiments

10.1 Experimental Setup

Dataset: We used three widely used federated learning datasets: MNIST [23], CIFAR-10, CIFAR-100 [3], and EMNIST [8]. CIFAR-10 consists of 50,000 images of size 32×3232\times 32 for training and 10,000 images for testing. CIFAR-100, on the other hand, consists of 100 classes, with 500 32×3232\times 32 images per class for training and 100 images per class for testing. MNIST contains 10 labels and includes 60,000 samples of 28×2828\times 28 grayscale images for training and 10,000 for testing. For creating heterogenity, we followed the
Training Details: For evaluation, We have reported the average test accuracy of the global model [59] for different approaches. The final global model at the last communication round is saved and used during the evaluation. The global model is then personalized according to each baseline’s personalization or fine-tuning algorithm for r=4r=4 local epochs and T=50T=50. For FedAlt, the local model is reconstructed from the global model and fine-tuned on the test data. For FedSim, both the global and local models are fine-tuned partially but simultaneously. In the case of FedBABU, the head (fully connected layers) remains frozen during local training, while the body is updated. Since we could not directly apply pFedHN in our platform setting, we adapted their method using the same hyper parameters discussed above and employed hidden layers with 100 units for the hypernetwork and 16 kernels. The local update process for LG-FedAvg, FedAvg, and Per-FedAvg simply involves updating all layers jointly during the fine-tuning process. The global learning rate for the methods that need sgd update in the global server e.g., FedAvg, has been set from lrglobal=[1e3,1e4,and1e5]lr_{global}=[1e-3,1e-4,and1e-5].It should be noted that due to the performance drop for some methods (FedAlt , FedSim) in round 10 or 40 in some settings, we’ve reported the highest accuracy achieved. Also this is the reason the accuracy curves are illustrated for 39 rounds instead of 50. 555As discussed in the main paper, the change of hyper-parameters such as lr, batch size, momentum and even changing the optimizer to adam didn’t help with the performance drop in most cases.

10.2 Out-of-Sample and Participation Gap

In the evaluation of the effect of learning rate and mixup, the average test accuracy666Classification accuracy using softmax is measured on cold-start clients |Dkunseents|,k{1,,M}|D^{ts}_{k\cap\text{unseen}}|,k\subset\{1,\dots,M\}, where DtsDtrD^{ts}\neq D^{tr}. These clients have not participated in the federation at any point during training. FedAlt and FedSim perform poorly on cold-start users or unseen clients, highlighting their limited generalization capability. The test accuracy of pFedMix, while affected under a 10% participation rate, benefits significantly from seeing more clients. Increased client participation directly improves accuracy, as observed in previous studies [34].

10.3 Ablation Study : The Effects of Different alpha on Mixup Degree

The Beta distribution β(α,α)\beta(\alpha,\alpha) is defined on the interval [0,1], where the parameter α\alpha controls the shape of the distribution. The value of λ\lambda, used in Eq. 4, is naturally sampled from this distribution. By varying the parameter α\alpha, we can adjust how much mixing occurs between the global model GG and the local model LkL_{k}.

  • Uniform Distribution: When α=1\alpha=1, the Beta distribution becomes uniform over [0,1]. In this case, λ\lambda is sampled uniformly across the entire interval, meaning that each model, GG and LkL_{k}, has an equal probability of being weighted more or less in the mixup process. This leads to a broad exploration of different combinations of global and local models, allowing for a wide range of mixed models.

  • Concentrated Mixup ( α>1\alpha>1): When α>1\alpha>1, the Beta distribution is concentrated around the center of the interval [0,1]. As a result, the mixup factor λ\lambda is more likely to be closer to 0.5, leading to more balanced combinations of the global and local models. This results in outputs that are more ”mixed,” with neither model dominating the mixup process. Such a setup can enhance the robustness of the combined model, as it prevents extreme weighting of either model, creating smoother interpolations between them.

  • Extremal Mixup ( α<1\alpha<1): In contrast, when α<1\alpha<1, the Beta distribution becomes U-shaped, with more probability mass near 0 and 1. This means that λ\lambda tends to be either very close to 0 or very close to 1, favoring one model over the other in the mixup process. When λ0\lambda\approx 0, the local model LkL_{k} is chosen almost exclusively, and when λ1\lambda\approx 1, the global model GG is predominantly selected. This form of mixup creates a more deterministic selection between global and local models, with less mixing occurring.

Refer to caption
Figure 9: PDF λ\lambda (mixup degree) for different values of α\alpha in β\beta distribution

The behavior of different α\alpha values is depicted in Figure 9, where the distribution of the mixup factor λ\lambda is visualized. These distributions highlight the varying degrees of mixup, ranging from uniform blending to nearly deterministic model selection. To thoroughly investigate the impact of α\alpha on model performance, we designed two distinct experimental setups:

  • Random Sampling: In this scenario, we set different α\alpha, meaning that the λ\lambda values are sampled uniformly from the interval [0,1] according. This ensures a wide range of mixup combinations between the global and local models. The random sampling approach helps us assess the general robustness of the model when the mixup degree λ\lambda is not biased towards any specific value. Table 3 shows the effect of different α\alpha on the overall test accuracy of pMixFed.

  • Adaptive Sampling: For this case, we divided the communication rounds into three distinct stages, each consisting of epochglobal3\frac{epoch_{global}}{3} epochs. During these stages, the parameter α\alpha is adaptively changed as follows:

    α={0.1,initial stage (early training)100,middle stage (convergence phase)10,final stage (fine-tuning)\alpha=\begin{cases}0.1,&\text{initial stage (early training)}\\ 100,&\text{middle stage (convergence phase)}\\ 10,&\text{final stage (fine-tuning)}\end{cases} (29)

    This adaptive strategy mimics the behavior of the original pFedMix algorithm while also allowing for more controlled exploration of different mixup combinations. During the early and late stages, a small α\alpha (0.1) encourages more deterministic model selections (i.e., either local or global), while the middle stage with α=100\alpha=100 promotes more balanced mixing. This dynamic adjustment of α\alpha enables us to control the degree of mixup at different phases of training. Table 4 also shows the effect of different sampling approaches (random and adaptive) on the overall test accuracy of all three dataset pMixFed. The results shows that the adaptive sampling which creats a form scheduling for mixup degree shows promising result even compared to the original algorithm using adaptive μ\mu.

Table 3: Accuracy of pMixFedpMixFed with Different α\alpha Values in the Beta Distribution
Dataset α=0.1\alpha=0.1 α=0.5\alpha=0.5 α=1\alpha=1 α=2\alpha=2 α=5\alpha=5
CIFAR-10 78.5 81.2 82.6 84.3 83.1
CIFAR-100 42.3 45.6 47.2 49.8 48.1
Table 4: Accuracy of pMixFedpMixFed with Different α\alpha Values and Sampling Strategies
Dataset Sampling Strategy Accuracy(%)
CIFAR-10 Random Sampling (α=1\alpha=1) 82.6
CIFAR-10 Adaptive Sampling 85.3
CIFAR-100 Random Sampling (α=1\alpha=1) 47.2
CIFAR-100 Adaptive Sampling 50.1
MNIST Random Sampling (α=1\alpha=1) 97.9
MNIST Adaptive Sampling 98.6