Personalized Federated Learning with Contextual Modulation and Meta-Learning

Anna Vettoruzzo
[email protected] Center for Applied Intelligent Systems Research (CAISR), Halmstad University, Halmstad, Sweden. Mohamed-Rafik Bouguelia¹¹footnotemark: 1
[email protected] Thorsteinn Rögnvaldsson¹¹footnotemark: 1
[email protected]

Abstract

Federated learning has emerged as a promising approach for training machine learning models on decentralized data sources while preserving data privacy. However, challenges such as communication bottlenecks, heterogeneity of client devices, and non-i.i.d. data distribution pose significant obstacles to achieving optimal model performance. We propose a novel framework that combines federated learning with meta-learning techniques to enhance both efficiency and generalization capabilities. Our approach introduces a federated modulator that learns contextual information from data batches and uses this knowledge to generate modulation parameters. These parameters dynamically adjust the activations of a base model, which operates using a MAML-based approach for model personalization. Experimental results across diverse datasets highlight the improvements in convergence speed and model performance compared to existing federated learning approaches. These findings highlight the potential of incorporating contextual information and meta-learning techniques into federated learning, paving the way for advancements in distributed machine learning paradigms.

Keywords: Personalized federated learning, Meta-learning, Federated learning, Context learning

1 Introduction

Machine learning and deep learning have revolutionized various domains by enabling models to learn from vast amounts of centralized data. However, the traditional approach of collecting data into a central server raises concerns regarding data privacy and security. To address these issues, federated learning (FL) [27] has emerged as a promising framework that enables collaborative model training while keeping the data decentralized on individual devices. This decentralized approach has gained significant attention in domains such as healthcare, Internet of Things (IoT), and mobile applications, where data cannot be easily transferred to a central server due to privacy regulations, connectivity limitations, or security risks. However, in the context of FL, data distributions across clients are often not “independent and identically distributed” (non-i.i.d.) since different devices generate and collect data separately. This poses a challenge for FL methods, such as FedAvg [27], that were primarily designed to handle i.i.d. data. Personalized federated learning (PFL) [9] takes a step further by customizing models to the needs and preferences of individual clients, an important aspect in domains such as recommendation systems and user behavior analysis. One way to accomplish this is by integrating meta-learning techniques into the FL framework. For example, Per-FedAvg [9] finds an initialization of a global model that performs well after each client’s update with respect to its own data and loss function, potentially by performing a few steps of gradient descent. This approach accounts for the heterogeneity of data distributions while benefiting from collaborative training. However, when there are substantial differences between client distributions, relying on a single global model may be insufficient. Some approaches have attempted to tackle this issue by learning multiple global models that simultaneously fit different data distributions [8, 11, 26, 33, 43], but this can impose a high computational cost at the server level, as storing multiple models typically requires significant memory resources. Moreover, selecting the most suitable model for each client based on the limited locally available data can be challenging.

In this paper, we propose CAFeMe, a context-aware federated learning solution that leverages meta-learning to facilitate personalization to each client. Our approach is based on a global model comprising a federated modulator and a base model. The federated modulator network captures contextual information from the data of each client and generates modulation parameters that dynamically adjust the activations of the base model. In each communication round, the global model is distributed to the clients, which personalize it with their locally available data. The modulation parameters generated by the federated modulator serve as additional parameters within the base network, allowing for the personalization of the model to the unique characteristics of each client’s data distribution. This allows our proposed approach to deal with high heterogeneity in the clients’ data while maintaining a single model at the server level.

We demonstrate the effectiveness of CAFeMe in improving model convergence, generalization, and personalized prediction on various datasets within the PFL framework. The results highlight the potential of leveraging only relevant information for personalization to each client. This also results in a fast learning process that is particularly advantageous in time-sensitive applications. Furthermore, we showcase the robustness of our framework under the concept shift scenario, where the same class-label may be related to different concepts across clients. The code is available online at https://github.com/annaVettoruzzo/CAFeMe.

2 Related works

The pioneering federated learning (FL) algorithm, FedAvg [27], performs well in scenarios where the local data across clients is i.i.d. However, in scenarios with heterogeneous (non-i.i.d.) data distributions, several techniques have been proposed to improve local learning. These include regularization methods [1, 21, 24], local bias correction techniques [6, 17, 28], data sharing mechanisms [47], data augmentation methods [44], and selective clients aggregation [36, 41]. These methods primarily focus on improving the performance of the global model without considering better customization for each client. An alternative approach is personalized federated learning (PFL), which aims to develop customized models for individual clients while leveraging the collaborative nature of FL. Popular PFL methods include L2GD [13], which combines local and global models, FedRep [5] that decouples the parameters of the feature extractor and the classifier to learn unique local heads for each client, methods that investigate client-specific model aggregations [15, 46], as well as multi-task learning methods such as pFedMe [35], Ditto [22], and FedPAC [42]. Recently, some frameworks propose to incorporate meta-learning into the PFL framework [38]. One notable approach is Per-FedAvg [9], which proposes a decentralized version of model-agnostic meta-learning (MAML) [10] for the FL setting. MAML is a meta-learning algorithm that learns models that can efficiently be adapted, through an optimization procedure, to new tasks. In the context of PFL, personalization to a client aligns with adaptation to a task [16]. Other papers [3, 16, 23] explore the combination of MAML-type methods with FL architectures to improve personalization for each client’s local dataset. Additionally, the authors of [18] propose ARUBA, a meta-learning algorithm inspired by online convex optimization, which enhances the performance of FedAvg.

While the effectiveness of these approaches in PFL has been demonstrated, concerns remain regarding their ability to handle highly heterogeneous data distributions among clients. A special type of PFL approach that tries to address this concern is clustered or group-based FL [8, 11, 26, 33, 43], where multiple (i.e., $K$ ) group-based global models are learned after partitioning clients into $K$ groups. These methods can either perform clustering at the server level, [33, 12, 43], or let the clients compute the cluster identity based on their local empirical loss [11]. However, partitioning clients into groups presents challenges due to the ambiguity in the clustering structure, and the memory costs of storing $K$ models at the server level.

Our contribution consists of improving the performance and scalability of the PFL framework, even in scenarios where clients exhibit substantial differences in the data distribution. Our proposed method incorporates context information extracted from each client’s local data to drive the personalized model towards a better fit of the client-specific data distribution, while ensuring that the personalized base model does not deviate too far from the global version at the server level. This approach ensures a balance between local adaptation and global consistency, facilitating effective model personalization for each client.

3 Background and Notation

This section provides a formal description of the PFL problem and its connection with the MAML algorithm [10], specifically designed for meta-learning problems. In PFL, we consider a distributed learning setting where a central server communicates with $C$ clients. Let $\mathbb{C}=\{1,\cdots,C\}$ denote the set of clients participating in the training process. Each client $i$ has its own data distribution $P_{\mathcal{X}\mathcal{Y}}^{(i)}$ on $\mathcal{X}\times\mathcal{Y}$ , where $\mathcal{X}$ represents the input space and $\mathcal{Y}$ denotes the label space. Each client $i$ has access to $n_{i}$ data points, represented as $\mathcal{D}_{i}=\{x_{k},y_{k}\}$ , which are samples drawn from $P_{\mathcal{X}\mathcal{Y}}^{(i)}$ . In this context, it is assumed that the probability distributions $P_{\mathcal{X}\mathcal{Y}}^{(i)}$ and $P_{\mathcal{X}\mathcal{Y}}^{(j)}$ differ for any pair of clients $i$ and $j$ , which is typically the case in PFL. This is referred to as the non-i.i.d setting. The differences in probability distributions can arise from variations in $P_{\mathcal{X}}$ , in $P_{\mathcal{Y}}$ or in the conditional $P_{\mathcal{Y}|\mathcal{X}}$ , resulting in covariate shift, label shift, or concept shift, respectively. While previous works have primarily focused on the first two, our proposed solution is designed to effectively handle the concept shift scenario as well.

The main goal of PFL is to learn a global model $f_{\omega}$ that performs well after personalization to each client’s local dataset. This is achieved through an update scheme that proceeds in rounds of communication. In each communication round of the training process, the global model is initialized with $\omega$ and sent to the clients. Each client $i\in\mathbb{C}$ then personalizes the global model using its local data $\mathcal{D}_{i}$ to derive personalized model parameters $\omega^{\prime}_{i}$ . Typically, this personalization process involves performing a few steps of gradient descent on the client’s local dataset. Formally, let $\mathcal{L}_{\mathcal{D}_{i}}(f_{\omega^{\prime}_{i}})$ be the empirical risk of the local model $f_{\omega^{\prime}_{i}}$ on the local dataset $\mathcal{D}_{i}$ . The objective of PFL can be described as follows:

(3.1)

\min_{\bm{W}}\biggl{\{}F(\bm{W}):=\frac{1}{C}\sum_{i=1}^{C}\mathcal{L}_{\mathcal{D}_{i}}(f_{\omega^{\prime}_{i}})\biggr{\}}

where $\bm{W}$ denotes the collection of all local models, i.e., $\bm{W}=\{\omega^{\prime}_{i}\}_{i=1}^{C}$ . However, in the non-i.i.d setting, it is challenging to find a global model that performs well after personalization to heterogeneous clients. One way to address this challenge is to incorporate meta-learning into the PFL setting.

Meta-learning approaches aim to learn models that can be efficiently adapted to new tasks by leveraging the knowledge acquired from previous tasks. Optimization-based meta-learning methods, such as MAML [10], use a bi-level optimization process to learn an initial set of parameters $\omega$ that can be effectively adapted with gradient descent on new tasks. During the meta-training phase, a set of training tasks $\{\mathcal{T}_{i}\}_{i=1}^{T}$ is used. Each task $\mathcal{T}_{i}$ corresponds to data generating distributions $\mathcal{T}_{i}\triangleq\{P_{\mathcal{X}}^{(i)},P_{\mathcal{Y}|\mathcal{X}}^{(i)}\}$ . The data sampled from each task is divided into a support set $\mathcal{D}_{i}^{(sp)}$ containing $K$ training examples, and a query set $\mathcal{D}_{i}^{(qr)}$ used for evaluating the model’s performance. MAML meta-trains a neural network model $f_{\omega}$ parameterized by $\omega$ using a two-stage procedure consisting of an inner and an outer loop. In the inner loop, the initial parameters $\omega$ are adapted to each training task $\mathcal{T}_{i}$ by taking a few gradient descents steps on the support set $\mathcal{D}_{i}^{(sp)}\triangleq\{x_{k},y_{k}\}_{k=1}^{K}$ . This process results in task-specific parameters $\omega^{\prime}_{i}$ , as illustrated in Eq. 3.2 when a single gradient step is used:

(3.2)

\omega^{\prime}_{i}=\omega-\alpha\nabla_{\omega}\mathcal{L}_{\mathcal{D}_{i}^{(sp)}}(f_{\omega}).

The initial parameters $\omega$ are then optimized in the outer loop by minimizing the loss achieved by the task-specific parameters $\omega^{\prime}_{i}$ on the query set $\mathcal{D}_{i}^{(qr)}$ of the same task $\mathcal{T}_{i}$ . This is shown in Eq. 3.3:

(3.3)

\omega\leftarrow\omega-\beta\nabla_{\omega}\sum_{\mathcal{T}_{i}\sim\textnormal{P}(\mathcal{T})}\mathcal{L}_{\mathcal{D}_{i}^{(qr)}}(f_{\omega^{\prime}_{i}}).

The result is a model initialization $\omega$ that performs well when updated with a few training examples $\mathcal{D}_{new}^{(sp)}\triangleq\{x_{k},y_{k}\}_{k=1}^{K}$ from a new task $\mathcal{T}_{new}$ . The goal is to train the model on $\mathcal{D}_{new}^{(sp)}$ while leveraging the previous knowledge acquired during meta-training in order to achieve good generalization performance on new unlabeled test examples from $\mathcal{T}_{new}$ .

As suggested in [16, 9], optimization-based meta-learning algorithms, such as MAML, bear relevance to the PFL setting due to the inherent heterogeneity present both in tasks (for meta-learning) and in clients (for federated learning). While MAML assumes access to only a few labeled data from a new task at test time, PFL does not have this constraint. Consequently, an estimation of the empirical loss (in Equations 3.2 and 3.3) can be computed using independent batches of data (e.g., $\mathcal{D}_{i}^{\prime}$ , $\mathcal{D}_{i}^{\prime\prime}$ , etc.) that belongs to $\mathcal{D}_{i}$ , as suggested in [9]. By incorporating the MAML formulation into PFL, the objective of PFL shifts towards finding a model initialization shared among all clients that performs well once each client updates it based on the loss function computed with their own local data.

4 Proposed approach

In this section, we present our proposed CAFeMe approach in detail. The aim is to improve the performance and scalability of the PFL framework by leveraging meta-learning techniques and extracting contextual information from each client’s local data. To achieve this, we introduce a federated modulator network $g_{\mu}$ followed by a base network $f_{\psi}$ , as shown in Figure 1. We define the complete set of model parameters encompassing both the modulator and the base network as $\omega=\{\mu,\psi\}$ . Additionally, we incorporate modulation layers in the base network to enable efficient personalization to each client’s local dataset. These layers aim to condition the base network $f_{\psi}$ on that client’s data by employing a feature-wise gating mechanism to the activations of the preceding layers, allowing only the relevant features (or activations) to propagate forward while disregarding the non-relevant ones. The parameters of the modulation layers are predicted by the federated modulator. The latter extracts some context from batches of data available at the client level and transforms it to generate the modulation parameters. The context is designed to be representative of the client’s local dataset and invariant to the permutation of data samples, ensuring the predicted modulation parameters are independent of the ordering of the data. We delve into the architecture of the federated modulator in Section 4.1 and provide the complete algorithm in Section 4.2.

Refer to caption — Figure 1: Overall framework of *CAFeMe*. For each client involved in the federation, the federated modulator $g_{\mu}$ takes in input a batch of local data $\mathcal{D}^{\prime}_{i}$ , sampled from $\mathcal{D}_{i}^{pers}$ , and predicts the parameters of the modulation layers $\bm{\zeta}_{i}$ . These parameters are used to modulate the activations of the base network $f_{\psi}$ . The modulation operator is denoted by $\odot$ .

4.1 Federated Modulator.

As illustrated in Figure 2, the federated modulator network $g_{\mu}$ consists of three parts: a feature extractor (A), an averaging layer (B), and a multilayer perception (MLP) (C). The feature extractor (A) takes as input a batch of data $\mathcal{D}_{i}^{\prime}=\{x_{b},y_{b}\}_{b=1}^{B}$ , where $B$ is the batch size. Each batch consists of inputs $x_{b}$ and labels $y_{b}$ (represented as one-hot-vectors in classification) sampled from a client $i$ . It then transforms them into $\mathcal{Z}_{i}=\{h(~{}h_{x}(x_{b})~{}\oplus~{}y_{b}~{})\}_{b=1}^{B}$ , where $h_{x}(.)$ denotes an input-specific feature extractor, and $\oplus$ denotes the concatenation operator. The averaging layer (B) calculates the mean of the transformed data $\mathcal{Z}_{i}$ along the $B$ examples, creating a vector representation $z_{i}$ that characterizes client $i$ . This task representation is invariant to permutations of the examples and is expected to capture client-specific information. The final part (C) of the network generates the parameters for the modulation layers in the base network. It is an MLP that takes $z_{i}$ as input and predicts a set of parameters $\bm{\zeta}_{i}=\{\zeta_{i}^{1},\cdots\zeta_{i}^{L}\}$ , where $L$ is the number of hidden layers in the base model. Each $\zeta_{i}^{l}$ , for $l=1,...,L$ , corresponds to the parameters of the modulation layer following the $l^{\textnormal{th}}$ layer in the base model. These parameters will act as “scaling” or “gating” parameters; they adjust the feature importance and diminish the contribution of less informative features for the given client $i$ . This is described more formally in the next section.

4.2 Algorithm.

The complete algorithm of the CAFeMe method is presented in Algorithm 1. At each communication round $t$ , with $t=1,\cdots,T$ , a subset of $M$ clients is randomly sampled and denoted with $\mathcal{C}^{(t)}$ . In line 4 of Algorithm 1, the parameters of the global model $\omega$ are sent to each of the selected clients with the aim of personalizing them to each client’s local data. To do so, each client’s dataset $\mathcal{D}_{i}$ is split into two disjoint sets, $\mathcal{D}_{i}^{pers}$ and $\mathcal{D}_{i}^{eval}$ . The former set is used in line 7 to personalize the global model’s parameters $\omega$ to client $i$ . This personalization is described in Algorithm 2, where after a first initialization step in line 1, a batch of data is sampled from $\mathcal{D}_{i}^{pers}$ and input in the federated modulator $g_{\mu}$ to obtain the modulation parameters $\bm{\zeta}_{i}$ , as described in Section 4.1. Note that if $S>1$ , an independent batch $\mathcal{D}^{\prime}$ is sampled at each iteration. The modulation parameters $\bm{\zeta}_{i}$ are then used to modulate the activations of the base network, resulting in a modulated model denoted by $f_{\psi}|\bm{\zeta}_{i}$ which is more personalized for the client at hand. To better explain the modulated network $f_{\psi}|\bm{\zeta}_{i}$ , let us examine the activations $a(\mathbf{x}~{}\psi^{l})$ produced by a particular layer $l$ . Using modulation parameters $\zeta_{i}^{l}$ , the activations can be transformed to produce modulated activations, indicated as $a(\mathbf{x}~{}\psi^{l})\odot\zeta_{i}^{l}$ , as exemplified by Equation 4.4. This equation outlines a “sigmoidal gating” mechanism that enables $\zeta_{i}^{l}$ to determine which activations are propagated forward to the next layers and which are zeroed out. Note that this approach is not restricted to this type of modulation, but we believe that gating the model’s activations is sufficient to specialize the model to specific clients while simultaneously leveraging the knowledge learned from other clients.

(4.4)

a(\mathbf{x}~{}\psi^{l})\odot\zeta_{i}^{l}=a(\mathbf{x}~{}\psi^{l})\otimes\sigma(\zeta_{i}^{l}),

where $\otimes$ denotes an element-wise multiplication and $\sigma$ is the sigmoid function making each value of $\zeta_{i}^{l}$ in the range $[0,1]$ . The parameters $\mu$ and $\psi$ are then personalized to the specific client in lines 5 and 6 of Algorithm 2, resulting in client-specific parameters $\mu^{\prime}$ and $\psi^{\prime}$ , respectively. This personalization is performed in a MAML-based manner, see Equation 3.2, but the loss is now computed with a batch of data $\mathcal{D}_{i}^{\prime}$ (not necessarily small as in MAML), and with $f_{\psi}|\bm{\zeta}_{i}$ that is the base model whose activations have been transformed with the parameters of the modulation layers. This process can be iterated for $S\geq 1$ gradient descent steps, and the personalized parameters $\psi_{i}^{\prime}$ and $\bm{\zeta}_{i}$ are returned.

In line 8 of Algorithm 1, the initial set of parameters $\omega$ is updated, similarly to Equation 3.3, by minimizing the loss computed on $\mathcal{D}_{i}^{eval}$ using $f_{\psi^{\prime}_{i}}|\bm{\zeta}_{i}$ (i.e., the evaluation loss achieved by the model after being personalized to client $i$ ). Here, any optimizer of choice, e.g., Adam, can be used (not necessarily gradient descent). This results in a set of parameters $\omega_{i}^{\prime}$ (comprising both $\mu_{i}^{\prime}$ and $\psi_{i}^{\prime}$ ) that are personalized to the client’s data and, at the same time, generalize well to a held-out dataset sampled from the same distribution. This guarantee good performance when new data from the same client are observed and prevents overfitting to the client’s initial data. The updated parameters $\omega_{i}^{\prime}$ are then sent back to the server in line 9. After receiving the parameters from all the selected clients, the server collects the local models and computes the average to reinitialize $\omega$ , following a similar approach to the FedAvg algorithm [27].

Algorithm 1 CAFeMe

Require Active clients $M$ , step sizes $\alpha,\beta$

1: Randomly initialize

\omega=\{\mu,\psi\}

2: for

t=1,\cdots,T

3: Choose a subset of clients

\mathcal{C}^{(t)}

with size

M

4: Send

\omega

to all clients in

\mathcal{C}^{(t)}

5: for all

i\in\mathcal{C}^{(t)}

6: Split data

\mathcal{D}_{i}=\{\mathcal{D}_{i}^{pers},\mathcal{D}_{i}^{eval}\}

7: Personalize the global model:

\psi_{i}^{\prime},\bm{\zeta}_{i}\leftarrow\emph{Personalization}(S,\alpha,\mathcal{D}_{i}^{pers},\omega)

8: Update

\omega_{i}^{\prime}\leftarrow\omega-\beta\nabla_{\omega}\mathcal{L}_{\mathcal{D}_{i}^{eval}}(f_{\psi_{i}^{\prime}}|\bm{\zeta}_{i})

9: Send

\omega_{i}^{\prime}

back to the server

10: end for

11: Server updates

\omega=\frac{1}{M}\sum_{i\in\mathcal{C}^{(t)}}\omega_{i}^{\prime}

12: end for

Algorithm 2 Personalization(

S,\alpha,\mathcal{D}^{pers},\omega

)

Require Personalization steps $S$ , step size $\alpha$ , local dataset for personalization $\mathcal{D}_{i}^{pers}$ , initial parameters of the model $\omega=\{\mu,\psi\}$

1: Initialize

\mu^{\prime}\leftarrow\mu

\psi^{\prime}\leftarrow\psi

2: for

s=1,\cdots,S

3: Sample batch of data

\mathcal{D}^{\prime}

from

\mathcal{D}^{pers}

4: Predict modulation parameters

\bm{\zeta}=g_{\mu^{\prime}}(\mathcal{D}^{\prime})

\mu^{\prime}=\mu^{\prime}-\alpha\nabla_{\mu^{\prime}}\mathcal{L}_{\mathcal{D}^{\prime}}(f_{\psi^{\prime}}|\bm{\zeta})

\psi^{\prime}=\psi^{\prime}-\alpha\nabla_{\psi^{\prime}}\mathcal{L}_{\mathcal{D}^{\prime}}(f_{\psi^{\prime}}|\bm{\zeta})

7: end for

8: Predict modulation parameters

\bm{\zeta}=g_{\mu^{\prime}}(\mathcal{D}^{\prime})

9: return

\psi^{\prime},\bm{\zeta}

During the testing phase, new clients can join the federation, possibly with local data sampled from different data distributions compared to those encountered during training. The objective is to obtain a personalized model for each new client that performs effectively on its local data, potentially requiring a few personalization steps and only a small amount of labeled data.

5 Experimental setup

5.1 Datasets.

We assess the performance of CAFeMe on various datasets, including synthetic and realistic non-i.i.d datasets. For the synthetic datasets, we synthesize distributed non-i.i.d datasets by partitioning well-known benchmarks, such as CIFAR-10 [19] and RMNIST (a modified version of the classic MNIST dataset [7]), into smaller subsets specific to each client. Inspired by previous works in the field [27, 45, 14, 40], we explore two different data partitioning schemes: the shards partition and the dirichlet partition.

We also evaluate the performance of the proposed approach on two more realistic datasets, namely FEMNIST [2] and Meta-Dataset [37]. While FEMNIST has been extensively used in the FL literature, this study represents the first time Meta-Dataset is employed to evaluate the performance of a PFL algorithm. Meta-Dataset is a large-scale dataset comprising various classification datasets. In this setting, we assume clients share the common tasks of classifying images into $N$ different classes, but their local data are sampled from completely different datasets. For this reason, the same label might have different meanings across clients, hence the concept shift scenario. This scenario closely resembles real-world situations where clients that participate in the federation might belong to different enterprises with diverse goals and interests.

5.2 Compared methods.

We compare the results with several state-of-the-art approaches. Specifically, we include FL works such as FedAvg [27] with its fine-tuned variant (i.e., FedAvg-FT), and IFCA [11], a framework specifically designed for clustered FL, together with IFCA-FT. Furthermore, we consider Ditto [22] a multi-task learning framework tailored for PFL, FedRep [5] a PFL method characterized by parameter decoupling of feature extractor and classifier, and Per-FedAvg [9] a meta-learning approach specifically designed for solving PFL problems.

5.3 Training setting.

To ensure fair and reliable comparisons, similar hyperparameters and model architectures have been used for all the baselines. For all the datasets, $80$ clients ( $C=80$ ) are used for training the models and $20$ clients for evaluation of the performance. The number of communication rounds is set to $1000$ , except for FEMNIST and Meta-Dataset where it is set to $2000$ , due to the complexity of the datasets. For each round of communication, we randomly select $M=5$ clients, and we perform $S=5$ personalization steps. At test time, we let the models perform $50$ personalization steps to allow a good personalization to each client’s own data. We use mini-batch stochastic gradient descent (SGD) as a local optimizer, with a fixed batch size of $30$ samples. For CAFeMe, we use the feature-wise gating defined in Equation 4.4 as a way to modulate the activations of the base model. However, we also investigated an alternative way by using an affine transformation, as used in FiLM [31]. In this case, the modulation parameters $\zeta^{l}$ of a layer $l$ consists of two vectors, $\zeta_{a}^{l}$ and $\zeta_{b}^{l}$ , used to scale and bias the activations, i.e.,

a(\mathbf{x}~{}\psi^{l})\odot\zeta^{l}=\zeta_{a}^{l}~{}\otimes~{}a(\mathbf{x}~{}\psi^{l})~{}+~{}\zeta_{b}^{l}.

Nonetheless, with this modulation, the performance of CAFeMe is highly variable across different runs of the algorithm and doesn’t achieve a good performance. This is likely because scaling the model’s activations with a value that is not normalized in $[0,1]$ (unlike the sigmoidal feature-wise gating of Equation 4.4) may result in ending up in a local minima far from the optimal results. Therefore, in the rest of the experiments, all results are reported using the feature-wise gating mechanism in Equation 4.4.

We construct two different models for RMNIST/FEMNIST and CIFAR-10/Meta-Dataset, respectively. The first CNN model comprises two modules, each consisting of a convolutional layer with 32 filters, followed by batch normalization and ReLU nonlinearities. One linear layer with a size of 576 is used to complete the classification model. Similarly, the architecture of the federated modulator consists of the same two modules followed by three linear layers with sizes of 100, 200, 200, respectively. The second CNN model is similar to the first one but comprises three modules with 64 filters and two linear layers with a size of 576. Also, the federated modulator has three modules and four linear layers with sizes of 100, 100, 200, 200.

To account for statistical variations, we conduct five complete runs of the algorithms and report the results as the average accuracy and standard deviation.

6 Results

Table 1: Comparison of test accuracy (%) after

50

personalization steps on new clients at test time on synthetic datasets.

Method	CIFAR-10		RMNIST
	Shards	Dirichlet	Shards	Dirichlet
FedAvg	$27.87\pm 1.10$	$40.91\pm 1.98$	$60.26\pm 3.77$	$79.16\pm 2.93$
FedAvg-FT	$87.00\pm 2.40$	$67.13\pm 1.48$	$87.08\pm 3.39$	$89.47\pm 1.43$
IFCA	$29.01\pm 2.06$	$42.10\pm 3.00$	$65.13\pm 6.26$	$80.50\pm 1.69$
IFCA-FT	$87.26\pm 2.50$	$64.96\pm 1.21$	$88.52\pm 3.32$	$89.84\pm 1.21$
Ditto	$89.87\pm 1.02$	$70.71\pm 1.54$	$90.44\pm 1.44$	$89.27\pm 1.50$
FedRep	$88.77\pm 1.43$	$68.44\pm 2.16$	$90.95\pm 1.64$	$89.88\pm 1.46$
Per-FedAvg	$88.41\pm 2.06$	$71.04\pm 1.91$	$83.86\pm 1.74$	$74.57\pm 3.12$
CAFeMe	$\mathbf{90.66\pm 0.66}$	$\mathbf{74.97\pm 1.67}$	$\mathbf{98.82\pm 0.11}$	$\mathbf{94.31\pm 1.00}$

Table 2: Comparison of test accuracy (%) after

50

personalization steps on new clients at test time on realistic datasets.

Method	FEMNIST	Meta-Dataset
FedAvg	$79.81\pm 2.21$	$10.58\pm 0.75$
FedAvg-FT	$82.07\pm 1.92$	$57.78\pm 1.16$
IFCA	$79.81\pm 2.04$	$9.74\pm 0.73$
IFCA-FT	$82.56\pm 1.79$	$57.17\pm 1.44$
Ditto	$80.41\pm 1.29$	$45.31\pm 1.44$
FedRep	$78.10\pm 1.81$	$45.17\pm 1.90$
Per-FedAvg	$80.26\pm 2.60$	$57.89\pm 1.09$
CAFeMe	$\mathbf{82.76\pm 0.83}$	$\mathbf{62.62\pm 1.33}$

The experimental results are summarized in Tables 1 and 2 for the synthetic and realistic datasets, respectively. Across all datasets, CAFeMe consistently outperforms the other baseline methods. The performance gap becomes even more evident when the heterogeneity in clients’ data distribution is large (as observed in RMNIST and Meta-Dataset), highlighting the effectiveness of the proposed approach in personalizing to each client’s local data distribution. This observation is supported by the results presented in Figure 3, which illustrates the performance of different PFL methods under varying levels of data heterogeneity in the RMNIST dataset.

When the data are (almost) balanced and i.i.d, as indicated by the “None” column in the histogram and when $\alpha=1000$ , the performance of CAFeMe is comparable to other baselines, such as Ditto and FedRep. However, as the data heterogeneity across clients increases (lower $\alpha$ values), CAFeMe consistently maintains strong performance, while the performance of the other baselines significantly deteriorates. This is particularly evident when using the shards partition scheme, where most of the clients have data from only two classes. In this challenging scenario, personalizing the global model to each local dataset becomes difficult due to the narrow local data distribution compared to the one used for learning the global model. Nevertheless, CAFeMe overcomes this challenge by effectively modulating the global model through meta-learning, resulting in robust personalization even in the presence of concept shifts. This observation is further supported by the results on Meta-Dataset (Table 2). The variability in the conditional distribution $P_{\mathcal{Y}|\mathcal{X}}$ across different clients hurts the performance of other methods, which struggle to adapt the global model to such variability. Even using multiple global models, as in IFCA-FT, is insufficient for effective personalization on the client side, also due to the limitations on the estimation of the cluster identity. In contrast, CAFeMe leverages meta-learning and the federated modulator to achieve superior performance in personalizing the global model among clients with concept shifts.

Furthermore, Figure 4 shows that CAFeMe achieves efficient personalization with a small number of personalization steps at test time across all datasets. This demonstrates the efficiency of the proposed approach, enabling quick personalization for clients with limited computational resources. We also conducted additional experiments to investigate the impact of different dataset sizes at the client level. As depicted in Figure 5, when the local data size is too small (e.g., 100), all baselines exhibit low performance due to the inability of personalizing the global model effectively to the client’s local data distribution. On the other hand, as the size of data increases, CAFeMe shows a substantial improvement in performance, suggesting its applicability even in scenarios with small amounts of data at the client level.

7 Conclusion

In this work, we presented CAFeMe, a novel approach for PFL that leverages meta-learning and a federated modulator network to achieve efficient and effective personalization of the global model to each client’s local data distribution. Our proposed approach overcomes the limitations of traditional FL methods in scenarios with non-i.i.d data distributions by introducing modulation layers that condition the base model on each client’s data.

Through extensive experiments on various synthetic and realistic datasets, we demonstrated the superiority of CAFeMe over state-of-the-art baselines. The performance gap was particularly evident when the data heterogeneity among clients was significant, showing the robustness and adaptability of our approach in such challenging scenarios. The ability to personalize the global model effectively, even in the presence of concept shifts, was one of the key strengths of CAFeMe, setting it apart from traditional approaches that struggled to handle such scenarios. We believe that CAFeMe holds great promise for advancing personalized federated learning in various real-world applications, and we hope that our work inspires further research in this exciting and rapidly evolving field. Future research directions may involve investigating more sophisticated modulator architectures as well as diverse meta-learning techniques and algorithms to optimize the personalization process for clients with varying data distributions. Furthermore, it would be interesting to explore the robustness of CAFeMe under adversarial settings and develop mechanisms to defend against potential security and privacy threats.

Acknowledgments

This work was supported by the “Knowledge Foundation” (KK-stiftelsen).

References

[1] Durmus Alp Emre Acar et al. “Federated Learning Based on Dynamic Regularization” In International Conference on Learning Representations, 2021
[2] Sebastian Caldas et al. “Leaf: A benchmark for federated settings” In arXiv preprint arXiv:1812.01097, 2018
[3] Fei Chen et al. “Federated meta-learning with fast convergence and efficient communication” In arXiv preprint arXiv:1802.07876, 2018
[4] Mircea Cimpoi et al. “Describing textures in the wild” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 3606–3613
[5] Liam Collins, Hamed Hassani, Aryan Mokhtari and Sanjay Shakkottai “Exploiting shared representations for personalized federated learning” In International Conference on Machine Learning, 2021, pp. 2089–2099 PMLR
[6] Yatin Dandi, Luis Barba and Martin Jaggi “Implicit gradient alignment in distributed and federated learning” In Proceedings of the AAAI Conference on Artificial Intelligence 36.6, 2022, pp. 6454–6462
[7] Li Deng “The mnist database of handwritten digit images for machine learning research [best of the web]” In IEEE signal processing magazine 29.6 IEEE, 2012, pp. 141–142
[8] Moming Duan et al. “Flexible clustered federated learning for client-level data distribution shift” In IEEE Transactions on Parallel and Distributed Systems 33.11 IEEE, 2021, pp. 2661–2674
[9] Alireza Fallah, Aryan Mokhtari and Asuman Ozdaglar “Personalized federated learning: A meta-learning approach” In arXiv preprint arXiv:2002.07948, 2020
[10] Chelsea Finn, Pieter Abbeel and Sergey Levine “Model-agnostic meta-learning for fast adaptation of deep networks” In International conference on machine learning, 2017, pp. 1126–1135 PMLR
[11] Avishek Ghosh, Jichan Chung, Dong Yin and Kannan Ramchandran “An efficient framework for clustered federated learning” In Advances in Neural Information Processing Systems 33, 2020, pp. 19586–19597
[12] Avishek Ghosh, Justin Hong, Dong Yin and Kannan Ramchandran “Robust federated learning in a heterogeneous environment” In arXiv preprint arXiv:1906.06629, 2019
[13] Filip Hanzely and Peter Richtárik “Federated learning of a mixture of global and local models” In arXiv preprint arXiv:2002.05516, 2020
[14] Tzu-Ming Harry Hsu, Hang Qi and Matthew Brown “Measuring the effects of non-identical data distribution for federated visual classification” In arXiv preprint arXiv:1909.06335, 2019
[15] Yutao Huang et al. “Personalized cross-silo federated learning on non-iid data” In Proceedings of the AAAI Conference on Artificial Intelligence 35.9, 2021, pp. 7865–7873
[16] Yihan Jiang, Jakub Konečnỳ, Keith Rush and Sreeram Kannan “Improving federated learning personalization via model agnostic meta learning” In arXiv preprint arXiv:1909.12488, 2019
[17] Sai Praneeth Karimireddy et al. “Scaffold: Stochastic controlled averaging for federated learning” In International Conference on Machine Learning, 2020, pp. 5132–5143 PMLR
[18] Mikhail Khodak, Maria-Florina F Balcan and Ameet S Talwalkar “Adaptive gradient-based meta-learning methods” In Advances in Neural Information Processing Systems 32, 2019
[19] Alex Krizhevsky “Learning multiple layers of features from tiny images”, 2009
[20] Brenden Lake, Ruslan Salakhutdinov, Jason Gross and Joshua Tenenbaum “One shot learning of simple visual concepts” In Proceedings of the annual meeting of the cognitive science society 33.33, 2011
[21] Qinbin Li, Bingsheng He and Dawn Song “Model-contrastive federated learning” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10713–10722
[22] Tian Li, Shengyuan Hu, Ahmad Beirami and Virginia Smith “Ditto: Fair and robust federated learning through personalization” In International Conference on Machine Learning, 2021, pp. 6357–6368 PMLR
[23] Tian Li, Maziar Sanjabi, Ahmad Beirami and Virginia Smith “Fair Resource Allocation in Federated Learning” In International Conference on Learning Representations, 2020
[24] Tian Li et al. “Federated optimization in heterogeneous networks” In Proceedings of Machine learning and systems 2, 2020, pp. 429–450
[25] Subhransu Maji et al. “Fine-grained visual classification of aircraft” In arXiv preprint arXiv:1306.5151, 2013
[26] Yishay Mansour, Mehryar Mohri, Jae Ro and Ananda Theertha Suresh “Three approaches for personalization with applications to federated learning” In arXiv preprint arXiv:2002.10619, 2020
[27] Brendan McMahan et al. “Communication-efficient learning of deep networks from decentralized data” In Artificial intelligence and statistics, 2017, pp. 1273–1282 PMLR
[28] Tomoya Murata and Taiji Suzuki “Bias-Variance Reduced Local SGD for Less Heterogeneous Federated Learning” In International Conference on Machine Learning, 2021, pp. 7872–7881 PMLR
[29] Maria-Elena Nilsback and Andrew Zisserman “Automated flower classification over a large number of classes” In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, 2008, pp. 722–729 IEEE
[30] Boris Oreshkin, Pau Rodriguez Lopez and Alexandre Lacoste “Tadam: Task dependent adaptive metric for improved few-shot learning” In Advances in neural information processing systems 31, 2018
[31] Ethan Perez et al. “Film: Visual reasoning with a general conditioning layer” In Proceedings of the AAAI Conference on Artificial Intelligence 32.1, 2018
[32] Sachin Ravi and Hugo Larochelle “Optimization as a model for few-shot learning” In International conference on learning representations, 2017
[33] Felix Sattler, Klaus-Robert Müller and Wojciech Samek “Clustered federated learning: Model-agnostic distributed multitask optimization under privacy constraints” In IEEE transactions on neural networks and learning systems 32.8 IEEE, 2020, pp. 3710–3722
[34] Johannes Stallkamp, Marc Schlipsing, Jan Salmen and Christian Igel “Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition” In Neural networks 32 Elsevier, 2012, pp. 323–332
[35] Canh T Dinh, Nguyen Tran and Josh Nguyen “Personalized federated learning with moreau envelopes” In Advances in Neural Information Processing Systems 33, 2020, pp. 21394–21405
[36] Minxue Tang et al. “FedCor: Correlation-based active client selection strategy for heterogeneous federated learning” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10102–10111
[37] Eleni Triantafillou et al. “Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few Examples” In International Conference on Learning Representations, 2020
[38] Anna Vettoruzzo et al. “Advances and Challenges in Meta-Learning: A Technical Review” In arXiv preprint arXiv:2307.04722, 2023
[39] Catherine Wah et al. “The caltech-ucsd birds-200-2011 dataset”, 2011
[40] Hongyi Wang et al. “Federated Learning with Matched Averaging” In International Conference on Learning Representations, 2020
[41] Hongda Wu and Ping Wang “Node selection toward faster convergence for federated learning on non-iid data” In IEEE Transactions on Network Science and Engineering 9.5 IEEE, 2022, pp. 3099–3111
[42] Jian Xu, Xinyi Tong and Huang Shao-Lun “Personalized Federated Learning with Feature Alignment and Classifier Collaboration” In International conference on learning representations, 2023
[43] Lei Yang, Jiaming Huang, Wanyu Lin and Jiannong Cao “Personalized federated learning on non-IID data via group-based meta-learning” In ACM Transactions on Knowledge Discovery from Data 17.4 ACM New York, NY, 2023, pp. 1–20
[44] Tehrim Yoon, Sumin Shin, Sung Ju Hwang and Eunho Yang “FedMix: Approximation of Mixup under Mean Augmented Federated Learning” In International Conference on Learning Representations
[45] Mikhail Yurochkin et al. “Bayesian nonparametric federated learning of neural networks” In International conference on machine learning, 2019, pp. 7252–7261 PMLR
[46] Michael Zhang et al. “Personalized Federated Learning with First Order Model Optimization” In International Conference on Learning Representations, 2021
[47] Yue Zhao et al. “Federated learning with non-iid data” In arXiv preprint arXiv:1806.00582, 2018

Supplementary Material

8 Datasets

8.1 Synthetic datasets.

We evaluate our model on two synthetic datasets: CIFAR-10 and a rotated version of MNIST, referred to as RMNIST. For the latter, we partition MNIST data into $10$ subsets of equal size and we apply a distinct degree of rotation within the range $[0,200]$ to each subset. This rotational variation enhances the diversity within the MNIST dataset, enabling a more realistic representation of non-i.i.d. data.

To partition data into clients, we designed two partition schemes: the shards partition and the dirichlet partition. A visual representation of these two partition schemes is shown in Figure 6. The shards partition scheme involves a balanced non-i.i.d. partition, where each client is allocated an equal number of samples. To construct this scheme, we first sort the data by label, divide it into $200$ shards, and assign two shards to each client. Therefore, most clients solely have examples from two classes. On the other hand, the dirichlet partition exhibits an unbalanced and non-i.i.d. scheme, characterized by varying numbers of data points and class proportions across clients. For each client, we sample $\mathbf{p}_{n}\sim\textnormal{Dir}_{C}(\alpha)$ , where $C$ represents the number of clients, and allocate a $\mathbf{p}_{n,i}$ proportion of class $n$ to local client $i$ . The concentration parameter $\alpha$ regulates the level of data heterogeneity across different clients: the smaller this value, the higher the class imbalance. We set it to $0.3$ by default if not explicitly mentioned.

8.2 Realistic datasets.

To assess the performance of CAFeMe in a more realistic scenario we evaluate it on FEMNIST and Meta-Dataset. While FEMNIST is a well-known dataset in the field of PFL, this is the first work that uses Meta-Dataset to assess the performance of a PFL framework. Meta-Dataset is a large-scale dataset comprising various classification datasets such as Mini-Imagenet [32], FC100 [30], Omniglot [20], Aircraft [25], CUB Birds [39], Describable Textures Dataset (DTD) [4], Traffic Signs [34], and VGG Flowers [29]. In this context, each client is assigned data with $N$ classes sampled from one of these classification datasets. Therefore, while all clients share the same task of classifying images into $N$ classes, the nature of their local data varies significantly, resulting in labels that might have different meanings across clients. This leads to the emergence of the concept shift scenario, a common phenomenon in real-world situations.