DualFed: Enjoying both Generalization and Personalization in Federated Learning via Hierachical Representations

Guogang Zhu buaa˙[email protected] 0000-0002-6381-1420 State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang UniversityBeijingChina , Xuefeng Liu liu˙[email protected] 0000-0003-2705-8731 State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang UniversityBeijingChina Zhongguancun LaboratoryBeijingChina , Jianwei Niu [email protected] 0000-0003-3946-5107 State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang UniversityBeijingChina Zhongguancun LaboratoryBeijingChina , Shaojie Tang [email protected] 0000-0001-9261-5210 Department of Management Science and Systems, University at BuffaloBuffaloNew YorkUnited States , Xinghao Wu [email protected] 0000-0002-6987-3972 State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang UniversityBeijingChina and Jiayuan Zhang [email protected] 0009-0006-0424-2930 State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang UniversityBeijingChina

(2024)

Abstract.

In personalized federated learning (PFL), it is widely recognized that achieving both high model generalization and effective personalization poses a significant challenge due to their conflicting nature. As a result, existing PFL methods can only manage a trade-off between these two objectives. This raises an interesting question: Is it feasible to develop a model capable of achieving both objectives simultaneously? Our paper presents an affirmative answer, and the key lies in the observation that deep models inherently exhibit hierarchical architectures, which produce representations with various levels of generalization and personalization at different stages. A straightforward approach stemming from this observation is to select multiple representations from these layers and combine them to concurrently achieve generalization and personalization. However, the number of candidate representations is commonly huge, which makes this method infeasible due to high computational costs. To address this problem, we propose DualFed, a new method that can directly yield dual representations correspond to generalization and personalization respectively, thereby simplifying the optimization task. Specifically, DualFed inserts a personalized projection network between the encoder and classifier. The pre-projection representations are able to capture generalized information shareable across clients, and the post-projection representations are effective to capture task-specific information on local clients. This design minimizes the mutual interference between generalization and personalization, thereby achieving a win-win situation. Extensive experiments show that DualFed can outperform other FL methods. Code is available at https://github.com/GuogangZhu/DualFed.

Federated Learning, Generalization, Personalization, Representation Learning

^†^†journalyear: 2024^†^†copyright: acmlicensed^†^†conference: Proceedings of the 32nd ACM International Conference on Multimedia; October 28-November 1, 2024; Melbourne, VIC, Australia^†^†booktitle: Proceedings of the 32nd ACM International Conference on Multimedia (MM ’24), October 28-November 1, 2024, Melbourne, VIC, Australia^†^†doi: 10.1145/3664647.3681260^†^†isbn: 979-8-4007-0686-8/24/10^†^†ccs: Computing methodologies Distributed artificial intelligence

1. Introduction

Federated learning (FL) (McMahan et al., 2017) is an emerging machine learning paradigm that enables multiple clients to collaboratively train a model while preserving their data privacy. In real-world applications, data distributions across clients are often non-independent and identically distributed (Non-IID). For instance, in video surveillance, the data collected by distributed cameras can vary significantly due to differences in weather and lighting conditions (Hassaballah et al., 2020; Mou et al., 2021; Leroux et al., 2022; Chen et al., 2024). This Non-IID data distributions can significantly degrade the FL model performance (Zhao et al., 2018; Zhu et al., 2021). Currently, there are primarily two objectives to mitigate this issue: improving model generalization to accommodate more clients or enhancing model personalization to better adapt local data distributions. However, since local data distributions often differ from the global distribution in Non-IID FL, these two optimized objectives are typically in conflict.

Personalized federated learning (PFL), which aims to balance model generalization with personalization, serves as an effective approach to address the challenges posed by Non-IID data. Earlier PFL approaches suggest sharing the classifier or encoder, while personalizing the other (Arivazhagan et al., 2019; Collins et al., 2021; Liang et al., 2020). This strategy aims to strike a balance between client collaboration and local adaptation, as presented in Figure 1 (a) and (b). However, these approaches can only ensure that the encoder to generate either the generalized or personalized representations. Thereby, some PFL methods suggest personalizing specific parameters within the encoder, allowing it to extract the representations that exhibit both generalization and personalization (Li et al., 2021; Sun et al., 2021; Shen et al., 2022). Additionally, some PFL techniques concurrently use global and personalized classifiers for predictions (Chen and Chao, 2022; Zhang et al., 2023a) to harmonize generalization and personalization. Nevertheless, these methods inherently involve a trade-off between model generalization and personalization. This leads to an interesting question: Is it feasible to create a model that can achieve both of these objectives concurrently in Non-IID FL?

Refer to caption — Figure 1. Different forms that combines the representations and the classifier. (a) Global encoder with personalized classifier, (b) Personalized classifier with global encoder, (c) Our proposed DualFed that utilizes hierachical representations.

In fact, the dilemma in existed PFL methods primarily because they rely solely on post-encoder representations for decision-making. This design presents a significant hurdle as it necessitates the post-encoder representations to simultaneously exhibit both high generalization and personalization – objectives that are inherently contradictory in Non-IID FL. It is well known that deep models naturally produce hierarchical representations, as evidenced in studies such as (Zeiler and Fergus, 2014; Yosinski et al., 2014; Shwartz-Ziv and Tishby, 2017; Olah et al., 2017; Recanatesi et al., 2019; Bordes et al., 2023; Gui et al., 2023; Wang et al., 2023; Masarczyk et al., 2024). The shallow layers capture general patterns that are transferable across different data distributions. As we delve into deeper layers, the representations become more specified for the downstream task. This implies that both the generalization and personalization that PFL seeks for are already existed within the model. These observations do shed some lights on us: Can we leverage the hierachical representations within the deep model to achieve both high model generalization and personalization simulatenously?

In this paper, we provide a positive response to the question posed earlier. A straightforward method for leveraging hierarchical representations involves directly selecting both generalized and personalized representations from them. However, this approach can incur substantial computational costs, owing to the volume of the candidate representations (Sariyildiz et al., 2023). To address this problem, we introduce DualFed, a new PFL approach that not only straightforward to implement but also effectively decouple these two types of representations. As shown in Figure 1 (c), in DualFed, we modify the commonly used encoder-classifier architecture by inserting a projection network between the encoder and classifier. This modification generates representations at two distinct stages, aligning with the objectives of generalization and personalization, respectively. Specifically, the pre-projection representations generated before the projection network, are isolated from local tasks, making them more transferable across clients. Conversely, the post-projection representations produced after the projection network are closer to the decision layers, being more discriminative and personalized to local data distributions. To align with the objectives of these two representations, we maintain a shared encoder while localizing the projection network. A global classifier and a personalized classifier are trained using the pre-projection and post-projection representations, respectively. During inference, the outputs from these two classifier are combined to yield the final predictions, effectively benefiting from collaboration across clients and local adaptation.

We conduct extensive experiments on multiple datasets to demonstrate the effectiveness of DualFed. The experimental results show that DualFed can outperform state-of-the-art (SOTA) FL methods.

2. Related Work

Federated Learning. FL (Li et al., 2020a; Kairouz et al., 2021) can be categorized into general FL (GFL) (McMahan et al., 2017; Karimireddy et al., 2020; Li et al., 2020b) and personalized FL (PFL) (Tan et al., 2022b; Li et al., 2021; Collins et al., 2021; Arivazhagan et al., 2019; Liang et al., 2020; Sun et al., 2021; Zheng et al., 2022; Wu et al., 2023, 2024a, 2024b). GFL aims to develop a generalized model that can be shared across clients. However, in Non-IID FL, it becomes challenging for a global model to satisfy the diverse needs of multiple clients, often leading to significant performance degradation (Zhao et al., 2018; Zhu et al., 2021). Consequently, PFL has emerged as an effective solution for these Non-IID situations by introducing model personalization to better align with local data distributions. There are various approaches to implement PFL, including model clustering (Sattler et al., 2020; Ghosh et al., 2020; Caldarola et al., 2021; Cai et al., 2023), and the personalization of specific parameters within the model (Li et al., 2021; Collins et al., 2021; Arivazhagan et al., 2019; Liang et al., 2020; Sun et al., 2021). However, these PFL methods can only manage a trade-off between model generalization and personalization, as they expect the post-encoder representations to achieve the conflicting objectives.

Representation Learning in Deep Models. Since advanced deep learning models are typically organized as hierachical layers, analysing how representations evolve during the representation extraction process has been an established field (Yosinski et al., 2014; Olah et al., 2017; Masarczyk et al., 2024; Wang et al., 2023; Zeiler and Fergus, 2014). Previous research indicates that deep models start by extracting generalized features and progressively filter out irrelevant components, retaining only those crucial for downstream tasks (Yosinski et al., 2014; Masarczyk et al., 2024). This has inspired numerous studies that leverage intermediate representations, in domains like object detection (Lin et al., 2017). However, selecting the optimal representations for each specific problem is computationally challenging (Sariyildiz et al., 2023). In response, SimCLR (Chen et al., 2020b) proposes to use a scalable projection network during training and discard it afterwards. This design has become a common practice in both supervised learning (Khosla et al., 2020; Feng et al., 2021; Wang et al., 2022) and self-supervised learning (Chen et al., 2020a; Grill et al., 2020; Caron et al., 2020; Zbontar et al., 2021; Chen and He, 2021). Since then, numerous studies have explored the projector’s role in model training from empirical (Sariyildiz et al., 2023; Wang et al., 2022; Li et al., 2022; Bordes et al., 2023) and theoretical perspectives (Jing et al., 2022; Wang et al., 2023; Xue et al., 2024). The common explanation is that the projection network differentiates the representations of the pre-training and downstream tasks, thereby enhancing the model transferability (Wang et al., 2022). This situation is especially significant when the pre-training and downstream tasks are misaligned (Bordes et al., 2023). Nevertheless, the effects of projection network within FL are still not fully understood.

Federated Learning within Representation Space. The primary contribution of these methods is the regularization of the representation space to mitigate data heterogeneity (Tan et al., 2022a; Zhou et al., 2023; Zhu et al., 2022; Zhang et al., 2023a, b; Luo et al., 2021; Qi et al., 2023; Long et al., 2023). A straightforward strategy in these approaches involves directly calibrating the representation space. For instance, CCVR (Luo et al., 2021) post-calibrates the classifier after federated training using virtual representations. Another research direction links performance degradation to the misalignment of representation spaces across clients (Zhu et al., 2022; Zhou et al., 2023). In response, various methods have been developed to explicitly align the representation space across clients. Notably, FedProto (Tan et al., 2022a), AlignFed (Zhu et al., 2022, 2023a), and FedFA (Zhou et al., 2023) use class-wise representation centers for representation alignment. Additionally, some methods achieve alignment by implementing a fixed classifier. For instance, FedBABU (Oh et al., 2022) employs a randomly initialized classifier, SphereFed (Dong et al., 2022) introduces an orthogonal classifier, while FedETF (Li et al., 2023) implements an ETF (Equiangular Tight Frame) classifier during model training. However, these methods primarily focus on extracting generalized representations shareable across clients, often overlooking the personalized representations specific to local tasks. Consequently, recent studies have focused on balancing both model generalization and personalization (Zhu et al., 2023b; Chen and Chao, 2022). For example, Fed-RoD (Chen and Chao, 2022) achieving this goal combining the predictions of personalized and global classifiers. Yet, these methods face challenges, as they rely solely on representations at the same stage. Expecting the single-stage representations to exhibit both generalization and personalization is often intertwined.

3. Preliminaries

3.1. Federated Learning

In this paper, we consider a standard PFL setting which consists of a central server and $M$ distributed clients. For each client $m\in[M]$ , there are totally $N_{m}$ samples $\{\bm{x}_{m}^{i},\bm{y}_{m}^{i}\}_{i=1}^{N_{m}}$ drawn from the distribution $\mathcal{D}_{m}$ , where $\bm{x}_{m}^{i}\in\mathcal{X}_{m}\subseteq\mathbb{R}^{n}$ represents the raw input and $\bm{y}_{m}^{i}\in\mathcal{Y}_{m}\subseteq\{0,1\}^{C}$ represents the corresponding label, with $C$ denoting the total number of classes. In Non-IID scenarios within PFL, the data distributions are assumed to be heterogeneous across clients, indicating that $\mathcal{D}_{i}\neq\mathcal{D}_{j},\forall i,j\in\{1,2,\dots M\},i\neq j$ .

The goal of a standard PFL setting is to develop a model $\psi_{m}(\cdot)$ parameterized by $\Theta_{m}$ for client $m$ . The corresponding optimization objective can be expressed as:

(1)

\text{arg}\ \underset{\Theta_{1},\ldots,\Theta_{M}}{\text{min}}\mathcal{L}(\Theta_{1},\ldots,\Theta_{M})\triangleq\text{arg}\underset{\Theta_{1},\ldots,\Theta_{M}}{\text{min}}\frac{1}{M}\sum_{m=1}^{M}\mathcal{L}_{m}(\Theta_{m}),

where $\mathcal{L}(\Theta_{1},\ldots,\Theta_{M})$ represents the overall optimization objective for the PFL system, $\mathcal{L}_{m}(\Theta_{m})$ denotes the empirical risk for client $m$ . In PFL, directly optimizing $\mathcal{L}(\Theta_{1},\ldots,\Theta_{M})$ is commonly infeasible as the clients cannot access the data on other clients. Therefore, a PFL training procedure typically involves the independently local updating performed on participating clients utilizing their own empirical risk and the model aggregation performed on the server. Specifically, for client $m$ , its empirical risk is defined as:

(2)

\mathcal{L}_{m}(\Theta_{m}):=\frac{1}{N_{m}}\sum_{i=1}^{N_{m}}\ell(\bm{y}_{m}^{i},\hat{\bm{y}}_{m}^{i}),

with $\hat{\bm{y}}_{m}^{i}=\psi_{m}(\bm{x}_{m}^{i};\Theta_{m})$ representing the model’s prediction for $\bm{x}_{m}^{i}$ , and $\ell:\mathcal{Y}\times\mathcal{Y}\rightarrow\mathbb{R}$ being the loss function that quantifies the prediction error (e.g., cross-entropy loss).

Once the local training on clients is completed, the participating clients upload their updated global parameters within the model to the server. The server then averages the parameters at corresponding positions to generate new global parameters. These global parameters are subsequently distributed to the clients for the next round of local updating. By iteratively performing local training and model aggregation, PFL facilitates collaborative model training without the need to share raw data from the clients.

For the sake of brevity, we occasionally omit the superscript denoting the sample index in subsequent sections of this paper. Additionally, we sometimes denote personalized parameters with the superscrip $p$ (e.g., $\Theta_{m}^{p}$ ), and global parameters with the superscript $s$ (e.g., $\Theta_{m}^{s}$ ), to clarify the expressions in the following sections.

3.2. Motivation of DualFed

As shown Figure 1 (a) and (b), in previous studies of PFL, the model $\Theta_{m}$ is commonly divided into an encoder $f_{m}(\cdot)$ and a classifier $h_{m}(\cdot)$ (Oh et al., 2022; Zhou et al., 2023; Zhu et al., 2022; Arivazhagan et al., 2019; Collins et al., 2021), parameterized by $\theta_{m}^{f}$ and $\theta_{m}^{h}$ , respectively. The encoder $f_{m}(\cdot):\mathcal{X}_{m}\rightarrow\mathcal{Z}_{m}$ generally consists of a series of stacked convolutional layers. It maps the raw input $\bm{x}_{m}$ from $\mathcal{X}_{m}\subseteq\mathbb{R}^{n}$ into a representation space $\mathcal{Z}_{m}\subseteq\mathbb{R}^{k}$ , which is denoted as $\bm{z}_{m}=f(\bm{x}_{m};\theta_{m}^{f})$ . Here, $\bm{z}_{m}\in\mathcal{Z}_{m}$ denotes the representation generated from $\bm{x}_{m}$ utilizing the encoder $f_{m}(\cdot)$ . Practically, the dimension of this representation is significantly smaller than that of the raw input, which implies that $k\ll n$ . The classifier, $h_{m}(\cdot):\mathcal{Z}_{m}\rightarrow\mathcal{Y}_{m}$ , generally includes a fully connected (FC) layer and a softmax layer. It generates the normalized predictions $\hat{\bm{y}}_{m}$ based on the representation $\bm{z}_{m}$ , which is indicated as $\hat{\bm{y}}_{m}=h_{m}(\bm{z}_{m};\theta_{m}^{h})$ .

Nevertheless, within the encoder-classifier architecture, only the representations after the encoder, referred to as the post-encoder representations, are used for decision-making. This approach can lead to a dilemma in PFL, as generalization and personalization are contradictory objectives, particularly in Non-IID scenarios. More specifically, to enhance model generalization, the post-encoder representations should capture shared information across varying data distributions among clients. On the other hand, enhancing model personalization requires these representations to capture specific information aligned with each client’s local data distribution. When the data distribution varies significantly across clients, these two types of information can be vastly different. Consequently, in this encoder-classifier architecture, ensuring that the post-encoder representations simultaneously meet these two conflicting objectives is a challenging task.

To address the dilemma mentioned earlier, we shift our focus on the process of representation extraction within the deep models. Advanced deep models are typically organized in a hierachical architecture. As shown in previous studies, these models initially extract generalized representations that are transferable across various data distributions (Zeiler and Fergus, 2014; Yosinski et al., 2014; Shwartz-Ziv and Tishby, 2017; Olah et al., 2017; Recanatesi et al., 2019; Bordes et al., 2023; Gui et al., 2023; Wang et al., 2023; Masarczyk et al., 2024). As the model progresses to deeper layers, it gradually discards irrelevant components and retains only information relevant to the specific task. In other words, both the generalized and personalized representations that PFL seeks for are already existed within the model. By leveraging these hidden generalized and personalized representations, we can achieve both high generalization and personalization in PFL. However, directly extracting these specific representations during the representation extraction phase is computationally challenging (Sariyildiz et al., 2023). Therefore, DualFed adopts a simpler strategy by incorporating a personalized projection network, which effectively decouples the generalized and personalized representations.

4. Method

4.1. Framework Overview of DualFed

Figure 2 presents the framework of DualFed. It aligns with the standard training framework of PFL, which includes iterative local training on clients and global model aggregation on the server. The key innovation in DualFed, as compared to previous PFL methods, is the integration of a personalized projection network situated between the encoder and the classifier. We refer to this personalized projection network as $g^{p}_{m}(\cdot)$ , with its parameters denoted by $\theta_{m}^{g,p}$ . Functionally, this projection network, $g^{p}_{m}(\cdot):\mathcal{Z}_{m}\rightarrow\mathcal{U}_{m}$ is usually a MLP (multi-layer perceptron). By inserting this projection network, the representations produced by the encoder are not directly inputted into the classifier for prediction. Instead, they first pass through the projection network, which remaps them to a personalized representation space $\mathcal{U}_{m}\subseteq\mathbb{R}^{d}$ . Formally, we represent this process as $\bm{u}_{m}=g(\bm{z}_{m};\theta_{m}^{g,p})$ . For clarity, we term the representation before the projection network (i.e., $\bm{z}_{m}$ ) as the pre-projection representations, and the representation after the projection network (i.e., $\bm{u}_{m}$ ) as the post-projection representations.

Drawing on the hierarchical nature of deep model representation extraction, the pre-projection and post-projection representations in our framework exhibit distinct characteristics, aligning with the generalized and personalized objectives of PFL, respectively. Specifically, the pre-projection representations are separated from the final outputs by the projector network, meaning that they are not directly tied to the local tasks on each client. As previously studies have shown, these pre-projection representations are easier transferred across different data distributions (Wang et al., 2022; Sariyildiz et al., 2023). Therefore, in DualFed, the post-projection representations are fed into a global classifier $h_{m}^{s}(\cdot)$ , which is parameterized by $\theta^{h,s}_{m}$ . Additionally, to encourage the encoder to extract more generalized information, we let the encoder be shared among clients in DualFed. The predictions from this global classifier is expressed as:

(3)

\hat{\bm{y}}^{s}_{m}=h_{m}^{s}\circ f_{m}^{s}(\bm{x}_{m}),\forall m\in[M].

Conversely, the post-projection representations are more closely aligned with the final outputs. This implies that these representations are more pertinent to accomplishing tasks related to the local data distribution. In DualFed, to effectively adapt to these local distributions, we utilize a personalized classifier $h^{p}_{m}(\cdot)$ for each client, parameterized by $\theta_{m}^{h,p}$ , to adapt to the local data distribution. For a given input $\bm{x}_{m}$ , the prediction generated by this personalized classifier can be expressed as follows:

(4)

\hat{\bm{y}}^{p}_{m}=h^{p}_{m}\circ g^{p}_{m}\circ f_{m}^{s}(\bm{x}_{m}),\forall m\in[M].

During inference, the final predictions are derived by ensembling the outputs from both the global classifier and the personalized classifier. This process is expressed as follows:

(5)

\hat{\bm{y}}_{m}=\hat{\bm{y}}_{m}^{p}+\hat{\bm{y}}_{m}^{s},\forall m\in[M].

By integrating a personalized projection network between the encoder and the classifier, DualFed effectively separates the contradictory optimization objectives inherent in PFL into distinct stages within the model. This approach resolves the conflict of pursuing contradictory objectives within the representations in the same stage, thereby can achieve a win-win situation between the model generalization and personalization.

4.2. Local Training on Client

In DualFed, each client updates the model for $E$ rounds using its own datasets after revceiving the global models from the sever. In order to fully exploit the hierarchical characteristics of deep model representations and achieve the optimization objectives of PFL, we introduce a stage-wise training procedure for local clients.

At the first stage, we freeze the global classifier and training the main branch of the model. The main branch comprises the global encoder, the personalized projector, and the personalized classifier, with their parameters collectively represented as $\overline{\Theta}_{m}:=\{\theta^{f,s}_{m},\theta^{g,p}_{m},\theta^{h,p}_{m}\}$ . This stage allows the model to extract both generalized and personalized representations. To ensure the model’s effectiveness in accomplishing local tasks, we employ cross-entropy loss as the classification loss, as indicated in the following equation:

(6)

\mathcal{L}_{pcls}=\sum_{i=1}^{N_{m}}\sum_{c=1}^{C}\bm{y}_{m}^{i,c}\log(\hat{\bm{y}}^{p,i,c}_{m}),

where $\bm{y}_{m}^{i,c}$ denotes the value at $c^{th}$ class of the one-hot ground-truth label of the $i^{th}$ sample on client $m$ , $\hat{\bm{y}}^{p,i,c}_{m}$ represents the normalized prediction probability of $c^{th}$ classes of the $i^{th}$ sample on client $m$ from the personalized classifier.

As the post-projection representations are tailored to adapt to the local data distribution, we further enhance its discrimination by implementing supervised contrastive loss (Khosla et al., 2020), as demonstrated in the following equation:

(7)

\mathcal{L}_{con}=-\frac{1}{N_{m}}\sum_{i=1}^{N_{m}}\frac{1}{|A(i)|}\sum_{j\in A(i)}\log\frac{\exp(\bm{u}_{i}\odot\bm{u}_{j}/\tau)}{\sum_{a\in A\backslash\{i\}}\exp(\bm{u}_{i}\odot\bm{u}_{a}/\tau)}

where $A$ is the full set of samples, $A(i)$ consists of samples in $A$ that belong to the same class as $\bm{x}_{m}^{i}$ , $\odot$ is the cosine similarity, and $\tau\in\mathcal{R}^{+}$ is the temperature coefficient.

The optimization objective at this stage is then defined as:

(8)

\overline{\Theta}_{m}^{t}=\text{arg}\ \underset{\overline{\Theta}_{m}}{\text{min}}\ \mathcal{L}_{pcls}+\lambda\mathcal{L}_{con},

where $\lambda$ denotes the hyperparameter used for balancing these two loss terms, $t$ denotes the local updating epochs.

After updating $\overline{\Theta}_{m}$ , we freeze its parameters and train the global classifier using the pre-projection representations to fulfill the local task, as represented in the following equation:

(9)

\mathcal{L}_{scls}=\sum_{i=1}^{N_{m}}\sum_{c=1}^{C}\bm{y}_{m}^{i,c}\log(\hat{\bm{y}}^{s,i,c}_{m}),

where $\bm{y}_{m}^{i,c}$ denotes the value at $c$ class of the one-hot ground-truth label of the $i_{th}$ sample on client $m$ , $\hat{\bm{y}}^{s,i,c}_{m}$ represents the normalized prediction probability of $c$ classes of the $i_{th}$ sample on client $m$ from the global classifier.

The optimization objective in this stage can be expressed as:

(10)

\theta^{h,s,t}_{m}=\text{arg}\ \underset{\theta^{h,s}_{m}}{\text{min}}\ \mathcal{L}_{scls}.

In DualFed, both optimization objectives, as described in Eqs. (8) and (10), are optimized using mini-batch stochastic gradient descent (SGD). As evidenced by our experiments, this stage-wise optimization strategy diminishes the impact of local tasks on the pre-projection representations, thereby effectively preserving its generalization.

4.3. Model Aggregation on Server

Once the local updating process is complete, the clients send their global encoder and classifier parameters to the server. The server then aggregates these parameters using the following equation:

(11)

\tilde{\theta}^{f,s}=\sum_{m=1}^{M}\frac{1}{M}\theta^{f,s}_{m},\quad\tilde{\theta}^{h,s}=\sum_{m=1}^{M}\frac{1}{M}\theta^{h,s}_{m}.

Following the model aggregation, the server broadcast the updated model back to the clients to for subsequent local training.

5. Experiment

5.1. Dataset Description

Our experiments are conducted on PACS (Li et al., 2017), DomainNet (Peng et al., 2019), and Office-Home (Venkateswara et al., 2017), each of them consists of several domains. PACS includes $4$ distinct domains: Photo (P), Art Painting (A), Cartoon (C), and Sketch (S), each featuring images from $7$ common categories. DomainNet encompasses $6$ distinct domains: Clipart (C), Infograph (I), Painting (P), Quickdraw (Q), Real (R), and Sketch (S). Initially, each domain comprises $345$ classes, but for our study, we narrow this down to $10$ commonly used classes to create our experimental dataset. Office-Home contains $4$ distinct domains: Art (A), Clipart (C), Product (P), and Real-World (R), each containing $65$ classes. We retain all classes within Office-Home to conduct a comprehensive evaluation of DualFed on a larger-scale dataset.

For these datasets, we select the images from a single domain to form the dataset of an individual client. In both PACS and DomainNet, we choose a subset of $500$ training images per client from the same domain for the training dataset. For Office-Home, we set the number of training samples to $2,000$ for the Clipart, Product, and Real-World domains. In the case of the Art domain, the number is limited to $1,942$ , matching the total number of samples available in this domain. All the images from the test dataset are reserved for evaluation for these datasets. We apply random flipping and rotational augmentations to these images during the training.

5.2. Compared Methods

We perform a comparative analysis against the following methods, including FedAvg(McMahan et al., 2017), FedProx(Li et al., 2020b), FedPer(Arivazhagan et al., 2019), FedRep(Collins et al., 2021), LG-FedAvg(Liang et al., 2020), FedBN(Li et al., 2021), FedProto(Tan et al., 2022a), SphereFed(Dong et al., 2022), Fed-RoD(Chen and Chao, 2022), FedETF(Li et al., 2023). Additionally, the SingleSet method, where separate models are trained and tested for each client using only their private data, is also used for comparison in our experiments.

5.3. Implementation Details

The adopted encoder is from the one of the ResNet18 model pretrained on the ImageNet dataset (He et al., 2016). It is followed by a projector network, which consists of an FC network with the architecture: [Linear( $512$ , $256$ ) - ReLU - BN - Linear( $256$ , $512$ ) - BN]. To ensure uniform model capacity, all compared methods employ this Encoder-Projector architecture for representation extraction.

The learning rate is set $0.01$ , with a momentum of $0.5$ , for all methods except SphereFed . For SphereFed, we set the learning rate to $1.0$ for Office-Home and to $0.1$ for both DomainNet and PACS. The batch size is set to $256$ for all methods. The epoch of local updating is set to $1$ for all methods except FedRep. For FedRep, it has a total of $5$ local epochs, with the initial 4 epochs focusing on classifier optimization and the last epoch on encoder and projector optimization. The total global rounds is set to $300$ .

The other hyperparameters for different methods are selected by grid searching. To mitigate cross-domain interference and potential privacy issues related to BN layers, we localize the running-mean and running-var components within these layers for all methods.

To ensure the reliability of our results, each experiment is repeated $5$ times with random seeds: $\{0,1,2,3,4\}$ . The subsequent sections will detail the mean and standard deviation of the highest test accuracy achieved during FL training.

5.4. Experimental Results

Tables 1 - 3 showcase the experimental results of our proposed DualFed alongside other FL methods on the PACS, Office-Home, and DomainNet datasets, respectively. Notably, DualFed presents a significant performance gain in comparison to these SOTA methods.

Interestingly, the SingleSet model stands out as a strong benchmark, despite not collaborating with other clients, particularly in simpler domains, such as the Quickdraw domain in the DomainNet dataset. The underlying reason is that these simpler domains requires less complex semantic information for downstream tasks. In these cases, a personalized encoder’s representations are sufficient, and collaboration for extensive semantic extraction might be unnecessary or even detrimental. This observation is supported by LG-FedAvg’s performance, which, while also utilizing a personalized encoder for representation extraction, outperforms SingleSet by leveraging collaborative training for a global classifier.

Table 1. Experimental Results on PACS Dataset.

Method	P	A	C	S	Avg.
SingleSet	97.78 $\pm$ 0.56	88.12 $\pm$ 0.25	89.19 $\pm$ 0.37	91.01 $\pm$ 0.73	91.52 $\pm$ 0.10
FedAvg	97.72 $\pm$ 0.56	89.24 $\pm$ 1.01	89.32 $\pm$ 0.60	91.01 $\pm$ 0.70	91.82 $\pm$ 0.34
FedProx	97.90 $\pm$ 0.38	89.14 $\pm$ 1.18	89.40 $\pm$ 0.61	91.52 $\pm$ 0.72	91.99 $\pm$ 0.38
FedPer	98.20 $\pm$ 0.42	89.54 $\pm$ 1.16	91.28 $\pm$ 0.75	91.29 $\pm$ 0.60	92.58 $\pm$ 0.57
FedRep	97.84 $\pm$ 0.35	89.83 $\pm$ 1.33	89.96 $\pm$ 0.27	91.39 $\pm$ 0.48	92.25 $\pm$ 0.22
LG-FedAvg	97.60 $\pm$ 0.54	88.46 $\pm$ 0.45	89.74 $\pm$ 0.30	91.36 $\pm$ 0.66	91.79 $\pm$ 0.24
FedBN	92.20 $\pm$ 0.46	89.88 $\pm$ 0.86	90.38 $\pm$ 0.75	91.34 $\pm$ 0.53	92.45 $\pm$ 0.37
FedProto	97.90 $\pm$ 0.19	91.15 $\pm$ 0.50	92.22 $\pm$ 0.61	92.99 $\pm$ 0.59	93.57 $\pm$ 0.34
SphereFed	98.26 $\pm$ 0.35	88.95 $\pm$ 0.87	91.11 $\pm$ 0.42	91.03 $\pm$ 0.82	92.34 $\pm$ 0.26
Fed-RoD	98.02 $\pm$ 0.36	88.85 $\pm$ 1.04	89.79 $\pm$ 0.49	90.85 $\pm$ 0.59	91.88 $\pm$ 0.31
FedETF	97.43 $\pm$ 0.24	90.95 $\pm$ 0.77	90.26 $\pm$ 0.29	90.70 $\pm$ 0.68	92.33 $\pm$ 0.30
DualFed	98.32 $\pm$ 0.24	92.47 $\pm$ 0.42	94.91 $\pm$ 0.63	94.32 $\pm$ 0.61	95.01 $\pm$ 0.31

Table 2. Experimental Results on Office-Home Dataset.

Method	A	C	P	R	Avg.
SingleSet	66.52 $\pm$ 1.27	74.27 $\pm$ 0.60	87.46 $\pm$ 1.02	77.54 $\pm$ 0.58	76.45 $\pm$ 0.32
FedAvg	68.82 $\pm$ 1.30	74.91 $\pm$ 1.02	85.82 $\pm$ 0.36	80.30 $\pm$ 0.53	77.46 $\pm$ 0.35
FedProx	68.78 $\pm$ 1.37	74.73 $\pm$ 0.79	85.73 $\pm$ 0.35	80.25 $\pm$ 0.70	77.37 $\pm$ 0.33
FedPer	70.31 $\pm$ 1.07	75.03 $\pm$ 0.38	87.76 $\pm$ 0.18	80.51 $\pm$ 0.43	78.40 $\pm$ 0.40
FedRep	70.23 $\pm$ 0.96	75.44 $\pm$ 0.69	85.82 $\pm$ 0.45	80.39 $\pm$ 0.92	77.97 $\pm$ 0.37
LG-FedAvg	67.22 $\pm$ 1.30	75.33 $\pm$ 0.19	87.44 $\pm$ 0.43	77.80 $\pm$ 0.27	76.94 $\pm$ 0.26
FedBN	68.58 $\pm$ 1.23	76.01 $\pm$ 0.45	86.31 $\pm$ 0.96	79.40 $\pm$ 0.40	77.58 $\pm$ 0.29
FedProto	67.92 $\pm$ 0.74	75.76 $\pm$ 0.57	87.80 $\pm$ 0.30	77.89 $\pm$ 0.41	77.34 $\pm$ 0.25
SphereFed	66.68 $\pm$ 0.89	69.12 $\pm$ 0.82	81.92 $\pm$ 0.95	76.76 $\pm$ 0.28	73.62 $\pm$ 0.48
Fed-RoD	68.21 $\pm$ 0.86	75.42 $\pm$ 0.37	86.40 $\pm$ 0.72	80.30 $\pm$ 0.79	77.58 $\pm$ 0.22
FedETF	69.90 $\pm$ 1.14	74.64 $\pm$ 0.41	85.52 $\pm$ 0.35	80.18 $\pm$ 0.39	77.56 $\pm$ 0.29
DualFed	71.01 $\pm$ 0.71	77.41 $\pm$ 0.47	88.84 $\pm$ 0.47	81.70 $\pm$ 0.28	79.74 $\pm$ 0.37

Table 3. Experimental Results on DomainNet Dataset.

Method	C	I	P	Q	R	S	Avg.
SingleSet	88.25 $\pm$ 0.81	50.99 $\pm$ 1.24	89.60 $\pm$ 1.00	82.78 $\pm$ 0.43	94.07 $\pm$ 0.12	88.16 $\pm$ 0.53	82.31 $\pm$ 0.19
FedAvg	89.47 $\pm$ 0.97	53.70 $\pm$ 0.86	89.60 $\pm$ 0.52	80.58 $\pm$ 0.80	92.85 $\pm$ 0.52	88.56 $\pm$ 0.58	82.46 $\pm$ 0.33
FedProx	89.47 $\pm$ 0.86	53.79 $\pm$ 0.96	89.56 $\pm$ 0.55	80.56 $\pm$ 0.87	92.87 $\pm$ 0.54	88.63 $\pm$ 0.56	82.48 $\pm$ 0.38
FedPer	89.70 $\pm$ 0.81	54.22 $\pm$ 0.68	92.12 $\pm$ 0.98	82.18 $\pm$ 0.65	94.76 $\pm$ 0.41	89.57 $\pm$ 0.66	83.76 $\pm$ 0.32
FedRep	89.62 $\pm$ 0.76	54.19 $\pm$ 0.71	90.60 $\pm$ 0.37	80.84 $\pm$ 0.91	93.03 $\pm$ 0.49	89.03 $\pm$ 0.77	82.88 $\pm$ 0.24
LG-FedAvg	88.56 $\pm$ 0.83	51.54 $\pm$ 1.18	89.89 $\pm$ 0.78	82.68 $\pm$ 0.74	94.20 $\pm$ 0.32	88.59 $\pm$ 0.70	82.58 $\pm$ 0.09
FedBN	89.85 $\pm$ 0.67	54.58 $\pm$ 1.04	91.34 $\pm$ 0.90	80.62 $\pm$ 0.68	93.76 $\pm$ 0.44	89.06 $\pm$ 0.41	83.20 $\pm$ 0.36
FedProto	90.04 $\pm$ 0.86	54.31 $\pm$ 0.91	92.18 $\pm$ 0.55	84.82 $\pm$ 0.67	94.82 $\pm$ 0.25	90.40 $\pm$ 0.56	84.43 $\pm$ 0.30
SphereFed	88.97 $\pm$ 0.52	51.02 $\pm$ 1.63	90.69 $\pm$ 0.43	78.50 $\pm$ 1.18	92.65 $\pm$ 0.33	88.77 $\pm$ 0.54	81.77 $\pm$ 0.48
Fed-RoD	89.70 $\pm$ 0.99	52.91 $\pm$ 0.89	90.18 $\pm$ 0.51	81.64 $\pm$ 0.50	93.03 $\pm$ 0.46	88.88 $\pm$ 0.73	82.72 $\pm$ 0.25
FedETF	88.97 $\pm$ 0.81	55.65 $\pm$ 0.85	91.76 $\pm$ 0.52	79.76 $\pm$ 0.48	94.15 $\pm$ 0.23	89.03 $\pm$ 0.39	83.22 $\pm$ 0.31
DualFed	92.51 $\pm$ 0.41	56.77 $\pm$ 0.95	94.41 $\pm$ 0.30	85.18 $\pm$ 0.30	94.69 $\pm$ 0.08	92.27 $\pm$ 0.54	86.14 $\pm$ 0.12

However, as the complexity within a domain increases, such as in the Infograph domain of DomainNet, the benefits of sharing the encoder among clients become apparent. This collaborative approach allows the encoder to extract more nuanced semantic information from the raw data, improving overall model performance, as demonstrated by the results of FedAvg and FedProx. FedRep and FedPer, employing a personalized classifier to adapt the representations from the global encoder, often outperform FedAvg and FedProx. However, these methods primarily leverage the global encoder’s representations and do not fully utilize personalized information to cater to the local data distribution on individual clients.

FedProto significantly improves model performance by aligning representations from different clients within a unified representation space. Nonetheless, this alignment can result in a loss of semantic information pertinent to local tasks due to varying data distributions across clients. This issue is even more pronounced in models like SphereFed and FedETF, which employ a predefined classifier for representation alignment and lack specific semantic information about local data.

Fed-RoD adopts an architecture similar to ours, utilizing both global and personalized classifiers to capture generalized and personalized information. However, it attempts to utilize representations at the same stage, posing challenges in simultaneously meeting these two contradictory objectives. In contrast, our proposed method strategically separates these two conflicting objectives into different stages of the model. This division allows us to achieve both generalization and personalization more effectively, ultimately resulting in superior performance across a wider range of scenarios.

5.5. Additional Analysis

Comparison of global and personalized classifiers. To gain a deeper understanding of the behavior of the global and personalized classifiers, we compare their accuracy, individually and in combination, during training. Figure 3 shows the corresponding experimental results on DomainNet. It is evident that personalized classifier significantly surpasses the global one, owing to its better alignment with local data distributions. Nevertheless, the accuracy of the local classifier can be significantly improved by combining its predictions with those from the global classifier. This enhancement is particularly notable in complex domains, such as Infograph. Conversely, in simpler domains like Quickdraw and Sketch, the benefit of combining classifiers becomes less pronounced. This occurs because, in simpler domains, the representations extracted by the personalized projection network are sufficient for each client’s local tasks, thereby reducing the necessity for more diverse representations from the global encoder.

Visualization of generalized and personalized representations. To intuitively understand the generalized and personalized representations, we utilize t-SNE (Van der Maaten and Hinton, 2008) for visualization. Figure 4 illustrates the visualization of both the generalized and personalized representations on DomainNet dataset. In these visualizations, different colors indicate different classes. It is noticeable that the personalized representations are more discriminative than the generalized ones, yet they exhibit lower consistency across clients. This demonstrates that DualFed can effectively separate the representation extraction process into two distinct stages, each characterized by high levels of generalization and personalization, respectively.

Quantitative evaluation of generalized and personalized representations. We employ two metrics to quantitatively evaluate the evolution of generalized and personalized representations during training. To quantify the personaliztion of representations on clients, we adopt the class-wise separation in (Kornblith et al., 2021). Additionally, we adopt the linear centered kernel alignment (CKA) (Kornblith et al., 2019), to measure the generalization ability of representation. Figure 5 presents the varying of $separation$ during the training. The personalized representations can achieve higher class separation compared with the generalized representations. However, as shown in Figure 6, the similarity between clients of generalized representations is significant higher that that of the personalized representations.

Comparison of Training Strategy. DualFed employs a stage-wise training strategy, ensuring that the pre-projection representation remain undisturbed by specific local tasks, thereby maintaining its generalization. Here, we compare this training strategy with the one that training all parameters simultaneously. As shown in Table 4, when $E$ is relatively small (i.e., $E=1$ ), simultaneous training can, in fact, outperforms stage-wise training. However, as $E$ increases (i.e., $E=20$ ), simultaneous training lead to a obvious performance drop in PACS and DomainNet. This trend can be attributed to the fact that an increased number of local epochs causes the pre-projection representations to align more closely with the local task, thereby reducing their generalization.

Table 4. Experiments with Different Training Strategy.

$E$	Strategy	PACS	DomainNet	Office-Home
$1$	Stage-wise	95.01 $\pm$ 0.31	86.14 $\pm$ 0.12	79.74 $\pm$ 0.37
$1$	Simu.	95.15 $\pm$ 0.16	86.68 $\pm$ 0.20	80.57 $\pm$ 0.09
$20$	Stage-wise	94.17 $\pm$ 0.28	84.49 $\pm$ 0.18	75.93 $\pm$ 0.77
$20$	Simu.	93.85 $\pm$ 0.30	84.71 $\pm$ 0.33	75.42 $\pm$ 0.65

Effect of Projector Architecture. We investigate the impact of the architecture of the projection network in three key aspects: the model depth ( $D$ ), the dimension of hidden layers ( $H$ ), the impact of BN layers. The corresponding results are shown in Table 5. While increasing $D$ can lead to more generalized pre-projection representations, it simultaneously reduces their discriminative power. Therefore, it is advisable to select an optimal $D$ that maintains a balance in the discriminative and generalized ability of the pre-projection representations. Increasing $H$ can enhance the model performance in most times. The importance of BN layers becomes more pronounced as the scale of the dataset increases.

Table 5. Experiments with Different Projector Architecture.

$D$	$H$	BN	PACS	DomainNet	Office-Home
1	256	✓	94.72 $\pm$ 0.18	86.16 $\pm$ 0.09	79.96 $\pm$ 0.24
2	256	✓	95.01 $\pm$ 0.31	86.14 $\pm$ 0.12	79.74 $\pm$ 0.37
3	256	✓	94.97 $\pm$ 0.18	85.91 $\pm$ 0.26	79.31 $\pm$ 0.36
2	64	✓	95.35 $\pm$ 0.19	86.06 $\pm$ 0.32	79.43 $\pm$ 0.24
2	128	✓	95.15 $\pm$ 0.18	85.95 $\pm$ 0.18	79.49 $\pm$ 0.21
2	512	✓	95.21 $\pm$ 0.17	86.23 $\pm$ 0.23	79.97 $\pm$ 0.35
2	256	✗	95.13 $\pm$ 0.19	86.23 $\pm$ 0.26	79.22 $\pm$ 0.38

Effect of Position of Global Classifier. In DualFed, we employ a global classifier for generalized representations and a personalized classifier for personalized representations. Here we conduct experiments when placing the global classifier after the projector. In these experiments, we maintained a shared encoder and investigated two configurations: sharing the projection network (DualFed-G) and personalizing it (DualFed-P). As indicated in Table 6, removing the global classifier to the same stage as the personalized classifier results in a significant performance decrease. This underscores the importance of the representations at different stages, as they provide complementary information.

Table 6. Experimental Results when Placing Global Classifier at Different Positions.

	PACS	DomainNet	Office-Home
DualFed	95.01 $\pm$ 0.31	86.14 $\pm$ 0.12	79.74 $\pm$ 0.37
DualFed-P	94.95 $\pm$ 0.18	85.55 $\pm$ 0.09	78.24 $\pm$ 0.29
DualFed-G	94.84 $\pm$ 0.12	84.90 $\pm$ 0.42	78.08 $\pm$ 0.17

Effect of Personalized Layers. Table 7 presents the model performance with different personalization strategy. The results indicate that combining a global encoder with a personalized projection network significantly enhances model performance, as it integrates both generalized and personalized information.

Table 7. Experimental Results with Different Parameter Personalized Strategies, where ✓ Denotes the Personalized Parameters, ✗ Denotes the Global Parameters.

Enc.	Prj.	P.C.	G.C.	PACS	DomainNet	Office-Home
✗	✓	✓	✓	94.96 $\pm$ 0.26	86.16 $\pm$ 0.27	79.33 $\pm$ 0.41
✗	✓	✓	✗	95.01 $\pm$ 0.31	86.11 $\pm$ 0.19	79.74 $\pm$ 0.28
✗	✗	✗	✗	94.58 $\pm$ 0.22	84.55 $\pm$ 0.30	78.58 $\pm$ 0.46
✗	✗	✓	✗	94.80 $\pm$ 0.20	85.21 $\pm$ 0.16	79.19 $\pm$ 0.19
✓	✓	✓	✓	93.73 $\pm$ 0.08	83.50 $\pm$ 0.43	77.85 $\pm$ 0.44

Communication Costs. We assess the communication costs by using the total number of model parameters transferred to reach a predefined target accuracy during training. For PACS, DomainNet, and Office-Home, the target accuracies are set to 85%, 75%, and 70%, respectively. As illustrated in Table 8, DualFed outperforms other methods by achieving the same target accuracy with lower communication costs, showcasing its practical efficiency.

Table 8. Averaged Communication Costs (MB) when Reaching the Same Target Accuracy during Training.

	PACS	DomainNet	Office-Home
FedAvg	1920.93	2008.52	2538.72
FedProx	1833.62	2008.52	2538.72
FedPer	1658.47	1658.47	2007.62
FedRep	2705.92	3316.94	3753.37
FedBN	1482.91	1832.08	2098.97
SphereFed	3142.36	5586.42	18330.43
Fed-RoD	1135.10	1309.90	1663.30
FedETF	11783.85	15537.22	15973.66
DualFed	611.21	873.27	1225.59

Effect of Hyper Parameters. We conduct experiments using various hyperparameters, including the temperature coefficient ( $\tau$ ), the loss balance coefficient ( $\lambda$ ), and the number of local epochs ( $E$ ). As depicted in Figure 7, we observe that as $\tau$ increases, its effectiveness in distinguishing between different classes diminishes, thereby losing the advantage of contrastive loss. Figure 8 presents the test accuracy with varying $\lambda$ . Setting $\lambda$ to 0 is equivalent to training the model solely with cross-entropy loss. Increasing $\lambda$ enhances the distinctiveness and relevance of personalized representations to the local task, which, in turn, improves model performance. Figure 9 shows the test accuracy with different local epochs, it illustrates that DualFed consistently surpasses FedAvg across various local epochs, demonstrating its robustness to local epochs.

6. Conclusion

In this paper, we propose a new PFL approach called DualFed. It decouples the objectives of generalization and personalization in PFL by a personalized projection network. This modification reduce the mutual interference between the conflicting optimization objectives in traditional PFL, thereby can achieve a win-win situation of both generalization and personalization in PFL. Our experiments across various datasets have shown the effectiveness of DualFed.

Acknowledgment

This work was supported by the National Natural Science Foundation of China under Grants 62372028 and 62372027.

References

(1)
Arivazhagan et al. (2019) Manoj Ghuhan Arivazhagan, Vinay Aggarwal, Aaditya Kumar Singh, and Sunav Choudhary. 2019. Federated learning with personalization layers. arXiv preprint arXiv:1912.00818 (2019).
Bordes et al. (2023) Florian Bordes, Randall Balestriero, Quentin Garrido, Adrien Bardes, and Pascal Vincent. 2023. Guillotine Regularization: Why removing layers is needed to improve generalization in Self-Supervised Learning. Transactions on Machine Learning Research (2023).
Cai et al. (2023) Luxin Cai, Naiyue Chen, Yuanzhouhan Cao, Jiahuan He, and Yidong Li. 2023. FedCE: Personalized Federated Learning Method based on Clustering Ensembles. In Proceedings of the 31st ACM International Conference on Multimedia. 1625–1633.
Caldarola et al. (2021) Debora Caldarola, Massimiliano Mancini, Fabio Galasso, Marco Ciccone, Emanuele Rodolà, and Barbara Caputo. 2021. Cluster-driven graph federated learning over multiple domains. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2749–2758.
Caron et al. (2020) Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. 2020. Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems 33 (2020), 9912–9924.
Chen and Chao (2022) Hong-You Chen and Wei-Lun Chao. 2022. On Bridging Generic and Personalized Federated Learning for Image Classification. In International Conference on Learning Representations.
Chen et al. (2024) Mingkang Chen, Jingtao Sun, Kento Aida, and Atsuko Takefusa. 2024. Weather-aware object detection method for maritime surveillance systems. Future Generation Computer Systems 151 (2024), 111–123.
Chen et al. (2020b) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020b. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597–1607.
Chen et al. (2020a) Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. 2020a. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 (2020).
Chen and He (2021) Xinlei Chen and Kaiming He. 2021. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 15750–15758.
Collins et al. (2021) Liam Collins, Hamed Hassani, Aryan Mokhtari, and Sanjay Shakkottai. 2021. Exploiting shared representations for personalized federated learning. In International Conference on Machine Learning. PMLR, 2089–2099.
Dong et al. (2022) Xin Dong, Sai Qian Zhang, Ang Li, and HT Kung. 2022. Spherefed: Hyperspherical federated learning. In European Conference on Computer Vision. Springer, 165–184.
Feng et al. (2021) Yutong Feng, Jianwen Jiang, Mingqian Tang, Rong Jin, and Yue Gao. 2021. Rethinking Supervised Pre-training for Better Downstream Transferring. arXiv e-prints (2021), arXiv–2110.
Ghosh et al. (2020) Avishek Ghosh, Jichan Chung, Dong Yin, and Kannan Ramchandran. 2020. An efficient framework for clustered federated learning. Advances in Neural Information Processing Systems 33 (2020), 19586–19597.
Grill et al. (2020) Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. 2020. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems 33 (2020), 21271–21284.
Gui et al. (2023) Yu Gui, Cong Ma, and Yiqiao Zhong. 2023. Unraveling Projection Heads in Contrastive Learning: Insights from Expansion and Shrinkage. arXiv preprint arXiv:2306.03335 (2023).
Hassaballah et al. (2020) Mahmoud Hassaballah, Mourad A Kenk, Khan Muhammad, and Shervin Minaee. 2020. Vehicle detection and tracking in adverse weather using a deep learning framework. IEEE transactions on intelligent transportation systems 22, 7 (2020), 4230–4242.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
Jing et al. (2022) Li Jing, Pascal Vincent, Yann LeCun, and Yuandong Tian. 2022. UNDERSTANDING DIMENSIONAL COLLAPSE IN CONTRASTIVE SELF-SUPERVISED LEARNING. In 10th International Conference on Learning Representations, ICLR 2022.
Kairouz et al. (2021) Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. 2021. Advances and open problems in federated learning. Foundations and Trends® in Machine Learning 14, 1–2 (2021), 1–210.
Karimireddy et al. (2020) Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. 2020. Scaffold: Stochastic controlled averaging for federated learning. In International conference on machine learning. PMLR, 5132–5143.
Khosla et al. (2020) Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised contrastive learning. Advances in neural information processing systems 33 (2020), 18661–18673.
Kornblith et al. (2021) Simon Kornblith, Ting Chen, Honglak Lee, and Mohammad Norouzi. 2021. Why do better loss functions lead to less transferable features? Advances in Neural Information Processing Systems 34 (2021), 28648–28662.
Kornblith et al. (2019) Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. 2019. Similarity of neural network representations revisited. In International conference on machine learning. PMLR, 3519–3529.
Leroux et al. (2022) Sam Leroux, Bo Li, and Pieter Simoens. 2022. Multi-branch neural networks for video anomaly detection in adverse lighting and weather conditions. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2358–2366.
Li et al. (2017) Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. 2017. Deeper, broader and artier domain generalization. In Proceedings of the IEEE international conference on computer vision. 5542–5550.
Li et al. (2020a) Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. 2020a. Federated learning: Challenges, methods, and future directions. IEEE signal processing magazine 37, 3 (2020), 50–60.
Li et al. (2020b) Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. 2020b. Federated optimization in heterogeneous networks. Proceedings of Machine learning and systems 2 (2020), 429–450.
Li et al. (2021) Xiaoxiao Li, Meirui JIANG, Xiaofei Zhang, Michael Kamp, and Qi Dou. 2021. FedBN: Federated Learning on Non-IID Features via Local Batch Normalization. In International Conference on Learning Representations.
Li et al. (2022) Xiao Li, Sheng Liu, Jinxin Zhou, Xinyu Lu, Carlos Fernandez-Granda, Zhihui Zhu, and Qing Qu. 2022. Principled and efficient transfer learning of deep models via neural collapse. arXiv preprint arXiv:2212.12206 (2022).
Li et al. (2023) Zexi Li, Xinyi Shang, Rui He, Tao Lin, and Chao Wu. 2023. No Fear of Classifier Biases: Neural Collapse Inspired Federated Learning with Synthetic and Fixed Classifier. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 5319–5329.
Liang et al. (2020) Paul Pu Liang, Terrance Liu, Liu Ziyin, Nicholas B Allen, Randy P Auerbach, David Brent, Ruslan Salakhutdinov, and Louis-Philippe Morency. 2020. Think locally, act globally: Federated learning with local and global representations. arXiv preprint arXiv:2001.01523 (2020).
Lin et al. (2017) Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2117–2125.
Long et al. (2023) Yunfei Long, Zhe Xue, Lingyang Chu, Tianlong Zhang, Junjiang Wu, Yu Zang, and Junping Du. 2023. Fedcd: A classifier debiased federated learning framework for non-iid data. In Proceedings of the 31st ACM International Conference on Multimedia. 8994–9002.
Luo et al. (2021) Mi Luo, Fei Chen, Dapeng Hu, Yifan Zhang, Jian Liang, and Jiashi Feng. 2021. No fear of heterogeneity: Classifier calibration for federated learning with non-iid data. Advances in Neural Information Processing Systems 34 (2021), 5972–5984.
Masarczyk et al. (2024) Wojciech Masarczyk, Mateusz Ostaszewski, Ehsan Imani, Razvan Pascanu, Piotr Miłoś, and Tomasz Trzcinski. 2024. The tunnel effect: Building data representations in deep neural networks. Advances in Neural Information Processing Systems 36 (2024).
McMahan et al. (2017) Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. 2017. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics. PMLR, 1273–1282.
Mou et al. (2021) Quanzheng Mou, Longsheng Wei, Conghao Wang, Dapeng Luo, Songze He, Jing Zhang, Huimin Xu, Chen Luo, and Changxin Gao. 2021. Unsupervised domain-adaptive scene-specific pedestrian detection for static video surveillance. Pattern Recognition 118 (2021), 108038.
Oh et al. (2022) Jaehoon Oh, SangMook Kim, and Se-Young Yun. 2022. FedBABU: Toward Enhanced Representation for Federated Image Classification. In International Conference on Learning Representations.
Olah et al. (2017) Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. 2017. Feature visualization. Distill 2, 11 (2017), e7.
Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019).
Peng et al. (2019) Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. 2019. Moment matching for multi-source domain adaptation. In Proceedings of the IEEE/CVF international conference on computer vision. 1406–1415.
Qi et al. (2023) Zhuang Qi, Lei Meng, Zitan Chen, Han Hu, Hui Lin, and Xiangxu Meng. 2023. Cross-Silo Prototypical Calibration for Federated Learning with Non-IID Data. In Proceedings of the 31st ACM International Conference on Multimedia. 3099–3107.
Recanatesi et al. (2019) Stefano Recanatesi, Matthew Farrell, Madhu Advani, Timothy Moore, Guillaume Lajoie, and Eric Shea-Brown. 2019. Dimensionality compression and expansion in deep neural networks. arXiv preprint arXiv:1906.00443 (2019).
Sariyildiz et al. (2023) Mert Bulent Sariyildiz, Yannis Kalantidis, Karteek Alahari, and Diane Larlus. 2023. No Reason for No Supervision: Improved Generalization in Supervised Models. In ICLR 2023-International Conference on Learning Representations. 1–26.
Sattler et al. (2020) Felix Sattler, Klaus-Robert Müller, and Wojciech Samek. 2020. Clustered federated learning: Model-agnostic distributed multitask optimization under privacy constraints. IEEE transactions on neural networks and learning systems 32, 8 (2020), 3710–3722.
Shen et al. (2022) Yiqing Shen, Yuyin Zhou, and Lequan Yu. 2022. Cd2-pfed: Cyclic distillation-guided channel decoupling for model personalization in federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10041–10050.
Shwartz-Ziv and Tishby (2017) Ravid Shwartz-Ziv and Naftali Tishby. 2017. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810 (2017).
Sun et al. (2021) Benyuan Sun, Hongxing Huo, Yi Yang, and Bo Bai. 2021. Partialfed: Cross-domain personalized federated learning via partial initialization. Advances in Neural Information Processing Systems 34 (2021), 23309–23320.
Tan et al. (2022b) Alysa Ziying Tan, Han Yu, Lizhen Cui, and Qiang Yang. 2022b. Towards personalized federated learning. IEEE Transactions on Neural Networks and Learning Systems (2022).
Tan et al. (2022a) Yue Tan, Guodong Long, Lu Liu, Tianyi Zhou, Qinghua Lu, Jing Jiang, and Chengqi Zhang. 2022a. Fedproto: Federated prototype learning across heterogeneous clients. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 8432–8440.
Van der Maaten and Hinton (2008) Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).
Venkateswara et al. (2017) Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. 2017. Deep hashing network for unsupervised domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5018–5027.
Wang et al. (2023) Peng Wang, Xiao Li, Can Yaras, Zhihui Zhu, Laura Balzano, Wei Hu, and Qing Qu. 2023. Understanding Deep Representation Learning via Layerwise Feature Compression and Discrimination. arXiv preprint arXiv:2311.02960 (2023).
Wang et al. (2022) Yizhou Wang, Shixiang Tang, Feng Zhu, Lei Bai, Rui Zhao, Donglian Qi, and Wanli Ouyang. 2022. Revisiting the transferability of supervised pretraining: an mlp perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9183–9193.
Wu et al. (2024a) Xinghao Wu, Xuefeng Liu, Jianwei Niu, Haolin Wang, Shaojie Tang, Guogang Zhu, and Hao Su. 2024a. Decoupling General and Personalized Knowledge in Federated Learning via Additive and Low-Rank Decomposition. arXiv preprint arXiv:2406.19931 (2024).
Wu et al. (2023) Xinghao Wu, Xuefeng Liu, Jianwei Niu, Guogang Zhu, and Shaojie Tang. 2023. Bold but cautious: Unlocking the potential of personalized federated learning through cautiously aggressive collaboration. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 19375–19384.
Wu et al. (2024b) Xinghao Wu, Xuefeng Liu, Jianwei Niu, Guogang Zhu, Shaojie Tang, Xiaotian Li, and Jiannong Cao. 2024b. The Diversity Bonus: Learning from Dissimilar Distributed Clients in Personalized Federated Learning. arXiv:2407.15464 [cs.LG] https://arxiv.org/abs/2407.15464
Xue et al. (2024) Yihao Xue, Eric Gan, Jiayi Ni, Siddharth Joshi, and Baharan Mirzasoleiman. 2024. Investigating the Benefits of Projection Head for Representation Learning. arXiv preprint arXiv:2403.11391 (2024).
Yang et al. (2023) Fu-En Yang, Chien-Yi Wang, and Yu-Chiang Frank Wang. 2023. Efficient model personalization in federated learning via client-specific prompt generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 19159–19168.
Yosinski et al. (2014) Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks? Advances in neural information processing systems 27 (2014).
Zbontar et al. (2021) Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. 2021. Barlow twins: Self-supervised learning via redundancy reduction. In International conference on machine learning. PMLR, 12310–12320.
Zeiler and Fergus (2014) Matthew D Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13. Springer, 818–833.
Zhang et al. (2023b) Jianqing Zhang, Yang Hua, Hao Wang, Tao Song, Zhengui Xue, Ruhui Ma, Jian Cao, and Haibing Guan. 2023b. Gpfl: Simultaneously learning global and personalized feature information for personalized federated learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5041–5051.
Zhang et al. (2023a) Jianqing Zhang, Yang Hua, Hao Wang, Tao Song, Zhengui Xue, Ruhui Ma, and Haibing Guan. 2023a. Fedcp: Separating feature information for personalized federated learning via conditional policy. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3249–3261.
Zhao et al. (2018) Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. 2018. Federated learning with non-iid data. arXiv preprint arXiv:1806.00582 (2018).
Zheng et al. (2022) Kaiyu Zheng, Xuefeng Liu, Guogang Zhu, Xinghao Wu, and Jianwei Niu. 2022. Channelfed: Enabling personalized federated learning via localized channel attention. In GLOBECOM 2022-2022 IEEE Global Communications Conference. IEEE, 2987–2992.
Zhou et al. (2023) Tailin Zhou, Jun Zhang, and Danny HK Tsang. 2023. FedFA: Federated Learning with Feature Anchors to Align Features and Classifiers for Heterogeneous Data. IEEE Transactions on Mobile Computing (2023).
Zhu et al. (2022) Guogang Zhu, Xuefeng Liu, Shaojie Tang, and Jianwei Niu. 2022. Aligning before aggregating: Enabling cross-domain federated learning via consistent feature extraction. In 2022 IEEE 42nd International Conference on Distributed Computing Systems (ICDCS). IEEE, 809–819.
Zhu et al. (2023a) Guogang Zhu, Xuefeng Liu, Shaojie Tang, and Jianwei Niu. 2023a. Aligning before Aggregating: Enabling Communication Efficient Cross-Domain Federated Learning via Consistent Feature Extraction. IEEE Transactions on Mobile Computing (2023).
Zhu et al. (2023b) Guogang Zhu, Xuefeng Liu, Shaojie Tang, Jianwei Niu, Xinghao Wu, and Jiaxing Shen. 2023b. Take Your Pick: Enabling Effective Personalized Federated Learning within Low-dimensional Feature Space. arXiv preprint arXiv:2307.13995 (2023).
Zhu et al. (2021) Hangyu Zhu, Jinjin Xu, Shiqing Liu, and Yaochu Jin. 2021. Federated learning on non-IID data: A survey. Neurocomputing 465 (2021), 371–390.

\includepdfmerge

supplementary.pdf, 1-5