FedIN: Federated Intermediate Layers Learning for Model Heterogeneity

Yun-Hin Chan, Zhihan Jiang, Jing Deng and Edith C. H. Ngai Yun-Hin Chan, Zhihan Jiang, Jing Deng and Edith C. H. Ngai (corresponding author) are with the Department of Electrical and Electronic Engineering, The University of Hong Kong, Pokfulam, Hong Kong (e-mail: [email protected]; [email protected]; [email protected]; [email protected]).

Abstract

Federated learning (FL) facilitates edge devices to cooperatively train a global shared model while maintaining the training data locally and privately. However, a common assumption in FL requires the participating edge devices to have similar computation resources and train on an identical global model architecture. In this study, we propose an FL method called Federated Intermediate Layers Learning (FedIN), supporting heterogeneous models without relying on any public dataset. Instead, FedIN leverages the inherent knowledge embedded in client model features to facilitate knowledge exchange. The training models in FedIN are partitioned into three distinct components: an extractor, intermediate layers, and a classifier. We capture client features by extracting the outputs of the extractor and the inputs of the classifier. To harness the knowledge from client features, we propose IN training for aligning the intermediate layers based on features obtained from other clients. IN training only needs minimal memory and communication overhead by utilizing a single batch of client features. Additionally, we formulate and address a convex optimization problem to mitigate the challenge of gradient divergence caused by conflicts between IN training and local training. The experiment results demonstrate the superior performance of FedIN in heterogeneous model environments compared to state-of-the-art algorithms. Furthermore, our ablation study demonstrates the effectiveness of IN training and the proposed solution for alleviating gradient divergence.

Index Terms:

Federated learning, heterogeneous models, convex optimization.

1 Introduction

The massive increase in the usage of Internet-of-Things (IoT) devices induces enormous amounts of data from users [1]. Managing these IoT big data efficiently without invading user privacy becomes a significant concern. Federated Learning (FL) [2] is proposed as a distributed machine learning paradigm that enables collaborative training on the IoT data while keeping the user data locally. FedAvg [2] is the first training paradigm introducing the details of FL. All the clients transmit weights from their local models to the server after a few local training epochs. The server averages these weights to update the global model and sends this model back to clients.

Although FL has been employed successfully in many applications, such as recognizing human activities [3] and learning sentiment [4], many practical challenges of FL still remain to be solved [5]. One of the most crucial and practical challenges is system heterogeneity, which is usually described as different available resources of the client devices in the FL training process [6]. Many existing FL schemes (e.g., FedAvg) assume that the client devices with distinct resources possess the same architecture as the global shared model for global aggregation. Nevertheless, some clients with lower computation resources may be unable to complete their local training in time, dragging the training speed of the entire communication round. The clients hindering the training process are called stragglers. Some research has proposed asynchronous FL [7, 8, 9], maintaining the adaptive local training epochs for clients and clustering clients according to their available resources in order to mitigate the problem of stragglers. Nonetheless, given that all clients keep the same model architecture, it is still possible that less capable clients do not have sufficient memory to deploy the shared global model. In this case, the global model must be adjusted to a smaller size, leading to the resource waste of more capable clients and reducing the performance of FL training. In fact, it is impractical to guarantee that all the clients have the same amount of resources, particularly in IoT systems. It is common to have heterogeneous devices with different capabilities working together. Thus, supporting heterogeneous models could fully utilize the resources of heterogeneous devices and better address the system heterogeneity problem.

Refer to caption — Figure 1: An illustration for model heterogeneity. The clients participate in the federated learning with different available resources, inducing different model architectures.

A straightforward way to facilitate system heterogeneity is to deploy different model architectures based on the available resources of the clients, as shown in Figure 1. However, the server can not aggregate the weights directly like FedAvg under the heterogeneous model architectures. It is essential to investigate alternative ways to incorporate weights and knowledge among the clients. Some recent works propose to solve this problem by performing knowledge distillation [10] on a public dataset, such as RHFL [11] and FedMD [12]. Although they can set up various model architectures in the clients, it is challenging to collect a valid public dataset with a similar distribution to the users’ local datasets in practice. To address this issue, inspired by model parallelism-based split learning [13, 14], FedGKT [15] divides the model architectures into two parts, collecting features from the outputs of the former part of models in the clients and training the remaining model in the server based on the collected features. HeteroFL [16] is another line of work to support heterogeneous models without relying on a public dataset. It derives local models with different sizes from the largest model (considered to be the server model). However, this approach still requires all client models to share the same architecture design just with different sizes.

To support system heterogeneity, we propose a method called Federated Intermediate Layers Learning (FedIN), training the intermediate layers according to one batch of the collected features from other clients. In FedIN, a local model architecture consists of three components: an extractor, intermediate layers, and a classifier, as depicted in Figure 2. Client features are derived from the outputs of the extractor and the inputs to the classifier. Notably, clients only need to transmit one batch of features to the server, in addition to weight updates. The intermediate layers are updated through a combination of local training and IN training process, where IN training leverages a single batch of features to extract knowledge from other clients. However, directly deploying these two training processes can induce a critical problem called gradient divergence [17, 18] because the latent information from the local dataset and the features collected from other clients varies. To alleviate the effect of this problem, we formulate and solve a convex optimization problem to obtain the optimal updated gradients. The experiment results demonstrate that FedIN outperforms the baselines in terms of both accuracy and overheads.

Our contributions are summarized as follows.

•

We proposed a novel FL method called FedIN, utilizing local training and IN training for intermediate layers, which is a flexible and reliable FL method addressing the system heterogeneity problem.
•

To alleviate the effects of the gradient divergence, we formulate a convex optimization problem to derive the optimal updated gradient. The ablation study shows its effectiveness in handling the gradient divergence problem.
•

Our experiments reveal that FedIN achieves the best performances in the IID and non-IID data compared with the state-of-the-art algorithms. Moreover, we conduct a thorough analysis to investigate the factors contributing to the improvements attained by FedIN.

2 Related work

In this section, we first introduce federated learning and then review and classify the works for model heterogeneity into three categories.

2.1 Federated Learning

Federated Learning (FL) was proposed by Google in 2017 to organize cooperative model training among edge devices and servers [2]. In FL, numerous clients train models jointly while retaining training data locally to maintain privacy protection. Various methods have been proposed and achieved good performance in different scenarios. In [7], FedAsyn utilizes coordinators and schedulers to create an asynchronous training process, handling the stragglers in the FL training process. FedProx [19] regularizes and re-parametrizes FedAvg, guaranteeing convergence when learning over non-IID data. To share local knowledge among clients with different model architectures, FCCL [20] generates a cross-correlation matrix based on the unlabeled public dataset. In this work, we propose a novel FL framework called FedIN, which supports the training of heterogeneous models at edge devices with heterogeneous resources.

2.2 Heterogeneous Models

Our work focuses on supporting heterogeneous models in FL. This subsection classifies recent research contributing to model heterogeneity into three categories.

2.2.1 Public and auxiliary data

If a server has a public dataset, clients can exploit the general knowledge from this dataset, constructing a simple and efficient bridge to exchange knowledge among clients. FedAUX [21] utilizes unsupervised pre-training and unlabeled auxiliary data to initialize heterogeneous models. FedGen [22] simulates the prior knowledge from all the clients according to a generator. To dig out the latent knowledge from the public dataset, several studies [12, 23, 15] are proposed to address the system heterogeneity problem, inspired by the knowledge distillation. Specifically, knowledge distillation (KD) [10] was proposed by Hinton et al., training a student model with the knowledge distilled from a teacher model. In FedMD [12], a large public dataset is deployed in a server, while the clients distill and transmit logits from this dataset to learn the knowledge from both logits and local private datasets. In FedH2L [23], clients extract the logits from a public dataset consisting of small portions of local datasets from other clients. In RHFL [11], a server calculates the weights of clients by the symmetric cross-entropy loss function, and clients distilled knowledge from the unlabeled dataset. FCCL [20] computed a cross-correlation matrix also based on the unlabeled public dataset.

2.2.2 Data-free knowledge distillation

However, the former methods using KD in FL acquire a public dataset. The server may not collect sufficient data due to data availability and privacy concerns. In contrast to the aforementioned methods, data-free KD is a novel approach to complete the knowledge distillation process without the training data. The basic ideas of data-free KD are to optimize noise inputs to minimize the distance to prior knowledge [24], and Chen et al. [25] train Generative Adversarial Networks (GANs) [26] to generate training data for the entire KD process, utilizing the knowledge distilled from the teacher model. To free the limitation from a public dataset, a few research works consider data-free KD in FL. In FedML [27], latent knowledge from homogeneous models is applied to train heterogeneous models. In FedHe [28], logits belonging to the same class are directly averaged in a server. In FedGKT [15], a neural network is split into a client and a server, while the server completes the entire training process based on the features and logits collected from all clients. Most of the existing data-free approaches are based on logits. While FedIN is also a data-free approach, instead of transmitting the logits, the knowledge exchanged in FedIN also includes the client features, which contain more information than simply using the logits.

2.2.3 Sub-models

To adapt to the available resources of different clients, several studies split the large models into small sub-models. HeteroFL [16] divides a large model into local models with different sizes. However, the architectures of local and global models are still restricted by the same model architecture. SlimFL [29] integrates slimmable neural network (SNN) architectures [30] into FL, adapting the widths of local neural networks based on resource limitations. In [31], FjORD leverages Ordered Dropout and a self-distillation method to determine the model widths. However, similar to HeteroFL, they only vary the number of parameters for each layer. In this paper, we propose a more heterogeneous and flexible FL framework supporting various edge devices.

3 Problem formulation

We introduce the general federated learning problem for heterogeneous models in this section.

3.1 Heterogeneous Federated Learning

The goal of FL is to collaborate with the clients to train a shared global model while keeping their local data private. We briefly summarize the optimization problem below. We assume that $K$ clients participate in FL. Each client has a private dataset $D_{k}=\{(x_{i,k},y_{i,k})|i=1,2,...,|D_{k}|\}$ , where $k\in\{1,...,K\}$ is the index of a client, and $|D_{k}|$ denotes the size of a dataset $D_{k}$ . Private dataset $D_{k}$ is only accessible to client $k$ , guaranteeing data privacy. In traditional FL, the clients share identical model architecture. We denote a training model by $f(x;w)$ , where $w$ are the training weights and $x$ are the inputs. The loss function $l_{k}$ of client $k$ is shown as follows,

\displaystyle\min_{w}

\displaystyle l_{k}(w)=\frac{1}{|D_{k}|}\sum_{i=1}^{|D_{k}|}{l(f(x_{i};w),y_{i})},

(1)

where $l(\cdot,\cdot)$ is a loss function for each data sample $(x_{i,k},y_{i,k})$ . If the total size of all datasets is $N=\sum_{i=1}^{K}|D_{k}|$ , the global optimization target is,

\displaystyle\min_{w}

\displaystyle L(w)=\sum_{k=1}^{K}\frac{|D_{k}|}{N}{l_{k}(w)},

(2)

where $w$ are the parameters from the global shared model, The size of a private dataset affects the weight of its loss in the global loss function. If a client has a larger dataset, its loss deserves more attention. A common method for updating $w$ in original federated learning is to aggregate the updated weights and gradients from different clients [2].

Nevertheless, it may not be possible to deploy an identical model architecture for all the clients due to system heterogeneity. One potential solution is to allow clients to select different model architectures according to their capabilities in heterogeneous FL. The problem of heterogeneous FL is described as follows. We denote $w_{k}$ as the model weights of client $k$ . Therefore, the global loss function is described as follows,

\displaystyle\min_{w_{1},w_{2},...,w_{K}}\ L(w_{1},...,w_{K})=\sum_{k=1}^{K}\frac{|D_{k}|}{N}{l_{k}(w_{k})},

(3)

where the optimized model weights $\{w_{1},w_{2},...,w_{K}\}$ have different size. Thus, the direct aggregation of entire model weights becomes unfeasible when dealing with heterogeneity among models. Therefore, we adopt layer-wise heterogeneous aggregation [32] as an alternative approach to aggregate the layer weights of heterogeneous models instead of the entire model weights in our experiments.

4 FedIN: Federated Intermediate Layers Learning

In this section, we describe the details of FedIN, focusing on addressing system heterogeneity by deploying clients with diverse model architectures that align with their available resources. Figure 2 illustrates the workflow of FedIN. The client model consists of three key components: an extractor, intermediate layers, and a classifier. The outputs of the extractor, referred to as feature inputs ( $s_{in}$ ), serve as inputs to the intermediate layers. Similarly, the outputs of the intermediate layers, referred to as feature outputs ( $s_{out}$ ), act as inputs to the classifier. The client features are the pair of feature inputs and outputs, denoted as $(s_{in},s_{out})$ . To be specific, FedIN encompasses two training processes: local training, which leverages the private dataset, and IN training, which relies on the feature inputs and outputs $(s_{in},s_{out})$ . Moreover, to address the challenge of gradient divergence arising from conflicts between local training and IN training, we propose a convex optimization problem formulation to obtain the optimal updated gradients.

4.1 Local Training and IN Training

The clients receive a single batch of feature inputs and feature outputs, denoted as $S=\{(s_{i,in}^{c},s_{i,out}^{c})\}_{i=1}^{|S|}$ , from the server. These samples are utilized for training the intermediate layers during the IN training process. The superscript $c$ means that these feature inputs and outputs are from the central server. The clients begin their local training after receiving a batch of client features from the server. For an instance $(x_{i,k},y_{i,k})\in D_{k}$ , client $k$ conducts local training on its private dataset. The loss function of the local training is shown as follows,

\displaystyle l_{local,k}

\displaystyle=

\displaystyle l_{CE}(f(x_{i,k};w_{k}^{t}),y_{i,k})+\frac{\mu}{2}||w_{k}^{t}-w_{k}^{t-1}||^{2},

(4)

where $w_{k}^{t}$ are the weights of client $k$ at time $t$ , and $l_{CE}$ is the cross-entropy loss function for the local training. To ensure client consistency and prevent overfitting, we add a proximal regularization term [19] in Eq. 4.

The second training process is IN training, which is training the intermediate layers from the features dataset $S$ . It is worth mentioning that the sample number of $S$ is one batch size. We denote the weights of the extractor and the classifier by $w_{e,k}$ and $w_{c,k}$ for client $k\in\{1,...,K\}$ . Moreover, the weights of the intermediate layers are denoted by $w_{in,k}$ . The relations among the data sample $(x_{i,k},y_{i,k})\in D_{k}$ , client weights, and $(s_{i,in}^{k},s_{i,out}^{k})$ are shown as follows,

\displaystyle s_{i,in}^{k}=f(x_{i,k};w_{e,k}),

(5)

\displaystyle s_{i,out}^{k}=f(s_{i,in}^{k};w_{in,k}),

(6)

\displaystyle f(x_{i,k};w_{k})=f(s_{i,out}^{k};w_{c,k}).

(7)

Eq. 5 shows that the feature input $s_{i,in}^{k}$ is the output of the extractor $w_{e,k}$ of an instance $(x_{i,k},y_{i,k})$ from client $k$ . Eq. 6 describes that the feature output $s_{i,out}^{k}$ is the output of the intermediate layers $w_{in,k}$ with the feature input $s_{i,in}^{k}$ . Eq. 7 proves the equivalence between the output of the classifier $w_{c,k}$ and the output of the whole client model $w_{k}$ . In the local training process, the feature inputs and outputs $(s_{i,in}^{k},s_{i,out}^{k})$ are collected by client $k$ . After completing the training process, client $k$ transmits one batch of the collected feature inputs and outputs to the server. This process is indicated by the blue arrows in Figure 2.

Eq. 6 shows the main function of the IN training, as shown in Figure 2. After the client receives the feature dataset $S=\{(s_{i,in}^{c},s_{i,out}^{c})\}_{i=1}^{|S|}$ , it begins the IN training for the intermediate layers. The feature inputs $s_{i,in}^{c}$ from the server are the inputs of the intermediate layers, while the $s_{i,out}^{c}$ are the targets of the IN training. The loss function of IN training is defined as follows,

\displaystyle l_{IN,k}=l_{MSE}(f(s_{i,in}^{c};w_{in,k}),s_{i,out}^{c}),

(8)

where $l_{MSE}$ is a mean-square error loss function. The weights $w_{in,k}$ are updated by the loss functions of the local training $l_{local,k}$ and the IN training $l_{IN,k}$ .

4.2 Gradient Alleviation

However, it is possible that the gradients from the local training and the IN training are divergent, which would drag the convergent speed and disturb the model to achieve the optimum point [17, 18]. It is crucial to alleviate the gradients divergence problem in our method. Therefore, how to mitigate this problem is critical in FedIN. To address this problem, we formulate a convex optimization problem as follows.

We define the gradients from the local training as a matrix $G_{local}$ and the gradients from the IN training as a matrix $G_{IN}$ . To guarantee the optimized direction of the models, we design a constraint for the gradient as follows,

\displaystyle\langle G_{IN},G_{local}\rangle\geq 0,

(9)

where $\langle\cdot,\cdot\rangle$ is the dot product, which ensures the optimized direction for $G_{local}$ and $G_{IN}$ to be the same. In the optimization problem, we denote the new optimized gradients by a matrix $Z$ and model the following convex optimization primal problem,

	$\displaystyle\min_{Z}$		$\displaystyle\|\|G_{IN}-Z\|\|_{F}^{2},$		(10)
	$\displaystyle s.t.$		$\displaystyle\langle Z,G_{local}\rangle\geq 0,$		(10)

where we maintain the optimized direction between $Z$ and $G_{local}$ to be the same and minimize the distance between $Z$ and $G_{in}$ . We consider that the information from the feature inputs and outputs is more fruitful than the local private dataset which is easier to have over-fitting in the training process. We solve this convex optimization problem by the Lagrange dual problem [33]. The Lagrangian is shown as,

	$\displaystyle L(Z,\lambda)=tr(G_{IN}^{T}G_{IN})-tr(Z^{T}G_{IN})$		(11)
	$\displaystyle-tr(G_{IN}^{T}Z)+tr(Z^{T}Z)-\lambda tr(G_{local}^{T}Z),$		(11)

where $tr(A)$ means the trace of the matrix $A$ , and the $\lambda$ is a Lagrange multiplier associated with $\langle Z,G_{local}\rangle\geq 0$ . To derive the dual problem, we first get the optimum of $Z$ for the Lagrangian Eq. 11, and then obtain the Lagrange dual function $g(\lambda)=\inf_{Z}L(Z,\lambda)$ . Thus, the Lagrange dual problem is described as follows,

$\displaystyle\max_{\lambda}$	$\displaystyle g(\lambda)=-\frac{\lambda^{2}}{4}tr(G_{local}^{T}G_{local})$	(12)
	$\displaystyle-\lambda tr(G_{local}^{T}G_{IN}),$
$\displaystyle s.t.$	$\displaystyle\lambda\geq 0,$

where the optimum of the Lagrangian Eq. 11 is $Z=G_{IN}+\frac{\lambda}{2}G_{local}$ . If the $\lambda$ is large enough, it is obvious that $\langle Z,G_{local}\rangle>0$ , which means this convex optimization problem holds strong duality because it satisfies the Slater’s constraint qualification[34], i.e., the optimum of the primal problem Eq. 10 is also $Z=G_{IN}+\frac{\lambda}{2}G_{local}$ . Furthermore, the dual problem Eq. 12 can be solved to obtain the analytic solution for $\lambda$ and $Z$ , which is shown as follows,

Z=\begin{cases}G_{IN},&\text{if }b\geq 0\\ G_{IN}-\frac{b}{a}G_{local},&\text{if }b<0\end{cases}

(13)

where $a=tr(G_{local}^{T}G_{local})$ and $b=tr(G_{local}^{T}G_{IN})$ . However, one crucial point is that the clients will handle this optimization process. If we calculate each gradient matrix following Eq. 13, this process would occupy lots of computing resources because of the matrix multiplication. Therefore, to mitigate the computational pressure on the clients, we simplified the updated gradient matrix as,

\displaystyle Z=G_{IN}+\frac{\lambda}{2}G_{local},

(14)

where $\lambda=1$ is set for the optimum point of the primal problem in our experiment settings. Since $G_{IN}$ is only associated with the weights $w_{IN,k}$ and not related to $w_{e,k}$ and $w_{c,k}$ , the client models are optimized by Eq. 14 in FedIN directly.

4.3 Weight Aggregation

FedIN does not impose any limitations on the model architecture, indicating that diverse model architectures can be deployed in FedIN by leveraging a single batch of client features as the communication knowledge. If client models have different numbers of layers, FedIN adopts layer-wise heterogeneous aggregation [32], enabling the server to aggregate weights from the same layer rather than the same model. Similarly, when client models have different architectures, FedIN aggregates model weights only from models with identical architectures, the same as the aggregation method used in FedAvg [2] and FedDF [35]. For example, the weights of CNNs cannot be aggregated with the weights of Transformers [36]. However, the weights of CNNs can be aggregated with other CNNs possessing the same depths and widths. The effectiveness of FedIN with these two distinctive aggregation methods is further demonstrated in our experimental section.

4.4 Detailed Algorithm of FedIN

The algorithm process of FedIN is presented in Algorithm 1. On the server side, the server receives the model weights, feature inputs and outputs $(w_{k},s_{i,in}^{k},s_{i,out}^{k})$ . The new feature inputs and outputs, $(s_{i,in}^{k},s_{i,out}^{k})$ , are stored in the server dataset. The server samples one batch of feature inputs and outputs, denoted as $S=\{(s_{i,in}^{c},s_{i,out}^{c})\}_{i=1}^{|S|}$ , as indicated in line 5 of Algorithm 1. Finally, the server transmits averaged weights $\bar{w}$ and the batch of feature inputs and outputs $S$ to all the clients. On the client side, no data is received if it is in the initial training process. Each client $k$ performs local training using Eq. 4. During the initial training, client $k$ also collects the feature inputs and outputs, $(s_{i,in}^{k},s_{i,out}^{k})$ . If client $k$ is not in the initial training, it receives $(\bar{w},S)$ from the server. Client $k$ computes $l_{local,k}$ from the local private dataset and $l_{IN,k}$ from the feature dataset $S$ , and updates its model by Eq. 14. At last, client $k$ transmits $(w^{k},s_{i,in}^{k},s_{i,out}^{k})$ to the server.

Input: Local dataset

D_{k},k\in\{1,...,K\}

K

clients and their weights

w_{1},...,w_{K}

Output: Optimal weights for all the clients

w_{1},...,w_{K}

1 Server process:

2 while not converge do

3 Receives

(w_{k},s_{i,in}^{k},s_{i,out}^{k})

from the client

k

4 Saves feature inputs and outputs

(s_{i,in}^{k},s_{i,out}^{k})

in the server dataset.

5 Randomly samples the one batch of feature inputs and outputs

S=\{(s_{i,in}^{c},s_{i,out}^{c})\}_{i=1}^{|S|}

from the server dataset.

6 Transmits averaged weights and a batch of feature inputs and outputs

(\bar{w},S)

to the sampled clients.

7Client processes:

8 while random sampled clients $k,k\in{1,...,K}$ do

9 if initial training then

10 Updates the client

k

model according to the local training Eq. 4.

11 Collects

(s_{i,in}^{k},s_{i,out}^{k})

from data samples

(x_{i,k},y_{i,k})\in D_{k}

from the local training.

13 else

14 Receives

(\bar{w},S)

from the server.

w_{k}=\bar{w}

16 Computes

l_{local,k}

based on the private dataset by Eq. 4.

17 Collects

(s_{i,in}^{k},s_{i,out}^{k})

from data samples

(x_{i,k},y_{i,k})\in D_{k}

from the local training.

18 Computes

l_{IN,k}

according to the

S

by Eq. 8.

19 Updates the weights by Eq. 14.

20 Transmit the weights and collected information

(w_{k},s_{i,in}^{k},s_{i,out}^{k})

to the server.

Algorithm 1 FedIN

5 Experiments

In this section, we conduct experiments to evaluate the performances of FedIN on the CIFAR-10 [37], Fashion-MNIST [38] and SVHN [39] datasets, and also manage the ablation studies of FedIN to assess the effectiveness of individual components within FedIN. Our codes will be released on Github.

5.1 Experiment Settings

5.1.1 Federated settings

CIFAR-10 contains 50,000 $32\times 32$ color images in the training dataset and 10,000 images in the testing dataset with ten different classes, such as car, dog, and cat. Fashion-MNIST has 70,000 $28\times 28$ gray-scale images with ten classes, such as T-shirts, trousers, and dresses, including 60,000 samples for training and 10,000 samples for validation. SVHN is a real-world image dataset, obtained from house numbers in Google Street View images. It includes 73,257 $32\times 32$ color images for training and 26,032 images for testing. We establish two distributions for these datasets, independent and identically distribution (IID), and non-IID, as demonstrated in Figure 3. The non-IID data is generated using a Dirichlet distribution with a parameter $\alpha=0.5$ , as elaborated in Figure 3b. We have 100 clients in the FL training process. The model architectures are ResNet10, ResNet14, ResNet18, ResNet22 and ResNet26 from PyTorch source codes, and evenly distributed among 100 clients. The number of communication rounds is set to 500. The batch size is 16 during the training process. For all datasets, the clients complete 5 epochs of inner training during each communication round.

5.1.2 Baselines

We have two classic algorithms, FedAvg and FedProx, and five state-of-the-art methods, Scaffold, FedNova, MOON, HeteroFL, and InclusiveFL, as our baselines.

•

FedAvg [2]: Clients transmit their updated gradients to the server, and the server performs gradient averaging to update the global model.
•

FedProx [19]: Based on FedAvg, clients incorporate a regularization term to minimize the disparities between local models and a global model.
•

Scaffold [40]: Clients retain the local control variate, while the server maintains server control variate, effectively mitigating the impact of client-drift.
•

FedNova [17]: The server uses a normalized averaging method that mitigates objective inconsistency while maintaining fast error convergence.
•

MOON [41]: Clients conduct the contrastive learning between the current local model, global model, and the local model from the previous time stamp.
•

HeteroFL [16]: The smaller models are derived from the largest model sharing the same architecture. These small models are deployed on the clients. The corresponding parameters from all models are aggregated to update the largest model.
•

InclusiveFL [32]: This method proposes a momentum knowledge distillation method to enhance the transfer of knowledge from large models to smaller models.

5.1.3 Details of baselines

FedIN and the baselines, with the exception of HeteroFL, utilize the layer-wise aggregation technique proposed in [32] under our heterogeneous model environment. Since HeteroFL requires model splitting based on its own methodology, it cannot utilize this aggregation technique. Therefore, in order to maintain a similar number of parameters as the other baselines, we deploy ResNet152 in HeteroFL instead of using the largest model, ResNet26, as in other methods. The model split mode in HeteroFL is ”dynamic_a1-b1-c1-d1-e1” from the source code because of five heterogeneous models in all other methods. The hyper-parameter $\frac{\mu}{2}$ for FedProx and FedIN is 0.05. We use Adam optimizer with a learning rate of 0.001, $\beta_{1}=0.9$ and $\beta_{2}=0.999$ , default parameter settings for all methods. All experiments are conducted in the same environment utilizing four Nvidia RTX3090 GPUs.

TABLE I: Model accuracy for IID and non-IID data of FashionMNIST. Target accuracy is 85.

Methods	FashionMNIST (ACC=85)
Methods	IID $\uparrow$	Non-IID $\uparrow$	Round $\downarrow$	Speedup $\uparrow$
FedAvg[2]	90.3	89.4	47	$\times$ 1.0
FedProx[19]	89.7	87.6	40	$\times$ 1.2
Scaffold[40]	88.3	87.1	25	$\times$ 1.9
FedNova[17]	87.5	87.3	36	$\times$ 1.3
MOON[41]	89.5	89.0	34	$\times$ 1.4
HeteroFL[16]	89.3	89.5	140	$\times$ 0.3
InclusiveFL[32]	88.4	89.1	31	$\times$ 1.5
FedIN (Ours)	91.2	90.3	20	$\times$ 2.4

TABLE II: Model accuracy for IID and non-IID data of SVHN. Target accuracy is 80.

Methods	SVHN (ACC=80)
Methods	IID $\uparrow$	Non-IID $\uparrow$	Round $\downarrow$	Speedup $\uparrow$
FedAvg[2]	89.2	84.5	82	$\times$ 1.0
FedProx[19]	90.6	87.3	45	$\times$ 1.8
Scaffold[40]	91.1	86.0	72	$\times$ 1.1
FedNova[17]	87.3	86.7	106	$\times$ 0.8
MOON[41]	89.5	86.1	55	$\times$ 1.5
HeteroFL[16]	93.8	89.3	107	$\times$ 0.8
InclusiveFL[32]	90.9	88.7	67	$\times$ 1.2
FedIN (Ours)	91.8	89.3	29	$\times$ 2.8

TABLE III: Model accuracy for IID and non-IID data of CIFAR-10. Target accuracy is 60.

Methods	CIFAR-10 (ACC=60)
Methods	IID $\uparrow$	Non-IID $\uparrow$	Round $\downarrow$	Speedup $\uparrow$
FedAvg[2]	76.8	66.2	109	$\times$ 1.0
FedProx[19]	77.6	72.0	72	$\times$ 1.5
Scaffold[40]	79.0	68.1	120	$\times$ 0.9
FedNova[17]	62.9	60.3	229	$\times$ 0.5
MOON[41]	74.1	67.4	129	$\times$ 0.8
HeteroFL[16]	72.1	61.0	273	$\times$ 0.4
InclusiveFL[32]	75.0	66.1	160	$\times$ 0.7
FedIN (Ours)	80.5	75.9	54	$\times$ 2.0

5.2 Accuracy Analyses.

5.2.1 Accuracy of IID and non-IID data.

We conduct experiments on the IID and non-IID data in Fashion-MNIST, SVHN, and CIFAR-10 datasets. The experiment results are shown in TABLE I, TABLE II, and TABLE III. These tables provide details on the communication round (denoted as ”Round”) at which the methods achieve the target accuracy (ACC) under the non-IID setting. The best results in each table are highlighted in bold. The symbols ” $\uparrow$ ” and ” $\downarrow$ ” indicate that a higher or lower value of the respective metric is better, respectively. The target accuracy for non-IID data is specified at the top of each table.

From TABLE I, FedIN achieves the highest accuracy among all methods. It attains an accuracy of 91.2% on IID data and 90.3% on non-IID data. Furthermore, FedIN requires only 20 communication rounds to reach the target accuracy, demonstrating a speedup of 2.4 times compared to the baseline FedAvg.

In TABLE II, FedIN achieves the best results with 91.8% on IID and 89.3% on non-IID data. FedAvg requires 82 communication rounds to achieve the target accuracy of SVHN (80%), while FedIN accomplishes it in just 29 rounds. FedIN exhibits the fastest convergence speed among all state-of-the-art baselines, with a speedup of 2.8 times compared to FedAvg.

In TABLE III, FedIN demonstrates substantial improvements compared to most baselines. It achieves an accuracy of 80.5% on IID data and 75.6% on non-IID data, surpassing the second-best baseline, FedProx, which achieves 77.6% and 72.0%, respectively. Notably, two baselines, HeteroFL [16] and InclusiveFL [32], which focus on addressing system heterogeneity, perform worse than FedIN, especially on non-IID data. HeteroFL achieves 72.1% on IID data and 61.0% on non-IID data, while InclusiveFL attains 75% on IID and 66.1% on non-IID data. Moreover, the target accuracy for CIFAR-10 is 60%, and FedAvg requires 109 rounds to achieve it, whereas FedIN only needs 54 rounds. HeteroFL requires 273 rounds and InclusiveFL needs 160 rounds, indicating slower convergence speeds compared to FedAvg. However, FedIN achieves a speedup of 2.0 times compared to FedAvg, demonstrating significantly accelerated convergence.

Additionally, Figure 4 shows the smoothed test accuracy on non-IID data of CIFAR-10. FedIN (red line) achieves the highest accuracy and exhibits the fastest convergence speed throughout the training process. It is the first method to achieve the target accuracy (red dot line). FedProx[19] is the second-best method but still lags behind FedIN. Several methods, including FedAvg[2], Scaffold[40], MOON[41], and InclusiveFL[32], demonstrate similar convergence processes, achieving similar results ranging from 64% to 68%, as indicated in TABLE III. FedNova[17] and HeteroFL[16] have significantly slower convergence speeds compared to FedAvg in Figure 4, achieving only 60.3% and 61.0%, respectively. Moreover, FedIN incurs only a small additional overhead of one batch of feature inputs and outputs compared to FedAvg, as shown in TABLE VI.

TABLE IV: Model accuracy with homogeneous models.

CIFAR-10	Methods
CIFAR-10	FedProx	Scaffold	FedNova	MOON	FedIN
IID	83.5	84.3	82.0	84.2	84.7
Non-IID	77.5	76.8	75.4	78.2	79.2

5.2.2 Accuracy of homogeneous models

While FedIN primarily addresses the system heterogeneity challenge in FL, we also conduct experiments in a homogeneous model environment using CIFAR-10. All client models are ResNet18 in this experiment and the remaining federated settings are the same as the system heterogeneity experiments. As presented in TABLE IV, FedIN still outperforms state-of-the-art baselines, specifically designed to enhance FL performance in homogeneous model environments. Notably, FedIN achieves the highest accuracy, 84.7% on IID data and 79.2% on non-IID data of CIFAR-10, while the second-best result is 84.3% from Scaffold on IID data and 78.2% from MOON on non-IID data.

TABLE V: Model accuracy with heterogeneity models with FedAvg aggregations.

Fashion-MNIST	Methods
Fashion-MNIST	FedAvg	FedProx	Scaffold	FedNova	MOON	InclusiveFL	FedIN
IID	86.1	83.4	87.7	84.2	87.0	88.1	88.9
Non-IID	85.4	82.1	86.3	83.9	86.5	86.4	88.0

5.2.3 Accuracy with FedAvg aggregation

To ensure a fair comparison, both the baselines and FedIN employ layer-wise aggregation. However, it is worth noting that FedIN can be deployed in scenarios with extreme heterogeneity, where layer-wise aggregation is not feasible. In such cases, model weights with the same architectures are the only ones that can be aggregated. To demonstrate the effectiveness of FedIN in such extreme environments, we conducted experiments on the Fashion-MNIST dataset, utilizing FedAvg aggregation. The remaining federated settings are the same in this experiment. As indicated in TABLE V, FedIN still achieves the highest accuracy, 88.9% on IID data and 88.0% on non-IID data. These results further emphasize the effectiveness of FedIN in extreme system heterogeneity environments.

TABLE VI: Training overheads for different methods. ”Params” indicates the communication overheads. ”Memory” refers to the memory occupied by methods in the training process.

Metrics	Methods
Metrics	FedAvg	Scaffold	MOON	HeteroFL	FedIN
Params(M) $\downarrow$	12.28	24.56	12.28	16.29	12.35
Memory(MB) $\downarrow$	235.0	470.0	705.0	445.6	235.3

5.3 The Reason for the Improvements

5.3.1 CKA similarity for different stages

Inspired by [42] and [43], we use CKA similarity [44] to examine the layer similarity among different clients across different methods, in order to shed light on the reasons behind the improvements observed with FedIN. A higher CKA similarity indicates that client models can effectively capture common features within the context of system heterogeneity. In our analysis, stage $i$ indicates the $i_{th}$ block in the ResNet architecture, aligned with the corresponding layer $i$ in the PyTorch ResNet source code. It is worth mentioning that ResNet10, ResNet14, ResNet18, ResNet22, and ResNet 26 consist of four stages. Our focus lies on evaluating the CKA similarity of outputs across these four stages. To simplify the figure annotations, we concentrate on three specific methods: FedAvg as an essential baseline, InclusiveFL as a representative method for system heterogeneity, and our proposed method, FedIN.

Figure 5 illustrates the CKA similarity of different stages under IID and non-IID. Notably, in Figure 5a, FedIN exhibits the highest similarity even in the deepest stage (stage 3), while FedAvg and InclusiveFL struggle to maintain high similarity levels in stage 3, as evidenced by the gray area in the figure. In Figure 5b, FedIN still maintains a higher similarity than FedAvg and InclusiveFL, especially in the deep stage (stage 3).

To gain further insights into the dissimilarities between FedIN and the other methods, we present heatmaps of similarity from stage 2 and stage 3 among clients in Figure 6. Figure 6a, Figure 6c, and Figure 6e demonstrate the heatmaps from stage 2. Similar to the observations in Figure 5b, the average similarity of FedIN marginally surpasses that of FedAvg and InclusiveFL. However, in stage 3, as shown in Figure 6b, Figure 6d, and Figure 6f, the heatmap corresponding to FedIN (Figure 6f) exhibits significantly lighter shades compared to the heatmaps of FedAvg (Figure 6b) and InclusiveFL (Figure 6d). These results and analyses suggest that FedIN ensures consistency among the deep layers of client models, captures valuable shared features, and achieves superior accuracy in the presence of system heterogeneity.

5.3.2 t-SNE visualization

We conduct t-SNE visualizations [45] on features extracted from stage 3 in Figure 7, focusing on data belonging to the same class. The objective is to observe the clustering behavior of these data points. In Figure 7a and Figure 7b, it is evident that the features from client 0 and client 1 and features from client 2 are separated. However, the features from these three clients form a singular cluster in FedIN, as depicted in Figure 7c, validating that the features from data with the same class from different model architectures are consistent in FedIN.

TABLE VII: Model accuracy with ablation studies.

CIFAR-10	Methods
CIFAR-10	FedAvg	w/o IN	w/o Prox	w/o Opt	FedIN
IID	76.8	77.6	78.8	79.4	80.5
Non-IID	66.2	72.0	66.4	74.9	75.9

TABLE VIII: Model accuracy with different client numbers on CIFAR-10.

Methods	IID					Non-IID
Methods	$N_{c}=10$	$N_{c}=20$	$N_{c}=50$	$N_{c}=100$	$N_{c}=200$	$N_{c}=10$	$N_{c}=20$	$N_{c}=50$	$N_{c}=100$	$N_{c}=200$
FedAvg	79.3	79.2	78.7	76.8	74.0	68.3	67.9	66.9	66.2	62.5
InclusiveFL	77.5	76.7	79.1	75.0	73.4	66.8	68.4	67.1	66.1	61.2
FedIN	82.8	83.1	81.0	80.5	74.3	76.7	76.3	74.1	75.9	72.2

5.4 Ablation Study

We conduct an ablation study to evaluate the contributions of the key components in FedIN. Our ablation study includes the following methods: (i) FedAvg, (ii) FedIN w/o IN (FedIN without IN loss), (iii) FedIN w/o Prox (FedIN without Prox regularized term), (iv) FedIN w/o Opt (FedIN without the gradient alleviation (optimization)). TABLE VII and Figure 8 illustrates the results of the ablation studies.

5.4.1 Effects of the gradient alleviation

In section 4.2, we propose a convex optimization formulation to address the gradient divergence. In FedIN, we simultaneously update the intermediate layers by the gradients from the local training and IN training by Eq. 14. This approach, referred to as gradient alleviation, serves as a solution to the gradient divergence problem. In this experiment, we highlight that our solution is advantageous and effective in solving the gradient divergence problem.

Figure 8 illustrates the results of considering the gradient divergence problem and ignoring this problem. When FedIN disregards this problem, it updates the entire model from the local training, and then continues to update the intermediate layers from the IN training, as depicted by the result of FedIN w/o Opt in Figure 8. The accuracy achieved by FedIN surpasses that of FedIN without gradient alleviation (FedIN w/o Opt), and the convergence speed of FedIN is also accelerated, as observed in Figure 8. Furthermore, the application of gradient alleviation leads to a performance improvement of 1% from TABLE VII. These findings validate the effectiveness and efficiency of our proposed solution to the gradient divergence problem. It is noteworthy that gradient alleviation does not impose any additional burdens on either the clients or the server.

5.4.2 Effects of the loss function

In FedIN, the loss function (Eq. 14) incorporates two additional terms, one is IN loss and the other one is Prox regularized term. When considering FedIN w/o Prox, the client models are trained without regularization, and the convergent speed is similar to FedIN w/o IN before 200 rounds as shown in Figure 8. However, after 200 rounds, FedIN w/o Prox becomes unstable and its performance deteriorates during the subsequent training process, suggesting that the client models are overfitting to their local dataset. At last, FedIN w/o Prox only achieves the performance like FedAvg, as shown in TABLE VII, hinting that the function of IN loss is eliminated at the end of the training process. Therefore, the inclusion of a regularized term becomes essential to maintain the effectiveness of IN loss throughout the training process. After adding the Prox regularized term, FedIN w/o Opt achieves better results than FedIN w/o In and FedIN w/o Prox, indicating the efficiency of the combination of IN loss and Prox regularized term.

5.4.3 Effects of client numbers

To investigate the effects of varying client numbers, we conduct experiments on CIFAR-10, as presented in TABLE VIII. $N_{c}$ denotes the number of clients. Notably, FedIN outperforms the other methods across all settings of different client numbers. Furthermore, as the number of clients increases, we observe a decline in accuracy because the amount of local data for each client is also decreased. In the context of a higher number of clients, such as 200 clients, the clients face greater challenges in learning meaningful features due to the limited local data. However, FedIN still achieves 72.2% with 200 clients under non-IID data, surpassing the performance of FedAvg (62.5%) and InclusiveFL (61.2%). These results demonstrate the robustness and effectiveness of the FedIN method in handling the challenges posed by a high number of clients.

5.4.4 Effects of batch sizes and sample numbers

We also conduct analysis on different batch sizes and sample numbers on CIFAR-10 to verify the effects of these hyperparameters. As shown in Figure 9a, batch sizes 16, 32, and 64 are the best selections, but the batch sizes of 8 and 128 still outperform HeteroFL and InclusiveFL. Considering the communication overhead, a batch size of 16 is the optimal choice. From Figure 9b, it is clear that increasing the sample numbers has little impact on accuracy improvement. However, even with a sample number of 1, there is a significant improvement compared to the baselines. It is unnecessary to add excessive overheads to achieve marginal improvement.

6 Conclusions

We propose a novel method, called FedIN, which supports model heterogeneity in FL environment. FedIN conducts local training based on the private dataset and IN training from the client features, requiring only one batch of features. Moreover, we formulate a convex optimization problem to tackle the gradient divergence problem induced by a combination of local training and IN training. We conduct extensive experiments on IID data and non-IID data from three public datasets with seven baselines. The experiment results illustrate that FedIN achieves superior performances in both IID data and non-IID data. Moreover, we conducted an analysis to elucidate the efficiency and effectiveness of FedIN in heterogeneous environments. Furthermore, we investigated the contributions of each component of FedIN in the ablation studies. These studies not only highlight the advantages of our proposed solution for addressing the gradient divergence problem but also emphasize the importance of IN training and the impact of varying batch sizes, sample numbers, and client numbers.

References

[1] X. Song, H. Zhang, R. Akerkar, H. Huang, S. Guo, L. Zhong, Y. Ji, A. L. Opdahl, H. Purohit, A. Skupin, A. Pottathil, and A. Culotta, “Big data and emergency management: Concepts, methodologies, and applications,” IEEE Transactions on Big Data, vol. 8, no. 2, pp. 397–419, 2022.
[2] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Artificial intelligence and statistics. PMLR, 2017, pp. 1273–1282.
[3] Y. Chen, X. Sun, and Y. Jin, “Communication-efficient federated deep learning with layerwise asynchronous model update and temporally weighted aggregation,” IEEE transactions on neural networks and learning systems, vol. 31, no. 10, pp. 4229–4238, 2019.
[4] V. Smith, C.-K. Chiang, M. Sanjabi, and A. Talwalkar, “Federated multi-task learning,” 31st Conference on Neural Information Processing Systems (NeurIPS), 2017.
[5] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings et al., “Advances and open problems in federated learning,” 2021.
[6] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning: Challenges, methods, and future directions,” IEEE Signal Processing Magazine, vol. 37, no. 3, pp. 50–60, 2020.
[7] C. Xie, S. Koyejo, and I. Gupta, “Asynchronous federated optimization,” 12th Annual Workshop on Optimization for Machine Learning, 2020.
[8] Y. Chen, Y. Ning, M. Slawski, and H. Rangwala, “Asynchronous online federated learning for edge devices with non-iid data,” in 2020 IEEE International Conference on Big Data (Big Data). IEEE, 2020, pp. 15–24.
[9] Z. Chai, Y. Chen, A. Anwar, L. Zhao, Y. Cheng, and H. Rangwala, “Fedat: A high-performance and communication-efficient federated learning system with asynchronous tiers,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’21. New York, NY, USA: Association for Computing Machinery, 2021.
[10] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” NIPS Deep Learning and Representation Learning Workshop, 2015.
[11] X. Fang and M. Ye, “Robust federated learning with noisy and heterogeneous clients,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 072–10 081.
[12] D. Li and J. Wang, “FedMD: Heterogenous federated learning via model distillation,” NeurIPS Workshop on Federated Learning for Data Privacy and Confidentiality, 2019.
[13] O. Gupta and R. Raskar, “Distributed learning of deep neural network over multiple agents,” Journal of Network and Computer Applications, vol. 116, pp. 1–8, 2018.
[14] P. Vepakomma, O. Gupta, T. Swedish, and R. Raskar, “Split learning for health: Distributed deep learning without sharing raw patient data,” arXiv preprint arXiv:1812.00564, 2018.
[15] C. He, M. Annavaram, and S. Avestimehr, “Group knowledge transfer: Federated learning of large cnns at the edge,” Advances in Neural Information Processing Systems, vol. 33, pp. 14 068–14 080, 2020.
[16] E. Diao, J. Ding, and V. Tarokh, “HeteroFL: Computation and communication efficient federated learning for heterogeneous clients,” in International Conference on Learning Representations, 2021.
[17] J. Wang, Q. Liu, H. Liang, G. Joshi, and H. V. Poor, “Tackling the objective inconsistency problem in heterogeneous federated optimization,” Advances in neural information processing systems, vol. 33, pp. 7611–7623, 2020.
[18] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated learning with non-iid data,” arXiv preprint arXiv:1806.00582, 2018.
[19] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, “Federated optimization in heterogeneous networks,” Proceedings of the 3rd MLSys Conference, 2020.
[20] W. Huang, M. Ye, and B. Du, “Learn from others and be yourself in heterogeneous federated learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 143–10 153.
[21] F. Sattler, T. Korjakow, R. Rischke, and W. Samek, “Fedaux: Leveraging unlabeled auxiliary data in federated learning,” IEEE Transactions on Neural Networks and Learning Systems, 2021.
[22] Z. Zhu, J. Hong, and J. Zhou, “Data-free knowledge distillation for heterogeneous federated learning,” in International Conference on Machine Learning. PMLR, 2021, pp. 12 878–12 889.
[23] Y. Li, W. Zhou, H. Wang, H. Mi, and T. M. Hospedales, “FedH2L: Federated learning with model and statistical heterogeneity,” arXiv preprint arXiv:2101.11296, 2021.
[24] G. K. Nayak, K. R. Mopuri, V. Shaj, V. B. Radhakrishnan, and A. Chakraborty, “Zero-shot knowledge distillation in deep networks,” in International Conference on Machine Learning. PMLR, 2019, pp. 4743–4751.
[25] H. Chen, Y. Wang, C. Xu, Z. Yang, C. Liu, B. Shi, C. Xu, C. Xu, and Q. Tian, “Data-free learning of student networks,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 3514–3522.
[26] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in neural information processing systems, vol. 27, 2014.
[27] T. Shen, J. Zhang, X. Jia, F. Zhang, G. Huang, P. Zhou, K. Kuang, F. Wu, and C. Wu, “Federated mutual learning,” arXiv preprint arXiv:2006.16765, 2020.
[28] Y. H. Chan and E. Ngai, “Fedhe: Heterogeneous models and communication-efficient federated learning,” IEEE International Confer- ence on Mobility, Sensing and Networking (MSN 2021), 2021.
[29] H. Baek, W. J. Yun, Y. Kwak, S. Jung, M. Ji, M. Bennis, J. Park, and J. Kim, “Joint superposition coding and training for federated learning over multi-width neural networks,” in IEEE INFOCOM 2022-IEEE Conference on Computer Communications. IEEE, 2022, pp. 1729–1738.
[30] J. Yu and T. S. Huang, “Universally slimmable networks and improved training techniques,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 1803–1811.
[31] S. Horvath, S. Laskaridis, M. Almeida, I. Leontiadis, S. Venieris, and N. Lane, “Fjord: Fair and accurate federated learning under heterogeneous targets with ordered dropout,” Advances in Neural Information Processing Systems, vol. 34, pp. 12 876–12 889, 2021.
[32] R. Liu, F. Wu, C. Wu, Y. Wang, L. Lyu, H. Chen, and X. Xie, “No one left behind: Inclusive federated learning over heterogeneous devices,” in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, pp. 3398–3406.
[33] R. I. Bot, S.-M. Grad, and G. Wanka, Duality in vector optimization. Springer Science & Business Media, 2009.
[34] S. Boyd, S. P. Boyd, and L. Vandenberghe, Convex optimization. Cambridge university press, 2004.
[35] T. Lin, L. Kong, S. U. Stich, and M. Jaggi, “Ensemble distillation for robust model fusion in federated learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 2351–2363, 2020.
[36] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[37] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009.
[38] H. Xiao, K. Rasul, and R. Vollgraf. (2017) Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms.
[39] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, “Reading digits in natural images with unsupervised feature learning,” 2011.
[40] S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh, “Scaffold: Stochastic controlled averaging for federated learning,” in International Conference on Machine Learning. PMLR, 2020, pp. 5132–5143.
[41] Q. Li, B. He, and D. Song, “Model-contrastive federated learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10 713–10 722.
[42] M. Luo, F. Chen, D. Hu, Y. Zhang, J. Liang, and J. Feng, “No fear of heterogeneity: Classifier calibration for federated learning with non-iid data,” Advances in Neural Information Processing Systems, vol. 34, pp. 5972–5984, 2021.
[43] M. Raghu, T. Unterthiner, S. Kornblith, C. Zhang, and A. Dosovitskiy, “Do vision transformers see like convolutional neural networks?” Advances in Neural Information Processing Systems, vol. 34, pp. 12 116–12 128, 2021.
[44] S. Kornblith, M. Norouzi, H. Lee, and G. Hinton, “Similarity of neural network representations revisited,” in International conference on machine learning. PMLR, 2019, pp. 3519–3529.
[45] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, 2008.