EFTViT: Efficient Federated Training of Vision Transformers with Masked Images on Resource-Constrained Edge Devices

Meihan Wu¹, Tao Chang¹, Cui Miao¹, Jie Zhou¹, Chun Li², Xiangyu Xu³, Ming Li⁴ and Xiaodong Wang¹
¹National University of Defense Technology ²Shenzhen MSU-BIT
³ Xi’an Jiaotong University ⁴ Guangming Laboratory
{meihanwu20,changtao15,miaocui1024,jiezhou,xdwang}@nudt.edu.cn
[email protected]
Corresponding author.

Abstract

Federated learning research has recently shifted from Convolutional Neural Networks (CNNs) to Vision Transformers (ViTs) due to their superior capacity. ViTs training demands higher computational resources due to the lack of 2D inductive biases inherent in CNNs. However, efficient federated training of ViTs on resource-constrained edge devices remains unexplored in the community. In this paper, we propose EFTViT, a hierarchical federated framework that leverages masked images to enable efficient, full-parameter training on resource-constrained edge devices, offering substantial benefits for learning on heterogeneous data. In general, we patchify images and randomly mask a portion of the patches, observing that excluding them from training has minimal impact on performance while substantially reducing computation costs and enhancing data content privacy protection. Specifically, EFTViT comprises a series of lightweight local modules and a larger global module, updated independently on clients and the central server, respectively. The local modules are trained on masked image patches, while the global module is trained on intermediate patch features uploaded from the local client, balanced through a proposed median sampling strategy to erase client data distribution privacy. We analyze the computational complexity and privacy protection of EFTViT. Extensive experiments on popular benchmarks show that EFTViT achieves up to 28.17% accuracy improvement, reduces local training computational cost by up to 2.8 $\times$ , and cuts local training time by up to 4.4 $\times$ compared to existing methods.

1 Introduction

Federated Learning (FL) targets to enable collaborative training across multiple data distributed among different clients while prioritizing data privacy protection [24, 25, 21]. Early research on FL primarily concentrates on Convolutional Neural Networks (CNNs) [22, 20, 1]. Recently, the focus has increasingly shifted toward Vision Transformers (ViTs) [8], whose self-attention mechanisms excel at capturing long-range correspondences within images, achieving state-of-the-art performance across visual problems, e.g., object recognition [8], detection [13, 6], and semantic segmentation [40]. Despite their impressive capabilities, training ViTs generally incurs significantly higher computational costs and longer training times due to the lack of spatial inductive biases within images [30, 3], making it prohibitively challenging for resource-constrained edge devices.

Refer to caption — (a) Previous CNN Method

In the CNN era, the resource-constrained FL problem has been explored by some researchers. The workflow of these methods is summarized in Figure 1(a). Typically, model-heterogeneous methods [23, 1, 4, 37] train models of varying sizes on clients based on their available resources. However, these approaches are not well-suited to ViTs, as they fail to fundamentally reduce the computational demands of client-side training.

In this work, we explore whether the training computational costs of ViTs can be fundamentally reduced without significantly compromising FL performance. Recent work in self-supervised learning has demonstrated that masked image modeling can effectively learn generalizable visual representations by reconstructing randomly masked pixels in input images [13, 32], highlighting the substantial redundancy in images that may be unnecessary for recognition. To test this hypothesis, we conduct FL experiments with no resource constraints, using masked images to examine their impact on model performance and training computational costs. In the experiments, images are uniformly partitioned into non-overlapping patches, with a specified ratio $r_{m}$ of patches randomly masked. Only the unmasked patches are utilized for model training.

As illustrated in Figure 2, we conduct experiments under a challenging data heterogeneity setting with $\beta=0.1$ , where $\beta$ is a concentration parameter from the Dirichlet distribution $Dir_{N}(\beta)$ in FL. Results indicate that varying the masking ratio has minimal impact on model accuracy but significantly reduces training computation costs. For instance, increasing $r_{m}$ from 0.00 to 0.75 reduces the computational load by up to 5.2 $\times$ , with only a marginal decrease in accuracy. These findings suggest that using masked images in FL is a promising approach for enabling efficient ViT training on resource-constrained edge devices.

Inspired by these observations, we propose EFTViT, a hierarchical federated learning framework (as illustrated in Figure 1(b)) that employs masked images to efficiently train ViT models across multiple heterogeneous data on resource-constrained clients, while also enhancing privacy protection by concealing client data content. EFTViT comprises lightweight local modules on edge clients and a larger global module on the central server, designed to accommodate limited client resources. The local modules are trained on masked images. Rather than aggregating parameters from clients, the global module receives intermediate patch features from the local modules, enabling it to learn universal representations suitable for heterogeneous data. To maintain client data distribution, we propose a median sampling strategy that adjusts the patch feature count for each class to the median across all classes prior to uploading, enhancing both performance and training efficiency.

Our main contributions in this work are summarized as follows:

•

To the best of our knowledge, we present EFTViT, the first federated learning framework to leverage masked images for efficiently training ViT models across multiple resource-constrained clients, while also enhancing client data content protection.
•

EFTViT enables hierarchical training of all model parameters across clients and the central server, demonstrating substantial benefits for heterogeneous data. Additionally, we introduce a median sampling strategy to obscure the distribution information of intermediate features before they are uploaded to the server.
•

Experiments on popular benchmarks demonstrate that EFTViT improves accuracy by up to 28.17%, reduces local training computational costs by up to 2.8 $\times$ , and lower local training time by as much as 4.4 $\times$ , setting new state-of-the-art results.

2 Related Works

2.1 General Federated Learning

Federated learning is a decentralized machine learning approach that enhances privacy by training models directly on client devices, only transmitting model parameters to a central server. Most studies focus on addressing data heterogeneity [22, 17, 11, 20] and privacy protection [2, 27, 5] in FL. For instance, FedProx [22] adds a proximal term to optimize the local updates for addressing data heterogeneity. Regarding privacy protection, Asad et al. [2] apply homomorphic encryption to FL, enabling clients to encrypt their local models using private keys. Shi et al. [27] propose a FL method with differential privacy (DP). However, these works rely on the ideal assumption that clients have sufficient resources to handle model training process.

2.2 Federated Learning on Edge Devices

Federated learning approaches on resource-constrained clients can be categorized into federated distillation (FD) [12, 15, 19, 31] and partial training (PT) [7, 1]. FD methods focus on aggregating knowledge from heterogeneous client models to a server model. For instance, FedGKT [12] trains small models on clients and periodically transfers their knowledge to a large server model via knowledge distillation. PT methods divide a global model into smaller sub-models that can be locally trained on resource-constrained clients. For instance, HeteroFL [7] randomly selects sub-models from the global model to distribute to clients. However, these methods adapt model size to clients’ capacities, rather than fundamentally addressing the computational burden of client-side training.

2.3 Parameter-Efficient Fine-Tuning

When dealing with transformer-based complex models, Parameter-Efficient Fine-Tuning (PEFT) [36, 16, 14] provides a practical solution for efficiently adapting pre-trained models across the various downstream tasks, which can reduce storage and computation costs by fixing most pre-trained parameters and fine-tuning only a small subset [10]. Several studies [29, 38] have explored using different PEFT techniques to assess performance improvements and resource savings in federated systems. However, the limited fine-tuning of parameters in PEFT inevitably constrains the adaptability of pre-trained models to new tasks, potentially resulting in suboptimal performance in federated systems with data heterogeneity.

3 Efficient Federated Learning with Masked Images

3.1 Problem Definition

We employ supervised classification tasks distributed across $K$ clients to formulate our problem. Each client $k$ possesses a dataset $D_{k}:=(X_{k},Y_{k})$ , where $X_{k}\in\mathbb{R}^{N_{k}\times d_{k}}$ denotes the data samples and $Y_{k}\in\mathbb{R}^{N_{k}\times c_{k}}$ represents their corresponding labels. Here, $N_{k}$ represents the number of data points, $d_{k}$ denotes the input dimension, and $c_{k}$ indicates the number of classes.

3.2 Overview

As illustrated in Figure 3, EFTViT employs hierarchical training across clients and the central server to enable privacy-preserving and efficient collaborative learning. Each client includes a local module with $M$ Transformer layers, a shared global module with $N$ Transformer layers, and a classification head. The local module and classification head are trained on each client with unmasked image patches $X_{p}$ , enabling efficient local training and generating patch features that represent local knowledge. To safeguard data distribution privacy, a median sampling strategy is applied on each client to create a balanced patch features (BPF) dataset before uploading to the server. The global module is then trained on the server using the BPF dataset from clients to effectively learn global representations for all tasks. Finally, the server transmits the updated global module parameters back to clients for next training round.

3.3 Training with Masked Images

To enable efficient local training on resource-constrained clients, we present a patch-wise optimization strategy. Firstly, each input image is divided into a sequence of regular, non-overlapping patches, which are randomly masked at a ratio $r_{m}$ . The remaining unmasked patches, denoted as $X_{p}$ , are then used to train our framework. We define the patch features obtained by the local module on the client $k$ $\mathcal{M}_{k}(\phi_{k};\cdotp)$ as $H_{p}^{k}=\mathcal{M}_{k}(\phi_{k};X_{p}^{k})$ , where $X_{p}^{k}=Mask\_D(X_{k})$ and $Mask\_D(X_{k})$ is the operation of randomly masking image patches from $X_{k}$ and discarding the selected patches. To preserve patch ordering for ViTs, the positional embeddings [28] of the remaining patches are retained. This is inspired by the internal redundancy of images and reduces the amount of data that the model needs to process, thereby lowering computational complexity. Additionally, these patch features $H_{p}^{k}$ make it pretty challenging to reconstruct the original images since they are encoded from a very small portion of each image, inherently providing EFTViT with a content privacy advantage. Notably, the entire images are adopted for the inference on each client.

3.4 Data Distribution Protection with Median Sampling

To enhance privacy in EFTViT, we propose a median sampling strategy to generate a balanced patch features dataset $D_{H}^{k}$ on each client. It aims to ensure that the generated patch features on each client contain an equal number of samples for each class, thereby preventing the leakage of statistical information or user preferences when uploaded to the central server. Imbalanced data distribution on clients is a common issue in federated learning, and the median, being less sensitive to extreme values, is well-suited for addressing this challenge. Our median sampling strategy uses the median of class sample counts on each client to differentiate between minority and majority classes. It then applies oversampling to increase samples of minority classes and downsampling to reduce samples of majority classes. Specifically, for minority class samples, all patch features generated across multiple local training epochs are retained, whereas for majority class samples, only patch features from the final epoch are preserved. Next, downsampling is applied to reduce the number of samples in each class to the median. Empirically, we find that increasing the sampling threshold adds to computation costs but does not significantly improve final performance.

3.5 Hierarchical Training Paradigm

To effectively reduce the computational burden on clients without compromising performance, we propose a new hierarchical training strategy for ViTs that minimizes the number of trainable parameters on the clients. As aforementioned, our ViT models comprise a collection of lightweight local modules, a shared large global module and a classification head.

Training on Clients. On the client $k$ , the local module $\mathcal{M}_{k}(\phi_{k};\cdot)$ is responsible for mapping image patches $X_{p}$ into patch features $H_{p}$ , while the global module $\mathcal{M}_{k}(w;\cdot)$ encodes $H_{p}$ into representation vectors $H_{r}$ . The final classification head $\mathcal{M}_{k}(\theta_{k};\cdot)$ transforms the representation vectors $H_{r}$ to match the number of classes. Only the parameters of the local module and classification head are trainable, while the parameters of the global module remain frozen and are iteratively updated via downloads from the server. For the client $k$ , the loss function used in local training is defined as

\mathcal{L}(\phi_{k},\theta_{k})=\textstyle\sum_{i}^{c_{k}}p(y=i)log(\mathcal{M}_{k}(\phi_{k},w,\theta_{k};X_{ip}^{k})),

(1)

where $c_{k}$ is the number of classes in client $k$ , and $p(y=i)$ is the probability distribution of label $i$ . The parameters $\phi_{k}$ , $w$ , $\theta_{k}$ are from the local module, global module, and classification head, respectively. Therefore, the optimization objective is to minimize

\min_{\phi_{k},\theta_{k}}\mathbb{E}_{X_{ip}^{k}\sim D_{K}}[\mathcal{L}(\mathcal{M}_{k}(\phi_{k},w,\theta_{k};X_{ip}^{k}),Y_{k})],

(2)

where $\phi_{k}$ and $\theta_{k}$ are trainable.

Training on Server. The server aggregates heterogeneous knowledge from clients to learn universal representations across diverse tasks. The global module $\mathcal{M}_{S}(w;\cdot)$ and classification head $\mathcal{M}_{S}(\theta;\cdot)$ are trained using the balanced patch features dataset uploaded from participating clients in the latest training round. The loss function can be formulated as

\mathcal{L}_{s}(w,\theta)=\textstyle\sum_{i}^{C}p(y=i)log(\mathcal{M}_{S}(w,\theta;H_{p}))

(3)

where $C$ is the total number of classes, and $p(y=i)$ is the probability distribution of label $i$ on the data. The optimization objective on the server is to minimize

\min_{w,\theta}\mathbb{E}_{H_{p}\sim D_{H}}[\mathcal{L}_{s}(\mathcal{M}_{S}(w,\theta;H_{p}),Y)],

(4)

where $H_{p}$ and $Y$ represent respective patch features and labels uploaded from clients.

3.6 Collaborative Algorithms

The overall workflow of our EFTViT is shown in Algorithm 1 and Algorithm 2. At the start of each round $t$ , the server will randomly choose a proportion $P$ from $K$ clients to participate in training. Each client updates the parameters of its global module and classification head with those received from the server, and then initiates local training. The median sampling is applied to patch features $H_{p}^{k}$ to obscure local data distribution and produce a balanced dataset. The detailed process is presented in Algorithm 1.

The server receives the balanced patch features dataset $D_{H}^{k}$ from $P\cdot K$ clients to update the global dataset $D_{H}$ , storing new client data and updating existing client data. This dataset is used to train the global module $\mathcal{M}_{S}(w;\cdot)$ and classification head $\mathcal{M}_{S}(\theta;\cdot)$ , with the updated parameters $w$ and $\theta$ sent back to clients upon completion of training. The process is elaborated in Algorithm 2.

Algorithm 1 EFTViT: Clients

Input: $D_{k}:=(X_{k},Y_{k})$ is the dataset in $k$ client. $E_{c}$ represents the number of training epochs on each client. $\phi_{k}$ , $w$ , $\theta_{k}$ are the parameters of the local module, global module, and classification head, respectively. $Mask\_D(X_{k})$ is the operation of randomly dropout image patches.

1: Local_Update_Sampling(

k,w,\theta

2: Update the parameters

w,\theta

3: Calculate the maximum sample number

n_{max}

and median

n_{med}

D_{c}^{k}

4: for each class

i

5: if

n_{i}<n_{med}

and

y_{k}==i

then

D_{minority}\leftarrow add(X_{k},Y_{k})

7: end if

8: end for

9: for each epoch

e

from

1

E_{c}

10: for each batch in

D_{k}

11:

\phi_{k}\leftarrow\phi_{k}-\eta_{c}\bigtriangledown\mathcal{L}_{c}(\phi_{k},\theta_{k})

12:

\theta_{k}\leftarrow\theta_{k}-\eta_{c}\bigtriangledown\mathcal{L}_{c}(\phi_{k},\theta_{k})

13: for each sample

s\in D_{minority}

14:

H_{p}^{s},Y_{k}^{s}\leftarrow\mathcal{M}_{k}(\phi_{k},Mask\_D(X_{k}),Y_{k})

15:

D_{H}^{k}\leftarrow add(H_{p}^{s},Y_{k}^{s})

// Oversampling

16: end for

17: end for

18: end for

19: for each sample

s\notin D_{minority}

20:

D_{H}^{k}\leftarrow add(H_{p}^{s},Y_{k}^{s})

21: end for

22: for each class

i

23: Delete the first

(n_{i}-n_{med})

sample. // Downsampling

24: end for

25: Send

D_{H}^{k}

to server.

Algorithm 2 EFTViT: Server

Input: $T$ is the number of training rounds. $E_{s}$ represents the number of training epochs on server. $w$ , $\theta$ represent the parameter of global module and classification head of the server model, respectively.

1: Server_Execute:

2: Send

\phi,w,\theta

to clients.

3: for each round

t

from

1

T

4: for each client

k

in parallel do

D_{H}^{k}\leftarrow

Local_Update_Sampling(

k,w,\theta

)

6: Update

D_{H}

D_{H}^{k}

7: end for

8: for each local epoch

e

from 1 to

E_{s}

9: for each batch

\in D_{H}

10:

w,\theta\leftarrow w-\eta_{s}\bigtriangledown\mathcal{L}_{s}(w,\theta),\theta-\eta_{s}\bigtriangledown\mathcal{L}_{s}(w,\theta)

11: end for

12: end for

13: Send

w,\theta

to clients.

14: end for

3.7 Privacy & Complexity Analysis

Data Content Privacy. Contrary to previous beliefs, recent studies show that exchanging intermediate features during federated learning training is safer than sharing gradients. This is because attackers only have access to evolving feature maps rather than the final, fully trained maps, making data reconstruction attacks more challenging [12, 35, 39, 41]. Furthermore, EFTViT uploads patch features corresponding to 25% of the image area, controlled by the masking rate $r_{m}$ , which makes recovering the original image highly challenging, even if theoretically feasible. The masking rate can be further increased to enhance data content privacy, if necessary.

Data Distribution Privacy. To protect user statistical information and preferences, our patch features are balanced via the proposed median sampling strategy on clients, ensuring an equal number of samples for each class. Additionally, our strategy is orthogonal to other privacy protection methods, such as Differential Privacy [9], which can be seamlessly integrated into EFTViT to offer enhanced protection against attacks.

Table 1: Performance comparison between our method and state-of-the-art approaches under two data heterogeneity levels (

\beta=0.1

and

\beta=1.0

). Notably, Fed-Full trains all parameters on clients with no resource constraints, representing the theoretical upper bound for other methods, and is therefore excluded from the comparison. Our method, EFTViT, achieves the highest accuracy across all test scenarios, demonstrating strong capability in handling highly heterogeneous data at

\beta=0.1

, particularly on CIFAR-100 and UC Merced Land-Use. Best results are highlighted in bold.

Dataset	Heterogeneity Setting	Fed-Full	Fed-Head	Fed-Bias [36]	Fed-Prompt [16]	Fed-LoRA [14]	FEDBFPT [33]	Ours
UC Merce Land-Use	$\beta=0.1$	99.31	76.67	90.48	86.43	91.19	90.95	98.80
UC Merce Land-Use	$\beta=1.0$	99.33	93.57	94.52	94.52	97.14	95.71	98.10
CIFAR-100	$\beta=0.1$	90.40	74.41	88.76	88.34	70.23	88.91	90.02
CIFAR-100	$\beta=1.0$	92.10	79.57	90.67	90.45	84.05	90.37	90.81
CIFAR-10	$\beta=0.1$	98.31	92.27	97.83	97.99	96.91	97.99	98.12
CIFAR-10	$\beta=1.0$	98.71	94.23	98.21	98.18	98.09	98.26	98.35

Complexity. Given a ViT model, let $(h,w)$ represent the resolution of original image, $(p,p)$ represent the resolution of each image patch, $n=h*w/p^{2}$ be the resulting number, $d$ be the latent vector size, and $N_{T}$ represent the number of Transformer layers. To simplify the calculation, we assume that size of $Q$ , $K$ and $V$ is $n\times d$ . Each client model has $N_{T}$ Transformer layers, divided into $M$ layers for local module and $N$ layers for global module. The model trains on $(1-r_{m})$ of the image patches, where $r_{m}$ is the masking ratio. The time cost for forward propagation on the client is $\mathcal{O}(5\cdot N_{T}\cdot(1-r_{m})\cdot n\cdot d^{2}+2\cdot N_{T}\cdot(1-r_{m})^{2}\cdot n^{2}\cdot d)$ . As the parameters of the $N$ Transformer layers in the global module are frozen, the backward propagation time cost is $\mathcal{O}(10\cdot(N_{T}-N)\cdot(1-r_{m})\cdot n\cdot d^{2}+4\cdot(N_{T}-N)\cdot(1-r_{m})^{2}\cdot n^{2}\cdot d)$ . Therefore, the overall time complexity in the client training stage is $\mathcal{O}((15N_{T}-10N)\cdot(1-r_{m})\cdot n\cdot d^{2}+(6N_{T}-4N)\cdot(1-r_{m})^{2}\cdot n^{2}\cdot d)$ . As $N$ approaches $N_{T}$ and $r_{m}$ approaches 1, the computational complexity of the model on the client gradually declines. Our default configurations are $N_{T}=12$ , $N=10$ , and $r_{m}=0.75$ , substantially reducing the computational load on the client.

4 Experiments

4.1 Datasets

To comprehensively evaluate EFTViT, we conduct experiments on two widely used federated learning datasets, CIFAR-10 [18] and CIFAR-100 [18], as well as a more challenging datasets, UC Merced Land-Use [34], for remote sensing. CIFAR-10 and CIFAR-100 datasets each contain 60,000 color images. CIFAR-10 is organized into 10 classes, with 6,000 images per class (5,000 for training and 1,000 for testing), while CIFAR-100 has 100 classes, with 600 images per class (500 for training and 100 for testing). UC Merced Land-Use dataset contains 21 land-use classes, e.g., agricultural, forest, freeway, beach, and other classes, each with 100 images (80 for training and 20 for testing). We partition samples to all clients following a Dirichlet distribution $Dir_{N}(\beta)$ with a concentration parameter $\beta$ , setting $\beta=\{0.1,1\}$ to simulate high or low levels of heterogeneity.

4.2 Implementations

We use ViT-B [8] pre-trained on ImageNet-21K [26] as the backbone of our framework. The input images are resized to $224\times 224$ with a patch size of $16\times 16$ . During training, data augmentation techniques such as random cropping, flipping, and brightness adjustment are applied. Following federated learning practices, we set the number of clients to 100, with a client selection ratio $P=0.1$ . The AdamW optimizer is used with an initial learning rate of $5\times 10^{-5}$ , weight decay of 0.05, and a cosine annealing learning rate schedule with warm-up. We use a batch size of 32 for both training and testing on each client. All experiments are conducted on a single NVIDIA GeForce RTX 3090 GPU. In each round, clients train for 5 epochs locally, while the server performs an additional 2 epochs. The framework is trained for a total of 200 rounds, requiring approximately 24 hours.

Table 2: The number of training rounds (# Rounds) required by EFTViT and other baselines to reach the target accuracy of 85%. Improve denotes the improvement of EFTViT over other methods. The results demonstrate that EFTViT significantly reduces the convergence course. Note that N/A indicates the corresponding method can not reach the target accuracy.

Metric	Dataset	Fed-Head	Fed-Bias [36]	Fed-Prompt [16]	Fed-LoRA [14]	FEDBFPT [33]	Ours
# Round Improve	UC Merce Land-Use	N/A	119	163	108	136	6
	UC Merce Land-Use	N/A	19.8 $\times$	27.1 $\times$	18 $\times$	17 $\times$	-
	CIFAR-100	N/A	30	35	N/A	31	4
	CIFAR-100	N/A	7.5 $\times$	8.7 $\times$	N/A	7.7 $\times$	-
	CIFAR-10	13	10	19	13	10	3
	CIFAR-10	4.3 $\times$	3.3 $\times$	6.3 $\times$	4.3 $\times$	3.3 $\times$	-

Table 3: Training GFLOPs and Time of EFTViT and other baselines, where Time indicates the maximum local training time in federated learning. Improve represents the improvement of EFTViT over other baselines. The results show that EFTViT significantly enhances computational efficiency across both metrics.

Metric	Dataset	Fed-Full	Fed-Head	Fed-Bias [36]	Fed-Prompt [16]	Fed-LoRA [14]	FEDBFPT [33]	Ours
GFLOPs	-	12.005	12.005	12.005	12.646	16.982	12.005	5.911
Improve	-	$2.0\times$	$2.0\times$	$2.0\times$	$2.1\times$	$2.8\times$	$2.0\times$	-
Time (s) Improve	UC Merce Land-Use	6.887	4.376	5.506	5.915	6.087	5.744	2.025
	UC Merce Land-Use	$3.4\times$	$2.1\times$	$2.7\times$	$2.9\times$	$3.0\times$	$2.8\times$	-
	CIFAR-100	97.023	40.116	72.559	79.951	88.165	49.077	20.085
	CIFAR-100	$4.8\times$	$2.0\times$	$3.6\times$	$4.0\times$	$4.4\times$	$2.4\times$	-
	CIFAR-10	96.378	42.245	69.782	75.657	83.657	51.325	18.923
	CIFAR-10	$5\times$	$2.2\times$	$3.6\times$	$3.9\times$	$4.4\times$	$2.7\times$	-

4.3 Comparison with State-of-the-Art Methods

Given the lack of studies on training ViTs on resource-constrained clients, we adapt the FEDBFPT approach [33], originally designed for natural language processing tasks, as a strong baseline, which progressively optimizes the shallower layers while selectively sampling deeper layers to reduce resource consumption. To establish additional baselines, we adapt several well-known PEFT methods to our federated learning setup: (a) Fed-Head: trains only the head layer parameters; (b) Fed-Bias: applies bias-tuning [36], focusing on training only the bias terms; (c) Fed-Prompt: incorporates prompt-tuning [16], adding trainable prompt embeddings to the input; and (d) Fed-LoRA: integrates LoRA-tuning [14] by adding the LoRA module to the query and value layers. These methods use FedAVG [24] for parameter aggregation. Otherwise, our method and the baseline methods share the same settings in the federated learning scenario.

Testing Accuracy. The testing results of all methods across various datasets and data heterogeneity levels are presented in Table 1. Note that Fed-Full means training all ViT parameters in clients without resource constraints, serving as a reference for the comparison. Compared with the baselines, EFTViT demonstrates apparent performance gains across all scenarios. For instance, we outperform the second-best method by over 7.61% on UC Merced Land-Use with $\beta=0.1$ . Notably, our method shows consistent results in high and low data heterogeneity settings, with even better performance under higher heterogeneity. In contrast, the baseline methods degrade heavily in performance as data heterogeneity increases. These findings underscore the importance of our hierarchical training strategy in handling data heterogeneity effectively.

Convergence. We report the testing accuracy changes of EFTViT, FEDBFPT, and other baselines over 100 training rounds on CIFAR-10, CIFAR-100, and UC Merced Land-Use under high heterogeneity settings, as shown in Figure 4. Our method consistently achieves the highest testing accuracy on three datasets throughout the training phase, converging faster and more stably. To quantitatively compare convergence speed, we set a target accuracy of 85% and record the number of training rounds (# Rounds) required to reach this threshold. As shown in Table 2, EFTViT significantly accelerates the convergence process, achieving 27.1 $\times$ faster convergence than Fed-Prompt on the UC Merced Land-Use dataset.

Computational Efficiency. We evaluate the client-side computational efficiency of EFTViT from two perspectives: the computational cost of forward propagation during training and the maximum local training time across clients. Computational cost is measured in Giga Floating-Point Operations (GFLOPs). At a target accuracy of 85%, we report the maximum local training time (Time) for EFTViT and other baselines across three datasets. The results in Table 3 show that our method significantly improves computational efficiency across both metrics. Specifically, EFTViT achieves at least 2 $\times$ the efficiency of other methods in terms of GFLOPs. For training time, EFTViT reduces local training time by 2.8 $\times$ compared to FEDBFPT on the UC Merced Land-Use dataset. This demonstrates that our masked image and hierarchical training strategy effectively reduces client computation, making EFTViT well-suited for federated learning in resource-constrained environments.

4.4 Ablation Study

We conduct extensive ablation experiments to investigate the key components of our approach.

Table 4: GFLOPs calculated by different

r_{m}

. GFLOPs decrease significantly as the masking rate increases.

$r_{m}$	0.00	0.25	0.50	0.75	0.95
GFLOPs	12.005	8.914	5.911	2.997	0.684

Effect of Masking Ratio. The masking ratio $r_{m}$ determines the number of masked image patches. A smaller $r_{m}$ reduces the amount of input data, thus lowering computational requirements during model training. Table 4 provides the GFLOPs for various masking rates, demonstrating that increasing the masking ratio significantly reduces GFLOPs. However, increasing the masking ratio also affects overall performance. We evaluate the effect of different masking rates for EFTViT. Figure 5 shows the results of EFTViT with varying masking ratios on CIFAR-100 at $\beta=0.1$ . Results indicate that EFTViT can support a wide masking ratio range. When the masking ratio increases from 0% to 75%, the accuracy remains larger than 90%. However, the performance decreases heavily when the masking ratio exceeds 75%. Therefore, we select a masking ratio of 75% to strike a balance between accuracy and computational efficiency.

Effect of Layer Number $M$ in Local Module. The layer number $M$ determines the trainable parameter division between clients and the server, affecting the computational load of clients and final performance. Table 5 presents the number of trainable parameters (# Params) in each client and the corresponding accuracy achieved by the model for different values of $M$ . The results show that $M$ has minimal impact on the testing accuracy, showcasing the superior robustness of our EFTViT w.r.t. client resources. Given the higher computational cost of a large $M$ on clients and the accuracy decrease, we select $M=2$ as the default setting.

Effect of Sampling Threshold. As elaborated in Section 3.4, the sampling threshold determines the number of balanced patch features to upload for server training. Therefore, a higher threshold increases the training cost on the server. We investigate the impact of utilizing median or higher sampling thresholds in EFTViT, as shown in Figure 6. Results indicate that increasing the threshold provides minimal performance improvements. To enhance the computational efficiency on the server, we select the median as the threshold in our method.

Table 5: Accuracy and number of trainable parameters (# Params) on each client for different layer numbers

M

. Results demonstrate that our EFTViT has superior robustness w.r.t. client resources.

$M$	# Params	Accuracy (%)
$M$	# Params	CIFAR-10	CIFAR-100	UC Merce Land-Use
2	14.23M	98.35	90.81	98.80
4	27.82M	97.56	90.14	97.85
6	41.34M	97.98	89.37	97.38

5 Conclusion

In this work, we propose a hierarchical federated framework, EFTViT, designed for efficient training on resource-constrained edge devices and handling heterogeneous data effectively. EFTViT reduces client computation by leveraging masked images with an appropriate masking ratio, which minimizes performance degradation while significantly lowering computational overhead by exploiting redundancy in image information. The masked images can also prevent the data content leakage from uploaded local features. Additionally, the hierarchical training strategy, which splits parameter training between the client and server, achieves full parameter optimization and improves performance on heterogeneous data across multiple clients. Finally, EFTViT incorporates a median sampling strategy to protect user data distribution, ensuring privacy while maintaining robust performance. The extensive experiments on three benchmarks demonstrate that EFTViT significantly improves classification accuracy, reduces client training computational costs and time by large margins.

References

Alam et al. [2022] Samiul Alam, Luyang Liu, Ming Yan, and Mi Zhang. Fedrolex: Model-heterogeneous federated learning with rolling sub-model extraction. Advances in Neural Information Processing Systems, 35:29677–29690, 2022.
Asad et al. [2020] Muhammad Asad, Ahmed Moustafa, and Takayuki Ito. Fedopt: Towards communication efficiency and privacy preservation in federated learning. Applied Sciences, 10(8):2864, 2020.
Chen et al. [2022] Xianing Chen, Qiong Cao, Yujie Zhong, Jing Zhang, Shenghua Gao, and Dacheng Tao. Dearkd: data-efficient early knowledge distillation for vision transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12052–12062, 2022.
Cho et al. [2022] Yae Jee Cho, Andre Manoel, Gauri Joshi, Robert Sim, and Dimitrios Dimitriadis. Heterogeneous ensemble knowledge transfer for training large models in federated learning. arXiv preprint arXiv:2204.12703, 2022.
Choudhury et al. [2020] Olivia Choudhury, Aris Gkoulalas-Divanis, Theodoros Salonidis, Issa Sylla, Yoonyoung Park, Grace Hsu, and Amar Das. Anonymizing data for privacy-preserving federated learning. arXiv preprint arXiv:2002.09096, 2020.
Dai et al. [2021] Zhigang Dai, Bolun Cai, Yugeng Lin, and Junying Chen. Up-detr: Unsupervised pre-training for object detection with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1601–1610, 2021.
Diao et al. [2020] Enmao Diao, Jie Ding, and Vahid Tarokh. Heterofl: Computation and communication efficient federated learning for heterogeneous clients. In International Conference on Learning Representations, 2020.
Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
Dwork [2008] Cynthia Dwork. Differential privacy: A survey of results. In International conference on theory and applications of models of computation, pages 1–19. Springer, 2008.
Han et al. [2024] Zeyu Han, Chao Gao, Jinyang Liu, Sai Qian Zhang, et al. Parameter-efficient fine-tuning for large models: A comprehensive survey. arXiv preprint arXiv:2403.14608, 2024.
Hanzely and Richtárik [2020] Filip Hanzely and Peter Richtárik. Federated learning of a mixture of global and local models. arXiv preprint arXiv:2002.05516, 2020.
He et al. [2020] Chaoyang He, Murali Annavaram, and Salman Avestimehr. Group knowledge transfer: Federated learning of large cnns at the edge. Advances in Neural Information Processing Systems, 33:14068–14080, 2020.
He et al. [2022] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
Hu et al. [2021] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
Itahara et al. [2021] Sohei Itahara, Takayuki Nishio, Yusuke Koda, Masahiro Morikura, and Koji Yamamoto. Distillation-based semi-supervised federated learning for communication-efficient collaborative training with non-iid private data. IEEE Transactions on Mobile Computing, 22(1):191–205, 2021.
Jia et al. [2022] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. In European Conference on Computer Vision, pages 709–727. Springer, 2022.
Karimireddy et al. [2020] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. In International conference on machine learning, pages 5132–5143. PMLR, 2020.
Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
Li and Wang [2019] Daliang Li and Junpu Wang. Fedmd: Heterogenous federated learning via model distillation. arXiv preprint arXiv:1910.03581, 2019.
Li et al. [2021a] Qinbin Li, Bingsheng He, and Dawn Song. Model-contrastive federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10713–10722, 2021a.
Li et al. [2021b] Qinbin Li, Zeyi Wen, Zhaomin Wu, Sixu Hu, Naibo Wang, Yuan Li, Xu Liu, and Bingsheng He. A survey on federated learning systems: Vision, hype and reality for data privacy and protection. IEEE Transactions on Knowledge and Data Engineering, 35(4):3347–3366, 2021b.
Li et al. [2020] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. Proceedings of Machine learning and systems, 2:429–450, 2020.
Lin et al. [2020] Tao Lin, Lingjing Kong, Sebastian U Stich, and Martin Jaggi. Ensemble distillation for robust model fusion in federated learning. Advances in Neural Information Processing Systems, 33:2351–2363, 2020.
McMahan et al. [2017] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017.
Nguyen et al. [2021] Dinh C Nguyen, Ming Ding, Pubudu N Pathirana, Aruna Seneviratne, Jun Li, and H Vincent Poor. Federated learning for internet of things: A comprehensive survey. IEEE Communications Surveys & Tutorials, 23(3):1622–1658, 2021.
Ridnik et al. [2021] Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, and Lihi Zelnik-Manor. Imagenet-21k pretraining for the masses. arXiv preprint arXiv:2104.10972, 2021.
Shi et al. [2023] Yifan Shi, Yingqi Liu, Kang Wei, Li Shen, Xueqian Wang, and Dacheng Tao. Make landscape flatter in differentially private federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24552–24562, 2023.
Steiner et al. [2021] Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, and Lucas Beyer. How to train your vit? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270, 2021.
Sun et al. [2022] Guangyu Sun, Matias Mendieta, Taojiannan Yang, and Chen Chen. Conquering the communication constraints to enable large pre-trained models in federated learning. arXiv preprint arXiv:2210.01708, 2022.
Touvron et al. [2021] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021.
Wu et al. [2024] Meihan Wu, Li Li, Tao Chang, Peng Qiao, Cui Miao, Jie Zhou, Jingnan Wang, and Xiaodong Wang. Fedekt: Ensemble knowledge transfer for model-heterogeneous federated learning. In 2024 IEEE/ACM 32nd International Symposium on Quality of Service (IWQoS), pages 1–10. IEEE, 2024.
Xie et al. [2023] Zhenda Xie, Zigang Geng, Jingcheng Hu, Zheng Zhang, Han Hu, and Yue Cao. Revealing the dark secrets of masked image modeling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14475–14485, 2023.
Xin’ao Wang et al. [2023] Huan Li Xin’ao Wang, Ke Chen, and Lidan Shou. Fedbfpt: An efficient federated learning framework for bert further pre-training. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, pages 4344–4352, 2023.
Yang and Newsam [2010] Yi Yang and Shawn Newsam. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL international conference on advances in geographic information systems, pages 270–279, 2010.
Yin et al. [2021] Hongxu Yin, Arun Mallya, Arash Vahdat, Jose M Alvarez, Jan Kautz, and Pavlo Molchanov. See through gradients: Image batch recovery via gradinversion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16337–16346, 2021.
Zaken et al. [2021] Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199, 2021.
Zhang et al. [2023a] Jie Zhang, Song Guo, Jingcai Guo, Deze Zeng, Jingren Zhou, and Albert Zomaya. Towards data-independent knowledge transfer in model-heterogeneous federated learning. IEEE Transactions on Computers, 2023a.
Zhang et al. [2023b] Zhuo Zhang, Yuanhang Yang, Yong Dai, Qifan Wang, Yue Yu, Lizhen Qu, and Zenglin Xu. Fedpetuning: When federated learning meets the parameter-efficient tuning methods of pre-trained language models. In Annual Meeting of the Association of Computational Linguistics 2023, pages 9963–9977. Association for Computational Linguistics (ACL), 2023b.
Zhao et al. [2020] Bo Zhao, Konda Reddy Mopuri, and Hakan Bilen. idlg: Improved deep leakage from gradients. arXiv preprint arXiv:2001.02610, 2020.
Zheng et al. [2021] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip HS Torr, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6881–6890, 2021.
Zhu et al. [2019] Ligeng Zhu, Zhijian Liu, and Song Han. Deep leakage from gradients. Advances in neural information processing systems, 32, 2019.