On the Federated Learning Framework for Cooperative Perception

1Both authors contribute equally to the work and are co-first authors.2Corresponding author.The works of Z. Zhang, J.X. Liu and H. Liu were jointly supported by the National Natural Science Foundation of China (62201474), Suzhou Science and Technology Development Planning Programme (Grant No.ZXL2023171) and XJTLU Research Development Fund (RDF-22-01-129, RDF-21-02-084).Z. Zhang, H. Liu and J.X. Liu are with the School of AI and Advanced Computing, Xi’an Jiaotong-Liverpool University, Suzhou, P.R. China. Email: [email protected], {hongbin.liu, jingxin.liu}@xjtlu.edu.cn.J. Liu is with Momoni AI, Gothenburg, Sweden. Email: [email protected]. Zhou and T. Huang are with the College of Science and Engineering, James Cook University, Cairns, QLD 4870, Australia. Email: [email protected], [email protected]. Han is with the School of Science, Computing and Engineering Technologies, Swinburne University of Technology, Melbourne, VIC 3122, Australia. Email: [email protected]. Zhenrong Zhang1, Jianan Liu1, Xi Zhou, Tao Huang,
Qing-Long Han, , Jingxin Liu, , and Hongbin Liu,2

Abstract

Cooperative perception (CP) is essential to enhance the efficiency and safety of future transportation systems, requiring extensive data sharing among vehicles on the road, which raises significant privacy concerns. Federated learning offers a promising solution by enabling data privacy-preserving collaborative enhancements in perception, decision-making, and planning among connected and autonomous vehicles (CAVs). However, federated learning is impeded by significant challenges arising from data heterogeneity across diverse clients, potentially diminishing model accuracy and prolonging convergence periods. This study introduces a specialized federated learning framework for CP, termed the federated dynamic weighted aggregation (FedDWA) algorithm, facilitated by dynamic adjusting loss (DALoss) function. This framework employs dynamic client weighting to direct model convergence and integrates a novel loss function that utilizes Kullback-Leibler divergence (KLD) to counteract the detrimental effects of non-independently and identically distributed (Non-IID) and unbalanced data. Utilizing the BEV transformer as the primary model, our rigorous testing on FedBEVT dataset which is expanded on OpenV2V dataset, demonstrates significant improvements in the average intersection over union (IoU). These results highlight the substantial potential of our federated learning framework to address data heterogeneity challenges in CP, thereby enhancing the accuracy of perception models and facilitating more robust and efficient collaborative learning solutions in the transportation sector.

Index Terms:

Cooperative intelligent transportation system, cooperative perception, autonomous driving, federated learning, bird’s-eye-view segmentation.

I Introduction

Environmental perception is crucial for advanced driver assistance systems (ADAS) and autonomous driving, encompassing tasks like instance segmentation [2, 3, 1, 4], traffic scene understanding [5], 2D object detection [6], 3D object detection [7, 8, 9, 10], and multi-object tracking [11, 12, 13, 14]. These technologies also play a critical role in enhancing cooperative intelligent transportation systems (C-ITS), significantly improving transportation safety and efficiency [15, 16, 17]. The rise of C-ITS has drawn considerable interest from both academia and industry, highlighting a collaborative framework that integrates various ITS components—personal, vehicle, roadside, and central systems—to surpass traditional ITS models in service quality. Connected automated vehicles (CAVs) exchange information through vehicle-to-everything (V2X) networks, including vehicle-to-vehicle (V2V), vehicle-to-infrastructure (V2I), vehicle-to-network (V2N), and infrastructure-to-network (I2N).

A major challenge within C-ITS is the security and privacy of shared data. Federated learning in general offers a robust solution by enabling decentralized model training without compromising data privacy, which has already been demonstrated in many applications like wireless communications [18], Industry 4.0. [19] and robotics [20], etc. In C-ITS, this approach also allows CAVs to collaboratively participate in environment perception, traffic flow prediction, and decision-making, to improve deep neural network (DNN) models while keeping the privacy protected by avoiding exchange sensor data or latent features between CAVs [21, 22]. Additionally, V2X communication supports seamless data exchange between vehicles, roadside units (RSUs), and cloud systems, enhancing the effectiveness of federated learning in C-ITS [23, 24, 25]. Supported by RSUs, digital twins can produce accurate labels for learning algorithms, thereby increasing the precision and performance of the systems.

Despite its numerous advantages, the federated learning paradigm encounters a significant challenge in data heterogeneity when implemented across diverse clients. This issue is particularly pronounced in C-ITS perception which is also known as cooperative perception (CP) due to the varied sensor configurations among participating vehicles and infrastructure. Such variations include discrepancies in camera counts, sensor placements, and non-independently and identically distributed (Non-IID) data among traffic agents, leading to marked differences in data characteristics [26].

In a CP situation, dealing with data heterogeneity and distribution shifts is crucial. These challenges can significantly impact the performance and reliability of the system. For example, some clients may have the same type of sensors on one hand. However, their configurations such as mounting positions, angles, or calibration, may vary. On the other hand, some clients may have different types of sensors, which may lead to different data characteristics. Specifically, the dataset supplied by FedBEVT [27] incorporates data from three distinct client types: cars, buses, and trucks. Each type of vehicle has sensors installed at varying positions and angles, and the calibration among these sensors also differs, resulting in data heterogeneity. Moreover, environmental variations further influence data distribution shifts. For example, changes in lighting conditions across multiple time frames within a single scene can significantly alter. Such phenomena could be observed clearly in the dataset supplied by FedBEVT [27]. These are significant challenges for traditional federated learning methods, as it may be challenging to effectively synchronize the contributions of local models towards the convergence of the global model. This paper explores strategies to harness the benefits of external data while mitigating the impacts of data heterogeneity within a federated learning framework, with the aim of significantly improving local models’ performance.

Among the critical tasks essential for CP, bird’s-eye-view (BEV) perception is emphasized due to its key role in achieving comprehensive environmental awareness. To address the challenge of data heterogeneity, we introduce a novel federated learning framework specifically designed for BEV perception. Specifically, we proposed an advanced aggregation methodology within our federated learning framework, to confront with varied data distributions among clients and their disparate impacts on global model convergence. This aggregation mechanism is designed to handle the heterogeneity intrinsic to diverse agents, thereby enhancing training efficiency and model efficacy. Additionally, we introduced a novel loss function designed to mitigate unstable convergence issues, steering the model toward alignment with the global optimizer.

To validate our approach with fair comparison, we conducted extensive experiments and follow the same BEV semantic segmentation network model proposed in [27], which consists of five main components: an encoder for image feature extraction; positional embedding to capture camera geometry; a cross-attention module that transitions front views to BEV; convolution layers within transformers for feature refinement; and a decoder to convert BEV representations into actionable predictions for BEV semantic segmentation. The results demonstrate the superiority of our framework, showing improved test accuracy and reduced communication overhead across most clients, thus confirming the effectiveness of the proposed federated learning framework in addressing the complex challenges.

Our contributions are summarized as follows:

•

A federated learning framework tailored for CP task with a novel federated dynamic weighted aggregation (FedDWA) algorithm is introduced. This framework addresses data heterogeneity across clients by employing a dynamic constant mechanism, which adjusts the balance between global and local model, narrowing the performance gap and fostering a cohesive learning environment.
•

To enhance training efficacy further, a dynamic adjusting loss (DALoss) function is proposed for CP task under federated learning framework. This function is tailored to the varying data distributions among clients and the central server, fine-tuning the model’s convergence direction based on real-time data distribution insights, thus ensuring improved convergence and significantly boosting the accuracy of BEV perception outcomes.

II Related Work

II-A Cooperative Perception in Autonomous Driving

It is essential that autonomous driving systems have the ability to perceive the environment accurately. This perception directly affects the efficiency of decision-making and overall safety. Traditional single vehicle perception systems, although have recently been equipped by the BEV perception paradigm which extracts BEV space feature from input sequences in perspective view or 3D space, facilitate performance of perception tasks such as 3D bounding box detection and BEV segmentation [29, 28, 30], yet still often struggle with occlusions and distant objects since being constrained by limited sensing capabilities. CP, enabled by V2X communications, marks a significant improvement, allowing CAVs to enhance perception through extensive wireless information sharing, encompassing V2V, V2I, and V2N communications. OpenCDA, introduced by [32], provides a versatile framework for developing and testing cooperative driving automation (CDA) systems within a simulated environment. Complementarily, the work [33] presented a large-scale simulated dataset for V2V perception, enriched with a diverse collection of scenes and annotated data. This dataset proves critical for advancing CP systems, as supported by [15], which underscores the expanded sensing capabilities facilitated by V2X. Further developments include the CoBEVT framework introduced in [31], a sparse vision transformer tailored for autonomous driving that leverages multi-agent multi-camera sensors, enabling collaborative perception across multiple vehicles. The work [34] discusses the integral role of advanced algorithms in optimizing CP systems. The works [35] and [36] explore the potential of integrating infrastructure-based sensors to bolster CP. There have been some notable advancements in perception frameworks recently. One of these is the introduction of V2X-ViT [37]. V2X-ViT is a novel vision Transformer designed to effectively combine information from on-road agents. Another important contribution is the work by Xu et al. [38]. They have created the first extensive real-world multimodal dataset for V2V perception. This dataset significantly enhances data availability for CP research. Lastly, Xiang et al. have introduced HM-ViT in their paper [39]. HM-ViT is an innovative multi-agent hetero-modal cooperative perception framework. It can accurately predict 3D objects in dynamic V2V scenarios. Collectively, these advancements represent a major leap towards more comprehensive and precise environmental perception in autonomous driving, addressing critical challenges and enhancing driving safety.

II-B Federated Learning

Federated learning, initially proposed as FedAvg by [40], marks a significant shift in machine learning by enabling distributed model training across multiple systems without accessing users’ raw data, thus enhancing data privacy. A primary challenge in federated learning is the management of data heterogeneity and distribution shifts, which complicate efficient model optimization. The work [41] introduces a method to mitigate client heterogeneity by incorporating proximal terms, ensuring that local model updates align with the global model. Additionally, the work [42] utilizes knowledge distillation techniques to aggregate locally computed logits, allowing for the synthesis of cohesive global models without requiring uniform local model architectures. Further addressing this heterogeneity, the works [43, 41] propose algorithms that adjust the training objective to tackle these challenges effectively. The work [44] introduced SCAFFOLD, which employs control variables for variance reduction to correct ‘client-drift’ in local updates. Moreover, the work [45] presents FedSAM, an algorithm that integrates sharpness aware minimization (SAM) as a local optimizer with a momentum-based federated learning algorithm to enhance synergy between local and global models.

II-C Federated Learning with Cooperative Perception

In CP involving CAVs and RSUs, the exchange of sensor data raises significant privacy concerns. This data, sourced from vehicles and smart city infrastructure, can reveal sensitive information about individuals, such as location and personal habits. To address this, the work [46] applied federated learning to object detection tasks in AVs, capitalizing on the method’s ability to train models without compromising data privacy. However, the broader application of federated learning in CP was initially underexplored. Recent studies such as [21], [22] and [47] discuss the application of federated learning in CP settings. The work [48] introduces an innovative federated learning framework, H2-Fed, designed to handle Hierarchical Heterogeneity, significantly improving conventional pre-trained deep learning models for CP. Furthermore, the work [49] proposes a new client selection pipeline using V2X communications, optimizing client selection based on predicted communication latency in real CP scenarios. These developments underscore the potential for federated learning for collaborative model training in privacy-sensitive environments. Despite these advances, implementing CP through federated learning presents challenges, primarily due to heterogeneity in data distribution, computational resources, and network connectivity among devices. This variability can significantly affect learning efficiency and effectiveness. To address these challenges, the work [50] introduces a fair and efficient algorithm to manage the imbalanced distribution of data and fluctuating channel conditions. Moreover, the work [27] proposed FedBEVT, a federated transformer learning approach that focuses on BEV perception to tackle common data heterogeneity issues in CP.

III Methodology

III-A Problem Formulation

In this paper, we aim to establish a federated learning framework, $FL$ , consisting of $M$ clients that each equipped with a unique label $m\in\{1,2,...,M\}$ , for the BEV semantic segmentation task under V2X CP setting. The $m$ -th client contains a set of data samples denoted by $\mathbf{N}_{m}=\{(\mathbf{N}_{m,i},\mathbf{Y}_{m,i})\}_{i\in\{1,2,...,|\mathbf{N}_{m}|\}}$ , where $|\cdot|$ denotes the number of elements in the set, $\mathbf{N}_{m,i}$ is the set of images in the $i$ -th data sample captured from a multi-view camera system, and $\mathbf{Y}_{m,i}$ is the set of corresponding BEV semantic masks. Within this framework, each client $m$ trains a local model $G_{m}$ independently using the calculated gradient, by minimizing a primary optimization objective loss function $L_{m,i}$ for each data sample in their dataset, and then transmits the model updates to a central server. Thus this loss function is crucial for assessing and minimizing the error throughout the distributed network of clients $m$ . The server integrates these updates to improve the global model. This cycle of local training and central updating repeats across several communication rounds to refine the global model optimally.

To demonstrate the effectiveness of our proposed federated learning framework, we follow FedBEVT [27] to employ the same transformer-based BEV segmentation network as network model $G_{m}$ and $G_{g}$ , on each client and server respectively. Such network model comprises an image feature encoder utilizing a CNN, a BEV transformer with a multi-layer attention structure (incorporating both sparse cross-view and self-attention mechanisms), and a BEV decoder. This configuration effectively encodes images from multiple cameras and transforms them into BEV features. These features are decoded to produce semantic segmentation outputs on the BEV perspective.

Our framework introduces a novel aggregation algorithm and a customized loss function designed to tackle two primary challenges for CP setting: (i) the heterogeneity of data and distribution shifts across different clients, and (ii) the varying contributions of different clients to the convergence of the global model. These enhancements are vital to ensure robust and precise model performance in autonomous driving environments.

III-B Federated Dynamic Weighted Aggregation

To address the specific challenges of applying federated learning in CP, particularly client drift due to diverse sensor types and Non-IID data distributions, we introduce the federated dynamic weighted aggregation (FedDWA) algorithm for our federated learning framework $FL$ . This algorithm updates the client model described by:

G_{m}=G_{m}^{-}-\frac{1}{|\mathbf{N}_{m}|}U_{m},

(1)

U_{m}=\eta_{m}(\sum_{i}^{|\mathbf{N}_{m}|}\partial_{i}(G_{m}^{-})+c_{g}^{-}-c_{m}^{-}),

(2)

Input :

1 Number of communication round

R

and data samples

\mathbf{N}_{m}

2 Server input: global model

G_{g}^{-}

and global control variable

c_{g}^{-}

in previous round, global learning rate

\eta_{g}

m

-th client input: local control variable

c_{m}^{-}

in previous round and local step-size

\eta_{m}

4 for each round $r=1,\dots,R$ do

5 for do

7 send

G_{g}^{-}

c_{g}^{-}

to all clients

m\in\{1,2,...,M\}

8 for client $m\in\{1,2,...,M\}$ parallel do

9 initialize local model on

m

-th client

G_{m}\leftarrow G_{g}^{-}

10 initialize

U_{m}=0

11 initialize

W_{m}=0

12 initialize

O_{m}=0

14 for each data sample $\mathbf{N_{m,i}}$ do

// Local model training

16 calculate mini-batch gradient

\partial_{i}(G_{m})

U_{m}

+=

\partial_{i}(G_{m}^{-})

W_{m}

+=

\partial_{i}(G_{m})

O_{m}

+=

\mathcal{D}_{\mathrm{KL}}\left(P_{g}^{-}(\mathbf{N}_{m,i}),P_{m}(\mathbf{N}_{m,i})\right)

20 end for

U_{m}

=

\eta_{m}(U_{m}+c_{g}^{-}-c_{m}^{-})

G_{m}=G_{m}^{-}-\frac{1}{|\mathbf{N}_{m}|}U_{m}

// Local control variable update

c_{m}=\frac{1}{|\mathbf{N}_{m}|}W_{m}

T_{m}=\frac{1}{|\mathbf{N}_{m}|}c_{m}\cdot O_{m}

25 send

G_{m}

T_{m}

to global server

27 end for

29 end for

// Global control variable update.

c_{g}=c_{g}^{-}+\frac{1}{M}\sum_{m=1}^{M}T_{m}

// Global model update.

G_{g}=G_{g}^{-}+\frac{\eta_{g}}{M}\sum_{m=1}^{M}(G_{g}^{-}-G_{m})

35 end for

Algorithm 1 FedDWA

where $G_{m}$ represents (the weights of) the local model of the $m$ -th client, $\eta_{m}$ is the local step-size, and $\partial_{i}(G_{m}^{-})$ denotes the gradient of the $i$ -th data sample for the local model in the previous communication round.

Each client uses a local control variable $c_{m}$ to account for variances in model updates, aligning local objectives with the global model to mitigate client-specific drifts, and $c_{m}^{-}$ denotes the local control variable in the previous communication round. Both $c_{m}$ and $c_{g}$ are initialized as 0.

Inspired by [44], we maintain a global control variable $c_{g}$ on the server, aggregating information from all clients to guide the overall direction of model updates. Diverging from [44], which uses the average of local variables to update $c_{g}$ , we utilize Kullback-Leibler divergence (KLD) [51] to dynamically weigh the contributions of different clients like:

c_{g}=c_{g}^{-}+\frac{1}{M}\sum_{m=1}^{M}T_{m},

(3)

where $T_{m}$ is the intermediate variable, as shown in:

T_{m}=\frac{1}{|\mathbf{N}_{m}|}c_{m}\cdot O_{m},

(4)

and

O_{m}=\sum_{i}^{|\mathbf{N}_{m}|}\mathcal{D}_{\mathrm{KL}}\left(P_{g}^{-}(\mathbf{N}_{m,i}),P_{m}(\mathbf{N}_{m,i})\right).

(5)

The KLD is calculated between the predicted data distribution of the global model from the previous communication round, $P_{g}^{-}(\mathbf{N}_{m,i})$ , and the distribution of $m$ -th local client, $P_{m}(\mathbf{N}_{m,i})$ . Where $P_{g}^{-}(\mathbf{N}_{m,i})$ is approximated by using the segmentation output of global model from previous communication round $G^{-}_{g}$ , and $P_{m}(\mathbf{N}_{m,i})$ is approximated by using the segmentation output of each local model $G_{m}$ . To transform the segmentation output into a distribution, a log softmax function is applied to convert the segmentation output into log probabilities, e.g.:

\begin{split}P_{m}(\mathbf{N}_{m,i})=\begin{bmatrix}log(\sigma(V_{1,1}))&\dots&log(\sigma(V_{1,k}))\\ \dots&\dots&\dots\\ log(\sigma(V_{j,1}))&\dots&log(\sigma(V_{j,k}))\end{bmatrix},\end{split}

(6)

where $V_{j,k}$ denotes the $(j,k)$ -th pixel values in the segmentation output matrix $G_{m}(\mathbf{N}_{m,i})$ and $\sigma$ represents the softmax function. The same calculation is applied for $P_{g}^{-}(\mathbf{N}_{m,i})$ .

Each client uses gradients to update the local control variable $c_{m}$ :

c_{m}=\frac{1}{|\mathbf{N}|_{m}}W_{m},

(7)

where

W_{m}=\sum_{i}^{|\mathbf{N}_{m}|}\partial_{i}(G_{m}).

(8)

The global model is updated as outlined below:

G_{g}=G_{g}^{-}+\frac{\eta_{g}}{M}\sum_{m=1}^{M}(G_{g}^{-}-G_{m}),

(9)

where $\eta_{g}$ , the learning rate, plays a crucial role in balancing the updates, and $G_{g}$ represents the global model. This equation captures the aggregation of differences between the global and local models, promoting the convergence of the global model. To provide an comprehensive understanding of FedDWA, the pseudo code is illustrated as Algorithm 1.

III-C Dynamic Adjusting Loss Function

As discussed before, it is critical to define a suitable loss function $L_{m,i}$ , to calculate the gradient of $i$ -th data sample for each client’s local model, $\partial_{i}(G_{m})$ . The cross-entropy loss function is widely used within federated learning frameworks to train individual clients. This function effectively gauges the performance of classification models that output probabilities between 0 and 1. Nevertheless, the inherent data heterogeneity across clients, particularly with non-IID data, complicates the convergence to a global optimum.

To address this issue, we introduce an enhanced local training loss function termed dynamic adjusting loss (DALoss):

L_{m,i}=\mathcal{S}\left(G_{m}(\mathbf{N}_{m,i}),\mathbf{Y}_{m,i}\right)+Q_{m,i},

(10)

where $\mathcal{S}\left(G_{m}(\mathbf{N}_{m,i}),\mathbf{Y}_{m,i}\right)$ denotes the standard cross-entropy loss for the $i$ -th data sample of client model $G_{m}$ against the true labels $\mathbf{Y}_{m,i}$ , and $Q_{m,i}$ is the dynamic control term defined by:

Q_{m,i}=C\cdot\mathcal{D}_{\mathrm{KL}}\left(P_{g}^{-}(\mathbf{N}_{m,i}),P_{m}(\mathbf{N}_{m,i})\right)\cdot||G_{g}^{-}-G_{m}||_{2}^{2},

(11)

where $C\cdot\mathcal{D}_{\mathrm{KL}}\left(P_{g}^{-}(\mathbf{N}_{m,i}),P_{m}(\mathbf{N}_{m,i})\right)$ scales the quadratic penalty based on the KLD, measuring the discrepancy between the probabilistic distributions of the global and local models from the previous communication round. $C$ is a regularization constant, and $||\cdot||_{2}$ denotes the $L^{2}$ norm. This divergence quantifies the differences, ensuring that local updates contribute significantly toward global model convergence.

This modification incorporates a dynamic adjustable regularization mechanism to better align each client’s model updates with the global model. Such dynamic adjustment of the regularization parameter allows our approach to adapt to the evolving nature of the training process and the unique characteristics of each client.

By integrating KLD into the loss function, the learning process is encouraged to minimize deviations between data distribution in different clients, effectively aligning the local models closer to the global model. Specifically, the KLD is employed as a weighting factor in the L2-norm calculation between the global and local models, as shown in Eq. (11). A large KLD indicates a significant discrepancy in the distributions, resulting in an increased weight. This increment in weight amplifies the loss, consequently accelerating the convergence of the local model towards the global model. By minimizing this divergence between local updates and the global model, the learning process adaptively adjusts to the unique data distributions present in each client. Therefore, our approach ensures that local updates contribute positively towards a more generalized global model, mitigating the impact of data distribution shifts and data heterogeneity.

While the design of the loss function allows for larger updates when there are significant discrepancies between the local and global models, it is crucial to manage the potential instability might be introduced by these aggressive updates. To avoid such instability, we use a regularization constant $C$ , which can penalize large changes in model weights and provide a counterbalance to aggressive updates. We also use adaptive learning rates which can adjust the step size based on the training dynamics that we reduce the learning rate after several aggregation steps thus tempering the update magnitude.

IV Experiments and Analysis

The experiments were conducted on a computer equipped with a single NVIDIA GeForce RTX 4090 GPU and a 12th Generation Intel Core i9-12900K CPU. The software environment consisted of Pytorch 2.0.1 and CUDA 12.3. Our experiments are implemented on BEV semantic segmentation, a crucial task in autonomous driving technologies. BEV semantic segmentation facilitates a comprehensive top-down environmental perspective, which is essential for navigation, obstacle avoidance, and path planning in autonomous vehicles.

IV-A Experiment Setup

We do the experiment on the dataset provided by FedBeVT [27], which extends the original OpenV2V [33] dataset only contained car objects by adding two types of vehicles: truck and bus with different camera sensor installation positions. This enhancement broadens the diversity and volume of the scenarios represented, providing a robust foundation for testing our model.

TABLE I: Comparison between Our Method and Other State-of-the-art Methods.

Method	Client 1: Bus			Client 2: Truck			Client 3: Car A			Client 4: Car B
	N1 = 1388			N2 = 1488			N3 = 2140			N4 = 1384
	Loss $\downarrow$	IOU $\uparrow$	Com. $\downarrow$	Loss $\downarrow$	IOU $\uparrow$	Com. $\downarrow$	Loss $\downarrow$	IOU $\uparrow$	Com. $\downarrow$	Loss $\downarrow$	IOU $\uparrow$	Com. $\downarrow$
Local Training	0.0716	5.42 $\%$	-	0.1740	4.16 $\%$	-	0.0288	13.15 $\%$	-	0.0298	9.41 $\%$	-
FedAvg [40]	0.2202	5.89 $\%$	105	0.0483	5.89 $\%$	180	0.0518	14.09 $\%$	175	0.0899	14.61 $\%$	175
FedRep [52]	0.0170	7.45 $\%$	320	0.1319	7.03 $\%$	105	0.0352	18.94 $\%$	345	0.0394	15.41 $\%$	330
FedTP [53]	0.0765	6.73 $\%$	175	0.1015	5.42 $\%$	140	0.0183	17.29 $\%$	335	0.0389	15.56 $\%$	300
FedBEVT [27]	0.0061	10.32 $\%$	340	0.1294	7.33 $\%$	65	0.0162	18.40 $\%$	360	0.0268	16.16 $\%$	370
FedDWA with DALoss (Ours)	0.0062	10.72 $\%$	235	0.0094	15.91 $\%$	105	0.0157	21.30 $\%$	484	0.0198	19.35 $\%$	100

$\uparrow$

The upper arrow denotes that better performance is registered with higher value.
$\downarrow$

The lower arrow denotes that better performance is registered with lower value.

Following the protocol established in [27], our study orchestrates a collaborative framework among four distinct industrial clients to assess the efficacy of our proposed object detection model under various operational conditions Specifically, the dataset comprises:

•

Client 1 Bus: 1,398 frames in the training set and 413 frames in the testing set.
•

Client 2 Truck: 1,459 frames in the training set and 356 frames in the testing set.
•

Client 3 Car A: 11,636 frames in the training set and 3,546 frames in the testing set.
•

Client 4 Car B: 7,142 frames in the training set and 10,191 frames in the testing set.

For a fair comparison between our federated learning strategy and state-of-the-art approach FedBEVT [27], our model architecture adheres to the configuration described in [27]. This involves processing camera images observed by clients. Similar to [27], we begin by sending these images through a 3-layer ResNet34 encoder, which is proficient at extracting and encoding image features across various spatial dimensions. This encoding process yields feature tensors at differentiated spatial resolutions: $64\times 64\times 128$ , $32\times 32\times 256$ , and $16\times 16\times 512$ .

Subsequently, we employ a focused attention cross (FAX) transformer operation, which is attention-based and skillfully manages interactions between BEV embeddings and the encoded image features, conceptualized as query, key, and value within this framework. Finally, a 3-layer bilinear module is used to transform these enriched BEV features. In our model configuration, the hyper-parameter $C$ defined in Equation (11) is set to 0.1. For other hyper-parameters, we adopt the configurations detailed in [27]. We selected these settings due to their proven effectiveness in similar contexts, ensuring that our model benefits from established best practices while maintaining consistency with state-of-the-art for fair comparison.

IV-B Performance Analysis

In our comparative study, we evaluate the performance of our proposed algorithm, FedDWA, against several established methods, including FedAvg [40], FedRep [52], FedTP [53], and FedBEVT [27]. The evaluation focuses on the average intersection over union (IoU) achieved by models across three use cases (UCs) and the number of communication rounds required to attain optimal IoU values.

The pixel-wise IoU metric is a common evaluation metric for BEV semantic segmentation. It quantifies the overlap between the predicted segmentation and the ground truth, providing a measure of accuracy at the pixel level. The IoU is calculated as follows:

IoU=\frac{TP}{TP+FP+FN},

(12)

where TP is the number of true positive pixels, FP is the number of false positive pixels, and FN is the number of false negative pixels.

The experimental analysis presented in Table I provides a comprehensive comparison of our federated learning framework against these prominent approaches.

Performance metrics include training loss, IoU, and the number of communication rounds (Com.) required to achieve peak IoU, offering a detailed assessment of the efficacy and efficiency of our approach within a federated learning context.

The key findings for each client scenario are as follows:

•

Client 1 Bus: Our method shows a significant improvement, achieving a 10.72 $\%$ IoU after 235 communication rounds, demonstrating effective data heterogeneity management and communication efficiency.
•

Client 2 Truck: Our approach excels, reaching a 15.91 $\%$ IoU, marking a notable improvement over local training and indicating robust adaptability and superior management of client-specific data characteristics.
•

Client 3 Car A: Achieving the highest IoU of 21.30 $\%$ among all methods, our approach excels in complex vehicular environments, underscoring its potential to optimize the federated learning framework across diverse datasets.
•

Client 4 Car B: This client achieves a 19.35 $\%$ IoU in just 100 communication rounds, highlighting both the efficacy and efficiency of our approach and establishing it as a leading method in federated learning for BEV perception.

Fig. 1 illustrates the relationship between the number of communication rounds and IoU during the testing phase for the four clients, providing a detailed view of test progression. Notably, the truck client and car client B reached optimal IoU with fewer communication rounds, potentially due to data distributions more closely aligned with the predictions of their local models. However, after reaching peak IoU, there is a noticeable decline in performance, suggesting possible overfitting as the models become excessively tailored to their local datasets, to the detriment of generalization.

IV-C Ablation Experiment

TABLE II: Ablation Experiment for Effectiveness of DALoss with IoU

Method	Client 1	Client 2	Client 3	Client 4
FedBEVT [27]	10.32 $\%$	7.33 $\%$	18.40 $\%$	16.16 $\%$
FedBEVT + DALoss	10.49 $\%$	11.97 $\%$	19.20 $\%$	16.45 $\%$
FedDWA (ours)	10.66 $\%$	14.88 $\%$	20.83 $\%$	18.28 $\%$
FedDWA (ours) + DALoss	10.72 $\%$	15.91 $\%$	21.30 $\%$	19.35 $\%$

The ablation study summarized in Table II elucidates the incremental impact of our DALoss and the intrinsic benefits of our proposed FedDWA framework. This study methodically analyzes the improvements offered by DALoss. The addition of DALoss consistently improves performance across all clients. When DALoss is added to FedBEVT [27], there is a noticeable improvement in IoU percentages for Client 2 (from 7.33 $\%$ to 11.97 $\%$ ) and modest gains across other clients. This indicates that DALoss facilitates better generalization. Our proposed FedDWA framework, even without DALoss, shows better performance than FedBEVT. For example, FedDWA achieves an IoU of 20.83 $\%$ for Client 3 and 18.28 $\%$ for Client 4, which are significant improvements over FedBEVT’s 18.40 $\%$ and 16.16, respectively. This enhancement underscores the effectiveness of our weighting approach in distributing model updates across different clients in a way that respects their unique data distributions. The combination of FedDWA and DALoss shows the most substantial gains in performance. For instance, Client 2’s IoU increases from 14.88 $\%$ with only FedDWA to 15.91 $\%$ with the addition of DALoss, and similar incremental benefits are observed for other clients. The highest performance is observed in Client 3, where the IoU reaches 21.3 $\%$ . Fig 2 illustrates the visualization of an example network model output across different clients. It compares the results among the ground truth, the current state-of-the-art approach FedBEVT [27], the FedDWA only, and both FedDWA and DALoss on bus client, truck client and car client. The comparison shows that the model incorporating both FedDWA and DALoss achieves better performance.

V Conclusion

In this paper, we explore the effectiveness of federated learning frameworks for optimizing transformer-based models in BEV perception within road traffic datasets. A primary focus of our investigation is the challenge of data heterogeneity, which significantly impedes the performance of federated learning methodologies. To address this issue, we introduce an innovative algorithm, FedDWA, together with DALoss, specifically designed to mitigate the adverse effects of data diversity. Our comparative analyses with existing federated learning frameworks demonstrate that our proposed solutions significantly enhance the efficacy of the federated learning framework. These improvements are primarily achieved through dynamic adjustments of the model convergence direction and the equitable integration of diverse local models into a unified global model. Our findings underscore the viability and potential of federated learning frameworks in BEV perception, suggesting the approach as a resilient, efficient, and inclusive alternative for model training.

References

[1] Y. Sun, W. Zuo, H. Huang, P. Cai, and M. Liu, “Pointmoseg: Sparse tensor-based end-to-end moving-obstacle segmentation in 3-d lidar point clouds for autonomous driving,” IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 510–517, 2021.
[2] W. Xiong, J. Liu, Y. Xia, T. Huang, B. Zhu, and W. Xiang, “Contrastive learning for automotive mmWave radar detection points based instance segmentation,” in Proceedings of the IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), (Macau, China), 2022, pp. 1255–1261.
[3] J. Liu, W. Xiong, L. Bai, Y. Xia, T. Huang, W. Ouyang, and B. Zhu, “Deep instance segmentation with automotive radar detection points,” IEEE Transactions on Intelligent Vehicles, vol. 8, pp. 84–94, January 2023.
[4] M. Zeller, V. S. Sandhu, B. Mersch, J. Behley, M. Heidingsfeld and C. Stachniss, ”Radar Instance Transformer: Reliable Moving Instance Segmentation in Sparse Radar Point Clouds,” IEEE Transactions on Robotics, vol. 40, pp. 2357-2372, 2024.
[5] T. Monninger, J. Schmidt, J. Rupprecht, D. Raba, J. Jordan, D. Frank, S. Staab, and K. Dietmayer, “Scene: Reasoning about traffic scenes using heterogeneous graph neural networks,” IEEE Robotics and Automation Letters, vol. 8, no. 3, pp. 1531–1538, 2023.
[6] O. Mazhar, R. Babuˇska, and J. Kober, “Gem: Glare or gloom, I can still see you – end-to-end multi-modal object detection,” IEEE Robotics and Automation Letters, vol. 6, no. 4, pp. 6321–6328, 2021.
[7] Y. Liu, Y. Yixuan and M. Liu, ”Ground-Aware Monocular 3D Object Detection for Autonomous Driving,” IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 919-926, 2021.
[8] J. Liu, Q. Zhao, W. Xiong, T. Huang, Q.-L. Han, and B. Zhu, “SMURF: Spatial multi-representation fusion for 3d object detection with 4d imaging radar,” IEEE Transactions on Intelligent Vehicles, vol. 9, no. 1, pp. 799–812, 2024.
[9] W. Xiong, J. Liu, T. Huang, Q.-L. Han, Y. Xia, and B. Zhu, “LXL: LiDAR excluded lean 3d object detection with 4d imaging radar and camera fusion,” IEEE Transactions on Intelligent Vehicles, vol. 9, no. 1, pp. 79–92, 2024.
[10] Y. Yang, J. Liu, T. Huang, Q.-L. Han, G. Ma, and B. Zhu, “RaLiBEV: Radar and LiDAR BEV fusion learning for anchor box free object detection systems,” 2022. arXiv:2211.06108.
[11] J. Liu, L. Bai, Y. Xia, T. Huang, B. Zhu, and Q.-L. Han, “GNN-PMB: A simple but effective online 3d multi-object tracker without bells and whistles,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 2, pp. 1176–1189, 2023.
[12] J. Liu, G. Ding, Y. Xia, J. Sun, T. Huang, L. Xie, and B. Zhu, “Which framework is suitable for online 3d multi-object tracking for autonomous driving with automotive 4d imaging radar?,” in Proceedings of the IEEE 35th Intelligent Vehicles Symposium (IV), 2024, pp. 1258-1265.
[13] T. Sadjadpour, J. Li, R. Ambrus and J. Bohg, “ShaSTA: Modeling Shape and Spatio-Temporal Affinities for 3D Multi-Object Tracking,” IEEE Robotics and Automation Letters, vol. 9, no. 5, 2024.
[14] G. Ding, J. Liu, Y. Xia, T. Huang, B. Zhu, and J. Sun, “LiDAR point cloud-based multiple vehicle tracking with probabilistic measurement-region association,” in Proceedings of the 27th International Conference on Information Fusion (FUSION), 2024. arXiv:2403.06423.
[15] T. Huang, J. Liu, X. Zhou, D. C. Nguyen, M. R. Azghadi, Y. Xia, Q.-L.Han, and S. Sun, “V2X cooperative perception for autonomous driving: Recent advances and challenges,” 2023. arXiv:2310.03525.
[16] Y. Hu, Y. Lu, R. Xu, W. Xie, S. Chen, and Y. Wang, “Collaboration helps camera overtake lidar in 3d detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 9243–9252.
[17] Y. Li, Q. Fang, J. Bai, S. Chen, F. Juefei-Xu, and C. Feng, “Among us: Adversarially robust collaborative perception by consensus,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 186–195.
[18] M. Chen, H. V. Poor, W. Saad and S. Cui, ”Wireless Communications for Collaborative Federated Learning,” in IEEE Communications Magazine, vol. 58, no. 12, pp. 48-54, 2020.
[19] M. Hao, H. Li, X. Luo, G. Xu, H. Yang and S. Liu, ”Efficient and Privacy-Enhanced Federated Learning for Industrial Artificial Intelligence,” in IEEE Transactions on Industrial Informatics, vol. 16, no. 10, pp. 6532-6542, 2020.
[20] C. I. Huang, Y. Y. Huang, J. X. Liu, Y. T. Ko, H. C. Wang, K. H. Chiang, and L. F. Yu, “Fed-HANet: Federated visual grasping learning for human robot handovers,” IEEE Robotics and Automation Letters, vol. 8, no. 6, pp. 3772-3779, 2023.
[21] V. P. Chellapandi, L. Yuan, S. H. Zak, and Z. Wang, “A survey of federated learning for connected and automated vehicles,” in Proceedings of the IEEE 26th International Conference on Intelligent Transportation Systems (ITSC), 2023, pp. 2485–2492.
[22] V. P. Chellapandi, L. Yuan, C. G. Brinton, S. H. Żak and Z. Wang, “Federated Learning for Connected and Automated Vehicles: A Survey of Existing Approaches and Challenges,” IEEE Transactions on Intelligent Vehicles, vol. 9, no. 1, pp. 119-137, 2024.
[23] C. Creß, Z. Bing, and A. C. Knoll, “Intelligent transportation systems using external infrastructure: A literature survey,” 2021. arXiv:2112.05615.
[24] X. Huang, T. Huang, S. Gu, S. Zhao, and G. Zhang, “Responsible federated learning in smart transportation: Outlooks and challenges,” 2024. arXiv:2404.06777.
[25] T. Huang, X. Yuan, J. Yuan, and W. Xiang, “Optimization of data exchange in 5g vehicle-to-infrastructure edge networks,” IEEE Transactions on Vehicular Technology, vol. 69, no. 9, pp. 9376–9389, 2020.
[26] J. Philion and S. Fidler, “Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d,” in Proceedings of the European Conference on Computer Vision (ECCV), 2020, pp. 194–210.
[27] R. Song, R. Xu, A. Festag, J. Ma, and A. Knoll, “FedBEVT: Federated learning bird’s eye view perception transformer in road traffic systems,” IEEE Transactions on Intelligent Vehicles, vol. 9, no. 1, pp. 958–969, 2024.
[28] J. Huang, G. Huang, Z. Zhu, Y. Ye, and D. Du, “Bevdet: High-performance multi-camera 3d object detection in bird-eye-view,” 2021, arXiv:2112.11790.
[29] Z. Li, et al., “Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers,” in Proceedings of the European Conference on Computer Vision (ECCV), 2022, pp. 1-18.
[30] Z. Liu, H. Tang, A. Amini, X. Yang, H. Mao, D. L. Rus, and S. Han, “Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation,” in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2023, pp. 2774-2781.
[31] R. Xu, Z. Tu, H. Xiang, W. Shao, B. Zhou, and J. Ma, “CoBEVT: Cooperative bird’s eye view semantic segmentation with sparse transformers,” 2022. arXiv:2207.02202.
[32] R. Xu, Y. Guo, X. Han, X. Xia, H. Xiang, and J. Ma, “OpenCDA: an open cooperative driving automation framework integrated with co-simulation,” in Proceedings of the IEEE International Intelligent Transportation Systems Conference (ITSC), 2021, pp. 1155–1162.
[33] R. Xu, H. Xiang, X. Xia, X. Han, J. Li, and J. Ma, “OPV2V: An open benchmark dataset and fusion pipeline for perception with vehicle-to-vehicle communication,” in Proceedings of the International Conference on Robotics and Automation (ICRA), 2022, pp. 2583–2589.
[34] Q. Yang, S. Fu, H. Wang, and H. Fang, “Machine-learning-enabled cooperative perception for connected autonomous vehicles: Challenges and opportunities,” IEEE Network, vol. 35, no. 3, pp. 96–101, 2021.
[35] Z. Bai, G. Wu, X. Qi, Y. Liu, K. Oguchi, and M. J. Barth, “Infrastructure-based object detection and tracking for cooperative driving automation: A survey,” in Proceedings of the Intelligent Vehicles Symposium (IV), pp. 1366–1373, 2022.
[36] Y. Han, H. Zhang, H. Li, Y. Jin, C. Lang, and Y. Li, “Collaborative perception in autonomous driving: Methods, datasets and challenges,” 2023. arXiv:2301.06262.
[37] R. Xu, H. Xiang, Z. Tu, X. Xia, M.-H. Yang, and J. Ma, “V2X-ViT: Vehicle-to-everything cooperative perception with vision transformer,” in Proceedings of the European Conference on Computer Vision (ECCV), 2022, pp. 107–124.
[38] R. Xu, X. Xia, J. Li, H. Li, S. Zhang, Z. Tu, Z. Meng, H. Xiang, X. Dong, R. Song, et al., “V2V4real: A real-world large-scale dataset for vehicle-to-vehicle cooperative perception,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 13712–13722.
[39] H. Xiang, R. Xu, and J. Ma, “HM-ViT: Hetero-modal vehicle-to-vehicle cooperative perception with vision transformer,” 2023. arXiv:2304.10628.
[40] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Proceedings of International Conference on Artificial Intelligence and Statistics, pp. 1273–1282, PMLR, 2017.
[41] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, “Federated optimization in heterogeneous networks,” in Proceedings of Machine Learning and Systems, vol. 2, pp. 429–450, 2020.
[42] T. Lin, L. Kong, S. U. Stich, and M. Jaggi, “Ensemble distillation for robust model fusion in federated learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 2351–2363, 2020.
[43] J. Wang, Q. Liu, H. Liang, G. Joshi, and H. V. Poor, “A novel framework for the analysis and design of heterogeneous federated learning,” IEEE Transactions on Signal Processing, vol. 69, pp. 5234–5249, 2021.
[44] S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh, “Scaffold: Stochastic controlled averaging for federated learning,” in Proceedings of the International Conference on Machine Learning, 2020, pp. 5132–5143.
[45] Z. Qu, X. Li, R. Duan, Y. Liu, B. Tang, and Z. Lu, “Generalized federated learning via sharpness aware minimization,” in Proceedings of the International Conference on Machine Learning, 2022, pp. 18250–18280.
[46] D. Jallepalli, N. C. Ravikumar, P. V. Badarinath, S. Uchil, and M. A. Suresh, “Federated learning for object detection in autonomous vehicles,” in Proceedings of the IEEE Seventh International Conference on Big Data Computing Service and Applications (BigDataService), 2021, pp. 107–114.
[47] X. Chang, X. Xue, B. Yao, A. Li, J. Ma, and Y. Yu, “Reliable federated learning with mobility-aware reputation mechanism for internet of vehicles,” in Proceedings of the IEEE 26th International Conference on Intelligent Transportation Systems (ITSC), 2023, pp. 362–367.
[48] R. Song, L. Zhou, V. Lakshminarasimhan, A. Festag, and A. Knoll, “Federated learning framework coping with hierarchical heterogeneity in cooperative its,” in Proceedings of the IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), 2022, pp. 3502–3508.
[49] R. Song, L. Lyu, W. Jiang, A. Festag, and A. Knoll, “V2X-boosted federated learning for cooperative intelligent transportation systems with contextual client selection,” 2023. arXiv:2305.11654.
[50] X. Tang, J. Zhang, Y. Fu, C. Li, N. Cheng, and X. Yuan, “A fair and efficient federated learning algorithm for autonomous driving,” in Proceedings of the IEEE 98th Vehicular Technology Conference (VTC2023-Fall), 2023, pp. 1–5.
[51] T. Van Erven and P. Harremos, “Renyi divergence and kullback-leibler divergence,” IEEE Transactions on Information Theory, vol. 60, no. 7, pp. 3797–3820, 2014.
[52] L. Collins, H. Hassani, A. Mokhtari, and S. Shakkottai, “Exploiting shared representations for personalized federated learning,” in Proceedings of the International Conference on Machine Learning (ICML), 2021, pp. 2089–2099.
[53] H. Li, Z. Cai, J. Wang, J. Tang, W. Ding, C.-T. Lin, and Y. Shi, “Fedtp: Federated learning by transformer personalization,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–15, 2023. doi: 10.1109/TNNLS.2023.3269062.