Personalized Federated Learning of Driver Prediction Models for Autonomous Driving

Manabu Nakanoya¹, Junha Im², Hang Qiu³, Sachin Katti³, Marco Pavone³, Sandeep Chinchali⁴ ¹ NEC Corporation, Kanagawa, Japan [email protected]² Samsung Electronics, Hwaseong, Korea [email protected]³ Stanford University, Stanford, CA, USA {hangqiu, skatti, pavone}@stanford.edu⁴ The University of Texas at Austin, Austin, TX [email protected]

Abstract

Autonomous vehicles (AVs) must interact with a diverse set of human drivers in heterogeneous geographic areas. Ideally, fleets of AVs should share trajectory data to continually re-train and improve trajectory forecasting models from collective experience using cloud-based distributed learning. At the same time, these robots should ideally avoid uploading raw driver interaction data in order to protect proprietary policies (when sharing insights with other companies) or protect driver privacy from insurance companies. Federated learning (FL) is a popular mechanism to learn models in cloud servers from diverse users without divulging private local data. However, FL is often not robust – it learns sub-optimal models when user data comes from highly heterogeneous distributions, which is a key hallmark of human-robot interactions. In this paper, we present a novel variant of personalized FL to specialize robust robot learning models to diverse user distributions. Our algorithm outperforms standard FL benchmarks by up to ${\color[rgb]{0,0,0}2\times}$ in real user studies that we conducted where human-operated vehicles must gracefully merge lanes with simulated AVs in the standard CARLA and CARLO AV simulators.

I Introduction

Future robotic fleets must operate amongst diverse humans with heterogeneous preferences on human-robot interaction, in applications ranging from nursing assistance robots to home robots and autonomous vehicles (AVs). Given such heterogeneity, there is a large incentive to share data from robotic fleet deployments to improve computer vision, prediction, and control modules based on diverse interactions with human users. For example, AVs can learn human trajectory forecasting models to proactively anticipate the behavior of nearby humans to aid decision-making [24]. Sharing rare and risk-sensitive driver styles during challenging contexts (e.g. traffic disruptions) can potentially help learn more robust forecasting models that generalize to new cities and driver populations. At the same time, raw trajectory data should be kept private to avoid revealing proprietary logic to competitor companies or individual driver behavior to insurance companies. In this paper, we address how to balance the competing objectives of user privacy and utility of data sharing in large-scale robotic fleet learning.

Federated learning (FL) is a promising approach to train machine learning (ML) models from distributed datasets while preserving privacy [19, 20, 14]. FL trains a model locally at a user’s device (e.g. a robot) and simply shares model parameter updates with a central server to learn from diverse users while protecting privacy by not uploading raw training data. Typically, FL performs poorly (e.g., converges to a poor global model) when user data is highly heterogeneous [11, 8], which is a key problem for AVs that must interact with a variety of diverse human driving styles. The recently-proposed personalized variant of FL [8, 5] is a promising approach to handle user diversity, but, to the best of our knowledge, has hitherto not been applied in robotics. Moreover, our subsequent results illustrate poor performance for standard FL in robotics applications owing to diverse human-robot interactions. As such, the key contribution of this work is to contribute novel algorithms for personalized FL for human trajectory forecasting models in AV deployments.

Contributions and Organization: The technical contributions and organization of this paper are as follows. To the best of our knowledge, we present the first user study that assesses the efficacy of personalized federated learning in robotics, especially for trajectory forecasting models for AVs. Then, we introduce a simple experiment where a fleet of robots have similar dynamics models, but widely different cost functions, for which standard applications of personalized FL are sub-optimal. Then, to mitigate these problems, we introduce novel algorithmic extensions to personalized FL that effectively learn from global experience for common parameters (e.g., shared dynamics models) while adaptively specializing local cost functions. Finally, we show strong performance gains for our algorithm in a user-study with real human drivers on photo-realistic simulators like CARLA [6] and lightweight simulators like CARLO [3] for a lane merging scenario requiring challenging human-robot interaction.

II Related Work

Motivation for Personalized Federated Learning in Robotics: FL [19] is a method to train a global ML model from multiple networked devices, such as mobile phones, that each have local labelled datasets. The key benefits of FL are privacy protection and reduced communication overhead since locally-trained model parameters, instead of raw user data, are sent to the cloud for knowledge sharing. FedAVG [20] is a simple, widely-used implementation of the FL algorithm that averages model gradients from diverse clients to learn a global model, which is then synced back to individual clients for continual learning. Standard applications of FL, including FedAVG, lose robust performance when individual mobile clients’ (e.g., robots) data is highly heterogenous [11, 14, 1]. Such vulnerability to dataset heterogeneity is a crucial problem for networked robotic fleets such as AVs, since perception and forecasting models can fail to converge when robots observe sensory data from diverse environment distributions and interactions with radically different humans.

Recently, personalized FL [4, 5, 8] has emerged as an effective method to learn robust models under such heterogeneity by specializing potentially unique models for each client (e.g. robot). It aims to first learn a global model and then efficiently adapt it to individual robots while minimizing the extra training cost of personalization. This technique is inspired by multi-task learning [22, 27] and meta-learning [26], and is especially important in situations where the cost of personalization is relatively high, such as for low-power robots or mobile phones. Another approach to personalized FL is to weigh the parameters of a global model and unique personalized local model. For example, Deng et al. [4] propose an algorithm that adaptively changes the relative weighting of parameters of a global model during local training. This approach is generally able to improve the accuracy of personalized models, although it leads to increased computation and communication costs.

Our key observation is that many models in robotics have common internal structures – dynamics models are often shared by similar robots, but cost functions or risk sensitivities can be unique. As such, our approach embraces the fact that robotic models can have a subset of parameters that are relatively invariant to data measured on each robot (e.g., shared dynamics) and should therefore resemble global parameters. On the other hand, other subsets of parameters should be personalized for each client, such as those that model risk-sensitivity and cost functions. Thus, our method flexibly adjusts the learning rate for each class of parameters based on its variance across heterogenous robots, which leads to more accurate and stable learning. As such, we embrace that different sub-sets of parameters represent global and local patterns, which makes our work different from [4] which simply produces a weighted average of global and local models.

Adaptive Learning Rates in Optimization: In optimization, the learning rate is a hyper-parameter that governs how quickly parameters can be updated during one step of gradient descent, such as in the standard Stochastic Gradient Descent (SGD) algorithm. For example, the AdaGrad [7] algorithm adjusts the learning rate for each parameter based on its cumulative gradient. This algorithm helps to stabilize and speed up the learning process because it can apply a more appropriate learning rate than a uniform one depending on the progress of learning individual parameters. SGD algorithms such as RMSProp [25] and Adam [12], which are widely used in modern ML training, also use this core idea, which can easily be used in personalized FL. In key contrast to these algorithms, our algorithm adjusts the learning rate based on the progress of learning across distributed robots. Specifically, our algorithm calculates the variance of each individual parameter across FL clients (robots) to gauge the progress of learning and set an appropriate learning rate for that parameter. Moreover, our method can be used alongside standard SGD optimizers for local datasets, such as Adam, as we show in our evaluation.

Trajectory Forecasting for AVs: Our work is complementary to a rich body of research on trajectory forecasting models in robotics, which attempt to predict the motion of pedestrians or the future trajectories of human drivers conditioned on behaviors of AVs. For example, Gupta et al. [9] proposes a Generative Adversarial Network (GAN) model that predicts pedestrians trajectories while considering their interactions. Likewise, Ivanovic et al. [10, 23] propose recurrent neural network models that predict distributions over future agent trajectories by learning from past timeseries of agent interactions. Schmerling et al. [24] incorporate such trajectory forecasting models to construct control policies for AVs that anticipate the behavior of nearby human drivers to smoothly negotiate lane changes. Our work is complementary to such prior research – rather than develop novel forecasting models, we instead develop novel learning techniques to specialize models from heterogeneous agent interactions while protecting data privacy.

Federated Learning in Robotics: We note prior work has applied FL to robotics, but to learn vision models in private scenarios [15, 16, 17]. Instead, we address personalized federated learning to adapt to heterogenous human interactions, which is the key novelty of our paper. Moreover, a key novelty of our work is our new algorithm for personalization with differentiates between common parameters, such as for dynamics models, that apply to many robots and those that are robot, human, or scenario-specific.

III Problem Statement and Proposed Method

We now formalize our problem statement to show how FL is applied to robotic systems. Fig. 2 illustrates an overview of the system we assume in our problem. First, we introduce a fleet of robots, such as autonomous vehicles, consisting of $n$ robots denoted by $\mathcal{R}_{k}(k=0,1,..,n-1)$ which measure sensory data $x^{t}_{k}$ such as an image or LIDAR point cloud where $t\in\mathbb{N}$ is discrete time. $\mathcal{R}_{k}$ connects to a cloud environment $\mathcal{C}$ with a wireless network. Our goal is to learn a model that predicts future system states or controls of other objects, such as pedestrians and vehicles, based on measured data $x^{t}_{k}$ and a parameter $\theta_{k}$ to be learned. Additionally, we introduce $\theta_{\mathrm{global}}$ which is a parameter of the global model. The global model is held in the cloud $\mathcal{C}$ and trained without sharing the measured data with robots.

The measured dataset is used to learn a machine learning model, such as a trajectory forecasting model, that predicts future values $x^{t}_{k}$ with model parameters $\theta_{k}$ . A series of $x^{t}_{k}$ measured from time $0$ to $T$ , such as a trajectory of a specific object or a segment of video, is denoted by $\mathbf{x}_{k}=x^{0:T}_{k}$ . We note that our problem especially applies when each $\mathbf{x}_{k}$ is very different from others. This is because personalization is not needed if $\mathbf{x}_{k}$ has the same distribution for all $k$ .

The learning algorithms of our federated learning system aim to learn the parameters of a prediction model. Therefore, similar to general ML algorithms, we assume an objective function $F$ to be optimized. Typically, this is a loss function of a prediction model, such as mean squared error for future predicted system states. For personalization, the learning algorithms aim to find optimal parameter $\theta^{*}_{k}$ which minimizes $F(\mathbf{x}_{k};\theta_{k})$ for each robot $\mathcal{R}_{k}$ . Thus, we formalize our problem as a minimization problem of the sum of the objective functions $F(\mathbf{x}_{k};\theta_{k})$ :

\displaystyle\min{\sum^{n-1}_{k=0}F(\mathbf{x}_{k};\theta_{k})}\;\;\;\mathrm{where}\>\mathbf{x}_{k}=x^{0:T}_{k}.

This means learning algorithms for our system should learn $\hat{\theta}_{k}$ that minimizes the expected value of the objective functions by training the prediction model at robot $\mathcal{R}_{k}$ . We note that $\hat{\theta}_{k}$ are the same value for all $k$ if the algorithm trains only a single global model, like general FedAVG [18] without personalization.

III-A Proposed Method

Even in the setting we showed above, general FL algorithms such as FedAVG can potentially perform well. Namely, we can obtain a better set of parameters $\hat{\theta}_{k}$ than randomly initialized without sharing $\mathbf{x}_{k}$ in a cloud environment $\mathcal{C}$ . However, we hypothesize that we can improve $\hat{\theta}_{k}$ by syncing parameters from a global model as in standard FL, but also specializing the model for robot $\mathcal{R}_{k}$ with the local data $\mathbf{x}_{k}$ . Therefore, we propose a variant of FL algorithm with an additional training step for personalization, which is shown in Alg. 1. There are two major differences from general FL. First, we add a training step executed on each robot $\mathcal{R}_{k}$ for personalization in Alg. 1 line 7. Unlike some prior personalized FL works such as [8], we assume $\mathcal{R}_{k}$ can execute many more training steps (e.g. not one step of SGD but a few epochs) because today’s robot systems can do so with modern low-power deep learning accelerators. Our main innovation is to carefully limit parameter updates to avoid overfitting by personalized training. This is inspired by standard methods in fine-tuning DNNs using transfer learning, which freeze some layers’ parameters. Also, retraining all the parameters on a local dataset, which can be of limited size and highly biased, often causes overfitting. We found that it is effective to slow down the update of parameters that have common features across many robots’ data during personalized training. To do so, our key insight is to estimate the variance of different parameters across robots during FL, which is codified in Algorithm lines 7, 13, and 14.

Specifically, we now discuss how to properly adjust the extent of parameter updates. In short, we propose to apply an adaptive learning rate to FL for robotics systems. Generally, it is hard to accurately determine which parameters learn common dynamics and shared trends across robots, especially when training a model such as a DNN with a huge number of parameters. However, we can expect such common parameters tend to eventually converge to a similar value on any dataset. Therefore, we propose a method to estimate how much parameters have in common from the variation of parameters learned by each robot in a fleet. The estimated degree of similarity finally affects the learning rate of each parameter (lines 13-14). As a result, higher learning rates are assigned to parameters with higher variation in personalized training, which might indicate they need specialization per robot. Specifically, we introduce a variation vector $\vec{\sigma}$ whose elements represent the degree of variability of each parameter. Then, the learning rate of each parameter is determined based on a ratio of the variability. Namely, the learning rate of the $i$ -th parameter of parameter vector $\theta$ is denoted by $L\sigma_{i}/\max\vec{\sigma}$ , where $L$ is a constant representing the maximum learning rate. In practical systems, we can cluster parameters (e.g. parameters of a layer of a DNN) and calculate learning rates for each cluster for computational scalability.

1 Randomly initialize parameter

\theta_{global}

2 Initialize

\vec{\sigma}=\{\sigma_{0},\sigma_{1},\dots,\sigma_{I}\}\leftarrow\{1,1,\dots,1\}

3 Initialize

\vec{l}=\{l_{0},l_{1},\dots,l_{I}\}\leftarrow\{L,L,\dots,L\}

4 repeat

5 for $k~{}\leftarrow 0$ to $n-1$ do

\theta_{k}\leftarrow\theta_{global}

7 Update

\theta_{k}

with

F(\mathbf{x}_{k};)

and learning rate

\vec{l}

\hat{\theta}_{k}\leftarrow\theta_{k}

\theta_{k}\leftarrow\theta_{global}

9 Update

\theta_{k}

with

F(\mathbf{x}_{k};)

and learning rate

L

10 Send

\theta_{k}

to cloud environment

\mathcal{C}

Result: Return

\hat{\theta}_{k}

12 end for

13 Update

\theta_{global}

by aggregating

\{\theta_{0},\theta_{1},\dots,\theta_{n-1}\}

14 Update

\vec{\sigma}

\sigma_{i}\leftarrow\sum_{k=0}^{n-1}(\theta^{i}_{k}-\bar{\theta}^{i})^{2}

15 Update

\vec{l}

l_{i}\leftarrow L\sigma_{i}/\max\{\sigma_{0},\sigma_{1},\dots,\sigma_{I}\}

17until $\hat{\theta}_{k}$ converges

Algorithm 1 Federated Control Learning:

L

is a scalar constant representing the basic learning rate, and

i

is the id of an individual parameter in

\theta_{k}

. Therefore,

\theta^{i}_{k}

represents the

i

-th parameter in

\theta_{k}

(the number of individual parameters is

I

). The key process is line 7 where a model is trained with the adaptive learning rate

\vec{l}

\vec{\sigma}

, which is used to update

\vec{l}

, which represents a vector of the deviation sum of squares for each

\theta^{i}

. We assume every robot (client)

\mathcal{R}_{k}

participates in every training round.

IV Experimental Results

We now evaluate our proposed method (Algorithm 1) on three diverse tasks. We start with a simple toy example to benchmark the proposed method. The second and third are more practical tasks in autonomous driving where a robot vehicle changes a lane while anticipating the behavior of other human-driven vehicles.

IV-A Illustrative Toy Example: LQR control of a 1-dimensional point mass “robot”

In this section, we introduce a very simple task where a point mass must be controlled by a linear-quadratic regulator (LQR) controller in a 1-dimensional space, as illustrated in Fig. 3. The goal is to learn parameters of an LQR controller that moves the point mass from an arbitrary initial state to the origin. Importantly, we have $K=3$ robots with identical dynamics, but different LQR cost functions. Our goal is to estimate both the dynamics model and LQR control policy parameters using local datasets $\mathbf{x}_{k}$ for robot $k$ and personalized federated learning to share knowledge amongst robots $k$ . We now introduce the system dynamics and LQR controller for each robot $k$ .

Each robot $k$ ’s system state at discrete time $t$ is denoted by $x^{k}_{t}=[y^{k}_{t},v^{k}_{t}]$ , which is a vector of its velocity $v^{k}_{t}$ and position $y^{k}_{t}$ . The dynamics of this simple linear system are given by $x^{k}_{t+1}=Ax^{k}_{t}+Bu^{k}_{t}$ , where $A\in\mathbb{R}^{2\times 2}$ is the dynamics matrix, $B\in\mathbb{R}^{2\times 1}$ is the control matrix, and $u^{k}_{t}\in\mathbb{R}$ is the control input. Assuming a unit-mass system, the dynamics of our linear system are given by:

\displaystyle x^{k}_{t+1}=\underbrace{\begin{pmatrix}1&1\\ 0&1\end{pmatrix}}_{A}x^{k}_{t}+\underbrace{\begin{pmatrix}0\\ 1\end{pmatrix}}_{B}u^{k}_{t}.

Next, we define a cost function to identify the optimal control input. The standard LQR cost is defined as $\sum_{t=0}^{\infty}((x^{k}_{t})^{\mathrm{T}}Q^{k}x^{k}_{t}+(u^{k}_{t})^{\mathrm{T}}R^{k}u^{t}_{k})$ , where $Q^{k}$ and $R^{k}$ are weight matrices ( $Q^{k}\in\mathbb{R}^{2\times 2},R^{k}\in\mathbb{R}^{1\times 1}$ ). It is well-known that the optimal control input $u^{*,k}_{t}=\mathcal{K}^{\mathrm{LQR}}x^{k}_{t}$ , where $\mathcal{K}^{\mathrm{LQR}}$ is a feedback matrix that arises from the solution to the discrete time Riccati Equation.

In our toy experiment, we assume each robot $\mathcal{R}_{k}$ knows the parametric form of the linear dynamics equations and linear feedback policy, but does not know the specific parameters $A$ , $B$ , $Q^{k}$ and $R^{k}$ . Instead, each robot must learn these parameters from measured rollouts (e.g., trajectory data) of an expert controlling the system. We note that it is sufficient to learn dynamics matrices $A$ and $B$ to predict the next system state from the current state and applied control, as well as the LQR feedback matrix $\mathcal{K}^{\mathrm{LQR},k}$ to general controls from the current system state.

Crucially, since each robot has the same dynamics, our experiments should show that $A$ and $B$ quickly converge to the same global value using knowledge transfer. However, since the cost functions differ, we should see each robot converge to its unique LQR feedback matrix, illustrating the benefits of personalization.

Heterogenous Robot Datasets: We generated $K=3$ different synthetic datasets (one per robot) containing rollouts of the LQR system using common dynamics but different cost functions. For simplicity, we fixed the weight matrix $Q^{k}$ as an identity matrix but varied $R^{k}$ for each robot $k$ . The first robot has the reference controller whose $R^{k}=1$ . The second and third are 50 and 100, respectively, which indicate they prefer larger magnitude actuation. During simulation, we added Gaussian noise to the system state and control inputs with a small variance of $0.01$ . Using the optimal controllers for each robot’s cost function, we collected expert trajectories of a point mass for 40 randomly initialized states and a simulation time horizon of $T=30$ . Therefore, the number of total states collected was $3,600$ $(40\times 30\times 3)$ , with $10\%$ of the data held-out as testing data.

Benchmark Algorithms: To compare to our proposed method (Algorithm 1), we evaluated five algorithms shown in Table I. These benchmarks rigorously cover a spectrum of algorithms used in practice today, ranging from purely local training without data sharing (“Local”), simply pooling all data in the cloud without privacy considerations (“Cloud”), standard FL with FedAvg (“SFL”), standard personalized FL (“SPFL”), and finally our proposed method with adaptive, parameter-wise learning rates for personalized FL (“APFL”). Of course, each robot only has access to its own training dataset locally.

TABLE I: Description of training algorithms

Name	Description
Local	All robots train their models only on data they measure themselves.
Cloud	Training is processed on the cloud environment using all raw data, without privacy guarantees.
Standard FL (SFL)	Standard FL, which is the same as Alg. 1 except there is no personalization (line 7).
Standard Personalized FL (SPFL)	Standard personalized FL, which is Alg. 1 without applying our contribution of a parameter-wise learning rate.
Adaptive Personalized FL (APFL, Ours)	Our proposed method in Algorithm 1.

Evaluation Metrics: The overall loss function $F(\mathbf{x}_{k};\theta_{k})$ is the error in predicting a trajectory rollout using learned parameters $\hat{A}^{k}$ , $\hat{B}^{k}$ , and $\hat{\mathcal{K}}^{\mathrm{LQR,k}}$ instead of the true parameters $A$ , $B$ , and $\mathcal{K}^{\mathrm{LQR,k}}$ . This is a sum of the mean squared error (MSE) of the control loss and state loss. Therefore, the loss function $F(\mathbf{x}_{k};\theta_{k})$ is denoted by $1/|\mathbf{x}_{k}|\sum_{k,t}(\|\hat{x}^{k}_{t}-x^{k}_{t}\|_{2}^{2}+(\hat{u}^{k}_{t}-u^{k}_{t})^{2})$ where $\hat{x}^{k}_{t}$ and $\hat{u}^{k}_{t}$ are prediction vectors of the system state and control input computed by $\hat{A}x^{k}_{t-1}+\hat{B}u^{k}_{t-1}$ and $-\hat{\mathcal{K}}^{\mathrm{LQR,k}}x^{k}_{t-1}$ , respectively. We implemented the training algorithms in Table 1 and evaluated each robot on test data drawn from its local data distribution. Then, we report the loss averaged across each robot, which corresponds to the objective function in Sec. III. We used the ADAM optimizer [12], and both the default learning rate and initial uniform learning rate $L$ in Algorithm 1 are set to $0.01$ .

TABLE II: Prediction loss for each algorithm: We present the total loss and constituent terms of state loss and control loss for each scheme, averaged over 10 test trials. Clearly, our method of APFL has the lowest total loss (our optimization objective) since it learns the global dynamics but learns unique cost functions per robot, which is crucial to achieve low control loss. We learn the dynamics and control with a small, non-zero loss since our simulations had Gaussian dynamics and actuation noise.

Algorithm	Local	Cloud	SFL	SPFL	APFL
State Loss	0.01101	0.01012	0.01023	0.01023	0.01018
Control Loss	0.01251	0.19666	0.20123	0.01095	0.01096
Total Loss	0.02352	0.20678	0.21146	0.02118	0.02114

Results: Table II shows the prediction losses for each training method. We show the prediction loss for next system states ( $x^{k}_{t}$ predicted by $A$ and $B$ ) and control loss ( $u^{k}_{t}$ predicted by $\mathcal{K}^{LQR,k}$ ) separately to investigate how learning affects shared and robot-specific parameters. Most methods are able to learn the common state dynamics that depend on the global $A$ and $B$ . However, purely local training achieves the worst loss since each robot has a smaller amount of local training data. Our key result is that for control loss, the methods with personalization (Local, SPFL, and our APFL) outperform others that train a global model by mixing robots’ heterogenous data. Crucially, our method of APFL has the lowest total loss.

Fig. 4 provides further evidence for the benefits of our APFL method by showing how close the learned parameters $\hat{A}^{k}$ , $\hat{B}^{k}$ , and $\hat{\mathcal{K}}^{\mathrm{LQR},k}$ differ from the ground-truth parameters. Clearly, our method of APFL (purple) shows strong benefits of not only personalization, but also our scheme of adaptively adjusting the learning rate of parameters based on their variation across robots. Specifically, we achieve both low errors for dynamics and control matrices (x-axis) and the estimated LQR feedback matrix (y-axis). Next we evaluate the benefits of APFL on two challenging AV tasks featuring rich human-robot interaction in our user studies.

IV-B Lane swapping in the CARLO driving simulator

The second case study is shown on the left side of Fig. 5, where a human-driven vehicle and AV must safely interchange lanes in a short distance, inspired by [24]. The AV is equipped with a trajectory forecasting model that predicts the future motion of the human-driven car, conditioned on past interaction history and a candidate robot future control decision. A key feature of our work is we performed a study with 7 real human drivers, who exhibited a mixture of aggressive and cautious driving styles when deciding whether to overtake or yield to the AV. To quickly collect data with diverse human volunteers, we used the CARLO 2D driving simulator [3], which is a light-weight version of CARLA without photo-realistic rendered scenes. Crucially, our FL framework ensures that the human-robot interaction datasets, which often show risky driving styles from human subjects, are kept private and never shared with a central server.

System Definition: We define the joint robot and human (i.e., system) state $S_{t}$ as $(\vec{p}^{\;r}_{t},\vec{p}^{\;h}_{t},\vec{v}^{\;r}_{t},\vec{v}^{\;h}_{t})$ where $\vec{p}^{\;r}_{t}$ and $\vec{p}^{\;h}_{t}$ are the 2D position vectors of the robot and human car at discrete time $t$ , and $\vec{v}^{\;r}_{t}$ and $\vec{v}^{\;r}_{t}$ are the velocity vectors. The 2D vectors are composed of $x$ and $y$ components. For example, a robot’s velocity is denoted by $\vec{v}^{\;r}_{t}=(v^{rx}_{t},v^{ry}_{t})$ where $v^{rx}_{t}$ and $v^{ry}_{t}$ are scalar values representing a robot’s velocity in the lateral and longitudinal directions. The control space of each car involves throttle and steering which are finite sets. For example, the throttle of a robot is denoted by $c^{rth}_{t}\in\{-1.5,0,1.5\}$ . Similarly, the steering of a human is denoted by $c^{hst}_{t}\in\{-0.04,0,0.04\}$ . Thus, the control of the cars at time $t$ is denoted by $C_{t}=(C^{r}_{t},C^{h}_{t})=(c^{rth}_{t},c^{rst}_{t},c^{hth}_{t},c^{hst}_{t})$ .

Prediction Models for Federated Learning: The robot controller has to make predictions of future system states to choose appropriate controls. Our scenario is inspired by [24], and makes the same practical assumption that we know the AV’s internal dynamics and can fully observe robot and human states. Crucially, all we need is to predict the human’s future control inputs and trajectory conditioned on candidate future robot controls. Then, given such a prediction model, we can embed it into a Model Predictive Controller (MPC) [2] for the AV to choose low cost robot actions that anticipate the response of humans. This means that the prediction model needs to estimate a future series of human control inputs at a single time-step. To achieve this, we introduce a Conditional Variational Autoencoder (CVAE) illustrated in Fig. 6 with recurrent subcomponents [24] which are able to predict time-series data. Our MPC control inputs are decided periodically at a discrete interval $\tau$ .

Robot Controller: We define the robot controller as a two-phase controller. In the first phase, the controller is responsible for negotiating the longitudinal distance between human vehicles with different driving styles. This is the most important part in our experiment because it requires accurate, personalized predictions of human driver behavior. For the first phase, we introduce a cost function $J_{t}$ to evaluate candidate control inputs at time $t$ as follows:

\displaystyle J_{t}=\alpha(p^{ry}_{t}-p^{hy}_{t})(v^{ry}_{t}-v^{hy}_{t})+\beta/|\vec{p}^{\;r}_{t}-\vec{p}^{\;h}_{t}|

The first term represents longitudinal cost based on the fact that cars need sufficient longitudinal distance during a lane change. The second term represents a distance cost between cars to avoid collisions. $\alpha$ and $\beta$ are parameters that weight each objective. We chose $\alpha=-3$ and $\beta=5000$ to heavily emphasize safety in this experiment.

When the robot controller decides a series of control inputs $[(c^{rth}_{t+1},c^{rst}_{t+1}),(c^{rth}_{t+2},c^{rst}_{t+2}),..,(c^{rth}_{t+\tau},c^{rst}_{t+\tau})]$ at time $t$ , it computes the sum of the expected cost function in the time horizon $H>\tau$ for all possible robot control candidates. The controller chooses a series of control inputs that minimizes the sum of the expected cost, which naturally depends on anticipated future human controls. We set the control interval $\tau$ and planning horizon $H$ to $3$ and $20$ respectively, which performed well with an acceptable compute budget.

Once there is enough longitudinal distance to swap lanes, the controller switches to the second phase, and just works as a lane change controller. Since we observed humans’ driving styles almost do not differ during the lane change (and largely govern when to safely initiate it), the lane change controller is a static LQR controller. This static LQR controller drives the robot to a target lateral position $p^{rx}$ representing the x-coordinate of a target lane with longitudinal velocity $v^{ry}$ . In the following quantitative evaluation, we incorporate human trajectory prediction models trained with the benchmarks in Table 1 into the MPC-based AV controller. Then, we compare these controllers in terms of how safe and efficiently the cars are able to swap lanes given the different human’s driving styles.

Dataset: Our dataset consists of system trajectories (including robot and human states, controls, and costs) from time $0$ to the time when both cars finished swapping lanes. The dataset consists of a training set, a held-out evaluation set, and a third held-out test set. In order to add diversity to each session’s data, we randomly created 50 initial states $S_{0}$ .

For the training set, seven humans play the lane swapping scenario running on the CARLO simulator with 50 initial states above. Since the prediction model with CVAE shown in Fig. 6 is not trained at this time, the robot controller uses a very simple model based on only the current state. Specifically, the predicted human control $\hat{C}^{h}_{i}$ for future time $i$ is always the current human control $C^{h}_{t}$ for any $i\in\{t+1,t+2,..,t+\tau\}$ .

Once we gather initial data for how humans respond to a baseline AV without a sophisticated model of human behavior, we can train a better driven prediction model. This prediction model is in turn used in the AV’s MPC and resulting rollouts against human drivers allows us to generate an evaluation set. Notably, this evaluation set is used to fine-tune and re-train the human trajectory forecasting model.

Similar to the experiment IV-A, each of the 7 robot cars observes only one type of data corresponding to one human driver. During training, both the default learning rate and $L$ are 0.001, and the loss function is the evidence lower bound objective (ELBO) [13, 21]. The number of epochs is 30 for each training step in Alg. 1. We used 40 sessions for training and 10 challenging held-out sessions for testing. We trained the model with four algorithms shown in Tab. I. Similar to the experiment in Sec. IV-A, the federated learning algorithms use FedAVG to aggregate trained parameters. In our method (APFL), we used layer-wise learning rate instead of parameter-wise to improve compute efficiency, which essentially groups parameters in one layer.

Evaluation Results: We evaluate the test losses of the prediction models trained with our benchmark algorithms. Fig. 7 illustrates the held-out test dataset loss for model’s successively trained using our benchmark algorithms. In contrast to standard FL, algorithms involving personalization adapt better to heterogenous human driver interactions to achieve a lower loss. Crucially, our method of APFL achieves the lowest loss, which is consistent with the results of the experiment in Sec. IV-A.

Additionally, evaluation results show an AV (with our learned prediction models in-the-loop) can safely and efficiently swap lanes with actual human drivers in test episodes in the CARLO simulator. Fig. 8 shows distributions of key quantitative metrics over the test episodes, such as the elapsed time until the AV starts a lane change, distance between cars when lane changes commence, and the mean cost for each session. Clearly, an AV using our APFL model can start to change lanes quickly with a larger safe distance to the human car.

IV-C Lane change with the CARLA AV simulator

We now consider a more sophisticated AV driving scenario in the photo-realistic, standard CARLA [6] simulator. As shown on the right of Fig. 5, our scenario is similar to the lane-swap scenario of the previous section. However, there is additional complexity since there is a gray human-driven car that starts at random distances from the AV with random velocities, which requires the AV to reason about whether there is enough safety margin to overtake the red car.

Experimental Setup: The settings of this experiment closely follow those of the lane swapping case in IV-B. The control space is the same as before, and the state space is the same except that we add the state of the gray car. The architecture of the prediction model is exactly the same as the one in IV-B, shown in Fig. 6. Further, the robot controller is also the same except it does not switch to the second phase where the robot is going to change a lane because robot cars do not change lanes in the scenario. Additionally, we set a lower limit $C_{low}$ on the cost function to prevent the robot cars from unnecessarily increasing the distance to the human car after both cars have a safe and sufficient distance, which will artificially decrease the cost. Therefore, the new cost function $J^{\prime}_{t}$ is denoted by $\max(C_{low},J_{t})$ .

Unlike the lane swapping case, we use a programmed synthetic controller for human cars to test on uniformly diverse driving styles, including risky behaviors. This synthetic human controller is also a two-phase controller, but control inputs are decided according to a target velocity in the first phase. The target velocity is chosen as high $v_{\mathrm{high}}$ or low $v_{\mathrm{low}}$ , where the high velocity $v_{\mathrm{high}}$ is only chosen if the human car has a safe relative difference from the other cars, modulated by its risk tolerance parameter $\gamma$ :

\displaystyle v^{\mathrm{target}}_{t}=\left\{\begin{array}[]{ll}v_{\mathrm{high}}&\text{if}\;\;\gamma D_{\mathrm{rel}}\geq 0\;\land\;p^{hy}_{t}-p^{gy}_{t}\geq D_{\mathrm{safe}}\\ v_{\mathrm{low}}&\text{otherwise}\end{array}\right.

In the above equation, $\gamma$ is a parameter representing the degree of preference for overtaking robot cars, which we henceforth called the risk-tolerance parameter. Further, in the above equation, $D_{\mathrm{rel}}$ indicates the relative gap between cars based on the positions and velocities denoted by $(p^{hy}_{t}-p^{ry}_{t})+(v^{hy}_{t}-v^{ry}_{t})$ , and $p^{gy}_{t}$ and $D_{\mathrm{safe}}$ are a longitudinal position of the gray car and a safety distance constant, respectively.

The target velocity for the human-driven vehicle is set depending on the relative velocity to the robot car driving alongside it. Once the target velocity is determined, the synthetic human-driver controller computes control inputs with a PID controller in CARLA. We tested 5 kinds of driving styles whose risk-sensitivity ( $\gamma$ ) parameters were in the range $[-1.0,-0.5,0,0.5,1.0]$ . These driving styles include a mixture of cautious and aggressive styles to ensure a diverse dataset. For instance, a human car with $\gamma=1.0$ sometimes tries to pass the robot car even when the current relative position to the robot car is minimal.

We collected a diverse dataset in the same manner as in the previous scenario. Specifically, we generated 54 initial states where the relative position between cars and velocities are uniformly distributed. We trained trajectory forecasting models according to Table I with the same hyper-parameters and LSTM encoder-decoder model as the previous scenario. Finally, we integrated the trained trajectory forecasting models into the AV’s controller and evaluated it in 9 challenging scenarios where the robot cars need to change their behavior depending on humans’ driving styles.

TABLE III: Prediction loss for each training scheme: All losses are on the test data after training has converged.

Algorithm	Cloud	SFL	SPFL	APFL(Ours)
Loss	0.01643	0.01760	0.01600	0.01160

Evaluation Results: Using the same metrics as the lane swapping case in Sec. IV-B, we evaluate the test loss of the prediction model and the controllers’ performance in terms of how safe and efficiently the cars are able to change a lane. The prediction losses on test data for each model, which are shown in Tab. III, have the same trend as in the previous scenario of lane swapping, clearly illustrating the benefits of our APFL method. Moreover, Fig. 9 illustrates APFL provides the lowest cost while maintaining a safe distance from other cars.

V Discussion and Conclusions

This paper presents a novel personalized federated learning framework for deployments of robots that measure diverse sensory streams in varied environmental contexts. Our first contribution is to show the drawbacks, and potential, of standard personalized FL through a case study where robotic models share a common structure for dynamics but heterogeneous cost functions. Then, our second contribution is to propose a novel algorithm which mitigates these drawbacks and effectively leverages both local and global knowledge to improve robotic control. Specifically, we demonstrate strong experimental performance of our algorithm in state-of-the-art driving simulators featuring real human driver data.

In future work, we plan to test our algorithm on large-scale deployments of AVs using a combination of public trajectory datasets from Waymo, Aptiv, Lyft, etc. Moreover, we plan to provide theoretical convergence guarantees for our personalized FL algorithm in distributed convex optimization settings that arise in multi-agent systems and robotics. Overall, our work is a timely first step to address how to learn privacy-preserving, specialized deep learning models for robot fleets that will increasingly interact with diverse humans in diverse operating scenarios.

References

[1] E. Bagdasaryan, A. Veit, Y. Hua, D. Estrin, and V. Shmatikov. How to backdoor federated learning. In International Conference on Artificial Intelligence and Statistics, pages 2938–2948. PMLR, 2020.
[2] E. F. Camacho and C. B. Alba. Model predictive control. Springer Science & Business Media, 2013.
[3] Z. Cao, E. Biyik, W. Z. Wang, A. Raventos, A. Gaidon, G. Rosman, and D. Sadigh. Reinforcement learning based control of imitative policies for near-accident driving. In Proceedings of Robotics: Science and Systems (RSS), July 2020.
[4] Y. Deng, M. M. Kamani, and M. Mahdavi. Adaptive personalized federated learning. arXiv preprint arXiv:2003.13461, 2020.
[5] C. T. Dinh, N. H. Tran, and T. D. Nguyen. Personalized federated learning with moreau envelopes. arXiv preprint arXiv:2006.08848, 2020.
[6] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun. CARLA: An open urban driving simulator. In Proceedings of the 1st Annual Conference on Robot Learning, pages 1–16, 2017.
[7] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
[8] A. Fallah, A. Mokhtari, and A. Ozdaglar. Personalized federated learning: A meta-learning approach. arXiv preprint arXiv:2002.07948, 2020.
[9] A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi. Social gan: Socially acceptable trajectories with generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2255–2264, 2018.
[10] B. Ivanovic and M. Pavone. The trajectron: Probabilistic multi-agent trajectory modeling with dynamic spatiotemporal graphs. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2375–2384, 2019.
[11] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, et al. Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977, 2019.
[12] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[13] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
[14] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3):50–60, 2020.
[15] Z. Li, L. Wang, L. Jiang, and C.-Z. Xu. Fc-slam: Federated learning enhanced distributed visual-lidar slam in cloud robotic system. In 2019 IEEE International Conference on Robotics and Biomimetics (ROBIO), pages 1995–2000. IEEE, 2019.
[16] B. Liu, L. Wang, and M. Liu. Lifelong federated reinforcement learning: a learning architecture for navigation in cloud robotic systems. IEEE Robotics and Automation Letters, 4(4):4555–4562, 2019.
[17] B. Liu, L. Wang, M. Liu, and C.-Z. Xu. Federated imitation learning: A novel framework for cloud robotic systems with heterogeneous sensor data. IEEE Robotics and Automation Letters, 5(2):3509–3516, 2020.
[18] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017.
[19] B. McMahan and D. Ramage. Federated learning: Collaborative machine learning without centralized training data. https://ai.googleblog.com/2017/04/federated-learning-collaborative.html, 2017. [Online; accessed 13-Aug.-2021].
[20] H. B. McMahan, E. Moore, D. Ramage, and B. A. y Arcas. Federated learning of deep networks using model averaging. arXiv preprint arXiv:1602.05629, 2016.
[21] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning, pages 1278–1286. PMLR, 2014.
[22] S. Ruder. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098, 2017.
[23] T. Salzmann, B. Ivanovic, P. Chakravarty, and M. Pavone. Trajectron++: Dynamically-feasible trajectory forecasting with heterogeneous data. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16, pages 683–700. Springer, 2020.
[24] E. Schmerling, K. Leung, W. Vollprecht, and M. Pavone. Multimodal probabilistic model-based planning for human-robot interaction. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 3399–3406. IEEE, 2018.
[25] T. Tieleman, G. Hinton, et al. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
[26] R. Vilalta and Y. Drissi. A perspective view and survey of meta-learning. Artificial intelligence review, 18(2):77–95, 2002.
[27] Y. Zhang and Q. Yang. A survey on multi-task learning. arXiv preprint arXiv:1707.08114, 2017.