This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Improved Robustness and Safety for Pre-Adaptation of Meta Reinforcement Learning with Prior Regularization

Lu Wen1, Songan Zhang2, H. Eric Tseng2, Baljeet Singh2, Dimitar Filev2, Huei Peng3 1Lu Wen is a Ph.D. Student in Mechanical Engineering, University of Michigan, email: [email protected]2Songan Zhang, Eric Tseng, Baljeet Singh, Dimitar Filev are with Ford Research and Advanced Engineering.3Huei Peng is a professor of Mechanical Engineering, University of Michigan, email: [email protected]
Abstract

Meta Reinforcement Learning (Meta-RL) has seen substantial advancements recently. In particular, off-policy methods were developed to improve the data efficiency of Meta-RL techniques. Probabilistic embeddings for actor-critic RL (PEARL) is a leading approach for multi-MDP adaptation problems. A major drawback of many existing Meta-RL methods, including PEARL, is that they do not explicitly consider the safety of the prior policy when it is exposed to a new task for the first time. Safety is essential for many real world applications, including field robots and Autonomous Vehicles (AVs). In this paper, we develop the PEARL PLUS (PEARL+) algorithm, which optimizes the policy for both prior (pre-adaptation) safety and posterior (after-adaptation) performance. Building on top of PEARL, our proposed PEARL+ algorithm introduces a prior regularization term in the reward function and a new Q-network for recovering the state-action value under prior context assumptions, to improve the robustness to task distribution shift and safety of the trained network exposed to a new task for the first time. The performance of PEARL+ is validated by solving three safety-critical problems related to robots and AVs, including two MuJoCo benchmark problems. From the simulation experiments, we show that safety of the prior policy is significantly improved and more robust to task distribution shift compared to PEARL.

I Introduction

It is important for trained Artificial Intelligence (AI) to adapt to a new task relatively quickly with a small amount of data. Although Deep Reinforcement Learning (DRL) methods were shown to be powerful for complicated sequential decision making problems, they usually learn a separate policy for each specific task. In recent years, researchers tried to design AI that can solve multiple tasks, by either switching to the corresponding policy it already learned[1], or learning and adapting to the new task. The first approach requires that we can enumerate the tasks, which may not be possible or practical for some real-world applications. Moreover, learning to master each and every task often requires a lot of training data, making it data inefficient.

The second approach, which is to enable quick learning/adaptation as new tasks arise, is named Meta Reinforcement Learning (Meta-RL)[2]. Meta-RL techniques aim to gain “general learning” for better dealing with new tasks by leveraging learned experience from previous related tasks. This requires that these tasks share some common features. For instance, in autonomous driving, merging into a highway from a ramp and entering a round-about both involve the Autonomous Vehicle (AV) agent choosing a gap, adjusting its velocity, and making the move, while the exact road geometry and behaviors of the adjacent vehicles may vary.

How to leverage learning for a family of tasks sharing similar structures, instead of simply doing well in a single task is still an open topic. In [3, 4, 5], the agent is trained to find the parameters θ\theta of its policy, so that when the agent takes a few gradient steps on that θ\theta, it will converge toward θ\theta^{*} which is optimal for a new task. In this case, the common structure of different tasks is implicitly represented by the initial parameter θ\theta. Another approach is to learn latent contexts among tasks and the context acts as part of the input to the policy network. For example, in [6] Rakelly et al. developed a data efficient off-policy Meta-RL method called Probabilistic Embeddings for Actor-critic meta-RL (PEARL), performing online probabilistic filtering of the latent task variables to infer how to solve a new task from a small amount of new data. While the off-policy nature of PEARL enable data efficiency when adapting to a new task, it does require on-line collecting the context and corresponding latent variables of a new task with prior policy (the policy conditioned on the prior distribution). As a result, the performance is not guaranteed when encountering a new task for the first time. This poses a serious challenge for applications like autonomous driving where the agent cannot afford to perform very poorly. In other words, it is important for the prior (pre-adaptation) policy to maintain robustness with respect to task distribution shift.

Safety in RL is not a new problem and a lot of research has been done [7, 8], but the safety of Meta-RL before adaptation has not received enough attention. In this paper, we consider a policy as safe if it would not lead to a priorly-defined unsafe state. Robustness in meta-learning has also been studied in multiple recent works. The work in [9] explores robustness to perturbations in the task samples, [10] deals with imbalances in the number of samples per task instance, and [11] proposes a solution to improve the robustness to shifts in the task distribution between meta-training and meta-testing. In this paper our discussion is on ’pre-adaptive task-robustness’ in the sense that it performs well before adaptation despite the task-distributional shift between prior assumption and the real. All of our experimental studies are safety-critical cases and we will consider a prior policy pre-adaptive task-robust if its ‘safety violation’ rate does not exceeds an absolute rate of 0.1%. We develop PEARL+, a Meta-RL method based on PEARL[6] to improve safety and robustness for pre-adaptation of Meta-RL. We introduce a prior regularization term and a new Q-network, which doesn’t back-propagate into the context encoder, to explicitly consider the robustness and safety of the prior policy without sacrificing after-adaptation performance.

The rest of the paper is organized as follows. We first describe the related work in Section II. Then the preliminaries of Meta-RL and the PEARL method will be introduced in Section III. In Section IV, we present the details of our new method PEARL+. We study the performance of PEARL+ on three applications and the results are shown in Section V. Finally, the paper is concluded in Section VI.

II Related Work

The current state-of-the-art Meta-RL algorithms can be categorized into four types.

RNN-based Meta-RL. In these methods (e.g., [2][12]), a Recurrent Neural Network (RNN) structure is used, the experience from previous tasks are preserved in the RNN parameters. The agent starts with no memory/knowledge when it starts to adapt to a new task. However, RNN based policy has no mathematical convergence proof and is believed to be vulnerable to meta-overfitting [13].

Gradient-based Meta-RL. This type of Meta-RLs learn a policy initialization that can quickly adapt to a new task after few policy gradient steps. They usually separate learning into two loops: inner loop for policy adaptation, and outer loop for policy updates[14]. In most existing work, optimizations are through gradient descents for both loops[3][5]. One notable exception is [15], which implements behavior cloning in the inner loop to reduce the need for task exploration. Due to their on-policy nature, these methods usually need a lot of data and parameter-tuning[13].

Model-based Meta-RL These methods learn a predictive model, represented by a deep neural network, and find model parameters that are sensitive to variations among the tasks[16][17]. The action to take is then determined through optimization methods such as model-predictive control (MPC). Due to their model-based nature, these methods require less data and can adapt more quickly. However, they require relatively high computation cost for the online optimization process to get actions[13].

Context-based Meta-RL In this category, the meta-agents learn the latent variables of a distribution of tasks, which the policy uses to adapt to new tasks [6, 18, 19, 20, 21]. Our method falls into this category. It is worth noting that the approach in this category allows off-policy learning. That is, the RL learning does not require the latest updated policy to interact with the environment. Rather, it can leverage experiences from other policies, such as learning from replaying buffer samples from old policy as used in [19]. Examples of this category include [20][21] where their off-policy Meta-RL algorithms were developed by decoupling the task inference from the policy training.

Regardless of the categories, most of the meta RL literature has focused on data efficiency and fast adaptability. However, in some applications, the performance of the prior policy (i.e. prior to adaptation) may be critical to avoid failures as these cases cannot tolerate mischief. For example, for automated vehicles, the factory installed default setting can be considered as the prior policy, while vehicles may be sold and driven in different cities/countries with a different driving culture/roadmanship, which can be considered as different tasks (MDPs). In this case, the safety performance of the prior policy (factory setting) cannot be overlooked. We were not aware of any Meta-RL methods that explicitly considered the safety of the prior policy at the beginning of the meta testing, or when it is first exposed to a new task in the real-world. This lack of assurance of robust initial performance is a concern of Meta-RL application to safety-critical tasks.

III Preliminaries

In this section, we will briefly review the formulation of our meta reinforcement learning problem, and introduce the PEARL algorithm which we build upon and enhance.

III-A Meta Reinforcement Learning

Following the traditional Meta-RL problem formulations [3], we consider a distribution of tasks p(𝒯)p(\mathcal{T}) that we want our model to adapt to, where each task 𝒯i\mathcal{T}_{i} is modeled as a Markov decision process (MDP), consisting of a finite set of states SS, actions 𝒜\mathcal{A}, a transition function PiP_{i}, and a bounded reward function RiR_{i}, and the discount factor γi\gamma_{i}, i.e., 𝒯i=(S,A,Pi,Ri,γi)\mathcal{T}_{i}=(S,A,P_{i},R_{i},\gamma_{i}), where the state space and action space are same across tasks.

In Meta-RL, the goal is to train a meta-learner ()\mathcal{M}(*) which learn from a large number of tasks (known as meta-training tasks) from a given distribution, with the hope that it can generalize to new tasks or new environments in the same distribution. In the meta-testing process, the meta-learner’s ability to adapt to new tasks (known as meta-testing tasks) sampled from the same distribution will be verified. In a mathematical form, the meta-learner \mathcal{M} with parameters θ\theta can be optimized by:

θ:=argmaxθi=1N𝒯i(ϕi),\mathcal{M}_{\theta}:=\arg\max_{\theta}\sum_{i=1}^{N}\mathcal{L}_{\mathcal{T}_{i}}(\phi_{i}), (1)
ϕi=fθ(𝒯i).\phi_{i}=f_{\theta}(\mathcal{T}_{i}). (2)

where ff is the adaptation operation, and \mathcal{L} denotes the objective function, for which we use the expected reward function:

(θ)=𝐄πθ[t=0γitri(t)]\mathcal{L}(\theta)=\mathbf{E}_{\pi_{\theta}}\left[\sum_{t=0}^{\infty}\gamma_{i}^{t}r_{i}(t)\right] (3)

Facing a new task 𝒯i\mathcal{T}_{i}, the meta-learner adapts from parameter θ\theta to ϕi\phi_{i} after the adaptation operation in Eq. 2. The adaptation function fθf_{\theta} varies between methods. For example, in the MAML algorithm[3] it consists of gradient descent, while in PEARL it consists of encoding experience to, and inference on, posterior task embeddings. The meta-learner optimizes its parameters to maximize the sum of objective values collected from performing on meta-training tasks with corresponding adapted parameter ϕ\phi as Eq. 1.

III-B PEARL algorithm

The PEARL[6] algorithm is an off-policy Meta-RL algorithm that disentangles task inference and control. PEARL captures knowledge about how the current task should be performed in a latent probabilistic context variable ZZ, for which it conditions the policy as πθ(𝐚|𝐬,𝐳)\pi_{\theta}(\mathbf{a|s,z}) to adapt to the task. In the meta-training process, PEARL leverages data from a variety of training tasks to train a probabilistic encoder q(𝐳|𝐜)q(\mathbf{z}|\mathbf{c}) to estimate the posterior p(𝐳|𝐜)p(\mathbf{z}|\mathbf{c}), where 𝐜\mathbf{c} refers to context, and 𝐜i=(𝐬i,𝐚i,ri,𝐬i)\mathbf{c}_{i}=(\mathbf{s}_{i},\mathbf{a}_{i},r_{i},\mathbf{s}_{i}^{\prime}) denotes one transition tuple.

The objective optimized by PEARL is:

𝔼𝒯[𝔼𝐳q(𝐳|𝐜𝒯)[R(𝒯,𝐳)+βDKL(q(𝐳|𝐜𝒯)||p0(𝐳))]],\mathbb{E}_{\mathcal{T}}[\mathbb{E}_{\mathbf{z}\sim q(\mathbf{z}|\mathbf{c}^{\mathcal{T}})}[R(\mathcal{T},\mathbf{z})+\beta D_{KL}(q(\mathbf{z}|\mathbf{c}^{\mathcal{T}})||p_{0}(\mathbf{z}))]], (4)

where p0(𝐳)p_{0}(\mathbf{z}) is a unit Gaussian prior over the context variable ZZ. The KL divergence term can be interpreted as the result of a variational approximation to an information bottleneck[22] that constrains 𝐳\mathbf{z} to contain only necessary information to adapt to the task, mitigating the overfitting issue.

PEARL is built on top of the soft actor-critic (SAC) algorithm[23], which exhibits good sample efficiency and stability. The inference network (also known as the encoder) qϕ(𝐳|𝐜)q_{\phi}(\mathbf{z|c}) is optimized jointly with the actor πθ(𝐚|𝐬,𝐳)\pi_{\theta}(\mathbf{a|s,z}) and critic Qθ(𝐬,𝐚,𝐳)Q_{\theta}(\mathbf{s,a,z}). If we use the state-action value as the metric to train the encoder, the critic and actor loss can be written respectively as Eq. 5 and Eq. 6:

critic=𝐄(𝐬,𝐚,,r,𝐬)𝐳qϕ(𝐳|𝐜)[Qθ(𝐬,𝐚,𝐳)(r+V¯(𝐬,𝐳¯))]2.\mathcal{L}_{critic}=\mathbf{E}_{\begin{subarray}{c}(\mathbf{s,a,},r,\mathbf{s^{\prime}})\sim\mathcal{B}\\ \mathbf{z}\sim q_{\phi}(\mathbf{z|c})\end{subarray}}[Q_{\theta}(\mathbf{s,a,z})-(r+\bar{V}(\mathbf{s^{\prime},\bar{z}}))]^{2}. (5)

where V¯\bar{V} is a target value network and 𝐳¯\mathbf{\bar{z}} means that gradients are not back-propagated through the network.

actor=𝐄𝐬,𝐚πθ𝐳qϕ(𝐳|𝐜)[DKL(πθ(𝐚|𝐬,𝐳¯)||exp(Qθ(𝐬,𝐚,𝐳¯))𝒵θ(𝐬))].\mathcal{L}_{actor}=\mathbf{E}_{\begin{subarray}{c}\mathbf{s}\sim\mathcal{B},\mathbf{a}\sim\pi_{\theta}\\ \mathbf{z}\sim q_{\phi}(\mathbf{z|c})\end{subarray}}\left[D_{KL}\left(\pi_{\theta}(\mathbf{a|s,\bar{z}})\left|\right|\frac{\exp{(Q_{\theta}(\mathbf{s,a,\bar{z}}))}}{\mathcal{Z}_{\theta}(\mathbf{s})}\right)\right]. (6)

where the partition function 𝒵θ\mathcal{Z}_{\theta} normalizes the distribution, and it does not contribute to the gradient with respect to the policy.

The data used to infer qϕ(𝐳|𝐜)q_{\phi}(\mathbf{z|c}) is distinct from that used to construct the critic loss. As illustrated in Fig. 1, the context data sampler 𝒮𝐜\mathcal{S}_{\mathbf{c}} samples uniformly from the most recently collected data batch, while the actor and critic are trained with data drawn uniformly from the entire replay buffer.

During meta-testing, PEARL samples the context variables from the unit Gaussian prior and hold constant in an episode to do temporally-extended exploration when faced with an unseen task. Later, the latent context variable of the new task is inferred from the gathered experience with the inference network qϕ(𝐳|𝐜)q_{\phi}(\mathbf{z|c}).

IV PEARL+ and the prior regularization term

Most of the Meta-RL algorithms proposed in the literature focused on optimizing the performance adapting to specific tasks, but did not explicitly consider the performance before adaptation. For some applications like robots and automated vehicles, any fatal mistakes are unacceptable, including before adaptation. In this section, we’ll explain details of our algorithm PEARL+, which is a direct enhancement to PEARL.

In PEARL’s configuration, the prior over context variable zz is denoted as z0z_{0}, which is assumed to follow an identical independent distribution. However, this assumption does not hold for real tasks and environments. Since the previous formulation (Eq. 4) only optimizes on the posterior return, the policy trained in this way doesn’t perform well under on the prior assumption before adaptation with regard to the distribution shift. We propose to solve the problem by introducing a prior regularization term, to calculate which we create a new Q-Network for state-action value estimation under the prior context assumption.

The prior regularization is formulated as the expected return depending on the random context variable following the prior unit Gaussian distribution p(𝐳𝟎)p(\mathbf{z_{0}}). The objective function of PEARL+ can then be formulated as:

𝔼𝒯[𝔼𝐳q(𝐳|𝐜𝒯)[R(𝒯,𝐳)+βDKL(q(𝐳|𝐜𝒯)||p0(𝐳))]+α𝔼𝐳p(𝐳0)[R(𝒯,𝐳)]]\mathbb{E}_{\mathcal{T}}[\mathbb{E}_{\mathbf{z}\sim q(\mathbf{z}|\mathbf{c}^{\mathcal{T}})}[R(\mathcal{T},\mathbf{z})+\beta D_{KL}(q(\mathbf{z}|\mathbf{c}^{\mathcal{T}})||p_{0}(\mathbf{z}))]\\ +\alpha\mathbb{E}_{\mathbf{z}\sim p(\mathbf{z}_{0})}[R(\mathcal{T},\mathbf{z})]] (7)

where α>0\alpha>0 is the trade-off coefficient.

The policy is trained to maximize the return which considers both prior (before adaptation) and posterior (after adaptation) of the context variable. Similar to the exploration-exploitation trade-off: emphasizing prior regularization performance will decrease performance of the posterior context. Note that the prior regularization term does not depend on the inference network. Therefore, it will not affect the training of the context encoder qϕq_{\phi}. However, we can train the policy to be more robust by considering the mismatch with the real context.

Refer to caption
Figure 1: The PEARL+ diagram

Based on the structure of PEARL, we introduce a new Q-network Q~θ\tilde{Q}_{\theta} to recover the action-state value with a prior context assumption 𝐳𝟎\mathbf{z_{0}}. As shown in Fig. 1, we update the Q~θ\tilde{Q}_{\theta} network by gradients computed from loss ~critic\tilde{\mathcal{L}}_{critic} (red line) with data sampled from the whole replay buffer. Similar to PEARL, the policy of PEARL+ πθ\pi_{\theta} is also built on top of the Soft Actor Critic (SAC) algorithm, but now has two optimization sources (both shown in green lines): the original actor(𝐳)\mathcal{L}_{actor}(\mathbf{z}) computed with the posterior context, and an additional actor(𝐳𝟎)\mathcal{L}_{actor}(\mathbf{z_{0}}) computed with the prior context. They share the same set of data sampled from the whole replay buffer. The encoder is optimized by critic\mathcal{L}_{critic} (yellow line) and DKLD_{KL} (purple line), which is the same as PEARL.

We summarize the meta-training procedure of PEARL+ in Algorithm 1. The meta-testing procedure is the same as PEARL, and therefore will not be described in this paper.

Algorithm 1 PEARL+ Meta-training
0:  Batch of training tasks {𝒯i}i=1,T\{\mathcal{T}_{i}\}_{i=1,\dots T} from p(𝒯)p(\mathcal{T}), learning rates α1,α2,α3\alpha_{1},\alpha_{2},\alpha_{3}
1:  Initialize replay buffers i\mathcal{B}^{i} for each training task
2:  while not done do
3:     for each 𝒯i\mathcal{T}_{i} do
4:        Initialize context 𝐜i={}\mathbf{c}^{i}=\{\}
5:        for k=1,,Kk=1,\dots,K do
6:           Sample 𝐳qϕ(𝐳|𝐜i)\mathbf{z}\sim q_{\phi}(\mathbf{z}|\mathbf{c}^{i})
7:           Gather data from πθ(𝐚|𝐬,𝐳)\pi_{\theta}(\mathbf{a}|\mathbf{s,z}) and add to i\mathcal{B}^{i}
8:           Update 𝐜i={(𝐬j,𝐚j,𝐬j,rj)}j:1Ni\mathbf{c}^{i}=\{(\mathbf{s}_{j},\mathbf{a}_{j},\mathbf{s^{\prime}}_{j},r_{j})\}_{j:1\dots N}\sim\mathcal{B}^{i}
9:        end for
10:     end for
11:     for step in training steps do
12:        for each 𝒯i\mathcal{T}_{i} do
13:           Sample context 𝐜i𝒮C(i)\mathbf{c}^{i}\sim\mathcal{S}_{C}(\mathcal{B}^{i}) and RL batch biib^{i}\sim\mathcal{B}^{i}
14:           Sample 𝐳qϕ(𝐳|𝐜i)\mathbf{z}\sim q_{\phi}(\mathbf{z}|\mathbf{c}^{i})
15:           Sample 𝐳0N(0,1)\mathbf{z}_{0}\sim N(0,1)
16:           actori=actor(bi,𝐳)+αactor(bi,𝐳0)\mathcal{L}_{actor}^{i}=\mathcal{L}_{actor}(b^{i},\mathbf{z})+\alpha\mathcal{L}_{actor}(b^{i},\mathbf{z}_{0})
17:           critici=critic(bi,𝐳)\mathcal{L}_{critic}^{i}=\mathcal{L}_{critic}(b^{i},\mathbf{z})
18:           ~critici=~critic(bi,𝐳0)\tilde{\mathcal{L}}_{critic}^{i}=\tilde{\mathcal{L}}_{critic}(b^{i},\mathbf{z}_{0})
19:           KLi=βDKL(q(𝐳|𝐜i)||r(𝐳))\mathcal{L}_{KL}^{i}=\beta D_{KL}(q(\mathbf{z}|\mathbf{c}^{i})||r(\mathbf{z}))
20:        end for
21:        ϕϕα1ϕi(critici+KLi)\phi\xleftarrow{}\phi-\alpha_{1}\nabla{\phi}\sum_{i}(\mathcal{L}_{critic}^{i}+\mathcal{L}_{KL}^{i})
22:        θπθπα2θiactori\theta_{\pi}\xleftarrow{}\theta_{\pi}-\alpha_{2}\nabla{\theta}\sum_{i}\mathcal{L}_{actor}^{i}
23:        θQθQα3θicritici\theta_{Q}\xleftarrow{}\theta_{Q}-\alpha_{3}\nabla{\theta}\sum_{i}\mathcal{L}_{critic}^{i}
24:        θQ~θQ~α3θi~critici\theta_{\tilde{Q}}\xleftarrow{}\theta_{\tilde{Q}}-\alpha_{3}\nabla{\theta}\sum_{i}\tilde{\mathcal{L}}_{critic}^{i}
25:     end for
26:  end while

V Experiments and results

In this section, we introduce the three experiments used to examine the performance of the PEARL+ algorithm. All three examples are in the field of robotics and autonomous vehicles, and their prior safety performance (after training, but before adaptation to a new task) is important.

V-A Experimental setup

The three experiments are: two continuous control tasks with robots in the MuJoCo simulator[24], and one discrete decision-making problem in highway ramp merging.

V-A1 MuJoCo

Walker-2D-Vel, Ant-3D-Vel.

Refer to caption
Figure 2: MuJoCo robots: ant(3D), and walker(2D)

We choose two MuJoCo examples from [3] and [6] because the safety metrics of both are well defined, namely health status. The health status is evaluated by how reasonable the moving pose of the robot is on the bionics. Specifically, healthy pose means moving like a human’s leg without jumping, falling or leaning for the MuJoCo Walker, and acting like a four legged insect with no jumping or slipping for the MuJoCo Ant. The episode terminates when the pose is unhealthy or reaches the maximum episodic length. These two locomotion tasks require adaptation to improve the reward functions: walking velocity for the Walker-2D-Vel, and moving direction and velocity for the Ant-3D-Vel. For details of the environment configurations, please refer to [6].

V-A2 Self-driving

Highway Merge

Refer to caption
Figure 3: Highway-Merging Scenario

Meta-RL has been applied to solve decision-making problems in autonomous driving [14][25][26], but none of the references we had reviewed studied the prior safety, which is an important issue for AVs.

As shown in Fig. 3, we consider a scenario for an automated vehicle: the ego vehicle (the green vehicle in Fig. 3) is expected to take a mandatory lane change to merge into the main lane before the merging area ends. The vehicles driving in the main lane under normal situations would follow the Intelligent Driver Model (IDM)[27] for their longitudinal dynamics. Since there is only one lane on the highway, lateral dynamics are not considered. When the ego vehicle is still on the ramp and hasn’t merged in, the vehicle on the main lane right behind it (the yellow vehicle in Fig. 3) will switch from the IDM car-following model to the Hidas’ model[28] interacting with the merging ego vehicle.

Refer to caption
Figure 4: Hidas’s merging model. The green, yellow, and blue vehicle are the subject, follower, and lead vehicles, respectively, with velocity vsv_{s}, vfv_{f} and vlv_{l}.

In the Hidas model, the main lane vehicle behind the merging ego vehicle assesses if the “follow gap” (shown in Fig. 4) would be sufficient/feasible for merging (i.e. larger than a minimum gap) should it yield and slow down willingly. The willingness is defined by maximum braking bfb_{f} and range of speed decrease DvDv. If not, the main lane vehicle will ignore the merging vehicle. The feasibility of the follower to slow down is calculated as follows:

Dt=Dv/bfDt=Dv/b_{f} (8)
gf,sld=gf(vfDtbfDt2/2)+vsDtg_{f,sld}=g_{f}-(v_{f}Dt-b_{f}Dt^{2}/2)+v_{s}Dt (9)
gf,min=gmin+{c(vfvs),if vf>vs0,otherwiseg_{f,min}=g_{min}+\begin{cases}c(v_{f}-v_{s}),&\text{if }v_{f}>v_{s}\\ 0,&\text{otherwise}\end{cases} (10)

The follower vehicle slows down if gf,sld>gf,ming_{f,sld}>g_{f,min}, or else it follows its lead vehicle on the main lane with the IDM model instead.

The Hidas’ interactive model parameters used in our experiment were calibrated from the collected video data by[28] and are as follows:

  • Maximum speed decrease, Dv=2.7m/s(10km/h)Dv=2.7\text{m/s}(\sim 10\text{km/h}).

  • Minimum safe constant gap gmin=2.0mg_{\text{min}}=2.0\text{m}.

  • Acceptable gap parameter c=0.9c=0.9.

  • Acceptable deceleration rate bf=1.5m/s2b_{f}=1.5\text{m/s${}^{2}$}

This self-driving task requires the Meta-RL to adapt to the transition probability functions. The MDP varieties considered here are marked in blue in Fig. 3, and are sampled within their ranges to represent different driving environments:

  • traffic density p[30,50]p\in[30,50] veh/km;

  • traffic speed venv[50,70]v_{env}\in[50,70] mph;

The state-space SRnS\subseteq R^{n} of the learning agent includes the ego-vehicle-related states: [DEOLD_{EOL}, yy, vxv_{x},vyv_{y}], and surrounding-vehicle-related states: [feif_{e}^{i}, Δxi\Delta x^{i}, Δyi\Delta y^{i}, Δvxi\Delta v_{x}^{i}, Δvyi\Delta v_{y}^{i} ]. We consider 4 nearest surrounding vehicles on the main lane in the overall state space SR24S\subseteq R^{24}. The five discrete actions of the learning agent include: accelerate (1.5m/s21.5m/s^{2}), decelerate (1.5m/s2-1.5m/s^{2}), cruise (IDM), left lane-change, and right lane-change.

The reward function is designed to consider multiple factors: relative velocity, distance to the front vehicle, merging safety (evaluated by the deceleration of the rear vehicle), penalty for collisions, and action cost:

R(s,a)=α|vxvego,target|+β|Δxfront1000traffic density|+γarear,merge+rcollision+ra,\begin{split}R(s,a)=&\quad\alpha\left|v_{x}-v_{ego,target}\right|\\ &+\beta\left|\Delta x^{front}-\frac{1000}{\text{traffic density}}\right|\\ &+\gamma a_{rear,merge}+r_{collision}+r_{a},\end{split} (11)

where α\alpha, β\beta, γ\gamma are the weighting coefficients. arear,mergea_{rear,merge} is the deceleration of the rear vehicle and is only applied at the time-step when the ego vehicle merges into the main lane.

V-B Safety and performance comparison

Refer to caption
(a) Meta Training Results
Refer to caption
(b) Meta Testing Results
Figure 5: Meta Training Results

The average episodic return on meta-training tasks throughout the training process is shown in Fig. 5(a). Upon the completion of meta-training, we further compare the performance among different algorithms along the adaptation trajectories (See Fig. 5(b)). The unhealthy/crash rates corresponding to Fig 5(b) are shown in Table I(a),I(b), I(c). The results are discussed below.

Prior safety and performance. From Fig. 5(b) and Table I(a),I(b), I(c) we can see that, even though PEARL and PEARL+ both show increasing reward with adaptation, PEARL+ has much higher reward and significantly lower unhealthy/crash rate before adaptation (using their policies conditioned on the prior inference of the task). As unhealthy poses or crashes carry very heavy penalty in all three experiments, substantially better performance implies improved safety.

Posterior adaptation and performance. Fig. 5(b) and Fig. 5(a) show that in all experiments, the episodic reward of PEARL+ after sufficient adaptation is very close to or better than that of PEARL. In other words, PEARL+ has a better prior performance, and this is not achieved without sacrificing posterior performance compared with PEARL.

Training efficiency. From Fig. 5(a) we can see that PEARL+ is similar to PEARL in terms of sample efficiency and time efficiency, since they’re both built upon the same off-policy algorithm SAC. PEARL+ outperforms MAML (an on-policy algorithm) in terms of sample efficiency by a factor of 10-100X in our experiments.

To draw a conclusion, PEARL+ improves the policy’s performance based on prior, without sacrificing its adaptation performance. In the robotics and autonomous driving tasks described above, PEARL+ learns a meta policy that is safe before adaptation (the main improvement) and achieves similar or better performance during and after adaptation compared with the original PEARL algorithm.

TABLE I: Meta-Testing Unhealthy/Crash Rate Results
Fine Tune MAML PEARL PEARL+ (ours)
0 traj. 0.0%\sim\textbf{0.0\%} 2.00%2.00\% 60.7%60.7\% 0.0%\sim\textbf{0.0\%}
1 traj. 0.0%\sim\textbf{0.0\%} 1.67%1.67\% 2.9%2.9\% 0.0%\sim\textbf{0.0\%}
3 traj. 1.11%1.11\% 2.67%2.67\% 0.7%0.7\% 0.0%\sim\textbf{0.0\%}
5 traj. 1.67%1.67\% 1.33%1.33\% 0.0%\sim\textbf{0.0\%} 0.0%\sim\textbf{0.0\%}
(a) Walker2D-vel: Unhealthy Rates
Fine Tune MAML PEARL PEARL+ (ours)
0 traj. 13.3%13.3\% 3.33%3.33\% 33.3%33.3\% 0.0%\sim\textbf{0.0\%}
1 traj. 13.1%13.1\% 6.67%6.67\% 0.0%\sim\textbf{0.0\%} 0.0%\sim\textbf{0.0\%}
3 traj. 12.7%12.7\% 6.67%6.67\% 0.0%\sim\textbf{0.0\%} 0.0%\sim\textbf{0.0\%}
5 traj. 13.9%13.9\% 3.33%3.33\% 0.0%\sim\textbf{0.0\%} 0.0%\sim\textbf{0.0\%}
(b) Ant3D-vel: Unhealthy Rates
Fine Tune MAML PEARL PEARL+ (ours)
0 traj. 14.5%14.5\% 19.0%19.0\% 27.48%27.48\% 0.04%\sim\textbf{0.04\%}
1 traj. 14.8%14.8\% 15.3%15.3\% 0.86%0.86\% 0.00%\sim\textbf{0.00\%}
3 traj. 14.1%14.1\% 18.3%18.3\% 0.12%0.12\% 0.00%\sim\textbf{0.00\%}
5 traj. 13.7%13.7\% 19.7%19.7\% 0.00%\sim\textbf{0.00\%} 0.00%\sim\textbf{0.00\%}
(c) Highway-Merge: Crash Rates

V-C Visualization of prior policy exploration

In this section, we present example simulation results to visualize how PEARL+ benefits the decision-making process compared with PEARL, with focus on the Highway-Merge task.

Each picture in Fig. 6 plots 50 trajectories’ speed profile for both before and after adaptation on a meta-testing task. The before-adaptation comparison between PEARL+ and PEARL shows that PEARL+’s exploration is more consistent and exhibits a safe and reasonable behavior (accelerating to catch up to the vehicles driving on the main-lane). After adaptation, both algorithms have similar performance.

Refer to caption
Figure 6: Exploration trajectories

To compare the safety of the prior policy, we also analyze the minimum braking required to avoid crashes at the moment when the vehicle merges into the main lane in Fig. 7. The minimum braking bminb_{min} is computed with the following equations:

bmin=max{Bmin(front,ego),Bmin(ego,rear)}b_{min}=\max{\{B_{min}(front,ego),B_{min}(ego,rear)\}} (12)
Bmin(veh0,veh1)=v12v022(x0x1)B_{min}(veh_{0},veh_{1})=\frac{v_{1}^{2}-v_{0}^{2}}{2(x_{0}-x_{1})} (13)

where xx is the longitudinal position, vv is the velocity, and front,rear,egofront,rear,ego represents the front vehicle, rear vehicle and ego vehicle respectively.

Refer to caption
Figure 7: Minimum braking (bminb_{min}) required to avoid accident. (1000 roll-outs in total. Green represents negative bminb_{min}, red represents positive bminb_{min}), grey dot-line is where bmin=0b_{min}=0

The magnitude of positive value quantifies the braking effort required for the following vehicles (either ego or rear vehicles) and thus higher values imply higher likelihood of crashes (shown in red color in Fig. 7). In addition, negative bminb_{min} means the following vehicle’s speed is lower than the leading vehicle, thus is at low risk of crashes (shown in green color). We pay more attention to those cases with a positive bminb_{min}. From Fig. 7 we can see that PEARL has more higher-risk cases compared with PEARL+.

TABLE II: Meta-testing results and training efficiency comparison of the ablation study
Ablation Algorithm Return Crash Rate 1 Epoch #
1Before Adpt. After Adpt. Before Adpt. After Adpt.
PEARL 45.48±40.8345.48\pm 40.83 106.36106.36±20.74\pm 20.74 33.33±11.39%33.33\pm 11.39\% 0.00%\sim 0.00\% 275\sim 275
PEARL+ (w/o) 92.47±11.5492.47\pm 11.54 151.32±23.86151.32\pm 23.86 0.00%\sim 0.00\% 0.00%\sim 0.00\% 300\sim 300
PEARL+ (300-Node) 109.34±10.18\textbf{109.34}\pm 10.18 166.10±20.49\textbf{166.10}\pm 20.49 0.00%\sim 0.00\% 0.00%\sim 0.00\% 250\sim 250
PEARL+ (100-Node) 104.26±12.63\textbf{104.26}\pm 12.63 163.79±19.58\textbf{163.79}\pm 19.58 0.00%\sim 0.00\% 0.00%\sim 0.00\% 220\sim\textbf{220}
(a) Ant3D-vel Experiment
Ablation Algorithm Return Crash Rate Epoch #
Before Adpt. After Adpt. Before Adpt. After Adpt.
PEARL 33.44-33.44±3.41\pm 3.41 67.28±2.5267.28\pm 2.52 27.84±9.37%27.84\pm 9.37\% 0.00%\sim 0.00\% 400\sim 400
PEARL+ (w/o) 42.47±42.47\pm18.1318.13 54.19±54.19\pm15.9515.95 0.74±0.95%0.74\pm 0.95\% 0.00%\sim 0.00\% 300±100\sim 300\pm 100
PEARL+ (300-Node) 58.12±4.59\textbf{58.12}\pm 4.59 69.52±4.72\textbf{69.52}\pm 4.72 0.86±0.75%0.86\pm 0.75\% 0.00%\sim 0.00\% 430\sim 430
PEARL+ (100-Node) 56.47±3.88\textbf{56.47}\pm 3.88 70.14±3.26\textbf{70.14}\pm 3.26 0.86±0.75%0.86\pm 0.75\% 0.00%\sim 0.00\% 400\sim\textbf{400}
(b) Highway-Merge Experiment
  • 1

    The number of training epochs before weights of networks converge.

  • *

    Each mean and standard deviation is calculated on 4 independent runs.

  • *

    Bold means dominant advantages; Underline means dominant disadvantages.

V-D Ablation study

We conduct an ablation study to investigate the necessity of introducing an extra Q-network to estimate the action-state value with the prior context assumption. Because of large time and computation cost, we only carry out the ablation study on one Mujoco robotic task (Ant3D-vel) and the AV task (Highway-Merge). The four ablation algorithms we consider here are as follows:

  • PEARL: the original PEARL algorithm, which doesn’t include a regularization term over the expected return of the policy conditioning on the prior context assumption.

  • PEARL+ (w/o): this ablation algorithm introduces the regularization term concerning the policy with prior context assumption, but without introducing a new independent Q-network specially for state-action value estimation conditioning on prior assumption. In this setting, we calculate the loss ~critic\tilde{\mathcal{L}}_{critic} as Qθ(s, a, z0)Q_{\theta}(\textbf{s, a, z}_{0}).

  • PEARL+ (300-Node) and PEARL+ (100-Node): these two are both based on our proposed algorithm, including a regularization term in the objective function and a special Q-network to calculate the regularization term, but with different sizes of the extra Q-network. PEARL+ (300-Node) uses a 3-layer, 300-hidden-node Fully Connected Network (FCN) as the extra Q-Network, which is of the same size of Q-network used for the SAC policy training, while the 100-Node setting refers to using a 3-layer, 100-hidden-node FCN.

We use the same training hyper parameters for all these ablation algorithms, and list the meta-testing results and training efficiency comparison in Table II(b). In the following discussion, PEARL+ refers to both PEARL+(300-Node) and PEARL+(100-Node) as a whole.

Comparing the meta-testing results, we conclude that both PEARL+ and PEARL+(w/o) are safe before and after adaptation. However, PEARL+ gives better performance before adaptation and better adaptation ability. Specifically in the Highway-Merge experiment, we found that PEARL+(w/o) has higher variance than PEARL+, which is due to one time stuck at local optima. The superiority of PEARL+ compared to PEARL(w/o) might be because an extra Q-Network shares the burden of estimating state-action values, reducing the complexity of the action-state value function. Besides, the two-Q-Network PEARL+ framework can be viewed as division of the prior safety and adaptation ability: the original SAC Q-Network is responsible for adaptation (with the encoder), and the extra Q-Network is responsible for prior safety before adaptation. The division of Q-networks to consider the regularization term and the original optimization term might reduce the sensitivity to the weighting coefficient α\alpha, whose effects will be discussed in the V-E section.

We further explore how the size of the extra Q-Network influence safety and adaptation ability. Comparing the meta-testing return of PEARL+(100-Node) and PEARL+(300-Node), we do not see obvious performance advantages, but PEARL+(100-Node) tends to require less epochs to converge than PEARL+(300-Node). This means that we can save training time and computational storage by using a smaller network to estimate prior state-action value.

Refer to caption
Figure 8: Meta-testing returns of Highway-Merge
TABLE III: Meta-Testing crash rates of Highway-Merge
α\alpha Before Adpt. After Adpt.
5 0.00%\sim\textbf{0.00\%} 11.8%11.8\%
1 0.5%0.5\% 0.0%\mathbf{\sim 0.0\%}
0.1 2.5%2.5\% 0.0%\mathbf{\sim 0.0\%}
0.01 35.0%35.0\% 0.0%\mathbf{\sim 0.0\%}
0 37.5%37.5\% 0.0%\mathbf{\sim 0.0\%}

V-E Effect of the weighting coefficient α\alpha

In this section, we investigate the choice of the weighting coefficient of our additional optimization term in PEARL+. Larger α\alpha results in more focus on the performance of the policy with prior, while smaller α\alpha optimizes more on the policy posterior.

Fig. 8 indicates that an α\alpha that is too large can impair learning during adaptation (e.g., α=1\alpha=1), and even fail to adapt (α=5\alpha=5). On the other hand, an α\alpha that is too small will not achieve the desired prior policy safety, as shown in Table III (α=0.01\alpha=0.01). Therefore, the weight needs to be chosen to balance between prior safety and posterior adaptation. In our case, α=0.1\alpha=0.1 is a good choice in all three experiments, but it is possible a different α\alpha needs to be used for other applications.

VI Conclusion

In this paper, we propose a Meta-RL algorithm that considers pre-adaptive safety and robust performance to unseen tasks. This is an essential feature for safety-critical real-world applications. By introducing an additional regularization term over the expected return of the policy with prior context assumption, and a corresponding additional critic network that returns the Q value conditioned on the prior context, our algorithm PEARL+ achieves superior performance (mostly through better safety reward) when it is exposed to a new task for the first time. In addition, it preserves the fast adaptation of PEARL, the off-policy Meta-RL method our algorithm builds upon. Our algorithm has been validated using three example applications: two MuJoCo robotic tasks and a highway merge autonomous driving task. These experiments show that PEARL+ achieves better robustness and safety performance, and the ablation study demonstrates the necessity of introducing an extra Q-Network. Furthermore, the prior regularization term is not restricted to context-based Meta-RL algorithms, and actually can be applied to any other Meta-RL frameworks, which would be a good direction for future work

References

  • [1] R. Liaw, S. Krishnan, A. Garg, D. Crankshaw, J. E. Gonzalez, and K. Goldberg, “Composing meta-policies for autonomous driving using hierarchical deep reinforcement learning,” arXiv preprint arXiv:1711.01503, 2017.
  • [2] J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Z. Leibo, R. Munos, C. Blundell, D. Kumaran, and M. Botvinick, “Learning to reinforcement learn,” arXiv preprint arXiv:1611.05763, 2016.
  • [3] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in International Conference on Machine Learning, pp. 1126–1135, PMLR, 2017.
  • [4] A. Antoniou, H. Edwards, and A. Storkey, “How to train your maml,” 2019.
  • [5] A. Nichol, J. Achiam, and J. Schulman, “On first-order meta-learning algorithms,” arXiv preprint arXiv:1803.02999, 2018.
  • [6] K. Rakelly, A. Zhou, C. Finn, S. Levine, and D. Quillen, “Efficient off-policy meta-reinforcement learning via probabilistic context variables,” in International conference on machine learning, pp. 5331–5340, PMLR, 2019.
  • [7] J. Garcıa and F. Fernández, “A comprehensive survey on safe reinforcement learning,” Journal of Machine Learning Research, vol. 16, no. 1, pp. 1437–1480, 2015.
  • [8] L. Wen, J. Duan, S. E. Li, S. Xu, and H. Peng, “Safe reinforcement learning for autonomous vehicles through parallel constrained policy optimization*,” in 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), pp. 1–7, 2020.
  • [9] D. Zügner and S. Günnemann, “Adversarial attacks on graph neural networks via meta learning,” arXiv preprint arXiv:1902.08412, 2019.
  • [10] H. Rafique, M. Liu, Q. Lin, and T. Yang, “Weakly-convex–concave min–max optimization: provable algorithms and applications in machine learning,” Optimization Methods and Software, pp. 1–35, 2021.
  • [11] L. Collins, A. Mokhtari, and S. Shakkottai, “Task-robust model-agnostic meta-learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 18860–18871, 2020.
  • [12] Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and P. Abbeel, “Rl 2: Fast reinforcement learning via slow reinforcement learning,” arXiv preprint arXiv:1611.02779, 2016.
  • [13] S. Levine and C. Finn, “Meta-Learning: from Few-Shot Learning to Rapid Reinforcement Learning.” http://tinyurl.com/icml-meta-slides, 2019.
  • [14] S. Zhang, L. Wen, H. Peng, and H. E. Tseng, “Quick learner automated vehicle adapting its roadmanship to varying traffic cultures with meta reinforcement learning,” arXiv preprint arXiv:2104.08876, 2021.
  • [15] R. Mendonca, A. Gupta, R. Kralev, P. Abbeel, S. Levine, and C. Finn, “Guided meta-policy search,” arXiv preprint arXiv:1904.00956, 2019.
  • [16] S. Belkhale, R. Li, G. Kahn, R. McAllister, R. Calandra, and S. Levine, “Model-based meta-reinforcement learning for flight with suspended payloads,” IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 1471–1478, 2021.
  • [17] Z. Lin, G. Thomas, G. Yang, and T. Ma, “Model-based adversarial meta-reinforcement learning,” arXiv preprint arXiv:2006.08875, 2020.
  • [18] A. Gupta, R. Mendonca, Y. Liu, P. Abbeel, and S. Levine, “Meta-reinforcement learning of structured exploration strategies,” arXiv preprint arXiv:1802.07245, 2018.
  • [19] R. Fakoor, P. Chaudhari, S. Soatto, and A. J. Smola, “Meta-q-learning,” arXiv preprint arXiv:1910.00125, 2019.
  • [20] L. Kirsch, S. van Steenkiste, and J. Schmidhuber, “Improving generalization in meta reinforcement learning using learned objectives,” arXiv preprint arXiv:1910.04098, 2019.
  • [21] W. Zhou, Y. Li, Y. Yang, H. Wang, and T. M. Hospedales, “Online meta-critic learning for off-policy actor-critic methods,” arXiv preprint arXiv:2003.05334, 2020.
  • [22] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy, “Deep variational information bottleneck,” arXiv preprint arXiv:1612.00410, 2016.
  • [23] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in International Conference on Machine Learning, pp. 1861–1870, PMLR, 2018.
  • [24] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model-based control,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033, IEEE, 2012.
  • [25] P. Wang, H. Li, and C.-Y. Chan, “Meta-adversarial inverse reinforcement learning for decision-making tasks,” arXiv preprint arXiv:2103.12694, 2021.
  • [26] Y. Jaafra, A. Deruyver, J. L. Laurent, and M. S. Naceur, “Context-aware autonomous driving using meta-reinforcement learning,” in 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), pp. 450–455, IEEE, 2019.
  • [27] M. Treiber, A. Hennecke, and D. Helbing, “Congested traffic states in empirical observations and microscopic simulations,” Physical review E, vol. 62, no. 2, p. 1805, 2000.
  • [28] P. Hidas, “Modelling vehicle interactions in microscopic simulation of merging and weaving,” Transportation Research Part C: Emerging Technologies, vol. 13, no. 1, pp. 37–62, 2005.