This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning to drive from a world on rails

Dian Chen
UT Austin
   Vladlen Koltun
Intel Labs
   Philipp Krähenbühl
UT Austin
Abstract

We learn an interactive vision-based driving policy from pre-recorded driving logs via a model-based approach. A forward model of the world supervises a driving policy that predicts the outcome of any potential driving trajectory. To support learning from pre-recorded logs, we assume that the world is on rails, meaning neither the agent nor its actions influence the environment. This assumption greatly simplifies the learning problem, factorizing the dynamics into a nonreactive world model and a low-dimensional and compact forward model of the ego-vehicle. Our approach computes action-values for each training trajectory using a tabular dynamic-programming evaluation of the Bellman equations; these action-values in turn supervise the final vision-based driving policy. Despite the world-on-rails assumption, the final driving policy acts well in a dynamic and reactive world. At the time of writing, our method ranks first on the CARLA leaderboard, attaining a 25%25\% higher driving score while using 40×40\times less data. Our method is also an order of magnitude more sample-efficient than state-of-the-art model-free reinforcement learning techniques on navigational tasks in the ProcGen benchmark.

1 Introduction

Vision-based autonomous driving is hard. An agent needs to perceive, understand, and interact with its environment from incomplete and partial experiences. Most successful driving approaches [6, 30, 37, 38] reduce autonomous navigation to imitating an expert, usually a human actor. Expert actions serve as a source of strong supervision, sensory inputs of the expert trajectories explore the world, and policy learning reduces to supervised learning backed by powerful deep networks. However, expert trajectories are often heavily biased, and safety-critical observations are rare. After all, human operators drive hundreds of thousands of miles before observing a traffic incident [43]. This sparsity of safety-critical training data makes it difficult for a behavior-cloning agent to learn and recover from mistakes. Model-free reinforcement learning [27, 44] offers a solution, allowing an agent to actively explore its environment and learn from it. However, this exploration is even less data-efficient than behavior cloning, as it needs to experience mistakes to avoid them. For reinforcement learning, the required sample complexity for safe driving is prohibitively large, even in simulation [44].

Refer to caption
Figure 1: We learn a reactive visuomotor driving policy that gets to explore the effects of its own actions at training time. The policy simulates the effects of its own actions using a forward model in pre-recorded driving logs. It then learns to choose safe actions without explicitly experiencing unsafe driving behavior. Picture selected from the Waymo open dataset [41].
Refer to caption
(a) Forward model
Refer to caption
(b) Bellman update
Refer to caption
(c) Distillation
Figure 2: Overview of our approach. Given a dataset of offline driving trajectories of sensor readings, driving states, and actions, we first learn a forward model of the ego-vehicle (a). Using the offline driving trajectories, we then compute action-values under a predefined reward and learned forward model using dynamic programming and backward induction on the Bellman equation (b). Finally, the action-values supervise a reactive visuomotor driving policy through policy distillation (c). For a single image, we supervise the policy for all vehicle speeds and actions for a richer supervisory signal.

In this paper, we present a method to learn a navigation policy that recovers from mistakes without ever making them, as illustrated in Figure 1. We first learn a world model on static pre-recorded trajectories. This world model is able to simulate the agent’s actions without ever executing them. Next, we estimate action-value functions for all pre-recorded trajectories. Finally, we train a reactive visuomotor policy that gets to observe the impact of all its actions as predicted by the action-value function. The policy learns to avoid costly mistakes, or recover from them. We use driving logs, recorded lane maps and locations of traffic participants, to train the world model and compute the action-value function. However, our visuomotor policy drives using raw sensor inputs, namely RGB images and speed readings alone. Figure 2 provides an overview.

The core challenge in our approach is to build a sufficiently expressive and accurate world model that allows the agent to explore its environment and the impact of its actions. For autonomous driving, this involves modeling the autonomous vehicle and all other scene elements, such as other vehicles, pedestrians, traffic lights, etc. In its raw form, the state space in which the agent operates is too high-dimensional to effectively explore. We thus make a simplifying assumption: The agent’s actions only affect its own state, and cannot directly influence the environment around it. In other words: the world is “on rails”. This naturally factorizes the world model into an agent-specific component that reacts to the agent’s commands, and a passively moving world. For the agent, we learn an action-conditional forward model. For the environment, we simply replay pre-recorded trajectories from the training data.

The factorization of the world model lends itself to a simple evaluation of the Bellman equations through dynamic programming and backward induction. For each driving trajectory, we compute a tabular approximation of the value function over all potential agent states. We use this value function and the agent’s forward model to compute action-value functions, which then supervise the visuomotor policy. The action values are computed over all agent states, and thus serve as denser supervision signals for the same number of environment interactions. They provide the visuomotor policy with action targets for any camera viewpoint, vehicle speed, or high level command augmentations.

We evaluate our method in the CARLA simulator [14]. On the CARLA leaderboard111https://leaderboard.carla.org/leaderboard/, we achieve a 25%25\% higher driving score than the prior top-ranking entry while using 40×40\times less training data. Notably, our method uses camera-only sensors, while some prior work relies on LiDAR. We also outperform all prior methods on the NoCrash benchmark [11]. Finally, we show that our method generalizes to other environments using the ProcGen platform [8]. Our method successfully learns navigational policies in the Maze and Heist environments with an order of magnitude fewer observations than baseline algorithms. Code and data are available222 https://dotchen.github.io/world_on_rails.

2 Related Work

Imitation learning is one of the earliest and most successful approaches to vision-based driving and navigation. Pomerleau [35] pioneered this direction with ALVINN. Recent work extends imitation learning to challenging urban driving and navigation in complicated environments [31, 33, 1, 10, 11, 38, 27]. Imitation learning algorithms train on trajectories collected by human experts [10, 14, 35], or privileged experts constructed with rich sensory data [6, 33]. These approaches are limited to the expert’s observations and actions. In contrast, our work learns to drive from passive driving logs and integrates mental exploration into the learning process so as to imagine and learn from scenarios that were not experienced when the logs were collected.

Model-based reinforcement learning builds a forward model to help train the policy. Sutton [42], Gu et al. [16], Kalweit and Boedecker [22], Kurutach et al. [23] use a forward world model to generate imagined trajectories to improve the sample complexity. World models [32, 17, 19, 39] use the forward model to provide additional context to assist the learning agents’ decision making. Feinberg et al. [15], Buckman et al. [4] roll out the forward model for short horizons to improve the fidelity of their Q or value function approximation. In our work, we factorize the forward world model into the controllable ego-agent and a passively moving environment. This factorization significantly simplifies policy learning and allows for a tabular evaluation of the Q and value functions. Our idea of factorizing the agent and the environment is similar to the idea of exogenous events in policy learning [3]. Recently, Dietterich et al. [13], Chitnis and Lozano-Pérez [7] considered finding a minimal factorized MDP. In contrast, we explicitly factorize the environment and focus on leveraging the factorization for planning and supervision of a visuomotor policy.

Policy distillation remaps the outputs of a privileged agent to a visuomotor agent [6, 26, 33, 25]. Levine et al. [26] use optimal control methods to learn local controllers for robotic manipulation tasks, and use them to supervise a visuomotor policy. Pan et al. [33] train a visuomotor driving policy by imitating an MPC controller that has access to expensive sensors. Lee et al. [25] first learn a privileged policy using model-free RL, then distill a visuomotor agent. Chen et al. [6] distill a visuomotor agent from a policy learned by imitation on privileged simulator states. Our approach uses a similar privileged simulator state to infer an action-value function to supervise the final visuomotor policy. While prior work uses one policy to supervise another, in our work a tabular action-value function supervised the policy. A reactive driving policy only exists after distillation.

Cost volume based planners [45, 37, 5] score and rank select future ego-vehicle trajectories. In tabular form, they closely resemble our action-value estimate. However, our action-value estimate has two advantages. First, it supervises a policy at training time in an offline process, while cost volumes need to be predicted for inference [45, 37, 5]. Second, we make use of ground-truth states, while cost volumes use imitation [45] or affordances [37, 5] from partial observations.

3 Method

We aim to learn a reactive visuomotor policy π(I)\pi(I) that produces an action a𝒜a\in\mathcal{A} for a sensory input II. At training time, we are given a set of trajectories τD\tau\in D. Each trajectory τ={(I^1,L^1,a^1),(I^2,L^2,a^2),}\tau=\{(\hat{I}_{1},\hat{L}_{1},\hat{a}_{1}),(\hat{I}_{2},\hat{L}_{2},\hat{a}_{2}),\ldots\} contains a stream of sensor readings I^t\hat{I}_{t}, corresponding driving logs L^t\hat{L}_{t}, and executed actions a^t\hat{a}_{t}. The hat symbols denotes data from driving logs, regular symbols denote free or random variables. The driving logs record the state (position, velocity, and orientation) of the ego-vehicle and all other traffic participants, as well as the environment state (lane information, traffic light state, etc.). We use the driving logs to compute a forward model 𝒯\mathcal{T} of the world and an action-value function QQ from a scalar reward. The forward model 𝒯\mathcal{T} takes a driving state LtL_{t} and an agent’s action ata_{t} to predict the next state Lt+1L_{t+1}. We use a hybrid semi-parametric model to estimate 𝒯\mathcal{T}, as described in Section 3.1. Specifically, we factorize the forward model into a ego-vehicle component 𝒯ego\mathcal{T}^{ego} and a world component 𝒯world\mathcal{T}^{world}. We approximate the ego-vehicle forward model using a simple deep network, while the collected trajectories are used non-parametrically for the world forward model. This factorization allows us to estimate an action-value function using a tabular approximation of the Bellman equation, as described in Section 3.2. Finally, we use the estimated action-values QQ to distill a visuomotor policy π\pi. This policy π\pi maximizes the expected return under our forward model and tabular action-value approximation. At training time, our algorithm uses privileged information, i.e. driving logs, to supervise policy learning, but the final policy π(It)\pi(I_{t}) drives from sensor inputs alone. Algorithm 1 and Figure 2 summarize the entire training process.

Data: Training trajectories DD
Result: Policy π(I)𝒜\pi(I)\in\mathcal{A}
// Forward-model fitting §3.1
Function 𝐅𝐢𝐭𝐅𝐨𝐫𝐰𝐚𝐫𝐝(D)𝒯ego\mathbf{FitForward}(D)\to\mathcal{T}^{ego}:
       Minimize Equation (1);
       return ego-vehicle forward model 𝒯ego\mathcal{T}^{ego};
      
end
// Action-value estimate §3.2
Function 𝐄𝐬𝐭𝐢𝐦𝐚𝐭𝐞𝐐(D,𝒯ego)Q\mathbf{EstimateQ}(D,\mathcal{T}^{ego})\to Q:
       for τD\tau\in D do
             Initialize V|τ|+1()=0V_{|\tau|+1}(\cdot)=0;
             for t=|τ|1t=|\tau|\ldots 1 do
                   Compute QtQ_{t} an VtV_{t} Equation (2);
                   Store QtQ_{t};
                  
             end for
            
       end for
      return stored QQ-values;
      
end
// Policy distillation §3.3
Function 𝐃𝐢𝐬𝐭𝐢𝐥𝐥𝐏𝐨𝐥𝐢𝐜𝐲(D,Q)π\mathbf{DistillPolicy}(D,Q)\to\pi:
       Minimize Equation (3);
       return visuomotor policy π\pi;
      
end
Learn forward model 𝒯ego=𝐅𝐢𝐭𝐅𝐨𝐫𝐰𝐚𝐫𝐝(D)\mathcal{T}^{ego}\!=\!\mathbf{FitForward}(D);
Estimate action-values Q=𝐄𝐬𝐭𝐢𝐦𝐚𝐭𝐞𝐐(D,𝒯ego)Q\!=\!\mathbf{EstimateQ}(D,\mathcal{T}^{ego});
Learn visuomotor policy π=𝐃𝐢𝐬𝐭𝐢𝐥𝐥𝐏𝐨𝐥𝐢𝐜𝐲(D,Q)\pi\!=\!\mathbf{DistillPolicy}(D,Q);
Algorithm 1 Learning in a world-on-rails

3.1 A factorized forward model

In its raw form the forward model 𝒯\mathcal{T} is too complex to efficiently predict and simulate. After all, entire driving simulators are designed to forecast just one of the many possible future driving states. We thus factorize the driving state LtL_{t} and forward model 𝒯\mathcal{T} into two parts: A part considering just the vehicle being controlled Lt+1ego=𝒯ego(Ltego,Ltworld,at){L^{ego}_{t+1}=\mathcal{T}^{ego}(L^{ego}_{t},L^{world}_{t},a_{t})} and a part modeling the rest of the world Lt+1world=𝒯world(Ltego,Ltworld,at){L^{world}_{t+1}=\mathcal{T}^{world}(L^{ego}_{t},L^{world}_{t},a_{t})}. Here we consider only deterministic transitions. We furthermore assume that the world is on rails and cannot react to the agents’ commands aa or the state of the ego-vehicle LegoL^{ego}. Specifically, the transition of the world state only depends on the prior world state itself: Lt+1world=𝒯world(Ltworld){L^{world}_{t+1}=\mathcal{T}^{world}(L^{world}_{t})}. Thus the initial state of the world L0worldL^{world}_{0} determines the entire trajectory of the world: {L1world,L2world,}\{L^{world}_{1},L^{world}_{2},\ldots\}. This allows us to model the world transition using the collected trajectories τ\tau directly. We thus only need to model the ego-vehicle’s forward-model 𝒯ego\mathcal{T}^{ego} for any ego-vehicle state LtegoL^{ego}_{t} and action ata_{t}. We train 𝒯ego\mathcal{T}^{ego} on the collected trajectories using L1 regression

EL^t:t+Tego,a^t[Δ=1T|𝒯egoΔ(L^tego,a^t+Δ1)L^t+Δego|],E_{\hat{L}^{ego}_{t:t+T}\!,\hat{a}_{t}}\!\!\left[\sum_{\Delta=1}^{T}\left|{\mathcal{T}^{ego}}^{\Delta}(\hat{L}^{ego}_{t}\!,\hat{a}_{t+\Delta-1})\!-\!\hat{L}^{ego}_{t+\Delta}\right|\right]\!, (1)

where we roll out the forward model for T=10T=10 steps to obtain a more robust regression target. We use a simple parametric bicycle model that easily generalizes beyond the training states L^ego\hat{L}^{ego}, as described in Section 4.

The world-on-rails assumption clearly does not hold, neither in a simulator nor in the real world. Other agents in the world will react to the ego-vehicle and its actions. However, this does not imply that a world-on-rails cannot provide strong and useful supervision to the agent. Our experiments show that an agent trained in a world-on-rails significantly outperforms agents trained with a full forward model of the world. The world-on-rails assumption significantly simplifies the estimation of an action-value function in Section 3.2 and subsequent policy learning in Section 3.3.

3.2 A factorized Bellman equation

Our goal is to estimate an action-value function Q(L^t,a)Q(\hat{L}_{t},a) for each state L^t\hat{L}_{t} of the training trajectory and action aa. We use the Bellman equation and a tabular discretization of the value function here. Recall the γ\gamma-discounted Bellman equation: V(Lt)=maxaQ(Lt,a)V(L_{t})=\max_{a}Q(L_{t},a) and Q(Lt,a)=γV(𝒯(Lt,a))+r(Lt,a)Q(L_{t},a)=\gamma V(\mathcal{T}(L_{t},a))+r(L_{t},a) for any state LtL_{t}, action aa, and reward rr. Ordinarily, one would need to resort to Bellman iterations to estimate VV and QQ. However, our factorized forward-model simplifies this:

V(Ltego,L^tworld)=\displaystyle V(L^{ego}_{t}\!,\hat{L}^{world}_{t})= maxaQ(Ltego,L^tworld,a)\displaystyle\max_{a}Q(L^{ego}_{t}\!,\hat{L}^{world}_{t}\!,a)
Q(Ltego,L^tworld,at)=\displaystyle Q(L^{ego}_{t}\!,\hat{L}^{world}_{t}\!,a_{t})= r(Ltego,L^tworld,at)+\displaystyle r(L^{ego}_{t}\!,\hat{L}^{world}_{t}\!,a_{t})+
γV(𝒯ego(Ltego,L^tworld,a),L^t+1world).\displaystyle\gamma V(\mathcal{T}^{ego}(L^{ego}_{t}\!,\hat{L}^{world}_{t}\!,a),\hat{L}^{world}_{t+1}).

The action-value function is needed for all ego-vehicle state LegoL^{ego}, but only recorded world states L^tworld\hat{L}_{t}^{world}. It is sufficient to evaluate the action-value function on just recorded world states for all ego-vehicle states: V^t(Ltego)=V(Ltego,L^tworld)\hat{V}_{t}(L^{ego}_{t})=V(L^{ego}_{t}\!,\hat{L}^{world}_{t}), Q^t(Ltego,at)=V(Ltego,L^tworld,at)\hat{Q}_{t}(L^{ego}_{t},a_{t})=V(L^{ego}_{t}\!,\hat{L}^{world}_{t},a_{t}). Furthermore, the world states are strictly ordered in time, hence the Bellman equations simplifies to

V^t(Ltego)=\displaystyle\hat{V}_{t}(L^{ego}_{t})= maxaQ^t(Ltego,a)\displaystyle\max_{a}\hat{Q}_{t}(L^{ego}_{t}\!,a) (2)
Q^t(Ltego,at)=\displaystyle\hat{Q}_{t}(L^{ego}_{t}\!,a_{t})= r(Ltego,L^tworld,at)+\displaystyle r(L^{ego}_{t}\!,\hat{L}^{world}_{t}\!,a_{t})+
γV^t+1(𝒯ego(Ltego,a)).\displaystyle\gamma\hat{V}_{t+1}(\mathcal{T}^{ego}(L^{ego}_{t}\!,a)).

Here the value and action-value functions only consider recorded world states, but all possible ego-vehicle states. The model is thus able to “imagine” driving behaviors and their reward without ever executing them. In order to collect rewards from these “imagined” states, we require an explicit reward function rr, and not just a scalar reward signal provided by the environment. For a detailed discussion of the reward see Section 4.

We solve Equation (2) using backward induction and dynamic programming. The state of the ego-vehicles LegoL^{ego} is compact (position, orientation, and velocity). This allows us to compute a tabular approximation of the value function Vt(Ltego)V_{t}(L^{ego}_{t}), evaluated in batch operations efficiently. Specifically, we discretize Vt(Ltego)V_{t}(L^{ego}_{t}) into bins corresponding to the position, orientation, and velocity of the ego-vehicle. When evaluating, we use linear interpolation if the requested value falls between bins. Furthermore, the action space is also small, allowing for a discretization of the max\max operator in the value update. During backward induction, we implicitly represent the action-value function QtQ_{t} using Vt+1V_{t+1} and the forward model 𝒯ego\mathcal{T}^{ego}. We only discretize Qt(L^tego,)Q_{t}(\hat{L}^{ego}_{t},\cdot) to supervise the visuomotor policy at timestep tt. Algorithm 1 summarizes our backward induction. More details are provided in the supplementary material for reference.

3.3 Policy Distillation

We use the action-value functions for the ego-vehicle state Qt(L^tego,)Q_{t}(\hat{L}^{ego}_{t},\cdot) to supervise a visuomotor policy π(I^t)\pi(\hat{I}_{t}). The action-value Qt(L^tego,)Q_{t}(\hat{L}^{ego}_{t},\cdot) represents the expected return of an optimal policy each vehicle state. We directly optimize this expected return in our policy:

EL^tworld,Ltego,I^t[aπ(a|I^t)Q^t(Ltego,a)+αH(π(|I^t))].E_{\hat{L}^{world}_{t},L^{ego}_{t},\hat{I}_{t}}\left[\sum_{a}\pi(a|\hat{I}_{t})\hat{Q}_{t}(L^{ego}_{t},a)+\alpha H\!\left(\pi(\cdot|\hat{I}_{t})\right)\right]. (3)

Since the action-value functions are computed densely, only the environment needs to be recorded, not the ego state. We can therefore supervise with augmented I^t\hat{I}_{t} representing arbitrary LtegoL^{ego}_{t}. We additionally add an entropy regularizer HH [18] to encourage a more diverse output policy, where α\alpha is the temperature hyperparameter. In practice, we discretize both the action-values and the visuomotor policy below.

4 Implementation

Forward model.

We train the ego-vehicle forward model 𝒯ego\mathcal{T}^{ego} on a small subset of trajectories. We collect the subset of trajectories to span the entire action space of the ego-vehicle: steering s[1,1]s\in[-1,1] and throttle t[0,1]t\in[0,1] are uniformly sampled, with brake b{0,1}b\in\{0,1\} sampled from a Bernoulli distribution. The forward model 𝒯ego\mathcal{T}^{ego} takes as inputs the current ego-vehicle state as 2D location xt,ytx_{t},y_{t}, orientation θt\theta_{t}, speed vtv_{t}, and predicts the next ego-vehicle state xt+1,yt+1,θt+1,vt+1x_{t+1},y_{t+1},\theta_{t+1},v_{t+1}. We use a parameterized bicycle model as the structural prior for 𝒯ego\mathcal{T}^{ego}. In particular, we only learn the vehicle wheelbases fb,rbf_{b},r_{b}, the mapping from user steering ss to wheel steering ϕ\phi, and the mapping from throttle and braking to acceleration aa. The kinematics of the bicycle model are described in the appendix for reference. We train 𝒯ego\mathcal{T}^{ego} in an auto-regressive manner using L1 loss and stochastic gradient descent.

Bellman equation evaluation.

For each time-step tt, we represent the value function VtV_{t} as a 4D tensor discretized into NH×NWN_{H}\times N_{W} position bins, NvN_{v} velocity bins, and NθN_{\theta} orientation bins. We use NH=NW=96N_{H}=N_{W}=96, Nv=4N_{v}=4, and Nθ=5N_{\theta}=5. Each bin has a physical size of 13×13m2\frac{1}{3}\times\frac{1}{3}m^{2} and corresponds to a 2m/s2m/s velocity range and a 3838^{\circ} orientation range. The ego-vehicle state L^tego=(xt,yt,v,θ)\hat{L}^{ego}_{t}=(x_{t},y_{t},v,\theta) is always centered in this discretization. The position of the ego-vehicle (xt,yt)(x_{t},y_{t}) is at the center of the spatial discretization. We only represent orientations in the range [95,95][-95^{\circ},95^{\circ}] relative to the ego-vehicle. When computing the action value function, any value VtV_{t} that does not lie in the center of a bin is interpolated among its 242^{4} neighboring bins using linear interpolation. The linear interpolation is computed over all states at once and is factorized over ego state dimensions (location, speed and orientation), thus it is efficient. Values that fall outside the discretization are 0. We discretize actions into Ms×MtM_{s}\times M_{t} bins for steering and throttle respectively, and one additional bin for braking. We do not steer or throttle while braking. We use Ms=9M_{s}=9 and Mt=3M_{t}=3 for a total of 93+1=289\cdot 3+1=28 discrete actions.

Policy network.

The policy network uses a ResNet34 [20] backbone to parse the RGB inputs. We use global average pooling to flatten the ResNet features, before concatenating them with the ego-vehicle speed and feeding this to a fully-connected network. The network produces a categorical distribution over the discretized action space.

In CARLA, the agent receives a high-level navigation command ctc_{t} for each time-step. We supervise the visuomotor agent simultaneously on all the high-level commands [6]. Additionally, we task the agent to predict semantic segmentation as an auxiliary loss. This consistently improves the agent’s driving performance, especially when generalizing to new environments. We use image data augmentations following [11, 6]. More details on the augmentations are in the appendix for reference.

Reward design.

The reward function r(Ltego,Ltworld,at,ct)r(L_{t}^{ego},L_{t}^{world},a_{t},c_{t}) considers ego-vehicle state, world state, action, and high-level command, and is computed from the driving log at each timestep. We use the lane information of the world and high-level command to first compute the target lane of the ego-vehicle. The agent receives a reward of +1+1 for staying in the target lane at the desired position, orientation and speed, and is smoothly penalized for deviating from the lane down to a value of 0. If the agent is located at a “zero-speed” region (e.g. red light, or close to other traffic participants), it is rewarded for zero velocity regardless of orientation, and penalized otherwise except for red light zones. All “zero speed” rewards are scaled by rstop=0.01r_{\text{stop}}=0.01, in order to avoid agents disregarding the target lane. The agents receives a greedy reward of rbrake=+5r_{\text{brake}}=+5 if it brakes in the zero-speed zone. To avoid agents chasing braking region, the braking reward cannot be accumulated. All rewards are additive. We found that with zero-speed zones and brake rewards, there is no need to explicitly penalize collisions. We compute the action-values over all high-level commands (“turn left”, “turn right”, “go straight”, “follow lane”, “change left” or “change right”) for each timestep, and use multi-branch supervision [6] when distilling the visuomotor agent.

5 Experiments

Dataset.

We evaluate our approach on the open-source CARLA simulator [14]. We train our ego-vehicle forward model on a small subset of trajectories consisting of 24002400 collected frames. It learns from random actions.

The bulk of our training set uses just passive sensor information II and training logs LL. For the CARLA leaderboard, we collect 1M1M frames, corresponding to roughly 69 hours of driving. For the NoCrash benchmark [11], we collect 270K270K frames. The dataset uses a privileged autopilot πb\pi_{b}. However, we do not store the controls from the ego-vehicle autopilot, unlike imitation learning. The RGB image is collected and stitched from three front-facing cameras all mounted at x=1.51.5m, z=2.42.4m in the ego-vehicle frame. Each camera has a 6060^{\circ} FOV; the side cameras are angled at 5555^{\circ}. For the CARLA leaderboard, we additionally use a telephoto camera with 5050^{\circ} FOV to capture distant traffic lights. To augment the dataset, we additionally mount two side camera suites with the same setup, each mounted as if the vehicle is angled at ±30\pm 30^{\circ} following Bojarski et al. [2]. For the CARLA leaderboard, we collect our dataset in the 8 public towns under a variety of weathers. For the NoCrash benchmark, we collect our entire dataset in Town1 under four training weathers, as specified by the CARLA benchmark [14, 10].

Experimental setup.

We evaluate our approach on both the CARLA leaderboard and the NoCrash benchmark. For both benchmarks, at each frame, the agent receives RGB camera reading II, speed reading vv, and a high-level command cc to compute steering ss, throttle tt, and brake bb.

For the CARLA leaderboard, agents are asked to navigate to specified goals through a variety of areas, including freeways, urban scenes, and residential districts, and in a variety of weather conditions. The agents face challenging traffic situations along the route, including lane merging/changing, negotiations, traffic lights, and interactions with pedestrians and cyclists. Agents are evaluated in held-out towns in terms of a Driving Score metric that is determined by route completion and traffic infractions.

In the NoCrash benchmark, agents are asked to safely navigate to specified goals in an urban setting with intersections, traffic lights, pedestrians, and other vehicles in the environment. The NoCrash benchmark consists of three driving conditions, with traffic density ranging from empty to heavily packed with vehicles and pedestrians. Each driving condition has the same set of 50 predefined routes: 25 in the training town (Town1) and 25 in an unseen town (Town2). Agents are evaluated based on their success rates. A trial on a route is considered successful if the agent safely navigates from the starting position to the goal within a certain time limit. The time limit corresponds to the amount of time required to drive the route at a cruising speed of 55 km/h, excluding time spent stopping for traffic lights or other traffic participants. In addition, a trial is considered a failure and aborts if a collision above a preset threshold occurs, or the vehicle deviates from the route by a preset margin. Each trial is evaluated on six weathers, four of which are seen in training and two that are only used at test time. The four training weathers are “Clear noon”, “Clear noon after rain”, “Heavy raining noon”, and “Clear sunset”. The two test weathers are “Wet sunset” and “Soft rain sunset”. We use CARLA 0.9.10 for all experiments.

Method DS \uparrow RC \uparrow IS \uparrow Data LiDAR
CILRS [11] 5.375.37 14.4014.40 0.550.55 - ×\times
LBC [6] 8.978.97 17.5417.54 0.73\bf{0.73} - ×\times
Transfuser [36] 16.9316.93 51.8251.82 0.420.42 150150K \checkmark
IA [44] 24.9824.98 46.9746.97 0.520.52 4040M ×\times
Rails 31.37\bf{31.37} 57.65\bf{57.65} 0.560.56 11M ×\times
Table 1: Comparison of the driving score (DS, main metric), route completion (RC), and infraction score (IS) on the CARLA leaderboard (accessed July 2021). For all three metrics, higher is better. Our method improves the driving score by 25%25\% relative to the prior state of the art [44] while using 40×40\times less data.
Task Town Weather IA LBC Rails
Empty train train 8585 8989 𝟗𝟖\mathbf{98}
Regular 8585 8787 𝟏𝟎𝟎\mathbf{100}
Dense 6363 7575 𝟗𝟔\mathbf{96}
Empty test train 77\it 77 8686 𝟗𝟒\bf{94}
Regular 66\it 66 7979 𝟖𝟗\bf{89}
Dense 33\it 33 5353 𝟕𝟒\bf{74}
Empty train test - 6060 𝟗𝟎\bf{90}
Regular - 6060 𝟗𝟎\bf{90}
Dense - 5454 𝟖𝟒\bf{84}
Empty test test - 3636 𝟕𝟖\bf{78}
Regular - 3636 𝟖𝟐\bf{82}
Dense - 1212 𝟔𝟔\bf{66}
Table 2: Comparison of the success rate of the presented approach (Rails) to the state of the art on NoCrash (LBC), and the winning entry of the 2020 CARLA Challenge (IA). All three methods are trained and evaluated on CARLA 0.9.10. IA uses all towns and all weathers to train. It thus does not have test weathers. Italic numbers indicate that the policy was trained on the test town. Additional route completion measurements are provided in the appendix for reference.

Comparison to the state of the art.

Table 1 compares the performance of the presented approach on the CARLA leaderboard. We list the three key metrics from the leaderboard: driving score (primary summary measure used for ranking entries on the leaderboard), route completion, and infraction score. We compare to CILRS [11], LBC [6], Transfuser [36] and IA [44]. LBC is the state of the art on the NoCrash benchmark, and Transfuser is a very recent method utilizing sensor fusion. Both LBC and Transfuser are based on imitation learning. IA is the winning entry in the 2020 CARLA Challenge, and the prior leading entry on the CARLA leaderboard. IA is based on model-free reinforcement learning with Rainbow [21] and IQN [12]. In comparison to the prior leading entry (IA), we improve the driving score by +25%+25\% while using 40×40\times less data.

Table 2 compares the performance on the CARLA NoCrash benchmark. We retrain LBC (the prior state of the art on NoCrash) on CARLA 0.9.10 using the same training data with augmented camera views as in our approach. To help LBC generalize, we found it important to train with additional semantic segmentation supervision. CARLA 0.9.10 features more complex visuals, and generalization to new weather conditions is harder. IA features two models, a published model trained on CARLA 0.9.6 Town1 alone, and a much stronger CARLA Challenge model (trained on CARLA 0.9.10). We compare to the stronger challenge model. However, this model was trained on many more towns, and under both training and testing weather conditions. It thus does not have held-out testing weathers. Our method outperforms LBC and IA on all 12 tasks and conditions. Furthermore, our method does not require expert actions anywhere in the training pipeline, unlike LBC. We outperform IA on all traffic scenarios in both towns, even though we train only on Town1.

Factorized world ×\times
Task Town DM F-DM CEM Rails
Straight train 37\it 37 44\it 44 𝟏𝟎𝟎\it\mathbf{100} 𝟏𝟎𝟎\mathbf{100}
Turn 0\it 0 0\it 0 88\it 88 𝟏𝟎𝟎\mathbf{100}
Straight test 44\it 44 52\it 52 𝟏𝟎𝟎\it\mathbf{100} 𝟏𝟎𝟎\mathbf{100}
Turn 0\it 0 0\it 0 97\it 97 𝟏𝟎𝟎\mathbf{100}
Empty train 0 0 88\it 88 𝟗𝟖\mathbf{98}
Regular 0\it 0 0\it 0 86\it 86 𝟏𝟎𝟎\mathbf{100}
Dense 0\it 0 0\it 0 72\it 72 𝟗𝟔\mathbf{96}
Empty test 0\it 0 0\it 0 𝟗𝟕\it\bf 97 9494
Regular 0\it 0 0\it 0 84\it 84 𝟖𝟗\bf{89}
Dense 0\it 0 0\it 0 47\it 47 𝟕𝟒\bf{74}
Table 3: Comparison of the success rate on the CoRL17 and NoCrash benchmark under training weathers. We compare our full visuomotor agent with model-based baselines. Dreamer (DM) [19] trains the full world model, whereas the rest follow our factorization and use the same forward model 𝒯ego\mathcal{T}^{ego} as our approach. Numbers in italic indicate agents that use privileged information (such as driving logs) at test time. Our approach uses sensor readings alone. Nevertheless, our approach outperforms all baselines.

Ablation study.

Table 3 compares our visuomotor agent with other model-based approaches. All baselines optimize the same reward function described in Section 4. Dreamer (DM) [19] trains a full-fledged embedding-based world model, and uses it to backpropagate analytic gradients to the policy during rollouts. Building a full forward model of our driving scenarios can be challenging. To help this baseline, we give it access to driving logs both during training and testing. We additionally construct a variant, F-DM, which utilizes our factorized world model. F-DM replaces a full embedding-based world model with our ego forward model 𝒯ego\mathcal{T}^{ego}. Akin to our method, it observes the pre-recorded world states and thus cannot backpropagate through a forward model of the world. F-DM still trains the policy the same way as DM, using imaginary differentiable rollouts. Since Dreamer is off-policy, we implement both DM and F-DM in an offline RL manner, and train both on the same dataset that is used to supervise our visuomotor agent. CEM is an MPC baseline that factorizes the world and uses the cross-entropy method [29] to search for the best actions. It uses our forward model, but cannot simulate the environment forward at the test time. It assumes a static world. Like Dreamer, CEM has access to the driving log at test time of the current timestep. It replans at every timestep over the most recent driving log. All the baselines use privileged information (driving logs), whereas our method takes sensor inputs alone.

We evaluate under the training weather for our method, as driving logs for baselines are weather-agnostic333The physics in CARLA does not vary with weather. Only sensor readings change with different weather conditions. We found that the NoCrash benchmark is too hard for the Dreamer baseline, and thus additionally test on the much easier CoRL17 benchmark [14]. Akin to NoCrash, each task in the CoRL17 benchmark contains 50 predefined routes: 25 for the training town and 25 for an unseen test town. It runs on empty roads with simpler routes compared to NoCrash. Our method outperforms all other model-based baselines on almost all tasks by a margin, despite using sensor inputs instead of driving logs. Dreamer with a factorized world model outperforms the full world model but still fails to generalize beyond straight driving. One reason for the poor performance of Dreamer may be a bias in the training set. Cars mostly drive straight. Dreamer may simply see too few turning scenarios compared to the endless stream of straight driving.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a) RGB camera
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(b) Map
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(c) Value map
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(d) Action-value
Refer to caption
Figure 3: Visualization of the computed value function and action-value function for the current frame. The RGB camera image (a) and bird-eye view maps (b) show the ego-vehicle location in the world. The value-maps (c) show the discretized tabular value estimate for 4 speed bins ×\times 5 orientation bins. The orientation bins are 95-95^{\circ} to 9595^{\circ} from left to right, and the speed bins are 0 m/s to 88 m/s from top to bottom. Each map has a resolution of 96×9696\times 96 corresponding to a 2424m2 area around the vehicle. We crop areas behind the ego-vehicle for visualization. The value maps use 5 Bellman updates and see 1.25s into the future. (d) shows the action-values based on the current ego-vehicle state. Actions with highest values are highlighted with red boxes. These action-values supervise the visuomotor policy that takes camera RGB images as input.
Train town Test Town
Train Weather Test Weather Train Weather Test Weather
CA SA Empty Regular Dense Empty Regular Dense Empty Regular Dense Empty Regular Dense
×\times ×\times 8787 8282 8282 6060 7878 8282 8585 8080 6363 6868 5454 4242
×\times \checkmark 9797 9797 9292 7878 8282 8080 9292 𝟗𝟏\bf{91} 6464 6666 7272 5858
\checkmark ×\times 𝟏𝟎𝟎\bf{100} 9898 9090 𝟗𝟐\bf{92} 𝟗𝟒\bf{94} 7676 9090 8282 6060 7878 6262 4848
\checkmark \checkmark 9898 𝟏𝟎𝟎\bf{100} 𝟗𝟔\bf{96} 9090 9090 𝟖𝟒\bf{84} 𝟗𝟒\bf{94} 8989 𝟕𝟒\bf{74} 𝟕𝟖\bf{78} 𝟖𝟐\bf{82} 𝟔𝟔\bf{66}
Table 4: Comparison of success rate in the NoCrash benchmark under different ablation conditions. CA stands for “camera augmentation” and SA stands for “speed augmentation”. All ablation models are trained on the same dataset and evaluated on CARLA 0.9.10. CA models additionally train on two augmented camera views per dataset frame.
Auxilliary loss
Town Weather ×\times
train train 9595 100100
train test 7070 100100
test train 8080 9898
test test 4646 7676
Table 5: Comparison of success rate in the NoCrash benchmark on the empty traffic condition with and without the auxiliary semantic segmentation loss.

Table 4 compares different variation of our visuomotor agent at the distillation stage. CA stands for camera augmentation, meaning the model trains on the additional augmented camera images, described in Section 5. SA stands for “speed augmentation”. An SA model trains to predict action values on all discretized speed bins, instead of taking as input the recorded speed reading from the dataset. During test time, an SA models uses linear interpolation to extract the action-values corresponding to the ego-vehicle speed. Models trained with camera or speed augmentation consistently outperform ones that were not, showing the benefits of dense action-values computed using our factorized Bellman updates. We therefore use camera and speed augmentation for our models for the CARLA leaderboard and the NoCrash benchmark. With the augmented supervision extracted from the dense action-values, models perform well even without techniques such as trajectory noise injection [24, 11]. Results for models trained with injected steering noise are provided in the appendix for reference.

Table 5 compares our visuomotor agent, which is trained with an auxiliary semantic segmentation loss, with a simpler baseline that does not use this auxiliary loss. Policies trained with semantic segmentation consistently outperform the action-only baseline, especially under generalization settings. We observed the same for the LBC baseline, which also uses semantic segmentation as an auxiliary loss.

Traffic light infraction analysis.

We additionally analyze traffic light infractions on the NoCrash benchmark. Table 6 compares the average number of traffic light violations per hour on all trials in the NoCrash benchmark. The presented approach has fewer traffic light infractions than the reinforcement learning baseline (IA) on all six tasks under the training weathers.

Visualization.

Figure 3 shows a visualization of the computed value and action-value functions for various driving scenarios. Each of these action-value functions densely supervises the policy for the displayed image.

Oracle actions ×\times \checkmark ×\times
Task Town Weather IA LBC Rails
Empty train train 3.343.34 1.351.35 0.00\bf{0.00}
Regular 6.716.71 1.891.89 0.43\bf{0.43}
Dense 15.4115.41 3.273.27 2.61\bf{2.61}
Empty test train 62.1862.18 8.45\bf{8.45} 10.6810.68
Regular 53.2853.28 8.228.22 6.95\bf{6.95}
Dense 54.9454.94 7.26\bf{7.26} 12.9012.90
Empty train test - 0.360.36 0.00\bf{0.00}
Regular - 0.810.81 0.00\bf{0.00}
Dense - 0.52\bf{0.52} 4.294.29
Empty test test - 8.17\bf{8.17} 14.4614.46
Regular - 8.61\bf{8.61} 11.3011.30
Dense - 4.87\bf{4.87} 13.2813.28
Table 6: Comparison of the average number of traffic light violations per hour of trials on the NoCrash benchmark. We compare our approach to LBC (prior state of the art on NoCrash) and IA (the winning entry of the 2020 CARLA challenge). LBC trains from oracle trajectories, whereas IA and ours do not.
Refer to caption
(a) Maze 2000 training levels
Refer to caption
(b) Maze 10000 training levels
Refer to caption
(c) Heist 2000 training levels
Refer to caption
(d) Heist 10000 training levels
Figure 4: Comparison of our method to state-of-the-art model-free reinforcement learning on the navigational tasks in the ProcGen benchmark. All plots measure the average episode returns on the testing levels. ‘PPO w/ priv’ is a customized PPO implementation that during training additionally takes as input the same privileged information that our approach uses to compute rewards and train the agent forward model. The presented approach is an order of magnitude more sample-efficient.

ProcGen navigation.

To demonstrate the broad applicability of our approach, we additionally evaluate on the navigational tasks (Maze and Heist) in the ProcGen benchmark [8]. In both environments, the agent is rewarded for navigating to desired locations. Maze features a plain navigation task through a complex environment. Heist additionally requires the agent to collect keys and unlock the doors before navigating to the goal. In ProcGen, the action space is discrete, hence we only discretize the ego-agent’s states. Similar to CARLA, we discretize the agent state into NH×NWN_{H}\times N_{W} location bins and NθN_{\theta} orientation bins. We use NH=NW=32N_{H}=N_{W}=32, and Nθ=8N_{\theta}=8. We ignore velocity. The agent’s forward dynamics model in ProcGen is not location agnostic as in CARLA. To address this, we use a small ConvNet to extract the environment context around the ego-agent forward model 𝒯ego\mathcal{T}^{ego}. The ConvNet takes as input a cropped 13×1313\times 13 region around the ego-agent in the original 64×6464\times 64 RGB observations. The ConvNet features are concatenated with agent orientation to predict the next ego-agent’s states under all discrete action commands. In order to evaluate sample efficiency, we implement our method on ProcGen in an off-policy reinforcement learning manner. We alternate between training or fine-tuning a policy and forward model, and rolling out new trajectories under the current policy. Compared to model-free baselines, our approach needs access to a dense reward function, instead of just the scalar reward signal of the environment. We compute this reward function using semantic labels obtained via the ProcGen renderer. For Maze, the reward function awards +1+1 for goal location regardless of orientation. For Heist, the reward function awards +1+1 for key and unlockable door locations regardless of orientation. In addition, we mask all unachievable ego-state values to 0 during the Bellman equation evaluation. We use this privileged information in the action-value computation only, and in no other place in our algorithm. Figure 4 compares the performance and sample-efficiency of our method with model-free reinforcement learning baselines PPO [40] and PPG [9]. PPG is the current state of the art on the ProcGen benchmark. In addition, we compare to a customized PPO implementation which during training also takes as input the same privileged information used in our method. Our method converges within 33M frames, while model-free baselines take up to 2525M frames. For both Maze and Heist environments, we train all agents on two different conditions: 20002000 and 1000010000 (procedurally generated) training levels. For both environments, agents are tested on completely randomized procedurally-generated levels. The comparison of average episode returns on the training levels is in the appendix for reference. Our method is an order of magnitude more sample-efficient than all the model-free RL baselines even when those methods are given the same privileged information used by our reward computation.

6 Conclusion

We show that assuming independence between the agent and the environment, which we refer to as a world on rails, significantly simplifies modern reinforcement learning. While true independence rarely holds, the gains in training efficacy outweigh the modeling constraints. Even with a simple reward function, an agent trained in a world-on-rails learns to drive better than state-of-the-art imitation learning agents on standard driving benchmarks. In addition, the presented policy learning framework is an order of magnitude more sample-efficient than state-of-the-art reinforcement learning on challenging ProcGen navigation tasks.

Acknowledgements

We thank Yuke Zhu for his valuable feedback. We thank Tianwei Yin for his help on figure 1. This work was supported by the NSF Institute for Foundations of Machine Learning and NSF award #1845485.

References

  • Bansal et al. [2019] Mayank Bansal, Alex Krizhevsky, and Abhijit Ogale. Chauffeurnet: Learning to drive by imitating the best and synthesizing the worst. In RSS, 2019.
  • Bojarski et al. [2016] Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for self-driving cars. 2016.
  • Boutilier et al. [1999] Craig Boutilier, Thomas Dean, and Steve Hanks. Decision-theoretic planning: Structural assumptions and computational leverage. In JAIR, 1999.
  • Buckman et al. [2018] Jacob Buckman, Danijar Hafner, George Tucker, Eugene Brevdo, and Honglak Lee. Sample-efficient reinforcement learning with stochastic ensemble value expansion. In NeurIPS, 2018.
  • Casas et al. [2021] Sergio Casas, Abbas Sadat, and Raquel Urtasun. Mp3: A unified model to map, perceive, predict and plan. Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • Chen et al. [2019] Dian Chen, Brady Zhou, Vladlen Koltun, and Philipp Krähenbühl. Learning by cheating. In CoRL, 2019.
  • Chitnis and Lozano-Pérez [2020] Rohan Chitnis and Tomás Lozano-Pérez. Learning compact models for planning with exogenous processes. In CoRL, 2020.
  • Cobbe et al. [2019] Karl Cobbe, Christopher Hesse, Jacob Hilton, and John Schulman. Leveraging procedural generation to benchmark reinforcement learning. In arXiv preprint, 2019.
  • Cobbe et al. [2020] Karl Cobbe, Jacob Hilton, Oleg Klimov, and John Schulman. Phasic policy gradient. In arXiv preprint, 2020.
  • Codevilla et al. [2018] Felipe Codevilla, Matthias Müller, Antonio López, Vladlen Koltun, and Alexey Dosovitskiy. End-to-end driving via conditional imitation learning. In ICRA, 2018.
  • Codevilla et al. [2019] Felipe Codevilla, Eder Santana, Antonio M López, and Adrien Gaidon. Exploring the limitations of behavior cloning for autonomous driving. In ICCV, 2019.
  • Dabney et al. [2018] Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos. Implicit quantile networks for distributional reinforcement learning. In ICML, 2018.
  • Dietterich et al. [2018] Thomas Dietterich, George Trimponias, and Zhitang Chen. Discovering and removing exogenous state variables and rewards for reinforcement learning. In ICML, 2018.
  • Dosovitskiy et al. [2017] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. In CoRL, 2017.
  • Feinberg et al. [2018] Vladimir Feinberg, Alvin Wan, Ion Stoica, Michael I Jordan, Joseph E Gonzalez, and Sergey Levine. Model-based value estimation for efficient model-free reinforcement learning. In arXiv preprint, 2018.
  • Gu et al. [2016] Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous deep q-learning with model-based acceleration. In ICML, 2016.
  • Ha and Schmidhuber [2018] David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In NeurIPS, 2018.
  • Haarnoja et al. [2018] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In ICML, 2018.
  • Hafner et al. [2019] Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. In ICLR, 2019.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • Hessel et al. [2018] Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In AAAI, 2018.
  • Kalweit and Boedecker [2017] Gabriel Kalweit and Joschka Boedecker. Uncertainty-driven imagination for continuous deep reinforcement learning. In CoRL, 2017.
  • Kurutach et al. [2018] Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, and Pieter Abbeel. Model-ensemble trust-region policy optimization. In ICLR, 2018.
  • Laskey et al. [2017] Michael Laskey, Jonathan Lee, Roy Fox, Anca Dragan, and Ken Goldberg. Dart: Noise injection for robust imitation learning. In CoRL, 2017.
  • Lee et al. [2020] Joonho Lee, Jemin Hwangbo, Lorenz Wellhausen, Vladlen Koltun, and Marco Hutter. Learning quadrupedal locomotion over challenging terrain. In Science robotics, 2020.
  • Levine et al. [2016] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. In JMLR, 2016.
  • Liang et al. [2018] Xiaodan Liang, Tairui Wang, Luona Yang, and Eric Xing. Cirl: Controllable imitative reinforcement learning for vision-based self-driving. In ECCV, 2018.
  • Lillicrap et al. [2016] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In ICLR, 2016.
  • Mannor et al. [2003] Shie Mannor, Reuven Y Rubinstein, and Yohai Gat. The cross entropy method for fast policy search. In ICML, 2003.
  • Mueller et al. [2018] Matthias Mueller, Alexey Dosovitskiy, Bernard Ghanem, and Vladlen Koltun. Driving policy transfer via modularity and abstraction. In CoRL, 2018.
  • Muller et al. [2006] Urs Muller, Jan Ben, Eric Cosatto, Beat Flepp, and Yann L Cun. Off-road obstacle avoidance through end-to-end learning. In NeurIPS, 2006.
  • Oh et al. [2017] Junhyuk Oh, Satinder Singh, and Honglak Lee. Value prediction network. In NeurIPS, 2017.
  • Pan et al. [2018] Yunpeng Pan, Ching-An Cheng, Kamil Saigol, Keuntak Lee, Xinyan Yan, Evangelos Theodorou, and Byron Boots. Agile autonomous driving using end-to-end deep imitation learning. In RSS, 2018.
  • Polack et al. [2017] P. Polack, F. Altché, B. d’Andréa-Novel, and A. de La Fortelle. The kinematic bicycle model: A consistent model for planning feasible trajectories for autonomous vehicles? In IV, 2017.
  • Pomerleau [1989] Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In NeurIPS, 1989.
  • Prakash et al. [2021] Aditya Prakash, Kashyap Chitta, and Andreas Geiger. Multi-modal fusion transformer for end-to-end autonomous driving. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • Sadat et al. [2020] Abbas Sadat, Sergio Casas, Mengye Ren, Xinyu Wu, Pranaab Dhawan, and Raquel Urtasun. Perceive, predict, and plan: Safe motion planning through interpretable semantic representations. In ECCV, 2020.
  • Sauer et al. [2018] A Sauer, N Savinov, and A Geiger. Conditional affordance learning for driving in urban environments. In CoRL, 2018.
  • Schrittwieser et al. [2020] Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. In Nature, 2020.
  • Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. In arXiv preprint, 2017.
  • Sun et al. [2020] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In CVPR, 2020.
  • Sutton [1991] Richard S Sutton. Planning by incremental dynamic programming. In PMLR, 1991.
  • Tefft [2017] Brian Tefft. Rates of motor vehicle crashes, injuries and deaths in relation to driver age, united states, 2014-2015. In AAA Foundation for Traffic Safety., 2017.
  • Toromanoff et al. [2020] Marin Toromanoff, Emilie Wirbel, and Fabien Moutarde. End-to-end model-free reinforcement learning for urban driving using implicit affordances. In CVPR, 2020.
  • Zeng et al. [2019] Wenyuan Zeng, Wenjie Luo, Simon Suo, Abbas Sadat, Bin Yang, Sergio Casas, and Raquel Urtasun. End-to-end interpretable neural motion planner. In CVPR, 2019.

Appendix

Appendix A Kinematic Bicyle Model

The kinematics of the bicycle model [34] 𝒯ego\mathcal{T}^{ego} used in our CARLA experiment is described below:

x˙\displaystyle\dot{x} =vcos(θ+β)\displaystyle=v\cos(\theta+\beta)
y˙\displaystyle\dot{y} =vsin(θ+β)\displaystyle=v\sin(\theta+\beta)
v˙\displaystyle\dot{v} =a\displaystyle=a
θ˙\displaystyle\dot{\theta} =vrbsin(β)\displaystyle=\frac{v}{r_{b}}\sin(\beta)
tan(β)\displaystyle\tan(\beta) =rbfb+rbtan(ϕ)\displaystyle=\frac{r_{b}}{f_{b}+r_{b}}\tan(\phi)

We train 𝒯ego\mathcal{T}^{ego} in an auto-regressive manner using L1 loss and stochastic gradient descent:

Jego\displaystyle J_{ego} =tT|xtx^t|+|yty^t|\displaystyle=\sum_{t}^{T}|x_{t}-\hat{x}_{t}|+|y_{t}-\hat{y}_{t}|
+tT|cos(θt)cos(θ^t)|+|sin(θt)sin(θ^t)|\displaystyle+\sum_{t}^{T}|\cos(\theta_{t})-\cos(\hat{\theta}_{t})|+|\sin(\theta_{t})-\sin(\hat{\theta}_{t})|

where xt+1,yt+1,θt+1,vt+1=𝒯ego(xt,yt,θt,vt,at)x_{t+1},y_{t+1},\theta_{t+1},v_{t+1}=\mathcal{T}^{ego}(x_{t},y_{t},\theta_{t},v_{t},a_{t}), and at=(st,tt,bt)a_{t}=(s_{t},t_{t},b_{t}). We only model θ\theta as a transform of ss; aa as a transform of (t,b)(t,b), and vehicle wheelbases rb,fbr_{b},f_{b}. We use an action repeat of 5 frames, hence both data collection and planning operate at 4 FPS, whereas the simulator and the visuomotor policy run at 20 FPS.

Appendix B ProcGen Training Levels Returns

Figure 5 plots the average episode returns of our method against PPO [40], PPG [9], and PPO with access to privileged information.

Refer to caption
(a) Maze 2000 training levels
Refer to caption
(b) Maze 10000 training levels
Refer to caption
(c) Heist 2000 training levels
Refer to caption
(d) Heist 10000 training levels
Figure 5: Comparison of our method to state-of-the-art model-free reinforcement learning on the navigational tasks of the ProcGen benchmark. All plots measure the average episode returns on the training levels. Experimental setup follows Figure 4.

Appendix C Additional NoCrash Experiments

Table 7 additionally compares the route completion rates of the presented approach (Rails) to prior state-of-the-art on the CARLA NoCrash benchmark.

Table 8 compares a variant of our approach using noisy training trajectories (with Ornstein-Uhlenbeck noise [28]). This variant is most similar to data collected in LBC [6]. The experimental setup is equivalent to Table 4, just on noisy trajectories.

Task Town Weather IA LBC Rails
Empty train train 95.0295.02 97.1597.15 98.82\bf{98.82}
Regular 94.7294.72 96.3896.38 100.00\bf{100.00}
Dense 82.9382.93 91.3591.35 98.24\bf{98.24}
Empty test train 88.87\it 88.87 92.4192.41 98.91\bf{98.91}
Regular 84.09\it 84.09 88.3288.32 94.95\bf{94.95}
Dense 63.63\it 63.63 74.8474.84 88.89\bf{88.89}
Empty train test - 79.3579.35 94.25\bf{94.25}
Regular - 79.2079.20 93.03\bf{93.03}
Dense - 76.7276.72 95.73\bf{95.73}
Empty test test - 62.4762.47 84.72\bf{84.72}
Regular - 63.5563.55 88.53\bf{88.53}
Dense - 44.9944.99 80.75\bf{80.75}
Table 7: Comparison of the mean route completion rate on NoCrash. The experimental setup follows Table 2.
Train town Test Town
Train Weather Test Weather Train Weather Test Weather
CA SA Empty Regular Dense Empty Regular Dense Empty Regular Dense Empty Regular Dense
×\times ×\times 8282 8888 8383 7070 7070 6464 6767 7272 5151 4444 6262 4040
×\times \checkmark 9595 9191 𝟗𝟕\bf{97} 9090 𝟗𝟔\bf{96} 𝟗𝟒\bf{94} 8181 8080 6464 5454 𝟕𝟒\bf{74} 𝟓𝟐\bf{52}
\checkmark ×\times 9797 9696 8989 8484 8686 7474 9292 8787 𝟔𝟔\bf{66} 𝟕𝟎\bf{70} 6262 3636
\checkmark \checkmark 𝟏𝟎𝟎\bf{100} 9898 9494 𝟗𝟒\bf{94} 8888 8282 𝟗𝟕\bf{97} 𝟗𝟐\bf{92} 6060 6464 7272 4646
Table 8: Comparison of success rate in the NoCrash benchmark trained on noisy trajectories, collected with injected Ornstein–Uhlenbeck [28]. Metric and evaluation protocol is comparable to Table 4, data-collection protocol follows LBC [6].

Appendix D Action-value Computation

In CARLA, we use a planning horizon of H=5H=5 to subsample the trajectories during action-value computation. At each frame tt, we compute and discretize the rewards from tt to t+H1t+H-1 around the ego vehicle state at time tt. We then compute the values and action-values for time tt using backward induction as described in section 3. In ProcGen, we use H=30H=30.

Appendix E CARLA Controls

In CARLA, to ensure a smooth control output from the discretized action space, we assume independence between steering and throttle, and use their softmax probabilities to compute smooth steering and throttle values. In particular, the sensorimotor policies predict logits logπsNs,logπtNt,logπb\log\pi_{s}\in\mathbb{R}^{N_{s}},\log\pi_{t}\in\mathbb{R}^{N_{t}},\log\pi_{b}\in\mathbb{R}. During training we model logπ(s,t,𝕀b)=(1𝕀b)(logπs(s)+logπt(t))+𝕀blogπb\log\pi(s,t,\mathbb{I}_{b})=(1-\mathbb{I}_{b})(\log\pi_{s}(s)+\log\pi_{t}(t))+\mathbb{I}_{b}\log\pi_{b}. During testing, we use

s\displaystyle s =cNsπs(sc)sc\displaystyle=\sum_{c}^{N_{s}}\pi_{s}(s_{c})s_{c}
t\displaystyle t =cNtπt(tc)sc\displaystyle=\sum_{c}^{N_{t}}\pi_{t}(t_{c})s_{c}
b\displaystyle b ={1,πb>=tb0,πb<tb\displaystyle=\begin{cases}1,\pi_{b}>=t_{b}\\ 0,\pi_{b}<t_{b}\end{cases}

We use tb=0.5t_{b}=0.5 in all our experiments. In addition, we apply a bang-bang controller on throttle, i.e we explictly set the computed throttle to 0 if the vehicle speed exceeds a predefined threshold.

Appendix F CARLA leaderboard

Following Toromanoff et al. [44], we use a 6 model ensemble to obtain a more stable control for our top leaderboard submission.

Appendix G Training Hyperparameters

Table 9 provide a list of training hyperparameters for reference. In our CARLA experiments we use the following image augmentations: Gaussian Blur, Additive Gaussian Noise, Pixel Dropout, Multiply (scaling), Linear Contrast, Grayscale, ElasticTransformation.

Hyperparameter CARLA ProcGen
Batch size 128 128
Learning rate - ego model 1e-2 3e-4
Learnign rate - distillation 3e-4 3e-4
Entropy loss scale (α\alpha) 1e-2 1e-2
Segmentation loss scale 5e-2 -
Table 9: Additional hyperparameters.