World Model as a Graph:
Learning Latent Landmarks for Planning

Lunjun Zhang Ge Yang Bradly Stadie

Abstract

Planning, the ability to analyze the structure of a problem in the large and decompose it into interrelated subproblems, is a hallmark of human intelligence. While deep reinforcement learning (RL) has shown great promise for solving relatively straightforward control tasks, it remains an open problem how to best incorporate planning into existing deep RL paradigms to handle increasingly complex environments. One prominent framework, Model-Based RL, learns a world model and plans using step-by-step virtual rollouts. This type of world model quickly diverges from reality when the planning horizon increases, thus struggling at long-horizon planning. How can we learn world models that endow agents with the ability to do temporally extended reasoning? In this work, we propose to learn graph-structured world models composed of sparse, multi-step transitions. We devise a novel algorithm to learn latent landmarks that are scattered (in terms of reachability) across the goal space as the nodes on the graph. In this same graph, the edges are the reachability estimates distilled from Q-functions. On a variety of high-dimensional continuous control tasks ranging from robotic manipulation to navigation, we demonstrate that our method, named $L^{3}P$ , significantly outperforms prior work, and is oftentimes the only method capable of leveraging both the robustness of model-free RL and generalization of graph-search algorithms. We believe our work is an important step towards scalable planning in reinforcement learning.

Machine Learning, ICML

1 Introduction

An intelligent agent should be able to solve difficult problems by breaking them down into sequences of simpler problems. Classically, planning algorithms have been the tool of choice for endowing AI agents with the ability to reason over complex long-horizon problems (Doran & Michie, 1966; Hart et al., 1968). Recent years have seen an uptick in monographs examining the intersection of classical planning techniques – which excel at temporal abstraction – with deep reinforcement learning (RL) algorithms – which excel at state abstraction. Perhaps the ripest fruit born of this relationship is the AlphaGo algorithm, wherein a model free policy is combined with a MCTS (Coulom, 2006) planning algorithm to achieve superhuman performance on the game of Go (Silver et al., 2016a).

In the field of robotics, progress on combining planning and reinforcement learning has been somewhat less rapid, although still resolute. Indeed, the laws of physics in the real world are vastly more complex than the simple rules of Go. Unlike board games such as chess and Go, which have deterministic and known dynamics and discrete action space, robots have to deal with a probabilistic and unpredictable world. Moreover, the action space for robotics is often continuous. As a result of these difficulties, planning in robotics presents a much harder problem. One general class of methods (Sutton, 1991) seeks to combine model-based planning and deep RL. These methods can be thought of as an extension of model-predictive control (MPC) algorithms, with the key difference being that the agent is trained over hypothetical experience in addition to the actually collected experience. The primary shortcoming of this class of methods is that, like MCTS in AlphaGo, they resort to planning with action sequences – forcing the robot to plan for each action at every hundred milliseconds. Planning on the level of action sequences is fundamentally bottlenecked by the accuracy of the learned dynamics model and the horizon of a task, as the learned world model quickly diverges over a long horizon. This limitation shows that world models in the traditional Model-based RL (MBRL) setting often fail to deliver the promise of planning.

Another general class of methods, Hierarchical RL (HRL), introduces a higher-level learner to address the problem of planning (Dayan & Hinton, 1993; Vezhnevets et al., 2017; Nachum et al., 2018). In this case, a goal-based RL agent serves as the worker, and a manager learns what sequences of goals it must set for the worker to achieve a complex task. While this is apparently a sound solution to the problem of planning, hierarchical learners neither explicitly learn a higher-level model of the world nor take advantage of the graph structure inherent to the problem of search.

Refer to caption — Figure 1: MBRL versus $L^{3}P$ (World Model as a Graph). MBRL does step-by-step virtual rollouts with the world model and quickly diverges from reality when the planning horizon increases. $L^{3}P$ models the world as a graph of sparse multi-step transitions, where the nodes are learned latent landmarks and the edges are reachability estimates. $L^{3}P$ succeeds at temporally extended reasoning. Code for $L^{3}P$ is available at: https://github.com/LunjunZhang/world-model-as-a-graph.

To better combine classical planning and reinforcement learning, we propose to learn graph-structured world models composed of sparse multi-step transitions. To model the world as a graph, we borrow a concept from the navigation literature – the idea of landmarks (Wang et al., 2008). Landmarks are essentially states that an agent can navigate between in order to complete tasks. However, rather than simply using previously seen states as landmarks, as is traditionally done, we will instead develop a novel algorithm to learn the landmarks used for planning. Our key insight is that by mapping previously achieved goals into a latent space that captures the temporal distance between goals, we can perform clustering in the latent space to group together goals that are easily reachable from one another. Subsequently, we can then decode the latent centroids to obtain a set of goals scattered (in terms of reachability) across the goal space. Since our learned landmarks are obtained from latent clustering, we call them latent landmarks. The chief algorithmic contribution of this paper is a new method for planning over learned latent landmarks for high-dimensional continuous control domains, which we name Learning Latent Landmarks for Planning ( $L^{3}P$ ).

The idea of reducing planning in RL to a graph search problem has enjoyed some attention recently (Savinov et al., 2018a; Eysenbach et al., 2019; Huang et al., 2019; Liu et al., 2019; Yang et al., 2020; Emmons et al., 2020). A key difference between those works and $L^{3}P$ is that our use of learned latent landmarks allows us to substantially reduce the size of the search space. What’s more, we make improvements to the graph search module and the online planning algorithm to improve the robustness and sample efficiency of our method. As a result of those decisions, our algorithm is able to achieve superior performance on a variety of robotics domains involving both navigation and manipulation. In addition to the results presented in Section 5, videos of our algorithm’s performance and a more detailed analysis of the sub-tasks discovered by the latent landmarks can be found at: https://sites.google.com/view/latent-landmarks/.

2 Related Works

The problem of learning landmarks to aid in robotics problems has a long and rich history (Gillner & Mallot, 1998; Wang & Spelke, 2002; Wang et al., 2008). Prior art has been deeply rooted in the classical planning literature. For example, traditional methods would utilize (Dijkstra et al., 1959) to plan over generated waypoints, SLAM (Durrant-Whyte & Bailey, 2006) to simultaneously integrate mapping, or the RRT algorithm (LaValle, 1998) for explicit path planning. The A* algorithm (Hart et al., 1968) further improved the computational efficiency of Dijkstra. Those types of methods often heavily rely on a hand-crafted configuration space that provides prior knowledge.

Planning is intimately related to model-based RL (MBRL), as the core ideas underlying learned models and planners can enjoy considerable overlap. Perhaps the most clear instance of this overlap is Model Predictive Control (MPC), and the related Dyna algorithm (Sutton, 1991). When combined with modern techniques (Kurutach et al., 2018; Luo et al., 2018; Nagabandi et al., 2018; Ha & Schmidhuber, 2018; Hafner et al., 2019; Wang & Ba, 2019; Janner et al., 2019), MBRL is able to achieve some level of success. (Corneil et al., 2018) and (Hafner et al., 2020) also learn a discrete latent representation of the environment in the MBRL framework. As discussed in the introduction, planning on action sequences will fundamentally struggle to scale in robotics.

Our method makes extensive use of a parametric goal-based RL agent to accomplish low-level navigation between states. This area has seen rapid progress recently, largely stemming from the success of Hindsight Experience Replay (HER) (Andrychowicz et al., 2017). Several improvements to HER augment the goal relabeling and sampling strategies to boost performance (Nair et al., 2018; Pong et al., 2018, 2019; Zhao et al., 2019; Pitis et al., 2020). There have also been attempts at incorporating search as inductive biases within the value function (Silver et al., 2016b; Tamar et al., 2016; Farquhar et al., 2017; Racanière et al., 2017; Lee et al., 2018; Srinivas et al., 2018). The focus of this line of work is to improve the low-level policy and is thus orthogonal to our work.

Recent work in Hierarchical RL (HRL) builds upon goal-based RL by learning a high-level parametric manager that feeds goals to the low-level goal-based agent (Dayan & Hinton, 1993; Vezhnevets et al., 2017; Nachum et al., 2018). This can be viewed as a parametric alternative to classical planning, as discussed in the introduction. Recently, (Jurgenson et al., 2020; Pertsch et al., 2020) have derived HRL methods that are intimately tied to tree search algorithms. These papers are further connected to a recent trend in the literature wherein classical search methods are combined with parametric control (Savinov et al., 2018a; Eysenbach et al., 2019; Huang et al., 2019; Liu et al., 2019; Yang et al., 2020; Emmons et al., 2020). Several of these articles will be discussed throughout this paper. LEAP (Nasiriany et al., 2019) also considers the problem of proposing sub-goals for a goal-conditioned agent: it uses a VAE (Kingma & Welling, 2013) and does CEM on the prior distribution to form the landmarks. Our method constrains the latent space with temporal reachability between goals, a concept previously explored in (Savinov et al., 2018b), and uses latent clustering and graph search rather than sampling-based methods to learn and propose sub-goals.

3 Background

We consider the problem of Multi-Goal RL under a Markov Decision Process (MDP) parameterized by $(S,A,\mathds{P},G,\Psi,R,\rho_{0})$ . $S$ and $A$ are the state and action space. The probability distribution of the initial states is given by $\rho_{0}(s)$ , and $\mathds{P}(s^{\prime}|s,a)$ is the transition probability. $\Psi:S\mapsto G$ is a mapping from the state space to the goal space, which assumes that every state $s$ can be mapped to a corresponding achieved goal $g$ . The reward function $R$ can be defined as $R(s,a,s^{\prime},g)=-\mathds{1}\{\Psi(s^{\prime})\neq g\}$ . We further assume that each episode has a fixed horizon $T$ .

A multi-goal policy is a probability distribution $\pi:S\times G\times A\rightarrow\mathbb{R}^{+}$ , which gives rise to trajectory samples of the form $\tau=\{s_{0},a_{0},g,s_{1},\cdots s_{T}\}$ . The purpose of the policy $\pi$ is to learn how to reach the goals drawn from the goal distribution $p_{g}$ . With a discount factor $\gamma\in(0,1)$ , it maximizes $\mathcal{J}(\pi)=\mathbb{E}_{g\sim p_{g},\tau\sim\pi(g)}[\sum_{t=0}^{T-1}\gamma^{t}\cdot R(s_{t},a_{t},s_{t+1},g)]$ . Q-learning provides a sample-efficient way to optimize the above objective by utilizing off-policy data stored in a replay buffer $B$ . $Q(s,a,g)$ estimates the reward-to-go under the current policy $\pi$ conditioned upon the given goal. An additional technique, called Hindsight Experience Replay, or HER (Andrychowicz et al., 2017), uses hindsight relabelling to drastically speed up training. This relabeling crucially relies upon the mapping $\Psi:S\mapsto G$ in the multi-goal MDP setting. We can write the the joint objective of multi-goal Q-learning with HER as minimizing (with $Q$ being the online network and $\widehat{Q}$ being the target network):

\displaystyle\Big{(}Q(s_{t},a_{t},g)-\Big{(}R(s_{t+1},g)+\gamma\cdot\widehat{Q}(s_{t+1},a^{\prime},g)\Big{)}\Big{)}^{2}

(1)

where $\tau\sim B,t\sim\{0\cdots T-1\},(s_{t},a_{t},s_{t+1})\sim\tau,k\sim\{t+1\cdots T\},g=\Psi(s_{k}),a^{\prime}\sim\pi(\cdot\mid s_{t+1},g)$ .

4 The $L^{3}P$ Algorithm

Our overall objective in this section is to derive an algorithm that learns a small number of landmarks scattered across goal space in terms of reachability and use those learned landmarks for planning. There are three chief difficulties we must overcome when considering such an algorithm. First, how can we group together goals that are easily reachable from one another? The answer is to embed goals into a latent space, where the latent representation captures some notion of temporal distance between goals – in the sense that goals that would take many timesteps to navigate between are further apart in latent space. Second, we need to find a way to learn a sparse set of landmarks used for planning. Our method performs clustering on the constrained latent space, and decodes the learned centroids as the landmarks we seek. Finally, we need to develop a non-parametric planning algorithm responsible for selecting sequences of landmarks the agent must traverse to accomplish its high-level goal. The proposed online planning algorithm is simple, scalable, and robust.

4.1 Learning a Latent Space

Let us consider the following question: “How should we go about learning a latent space of goals where the metric reflects reachability?” Suppose we have an auto-encoder (AE) in the agent’s goal space, with deterministic encoder $f_{E}$ and decoder $f_{D}$ . As usual, the reconstruction loss is given by $\mathcal{L}_{rec}(g)=\Big{\lVert}f_{D}\big{(}f_{E}(g)\big{)}-g\Big{\rVert}_{2}^{2}$ . We want to make sure that the distance between two latent codes would roughly correspond to the number of steps it would take the policy to go from one goal to another. Concretely, for any pair of goals $(g_{1},g_{2})$ , we optimize the following loss $\mathcal{L}_{latent}(g_{1},g_{2})$ :

\displaystyle\Big{(}\big{\lVert}f_{E}(g_{1})-f_{E}(g_{2})\big{\rVert}_{2}^{2}-\dfrac{1}{2}\big{(}V(g_{1},g_{2})+V(g_{2},g_{1})\big{)}\Big{)}^{2}

(2)

Where $V:G\times G\rightarrow\mathbb{R}^{+}$ is a mapping that estimates how many steps it would take the policy $\pi$ to go from one goal to another goal on average. By adding this constraint and solving a joint optimization $\mathcal{L}_{rec}+\lambda\cdot\mathcal{L}_{latent}$ , the encoding-decoding mapping can no longer be arbitrary, giving more structure to the latent space. Goals that are close by in terms of reachability will be naturally clustered in the latent space, and interpolations between latent codes will lead to meaningful results.

Of course, the constraint in Equation 2 is quite meaningless if we do not have a way to estimate the mapping $V$ . We will proceed towards this objective by noting the following interesting connection between multi-goal Q-functions and reachability. In the multi-goal RL framework considered in the background section, the reward is binary in nature. The agent receives a reward of $-1$ until it reaches the goal, and then $0$ when it reaches the desired goal. In this setting, the Q-function is implicitly estimating the number of steps it takes to reach the goal $g$ from the current state $s$ after the action $a$ is taken. Denote this quantity as $D(s,a,g)$ , the Q-function can be re-written as:

	$\displaystyle Q(s,a,g)$	$\displaystyle=\sum_{t=0}^{D(s,a,g)-1}\gamma^{t}\cdot(-1)+\sum_{t=D(s,a,g)}^{T-1}\gamma^{t}\cdot 0$		(3)
		$\displaystyle=-\dfrac{1-\gamma^{D(s,a,g)}}{1-\gamma}$		(3)

Choosing to parameterize Q-functions in this way disentangles the effect of $\gamma$ on multi-goal Q-learning. It also provides us with access the direct distance estimation function $D(s,a,g)$ . We note that this distance is not a mathematical distance in the sense of a metric. Instead, we use the word distance to refer to the number of steps the policy $\pi$ needs to take in the environment.

Given our tractable estimate of $D$ , it is now a straightforward matter to estimate the desired quantity $V$ , which approximates how many steps it takes the policy to transition between goals. To get the desired estimate, we regress $V$ towards $D$ by minimizing

\displaystyle\min_{V}\Bigg{(}D\big{(}s_{t},a_{t},\Psi(s_{k})\big{)}-V\big{(}\Psi(s_{t+1}),\Psi(s_{k})\big{)}\Bigg{)}^{2}

(4)

with $\tau\sim B,t\sim\{0\cdots T-1\},(s_{t},a_{t},s_{t+1})\sim\tau,k\sim\{t+1\cdots T\}$ , and $\Psi$ being given by the environment to map the states to the goal space. One crucial detail is the use of $\Psi(s_{t+1})$ rather than $\Psi(s_{t})$ in the inputs to $V$ . This is due to the fact that $D:S\times A\times G\rightarrow\mathbb{R}$ outputs the number of steps to go after an action is taken, when the state has transitioned into $s_{t+1}$ . The objective above provides an unbiased estimate of the average number of steps between two goals.

The estimates $D$ and $V$ will prove useful beyond helping to optimize the auto-encoder in Equation 2. They will prove essential in weighting and planning over latent landmark nodes in Section 4.3.

4.2 Learning Latent Landmarks

Planning on a graph can be expensive, as the number of edges can grow quadratically with the number of nodes. To battle this issue in scalability, we use the constrained latent space to learn a sparse set of landmarks. A landmark can be thought of as a waypoint that the agent can pass through enroute to achieve a desired goal. Ideally, goals that are easily reachable from one another should be grouped to form one single landmark. Since our latent representation captures the temporal reachability between goals, this can be achieved by doing clustering in the latent space. The cluster centroids, when decoded from the decoder, will be precisely the latent landmarks we are seeking.

Clustering proceeds as follows. For $N$ clusters to be learned, we define a mixture of Gaussians in the latent space with $N$ trainable latent centroids, $\{\textbf{c}_{1}\cdots\textbf{c}_{N}\}$ , and a shared trainable variance vector $\bm{\sigma}$ . We maximize the evidence lower bound (ELBO) with a uniform prior $p(\textbf{c})$ :

		$\displaystyle\log p\Big{(}z=f_{E}(g)\Big{)}$		(5)
		$\displaystyle\geq\mathbb{E}_{q(\textbf{c}\mid z)}\Big{[}\log p(z\mid\textbf{c})\Big{]}-D_{KL}\Big{(}q(\textbf{c}\mid z)\parallel p(\textbf{c})\Big{)}$		(5)

Ideally, we would like each batch of data given to the latent clustering model to be representative of the whole replay buffer, such that the centroids will quickly learn to scatter out. To this end, we propose to use the Greedy Latent Sparsification (GLS) algorithm (see the Appendix) on each batch of data sampled from the replay before taking a gradient step with the batch. GLS is inspired by kmeans++ (Arthur & Vassilvitskii, 2007), with several key differences: this sparsification process is used for both training and initialization, it uses a neural metric for determining the distance between data points, and that it is compatible with mini-batch-style gradient-based training.

4.3 Planning with Latent Landmarks

Having derived a latent encoding algorithm and an algorithm for learning latent landmarks, we at last turn our attention to search and planning. $L^{3}P$ is agnostic to the graph search algorithm being used. In practice, we use a variant of the Floyd algorithm, where our relaxation operations use a soft max rather than hard max for better stability (see the Appendix for more details). To construct a weight matrix that provides raw distance estimates between latent landmarks in the first place, we begin by decoding the learned centroids in the latent space into the nodes in the graph $\{f_{D}(\textbf{c}_{1})\cdots f_{D}(\textbf{c}_{N})\}$ . To build the graph, we add two edges directed in reverse orders for every pair of latent landmarks. For instance, for an edge going from $f_{D}(\textbf{c}_{i})$ to $f_{D}(\textbf{c}_{j})$ , the weight on that edge is $w_{i,j}=-V(f_{D}(\textbf{c}_{i}),f_{D}(\textbf{c}_{j}))$ . Notice that the distances are negated. At the start of an episode, the agent receives a goal $g$ , and we construct matrix $W$ :

\displaystyle W=\begin{pmatrix}0&\dots&w_{1,N}&-V(f_{D}(\textbf{c}_{1}),g)\\ \vdots&\ddots&\vdots&\vdots\\ w_{N,1}&\dots&0&-V(f_{D}(\textbf{c}_{N}),g)\\ -\infty&\dots&-\infty&0\end{pmatrix}

(6)

Algorithm 1 Online Planning in

L^{3}P

Given: Environment env, initial state $s$ , goal $g$ .

1:Cnt =

0

. SubG = None.

2:Solve for

\bm{d}_{\bm{c}\rightarrow g}

with graph search using

W

3:for

t

1

T

\triangleright

One episode

4: if Cnt

\geq 1.0

then

5: Cnt

=\texttt{Cnt}-1

6: else

\triangleright

We do not re-plan at every step

7: Calculate

\bm{d}_{s\rightarrow\bm{c}}

\bm{d}

\leftarrow

\bm{d}_{s\rightarrow\bm{c}}+\bm{d}_{\bm{c}\rightarrow g}

\triangleright

Remove the immediate previous landmark

10: if SubG

\neq

None then

11:

\bm{d}[\texttt{SubG}]\leftarrow-\infty

12: end if

13: SubG, Cnt

\leftarrow\operatorname*{arg\,max}\bm{d}

-\max\bm{d}

14: end if

15:

a\sim\pi(s,\texttt{SubG})

;

s\leftarrow

env.step(a).

16:end for

For online planning, when the agent receives a goal at the start of an episode, we use graph search to solve for $\bm{d}_{\bm{c}\rightarrow g}$ (which is fixed throughout an episode). For an observation state $s$ , the algorithm calculates $\bm{d}_{s\rightarrow\bm{c}}$ :

\displaystyle\bm{d}_{s\rightarrow\bm{c}}

\displaystyle=\begin{pmatrix}-D\big{(}s,\pi(s,f_{D}(\textbf{c}_{1})),f_{D}(\textbf{c}_{1})\big{)}\\ \vdots\\ -D\big{(}s,\pi(s,f_{D}(\textbf{c}_{N})),f_{D}(\textbf{c}_{N})\big{)}\\ -D\big{(}s,\pi(s,g),g\big{)}\end{pmatrix}

(7)

The chosen landmark is $\texttt{subgoal}\leftarrow\operatorname*{arg\,max}(\bm{d}_{s\rightarrow\bm{c}}+\bm{d}_{\bm{c}\rightarrow g})$ . To further provide temporal abstraction and robustness, the agent will be asked to consistently pursue subgoal for $K=-\bm{d}_{s\rightarrow\bm{c}}[\texttt{subgoal}]$ number of steps, which is how many steps it thinks it will need. The proposed goal does not change during this period. In this way, $L^{3}P$ makes sure that the agent does not re-plan at every step, and this mechanism for temporal abstraction is crucial to its robustness. This mechanism is detailed in Algorithm 1.

After this $K$ many steps, the agent will decide on the next landmark to pursue by re-calculating $\bm{d}_{s\rightarrow\bm{c}}$ , but the immediate previous landmark will not be considered as a candidate landmark. The reason is that, if the agent has failed to reach a self-proposed landmark within the reachability limit it has set for itself, then the agent should try something new for the immediate next goal rather than stick to the immediate previous landmark for another round. We have found that this simple algorithm helps the agent avoid getting stuck and improves the overall robustness of the agent.

In summary, we have derived an algorithm that learns a sparse set of latent landmarks scattered across goal space in terms of reachability, and uses those learned landmarks for robust planning.

5 Experiments and Evaluation

We investigate the impact of $L^{3}P$ in a variety of robotic manipulation and navigation environments. These include standard benchmarks such as Fetch-PickAndPlace, and more difficult environments such as AntMaze-Hard and Place-Inside-Box that have been engineered to require test-time generalization. Videos of our algorithm in action are available at: https://sites.google.com/view/latent-landmarks/.

5.1 Baselines

We compare our method with a variety of baselines. HER (Andrychowicz et al., 2017) is a model-free RL algorithm. SORB (Eysenbach et al., 2019) is a method that combines RL and graph search by using the entire replay buffer. Mapping State Space (MSS Huang et al. 2019) reduces the number of vertices by sub-sampling the replay buffer. $L^{3}P$ , SORB, and MSS all use the same hindsight relabelling strategy proposed in HER. All of the domains are continuous control tasks, so we adopt DDPG (Lillicrap et al., 2015) as the learning algorithm for the low-level actor.

5.2 Generalization to Longer Horizons

The PointMaze-Hard and AntMaze-Hard environments introduced in Figure 6 are designed to test an agent’s ability to generalize to longer horizons. While PointMaze and AntMaze have been previously used in (Duan et al., 2016; Huang et al., 2019; Pitis et al., 2020), we make slight changes to those environments in order to increase their difficulty. We use a short, 200-timestep time horizon during training and a $\rho_{0}$ that is uniform in the maze. At test time, we always initialize the agent on one end of the maze, and set the goal on the other end. The horizon of the test environment is 500 steps. Crucially, no prior knowledge on the shape of the maze is given to the agent. We also set a much stricter threshold for determining whether an agent has reached the goal. In Figure 4, we see $L^{3}P$ is the only algorithm capable of solving AntMaze-Hard consistently.

We observe an interesting trend where the success rates for some of other graph search methods crash and then slowly recover after making some initial progress. We postulate this occurs because methods that are based on using the entire replay or sub-sampling the replay for landmark selection will struggle as the buffer size increases. For instance, in the AntMaze-Hard environment, MSS and SORB use 400 and tens of thousands of landmarks respectively, whereas $L^{3}P$ obtains a lean graph that only contain 50 learnable landmarks. The result suggests that learning latent landmarks is significantly more sample efficient and stable than either directly using or sub-sampling the replay buffer to build the graph. The online planning algorithm in $L^{3}P$ , which effectively leverages temporal abstraction to improve robustness, also contributes to the asymptotic success rate. As explained in Figure 6, $L^{3}P$ successfully addresses the common failure modes of graph-based RL methods. The result convincingly shows that, at least on the navigation tasks considered, $L^{3}P$ is most effective at taking advantage of the problem’s inherent graph structure (without any prior knowledge of the map or environment configurations) and generalizing to longer horizons during test time.

5.3 Robotic Manipulation Tasks

We also benchmark challenging robotic manipulations tasks with a Fetch robot introduced in (Plappert et al., 2018; Andrychowicz et al., 2017). Besides the PickAndPlace task, we also evaluate our method on two additional Fetch tasks involving a box on a table, as illustrated in Figure 3. In Box-Distractor-PickAndPlace environment, the agent needs to perform the pick-and-place task with a box in the middle of the table serving as a distractor. The Place-Inside-Box environment aims to teach the agent to place an object with randomly initialized locations into the box and has a simple curriculum. During training, the goal distribution has 80% regular pick-and-place goals, enabling the agent to first learn how to fetch in general. Meanwhile, only 20% of the goals are inside the box, which is the harder part of the task. During testing, we evaluate the agent’s ability to pick up the object from the table and place it inside the box. Our method achieves dominant performance in both learning speed and test-time generalization on those three robotic manipulation environments. We note that on those manipulation tasks considered, many prior planning methods hurt the performance of the model-free agent. $L^{3}P$ is the only method that is able to help the model-free agent learn faster and perform better on all three tasks.

5.4 Understanding Model Choices in $L^{3}P$

We investigate $L^{3}P$ ’s sensitivity to different design choices and hyper-parameters via a set of ablation studies. More specifically, we study how the following four factors affect the performance of $L^{3}P$ : the choice of graph search algorithms, and edge weight cutoff threshold in graph search (a key hyper-parameter in the graph search module); the choice of online planning algorithms, and the number of latent landmarks being learned (a key hyper-parameter in the planning module).

While $L^{3}P$ is agnostic to the graph search algorithm being used, we study the effect of two possible choices: Floyd algorithm and a soft version of Floyd (soft Floyd). As shown in Figure 7, the choice seems to have a relatively small effect on learning. During the early phase of experimentation, we find that having a soft operation for relaxation in Floyd leads to better overall training stability. A hard version of relaxation helps the learning curve take off faster but suffers from greater instability during policy improvement. The likely reason is that neural distance estimates are not entirely accurate, and in the presence of occasional bad edges, using softmax rather than hard max improves robustness. We therefore use soft relaxation in $L^{3}P$ .

In the graph search module, a very sensitive hyper-parameter is the edge weight cutoff threshold, denoted as $d\_max$ . This clipping threshold is commonly used in prior works such as (Savinov et al., 2018a; Eysenbach et al., 2019; Huang et al., 2019; Emmons et al., 2020). It essentially means that if the weight of an edge is bigger than $d\_max$ , then it is set to be infinity during the graph search process. The motivation for introducing this common hyper-parameter is two-fold. Firstly, we only trust distance estimates when they are local, because value iterations are inherently local. Secondly, we want the next sub-goal to be relatively nearby in terms of reachability. The $d\_max$ value determines the maximum (perceived) distance from the current state to next proposed subgoal. As shown in Figure 7, our current approach is still quite sensitive to this hyper-parameter; changes to $d\_max$ can have a considerable impact on learning. As this weakness is common to this class of approaches, we believe that further research is required to discover more principled ways of encouraging the search results to be local.

For online planning, the $L^{3}P$ planner introduced in Algorithm 1 is essential to the success of $L^{3}P$ . Our planning algorithm can take advantage of the temporal abstraction provided by the graph-structured world model. As previously shown in Figure 6, the design of $L^{3}P$ planner avoids many common pitfalls. It does not re-plan at every step, but instead uses the reachability estimates to dynamically decide when to re-plan, striking a balance between adaptability and consistency in planning. This planner is also more tolerant of errors: it removes the immediate previous landmark when it re-plans, so that the agent will be less prone to getting stuck. In Figure 8, we compare the $L^{3}P$ planner to a naive planner, which simply re-calculates the shortest path at every step. The result shows that our planning algorithm is crucial to the success of $L^{3}P$ .

An important hyper-parameter in graph-based planning is the number of landmarks being used. Intuitively, since $L^{3}P$ is learning the nodes on the graph, it should be robust to the changes in the number of nodes (landmarks) being learned. In Figure 8, we show that this is indeed the case: $L^{3}P$ is robust to the number of latent landmarks. In contrast to prior methods, $L^{3}P$ is able to learn the nodes (landmarks) used for graph search from the agent’s own experience. We vary this hyper-parameter in the challenging AntMaze-Hard environment, and we find that $L^{3}P$ is robust against a variety of values. This is expected, because the landmarks in the latent space of $L^{3}P$ will try to be equally scattered across the goal space according to the learned reachability metric. As the number of landmarks decreases, the learning procedure will automatically push the landmarks to be further away from one another.

6 Closing Remarks

In this work, we introduce a way of learning graph-structured world models that endow agents with the ability to do temporally extended reasoning. The algorithm, $L^{3}P$ , learns a set of latent landmarks scattered across the goal space to enable scalable planning. We demonstrate that $L^{3}P$ achieves significantly better sample efficiency, higher asymptotic performance, and generalization to longer horizons on a range of challenging robotic navigation and manipulation tasks. Here we briefly discuss two promising future directions. First, how can an agent quickly generate a set of plausible landmarks in a previously unseen environment? A lot of progress has been made on the topics of meta reinforcement learning and learning to explore; can $L^{3}P$ be combined with meta learning techniques for fast landmarks generation? Second, can we learn graph-structured world models from offline datasets? Batch RL is a more realistic setting for many RL applications, since online interaction can be expensive in the real world. Applying $L^{3}P$ to offline datasets might require a notion of uncertainty in different parts of the graph.

Acknowledgements

We thank the anonymous reviewers for providing helpful comments on the paper. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute for Artificial Intelligence (www.vectorinstitute.ai/partners).

References

Andrychowicz et al. (2017) Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Abbeel, O. P., and Zaremba, W. Hindsight experience replay. In Advances in neural information processing systems, pp. 5048–5058, 2017.
Arthur & Vassilvitskii (2007) Arthur, D. and Vassilvitskii, S. k-means++: The advantages of careful seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 2007.
Corneil et al. (2018) Corneil, D., Gerstner, W., and Brea, J. Efficient model-based deep reinforcement learning with variational state tabulation. arXiv preprint arXiv:1802.04325, 2018.
Coulom (2006) Coulom, R. Efficient selectivity and backup operators in monte-carlo tree search. In International conference on computers and games, pp. 72–83. Springer, 2006.
Dayan & Hinton (1993) Dayan, P. and Hinton, G. E. Feudal reinforcement learning. In Advances in neural information processing systems, pp. 271–278, 1993.
Dijkstra et al. (1959) Dijkstra, E. W. et al. A note on two problems in connexion with graphs. Numerische mathematik, 1(1):269–271, 1959.
Doran & Michie (1966) Doran, J. E. and Michie, D. Experiments with the graph traverser program. Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences, 294(1437):235–259, 1966.
Duan et al. (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., and Abbeel, P. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pp. 1329–1338, 2016.
Durrant-Whyte & Bailey (2006) Durrant-Whyte, H. and Bailey, T. Simultaneous localization and mapping: part i. IEEE robotics & automation magazine, 13(2):99–110, 2006.
Emmons et al. (2020) Emmons, S., Jain, A., Laskin, M., Kurutach, T., Abbeel, P., and Pathak, D. Sparse graphical memory for robust planning. arXiv preprint arXiv:2003.06417, 2020.
Eysenbach et al. (2019) Eysenbach, B., Salakhutdinov, R. R., and Levine, S. Search on the replay buffer: Bridging planning and reinforcement learning. In Advances in Neural Information Processing Systems, pp. 15246–15257, 2019.
Farquhar et al. (2017) Farquhar, G., Rocktäschel, T., Igl, M., and Whiteson, S. Treeqn and atreec: Differentiable tree-structured models for deep reinforcement learning. October 2017. URL http://arxiv.org/abs/1710.11417.
Gillner & Mallot (1998) Gillner, S. and Mallot, H. A. Navigation and acquisition of spatial knowledge in a virtual maze. Journal of cognitive neuroscience, 10(4):445–463, 1998.
Ha & Schmidhuber (2018) Ha, D. and Schmidhuber, J. Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems, pp. 2450–2462, 2018.
Hafner et al. (2019) Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., and Davidson, J. Learning latent dynamics for planning from pixels. In International Conference on Machine Learning, pp. 2555–2565. PMLR, 2019.
Hafner et al. (2020) Hafner, D., Lillicrap, T., Norouzi, M., and Ba, J. Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193, 2020.
Hart et al. (1968) Hart, P. E., Nilsson, N. J., and Raphael, B. A formal basis for the heuristic determination of minimum cost paths. IEEE transactions on Systems Science and Cybernetics, 4(2):100–107, 1968.
Huang et al. (2019) Huang, Z., Liu, F., and Su, H. Mapping state space using landmarks for universal goal reaching. In Advances in Neural Information Processing Systems, pp. 1942–1952, 2019.
Janner et al. (2019) Janner, M., Fu, J., Zhang, M., and Levine, S. When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems, pp. 12519–12530, 2019.
Jurgenson et al. (2020) Jurgenson, T., Avner, O., Groshev, E., and Tamar, A. Sub-goal trees–a framework for goal-based reinforcement learning. arXiv preprint arXiv:2002.12361, 2020.
Kingma & Welling (2013) Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
Kurutach et al. (2018) Kurutach, T., Clavera, I., Duan, Y., Tamar, A., and Abbeel, P. Model-ensemble trust-region policy optimization. arXiv preprint arXiv:1802.10592, 2018.
Langley (2000) Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
LaValle (1998) LaValle, S. M. Rapidly-exploring random trees: A new tool for path planning. 1998.
Lee et al. (2018) Lee, L., Parisotto, E., Chaplot, D. S., Xing, E., and Salakhutdinov, R. Gated path planning networks. June 2018. URL http://arxiv.org/abs/1806.06408.
Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
Liu et al. (2019) Liu, K., Kurutach, T., Tung, C. K.-C., Abbeel, P., and Tamar, A. Hallucinative topological memory for zero-shot visual planning. ArXiv, 2019. URL https://arxiv.org/abs/2002.12336.
Luo et al. (2018) Luo, Y., Xu, H., Li, Y., Tian, Y., Darrell, T., and Ma, T. Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. arXiv preprint arXiv:1807.03858, 2018.
Nachum et al. (2018) Nachum, O., Gu, S. S., Lee, H., and Levine, S. Data-efficient hierarchical reinforcement learning. In Advances in Neural Information Processing Systems, pp. 3303–3313, 2018.
Nagabandi et al. (2018) Nagabandi, A., Kahn, G., Fearing, R. S., and Levine, S. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7559–7566. IEEE, 2018.
Nair et al. (2018) Nair, A. V., Pong, V., Dalal, M., Bahl, S., Lin, S., and Levine, S. Visual reinforcement learning with imagined goals. In Advances in Neural Information Processing Systems, pp. 9191–9200, 2018.
Nasiriany et al. (2019) Nasiriany, S., Pong, V., Lin, S., and Levine, S. Planning with goal-conditioned policies. In Advances in Neural Information Processing Systems, pp. 14843–14854, 2019.
Pertsch et al. (2020) Pertsch, K., Rybkin, O., Ebert, F., Finn, C., Jayaraman, D., and Levine, S. Long-horizon visual planning with goal-conditioned hierarchical predictors. arXiv preprint arXiv:2006.13205, 2020.
Pitis et al. (2020) Pitis, S., Chan, H., Zhao, S., Stadie, B., and Ba, J. Maximum entropy gain exploration for long horizon multi-goal reinforcement learning. arXiv preprint arXiv:2007.02832, 2020.
Plappert et al. (2018) Plappert, M., Andrychowicz, M., Ray, A., McGrew, B., Baker, B., Powell, G., Schneider, J., Tobin, J., Chociej, M., Welinder, P., et al. Multi-goal reinforcement learning: Challenging robotics environments and request for research. arXiv preprint arXiv:1802.09464, 2018.
Pong et al. (2018) Pong, V., Gu, S., Dalal, M., and Levine, S. Temporal difference models: Model-free deep rl for model-based control. arXiv preprint arXiv:1802.09081, 2018.
Pong et al. (2019) Pong, V. H., Dalal, M., Lin, S., Nair, A., Bahl, S., and Levine, S. Skew-fit: State-covering self-supervised reinforcement learning. arXiv preprint arXiv:1903.03698, 2019.
Racanière et al. (2017) Racanière, S., Weber, T., Reichert, D., Buesing, L., Guez, A., Jimenez Rezende, D., Puigdomènech Badia, A., Vinyals, O., Heess, N., Li, Y., Pascanu, R., Battaglia, P., Hassabis, D., Silver, D., and Wierstra, D. Imagination-augmented agents for deep reinforcement learning. Neural Information Processing Systems, 2017.
Savinov et al. (2018a) Savinov, N., Dosovitskiy, A., and Koltun, V. Semi-parametric topological memory for navigation. arXiv preprint arXiv:1803.00653, 2018a.
Savinov et al. (2018b) Savinov, N., Raichuk, A., Marinier, R., Vincent, D., Pollefeys, M., Lillicrap, T., and Gelly, S. Episodic curiosity through reachability. arXiv preprint arXiv:1810.02274, 2018b.
Silver et al. (2016a) Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016a.
Silver et al. (2016b) Silver, D., van Hasselt, H., Hessel, M., Schaul, T., Guez, A., Harley, T., Dulac-Arnold, G., Reichert, D., Rabinowitz, N., Barreto, A., and Degris, T. The predictron: End-to-end learning and planning. December 2016b. URL http://arxiv.org/abs/1612.08810.
Srinivas et al. (2018) Srinivas, A., Jabri, A., Abbeel, P., Levine, S., and Finn, C. Universal planning networks. April 2018. URL http://arxiv.org/abs/1804.00645.
Sutton (1991) Sutton, R. S. Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin, 2(4):160–163, 1991.
Tamar et al. (2016) Tamar, A., Wu, Y., Thomas, G., Levine, S., and Abbeel, P. Value iteration networks. February 2016. URL http://arxiv.org/abs/1602.02867.
Vezhnevets et al. (2017) Vezhnevets, A. S., Osindero, S., Schaul, T., Heess, N., Jaderberg, M., Silver, D., and Kavukcuoglu, K. Feudal networks for hierarchical reinforcement learning. arXiv preprint arXiv:1703.01161, 2017.
Wang & Spelke (2002) Wang, R. F. and Spelke, E. S. Human spatial representation: Insights from animals. Trends in cognitive sciences, 6(9):376–382, 2002.
Wang & Ba (2019) Wang, T. and Ba, J. Exploring model-based planning with policy networks. arXiv preprint arXiv:1906.08649, 2019.
Wang et al. (2008) Wang, Y., Mulvaney, D., Sillitoe, I., and Swere, E. Robot navigation by waypoints. Journal of Intelligent and Robotic Systems, 52(2):175–207, 2008.
Yang et al. (2020) Yang, G., Zhang, A., Morcos, A. S., Pineau, J., Abbeel, P., and Calandra, R. Plan2vec: Unsupervised representation learning by latent plans. In Proceedings of The 2nd Annual Conference on Learning for Dynamics and Control, volume 120 of Proceedings of Machine Learning Research, pp. 1–12, 2020. arXiv:2005.03648.
Zhao et al. (2019) Zhao, R., Sun, X., and Tresp, V. Maximum entropy-regularized multi-goal reinforcement learning. arXiv preprint arXiv:1905.08786, 2019.

Appendix A: Greedy Latent Sparsification

The Greedy Latent Sparsification (GLS) algorithm sub-samples a large batch by sparsification. GLS first randomly selects a latent embedding from the batch, and then greedily chooses the next embedding that is furthest away from already selected embeddings. After collecting some warm-up trajectories before planning starts (see Table 1 below) during training, we first use GLS to initialize the latent centroids, and then continue to use it to sample the batches used to train the latent clusters. GLS is strongly inspired by (Arthur & Vassilvitskii, 2007), and this type of approach is known to improve clustering.

Appendix B: Graph Search with Soft Relaxations

In this paper, we employ a soft version of Floyd algorithm, which we find to empirically work well. Rather than simply using the $\min$ operation to do relaxation, the soft value iteration procedure uses a $soft\min$ operation when doing an update (note that, since we negated the distances to be negative in the weight matrix of the graph, the operations we use are actually max and softmax). The reason is that neural distances can be inconsistent and inaccurate at times, and using a soft operation makes the whole procedure more robust. More concretely, we repeat the following update on the weight matrix for $S$ steps with temperature $\beta$ :

\displaystyle w_{i,j}

\displaystyle\leftarrow\sum_{k=1}^{N+1}\dfrac{\exp\dfrac{1}{\beta}(w_{i,k}+w_{k,j})}{\sum_{k^{\prime}=1}^{N+1}\exp\dfrac{1}{\beta}(w_{i,k^{\prime}}+w_{k^{\prime},j})}\Big{(}w_{i,k}+w_{k,j}\Big{)}

(8)

Following the practice in (Eysenbach et al., 2019; Huang et al., 2019), we do the following initialization to the distance matrix: for entries smaller than the negative of $d\_max$ , we penalize the entry by adding $-\infty$ to it (in this paper, we use $-10^{6}$ as the $-\infty$ value). The essential idea is that we only trust a neural estimate when it is local, and we rely on graph search to solve for global, longer-horizon distances. The $-\infty$ penalty effectively masks out those entries with large negative values in the softmax operation above. If we replace softmax with a hard max, we recover the original update in Floyd algorithm; we can interpolate between a hard Floyd and a soft Floyd by tuning the temperature $\beta$ .

Appendix C: Overall Training Procedure

Here we provide an overall training procedure for $L^{3}P$ in Algorithm 3. Given an environment env and a training goal distribution $p(g)$ , we initialize a replay buffer $B$ and the following trainble modules: policy $\pi$ , distance function $D$ , value function $V$ , encoder $f_{E}$ and decoder $f_{D}$ , latent centroids $\{\textbf{c}_{1}\cdots\textbf{c}_{N}\}$ .

Every $K_{env}$ episodes of sampling, we take gradient steps for the above modules. The ratio between the number of environment steps and the number of gradient steps is a hyper-parameter.

Appendix D: Implementation Details

•

We find that having a centralized replay for all parallel workers is significantly more sample efficient than having separate replays for each worker and simply averaging the gradients across workers.
•

For Ant-Maze environment, we do grad norm clipping by a value of $15.0$ for all networks. For Fetch tasks, we normalize the inputs by running means and standard deviations per input dimensions.
•

Since $L^{3}P$ is able to decompose a long-horizon goal into many short-horizon goals, we shorten the range of future steps where we do hindsight relabelling; as a result, the agent can focus its optimization effort on more immediate goals. This corresponds to the hyper-parameter: hindsight relabelling range.
•

During training, we collect $50\%$ of the data without the planning module, and the other $50\%$ of the data with planning. This corresponds to the hyper-parameter: probability of using search during train.
•

At train time, to encourage exploration during planning, we temporarily add a small number of random landmarks from GLS (Algorithm 2) to the existing latent landmarks. A new set of random landmarks is selected for each episode before graph search starts (Algorithm 1). This corresponds to the hyper-parameter: random landmarks added during train.
•

We find that collecting a certain number of warm-up trajectories for every worker before the planning procedure starts (during training) and before GLS (Algorithm 2) is used for initialization to help improve the planning results. This corresponds to the hyper-parameter: number of warm-up trajectories.

Appendix E: Hyper-parameters

The first table below lists the common hyper-parameters across all environments. The second table below lists the hyper-parameters that differ across the environments.

World Model as a Graph: Learning Latent Landmarks for Planning