This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Proto Successor Measure: Representing the space of all possible solutions of
Reinforcement Learning

Siddhant Agarwal  ,1, Harshit Sikchi∗,1, Peter Stone†, 1, 2, Amy Zhang†, 1, 3
1 The University of Texas at Austin, 2 Sony AI, 3 Meta AI
{siddhant,hsikchi}@utexas.edu
 Equal contribution, Equal Advising
Abstract

Having explored an environment, intelligent agents should be able to transfer their knowledge to most downstream tasks within that environment. Referred to as “zero-shot learning,” this ability remains elusive for general-purpose reinforcement learning algorithms. While recent works have attempted to produce zero-shot RL agents, they make assumptions about the nature of the tasks or the structure of the MDP. We present Proto Successor Measure: the basis set for all possible solutions of Reinforcement Learning in a dynamical system. We provably show that any possible policy can be represented using an affine combination of these policy independent basis functions. Given a reward function at test time, we simply need to find the right set of linear weights to combine these basis corresponding to the optimal policy. We derive a practical algorithm to learn these basis functions using only interaction data from the environment and show that our approach can produce the optimal policy at test time for any given reward function without additional environmental interactions. Project page: agarwalsiddhant10.github.io/psm.html

1 Introduction

A wide variety of tasks can be defined within an environment (or any dynamical system). For instance, in navigation environments, tasks can be defined to reach a goal, path following, reach a goal while avoiding certain states etc. Once familiar with an environment, humans have the wonderful ability to perform new tasks in that environment without any additional practice. For example, consider the last time you moved to a new city. At first, you may have needed to explore various routes to figure out the most efficient way to get to the nearest supermarket or place of work. But eventually, you could probably travel to new places efficiently the very first time you needed to get there. Like humans, intelligent agents should be able to infer the necessary information about the environment during exploration and use this experience for solving any downstream task efficiently. Reinforcement Learning (RL) algorithms have seen great success in finding a sequence of decisions that optimally solves a given task in the environment (Wurman et al., 2022; Fawzi et al., 2022). In RL settings, tasks are defined using reward functions with different tasks having their own optimal agent policy or behavior corresponding to the task reward. RL agents are usually trained for a given task (reward function) or on a distribution of related tasks; most RL agents do not generalize to solving any task, even in the same environment. While related machine learning fields like computer vision and natural language processing have shown success in zero-shot (Ramesh et al., 2021) and few-shot (Radford et al., 2021) adaptation to a wide range of downstream tasks, RL lags behind in such functionalities. Unsupervised reinforcement learning aims to extract reusable information such as skills (Eysenbach et al., 2019; Zahavy et al., 2023), representations (Ghosh et al., 2023; Ma et al., 2023), world-model (Janner et al., 2019; Hafner et al., 2020), goal-reaching policies (Agarwal et al., 2024; Sikchi et al., 2024a), etc, from the environment using data independent of the task reward to efficiently train RL agents for any task. Recent advances in unsupervised RL (Wu et al., 2019; Touati & Ollivier, 2021; Blier et al., 2021b; Touati et al., 2023) have shown some promise towards achieving zero-shot RL.

Recently proposed pretraining algorithms (Stooke et al., 2021; Schwarzer et al., 2021b; Sermanet et al., 2017; Nair et al., ; Ma et al., 2023) use self-supervised learning to learn representations from large-scale data to facilitate few-shot RL but these representations are dependent on the policies used for collecting the data. These algorithms assume that the large scale data is collected from a “good” policy demonstrating expert task solving behaviors. Several prior works aim to achieve generalization in multi-task RL by building upon successor features (Dayan, 1993) which represent rewards as a linear combination of state features. These methods have limited generalization capacity to unseen arbitrary tasks. Other works (Mahadevan, 2005; Machado et al., 2017; 2018; Bellemare et al., 2019; Farebrother et al., 2023) represent value functions using eigenvectors of the graph Laplacian obtained from a random policy to approximate the global basis of value functions. However, the eigenvectors from a random policy cannot represent all value functions. In fact, we show that an alternative strategy of representing visitation distributions using a set of basis functions covers a larger set of solutions than doing the same with value functions. Skill learning methods (Eysenbach et al., 2019; Park et al., 2024b; Eysenbach et al., 2022) view any policy as combination of skills , but as shown by Eysenbach et al. (2022), these methods do not recover all possible skills from the MDP. Some recent works have attempted zero-shot RL by decomposing the representation of visitation distributions (Touati & Ollivier, 2021; Touati et al., 2023), but they learn policy representations as a projection of the reward function which can lead to loss of task relevant information. We present a stronger, more principled approach for representing any solution of RL in the MDP.

Refer to caption
Figure 1: Method Overview: Visitation distributions corresponding to any policy must obey the Bellman Flow constraint for the dynamical system. This means they must lie on the plane defined by the the Bellman Flow equation. Being a plane, it can be represented using a set of basis set Φ\Phi and a bias. All valid (non negative) visitation distributions lie within a convex hull on this plane. The boundary of this hull is defined using the non negativity constraints: Φw+b0\Phi w+b\geq 0. Each point within this convex hull corresponds to a visitation distribution for a valid policy and is defined simply by the “coordinate” ww.

Any policy in the environment can be represented using visitation distributions or the distributions over states and actions that the agent visits when following a policy. We learn a basis set to represent any possible visitation distribution in the underlying environmental dynamics. We draw our inspiration from the linear programming view (Manne, 1960; Denardo, 1970; Nachum & Dai, 2020; Sikchi et al., 2024b) of reinforcement learning; the objective is to find the visitation distribution that maximizes the return (the dot-product of the visitation distribution and the reward) subject to the Bellman Flow constraints. We show that any solution of the Bellman Flow constraint for the visitation distribution can be represented as a linear combination of policy-independent basis functions and a bias. As shown in Figure 1, any visitation distribution that is a solution of the Bellman Flow for a given dynamical system lies on a plane defined using policy independent basis Φ\Phi and a bias bb. On this plane, only a small convex region defines the valid (non negative) visitations distributions. Any visitation distribution in this convex hull can be obtained simply using the “coordinates” ww. We introduce Proto-Successor Measure, the set of basis functions and bias to represent any successor measure (a generalization over visitation distributions) in the MDP that can be learnt using reward-free interaction data. At test time, obtaining the optimal policy reduces to simply finding the linear weights to combine these basis vectors that maximize its dot-product with the user-specified reward. These basis vectors only depend on the state-action transition dynamics of the MDP, independent of the initial state distribution, reward, or policy, and can be thought to compactly represent the entire dynamics.

The contributions of our work are (1) a novel, mathematically complete perspective on representation learning for Markov decision processes; (2) an efficient practical instantiation that reduces basis learning to a single-player optimization; and (3) evaluations of a number of tasks demonstrating the capability of our learned representations to quickly infer optimal policies.

2 Related Work

Unsupervised Reinforcement Learning: Unsupervised RL generally refers to a broad class of algorithms that use reward-free data to improve the efficiency of RL algorithms. We focus on methods that learn representations to produce optimal value functions for any given reward function. Representation learning through unsupervised or self-supervised RL has been discussed for both pre-training (Nair et al., ; Ma et al., 2023) and training as auxiliary objectives (Agarwal et al., 2021; Schwarzer et al., 2021a). While using auxiliary objectives for representation learning does accelerate policy learning for downstream tasks, the policy learning begins from scratch for a new task. Pre-training methods like Ma et al. (2023); Nair et al. use self-supervised learning techniques from computer vision like masked auto-encoding to learn representations that can be used directly for downstream tasks. These methods use large-scale datasets (Grauman et al., 2022) to learn representations but these are fitted around the policies used for collecting data. These representations do not represent any possible behavior nor are trained to represent Q functions for any reward functions. A number of works in prior literature aim to discover intents or skills using a diversity objective. These methods use the fact that the latents or skills should define the output state-visitation distributions thus diversity can be ensured by maximizing mutual information (Warde-Farley et al., 2019; Eysenbach et al., 2019; Achiam et al., 2018; Eysenbach et al., 2022) or minimizing Wasserstein distance (Park et al., 2024b) between the latents and corresponding state-visitation distributions. PSM differs from these works and takes a step towards learning representations optimal for predicting value functions as well as a zero-shot near-optimal policy for any reward.

Methods that linearize RL quantities: Learning basis vectors has been leveraged in RL to allow for transfer to new tasks. Successor features (Barreto et al., 2017) represents rewards as a linear combination of transition features and subsequently the Q-functions are linear in successor features. Several methods have extended successor features (Lehnert & Littman, 2020; Hoang et al., 2021; Alegre et al., 2022; Reinke & Alameda-Pineda, 2021) to learn better policies in more complex domains.

Spectral methods like Proto Value Functions (PVFs) (Mahadevan, 2005; Mahadevan & Maggioni, 2007) instead represent the value functions as a linear combination of basis vectors. It uses the eigenvectors of the random walk operator (graph Laplacian) as the basis vectors. Adversarial Value Functions (Bellemare et al., 2019) and Proto Value Networks (Farebrother et al., 2023) have attempted to scale up this idea in different ways. However, deriving these eigenvectors from a Laplacian is not scalable to larger state spaces. Wu et al. (2019) recently presented an approximate scalable objective, but the Laplacian is still dependent on the policy which makes it incapable of representing all behaviors or Q functions.

Similar to our work, Forward Backward (FB) Representations (Touati & Ollivier, 2021; Touati et al., 2023) use an inductive bias on the successor measure to decompose it into a forward and backward representation. Unlike FB, our representations are linear on a set of basis features. Additionally, FB ties the reward representation with the representation of the optimal policy derived using Q function maximization which can lead to overestimation issues and instability during training as a result of Bellman optimality backups.

3 Preliminaries

In this section we introduce some preliminaries and define terminologies that will be used in later sections. We begin with some MDP fundamentals and RL preliminaries followed by a discussion on affine spaces which form the basis for our representation learning paradigm.

3.1 Markov Decision Processes

A Markov Decision Process is defined as a tuple 𝒮,𝒜,P,r,γ,μ\langle\mathcal{S},\mathcal{A},P,r,\gamma,\mu\rangle where 𝒮\mathcal{S} is the state space, 𝒜\mathcal{A} is the action space, P:𝒮×𝒜Δ(𝒮)P:\mathcal{S}\times\mathcal{A}\longmapsto\Delta(\mathcal{S}) is the transition probability (Δ()\Delta(\cdot) denotes a probability distribution over a set), γ[0,1)\gamma\in[0,1) is the discount factor, μ\mu is the distribution over initial states and r:𝒮×𝒜r:\mathcal{S}\times\mathcal{A}\longmapsto\mathbb{R} is the reward function. The task is specified using the reward function rr and the initial state distribution μ\mu. The goal for the RL agent is to learn a policy πθ:𝒮𝒜\pi_{\theta}:\mathcal{S}\longmapsto\mathcal{A} that maximizes the expected return J(πθ)=𝔼s0μ𝔼πθ[t=0γtr(st,at)]J(\pi_{\theta})=\mathbbm{E}_{s_{0}\sim\mu}\mathbbm{E}_{\pi_{\theta}}[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t})].

In this work, we consider a task-free MDP which does not provide the reward function or the initial state distribution. Hence, a task-free or reward-free MDP is simply the tuple 𝒮,𝒜,P,γ\langle\mathcal{S},\mathcal{A},P,\gamma\rangle. A task-free MDP essentially only captures the underlying environment dynamics and can have infinite downstream tasks specified through different reward functions.

The state-action visitation distribution, dπ(s,a)d^{\pi}(s,a) is defined as the normalized probability of being in a state ss and taking an action aa if the agent follows the policy π\pi from a state sampled from the initial state distribution. Concretely, dπ(s,a)=(1γ)t=0γt(st=s,at=a)d^{\pi}(s,a)=(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}\mathbbm{P}(s_{t}=s,a_{t}=a). A more general quantity, successor measure, Mπ(s,a,s+,a+)M^{\pi}(s,a,s^{+},a^{+}), is defined as the probability of being in state s+s^{+} and taking action a+a^{+} when starting from the state-action pair s,as,a and following the policy π\pi. Mathematically, Mπ(s,a,s+,a+)=(1γ)t=0γt(st=s+,at=a+|s0=s,a0=a)M^{\pi}(s,a,s^{+},a^{+})=(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}\mathbbm{P}(s_{t}=s^{+},a_{t}=a^{+}|s_{0}=s,a_{0}=a). The state-action visitation distribution can be written as dπ(s,a)=𝔼s0μ(s),a0π(a0|s0)[Mπ(s0,a0,s,a)]d^{\pi}(s,a)=\mathbb{E}_{s_{0}\sim\mu(s),a_{0}\sim\pi(a_{0}|s_{0})}[M^{\pi}(s_{0},a_{0},s,a)].

Both these quantities, state-action visitation distribution and successor measure, follow the Bellman Flow equations:

dπ(s,a)=(1γ)μ(s)π(a|s)+γs𝒮,a𝒜P(s|s,a)dπ(s,a)π(a|s).d^{\pi}(s,a)=(1-\gamma)\mu(s)\pi(a|s)+\\ \gamma\sum_{{\color[rgb]{0,0,0}s^{\prime}\in\mathcal{S},a^{\prime}\in\mathcal{A}}}P(s|s^{\prime},a^{\prime})d^{\pi}(s^{\prime},a^{\prime})\pi(a|s). (1)

For successor measure, the initial state distribution changes to an identity function

Mπ(s,a,s+,a+)=(1γ)𝟙[s=s+,a=a+]+γs𝒮,a𝒜P(s+|s,a)Mπ(s,a,s,a)π(a+|s+).M^{\pi}(s,a,s^{+},a^{+})=(1-\gamma)\mathbbm{1}[s=s^{+},a=a^{+}]+\\ \gamma\sum_{{\color[rgb]{0,0,0}s^{\prime}\in\mathcal{S},a^{\prime}\in\mathcal{A}}}P(s^{+}|s^{\prime},a^{\prime})M^{\pi}(s,a,s^{\prime},a^{\prime})\pi(a^{+}|s^{+}). (2)

The RL objective has a well studied linear programming interpretation (Manne, 1960). Given any task reward function rr, the RL objective can be rewritten in the form of a constrained linear program:

maxd\displaystyle\max_{d} s,ad(s,a)r(s,a),s.t.d(s,a)0s,a,\displaystyle\sum_{s,a}d(s,a)r(s,a),\quad s.t.\quad d(s,a)\geq 0\quad\forall s,a, (3)
s.t.\displaystyle s.t. d(s,a)=(1γ)μ(s)π(a|s)+γs𝒮,a𝒜P(s|s,a)d(s,a)π(a|s)\displaystyle d(s,a)=(1-\gamma)\mu(s)\pi(a|s)+\gamma\sum_{{\color[rgb]{0,0,0}s^{\prime}\in\mathcal{S},a^{\prime}\in\mathcal{A}}}P(s|s^{\prime},a^{\prime})d(s^{\prime},a^{\prime})\pi(a|s)

and the unique policy corresponding to visitation dd is obtained by π(a|s)=d(s,a)ad(s,a)\pi(a|s)=\frac{d(s,a)}{\sum_{a}d(s,a)}. The Q function can then be defined using successor measure as Qπ(s,a)=s+,a+Mπ(s,a,s+,a+)r(s+,a+)Q^{\pi}(s,a)=\sum_{s^{+},a^{+}}M^{\pi}(s,a,s+,a+)r(s^{+},a^{+}) or Qπ=MπrQ^{\pi}=M^{\pi}r. Obtaining the optimal policies requires maximizing the Q function which requires solving argmaxπMπr\operatorname*{arg\,max}_{\pi}M^{\pi}r.

3.2 Affine Spaces

Let 𝒱\mathcal{V} be a vector space and bb be a vector. An affine set is defined as A=b+𝒱={x|x=b+v,v𝒱}A=b+\mathcal{V}=\{x|x=b+v,v\in\mathcal{V}\}. Any vector in a vector space can be written as a linear combination of basis vectors, i.e., v=inαiviv=\sum_{i}^{n}\alpha_{i}v_{i} where nn is the dimensionality of the vector space. This property implies that any element of an affine space can be expressed as x=b+inαivix=b+\sum_{i}^{n}\alpha_{i}v_{i}. Given a system of linear equations Ax=cAx=c, with AA being an m×nm\times n matrix (m<nm<n) and c0c\neq 0, the solution xx forms an affine set. Hence, there exists alphas αi\alpha_{i} such that x=b+iαixix=b+\sum_{i}\alpha_{i}x_{i}. The vectors {xi}\{x_{i}\} form the basis set of the null space or kernel of AA. The values (αi)(\alpha_{i}) form the affine coordinates of xx for the basis {xi}\{x_{i}\}. Hence, for a given system with known {xi}\{x_{i}\} and bb, any solution can be represented using only the affine coordinates (αi)(\alpha_{i}).

We first explain the theoretical foundations of our method in Section 4 and derive a practical algorithm following the theory in Section 5

4 The Basis Set for All Solutions of RL

In this section, we introduce the theoretical results that form the foundation for our representation learning approach. The goal is to learn policy-independent representations that can represent any valid visitation distribution in the environment (i.e. satisfy the Bellman Flow constraint in Equation 3). With a compact way to represent these distributions, it is possible to reduce the policy optimization problem to a search in this compact representation space. We will show that state visitation distributions and successor measures form an affine set and thus can be represented as iϕiwiπ+b\sum_{i}\phi_{i}w_{i}^{\pi}+b, where ϕi\phi_{i} are basis functions, wπw^{\pi} are “coordinates” or weights to linearly combine the basis functions, and bb is a bias term. First, we build up the formal intuition for this statement and later we will use a toy example to show how these representations can make policy search easier.

The first constraint in Equation 3 is the Bellman Flow equation. We begin with Lemma 4.1 showing that state visitation distributions that satisfy the Bellman Flow form affine sets.

Theorem 4.1.

All possible state-action visitation distributions in an MDP form an affine set.

While Theorem 4.1 shows that any state-action visitation distribution in an MDP can be written using a linear combination of basis and bias terms, state-action visitation distributions still depend on the initial state distribution. Moreover, as shown in Equation 1, computing the state-action visitation distribution requires a summation over all states and actions in the MDP which is not always possible. Successor measures are more general than state-visitation distributions as they encode the state-action visitation of the policy conditioned on a starting state-action pair. Using similar techniques, we show that successor measures also form affine sets.

Corollary 4.2.

Any successor measure, MπM^{\pi} in an MDP forms an affine set and so can be represented as idϕiwiπ+b\sum_{i}^{d}\phi_{i}w_{i}^{\pi}+b where ϕi\phi_{i} and bb are independent of the policy π\pi and dd is the dimension of the affine space.

Following Corollary 4.2, for any ww, the function idϕiwiπ+b\sum_{i}^{d}\phi_{i}w_{i}^{\pi}+b will be a solution of Equation 2. Hence, given Φ\Phi (ϕi\phi_{i} stacked together) and bb, we do not need the first constraint on the linear program (in Equation 3) anymore. The other constraint: ϕiwi+b0\phi_{i}w_{i}+b\geq 0 still remains which ww needs to satisfy. We discuss ways to manage this constraint in Section 5.3. The linear program given a reward function now becomes,

maxw\displaystyle\max_{w} 𝔼μ[(Φw+b)r]\displaystyle\mathbb{E}_{\mu}[(\Phi w+b)r] (4)
s.t.\displaystyle s.t. Φw+b0s,a.\displaystyle\Phi w+b\geq 0\quad\forall s,a.

In fact, any visitation distribution that is a policy-independent linear transformation of MπM^{\pi}, such as state visitation distribution or future state-visitation distribution, can be represented in the same way as shown in Corollary 4.3.

Corollary 4.3.

Any quantity that is a policy-independent linear transformation of MπM^{\pi} can be written as a linear combination of policy-independent basis and bias terms.

Extension to Continuous Spaces: In continuous spaces, the basis matrices ϕ\phi and bias bb become functions ϕ:S×A×Sd\phi:S\times A\times S\rightarrow\mathbb{R}^{d} and b:S×A×Sb:S\times A\times S\rightarrow\mathbb{R}. The linear equation with matrix operations becomes a linear equation with functional transformations, and any sum over states is replaced with expectation under the data distribution.

Toy Example: Let’s consider a simple 2 state MDP (as shown in Figure 2a) to depict how the precomputation and inference will take place. Consider the state-action visitation distribution as in Equation 1. For this simple MDP, the Φ\Phi and bb can be computed using simple algebraic manipulations. For a given initial state-visitation distribution, μ\mu and γ\gamma, the state-action visitation distribution d=(d(s0,a0),d(s1,a0),d(s0,a1),d(s1,a1))Td=(d(s_{0},a_{0}),d(s_{1},a_{0}),d(s_{0},a_{1}),d(s_{1},a_{1}))^{T} can be written as,

d=w1(γ1+γ11+γ10)+w2(11+γγ1+γ01)+(μ(s0)+γμ(s1)1+γμ(s1)+γμ(s0)1+γ00).d=w_{1}\begin{pmatrix}\frac{-\gamma}{1+\gamma}\\ \frac{-1}{1+\gamma}\\ 1\\ 0\end{pmatrix}+w_{2}\begin{pmatrix}\frac{-1}{1+\gamma}\\ \frac{-\gamma}{1+\gamma}\\ 0\\ 1\end{pmatrix}+\begin{pmatrix}\frac{\mu(s_{0})+\gamma\mu(s_{1})}{1+\gamma}\\ \frac{\mu(s_{1})+\gamma\mu(s_{0})}{1+\gamma}\\ 0\\ 0\end{pmatrix}. (5)

The derivation for these basis vectors and the bias vector can be found in Appendix 5. Equation 21 represents any vector that is a solution of Equation 1 for the simple MDP. Any state-action visitation distribution possible in the MDP can now be represented using only w=(w1,w2)Tw=(w_{1},w_{2})^{T}. The only constraint in the linear program of Equation 4 is Φw+b0\Phi w+b\geq 0. Looking closely, this constraint gives rise to four inequalities in ww and the linear program reduces to,

maxw1,w2(γw1w21+γ,w1γw21+γ,w1,w2)Trs.t.w1+γw2μ(s0)+γμ(s1)γw1+w2μ(s1)+γμ(s0)w10,w20.\begin{aligned} \max_{w_{1},w2}\quad&(\frac{-\gamma w_{1}-w_{2}}{1+\gamma},\frac{-w_{1}-\gamma w_{2}}{1+\gamma},w_{1},w_{2})^{T}r\\ s.t.\quad&w_{1}+\gamma w_{2}\leq\mu(s_{0})+\gamma\mu(s_{1})\\ \quad&\gamma w_{1}+w_{2}\leq\mu(s_{1})+\gamma\mu(s_{0})\\ \quad&w_{1}\geq 0,w_{2}\geq 0\end{aligned}. (6)

The inequalities in ww give rise to a simplex as shown in Figure 2b. For any specific instantiation of μ\mu and rr, the optimal policy can be easily found. For instance, if μ=(1,0)T\mu=(1,0)^{T} and the reward function, r=(1,0,1,0)Tr=(1,0,1,0)^{T}, the optimal ww will be obtained at the vertex (w1=1,w2=0)(w_{1}=1,w_{2}=0) and the corresponding state-action visitation distribution is d=(0,0,1,0)Td=(0,0,1,0)^{T}.

Refer to caption
(a)
Refer to caption
(b)
Figure 2: (left) A Toy MDP with 2 states and 2 actions to depict how the linear program of RL is reduced using precomputation. (right) The corresponding simplex for ww assuming the initial state distribution is μ=(1,0)T\mu=(1,0)^{T}.

As shown for the toy MDP, the successor measures form a simplex as discussed in Eysenbach et al. (2022). Spectral Methods following Proto Value Functions (Mahadevan & Maggioni, 2007) have instead tried to learn policy independent basis functions, Φvf\Phi^{vf} to represent value functions as a linear span, Vπ=ΦvfwπV^{\pi}=\Phi^{vf}w^{\pi}. Some prior works (Dadashi et al., 2019) have already argued that value functions do not form convex polytopes. We show through Theorem 4.4 that for identical dimensionalities, the span of value functions using basis functions represent a smaller class of value functions than the set of value functions that can be represented using the span of the successor measure.

Theorem 4.4.

Given a dd-dimensional basis 𝐕:nd\mathbf{V}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{d}, define span{𝐕}span\{\mathbf{V}\} as the space of all linear combinations of the basis 𝐕\mathbf{V}. Let span{Φvf}span\{\Phi^{vf}\} represents the space of the value functions spanned by Φvf\Phi^{vf} i.e. Vπ=ΦvfwπV^{\pi}=\Phi^{vf}w^{\pi} and let {span{Φ}r}\{span\{\Phi\}r\} represents the space of value functions using the successor measures spanned by Φ\Phi i.e. Vπ=s+[Φwπ.r(s+)]V^{\pi}=\sum_{s^{+}}[\Phi w^{\pi}.r(s^{+})]. For the same dimensionality of task (policy or reward) independent basis, span{Φvf}{span{Φ}r}span\{\Phi^{vf}\}\subseteq\{span\{\Phi\}r\}.

Approaches such as Forward Backward Representations (Touati & Ollivier, 2021) have also been based on representing successor measures but they force a latent variable zz representing the policy to be a function of the reward for which the policy is optimal. The forward map that they propose is a function of this latent zz. We, on the other hand, propose a representation that is truly independent of the policy or the reward.

5 Method

In this section, we start by introducing the core practical algorithm for representation learning inspired by the theory discussed in Section 4 for obtaining Φ\Phi and bb. We then discuss the inference step, i.e., obtaining ww for a given reward function.

5.1 Learning Φ\Phi and bb

For a given policy π\pi, its successor measure under our framework is denoted by Mπ=Φwπ+bM^{\pi}=\Phi w^{\pi}+b with wπw^{\pi} the only object depending on policy. Given an offline dataset with density ρ\rho, we follow prior works (Touati & Ollivier, 2021; Blier et al., 2021b) and model densities mπ=Mπ/ρm^{\pi}=M^{\pi}/\rho learned with the following objective:

Lπ(Φ,b,wπ)=𝔼s,aρ[mΦ,b,wπ(s,a,s,a)]+12𝔼s,a,sρ,s+,a+ρ[mΦ,b,wπ(s,a,s+,a+)γm¯Φ¯,b¯,w¯π(s,π(s),s+,a+)].L^{\pi}(\Phi,b,w^{\pi})=-\mathbb{E}_{s,a\sim\rho}[m^{\Phi,b,w^{\pi}}(s,a,s,a)]\\ +\frac{1}{2}\mathbb{E}_{s,a,s^{\prime}\sim\rho,s^{+},a^{+}\sim\rho}[m^{\Phi,b,w^{\pi}}(s,a,s^{+},a^{+})-\gamma\bar{m}^{\bar{\Phi},\bar{b},\bar{w}^{\pi}}(s^{\prime},\pi(s^{\prime}),s^{+},a^{+})]. (7)

The above objective only requires samples (s,a,s)(s,a,s^{\prime}) from the reward-free dataset and a random state-action pair (s+,a+)(s^{+},a^{+}) (also sampled from the same data) to compute and minimize L(π)L(\pi).

A Φ\Phi and bb that allows for minimizing the L(π)L(\pi) for all πΠ\pi\in\Pi forms a solution to our representation learning problem. But how do we go about learning such Φ\Phi and bb? A naïve way to implement learning Φ\Phi and bb is via a bi-level optimization. We sample policies from the policy space of Π\Pi, for each policy we learn a wπw^{\pi} that optimizes the policy evaluation loss (Eq 7) and take a gradient update w.r.t Φ\Phi and bb. In general, the objective can be optimized by any two-player game solving strategies with [Φ,b][\Phi,b] as the first player and wπw^{\pi} as the second player. Instead, in the next section, we present an approach to simplify learning representations to a single-player game.

5.2 Simplifying Optimization via a Discrete Codebook of Policies

Learning a new wπw^{\pi} for each specific sampled policy π\pi does not leverage precomputations and requires retraining from scratch. We propose parameterizing ww to be conditional on policy, which allows leveraging generalization between policies that induce similar visitation and as we show, will allow us to simplify the two player game into a single player optimization. In general, policies are high-dimensional objects and compressing them can result in additional overhead. Compression by parameterizing policies with a latent variable zz is another alternative but presents the challenge of covering the space of all possible policies by sampling zz. Instead, we propose using a discrete codebook of policies as a way to simulate uniform sampling of all possible policies with support in the offline dataset.

Discrete Codebook of Policies: Denote zz as a compact representation of policies. We propose to represent zz as a random sampling seed that will generate a deterministic policy from the set of supported policies as follows:

π(a|s,z)=Uniform Sample(seed=z+hash(s)).\pi(a|s,z)=\text{Uniform Sample}(\text{seed}=z+\text{hash}(s)). (8)

The above sampling strategy defines a unique mapping from a seed to a policy. If the seed generator is unbiased, the approach provably samples from among all possible deterministic policies uniformly. Now, with policy πz\pi_{z} and wzw_{z} parameterized as a function of zz we derive the following single-player reduction to learn Φ,b,w\Phi,b,w jointly.

PSM-objective:argminΦ,b,w(z)𝔼z[Lπz(Φ,b,w(z))].\texttt{PSM-objective:}\quad\operatorname*{arg\,min}_{\Phi,b,w(z)}\mathbb{E}_{z}[L^{\pi_{z}}(\Phi,b,w(z))]. (9)

5.3 Fast Optimal Policy Inference on Downstream Tasks

After obtaining Φ\Phi and bb via the pretraining step, the only parameter to compute for obtaining the optimal Q function for a downstream task in the MDP is ww. As discussed earlier, Q=maxw(Φw+b)rQ^{*}=\max_{w}(\Phi w+b)r but simply maximizing this objective will not yield a Q function. The linear program still has a constraint of Φw+b0,s,a\Phi w+b\geq 0,\forall s,a. We solve the constrained linear program by constructing the Lagrangian dual using Lagrange multipliers λ(s,a)\lambda(s,a). The dual problem is shown in Equation 10. Here, we write the corresponding loss for the constraint as min(Φw+b,0)\min(\Phi w+b,0).

maxλ0minwΦwrs,aλ(s,a)min(Φw+b,0).\max_{\lambda\geq 0}\min_{w}-\Phi wr-\sum_{s,a}\lambda(s,a)\min(\Phi w+b,0). (10)

Once ww^{*} is obtained, the corresponding MM^{*} and QQ^{*} can be easily computed. The policy can be obtained as π=argmaxaQ(s,a)\pi^{*}=\operatorname*{arg\,max}_{a}Q^{*}(s,a) for discrete action spaces and via DDPG style policy learning for continuous action spaces.

6 Connections to Successor Features

In this section, we uncover the theoretical connections between PSM and successor features. Successor Features (Barreto et al., 2017) (ψπ(s,a)\psi^{\pi}(s,a)) are defined as the discounted sum of state features φ(s)\varphi(s), ψπ(s,a)=𝔼π[tγtφ(st)]\psi^{\pi}(s,a)=\mathbb{E}_{\pi}[\sum_{t}\gamma^{t}\varphi(s_{t})]. These state features can be used to span reward functions as r=φzr=\varphi z. Using this construction, the Q function is linear in zz as Q(s,a)=ψπ(s,a)zQ(s,a)=\psi^{\pi}(s,a)z. We can establish a simple relation between MπM^{\pi} and ψπ\psi^{\pi}, ψπ(s,a)=sMπ(s,a,s)φ(s)𝑑s\psi^{\pi}(s,a)=\int_{s^{\prime}}M^{\pi}(s,a,s^{\prime})\varphi(s^{\prime})ds^{\prime}. This connection shows that, like successor measures, successor features can also be represented using a similar basis.

Theorem 6.1.

Successor Features ψπ(s,a)\psi^{\pi}(s,a) belong to an affine set and can be represented using a linear combination of basis functions and a bias.

Interestingly, instead of learning the basis of successor measures, we show below that PSM can also be used to learn the basis of successor features. While traditional successor feature-based methods assume that the state features φ\varphi are provided, PSM can be used to jointly learn the successor feature and the state feature. We begin by introducing the following Lemma 6.2 from (Touati et al., 2023) which connects an a specific decomposition for successor measures to the ability of jointly learning state features and successor representations,

Lemma 6.2.

(Theorem 13 of Touati et al. (2023)) For an offline dataset with density ρ\rho, if the successor measure is represented as Mπ(s,a,s+)=ψπ(s,a)φ(s+)ρ(s+)M^{\pi}(s,a,s^{+})=\psi^{\pi}(s,a)\varphi(s^{+})\rho(s^{+}), then ψ\psi is the successor feature ψπ(s,a)\psi^{\pi}(s,a) for state feature φ(s)T(𝔼ρ(φφT))1\varphi(s)^{T}(\mathbb{E}_{\rho}(\varphi\varphi^{T}))^{-1}.

According to Lemma 6.2, if Mπ(s,a,s+)=ψπ(s,a)φ(s+)ρ(s+)M^{\pi}(s,a,s^{+})=\psi^{\pi}(s,a)\varphi(s^{+})\rho(s^{+}), then the corresponding successor feature is ψπ(s,a)\psi^{\pi}(s,a) and the state feature is φ(s)T(𝔼ρ(φφT))1\varphi(s)^{T}(\mathbb{E}_{\rho}(\varphi\varphi^{T}))^{-1}. PSM represents successor measures as Mπ(s,a,s+)=ϕ(s,a,s+)wπρ(s+)M^{\pi}(s,a,s^{+})=\phi(s,a,s^{+})w^{\pi}\rho(s^{+}) (for simplicity, combining the bias within the basis without loss of generality). It can be shown that if the basis learned for successor measure using PSM, ϕ(s,a,s+)\phi(s,a,s^{+}) is represented as a decomposition ϕψ(s,a)Tφ(s+)\phi_{\psi}(s,a)^{T}\varphi(s^{+}), ϕψ(s,a)\phi_{\psi}(s,a) forms the basis for successor features for the state features φ(s)T(𝔼ρ(φφT))1\varphi(s)^{T}(\mathbb{E}_{\rho}(\varphi\varphi^{T}))^{-1}. Formally, we present the following theorem,

Theorem 6.3.

For the PSM representation Mπ(s,a,s+)=ϕ(s,a,s+)wπM^{\pi}(s,a,s^{+})=\phi(s,a,s^{+})w^{\pi} and ϕ(s,a,s+)=ϕψ(s,a)Tφ(s+)\phi(s,a,s^{+})=\phi_{\psi}(s,a)^{T}\varphi(s^{+}), the successor feature ψπ(s,a)=ϕψ(s,a)wπ\psi^{\pi}(s,a)=\phi_{\psi}(s,a)w^{\pi} for the state feature φ(s)T(𝔼ρ(φφT))1\varphi(s)^{T}(\mathbb{E}_{\rho}(\varphi\varphi^{T}))^{-1}.

Thus, successor features can be obtained by enforcing a particular inductive bias to decompose ϕ\phi in PSM. For rewards linear in state features (r(s)=φ(s)zr(s)=\langle\varphi(s)\cdot z\rangle for some weights zz), the QQ-functions remain linear given by Qπ(s,a)=ϕψ(s,a)wπ𝔼ρ[φ(s)z]Q^{\pi}(s,a)=\phi_{\psi}(s,a)w^{\pi}\mathbb{E}_{\rho}[\varphi(s)z]. A natural question to ask is, with this decomposition, do we lose the expressibility of PSM compared to the methods that compute basis spanning value functions, thus contradicting Theorem 4.4? The answer is negative, since (1) even though the value function seems to be linear combination of some basis with weights wπw^{\pi}, these weights are not tied to zz or the reward. The relationship between the optimal weights wπw^{\pi^{*}} and zz defining the reward function is not necessarily linear as the prior works assume, and (2) the decomposition ϕ(s,a,s+)=ϕψ(s,a)φ(s+)\phi(s,a,s^{+})=\phi_{\psi}(s,a)\varphi(s^{+}) reduces the representation capacity of the basis. While prior works are only able to recover features pertaining to this reduced representation capacity, PSM does not assume this decomposition and can learn a larger representation space.

7 Experimental Study

Our experiments evaluate how PSM can be used to encapsulate a task-free MDP into a representation that will enable zero-shot inference on any downstream task. In the experiments we investigate a) the quality of value functions learned by PSM, b) the zero-shot performance of PSM in contrast to other baselines, and finally on robot manipulation task c) the ability to learn general goal-reaching skills arising from the PSM objective d) Quality of learned PSM representations in enabling zero-shot RL for continuous state-action space tasks.

Baselines

We compare to the methods that have stood the test of time and perform best: Laplacian features (Wu et al., 2018) and Forward-Backward (Touati et al., 2023). Laplacian features learn features of a state by considering eigenvectors of a graph Laplacian induced by a random walk. These features ψ(s)d\psi(s)\in\mathbb{R}^{d} obtained for each state are used to define a reward function conditioned on a reward r(s;ψ)=ψ(s)zr(s;\psi)=\psi(s)\cdot z where zz is sampled uniformly from a unit d-dimensional sphere. For each zz an optimal policy is pretrained from the dataset on the induced reward function. During inference the corresponding zz for a given reward function is obtained as a solution to the following linear regression: minz𝔼s[(ψzr(s))2].\min_{z}\mathbb{E}_{s}[(\psi^{\top}\cdot z-r(s))^{2}]. Forward-backward (FB) learns both the optimal policy and state features jointly for all reward that are in the linear span of state-features. FB methods typically assume a goal-conditioned prior during pretraining which typically helps in learning policies that reach various states in the dataset. HILP (Park et al., 2024a) makes two changes to FB: a) Reduces the tasks to be goal reaching to learn the features of a state and b) Uses a more performant offline RL method, IQL (Kostrikov et al., 2021) to learn features. We provide detailed experimental setup and hyperparameters in Appendix B.3.

7.1 Zero shot Value function and Optimal Policy prediction

In this section, we consider goal-conditioned rewards on discrete gridworld and the classic four-room environments. Since the goal-conditioned rewards are state-only reward functions, we learn representations for Mπ(s,a,s+)M^{\pi}(s,a,s^{+}) instead of Mπ(s,a,s+,a+)M^{\pi}(s,a,s^{+},a^{+}) using the learning objectives in Equation 9.

Refer to caption
(a) Gridworld
Refer to caption
(b) Four-room
Figure 3: Qualitative results on a gridworld and four-room: G denotes the goal sampled for every episode. The black regions are the boundaries/obstacles. The agent needs to navigate across the grid and through the small opening (in case of four-room) to reach the goal. We visualize the optimal Q-functions inferred at test time for the given goal in the image. The arrows denote the optimal policy. (Top row) Results for PSM, (Middle Row) Results for FB, (Bottom row) Results for Laplacian Eigenfunctions.

Task Setup: Both environments have discrete state and action spaces. The action space consists of five actions: {up,right,down,left,stay}\{up,right,down,left,stay\}. We collect transitions in the environment by uniformly spawning the agent and taking a random-uniform action.This allows us to form our offline reward-free dataset will full coverage to train Φ\Phi and bb. During inference, we sample a goal and infer the optimal Q function on the goal. Since the reward function is given by r(s)=𝟙s=gr(s)=\mathbbm{1}_{s=g}, the inference looks like Q(s,a)=maxwΦ(s,a,g)ws.t.Φ(s,a,s)w+b(s,a,s)0s,a,sQ(s,a)=\max_{w}\Phi(s,a,g)w\quad s.t.\quad\Phi(s,a,s^{\prime})w+b(s,a,s^{\prime})\geq 0\quad\forall s,a,s^{\prime}. Figure 3 shows the Q function and the corresponding optimal policy (when executed from a fixed start state) on the gridworld and the four-room environment. As illustrated clearly, for both the environments, the optimal Q function and policy can be obtained zero-shot for any given goal-conditioned downstream task. We observe a 100% success rate on both these tasks.

Refer to caption
Refer to caption
Figure 4: Quantitative results on FetchReach: The success rates (averaged over 3 seeds) are plotted (along with the standard deviation as shaded) with respect to the training updates for PSM, FB and Laplacian. PSM quickly reaches optimal performance while FB shows instability in maintaining its optimality. Laplacian is far from the optimal performance.

Comparison to baselines: We can draw a couple of conclusions from the visualization of the Q functions inferred by the different methods. First, the Q function learnt by PSM is more sharply concentrated on optimal state-action pairs compared to the two baselines. Both baselines have more uniform value estimates, leaving only a minor differential over state values. Secondly, the baselines produce far more incorrect optimal actions (represented by the green arrows) compared to PSM.

7.2 Learning zero-shot policies for manipulation

We consider the Fetch-Reach environment with continuous states and discrete actions (Touati & Ollivier, 2021). A dataset of size 1M is constructed using DQN+RND. FB, Laplacian and PSM all use this dataset to learn pretrained objects that can be used for zero-shot RL.

We observe that PSM outperforms baselines FB and Laplacian in its ability to learn a zero-shot policy. One key observation is that PSM learning is stable whereas FB exhibits a drop in performance, likely due to the use of Bellman optimality backups resulting in overestimation bias during training. Laplacian’s capacity to output zero-shot policies is far exceeded by PSM because Laplacian methods construct the graph Laplacian for random policies and may not be able to represent optimal value functions for all rewards.

7.3 Learning Zero-shot Policies for Continuous Control

We use the ExoRL suite (Yarats et al., 2022) for obtaining exploratory datasets collected by running RND (Burda et al., 2019). PSM objective in Equation 9 directly enables learning the basis for successor measures. We decompose the basis representation ϕ(s,a,s+)\phi(s,a,s^{+}) to ϕψ(s,a)Tφ(s+)\phi_{\psi}(s,a)^{T}\varphi(s^{+}) as discussed in detail in Section 6. PSM thus ensures that φ(s+)\varphi(s^{+}) can be used to construct basic features to span any reward function. Note that this is not a limiting assumption, as the features can be arbitrarily non-linear in states. In these experiments, we compare the ability of PSM to obtain these representations as compared to prior zero-shot RL methods. Additional experimental details can be found in Appendix B.3.

Environment Task Laplace FB HILP PSM
Walker Stand 243.70 ±\pm 151.40 902.63±\pm 38.94 607.07 ±\pm 165.28 872.61 ±\pm 38.81
Run 63.65 ±\pm 31.02 392.76 ±\pm 31.29 107.84 ±\pm 34.24 351.50 ±\pm 19.46
Walk 190.53 ±\pm 168.45 877.10 ±\pm 81.05 399.67 ±\pm39.31 891.44 ±\pm 46.81
Flip 48.73 ±\pm 17.66 206.22 ±\pm 162.27 277.95 ±\pm 59.63 640.75 ±\pm 31.88
Average 136.65 594.67 348.13 689.07
Cheetah Run 96.32 ±\pm 35.69 257.59 ±\pm 58.51 68.22 ±\pm47.08 276.41 ±\pm 70.23
Run Backward 106.38 ±\pm 29.4 307.07 ±\pm 14.91 37.99 ±\pm25.16 286.13 ±\pm 25.38
Walk 409.15 ±\pm 56.08 799.83 ±\pm67.51 318.30 ±\pm 168.42 887.02 ±\pm 59.87
Walk Backward 654.29 ±\pm 219.81 980.76 ±\pm 2.32 349.61 ±\pm 236.29 980.90 ±\pm 2.04
Average 316.53 586.31 193.53 607.61
Quadruped Stand 854.50 ±\pm 41.47 740.05 ±\pm 107.15 409.54 ±\pm 97.59 842.86 ±\pm 82.18
Run 412.98 ±\pm 54.03 386.67 ±\pm 32.53 205.44 ±\pm 47.89 431.77 ±\pm 44.69
Walk 494.56 ±\pm 62.49 566.57 ±\pm 53.22 218.54 ±\pm86.67 603.97±\pm73.67
Jump 642.84 ±\pm 114.15 581.28 ±\pm 107.38 325.51 ±\pm93.06 596.37 ±\pm94.23
Average 601.22 568.64 289.75 618.74
Pointmass Reach Top Left 713.46 ±\pm 58.90 897.83 ±\pm 35.79 944.46 ±\pm 12.94 831.43 ±\pm 69.51
Reach Top Right 581.14 ±\pm 214.79 274.95 ±\pm 197.90 96.04 ±\pm 166.34 730.27 ±\pm 58.10
Reach Bottom Left 689.05 ±\pm 37.08 517.23 ±\pm 302.63 192.34 ±\pm 177.48 451.38 ±\pm 73.46
Reach Bottom Right 21.29 ±\pm 42.54 19.37±\pm33.54 0.17 ±\pm 0.29 43.29 ±\pm 38.40
Average 501.23 427.34 308.25 514.09
Table 1: Table shows comparison (averaged over 5 seeds) between zero-shot RL performance of different methods with representation size of d=128d=128. PSM demonstrates a marked improvement over prior methods.

Table 1 compares PSM’s zero-shot performance in continuous state-action spaces to representative methods - Laplacian, FB, and HILP. We note that to make the comparisons fair, we use the same representation dimension of d=128d=128, the same discount factor, and the same inference and policy extraction across environments for a particular method. Overall, PSM demonstrates marked improvement over baselines across most environments. Further ablations studying effect of latent dimensionality can be found in Appendix C.

8 Conclusion

In this work, we propose Proto Successor Measures (PSM), a zero-shot RL method that compresses any MDP to allow for optimal policy inference for any reward function without additional environmental interactions. This framework marks a step in the direction of moving away from common idealogy in RL to solve single tasks optimally, and rather pretraining reward-free agents that are able to solve an infinite number of tasks. PSM is based on the principle that successor measures are solutions to an affine set and proposes an efficient and mathematically grounded algorithm to extract the basis for the affine set. Our empirical results show that PSM can produce the optimal Q function and the optimal policy for a number of goal-conditioned as well as reward-specified tasks in a number of environments outperforming prior baselines.

Limitations and Future Work: PSM shows that any MDP can be compressed to a representation space that allows zero-shot RL, but it remains unclear as to what the size of the representation space should be. A large representational dimension can lead to increased compute requirements and training time with a possible chance of overfitting, and a small representation dimension can fail to capture nuances about environments that have non-smooth environmental dynamics. It is also an interesting future direction to study the impact that dataset coverage has on zero-shot RL performance.

Acknowledgements

The authors would like to thank Scott Niekum, Ahmed Touati and Ben Eysenbach for inspiring discussions during this work. We thank Caleb Chuck, Max Rudolph, and members of MIDI Lab for their feedback on this work. SA, HS, and AZ are supported by NSF 2340651 and ARO W911NF-24-1-0193. This work has in part taken place in the Learning Agents Research Group (LARG) at the Artificial Intelligence Laboratory, The University of Texas at Austin. LARG research is supported in part by the National Science Foundation (FAIN-2019844, NRT-2125858), the Office of Naval Research (N00014-18-2243), Army Research Office (W911NF-23-2-0004, W911NF-17-2-0181), DARPA (Cooperative Agreement HR00112520004 on Ad Hoc Teamwork), Lockheed Martin, and Good Systems, a research grand challenge at the University of Texas at Austin. The views and conclusions contained in this document are those of the authors alone. Peter Stone serves as the Executive Director of Sony AI America and receives financial compensation for this work. The terms of this arrangement have been reviewed and approved by the University of Texas at Austin in accordance with its policy on objectivity in research.

References

  • Achiam et al. (2018) Joshua Achiam, Harrison Edwards, Dario Amodei, and Pieter Abbeel. Variational option discovery algorithms. CoRR, abs/1807.10299, 2018. URL http://arxiv.org/abs/1807.10299.
  • Agarwal et al. (2021) Siddhant Agarwal, Aaron Courville, and Rishabh Agarwal. Behavior predictive representations for generalization in reinforcement learning. In Deep RL Workshop NeurIPS 2021, 2021. URL https://openreview.net/forum?id=b5PJaxS6Jxg.
  • Agarwal et al. (2024) Siddhant Agarwal, Ishan Durugkar, Peter Stone, and Amy Zhang. f-policy gradients: A general framework for goal-conditioned rl using f-divergences. Advances in Neural Information Processing Systems, 36, 2024.
  • Alegre et al. (2022) Lucas Nunes Alegre, Ana Bazzan, and Bruno C Da Silva. Optimistic linear support and successor features as a basis for optimal policy transfer. In International conference on machine learning, pp. 394–413. PMLR, 2022.
  • Barreto et al. (2017) André Barreto, Will Dabney, Rémi Munos, Jonathan J Hunt, Tom Schaul, Hado P van Hasselt, and David Silver. Successor features for transfer in reinforcement learning. Advances in neural information processing systems, 30, 2017.
  • Bellemare et al. (2019) Marc Bellemare, Will Dabney, Robert Dadashi, Adrien Ali Taiga, Pablo Samuel Castro, Nicolas Le Roux, Dale Schuurmans, Tor Lattimore, and Clare Lyle. A geometric perspective on optimal representations for reinforcement learning. Advances in neural information processing systems, 32, 2019.
  • Blier et al. (2021a) Léonard Blier, Corentin Tallec, and Yann Ollivier. Learning successor states and goal-dependent values: A mathematical viewpoint. arXiv preprint arXiv:2101.07123, 2021a.
  • Blier et al. (2021b) Léonard Blier, Corentin Tallec, and Yann Ollivier. Learning successor states and goal-dependent values: A mathematical viewpoint. CoRR, abs/2101.07123, 2021b. URL https://arxiv.org/abs/2101.07123.
  • Burda et al. (2019) Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=H1lJJnR5Ym.
  • Dadashi et al. (2019) Robert Dadashi, Adrien Ali Taiga, Nicolas Le Roux, Dale Schuurmans, and Marc G Bellemare. The value function polytope in reinforcement learning. In International Conference on Machine Learning, pp. 1486–1495. PMLR, 2019.
  • Dayan (1993) Peter Dayan. Improving generalization for temporal difference learning: The successor representation. Neural computation, 5(4):613–624, 1993.
  • Denardo (1970) Eric V Denardo. On linear programming in a markov decision problem. Management Science, 16(5):281–288, 1970.
  • Eysenbach et al. (2019) Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=SJx63jRqFm.
  • Eysenbach et al. (2022) Benjamin Eysenbach, Ruslan Salakhutdinov, and Sergey Levine. The information geometry of unsupervised reinforcement learning. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=3wU2UX0voE.
  • Farebrother et al. (2023) Jesse Farebrother, Joshua Greaves, Rishabh Agarwal, Charline Le Lan, Ross Goroshin, Pablo Samuel Castro, and Marc G Bellemare. Proto-value networks: Scaling representation learning with auxiliary tasks. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=oGDKSt9JrZi.
  • Fawzi et al. (2022) Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Francisco J R Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 610(7930):47–53, 2022.
  • Ghosh et al. (2023) Dibya Ghosh, Chethan Anand Bhateja, and Sergey Levine. Reinforcement learning from passive data via latent intentions. In International Conference on Machine Learning, pp. 11321–11339. PMLR, 2023.
  • Grauman et al. (2022) Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  18995–19012, 2022.
  • Hafner et al. (2020) Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=S1lOTC4tDS.
  • Hoang et al. (2021) Christopher Hoang, Sungryull Sohn, Jongwook Choi, Wilka Torrico Carvalho, and Honglak Lee. Successor feature landmarks for long-horizon goal-conditioned reinforcement learning. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=rD6ulZFTbf.
  • Janner et al. (2019) Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. Advances in neural information processing systems, 32, 2019.
  • Koren (2003) Yehuda Koren. On spectral graph drawing. In International Computing and Combinatorics Conference, pp. 496–508. Springer, 2003.
  • Kostrikov et al. (2021) Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169, 2021.
  • Lehnert & Littman (2020) Lucas Lehnert and Michael L. Littman. Successor features combine elements of model-free and model-based reinforcement learning. Journal of Machine Learning Research, 21(196):1–53, 2020. URL http://jmlr.org/papers/v21/19-060.html.
  • Ma et al. (2023) Yecheng Jason Ma, Shagun Sodhani, Dinesh Jayaraman, Osbert Bastani, Vikash Kumar, and Amy Zhang. VIP: Towards universal visual reward and representation via value-implicit pre-training. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=YJ7o2wetJ2.
  • Machado et al. (2017) Marlos C Machado, Marc G Bellemare, and Michael Bowling. A laplacian framework for option discovery in reinforcement learning. In International Conference on Machine Learning, pp. 2295–2304. PMLR, 2017.
  • Machado et al. (2018) Marlos C. Machado, Clemens Rosenbaum, Xiaoxiao Guo, Miao Liu, Gerald Tesauro, and Murray Campbell. Eigenoption discovery through the deep successor representation. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=Bk8ZcAxR-.
  • Mahadevan (2005) Sridhar Mahadevan. Proto-value functions: Developmental reinforcement learning. In Proceedings of the 22nd international conference on Machine learning, pp.  553–560, 2005.
  • Mahadevan & Maggioni (2007) Sridhar Mahadevan and Mauro Maggioni. Proto-value functions: A laplacian framework for learning representation and control in markov decision processes. Journal of Machine Learning Research, 8(10), 2007.
  • Manne (1960) Alan S Manne. Linear programming and sequential decisions. Management Science, 6(3):259–267, 1960.
  • Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. Playing atari with deep reinforcement learning. CoRR, abs/1312.5602, 2013. URL http://arxiv.org/abs/1312.5602.
  • Nachum & Dai (2020) Ofir Nachum and Bo Dai. Reinforcement learning via fenchel-rockafellar duality. arXiv preprint arXiv:2001.01866, 2020.
  • (33) Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation. In 6th Annual Conference on Robot Learning.
  • Park et al. (2024a) Seohong Park, Tobias Kreiman, and Sergey Levine. Foundation policies with hilbert representations. In Forty-first International Conference on Machine Learning, 2024a. URL https://openreview.net/forum?id=LhNsSaAKub.
  • Park et al. (2024b) Seohong Park, Oleh Rybkin, and Sergey Levine. METRA: Scalable unsupervised RL with metric-aware abstraction. In The Twelfth International Conference on Learning Representations, 2024b. URL https://openreview.net/forum?id=c5pwL0Soay.
  • Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. CoRR, abs/2103.00020, 2021. URL https://arxiv.org/abs/2103.00020.
  • Ramesh et al. (2021) Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. CoRR, abs/2102.12092, 2021. URL https://arxiv.org/abs/2102.12092.
  • Reinke & Alameda-Pineda (2021) Chris Reinke and Xavier Alameda-Pineda. Xi-learning: Successor feature transfer learning for general reward functions. CoRR, abs/2110.15701, 2021. URL https://arxiv.org/abs/2110.15701.
  • Schwarzer et al. (2021a) Max Schwarzer, Ankesh Anand, Rishab Goel, R Devon Hjelm, Aaron Courville, and Philip Bachman. Data-efficient reinforcement learning with self-predictive representations. In International Conference on Learning Representations, 2021a. URL https://openreview.net/forum?id=uCQfPZwRaUu.
  • Schwarzer et al. (2021b) Max Schwarzer, Nitarshan Rajkumar, Michael Noukhovitch, Ankesh Anand, Laurent Charlin, R Devon Hjelm, Philip Bachman, and Aaron C Courville. Pretraining representations for data-efficient reinforcement learning. Advances in Neural Information Processing Systems, 34:12686–12699, 2021b.
  • Sermanet et al. (2017) Pierre Sermanet, Corey Lynch, Jasmine Hsu, and Sergey Levine. Time-contrastive networks: Self-supervised learning from multi-view observation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp.  14–15, 2017.
  • Sikchi et al. (2024a) Harshit Sikchi, Rohan Chitnis, Ahmed Touati, Alborz Geramifard, Amy Zhang, and Scott Niekum. Score models for offline goal-conditioned reinforcement learning. In The Twelfth International Conference on Learning Representations, 2024a. URL https://openreview.net/forum?id=oXjnwQLcTA.
  • Sikchi et al. (2024b) Harshit Sikchi, Qinqing Zheng, Amy Zhang, and Scott Niekum. Dual RL: Unification and new methods for reinforcement and imitation learning. In The Twelfth International Conference on Learning Representations, 2024b. URL https://openreview.net/forum?id=xt9Bu66rqv.
  • Stooke et al. (2021) Adam Stooke, Kimin Lee, Pieter Abbeel, and Michael Laskin. Decoupling representation learning from reinforcement learning. In International conference on machine learning, pp. 9870–9879. PMLR, 2021.
  • Tassa et al. (2018) Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, Timothy P. Lillicrap, and Martin A. Riedmiller. Deepmind control suite. CoRR, abs/1801.00690, 2018. URL http://arxiv.org/abs/1801.00690.
  • Touati & Ollivier (2021) Ahmed Touati and Yann Ollivier. Learning one representation to optimize all rewards. Advances in Neural Information Processing Systems, 34:13–23, 2021.
  • Touati et al. (2023) Ahmed Touati, Jérémy Rapin, and Yann Ollivier. Does zero-shot reinforcement learning exist? In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=MYEap_OcQI.
  • Warde-Farley et al. (2019) David Warde-Farley, Tom Van de Wiele, Tejas Kulkarni, Catalin Ionescu, Steven Hansen, and Volodymyr Mnih. Unsupervised control through non-parametric discriminative rewards. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=r1eVMnA9K7.
  • Wu et al. (2018) Yifan Wu, George Tucker, and Ofir Nachum. The laplacian in rl: Learning representations with efficient approximations. arXiv preprint arXiv:1810.04586, 2018.
  • Wu et al. (2019) Yifan Wu, George Tucker, and Ofir Nachum. The laplacian in RL: Learning representations with efficient approximations. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=HJlNpoA5YQ.
  • Wurman et al. (2022) Peter R Wurman, Samuel Barrett, Kenta Kawamoto, James MacGlashan, Kaushik Subramanian, Thomas J Walsh, Roberto Capobianco, Alisa Devlic, Franziska Eckert, Florian Fuchs, et al. Outracing champion gran turismo drivers with deep reinforcement learning. Nature, 602(7896):223–228, 2022.
  • Yarats et al. (2022) Denis Yarats, David Brandfonbrener, Hao Liu, Michael Laskin, Pieter Abbeel, Alessandro Lazaric, and Lerrel Pinto. Don’t change the algorithm, change the data: Exploratory data for offline reinforcement learning. arXiv preprint arXiv:2201.13425, 2022.
  • Zahavy et al. (2023) Tom Zahavy, Yannick Schroecker, Feryal Behbahani, Kate Baumli, Sebastian Flennerhag, Shaobo Hou, and Satinder Singh. Discovering policies with DOMiNO: Diversity optimization maintaining near optimality. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=kjkdzBW3b8p.

Appendix

Appendix A Theoretical Results

In this section, we will present the proofs for all the Theorems and Corollaries stated in Section 4 and 6.

A.1 Proof of Theorem 4.1

Theorem 4.1.

All possible state-action visitation distributions in an MDP form an affine set.

Proof.

Any state-action visitation distribution, dπ(s,a)d^{\pi}(s,a) must satsify the Bellman Flow equation:

adπ(s,a)=(1γ)μ(s)+γs,a(s|s,a)dπ(s,a).\sum_{a}d^{\pi}(s,a)=(1-\gamma)\mu(s)+\gamma\sum_{s^{\prime},a^{\prime}}\mathbb{P}(s|s^{\prime},a^{\prime})d^{\pi}(s^{\prime},a^{\prime}). (11)

This equation can be written in matrix notation as:

adπ=(1γ)μ+γPTdπ.\sum_{a}d^{\pi}=(1-\gamma)\mu+\gamma P^{T}d^{\pi}. (12)

Rearranging the terms,

(SγPT)dπ=(1γ)μ,(S-\gamma P^{T})d^{\pi}=(1-\gamma)\mu, (13)

where SS is the matrix for a\sum_{a} of size |𝒮|×|𝒮||𝒜||\mathcal{S}|\times|\mathcal{S}||\mathcal{A}| with only |𝒜||\mathcal{A}| entries set to 1 corresponding to the state denoted by the row. This equation is an affine equation of the form Ax=bAx=b whose solution set forms an affine set. Hence all state-visitation distributions dπd^{\pi} form an affine set.

In the continuous spaces, the visitation distributions would be represented as functions: dπ:S×Ad^{\pi}:S\times A\rightarrow\mathbb{R} rather than vectors in [0,1]S×A[0,1]^{S\times A}. The state-action visitation distribution dπ(s,a)d^{\pi}(s,a) will satisfy the following continuous Bellman Flow Equation,

Adπ(s,a)𝑑a=(1γ)μ(s)+γSA(s|s,a)dπ(s,a)𝑑s𝑑a.\int_{A}d^{\pi}(s,a)da=(1-\gamma)\mu(s)+\gamma\int_{S}\int_{A}\mathbb{P}(s|s^{\prime},a^{\prime})d^{\pi}(s^{\prime},a^{\prime})ds^{\prime}da^{\prime}. (14)

This equation is the same as Equation 11 except, the vectors representing distributions are replaced by functions and the discrete operator \sum is replaced by \int.

The Bellman Flow operator can be defined as TT that acts on dπd^{\pi} as,

T[dπ](s)=Adπ(s,a)𝑑aγSA(s|s,a)dπ(s,a)𝑑s𝑑a.T[d^{\pi}](s)=\int_{A}d^{\pi}(s,a)da-\gamma\int_{S}\int_{A}\mathbb{P}(s|s^{\prime},a^{\prime})d^{\pi}(s^{\prime},a^{\prime})ds^{\prime}da^{\prime}. (15)

From Equation 14, T[dπ](s)=(1γ)μ(s)T[d^{\pi}](s)=(1-\gamma)\mu(s). The operator TT is a linear operator, hence dπ(s,a)d^{\pi}(s,a) forms an affine space.

A.2 Proof of Corollary 4.2

Corollary 4.2.

Any successor measure, MπM^{\pi}, in an MDP forms an affine set and so can be represented as idϕiwiπ+b\sum_{i}^{d}\phi_{i}w_{i}^{\pi}+b where ϕi\phi_{i} and bb are independent of the policy π\pi and dd is the dimension of the affine space.

Proof.

Using Theorem 4.1, we have shown that state-action visitation distributions form affine sets. Similarly, successor measures, Mπ(s,a,s+,a+)M^{\pi}(s,a,s^{+},a^{+}) are solutions of the Bellman Flow equation:

Mπ(s,a,s+,a+)=(1γ)𝟙[s=s+,a=a+]+γs,a𝒮𝒜P(s+|s,a)Mπ(s,a,s,a)π(a+|s+).M^{\pi}(s,a,s^{+},a^{+})=(1-\gamma)\mathbbm{1}[s=s^{+},a=a^{+}]+\gamma\sum_{s^{\prime},a^{\prime}\in\mathcal{SA}}P(s^{+}|s^{\prime},a^{\prime})M^{\pi}(s,a,s^{\prime},a^{\prime})\pi(a^{+}|s^{+}). (16)

Taking summation over a+a^{+} on both sides gives us an equation very similar to Equation 11 and so can be written by rearranging as,

(SγPT)Mπ=(1γ)𝟙[s=s+].(S-\gamma P^{T})M^{\pi}=(1-\gamma)\mathbbm{1}[s=s^{+}]. (17)

With similar arguments as in Lemma 4.1, MπM^{\pi} also forms an affine set.

Following the previous proof, in continuous spaces, MπM^{\pi} becomes a function Mπ:S×A×S×AM^{\pi}:S\times A\times S\times A\rightarrow\mathbb{R} and the Bellman Flow equation transforms to,

Mπ(s,a,s+,a+)=(1γ)p(s=s+,a=a+)+γSAP(s+|s,a)Mπ(s,a,s,a)π(a+|s+)𝑑s𝑑a.M^{\pi}(s,a,s^{+},a^{+})=(1-\gamma)p(s=s^{+},a=a^{+})+\gamma\int_{S}\int_{A}P(s^{+}|s^{\prime},a^{\prime})M^{\pi}(s,a,s^{\prime},a^{\prime})\pi(a^{+}|s^{+})ds^{\prime}da^{\prime}. (18)

Integrating both sides over a+a^{+}, the Bellman Flow operator TT can be constructed that acts on MπM^{\pi},

T[Mπ](s,a,s+)\displaystyle T[M^{\pi}](s,a,s^{+}) =AMπ(s,a,s+,a+)𝑑a+γSAP(s+|s,a)Mπ(s,a,s,a)𝑑s𝑑a\displaystyle=\int_{A}M^{\pi}(s,a,s^{+},a^{+})da^{+}-\gamma\int_{S}\int_{A}P(s^{+}|s^{\prime},a^{\prime})M^{\pi}(s,a,s^{\prime},a^{\prime})ds^{\prime}da^{\prime} (19)
T[Mπ](s,a,s+)\displaystyle\implies T[M^{\pi}](s,a,s^{+}) =(1γ)p(s=s+,a=a+)\displaystyle=(1-\gamma)p(s=s^{+},a=a^{+}) (20)

As TT is a linear operator, MπM^{\pi} belongs to an affine set.

Any element xx of an affine set of dimensionality dd, can be written as idϕiwi+b\sum_{i}^{d}\phi_{i}w_{i}+b where ϕi\langle\phi_{i}\rangle are the basis and bb is a bias vector. The basis is given by the null space of the matrix operator (SγPT)(S-\gamma P^{T}) (TT in case of continuous spaces). Since the operator (SγPT)(S-\gamma P^{T}) (and TT) and the vector (1γ)𝟙[s=s+](1-\gamma)\mathbbm{1}[s=s^{+}] (and function (1γ)p(s=s+,a=a+)(1-\gamma)p(s=s^{+},a=a^{+})) are independent of the policy, the basis Φ\Phi and the bias bb are also independent of the policy. ∎

A.3 Proof of Theorem 4.4

Theorem 4.4.

For the same dimensionality, span{Φvf}span\{\Phi^{vf}\} represents the set of the value functions spanned by Φvf\Phi^{vf} and {span{Φ}r}\{span\{\Phi\}r\} represents the set of value functions using the successor measures spanned by Φ\Phi, span{Φvf}{span{Φ}r}span\{\Phi^{vf}\}\subseteq\{span\{\Phi\}r\}.

Proof.

We need to show that any element that belongs to the set {span{Φ}r}\{span\{\Phi\}r\} also belongs to the set span{Φvf}span\{\Phi^{vf}\}.

If we assume a special Φi(s,s)=σi(s)ηi(s)\Phi_{i}(s,s^{\prime})=\sigma_{i}(s)\eta_{i}(s^{\prime}),

Vπ(s)\displaystyle V^{\pi}(s) =iwiπsΦ(s,s)r(s)\displaystyle=\sum_{i}w^{\pi}_{i}\sum_{s^{\prime}}\Phi(s,s^{\prime})r(s^{\prime})
=i[wiπsηi(s)r(s)]σi(s).\displaystyle=\sum_{i}\big{[}w^{\pi}_{i}\sum_{s^{\prime}}\eta_{i}(s^{\prime})r(s^{\prime})\big{]}\sigma_{i}(s).

The two equations match with βiπ=wiπsηi(s)r(s)\beta_{i}^{\pi}=w^{\pi}_{i}\sum_{s^{\prime}}\eta_{i}(s^{\prime})r(s^{\prime}) and σi(s)=Φivf(s)\sigma_{i}(s)=\Phi^{vf}_{i}(s). This implies for every instance in the span of Φvf\Phi^{vf}, there exists some instance in the span of Φ\Phi. ∎

A.4 Proof of Theorem 6.1

Theorem 6.1.

Successor Features ψπ(s,a)\psi^{\pi}(s,a) belong to an affine set and can be represented using a linear combination of basis functions and a bias.

Proof.

Given basic state features, φ:S|d|\varphi:S\rightarrow\mathbb{R}^{|d|}, the successor feature is defined as, ψπ(s,a)=𝔼π[tγtφ(st+1)]\psi^{\pi}(s,a)=\mathbb{E}_{\pi}[\sum_{t}\gamma^{t}\varphi(s_{t+1})]. It can be correspondingly connected to successor measures as ψπ(s,a)=sM(s,a,s)φ(s)\psi^{\pi}(s,a)=\sum_{s^{\prime}}M(s,a,s^{\prime})\varphi(s^{\prime}) (replace s\sum_{s^{\prime}} with s\int_{s^{\prime}} for continuous domains). In Linear algebra notations, let MπM^{\pi} be a (S×A)×S(S\times A)\times S dimensional matrix representing successor measure. Define Φs\Phi_{s} as the S×dS\times d matrix containing φ\varphi for each state concatenated row-wise. The (S×A)×d(S\times A)\times d matrix representing Ψπ\Psi^{\pi} can be given as,

Ψπ\displaystyle\Psi^{\pi} =MπΦs\displaystyle=M^{\pi}\Phi_{s}
Ψπ\displaystyle\implies\Psi^{\pi} =iϕiwiπΦs(Mπ is affine for basis ϕ)\displaystyle=\sum_{i}\phi_{i}w^{\pi}_{i}\Phi_{s}\quad\quad(M^{\pi}\text{ is affine for basis }\phi)
Ψπ\displaystyle\implies\Psi^{\pi} =siϕi(,,s)wiπφ(s)\displaystyle=\sum_{s^{\prime}}\sum_{i}\phi_{i}(\cdot,\cdot,s^{\prime})w^{\pi}_{i}\varphi(s^{\prime})
Ψπ\displaystyle\implies\Psi^{\pi} =isϕi(,,s)φ(s)wiπ\displaystyle=\sum_{i}\sum_{s^{\prime}}\phi_{i}(\cdot,\cdot,s^{\prime})\varphi(s^{\prime})w^{\pi}_{i}
Ψπ\displaystyle\implies\Psi^{\pi} =iϕψ,iwiπ(ϕψ=sϕi(,,s)φ(s))\displaystyle=\sum_{i}\phi_{\psi,i}w^{\pi}_{i}\quad\quad(\phi_{\psi}=\sum_{s^{\prime}}\phi_{i}(\cdot,\cdot,s^{\prime})\varphi(s^{\prime}))
Ψπ\displaystyle\implies\Psi^{\pi} =Φψwπ\displaystyle=\Phi_{\psi}w^{\pi}

Hence, the successor features are affine with policy independent basis Φψ\Phi_{\psi}. ∎

A.5 Proof of Theorem 6.3

Theorem 6.3.

If Mπ(s,a,s+)=ϕ(s,a,s+)wπM^{\pi}(s,a,s^{+})=\phi(s,a,s^{+})w^{\pi} and ϕ(s,a,s+)=ϕψ(s,a)Tϕs(s+)\phi(s,a,s^{+})=\phi_{\psi}(s,a)^{T}\phi_{s}(s^{+}), the successor feature ψπ(s,a)=ϕψ(s,a)wπ\psi^{\pi}(s,a)=\phi_{\psi}(s,a)w^{\pi} for the basic feature ϕs(s)T(ϕsϕsT)1\phi_{s}(s)^{T}(\phi_{s}\phi_{s}^{T})^{-1}.

Proof.

Consider ϕ(s,a,s+)d\phi(s,a,s^{+})\in\mathbb{R}^{d} as the set of d1d-1 basis vectors and the bias with wπdw^{\pi}\in\mathbb{R}^{d} being the d1d-1 weights to combine the basis and wdπ=1w^{\pi}_{d}=1. Clearly from Theorem 4.2, Mπ(s,a,s+)M^{\pi}(s,a,s^{+}) can be represented as ϕ(s,a,s+)wπ\phi(s,a,s^{+})w^{\pi}. Further, ϕ(s,a,s+)=ϕψ(s,a)Tϕs(s+)\phi(s,a,s^{+})=\phi_{\psi}(s,a)^{T}\phi_{s}(s^{+}) where ϕψ(s,a)d×d\phi_{\psi}(s,a)\in\mathbb{R}^{d\times d} and ϕs(s+)d\phi_{s}(s^{+})\in\mathbb{R}^{d}. So,

Mπ(s,a,s+)\displaystyle M^{\pi}(s,a,s^{+}) =ijϕψ(s,a)ijϕs(s+)jwiπ\displaystyle=\sum_{i}\sum_{j}\phi_{\psi}(s,a)_{ij}\phi_{s}(s^{+})_{j}w^{\pi}_{i}
Mπ(s,a,s+)\displaystyle\implies M^{\pi}(s,a,s^{+}) =jiϕψ(s,a)ijwiπϕs(s+)j\displaystyle=\sum_{j}\sum_{i}\phi_{\psi}(s,a)_{ij}w^{\pi}_{i}\phi_{s}(s^{+})_{j}
Mπ(s,a,s+)\displaystyle\implies M^{\pi}(s,a,s^{+}) =jϕψ(s,a)jTwπϕs(s+)j\displaystyle=\sum_{j}\phi_{\psi}(s,a)_{j}^{T}w^{\pi}\phi_{s}(s^{+})_{j}
Mπ(s,a,s+)\displaystyle\implies M^{\pi}(s,a,s^{+}) =jψπ(s,a)jϕs(s+)j(Writing ϕψ(s,a)Twπ as ψπ(s,a))\displaystyle=\sum_{j}\psi^{\pi}(s,a)_{j}\phi_{s}(s^{+})_{j}\quad\quad(\text{Writing }\phi_{\psi}(s,a)^{T}w^{\pi}\text{ as }\psi^{\pi}(s,a))
Mπ(s,a,s+)\displaystyle\implies M^{\pi}(s,a,s^{+}) =ψπ(s,a)Tϕs(s+)\displaystyle=\psi^{\pi}(s,a)^{T}\phi_{s}(s^{+})

From Lemma 6.2, ψπ(s,a)\psi^{\pi}(s,a) is the successor feature for the basic feature ϕs(s)T(ϕsϕsT)1\phi_{s}(s)^{T}(\phi_{s}\phi_{s}^{T})^{-1}.

Note: In continuous settings, we can use the dataset marginal density as described in Section 5. The basic features become ϕs(s)T(𝔼ρ[ϕsϕsT])1\phi_{s}(s)^{T}(\mathbb{E}_{\rho}[\phi_{s}\phi_{s}^{T}])^{-1}. ∎

A.6 Deriving a basis for the Toy Example

Refer to caption
Figure 5: The Toy MDP described in Section 4.

Consider the MDP shown in Figure 5. The state action visitation distribution is written as d=(d(s0,a0),d(s1,a0),d(s0,a1),d(s1,a1))Td=(d(s_{0},a_{0}),d(s_{1},a_{0}),d(s_{0},a_{1}),d(s_{1},a_{1}))^{T}. The corresponding dynamics can be written as,

P={blockarray}ccccc&s0,a0s1,a0s0,a1s0,a1
{block}
c[cccc]s00110s11001
P=\blockarray{ccccc}&\mbox{$s_{0},a_{0}$}\mbox{$s_{1},a_{0}$}\mbox{$s_{0},a_{1}$}\mbox{$s_{0},a_{1}$}\\ \block{c[cccc]}s_{0}0110\\ s_{1}1001\\

The Bellman Flow equation thus becomes,

ad(s,a)\displaystyle\sum_{a}d(s,a) =(1γ)μ(s)+γs,aP(s|s,a)d(s,a)\displaystyle=(1-\gamma)\mu(s)+\gamma\sum_{s^{\prime},a^{\prime}}P(s|s^{\prime},a^{\prime})d(s^{\prime},a^{\prime})
[11000011](d(s0,a0)d(s1,a0)d(s0,a1)d(s1,a1))\displaystyle\implies\begin{bmatrix}1&1&0&0\\ 0&0&1&1\end{bmatrix}\begin{pmatrix}d(s_{0},a_{0})\\ d(s_{1},a_{0})\\ d(s_{0},a_{1})\\ d(s_{1},a_{1})\end{pmatrix} =(1γ)(μ(s0)μ(s1))+γ[01101001](d(s0,a0)d(s1,a0)d(s0,a1)d(s1,a1))\displaystyle=(1-\gamma)\begin{pmatrix}\mu(s_{0})\\ \mu(s_{1})\end{pmatrix}+\gamma\begin{bmatrix}0&1&1&0\\ 1&0&0&1\end{bmatrix}\begin{pmatrix}d(s_{0},a_{0})\\ d(s_{1},a_{0})\\ d(s_{0},a_{1})\\ d(s_{1},a_{1})\end{pmatrix}
[11γγ0γ011γ](d(s0,a0)d(s1,a0)d(s0,a1)d(s1,a1))\displaystyle\implies\begin{bmatrix}1&1-\gamma&-\gamma&0\\ -\gamma&0&1&1-\gamma\end{bmatrix}\begin{pmatrix}d(s_{0},a_{0})\\ d(s_{1},a_{0})\\ d(s_{0},a_{1})\\ d(s_{1},a_{1})\end{pmatrix} =(1γ)(μ(s0)μ(s1))\displaystyle=(1-\gamma)\begin{pmatrix}\mu(s_{0})\\ \mu(s_{1})\end{pmatrix}

This affine equation can be solved in closed form using Gauss Elimination to obtain

(d(s0,a0)d(s1,a0)d(s0,a1)d(s1,a1))=w1(γ1+γ11+γ10)+w2(11+γγ1+γ01)+(μ(s0)+γμ(s1)1+γμ(s1)+γμ(s0)1+γ00).\begin{pmatrix}d(s_{0},a_{0})\\ d(s_{1},a_{0})\\ d(s_{0},a_{1})\\ d(s_{1},a_{1})\end{pmatrix}=w_{1}\begin{pmatrix}\frac{-\gamma}{1+\gamma}\\ \frac{-1}{1+\gamma}\\ 1\\ 0\end{pmatrix}+w_{2}\begin{pmatrix}\frac{-1}{1+\gamma}\\ \frac{-\gamma}{1+\gamma}\\ 0\\ 1\end{pmatrix}+\begin{pmatrix}\frac{\mu(s_{0})+\gamma\mu(s_{1})}{1+\gamma}\\ \frac{\mu(s_{1})+\gamma\mu(s_{0})}{1+\gamma}\\ 0\\ 0\end{pmatrix}. (21)

Appendix B Experimental Details

B.1 Environments

B.1.1 Gridworlds

We use https://github.com/facebookresearch/controllable_agent code-base to build upon the gridworld and 4 room experiments. The task is to reach a goal state that is randomly sampled at the beginning of every episode. The reward function is 0 at all non-goal states while 11 at goal states. The episode length for these tasks are 200.

The state representation is given by (x,y)(x,y) which are scaled down to be in [0,1][0,1]. The action space consists of five actions: {up,right,down,left,stay}\{up,right,down,left,stay\}.

B.1.2 Fetch

We build on top of https://github.com/ahmed-touati/controllable_agent which contains the Fetch environments with discretized action spaces. The state space is unchanged but the action space is discretized to produce manhattan style movements i.e. move one-coordinate at a time. These six actions are mapped to the true actions of Fetch as: {0:[1,0,0,0],1:[0,1,0,0],2:[0,0,1,0],3:[1,0,0,0],4:[0,1,0,0],5:[0,0,1,0]}\{0:[1,0,0,0],1:[0,1,0,0],2:[0,0,1,0],3:[-1,0,0,0],4:[0,-1,0,0],5:[0,0,-1,0]\}. Fetch has an episode length of 50.

B.1.3 DM-control environments

Refer to caption
Figure 6: DM Control Environments: Visual rendering of each of the four DM Control environments we use: (from left to right) Walker, Cheetah, Quadruped, Pointmass

These continuous control environments have been discussed in length in DeepMind Control Suite (Tassa et al., 2018). We use these environments to provide evaluations for PSM on larger and continuous state and action spaces. The following four environments are used:

Walker: It has 24 dimensional state space consisting of joint positions and velocities and 6 dimensional action space where each dimension of action lies in [1,1][-1,1]. The system represents a planar walker. At test time, we test the following four tasks: Walk, Run, Stand and Flip, each with complex dense rewards.

Cheetah: It has 17 dimensional state space consisting of joint positions and velocities and 6 dimensional action space where each dimension of action lies in [1,1][-1,1]. The system represents a planar biped “cheetah”. At test time, we test the following four tasks: Run, Run Backward, Walk and Walk Backward, each with complex dense rewards.

Quadruped: It has 78 dimensional state space consisting of joint positions and velocities and 12 dimensional action space where each dimension of action lies in [1,1][-1,1]. The system represents a 3-dimensional ant with 4 legs. At test time, we test the following four tasks: Walk, Run, Stand and Jump, each with complex dense rewards.

Pointmass: The environment represents a 4-room planar grid with 4-dimensional state space (x,y,vx,vy)(x,y,v_{x},v_{y}) and 2-dimensional action space. The four tasks that we test on are Reach Top Left, Reach Top Right, Reach Bottom Left and Reach Bottom Right each being goal reaching tasks for the four room centers respectively.

All DM Control tasks have an episode length of 1000.

B.2 Datasets

Gridworld: The exploratory data is collected by uniformly spawning the agent and taking a random action. Each of the three method is trained on the reward-free exploratory data. At test time, a random goal is sampled and the optimal Q function is inferred by each.

Fetch: The exploratory data is collected by running DQN (Mnih et al., 2013) training with RND reward (Burda et al., 2019) taken from https://github.com/iDurugkar/adversarial-intrinsic-motivation. 2000020000 trajectories, each of length 5050, are collected.

DM Control: We use publically available datasets from ExoRL Suite (Yarats et al., 2022) collected using RND exploration.

B.3 Implementation Details

B.3.1 Baselines

We consider a variety of baselines that represent different state of the art approaches for zero-shot reinforcement learning. In particular, we consider Laplacian, Forward-Backward, and HILP.

1. Laplacian (Wu et al., 2018; Koren, 2003): This method constructs a graph Laplacian for the MDP induced by a random policy. Eigenfunctions of this graph Laplacian gives a representation for each state ϕ(s)\phi(s), or the state feature. These state-features are used to learn the successor features; and trained to optimize a family of reward functions r(s)=ϕ(s)zr(s)=\langle\phi(s)\cdot z\rangle, where zz is usually sampled from a unit hypersphere uniformly (same for all baselines). The reward functions are optimized via TD3.

2. Forward-Backward (Blier et al., 2021a; Touati & Ollivier, 2021; Touati et al., 2023): Forward-backward algorithm takes a slightly different perspective: instead of training a state-representation first, a mapping is defined between reward function to a latent variable (z=sϕ(s).r(s)z=\sum_{s}\phi(s).r(s)) and the optimal policy for the reward function is set to πz\pi_{z}, i.e the policy conditioned on the corresponding latent variable zz. Training for optimizing all reward functions in this class allows for state-features and successor-features to coemerge. The reward functions are optimized via TD3.

3. HILP (Park et al., 2024a): Instead of letting the state-features coemerge as in FB, HILP proposes to learn features from offline datasets that are sufficient for goal reaching. Thus, two states are close to each other if they are reachable in a few steps according to environmental dynamics. HILP uses a specialized offline RL algorithm with different discounting to learn these state features which could explain its benefit in some datasets where TD3 is not suitable for offline learning.

Implementation: We build upon the codebase for FB https://github.com/facebookresearch/controllable_agent and implement all the algorithms under a uniform setup for network architectures and same hyperparameters for shared modules across the algorithms. We keep the same method agnostic hyperparameters and use the author-suggested method-specfic hyperparameters. The hyperparameters for all methods can be found here:

Table 2: Hyperparameters for baselines and PSM.
Hyperparameter Value
Replay buffer size 5×1065\times 10^{6} (10×10610\times 10^{6} for maze)
Representation dimension 128
Batch size 1024
Discount factor γ\gamma 0.98 (0.99 for maze)
Optimizer Adam
Learning rate 3×1043\times 10^{-4}
Momentum coefficient for target networks 0.99
Stddev σ\sigma for policy smoothing 0.2
Truncation level for policy smoothing 0.3
Number of gradient steps 2×1062\times 10^{6}
Batch size for task inference 10410^{4}
Regularization weight for orthonormality loss (ensures diversity) 1
FB specific hyperparameters
Hidden units (FF) 1024
Number of layers (FF) 3
Hidden units (bb) 256
Number of layers (bb) 2
HILP specific hyperparameters
Hidden units (ϕ\phi) 256
Number of layers (ϕ\phi) 2
Hidden units (ψ\psi) 1024
Number of layers (ψ\psi) 3
Discount Factor for ϕ\phi 0.96
Discount Factor for ψ\psi 0.98 (0.99 for maze)
Loss type Q-loss
PSM specific hyperparameters
Hidden units (ϕ,b\phi,b) 1024
Number of layers (ϕ,b\phi,b) 3
Hidden units (ww) 1024
Number of layers (ww) 3
Double GD lr 1e-4

Proto Successor Measures (PSM): PSM differs from baselines in that we learn richer representations compared to Laplacian or HILP as we are not biased by behavior policy or only learn representations sufficient for goal reaching. Compared to FB, our representation learning phase is more stable as we learn representations by Bellman evaluation backups and do not need Bellman optimality backups. Thus, our approach is not susceptible to learning instabilities that arise from overestimation that is common in Deep RL and makes stabilizing FB hard.The hyperparameters are discussed in Appendix Table 2.

B.3.2 PSM representation learning psuedocode

1 def psm_loss(
2 self,
3 obs: torch.Tensor,
4 action: torch.Tensor,
5 discount: torch.Tensor,
6 next_obs: torch.Tensor,
7 next_goal: torch.Tensor,
8 z: torch.Tensor,
9 step: int
10 ) -> tp.Dict[str, float]:
11 metrics: tp.Dict[str, float] = {}
12 # Create a batch_size x batch_size for learning M^\pi(s,a,s+)
13 idx = torch.arange(obs.shape[0]).to(obs.device)
14 mesh = torch.stack(torch.meshgrid(idx, idx, indexing=’xy’)).T.reshape(-1, 2)
15 m_obs = obs[mesh[:, 0]]
16 m_next_obs = next_obs[mesh[:, 0]]
17 m_action = action[mesh[:, 0]]
18 m_next_goal = next_goal[mesh[:, 1]]
19 perm = torch.randperm(obs.shape[0])
20
21 # compute PSM loss
22 with torch.no_grad():
23 target_phi, target_b = self.psm_target(m_next_obs, m_next_goal)
24 target_w = self.w_target(z)
25 target_phi = target_phi[torch.arange(target_phi.shape[0]), next_actions.squeeze(1)]
26 target_b = target_b[torch.arange(target_b.shape[0]), next_actions.squeeze(1)]
27 target_M = torch.einsum("sd, sd -> s", target_phi, target_w) + target_b
28
29
30 phi, b = self.psm(m_obs, m_next_goal)
31 phi = phi[torch.arange(phi.shape[0]), m_action.squeeze(1)]
32 b = b[torch.arange(b.shape[0]), m_action.squeeze(1)]
33 M = torch.einsum("sd, sd -> s", phi, self.w(z)) + b
34 M = M.reshape(obs.shape[0], obs.shape[0])
35 target_M = target_M.reshape(obs.shape[0], obs.shape[0])
36 I = torch.eye(*M.size(), device=M.device)
37 off_diag = ~I.bool()
38 psm_offdiag: tp.Any = 0.5 * (M - discount * target_M)[off_diag].pow(2).mean()
39 psm_diag: tp.Any = -((1 - discount) * (M.diag().unsqueeze(1))).mean()
40 psm_loss = psm_offdiag + psm_diag
41
42
43 # optimize PSM
44 self.opt.zero_grad(set_to_none=True)
45 self.actor_opt.zero_grad(set_to_none=True)
46 psm_loss.backward()
47 self.opt.step()
48 self.actor_opt.step()

Compute: All our experiments were trained on Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz CPUS and NVIDIA GeForce GTX TITAN GPUs. Each training run took around 10-12 hours.

Appendix C Additional Experiments

C.1 Ablation on dimension of the affine space: dd

We perform the experiments described in Section 7.3 for two of the conitnuous environments with varying dimensionality of the affine space (or corresponding successor feature in the inductive construction), dd. Interestingly, the performance of PSM does not change much across different values of dd ranging from 3232 to 256256. This is in contrast to methods like HILP which sees significant drop in performance by modifying dd.

Environment Task d=32d=32 d=50d=50 d=128d=128 d=256d=256
Walker Stand 898.98 ±\pm 48.64 942.85 ±\pm 19.43 872.61 ±\pm 38.81 911.25 ±\pm 32.86
Run 359.51 ±\pm 70.66 392.76 ±\pm 31.29 351.50 ±\pm 19.46 372.39 ±\pm 41.29
Walk 825.66 ±\pm 60.14 822.39 ±\pm 60.14 891.44 ±\pm 46.81 886.03 ±\pm 28.96
Flip 628.92 ±\pm 94.95 521.78 ±\pm 29.06 640.75 ±\pm 31.88 593.78 ±\pm 27.14
Average 678.27 669.45 689.07 690.86
Cheetah Run 298.98 ±\pm 95.63 386.75 ±\pm 55.79 276.41 ±\pm 70.23 268.91 ±\pm 79.07
Run Backward 295.43 ±\pm 19.72 260.13 ±\pm 24.93 286.13 ±\pm 25.38 290.89 ±\pm 14.36
Walk 942.12 ±\pm 84.25 893.89 ±\pm 91.69 887.02 ±\pm 59.87 920.50 ±\pm 68.98
Walk Backward 978.64 ±\pm 8.74 916.68 ±\pm 124.34 980.90 ±\pm 2.04 982.29 ±\pm 0.70
Average 628.79 615.61 607.61 615.64
Table 3: Table shows comparison (averaged over 5 seeds) between different representation sizes (or affine space dimensionality dd) for PSM.

C.2 Quantitative Results on Gridworld and Discrete Maze

We provide quantitative results for the experiments performed in Section 7.1.

Environment Laplace FB PSM
Gridworld 19.28 ±\pm 2.34 14.53 ±\pm 0.68 2.05 ±\pm 1.20
Discrete Maze 38.47 ±\pm 7.01 28.80 ±\pm 10.50 11.54 ±\pm 1.07
Table 4: Table shows average error (averaged over 3 seeds) for the predicted policy from different zero-shot RL methods with respect to the oracle optimal policy.

Quantitative Experiment Description: For each randomly sampled goal, we obtain the inferred value function and the inferred policy using PSM and the baselines. At every state in the discrete space, we check if the policy inferred by these algorithms is optimal or not. The oracle or the optimal policy can be obtained by running the Bellman Ford algorithm in the discrete gridworld or maze. We report (in Table 4) the average error (# incorrect policy predictions/Total # of states) for 10 randomly sampled goal (over 3 seeds).

As clearly seen, the average error for PSM is significantly less than the baselines which augments the qualitative results presented in Section 7.1.