This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Control-Aware Representations for Model-based Reinforcement Learning

Brandon Cui Yinlam Chow Mohammad Ghavamzadeh Facebook AI Research Google Research Google Research [email protected] [email protected] [email protected]
(April 2020)
Abstract

A major challenge in modern reinforcement learning (RL) is efficient control of dynamical systems from high-dimensional sensory observations. Learning controllable embedding (LCE) is a promising approach that addresses this challenge by embedding the observations into a lower-dimensional latent space, estimating the latent dynamics, and utilizing it to perform control in the latent space. Two important questions in this area are how to learn a representation that is amenable to the control problem at hand, and how to achieve an end-to-end framework for representation learning and control. In this paper, we take a few steps towards addressing these questions. We first formulate a LCE model to learn representations that are suitable to be used by a policy iteration style algorithm in the latent space. We call this model control-aware representation learning (CARL). We derive a loss function for CARL that has close connection to the prediction, consistency, and curvature (PCC) principle for representation learning. We derive three implementations of CARL. In the offline implementation, we replace the locally-linear control algorithm (e.g., iLQR) used by the existing LCE methods with a RL algorithm, namely model-based soft actor-critic, and show that it results in significant improvement. In online CARL, we interleave representation learning and control, and demonstrate further gain in performance. Finally, we propose value-guided CARL, a variation in which we optimize a weighted version of the CARL loss function, where the weights depend on the TD-error of the current policy. We evaluate the proposed algorithms by extensive experiments on benchmark tasks and compare them with several LCE baselines.

1 Introduction

Control of non-linear dynamical systems is a key problem in control theory. Many methods have been developed with different levels of success in different classes of such problems. The majority of these methods assume that a model of the system is known and the underlying state of the system is low-dimensional and observable. These requirements limit the usage of these techniques in controlling dynamical systems from high-dimensional raw sensory data (e.g., image and audio), where the system dynamics is unknown, a scenario often seen in modern reinforcement learning (RL).

Recent years have witnessed the rapid development of a large arsenal of model-free RL algorithms, such as DQN (Mnih et al., 2013), TRPO (Schulman et al., 2015a), PPO (Schulman et al., 2017), and SAC (Haarnoja et al., 2018), with impressive success in solving high-dimensional control problems. However, most of this success has been limited to simulated environments (e.g., computer games), mainly due to the fact that these algorithms often require a large number of samples from the environment. This restricts their applicability in real-world physical systems, for which data collection is often a difficult process. On the other hand, model-based RL algorithms, such as PILCO (Deisenroth and Rasmussen, 2011), MBPO (Janner et al., 2019), and Visual Foresight (Ebert et al., 2018), despite their success, still have many issues in handling the difficulties of learning a model (dynamics) in a high-dimensional (pixel) space.

To address this issue, a class of algorithms have been developed that are based on learning a low-dimensional latent (embedding) space and a latent model (dynamics), and then using this model to control the system in the latent space. This class has been referred to as learning controllable embedding (LCE) and includes recently developed algorithms, such as E2C (Watter et al., 2015), RCE (Banijamali et al., 2018), SOLAR (Zhang et al., 2019), PCC (Levine et al., 2020), Dreamer (Hafner et al., 2020), and PC3 (Shu et al., 2020). The following two properties are extremely important in designing LCE models and algorithms. First, to learn a representation that is the most suitable for the control process at hand. This suggests incorporating the control algorithm in the process of learning representation. This view of learning control-aware representations is aligned with the value-aware and policy-aware model learning, VAML (Farahmand, 2018) and PAML (Abachi et al., 2020), frameworks that have been recently proposed in model-based RL. Second, to interleave the representation learning and control processes, and to update them both, using a unifying objective function. This allows to have an end-to-end framework for representation learning and control.

LCE methods, such as SOLAR and Dreamer, have taken steps towards the second objective by performing representation learning and control in an online fashion. This is in contrast to offline methods like E2C, RCE, PCC, and PC3, that learn a representation once and then use it in the entire control process. On the other hand, methods like PCC and PC3 address the first objective by adding a term to their representation learning loss function that accounts for the curvature of the latent dynamics. This term regularizes the representation towards smoother latent dynamics, which are suitable for the locally-linear controllers, e.g., iLQR (Li and Todorov, 2004), used by these methods.

In this paper, we take a few steps towards the above two objectives. We first formulate a LCE model to learn representations that are suitable to be used by a policy iteration (PI) style algorithm in the latent space. We call this model control-aware representation learning (CARL). We derive a loss function for CARL that exhibits a close connection to the prediction, consistency, and curvature (PCC) principle for representation learning, proposed in Levine et al. (2020). We derive three implementations of CARL: offline, online, and value-guided. Similar to offline LCE methods, such as E2C, RCE, PCC, and PC3, in offline CARL, we first learn a representation and then use it in the entire control process. However, in offline CARL, we replace the locally-linear control algorithm (e.g., iLQR) used by these LCE methods with a PI-style (actor-critic) RL algorithm. Our choice of RL algorithm is the model-based implementation of soft actor-critic (SAC) (Haarnoja et al., 2018). Our experiments show significant performance improvement by replacing iLQR with SAC. Online CARL is an iterative algorithm in which at each iteration, we first learn a latent representation by minimizing the CARL loss, and then perform several policy updates using SAC in this latent space. Our experiments with online CARL show further performance gain over its offline version. Finally, in value-guided CARL (V-CARL), we optimize a weighted version of the CARL loss function, in which the weights depend on the TD-error of the current policy. This would help to further incorporate the control algorithm in the representation learning process. We evaluate the proposed algorithms by extensive experiments on benchmark tasks and compare them with several LCE baselines.

2 Problem Formulation

We are interested in learning control policies for non-linear dynamical systems, where the states s𝒮nss\in\mathcal{S}\subseteq\mathbb{R}^{n_{s}} are not fully observed and we only have access to their high-dimensional observations x𝒳nx,nxnsx\in\mathcal{X}\subseteq\mathbb{R}^{n_{x}},\;n_{x}\gg n_{s}. This scenario captures many practical applications in which we interact with a system only through high-dimensional sensory signals, such as image and audio. We assume that the observations xx have been selected such that we can model the system in the observation space using a Markov decision process (MDP)111A method to ensure observations are Markovian is to buffer them for several time steps (Mnih et al., 2013). 𝒳=𝒳,𝒜,r,P,γ\mathcal{M}_{\mathcal{X}}=\langle\mathcal{X},\mathcal{A},r,P,\gamma\rangle, where 𝒳\mathcal{X} and 𝒜\mathcal{A} are observation and action spaces; r:𝒳×𝒜r:\mathcal{X}\times\mathcal{A}\rightarrow\mathbb{R} is the reward function with maximum value RmaxR_{\max}, defined by the designer of the system to achieve the control objective;222For example, in a goal tracking problem in which the agent (robot) aims at finding the shortest path to reach the observation goal xgx_{g} (the observation corresponding to the goal state sgs_{g}), we may define the reward for each observation xx as the negative of its distance to xgx_{g}, i.e., xxg2-\|x-x_{g}\|^{2}. P:𝒳×𝒜(𝒳)P:\mathcal{X}\times\mathcal{A}\rightarrow\mathbb{P}(\mathcal{X}) is the unknown transition kernel; and γ(0,1)\gamma\in(0,1) is the discount factor. Our goal is to find a mapping from observations to control signals, μ:𝒳(𝒜)\mu:\mathcal{X}\rightarrow\mathbb{P}(\mathcal{A}), with maximum expected return, i.e., J(μ)=𝔼[t=0γtr(xt,at)P,μ]J(\mu)=\mathbb{E}[\sum_{t=0}^{\infty}\gamma^{t}r(x_{t},a_{t})\mid P,\mu].

Since the observations xx are high-dimensional and the observation dynamics PP is unknown, solving the control problem in the observation space may not be efficient. As discussed in Section 1, the class of learning controllable embedding (LCE) algorithms addresses this issue by learning a low-dimensional latent (embedding) space 𝒵nz,nznx\mathcal{Z}\subseteq\mathbb{R}^{n_{z}},\;n_{z}\ll n_{x} and a latent state dynamics, and controlling the system there. The main idea behind LCE is to learn an encoder E:𝒳(𝒵)E:\mathcal{X}\rightarrow\mathbb{P}(\mathcal{Z}), a latent space dynamics F:𝒵×𝒜(𝒵)F:\mathcal{Z}\times\mathcal{A}\rightarrow\mathbb{P}(\mathcal{Z}), and a decoder D:𝒵(𝒳)D:\mathcal{Z}\rightarrow\mathbb{P}(\mathcal{X}),333Some recent LCE models, such as PC3 (Shu et al., 2020), are advocating latent models without a decoder. Although we are aware of the merits of such approach, we use a decoder in the models proposed in this paper. such that a good or optimal controller (policy) in 𝒵\mathcal{Z} minimizes the expected loss in the observation space 𝒳\mathcal{X}. This means that if we model the control problem in 𝒵\mathcal{Z} as a MDP 𝒵=𝒵,𝒜,r¯,F,γ\mathcal{M}_{\mathcal{Z}}=\langle\mathcal{Z},\mathcal{A},\bar{r},F,\gamma\rangle and solve it using a model-based RL algorithm to obtain a policy π:𝒵(𝒜)\pi:\mathcal{Z}\rightarrow\mathbb{P}(\mathcal{A}), the image of π\pi in the observation space, i.e., (πE)(a|x)=z𝑑E(z|x)π(a|z)(\pi\circ E)(a|x)=\int_{z}dE(z|x)\pi(a|z), should have a high return. Thus, the loss function to learn 𝒵\mathcal{Z} and (E,F,D)(E,F,D) from observations {(xt,at,rt,xt+1)}\{(x_{t},a_{t},r_{t},x_{t+1})\} should be designed to comply with this goal.

This is why in this paper, we propose a LCE framework that tries to incorporate the control algorithm used in the latent space in the representation learning process. We call this model, control-aware representation learning (CARL). In CARL, we set the class of control (RL) algorithms used in the latent space to approximate policy iteration (PI), and more specifically to soft actor-critic (SAC) (Haarnoja et al., 2018). Before describing CARL in details in the following sections, we present a number of useful definitions and notations here.

For any policy μ\mu in 𝒳\mathcal{X}, we define its value function UμU_{\mu} and Bellman operator TμT_{\mu} as

Uμ(x)=𝔼[t=0γtrμ(xt)Pμ,x0=x],Tμ[U](x)=𝔼xPμ(|x)[rμ(x)+γU(x)],U_{\mu}(x)=\mathbb{E}[\sum_{t=0}^{\infty}\gamma^{t}r_{\mu}(x_{t})\mid P_{\mu},x_{0}=x],\qquad\qquad T_{\mu}[U](x)=\mathbb{E}_{x^{\prime}\sim P_{\mu}(\cdot|x)}[r_{\mu}(x)+\gamma U(x^{\prime})], (1)

for all x𝒳x\!\in\!\mathcal{X} and U:𝒳U:\mathcal{X}\!\rightarrow\!\mathbb{R}, where rμ(x)=a𝑑μ(a|x)r(x,a)r_{\mu}(x)\!=\!\int_{a}d\mu(a|x)r(x,a) and Pμ(x|x)=a𝑑μ(a|x)P(x|x,a)P_{\mu}(x^{\prime}|x)\!=\!\int_{a}d\mu(a|x)P(x^{\prime}|x,a) are the reward function and dynamics induced by μ\mu.

Similarly, for any policy π\pi in 𝒵\mathcal{Z}, we define its induced reward function and dynamics as r¯π(z)=a𝑑π(a|z)r¯(z,a)\bar{r}_{\pi}(z)=\int_{a}d\pi(a|z)\bar{r}(z,a) and Fπ(z|z)=a𝑑π(a|z)F(z|z,a)F_{\pi}(z^{\prime}|z)=\int_{a}d\pi(a|z)F(z^{\prime}|z,a). We also define its value function VπV_{\pi} and Bellman operator TπT_{\pi}, for all z𝒵z\in\mathcal{Z} and V:𝒵V:\mathcal{Z}\rightarrow\mathbb{R}, as

Vπ(z)=𝔼[t=0γtr¯π(zt)Fπ,z0=z],Tπ[U](z)=𝔼zFπ(|z)[r¯π(z)+γV(z)].V_{\pi}(z)=\mathbb{E}[\sum_{t=0}^{\infty}\gamma^{t}\bar{r}_{\pi}(z_{t})\mid F_{\pi},z_{0}=z],\qquad T_{\pi}[U](z)=\mathbb{E}_{z^{\prime}\sim F_{\pi}(\cdot|z)}[{\bar{r}}_{\pi}(z)+\gamma V(z^{\prime})]. (2)

For any policy π\pi and value function VV in the latent space 𝒵\mathcal{Z}, we denote by πE\pi\circ E and VEV\circ E, their image in the observation space 𝒳\mathcal{X}, given encoder EE, and define them as

(πE)(a|x)=z𝑑E(z|x)π(a|z),(VE)(x)=z𝑑E(z|x)V(z).(\pi\circ E)(a|x)=\int_{z}dE(z|x)\pi(a|z),\qquad\qquad(V\circ E)(x)=\int_{z}dE(z|x)V(z). (3)

3 A Control Perspective for CARL Model

In this section, we formulate our LCE model, which we refer to as control-aware representation learning (CARL). As described in Section 2, CARL is a model for learning a low-dimensional latent space 𝒵\mathcal{Z} and the latent dynamics, from data generated in the observation space 𝒳\mathcal{X}, such that this representation is suitable to be used by a policy iteration (PI) algorithm in 𝒵\mathcal{Z}. In order to derive the loss function used by CARL to learn 𝒵\mathcal{Z} and its dynamics, i.e., (E,F,D,r¯)(E,F,D,\bar{r}), we first describe how the representation learning can be interleaved with PI in 𝒵\mathcal{Z}. Algorithm 1 contains the pseudo-code of the resulting algorithm, which we refer to as latent space learning policy iteration (LSLPI).

Each iteration ii of LSLPI starts with a policy μ(i)\mu^{(i)} in the observation space 𝒳\mathcal{X}, which is the mapping of the improved policy in 𝒵\mathcal{Z} in iteration i1i-1, i.e., π+(i1)\pi_{+}^{(i-1)}, back in 𝒳\mathcal{X} through the encoder E(i1)E^{(i-1)} (Lines 6 and 7). We then compute π(i)\pi^{(i)}, the current policy in 𝒵\mathcal{Z}, as the image of μ(i)\mu^{(i)} in 𝒵\mathcal{Z} through the encoder E(i)E^{(i)} (Line 4). Note that E(i)E^{(i)} is the encoder learned at the end of iteration i1i-1 (Line 8). We then use the latent space dynamics F(i)F^{(i)} learned at the end of iteration i1i-1 (Line 8), and first compute the value function of π(i)\pi^{(i)} in the policy evaluation or critic step, i.e., V(i)=Vπ(i)V^{(i)}=V_{\pi^{(i)}} (Line 5), and then use V(i)V^{(i)} to compute the improved policy π+(i)\pi_{+}^{(i)}, as the greedy policy w.r.t. V(i)V^{(i)}, i.e., π(i+1)=𝒢[V(i)]\pi^{(i+1)}=\mathcal{G}[V^{(i)}], in the policy improvement or actor step (Line 6). Using the samples in the buffer 𝒟\mathcal{D}, together with the current policies in 𝒵\mathcal{Z}, i.e., π(i)\pi^{(i)} and π+(i)\pi_{+}^{(i)}, we learn the new representation (E(i+1),F(i+1),D(i+1),r¯(i+1))(E^{(i+1)},F^{(i+1)},D^{(i+1)},\bar{r}^{(i+1)}) (Line 8). Finally, we generate samples 𝒟(i+1)\mathcal{D}^{(i+1)} by following μ(i+1)\mu^{(i+1)}, the image of the improved policy π+(i)\pi_{+}^{(i)} back in 𝒳\mathcal{X} using the old encoder EiE^{i} (Line 7), and add it to the buffer 𝒟\mathcal{D} (Line 9), and the algorithm iterates. It is important to note that both critic and actor operate in the low-dimensional latent space 𝒵\mathcal{Z}.

Algorithm 1 Latent Space Learning with Policy Iteration (LSLPI)
1:  Inputs: E(0)E^{(0)}, F(0)F^{(0)}, D(0)D^{(0)};
2:  Initialization: μ(0)=\mu^{(0)}= random policy;     𝒟\mathcal{D}\leftarrow samples generated from μ(0)\mu^{(0)};
3:  for i=0,1,i=0,1,\ldots do
4:     Compute π(i)\pi^{(i)} such that μ(i)=π(i)E(i)\mu^{(i)}=\pi^{(i)}\circ E^{(i)}; # set π(i)\pi^{(i)} to be the image of μ(i)\mu^{(i)} in the latent space
5:     Compute the value function of π(i)\pi^{(i)} and set V(i)=Vπ(i)V^{(i)}=V_{\pi^{(i)}}; # policy evaluation (critic)
6:     Compute the greedy policy w.r.t. V(i)V^{(i)} and set π+(i)=𝒢[V(i)]\pi_{+}^{(i)}=\mathcal{G}[V^{(i)}]; # policy improvement (actor)
7:     Set μ(i+1)=π+(i)E(i)\mu^{(i+1)}=\pi_{+}^{(i)}\circ E^{(i)}; # project the improved policy π+(i)\pi_{+}^{(i)} back into the observation space
8:     Learn (E(i+1),F(i+1),D(i+1),r¯(i+1))(E^{(i+1)},F^{(i+1)},D^{(i+1)},{\bar{r}}^{(i+1)}) from 𝒟\mathcal{D}, π(i)\pi^{(i)}, and π+(i)\pi_{+}^{(i)}; # representation learning
9:     Generate samples 𝒟(i+1)={(xt,at,rt,xt)}t=1n\mathcal{D}^{(i+1)}=\{(x_{t},a_{t},r_{t},x^{\prime}_{t})\}_{t=1}^{n} from μ(i+1)\mu^{(i+1)};     𝒟𝒟𝒟(i+1)\mathcal{D}\leftarrow\mathcal{D}\cup\mathcal{D}^{(i+1)};
10:  end for

LSLPI is a PI algorithm in 𝒵\mathcal{Z}. However, what is desired is that it also acts as a PI algorithm in 𝒳\mathcal{X}, i.e., it results in (monotonic) policy improvement in 𝒳\mathcal{X}, i.e., Uμ(i+1)Uμ(i)U_{\mu^{(i+1)}}\geq U_{\mu^{(i)}}. Therefore, we define the representation learning loss function in CARL, such that it ensures that LSLPI also results in policy improvement in 𝒳\mathcal{X}. The following theorem, whose proof is reported in Appendix B, shows the relationship between the value functions of two consecutive polices generated by LSLPI in 𝒳\mathcal{X}.

Theorem 1.

Let μ\mu, μ+\mu_{+}, π\pi, π+\pi_{+}, and (E,F,D,r¯)(E,F,D,\bar{r}) be the policies μ(i)\mu^{(i)}, μ(i+1)\mu^{(i+1)}, π(i)\pi^{(i)}, π+(i)\pi_{+}^{(i)}, and the learned latent representation (E(i+1),F(i+1),D(i+1),r¯(i+1))(E^{(i+1)},F^{(i+1)},D^{(i+1)},\bar{r}^{(i+1)}) at iteration ii of the LSLPI algorithm (Algorithm 1). Then, the following holds for the value functions of μ\mu and μ+\mu_{+}:

Uμ+(x)Uμ(x)(γ1γπ~{π,π+}Δ(E,F,D,r¯,π~,x)+Rmax1γDKL((πE)(|x)||μ(|x))Lreg(E,E,π,x)),U_{\mu_{+}}(x)\geq U_{\mu}(x)-\Big{(}\frac{\gamma}{1-\gamma}\sum_{\widetilde{\pi}\in\{\pi,\pi_{+}\}}\Delta(E,F,D,\bar{r},\widetilde{\pi},x)+\frac{R_{\max}}{1-\gamma}\cdot\underbrace{D_{\text{KL}}\big{(}(\pi\circ E)(\cdot|x)\;||\;\mu(\cdot|x)\big{)}}_{\mathrm{L_{reg}(E,E_{-},\pi,x)}}\Big{)}, (4)

for all x𝒳x\in\mathcal{X}, where the error term Δ\Delta for a policy π\pi is given by

Δ(\displaystyle\Delta( E,F,D,r¯,π,x)=Rmax1γ12z𝑑E(z|x)logD(x|z)(I)=Led(E,D,x)+  2|rπE(x)zdE(z|x)r¯π(z)|(II)=Lr(E,r¯,π,x)\displaystyle E,F,D,\bar{r},\pi,x)=\frac{R_{\max}}{1-\gamma}\overbrace{\sqrt{\frac{-1}{2}\int_{z}dE(z|x)\log D(x|z)}}^{(\mathrm{I})=\mathrm{L_{ed}(E,D,x)}}\;\;+\;\;2\overbrace{\big{|}r_{\pi\circ E}(x)-\int_{z}dE(z|x)\bar{r}_{\pi}(z)\big{|}}^{(\mathrm{II})=\mathrm{L_{r}(E,\bar{r},\pi,x)}} (5)
+γRmax2(1γ)(DKL(PπE(|x)||(DFπE)(|x))(III)=Lp(E,F,D,π,x)+DKL((EPπE)(|x)||(FπE)(|x))(IV)).\displaystyle+\frac{\gamma R_{\max}}{\sqrt{2}(1-\gamma)}\Big{(}\underbrace{\sqrt{D_{\text{KL}}\big{(}P_{\pi\circ E}(\cdot|x)\;||\;(D\circ F_{\pi}\circ E)(\cdot|x)\big{)}}}_{(\mathrm{III})=\mathrm{L_{p}(E,F,D,\pi,x)}}+\underbrace{\sqrt{D_{\text{KL}}\big{(}(E\circ P_{\pi\circ E})(\cdot|x)\;||\;(F_{\pi}\circ E)(\cdot|x)\big{)}}}_{(\mathrm{IV})}\Big{)}.

It is easy to see that LSLPI guarantees (policy) improvement in 𝒳\mathcal{X}, if the terms in the parentheses on the RHS of (4) are zero. We now describe these terms. The last term on the RHS of (4) is the KL between π(i)E\pi^{(i)}\circ E and μ(i)=π(i)E(i)\mu^{(i)}=\pi^{(i)}\circ E^{(i)}. This term can be seen as a regularizer to keep the new encoder EE close to the old one E(i)E^{(i)}. The four terms in (5) are: (I) The encoding-decoding error to ensure x(DE)(x)x\approx(D\circ E)(x); (II) The error that measures the mismatch between the reward of taking action according to policy πE\pi\circ E at x𝒳x\in\mathcal{X}, and the reward of taking action according to policy π\pi at the image of xx in 𝒵\mathcal{Z} under EE; (III) The error in predicting the next observation through paths in 𝒳\mathcal{X} and 𝒵\mathcal{Z}. This is the error between xx^{\prime} and x^\hat{x}^{\prime} shown in Fig. 1(a); and (IV) The error in predicting the next latent state through paths in 𝒳\mathcal{X} and 𝒵\mathcal{Z}. This is the error between zz^{\prime} and z~\tilde{z}^{\prime} shown in Fig. 1(b).

Refer to caption
Refer to caption
Figure 1: (a) Paths from the current observation xx to the next one, (left) in 𝒳\mathcal{X} and (right) through 𝒵\mathcal{Z}. (b) Paths from the current observation xx to the next latent state, (left) through 𝒳\mathcal{X} followed by encoding and (right) starting with encoding and through 𝒵\mathcal{Z}.

Representation Learning in CARL:   Theorem 1 provides us with a recipe (loss function) to learn the latent space 𝒵\mathcal{Z} and (E,F,D,r¯)(E,F,D,\bar{r}). In CARL, we propose to learn a representation for which the terms in the parentheses on the RHS of (4) are small. As mentioned earlier, the second term, Lreg(E,E,π,x)L_{\text{reg}}(E,E_{-},\pi,x), can be considered as a regularizer to keep the new encoder EE close to the old one EE_{-}. Term (II) that measures the mismatch between rewards can be kept small, or even zero, if the designer of the system selects the rewards in a compatible way. Although CARL allows us to learn a reward function in the latent space, similar to several other LCE works (Watter et al., 2015; Banijamali et al., 2018; Levine et al., 2020; Shu et al., 2020), in this paper, we assume that a compatible reward function is given. Terms (III) and (IV) are the equivalent of the prediction and consistency terms in PCC (Levine et al., 2020) for a particular latent space policy π\pi. Since PCC has been designed for an offline setting (i.e., one-shot representation learning and control), the prediction and consistency terms are independent of a particular policy and are defined for state-action pairs. While CARL is designed for an online setting (i.e., interleaving representation learning and control), and thus, its loss function at each iteration depends on the current latent space policies π\pi and π+\pi_{+}. As we will see in Section 4.1, our offline implementation of CARL uses a loss function very similar to that of PCC. Note that (IV) is slightly different than the consistency term in PCC. However, if we upper-bound it using Jensen inequality as (IV)Lc(E,F,π,x):=x𝒳dPπE(x|x)DKL(E(|x)||(FπE)(|x))(\mathrm{IV})\leq L_{c}(E,F,\pi,x):=\int_{x^{\prime}\in\mathcal{X}}dP_{\pi\circ E}(x^{\prime}|x)\cdot D_{\text{KL}}\big{(}E(\cdot|x^{\prime})\;||\;(F_{\pi}\circ E)(\cdot|x)\big{)}, we obtain the loss term Lc(E,F,π,x)L_{\text{c}}(E,F,\pi,x), which is very similar to the consistency term in PCC. Similar to PCC, we also add a curvature loss to the loss function of CARL to encourage having a smoother latent space dynamics FπF_{\pi}. Putting all these terms together, we obtain the CARL loss function as

(E,F,D)argmin(E,F,D)x𝒟\displaystyle(E^{*},F^{*},D^{*})\in\operatorname*{arg\,min}_{(E,F,D)}\sum_{x\sim\mathcal{D}} λedLed(E,D,x)+λpLp(E,F,D,π,x)+λcLc(E,F,π,x)\displaystyle\lambda_{\text{ed}}L_{\text{ed}}(E,D,x)+\lambda_{\text{p}}L_{\text{p}}(E,F,D,\pi,x)+\lambda_{\text{c}}L_{\text{c}}(E,F,\pi,x)
+λcurLcur(F,π,x)+λregLreg(E,E,π,x),\displaystyle+\lambda_{\text{cur}}L_{\text{cur}}(F,\pi,x)+\lambda_{\text{reg}}L_{\text{reg}}(E,E_{-},\pi,x), (6)

where (λed,λp,λc,λcur,λreg)(\lambda_{\text{ed}},\lambda_{\text{p}},\lambda_{\text{c}},\lambda_{\text{cur}},\lambda_{\text{reg}}) are hyper-parameters of the algorithm, (Led,Lp)(L_{\text{ed}},L_{\text{p}}) are the encoding-decoding and prediction losses defined in (5), LcL_{\text{c}} is the consistency loss defined above, LcurL_{\text{cur}} is the curvature loss, and LregL_{\text{reg}} is the regularizer that ensures the new encoder remains close to the old one. We set λreg\lambda_{\text{reg}} to a small value not to allow LregL_{\text{reg}} to play a major role in our implementations.

4 Different Implementations of CARL

The CARL loss function in (6) introduces an optimization problem that takes a policy π\pi as input and learns a representation suitable for its evaluation and improvement. To optimize this loss in practice, similar to the PCC model (Levine et al., 2020), we define P^=DFπE\widehat{P}=D\circ F_{\pi}\circ E as a latent variable model that factorizes as P^(xt+1,zt,z^t+1|xt,π)=P^(zt|xt)P^(z^t+1|zt,π)P^(xt+1|z^t+1)\widehat{P}(x_{t+1},z_{t},\hat{z}_{t+1}|x_{t},\pi)=\widehat{P}(z_{t}|x_{t})\widehat{P}(\hat{z}_{t+1}|z_{t},\pi)\widehat{P}(x_{t+1}|\hat{z}_{t+1}), and use a variational approximation to the interactable negative log-likelihood of the loss terms in (6). The variational bounds for these terms can be obtained similar to Eqs. 6 and 7 in Levine et al. (2020). Below we describe three instantiations of the CARL model in practice. Implementation details can be found in Algorithm 2 in Appendix E. While CARL is compatible with most PI-style (actor-critic) RL algorithms, following a recent work, MBRL (Janner et al., 2019), we choose SAC as the RL algorithm in CARL. Since most actor-critic algorithms are based on first-order gradient updates, as discussed in Section 3, we regularize the curvature of the latent dynamics FF (see Eqs. 8 and 9 in Levine et al. 2020) in CARL to improve its empirical stability and performance in policy learning.

4.1 Offline CARL

We first implement CARL in an offline setting, where we first generate a (relatively) large batch of observation samples {(xt,at,rt,xt)}t=1N\{(x_{t},a_{t},r_{t},x^{\prime}_{t})\}_{t=1}^{N} using an exploratory (e.g., random) policy. We then use this batch to optimize the CARL loss function (6) via a variational approximation scheme, as described above, and learn a latent representation 𝒵\mathcal{Z} and (E,F,D)(E,F,D). Finally, we solve the decision problem in 𝒵\mathcal{Z} using a model-based RL algorithm, which in our case is model-based soft actor-critic (SAC) (Haarnoja et al., 2018). The learned policy π^\hat{\pi}^{*} in 𝒵\mathcal{Z} is then used to control the system from observations as at(π^E)(|xt)a_{t}\sim(\hat{\pi}^{*}\circ E)(\cdot|x_{t}). This is the setting that has been used in several recent LCE works, such as E2C (Watter et al., 2015), RCE (Banijamali et al., 2018), PCC (Levine et al., 2020), and PC3 (Shu et al., 2020). Our offline implementation is different than these works in 1) we replace the locally linear control algorithm, namely iterative LQR (iLQR) (Li and Todorov, 2004), used in them with model-based SAC, which results in significant performance improvement as shown in our experimental results in Section 5, and 2) we optimize the CARL loss function that as mentioned above, despite close connection is still different than the one used by PCC.

The CARL loss function presented in Section 3 has been designed for an online setting in which at each iteration, it takes a policy as input and learns a representation that is suitable for evaluating and improving this policy. However, in the offline setting, the learned representation should be good for any policy generated in the course of running the PI-style control algorithm. Therefore, we marginalize out the policy from the (online) CARL’s loss function and use the RHS of the following corollary (proof in Appendix C) to construct the CARL loss function used in our offline experiments.

Corollary 2.

Let μ\mu and μ+\mu_{+} be two consecutive policies in 𝒳\mathcal{X} genrated by a PI-style control algorithm in the latent space constructed by (E,F,D,r¯)(E,\!F,\!D,\!\bar{r}). Then, the following holds for the value functions of μ\mu and μ+\mu_{+}, where Δ\Delta is defined by (5) (in modulo to replacing sampled action aπEa\!\sim\!\pi\!\circ\!E with action aa):

Uμ+(x)Uμ(x)2γ1γmaxa𝒜Δ(E,F,D,r¯,a,x),x𝒳.U_{\mu_{+}}(x)\geq U_{\mu}(x)-\frac{2\gamma}{1-\gamma}\cdot\max_{a\in\mathcal{A}}\;\Delta(E,F,D,\bar{r},a,x),\,\,\forall x\in\mathcal{X}. (7)

4.2 Online CARL

In the online implementation of CARL, at each iteration ii, the current policy π(i)\pi^{(i)} is the improved policy of the last iteration, π+(i1)\pi_{+}^{(i-1)}. We first generate a relatively (to offline CARL) small batch samples from the image of this policy in 𝒳\mathcal{X}, i.e., μ(i)=π(i)E(i1)\mu^{(i)}=\pi^{(i)}\circ E^{(i-1)}, and then learn a representation (E(i),F(i),D(i))(E^{(i)},F^{(i)},D^{(i)}) suitable for evaluating and improving the image of μ(i)\mu^{(i)} in 𝒵\mathcal{Z} under the new encoder E(i)E^{(i)}. This means that with the new representation, the current policy that was the image of μ(i)\mu^{(i)} in 𝒵\mathcal{Z} under E(i1)E^{(i-1)}, should be replaced by its image π(i)\pi^{(i)} under the new encoder, i.e., π(i)E(i)μ(i)\pi^{(i)}\!\circ\!E^{(i)}\approx\mu^{(i)}. In online CARL, this is done by a policy distillation step in which we minimize the following loss:444Our experiments reported in Appendix G show that adding distillation improves the performance in online CARL. Thus, all the results reported for online CARL and value-guided CARL in the main paper are with policy distillation.

π(i)argminπx𝒟DKL((πE(i))(|x)||(π+(i1)E(i1))(|x)).\pi^{(i)}\in\operatorname*{arg\,min}_{\pi}\sum_{x\sim\mathcal{D}}D_{\text{KL}}\big{(}(\pi\circ E^{(i)})(\cdot|x)\;||\;(\pi_{+}^{(i-1)}\circ E^{(i-1)})(\cdot|x)\big{)}. (8)

After the current policy π(i)\pi^{(i)} was set, we perform multiple steps of (model-based) SAC in 𝒵\mathcal{Z} using the current model, (F(i),r¯(i))(F^{(i)},\bar{r}^{(i)}), and then send the resulting policy π+(i)\pi_{+}^{(i)} to the next iteration.

4.3 Value-Guided CARL

In the previous two implementations of CARL, we learn the model (E,F,D)(E,F,D) using the loss function (6). Theorem 1 shows that minimizing this loss guarantees performance improvement. While this loss depends on the current policy μ\mu (through the latent space policy and encoder), it does not use its value function UμU_{\mu}. To incorporate this extra piece of information in the representation learning process, we utilize results from variational model-based policy optimization (VMBPO) work by Chow et al. (2020). Using Lemma 3 in Chow et al. (2020), we can include the value function in the observation space dynamics as555In general, this can also be applied to reward learning, but for simplicity we only focus on learning dynamics.

P(x|x,a)=P(x|x,a)exp(τr(x,a)+γU~μ(x)W~μ(x,a)γ),\begin{split}P^{*}(x^{\prime}|x,a)=P(x^{\prime}|x,a)\cdot\exp\big{(}\tau\cdot\frac{r(x,a)+\gamma\tilde{U}_{\mu}(x^{\prime})-\tilde{W}_{\mu}(x,a)}{\gamma}\big{)},\end{split} (9)

where U~μ(x):=1τlog𝔼[exp(τt=0γtrμ,t)Pμ,x0=x]\tilde{U}_{\mu}(x):=\frac{1}{\tau}\log\mathbb{E}\left[\exp\left(\tau\cdot\sum_{t=0}^{\infty}\gamma^{t}r_{\mu,t}\right)\mid P_{\mu},x_{0}=x\right] is the risk-adjusted value function of policy μ\mu, and W~μ(x)\tilde{W}_{\mu}(x) is its corresponding state-action value function, i.e., W~μ(x,a):=r(x,a)+γ1τlog𝔼xP(|x,a)[exp(τUμ(x))]\tilde{W}_{\mu}(x,a):=r(x,a)+\gamma\cdot\frac{1}{\tau}\log\mathbb{E}_{x^{\prime}\sim P(\cdot|x,a)}\left[\exp(\tau\cdot U_{\mu}(x^{\prime}))\right]. The reason for referring to W~μ\tilde{W}_{\mu} and U~μ\tilde{U}_{\mu} as risk-adjusted value functions is that the Bellman operator is no longer defined by the expectation operator over P(x|x,a)P(x^{\prime}|x,a), but instead is defined by the exponential risk ρτ(U()|x,a)=1τlog𝔼xP(|x,a)[exp(τU(x))]\rho_{\tau}(U(\cdot)|x,a)=\frac{1}{\tau}\log\mathbb{E}_{x^{\prime}\sim P(\cdot|x,a)}[\exp(\tau\cdot U(x^{\prime}))], with the temperature parameter τ>0\tau>0 (Ruszczyński and Shapiro, 2006). The modified dynamics PP^{*} is a re-weighting of PP using the exponential-twisting weights exp(τγw(x,a,x))\exp(\frac{\tau}{\gamma}\cdot w(x,a,x^{\prime})), where w(x,a,x):=r(x,a)+γU~μ(x)W~μ(x,a)w(x,a,x^{\prime}):=r(x,a)+\gamma\tilde{U}_{\mu}(x^{\prime})-\tilde{W}_{\mu}(x,a) is the temporal difference (TD) error of the risk-adjusted value functions.

To incorporate the value-guided transition model PP^{*} into the CARL loss function, we need to modify all the loss terms that depend on PP, i.e., the prediction loss Lp(E,F,D,π,x)L_{\text{p}}(E,F,D,\pi,x) and the consistency loss Lc(E,F,π,x)L_{\text{c}}(E,F,\pi,x). Because of the regularization term Lreg(E,E,π,x)L_{\text{reg}}(E,E_{-},\pi,x) that enforces the policy πE\pi\circ E to be close to μ\mu, we may replace the transition dynamics PπEP_{\pi\circ E} in the prediction loss Lp(E,F,D,π,x)L_{\text{p}}(E,F,D,\pi,x) with PμP_{\mu}. Since logPμ(x|x)\log P_{\mu}(x^{\prime}|x) does not depend on the representation, minimizing the prediction loss would be equivalent to maximizing the expected log-likelihood (MLE) Lmle(E,F,D,π,x)=x𝑑Pμ(x|x)log(DFπE)(x|x)L_{\text{mle}}(E,F,D,\pi,x)=-\int_{x^{\prime}}dP_{\mu}(x^{\prime}|x)\cdot\log(D\circ F_{\pi}\circ E)(x^{\prime}|x). Now if we replace dynamics PP with its value-guided counterpart PP^{*} in the MLE loss, we obtain the modified prediction loss

Lp(E,F,D,π,x)=adμ(a|x)xdP(x|x,a)exp(τγw(x,a,x))log(DFπE)(x|x),\begin{split}L^{\prime}_{\text{p}}(E,F,&\,D,\pi,x)=-\int_{a}d\mu(a|x)\int_{x^{\prime}}dP(x^{\prime}|x,a)\cdot\exp(\frac{\tau}{\gamma}\cdot w(x,a,x^{\prime}))\cdot\log(D\circ F_{\pi}\circ E)(x^{\prime}|x),\end{split} (10)

which corresponds to a weighted MLE loss in which the weights are the exponential TD errors.

Using analogous arguments, we may write the value-guided version of the consistency loss Lc(E,F,π,x)=x𝒳dPμ(x|x)DKL(E(|x)||(FπE)(|x))L_{\text{c}}(E,F,\pi,x)=\int_{x^{\prime}\in\mathcal{X}}dP_{\mu}(x^{\prime}|x)\cdot D_{\text{KL}}(E(\cdot|x^{\prime})||(F_{\pi}\circ E)(\cdot|x)) as

Lc(E,F,D,π,x)=adμ(a|x)xdP(x|x,a)exp(τγw(x,a,x))DKL(E(|x)||(FπE)(|x)).\begin{split}L^{\prime}_{\text{c}}(E,F,&D,\pi,x)=\int_{a}d\mu(a|x)\int_{x^{\prime}}dP(x^{\prime}|x,a)\cdot\exp(\frac{\tau}{\gamma}\cdot w(x,a,x^{\prime}))\cdot D_{\text{KL}}(E(\cdot|x^{\prime})||(F_{\pi}\circ E)(\cdot|x)).\end{split} (11)

To complete the value-guided CARL (V-CARL) implementation, we need to compute the risk-adjusted value functions U~μ\tilde{U}_{\mu} and W~μ\tilde{W}_{\mu} to construct the weight w(x,a,x)w(x,a,x^{\prime}). Here we provide a recipe for the case when τ\tau is small (see Appendix D for details), in which the risk-adjusted value functions can be approximated by their risk-neutral counterparts, i.e., U~μ(x)Uμ(x)\tilde{U}_{\mu}(x)\!\approx\!U_{\mu}(x), and W~μ(x,a)Wμ(x,a):=r(x,a)+x𝑑P(x|x,a)Uμ(x)\tilde{W}_{\mu}(x,a)\!\approx\!W_{\mu}(x,a)\!:=\!r(x,a)+\int_{x^{\prime}}\!dP(x^{\prime}|x,a)U_{\mu}(x^{\prime}). Following Lemma 7 in Appendix A, we can approximate Uμ(x)U_{\mu}(x) with (VπE)(x)(V_{\pi}\circ E)(x) and Wμ(x,a)W_{\mu}(x,a) with (QπE)(x,a)(Q_{\pi}\circ E)(x,a). Together with the compatibility of the rewards, i.e., r(x,a)(r¯E)(x,a)r(x,a)\approx(\bar{r}\circ E)(x,a), the weight w(x,a,x)w(x,a,x^{\prime}) can be approximated by w^(x,a,x):=z,z𝑑E(z|x)𝑑E(z|x)(r¯(z,a)Qπ(z,a)+γVπ(z))\small\widehat{w}(x,a,x^{\prime}):=\int_{z,z^{\prime}}dE(z|x)\cdot dE(z^{\prime}|x^{\prime})\cdot(\bar{r}(z,a)-Q_{\pi}(z,a)+\gamma V_{\pi}(z^{\prime})), which is simply the TD error in the latent space.

5 Experimental Results

In this section, we experiment with the following continuous control domains (see Appendix F for more descriptions): (i) Planar System, (ii) Inverted Pendulum, (iii) Cartpole, (iv) 3-link manipulator, and compare the performance of our CARL algorithms with two LCE baselines: PCC (Levine et al., 2020) and SOLAR (Zhang et al., 2019). These tasks have underlying start and goal states that are “not” observable to the algorithms, instead the algorithms only have access to the start and goal image observations. To evaluate the performance of the control algorithms, similar to Levine et al. (2020), we report the %\%-time spent in the goal. The initial policy that is used for data generation is uniformly random. To measure performance reproducibility for each experiment, we 1) train 2525 models and 2) perform 1010 control tasks for each model. For SOLAR, due to its high computation cost we only train and evaluate 1010 different models. Besides the average results, we also report the results from the best LCE models, averaged over the 1010 control tasks.

General Results

Table 1 shows the mean and standard deviations of the metrics on different control tasks. To compare the data efficiency of different methods, we also report the number of samples required to train the latent space and controller in each method. Our main experimental observation is four-fold. First, by replacing the iLQR control algorithm used by PCC with model-based SAC, offline CARL achieves significantly better performance in all of the tasks. This can be attributed to the advantage that SAC is more robust and effective operating in non-(locally)-linear environments. More details in the comparisons between PCC and offline CARL can be found in Appendix G, in which we explicitly compare their control performance and latent representation maps. Second, in all the tasks online CARL is more data-efficient than the offline counterparts, i.e., we can achieve similar or better performance with fewer environment samples. In particular, online CARL is notably superior in the planar, cartpole, and swingup tasks, in which similar performance can be achieved with 22, 2.52.5, and 44 times less samples, respectively (see Figure 2). From Figure 3, interestingly the latent representations of online CARL also tend to improve progressively with the value of the policy. Third, in the simpler tasks (Planar, Swingup, Cartpole), value-guided CARL (V-CARL) manages to achieve even better performance. This corroborates with our hypothesis that extra improvement can be delivered by CARL when its LCE model is more accurate in regions of the 𝒵\mathcal{Z} space with higher temporal difference — regions with higher anticipated future return. Unfortunately, in three-pole, the performance of V-CARL is worse than its online counterpart. This is likely due to instability in representation learning when the sample variance is amplified by the exponential-TD weight. Finally, SOLAR requires much more samples to learn a reasonable latent space for control, and with limited data it fails to converge to a good policy. Note that we are able to obtain better results in the planar task when the goal location is fixed (see Appendix 3). Yet even with the fine-tuned latent space from Zhang et al. (2019), its performance is incomparable with that of CARL algorithms.

Environment Algorithm Number of Samples Avg %\%-Goal Best %\%-Goal
Planar PCC 5000 38.85±2.4538.85\pm 2.45 62.5±10.4262.5\pm 10.42
Planar Offline CARL 5000 63.43±2.7863.43\pm 2.78 79.51±0.38\textbf{79.51}\pm\textbf{0.38}
Planar Online CARL 3072 68.03±1.6968.03\pm 1.69 79.02±0.3879.02\pm 0.38
Planar Value-Guided CARL 3200 71.05±1.46\textbf{71.05}\pm\textbf{1.46} 79.51±0.38\textbf{79.51}\pm\textbf{0.38}
Planar SOLAR 5000 (VAE) + 16000 (Control) 5.82±2.505.82\pm 2.50 9.13±3.549.13\pm 3.54
Swingup PCC 5000 86.60±1.0086.60\pm 1.00 97.40±0.6197.40\pm 0.61
Swingup Offline CARL 5000 88.43±2.0288.43\pm 2.02 98.50±0.0\textbf{98.50}\pm\textbf{0.0}
Swingup Online CARL 1408 95.04±0.9695.04\pm 0.96 98.50±0.0\textbf{98.50}\pm\textbf{0.0}
Swingup Value-Guided CARL 1408 96.50±0.25\textbf{96.50}\pm\textbf{0.25} 98.50±0.0\textbf{98.50}\pm\textbf{0.0}
Swingup SOLAR 5200 (VAE) + 40000 (Control) 16.1±0.6916.1\pm 0.69 22.45±1.9622.45\pm 1.96
Cartpole PCC 10000 83.64±0.6383.64\pm 0.63 100.0±0.0\textbf{100.0}\pm\textbf{0.0}
Cartpole Offline CARL 10000 91.11±1.5091.11\pm 1.50 100.0±0.0\textbf{100.0}\pm\textbf{0.0}
Cartpole Online CARL 5120 95.34±1.1795.34\pm 1.17 100.0±0.0\textbf{100.0}\pm\textbf{0.0}
Cartpole Value-Guided CARL 5120 95.79±1.06\textbf{95.79}\pm\textbf{1.06} 100.0±0.0\textbf{100.0}\pm\textbf{0.0}
Cartpole SOLAR 5000 (VAE) + 40000 (Control) 10.61±2.5810.61\pm 2.58 12.33±2.9612.33\pm 2.96
Three-pole PCC 4096 4.41±0.754.41\pm 0.75 36.20±7.0636.20\pm 7.06
Three-pole Offline CARL 4096 63.20±1.77\textbf{63.20}\pm\textbf{1.77} 88.55±0.088.55\pm 0.0
Three-pole Online CARL 2944 62.17±2.2862.17\pm 2.28 90.05±0.0\textbf{90.05}\pm\textbf{0.0}
Three-pole Value-Guided CARL 2816 55.06±2.4255.06\pm 2.42 89.05±0.089.05\pm 0.0
Three-pole SOLAR 2000 (VAE) + 20000 (Control) 0±00\pm 0 0±00\pm 0
Table 1: Mean ±\pm standard error results (%\%-goal) and samples used for different LCE algorithms.
Refer to caption
(a) Planar
Refer to caption
(b) Swingup
Refer to caption
(c) Cartpole
Refer to caption
(d) Three-pole
Figure 2: Training curves of offline CARL, online CARL, and value-guided CARL. The shaded region represents mean ±\pm standard error. Online variants of CARL is more data-efficient than offline CARL.
Refer to caption
(a) i=1i=1
Refer to caption
(b) i=3i=3
Refer to caption
(c) i=6i=6
Refer to caption
(d) i=10i=10
Refer to caption
(e) i=14i=14
Figure 3: Evolution of the latent representation of the Planar problem learned by online CARL. Here ii represents the number of LCE model-learning episodes (Algorithm 2 in Appendix E)
Refer to caption
(a) Planar
Refer to caption
(b) Swingup
Refer to caption
(c) Cartpole
Refer to caption
(d) Three-pole
Figure 4: Training curves comparing Online CARL and Value-Guided CARL when the initial samples are from the biased regions as described in Appendix F.
Results with Environment-biased Sampling

In the previous experiments, all the online LCE algorithms are warm-started with data collected by a uniformly random policy over the entire environment. With this uniform data collection, we do not observe a significant difference between online CARL and V-CARL. This is because with sufficient data the LCE dynamics is accurate enough on most parts of the state-action space for control. To further illustrate the advantage of V-CARL over online CARL, we modify the experimental setting by gathering initial samples only from a specific region of the environment (see Appendix F for details). Figure 4 shows the learning curves of online CARL and V-CARL in this case. Clearly with biased initial data both algorithms experience a certain level of performance degradation. Yet, V-CARL clearly outperforms online CARL. This again verifies our conjecture that value-aware LCE models are more robust to initial data distribution and more superior in policy optimization.

6 Conclusions

In this paper, we argued for incorporating control in the representation learning process and for the interaction between control and representation learning in learning controllable embedding (LCE) algorithms. We proposed a LCE model called control-aware representation learning (CARL) that learns representations suitable for policy iteration (PI) style control algorithms. We proposed three implementations of CARL that combine representation learning with model-based soft actor-critic (SAC), as the controller, in offline and online fashions. In the third implementation, called value-guided CARL, we further included the control process in representation learning by optimizing a weighted version of the CARL loss function, in which the weights depend on the TD-error of the current policy. We evaluated the proposed algorithms on benchmark tasks and compared them with several LCE baselines. The experiments show the importance of SAC as the controller and of the online implementation. Future directions include 1) investigating other PI-style algorithms in place of SAC, 2) developing LCE models suitable for value iteration style algorithms, and 3) identifying other forms of bias for learning an effective embedding and latent dynamics.

Broader Impact

Controlling non-linear dynamical systems from high-dimensional observation is challenging. Direct application of model-free and model-based reinforcement learning algorithms to this problem may not be efficient, due to requiring a large number of samples from the real system (model-free) and the challenges of learning a model in a high-dimensional space (model-based). A popular approach to address this problem is learning controllable embedding (LCE), i.e., learning a low-dimensional latent space and a latent space model, and performing the optimal control in this learned latent space. This work is a step towards end-to-end representation learning and control in this setting. We propose methods that interleave representation learning and control, which allows us to learn control-aware representations, i.e., representations that are suitable for the control problem at hand.

References

  • Abachi et al. [2020] R. Abachi, M. Ghavamzadeh, and A. Farahmand. Policy-aware model learning for policy gradient methods. preprint arXiv:2003.00030, 2020.
  • Banijamali et al. [2018] E. Banijamali, R. Shu, M. Ghavamzadeh, H. Bui, and A. Ghodsi. Robust locally-linear controllable embedding. In Proceedings of the Twenty First International Conference on Artificial Intelligence and Statistics, pages 1751–1759, 2018.
  • Borkar [2002] V. Borkar. Q-learning for risk-sensitive control. Mathematics of operations research, 27(2):294–311, 2002.
  • Breivik and Fossen [2005] M. Breivik and T. Fossen. Principles of guidance-based path following in 2D and 3D. In Proceedings of the 44th IEEE Conference on Decision and Control, pages 627–634, 2005.
  • Chow et al. [2020] Y. Chow, B. Cui, M. Ryu, and M. Ghavamzadeh. Variational model-based policy optimization. In arXiv, 2020.
  • Deisenroth and Rasmussen [2011] M. Deisenroth and C. Rasmussen. PILCO: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on Machine Learning, page 465–472, 2011.
  • Ebert et al. [2018] F. Ebert, C. Finn, S. Dasari, A. Xie, A. Lee, and S. Levine. Visual Foresight: Model-based deep reinforcement learning for vision-based robotic control. preprint arXiv:1812.00568, 2018.
  • Farahmand [2018] A. Farahmand. Iterative value-aware model learning. In Advances in Neural Information Processing Systems 31, pages 9072–9083, 2018.
  • Furuta et al. [1991] K. Furuta, M. Yamakita, and S. Kobayashi. Swing up control of inverted pendulum. In Proceedings of International Conference on Industrial Electronics, Control and Instrumentation, pages 2193–2198, 1991.
  • Geva and Sitte [1993] S. Geva and J. Sitte. A cartpole experiment benchmark for trainable controllers. IEEE Control Systems Magazine, 13(5):40–51, 1993.
  • Haarnoja et al. [2018] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, pages 1861–1870, 2018.
  • Hafner et al. [2020] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representations, 2020.
  • Janner et al. [2019] M. Janner, J. Fu, M. Zhang, and S. Levine. When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems 32, pages 12519–12530. 2019.
  • Kingma and Ba [2014] D. Kingma and J. Ba. Adam: A method for stochastic optimization, 2014.
  • Kingma and Welling [2013] D. Kingma and M. Welling. Auto-encoding variational bayes, 2013.
  • Lai et al. [2015] X. Lai, A. Zhang, M. Wu, and J. She. Singularity-avoiding swing-up control for underactuated three-link gymnast robot using virtual coupling between control torques. International Journal of Robust and Nonlinear Control, 25(2):207–221, 2015.
  • Levine et al. [2020] N. Levine, Y. Chow, R. Shu, A. Li, M. Ghavamzadeh, and H. Bui. Prediction, consistency, curvature: Representation learning for locally-linear control. In Proceedings of the 8th International Conference on Learning Representations, 2020.
  • Li and Todorov [2004] W. Li and E. Todorov. Iterative linear quadratic regulator design for nonlinear biological movement systems. In ICINCO (1), pages 222–229, 2004.
  • Mnih et al. [2013] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing Atari with deep reinforcement learning. preprint arXiv:1312.5602, 2013.
  • Ruszczyński and Shapiro [2006] A. Ruszczyński and A. Shapiro. Optimization of convex risk functions. Mathematics of operations research, 31(3):433–452, 2006.
  • Schulman et al. [2015a] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning, pages 1889–1897, 2015a.
  • Schulman et al. [2015b] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889–1897, 2015b.
  • Schulman et al. [2017] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. preprint arXiv:1707.06347, 2017.
  • Shu et al. [2020] R. Shu, T. Nguyen, Y. Chow, T. Pham, K. Than, M. Ghavamzadeh, S. Ermon, and H. Bui. Predictive coding for locally-linear control. preprint arXiv:2003.01086, 2020.
  • Spong [1995] M. Spong. The swing up control problem for the acrobot. IEEE Control Systems Magazine, 15(1):49–55, 1995.
  • Watter et al. [2015] M. Watter, J. Springenberg, J. Boedecker, and M. Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In Advances in Neural Information Processing Systems 28, pages 2746–2754. 2015.
  • Zhang et al. [2019] M. Zhang, S. Vikram, L. Smith, P. Abbeel, M. Johnson, and S. Levine. SOLAR: Deep structured representations for model-based reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, pages 7444–7453, 2019.

Appendix A Approximate Latent Policy Evaluation

To start with, for any value function U:𝒳U:\mathcal{X}\rightarrow\mathbb{R}, at any observation x𝒳x\in\mathcal{X} and any arbitrary policy μ\mu the μ\mu-induced observation-space Bellman operator can be written as:

Tμ[U](x):=a𝒜𝑑μ(a|x)x𝒳𝑑P(x|x,a)(r(x,a)+γU(x)),T_{\mu}[U](x):=\int_{a\in\mathcal{A}}d\mu(a|x)\int_{x^{\prime}\in\mathcal{X}}dP(x^{\prime}|x,a)\cdot\big{(}r(x,a)+\gamma\cdot U(x^{\prime})\big{)},

On the other hand, utilizing the LCE model parameterization and the policy parameterization μ(a|x)=πE(a|x)\mu(a|x)=\pi\circ E(a|x), where πE\pi\circ E corresponds to sampling a latent state zz from the LCE encoder EE and applying a latent space policy π\pi, we also define an approximate Bellman operator for any observation x𝒳x\in\mathcal{X} one can re-write this function as

Tμ[U](x):=z𝒵𝑑E(z|x)a𝒜𝑑π(a|z)z𝒵𝑑F(z|z,a)x𝒳𝑑D(x|z)(r(x,a)+γU(x)).T^{\prime}_{\mu}[U](x):=\int_{z\in\mathcal{Z}}dE(z|x)\cdot\int_{a\in\mathcal{A}}d\pi(a|z)\cdot\int_{z^{\prime}\in\mathcal{Z}}dF(z^{\prime}|z,a)\cdot\int_{x^{\prime}\in\mathcal{X}}dD(x^{\prime}|z^{\prime})\cdot(r(x,a)+\gamma U(x^{\prime})).

On the other hand for any value function V:𝒵V:\mathcal{Z}\rightarrow\mathbb{R}, consider the following latent-space Bellman operator:

𝒯π[V](z):=a𝒜𝑑π(a|z)z𝒵𝑑F(z|z,a)(r¯(z,a)+γV(z)),\mathcal{T}_{\pi}[V](z):=\int_{a\in\mathcal{A}}d\pi(a|z)\cdot\int_{z^{\prime}\in\mathcal{Z}}dF(z^{\prime}|z,a)\cdot\left(\bar{r}(z,a)+\gamma\cdot V(z^{\prime})\right),

where r¯(z,a)\bar{r}(z,a) is a latent reward function that approximates the corresponding observation reward induced by the policy πE\pi\circ E, i.e., |a𝒜z𝒵dπ(a|z)dE(z|x)(r¯(z,a)r(x,a))|0|\int_{a\in\mathcal{A}}\int_{z\in\mathcal{Z}}d\pi(a|z)dE(z|x)(\bar{r}(z,a)-r(x,a))|\approx 0, and V:𝒵V:\mathcal{Z}\rightarrow\mathbb{R} is the latent value function.

Using the results from Farahmand [2018], first we bound the difference of the observation Bellman backup Tμ[U]{T}_{\mu}[U], for μ=πE\mu=\pi\circ E, and the latent Bellman backup Tμ[U]T^{\prime}_{\mu}[U] w.r.t. an arbitrary value function VV using the following inequality:

|Tμ[U](x)Tμ[U](x)|γUDTV(PπE(|x)||(DFπE)(|x))γU2DKL(PπE(|x)||(DFπE)(|x)),x𝒳,\begin{split}\big{|}T_{\mu}[U](x)&-T^{\prime}_{\mu}[U](x)\big{|}\leq\gamma\|U\|_{\infty}\cdot D_{\text{TV}}\bigg{(}P_{\pi\circ E}(\cdot|x)||(D\circ F\circ\pi\circ E)(\cdot|x)\bigg{)}\\ &\leq\frac{\gamma\|U\|_{\infty}}{\sqrt{2}}\cdot\sqrt{D_{\text{KL}}\bigg{(}P_{\pi\circ E}(\cdot|x)||(D\circ F\circ\pi\circ E)(\cdot|x)\bigg{)}},\,\,\forall x\in\mathcal{X},\end{split} (12)

in which U=maxx𝒳|U(x)|\|U\|_{\infty}=\max_{x\in\mathcal{X}}|U(x)|, and the second inequality is by Pinsker inequality. In a γ\gamma-discounting MDP, whose immediate reward is bounded by magnitude RmaxR_{\max}, this quantity is bounded by Rmax/(1γ)R_{\max}/(1-\gamma). This implies that the difference of these Bellman operators can be bounded by a prediction loss. Alternatively one can derive the above TV-divergence without the dependency of the policy by considering the worst-case actions as follows:

DTV(PπE(|x)||(DFπE)(|x))maxa𝒜DTV(P(|x,a)||(DFE)(|x,a))12maxa𝒜DKL(P(|x,a)||(DFE)(|x,a)).\begin{split}D_{\text{TV}}\bigg{(}P_{\pi\circ E}(\cdot|x)||(D\circ F\circ\pi\circ E)&(\cdot|x)\bigg{)}\leq\max_{a\in\mathcal{A}}\,D_{\text{TV}}\bigg{(}P(\cdot|x,a)||(D\circ F\circ E)(\cdot|x,a)\bigg{)}\\ \leq&\frac{1}{\sqrt{2}}\max_{a\in\mathcal{A}}\,\sqrt{D_{\text{KL}}\bigg{(}P(\cdot|x,a)||(D\circ F\circ E)(\cdot|x,a)\bigg{)}}.\end{split} (13)

This upper bound corresponds to the prediction loss in PCC.

Second, for any arbitrary observation value function UU, we have the following result that to connects the observation Bellman residual |Tμ[U](x)U(x)|\left|T^{\prime}_{\mu}[U](x)-U(x)\right| and the latent Bellman residual z𝒵𝑑E(z|x)|𝒯π[V](z)V(z)|\int_{z\in\mathcal{Z}}dE(z|x)\left|\mathcal{T}_{\pi}[V](z)-V(z)\right| when the latent value function is V(z)=x~𝒳𝑑D(x~|z)U(x~)V(z)=\int_{\widetilde{x}\in\mathcal{X}}dD(\widetilde{x}|z)U(\widetilde{x}).

Lemma 3.

For any observation x𝒳x\in\mathcal{X}, encoder-transition-decoder tuple (E,F,D)(E,F,D), latent-space policy π\pi, such that the policy is parameterized as μ=πE\mu=\pi\circ E, and value function VV, such that the latent function is defined as V(z)=x~𝒳𝑑D(x~|z)U(x~)V(z)=\int_{\widetilde{x}\in\mathcal{X}}dD(\widetilde{x}|z)U(\widetilde{x}), the following statement holds:

|Tμ[U](x)U(x)||z𝒵dE(z|x)(𝒯π[V](z)V(z))|\displaystyle\left|T^{\prime}_{\mu}[U](x)\!-\!U(x)\right|\geq\left|\int_{z\in\mathcal{Z}}dE(z|x)(\mathcal{T}_{\pi}[V](z)-V(z))\right| (14)
|z𝒵dE(z|x)a𝒜dπ(a|z)(r(x,a)r¯(z,a))|Rmax1γ12z𝒵𝑑E(z|x)logD(x|z),x𝒳.\displaystyle\!-\!\left|\int_{z\in\mathcal{Z}}dE(z|x)\int_{a\in\mathcal{A}}d\pi(a|z)(r(x,a)-\bar{r}(z,a))\right|-\frac{R_{\max}}{1-\gamma}\sqrt{\frac{-1}{2}\int_{z\in\mathcal{Z}}dE(z|x)\log D(x|z)},\,\,\,\forall x\in\mathcal{X}.
Proof.

Using the definitions of Bellman operators, we have the following chain of inequalities:

|Tμ[U](x)z𝒵dE(z|x)𝒯π[V](z)||z𝒵dE(z|x)a𝒜dπ(a|z)(x𝒳z𝒵dD(x|z)dF(z|z,a)γU(x)z𝒵dF(z|z,a)γV(z))|+|z𝒵dE(z|x)a𝒜dπ(a|z)z𝒵dF(z|z,a)(x𝒳dD(x|z)r(x,a)r¯(z,a))||z𝒵dE(z|x)a𝒜dπ(a|z)(r(x,a)r¯(z,a))|,\begin{split}&\left|T^{\prime}_{\mu}[U](x)-\int_{z\in\mathcal{Z}}dE(z|x)\mathcal{T}_{\pi}[V](z)\right|\leq\left|\int_{z\in\mathcal{Z}}dE(z|x)\int_{a\in\mathcal{A}}d\pi(a|z)\cdot\right.\\ &\quad\left.\left(\int_{x^{\prime}\in\mathcal{X}}\int_{z^{\prime}\in\mathcal{Z}}dD(x^{\prime}|z^{\prime})\cdot dF(z^{\prime}|z,a)\cdot\gamma\cdot U(x^{\prime})-\int_{z^{\prime}\in\mathcal{Z}}dF(z^{\prime}|z,a)\cdot\gamma\cdot V(z^{\prime})\right)\right|\\ &\quad+\left|\int_{z\in\mathcal{Z}}dE(z|x)\int_{a\in\mathcal{A}}d\pi(a|z)\int_{z^{\prime}\in\mathcal{Z}}dF(z^{\prime}|z,a)\cdot\left(\int_{x^{\prime}\in\mathcal{X}}dD(x^{\prime}|z^{\prime})r(x,a)-\bar{r}(z,a)\right)\right|\\ \leq&\left|\int_{z\in\mathcal{Z}}dE(z|x)\int_{a\in\mathcal{A}}d\pi(a|z)(r(x,a)-\bar{r}(z,a))\right|,\end{split}

where the inequalities are based on triangular inequality, and the first term in this inequality vanishes due to the definition V(z)=x~𝒳𝑑D(x~|z)U(x~)V(z)=\int_{\widetilde{x}\in\mathcal{X}}dD(\widetilde{x}|z)U(\widetilde{x}). Furthermore, using basic arithmetic manipulations and triangular inequality, at any x𝒳x\in\mathcal{X} we bound the Bellman residual |Tμ[U](x)U(x)|\left|T^{\prime}_{\mu}[U](x)-U(x)\right| as

|Tμ[U](x)U(x)|\displaystyle\bigg{|}T^{\prime}_{\mu}[U](x)-U(x)\bigg{|} (15)
=\displaystyle= |Tμ[U](x)z𝒵dE(z|x)(𝒯π[V](z)V(z))+z𝒵dE(z|x)(𝒯π[V](z)V(z))U(x)|\displaystyle\bigg{|}T^{\prime}_{\mu}[U](x)-\int_{z\in\mathcal{Z}}dE(z|x)(\mathcal{T}_{\pi}[V](z)-V(z))+\int_{z\in\mathcal{Z}}dE(z|x)(\mathcal{T}_{\pi}[V](z)-V(z))-U(x)\bigg{|}
\displaystyle\geq |z𝒵dE(z|x)(𝒯π[V](z)V(z))||Tμ[U](x)z𝒵dE(z|x)𝒯π[V](z)||U(x)z𝒵dE(z|x)V(z)|\displaystyle\bigg{|}\int_{z\in\mathcal{Z}}dE(z|x)(\mathcal{T}_{\pi}[V](z)-V(z))\bigg{|}-\bigg{|}T^{\prime}_{\mu}[U](x)-\int_{z\in\mathcal{Z}}dE(z|x)\mathcal{T}_{\pi}[V](z)\bigg{|}-\left|U(x)-\int_{z\in\mathcal{Z}}dE(z|x)V(z)\right|
\displaystyle\geq |z𝒵dE(z|x)(𝒯π[V](z)V(z))||z𝒵dE(z|x)a𝒜dπ(a|z)(r(x,a)r¯(z,a))|\displaystyle\bigg{|}\int_{z\in\mathcal{Z}}dE(z|x)(\mathcal{T}_{\pi}[V](z)-V(z))\bigg{|}-\left|\int_{z\in\mathcal{Z}}dE(z|x)\int_{a\in\mathcal{A}}d\pi(a|z)(r(x,a)-\bar{r}(z,a))\right|
|x~𝒳(z𝒵dE(z|x)dD(x~|z)d𝟏{x=x~})U(x~)|.\displaystyle-\left|\int_{\widetilde{x}\in\mathcal{X}}\left(\int_{z\in\mathcal{Z}}dE(z|x)dD(\widetilde{x}|z)-d\mathbf{1}\{x=\widetilde{x}\}\right)U(\widetilde{x})\right|.

The last term in the above inequality can be further upper-bounded as follows:

|x~𝒳(z𝒵dE(z|x)dD(x~|z)d𝟏{x=x~})U(x~)|x~𝒳|z𝒵dE(z|x)D(x~|z)𝟏{x=x~}|U.\left|\int_{\widetilde{x}\in\mathcal{X}}\left(\int_{z\in\mathcal{Z}}dE(z|x)dD(\widetilde{x}|z)-d\mathbf{1}\{x=\widetilde{x}\}\right)U(\widetilde{x})\right|\leq\int_{\widetilde{x}\in\mathcal{X}}\left|\int_{z\in\mathcal{Z}}dE(z|x)D(\widetilde{x}|z)-\mathbf{1}\{x=\widetilde{x}\}\right|\cdot\|U\|_{\infty}.

Furthermore this upper bound can be further bounded by

x~𝒳z𝒵dE(z|x)|D(x~|z)𝟏{x=x~}|z𝒵dE(z|x)12DKL(𝟏{=x}||D(|z))=z𝒵𝑑E(z|x)12logD(x|z)12z𝒵𝑑E(z|x)logD(x|z),\begin{split}&\int_{\widetilde{x}\in\mathcal{X}}\int_{z\in\mathcal{Z}}dE(z|x)\left|D(\widetilde{x}|z)-\mathbf{1}\{x=\widetilde{x}\}\right|\leq\int_{z\in\mathcal{Z}}dE(z|x)\sqrt{\frac{1}{2}D_{\text{KL}}(\mathbf{1}\{\cdot=x\}||D(\cdot|z))}\\ =&\int_{z\in\mathcal{Z}}dE(z|x)\sqrt{\frac{-1}{2}\log D(x|z)}\leq\sqrt{\frac{-1}{2}\int_{z\in\mathcal{Z}}dE(z|x)\log D(x|z)},\end{split}

where the first inequality follows from Pinsker’s inequality and the second inequality follows from the concavity of ()\sqrt{(\cdot)}. Combining this result with the above inequality completes the proof. ∎

This right side of inequality (14) contains several terms. The first corresponds to the latent Bellman residual error, the second corresponds to the latent reward estimation error w.r.t. policy π\pi, and the third term is a reconstruction loss in the encoder-decoder path, which is commonly found in training auto-encoders (and is also a regularization term in PCC). Utilizing the inequality in equation 12 and this lemma, one can further show that for any V:𝒳V:\mathcal{X}\rightarrow\mathbb{R} and at any x𝒳x\in\mathcal{X},

|Tμ[U](x)U(x)|=|Tμ[U](x)Tμ[U](x)+Tμ[U](x)U(x)||Tμ[U](x)U(x)||Tμ[U](x)Tμ[U](x)|\displaystyle\left|T_{\mu}[U](x)\!-\!U(x)\right|=\left|T_{\mu}[U](x)\!-\!T^{\prime}_{\mu}[U](x)+T^{\prime}_{\mu}[U](x)\!-\!U(x)\right|\geq\left|T^{\prime}_{\mu}[U](x)\!-\!U(x)\right|-\left|T_{\mu}[U](x)\!-\!T^{\prime}_{\mu}[U](x)\right|
\displaystyle\geq |z𝒵dE(z|x)(𝒯π[V](z)V(z))|γRmax2(1γ)DKL(PπE(|x)||(DFπE)(|x))\displaystyle\left|\int_{z\in\mathcal{Z}}dE(z|x)(\mathcal{T}_{\pi}[V](z)-V(z))\right|-\frac{\gamma R_{\max}}{\sqrt{2}(1-\gamma)}\cdot\sqrt{D_{\text{KL}}\bigg{(}P_{\pi\circ E}(\cdot|x)||(D\circ F\circ\pi\circ E)(\cdot|x)\bigg{)}}
|z𝒵dE(z|x)a𝒜dπ(a|z)(r(x,a)r¯(z,a))|Rmax1γ12z𝒵𝑑E(z|x)logD(x|z).\displaystyle\!-\!\left|\int_{z\in\mathcal{Z}}dE(z|x)\int_{a\in\mathcal{A}}d\pi(a|z)(r(x,a)-\bar{r}(z,a))\right|-\frac{R_{\max}}{1-\gamma}\sqrt{\frac{-1}{2}\int_{z\in\mathcal{Z}}dE(z|x)\log D(x|z)}. (16)

Fourth, for any arbitrary latent value function VV, we have the following result that to connects the observation Bellman residual |Tμ[U](x)U(x)|\left|T_{\mu}[U](x)-U(x)\right| and the latent Bellman residual z𝒵𝑑E(z|x)|𝒯π[V](z)V(z)|\int_{z\in\mathcal{Z}}dE(z|x)\left|\mathcal{T}_{\pi}[V](z)-V(z)\right| when the observation value function is U(x)=z~𝒵𝑑E(z~|x)V(z~)U(x)=\int_{\widetilde{z}\in\mathcal{Z}}dE(\widetilde{z}|x)V(\widetilde{z}).

Lemma 4.

For any observation x𝒳x\in\mathcal{X}, encoder-transition-decoder tuple (E,F,D)(E,F,D), latent-space policy π\pi, such that the policy is parameterized as μ=πE\mu=\pi\circ E, and latent value function VV, such that the function is defined as U(x)=z~𝒵𝑑E(z~|x)V(z~)U(x)=\int_{\widetilde{z}\in\mathcal{Z}}dE(\widetilde{z}|x)V(\widetilde{z}), the following statement holds:

|z𝒵dE(z|x)(𝒯π[V](z)V(z))||Tμ[U](x)U(x)||z𝒵dE(z|x)a𝒜dπ(a|z)(r(x,a)r¯(z,a))|\displaystyle\left|\int_{z\in\mathcal{Z}}dE(z|x)\cdot\big{(}\mathcal{T}_{\pi}[V](z)-V(z)\big{)}\right|\geq\big{|}T_{\mu}[U](x)-U(x)\big{|}-\left|\int_{z\in\mathcal{Z}}dE(z|x)\int_{a\in\mathcal{A}}d\pi(a|z)(r(x,a)-\bar{r}(z,a))\right|
γRmax2(1γ)x𝒳dPπE(x|x)DKL(E(|x)||(FπE)(|x)).\displaystyle\,\,-\frac{\gamma R_{\max}}{\sqrt{2}(1-\gamma)}\sqrt{\int_{x^{\prime}\in\mathcal{X}}dP_{\pi\circ E}(x^{\prime}|x)\cdot D_{\text{KL}}\left(E(\cdot|x^{\prime})||(F_{\pi}\circ E)(\cdot|x)\right)}. (17)
Proof.

Using the definition U(x)=z~𝒵𝑑E(z~|x)V(z~)U(x)=\int_{\widetilde{z}\in\mathcal{Z}}dE(\widetilde{z}|x)V(\widetilde{z}) and the triangular inequality,

|z𝒵dE(z|x)(𝒯π[V](z)V(z))||Tμ[U](x)U(x)||z𝒵dE(z|x)𝒯π[V](z)Tμ[U](x)|\left|\int_{z\in\mathcal{Z}}dE(z|x)\cdot\big{(}\mathcal{T}_{\pi}[V](z)-V(z)\big{)}\right|\geq\big{|}T_{\mu}[U](x)-U(x)\big{|}-\left|\int_{z\in\mathcal{Z}}dE(z|x)\cdot\mathcal{T}_{\pi}[V](z)-T_{\mu}[U](x)\right|

the proof of this lemma is completed by bounding the difference of the first term via the following inequality:

|z𝒵dE(z|x)𝒯π[V](z)Tμ[U](x)||z𝒵dE(z|x)a𝒜dπ(a|z)(r(x,a)r¯(z,a))|+γ|a𝒜dμ(a|x)x𝒳dP(x|x,a)z𝒵dE(z|x)V(z)z𝒵dE(z|x)a𝒜dπ(a|z)z𝒵dF(z|z,a)V(z)|\begin{split}&\left|\int_{z\in\mathcal{Z}}dE(z|x)\cdot\mathcal{T}_{\pi}[V](z)-T_{\mu}[U](x)\right|\leq\left|\int_{z\in\mathcal{Z}}dE(z|x)\int_{a\in\mathcal{A}}d\pi(a|z)(r(x,a)-\bar{r}(z,a))\right|+\\ &\quad\gamma\left|\int_{a\in\mathcal{A}}d\mu(a|x)\int_{x^{\prime}\in\mathcal{X}}dP(x^{\prime}|x,a)\cdot\int_{z^{\prime}\in\mathcal{Z}}dE(z^{\prime}|x^{\prime})V(z^{\prime})-\int_{z\in\mathcal{Z}}dE(z|x)\cdot\int_{a\in\mathcal{A}}d\pi(a|z)\cdot\int_{z^{\prime}\in\mathcal{Z}}dF(z^{\prime}|z,a)\cdot V(z^{\prime})\right|\end{split}

in which the second term can be further upper bounded as follows:

|a𝒜dμ(a|x)x𝒳dP(x|x,a)z𝒵dE(z|x)V(z)z𝒵dE(z|x)a𝒜dπ(a|z)z𝒵dF(z|z,a)V(z)|z𝒵|a𝒜dμ(a|x)x𝒳dP(x|x,a)E(z|x)z𝒵dE(z|x)a𝒜dπ(a|z)F(z|z,a)|V=DTV(a𝒜dμ(a|x)x𝒳dP(x|x,a)E(|x)||z𝒵dE(z|x)a𝒜dπ(a|z)F(|z,a))VV2x𝒳dPπE(x|x)DKL(E(|x)||(FπE)(|x)).\begin{split}&\left|\int_{a\in\mathcal{A}}d\mu(a|x)\int_{x^{\prime}\in\mathcal{X}}dP(x^{\prime}|x,a)\cdot\int_{z^{\prime}\in\mathcal{Z}}dE(z^{\prime}|x^{\prime})V(z^{\prime})-\int_{z\in\mathcal{Z}}dE(z|x)\cdot\int_{a\in\mathcal{A}}d\pi(a|z)\cdot\int_{z^{\prime}\in\mathcal{Z}}dF(z^{\prime}|z,a)\cdot V(z^{\prime})\right|\\ \leq&\int_{z^{\prime}\in\mathcal{Z}}\left|\int_{a\in\mathcal{A}}d\mu(a|x)\int_{x^{\prime}\in\mathcal{X}}dP(x^{\prime}|x,a)\cdot E(z^{\prime}|x^{\prime})-\int_{z\in\mathcal{Z}}dE(z|x)\cdot\int_{a\in\mathcal{A}}d\pi(a|z)\cdot F(z^{\prime}|z,a)\right|\cdot\|V\|_{\infty}\\ =&D_{\text{TV}}\left(\int_{a\in\mathcal{A}}d\mu(a|x)\int_{x^{\prime}\in\mathcal{X}}dP(x^{\prime}|x,a)\cdot E(\cdot|x^{\prime})||\int_{z\in\mathcal{Z}}dE(z|x)\cdot\int_{a\in\mathcal{A}}d\pi(a|z)\cdot F(\cdot|z,a)\right)\cdot\|V\|_{\infty}\\ \leq&\frac{\|V\|_{\infty}}{\sqrt{2}}\sqrt{\int_{x^{\prime}\in\mathcal{X}}dP_{\pi\circ E}(x^{\prime}|x)\cdot D_{\text{KL}}\left(E(\cdot|x^{\prime})||(F_{\pi}\circ E)(\cdot|x)\right)}.\end{split}

where the first equality follows definition of TV-divergence and the second inequality follows directly from Pinsker inequality and joint convexity of DKL(x||y)D_{\text{KL}}(x||y). The proof of this lemma can be completed by combining the above results and using VRmax/(1γ)\|V\|_{\infty}\leq R_{\max}/(1-\gamma). ∎

Notice that since both the observation Bellman operator TμT_{\mu} and the latent Bellman operator 𝒯π\mathcal{T}_{\pi} are contraction mappings, there exists a unique solution Uμ:𝒳U_{\mu}:\mathcal{X}\rightarrow\mathbb{R} to the observation fixed point equation Tμ[U](x)=U(x)T_{\mu}[U](x)=U(x), x𝒳\forall x\in\mathcal{X}, a unique solution Vπ:𝒵V_{\pi}:\mathcal{Z}\rightarrow\mathbb{R} to the latent fixed point equation 𝒯π[V](z)=V(z)\mathcal{T}_{\pi}[V](z)=V(z), z𝒵\forall z\in\mathcal{Z}. Together with the result in (16), we can immediately show that theorem 5, which connects the optimal observation bellman residual error and the optimal latent bellman residual error.

Theorem 5.

Let VπE(x)=z~𝒵𝑑E(z~|x)Vπ(z~)V_{\pi}\circ E(x)=\int_{\widetilde{z}\in\mathcal{Z}}dE(\widetilde{z}|x)V_{\pi}(\widetilde{z}), be the observation value function in which VπV_{\pi} is the solution of the soft latent fixed-point equation 𝒯π[V](z)=V(z)\mathcal{T}_{\pi}[V](z)=V(z) w.r.t. latent policy π\pi, encoder-transition-decoder tuple (E,F,D)(E,F,D), and parameterized observation-based policy μ=πE\mu=\pi\circ E. Then the following statement holds at any x𝒳x\in\mathcal{X}:

|Tμ[VπE](x)VπE(x)|Rmax1γ12z𝒵𝑑E(z|x)logD(x|z)+2|z𝒵dE(z|x)a𝒜dπ(a|z)(r(x,a)r¯(z,a))|+γRmax2(1γ){DKL(PπE(|x)||(DFπE)(|x))+x𝒳dPπE(x|x)DKL(E(|x)||(FπE)(|x))}.\begin{split}&\big{|}T_{\mu}[V_{\pi}\circ E](x)-V_{\pi}\circ E(x)\big{|}\leq\frac{R_{\max}}{1-\gamma}\sqrt{\frac{-1}{2}\int_{z\in\mathcal{Z}}dE(z|x)\log D(x|z)}\\ &+2\left|\int_{z\in\mathcal{Z}}dE(z|x)\int_{a\in\mathcal{A}}d\pi(a|z)(r(x,a)-\bar{r}(z,a))\right|+\frac{\gamma R_{\max}}{\sqrt{2}(1-\gamma)}\left\{\sqrt{D_{\text{KL}}\bigg{(}P_{\pi\circ E}(\cdot|x)||(D\circ F\circ\pi\circ E)(\cdot|x)\bigg{)}}\right.\\ &+\left.\sqrt{\int_{x^{\prime}\in\mathcal{X}}dP_{\pi\circ E}(x^{\prime}|x)\cdot D_{\text{KL}}\left(E(\cdot|x^{\prime})||(F_{\pi}\circ E)(\cdot|x)\right)}\right\}.\end{split}
Proof.

Using Lemma 3 with V=UμV=U_{\mu}, the fixed-point solution Tμ[Uμ](x)=Uμ(x)T_{\mu}[U_{\mu}](x)=U_{\mu}(x), and denoting by V^μ(z)=x~𝒳𝑑D(x~|z)Uμ(x~)\hat{V}_{\mu}(z)=\int_{\widetilde{x}\in\mathcal{X}}dD(\widetilde{x}|z)U_{\mu}(\widetilde{x}), for any x𝒳x\in\mathcal{X}, we have

|Tμ[Uμ](x)Uμ(x)|\displaystyle\left|T_{\mu}[U_{\mu}](x)\!-\!U_{\mu}(x)\right|
\displaystyle\geq |z𝒵dE(z|x)(𝒯π[V^μ](z)V^μ(z))|γRmax2(1γ)DKL(PπE(|x)||(DFπE)(|x))\displaystyle\left|\int_{z\in\mathcal{Z}}dE(z|x)(\mathcal{T}_{\pi}[\hat{V}_{\mu}](z)-\hat{V}_{\mu}(z))\right|-\frac{\gamma R_{\max}}{\sqrt{2}(1-\gamma)}\cdot\sqrt{D_{\text{KL}}\bigg{(}P_{\pi\circ E}(\cdot|x)||(D\circ F\circ\pi\circ E)(\cdot|x)\bigg{)}}
|z𝒵dE(z|x)a𝒜dπ(a|z)(r(x,a)r¯(z,a))|Rmax1γ12z𝒵𝑑E(z|x)logD(x|z).\displaystyle\!-\!\left|\int_{z\in\mathcal{Z}}dE(z|x)\int_{a\in\mathcal{A}}d\pi(a|z)(r(x,a)\!-\!\bar{r}(z,a))\right|-\frac{R_{\max}}{1-\gamma}\sqrt{\frac{-1}{2}\int_{z\in\mathcal{Z}}dE(z|x)\log D(x|z)}.

Using the fact that VπV_{\pi} is the fixed-point solution, i.e., 𝒯π[Vπ](z)=Vπ(z)\mathcal{T}_{\pi}[V_{\pi}](z)=V_{\pi}(z), we have

|z𝒵dE(z|x)(𝒯π[V^μ](z)V^μ(z))||z𝒵dE(z|x)(𝒯π[Vπ](z)Vπ(z))|=0.\left|\int_{z\in\mathcal{Z}}dE(z|x)(\mathcal{T}_{\pi}[\hat{V}_{\mu}](z)-\hat{V}_{\mu}(z))\right|\geq\left|\int_{z\in\mathcal{Z}}dE(z|x)(\mathcal{T}_{\pi}[V_{\pi}](z)-V_{\pi}(z))\right|=0.

On the other hand, using Lemma 4 with V=UμV=U_{\mu}, and with the definition of VπE(x)=z~𝒵𝑑E(z~|x)Vπ(z~)V_{\pi}\circ E(x)=\int_{\widetilde{z}\in\mathcal{Z}}dE(\widetilde{z}|x)V_{\pi}(\widetilde{z}) we can further show that

|z𝒵dE(z|x)(𝒯π[V^μ](z)V^μ(z))||Tμ[VπE](x)VπE(x)||z𝒵dE(z|x)a𝒜dπ(a|z)(r(x,a)r¯(z,a))|\displaystyle\left|\int_{z\in\mathcal{Z}}dE(z|x)(\mathcal{T}_{\pi}[\hat{V}_{\mu}](z)-\hat{V}_{\mu}(z))\right|\geq\big{|}T_{\mu}[V_{\pi}\circ E](x)-V_{\pi}\circ E(x)\big{|}-\left|\int_{z\in\mathcal{Z}}dE(z|x)\int_{a\in\mathcal{A}}d\pi(a|z)(r(x,a)-\bar{r}(z,a))\right|
γRmax2(1γ)x𝒳dPπE(x|x)DKL(E(|x)||(FπE)(|x)).\displaystyle\,\,-\frac{\gamma R_{\max}}{\sqrt{2}(1-\gamma)}\sqrt{\int_{x^{\prime}\in\mathcal{X}}dP_{\pi\circ E}(x^{\prime}|x)\cdot D_{\text{KL}}\left(E(\cdot|x^{\prime})||(F_{\pi}\circ E)(\cdot|x)\right)}.

Together these inequalities imply that

|Tμ[Uμ](x)Uμ(x)||Tμ[VπE](x)VπE(x)|2|z𝒵dE(z|x)a𝒜dπ(a|z)(r(x,a)r¯(z,a))|\displaystyle\left|T_{\mu}[U_{\mu}](x)\!-\!U_{\mu}(x)\right|\geq\big{|}T_{\mu}[V_{\pi}\circ E](x)-V_{\pi}\circ E(x)\big{|}-2\left|\int_{z\in\mathcal{Z}}dE(z|x)\int_{a\in\mathcal{A}}d\pi(a|z)(r(x,a)-\bar{r}(z,a))\right|
γRmax2(1γ)x𝒳dPπE(x|x)DKL(E(|x)||(FπE)(|x)).\displaystyle\,\,-\frac{\gamma R_{\max}}{\sqrt{2}(1-\gamma)}\sqrt{\int_{x^{\prime}\in\mathcal{X}}dP_{\pi\circ E}(x^{\prime}|x)\cdot D_{\text{KL}}\left(E(\cdot|x^{\prime})||(F_{\pi}\circ E)(\cdot|x)\right)}.
γRmax2(1γ)DKL(PπE(|x)||(DFπE)(|x))Rmax1γ12z𝒵𝑑E(z|x)logD(x|z).\displaystyle\,\,-\frac{\gamma R_{\max}}{\sqrt{2}(1-\gamma)}\cdot\sqrt{D_{\text{KL}}\bigg{(}P_{\pi\circ E}(\cdot|x)||(D\circ F\circ\pi\circ E)(\cdot|x)\bigg{)}}-\frac{R_{\max}}{1-\gamma}\sqrt{\frac{-1}{2}\int_{z\in\mathcal{Z}}dE(z|x)\log D(x|z)}.

By recalling UμU_{\mu} is the fixed-point solution, i.e., Tμ[Uμ](x)=Uμ(x)T_{\mu}[U_{\mu}](x)=U_{\mu}(x), x𝒳\forall x\in\mathcal{X}, the proof is completed by setting the left side to be zero. ∎

This theory shows that the observation Bellmen residual error w.r.t. value function VπEV_{\pi}\circ E, where VπV_{\pi} is the optimal latent value function (w.r.t. soft Bellman fixed-point equation), depends on (i) the prediction error, (ii) the consistency term, (iii) latent reward estimation error, and (iv) the encoder-decoder reconstruction error. Using analogous derivations as in (13) for the prediction term, we can further derive a observation Bellman residual error upper bound w.r.t. value function VπEV_{\pi}\circ E that does not depend on the policy.

Corollary 6.

Let VπE(x)=z~𝒵𝑑E(z~|x)Vπ(z~)V_{\pi}\circ E(x)=\int_{\widetilde{z}\in\mathcal{Z}}dE(\widetilde{z}|x)V_{\pi}(\widetilde{z}), be the observation value function in which VπV_{\pi} is the solution of the soft latent fixed-point equation 𝒯π[V](z)=V(z)\mathcal{T}_{\pi}[V](z)=V(z) w.r.t. latent policy π\pi, encoder-transition-decoder tuple (E,F,D)(E,F,D), and parameterized observation-based policy μ=πE\mu=\pi\circ E. Then the following statement holds at any x𝒳x\in\mathcal{X}:

|Tμ[VπE](x)VπE(x)|Rmax1γ12z𝒵𝑑E(z|x)logD(x|z)+2maxa𝒜{|r(x,a)zdE(z|x)r¯(z,a)|+γRmax2(1γ){DKL(P(|x,a)||(DFE)(|x,a))+x𝒳dP(x|x,a)DKL(E(|x)||FE(|x,a))}}.\begin{split}&\big{|}T_{\mu}[V_{\pi}\circ E](x)-V_{\pi}\circ E(x)\big{|}\leq\frac{R_{\max}}{1-\gamma}\sqrt{\frac{-1}{2}\int_{z\in\mathcal{Z}}dE(z|x)\log D(x|z)}\\ &\qquad+2\cdot\max_{a\in\mathcal{A}}\left\{\left|r(x,a)-\int_{z}dE(z|x)\bar{r}(z,a)\right|+\frac{\gamma R_{\max}}{\sqrt{2}(1-\gamma)}\left\{\sqrt{D_{\text{KL}}\bigg{(}P(\cdot|x,a)||(D\circ F\circ E)(\cdot|x,a)\bigg{)}}\right.\right.\\ &\qquad\qquad\qquad\qquad+\left.\left.\sqrt{\int_{x^{\prime}\in\mathcal{X}}dP(x^{\prime}|x,a)\cdot D_{\text{KL}}\left(E(\cdot|x^{\prime})||F\circ E(\cdot|x,a)\right)}\right\}\right\}.\end{split}
Lemma 7.

The observation value function UπEU_{\pi\circ E} w.r.t. policy πE\pi\circ E satisfies the following bound

|VπE(x)UπE(x)|γ1γΔ(E,F,D,r¯,π,x),x𝒳.\left|V_{\pi}\circ E(x)-U_{\pi\circ E}(x)\right|\leq\frac{\gamma}{1-\gamma}\Delta(E,F,D,\bar{r},\pi,x),\,\,\forall x\in\mathcal{X}.

where the error term is given by

Δ(E,F,D,r¯,π,x)=Rmax1γ12z𝑑E(z|x)logD(x|z)+2|zdE(z|x)adπ(a|z)(r(x,a)r¯(z,a))|+γRmax2(1γ)DKL(PπE(|x)||(DFπE)(|x))+γRmax2(1γ)x𝒳dPπE(x|x)DKL(E(|x)||(FπE)(|x)).\begin{split}\Delta(\,&E,F,D,\bar{r},\pi,x)=\frac{R_{\max}}{1-\gamma}\sqrt{\frac{-1}{2}\int_{z}dE(z|x)\log D(x|z)}\\ &+2\big{|}\int_{z}dE(z|x)\int_{a}d\pi(a|z)(r(x,a)-\bar{r}(z,a))\big{|}+\frac{\gamma R_{\max}}{\sqrt{2}(1-\gamma)}\sqrt{D_{\text{KL}}\big{(}P_{\pi\circ E}(\cdot|x)\;||\;(D\circ F_{\pi}\circ E)(\cdot|x)\big{)}}\\ &+\frac{\gamma R_{\max}}{\sqrt{2}(1-\gamma)}\sqrt{\int_{x^{\prime}\in\mathcal{X}}dP_{\pi\circ E}(x^{\prime}|x)\cdot D_{\text{KL}}\left(E(\cdot|x^{\prime})||(F_{\pi}\circ E)(\cdot|x)\right)}.\end{split}
Proof.

To prove the right side of the approximate policy evaluation inequality, recall from Theorem 5 with μ=πE\mu=\pi\circ E and LCE-reward models (E,F,D,r¯)(E,F,D,\bar{r}) that

TπE[VπE](x)VπE(x)+Δ(E,F,D,r¯,π,x),x𝒳.T_{\pi\circ E}[V_{\pi}\circ E](x)\leq V_{\pi}\circ E(x)+\Delta(E,F,D,\bar{r},\pi,x),\,\,\forall x\in\mathcal{X}.

Applying the Bellman operator TπET_{\pi\circ E} on both sides we get

TπE2[VπE](x)TπE[VπE](x)+γΔ(E,F,D,r¯,π,x),x𝒳.T^{2}_{\pi\circ E}[V_{\pi}\circ E](x)\leq T_{\pi\circ E}[V_{\pi}\circ E](x)+\gamma\Delta(E,F,D,\bar{r},\pi,x),\,\,\forall x\in\mathcal{X}.

Proceeding similarly it follows that

TπE[VπE](x)TπE1[VπE](x)+γΔ(E,F,D,r¯,π,x),x𝒳.T^{\ell}_{\pi\circ E}[V_{\pi}\circ E](x)\leq T^{\ell-1}_{\pi\circ E}[V_{\pi}\circ E](x)+\gamma\Delta(E,F,D,\bar{r},\pi,x),\,\,\forall x\in\mathcal{X}.

Therefore, for every k>0k>0

TπEk[VπE](x)VπE(x)=1kTπE[VπE](x)TπE1[VπE](x)γ1γΔ(E,F,D,r¯,π,x),x𝒳.\begin{split}T^{k}_{\pi\circ E}[V_{\pi}\circ E](x)-V_{\pi}\circ E(x)&\leq\sum_{\ell=1}^{k}T^{\ell}_{\pi\circ E}[V_{\pi}\circ E](x)-T^{\ell-1}_{\pi\circ E}[V_{\pi}\circ E](x)\\ &\leq\frac{\gamma}{1-\gamma}\cdot\Delta(E,F,D,\bar{r},\pi,x),\,\,\forall x\in\mathcal{X}.\end{split}

Taking kk\rightarrow\infty we then obtain UπE(x)VπE(x)+γ1γΔ(E,F,D,r¯,π,x)U_{\pi\circ E}(x)\leq V_{\pi}\circ E(x)+\frac{\gamma}{1-\gamma}\Delta(E,F,D,\bar{r},\pi,x). On the other hand, the left side of the policy evaluation inequality follows analogous arguments. This completes the proof of the approximate policy evaluation. ∎

Appendix B Approximate Policy Iteration with CARL

Consider the following latent policy iteration procedure that optimizes the policy in form of μE\mu\circ E. Starting at iteration i=0i=0, given an initial observation policy μ(0)\mu^{(0)}, an initial observation value function U(0)U^{(0)}, a LCE model (E(0),F(0),D(0))(E^{(0)},F^{(0)},D^{(0)}), and a latent reward model r¯(0)\bar{r}^{(0)}, do

  1. 1.

    Compute the distilled latent policy by π(i)argminμDKL(πE(i)(|x)||μ(i)(|x))\pi^{(i)}\leftarrow\arg\min_{\mu}D_{\text{KL}}(\pi\circ E^{(i)}(\cdot|x)||\mu^{(i)}(\cdot|x))

  2. 2.

    Compute the π(i)\pi^{(i)}-induced latent value function Vπ(i)(z)=limnTπ(i)n[V(i)](z)V_{\pi^{(i)}}(z)=\lim_{n\rightarrow\infty}T^{n}_{\pi^{(i)}}[V^{(i)}](z), z\forall z w.r.t. models (F(i),r¯(i))(F^{(i)},\bar{r}^{(i)}) and the state-action latent value function Qπ(i)(z,a):=r¯(i)(z,a)+γz𝒳𝑑F(i)(z|z,a)Vπ(i)(z)Q_{\pi^{(i)}}(z,a):=\bar{r}^{(i)}(z,a)+\gamma\int_{z^{\prime}\in\mathcal{X}}dF^{(i)}(z^{\prime}|z,a)V_{\pi^{(i)}}(z^{\prime})

  3. 3.

    Compute the updated latent policy π(i+1)(|z)argmaxpΔa𝒜dp(a)Qπ(i)(z,a)\pi^{(i+1)}(\cdot|z)\in\arg\max_{p\in\Delta}\int_{a\in\mathcal{A}}dp(a)\cdot Q_{\pi^{(i)}}(z,a), and the updated observation policy μ(i+1)(|x)=π(i+1)E(i)(|x)\mu^{(i+1)}(\cdot|x)=\pi^{(i+1)}\circ E^{(i)}(\cdot|x)

  4. 4.

    Update the LCE model (E(i+1),F(i+1),D(i+1))(E^{(i+1)},F^{(i+1)},D^{(i+1)}) and the latent reward model r¯(i+1)\bar{r}^{(i+1)} by minimizing the loss Δ(E,F,D,r¯,π(i+1))\Delta(E,F,D,\bar{r},\pi^{(i+1)})

Repeat step 1-4 with the updated value function U(i+1)=Uπ(i)U^{(i+1)}=U_{\pi^{(i)}}.

Equipped with this technical property and the policy evaluation result in Theorem 5, we can now provide a policy improvement result on the above proposed procedure.

Theorem 8.

For any observation x𝒳x\in\mathcal{X}, the latent policy iteration procedure has the approximate policy improvement property:

Uμ(i+1)(x)Uμ(i)(x)γ1γπ{π(i),π(i+1)}Δ(E(i),F(i),D(i),r¯(i),π,x)Rmax1γDKL(μ(i)(|x)||π(i)E(i)(|x)).U_{\mu^{(i+1)}}(x)\geq U_{\mu^{(i)}}(x)-\frac{\gamma}{1-\gamma}\sum_{\pi\in\{\pi^{(i)},\pi^{(i+1)}\}}\Delta(E^{(i)},F^{(i)},D^{(i)},\bar{r}^{(i)},\pi,x)-\frac{R_{\max}}{1-\gamma}\cdot D_{\text{KL}}(\mu^{(i)}(\cdot|x)||\pi^{(i)}\circ E^{(i)}(\cdot|x)). (18)

where the first error term is given by

Δ(E,F,D,r¯,π,x)=Rmax1γ12z𝑑E(z|x)logD(x|z)+2|zdE(z|x)adπ(a|z)(r(x,a)r¯(z,a))|+γRmax2(1γ)DKL(PπE(|x)||(DFπE)(|x))+γRmax2(1γ)x𝒳dPπE(x|x)DKL(E(|x)||(FπE)(|x)).\begin{split}\Delta(\,&E,F,D,\bar{r},\pi,x)=\frac{R_{\max}}{1-\gamma}\sqrt{\frac{-1}{2}\int_{z}dE(z|x)\log D(x|z)}\\ &+2\big{|}\int_{z}dE(z|x)\int_{a}d\pi(a|z)(r(x,a)-\bar{r}(z,a))\big{|}+\frac{\gamma R_{\max}}{\sqrt{2}(1-\gamma)}\sqrt{D_{\text{KL}}\big{(}P_{\pi\circ E}(\cdot|x)\;||\;(D\circ F_{\pi}\circ E)(\cdot|x)\big{)}}\\ &+\frac{\gamma R_{\max}}{\sqrt{2}(1-\gamma)}\sqrt{\int_{x^{\prime}\in\mathcal{X}}dP_{\pi\circ E}(x^{\prime}|x)\cdot D_{\text{KL}}\left(E(\cdot|x^{\prime})||(F_{\pi}\circ E)(\cdot|x)\right)}.\end{split}
Proof.

Using the policy evaluation property, we have

Uπ(i)E(i)(x)Vπ(i)E(i)(x)+γ1γΔ(E(i),F(i),D(i),r¯(i),π(i),x),x𝒳.U_{\pi^{(i)}\circ E^{(i)}}(x)\leq V_{\pi^{(i)}}\circ E^{(i)}(x)+\frac{\gamma}{1-\gamma}\Delta(E^{(i)},F^{(i)},D^{(i)},\bar{r}^{(i)},\pi^{(i)},x),\,\,\,\forall x\in\mathcal{X}.

Applying Bellman operator Tπ(i)E(i)T_{\pi^{(i)}\circ E^{(i)}} on both sides and noticing that Tπ(i)E(i)[U](x)T[U](x)T_{\pi^{(i)}\circ E^{(i)}}[U](x)\leq T[U](x) uniformly at x𝒳x\in\mathcal{X} for any observation value function UU, we can then show that for any x𝒳x\in\mathcal{X},

Uπ(i)E(i)(x)Vπ(i)E(i)(x)+γ1γΔ(E(i),F(i),D(i),r¯(i),π(i),x)=z𝒵𝑑E(i)(z|x)𝒯π(i)[Vπ(i)](z)+γ1γΔ(E(i),F(i),D(i),r¯(i),π(i),x)z𝒵𝑑E(i)(z|x)𝒯[Vπ(i)](z)+γ1γΔ(E(i),F(i),D(i),r¯(i),π(i),x)=z𝒵𝑑E(i)(z|x)𝒯π(i+1)[Vπ(i)](z)+γ1γΔ(E(i),F(i),D(i),r¯(i),π(i),x)z𝒵𝑑E(i)(z|x)𝒯π(i+1)[Vπ(i+1)](z)+γ1γΔ(E(i),F(i),D(i),r¯(i),π(i),x)=z𝒵𝑑E(i)(z|x)Vπ(i+1)(z)+γ1γΔ(E(i),F(i),D(i),r¯(i),π(i),x)Uπ(i+1)E(i)(x)Uμ(i+1)(x)+γ1γ{Δ(E(i),F(i),D(i),r¯(i),π(i),x)+Δ(E(i),F(i),D(i),r¯(i),π(i+1),x)}.\begin{split}&U_{\pi^{(i)}\circ E^{(i)}}(x)\\ \leq&{V}_{\pi^{(i)}}\circ E^{(i)}(x)+\frac{\gamma}{1-\gamma}\Delta(E^{(i)},F^{(i)},D^{(i)},\bar{r}^{(i)},\pi^{(i)},x)\\ =&\int_{z\in\mathcal{Z}}dE^{(i)}(z|x)\mathcal{T}_{\pi^{(i)}}[V_{\pi^{(i)}}](z)+\frac{\gamma}{1-\gamma}\cdot\Delta(E^{(i)},F^{(i)},D^{(i)},\bar{r}^{(i)},\pi^{(i)},x)\\ \leq&\int_{z\in\mathcal{Z}}dE^{(i)}(z|x)\mathcal{T}[V_{\pi^{(i)}}](z)+\frac{\gamma}{1-\gamma}\cdot\Delta(E^{(i)},F^{(i)},D^{(i)},\bar{r}^{(i)},\pi^{(i)},x)\\ =&\int_{z\in\mathcal{Z}}dE^{(i)}(z|x)\mathcal{T}_{\pi^{(i+1)}}[V_{\pi^{(i)}}](z)+\frac{\gamma}{1-\gamma}\cdot\Delta(E^{(i)},F^{(i)},D^{(i)},\bar{r}^{(i)},\pi^{(i)},x)\\ \leq&\int_{z\in\mathcal{Z}}dE^{(i)}(z|x)\mathcal{T}_{\pi^{(i+1)}}[V_{\pi^{(i+1)}}](z)+\frac{\gamma}{1-\gamma}\cdot\Delta(E^{(i)},F^{(i)},D^{(i)},\bar{r}^{(i)},\pi^{(i)},x)\\ =&\int_{z\in\mathcal{Z}}dE^{(i)}(z|x)V_{\pi^{(i+1)}}(z)+\frac{\gamma}{1-\gamma}\cdot\Delta(E^{(i)},F^{(i)},D^{(i)},\bar{r}^{(i)},\pi^{(i)},x)\\ \leq&\underbrace{U_{\pi^{(i+1)}\circ E^{(i)}}(x)}_{U_{\mu^{(i+1)}}(x)}+\frac{\gamma}{1-\gamma}\left\{\Delta(E^{(i)},F^{(i)},D^{(i)},\bar{r}^{(i)},\pi^{(i)},x)+\Delta(E^{(i)},F^{(i)},D^{(i)},\bar{r}^{(i)},\pi^{(i+1)},x)\right\}.\end{split} (19)

The first inequality is based on Lemma 7. The first equality is based on the property that Vπ(i)V_{\pi^{(i)}} is a unique solution to fixed-point equation 𝒯π(i)[V](z)=V(z)\mathcal{T}_{\pi^{(i)}}[V](z)=V(z). The second inequality is based on the definition

𝒯[V](z)=maxpΔa𝑑p(a){r¯(i)(z,a)+γz𝒳𝑑F(i)(z|z,a)Vπ(i)(z)}a𝑑π(i)(a|z){r¯(i)(z,a)+γz𝒳𝑑F(i)(z|z,a)Vπ(i)(z)}.\begin{split}\mathcal{T}[V](z)&=\max_{p\in\Delta}\int_{a}dp(a)\left\{\bar{r}^{(i)}(z,a)+\gamma\int_{z^{\prime}\in\mathcal{X}}dF^{(i)}(z^{\prime}|z,a)V_{\pi^{(i)}}(z^{\prime})\right\}\\ &\geq\int_{a}d\pi^{(i)}(a|z)\left\{\bar{r}^{(i)}(z,a)+\gamma\int_{z^{\prime}\in\mathcal{X}}dF^{(i)}(z^{\prime}|z,a)V_{\pi^{(i)}}(z^{\prime})\right\}.\end{split}

The second equality is based on the definition of π(i+1)\pi^{(i+1)}. The third inequality is based on the policy improvement property in latent policy iteration, i.e.,

𝒯[Vπ(i)]𝒯π(i)[Vπ(i)]=Vπ(i)Vπ(i+1)=limk𝒯π(i+1)k[Vπ(i)]=limk𝒯k[Vπ(i)]Vπ(i),\mathcal{T}[V_{\pi^{(i)}}]\geq\mathcal{T}_{\pi^{(i)}}[V_{\pi^{(i)}}]=V_{\pi^{(i)}}\implies V_{\pi^{(i+1)}}=\lim_{k\rightarrow\infty}\mathcal{T}_{\pi^{(i+1)}}^{k}[V_{\pi^{(i)}}]=\lim_{k\rightarrow\infty}\mathcal{T}^{k}[V_{\pi^{(i)}}]\geq V_{\pi^{(i)}},

and the monotonicity property of latent Bellman operator. The third equality is based on the fact that Vπ(i+1)V_{\pi^{(i+1)}} is a unique solution to fixed-point equation 𝒯π(i+1)[V](z)=V(z)\mathcal{T}_{\pi^{(i+1)}}[V](z)=V(z). The fourth inequality is again based on Lemma 7 (when π=π(i+1)\pi=\pi^{(i+1)}).

Furthermore, considering the error from the distillation step, using the result from Schulman et al. [2015b], one can show that

Uμ(i)(x)Uπ(i)E(i)(x)+Rmax2(1γ)DKL(π(i)E(i)(|x)||μ(i)(|x))U_{\mu^{(i)}}(x)\leq U_{\pi^{(i)}\circ E^{(i)}}(x)+\frac{R_{\max}}{\sqrt{2}(1-\gamma)}\cdot\sqrt{D_{\text{KL}}(\pi^{(i)}\circ E^{(i)}(\cdot|x)||\mu^{(i)}(\cdot|x))} (20)

Together this implies the result of the approximate policy improvement property. ∎

Appendix C Offline CARL

Now consider the following offline latent policy iteration procedure that optimizes the policy in form of μE\mu\circ E. Starting at iteration i=0i=0, given an initial observation policy μ(0)\mu^{(0)}, an initial observation value function U(0)U^{(0)}, an offline LCE model (E,F,D)(E,F,D), and an offline latent reward model r¯\bar{r}, do

  1. 1.

    Compute the distilled latent policy by π(i)argminμDKL(πE(|x)||μ(i)(|x))\pi^{(i)}\leftarrow\arg\min_{\mu}D_{\text{KL}}(\pi\circ E(\cdot|x)||\mu^{(i)}(\cdot|x))

  2. 2.

    Compute the updated latent policy π(i+1)(|z)argmaxpΔa𝒜dp(a)Qπ(i)(z,a)\pi^{(i+1)}(\cdot|z)\in\arg\max_{p\in\Delta}\int_{a\in\mathcal{A}}dp(a)\cdot Q_{\pi^{(i)}}(z,a), and the updated observation policy μ(i+1)(|x)=π(i+1)E(|x)\mu^{(i+1)}(\cdot|x)=\pi^{(i+1)}\circ E(\cdot|x)

Repeat step 1-2 with the updated value function U(i+1)=Uπ(i)U^{(i+1)}=U_{\pi^{(i)}} until convergence.

Corollary 9.

For any observation x𝒳x\in\mathcal{X}, the offline latent policy iteration procedure has the approximate policy improvement property

Uμ(i+1)(x)Uμ(i)(x)2γ1γmaxa𝒜Δ(E,F,D,r¯(i),a,x),U_{\mu^{(i+1)}}(x)\geq U_{\mu^{(i)}}(x)-\frac{2\gamma}{1-\gamma}\max_{a\in\mathcal{A}}\Delta(E,F,D,\bar{r}^{(i)},a,x), (21)

and the sub-optimality performance bound:

limiUμ(i)(x)U(x)2γ(1γ)2maxa𝒜,x𝒳Δ(E,F,D,r¯,a,x),\lim_{i\rightarrow\infty}\|U_{\mu^{(i)}}(x)-U^{*}(x)\|\leq\frac{2\gamma}{(1-\gamma)^{2}}\max_{a\in\mathcal{A},x\in\mathcal{X}}\Delta(E,F,D,\bar{r},a,x),

where the optimal value function is given by

U(x)=𝔼[t=0γtrμ(xt)Pμ,μ=πE,x0=x],πargmaxπ𝔼[t=0γtr¯π(zt)Fπ,z0=z]U^{*}(x)=\mathbb{E}[\sum_{t=0}^{\infty}\gamma^{t}r_{\mu}(x_{t})\mid P_{\mu},\mu=\pi^{*}\circ E,x_{0}=x],\,\,\pi^{*}\in\arg\max_{\pi}\mathbb{E}[\sum_{t=0}^{\infty}\gamma^{t}\bar{r}_{\pi}(z_{t})\mid F_{\pi},z_{0}=z]

and the action-dependent (and policy-independent) error term is given by

Δ(E,F,D,r¯,a,x)=Rmax1γ12z𝑑E(z|x)logD(x|z)+2|r(x,a)zdE(z|x)r¯(z,a)|+γRmax2(1γ)DKL(P(|x,a)||(DFE)(|x,a))+γRmax2(1γ)zdE(z|x)xdP(x|x,a)DKL(E(|x)||F(|z,a)).\begin{split}\Delta(\,&E,F,D,\bar{r},a,x)=\frac{R_{\max}}{1-\gamma}\sqrt{\frac{-1}{2}\int_{z}dE(z|x)\log D(x|z)}\\ &+2\big{|}r(x,a)-\int_{z}dE(z|x)\bar{r}(z,a)\big{|}+\frac{\gamma R_{\max}}{\sqrt{2}(1-\gamma)}\sqrt{D_{\text{KL}}\big{(}P(\cdot|x,a)\;||\;(D\circ F\circ E)(\cdot|x,a)\big{)}}\\ &+\frac{\gamma R_{\max}}{\sqrt{2}(1-\gamma)}\sqrt{\int_{z}dE(z|x)\int_{x^{\prime}}dP(x^{\prime}|x,a)\cdot D_{\text{KL}}\big{(}E(\cdot|x^{\prime})\;||\;F(\cdot|z,a)\big{)}}.\end{split}
Proof.

Using the result from Theorem 8, when the LCE model (E,F,D)(E,F,D) and the latent reward model r¯\bar{r} do not change online, and there is no need for the distillation step. We also follow analogous arguments as in Corollary 6 that for Δ(E,F,D,r¯(i),π,x)\Delta(E,F,D,\bar{r}^{(i)},\pi,x) with the more conservative, worst-case error term over actions, i.e., maxa𝒜Δ(E,F,D,r¯(i),a,x)\max_{a\in\mathcal{A}}\Delta(E,F,D,\bar{r}^{(i)},a,x), one can eliminate the dependencies on policies. Therefore, the policy improvement property can be simplified as in (21).

Denote by Tμ[U](x)T_{\mu^{*}}[U](x) the observation Bellman operator w.r.t. optimal latent policy π\pi^{*}. Recall that z𝒵𝑑E(z|x)𝒯π(i+1)[Vπ(i+1)](z)=Vπ(i+1)E(x)\int_{z\in\mathcal{Z}}dE(z|x)\mathcal{T}_{\pi^{(i+1)}}[V_{\pi^{(i+1)}}](z)=V_{\pi^{(i+1)}}\circ E(x). Using the results in Lemma 7, we have the following chain on inequalities

Uμ(i+1)(x)z𝒵𝑑E(z|x)𝒯π(i+1)[Vπ(i+1)](z)γ1γmaxa𝒜Δ(E,F,D,r¯,a,x)z𝒵𝑑E(z|x)𝒯π(i+1)[Vπ(i)](z)γ1γmaxa𝒜Δ(E,F,D,r¯,a,x)=z𝒵𝑑E(z|x)𝒯[Vπ(i)](z)γ1γmaxa𝒜Δ(E,F,D,r¯,a,x)z𝒵𝑑E(z|x)𝒯π[Vπ(i)](x)γ1γmaxa𝒜Δ(E,F,D,r¯,a,x)Tμ[Vμ(i)E](x)γ+γ(1γ)1γmaxa𝒜Δ(E,F,D,r¯,a,x),Tμ[Uμ(i)](z)γ+γ(1γ)+γ21γmaxa𝒜Δ(E,F,D,r¯,a,x),\begin{split}U_{\mu^{(i+1)}}(x)\geq&\int_{z\in\mathcal{Z}}dE(z|x)\mathcal{T}_{\pi^{(i+1)}}[V_{\pi^{(i+1)}}](z)-\frac{\gamma}{1-\gamma}\cdot\max_{a\in\mathcal{A}}\Delta(E,F,D,\bar{r},a,x)\\ \geq&\int_{z\in\mathcal{Z}}dE(z|x)\mathcal{T}_{\pi^{(i+1)}}[V_{\pi^{(i)}}](z)-\frac{\gamma}{1-\gamma}\cdot\max_{a\in\mathcal{A}}\Delta(E,F,D,\bar{r},a,x)\\ =&\int_{z\in\mathcal{Z}}dE(z|x)\mathcal{T}[V_{\pi^{(i)}}](z)-\frac{\gamma}{1-\gamma}\cdot\max_{a\in\mathcal{A}}\Delta(E,F,D,\bar{r},a,x)\\ \geq&\int_{z\in\mathcal{Z}}dE(z|x)\mathcal{T}_{\pi^{*}}[V_{\pi^{(i)}}](x)-\frac{\gamma}{1-\gamma}\cdot\max_{a\in\mathcal{A}}\Delta(E,F,D,\bar{r},a,x)\\ \geq&T_{\mu^{*}}[V_{\mu^{(i)}}\circ E](x)-\frac{\gamma+\gamma(1-\gamma)}{1-\gamma}\cdot\max_{a\in\mathcal{A}}\Delta(E,F,D,\bar{r},a,x),\\ \geq&T_{\mu^{*}}[U_{\mu^{(i)}}](z)-\frac{\gamma+\gamma(1-\gamma)+\gamma^{2}}{1-\gamma}\cdot\max_{a\in\mathcal{A}}\Delta(E,F,D,\bar{r},a,x),\\ \end{split}

where the fourth inequality follows from direct algebraic manipulations and the last inequality follows from Lemma 7 when applied to Uμ(i)U_{\mu^{(i)}}, i.e.,

Uμ(i)(x)Vπ(i)E(x)+γ1γmaxa𝒜Δ(E,F,D,r¯,a,x)U_{\mu^{(i)}}(x)\leq V_{\pi^{(i)}}\circ E(x)+\frac{\gamma}{1-\gamma}\cdot\max_{a\in\mathcal{A}}\Delta(E,F,D,\bar{r},a,x)

and from the contraction property of TμT_{\mu^{*}}, i.e.,

Tμ[Uμ(i)](x)Tμ[Vπ(i)E](x)+γ21γmaxa𝒜Δ(E,F,D,r¯,a,x).T_{\mu^{*}}[U_{\mu^{(i)}}](x)\leq T_{\mu^{*}}[V_{\pi^{(i)}}\circ E](x)+\frac{\gamma^{2}}{1-\gamma}\cdot\max_{a\in\mathcal{A}}\Delta(E,F,D,\bar{r},a,x).

Also notice that with UU^{*} equal to the fixed-point solution of this Bellman operator, we have the following property:

γUμ(i)UTμ[Uμ(i)]Tμ[U]=Tπ[Uμ(i)]UγUμ(i)U.-\gamma\|U_{\mu^{(i)}}-U^{*}\|_{\infty}\leq T_{\mu^{*}}[U_{\mu^{(i)}}]-T_{\mu^{*}}[U^{*}]=T_{\pi^{*}}[U_{\mu^{(i)}}]-U^{*}\leq\gamma\|U_{\mu^{(i)}}-U^{*}\|_{\infty}.

Furthermore, since Uμ(i+1)(x)U(x)U_{\mu^{(i+1)}}(x)\leq U^{*}(x), we have the chain of inequalities

Uμ(i+1)U=Uμ(i+1)(x)U(x)Tμ[Uμ(i)](x)U(x)2γ1γmaxx𝒳,a𝒜Δ(E,F,D,r¯,a,x)γUμ(i)U2γ1γmaxx𝒳,a𝒜Δ(E,F,D,r¯,a,x).\begin{split}-\|U_{\mu^{(i+1)}}-U^{*}\|_{\infty}=U_{\mu^{(i+1)}}(x)-U^{*}(x)\geq&T_{\mu^{*}}[U_{\mu^{(i)}}](x)-U^{*}(x)-\frac{2\gamma}{1-\gamma}\max_{x\in\mathcal{X},a\in\mathcal{A}}\Delta(E,F,D,\bar{r},a,x)\\ \geq&-\gamma\|U_{\mu^{(i)}}-U^{*}\|_{\infty}-\frac{2\gamma}{1-\gamma}\cdot\max_{x\in\mathcal{X},a\in\mathcal{A}}\Delta(E,F,D,\bar{r},a,x).\end{split}

In other words, we have

Uμ(i+1)(x)U(x)γUμ(i)U+2γ1γmaxx𝒳,a𝒜Δ(E,F,D,r¯,a,x).\|U_{\mu^{(i+1)}}(x)-U^{*}(x)\|_{\infty}\leq\gamma\|U_{\mu^{(i)}}-U^{*}\|_{\infty}+\frac{2\gamma}{1-\gamma}\max_{x\in\mathcal{X},a\in\mathcal{A}}\Delta(E,F,D,\bar{r},a,x).

The proof is completed by taking ii\rightarrow\infty and noticing that

limiUμ(i+1)(x)U(x)=limiUμ(i+1)(x)U(x).\lim_{i\rightarrow\infty}\|U_{\mu^{(i+1)}}(x)-U^{*}(x)\|_{\infty}=\lim_{i\rightarrow\infty}\|U_{\mu^{(i+1)}}(x)-U^{*}(x)\|_{\infty}.

Appendix D Value-Guided CARL

Previously the LCE model (E,F,D)(E,F,D) and the latent reward model r¯\bar{r} are learned to minimize the lower bound of approximate policy improvement. While this procedure depends on the current policy π\pi and encoder EE to generate new data for updating these models, the model learning part only consists of the (i) the prediction loss, (ii) consistency loss, and (iii) the policy matching regularization loss between the observation policies μ\mu and πE\pi\circ E (distillation loss). The LCE model learning objective does not explicitly take into the account of the primary RL objective.

To tackle this issue, we apply the techniques from variational model-based policy optimization [Chow et al., 2020], which aims at learning a dynamics model that is also sensitive to the value function of the RL objective, to learn the LCE model. In particular, according to (16) of Chow et al. [2020], the "optimal" observation dynamics model that also takes the value function w.r.t. policy μ\mu in to the account has the form

P(x|x,a)=P(x|x,a)exp(τU~μ(x))exp(τ(W~μ(x,a)r(x,a))/γ)=P(x|x,a)exp(τr(x,a)+γU~μ(x)W~μ(x,a)γ),P^{*}(x^{\prime}|x,a)=\frac{P(x^{\prime}|x,a)\cdot\exp\big{(}\tau\cdot\tilde{U}_{\mu}(x^{\prime})\big{)}}{\exp\big{(}\tau\cdot(\tilde{W}_{\mu}(x,a)-r(x,a))/\gamma\big{)}}=P(x^{\prime}|x,a)\cdot\exp\left(\tau\cdot\frac{r(x,a)+\gamma\tilde{U}_{\mu}(x^{\prime})-\tilde{W}_{\mu}(x,a)}{\gamma}\right), (22)

in which U~μ(x)\tilde{U}_{\mu}(x) is the risk-adjusted observation value function at policy μ\mu, i.e.,

U~μ(x):=1τlog𝔼[exp(τt=0γtrμ(xt))Pμ,x0=x],\tilde{U}_{\mu}(x):=\frac{1}{\tau}\log\mathbb{E}\left[\exp\left(\tau\cdot\sum_{t=0}^{\infty}\gamma^{t}r_{\mu}(x_{t})\right)\mid P_{\mu},x_{0}=x\right],

which is also a unique solution that satisfies the fixed-point property:

U~μ(x)=a𝑑μ(a|x)[r(x,a)+γ1τlog𝔼xP(|x,a)[exp(τU~(x))]],\tilde{U}_{\mu}(x)=\int_{a}d\mu(a|x)\left[r(x,a)+\gamma\cdot\frac{1}{\tau}\cdot\log\mathbb{E}_{x^{\prime}\sim P(\cdot|x,a)}\big{[}\exp\big{(}\tau\cdot\tilde{U}(x^{\prime})\big{)}\big{]}\right],

and W~μ(x)\tilde{W}_{\mu}(x) is the corresponding risk-adjusted observation state-action value function at policy μ\mu, i.e.,

W~μ(x,a):=r(x,a)+γ1τlog𝔼xP(|x,a)[exp(τUμ(x))].\tilde{W}_{\mu}(x,a):=r(x,a)+\gamma\cdot\frac{1}{\tau}\cdot\log\mathbb{E}_{x^{\prime}\sim P(\cdot|x,a)}\big{[}\exp\big{(}\tau\cdot U_{\mu}(x^{\prime})\big{)}\big{]}.

The above value functions are termed "risk-adjusted" because the next state is no longer marginalized by taking an expectation over the original transition probability P(x|x,a)P(x^{\prime}|x,a), but instead it is marginalized by taking the exponential risk [Ruszczyński and Shapiro, 2006], i.e., ρτ(U()|x,a)=1τlog𝔼xP(|x,a)[exp(τU(x))]\rho_{\tau}(U(\cdot)|x,a)=\frac{1}{\tau}\log\mathbb{E}_{x^{\prime}\sim P(\cdot|x,a)}[\exp(\tau\cdot U(x^{\prime}))], where τ\tau is the temperature constant. This modified dynamics model PP^{*} is an exponential twisting of the original transition dynamics PP with weight

w(x,a,x)=τ(r(x,a)+γU~μ(x)W~μ(x,a))/γ,w(x,a,x^{\prime})=\tau\cdot(r(x,a)+\gamma\tilde{U}_{\mu}(x^{\prime})-\tilde{W}_{\mu}(x,a))/\gamma, (23)

which corresponds to the standard discounted TD-error of the risk-adjusted value functions.

To incorporate the “value-guided” transition model PP^{*} in the LCE model learning, all we need to do is to replace the original transition model PP in the prediction loss and in the consistency loss with the value-guided counterpart. Recall the prediction loss:

Lp(E,F,D,π,x)=DKL(Pμ(|x)||(DFπE)(|x))=xdPμ(x|x)log(DFπE)(x|x)Pμ(x|x).L_{p}(E,F,D,\pi,x)=D_{\text{KL}}\big{(}P_{\mu}(\cdot|x)\;||\;(D\circ F_{\pi}\circ E)(\cdot|x)\big{)}=\int_{x^{\prime}}dP_{\mu}(x^{\prime}|x)\log\frac{(D\circ F_{\pi}\circ E)(x^{\prime}|x)}{P_{\mu}(x^{\prime}|x)}.

Since the term logPμ(x|x)\log P_{\mu}(x^{\prime}|x) is independent to the LCE model, one can equivalent optimize the maximum likelihood (MLE):

Lmle(E,F,D,π,x)=x𝑑Pμ(x|x)log(DFπE)(x|x).L_{\text{mle}}(E,F,D,\pi,x)=-\int_{x^{\prime}}dP_{\mu}(x^{\prime}|x)\cdot\log(D\circ F_{\pi}\circ E)(x^{\prime}|x).

Now with the value-guided transition model, this MLE loss function can be re-written as

Lmle(E,F,D,π,x)=xdPμ(x|x)log(DFπE)(x|x)=x𝑑Pμ(x|x)log(DFπE)(x|x)=a𝑑μ(a|x)x𝑑P(x|x,a)exp(w(x,a,x))log(DFπE)(x|x).\begin{split}L_{\text{mle}}(E,F,&\,D,\pi,x)=-\int_{x^{\prime}}dP^{*}_{\mu}(x^{\prime}|x)\cdot\log(D\circ F_{\pi}\circ E)(x^{\prime}|x)\\ =&-\int_{x^{\prime}}dP^{*}_{\mu}(x^{\prime}|x)\cdot\log(D\circ F_{\pi}\circ E)(x^{\prime}|x)\\ =&-\int_{a}d\mu(a|x)\int_{x^{\prime}}dP(x^{\prime}|x,a)\cdot\exp(w(x,a,x^{\prime}))\cdot\log(D\circ F_{\pi}\circ E)(x^{\prime}|x).\end{split} (24)

For the consistency loss, consider

Lc(E,F,D,π,x)=x𝒳dPμ(x|x)DKL(E(|x)||(FπE)(|x)).L_{c}(E,F,D,\pi,x)=\int_{x^{\prime}\in\mathcal{X}}dP_{\mu}(x^{\prime}|x)\cdot D_{\text{KL}}\left(E(\cdot|x^{\prime})||(F_{\pi}\circ E)(\cdot|x)\right).

Similar to the derivations of the prediction loss, with the value-guided transition model, this consistency loss function be re-written as

Lc(E,F,D,π,x)=x𝒳dPμ(x|x)DKL(E(|x)||(FπE)(|x))=adμ(a|x)xdP(x|x,a)exp(w(x,a,x))DKL(E(|x)||(FπE)(|x)).\begin{split}L_{c}(E,F,&D,\pi,x)=\int_{x^{\prime}\in\mathcal{X}}dP^{*}_{\mu}(x^{\prime}|x)\cdot D_{\text{KL}}\left(E(\cdot|x^{\prime})||(F_{\pi}\circ E)(\cdot|x)\right)\\ =&\int_{a}d\mu(a|x)\int_{x^{\prime}}dP(x^{\prime}|x,a)\cdot\exp(w(x,a,x^{\prime}))\cdot D_{\text{KL}}\left(E(\cdot|x^{\prime})||(F_{\pi}\circ E)(\cdot|x)\right).\end{split} (25)

Below we propose ways to efficiently compute the exponential twisting weight w(x,a,x)=τ(r(x,a)+γU~μ(x)W~μ(x,a))/γw(x,a,x^{\prime})=\tau\cdot(r(x,a)+\gamma\tilde{U}_{\mu}(x^{\prime})-\tilde{W}_{\mu}(x,a))/\gamma. For simplicity we consider the case when τ\tau is small, where in this case ρτ(U()|x,a)𝔼[U|x,a]\rho_{\tau}(U(\cdot)|x,a)\approx\mathbb{E}[U|x,a]. Extending the following arguments requires directly learning the risk-adjusted value function U~μ(x)\tilde{U}_{\mu}(x) and state-action value functions W~μ(x,a)\tilde{W}_{\mu}(x,a), whose details can be found in Borkar [2002].

Under this condition, for any (x,a)𝒳×𝒜(x,a)\in\mathcal{X}\times\mathcal{A}, we have

U~μ(x)Uμ(x),W~μ(x,a)Wμ(x,a):=r(x,a)+x𝑑P(x|x,a)Uμ(x).\tilde{U}_{\mu}(x)\approx U_{\mu}(x),\quad\quad\tilde{W}_{\mu}(x,a)\approx W_{\mu}(x,a):=r(x,a)+\int_{x^{\prime}}dP(x^{\prime}|x,a)U_{\mu}(x^{\prime}).

Instead of computing the value functions in the observation space, we can approximate them with their low-dimensional latent-space counterparts. In particular, Lemma 7 implies that

|VπE(x)Uμ(x)|γ1γΔ(E,F,D,r¯,π,x)+Rmax1γDKL(πE(|x)||μ(|x)),x𝒳.\left|V_{\pi}\circ E(x)-U_{\mu}(x)\right|\leq\frac{\gamma}{1-\gamma}\Delta(E,F,D,\bar{r},\pi,x)+\frac{R_{\max}}{1-\gamma}\cdot D_{\text{KL}}(\pi\circ E(\cdot|x)||\mu(\cdot|x)),\,\,\forall x\in\mathcal{X}.

Since we are optimizing the terms on the right side of the bound for the LCE model, if these terms are small, then Uμ(x)VπE(x)U_{\mu}(x)\approx V_{\pi}\circ E(x). Following analogous derivations we also have the following error bound for the state-action value function:

|QπE(x,a)Wμ(x,a)|γ1γΔ(E,F,D,r¯,x,a)+γRmax1γDKL(πE(|x)||μ(|x)),x𝒳,a𝒜.\begin{split}&\left|Q_{\pi}\circ E(x,a)-W_{\mu}(x,a)\right|\\ \leq&\frac{\gamma}{1-\gamma}\Delta(E,F,D,\bar{r},x,a)+\frac{\gamma R_{\max}}{1-\gamma}\cdot D_{\text{KL}}(\pi\circ E(\cdot|x)||\mu(\cdot|x)),\,\,\forall x\in\mathcal{X},\,\,a\in\mathcal{A}.\end{split}

where the state-action error term is given by

Δ(E,F,D,r¯,x,a):=Rmax1γ12z𝑑E(z|x)logD(x|z)+2|r(x,a)r¯(z,a)|+γRmax2(1γ)DKL(P(|x,a)||(DFE)(|x,a))+γRmax2(1γ)x𝒳dP(x|x,a)DKL(E(|x)||(FE)(|x,a)).\begin{split}\Delta(E,F,D,\bar{r},\,&x,a):=\frac{R_{\max}}{1-\gamma}\sqrt{\frac{-1}{2}\int_{z}dE(z|x)\log D(x|z)}\\ &+2\big{|}r(x,a)-\bar{r}(z,a)\big{|}+\frac{\gamma R_{\max}}{\sqrt{2}(1-\gamma)}\sqrt{D_{\text{KL}}\big{(}P(\cdot|x,a)\;||\;(D\circ F\circ E)(\cdot|x,a)\big{)}}\\ &+\frac{\gamma R_{\max}}{\sqrt{2}(1-\gamma)}\sqrt{\int_{x^{\prime}\in\mathcal{X}}dP(x^{\prime}|x,a)\cdot D_{\text{KL}}\left(E(\cdot|x^{\prime})||(F\circ E)(\cdot|x,a)\right)}.\end{split}

If this terms is small the state-action value function can be approximated by Wμ(x,a)QπE(x,a)W_{\mu}(x,a)\approx Q_{\pi}\circ E(x,a).

Finally, recall that we are learning the latent reward model by minimizing the following reward loss: |z,adE(z|x)dπ(a|z)(r(x,a)r¯(z,a))|\big{|}\int_{z,a}dE(z|x)d\pi(a|z)(r(x,a)-\bar{r}(z,a))\big{|}. Therefore, we have that r(x,a)r¯E(x,a)r(x,a)\approx\bar{r}\circ E(x,a). Together, the exponential twisting weights can be approximated by the latent reward, latent value function, and latent state-action value function as follows:

w(x,a,x)w^(x,a,x):=z,z𝑑E(z|x)𝑑E(z|x)(r¯(z,a)Qπ(z,a))+γVπ(z),w(x,a,x^{\prime})\approx\widehat{w}(x,a,x^{\prime}):=\int_{z,z^{\prime}}dE(z|x)\cdot dE(z^{\prime}|x^{\prime})\cdot(\bar{r}(z,a)-Q_{\pi}(z,a))+\gamma V_{\pi}(z^{\prime}),

and correspondingly the exponential twisting term can be approximated by exp(w^(x,a,x))\exp(\widehat{w}(x,a,x^{\prime})).

Appendix E CARL Algorithms

Below in Algorithm 2 we present the practical implementation of the CARL algorithm with notation for all of its variants (offline CARL, online CARL, Value-Guided CARL).

Algorithm 2 Control Aware Representation Learning (CARL)
1:  Inputs: A dataset 𝒟real\mathcal{D}_{\text{real}} tuples (x,a,x,xg)(x,a,x^{\prime},x_{g}) from the environment, EnvEnv. A latent controllable embedding (LCE) network \mathcal{M} consisting of an encoder E:𝒳𝒵E:\mathcal{X}\rightarrow\mathcal{Z}, transition dynamics F:𝒵×𝒜𝒵F:\mathcal{Z}\times\mathcal{A}\rightarrow\mathcal{Z}, decoder D:𝒵𝒳D:\mathcal{Z}\rightarrow\mathcal{X}, and backwards encoder B:𝒳×𝒵×𝒜𝒵B:\mathcal{X}\times\mathcal{Z}\times\mathcal{A}\rightarrow\mathcal{Z}; plus networks for control: latent critic networks Vϕ1,Vϕ2:𝒵V_{\phi_{1}},V_{\phi_{2}}:\mathcal{Z}\rightarrow\mathbb{R}, Qθ1,Qθ2:𝒵×𝒜Q_{\theta_{1}},Q_{\theta_{2}}:\mathcal{Z}\times\mathcal{A}\rightarrow\mathbb{R}, and latent actor network πψ:𝒵𝒜\pi_{\psi}:\mathcal{Z}\rightarrow\mathcal{A}
2:  for i=0,,Ti=0,\ldots,T #when T=0T=0 this is offline CARL do
3:     for j=1,,j=1,\ldots, num_pcc_epochs do
4:        #representation learning.
5:        Train (i)\mathcal{M}^{(i)} using dataset 𝒟real\mathcal{D}_{\text{real}} model
6:        For value-guided CARL, the prediction and consistency loss functions requires the exponential twisting weight exp(τγw^(x,a,x))\exp(\frac{\tau}{\gamma}\cdot\hat{w}(x,a,x^{\prime})), where w^(x,a,x)=z,zE(i1)(z|x)E(i1)(z|x)(||zgz||2Qϕ¯(z,a))+γVϕ¯(z))\hat{w}(x,a,x^{\prime})=\int_{z,z^{\prime}}E^{(i-1)}(z|x)\cdot E^{(i-1)}(z^{\prime}|x^{\prime})\cdot(-||z_{g}-z^{\prime}||_{2}-Q_{\bar{\phi}}(z,a))+\gamma V_{\bar{\phi}}(z^{\prime}))
7:     end for
8:     Initialize a soft actor critic (SAC) policy π(i)\pi^{(i)}
9:     if Do policy distillation and i1i\geq 1 then
10:         #Corresponds to the policy distillation loss LpL_{p}
11:        for Each policy distillation epoch do
12:           πψ(i)πψ(i)ψ𝔼xD[DKL(πψ(i)(E(i)(|x))||πψ(i)(E(i1)(|x)))]\pi^{(i)}_{\psi}\leftarrow\pi^{(i)}_{\psi}-\nabla_{\psi}\mathbb{E}_{x\sim D}\left[D_{KL}\left(\pi^{(i)}_{\psi}\left(E^{(i)}(\cdot|x)\right)||\pi^{(i)}_{\psi}\left(E^{(i-1)}(\cdot|x)\right)\right)\right]
13:        end for
14:     end if
15:     Initialize a latent space buffer latent\mathcal{B}_{latent}
16:     #learning a latent space policy
17:     for Each soft actor critic step do
18:        Sample real dataset (x,a,xg)𝒟real(i)(x,a,x_{g})\sim\mathcal{D}_{\text{real}}^{(i)}
19:        Generate necessary latent space variables:zE(|x),zF(|z,u),zgE(|xg),r=||zgz||2z\sim E(\cdot|x),z^{\prime}\sim F(\cdot|z,u),z_{g}\sim E(\cdot|x_{g}),r=-||z_{g}-z^{\prime}||_{2}
20:        Add latent batch to latent buffer latentlatent(z,a,z,r,zg)\mathcal{B}_{latent}\leftarrow\mathcal{B}_{latent}\cup(z,a,z^{\prime},r,z_{g})
21:        Sample latent buffer (z,a,z,r,zg)latent(z,a,z^{\prime},r,z_{g})\sim\mathcal{B}_{latent}
22:         #Train the policy π(i)\pi^{(i)} with (z,u,z,r,zg)(z,u,z^{\prime},r,z_{g}) with the SAC algorithm
23:        θiθiκQθiJQ(θi)\theta_{i}\leftarrow\theta_{i}-\kappa_{Q}\nabla_{\theta_{i}}J_{Q}(\theta_{i}) for i{1,2}i\in\{1,2\} #Update the Q-function weights
24:        ϕiϕiκVϕiJV(ϕi)\phi_{i}\leftarrow\phi_{i}-\kappa_{V}\nabla_{\phi_{i}}J_{V}(\phi_{i}) for i{1,2}i\in\{1,2\} #Update the V-function weights
25:        ψψκπψJψ(ψ)\psi\leftarrow\psi-\kappa_{\pi}\nabla_{\psi}J_{\psi}(\psi) #Update the policy weights
26:        θi¯νθi+(1ν)θi¯\bar{\theta_{i}}\leftarrow\nu\theta_{i}+(1-\nu)\bar{\theta_{i}} for i{1,2}i\in\{1,2\} #Update the Q-target critic networks weights
27:        ϕi¯νϕi+(1ν)ϕi¯\bar{\phi_{i}}\leftarrow\nu\phi_{i}+(1-\nu)\bar{\phi_{i}} for i{1,2}i\in\{1,2\} #Update the V-target critic networks weights
28:     end for
29:      #Sample the environment for new real data
30:     for Each Interaction with Environment do
31:        Sample Actions aπ(i)(E(|x))a\sim\pi^{(i)}(E(\cdot|x))
32:        Interact with the environment xEnv(x,a)x^{\prime}\leftarrow Env(x,a)
33:        Update real data dataset 𝒟real𝒟real(x,a,x,xg)\mathcal{D}_{\text{real}}\leftarrow\mathcal{D}_{\text{real}}\cup(x,a,x^{\prime},x_{g})
34:        Update current state xxx\leftarrow x^{\prime}
35:     end for
36:  end for

Soft Actor Critic (SAC) Updates The policy parameters ψ\psi are optimized to update the latent space policy towards the exponential of the soft Q-function,

Jπ(ψ)=𝔼zt𝒟[𝔼atπψ[αlog(πψ(at|zt)Qθ(zt,at)]]J_{\pi}(\psi)=\underset{z_{t}\sim\mathcal{D}}{\mathbb{E}}\left[\underset{a_{t}\sim\pi_{\psi}}{\mathbb{E}}\left[\alpha log(\pi_{\psi}(a_{t}|z_{t})-Q_{\theta}(z_{t},a_{t})\right]\right] (26)

Our updates to the Q network minimize the following loss function:

JQ(θ)=𝔼(zt,at)𝒟[12(Qθ(zt,at)Q^(zt,at))2]J_{Q}(\theta)=\mathbb{E}_{(z_{t},a_{t})\sim\mathcal{D}}\left[\frac{1}{2}\left(Q_{\theta}(z_{t},a_{t})-\hat{Q}(z_{t},a_{t})\right)^{2}\right] (27)

where:

Q^θ(zt,at)=r(zt,at)+i=0k1γi+1r(zt+i+1,at+i+1)|atπψ(|zt),zt+1F(|zt,at)\hat{Q}_{\theta}(z_{t},a_{t})=r(z_{t},a_{t})+\sum_{i=0}^{k-1}\gamma^{i+1}r(z_{t+i+1},a_{t+i+1})|a_{t}\sim\pi_{\psi}(\cdot|z_{t}),z_{t+1}\sim F(\cdot|z_{t},a_{t}) (28)

here, FF is the learned latent space transition model and r(zt,at)=zgoalzt+122r(z_{t},a_{t})=||z_{goal}-z_{t+1}||^{2}_{2} where, zt+1F(|zt,at)z_{t+1}\sim F(\cdot|z_{t},a_{t}) and zgoalE(|xgoal)z_{goal}\sim E(\cdot|x_{goal}) and xgoalx_{goal} is the observation of the environment; additionally, kk is a tunable hyperparameter of the number of rollouts in the latent space should we rollout our model to sufficiently approximate the Q-value. As in Haarnoja et al. [2018] we utilize two Q-functions and take the minimum of the Q-functions to generate the value in the actor loss function. We note that we don’t have value network updates as we tried adding value networks but were unable to get good results.

Appendix F Experimental Details

In the following section we will provide a description of the domains and implementation details used for the experiments. For all the experiments we define the reward function as r(z,a)=(zzg)Q(zzg)aRar(z,a)=-(z-z_{g})^{\top}Q(z-z_{g})-a^{\top}Ra, where zz and zgz_{g} are latent states of the current and goal observation, and Q=κInz,R=InaQ=\kappa\cdot I_{n_{z}},R=I_{n_{a}} where κ=50\kappa=50, are respectively penalty weights on the state error and action. This reward configuration follows exactly from Levine et al. [2020].

Planar System

In this task the main goal is to navigate an agent in a surrounded area on a 2D plane [Breivik and Fossen, 2005], whose goal is to navigate from a corner to the opposite one, while avoiding the six obstacles in this area. The system is observed through a set of 40×4040\times 40 pixel images taken from the top view, which specifies the agent’s location in the area. Actions are two-dimensional and specify the xyx-y direction of the agent’s movement, and given these actions the next position of the agent is generated by a deterministic underlying (unobservable) state evolution function. Start State: top-left corner. Goal State: one of three corners (excluding top-left corner). Agent’s Objective: agent is within Euclidean distance of 55 from the goal state.

For the biased variant of this experiment we uniformly sample a proportion, pp, of the total samples within a 30×3030\times 30 pixel region, which doesn’t include any of the goal states, and the other 1p1-p proportion of the samples are sampled uniformly from the entire underlying state space.

Inverted Pendulum – SwingUp

This is the classic problem of controlling an inverted pendulum [Furuta et al., 1991] from 48×4848\times 48 pixel images. The goal of this task is to swing up an under-actuated pendulum from the downward resting position (pendulum hanging down) to the top position and to balance it. The underlying state sts_{t} of the system has two dimensions: angle and angular velocity, which is unobservable. The control (action) is 11-dimensional, which is the torque applied to the joint of the pendulum. For all PCC based algorithms, we opt to consider each observation xtx_{t} as two images generated from consecutive time-frames (the current time and the previous time; this was also done in the original PCC paper [Levine et al., 2020]. This is because each image only shows the position of the pendulum and does not contain any information about the velocity. Start State: Pole is resting down, Agent’s Objective: pole’s angle is within ±π/6\pm\pi/6 from an upright position.

For the biased experimentation of this experiment we sample a proportion, pp, of the total samples from when the pendulum is in it’s closer to its resting position [π,2.0][2,π][-\pi,-2.0]\cup[2,\pi] and the other 1p1-p samples when the pendulum is within ±0.5\pm 0.5 from an upright position.

CartPole

This is the visual version of the classic task of controlling a cart-pole system [Geva and Sitte, 1993]. The goal in this task is to balance a pole on a moving cart, while the cart avoids hitting the left and right boundaries. The control (action) is 11-dimensional, which is the force applied to the cart. The underlying state of the system sts_{t} is 44- dimensional, which indicates the angle and angular velocity of the pole, as well as the position and velocity of the cart. Similar to the inverted pendulum, in order to maintain the Markovian property the observation xtx_{t} is a stack of two 80×8080\times 80 pixel images generated from consecutive time-frames. Start State: Pole is randomly sampled in ±π/6\pm\pi/6. Agent’s Objective: pole’s angle is within ±π/10\pm\pi/10 from an upright position.

For the biased experiment we sample a proportion, pp, of the total samples from when the pole’s angle is sampled from [π/6,π/10][π/10,π/6][-\pi/6,-\pi/10]\cup[\pi/10,\pi/6] and the other 1p1-p samples are sampled as before uniformly from the given state space.

33-link Manipulator — SwingUp & Balance

The goal in this task is to move a 33-link manipulator from the initial position (which is the downward resting position) to a final position (which is the top position) and balance it. In the 11-link case, this experiment is reduced to inverted pendulum. In the 22-link case the setup is similar to that of arcobot [Spong, 1995], except that we have torques applied to all intermediate joints, and in the 33-link case the setup is similar to that of the 33-link planar robot arm domain that was used in the E2C paper, except that the robotic arms are modeled by simple rectangular rods (instead of real images of robot arms), and our task success criterion requires both swing-up (manipulate to final position) and balance.666Unfortunately due to copyright issues, we cannot test our algorithms on the original 33-link planar robot arm domain. The underlying (unobservable) state sts_{t} of the system is 2N2N-dimensional, which indicates the relative angle and angular velocity at each link, and the actions are NN-dimensional, representing the force applied to each joint of the arm. The state evolution is modeled by the standard Euler-Lagrange equations [Spong, 1995, Lai et al., 2015]. Similar to the inverted pendulum and cartpole, in order to maintain the Markovian property, the observation state xtx_{t} is a stack of two 80×8080\times 80 pixel images of the NN-link manipulator generated from consecutive time-frames. In the experiments we will evaluate the models based on the case of N=2N=2 (22-link manipulator) and N=3N=3 (33-link manipulator). Start State: 1st1^{\text{st}} pole with angle π\pi, 2nd2^{\text{nd}} pole with angle 2π/32\pi/3, and 3rd3^{\text{rd}} pole with angle π/3\pi/3, where angle π\pi is a resting position. Agent’s Objective: the sum of all poles’ angles is within ±π/6\pm\pi/6 from an upright position.

For the biased experiment we sample a proportion, pp of the total samples of when the 1st pole is within ±π/2\pm\pi/2, the 2nd pole is within angle ±π/3\pm\pi/3, and the 3rd pole is within angle ±π/6\pm\pi/6 of the upright position and the other 1p1-p samples are sampled as before uniformly from the given state space.

F.1 Data Generation Procedure

For all algorithms that use the PCC framework for representation learning, we always start by sampling triplets of the form (xt,at,xt+1)(x_{t},a_{t},x_{t+1}), which is done by (1) uniformly randomly sampling an underlying state sts_{t} from the environment and creating the corresponding observation xtx_{t}, (2) uniformly randomly sampling a valid action ata_{t}, and (3) obtaining the next state st+1s_{t+1} through the environment’s true dynamics and creating the corresponding observation xt+1x_{t+1}.

When interacting with the true underlying MDP, sampling the environment for more data for the iterative online variant of our algorithm, at iteration ii of our algorithm, we are following our learned policy π(i)\pi^{(i)}. We start with an initial observation x0x_{0} and generate our initial action a0a_{0}, a0πψ(i)(E(|x0))a_{0}\sim\pi^{(i)}_{\psi}(E(\cdot|x_{0})) and continue following our learned policy π(i)\pi^{(i)} to get our action aja_{j}, ajπψ(i)(E(|xj))a_{j}\sim\pi^{(i)}_{\psi}(E(\cdot|x_{j})).We continue this process until we have reached the end of the episode or the pre-defined number of samples we draw from the environment.

In SOLAR each training sample is an episode {x1,a1,x2,,xT,aT,xT+1}\{x_{1},a_{1},x_{2},\cdots,x_{T},a_{T},x_{T+1}\} where TT is the control horizon. We uniformly sample TT actions from the action space, apply the dynamics TT times, and generate the TT corresponding observations.

F.2 Implementation

In the following we describe architectures and hyper-parameters that were used for training the different algorithms.

F.2.1 Training Hyper-Parameters and Regulizers

SOLAR training specifics, we used their default setting:

  • Batch size of 2.

  • ADAM [Kingma and Ba, 2014] with β1=0.9\beta_{1}=0.9, β2=0.999\beta_{2}=0.999, and ϵ=108\epsilon=10^{-8}. We also use a learning rate αmodel=2105×horizon\alpha_{model}=2\cdot 10^{-5}\times\textrm{horizon} for the learning rate 𝒩𝒲\mathcal{M}\mathcal{N}\mathcal{I}\mathcal{W} prior and α=103\alpha=10^{-3} for other parameters.

  • βstart,βend,βrate=(104,10.0,5105)\beta_{start},\beta_{end},\beta_{rate}=(10^{-4},10.0,5\cdot 10^{-5})

  • Local inference and control:

    • Data strength: 50

    • KL step: 2.0

    • Number of Iterations: 10

PCC training specifics, we use their reported optimal hyperparameters:

  • Batch size of 128128.

  • ADAM with α=5104\alpha=5\cdot 10^{-4}, β1=0.9,β2=0.999\beta_{1}=0.9,\beta_{2}=0.999, and ϵ=108\epsilon=10^{-8}.

  • L2 regularization with a coefficient of 10310^{-3}.

  • (λp,λc,λcur)=(1,7,1)(\lambda_{p},\lambda_{c},\lambda_{\text{cur}})=(1,7,1), and the additive Gaussian noise in the curvature loss is 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}), where σ2=0.01\sigma^{2}=0.01.

  • Additional VAE [Kingma and Welling, 2013] loss term VAE=𝔼q(z|x)[logp(x|z)]+DKL(q(z|x)||p(z))\ell_{\text{VAE}}=-\mathbb{E}_{q(z|x)}[\log p(x|z)]+D_{\text{KL}}(q(z|x)||p(z)) with a very small coefficient of 0.010.01, where p(z)=𝒩(0,1)p(z)=\mathcal{N}(0,1).

  • Additional deterministic reconstruction loss with coefficient 0.30.3: given the current observation xx, we take the means of the encoder output and the dynamics model output, and decode to get the reconstruction of the next observation.

CARL training specifics:

  • Batch size of 128.

  • ADAM with α=5104\alpha=5\cdot 10^{-4}, β1=0.9,β2=0.999\beta_{1}=0.9,\beta_{2}=0.999, and ϵ=108\epsilon=10^{-8}.

  • L2 regularization with a coefficient of 10310^{-3}.

  • The additive Gaussian noise in the curvature loss is 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}), where σ2=0.01\sigma^{2}=0.01.

  • As in Levine et al. [2020] we use the deterministic reconstruction loss with coefficient 0.30.3.

F.2.2 Network Architectures

We next present the specific architecture choices for each domain. For fair comparison, The numbers of layers and neurons of each component were shared across all algorithms. ReLU non-linearities were used between each two layers.

Encoder: composed of a backbone (either a MLP or a CNN, depending on the domain) and an additional fully-connected layer that outputs mean variance vectors that induce a diagonal Gaussian distribution (for PCC, SOLAR, and all CARL variants).

Decoder: composed of a backbone (either a MLP or a CNN, depending on the domain) and an additional fully-connected layer that outputs logits that induce a Bernoulli distribution (for PCC, SOLAR, and all CARL variants)

Dynamical model: the path that leads from {zt,at}\{z_{t},a_{t}\} to z^t+1\hat{z}_{t+1}. Composed of a MLP backbone and an additional fully-connected layer that outputs mean and variance vectors that induce a diagonal Gaussian distribution (for PCC, SOLAR, and all CARL variants).

Backwards dynamical model: the path that leads from {z^t+1,at,xt}\{\hat{z}_{t+1},a_{t},x_{t}\} to ztz_{t}. Each of these inputs goes to fully-connected layer of {Nz,Nu,Nx}\{N_{z},N_{u},N_{x}\} neurons respectively. These outputs are then concatenated and passed through another layer of NjointN_{joint} neurons, and finally with an additionally fully-connected layer we output the mean and variance vectors that induce a diagonal Gaussian distribution.

SAC Architecture: For all of our environments with all CARL algorithms, we utilized the same SAC architecture as seen in Table 2:

Hyper Parameters for SAC Value(s)
Discount Factor 0.99
Critic Network Architecture MLP with 2 hidden layers of size 256
Actor Network Architecture MLP with 2 hidden layers of size 256
Exploration policy 𝒩(0,σ=1)\mathcal{N}(0,\sigma=1)
Exploration noise (σ\sigma) decay 0.999
Exploration noise (σ\sigma) minimum 0.025
Temperature 0.99995
Soft target update rate (τ\tau) 0.005
Replay memory size 10610^{6}
Minibatch size 128
Number of Rollouts in the Latent space, kk in (28) 5
Critic learning rate 0.001
Actor learning rate 0.0005
Neural network optimizer Adam
Table 2: Hyper parameters for the SAC controller.
Planar system
  • Input: 40×4040\times 40 images.

  • Actions space: 22-dimensional

  • Latent space: 22-dimensional

  • Encoder: 33 Layers: 300300 units - 300300 units - 44 units (22 for mean and 22 for variance)

  • Decoder: 33 Layers: 300300 units - 300300 units - 16001600 units (logits)

  • Dynamics: 33 Layers: 2020 units - 2020 units - 44 units

  • Backwards dynamics: Nz=5,Na=5,Nx=100N_{z}=5,N_{a}=5,N_{x}=100 - Njoint=100N_{\text{joint}}=100 - 44 units

  • Number of control actions: or the planning horizon T=40T=40

  • Offline and Online CARL hyperparameters: λed=0.01,λp=1,λc=7,λcur=1\lambda_{ed}=0.01,\lambda_{p}=1,\lambda_{c}=7,\lambda_{cur}=1

  • Value-Guided CARL hyperparameters: λed=0.01,λp=2,λc=11,λcur=1,τ=1/30.0\lambda_{ed}=0.01,\lambda_{p}=2,\lambda_{c}=11,\lambda_{cur}=1,\tau=1/30.0

  • Proportion of biased samples: p=0.5p=0.5

  • Number of samples from the environment per iteration ii in Algorithm 2: 128

  • Initial standard deviation for collecting data (SOLAR): 1.5 for both global and local training.

Inverted Pendulum – SwingUp
  • Input: Two 48×4848\times 48 images.

  • Actions space: 11-dimensional

  • Latent space: 33-dimensional

  • Encoder: 33 Layers: 500500 units - 500500 units - 66 units (33 for mean and 33 for variance)

  • Decoder: 33 Layers: 500500 units - 500500 units - 46084608 units (logits)

  • Dynamics: 33 Layers: 3030 units - 3030 units - 66 units

  • Backwards dynamics: Nz=10,Na=10,Nx=200N_{z}=10,N_{a}=10,N_{x}=200 - Njoint=200N_{\text{joint}}=200 - 66 units

  • Number of control actions: or the planning horizon T=400T=400

  • Offline and Online CARL environment hyperparameters: λed=0.01,λp=1,λc=11,λcur=1\lambda_{ed}=0.01,\lambda_{p}=1,\lambda_{c}=11,\lambda_{cur}=1

  • Value-Guided CARL environment hyperparameters: λed=0.01,λp=1,λc=7,λcur=1,τ=1/60.0\lambda_{ed}=0.01,\lambda_{p}=1,\lambda_{c}=7,\lambda_{cur}=1,\tau=1/60.0

  • Proportion of biased samples: p=0.95p=0.95

  • Number of samples from the environment per iteration ii in Algorithm 2: 128

  • Initial standard deviation for collecting data (SOLAR): 0.5 for both global and local training.

Cart-pole – Balancing
  • Input: Two 80×8080\times 80 images.

  • Actions space: 11-dimensional

  • Latent space: 88-dimensional

  • Encoder: 66 Layers: Convolutional layer: 32×5×532\times 5\times 5; stride (1,1)(1,1) - Convolutional layer: 32×5×532\times 5\times 5; stride (2,2)(2,2) - Convolutional layer: 32×5×532\times 5\times 5; stride (2,2)(2,2) - Convolutional layer: 10×5×510\times 5\times 5; stride (2,2)(2,2) - 200200 units - 1616 units (88 for mean and 88 for variance)

  • Decoder: 66 Layers: 200200 units - 10001000 units - 100100 units - Convolutional layer: 32×5×532\times 5\times 5; stride (1,1)(1,1) - Upsampling (2,2)(2,2) - convolutional layer: 32×5×532\times 5\times 5; stride (1,1)(1,1) - Upsampling (2,2)(2,2) - Convolutional layer: 32×5×532\times 5\times 5; stride (1,1)(1,1) - Upsampling (2,2)(2,2) - Convolutional layer: 2×5×52\times 5\times 5; stride (1,1)(1,1)

  • Dynamics: 33 Layers: 4040 units - 4040 units - 1616 units

  • Backwards dynamics: Nz=10,Na=10,Nx=300N_{z}=10,N_{a}=10,N_{x}=300 - Njoint=300N_{\text{joint}}=300 - 1616 units

  • Number of control actions: or the planning horizon T=200T=200

  • Offline and Online CARL environment hyperparameters: λed=0.01,λp=1,λc=7,λcur=1\lambda_{ed}=0.01,\lambda_{p}=1,\lambda_{c}=7,\lambda_{cur}=1

  • Value-Guided CARL environment hyperparameters: λed=0.01,λp=1,λc=7,λcur=1,τ=1/40.0\lambda_{ed}=0.01,\lambda_{p}=1,\lambda_{c}=7,\lambda_{cur}=1,\tau=1/40.0

  • Proportion of biased samples: p=0.8p=0.8

  • Number of samples from the environment per iteration ii in Algorithm 2: 256

  • Initial standard deviation for collecting data (SOLAR): 10 for global and 5 for local training.

33-link Manipulator — Swing Up & Balance
  • Input: Two 80×8080\times 80 images.

  • Actions space: 33-dimensional

  • Latent space: 88-dimensional

  • Encoder: 66 Layers: Convolutional layer: 32×5×532\times 5\times 5; stride (1,1)(1,1) - Convolutional layer: 32×5×532\times 5\times 5; stride (2,2)(2,2) - Convolutional layer: 32×5×532\times 5\times 5; stride (2,2)(2,2) - Convolutional layer: 10×5×510\times 5\times 5; stride (2,2)(2,2) - 500500 units - 1616 units (88 for mean and 88 for variance)

  • Decoder: 66 Layers: 200200 units - 10001000 units - 100100 units - Convolutional layer: 32×5×532\times 5\times 5; stride (1,1)(1,1) - Upsampling (2,2)(2,2) - convolutional layer: 32×5×532\times 5\times 5; stride (1,1)(1,1) - Upsampling (2,2)(2,2) - Convolutional layer: 32×5×532\times 5\times 5; stride (1,1)(1,1) - Upsampling (2,2)(2,2) - Convolutional layer: 2×5×52\times 5\times 5; stride (1,1)(1,1)

  • Dynamics: 33 Layers: 4040 units - 4040 units - 1616 units

  • Backwards dynamics: Nz=10,Na=10,Nx=400N_{z}=10,N_{a}=10,N_{x}=400 - Njoint=400N_{\text{joint}}=400 - 1616 units

  • Number of control actions: or the planning horizon T=200T=200

  • Offline and Online CARL environment hyperparameters: λed=0.01,λp=1,λc=11,λcur=1\lambda_{ed}=0.01,\lambda_{p}=1,\lambda_{c}=11,\lambda_{cur}=1

  • Value-Guided CARL environment hyperparameters: λed=0.01,λp=2,λc=11,λcur=1,τ=1/60.0\lambda_{ed}=0.01,\lambda_{p}=2,\lambda_{c}=11,\lambda_{cur}=1,\tau=1/60.0

  • Proportion of biased samples: p=0.2p=0.2

  • Number of samples from the environment per iteration ii in Algorithm 2: 128

  • Initial standard deviation for collecting data (SOLAR): 1 for global and 0.5 for local training.

Appendix G Additional Experiments

Policy Distillation In our iterative algorithm, we describe a method to connect policies from two different latent spaces in eq. (8). In Figure 5 we show the learning curves for Online CARL with and without policy distillation. In general, when utilizing policy distillation, we achieve similar performance to the iterative variant of our algorithm. Additionally, these results show that with policy distillation, in the three-pole and swingup tasks we are able to achieve faster convergence. Another observed added benefit is that with policy distillation we achieve more stability in the final metrics as we add more samples form our environment across environments.

Refer to caption
(a) Planar
Refer to caption
(b) Swingup
Refer to caption
(c) Cartpole
Refer to caption
(d) Three-pole
Figure 5: Training curves comparing our online algorithm with and without policy distillation on continuous control environments. The solid curves depict the mean of the experiments and the standard deviations correspond to the standard deviation of the means.

Latent Representation and Performance for the Planar System All of the following figures were trained using the PCC framework. We present 5 representations with the worst control performance (Figure 6) and 5 representations that had the best control performance (Figure 7), with the PCC algorithm; thus we were using iLQR as the controller. Additionally, we present 5 representations that performed the worst (Figure 8) and 5 representations that performed the best (Figure 9) with our offline CARL algorithm; thus, we were using SAC as the controller. These maps were generated by uniformly sampling a state ss from the underlying environment, creating a corresponding observation xx, and using the encoder create the latent representation z=𝔼[E(|x)]z=\mathbb{E}[E(\cdot|x)]. All of the latent maps presented in Figures 6-9 are generated from the same PCC framework and same hyperparameters so the best performing maps are all fairly similar and the worst performing maps have similarities. It is important to note that even though these latent maps are similar, it is clear that their is a large difference in performance in table 3. Importantly, it is clear that iLQR struggles significantly more than SAC in these non-linear environments as seen in the worst case performance and the corresponding latent maps, where the latent maps contain additional twisting or curvature resulting in poorer performance.

In this case it is obvious that control in several of the latent representations that performed poorly would be difficult as there are regions that are highly non-smooth, non-locally-linear; thus, a locally linear controller such as iLQR is likely to perform poorly. We compare the top and worst 5 representations trained using the PCC framework, with the only difference being the controller (SAC vs. iLQR) in table 3.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: Latent maps for the 5 worst performing representations on average with PCC as the algorithm; thus, latent control with iLQR as the controller.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: Latent maps for the 5 best performing representations on average with PCC as the algorithm; thus, latent control with iLQR as the controller.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 8: Latent maps for the 5 worst performing representations on average with Offline CARL as the algorithm; thus, latent control with SAC as the controller.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 9: Latent maps for the 5 best performing representations on average with PCC as the algorithm; thus, latent control with SAC as the controller.
Environment Algorithm Worst 5 Avg. Results Top 5 Avg. Results
Planar PCC 6.15±2.896.15\pm 2.89 62.25±4.4562.25\pm 4.45
Planar Offline CARL 33.46±4.61\textbf{33.46}\pm\textbf{4.61} 78.87±0.19\textbf{78.87}\pm\textbf{0.19}
Swingup PCC 68.07±3.49\textbf{68.07}\pm\textbf{3.49} 95.71±0.3895.71\pm 0.38
Swingup Offline CARL 57.22±3.7157.22\pm 3.71 98.50±0.0\textbf{98.50}\pm\textbf{0.0}
Cartpole PCC 50.98±5.4450.98\pm 5.44 99.85±0.0899.85\pm 0.08
Cartpole Offline CARL 74.44±5.28\textbf{74.44}\pm\textbf{5.28} 100.0±0.0\textbf{100.0}\pm\textbf{0.0}
Three-pole PCC 0±00\pm 0 18.42±2.9818.42\pm 2.98
Three-pole Offline CARL 6.17±1.71\textbf{6.17}\pm\textbf{1.71} 85.77±0.23\textbf{85.77}\pm\textbf{0.23}
Table 3: Percentage of steps in goal state; averaged over the 5 worst models and the 5 best models.

Additional SOLAR Results In our experiments for the planar, swingup, and cartpole environments we start from a point randomly chosen from a region surrounding the start point in the underlying MDP; additionally, in the planar case we randomize the target every episode. In Table 4, we present results where the start and goal states are from fixed points, to see if there is improvement in the SOLAR results. Also we try to shorten the horizon for swingup to 100100 to see if shorter horizons can play a factor in domains with rather long horizons. We don’t present any new results on the three-pole task as there was already a fixed starting state and fixed goal. We note that there is a dramatic improvement for the planar case when there is a fixed start and goal state and modest improvement in the cartpole and swingup cases. However, we still need to note that these results still are incomparable to the performance of any of CARL variants, offline CARL, online CARL, value-guided CARL, introduced in this paper.

Environment Algorithm Number of Samples Avg Result Best Result
Planar SOLAR 5000 (VAE) + 40000 (Control) 26.70±5.9226.70\pm 5.92 41±7.2841\pm 7.28
Cartpole SOLAR 10000 (VAE) + 40000 (Control) 14.60±1.7014.60\pm 1.70 20.05±2.9120.05\pm 2.91
Swingup SOLAR 20000 (VAE) + 40000 (Control) 22.40±3.0722.40\pm 3.07 34.03±2.0934.03\pm 2.09
Table 4: Percentage of steps in goal state; averaged over all models and the best model. Additionally, the number of samples used for training for SOLAR are under the condition that there is the same start and same goal state for all episodes.