This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Jump Operator Planning: Goal-Conditioned
Policy Ensembles and Zero-Shot Transfer

Thomas J. Ringstrom
Department of Computer Science
University of Minnesota
Minneapolis, MN 55404
[email protected]
&Mohammadhosein Hasanbeig
Department of Computer Science
University of Oxford
Oxford, United Kingdom, OX1 3QD
[email protected]
Alessandro Abate
Department of Computer Science
University of Oxford
Oxford, United Kingdom, OX1 3QD
[email protected]
Corresponding author
Abstract

In Hierarchical Control, compositionality, abstraction, and task-transfer are crucial for designing versatile algorithms which can solve a variety of problems with maximal representational reuse. We propose a novel hierarchical and compositional framework called Jump-Operator Dynamic Programming for quickly computing solutions within a super-exponential space of sequential sub-goal tasks with ordering constraints, while also providing a fast linearly-solvable algorithm as an implementation. This approach involves controlling over an ensemble of reusable goal-conditioned polices functioning as temporally extended actions, and utilizes transition operators called feasibility functions, which are used to summarize initial-to-final state dynamics of the polices. Consequently, the added complexity of grounding a high-level task space onto a larger ambient state-space can be mitigated by optimizing in a lower-dimensional subspace defined by the grounding, substantially improving the scalability of the algorithm while effecting transferable solutions. We then identify classes of objective functions on this subspace whose solutions are invariant to the grounding, resulting in optimal zero-shot transfer.

Keywords: Hierarchical Control, Sequential Goal Tasks, State-Action LMDP, Zero-Shot Transfer, Goal-conditioned Policies, Compositionality

1 Introduction

An open problem in the domains of Planning, Dynamic Programming (DP) and Reinforcement Learning (RL) involves the construction of polices achieving sequential multi-subgoal tasks over long ranges of space and time. To achieve such tasks efficiently, one needs the right set of hierarchical abstractions, representations, and optimizations: identifying these features is a challenging problem with many related questions. What is the definition of a task? How do tasks interface with environments? How can representations, whether they concern policies, value functions, or task rules, be transferable within and across new environments? We propose a framework called Jump-Operator Dynamic Programming (JODP) for ordered sequential sub-goal planning, which addresses these questions by solving structured, multi-subgoal tasks, using coarse-grained transition operators, which jump the agent from initial-to-final states.

DP is a recursive and exact algorithm that provides analytic insight into the problem of sequential decision making, often to the benefit of approximation or learning frameworks like RL [1]. For instance, we understand the convergence properties of RL algorithms in relation to the optimal value function produced by DP [2, 3, 4], as they share the same underlying formalism, i.e. Markov Decision Processes (MDP) [5]. Likewise, we argue that DP-based methods can help elucidate mechanisms for transfer and generalization for sequential goal-problems. However, it is not necessarily the case that simply augmenting a state-space with more task-variables will admit reusable solutions: we need a foundation built on representations that can facilitate efficient computation and transfer. In this paper, we do not focus on the subject of learning, rather, we provide formalisms from which we can build an analytic theory of compositionality and zero-shot task transfer for sequential-goal planning; learning algorithms, of any stripe, can then inherit the properties and guarantees of the framework by computing the appropriate representations to promote generalization.

JODP is general framework that uses specific kinds of objects and functions that engender key properties. We show that ordered sub-goal tasks can be formulated as first-exit problems in a binary-vector task space that records sub-goal progress (Section 3.2). Dynamics through the task-space can be induced by high-level goal variables mapped to low-level state-actions via grounding functions (Section 3.3). We demonstrate how to construct an ensemble of reusable goal-conditioned shortest-path polices on a base state-space in order to drive the task-space dynamics. Crucially, from these polices we can build new abstract operators that represent coarse-grained initial-to-final state and goal dynamics for each policy, called Jump Operators and Feasibility Functions (Sections 3.4, 3.5). These operators play a central role in an efficient task-policy optimization, called the task-LMDP (tLMDP), which stitches together low-level goal-conditioned polices to solve the task (Section 3.6). The key property of a policy ensemble is remappability: we can remap ensemble members to new tasks with different grounding functions, allowing us to solve polices within a super-exponential space of problems without recomputing low-level representations. Furthermore, we show that it is sufficient to compute the task-solution restricted to the grounded subspace of state-actions defined by the sub-goal grounding function, which has significantly better scaling properties under the increase of state-space size and sub-goal number than computing a solution with DP in the full Cartesian product state-space formed by the task- and base-space. A powerful consequence of this is that solutions to the tLMDP in the grounded subspace can have an important grounding invariance property, which allows the solution to be remapped to different ensemble polices with perfect zero-shot transfer.

Related Research  This work has connections with several areas of research. The tLMDP and it’s precursor, the saLMDP (yet to be discussed), are new members in the family of Linearly-Solvable Optimal Control problems [6, 7], and are inspired by recent hierarchical innovations for controlling over sub-problems [8, 9, 10]. Research regarding goal-conditioned polices was introduced both in the context of RL [11], and in the context of model-based non-stationary control [12]. Ordered-goal tasks have also been recently proposed in RL for training agents with symbolic and temporal rules [13, 14, 15, 16, 17, 18, 19, 20]. Policy-concatenation (stitching) has also been proposed as a means of skill-learning in the Options framework [21, 22, 23], as well as Hierarchical Control [24, 12], and other schemes for concatenating polices such as policy sketches (using modular polices and task-specific reward functions) [25], or compositional plan vectors [26]. Lastly, a compositional Boolean task algebra was recently introduced demonstrating zero-shot transfer for non-sequential polices [27].

2 Background

The Linearly Solvable Markov Decision Process (LMDP) [7] is an entropy-regularized relaxation of a standard Markov Decision Process (MDP). It is defined as a three-tuple (𝒳,p,q)(\mathcal{X},p,q) where 𝒳\mathcal{X} is the finite set of states, pp is an uncontrolled "passive dynamics" transition function p:𝒳×𝒳[0,1]p:\mathcal{X}\times\mathcal{X}\rightarrow[0,1], and q:𝒳q:\mathcal{X}\rightarrow\mathbb{R} is a state-cost function. The Linear Bellman equation is defined as:

v(x)=minu[q(x)+KL(u(x|x)||p(x|x))+𝔼xu[v(x)]],v(x)=\min_{u}\Big{[}q(x)+KL(u(x^{\prime}|x)||p(x^{\prime}|x))+\mathop{\mathbb{E}}_{x^{\prime}\sim u}[v(x^{\prime})]\Big{]}, (1)

where v:𝒳v:\mathcal{X}\rightarrow\mathbb{R} is the value function, u:𝒳×𝒳[0,1]u:\mathcal{X}\times\mathcal{X}\rightarrow[0,1] is an arbitrary policy, and KL(||)KL(\cdot||\cdot) is the Kullback–Leibler (KL) divergence between two given distributions. Unlike the standard MDP, the action variable has been replaced with a controlled transition matrix u(x|x)u(x^{\prime}|x). Action costs in (1) are conceptualized as deviations away from the passive dynamics, as measured by the KL-divergence. Therefore, instead of optimizing a state-to-action map π(a|x)\pi^{*}(a|x), as one would with standard algorithms such as value-iteration [1], we optimize for desirable state-to-state Markov chain dynamics. The solution to (1) will be a policy uu^{*} which minimizes the expected accumulated cost contributed by both the state-cost function qq and the action cost under the future dynamics. Solving the equation requires a change of variable transformation to the value function v(x)v(x), which converts it to a desirability function z(x)=exp(v(x)).z(x)=exp(-v(x)). It is shown in [7] that the optimal controlled dynamics, uu^{*}, can be determined by rescaling the passive dynamics by the optimal desirability function vector, which is the largest eigenvector of QPQP, i.e. 𝐳=QP𝐳\mathbf{z}=QP\mathbf{z} where 𝐳\mathbf{z} is the desirability function vector. The matrix QQ is a diagonal matrix of the negatively exponentiated cost function, i.e. Q=diag(exp(𝐪))Q=diag(exp(-\mathbf{q})), where 𝐪\mathbf{q} is the cost vector on 𝒳\mathcal{X}, and PP is the matrix form of the passive transition function pp such that i:jPij=1,\forall i:\sum_{j}P_{ij}=1, where Pij=p(xj|xi)P_{ij}=p(x_{j}|x_{i}). The rescaled dynamics are normalized by G[z](x)=𝔼xpz(x)G[z](x)=\mathop{\mathbb{E}}_{x^{\prime}\sim p}z(x^{\prime}) to obtain the optimal policy

u(x|x)=p(x|x)z(x)G[z](x).u^{*}(x^{\prime}|x)=\frac{p(x^{\prime}|x)z(x^{\prime})}{G[z](x)}.

3 Algorithm: Jump-Operator Dynamic Programming

3.1 State-Action LMDP

We introduce a novel extension of the LMDP, that operates on state-action pairs. This extension will be employed in two places throughout the algorithm, for both the low-level goal-conditioned policy ensemble, and the high-level tLMDP task policy that “stitches” policies from the ensemble together. Reintroducing the action variable will have non-trivial consequences for the tLMDP, as we can construct more expressive cost functions including ordering constraints on sub-goals, and the cost of following a given policy to complete a sub-goal, all while retaining a largest eigenvector solution.

Definition 3.1 (state-action LMDP).

A state-action LMDP (saLMDP) is defined as the tuple =(𝒳,𝒜,px,pa,q)\mathscr{M}=(\mathcal{X},\mathcal{A},p_{x},p_{a},q), where 𝒳\mathcal{X} is the finite set of states, 𝒜\mathcal{A} is the finite set of actions, px:𝒳×𝒜×𝒳[0,1]p_{x}:\mathcal{X}\times\mathcal{A}\times\mathcal{X}\rightarrow[0,1] is a state-transition model, pa:𝒳×𝒜×𝒳×𝒜[0,1]p_{a}:\mathcal{X}\times\mathcal{A}\times\mathcal{X}\times\mathcal{A}\rightarrow[0,1] is the "passive" action-transition dynamics, and q:𝒳×𝒜q:\mathcal{X}\times\mathcal{A}\rightarrow\mathbb{R} is the cost function.

We formulate a new objective function by defining the state in saLMDP as a state-action pair y=(x,a){y}=(x,a), and then substitute y{y} for xx in (1). After substitution, we can decompose the resulting joint distributions uxa(x,a|x,a)u_{xa}(x^{\prime},a^{\prime}|x,a) and pxa(x,a|x,a)p_{xa}(x^{\prime},a^{\prime}|x,a) into uxa:=ux(x|x,a)ua(a|x,x,a)u_{xa}:=u_{x}(x^{\prime}|x,a)u_{a}(a^{\prime}|x^{\prime},x,a) and pxa:=px(x|x,a)pa(a|x,x,a)p_{xa}:=p_{x}(x^{\prime}|x,a)p_{a}(a^{\prime}|x^{\prime},x,a) via the chain rule, where ux:𝒳×𝒜×𝒳[0,1]u_{x}:\mathcal{X}\times\mathcal{A}\times\mathcal{X}\rightarrow[0,1] is a state-transition policy, and ua:𝒳×𝒜×𝒳×𝒜[0,1]u_{a}:\mathcal{X}\times\mathcal{A}\times\mathcal{X}\times\mathcal{A}\rightarrow[0,1] is an action transition policy. We enforce an optimization constraint that ux(x|x,a)=px(x|x,a)u_{x}(x^{\prime}|x,a)=p_{x}(x^{\prime}|x,a) so the controller is not free to optimize uxu_{x}, which would modify a fixed physics model encoded in pxp_{x}. Thus, we only allow the controller to modify the distribution over actions, uau_{a}, giving rise to the saLMDP objective and solution:

v(x,a)=minua[q(x,a)+𝔼aua[log(ua(a|x,x,a)pa(a|x,x,a))]+𝔼auaxux=px[v(x,a)]],\displaystyle v(x,a)=\min\limits_{\begin{subarray}{c}u_{a}\end{subarray}}\Bigg{[}q(x,a)+\mathop{\mathbb{E}}\limits_{\begin{subarray}{c}a^{\prime}\sim u_{a}\end{subarray}}\Bigg{[}log\Bigg{(}\frac{u_{a}(a^{\prime}|x^{\prime},x,a)}{p_{a}(a^{\prime}|x^{\prime},x,a)}\Bigg{)}\Bigg{]}+\mathop{\mathbb{E}}\limits_{\begin{subarray}{c}a^{\prime}\sim u_{a}\\ x^{\prime}\sim u_{x}=p_{x}\end{subarray}}[v(x^{\prime},a^{\prime})]\Bigg{]}, (2)
ua(a|x,x,a)=pa(a|x,x,a)z(x,a)G[z](x,x,a),\displaystyle u_{a}^{*}(a^{\prime}|x^{\prime},x,a)=\frac{p_{a}(a^{\prime}|x^{\prime},x,a)z(x^{\prime},a^{\prime})}{G[z](x^{\prime},x,a)}, (3)
𝐳=QPxa𝐳,\displaystyle\mathbf{z}=QP_{xa}\mathbf{z}, (4)

where G[z](x,x,a)=𝔼apaz(x,a)G[z](x^{\prime},x,a)=\mathop{\mathbb{E}}_{a^{\prime}\sim p_{a}}z(x^{\prime},a^{\prime}) is normalization term and 𝐳\mathbf{z} is the desirability vector on 𝒳×𝒜\mathcal{X}\times\mathcal{A}, i.e. for each y=(x,a)y=(x,a) we have z(x,a)=exp(v(x,a))z(x,a)=exp(-v(x,a)), and uau_{a}^{*} is the optimal policy–we will refer to uau^{*}_{a} as uu^{*} for compactness111v(x,a)v(x,a) is the same as the ”Q-function”, Q(x,a)Q(x,a), in RL [28], however, since we mostly work with z(x,a)z(x,a), we keep consistent with the original LMDP notation in which qq is a cost-function and QQ is a cost-matrix.. When pxp_{x} is a deterministic distribution, same as the LMDP, the optimal policy uu^{*} can be solved for by computing the optimal desirability function z(x,a)z(x,a) which is the largest eigenvector of QPxaQP_{xa} (see Appx. 5.2 for derivation), where Q=diag(exp(𝐪))Q=diag(exp(-\mathbf{q})) and PxaP_{xa} is the matrix form of the joint passive dynamics pxa(x,a|x,a)p_{xa}(x^{\prime},a^{\prime}|x,a). In the case where pxp_{x} is stochastic, the solution does not reduce down to a linear equation, but there exists an iterative non-linear mapping for computing the desirability function (25). Henceforth, the following theory we develop will be linear, only considering pxp_{x} as deterministic.

Note that as the cost of a state-action in 𝐪\mathbf{q} approaches infinity, its negatively exponentiated counterpart, represented as a diagonal entry in QQ, approaches zero. Thus, zero-entries on the diagonal effectively edit the transition structure of QPxaQP_{xa}, shaping the backwards propagation of desirability when computing (4) with power-iteration, suppressing state-actions which would violate the infinite-cost transitions. Todorov [7] showed that 𝐳\mathbf{z} could approximate shortest-paths with arbitrary precision in the limit of driving the internal state costs arbitrarily high towards infinity, i.e. S(x)=limcvc(x)cS(x)=\lim_{c\rightarrow\infty}\frac{v_{c}(x)}{c}, where S(x)S(x) is the shortest-path distance from xx to the goal state and cc is the internal state cost. In this paper, obstacles are represented with infinite cost, internal states have an arbitrarily high constant cost, and the boundary (goal) state-action has zero cost. Therefore, the saLMDP objective with this cost structure is a discrete reach-avoid shortest-path first-exit control problem, avoiding obstacles states while controlling to a boundary state-action. These polices are said to be goal-conditioned, as we can explicitly interpret the policy by the goal’s grounded state-action, rendering the polices remappable for new tasks.

3.2 Ordered Goal Task

We define an ordered goal task, which specifies a set of necessary sub-goals to be completed under precedence constraints.

Definition 3.2 (OG-task).

An Ordered-Goal task (OG-task) is defined as the tuple 𝒯=(Σ,𝒢,𝒪Ψ,Ψ,fΨ)\mathscr{T}=(\Sigma,\mathcal{G},\mathcal{O}_{\Psi},\Psi,f_{\Psi}), where Σ\Sigma is a set of 2|𝒢|2^{|\mathcal{G}|} binary vectors indicating sub-goal completion, 𝒢\mathcal{G} is a set of goal variables which are actions on Σ\Sigma, 𝒪Ψ\mathcal{O}_{\Psi} is a set of precedence constraints on goal types, Ψ\Psi is a set of types, and fΨ:𝒢2Ψf_{\Psi}:\mathcal{G}\rightarrow 2^{\Psi} is a function assigning goals to subsets of types.

Refer to caption

Figure 1: A policy solved with the saLMDP on the OG-task state-space. Red arrows denote penalized ordering constraints on sub-goal colors (types).

Here, 𝒪Ψ\mathcal{O}_{\Psi} is a set of precedence relations, o(,)o(\cdot,\cdot), on types, e.g. o(ψ1,ψ2)𝒪Ψψ1ψ2,ψ1,ψ2Ψo(\psi_{1},\psi_{2})\in\mathcal{O}_{\Psi}\implies\psi_{1}\prec\psi_{2},\leavevmode\nobreak\ \psi_{1},\psi_{2}\in\Psi, and Σ\Sigma is a binary vector state-space which tracks the progress of goal achievement. For example: σ=[0,1,0,0,1]Σ\sigma=[0,1,0,0,1]\in\Sigma indicates that g2,g5𝒢\text{g}_{2},\textsl{g}_{5}\in\mathcal{G} have been completed and three other goals g1,g3,g4𝒢\textsl{g}_{1},\textsl{g}_{3},\textsl{g}_{4}\in\mathcal{G} remain to be completed. With these primary objects, we can derive a transition operator Tσ(σ|σ,g)T_{\sigma}(\sigma^{\prime}|\sigma,\textsl{g}) which specifies the dynamics of task-space under given sub-goals, with the restriction that gi\textsl{g}_{i} can only flip the ithi^{th} bit in the vector, σ(i)\sigma(i). The task space transition operator TσT_{\sigma} encodes the task-space cube in the example of Fig. 1. Likewise, a cost function on this space qσ(σ,g)q_{\sigma}(\sigma,\textsl{g}) can be derived directly from the ordering constraints. That is, if we would like to specify that a sub-goal of one type precedes a sub-goal of another type, e.g. ψ1ψ2\psi_{1}\prec\psi_{2}, then these constraints are imposed directly on to the cost function. We can define an induced precedence relation set on goals

𝒪𝒢:={o(gi,gj)|o(ψ1,ψ2)𝒪Ψ,(ψ1,ψ2)fΨ(gi)×fΨ(gj)}.\mathcal{O}_{\mathcal{G}}:=\{o(\textsl{g}_{i},\textsl{g}_{j})\leavevmode\nobreak\ |\leavevmode\nobreak\ o(\psi_{1},\psi_{2})\in\mathcal{O}_{\Psi},\leavevmode\nobreak\ \exists(\psi_{1},\psi_{2})\in f_{\Psi}(\textsl{g}_{i})\times f_{\Psi}(\textsl{g}_{j})\}.

Type relations add an additional level of symbolic reasoning into the architecture, enabling us to make statements such as: "complete the purple sub-goals before the orange sub-goals", or "chop wood with an axe in addition to collecting water", as depicted in Fig. 2.A and 5.A by red arrows. The cost can then be defined as q=qσg+qσq=q_{\sigma\textsl{g}}+q_{\sigma}:

qσg(σ,gi)={𝐢𝐟:gj𝒢s.t.o(gi,gj)𝒪𝒢σ(j)=1;𝐞𝐥𝐬𝐞: 0.},q_{\sigma\textsl{g}}(\sigma,\textsl{g}_{i})=\{\infty\leavevmode\nobreak\ \leavevmode\nobreak\ \mathbf{if}:\leavevmode\nobreak\ \exists\textsl{g}_{j}\in\mathcal{G}\leavevmode\nobreak\ s.t.\leavevmode\nobreak\ o(\textsl{g}_{i},\textsl{g}_{j})\in\mathcal{O}_{\mathcal{G}}\land\leavevmode\nobreak\ \sigma(j)=1;\leavevmode\nobreak\ \mathbf{else}:\leavevmode\nobreak\ 0.\},
qσ(σ)={0𝐢𝐟:σ=σf;𝐞𝐥𝐬𝐞:const.}q_{\sigma}(\sigma)=\{0\leavevmode\nobreak\ \leavevmode\nobreak\ \mathbf{if}:\leavevmode\nobreak\ \sigma=\sigma_{f};\leavevmode\nobreak\ \mathbf{else}:\leavevmode\nobreak\ const.\}

where qσgq_{\sigma\textsl{g}} penalizes ordering violations on the bit-flips of σ\sigma, and qσq_{\sigma} is a constant cost on task states with the exception of the task-space boundary state σf=[1,1,]\sigma_{f}=[1,1,...].

Since the objects of an OG-task 𝒯\mathscr{T} can be directly used to define a transition operator and a cost function, much like Definition 3.1, we can introduce a passive transition dynamics pg(g|σ,σ,g)p_{\textsl{g}}(\textsl{g}^{\prime}|\sigma^{\prime},\sigma,\textsl{g}) and solve for a policy in the task space, ug(g|σ,σ,g)u_{\textsl{g}}^{*}(\textsl{g}^{\prime}|\sigma^{\prime},\sigma,\textsl{g}), which satisfies our constraints, as illustrated in Fig. 2.A. However, synthesizing a policy only in the task space is not sufficiently useful for an agent. To make task-states and actions useful and reusable representations, we show how to map goal-actions to the agent’s underlying controllable space, in this case, the base space 𝒳×𝒜\mathcal{X}\times\mathcal{A}.

3.3 Grounding and Ungrounding Functions

In order handle this mapping, we introduce specialized functions called grounding and ungrounding functions which mediate the interaction between high-level goals and the agent’s low-level dynamics. A grounding function, h:𝒢𝒳×𝒜h:\mathcal{G}\rightarrow\mathcal{X}\times\mathcal{A} is a function which associates a goal variable with a state-action pair in the base space. An ungrounding function h¯:𝒳×𝒜𝒢\bar{h}:\mathcal{X}\times\mathcal{A}\rightarrow\mathcal{G} is a left-inverse of hh which maps state-actions back to the goal variable. For grid-world examples in Fig. 2, the base action-space 𝒜={au,ad,ar,al,an,ag}\mathcal{A}=\{a_{u},a_{d},a_{r},a_{l},a_{n},a_{g}\} includes an action aga_{g}, which is a special neutral action for completing a sub-goal in addition to five standard directional actions: up, down, right, left and neutral. Consider a saLMDP transition operator pxa(x,a|x,a)p_{xa}(x^{\prime},a^{\prime}|x,a). If we compose the Markov chain with the ungrounding function we obtain

pgxa(g,x,a|g,x,a):=h¯(g|x,a)pxa(x,a|x,a).p_{\textsl{g}xa}(\textsl{g}^{\prime},x^{\prime},a^{\prime}|\textsl{g},x,a):=\bar{h}(\textsl{g}^{\prime}|x^{\prime},a^{\prime})p_{xa}(x^{\prime},a^{\prime}|x,a).

Since we have specified transition dynamics TσT_{\sigma} over Σ\Sigma in Section 3.2, we can also obtain the following joint dynamics

pσgxa(σ,g,x,a|σ,g,x,a):=Tσ(σ|σ,g)pgxa(g,x,a|g,x,a).p_{\sigma\textsl{g}xa}(\sigma^{\prime},\textsl{g}^{\prime},x^{\prime},a^{\prime}|\sigma,\textsl{g},x,a):=T_{\sigma}(\sigma^{\prime}|\sigma,\textsl{g}^{\prime})p_{\textsl{g}xa}(\textsl{g}^{\prime},x^{\prime},a^{\prime}|\textsl{g},x,a).

We call pσgxap_{\sigma\textsl{g}xa} the full dynamics. Note that this transition operator is specified over a space of size |Σ×𝒢×𝒳×𝒜||\Sigma\times\mathcal{G}\times\mathcal{X}\times\mathcal{A}|. Since Σ\Sigma grows exponentially in the number of sub-goals, i.e. 2|𝒢|2^{|\mathcal{G}|}, this operator over the full product space can introduce considerable space and time complexity into the computation of the optimal policy. However, the added complexity can be mitigated by creating specialized transition operators called jump operators which summarize the resulting dynamics of the a policy from its initial state to the state-action of sub-goal completion. In doing so, we can avoid representing internal (non-subgoal) states of 𝒳\mathcal{X} and compute a solution in a lower dimensional subspace of the full space, to be discussed in 3.6.

Refer to caption

Figure 2: A: Grounding and ungrounding functions relate goal with state-actions. B: Optimal policies are shortest-path state-action Markov Chains to a single goal state-action. C: Optimal polices, represented as a Markov chain UU, can be used to compute a new Jump-Operator. D: The Jump Operator and ungrounding functions are combined to form a feasibility function which is used to couple the task-space dynamics to the base space.

3.4 Jump Operator

Let g𝒢\textsl{g}\in\mathcal{G} be a goal in a given OG-task 𝒯\mathscr{T}. Recall the joint distribution uxau_{xa} over the base space 𝒳×𝒜\mathcal{X}\times\mathcal{A} in Section 3.1 that expresses the control policy. Once a control policy is optimized with a saLMDP, a Markov chain is induced. Consider an optimal shortest-path first-exit Markov chain matrix UgU_{g}, induced by applying the distribution uxag(x,a|x,a)=uag(a|x;x,a)p(x|x,a){u_{xa}}_{g}^{*}(x^{\prime},a^{\prime}|x,a)={u^{*}_{a}}_{g}(a^{\prime}|x^{\prime};x,a)p(x^{\prime}|x,a) that controls to the goal g’s state-action pair (x,a)g(x,a)_{g}222We use stylized ‘g’ to refer to goal variables and ‘gg’ to refer to the goal index on a state-action (Fig. 2.A). By slight notation abuse we might use (xg,ag)(x_{g},a_{g}) instead of (x,a)g(x,a)_{g}. With this Markov chain, we can construct a matrix

C=[Ug¯Hg0I],C=\begin{bmatrix}U_{\bar{g}}&H_{g}\\ 0&I\end{bmatrix},

which segregates the transient state dynamics Ug¯U_{\bar{g}} from the absorbing state dynamics HgH_{g}. Ug¯U_{\bar{g}} is a sub-matrix of UgU_{g} which excludes the gthg^{th} row and column, and HgH_{g} is a rank-one matrix of zeros except for the gthg^{th} column, which is taken from gthg^{th} column of UgU_{g}, excluding the gthg^{th} element. By taking an arbitrary power, CτC^{\tau}, the elements represent the probability of remaining in the transient states or absorbing into the goal state-action after τ\tau time steps. The power of CC in the infinite limit can be written as [29]:

limτCτ=[limτUg¯τ(τ=0Ug¯τ)Hg0I]=[limτUg¯τ(IUg¯)1Hg0I],\lim_{\tau\rightarrow\infty}C^{\tau}=\begin{bmatrix}\lim_{\tau\rightarrow\infty}U_{\bar{g}}^{\tau}&(\sum_{\tau=0}^{\infty}U_{\bar{g}}^{\tau})H_{g}\\ 0&I\end{bmatrix}=\begin{bmatrix}\lim_{\tau\rightarrow\infty}U_{\bar{g}}^{\tau}&(I-U_{\bar{g}})^{-1}H_{g}\\ 0&I\end{bmatrix}, (5)

where (IUg¯)1(I-U_{\bar{g}})^{-1} is the fundamental matrix of the absorbing chain333Invertibility is generally not a concern if U is derived from a first-exit problem. See 5.6 for discussion., which is the solution to the Neumann series τ=0Ug¯τ\sum_{\tau=0}^{\infty}U_{\bar{g}}^{\tau}. Note that the upper right sub-matrix (IUg¯)1Hg(I-U_{\bar{g}})^{-1}H_{g} is a rank-one matrix, which represents the probability of starting in state-action (x,a)i(x,a)_{i}, indexing the row space, and absorbing into state-action (x,a)j(x,a)_{j}, indexing the column space. There can only be positive probability in (IUg¯)1Hg(I-U_{\bar{g}})^{-1}H_{g} in the case that j=gj=g, corresponding to the single defined absorbing goal state-action (xg,ag)(x_{g},a_{g}). Let N\mathscr{M}^{N}, be a set of NN first-exit saLMDPs, one for every grounded state-action. Consider a set of policies for each saLMDP, 𝒰N={uxag1,uxag2,}\mathcal{U}_{N}=\{{u^{*}_{xa}}_{g_{1}},{u^{*}_{xa}}_{g_{2}},...\} and an associated set of policy indices Π={πg1,πg2,}\Pi=\{\pi_{\textsl{g}_{1}},\pi_{\textsl{g}_{2}},...\}, which indexes corresponding members of 𝒰N\mathcal{U}_{N} to avoid overlapping notation for high- and low-level polices. Specifically, πg\pi_{\textsl{g}} points to a low-level policy: πguxag\pi_{\textsl{g}}\xrightarrow{}{u^{*}_{xa}}_{g}. We can define a new operator using the upper-right sub-matrix for any gg and its associated uxag{u^{*}_{xa}}_{g} as:

J((x,a)j|(x,a)i,πg)=eiT(IUg¯)1Hgej,\displaystyle J((x,a)_{j}|(x,a)_{i},\pi_{\textsl{g}})=e_{i}^{T}(I-U_{\bar{g}})^{-1}H_{g}e_{j}, (6)

where eie_{i} and eje_{j} are one-hot unit vectors corresponding to (x,a)i(x,a)_{i} and (x,a)j(x,a)_{j}. The probability of starting and ending at the boundary state-action (x,a)g(x,a)_{g} is defined to be 1. JJ can also be written as J((x,a)j|(x,a)i,uxag)J((x,a)_{j}|(x,a)_{i},u_{xa_{g}}), given that πg\pi_{\textsl{g}} is simply an index variable–this will be important when we remap low-level solutions to new problems. The new operator JJ is a jump operator which maps the agent from an initial state-action to the goal state-action with the probability that it will eventually reach it. We now show how JJ can be combined with h¯\bar{h} to construct joint-state jump dynamics.

3.5 Feasibility Functions

Recall that in Section 3.3 we defined a new passive dynamics pgxa(g,x,a|g,x,a)p_{\textsl{g}xa}(\textsl{g}^{\prime},x^{\prime},a^{\prime}|\textsl{g},x,a) by composing pxap_{xa} with h¯\bar{h}. Likewise, now that we have defined the jump operator we can define a new representation called a feasibility function, κ\kappa, which composes the jump operator with the ungrounding function:

κ(g,x,a|x,a,πg)=h¯(g|x,a)J(x,a|x,a,πg),\displaystyle\kappa(\textsl{g}^{\prime},x^{\prime},a^{\prime}|x,a,\pi_{\textsl{g}})=\bar{h}(\textsl{g}^{\prime}|x^{\prime},a^{\prime})J(x^{\prime},a^{\prime}|x,a,\pi_{\textsl{g}}), (7)
Tκ(σ,g,x,a|σ,x,a,πg)=Tσ(σ|σ,g)κ(g,x,a|x,a,πg),\displaystyle T_{\kappa}(\sigma^{\prime},\textsl{g}^{\prime},x^{\prime},a^{\prime}|\sigma,x,a,\pi_{\textsl{g}})=T_{\sigma}(\sigma^{\prime}|\sigma,\textsl{g}^{\prime})\kappa(\textsl{g}^{\prime},x^{\prime},a^{\prime}|x,a,\pi_{\textsl{g}}), (8)

The feasibility function κ:(𝒳×𝒜×Π)×(𝒢×𝒳×𝒜)[0,1]\kappa:(\mathcal{X}\times\mathcal{A}\times\Pi)\times(\mathcal{G}\times\mathcal{X}\times\mathcal{A})\rightarrow[0,1], originally conceived of in [12] in a non-stationary context, is a function in which the transition of the action variable (goal) of the higher level space is parameterized in terms of the policy on the lower level space. In (8), we use the feasibility function with the high-level operator TσT_{\sigma} to create coupled dynamics, TκT_{\kappa}, summarizing the initial-to-final dynamics over both spaces.

3.6 Task-LMDP

With the definition of the new joint dynamics, TκT_{\kappa}, we are in a position to repurpose the saLMDP to optimize a task-policy over the joint space, stitching together polices from the ensemble to complete the task, an algorithm called the task-LMDP (tLMDP).

Definition 3.3 (task-LMDP).

A task-LMDP is defined by the tuple =(N,𝒯,h)\mathscr{L}=(\mathscr{M}^{N},\mathscr{T},h), where N\mathscr{M}^{N} is a set of first-exit saLMDP control problems of size NN, the number of states in 𝒳\mathcal{X} to be controlled to, 𝒯\mathscr{T} is an OG-task, and h:𝒢𝒳×𝒜h:\mathcal{G}\rightarrow\mathcal{X}\times\mathcal{A} is the grounding function.

From these three components, we can derive a desirability ensemble 𝒵N={z1,z2,}\mathcal{Z}_{N}=\{z_{1},z_{2},...\}, policy ensemble 𝒰N={u1,u2,}\mathcal{U}_{N}=\{u_{1},u_{2},...\}, jump operator JJ, ungrounding function h¯\bar{h}, and feasibility function, κ\kappa, in order to construct the coupled dynamics TκT_{\kappa} in (8). NN can either be NAllN_{All}, for a complete ensemble of all non-obstacle states, or NGSN_{GS}, the number of goal states defined by the grounding function. The choice depends on whether or not the solution is intended for task-transfer, where policies in the complete ensemble can be remapped to different problems, as we will soon explain. By defining a uniform passive and controlled dynamics distribution pπg,uπgp_{\pi_{\textsl{g}}},u_{\pi_{\textsl{g}}}, we arrive at the tLMDP objective:

v(s,πg)=minuπg[q(s,πg)\displaystyle v(s,\pi_{\textsl{g}})=\min\limits_{\begin{subarray}{c}u_{\pi_{\textsl{g}}}\end{subarray}}\Bigg{[}q(s,\pi_{\textsl{g}}) +𝔼πguπg[log(uπg(πg|s,s,πg)pπg(πg|s,s,πg))]+𝔼πguπgsTκ[v(s,πg)]]\displaystyle+\mathop{\mathbb{E}}\limits_{\begin{subarray}{c}\pi_{\textsl{g}}\sim u_{\pi_{\textsl{g}}}\end{subarray}}\Bigg{[}log\Bigg{(}\frac{u_{\pi_{\textsl{g}}}(\pi_{\textsl{g}}^{\prime}|s^{\prime},s,\pi_{\textsl{g}})}{p_{\pi_{\textsl{g}}}(\pi_{\textsl{g}}^{\prime}|s^{\prime},s,\pi_{\textsl{g}})}\Bigg{)}\Bigg{]}+\mathop{\mathbb{E}}\limits_{\begin{subarray}{c}\pi_{\textsl{g}}^{\prime}\sim u_{\pi_{\textsl{g}}}\\ s^{\prime}\sim T_{\kappa}\end{subarray}}[v(s^{\prime},\pi_{\textsl{g}}^{\prime})]\Bigg{]} (9)

where s=(σ,g,x,a)s=(\sigma,\textsl{g},x,a). This objective function is similar to the saLMDP, however, the cost function q(s,πg)q(s,\pi_{\textsl{g}}) is an additive mixture of functions qσg(σ,πg)+qσ(σ)+qπ(x,a,πg)q_{\sigma\textsl{g}}(\sigma,\pi_{\textsl{g}})+q_{\sigma}(\sigma)+q_{\pi}(x,a,\pi_{\textsl{g}}). Just like in Section 3.2, qσgq_{\sigma\textsl{g}} and qσq_{\sigma} encodes the ordering constraints and a constant state-cost, while qπ(x,a,πg)=vπg(x,a)q_{\pi}(x,a,\pi_{\textsl{g}})=v_{\pi_{\textsl{g}}}(x,a) encodes the cost of following the policy from (x,a)(x,a) until the ungrounding of g, which is the value function of the policy. This is significant because Jump Operators combined with a value-function cost form compact representations which preserve low-level trajectory information. We motivate this approach with a proof that, in the case of deterministic transition operators and polices, value functions based on the full operator are equivalent to value functions based on the jump operator when the jump-operator cost-function equals the value function of the low-level policy (Appx.5.9). The optimal desirability function and policy for this coupled-state system is:

𝐳GS=QσgQσQπPκ𝐳GS\displaystyle\mathbf{z}_{GS}=Q_{\sigma\textsl{g}}Q_{\sigma}Q_{\pi}P_{\kappa}\mathbf{z}_{GS} (10)
uπg(πg|s,s,πg)=pπg(πg|s,s,πg)zGS(s,πg)G[z](s,s,πg)\displaystyle u_{\pi_{\textsl{g}}}^{*}(\pi_{\textsl{g}}^{\prime}|s^{\prime},s,\pi_{\textsl{g}})=\frac{p_{\pi_{\textsl{g}}}(\pi_{\textsl{g}}^{\prime}|s^{\prime},s,\pi_{\textsl{g}})z_{GS}(s^{\prime},\pi_{\textsl{g}}^{\prime})}{G[z](s^{\prime},s,\pi_{\textsl{g}})} (11)

where the QQ-matrices are diagonal matrices of each negatively exponentiated cost function, and PκP_{\kappa} is the matrix for the joint state-policy passive dynamics of pκ(s,π|s,π):=Tκ(s|s,π)pπg(π|s,s,π)p_{\kappa}(s^{\prime},\pi^{\prime}|s,\pi):=T_{\kappa}(s^{\prime}|s,\pi)p_{\pi_{\textsl{g}}}(\pi^{\prime}|s^{\prime},s,\pi), analogous to PxaP_{xa} in (4). In (10), all matrices have the indexing scheme (σ,g,xg,ag,πg)(\sigma,\textsl{g},x_{g},a_{g},\pi_{\textsl{g}}). However, the grounding function is one-to-one, so there are only NgN_{\textsl{g}} (g,x,a)(\textsl{g},x,a) tuples, and there are NgN_{\textsl{g}} polices πg\pi_{\textsl{g}}, which advance σ\sigma by ungrounding from these state-action pairs, called the Grounded Subspace (GS). Hence, the transition matrix PκP_{\kappa} can be represented as a spare matrix of size |Σ|Ng2×|Σ|Ng2|\Sigma|N_{\textsl{g}}^{2}\times|\Sigma|N_{\textsl{g}}^{2}. We can compute the first-exit problem restricted to this sub-space, then compute the desirability for the agent’s best initial state policy pair (s,πg)0(s,\pi_{\textsl{g}})_{0} which lies outside the GS (for all possible πg\pi_{\textsl{g}} from the state). This is the desirability to enter grounded subspace (and then complete the task) at a given grounded state-action under a policy from the initial (x,a,πg)0(x,a,\pi_{\textsl{g}})_{0}. Henceforth, we call this the Desirability-To-Enter (DTE): 𝐳DTE=Q¯σgQ¯σQ¯πP¯κ𝐳GS\mathbf{z}_{DTE}=\bar{Q}_{\sigma\textsl{g}}\bar{Q}_{\sigma}\bar{Q}_{\pi}\bar{P}_{\kappa}\mathbf{z}_{GS}, which is a single matrix-vector multiplication where 𝐳GS\mathbf{z}_{GS} is the result of (10) computed for the grounded subspace (see 3 for visualization). P¯κ\bar{P}_{\kappa} is wide matrix of size Ng×|Σ|Ng2N_{\textsl{g}}\times|\Sigma|N_{\textsl{g}}^{2} where rows correspond to the agent’s current state ss paired with members of the set of GS-polices {πg}GS\{\pi_{\textsl{g}}\}_{GS}, and Q¯\bar{Q}-matrices are Ng×NgN_{\textsl{g}}\times N_{\textsl{g}} diagonal cost matrices specific to the starting state. By computing 𝐳GS\mathbf{z}_{GS} separately from 𝐳DTE\mathbf{z}_{DTE} we effectively break the problem into two parts, a one-step finite horizon problem for interior states (where the "step" is a single policy call, not one discrete time-step) and a first-exit problem for the GS. This enables the agent to be anywhere in the state-space while requiring only one additional linear transformation to act optimally.

Refer to caption

Figure 3: First-exit/Finite-horizon Decomposition (Grounded-Subspace). A tLMDP policy uπgu_{\pi_{\textsl{g}}} generates coupled trajectories for a task. Green arrows on the base state-space depict the jump trajectories whereas green arrows on the top cube are the corresponding dynamics induced in task space. The green line on 𝒳\mathcal{X} shows the agent’s trajectory when following the sampled polices until ungrounding. Dashed green arrows indicated that the policy was sampled using the DTE, which was computed from outside the grounded subspace. 𝒳𝒜ΠGS:=𝒳𝒜GS×ΠGS\mathcal{XA}\Pi_{GS}:=\mathcal{XA}_{GS}\times\Pi_{GS} is shorthand for the set of indices (x,a,πg)(x,a,\pi_{\textsl{g}}) which are restricted to the grounded sub-space, where 𝒳𝒜GS:={(x,a)h(g)|g𝒢}\mathcal{XA}_{GS}:=\{(x,a)\leftarrow h(\textsl{g})\leavevmode\nobreak\ |\leavevmode\nobreak\ \forall\textsl{g}\in\mathcal{G}\}, ΠGS:={πg|g𝒢}\Pi_{GS}:=\{\pi_{\textsl{g}}\leavevmode\nobreak\ |\leavevmode\nobreak\ \forall\textsl{g}\in\mathcal{G}\}, and 𝒳𝒜Π:=𝒳×𝒜×Π\mathcal{XA}\Pi:=\mathcal{X}\times\mathcal{A}\times\Pi is the full product space. PκAllP_{\kappa_{All}} is the passive dynamics matrix over the complete ensemble 𝒰All\mathcal{U}_{All}, and QAllQ_{All} is a cost matrix on the same space. By computing the solution in the grounded subspace, we can compute the largest eigenvector solution on the grounded submatrix without needing to form the full matrices. Because the grounded-subspace (row-space in the complete matrices) has zero entries outside of the grounded state-action-polices, an agent that starts at (x,a)g(x,a)_{g} and calls policy πg\pi_{\textsl{g}} will remain within 𝒳𝒜ΠGS\mathcal{XA}\Pi_{GS}. When the agent starts outside the GS at state s0s_{0}, it chooses a policy in ΠGS\Pi_{GS}, which will only communicate with the grounded submatrix.

Note that the ordering constraints in QσgQ_{\sigma\textsl{g}} are acting on the row space of PκP_{\kappa}, where κ\kappa determines the structure of PκP_{\kappa} using global knowledge of sub-goal connectivity. Also, since the cost function qπq_{\pi} equals the value function, i.e. qπ(πg,x,a)=vπg(x,a)q_{\pi}(\pi_{\textsl{g}},x,a)=v_{\pi_{\textsl{g}}}(x,a), the cost matrix QπQ_{\pi} has diagonal entries which are the desirability functions of each sub-problem concatenated into a vector: Qπ=diag(exp(𝐪π))=diag(exp(𝐯π))=diag(𝐳π)Q_{\pi}=diag(exp(-\mathbf{q}_{\pi}))=diag(exp(-\mathbf{v}_{\pi}))=diag(\mathbf{z}_{\pi}). (see Appx.5.1 for architecture visualization.). Once uπgu_{\pi_{\textsl{g}}}^{*} has been computed, the agent follows the policy indexed by each sampled πg\pi_{\textsl{g}}, i.e. πguxag(x,a|x,a)\pi_{\textsl{g}}\xrightarrow{}u_{xa_{g}}^{*}(x^{\prime},a^{\prime}|x,a), until the goal state-action has been reached, followed by a newly sampled πg\pi_{\textsl{g}}. Fig. 5.A shows an example of coupled state trajectory and jump dynamics, drawn from the optimal policy which satisfies the task of getting an axe before chopping wood, and collecting water. Because the tLMDP solves order-constrained multi-goal tasks, it is structurally analogous to (precedence-constrained) Traveling Salesman Problems [30] given a final goal that grounds to the agent’s initial state. Fig. 5.A shows a problem with nine sub-goals with precedence relations on sub-goal colors as types, ψ\psi.

The tLMDP solution (10) has three key properties that make it particularly flexible in the context of efficient task transfer:

  1. P1P1

    Policy Remappability: Given a complete policy-desirability ensemble and jump operator, under a new grounding function we only need to re-index the polices (where πg\pi_{\textsl{g}} references different low-level elements of 𝒰All\mathcal{U}_{All} and 𝒵All\mathcal{Z}_{All}) to derive new matrices for equation (10). Consequently, for NgN_{\textsl{g}} goals there is a space of (NxNg)2Ng2{N_{x}\choose N_{\textsl{g}}}2^{N_{\textsl{g}}^{2}} possible tasks one could formulate444For NgN_{\textsl{g}} goals grounded onto NxN_{x} possible states along with a set of ordering constraints which would not require recomputation of 𝒰All,𝒵All\mathcal{U}_{All},\mathcal{Z}_{All} or JAllJ_{All}.

  2. P2P2

    Grounding Invariance: Since the optimal desirability function is a largest eigenvector, there is an invariance of the solution to scalar multiples of the diagonal matrices, assuming PκP_{\kappa} remains constant under the new grounding.

  3. P3P3

    Minimal Solution Distance: The agent can compute the optimal desirability (the desirability-to-enter, DTE) for any given initial state, s0s_{0}, outside the GS with one matrix-vector product. Importantly, this also holds true after the zero-shot regrounding of a previous grounded-subspace solution.

Refer to caption

Figure 4: Policy Remapping: A simple 3×33\times 3 state-space with a complete policy ensemble and two different grounding functions that map πg\pi_{\textsl{g}} variables to ensemble polices and desirability functions.

These properties all allow for the fast transfer of task solutions within the same environment, or across different environments with distinct obstacle configurations and transition operators. The following sections will elaborate on the empirical and theoretical implications of these properties.

Refer to caption

Figure 5: A: An order-constrained Traveling Salesman Problem with ordering constraints defined over sub-goal colors. B: Computation time when fixing the number of sub-goals and increasing the size of 𝒳\mathcal{X}. C: Computation time when fixing the state-space sizes while varying the number of sub-goals. D: Cumulative computation time for solving N re-grounded tasks when fixing the size of 𝒳\mathcal{X} to 20×2020\times 20 states, and the sub-goal count to nine with two ordering constraints. The tLMDP is efficient by exploiting representational reuse of a full policy ensemble and the jump operator. F: A close up of the cumulative time of GS\mathscr{L}_{GS} compared to the zero-shot tGIE\mathscr{L}_{tGIE} solution, requiring no additional computation. In B-C, only the GS base-polices were computed, whereas D-E uses complete ensembles and jump operator. Sub-goal locations were randomly sampled and in B-C we plotted the mean time over 15 episodes. (2.6 GHz 6-Core Intel Core i7, 32 GB 2400 MHz DDR4)

3.6.1 Results

In Fig. 5.B-D we empirically show the computational advantage of computing tLMDP polices in the GS (GS\mathscr{L}_{GS}), relative to computing polices with value-iteration on the full product space, denoted as 𝒱Full\mathscr{V}_{Full}. In Fig. 5.B, we hold the number of sub-goals constant at, Ng=6,8,10N_{\textsl{g}}=6,8,10, and vary the number of states in the grid-world. We can see the clear advantage GS\mathscr{L}_{GS} has from separating the low-level GS-ensemble and Jump-operator computation from the task-policy optimization. The plots show that by working in the lower-dimensional GS, GS\mathscr{L}_{GS} can optimize over much larger base state-spaces with a large number of sub-goals. 𝒱Full\mathscr{V}_{Full}, however, optimizes the task policy over the full product space, hindering scalability. In Fig. 5.C, we fix the number of grid-world states at 15×1515\times 15, 30×3030\times 30, and 60×6060\times 60, and only vary the number of sub-goals: we expect an exponential increase in run-time for both GS\mathscr{L}_{GS} and 𝒱Full\mathscr{V}_{Full}, given that the complexity is dominated by |Σ|=2Ng|\Sigma|=2^{N_{\textsl{g}}}. However, since GS\mathscr{L}_{GS} is working in the GS it only needs to compute NgN_{\textsl{g}} base-polices and JGSJ_{GS}, whereas 𝒱Full\mathscr{V}_{Full} is again burdened by the extra dimensionality of the non-GS internal states. GS\mathscr{L}_{GS} plots for all grid-world sizes remain relatively similar, due to the fact that the GS dimensionality is not a function of internal state count, whereas 𝒱Full\mathscr{V}_{Full} sees a progressively larger leftward shift of the curve for each state-space size.

We demonstrate the flexibility granted by 𝐏𝟏\mathbf{P1} in Fig. 5.D, which plots the total time for computing NN_{\mathscr{L}} different tasks with Ng=9N_{\textsl{g}}=9 sub-goals and |𝒪g|=2|\mathcal{O}_{\textsl{g}}|=2 grounded in the same 20×2020\times 20 environment. 𝒱Full\mathscr{V}_{Full} does not have a policy ensemble at its disposal, so it must solve new tasks from scratch, resulting in a worst-case constant-time increase per task proportional to (2Ng|𝒳|)3|𝒜|(2^{N_{\textsl{g}}}|\mathcal{X}|)^{3}|\mathcal{A}| [5]. Alternatively, GS\mathscr{L}_{GS} can remap the goal variables to reusable ensemble polices to form a new GS, and therefore it does not require the computation of new 𝒰All\mathcal{U}_{All}, 𝒵All\mathcal{Z}_{All}, or JAllJ_{All} (which contribute to a one-time upfront cost, manifesting in a higher intercept) it only requires recomputation of (10). This results in a constant time-per-task increase proportional to 2NgNg42^{N_{\textsl{g}}}N_{\textsl{g}}^{4} (Appx.5.7). For a modest number of sub-goals this solution can be recomputed very quickly. The green line, GS\mathscr{L}_{GS}, shows an negligible growth in time, where each solution to the regrounded nine-subgoal task was computed in approximately 2×103s2\times 10^{-3}s. The speed with which this remapping and task reoptimization is possible could have important applications for real-time online systems which can learn and rapidly adapt to changes in the structure of the task.

3.7 Compositionality and Transfer

A key advantage of computing the solution in the GS pertains to Grounding Invariance (property 𝐏𝟐\mathbf{P2}), conferred by Policy Remappability (property 𝐏𝟏\mathbf{P1}). Consider a solution for the tLMDP problem 1\mathscr{L}_{1} under grounding h1h_{1}. 𝐏𝟏\mathbf{P1} means that under a new grounding function h2h_{2} for 2\mathscr{L}_{2}, the GS-index πg\pi_{\textsl{g}} now references a different policy and desirability vector, and also conditions different dynamics in JAllJ_{All}. Since the matrices QπQ_{\pi} and PκP_{\kappa} are the only two matrices whose elements are functions of hh, if these matrices are exactly the same under h1h_{1} and h2h_{2}, then (10) is the same, holding the task-matrices QσgQ_{\sigma\textsl{g}} and QσQ_{\sigma} constant. Furthermore, (10) being a largest eigenvector solution implies that if Qπ,h2Q_{\pi,h_{2}} is a scalar multiple of Qπ,h1Q_{\pi,h_{1}}, it has the same solution. This means that the solution for 1\mathscr{L}_{1} will preserve cost-minimization for the new grounding h2h_{2}, which can change distances between sub-goals. This invariance is called a task and cost preserving Grounding Invariant Equation (tc-GIE), as in (12):

tc-GIE:𝐳GS=γQσgQσQπPκ𝐳GS:γ+,Qπ,h1=γQπ,h2,Pκ1=Pκ2,\displaystyle\text{tc-GIE:}\leavevmode\nobreak\ \mathbf{z}_{GS}=\gamma Q_{\sigma\textsl{g}}Q_{\sigma}Q_{\pi}P_{\kappa}\mathbf{z}_{GS}\leavevmode\nobreak\ \leavevmode\nobreak\ :\leavevmode\nobreak\ \leavevmode\nobreak\ \exists\gamma\in\mathbb{R}^{+},\leavevmode\nobreak\ Q_{\pi,h_{1}}=\gamma Q_{\pi,h_{2}},\leavevmode\nobreak\ P_{\kappa_{1}}=P_{\kappa_{2}}, (12)
t-GIE:𝐳GS=QσgQσPκ𝐳GS:Qπ,h1,Qπ,h2I,Pκ1=Pκ2.\displaystyle\text{t-GIE:}\leavevmode\nobreak\ \mathbf{z}_{GS}=Q_{\sigma\textsl{g}}Q_{\sigma}P_{\kappa}\mathbf{z}_{GS}\leavevmode\nobreak\ \leavevmode\nobreak\ :\leavevmode\nobreak\ \leavevmode\nobreak\ Q_{\pi,h_{1}},Q_{\pi,h_{2}}\leftarrow I,\leavevmode\nobreak\ P_{\kappa_{1}}=P_{\kappa_{2}}. (13)

If, however, Qπ,h2Q_{\pi,h_{2}} is not a scalar multiple of Qπ,h1Q_{\pi,h_{1}}, it is still possible for the two problems to share the same solution if QπQ_{\pi} is set to the identity matrix for both problems (policy cost equal to zero). This means that the task will be solved, however, the policy will not minimize the low-level trajectory costs on 𝒳\mathcal{X}. This is a relaxed version of grounding-invariance called the task preserving Grounding Invariant Equation (t-GIE), (13). Fig.5.E shows the relationship between GS\mathscr{L}_{GS} (from 5.D) and tGIE\mathscr{L}_{tGIE}, a t-GIE problem with the same task specification. Whereas GS\mathscr{L}_{GS} sees a slight additional time increase for optimizing (10) for every new grounding, tGIE\mathscr{L}_{tGIE} solves the task with no further optimization, resulting in zero-shot transfer. Furthermore, regardless of whether two problem formulations are related by (12) or (13), the ordering violation costs in QσgQ_{\sigma\textsl{g}} are hard constraints (infinite cost), so as long as PκP_{\kappa} and QσgQ_{\sigma\textsl{g}} remains constant between 1\mathscr{L}_{1} and 2\mathscr{L}_{2}, a policy uπ,1u^{*}_{\pi,1} for 1\mathscr{L}_{1} can be used to solve 2\mathscr{L}_{2} however, the sub-goal dynamics will retain the biases of 1\mathscr{L}_{1}’s low-level policy costs, dictated by the obstacle configuration.

Refer to caption

Figure 6: A cliff example, in which the agent can descend but is not able to ascend the cliff. Four different tasks (T1T_{1}-T4T_{4}) defined on four cliff environments (E1E_{1}-E4E_{4}) that are related by 2 possible GIEs. SS is a shortest-path matrix. If S1S_{1} is the same as S2S_{2} up to an additive constant and they share the same feasibility matrix KK, Task 2 will have the same solution as Task 1, but mapped to different underlying polices (tc-GIE); this solution will minimize trajectory length on 𝒳\mathcal{X}. If no such additive constant exists but the tasks share the same KK-matrix, the tasks can only share an optimal solution where trajectory length is not minimized (t-GIE) (QπQ_{\pi} is set to II). Task 4 does not share a KK-matrix with other tasks and therefore doesn’t share a GIE solution.

Figure 6 demonstrates both GIE equations for an environment with a cliff that the agent can descend, but not ascend, resulting in different PκP_{\kappa}-matrices depending on the grounding. Since PκP_{\kappa} is a function of hh indirectly through κ\kappa, differences can only arise in PκP_{\kappa} under different sub-goal connectivity structures. These connectivity structures are stored in the matrix K(i,g)=κ((x,a)g,g|(x,a)i,πg)K(i,g)=\kappa((x,a)_{g},\textsl{g}|(x,a)_{i},\pi_{\textsl{g}}), which represents the global connectivity of the sub-goals. Furthermore, recall that Qπ=diag(𝐳π)Q_{\pi}=diag(\mathbf{z}_{\pi}) where 𝐳π\mathbf{z}_{\pi} is a concatenated state-action vector of all low-level desirability functions in 𝒵GS\mathcal{Z}_{GS}. Because the desirability function is defined as z(x)=ev(x)z(x)=e^{-v(x)}, constant multiples of zz translate into constant additives of v(x)v(x), (i.e. γz(x)=e(v(x)+α)\gamma z(x)=e^{-(v(x)+\alpha)} where γ=eα\gamma=e^{-\alpha}). Since we are working with shortest path polices, to better illustrate the tc-GIE relationship in Figure 6, we construct a matrix SS of shortest path distances which are derived from the value function divided by the internal state cost, which gives us the shortest path distance, i.e. S(i,g)=vg((x,a)i)cS(i,g)=\frac{v_{g}((x,a)_{i})}{c}, where cc is the constant internal state-action cost used to define q(x,a)q(x,a), and vg=log(zg)v_{g}=-log(z_{g}) is the value function corresponding to the low-level policy with boundary state (x,a)g(x,a)_{g}. The new grounding functions preserve the solution under an additive constant α\alpha to inter-subgoal distances, which translates into a multiplicative constant eae1/ce^{-a}e^{1/c} between desirability functions used to define QπQ_{\pi}. Therefore, even between environments with different obstacle constraints, we can compute that two tasks in the grounded subspace are exactly the same because they share the same underlying objective function. The t-GIE is also depicted in Figure 6: task T3T_{3} in E3E_{3} does not share a policy cost matrix with a multiplicative constant with T1,T2T_{1},T_{2}, however, the feasibility structure remains the same and we can therefore set QπIQ_{\pi}\leftarrow I for both problems to obtain the same objective function.

4 Discussion

JODP employs specific representations and architectural motifs which contribute to the flexibility and scalability of the framework. The construction of the policy ensemble and jump operator is computationally decoupled from the global tLMDP task-optimization. This decoupling is necessary for scalability: both objects can be parallelized and the jump operator is required to work in the grounded-subspace, which is critical for the method to scale to large state-space sizes. Crucially, because jump operators are paired with cost functions derived from the low-level value function, the method effectively performs multi-step tasks while integrating all of the low-level cost information for each successive jump, which is a new abstraction technique for hierarchical control.

While the focus of this paper was on exact dynamic programming methods, the motifs have important implications for learning frameworks. As we have shown, the task can be a structured object which can be traversed by a first-exit policy, where task rules correspond to a cost function which shapes permissible dynamics on this space—the reward is not the task. While it is certainly true that the value function is interpreted as an accumulated reward or cost, first-exit value functions are endowed with additional explicit meaning: the accumulated cost until hitting the boundary state. This explicitness is necessary for the property of remappability, which allows low-level polices to generalize to new tasks by mapping the boundary state-actions to high-level actions. This also suggests that many RL algorithms, so far as they are used for Artificial Intelligence, do not posses the requisite constrains on the form of the polices which make them composable and remappable to modular and transferable tasks, and future learning methods could benefit from utilizing shortest-path polices as low-level representations along with reusable task-spaces.

In conclusion, JODP provides key new representations for hierarchical control along with a theory of grounding-invariant zero-shot transfer, which has major implications for computationally efficient and interpretable artificial intelligence.

Acknowledgments

Thank you to Keno Juchems and Andrew Saxe for valuable discussions and feedback.

References

  • [1] R. Bellman, “A Markovian decision process,” Journal of mathematics and mechanics, pp. 679–684, 1957.
  • [2] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no. 3-4, pp. 279–292, 1992.
  • [3] T. Jaakkola, M. I. Jordan, and S. P. Singh, “Convergence of stochastic iterative dynamic programming algorithms,” in Advances in neural information processing systems, 1994, pp. 703–710.
  • [4] J. N. Tsitsiklis, “Asynchronous stochastic approximation and Q-learning,” Machine learning, vol. 16, no. 3, pp. 185–202, 1994.
  • [5] M. L. Puterman, Markov decision processes: discrete stochastic dynamic programming.   John Wiley & Sons, 2014.
  • [6] H. J. Kappen, “Linear theory for control of nonlinear stochastic systems,” Physical review letters, vol. 95, no. 20, p. 200201, 2005.
  • [7] E. Todorov, “Efficient computation of optimal actions,” Proceedings of the national academy of sciences, vol. 106, no. 28, pp. 11 478–11 483, 2009.
  • [8] A. Jonsson and V. Gómez, “Hierarchical linearly-solvable Markov decision problems.”
  • [9] A. M. Saxe, A. C. Earle, and B. Rosman, “Hierarchy through composition with multitask LMDPs,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70.   JMLR. org, 2017, pp. 3017–3026.
  • [10] D. Tirumala, H. Noh, A. Galashov, L. Hasenclever, A. Ahuja, G. Wayne, R. Pascanu, Y. W. Teh, and N. Heess, “Exploiting hierarchy for learning and transfer in KL-regularized RL,” arXiv preprint arXiv:1903.07438, 2019.
  • [11] S. Nasiriany, V. Pong, S. Lin, and S. Levine, “Planning with goal-conditioned policies,” in Advances in Neural Information Processing Systems, 2019, pp. 14 814–14 825.
  • [12] T. J. Ringstrom and P. R. Schrater, “Constraint satisfaction propagation: non-stationary policy synthesis for temporal logic planning,” arXiv preprint arXiv:1901.10405, 2019.
  • [13] L. Illanes, X. Yan, R. Toro Icarte, and S. A. McIlraith, “Symbolic plans as high-level instructions for reinforcement learning.”
  • [14] M. Hasanbeig, Y. Kantaros, A. Abate, D. Kroening, G. J. Pappas, and I. Lee, “Reinforcement learning for temporal logic control synthesis with probabilistic satisfaction guarantees,” in Proceedings of the 58th Conference on Decision and Control.   IEEE, 2019, pp. 5338–5343.
  • [15] M. Hasanbeig, A. Abate, and D. Kroening, “Cautious reinforcement learning with logical constraints,” in Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems.   International Foundation for Autonomous Agents and Multiagent Systems, 2020, pp. 483–491.
  • [16] R. Toro Icarte, T. Klassen, R. Valenzano, and S. McIlraith, “Using reward machines for high-level task specification and decomposition in reinforcement learning,” in International Conference on Machine Learning, 2018, pp. 2107–2116.
  • [17] M. Hasanbeig, A. Abate, and D. Kroening, “Logically-Constrained Neural Fitted Q-Iteration,” in Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems.   International Foundation for Autonomous Agents and Multiagent Systems, 2019, pp. 2012–2014.
  • [18] ——, “Logically-constrained reinforcement learning,” arXiv preprint arXiv:1801.08099, 2018.
  • [19] L. Z. Yuan, M. Hasanbeig, A. Abate, and D. Kroening, “Modular deep reinforcement learning with temporal logic specifications,” arXiv preprint arXiv:1909.11591, 2019.
  • [20] M. Hasanbeig, A. Abate, and D. Kroening, “Certified reinforcement learning with logic guidance,” arXiv preprint arXiv:1902.00778, 2019.
  • [21] R. S. Sutton, D. Precup, and S. Singh, “Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning,” Artificial intelligence, vol. 112, no. 1-2, pp. 181–211, 1999.
  • [22] R. Fox, S. Krishnan, I. Stoica, and K. Goldberg, “Multi-level discovery of deep options,” arXiv preprint arXiv:1703.08294, 2017.
  • [23] A. Barreto, D. Borsa, S. Hou, G. Comanici, E. Aygün, P. Hamel, D. Toyama, S. Mourad, D. Silver, D. Precup et al., “The option keyboard: Combining skills in reinforcement learning,” in Advances in Neural Information Processing Systems, 2019, pp. 13 031–13 041.
  • [24] X. B. Peng, M. Chang, G. Zhang, P. Abbeel, and S. Levine, “MCP: Learning composable hierarchical control with multiplicative compositional policies,” in Advances in Neural Information Processing Systems, 2019, pp. 3681–3692.
  • [25] J. Andreas, D. Klein, and S. Levine, “Modular multitask reinforcement learning with policy sketches,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70.   JMLR. org, 2017, pp. 166–175.
  • [26] C. Devin, D. Geng, P. Abbeel, T. Darrell, and S. Levine, “Plan arithmetic: Compositional plan vectors for multi-task control,” arXiv preprint arXiv:1910.14033, 2019.
  • [27] G. N. Tasse, S. James, and B. Rosman, “A Boolean task algebra for reinforcement learning,” arXiv preprint arXiv:2001.01394, 2020.
  • [28] R. S. Sutton, A. G. Barto et al., Introduction to reinforcement learning.   MIT press Cambridge, 1998, vol. 135.
  • [29] P. Brémaud, Markov chains: Gibbs fields, Monte Carlo simulation, and queues.   Springer Science & Business Media, 2013, vol. 31.
  • [30] L. Bianco, A. Mingozzi, S. Ricciardelli, and M. Spadoni, “Exact and heuristic procedures for the traveling salesman problem with precedence constraints, based on dynamic programming,” INFOR: Information Systems and Operational Research, vol. 32, no. 1, pp. 19–32, 1994.

5 Appendix

5.1 Function Dependency Graph

Refer to caption

Figure 7: Function Dependency Graph: This graph shows the dependency structure of the different objects in this paper. Arrows denote that either one object was derived from another, or it takes another object as an argument. Not every relationship is plotted, only the essential structure for understanding the architecture.

5.2 saLMDP Derivation

We begin by creating a state vector y=(x,a){y}=(x,a) and substitute this state in the equation for an LMDP:

v(y)=v(x,a)=minuxa[q(x,a)+KL(uxa(x,a|x,a)||pxa(x,a|x,a))+𝔼x,auxa[v(x,a)]]\displaystyle v^{*}({y})=v^{*}(x,a)=\min_{u_{xa}}\Big{[}q(x,a)+KL(u_{xa}(x^{\prime},a^{\prime}|x,a)||p_{xa}(x^{\prime},a^{\prime}|x,a))\,\,+\,\,\mathop{\mathbb{E}}_{\mathclap{x^{\prime},a^{\prime}\sim u_{xa}}}\,\,[v^{*}(x^{\prime},a^{\prime})]\Big{]} (14)

We now decompose the joint distributions using the product rule and impose a dynamics constraint. We would like the low level dynamics of the saLMDP, px(x|x,a)p_{x}(x^{\prime}|x,a), to behave like an uncontrollable physical system and control to only be achieved over the state-space by manipulating the action distribution, uau_{a}. To enforce this condition we introduce the constraint that ux=pxu_{x}=p_{x}, and we minimize over uau_{a}.

=minuaux[q(x,a)+𝔼auaxux[log(ua(a|x;x,a)ux(x|x,a)pa(a|x;x,a)px(x|x,a))]+𝔼auaxux[v(x,a)]]ux=pxs.t.\displaystyle=\min\limits_{\begin{subarray}{c}u_{a}\\ u_{x}\end{subarray}}\Bigg{[}q(x,a)+\mathop{\mathbb{E}}\limits_{\begin{subarray}{c}a^{\prime}\sim u_{a}\\ x^{\prime}\sim u_{x}\end{subarray}}\Bigg{[}log\Bigg{(}\frac{u_{a}(a^{\prime}|x^{\prime};x,a)u_{x}(x^{\prime}|x,a)}{p_{a}(a^{\prime}|x^{\prime};x,a)p_{x}(x^{\prime}|x,a)}\Bigg{)}\Bigg{]}+\mathop{\mathbb{E}}\limits_{\begin{subarray}{c}a^{\prime}\sim u_{a}\\ x^{\prime}\sim u_{x}\end{subarray}}[v(x^{\prime},a^{\prime})]\Bigg{]}\quad\stackrel{{\scriptstyle s.t.}}{{u_{x}=p_{x}}} (15)
=minua[q(x,a)+𝔼aua[log(ua(a|x;x,a)pa(a|x;x,a))]+𝔼auaxux=px[v(x,a)]]\displaystyle=\min\limits_{\begin{subarray}{c}u_{a}\end{subarray}}\Bigg{[}q(x,a)+\mathop{\mathbb{E}}\limits_{\begin{subarray}{c}a^{\prime}\sim u_{a}\end{subarray}}\Bigg{[}log\Bigg{(}\frac{u_{a}(a^{\prime}|x^{\prime};x,a)}{p_{a}(a^{\prime}|x^{\prime};x,a)}\Bigg{)}\Bigg{]}+\mathop{\mathbb{E}}\limits_{\begin{subarray}{c}a^{\prime}\sim u_{a}\\ x^{\prime}\sim u_{x}=p_{x}\end{subarray}}[v(x^{\prime},a^{\prime})]\Bigg{]} (16)

We consider (16) to be the canonical form of the saLMDP objective function.

After transforming the cost-to-go, v(x,a)v(x,a), into desirability via v(x,a)=log(z(x,a))v(x,a)=-log(z(x,a)) the derivation proceeds as follows:

=minua[q(x,a)+𝔼aua[log(ua(a|x;x,a)pa(a|x;x,a))]+𝔼auaxpx[log(z(x,a))]]\displaystyle=\min\limits_{\begin{subarray}{c}u_{a}\end{subarray}}\Bigg{[}q(x,a)+\mathop{\mathbb{E}}\limits_{\begin{subarray}{c}a^{\prime}\sim u_{a}\end{subarray}}\Bigg{[}log\Bigg{(}\frac{u_{a}(a^{\prime}|x^{\prime};x,a)}{p_{a}(a^{\prime}|x^{\prime};x,a)}\Bigg{)}\Bigg{]}+\mathop{\mathbb{E}}\limits_{\begin{subarray}{c}a^{\prime}\sim u_{a}\\ x^{\prime}\sim p_{x}\end{subarray}}[-log(z(x^{\prime},a^{\prime}))]\Bigg{]} (17)
=minua[q(x,a)+𝔼xpx[𝔼aua[log(ua(a|x;x,a)pa(a|x;x,a)z(x,a))]]]\displaystyle=\min\limits_{\begin{subarray}{c}u_{a}\end{subarray}}\Bigg{[}q(x,a)+\mathop{\mathbb{E}}\limits_{x^{\prime}\sim p_{x}}\Bigg{[}\mathop{\mathbb{E}}\limits_{a^{\prime}\sim u_{a}}\Bigg{[}log\Bigg{(}\frac{u_{a}(a^{\prime}|x^{\prime};x,a)}{p_{a}(a^{\prime}|x^{\prime};x,a)z(x^{\prime},a^{\prime})}\Bigg{)}\Bigg{]}\Bigg{]}\Bigg{]} (18)
=minua[q(x,a)+𝔼xpx[𝔼aua[log(ua(a|x;x,a)pa(a|x;x,a)z(x,a)G[z](x,x,a)G[z](x,x,a))]]]\displaystyle=\min\limits_{\begin{subarray}{c}u_{a}\end{subarray}}\Bigg{[}q(x,a)+\mathop{\mathbb{E}}\limits_{x^{\prime}\sim p_{x}}\Bigg{[}\mathop{\mathbb{E}}\limits_{a^{\prime}\sim u_{a}}\Bigg{[}log\Bigg{(}\frac{u_{a}(a^{\prime}|x^{\prime};x,a)}{p_{a}(a^{\prime}|x^{\prime};x,a)z(x^{\prime},a^{\prime})\frac{G[z](x^{\prime},x,a)}{G[z](x^{\prime},x,a)}}\Bigg{)}\Bigg{]}\Bigg{]}\Bigg{]} (19)
=minua[q(x,a)+𝔼xpx[log(G[z](x,x,a))+𝔼aua[log(ua(a|x;x,a)pa(a|x;x,a)z(x,a)G[z](x,x,a))]]]\displaystyle=\min\limits_{\begin{subarray}{c}u_{a}\end{subarray}}\Bigg{[}q(x,a)+\mathop{\mathbb{E}}\limits_{x^{\prime}\sim p_{x}}\Bigg{[}-log\big{(}G[z](x^{\prime},x,a)\big{)}+\mathop{\mathbb{E}}\limits_{a^{\prime}\sim u_{a}}\Bigg{[}log\Bigg{(}\frac{u_{a}(a^{\prime}|x^{\prime};x,a)}{\frac{p_{a}(a^{\prime}|x^{\prime};x,a)z(x^{\prime},a^{\prime})}{G[z](x^{\prime},x,a)}}\Bigg{)}\Bigg{]}\Bigg{]}\Bigg{]} (20)

Where G[]()G[\cdot](\cdot) is a normalization term that satisfies the KL divergence requirement that the terms sum to 1.

G[z](x,x,a)=a𝒜pa(a|x;x,a)z(x,a)\displaystyle G[z](x^{\prime},x,a)=\sum_{a^{\prime}\in\mathcal{A}}p_{a}(a^{\prime}|x^{\prime};x,a)z(x^{\prime},a^{\prime}) (21)

By setting the policy uau_{a}^{*} to the denominator (20), the third term goes to zero.

ua(a|x;x,a)=pa(a|x;x,a)z(x,a)G[z](x,x,a)\displaystyle u_{a}^{*}(a^{\prime}|x^{\prime};x,a)=\frac{p_{a}(a^{\prime}|x^{\prime};x,a)z(x^{\prime},a^{\prime})}{G[z](x^{\prime},x,a)} (22)

This leaves two terms that do not depend on the minimization argument uau_{a}. The resulting equation is:

log(z(x,a))=q(x,a)𝔼xpx[log(G[z](x,x,a))]\displaystyle-log(z(x,a))=q(x,a)-\mathop{\mathbb{E}}\limits_{x^{\prime}\sim p_{x}}\big{[}log\big{(}G[z](x^{\prime},x,a)\big{)}\big{]} (23)
z(x,a)=exp(q(x,a))exp(𝔼xpx[log(G[z](x,x,a))])\displaystyle z(x,a)=exp(-q(x,a))exp\Big{(}\mathop{\mathbb{E}}\limits_{x^{\prime}\sim p_{x}}\big{[}log\big{(}G[z](x^{\prime},x,a)\big{)}\big{]}\Big{)} (24)

The above equation can be written in matrix-vector notation with nested non-linearities, but it requires us to construct non-square matrices. This is because if 𝐳\mathbf{z} is a state-action desirability vector, then G[z](x,x,a)=𝔼apaz(x,a)G[z](x^{\prime},x,a)=\mathop{\mathbb{E}}\limits_{a^{\prime}\sim p_{a}}z(x^{\prime},a^{\prime}) requires us to only sum over only the aa^{\prime} variables of the vector.

To construct the appropriate matrices, we encode the distributions pxp_{x} and pap_{a} into two matrices, MM and WW respectively. MM is an XA×XAXXA\times XAX^{\prime} matrix (where X=|𝒳|,A=|𝒜|X=|\mathcal{X}|,A=|\mathcal{A}|) which encodes px(x|x,a)p_{x}(x^{\prime}|x,a), where the additional XAXA in the column space are dummy variables. That is, the indexing scheme for the rows is (x,a)(x,a), and (x,a,x)(x,a,x^{\prime}) for the columns, so elements with (x,a)(x,a) in the columns can only have non-zero entries if they match the (x,a)(x,a) pairs on the row. Similarly, WW is an XAX×XAXAX^{\prime}\times X^{\prime}A^{\prime} matrix encoding pa(a|x,x,a)p_{a}(a^{\prime}|x^{\prime},x,a) where the matrix can only have non-zero entries when the xx^{\prime} on the row matches the xx^{\prime} on the column. It is straightforward to check that multiplying MM and WW forms a matrix of the joint distribution Pxa=MWP_{xa}=MW. WW allows us to take the expectation over the aa^{\prime} variables in 𝐳\mathbf{z}, followed by the point-wise loglog nonlinearity, then MM takes the expectation over the xx^{\prime} variables.

Now written in matrix notation we have:

𝐳=Qexp(Mlog(W𝐳))\displaystyle\mathbf{z}=Q\exp(M\log(W\mathbf{z})) (25)

Since the exp()exp(\cdot) and log()log(\cdot) are element-wise functions, Jensen’s inequality implies the following element-wise inequality:

𝐳1=Qexp(Mlog(W𝐳))QMW𝐳=𝐳2\displaystyle\mathbf{z}_{1}=Q\exp(M\log(W\mathbf{z}))\preceq QMW\mathbf{z}=\mathbf{z}_{2} (26)

However, Jenson’s inequality holds as equality when the expectation is over a deterministic distribution. Therefore if every row of the state dynamics matrix MM is a deterministic distribution, then we can exchange the order of the point-wise exponential with the matrix multiplication:

𝐳=Qexp(M(log(W𝐳)))\displaystyle\mathbf{z}=Qexp(M(log(W\mathbf{z}))) (27)
\displaystyle\implies 𝐳=QMexp(log(W𝐳))\displaystyle\mathbf{z}=QMexp(log(W\mathbf{z})) (28)
\displaystyle\implies 𝐳=QMW𝐳\displaystyle\mathbf{z}=QMW\mathbf{z} (29)
\displaystyle\implies 𝐳=QPxa𝐳\displaystyle\mathbf{z}=QP_{xa}\mathbf{z} (30)

where Pxa=MWP_{xa}=MW is the matrix for the joint distribution p(x,a|x,a)p(x^{\prime},a^{\prime}|x,a). The original non-linear mapping reduces to a linear equation and zz can be solved for as the largest eigenvector of QPxaQP_{xa}. We can solve for zz by the iterating either (25), in the stochastic case, or (30) in the deterministic case, until convergence. Then plug 𝐳\mathbf{z} into (22) to obtain the optimal policy.

5.3 tLMDP Derivation

The tLMDP only differs in the transition operators and number of cost functions. Following the same derivation above we will arrive at:

log(z(s,πg))=q(s,πg)𝔼sTκ[log(G[z](s,s,πg))]\displaystyle-log(z(s,\pi_{\textsl{g}}))=q(s,\pi_{\textsl{g}})-\mathop{\mathbb{E}}\limits_{s^{\prime}\sim T_{\kappa}}\big{[}log\big{(}G[z](s^{\prime},s,\pi_{\textsl{g}})\big{)}\big{]} (31)
z(s,πg)=exp(qσg(σ,πg))exp(qσ(σ))exp(qπg(πg,x,a))exp(𝔼sTκ[log(G[z](s,s,πg))])\displaystyle z(s,\pi_{\textsl{g}})=exp(-q_{\sigma\textsl{g}}(\sigma,\pi_{\textsl{g}}))exp(-q_{\sigma}(\sigma))exp(-q_{\pi_{\textsl{g}}}(\pi_{\textsl{g}},x,a))exp\Big{(}\mathop{\mathbb{E}}\limits_{s^{\prime}\sim T_{\kappa}}\big{[}log\big{(}G[z](s^{\prime},s,\pi_{\textsl{g}})\big{)}\big{]}\Big{)} (32)

Following the previous methods for the saLMDP solution derivation, in matrix-vector notation and under deterministic TκT_{\kappa} this results in:

𝐳=QσgQσQπPκ𝐳\displaystyle\mathbf{z}=Q_{\sigma\textsl{g}}Q_{\sigma}Q_{\pi}P_{\kappa}\mathbf{z} (33)

5.4 Convergence of iterative non-linear map

Here we show that the iterative mapping for the stochastic case converges to a fixed point. The proof uses Banach’s fixed point theorem, which requires us to show that (25) is a contraction mapping. To do, we define the mapping as

𝐳=T[𝐳]\displaystyle\mathbf{z}=T[\mathbf{z}^{\prime}] (34)

To demonstrate that TT is a contraction mapping we show that for any 𝐳1\mathbf{z}_{1} and 𝐳2\mathbf{z}_{2}, d(T(𝐳1),T(𝐳2))βd(𝐳1,𝐳2)d(T(\mathbf{z}_{1}),T(\mathbf{z}_{2}))\leq\beta d(\mathbf{z}_{1},\mathbf{z}_{2}) for some β[0,1)\beta\in[0,1) and d(x,y)d(x,y). Let d(x,y)=xyd(x,y)=\lVert x-y\rVert_{\infty}. Considering 𝐳\mathbf{z} as the state-action desirability vector, we write TT as:

T[𝐳]=Qexp(Mlog(W𝐳))\displaystyle T[\mathbf{z}]=Q\exp(M\log(W\mathbf{z}^{\prime})) (35)
T[𝐳1]T[𝐳2]=Qexp(Mlog(W𝐳1))Qexp(Plog(W𝐳2))QMW𝐳1QMW𝐳2,Jensen’s Ineq.β𝐳1β𝐳2, 0β<1=β𝐳1𝐳2\begin{split}\lVert T[\mathbf{z}_{1}]-T[\mathbf{z}_{2}]\rVert_{\infty}&=\lVert Q\exp(M\log(W\mathbf{z}_{1}))-Q\exp(P\log(W\mathbf{z}_{2}))\rVert_{\infty}\\ &\leq\lVert QMW\mathbf{z}_{1}-QMW\mathbf{z}_{2}\rVert_{\infty},\leavevmode\nobreak\ \text{Jensen's Ineq.}\quad\quad\\ &\leq\lVert\beta\mathbf{z}_{1}-\beta\mathbf{z}_{2}\rVert_{\infty},\leavevmode\nobreak\ 0\leq\beta<1\quad\quad\\ &=\beta\lVert\mathbf{z}_{1}-\mathbf{z}_{2}\rVert_{\infty}\end{split} (36)

Thus, T[𝐳]T[\mathbf{z}^{\prime}] is a contraction mapping, which implies via Banach’s fixed point theorem that an iterative algorithm will globally converge to 𝐳\mathbf{z}^{*}.

5.5 Finite-Horizon/First-Exit Decomposition (Grounded Subspace)

Because we are working with shortest-path first-exit control problems, all trajectories from any given σkσk+1\sigma_{k}\rightarrow\sigma_{k+1} must be shortest paths (where kk is a lumped time-step called a period which corresponds to the window of time σ\sigma remains constant). This means instead of working with the complete ensembles, 𝒰All,𝒵All\mathcal{U}_{All},\mathcal{Z}_{All}, and JAllJ_{All} we only need to work with 𝒰GS,𝒵GS,JGS\mathcal{U}_{GS},\mathcal{Z}_{GS},J_{GS}. This is due to the fact that any policy uiu_{i}^{*} which controls to an internal state-action (x,a)i(x,a)_{i} followed by a policy that controls to (x,a)g(x,a)_{g} by ugu_{g}^{*} can never have an accumulated cost-to-go vi+vgv_{i}+v_{g} which is lower than the value function vgv_{g} for ugu^{*}_{g}. (because shortest-path polices have the lowest cost-to-go). The consequence of the GS restriction is that an agent that controls from a state inside the GS will remain inside the GS (given that the JGSJ_{GS} can only jump you to GS state-actions). Therefore, a first-exit tLMDP problem can be decomposed by separating the first period out of the Bellman equation and solving a first-exit (FE) problem starting at k=1k=1, while solving the policy for the period k=0k=0 as a one-step finite-horizon (FH) problem which controls into the grounded subspace.

We define the Bellman equation in two parts, one for controlling within the grounded subspace (38), and one for controlling from outside the grounded subspace into the grounded subspace (37):

vk0(s0,πg)=minuπg[q(s0,πg)+𝔼πguπg[log(uπg(πg|s,s0,πg)pπg(πg|s,s0,πg))]+𝔼πguπgsTκ[vFE(s,πg)]]\displaystyle v_{k_{0}}^{*}(s_{0},\pi_{\textsl{g}})=\min\limits_{\begin{subarray}{c}u_{\pi_{\textsl{g}}}\end{subarray}}\Bigg{[}q(s_{0},\pi_{\textsl{g}})+\mathop{\mathbb{E}}\limits_{\begin{subarray}{c}\pi_{\textsl{g}}\sim u_{\pi_{\textsl{g}}}\end{subarray}}\Bigg{[}log\Bigg{(}\frac{u_{\pi_{\textsl{g}}}(\pi_{\textsl{g}}^{\prime}|s^{\prime},s_{0},\pi_{\textsl{g}})}{p_{\pi_{\textsl{g}}}(\pi_{\textsl{g}}^{\prime}|s^{\prime},s_{0},\pi_{\textsl{g}})}\Bigg{)}\Bigg{]}+\mathop{\mathbb{E}}\limits_{\begin{subarray}{c}\pi_{\textsl{g}}^{\prime}\sim u_{\pi_{\textsl{g}}}\\ s^{\prime}\sim T_{\kappa}\end{subarray}}[v_{FE}(s^{\prime},\pi_{\textsl{g}}^{\prime})]\Bigg{]} (37)

where:

vFE(s,πg)=minuπg[q(s,πg)+𝔼πguπg[log(uπg(πg|s,s,πg)pπg(πg|s,s,πg))]+𝔼πguπgsTκ[vFE(s,πg)]]\displaystyle v_{FE}^{*}(s,\pi_{\textsl{g}})=\min\limits_{\begin{subarray}{c}u_{\pi_{\textsl{g}}}\end{subarray}}\Bigg{[}q(s,\pi_{\textsl{g}})+\mathop{\mathbb{E}}\limits_{\begin{subarray}{c}\pi_{\textsl{g}}\sim u_{\pi_{\textsl{g}}}\end{subarray}}\Bigg{[}log\Bigg{(}\frac{u_{\pi_{\textsl{g}}}(\pi_{\textsl{g}}^{\prime}|s^{\prime},s,\pi_{\textsl{g}})}{p_{\pi_{\textsl{g}}}(\pi_{\textsl{g}}^{\prime}|s^{\prime},s,\pi_{\textsl{g}})}\Bigg{)}\Bigg{]}+\mathop{\mathbb{E}}\limits_{\begin{subarray}{c}\pi_{\textsl{g}}^{\prime}\sim u_{\pi_{\textsl{g}}}\\ s^{\prime}\sim T_{\kappa}\end{subarray}}[v_{FE}(s^{\prime},\pi_{\textsl{g}}^{\prime})]\Bigg{]} (38)

where in (38), (s,π)𝒳𝒜ΠGS(s,\pi)\in\mathcal{XA}\Pi_{GS}, and TκT_{\kappa} and uπgu_{\pi_{\textsl{g}}} in both equations are restricted to polices in the grounded subspace, ΠGS\Pi_{GS}.

5.6 Invertibility of IUg¯I-U_{\bar{g}}

(IUg¯)1(I-U_{\bar{g}})^{-1} is the solution to a Neumann series which represents the total expected state-action occupancies under the transient state-action Markov chain dynamics. Normally, the concern would be that if the Markov chain UU is reducible, then there is a communicating class of states-actions which do not communicate with the boundary state-action. This would result in infinite state-action visits, and thus (IUg¯)(I-U_{\bar{g}}) would not be invertible.

However, since we are computing J=(IUg¯)1HgJ=(I-U_{\bar{g}})^{-1}H_{g}, and HgH_{g} is a matrix of zeros except for the gthg^{th} column, we know that Jij=0,(i)(jg)J_{ij}=0,(\forall i)(\forall j\neq g). Therefore, we are only interested solving the linear system (IUg¯)𝐩abs=𝐡g(I-U_{\bar{g}})\mathbf{p}_{abs}=\mathbf{h}_{g}, where 𝐡g\mathbf{h}_{g} is the gthg^{th} column of HgH_{g} (known), 𝐩abs(i)\mathbf{p}_{abs}(i) (unknown) is the probability of absorption into the boundary state, starting from the state-action (x,a)i(x,a)_{i} under the Markov chain dynamics, and 𝐩abs\mathbf{p}_{abs} is set to be the gthg^{th} column of JJ, i.e. J:g𝐩absJ_{:g}\leftarrow\mathbf{p}_{abs}. We have a solution, 𝐩abs\mathbf{p}_{abs}, in which the agent starting at state-action (x,a)i(x,a)_{i} will either arrive at the state-action (x,a)g(x,a)_{g} with some positive probability, or it will not arrive at all, meaning 𝐩abs\mathbf{p}_{abs} is guaranteed to take on values between 0 and 11, inclusive.

5.7 Complexity

The run-time complexity of the tLMDP solution (10) depends on the specialized structure of the task. Because the ordering constraints preclude the agent from needing to "undo" sub-goals encoded in σ\sigma, the agent is never more than NgN_{\textsl{g}} policy-calls away from arriving at σf=[1,1,]\sigma_{f}=[1,1,...] due to the fact that it takes at most NgN_{\textsl{g}} bit flips to solve the task. This means that we can initialize 𝐳0=𝐛g\mathbf{z}_{0}=\mathbf{b}_{\textsl{g}}, where 𝐛g\mathbf{b}_{\textsl{g}} is a vector of zeros except for the boundary state-actions for σf\sigma_{f}, which are set to the known boundary value of 11. QσgQσQπPκQ_{\sigma\textsl{g}}Q_{\sigma}Q_{\pi}P_{\kappa} is a sparse matrix of size 2NgNg2×2NgNg22^{N_{\textsl{g}}}N_{\textsl{g}}^{2}\times 2^{N_{\textsl{g}}}N_{\textsl{g}}^{2}, with only 2NgNg32^{N_{\textsl{g}}}N_{\textsl{g}}^{3} non-zero entries, and power-iteration starting from this initialization converges within NgN_{\textsl{g}} iterations. Therefore, we have NgN_{\textsl{g}} sparse matrix-vector multiplications, resulting in a total of 2NgNg42^{N_{\textsl{g}}}N_{\textsl{g}}^{4} floating-point operations.

5.8 Algorithm Pseudocode

input : saLMDP =(𝒳,𝒜,px,pa,q)\mathscr{M}=(\mathcal{X},\mathcal{A},p_{x},p_{a},q), convergence constant ϵ\epsilon
output : Optimal Policy: u(a|x;x,a)u^{*}(a^{\prime}|x^{\prime};x,a); Desirability function: zz
1 Construct PP as the passive Markov chain matrix for pxap_{xa}
2 Qdiag(exp(𝐪))Q\leftarrow diag(-exp(\mathbf{q}))
3 Initialize 𝐳\mathbf{z} to a one-hot vector with a 11 on the boundary state.
4 deltadelta\leftarrow\infty
5 #Power-Iteration
6 while delta>ϵdelta>\epsilon do
7       𝐳_𝐧𝐞𝐰QP𝐳\mathbf{z\_new}\leftarrow QP\mathbf{z}
8       deltasum(abs(𝐳_𝐧𝐞𝐰𝐳))delta\leftarrow sum(abs(\mathbf{z\_new}-\mathbf{z}))
9       𝐳𝐳_𝐧𝐞𝐰\mathbf{z}\leftarrow\mathbf{z\_new}
10      
11 end while
12uap(a|x,x,a)z(x,a)G[z](x,x,a)u^{*}_{a}\leftarrow\frac{p(a^{\prime}|x^{\prime},x,a)z(x^{\prime},a^{\prime})}{G[z](x^{\prime},x,a)}
Algorithm 1 saLMDP_\_solve (Deterministic pxp_{x})
input : saLMDP ensemble N\mathscr{M}^{N}, OG-task 𝒯=(Σ,𝒢,𝒪Ψ,Ψ,fΨ)\mathscr{T}=(\Sigma,\mathcal{G},\mathcal{O}_{\Psi},\Psi,f_{\Psi}), grounding function hh
output : Task Policy: uπu^{*}_{\pi}
1 nxa:nxa: number of state-action pairs
2 npi:npi: number of polices
3 Jzeros(nxa,nxa,npi)J\leftarrow zeros(nxa,nxa,npi)
4 𝒪𝒢{o(gi,gj)|o(ψ1,ψ2)𝒪Ψ,(ψ1,ψ2)fΨ(gi)×fΨ(gj)}\mathcal{O}_{\mathcal{G}}\leftarrow\{o(\textsl{g}_{i},\textsl{g}_{j})\leavevmode\nobreak\ |\leavevmode\nobreak\ o(\psi_{1},\psi_{2})\in\mathcal{O}_{\Psi},\leavevmode\nobreak\ \exists(\psi_{1},\psi_{2})\in f_{\Psi}(\textsl{g}_{i})\times f_{\Psi}(\textsl{g}_{j})\}
5 Construct TσT_{\sigma}: binary vector transition operator on Σ\Sigma # Section 3.2
6 for g\mathscr{M}_{g} in N\mathscr{M}^{N} do
7       uxg,ag,zg𝚜𝚊𝙻𝙼𝙳𝙿_𝚜𝚘𝚕𝚟𝚎(g)u_{x_{g},a_{g}},z_{g}\leftarrow\mathtt{saLMDP\_solve}(\mathscr{M}_{g})
8      
9      store (uxg,ag(u_{x_{g},a_{g}}, zg)z_{g}) in (𝒰(\mathcal{U}, 𝒵)\mathcal{Z})
10      
11 end for
12
13for UgU_{g} in 𝒰\mathcal{U} do
14       𝐩abs\mathbf{p}_{abs} \leftarrow solve_\_sparse_\_linsys(IUg¯I-U_{\bar{g}}, 𝐡g\mathbf{h}_{g})
15       J(:,g,g)𝐩absJ(:,g,g)\leftarrow\mathbf{p}_{abs}
16      
17 end for
18Define pπp_{\pi} as uniform passive dynamics
19 κJh¯\kappa\leftarrow J\circ\bar{h} # (7)
20 TκTσκT_{\kappa}\leftarrow T_{\sigma}\circ\kappa # (8)
21 PκTκpπP_{\kappa}\leftarrow T_{\kappa}\circ p_{\pi} # Section 3.6
22 qσg(σ,πgi)={𝐢𝐟:gj𝒢s.t.o(gi,gj)𝒪𝒢σ(j)=1;𝐞𝐥𝐬𝐞: 0.}q_{\sigma\textsl{g}}(\sigma,\pi_{\textsl{g}_{i}})=\{\infty\leavevmode\nobreak\ \leavevmode\nobreak\ \mathbf{if}:\leavevmode\nobreak\ \exists\textsl{g}_{j}\in\mathcal{G}\leavevmode\nobreak\ s.t.\leavevmode\nobreak\ o(\textsl{g}_{i},\textsl{g}_{j})\in\mathcal{O}_{\mathcal{G}}\land\leavevmode\nobreak\ \sigma(j)=1;\leavevmode\nobreak\ \mathbf{else}:\leavevmode\nobreak\ 0.\}
23 Qσgdiag(exp(𝐪σg))Q_{\sigma\textsl{g}}\leftarrow diag(-exp(\mathbf{q}_{\sigma\textsl{g}}))
24 qσ(σ)0 if σ=σf, else const.q_{\sigma}(\sigma)\leftarrow 0\text{ if }\sigma=\sigma_{f},\text{ else }const.
25 Qσdiag(exp(𝐪σ))Q_{\sigma}\leftarrow diag(-exp(\mathbf{q}_{\sigma}))
26 Qπdiag(vec(𝒵GS))Q_{\pi}\leftarrow diag(vec(\mathcal{Z}_{GS}))
27 QQσgQσQπQ\leftarrow Q_{\sigma\textsl{g}}Q_{\sigma}Q_{\pi}
28 Set 𝐳\mathbf{z} to zero vector with ones on σf\sigma_{f} boundary states.
29 #Power-Iteration
30 while delta>ϵdelta>\epsilon do
31       𝐳_𝐧𝐞𝐰QPκ𝐳\mathbf{z\_new}\leftarrow QP_{\kappa}\mathbf{z}
32       deltasum(abs(𝐳_𝐧𝐞𝐰𝐳))delta\leftarrow sum(abs(\mathbf{z\_new}-\mathbf{z}))
33       𝐳𝐳_𝐧𝐞𝐰\mathbf{z}\leftarrow\mathbf{z\_new}
34      
35 end while
36uπgp(πg|s,s,πg)z(s,πg)G[z](s,s,πg)u^{*}_{\pi_{\textsl{g}}}\leftarrow\frac{p(\pi_{\textsl{g}}^{\prime}|s^{\prime},s,\pi_{\textsl{g}})z(s^{\prime},\pi_{\textsl{g}}^{\prime})}{G[z](s^{\prime},s,\pi_{\textsl{g}})}
Algorithm 2 tLMDP_\_solve

5.9 Jump Operator Vs. Full Operator Value Function Equivalency

Here we show that Bellman equation based on a deterministic full transition operator with a deterministic policy achieve the same value function as a Bellman equation based on the Jump operator when the jump operator cost function is set to be the value function of the underlying ensemble policy.

Let us assume that the "full operator" and the policy π(a|x)\pi(a|x) are deterministic, namely

T(σ,x|σ,x,a)=Tσ(σ|σ,g)h¯(g|x,a)Tx(x|x,a).T(\sigma^{\prime},x^{\prime}|\sigma,x,a)=T_{\sigma}(\sigma^{\prime}|\sigma,\textsl{g}^{\prime})\bar{h}(\textsl{g}|x,a)T_{x}(x^{\prime}|x,a).

We begin with the definition of the optimal value function for a deterministic first-exit control problem:

v(σ,x)=mina[q(σ,x,a)+v(σ,x)T(σ,x|σ,x,a)].\displaystyle v^{*}(\sigma,x)=\min_{a}\left[q(\sigma,x,a)+v^{*}(\sigma^{\prime},x^{\prime})T(\sigma^{\prime},x^{\prime}|\sigma,x,a)\right]. (39)

Unrolling the recursion we have:

v(σ,x)=q(σtf,xtf,atf)+t=0tf1q(σt,xt,at)T(σt,xt|σt1,xt1,at1)π(a|σ,x).\displaystyle v^{*}(\sigma,x)=q(\sigma_{t_{f}},x_{t_{f}},a_{t_{f}})+\sum_{t=0}^{t_{f}-1}q(\sigma_{t},x_{t},a_{t})T(\sigma_{t},x_{t}|\sigma_{t-1},x_{t-1},a_{t-1})\pi^{*}(a|\sigma,x). (40)

We can decompose this summation by isolating constant σ\sigma variables within each σσ\sigma\rightarrow\sigma^{\prime} transition by creating a new index kk for each period when σ\sigma remains constant, and where τk\tau_{k} indexes inter-period discrete time steps:

v(σ,x)=\displaystyle v^{*}(\sigma,x)=
q(σtf,xtf,atf)+k=0kf1τk=0τk,fq(στk,xτk,aτk)T(στk,xτk|στk1,xτk1,aτk1)π(a|σ,x)=\displaystyle q(\sigma_{t_{f}},x_{t_{f}},a_{t_{f}})+\sum_{k=0}^{k_{f}-1}\sum_{\tau_{k}=0}^{\tau_{k,f}}q(\sigma_{\tau_{k}},x_{\tau_{k}},a_{\tau_{k}})T(\sigma_{\tau_{k}},x_{\tau_{k}}|\sigma_{{\tau_{k}}-1},x_{{\tau_{k}}-1},a_{{\tau_{k}}-1})\pi^{*}(a|\sigma,x)= (41)
q(σtf,xtf,atf)+k=0kf1[q(στk,f,xτk,f,aτk,f)T(στk,f,xτk,f|στk,f1,xτk,f1,aτk,f1)π(a|σ,x)\displaystyle q(\sigma_{t_{f}},x_{t_{f}},a_{t_{f}})+\sum_{k=0}^{k_{f}-1}\Bigg{[}q(\sigma_{\tau_{k,f}},x_{\tau_{k,f}},a_{\tau_{k,f}})T(\sigma_{\tau_{k,f}},x_{\tau_{k,f}}|\sigma_{\tau_{k,f}-1},x_{\tau_{k,f}-1},a_{\tau_{k,f}-1})\pi^{*}(a|\sigma,x)
+τk=0τk,f1q(στk,xτk,aτk)T(στk,xτk|στk1,xτk1,aτk1)π(a|σ,x)]=\displaystyle\qquad\qquad\quad+\sum_{\tau_{k}=0}^{\tau_{k,f}-1}q(\sigma_{\tau_{k}},x_{\tau_{k}},a_{\tau_{k}})T(\sigma_{\tau_{k}},x_{\tau_{k}}|\sigma_{\tau_{k}-1},x_{\tau_{k}-1},a_{\tau_{k}-1})\pi^{*}(a|\sigma,x)\Bigg{]}= (42)
q(σtf,xtf,atf)+k=0kf1[q(στk,f,xτk,f,aτk,f)Tπ(στk,f,xτk,f,aτk,f|στk,f1,xτk,f1,aτk,f1)\displaystyle q(\sigma_{t_{f}},x_{t_{f}},a_{t_{f}})+\sum_{k=0}^{k_{f}-1}\Bigg{[}q(\sigma_{\tau_{k,f}},x_{\tau_{k,f}},a_{\tau_{k,f}})T_{\pi^{*}}(\sigma_{\tau_{k,f}},x_{\tau_{k,f}},a_{\tau_{k,f}}|\sigma_{\tau_{k,f}-1},x_{\tau_{k,f}-1},a_{\tau_{k,f}-1})
+τk=0τk,f1q(στk,f,xτk,f,aτk,f)Tπ(στk,xτk,aτk|στk1,xτk1,aτk1)],\displaystyle\qquad\qquad\quad+\sum_{\tau_{k}=0}^{\tau_{k,f}-1}q(\sigma_{\tau_{k,f}},x_{\tau_{k,f}},a_{\tau_{k,f}})T_{\pi^{*}}(\sigma_{\tau_{k}},x_{\tau_{k}},a_{\tau_{k}}|\sigma_{{\tau_{k}}-1},x_{{\tau_{k}}-1},a_{\tau_{k}-1})\Bigg{]}, (43)

where Tπ:(Σ×𝒳×𝒜)×(Σ×𝒳×𝒜)[0,1]T_{\pi^{*}}:(\Sigma\times\mathcal{X}\times\mathcal{A})\times(\Sigma\times\mathcal{X}\times\mathcal{A})\rightarrow[0,1] is the induced Markov chain by applying the policy π\pi^{*}. By conditioning the transition operator on the first time step of period kk, we have:

v(σ,x)=q(σtf,xtf,atf)+k=0kf1[q(στk,f,xτk,f,aτk,f)Tπτk,f(στk,xτk,aτk|σk,xk,ak)\displaystyle v^{*}(\sigma,x)=q(\sigma_{t_{f}},x_{t_{f}},a_{t_{f}})+\sum_{k=0}^{k_{f}-1}\Bigg{[}q(\sigma_{\tau_{k,f}},x_{\tau_{k,f}},a_{\tau_{k,f}})T^{\tau_{k,f}}_{\pi^{*}}(\sigma_{\tau_{k}},x_{\tau_{k}},a_{\tau_{k}}|\sigma_{k},x_{k},a_{k})
+τk=0τk,f1q(στk,f,xτk,f,aτk,f)Tπτk(στk,xτk,aτk|σk,xk,ak)]=\displaystyle\qquad\qquad\quad+\sum_{\tau_{k}=0}^{\tau_{k,f}-1}q(\sigma_{\tau_{k,f}},x_{\tau_{k,f}},a_{\tau_{k,f}})T^{\tau_{k}}_{\pi^{*}}(\sigma_{\tau_{k}},x_{\tau_{k}},a_{\tau_{k}}|\sigma_{k},x_{k},a_{k})\Bigg{]}= (44)
q(σtf,xtf,atf)+k=0kf1[q(στk,f,xτk,f,aτk,f)Tτk,f(στk,xτk,aτk|σk,xk,ak,πσ)ϕ(πσ|σ,x)\displaystyle q(\sigma_{t_{f}},x_{t_{f}},a_{t_{f}})+\sum_{k=0}^{k_{f}-1}\Bigg{[}q(\sigma_{\tau_{k,f}},x_{\tau_{k,f}},a_{\tau_{k,f}})T^{\tau_{k,f}}(\sigma_{\tau_{k}},x_{\tau_{k}},a_{\tau_{k}}|\sigma_{k},x_{k},a_{k},\pi^{*}_{\sigma})\phi^{*}(\pi^{*}_{\sigma}|\sigma,x)
+τk=0τk,f1q(στk,f,xτk,f,aτk,f)Tτk(στk,xτk,aτk|σk,xk,πσ)ϕ(πσ|σ,x)],\displaystyle\qquad\qquad\quad+\sum_{\tau_{k}=0}^{\tau_{k,f}-1}q(\sigma_{\tau_{k,f}},x_{\tau_{k,f}},a_{\tau_{k,f}})T^{\tau_{k}}(\sigma_{\tau_{k}},x_{\tau_{k}},a_{\tau_{k}}|\sigma_{k},x_{k},\pi^{*}_{\sigma})\phi^{*}(\pi^{*}_{\sigma}|\sigma,x)\Bigg{]}, (45)

where ϕ:Σ×𝒳Πσ\phi^{*}:\Sigma\times\mathcal{X}\rightarrow\Pi_{\sigma}, Πσ={πσ1,πσ2,}\Pi_{\sigma}=\{\pi^{*}_{\sigma_{1}},\pi^{*}_{\sigma_{2}},...\} is a map to the set of optimal polices conditioned on σ\sigma, and πσ:𝒳𝒜\pi^{*}_{\sigma}:\mathcal{X}\rightarrow\mathcal{A} is the optimal state-action map conditioned on σ\sigma being constant. We omit the timestamps on the arguments of πσ\pi^{*}_{\sigma}, since it is a stationary policy.

Note that the term inside the first summation in (45) is itself a first-exit control problem, which can be defined as:

vσ(σk,xk|πσ):=q(στk,f,xτk,f,aτk,f)Tτk,f(στk,xτk,aτk|σk,xk,ak,πσ)ϕ(πσ|σ,x)\displaystyle v^{*}_{\sigma}(\sigma_{k},x_{k}|\pi^{*}_{\sigma}):=q(\sigma_{\tau_{k,f}},x_{\tau_{k,f}},a_{\tau_{k,f}})T^{\tau_{k,f}}(\sigma_{\tau_{k}},x_{\tau_{k}},a_{\tau_{k}}|\sigma_{k},x_{k},a_{k},\pi^{*}_{\sigma})\phi^{*}(\pi^{*}_{\sigma}|\sigma,x)
+τk=0τk,f1q(στk,xτk,aτk)Tτk(στk,xτk|σk,xk,πσ)ϕ(πσ|σ,x).\displaystyle\qquad\qquad\quad+\!\!\!\sum_{\tau_{k}=0}^{\tau_{k,f}-1}\!\!\!q(\sigma_{\tau_{k}},x_{\tau_{k}},a_{\tau_{k}})T^{\tau_{k}}(\sigma_{\tau_{k}},x_{\tau_{k}}|\sigma_{k},x_{k},\pi^{*}_{\sigma})\phi^{*}(\pi^{*}_{\sigma}|\sigma,x). (46)

Substituting vσ(σk,xk|πσ)v^{*}_{\sigma}(\sigma_{k},x_{k}|\pi^{*}_{\sigma}) for the inner summation we obtain:

v(σ,x)=q(σtf,xtf,atf)+k=0kf1vσ(σk,xk|πσ)P(σk,xk|σk1,xk1,πσk)ϕ(πσ|σ,x),\displaystyle v^{*}(\sigma,x)=q(\sigma_{t_{f}},x_{t_{f}},a_{t_{f}})+\sum_{k=0}^{k_{f}-1}v^{*}_{\sigma}(\sigma_{k},x_{k}|\pi^{*}_{\sigma})P(\sigma_{k},x_{k}|\sigma_{k-1},x_{k-1},\pi^{*}_{\sigma_{k}})\phi^{*}(\pi^{*}_{\sigma}|\sigma,x), (47)

where P(σk,xk|σk1,xk1,πσk)=Tπτk,f(στk,xτk|σk,xk,πσk)P(\sigma_{k},x_{k}|\sigma_{k-1},x_{k-1},\pi^{*}_{\sigma_{k}})=T^{\tau_{k,f}}_{\pi^{*}}(\sigma_{\tau_{k}},x_{\tau_{k}}|\sigma_{k},x_{k},\pi^{*}_{\sigma_{k}}). Given an optimal policy, we have no a priori knowledge of which terminal states are traversed under the dynamics. However, if we define a new set of first-exit problems specified by individual terminal states in 𝒯\mathcal{T}, we can substitute these individual first-exit problems into the overall problem.

In (47) we are compressing the inner loop into a policy evaluation with the (deterministic) probability of jumping from (σk1,xk1)(\sigma_{k-1},x_{k-1}) to (σk,xk)(\sigma_{k},x_{k}). Since qxa(x,a)q_{xa}(x,a) is not dependent on σ\sigma, we can reuse the associated policy for any σ\sigma. Therefore, we can introduce a set of polices Π^={π^1,π^2,}\hat{\Pi}=\{\hat{\pi}_{1},\hat{\pi}_{2},...\}, derived from a set of value functions 𝒱={v^1,v^2,}\mathcal{V}=\{\hat{v}_{1},\hat{v}_{2},...\}, the induced Markov chains ={Tπ1,Tπ2,}\mathcal{M}=\{T_{\pi_{1}},T_{\pi_{2}},...\}, and its corresponding "jump" J(σ,x|σ,x,π^)J(\sigma^{\prime},x^{\prime}|\sigma,x,\hat{\pi}), which stand in for vσ(σk,xk|πσ)v^{*}_{\sigma}(\sigma_{k},x_{k}|\pi^{*}_{\sigma}) and P(σk,xk,ak|σk1,xk1,ak1,πσk)P(\sigma_{k},x_{k},a_{k}|\sigma_{k-1},x_{k-1},a_{k-1},\pi^{*}_{\sigma_{k}}). That is, if we can show that vσv^{*}_{\sigma} is contained within 𝒱\mathcal{V}, and PπP_{\pi} can be produced by the corresponding policy in JJ, then we can write an optimization which substitutes the (v^,J)(\hat{v},J) pair for the original functions.

We define a new high-level policy ϕ^(σ,x):Σ×𝒳Π^\hat{\phi}(\sigma,x):\Sigma\times\mathcal{X}\rightarrow\hat{\Pi}. Since we are working with deterministic dynamics and polices, there is only a single trajectory followed by the agent under the optimal policy, implying that only one of the boundary states is traversed. This means that for a given P(σk,xk,ak|σk1,xk1,ak1,πσk)P(\sigma_{k},x_{k},a_{k}|\sigma_{k-1},x_{k-1},a_{k-1},\pi^{*}_{\sigma_{k}}), there exists a way to condition J(σ,x|σ,x,π^g)J(\sigma^{\prime},x^{\prime}|\sigma,x,\hat{\pi}_{\textsl{g}}) to achieve the same terminal state. That is,

P(σk,xk,ak|σk1,xk1,ak1,πσk)maxπ^Π^J(σ,x,a|σ,x,a,πg^)=0.P(\sigma_{k}^{\prime},x_{k}^{\prime},a_{k}^{\prime}|\sigma_{k-1},x_{k-1},a_{k-1},\pi^{*}_{\sigma_{k}})-\max_{\hat{\pi}\in\hat{\Pi}}J(\sigma^{\prime},x^{\prime},a^{\prime}|\sigma,x,a,\hat{\pi_{\textsl{g}}})=0.

We call the policy that reaches the single traversed boundary state-action of the full-operator policy the jump-optimal policy:

π^g+=argmaxπ^Π^J(σ,x,a|σ,x,a,πg^).\hat{\pi}_{\textsl{g}}^{+}=\operatorname*{argmax}_{\hat{\pi}\in\hat{\Pi}}J(\sigma^{\prime},x^{\prime},a^{\prime}|\sigma,x,a,\hat{\pi_{\textsl{g}}}).

Because the jump-optimal policy, π^g+\hat{\pi}_{\textsl{g}}^{+}, is a deterministic shortest-path to the same boundary state as πσk\pi^{*}_{\sigma_{k}}, the corresponding ensemble value function, v^g\hat{v}_{\textsl{g}}, must be the same as v^σ\hat{v}_{\sigma}^{*} evaluated at a given state xx. Therefore, the jump-optimal policy is also the value-optimal policy to select from state xx, which must be equivalent in value to the policy πσ\pi_{\sigma}^{*} followed from xx: ϕ^(π^g+|σ,x)=ϕ^(π^g|σ,x)=ϕ(πσ|σ,x)\hat{\phi}^{*}(\hat{\pi}_{\textsl{g}}^{+}|\sigma,x)=\hat{\phi}^{*}(\hat{\pi}_{\textsl{g}}^{*}|\sigma,x)=\phi^{*}(\pi_{\sigma}^{*}|\sigma,x). We can then write the following equivalence:

P(σk,xk|σk1,xk1,πσk)ϕ(πσ|σ,x)=J(σ,x|σ,x,π^g)ϕ^(π^g|σ,x).P(\sigma_{k}^{\prime},x_{k}^{\prime}|\sigma_{k-1},x_{k-1},\pi^{*}_{\sigma_{k}})\phi^{*}(\pi^{*}_{\sigma}|\sigma,x)=J(\sigma^{\prime},x^{\prime}|\sigma,x,\hat{\pi}_{\textsl{g}}^{*})\hat{\phi}^{*}(\hat{\pi}_{\textsl{g}}^{*}|\sigma,x).

Substituting the jump operator for PP and the optimal ensemble value function v^\hat{v}^{*} for vσv_{\sigma}, the optimization in (47) becomes:

v(σ,x)=q(σtf,xtf,atf)+k=0kf1[v^g(σk,xk|πg)J(σk,xk|σk1,xk1,π^g)ϕ^(π^g|σ,x)].\displaystyle v^{*}(\sigma,x)=q(\sigma_{t_{f}},x_{t_{f}},a_{t_{f}})+\sum_{k=0}^{k_{f}-1}\left[\hat{v}_{\textsl{g}}^{*}(\sigma_{k},x_{k}|\pi^{*}_{\textsl{g}})J(\sigma_{k},x_{k}|\sigma_{k-1},x_{k-1},\hat{\pi}_{\textsl{g}})\hat{\phi}^{*}(\hat{\pi}_{\textsl{g}}|\sigma,x)\right].\hskip 28.45274pt (48)

Substituting the optimal policy for a minimization over polices, we have:

v(σ,x)=q(σtf,xtf,atf)+minπ^gΠ^k=0kf1[v^g(σk,xk|π^g)J(σk,xk|σk1,xk1,π^g)],\displaystyle v^{*}(\sigma,x)=q(\sigma_{t_{f}},x_{t_{f}},a_{t_{f}})+\min_{\hat{\pi}_{\textsl{g}}\in\hat{\Pi}}\sum_{k=0}^{k_{f}-1}\left[\hat{v}_{\textsl{g}}(\sigma_{k},x_{k}|\hat{\pi}_{\textsl{g}})J(\sigma_{k},x_{k}|\sigma_{k-1},x_{k-1},\hat{\pi}_{\textsl{g}})\right], (49)
v(σ,x)=minπ^gΠ^[q^(σk,xk,π^g)+v(σ,x)J(σ,x|σ,x,π^g)],\displaystyle v^{*}(\sigma,x)=\min_{\hat{\pi}_{\textsl{g}}\in\hat{\Pi}}\left[\hat{q}(\sigma_{k},x_{k},\hat{\pi}_{\textsl{g}})+v^{*}(\sigma^{\prime},x^{\prime})J(\sigma^{\prime},x^{\prime}|\sigma,x,\hat{\pi}_{\textsl{g}})\right], (50)
ϕ^(π^g|σ,x)=argminπ^gΠ^[q^(σk,xk,π^g)+v(σ,x)J(σ,x|σ,x,π^g)],\displaystyle\hat{\phi}^{*}(\hat{\pi}_{\textsl{g}}|\sigma,x)=\operatorname*{argmin}_{\hat{\pi}_{\textsl{g}}\in\hat{\Pi}}\left[\hat{q}(\sigma_{k},x_{k},\hat{\pi}_{\textsl{g}})+v^{*}(\sigma^{\prime},x^{\prime})J(\sigma^{\prime},x^{\prime}|\sigma,x,\hat{\pi}_{\textsl{g}})\right], (51)

where q^(σk,xk,π^g)=v^g(σk,xk|π^g)\hat{q}(\sigma_{k},x_{k},\hat{\pi}_{\textsl{g}})=\hat{v}_{\textsl{g}}(\sigma_{k},x_{k}|\hat{\pi}_{\textsl{g}}).

We have thus obtained a new Bellman equation with a different transition operator, the "jump operator", which is guaranteed to have the same solution as the Bellman equation with the full operator when the cost function is defined as the value function of the ensemble policy.