Jump Operator Planning: Goal-Conditioned
Policy Ensembles and Zero-Shot Transfer
Abstract
In Hierarchical Control, compositionality, abstraction, and task-transfer are crucial for designing versatile algorithms which can solve a variety of problems with maximal representational reuse. We propose a novel hierarchical and compositional framework called Jump-Operator Dynamic Programming for quickly computing solutions within a super-exponential space of sequential sub-goal tasks with ordering constraints, while also providing a fast linearly-solvable algorithm as an implementation. This approach involves controlling over an ensemble of reusable goal-conditioned polices functioning as temporally extended actions, and utilizes transition operators called feasibility functions, which are used to summarize initial-to-final state dynamics of the polices. Consequently, the added complexity of grounding a high-level task space onto a larger ambient state-space can be mitigated by optimizing in a lower-dimensional subspace defined by the grounding, substantially improving the scalability of the algorithm while effecting transferable solutions. We then identify classes of objective functions on this subspace whose solutions are invariant to the grounding, resulting in optimal zero-shot transfer.
Keywords: Hierarchical Control, Sequential Goal Tasks, State-Action LMDP, Zero-Shot Transfer, Goal-conditioned Policies, Compositionality
1 Introduction
An open problem in the domains of Planning, Dynamic Programming (DP) and Reinforcement Learning (RL) involves the construction of polices achieving sequential multi-subgoal tasks over long ranges of space and time. To achieve such tasks efficiently, one needs the right set of hierarchical abstractions, representations, and optimizations: identifying these features is a challenging problem with many related questions. What is the definition of a task? How do tasks interface with environments? How can representations, whether they concern policies, value functions, or task rules, be transferable within and across new environments? We propose a framework called Jump-Operator Dynamic Programming (JODP) for ordered sequential sub-goal planning, which addresses these questions by solving structured, multi-subgoal tasks, using coarse-grained transition operators, which jump the agent from initial-to-final states.
DP is a recursive and exact algorithm that provides analytic insight into the problem of sequential decision making, often to the benefit of approximation or learning frameworks like RL [1]. For instance, we understand the convergence properties of RL algorithms in relation to the optimal value function produced by DP [2, 3, 4], as they share the same underlying formalism, i.e. Markov Decision Processes (MDP) [5]. Likewise, we argue that DP-based methods can help elucidate mechanisms for transfer and generalization for sequential goal-problems. However, it is not necessarily the case that simply augmenting a state-space with more task-variables will admit reusable solutions: we need a foundation built on representations that can facilitate efficient computation and transfer. In this paper, we do not focus on the subject of learning, rather, we provide formalisms from which we can build an analytic theory of compositionality and zero-shot task transfer for sequential-goal planning; learning algorithms, of any stripe, can then inherit the properties and guarantees of the framework by computing the appropriate representations to promote generalization.
JODP is general framework that uses specific kinds of objects and functions that engender key properties. We show that ordered sub-goal tasks can be formulated as first-exit problems in a binary-vector task space that records sub-goal progress (Section 3.2). Dynamics through the task-space can be induced by high-level goal variables mapped to low-level state-actions via grounding functions (Section 3.3). We demonstrate how to construct an ensemble of reusable goal-conditioned shortest-path polices on a base state-space in order to drive the task-space dynamics. Crucially, from these polices we can build new abstract operators that represent coarse-grained initial-to-final state and goal dynamics for each policy, called Jump Operators and Feasibility Functions (Sections 3.4, 3.5). These operators play a central role in an efficient task-policy optimization, called the task-LMDP (tLMDP), which stitches together low-level goal-conditioned polices to solve the task (Section 3.6). The key property of a policy ensemble is remappability: we can remap ensemble members to new tasks with different grounding functions, allowing us to solve polices within a super-exponential space of problems without recomputing low-level representations. Furthermore, we show that it is sufficient to compute the task-solution restricted to the grounded subspace of state-actions defined by the sub-goal grounding function, which has significantly better scaling properties under the increase of state-space size and sub-goal number than computing a solution with DP in the full Cartesian product state-space formed by the task- and base-space. A powerful consequence of this is that solutions to the tLMDP in the grounded subspace can have an important grounding invariance property, which allows the solution to be remapped to different ensemble polices with perfect zero-shot transfer.
Related Research This work has connections with several areas of research. The tLMDP and it’s precursor, the saLMDP (yet to be discussed), are new members in the family of Linearly-Solvable Optimal Control problems [6, 7], and are inspired by recent hierarchical innovations for controlling over sub-problems [8, 9, 10]. Research regarding goal-conditioned polices was introduced both in the context of RL [11], and in the context of model-based non-stationary control [12]. Ordered-goal tasks have also been recently proposed in RL for training agents with symbolic and temporal rules [13, 14, 15, 16, 17, 18, 19, 20]. Policy-concatenation (stitching) has also been proposed as a means of skill-learning in the Options framework [21, 22, 23], as well as Hierarchical Control [24, 12], and other schemes for concatenating polices such as policy sketches (using modular polices and task-specific reward functions) [25], or compositional plan vectors [26]. Lastly, a compositional Boolean task algebra was recently introduced demonstrating zero-shot transfer for non-sequential polices [27].
2 Background
The Linearly Solvable Markov Decision Process (LMDP) [7] is an entropy-regularized relaxation of a standard Markov Decision Process (MDP). It is defined as a three-tuple where is the finite set of states, is an uncontrolled "passive dynamics" transition function , and is a state-cost function. The Linear Bellman equation is defined as:
(1) |
where is the value function, is an arbitrary policy, and is the Kullback–Leibler (KL) divergence between two given distributions. Unlike the standard MDP, the action variable has been replaced with a controlled transition matrix . Action costs in (1) are conceptualized as deviations away from the passive dynamics, as measured by the KL-divergence. Therefore, instead of optimizing a state-to-action map , as one would with standard algorithms such as value-iteration [1], we optimize for desirable state-to-state Markov chain dynamics. The solution to (1) will be a policy which minimizes the expected accumulated cost contributed by both the state-cost function and the action cost under the future dynamics. Solving the equation requires a change of variable transformation to the value function , which converts it to a desirability function It is shown in [7] that the optimal controlled dynamics, , can be determined by rescaling the passive dynamics by the optimal desirability function vector, which is the largest eigenvector of , i.e. where is the desirability function vector. The matrix is a diagonal matrix of the negatively exponentiated cost function, i.e. , where is the cost vector on , and is the matrix form of the passive transition function such that where . The rescaled dynamics are normalized by to obtain the optimal policy
3 Algorithm: Jump-Operator Dynamic Programming
3.1 State-Action LMDP
We introduce a novel extension of the LMDP, that operates on state-action pairs. This extension will be employed in two places throughout the algorithm, for both the low-level goal-conditioned policy ensemble, and the high-level tLMDP task policy that “stitches” policies from the ensemble together. Reintroducing the action variable will have non-trivial consequences for the tLMDP, as we can construct more expressive cost functions including ordering constraints on sub-goals, and the cost of following a given policy to complete a sub-goal, all while retaining a largest eigenvector solution.
Definition 3.1 (state-action LMDP).
A state-action LMDP (saLMDP) is defined as the tuple , where is the finite set of states, is the finite set of actions, is a state-transition model, is the "passive" action-transition dynamics, and is the cost function.
We formulate a new objective function by defining the state in saLMDP as a state-action pair , and then substitute for in (1). After substitution, we can decompose the resulting joint distributions and into and via the chain rule, where is a state-transition policy, and is an action transition policy. We enforce an optimization constraint that so the controller is not free to optimize , which would modify a fixed physics model encoded in . Thus, we only allow the controller to modify the distribution over actions, , giving rise to the saLMDP objective and solution:
(2) | |||
(3) | |||
(4) |
where is normalization term and is the desirability vector on , i.e. for each we have , and is the optimal policy–we will refer to as for compactness111 is the same as the ”Q-function”, , in RL [28], however, since we mostly work with , we keep consistent with the original LMDP notation in which is a cost-function and is a cost-matrix.. When is a deterministic distribution, same as the LMDP, the optimal policy can be solved for by computing the optimal desirability function which is the largest eigenvector of (see Appx. 5.2 for derivation), where and is the matrix form of the joint passive dynamics . In the case where is stochastic, the solution does not reduce down to a linear equation, but there exists an iterative non-linear mapping for computing the desirability function (25). Henceforth, the following theory we develop will be linear, only considering as deterministic.
Note that as the cost of a state-action in approaches infinity, its negatively exponentiated counterpart, represented as a diagonal entry in , approaches zero. Thus, zero-entries on the diagonal effectively edit the transition structure of , shaping the backwards propagation of desirability when computing (4) with power-iteration, suppressing state-actions which would violate the infinite-cost transitions. Todorov [7] showed that could approximate shortest-paths with arbitrary precision in the limit of driving the internal state costs arbitrarily high towards infinity, i.e. , where is the shortest-path distance from to the goal state and is the internal state cost. In this paper, obstacles are represented with infinite cost, internal states have an arbitrarily high constant cost, and the boundary (goal) state-action has zero cost. Therefore, the saLMDP objective with this cost structure is a discrete reach-avoid shortest-path first-exit control problem, avoiding obstacles states while controlling to a boundary state-action. These polices are said to be goal-conditioned, as we can explicitly interpret the policy by the goal’s grounded state-action, rendering the polices remappable for new tasks.
3.2 Ordered Goal Task
We define an ordered goal task, which specifies a set of necessary sub-goals to be completed under precedence constraints.
Definition 3.2 (OG-task).
An Ordered-Goal task (OG-task) is defined as the tuple , where is a set of binary vectors indicating sub-goal completion, is a set of goal variables which are actions on , is a set of precedence constraints on goal types, is a set of types, and is a function assigning goals to subsets of types.
Here, is a set of precedence relations, , on types, e.g. , and is a binary vector state-space which tracks the progress of goal achievement. For example: indicates that have been completed and three other goals remain to be completed. With these primary objects, we can derive a transition operator which specifies the dynamics of task-space under given sub-goals, with the restriction that can only flip the bit in the vector, . The task space transition operator encodes the task-space cube in the example of Fig. 1. Likewise, a cost function on this space can be derived directly from the ordering constraints. That is, if we would like to specify that a sub-goal of one type precedes a sub-goal of another type, e.g. , then these constraints are imposed directly on to the cost function. We can define an induced precedence relation set on goals
Type relations add an additional level of symbolic reasoning into the architecture, enabling us to make statements such as: "complete the purple sub-goals before the orange sub-goals", or "chop wood with an axe in addition to collecting water", as depicted in Fig. 2.A and 5.A by red arrows. The cost can then be defined as :
where penalizes ordering violations on the bit-flips of , and is a constant cost on task states with the exception of the task-space boundary state .
Since the objects of an OG-task can be directly used to define a transition operator and a cost function, much like Definition 3.1, we can introduce a passive transition dynamics and solve for a policy in the task space, , which satisfies our constraints, as illustrated in Fig. 2.A. However, synthesizing a policy only in the task space is not sufficiently useful for an agent. To make task-states and actions useful and reusable representations, we show how to map goal-actions to the agent’s underlying controllable space, in this case, the base space .
3.3 Grounding and Ungrounding Functions
In order handle this mapping, we introduce specialized functions called grounding and ungrounding functions which mediate the interaction between high-level goals and the agent’s low-level dynamics. A grounding function, is a function which associates a goal variable with a state-action pair in the base space. An ungrounding function is a left-inverse of which maps state-actions back to the goal variable. For grid-world examples in Fig. 2, the base action-space includes an action , which is a special neutral action for completing a sub-goal in addition to five standard directional actions: up, down, right, left and neutral. Consider a saLMDP transition operator . If we compose the Markov chain with the ungrounding function we obtain
Since we have specified transition dynamics over in Section 3.2, we can also obtain the following joint dynamics
We call the full dynamics. Note that this transition operator is specified over a space of size . Since grows exponentially in the number of sub-goals, i.e. , this operator over the full product space can introduce considerable space and time complexity into the computation of the optimal policy. However, the added complexity can be mitigated by creating specialized transition operators called jump operators which summarize the resulting dynamics of the a policy from its initial state to the state-action of sub-goal completion. In doing so, we can avoid representing internal (non-subgoal) states of and compute a solution in a lower dimensional subspace of the full space, to be discussed in 3.6.
3.4 Jump Operator
Let be a goal in a given OG-task . Recall the joint distribution over the base space in Section 3.1 that expresses the control policy. Once a control policy is optimized with a saLMDP, a Markov chain is induced. Consider an optimal shortest-path first-exit Markov chain matrix , induced by applying the distribution that controls to the goal g’s state-action pair 222We use stylized ‘g’ to refer to goal variables and ‘’ to refer to the goal index on a state-action (Fig. 2.A). By slight notation abuse we might use instead of . With this Markov chain, we can construct a matrix
which segregates the transient state dynamics from the absorbing state dynamics . is a sub-matrix of which excludes the row and column, and is a rank-one matrix of zeros except for the column, which is taken from column of , excluding the element. By taking an arbitrary power, , the elements represent the probability of remaining in the transient states or absorbing into the goal state-action after time steps. The power of in the infinite limit can be written as [29]:
(5) |
where is the fundamental matrix of the absorbing chain333Invertibility is generally not a concern if U is derived from a first-exit problem. See 5.6 for discussion., which is the solution to the Neumann series . Note that the upper right sub-matrix is a rank-one matrix, which represents the probability of starting in state-action , indexing the row space, and absorbing into state-action , indexing the column space. There can only be positive probability in in the case that , corresponding to the single defined absorbing goal state-action . Let , be a set of first-exit saLMDPs, one for every grounded state-action. Consider a set of policies for each saLMDP, and an associated set of policy indices , which indexes corresponding members of to avoid overlapping notation for high- and low-level polices. Specifically, points to a low-level policy: . We can define a new operator using the upper-right sub-matrix for any and its associated as:
(6) |
where and are one-hot unit vectors corresponding to and . The probability of starting and ending at the boundary state-action is defined to be 1. can also be written as , given that is simply an index variable–this will be important when we remap low-level solutions to new problems. The new operator is a jump operator which maps the agent from an initial state-action to the goal state-action with the probability that it will eventually reach it. We now show how can be combined with to construct joint-state jump dynamics.
3.5 Feasibility Functions
Recall that in Section 3.3 we defined a new passive dynamics by composing with . Likewise, now that we have defined the jump operator we can define a new representation called a feasibility function, , which composes the jump operator with the ungrounding function:
(7) | |||
(8) |
The feasibility function , originally conceived of in [12] in a non-stationary context, is a function in which the transition of the action variable (goal) of the higher level space is parameterized in terms of the policy on the lower level space. In (8), we use the feasibility function with the high-level operator to create coupled dynamics, , summarizing the initial-to-final dynamics over both spaces.
3.6 Task-LMDP
With the definition of the new joint dynamics, , we are in a position to repurpose the saLMDP to optimize a task-policy over the joint space, stitching together polices from the ensemble to complete the task, an algorithm called the task-LMDP (tLMDP).
Definition 3.3 (task-LMDP).
A task-LMDP is defined by the tuple , where is a set of first-exit saLMDP control problems of size , the number of states in to be controlled to, is an OG-task, and is the grounding function.
From these three components, we can derive a desirability ensemble , policy ensemble , jump operator , ungrounding function , and feasibility function, , in order to construct the coupled dynamics in (8). can either be , for a complete ensemble of all non-obstacle states, or , the number of goal states defined by the grounding function. The choice depends on whether or not the solution is intended for task-transfer, where policies in the complete ensemble can be remapped to different problems, as we will soon explain. By defining a uniform passive and controlled dynamics distribution , we arrive at the tLMDP objective:
(9) |
where . This objective function is similar to the saLMDP, however, the cost function is an additive mixture of functions . Just like in Section 3.2, and encodes the ordering constraints and a constant state-cost, while encodes the cost of following the policy from until the ungrounding of g, which is the value function of the policy. This is significant because Jump Operators combined with a value-function cost form compact representations which preserve low-level trajectory information. We motivate this approach with a proof that, in the case of deterministic transition operators and polices, value functions based on the full operator are equivalent to value functions based on the jump operator when the jump-operator cost-function equals the value function of the low-level policy (Appx.5.9). The optimal desirability function and policy for this coupled-state system is:
(10) | |||
(11) |
where the -matrices are diagonal matrices of each negatively exponentiated cost function, and is the matrix for the joint state-policy passive dynamics of , analogous to in (4). In (10), all matrices have the indexing scheme . However, the grounding function is one-to-one, so there are only tuples, and there are polices , which advance by ungrounding from these state-action pairs, called the Grounded Subspace (GS). Hence, the transition matrix can be represented as a spare matrix of size . We can compute the first-exit problem restricted to this sub-space, then compute the desirability for the agent’s best initial state policy pair which lies outside the GS (for all possible from the state). This is the desirability to enter grounded subspace (and then complete the task) at a given grounded state-action under a policy from the initial . Henceforth, we call this the Desirability-To-Enter (DTE): , which is a single matrix-vector multiplication where is the result of (10) computed for the grounded subspace (see 3 for visualization). is wide matrix of size where rows correspond to the agent’s current state paired with members of the set of GS-polices , and -matrices are diagonal cost matrices specific to the starting state. By computing separately from we effectively break the problem into two parts, a one-step finite horizon problem for interior states (where the "step" is a single policy call, not one discrete time-step) and a first-exit problem for the GS. This enables the agent to be anywhere in the state-space while requiring only one additional linear transformation to act optimally.
Note that the ordering constraints in are acting on the row space of , where determines the structure of using global knowledge of sub-goal connectivity. Also, since the cost function equals the value function, i.e. , the cost matrix has diagonal entries which are the desirability functions of each sub-problem concatenated into a vector: . (see Appx.5.1 for architecture visualization.). Once has been computed, the agent follows the policy indexed by each sampled , i.e. , until the goal state-action has been reached, followed by a newly sampled . Fig. 5.A shows an example of coupled state trajectory and jump dynamics, drawn from the optimal policy which satisfies the task of getting an axe before chopping wood, and collecting water. Because the tLMDP solves order-constrained multi-goal tasks, it is structurally analogous to (precedence-constrained) Traveling Salesman Problems [30] given a final goal that grounds to the agent’s initial state. Fig. 5.A shows a problem with nine sub-goals with precedence relations on sub-goal colors as types, .
The tLMDP solution (10) has three key properties that make it particularly flexible in the context of efficient task transfer:
-
Policy Remappability: Given a complete policy-desirability ensemble and jump operator, under a new grounding function we only need to re-index the polices (where references different low-level elements of and ) to derive new matrices for equation (10). Consequently, for goals there is a space of possible tasks one could formulate444For goals grounded onto possible states along with a set of ordering constraints which would not require recomputation of or .
-
Grounding Invariance: Since the optimal desirability function is a largest eigenvector, there is an invariance of the solution to scalar multiples of the diagonal matrices, assuming remains constant under the new grounding.
-
Minimal Solution Distance: The agent can compute the optimal desirability (the desirability-to-enter, DTE) for any given initial state, , outside the GS with one matrix-vector product. Importantly, this also holds true after the zero-shot regrounding of a previous grounded-subspace solution.
These properties all allow for the fast transfer of task solutions within the same environment, or across different environments with distinct obstacle configurations and transition operators. The following sections will elaborate on the empirical and theoretical implications of these properties.
3.6.1 Results
In Fig. 5.B-D we empirically show the computational advantage of computing tLMDP polices in the GS (), relative to computing polices with value-iteration on the full product space, denoted as . In Fig. 5.B, we hold the number of sub-goals constant at, , and vary the number of states in the grid-world. We can see the clear advantage has from separating the low-level GS-ensemble and Jump-operator computation from the task-policy optimization. The plots show that by working in the lower-dimensional GS, can optimize over much larger base state-spaces with a large number of sub-goals. , however, optimizes the task policy over the full product space, hindering scalability. In Fig. 5.C, we fix the number of grid-world states at , , and , and only vary the number of sub-goals: we expect an exponential increase in run-time for both and , given that the complexity is dominated by . However, since is working in the GS it only needs to compute base-polices and , whereas is again burdened by the extra dimensionality of the non-GS internal states. plots for all grid-world sizes remain relatively similar, due to the fact that the GS dimensionality is not a function of internal state count, whereas sees a progressively larger leftward shift of the curve for each state-space size.
We demonstrate the flexibility granted by in Fig. 5.D, which plots the total time for computing different tasks with sub-goals and grounded in the same environment. does not have a policy ensemble at its disposal, so it must solve new tasks from scratch, resulting in a worst-case constant-time increase per task proportional to [5]. Alternatively, can remap the goal variables to reusable ensemble polices to form a new GS, and therefore it does not require the computation of new , , or (which contribute to a one-time upfront cost, manifesting in a higher intercept) it only requires recomputation of (10). This results in a constant time-per-task increase proportional to (Appx.5.7). For a modest number of sub-goals this solution can be recomputed very quickly. The green line, , shows an negligible growth in time, where each solution to the regrounded nine-subgoal task was computed in approximately . The speed with which this remapping and task reoptimization is possible could have important applications for real-time online systems which can learn and rapidly adapt to changes in the structure of the task.
3.7 Compositionality and Transfer
A key advantage of computing the solution in the GS pertains to Grounding Invariance (property ), conferred by Policy Remappability (property ). Consider a solution for the tLMDP problem under grounding . means that under a new grounding function for , the GS-index now references a different policy and desirability vector, and also conditions different dynamics in . Since the matrices and are the only two matrices whose elements are functions of , if these matrices are exactly the same under and , then (10) is the same, holding the task-matrices and constant. Furthermore, (10) being a largest eigenvector solution implies that if is a scalar multiple of , it has the same solution. This means that the solution for will preserve cost-minimization for the new grounding , which can change distances between sub-goals. This invariance is called a task and cost preserving Grounding Invariant Equation (tc-GIE), as in (12):
(12) | |||
(13) |
If, however, is not a scalar multiple of , it is still possible for the two problems to share the same solution if is set to the identity matrix for both problems (policy cost equal to zero). This means that the task will be solved, however, the policy will not minimize the low-level trajectory costs on . This is a relaxed version of grounding-invariance called the task preserving Grounding Invariant Equation (t-GIE), (13). Fig.5.E shows the relationship between (from 5.D) and , a t-GIE problem with the same task specification. Whereas sees a slight additional time increase for optimizing (10) for every new grounding, solves the task with no further optimization, resulting in zero-shot transfer. Furthermore, regardless of whether two problem formulations are related by (12) or (13), the ordering violation costs in are hard constraints (infinite cost), so as long as and remains constant between and , a policy for can be used to solve however, the sub-goal dynamics will retain the biases of ’s low-level policy costs, dictated by the obstacle configuration.
Figure 6 demonstrates both GIE equations for an environment with a cliff that the agent can descend, but not ascend, resulting in different -matrices depending on the grounding. Since is a function of indirectly through , differences can only arise in under different sub-goal connectivity structures. These connectivity structures are stored in the matrix , which represents the global connectivity of the sub-goals. Furthermore, recall that where is a concatenated state-action vector of all low-level desirability functions in . Because the desirability function is defined as , constant multiples of translate into constant additives of , (i.e. where ). Since we are working with shortest path polices, to better illustrate the tc-GIE relationship in Figure 6, we construct a matrix of shortest path distances which are derived from the value function divided by the internal state cost, which gives us the shortest path distance, i.e. , where is the constant internal state-action cost used to define , and is the value function corresponding to the low-level policy with boundary state . The new grounding functions preserve the solution under an additive constant to inter-subgoal distances, which translates into a multiplicative constant between desirability functions used to define . Therefore, even between environments with different obstacle constraints, we can compute that two tasks in the grounded subspace are exactly the same because they share the same underlying objective function. The t-GIE is also depicted in Figure 6: task in does not share a policy cost matrix with a multiplicative constant with , however, the feasibility structure remains the same and we can therefore set for both problems to obtain the same objective function.
4 Discussion
JODP employs specific representations and architectural motifs which contribute to the flexibility and scalability of the framework. The construction of the policy ensemble and jump operator is computationally decoupled from the global tLMDP task-optimization. This decoupling is necessary for scalability: both objects can be parallelized and the jump operator is required to work in the grounded-subspace, which is critical for the method to scale to large state-space sizes. Crucially, because jump operators are paired with cost functions derived from the low-level value function, the method effectively performs multi-step tasks while integrating all of the low-level cost information for each successive jump, which is a new abstraction technique for hierarchical control.
While the focus of this paper was on exact dynamic programming methods, the motifs have important implications for learning frameworks. As we have shown, the task can be a structured object which can be traversed by a first-exit policy, where task rules correspond to a cost function which shapes permissible dynamics on this space—the reward is not the task. While it is certainly true that the value function is interpreted as an accumulated reward or cost, first-exit value functions are endowed with additional explicit meaning: the accumulated cost until hitting the boundary state. This explicitness is necessary for the property of remappability, which allows low-level polices to generalize to new tasks by mapping the boundary state-actions to high-level actions. This also suggests that many RL algorithms, so far as they are used for Artificial Intelligence, do not posses the requisite constrains on the form of the polices which make them composable and remappable to modular and transferable tasks, and future learning methods could benefit from utilizing shortest-path polices as low-level representations along with reusable task-spaces.
In conclusion, JODP provides key new representations for hierarchical control along with a theory of grounding-invariant zero-shot transfer, which has major implications for computationally efficient and interpretable artificial intelligence.
Acknowledgments
Thank you to Keno Juchems and Andrew Saxe for valuable discussions and feedback.
References
- [1] R. Bellman, “A Markovian decision process,” Journal of mathematics and mechanics, pp. 679–684, 1957.
- [2] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no. 3-4, pp. 279–292, 1992.
- [3] T. Jaakkola, M. I. Jordan, and S. P. Singh, “Convergence of stochastic iterative dynamic programming algorithms,” in Advances in neural information processing systems, 1994, pp. 703–710.
- [4] J. N. Tsitsiklis, “Asynchronous stochastic approximation and Q-learning,” Machine learning, vol. 16, no. 3, pp. 185–202, 1994.
- [5] M. L. Puterman, Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
- [6] H. J. Kappen, “Linear theory for control of nonlinear stochastic systems,” Physical review letters, vol. 95, no. 20, p. 200201, 2005.
- [7] E. Todorov, “Efficient computation of optimal actions,” Proceedings of the national academy of sciences, vol. 106, no. 28, pp. 11 478–11 483, 2009.
- [8] A. Jonsson and V. Gómez, “Hierarchical linearly-solvable Markov decision problems.”
- [9] A. M. Saxe, A. C. Earle, and B. Rosman, “Hierarchy through composition with multitask LMDPs,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 3017–3026.
- [10] D. Tirumala, H. Noh, A. Galashov, L. Hasenclever, A. Ahuja, G. Wayne, R. Pascanu, Y. W. Teh, and N. Heess, “Exploiting hierarchy for learning and transfer in KL-regularized RL,” arXiv preprint arXiv:1903.07438, 2019.
- [11] S. Nasiriany, V. Pong, S. Lin, and S. Levine, “Planning with goal-conditioned policies,” in Advances in Neural Information Processing Systems, 2019, pp. 14 814–14 825.
- [12] T. J. Ringstrom and P. R. Schrater, “Constraint satisfaction propagation: non-stationary policy synthesis for temporal logic planning,” arXiv preprint arXiv:1901.10405, 2019.
- [13] L. Illanes, X. Yan, R. Toro Icarte, and S. A. McIlraith, “Symbolic plans as high-level instructions for reinforcement learning.”
- [14] M. Hasanbeig, Y. Kantaros, A. Abate, D. Kroening, G. J. Pappas, and I. Lee, “Reinforcement learning for temporal logic control synthesis with probabilistic satisfaction guarantees,” in Proceedings of the 58th Conference on Decision and Control. IEEE, 2019, pp. 5338–5343.
- [15] M. Hasanbeig, A. Abate, and D. Kroening, “Cautious reinforcement learning with logical constraints,” in Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems. International Foundation for Autonomous Agents and Multiagent Systems, 2020, pp. 483–491.
- [16] R. Toro Icarte, T. Klassen, R. Valenzano, and S. McIlraith, “Using reward machines for high-level task specification and decomposition in reinforcement learning,” in International Conference on Machine Learning, 2018, pp. 2107–2116.
- [17] M. Hasanbeig, A. Abate, and D. Kroening, “Logically-Constrained Neural Fitted Q-Iteration,” in Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems. International Foundation for Autonomous Agents and Multiagent Systems, 2019, pp. 2012–2014.
- [18] ——, “Logically-constrained reinforcement learning,” arXiv preprint arXiv:1801.08099, 2018.
- [19] L. Z. Yuan, M. Hasanbeig, A. Abate, and D. Kroening, “Modular deep reinforcement learning with temporal logic specifications,” arXiv preprint arXiv:1909.11591, 2019.
- [20] M. Hasanbeig, A. Abate, and D. Kroening, “Certified reinforcement learning with logic guidance,” arXiv preprint arXiv:1902.00778, 2019.
- [21] R. S. Sutton, D. Precup, and S. Singh, “Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning,” Artificial intelligence, vol. 112, no. 1-2, pp. 181–211, 1999.
- [22] R. Fox, S. Krishnan, I. Stoica, and K. Goldberg, “Multi-level discovery of deep options,” arXiv preprint arXiv:1703.08294, 2017.
- [23] A. Barreto, D. Borsa, S. Hou, G. Comanici, E. Aygün, P. Hamel, D. Toyama, S. Mourad, D. Silver, D. Precup et al., “The option keyboard: Combining skills in reinforcement learning,” in Advances in Neural Information Processing Systems, 2019, pp. 13 031–13 041.
- [24] X. B. Peng, M. Chang, G. Zhang, P. Abbeel, and S. Levine, “MCP: Learning composable hierarchical control with multiplicative compositional policies,” in Advances in Neural Information Processing Systems, 2019, pp. 3681–3692.
- [25] J. Andreas, D. Klein, and S. Levine, “Modular multitask reinforcement learning with policy sketches,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 166–175.
- [26] C. Devin, D. Geng, P. Abbeel, T. Darrell, and S. Levine, “Plan arithmetic: Compositional plan vectors for multi-task control,” arXiv preprint arXiv:1910.14033, 2019.
- [27] G. N. Tasse, S. James, and B. Rosman, “A Boolean task algebra for reinforcement learning,” arXiv preprint arXiv:2001.01394, 2020.
- [28] R. S. Sutton, A. G. Barto et al., Introduction to reinforcement learning. MIT press Cambridge, 1998, vol. 135.
- [29] P. Brémaud, Markov chains: Gibbs fields, Monte Carlo simulation, and queues. Springer Science & Business Media, 2013, vol. 31.
- [30] L. Bianco, A. Mingozzi, S. Ricciardelli, and M. Spadoni, “Exact and heuristic procedures for the traveling salesman problem with precedence constraints, based on dynamic programming,” INFOR: Information Systems and Operational Research, vol. 32, no. 1, pp. 19–32, 1994.
5 Appendix
5.1 Function Dependency Graph
5.2 saLMDP Derivation
We begin by creating a state vector and substitute this state in the equation for an LMDP:
(14) |
We now decompose the joint distributions using the product rule and impose a dynamics constraint. We would like the low level dynamics of the saLMDP, , to behave like an uncontrollable physical system and control to only be achieved over the state-space by manipulating the action distribution, . To enforce this condition we introduce the constraint that , and we minimize over .
(15) | |||
(16) |
We consider (16) to be the canonical form of the saLMDP objective function.
After transforming the cost-to-go, , into desirability via the derivation proceeds as follows:
(17) | |||
(18) | |||
(19) | |||
(20) |
Where is a normalization term that satisfies the KL divergence requirement that the terms sum to 1.
(21) |
By setting the policy to the denominator (20), the third term goes to zero.
(22) |
This leaves two terms that do not depend on the minimization argument . The resulting equation is:
(23) | |||
(24) |
The above equation can be written in matrix-vector notation with nested non-linearities, but it requires us to construct non-square matrices. This is because if is a state-action desirability vector, then requires us to only sum over only the variables of the vector.
To construct the appropriate matrices, we encode the distributions and into two matrices, and respectively. is an matrix (where ) which encodes , where the additional in the column space are dummy variables. That is, the indexing scheme for the rows is , and for the columns, so elements with in the columns can only have non-zero entries if they match the pairs on the row. Similarly, is an matrix encoding where the matrix can only have non-zero entries when the on the row matches the on the column. It is straightforward to check that multiplying and forms a matrix of the joint distribution . allows us to take the expectation over the variables in , followed by the point-wise nonlinearity, then takes the expectation over the variables.
Now written in matrix notation we have:
(25) |
Since the and are element-wise functions, Jensen’s inequality implies the following element-wise inequality:
(26) |
However, Jenson’s inequality holds as equality when the expectation is over a deterministic distribution. Therefore if every row of the state dynamics matrix is a deterministic distribution, then we can exchange the order of the point-wise exponential with the matrix multiplication:
(27) | ||||
(28) | ||||
(29) | ||||
(30) |
where is the matrix for the joint distribution . The original non-linear mapping reduces to a linear equation and can be solved for as the largest eigenvector of . We can solve for by the iterating either (25), in the stochastic case, or (30) in the deterministic case, until convergence. Then plug into (22) to obtain the optimal policy.
5.3 tLMDP Derivation
The tLMDP only differs in the transition operators and number of cost functions. Following the same derivation above we will arrive at:
(31) | |||
(32) |
Following the previous methods for the saLMDP solution derivation, in matrix-vector notation and under deterministic this results in:
(33) |
5.4 Convergence of iterative non-linear map
Here we show that the iterative mapping for the stochastic case converges to a fixed point. The proof uses Banach’s fixed point theorem, which requires us to show that (25) is a contraction mapping. To do, we define the mapping as
(34) |
To demonstrate that is a contraction mapping we show that for any and , for some and . Let . Considering as the state-action desirability vector, we write as:
(35) |
(36) |
Thus, is a contraction mapping, which implies via Banach’s fixed point theorem that an iterative algorithm will globally converge to .
5.5 Finite-Horizon/First-Exit Decomposition (Grounded Subspace)
Because we are working with shortest-path first-exit control problems, all trajectories from any given must be shortest paths (where is a lumped time-step called a period which corresponds to the window of time remains constant). This means instead of working with the complete ensembles, , and we only need to work with . This is due to the fact that any policy which controls to an internal state-action followed by a policy that controls to by can never have an accumulated cost-to-go which is lower than the value function for . (because shortest-path polices have the lowest cost-to-go). The consequence of the GS restriction is that an agent that controls from a state inside the GS will remain inside the GS (given that the can only jump you to GS state-actions). Therefore, a first-exit tLMDP problem can be decomposed by separating the first period out of the Bellman equation and solving a first-exit (FE) problem starting at , while solving the policy for the period as a one-step finite-horizon (FH) problem which controls into the grounded subspace.
We define the Bellman equation in two parts, one for controlling within the grounded subspace (38), and one for controlling from outside the grounded subspace into the grounded subspace (37):
(37) |
where:
(38) |
where in (38), , and and in both equations are restricted to polices in the grounded subspace, .
5.6 Invertibility of
is the solution to a Neumann series which represents the total expected state-action occupancies under the transient state-action Markov chain dynamics. Normally, the concern would be that if the Markov chain is reducible, then there is a communicating class of states-actions which do not communicate with the boundary state-action. This would result in infinite state-action visits, and thus would not be invertible.
However, since we are computing , and is a matrix of zeros except for the column, we know that . Therefore, we are only interested solving the linear system , where is the column of (known), (unknown) is the probability of absorption into the boundary state, starting from the state-action under the Markov chain dynamics, and is set to be the column of , i.e. . We have a solution, , in which the agent starting at state-action will either arrive at the state-action with some positive probability, or it will not arrive at all, meaning is guaranteed to take on values between and , inclusive.
5.7 Complexity
The run-time complexity of the tLMDP solution (10) depends on the specialized structure of the task. Because the ordering constraints preclude the agent from needing to "undo" sub-goals encoded in , the agent is never more than policy-calls away from arriving at due to the fact that it takes at most bit flips to solve the task. This means that we can initialize , where is a vector of zeros except for the boundary state-actions for , which are set to the known boundary value of . is a sparse matrix of size , with only non-zero entries, and power-iteration starting from this initialization converges within iterations. Therefore, we have sparse matrix-vector multiplications, resulting in a total of floating-point operations.
5.8 Algorithm Pseudocode
5.9 Jump Operator Vs. Full Operator Value Function Equivalency
Here we show that Bellman equation based on a deterministic full transition operator with a deterministic policy achieve the same value function as a Bellman equation based on the Jump operator when the jump operator cost function is set to be the value function of the underlying ensemble policy.
Let us assume that the "full operator" and the policy are deterministic, namely
We begin with the definition of the optimal value function for a deterministic first-exit control problem:
(39) |
Unrolling the recursion we have:
(40) |
We can decompose this summation by isolating constant variables within each transition by creating a new index for each period when remains constant, and where indexes inter-period discrete time steps:
(41) | |||
(42) | |||
(43) |
where is the induced Markov chain by applying the policy . By conditioning the transition operator on the first time step of period , we have:
(44) | |||
(45) |
where , is a map to the set of optimal polices conditioned on , and is the optimal state-action map conditioned on being constant. We omit the timestamps on the arguments of , since it is a stationary policy.
Note that the term inside the first summation in (45) is itself a first-exit control problem, which can be defined as:
(46) |
Substituting for the inner summation we obtain:
(47) |
where . Given an optimal policy, we have no a priori knowledge of which terminal states are traversed under the dynamics. However, if we define a new set of first-exit problems specified by individual terminal states in , we can substitute these individual first-exit problems into the overall problem.
In (47) we are compressing the inner loop into a policy evaluation with the (deterministic) probability of jumping from to . Since is not dependent on , we can reuse the associated policy for any . Therefore, we can introduce a set of polices , derived from a set of value functions , the induced Markov chains , and its corresponding "jump" , which stand in for and . That is, if we can show that is contained within , and can be produced by the corresponding policy in , then we can write an optimization which substitutes the pair for the original functions.
We define a new high-level policy . Since we are working with deterministic dynamics and polices, there is only a single trajectory followed by the agent under the optimal policy, implying that only one of the boundary states is traversed. This means that for a given , there exists a way to condition to achieve the same terminal state. That is,
We call the policy that reaches the single traversed boundary state-action of the full-operator policy the jump-optimal policy:
Because the jump-optimal policy, , is a deterministic shortest-path to the same boundary state as , the corresponding ensemble value function, , must be the same as evaluated at a given state . Therefore, the jump-optimal policy is also the value-optimal policy to select from state , which must be equivalent in value to the policy followed from : . We can then write the following equivalence:
Substituting the jump operator for and the optimal ensemble value function for , the optimization in (47) becomes:
(48) |
Substituting the optimal policy for a minimization over polices, we have:
(49) | |||
(50) | |||
(51) |
where .
We have thus obtained a new Bellman equation with a different transition operator, the "jump operator", which is guaranteed to have the same solution as the Bellman equation with the full operator when the cost function is defined as the value function of the ensemble policy.