Hierarchical Reinforcement Learning for Deep Goal Reasoning:
An Expressiveness Analysis
Abstract
Hierarchical DQN (h-DQN) is a two-level architecture of feedforward neural networks where the meta level selects goals and the lower level takes actions to achieve the goals. We show tasks that cannot be solved by h-DQN, exemplifying the limitation of this type of hierarchical framework (HF). We describe the recurrent hierarchical framework (RHF), generalizing architectures that use a recurrent neural network at the meta level. We analyze the expressiveness of HF and RHF using context-sensitive grammars. We show that RHF is more expressive than HF. We perform experiments comparing an implementation of RHF with two HF baselines; the results corroborate our theoretical findings.
1 Introduction
Hierarchical DQN (h-DQN) (?) is a reinforcement learning (RL) system that combines deep RL and goal reasoning. h-DQN uses a two-level hierarchical architecture: a meta controller at the top level and a controller at the lower level. The meta controller and the controller operate at different temporal scales. The meta controller receives a state and selects a goal , which can be achieved in some state. The controller receives the current state and ; it takes actions until is achieved in a state . Then the meta controller receives and selects the next goal ; the process repeats. Both levels of h-DQN use deep Q-networks (DQN) (?) to jointly learn value functions; the meta controller tries to maximize the cumulative reward from the environment whereas the controller is motivated by user-defined intrinsic rewards.
Goal reasoning is the study of agents that are capable of reasoning about their goals and changing them when the need arises (?; ?). h-DQN selects its goals and plans toward achieving them while adapting the goal selection process using RL. Thus h-DQN can be considered a goal reasoning agent.
To exemplify the capability and limitation of h-DQN, consider a one-dimensional corridor environment shown in Figure 1 (top), which is based on the environments in (?; ?). The corridor consists of 7 states from left to right: . An agent starts in (circled) and can move left or right. is the terminal state (gray). In task (i), the agent receives a reward of if it visits (asterisk) at least once and then reaches ; otherwise, the reward is . Figure 1 (a) shows a trajectory that results in a reward of in task (i). h-DQN is able to generate this trajectory and obtain the maximum reward in task (i). Now consider task (ii): the agent must visit at least twice to receive a reward of . We will show that it is impossible for h-DQN to deterministically generate a trajectory (e.g., Figure 1 (b)) that solves task (ii).
h-DQN can be considered an instance of the hierarchical framework (HF): a two-level architecture of feedforward neural networks that can use any RL algorithms, not just Q-learning. HF can use different algorithms for the meta controller and the controller, whereas h-DQN uses the same algorithm at each level.
Consider using a recurrent neural network (RNN) in place of the feedforward network in the HF meta controller. We refer to this type of architecture as the recurrent hierarchical framework (RHF). A related instance is feudal networks (FuNs) (?), which uses RNNs at both levels. To investigate the effects of a recurrent meta controller, we focus on the case where HF and RHF use the same controller, isolating the difference in their meta controllers.
In the remainder of this paper, we describe HF and RHF, and construct two types of context-sensitive grammars that capture the expressiveness of HF and RHF respectively. Using the grammars, we show that RHF is more expressive than HF: (1) any state-goal trajectories generated by an HF system can be generated by an RHF system; (2) some state-goal trajectories can only be generated by RHF but not HF. We present an implementation of RHF. We perform experiments comparing this implementation with two HF baselines; the results corroborate our expressiveness analysis of HF and RHF.
2 Related Work
The options framework (?) defines an option as a temporally extended course of actions consisting of three components: an initiation set , a policy , and a termination condition . An agent in a state can select an option if and only if . If is selected, the agent takes actions according to until terminates according to .
DQN (?) combines a semi-gradient Q-learning algorithm (?) and a deep convolutional neural network (CNN) (?) to learn to play Atari 2600 games. To improve stability, DQN uses experience replay and maintains both a Q-network and a target network. The Q-network provides the behavior policy and updates online. The target network, which is used to compute the temporal difference error, updates periodically and remains fixed between updates. DQN plays at human level or above in 29 of the games, but performs poorly in Montezuma’s Revenge. This game requires strategic exploration due to sparse rewards. Hierarchical deep RL systems such as h-DQN (?) and FuNs (?) achieve better performance than DQN in Montezuma’s Revenge.
The deep recurrent Q-network (DRQN) (?) replaces the fully-connected layer in DQN with a layer of long short-term memory (LSTM) (?), which provides the capability to integrate state information through time. DRQN learns to play several partially observable versions of Atari games and achieves results comparable to DQN.
FuNs (?) is a two-level architecture consisting of a manager and a worker. Both modules are recurrent: the worker uses a standard LSTM; the manager uses a dilated LSTM that extends its state memory. The manager is similar to a meta controller. The distinction is that the goals are directional in FuNs whereas the goals are states in h-DQN.
3 Background
Reinforcement Learning.
We consider reinforcement learning for episodic tasks. A decision-making agent interacts with an external environment. The environment is modeled by a finite set of nonterminal states , a terminal state , a finite set of actions , a transition function , and a reward function . ( is the Kleene plus operation on in .) At each time step , the agent observes a state and takes an action . At the next time step, the environment transitions to ;111The parentheses in a subscript indicate that the subscript denotes a time step. For example, is the environment’s state at time step 0. Without the parentheses, the subscript denotes a distinct element in a set. This notation applies to states, actions, and goals. the agent receives a reward . The process repeats until the terminal state is reached. The objective of the agent is to take actions in a way that maximizes the discounted return (?):
where is the discount rate, , and is the terminal time step.
Context-Sensitive Grammars.
A context sensitive grammar (CSG) is a 4-tuple , where is a finite set of nonterminal symbols, is a finite set of terminal symbols, is the start symbol, and is a finite set of production rules. All the rules in are of the form
where , , and .
Given a set of strings , the -th power of (i.e., the concatenation of with itself times) is recursively defined as follows:
where is the empty string.
4 Expressiveness Analysis
When an HF or RHF system interacts with an environment, the meta controller receives states and generates goals, forming a trajectory of alternating states and goals. For example, the string represents the state-goal trajectory in Figure 1 (a). To analyze the expressiveness of HF and RHF, we define two special types of CSGs that generate strings representing state-goal trajectories. A synopsis of our arguments is as follows:
-
1.
We describe how HF and RHF generate goals in an environment. Then we define two types of CSGs: constrained and -recurrent, to represent HF’s and RHF’s goal generation respectively.
-
2.
We prove that a constrained CSG is a special case of a -recurrent CSG. In other words, for any constrained CSG, there exists a -recurrent CSG such that both grammars generate the same strings.
-
3.
We prove that constrained CSGs cannot generate some string. However, there exists a -recurrent CSG that can generate this string.
Combined, these points imply that RHF is more expressive than HF.
In our analysis, we make the assumption of deterministic policies: the system selects goals and actions deterministically after training. RL algorithms oftentimes use an exploratory stochastic policy. For example, h-DQN uses an -greedy policy to select goals and actions during training; the best goal or action is selected with a probability of and a random one is selected with a probability of . As learning proceeds, is gradually reduced to a small value. We want to analyze the system’s behavior after it has converged to deterministic policies and hence make this assumption.
4.1 Expressiveness of HF
HF is a two-level architecture consisting of a meta controller and a controller. At a time step , the meta controller receives a state and selects a goal , where is a finite set of goals. From to a time step when either is achieved or the terminal state is reached, the controller receives and , and takes an action , where . At , if the terminal state is not reached, the meta controller receives and selects the next goal ; the process repeats. Regardless of implementation details, under the assumption of deterministic policies, the meta controller defines a function .
To capture the expressiveness of HF, we construct a type of CSG encompassing the following components:
-
(a)
There can be one or more starting states. For every starting state , we add one rule of the form: . is the start symbol. is a nonterminal symbol representing the meta controller.
-
(b)
For every state-goal pair defined by the function , we add one rule of the form: . is a nonterminal symbol representing the controller. The expression is used to derive a new state, which is explained in (c).
-
(c)
From to , the controller receives a state-goal and takes an action , where . The iteration has three possible outcomes:
-
i.
The controller returns control to the meta controller when is achieved in and is not the terminal state. This is captured by the rule , where is the state when the meta controller gives control to the controller (i.e., ), is the goal selected by the meta controller (i.e., ), and is the state satisfying (i.e., ).
-
ii.
The episode terminates when is the terminal state. It is possible that is achieved in . This is captured by the rule , where is the terminal state (i.e., ).
-
iii.
The iteration gets stuck in an infinite loop because the controller never achieves nor the terminal state is ever reached. This is captured by the rule , which does not yield any string.
-
i.
We formally define this type of CSG as follows.
Definition 1.
A context-sensitive grammar is constrained if
-
1.
The set of nonterminals .
-
2.
The set of terminals .
-
3.
The set contains only production rules of the following forms:
-
(a)
For every starting state , there exists exactly one rule of the form
-
(b)
For every , there exists exactly one rule of the form
where .
-
(c)
For every combination of and , there exists exactly one rule of one of the following forms:
-
i.
, where .
-
ii.
.
-
iii.
.
-
i.
-
(a)
Example 1.
This example shows a constrained CSG that generates a string representing the state-goal trajectory in Figure 1 (a), which explains how an HF system solves corridor task (i). We define that is achieved in ().
Consider a constrained CSG . . is the terminal state. Some rules in are:
(1) | ||||
(2) | ||||
(3) | ||||
(4) | ||||
(5) |
Applying these rules derives
The substring represents the target state-goal trajectory. The substring is the states visited in reverse order; it is an artifact of the derivation.
Theorem 1.
Let be a constrained context-sensitive grammar and . Then cannot generate a string that contains both and as substrings.
Proof.
Observe that the only rules that add a to the right of an are of form (b) in Definition 1. Assume that can generate a string that contains both and as substrings. Then must contain at least the following rules:
(1) | ||||
(2) |
Since , the inequality holds. Since contains both rules (1) and (2), and there is at most one rule of the form in , the equality must hold, which contradicts the fact that . Therefore Theorem 1 holds.
∎
Theorem 1 implies that constrained CSGs cannot generate a string that contains as a substring. This substring represents the state-goal trajectory in Figure 1 (b). This implies that HF systems cannot generate this trajectory.
4.2 Expressiveness of RHF
RHF is a two-level architecture consisting of a meta controller and a controller. RHF differs from HF in that the RHF meta controller receives as input a sequence of states instead a single state. Observe that the meta controller has a finite memory: the meta controller can recall a sequence of states of up to a certain length . We define
where is the set of nonterminal states and is a nonnegative integer. represents a chronologically ordered sequence of states observed by the meta controller before the current time step. At a time step , the meta controller receives and selects a goal ; then is appended to . From to a time step when either is achieved or the terminal state is reached, the controller receives and , and takes an action , where . At , if the terminal state is not reached, the meta controller receives and selects the next goal ; then is appended to ; the process repeats. Therefore, regardless of implementation details, under the assumption of deterministic policies, the meta controller defines a function .
We construct a type of CSG as follows to capture the expressiveness of RHF and provide a formal definition afterward.
-
(a)
There can be one or more starting states. For every starting state , we add one rule of the form: . is the start symbol. is a nonterminal symbol representing the meta controller.
-
(b)
For every sequence-goal pair defined by the function , we add one rule of the form: . is a nonterminal symbol representing the controller. The expression is used to derive a new state, which is explained in (c).
-
(c)
The controller receives a state-goal and behaves the same way as in HF. Three outcomes are possible:
-
i.
The controller returns control to the meta controller when is achieved in a state and is not the terminal state. This is captured by the rule .
-
ii.
The episode terminates when the terminal state is reached. It is possible that is achieved in . This is captured by the rule .
-
iii.
The iteration gets stuck in an infinite loop because the controller never achieves nor the terminal state is ever reached. This is captured by the rule .
-
i.
Definition 2.
A context-sensitive grammar is -recurrent ( is a nonnegative integer) if
-
1.
The set of nonterminals .
-
2.
The set of terminals .
-
3.
The set contains only production rules of the following forms:
-
(a)
For every starting state , there exists exactly one rule of the form
-
(b)
For every combination of and , there exists zero or exactly one rule of the form
where .
-
(c)
For every combination of and , there exists exactly one rule of one of the following forms:
-
i.
, where .
-
ii.
.
-
iii.
.
-
i.
-
(a)
Example 2.
This example shows a -recurrent CSG that generates a string representing the state-goal trajectory in Figure 1 (b), which explains how an RHF system solves corridor task (ii). We define that is achieved in ().
Consider a 2-recurrent CSG . . is the terminal state. Some rules in are:
(1) | ||||
(2) | ||||
(3) | ||||
(4) | ||||
(5) | ||||
(6) | ||||
(7) | ||||
(8) | ||||
(9) |
Applying these rules derives
The substring represents the target state-goal trajectory. The substring is the states visited in reverse order; it is an artifact of the derivation.
Proposition 1.
Any constrained context-sensitive grammar is 0-recurrent.
Proof.
All constrained CSGs satisfy the conditions of Definition 2. Definition 1 is a special case of Definition 2 where . Therefore Proposition 1 holds.
∎
Constrained CSGs and -recurrent CSGs capture the expressiveness of HF and RHF respectively. Proposition 1 implies that for any constrained CSG, there exists a -recurrent CSG such that both grammars generate the same strings. Theorem 1 implies that constrained CSGs cannot generate the string , which contains both and . Example 2 shows that there exists a -recurrent CSG that can generate this string. Therefore RHF is more expressive than HF.
5 Implementation
We describe an implementation of RHF named Rh-REINFORCE, which is used in our experiments. Figure 2 shows the architecture of Rh-REINFORCE.
The meta controller is a policy approximator that uses an RNN. The input to the meta controller is a variable-length sequence of vectors, each of which represents a state observed by the meta controller at a time step. When the vector dimension is high (e.g., image pixels), the input is passed to two convolutional layers with rectifiers. The output of the convolutional layers is passed to a layer of gated recurrent units (GRU) (?). The output of the GRU layer is passed to a fully-connected softmax output layer that has a number of units equal to . When the vector dimension is low (e.g., the corridor task), the input is directly passed to a GRU layer followed by an output layer. The last vector in the output sequence, which corresponds to the most recent time step, is used as a probability distribution to select a goal.
The controller is an actor-critic model (?) that uses two feedforward neural networks. The actor has two fully-connected or convolutional layers (depending on the input dimension) with rectifiers, followed by a fully-connected softmax output layer. The critic has the same network structure as the actor except that the output layer uses a linear activation instead of softmax. The input to the controller is a state extended by a vector representing the goal selected by the meta controller.
Learning.
(Algorithm 1) The meta controller uses REINFORCE (?) to learn a parameterized policy . (In this context is actually in the expressive analysis.) The performance measure is the expected return . The parameters is optimized by gradient ascent on in the direction
(1) |
The controller uses a one-step actor-critic method (?) to learn a policy , and a value function . Compared to REINFORCE, this method uses the one-step return and the value function as a baseline to reduce variance. The parameters is optimized using the gradient
(2) |
where is the temporal difference error. ( is an intrinsic reward.)
The loss for the value function is . The parameters is optimized by gradient descent on in the direction
(3) |
In an episode, the meta controller selects a goal (line 7). Until is achieved or the terminal state is reached, the controller takes actions (line 11); after each action, and are updated using gradients (2) and (3) (line 15). After is achieved, is saved to construct a trajectory (line 17). The process repeats until the episode terminates. At the end of the episode, the trajectory is used to compute in gradient (1); then is updated using backpropagation through time for the length of the episode (line 21).
Initialize meta controller policy
Initialize controller policy
Initialize controller value function
6 Experiments
To demonstrate that RHF is more expressive than HF, we compare Rh-REINFORCE with two HF systems: h-REINFORCE (substituting a fully-connected layer for the GRU layer in Rh-REINFORCE) and h-DQN. The systems are trained in three variants of the corridor environment and a grid environment. In each environment, the systems use an optimal controller, which takes the actions that result in the shortest path from the current state to a state that satisfies a given goal (or the most likely actions if the environment is stochastic). Therefore, any difference in performance between the systems is due to their meta controllers.
6.1 Environments
Corridor.
(Figure 1 (b)) The environment is corridor task (ii). It has 7 states from left to right: . is the terminal state (gray). The actions are left move and right move. The agent starts in (circled); at each time step, it can move to the state immediate to its left or right. (Taking a right move in has no effect.) To constitute a visit to (asterisk), the agent must move from to . The agent receives a reward of if it visits at least twice and then reaches ; otherwise, the reward is . Actions have no rewards. An episode ends after 20 time steps; the agent receives a reward of if it fails to reach .
Stochastic Corridor.
This a modified version of Corridor based on the environment in (?). It has the same state configuration and reward function as Corridor. The difference it that when the agent takes a right move, with a probability of 0.5, it ends up in the state to its right, and with a probability of 0.5, it ends up in the state to its left. An episode ends after 20 time steps resulting in a reward of (the same as Corridor).
Doom Corridor.
This is Corridor set in a Doom map created using the ViZDoom API (?). It has the same reward function as Corridor. The agent always faces right and observes RGB game frames. Each input frame is resized to before passed to a convolutional layer.
Grid.
(Figure 3) This is a navigation task in a gridworld. The terminal state (gray) is above the first row. The landmark states are numbered 1, 2, and 3. The agent starts at the top left state (circled). To obtain a reward of , the agent must visit landmark 1, return to the circled state, then visit landmark 2, return to the circled state, then visit landmark 3, return to the circled state, and finally reach the terminal state. The visits can be intertwined, for example, still results in a reward of . In all other cases the reward is . An episode ends after 60 actions. The task cannot be solved by HF because it requires a state-goal trajectory that contains at least and , which as per Theorem 1 cannot be generated by HF.
6.2 Training
In each environment, all three systems use the same fixed controller. Hence only the meta controller’s parameters are updated.
Rh-REINFORCE and h-REINFORCE use Algorithm 1 to learn softmax policies. (Intrinsic rewards and line 15 are not applicable because only the meta controller is being trained.) In Corridor and Doom Corridor, the training includes a random exploration phase for the initial 1000 episodes, where the meta controller selects random goals. This phase is not used in the other two environments.
h-DQN uses a one-step Q-learning algorithm with experience replay: the replay size is 100000; the batch size is 64; the target network update rate is 0.001. The goal selection is -greedy. decays from 1 to 0.01 over 15000 steps. Other hyperparameters are summarized in Table 1.
Rh-REINFORCE | h-REINFORCE | h-DQN | |
---|---|---|---|
Corridor/Stochastic | GRU: 64 units | Dense 1: 16 units [ReLU] | |
Grid | Dense 2: 32 units [ReLU] | ||
Doom Corridor | Conv 1: 32 filters, (8, 8) kernel, strides=4 [ReLU] | ||
Conv 2: 64 filters, (4, 4) kernel, strides=2 [ReLU] | |||
GRU: 256 units | Dense: 256 units [ReLU] | ||
Output activation | softmax | linear | |
Loss | categorical crossentropy | Huber | |
Optimizer | Adam | RMSprop | |
Learning rate | 0.001 |
6.3 Results
Rh-REINFORCE outperforms h-REINFORCE and h-DQN in all the environments. Figure 4 shows the average episodic returns of 10 runs in each environment.
In Corridor, Rh-REINFORCE learns an optimal policy in 2000 episodes. h-REINFORCE and h-DQN achieve an average return of 0.013 and 0.116 respectively after 10000 episodes.
In Stochastic Corridor, Rh-REINFORCE, h-REINFORCE, and h-DQN achieve an average return of 0.416, 0.071, and 0.054, respectively. For comparison, a handcrafted optimal policy has an average return of 0.423; a random policy has an average return of 0.031.
In Doom Corridor, Rh-REINFORCE, h-REINFORCE, and h-DQN achieve an average return of 0.835, 0.029, and 0.006, respectively.
In Grid, Rh-REINFORCE learns an optimal policy in 14000 episodes. h-DQN and h-REINFORCE are unable find an optimal policy after 20000 episodes.




7 Concluding Remarks
The experimental results are consistent with our expressiveness analysis. In the deterministic environments, the two HF systems are unable to generate the state-goal trajectories that result in the maximum reward; in contrast, the RHF system is capable of learning to generate the trajectories and thus obtaining a much higher reward. Since the policies (i.e., softmax and -greedy) during training are not deterministic, it is possible for the HF systems to obtain a high reward by chance, but they are unable to do so consistently. The same conclusion holds in the stochastic environment.
We claim neither that an HF or RHF system can optimally solve an arbitrary RL task nor that they perform better than other deep RL systems. It is possible for an RL system without a hierarchical architecture to achieve a high performance in a variety of tasks. Our analysis focuses on the expressiveness of HF and RHF, i.e., the kinds of state-goal trajectories that the frameworks can and cannot generate.
For future work, we want to investigate the expressiveness of architectures that have additional recurrent levels on top of the meta level. We conjecture that a single recurrent level with a sufficient number of units can simulate any number of recurrent levels and thus adding more levels will not increase a system’s expressive power.
Acknowledgements
This research was supported by the Office of Naval Research grants N00014-18-1-2009 and N68335-18-C-4027 and the National Science Foundation grant 1909879. The views, opinions, and findings expressed are those of the authors and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.
References
- Aha 2018 Aha, D. W. 2018. Goal reasoning: Foundations, emerging applications, and prospects. AI Magazine 39(2):3–24.
- Cho et al. 2014 Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. In Conference on Empirical Methods in Natural Language Processing (EMNLP 2014).
- Hausknecht and Stone 2015 Hausknecht, M., and Stone, P. 2015. Deep recurrent q-learning for partially observable mdps. In 2015 AAAI Fall Symposium Series.
- Hochreiter and Schmidhuber 1997 Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation 9(8):1735–1780.
- Kempka et al. 2016 Kempka, M.; Wydmuch, M.; Runc, G.; Toczek, J.; and Jaśkowski, W. 2016. ViZDoom: A Doom-based AI research platform for visual reinforcement learning. In IEEE Conference on Computational Intelligence and Games, 341–348. IEEE.
- Konda and Tsitsiklis 2000 Konda, V. R., and Tsitsiklis, J. N. 2000. Actor-critic algorithms. In Advances in neural information processing systems, 1008–1014.
- Kulkarni et al. 2016 Kulkarni, T. D.; Narasimhan, K.; Saeedi, A.; and Tenenbaum, J. 2016. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems, 3675–3683.
- LeCun et al. 1998 LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324.
- Mnih et al. 2015 Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-level control through deep reinforcement learning. Nature 518(7540):529–533.
- Muñoz-Avila, Dannenhauer, and Reifsnyder 2019 Muñoz-Avila, H.; Dannenhauer, D.; and Reifsnyder, N. 2019. Is everything going according to plan? expectations in goal reasoning agents. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 9823–9829.
- Osband et al. 2016 Osband, I.; Blundell, C.; Pritzel, A.; and Van Roy, B. 2016. Deep exploration via bootstrapped dqn. In Advances in neural information processing systems, 4026–4034.
- Sutton and Barto 2018 Sutton, R. S., and Barto, A. G. 2018. Reinforcement learning: An introduction.
- Sutton, Precup, and Singh 1999 Sutton, R. S.; Precup, D.; and Singh, S. 1999. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence 112(1-2):181–211.
- Vezhnevets et al. 2017 Vezhnevets, A. S.; Osindero, S.; Schaul, T.; Heess, N.; Jaderberg, M.; Silver, D.; and Kavukcuoglu, K. 2017. Feudal networks for hierarchical reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 3540–3549.
- Watkins and Dayan 1992 Watkins, C. J., and Dayan, P. 1992. Q-learning. Machine learning 8(3-4):279–292.
- Williams 1992 Williams, R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8(3-4):229–256.