This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Hierarchical Reinforcement Learning for Deep Goal Reasoning:
An Expressiveness Analysis

Weihang Yuan    Héctor Muñoz-Avila
\affiliationsDepartment of Computer Science and Engineering
Lehigh University
\emails{wey218, hem4}@lehigh.edu
Abstract

Hierarchical DQN (h-DQN) is a two-level architecture of feedforward neural networks where the meta level selects goals and the lower level takes actions to achieve the goals. We show tasks that cannot be solved by h-DQN, exemplifying the limitation of this type of hierarchical framework (HF). We describe the recurrent hierarchical framework (RHF), generalizing architectures that use a recurrent neural network at the meta level. We analyze the expressiveness of HF and RHF using context-sensitive grammars. We show that RHF is more expressive than HF. We perform experiments comparing an implementation of RHF with two HF baselines; the results corroborate our theoretical findings.

1 Introduction

Hierarchical DQN (h-DQN) (?) is a reinforcement learning (RL) system that combines deep RL and goal reasoning. h-DQN uses a two-level hierarchical architecture: a meta controller at the top level and a controller at the lower level. The meta controller and the controller operate at different temporal scales. The meta controller receives a state ss and selects a goal gg, which can be achieved in some state. The controller receives the current state and gg; it takes actions until gg is achieved in a state ss^{\prime}. Then the meta controller receives ss^{\prime} and selects the next goal gg^{\prime}; the process repeats. Both levels of h-DQN use deep Q-networks (DQN) (?) to jointly learn value functions; the meta controller tries to maximize the cumulative reward from the environment whereas the controller is motivated by user-defined intrinsic rewards.

Goal reasoning is the study of agents that are capable of reasoning about their goals and changing them when the need arises (??). h-DQN selects its goals and plans toward achieving them while adapting the goal selection process using RL. Thus h-DQN can be considered a goal reasoning agent.

To exemplify the capability and limitation of h-DQN, consider a one-dimensional corridor environment shown in Figure 1 (top), which is based on the environments in (??). The corridor consists of 7 states from left to right: s0,,s6s_{0},\dots,s_{6}. An agent starts in s3s_{3} (circled) and can move left or right. s0s_{0} is the terminal state (gray). In task (i), the agent receives a reward of +1+1 if it visits s6s_{6} (asterisk) at least once and then reaches s0s_{0}; otherwise, the reward is +0.01+0.01. Figure 1 (a) shows a trajectory that results in a reward of +1+1 in task (i). h-DQN is able to generate this trajectory and obtain the maximum reward in task (i). Now consider task (ii): the agent must visit s6s_{6} at least twice to receive a reward of +1+1. We will show that it is impossible for h-DQN to deterministically generate a trajectory (e.g., Figure 1 (b)) that solves task (ii).

(a)(b)
Figure 1: The corridor environment (top). The starting state is circled. The terminal state is gray. In corridor task (ii), trajectory (a) results in a reward of +0.01+0.01; trajectory (b) results in a reward of +1+1.

h-DQN can be considered an instance of the hierarchical framework (HF): a two-level architecture of feedforward neural networks that can use any RL algorithms, not just Q-learning. HF can use different algorithms for the meta controller and the controller, whereas h-DQN uses the same algorithm at each level.

Consider using a recurrent neural network (RNN) in place of the feedforward network in the HF meta controller. We refer to this type of architecture as the recurrent hierarchical framework (RHF). A related instance is feudal networks (FuNs) (?), which uses RNNs at both levels. To investigate the effects of a recurrent meta controller, we focus on the case where HF and RHF use the same controller, isolating the difference in their meta controllers.

In the remainder of this paper, we describe HF and RHF, and construct two types of context-sensitive grammars that capture the expressiveness of HF and RHF respectively. Using the grammars, we show that RHF is more expressive than HF: (1) any state-goal trajectories generated by an HF system can be generated by an RHF system; (2) some state-goal trajectories can only be generated by RHF but not HF. We present an implementation of RHF. We perform experiments comparing this implementation with two HF baselines; the results corroborate our expressiveness analysis of HF and RHF.

2 Related Work

The options framework (?) defines an option as a temporally extended course of actions consisting of three components: an initiation set II, a policy π\pi, and a termination condition β\beta. An agent in a state ss can select an option oo if and only if sIs\in I. If oo is selected, the agent takes actions according to π\pi until oo terminates according to β\beta.

DQN (?) combines a semi-gradient Q-learning algorithm (?) and a deep convolutional neural network (CNN) (?) to learn to play Atari 2600 games. To improve stability, DQN uses experience replay and maintains both a Q-network and a target network. The Q-network provides the behavior policy and updates online. The target network, which is used to compute the temporal difference error, updates periodically and remains fixed between updates. DQN plays at human level or above in 29 of the games, but performs poorly in Montezuma’s Revenge. This game requires strategic exploration due to sparse rewards. Hierarchical deep RL systems such as h-DQN (?) and FuNs (?) achieve better performance than DQN in Montezuma’s Revenge.

The deep recurrent Q-network (DRQN) (?) replaces the fully-connected layer in DQN with a layer of long short-term memory (LSTM) (?), which provides the capability to integrate state information through time. DRQN learns to play several partially observable versions of Atari games and achieves results comparable to DQN.

FuNs (?) is a two-level architecture consisting of a manager and a worker. Both modules are recurrent: the worker uses a standard LSTM; the manager uses a dilated LSTM that extends its state memory. The manager is similar to a meta controller. The distinction is that the goals are directional in FuNs whereas the goals are states in h-DQN.

3 Background

Reinforcement Learning.

We consider reinforcement learning for episodic tasks. A decision-making agent interacts with an external environment. The environment is modeled by a finite set of nonterminal states 𝒮\mathcal{S}, a terminal state τ\tau, a finite set of actions 𝒜\mathcal{A}, a transition function 𝒯:𝒮×𝒜𝒮\mathcal{T}:\mathcal{S}\times\mathcal{A}\mapsto\mathcal{S}, and a reward function :𝒮+\mathcal{R}:\mathcal{S}^{+}\mapsto\mathbb{R}. (𝒮+\mathcal{S}^{+} is the Kleene plus operation on in 𝒮\mathcal{S}.) At each time step tt, the agent observes a state s(t)𝒮s_{(t)}\in\mathcal{S} and takes an action a(t)𝒜a_{(t)}\in\mathcal{A}. At the next time step, the environment transitions to s(t+1)s_{(t+1)};111The parentheses in a subscript indicate that the subscript denotes a time step. For example, s(0)s_{(0)} is the environment’s state at time step 0. Without the parentheses, the subscript denotes a distinct element in a set. This notation applies to states, actions, and goals. the agent receives a reward rt+1r_{t+1}. The process repeats until the terminal state is reached. The objective of the agent is to take actions in a way that maximizes the discounted return (?):

Gt=i=0Tγirt+i+1,G_{t}=\sum_{i=0}^{T}\gamma^{i}r_{t+i+1},

where γ\gamma is the discount rate, 0γ10\leq\gamma\leq 1, and TT is the terminal time step.

Context-Sensitive Grammars.

A context sensitive grammar (CSG) is a 4-tuple (N,Σ,P,S)(N,\Sigma,P,S), where NN is a finite set of nonterminal symbols, Σ\Sigma is a finite set of terminal symbols, SS is the start symbol, and PP is a finite set of production rules. All the rules in PP are of the form

αAβαγβ,\alpha A\beta\to\alpha\gamma\beta,

where ANA\in N, α,β(ΣN)\alpha,\beta\in(\Sigma\cup N)^{*}, and γ(ΣN)+\gamma\in(\Sigma\cup N)^{+}.

Given a set of strings VV, the ii-th power of VV (i.e., the concatenation of VV with itself ii times) is recursively defined as follows:

V0\displaystyle V^{0} ={λ}\displaystyle=\{\lambda\}
Vi+1\displaystyle V^{i+1} ={wvwVi and vV}\displaystyle=\{wv\mid w\in V^{i}\text{ and }v\in V\}

where λ\lambda is the empty string.

4 Expressiveness Analysis

When an HF or RHF system interacts with an environment, the meta controller receives states and generates goals, forming a trajectory of alternating states and goals. For example, the string s3g6s6g0s0s_{3}g_{6}s_{6}g_{0}s_{0} represents the state-goal trajectory in Figure 1 (a). To analyze the expressiveness of HF and RHF, we define two special types of CSGs that generate strings representing state-goal trajectories. A synopsis of our arguments is as follows:

  1. 1.

    We describe how HF and RHF generate goals in an environment. Then we define two types of CSGs: constrained and kk-recurrent, to represent HF’s and RHF’s goal generation respectively.

  2. 2.

    We prove that a constrained CSG is a special case of a kk-recurrent CSG. In other words, for any constrained CSG, there exists a kk-recurrent CSG such that both grammars generate the same strings.

  3. 3.

    We prove that constrained CSGs cannot generate some string. However, there exists a kk-recurrent CSG that can generate this string.

Combined, these points imply that RHF is more expressive than HF.

In our analysis, we make the assumption of deterministic policies: the system selects goals and actions deterministically after training. RL algorithms oftentimes use an exploratory stochastic policy. For example, h-DQN uses an ϵ\epsilon-greedy policy to select goals and actions during training; the best goal or action is selected with a probability of 1ϵ1-\epsilon and a random one is selected with a probability of ϵ\epsilon. As learning proceeds, ϵ\epsilon is gradually reduced to a small value. We want to analyze the system’s behavior after it has converged to deterministic policies and hence make this assumption.

4.1 Expressiveness of HF

HF is a two-level architecture consisting of a meta controller and a controller. At a time step tt, the meta controller receives a state s(t)𝒮s_{(t)}\in\mathcal{S} and selects a goal g(t)𝒢g_{(t)}\in\mathcal{G}, where 𝒢\mathcal{G} is a finite set of goals. From tt to a time step t+nt+n when either g(t)g_{(t)} is achieved or the terminal state is reached, the controller receives s(t+i)s_{(t+i)} and g(t)g_{(t)}, and takes an action a(t+i)a_{(t+i)}, where 0in10\leq i\leq n-1. At t+nt+n, if the terminal state is not reached, the meta controller receives s(t+n)s_{(t+n)} and selects the next goal g(t+n)g_{(t+n)}; the process repeats. Regardless of implementation details, under the assumption of deterministic policies, the meta controller defines a function 𝒮𝒢\mathcal{S}\mapsto\mathcal{G}.

To capture the expressiveness of HF, we construct a type of CSG encompassing the following components:

  1. (a)

    There can be one or more starting states. For every starting state ss, we add one rule of the form: SsMETAS\to s\langle\text{META}\rangle. SS is the start symbol. META\langle\text{META}\rangle is a nonterminal symbol representing the meta controller.

  2. (b)

    For every state-goal pair {s}{g}\{s\}\mapsto\{g\} defined by the function 𝒮𝒢\mathcal{S}\mapsto\mathcal{G}, we add one rule of the form: sMETAsgACTss\langle\text{META}\rangle\to sg\langle\text{ACT}\rangle s. ACT\langle\text{ACT}\rangle is a nonterminal symbol representing the controller. The expression sgACTsg\langle\text{ACT}\rangle is used to derive a new state, which is explained in (c).

  3. (c)

    From tt to t+nt+n, the controller receives a state-goal (s(t+i),g(t))(s_{(t+i)},g_{(t)}) and takes an action a(t+i)a_{(t+i)}, where 0in10\leq i\leq n-1. The iteration has three possible outcomes:

    1. i.

      The controller returns control to the meta controller when g(t)g_{(t)} is achieved in s(t+n)s_{(t+n)} and s(t+n)s_{(t+n)} is not the terminal state. This is captured by the rule sgACTsgsMETAsg\langle\text{ACT}\rangle\to sgs^{\prime}\langle\text{META}\rangle, where ss is the state when the meta controller gives control to the controller (i.e., s=s(t)s=s_{(t)}), gg is the goal selected by the meta controller (i.e., g=g(t)g=g_{(t)}), and ss^{\prime} is the state satisfying gg (i.e., s=s(t+n)s^{\prime}=s_{(t+n)}).

    2. ii.

      The episode terminates when s(t+n)s_{(t+n)} is the terminal state. It is possible that gg is achieved in s(t+n)s_{(t+n)}. This is captured by the rule sgACTsgτsg\langle\text{ACT}\rangle\to sg\tau, where τ\tau is the terminal state (i.e., τ=s(t+n)\tau=s_{(t+n)}).

    3. iii.

      The iteration gets stuck in an infinite loop because the controller never achieves gg nor the terminal state is ever reached. This is captured by the rule sgACTsgACTsg\langle\text{ACT}\rangle\to sg\langle\text{ACT}\rangle, which does not yield any string.

We formally define this type of CSG as follows.

Definition 1.

A context-sensitive grammar (N,Σ,P,S)(N,\Sigma,P,S) is constrained if

  1. 1.

    The set of nonterminals N={S,META,ACT}N=\{S,\langle\text{META}\rangle,\langle\text{ACT}\rangle\}.

  2. 2.

    The set of terminals Σ=𝒮𝒢{τ}\Sigma=\mathcal{S}\cup\mathcal{G}\cup\{\tau\}.

  3. 3.

    The set PP contains only production rules of the following forms:

    1. (a)

      For every starting state s𝒮s\in\mathcal{S}, there exists exactly one rule of the form

      SsMETA.S\to s\langle\text{META}\rangle.
    2. (b)

      For every s𝒮s\in\mathcal{S}, there exists exactly one rule of the form

      sMETAsgACTs,s\langle\text{META}\rangle\to sg\langle\text{ACT}\rangle s,

      where g𝒢g\in\mathcal{G}.

    3. (c)

      For every combination of s𝒮s\in\mathcal{S} and g𝒢g\in\mathcal{G}, there exists exactly one rule of one of the following forms:

      1. i.

        sgACTsgsMETAsg\langle\text{ACT}\rangle\to sgs^{\prime}\langle\text{META}\rangle, where s𝒮s^{\prime}\in\mathcal{S}.

      2. ii.

        sgACTsgτsg\langle\text{ACT}\rangle\to sg\tau.

      3. iii.

        sgACTsgACTsg\langle\text{ACT}\rangle\to sg\langle\text{ACT}\rangle.

Example 1.

This example shows a constrained CSG that generates a string representing the state-goal trajectory in Figure 1 (a), which explains how an HF system solves corridor task (i). We define that gig_{i} is achieved in sis_{i} (0i60\leq i\leq 6).

Consider a constrained CSG G1=(N,Σ,P,S)G_{1}=(N,\Sigma,P,S). Σ={si,gi0i6}\Sigma=\{s_{i},g_{i}\mid 0\leq i\leq 6\}. s0s_{0} is the terminal state. Some rules in PP are:

S\displaystyle S s3META\displaystyle\to s_{3}\langle\text{META}\rangle (1)
s3META\displaystyle s_{3}\langle\text{META}\rangle s3g6ACTs3\displaystyle\to s_{3}g_{6}\langle\text{ACT}\rangle s_{3} (2)
s6META\displaystyle s_{6}\langle\text{META}\rangle s6g0ACTs6\displaystyle\to s_{6}g_{0}\langle\text{ACT}\rangle s_{6} (3)
s3g6ACT\displaystyle s_{3}g_{6}\langle\text{ACT}\rangle s3g6s6META\displaystyle\to s_{3}g_{6}s_{6}\langle\text{META}\rangle (4)
s6g0ACT\displaystyle s_{6}g_{0}\langle\text{ACT}\rangle s6g0s0\displaystyle\to s_{6}g_{0}s_{0} (5)

Applying these rules derives

S1s3META2s3g6ACTs34s3g6s6METAs33s3g6s6g0ACTs6s35s3g6s6g0s0s6s3\begin{split}S&\xrightarrow{1}s_{3}\langle\text{META}\rangle\\ &\xrightarrow{2}s_{3}g_{6}\langle\text{ACT}\rangle s_{3}\\ &\xrightarrow{4}s_{3}g_{6}s_{6}\langle\text{META}\rangle s_{3}\\ &\xrightarrow{3}s_{3}g_{6}s_{6}g_{0}\langle\text{ACT}\rangle s_{6}s_{3}\\ &\xrightarrow{5}s_{3}g_{6}s_{6}g_{0}s_{0}s_{6}s_{3}\\ \end{split}

The substring s3g6s6g0s0s_{3}g_{6}s_{6}g_{0}s_{0} represents the target state-goal trajectory. The substring s0s6s3s_{0}s_{6}s_{3} is the states visited in reverse order; it is an artifact of the derivation.

Theorem 1.

Let G=(N,Σ,P,S)G=(N,\Sigma,P,S) be a constrained context-sensitive grammar and {s0,ga,gb}Σ\{s_{0},g_{a},g_{b}\}\subseteq\Sigma. Then GG cannot generate a string that contains both s0gas_{0}g_{a} and s0gbs_{0}g_{b} as substrings.

Proof.

Observe that the only rules that add a gg to the right of an ss are of form (b) in Definition 1. Assume that GG can generate a string that contains both s0gas_{0}g_{a} and s0gbs_{0}g_{b} as substrings. Then PP must contain at least the following rules:

s0META\displaystyle s_{0}\langle\text{META}\rangle s0gaACTs0\displaystyle\to s_{0}g_{a}\langle\text{ACT}\rangle s_{0} (1)
s0META\displaystyle s_{0}\langle\text{META}\rangle s0gbACTs0\displaystyle\to s_{0}g_{b}\langle\text{ACT}\rangle s_{0} (2)

Since {ga,gb}Σ\{g_{a},g_{b}\}\subseteq\Sigma, the inequality gagbg_{a}\neq g_{b} holds. Since PP contains both rules (1) and (2), and there is at most one rule of the form s0METAs0gACTs0s_{0}\langle\text{META}\rangle\to s_{0}g\langle\text{ACT}\rangle s_{0} in PP, the equality ga=gbg_{a}=g_{b} must hold, which contradicts the fact that gagbg_{a}\neq g_{b}. Therefore Theorem 1 holds.

Theorem 1 implies that constrained CSGs cannot generate a string that contains s3g6s6g5s5g6s6g0s0s_{3}g_{6}s_{6}g_{5}s_{5}g_{6}s_{6}g_{0}s_{0} as a substring. This substring represents the state-goal trajectory in Figure 1 (b). This implies that HF systems cannot generate this trajectory.

4.2 Expressiveness of RHF

RHF is a two-level architecture consisting of a meta controller and a controller. RHF differs from HF in that the RHF meta controller receives as input a sequence of states instead a single state. Observe that the meta controller has a finite memory: the meta controller can recall a sequence of states of up to a certain length kk. We define

𝒮k=0ik𝒮i,\mathcal{S}^{\leq k}=\bigcup_{0\leq i\leq k}\mathcal{S}^{i},

where 𝒮\mathcal{S} is the set of nonterminal states and kk is a nonnegative integer. s~𝒮k\tilde{s}\in\mathcal{S}^{\leq k} represents a chronologically ordered sequence of states observed by the meta controller before the current time step. At a time step tt, the meta controller receives s(t)s~s_{(t)}\tilde{s} and selects a goal g𝒢g\in\mathcal{G}; then s(t)s_{(t)} is appended to s~\tilde{s}. From tt to a time step t+nt+n when either g(t)g_{(t)} is achieved or the terminal state is reached, the controller receives s(t+i)s_{(t+i)} and g(t)g_{(t)}, and takes an action a(t+i)a_{(t+i)}, where 0in10\leq i\leq n-1. At t+nt+n, if the terminal state is not reached, the meta controller receives s(t+n)s~s_{(t+n)}\tilde{s} and selects the next goal g(t+n)g_{(t+n)}; then s(t+n)s_{(t+n)} is appended to s~\tilde{s}; the process repeats. Therefore, regardless of implementation details, under the assumption of deterministic policies, the meta controller defines a function 𝒮k+1𝒢\mathcal{S}^{\leq k+1}\mapsto\mathcal{G}.

We construct a type of CSG as follows to capture the expressiveness of RHF and provide a formal definition afterward.

  1. (a)

    There can be one or more starting states. For every starting state ss, we add one rule of the form: SsMETAS\to s\langle\text{META}\rangle. SS is the start symbol. META\langle\text{META}\rangle is a nonterminal symbol representing the meta controller.

  2. (b)

    For every sequence-goal pair {ss~}{g}\{s\tilde{s}\}\mapsto\{g\} defined by the function 𝒮k+1𝒢\mathcal{S}^{\leq k+1}\mapsto\mathcal{G}, we add one rule of the form: sMETAs~sgACTss~s\langle\text{META}\rangle\tilde{s}\to sg\langle\text{ACT}\rangle s\tilde{s}. ACT\langle\text{ACT}\rangle is a nonterminal symbol representing the controller. The expression sgACTsg\langle\text{ACT}\rangle is used to derive a new state, which is explained in (c).

  3. (c)

    The controller receives a state-goal (s,g)(s,g) and behaves the same way as in HF. Three outcomes are possible:

    1. i.

      The controller returns control to the meta controller when gg is achieved in a state ss^{\prime} and ss^{\prime} is not the terminal state. This is captured by the rule sgACTsgsMETAsg\langle\text{ACT}\rangle\to sgs^{\prime}\langle\text{META}\rangle.

    2. ii.

      The episode terminates when the terminal state τ\tau is reached. It is possible that gg is achieved in τ\tau. This is captured by the rule sgACTsgτsg\langle\text{ACT}\rangle\to sg\tau.

    3. iii.

      The iteration gets stuck in an infinite loop because the controller never achieves gg nor the terminal state is ever reached. This is captured by the rule sgACTsgACTsg\langle\text{ACT}\rangle\to sg\langle\text{ACT}\rangle.

Definition 2.

A context-sensitive grammar (N,Σ,P,S)(N,\Sigma,P,S) is kk-recurrent (kk is a nonnegative integer) if

  1. 1.

    The set of nonterminals N={S,META,ACT}N=\{S,\langle\text{META}\rangle,\langle\text{ACT}\rangle\}.

  2. 2.

    The set of terminals Σ=𝒮𝒢{τ}\Sigma=\mathcal{S}\cup\mathcal{G}\cup\{\tau\}.

  3. 3.

    The set PP contains only production rules of the following forms:

    1. (a)

      For every starting state s𝒮s\in\mathcal{S}, there exists exactly one rule of the form

      SsMETA.S\to s\langle\text{META}\rangle.
    2. (b)

      For every combination of s𝒮s\in\mathcal{S} and s~𝒮k\tilde{s}\in\mathcal{S}^{\leq k}, there exists zero or exactly one rule of the form

      sMETAs~sgACTss~,s\langle\text{META}\rangle\tilde{s}\to sg\langle\text{ACT}\rangle s\tilde{s},

      where g𝒢g\in\mathcal{G}.

    3. (c)

      For every combination of s𝒮s\in\mathcal{S} and g𝒢g\in\mathcal{G}, there exists exactly one rule of one of the following forms:

      1. i.

        sgACTsgsMETAsg\langle\text{ACT}\rangle\to sgs^{\prime}\langle\text{META}\rangle, where s𝒮s^{\prime}\in\mathcal{S}.

      2. ii.

        sgACTsgτsg\langle\text{ACT}\rangle\to sg\tau.

      3. iii.

        sgACTsgACTsg\langle\text{ACT}\rangle\to sg\langle\text{ACT}\rangle.

Example 2.

This example shows a kk-recurrent CSG that generates a string representing the state-goal trajectory in Figure 1 (b), which explains how an RHF system solves corridor task (ii). We define that gig_{i} is achieved in sis_{i} (0i60\leq i\leq 6).

Consider a 2-recurrent CSG G2=(N,Σ,P,S)G_{2}=(N,\Sigma,P,S). Σ={si,gi0i6}\Sigma=\{s_{i},g_{i}\mid 0\leq i\leq 6\}. s0s_{0} is the terminal state. Some rules in PP are:

S\displaystyle S s3META\displaystyle\to s_{3}\langle\text{META}\rangle (1)
s3META\displaystyle s_{3}\langle\text{META}\rangle s3g6ACTs3\displaystyle\to s_{3}g_{6}\langle\text{ACT}\rangle s_{3} (2)
s6METAs3\displaystyle s_{6}\langle\text{META}\rangle s_{3} s6g5ACTs6s3\displaystyle\to s_{6}g_{5}\langle\text{ACT}\rangle s_{6}s_{3} (3)
s5METAs6s3\displaystyle s_{5}\langle\text{META}\rangle s_{6}s_{3} s5g6ACTs5s6s3\displaystyle\to s_{5}g_{6}\langle\text{ACT}\rangle s_{5}s_{6}s_{3} (4)
s6METAs5s6s3\displaystyle s_{6}\langle\text{META}\rangle s_{5}s_{6}s_{3} s6s0ACTs6s5s6s3\displaystyle\to s_{6}s_{0}\langle\text{ACT}\rangle s_{6}s_{5}s_{6}s_{3} (5)
s6g5ACT\displaystyle s_{6}g_{5}\langle\text{ACT}\rangle s6g5s5META\displaystyle\to s_{6}g_{5}s_{5}\langle\text{META}\rangle (6)
s3g6ACT\displaystyle s_{3}g_{6}\langle\text{ACT}\rangle s3g6s6META\displaystyle\to s_{3}g_{6}s_{6}\langle\text{META}\rangle (7)
s5g6ACT\displaystyle s_{5}g_{6}\langle\text{ACT}\rangle s5g6s6META\displaystyle\to s_{5}g_{6}s_{6}\langle\text{META}\rangle (8)
s6g0ACT\displaystyle s_{6}g_{0}\langle\text{ACT}\rangle s6g0s0\displaystyle\to s_{6}g_{0}s_{0} (9)

Applying these rules derives

S1s3META2s3g6ACTs37s3g6s6METAs33s3g6s6g5ACTs6s36s3g6s6g5s5METAs6s34s3g6s6g5s5g6ACTs5s6s38s3g6s6g5s5g6s6METAs5s6s35s3g6s6g5s5g6s6g0ACTs6s5s6s39s3g6s6g5s5g6s6g0s0s6s5s6s3\begin{split}S&\xrightarrow{1}s_{3}\langle\text{META}\rangle\\ &\xrightarrow{2}s_{3}g_{6}\langle\text{ACT}\rangle s_{3}\\ &\xrightarrow{7}s_{3}g_{6}s_{6}\langle\text{META}\rangle s_{3}\\ &\xrightarrow{3}s_{3}g_{6}s_{6}g_{5}\langle\text{ACT}\rangle s_{6}s_{3}\\ &\xrightarrow{6}s_{3}g_{6}s_{6}g_{5}s_{5}\langle\text{META}\rangle s_{6}s_{3}\\ &\xrightarrow{4}s_{3}g_{6}s_{6}g_{5}s_{5}g_{6}\langle\text{ACT}\rangle s_{5}s_{6}s_{3}\\ &\xrightarrow{8}s_{3}g_{6}s_{6}g_{5}s_{5}g_{6}s_{6}\langle\text{META}\rangle s_{5}s_{6}s_{3}\\ &\xrightarrow{5}s_{3}g_{6}s_{6}g_{5}s_{5}g_{6}s_{6}g_{0}\langle\text{ACT}\rangle s_{6}s_{5}s_{6}s_{3}\\ &\xrightarrow{9}s_{3}g_{6}s_{6}g_{5}s_{5}g_{6}s_{6}g_{0}s_{0}s_{6}s_{5}s_{6}s_{3}\end{split}

The substring s3g6s6g5s5g6s6g0s0s_{3}g_{6}s_{6}g_{5}s_{5}g_{6}s_{6}g_{0}s_{0} represents the target state-goal trajectory. The substring s0s6s5s6s3s_{0}s_{6}s_{5}s_{6}s_{3} is the states visited in reverse order; it is an artifact of the derivation.

Proposition 1.

Any constrained context-sensitive grammar is 0-recurrent.

Proof.

All constrained CSGs satisfy the conditions of Definition 2. Definition 1 is a special case of Definition 2 where k=0k=0. Therefore Proposition 1 holds.

Constrained CSGs and kk-recurrent CSGs capture the expressiveness of HF and RHF respectively. Proposition 1 implies that for any constrained CSG, there exists a kk-recurrent CSG such that both grammars generate the same strings. Theorem 1 implies that constrained CSGs cannot generate the string s3g6s6g5s5g6s6g0s0s6s5s6s3s_{3}g_{6}s_{6}g_{5}s_{5}g_{6}s_{6}g_{0}s_{0}s_{6}s_{5}s_{6}s_{3}, which contains both s6g5s_{6}g_{5} and s6g0s_{6}g_{0}. Example 2 shows that there exists a kk-recurrent CSG that can generate this string. Therefore RHF is more expressive than HF.

5 Implementation

We describe an implementation of RHF named Rh-REINFORCE, which is used in our experiments. Figure 2 shows the architecture of Rh-REINFORCE.

The meta controller is a policy approximator that uses an RNN. The input to the meta controller is a variable-length sequence of vectors, each of which represents a state observed by the meta controller at a time step. When the vector dimension is high (e.g., image pixels), the input is passed to two convolutional layers with rectifiers. The output of the convolutional layers is passed to a layer of gated recurrent units (GRU) (?). The output of the GRU layer is passed to a fully-connected softmax output layer that has a number of units equal to |𝒢|\lvert\mathcal{G}\rvert. When the vector dimension is low (e.g., the corridor task), the input is directly passed to a GRU layer followed by an output layer. The last vector in the output sequence, which corresponds to the most recent time step, is used as a probability distribution to select a goal.

The controller is an actor-critic model (?) that uses two feedforward neural networks. The actor has two fully-connected or convolutional layers (depending on the input dimension) with rectifiers, followed by a fully-connected softmax output layer. The critic has the same network structure as the actor except that the output layer uses a linear activation instead of softmax. The input to the controller is a state extended by a vector representing the goal selected by the meta controller.

Refer to captiongπ(|s~;θ)g\sim\pi(\cdot|\tilde{s};\theta)aπa(|sa,g;θa)a\sim\pi_{a}(\cdot|s_{a},g;\theta_{a})
Figure 2: The architecture of Rh-REINFORCE for high dimensional input vectors. When the input dimension is low, the meta controller does not use convolutional layers. The controller has a similar network architecture to the meta controller without a recurrent layer.

Learning.

(Algorithm 1) The meta controller uses REINFORCE (?) to learn a parameterized policy π(g|s~;θ)\pi(g|\tilde{s};\theta). (In this context s~\tilde{s} is actually ss~s\tilde{s} in the expressive analysis.) The performance measure is the expected return 𝔼[Gt]\mathbb{E}[G_{t}]. The parameters θ\theta is optimized by gradient ascent on 𝔼[Gt]\mathbb{E}[G_{t}] in the direction

Gtlnπ(g(t)|s~(t);θ)G_{t}\nabla\ln{\pi(g_{(t)}|\tilde{s}_{(t)};\theta)} (1)

The controller uses a one-step actor-critic method (?) to learn a policy πa(a|s,g;θa)\pi_{a}(a|s,g;\theta_{a}), and a value function v(s,g;θv)v(s,g;\theta_{v}). Compared to REINFORCE, this method uses the one-step return and the value function as a baseline to reduce variance. The parameters θa\theta_{a} is optimized using the gradient

δlnπa(a(t)|s(t),g;θa),\delta\nabla\ln{\pi_{a}(a_{(t)}|s_{(t)},g;\theta_{a})}, (2)

where δ=it+γv(s(t+1),g;θv)v(s(t),g;θv)\delta=i_{t}+\gamma v(s_{(t+1)},g;\theta_{v})-v(s_{(t)},g;\theta_{v}) is the temporal difference error. (iti_{t} is an intrinsic reward.)

The loss for the value function is L=𝔼[δ2]L=\mathbb{E}[\delta^{2}]. The parameters θv\theta_{v} is optimized by gradient descent on LL in the direction

δv(s(t),g;θv)\delta\nabla v(s_{(t)},g;\theta_{v}) (3)

In an episode, the meta controller selects a goal gg (line 7). Until gg is achieved or the terminal state is reached, the controller takes actions (line 11); after each action, θa\theta_{a} and θv\theta_{v} are updated using gradients (2) and (3) (line 15). After gg is achieved, (s,g,R)(s,g,R) is saved to construct a trajectory (line 17). The process repeats until the episode terminates. At the end of the episode, the trajectory is used to compute GtG_{t} in gradient (1); then θ\theta is updated using backpropagation through time for the length of the episode (line 21).

Algorithm 1 Rh-REINFORCE learning

Initialize meta controller policy π(g|s~;θ)\pi(g|\tilde{s};\theta)
Initialize controller policy πa(a|s,g;θa)\pi_{a}(a|s,g;\theta_{a})
Initialize controller value function v(s,g;θv)v(s,g;\theta_{v})

1:for episode=1episode=1 to nn do
2:     trajectory[]trajectory\leftarrow[~{}]
3:     s~[]\tilde{s}\leftarrow[~{}]
4:     Initialize the environment and get state ss
5:     Append ss to s~\tilde{s}
6:     while ss is not terminal do
7:         Select goal gπ(|s~;θ)g\sim\pi(\cdot|\tilde{s};\theta)
8:         R0R\leftarrow 0
9:         sass_{a}\leftarrow s
10:         while not (sas_{a} is terminal or gg reached) do
11:              Take action aπa(|sa,g;θa)a\sim\pi_{a}(\cdot|s_{a},g;\theta_{a})
12:              Get ss^{\prime}, external reward rr, intrinsic reward ii
13:              RR+rR\leftarrow R+r
14:              sass_{a}\leftarrow s^{\prime}
15:              Update θa\theta_{a}, θv\theta_{v}
16:         end while
17:         Append (s,g,R)(s,g,R) to trajectorytrajectory
18:         ssas\leftarrow s_{a}
19:         Append ss to s~\tilde{s}
20:     end while
21:     Update θ\theta
22:end for

6 Experiments

To demonstrate that RHF is more expressive than HF, we compare Rh-REINFORCE with two HF systems: h-REINFORCE (substituting a fully-connected layer for the GRU layer in Rh-REINFORCE) and h-DQN. The systems are trained in three variants of the corridor environment and a grid environment. In each environment, the systems use an optimal controller, which takes the actions that result in the shortest path from the current state to a state that satisfies a given goal (or the most likely actions if the environment is stochastic). Therefore, any difference in performance between the systems is due to their meta controllers.

6.1 Environments

Corridor.

(Figure 1 (b)) The environment is corridor task (ii). It has 7 states from left to right: s0,s1,,s6s_{0},s_{1},\dots,s_{6}. s0s_{0} is the terminal state (gray). The actions are left move and right move. The agent starts in s3s_{3} (circled); at each time step, it can move to the state immediate to its left or right. (Taking a right move in s6s_{6} has no effect.) To constitute a visit to s6s_{6} (asterisk), the agent must move from s5s_{5} to s6s_{6}. The agent receives a reward of +1+1 if it visits s6s_{6} at least twice and then reaches s0s_{0}; otherwise, the reward is +0.01+0.01. Actions have no rewards. An episode ends after 20 time steps; the agent receives a reward of 0 if it fails to reach s0s_{0}.

Stochastic Corridor.

This a modified version of Corridor based on the environment in (?). It has the same state configuration and reward function as Corridor. The difference it that when the agent takes a right move, with a probability of 0.5, it ends up in the state to its right, and with a probability of 0.5, it ends up in the state to its left. An episode ends after 20 time steps resulting in a reward of 0 (the same as Corridor).

Doom Corridor.

This is Corridor set in a Doom map created using the ViZDoom API (?). It has the same reward function as Corridor. The agent always faces right and observes 320×240×3320\times 240\times 3 RGB game frames. Each input frame is resized to 100×100×3100\times 100\times 3 before passed to a convolutional layer.

Grid.

(Figure 3) This is a navigation task in a 5×55\times 5 gridworld. The terminal state (gray) is above the first row. The landmark states are numbered 1, 2, and 3. The agent starts at the top left state (circled). To obtain a reward of +1+1, the agent must visit landmark 1, return to the circled state, then visit landmark 2, return to the circled state, then visit landmark 3, return to the circled state, and finally reach the terminal state. The visits can be intertwined, for example, 010121030τ0~{}1~{}0~{}1~{}2~{}1~{}0~{}3~{}0~{}\tau still results in a reward of +1+1. In all other cases the reward is 0. An episode ends after 60 actions. The task cannot be solved by HF because it requires a state-goal trajectory that contains at least s0g1s_{0}g_{1} and s0gτs_{0}g_{\tau}, which as per Theorem 1 cannot be generated by HF.

123
Figure 3: The grid environment. The starting state is circled. The terminal state is gray. The landmarks are numbered 1, 2, and 3.

6.2 Training

In each environment, all three systems use the same fixed controller. Hence only the meta controller’s parameters are updated.

Rh-REINFORCE and h-REINFORCE use Algorithm 1 to learn softmax policies. (Intrinsic rewards and line 15 are not applicable because only the meta controller is being trained.) In Corridor and Doom Corridor, the training includes a random exploration phase for the initial 1000 episodes, where the meta controller selects random goals. This phase is not used in the other two environments.

h-DQN uses a one-step Q-learning algorithm with experience replay: the replay size is 100000; the batch size is 64; the target network update rate is 0.001. The goal selection is ϵ\epsilon-greedy. ϵ\epsilon decays from 1 to 0.01 over 15000 steps. Other hyperparameters are summarized in Table 1.

Table 1: Meta controller hyperparameters. All three systems use two convolution layers followed by a GRU/Dense layer in Doom Corridor. Rh-REINFORCE and h-REINFORCE use the same output activation, loss, and optimizer. h-REINFORCE and h-DQN have the same network architectures except the output activation, loss, and optimizer. The same learning rate is used across the systems.
Rh-REINFORCE h-REINFORCE h-DQN
Corridor/Stochastic GRU: 64 units Dense 1: 16 units [ReLU]
Grid Dense 2: 32 units [ReLU]
Doom Corridor Conv 1: 32 filters, (8, 8) kernel, strides=4 [ReLU]
Conv 2: 64 filters, (4, 4) kernel, strides=2 [ReLU]
GRU: 256 units Dense: 256 units [ReLU]
Output activation softmax linear
Loss categorical crossentropy Huber
Optimizer Adam RMSprop
Learning rate 0.001

6.3 Results

Rh-REINFORCE outperforms h-REINFORCE and h-DQN in all the environments. Figure 4 shows the average episodic returns of 10 runs in each environment.

In Corridor, Rh-REINFORCE learns an optimal policy in 2000 episodes. h-REINFORCE and h-DQN achieve an average return of 0.013 and 0.116 respectively after 10000 episodes.

In Stochastic Corridor, Rh-REINFORCE, h-REINFORCE, and h-DQN achieve an average return of 0.416, 0.071, and 0.054, respectively. For comparison, a handcrafted optimal policy has an average return of 0.423; a random policy has an average return of 0.031.

In Doom Corridor, Rh-REINFORCE, h-REINFORCE, and h-DQN achieve an average return of 0.835, 0.029, and 0.006, respectively.

In Grid, Rh-REINFORCE learns an optimal policy in 14000 episodes. h-DQN and h-REINFORCE are unable find an optimal policy after 20000 episodes.

Refer to caption
(a) Corridor
Refer to caption
(b) Stochastic Corridor
Refer to caption
(c) Doom Corridor
Refer to caption
(d) Grid
Figure 4: Comparison of Rh-REINFORCE (blue), h-REINFORCE (yellow), and h-DQN (red). Episodic returns are plotted against training episodes. The data points are the average of 10 runs smoothed by a 100-episode moving average. In each environment, all three systems use the same controller. Thus the plots illustrate the difference in meta controller performance.

7 Concluding Remarks

The experimental results are consistent with our expressiveness analysis. In the deterministic environments, the two HF systems are unable to generate the state-goal trajectories that result in the maximum reward; in contrast, the RHF system is capable of learning to generate the trajectories and thus obtaining a much higher reward. Since the policies (i.e., softmax and ϵ\epsilon-greedy) during training are not deterministic, it is possible for the HF systems to obtain a high reward by chance, but they are unable to do so consistently. The same conclusion holds in the stochastic environment.

We claim neither that an HF or RHF system can optimally solve an arbitrary RL task nor that they perform better than other deep RL systems. It is possible for an RL system without a hierarchical architecture to achieve a high performance in a variety of tasks. Our analysis focuses on the expressiveness of HF and RHF, i.e., the kinds of state-goal trajectories that the frameworks can and cannot generate.

For future work, we want to investigate the expressiveness of architectures that have additional recurrent levels on top of the meta level. We conjecture that a single recurrent level with a sufficient number of units can simulate any number of recurrent levels and thus adding more levels will not increase a system’s expressive power.

Acknowledgements

This research was supported by the Office of Naval Research grants N00014-18-1-2009 and N68335-18-C-4027 and the National Science Foundation grant 1909879. The views, opinions, and findings expressed are those of the authors and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.

References

  • Aha 2018 Aha, D. W. 2018. Goal reasoning: Foundations, emerging applications, and prospects. AI Magazine 39(2):3–24.
  • Cho et al. 2014 Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. In Conference on Empirical Methods in Natural Language Processing (EMNLP 2014).
  • Hausknecht and Stone 2015 Hausknecht, M., and Stone, P. 2015. Deep recurrent q-learning for partially observable mdps. In 2015 AAAI Fall Symposium Series.
  • Hochreiter and Schmidhuber 1997 Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation 9(8):1735–1780.
  • Kempka et al. 2016 Kempka, M.; Wydmuch, M.; Runc, G.; Toczek, J.; and Jaśkowski, W. 2016. ViZDoom: A Doom-based AI research platform for visual reinforcement learning. In IEEE Conference on Computational Intelligence and Games, 341–348. IEEE.
  • Konda and Tsitsiklis 2000 Konda, V. R., and Tsitsiklis, J. N. 2000. Actor-critic algorithms. In Advances in neural information processing systems, 1008–1014.
  • Kulkarni et al. 2016 Kulkarni, T. D.; Narasimhan, K.; Saeedi, A.; and Tenenbaum, J. 2016. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems, 3675–3683.
  • LeCun et al. 1998 LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324.
  • Mnih et al. 2015 Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-level control through deep reinforcement learning. Nature 518(7540):529–533.
  • Muñoz-Avila, Dannenhauer, and Reifsnyder 2019 Muñoz-Avila, H.; Dannenhauer, D.; and Reifsnyder, N. 2019. Is everything going according to plan? expectations in goal reasoning agents. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 9823–9829.
  • Osband et al. 2016 Osband, I.; Blundell, C.; Pritzel, A.; and Van Roy, B. 2016. Deep exploration via bootstrapped dqn. In Advances in neural information processing systems, 4026–4034.
  • Sutton and Barto 2018 Sutton, R. S., and Barto, A. G. 2018. Reinforcement learning: An introduction.
  • Sutton, Precup, and Singh 1999 Sutton, R. S.; Precup, D.; and Singh, S. 1999. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence 112(1-2):181–211.
  • Vezhnevets et al. 2017 Vezhnevets, A. S.; Osindero, S.; Schaul, T.; Heess, N.; Jaderberg, M.; Silver, D.; and Kavukcuoglu, K. 2017. Feudal networks for hierarchical reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 3540–3549.
  • Watkins and Dayan 1992 Watkins, C. J., and Dayan, P. 1992. Q-learning. Machine learning 8(3-4):279–292.
  • Williams 1992 Williams, R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8(3-4):229–256.