This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning Hidden Subgoals under Temporal Ordering Constraints in Reinforcement Learning

Duo Xu, Faramarz Fekri
Abstract

In real-world applications, the success of completing a task is often determined by multiple key steps which are distant in time steps and have to be achieved in a fixed time order. For example, the key steps listed on the cooking recipe should be achieved one-by-one in the right time order. These key steps can be regarded as subgoals of the task and their time orderings are described as temporal ordering constraints. However, in many real-world problems, subgoals or key states are often hidden in the state space and their temporal ordering constraints are also unknown, which make it challenging for previous RL algorithms to solve this kind of tasks. In order to address this issue, in this work we propose a novel RL algorithm for learning hidden subgoals under temporal ordering constraints (LSTOC). We propose a new contrastive learning objective which can effectively learn hidden subgoals (key states) and their temporal orderings at the same time, based on first-occupancy representation and temporal geometric sampling. In addition, we propose a sample-efficient learning strategy to discover subgoals one-by-one following their temporal order constraints by building a subgoal tree to represent discovered subgoals and their temporal ordering relationships. Specifically, this tree can be used to improve the sample efficiency of trajectory collection, fasten the task solving and generalize to unseen tasks. The LSTOC framework is evaluated on several environments with image-based observations, showing its significant improvement over baseline methods. 111Source codes will be released upon acceptance

1 Introduction

In real life, successfully completing a task often involves multiple temporally extended key steps, where these key steps have to be achieved in specified time orders. For instance, in the process of making chemicals, different operations have to be strictly performed in the right time order, e.g., sulfuric acid must be added after water. Otherwise, the right chemical reaction can never occur or even the safety will be threatened. These key steps are necessary for the success of completing the given task and skipping any of them or doing them in the wrong time order will lead to failure of the task. In this work, these key steps are regarded as subgoals. Tasks consisting of multiple subgoals with temporal ordering constraints are common in many real-world applications, such as the temporal logic tasks in control systems and robotics (Baier and Katoen 2008). Since these tasks may have long-time horizon and sparse reward, the knowledge of subgoals and their temporal orderings are necessary for modern RL algorithms to solve these tasks efficiently. However, these subgoals can be hidden and unknown in many real-world scenarios. For instance, due to the user’s lack of knowledge, these subgoals may be missing when specifying the task. Alternatively, due to the partial observability of environment, the agent does not know subgoals and their temporal orderings in advance.

Motivating example. For example, consider a service robot tasked to collect the diamond in limited time steps, as shown in Figure 1(a). Due to the limited power and the blockage of river, the agent has to first go to the charger to get charged, then pick up wheel or board to go across the river, and finally get the diamond. If the agent first picks up the wheel or board and then goes to the charger, the task cannot be finished in the required time steps. The temporal dependencies of these subgoals can be described by the finite state machine (FSM) in Figure 1(b). The temporal logic language for describing these dependencies is c;(b\veew);d. However, since the agent can only observe things around him, it is not aware of the river and does not know that charger, wheel and board are subgoals, i.e., subgoals are hidden to the agent.

Refer to caption
(a) Layout
c;(bw);d\text{c};(\text{b}\vee\text{w});d
Refer to caption
b
w
Refer to caption
d
Refer to caption
v0v_{0}
vTv_{T}
c
Refer to caption
(b) FSM
Figure 1: (a) Example task. (b) The FSM for temporal dependencies of subgoals. Letters ”c”, ”b”, ”w” and ”d” are short for charger, board, wheel and diamond, respectively.

When solving tasks with hidden subgoals, the binary label at the end of the episode, indicating the task is accomplished successfully or not, is the only reward information for the agent to leverage to solve the task. This existing situation can be challenging for modern RL algorithms which use Bellman equation to propagate value estimates back to the earlier key steps (Sutton and Barto 2018). These algorithms suffer from slow convergence and expensive learning complexity, which are going to be verified by empirical experiments. Therefore, it is necessary to develop new RL algorithms to solve the task which contains multiple hidden subgoals with unknown temporal ordering constraints. To the best knowledge, this is the first work which investigates this problem.

In this work, we propose a novel framework for Learning hidden Subgoals under Temporal Ordering Constraints in RL (LSTOC). It consists of the learning subgoal and the labeling components. In learning subgoals, the proposed framework efficiently discovers states or observations corresponding to hidden subgoals and learns their temporal dependencies by using contrastive learning, which iteratively builds a subgoal tree (denoted as 𝒯\mathcal{T} and defined in Section 3.2) by discovered subgoals to represent learned temporal ordering constraints. Specifically, in 𝒯\mathcal{T}, nodes are labeled with discovered key states of subgoals and edges represent their temporal ordering relationships. This tree 𝒯\mathcal{T} is used to guide the trajectory collection, ground the semantic meaning of learned subgoals (labeling component), accelerate the task solving and help the generalization.

Subgoal Learning. In order to improve the sample efficiency, we propose a new learning method which discovers hidden subgoals one-by-one and builds 𝒯\mathcal{T} iteratively representing discovered subgoals and their temporal ordering relationships. The trajectory collection is guided by 𝒯\mathcal{T} to focus more on the working node vwv_{w} which is a node in 𝒯\mathcal{T} not fully explored yet. In every iteration of building 𝒯\mathcal{T}, by using a novel contrastive learning method, the agent only discovers the next subgoal which is next to the subgoal of working node vwv_{w} under temporal ordering constraints, and expands 𝒯\mathcal{T} by adding this newly discovered subgoal as a new child to vwv_{w}. Then, this new child node will be used as working node, initiating the next iteration of tree expansion. This iterating process will stop whenever the success of every collected trajectory about task completion can be explained by the constructed 𝒯\mathcal{T}, meaning that 𝒯\mathcal{T} is fully constructed, i.e., all the hidden subgoals and temporal orderings have been learned in 𝒯\mathcal{T}.

Contrastive Learning. In order to discovery subgoals next to working node vwv_{w}, we propose a new contrastive learning method. In this case, since only the first visit to the next subgoal is meaningful, based on pre-processed trajectories, the proposed method first computes the first-occupancy representation (FR) (Moskovitz, Wilson, and Sahani 2021) of trajectories by removing repetitive states. Then, we will use contrastive learning to detect subgoals. However, since conventional contrastive learning could detect multiple subgoals without giving any temporal distance information and the next subgoal is temporally closest to vwv_{w} among detected ones, the real next subgoal we need to discover could be missed by using conventional methods. Therefore, it is necessary to learn the temporal distances (i.e., temporal orderings) of detected subgoals and then select the temporally closest one as next subgoal. Therefore, we propose a new contrastive learning objective which can detect key states for subgoals and learn their temporal orderings at the same time. It formulates the contrastive learning objective by using temporal geometric sampling to sample one positive state from the FR of a processed positive trajectory and several negative states from the FR of a batch of processed negative trajectories. To be best of our knowledge, we are the first to propose a contrastive learning method which can detects key states and learn their temporal distances at the same time.

Labeling. In the labeling part, if the specification representing the temporal dependencies of subgoal semantic symbols is given, we formulate an integer linear programming (ILP) problem to determine the mapping from the discovered key states of subgoals to subgoal semantic symbols, making every path of the constructed subgoal tree 𝒯\mathcal{T} satisfy the given specification. When this ILP problem is solved, the labeling function is obtained, essentially giving semantic meaning to every learned subgoal.

We evaluate LSTOC on 99 tasks in three environments, including Letter, Office, and Crafter domains. In these environments, with image-based observations, the agent needs to visit different objects under temporal ordering constraints. Our evaluations show that LSTOC can outperform baselines on learning subgoals and efficiency of solving given tasks. The generalizability of LSTOC is also empirically verified. The limitation of LSTOC is discussed finally.

2 Related Works

Recently linear temporal logic (LTL) formulas have been widely used in Reinforcement Learning (RL) to specify temporal logic tasks (Littman et al. 2017). Some papers develop RL algorithms to solve tasks in the LTL specification (Camacho et al. 2019; De Giacomo et al. 2019; Bozkurt et al. 2020). In some other papers, authors focus on learning the task machine from traces of symbolic observations based on binary labels received from the environment (Gaon and Brafman 2020; Xu et al. 2021; Ronca et al. 2022). However, all these papers assume the access to a labeling function which maps raw states into propositional symbols, working in the labeled MDP (Hasanbeig et al. 2019).

There are some papers assuming to have an imperfect labeling function, where the predicted symbols can be erroneous or uncertain (Li et al. 2022; Hatanaka, Yamashina, and Matsubara 2023). But these papers do not address the problem of learning subgoals. A recent paper studies the problem of grounding LTL in finite traces (LTLf) formulas in image sequences (Umili, Capobianco, and De Giacomo 2023). However, their method is only applicable to offline cases with static dataset and does not consider the online exploration in the environment. In addition, authors in (Luo et al. 2023) propose an algorithm for learning rational subgoals based on dynamic programming. However, Their approach requires the availability of the state transition model, which is not feasible in general real-world applications.

Contrastive learning was used to detect subgoals (key states) in previous RL papers (Zhang and Kashima 2023; Sun et al. 2023; Casper et al. 2023; Park et al. 2022; Liang et al. 2022). However, these methods sample clips of positive and negative trajectories to formulate the contrastive objective. It can make the temporal distances of states not distinguishable and cannot learn temporal distances of detected subgoals. Therefore, these methods cannot be used to learn subgoals under temporal ordering constraints.

3 Preliminaries

3.1 Reinforcement Learning

Reinforcement learning (RL) is a framework for learning the strategy of selecting actions in an environment in order to maximize the collected rewards over time (Sutton and Barto 2018). The problems addressed by RL can be formalized as Markov decision processes (MDP), defined as a tuple =𝒮,𝒜,𝒯,R,γ,𝒮0\mathcal{M}=\langle\mathcal{S},\mathcal{A},\mathcal{T},R,\gamma,\mathcal{S}_{0}\rangle, where 𝒮\mathcal{S} is a finite set of environment states, 𝒜\mathcal{A} is a finite set of agent actions, 𝒯:𝒮×𝒜×𝒮[0,1]\mathcal{T}:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\to[0,1] is a probabilistic transition function, R:𝒮×𝒜[Rmin,Rmax]R:\mathcal{S}\times\mathcal{A}\to[R_{\text{min}},R_{\text{max}}] is a reward function with Rmin,RmaxR_{\text{min}},R_{\text{max}}\in\mathbb{R} and γ[0,1)\gamma\in[0,1) is a discount factor. Note that 𝒮0\mathcal{S}_{0} is the set of initial states where the agent starts in every episode, and S0:s0𝒮0S_{0}:s_{0}\sim\mathcal{S}_{0} is a distribution of initial states.

In addition, corresponding to the MDP, we assume that there is a set of semantic symbols 𝒢\mathcal{G} representing subgoals. We also define a labeling function L:𝒮𝒢{}L:\mathcal{S}\to\mathcal{G}\cup\{\varnothing\} that maps an environmental state to a subgoal which is a key step to accomplish the task. Furthermore, the key state is defined as the environmental state whose output of function LL is a subgoal in 𝒢\mathcal{G}. Most states of the environment are not key states where the outputs of LL are empty (\varnothing). In this work, the states corresponding to subgoals and the labeling function are all unknown to the agent initially. The agent can only leverage collected trajectories and their labels of task accomplishing results to solve the task.

3.2 Temporal Logic Language

We assume that the temporal dependencies (orderings) of subgoals considered in this work can be described by a formal language 𝒯\mathcal{TL} composed by three operators. Syntactically, all subgoals in 𝒢\mathcal{G} are in 𝒯\mathcal{TL}, and φ1,φ2𝒯\forall\varphi_{1},\varphi_{2}\in\mathcal{TL}, the expressions (φ1;φ2)(\varphi_{1};\varphi_{2}), (φ1φ2)(\varphi_{1}\vee\varphi_{2}) and (φ1φ2)(\varphi_{1}\wedge\varphi_{2}) are all in 𝒯\mathcal{TL}, representing ”φ1\varphi_{1} then φ2\varphi_{2}”, ”φ1\varphi_{1} or φ2\varphi_{2}” and ”φ1\varphi_{1} and φ2\varphi_{2}”, respectively. Formally, a trajectory of states τ=(s1,,sn)\tau=(s_{1},\ldots,s_{n}) satisfies a task description φ\varphi, written as τφ\tau\models\varphi, whenever one of the following holds:

  • If φ\varphi is a single subgoal g𝒢g\in\mathcal{G}, then the first state of τ\tau must not satisfy gg, and instead the last state must satisfy gg, which implies that τ\tau has at least 2 states

  • If φ=(φ1;φ2)\varphi=(\varphi_{1};\varphi_{2}), then 0<j<n\exists 0<j<n such that (s1,,sj)φ1(s_{1},\ldots,s_{j})\models\varphi_{1} and (sj,,sn)φ2(s_{j},\ldots,s_{n})\models\varphi_{2}, i.e., task φ1\varphi_{1} should be finished before φ2\varphi_{2}

  • If φ=(φ1φ2)\varphi=(\varphi_{1}\vee\varphi_{2}), then τφ1\tau\models\varphi_{1} or τφ2\tau\models\varphi_{2}, i.e., the agent should either finish φ1\varphi_{1} or φ2\varphi_{2}

  • If φ=(φ1φ2)\varphi=(\varphi_{1}\wedge\varphi_{2}), then τ(φ1;φ2)\tau\models(\varphi_{1};\varphi_{2}) or τ(φ2;φ1)\tau\models(\varphi_{2};\varphi_{1}), i.e., the agent should finish both φ1\varphi_{1} and φ2\varphi_{2} in any order

Note that the language 𝒯\mathcal{TL} for specifying temporal dependencies is expressive enough and covers LTLf (De Giacomo and Vardi 2013) which is a finite fragment of LTL without using ”always” operator \Box.

Refer to caption
v0v_{0}
aa
bb
a;ba;b
Refer to caption
(a;b)c(a;b)\vee c
cc
Refer to caption
aba\wedge b
Refer to caption
v0v_{0}
vTv_{T}
aa
bb
Refer to caption
bb
aa
Refer to caption
vTv_{T}
v0v_{0}
vTv_{T}
aa
bb
Refer to caption
Figure 2: Examples of TL formulas and their corresponding FSMs. The initial node is v0v_{0} and the accepting (terminal) node is vTv_{T}.

Every task specification φ𝒯\varphi\in\mathcal{TL} can be represented by a non-deterministic finite-state machine (FSM) (Luo et al. 2023), representing the temporal orderings and branching structures. Each FSM φ\mathcal{M}_{\varphi} of task φ\varphi is a tuple (Vφ,Eφ,Iφ,Fφ)(V_{\varphi},E_{\varphi},I_{\varphi},F_{\varphi}) which denote subgoal nodes, edges, the set of initial nodes and the set of accepting (terminal) nodes, respectively. Some examples of FSMs are shown in Figure 2. There exists a deterministic algorithm for transforming any specification in 𝒯\mathcal{TL} to a unique FSM (Luo et al. 2023).

Subgoal tree
Refer to caption
v0v_{0}
(0,9)(0,9)
Refer to caption
(0,9)(0,9)
Refer to caption
(3,7)(3,7)
Refer to caption
(7,1)(7,1)
Refer to caption
(9,9)(9,9)
Refer to caption
(9,9)(9,9)
Refer to caption
Figure 3: The subgoal tree representing the subgoal temporal dependencies c;(bw);dc;(b\vee w);d.

First, we define the satisfying sequence as the sequence of key states which can satisfy the task. Second, the tree formed by all the satisfying sequences of the task is defined as the subgoal tree. For example, for the problem in Figure 1(a), its satisfying sequences are [(0,9),(3,7),(9,9)][(0,9),(3,7),(9,9)] and [(0,9),(7,1),(9,9)][(0,9),(7,1),(9,9)] ((x,y)(x,y) denotes the coordinates of state), where (0,9),(3,7),(7,1)(0,9),(3,7),(7,1) and (9,9)(9,9) represent states corresponding to subgoals ”c”, ”b”, ”w” and ”d”, respectively. Its subgoal tree is shown in Figure 3. In this work, we only consider the subgoal temporal dependencies whose FSMs do not have any loops. Hence, an FSM in this category can be converted into a tree. Note that subgoals and their temporal dependencies (i.e., ordering constraints) are hidden and unknown to the agent. The algorithm proposed in this work is to learn these subgoals and their temporal ordering constraints based on collected trajectories and labels whether the task is accomplished or not.

3.3 First-occupancy Representation

We use first-occupancy representation (FR) for learning subgoals. FR measures the duration in which a policy is expected to reach a state for the first time, which emphasizes the first occupancy.
Definition 1.(Moskovitz, Wilson, and Sahani 2021) For an MDP with finite 𝒮\mathcal{S}, the first-occupancy representation (FR) for a policy π\pi Fπ[0,1]|𝒮|×|𝒮|F^{\pi}\in[0,1]^{|\mathcal{S}|\times|\mathcal{S}|} is given by

Fπ(s,s):=𝔼π[k=0γk1(st+k=s,s{st:t+k})|st=s]F^{\pi}(s,s^{\prime}):=\mathbb{E}_{\pi}\bigg{[}\sum_{k=0}^{\infty}\gamma^{k}\text{1}(s_{t+k}=s^{\prime},s^{\prime}\not\in\{s_{t:t+k}\})\bigg{|}s_{t}=s\bigg{]} (1)

where {st:t+k}={st,st+1,,st+k1}\{s_{t:t+k}\}=\{s_{t},s_{t+1},\ldots,s_{t+k-1}\} and {st:t+0}=\{s_{t:t+0}\}=\emptyset. The above indicator function 1 equals 1 only when ss^{\prime} first occurs at time t+kt+k since time tt. So Fπ(s,s)F^{\pi}(s,s^{\prime}) gives the expected discount at the time the policy first reaches ss^{\prime} starting from ss.

3.4 Contrastive Learning in RL

Contrastive learning was used in previous RL algorithms to detect important states (Sun et al. 2023; Zhang and Kashima 2023), which learns an importance function fωf_{\omega} by comparing positive and negative trajectories. The objective loss function can be written as

(ω;P,N)=\displaystyle\mathcal{L}(\omega;\mathcal{B}_{P},\mathcal{B}_{N})= (2)
𝔼τpP,τnN[ν=1Lexp(fω(τp[ν]))ν=1Lexp(fω(τp[ν]))+μ=1Lexp(fω(τn[μ]))]\displaystyle-\mathbb{E}_{\begin{subarray}{c}\tau_{p}\sim\mathcal{B}_{P},\\ \tau_{n}\sim\mathcal{B}_{N}\end{subarray}}\bigg{[}\frac{\sum_{\nu=1}^{L}\exp(f_{\omega}(\tau_{p}[\nu]))}{\sum_{\nu=1}^{L}\exp(f_{\omega}(\tau_{p}[\nu]))+\sum_{\mu=1}^{L}\exp(f_{\omega}(\tau_{n}[\mu]))}\bigg{]}

which aims making important states to have high value at the output of fωf_{\omega}. However, this conventional method has two major shortcomings. First, it cannot learn temporal distances of important states. In our work, hidden subgoals are under temporal ordering constraints, and different subgoals may have different temporal distances away from the initial state. But function fωf_{\omega} trained here treated every subgoal equally, which cannot detect the real next subgoal to achieve. Second, the numerator in (2) contains multiple states and the importance value (fωf_{\omega} output) of real subgoal may not stand out among its neighboring states, which can make the subgoal learning inaccurate. In order to tackle these two issues, we propose a new contrastive learning method.

4 Methodology

In this work, we propose the LSTOC framework for learning hidden subgoals and their temporal ordering constraints by leveraging trajectories and results of task accomplishment (positive or negative labels) collected from the environment. Based on constructed subgoal tree, the agent can solve the task faster and generalize to other unseen tasks involving same subgoals.

learning subgoals
Refer to caption
labeling
Mφ,𝒮^KM_{\varphi},\hat{\mathcal{S}}_{K}
𝒯φ{\mathcal{T}}_{\varphi}
Refer to caption
fθ,𝒮^Kf_{\theta},\hat{\mathcal{S}}_{K}
𝒯φ{\mathcal{T}}_{\varphi}
πexp\pi_{\text{exp}}
Refer to caption
P\mathcal{B}_{P}
N\mathcal{B}_{N}
Refer to caption

action

state

Refer to caption
agent
Refer to caption
environment
𝒯φ{\mathcal{T}}_{\varphi}
Discover
next subgoal
Expand
satisfying tree
Exploration
policy
Refer to caption
s^1\hat{s}_{1}
s^2\hat{s}_{2}
s^3\hat{s}_{3}
aa
bb
cc
Refer to caption
Figure 4: Diagram of the LSTOC framework. For learning subgoals, P\mathcal{B}_{P} (N\mathcal{B}_{N}) represents the buffer of positive (negative) trajectories, fθf_{\theta} is the state representation function, 𝒮^K\hat{\mathcal{S}}_{K} is the set of discovered key states, 𝒯φ{\mathcal{T}}_{\varphi} is the subgoal tree, and πexp\pi_{\text{exp}} is the exploration policy. The trajectory collection is guided by 𝒯φ{\mathcal{T}}_{\varphi} and πexp\pi_{\text{exp}}. In the labeling part, based on φ\mathcal{M}_{\varphi}, 𝒮^K\hat{\mathcal{S}}_{K} and 𝒯φ{\mathcal{T}}_{\varphi}, the mapping from discovered key states to subgoal symbols is determined by solving an ILP problem, yielding the labeling function. φ\mathcal{M}_{\varphi} denotes the FSM of temporal dependencies of subgoals in task φ\varphi.

4.1 General Context

The diagram of LSTOC is shown in Figure 4. In learning subgoal, the agent iteratively learns hidden subgoals one-by-one in a depth first manner and builds a subgoal tree 𝒯φ\mathcal{T}_{\varphi} to guide the trajectory collection and represent learned subgoals and temporal orderings. For example, as the problem shown in Figure 1, the agent will first learn subgoal ”c”, then ”w”, then ”d”, and then ”b”, finally ”d”, building the subgoal tree in Figure 3.

In every iteration, the agent focuses on learning subgoals next to the current working node by using a contrastive learning method, and expands 𝒯φ\mathcal{T}_{\varphi} by adding a new leaf node labeled with the newly learned subgoal, which is used as the working node in the next iteration. This iteration will be repeated until the success of every trajectory on task completion can be explained by the learned subgoals and their temporal relationships on 𝒯φ\mathcal{T}_{\varphi}. Then, the agent will proceed to the labeling component. When collecting a trajectory τk\tau_{k}, by using an exploration policy πexp\pi_{\text{exp}} trained with reward shaping method, the agent is guided by 𝒯φ\mathcal{T}_{\varphi} to reach the working node (an unexplored node of 𝒯φ\mathcal{T}_{\varphi}). A binary label lkl_{k} indicating the result of task completion is received at the end of τk\tau_{k}. Positive (lk=1l_{k}=1) and negative (lk=0l_{k}=0) trajectories are stored into positive (P\mathcal{B}_{P}) and negative buffers (N\mathcal{B}_{N}), respectively.

Once the FSM φ\mathcal{M}_{\varphi} which describes subgoal temporal dependencies by semantic symbols in 𝒢\mathcal{G} becomes available, the labeling component of LSTOC can determine the mapping from discovered key states to subgoal symbols in 𝒢\mathcal{G} by solving an integer linear programming (ILP) problem, leveraging 𝒯φ\mathcal{T}_{\varphi} to impose the semantic meaning onto every learned subgoal.

Subgoal Notation. We define the set 𝒮^K\hat{\mathcal{S}}_{K} as an ordered set of discovered key states which form a subset of the state space of the environment, i.e., 𝒮^K𝒮\hat{\mathcal{S}}_{K}\subset\mathcal{S}. Every node of the subgoal tree is labeled by a state in 𝒮^K\hat{\mathcal{S}}_{K}. For the kk-th state in 𝒮^K\hat{\mathcal{S}}_{K}, i.e., s^k\hat{s}_{k}, kk is the index for indicating detected subgoal and s^k\hat{s}_{k} is the key state corresponding to the detected subgoal kk. Only newly discovered key state not included in 𝒮^K\hat{\mathcal{S}}_{K} will be added to 𝒮^K\hat{\mathcal{S}}_{K}, creating an index indicating a newly detected subgoal.

Remark. Although subgoals are hidden, the result of the task completion can still be returned by the environment to the agent for every trajectory. This is common in robotic applications. For example, in service robot, when the task specification misses key steps, the robot can learn to discover these missing steps based on its trajectory data and feedbacks about task completion from users or other external sources (Gonzalez-Aguirre et al. 2021).

4.2 Learning Subgoals

Since the task is assumed to have multiple temporally extended hidden subgoals, it is impractical to collect sufficient informative trajectories for an offline supervised learning approach to detect all the subgoals at once. Therefore, we propose a new learning method to discover subgoals one-by-one iteratively. In every iteration, in addition to trajectory collection, the agent conduct operations including discovering next subgoal, expanding the subgoal tree and training the exploration policy, which will be introduced as following.

Refer to caption
v0v_{0}
s^1\hat{s}_{1}
s^2\hat{s}_{2}
Refer to caption
s^2\hat{s}_{2}
s^3\hat{s}_{3}
Refer to caption
v0v_{0}
s^1\hat{s}_{1}
Refer to caption
s^2\hat{s}_{2}
Refer to caption
𝒮^K={s^1,s^2}\hat{\mathcal{S}}_{K}=\{\hat{s}_{1},\hat{s}_{2}\}
Refer to caption
v0v_{0}
s^2\hat{s}_{2}
Refer to caption
s^2\hat{s}_{2}
Refer to caption
𝒮^K={s^1,s^2}\hat{\mathcal{S}}_{K}=\{\hat{s}_{1},\hat{s}_{2}\}
Refer to caption
v0v_{0}
s^2\hat{s}_{2}
Refer to caption
𝒮^K={s^1,s^2,s^3}\hat{\mathcal{S}}_{K}=\{\hat{s}_{1},\hat{s}_{2},\hat{s}_{3}\}
s^1\hat{s}_{1}
s^3\hat{s}_{3}
Refer to caption
FSM
Refer to caption
v0v_{0}
vTv_{T}
aa
bb
Refer to caption
bb
cc
Refer to caption
Subgoal Tree
Refer to caption
vTv_{T}
Refer to caption
vTv_{T}
Refer to caption
vTv_{T}
Refer to caption
s^1\hat{s}_{1}
s^2\hat{s}_{2}
Figure 5: Examples of building subgoal tree 𝒯φ{\mathcal{T}}_{\varphi}. The temporal dependencies of subgoals can be expressed as (a;b)(b;c)(a;b)\vee(b;c), whose FSM is shown in the rightmost figure. In the left three figures, the red node denotes the working node vwv_{w}, and 𝒮^K\hat{\mathcal{S}}_{K} is given on the upper left corner. The dashed nodes are unexplored nodes to be explored. Every node is labeled with a discovered key state and its index. The fourth figure shows a fully built 𝒯φ{\mathcal{T}}_{\varphi}. The subgoals of the task are hidden and the agent only knows the result of task completion for each episode.

Expand Subgoal Tree

Tree Definition. As discussed in Section 3.2, the temporal ordering constraints of discovered subgoals can be expressed by the subgoal tree 𝒯φ{\mathcal{T}}_{\varphi}. In 𝒯φ{\mathcal{T}}_{\varphi}, except the root, each node vnv_{n} is labeled by a key state s^kvn\hat{s}_{k_{v_{n}}} of the kvnk_{v_{n}}-th learned subgoal in 𝒮^K\hat{\mathcal{S}}_{K}. Specifically, in 𝒯φ{\mathcal{T}}_{\varphi}, the children of a node labeled with subgoal pp contain key states of next subgoals to achieve after pp is visited in the FSM of subgoal temporal dependencies. Since children of every node are only labeled by the temporally nearest subgoals for task completion, the temporal orderings of subgoals can be well represented in 𝒯φ{\mathcal{T}}_{\varphi}. Whenever 𝒯φ\mathcal{T}_{\varphi} is well established, every path from the root to a leaf node in 𝒯φ{\mathcal{T}}_{\varphi} corresponds to a satisfying sequence of the task (concept introduced in Section 3.2).

Tree Expansion. Based on discovered key states, 𝒯φ{\mathcal{T}}_{\varphi} is expanded iteratively in a depth-first manner. Initially, 𝒯φ{\mathcal{T}}_{\varphi} only has the root node v0v_{0}. For a node vlv_{l}, we define the path of vlv_{l} (denoted as ξl\xi_{l}) as the sequence of key states along the path from v0v_{0} to vlv_{l} in 𝒯φ{\mathcal{T}}_{\varphi}, as an example shown in Figure 6. We first define nodes whose paths has not lead to task completion yet as unexplored nodes. In each iteration of tree expansion, the agent first selects an unexplored node vwv_{w} from 𝒯φ{\mathcal{T}}_{\varphi} as the working node, and then focuses on discovering next subgoals to achieve after visiting vwv_{w} in 𝒯φ\mathcal{T}_{\varphi}. Examples of building the subgoal tree are shown in Figure 5, where the dashed nodes are unexplored nodes.

Expansion at a Working Node. For expansion at vwv_{w}, the agent first explores the collection of trajectories conditioned on vwv_{w} and then discovers next subgoals by using contrastive learning. We define a trajectory which visits key states along the path of vwv_{w} (i.e., ξw\xi_{w}) sequentially as a trajectory conditioned on the working node vwv_{w}. An example of trajectory conditioned on vwv_{w} is shown in Figure 6, where the path ξw\xi_{w} is [s^1,s^2][\hat{s}_{1},\hat{s}_{2}]. Every trajectory is collected by applying the exploration policy πexp\pi_{\text{exp}}. In order to encourage the agent to collect trajectories conditioned on vwv_{w}, πexp\pi_{\text{exp}} is trained by reward shaping method guided by 𝒯φ\mathcal{T}_{\varphi}, which is introduced in Section 4.2.

Refer to caption
v0v_{0}
Refer to caption
s^1\hat{s}_{1}
s^1\hat{s}_{1}
Refer to caption
s^2\hat{s}_{2}
Refer to caption
s^2\hat{s}_{2}
Refer to caption
vTv_{T}
Figure 6: A trajectory conditioned on vwv_{w}. The red node is the working node vwv_{w}. The dashed nodes are unexplored nodes. Blue: visits the path ξw:=[s^1,s^2]\xi_{w}:=[\hat{s}_{1},\hat{s}_{2}]. Red: explores till the end of the episode. Only the red part is used to discover key states of next subgoals.

If the number of collected positive trajectories conditioned on vwv_{w} is larger than a threshold NTN_{T}, then the agent initiates its process of subgoal discovery at working node vwv_{w}. Specifically, the agent picks up trajectories conditioned on vwv_{w} from P\mathcal{B}_{P} and N\mathcal{B}_{N}, and only keeps the part after vwv_{w} is visited in every trajectory (red part in Figure 6). Based on these pre-processed trajectories, the key states of next subgoals to visit (after vwv_{w}) are discovered by using contrastive learning method (introduced in Section 4.2). Then, the newly discovered key state is used to build a new node vnewv_{\text{new}}, which is added to 𝒯φ{\mathcal{T}}_{\varphi} as a new child of vwv_{w}. The new working node will be at vnewv_{\text{new}}, initiating a new iteration of tree expansion at vnewv_{\text{new}}. Since the agent is not familiar with the environment initially, we design an adaptive mechanism of selecting NTN_{T}, which is introduced in Section 4.4.

Satisfying Sequence. At working node vwv_{w}, if the agent can complete the task successfully whenever he follows the path ξw\xi_{w}, then the path ξw\xi_{w} is regarded as a satisfying sequence of the task (concept introduced in Section 3.2). In this case, we say that positive trajectories which complete the task successfully by following the path ξw\xi_{w} are explained by the path ξw\xi_{w}. The path ξw\xi_{w} will be added into the set of discovered satisfying sequences 𝒫S\mathcal{P}_{S}. Then, the node vwv_{w} becomes fully explored and the parent node vpv_{p} of vwv_{w} will be selected as the new working node, initiating a new iteration of expansion at vpv_{p}.

Discovering Alternative Branches. With working node at vpv_{p}, as long as we can find positive trajectories conditioned on vpv_{p} and not explained by any discovered satisfying sequences in 𝒫S\mathcal{P}_{S}, there must be alternative hidden subgoals next to vpv_{p} which are not discovered yet. The agent will continue exploring and discovering hidden subgoals (next to vpv_{p}) based on these unexplained trajectories, until the success of every positive trajectory conditioned on vpv_{p} is explained by descendent nodes of vpv_{p} in 𝒯φ\mathcal{T}_{\varphi} (i.e., vpv_{p} becomes fully explored). Specifically, if unexplained trajectories conditioned on vpv_{p} are not sufficient, the agent applies πexp\pi_{\text{exp}} to explore for collecting more trajectories, until the number of unexplained positive trajectories conditioned at vpv_{p} is larger than NTN_{T}. If new key states of next subgoals are found, these states will be used to build new nodes added to 𝒯φ\mathcal{T}_{\varphi} as new children of vpv_{p}, and additional iterations of expansion at these new nodes will be initiated. Otherwise, vpv_{p} becomes fully explored, and the parent of vpv_{p} will be selected as the new working node.

Termination Condition of Tree Expansion. The tree expansion will stop only when 1) every collected positive trajectory can be explained by 𝒯φ\mathcal{T}_{\varphi} or 2) 𝒯φ\mathcal{T}_{\varphi} becomes structurally inconsistent with MφM_{\varphi} (the FSM for subgoal temporal dependencies). Condition 1) means that every positive trajectory τp\tau_{p} can be explained by a satisfying sequence ξl\xi_{l} which is a path in 𝒯φ\mathcal{T}_{\varphi}. Condition 2) means that the current 𝒯φ\mathcal{T}_{\varphi} is wrong and has to be built from the scratch again. The details of condition 2) are discussed in Appendix B. If condition 2) is true, the LSTOC framework will be reset and restart again with a larger threshold NTN_{T}.

Discovering Next Subgoal

In each iteration of LSTOC framework, based on the up-to-date subgoal tree 𝒯φ\mathcal{T}_{\varphi}, the agent focuses on learning next subgoal to achieve after visiting working node (concept introduced in Section 4.2). For instance, in the example of Figure 1, if the up-to-date 𝒯φ\mathcal{T}_{\varphi} has two nodes labeled by key states of learned subgoals ”c” and ”b” (”c” is the parent of ”b”) and the working node is at ”b”, then the next subgoal to discover is the state of ”d”. As discussed in Section 3.4, although contrastive learning was used in RL to detect important states, there are some issues making it inapplicable to solve our problem. In this work, we propose a new contrastive learning method to discover key state of next subgoal by introducing some new techniques.

First-occupancy Representation (FR). Starting from the working node vwv_{w} of 𝒯φ\mathcal{T}_{\varphi}, only the first visit to the next subgoal is meaningful for completing the task. In tasks with multiple subgoals under temporal ordering constraints, repetitive state visitations, especially those to non-key states, are frequent and can distract the weight function fωf_{\omega} from paying attention to real key states. Therefore, we propose to compute and use the FRs of trajectories to conduct contrastive learning. Specifically, for any trajectory τ\tau conditioned on vwv_{w} (an example shown in Figure 6), we select the part of τ\tau after vwv_{w} is visited in 𝒯φ\mathcal{T}_{\varphi} and remove repetitive states there, where only the first visit to any state is kept, yielding the FR of τ\tau, denoted as τ~\tilde{\tau}.

Remark. Comparison of states is needed when removing repetitive states in a trajectory. However, the raw state or observation may not be directly comparable. In this case, we propose to learn a distinguishable dd-dimensional representation of any state or observation in buffer \mathcal{B}, i.e., oθ():𝒮do_{\theta}(\cdot):\mathcal{S}\to\mathbb{R}^{d}, based on the contrastive loss in (Oord, Li, and Vinyals 2018). Then, if two states s1s_{1} and s2s_{2} have high cosine similarity of vectors oθ(s1)o_{\theta}(s_{1}) and oθ(s2)o_{\theta}(s_{2}), i.e., greater than 0.990.99, states s1s_{1} and s2s_{2} will be regarded as the same.

Contrastive Learning Objective. Since the task has multiple temporally extended hidden subgoals, applying conventional contrastive learning to train fωf_{\omega} based on FRs of selected trajectories would produce multiple subgoals, some of which are not next subgoal to achieve. For example shown in Figure 1, when 𝒯φ\mathcal{T}_{\varphi} only has one node labeled with the key state of subgoal ”c” and working node vwv_{w} is also at that node, direct application of conventional contrastive learning can produce multiple key states including states of ”b”, ”w” and ”d”. However, ”d” is not the next subgoal to achieve after achieving ”c”, which can distract the learning of hidden subgoals. Since hidden subgoals need to be achieved following temporal ordering constraints, the agent needs to select only one of the next subgoals to expand the subgoal tree 𝒯φ\mathcal{T}_{\varphi}, which has the smallest temporal distance away from subgoal of the working node vwv_{w}.

Therefore, we propose a new contrastive learning objective which can detect key states for subgoals and learn their temporal distances at the same time. The proposed contrastive objective is formulated by one positive sample and multiple negative ones. Specifically, for any trajectory τ+\tau^{+} conditioned on vwv_{w} randomly selected from P\mathcal{B}_{P}, with the FR of pre-processed trajectory (introduced above) denoted as τ~+\tilde{\tau}^{+}, the positive sample is to sample one state st+s_{t^{+}} from τ~+\tilde{\tau}^{+} where the time index t+t^{+} is sampled from the temporal geometric distribution, i.e., t+Geom(1γ)t^{+}\sim\text{Geom}(1-\gamma)222Geom(1γ)\text{Geom}(1-\gamma) is the geometric distribution with parameter 1γ1-\gamma. and γ\gamma is the discount factor of the environmental MDP. The negative samples have BB states, denoted as {sti}i=1B\{s_{t_{i}^{-}}\}_{i=1}^{B}. To sample each stis_{t_{i}^{-}}, a negative trajectory τi\tau_{i}^{-} conditioned on vwv_{w} is randomly selected from N\mathcal{B}_{N}. After pre-processing and computing the FR of τi\tau_{i}^{-}, stis_{t_{i}^{-}} is sampled from τ~i\tilde{\tau}_{i}^{-} with tiGeom(1γ)t_{i}^{-}\sim\text{Geom}(1-\gamma). Therefore, the proposed contrastive learning objective can be formulated as below,

contrastive(ω;{st+},{sti}i=1B)=\displaystyle\mathcal{L}_{\text{contrastive}}(\omega;\{s_{t^{+}}\},\{s_{t_{i}^{-}}\}_{i=1}^{B})= (3)
exp(fω(oθ(st+)))exp(fω(oθ(st+))))+i=1Bexp(fω(oθ(sti)))\displaystyle\frac{\exp(f_{\omega}(o_{\theta}(s_{t^{+}})))}{\exp(f_{\omega}(o_{\theta}(s_{t^{+}}))))+\sum_{i=1}^{B}\exp(f_{\omega}(o_{\theta}(s_{t^{-}_{i}})))}

In the objective function above, the numerator is a function of only one state st+s_{t^{+}} sampled from a positive trajectory while the denominator has multiple states sampled from negative trajectories. This can make the importance function fωf_{\omega} assign high value to the right state more concentratedly, without being distracted by neighbors of the real key state. The temporal geometric sampling can make states, which are temporally closer to the initial states of any trajectories, have higher chances of being used to formulate the contrastive learning objective. This can make the value of fωf_{\omega} to be inversely proportional to the temporal distance of the input state, so that the temporal ordering can be reflected by the values of fωf_{\omega}, i.e., the higher the value of fωf_{\omega}, the higher the temporal order will be .

After the objective in (3) is maximized with sufficient number of states sampled from positive and negative trajectories, the outputs of fωf_{\omega} at key states of subgoals can be much larger than other states. The state with highest value of fωf_{\omega} (highest temporal order) will be selected as the key state of next subgoal to achieve and be used to expand the subgoal tree 𝒯φ\mathcal{T}_{\varphi}.

Exploration Policy

The exploration policy πexp\pi_{\text{exp}} is realized by a GRU-based policy model. Each action is history dependent and drawn from the action distribution at the output of the policy model. The policy πexp\pi_{\text{exp}} is trained by the recurrent PPO algorithm (Goyal et al. 2020; Ni et al. 2021) which extends the classical PPO (Schulman et al. 2017) into POMDP domain. The binary label of task satisfaction given at the end of the trajectory is used as a reward for training πexp\pi_{\text{exp}}.

Additionally, in order to guide the exploration to collect more trajectories conditioned on the working node vwv_{w} in 𝒯φ\mathcal{T}_{\varphi}, auxiliary rewards (ra>0r_{a}>0) will be given to visits to the key states along the path ξw\xi_{w} from root to vwv_{w} in the right temporal order specified by path ξw\xi_{w}. Training πexp\pi_{\text{exp}} with these rewards can encourage the agent to collect more trajectory data which are useful for discovering next subgoals at vwv_{w}. For the example in Figure 6, if any collected trajectory τ\tau is conditioned on vw(=s^2)v_{w}(=\hat{s}_{2}), rar_{a} will be given to the first visit to s^1\hat{s}_{1} and then the first visit to s^2\hat{s}_{2}. If s^2\hat{s}_{2} is first visited in a trajectory τ\tau^{\prime}, no rar_{a} will be given to states in τ\tau^{\prime}, since τ\tau^{\prime} follows another path on 𝒯φ\mathcal{T}_{\varphi} not containing vwv_{w} and hence is not conditioned on vwv_{w}.

4.3 Labeling

Given the constructed subgoal tree 𝒯φ{\mathcal{T}}_{\varphi} and FSM φ\mathcal{M}_{\varphi}, in the labeling part of LSTOC, the mapping from 𝒮^K\hat{\mathcal{S}}_{K} to subgoal semantic symbols 𝒢\mathcal{G} (labeling function LφL_{\varphi}) is determined by solving an integer linear programming (ILP) problem, which makes every discovered satisfying sequence lead to an accepting state in φ\mathcal{M}_{\varphi} and hence yields the semantic meaning of every hidden subgoal. The details of the ILP problem are introduced in Appendix A

4.4 Algorithm

Since the agent knows little about the environment initially, it is difficult to set the initiation condition (the value NTN_{T}) of subgoal discovery at every working node. If NTN_{T} is too small, there would not be enough trajectory data for subgoal discovery. If NTN_{T} is too large, too many trajectories will be redundant, reducing the learning efficiency. Therefore, we propose an adaptive mechanism to try different values of NTN_{T} from small to large. Specifically, there is a main loop (Algorithm 1 in Appendix B), trying to call LSTOC framework with NT=KT,2KT,N_{T}=K_{T},2K_{T},\ldots, until hidden subgoals are correctly learned and the ILP problem is successfully solved (Line 8 and 9 of Algorithm 3 in Appendix B), yielding the labeling function LφL_{\varphi}. The tables of algorithms are presented in Appendix B.

5 Experiments

The experiments aim at answering the following questions:

  1. 1.

    How well the proposed contrastive learning method would detect key states for subgoals and learn their temporal distances?

  2. 2.

    Can we solve the task and learn hidden subgoals more efficiently with the help of building the subgoal tree?

  3. 3.

    Would the learned subgoals help the generalization to other new tasks?

The details of the environments are introduced in Appendix C. The tasks used for evaluation are presented in Appendix D. Then, experimental results answering questions 1) and 2) above will be presented in Section 5.1 and 5.1. Question 3) will be answered in Appendix E. Limitation of LSTOC will be discussed finally. Implementation details and more results are also included in Appendix.

5.1 Results

Learning Subgoals

In this section, we focus on the evaluation of the proposed contrastive learning method which consists of two operations: 1) computing the first-occupancy representations (FR) of trajectories; 2) formulating the contrastive objective by using temporal geometric sampling to sample one state from every selected trajectory. The detailed introduction is in Section 4.2.

According to these two operations, we design two baseline contrastive learning methods for comparison. In Baseline 1, the FR computation is omitted, and the formulation of contrastive learning objective is same as the one in LSTOC. In Baseline 2, the FR computation is kept the same, but the contrastive learning objective is formulated by the conventional method in (Sun et al. 2023; Zhang and Kashima 2023), where the objective is computed by every states in selected positive and negative trajectories, without specific state sampling process.

Refer to caption
(a) Letter Task 1
Refer to caption
(b) Office Task 2
Refer to caption
(c) Crafter Task 1
Refer to caption
(d) Crafter Task 3
Figure 7: Comparison on accuracy of subgoal learning.

The comparison on subgoal learning is shown in Figure 7, and the results for all the tasks are shown in Figure 13 in Appendix. In every figure, LSTOC and baselines are compared based on the same amount of transition samples collected from the environment, which are 10610^{6} samples in Letter domain, 2×1062\times 10^{6} samples in Office domain and 2.5×1062.5\times 10^{6} in Crafter domain. We can see that the contrastive learning method in LSTOC significantly outperforms baselines, showing the effects of FR and temporal geometric sampling. FR is important since the agent only focuses on detecting the key state of next subgoal only and FR can prevent the distraction of key states of future subgoals. The temporal distances of detected key states are learned by temporal geometric sampling. Baseline 1 performs better than Baseline 2 in Office and Crafter domains, showing that temporal geometric sampling is more important to key state detection than FR, since the task can only be accomplished when subgoals are achieved following temporal ordering constraints.

Refer to caption
(a) LSTOC Task 1 Letter
Refer to caption
(b) Baseline Task 1 Letter
Refer to caption
(c) LSTOC Task 2 Letter
Refer to caption
(d) Baseline Task 2 Letter
Refer to caption
(e) LSTOC Task 1 Office
Refer to caption
(f) Baseline Task 1 Office
Figure 8: Comparison of the proposed contrastive learning method in LSTOC and baseline on learning temporal distance. The value in each grid is the output of importance function fωf_{\omega}. The higher the value is, the smaller the temporal distance is, meaning that the input state is closer to the initial state.

Visualization

Furthermore, we visualize the performance of the proposed contrastive learning method in LSTOC in Figure 8, demonstrating its performance on detecting key states and learning their temporal distances. We use the exploration policy πexp\pi_{\text{exp}} to collect a fixed number of trajectories labeled with task completion results, based on which we apply the proposed contrastive learning method to learn key states of subgoals. The number of collected trajectories for Letter domain is 50005000 and the number in Office domain is 80008000. The proposed contrastive learning method in LSTOC is compared with the baseline which first computes the FR of trajectories and then use classical contrastive object (Zhang and Kashima 2023) to detect key states. Specifically, in Figure 8, the task in the first row is the task 1 in Letter domain in Figure 11(a), and the positions of ”a”, ”b” and ”c” are (3,1),(5,2)(3,1),(5,2) and (7,7)(7,7). In the second row, the task is the task 2 in Letter domain of Figure 11(a), and positions of ”a”, ”b”, ”c” and ”d” are (3,2),(1,8),(7,9)(3,2),(1,8),(7,9) and (5,6)(5,6). In the third row, the task is the task 1 in Office domain of Figure 11(b), and positions of ”c”, ”o” and ”p” are (20,14),(14,2)(20,14),(14,2) and (3,14)(3,14). Note that the positions of subgoals are unknown to the agent initially. The contrastive learning is conducted on a fixed set of trajectories.

In Figure 8, the output of importance function fωf_{\omega} visualized in every grid is inversely proportional to the temporal distance (ordering) of input state. In the evaluation of LSTOC, the positions of subgoals are all detected correctly. Besides, the key state of first subgoal of every task has significantly highest value at the output of fωf_{\omega} and hence has the smallest temporal distance. The relative temporal distances of key states learned by the LSTOC obey the temporal orderings in the task specification. Obviously, we can see that compared with baseline, the proposed contrastive learning method in LSTOC can detect key states of subgoals accurately, while the baseline method is influenced by neighboring states. Besides, the output of fωf_{\omega} trained by the proposed method can clearly reflect the temporal orderings of detected subgoals in the task specification (the higher the value of fωf_{\omega} is, the higher the temporal ordering will be), while the baseline has some ambiguity on reflecting subgoal temporal orderings in some cases.

However, we can see that when using the proposed method in LSTOC, the key states of subgoals which have big temporal distances (far away from initial state) cannot stand out at the output of fωf_{\omega} among with other neighboring non-important states, e.g., in Figure 8(c) and 8(e). This is because the contrastive learning is conducted on a fixed set of trajectories, which can only be used to detect the first or next subgoal to achieve. That is also the reason why we propose to detect subgoals one-by-one by building a subgoal tree.

Refer to caption
(a) Letter Task 1
Refer to caption
(b) Office Task 2
Refer to caption
(c) Crafter Task 1
Refer to caption
(d) Crafter Task 3
Figure 9: Comparison of efficiency on task solving. The y-axis is success rate of completing the task. The x-axis is the environmental step.

Learning Efficiency

As introduced in Section 4.2, in order to collect more trajectories conditioned on the working node, the LSTOC adds auxiliary reward to every visit to a discovered key state which can make the agent achieve the working node in 𝒯φ\mathcal{T}_{\varphi}. This can significantly accelerate the agent’s learning to solve the task composed by temporally extended subgoals. In order to verify this, we compare LSTOC with a baseline method which uses an RNN-based policy trained by PPO algorithm to solve the task. The results for selected tasks are shown in Figure 9, with complete results shown in Figure 14 in Appendix. We can see that the LSTOC can solve every task significantly faster than the baseline. This is because the baseline does not specifically learn the subgoal and does not use auxiliary reward to resolve the issue of reward sparsity.

In addition, we also evaluate the average number of trajectories (both positive and negative) collected until every subgoal is learned (i.e., the condition for executing Line 11 of Algorithm 3 is met). The baseline method uses a fixed exploration policy π¯exp\bar{\pi}_{\text{exp}} to collect trajectories, based on which the contrastive learning in Section 4.2 is applied to learn subgoals. Specifically, the policy π¯exp\bar{\pi}_{\text{exp}} is only trained by the reward of task completion before the first tree expansion is conducted. The comparison results are shown in Table 1. Compared with LSTOC, the difference of the baseline is that exploration policy π¯exp\bar{\pi}_{\text{exp}} is not further tuned to collect more trajectories conditioned on the working node. In this case, to discover subgoals along the branch of the task FSM which is harder to achieve can cost more trajectory data. This explains the advantage of LSTOC in Table 1.

LSTOC Baseline
Task 2 Letter 18123±291518123\pm 2915 39256±256739256\pm 2567
Task 2 Office 32195±562332195\pm 5623 56232±462156232\pm 4621
Task 3 Crafter 33562±448233562\pm 4482 50172±521550172\pm 5215
Table 1: Comparison of Subgoal Learning Efficiency

6 Correctness and Limitation

Empirically, as long as NTN_{T} is large enough and randomness of exploration policy πexp\pi_{\text{exp}} is sufficient, the contrastive learning in (3) can always discover the correct key state of next subgoal at every working node. Then, the correct subgoal tree can be built and labeling function can be correctly obtained by solving the ILP problem. Specifically, the randomness of πexp\pi_{\text{exp}} can be guaranteed by using ϵ\epsilon-greedy into action selection, where ϵ=0.5\epsilon=0.5 is enough for all the environments.

However, LSTOC framework still has limitations. In some cases, the labeling component can not distinguish environmental bottleneck states from hidden subgoals. In some other cases, the labeling component cannot tell the differences of symmetric branches in the given FSM. Furthermore, the trajectory collection can be problematic in some hard-exploration environments. The details of correctness and limitation are presented in Appendix G.

7 Conclusion

In this work, we propose a framework for learning hidden subgoals under temporal ordering constraints, including a new contrastive learning method and a sample-efficient learning strategy for temporally extended hidden subgoals. In the future, we will resolve the issues of extending this work into hard-exploration environments, improving its sample efficiency in environments with large state space.

References

  • Baier and Katoen (2008) Baier, C.; and Katoen, J.-P. 2008. Principles of model checking. MIT press.
  • Bozkurt et al. (2020) Bozkurt, A. K.; Wang, Y.; Zavlanos, M. M.; and Pajic, M. 2020. Control synthesis from linear temporal logic specifications using model-free reinforcement learning. In 2020 IEEE International Conference on Robotics and Automation (ICRA), 10349–10355. IEEE.
  • Camacho et al. (2019) Camacho, A.; Icarte, R. T.; Klassen, T. Q.; Valenzano, R. A.; and McIlraith, S. A. 2019. LTL and Beyond: Formal Languages for Reward Function Specification in Reinforcement Learning. In IJCAI, volume 19, 6065–6073.
  • Casper et al. (2023) Casper, S.; Davies, X.; Shi, C.; Gilbert, T. K.; Scheurer, J.; Rando, J.; Freedman, R.; Korbak, T.; Lindner, D.; Freire, P.; et al. 2023. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217.
  • De Giacomo et al. (2019) De Giacomo, G.; Iocchi, L.; Favorito, M.; and Patrizi, F. 2019. Foundations for restraining bolts: Reinforcement learning with LTLf/LDLf restraining specifications. In Proceedings of the international conference on automated planning and scheduling, volume 29, 128–136.
  • De Giacomo and Vardi (2013) De Giacomo, G.; and Vardi, M. Y. 2013. Linear temporal logic and linear dynamic logic on finite traces. In IJCAI’13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence, 854–860. Association for Computing Machinery.
  • Gaon and Brafman (2020) Gaon, M.; and Brafman, R. 2020. Reinforcement learning with non-markovian rewards. In Proceedings of the AAAI conference on artificial intelligence, volume 34, 3980–3987.
  • Gonzalez-Aguirre et al. (2021) Gonzalez-Aguirre, J. A.; Osorio-Oliveros, R.; Rodriguez-Hernandez, K. L.; Lizárraga-Iturralde, J.; Morales Menendez, R.; Ramirez-Mendoza, R. A.; Ramirez-Moreno, M. A.; and Lozoya-Santos, J. d. J. 2021. Service robots: Trends and technology. Applied Sciences, 11(22): 10702.
  • Goyal et al. (2020) Goyal, A.; Lamb, A.; Hoffmann, J.; Sodhani, S.; Levine, S.; Bengio, Y.; and Schölkopf, B. 2020. Recurrent Independent Mechanisms. In International Conference on Learning Representations.
  • GurobiOptimization (2023) GurobiOptimization, L. 2023. Gurobi Optimizer Reference Manual.
  • Hafner (2021) Hafner, D. 2021. Benchmarking the Spectrum of Agent Capabilities. arXiv preprint arXiv:2109.06780.
  • Hasanbeig et al. (2019) Hasanbeig, M.; Kantaros, Y.; Abate, A.; Kroening, D.; Pappas, G. J.; and Lee, I. 2019. Reinforcement learning for temporal logic control synthesis with probabilistic satisfaction guarantees. In 2019 IEEE 58th conference on decision and control (CDC), 5338–5343. IEEE.
  • Hatanaka, Yamashina, and Matsubara (2023) Hatanaka, W.; Yamashina, R.; and Matsubara, T. 2023. Reinforcement Learning of Action and Query Policies with LTL Instructions under Uncertain Event Detector. IEEE Robotics and Automation Letters.
  • Icarte et al. (2018) Icarte, R. T.; Klassen, T.; Valenzano, R.; and McIlraith, S. 2018. Using reward machines for high-level task specification and decomposition in reinforcement learning. In International Conference on Machine Learning, 2107–2116. PMLR.
  • Li et al. (2022) Li, A. C.; Chen, Z.; Vaezipoor, P.; Klassen, T. Q.; Icarte, R. T.; and McIlraith, S. A. 2022. Noisy Symbolic Abstractions for Deep RL: A case study with Reward Machines. In Deep Reinforcement Learning Workshop NeurIPS 2022.
  • Liang et al. (2022) Liang, X.; Shu, K.; Lee, K.; and Abbeel, P. 2022. Reward Uncertainty for Exploration in Preference-based Reinforcement Learning. In 10th International Conference on Learning Representations, ICLR 2022. International Conference on Learning Representations.
  • Littman et al. (2017) Littman, M. L.; Topcu, U.; Fu, J.; Isbell, C.; Wen, M.; and MacGlashan, J. 2017. Environment-independent task specifications via GLTL. arXiv preprint arXiv:1704.04341.
  • Luo et al. (2023) Luo, Z.; Mao, J.; Wu, J.; Lozano-Pérez, T.; Tenenbaum, J. B.; and Kaelbling, L. P. 2023. Learning rational subgoals from demonstrations and instructions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 12068–12078.
  • Mnih et al. (2015) Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-level control through deep reinforcement learning. nature, 518(7540): 529–533.
  • Moskovitz, Wilson, and Sahani (2021) Moskovitz, T.; Wilson, S. R.; and Sahani, M. 2021. A First-Occupancy Representation for Reinforcement Learning. In International Conference on Learning Representations.
  • Ni et al. (2021) Ni, T.; Eysenbach, B.; Levine, S.; and Salakhutdinov, R. 2021. Recurrent model-free rl is a strong baseline for many pomdps.
  • Oord, Li, and Vinyals (2018) Oord, A. v. d.; Li, Y.; and Vinyals, O. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
  • Park et al. (2022) Park, J.; Seo, Y.; Shin, J.; Lee, H.; Abbeel, P.; and Lee, K. 2022. SURF: Semi-supervised Reward Learning with Data Augmentation for Feedback-efficient Preference-based Reinforcement Learning. In International Conference on Learning Representations.
  • Ronca et al. (2022) Ronca, A.; Paludo Licks, G.; De Giacomo, G.; et al. 2022. Markov Abstractions for PAC Reinforcement Learning in Non-Markov Decision Processes. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, 3408–3415. International Joint Conferences on Artificial Intelligence.
  • Schulman et al. (2017) Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  • Sun et al. (2023) Sun, C.; Yang, W.; Jiralerspong, T.; Malenfant, D.; Alsbury-Nealy, B.; Bengio, Y.; and Richards, B. A. 2023. Contrastive Retrospection: honing in on critical steps for rapid learning and generalization in RL. In Thirty-seventh Conference on Neural Information Processing Systems.
  • Sutton and Barto (2018) Sutton, R. S.; and Barto, A. G. 2018. Reinforcement learning: An introduction. MIT press.
  • Umili, Capobianco, and De Giacomo (2023) Umili, E.; Capobianco, R.; and De Giacomo, G. 2023. Grounding LTLf specifications in image sequences. In Proceedings of the International Conference on Principles of Knowledge Representation and Reasoning, volume 19, 668–678.
  • Xu et al. (2021) Xu, Z.; Wu, B.; Ojha, A.; Neider, D.; and Topcu, U. 2021. Active finite reward automaton inference and reinforcement learning using queries and counterexamples. In Machine Learning and Knowledge Extraction: 5th IFIP TC 5, TC 12, WG 8.4, WG 8.9, WG 12.9 International Cross-Domain Conference, CD-MAKE 2021, Virtual Event, August 17–20, 2021, Proceedings 5, 115–135. Springer.
  • Zhang and Kashima (2023) Zhang, G.; and Kashima, H. 2023. Learning state importance for preference-based reinforcement learning. Machine Learning, 1–17.

Appendix A Labeling Component of LSTOC

The FSM φ\mathcal{M}_{\varphi} is the FSM specification of subgoal temporal dependencies given by the user, where each node is described by a subgoal semantic symbol in 𝒢\mathcal{G}. Denote the state space of φ\mathcal{M}_{\varphi} as 𝒰\mathcal{U} which has UU states. The transition function of φ\mathcal{M}_{\varphi} is defined as δφ:𝒰×𝒢×𝒰{0,1}\delta_{\varphi}:\mathcal{U}\times\mathcal{G}\times\mathcal{U}\to\{0,1\}, and the transition from state uu to uu^{\prime} conditioned on symbol gg is expressed as δφ(u,g,u)=1\delta_{\varphi}(u,g,u^{\prime})=1, where gg is the index of symbol in 𝒢\mathcal{G}.

Assume that the set of discovered satisfying sequences 𝒫S{\mathcal{P}}_{S} has PP sequences, and the mm-th sequence has lml_{m} elements. Note that 𝒫S\mathcal{P}_{S} consists of all the paths in the subgoal tree 𝒯φ\mathcal{T}_{\varphi} which is built by the subgoal learning component of LSTOC framework. The set 𝒮^K\hat{\mathcal{S}}_{K} contains the discovered key states of subgoals. The target of labeling component of LSTOC is to determine the mapping from 𝒮^K\hat{\mathcal{S}}_{K} to 𝒢\mathcal{G}, making every sequence in 𝒫S\mathcal{P}_{S} lead to an accepting state in φ\mathcal{M}_{\varphi} and hence yielding the labeling function LφL_{\varphi}.

Now we start formulating the integer linear programming (ILP) problem for learning the labeling function. The binary variables of this ILP problem are composed by state transition variables um,n,i,ju_{m,n,i,j} and mapping variables vk,lv_{k,l}, where um,n,i,j=1u_{m,n,i,j}=1 denotes that the nn-th element of mm-th sequence of 𝒫S\mathcal{P}_{S} makes the agent transit from state ii to jj over FSM φ{\mathcal{M}}_{\varphi}, and vk,l=1v_{k,l}=1 denotes that kk-th state in 𝒮^K\hat{\mathcal{S}}_{K} is mapped to ll-th symbol in 𝒢\mathcal{G}. Based on their definitions, we can first have 5 constraints on these binary variables:

i,j=1Uum,n,i,j=1,m=1,,|𝒫S|,n=1,,ln\displaystyle\sum_{i,j=1}^{U}u_{m,n,i,j}=1,\hskip 5.0pt\forall m=1,\ldots,|\mathcal{P}_{S}|,n=1,\ldots,l_{n} (4)
j=1Uum,1,1,j=1,m=1,,|𝒫S|\displaystyle\sum_{j=1}^{U}u_{m,1,1,j}=1,\hskip 5.0pt\forall m=1,\ldots,|\mathcal{P}_{S}| (5)
i=1Uum,n,i,j=i=1Uum,n+1,j,i,m=1,,|𝒫S|,n=1,,ln1,j=1,,U\displaystyle\sum_{i=1}^{U}u_{m,n,i,j}=\sum_{i=1}^{U}u_{m,n+1,j,i},\hskip 5.0pt\forall m=1,\ldots,|\mathcal{P}_{S}|,n=1,\ldots,l_{n}-1,j=1,\ldots,U (6)
l=1|𝒢|vk,l1,k=1,,|𝒮^K|\displaystyle\sum_{l=1}^{|\mathcal{G}|}v_{k,l}\leq 1,\hskip 5.0pt\forall k=1,\ldots,|\hat{\mathcal{S}}_{K}| (7)
k=1|𝒮^K|vk,l=1,l=1,,|𝒢|\displaystyle\sum_{k=1}^{|\hat{\mathcal{S}}_{K}|}v_{k,l}=1,\hskip 5.0pt\forall l=1,\ldots,|\mathcal{G}| (8)

where these constraints mean: 1) every element of every sequence in |𝒫S||\mathcal{P}_{S}| makes a transition, including staying at the same state of FSM φ{\mathcal{M}}_{\varphi}; 2) the first element of every sequence is in the first state of φ{\mathcal{M}}_{\varphi}; 3) for any pair of consecutive elements of every sequence, the out-going state of the previous element is the same as the in-coming state of the other one; 4) every state in 𝒮^K\hat{\mathcal{S}}_{K} is mapped to at most one semantic symbol in 𝒢\mathcal{G}, since some discovered key states may be redundant; 5) every symbol in 𝒢\mathcal{G} is associated with one discovered key state in 𝒮^K\hat{\mathcal{S}}_{K}.

Since the transition variables um,n,i,ju_{m,n,i,j} and mapping variables vk,lv_{k,l} must be consistent with the state transitions of φ{\mathcal{M}}_{\varphi} (δφ\delta_{\varphi}), we have another set of constraints:

um,n,i,jδφ(i,l,j)vk,lu_{m,n,i,j}\leq\delta_{\varphi}(i,l,j)\cdot v_{k,l} (9)

where m,n,i,jm,n,i,j and k,lk,l have the same range of values as above constraints.

Finally, we have another set of constraints which make sure that the last element of every sequence in 𝒫S\mathcal{P}_{S} makes the agent stay in any accepting state of FSM φ\mathcal{M}_{\varphi}. Then, we have

j𝒰Fum,n,i,j=1\sum_{j\in\mathcal{U}_{F}}u_{m,n,i,j}=1 (10)

where 𝒰F\mathcal{U}_{F} denotes the set of accepting states of φ\mathcal{M}_{\varphi}. In order to ignore the states in 𝒮^K\hat{\mathcal{S}}_{K} not associated with any subgoals during mapping, such as bottleneck state in the environmental layout, we use the sum of mapping variables as the objective:

k=1|𝒮^K|l=1|𝒢|vk,l\sum_{k=1}^{|\hat{\mathcal{S}}_{K}|}\sum_{l=1}^{|\mathcal{G}|}v_{k,l} (11)

The formulated ILP problem has the objective (11) and constraints (4)-(10). We solve it by Gurobi solver (GurobiOptimization 2023).

Appendix B Algorithms

We present the algorithm tables of the proposed framework in this section. Since the agent is not familiar with the environment in advance and the optimal selection of NTN_{T} is not clear, we design a main loop to try to call LSTOC (described in Algorithm 3) with different values of NTN_{T} increasing incrementally. In implementation, we set NT=80N_{T}=80 and HT=20H_{T}=20 for every domain.

The subgoal discovery process at a working node is described in Algorithm 2. It is same as the process described in Section 4.2. The inputs 𝒟P\mathcal{D}_{P} and 𝒟N\mathcal{D}_{N} only contain trajectories conditioned on the current working node. In line 3, the discriminative state representation is first pre-trained based on data in PN\mathcal{B}_{P}\cup\mathcal{B}_{N}. For line 5 and 6, In the process of FR, trajectories in 𝒟\mathcal{D} are processed to only keep the part after achieving the working node (the red part in Figure 6 as an example), then the FR of every processed trajectory is first computed by removing repetitive states. Then, from line 7 to 13, the importance function fωf_{\omega} is iteratively trained by the objective formulated in (3). In implementation, the number of iterations (LL) is set to be 700700 for every domain. The batch size of negative samples (BB) is chosen to be 6464. Due to the simple architecture of fωf_{\omega}, the learning rate of each iteration is set to be 0.010.01. In line 14, the state with highest value of fωf_{\omega} (highest temporal ordering) is selected as the discovered key state of next subgoal and returned to LSTOC in line 15.

The LSTOC is described in Algorithm 3. In line 5, the agent first collects sufficient number of trajectories from the environment. The loop starting at line 6 is the iterating process of building subgoal tree 𝒯φ\mathcal{T}_{\varphi} which will terminate when 1) every positive trajectory is explained by 𝒯φ\mathcal{T}_{\varphi} (line 8) and ILP problem for symbol grounding is successfully solved (line 9); or 2) 𝒯φ\mathcal{T}_{\varphi} is wrongly built (line 37) with condition (***) introduced in the following paragraph. The loop starting at line 7 is the loop of collecting sufficient number of positive trajectory for subgoal discovery at the current working node vwv_{w}. From line 14 to 16, trajectories conditioned on vwv_{w} are selected from P\mathcal{B}_{P} and N\mathcal{B}_{N} with explained trajectories discarded, and stored as 𝒟P\mathcal{D}_{P} and 𝒟N\mathcal{D}_{N}. In line 17, if every trajectory in 𝒟P\mathcal{D}_{P} which follows the path of vwv_{w} can lead to the accomplishment of the task (getting a positive label), the path pwp_{w} will be a newly discovered satisfying path and tree expansion will be moved back to the parent node of vwv_{w}, initiating another iteration. In line 21, if the condition (**) of being fully explored is met, the tree expansion will be moved to the parent node of vwv_{w} and starts another iteration of expansion. In line 24, if the condition of subgoal discovery at vwv_{w} is not met, more trajectories will be collected by Explore process and exploration policy will be further trained to make more trajectories conditioned on vwv_{w} to be collected. Otherwise, the subgoal discovery process will be called in line 32. After new key state of subgoal is discovered, from line 33 to 39, 𝒯φ\mathcal{T}_{\varphi} and set of key states 𝒮^K\hat{\mathcal{S}}_{K} will be expanded by adding a new node vnewv_{\text{new}}. Then, in line 40 and 41, the working node is moved to vnewv_{\text{new}}.

The condition (*) in Algorithm 3 states that pwp_{w} is a newly discovered satisfying sequence (concept defined in Section 3.2) and no further expansion at the current working node is needed. The condition (**) in Algorithm 3 states that every positive trajectory conditioned on vwv_{w} can be explained and no further expansion at the current working node is needed. The condition (***) in Algorithm 3 refers to the situations where 𝒯φ\mathcal{T}_{\varphi} is wrong built. The condition (***) becomes true when the longest path in 𝒯φ\mathcal{T}_{\varphi} is longer than that in φ\mathcal{M}_{\varphi} or the largest degree of nodes in 𝒯φ\mathcal{T}_{\varphi} is larger than that in φ\mathcal{M}_{\varphi}.

Algorithm 1 Main Loop
1:Require MφM_{\varphi}: FSM representing the temporal dependencies of subgoals; KTK_{T}: threshold increment of each round of LSTOC;
2:NTKTN_{T}\leftarrow K_{T};
3:while True do
4:     result, dict \leftarrow LSTOC(NT,Mφ)(N_{T},M_{\varphi}); % Call Algorithm 3
5:     if result is True then
6:         Return dict; % Hidden subgoals and their semantic symbols are learned and returned
7:     else
8:         NTNT+KTN_{T}\leftarrow N_{T}+K_{T}; % threshold NTN_{T} is increased to start another call of LSTOC algorithm
9:     end if
10:end while
Algorithm 2 Contrastive(𝒟P,𝒫N,vw)(\mathcal{D}_{P},\mathcal{P}_{N},v_{w})
1:Input 𝒟P(𝒟N)\mathcal{D}_{P}(\mathcal{D}_{N}): buffer of selected positive (negative) trajectories for subgoal discovery; vwv_{w}: current working node;
2:Notation FR(): function for computing FR introduced in Section 4.2; oθo_{\theta}: representation for non-symbolic state or observation; fωf_{\omega}: function inversely proportional to the temporal distance; BB: number of negative states for formulating the contrastive objective; γ\gamma: discount factor of the working environment;
3:Pre-train the state representation oθo_{\theta} based on InfoNCE loss (Oord, Li, and Vinyals 2018);
4:Initialize fωf_{\omega};
5:𝒟~PFR(𝒟P,vw)\tilde{\mathcal{D}}_{P}\leftarrow\text{FR}(\mathcal{D}_{P},v_{w}); % Process introduced in the ”FR” paragraph of Section 4.2
6:𝒟~NFR(𝒟N,vw)\tilde{\mathcal{D}}_{N}\leftarrow\text{FR}(\mathcal{D}_{N},v_{w});
7:for i=1,2,,Li=1,2,\ldots,L do
8:     Sample τ+𝒟~P\tau^{+}\sim\tilde{\mathcal{D}}_{P};
9:     {st+}\{s_{t^{+}}\}\leftarrow sample one state from τ+\tau^{+} with t+Geom(1γ)t^{+}\sim\text{Geom}(1-\gamma);
10:     {sti}i=1B\{s_{t_{i}^{-}}\}_{i=1}^{B}\leftarrow sample BB trajectories from 𝒟~N\tilde{\mathcal{D}}_{N} and sample BB states from these trajectories with tiGeom(1γ)t_{i}^{-}\sim\text{Geom}(1-\gamma);
11:     Formulate objective (3) with {st+}\{s_{t^{+}}\} and {sti}i=1B\{s_{t_{i}^{-}}\}_{i=1}^{B};
12:     Compute the gradient of (3) to train fωf_{\omega};
13:end for
14:s^new\hat{s}_{\text{new}}\leftarrow select the state with highest output of fωf_{\omega} among trajectories in 𝒟P\mathcal{D}_{P};
15:Return s^new\hat{s}_{\text{new}}
Algorithm 3 LSTOC(NT,Mφ)(N_{T},M_{\varphi})
1:Input NTN_{T}: the threshold of number of unexplained positive trajectories to initiate a subgoal discovery; MφM_{\varphi}: the FSM for temporal dependencies of subgoals;
2:Notation 𝒯φ\mathcal{T}_{\varphi}: subgoal tree; vwv_{w}: working node on 𝒯φ\mathcal{T}_{\varphi}; pwp_{w}: working path (set of discovered key states along the path from v0v_{0} to vwv_{w} on 𝒯φ\mathcal{T}_{\varphi}); 𝒫S\mathcal{P}_{S}: set of discovered satisfying sequences; 𝒟P(𝒟N)\mathcal{D}_{P}(\mathcal{D}_{N}): selected unexplained trajectories for subgoal discovery; Explore(X)(X): collect trajectories from the environment with policy πexp\pi_{\text{exp}} until XX positive trajectories are collected; BTB_{T}: number of positive trajectories collected in each exploration; other notations are introduced in the caption of Figure 4;
3:Initialize 𝒯φ\mathcal{T}_{\varphi} with root node v0v_{0} and set vwv0v_{w}\leftarrow v_{0};
4:Initialize pw[],𝒮^K{}p_{w}\leftarrow[],\hat{\mathcal{S}}_{K}\leftarrow\{\} and 𝒫S{}\mathcal{P}_{S}\leftarrow\{\};
5:P,N\mathcal{B}_{P},\mathcal{B}_{N}\leftarrow Explore(NT)(N_{T});
6:while True do
7:     while True do
8:         if vw==v0v_{w}==v_{0} and P\mathcal{B}_{P} can all be explained by 𝒫S\mathcal{P}_{S} then
9:              result,Lφ\text{result},L_{\varphi}\leftarrow ILP(Mφ,𝒫S)(M_{\varphi},\mathcal{P}_{S});
10:              if result is True then
11:                  Return True, {Lφ,𝒮^K}\{L_{\varphi},\hat{\mathcal{S}}_{K}\} % Hidden subgoals and temporal orderings are all correctly learned
12:              end if
13:         end if
14:         𝒟P\mathcal{D}_{P}\leftarrow select trajectories conditioned on vwv_{w} from P\mathcal{B}_{P};
15:         𝒟P\mathcal{D}_{P}\leftarrow discard trajectories in 𝒟P\mathcal{D}_{P} which can be explained by paths in 𝒫S\mathcal{P}_{S};
16:         𝒟N\mathcal{D}_{N}\leftarrow select trajectories conditioned on vwv_{w} from N\mathcal{B}_{N};
17:         if pwp_{w} is a satisfying sequence then % Condition (*)
18:              𝒫S𝒫S{pw}\mathcal{P}_{S}\leftarrow\mathcal{P}_{S}\cup\{p_{w}\};
19:              vwv_{w}\leftarrow parent of vwv_{w};
20:              pwpw[:1]p_{w}\leftarrow p_{w}[:-1]; % Discard the last element of path pwp_{w}
21:         else if vwv_{w} is fully explored then % Condition (**)
22:              vwv_{w}\leftarrow parent of vwv_{w};
23:              pwpw[:1]p_{w}\leftarrow p_{w}[:-1];
24:         else if |𝒟P|<NT|\mathcal{D}_{P}|<N_{T} then
25:              P,N\mathcal{B}_{P}^{\prime},\mathcal{B}_{N}^{\prime}\leftarrow Explore(BT)(B_{T});
26:              PPP,NNN\mathcal{B}_{P}\leftarrow\mathcal{B}_{P}\cup\mathcal{B}_{P}^{\prime},\mathcal{B}_{N}\leftarrow\mathcal{B}_{N}\cup\mathcal{B}_{N}^{\prime}
27:              PPO(PN)(\mathcal{B}_{P}^{\prime}\cup\mathcal{B}_{N}); % Train πexp\pi_{\text{exp}} with PPO
28:         else
29:              Break;
30:         end if
31:     end while
32:     s^new\hat{s}_{\text{new}}\leftarrow Contrastive(𝒟P,𝒟N,vw)(\mathcal{D}_{P},\mathcal{D}_{N},v_{w}); % Call Algorithm 2 to discover next subgoal
33:     if s^new\hat{s}_{\text{new}} not in 𝒮^K\hat{\mathcal{S}}_{K} then
34:         𝒮^K𝒮^K{s^new}\hat{\mathcal{S}}_{K}\leftarrow\hat{\mathcal{S}}_{K}\cup\{\hat{s}_{\text{new}}\};
35:     end if
36:     Add a new node nnewn_{\text{new}} to 𝒯φ\mathcal{T}_{\varphi} as a child of vwv_{w}, labeled with s^new\hat{s}_{\text{new}};
37:     if 𝒯φ\mathcal{T}_{\varphi} is inconsistent with MφM_{\varphi} then % Condition (***)
38:         Return False, None % Subgoals are wrongly learned and another round of LSTOC is needed
39:     end if
40:     vwnnewv_{w}\leftarrow n_{\text{new}};
41:     pwpw[s^new]p_{w}\leftarrow p_{w}\cup[\hat{s}_{\text{new}}];
42:end while
Refer to caption
(a) Letter
Refer to caption
(b) Office
Refer to caption
(c) Crafter
Figure 10: Domains. The agent has full or partial observations in Letter domain, and has partial observation in Office and Crafter domain. The Crafter has pixel-based observations to the agent.

Appendix C Environments

Environment domains adopted in the experiments include both grid-based and pixel-based observations. Note that in all the environment domains, there is no labelling function which maps from agent’s observation into any letter or items, so letters or items are all hidden to the agent.

Letter. The first environment domain is the Letter. As shown in Figure 10(a), there are multiple letters allocated in the map. In this domain, the observation of the agent can be the full or partial map, which is an image-based tensor without any specific information on locations of letters. If the map has the size of m×nm\times n with kk letters, the observation is a m×n×(k+1)m\times n\times(k+1) binary tensor. The agent’s actions include movements in four cardinal directions. Only a subset of letters in the map is used as task subgoals, making the problem more challenging.

Office. The second environment domain is the Office. It is a variant of office game widely used in previous papers (Icarte et al. 2018). As shown in Figure 10(b), in this domain, the agent only as partial observation of the environment, which is a 5×55\times 5 grid centric to the agent. There are 12 rooms in the map with some segments of walls as obstacles. Five letters, ”c”, ”o”, ”m”, ”d” and ”p”, represent coffee, office, mailbox, desk, and printer, which are randomly located in these rooms. One room has at most one letter. The task subgoals may use part of these five letters. The observation of the agent is a 5×5×75\times 5\times 7 binary tensor, including letters, wall and agent. The size of whole map is 19×2519\times 25.

Crafter. Crafter is a powerful sandbox for designing custom environments (Hafner 2021). The agent there can navigate in the map to visit landmarks and pick up various of objects to finish the task. Our experiments the original environment to focus on the navigation part. An example of the screenshot is shown in Figure 10(c). In our experiments, the map is a 20×\times20 grid. The observation to the agent is a 5×55\times 5 partial observation centric to the agent, where each grid of the observation is described by 16×\times16 pixels. The objects include tree, sapling, coal, diamond, energy, furnace and health, short for ”t”, ”s”, ”c”, ”d”, ”e”, ”f” and ”h”. The task formula is described in terms of letters for these objects. The agent needs to visit right objects in the right orders.

Refer to caption
v0v_{0}
aa
bb
Refer to caption
cc
Refer to caption
v0v_{0}
aa
bb
Refer to caption
cc
dd
aa
Refer to caption
vTv_{T}
Refer to caption
v0v_{0}
aa
bb
Refer to caption
cc
ee
dd
ee
Refer to caption
vTv_{T}
1
2
3
vTv_{T}
(a) Letter
Refer to caption
v0v_{0}
cc
oo
Refer to caption
pp
Refer to caption
v0v_{0}
cc
oo
Refer to caption
dd
pp
mm
pp
Refer to caption
vTv_{T}
Refer to caption
v0v_{0}
cc
oo
Refer to caption
mm
pp
pp
Refer to caption
vTv_{T}
1
2
3
vTv_{T}
(b) Office
Refer to caption
v0v_{0}
tt
cc
Refer to caption
ss
Refer to caption
v0v_{0}
ff
ss
Refer to caption
hh
cc
dd
cc
Refer to caption
vTv_{T}
Refer to caption
v0v_{0}
ee
hh
Refer to caption
ff
ff
Refer to caption
vTv_{T}
1
2
3
(c) Crafter
Figure 11: The FSM of subgoal temporal dependencies for the given task. The agent cannot know the position directly via the interaction with the environment.

Appendix D Tasks

The FSMs describing temporal dependencies of hidden subgoals are shown in Figure 11. The LSTOC is evaluated and compared on tasks with these FSMs. The first task of every domain is always a sequential task consisting of 3 subgoals. In other tasks, there are converging and diverging branches, which are designed to evaluate the LSTOC’s capability of discovering alternative branches. Some tasks, e.g., task 3 in Letter and task 2 in Office, have repetitive subgoals, which are designed to test whether LSTOC can address repetitive subgoals or not.

Appendix E Generalization

We evaluate the generalization capability of LSTOC in Office and Crafter domains. In the training environment, the hidden subgoals and labeling function are first learned by LSTOC framework, where Office and Crafter domains both use task 2 in Figure 11(b) and 11(c) to learn subgoals. Then, in the testing environment, the agent solves an unseen task with the help of auxiliary rewards given by the labeling function (learned by LSTOC in training) and task FSM. Specifically, the agent gets a positive reward whenever he visits a key state corresponding to the subgoal which can make progress toward an accepting state in the task FSM. The testing environment is different from the training one. In Office domain, the testing environment has different layout of wall segments compared with the training environment. In Crafter domain, new and unseen objects, including drink, sand, table and zombie, are added to the testing environment.

Refer to caption
(a) Office
Refer to caption
(b) Crafter
Figure 12: Generalization performance. The x-axis is environmental step. The y-axis is the success rate of completing the given task unseen in the training.

In evaluation, both the proposed and baseline methods use a RNN-based agent trained by PPO algorithm. In baseline, the agent only receives a positive reward whenever the task is accomplished. However, in the proposed method, the agent additionally gets auxiliary rewards. The comparison is shown in Figure 12. The significant acceleration of the proposed method verifies the generalization effect of subgoals learned by LSTOC framework, where the learned labeling function can adapt to changes of layout and ignore distracting unseen objects.

Refer to caption
(a) Task 1. Letter
Refer to caption
(b) Task 2. Letter
Refer to caption
(c) Task 3. Letter
Refer to caption
(d) Task 1. Office
Refer to caption
(e) Task 2. Office
Refer to caption
(f) Task 3. Office
Refer to caption
(g) Task 1. Crafter
Refer to caption
(h) Task 2. Crafter
Refer to caption
(i) Task 3. Crafter
Figure 13: Performance of contrastive learning. The vertical axis in every figure denotes the accuracy of detecting the key states of subgoals. Initially, the agent does not know any of these (hidden) subgoals, and no symbolic observation is available. The agent only knows the result of task accomplishment at the end of the trajectory.
Refer to caption
(a) Task 1 Letter
Refer to caption
(b) Task 2 Letter
Refer to caption
(c) Task 3 Letter
Refer to caption
(d) Task 1 Office
Refer to caption
(e) Task 2 Office
Refer to caption
(f) Task 3 Office
Refer to caption
(g) Task 1 Crafter
Refer to caption
(h) Task 2 Crafter
Refer to caption
(i) Task 3 Crafter
Figure 14: Comparison of learning efficiency. The x-axis is the environmental steps taken by the agent. The y-axis is the success rate on completing the given task.

Appendix F Neural Architecture

We build neural network architectures for state representation function oθo_{\theta} and importance function fωf_{\omega} and the exploration policy πexp\pi_{\text{exp}}. The function oθo_{\theta} is used to extract a dd-dimensional vector as state representation for an input raw state or observation. In Letter and Office domain, dd is 128128 and oθo_{\theta} is realized by a two-layer MLP with 128 neurons in each layer. In Crafter domain, oθo_{\theta} is realized by a convolutional neural network (CNN) module. This CNN is the same as the classical CNN for deep RL proposed in (Mnih et al. 2015), where the first convolutional layer has 32 channels with kernel size of 8 and stride of 4, the second layer has 64 channels with the kernel size of 4 and stride of 2 and the third layer has 64 channels with the kernel size of 3 and stride of 1. The CNN module produces an embedding vector with the size of d=512d=512.

The importance function fωf_{\omega} is realized by MLP in all three domains, which is a two-layer MLP with 128128 neurons in Letter and Office domains and 256256 neurons in Crafter domain.

The exploration policy πexp\pi_{\text{exp}} is a GRU-based policy. In πexp\pi_{\text{exp}}, the hidden dimension of gated recurrent unit (GRU) module is 128 for Letter/Office domains and 256 for Crafter domain. The outputs of πexp\pi_{\text{exp}} consist of action and predicted value, which are conditioned on both the hidden state and the embedding vector of input observation.

Appendix G Correctness and Limitation

G.1 Correctness

Empirically, as long as NTN_{T} is large enough and randomness of exploration policy πexp\pi_{\text{exp}} is sufficient, the contrastive learning in (3) can always discover the correct key state of next subgoal at every working node. This is because sufficient trajectory data and randomness in data collection can make most parts of state space covered by both positive and negative trajectories. Then, the correct subgoal tree can be built and labeling function can be correctly obtained by solving the ILP problem. Specifically, the randomness of πexp\pi_{\text{exp}} can be guaranteed by using ϵ\epsilon-greedy into action selection, where ϵ=0.5\epsilon=0.5 is enough for all the environments.

In other words, whenever the state space is covered sufficiently enough by collected trajectory data, no hidden subgoal in the environment can be missed. Then, whenever every collected positive trajectory can be explained by some path in 𝒯φ\mathcal{T}_{\varphi}, all the hidden subgoals and their temporal orderings are learned, showing the correctness of the termination condition of LSTOC.

G.2 Limitation

However, LSTOC framework still has limitations. First, its labeling component cannot tell the difference between the bottleneck state in the environment and the real key states of subgoals, even though these are all key states. For example, as shown in Figure 15, a navigation environment has two rooms.

Refer to caption
Figure 15: Map of room 1 and 2.

Room 1 has a red ball and room 2 has a blue ball and a green ball. However, there is an open corridor connecting room 1 and 2. This corridor becomes a bottleneck state of connecting room 1 and 2. Assume that the given task has hidden subgoals of first picking up red ball, then blue ball, and finally green ball, i.e., r;b;gr;b;g in temporal logic language. In this case, four subgoals will be discovered by LSTOC, which are corridor and states of red, blue and green balls. Then, since the FSM does not have corridor and the agent does not know the location of blue ball, the the corridor cannot be distinguished from the state of blue ball. Therefore, the ILP problem in the labeling component does not have a unique solution and the labeling function cannot be obtained. However, the learning subgoal part of LSTOC can still detect every meaningful key state.

Second, the labeling component cannot tell the differences of symmetric branches of FSM. For example, the given FSM is (a;b;c)(d;e;f)(a;b;c)\vee(d;e;f) and two branches are symmetric. The learning subgoal component of LSTOC will discover 6 key states of subgoals, but the mapping from discovered key states to semantic symbols (a,b,c,d,e, and f) cannot be determined.

It is important to note that the limitations outlined above cannot be resolved. In the problem formulation, the only feedback available to the agent for anchoring subgoal symbols is a binary label indicating task completion at the end of each trajectory. The agent learns the labeling function (i.e., the mapping from learned subgoals to subgoal symbols in the given FSM), by utilizing these binary labels and structural information of subgoal symbols in the given FSM. As a result, when there are hidden bottlenecks in the environment or the task involves symmetric branches of subgoals, the given FSM has ambiguity and hence the agent lacks sufficient information to accurately identify the mapping from learned subgoals to subgoal symbols in the given FSM. Consequently, in these situations, it is impossible for the agent to correctly learn the labeling function, but the agent can still accurately estimate the hidden subgoals in the environment.

Third, the trajectory collection could not cover some important states in hard-exploration environments. In environments like Montezuma-revenge, some key states need thousands of actions to reach and the exploration policy trained by task completion signal only cannot collect trajectories covering these key states. Then, some hidden subgoals cannot be discovered and the labeling component cannot produce correct result.