HILONet: Hierarchical Imitation Learning from Non-Aligned Observations

Shanqi Liu, Junjie Cao, Wenzhou Chen, Licheng Wen, and Yong Liu Shanqi Liu, Junjie Cao, Wenzhou Chen, Licheng Wen, and Yong Liu are with the State Key Laboratory of Industrial Control Technology and Institute of Cyber-Systems and Control, Zhejiang University, Zhejiang, 310027, China, [email protected]). This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.

Abstract

It is challenging learning from demonstrated observation-only trajectories in a non-time-aligned environment because most imitation learning methods aim to imitate experts by following the demonstration step-by-step. However, aligned demonstrations are seldom obtainable in real-world scenarios. In this work, we propose a new imitation learning approach called Hierarchical Imitation Learning from Observation(HILONet), which adopts a hierarchical structure to choose feasible sub-goals from demonstrated observations dynamically. Furthermore, our method can learn different kinds of policy corresponding to different tasks by using a novel reward structure. We also present three different ways to increase sample efficiency in the hierarchical structure. We conduct extensive experiments using several environments. The results show the improvement in both performance and learning efficiency.

Index Terms:

hierarchical, imitation learning from observation.

I Introduction

Robots can acquire complex behavior skills suitable for various unstructured environments through learning. Two of the most prevalent paradigms for behavior learning in robots are imitation learning(IL) and reinforcement learning(RL). RL methods can theoretically learn behaviors that are optimal with respect to a clear task reward. However, it usually takes millions of training steps to converge. IL methods, on the other hand, can learn faster by mimicking expert demonstrations. But in many real world scenarios, the demonstrations are hard to obtain as we may not be able to obtain expert’s actions or the expert has a different action space. In such a case, the more-specific problem of imitation learning from observation(ILfO) must be considered. ILfO refers to only using the observation of demonstrated trajectories to imitate expert, and does not require expert actions label or external rewards.

Previous works in ILfO focus on mimicking an expert skill by following the demonstration step-by-step, such as OptionGAN[1] and TCN[2]. They expect the agent can achieve the same observation of the demonstrated trajectory at every time step. Such methods require assuming that the demonstration of the learning task can be temporally aligned with the agent’s actions. However, this assumption does not hold when environments do not have a constant number of steps to end, which is common in real world scenarios, especially when demonstrator and learning agent have different physical abilities or in different environments. In such a case, it is infeasible to follow the demonstrated trajectory at every time step. Moreover, these methods can only utilize one demonstrated trajectory, as multiple trajectories will provide multiple states at each time step and the agent can not choose which one should be followed. This leads to lower utilization efficiency of expert trajectories. As a result, there are a few works focusing on making the non-time-aligned demonstration be time aligned[3], [4]. However, instead of strictly following a demonstration step-by-step, a more natural way is to select the observations that are feasible to achieve adaptively. In this case, the key to ILfO is how to choose the feasible goals. This task is similar to the goal-based hierarchical reinforcement learning’s(HRL) task. However, although few works[5] are drawing on hierarchical reinforcement learning and use it in imitation learning, they did not use it to choose the sub-goal from all demonstrated trajectories.

In our work, we propose a novel hierarchical reinforcement learning method that can flexibly choose an observation from the expert’s trajectories and use it as the sub-goal to follow. The whole structure of our method consists of two-part, high level policy and low level policy. High level policy outputs the sub-goal chosen from expert trajectory every a few steps and low level policy takes it as the sub-goal to achieve. The insight is that when human imitate from observation, we tend to choose some key position to achieve and how we get there is not essential. This means the agent does not have to focus on every step of the expert. Instead, it searches for the observations in demonstrations that can actually be achieved from the current observation. Then the agent only needs to interact with the environment and find the path to achieve the goal.

Refer to caption — Figure 1: The structure of our overall policy. The high level policy is charged for choosing a reachable sub-goal depending on current observation from all demonstrated trajectories. Low level policy is capable of achieving that sub-goal in specific steps. The rollout is shown in Algorithm. 1.

Furthermore, to the best of our knowledge, most tasks in real world scenario can be classified into two categories. One can be effectively described by a single goal observation such as navigation. As well as the other kind of tasks such as swimming, which are usually described by a sequence of key observations rather than a single goal. As agent must go through a series of specific position to gain high speed when swimming. It is clear that we want the agent explores more novel paths to the goal position in the tasks described by a single goal observation and mimics the expert behavior as closely as possible in the tasks described by a sequence of key observations. In our method, we propose a novel reward structure that can control whether the agent follows the trajectories as closely as possible or explores more. We theoretically prove that our method can solve all two kinds of tasks by control the behavior mode of the agent. Thus, our method has broad applicability.

Moreover, to increase the sample efficiency, we propose several methods to overcome the non-stationarity in hierarchical structure. Firstly, we use a hindsight replacement method to transfer the non-optimal transitions of high level policy in HRL into optimal ones. We also propose a time-delay training method to stabilize the low level policy. Additionally, we choose differentiated experience pools for high level policy and low level policy.

Finally, we test our method and the state-of-art imitation learning methods in five different environments including both single-goal tasks and sequenced-goal tasks. The result indicates our method outperforms all compared methods. And we demonstrate that our method can solve all two kinds of tasks by control the behavior mode of the agent.

In summary, the main contributions of our work include:

$\bullet$

We propose a new way of learning from observation using hierarchical reinforcement learning structure to choose feasible sub-goal, which can solve all kinds of non-time-aligned environments.
$\bullet$

We increase the sample efficiency of hierarchical reinforcement learning by overcoming the non-stationarity.
$\bullet$

We test our method and SOTA imitation learning method in five different tasks. And we show the ability of our method to solve different kinds of tasks.

II Related Work

II-A Imitation Learning from Observation(ILfO)

There are two general groups of ILfO: model-based algorithms, and model-free algorithms[6]. Model-based approaches to ILfO are characterized by the fact that they learn some dynamics model during the imitation process. As for model-based part, [7] have proposed an algorithm, behavioural cloning from observation(BCO), that is learned an inverse dynamics model using an exploratory policy, and then uses that model to infer the actions from the demonstrations and to use behaviour clone[8] to learn. There is also another ILfO approach that learns and uses forward dynamics like imitating latent policies from observation (ILPO) [9].

In this work, we mostly discuss model-free algorithms. Adversarial approaches to ILfO are inspired by the generative adversarial imitation learning (GAIL)[10] algorithm, but takes ${(o_{t},o_{t+1})}$ instead of ${(o_{t},a_{t})}$ in GAIL. This is called GAILfO[11]. There are other adversarial approaches like [12], [13].

Another type of model-free algorithm is reward engineering. It means that, based on the expert demonstrations, a manually-designed reward function is used to find imitation policies via RL. Reward engineering methods use Euler distance of the same time step’s policy observation and expert observation to design rewards, like in TCN and [14]. Another approach of this type is [15] in which the algorithm uses a formulation similar to shuffle-and-learn [16] to train a classifier that learns the order of frames in the demonstration.

II-B Hierarchical Reinforcement Learning(HRL)

There are several RL approaches to learning hierarchical policies[17]; [18]; [19]. However, these have many strict limits and are not off-policy training methods. Recently popular works like HIRO[20], Option-Critic[21] and FeUdal Networks[22]have achieve quite good performance. Especially, HAC[23] using hindsight[24] to overcome non-stationarity in HRL and makes off-policy method have a better performance. In this work, we also use hindsight in a different way to overcome non-stationarity in the form of imitation learning. As for using HRL in imitation learning, there are few works, too. Like [5] using HRL to choose which observation of the expert trajectory should be jumped. However, it still follows a single trajectory step-by-step essentially.

In the hierarchical framework, we consider agent with a natural two-level hierarchy. The high level corresponds to choosing sub-goals, and the low level corresponds to achieve those sub-goals. This structure is typical in past works[25].

III Background

III-A Markov Decision Process (MDP)

In standard Markov Decision Process (MDP), one agent sequentially chooses an action $a_{t}$ according to a policy $\pi(a|o)$ based on the observation $o_{t}$ at time t. After taking the action $a_{t}$ , observation $o_{t}$ transforms to the next observation $o_{t+1}$ according to the transition probability which satisfies Markov property and is entirely determined by the observation-action pair one time step before, i.e. $o_{t+1}\sim p(o_{t+1}|o_{t},a_{t})$ . Then the agent receives a scalar reward signal $r(o_{t},a_{t})$ from the environment. Deep Reinforcement Learning is one kind of deep learning algorithms that finds a policy $\pi$ which can extract features with deep neural network and maximizes the expected discounted cumulative reward in one episode, i.e. $R(o_{t})=E\left[\sum_{t}\gamma^{t}r(o_{t},a_{t})\right]$ , where $\gamma$ is a discount factor.

III-B Hindsight Experience Replay(HER)

As we use the off-policy learning algorithm, we also use Hindsight Experience Replay(HER)[24]. HER is a data augmentation technique that can accelerate learning in sparse reward tasks. HER first creates copies of the observation, action, reward, next observation, goal transitions that are created in traditional off-policy RL. In the copied transitions, the original goal element is replaced with a observation that was achieved during the episode, which guarantees that at least one of the HER transitions will contain the sparse reward. This method can help us to overcome non-stationarity in HRL.

III-C Deep Deterministic Policy Gradient(DDPG)

Policy gradient methods maximize the expected cumulative reward by estimating the performance gradient with respect to the policy parameter vector $\theta$ : $\nabla_{\theta}J(\pi_{\theta})$ , and updating the policy parameter vector with gradient ascent.

As the original version of policy gradient algorithm, REINFORCE [26] tends to be of high variance due to the gradient estimation with Monte Carlo method:

\nabla_{\theta}J(\theta)\approx\frac{1}{m}\sum_{i=1}^{m}\sum_{t=0}^{N-1}\nabla_{\theta}\log\pi_{\theta}(a_{t}|o_{t})R_{t},

(1)

where $R_{t}$ represents the cumulative reward from time t to the end of one episode. Actor-critic methods use the value function $V(o_{t})$ , action-value function $Q(o_{t},a_{t})$ or advantage function $A(o_{t},a_{t})=Q(o_{t},a_{t})-V(o_{t})$ to substitute for the cumulative reward $R_{t}$ , so as to reduce the variance of gradient estimation and improve the performance of policy gradient methods. Policy gradient methods maximize the expected cumulative reward by estimating the performance gradient with respect to the policy parameter vector $\theta$ : $\nabla_{\theta}J(\pi_{\theta})$ , and updating the policy parameter vector with gradient ascent.

For deterministic policy such as DPG [27] and DDPG [28], according to the deterministic policy gradient theorem [27], the gradient of the objective $J(\theta)$ can be estimated as:

\nabla_{\theta}J(\theta)\approx\frac{1}{m}\sum_{i=1}^{m}\sum_{t=0}^{N-1}\nabla_{\theta}\mu_{\theta}(a_{t}|o_{t})\nabla_{a}Q^{\mu}(o_{t},a)|_{a=\mu_{\theta}(o_{t})}.

(2)

III-D Notations

In order to show the algorithm more clearly, we list all the main notations used in our method below. We first define $D=\{\tau_{1},\tau_{2},...,\tau_{N}\}$ means all $N$ trajectories we used in training process. $\tau_{i}=\{d_{1}^{i},d_{2}^{i},...,d_{T_{i}}^{i}\}$ is one trajectory of demonstrations. $d_{j}^{i}$ means the observation of time step $j$ in the number $i$ trajectory of demonstrations. $T_{i}$ means the length of trajectory $i$ . Then, we define $I(d_{j}^{i})$ to represent the index of $d_{j}^{i}$ in its own trajectory.

I(d_{j}^{i})=\frac{j}{T_{i}}

(3)

IV Method

In this paper, we focus on solving imitation learning from observation in non-time-aligned settings. We propose a hierarchical RL framework that learns to find a feasible demonstration observation that the agent can achieve from its correct state. Following this guide, the agent can solve the target task without knowing the exact environment reward. Furthermore, we propose a novel reward structure which can control the behavior mode of the agent. Then, we propose several methods to overcome the non-stationarity in the hierarchical framework to increase sample efficiency.

IV-A Hierarchical Imitation Learning From Observation Framework

In our approach, the whole policy is consisted of two part, high level policy $\pi_{high}(o_{g}|o_{t};\theta_{h})$ and low level policy $\pi_{low}(a|o_{t},o_{g};\theta_{l})$ , where $o_{g}$ is sub-goal chosen from expert trajectory observations for low level policy, $o_{t}$ is current observation. We use DDPG algorithm to train both high level policy and low level policy.

The high level policy is charged for choosing a feasible demonstrated observation depending on current observation so low level policy is capable of achieving that sub-goal in certain steps, i.e. five steps in our experiment. To represent the $o_{g}$ among $D$ , we select a 2-dim action space for high level policy. The high level policy’s action consists of two rates between 0 and 1, the first dimension of action stands for which trajectory in $D$ is chosen, and second action dimension is the index of observation chosen in the trajectory. In this way, the sub-goal can be formed as:

\begin{array}[]{c}i=a_{h}^{1}\times N\\ j=a_{h}^{2}\times T_{i}\\ o_{g}=d_{j}^{i}=\pi_{high}(a_{h}^{1},a_{h}^{2}|o_{t};\theta_{h})\end{array}

(4)

where the $N$ is the number of all demonstrated trajectories and $T_{i}$ means the length of trajectory $i$ .

The low level policy is focused on interacting with the environment and find a way to achieve the sub-goal provided by the high level policy in certain steps. It takes $\{o_{g},o_{t}\}$ as input and output $a_{t}$ as inter-action that interacts with the environment. The low level policy can be viewed as

a_{t}=\pi_{low}(a_{t}|o_{t},o_{g};\theta_{l})

(5)

In this way, the low level policy and the high level policy together form the overall strategy. By flexibly selecting the observation in the demonstration as the sub-goal, the target task can be completed step-by-step. The structure of our overall policy is shown in Fig. 1. The training process is shown in Algorithm. 1.

Algorithm 1 HILONet

Initialize $\pi_{high}(a_{h}^{1},a_{h}^{2}|o_{t};\theta_{h})$ , $\pi_{low}(a_{t}|o_{t},o_{g};\theta_{l})$ .

Collect $D$ using expert policy in the environment.

Initialize replay buffer $R_{h}$ , $R_{l}$ .

Initialize Q-Network $Q_{\theta_{h}}$ and target Q-Network $Q_{\theta_{h}^{\prime}}$ , Q-Network $Q_{\theta_{l}}$ and target Q-Network $Q_{\theta_{l}^{\prime}}$ .

1:for n steps = 1 to T do

\pi_{high}

takes action

o_{g}=\pi_{high}(a_{h}^{1},a_{h}^{2}|o_{t};\theta_{h})

\pi_{low}

takes action

a_{t}=\pi_{low}(a_{t}|o_{t},o_{g};\theta_{l})

4: Get

<(o_{t},o_{g}),a_{t},r_{l},o_{t+1}>

and

<o_{t},o_{g},r_{h},o_{t+1}>

5: if

o_{t+1}

in the

D

and

o_{t+1}

is not

o_{g}

then

6: Replace

o_{g}

o_{t+1}

in both

<(o_{t},o_{g}),a_{t},r_{l},o_{t+1}>

and

<o_{t},o_{g},r_{h},o_{t+1}>

8: Store

<(o_{t},o_{g}),a_{t},r_{l},o_{t+1}>

R_{l}

9: Store

<o_{t},o_{g},r_{h},o_{t+1}>

R_{h}

10: Sample a mini-batch

B_{h}

and

B_{l}

from

R_{h}

R_{l}

11: if n mod TimeDelay then

12: Perform a gradient decent step on

\pi_{high}

13: Perform a gradient decent step on

Q_{\theta_{h}}

14: if n mod TargetUpdate

\times

TimeDelay then.

15: Update

Q_{\theta_{h}^{\prime}}:\theta_{h}^{\prime}\leftarrow\theta_{h}

16: Perform a gradient decent step on

\pi_{low}

17: Perform a gradient decent step on

Q_{\theta_{l}}

18: if n mod TargetUpdate then

19: Update

Q_{\theta_{l}^{\prime}}:\theta_{l}^{\prime}\leftarrow\theta_{l}

. Start next episode.

IV-B Reward Structure

The rewards for both high level and low level policy are designed based on observation information as there is no action label available. For low level policy, the insight is simple, that we want the reward to encourage the agent to achieve sub-goals. We can use the Euler distance of goal observation and current observation to do so directly. However, this reward can not reflect whether the sub-goal is achieved. For example, the reward increases similarly when agent approaching the sub-goal from a far distance or when agent is close to the target but missing the target slightly. To encourage agent achieving sub-goal, we add a sparse reward $r$ that is given only when the agent achieves the current sub-goal. We define if $|o_{g}-o_{t}|<\epsilon$ then we consider agent has achieved the sub-goal. The overall reward of low level policy can be viewed as:

r_{low}(o_{t},o_{g})=\left\{\begin{array}[]{lr}-|o_{g}-o_{t}|^{2}&\|o_{g}-o_{t}\|>\epsilon\\ -|o_{g}-o_{t}|^{2}+r&\|o_{g}-o_{t}\|<\epsilon\\ \end{array}\right.

(6)

As for the high level policy, we consider it should guide the low level policy to accomplish the specific tasks. In this way, the high level policy must do two tasks, finding a feasible sub-goal that agent can achieve and the sub-goal should gradually solve the specific tasks. The reward for the high level policy is designed based on these two proposals. First, we naturally use a sparse reward that given only when low level policy achieved the sub-goal. Then we add another reward related with which phase agent is right now. This reward can be evaluated by $I(o_{g})=I(d_{j}^{i})$ . Because it represents the index of the reached expert state in the entire expert sequence. If the value of $I(o_{g})$ is getting larger, means the agent is getting closer to the final goal. And since $I(o_{g})$ is normalized between 0 and 1, it can be used among all trajectories even if they have the different number of steps to the end. If we output sub-goal every $\Delta t$ , we can use

\Delta I(o_{g})=I(o_{g}^{t})-I(o_{g}^{t-\Delta t})

(7)

as the reward to guide agent. Furthermore, we define $I(o_{g})=0$ if $o_{t}$ is not in expert trajectory, which can punish agent when it deviates from the expert trajectory. The overall reward of high level policy can be viewed as:

r_{high}(o_{g}^{t},o_{g}^{t-\Delta t})=\left\{\begin{array}[]{lr}1+\alpha\cdot(I(o_{g}^{t})-I(o_{g}^{t-\Delta t}))&\|o_{g}^{i}-o_{t}\|<\epsilon\\ 0&\|o_{g}^{i}-o_{t}\|>\epsilon\end{array}\right.

(8)

where $\alpha$ is the multipliers to control the optimization ratio of these two parts of the reward. And here we propose,

Theorems 1: By changing $\alpha$ , we can control the behavior pattern of the learned policy, the higher $\alpha$ is, the learned policy would focus on following the trajectories as closely as possible. And the lower $\alpha$ is, the learned policy would focus on exploring more novel paths to the goal position.

$Proof$ Firstly, we can have the optimal value at a state is given by the state-value function

V^{*}\left(o_{t}\right)=\max_{\pi}\mathbb{E}\left[\sum_{t=0}^{T}\gamma^{t}r\left(o_{t}\right)\right].

(9)

where

r\left(o_{t}\right)=\max_{\pi}r\left(o_{t},a_{t}\right).

we consider two optimum policies that can solve the tasks $\pi_{1}$ and $\pi_{2}$ . Both policies are well-trained, which means the low level policies is the same, only high level policies are different. $\pi_{1}$ choose a path to reach the goal observation exactly as the expert does, while $\pi_{2}$ choose a total different path but reach the same goal position. The state-value function of these two policy are:

\begin{multlined}V_{1}^{*}\left(o_{t}\right)=\max_{\pi_{1}}\mathbb{E}\left[\sum_{t=0}^{T}\gamma^{t}r\left(o_{t}\right)\right].\\ V_{2}^{*}\left(o_{t}\right)=\max_{\pi_{2}}\mathbb{E}\left[\sum_{t=0}^{T}\gamma^{t}r\left(o_{t}\right)\right].\end{multlined}V_{1}^{*}\left(o_{t}\right)=\max_{\pi_{1}}\mathbb{E}\left[\sum_{t=0}^{T}\gamma^{t}r\left(o_{t}\right)\right].\\ V_{2}^{*}\left(o_{t}\right)=\max_{\pi_{2}}\mathbb{E}\left[\sum_{t=0}^{T}\gamma^{t}r\left(o_{t}\right)\right].

(10)

We substitute our reward $r_{high}$ into Eq. 10 to obtain

\begin{multlined}V_{1}^{*}\left(o_{t}\right)=\mathbb{E}\left[\sum_{t=0}^{T}\gamma^{t}(1+\alpha\cdot(I(o_{t})-I(o_{t-\Delta t})))\right]\\ =\sum_{t=0}^{T}\gamma^{t}(1+\alpha\cdot\frac{\Delta t}{T})).\end{multlined}V_{1}^{*}\left(o_{t}\right)=\mathbb{E}\left[\sum_{t=0}^{T}\gamma^{t}(1+\alpha\cdot(I(o_{t})-I(o_{t-\Delta t})))\right]\\ =\sum_{t=0}^{T}\gamma^{t}(1+\alpha\cdot\frac{\Delta t}{T})).

(11)

\begin{multlined}V_{2}^{*}\left(o_{t}\right)=\mathbb{E}\left[\gamma^{T}(1+\alpha\cdot I(o_{T})-0))\right]=(1+\alpha)\cdot\gamma^{T}.\end{multlined}V_{2}^{*}\left(o_{t}\right)=\mathbb{E}\left[\gamma^{T}(1+\alpha\cdot I(o_{T})-0))\right]=(1+\alpha)\cdot\gamma^{T}.

(12)

We use Eq. 11 minus Eq. 12, we have

\begin{multlined}\Delta V^{*}\left(o_{t}\right)=\sum_{t=0}^{T}\gamma^{t}(1+\alpha\cdot\frac{\Delta t}{T}))-(1+\alpha)\cdot\gamma^{T}\\ =\sum_{t=0}^{T}\gamma^{t}-\gamma^{T}+\alpha\cdot(\sum_{t=0}^{T}\gamma^{t}\cdot\frac{\Delta t}{T}-\gamma^{T})\\ =\sum_{t=0}^{T-1}\gamma^{t}+\alpha\cdot(\sum_{t=0}^{T}\gamma^{t}\cdot\frac{\Delta t}{T}-\gamma^{T}),\end{multlined}\Delta V^{*}\left(o_{t}\right)=\sum_{t=0}^{T}\gamma^{t}(1+\alpha\cdot\frac{\Delta t}{T}))-(1+\alpha)\cdot\gamma^{T}\\ =\sum_{t=0}^{T}\gamma^{t}-\gamma^{T}+\alpha\cdot(\sum_{t=0}^{T}\gamma^{t}\cdot\frac{\Delta t}{T}-\gamma^{T})\\ =\sum_{t=0}^{T-1}\gamma^{t}+\alpha\cdot(\sum_{t=0}^{T}\gamma^{t}\cdot\frac{\Delta t}{T}-\gamma^{T}),

(13)

Then we have

\begin{multlined}\sum_{t=0}^{T}\gamma^{t}\cdot\frac{\Delta t}{T}-\gamma^{T}=\sum_{t=0}^{T}\gamma^{t}\cdot\frac{\Delta t}{T}+T\cdot\gamma^{T}\cdot\frac{1}{T}\\ =\sum_{t=0}^{T}(\gamma^{t}\cdot\frac{\Delta t}{T}-\gamma^{T}\cdot\frac{1}{T}),\end{multlined}\sum_{t=0}^{T}\gamma^{t}\cdot\frac{\Delta t}{T}-\gamma^{T}=\sum_{t=0}^{T}\gamma^{t}\cdot\frac{\Delta t}{T}+T\cdot\gamma^{T}\cdot\frac{1}{T}\\ =\sum_{t=0}^{T}(\gamma^{t}\cdot\frac{\Delta t}{T}-\gamma^{T}\cdot\frac{1}{T}),

(14)

\begin{multlined}\gamma^{t}\geq\gamma^{T};\frac{\Delta t}{T}\geq\frac{1}{T},\end{multlined}\gamma^{t}\geq\gamma^{T};\frac{\Delta t}{T}\geq\frac{1}{T},

(15)

We have

\begin{multlined}\sum_{t=0}^{T}\gamma^{t}\cdot\frac{\Delta t}{T}-\gamma^{T}\geq 0.\end{multlined}\sum_{t=0}^{T}\gamma^{t}\cdot\frac{\Delta t}{T}-\gamma^{T}\geq 0.

(16)

In this way, we prove that $\Delta V^{*}\left(o_{t}\right)$ is monotonically increasing with $\alpha$ , when $\alpha$ is high, the return distinguish between $\pi_{1}$ and $\pi_{2}$ are large, which pushes the agent to follow the expert as close as possible. However, when the $\alpha$ is low, there is not much difference in the return of $\pi_{1}$ and $\pi_{2}$ , which makes the agent tend to explore more novel path, as the novel path can increase the chance of obtaining rewards.

In summary, our hierarchical framework method has three advantages. First of all, comparing to another reward engineering imitation learning from observation methods, our method can plan dynamically thus it can solve the non-time-aligned problem. Furthermore, the reward structure of our method can adopt in all kinds of environments, no matter the tasks are described by a single goal observation or a sequence of key observations. Additionally, our method can use information from multiple trajectories simultaneously, while most reward engineering methods can only imitate one trajectory if it imitates an expert step-by-step.

Secondly, our method is based on reward engineering, so comparing with adversarial methods, our method has access to observation information directly, which can offer more dense reward and will not miss any important information in demonstrated trajectories.

The third advantage is that using a hierarchical framework policy can naturally divide a complex task into two more straightforward tasks, which will accelerate learning in sequential decision-making tasks.

IV-C Methods to Overcome the Non-stationarity in Hierarchical Framework

In this section, we will introduce three methods that overcome the non-stationarity in the hierarchical framework so that we can increase sample efficiency.

IV-C1 Hindsight Transitions

Non-stationarity of hierarchical framework mainly comes from the low level policy. A transition collected when the low level policy is not well-trained can be useless. As the low level policy is changing that even giving the same sub-goal, the low level policy will execute a different transition. Moreover, this will change the distribution of reward for high level policy.

We can overcome this problem by using a hindsight replacement method, as shown in Fig. 3. Once the low level policy achieves an observation in expert trajectories, we consider the low level policy finishes the sub-goal offered by the high level policy. In this way, we can take this transition as a stationary transition generated by a well-trained low level policy.

To achieve this, we need to replace the original transitions of high level policy to a hindsight one. If we have $\{o_{t},o_{g}=\pi_{high}(o_{t}),r_{high}(o_{t+\Delta t},o_{g}),o_{t+\Delta t}\}$ as original transition. Once $o_{t+\Delta t}$ is in expert trajectories, we can use hindsight methods to replace sub-goal(high level policy action) by $o_{t+\Delta t}$ , this means our high level policy’s action chose $o_{t+\Delta t}$ as sub-goal at the first place, and our low level policy also use the best policy to achieve that sub-goal. So, the hindsight transition can be viewed as $\{o_{t},o_{g}=o_{t+\Delta t},r_{high}(o_{t+\Delta t},o_{g}),o_{t+\Delta t}\}$ . By now, we have stationary hindsight transitions in high level policy replay buffer. However, not every transition can be replaced by stationary hindsight transition, for $o_{t+\Delta t}$ is not necessary within expert trajectories. We consider other methods to overcome this problem which will show below.

We also use hindsight transitions in low level policy replay buffer. The reason is that we could use failed trajectory as success one offering the agent more dense reward. Assuming high level policy outputs sub-goal every $\Delta t$ . In this way, if we have $\{o_{t},\pi_{low}(o_{t},o_{g}),r_{low}(o_{t+1},o_{g}),o_{t+1}\}$ as original transition of low level policy. Once $o_{t+\Delta t}$ is in expert trajectories, we can use hindsight methods to replace sub-goal $o_{g}$ by $o_{t+\Delta t}$ like we did in high level policy. Then, the hindsight transition of low level policy is $\{o_{t},\pi_{low}(o_{t},o_{t+\Delta t}),r_{low}(o_{t+1},o_{t+\Delta t}),o_{t+1}\}$ , it means our low level policy using a well-trained policy to achieve sub-goal that offered by high level policy instead of achieving a wrong observation.

To summarize, we use two different hindsight trajectory replacement methods in both high and low level policy. These methods can overcome the non-stationarity of the hierarchical framework in high level policy. Furthermore, we use the hindsight trajectory replacement method to increase the sample efficiency of the low level policy. These methods can overcome the non-stationarity of the hierarchical framework and increase the sample efficiency of our method.

IV-C2 Asynchronous Delayed Update

Since we know, the non-stationarity of hierarchical framework mainly comes from the imperfect low level policy. We can take measures to ensure that low level policy is trained firstly. As the better trained low level policy can significantly reduce non-stationary transitions in high level policy replay buffer. To achieve this, we use a time-delayed training process to do so. The low level policy is trained every step while high level policy is trained with a delay, for instance, double steps.

The idea of time delay has shown its effectiveness in TD3[29]. TD3 uses it to train the actor and critic networks because it makes sense to train the actor network after the critic network is trained more accurately. In our case, this makes even more sense, because our high and low level policy are not facing similar tasks like actor and critic network in TD3. As the hierarchical structure divides the overall complex task into two simpler tasks, our high level policy does not need to interact with the environment. This makes its task not so hard to learn that even if we delay its training process, it can still learn a good enough policy.

IV-C3 Smaller Experience Replay Buffer Size

The last method to reduce the non-stationarity of the hierarchical framework is keeping our high level policy experience replay buffer size small. This can be easily explained that smaller size means using less old experience which we already know is bad for high level policy training. However, we just used a relatively small experience replay buffer, which does not have a significant impact on off-policy sample efficiency. Furthermore, as we discussed in earlier section, the high level policy has an easier task to learn, which indicating using a smaller replay buffer does not do much harm to the learning process.

V Experiments

We evaluate our method in five different environments to show how our method benefits imitation learning in non-time-aligned environments. Especially, we are interested in answering two questions: (1) can our method choose the reachable observation from demonstration to solve these two kinds of non-time-aligned tasks? And (2) can it be compared with other state-of-the-art algorithms?

V-A Simulated Tasks

We choose MountainCar, LunarLander, Swimmer in Gym[30] and Reacher, 3Dball in ml-agents[31]. All environments are shown in Fig. 2. These five environments are all non-time-aligned and represent different characteristics, including two kinds of tasks can be effectively described by a single goal observation or a sequence of key observations rather than a single goal. Specially, the MountainCar and LunarLander can be viewed as a single goal environment, while the Swimmer and 3Dball can be viewed as a sequence of goal environment. And Reacher is more complex that it can be viewed as a mixing environment. More details about environments will discuss below. Experiments in these environments can show the universality of our method.

MountainCar: The goal is to have the car reach the target point. The target is on top of a hill on the right-hand side of the car. If the car reaches it or goes beyond, the episode terminates. This means the number of steps of each episode is changing, which is a typical non-time-aligned environment. On the left-hand side, there is another hill. Climbing this hill can be used to gain potential energy and accelerate towards the target. On top of this second hill, the car cannot go further than a position equal to -1, as if there was a wall. Hitting this limit does not generate a penalty.

LunarLander: The goal is to land on a specific position without crashing. The landing pad is always at coordinates (0,0). If lander moves away from the landing pad, it loses reward back. The episode finishes if the lander crashes or comes to reset, receiving additional -100 or +100 points. Landing outside the landing pad is possible. Fuel is infinite, so an agent can learn to fly and then land on its first attempt. These feature make sure that the agent can choose different path to the goal position, which is also a typical non-time-aligned environment.

Reacher: The agent must move their hand to a constantly moving goal location, and keep it there. The agent is a double-jointed arm which can move to target locations. The target location is a constantly moving location in 3D space, which makes the agent must follow its trajectory. This environment is different from the other environments. The moving target points bring an enormous challenge, as the first phrase of reaching the goal position can be viewed as a single-goal task and the second phrase of keeping moving within the goal area can be viewed as a sequenced-goal task.

3Dball: The agent must balance the ball on it’s head for as long as possible. This goal makes the environment different from the single-goal ones, as it not a reaching given goal point task which the goal point is usually the last frame of demonstration. This task needs to follow a special action pattern to keep a requested status, which is not directly related to the last frame of demonstration.

Swimmer: This task involves a 3-link swimming robot in a viscous fluid, where the goal is to make it swim forward as fast as possible, by actuating the two joints. Like 3Dball, this task requires the agent to arrive at the orderly goal frames in order to achieve high speed.

V-B Effect of Hierarchical Imitation Learning From Observation

In this part, we evaluate the effect our Hierarchical Imitation Learning From Observation(HILONet) in these five environments. For comparison, we evaluate several state-of-the-art imitation learning algorithms, including adversarial methods like GAIL, which using a specific class of cost functions. This allows for the use of generative adversarial networks in order to do apprenticeship learning. And GAILfO[11], which is similar to GAIL but only using observations. Meanwhile, we evaluate a baseline reward engineering method, which imitates the expert observation step-by-step. This method learns to reach the observation of the same time step of the demonstrated trajectory at each time step. We call it time-sequential reward engineering(TSRE). Notable, our method, GAILfO and reward engineering baseline are using only expert observations, but GAIL have access to the demonstrator’s actions.

We implement HILONet using DDPG algorithm. In the implementation of DDPG, the policies are all parameterized by a two-layer full connected network with 64 units per layer and “relu” active function and are initialized randomly. The high level policy acts every five steps. The $\alpha$ is 0.5 in single-goal task and 10 in the sequenced-goal task. We use precisely the same parameters for all compared algorithms. As for demonstrations, we use 20 trajectories in MountainCar, 30 trajectories in LunarLander, Swimmer, 3Dball and Reacher. However, TSRE can only use one of these trajectories as it can only follow one expert step-by-step. All demonstrations are collected by pre-trained expert policies which are trained with extinct reward. All tests are evaluated over three seeds and using 20 or 30 experts’ trajectories in all environments. For all experiments, we utilize the environmental return as the performance of each method, which is the y-axis in all plots.

		MountainCar	LunarLander	Reacher	3Dball	Swimmer
IL	GAIL	56.3 $\pm$ 11.5	149.9 $\pm$ 38.5	1.32 $\pm$ 0.49	3.1 $\pm$ 0.3	43.3 $\pm$ 1.7
ILfO	GAILfO	-98 $\pm$ 1.5	32.0 $\pm$ 73	0.2 $\pm$ 0.1	0.18 $\pm$ 0.02	25.6 $\pm$ 2.4
	TSRE	-3.3 $\pm$ 11.2	-148.9 $\pm$ 99	0.73 $\pm$ 0.3	0.3 $\pm$ 0.0	17.3 $\pm$ 2.2
	HILONet	22.9 $\pm$ 9.7	61.0 $\pm$ 37.7	1.1 $\pm$ 0.5	6.5 $\pm$ 3.4	38.5 $\pm$ 1.95

TABLE I: The final performance of all four methods mentioned before. Notably, the GAIL is an imitation learning method as it has access to action labels, while others is learning from observation methods that utilize pure observation without action label from demonstrations.

V-B1 Tasks with Single Goal

Fig. 4 and Fig. 5 depicts the performance of the agents trained with all methods in MountainCar and LunarLander. In all cases, our method outperforms the GAILfO and TSRE. These two algorithms are our main comparison algorithms because they are the same as ours belonging to ILfO algorithms. We find GAILfO has a much poorer performance in MountainCar, which learned a sub-optimal policy that only focuses on going back. This is because the backward acceleration samples account for the majority in the demonstrations, we think that in the process of GAILfO learning, it pays too much attention to these samples and ignores the small part of the forward acceleration samples(including final goal position) that are important to complete the task. However, unlike GAILfO, the reward of HILONet makes the agent to reach the desired goal observations directly. Therefore, there will not be a situation where a small part of important samples is ignored like GAILfO. We also notice that the TSRE can not learn a decent policy in both two environments, as it can only learn from a single trajectory which limits the possibility of finding a novel way to the goal position. On the contrary, HILONet can dynamically choose sub-goals at each time step to explore more paths. As a result, HILONet has the best performance among all methods.

V-B2 Visualization of Learned Trajectory

In order to show the novel paths learned to solve single-goal tasks, we plot expert trajectory and our policy’s trajectory in LunarLander for comparison. Each image contains ten trajectories, for instance. We show our policy’s trajectories from beginning to the end of the training so that we can observe the learning process. The result is shown in Fig. 6. We find that our policy’s trajectory is not exactly the same as the expert’s. This means our method learns some new paths to the final goal.

V-B3 Tasks with Sequenced Goals

Fig. 7 and Fig. 8 depicts the performance of the agents trained with all methods in 3Dball and Swimmer. The results show that our method learns the expert action pattern from the order of demonstration frames. In 3Dball and Swimmer environments, tasks can not be solved by purely achieve the final observation of a demonstrated trajectory. Agents must go through a series of specific observations to solve the task. And our method can learn to find these specific observations by following the order of demonstration frames, which is exactly our high level policy’s task when $\alpha$ is large. This conclusion is in the line with our theoretical proof. Moreover, we notice that TSRE has a comparable performance to GAILfO in sequenced-goals tasks. We believe that the reason is the sequenced-goals tasks requiring to follow the demonstrated trajectory strictly, which makes TSRE more advantageous than GAILfO, even if it only utilizes one trajectory.

V-B4 Mixing Tasks

Fig. 9 depicts the performance of the agents trained with all methods in Reacher. As Reacher is a mixing task, the result is similar to the former ones. We find our method still outperforms all compared methods when the environment involves both two kinds of tasks.

Finally, we show the final performance of all four methods in Table. I. As for GAIL, it performs best because it has access to action labels from the expert. However, in all these five environments, our method is not far worse than GAIL, HILONet even gets a higher score in environment 3Dball.

In summary, all experiment results show that our method has a better performance compared with either TSRE or adversarial method GAILfO. As a unique hierarchical structure is used, we can prove that the hierarchical structure does have advantages when performing imitation learning in non-time-aligned environments by achieving dynamic planning.

In addition, the hierarchical structure can solve the non-time-aligned problem in both two kinds of environments, including single-goal task and sequenced-goals task. Our method can learn new paths to the goal in tasks described by a single goal observation and follow the demonstrated trajectories as close as possible to learn the special action pattern in sequenced-goals tasks. As a result, HILONet has strong versatility.

V-C Ablation Study

In this setting, we test how different ways of overcoming non-stationarity affect the agent’s performance. Firstly, we test our method training without hindsight transition replacement in both high level policy and low level policy. Then, we find out how the time delay method affects the training process. At last, we test our method with a bigger replay buffer size, e.g. twice as big as the original one.

These results in Fig. 10, Fig. 11 and Fig. 12 shows that all these three ways that we proposed to overcome non-stationarity in the hierarchical framework works. Especially, methods without hindsight transition replacement or time delay training can hardly even get out of ground level. As for using a bigger replay buffer size, the influence is not as huge as others. However, we can still observe a significant decay. The reason for this is that hindsight transition replacement or time delay training changes the distribution of the sample, so it has a greater impact on the learned policy. In contrast, changing the size of replay buffer only changes the learning hyperparameters, and the impact will not be so large.

VI Conclusion

In this paper, we introduced a new imitation learning from observation method, hierarchical imitation learning from observation (HILONet), using hierarchical reinforcement structure to choose observations from expert trajectories’ observations as goals. By achieving these goals, our method can imitate expert with only observations offered. Furthermore, we propose a reward structure that can control the behavior pattern of learning policy whether to explore more or mimic more. In this way, our method has the ability to solve tasks with single goal position and tasks described by a sequence of key observations. Additionally, we propose three different ways to overcome the non-stationarity problem in hierarchical structure to increase sample efficiency. We evaluate the method with extensive experiments based on five different environments, including both these have a single goal position and those do not have a specific target. The result shows that HILONet can solve all kinds of tasks and improves the training procedure of imitation learning. It outperforms GAILfO and reward engineering baseline and achieves nearly the same performance as GAIL in all environments. Furthermore, we test the effect of these three ways of overcoming non-stationarity. As a result, we find they improve the effectiveness of our method.

For further research, we will attempt to use a multi-level hierarchical structure, which is believed can divide complex tasks into more simple ones. This may be the key to a general imitation learning algorithm that can be used in robot control, autonomous driving and other fields.

References

[1] P. Henderson, W. D. Chang, P. L. Bacon, D. Meger, J. Pineau, and D. Precup, “Optiongan: Learning joint reward-policy options using generative adversarial inverse reinforcement learning,” 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, pp. 3199–3206, 2018.
[2] P. Sermanet, C. Lynch, Y. Chebotar, J. Hsu, E. Jang, S. Schaal, S. Levine, and G. Brain, “Time-Contrastive Networks: Self-Supervised Learning from Video,” Proceedings - IEEE International Conference on Robotics and Automation, pp. 1134–1141, 2018.
[3] K. H. Kim, Y. Gu, J. Song, S. Zhao, and S. Ermon, “Cross domain imitation learning,” arXiv preprint arXiv:1910.00105, 2019.
[4] F. Liu, Z. Ling, T. Mu, and H. Su, “State alignment-based imitation learning,” arXiv preprint arXiv:1911.10947, 2019.
[5] Y. Lee, E. S. Hu, Z. Yang, and J. J. Lim, “To Follow or not to Follow: Selective Imitation Learning from Observations,” no. CoRL, 2019. [Online]. Available: http://arxiv.org/abs/1912.07670
[6] F. Torabi, G. Warnell, and P. Stone, “Recent advances in imitation learning from observation,” IJCAI International Joint Conference on Artificial Intelligence, vol. 2019-Augus, pp. 6325–6331, 2019.
[7] ——, “Behavioral cloning from observation,” IJCAI International Joint Conference on Artificial Intelligence, vol. 2018-July, no. July, pp. 4950–4957, 2018.
[8] M. Bain and C. Sammut, “A framework for behavioural cloning.” in Machine Intelligence 15, 1995, pp. 103–129.
[9] A. Edwards, H. Sahni, Y. Schroecker, and C. Isbell, “Imitating latent policies from observation,” in International Conference on Machine Learning, 2019, pp. 1755–1763.
[10] J. Ho and S. Ermon, “Generative adversarial imitation learning,” in Advances in neural information processing systems, 2016, pp. 4565–4573.
[11] F. Torabi, G. Warnell, and P. Stone, “Generative adversarial imitation from observation,” arXiv preprint arXiv:1807.06158, 2018.
[12] J. Merel, Y. Tassa, D. TB, S. Srinivasan, J. Lemmon, Z. Wang, G. Wayne, and N. Heess, “Learning human behaviors from motion capture by adversarial imitation,” arXiv preprint arXiv:1707.02201, 2017.
[13] B. C. Stadie, P. Abbeel, and I. Sutskever, “Third-person imitation learning,” arXiv preprint arXiv:1703.01703, 2017.
[14] Y. Liu, A. Gupta, P. Abbeel, and S. Levine, “Imitation from Observation: Learning to Imitate Behaviors from Raw Video via Context Translation,” Proceedings - IEEE International Conference on Robotics and Automation, pp. 1118–1125, 2018.
[15] W. Goo and S. Niekum, “One-shot learning of multi-step tasks from observation via activity localization in auxiliary video,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 7755–7761.
[16] I. Misra, C. L. Zitnick, and M. Hebert, “Shuffle and learn: unsupervised learning using temporal order verification,” in European Conference on Computer Vision. Springer, 2016, pp. 527–544.
[17] R. S. Sutton, D. Precup, and S. Singh, “Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning,” Artificial intelligence, vol. 112, no. 1-2, pp. 181–211, 1999.
[18] T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum, “Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation,” in Advances in neural information processing systems, 2016, pp. 3675–3683.
[19] B. Bakker and J. Schmidhuber, “Hierarchical reinforcement learning with subpolicies specializing for learned subgoals.” in Neural Networks and Computational Intelligence. Citeseer, 2004, pp. 125–130.
[20] O. Nachum, S. S. Gu, H. Lee, and S. Levine, “Data-efficient hierarchical reinforcement learning,” in Advances in Neural Information Processing Systems, 2018, pp. 3303–3313.
[21] P.-L. Bacon, J. Harb, and D. Precup, “The option-critic architecture,” in Thirty-First AAAI Conference on Artificial Intelligence, 2017.
[22] A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu, “Feudal networks for hierarchical reinforcement learning,” arXiv preprint arXiv:1703.01161, 2017.
[23] A. Levy, G. Konidaris, R. Platt, and K. Saenko, “Learning multi-level hierarchies with hindsight,” arXiv preprint arXiv:1712.00948, 2017.
[24] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, O. P. Abbeel, and W. Zaremba, “Hindsight experience replay,” in Advances in neural information processing systems, 2017, pp. 5048–5058.
[25] G. Konidaris and A. G. Barto, “Skill discovery in continuous reinforcement learning domains using skill chaining,” in Advances in neural information processing systems, 2009, pp. 1015–1023.
[26] R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine learning, vol. 8, no. 3-4, pp. 229–256, 1992.
[27] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic policy gradient algorithms,” in ICML, ser. ICML, 2014.
[28] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
[29] S. Dankwa and W. Zheng, “Twin-delayed ddpg: A deep reinforcement learning technique to model a continuous movement of an intelligent robot agent,” in Proceedings of the 3rd International Conference on Vision, Image and Signal Processing, 2019, pp. 1–5.
[30] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprint arXiv:1606.01540, 2016.
[31] A. Nandy and M. Biswas, “Unity ml-agents,” in Neural Networks in Unity. Springer, 2018, pp. 27–67.