This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Primer on Maximum Causal Entropy Inverse Reinforcement Learning

Adam Gleave
[email protected]
Equal contribution.
   Sam Toyer11footnotemark: 1
[email protected]
Abstract

Inverse Reinforcement Learning (IRL) algorithms [17, 1] infer a reward function that explains demonstrations provided by an expert acting in the environment. Maximum Causal Entropy (MCE) IRL [31, 29] is currently the most popular formulation of IRL, with numerous extensions [5, 8, 21]. In this tutorial, we present a compressed derivation of MCE IRL and the key results from contemporary implementations of MCE IRL algorithms. We hope this will serve both as an introductory resource for those new to the field, and as a concise reference for those already familiar with these topics.

1 Introduction

The most direct approach to automating a task is to manually specify the steps required to complete the task in the form of a policy. However, it is often easier to specify a reward function that captures the overarching task objective, and then use reinforcement learning (RL) to obtain a policy that carries out the steps to achieve that objective [23]. Unfortunately, procedurally specifying a reward function can also be challenging. Even a task as simple as peg insertion from pixels has a non-trivial reward function [27, Section IV.A]. Most real-world tasks have far more complex reward functions than this, especially when they involve human interaction.

A natural solution is to learn the reward function itself. A common approach is Inverse Reinforcement Learning (IRL) [17, 1]: inferring a reward function from a set of demonstrations of a particular task. This is well-suited to tasks that humans can easily perform but would find difficult to directly specify the reward function for, such as walking or driving. An additional benefit is that demonstrations can be cheaply collected at scale: for example, vehicle manufacturers can learn from their customers’ driving behaviour [24, 14].

A key challenge for IRL is that the problem is underconstrained: many different reward functions are consistent with the observed expert behaviour [2, 15, 4, 22]. Some of these differences, such as scale or potential shaping, never change the optimal policy and so may be ignored [18, 10]. However, many differences do change the optimal policy—yet perhaps only in states that were never observed in the expert demonstrations. By contrast, alternative modalities such as actively querying the user for preference comparisons [20] can avoid this ambiguity, at the cost of a higher cognitive workload for the user.

Maximum Causal Entropy (MCE) IRL is a popular framework for IRL. Introduced by Ziebart [31, 29], MCE IRL models the demonstrator as maximising return achieved (like an RL algorithm) plus an entropy bonus that rewards randomising between actions. The entropy bonus allows the algorithm to model suboptimal demonstrator actions as random mistakes. In particular, it means that any set of sampled demonstrations has support under the demonstrator model, even if the trajectories are not perfectly optimal for any non-trivial reward function. This is important for modelling humans, who frequently deviate from optimality in complex tasks.

An alternative framework, Bayesian IRL [19], goes beyond finding a point estimate of the reward function and instead infers a posterior distribution over reward functions. It therefore assigns probability mass to all reward functions compatible with the demonstrations (so long as they have support in the prior). Unfortunately, Bayesian IRL is difficult to scale, and has to date only been demonstrated in relatively simple environments such as small, discrete MDPs.

In contrast to Bayesian IRL, algorithms based on MCE IRL have scaled to high-dimensional environments. Maximum Entropy Deep IRL [28] was one of the first extensions, and is able to learn rewards in gridworlds from pixel observations. More recently, Guided Cost Learning [6] and Adversarial IRL [8] have scaled to MuJoCo continuous control tasks. Given its popularity and accomplishments we focus on MCE IRL in the remainder of this document; we refer the reader to Jeon et al. [13] for a broader overview of reward learning.

2 Background

Before describing the MCE IRL algorithm itself, we first need to introduce some notation and concepts. First, we define Markov Decision Processes (MDPs). Then, we outline IRL based on feature matching. Finally, we introduce the notion of causal entropy.

2.1 Markov decision processes

In this tutorial, we consider Markov decision processes that are either discounted, or have a finite horizon, or both. Below we give the definition and notation that we use throughout. Note that when doing IRL, we drop the assumption that the true reward rr is known, and instead infer it from data.

Definition 2.1.

A Markov Decision Process (MDP) M=(𝒮,𝒜,γ,T,,𝒯,r)M=(\mathcal{S},\mathcal{A},\gamma,T,\mathcal{I},\mathcal{T},r) consists of a set of states 𝒮\mathcal{S} and a set of actions 𝒜\mathcal{A}; a discount factor γ[0,1]\gamma\in[0,1]; a horizon T{}T\in\mathbb{N}\cup\{\infty\}; an initial state distribution (s)\mathcal{I}(s); a transition distribution 𝒯(s|s,a)\mathcal{T}(s^{\prime}\>|\>s,a) specifying the probability of transitioning to ss^{\prime}{} from ss{} after taking action aa{}; and a reward function r(s,a,s)r(s,a,s^{\prime}) specifying the reward upon taking action aa in state ss and transitioning to state ss^{\prime}. It must be true that either the discount factor satisfies γ<1\gamma<1 or that the horizon is finite (T<T<\infty).

Given an MDP, we can define a (stochastic) policy πt(at|st)\pi_{t}(a_{t}\>|\>s_{t}) that assigns a probability to taking action at𝒜a_{t}\in\mathcal{A} in state st𝒮s_{t}\in\mathcal{S} at time step tt. The probability of a policy acting in the MDP producing a trajectory fragment τ=(s0,a0,s1,a1,,sk1,ak1,sk)\tau{}=\left(s_{0},a_{0},s_{1},a_{1},\ldots,s_{k-1},a_{k-1},s_{k}\right) of length kk\in\mathbb{N} is given by:

p(τ)=(s0)t=0k1𝒯(st+1|st,at)π(at|st)t.p(\tau{})=\mathcal{I}(s_{0})\prod_{t=0}^{k-1}\mathcal{T}{}(s_{t+1}\>|\>s_{t},a_{t})~{}\pi{}_{t}(a_{t}\>|\>s_{t})\,. (1)

Note in a finite-horizon MDP (TT\in\mathbb{N}) the policy may be non-stationary, i.e. it can depend on the time step tt. In the infinite-horizon case (T=T=\infty), the MDP is symmetric over time, and so we assume the policy is stationary. We drop the subscript and write only π\pi when the policy is acting across multiple timesteps or is known to be stationary.

The objective of an agent is to maximise the expected return:

G(π)=𝔼π[t=0T1γtr(St,At,St+1)].G(\pi)=\operatorname*{{\mathbb{E}}}_{\pi{}}\left[\sum_{t=0}^{T-1}\gamma^{t}r{}(S_{t},A_{t},S_{t+1})\right]\,. (2)

An optimal policy π\pi^{*} attains the highest possible expected return: πargmaxπG(π)\pi^{*}\in\operatorname*{arg\,max}_{\pi}G(\pi).

2.2 Imitation as feature expectation matching

In IRL, our objective is to recover a reward function that—when maximised by a reinforcement learner—will lead to similar behaviour to the demonstrations. One way to formalise “similar behaviour” is by feature expectation matching. Suppose the demonstrator is optimising some unknown linear reward function rθ(st,at)=θTϕ(st,at)r_{\theta_{\star}}(s_{t},a_{t})=\theta_{\star}^{T}\phi(s_{t},a_{t}), where ϕ(st,at)d\phi(s_{t},a_{t})\in\mathbb{R}^{d} is some fixed feature mapping. In this case, the demonstrator’s expected return under its behaviour distribution 𝒟\mathcal{D}{} will be linear in the expected sum of discounted feature vectors observed by the agent:

𝔼𝒟[t=0T1γtrθ(St,At)]=𝔼𝒟[t=0T1γtθTϕ(St,At)]=θT𝔼𝒟[t=0T1γtϕ(St,At)].\operatorname*{{\mathbb{E}}}_{\mathcal{\mathcal{D}}{}}\left[\sum_{t=0}^{T-1}\gamma^{t}r_{\theta_{\star}}(S_{t},A_{t})\right]=\operatorname*{{\mathbb{E}}}_{\mathcal{D}{}}\left[\sum_{t=0}^{T-1}\gamma^{t}\theta_{\star}^{T}\phi(S_{t},A_{t})\right]=\theta_{\star}^{T}\operatorname*{{\mathbb{E}}}_{\mathcal{D}{}}\left[\sum_{t=0}^{T-1}\gamma^{t}\phi(S_{t},A_{t})\right]\,. (3)

Say that we recover some imitation policy π\pi{} with identical expected feature counts to the demonstrator:

𝔼π[t=0T1γtϕ(St,At)]=𝔼𝒟[t=0T1γtϕ(St,At)].\operatorname*{{\mathbb{E}}}_{\pi{}}\left[\sum_{t=0}^{T-1}\gamma^{t}\phi(S_{t},A_{t})\right]=\operatorname*{{\mathbb{E}}}_{\mathcal{D}{}}\left[\sum_{t=0}^{T-1}\gamma^{t}\phi(S_{t},A_{t})\right]\,. (4)

Because expected reward is linear in the (matched) feature expectations above, the reward obtained by π\pi{} under the unknown true reward function rθ(st,at)r_{\theta_{\star}}(s_{t},a_{t}) must be the same as the reward obtained by the demonstrator 𝒟\mathcal{D}{} under that reward function [1]. If our imitation policy π\pi{} is optimal under reward function parameters θ^\hat{\theta}, then it is reasonable to say that θ^\hat{\theta} is an estimate of the demonstrator’s true reward parameters. However, in general there will be many choices of θ^\hat{\theta} that produce imitation policies π\pi{} with the same expected feature counts as the demonstrator. In the next section, we will see how we can apply the principle of maximum entropy to break ties between these reward functions.

2.3 Maximum causal entropy

The principle of maximum entropy holds that when choosing between many probability distributions that are consistent with the data, one should pick the distribution that has highest entropy. Intuitively, such a distribution is the “most uncertain” among those that meet the data-fitting constraints. This principle can also be formally justified with an appeal to games: choosing the maximum entropy distribution minimises one’s expected log loss in the setting where an adversary is able to choose the true distribution from those consistent with the data [25].

In an IRL setting, this principle leads to a simple and effective algorithm for simultaneously recovering a reward function and corresponding imitation policy from demonstrations. In particular, in MCE IRL we choose the reward function whose corresponding policy has maximal causal entropy:

Definition 2.2.

Let S0:TS_{0:T} and A0:TA_{0:T} be random variables representing states and actions induced by following a policy π\pi in an MDP and sampled according to Eq. 1. Then the causal entropy111The definition of causal entropy can also be generalised to non-Markovian contexts [29, section 4.2]. H(A0:T1S0:T1)H(A_{0:T-1}\|S_{0:T-1}) is the sum of the entropies of the policy action selection conditioned on the state at that timestep:

H(A0:T1S0:T1)\displaystyle H(A_{0:T-1}\|S_{0:T-1}) =𝔼π[t=0T1γtlogπ(At|St)t]=t=0T1γtH(At|St).\displaystyle=\operatorname*{{\mathbb{E}}}_{\pi{}}\left[-\sum_{t=0}^{T-1}\gamma^{t}\log\pi{}_{t}(A_{t}\>|\>S_{t})\right]=\sum_{t=0}^{T-1}\gamma^{t}H(A_{t}\>|\>S_{t})\,. (5)

Note the sum in 2.2 is discounted, effectively valuing entropy of later actions less. This is needed for consistency with discounted returns, and for convergence in infinite-horizon problems; see Haarnoja et al. [12, appendix A] for more information. We will revisit the subtleties of infinite horizons in Section 4.1.1.

Remark 2.1.

The causal entropy H(A0:T1S0:T1)H(A_{0:T-1}\|S_{0:T-1}) has the useful property that at each time step, it conditions only on information available to the agent (that is, the current state, as well as prior states and actions). By contrast, the conditional entropy of actions given states H(A0:T1|S0:T1)H(A_{0:T-1}\>|\>S_{0:T-1}) conditions on states that arise after each action was taken. Moreover, conventional Shannon entropy H(S0:T1,A0:T1)H(S_{0:T-1},A_{0:T-1}) calculates the entropy over the entire trajectory distribution, introducing an unwanted dependency on transition dynamics via terms H(St|St1,At1)H(S_{t}\>|\>S_{t-1},A_{t-1}):

H(S0:T1,A0:T1)\displaystyle H(S_{0:T-1},A_{0:T-1}) =t=0T1γtH(St,At|S0:t1,A0:t1)\displaystyle=\sum_{t=0}^{T-1}\gamma^{t}H(S_{t},A_{t}\>|\>S_{0:t-1},A_{0:t-1}) chain rule (6)
=t=0T1γtH(St|S0:t1,A0:t1)+t=0T1γtH(At|S0:t,A0:t1)\displaystyle=\sum_{t=0}^{T-1}\gamma^{t}H(S_{t}\>|\>S_{0:t-1},A_{0:t-1})+\sum_{t=0}^{T-1}\gamma^{t}H(A_{t}\>|\>S_{0:t},A_{0:t-1}) chain rule (7)
=t=0T1γtH(St|St1,At1)+t=0T1γtH(At|St)\displaystyle=\sum_{t=0}^{T-1}\gamma^{t}H(S_{t}\>|\>S_{t-1},A_{t-1})+\sum_{t=0}^{T-1}\gamma^{t}H(A_{t}\>|\>S_{t}) independence (8)
=t=0T1γtH(St|St1,At1)state transition entropy+H(A0:T1S0:T1)causal entropy\displaystyle=\sum_{t=0}^{T-1}\underbrace{\gamma^{t}H(S_{t}\>|\>S_{t-1},A_{t-1})}_{\text{state transition entropy}}+\underbrace{H(A_{0:T-1}\|S_{0:T-1})}_{\text{causal entropy}}\, causal ent. definition (9)

Maximising Shannon entropy therefore introduces a bias towards taking actions with uncertain (and possibly risky) outcomes. For this reason, maximum causal entropy should be used rather than maximum (Shannon) entropy in stochastic MDPs. In deterministic MDPs, H(St|St1,At1)=0H(S_{t}\>|\>S_{t-1},A_{t-1})=0 and the methods are equivalent (Section 3.3).

3 Maximum Causal Entropy (MCE) IRL

In this section, we start by reviewing two complementary ways of formalising the MCE IRL objective as an optimisation problem. First, we will consider MCE IRL as a way of solving a particular constrained optimisation problem. Second, we will describe MCE IRL in terms of maximum likelihood estimation, which will allow us to replace the linear reward function with a non-linear one. Finally, we will discuss how MCE IRL simplifies to Maximum Entropy (ME) IRL under deterministic dynamics.

3.1 MCE IRL as feature expectation matching

MCE IRL was originally introduced by Ziebart et al. [31], who considered the general setting of non-Markovian dynamics and trajectory-centric features. They then derived simplifications for the special case of Markovian dynamics and policies, as well as feature functions that decompose across state–action transitions. This primer only considers the simpler case (decomposed features, Markovian policies), and is thus substantially shorter. Moreover, we will assume throughout that the horizon TT is finite, and that the policy is non-stationary (time-dependent); removing these assumptions is discussed in Section 4.1.1

Optimisation Problem 3.1 (MCE IRL primal problem).

Given an expert data distribution 𝒟\mathcal{D}, the optimisation problem of finding a time-dependent stochastic policy π(st,at)t\pi{}_{t}(s_{t},a_{t}) that matches feature expectations while maximising causal entropy can be written as

maxπ𝒫\displaystyle\max_{\pi\in\mathscr{P}}\quad H(A0:T1S0:T1)\displaystyle H(A_{0:T-1}\|S_{0:T-1}) (10)
Subject to 𝔼π[t=0T1γtϕ(St,At)]=𝔼𝒟[t=0T1γtϕ(St,At)]\displaystyle\operatorname*{{\mathbb{E}}}_{\pi{}}\left[\sum_{t=0}^{T-1}\gamma^{t}\phi(S_{t},A_{t})\right]=\operatorname*{{\mathbb{E}}}_{\mathcal{D}{}}\left[\sum_{t=0}^{T-1}\gamma^{t}\phi(S_{t},A_{t})\right] (11)

where 𝒫\mathscr{P} denotes the set of policies where the action distribution at each state falls within the standard policy simplex, so that

π𝒫π(at|st)t0 and aπt(a|st)=1(0t<T,at𝒜,st𝒮),\pi\in\mathscr{P}\iff\pi{}_{t}(a_{t}\>|\>{}s_{t})\geq 0\text{ and }\sum_{a^{\prime}}\pi_{t}(a^{\prime}\>|\>{}s_{t})=1\quad\left(\forall 0\leq t<T,\forall a_{t}\in\mathcal{A},\forall s_{t}\in\mathcal{S}\right)~{}, (12)

and S0:T1S_{0:T-1} and A0:T1A_{0:T-1} are random sequences of states and actions, induced by πt\pi{}_{t} and sampled according to Eq. 1 on the left, and 𝒟\mathcal{D}{} on the right.

This primal problem is not convex in general, but we will nevertheless approach solving it using a standard recipe for convex optimisation problems [3, Chap. 5]:

  1. 1.

    Form the Lagrangian. First, we will convert the feature expectation matching constraint into a weighted penalty term, producing the Lagrangian for the problem.

  2. 2.

    Derive the dual problem. Next, we will use the Lagrangian to form a dual problem that is equivalent to 3.1. The dual problem will introduce an extra set of parameters—called Lagrange multipliers—which in this case will be interpretable as weights of a reward function.

  3. 3.

    Dual ascent. Finally, we will solve the dual problem with a simple procedure called dual ascent, which alternates between exactly maximising the Lagrangian with respect to the policy parameters (the primal variables) and taking a single gradient step to minimise the Lagrangian with respect to the reward weights (the dual variables). This process is repeated until the dual variables converge.

3.1.1 The Lagrangian and the dual problem

We begin our derivation of the dual problem by forming the Lagrangian Λ:𝒫×d\Lambda:\mathscr{P}\times\mathbb{R}^{d}\to\mathbb{R} for 3.1:

Λ(π,θ)\displaystyle\Lambda(\pi{},\theta) =H(A0:T1S0:T1)\displaystyle=H(A_{0:T-1}\|S_{0:T-1}) (13a)
+θT(𝔼π[t=0T1γtϕ(St,At)]𝔼𝒟[t=0T1γtϕ(St,At)]).\displaystyle\quad+\theta^{T}\left(\operatorname*{{\mathbb{E}}}_{\pi{}}\left[\sum_{t=0}^{T-1}\gamma^{t}\phi(S_{t},A_{t})\right]-\operatorname*{{\mathbb{E}}}_{\mathcal{D}{}}\left[\sum_{t=0}^{T-1}\gamma^{t}\phi(S_{t},A_{t})\right]\right)\,. (13b)

The dual variable vector θd\theta\in\mathbb{R}^{d} can be interpreted as a set of penalty weights to enforce the feature expectation matching constraint in Eq. 13b. Importantly, π\pi{} is constrained to the simplex 𝒫\mathscr{P}; later, we will need be careful with this constraint when we form a separate, nested optimisation problem to compute π\pi.

Equipped with the Lagrangian, we can now express the dual problem to 3.1

Optimisation Problem 3.2 (Dual MCE IRL problem).

Define the dual function g(θ)g(\theta) as

g(θ)=maxπ𝒫Λ(π,θ).g(\theta)=\max_{\pi\in\mathscr{P}}\Lambda(\pi,\theta)\,. (14)

The dual MCE IRL problem is the problem of finding a π\pi^{*} and θ\theta^{*} such that θ\theta^{*} attains

g(θ)=minθdg(θ),g(\theta^{*})=\min_{\theta\in\mathbb{R}^{d}}g(\theta)\,, (15)

while π\pi^{*} attains

Λ(π,θ)=g(θ).\Lambda(\pi^{*},\theta^{*})=g(\theta^{*})\,. (16)

Instead of optimising the primal (3.1), we will instead optimise the dual (3.2), and treat the recovered π\pi^{*} as a solution for the primal imitation learning problem. Moreover, we will later see that the recovered θ\theta^{*} can be interpreted as parameters of a reward function that incentivises reproducing the expert demonstrations.

3.1.2 The dual function

Observe that 3.2 is expressed in terms of a dual function g(θ)=maxπΛ(π,θ)g(\theta)=\max_{\pi}\Lambda(\pi,\theta). Computing g(θ)g(\theta) can therefore be viewed as a nested optimisation over π\pi subject to the probability simplex constraints defining 𝒫\mathscr{P}.

We will begin by putting the optimisation defining the dual function into the familiar form. First recall that the policy simplex 𝒫\mathscr{P} is defined by two constraints:

π(at|st)t0 and aπt(a|st)=1(0t<T,at𝒜,st𝒮)\pi{}_{t}(a_{t}\>|\>{}s_{t})\geq 0\text{ and }\sum_{a}\pi_{t}(a\>|\>{}s_{t})=1\quad\left(\forall 0\leq t<T,\forall a_{t}\in\mathcal{A},\forall s_{t}\in\mathcal{S}\right) (17)

Since the causal entropy term in Λ\Lambda is undefined when π\pi has negative elements, we will treat the non-negativity constraint as implicit, and focus only on the normalisation constraint. We will rewrite the normalisation constraint as hst(π)=0h_{s_{t}}(\pi)=0, where the function hsth_{s_{t}} is defined as

hst(π)=a𝒜π(a|st)1.h_{s_{t}}(\pi)=\sum_{a\in\mathcal{A}}\pi(a\>|\>s_{t})-1\,. (18)

This gives rise the following optimisation problem:

Optimisation Problem 3.3 (Dual function primal problem).

The problem of computing the dual function can be equivalently expressed as

maxπ𝒫\displaystyle\max_{\pi\in\mathscr{P}}\quad Λ(π,θ)\displaystyle\Lambda(\pi,\theta) (19)
Subject to hst(π)=0(0t<T,st𝒮)\displaystyle h_{s_{t}}(\pi)=0\quad\left(\forall 0\leq t<T,\forall s_{t}\in\mathcal{S}\right) (20)

The Lagrangian for 3.3 is

Ψ(π,μ;θ)\displaystyle\Psi(\pi,\mu;\theta) =Λ(π,θ)+st𝒮,0t<Tμsthst(π)\displaystyle=\Lambda(\pi,\theta)+\sum_{s_{t}\in\mathcal{S},0\leq t<T}\mu_{s_{t}}h_{s_{t}}(\pi) (21)
=H(A0:T1S0:T1)+θT(𝔼π[t=0T1γtϕ(St,At)]𝔼𝒟[t=0T1γtϕ(St,At)])\displaystyle=H(A_{0:T-1}\|S_{0:T-1})+\theta^{T}\left(\operatorname*{{\mathbb{E}}}_{\pi{}}\left[\sum_{t=0}^{T-1}\gamma^{t}\phi(S_{t},A_{t})\right]-\operatorname*{{\mathbb{E}}}_{\mathcal{D}{}}\left[\sum_{t=0}^{T-1}\gamma^{t}\phi(S_{t},A_{t})\right]\right) (22)
+st𝒮,0t<Tμst(a𝒜π(a|st)t1)\displaystyle\qquad+\sum_{s_{t}\in\mathcal{S},0\leq t<T}\mu_{s_{t}}\left(\sum_{a\in\mathcal{A}}\pi{}_{t}(a\>|\>s_{t})-1\right) (23)

where {μst}st𝒮,0t<T\left\{\mu_{s_{t}}\right\}_{s_{t}\in\mathcal{S},0\leq t<T} are dual variables for the normalisation constraints, and we have left the dependence on θ\theta implicit. We will attempt to find a minimiser of 3.3 by solving the Karush-Kuhn-Tucker (KKT) conditions for the problem [3, p.224], which are222This problem does not have explicit inequality constraints, so we do not require complementary slackness conditions.

πΨ(π,μ;θ)\displaystyle\nabla_{\pi}\Psi(\pi,\mu;\theta) =0\displaystyle=0 (24)
hst(π)\displaystyle h_{s_{t}}(\pi) =0(0t<T,st𝒮).\displaystyle=0\quad\left(\forall 0\leq t<T,\forall s_{t}\in\mathcal{S}\right)\,. (25)

Notice that μ\mu only appears in the first equation. We will later see that we can set it to a value which ensures that the normalisation constraints are satisfied and the gradient vanishes. First, however, we will compute the gradient of the Lagrangian for 3.3:

Lemma 3.1.

The gradient of Ψ(π,μ;θ)\Psi(\pi,\mu;\theta) is given by

π(at|st)t\displaystyle\nabla_{\pi{}_{t}(a_{t}\>|\>s_{t})} Ψ(π,μ;θ)=μstρπ,t(st)\bBigg@3(1+logπ(at|st)tθTϕ(st,at\displaystyle\Psi(\pi{},\mu;\theta)=\mu_{s_{t}}-\rho_{\pi,t}(s_{t})\bBigg@{3}(1+\log\pi{}_{t}(a_{t}\>|\>s_{t})-\theta^{T}\phi(s_{t},a_{t}
𝔼π[t=t+1T1γtt(θTϕ(St,At)logπ(At|St)t)|st,at]\bBigg@3),\displaystyle\quad-\operatorname*{{\mathbb{E}}}_{\pi{}}\left[\sum_{t^{\prime}=t+1}^{T-1}\gamma^{t^{\prime}-t}\left(\theta^{T}\phi(S_{t^{\prime}},A_{t^{\prime}})-\log\pi{}_{t^{\prime}}(A_{t^{\prime}}\>|\>S_{t^{\prime}})\right)\>\middle|\>{}s_{t},a_{t}\right]\bBigg@{3})\,, (26)

where

ρπ,t(st)=γt𝔼π[𝕀[St=st]]=γtπ(St=st)\rho_{\pi,t}(s_{t})=\gamma^{t}\mathbb{E}_{\pi}\left[\mathbb{I}[S_{t}=s_{t}]\right]=\gamma^{t}\mathbb{P}_{\pi}(S_{t}=s_{t}) (27)

is the discounted probability that the agent will be in state sts_{t} at time tt if it follows policy π\pi{}.

Proof.

We will derive this gradient term-by-term. Taking the derivative of the third term (normalisation) with respect to some arbitrary π(at|st)t\pi{}_{t}(a_{t}\>|\>s_{t}) yields

π(at|st)tt=0T1st𝒮μst(a𝒜π(a|st)t1)\displaystyle\nabla_{\pi{}_{t}(a_{t}\>|\>s_{t})}\sum_{t^{\prime}=0}^{T-1}\sum_{s_{t^{\prime}}\in\mathcal{S}}\mu_{s_{t^{\prime}}}\left(\sum_{a^{\prime}\in\mathcal{A}}\pi{}_{t}(a^{\prime}\>|\>s_{t^{\prime}})-1\right) =μst.\displaystyle=\mu_{s_{t}}\,. (28)

Note that we are differentiating with respect to the action selection probability πt(at|st)\pi_{t}(a_{t}\>|\>s_{t}), which is a variable specific to time tt; the policy need not be stationary, so we may have πt(a|s)πt(a|s)\pi_{t}(a\>|\>s)\neq\pi_{t^{\prime}}(a\>|\>{}s).

The derivative of the middle term (feature expectation matching) is

π(at|st)t\displaystyle\nabla_{\pi{}_{t}(a_{t}\>|\>s_{t})} θT(𝔼π[t=0T1γtϕ(St,At)]𝔼𝒟[t=0T1γtϕ(St,At)])\displaystyle\theta^{T}\left(\operatorname*{{\mathbb{E}}}_{\pi{}}\left[\sum_{t^{\prime}=0}^{T-1}\gamma^{t^{\prime}}\phi(S_{t^{\prime}},A_{t^{\prime}})\right]-\operatorname*{{\mathbb{E}}}_{\mathcal{D}{}}\left[\sum_{t^{\prime}=0}^{T-1}\gamma^{t^{\prime}}\phi(S_{t^{\prime}},A_{t^{\prime}})\right]\right)
=π(at|st)tθT𝔼π[t=0T1γtϕ(St,At)]\displaystyle=\nabla_{\pi{}_{t}(a_{t}\>|\>s_{t})}\theta^{T}\operatorname*{{\mathbb{E}}}_{\pi{}}\left[\sum_{t^{\prime}=0}^{T-1}\gamma^{t^{\prime}}\phi(S_{t^{\prime}},A_{t^{\prime}})\right] (29)
=θT𝔼Stπ[π(at|st)t𝔼π[t=tT1γtϕ(St,At)|St]]\displaystyle=\theta^{T}\operatorname*{{\mathbb{E}}}_{S_{t}\sim\pi{}}\left[\nabla_{\pi{}_{t}(a_{t}\>|\>s_{t})}\operatorname*{{\mathbb{E}}}_{\pi{}}\left[\sum_{t^{\prime}=t}^{T-1}\gamma^{t^{\prime}}\phi(S_{t^{\prime}},A_{t^{\prime}})\>\middle|\>{}S_{t}\right]\right] (30)
=θT𝔼Stπ[γt𝕀[St=st]π(at|st)t𝔼π[t=tT1γttϕ(St,At)|St=st]]\displaystyle=\theta^{T}\operatorname*{{\mathbb{E}}}_{S_{t}\sim\pi{}}\left[\gamma^{t}\mathbb{I}[S_{t}=s_{t}]\nabla_{\pi{}_{t}(a_{t}\>|\>s_{t})}\operatorname*{{\mathbb{E}}}_{\pi{}}\left[\sum_{t^{\prime}=t}^{T-1}\gamma^{t^{\prime}-t}\phi(S_{t^{\prime}},A_{t^{\prime}})\>\middle|\>{}S_{t}=s_{t}\right]\right] (31)
=ρπ,t(st)θTπ(at|st)t𝔼π[t=tT1γttϕ(St,At)|St=st]\displaystyle=\rho_{\pi,t}(s_{t})\theta^{T}\nabla_{\pi{}_{t}(a_{t}\>|\>s_{t})}\operatorname*{{\mathbb{E}}}_{\pi{}}\left[\sum_{t^{\prime}=t}^{T-1}\gamma^{t^{\prime}-t}\phi(S_{t^{\prime}},A_{t^{\prime}})\>\middle|\>{}S_{t}=s_{t}\right] (32)
=ρπ,t(st)θTπ(at|st)t(πt(atst)𝔼[t=tT1γttϕ(St,At)|St=st,At=at]\displaystyle=\rho_{\pi,t}(s_{t})\theta^{T}\nabla_{\pi{}_{t}(a_{t}\>|\>s_{t})}\left(\vphantom{\left(\sum_{a^{\prime}_{t}\neq a_{t}}\pi{}_{t}(a^{\prime}_{t}\>|\>s_{t})\right)}\pi_{t}(a_{t}\mid s_{t})\operatorname*{{\mathbb{E}}}\left[\sum_{t^{\prime}=t}^{T-1}\gamma^{t^{\prime}-t}\phi(S_{t^{\prime}},A_{t^{\prime}})\>\middle|\>{}S_{t}=s_{t},A_{t}=a_{t}\right]\right.
+(atatπ(at|st)t)𝔼[t=tT1γttϕ(St,At)|St=st,Atat])\displaystyle\qquad\qquad\qquad\left.+\left(\sum_{a^{\prime}_{t}\neq a_{t}}\pi{}_{t}(a^{\prime}_{t}\>|\>s_{t})\right)\cdot\operatorname*{{\mathbb{E}}}\left[\sum_{t^{\prime}=t}^{T-1}\gamma^{t^{\prime}-t}\phi(S_{t^{\prime}},A_{t^{\prime}})\>\middle|\>{}S_{t}=s_{t},A_{t}\neq a_{t}\right]\right) (33)
=ρπ,t(st)θT𝔼[t=tT1γttϕ(St,At)|st,at]\displaystyle=\rho_{\pi,t}(s_{t})\theta^{T}\operatorname*{{\mathbb{E}}}\left[\sum_{t^{\prime}=t}^{T-1}\gamma^{t^{\prime}-t}\phi(S_{t^{\prime}},A_{t^{\prime}})\>\middle|\>{}s_{t},a_{t}\right] (34)
=ρπ,t(st)θT[ϕ(st,at)+𝔼π[t=t+1T1γttϕ(St,At)|st,at]],\displaystyle=\rho_{\pi,t}(s_{t})\,\theta^{T}\left[\phi(s_{t},a_{t})+\operatorname*{{\mathbb{E}}}_{\pi{}}\left[\sum_{t^{\prime}=t+1}^{T-1}\gamma^{t^{\prime}-t}\phi(S_{t^{\prime}},A_{t^{\prime}})\>\middle|\>{}s_{t},a_{t}\right]\right]\,, (35)

Finally, the derivative of the first term (causal entropy) can be derived in a similar manner as

π(at|st)tH(A0:T1S0:T1)=π(at|st)t𝔼π[t=0T1γtlogπ(At|St)t]\displaystyle\nabla_{\pi{}_{t}(a_{t}\>|\>s_{t})}{H(A_{0:T-1}\|S_{0:T-1})}=-\nabla_{\pi{}_{t}(a_{t}\>|\>s_{t})}\operatorname*{{\mathbb{E}}}_{\pi{}}\left[\sum_{t^{\prime}=0}^{T-1}\gamma^{t}\log\pi{}_{t^{\prime}}(A_{t^{\prime}}\>|\>S_{t^{\prime}})\right] (36)
=ρπ,t(st)π(at|st)t𝔼π[t=tT1γttlogπ(At|St)t|st]\displaystyle\qquad=-\rho_{\pi,t}(s_{t})\nabla_{\pi{}_{t}(a_{t}\>|\>s_{t})}\operatorname*{{\mathbb{E}}}_{\pi{}}\left[\sum_{t^{\prime}=t}^{T-1}\gamma^{t^{\prime}-t}\log\pi{}_{t^{\prime}}(A_{t^{\prime}}\>|\>{}S_{t^{\prime}})\>\middle|\>{}s_{t}\right] (37)
=ρπ,t(st)π(at|st)t𝔼π[logπt(AtSt)+t=t+1T1γttlogπ(At|St)t|st]\displaystyle\qquad=-\rho_{\pi,t}(s_{t})\nabla_{\pi{}_{t}(a_{t}\>|\>s_{t})}\operatorname*{{\mathbb{E}}}_{\pi{}}\left[\log\pi_{t}(A_{t}\mid S_{t})+\sum_{t^{\prime}=t+1}^{T-1}\gamma^{t^{\prime}-t}\log\pi{}_{t^{\prime}}(A_{t^{\prime}}\>|\>{}S_{t^{\prime}})\>\middle|\>{}s_{t}\right] (38)
=ρπ,t(st)[1+logπ(at|st)t+𝔼π[t=t+1T1γttlogπ(At|St)t|st,at]].\displaystyle\qquad=-\rho_{\pi,t}(s_{t})\left[1+\log\pi{}_{t}(a_{t}\>|\>s_{t})+\operatorname*{{\mathbb{E}}}_{\pi}\left[\sum_{t^{\prime}=t+1}^{T-1}\gamma^{t^{\prime}-t}\log\pi{}_{t^{\prime}}(A_{t^{\prime}}\>|\>{}S_{t^{\prime}})\>\middle|\>{}s_{t},a_{t}\right]\right]\,. (39)

Putting it all together, the derivative of Ψ(π,μ;θ)\Psi(\pi,\mu;\theta) with respect to our policy is

π(at|st)t\displaystyle\nabla_{\pi{}_{t}(a_{t}\>|\>s_{t})} Ψ(π,μ;θ)=μstρπ,t(st)\bBigg@3(1+logπ(at|st)tθTϕ(st,at)\displaystyle\Psi(\pi{},\mu;\theta)=\mu_{s_{t}}-\rho_{\pi,t}(s_{t})\bBigg@{3}(1+\log\pi{}_{t}(a_{t}\>|\>s_{t})-\theta^{T}\phi(s_{t},a_{t})
𝔼π[t=t+1T1γtt(θTϕ(St,At)logπ(At|St)t)|st,at]\bBigg@3).\displaystyle\quad-\operatorname*{{\mathbb{E}}}_{\pi{}}\left[\sum_{t^{\prime}=t+1}^{T-1}\gamma^{t^{\prime}-t}\left(\theta^{T}\phi(S_{t^{\prime}},A_{t^{\prime}})-\log\pi{}_{t^{\prime}}(A_{t^{\prime}}\>|\>S_{t^{\prime}})\right)\>\middle|\>{}s_{t},a_{t}\right]\bBigg@{3})\,. (26)

We will now solve the first KKT condition, Eq. 24, by setting Eq. 26 equal to zero and solving for π\pi. This will give us the π\pi which attains g(θ)=Λ(π,θ)g(\theta)=\Lambda(\pi,\theta), thereby allowing us to achieve our goal (for this sub-section) of computing the dual function g(θ)g(\theta). Conveniently, the resulting π\pi has a form which resembles the optimal policy under value iteration—indeed, the recursion defining π\pi is often called soft value iteration:

Lemma 3.2.

The KKT condition πΨ(π,μ;θ)=0\nabla_{\pi}\Psi(\pi,\mu;\theta)=0 is satisfied by a policy

πt(atst)=exp(Qθ,tsoft(st,at)Vθ,tsoft(st))\pi_{t}(a_{t}\mid s_{t})=\exp\left(Q^{\mathrm{soft}}_{\theta,t}(s_{t},a_{t})-V^{\mathrm{soft}}_{\theta,t}(s_{t})\right) (40)

satisfying the following recursion:

Vθ,tsoft(st)=logat𝒜expQθ,tsoft(st,at)(0tT1)Qθ,tsoft(st,at)=θTϕ(st,at)+γ𝔼𝒯[Vθ,t+1soft(St+1)|st,at](0t<T1)Qθ,T1soft(sT1,aT1)=θTϕ(sT1,aT1)} Soft VI.\left.\begin{aligned} V^{\mathrm{soft}}_{\theta,t}(s_{t})&=\log\sum_{a_{t}\in\mathcal{A}}\exp Q^{\mathrm{soft}}_{\theta,t}(s_{t},a_{t})&\ (\forall 0\leq t\leq T-1)\\ Q^{\mathrm{soft}}_{\theta,t}(s_{t},a_{t})&=\theta^{T}\phi(s_{t},a_{t})+\gamma\operatorname*{{\mathbb{E}}}_{\mathcal{T}}\left[V^{\mathrm{soft}}_{\theta,t+1}(S_{t+1})\>\middle|\>{}s_{t},a_{t}\right]&\ (\forall 0\leq t<T-1)\\ Q^{\mathrm{soft}}_{\theta,T-1}(s_{T-1},a_{T-1})&=\theta^{T}\phi(s_{T-1},a_{T-1})&\end{aligned}\qquad\right\}\text{ Soft VI.} (41)
Proof.

We begin by setting the derivative of the Lagrangian Ψ\Psi to zero and rearranging. Through this, we find that at optimality the policy (i.e. primal variables) must take the form

π(at|st)t=\displaystyle\pi{}_{t}(a_{t}\>|\>s_{t})= exp\bBigg@3(θTϕ(st,at)1+μstρπ,t(st)\displaystyle\exp\bBigg@{3}(\theta^{T}\phi(s_{t},a_{t})-1+\frac{\mu_{s_{t}}}{\rho_{\pi,t}(s_{t})} (42)
+𝔼π[t=t+1T1γtt(θTϕ(St,At)logπ(At|St)t)|st,at]\bBigg@3).\displaystyle+\operatorname*{{\mathbb{E}}}_{\pi}\left[\sum_{t^{\prime}=t+1}^{T-1}\gamma^{t^{\prime}-t}\left(\theta^{T}\phi(S_{t^{\prime}},A_{t^{\prime}})-\log\pi{}_{t^{\prime}}(A_{t^{\prime}}\>|\>S_{t^{\prime}})\right)\>\middle|\>{}s_{t},a_{t}\right]\bBigg@{3})\,.

We abuse notation by assuming that the last term vanishes when t=T1t=T-1, so that at the end of the trajectory we have

πT1(aT1sT1)=exp(θTϕ(sT1,aT1)1+μsT1ρπ,T1(sT1)).\pi_{T-1}(a_{T-1}\mid s_{T-1})=\exp\left(\theta^{T}\phi(s_{T-1},a_{T-1})-1+\frac{\mu_{s_{T-1}}}{\rho_{\pi,T-1}(s_{T-1})}\right)\,. (43)

Naively calculating the optimal policy from Eq. 42 would require enumeration of exponentially many trajectories to obtain the inner expectation. We will instead show that π\pi{} can be recovered by soft value iteration. First we decompose πθ,t(at|st)\pi_{\theta,t}(a_{t}\>|\>s_{t}) using Eq. 40, where QsoftQ^{\mathrm{soft}}{} and VsoftV^{\mathrm{soft}}{} functions are defined as:

Vθ,tsoft(st)1μstρπ,t(st),\displaystyle V^{\mathrm{soft}}_{\theta,t}(s_{t})\triangleq 1-\frac{\mu_{s_{t}}}{\rho_{\pi,t}(s_{t})}\,, (44)
Qθ,tsoft(st,at)θTϕ(st,at)+𝔼π[t=t+1T1γtt(θTϕ(St,At)logπ(At|St)θ,t)|st,at],\displaystyle Q^{\mathrm{soft}}_{\theta,t}(s_{t},a_{t})\triangleq\theta^{T}\phi(s_{t},a_{t})+\operatorname*{{\mathbb{E}}}_{\pi{}}\left[\sum_{t^{\prime}=t+1}^{T-1}\gamma^{t^{\prime}-t}(\theta^{T}\phi(S_{t^{\prime}},A_{t^{\prime}})-\log\pi{}_{\theta,t^{\prime}}(A_{t^{\prime}}\>|\>S_{t^{\prime}}))\>\middle|\>{}s_{t},a_{t}\right]\,, (45)
Qθ,T1soft(sT1,aT1)θTϕ(sT1,aT1).\displaystyle Q^{\mathrm{soft}}_{\theta,T-1}(s_{T-1},a_{T-1})\triangleq\theta^{T}\phi(s_{T-1},a_{T-1})\,. (46)

We can show that Qθ,tsoft(st,at)Q^{\mathrm{soft}}_{\theta,t}(s_{t},a_{t}) and Vθ,tsoft(st)V^{\mathrm{soft}}_{\theta,t}(s_{t}) satisfy a “softened” version of the Bellman equations. Note that by setting the μst\mu_{s_{t}} variables to be appropriate normalising constants, we ensure that the normalisation KKT conditions in Eq. 25 are satisfied. Since we wish to constrain ourselves to normalised policies π𝒫\pi\in\mathscr{P}, we can assume the μst\mu_{s_{t}} are chosen such that a𝒜πt(a|st)=1\sum_{a\in\mathcal{A}}\pi_{t}(a\>|\>s_{t})=1 for all st𝒮s_{t}\in\mathcal{S}. It then follows that Vθ,tsoft(st)V^{\mathrm{soft}}_{\theta,t}(s_{t}) must be a soft maximum over Qθ,tsoft(st,at)Q^{\mathrm{soft}}_{\theta,t}(s_{t},a_{t}) values in sts_{t}:

1\displaystyle 1 =at𝒜π(at|st)θ,t\displaystyle=\sum_{a_{t}\in\mathcal{A}}\pi{}_{\theta,t}(a_{t}\>|\>s_{t}) (47)
=at𝒜exp(Qθ,tsoft(st,at)Vθ,tsoft(st))\displaystyle=\sum_{a_{t}\in\mathcal{A}}\exp(Q^{\mathrm{soft}}_{\theta,t}(s_{t},a_{t})-V^{\mathrm{soft}}_{\theta,t}(s_{t})) (48)
1expVθ,tsoft(st)\displaystyle 1\cdot\exp V^{\mathrm{soft}}_{\theta,t}(s_{t}) =(at𝒜exp(Qθ,tsoft(st,at)Vθ,tsoft(st)))expVθ,tsoft(st)\displaystyle=\left(\sum_{a_{t}\in\mathcal{A}}\exp(Q^{\mathrm{soft}}_{\theta,t}(s_{t},a_{t})-V^{\mathrm{soft}}_{\theta,t}(s_{t}))\right)\cdot\exp V^{\mathrm{soft}}_{\theta,t}(s_{t}) (49)
=at𝒜expQθ,tsoft(st,at)\displaystyle=\sum_{a_{t}\in\mathcal{A}}\exp Q^{\mathrm{soft}}_{\theta,t}(s_{t},a_{t}) (50)
Vθ,tsoft(st)\displaystyle V^{\mathrm{soft}}_{\theta,t}(s_{t}) =logat𝒜expQθ,tsoft(st,at).\displaystyle=\log\sum_{a_{t}\in\mathcal{A}}\exp Q^{\mathrm{soft}}_{\theta,t}(s_{t},a_{t})\,. (51)

Likewise, we can use the definitions of Qθ,tsoftQ^{\mathrm{soft}}_{\theta,t} and Vθ,tsoftV^{\mathrm{soft}}_{\theta,t} to show that for t<T1t<T-1, we have

Qθ,tsoft(st,at)\displaystyle Q^{\mathrm{soft}}_{\theta,t}(s_{t},a_{t}) =θTϕ(st,at)+𝔼πθ[t=t+1T1γtt(θTϕ(St,At)logπ(At|St)θ,t)|St=st,At=at]\displaystyle=\theta^{T}\phi(s_{t},a_{t})+\operatorname*{{\mathbb{E}}}_{\pi{}_{\theta}}\left[\sum_{t^{\prime}=t+1}^{T-1}\gamma^{t^{\prime}-t}(\theta^{T}\phi(S_{t^{\prime}},A_{t^{\prime}})-\log\pi{}_{\theta,t^{\prime}}(A_{t^{\prime}}\>|\>S_{t^{\prime}}))\>\middle|\>{}S_{t}=s_{t},A_{t}=a_{t}\right] (52)
=θTϕ(st,at)+γ𝔼𝒯\bBigg@3[𝔼πθ\bBigg@3[Qθ,t+1soft(St+1,At+1)\displaystyle=\theta^{T}\phi(s_{t},a_{t})+\gamma\operatorname*{{\mathbb{E}}}_{\mathcal{T}}\bBigg@{3}[\operatorname*{{\mathbb{E}}}_{\pi{}_{\theta}}\bBigg@{3}[Q^{\mathrm{soft}}_{\theta,t+1}(S_{t+1},A_{t+1})- (53)
logπ(At+1|St+1)θ,t+1\bBigg@3|St+1\bBigg@3]\bBigg@3|St=st,At=at\bBigg@3]\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\>\log\pi{}_{\theta,t+1}(A_{t+1}\>|\>S_{t+1})\>\bBigg@{3}|\>{}S_{t+1}\bBigg@{3}]\>\bBigg@{3}|\>{}S_{t}=s_{t},A_{t}=a_{t}\bBigg@{3}]
=θTϕ(st,at)+γ𝔼𝒯\bBigg@3[𝔼πθ\bBigg@3[Qθ,t+1soft(St+1,At+1)\displaystyle=\theta^{T}\phi(s_{t},a_{t})+\gamma\operatorname*{{\mathbb{E}}}_{\mathcal{T}}\bBigg@{3}[\operatorname*{{\mathbb{E}}}_{\pi{}_{\theta}}\bBigg@{3}[Q^{\mathrm{soft}}_{\theta,t+1}(S_{t+1},A_{t+1})- (54)
(Qθ,t+1soft(St+1,At+1)Vθ,t+1soft(St+1))\bBigg@3|St+1\bBigg@3]\bBigg@3|St=st,At=at\bBigg@3]\displaystyle\qquad\qquad\qquad\qquad\qquad\>\left(Q^{\mathrm{soft}}_{\theta,t+1}(S_{t+1},A_{t+1})-V^{\mathrm{soft}}_{\theta,t+1}(S_{t+1})\right)\>\bBigg@{3}|\>{}S_{t+1}\bBigg@{3}]\>\bBigg@{3}|\>{}S_{t}=s_{t},A_{t}=a_{t}\bBigg@{3}]
=θTϕ(st,at)+γ𝔼𝒯[Vθ,t+1soft(St+1)|St=st,At=at],\displaystyle=\theta^{T}\phi(s_{t},a_{t})+\gamma\operatorname*{{\mathbb{E}}}_{\mathcal{T}}\left[V^{\mathrm{soft}}_{\theta,t+1}(S_{t+1})\>\middle|\>{}S_{t}=s_{t},A_{t}=a_{t}\right]\,, (55)

where the penultimate step follows from substituting πθ,t(at|st)=exp(Qθ,tsoft(st,at)Vθ,tsoft(st))\pi_{\theta,t}(a_{t}\>|\>s_{t})=\exp\left(Q^{\mathrm{soft}}_{\theta,t}(s_{t},a_{t})-V^{\mathrm{soft}}_{\theta,t}(s_{t})\right).

Putting it all together, we have the soft VI equations:

Vθ,tsoft(st)\displaystyle V^{\mathrm{soft}}_{\theta,t}(s_{t}) =logat𝒜expQθ,tsoft(st,at)\displaystyle=\log\sum_{a_{t}\in\mathcal{A}}\exp Q^{\mathrm{soft}}_{\theta,t}(s_{t},a_{t}) (56)
Qθ,tsoft(st,at)\displaystyle Q^{\mathrm{soft}}_{\theta,t}(s_{t},a_{t}) =θTϕ(st,at)+γ𝔼𝒯[Vθ,t+1soft(St+1)|st,at]\displaystyle=\theta^{T}\phi(s_{t},a_{t})+\gamma\operatorname*{{\mathbb{E}}}_{\mathcal{T}}\left[V^{\mathrm{soft}}_{\theta,t+1}(S_{t+1})\>\middle|\>{}s_{t},a_{t}\right] (57)
Qθ,Tsoft(sT,aT)\displaystyle Q^{\mathrm{soft}}_{\theta,T}(s_{T},a_{T}) =θTϕ(sT,aT).\displaystyle=\theta^{T}\phi(s_{T},a_{T}). (58)

Remark 3.1 (Soft VI).

Observe that the soft VI equations are analogous to traditional value iteration, but with the hard maximum over actions replaced with a log-sum-exp, which acts as a “soft” maximum. In the finite-horizon case, these equations can be applied recursively from time t=T1t=T-1 down to t=0t=0, yielding an action distribution πθ,t(at|st)=exp(Qθ,tsoft(st,at)Vθ,tsoft(st))\pi_{\theta,t}(a_{t}\>|\>s_{t})=\exp(Q^{\mathrm{soft}}_{\theta,t}(s_{t},a_{t})-V^{\mathrm{soft}}_{\theta,t}(s_{t})) at each time step tt and state sts_{t}. In the infinite-horizon case, we drop the subscripts tt and search for a fixed point VθsoftV^{\mathrm{soft}}_{\theta} and QθsoftQ^{\mathrm{soft}}_{\theta} to the soft VI equations with corresponding stationary policy πθ\pi_{\theta}.

In both cases, the agent chooses actions with probability exponential in the soft advantage Aθ,tsoft(st,at)Qθ,tsoft(st,at)Vθ,tsoft(st)A^{\mathrm{soft}}_{\theta,t}(s_{t},a_{t})\triangleq Q^{\mathrm{soft}}_{\theta,t}(s_{t},a_{t})-V^{\mathrm{soft}}_{\theta,t}(s_{t}). Thus, computing the dual function g(θ)g(\theta) reduces to solving a planning problem. The similarity between the soft VI equations and the ordinary Bellman equations means that we are (somewhat) justified in interpreting the dual variable vector θ\theta as weights for a reward function rθ(st,at)=θTϕ(st,at)r_{\theta}(s_{t},a_{t})=\theta^{T}\phi(s_{t},a_{t}).

3.1.3 Learning a reward function with dual ascent

1:Initialise some reward parameter estimate θ0\theta_{0}, and set k0k\leftarrow 0
2:repeat
3:     Apply soft value iteration to obtain optimal policy πθk\pi_{\theta_{k}} w.r.t θk\theta_{k} (Eq. 41)
4:     θk+1θk+αk(𝔼𝒟[t=0T1γtθrθ(St,At)]𝔼πθk[t=0T1γtθrθ(St,At)])\theta_{k+1}\leftarrow\theta_{k}+\alpha_{k}\left(\operatorname*{{\mathbb{E}}}\limits_{\mathcal{D}}\left[\sum_{t=0}^{T-1}\gamma^{t}\nabla_{\theta}r_{\theta}(S_{t},A_{t})\right]-\operatorname*{{\mathbb{E}}}\limits_{\pi_{\theta_{k}}}\left[\sum_{t=0}^{T-1}\gamma^{t}\nabla_{\theta}r_{\theta}(S_{t},A_{t})\right]\right) (Eq. 59)
5:     kk+1k\leftarrow k+1
6:until Stopping condition satisfied
Algorithm 1 Maximum Causal Intropy (MCE) IRL on demonstrator 𝒟\mathcal{D}

In the previous subsection, we determined how to find the policy π\pi which attains the value of the dual function g(θ)g(\theta), so that g(θ)=Λ(π,θ)g(\theta)=\Lambda(\pi{},\theta). All that remains is to determine the step that we need to take on the dual variables θ\theta (which we are interpreting as reward parameters). The gradient of the dual w.r.t θ\theta is given by

θΛ(π,θ)=𝔼π[t=0T1γtϕ(St,At)]𝔼𝒟[t=0T1γtϕ(St,At)].\nabla_{\theta}\Lambda(\pi{},\theta)=\operatorname*{{\mathbb{E}}}_{\pi{}}\left[\sum_{t=0}^{T-1}\gamma^{t}\phi(S_{t},A_{t})\right]-\operatorname*{{\mathbb{E}}}_{\mathcal{D}{}}\left[\sum_{t=0}^{T-1}\gamma^{t}\phi(S_{t},A_{t})\right]\,. (59)

The first (policy π\pi) term can be computed by applying soft VI to θ\theta and then rolling the optimal policy forward to obtain occupancy measures. The second (demonstrator 𝒟\mathcal{D}{}) term can be computed directly from demonstration trajectories. This cycle of performing soft VI followed by a gradient step on θ\theta is illustrated in Algorithm 1. In the next section, we’ll see an alternative perspective on the same algorithm that does not appeal to feature expectation matching or duality, and which can be extended to non-linear reward functions.

3.2 MCE IRL as maximum likelihood estimation

In the previous section, we saw that IRL can be reduced to dual ascent on a feature expectation matching problem. We can also interpret this method as maximum likelihood estimation of reward parameters θ\theta subject to the distribution over trajectories induced by the policy πθ,t(at|st)=exp(Qθ,tsoft(st,at)Vθ,tsoft(st))\pi_{\theta,t}(a_{t}\>|\>s_{t})=\exp(Q^{\mathrm{soft}}_{\theta,t}(s_{t},a_{t})-V^{\mathrm{soft}}_{\theta,t}(s_{t})) under the soft VI recursion in Eq. 41. This perspective has the advantage of allowing us to replace the linear reward function rθ(st,at)=θTϕ(st,at)r_{\theta}(s_{t},a_{t})=\theta^{T}\phi(s_{t},a_{t}) with a general non-linear reward function rθ(st,at)r_{\theta}(s_{t},a_{t}). If rθ(st,at)r_{\theta}(s_{t},a_{t}) is non-linear, then the resulting problem no longer has the same maximum-entropy feature-matching justification as before, although it does still appear to work well in practice.

To show that the feature expectation matching and maximum likelihood views are equivalent when rθr_{\theta} is linear, we will show that the gradient for the dual function g(θ)g(\theta) and the gradient for the dataset expected log likelihood are equivalent. First, we will introduce the notion of the discounted likelihood of a trajectory.

Definition 3.1 (Discounted likelihood).

The discounted likelihood of a trajectory τ=(s0,a0,s1,a1,,aT1,sT)\tau=(s_{0},a_{0},s_{1},a_{1},\ldots,a_{T-1},s_{T}) under policy πθ\pi_{\theta} is

pθ,γ(τ)=(s0)t=0T1𝒯(st+1|st,at)π(at|st)γtθ,t,p_{\theta,\gamma}(\tau{})=\mathcal{I}(s_{0})\prod_{t=0}^{T-1}\mathcal{T}(s_{t+1}\>|\>s_{t},a_{t})\pi{}_{\theta,t}(a_{t}\>|\>s_{t})^{\gamma^{t}}\,, (60)

where γ[0,1]\gamma\in[0,1] is a discount factor.

When γ=1\gamma=1, the discounted likelihood is simply the likelihood of τ\tau under the policy πθ\pi_{\theta}. When γ<1\gamma<1, probabilities of actions later in the trajectory will be regularised towards 1 by raising them to γt\gamma^{t}. This will not correspond to a normalised probability distribution, although we will see that it allows us to draw a connection between maximum likelihood inference and discounted MCE IRL.

Now consider the log discounted likelihood of a single trajectory τ\tau{} under a model in which the agent follows the soft VI policy πθ,t(at|st)\pi_{\theta,t}(a_{t}\>|\>s_{t}) with respect to reward parameters θ\theta:

logpθ(τ)\displaystyle\log p_{\theta}(\tau{}) =t=0T1γtlogπ(at|st)θ,t+log(s0)+t=1Tlog𝒯(st|st1,at1)\displaystyle=\sum_{t=0}^{T-1}\gamma^{t}\log\pi{}_{\theta,t}(a_{t}\>|\>s_{t})+\log\mathcal{I}(s_{0})+\sum_{t=1}^{T}\log\mathcal{T}(s_{t}\>|\>s_{t-1},a_{t-1}) (61)
=t=0T1γt(Qθ,tsoft(st,at)Vθ,tsoft(st))+f(τ).\displaystyle=\sum_{t=0}^{T-1}\gamma^{t}\left(Q^{\mathrm{soft}}_{\theta,t}(s_{t},a_{t})-V^{\mathrm{soft}}_{\theta,t}(s_{t})\right)+f(\tau)\,. (62)

Here f(τ)f(\tau) contains all the log𝒯\log\mathcal{T} and log\log\mathcal{I} (dynamics) terms, which are constant in θ\theta. Although the expression above only matches the log likelihood when γ=1\gamma=1, we have nevertheless added explicit discount factors so that the gradient equivalence with g(θ)g(\theta) holds when γ<1\gamma<1. The expected log likelihood of a demonstrator’s trajectories, sampled from demonstration distribution 𝒟\mathcal{D}{}, is then

(𝒟;θ)\displaystyle\mathcal{L}{}(\mathcal{D}{};\theta) =𝔼𝒟[logpθ(τ)]\displaystyle=\operatorname*{{\mathbb{E}}}_{\mathcal{D}{}}\left[\log p_{\theta}(\tau{})\right] (63)
=𝔼𝒟[t=0T1γt(Qθ,tsoft(St,At)Vθ,tsoft(St))]+c\displaystyle=\operatorname*{{\mathbb{E}}}_{\mathcal{D}{}}\left[\sum_{t=0}^{T-1}\gamma^{t}\left(Q^{\mathrm{soft}}_{\theta,t}(S_{t},A_{t})-V^{\mathrm{soft}}_{\theta,t}(S_{t})\right)\right]+c (64)
=𝔼𝒟[t=0T2γt(rθ(St,At)+γ𝔼𝒯[Vθ,t+1soft(St+1)|St,At]Vθ,tsoft(St))\displaystyle=\operatorname*{{\mathbb{E}}}_{\mathcal{D}{}}\Bigg{[}\sum_{t=0}^{T-2}\gamma^{t}\left(r_{\theta}(S_{t},A_{t})+\gamma\operatorname*{{\mathbb{E}}}_{\mathcal{T}{}}\left[V^{\mathrm{soft}}_{\theta,t+1}(S_{t+1})\>|\>{}S_{t},A_{t}\right]-V^{\mathrm{soft}}_{\theta,t}(S_{t})\right) (65)
+γT1(rθ(ST1,AT1)Vθ,T1soft(ST1))]+c\displaystyle\qquad\qquad+\gamma^{T-1}\left(r_{\theta}(S_{T-1},A_{T-1})-V^{\mathrm{soft}}_{\theta,T-1}(S_{T-1})\right)\Bigg{]}+c
=𝔼𝒟[t=0T1γtrθ(St,At)]+𝔼𝒟[t=0T2γt+1𝔼𝒯[Vθ,t+1soft(St+1)|St,At]]\displaystyle=\operatorname*{{\mathbb{E}}}_{\mathcal{D}{}}\left[\sum_{t=0}^{T-1}\gamma^{t}r_{\theta}(S_{t},A_{t})\right]+\operatorname*{{\mathbb{E}}}_{\mathcal{D}{}}\left[\sum_{t=0}^{T-2}\gamma^{t+1}\operatorname*{{\mathbb{E}}}_{\mathcal{T}}\left[V^{\mathrm{soft}}_{\theta,t+1}(S_{t+1})\>|\>{}S_{t},A_{t}\right]\right] (66)
𝔼𝒟[t=0T1γtVθ,tsoft(St)]+c.\displaystyle\qquad-\operatorname*{{\mathbb{E}}}_{\mathcal{D}{}}\left[\sum_{t=0}^{T-1}\gamma^{t}V^{\mathrm{soft}}_{\theta,t}(S_{t})\right]+c.

where cc is a constant that does not depend on θ\theta.

The transitions in the demonstration distribution 𝒟\mathcal{D}{} will follow 𝒯\mathcal{T}{} provided the demonstrator acts in the same MDP, and so we can drop the 𝔼𝒯\mathbb{E}_{\mathcal{T}}. Critically, this would not be possible were 𝒟\mathcal{D}{} instead an empirical distribution from a finite number of samples, as then 𝒟\mathcal{D}{} might not exactly follow 𝒯\mathcal{T}{}. However, this is a reasonable approximation for sufficiently large demonstration datasets.

After dropping 𝔼𝒯\mathbb{E}_{\mathcal{T}}, we simplify the telescoping sums:

(𝒟;θ)\displaystyle\mathcal{L}{}(\mathcal{D}{};\theta) =𝔼𝒟[t=0T1γtrθ(St,At)]+𝔼𝒟[t=0T2γt+1Vθ,t+1soft(St+1)t=0T1γtVθ,tsoft(St)]+c\displaystyle=\operatorname*{{\mathbb{E}}}_{\mathcal{D}{}}\left[\sum_{t=0}^{T-1}\gamma^{t}r_{\theta}(S_{t},A_{t})\right]+\operatorname*{{\mathbb{E}}}_{\mathcal{D}{}}\left[\sum_{t=0}^{T-2}\gamma^{t+1}V^{\mathrm{soft}}_{\theta,t+1}(S_{t+1})-\sum_{t=0}^{T-1}\gamma^{t}V^{\mathrm{soft}}_{\theta,t}(S_{t})\right]+c (67)
=𝔼𝒟[t=0T1γtrθ(St,At)]Vθ,0soft(s0)+c.\displaystyle=\operatorname*{{\mathbb{E}}}_{\mathcal{D}{}}\left[\sum_{t=0}^{T-1}\gamma^{t}r_{\theta}(S_{t},A_{t})\right]-V^{\mathrm{soft}}_{\theta,0}(s_{0})+c\,. (68)

All that remains to show is that the gradient of this expression is equal to the gradient of the dual function g(θ)g(\theta) that we saw in Section 3.

Lemma 3.3.

Assume that the reward function is linear in its parameters θ\theta, so that rθ(s,a)=θTϕ(s,a)r_{\theta}(s,a)=\theta^{T}\phi(s,a). Gradient ascent on the expected log likelihood from Eq. 68 updates parameters in the same direction as gradient descent on g(θ)g(\theta) from 3.2; that is,

θ(𝒟;θ)=θg(θ),\nabla_{\theta}\mathcal{L}(\mathcal{D};\theta)=-\nabla_{\theta}g(\theta)\,, (69)

where the negative on the right arises from the fact that we are doing gradient ascent on the expected log likelihood, and gradient descent on the dual function g(θ)g(\theta).

Proof.

We already know the gradient of g(θ)g(\theta) from Eq. 59, so we only need to derive the gradient of (𝒟;θ)\mathcal{L}(\mathcal{D};\theta). Computing the gradient of the first term of (𝒟;θ)\mathcal{L}(\mathcal{D};\theta) is trivial. We push the differentiation operator inside the expectation so that we are averaging rθ(st,at)\nabla r_{\theta}(s_{t},a_{t}) over the states and actions in our dataset of samples from 𝒟\mathcal{D}{}:

θ𝔼𝒟[t=0T1γtrθ(St,At)]=𝔼𝒟[t=0T1γtθrθ(St,At)]\nabla_{\theta}\operatorname*{{\mathbb{E}}}_{\mathcal{D}{}}\left[\sum_{t=0}^{T-1}\gamma^{t}r_{\theta}(S_{t},A_{t})\right]=\operatorname*{{\mathbb{E}}}_{\mathcal{D}{}}\left[\sum_{t=0}^{T-1}\gamma^{t}\nabla_{\theta}r_{\theta}(S_{t},A_{t})\right] (70)

The second term is slightly more involved, but can be derived recursively as

θVθ,tsoft(st)\displaystyle\nabla_{\theta}V^{\mathrm{soft}}_{\theta,t}(s_{t}) =θlogat𝒜expQθ,tsoft(st,at)\displaystyle=\nabla_{\theta}\log\sum_{a_{t}\in\mathcal{A}}\exp Q^{\mathrm{soft}}_{\theta,t}(s_{t},a_{t}) (71)
=1at𝒜expQθ,tsoft(st,at)at𝒜θexpQθ,tsoft(st,at)\displaystyle=\frac{1}{\sum_{a_{t}\in\mathcal{A}}\exp Q^{\mathrm{soft}}_{\theta,t}(s_{t},a_{t})}\sum_{a_{t}\in\mathcal{A}}\nabla_{\theta}\exp Q^{\mathrm{soft}}_{\theta,t}(s_{t},a_{t}) (72)
=at𝒜exp(Qθ,tsoft(st,at))θQθ,tsoft(st,at)expVθ,tsoft(st)\displaystyle=\frac{\sum_{a_{t}\in\mathcal{A}}\exp\left(Q^{\mathrm{soft}}_{\theta,t}(s_{t},a_{t})\right)\nabla_{\theta}Q^{\mathrm{soft}}_{\theta,t}(s_{t},a_{t})}{\exp V^{\mathrm{soft}}_{\theta,t}(s_{t})} (73)
=at𝒜π(at|st)θ,tθQθ,tsoft(st,at)\displaystyle=\sum_{a_{t}\in\mathcal{A}}\pi{}_{\theta,t}(a_{t}\>|\>s_{t})\nabla_{\theta}Q^{\mathrm{soft}}_{\theta,t}(s_{t},a_{t}) (74)
=𝔼πθ,t[θrθ(st,At)+γ𝔼𝒯[θVθ,t+1soft(St+1)|st,At]|st]\displaystyle=\operatorname*{{\mathbb{E}}}_{\pi{}_{\theta,t}}\left[\nabla_{\theta}r_{\theta}(s_{t},A_{t})+\gamma\operatorname*{{\mathbb{E}}}_{\mathcal{T}}\left[\nabla_{\theta}V^{\mathrm{soft}}_{\theta,t+1}(S_{t+1})\>\middle|\>{}s_{t},A_{t}\right]\>\middle|\>{}s_{t}\right] (75)
=𝔼πθ[t=tT1γttθrθ(St,At)|st],\displaystyle=\operatorname*{{\mathbb{E}}}_{\pi{}_{\theta}}\left[\sum_{t^{\prime}=t}^{T-1}\gamma^{t^{\prime}-t}\nabla_{\theta}r_{\theta}(S_{t^{\prime}},A_{t^{\prime}})\>\middle|\>{}s_{t}\right]\,, (76)

where the last step unrolled the recursion to time TT. Armed with this derivative, the gradient of our log likelihood is

θ(𝒟;θ)=𝔼𝒟[t=0T1γtθrθ(St,At)]𝔼πθ[t=0T1γtθrθ(St,At)].\nabla_{\theta}\mathcal{L}{}(\mathcal{D};\theta)={\operatorname*{{\mathbb{E}}}_{\mathcal{D}{}}\left[\sum_{t=0}^{T-1}\gamma^{t}\nabla_{\theta}r_{\theta}(S_{t},A_{t})\right]}-{\operatorname*{{\mathbb{E}}}_{\pi{}_{\theta}}\left[\sum_{t=0}^{T-1}\gamma^{t}\nabla_{\theta}r_{\theta}(S_{t},A_{t})\right]}\,. (77)

In the special case of a linear reward rθ(s,a)=θTϕ(s,a)r_{\theta}(s,a)=\theta^{T}\phi(s,a), this gradient is equal to θg(θ)-\nabla_{\theta}g(\theta) (where θg(θ)\nabla_{\theta}g(\theta) is given in Eq. 59). This shows that maximising the log likelihood \mathcal{L} is equivalent to minimising the dual function g(θ)g(\theta) from the previous section. ∎

3.3 Maximum Entropy (ME) IRL: MCE IRL for deterministic MDPs

So far we’ve considered the general setting of stochastic transition dynamics, where for any given state–action pair (st,at)𝒮×𝒜(s_{t},a_{t})\in\mathcal{S}\times\mathcal{A} there may be many possible successor states st+1𝒮s_{t+1}\in\mathcal{S} for which 𝒯(st+1|st,at)>0\mathcal{T}{}(s_{t+1}\>|\>s_{t},a_{t})>0. Consider what happens if we act under the Maximum Causal Entropy (MCE) policy πθ,t(at|st)=exp(Qθ,t(st,at)Vθ,t(st))\pi_{\theta,t}(a_{t}\>|\>s_{t})=\exp(Q_{\theta,t}(s_{t},a_{t})-V_{\theta,t}(s_{t})) for some θ\theta, but restrict ourselves to the case of deterministic dynamics, where 𝒯(st+1|st,at)=1\mathcal{T}{}(s_{t+1}\>|\>s_{t},a_{t})=1 for one successor state st+1s_{t+1} and zero for all others. Here the discounted likelihood of some feasible trajectory τ\tau—in which the state transitions agree with the dynamics—simplifies substantially.

Lemma 3.4.

In an MDP with deterministic state transitions and a deterministic initial state distribution, the MCE IRL density pθ,γ(τ)p_{\theta,\gamma}(\tau) takes the form

pθ,γ(τ)={1Zθexp(t=0T1γtrθ(st,at))feasible τ0infeasible τ,p_{\theta,\gamma}(\tau)=\begin{cases}\frac{1}{Z_{\theta}}\exp\left(\sum_{t=0}^{T-1}\gamma^{t}r_{\theta}(s_{t},a_{t})\right)&\text{feasible }\tau\\ 0&\text{infeasible }\tau\,,\end{cases} (78)

where Zθ=expVθ,0(s0)Z_{\theta}=\exp V_{\theta,0}(s_{0}) is a normalising constant dependent on the (deterministic) initial state s0s_{0}.

Proof.

Assuming deterministic dynamics, the discounted likelihood for a feasible τ\tau{} becomes:

pθ,γ(τ)\displaystyle p_{\theta,\gamma}(\tau{}) =(s0)t=0T1𝒯(st+1|st,at)π(at|st)γtt\displaystyle=\mathcal{I}(s_{0})\prod_{t=0}^{T-1}\mathcal{T}{}(s_{t+1}\>|\>s_{t},a_{t})\pi{}_{t}(a_{t}\>|\>s_{t})^{\gamma^{t}} (79)
=(s0)t=0T1π(at|st)γtt\displaystyle=\mathcal{I}(s_{0})\prod_{t=0}^{T-1}\pi{}_{t}(a_{t}\>|\>s_{t})^{\gamma^{t}} (80)
=(s0)exp(t=0T1γt(Qθ,t(st,at)Vθ,t(st)))\displaystyle=\mathcal{I}(s_{0})\exp\left(\sum_{t=0}^{T-1}\gamma^{t}\left(Q_{\theta,t}(s_{t},a_{t})-V_{\theta,t}(s_{t})\right)\right) (81)
=(s0)exp\bBigg@3(γT1(rθ(sT1,aT1)Vθ,T1(sT1))\displaystyle=\mathcal{I}(s_{0})\exp\bBigg@{3}(\gamma^{T-1}\left(r_{\theta}(s_{T-1},a_{T-1})-V_{\theta,T-1}(s_{T-1})\right) (82)
+t=0T2γt(rθ(st,at)+γ𝔼𝒯[Vθ,t+1(St+1)|st,at]Vθ,t(st))\bBigg@3)\displaystyle\qquad\qquad\qquad\ +\sum_{t=0}^{T-2}\gamma^{t}\left(r_{\theta}(s_{t},a_{t})+\gamma\operatorname*{{\mathbb{E}}}_{\mathcal{T}}[V_{\theta,t+1}(S_{t+1})\>|\>s_{t},a_{t}]-V_{\theta,t}(s_{t})\right)\bBigg@{3})
=(s0)exp\bBigg@3(γT1(rθ(sT1,aT1)Vθ,T1(sT1))\displaystyle=\mathcal{I}(s_{0})\exp\bBigg@{3}(\gamma^{T-1}\left(r_{\theta}(s_{T-1},a_{T-1})-V_{\theta,T-1}(s_{T-1})\right) (83)
+t=0T2γt(rθ(st,at)+γVθ,t+1(st+1)Vθ,t(st))\bBigg@3)\displaystyle\qquad\qquad\qquad\ +\sum_{t=0}^{T-2}\gamma^{t}\left(r_{\theta}(s_{t},a_{t})+\gamma V_{\theta,t+1}(s_{t+1})-V_{\theta,t}(s_{t})\right)\bBigg@{3})
=(s0)expVθ,0(s0)exp(t=0T1γtrθ(st,at)).\displaystyle=\frac{\mathcal{I}(s_{0})}{\exp V_{\theta,0}(s_{0})}\exp\left(\sum_{t=0}^{T-1}\gamma^{t}r_{\theta}(s_{t},a_{t})\right). (84)

We made use of determinism to drop the dynamics factors in Eq. 80, and to collapse the expectation over successor states that appeared in Eq. 82. If we now assume that the initial state distribution \mathcal{I} is deterministic (with all weight on the actual initial state s0s_{0} at the beginning of the trajectory τ\tau), then we get (s0)=1\mathcal{I}(s_{0})=1, from which the result follows. ∎

If we perform IRL under the assumption that our trajectory distribution follows the form of Eq. 78, then we recover the Maximum Entropy IRL algorithm (abbreviated MaxEnt or ME) that was first proposed by Ziebart [30]. As we will see in Section 4, this simplified algorithm lends itself well to approximation in environments with unknown (but deterministic) dynamics [6, 5, 8].

3.3.1 ME IRL implies risk-seeking behaviour in stochastic environments

Observe that the Maximum Entropy (ME) IRL discounted likelihood in Eq. 78 takes a much simpler form than the likelihood induced by MCE IRL, which requires recursive evaluation of the soft VI equations. Given that ME IRL has a simpler discounted likelihood, one might be tempted to use ME IRL in environments with stochastic dynamics by inserting dynamics factors to get a discounted likelihood of the form

p^θ,γ(τ)=(s0)t=0T1𝒯(st+1|st,at)exp(γtrθ(st,at)).\hat{p}_{\theta,\gamma}(\tau)=\mathcal{I}(s_{0})\prod_{t=0}^{T-1}\mathcal{T}{}(s_{t+1}\>|\>s_{t},a_{t})\exp\left(\gamma^{t}r_{\theta}(s_{t},a_{t})\right)\,. (85)

Unfortunately, “generalising” ME IRL in this way may not produce appropriate behaviour in environments with truly non-deterministic dynamics. The intuitive reason for this is that the ME IRL discounted likelihood in Eq. 78 can assign arbitrarily high likelihood to any feasible trajectory, so long as the return associated with that trajectory is high enough. Indeed, by moving the dynamics factors into the exp\exp, we can see that the discounted likelihood333 Previously, we defined the discounted likelihood as the ordinary trajectory likelihood with discount factors applied at each time step. In order to obtain a valid probability distribution, the discounted likelihood will have to be normalised. The reasoning in this section still applies after normalisation: as the return of a trajectory increases, its probability under the model will go to 1, even if stochastic transitions make it impossible to find a policy that actually achieves such a trajectory distribution. takes the form

p^θ,γ(τ)=exp(Rθ(τ)+log(s0)+t=0T1log𝒯(st+1|st,at)),\hat{p}_{\theta,\gamma}(\tau)=\exp\left(R_{\theta}(\tau)+\log\mathcal{I}(s_{0})+\sum_{t=0}^{T-1}\log\mathcal{T}{}(s_{t+1}\>|\>s_{t},a_{t})\right)\,, (86)

where Rθ(τ)=t=0T1γtrθ(st,at)R_{\theta}(\tau)=\sum_{t=0}^{T-1}\gamma^{t}r_{\theta}(s_{t},a_{t}). This form suggests that the agent can simply pay a “log cost” in order to choose outcomes that would give it better return. A trajectory τ\tau with extremely high return Rθ(τ)R_{\theta}(\tau) relative to other trajectories may have high likelihood even when the associated transition probabilities are low. In reality, if a non-deterministic environment has a state transition (s,a,s)(s,a,s^{\prime}) with very low (but non-zero) probability ϵ\epsilon, then no trajectory containing that transition can have likelihood more than ϵ\epsilon under any physically realisable policy, even if the trajectory has extremely high return.

s0s_{0}s1s_{1}s2s_{2}s3s_{3}r=1.0r=1.0r=100.0r=-100.050%50\%50%50\%
Figure 1: RiskyPath: The agent can either take a long but sure path to the goal (s0s1s2s_{0}\to s_{1}\to s_{2}), or attempt to take a shortcut (s0s2)s_{0}\to s_{2}), with the risk of receiving a low reward (s0s3s_{0}\to s_{3}). This example is simplified from Ziebart’s thesis [29, Figure 6.4].

This mismatch between real MDP dynamics and the ME IRL model can manifest as risk-taking behaviour in the presence of stochastic transitions. For example, in Fig. 1, an agent can obtain s0s1s2s_{0}\to s_{1}\to s_{2} with 100% probability, or s0s2s_{0}\to s_{2} and s0s3s_{0}\to s_{3} with 50% probability each. The ME IRL transition model would wrongly believe the agent could obtain s0s2s_{0}\to s_{2} with arbitrarily high probability by paying a small cost of log𝒯(s2|s0,)=log12=log2\log\mathcal{T}(s_{2}\>|\>s_{0},\cdot)=\log\frac{1}{2}=-\log 2. It would therefore conclude that an agent given rewards r(s2)=1r(s_{2})=1 and r(s3)=100r(s_{3})=-100 would favour the risky path for a sufficiently small discount factor γ\gamma{}, despite the true expected return always being higher for the safe path.

4 Dynamics-free approximations to ME IRL

Real-world environments are often too complex to be described in the tabular form required by MCE IRL. To perform IRL in these settings, we can use “dynamics-free” approximations to MCE IRL which do not require a known world model. In this section we describe several such approximations, all of which assume the environment is deterministic (simplifying MCE to ME IRL) and that the horizon is effectively infinite (simplifying to a stationary policy).

We begin by describing the assumptions made by these algorithms in more detail, and adapting our previous definitions to the non-tabular setting. We then consider Guided Cost Learning (GCL), which directly approximates the gradient of the ME IRL objective using importance-weighted samples [6]. Finally, we will describe Adversarial IRL (AIRL) [8], a GAN-based approximation to ME IRL, that distinguishes between state–action pairs instead of distinguishing between trajectories. AIRL has been shown to be successful in standard RL benchmark environments, including some MuJoCo-based control tasks [8] and some Atari games [26].

4.1 Notation and Assumptions

Determinism

GCL is based on ME IRL, which Section 3.3 showed is only well-founded for deterministic dynamics. AIRL can be derived using MCE IRL, but many of the key results only hold for deterministic environments. Thus we will assume that environments are deterministic throughout this section.

Function approximation

Since we’ll be discussing results about non-parametric and non-linear reward functions, we will overload our QsoftQ^{\mathrm{soft}}{} and VsoftV^{\mathrm{soft}}{} notation from earlier to refer to arbitrary reward functions. For example, given some reward function r(s,a,s)r(s,a,s^{\prime}), we use Qrsoft(s,a)Q^{\mathrm{soft}}_{r}(s,a) to denote the soft Q-function under rr. We’ll also be making use of the soft advantage function, defined as

Arsoft(s,a)\displaystyle A^{\mathrm{soft}}_{r}(s,a) Qrsoft(s,a)Vrsoft(s).\displaystyle\triangleq Q^{\mathrm{soft}}_{r}(s,a)-V^{\mathrm{soft}}_{r}(s)\,. (87)

4.1.1 Stationary policies and infinite-horizon MDPs

In addition to nonlinear function approximation, the algorithms which we will consider in the next few sections are designed to work with stationary (time-independent) policies. Theoretically, the use of stationary policies can be justified by appealing to infinite horizons. In an infinite horizon problem, the value of a given state is independent of the current time step, since there will always be an infinite number of time steps following it. Consequently, the value function Vθ,tsoftV^{\mathrm{soft}}_{\theta,t} and Q-function Qθ,tsoftQ^{\mathrm{soft}}_{\theta,t} lose their time-dependence, and can instead be recovered as the fixed point of the following recurrence:

Vrsoft(s)\displaystyle V^{\mathrm{soft}}_{r}(s) =loga𝒜expQrsoft(s,a),\displaystyle=\log\sum_{a\in\mathcal{A}}\exp Q^{\mathrm{soft}}_{r}(s,a)\,, (88)
Qrsoft(s,a)\displaystyle Q^{\mathrm{soft}}_{r}(s,a) =𝔼𝒯[r(s,a,S)+γVrsoft(S)|s,a].\displaystyle=\operatorname*{{\mathbb{E}}}_{\mathcal{T}}\left[r(s,a,S^{\prime})+\gamma V^{\mathrm{soft}}_{r}(S^{\prime})\>\middle|\>s,a\right]\,. (89)

This yields a stationary policy π(a|s)=expArsoft(s,a)\pi(a\>|\>{}s)=\exp A^{\mathrm{soft}}_{r}(s,a).

The notion of an “infinite” horizon might sound abstruse, given that real-world demonstration trajectories τ𝒟\tau\sim\mathcal{D} must be finite, but we can conceptually view them as being infinite trajectories padded with a virtual absorbing state. This helps us model situations where the demonstrator always completes the task in finite time, after which the demonstration ends.

We can also use this padding approach to model simulated tasks, where episodes often end when some termination condition is satisfied, such as a robot falling over or a vehicle reaching its destination. For the purpose of IRL, we can again imagine such trajectories as ending with an infinite sequence of repeated absorbing states, each with the same learned absorbing state reward value.

Note the presence of a termination condition may simplify the IRL problem considerably when episode termination is (positively or negatively) correlated with task success. Indeed, Kostrikov et al. [16] observe it is possible to get seemingly-competent behaviour in the Hopper environment by simply giving the agent a fixed positive reward at all time steps until termination, then zero afterwards. In general, the absorbing state reward tends to be low in tasks which require the agent to stay “alive” and high in goal-oriented tasks where the agent needs to finish an episode quickly.

Implementation termination bias

In practice, implementations of IRL algorithms often wrongly treat the episodes as being of finite but variable length, implicitly assuming that the absorbing state reward is exactly zero. To make matters worse, certain choices of neural network architecture can force a learned reward function to always be positive (or negative) up until termination. This can lead to situations where the algorithm “solves” an environment regardless of what reward parameters it learns, or where the algorithm cannot ever solve the environment because its reward function has the wrong sign. It is possible to fix this problem by explicitly learning an absorbing state reward, but for the purpose of comparing existing algorithms we instead recommend using fixed-horizon environments, which side-steps this problem entirely [9].

Convergence of soft VI

Soft value iteration still converges to a unique value in the infinite horizon case, provided the discount factor γ\gamma{} is less than 11 [12, Appendix A.2]. In particular, Haarnoja et al. [12]—adapting an earlier proof by Fox et al. [7]—show that soft value iteration is a contraction mapping. It then follows from the Banach fixed point theorem that soft value iteration has a unique fixed point.

4.2 Guided cost learning: Approximation via importance sampling

Consider the gradient of the log discounted likelihood of a demonstration distribution 𝒟\mathcal{D} under the ME IRL model, which was previously shown in Eq. 77 to be

θ(𝒟;θ)=𝔼𝒟[t=0T1γtθrθ(St,At)](a)𝔼πθ[t=0T1γtθrθ(St,At)](b).\nabla_{\theta}\mathcal{L}{}(\mathcal{D};\theta)=\underbrace{\operatorname*{{\mathbb{E}}}_{\mathcal{D}{}}\left[\sum_{t=0}^{T-1}\gamma^{t}\nabla_{\theta}r_{\theta}(S_{t},A_{t})\right]}_{{(a)}}-\underbrace{\operatorname*{{\mathbb{E}}}_{\pi{}_{\theta}}\left[\sum_{t=0}^{T-1}\gamma^{t}\nabla_{\theta}r_{\theta}(S_{t},A_{t})\right]}_{{(b)}}\,. (90)

Sample-based approximation of Eq. 90 is straightforward: we can just sample NN trajectories τ1,τ2,,τN𝒟\tau_{1},\tau_{2},\ldots,\tau_{N}\sim\mathcal{D} from the demonstration distribution 𝒟\mathcal{D}, then approximate the expectation with sample mean

𝔼𝒟[t=0T1γtθrθ(St,At)]missing1Ni=1Nt=0T1γtθrθ(st,i,at,i).\operatorname*{{\mathbb{E}}}_{\mathcal{D}{}}\left[\sum_{t=0}^{T-1}\gamma^{t}\nabla_{\theta}r_{\theta}(S_{t},A_{t})\right]missing\frac{1}{N}\sum_{i=1}^{N}\sum_{t=0}^{T-1}\gamma^{t}\nabla_{\theta}r_{\theta}(s_{t,i},a_{t,i})\,. (91)

Approximating Eq. 90 is harder, since recovering πθ\pi_{\theta} requires planning against the estimated reward function rθr_{\theta}. Although we could compute πθ\pi_{\theta} using maximum entropy RL [12], doing so would be extremely expensive, and we would have to repeat the process each time we update the reward parameters θ\theta.

Importance sampling

Instead of learning and sampling from πθ\pi_{\theta}, we can estimate the ME IRL gradient using importance sampling. Importance sampling is a widely applicable technique for working with distributions that are difficult to sample from or evaluate densities for. Imagine that we would like to evaluate an expectation 𝔼pf(X)\operatorname*{{\mathbb{E}}}_{p}f(X), where XX is a random variable with density pp and f:𝒳f:\mathcal{X}\to\mathbb{R} is some function on the sample space 𝒳\mathcal{X}. Say that the distribution pp is intractable to sample from, and we only know how to evaluate an unnormalised density p~(x)=αp(x)\tilde{p}(x)=\alpha p(x) with unknown scaling factor α\alpha. However, imagine that it is still possible to draw samples from another distribution qq for which we can evaluate the density q(x)q(x) at any point x𝒳x\in\mathcal{X}. In this case, we can rewrite our expectation as:

𝔼p[f(X)]\displaystyle\operatorname*{{\mathbb{E}}}_{p}\left[f(X)\right] =f(x)p(x)𝑑x\displaystyle=\int\!f(x)p(x)\,dx (92)
=f(x)p(x)q(x)q(x)𝑑x\displaystyle=\int\!f(x)\frac{p(x)}{q(x)}q(x)\,dx (93)
=1αf(x)p~(x)q(x)q(x)𝑑x\displaystyle=\frac{1}{\alpha}\int\!f(x)\frac{\tilde{p}(x)}{q(x)}q(x)\,dx (94)
=1α𝔼q[p~(X)q(X)f(X)].\displaystyle=\frac{1}{\alpha}\operatorname*{{\mathbb{E}}}_{q}\left[\frac{\tilde{p}(X)}{q(X)}f(X)\right]~{}. (95)

Similarly, we can evaluate α\alpha using an expectation with respect to qq:

α\displaystyle\alpha =αp(x)𝑑x\displaystyle=\alpha\int\!p(x)\,dx (96)
=p~(x)𝑑x\displaystyle=\int\!\tilde{p}(x)\,dx (97)
=p~(x)q(x)q(x)𝑑x\displaystyle=\int\!\frac{\tilde{p}(x)}{q(x)}q(x)\,dx (98)
=𝔼q[p~(X)q(X)].\displaystyle=\operatorname*{{\mathbb{E}}}_{q}\left[\frac{\tilde{p}(X)}{q(X)}\right]~{}. (99)

If we let w(x)=p~(x)/q(x)w(x)=\tilde{p}(x)/q(x), then we have

𝔼p[f(X)]=𝔼q[w(X)f(X)]𝔼q[w(X)].\operatorname*{{\mathbb{E}}}_{p}\left[f(X)\right]=\frac{\operatorname*{{\mathbb{E}}}_{q}\left[w(X)f(X)\right]}{\operatorname*{{\mathbb{E}}}_{q}\left[w(X)\right]}~{}. (100)

Both the numerator and the denominator of Eq. 100 can be approximated by averaging w(X)f(X)w(X)f(X) and w(X)w(X) on samples drawn from qq. This removes the need to sample from pp or evaluate its density directly.

Applying importance sampling to IRL

Now consider how we could use importance sampling to estimate Eq. 90. In this case, we wish to sample from the distribution induced by πθ\pi_{\theta}, which has discounted likelihood pθ,γ(τ)=1Zexprθ(τ)p_{\theta,\gamma}(\tau)=\frac{1}{Z}\exp r_{\theta}(\tau), for Z=exprθ(τ)𝑑τZ=\int\!\exp r_{\theta}(\tau)\,d\tau integrated over all feasible trajectories τ\tau.444 Note that if γ=1\gamma=1, then pθ,γp_{\theta,\gamma} will be undefined over infinite horizons for some MDPs, since rθ(τ)r_{\theta}(\tau) will diverge. On the other hand, if γ<1\gamma<1, then pθ,γp_{\theta,\gamma} will not be a normalised density unless we incorporate an additional γ\gamma-dependent factor into the Z(θ)Z(\theta) we derived in Lemma 3.4. We will ignore these subtleties for the remainder of the section by implicitly assuming that the horizon is finite, but that the policy is still stationary. Again, computing ZZ is intractable—doing so is equivalent to optimally solving an entropy-regularised RL problem. Instead of evaluating ZZ directly, we can use importance sampling: if there is some easy-to-compute trajectory distribution q(τ)q(\tau) that we can draw samples from, then we can rewrite the expectation in Eq. 90 as555To avoid confusion with the horizon TT, we are treating lowercase τ\tau as a random variable in these expectations.

𝔼πθ[t=0T1γtθrθ(St,At)]=𝔼τq[wθ(τ)t=0T1γtθrθ(St,At)]𝔼τqwθ(τ),\operatorname*{{\mathbb{E}}}_{\pi{}_{\theta}}\left[\sum_{t=0}^{T-1}\gamma^{t}\nabla_{\theta}r_{\theta}(S_{t},A_{t})\right]=\frac{\operatorname*{{\mathbb{E}}}_{\tau\sim q}\left[w_{\theta}(\tau)\sum_{t=0}^{T-1}\gamma^{t}\nabla_{\theta}r_{\theta}(S_{t},A_{t})\right]}{\operatorname*{{\mathbb{E}}}_{\tau\sim q}w_{\theta}(\tau)}\,, (101)

where wθ(τ)=(exprθ(τ))/q(τ)w_{\theta}(\tau)=\left(\exp r_{\theta}(\tau)\right)/q(\tau) denotes the importance weighting function.

The main insight of Guided Cost Learning (GCL) is that the proposal distribution q(τ)q(\tau) can be iteratively updated to bring it close to the optimal proposal distribution pθ(τ)p_{\theta}(\tau) without having to compute pθ(τ)p_{\theta}(\tau) exactly [6]. Specifically, after each cost function update, the proposal distribution q(τ)q(\tau) is updated by using reinforcement learning to maximise the entropy-regularised return,

RH(q)=𝔼q[t=0T1γtrθ(St,At)]+H(q).R_{H}(q)=\operatorname*{{\mathbb{E}}}_{q}\left[\sum_{t=0}^{T-1}\gamma^{t}r_{\theta}(S_{t},A_{t})\right]+H(q)~{}. (102)

Since GCL uses importance sampling to evaluate the ME IRL gradient, it does not need to train qq to optimality against RH(q)R_{H}(q). Instead, doing a few RL steps after each cost function update can suffice to produce a good proposal distribution qq.

In addition to the core insight that an adaptive proposal distribution can improve importance-sampled gradient estimates in ME IRL, the GCL paper also proposes a number of tricks that are necessary to make the method work in practice. Most significantly, the actual proposal distribution q(τ)q(\tau) used in the GCL paper is not just a policy trained with RL, but a mixture between an RL-trained policy and an approximation of the expert trajectory distribution. Moreover, the learnt reward function uses an architecture and pair of regularisers that make it particularly well-suited to the goal-reaching tasks that GCL was evaluated on. In the remainder of this primer, we will introduce the Adversarial IRL algorithm [8], which does not rely on importance sampling and does not require as many special tricks to work well in practice.

4.3 An interlude on generative adversarial networks

In the next sub-section, we will see that ME IRL can be recast as a way of learning a specially-structured kind of Generative Adversarial Network (GAN). This sub-section will briefly review the basic principles behind GANs. Readers unfamiliar with GANs may wish to consult the original GAN paper [11].

GANs include a generator function x=G(z)x=G(z) that maps from a noise vector zz to a generated sample x𝒳x\in\mathcal{X}, for some output space 𝒳\mathcal{X}. Given some fixed noise density pn(z)p_{n}(z) (e.g. from a unit Gaussian distribution), we define pg(x)p_{g}(x) to represent the density over 𝒳\mathcal{X} induced by sampling ZpnZ\sim p_{n} and then computing X=G(Z)X=G(Z). The overarching objective of GAN training is to ensure that pg=pdatap_{g}=p_{\text{data}}, i.e. to have XX match the true data distribution.

In GAN training, we alternate between training a discriminator D:𝒳[0,1]D:\mathcal{X}\to[0,1] to distinguish between pgp_{g} and pdatap_{\text{data}}, and training the generator G(z)G(z) to produce samples that appear “real” to D(x)D(x). Specifically, D(x)D(x) is treated as a classifier that predicts the probability that xx is a sample from pdatap_{\text{data}} instead of pgp_{g}. D(x)D(x) is trained to minimise the cross-entropy loss

L=D𝔼Zpz[log(1D(G(Z)))]𝔼Xpdata[logD(X)].\displaystyle L{}_{D}=-\operatorname*{{\mathbb{E}}}_{Z\sim p_{z}}\left[\log(1-D(G(Z)))\right]-\operatorname*{{\mathbb{E}}}_{X\sim p_{\text{data}}}\left[\log D(X)\right]\,. (103)

In tandem, GG is trained to maximise LDL_{D}—that is, we want to solve maxGminDLD\max_{G}\min_{D}L_{D}. If DD can be an arbitrary function, then minDLD\min_{D}L_{D} is attained at [11, Proposition 1]:

DG(x)=pdata(x)pdata(x)+pg(x).D_{G}^{*}(x)=\frac{p_{\text{data}}(x)}{p_{\text{data}}(x)+p_{g}(x)}\,. (104)

It is known that GargmaxGminDLDG\in\operatorname*{arg\,max}_{G}\min_{D}L_{D} if and only if pg=pdatap_{g}=p_{\text{data}} [11, Theorem 1]. As a proof sketch, if pg=pdatap_{g}=p_{\text{data}} then the loss is L=Dlog4L{}_{D}=\log 4. Moreover, it can be shown that minDLD\min_{D}L{}_{D} is equal to log42JSD(pdatapg)\log 4-2\cdot\operatorname{JSD}(p_{\text{data}}\|p_{g}), where JSD(pq)\operatorname{JSD}(p\|q) denotes the Jensen-Shannon divergence between distributions pp and qq. Since the Jensen-Shannon divergence is always non-negative and becomes zero if and only if pdata=pgp_{\text{data}}=p_{g}, it follows that log4\log 4 is the optimal value of maxGminDLD\max_{G}\min_{D}L_{D}, and that pg=pdatap_{g}=p_{\text{data}} is the only choice of GG that attains this optimum.

4.4 Adversarial IRL: A state-centric, GAN-based approximation

GANs can be used to produce an IRL method similar to GCL, called GAN-GCL [5]. However, GAN-GCL obtains poor performance in practice [8, Table 2]. In particular, performing discrimination over entire trajectories τ\tau can complicate training.

Adversarial IRL (AIRL) [8] instead discriminates between state-action pairs. Concretely, it learns a (stationary) stochastic policy π(a|s)\pi{}(a\>|\>s) and reward function fθ(s,a)f_{\theta}(s,a). The AIRL discriminator is defined by:

D(s,a)θ=expfθ(s,a)expfθ(s,a)+π(a|s).D{}_{\theta}(s,a)=\frac{\exp f_{\theta}(s,a)}{\exp f_{\theta}(s,a)+\pi{}(a\>|\>s)}\,. (105)

The generator π\pi{} is trained with the discriminator confusion:

r^θ(s,a)logDθ(s,a)log(1Dθ(s,a)).\hat{r}_{\theta}(s,a)\triangleq\log D_{\theta}(s,a)-\log\left(1-D_{\theta}(s,a)\right)\,. (106)

The discriminator is trained with cross-entropy loss, updating only the parameters θ\theta corresponding to the reward function fθf_{\theta}:

L(θ)𝔼π[t=0T1log(1D(St,At)θ)]𝔼𝒟[t=0T1logD(St,At)θ].L(\theta)\triangleq-\operatorname*{{\mathbb{E}}}_{\pi{}}\left[\sum_{t=0}^{T-1}\log\left(1-D{}_{\theta}(S_{t},A_{t})\right)\right]-\operatorname*{{\mathbb{E}}}_{\mathcal{D}{}}\left[\sum_{t=0}^{T-1}\log D{}_{\theta}(S_{t},A_{t})\right]\,. (107)

Recall we are working in the setting of infinite horizon MDPs, since AIRL (and GCL) assume stationary policies. Yet the summation in Eq. 107 is undiscounted, and would usually diverge if T=T=\infty. For AIRL, we must therefore additionally assume that the transition dynamics are proper, such that the MDP will almost surely enter an absorbing state in finite time. In Eq. 107, the trajectory fragments τ\tau consist of the finite-length component prior to entering an absorbing state, ensuring the loss is well-defined. Unfortunately, omitting the absorbing state from τ\tau also means the discriminator—and therefore reward function—cannot learn an absorbing-state reward.

In the following sections, we summarise the key theoretical results of AIRL. To make it clear where assumptions are being used, we state intermediate lemmas with minimal assumptions. However, all theorems require deterministic dynamics, and the theorems in Section 4.4.2 additionally rely on the dynamics being “decomposable”.

4.4.1 Policy objective is entropy-regularised fθf_{\theta}

The reward r^θ\hat{r}_{\theta} used to train the generator π(a|s)\pi(a\>|\>s) is a sum of two widely used GAN objectives: the minimax objective log(1Dθ(s,a))-\log\left(1-D_{\theta}(s,a)\right) and the more widely used logDθ(s,a)\log D_{\theta}(s,a) objective. Combining the two objectives is reasonable since both objectives have the same fixed point. Moreover, the combined objective has a pleasing interpretation in terms of fθf_{\theta}:

r^(s,a)=fθ(s,a)logπ(a|s).\hat{r}(s,a)=f_{\theta}(s,a)-\log\pi{}(a\>|\>s)\,. (108)

Summing over entire trajectories, we obtain the entropy-regularised policy objective:

𝔼π[t=0T1γtr^(St,At)]=𝔼π[t=0T1γt(fθ(St,At)logπ(At|St))].\operatorname*{{\mathbb{E}}}_{\pi{}}\left[\sum_{t=0}^{T-1}\gamma^{t}\hat{r}(S_{t},A_{t})\right]=\operatorname*{{\mathbb{E}}}_{\pi{}}\left[\sum_{t=0}^{T-1}\gamma^{t}\left(f_{\theta}(S_{t},A_{t})-\log\pi{}(A_{t}\>|\>S_{t})\right)\right]\,. (109)

It is known that the policy π\pi that maximises this objective is [29, 12]:

π(a|s)=exp(Qfθsoft(s,a)Vfθsoft(s))=expAfθsoft(s,a).\pi(a\>|\>s)=\exp\left(Q^{\mathrm{soft}}_{f_{\theta}}(s,a)-V^{\mathrm{soft}}_{f_{\theta}}(s)\right)=\exp A^{\mathrm{soft}}_{f_{\theta}}(s,a)\,. (110)

4.4.2 fθ(s,a)f_{\theta}(s,a) recovers the optimal advantage

Recall that for a fixed generator GG, the optimal discriminator DD is

DG(x)=pdata(x)pdata(x)+pg(x),D_{G}^{*}(x)=\frac{p_{\text{data}}(x)}{p_{\text{data}}(x)+p_{g}(x)}, (104)

where pdata(x)p_{\text{data}}(x) is the true data distribution (in our case, expert trajectories), and pg(x)p_{g}(x) is the distribution induced by the generator (in our case, policy π\pi{}).

Normally in GAN training, it is computationally efficient to sample from the generator distribution pgp_{g} but expensive to evaluate the density pg(x)p_{g}(x) of a given sample. Fortunately, in AIRL (like GAN-GCL), density evaluation is cheap since the generator pg(x)p_{g}(x) is defined by a stochastic policy π(a|s)\pi{}(a\>|\>s), which explicitly defines a distribution.

Suppose the generator is always pitted against the optimal discriminator in Eq. 104. It is known that the generator maximizes the loss in Eq. 107 of the optimal discriminator if and only if pg(x)=pdata(x)p_{g}(x)=p_{\text{data}}(x). In our case this is attained when the generator π(a|s)\pi{}(a\>|\>s) is equal to the expert policy π(a|s)E\pi{}_{E}(a\>|\>s).

The optimal discriminator for an optimal generator will always output 12\frac{1}{2}: i.e. fθ(s,a)=logπ(a|s)Ef^{*}_{\theta}(s,a)=\log\pi{}_{E}(a\>|\>s). Moreover, the expert policy πE\pi{}_{E} is the optimal maximum causal entropy policy for the MDP’s reward function rr, so π(a|s)E=expArsoft(s,a)\pi{}_{E}(a\>|\>s)=\exp A^{\mathrm{soft}}_{r}(s,a). So, if the GAN converges, then at optimality fθ(s,a)=Arsoft(s,a)f^{*}_{\theta}(s,a)=A^{\mathrm{soft}}_{r}(s,a) and π(a|s)=π(a|s)E\pi{}(a\>|\>s)=\pi{}_{E}(a\>|\>s).

4.4.3 Reward shaping in MCE RL

In this section, we will introduce a classical result of (hard) optimal policy equivalence under potential shaping due to Ng et al. [18], and then generalise it to the case of maximum entropy (soft) optimal policies.

Definition 4.1.

Let rr and rr^{\prime} be two reward functions. We say rr and rr^{\prime} induce the same hard optimal policy under transition dynamics 𝒯\mathcal{T} if, for all states s𝒮s\in\mathcal{S}{}:

argmaxaQr,𝒯(s,a)=argmaxaQr,𝒯(s,a).\arg\max_{a}Q_{r,\mathcal{T}}(s,a)=\arg\max_{a}Q_{r^{\prime},\mathcal{T}}(s,a)\,. (111)
Theorem 4.1.

Let rr and rr^{\prime} be two reward functions. rr and rr^{\prime} induce the same hard optimal policy under all transition dynamics 𝒯\mathcal{T} if:

r(s,a,s)=λ(r(s,a,s)+γϕ(s)ϕ(s)),r^{\prime}(s,a,s^{\prime})=\lambda\left(r(s,a,s^{\prime})+\gamma\phi(s^{\prime})-\phi(s)\right)\,, (112)

for some λ>0\lambda>0 and potential-shaping function ϕ:𝒮\phi:\mathcal{S}\to\mathbb{R}.

Proof.

See [18, 10]. ∎

Definition 4.2.

Let rr and rr^{\prime} be two reward functions. We say they induce the same soft optimal policy under transition dynamics 𝒯\mathcal{T} if, for all states s𝒮s\in\mathcal{S}{} and actions a𝒜a\in\mathcal{A}{}:

Ar,𝒯soft(s,a)=Ar,𝒯soft(s,a).A^{\mathrm{soft}}_{r,\mathcal{T}}(s,a)=A^{\mathrm{soft}}_{r^{\prime},\mathcal{T}}(s,a)\,. (113)
Theorem 4.2.

Let rr and rr^{\prime} be two reward functions. rr and rr^{\prime} induce the same soft optimal policy under all transition dynamics 𝒯\mathcal{T} if r(s,a,s)=r(s,a,s)+γϕ(s)ϕ(s)r^{\prime}(s,a,s^{\prime})=r(s,a,s^{\prime})+\gamma\phi(s^{\prime})-\phi(s) for some potential-shaping function ϕ:𝒮\phi:\mathcal{S}\to\mathbb{R}.

Proof.

We have:

Qr,𝒯soft(s,a)=𝔼𝒯[r(s,a,S)+γϕ(S)ϕ(s)+γVr,𝒯soft(S)|s,a].Q^{\mathrm{soft}}_{r^{\prime},\mathcal{T}}(s,a)=\operatorname*{{\mathbb{E}}}_{\mathcal{T}}\left[r(s,a,S^{\prime})+\gamma\phi(S^{\prime})-\phi(s)+\gamma V^{\mathrm{soft}}_{r^{\prime},\mathcal{T}}(S^{\prime})\>\middle|\>{}s,a\right]\,. (114)

So:

Qr,𝒯soft(s,a)+ϕ(s)\displaystyle Q^{\mathrm{soft}}_{r^{\prime},\mathcal{T}}(s,a)+\phi(s) =𝔼𝒯\bBigg@3[r(s,a,S)+γloga𝒜exp(Qr,𝒯soft(S,a)+ϕ(S))\bBigg@3|s,a\bBigg@3].\displaystyle=\operatorname*{{\mathbb{E}}}_{\mathcal{T}}\bBigg@{3}[r(s,a,S^{\prime})+\gamma\log\sum_{a^{\prime}\in\mathcal{A}}\exp\left(Q^{\mathrm{soft}}_{r^{\prime},\mathcal{T}}(S^{\prime},a^{\prime})+\phi(S^{\prime})\right)\>\bBigg@{3}|\>{}s,a\bBigg@{3}]\,. (115)

Thus Qr,𝒯soft(s,a)+ϕ(s)Q^{\mathrm{soft}}_{r^{\prime},\mathcal{T}}(s,a)+\phi(s) satisfies the soft Bellman backup for rr, so:

Qr,𝒯soft(s,a)=Qr,𝒯soft(s,a)+ϕ(s).Q^{\mathrm{soft}}_{r,\mathcal{T}}(s,a)=Q^{\mathrm{soft}}_{r^{\prime},\mathcal{T}}(s,a)+\phi(s)\,. (116)

It follows that the optimal advantage is invariant to shaping:

Ar,𝒯soft(s,a)\displaystyle A^{\mathrm{soft}}_{r^{\prime},\mathcal{T}}(s,a) =Qr,𝒯soft(s,a)Vr,𝒯soft(s)\displaystyle=Q^{\mathrm{soft}}_{r^{\prime},\mathcal{T}}(s,a)-V^{\mathrm{soft}}_{r^{\prime},\mathcal{T}}(s) (117)
=Qr,𝒯soft(s,a)loga𝒜exp(Qr,𝒯soft(s,a))\displaystyle=Q^{\mathrm{soft}}_{r^{\prime},\mathcal{T}}(s,a)-\log\sum_{a\in\mathcal{A}}\exp\left(Q^{\mathrm{soft}}_{r^{\prime},\mathcal{T}}(s,a)\right) (118)
=Qr,𝒯soft(s,a)+ϕ(s)loga𝒜exp(Qr,𝒯soft(s,a)+ϕ(s))\displaystyle=Q^{\mathrm{soft}}_{r,\mathcal{T}}(s,a)+\phi(s)-\log\sum_{a\in\mathcal{A}}\exp\left(Q^{\mathrm{soft}}_{r,\mathcal{T}}(s,a)+\phi(s)\right) (119)
=Qr,𝒯soft(s,a)loga𝒜exp(Qr,𝒯soft(s,a))\displaystyle=Q^{\mathrm{soft}}_{r,\mathcal{T}}(s,a)-\log\sum_{a\in\mathcal{A}}\exp\left(Q^{\mathrm{soft}}_{r,\mathcal{T}}(s,a)\right) (120)
=Ar,𝒯soft(s,a).\displaystyle=A^{\mathrm{soft}}_{r,\mathcal{T}}(s,a)\,. (121)

Remark 4.1.

Note that rescaling rr by λ1\lambda\neq 1 changes the soft optimal advantage function and, therefore, the soft-optimal policy. In particular, it approximately rescales the soft optimal advantage function (this is not exact as log-sum-exp is non-linear). Rescaling will therefore tend to have the effect of making the soft-optimal policy less (λ>1)\lambda>1) or more (λ<1\lambda<1) stochastic.

4.4.4 Discriminator objective

In this section, we show that minimising the loss of the discriminator corresponds to ME IRL in deterministic dynamics when fθf_{\theta} is already an advantage for some reward function.

Theorem 4.3.

Consider an undiscounted, deterministic MDP. Suppose fθf_{\theta} and πθ\pi_{\theta} are the soft-optimal advantage function and policy for reward function rθr_{\theta}. Then minimising the cross-entropy loss of the discriminator under generator πθ\pi_{\theta} is equivalent to maximising the log-likelihood of observations under the Maximum Entropy (ME) IRL model.

Specifically, recall that the gradient of ME IRL (with γ=1\gamma=1) log likelihood is

θ(𝒟;θ)=𝔼𝒟[t=0T1θrθ(St,At)]𝔼πθ[t=0T1θrθ(St,At)].\nabla_{\theta}\mathcal{L}{}(\mathcal{D};\theta)={\operatorname*{{\mathbb{E}}}_{\mathcal{D}{}}\left[\sum_{t=0}^{T-1}\nabla_{\theta}r_{\theta}(S_{t},A_{t})\right]}-{\operatorname*{{\mathbb{E}}}_{\pi{}_{\theta}}\left[\sum_{t=0}^{T-1}\nabla_{\theta}r_{\theta}(S_{t},A_{t})\right]}\,.

We will show that the gradient of the discriminator objective is

2θL(θ)=𝔼𝒟[t=0T1θfθ(St,At)]𝔼πθ[t=0T1θfθ(St,At)].-2\nabla_{\theta}L(\theta)=\operatorname*{{\mathbb{E}}}_{\mathcal{D}}\left[\sum_{t=0}^{T-1}\nabla_{\theta}f_{\theta}(S_{t},A_{t})\right]-\operatorname*{{\mathbb{E}}}_{\pi{}_{\theta}}\left[\sum_{t=0}^{T-1}\nabla_{\theta}f_{\theta}(S_{t},A_{t})\right]\,. (122)
Proof.

Note that L(θ)L(\theta) is the discriminator loss, and so we wish to maximise the discriminator objective L(θ)-L(\theta). This has the gradient:

θL(θ)=𝔼𝒟[t=0T1θlogD(St,At,St+1)θ]+𝔼π[t=0T1θlog(1D(St,At,St+1)θ)].-\nabla_{\theta}L(\theta)=\operatorname*{{\mathbb{E}}}_{\mathcal{D}{}}\left[\sum_{t=0}^{T-1}\nabla_{\theta}\log D{}_{\theta}(S_{t},A_{t},S_{t+1})\right]+\operatorname*{{\mathbb{E}}}_{\pi{}}\left[\sum_{t=0}^{T-1}\nabla_{\theta}\log\left(1-D{}_{\theta}(S_{t},A_{t},S_{t+1})\right)\right]\,. (123)

Observe that:

logDθ(st,at,st+1)=fθ(st,at)log(expfθ(st,at)+π(at|st)).\log D_{\theta}(s_{t},a_{t},s_{t+1})=f_{\theta}(s_{t},a_{t})-\log\left(\exp f_{\theta}(s_{t},a_{t})+\pi{}(a_{t}\>|\>s_{t})\right)\,. (124)

Thus:

θlogDθ(st,at)=θfθ(st,at)exp(fθ(st,at))θfθ(st,at)expfθ(st,at)+π(at|st).\nabla_{\theta}\log D_{\theta}(s_{t},a_{t})=\nabla_{\theta}f_{\theta}(s_{t},a_{t})-\frac{\exp\left(f_{\theta}(s_{t},a_{t})\right)\nabla_{\theta}f_{\theta}(s_{t},a_{t})}{\exp f_{\theta}(s_{t},a_{t})+\pi{}(a_{t}\>|\>s_{t})}\,. (125)

Similarly:

log(1Dθ(st,at))=π(at|st)log(expfθ(st,at)+π(at|st)).\log\left(1-D_{\theta}(s_{t},a_{t})\right)=\pi(a_{t}\>|\>s_{t})-\log\left(\exp f_{\theta}(s_{t},a_{t})+\pi{}(a_{t}\>|\>s_{t})\right)\,. (126)

So:

θlog(1Dθ(st,at)))=exp(fθ(st,at))θfθ(st,at)expfθ(st,at)+π(at|st).\nabla_{\theta}\log\left(1-D_{\theta}(s_{t},a_{t})\right))=-\frac{\exp\left(f_{\theta}(s_{t},a_{t})\right)\nabla_{\theta}f_{\theta}(s_{t},a_{t})}{\exp f_{\theta}(s_{t},a_{t})+\pi{}(a_{t}\>|\>s_{t})}\,. (127)

Recall we train the policy π(at|st)\pi{}(a_{t}\>|\>s_{t}) to maximise Eq. 109. The optimal maximum entropy policy for a given fθf_{\theta} is πfθ(at|st)=expAfθsoft(st,at)\pi^{*}_{f_{\theta}}(a_{t}\>|\>s_{t})=\exp A^{\mathrm{soft}}_{f_{\theta}}(s_{t},a_{t}).

By assumption, fθf_{\theta} is the advantage for reward function rθr_{\theta}, so:

fθ(s,a)\displaystyle f_{\theta}(s,a) =Arθsoft(s,a)\displaystyle=A^{\mathrm{soft}}_{r_{\theta}}(s,a) (128)
=Qrθsoft(s,a)Vrθsoft(s)\displaystyle=Q^{\mathrm{soft}}_{r_{\theta}}(s,a)-V^{\mathrm{soft}}_{r_{\theta}}(s) (129)
=s𝒮𝒯(s,a,s)(rθ(s,a,s)+γVrsoft(s))Vrsoft(s)\displaystyle=\sum_{s^{\prime}\in\mathcal{S}}\mathcal{T}(s,a,s^{\prime})\left(r_{\theta}(s,a,s^{\prime})+\gamma V^{\mathrm{soft}}_{r}(s^{\prime})\right)-V^{\mathrm{soft}}_{r}(s) (130)
=rθ(s,a,𝒯(s,a))+γVrsoft(𝒯(s,a))Vrsoft(s),\displaystyle=r_{\theta}\left(s,a,\mathcal{T}(s,a)\right)+\gamma V^{\mathrm{soft}}_{r}\left(\mathcal{T}(s,a)\right)-V^{\mathrm{soft}}_{r}(s)\,, (131)

where we write s=𝒯(s,a)s^{\prime}=\mathcal{T}(s,a) for the deterministic next-state. Restricting ourselves only to feasible transitions (s,a,s)(s,a,s^{\prime}), we can alternatively write:

fθ(s,a,s)=r(s,a,s)+γVrsoft(s)Vrsoft(s).f_{\theta}(s,a,s^{\prime})=r(s,a,s^{\prime})+\gamma V^{\mathrm{soft}}_{r}(s^{\prime})-V^{\mathrm{soft}}_{r}(s)\,. (132)

That is, fθf_{\theta} is rθr_{\theta} shaped by potential function Vrsoft(s)V^{\mathrm{soft}}_{r}(s). Applying Theorem 4.2, it follows that:

Afθsoft(s,a)=Arθsoft(s,a)A^{\mathrm{soft}}_{f_{\theta}}(s,a)=A^{\mathrm{soft}}_{r_{\theta}}(s,a)\, (133)

but by assumption fθ(s,a)=Arθsoft(s,a)f_{\theta}(s,a)=A^{\mathrm{soft}}_{r_{\theta}}(s,a), so we have that fθ(s,a)f_{\theta}(s,a) is idempotent under the advantage operator:

Afθsoft(s,a)=fθ(s,a).A^{\mathrm{soft}}_{f_{\theta}}(s,a)=f_{\theta}(s,a)\,. (134)

Thus the optimal policy is πfθ(at|st)=expfθ(st,at)=πrθ(at|st)\pi^{*}_{f_{\theta}}(a_{t}\>|\>s_{t})=\exp f_{\theta}(s_{t},a_{t})=\pi^{*}_{r_{\theta}}(a_{t}\>|\>s_{t}). Substituting this expression into Eq. 125 gives

θlogDθ(st,at)=12θfθ(st,at),\nabla_{\theta}\log D_{\theta}(s_{t},a_{t})=\frac{1}{2}\nabla_{\theta}f_{\theta}(s_{t},a_{t})\,, (135)

and into Eq. 127 gives

θlog(1Dθ(st,at)))=12θfθ(st,at).\nabla_{\theta}\log\left(1-D_{\theta}(s_{t},a_{t})\right))=-\frac{1}{2}\nabla_{\theta}f_{\theta}(s_{t},a_{t})\,. (136)

So:

2θL(θ)\displaystyle-2\nabla_{\theta}L(\theta) =𝔼𝒟[t=0Tθfθ(St,At)]𝔼πθ[t=0Tθfθ(St,At)].\displaystyle=\operatorname*{{\mathbb{E}}}_{\mathcal{D}}\left[\sum_{t=0}^{T}\nabla_{\theta}f_{\theta}(S_{t},A_{t})\right]-\operatorname*{{\mathbb{E}}}_{\pi{}_{\theta}}\left[\sum_{t=0}^{T}\nabla_{\theta}f_{\theta}(S_{t},A_{t})\right]\,. (137)

Remark 4.2.

Section 4.4.2 showed the globally optimal fθf_{\theta} is the optimal soft advantage function. However, there is no guarantee that fθf_{\theta} is ever a soft advantage function during training. So this theorem does not demonstrate convergence, but does provide intuition for why AIRL often works well in practice.

4.4.5 Recovering rewards

In Section 4.4.3, we saw that if a reward function rr^{\prime} is a potential shaped version of rr then rr^{\prime} induces the same soft QQ-values as rr up to a state-only function. In the case that both reward functions are state-only, i.e. r(s)r(s) and r(s)r^{\prime}(s), then potential shaping (when γ<1)\gamma<1) reduces to the special case of r(s)=r(s)+kr^{\prime}(s)=r(s)+k for some constant kk. Perhaps surprisingly, AIRL can determine state-only rewards up to a constant provided the (deterministic) transition dynamics 𝒯\mathcal{T} satisfies a strong requirement known as the decomposability condition.

Definition 4.3 (Decomposability Condition).

We define two states u,v𝒮u,v\in\mathcal{S} as being 11-step linked under a transition distribution 𝒯(s|s,a)\mathcal{T}(s^{\prime}\>|\>s,a) if there exists a state s𝒮s\in\mathcal{S} and actions a,b𝒜a,b\in\mathcal{A} such that uu and vv are successor states to ss: i.e. 𝒯(u|s,a)>0\mathcal{T}(u\>|\>s,a)>0 and 𝒯(v|s,b)>0\mathcal{T}(v\>|\>s,b)>0.

We define two states u,v𝒮u,v\in\mathcal{S} as being n+1n+1-step linked if they are nn-step linked or if there is an intermediate state s𝒮s\in\mathcal{S} such that uu is nn-step linked to ss and ss is 11-step linked to vv.

We define two states u,v𝒮u,v\in\mathcal{S} as being linked if there is some nn\in\mathbb{N} for which they are nn-step linked.

A transition distribution 𝒯\mathcal{T} is decomposable if all pairs of states in the MDP are linked.

The decomposability condition can be counterintuitive, so we consider some examples before using the definition further. Note that although the decomposability condition is stated in terms of transition probabilities, later applications of this condition will assume deterministic dynamics where probabilities are either 0 or 11.

A simple MDP that does not satisfy the condition is a two-state cyclic MDP, where it is only possible to transition from state AA to BB and vice-versa. There is no state that can reach both AA and BB, so they are not 11-step linked. They are therefore also not nn-step linked for any nn, since there are no possible intermediary states. However, the MDP would be decomposable if the dynamics were extended to allow self-transitions (from AAA\to A and BBB\to B).

A similar pattern holds in gridworlds. Imagine a checkerboard pattern on the grid. If all actions move to an adjacent cell (left, right, up or down), then all the successors of white cells are black, and vice-versa. Consequently, cells are only ever 11-step linked to cells of the same colour. Taking the transitive closure, all cells of the same colour are linked together, but never to cells of a different colour. However, if ones adds a ‘stay’ action to the gridworld then all cells are linked.

We have not been able to determine whether the decomposability condition is satisfied in standard RL benchmarks, such as MuJoCo tasks or Atari games.

We will now show a key application of the decomposability condition: that equality of soft optimal policies implies equality of state-only rewards up to a constant.

Theorem 4.4.

Let 𝒯\mathcal{T} be a deterministic dynamics model satisfying the decomposability condition, and let γ>0\gamma>0. Let r(s)r(s) and r(s)r^{\prime}(s) be two reward functions producing the same MCE policy in 𝒯\mathcal{T}. That is, for all states s𝒮s\in\mathcal{S} and actions a𝒜a\in\mathcal{A}:

Ar,𝒯soft(s,a)=Ar,𝒯soft(s,a).A^{\mathrm{soft}}_{r^{\prime},\mathcal{T}}(s,a)=A^{\mathrm{soft}}_{r,\mathcal{T}}(s,a)\,. (138)

Then r(s)=r(s)+kr^{\prime}(s)=r(s)+k, for some constant kk.

Proof.

We start by considering the general case of a stochastic dynamics model 𝒯\mathcal{T} and reward functions over (s,a,s)(s,a,s^{\prime}) triples. We introduce the simplifying assumptions only when necessary, to highlight why we make these assumptions.

Substituting the definition for AsoftA^{\mathrm{soft}} in Eq. 138:

Qr,𝒯soft(s,a)Vr,𝒯soft(s)=Qr,𝒯soft(s,a)Vr,𝒯soft(s).Q^{\mathrm{soft}}_{r^{\prime},\mathcal{T}}(s,a)-V^{\mathrm{soft}}_{r^{\prime},\mathcal{T}}(s)=Q^{\mathrm{soft}}_{r,\mathcal{T}}(s,a)-V^{\mathrm{soft}}_{r,\mathcal{T}}(s)\,. (139)

So:

Qr,𝒯soft(s,a)=Qr,𝒯soft(s,a)f(s),Q^{\mathrm{soft}}_{r^{\prime},\mathcal{T}}(s,a)=Q^{\mathrm{soft}}_{r,\mathcal{T}}(s,a)-f(s)\,, (140)

where f(s)=Vr,𝒯soft(s)Vr,𝒯soft(s)f(s)=V^{\mathrm{soft}}_{r,\mathcal{T}}(s)-V^{\mathrm{soft}}_{r^{\prime},\mathcal{T}}(s). Now:

Qr,𝒯soft\displaystyle Q^{\mathrm{soft}}_{r^{\prime},\mathcal{T}} (s,a)=Qr,𝒯soft(s,a)f(s)\displaystyle(s,a)=Q^{\mathrm{soft}}_{r,\mathcal{T}}(s,a)-f(s)
=𝔼𝒯[r(s,a,S)f(s)+γlogaexpQr,𝒯soft(S,a)|s,a]\displaystyle=\operatorname*{{\mathbb{E}}}_{\mathcal{T}}\left[r(s,a,S^{\prime})-f(s)+\gamma\log\sum_{a^{\prime}}\exp Q^{\mathrm{soft}}_{r,\mathcal{T}}(S^{\prime},a^{\prime})\>\middle|\>{}s,a\right] (141)
=𝔼𝒯\bBigg@3[r(s,a,S)f(s)+γf(S)+γlogaexp(Qr,𝒯soft(S,a)f(S))\bBigg@3|s,a\bBigg@3]\displaystyle=\operatorname*{{\mathbb{E}}}_{\mathcal{T}}\bBigg@{3}[r(s,a,S^{\prime})-f(s)+\gamma f(S^{\prime})+\gamma\log\sum_{a^{\prime}}\exp\left(Q^{\mathrm{soft}}_{r,\mathcal{T}}(S^{\prime},a^{\prime})-f(S^{\prime})\right)\>\bBigg@{3}|\>{}s,a\bBigg@{3}] (142)
=𝔼𝒯[r(s,a,S)f(s)+γf(S)+γlogaexpQr,𝒯soft(S,a)|s,a].\displaystyle=\operatorname*{{\mathbb{E}}}_{\mathcal{T}}\left[r(s,a,S^{\prime})-f(s)+\gamma f(S^{\prime})+\gamma\log\sum_{a^{\prime}}\exp Q^{\mathrm{soft}}_{r^{\prime},\mathcal{T}}(S^{\prime},a^{\prime})\>\middle|\>{}s,a\right]\,. (143)

Contrast this with the Bellman backup on rr^{\prime}:

Qr,𝒯soft(s,a)=𝔼𝒯[r(s,a,S)+γlogaexpQr,𝒯soft(S,a)|s,a].Q^{\mathrm{soft}}_{r^{\prime},\mathcal{T}}(s,a)=\operatorname*{{\mathbb{E}}}_{\mathcal{T}}\left[r^{\prime}(s,a,S^{\prime})+\gamma\log\sum_{a^{\prime}}\exp Q^{\mathrm{soft}}_{r^{\prime},\mathcal{T}}(S^{\prime},a^{\prime})\>\middle|\>{}s,a\right]\,. (144)

So, equating these two expressions for Qr,𝒯soft(s,a)Q^{\mathrm{soft}}_{r^{\prime},\mathcal{T}}(s,a):

𝔼𝒯[r(s,a,S)|s,a]=𝔼𝒯[r(s,a,S)f(s)+γf(S)|s,a].\operatorname*{{\mathbb{E}}}_{\mathcal{T}}\left[r^{\prime}(s,a,S^{\prime})\>\middle|\>{}s,a\right]=\operatorname*{{\mathbb{E}}}_{\mathcal{T}}\left[r(s,a,S^{\prime})-f(s)+\gamma f(S^{\prime})\>\middle|\>{}s,a\right]\,. (145)

In the special case of deterministic dynamics, then if s=𝒯(s,a)s^{\prime}=\mathcal{T}(s,a), we have:

r(s,a,s)=r(s,a,s)f(s)+γf(s).r^{\prime}(s,a,s^{\prime})=r(s,a,s^{\prime})-f(s)+\gamma f(s^{\prime})\,. (146)

This looks like rr^{\prime} being a potential-shaped version of rr, but note this equality may not hold for transitions that are not feasible under this dynamics model 𝒯\mathcal{T}.

If we now constrain rr^{\prime} and rr to be state-only, we get:

r(s)r(s)+f(s)=γf(s).r^{\prime}(s)-r(s)+f(s)=\gamma f(s^{\prime})\,. (147)

In particular, since γ0\gamma\neq 0 this implies that for a given state ss, all possible successor states ss^{\prime} (reached via different actions) must have the same value f(s)f(s^{\prime}). In other words, all 11-step linked states have the same ff value. Moreover, as 22-step linked states are linked via a 11-step linked state and equality is transitive, they must also have the same ff values. By induction, all linked states must have the same ff value. Since by assumption 𝒯\mathcal{T} is decomposable, then f(s)=cf(s)=c for some constant, and so:

r(s)=r(s)+(γ1)c=r(s)+k.r^{\prime}(s)=r(s)+(\gamma-1)c=r(s)+k\,. (148)

Now, we will see how the decomposability condition allows us to make inferences about equality between state-only functions. This will then allow us to prove that AIRL can recover state-only reward functions.

Lemma 4.1.

Suppose the transition distribution 𝒯\mathcal{T} is decomposable. Let a(s),b(s),c(s),d(s)a(s),b(s),c(s),d(s) be functions of the state. Suppose that for all states s𝒮s\in\mathcal{S}, actions a𝒜a\in\mathcal{A} and successor states s𝒮s^{\prime}\in\mathcal{S} for which 𝒯(s|s,a)>0\mathcal{T}(s^{\prime}\>|\>s,a)>0, we have

a(s)c(s)=b(s)d(s).a(s)-c(s)=b(s^{\prime})-d(s^{\prime})\,. (149)

Then for all s𝒮s\in\mathcal{S},

a(s)\displaystyle a(s) =c(s)+k\displaystyle=c(s)+k (150)
b(s)\displaystyle b(s) =d(s)+k,\displaystyle=d(s)+k\,, (151)

where kk\in\mathbb{R} is a constant.

Proof.

Let f(s)=a(s)c(s)f(s)=a(s)-c(s) and g(s)=b(s)d(s)g(s^{\prime})=b(s^{\prime})-d(s^{\prime}).

Base case: Since f(s)=g(s)f(s)=g(s^{\prime}) for any successor state ss^{\prime} of ss, it must be that g(s)g(s^{\prime}) takes on the same value for all successor states ss^{\prime} for ss. This shows that all 11-step linked states have the same g(s)g(s^{\prime}).

Inductive case: Moreover, this extends by transitivity. Suppose that all nn-step linked states ss^{\prime} have the same g(s)g(s^{\prime}). Let uu and vv be (n+1)(n+1)-step linked. So uu is nn-step linked to some intermediate s𝒮s\in\mathcal{S} that is 11-step linked to vv. But then g(u)=g(s)=g(v)g(u)=g(s)=g(v). So in fact all (n+1)(n+1)-step linked states ss^{\prime} have the same g(s)g(s^{\prime}).

By induction, it follows that all linked states ss^{\prime} have the same g(s)g(s^{\prime}). In a decomposable MDP, all states are linked, so g(s)g(s) is equal to some constant kk\in\mathbb{R} for all s𝒮s\in\mathcal{S}.666Note that in a decomposable MDP all states are the successor of at least one state. Moreover, in any (infinite-horizon) MDP any state s𝒮s\in\mathcal{S} must have at least one successor state s𝒮s^{\prime}\in\mathcal{S} (possibly itself). By assumption, f(s)=g(s)f(s)=g(s^{\prime}), so f(s)=kf(s)=k for all s𝒮s\in\mathcal{S}. ∎

Finally, we can use the preceding result to show when AIRL can recover a state-only reward (up to a constant). Note that even if the ground-truth reward r(s)r(s) is state-only, the reward network ff in AIRL must be a function of states and actions (or next states) in order to be able to represent the global minimum: the soft advantage function Asoft(s,a)A^{\mathrm{soft}}(s,a). The key idea in the following theorem is to decompose ff into a state-only reward gθ(s)g_{\theta}(s) and a potential shaping term. This gives ff capacity to represent Asoft(s,a)A^{\mathrm{soft}}(s,a), while the structure ensures gθ(s)g_{\theta}(s) equals the ground-truth reward up to a constant.

Theorem 4.5.

Suppose the reward network is parameterized by

fθ,ϕ(s,a,s)=gθ(s)+γhϕ(s)hϕ(s).f_{\theta,\phi}(s,a,s^{\prime})=g_{\theta}(s)+\gamma h_{\phi}(s^{\prime})-h_{\phi}(s)\,. (152)

Suppose the ground-truth reward is state-only, r(s)r(s). Suppose moreover that the MDP has deterministic dynamics 𝒯\mathcal{T} satisfying the decomposability condition, and that γ>0\gamma>0. Then if fθ,ϕf_{\theta^{*},\phi^{*}} is the reward network for a global optimum (Dθ,ϕ,π)(D_{\theta^{*},\phi^{*}},\pi) of the AIRL problem, that is fθ,ϕ(s,a,s)=f(s,a,s)f_{\theta^{*},\phi^{*}}(s,a,s^{\prime})=f^{*}(s,a,s^{\prime}), we have:

gθ(s)=r(s)+k1,hϕ(s)=Vrsoft(s)+k2,g_{\theta^{*}}(s)=r(s)+k_{1},\quad h_{\phi^{*}}(s)=V^{\mathrm{soft}}_{r}(s)+k_{2}\,, (153)

where k1,k2k_{1},k_{2}\in\mathbb{R} are constants and s𝒮s\in\mathcal{S} is a state.

Proof.

We know the global minimum f(s,a,s)=Asoft(s,a)f^{*}(s,a,s^{\prime})=A^{\mathrm{soft}}(s,a), from Section 4.4.2. Now:

Asoft(s,a)\displaystyle A^{\mathrm{soft}}(s,a) =Qrsoft(s,a)Vrsoft(s)\displaystyle=Q^{\mathrm{soft}}_{r}(s,a)-V^{\mathrm{soft}}_{r}(s) (154)
=r(s)+γVrsoft(s)Vrsoft(s).\displaystyle=r(s)+\gamma V^{\mathrm{soft}}_{r}(s^{\prime})-V^{\mathrm{soft}}_{r}(s)\,. (155)

So for all states s𝒮s\in\mathcal{S}, actions a𝒜a\in\mathcal{A} and resulting deterministic successor states s=𝒯(s,a)s^{\prime}=\mathcal{T}(s,a) we have:

gθ(s)+γhϕ(s)hϕ(s)=r(s)+γVrsoft(s)Vrsoft(s)\displaystyle g_{\theta^{*}}(s)+\gamma h_{\phi^{*}}(s^{\prime})-h_{\phi^{*}}(s)=r(s)+\gamma V^{\mathrm{soft}}_{r}(s^{\prime})-V^{\mathrm{soft}}_{r}(s) (156)
\displaystyle\iff (gθ(s)hϕ(s))(r(s)Vrsoft(s))=γVrsoft(s)γhϕ(s).\displaystyle\left(g_{\theta^{*}}(s)-h_{\phi^{*}}(s)\right)-\left(r(s)-V^{\mathrm{soft}}_{r}(s)\right)=\gamma V^{\mathrm{soft}}_{r}(s^{\prime})-\gamma h_{\phi^{*}}(s^{\prime})\,. (157)

Now applying Lemma 4.1 with a(s)=gθ(s)hϕ(s)a(s)=g_{\theta^{*}}(s)-h_{\phi^{*}}(s), c(s)=r(s)Vrsoft(s)c(s)=r(s)-V^{\mathrm{soft}}_{r}(s), b(s)=γVrsoft(s)b(s^{\prime})=\gamma V^{\mathrm{soft}}_{r}(s^{\prime}) and d(s)=γhϕ(s)d(s^{\prime})=\gamma h_{\phi^{*}}(s^{\prime}) gives:

gθ(s)hϕ(s)\displaystyle g_{\theta^{*}}(s)-h_{\phi^{*}}(s) =a(s)=c(s)+k0=r(s)Vrsoft(s)+k0\displaystyle=a(s)=c(s)+k_{0}=r(s)-V^{\mathrm{soft}}_{r}(s)+k_{0} (158)
γhϕ(s)\displaystyle\gamma h_{\phi^{*}}(s^{\prime}) =d(s)=b(s)+k0=γVrsoft(s)+k0,\displaystyle=d(s^{\prime})=b(s^{\prime})+k_{0}=\gamma V^{\mathrm{soft}}_{r}(s^{\prime})+k_{0}\,, (159)

where k0k_{0}\in\mathbb{R} is a constant. Rearranging and using γ0\gamma\neq 0 we have:

gθ(s)\displaystyle g_{\theta^{*}}(s) =r(s)+hϕ(s)Vrsoft(s)+k0\displaystyle=r(s)+h_{\phi^{*}}(s)-V^{\mathrm{soft}}_{r}(s)+k_{0} (160)
hϕ(s)\displaystyle h_{\phi^{*}}(s^{\prime}) =Vrsoft(s)+k0γ=Vrsoft(s)+k2.\displaystyle=V^{\mathrm{soft}}_{r}(s^{\prime})+\frac{k_{0}}{\gamma}=V^{\mathrm{soft}}_{r}(s^{\prime})+k_{2}\,. (161)

By the decomposability condition, all states s𝒮s\in\mathcal{S} are the successor of some state (possibly themselves). So we can apply Eq. 161 to all s𝒮s\in\mathcal{S}:

hϕ(s)=Vrsoft(s)+k2.h_{\phi^{*}}(s)=V^{\mathrm{soft}}_{r}(s)+k_{2}\,. (162)

Finally, applying Eq. 162 to Eq. 160 yields:

gθ(s)=r(s)+(k0+k2)=r(s)+k1,g_{\theta^{*}}(s)=r(s)+(k_{0}+k_{2})=r(s)+k_{1}\,, (163)

as required. ∎

Remark 4.3.

Note this theorem makes several strong assumptions. In particular, it requires that ff attains the global minimum, but AIRL is not in general guaranteed to converge to a global optimum. Additionally, many environments have stochastic dynamics or are not 1-step linked.

Note that in stochastic dynamics there may not exist any function f(s,s)f(s,s^{\prime}) that is always equal to Asoft(s,a)A^{\mathrm{soft}}(s,a). This is because there may exist a,aa,a^{\prime} such that 𝒯(|s,a)\mathcal{T}(\cdot\>|\>s,a) and 𝒯(|s,a)\mathcal{T}(\cdot\>|\>s,a^{\prime}) differ but both have support for ss^{\prime} while Asoft(s,a)Asoft(s,a)A^{\mathrm{soft}}(s,a)\neq A^{\mathrm{soft}}(s,a^{\prime}) .

5 Conclusion

We have described three Inverse Reinforcement Learning (IRL) algorithms: the tabular method Maximum Causal Entropy (MCE) IRL; and deep-learning based, dynamics-free algorithms Guided Cost Learning (GCL) and Adversarial IRL (AIRL). We have shown that MCE IRL can be derived from maximising entropy under a feature expectation matching constraint. Furthermore, we have shown how this is equivalent to maximising the likelihood of the data. Finally, we have explained how GCL and AIRL can both be viewed as extensions of this maximum likelihood solution to settings with unknown dynamics and potentially continuous state and action spaces.

While contemporary methods such as GCL and AIRL have many points in common with MCE IRL, the connection is far from exact. For example, the discriminator objective of AIRL only aligns with that of MCE IRL in undiscounted MDPs, yet it is routinely applied to discounted MDPs. One promising direction for future work would be to derive a dynamics-free algorithm using function approximation directly from the original MCE IRL approach. We consider it probable that such an algorithm would provide more stable performance than existing heuristic approximations to MCE IRL.

Acknowledgements

We would like to thank Alyssa Li Dayan, Michael Dennis, Yawen Duan, Daniel Filan, Erik Jenner, Niklas Lauffer and Cody Wild for feedback on earlier versions of this manuscript.

References

  • Abbeel and Ng [2004] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In ICML, 2004.
  • Amin et al. [2017] Kareem Amin, Nan Jiang, and Satinder Singh. Repeated inverse reinforcement learning. In NIPS, 2017.
  • Boyd and Vandenberghe [2004] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge University Press, 2004.
  • Cao et al. [2021] Haoyang Cao, Samuel N. Cohen, and Lukasz Szpruch. Identifiability in inverse reinforcement learning. arXiv: 2106.03498v2 [cs.LG], 2021.
  • Finn et al. [2016a] Chelsea Finn, Paul Christiano, Pieter Abbeel, and Sergey Levine. A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models. arXiv: 1611.03852v3 [cs.LG], 2016a.
  • Finn et al. [2016b] Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In ICML, 2016b.
  • Fox et al. [2016] Roy Fox, Ari Pakman, and Naftali Tishby. Taming the noise in reinforcement learning via soft updates. In UAI, 2016.
  • Fu et al. [2018] Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adversarial inverse reinforcement learning. In ICLR, 2018.
  • Gleave et al. [2020] Adam Gleave, Pedro Freire, Steven Wang, and Sam Toyer. seals: Suite of environments for algorithms that learn specifications. https://github.com/HumanCompatibleAI/seals, 2020.
  • Gleave et al. [2021] Adam Gleave, Michael Dennis, Shane Legg, Stuart Russell, and Jan Leike. Quantifying differences in reward functions. In ICLR, 2021.
  • Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.
  • Haarnoja et al. [2017] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. In ICML, 2017.
  • Jeon et al. [2020] Hong Jun Jeon, Smitha Milli, and Anca Dragan. Reward-rational (implicit) choice: A unifying formalism for reward learning. In NeurIPS, 2020.
  • Karpathy [2020] Andrej Karpathy. Keynote talk. CVPR Workshop on Scalability in Autonomous Driving, 2020.
  • Kim et al. [2021] Kuno Kim, Shivam Garg, Kirankumar Shiragur, and Stefano Ermon. Reward identification in inverse reinforcement learning. In ICML, 2021.
  • Kostrikov et al. [2019] Ilya Kostrikov, Kumar Krishna Agrawal, Debidatta Dwibedi, Sergey Levine, and Jonathan Tompson. Discriminator-actor-critic: Addressing sample inefficiency and reward bias in adversarial imitation learning. In ICLR, 2019.
  • Ng and Russell [2000] Andrew Y Ng and Stuart J Russell. Algorithms for inverse reinforcement learning. In ICML, 2000.
  • Ng et al. [1999] Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, 1999.
  • Ramachandran and Amir [2007] Deepak Ramachandran and Eyal Amir. Bayesian inverse reinforcement learning. In IJCAI, 2007.
  • Sadigh et al. [2017] Dorsa Sadigh, Anca D. Dragan, S. Shankar Sastry, and Sanjit A. Seshia. Active preference-based learning of reward functions. In RSS, 2017.
  • Shah et al. [2019] Rohin Shah, Dmitrii Krasheninnikov, Jordan Alexander, Pieter Abbeel, and Anca Dragan. Preferences implicit in the state of the world. In ICLR, 2019.
  • Skalse et al. [2022] Joar Skalse, Matthew Farrugia-Roberts, Stuart Russell, Alessandro Abate, and Adam Gleave. Invariance in policy optimisation and partial identifiability in reward learning. arXiv:2203.07475v1 [cs.LG], 2022.
  • Sutton and Barto [2018] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 2018.
  • Tesla [2016] Tesla. Upgrading Autopilot: Seeing the world in radar. https://www.tesla.com/blog/upgrading-autopilot-seeing-world-radar, 2016. Accessed: 2020-07-27.
  • Topsøe [1979] Flemming Topsøe. Information-theoretical optimization techniques. Kybernetika, 1979.
  • Tucker et al. [2018] Aaron Tucker, Adam Gleave, and Stuart Russell. Inverse reinforcement learning for video games. arXiv: arXiv:1810.10593v1 [cs.LG], 2018.
  • Vecerik et al. [2019] Mel Vecerik, Oleg Sushkov, David Barker, Thomas Rothörl, Todd Hester, and Jon Scholz. A practical approach to insertion with variable socket position using deep reinforcement learning. In ICRA, 2019.
  • Wulfmeier et al. [2015] Markus Wulfmeier, Peter Ondruska, and Ingmar Posner. Maximum entropy deep inverse reinforcement learning. arXiv: 1507.04888v3 [cs.LG], 2015.
  • Ziebart [2010] Brian D Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. PhD thesis, CMU, 2010.
  • Ziebart et al. [2008] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In AAAI, 2008.
  • Ziebart et al. [2010] Brian D. Ziebart, J. Andrew Bagnell, and Anind K. Dey. Modeling interaction via the principle of maximum causal entropy. In ICML, 2010.