This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Explaining Off-Policy Actor-Critic From A Bias-Variance Perspective

Ting-Han Fan and Peter J. Ramadge
Department of Electrical and Computer Engineering, Princeton University
{tinghanf,ramadge}@princeton.edu
Abstract

Off-policy Actor-Critic algorithms have demonstrated phenomenal experimental performance but still require better explanations. To this end, we show its policy evaluation error on the distribution of transitions decomposes into: a Bellman error, a bias from policy mismatch, and a variance term from sampling. By comparing the magnitude of bias and variance, we explain the success of the Emphasizing Recent Experience sampling and 1/age weighted sampling. Both sampling strategies yield smaller bias and variance and are hence preferable to uniform sampling.

1 Introduction

A practical reinforcement learning (RL) algorithm is often in an actor-critic setting (Lin, 1992; Precup et al., 2000) where the policy (actor) generates actions and the Q/value function (critic) evaluates the policy’s performance. Under this setting, off-policy RL uses transitions sampled from a replay buffer to perform Q function updates, yielding a new policy π\pi. Then, a finite-length trajectory under π\pi is added to the buffer, and the process repeats. Notice that sampling from a replay buffer is an offline operation and that the growth of replay buffer is an online operation. This implies off-policy actor-critic RL lies between offline RL (Yu et al., 2020; Levine et al., 2020) and on-policy RL (Schulman et al., 2015, 2017). From a bias-variance perspective, offline RL experiences large policy mismatch bias but low sampling variance, while on-policy RL has a low bias but high variance. Hence, with a careful choice of the sampling from its replay buffer, off-policy actor-critic RL may achieve a better bias-variance trade-off. This is the direction we explore this work.

To reduce policy mismatch bias, off-policy RL employs importance sampling with the weight given by the probability ratio of the current to behavior policy (the policy that samples the trajectories) (Precup et al., 2000; Xie et al., 2019; Schmitt et al., 2020). However, because the behavior policy is usually not given in practice, one can either estimate the probability ratio from the data (Lazic et al., 2020; Yang et al., 2020; Sinha et al., 2020) or use other reasonable quantities, such as the Bellman error (Schaul et al., 2016), as the sampling weight. Even using a naive uniform sampling from the replay buffer, some off-policy actor-critic algorithms can achieve a nontrivial performance (Haarnoja et al., 2018; Fujimoto et al., 2018). These observations suggest we need to better understand the success of off-policy actor-critic algorithms, especially in practical situations where a fixed behavior policy is unavailable.

Our contributions are as follows. To understand the actor-critic setting without a fixed behavior policy, we construct a non-stationary policy that generates the averaged occupancy measure. We use this policy as a reference and show the policy evaluation error in an off-policy actor-critic setting decomposes into the Bellman error, the policy mismatch bias, and the variance from sampling. Since supervised learning during the Q function update only controls the Bellman error, we need careful sampling strategies to mitigate bias and variance. We show that the 1/age weighting or its variants like the Emphasizing Recent Experience (ERE) strategy (Wang and Ross, 2019) are preferable because both their biases and variances are smaller than that of uniform weighting.

To ensure the applicability of our explanation to practical off-policy actor-critic algorithms, we adopt weak but verifiable assumptions such as Lipschitz Q functions, bounded rewards, and a bounded action space. We avoid strong assumptions such as a fixed well-explored behavior policy, concentration coefficients, bounded probability ratios (e.g., ratios of current to behavior policy), and tabular or linear function approximation. Hence our analysis is more applicable to practical settings. In addition, we explain the success of the ERE strategy (Wang and Ross, 2019) from a bias-variance perspective and show that ERE is essentially equivalent to 1/age weighting. Our simulations also support this conclusion. Our results thus provide theoretical foundations for practical off-policy actor-critic RL algorithms and allow us to reduce the algorithm to a simpler form.

2 Preliminaries

2.1 Reinforcement learning

Consider an infinite-horizon Markov Decision Process (MDP) 𝒮,𝒜,T,r,γ,\langle\mathcal{S},~{}\mathcal{A},~{}T,~{}r,~{}\gamma\rangle, where 𝒮,𝒜\mathcal{S},~{}\mathcal{A} are finite-dimensional continuous state and action spaces, r(s,a)r(s,a) is a deterministic reward function, γ(0,1)\gamma\in(0,1) is the discount factor, and T(s|s,a)T(s^{\prime}|s,a) is the state transition density; i.e., the density of the next state ss^{\prime} given the current state and action (s,a).(s,a). Given an initial state distribution ρ0,\rho_{0}, the objective of RL is to find a policy π\pi that maximizes the γ\gamma-discounted cumulative reward when the actions along the trajectory follow π\pi:

maxπJ(π)=maxπ𝔼[i=0γir(si,ai)|s0ρ0,aiπ(|si),si+1T(|si,ai)].\max_{\pi}J(\pi)=\max_{\pi}\mathbb{E}\left[\sum_{i=0}^{\infty}\gamma^{i}r(s_{i},a_{i})\Big{\lvert}s_{0}\sim\rho_{0},~{}a_{i}\sim\pi(\cdot|s_{i}),~{}s_{i+1}\sim T(\cdot|s_{i},a_{i})\right]. (1)

Let ρi(s|ρ0,π,T)\rho_{i}(s|\rho_{0},~{}\pi,~{}T) be the state distribution under π\pi at trajectory step ii. Define the normalized state occupancy measure by

ρρ0π(s)(1γ)i=0γiρi(s|ρ0,π,T).\rho^{\pi}_{\rho_{0}}(s)\triangleq(1-\gamma)\sum_{i=0}^{\infty}\gamma^{i}\rho_{i}(s|\rho_{0},~{}\pi,~{}T). (2)

Ideally, the maximization in (1) is achieved using a parameterized policy πθ\pi_{\theta} and policy gradient updates with (Sutton et al., 1999; Silver et al., 2014):

θJ(πθ)=(1γ)1𝔼(s,a)ρρ0πθ[(θlogπθ(a|s))Qπθ(s,a)].\nabla_{\theta}J(\pi_{\theta})=(1-\gamma)^{-1}\mathbb{E}_{(s,a)\sim\rho^{\pi_{\theta}}_{\rho_{0}}}[(\nabla_{\theta}\log\pi_{\theta}(a|s))Q^{\pi_{\theta}}(s,a)]. (3)

Here ρρ0πθ(s,a)=ρρ0πθ(s)πθ(a|s)\rho^{\pi_{\theta}}_{\rho_{0}}(s,a)=\rho^{\pi_{\theta}}_{\rho_{0}}(s)\pi_{\theta}(a|s) is given in (2), and QπθQ^{\pi_{\theta}} is the Q function under policy πθ\pi_{\theta}:

Qπ(s,a)=𝔼[i=0γir(si,ai)|s0=s,a0=a,aiπ(|si),si+1T(|si,ai)].Q^{\pi}(s,a)=\mathbb{E}\left[\sum_{i=0}^{\infty}\gamma^{i}r(s_{i},a_{i})\Big{\lvert}s_{0}=s,~{}a_{0}=a,~{}a_{i}\sim\pi(\cdot|s_{i}),~{}s_{i+1}\sim T(\cdot|s_{i},a_{i})\right]. (4)

Off-policy RL estimates QπQ^{\pi} by approximating the solution of the Bellman fixed-point equation:

(πQπ)(s,a)=r(s,a)+γ𝔼sT(|s,a),aπ(a|s)Qπ(s,a)=Qπ(s,a).(\mathcal{B}^{\pi}Q^{\pi})(s,a)=r(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim T(\cdot|s,a),~{}a^{\prime}\sim\pi(a|s)}Q^{\pi}(s^{\prime},a^{\prime})=Q^{\pi}(s,a).

It is well-known that QπQ^{\pi} is the unique fixed point of the Bellman operator π\mathcal{B}^{\pi}. Hence if (BπQ^)(s,a)Q^(s,a),(B^{\pi}\hat{Q})(s,a)\approx\hat{Q}(s,a), then Q^\hat{Q} may be a “good” estimate of QπQ^{\pi}. In the next subsection, we will see that an off-policy actor-critic algorithm encourages this to hold for the replay buffer.

2.2 Off-policy actor-critic algorithm

We study an off-policy actor-critic algorithm of the form shown in Alg. 1.

Algorithm 1 Off-policy Actor-critic Algorithm
0:  Parameterized policy πe=πθe\pi^{e}=\pi_{\theta_{e}}. Learning rate α\alpha.
1:  for episodee=1,2,\text{episode}~{}e=1,2,... do
2:     Sample a length-LL trajectory {(sie,aie)}i=0L1\{(s_{i}^{e},a_{i}^{e})\}_{i=0}^{L-1} under initial distribution ρ0(s)\rho_{0}(s), policy πe(a|s)\pi^{e}(a|s) and state transition T(|s,a)T(\cdot|s,a).
3:     Train a “learned” Q function by Q^=argminQj=1ei=0L1(Q(sij,aij)(πeQ)(sij,aij))2\hat{Q}=\arg\min_{Q}\sum_{j=1}^{e}\sum_{i=0}^{L-1}\big{(}Q(s_{i}^{j},a_{i}^{j})-(\mathcal{B}^{\pi^{e}}Q)(s_{i}^{j},a_{i}^{j})\big{)}^{2}
4:     Approximate Eq. (3) as θJ^(πθ)=(1γ)1𝔼(s,a)ρρ0πθ[(θlogπθ(a|s))Q^(s,a)]\nabla_{\theta}\hat{J}(\pi_{\theta})=(1-\gamma)^{-1}\mathbb{E}_{(s,a)\sim\rho^{\pi_{\theta}}_{\rho_{0}}}[(\nabla_{\theta}\log\pi_{\theta}(a|s))\hat{Q}(s,a)] and update as θe+1=θe+αθJ^(πθ)\theta_{e+1}=\theta_{e}+\alpha\nabla_{\theta}\hat{J}(\pi_{\theta}).
5:  end for

In line 2, for episode index ee, πe\pi^{e} samples one trajectory of length L.L. The transitions in this trajectory are then added to the replay buffer. Since the policies for different episodes are distinct, the collection of transitions in the replay buffer are generally inconsistent with the current policy.

Line 3 is a supervised learning that minimizes the Bellman error of Q^\hat{Q}, making (πeQ^)(s,a)Q^(s,a)(\mathcal{B}^{\pi^{e}}\hat{Q})(s,a)\approx\hat{Q}(s,a) for (s,a)(s,a) in the replay buffer. When this holds, we will prove that the “distance” between Q^\hat{Q} and QπeQ^{\pi^{e}} become smaller over the replay buffer. This is a crucial step for line 4 to be truly useful. In line 4, QπeQ^{\pi^{e}} is replaced by Q^\hat{Q}, and the policy is updated accordingly.

In practice, line 3 is replaced by gradient-descent-based mini-batch updates (Fujimoto et al., 2018; Haarnoja et al., 2018). For example, ϕϕαJ(Q^ϕ)\phi\leftarrow\phi-\alpha\nabla J(\hat{Q}_{\phi}) with ϕ\phi and J(Q^ϕ)J(\hat{Q}_{\phi}) being the parameter and the Bellman error of Q^,\hat{Q}, respectively. In section 5.2, we show that a uniform-weighted mini-batch sampling biases learning towards older samples; this motivates the need for a countermeasure.

Note that Alg. 1 seems to be inconsistent in the horizon because the Q function, Eq. (4), is defined in the infinite horizon but the trajectories are finite-length. In Corollary 1, we will address this inconsistency by approximating the Q function using finite-length trajectories.

2.3 The construction of our behavior policy

Although Alg. 1 doesn’t have a fixed behavior policy, in Lemma 1, we construct a non-stationary policy that generates the averaged occupancy measure at every step. This describes the averaged-over-episode behavior at every trajectory step ii of the historical trajectory {(sie,aie)}i,e=0,1L1,N\{(s_{i}^{e},a_{i}^{e})\}_{i,e=0,1}^{L-1,N}. We hence define it to be our behavior policy and use it as a reference when analyzing Alg. 1.

The construction is as follows. Denote the state distribution at trajectory step ii in episode ee as ρie(s)=ρi(s|ρ0,πe,T)\rho_{i}^{e}(s)=\rho_{i}(s|\rho_{0},~{}\pi^{e},T). By Eq. (2), ρρ0πe\rho_{\rho_{0}}^{\pi^{e}} is the state occupancy measure generated by (ρ0,πe,T)(\rho_{0},\pi^{e},T). Intuitively, ρρ0πe\rho_{\rho_{0}}^{\pi^{e}} describes the discounted state distribution starting at trajectory step 0 in episode ee.

More generally, define ρρieπe\rho^{\pi^{e}}_{\rho_{i}^{e}} as the state occupancy measure generated by (ρie,πe,T)(\rho_{i}^{e},\pi^{e},T). Then, ρρieπe\rho^{\pi^{e}}_{\rho_{i}^{e}} describes the discounted state distribution starting at trajectory step ii in episode ee.

ρρieπe(s)(1γ)j=iγjiρje(s)\rho^{\pi^{e}}_{\rho_{i}^{e}}(s)\triangleq(1-\gamma)\sum_{j=i}^{\infty}\gamma^{j-i}\rho_{j}^{e}(s) (5)

Since Alg. 1 considers trajectories from all episodes, the average-over-episodes distribution N1eρρieπeN^{-1}\sum_{e}\rho_{\rho_{i}^{e}}^{\pi^{e}} is of interest. It describes the averaged-discounted state distribution at trajectory step ii over all episodes.

Behavior policy through averaging.

Let ρ¯io\overline{\rho}_{i}^{o} be the average-over-episodes of {ρρieπe}e\{\rho_{\rho_{i}^{e}}^{\pi^{e}}\}_{e}; i.e., ρ¯io\overline{\rho}_{i}^{o} is the averaged occupancy measure if initialized at step ii. To describe the average-over-episodes behavior at step ii, we want a policy that generates ρ¯io\overline{\rho}_{i}^{o}, and Eq. (5) helps construct such a policy. Concretely, let ρ¯i\overline{\rho}_{i}, πiD(a|s)\pi_{i}^{D}(a|s) be the averaged state distribution and the averaged policy at step ii, respectively.

ρ¯io(s)N1e=1Nρρieπe(s),ρ¯i(s)N1e=1Nρie(s),πiD(a|s)e=1Nπe(a|s)ρρieπe(s)e=1Nρρieπe(s).\overline{\rho}_{i}^{o}(s)\triangleq N^{-1}\sum_{e=1}^{N}\rho_{\rho_{i}^{e}}^{\pi^{e}}(s),\quad\overline{\rho}_{i}(s)\triangleq N^{-1}\sum_{e=1}^{N}\rho_{i}^{e}(s),\quad\pi_{i}^{D}(a|s)\triangleq\frac{\sum_{e=1}^{N}\pi^{e}(a|s)\rho_{\rho_{i}^{e}}^{\pi^{e}}(s)}{\sum_{e=1}^{N}\rho_{\rho_{i}^{e}}^{\pi^{e}}(s)}. (6)

Lemma 1 shows πiD\pi_{i}^{D} in Eq. (6) is a notion of behavior policy in the sense that (ρ¯i,πiD,T)(\overline{\rho}_{i},\pi_{i}^{D},T) generates ρ¯io\overline{\rho}_{i}^{o}; i.e., πiD\pi_{i}^{D} generates the averaged occupancy measures when the initial state follows ρ¯i\overline{\rho}_{i}. Since πiD\pi_{i}^{D} describes the averaged discounted behavior starting at step ii, we define it to be our behavior policy. This is a key to analyzing the policy evaluation error in Alg. 1.

Lemma 1.

Let ρρ¯iπiD(s)\rho_{\overline{\rho}_{i}}^{\pi_{i}^{D}}(s) be the normalized state occupancy measure generated by (ρ¯i,πiD,T)(\overline{\rho}_{i},\pi_{i}^{D},T). Then ρ¯io(s)=ρρ¯iπiD(s)a.e.\overline{\rho}_{i}^{o}(s)=\rho_{\overline{\rho}_{i}}^{\pi_{i}^{D}}(s)~{}a.e. and hence from Eq. (6), N1e=1Nρρieπe(s)πe(a|s)=ρρ¯iπiD(s)πiD(a|s)a.e.N^{-1}\sum_{e=1}^{N}\rho_{\rho_{i}^{e}}^{\pi^{e}}(s)\pi^{e}(a|s)=\rho_{\overline{\rho}_{i}}^{\pi_{i}^{D}}(s)\pi^{D}_{i}(a|s)~{}a.e.

3 Related work

There are two main approaches to off-policy RL: importance sampling (IS) (Tokdar and Kass, 2010) and regression-based approach. IS has a low bias but high variance, while the opposite holds for the regression-based approach. Prior work also suggests that combining IS and the regression-based yields robust results (Dudík et al., 2011; Jiang and Li, 2016; Thomas and Brunskill, 2016; Kallus and Uehara, 2020). Below, we briefly review these techniques.

Importance Sampling. Standard IS uses behavior policy to form an importance weight from the probability ratio of the current to the behavior policy. For a fully accessible behavior policy, examples of this approach include: the naive IS weight (Precup et al., 2000), importance weight clipping (Schmitt et al., 2020) and the marginalized importance weight (Xie et al., 2019; Yin and Wang, 2020; Yin et al., 2021). Alternatively, one can use the density ratio of the occupancy measures of the current to the behavior policies (Liu et al., 2018). This is estimated using a maximum entropy approach (Lazic et al., 2020), a Lagrangian approach (Yang et al., 2020), or a variational approach (Sinha et al., 2020). A distinct approach emphasizes experience without considering probability ratios. Examples include emphasizing samples with a higher TD error (Schaul et al., 2016; Horgan et al., 2018), emphasizing recent experience (Wang and Ross, 2019) or updating the policy towards the past and discarding distant experience (Novati and Koumoutsakos, 2019).

Regression-based. A regression-based approach can achieve strong experimental results using proper policy exploration and good function approximation (Fujimoto et al., 2018; Haarnoja et al., 2018). It also admits strong theoretical results such as generalization error using a concentration coefficient (Le et al., 2019), policy evaluation error using bounded probability ratios (Agarwal et al., 2019)[Chap 5.1], minimax optimal bound under linear function approximation Duan et al. (2020), confidence bounds constructed by kernel Bellman loss Feng et al. (2020), and sharp sample complexity in asynchronous setting (Li et al., 2020). However, these settings require a fixed behavior policy and bounded probability ratios (i.e., the behavior policy should be sufficiently exploratory). Both requirements rarely hold in practice. To avoid this issue, we construct a non-stationary behavior policy that generates the occupancy measure and use it to analyze the policy evaluation error. This allows us to avoid assumptions and to work with continuous state and action spaces.

4 Policy evaluation error of off-policy actor-critic algorithms

Let QQ^{*}, QπNQ^{\pi^{N}} be the Q function of the optimal policy and πN\pi^{N}, respectively. Let Q^\hat{Q} be the estimated Q function. The performance error |Q^Q||\hat{Q}-Q^{*}| decomposes into |Q^QπN|+|QπNQ||\hat{Q}-Q^{\pi^{N}}|+|Q^{\pi^{N}}-Q^{*}|. The first term |Q^QπN||\hat{Q}-Q^{\pi^{N}}| is the policy evaluation error and is the focus in the off-policy evaluation literature (Duan et al., 2020). The second term is the policy’s optimality gap and to bound this term currently requires strong assumptions such as tabular or linear MDPs (Jin et al., 2018, 2020). In an off-policy actor-critic setting, the policy evaluation error has not been analyzed adequately since most analysis requires a fixed behavior policy. This is the focus of our analysis.

Suppose we are given the trajectories sampled in the past episodes (the replay buffer). We analyze the policy evaluation error over the expected distribution of transitions. We express this error in terms of the Bellman error of Q^\hat{Q}, the bias term in 1-Wasserstein distance between the policies (πN,πiD),(\pi^{N},\pi^{D}_{i}), and the variance term in the number of trajectories NN. Note πiD\pi^{D}_{i} is the behavior policy at trajectory step ii defined in Eq. (6). The use of 1-Wasserstein distance (Villani, 2008) makes the results applicable to both stochastic and deterministic policies. Since supervised learning only makes the Bellman error small, we need good sampling strategies to mitigate the bias and variance terms. We hence investigate sampling techniques from a bias-variance perspective in the next section.

4.1 Problem setup

Notation.

In episode e,e, a length-LL trajectory {(sie,aie)}i=0L1\{(s_{i}^{e},a_{i}^{e})\}_{i=0}^{L-1} following policy πe\pi^{e} is sampled (Alg. 1, line 2). Then, Q^\hat{Q} is fitted over the replay buffer (line 3). Because the error of Q^\hat{Q} at step ii depends on the states sampled at steps i,i+1,i,i+1,\dots, the importance of these samples (s,a)(s,a) depends on the trajectory step ii. Also, due to the discount factor, the importance of step j>ij>i is discounted by γji\gamma^{j-i} relative to step ii. Hence, we will use a Bellman error and a policy mismatch error that reflects the dependency on the trajectory step and the discount factor. For f:𝒮×𝒜f:~{}\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R} and g:𝒮g:~{}\mathcal{S}\rightarrow\mathbb{R}, define an averaging-discounted operator over the replay buffer in NN episodes:

E~iLf(,)1Ne=1Nj=iL1(1γ)γjif(sje,aje)andE~iLg()1Ne=1Nj=iL1(1γ)γjig(sje)\tilde{E}_{i}^{L}f(\cdot,\cdot)\triangleq\frac{1}{N}\sum_{e=1}^{N}\sum_{j=i}^{L-1}(1-\gamma)\gamma^{j-i}f(s_{j}^{e},a_{j}^{e})\quad\text{and}\quad\tilde{E}_{i}^{L}g(\cdot)\triangleq\frac{1}{N}\sum_{e=1}^{N}\sum_{j=i}^{L-1}(1-\gamma)\gamma^{j-i}g(s_{j}^{e})

Using E~iL\tilde{E}_{i}^{L}, the Bellman error of Q^\hat{Q} and the distance between the policies (πN,πiD)(\pi^{N},\pi^{D}_{i}) on the replay buffer are written as

ϵQ^i,L=E~iL|Q^(,)(πNQ^)(,)|,W1i,L=E~iLW1(πN||πiD)().\begin{split}&\epsilon^{i,L}_{\hat{Q}}=\tilde{E}_{i}^{L}\Big{|}\hat{Q}(\cdot,\cdot)-(\mathcal{B}^{\pi^{N}}\hat{Q})(\cdot,\cdot)\Big{|},\\ &W_{1}^{i,L}=\tilde{E}_{i}^{L}W_{1}(\pi^{N}\lvert\rvert\pi^{D}_{i})(\cdot).\end{split} (7)

W1(πN||πiD)(s)=W1(πN(|s)||πiD(|s))W_{1}(\pi^{N}\lvert\rvert\pi^{D}_{i})(s)=W_{1}(\pi^{N}(\cdot|s)\lvert\rvert\pi^{D}_{i}(\cdot|s)) is the 1-Wasserstein distance between two policies at state ss, which can be viewed as a function of (s,a)(s,a). Both the Bellman error ϵQ^i,L\epsilon^{i,L}_{\hat{Q}} and the policy mismatch error W1i,LW_{1}^{i,L} depend on trajectory step ii and are both discounted by γ\gamma.

Assumptions.

We now relate the Bellman error and the policy mismatch error defined in Eq. (7) to the policy evaluation error |Q^QπN|.|\hat{Q}-Q^{\pi^{N}}|. To do so requires some assumptions. First, to control the error of Q^\hat{Q} by policy mismatch error in W1W_{1} distance, we assume that for every state, Q^\hat{Q} and QπNQ^{\pi^{N}} are L𝒜L_{\mathcal{A}}-Lipschitz over actions. We provide reasoning for this assumption in the later discussion. Next, observe that both quantities in Eq. (7) are random with sources of randomness from initial states, policies at different episodes, and the state transitions. To control this randomness, we need assumptions on the Q functions and apply a concentration inequality. Because Alg. 1 only samples one trajectory in each episode, higher-order quantities (e.g., variance) is unavailable. This motivates us to use first-order quantities (e.g., a uniform bound on the Q functions) and apply Hoeffding’s inequality. Hence, we assume that Q^(s,a)\hat{Q}(s,a) and QπN(s,a)Q^{\pi^{N}}(s,a) are bounded in the interval [0,rmax/(1γ)][0,r^{\max}/(1-\gamma)] and that the action space is bounded with the diameter diam𝒜\text{diam}_{\mathcal{A}}. A justification of these assumptions is provided in Appendix A.2.

4.2 Policy evaluation error

Observe that the Q function, Eq. (4), is defined on infinite-length trajectories while the errors, Eq. (7), are evaluated on length-LL trajectories. Discounting makes it possible to approximate the Q functions using finite-length LL. To approximate the Q function at trajectory step ii up to a constant error, we need samples at least to step i+Ω((1γ)1)i+\Omega((1-\gamma)^{-1}). Hence, we first prove the main theorem with i=0i=0 and L=L=\infty. Then, we generalize the result to i0i\geq 0 and finite LL in Corollary 1.

Theorem 1.

Let NN be the number of episodes. For f:𝒮×𝒜,f\colon\mathcal{S}\times\mathcal{A}\to\mathbb{R}, define the operator E~0f1Ne=1Ni=0(1γ)γif(sie,aie)\tilde{E}_{0}^{\infty}f\triangleq\frac{1}{N}\sum_{e=1}^{N}\sum_{i=0}^{\infty}(1-\gamma)\gamma^{i}f(s_{i}^{e},a_{i}^{e}). Denote the policy mismatch error and the Bellman error as W10,=E~0W1(πN||π0D)()W_{1}^{0,\infty}=\tilde{E}_{0}^{\infty}W_{1}(\pi^{N}\lvert\rvert\pi^{D}_{0})(\cdot) and ϵQ^0,=E~0|Q^(,)(πNQ^)(,)|\epsilon^{0,\infty}_{\hat{Q}}=\tilde{E}_{0}^{\infty}|\hat{Q}(\cdot,\cdot)-(\mathcal{B}^{\pi^{N}}\hat{Q})(\cdot,\cdot)|, respectively. Assume

  • (1)

    For each s𝒮,s\in\mathcal{S}, Q^(s,a)\hat{Q}(s,a) and QπN(s,a)Q^{\pi^{N}}(s,a) are L𝒜L_{\mathcal{A}}-Lipschitz in a.a.

  • (2)

    Q^(s,a)\hat{Q}(s,a) and QπN(s,a)Q^{\pi^{N}}(s,a) are bounded and take values in [0,rmax/(1γ)][0,r^{\max}/(1-\gamma)].

  • (3)

    The action space is bounded with diameter diam𝒜<\text{diam}_{\mathcal{A}}<\infty.

Then, with probability at least 1δ1-\delta,

𝔼(s0,a0)ρ0(s)π0D(a|s)|Q^(s0,a0)QπN(s0,a0)|\displaystyle\mathbb{E}_{(s_{0},a_{0})\sim\rho_{0}(s)\pi^{D}_{0}(a|s)}\Big{|}\hat{Q}(s_{0},a_{0})-Q^{\pi^{N}}(s_{0},a_{0})\Big{|}
\displaystyle\leq rmax1γ12Nlog2δ+(rmax(1γ)2+2L𝒜diam𝒜1γ)2Nlog2δ+11γ(ϵQ^0,+2L𝒜W10,).\displaystyle\frac{r^{\max}}{1-\gamma}\sqrt{\frac{1}{2N}\log\frac{2}{\delta}}+\Big{(}\frac{r^{\max}}{(1-\gamma)^{2}}+\frac{2L_{\mathcal{A}}\text{diam}_{\mathcal{A}}}{1-\gamma}\Big{)}\sqrt{\frac{2}{N}\log\frac{2}{\delta}}+\frac{1}{1-\gamma}\Big{(}\epsilon_{\hat{Q}}^{0,\infty}+2L_{\mathcal{A}}W_{1}^{0,\infty}\Big{)}.

Theorem 1 expresses the error of Q^\hat{Q} as the sum of Bellman error, bias, and variance terms. To be more specific, the first two terms are understood as the “variance from sampling” because these decrease in the number of episodes NN. On the other hand, W10,W_{1}^{0,\infty} is the “policy mismatch bias” w.r.t. πN\pi^{N}. Because the behavior policy π0D\pi_{0}^{D} is a mixture of the historical policies {πe}e=1N,\{\pi^{e}\}_{e=1}^{N}, Eq. (6), we expect it to increase in NN until πN\pi^{N} begins to converge.

Theorem 1 only indicates the difference between Q^\hat{Q} and QπNQ^{\pi^{N}} at i=0i=0 and infinite-length trajectories. We can generalize it to i0i\geq 0 and finite-length trajectories as follows. Recall from Eq. (6) and Lemma 1, the average state distribution at the ii-th step, ρ¯i,\overline{\rho}_{i}, and the behavior policy at the ii-th step, πiD,\pi^{D}_{i}, generate the average state occupancy measure, i.e., ρρ¯iπiD=N1e=1Nρρieπea.e.\rho_{\overline{\rho}_{i}}^{\pi^{D}_{i}}=N^{-1}\sum_{e=1}^{N}\rho_{\rho_{i}^{e}}^{\pi^{e}}~{}a.e. Therefore, by restricting attention to the states sampled at time steps i,i+1,i,i+1,\dots the “initial state distribution” and the behavior policy become ρ¯i\overline{\rho}_{i} and πiD\pi^{D}_{i}, which generalizes Theorem 1 from i=0i=0 to i0i\geq 0. In addition, due to γ\gamma-discounting, we may use length-LL trajectories to approximate infinite-length ones, provided that Li+Ω((1γ)1)L\geq i+\Omega((1-\gamma)^{-1}). These observations lead to the following corollary.

Corollary 1.

Fix assumptions (1) (2) (3) of Theorem 1. Rewrite the policy mismatch error and the Bellman error as W1i,L=E~iLW1(πN||πiD)()W_{1}^{i,L}=\tilde{E}_{i}^{L}W_{1}(\pi^{N}\lvert\rvert\pi^{D}_{i})(\cdot) and ϵQ^i,L=E~iL|Q^(,)(πNQ^)(,)|\epsilon_{\hat{Q}}^{i,L}=\tilde{E}_{i}^{L}|\hat{Q}(\cdot,\cdot)-(\mathcal{B}^{\pi^{N}}\hat{Q})(\cdot,\cdot)|, respectively. Note E~iL\tilde{E}_{i}^{L} is defined in Eq. (7). Then, with probability at least 1δ1-\delta,

𝔼(si,ai)ρ¯i(s)πiD(a|s)|Q^(si,ai)QπN(si,ai)|\displaystyle\mathbb{E}_{(s_{i},a_{i})\sim\overline{\rho}_{i}(s)\pi^{D}_{i}(a|s)}\Big{|}\hat{Q}(s_{i},a_{i})-Q^{\pi^{N}}(s_{i},a_{i})\Big{|}\leq
rmax1γ12Nlog2δ+(rmax(1γ)2+2L𝒜diam𝒜1γ)(2Nlog2δ+γLi)+11γ(ϵQ^i,L+2L𝒜W1i,L).\displaystyle{\scriptsize\frac{r^{\max}}{1-\gamma}\sqrt{\frac{1}{2N}\log\frac{2}{\delta}}+\Big{(}\frac{r^{\max}}{(1-\gamma)^{2}}+\frac{2L_{\mathcal{A}}\text{diam}_{\mathcal{A}}}{1-\gamma}\Big{)}\Big{(}\sqrt{\frac{2}{N}\log\frac{2}{\delta}}+\gamma^{L-i}\Big{)}+\frac{1}{1-\gamma}\Big{(}\epsilon_{\hat{Q}}^{i,L}+2L_{\mathcal{A}}W_{1}^{i,L}\Big{)}.}

Moreover, if iLlogϵlogγi\leq L-\frac{\log\epsilon}{\log\gamma} with 0<ϵ<10<\epsilon<1 and the constants are normalized as L𝒜=c/(1γ)L_{\mathcal{A}}=c/(1-\gamma), ϵQ^i,L=ξQ^i,L/(1γ)\epsilon_{\hat{Q}}^{i,L}=\xi_{\hat{Q}}^{i,L}/(1-\gamma), then, with probability at least 1δ1-\delta,

𝔼(si,ai)ρ¯i(s)πiD(a|s)|Q^(si,ai)QπN(si,ai)|\displaystyle\mathbb{E}_{(s_{i},a_{i})\sim\overline{\rho}_{i}(s)\pi^{D}_{i}(a|s)}\Big{|}\hat{Q}(s_{i},a_{i})-Q^{\pi^{N}}(s_{i},a_{i})\Big{|}
\displaystyle\leq O~(1(1γ)2((rmax+cdiam𝒜)(1/N+ϵ)+ξQ^i,L+cW1i,L)),\displaystyle\tilde{O}\Big{(}\frac{1}{(1-\gamma)^{2}}\Big{(}(r^{\max}+c\cdot\text{diam}_{\mathcal{A}})(1/\sqrt{N}+\epsilon)+\xi_{\hat{Q}}^{i,L}+c\cdot W_{1}^{i,L}\Big{)}\Big{)},

where O~()\tilde{O}(\cdot) is a variant of O()O(\cdot) that ignores logarithmic factors.

Note that if i=0i=0 and L=,L=\infty, Corollary 1 is identical to Theorem 1. Because the Bellman error of Q^\hat{Q} and the bias of the policy are both evaluated using the averaging-discounted operator E~iL\tilde{E}_{i}^{L}, Corollary 1 implies the difference of Q^\hat{Q} and QπNQ^{\pi^{N}} at trajectory step ii mainly depends on states at trajectory steps i\geq i. Since the Q function is a discounted sum of rewards from the current step to the future, the error at step ii should mainly depend on the steps i\geq i.

Normalization.

Although the first conclusion in Corollary 1 gives a bound on the error of Q^\hat{Q}, the constants may implicitly depend on the expected horizon (1γ)1.(1-\gamma)^{-1}. Hence its interpretation requires care. For instance, L𝒜L_{\mathcal{A}}, the Lipschitz constant of the Q functions w.r.t. actions, is probably the most tricky constant. While it is used extensively in the prior work (Luo et al., 2019; Xiao et al., 2019; Ni et al., 2019; Yu et al., 2020), its magnitude is never properly addressed in the literature. Intuitively, if a policy π\pi is good enough, it should quickly correct some disturbance on actions. In this case, the rewards after the disturbance only differ in a few trajectory steps, so the Lipschitzness of QπQ^{\pi} in actions is sublinear in (1γ)1(1-\gamma)^{-1}. On the other hand, if the policy π\pi fails to correct a disturbance δ\delta, due to error propagation, the error δ\delta propagates to every future step, leading to a linear error dependency to the horizon O(δH)O(\delta H). Therefore, the Lipschitzness of QπQ^{\pi} in actions can be as large as O((1γ)1)O((1-\gamma)^{-1}). In addition to L𝒜L_{\mathcal{A}}, the Bellman error ϵQ^i,L\epsilon_{\hat{Q}}^{i,L} should scale linearly in (1γ)1(1-\gamma)^{-1} because the Q function represents the discounted cumulative reward, Eq. (4), which scales linearly in (1γ)1(1-\gamma)^{-1}. These observations suggest that ϵQ^i,L\epsilon_{\hat{Q}}^{i,L} and L𝒜L_{\mathcal{A}} in Corollary 1 are either linear in the horizon or lie between sublinear and linear. To better capture the dependency on the horizon, we normalize the constants and get the second conclusion.

Interpretation.

The second conclusion shows the approximation error from infinite to finite-length trajectories is bounded by a constant for iLlogϵlogγTaylorLΩ((1γ)1)i\leq L-\frac{\log\epsilon}{\log\gamma}\overset{\text{Taylor}}{\approx}L-\Omega((1-\gamma)^{-1}), and will become harder to control for the higher ii due to the lack of samples. Besides, the variance from sampling O~(1/N)\tilde{O}(1/\sqrt{N}) and the bias from policy mismatch W1i,LW_{1}^{i,L} correspond to the same order of the expected horizon (1γ)1(1-\gamma)^{-1}. This means that the relative magnitude between them is irrelevant to the expected horizon and may largely depend on NN. The variance term dominates when NN is small, while the bias term dominates when NN is large. Therefore, one may imagine that the training of Alg. 1 has two phases. At phase 1, the variance term dominates and decreases in NN, so the learning improves quickly as more trajectories are collected. At phase 2, the bias term dominates, so πN\pi^{N} becomes harder to improve and tends to converge. The tendency of convergence makes the bias smaller, meaning that the error of Q^\hat{Q} becomes smaller, too. Therefore, the policy can still slowly improve in NN at phase 2. This explains why experiments (Figure 2) usually show a quick performance improvement at the beginning followed by a drastic slowdown.

5 Practical sampling strategies

As previously mentioned, the supervised learning during the Q function update fails to control the bias and variance. We need careful sampling techniques during the sampling from the replay buffer to mitigate the policy evaluation error. In particular, Wang and Ross (2019) proposes to emphasize recent experience (ERE) because the recent policies are closer to the latest policy. We show below that the ERE strategy is a refinement of 1/age weighting and that both methods help balance the expected selection number of each training transition (s,a)(s,a). Balanced selection numbers reduce both the policy mismatch bias and sampling variance. Hence, this suggests the potential usefulness of ERE and 1/age, which we verify through experiments in the last subsection.

5.1 Emphasizing recent experience

In Wang and Ross (2019), the authors use a length-KK trajectory (KK may differ across episodes) and perform KK updates. In the kk-th (1kK1\leq k\leq K) update, a mini-batch is sampled uniformly from the most recent ck=max(N0ηkL0K,cmin)c_{k}=\max(N_{0}\eta^{k\frac{L_{0}}{K}},c_{\min}) samples, where N0N_{0} is the current size of the replay buffer, L0L_{0} is the maximum horizon of the environment, η\eta is the decay parameter, and cminc_{\min} is the minimum coverage of the sampling. For MuJoCo (Todorov et al., 2012) environments, the paper suggests the values: (L0,η,cmin)=(1000,0.996,5000)(L_{0},~{}\eta,~{}c_{\min})=(1000,~{}0.996,~{}5000). One can see that η=1\eta=1 does a uniform weighting over the replay buffer, and the emphasis on the recent data becomes larger as η\eta becomes smaller. To see how the ERE strategy affects the mini-batch sampling, we prove the following result.

Proposition 1.

ERE is approximately equivalent (Taylor Approx.) to the non-uniform weighting:

wt1max(t,cmin,N0ηL0)1N0+𝟙(tcmin)cminmax(lncminN0ηL0,0),w_{t}\propto\frac{1}{\max(t,c_{\min},N_{0}\eta^{L_{0}})}-\frac{1}{N_{0}}+\frac{\mathbbm{1}(t\leq c_{\min})}{c_{\min}}\max\Big{(}\ln\frac{c_{\min}}{N_{0}\eta^{L_{0}}},~{}0\Big{)}, (8)

where tt is the age of a data point relative to the newest time step; i.e., w1w_{1} is the newest sample.

Note that Prop. 1 holds for η1\eta\neq 1 because it is derived from the geometric series formula: (1ηn)/(1η)(1-\eta^{n})/(1-\eta), which is valid when η1\eta\neq 1. Despite this discontinuity, we may still claim that the ERE strategy performs a uniform weighting when η\eta is close to 1. This is because when η1\eta\approx 1, Eq. (8) suggests wtw_{t} is proportional to 1/(N0ηL0)1/N01/(N_{0}\eta^{L_{0}})-1/{N_{0}} for all 1tN01\leq t\leq N_{0}, which is a uniform weighting. The emphasis on the recent experience (indicated by 𝟙(tcmin)\mathbbm{1}(t\leq c_{\min})) is also evident from Eq. (8). Precisely, the second term increases logarithmically ln1η\ln\frac{1}{\eta} when η\eta becomes smaller, so the smaller η\eta indeed gives more weight on the recent experience.

Discussion.

A key distinction between the original ERE and Prop. 1 is that the original ERE considers the trajectory length KK while Prop. 1 doesn’t. Intuitively, the disappearance of KK’s dependency results from the aggregation of all effects in 1kK1\leq k\leq K updates. We verify that Eq. (8) tracks the original ERE well in the next subsection.

Another feature of Prop. 1 is an implicit 1/age weighting in Eq. (8). Although the original ERE samples uniformly from the recent ckc_{k} points, the aggregate effect of all 1kK1\leq k\leq K updates appears to be well approximated by a 1/age weighting. To understand the effect of 1/age, recall from section 2.2 that in practice, Q^\hat{Q} is updated using mini-batch samples. Define a point (sie,aie)(s_{i}^{e},a_{i}^{e})’s time step as i+1+L(e1)i+1+L\cdot(e-1). Then the expected number of times in all batch samples that a point at a certain time step is selected (the expected selection number) gives the aggregate weight over the time steps. As shown in Figure. 1, 1/age weighting and ERE_apx, Eq. (8), give almost uniform expected selection numbers across time steps while uniform weighting is significantly biased toward the old samples. Therefore, 1/age weighting and its variants help balance the expected selection number.

Refer to caption
Figure 1: Expected selection numbers (aggregate weights) over a million time steps.

5.2 Merit of balanced selection number

Recall the expected selection numbers are the aggregate weights over time. To understand the merit of balanced selection numbers, we develop an error analysis under non-uniform aggregate weights in the Appendix (Theorem 2 & Corollary 2). Similar to the uniform-weight error in section 4, the non-uniform-weight error is controlled by bias from policy mismatch and variance from sampling. In the following, we discuss these biases and variances in ERE, 1/age weighting, and uniform weighting. We conclude that the ERE and 1/age weighting are better because both of their biases and variances are smaller than that of uniform weighting.

For the bias from policy mismatch, Figure 1 shows the uniform weighting (for each sampling from the replay buffer) makes the aggregate weights (expected selection number) bias toward old samples. Because the old policies tend to be distant from the current policy, the uniform weighting has a larger policy mismatch bias than ERE and 1/age weighting.

As for the variance from sampling, we generalize Corollary 1’s uniform case: O~(1/N)\tilde{O}(1/\sqrt{N}) to Corollary 2’s weighted case: O~(iwi2/iwi)\tilde{O}(\sqrt{\sum_{i}w_{i}^{2}}/\sum_{i}w_{i}) in the Appendix. To see this, let {Xi}i=1N\{X_{i}\}_{i=1}^{N} be a martingale difference sequence with 0XiΔ0\leq X_{i}\leq\Delta. Consider a weighted sequence {wiXi}i=1N\{w_{i}X_{i}\}_{i=1}^{N} with weights {wi>0}\{w_{i}>0\}. The Azuma-Hoeffding inequality suggests that with probability 1δ\geq 1-\delta,

iwiXi𝔼[iwiXi]iwii(wiΔ)22log1δ/iwi=O~(Δiwi2iwi),\displaystyle\frac{\sum_{i}w_{i}X_{i}-\mathbb{E}[\sum_{i}w_{i}X_{i}]}{\sum_{i}w_{i}}\leq\sqrt{\frac{\sum_{i}(w_{i}\Delta)^{2}}{2}\log\frac{1}{\delta}}/\sum_{i}w_{i}=\tilde{O}\Big{(}\Delta\frac{\sqrt{\sum_{i}w_{i}^{2}}}{\sum_{i}w_{i}}\Big{)},

which means that a general non-uniform weight admits a Hoeffding error O~(iwi2/iwi)\tilde{O}(\sqrt{\sum_{i}w_{i}^{2}}/\sum_{i}w_{i}) and is reduced to O~(1/N)\tilde{O}(1/\sqrt{N}) when the weights are equal. Furthermore, one can prove that the Hoeffding error is minimized under uniform weights:

Proposition 2.

Let {wt>0}t=1N\{w_{t}>0\}_{t=1}^{N} be the weights of the data indexed by tt. Then the Hoeffding error t=1Nwt2/t=1Nwt\sqrt{\sum_{t=1}^{N}w_{t}^{2}}/\sum_{t=1}^{N}w_{t} is minimized when the weights are equal: wt=c>0,tw_{t}=c>0,~{}\forall~{}t.

Therefore, the variance from sampling is large under non-uniform aggregate weights and is minimized by uniform aggregate weights. Since the uniform weighting leads to more non-uniform aggregate weights than ERE and 1/age weighting, its variance is also larger.

Because the ERE and 1/age weighting have balanced selection numbers (i.e., balanced aggregate weights), their biases and variance are smaller. They should perform better than uniform weighting for off-policy actor-critic RL. We will verify this in the next subsection.

5.3 Experimental verification

Since we’ve established a theoretical foundation of ERE and 1/age from a bias-variance perspective, we will explore two main propositions from the preceding subsections: (1) Are ERE and 1/age weighting better than uniform weighting? (2) Does the approximated ERE proposed in Eq. (8) track the original ERE well? In addition, since prioritized experience replay (Schaul et al., 2016) (PER) is a popular sampling method, a natural question to ask is (3) Do ERE and 1/age outperform PER?

We evaluate five sampling methods (ERE, ERE_apx, 1/age weighting, uniform, PER) on five MuJoCo continuous-control environments (Todorov et al., 2012): Humanoid, Ant, Walker2d, HalfCheetah, and Hopper. All tasks have a maximum horizon of 1000 and are trained using a Pytorch Soft-Actor-Critic (Haarnoja et al., 2018) implementation on Github (Tandon, 2018). Most hyper-parameters of the SAC algorithm are the same as that in Tandon (2018) except for the batch size, where we find a batch size of 512 tends to give more stable results. Our code is available at https://github.com/sunfex/weighted-sac. The SAC implementation and the MuJoCo environment are licensed under the MIT license and the personal student license, respectively. The experiment is run on a server with an Intel i7-6850K CPU and Nvidia GTX 1080 Ti GPUs.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: ERE, ERE_apx, 1/age weighting and uniform weighting on MuJoCo environments in a million steps. The solid lines and shaded areas are the means and one standard deviations.

In Figure 2, the ERE and 1/age weighting strategies perform better than the uniform weighting does in all tasks. This verifies our preceding assertion that the ERE and 1/age weighting are superior because their biases and variances of the estimated Q function are smaller. Moreover, ERE and ERE_apx mostly coincide with each other, so Eq. (8) is indeed a good approximation of the ERE strategy. This also explains the implicit connection between ERE and 1/age weighting strategies: ERE is almost equivalent to ERE_apx and 1/age is the main factor in ERE_apx, so ERE and 1/age weighting should produce similar results. Finally, ERE and 1/age generally outperform PER, so the 1/age-based samplings that we study achieve nontrivial performance improvements.

6 Conclusion

To understand an off-policy actor-critic algorithm, we show the policy evaluation error on the expected distribution of transitions decomposes into the Bellman error, the bias from policy mismatch, and the variance from sampling. We use this to explain that a successful off-policy actor-critic algorithm should have a careful sampling strategy that controls its bias and variance. Motivated by the empirical success of Emphasizing Recent Experience (ERE), we prove that ERE is a variant of the 1/age weighting. We then explain from a bias-variance perspective that ERE and 1/age weighting are preferable over uniform weighting because the resulting biases and variances are smaller.

References

  • Agarwal et al. (2019) Agarwal, A., N. Jiang, S. M. Kakade, and W. Sun
    2019.
    Reinforcement learning: Theory and algorithms. CS Dept., UW Seattle, Seattle, WA, USA, Tech. Rep.
  • Duan et al. (2020) Duan, Y., Z. Jia, and M. Wang
    2020.
    Minimax-optimal off-policy evaluation with linear function approximation. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, Pp.  2701–2709, Virtual. PMLR.
  • Dudík et al. (2011) Dudík, M., J. Langford, and L. Li
    2011.
    Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, P.  1097–1104, Madison, WI, USA. Omnipress.
  • Fan and Ramadge (2021) Fan, T.-H. and P. Ramadge
    2021.
    A contraction approach to model-based reinforcement learning. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, A. Banerjee and K. Fukumizu, eds., volume 130 of Proceedings of Machine Learning Research, Pp.  325–333. PMLR.
  • Feng et al. (2020) Feng, Y., T. Ren, Z. Tang, and Q. Liu
    2020.
    Accountable off-policy evaluation with kernel Bellman statistics. In Proceedings of the 37th International Conference on Machine Learning, H. D. III and A. Singh, eds., volume 119 of Proceedings of Machine Learning Research, Pp.  3102–3111. PMLR.
  • Fujimoto et al. (2018) Fujimoto, S., H. van Hoof, and D. Meger
    2018.
    Addressing function approximation error in actor-critic methods. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, J. G. Dy and A. Krause, eds., volume 80 of Proceedings of Machine Learning Research, Pp.  1582–1591. PMLR.
  • Fujita and Maeda (2018) Fujita, Y. and S.-i. Maeda
    2018.
    Clipped action policy gradient. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause, eds., volume 80 of Proceedings of Machine Learning Research, Pp.  1597–1606. PMLR.
  • Gulrajani et al. (2017) Gulrajani, I., F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville
    2017.
    Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds., volume 30. Curran Associates, Inc.
  • Haarnoja et al. (2018) Haarnoja, T., A. Zhou, P. Abbeel, and S. Levine
    2018.
    Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, J. G. Dy and A. Krause, eds., volume 80 of Proceedings of Machine Learning Research, Pp.  1856–1865. PMLR.
  • Horgan et al. (2018) Horgan, D., J. Quan, D. Budden, G. Barth-Maron, M. Hessel, H. van Hasselt, and D. Silver
    2018.
    Distributed prioritized experience replay. In International Conference on Learning Representations.
  • Jiang and Li (2016) Jiang, N. and L. Li
    2016.
    Doubly robust off-policy value evaluation for reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning, M. F. Balcan and K. Q. Weinberger, eds., volume 48 of Proceedings of Machine Learning Research, Pp.  652–661, New York, New York, USA. PMLR.
  • Jin et al. (2018) Jin, C., Z. Allen-Zhu, S. Bubeck, and M. I. Jordan
    2018.
    Is q-learning provably efficient? In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, eds., volume 31. Curran Associates, Inc.
  • Jin et al. (2020) Jin, C., Z. Yang, Z. Wang, and M. I. Jordan
    2020.
    Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, Pp.  2137–2143. PMLR.
  • Kallus and Uehara (2020) Kallus, N. and M. Uehara
    2020.
    Doubly robust off-policy value and gradient estimation for deterministic policies. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, eds., volume 33, Pp.  10420–10430. Curran Associates, Inc.
  • Lazic et al. (2020) Lazic, N., D. Yin, M. Farajtabar, N. Levine, D. Gorur, C. Harris, and D. Schuurmans
    2020.
    A maximum-entropy approach to off-policy evaluation in average-reward mdps. Advances in Neural Information Processing Systems, 33.
  • Le et al. (2019) Le, H., C. Voloshin, and Y. Yue
    2019.
    Batch policy learning under constraints. In International Conference on Machine Learning, Pp.  3703–3712. PMLR.
  • Levine et al. (2020) Levine, S., A. Kumar, G. Tucker, and J. Fu
    2020.
    Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643.
  • Li et al. (2020) Li, G., Y. Wei, Y. Chi, Y. Gu, and Y. Chen
    2020.
    Sample complexity of asynchronous q-learning: Sharper analysis and variance reduction. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, eds., volume 33, Pp.  7031–7043. Curran Associates, Inc.
  • Lin (1992) Lin, L. J.
    1992.
    Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8:293–321.
  • Liu et al. (2018) Liu, Q., L. Li, Z. Tang, and D. Zhou
    2018.
    Breaking the curse of horizon: Infinite-horizon off-policy estimation. In Advances in Neural Information Processing Systems, volume 31, Pp.  5356–5366. Curran Associates, Inc.
  • Luo et al. (2019) Luo, Y., H. Xu, Y. Li, Y. Tian, T. Darrell, and T. Ma
    2019.
    Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. In International Conference on Learning Representations.
  • Miyato et al. (2018) Miyato, T., T. Kataoka, M. Koyama, and Y. Yoshida
    2018.
    Spectral normalization for generative adversarial networks. In International Conference on Learning Representations.
  • Ni et al. (2019) Ni, C., L. F. Yang, and M. Wang
    2019.
    Learning to control in metric space with optimal regret. In 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Pp.  726–733.
  • Novati and Koumoutsakos (2019) Novati, G. and P. Koumoutsakos
    2019.
    Remember and forget for experience replay. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov, eds., volume 97 of Proceedings of Machine Learning Research, Pp.  4851–4860, Long Beach, California, USA. PMLR.
  • Precup et al. (2000) Precup, D., R. S. Sutton, and S. P. Singh
    2000.
    Eligibility traces for off-policy policy evaluation. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML ’00, P.  759–766, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
  • Schaul et al. (2016) Schaul, T., J. Quan, I. Antonoglou, and D. Silver
    2016.
    Prioritized experience replay. In ICLR.
  • Schmitt et al. (2020) Schmitt, S., M. Hessel, and K. Simonyan
    2020.
    Off-policy actor-critic with shared experience replay. In Proceedings of the 37th International Conference on Machine Learning, H. D. III and A. Singh, eds., volume 119 of Proceedings of Machine Learning Research, Pp.  8545–8554, Virtual. PMLR.
  • Schulman et al. (2015) Schulman, J., S. Levine, P. Abbeel, M. Jordan, and P. Moritz
    2015.
    Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei, eds., volume 37 of Proceedings of Machine Learning Research, Pp.  1889–1897, Lille, France. PMLR.
  • Schulman et al. (2017) Schulman, J., F. Wolski, P. Dhariwal, A. Radford, and O. Klimov
    2017.
    Proximal policy optimization algorithms. arXiv:1707.06347.
  • Silver et al. (2014) Silver, D., G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller
    2014.
    Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on Machine Learning, E. P. Xing and T. Jebara, eds., volume 32 of Proceedings of Machine Learning Research, Pp.  387–395, Bejing, China. PMLR.
  • Sinha et al. (2020) Sinha, S., J. Song, A. Garg, and S. Ermon
    2020.
    Experience replay with likelihood-free importance weights. arXiv:2006.13169.
  • Sutton et al. (1999) Sutton, R. S., D. McAllester, S. Singh, and Y. Mansour
    1999.
    Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS’99, P.  1057–1063, Cambridge, MA, USA. MIT Press.
  • Tandon (2018) Tandon, P.
    2018.
    Pytorch soft actor critic. Github Repository, https://github.com/pranz24/pytorch-soft-actor-critic.
  • Thomas and Brunskill (2016) Thomas, P. and E. Brunskill
    2016.
    Data-efficient off-policy policy evaluation for reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning, M. F. Balcan and K. Q. Weinberger, eds., volume 48 of Proceedings of Machine Learning Research, Pp.  2139–2148, New York, New York, USA. PMLR.
  • Todorov et al. (2012) Todorov, E., T. Erez, and Y. Tassa
    2012.
    Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Pp.  5026–5033.
  • Tokdar and Kass (2010) Tokdar, S. T. and R. E. Kass
    2010.
    Importance sampling: a review. Wiley Interdisciplinary Reviews: Computational Statistics, 2(1):54–60.
  • Villani (2008) Villani, C.
    2008.
    Optimal transport – Old and new, volume 338, Pp.  xxii+973.
  • Wang and Ross (2019) Wang, C. and K. Ross
    2019.
    Boosting soft actor-critic: Emphasizing recent experience without forgetting the past. arXiv:1906.04009.
  • Xiao et al. (2019) Xiao, C., Y. Wu, C. Ma, D. Schuurmans, and M. Müller
    2019.
    Learning to combat compounding-error in model-based reinforcement learning. arXiv:1912.11206.
  • Xie et al. (2019) Xie, T., Y. Ma, and Y.-X. Wang
    2019.
    Towards optimal off-policy evaluation for reinforcement learning with marginalized importance sampling. In Advances in Neural Information Processing Systems, volume 32, Pp.  9668–9678. Curran Associates, Inc.
  • Yang et al. (2020) Yang, M., O. Nachum, B. Dai, L. Li, and D. Schuurmans
    2020.
    Off-policy evaluation via the regularized lagrangian. Advances in Neural Information Processing Systems, 33.
  • Yin et al. (2021) Yin, M., Y. Bai, and Y.-X. Wang
    2021.
    Near-optimal provable uniform convergence in offline policy evaluation for reinforcement learning. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, A. Banerjee and K. Fukumizu, eds., volume 130 of Proceedings of Machine Learning Research, Pp.  1567–1575. PMLR.
  • Yin and Wang (2020) Yin, M. and Y.-X. Wang
    2020.
    Asymptotically efficient off-policy evaluation for tabular reinforcement learning. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, S. Chiappa and R. Calandra, eds., volume 108 of Proceedings of Machine Learning Research, Pp.  3948–3958, Online. PMLR.
  • Yu et al. (2020) Yu, T., G. Thomas, L. Yu, S. Ermon, J. Y. Zou, S. Levine, C. Finn, and T. Ma
    2020.
    Mopo: Model-based offline policy optimization. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, eds., volume 33, Pp.  14129–14142. Curran Associates, Inc.

Appendix A Appendix

A.1 Hyper-parameters

To train the SAC agents, we use deep neural networks to parameterize the policy and Q functions. Both networks consist of dense layers with the same widths. Table 1 presents the suggested hyper-parameters. As mentioned in the experiment section, the hyper-parameters are similar as the implementation in Tandon (2018).

Variable Value
Optimizer Adam
Learning rate 3E-4
Discount factor 0.99
Batch size 512
Model width 256
Model depth 3
Table 1: Suggested hyper-parameters for SAC.
Environment Hopper HalfCheetah Walker2D Ant Humanoid
Temperature 0.2 0.2 0.2 0.2 0.05
Table 2: Temperature parameters for SAC in MuJoCo environments.

A.2 Justification of assumptions

In section 4.1, we introduce three main assumptions in this work. Below is a validation for each.

  • 1.

    For each s𝒮,s\in\mathcal{S}, Q^(s,a)\hat{Q}(s,a) and QπN(s,a)Q^{\pi^{N}}(s,a) are L𝒜L_{\mathcal{A}}-Lipschitz in a.a. As mentioned in the paragraph ”Normalization” of section 4.2, the Lipschitzness of QπNQ^{\pi^{N}} is sublinear or linear in the horizon, which quantifies the magnitude of QπNQ^{\pi^{N}}’s Lipschitz constant. Since Q^\hat{Q} approximates QπNQ^{\pi^{N}}, Q^\hat{Q} should have a similar property as long as the training error is well controlled. The practitioner can also enforce the Lipschitzness of Q^\hat{Q} using gradient penalty (Gulrajani et al., 2017) or spectral normalization (Miyato et al., 2018).

  • 2.

    𝑸^(𝒔,𝒂)\hat{Q}(s,a) and QπN(s,a)Q^{\pi^{N}}(s,a) are bounded and take values in [𝟎,r𝐦𝐚𝐱/(𝟏γ)][0,r^{\max}/(1-\gamma)]. This is a standard assumption in the RL literature. If the bound is violated, one can either clip, translate, or rescale to obtain new Q functions that satisfy the constraint. Note a bounded reward r(s,a)[0,rmax]r(s,a)\in[0,r^{\max}] has implied QπN[0,rmax/(1γ)]Q^{\pi^{N}}\in[0,r^{\max}/(1-\gamma)].

  • 3.

    The action space is bounded with diameter diam𝒜<\text{diam}_{\mathcal{A}}<\infty. This is a standard assumption in continuous-control environments and is usually satisfied in practice. It is also common to use clipping to ensure the bounds of the actions generated by the policy (Fujita and Maeda, 2018).

A.3 The construction of our behavior policy

We first discuss some important relations between state occupancy measures and Bellman flow operator. Similar results about Fact 1 and 2 be found in Liu et al. (2018)[Lemma 3] and Fan and Ramadge (2021)[Appendix A.1].

Definition 1 (Bellman flow operator).

The Bellman flow operator Bρ0,π,T()B_{\rho_{0},\pi,T}(\cdot) generated by (ρ0,π,T)(\rho_{0},\pi,T) with discount factor γ\gamma is defined as

Bρ0,π,T(ρ)(s)(1γ)ρ0(s)+γT(s|s,a)π(a|s)ρ(s)𝑑s𝑑a.B_{\rho_{0},\pi,T}(\rho)(s)\triangleq(1-\gamma)\rho_{0}(s)+\gamma\int T(s|s^{\prime},a^{\prime})\pi(a^{\prime}|s^{\prime})\rho(s^{\prime})ds^{\prime}da^{\prime}.
Fact 1.

Bρ0,π,TB_{\rho_{0},\pi,T} is a γ\gamma-contraction w.r.t. total variational distance.

Proof.

Let p1(s),p2(s)p_{1}(s),~{}p_{2}(s) be the density functions of some state distributions.

DTV(Bρ0,π,T(p1)||Bρ0,π,T(p2))=12|Bρ0,π,T(p1(s))Bρ0,π,T(p2(s))|𝑑s=12γ|T(s|s,a)π(a|s)(p1(s)p2(s))𝑑s𝑑a|𝑑sγ2T(s|s,a)π(a|s)|p1(s)p2(s)|𝑑s𝑑a𝑑s=γ2|p1(s)p2(s)|𝑑s=γDTV(p1||p2).\begin{split}D_{TV}(B_{\rho_{0},\pi,T}(p_{1})\lvert\rvert B_{\rho_{0},\pi,T}(p_{2}))&=\frac{1}{2}\int\big{\lvert}B_{\rho_{0},\pi,T}(p_{1}(s))-B_{\rho_{0},\pi,T}(p_{2}(s))\big{\rvert}ds\\ &=\frac{1}{2}\int\gamma\Big{\lvert}\int T(s|s^{\prime},a^{\prime})\pi(a^{\prime}|s^{\prime})\big{(}p_{1}(s^{\prime})-p_{2}(s^{\prime})\big{)}ds^{\prime}da^{\prime}\Big{\rvert}ds\\ &\leq\frac{\gamma}{2}\int T(s|s^{\prime},a^{\prime})\pi(a^{\prime}|s^{\prime})\big{\lvert}p_{1}(s^{\prime})-p_{2}(s^{\prime})\big{\rvert}ds^{\prime}da^{\prime}ds\\ &=\frac{\gamma}{2}\int\big{\lvert}p_{1}(s^{\prime})-p_{2}(s^{\prime})\big{\rvert}ds^{\prime}=\gamma D_{TV}(p_{1}\lvert\rvert p_{2}).\end{split}

Fact 2.

The normalized state occupancy measure ρρ0π\rho_{\rho_{0}}^{\pi} generated by (ρ0,π,T)(\rho_{0},\pi,T) with discount factor γ\gamma is a fixed point of Bρ0,π,TB_{\rho_{0},\pi,T}; i.e., Bρ0,π,T(ρρ0π)(s)=ρρ0π(s)B_{\rho_{0},\pi,T}(\rho_{\rho_{0}}^{\pi})(s)=\rho_{\rho_{0}}^{\pi}(s).

Proof.
ρρ0π(s)=(1γ)i=0γifi(s|ρ0,π,T)=(1γ)f0(s|ρ0,π,T)+γ(1γ)i=0γifi+1(s|ρ0,π,T)=(1γ)ρ0(s)+γ(1γ)i=0γiT(s|s,a)π(a|s)fi(s|ρ0,π,T)𝑑s𝑑a=(1γ)ρ0(s)+γT(s|s,a)π(a|s)(1γ)i=0γifi(s|ρ0,π,T)dsda=(1γ)ρ0(s)+γT(s|s,a)π(a|s)ρρ0π(s)𝑑s𝑑a=Bρ0,π,T(ρρ0π)(s).\begin{split}\rho_{\rho_{0}}^{\pi}(s)=&(1-\gamma)\sum_{i=0}^{\infty}\gamma^{i}f_{i}(s|\rho_{0},\pi,T)\\ =&(1-\gamma)f_{0}(s|\rho_{0},\pi,T)+\gamma(1-\gamma)\sum_{i=0}^{\infty}\gamma^{i}f_{i+1}(s|\rho_{0},\pi,T)\\ =&(1-\gamma)\rho_{0}(s)+\gamma(1-\gamma)\sum_{i=0}^{\infty}\gamma^{i}\int T(s|s^{\prime},a^{\prime})\pi(a^{\prime}|s^{\prime})f_{i}(s^{\prime}|\rho_{0},\pi,T)ds^{\prime}da^{\prime}\\ =&(1-\gamma)\rho_{0}(s)+\gamma\int T(s|s^{\prime},a^{\prime})\pi(a^{\prime}|s^{\prime})(1-\gamma)\sum_{i=0}^{\infty}\gamma^{i}f_{i}(s^{\prime}|\rho_{0},\pi,T)ds^{\prime}da^{\prime}\\ =&(1-\gamma)\rho_{0}(s)+\gamma\int T(s|s^{\prime},a^{\prime})\pi(a^{\prime}|s^{\prime})\rho_{\rho_{0}}^{\pi}(s^{\prime})ds^{\prime}da^{\prime}=B_{\rho_{0},\pi,T}(\rho_{\rho_{0}}^{\pi})(s).\end{split}

Thus, the Bellman flow operator is useful to analyze the state occupancy measures, and we have the following lemma to construct the behavior policy πiD\pi_{i}^{D}.

Lemma 1.

Let ρie(s)\rho_{i}^{e}(s) the state distribution at trajectory step ii in episode ee. Let ρρieπe(s)\rho_{\rho_{i}^{e}}^{\pi^{e}}(s) be the normalized occupancy measure starting at trajectory step ii in episode ee. Then 1Ne=1Nρρieπe(s)=ρρ¯iπiD(s)a.e.\frac{1}{N}\sum_{e=1}^{N}\rho_{\rho_{i}^{e}}^{\pi^{e}}(s)=\rho_{\overline{\rho}_{i}}^{\pi_{i}^{D}}(s)~{}a.e., where ρρ¯iπiD\rho_{\overline{\rho}_{i}}^{\pi_{i}^{D}} is the normalized state occupancy measure is generated by (ρ¯i,πiD,T)(\overline{\rho}_{i},\pi_{i}^{D},T). Moreover, we have 1Ne=1Nρρieπe(s)πe(a|s)=ρρ¯iπiD(s)πiD(a|s)a.e.\frac{1}{N}\sum_{e=1}^{N}\rho_{\rho_{i}^{e}}^{\pi^{e}}(s)\pi^{e}(a|s)=\rho_{\overline{\rho}_{i}}^{\pi_{i}^{D}}(s)\pi^{D}_{i}(a|s)~{}a.e.

Proof.

Precisely, ρie(s)=ρi(s|ρ0,πe,T)\rho_{i}^{e}(s)=\rho_{i}(s|\rho_{0},\pi^{e},T) is the state distribution at trajectory step ii following the laws of (ρ0,πe,T)(\rho_{0},\pi^{e},T). Since ρρieπe\rho_{\rho_{i}^{e}}^{\pi^{e}} is the normalized occupancy measure generated by (ρie,πe,T)(\rho_{i}^{e},\pi^{e},T), each ρρieπe\rho_{\rho_{i}^{e}}^{\pi^{e}} is the fixed-point of the Bellman flow equation:

ρρieπe(s)=(1γ)ρie(s)+γT(s|s,a)πe(a|s)ρρieπe(s)𝑑s𝑑a,e[1,,N].\rho_{\rho_{i}^{e}}^{\pi^{e}}(s)=(1-\gamma)\rho_{i}^{e}(s)+\gamma\int T(s|s^{\prime},a^{\prime})\pi^{e}(a^{\prime}|s^{\prime})\rho_{\rho_{i}^{e}}^{\pi^{e}}(s^{\prime})ds^{\prime}da^{\prime},~{}~{}~{}\forall~{}e\in[1,...,N].

This implies the average normalized occupancy measure is the fixed point of the Bellman flow equation characterized by (ρ¯i,πiD,T)(\overline{\rho}_{i},\pi_{i}^{D},T):

1Ne=1Nρρieπe(s)\displaystyle\frac{1}{N}\sum_{e=1}^{N}\rho_{\rho_{i}^{e}}^{\pi^{e}}(s) =1Ne=1N[(1γ)ρie(s)+γT(s|s,a)πe(a|s)ρρieπe(s)𝑑s𝑑a]\displaystyle=\frac{1}{N}\sum_{e=1}^{N}\left[(1-\gamma)\rho_{i}^{e}(s)+\gamma\int T(s|s^{\prime},a^{\prime})\pi^{e}(a^{\prime}|s^{\prime})\rho_{\rho_{i}^{e}}^{\pi^{e}}(s^{\prime})ds^{\prime}da^{\prime}\right]
=(1γ)ρ¯i(s)+γT(s|s,a)1Ne=1N[πe(a|s)ρρieπe(s)]dsda\displaystyle=(1-\gamma)\overline{\rho}_{i}(s)+\gamma\int T(s|s^{\prime},a^{\prime})\frac{1}{N}\sum_{e=1}^{N}\left[\pi^{e}(a^{\prime}|s^{\prime})\rho_{\rho_{i}^{e}}^{\pi^{e}}(s^{\prime})\right]ds^{\prime}da^{\prime}
=(1γ)ρ¯i(s)+γT(s|s,a)e=1Nπe(a|s)ρρieπe(s)e=1Nρρieπe(s)e=1Nρρieπe(s)N𝑑s𝑑a\displaystyle=(1-\gamma)\overline{\rho}_{i}(s)+\gamma\int T(s|s^{\prime},a^{\prime})\frac{\sum_{e=1}^{N}\pi^{e}(a^{\prime}|s^{\prime})\rho_{\rho_{i}^{e}}^{\pi^{e}}(s^{\prime})}{\sum_{e=1}^{N}\rho_{\rho_{i}^{e}}^{\pi^{e}}(s^{\prime})}\frac{\sum_{e=1}^{N}\rho_{\rho_{i}^{e}}^{\pi^{e}}(s^{\prime})}{N}ds^{\prime}da^{\prime}
=(1γ)ρ¯i(s)+γT(s|s,a)πiD(a|s)1Ne=1Nρρieπe(s)dsda,\displaystyle=(1-\gamma)\overline{\rho}_{i}(s)+\gamma\int T(s|s^{\prime},a^{\prime})\pi_{i}^{D}(a^{\prime}|s^{\prime})\frac{1}{N}\sum_{e=1}^{N}\rho_{\rho_{i}^{e}}^{\pi^{e}}(s^{\prime})ds^{\prime}da^{\prime},

where πiD(a|s)e=1Nπe(a|s)ρρieπe(s)e=1Nρρieπe(s)\pi_{i}^{D}(a|s)\triangleq\frac{\sum_{e=1}^{N}\pi^{e}(a|s)\rho_{\rho_{i}^{e}}^{\pi^{e}}(s)}{\sum_{e=1}^{N}\rho_{\rho_{i}^{e}}^{\pi^{e}}(s)} is the average behavior policy at step ii. Since the Bellman flow operator is a γ\gamma-contraction in TV distance and hence has a unique fixed point in TV distance, denoted as ρρ¯iπiD(s)\rho_{\overline{\rho}_{i}}^{\pi_{i}^{D}}(s), we arrive at 1Ne=1Nρρieπe(s)=ρρ¯iπiD(s)\frac{1}{N}\sum_{e=1}^{N}\rho_{\rho_{i}^{e}}^{\pi^{e}}(s)=\rho_{\overline{\rho}_{i}}^{\pi_{i}^{D}}(s) almost everywhere. Also, by construction, we have

1Ne=1Nρρieπe(s)πe(a|s)=ρρ¯iπiD(s)πiD(a|s)a.e.\frac{1}{N}\sum_{e=1}^{N}\rho_{\rho_{i}^{e}}^{\pi^{e}}(s)\pi^{e}(a|s)=\rho_{\overline{\rho}_{i}}^{\pi_{i}^{D}}(s)\pi^{D}_{i}(a|s)~{}a.e.

A.4 Policy Evaluation Error

Lemma 2.

If Q(s,a)Q(s,a) is L𝒜L_{\mathcal{A}}-Lipschitz in aa for any ss, then, for any state distribution ρ\rho,

𝔼sρ|𝔼aπ1(|s)Q(s,a)𝔼aπ2(|s)Q(s,a)|L𝒜𝔼sρW1(π1(|s)||π2(|s)).\mathbb{E}_{s\sim\rho}\Big{|}\mathbb{E}_{a\sim\pi_{1}(\cdot|s)}Q(s,a)-\mathbb{E}_{a\sim\pi_{2}(\cdot|s)}Q(s,a)\Big{|}\leq L_{\mathcal{A}}\mathbb{E}_{s\sim\rho}W_{1}(\pi_{1}(\cdot|s)\lvert\rvert\pi_{2}(\cdot|s)).
Proof.

For any fixed ss, we have

𝔼aπ1(|s)Q(s,a)𝔼aπ2(|s)Q(s,a)=L𝒜[𝔼aπ1(|s)Q(s,a)L𝒜𝔼aπ2(|s)Q(s,a)L𝒜]\displaystyle\mathbb{E}_{a\sim\pi_{1}(\cdot|s)}Q(s,a)-\mathbb{E}_{a\sim\pi_{2}(\cdot|s)}Q(s,a)=L_{\mathcal{A}}\Big{[}\mathbb{E}_{a\sim\pi_{1}(\cdot|s)}\frac{Q(s,a)}{L_{\mathcal{A}}}-\mathbb{E}_{a\sim\pi_{2}(\cdot|s)}\frac{Q(s,a)}{L_{\mathcal{A}}}\Big{]}
\displaystyle\leq L𝒜[supfLip1𝔼aπ1(|s)f(a)𝔼aπ2(|s)f(a)]=L𝒜W1(π1(|s)||π2(|s)),\displaystyle L_{\mathcal{A}}\Big{[}\sup_{\|f\|_{\text{Lip}}\leq 1}\mathbb{E}_{a\sim\pi_{1}(\cdot|s)}f(a)-\mathbb{E}_{a\sim\pi_{2}(\cdot|s)}f(a)\Big{]}=L_{\mathcal{A}}W_{1}(\pi_{1}(\cdot|s)\lvert\rvert\pi_{2}(\cdot|s)),

where the second line follows from Kantorovich-Rubinstein duality (Villani, 2008). Since the 1-Wasserstein distance W1W_{1} is symmetric, we can interchange the roles of π1\pi_{1} and π2\pi_{2}, yielding

|𝔼aπ1(|s)Q(s,a)𝔼aπ2(|s)Q(s,a)|L𝒜W1(π1(|s)||π2(|s)).\Big{|}\mathbb{E}_{a\sim\pi_{1}(\cdot|s)}Q(s,a)-\mathbb{E}_{a\sim\pi_{2}(\cdot|s)}Q(s,a)\Big{|}\leq L_{\mathcal{A}}W_{1}(\pi_{1}(\cdot|s)\lvert\rvert\pi_{2}(\cdot|s)).

Taking the expectation 𝔼sρ\mathbb{E}_{s\sim\rho} on both sides completes the proof. ∎

Lemma 3.

If Qπ(s,a)Q^{\pi}(s,a) is L𝒜L_{\mathcal{A}}-Lipschitz in aa for any ss, then

𝔼(s0,a0)ρ^0(s)π0D(a|s)|Qπ0D(s0,a0)Qπ(s0,a0)|L𝒜1γ𝔼sρρ^0π0DW1(π0D(|s)||π(|s)).\mathbb{E}_{(s_{0},a_{0})\sim\hat{\rho}_{0}(s)\pi^{D}_{0}(a|s)}\Big{|}Q^{\pi^{D}_{0}}(s_{0},a_{0})-Q^{\pi}(s_{0},a_{0})\Big{|}\leq\frac{L_{\mathcal{A}}}{1-\gamma}\mathbb{E}_{s\sim\rho^{\pi^{D}_{0}}_{\hat{\rho}_{0}}}W_{1}(\pi^{D}_{0}(\cdot|s)\lvert\rvert\pi(\cdot|s)).
Proof.

For any (s,a)(s,a), we have

|Qπ0D(s,a)Qπ(s,a)|=γ|𝔼sT(|s,a)𝔼π0DQπ0D(s,π0D(s))𝔼πQπ(s,π(s))|γ𝔼sT(|s,a)|𝔼π0D,πQπ0D(s,π0D(s))Qπ(s,π0D(s))+Qπ(s,π0D(s))Qπ(s,π(s))|γ𝔼sT(|s,a)(|𝔼π0DQπ0D(s,π0D(s))Qπ(s,π0D(s))|+|𝔼π0D,πQπ(s,π0D(s))Qπ(s,π(s))|)γ𝔼sT(|s,a)(𝔼π0D|Qπ0D(s,π0D(s))Qπ(s,π0D(s))|+L𝒜W1(π0D(|s)||π(|s))),\begin{split}&|Q^{\pi^{D}_{0}}(s,a)-Q^{\pi}(s,a)|\\ =&\gamma\Big{|}\mathbb{E}_{s^{\prime}\sim T(\cdot|s,a)}\mathbb{E}_{\pi^{D}_{0}}Q^{\pi^{D}_{0}}(s^{\prime},\pi^{D}_{0}(s^{\prime}))-\mathbb{E}_{\pi}Q^{\pi}(s^{\prime},\pi(s^{\prime}))\Big{|}\\ \leq&\gamma\mathbb{E}_{s^{\prime}\sim T(\cdot|s,a)}\Big{|}\mathbb{E}_{\pi^{D}_{0},\pi}Q^{\pi^{D}_{0}}(s^{\prime},\pi^{D}_{0}(s^{\prime}))-Q^{\pi}(s^{\prime},\pi^{D}_{0}(s^{\prime}))+Q^{\pi}(s^{\prime},\pi^{D}_{0}(s^{\prime}))-Q^{\pi}(s^{\prime},\pi(s^{\prime}))\Big{|}\\ \leq&\gamma\mathbb{E}_{s^{\prime}\sim T(\cdot|s,a)}\Big{(}\Big{|}\mathbb{E}_{\pi^{D}_{0}}Q^{\pi^{D}_{0}}(s^{\prime},\pi^{D}_{0}(s^{\prime}))-Q^{\pi}(s^{\prime},\pi^{D}_{0}(s^{\prime}))\Big{|}+\Big{|}\mathbb{E}_{\pi^{D}_{0},\pi}Q^{\pi}(s^{\prime},\pi^{D}_{0}(s^{\prime}))-Q^{\pi}(s^{\prime},\pi(s^{\prime}))\Big{|}\Big{)}\\ \leq&\gamma\mathbb{E}_{s^{\prime}\sim T(\cdot|s,a)}\Big{(}\mathbb{E}_{\pi^{D}_{0}}\Big{|}Q^{\pi^{D}_{0}}(s^{\prime},\pi^{D}_{0}(s^{\prime}))-Q^{\pi}(s^{\prime},\pi^{D}_{0}(s^{\prime}))\Big{|}+L_{\mathcal{A}}W_{1}(\pi^{D}_{0}(\cdot|s^{\prime})\lvert\rvert\pi(\cdot|s^{\prime}))\Big{)},\end{split}

where the last line follows from Lemma 2. Let ρiπ0D\rho_{i}^{\pi^{D}_{0}} be the state distribution at step ii following the laws of (ρ^0,π0D,T)(\hat{\rho}_{0},\pi^{D}_{0},T). Take expectation over ρ^0\hat{\rho}_{0} and expand the recursive relation. We arrive at

𝔼(s0,a0)ρ^0(s)π0D(a|s)|Qπ0D(s0,a0)Qπ(s0,a0)|L𝒜i=1γi𝔼siρiπ0DW1(π0D(|si)||π(|si))\displaystyle\mathbb{E}_{(s_{0},a_{0})\sim\hat{\rho}_{0}(s)\pi^{D}_{0}(a|s)}\Big{|}Q^{\pi^{D}_{0}}(s_{0},a_{0})-Q^{\pi}(s_{0},a_{0})\Big{|}\leq L_{\mathcal{A}}\sum_{i=1}^{\infty}\gamma^{i}\mathbb{E}_{s_{i}\sim\rho_{i}^{\pi^{D}_{0}}}W_{1}(\pi^{D}_{0}(\cdot|s_{i})\lvert\rvert\pi(\cdot|s_{i}))
=\displaystyle= L𝒜1γ𝔼sρρ^0π0DW1(π0D(|s)||π(|s))L𝒜𝔼sρ^W1(π0D(|s)||π(|s))\displaystyle\frac{L_{\mathcal{A}}}{1-\gamma}\mathbb{E}_{s\sim\rho^{\pi^{D}_{0}}_{\hat{\rho}_{0}}}W_{1}(\pi^{D}_{0}(\cdot|s)\lvert\rvert\pi(\cdot|s))-L_{\mathcal{A}}\mathbb{E}_{s\sim\hat{\rho}}W_{1}(\pi^{D}_{0}(\cdot|s)\lvert\rvert\pi(\cdot|s))
\displaystyle\leq L𝒜1γ𝔼sρρ^0π0DW1(π0D(|s)||π(|s)),\displaystyle\frac{L_{\mathcal{A}}}{1-\gamma}\mathbb{E}_{s\sim\rho^{\pi^{D}_{0}}_{\hat{\rho}_{0}}}W_{1}(\pi^{D}_{0}(\cdot|s)\lvert\rvert\pi(\cdot|s)),

where the second line follows from ρρ^0π0D=(1γ)i=0γiρiπ0D\rho^{\pi^{D}_{0}}_{\hat{\rho}_{0}}=(1-\gamma)\sum_{i=0}^{\infty}\gamma^{i}\rho_{i}^{\pi^{D}_{0}}. ∎

Lemma 4.

If Q(s,a)Q(s,a) is L𝒜L_{\mathcal{A}}-Lipschitz in aa for any ss, then

𝔼(s0,a0)ρ^0(s)π0D(a|s)|Q(s0,a0)Qπ0D(s0,a0)|\displaystyle\mathbb{E}_{(s_{0},a_{0})\sim\hat{\rho}_{0}(s)\pi^{D}_{0}(a|s)}\Big{|}Q(s_{0},a_{0})-Q^{\pi^{D}_{0}}(s_{0},a_{0})\Big{|}
\displaystyle\leq 𝔼(s,a)ρρ^0π0D(s)π0D(a|s)|Q(s,a)πQ(s,a)|+L𝒜W1(π(|s)||π0D(|s))1γ.\displaystyle\frac{\mathbb{E}_{(s,a)\sim\rho^{\pi^{D}_{0}}_{\hat{\rho}_{0}}(s)\pi^{D}_{0}(a|s)}\Big{|}Q(s,a)-\mathcal{B}^{\pi}Q(s,a)\Big{|}+L_{\mathcal{A}}W_{1}(\pi(\cdot|s)\lvert\rvert\pi^{D}_{0}(\cdot|s))}{1-\gamma}.
Proof.

For any (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, we have a recursive relation:

|Q(s,a)Qπ0D(s,a)|=|Q(s,a)π0DQπ0D(s,a)|\displaystyle\Big{|}Q(s,a)-Q^{\pi^{D}_{0}}(s,a)\Big{|}=\Big{|}Q(s,a)-\mathcal{B}^{\pi^{D}_{0}}Q^{\pi^{D}_{0}}(s,a)\Big{|}
\displaystyle\leq |Q(s,a)π0DQ(s,a)|+|π0DQ(s,a)π0DQπ0D(s,a)|\displaystyle\Big{|}Q(s,a)-\mathcal{B}^{\pi^{D}_{0}}Q(s,a)\Big{|}+\Big{|}\mathcal{B}^{\pi^{D}_{0}}Q(s,a)-\mathcal{B}^{\pi^{D}_{0}}Q^{\pi^{D}_{0}}(s,a)\Big{|}
=\displaystyle= |Q(s,a)π0DQ(s,a)|+γ|𝔼sT(|s,a),aπ0D(|s)Q(s,a)Qπ0D(s,a)|\displaystyle\Big{|}Q(s,a)-\mathcal{B}^{\pi^{D}_{0}}Q(s,a)\Big{|}+\gamma\Big{|}\mathbb{E}_{s^{\prime}\sim T(\cdot|s,a),a^{\prime}\sim\pi^{D}_{0}(\cdot|s^{\prime})}Q(s^{\prime},a^{\prime})-Q^{\pi^{D}_{0}}(s^{\prime},a^{\prime})\Big{|}
\displaystyle\leq |Q(s,a)π0DQ(s,a)|+γ𝔼sT(|s,a),aπ0D(|s)|Q(s,a)Qπ0D(s,a)|\displaystyle\Big{|}Q(s,a)-\mathcal{B}^{\pi^{D}_{0}}Q(s,a)\Big{|}+\gamma\mathbb{E}_{s^{\prime}\sim T(\cdot|s,a),a^{\prime}\sim\pi^{D}_{0}(\cdot|s^{\prime})}\Big{|}Q(s^{\prime},a^{\prime})-Q^{\pi^{D}_{0}}(s^{\prime},a^{\prime})\Big{|}

Expand the recursive relation. We have

𝔼(s0,a0)ρ^0(s)π0D(a|s)|Q(s0,a0)Qπ0D(s0,a0)|𝔼s0ρ^0𝔼a0π0D(|s0)|Q(s0,a0)π0DQ(s0,a0)|+𝔼s0ρ^0i=1γi𝔼a0,s1,a1,,si1,ai1𝔼siT(|si1,ai1),aiπ0D(|si)|Q(si,ai)π0DQ(si,ai)|=𝔼s0ρ^0i=0γi𝔼a0,s1,a1,,si,aiT,π0D|Q(si,ai)π0DQ(si,ai)|=11γ𝔼(s,a)ρρ^0π0D(s)π0D(a|s)|Q(s,a)π0DQ(s,a)|.\begin{split}&\mathbb{E}_{(s_{0},a_{0})\sim\hat{\rho}_{0}(s)\pi^{D}_{0}(a|s)}\Big{|}Q(s_{0},a_{0})-Q^{\pi^{D}_{0}}(s_{0},a_{0})\Big{|}\\ \leq&\mathbb{E}_{s_{0}\sim\hat{\rho}_{0}}\mathbb{E}_{a_{0}\sim\pi^{D}_{0}(\cdot|s_{0})}\Big{|}Q(s_{0},a_{0})-\mathcal{B}^{\pi^{D}_{0}}Q(s_{0},a_{0})\Big{|}\\ &+\mathbb{E}_{s_{0}\sim\hat{\rho}_{0}}\sum_{i=1}^{\infty}\gamma^{i}\mathbb{E}_{a_{0},s_{1},a_{1},...,s_{i-1},a_{i-1}}\mathbb{E}_{s_{i}\sim T(\cdot|s_{i-1},a_{i-1}),a_{i}\sim\pi^{D}_{0}(\cdot|s_{i})}\Big{|}Q(s_{i},a_{i})-\mathcal{B}^{\pi^{D}_{0}}Q(s_{i},a_{i})\Big{|}\\ =&\mathbb{E}_{s_{0}\sim\hat{\rho}_{0}}\sum_{i=0}^{\infty}\gamma^{i}\mathbb{E}_{a_{0},s_{1},a_{1},...,s_{i},a_{i}\sim T,\pi^{D}_{0}}\Big{|}Q(s_{i},a_{i})-\mathcal{B}^{\pi^{D}_{0}}Q(s_{i},a_{i})\Big{|}\\ =&\frac{1}{1-\gamma}\mathbb{E}_{(s,a)\sim\rho^{\pi^{D}_{0}}_{\hat{\rho}_{0}}(s)\pi^{D}_{0}(a|s)}\Big{|}Q(s,a)-\mathcal{B}^{\pi^{D}_{0}}Q(s,a)\Big{|}.\end{split} (9)

The last line uses a fact for the normalized occupancy measure :

𝔼ρρ^0π0D=(1γ)[𝔼s0ρ^0𝔼a0π0D(|s0)+i=1γi𝔼s0ρ^0𝔼a0π0D(|s0)𝔼siT(|si1,ai1),aiπ0D(|si)].\displaystyle\mathbb{E}_{\rho^{\pi^{D}_{0}}_{\hat{\rho}_{0}}}=(1-\gamma)\Big{[}\mathbb{E}_{s_{0}\sim\hat{\rho}_{0}}\mathbb{E}_{a_{0}\sim\pi^{D}_{0}(\cdot|s_{0})}+\sum_{i=1}^{\infty}\gamma^{i}\mathbb{E}_{s_{0}\sim\hat{\rho}_{0}}\mathbb{E}_{a_{0}\sim\pi^{D}_{0}(\cdot|s_{0})}...\mathbb{E}_{s_{i}\sim T(\cdot|s_{i-1},a_{i-1}),a_{i}\sim\pi^{D}_{0}(\cdot|s_{i})}\Big{]}.

We are almost done once the π0D\mathcal{B}^{\pi^{D}_{0}} in Eq. (9) is replaced by π\mathcal{B}^{\pi}. Note that

𝔼(s,a)ρρ^0π0D(s)π0D(a|s)|Q(s,a)π0DQ(s,a)|𝔼(s,a)ρρ^0π0D(s)π0D(a|s)|Q(s,a)πQ(s,a)|+|πQ(s,a)π0DQ(s,a)|=𝔼(s,a)ρρ^0π0D(s)π0D(a|s)|Q(s,a)πQ(s,a)|+γ|𝔼sT(|s,a)𝔼π,π0DQ(s,π(s))Q(s,π0D(s))|Lem.2𝔼(s,a)ρρ^0π0D(s)π0D(a|s)[|Q(s,a)πQ(s,a)|+γL𝒜𝔼sT(|s,a)W1(π(|s)||π0D|s)]=𝔼(s,a)ρρ^0π0D(s)π0D(a|s)|Q(s,a)πQ(s,a)|+L𝒜𝔼sρρ^0π0DW1(π(|s)||π0D(|s))(1γ)L𝒜𝔼sρ^0W1(π(|s)||π0D(|s))𝔼(s,a)ρρ^0π0D(s)π0D(a|s)|Q(s,a)πQ(s,a)|+L𝒜W1(π(|s)||π0D(|s)).\begin{split}&\mathbb{E}_{(s,a)\sim\rho^{\pi^{D}_{0}}_{\hat{\rho}_{0}}(s)\pi^{D}_{0}(a|s)}\Big{|}Q(s,a)-\mathcal{B}^{\pi^{D}_{0}}Q(s,a)\Big{|}\\ \leq&\mathbb{E}_{(s,a)\sim\rho^{\pi^{D}_{0}}_{\hat{\rho}_{0}}(s)\pi^{D}_{0}(a|s)}\Big{|}Q(s,a)-\mathcal{B}^{\pi}Q(s,a)\Big{|}+\Big{|}\mathcal{B}^{\pi}Q(s,a)-\mathcal{B}^{\pi^{D}_{0}}Q(s,a)\Big{|}\\ =&\mathbb{E}_{(s,a)\sim\rho^{\pi^{D}_{0}}_{\hat{\rho}_{0}}(s)\pi^{D}_{0}(a|s)}\Big{|}Q(s,a)-\mathcal{B}^{\pi}Q(s,a)\Big{|}+\gamma\Big{|}\mathbb{E}_{s^{\prime}\sim T(\cdot|s,a)}\mathbb{E}_{\pi,\pi^{D}_{0}}Q(s^{\prime},\pi(s^{\prime}))-Q(s^{\prime},\pi^{D}_{0}(s^{\prime}))\Big{|}\\ \overset{Lem.~{}\ref{lem:wass-bound}}{\leq}&\mathbb{E}_{(s,a)\sim\rho^{\pi^{D}_{0}}_{\hat{\rho}_{0}}(s)\pi^{D}_{0}(a|s)}\bigg{[}\Big{|}Q(s,a)-\mathcal{B}^{\pi}Q(s,a)\Big{|}+\gamma L_{\mathcal{A}}\mathbb{E}_{s^{\prime}\sim T(\cdot|s,a)}W_{1}(\pi(\cdot|s^{\prime})\lvert\rvert\pi^{D}_{0}{\cdot}|s^{\prime})\bigg{]}\\ =&\mathbb{E}_{(s,a)\sim\rho^{\pi^{D}_{0}}_{\hat{\rho}_{0}}(s)\pi^{D}_{0}(a|s)}\Big{|}Q(s,a)-\mathcal{B}^{\pi}Q(s,a)\Big{|}\\ &+L_{\mathcal{A}}\mathbb{E}_{s\sim\rho^{\pi^{D}_{0}}_{\hat{\rho}_{0}}}W_{1}(\pi(\cdot|s)\lvert\rvert\pi^{D}_{0}(\cdot|s))-(1-\gamma)L_{\mathcal{A}}\mathbb{E}_{s\sim\hat{\rho}_{0}}W_{1}(\pi(\cdot|s)\lvert\rvert\pi^{D}_{0}(\cdot|s))\\ \leq&\mathbb{E}_{(s,a)\sim\rho^{\pi^{D}_{0}}_{\hat{\rho}_{0}}(s)\pi^{D}_{0}(a|s)}\Big{|}Q(s,a)-\mathcal{B}^{\pi}Q(s,a)\Big{|}+L_{\mathcal{A}}W_{1}(\pi(\cdot|s)\lvert\rvert\pi^{D}_{0}(\cdot|s)).\end{split} (10)

The second-last line follows from that the distribution of ss^{\prime} is ρρ^0,1π0D(s)=saT(s|s,a)ρρ^0π0D(s,a)\rho^{\pi^{D}_{0}}_{\hat{\rho}_{0},1}(s^{\prime})=\int_{sa}T(s^{\prime}|s,a)\rho^{\pi^{D}_{0}}_{\hat{\rho}_{0}}(s,a), which satisfies the identity

ρρ^0π0D(s)=(1γ)ρ^0(s)+γT(s|s,a)ρρ^0π0D(s,a)𝑑s𝑑a=(1γ)ρ^0(s)+γρρ^0,1π0D(s).\rho^{\pi^{D}_{0}}_{\hat{\rho}_{0}}(s^{\prime})=(1-\gamma)\hat{\rho}_{0}(s^{\prime})+\gamma\int T(s^{\prime}|s,a)\rho^{\pi^{D}_{0}}_{\hat{\rho}_{0}}(s,a)dsda=(1-\gamma)\hat{\rho}_{0}(s^{\prime})+\gamma\rho^{\pi^{D}_{0}}_{\hat{\rho}_{0},1}(s^{\prime}).

Combining Eq. (9) and Eq. (10), the result follows. ∎

Lemma 5.

Let 0f(s,a)Δ0\leq f(s,a)\leq\Delta be a bounded function for (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}. Let E~0\tilde{E}_{0}^{\infty} be the averaging-discounted operator with infinite-length trajectories in NN episodes; i.e., E~0f=1Ne=1Ni=0(1γ)γif(sie,aie)\tilde{E}_{0}^{\infty}f=\frac{1}{N}\sum_{e=1}^{N}\sum_{i=0}^{\infty}(1-\gamma)\gamma^{i}f(s_{i}^{e},a_{i}^{e}). Then, with probability greather than 1δ1-\delta,

(E~0𝔼ρρ^0π0D(s)π0D(a|s))fΔ2Nlog1δ.\left(\tilde{E}_{0}^{\infty}-\mathbb{E}_{\rho_{\hat{\rho}_{0}}^{\pi^{D}_{0}}(s)\pi^{D}_{0}(a|s)}\right)f\leq\Delta\sqrt{\frac{2}{N}\log\frac{1}{\delta}}.
Proof.

Let δs0e(s)\delta_{s_{0}^{e}}(s) be the delta measure at the initial state in episode ee. Because the empirical distribution ρ^0\hat{\rho}_{0} is composed of the realization of the trajectories’s initial states, we know ρ^0=1Ne=1Nδs0e\hat{\rho}_{0}=\frac{1}{N}\sum_{e=1}^{N}\delta_{s_{0}^{e}}. Then, Lemma 1 implies ρρ^0π0D\rho_{\hat{\rho}_{0}}^{\pi^{D}_{0}} is an average of the normalized occupancy measures in NN episodes.

ρρ^0π0D(s)π0D(a|s)=a.e.1Ne=1Nρδs0eπe(s)πe(a|s)=def.ofρδs0eπe1Ne=1Ni=0(1γ)γiρie(s|δs0e,πe,T)πe(a|s),\begin{split}\rho_{\hat{\rho}_{0}}^{\pi^{D}_{0}}(s)\pi^{D}_{0}(a|s)\overset{a.e.}{=}&\frac{1}{N}\sum_{e=1}^{N}\rho_{\delta_{s_{0}^{e}}}^{\pi^{e}}(s)\pi^{e}(a|s)\\ \overset{def.~{}of~{}\rho_{\delta_{s_{0}^{e}}}^{\pi^{e}}}{=}&\frac{1}{N}\sum_{e=1}^{N}\sum_{i=0}^{\infty}(1-\gamma)\gamma^{i}\rho_{i}^{e}(s|\delta_{s_{0}^{e}},\pi^{e},T)\pi^{e}(a|s),\end{split} (11)

where ρie\rho_{i}^{e} is the state density at trajectory step ii in episode ee. Since (sie,aie)ρie(s|δs0e,πe,T)πe(a|s)(s_{i}^{e},a_{i}^{e})\sim\rho_{i}^{e}(s|\delta_{s_{0}^{e}},\pi^{e},T)\pi^{e}(a|s), we have

(E~0𝔼ρρ^0π0D(s)π0D(a|s))f=1Ne=1N[i=0(1γ)γif(sie,aie)𝔼ρδs0eπe(s)πe(a|s)f]=(11)1Ne=1N[i=0(1γ)γi[f(sie,aie)𝔼[f(sie,aie)|s0e]]]=1Ne=1NMe.\begin{split}\left(\tilde{E}_{0}^{\infty}-\mathbb{E}_{\rho_{\hat{\rho}_{0}}^{\pi^{D}_{0}}(s)\pi^{D}_{0}(a|s)}\right)f&=\frac{1}{N}\sum_{e=1}^{N}\left[\sum_{i=0}^{\infty}(1-\gamma)\gamma^{i}f(s_{i}^{e},a_{i}^{e})-\mathbb{E}_{\rho_{\delta_{s_{0}^{e}}}^{\pi^{e}}(s)\pi^{e}(a|s)}f\right]\\ &\overset{\eqref{eq:occu-expand}}{=}\frac{1}{N}\sum_{e=1}^{N}\left[\sum_{i=0}^{\infty}(1-\gamma)\gamma^{i}\big{[}f(s_{i}^{e},a_{i}^{e})-\mathbb{E}[f(s_{i}^{e},a_{i}^{e})|s_{0}^{e}]\big{]}\right]=\frac{1}{N}\sum_{e=1}^{N}M^{e}.\end{split} (12)

We claim that {Me}e=1N\{M^{e}\}_{e=1}^{N} defined as Me=i=0(1γ)γi[f(sie,aie)𝔼[f(sie,aie)|s0e]]M^{e}=\sum_{i=0}^{\infty}(1-\gamma)\gamma^{i}\big{[}f(s_{i}^{e},a_{i}^{e})-\mathbb{E}[f(s_{i}^{e},a_{i}^{e})|s_{0}^{e}]\big{]} is a martingale difference sequence. To see this, let e\mathcal{F}^{e} be the filtration of all randomness from episode 11 to ee, with 0\mathcal{F}^{0} being a null set. Clearly, we have MeeM^{e}\in\mathcal{F}^{e}, 𝔼[Me|e1]=0\mathbb{E}[M^{e}|\mathcal{F}^{e-1}]=0 and Me[Δ,Δ]M^{e}\in[-\Delta,\Delta] since f(s,a)[0,Δ]f(s,a)\in[0,\Delta] by assumption, which proves {Me}e=1N\{M^{e}\}_{e=1}^{N} is a martingale difference sequence.

Finally, since MeM^{e} is bounded in [Δ,Δ][-\Delta,\Delta], by Azuma-Hoeffding inequality, we conclude that with probability greater than 1δ1-\delta,

Eq. (12)N1e=1N(2Δ)22log1δ=Δ2Nlog1δ.\text{Eq.~{}}\eqref{eq:main-mgd}\leq N^{-1}\sqrt{\frac{\sum_{e=1}^{N}(2\Delta)^{2}}{2}\log\frac{1}{\delta}}=\Delta\sqrt{\frac{2}{N}\log\frac{1}{\delta}}.

Theorem 1.

Let NN be the number of episodes. For f:𝒮×𝒜,f\colon\mathcal{S}\times\mathcal{A}\to\mathbb{R}, define the operator E~0f1Ne=1Ni=0(1γ)γif(sie,aie)\tilde{E}_{0}^{\infty}f\triangleq\frac{1}{N}\sum_{e=1}^{N}\sum_{i=0}^{\infty}(1-\gamma)\gamma^{i}f(s_{i}^{e},a_{i}^{e}). Denote the policy mismatch error and the Bellman error as W10,=E~0W1(πN||π0D)()W_{1}^{0,\infty}=\tilde{E}_{0}^{\infty}W_{1}(\pi^{N}\lvert\rvert\pi^{D}_{0})(\cdot) and ϵQ^0,=E~0|Q^(,)(πNQ^)(,)|\epsilon^{0,\infty}_{\hat{Q}}=\tilde{E}_{0}^{\infty}|\hat{Q}(\cdot,\cdot)-(\mathcal{B}^{\pi^{N}}\hat{Q})(\cdot,\cdot)|, respectively. Assume

  • (1)

    For each s𝒮,s\in\mathcal{S}, Q^(s,a)\hat{Q}(s,a) and QπN(s,a)Q^{\pi^{N}}(s,a) are L𝒜L_{\mathcal{A}}-Lipschitz in a.a.

  • (2)

    Q^(s,a)\hat{Q}(s,a) and QπN(s,a)Q^{\pi^{N}}(s,a) are bounded and take values in [0,rmax/(1γ)][0,r^{\max}/(1-\gamma)].

  • (3)

    The action space is bounded with diameter diam𝒜<\text{diam}_{\mathcal{A}}<\infty.

Note W1(πN||πiD)(s)W_{1}(\pi^{N}\lvert\rvert\pi^{D}_{i})(s) means W1(πN(|s)||πiD(|s))W_{1}(\pi^{N}(\cdot|s)\lvert\rvert\pi^{D}_{i}(\cdot|s)), which is a function of ss. Then, with probability greater than 1δ1-\delta, we have

𝔼(s0,a0)ρ0(s)π0D(a|s)|Q^(s0,a0)QπN(s0,a0)|\displaystyle\mathbb{E}_{(s_{0},a_{0})\sim\rho_{0}(s)\pi^{D}_{0}(a|s)}\Big{|}\hat{Q}(s_{0},a_{0})-Q^{\pi^{N}}(s_{0},a_{0})\Big{|}
\displaystyle\leq rmax1γ12Nlog2δ+(rmax(1γ)2+2L𝒜diam𝒜1γ)2Nlog2δ+11γ(ϵQ^0,+2L𝒜W10,).\displaystyle\frac{r^{\max}}{1-\gamma}\sqrt{\frac{1}{2N}\log\frac{2}{\delta}}+\Big{(}\frac{r^{\max}}{(1-\gamma)^{2}}+\frac{2L_{\mathcal{A}}\text{diam}_{\mathcal{A}}}{1-\gamma}\Big{)}\sqrt{\frac{2}{N}\log\frac{2}{\delta}}+\frac{1}{1-\gamma}\Big{(}\epsilon_{\hat{Q}}^{0,\infty}+2L_{\mathcal{A}}W_{1}^{0,\infty}\Big{)}.
Proof.

The proof basically combines Lemma 34 and 5. To start with, the objective is decomposed as:

𝔼(s0,a0)ρ0(s)π0D(a|s)|Q^(s0,a0)QπN(s0,a0)|=H1+𝔼(s0,a0)ρ^0(s)π0D(a|s)|Q^(s0,a0)QπN(s0,a0)|H1+𝔼(s,a)ρ^0(s)π0D(a|s)|Q^(s0,a0)Qπ0D(s0,a0)|+|Qπ0D(s0,a0)QπN(s0,a0)|4,3H1+11γ𝔼(s,a)ρρ^0π0D(s)π0D(a|s)[|Q^(s,a)(πNQ^)(s,a)|+2L𝒜W1(πN(|s)||π0D(|s))]=H1+H21γ+11γE~0[|Q^(,)(πNQ^)(,)|+2L𝒜W1(πN||π0D)()]Assump.H1+H21γ+11γ(ϵQ^0,+2L𝒜W10,).\begin{split}&\mathbb{E}_{(s_{0},a_{0})\sim\rho_{0}(s)\pi^{D}_{0}(a|s)}\Big{|}\hat{Q}(s_{0},a_{0})-Q^{\pi^{N}}(s_{0},a_{0})\Big{|}\\ =&H_{1}+\mathbb{E}_{(s_{0},a_{0})\sim\hat{\rho}_{0}(s)\pi^{D}_{0}(a|s)}\Big{|}\hat{Q}(s_{0},a_{0})-Q^{\pi^{N}}(s_{0},a_{0})\Big{|}\\ \leq&H_{1}+\mathbb{E}_{(s,a)\sim\hat{\rho}_{0}(s)\pi^{D}_{0}(a|s)}\Big{|}\hat{Q}(s_{0},a_{0})-Q^{\pi^{D}_{0}}(s_{0},a_{0})\Big{|}+\Big{|}Q^{\pi^{D}_{0}}(s_{0},a_{0})-Q^{\pi^{N}}(s_{0},a_{0})\Big{|}\\ \overset{\ref{lem:q-qpid},~{}\ref{lem:qpid-qpi}}{\leq}&H_{1}+\frac{1}{1-\gamma}\mathbb{E}_{(s,a)\sim\rho^{\pi^{D}_{0}}_{\hat{\rho}_{0}}(s)\pi^{D}_{0}(a|s)}\bigg{[}\Big{|}\hat{Q}(s,a)-(\mathcal{B}^{\pi^{N}}\hat{Q})(s,a)\Big{|}+2L_{\mathcal{A}}W_{1}(\pi^{N}(\cdot|s)\lvert\rvert\pi^{D}_{0}(\cdot|s))\bigg{]}\\ =&H_{1}+\frac{H_{2}}{1-\gamma}+\frac{1}{1-\gamma}\tilde{E}_{0}^{\infty}\bigg{[}\Big{|}\hat{Q}(\cdot,\cdot)-(\mathcal{B}^{\pi^{N}}\hat{Q})(\cdot,\cdot)\Big{|}+2L_{\mathcal{A}}W_{1}(\pi^{N}\lvert\rvert\pi^{D}_{0})(\cdot)\bigg{]}\\ \overset{\text{Assump.}}{\leq}&H_{1}+\frac{H_{2}}{1-\gamma}+\frac{1}{1-\gamma}\Big{(}\epsilon_{\hat{Q}}^{0,\infty}+2L_{\mathcal{A}}W_{1}^{0,\infty}\Big{)}.\end{split} (13)

Because |Q^(s,a)QπN(s,a)|[0,rmax1γ]|\hat{Q}(s,a)-Q^{\pi^{N}}(s,a)|\in[0,\frac{r^{\max}}{1-\gamma}], we know 𝔼aπ0D(|s)|Q^(s,a)QπN(s,a)|[0,rmax1γ]\mathbb{E}_{a\sim\pi^{D}_{0}(\cdot|s)}|\hat{Q}(s,a)-Q^{\pi^{N}}(s,a)|\in[0,\frac{r^{\max}}{1-\gamma}], too. Suppose ρ^0\hat{\rho}_{0} have NN independent samples. By Hoeffding’s inequality, with probability 1δ\geq 1-\delta,

H1=(𝔼s0ρ0𝔼s0ρ^0)𝔼a0π0D(|s0)|Q^(s0,a0)QπN(s0,a0)|rmax1γ12Nlog1δ.\begin{split}H_{1}&=\big{(}\mathbb{E}_{s_{0}\sim\rho_{0}}-\mathbb{E}_{s_{0}\sim\hat{\rho}_{0}}\big{)}\mathbb{E}_{a_{0}\sim\pi^{D}_{0}(\cdot|s_{0})}\big{|}\hat{Q}(s_{0},a_{0})-Q^{\pi^{N}}(s_{0},a_{0})\big{|}\leq\frac{r^{\max}}{1-\gamma}\sqrt{\frac{1}{2N}\log\frac{1}{\delta}}.\end{split} (14)

Also, let f(s,a)=|Q^(s,a)(πNQ^)(s,a)|+2L𝒜W1(πN(|s)||π0D(|s))f(s,a)=\big{|}\hat{Q}(s,a)-(\mathcal{B}^{\pi^{N}}\hat{Q})(s,a)\big{|}+2L_{\mathcal{A}}W_{1}(\pi^{N}(\cdot|s)\lvert\rvert\pi^{D}_{0}(\cdot|s)). Then f(s,a)f(s,a) is bounded in [0,rmax1γ+2L𝒜diam𝒜][0,\frac{r^{\max}}{1-\gamma}+2L_{\mathcal{A}}\text{diam}_{\mathcal{A}}]. Thereby, Lemma 5 implies that with probability greater than 1δ1-\delta,

H2=(𝔼(s,a)ρρ^0π0D(s)π0D(a|s)E~0)[|Q^(s,a)(πNQ^)(s,a)|+2L𝒜W1(πN(|s)||π0D(|s))](rmax1γ+2L𝒜diam𝒜)2Nlog1δ\begin{split}H_{2}=&\Big{(}\mathbb{E}_{(s,a)\sim\rho^{\pi^{D}_{0}}_{\hat{\rho}_{0}}(s)\pi^{D}_{0}(a|s)}-\tilde{E}_{0}^{\infty}\Big{)}\bigg{[}\Big{|}\hat{Q}(s,a)-(\mathcal{B}^{\pi^{N}}\hat{Q})(s,a)\Big{|}+2L_{\mathcal{A}}W_{1}(\pi^{N}(\cdot|s)\lvert\rvert\pi^{D}_{0}(\cdot|s))\bigg{]}\\ \leq&\Big{(}\frac{r^{\max}}{1-\gamma}+2L_{\mathcal{A}}\text{diam}_{\mathcal{A}}\Big{)}\sqrt{\frac{2}{N}\log\frac{1}{\delta}}\end{split} (15)

Combining Eq. (13), (14) and (15), a union bound implies with probability greater than 12δ1-2\delta,

𝔼(s0,a0)ρ0(s)π0D(a|s)|Q^(s0,a0)QπN(s0,a0)|\displaystyle\mathbb{E}_{(s_{0},a_{0})\sim\rho_{0}(s)\pi^{D}_{0}(a|s)}\Big{|}\hat{Q}(s_{0},a_{0})-Q^{\pi^{N}}(s_{0},a_{0})\Big{|}
\displaystyle\leq rmax1γ12Nlog1δ+(rmax(1γ)2+2L𝒜diam𝒜1γ)2Nlog1δ+11γ(ϵQ^0,+2L𝒜W10,)\displaystyle\frac{r^{\max}}{1-\gamma}\sqrt{\frac{1}{2N}\log\frac{1}{\delta}}+\Big{(}\frac{r^{\max}}{(1-\gamma)^{2}}+\frac{2L_{\mathcal{A}}\text{diam}_{\mathcal{A}}}{1-\gamma}\Big{)}\sqrt{\frac{2}{N}\log\frac{1}{\delta}}+\frac{1}{1-\gamma}\Big{(}\epsilon_{\hat{Q}}^{0,\infty}+2L_{\mathcal{A}}W_{1}^{0,\infty}\Big{)}

Finally, rescaling δ\delta to δ/2\delta/2 finishes the proof. ∎

Corollary 1.

Let E~iL\tilde{E}_{i}^{L} be the operator defined as E~iLf=1Ne=1Nj=iL1(1γ)γjif(sje,aje)\tilde{E}_{i}^{L}f=\frac{1}{N}\sum_{e=1}^{N}\sum_{j=i}^{L-1}(1-\gamma)\gamma^{j-i}f(s_{j}^{e},a_{j}^{e}). Fix assumptions (1) (2) (3) of Theorem 1. Rewrite the policy mismatch error and the Bellman error as W1i,L=E~iLW1(πN||πiD)()W_{1}^{i,L}=\tilde{E}_{i}^{L}W_{1}(\pi^{N}\lvert\rvert\pi^{D}_{i})(\cdot) and ϵQ^i,L=E~iL|Q^(,)(πNQ^)(,)|\epsilon_{\hat{Q}}^{i,L}=\tilde{E}_{i}^{L}|\hat{Q}(\cdot,\cdot)-(\mathcal{B}^{\pi^{N}}\hat{Q})(\cdot,\cdot)|, respectively, where W1(πN||πiD)(s)=W1(πN(|s)||πiD(|s))W_{1}(\pi^{N}\lvert\rvert\pi^{D}_{i})(s)=W_{1}(\pi^{N}(\cdot|s)\lvert\rvert\pi^{D}_{i}(\cdot|s)). Let ρ¯i(s)\overline{\rho}_{i}(s) be the average state density at trajectory step ii over all NN episodes. Then, with probability greater than 1δ1-\delta, we have

𝔼(si,ai)ρ¯i(s)πiD(a|s)|Q^(si,ai)QπN(si,ai)|\displaystyle\mathbb{E}_{(s_{i},a_{i})\sim\overline{\rho}_{i}(s)\pi^{D}_{i}(a|s)}\Big{|}\hat{Q}(s_{i},a_{i})-Q^{\pi^{N}}(s_{i},a_{i})\Big{|}\leq
rmax1γ12Nlog2δ+(rmax(1γ)2+2L𝒜diam𝒜1γ)(2Nlog2δ+γLi)+11γ(ϵQ^i,L+2L𝒜W1i,L).\displaystyle\frac{r^{\max}}{1-\gamma}\sqrt{\frac{1}{2N}\log\frac{2}{\delta}}+\Big{(}\frac{r^{\max}}{(1-\gamma)^{2}}+\frac{2L_{\mathcal{A}}\text{diam}_{\mathcal{A}}}{1-\gamma}\Big{)}\Big{(}\sqrt{\frac{2}{N}\log\frac{2}{\delta}}+\gamma^{L-i}\Big{)}+\frac{1}{1-\gamma}\Big{(}\epsilon_{\hat{Q}}^{i,L}+2L_{\mathcal{A}}W_{1}^{i,L}\Big{)}.

Moreover, if iLlogϵlogγi\leq L-\frac{\log\epsilon}{\log\gamma} and the constants are normalized as L𝒜=c/(1γ)L_{\mathcal{A}}=c/(1-\gamma), ϵQ^i,L=ξQ^i,L/(1γ)\epsilon_{\hat{Q}}^{i,L}=\xi_{\hat{Q}}^{i,L}/(1-\gamma), then, with probability greater than 1δ1-\delta, we have

𝔼(si,ai)ρ¯i(s)πiD(a|s)|Q^(si,ai)QπN(si,ai)|\displaystyle\mathbb{E}_{(s_{i},a_{i})\sim\overline{\rho}_{i}(s)\pi^{D}_{i}(a|s)}\Big{|}\hat{Q}(s_{i},a_{i})-Q^{\pi^{N}}(s_{i},a_{i})\Big{|}
\displaystyle\leq O~(1(1γ)2((rmax+cdiam𝒜)(1/N+ϵ)+ξQ^i,L+cW1i,L))\displaystyle\tilde{O}\Big{(}\frac{1}{(1-\gamma)^{2}}\Big{(}(r^{\max}+c\cdot\text{diam}_{\mathcal{A}})(1/\sqrt{N}+\epsilon)+\xi_{\hat{Q}}^{i,L}+c\cdot W_{1}^{i,L}\Big{)}\Big{)}
Proof.

Notice that Theorem 1 is the situation when i=0i=0 and L=L=\infty. To push this to i0i\geq 0 and L=L=\infty, recall that Lemma 1 defines the behavior policy πiD\pi^{D}_{i}, the average state distribution ρ¯i\overline{\rho}_{i} and the state occupancy measure ρρ¯iπiD\rho_{\overline{\rho}_{i}}^{\pi^{D}_{i}} at step ii. Also, since the trajectories in each episode are initialized using the same initial state distribution, we have ρ0=ρ¯0\rho_{0}=\overline{\rho}_{0}. Therefore, the objective of Theorem 1 is actually 𝔼(s0,a0)ρ¯0(s)π0D(a|s)|Q^(s0,a0)QπN(s0,a0)|\mathbb{E}_{(s_{0},a_{0})\sim\overline{\rho}_{0}(s)\pi^{D}_{0}(a|s)}\Big{|}\hat{Q}(s_{0},a_{0})-Q^{\pi^{N}}(s_{0},a_{0})\Big{|}. This is generalized to i0i\geq 0 using substitutions: (ρ¯0,π0D,E~0)(ρ¯i,πiD,E~i)(\overline{\rho}_{0},~{}\pi^{D}_{0},~{}\tilde{E}_{0}^{\infty})\rightarrow(\overline{\rho}_{i},~{}\pi^{D}_{i},~{}\tilde{E}_{i}^{\infty}), yielding

𝔼(si,ai)ρ¯i(s)πiD(a|s)|Q^(si,ai)QπN(si,ai)|\displaystyle\mathbb{E}_{(s_{i},a_{i})\sim\overline{\rho}_{i}(s)\pi^{D}_{i}(a|s)}\Big{|}\hat{Q}(s_{i},a_{i})-Q^{\pi^{N}}(s_{i},a_{i})\Big{|}
\displaystyle\leq rmax1γ12Nlog2δ+(rmax(1γ)2+2L𝒜diam𝒜1γ)2Nlog2δ+11γ(ϵQ^i,+2L𝒜W1i,).\displaystyle\frac{r^{\max}}{1-\gamma}\sqrt{\frac{1}{2N}\log\frac{2}{\delta}}+\Big{(}\frac{r^{\max}}{(1-\gamma)^{2}}+\frac{2L_{\mathcal{A}}\text{diam}_{\mathcal{A}}}{1-\gamma}\Big{)}\sqrt{\frac{2}{N}\log\frac{2}{\delta}}+\frac{1}{1-\gamma}\Big{(}\epsilon_{\hat{Q}}^{i,\infty}+2L_{\mathcal{A}}W_{1}^{i,\infty}\Big{)}.

Finally, observe that for any bounded ff:

E~ifE~iLf+γLif\tilde{E}_{i}^{\infty}f\leq\tilde{E}_{i}^{L}f+\gamma^{L-i}\|f\|_{\infty}

We arrive at

𝔼(si,ai)ρ¯i(s)πiD(a|s)|Q^(si,ai)QπN(si,ai)|\displaystyle\mathbb{E}_{(s_{i},a_{i})\sim\overline{\rho}_{i}(s)\pi^{D}_{i}(a|s)}\Big{|}\hat{Q}(s_{i},a_{i})-Q^{\pi^{N}}(s_{i},a_{i})\Big{|}
\displaystyle\leq rmax1γ12Nlog2δ+(rmax(1γ)2+2L𝒜diam𝒜1γ)(2Nlog2δ+γLi)+11γ(ϵQ^i,L+2L𝒜W1i,L).\displaystyle\frac{r^{\max}}{1-\gamma}\sqrt{\frac{1}{2N}\log\frac{2}{\delta}}+\Big{(}\frac{r^{\max}}{(1-\gamma)^{2}}+\frac{2L_{\mathcal{A}}\text{diam}_{\mathcal{A}}}{1-\gamma}\Big{)}\Big{(}\sqrt{\frac{2}{N}\log\frac{2}{\delta}}+\gamma^{L-i}\Big{)}+\frac{1}{1-\gamma}\Big{(}\epsilon_{\hat{Q}}^{i,L}+2L_{\mathcal{A}}W_{1}^{i,L}\Big{)}.

As for the second conclusion, with the substitutions in the statement, we know γLiϵ\gamma^{L-i}\leq\epsilon. Hence,

𝔼(si,ai)ρ¯i(s)πiD(a|s)|Q^(si,ai)QπN(si,ai)|\displaystyle\mathbb{E}_{(s_{i},a_{i})\sim\overline{\rho}_{i}(s)\pi^{D}_{i}(a|s)}\Big{|}\hat{Q}(s_{i},a_{i})-Q^{\pi^{N}}(s_{i},a_{i})\Big{|}
\displaystyle\leq rmax1γ12Nlog2δ+(rmax(1γ)2+2cdiam𝒜(1γ)2)(2Nlog2δ+ϵ)+1(1γ)2(ξQ^i,L+2cW1i,L)\displaystyle\frac{r^{\max}}{1-\gamma}\sqrt{\frac{1}{2N}\log\frac{2}{\delta}}+\Big{(}\frac{r^{\max}}{(1-\gamma)^{2}}+\frac{2c\cdot\text{diam}_{\mathcal{A}}}{(1-\gamma)^{2}}\Big{)}\Big{(}\sqrt{\frac{2}{N}\log\frac{2}{\delta}}+\epsilon\Big{)}+\frac{1}{(1-\gamma)^{2}}\Big{(}\xi_{\hat{Q}}^{i,L}+2c\cdot W_{1}^{i,L}\Big{)}
=\displaystyle= O~(1(1γ)2((rmax+cdiam𝒜)(1/N+ϵ)+ξQ^i,L+cW1i,L))\displaystyle\tilde{O}\Big{(}\frac{1}{(1-\gamma)^{2}}\Big{(}(r^{\max}+c\cdot\text{diam}_{\mathcal{A}})(1/\sqrt{N}+\epsilon)+\xi_{\hat{Q}}^{i,L}+c\cdot W_{1}^{i,L}\Big{)}\Big{)}

A.5 Policy Evaluation Error under Non-uniform Weights

Lemma 6.

Following from Lemma 1, let w={we>0}e=1Nw=\{w_{e}>0\}_{e=1}^{N} be the (unnormalized) weights for the episodes. We have for the weighted case, 1eweeweρρieπe(s)=ρρ¯i,wπi,wD(s)a.e.\frac{1}{\sum_{e}w_{e}}\sum_{e}w_{e}\rho_{\rho_{i}^{e}}^{\pi^{e}}(s)=\rho_{\overline{\rho}_{i,w}}^{\pi_{i,w}^{D}}(s)~{}a.e., where ρρ¯i,wπi,wD(s)\rho_{\overline{\rho}_{i,w}}^{\pi_{i,w}^{D}}(s) is the normalized state occupancy measure is generated by (ρ¯i,w,πi,wD,T)(\overline{\rho}_{i,w},\pi_{i,w}^{D},T). Also, 1eweeweρρieπe(s)πe(a|s)=ρρ¯i,wπi,wD(s)πi,wD(a|s)a.e.\frac{1}{\sum_{e}w_{e}}\sum_{e}w_{e}\rho_{\rho_{i}^{e}}^{\pi^{e}}(s)\pi^{e}(a|s)=\rho_{\overline{\rho}_{i,w}}^{\pi_{i,w}^{D}}(s)\pi_{i,w}^{D}(a|s)~{}a.e.

Proof.

The proof is basically a generalization of Lemma 1. Since ρρieπe\rho_{\rho_{i}^{e}}^{\pi^{e}} is the normalized occupancy measure generated by (ρie,πe,T)(\rho_{i}^{e},\pi^{e},T), we know each ρρieπe\rho_{\rho_{i}^{e}}^{\pi^{e}} is the fixed-point of the Bellman flow equation:

ρρieπe(s)=(1γ)ρie(s)+γT(s|s,a)πe(a|s)ρρieπe(s)𝑑s𝑑a,e[1,,N].\rho_{\rho_{i}^{e}}^{\pi^{e}}(s)=(1-\gamma)\rho_{i}^{e}(s)+\gamma\int T(s|s^{\prime},a^{\prime})\pi^{e}(a^{\prime}|s^{\prime})\rho_{\rho_{i}^{e}}^{\pi^{e}}(s^{\prime})ds^{\prime}da^{\prime},~{}~{}~{}\forall~{}e\in[1,...,N].

Taking a weighted average, we get

1eweeweρρieπe(s)\displaystyle\frac{1}{\sum_{e}w_{e}}\sum_{e}w_{e}\rho_{\rho_{i}^{e}}^{\pi^{e}}(s)
=\displaystyle= 1eweewe[(1γ)ρie(s)+γT(s|s,a)πe(a|s)ρρieπe(s)𝑑s𝑑a]\displaystyle\frac{1}{\sum_{e}w_{e}}\sum_{e}w_{e}\left[(1-\gamma)\rho_{i}^{e}(s)+\gamma\int T(s|s^{\prime},a^{\prime})\pi^{e}(a^{\prime}|s^{\prime})\rho_{\rho_{i}^{e}}^{\pi^{e}}(s^{\prime})ds^{\prime}da^{\prime}\right]
=\displaystyle= (1γ)ρ¯i,w(s)+γT(s|s,a)1eweewe[πe(a|s)ρρieπe(s)]dsda\displaystyle(1-\gamma)\overline{\rho}_{i,w}(s)+\gamma\int T(s|s^{\prime},a^{\prime})\frac{1}{\sum_{e}w_{e}}\sum_{e}w_{e}\left[\pi^{e}(a^{\prime}|s^{\prime})\rho_{\rho_{i}^{e}}^{\pi^{e}}(s^{\prime})\right]ds^{\prime}da^{\prime}
=\displaystyle= (1γ)ρ¯i,w(s)+γT(s|s,a)eweπe(a|s)ρρieπe(s)eweρρieπe(s)eweρρieπe(s)ewe𝑑s𝑑a\displaystyle(1-\gamma)\overline{\rho}_{i,w}(s)+\gamma\int T(s|s^{\prime},a^{\prime})\frac{\sum_{e}w_{e}\pi^{e}(a^{\prime}|s^{\prime})\rho_{\rho_{i}^{e}}^{\pi^{e}}(s^{\prime})}{\sum_{e}w_{e}\rho_{\rho_{i}^{e}}^{\pi^{e}}(s^{\prime})}\frac{\sum_{e}w_{e}\rho_{\rho_{i}^{e}}^{\pi^{e}}(s^{\prime})}{\sum_{e}w_{e}}ds^{\prime}da^{\prime}
=\displaystyle= (1γ)ρ¯i,w(s)+γT(s|s,a)πi,wD(a|s)1eweeweρρieπe(s)dsda,\displaystyle(1-\gamma)\overline{\rho}_{i,w}(s)+\gamma\int T(s|s^{\prime},a^{\prime})\pi_{i,w}^{D}(a^{\prime}|s^{\prime})\frac{1}{\sum_{e}w_{e}}\sum_{e}w_{e}\rho_{\rho_{i}^{e}}^{\pi^{e}}(s^{\prime})ds^{\prime}da^{\prime},

where πiD(a|s)eweπe(a|s)ρρieπe(s)eweρρieπe(s)\pi_{i}^{D}(a|s)\triangleq\frac{\sum_{e}w_{e}\pi^{e}(a|s)\rho_{\rho_{i}^{e}}^{\pi^{e}}(s)}{\sum_{e}w_{e}\rho_{\rho_{i}^{e}}^{\pi^{e}}(s)} is the weighted behavior policy at step ii. Since the Bellman flow operator is a γ\gamma-contraction in TV distance and hence has a unique (up to difference in some measure zero set) fixed point, denoted as ρρ¯i,wπi,wD(s)\rho_{\overline{\rho}_{i,w}}^{\pi_{i,w}^{D}}(s), we arrive at 1eweeweρρieπe(s)=ρρ¯i,wπi,wD(s)a.e.\frac{1}{\sum_{e}w_{e}}\sum_{e}w_{e}\rho_{\rho_{i}^{e}}^{\pi^{e}}(s)=\rho_{\overline{\rho}_{i,w}}^{\pi_{i,w}^{D}}(s)~{}a.e. Finally, by definition of πi,wD(a|s)\pi_{i,w}^{D}(a|s), we conclude that 1eweeweρρieπe(s)πe(a|s)=ρρ¯i,wπi,wD(s)πi,wD(a|s)a.e.\frac{1}{\sum_{e}w_{e}}\sum_{e}w_{e}\rho_{\rho_{i}^{e}}^{\pi^{e}}(s)\pi^{e}(a|s)=\rho_{\overline{\rho}_{i,w}}^{\pi_{i,w}^{D}}(s)\pi_{i,w}^{D}(a|s)~{}a.e.

Theorem 2.

Let w={we>0}e=1Nw=\{w_{e}>0\}_{e=1}^{N} be the (unnormalized) weights for the episodes. Let E~0,w\tilde{E}_{0,w}^{\infty} be the operator defined as E~0,wf=1ewee=1Ni=0(1γ)γiwef(sie,aie)\tilde{E}_{0,w}^{\infty}f=\frac{1}{\sum_{e}w_{e}}\sum_{e=1}^{N}\sum_{i=0}^{\infty}(1-\gamma)\gamma^{i}w_{e}f(s_{i}^{e},a_{i}^{e}). Fix assumptions (1) (2) (3) of Theorem 1. Rewrite the policy mismatch error and the Bellman error as W1,w0,=E~0,wW1(πN||π0,wD)()W_{1,w}^{0,\infty}=\tilde{E}_{0,w}^{\infty}W_{1}(\pi^{N}\lvert\rvert\pi_{0,w}^{D})(\cdot) and ϵQ^,w0,=E~0,w|Q^(,)(πNQ^)(,)|\epsilon^{0,\infty}_{\hat{Q},w}=\tilde{E}_{0,w}^{\infty}|\hat{Q}(\cdot,\cdot)-(\mathcal{B}^{\pi^{N}}\hat{Q})(\cdot,\cdot)|, respectively, where W1(πN||π0,wD)(s)=W1(πN(|s)||π0,wD(|s))W_{1}(\pi^{N}\lvert\rvert\pi_{0,w}^{D})(s)=W_{1}(\pi^{N}(\cdot|s)\lvert\rvert\pi_{0,w}^{D}(\cdot|s)). With probability greater than 1δ1-\delta, we have

𝔼(s0,a0)ρ0(s)π0,wD(a|s)|Q^(s0,a0)QπN(s0,a0)|rmax1γewe22(ewe)2log2δ\displaystyle\mathbb{E}_{(s_{0},a_{0})\sim\rho_{0}(s)\pi^{D}_{0,w}(a|s)}\Big{|}\hat{Q}(s_{0},a_{0})-Q^{\pi^{N}}(s_{0},a_{0})\Big{|}\leq\frac{r^{\max}}{1-\gamma}\sqrt{\frac{\sum_{e}w_{e}^{2}}{2(\sum_{e}w_{e})^{2}}\log\frac{2}{\delta}}
+(rmax(1γ)2+2L𝒜diam𝒜1γ)2ewe2(ewe)2log2δ+11γ(ϵQ^,w0,+2L𝒜W1,w0,).\displaystyle\quad\quad\quad\quad\quad+\Big{(}\frac{r^{\max}}{(1-\gamma)^{2}}+\frac{2L_{\mathcal{A}}\text{diam}_{\mathcal{A}}}{1-\gamma}\Big{)}\sqrt{\frac{2\sum_{e}w_{e}^{2}}{(\sum_{e}w_{e})^{2}}\log\frac{2}{\delta}}+\frac{1}{1-\gamma}\Big{(}\epsilon_{\hat{Q},w}^{0,\infty}+2L_{\mathcal{A}}W_{1,w}^{0,\infty}\Big{)}.
Proof.

The proof basically follows from Theorem 1. Decompose the objective using Lemma 34 as:

𝔼(s0,a0)ρ0(s)π0,wD(a|s)|Q^(s0,a0)QπN(s0,a0)|=H1+𝔼(s0,a0)ρ^0,w(s)π0,wD(a|s)|Q^(s0,a0)QπN(s0,a0)|H1+𝔼(s,a)ρ^0,w(s)π0,wD(a|s)|Q^(s0,a0)Qπ0,wD(s0,a0)|+|Qπ0,wD(s0,a0)QπN(s0,a0)|4,3H1+11γ𝔼(s,a)ρρ^0,wπ0,wD(s)π0,wD(a|s)[|Q^(s,a)(πNQ^)(s,a)|+2L𝒜W1(πN(|s)||π0,wD(|s))]=H1+H21γ+11γE~0,w[|Q^(,)(πNQ^)(,)|+2L𝒜W1(πN||π0,wD)()]Assump.H1+H21γ+11γ(ϵQ^,w0,+2L𝒜W1,w0,),\begin{split}&\mathbb{E}_{(s_{0},a_{0})\sim\rho_{0}(s)\pi^{D}_{0,w}(a|s)}\Big{|}\hat{Q}(s_{0},a_{0})-Q^{\pi^{N}}(s_{0},a_{0})\Big{|}\\ =&H_{1}+\mathbb{E}_{(s_{0},a_{0})\sim\hat{\rho}_{0,w}(s)\pi^{D}_{0,w}(a|s)}\Big{|}\hat{Q}(s_{0},a_{0})-Q^{\pi^{N}}(s_{0},a_{0})\Big{|}\\ \leq&H_{1}+\mathbb{E}_{(s,a)\sim\hat{\rho}_{0,w}(s)\pi^{D}_{0,w}(a|s)}\Big{|}\hat{Q}(s_{0},a_{0})-Q^{\pi^{D}_{0,w}}(s_{0},a_{0})\Big{|}+\Big{|}Q^{\pi^{D}_{0,w}}(s_{0},a_{0})-Q^{\pi^{N}}(s_{0},a_{0})\Big{|}\\ \overset{\ref{lem:q-qpid},~{}\ref{lem:qpid-qpi}}{\leq}&H_{1}+\frac{1}{1-\gamma}\mathbb{E}_{(s,a)\sim\rho^{\pi^{D}_{0,w}}_{\hat{\rho}_{0,w}}(s)\pi^{D}_{0,w}(a|s)}\bigg{[}\Big{|}\hat{Q}(s,a)-(\mathcal{B}^{\pi^{N}}\hat{Q})(s,a)\Big{|}+2L_{\mathcal{A}}W_{1}(\pi^{N}(\cdot|s)\lvert\rvert\pi^{D}_{0,w}(\cdot|s))\bigg{]}\\ =&H_{1}+\frac{H_{2}}{1-\gamma}+\frac{1}{1-\gamma}\tilde{E}_{0,w}^{\infty}\bigg{[}\Big{|}\hat{Q}(\cdot,\cdot)-(\mathcal{B}^{\pi^{N}}\hat{Q})(\cdot,\cdot)\Big{|}+2L_{\mathcal{A}}W_{1}(\pi^{N}\lvert\rvert\pi^{D}_{0,w})(\cdot)\bigg{]}\\ \overset{\text{Assump.}}{\leq}&H_{1}+\frac{H_{2}}{1-\gamma}+\frac{1}{1-\gamma}\Big{(}\epsilon_{\hat{Q},w}^{0,\infty}+2L_{\mathcal{A}}W_{1,w}^{0,\infty}\Big{)},\end{split} (16)

where ρ^0,w(s)=eweδ(ss0e)ewe\hat{\rho}_{0,w}(s)=\frac{\sum_{e}w_{e}\delta(s-s^{e}_{0})}{\sum_{e}w_{e}} is the weighted empirical initial state distribution. Because |Q^(s,a)QπN(s,a)|[0,rmax1γ]|\hat{Q}(s,a)-Q^{\pi^{N}}(s,a)|\in[0,\frac{r^{\max}}{1-\gamma}], we know 𝔼aπ0,wD(|s)|Q^(s,a)QπN(s,a)|[0,rmax1γ]\mathbb{E}_{a\sim\pi^{D}_{0,w}(\cdot|s)}|\hat{Q}(s,a)-Q^{\pi^{N}}(s,a)|\in[0,\frac{r^{\max}}{1-\gamma}], too. By Hoeffding’s inequality, with probability greater than 1δ1-\delta,

H1=(𝔼s0ρ0𝔼s0ρ^0,w)𝔼a0π0,wD(|s0)|Q^(s0,a0)QπN(s0,a0)|1ewee(wermax1γ)22log1δ=rmax1γewe22(ewe)2log1δ.\begin{split}H_{1}&=\big{(}\mathbb{E}_{s_{0}\sim\rho_{0}}-\mathbb{E}_{s_{0}\sim\hat{\rho}_{0,w}}\big{)}\mathbb{E}_{a_{0}\sim\pi^{D}_{0,w}(\cdot|s_{0})}\big{|}\hat{Q}(s_{0},a_{0})-Q^{\pi^{N}}(s_{0},a_{0})\big{|}\\ &\leq\frac{1}{\sum_{e}w_{e}}\sqrt{\frac{\sum_{e}\Big{(}w_{e}\frac{r^{\max}}{1-\gamma}\Big{)}^{2}}{2}\log\frac{1}{\delta}}=\frac{r^{\max}}{1-\gamma}\sqrt{\frac{\sum_{e}w_{e}^{2}}{2(\sum_{e}w_{e})^{2}}\log\frac{1}{\delta}}.\end{split} (17)

Also, let f(s,a)=|Q^(s,a)(πNQ^)(s,a)|+2L𝒜W1(πN(|s)||π0,wD(|s))f(s,a)=\big{|}\hat{Q}(s,a)-(\mathcal{B}^{\pi^{N}}\hat{Q})(s,a)\big{|}+2L_{\mathcal{A}}W_{1}(\pi^{N}(\cdot|s)\lvert\rvert\pi^{D}_{0,w}(\cdot|s)). Then f(s,a)f(s,a) is bounded in [0,rmax1γ+2L𝒜diam𝒜][0,\frac{r^{\max}}{1-\gamma}+2L_{\mathcal{A}}\text{diam}_{\mathcal{A}}]. Using a weighted version of Azuma-Hoeffding in Lemma 5, we have that with probability greater than 1δ1-\delta,

H2=(𝔼(s,a)ρρ^0,wπ0,wD(s)π0,wD(a|s)E~0,w)[|Q^(s,a)(πNQ^)(s,a)|+2L𝒜W1(πN(|s)||π0,wD(|s))](rmax1γ+2L𝒜diam𝒜)2ewe2(ewe)2log1δ\begin{split}H_{2}=&\Big{(}\mathbb{E}_{(s,a)\sim\rho^{\pi^{D}_{0,w}}_{\hat{\rho}_{0,w}}(s)\pi^{D}_{0,w}(a|s)}-\tilde{E}_{0,w}^{\infty}\Big{)}\bigg{[}\Big{|}\hat{Q}(s,a)-(\mathcal{B}^{\pi^{N}}\hat{Q})(s,a)\Big{|}+2L_{\mathcal{A}}W_{1}(\pi^{N}(\cdot|s)\lvert\rvert\pi^{D}_{0,w}(\cdot|s))\bigg{]}\\ \leq&\Big{(}\frac{r^{\max}}{1-\gamma}+2L_{\mathcal{A}}\text{diam}_{\mathcal{A}}\Big{)}\sqrt{\frac{2\sum_{e}w_{e}^{2}}{(\sum_{e}w_{e})^{2}}\log\frac{1}{\delta}}\end{split} (18)

Combining Eq. (16), (17) and (18), a union bound implies with probability greater than 12δ1-2\delta,

𝔼(s0,a0)ρ0(s)π0,wD(a|s)|Q^(s0,a0)QπN(s0,a0)|rmax1γewe22(ewe)2log1δ\displaystyle\mathbb{E}_{(s_{0},a_{0})\sim\rho_{0}(s)\pi^{D}_{0,w}(a|s)}\Big{|}\hat{Q}(s_{0},a_{0})-Q^{\pi^{N}}(s_{0},a_{0})\Big{|}\leq\frac{r^{\max}}{1-\gamma}\sqrt{\frac{\sum_{e}w_{e}^{2}}{2(\sum_{e}w_{e})^{2}}\log\frac{1}{\delta}}
+(rmax(1γ)2+2L𝒜diam𝒜1γ)2ewe2(ewe)2log1δ+11γ(ϵQ^,w0,+2L𝒜W1,w0,)\displaystyle\quad\quad\quad\quad\quad+\Big{(}\frac{r^{\max}}{(1-\gamma)^{2}}+\frac{2L_{\mathcal{A}}\text{diam}_{\mathcal{A}}}{1-\gamma}\Big{)}\sqrt{\frac{2\sum_{e}w_{e}^{2}}{(\sum_{e}w_{e})^{2}}\log\frac{1}{\delta}}+\frac{1}{1-\gamma}\Big{(}\epsilon_{\hat{Q},w}^{0,\infty}+2L_{\mathcal{A}}W_{1,w}^{0,\infty}\Big{)}

Finally, rescaling δ\delta to δ/2\delta/2 finishes the proof. ∎

Corollary 2.

Let E~i,wL\tilde{E}_{i,w}^{L} be the operator defined as E~i,wLf=1eweej=iL1(1γ)γjiwef(sje,aje)\tilde{E}_{i,w}^{L}f=\frac{1}{\sum_{e}w_{e}}\sum_{e}\sum_{j=i}^{L-1}(1-\gamma)\gamma^{j-i}w_{e}f(s_{j}^{e},a_{j}^{e}). Fix assumptions (1) (2) (3) of Theorem 1. Rewrite the policy mismatch error and the Bellman error as W1,wi,L=E~i,wLW1(πN||πi,wD)()W_{1,w}^{i,L}=\tilde{E}_{i,w}^{L}W_{1}(\pi^{N}\lvert\rvert\pi_{i,w}^{D})(\cdot) and ϵQ^,wi,L=E~i,wL|Q^(,)(πNQ^)(,)|\epsilon_{\hat{Q},w}^{i,L}=\tilde{E}_{i,w}^{L}|\hat{Q}(\cdot,\cdot)-(\mathcal{B}^{\pi^{N}}\hat{Q})(\cdot,\cdot)|, respectively, where W1(πN||πi,wD)(s)=W1(πN(|s)||πi,wD(|s))W_{1}(\pi^{N}\lvert\rvert\pi_{i,w}^{D})(s)=W_{1}(\pi^{N}(\cdot|s)\lvert\rvert\pi_{i,w}^{D}(\cdot|s)). Let ρ¯i,w(s)\overline{\rho}_{i,w}(s) be the average weighted state density at trajectory step ii. Then, with probability greater than 1δ1-\delta, we have

𝔼(si,ai)ρ¯i,w(s)πi,wD(a|s)|Q^(si,ai)QπN(si,ai)|rmax1γewe22(ewe)2log2δ\displaystyle\mathbb{E}_{(s_{i},a_{i})\sim\overline{\rho}_{i,w}(s)\pi_{i,w}^{D}(a|s)}\Big{|}\hat{Q}(s_{i},a_{i})-Q^{\pi^{N}}(s_{i},a_{i})\Big{|}\leq\frac{r^{\max}}{1-\gamma}\sqrt{\frac{\sum_{e}w_{e}^{2}}{2(\sum_{e}w_{e})^{2}}\log\frac{2}{\delta}}
+(rmax(1γ)2+2L𝒜diam𝒜1γ)(2ewe2(ewe)2log2δ+γLi)+11γ(ϵQ^,wi,L+2L𝒜W1,wi,L).\displaystyle\quad\quad\quad+\Big{(}\frac{r^{\max}}{(1-\gamma)^{2}}+\frac{2L_{\mathcal{A}}\text{diam}_{\mathcal{A}}}{1-\gamma}\Big{)}\Big{(}\sqrt{\frac{2\sum_{e}w_{e}^{2}}{(\sum_{e}w_{e})^{2}}\log\frac{2}{\delta}}+\gamma^{L-i}\Big{)}+\frac{1}{1-\gamma}\Big{(}\epsilon_{\hat{Q},w}^{i,L}+2L_{\mathcal{A}}W_{1,w}^{i,L}\Big{)}.

Moreover, if iLlogϵlogγi\leq L-\frac{\log\epsilon}{\log\gamma} and the constants are normalized as L𝒜=c/(1γ)L_{\mathcal{A}}=c/(1-\gamma), ϵQ^,wi,L=ξQ^,wi,L/(1γ)\epsilon_{\hat{Q},w}^{i,L}=\xi_{\hat{Q},w}^{i,L}/(1-\gamma), then, with probability greater than 1δ1-\delta, we have

𝔼(si,ai)ρ¯i,w(s)πi,wD(a|s)|Q^(si,ai)QπN(si,ai)|\displaystyle\mathbb{E}_{(s_{i},a_{i})\sim\overline{\rho}_{i,w}(s)\pi_{i,w}^{D}(a|s)}\Big{|}\hat{Q}(s_{i},a_{i})-Q^{\pi^{N}}(s_{i},a_{i})\Big{|}
\displaystyle\leq O~(1(1γ)2((rmax+cdiam𝒜)((ewe2)/(ewe)2+ϵ)+ξQ^,wi,L+cW1,wi,L))\displaystyle\tilde{O}\Big{(}\frac{1}{(1-\gamma)^{2}}\Big{(}(r^{\max}+c\cdot\text{diam}_{\mathcal{A}})\Big{(}\sqrt{(\sum_{e}w_{e}^{2})/(\sum_{e}w_{e})^{2}}+\epsilon\Big{)}+\xi_{\hat{Q},w}^{i,L}+c\cdot W_{1,w}^{i,L}\Big{)}\Big{)}
Proof.

Observe that for any bounded ff:

E~i,wfE~i,wLf+γLif\tilde{E}_{i,w}^{\infty}f\leq\tilde{E}_{i,w}^{L}f+\gamma^{L-i}\|f\|_{\infty}

Thus, we can start from Theorem 2 and prove with the same argument in Corollary 1. ∎

A.6 Proof of Emphazing Recent Experience

Proposition 1.

The ERE strategy in Wang and Ross (2019) is equivalent to a non-uniform sampling with weight wtw_{t}:

wt11ηL0/K(1max(t,cmin,N0ηL0)1N0)+𝟙(tcmin)Kcminmax(1lncmin/N0L0lnη,0),\displaystyle w_{t}\propto\frac{1}{1-\eta^{L_{0}/K}}\Big{(}\frac{1}{\max(t,c_{\min},N_{0}\eta^{L_{0}})}-\frac{1}{N_{0}}\Big{)}+\frac{\mathbbm{1}(t\leq c_{\min})K}{c_{\min}}\max\Big{(}1-\frac{\ln c_{\min}/N_{0}}{L_{0}\ln\eta},~{}0\Big{)},

where tt is the age of a data point relative to the newest time step; i.e., w0w_{0} is the newest sample. N0N_{0} is the size of the experience replay. L0=1000L_{0}=1000 is the maximum horizon of the environment. KK is the length of the recent trajectory. η0.996\eta\approx 0.996 is the decay parameter. cmin5000c_{\min}\approx 5000 is the minimum coverage of the sampling. Moreover, the ERE strategy can be approximated (by Taylor Approximation) as

wt1max(t,cmin,N0ηL0)1N0+𝟙(tcmin)cminmax(lncminN0ηL0,0)\displaystyle w_{t}\propto\frac{1}{\max(t,c_{\min},N_{0}\eta^{L_{0}})}-\frac{1}{N_{0}}+\frac{\mathbbm{1}(t\leq c_{\min})}{c_{\min}}\max\Big{(}\ln\frac{c_{\min}}{N_{0}\eta^{L_{0}}},~{}0\Big{)}
Proof.

Recall that Wang and Ross (2019) assume a situation of doing KK updates in each episode. In the kkth update, the data is sampled uniformly from the most recent ck=max(N0ηkL0K,cmin)c_{k}=\max(N_{0}\eta^{k\frac{L_{0}}{K}},c_{\min}) points.

To compute the aggregrated weight wtw_{t} over these KK updates, observe that a data point of age tt is in the most recent ckc_{k} points if cktc_{k}\geq t and that the weight in each uniform sample is 1/ck1/c_{k}. Therefore, wtw_{t} should be proportional to

wtk:1kK,ckt1ck.w_{t}\propto\sum_{\begin{subarray}{c}k:~{}1\leq k\leq K,\\ ~{}c_{k}\geq t\end{subarray}}\frac{1}{c_{k}}. (19)

Because ckc_{k} is designed to be lower bounded by cminc_{\min}, we shall discuss Eq. (19) by cases.

  1. (1)

    When t>cmint>c_{\min}, we know ck>cminc_{k}>c_{\min} because cktc_{k}\geq t is a constraint in the sum. This means ck=N0ηkL0/Kc_{k}=N_{0}\eta^{kL_{0}/K} and hence cktc_{k}\geq t is equivalent to klntN0/lnηL0/Kk\leq\ln\frac{t}{N_{0}}/\ln\eta^{L_{0}/K}. Eq. (19) becomes

    k:1kK,ckt1ck=k=1min(K,lntN0/lnηL0/K)1N0ηL0k/K=()k=1min(K,lntN0/lnξ)1N0ξk={ξ1N01ξlogξtN01ξ1=1/t1/N01ηL0/KifK>lntN0/lnηL0/Kξ1N01ξK1ξ1=ηL0/N01/N01ηL0/KifKlntN0/lnηL0/K={1/t1/N01ηL0/Kift>N0ηL01/(N0ηL0)1/N01ηL0/KiftN0ηL0=11ηL0/K(1max(t,N0ηL0)1N0)\begin{split}&\sum_{\begin{subarray}{c}k:~{}1\leq k\leq K,\\ ~{}c_{k}\geq t\end{subarray}}\frac{1}{c_{k}}=\sum_{k=1}^{\min(K,~{}\ln\frac{t}{N_{0}}/\ln\eta^{L_{0}/K})}\frac{1}{N_{0}}\eta^{-L_{0}k/K}\overset{(*)}{=}\sum_{k=1}^{\min(K,~{}\ln\frac{t}{N_{0}}/\ln\xi)}\frac{1}{N_{0}}\xi^{-k}\\ =&\begin{cases}\frac{\xi^{-1}}{N_{0}}\frac{1-\xi^{-\log_{\xi}\frac{t}{N_{0}}}}{1-\xi^{-1}}=\frac{1/t-1/N_{0}}{1-\eta^{L_{0}/K}}&\text{if}~{}K>\ln\frac{t}{N_{0}}/\ln\eta^{L_{0}/K}\\ \frac{\xi^{-1}}{N_{0}}\frac{1-\xi^{-K}}{1-\xi^{-1}}=\frac{\eta^{-L_{0}}/N_{0}-1/N_{0}}{1-\eta^{L_{0}/K}}&\text{if}~{}K\leq\ln\frac{t}{N_{0}}/\ln\eta^{L_{0}/K}\end{cases}\\ =&\begin{cases}\frac{1/t-1/N_{0}}{1-\eta^{L_{0}/K}}&\text{if}~{}t>N_{0}\eta^{L_{0}}\\ \frac{1/(N_{0}\eta^{L_{0}})-1/N_{0}}{1-\eta^{L_{0}/K}}&\text{if}~{}t\leq N_{0}\eta^{L_{0}}\end{cases}=\frac{1}{1-\eta^{L_{0}/K}}\Big{(}\frac{1}{\max(t,~{}N_{0}\eta^{L_{0}})}-\frac{1}{N_{0}}\Big{)}\end{split}

    where (*) is a substitution: ξ=ηL0/K\xi=\eta^{L_{0}/K}.

  2. (2)

    When tcmint\leq c_{\min}, we know cktc_{k}\geq t for all kk because ckcminc_{k}\geq c_{\min} by definition. Eq. (19) becomes

    k:1kK,ckt1ck=k=1K1ck=()k=1min(K,lncminN0/lnξ)1N0ξk+max(KlncminN0/lnξ,0)1cmin=()11ηL0/K(1max(cmin,N0ηL0)1N0)+max(KKlncmin/N0L0lnη,0)1cmin=11ηL0/K(1max(cmin,N0ηL0)1N0)+Kcminmax(1lncmin/N0L0lnη,0),\begin{split}&\sum_{\begin{subarray}{c}k:~{}1\leq k\leq K,\\ ~{}c_{k}\geq t\end{subarray}}\frac{1}{c_{k}}=\sum_{k=1}^{K}\frac{1}{c_{k}}\\ \overset{(*)}{=}&\sum_{k=1}^{\min(K,~{}\ln\frac{c_{\min}}{N_{0}}/\ln\xi)}\frac{1}{N_{0}}\xi^{-k}+\max(K-\ln\frac{c_{\min}}{N_{0}}/\ln\xi,~{}0)\frac{1}{c_{\min}}\\ \overset{(**)}{=}&\frac{1}{1-\eta^{L_{0}/K}}\Big{(}\frac{1}{\max(c_{\min},~{}N_{0}\eta^{L_{0}})}-\frac{1}{N_{0}}\Big{)}+\max(K-K\frac{\ln c_{\min}/N_{0}}{L_{0}\ln\eta},~{}0)\frac{1}{c_{\min}}\\ =&\frac{1}{1-\eta^{L_{0}/K}}\Big{(}\frac{1}{\max(c_{\min},~{}N_{0}\eta^{L_{0}})}-\frac{1}{N_{0}}\Big{)}+\frac{K}{c_{\min}}\max\Big{(}1-\frac{\ln c_{\min}/N_{0}}{L_{0}\ln\eta},~{}0\Big{)},\end{split}

    where (*) does a substitution: ξ=ηL0/K\xi=\eta^{L_{0}/K} and split the sum. (**) reuses the analysis in case (1).

Combining cases (1) and (2), we arrive at the first conclusion. As for the approximation, since η0.996\eta\approx 0.996, let η=1κ\eta=1-\kappa. We have

1ηL0/K=1(1κ)L0/K11+κL0/K=κL0KL0Kln(1κ)=L0Klnη1-\eta^{L_{0}/K}=1-(1-\kappa)^{L_{0}/K}\approx 1-1+\kappa L_{0}/K=\kappa\frac{L_{0}}{K}\approx-\frac{L_{0}}{K}\ln(1-\kappa)=-\frac{L_{0}}{K}\ln\eta

Thus the first term in the conclusion of Prop. 1 is proportional to KK. The second term is also proportional to KK. Since wtw_{t} is only made to be proportional to the RHS and both terms on the RHS become proportional to KK, we can remove KK on the RHS:

wt1L0lnη(1max(t,cmin,N0ηL0)1N0)+𝟙(tcmin)cminmax(1lncmin/N0L0lnη,0)\displaystyle w_{t}\propto\frac{1}{-L_{0}\ln\eta}\Big{(}\frac{1}{\max(t,c_{\min},N_{0}\eta^{L_{0}})}-\frac{1}{N_{0}}\Big{)}+\frac{\mathbbm{1}(t\leq c_{\min})}{c_{\min}}\max\Big{(}1-\frac{\ln c_{\min}/N_{0}}{L_{0}\ln\eta},~{}0\Big{)}

Finally, because 0<η<10<\eta<1, L0lnη-L_{0}\ln\eta is a positive number, the above expression can be further simplified by timing L0lnη-L_{0}\ln\eta on the RHS, yielding the result. ∎

Proposition 2.

Let {wt>0}t=1N\{w_{t}>0\}_{t=1}^{N} be the weights of the data indexed by tt. Then the Hoeffding error t=1Nwt2/t=1Nwt\sqrt{\sum_{t=1}^{N}w_{t}^{2}}/\sum_{t=1}^{N}w_{t} is minimized when the weights are equal: wt=c>0,tw_{t}=c>0,~{}\forall~{}t.

Proof.

Let w=[w1,,wN]w=[w_{1},...,w_{N}]^{\top} be the weight vector and f(w)=ww/(𝟙w)f(w)=\sqrt{w^{\top}w}/(\mathbbm{1}^{\top}w) be the Hoeffding error. Observe that f(w)f(w) is of the form:

f(w)=w/(𝟙w)=w/(𝟙w),f(w)=\|w\|/(\mathbbm{1}^{\top}w)=\|w/(\mathbbm{1}^{\top}w)\|,

where w=ww\|w\|=\sqrt{w^{\top}w} is the 2-norm of ww. Thereby, let z=w/(𝟙w)z=w/(\mathbbm{1}^{\top}w) be the normalized vector. That f(w)f(w) is minimized is equivalent to that g(z)=zg(z)=\|z\| is minimized for 𝟙z=1\mathbbm{1}^{\top}z=1. By the lagrange multiplier, this happens when

g=zz=λ𝟙,for someλ.\nabla g=\frac{z}{\|z\|}=\lambda\mathbbm{1},~{}\text{for~{}some}~{}\lambda\in\mathbb{R}.

This can be achieved by zt=cz_{t}=c for some c>0c>0. Therefore, we know ff is minimized when

zt=wt𝟙w=cfor somec.z_{t}=\frac{w_{t}}{\mathbbm{1}^{\top}w}=c~{}\text{for~{}some}~{}c.

Since 𝟙w{\mathbbm{1}^{\top}w} does not depend on tt, we conclude that the minimizer happens at wt=c>0,t.w_{t}=c>0,~{}\forall t.