This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Improved Regret Bound and Experience Replay in Regularized Policy Iteration

Nevena Lazić [email protected] DeepMind Dong Yin [email protected] DeepMind Yasin Abbasi-Yadkori [email protected] DeepMind Csaba Szepesvári [email protected] DeepMind University of Alberta
Abstract

In this work, we study algorithms for learning in infinite-horizon undiscounted Markov decision processes (MDPs) with function approximation. We first show that the regret analysis of the Politex algorithm (a version of regularized policy iteration) can be sharpened from O(T3/4)O(T^{3/4}) to O(T)O(\sqrt{T}) under nearly identical assumptions, and instantiate the bound with linear function approximation. Our result provides the first high-probability O(T)O(\sqrt{T}) regret bound for a computationally efficient algorithm in this setting. The exact implementation of Politex with neural network function approximation is inefficient in terms of memory and computation. Since our analysis suggests that we need to approximate the average of the action-value functions of past policies well, we propose a simple efficient implementation where we train a single Q-function on a replay buffer with past data. We show that this often leads to superior performance over other implementation choices, especially in terms of wall-clock time. Our work also provides a novel theoretical justification for using experience replay within policy iteration algorithms.

1 Introduction

Model-free reinforcement learning (RL) algorithms combined with powerful function approximation have achieved impressive performance in a variety of application domains over the last decade. Unfortunately, the theoretical understanding of such methods is still quite limited. In this work, we study single-trajectory learning in infinite-horizon undiscounted Markov decision processes (MDPs), also known as average-reward MDPs, which capture tasks such as routing and the control of physical systems.

One line of works with performance guarantees for the average-reward setting follows the “online MDP” approach proposed by Even-Dar et al. (2009), where the agent selects policies by running an online learning algorithm in each state, typically mirror descent. The resulting algorithm is a version of approximate policy iteration (API), which alternates between (1) estimating the state-action value function (or Q-function) of the current policy and (2) setting the next policy to be optimal w.r.t. the sum of all previous Q-functions plus a regularizer. Note that, by contrast, standard API sets the next policy only based on the most recent Q-function. The policy update can also be written as maximizing the most recent Q-function minus KL-divergence to the previous policy, which is somewhat similar to recently popular versions of API (Schulman et al., 2015, 2017; Achiam et al., 2017; Abdolmaleki et al., 2018; Song et al., 2019a).

The original work of Even-Dar et al. (2009) studied this scheme with known dynamics, tabular representation, and adversarial reward functions. More recent works (Abbasi-Yadkori et al., 2019; Hao et al., 2020; Wei et al., 2020a) have adapted the approach to the case of unknown dynamics, stochastic rewards, and value function approximation. With linear value functions, the Politex algorithm of Abbasi-Yadkori et al. (2019) achieves O(T3/4)O(T^{3/4}) high-probability regret in ergodic MDPs, and the results only scale in the number of features rather than states. Wei et al. (2020a) later show an O(T)O(\sqrt{T}) bound on expected regret for a similar algorithm named MDP-EXP2. In this work, we revisit the analysis of Politex and show that it can be sharpened to O(T)O(\sqrt{T}) under nearly identical assumptions, resulting in the first O(T)O(\sqrt{T}) high-probability regret bound for a computationally efficient algorithm in this setting.

In addition to improved analysis, our work also addresses practical implementation of Politex with neural networks. The policies produced by Politex in each iteration require access to the sum of all previous Q-function estimates. With neural network function approximation, exact implementation requires us to keep all past networks in memory and evaluate them at each step, which is inefficient in terms of memory and computation. Some practical implementation choices include subsampling Q-functions and/or optimizing a KL-divergence regularized objective w.r.t. a parametric policy at each iteration. We propose an alternative approach, where we approximate the average of all past Q-functions by training a single network on a replay buffer with past data. We demonstrate that this choice often outperforms other approximate implementations, especially in terms of run-time. When available memory is constrained, we propose to subsample transitions using the notion of coresets (Bachem et al., 2017).

Our work also provides a novel perspective on the benefits of experience replay. Experience replay is a standard tool for stabilizing learning in modern deep RL, and typically used in off-policy methods like Deep Q-Networks (Mnih et al., 2013), as well as “value gradient” methods such as DDPG (Lillicrap et al., 2016) and SVG (Heess et al., 2015). A different line of on-policy methods typically does not rely on experience replay; instead, learning is stabilized by constraining consecutive policies to be close in terms of KL divergence (Schulman et al., 2015; Song et al., 2019a; Degrave et al., 2019). We observe that both experience replay and KL-divergence regularization can be viewed as approximate implementations of Politex. Thus, we provide a theoretical justification for using experience replay in API, as an approximate implementation of online learning in each state. Note that this online-learning view differs from the commonly used justifications for experience replay, namely that it “breaks temporal correlations” (Schaul et al., 2016; Mnih et al., 2013). Our analysis also suggests a new objective for subsampling or priority-sampling transitions in the replay buffer, which differs priority-sampling objectives of previous work (Schaul et al., 2016).

In summary, our main contributions are (1) an improved analysis of Politex, showing an O(T)O(\sqrt{T}) regret bound under the same assumptions as the original work, and (2) an efficient implementation that also offers a new perspective on the benefits of experience replay.

2 Setting and Notation

We consider learning in infinite-horizon undiscounted ergodic MDPs (𝒳,𝒜,r,P)(\mathcal{X},\mathcal{A},r,P), where 𝒳\mathcal{X} is the state space, 𝒜\mathcal{A} is a finite action space, r:𝒳×𝒜[0,1]r:\mathcal{X}\times\mathcal{A}\to[0,1] is an unknown reward function, and P:𝒳×𝒜Δ𝒳P:\mathcal{X}\times\mathcal{A}\to\Delta_{\mathcal{X}} is the unknown probability transition function. A policy π:𝒳Δ𝒜\pi:\mathcal{X}\to\Delta_{\mathcal{A}} is a mapping from a state to a distribution over actions. Let {(xtπ,atπ)}t=1\{(x_{t}^{\pi},a_{t}^{\pi})\}_{t=1}^{\infty} denote the state-action sequence obtained by following policy π\pi. The expected average reward of policy π\pi is defined as

Jπ:=limT𝔼[1Tt=1Tr(xtπ,atπ)].J_{\pi}:=\lim_{T\to\infty}\mathbb{E}\left[\frac{1}{T}\sum_{t=1}^{T}r(x_{t}^{\pi},a_{t}^{\pi})\right]. (2.1)

Let μπ\mu_{\pi} denote the stationary state distribution of a policy π\pi, satisfying μπ=𝔼xμπ,aπ[P(|x,a)]\mu_{\pi}=\mathbb{E}_{x\sim\mu_{\pi},a\sim\pi}[P(\cdot|x,a)]. We will sometimes write μπ\mu_{\pi} as a vector, and use νπ=μππ\nu_{\pi}=\mu_{\pi}\otimes\pi to denote the stationary state-action distribution. In ergodic MDPs, JπJ_{\pi} and μπ\mu_{\pi} are well-defined and independent of the initial state. The optimal policy π\pi_{*} is a policy that maximizes the expected average reward. We will denote by JJ_{*} and μ\mu_{*} the expected average reward and stationary state distribution of π\pi_{*}, respectively.

The value function of a policy π\pi is defined as:

Vπ(x):=𝔼[t=1(r(xtπ,atπ)Jπ)|x1π=x].V_{\pi}(x):=\mathbb{E}\left[\sum_{t=1}^{\infty}(r(x_{t}^{\pi},a_{t}^{\pi})-J_{\pi})|x_{1}^{\pi}=x\right]. (2.2)

The state-action value function Qπ(x,a)Q_{\pi}(x,a) is defined as

Qπ(x,a):=r(x,a)Jπ+xP(x|x,a)Vπ(x).Q_{\pi}(x,a):=r(x,a)-J_{\pi}+\sum_{x^{\prime}}P(x^{\prime}|x,a)V_{\pi}(x^{\prime}). (2.3)

Notice that

Vπ(x)=aπ(a|x)Qπ(x,a).V_{\pi}(x)=\sum_{a}\pi(a|x)Q_{\pi}(x,a). (2.4)

Equations (2.3) and (2.4) are known as the Bellman equation. If we do not use the definition of Vπ(x)V_{\pi}(x) in Eq. (2.2) and instead solve for Vπ(x)V_{\pi}(x) and Qπ(x,a)Q_{\pi}(x,a) using the Bellman equation, we can see that the solutions are unique up to an additive constant. Therefore, in the following, we may use Vπ(x)V_{\pi}(x) and Qπ(x,a)Q_{\pi}(x,a) to denote the same function up to a constant. We will use the shorthand notation Qπ(x,π)=𝔼aπ(|x)[Qπ(x,a)]Q_{\pi}(x,\pi^{\prime})=\mathbb{E}_{a\sim\pi^{\prime}(\cdot|x)}[Q_{\pi}(x,a)]; note that Qπ(x,π)=Vπ(x)Q_{\pi}(x,\pi)=V_{\pi}(x).

The agent interacts with the environment as follows: at each round tt, the agent observes a state xt𝒳x_{t}\in\mathcal{X}, chooses an action atπt(|xt)a_{t}\sim\pi_{t}(\cdot|x_{t}), and receives a reward rt:=r(xt,at)r_{t}:=r(x_{t},a_{t}). The environment then transitions to the next state xt+1x_{t+1} with probability P(xt+1|xt,at)P(x_{t+1}|x_{t},a_{t}). Recall that π\pi_{*} is the optimal policy and JJ_{*} is its expected average reward. The regret of an algorithm with respect to this fixed policy is defined as

RT:=t=1T(Jr(xt,at)).R_{T}:=\sum_{t=1}^{T}\Big{(}J_{*}-r(x_{t},a_{t})\Big{)}. (2.5)

The learning goal is to find an algorithm that minimizes the long-term regret RTR_{T}.

Our analysis will require the following assumption on the mixing rate of policies.

Assumption 2.1 (Uniform mixing).

Let HπH_{\pi} be the state-action transition matrix of a policy π\pi. Let γ(Hπ)\gamma(H_{\pi}) be the corresponding ergodicity coefficient (Seneta, 1979), defined as

γ(Hπ):=maxz:z𝟏=0,z1=1zHπ1.\gamma(H_{\pi}):=\max_{z:z^{\top}{\bf 1}=0,\|z\|_{1}=1}\|z^{\top}H_{\pi}\|_{1}.

We assume that there exists a scalar γ<1\gamma<1 such that for any policy π\pi, γ(Hπ)γ<1\gamma(H_{\pi})\leq\gamma<1.

Assumption 2.1 implies that for any pair of distributions ν,ν\nu,\nu^{\prime}, (νν)Hπ1γνν1\|(\nu-\nu^{\prime})^{\top}H_{\pi}\|_{1}\leq\gamma\|\nu-\nu^{\prime}\|_{1}; see Lemma A.1 in Appendix A for a proof.

3 Related Work

Regret bounds for average-reward MDPs. Most no-regret algorithms for infinite-horizon undiscounted MDPs are only applicable to tabular representations and model-based (Bartlett, 2009; Jaksch et al., 2010; Ouyang et al., 2017; Fruit et al., 2018; Jian et al., 2019; Talebi and Maillard, 2018). For weakly-communicating MDPs with diameter DD, these algorithms nearly achieve the minimax lower bound Ω(D|𝒳||𝒜|T)\Omega(\sqrt{D|\mathcal{X}||\mathcal{A}|T}) (Jaksch et al., 2010) with high probability. Wei et al. (2020b) provide model-free algorithms with regret bounds in the tabular setting. In the model-free setting with function approximation, the Politex algorithm (Abbasi-Yadkori et al., 2019) achieve O(d1/2T3/4)O(d^{1/2}T^{3/4}) regret in uniformly ergodic MDPs, where dd is the size of the compressed state-action space. Hao et al. (2020) improve these results to O(T2/3)O(T^{2/3}). More recently, Wei et al. (2020a) present three algorithms for average-reward MDPs with linear function approximation. Among these, FOPO achieves O(T)O(\sqrt{T}) regret but is computationally inefficient, and OLSVI.FH is efficient but obtains O(T3/4)O(T^{3/4}) regret. The MDP-EXP2 algorithm is computationally efficient, and under similar assumptions as in Abbasi-Yadkori et al. (2019) it obtains a O(T)O(\sqrt{T}) bound on expected regret (a weaker guarantee than the high-probability bounds in other works). Our analysis shows a high-probability O(T)O(\sqrt{T}) regret bound under the same assumptions.

KL-regularized approximate policy iteration. Our work is also related to approximate policy iteration algorithms which constrain each policy to be close to the previous policy in the sense of KL divergence. This approach was popularized by TRPO (Schulman et al., 2015), where it was motivated as an approximate implementation of conservative policy iteration (Kakade and Langford, 2002). Some of the related subsequent works include PPO (Schulman et al., 2017), MPO (Abdolmaleki et al., 2018), V-MPO (Song et al., 2019a), and CPO (Achiam et al., 2017). While these algorithms place a constraint on consecutive policies and are mostly heuristic, another line of research shows that using KL divergence as a regularizer has a theoretical justification in terms of either regret guarantees (Abbasi-Yadkori et al., 2019; Hao et al., 2020; Wei et al., 2020b, a) or error propagation (Vieillard et al., 2020a, b).

Experience replay. Experience replay (Lin, 1992) is one of the central tools for achieving good performance in deep reinforcement learning. While it is mostly used in off-policy methods such as deep Q-learning (Mnih et al., 2013, 2015), it has also shown benefits in value-gradient methods (Heess et al., 2015; Lillicrap et al., 2016), and has been used in some variants of KL-regularized policy iteration (Abdolmaleki et al., 2018; Tomar et al., 2020). Its success has been attributed to removing some temporal correlations from data fed to standard gradient-based optimization algorithms. Schaul et al. (2016) have shown that non-uniform replay sampling based on the Bellman error can improve performance. Unlike these works, we motivate experience replay from the perspective of online learning in MDPs (Even-Dar et al., 2009) with the goal of approximating the average of past value functions well.

Continual learning. Continual learning (CL) is the paradigm of learning a classifier or regressor that performs well on a set of tasks, where each task corresponds to a different data distribution. The tasks are observed sequentially, and the learning goal is to avoid forgetting past tasks without storing all the data in memory. This is quite similar to our goal of approximating the average of sequentially-observed Q-functions, where the data for approximating each Q-function has different distribution. In general, approaches to CL can be categorized as regularization-based (Kirkpatrick et al., 2017; Zenke et al., 2017; Farajtabar et al., 2020; Yin et al., 2020), expansion-based (Rusu et al., 2016), and replay-based (Lopez-Paz and Ranzato, 2017; Chaudhry et al., 2018; Borsos et al., 2020), with the approaches based on experience replay typically having superior performance over other methods.

4 Algorithm

Our algorithm is similar to the Politex schema (Abbasi-Yadkori et al., 2019). In each phase kk, Politex obtains an estimate Q^πk\widehat{Q}_{\pi_{k}} of the action-value function QπkQ_{\pi_{k}} of the current policy πk\pi_{k}, and then sets the next policy using the mirror descent update rule:

πk+1(|x)\displaystyle\pi_{k+1}(\cdot|x) =argmaxπQ^πk(x,π)η1DKL(ππk(|x))\displaystyle=\mathop{\mathrm{argmax}}_{\pi}\widehat{Q}_{\pi_{k}}(x,\pi)-\eta^{-1}D_{KL}(\pi\parallel\pi_{k}(\cdot|x))
=argmaxπi=1kQ^πi(x,π)+η1(π)\displaystyle=\mathop{\mathrm{argmax}}_{\pi}\sum_{i=1}^{k}\widehat{Q}_{\pi_{i}}(x,\pi)+\eta^{-1}\mathcal{H}(\pi)
exp(ηi=1kQ^πi(x,)),\displaystyle\propto\exp\bigg{(}\eta\sum_{i=1}^{k}\widehat{Q}_{\pi_{i}}(x,\cdot)\bigg{)}, (4.1)

where ()\mathcal{H}(\cdot) is the entropy function. When the functions {Q^πi}i=1k\{\widehat{Q}_{\pi_{i}}\}_{i=1}^{k} are tabular or linear, the above update can be implemented efficiently by simply summing all the table entries or weight vectors. However, with neural network function approximation, we need to keep all networks in memory and evaluate them at each step, which quickly becomes inefficient in terms of storage and computation. Some of the efficient implementations proposed in literature include keeping a subset of the action-value functions (Abbasi-Yadkori et al., 2019), and using a parametric policy and optimizing the KL-regularized objective w.r.t. the parameters over the available data (Tomar et al., 2020).

Our proposed method, presented in Algorithm 1, attempts to directly approximate the average of all previous action-value functions

Qk(x,a):=1ki=1kQπi(x,a).\displaystyle Q_{k}(x,a):=\frac{1}{k}\sum_{i=1}^{k}Q_{\pi_{i}}(x,a).

To do so, we only use a single network, and continually train it to approximate QkQ_{k}. At each iteration kk, we obtain a dataset of tuples 𝒟k={(xt,at,Rt)}\mathcal{D}_{k}=\{(x_{t},a_{t},R_{t})\}, where RtR_{t} is the empirical return from the state-action pair (xt,at)(x_{t},a_{t}). We initialize Q^k\widehat{Q}_{k} to Q^k1\widehat{Q}_{k-1} and update it by minimizing the squared error over the union of 𝒟k\mathcal{D}_{k} and the replay buffer \mathcal{R}. We then update the replay buffer with all or a subset of data in 𝒟k\mathcal{D}_{k}.

In the sequel, in Section 5, we first show that by focusing on estimating the average Q-function, we can improve the regret bound of Politex under nearly identical assumption; in Section 6, we instantiate this bound for linear value functions, where we can estimate the average Q-function simply by weight averaging; in Section 7, we focus on practical implementations, in particular, we discuss the limitations of weight averaging, and provide details on how to leverage replay data when using non-linear function approximation; in Section 8 we present our experimental results; and in Section 9 we make final remarks.

Algorithm 1 Schema for policy iteration with replay
1:  Input: phase length τ\tau, num. phases KK, parameter η\eta
2:  Initialize: π1(a|x)=1/|𝒜|\pi_{1}(a|x)=1/|\mathcal{A}|, empty replay buffer \mathcal{R}
3:  for k=1,,Kk=1,\ldots,K do
4:     Execute πk\pi_{k} for τ\tau time steps and collect data 𝒟k\mathcal{D}_{k}
5:     Compute Q^k\widehat{Q}_{k}, an estimate of Qk=1ki=1kQπiQ_{k}=\frac{1}{k}\sum_{i=1}^{k}Q_{\pi_{i}}, from data 𝒟k\mathcal{D}_{k} and replay \mathcal{R}
6:     Set πk+1(a|x)exp(ηkQ^k(x,a))\pi_{k+1}(a|x)\propto\exp\big{(}\eta k\widehat{Q}_{k}(x,a)\big{)}
7:     Update replay \mathcal{R} with 𝒟k\mathcal{D}_{k}
8:  end for
9:  Output: πK+1\pi_{K+1}

5 Regret Analysis of Politex

In this section, we revisit the regret analysis of Politex (Abbasi-Yadkori et al., 2019) in ergodic average-reward MDPs, and show that it can be improved from O(T3/4)O(T^{3/4}) to O(T)O(\sqrt{T}) under similar assumptions. Our analysis relies in part on a simple modification of the regret decomposition. Namely, instead of including the estimation error of each value-function QπkQ_{\pi_{k}} in the regret, we consider the error in the running average QkQ_{k}. When this error scales as O(1/k)O(1/\sqrt{k}) in a particular weighted norm, the regret of Politex is O(T)O(\sqrt{T}). As we show in the following section, this bound can be instantiated for linear value functions under the same assumptions as Abbasi-Yadkori et al. (2019).

Assumption 5.1 (Boundedness).

Let Q^πk:=kQ^k(k1)Q^k1\widehat{Q}_{\pi_{k}}:=k\widehat{Q}_{k}-(k-1)\widehat{Q}_{k-1}. We assume that there exists a constant QmaxQ_{\max} such that for all k=1,,Kk=1,...,K and for all x𝒳x\in\mathcal{X},

maxaQ^πk(x,a)minaQ^πk(x,a)Qmax.\max_{a}\widehat{Q}_{\pi_{k}}(x,a)-\min_{a}\widehat{Q}_{\pi_{k}}(x,a)\leq Q_{\max}.

For the purpose of our algorithm, functions Q^πk\widehat{Q}_{\pi_{k}} are unique up to a constant. Thus we can equivalently assume that Q^πk(x,)Qmax\|\widehat{Q}_{\pi_{k}}(x,\cdot)\|_{\infty}\leq Q_{\max} for all xx.

Define V^πk(x):=Q^πk(x,πk)\widehat{V}_{\pi_{k}}(x):=\widehat{Q}_{\pi_{k}}(x,\pi_{k}). Let Vk:=1ki=1kVπi(x)V_{k}:=\frac{1}{k}\sum_{i=1}^{k}V_{\pi_{i}}(x) and V^k(x):=i=1kV^πi(x)\widehat{V}_{k}(x):=\sum_{i=1}^{k}\widehat{V}_{\pi_{i}}(x) be the average of the state-value functions and its estimate. We will require the estimation error of the running average to scale as in the following assumption.

Assumption 5.2 (Estimation error).

Let μ\mu_{*} be the stationary state distribution of the optimal policy π\pi_{*}. With probability at least 1δ1-\delta, for a problem-dependent constant CC, the errors in Q^K\widehat{Q}_{K} and V^K\widehat{V}_{K} are bounded as

𝔼xμ[V^K(x)VK(x)]\displaystyle\mathbb{E}_{x\sim\mu_{*}}[\widehat{V}_{K}(x)-V_{K}(x)] Clog(1/δ)/K\displaystyle\leq C\sqrt{\log(1/\delta)/K}
𝔼xμ[QK(x,π)Q^K(x,π)]\displaystyle\mathbb{E}_{x\sim\mu_{*}}[Q_{K}(x,\pi_{*})-\widehat{Q}_{K}(x,\pi_{*})] Clog(1/δ)/K.\displaystyle\leq C\sqrt{\log(1/\delta)/K}\,.

Define Sδ(|𝒜|,μ)S_{\delta}(|\mathcal{A}|,\mu_{*}) as in Abbasi-Yadkori et al. (2019):

Sδ(|𝒜|,μ):=log|𝒜|2+μ,12log1δμ.\displaystyle S_{\delta}(|\mathcal{A}|,\mu_{*}):=\sqrt{\frac{\log|\mathcal{A}|}{2}}+\big{\langle}\mu_{*},\sqrt{\frac{1}{2}\log\frac{1}{\delta\mu_{*}}}\big{\rangle}.

We bound the regret of Politex in the following theorem. Here, recall that τ\tau is the length of each phase, and γ\gamma and η\eta are defined in Assumption 2.1 and Eq. (4.1), respectively.

Theorem 5.3 (Regret of Politex).

Let Assumptions 2.1, 5.2, and 5.1 hold. For τlogT2log(1/γ)\tau\geq\frac{\log T}{2\log(1/\gamma)} and η=8log|𝒜|QmaxK\eta=\frac{\sqrt{8\log|\mathcal{A}|}}{Q_{\max}\sqrt{K}}, for a constant C1C_{1}, with probability at least 14δ1-4\delta, the regret of Politex in ergodic average-reward MDPs is bounded as

RTC1(1+Qmax)Sδ(|𝒜|,μ)τ(1γ)2T.\displaystyle R_{T}\leq\frac{C_{1}(1+Q_{\max})S_{\delta}(|\mathcal{A}|,\mu_{*})\sqrt{\tau}}{(1-\gamma)^{2}}\sqrt{T}\,.
Proof.

We start by decomposing the cumulative regret, following similar steps as Abbasi-Yadkori et al. (2019):

RT=k=1Kt=(k1)τ+1kτ(JJπk)+(Jπkrt).R_{T}=\sum_{k=1}^{K}\sum_{t=(k-1)\tau+1}^{k\tau}(J_{*}-J_{\pi_{k}})+(J_{\pi_{k}}-r_{t}). (5.1)

The second term VT=k=1Kt=(k1)τ+1kτ(Jπkrt)V_{T}=\sum_{k=1}^{K}\sum_{t=(k-1)\tau+1}^{k\tau}(J_{\pi_{k}}-r_{t}) captures the sum of differences between observed rewards and their long term averages. In previous work, this term was shown to scale as O(Kτ)O(K\sqrt{\tau}). We show that the analysis can in fact be tightened to O(T)O(\sqrt{T}) using improved concentration bounds and the slow-changing nature of the policies. See Lemma B.1 in Appendix B for precise details.

The first term, which is also called pseudo-regret in literature, measures the difference between the expected reward of the reference policy and the policies produced by the algorithm. Applying the performance difference lemma (Cao, 1999), we can write each pseudo-regret term as

JJπk\displaystyle J_{*}-J_{\pi_{k}} =𝔼xμ[Qπk(x,π)Qπk(x,πk)].\displaystyle=\mathbb{E}_{x\sim\mu_{*}}\left[Q_{\pi_{k}}(x,\pi_{*})-Q_{\pi_{k}}(x,\pi_{k})\right].

Now, notice that Politex is running exact mirror descent for loss functions Q^πk\widehat{Q}_{\pi_{k}}. We bridge the pseudo-regret by the Q^πk\widehat{Q}_{\pi_{k}} terms:

RT1a\displaystyle R_{T1a} =τk=1K𝔼xμ[Qπk(x,π)Q^πk(x,π)]\displaystyle=\tau\sum_{k=1}^{K}\mathbb{E}_{x\sim\mu_{*}}\left[Q_{\pi_{k}}(x,\pi_{*})-\widehat{Q}_{\pi_{k}}(x,\pi_{*})\right] (5.2)
RT1b\displaystyle R_{T1b} =τk=1K𝔼xμ[Q^πk(x,πk)Qπk(x,πk)]\displaystyle=\tau\sum_{k=1}^{K}\mathbb{E}_{x\sim\mu_{*}}\left[\widehat{Q}_{\pi_{k}}(x,\pi_{k})-Q_{\pi_{k}}(x,\pi_{k})\right] (5.3)
RT2\displaystyle R_{T2} =τk=1K𝔼xμ[Q^πk(x,π)Q^πk(x,πk)].\displaystyle=\tau\sum_{k=1}^{K}\mathbb{E}_{x\sim\mu_{*}}\left[\widehat{Q}_{\pi_{k}}(x,\pi_{*})-\widehat{Q}_{\pi_{k}}(x,\pi_{k})\right]. (5.4)

RT2R_{T2} can be bounded using the regret of mirror descent as in previous work (Abbasi-Yadkori et al., 2019). Setting η=log|𝒜|Qmaxτ2K\eta=\frac{\sqrt{\log|\mathcal{A}|}}{Q_{\max}\tau\sqrt{2K}}, and using a union bound over all states, with probability at least 1δ1-\delta, RT2R_{T2} is bounded as

RT2τQmaxSδ(|𝒜|,μ)K.\displaystyle R_{T2}\leq\tau Q_{\max}S_{\delta}(|\mathcal{A}|,\mu_{*})\sqrt{K}\,.

Bounding regret due to estimation error. We now focus on bounding RT1=RT1a+RT1bR_{T1}=R_{T1a}+R_{T1b} under Assumption 5.2. We have the following:

RT1a\displaystyle R_{T1a} =τk=1K𝔼xμ[Qπk(x,π)Q^πk(x,π)]\displaystyle=\tau\sum_{k=1}^{K}\mathbb{E}_{x\sim\mu_{*}}\big{[}Q_{\pi_{k}}(x,\pi_{*})-\widehat{Q}_{\pi_{k}}(x,\pi_{*})\big{]}
=(τK)𝔼xμ,aπ[QK(x,a)Q^K(x,a)]\displaystyle=(\tau K)\mathbb{E}_{x\sim\mu_{*},a\sim\pi_{*}}\left[Q_{K}(x,a)-\widehat{Q}_{K}(x,a)\right]
CτKlog(1/δ).\displaystyle\leq C\tau\sqrt{K\log(1/\delta)}.
RT1b\displaystyle R_{T1b} =τk=1K𝔼xμ[V^πk(x)Vπk(x)]\displaystyle=\tau\sum_{k=1}^{K}\mathbb{E}_{x\sim\mu_{*}}\big{[}\widehat{V}_{\pi_{k}}(x)-V_{\pi_{k}}(x)\big{]}
=τK𝔼xμ[V^K(x)VK(x)]\displaystyle=\tau K\mathbb{E}_{x\sim\mu_{*}}\big{[}\widehat{V}_{K}(x)-V_{K}(x)\big{]}
CτKlog(1/δ).\displaystyle\leq C\tau\sqrt{K\log(1/\delta)}.

We can then obtain the final result by combining the bounds on RT1aR_{T1a}, RT1bR_{T1b}, RT2R_{T2}, VTV_{T} from Appendix B, and using union bound as well as the fact that K=T/τK=T/\tau. ∎

6 Linear Value Functions

In this section, we show that the estimation error condition in Assumption 5.2 (and thus O(T)O(\sqrt{T}) regret) can be achieved under similar assumptions as in Abbasi-Yadkori et al. (2019) and Wei et al. (2020a), which we state next.

Assumption 6.1 (Linear value functions).

The action-value function QπQ_{\pi} of any policy π\pi is linear: Qπ(x,a)=wπϕ(x,a)Q_{\pi}(x,a)=w_{\pi}^{\top}\phi(x,a), where ϕ:𝒮×𝒜d\phi:{\mathcal{S}}\times\mathcal{A}\rightarrow\mathbb{R}^{d} is a known feature function such that maxx,aϕ(x,a)CΦ\max_{x,a}\|\phi(x,a)\|\leq C_{\Phi}.

Assumption 6.2 (Feature excitation).

There exists a constant σ2\sigma^{2} such that for any policy π\pi,

λmin(𝔼(x,a)μππ[ϕ(x,a)ϕ(x,a)])σ2>0.\lambda_{\min}\left(\mathbb{E}_{(x,a)\sim\mu_{\pi}\otimes\pi}[\phi(x,a)\phi(x,a)^{\top}]\right)\geq\sigma^{2}>0.

We now describe a simple procedure for estimating the average action-value functions Qk(x,a)=ϕ(x,a)wkQ_{k}(x,a)=\phi(x,a)^{\top}w_{k}, where wk=1ki=1kwπiw_{k}=\frac{1}{k}\sum_{i=1}^{k}w_{\pi_{i}}, such that the conditions of Assumption 5.2 are satisfied. Essentially, we estimate each QπiQ_{\pi_{i}} using least-squares Monte Carlo and then average the weights. We will use the shorthand notation ϕt=ϕ(xt,at)\phi_{t}=\phi(x_{t},a_{t}). Let i\mathcal{H}_{i} and 𝒯i{\mathcal{T}}_{i} be subsets of time indices in phase ii (defined later). We estimate QπiQ_{\pi_{i}} as follows:

w^πi\displaystyle\widehat{w}_{\pi_{i}} =(tiϕtϕt+αI)1tiϕtRt\displaystyle=\bigg{(}\sum_{t\in\mathcal{H}_{i}}\phi_{t}\phi_{t}^{\top}+\alpha I\bigg{)}^{-1}\sum_{t\in\mathcal{H}_{i}}\phi_{t}R_{t} (6.1)

where RtR_{t} are the empirical bb-step returns (bb is specified later), computed as

Rt=j=tt+b(rtJ^πi),J^πi=1|𝒯i|t𝒯irt.\displaystyle R_{t}=\sum_{j=t}^{t+b}(r_{t}-\widehat{J}_{\pi_{i}}),\quad\widehat{J}_{\pi_{i}}=\frac{1}{|{\mathcal{T}}_{i}|}\sum_{t\in{\mathcal{T}}_{i}}r_{t}. (6.2)

We then estimate wkw_{k} as w^k=1ki=1kw^πi\widehat{w}_{k}=\frac{1}{k}\sum_{i=1}^{k}\widehat{w}_{\pi_{i}}. Note that for this special case of linear value functions, we do not need to use the replay buffer in Algorithm 1. For analysis purposes, we divide each phase of length τ\tau into 2m2m blocks of size bb and let i\mathcal{H}_{i} (𝒯i{\mathcal{T}}_{i}) be the starting indices of odd (even) blocks in phase ii. Due to the gaps between indices and fast mixing, this makes the data almost independent (we make this precise in Appendix C) and the error easier to analyze. In practice, one may simply want to use all data.

For a distribution μ\mu, let xμ\|x\|_{\mu} denote the distribution-weighted norm such that xμ2=iμixi2\|x\|_{\mu}^{2}=\sum_{i}\mu_{i}x_{i}^{2}. Using Jensen’s inequality, we have that (𝔼xμ[q(x)])2𝔼xμ[q(x)2](\mathbb{E}_{x\sim\mu}[q(x)])^{2}\leq\mathbb{E}_{x\sim\mu}[q(x)^{2}]. Thus, it suffices to bound the QQ-function error in the distribution-weighted norm, QKQ^Kμπ\|Q_{K}-\widehat{Q}_{K}\|_{\mu_{*}\otimes\pi_{*}}. Furthermore, given bounded features,

Q^KQKμπ\displaystyle\|\widehat{Q}_{K}-Q_{K}\|_{\mu_{*}\otimes\pi_{*}} CΦw^KwK2,\displaystyle\leq C_{\Phi}\|\widehat{w}_{K}-w_{K}\|_{2}\,,

so it suffices to bound the error in the weights. We bound this error in the following Lemma, proven in Appendix C.

Lemma 6.3 (Estimation error for linear functions).

Suppose that Assumptions 2.1, 6.1 and 6.2 hold and that true action-value weights are bounded as wπi2Cw\|w_{\pi_{i}}\|_{2}\leq C_{w} for all i=1,Ki=1,\ldots K. Then for any policy π\pi, for α=τ/K\alpha=\sqrt{\tau/K}, m72CΦ4σ2(1γ)2log(d/δ)m\geq 72C_{\Phi}^{4}\sigma^{-2}(1-\gamma)^{-2}\log(d/\delta), and blog(Tδ1(1γ)1)log(1/γ)b\geq\frac{\log(T\delta^{-1}(1-\gamma)^{-1})}{\log(1/\gamma)}, there exists an absolute constant cc such that with probability at least 1δ1-\delta,

w^KwK2cσ2(Cw+CΦ)blog(2d/δ)Km.\displaystyle\|\widehat{w}_{K}-w_{K}\|_{2}\leq c\sigma^{-2}(C_{w}+C_{\Phi})b\sqrt{\frac{\log(2d/\delta)}{Km}}.

Furthermore, in Appendix D, we show that the error in the average state-value function satisfies the following:

𝔼μ[V^K(x)VK(x)]cCΦ|𝒜|(Cw+CΦ)bσ2log(2d/δ)Km.\displaystyle\mathbb{E}_{\mu_{*}}[\widehat{V}_{K}(x)-V_{K}(x)]\leq cC_{\Phi}|\mathcal{A}|(C_{w}+C_{\Phi})\frac{b}{\sigma^{2}}\sqrt{\frac{\log(2d/\delta)}{Km}}.

We have demonstrated that Assumption 5.2 can be satisfied with linear value functions. For Assumption 5.1, it suffices for the weight estimates {w^πi}\{\widehat{w}_{\pi_{i}}\} to be bounded. This will be true, since we assume that the true weights are bounded and we can bound the error in the weight space. Thus Politex has an O(T)O(\sqrt{T}) regret in this setting (though note that we incur an extra |𝒜||\mathcal{A}| factor coming from the V^K\widehat{V}_{K} error bound).

7 Practical Implementation

As mentioned earlier, the key idea in our policy update is to obtain an estimate Q^k\widehat{Q}_{k} of the average of all the Q-functions in previous phases. We have seen that when we use linear functions to approximate the Q-functions, we can simply average the weights in order to get an estimate of the average Q-function. However, in practice, we often need to use non-linear functions, especially neural networks, to approximate complex Q-functions. In this section, we discuss how to efficiently implement our algorithm with non-linear function approximation.

7.1 Weight Averaging

The simplest idea may be averaging the weights of neural networks. However, a crucial difference from the linear setting is that averaging the weights of the neural networks is not equivalent to averaging the functions that they represent: the function that a neural network represent is invariant to the permutation of the hidden units, and thus two networks with very different weights can represent similar functions. Therefore, this implementation may only succeed when all the Q-function approximations are around the same local region in the weight space. Thus, when we learn the new Q-function Q^πk\widehat{Q}_{\pi_{k}} in phase kk, we should initialize it with Q^k1\widehat{Q}_{k-1}, run SGD with new data, and then average with Q^k1\widehat{Q}_{k-1}.

7.2 Experience Replay

Another natural idea is to leverage the data from replay buffer to obtain an estimate of the average Q-function. We elaborate the details below. We use simple bb-step Monte Carlo estimate for the QQ value of each state-action pair. For any (i1)τ+1tiτb(i-1)\tau+1\leq t\leq i\tau-b, we can estimate the state-action value of (xt,at)(x_{t},a_{t}) by bb-step cumulative reward111This is a practical implementation of Eq. (6.2), i.e., we do not split the data in each phase into blocks.

Rt=j=tt+b(rjJ^πi),J^πi=1τj=(i1)τ+1iτrj.R_{t}=\sum_{j=t}^{t+b}(r_{j}-\widehat{J}_{\pi_{i}}),\quad\widehat{J}_{\pi_{i}}=\frac{1}{\tau}\sum_{j=(i-1)\tau+1}^{i\tau}r_{j}.

In the following, we denote by τ:=τb\tau^{\prime}:=\tau-b the maximum number of data that we can collect from every phase. At the end of each phase, we store all or a subset of the (xt,at,Rt)(x_{t},a_{t},R_{t}) tuples in our replay buffer \mathcal{R}. We extract feature ϕ(x,a)d\phi(x,a)\in\mathbb{R}^{d} for the state-action pair (x,a)(x,a) and let {f:d}\mathcal{F}\subseteq\{f:\mathbb{R}^{d}\mapsto\mathbb{R}\} be a class of functions that we use to approximate the Q-functions. For phase ii, we propose the following method to estimate QπiQ_{\pi_{i}}: Q^πi(x,a)=f^(ϕ(x,a))\widehat{Q}_{\pi_{i}}(x,a)=\widehat{f}(\phi(x,a)), where f^argminfi(f)\widehat{f}\in\arg\min_{f\in\mathcal{F}}\ell_{i}(f) and

i(f):=1τt=(i1)τ+1iτb(f(ϕ(xt,at))Rt)2.\ell_{i}(f):=\frac{1}{\tau^{\prime}}\sum_{t=(i-1)\tau+1}^{i\tau-b}(f(\phi(x_{t},a_{t}))-R_{t})^{2}. (7.1)

Suppose that we store all the data from the previous phases in the replay buffer, then in order to estimate the average of the Q-functions of the first kk phases, i.e., Q^k\widehat{Q}_{k}, we propose to use the heuristic that minimizes the average of the kk squared loss functions defined in Eq. (7.1), i.e., 1ki=1ki(f)\frac{1}{k}\sum_{i=1}^{k}\ell_{i}(f).

Subsampling and coreset. In practice, due to the high memory cost, it may be hard to store all the data from the previous phases in the replay buffer. We found that a simple strategy to resolve this issue is to begin with storing all the data from every phase, and add a limit on size of the replay buffer. When the buffer size exceeds the limit, we eliminate a subset of the data uniformly at random.

Another approach is to sample a subset of size ss from the data collected in each phase. Denote this subset by i\mathcal{R}_{i} for phase ii. Thus in the kk-th phase, we have τ\tau^{\prime} data 𝒟k\mathcal{D}_{k} from the current phase as well as s(k1)s(k-1) data from the replay buffer \mathcal{R}. First, suppose that the ss data points are sampled uniformly at random. We can then minimize the following objective:

minf1k(i(f)+i=1k1^i(f)),\min_{f\in\mathcal{F}}\frac{1}{k}\Big{(}\ell_{i}(f)+\sum_{i=1}^{k-1}\widehat{\ell}_{i}(f)\Big{)}, (7.2)

where ^i(f):=1s(xt,at,Rt)i(f(ϕ(xt,at))Rt)2\widehat{\ell}_{i}(f):=\frac{1}{s}\sum_{(x_{t},a_{t},R_{t})\in\mathcal{R}_{i}}(f(\phi(x_{t},a_{t}))-R_{t})^{2} is an unbiased estimate of i(f)\ell_{i}(f). Further, uniform sampling is not the only way to construct an unbiased estimate. In fact, for any discrete distribution, with PMF q={qt}q=\{q_{t}\}, over the τ\tau^{\prime} data in 𝒟i\mathcal{D}_{i}, we can sample ss data points according to qq and construct

^i(f):=1τ(xt,at,Rt)i1qt(f(ϕ(xt,at))Rt)2,\widehat{\ell}_{i}(f):=\frac{1}{\tau^{\prime}}\sum_{(x_{t},a_{t},R_{t})\in\mathcal{R}_{i}}\frac{1}{q_{t}}(f(\phi(x_{t},a_{t}))-R_{t})^{2}, (7.3)

in order to obtain an unbiased estimate of i(f)\ell_{i}(f). As shown by Bachem et al. (2017), by choosing qt(f(ϕ(xt,at))Rt)2q_{t}\propto(f(\phi(x_{t},a_{t}))-R_{t})^{2}, we can minimize the variance of ^i(f)\widehat{\ell}_{i}(f) for any fixed ff. In the following, we call the subset of data sampled according to this distribution a coreset of the data. In the experiments in Section 8, we show that thanks to the variance reduction effect, sampling a coreset often produces better performance than sampling a subset uniformly at random, especially when the rewards are sparse.

Comparison to Abbasi-Yadkori et al. (2019). Our algorithm can be considered as an efficient implementation of Politex  (Abbasi-Yadkori et al., 2019) via experience replay. In the original Politex algorithm, the Q-functions are estimated using Eq. (7.1) for each phase, and all the functions are stored in memory. When an agent interacts with the environment, it needs to evaluate all the Q-functions in order to obtain the probability of each action. This implies that the time complexity of computing action probabilities increases with the number of phases, and as a result the algorithm is hard to scale to a large number of phases. Although we can choose to evaluate a random subset of the Q-functions to estimate the action probabilities in Politex, our implementation via experience replay can still be faster since we only need to evaluate a single function, trained with replay data, to take actions.

8 Experiments

In this section, we evaluate our implementations empirically. We make comparisons with several baselines in two control environments.

Environments. We use two control environments with the simulators described in Tassa et al. (2018). Both environments are episodic with episode length 10001000. The environments we evaluate are:

  • Cart-pole (Barto et al., 1983): The goal of this environment is to balance an unactuated pole by applying forces to a cart at its base. We discretize the continuous force to 55 values: {2,1,0,1,2}\{-2,-1,0,1,2\}. The reward at each time step is a real number in [0,1][0,1]. We end episodes early when the pole falls (we use rewards less than 0.50.5 an indicator for falling), and assume zero reward for the remaining steps when reporting results.

  • Ball-in-cup: Here, an actuated planar receptacle can translate in the vertical plane in order to swing and catch a ball attached to its bottom. This task has a sparse reward: 11 when the ball is in the cup, and 0 otherwise. We discretize the two-dimensional continuous action space to a 3×33\times 3 grid, i.e., 99 actions in total.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a) Iteration complexity
Refer to caption
(b) Training speed
Refer to caption
(c) Replay subsampling strategies
Figure 1: Experiments on the Cart-pole environment (top) and Ball-in-cup (bottom).

Algorithms. We compare the following algorithms: our proposed implementation of Politex using experience replay, Politex using weight averaging, the original Politex algorithm, the variant of Politex that averages 1010 randomly selected Q-functions, the mirror descent policy optimization (MDPO) algorithm (Tomar et al., 2020), the constrained policy optimization (CPO) (Achiam et al., 2017), and the V-MPO algorithm (Song et al., 2019b).

For all the algorithms, we extract Fourier basis features (Konidaris et al., 2011) from the raw observations for both environments: for Cart-pole, we use 44 bases and for Ball-in-cup we use 22 bases. For variants of Politex we construct the state-action features ϕ(x,a)\phi(x,a) by block one-hot encoding, i.e., we partition the ϕ(x,a)\phi(x,a) vector into |𝒜||\mathcal{A}| blocks, and set the aa-th block to be the Fourier features of xx and other blocks to be zero. We approximate QQ-functions using neural networks with one hidden layer and ReLU activation: the width of the hidden layer is 5050 for Cart-pole and 250250 for Ball-in-cup. For MDPO, CPO and V-MPO, the algorithms use a policy network whose input and output are the state feature and action probability, respectively. We use one hidden layer networks with width 5050 for Cart-pole and 250250 for Ball-in-cup. These three algorithms also need a value network, which takes the state feature as input and outputs its value for the current policy. For both environments, we use one hidden layer network with width 5050.

For Cart-pole, we choose phase length τ=104\tau=10^{4} and for Ball-in-cup, we choose τ=2×104\tau=2\times 10^{4}. Since the environments are episodic, we have multiple episodes in each phase. For Cart-pole, since the training usually does not need a large number of phases, we do not set limit on the size of the replay buffer, whereas for Ball-in-cup, we set the limit as 2×1062\times 10^{6}. For our algorithm and the variants of Politex, we choose parameter η\eta in {5,10,20,40,80,160}\{5,10,20,40,80,160\} and report the result of each algorithm using the value of η\eta that produces the best average reward. For MDPO, CPO and V-MPO, we note that according to Eq. (4.1), the KL regularization coefficient between policies is the reciprocal of the η\eta parameter in our algorithm, we also run these baseline algorithms in the same range of η\eta and report the best result. In addition, in CPO, we set the limit in KL divergence between adjacent policies 0.001η0.001\eta, which, in our experiments, leads to good performance. We treat the length of Monte Carlo estimate bb as a hyper parameter, so we choose it in {100,300}\{100,300\} and report the best performance.

Results. We run each algorithm in each environment 2020 times and report the average results as well as the standard deviation of the 2020 runs (shaded area). We report the average reward in each phase before training on the data collected during the phase. The results are presented in Figure 1. Every run is conducted on a single P100 GPU.

From Fig. 1(a), we can see that the iteration complexity (average reward vs number of phases) and final performance of the experience replay implementation using all the data from past phases (blue curve) are similar to many state-of-the-art algorithms, such as Politex and V-MPO. Meanwhile, according to Fig. 1(b), the experience replay implementation achieves strong performance in training speed, i.e., best performance at 3030 minute in Cart-pole and second best performance at 200200 minute in Ball-in-cup. We note here that the training time includes both the time for data collection, i.e., interacting with the environment simulator and the training of a new policy. The observation that the experience replay based implementation is faster than the original Politex and the implementation that uses 1010 Q-functions can be explained by the fact that the replay-based algorithm only uses a single Q-function and thus is faster at data collection, as discussed in Section 7.2. In addition, that the replay-based algorithm runs faster than MDPO, CPO, and V-MPO can be explained by its simplicity: in variants of Politex, we only need to approximate the Q-function, whereas in MDPO, CPO, and V-MPO, we need to train both the policy and value networks.

The Politex weight averaging scheme achieves the best performance on Ball-in-cup in both iteration complexity and training speed, but does not converge to a solution that is sufficiently close to the optimum on Cart-pole. Notice that for weight averaging, we used the initialization technique mentioned in Section 7.1. Considering the consistent strong performance of experience replay in both environments, we still recommend applying experience replay when using non-linear function approximation.

In Fig. 1(c), we can see that when we only sample a small subset (1%1\% or 4%4\%) of the data to store in the replay buffer, the coreset technique described in Section 7.2 achieves better performance due to the variance reduction effect. This improvement is quite significant in the Ball-in-cup environment, which has sparse rewards. Notice that sampling coreset only adds negligible computational overhead during training, and thus we recommend the coreset technique when there exists a strict memory constraint.

9 Discussion

The main contributions of our work are an improved analysis and practical implementation of Politex. On the theoretical side, we show that Politex obtains an O(T)O(\sqrt{T}) high-probability regret bound in uniformly mixing average-reward MDPs with linear function approximation, which is the first such bound for a computationally efficient algorithm. The main limitation of these result is that, similarly to previous works, they hold under somewhat strong assumptions which circumvent the need for exploration. An interesting future work direction would be to relax these assumptions and incorporate explicit exploration instead.

On the practical side, we propose an efficient implementation with neural networks that relies on experience replay, a standard tool in modern deep RL. Our work shows that experience replay and KL regularization can both be viewed as approximately implementing mirror descent policy updates within a policy iteration scheme. This provides an online-learning justification for using replay buffers in policy iteration, which is different than the standard explanation for their success. Our work also suggests a new objective for storing and prioritizing replay samples, with the goal of approximating the average of value functions well. This goal has some similarities with continual learning, where experience replay has also lead to empirical successes. One interesting direction for future work would be to explore other continual learning techniques in the approximate implementation of mirror descent policy updates.

Acknowledgements

We would like to thank Mehrdad Farajtabar, Dilan Görür, Nir Levine, and Ang Li for helpful discussions.

References

  • Abbasi-Yadkori et al. (2019) Y. Abbasi-Yadkori, P. Bartlett, K. Bhatia, N. Lazic, C. Szepesvari, and G. Weisz. Politex: Regret bounds for policy iteration using expert prediction. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 3692–3702. PMLR, 09–15 Jun 2019.
  • Abdolmaleki et al. (2018) A. Abdolmaleki, J. T. Springenberg, Y. Tassa, R. Munos, N. Heess, and M. Riedmiller. Maximum a posteriori policy optimisation. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=S1ANxQW0b.
  • Achiam et al. (2017) J. Achiam, D. Held, A. Tamar, and P. Abbeel. Constrained policy optimization. In International Conference on Machine Learning, pages 22–31, 2017.
  • Bachem et al. (2017) O. Bachem, M. Lucic, and A. Krause. Practical coreset constructions for machine learning. arXiv preprint arXiv:1703.06476, 2017.
  • Bartlett (2009) P. L. Bartlett. Regal: A regularization based algorithm for reinforcement learning in weakly communicating mdps. In In Proceedings of the 25th Annual Conference on Uncertainty in Artificial Intelligence, 2009.
  • Barto et al. (1983) A. G. Barto, R. S. Sutton, and C. W. Anderson. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, (5):834–846, 1983.
  • Borsos et al. (2020) Z. Borsos, M. Mutnỳ, and A. Krause. Coresets via bilevel optimization for continual learning and streaming. arXiv preprint arXiv:2006.03875, 2020.
  • Cao (1999) X.-R. Cao. Single sample path-based optimization of Markov chains. Journal of optimization theory and applications, 100(3):527–548, 1999.
  • Cesa-Bianchi and Lugosi (2006) N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge university press, 2006.
  • Chaudhry et al. (2018) A. Chaudhry, M. Ranzato, M. Rohrbach, and M. Elhoseiny. Efficient lifelong learning with a-gem. arXiv preprint arXiv:1812.00420, 2018.
  • Cho and Meyer (2001) G. E. Cho and C. D. Meyer. Comparison of perturbation bounds for the stationary distribution of a Markov chain. Linear Algebra and its Applications, 335(1-3):137–150, 2001.
  • Degrave et al. (2019) J. Degrave, A. Abdolmaleki, J. T. Springenberg, N. Heess, and M. Riedmiller. Quinoa: a Q-function you infer normalized over actions. arXiv preprint arXiv:1911.01831, 2019.
  • Even-Dar et al. (2009) E. Even-Dar, S. M. Kakade, and Y. Mansour. Online Markov decision processes. Mathematics of Operations Research, 34(3):726–736, 2009.
  • Farajtabar et al. (2020) M. Farajtabar, N. Azizan, A. Mott, and A. Li. Orthogonal gradient descent for continual learning. In International Conference on Artificial Intelligence and Statistics, pages 3762–3773, 2020.
  • Fruit et al. (2018) R. Fruit, M. Pirotta, A. Lazaric, and R. Ortner. Efficient bias-span-constrained exploration-exploitation in reinforcement learning. In International Conference on Machine Learning, 2018.
  • Hao et al. (2020) B. Hao, N. Lazic, Y. Abbasi-Yadkori, P. Joulani, and C. Szepesvari. Provably efficient adaptive approximate policy iteration. arXiv preprint arXiv:2002.03069, 2020.
  • Heess et al. (2015) N. Heess, G. Wayne, D. Silver, T. Lillicrap, T. Erez, and Y. Tassa. Learning continuous control policies by stochastic value gradients. Advances in Neural Information Processing Systems, 28:2944–2952, 2015.
  • Jaksch et al. (2010) T. Jaksch, R. Ortner, and P. Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
  • Jian et al. (2019) Q. Jian, R. Fruit, M. Pirotta, and A. Lazaric. Exploration bonus for regret minimization in discrete and continuous average reward mdps. In Advances in Neural Information Processing Systems, pages 4891–4900, 2019.
  • Jin et al. (2019) C. Jin, P. Netrapalli, R. Ge, S. M. Kakade, and M. I. Jordan. A short note on concentration inequalities for random vectors with subgaussian norm. arXiv preprint arXiv:1902.03736, 2019.
  • Kakade and Langford (2002) S. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. In ICML, volume 2, pages 267–274, 2002.
  • Kirkpatrick et al. (2017) J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
  • Konidaris et al. (2011) G. Konidaris, S. Osentoski, and P. Thomas. Value function approximation in reinforcement learning using the fourier basis. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 25, 2011.
  • Lillicrap et al. (2016) T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. In ICLR (Poster), 2016.
  • Lin (1992) L.-J. Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3-4):293–321, 1992.
  • Lopez-Paz and Ranzato (2017) D. Lopez-Paz and M. Ranzato. Gradient episodic memory for continual learning. In Advances in neural information processing systems, pages 6467–6476, 2017.
  • Mnih et al. (2013) V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • Mnih et al. (2015) V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
  • Neu et al. (2014) G. Neu, A. György, C. Szepesvári, and A. Antos. Online Markov decision processes under bandit feedback. IEEE Transactions on Automatic Control, 59(3):676–691, December 2014.
  • Ouyang et al. (2017) Y. Ouyang, M. Gagrani, A. Nayyar, and R. Jain. Learning unknown Markov decision processes: A Thompson sampling approach. In Advances in Neural Information Processing Systems, pages 1333–1342, 2017.
  • Rusu et al. (2016) A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
  • Schaul et al. (2016) T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay. In International Conference on Learrning Representations (ICLR), 2016.
  • Schulman et al. (2015) J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889–1897, 2015.
  • Schulman et al. (2017) J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • Seneta (1979) E. Seneta. Coefficients of ergodicity: structure and applications. Advances in applied probability, pages 576–590, 1979.
  • Seneta (1988) E. Seneta. Perturbation of the stationary distribution measured by ergodicity coefficients. Advances in Applied Probability, 20(1):228–230, 1988.
  • Song et al. (2019a) H. F. Song, A. Abdolmaleki, J. T. Springenberg, A. Clark, H. Soyer, J. W. Rae, S. Noury, A. Ahuja, S. Liu, D. Tirumala, N. Heess, D. Belov, M. A. Riedmiller, and M. M. Botvinick. V-MPO: on-policy maximum a posteriori policy optimization for discrete and continuous control. CoRR, abs/1909.12238, 2019a. URL http://arxiv.org/abs/1909.12238.
  • Song et al. (2019b) H. F. Song, A. Abdolmaleki, J. T. Springenberg, A. Clark, H. Soyer, J. W. Rae, S. Noury, A. Ahuja, S. Liu, D. Tirumala, et al. V-MPO: on-policy maximum a posteriori policy optimization for discrete and continuous control. arXiv preprint arXiv:1909.12238, 2019b.
  • Talebi and Maillard (2018) M. S. Talebi and O.-A. Maillard. Variance-aware regret bounds for undiscounted reinforcement learning in mdps. In Algorithmic Learning Theory, 2018.
  • Tassa et al. (2018) Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. d. L. Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, et al. DeepMind control suite. arXiv preprint arXiv:1801.00690, 2018.
  • Tomar et al. (2020) M. Tomar, L. Shani, Y. Efroni, and M. Ghavamzadeh. Mirror descent policy optimization. arXiv preprint arXiv:2005.09814, 2020.
  • Tropp (2012) J. A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12(4):389–434, 2012.
  • Vieillard et al. (2020a) N. Vieillard, T. Kozuno, B. Scherrer, O. Pietquin, R. Munos, and M. Geist. Leverage the average: an analysis of regularization in rl. arXiv preprint arXiv:2003.14089, 2020a.
  • Vieillard et al. (2020b) N. Vieillard, B. Scherrer, O. Pietquin, and M. Geist. Momentum in reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 2529–2538, 2020b.
  • Wei et al. (2020a) C.-Y. Wei, M. Jafarnia-Jahromi, H. Luo, and R. Jain. Learning infinite-horizon average-reward mdps with linear function approximation. arXiv preprint arXiv:2007.11849, 2020a.
  • Wei et al. (2020b) C.-Y. Wei, M. J. Jahromi, H. Luo, H. Sharma, and R. Jain. Model-free reinforcement learning in infinite-horizon average-reward Markov decision processes. In International Conference on Machine Learning, pages 10170–10180. PMLR, 2020b.
  • Yin et al. (2020) D. Yin, M. Farajtabar, and A. Li. SOLA: Continual learning with second-order loss approximation. arXiv preprint arXiv:2006.10974, 2020.
  • Yu (1994) B. Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 22(1):94–116, 1994.
  • Zenke et al. (2017) F. Zenke, B. Poole, and S. Ganguli. Continual learning through synaptic intelligence. Proceedings of machine learning research, 70:3987, 2017.

Appendix

Appendix A Preliminaries

We state some useful definitions and lemmas in this section.

Lemma A.1.

Let X1X_{1} and X2X_{2} be a pair of distribution vectors. Let HH be the transition matrix of an ergodic Markov chain with a stationary distribution ν\nu, and ergodicity coefficient (defined in Assumption 2.1) upper-bounded by γ<1\gamma<1. Then

(Hm)(X1X2)1γmX1X21.\displaystyle\|(H^{m})^{\top}(X_{1}-X_{2})\|_{1}\leq\gamma^{m}\|X_{1}-X_{2}\|_{1}\,.
Proof.

Let {v1,,vn}\{v_{1},...,v_{n}\} be the normalized left eigenvectors of HH corresponding to ordered eigenvalues {λ1,,λn}\{\lambda_{1},...,\lambda_{n}\}. Then v1=νv_{1}=\nu, λ1=1\lambda_{1}=1, and for all i2i\geq 2, we have that λi<1\lambda_{i}<1 (since the chain is ergodic) and vi𝟏=0v_{i}^{\top}{\bf 1}=0. Write X1X_{1} in terms of the eigenvector basis as:

X1=α1ν+i=2nαiviandX2=β1ν+i=2nβivi.\displaystyle X_{1}=\alpha_{1}\nu+\sum_{i=2}^{n}\alpha_{i}v_{i}\quad\text{and}\quad X_{2}=\beta_{1}\nu+\sum_{i=2}^{n}\beta_{i}v_{i}\,.

Since X1𝟏=1X_{1}^{\top}{\bf 1}=1 and X2𝟏=1X_{2}^{\top}{\bf 1}=1, it is easy to see that α1=β1=1\alpha_{1}=\beta_{1}=1. Thus we have

H(X1X2)1=Hi=2n(αiβi)vi1γi=2n(αiβi)vi1=γX1X21\displaystyle\|H^{\top}(X_{1}-X_{2})\|_{1}=\|H^{\top}\sum_{i=2}^{n}(\alpha_{i}-\beta_{i})v_{i}\|_{1}\leq\gamma\|\sum_{i=2}^{n}(\alpha_{i}-\beta_{i})v_{i}\|_{1}=\gamma\|X_{1}-X_{2}\|_{1}

where the inequality follows from the definition of the ergodicity coefficient and the fact that 𝟏vi=0{\bf 1}^{\top}v_{i}=0 for all i2i\geq 2. Since

𝟏Hi=2n(αiβi)vi=𝟏i=2nλi(αiβi)vi=0,\displaystyle{\bf 1}^{\top}H^{\top}\sum_{i=2}^{n}(\alpha_{i}-\beta_{i})v_{i}={\bf 1}^{\top}\sum_{i=2}^{n}\lambda_{i}(\alpha_{i}-\beta_{i})v_{i}=0,

the inequality also holds for powers of HH. ∎

Lemma A.2 (Doob martingale).

Let Assumption 2.1 hold, and let {(xt,at)}t=1T\{(x_{t},a_{t})\}_{t=1}^{T} be the state-action sequence obtained when following policies π1,,πk\pi_{1},...,\pi_{k} for τ\tau steps each from an initial distribution ν0\nu_{0}. For t[T]t\in[T], let XtX_{t} be a binary indicator vector with a non-zero element at the linear index of the state-action pair (xt,at)(x_{t},a_{t}). Define for i[T]i\in[T],

Bi\displaystyle B_{i} =𝔼[t=1TXt|X1,,Xi], and B0=𝔼[t=1TXt].\displaystyle=\mathbb{E}\left[\sum_{t=1}^{T}X_{t}|X_{1},...,X_{i}\right],\quad\text{ and }\,\,B_{0}=\mathbb{E}\left[\sum_{t=1}^{T}X_{t}\right].

Then, {Bi}i=0T\{B_{i}\}_{i=0}^{T} is a vector-valued martingale: 𝔼[BiBi1|B0,,Bi1]=0\mathbb{E}[B_{i}-B_{i-1}|B_{0},\dots,B_{i-1}]=0 for i=1,,Ti=1,\dots,T, and BiBi112(1γ)1\|B_{i}-B_{i-1}\|_{1}\leq 2(1-\gamma)^{-1} holds for i[T]i\in[T].

The constructed martingale is known as the Doob martingale underlying the sum t=1TXt\sum_{t=1}^{T}X_{t}.

Proof.

That {Bi}i=0T\{B_{i}\}_{i=0}^{T} is a martingale follows from the definition. We now bound its difference sequence. Let HtH_{t} be the state-action transition matrix at time tt, and let Hi:t=j=it1HjH_{i:t}=\prod_{j=i}^{t-1}H_{j}, and define Hi:i=IH_{i:i}=I. Then, for t=0,,T1t=0,\dots,T-1, 𝔼[Xt+1|Xt]=HtXt\mathbb{E}[X_{t+1}|X_{t}]=H_{t}^{\top}X_{t} and by the Markov property, for any i[T]i\in[T],

Bi\displaystyle B_{i} =t=1iXt+t=i+1T𝔼[Xt|Xi]=t=1iXt+t=i+1THi:tXi,andB0=t=1TH0:tX0.\displaystyle=\sum_{t=1}^{i}X_{t}+\sum_{t=i+1}^{T}\mathbb{E}[X_{t}|X_{i}]=\sum_{t=1}^{i}X_{t}+\sum_{t=i+1}^{T}H_{i:t}^{\top}X_{i},\quad\text{and}\quad B_{0}=\sum_{t=1}^{T}H_{0:t}^{\top}X_{0}.

For any i[T]i\in[T],

BiBi1\displaystyle B_{i}-B_{i-1} =t=1iXtt=1i1Xt+t=i+1THi:tXit=iTHi1:tXi1\displaystyle=\sum_{t=1}^{i}X_{t}-\sum_{t=1}^{i-1}X_{t}+\sum_{t=i+1}^{T}H_{i:t}^{\top}X_{i}-\sum_{t=i}^{T}H_{i-1:t}^{\top}X_{i-1}
=t=iTHi:t(XiHi1Xi1).\displaystyle=\sum_{t=i}^{T}H_{i:t}^{\top}(X_{i}-H_{i-1}^{\top}X_{i-1}). (A.1)

Since XiX_{i} and Hi1Xi1H_{i-1}^{\top}X_{i-1} are distribution vectors, under Assumption 2.1 and using Lemma A.1,

BiBi11t=iTHi:t(XiHi1Xi1)12j=0Tiγj2(1γ)1.\displaystyle\|B_{i}-B_{i-1}\|_{1}\leq\sum_{t=i}^{T}\|H_{i:t}^{\top}(X_{i}-H_{i-1}^{\top}X_{i-1})\|_{1}\leq 2\sum_{j=0}^{T-i}\gamma^{j}\leq 2(1-\gamma)^{-1}\,.

Let (k)k(\mathcal{F}_{k})_{k} be a filtration and define 𝔼k[]:=𝔼[|k]\mathbb{E}_{k}[\cdot]:=\mathbb{E}[\cdot|\mathcal{F}_{k}]. We will make use of the following concentration results for the sum of random matrices and vectors.

Theorem A.3 (Matrix Azuma, Tropp (2012) Thm 7.1).

Consider a finite ()k(\mathcal{F})_{k}-adapted sequence {Xk}\{X_{k}\} of Hermitian matrices of dimension mm, and a fixed sequence {Ak}\{A_{k}\} of Hermitian matrices that satisfy 𝔼k1Xk=0\mathbb{E}_{k-1}X_{k}=0 and Xk2Ak2X_{k}^{2}\preceq A_{k}^{2} almost surely. Let v=kAk2v=\|\sum_{k}A_{k}^{2}\|. Then with probability at least 1δ1-\delta, kXk222vln(m/δ)\|\sum_{k}X_{k}\|_{2}\leq 2\sqrt{2v\ln(m/\delta)}.

A version of Theorem A.3 for non-Hermitian matrices of dimension m1×m2m_{1}\times m_{2} can be obtained by applying the theorem to a Hermitian dilation of XX, 𝒟(X)=[0XX0]\mathcal{D}(X)=\big{[}\begin{smallmatrix}0&X\\ X^{*}&0\end{smallmatrix}\big{]}, which satisfies λmax(𝒟(X))=X\lambda_{\max}(\mathcal{D}(X))=\|X\| and 𝒟(X)2=[XX00XX]\mathcal{D}(X)^{2}=\big{[}\begin{smallmatrix}XX^{*}&0\\ 0&X^{*}X\end{smallmatrix}\big{]}. In this case, we have that v=max(kXkXk,kXkXk)v=\max\left(\|\sum_{k}X_{k}X_{k}^{*}\|,\|\sum_{k}X_{k}^{*}X_{k}\|\right).

Lemma A.4 (Hoeffding-type inequality for norm-subGaussian random vectors, Jin et al. (2019)).

Consider random vectors X1,,XndX_{1},\ldots,X_{n}\in\mathbb{R}^{d} and corresponding filtrations i=σ(X1,,Xi)\mathcal{F}_{i}=\sigma(X_{1},\ldots,X_{i}) i[n]i\in[n], such that Xi|i1X_{i}|\mathcal{F}_{i-1} is zero-mean norm-subGaussian with σii1\sigma_{i}\in\mathcal{F}_{i-1}. That is:

𝔼[Xi|i]=0,P(Xit|i1)2exp(t2/2σi2)t,i[n].\displaystyle\mathbb{E}[X_{i}|\mathcal{F}_{i}]=0,\quad P(\|X_{i}\|\geq t|\mathcal{F}_{i-1})\leq 2\exp(-t^{2}/2\sigma_{i}^{2})\quad\forall t\in\mathbb{R},\forall i\in[n].

If the condition is satisfied for fixed {σi}\{\sigma_{i}\}, there exists a constant cc such that for any δ>0\delta>0, with probability at least 1δ1-\delta,

i=1nXici=1nσi2log(2d/δ).\displaystyle\|\sum_{i=1}^{n}X_{i}\|\leq c\sqrt{\sum_{i=1}^{n}\sigma_{i}^{2}\log(2d/\delta)}\,.

Appendix B Bounding the Difference Between Empirical and Average Rewards

In this section, we bound the second term in Equation 5.1, corresponding to the difference between empirical and average rewards.

Lemma B.1.

Let Assumption 2.1 hold, and assume that τlogT2log(1/γ)\tau\geq\frac{\log T}{2\log(1/\gamma)} and that r(x,a)[0,1]r(x,a)\in[0,1] for all x,ax,a. Then, by choosing η=8log|𝒜|QmaxK\eta=\frac{\sqrt{8\log|\mathcal{A}|}}{Q_{\max}\sqrt{K}}, we have with probability at least 1δ1-\delta,

k=1Kt=(k1)τ+1kτ(rtJπk)2(1γ)12Tlog(2/δ)+2T+(1γ)28Klog|𝒜|.\displaystyle\sum_{k=1}^{K}\sum_{t=(k-1)\tau+1}^{k\tau}(r_{t}-J_{\pi_{k}})\leq 2(1-\gamma)^{-1}\sqrt{2T\log(2/\delta)}+2\sqrt{T}+(1-\gamma)^{-2}\sqrt{8K\log|\mathcal{A}|}\,.
Proof.

Let rr denote the vector of rewards, and recall that Jπ=νπrJ_{\pi}=\nu_{\pi}^{\top}r. Let XtX_{t} be the indicator vector for the state-action pair at time tt, as in Lemma A.2, and let νt=𝔼[Xt]\nu_{t}=\mathbb{E}[X_{t}]. We have the following:

VT\displaystyle V_{T} :=k=1Kt=(k1)τ+1kτ(rtJπk)=k=1Kt=(k1)τ+1kτr(Xtνt+νtνπk)\displaystyle:=\sum_{k=1}^{K}\sum_{t=(k-1)\tau+1}^{k\tau}(r_{t}-J_{\pi_{k}})=\sum_{k=1}^{K}\sum_{t=(k-1)\tau+1}^{k\tau}r^{\top}(X_{t}-\nu_{t}+\nu_{t}-\nu_{\pi_{k}})

We slightly abuse the notation above by letting νt\nu_{t} denote the state-action distribution at time tt, and νπ\nu_{\pi} the stationary distribution of policy π\pi. Let {Bi}i=0T\{B_{i}\}_{i=0}^{T} be the Doob martingale in Lemma A.2. Then B0=t=1TνtB_{0}=\sum_{t=1}^{T}\nu_{t} and BT=t=1TXtB_{T}=\sum_{t=1}^{T}X_{t}, and the first term can be expressed as

VT1:=t=1Tr(Xtνt)=r(BTB0).\displaystyle V_{T1}:=\sum_{t=1}^{T}r^{\top}(X_{t}-\nu_{t})=r^{\top}(B_{T}-B_{0}).

By Lemma A.2, |BiBi1,r|BiBi11r2(1γ)1|\langle B_{i}-B_{i-1},r\rangle|\leq\|B_{i}-B_{i-1}\|_{1}\|r\|_{\infty}\leq 2(1-\gamma)^{-1}. Hence by Azuma’s inequality, with probability at least 1δ1-\delta,

VT12(1γ)12Tlog(2/δ).V_{T1}\leq 2(1-\gamma)^{-1}\sqrt{2T\log(2/{\delta})}. (B.1)

For the second term we have

VT2\displaystyle V_{T2} :=k=1Kt=(k1)τ+1kτr(νtνπk)\displaystyle:=\sum_{k=1}^{K}\sum_{t=(k-1)\tau+1}^{k\tau}r^{\top}(\nu_{t}-\nu_{\pi_{k}})
=k=1Kr(i=1τ(Hπki)ν(k1)τνπk)\displaystyle=\sum_{k=1}^{K}r^{\top}\bigg{(}\sum_{i=1}^{\tau}(H_{\pi_{k}}^{i})^{\top}\nu_{(k-1)\tau}-\nu_{\pi_{k}}\bigg{)}
k=1Kri=1τ(Hπki)(ν(k1)τνπk1+νπk1)νπk1\displaystyle\leq\sum_{k=1}^{K}\|r\|_{\infty}\sum_{i=1}^{\tau}\left\|(H_{\pi_{k}}^{i})^{\top}(\nu_{(k-1)\tau}-\nu_{\pi_{k-1}}+\nu_{\pi_{k-1}})-\nu_{\pi_{k}}\right\|_{1}
k=1Ki=1τν(k1)τνπk11+(Hπki)νπk1νπk1\displaystyle\leq\sum_{k=1}^{K}\sum_{i=1}^{\tau}\|\nu_{(k-1)\tau}-\nu_{\pi_{k-1}}\|_{1}+\|(H_{\pi_{k}}^{i})^{\top}\nu_{\pi_{k-1}}-\nu_{\pi_{k}}\|_{1}
k=1Ki=1τ(Hπ(k1)τ)ν(k2)τνπk11+γiνπk1νπk1\displaystyle\leq\sum_{k=1}^{K}\sum_{i=1}^{\tau}\|(H_{\pi_{(k-1)}}^{\tau})^{\top}\nu_{(k-2)\tau}-\nu_{\pi_{k-1}}\|_{1}+\gamma^{i}\|\nu_{\pi_{k-1}}-\nu_{\pi_{k}}\|_{1}
2Tγτ+11γk=1Kνπkνπk11.\displaystyle\leq 2T\gamma^{\tau}+\frac{1}{1-\gamma}\sum_{k=1}^{K}\|\nu_{\pi_{k}}-\nu_{\pi_{k-1}}\|_{1}\,.

For τlogT2log(1/γ)\tau\geq\frac{\log T}{2\log(1/\gamma)}, the first term is upper-bounded by 2T2\sqrt{T}.

Using results on perturbations of Markov chains (Seneta, 1988; Cho and Meyer, 2001), we have that

νπkνπk1111γHπkHπk111γmaxxπk(|x)πk1(|x)1.\displaystyle\|\nu_{\pi_{k}}-\nu_{\pi_{k-1}}\|_{1}\leq\frac{1}{1-\gamma}\|H_{\pi_{k}}-H_{\pi_{k-1}}\|_{\infty}\leq\frac{1}{1-\gamma}\max_{x}\|\pi_{k}(\cdot|x)-\pi_{k-1}(\cdot|x)\|_{1}.

Note that the policies πk(|x)\pi_{k}(\cdot|x) are generated by running mirror descent on reward functions Q^πk(x,)\widehat{Q}_{\pi_{k}}(x,\cdot). A well-known property of mirror descent updates with entropy regularization (or equivalently, the exponentially-weighted-average algorithm) is that the difference between consecutive policies is bounded as

πk+1(|x)πk(|x)1ηQ^πk(x,).\displaystyle\|\pi_{k+1}(\cdot|x)-\pi_{k}(\cdot|x)\|_{1}\leq\eta\|\widehat{Q}_{\pi_{k}}(x,\cdot)\|_{\infty}\,.

See e.g. Neu et al. (2014) Section V.A for a proof, which involves applying Pinsker’s inequality and Hoeffding’s lemma (Cesa-Bianchi and Lugosi (2006) Section A.2 and Lemma A.6). Since we assume that Q^πkQmax\|\widehat{Q}_{\pi_{k}}\|_{\infty}\leq Q_{\max}, we can obtain

VT22T+(1γ)2KηQmax.V_{T2}\leq 2\sqrt{T}+(1-\gamma)^{-2}K\eta Q_{\max}.

By choosing η=8log|𝒜|QmaxK\eta=\frac{\sqrt{8\log|\mathcal{A}|}}{Q_{\max}\sqrt{K}}, we can bound the second term as

VT22T+(1γ)28Klog|𝒜|.V_{T2}\leq 2\sqrt{T}+(1-\gamma)^{-2}\sqrt{8K\log|\mathcal{A}|}. (B.2)

Putting Eq. (B.1) and (B.2) together, we obtain that with probability at least 1δ1-\delta,

VT2(1γ)12Tlog(2/δ)+2T+(1γ)28Klog|𝒜|.V_{T}\leq 2(1-\gamma)^{-1}\sqrt{2T\log(2/\delta)}+2\sqrt{T}+(1-\gamma)^{-2}\sqrt{8K\log|\mathcal{A}|}\,.

Appendix C Proof of Lemma 6.3

Proof.

Recall that we split each phase into 2m2m blocks of size bb and let i\mathcal{H}_{i} and 𝒯i{\mathcal{T}}_{i} denote the starting indices of odd and even blocks, respectively. We let RtR_{t} denote the empirical bb-step returns from the state action pair (xt,at)(x_{t},a_{t}) in phase ii:

Rt=i=tt+b(riJ^πi),J^πi=1|𝒯i|t𝒯irt.\displaystyle R_{t}=\sum_{i=t}^{t+b}(r_{i}-\widehat{J}_{\pi_{i}}),\quad\widehat{J}_{\pi_{i}}=\frac{1}{|{\mathcal{T}}_{i}|}\sum_{t\in{\mathcal{T}}_{i}}r_{t}.

We start by bounding the error in RtR_{t}. Let XX be a binary indicator vector for a state-action pair (x,a)(x,a). Let HπH_{\pi} be the state-action transition kernel for policy π\pi, and let νπ\nu_{\pi} be the corresponding stationary state-action distribution. We can write the action-value function at (x,a)(x,a) as

Qπ(x,a)\displaystyle Q_{\pi}(x,a) =r(x,a)Jπ+XHπQπ\displaystyle=r(x,a)-J_{\pi}+X^{\top}H_{\pi}Q_{\pi}
=(Xνπ)r+XHπ(rJπ𝟏+HπQπ)\displaystyle=(X-\nu_{\pi})^{\top}r+X^{\top}H_{\pi}(r-J_{\pi}{\bf 1}+H_{\pi}Q_{\pi})
=i=0(Xνπ)Hπir.\displaystyle=\sum_{i=0}^{\infty}(X-\nu_{\pi})^{\top}H_{\pi}^{i}r\,.

Let Qπb(x,a)=i=0b(Xνπ)HπirQ_{\pi}^{b}(x,a)=\sum_{i=0}^{b}(X-\nu_{\pi})^{\top}H_{\pi}^{i}r be a version of QπQ_{\pi} truncated to bb steps. Under uniform mixing, the difference to the true QπQ_{\pi} is bounded as

|Qπ(x,a)Qπb(x,a)|i=1|(Xνπ)Hπi+br|2γb+11γ.\displaystyle|Q_{\pi}(x,a)-Q_{\pi}^{b}(x,a)|\leq\sum_{i=1}^{\infty}\left|(X-\nu_{\pi})^{\top}H_{\pi}^{i+b}r\right|\leq\frac{2\gamma^{b+1}}{1-\gamma}\,. (C.1)

Let bt=Qπib(xt,at)Qπi(xt,at)b_{t}=Q_{\pi_{i}}^{b}(x_{t},a_{t})-Q_{\pi_{i}}(x_{t},a_{t}) denote the truncation bias at time tt, and let zt=i=tt+briXtHπi(it)rz_{t}=\sum_{i=t}^{t+b}r_{i}-X_{t}^{\top}H_{\pi_{i}}^{(i-t)}r denote the reward noise. We will write

Rt=Qπi(xt,at)+b(JπiJ^πi)+zt+bt.\displaystyle R_{t}=Q_{\pi_{i}}(x_{t},a_{t})+b(J_{\pi_{i}}-\widehat{J}_{\pi_{i}})+z_{t}+b_{t}.

Note that m=|i|m=|\mathcal{H}_{i}| and let

M^i=1mtiϕtϕt+αmI.\widehat{M}_{i}=\frac{1}{m}\sum_{t\in\mathcal{H}_{i}}\phi_{t}\phi_{t}^{\top}+\frac{\alpha}{m}I\,.

We estimate the value function of each policy πi\pi_{i} using data from phase ii as

w^πi\displaystyle\widehat{w}_{\pi_{i}} =M^i1m1tiϕtRt\displaystyle=\widehat{M}_{i}^{-1}m^{-1}\sum_{t\in\mathcal{H}_{i}}\phi_{t}R_{t}
=M^i1m1tiϕt(ϕtwπi+bt+zt+b(JπiJ^πi))+M^i1αm(wπiwπi)\displaystyle=\widehat{M}_{i}^{-1}m^{-1}\sum_{t\in\mathcal{H}_{i}}\phi_{t}(\phi_{t}^{\top}w_{\pi_{i}}+b_{t}+z_{t}+b(J_{\pi_{i}}-\widehat{J}_{\pi_{i}}))+\widehat{M}_{i}^{-1}\frac{\alpha}{m}(w_{\pi_{i}}-w_{\pi_{i}})
=wπi+M^i1m1tiϕt(zt+bt+b(JπiJ^πi))M^i1m1αwπi\displaystyle=w_{\pi_{i}}+\widehat{M}_{i}^{-1}m^{-1}\sum_{t\in\mathcal{H}_{i}}\phi_{t}(z_{t}+b_{t}+b(J_{\pi_{i}}-\widehat{J}_{\pi_{i}}))-\widehat{M}_{i}^{-1}m^{-1}\alpha w_{\pi_{i}}

Our estimate w^k\widehat{w}_{k} of wk=1ki=1kwπiw_{k}=\frac{1}{k}\sum_{i=1}^{k}w_{\pi_{i}} can thus be written as follows:

w^kwk\displaystyle\widehat{w}_{k}-w_{k} =1kmi=1ktiM^i1ϕt(zt+bt+b(JπiJ^πi))αkmi=1kM^i1wπi.\displaystyle=\frac{1}{km}\sum_{i=1}^{k}\sum_{t\in\mathcal{H}_{i}}\widehat{M}_{i}^{-1}\phi_{t}(z_{t}+b_{t}+b(J_{\pi_{i}}-\widehat{J}_{\pi_{i}}))-\frac{\alpha}{km}\sum_{i=1}^{k}\widehat{M}_{i}^{-1}w_{\pi_{i}}.

We proceed to upper-bound the norm of the RHS.

Set α=m/k\alpha=\sqrt{m/k}. Let CwC_{w} be an upper-bound on the norm of the true value-function weights wπi2\|w_{\pi_{i}}\|_{2} for i=1,,Ki=1,...,K. In Appendix C.3, we show that with probability at least 1δ1-\delta, for m72CΦ4σ2(1γ)2log(d/δ)m\geq 72C_{\Phi}^{4}\sigma^{-2}(1-\gamma)^{-2}\log(d/\delta), M^i122σ2\|\widehat{M}_{i}^{-1}\|_{2}\leq 2\sigma^{-2}. Thus with probability at least 1δ1-\delta, the last error term is upper-bounded as

αkmk=1kM^i1wπi22σ2Cw(km)1/2.\displaystyle\frac{\alpha}{km}\left\|\sum_{k=1}^{k}\widehat{M}_{i}^{-1}w_{\pi_{i}}\right\|_{2}\leq 2\sigma^{-2}C_{w}(km)^{-1/2}. (C.2)

Similarly, for

blog((1γ)1km)log(1/γ),b\geq\frac{\log((1-\gamma)^{-1}\sqrt{km})}{\log(1/\gamma)}, (C.3)

the norm of the truncation bias term is upper-bounded as

1kmi=1ktiM^i1ϕtbt2\displaystyle\frac{1}{km}\sum_{i=1}^{k}\sum_{t\in\mathcal{H}_{i}}\|\widehat{M}_{i}^{-1}\phi_{t}b_{t}\|_{2} 2γbkm(1γ)i=1ktiM^i1ϕt22σ2CΦ(km)1/2.\displaystyle\leq\frac{2\gamma^{b}}{km(1-\gamma)}\sum_{i=1}^{k}\sum_{t\in\mathcal{H}_{i}}\|\widehat{M}_{i}^{-1}\phi_{t}\|_{2}\leq 2\sigma^{-2}C_{\Phi}(km)^{-1/2}. (C.4)

To bound the error terms corresponding to reward noise ztz_{t} and average-error noise JπiJ^πiJ_{\pi_{i}}-\widehat{J}_{\pi_{i}}, we rely on the independent blocks techniques of Yu (1994). We show in Sections C.1 and C.2 that with probability 12δ1-2\delta, for constants c1c_{1} and c2c_{2}, each of these terms can be bounded as:

1kmi=1ktiM^i1ϕtzt2\displaystyle\frac{1}{km}\left\|\sum_{i=1}^{k}\sum_{t\in\mathcal{H}_{i}}\widehat{M}_{i}^{-1}\phi_{t}z_{t}\right\|_{2} 2c1CΦσ2blog(2d/δ)km\displaystyle\leq 2c_{1}C_{\Phi}\sigma^{-2}\sqrt{\frac{b\log(2d/\delta)}{km}}
bkmi=1k(JπiJ^πi)tiM^i1ϕt2\displaystyle\frac{b}{km}\left\|\sum_{i=1}^{k}(J_{\pi_{i}}-\widehat{J}_{\pi_{i}})\sum_{t\in\mathcal{H}_{i}}\widehat{M}_{i}^{-1}\phi_{t}\right\|_{2} 2c2CΦσ2blog(2d/δ)km.\displaystyle\leq 2c_{2}C_{\Phi}\sigma^{-2}b\sqrt{\frac{\log(2d/\delta)}{km}}.

Thus, putting terms together, we have for an absolute constant cc, with probability at least 1δ1-\delta,

w^kwk2cσ2(Cw+CΦ)blog(2d/δ)km.\displaystyle\|\widehat{w}_{k}-w_{k}\|_{2}\leq c\sigma^{-2}(C_{w}+C_{\Phi})b\sqrt{\frac{\log(2d/\delta)}{km}}.

Note that this result holds for every k[K]k\in[K] and thus also holds for k=Kk=K. ∎

C.1 Bounding i=1kM^i1tiϕtzt2\|\sum_{i=1}^{k}\widehat{M}_{i}^{-1}\sum_{t\in\mathcal{H}_{i}}\phi_{t}z_{t}\|_{2}

Let tv\|\cdot\|_{\mathrm{tv}} denote the total variation norm.

Definition C.1 (β\beta-mixing).

Let {Zt}t=1,2,\{Z_{t}\}_{t=1,2,\ldots} be a stochastic process. Denote by Z1:tZ_{1:t} the collection (Z1,,Zt)(Z_{1},\ldots,Z_{t}), where we allow t=t=\infty. Let σ(Zi:j)\sigma(Z_{i:j}) denote the sigma-algebra generated by Zi:jZ_{i:j} (iji\leq j). The kthk^{\rm th} β\beta-mixing coefficient of {Zt}\{Z_{t}\}, βk\beta_{k}, is defined by

βk\displaystyle\beta_{k} =supt1𝔼supBσ(Zt+k:)|P(B|Z1:t)P(B)|\displaystyle=\sup_{t\geq 1}\mathbb{E}{\sup_{B\in\sigma(Z_{t+k:\infty})}|P(B|Z_{1:t})-P(B)|}
=supt1𝔼PZt+k:|Z1:t(|Z1:t)PZt+k:()tv.\displaystyle=\sup_{t\geq 1}\mathbb{E}{\|P_{Z_{t+k:\infty}|Z_{1:t}}(\cdot|Z_{1:t})-P_{Z_{t+k:\infty}}(\cdot)\|_{\mathrm{tv}}}\,.

{Zt}\{Z_{t}\} is said to be β\beta-mixing if βk0\beta_{k}\rightarrow 0 as kk\rightarrow\infty. In particular, we say that a β\beta-mixing process mixes at an exponential rate with parameters β¯,α,γ>0\overline{\beta},\alpha,\gamma>0 if βkβ¯exp(αkγ)\beta_{k}\leq\overline{\beta}\exp(-\alpha k^{\gamma}) holds for all k0k\geq 0.

Let XtX_{t} be the indicator vector for the state-action pair (xt,at)(x_{t},a_{t}) as in Lemma A.2. Note that the distribution of (xt+1,at+1)(x_{t+1},a_{t+1}) given (xt,at)(x_{t},a_{t}) can be written as 𝔼[Xt+1|Xt]\mathbb{E}[X_{t+1}|X_{t}]. Let HtH_{t} be the state-action transition matrix at time tt, let Hi:t=j=it1HjH_{i:t}=\prod_{j=i}^{t-1}H_{j}, and define Hi:i=IH_{i:i}=I. Then we have that 𝔼[Xt+k|X1:t]=Ht:t+kXt\mathbb{E}[X_{t+k}|X_{1:t}]=H_{t:t+k}^{\top}X_{t} and 𝔼[Xt+k]=H1:t+kν0\mathbb{E}[X_{t+k}]=H_{1:t+k}^{\top}\nu_{0}, where ν0\nu_{0} is the initial state distribution. Thus, under the uniform mixing Assumption  2.1, the kthk^{th} β\beta-mixing coefficient is bounded as:

βk\displaystyle\beta_{k} supt1𝔼j=kHt:t+jXtH1:t+jν01supt1𝔼j=kγjXtH1:tν012γk1γ.\displaystyle\leq\sup_{t\geq 1}\mathbb{E}\sum_{j=k}^{\infty}\|H_{t:t+j}^{\top}X_{t}-H_{1:t+j}^{\top}\nu_{0}\|_{1}\leq\sup_{t\geq 1}\mathbb{E}\sum_{j=k}^{\infty}\gamma^{j}\|X_{t}-H_{1:t}^{\top}\nu_{0}\|_{1}\leq\frac{2\gamma^{k}}{1-\gamma}\,.

We bound the noise terms using the independent blocks technique of Yu (1994). Recall that we partition each phase into 2m2m blocks of size bb. Thus, after kk phases we have a total of 2km2km blocks. Let \mathbb{P} denote the joint distribution of state-action pairs in odd blocks. Let i\mathcal{I}_{i} denote the set of indices in the ithi^{th} block, and let xi,aix_{\mathcal{I}_{i}},a_{\mathcal{I}_{i}} denote the corresponding states and actions. We factorize the joint distribution according to blocks:

(x1,a1,x3,a3,,x2km1,a2km1)=\displaystyle\mathbb{P}(x_{\mathcal{I}_{1}},a_{\mathcal{I}_{1}},x_{\mathcal{I}_{3}},a_{\mathcal{I}_{3}},\ldots,x_{\mathcal{I}_{2km-1}},a_{\mathcal{I}_{2km-1}})= 1(x1,a1)×3(x3,a3|x1,a1)×\displaystyle\;\;\mathbb{P}_{1}(x_{\mathcal{I}_{1}},a_{\mathcal{I}_{1}})\times\mathbb{P}_{3}(x_{\mathcal{I}_{3}},a_{\mathcal{I}_{3}}|x_{\mathcal{I}_{1}},a_{\mathcal{I}_{1}})\times\cdots
×2km1(x2km1,a2km1|x2km3,a2km3).\displaystyle\times\mathbb{P}_{2km-1}(x_{\mathcal{I}_{2km-1}},a_{\mathcal{I}_{2km-1}}|x_{\mathcal{I}_{2km-3}},a_{\mathcal{I}_{2km-3}}).

Let ~i\widetilde{\mathbb{P}}_{i} be the marginal distribution over the variables in block ii, and let ~\widetilde{\mathbb{P}} be the product of marginals of odd blocks.

Corollary 2.7 of Yu (1994) implies that for any Borel-measurable set EE,

|(E)~(E)|(km1)βb,|\mathbb{P}(E)-\widetilde{\mathbb{P}}(E)|\leq(km-1)\beta_{b}, (C.5)

where βb\beta_{b} is the bthb^{th} β\beta-mixing coefficient of the process. The result follows since the size of the “gap” between successive blocks is bb; see Appendix E for more details.

Recall that our estimates w^πi\widehat{w}_{\pi_{i}} are based only on data in odd blocks in each phase. Let 𝔼~\widetilde{\mathbb{E}} denote the expectation w.r.t. the product-of-marginals distribution ~\widetilde{\mathbb{P}}. Then 𝔼~[M^i1tiϕtzt]=0\widetilde{\mathbb{E}}[\widehat{M}_{i}^{-1}\sum_{t\in\mathcal{H}_{i}}\phi_{t}z_{t}]=0 because for tit\in\mathcal{H}_{i} and under ~\widetilde{\mathbb{P}}, ztz_{t} is zero-mean given ϕt\phi_{t} and is independent of other feature vectors outside of the block. Furthermore, by Hoeffding’s inequality ~(|zt|/ba)2exp(2ba2)\widetilde{\mathbb{P}}(|z_{t}|/b\geq a)\leq 2\exp(-2ba^{2}). Since ϕt2CΦ\|\phi_{t}\|_{2}\leq C_{\Phi} and M^i122σ2\|\widehat{M}_{i}^{-1}\|_{2}\leq 2\sigma^{-2} for large enough mm, we have that

~(M^i1ϕtzt22bσ2CΦa)2exp(2ba2).\widetilde{\mathbb{P}}(\|\widehat{M}_{i}^{-1}\phi_{t}z_{t}\|_{2}\geq 2b\sigma^{-2}C_{\Phi}a)\leq 2\exp(-2ba^{2}).

Since M^i1ϕtzt\widehat{M}_{i}^{-1}\phi_{t}z_{t} are norm-subGaussian vectors, using Lemma A.4, there exists a constant c1c_{1} such that for any δ0\delta\geq 0

~(i=1kM^i1tiϕtzt22c1CΦσ2bkmlog(2d/δ))δ.\displaystyle\widetilde{\mathbb{P}}\left(\left\|\sum_{i=1}^{k}\widehat{M}_{i}^{-1}\sum_{t\in\mathcal{H}_{i}}\phi_{t}z_{t}\right\|_{2}\geq 2c_{1}C_{\Phi}\sigma^{-2}\sqrt{bkm\log(2d/\delta)}\right)\leq\delta\,.

Thus, using (C.5),

(i=1kM^i1tiϕtzt22c1CΦσ2bkmlog(2d/δ))δ+(km1)βb.\mathbb{P}\left(\left\|\sum_{i=1}^{k}\widehat{M}_{i}^{-1}\sum_{t\in\mathcal{H}_{i}}\phi_{t}z_{t}\right\|_{2}\geq 2c_{1}C_{\Phi}\sigma^{-2}\sqrt{bkm\log(2d/\delta)}\right)\leq\delta+(km-1)\beta_{b}\;.

Under Assumption 2.1, we have that βb2γb(1γ)1\beta_{b}\leq 2\gamma^{b}(1-\gamma)^{-1}. Setting δ=2kmγb(1γ)1\delta=2km\gamma^{b}(1-\gamma)^{-1} and solving for bb we get

b=log(2kmδ1(1γ)1)log(1/γ).\displaystyle b=\frac{\log(2km\delta^{-1}(1-\gamma)^{-1})}{\log(1/\gamma)}. (C.6)

Notice that when bb is chosen as in Eq. (C.6), the condition (C.3) is also satisfied. Plugging this into the previous display gives that with probability at least 12δ1-2\delta,

i=1kM^i1tiϕtzt22c1CΦσ2bkmlog(2d/δ).\left\|\sum_{i=1}^{k}\widehat{M}_{i}^{-1}\sum_{t\in\mathcal{H}_{i}}\phi_{t}z_{t}\right\|_{2}\leq 2c_{1}C_{\Phi}\sigma^{-2}\sqrt{bkm\log(2d/\delta)}.

C.2 Bounding i=1kM^i1tiϕt(JπiJ^πi)2\|\sum_{i=1}^{k}\widehat{M}_{i}^{-1}\sum_{t\in\mathcal{H}_{i}}\phi_{t}(J_{\pi_{i}}-\widehat{J}_{\pi_{i}})\|_{2}

Recall that the average-reward estimates J^πi\widehat{J}_{\pi_{i}} are computed using time indices corresponding to the starts of even blocks, 𝒯i{\mathcal{T}}_{i}. Thus this error term is only a function of the indices corresponding to block starts. Now let \mathbb{P} denote the distribution over state-action pairs (xt,at)(x_{t},a_{t}) for indices tt corresponding to block starts, i.e. t{1,b+1,2b+1,,(2km1)b+1}t\in\{1,b+1,2b+1,...,(2km-1)b+1\}. We again factorize the distribution over blocks as =122km\mathbb{P}=\mathbb{P}_{1}\otimes\mathbb{P}_{2}\otimes\cdots\otimes\mathbb{P}_{2km}. Let ~=~1~2~2km\widetilde{\mathbb{P}}=\widetilde{\mathbb{P}}_{1}\otimes\widetilde{\mathbb{P}}_{2}\otimes\cdots\otimes\widetilde{\mathbb{P}}_{2km} be a product-of-marginals distribution defined as follows. For odd jj, let ~j\widetilde{\mathbb{P}}_{j} be the marginal of \mathbb{P} over (xjb+1,ajb+1)(x_{jb+1},a_{jb+1}). For even jj in phase ii, let ~j=νπi\widetilde{\mathbb{P}}_{j}=\nu_{\pi_{i}} correspond to the stationary distribution of the corresponding policy πi\pi_{i}. Using arguments similar to independent blocks, we show in Appendix E that

~12(2km1)γb1.\|\mathbb{P}-\widetilde{\mathbb{P}}\|_{1}\leq 2(2km-1)\gamma^{b-1}.

Let 𝔼~\widetilde{\mathbb{E}} denote expectation w.r.t. the product-of-marginals distribution ~\widetilde{\mathbb{P}}. Then 𝔼~[M^i1tiϕt(JπiJ^πi)]=0\widetilde{\mathbb{E}}[\widehat{M}_{i}^{-1}\sum_{t\in\mathcal{H}^{i}}\phi_{t}(J_{\pi_{i}}-\widehat{J}_{\pi_{i}})]=0, since under ~\widetilde{\mathbb{P}}, J^πi\widehat{J}_{\pi_{i}} is the sum of rewards for state-action pairs distributed according to νπi\nu_{\pi_{i}}, and these state-action pairs are independent of other data. Using a similar argument as in the previous section, for b=1+log(4km/δ)log(1/γ)b=1+\frac{\log(4km/\delta)}{\log(1/\gamma)}, there exists a constant c2c_{2} such that with probability at least 12δ1-2\delta,

i=1kM^i1tiϕt(JπiJ^πi)22c2CΦσ2kmlog(2d/δ).\left\|\sum_{i=1}^{k}\widehat{M}_{i}^{-1}\sum_{t\in\mathcal{H}_{i}}\phi_{t}(J_{\pi_{i}}-\widehat{J}_{\pi_{i}})\right\|_{2}\leq 2c_{2}C_{\Phi}\sigma^{-2}\sqrt{km\log(2d/\delta)}\;.

C.3 Bounding M^i12\|\widehat{M}_{i}^{-1}\|_{2}

In this subsection, we show that with probability at least 1δ1-\delta, for m72CΦ4σ2(1γ)2log(d/δ))m\geq 72C_{\Phi}^{4}\sigma^{-2}(1-\gamma)^{-2}\log(d/\delta)), Mi122σ2\|M_{i}^{-1}\|_{2}\leq 2\sigma^{-2}.

Let Φ\Phi be a |𝒳||𝒜|×d|\mathcal{X}||\mathcal{A}|\times d matrix of all features. Let Di=diag(νπi)D_{i}={\rm diag}(\nu_{\pi_{i}}), and let D^i=diag(tiXt)\widehat{D}_{i}={\rm diag}(\sum_{t\in\mathcal{H}_{i}}X_{t}), where XtX_{t} is a state-action indicator as in Lemma A.2. Let Mi=ΦDiΦ+αm1IM_{i}=\Phi^{\top}D_{i}\Phi+\alpha m^{-1}I. We can write M^i1\widehat{M}_{i}^{-1} as

M^i1\displaystyle\widehat{M}_{i}^{-1} =(ΦD^iΦ+ατ1I+Φ(DiDi)Φ)1\displaystyle=(\Phi^{\top}\widehat{D}_{i}\Phi+{\alpha}{\tau}^{-1}I+\Phi^{\top}(D_{i}-D_{i})\Phi)^{-1}
=(Mi+Φ(DiDi)Φ)1\displaystyle=(M_{i}+\Phi^{\top}(D_{i}-D_{i})\Phi)^{-1}
=(I+Mi1Φ(DiDi)Φ)1Mi1\displaystyle=(I+M_{i}^{-1}\Phi^{\top}(D_{i}-D_{i})\Phi)^{-1}M_{i}^{-1}

By Assumption 6.2 and 6.1, Mi12σ2\|M_{i}^{-1}\|_{2}\leq\sigma^{-2}. In Appendix C.4, we show that w.p. at least 1δ1-\delta,

Φ(D^iDi)Φ26m1/2CΦ2(1γ)12log(d/δ)\displaystyle\|\Phi^{\top}(\widehat{D}_{i}-D_{i})\Phi\|_{2}\leq 6m^{-1/2}C_{\Phi}^{2}(1-\gamma)^{-1}\sqrt{2\log(d/\delta)}

Thus

M^i12σ2(1σ26m1/2CΦ2(1γ)12log(d/δ))1\displaystyle\|\widehat{M}_{i}^{-1}\|_{2}\leq\sigma^{-2}(1-\sigma^{-2}6m^{-1/2}C_{\Phi}^{2}(1-\gamma)^{-1}\sqrt{2\log(d/\delta)})^{-1}

For m72CΦ4σ2(1γ)2log(d/δ))m\geq 72C_{\Phi}^{4}\sigma^{-2}(1-\gamma)^{-2}\log(d/\delta)), the above norm is upper-bounded by M^i122σ2\|\widehat{M}_{i}^{-1}\|_{2}\leq 2\sigma^{-2}.

C.4 Bounding Φ(D^iDi)Φ2\|\Phi^{\top}(\widehat{D}_{i}-D_{i})\Phi^{\top}\|_{2}

For any matrix AA,

ΦAΦ2=ijAijϕiϕj2i,j|Aij|ϕiϕj2CΦ2i,j|Aij|=CΦ2A1,1.\displaystyle\|\Phi^{\top}A\Phi\|_{2}=\bigg{\|}\sum_{ij}A_{ij}\phi_{i}\phi_{j}^{\top}\bigg{\|}_{2}\leq\sum_{i,j}|A_{ij}|\|\phi_{i}\phi_{j}^{\top}\|_{2}\leq C_{\Phi}^{2}\sum_{i,j}|A_{ij}|=C_{\Phi}^{2}\|A\|_{1,1}\,. (C.7)

where A1,1\|A\|_{1,1} denotes the sum of absolute entries of AA. Using the same notation for XtX_{t} as in Lemma A.2,

Φ(D^iDi)Φ2\displaystyle\|\Phi^{\top}(\widehat{D}_{i}-D_{i})\Phi\|_{2} =1mtiΦdiag(Xtνt+νtνπi)Φ\displaystyle=\frac{1}{m}\sum_{t\in\mathcal{H}_{i}}\Phi^{\top}{\rm diag}(X_{t}-\nu_{t}+\nu_{t}-\nu_{\pi_{i}})\Phi
1mtiΦdiag(Xtνt)Φ2+CΨ2mtiνtνπi1.\displaystyle\leq\frac{1}{m}\bigg{\|}\sum_{t\in\mathcal{H}_{i}}\Phi^{\top}{\rm diag}(X_{t}-\nu_{t})\Phi\bigg{\|}_{2}+\frac{C_{\Psi}^{2}}{m}\sum_{t\in\mathcal{H}_{i}}\|\nu_{t}-\nu_{\pi_{i}}\|_{1}\,.

Under the fast-mixing assumption 2.1, the second term is bounded by 2CΨ2m1(1γ)12C_{\Psi}^{2}m^{-1}(1-\gamma)^{-1}.

For the first term, we can define a martingale (Bi)i=0m(B_{i})_{i=0}^{m} similar to the Doob martingale in Lemma A.2, but defined only on the mm indices i\mathcal{H}_{i}. Note that tiΦdiag(Xtνt)Φ=Φdiag(BmB0)Φ\sum_{t\in\mathcal{H}_{i}}\Phi^{\top}{\rm diag}(X_{t}-\nu_{t})\Phi=\Phi^{\top}{\rm diag}(B_{m}-B_{0})\Phi. Thus we can use matrix-Azuma to bound the difference sequence. Given that

(Φ(BiBi1)Φ)224CΦ4(1γ)2,\displaystyle\|(\Phi^{\top}(B_{i}-B_{i-1})\Phi)^{2}\|_{2}\leq 4C_{\Phi}^{4}(1-\gamma)^{-2},

combining the two terms, we have that with probability at least 1δ1-\delta,

Φ(D^iDi)Φ2\displaystyle\|\Phi^{\top}(\widehat{D}_{i}-D_{i})\Phi\|_{2} 4m1/2CΦ2(1γ)12log(d/δ)+2m1CΦ2(1γ)1\displaystyle\leq 4m^{-1/2}C_{\Phi}^{2}(1-\gamma)^{-1}\sqrt{2\log(d/\delta)}+2m^{-1}C_{\Phi}^{2}(1-\gamma)^{-1}
6m1/2CΦ2(1γ)12log(d/δ).\displaystyle\leq 6m^{-1/2}C_{\Phi}^{2}(1-\gamma)^{-1}\sqrt{2\log(d/\delta)}\,.

Appendix D Bounding 𝔼xμ[V^K(x)VK(x)]\mathbb{E}_{x\sim\mu*}[\widehat{V}_{K}(x)-V_{K}(x)]

We write the value function error as follows:

𝔼xμ[V^K(x)VK(x)]\displaystyle\mathbb{E}_{x\sim\mu*}[\widehat{V}_{K}(x)-V_{K}(x)] =xμ(x)aϕ(x,a)1Ki=1Kπi(a|x)(w^πiwπi)\displaystyle=\sum_{x}\mu_{*}(x)\sum_{a}\phi(x,a)^{\top}\frac{1}{K}\sum_{i=1}^{K}\pi_{i}(a|x)(\widehat{w}_{\pi_{i}}-w_{\pi_{i}})
1Kxμ(x)aϕ(x,a)2i=1Kπi(a|x)(w^πiwπi)2\displaystyle\leq\frac{1}{K}\sum_{x}\mu_{*}(x)\sum_{a}\|\phi(x,a)\|_{2}\left\|\sum_{i=1}^{K}\pi_{i}(a|x)(\widehat{w}_{\pi_{i}}-w_{\pi_{i}})\right\|_{2}

Note that for any set of scalars {pi}i=1K\{p_{i}\}_{i=1}^{K} with pi[0,1]p_{i}\in[0,1], the term i=1Kpi(w^πiwπi)2\left\|\sum_{i=1}^{K}p_{i}(\widehat{w}_{\pi_{i}}-w_{\pi_{i}})\right\|_{2} has the same upper bound as i=1K(w^πiwπi)2\|\sum_{i=1}^{K}(\widehat{w}_{\pi_{i}}-w_{\pi_{i}})\|_{2}. The reason is as follows. One part of the error includes bias terms (C.2) and (C.4), whose upper bounds are only smaller when reweighted by scalars in [0,1][0,1]. Thus we can simply upper-bound the bias by setting all {pi}i=1K\{p_{i}\}_{i=1}^{K} to 1. Another part of the error, analyzed in Appendices C.1 and C.2 involves sums of norm-subGaussian vectors. In this case, applying the weights only results in these vectors potentially having smaller norm bounds. We keep the same bounds for simplicity, again corresponding to all {pi}i=1K\{p_{i}\}_{i=1}^{K} equal to 1. Thus, reusing the results of the previous section, we have

𝔼xμ[V^K(x)VK(x)]\displaystyle\mathbb{E}_{x\sim\mu*}[\widehat{V}_{K}(x)-V_{K}(x)] CΦ|𝒜|cσ2(Cw+CΦ)blog(2d/δ)Km.\displaystyle\leq C_{\Phi}|\mathcal{A}|c\sigma^{-2}(C_{w}+C_{\Phi})b\sqrt{\frac{\log(2d/\delta)}{Km}}.

Appendix E Independent Blocks

Blocks. Recall that we partition each phase into 2m2m blocks of size bb. Thus, after kk phases we have a total of 2km2km blocks. Let \mathbb{P} denote the joint distribution of state-action pairs in odd blocks. Let i\mathcal{I}_{i} denote the set of indices in the ithi^{th} block, and let xi,aix_{\mathcal{I}_{i}},a_{\mathcal{I}_{i}} denote the corresponding states and actions. We factorize the joint distribution according to blocks:

(x1,a1,x3,a3,,x2km1,a2km1)=\displaystyle\mathbb{P}(x_{\mathcal{I}_{1}},a_{\mathcal{I}_{1}},x_{\mathcal{I}_{3}},a_{\mathcal{I}_{3}},\ldots,x_{\mathcal{I}_{2km-1}},a_{\mathcal{I}_{2km-1}})= 1(x1,a1)×3(x3,a3|x1,a1)×\displaystyle\;\;\mathbb{P}_{1}(x_{\mathcal{I}_{1}},a_{\mathcal{I}_{1}})\times\mathbb{P}_{3}(x_{\mathcal{I}_{3}},a_{\mathcal{I}_{3}}|x_{\mathcal{I}_{1}},a_{\mathcal{I}_{1}})\times\cdots
×2km1(x2km1,a2km1|x2km3,a2km3).\displaystyle\times\mathbb{P}_{2km-1}(x_{\mathcal{I}_{2km-1}},a_{\mathcal{I}_{2km-1}}|x_{\mathcal{I}_{2km-3}},a_{\mathcal{I}_{2km-3}}).

Let ~i\widetilde{\mathbb{P}}_{i} be the marginal distribution over the variables in block ii, and let ~\widetilde{\mathbb{P}} be the product of marginals. Then the difference between the distributions ~\widetilde{\mathbb{P}} and \mathbb{P} can be written as

~=\displaystyle\mathbb{P}-\widetilde{\mathbb{P}}= 132km11~3~2km1\displaystyle\;\;{\mathbb{P}_{1}}\otimes\mathbb{P}_{3}\otimes\cdots\otimes\mathbb{P}_{2km-1}-{\mathbb{P}_{1}}\otimes\widetilde{\mathbb{P}}_{3}\cdots\otimes\widetilde{\mathbb{P}}_{2km-1}
=\displaystyle= 1(3~3)52km1\displaystyle\;\;{\mathbb{P}_{1}}\otimes(\mathbb{P}_{3}-\widetilde{\mathbb{P}}_{3})\otimes\mathbb{P}_{5}\otimes\cdots\otimes\mathbb{P}_{2km-1}
+1~3(5~5)72km1\displaystyle+\mathbb{P}_{1}\otimes\widetilde{\mathbb{P}}_{3}\otimes(\mathbb{P}_{5}-\widetilde{\mathbb{P}}_{5})\otimes\mathbb{P}_{7}\otimes\ldots\otimes\mathbb{P}_{2km-1}
+\displaystyle+\cdots
+1~3~5~2km3(2km1~2km1).\displaystyle+{\mathbb{P}}_{1}\otimes\widetilde{\mathbb{P}}_{3}\otimes\widetilde{\mathbb{P}}_{5}\otimes\cdots\otimes\widetilde{\mathbb{P}}_{2km-3}\otimes(\mathbb{P}_{2km-1}-\widetilde{\mathbb{P}}_{2km-1}).

Under β\beta-mixing, since the gap between the blocks is of size bb, we have that

i(xi,ai|xi2,ai2)~i(xi,ai)1βb=2γb1γ.\displaystyle\|\mathbb{P}_{i}(x_{\mathcal{I}_{i}},a_{\mathcal{I}_{i}}|x_{\mathcal{I}_{i-2}},a_{\mathcal{I}_{i-2}})-\widetilde{\mathbb{P}}_{i}(x_{\mathcal{I}_{i}},a_{\mathcal{I}_{i}})\|_{1}\leq\beta_{b}=\frac{2\gamma^{b}}{1-\gamma}.

Thus the difference between the joint distribution and the product of marginals is bounded as

~1(km1)βb.\displaystyle\|\mathbb{P}-\widetilde{\mathbb{P}}\|_{1}\leq(km-1)\beta_{b}.

Block starts. Now let \mathbb{P} denote the distribution over state-action pairs (xt,at)(x_{t},a_{t}) for indices tt corresponding to block starts, i.e. t{1,b+1,2b+1,,(2km1)b+1}t\in\{1,b+1,2b+1,...,(2km-1)b+1\}. We again factorize the distribution over blocks:

(x1,a1,xb+1,ab+1,,x(2km1)b+1,a(2km1)b+1)=1(x1,a1)j=22kmi(xjb+1,ajb+1|x(j1)b+1,a(j1)b+1).\displaystyle\mathbb{P}(x_{1},a_{1},x_{b+1},a_{b+1},\ldots,x_{(2km-1)b+1},a_{(2km-1)b+1})=\mathbb{P}_{1}(x_{1},a_{1})\prod_{j=2}^{2km}\mathbb{P}_{i}(x_{jb+1},a_{jb+1}|x_{(j-1)b+1},a_{(j-1)b+1}).

Define a product-of-marginals distribution ~=~1~2~2km\widetilde{\mathbb{P}}=\widetilde{\mathbb{P}}_{1}\otimes\widetilde{\mathbb{P}}_{2}\otimes\cdots\otimes\widetilde{\mathbb{P}}_{2km} over the block-start variables as follows. For odd jj, let ~j\widetilde{\mathbb{P}}_{j} be the marginal of \mathbb{P} over (xjb+1,ajb+1)(x_{jb+1},a_{jb+1}). For even jj in phase ii, let ~j=νπi\widetilde{\mathbb{P}}_{j}=\nu_{\pi_{i}} correspond to the stationary distribution of the policy πi\pi_{i}. Using the same notation as in Appendix A, let XtX_{t} be the indicator vector for (xt,at)(x_{t},a_{t}) and let Hi:jH_{i:j} be the product of state-action transition matrices at times i+1,,ji+1,...,j. For odd blocks jj, we have

j(|x(j1)b+1,a(j1)b+1)~j()1=H(j1)b+1:jb(X(j1)b+1~j1)12γb1.\displaystyle\|\mathbb{P}_{j}(\cdot|x_{(j-1)b+1},a_{(j-1)b+1})-\widetilde{\mathbb{P}}_{j}(\cdot)\|_{1}=\|H_{(j-1)b+1:jb}^{\top}(X_{(j-1)b+1}-\widetilde{\mathbb{P}}_{j-1})\|_{1}\leq 2\gamma^{b-1}\,.

Slightly abusing notation, let HπiH_{\pi_{i}} be the state-action transition matrix under policy πi\pi_{i}. For even blocks jj in phase ii, since they always follow an odd block in the same phase,

j(|x(j1)b+1,a(j1)b+1)~j()1=(Hπib1)(X(jb)+1νπi)12γb1.\displaystyle\|\mathbb{P}_{j}(\cdot|x_{(j-1)b+1},a_{(j-1)b+1})-\widetilde{\mathbb{P}}_{j}(\cdot)\|_{1}=\|(H_{\pi_{i}}^{b-1})^{\top}(X_{(j-b)+1}-\nu_{\pi_{i}})\|_{1}\leq 2\gamma^{b-1}\,.

Thus, using a similar distribution decomposition as before, we have that ~12(2km1)γb1\|\mathbb{P}-\widetilde{\mathbb{P}}\|_{1}\leq 2(2km-1)\gamma^{b-1}.