Improved Regret Bound and Experience Replay in Regularized Policy Iteration

Nevena Lazić [email protected] DeepMind Dong Yin [email protected] DeepMind Yasin Abbasi-Yadkori [email protected] DeepMind Csaba Szepesvári [email protected] DeepMind University of Alberta

Abstract

In this work, we study algorithms for learning in infinite-horizon undiscounted Markov decision processes (MDPs) with function approximation. We first show that the regret analysis of the Politex algorithm (a version of regularized policy iteration) can be sharpened from $O(T^{3/4})$ to $O(\sqrt{T})$ under nearly identical assumptions, and instantiate the bound with linear function approximation. Our result provides the first high-probability $O(\sqrt{T})$ regret bound for a computationally efficient algorithm in this setting. The exact implementation of Politex with neural network function approximation is inefficient in terms of memory and computation. Since our analysis suggests that we need to approximate the average of the action-value functions of past policies well, we propose a simple efficient implementation where we train a single Q-function on a replay buffer with past data. We show that this often leads to superior performance over other implementation choices, especially in terms of wall-clock time. Our work also provides a novel theoretical justification for using experience replay within policy iteration algorithms.

1 Introduction

Model-free reinforcement learning (RL) algorithms combined with powerful function approximation have achieved impressive performance in a variety of application domains over the last decade. Unfortunately, the theoretical understanding of such methods is still quite limited. In this work, we study single-trajectory learning in infinite-horizon undiscounted Markov decision processes (MDPs), also known as average-reward MDPs, which capture tasks such as routing and the control of physical systems.

One line of works with performance guarantees for the average-reward setting follows the “online MDP” approach proposed by Even-Dar et al. (2009), where the agent selects policies by running an online learning algorithm in each state, typically mirror descent. The resulting algorithm is a version of approximate policy iteration (API), which alternates between (1) estimating the state-action value function (or Q-function) of the current policy and (2) setting the next policy to be optimal w.r.t. the sum of all previous Q-functions plus a regularizer. Note that, by contrast, standard API sets the next policy only based on the most recent Q-function. The policy update can also be written as maximizing the most recent Q-function minus KL-divergence to the previous policy, which is somewhat similar to recently popular versions of API (Schulman et al., 2015, 2017; Achiam et al., 2017; Abdolmaleki et al., 2018; Song et al., 2019a).

The original work of Even-Dar et al. (2009) studied this scheme with known dynamics, tabular representation, and adversarial reward functions. More recent works (Abbasi-Yadkori et al., 2019; Hao et al., 2020; Wei et al., 2020a) have adapted the approach to the case of unknown dynamics, stochastic rewards, and value function approximation. With linear value functions, the Politex algorithm of Abbasi-Yadkori et al. (2019) achieves $O(T^{3/4})$ high-probability regret in ergodic MDPs, and the results only scale in the number of features rather than states. Wei et al. (2020a) later show an $O(\sqrt{T})$ bound on expected regret for a similar algorithm named MDP-EXP2. In this work, we revisit the analysis of Politex and show that it can be sharpened to $O(\sqrt{T})$ under nearly identical assumptions, resulting in the first $O(\sqrt{T})$ high-probability regret bound for a computationally efficient algorithm in this setting.

In addition to improved analysis, our work also addresses practical implementation of Politex with neural networks. The policies produced by Politex in each iteration require access to the sum of all previous Q-function estimates. With neural network function approximation, exact implementation requires us to keep all past networks in memory and evaluate them at each step, which is inefficient in terms of memory and computation. Some practical implementation choices include subsampling Q-functions and/or optimizing a KL-divergence regularized objective w.r.t. a parametric policy at each iteration. We propose an alternative approach, where we approximate the average of all past Q-functions by training a single network on a replay buffer with past data. We demonstrate that this choice often outperforms other approximate implementations, especially in terms of run-time. When available memory is constrained, we propose to subsample transitions using the notion of coresets (Bachem et al., 2017).

Our work also provides a novel perspective on the benefits of experience replay. Experience replay is a standard tool for stabilizing learning in modern deep RL, and typically used in off-policy methods like Deep Q-Networks (Mnih et al., 2013), as well as “value gradient” methods such as DDPG (Lillicrap et al., 2016) and SVG (Heess et al., 2015). A different line of on-policy methods typically does not rely on experience replay; instead, learning is stabilized by constraining consecutive policies to be close in terms of KL divergence (Schulman et al., 2015; Song et al., 2019a; Degrave et al., 2019). We observe that both experience replay and KL-divergence regularization can be viewed as approximate implementations of Politex. Thus, we provide a theoretical justification for using experience replay in API, as an approximate implementation of online learning in each state. Note that this online-learning view differs from the commonly used justifications for experience replay, namely that it “breaks temporal correlations” (Schaul et al., 2016; Mnih et al., 2013). Our analysis also suggests a new objective for subsampling or priority-sampling transitions in the replay buffer, which differs priority-sampling objectives of previous work (Schaul et al., 2016).

In summary, our main contributions are (1) an improved analysis of Politex, showing an $O(\sqrt{T})$ regret bound under the same assumptions as the original work, and (2) an efficient implementation that also offers a new perspective on the benefits of experience replay.

2 Setting and Notation

We consider learning in infinite-horizon undiscounted ergodic MDPs $(\mathcal{X},\mathcal{A},r,P)$ , where $\mathcal{X}$ is the state space, $\mathcal{A}$ is a finite action space, $r:\mathcal{X}\times\mathcal{A}\to[0,1]$ is an unknown reward function, and $P:\mathcal{X}\times\mathcal{A}\to\Delta_{\mathcal{X}}$ is the unknown probability transition function. A policy $\pi:\mathcal{X}\to\Delta_{\mathcal{A}}$ is a mapping from a state to a distribution over actions. Let $\{(x_{t}^{\pi},a_{t}^{\pi})\}_{t=1}^{\infty}$ denote the state-action sequence obtained by following policy $\pi$ . The expected average reward of policy $\pi$ is defined as

J_{\pi}:=\lim_{T\to\infty}\mathbb{E}\left[\frac{1}{T}\sum_{t=1}^{T}r(x_{t}^{\pi},a_{t}^{\pi})\right].

(2.1)

Let $\mu_{\pi}$ denote the stationary state distribution of a policy $\pi$ , satisfying $\mu_{\pi}=\mathbb{E}_{x\sim\mu_{\pi},a\sim\pi}[P(\cdot|x,a)]$ . We will sometimes write $\mu_{\pi}$ as a vector, and use $\nu_{\pi}=\mu_{\pi}\otimes\pi$ to denote the stationary state-action distribution. In ergodic MDPs, $J_{\pi}$ and $\mu_{\pi}$ are well-defined and independent of the initial state. The optimal policy $\pi_{*}$ is a policy that maximizes the expected average reward. We will denote by $J_{*}$ and $\mu_{*}$ the expected average reward and stationary state distribution of $\pi_{*}$ , respectively.

The value function of a policy $\pi$ is defined as:

V_{\pi}(x):=\mathbb{E}\left[\sum_{t=1}^{\infty}(r(x_{t}^{\pi},a_{t}^{\pi})-J_{\pi})|x_{1}^{\pi}=x\right].

(2.2)

The state-action value function $Q_{\pi}(x,a)$ is defined as

Q_{\pi}(x,a):=r(x,a)-J_{\pi}+\sum_{x^{\prime}}P(x^{\prime}|x,a)V_{\pi}(x^{\prime}).

(2.3)

Notice that

V_{\pi}(x)=\sum_{a}\pi(a|x)Q_{\pi}(x,a).

(2.4)

Equations (2.3) and (2.4) are known as the Bellman equation. If we do not use the definition of $V_{\pi}(x)$ in Eq. (2.2) and instead solve for $V_{\pi}(x)$ and $Q_{\pi}(x,a)$ using the Bellman equation, we can see that the solutions are unique up to an additive constant. Therefore, in the following, we may use $V_{\pi}(x)$ and $Q_{\pi}(x,a)$ to denote the same function up to a constant. We will use the shorthand notation $Q_{\pi}(x,\pi^{\prime})=\mathbb{E}_{a\sim\pi^{\prime}(\cdot|x)}[Q_{\pi}(x,a)]$ ; note that $Q_{\pi}(x,\pi)=V_{\pi}(x)$ .

The agent interacts with the environment as follows: at each round $t$ , the agent observes a state $x_{t}\in\mathcal{X}$ , chooses an action $a_{t}\sim\pi_{t}(\cdot|x_{t})$ , and receives a reward $r_{t}:=r(x_{t},a_{t})$ . The environment then transitions to the next state $x_{t+1}$ with probability $P(x_{t+1}|x_{t},a_{t})$ . Recall that $\pi_{*}$ is the optimal policy and $J_{*}$ is its expected average reward. The regret of an algorithm with respect to this fixed policy is defined as

R_{T}:=\sum_{t=1}^{T}\Big{(}J_{*}-r(x_{t},a_{t})\Big{)}.

(2.5)

The learning goal is to find an algorithm that minimizes the long-term regret $R_{T}$ .

Our analysis will require the following assumption on the mixing rate of policies.

Assumption 2.1 (Uniform mixing).

Let $H_{\pi}$ be the state-action transition matrix of a policy $\pi$ . Let $\gamma(H_{\pi})$ be the corresponding ergodicity coefficient (Seneta, 1979), defined as

\gamma(H_{\pi}):=\max_{z:z^{\top}{\bf 1}=0,\|z\|_{1}=1}\|z^{\top}H_{\pi}\|_{1}.

We assume that there exists a scalar $\gamma<1$ such that for any policy $\pi$ , $\gamma(H_{\pi})\leq\gamma<1$ .

Assumption 2.1 implies that for any pair of distributions $\nu,\nu^{\prime}$ , $\|(\nu-\nu^{\prime})^{\top}H_{\pi}\|_{1}\leq\gamma\|\nu-\nu^{\prime}\|_{1}$ ; see Lemma A.1 in Appendix A for a proof.

3 Related Work

Regret bounds for average-reward MDPs. Most no-regret algorithms for infinite-horizon undiscounted MDPs are only applicable to tabular representations and model-based (Bartlett, 2009; Jaksch et al., 2010; Ouyang et al., 2017; Fruit et al., 2018; Jian et al., 2019; Talebi and Maillard, 2018). For weakly-communicating MDPs with diameter $D$ , these algorithms nearly achieve the minimax lower bound $\Omega(\sqrt{D|\mathcal{X}||\mathcal{A}|T})$ (Jaksch et al., 2010) with high probability. Wei et al. (2020b) provide model-free algorithms with regret bounds in the tabular setting. In the model-free setting with function approximation, the Politex algorithm (Abbasi-Yadkori et al., 2019) achieve $O(d^{1/2}T^{3/4})$ regret in uniformly ergodic MDPs, where $d$ is the size of the compressed state-action space. Hao et al. (2020) improve these results to $O(T^{2/3})$ . More recently, Wei et al. (2020a) present three algorithms for average-reward MDPs with linear function approximation. Among these, FOPO achieves $O(\sqrt{T})$ regret but is computationally inefficient, and OLSVI.FH is efficient but obtains $O(T^{3/4})$ regret. The MDP-EXP2 algorithm is computationally efficient, and under similar assumptions as in Abbasi-Yadkori et al. (2019) it obtains a $O(\sqrt{T})$ bound on expected regret (a weaker guarantee than the high-probability bounds in other works). Our analysis shows a high-probability $O(\sqrt{T})$ regret bound under the same assumptions.

KL-regularized approximate policy iteration. Our work is also related to approximate policy iteration algorithms which constrain each policy to be close to the previous policy in the sense of KL divergence. This approach was popularized by TRPO (Schulman et al., 2015), where it was motivated as an approximate implementation of conservative policy iteration (Kakade and Langford, 2002). Some of the related subsequent works include PPO (Schulman et al., 2017), MPO (Abdolmaleki et al., 2018), V-MPO (Song et al., 2019a), and CPO (Achiam et al., 2017). While these algorithms place a constraint on consecutive policies and are mostly heuristic, another line of research shows that using KL divergence as a regularizer has a theoretical justification in terms of either regret guarantees (Abbasi-Yadkori et al., 2019; Hao et al., 2020; Wei et al., 2020b, a) or error propagation (Vieillard et al., 2020a, b).

Experience replay. Experience replay (Lin, 1992) is one of the central tools for achieving good performance in deep reinforcement learning. While it is mostly used in off-policy methods such as deep Q-learning (Mnih et al., 2013, 2015), it has also shown benefits in value-gradient methods (Heess et al., 2015; Lillicrap et al., 2016), and has been used in some variants of KL-regularized policy iteration (Abdolmaleki et al., 2018; Tomar et al., 2020). Its success has been attributed to removing some temporal correlations from data fed to standard gradient-based optimization algorithms. Schaul et al. (2016) have shown that non-uniform replay sampling based on the Bellman error can improve performance. Unlike these works, we motivate experience replay from the perspective of online learning in MDPs (Even-Dar et al., 2009) with the goal of approximating the average of past value functions well.

Continual learning. Continual learning (CL) is the paradigm of learning a classifier or regressor that performs well on a set of tasks, where each task corresponds to a different data distribution. The tasks are observed sequentially, and the learning goal is to avoid forgetting past tasks without storing all the data in memory. This is quite similar to our goal of approximating the average of sequentially-observed Q-functions, where the data for approximating each Q-function has different distribution. In general, approaches to CL can be categorized as regularization-based (Kirkpatrick et al., 2017; Zenke et al., 2017; Farajtabar et al., 2020; Yin et al., 2020), expansion-based (Rusu et al., 2016), and replay-based (Lopez-Paz and Ranzato, 2017; Chaudhry et al., 2018; Borsos et al., 2020), with the approaches based on experience replay typically having superior performance over other methods.

4 Algorithm

Our algorithm is similar to the Politex schema (Abbasi-Yadkori et al., 2019). In each phase $k$ , Politex obtains an estimate $\widehat{Q}_{\pi_{k}}$ of the action-value function $Q_{\pi_{k}}$ of the current policy $\pi_{k}$ , and then sets the next policy using the mirror descent update rule:

$\displaystyle\pi_{k+1}(\cdot\|x)$	$\displaystyle=\mathop{\mathrm{argmax}}_{\pi}\widehat{Q}_{\pi_{k}}(x,\pi)-\eta^{-1}D_{KL}(\pi\parallel\pi_{k}(\cdot\|x))$
	$\displaystyle=\mathop{\mathrm{argmax}}_{\pi}\sum_{i=1}^{k}\widehat{Q}_{\pi_{i}}(x,\pi)+\eta^{-1}\mathcal{H}(\pi)$
	$\displaystyle\propto\exp\bigg{(}\eta\sum_{i=1}^{k}\widehat{Q}_{\pi_{i}}(x,\cdot)\bigg{)},$	(4.1)

where $\mathcal{H}(\cdot)$ is the entropy function. When the functions $\{\widehat{Q}_{\pi_{i}}\}_{i=1}^{k}$ are tabular or linear, the above update can be implemented efficiently by simply summing all the table entries or weight vectors. However, with neural network function approximation, we need to keep all networks in memory and evaluate them at each step, which quickly becomes inefficient in terms of storage and computation. Some of the efficient implementations proposed in literature include keeping a subset of the action-value functions (Abbasi-Yadkori et al., 2019), and using a parametric policy and optimizing the KL-regularized objective w.r.t. the parameters over the available data (Tomar et al., 2020).

Our proposed method, presented in Algorithm 1, attempts to directly approximate the average of all previous action-value functions

\displaystyle Q_{k}(x,a):=\frac{1}{k}\sum_{i=1}^{k}Q_{\pi_{i}}(x,a).

To do so, we only use a single network, and continually train it to approximate $Q_{k}$ . At each iteration $k$ , we obtain a dataset of tuples $\mathcal{D}_{k}=\{(x_{t},a_{t},R_{t})\}$ , where $R_{t}$ is the empirical return from the state-action pair $(x_{t},a_{t})$ . We initialize $\widehat{Q}_{k}$ to $\widehat{Q}_{k-1}$ and update it by minimizing the squared error over the union of $\mathcal{D}_{k}$ and the replay buffer $\mathcal{R}$ . We then update the replay buffer with all or a subset of data in $\mathcal{D}_{k}$ .

In the sequel, in Section 5, we first show that by focusing on estimating the average Q-function, we can improve the regret bound of Politex under nearly identical assumption; in Section 6, we instantiate this bound for linear value functions, where we can estimate the average Q-function simply by weight averaging; in Section 7, we focus on practical implementations, in particular, we discuss the limitations of weight averaging, and provide details on how to leverage replay data when using non-linear function approximation; in Section 8 we present our experimental results; and in Section 9 we make final remarks.

Algorithm 1 Schema for policy iteration with replay

1: Input: phase length

\tau

, num. phases

K

, parameter

\eta

2: Initialize:

\pi_{1}(a|x)=1/|\mathcal{A}|

, empty replay buffer

\mathcal{R}

3: for

k=1,\ldots,K

4: Execute

\pi_{k}

for

\tau

time steps and collect data

\mathcal{D}_{k}

5: Compute

\widehat{Q}_{k}

, an estimate of

Q_{k}=\frac{1}{k}\sum_{i=1}^{k}Q_{\pi_{i}}

, from data

\mathcal{D}_{k}

and replay

\mathcal{R}

6: Set

\pi_{k+1}(a|x)\propto\exp\big{(}\eta k\widehat{Q}_{k}(x,a)\big{)}

7: Update replay

\mathcal{R}

with

\mathcal{D}_{k}

8: end for

9: Output:

\pi_{K+1}

5 Regret Analysis of Politex

In this section, we revisit the regret analysis of Politex (Abbasi-Yadkori et al., 2019) in ergodic average-reward MDPs, and show that it can be improved from $O(T^{3/4})$ to $O(\sqrt{T})$ under similar assumptions. Our analysis relies in part on a simple modification of the regret decomposition. Namely, instead of including the estimation error of each value-function $Q_{\pi_{k}}$ in the regret, we consider the error in the running average $Q_{k}$ . When this error scales as $O(1/\sqrt{k})$ in a particular weighted norm, the regret of Politex is $O(\sqrt{T})$ . As we show in the following section, this bound can be instantiated for linear value functions under the same assumptions as Abbasi-Yadkori et al. (2019).

Assumption 5.1 (Boundedness).

Let $\widehat{Q}_{\pi_{k}}:=k\widehat{Q}_{k}-(k-1)\widehat{Q}_{k-1}$ . We assume that there exists a constant $Q_{\max}$ such that for all $k=1,...,K$ and for all $x\in\mathcal{X}$ ,

\max_{a}\widehat{Q}_{\pi_{k}}(x,a)-\min_{a}\widehat{Q}_{\pi_{k}}(x,a)\leq Q_{\max}.

For the purpose of our algorithm, functions $\widehat{Q}_{\pi_{k}}$ are unique up to a constant. Thus we can equivalently assume that $\|\widehat{Q}_{\pi_{k}}(x,\cdot)\|_{\infty}\leq Q_{\max}$ for all $x$ .

Define $\widehat{V}_{\pi_{k}}(x):=\widehat{Q}_{\pi_{k}}(x,\pi_{k})$ . Let $V_{k}:=\frac{1}{k}\sum_{i=1}^{k}V_{\pi_{i}}(x)$ and $\widehat{V}_{k}(x):=\sum_{i=1}^{k}\widehat{V}_{\pi_{i}}(x)$ be the average of the state-value functions and its estimate. We will require the estimation error of the running average to scale as in the following assumption.

Assumption 5.2 (Estimation error).

Let $\mu_{*}$ be the stationary state distribution of the optimal policy $\pi_{*}$ . With probability at least $1-\delta$ , for a problem-dependent constant $C$ , the errors in $\widehat{Q}_{K}$ and $\widehat{V}_{K}$ are bounded as

	$\displaystyle\mathbb{E}_{x\sim\mu_{*}}[\widehat{V}_{K}(x)-V_{K}(x)]$	$\displaystyle\leq C\sqrt{\log(1/\delta)/K}$
	$\displaystyle\mathbb{E}_{x\sim\mu_{}}[Q_{K}(x,\pi_{})-\widehat{Q}_{K}(x,\pi_{*})]$	$\displaystyle\leq C\sqrt{\log(1/\delta)/K}\,.$

Define $S_{\delta}(|\mathcal{A}|,\mu_{*})$ as in Abbasi-Yadkori et al. (2019):

\displaystyle S_{\delta}(|\mathcal{A}|,\mu_{*}):=\sqrt{\frac{\log|\mathcal{A}|}{2}}+\big{\langle}\mu_{*},\sqrt{\frac{1}{2}\log\frac{1}{\delta\mu_{*}}}\big{\rangle}.

We bound the regret of Politex in the following theorem. Here, recall that $\tau$ is the length of each phase, and $\gamma$ and $\eta$ are defined in Assumption 2.1 and Eq. (4.1), respectively.

Theorem 5.3 (Regret of Politex).

Let Assumptions 2.1, 5.2, and 5.1 hold. For $\tau\geq\frac{\log T}{2\log(1/\gamma)}$ and $\eta=\frac{\sqrt{8\log|\mathcal{A}|}}{Q_{\max}\sqrt{K}}$ , for a constant $C_{1}$ , with probability at least $1-4\delta$ , the regret of Politex in ergodic average-reward MDPs is bounded as

\displaystyle R_{T}\leq\frac{C_{1}(1+Q_{\max})S_{\delta}(|\mathcal{A}|,\mu_{*})\sqrt{\tau}}{(1-\gamma)^{2}}\sqrt{T}\,.

Proof.

We start by decomposing the cumulative regret, following similar steps as Abbasi-Yadkori et al. (2019):

R_{T}=\sum_{k=1}^{K}\sum_{t=(k-1)\tau+1}^{k\tau}(J_{*}-J_{\pi_{k}})+(J_{\pi_{k}}-r_{t}).

(5.1)

The second term $V_{T}=\sum_{k=1}^{K}\sum_{t=(k-1)\tau+1}^{k\tau}(J_{\pi_{k}}-r_{t})$ captures the sum of differences between observed rewards and their long term averages. In previous work, this term was shown to scale as $O(K\sqrt{\tau})$ . We show that the analysis can in fact be tightened to $O(\sqrt{T})$ using improved concentration bounds and the slow-changing nature of the policies. See Lemma B.1 in Appendix B for precise details.

The first term, which is also called pseudo-regret in literature, measures the difference between the expected reward of the reference policy and the policies produced by the algorithm. Applying the performance difference lemma (Cao, 1999), we can write each pseudo-regret term as

\displaystyle J_{*}-J_{\pi_{k}}

\displaystyle=\mathbb{E}_{x\sim\mu_{*}}\left[Q_{\pi_{k}}(x,\pi_{*})-Q_{\pi_{k}}(x,\pi_{k})\right].

Now, notice that Politex is running exact mirror descent for loss functions $\widehat{Q}_{\pi_{k}}$ . We bridge the pseudo-regret by the $\widehat{Q}_{\pi_{k}}$ terms:

$\displaystyle R_{T1a}$	$\displaystyle=\tau\sum_{k=1}^{K}\mathbb{E}_{x\sim\mu_{}}\left[Q_{\pi_{k}}(x,\pi_{})-\widehat{Q}_{\pi_{k}}(x,\pi_{*})\right]$	(5.2)
$\displaystyle R_{T1b}$	$\displaystyle=\tau\sum_{k=1}^{K}\mathbb{E}_{x\sim\mu_{*}}\left[\widehat{Q}_{\pi_{k}}(x,\pi_{k})-Q_{\pi_{k}}(x,\pi_{k})\right]$	(5.3)
$\displaystyle R_{T2}$	$\displaystyle=\tau\sum_{k=1}^{K}\mathbb{E}_{x\sim\mu_{}}\left[\widehat{Q}_{\pi_{k}}(x,\pi_{})-\widehat{Q}_{\pi_{k}}(x,\pi_{k})\right].$	(5.4)

$R_{T2}$ can be bounded using the regret of mirror descent as in previous work (Abbasi-Yadkori et al., 2019). Setting $\eta=\frac{\sqrt{\log|\mathcal{A}|}}{Q_{\max}\tau\sqrt{2K}}$ , and using a union bound over all states, with probability at least $1-\delta$ , $R_{T2}$ is bounded as

\displaystyle R_{T2}\leq\tau Q_{\max}S_{\delta}(|\mathcal{A}|,\mu_{*})\sqrt{K}\,.

Bounding regret due to estimation error. We now focus on bounding $R_{T1}=R_{T1a}+R_{T1b}$ under Assumption 5.2. We have the following:

	$\displaystyle R_{T1a}$	$\displaystyle=\tau\sum_{k=1}^{K}\mathbb{E}_{x\sim\mu_{}}\big{[}Q_{\pi_{k}}(x,\pi_{})-\widehat{Q}_{\pi_{k}}(x,\pi_{*})\big{]}$
		$\displaystyle=(\tau K)\mathbb{E}_{x\sim\mu_{},a\sim\pi_{}}\left[Q_{K}(x,a)-\widehat{Q}_{K}(x,a)\right]$
		$\displaystyle\leq C\tau\sqrt{K\log(1/\delta)}.$
	$\displaystyle R_{T1b}$	$\displaystyle=\tau\sum_{k=1}^{K}\mathbb{E}_{x\sim\mu_{*}}\big{[}\widehat{V}_{\pi_{k}}(x)-V_{\pi_{k}}(x)\big{]}$
		$\displaystyle=\tau K\mathbb{E}_{x\sim\mu_{*}}\big{[}\widehat{V}_{K}(x)-V_{K}(x)\big{]}$
		$\displaystyle\leq C\tau\sqrt{K\log(1/\delta)}.$

We can then obtain the final result by combining the bounds on $R_{T1a}$ , $R_{T1b}$ , $R_{T2}$ , $V_{T}$ from Appendix B, and using union bound as well as the fact that $K=T/\tau$ . ∎

6 Linear Value Functions

In this section, we show that the estimation error condition in Assumption 5.2 (and thus $O(\sqrt{T})$ regret) can be achieved under similar assumptions as in Abbasi-Yadkori et al. (2019) and Wei et al. (2020a), which we state next.

Assumption 6.1 (Linear value functions).

The action-value function $Q_{\pi}$ of any policy $\pi$ is linear: $Q_{\pi}(x,a)=w_{\pi}^{\top}\phi(x,a)$ , where $\phi:{\mathcal{S}}\times\mathcal{A}\rightarrow\mathbb{R}^{d}$ is a known feature function such that $\max_{x,a}\|\phi(x,a)\|\leq C_{\Phi}$ .

Assumption 6.2 (Feature excitation).

There exists a constant $\sigma^{2}$ such that for any policy $\pi$ ,

\lambda_{\min}\left(\mathbb{E}_{(x,a)\sim\mu_{\pi}\otimes\pi}[\phi(x,a)\phi(x,a)^{\top}]\right)\geq\sigma^{2}>0.

We now describe a simple procedure for estimating the average action-value functions $Q_{k}(x,a)=\phi(x,a)^{\top}w_{k}$ , where $w_{k}=\frac{1}{k}\sum_{i=1}^{k}w_{\pi_{i}}$ , such that the conditions of Assumption 5.2 are satisfied. Essentially, we estimate each $Q_{\pi_{i}}$ using least-squares Monte Carlo and then average the weights. We will use the shorthand notation $\phi_{t}=\phi(x_{t},a_{t})$ . Let $\mathcal{H}_{i}$ and ${\mathcal{T}}_{i}$ be subsets of time indices in phase $i$ (defined later). We estimate $Q_{\pi_{i}}$ as follows:

\displaystyle\widehat{w}_{\pi_{i}}

\displaystyle=\bigg{(}\sum_{t\in\mathcal{H}_{i}}\phi_{t}\phi_{t}^{\top}+\alpha I\bigg{)}^{-1}\sum_{t\in\mathcal{H}_{i}}\phi_{t}R_{t}

(6.1)

where $R_{t}$ are the empirical $b$ -step returns ( $b$ is specified later), computed as

\displaystyle R_{t}=\sum_{j=t}^{t+b}(r_{t}-\widehat{J}_{\pi_{i}}),\quad\widehat{J}_{\pi_{i}}=\frac{1}{|{\mathcal{T}}_{i}|}\sum_{t\in{\mathcal{T}}_{i}}r_{t}.

(6.2)

We then estimate $w_{k}$ as $\widehat{w}_{k}=\frac{1}{k}\sum_{i=1}^{k}\widehat{w}_{\pi_{i}}$ . Note that for this special case of linear value functions, we do not need to use the replay buffer in Algorithm 1. For analysis purposes, we divide each phase of length $\tau$ into $2m$ blocks of size $b$ and let $\mathcal{H}_{i}$ ( ${\mathcal{T}}_{i}$ ) be the starting indices of odd (even) blocks in phase $i$ . Due to the gaps between indices and fast mixing, this makes the data almost independent (we make this precise in Appendix C) and the error easier to analyze. In practice, one may simply want to use all data.

For a distribution $\mu$ , let $\|x\|_{\mu}$ denote the distribution-weighted norm such that $\|x\|_{\mu}^{2}=\sum_{i}\mu_{i}x_{i}^{2}$ . Using Jensen’s inequality, we have that $(\mathbb{E}_{x\sim\mu}[q(x)])^{2}\leq\mathbb{E}_{x\sim\mu}[q(x)^{2}]$ . Thus, it suffices to bound the $Q$ -function error in the distribution-weighted norm, $\|Q_{K}-\widehat{Q}_{K}\|_{\mu_{*}\otimes\pi_{*}}$ . Furthermore, given bounded features,

\displaystyle\|\widehat{Q}_{K}-Q_{K}\|_{\mu_{*}\otimes\pi_{*}}

\displaystyle\leq C_{\Phi}\|\widehat{w}_{K}-w_{K}\|_{2}\,,

so it suffices to bound the error in the weights. We bound this error in the following Lemma, proven in Appendix C.

Lemma 6.3 (Estimation error for linear functions).

Suppose that Assumptions 2.1, 6.1 and 6.2 hold and that true action-value weights are bounded as $\|w_{\pi_{i}}\|_{2}\leq C_{w}$ for all $i=1,\ldots K$ . Then for any policy $\pi$ , for $\alpha=\sqrt{\tau/K}$ , $m\geq 72C_{\Phi}^{4}\sigma^{-2}(1-\gamma)^{-2}\log(d/\delta)$ , and $b\geq\frac{\log(T\delta^{-1}(1-\gamma)^{-1})}{\log(1/\gamma)}$ , there exists an absolute constant $c$ such that with probability at least $1-\delta$ ,

\displaystyle\|\widehat{w}_{K}-w_{K}\|_{2}\leq c\sigma^{-2}(C_{w}+C_{\Phi})b\sqrt{\frac{\log(2d/\delta)}{Km}}.

Furthermore, in Appendix D, we show that the error in the average state-value function satisfies the following:

\displaystyle\mathbb{E}_{\mu_{*}}[\widehat{V}_{K}(x)-V_{K}(x)]\leq cC_{\Phi}|\mathcal{A}|(C_{w}+C_{\Phi})\frac{b}{\sigma^{2}}\sqrt{\frac{\log(2d/\delta)}{Km}}.

We have demonstrated that Assumption 5.2 can be satisfied with linear value functions. For Assumption 5.1, it suffices for the weight estimates $\{\widehat{w}_{\pi_{i}}\}$ to be bounded. This will be true, since we assume that the true weights are bounded and we can bound the error in the weight space. Thus Politex has an $O(\sqrt{T})$ regret in this setting (though note that we incur an extra $|\mathcal{A}|$ factor coming from the $\widehat{V}_{K}$ error bound).

7 Practical Implementation

As mentioned earlier, the key idea in our policy update is to obtain an estimate $\widehat{Q}_{k}$ of the average of all the Q-functions in previous phases. We have seen that when we use linear functions to approximate the Q-functions, we can simply average the weights in order to get an estimate of the average Q-function. However, in practice, we often need to use non-linear functions, especially neural networks, to approximate complex Q-functions. In this section, we discuss how to efficiently implement our algorithm with non-linear function approximation.

7.1 Weight Averaging

The simplest idea may be averaging the weights of neural networks. However, a crucial difference from the linear setting is that averaging the weights of the neural networks is not equivalent to averaging the functions that they represent: the function that a neural network represent is invariant to the permutation of the hidden units, and thus two networks with very different weights can represent similar functions. Therefore, this implementation may only succeed when all the Q-function approximations are around the same local region in the weight space. Thus, when we learn the new Q-function $\widehat{Q}_{\pi_{k}}$ in phase $k$ , we should initialize it with $\widehat{Q}_{k-1}$ , run SGD with new data, and then average with $\widehat{Q}_{k-1}$ .

7.2 Experience Replay

Another natural idea is to leverage the data from replay buffer to obtain an estimate of the average Q-function. We elaborate the details below. We use simple $b$ -step Monte Carlo estimate for the $Q$ value of each state-action pair. For any $(i-1)\tau+1\leq t\leq i\tau-b$ , we can estimate the state-action value of $(x_{t},a_{t})$ by $b$ -step cumulative reward¹¹1This is a practical implementation of Eq. (6.2), i.e., we do not split the data in each phase into blocks.

R_{t}=\sum_{j=t}^{t+b}(r_{j}-\widehat{J}_{\pi_{i}}),\quad\widehat{J}_{\pi_{i}}=\frac{1}{\tau}\sum_{j=(i-1)\tau+1}^{i\tau}r_{j}.

In the following, we denote by $\tau^{\prime}:=\tau-b$ the maximum number of data that we can collect from every phase. At the end of each phase, we store all or a subset of the $(x_{t},a_{t},R_{t})$ tuples in our replay buffer $\mathcal{R}$ . We extract feature $\phi(x,a)\in\mathbb{R}^{d}$ for the state-action pair $(x,a)$ and let $\mathcal{F}\subseteq\{f:\mathbb{R}^{d}\mapsto\mathbb{R}\}$ be a class of functions that we use to approximate the Q-functions. For phase $i$ , we propose the following method to estimate $Q_{\pi_{i}}$ : $\widehat{Q}_{\pi_{i}}(x,a)=\widehat{f}(\phi(x,a))$ , where $\widehat{f}\in\arg\min_{f\in\mathcal{F}}\ell_{i}(f)$ and

\ell_{i}(f):=\frac{1}{\tau^{\prime}}\sum_{t=(i-1)\tau+1}^{i\tau-b}(f(\phi(x_{t},a_{t}))-R_{t})^{2}.

(7.1)

Suppose that we store all the data from the previous phases in the replay buffer, then in order to estimate the average of the Q-functions of the first $k$ phases, i.e., $\widehat{Q}_{k}$ , we propose to use the heuristic that minimizes the average of the $k$ squared loss functions defined in Eq. (7.1), i.e., $\frac{1}{k}\sum_{i=1}^{k}\ell_{i}(f)$ .

Subsampling and coreset. In practice, due to the high memory cost, it may be hard to store all the data from the previous phases in the replay buffer. We found that a simple strategy to resolve this issue is to begin with storing all the data from every phase, and add a limit on size of the replay buffer. When the buffer size exceeds the limit, we eliminate a subset of the data uniformly at random.

Another approach is to sample a subset of size $s$ from the data collected in each phase. Denote this subset by $\mathcal{R}_{i}$ for phase $i$ . Thus in the $k$ -th phase, we have $\tau^{\prime}$ data $\mathcal{D}_{k}$ from the current phase as well as $s(k-1)$ data from the replay buffer $\mathcal{R}$ . First, suppose that the $s$ data points are sampled uniformly at random. We can then minimize the following objective:

\min_{f\in\mathcal{F}}\frac{1}{k}\Big{(}\ell_{i}(f)+\sum_{i=1}^{k-1}\widehat{\ell}_{i}(f)\Big{)},

(7.2)

where $\widehat{\ell}_{i}(f):=\frac{1}{s}\sum_{(x_{t},a_{t},R_{t})\in\mathcal{R}_{i}}(f(\phi(x_{t},a_{t}))-R_{t})^{2}$ is an unbiased estimate of $\ell_{i}(f)$ . Further, uniform sampling is not the only way to construct an unbiased estimate. In fact, for any discrete distribution, with PMF $q=\{q_{t}\}$ , over the $\tau^{\prime}$ data in $\mathcal{D}_{i}$ , we can sample $s$ data points according to $q$ and construct

\widehat{\ell}_{i}(f):=\frac{1}{\tau^{\prime}}\sum_{(x_{t},a_{t},R_{t})\in\mathcal{R}_{i}}\frac{1}{q_{t}}(f(\phi(x_{t},a_{t}))-R_{t})^{2},

(7.3)

in order to obtain an unbiased estimate of $\ell_{i}(f)$ . As shown by Bachem et al. (2017), by choosing $q_{t}\propto(f(\phi(x_{t},a_{t}))-R_{t})^{2}$ , we can minimize the variance of $\widehat{\ell}_{i}(f)$ for any fixed $f$ . In the following, we call the subset of data sampled according to this distribution a coreset of the data. In the experiments in Section 8, we show that thanks to the variance reduction effect, sampling a coreset often produces better performance than sampling a subset uniformly at random, especially when the rewards are sparse.

Comparison to Abbasi-Yadkori et al. (2019). Our algorithm can be considered as an efficient implementation of Politex (Abbasi-Yadkori et al., 2019) via experience replay. In the original Politex algorithm, the Q-functions are estimated using Eq. (7.1) for each phase, and all the functions are stored in memory. When an agent interacts with the environment, it needs to evaluate all the Q-functions in order to obtain the probability of each action. This implies that the time complexity of computing action probabilities increases with the number of phases, and as a result the algorithm is hard to scale to a large number of phases. Although we can choose to evaluate a random subset of the Q-functions to estimate the action probabilities in Politex, our implementation via experience replay can still be faster since we only need to evaluate a single function, trained with replay data, to take actions.

8 Experiments

In this section, we evaluate our implementations empirically. We make comparisons with several baselines in two control environments.

Environments. We use two control environments with the simulators described in Tassa et al. (2018). Both environments are episodic with episode length $1000$ . The environments we evaluate are:

•

Cart-pole (Barto et al., 1983): The goal of this environment is to balance an unactuated pole by applying forces to a cart at its base. We discretize the continuous force to $5$ values: $\{-2,-1,0,1,2\}$ . The reward at each time step is a real number in $[0,1]$ . We end episodes early when the pole falls (we use rewards less than $0.5$ an indicator for falling), and assume zero reward for the remaining steps when reporting results.
•

Ball-in-cup: Here, an actuated planar receptacle can translate in the vertical plane in order to swing and catch a ball attached to its bottom. This task has a sparse reward: $1$ when the ball is in the cup, and $0$ otherwise. We discretize the two-dimensional continuous action space to a $3\times 3$ grid, i.e., $9$ actions in total.

Refer to caption — (a) Iteration complexity

Algorithms. We compare the following algorithms: our proposed implementation of Politex using experience replay, Politex using weight averaging, the original Politex algorithm, the variant of Politex that averages $10$ randomly selected Q-functions, the mirror descent policy optimization (MDPO) algorithm (Tomar et al., 2020), the constrained policy optimization (CPO) (Achiam et al., 2017), and the V-MPO algorithm (Song et al., 2019b).

For all the algorithms, we extract Fourier basis features (Konidaris et al., 2011) from the raw observations for both environments: for Cart-pole, we use $4$ bases and for Ball-in-cup we use $2$ bases. For variants of Politex we construct the state-action features $\phi(x,a)$ by block one-hot encoding, i.e., we partition the $\phi(x,a)$ vector into $|\mathcal{A}|$ blocks, and set the $a$ -th block to be the Fourier features of $x$ and other blocks to be zero. We approximate $Q$ -functions using neural networks with one hidden layer and ReLU activation: the width of the hidden layer is $50$ for Cart-pole and $250$ for Ball-in-cup. For MDPO, CPO and V-MPO, the algorithms use a policy network whose input and output are the state feature and action probability, respectively. We use one hidden layer networks with width $50$ for Cart-pole and $250$ for Ball-in-cup. These three algorithms also need a value network, which takes the state feature as input and outputs its value for the current policy. For both environments, we use one hidden layer network with width $50$ .

For Cart-pole, we choose phase length $\tau=10^{4}$ and for Ball-in-cup, we choose $\tau=2\times 10^{4}$ . Since the environments are episodic, we have multiple episodes in each phase. For Cart-pole, since the training usually does not need a large number of phases, we do not set limit on the size of the replay buffer, whereas for Ball-in-cup, we set the limit as $2\times 10^{6}$ . For our algorithm and the variants of Politex, we choose parameter $\eta$ in $\{5,10,20,40,80,160\}$ and report the result of each algorithm using the value of $\eta$ that produces the best average reward. For MDPO, CPO and V-MPO, we note that according to Eq. (4.1), the KL regularization coefficient between policies is the reciprocal of the $\eta$ parameter in our algorithm, we also run these baseline algorithms in the same range of $\eta$ and report the best result. In addition, in CPO, we set the limit in KL divergence between adjacent policies $0.001\eta$ , which, in our experiments, leads to good performance. We treat the length of Monte Carlo estimate $b$ as a hyper parameter, so we choose it in $\{100,300\}$ and report the best performance.

Results. We run each algorithm in each environment $20$ times and report the average results as well as the standard deviation of the $20$ runs (shaded area). We report the average reward in each phase before training on the data collected during the phase. The results are presented in Figure 1. Every run is conducted on a single P100 GPU.

From Fig. 1(a), we can see that the iteration complexity (average reward vs number of phases) and final performance of the experience replay implementation using all the data from past phases (blue curve) are similar to many state-of-the-art algorithms, such as Politex and V-MPO. Meanwhile, according to Fig. 1(b), the experience replay implementation achieves strong performance in training speed, i.e., best performance at $30$ minute in Cart-pole and second best performance at $200$ minute in Ball-in-cup. We note here that the training time includes both the time for data collection, i.e., interacting with the environment simulator and the training of a new policy. The observation that the experience replay based implementation is faster than the original Politex and the implementation that uses $10$ Q-functions can be explained by the fact that the replay-based algorithm only uses a single Q-function and thus is faster at data collection, as discussed in Section 7.2. In addition, that the replay-based algorithm runs faster than MDPO, CPO, and V-MPO can be explained by its simplicity: in variants of Politex, we only need to approximate the Q-function, whereas in MDPO, CPO, and V-MPO, we need to train both the policy and value networks.

The Politex weight averaging scheme achieves the best performance on Ball-in-cup in both iteration complexity and training speed, but does not converge to a solution that is sufficiently close to the optimum on Cart-pole. Notice that for weight averaging, we used the initialization technique mentioned in Section 7.1. Considering the consistent strong performance of experience replay in both environments, we still recommend applying experience replay when using non-linear function approximation.

In Fig. 1(c), we can see that when we only sample a small subset ( $1\%$ or $4\%$ ) of the data to store in the replay buffer, the coreset technique described in Section 7.2 achieves better performance due to the variance reduction effect. This improvement is quite significant in the Ball-in-cup environment, which has sparse rewards. Notice that sampling coreset only adds negligible computational overhead during training, and thus we recommend the coreset technique when there exists a strict memory constraint.

9 Discussion

The main contributions of our work are an improved analysis and practical implementation of Politex. On the theoretical side, we show that Politex obtains an $O(\sqrt{T})$ high-probability regret bound in uniformly mixing average-reward MDPs with linear function approximation, which is the first such bound for a computationally efficient algorithm. The main limitation of these result is that, similarly to previous works, they hold under somewhat strong assumptions which circumvent the need for exploration. An interesting future work direction would be to relax these assumptions and incorporate explicit exploration instead.

On the practical side, we propose an efficient implementation with neural networks that relies on experience replay, a standard tool in modern deep RL. Our work shows that experience replay and KL regularization can both be viewed as approximately implementing mirror descent policy updates within a policy iteration scheme. This provides an online-learning justification for using replay buffers in policy iteration, which is different than the standard explanation for their success. Our work also suggests a new objective for storing and prioritizing replay samples, with the goal of approximating the average of value functions well. This goal has some similarities with continual learning, where experience replay has also lead to empirical successes. One interesting direction for future work would be to explore other continual learning techniques in the approximate implementation of mirror descent policy updates.

Acknowledgements

We would like to thank Mehrdad Farajtabar, Dilan Görür, Nir Levine, and Ang Li for helpful discussions.

References

Abbasi-Yadkori et al. (2019) Y. Abbasi-Yadkori, P. Bartlett, K. Bhatia, N. Lazic, C. Szepesvari, and G. Weisz. Politex: Regret bounds for policy iteration using expert prediction. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 3692–3702. PMLR, 09–15 Jun 2019.
Abdolmaleki et al. (2018) A. Abdolmaleki, J. T. Springenberg, Y. Tassa, R. Munos, N. Heess, and M. Riedmiller. Maximum a posteriori policy optimisation. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=S1ANxQW0b.
Achiam et al. (2017) J. Achiam, D. Held, A. Tamar, and P. Abbeel. Constrained policy optimization. In International Conference on Machine Learning, pages 22–31, 2017.
Bachem et al. (2017) O. Bachem, M. Lucic, and A. Krause. Practical coreset constructions for machine learning. arXiv preprint arXiv:1703.06476, 2017.
Bartlett (2009) P. L. Bartlett. Regal: A regularization based algorithm for reinforcement learning in weakly communicating mdps. In In Proceedings of the 25th Annual Conference on Uncertainty in Artificial Intelligence, 2009.
Barto et al. (1983) A. G. Barto, R. S. Sutton, and C. W. Anderson. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, (5):834–846, 1983.
Borsos et al. (2020) Z. Borsos, M. Mutnỳ, and A. Krause. Coresets via bilevel optimization for continual learning and streaming. arXiv preprint arXiv:2006.03875, 2020.
Cao (1999) X.-R. Cao. Single sample path-based optimization of Markov chains. Journal of optimization theory and applications, 100(3):527–548, 1999.
Cesa-Bianchi and Lugosi (2006) N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge university press, 2006.
Chaudhry et al. (2018) A. Chaudhry, M. Ranzato, M. Rohrbach, and M. Elhoseiny. Efficient lifelong learning with a-gem. arXiv preprint arXiv:1812.00420, 2018.
Cho and Meyer (2001) G. E. Cho and C. D. Meyer. Comparison of perturbation bounds for the stationary distribution of a Markov chain. Linear Algebra and its Applications, 335(1-3):137–150, 2001.
Degrave et al. (2019) J. Degrave, A. Abdolmaleki, J. T. Springenberg, N. Heess, and M. Riedmiller. Quinoa: a Q-function you infer normalized over actions. arXiv preprint arXiv:1911.01831, 2019.
Even-Dar et al. (2009) E. Even-Dar, S. M. Kakade, and Y. Mansour. Online Markov decision processes. Mathematics of Operations Research, 34(3):726–736, 2009.
Farajtabar et al. (2020) M. Farajtabar, N. Azizan, A. Mott, and A. Li. Orthogonal gradient descent for continual learning. In International Conference on Artificial Intelligence and Statistics, pages 3762–3773, 2020.
Fruit et al. (2018) R. Fruit, M. Pirotta, A. Lazaric, and R. Ortner. Efficient bias-span-constrained exploration-exploitation in reinforcement learning. In International Conference on Machine Learning, 2018.
Hao et al. (2020) B. Hao, N. Lazic, Y. Abbasi-Yadkori, P. Joulani, and C. Szepesvari. Provably efficient adaptive approximate policy iteration. arXiv preprint arXiv:2002.03069, 2020.
Heess et al. (2015) N. Heess, G. Wayne, D. Silver, T. Lillicrap, T. Erez, and Y. Tassa. Learning continuous control policies by stochastic value gradients. Advances in Neural Information Processing Systems, 28:2944–2952, 2015.
Jaksch et al. (2010) T. Jaksch, R. Ortner, and P. Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
Jian et al. (2019) Q. Jian, R. Fruit, M. Pirotta, and A. Lazaric. Exploration bonus for regret minimization in discrete and continuous average reward mdps. In Advances in Neural Information Processing Systems, pages 4891–4900, 2019.
Jin et al. (2019) C. Jin, P. Netrapalli, R. Ge, S. M. Kakade, and M. I. Jordan. A short note on concentration inequalities for random vectors with subgaussian norm. arXiv preprint arXiv:1902.03736, 2019.
Kakade and Langford (2002) S. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. In ICML, volume 2, pages 267–274, 2002.
Kirkpatrick et al. (2017) J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
Konidaris et al. (2011) G. Konidaris, S. Osentoski, and P. Thomas. Value function approximation in reinforcement learning using the fourier basis. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 25, 2011.
Lillicrap et al. (2016) T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. In ICLR (Poster), 2016.
Lin (1992) L.-J. Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3-4):293–321, 1992.
Lopez-Paz and Ranzato (2017) D. Lopez-Paz and M. Ranzato. Gradient episodic memory for continual learning. In Advances in neural information processing systems, pages 6467–6476, 2017.
Mnih et al. (2013) V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
Mnih et al. (2015) V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
Neu et al. (2014) G. Neu, A. György, C. Szepesvári, and A. Antos. Online Markov decision processes under bandit feedback. IEEE Transactions on Automatic Control, 59(3):676–691, December 2014.
Ouyang et al. (2017) Y. Ouyang, M. Gagrani, A. Nayyar, and R. Jain. Learning unknown Markov decision processes: A Thompson sampling approach. In Advances in Neural Information Processing Systems, pages 1333–1342, 2017.
Rusu et al. (2016) A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
Schaul et al. (2016) T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay. In International Conference on Learrning Representations (ICLR), 2016.
Schulman et al. (2015) J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889–1897, 2015.
Schulman et al. (2017) J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
Seneta (1979) E. Seneta. Coefficients of ergodicity: structure and applications. Advances in applied probability, pages 576–590, 1979.
Seneta (1988) E. Seneta. Perturbation of the stationary distribution measured by ergodicity coefficients. Advances in Applied Probability, 20(1):228–230, 1988.
Song et al. (2019a) H. F. Song, A. Abdolmaleki, J. T. Springenberg, A. Clark, H. Soyer, J. W. Rae, S. Noury, A. Ahuja, S. Liu, D. Tirumala, N. Heess, D. Belov, M. A. Riedmiller, and M. M. Botvinick. V-MPO: on-policy maximum a posteriori policy optimization for discrete and continuous control. CoRR, abs/1909.12238, 2019a. URL http://arxiv.org/abs/1909.12238.
Song et al. (2019b) H. F. Song, A. Abdolmaleki, J. T. Springenberg, A. Clark, H. Soyer, J. W. Rae, S. Noury, A. Ahuja, S. Liu, D. Tirumala, et al. V-MPO: on-policy maximum a posteriori policy optimization for discrete and continuous control. arXiv preprint arXiv:1909.12238, 2019b.
Talebi and Maillard (2018) M. S. Talebi and O.-A. Maillard. Variance-aware regret bounds for undiscounted reinforcement learning in mdps. In Algorithmic Learning Theory, 2018.
Tassa et al. (2018) Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. d. L. Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, et al. DeepMind control suite. arXiv preprint arXiv:1801.00690, 2018.
Tomar et al. (2020) M. Tomar, L. Shani, Y. Efroni, and M. Ghavamzadeh. Mirror descent policy optimization. arXiv preprint arXiv:2005.09814, 2020.
Tropp (2012) J. A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12(4):389–434, 2012.
Vieillard et al. (2020a) N. Vieillard, T. Kozuno, B. Scherrer, O. Pietquin, R. Munos, and M. Geist. Leverage the average: an analysis of regularization in rl. arXiv preprint arXiv:2003.14089, 2020a.
Vieillard et al. (2020b) N. Vieillard, B. Scherrer, O. Pietquin, and M. Geist. Momentum in reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 2529–2538, 2020b.
Wei et al. (2020a) C.-Y. Wei, M. Jafarnia-Jahromi, H. Luo, and R. Jain. Learning infinite-horizon average-reward mdps with linear function approximation. arXiv preprint arXiv:2007.11849, 2020a.
Wei et al. (2020b) C.-Y. Wei, M. J. Jahromi, H. Luo, H. Sharma, and R. Jain. Model-free reinforcement learning in infinite-horizon average-reward Markov decision processes. In International Conference on Machine Learning, pages 10170–10180. PMLR, 2020b.
Yin et al. (2020) D. Yin, M. Farajtabar, and A. Li. SOLA: Continual learning with second-order loss approximation. arXiv preprint arXiv:2006.10974, 2020.
Yu (1994) B. Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 22(1):94–116, 1994.
Zenke et al. (2017) F. Zenke, B. Poole, and S. Ganguli. Continual learning through synaptic intelligence. Proceedings of machine learning research, 70:3987, 2017.

Appendix

Appendix A Preliminaries

We state some useful definitions and lemmas in this section.

Lemma A.1.

Let $X_{1}$ and $X_{2}$ be a pair of distribution vectors. Let $H$ be the transition matrix of an ergodic Markov chain with a stationary distribution $\nu$ , and ergodicity coefficient (defined in Assumption 2.1) upper-bounded by $\gamma<1$ . Then

\displaystyle\|(H^{m})^{\top}(X_{1}-X_{2})\|_{1}\leq\gamma^{m}\|X_{1}-X_{2}\|_{1}\,.

Proof.

Let $\{v_{1},...,v_{n}\}$ be the normalized left eigenvectors of $H$ corresponding to ordered eigenvalues $\{\lambda_{1},...,\lambda_{n}\}$ . Then $v_{1}=\nu$ , $\lambda_{1}=1$ , and for all $i\geq 2$ , we have that $\lambda_{i}<1$ (since the chain is ergodic) and $v_{i}^{\top}{\bf 1}=0$ . Write $X_{1}$ in terms of the eigenvector basis as:

\displaystyle X_{1}=\alpha_{1}\nu+\sum_{i=2}^{n}\alpha_{i}v_{i}\quad\text{and}\quad X_{2}=\beta_{1}\nu+\sum_{i=2}^{n}\beta_{i}v_{i}\,.

Since $X_{1}^{\top}{\bf 1}=1$ and $X_{2}^{\top}{\bf 1}=1$ , it is easy to see that $\alpha_{1}=\beta_{1}=1$ . Thus we have

\displaystyle\|H^{\top}(X_{1}-X_{2})\|_{1}=\|H^{\top}\sum_{i=2}^{n}(\alpha_{i}-\beta_{i})v_{i}\|_{1}\leq\gamma\|\sum_{i=2}^{n}(\alpha_{i}-\beta_{i})v_{i}\|_{1}=\gamma\|X_{1}-X_{2}\|_{1}

where the inequality follows from the definition of the ergodicity coefficient and the fact that ${\bf 1}^{\top}v_{i}=0$ for all $i\geq 2$ . Since

\displaystyle{\bf 1}^{\top}H^{\top}\sum_{i=2}^{n}(\alpha_{i}-\beta_{i})v_{i}={\bf 1}^{\top}\sum_{i=2}^{n}\lambda_{i}(\alpha_{i}-\beta_{i})v_{i}=0,

the inequality also holds for powers of $H$ . ∎

Lemma A.2 (Doob martingale).

Let Assumption 2.1 hold, and let $\{(x_{t},a_{t})\}_{t=1}^{T}$ be the state-action sequence obtained when following policies $\pi_{1},...,\pi_{k}$ for $\tau$ steps each from an initial distribution $\nu_{0}$ . For $t\in[T]$ , let $X_{t}$ be a binary indicator vector with a non-zero element at the linear index of the state-action pair $(x_{t},a_{t})$ . Define for $i\in[T]$ ,

\displaystyle B_{i}

\displaystyle=\mathbb{E}\left[\sum_{t=1}^{T}X_{t}|X_{1},...,X_{i}\right],\quad\text{ and }\,\,B_{0}=\mathbb{E}\left[\sum_{t=1}^{T}X_{t}\right].

Then, $\{B_{i}\}_{i=0}^{T}$ is a vector-valued martingale: $\mathbb{E}[B_{i}-B_{i-1}|B_{0},\dots,B_{i-1}]=0$ for $i=1,\dots,T$ , and $\|B_{i}-B_{i-1}\|_{1}\leq 2(1-\gamma)^{-1}$ holds for $i\in[T]$ .

The constructed martingale is known as the Doob martingale underlying the sum $\sum_{t=1}^{T}X_{t}$ .

Proof.

That $\{B_{i}\}_{i=0}^{T}$ is a martingale follows from the definition. We now bound its difference sequence. Let $H_{t}$ be the state-action transition matrix at time $t$ , and let $H_{i:t}=\prod_{j=i}^{t-1}H_{j}$ , and define $H_{i:i}=I$ . Then, for $t=0,\dots,T-1$ , $\mathbb{E}[X_{t+1}|X_{t}]=H_{t}^{\top}X_{t}$ and by the Markov property, for any $i\in[T]$ ,

\displaystyle B_{i}

\displaystyle=\sum_{t=1}^{i}X_{t}+\sum_{t=i+1}^{T}\mathbb{E}[X_{t}|X_{i}]=\sum_{t=1}^{i}X_{t}+\sum_{t=i+1}^{T}H_{i:t}^{\top}X_{i},\quad\text{and}\quad B_{0}=\sum_{t=1}^{T}H_{0:t}^{\top}X_{0}.

For any $i\in[T]$ ,

	$\displaystyle B_{i}-B_{i-1}$	$\displaystyle=\sum_{t=1}^{i}X_{t}-\sum_{t=1}^{i-1}X_{t}+\sum_{t=i+1}^{T}H_{i:t}^{\top}X_{i}-\sum_{t=i}^{T}H_{i-1:t}^{\top}X_{i-1}$
		$\displaystyle=\sum_{t=i}^{T}H_{i:t}^{\top}(X_{i}-H_{i-1}^{\top}X_{i-1}).$		(A.1)

Since $X_{i}$ and $H_{i-1}^{\top}X_{i-1}$ are distribution vectors, under Assumption 2.1 and using Lemma A.1,

\displaystyle\|B_{i}-B_{i-1}\|_{1}\leq\sum_{t=i}^{T}\|H_{i:t}^{\top}(X_{i}-H_{i-1}^{\top}X_{i-1})\|_{1}\leq 2\sum_{j=0}^{T-i}\gamma^{j}\leq 2(1-\gamma)^{-1}\,.

∎

Let $(\mathcal{F}_{k})_{k}$ be a filtration and define $\mathbb{E}_{k}[\cdot]:=\mathbb{E}[\cdot|\mathcal{F}_{k}]$ . We will make use of the following concentration results for the sum of random matrices and vectors.

Theorem A.3 (Matrix Azuma, Tropp (2012) Thm 7.1).

Consider a finite $(\mathcal{F})_{k}$ -adapted sequence $\{X_{k}\}$ of Hermitian matrices of dimension $m$ , and a fixed sequence $\{A_{k}\}$ of Hermitian matrices that satisfy $\mathbb{E}_{k-1}X_{k}=0$ and $X_{k}^{2}\preceq A_{k}^{2}$ almost surely. Let $v=\|\sum_{k}A_{k}^{2}\|$ . Then with probability at least $1-\delta$ , $\|\sum_{k}X_{k}\|_{2}\leq 2\sqrt{2v\ln(m/\delta)}$ .

A version of Theorem A.3 for non-Hermitian matrices of dimension $m_{1}\times m_{2}$ can be obtained by applying the theorem to a Hermitian dilation of $X$ , $\mathcal{D}(X)=\big{[}\begin{smallmatrix}0&X\\ X^{*}&0\end{smallmatrix}\big{]}$ , which satisfies $\lambda_{\max}(\mathcal{D}(X))=\|X\|$ and $\mathcal{D}(X)^{2}=\big{[}\begin{smallmatrix}XX^{*}&0\\ 0&X^{*}X\end{smallmatrix}\big{]}$ . In this case, we have that $v=\max\left(\|\sum_{k}X_{k}X_{k}^{*}\|,\|\sum_{k}X_{k}^{*}X_{k}\|\right)$ .

Lemma A.4 (Hoeffding-type inequality for norm-subGaussian random vectors, Jin et al. (2019)).

Consider random vectors $X_{1},\ldots,X_{n}\in\mathbb{R}^{d}$ and corresponding filtrations $\mathcal{F}_{i}=\sigma(X_{1},\ldots,X_{i})$ $i\in[n]$ , such that $X_{i}|\mathcal{F}_{i-1}$ is zero-mean norm-subGaussian with $\sigma_{i}\in\mathcal{F}_{i-1}$ . That is:

\displaystyle\mathbb{E}[X_{i}|\mathcal{F}_{i}]=0,\quad P(\|X_{i}\|\geq t|\mathcal{F}_{i-1})\leq 2\exp(-t^{2}/2\sigma_{i}^{2})\quad\forall t\in\mathbb{R},\forall i\in[n].

If the condition is satisfied for fixed $\{\sigma_{i}\}$ , there exists a constant $c$ such that for any $\delta>0$ , with probability at least $1-\delta$ ,

\displaystyle\|\sum_{i=1}^{n}X_{i}\|\leq c\sqrt{\sum_{i=1}^{n}\sigma_{i}^{2}\log(2d/\delta)}\,.

Appendix B Bounding the Difference Between Empirical and Average Rewards

In this section, we bound the second term in Equation 5.1, corresponding to the difference between empirical and average rewards.

Lemma B.1.

Let Assumption 2.1 hold, and assume that $\tau\geq\frac{\log T}{2\log(1/\gamma)}$ and that $r(x,a)\in[0,1]$ for all $x,a$ . Then, by choosing $\eta=\frac{\sqrt{8\log|\mathcal{A}|}}{Q_{\max}\sqrt{K}}$ , we have with probability at least $1-\delta$ ,

\displaystyle\sum_{k=1}^{K}\sum_{t=(k-1)\tau+1}^{k\tau}(r_{t}-J_{\pi_{k}})\leq 2(1-\gamma)^{-1}\sqrt{2T\log(2/\delta)}+2\sqrt{T}+(1-\gamma)^{-2}\sqrt{8K\log|\mathcal{A}|}\,.

Proof.

Let $r$ denote the vector of rewards, and recall that $J_{\pi}=\nu_{\pi}^{\top}r$ . Let $X_{t}$ be the indicator vector for the state-action pair at time $t$ , as in Lemma A.2, and let $\nu_{t}=\mathbb{E}[X_{t}]$ . We have the following:

\displaystyle V_{T}

\displaystyle:=\sum_{k=1}^{K}\sum_{t=(k-1)\tau+1}^{k\tau}(r_{t}-J_{\pi_{k}})=\sum_{k=1}^{K}\sum_{t=(k-1)\tau+1}^{k\tau}r^{\top}(X_{t}-\nu_{t}+\nu_{t}-\nu_{\pi_{k}})

We slightly abuse the notation above by letting $\nu_{t}$ denote the state-action distribution at time $t$ , and $\nu_{\pi}$ the stationary distribution of policy $\pi$ . Let $\{B_{i}\}_{i=0}^{T}$ be the Doob martingale in Lemma A.2. Then $B_{0}=\sum_{t=1}^{T}\nu_{t}$ and $B_{T}=\sum_{t=1}^{T}X_{t}$ , and the first term can be expressed as

\displaystyle V_{T1}:=\sum_{t=1}^{T}r^{\top}(X_{t}-\nu_{t})=r^{\top}(B_{T}-B_{0}).

By Lemma A.2, $|\langle B_{i}-B_{i-1},r\rangle|\leq\|B_{i}-B_{i-1}\|_{1}\|r\|_{\infty}\leq 2(1-\gamma)^{-1}$ . Hence by Azuma’s inequality, with probability at least $1-\delta$ ,

V_{T1}\leq 2(1-\gamma)^{-1}\sqrt{2T\log(2/{\delta})}.

(B.1)

For the second term we have

	$\displaystyle V_{T2}$	$\displaystyle:=\sum_{k=1}^{K}\sum_{t=(k-1)\tau+1}^{k\tau}r^{\top}(\nu_{t}-\nu_{\pi_{k}})$
		$\displaystyle=\sum_{k=1}^{K}r^{\top}\bigg{(}\sum_{i=1}^{\tau}(H_{\pi_{k}}^{i})^{\top}\nu_{(k-1)\tau}-\nu_{\pi_{k}}\bigg{)}$
		$\displaystyle\leq\sum_{k=1}^{K}\\|r\\|_{\infty}\sum_{i=1}^{\tau}\left\\|(H_{\pi_{k}}^{i})^{\top}(\nu_{(k-1)\tau}-\nu_{\pi_{k-1}}+\nu_{\pi_{k-1}})-\nu_{\pi_{k}}\right\\|_{1}$
		$\displaystyle\leq\sum_{k=1}^{K}\sum_{i=1}^{\tau}\\|\nu_{(k-1)\tau}-\nu_{\pi_{k-1}}\\|_{1}+\\|(H_{\pi_{k}}^{i})^{\top}\nu_{\pi_{k-1}}-\nu_{\pi_{k}}\\|_{1}$
		$\displaystyle\leq\sum_{k=1}^{K}\sum_{i=1}^{\tau}\\|(H_{\pi_{(k-1)}}^{\tau})^{\top}\nu_{(k-2)\tau}-\nu_{\pi_{k-1}}\\|_{1}+\gamma^{i}\\|\nu_{\pi_{k-1}}-\nu_{\pi_{k}}\\|_{1}$
		$\displaystyle\leq 2T\gamma^{\tau}+\frac{1}{1-\gamma}\sum_{k=1}^{K}\\|\nu_{\pi_{k}}-\nu_{\pi_{k-1}}\\|_{1}\,.$

For $\tau\geq\frac{\log T}{2\log(1/\gamma)}$ , the first term is upper-bounded by $2\sqrt{T}$ .

Using results on perturbations of Markov chains (Seneta, 1988; Cho and Meyer, 2001), we have that

\displaystyle\|\nu_{\pi_{k}}-\nu_{\pi_{k-1}}\|_{1}\leq\frac{1}{1-\gamma}\|H_{\pi_{k}}-H_{\pi_{k-1}}\|_{\infty}\leq\frac{1}{1-\gamma}\max_{x}\|\pi_{k}(\cdot|x)-\pi_{k-1}(\cdot|x)\|_{1}.

Note that the policies $\pi_{k}(\cdot|x)$ are generated by running mirror descent on reward functions $\widehat{Q}_{\pi_{k}}(x,\cdot)$ . A well-known property of mirror descent updates with entropy regularization (or equivalently, the exponentially-weighted-average algorithm) is that the difference between consecutive policies is bounded as

\displaystyle\|\pi_{k+1}(\cdot|x)-\pi_{k}(\cdot|x)\|_{1}\leq\eta\|\widehat{Q}_{\pi_{k}}(x,\cdot)\|_{\infty}\,.

See e.g. Neu et al. (2014) Section V.A for a proof, which involves applying Pinsker’s inequality and Hoeffding’s lemma (Cesa-Bianchi and Lugosi (2006) Section A.2 and Lemma A.6). Since we assume that $\|\widehat{Q}_{\pi_{k}}\|_{\infty}\leq Q_{\max}$ , we can obtain

V_{T2}\leq 2\sqrt{T}+(1-\gamma)^{-2}K\eta Q_{\max}.

By choosing $\eta=\frac{\sqrt{8\log|\mathcal{A}|}}{Q_{\max}\sqrt{K}}$ , we can bound the second term as

V_{T2}\leq 2\sqrt{T}+(1-\gamma)^{-2}\sqrt{8K\log|\mathcal{A}|}.

(B.2)

Putting Eq. (B.1) and (B.2) together, we obtain that with probability at least $1-\delta$ ,

V_{T}\leq 2(1-\gamma)^{-1}\sqrt{2T\log(2/\delta)}+2\sqrt{T}+(1-\gamma)^{-2}\sqrt{8K\log|\mathcal{A}|}\,.

∎

Appendix C Proof of Lemma 6.3

Proof.

Recall that we split each phase into $2m$ blocks of size $b$ and let $\mathcal{H}_{i}$ and ${\mathcal{T}}_{i}$ denote the starting indices of odd and even blocks, respectively. We let $R_{t}$ denote the empirical $b$ -step returns from the state action pair $(x_{t},a_{t})$ in phase $i$ :

\displaystyle R_{t}=\sum_{i=t}^{t+b}(r_{i}-\widehat{J}_{\pi_{i}}),\quad\widehat{J}_{\pi_{i}}=\frac{1}{|{\mathcal{T}}_{i}|}\sum_{t\in{\mathcal{T}}_{i}}r_{t}.

We start by bounding the error in $R_{t}$ . Let $X$ be a binary indicator vector for a state-action pair $(x,a)$ . Let $H_{\pi}$ be the state-action transition kernel for policy $\pi$ , and let $\nu_{\pi}$ be the corresponding stationary state-action distribution. We can write the action-value function at $(x,a)$ as

	$\displaystyle Q_{\pi}(x,a)$	$\displaystyle=r(x,a)-J_{\pi}+X^{\top}H_{\pi}Q_{\pi}$
		$\displaystyle=(X-\nu_{\pi})^{\top}r+X^{\top}H_{\pi}(r-J_{\pi}{\bf 1}+H_{\pi}Q_{\pi})$
		$\displaystyle=\sum_{i=0}^{\infty}(X-\nu_{\pi})^{\top}H_{\pi}^{i}r\,.$

Let $Q_{\pi}^{b}(x,a)=\sum_{i=0}^{b}(X-\nu_{\pi})^{\top}H_{\pi}^{i}r$ be a version of $Q_{\pi}$ truncated to $b$ steps. Under uniform mixing, the difference to the true $Q_{\pi}$ is bounded as

\displaystyle|Q_{\pi}(x,a)-Q_{\pi}^{b}(x,a)|\leq\sum_{i=1}^{\infty}\left|(X-\nu_{\pi})^{\top}H_{\pi}^{i+b}r\right|\leq\frac{2\gamma^{b+1}}{1-\gamma}\,.

(C.1)

Let $b_{t}=Q_{\pi_{i}}^{b}(x_{t},a_{t})-Q_{\pi_{i}}(x_{t},a_{t})$ denote the truncation bias at time $t$ , and let $z_{t}=\sum_{i=t}^{t+b}r_{i}-X_{t}^{\top}H_{\pi_{i}}^{(i-t)}r$ denote the reward noise. We will write

\displaystyle R_{t}=Q_{\pi_{i}}(x_{t},a_{t})+b(J_{\pi_{i}}-\widehat{J}_{\pi_{i}})+z_{t}+b_{t}.

Note that $m=|\mathcal{H}_{i}|$ and let

\widehat{M}_{i}=\frac{1}{m}\sum_{t\in\mathcal{H}_{i}}\phi_{t}\phi_{t}^{\top}+\frac{\alpha}{m}I\,.

We estimate the value function of each policy $\pi_{i}$ using data from phase $i$ as

	$\displaystyle\widehat{w}_{\pi_{i}}$	$\displaystyle=\widehat{M}_{i}^{-1}m^{-1}\sum_{t\in\mathcal{H}_{i}}\phi_{t}R_{t}$
		$\displaystyle=\widehat{M}_{i}^{-1}m^{-1}\sum_{t\in\mathcal{H}_{i}}\phi_{t}(\phi_{t}^{\top}w_{\pi_{i}}+b_{t}+z_{t}+b(J_{\pi_{i}}-\widehat{J}_{\pi_{i}}))+\widehat{M}_{i}^{-1}\frac{\alpha}{m}(w_{\pi_{i}}-w_{\pi_{i}})$
		$\displaystyle=w_{\pi_{i}}+\widehat{M}_{i}^{-1}m^{-1}\sum_{t\in\mathcal{H}_{i}}\phi_{t}(z_{t}+b_{t}+b(J_{\pi_{i}}-\widehat{J}_{\pi_{i}}))-\widehat{M}_{i}^{-1}m^{-1}\alpha w_{\pi_{i}}$

Our estimate $\widehat{w}_{k}$ of $w_{k}=\frac{1}{k}\sum_{i=1}^{k}w_{\pi_{i}}$ can thus be written as follows:

\displaystyle\widehat{w}_{k}-w_{k}

\displaystyle=\frac{1}{km}\sum_{i=1}^{k}\sum_{t\in\mathcal{H}_{i}}\widehat{M}_{i}^{-1}\phi_{t}(z_{t}+b_{t}+b(J_{\pi_{i}}-\widehat{J}_{\pi_{i}}))-\frac{\alpha}{km}\sum_{i=1}^{k}\widehat{M}_{i}^{-1}w_{\pi_{i}}.

We proceed to upper-bound the norm of the RHS.

Set $\alpha=\sqrt{m/k}$ . Let $C_{w}$ be an upper-bound on the norm of the true value-function weights $\|w_{\pi_{i}}\|_{2}$ for $i=1,...,K$ . In Appendix C.3, we show that with probability at least $1-\delta$ , for $m\geq 72C_{\Phi}^{4}\sigma^{-2}(1-\gamma)^{-2}\log(d/\delta)$ , $\|\widehat{M}_{i}^{-1}\|_{2}\leq 2\sigma^{-2}$ . Thus with probability at least $1-\delta$ , the last error term is upper-bounded as

\displaystyle\frac{\alpha}{km}\left\|\sum_{k=1}^{k}\widehat{M}_{i}^{-1}w_{\pi_{i}}\right\|_{2}\leq 2\sigma^{-2}C_{w}(km)^{-1/2}.

(C.2)

Similarly, for

b\geq\frac{\log((1-\gamma)^{-1}\sqrt{km})}{\log(1/\gamma)},

(C.3)

the norm of the truncation bias term is upper-bounded as

\displaystyle\frac{1}{km}\sum_{i=1}^{k}\sum_{t\in\mathcal{H}_{i}}\|\widehat{M}_{i}^{-1}\phi_{t}b_{t}\|_{2}

\displaystyle\leq\frac{2\gamma^{b}}{km(1-\gamma)}\sum_{i=1}^{k}\sum_{t\in\mathcal{H}_{i}}\|\widehat{M}_{i}^{-1}\phi_{t}\|_{2}\leq 2\sigma^{-2}C_{\Phi}(km)^{-1/2}.

(C.4)

To bound the error terms corresponding to reward noise $z_{t}$ and average-error noise $J_{\pi_{i}}-\widehat{J}_{\pi_{i}}$ , we rely on the independent blocks techniques of Yu (1994). We show in Sections C.1 and C.2 that with probability $1-2\delta$ , for constants $c_{1}$ and $c_{2}$ , each of these terms can be bounded as:

	$\displaystyle\frac{1}{km}\left\\|\sum_{i=1}^{k}\sum_{t\in\mathcal{H}_{i}}\widehat{M}_{i}^{-1}\phi_{t}z_{t}\right\\|_{2}$	$\displaystyle\leq 2c_{1}C_{\Phi}\sigma^{-2}\sqrt{\frac{b\log(2d/\delta)}{km}}$
	$\displaystyle\frac{b}{km}\left\\|\sum_{i=1}^{k}(J_{\pi_{i}}-\widehat{J}_{\pi_{i}})\sum_{t\in\mathcal{H}_{i}}\widehat{M}_{i}^{-1}\phi_{t}\right\\|_{2}$	$\displaystyle\leq 2c_{2}C_{\Phi}\sigma^{-2}b\sqrt{\frac{\log(2d/\delta)}{km}}.$

Thus, putting terms together, we have for an absolute constant $c$ , with probability at least $1-\delta$ ,

\displaystyle\|\widehat{w}_{k}-w_{k}\|_{2}\leq c\sigma^{-2}(C_{w}+C_{\Phi})b\sqrt{\frac{\log(2d/\delta)}{km}}.

Note that this result holds for every $k\in[K]$ and thus also holds for $k=K$ . ∎

C.1 Bounding $\|\sum_{i=1}^{k}\widehat{M}_{i}^{-1}\sum_{t\in\mathcal{H}_{i}}\phi_{t}z_{t}\|_{2}$

Let $\|\cdot\|_{\mathrm{tv}}$ denote the total variation norm.

Definition C.1 ( $\beta$ -mixing).

Let $\{Z_{t}\}_{t=1,2,\ldots}$ be a stochastic process. Denote by $Z_{1:t}$ the collection $(Z_{1},\ldots,Z_{t})$ , where we allow $t=\infty$ . Let $\sigma(Z_{i:j})$ denote the sigma-algebra generated by $Z_{i:j}$ ( $i\leq j$ ). The $k^{\rm th}$ $\beta$ -mixing coefficient of $\{Z_{t}\}$ , $\beta_{k}$ , is defined by

	$\displaystyle\beta_{k}$	$\displaystyle=\sup_{t\geq 1}\mathbb{E}{\sup_{B\in\sigma(Z_{t+k:\infty})}\|P(B\|Z_{1:t})-P(B)\|}$
		$\displaystyle=\sup_{t\geq 1}\mathbb{E}{\\|P_{Z_{t+k:\infty}\|Z_{1:t}}(\cdot\|Z_{1:t})-P_{Z_{t+k:\infty}}(\cdot)\\|_{\mathrm{tv}}}\,.$

$\{Z_{t}\}$ is said to be $\beta$ -mixing if $\beta_{k}\rightarrow 0$ as $k\rightarrow\infty$ . In particular, we say that a $\beta$ -mixing process mixes at an exponential rate with parameters $\overline{\beta},\alpha,\gamma>0$ if $\beta_{k}\leq\overline{\beta}\exp(-\alpha k^{\gamma})$ holds for all $k\geq 0$ .

Let $X_{t}$ be the indicator vector for the state-action pair $(x_{t},a_{t})$ as in Lemma A.2. Note that the distribution of $(x_{t+1},a_{t+1})$ given $(x_{t},a_{t})$ can be written as $\mathbb{E}[X_{t+1}|X_{t}]$ . Let $H_{t}$ be the state-action transition matrix at time $t$ , let $H_{i:t}=\prod_{j=i}^{t-1}H_{j}$ , and define $H_{i:i}=I$ . Then we have that $\mathbb{E}[X_{t+k}|X_{1:t}]=H_{t:t+k}^{\top}X_{t}$ and $\mathbb{E}[X_{t+k}]=H_{1:t+k}^{\top}\nu_{0}$ , where $\nu_{0}$ is the initial state distribution. Thus, under the uniform mixing Assumption 2.1, the $k^{th}$ $\beta$ -mixing coefficient is bounded as:

\displaystyle\beta_{k}

\displaystyle\leq\sup_{t\geq 1}\mathbb{E}\sum_{j=k}^{\infty}\|H_{t:t+j}^{\top}X_{t}-H_{1:t+j}^{\top}\nu_{0}\|_{1}\leq\sup_{t\geq 1}\mathbb{E}\sum_{j=k}^{\infty}\gamma^{j}\|X_{t}-H_{1:t}^{\top}\nu_{0}\|_{1}\leq\frac{2\gamma^{k}}{1-\gamma}\,.

We bound the noise terms using the independent blocks technique of Yu (1994). Recall that we partition each phase into $2m$ blocks of size $b$ . Thus, after $k$ phases we have a total of $2km$ blocks. Let $\mathbb{P}$ denote the joint distribution of state-action pairs in odd blocks. Let $\mathcal{I}_{i}$ denote the set of indices in the $i^{th}$ block, and let $x_{\mathcal{I}_{i}},a_{\mathcal{I}_{i}}$ denote the corresponding states and actions. We factorize the joint distribution according to blocks:

	$\displaystyle\mathbb{P}(x_{\mathcal{I}_{1}},a_{\mathcal{I}_{1}},x_{\mathcal{I}_{3}},a_{\mathcal{I}_{3}},\ldots,x_{\mathcal{I}_{2km-1}},a_{\mathcal{I}_{2km-1}})=$	$\displaystyle\;\;\mathbb{P}_{1}(x_{\mathcal{I}_{1}},a_{\mathcal{I}_{1}})\times\mathbb{P}_{3}(x_{\mathcal{I}_{3}},a_{\mathcal{I}_{3}}\|x_{\mathcal{I}_{1}},a_{\mathcal{I}_{1}})\times\cdots$
		$\displaystyle\times\mathbb{P}_{2km-1}(x_{\mathcal{I}_{2km-1}},a_{\mathcal{I}_{2km-1}}\|x_{\mathcal{I}_{2km-3}},a_{\mathcal{I}_{2km-3}}).$

Let $\widetilde{\mathbb{P}}_{i}$ be the marginal distribution over the variables in block $i$ , and let $\widetilde{\mathbb{P}}$ be the product of marginals of odd blocks.

Corollary 2.7 of Yu (1994) implies that for any Borel-measurable set $E$ ,

|\mathbb{P}(E)-\widetilde{\mathbb{P}}(E)|\leq(km-1)\beta_{b},

(C.5)

where $\beta_{b}$ is the $b^{th}$ $\beta$ -mixing coefficient of the process. The result follows since the size of the “gap” between successive blocks is $b$ ; see Appendix E for more details.

Recall that our estimates $\widehat{w}_{\pi_{i}}$ are based only on data in odd blocks in each phase. Let $\widetilde{\mathbb{E}}$ denote the expectation w.r.t. the product-of-marginals distribution $\widetilde{\mathbb{P}}$ . Then $\widetilde{\mathbb{E}}[\widehat{M}_{i}^{-1}\sum_{t\in\mathcal{H}_{i}}\phi_{t}z_{t}]=0$ because for $t\in\mathcal{H}_{i}$ and under $\widetilde{\mathbb{P}}$ , $z_{t}$ is zero-mean given $\phi_{t}$ and is independent of other feature vectors outside of the block. Furthermore, by Hoeffding’s inequality $\widetilde{\mathbb{P}}(|z_{t}|/b\geq a)\leq 2\exp(-2ba^{2})$ . Since $\|\phi_{t}\|_{2}\leq C_{\Phi}$ and $\|\widehat{M}_{i}^{-1}\|_{2}\leq 2\sigma^{-2}$ for large enough $m$ , we have that

\widetilde{\mathbb{P}}(\|\widehat{M}_{i}^{-1}\phi_{t}z_{t}\|_{2}\geq 2b\sigma^{-2}C_{\Phi}a)\leq 2\exp(-2ba^{2}).

Since $\widehat{M}_{i}^{-1}\phi_{t}z_{t}$ are norm-subGaussian vectors, using Lemma A.4, there exists a constant $c_{1}$ such that for any $\delta\geq 0$

\displaystyle\widetilde{\mathbb{P}}\left(\left\|\sum_{i=1}^{k}\widehat{M}_{i}^{-1}\sum_{t\in\mathcal{H}_{i}}\phi_{t}z_{t}\right\|_{2}\geq 2c_{1}C_{\Phi}\sigma^{-2}\sqrt{bkm\log(2d/\delta)}\right)\leq\delta\,.

Thus, using (C.5),

\mathbb{P}\left(\left\|\sum_{i=1}^{k}\widehat{M}_{i}^{-1}\sum_{t\in\mathcal{H}_{i}}\phi_{t}z_{t}\right\|_{2}\geq 2c_{1}C_{\Phi}\sigma^{-2}\sqrt{bkm\log(2d/\delta)}\right)\leq\delta+(km-1)\beta_{b}\;.

Under Assumption 2.1, we have that $\beta_{b}\leq 2\gamma^{b}(1-\gamma)^{-1}$ . Setting $\delta=2km\gamma^{b}(1-\gamma)^{-1}$ and solving for $b$ we get

\displaystyle b=\frac{\log(2km\delta^{-1}(1-\gamma)^{-1})}{\log(1/\gamma)}.

(C.6)

Notice that when $b$ is chosen as in Eq. (C.6), the condition (C.3) is also satisfied. Plugging this into the previous display gives that with probability at least $1-2\delta$ ,

\left\|\sum_{i=1}^{k}\widehat{M}_{i}^{-1}\sum_{t\in\mathcal{H}_{i}}\phi_{t}z_{t}\right\|_{2}\leq 2c_{1}C_{\Phi}\sigma^{-2}\sqrt{bkm\log(2d/\delta)}.

C.2 Bounding $\|\sum_{i=1}^{k}\widehat{M}_{i}^{-1}\sum_{t\in\mathcal{H}_{i}}\phi_{t}(J_{\pi_{i}}-\widehat{J}_{\pi_{i}})\|_{2}$

Recall that the average-reward estimates $\widehat{J}_{\pi_{i}}$ are computed using time indices corresponding to the starts of even blocks, ${\mathcal{T}}_{i}$ . Thus this error term is only a function of the indices corresponding to block starts. Now let $\mathbb{P}$ denote the distribution over state-action pairs $(x_{t},a_{t})$ for indices $t$ corresponding to block starts, i.e. $t\in\{1,b+1,2b+1,...,(2km-1)b+1\}$ . We again factorize the distribution over blocks as $\mathbb{P}=\mathbb{P}_{1}\otimes\mathbb{P}_{2}\otimes\cdots\otimes\mathbb{P}_{2km}$ . Let $\widetilde{\mathbb{P}}=\widetilde{\mathbb{P}}_{1}\otimes\widetilde{\mathbb{P}}_{2}\otimes\cdots\otimes\widetilde{\mathbb{P}}_{2km}$ be a product-of-marginals distribution defined as follows. For odd $j$ , let $\widetilde{\mathbb{P}}_{j}$ be the marginal of $\mathbb{P}$ over $(x_{jb+1},a_{jb+1})$ . For even $j$ in phase $i$ , let $\widetilde{\mathbb{P}}_{j}=\nu_{\pi_{i}}$ correspond to the stationary distribution of the corresponding policy $\pi_{i}$ . Using arguments similar to independent blocks, we show in Appendix E that

\|\mathbb{P}-\widetilde{\mathbb{P}}\|_{1}\leq 2(2km-1)\gamma^{b-1}.

Let $\widetilde{\mathbb{E}}$ denote expectation w.r.t. the product-of-marginals distribution $\widetilde{\mathbb{P}}$ . Then $\widetilde{\mathbb{E}}[\widehat{M}_{i}^{-1}\sum_{t\in\mathcal{H}^{i}}\phi_{t}(J_{\pi_{i}}-\widehat{J}_{\pi_{i}})]=0$ , since under $\widetilde{\mathbb{P}}$ , $\widehat{J}_{\pi_{i}}$ is the sum of rewards for state-action pairs distributed according to $\nu_{\pi_{i}}$ , and these state-action pairs are independent of other data. Using a similar argument as in the previous section, for $b=1+\frac{\log(4km/\delta)}{\log(1/\gamma)}$ , there exists a constant $c_{2}$ such that with probability at least $1-2\delta$ ,

\left\|\sum_{i=1}^{k}\widehat{M}_{i}^{-1}\sum_{t\in\mathcal{H}_{i}}\phi_{t}(J_{\pi_{i}}-\widehat{J}_{\pi_{i}})\right\|_{2}\leq 2c_{2}C_{\Phi}\sigma^{-2}\sqrt{km\log(2d/\delta)}\;.

C.3 Bounding $\|\widehat{M}_{i}^{-1}\|_{2}$

In this subsection, we show that with probability at least $1-\delta$ , for $m\geq 72C_{\Phi}^{4}\sigma^{-2}(1-\gamma)^{-2}\log(d/\delta))$ , $\|M_{i}^{-1}\|_{2}\leq 2\sigma^{-2}$ .

Let $\Phi$ be a $|\mathcal{X}||\mathcal{A}|\times d$ matrix of all features. Let $D_{i}={\rm diag}(\nu_{\pi_{i}})$ , and let $\widehat{D}_{i}={\rm diag}(\sum_{t\in\mathcal{H}_{i}}X_{t})$ , where $X_{t}$ is a state-action indicator as in Lemma A.2. Let $M_{i}=\Phi^{\top}D_{i}\Phi+\alpha m^{-1}I$ . We can write $\widehat{M}_{i}^{-1}$ as

	$\displaystyle\widehat{M}_{i}^{-1}$	$\displaystyle=(\Phi^{\top}\widehat{D}_{i}\Phi+{\alpha}{\tau}^{-1}I+\Phi^{\top}(D_{i}-D_{i})\Phi)^{-1}$
		$\displaystyle=(M_{i}+\Phi^{\top}(D_{i}-D_{i})\Phi)^{-1}$
		$\displaystyle=(I+M_{i}^{-1}\Phi^{\top}(D_{i}-D_{i})\Phi)^{-1}M_{i}^{-1}$

By Assumption 6.2 and 6.1, $\|M_{i}^{-1}\|_{2}\leq\sigma^{-2}$ . In Appendix C.4, we show that w.p. at least $1-\delta$ ,

\displaystyle\|\Phi^{\top}(\widehat{D}_{i}-D_{i})\Phi\|_{2}\leq 6m^{-1/2}C_{\Phi}^{2}(1-\gamma)^{-1}\sqrt{2\log(d/\delta)}

Thus

\displaystyle\|\widehat{M}_{i}^{-1}\|_{2}\leq\sigma^{-2}(1-\sigma^{-2}6m^{-1/2}C_{\Phi}^{2}(1-\gamma)^{-1}\sqrt{2\log(d/\delta)})^{-1}

For $m\geq 72C_{\Phi}^{4}\sigma^{-2}(1-\gamma)^{-2}\log(d/\delta))$ , the above norm is upper-bounded by $\|\widehat{M}_{i}^{-1}\|_{2}\leq 2\sigma^{-2}$ .

C.4 Bounding $\|\Phi^{\top}(\widehat{D}_{i}-D_{i})\Phi^{\top}\|_{2}$

For any matrix $A$ ,

\displaystyle\|\Phi^{\top}A\Phi\|_{2}=\bigg{\|}\sum_{ij}A_{ij}\phi_{i}\phi_{j}^{\top}\bigg{\|}_{2}\leq\sum_{i,j}|A_{ij}|\|\phi_{i}\phi_{j}^{\top}\|_{2}\leq C_{\Phi}^{2}\sum_{i,j}|A_{ij}|=C_{\Phi}^{2}\|A\|_{1,1}\,.

(C.7)

where $\|A\|_{1,1}$ denotes the sum of absolute entries of $A$ . Using the same notation for $X_{t}$ as in Lemma A.2,

	$\displaystyle\\|\Phi^{\top}(\widehat{D}_{i}-D_{i})\Phi\\|_{2}$	$\displaystyle=\frac{1}{m}\sum_{t\in\mathcal{H}_{i}}\Phi^{\top}{\rm diag}(X_{t}-\nu_{t}+\nu_{t}-\nu_{\pi_{i}})\Phi$
		$\displaystyle\leq\frac{1}{m}\bigg{\\|}\sum_{t\in\mathcal{H}_{i}}\Phi^{\top}{\rm diag}(X_{t}-\nu_{t})\Phi\bigg{\\|}_{2}+\frac{C_{\Psi}^{2}}{m}\sum_{t\in\mathcal{H}_{i}}\\|\nu_{t}-\nu_{\pi_{i}}\\|_{1}\,.$

Under the fast-mixing assumption 2.1, the second term is bounded by $2C_{\Psi}^{2}m^{-1}(1-\gamma)^{-1}$ .

For the first term, we can define a martingale $(B_{i})_{i=0}^{m}$ similar to the Doob martingale in Lemma A.2, but defined only on the $m$ indices $\mathcal{H}_{i}$ . Note that $\sum_{t\in\mathcal{H}_{i}}\Phi^{\top}{\rm diag}(X_{t}-\nu_{t})\Phi=\Phi^{\top}{\rm diag}(B_{m}-B_{0})\Phi$ . Thus we can use matrix-Azuma to bound the difference sequence. Given that

\displaystyle\|(\Phi^{\top}(B_{i}-B_{i-1})\Phi)^{2}\|_{2}\leq 4C_{\Phi}^{4}(1-\gamma)^{-2},

combining the two terms, we have that with probability at least $1-\delta$ ,

	$\displaystyle\\|\Phi^{\top}(\widehat{D}_{i}-D_{i})\Phi\\|_{2}$	$\displaystyle\leq 4m^{-1/2}C_{\Phi}^{2}(1-\gamma)^{-1}\sqrt{2\log(d/\delta)}+2m^{-1}C_{\Phi}^{2}(1-\gamma)^{-1}$
		$\displaystyle\leq 6m^{-1/2}C_{\Phi}^{2}(1-\gamma)^{-1}\sqrt{2\log(d/\delta)}\,.$

Appendix D Bounding $\mathbb{E}_{x\sim\mu*}[\widehat{V}_{K}(x)-V_{K}(x)]$

We write the value function error as follows:

	$\displaystyle\mathbb{E}_{x\sim\mu*}[\widehat{V}_{K}(x)-V_{K}(x)]$	$\displaystyle=\sum_{x}\mu_{*}(x)\sum_{a}\phi(x,a)^{\top}\frac{1}{K}\sum_{i=1}^{K}\pi_{i}(a\|x)(\widehat{w}_{\pi_{i}}-w_{\pi_{i}})$
		$\displaystyle\leq\frac{1}{K}\sum_{x}\mu_{*}(x)\sum_{a}\\|\phi(x,a)\\|_{2}\left\\|\sum_{i=1}^{K}\pi_{i}(a\|x)(\widehat{w}_{\pi_{i}}-w_{\pi_{i}})\right\\|_{2}$

Note that for any set of scalars $\{p_{i}\}_{i=1}^{K}$ with $p_{i}\in[0,1]$ , the term $\left\|\sum_{i=1}^{K}p_{i}(\widehat{w}_{\pi_{i}}-w_{\pi_{i}})\right\|_{2}$ has the same upper bound as $\|\sum_{i=1}^{K}(\widehat{w}_{\pi_{i}}-w_{\pi_{i}})\|_{2}$ . The reason is as follows. One part of the error includes bias terms (C.2) and (C.4), whose upper bounds are only smaller when reweighted by scalars in $[0,1]$ . Thus we can simply upper-bound the bias by setting all $\{p_{i}\}_{i=1}^{K}$ to 1. Another part of the error, analyzed in Appendices C.1 and C.2 involves sums of norm-subGaussian vectors. In this case, applying the weights only results in these vectors potentially having smaller norm bounds. We keep the same bounds for simplicity, again corresponding to all $\{p_{i}\}_{i=1}^{K}$ equal to 1. Thus, reusing the results of the previous section, we have

\displaystyle\mathbb{E}_{x\sim\mu*}[\widehat{V}_{K}(x)-V_{K}(x)]

\displaystyle\leq C_{\Phi}|\mathcal{A}|c\sigma^{-2}(C_{w}+C_{\Phi})b\sqrt{\frac{\log(2d/\delta)}{Km}}.

Appendix E Independent Blocks

Blocks. Recall that we partition each phase into $2m$ blocks of size $b$ . Thus, after $k$ phases we have a total of $2km$ blocks. Let $\mathbb{P}$ denote the joint distribution of state-action pairs in odd blocks. Let $\mathcal{I}_{i}$ denote the set of indices in the $i^{th}$ block, and let $x_{\mathcal{I}_{i}},a_{\mathcal{I}_{i}}$ denote the corresponding states and actions. We factorize the joint distribution according to blocks:

	$\displaystyle\mathbb{P}(x_{\mathcal{I}_{1}},a_{\mathcal{I}_{1}},x_{\mathcal{I}_{3}},a_{\mathcal{I}_{3}},\ldots,x_{\mathcal{I}_{2km-1}},a_{\mathcal{I}_{2km-1}})=$	$\displaystyle\;\;\mathbb{P}_{1}(x_{\mathcal{I}_{1}},a_{\mathcal{I}_{1}})\times\mathbb{P}_{3}(x_{\mathcal{I}_{3}},a_{\mathcal{I}_{3}}\|x_{\mathcal{I}_{1}},a_{\mathcal{I}_{1}})\times\cdots$
		$\displaystyle\times\mathbb{P}_{2km-1}(x_{\mathcal{I}_{2km-1}},a_{\mathcal{I}_{2km-1}}\|x_{\mathcal{I}_{2km-3}},a_{\mathcal{I}_{2km-3}}).$

Let $\widetilde{\mathbb{P}}_{i}$ be the marginal distribution over the variables in block $i$ , and let $\widetilde{\mathbb{P}}$ be the product of marginals. Then the difference between the distributions $\widetilde{\mathbb{P}}$ and $\mathbb{P}$ can be written as

	$\displaystyle\mathbb{P}-\widetilde{\mathbb{P}}=$	$\displaystyle\;\;{\mathbb{P}_{1}}\otimes\mathbb{P}_{3}\otimes\cdots\otimes\mathbb{P}_{2km-1}-{\mathbb{P}_{1}}\otimes\widetilde{\mathbb{P}}_{3}\cdots\otimes\widetilde{\mathbb{P}}_{2km-1}$
	$\displaystyle=$	$\displaystyle\;\;{\mathbb{P}_{1}}\otimes(\mathbb{P}_{3}-\widetilde{\mathbb{P}}_{3})\otimes\mathbb{P}_{5}\otimes\cdots\otimes\mathbb{P}_{2km-1}$
		$\displaystyle+\mathbb{P}_{1}\otimes\widetilde{\mathbb{P}}_{3}\otimes(\mathbb{P}_{5}-\widetilde{\mathbb{P}}_{5})\otimes\mathbb{P}_{7}\otimes\ldots\otimes\mathbb{P}_{2km-1}$
		$\displaystyle+\cdots$
		$\displaystyle+{\mathbb{P}}_{1}\otimes\widetilde{\mathbb{P}}_{3}\otimes\widetilde{\mathbb{P}}_{5}\otimes\cdots\otimes\widetilde{\mathbb{P}}_{2km-3}\otimes(\mathbb{P}_{2km-1}-\widetilde{\mathbb{P}}_{2km-1}).$

Under $\beta$ -mixing, since the gap between the blocks is of size $b$ , we have that

\displaystyle\|\mathbb{P}_{i}(x_{\mathcal{I}_{i}},a_{\mathcal{I}_{i}}|x_{\mathcal{I}_{i-2}},a_{\mathcal{I}_{i-2}})-\widetilde{\mathbb{P}}_{i}(x_{\mathcal{I}_{i}},a_{\mathcal{I}_{i}})\|_{1}\leq\beta_{b}=\frac{2\gamma^{b}}{1-\gamma}.

Thus the difference between the joint distribution and the product of marginals is bounded as

\displaystyle\|\mathbb{P}-\widetilde{\mathbb{P}}\|_{1}\leq(km-1)\beta_{b}.

Block starts. Now let $\mathbb{P}$ denote the distribution over state-action pairs $(x_{t},a_{t})$ for indices $t$ corresponding to block starts, i.e. $t\in\{1,b+1,2b+1,...,(2km-1)b+1\}$ . We again factorize the distribution over blocks:

\displaystyle\mathbb{P}(x_{1},a_{1},x_{b+1},a_{b+1},\ldots,x_{(2km-1)b+1},a_{(2km-1)b+1})=\mathbb{P}_{1}(x_{1},a_{1})\prod_{j=2}^{2km}\mathbb{P}_{i}(x_{jb+1},a_{jb+1}|x_{(j-1)b+1},a_{(j-1)b+1}).

Define a product-of-marginals distribution $\widetilde{\mathbb{P}}=\widetilde{\mathbb{P}}_{1}\otimes\widetilde{\mathbb{P}}_{2}\otimes\cdots\otimes\widetilde{\mathbb{P}}_{2km}$ over the block-start variables as follows. For odd $j$ , let $\widetilde{\mathbb{P}}_{j}$ be the marginal of $\mathbb{P}$ over $(x_{jb+1},a_{jb+1})$ . For even $j$ in phase $i$ , let $\widetilde{\mathbb{P}}_{j}=\nu_{\pi_{i}}$ correspond to the stationary distribution of the policy $\pi_{i}$ . Using the same notation as in Appendix A, let $X_{t}$ be the indicator vector for $(x_{t},a_{t})$ and let $H_{i:j}$ be the product of state-action transition matrices at times $i+1,...,j$ . For odd blocks $j$ , we have

\displaystyle\|\mathbb{P}_{j}(\cdot|x_{(j-1)b+1},a_{(j-1)b+1})-\widetilde{\mathbb{P}}_{j}(\cdot)\|_{1}=\|H_{(j-1)b+1:jb}^{\top}(X_{(j-1)b+1}-\widetilde{\mathbb{P}}_{j-1})\|_{1}\leq 2\gamma^{b-1}\,.

Slightly abusing notation, let $H_{\pi_{i}}$ be the state-action transition matrix under policy $\pi_{i}$ . For even blocks $j$ in phase $i$ , since they always follow an odd block in the same phase,

\displaystyle\|\mathbb{P}_{j}(\cdot|x_{(j-1)b+1},a_{(j-1)b+1})-\widetilde{\mathbb{P}}_{j}(\cdot)\|_{1}=\|(H_{\pi_{i}}^{b-1})^{\top}(X_{(j-b)+1}-\nu_{\pi_{i}})\|_{1}\leq 2\gamma^{b-1}\,.

Thus, using a similar distribution decomposition as before, we have that $\|\mathbb{P}-\widetilde{\mathbb{P}}\|_{1}\leq 2(2km-1)\gamma^{b-1}$ .

	$\displaystyle V_{T2}$	$\displaystyle:=\sum_{k=1}^{K}\sum_{t=(k-1)\tau+1}^{k\tau}r^{\top}(\nu_{t}-\nu_{\pi_{k}})$
		$\displaystyle=\sum_{k=1}^{K}r^{\top}\bigg{(}\sum_{i=1}^{\tau}(H_{\pi_{k}}^{i})^{\top}\nu_{(k-1)\tau}-\nu_{\pi_{k}}\bigg{)}$
		$\displaystyle\leq\sum_{k=1}^{K}\\|r\\|_{\infty}\sum_{i=1}^{\tau}\left\\|(H_{\pi_{k}}^{i})^{\top}(\nu_{(k-1)\tau}-\nu_{\pi_{k-1}}+\nu_{\pi_{k-1}})-\nu_{\pi_{k}}\right\\|_{1}$
		$\displaystyle\leq\sum_{k=1}^{K}\sum_{i=1}^{\tau}\\|\nu_{(k-1)\tau}-\nu_{\pi_{k-1}}\\|_{1}+\\|(H_{\pi_{k}}^{i})^{\top}\nu_{\pi_{k-1}}-\nu_{\pi_{k}}\\|_{1}$
		$\displaystyle\leq\sum_{k=1}^{K}\sum_{i=1}^{\tau}\\|(H_{\pi_{(k-1)}}^{\tau})^{\top}\nu_{(k-2)\tau}-\nu_{\pi_{k-1}}\\|_{1}+\gamma^{i}\\|\nu_{\pi_{k-1}}-\nu_{\pi_{k}}\\|_{1}$
		$\displaystyle\leq 2T\gamma^{\tau}+\frac{1}{1-\gamma}\sum_{k=1}^{K}\\|\nu_{\pi_{k}}-\nu_{\pi_{k-1}}\\|_{1}\,.$

Improved Regret Bound and Experience Replay in Regularized Policy Iteration

Abstract

1 Introduction

2 Setting and Notation

Assumption 2.1 (Uniform mixing).

3 Related Work

4 Algorithm

5 Regret Analysis of Politex

Assumption 5.1 (Boundedness).

Assumption 5.2 (Estimation error).

Theorem 5.3 (Regret of Politex).

Proof.

6 Linear Value Functions

Assumption 6.1 (Linear value functions).

Assumption 6.2 (Feature excitation).

Lemma 6.3 (Estimation error for linear functions).

7 Practical Implementation

7.1 Weight Averaging

7.2 Experience Replay

8 Experiments

9 Discussion

Acknowledgements

References

Appendix

Appendix A Preliminaries

Lemma A.1.

Proof.

Lemma A.2 (Doob martingale).

Proof.

Theorem A.3 (Matrix Azuma, Tropp (2012) Thm 7.1).

Lemma A.4 (Hoeffding-type inequality for norm-subGaussian random vectors, Jin et al. (2019)).

Appendix B Bounding the Difference Between Empirical and Average Rewards

Lemma B.1.

Proof.

Appendix C Proof of Lemma 6.3

Proof.

C.1 Bounding ‖∑i=1kM^i−1​∑t∈ℋiϕt​zt‖2\|\sum_{i=1}^{k}\widehat{M}_{i}^{-1}\sum_{t\in\mathcal{H}_{i}}\phi_{t}z_{t}\|_{2}

Definition C.1 (β\beta-mixing).

C.2 Bounding ‖∑i=1kM^i−1​∑t∈ℋiϕt​(Jπi−J^πi)‖2\|\sum_{i=1}^{k}\widehat{M}_{i}^{-1}\sum_{t\in\mathcal{H}_{i}}\phi_{t}(J_{\pi_{i}}-\widehat{J}_{\pi_{i}})\|_{2}

C.3 Bounding ‖M^i−1‖2\|\widehat{M}_{i}^{-1}\|_{2}

C.4 Bounding ‖Φ⊤​(D^i−Di)​Φ⊤‖2\|\Phi^{\top}(\widehat{D}_{i}-D_{i})\Phi^{\top}\|_{2}

Appendix D Bounding 𝔼x∼μ⁣∗​[V^K​(x)−VK​(x)]\mathbb{E}_{x\sim\mu*}[\widehat{V}_{K}(x)-V_{K}(x)]

Appendix E Independent Blocks

C.1 Bounding $\|\sum_{i=1}^{k}\widehat{M}_{i}^{-1}\sum_{t\in\mathcal{H}_{i}}\phi_{t}z_{t}\|_{2}$

Definition C.1 ( $\beta$ -mixing).

C.2 Bounding $\|\sum_{i=1}^{k}\widehat{M}_{i}^{-1}\sum_{t\in\mathcal{H}_{i}}\phi_{t}(J_{\pi_{i}}-\widehat{J}_{\pi_{i}})\|_{2}$

C.3 Bounding $\|\widehat{M}_{i}^{-1}\|_{2}$

C.4 Bounding $\|\Phi^{\top}(\widehat{D}_{i}-D_{i})\Phi^{\top}\|_{2}$

Appendix D Bounding $\mathbb{E}_{x\sim\mu*}[\widehat{V}_{K}(x)-V_{K}(x)]$