This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Tighter Convergence Proof of Reverse Experience Replay

Nan Jiang, Jinzhao Li, Yexiang Xue
{jiang631, li4255, yexiang}@purdue.edu
Department of Computer Science
Purdue University, USA
Abstract

In reinforcement learning, Reverse Experience Replay (RER) is a recently proposed algorithm that attains better sample complexity than the classic experience replay method. RER requires the learning algorithm to update the parameters through consecutive state-action-reward tuples in reverse order. However, the most recent theoretical analysis only holds for a minimal learning rate and short consecutive steps, which converge slower than those large learning rate algorithms without RER. In view of this theoretical and empirical gap, we provide a tighter analysis that mitigates the limitation on the learning rate and the length of consecutive steps. Furthermore, we show theoretically that RER converges with a larger learning rate and a longer sequence.

1 Introduction

Reinforcement Learning (RL) is highly successful for a variety of practical problems in the realm of long-term decision-making. Experience Replay (ER) of historical trajectories plays a vital role in RL algorithms (Lin, 1992; Mnih et al., 2015). The trajectory is a sequence of transitions, where each transition is a state, action, and reward tuple. The memory space used to store these experienced trajectories is noted as the replay buffer. The methods to sample transitions from the replay buffer determine the rate and stability of the convergence of the learning algorithms.

Recently, Reversed Experience Replay (RER) (Florensa et al., 2017; Rotinov, 2019; Lee et al., 2019; Agarwal et al., 2022) is an approach inspired by the hippocampal reverse replay mechanism in human and animal neuron (Foster & Wilson, 2006; Ambrose et al., 2016; Igata et al., 2021). Theoretical analysis shows that RER improves the convergence rate towards optimal policies in comparison with ER-based algorithms. Unlike ER, which samples transitions uniformly (van Hasselt et al., 2016) (known as classic experience replay) or weightily  (Schaul et al., 2016) (known as prioritized experience replay) from the replay buffer, RER samples consecutive sequences of transitions from the buffer and reversely fed into the learning algorithm.

However, the most recent theoretical analysis on RER with QQ-learning only holds for a minimal learning rate and short consecutive steps (Agarwal et al., 2022), which converges slower than classic QQ-learning algorithm (together with ER) with a large learning rate. We attempt to bridge the gap between theory and practice for the newly proposed reverse experience replay algorithm.

In this paper, we provide a tighter analysis that relaxes the limitation on the learning rate and the length of the consecutive transitions. Our key idea is to transform the original problem involving a giant summation (shown in Equation 3.1) into a combinatorial counting problem (shown in Lemma 2), which greatly simplifies the whole problem. We hope the new idea of transforming the original problem into a combinatorial counting problem can enlighten other relevant domains. Furthermore, we show in Theorem 2 that RER converges faster with a larger learning rate η\eta and a longer consecutive sequence LL of state-action-reward tuples.

2 Preliminaries

Markov Decision Process

We consider a Markov decision process (MDP) with discounted rewards, noted as =(𝒮,𝒜,P,r,γ)\mathcal{M}=(\mathcal{S},\mathcal{A},{P},r,\gamma). Here 𝒮d\mathcal{S}\subset\mathbb{R}^{d} is the set of states, 𝒜\mathcal{A} is the set of actions, and γ(0,1)\gamma\in(0,1) indicates the discounting factor. We use P:𝒮×𝒜×𝒮[0,1]P:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\to[0,1] as the transition probability kernel of MDP. For each pair (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, P(s|s,a)P(s^{\prime}|s,a) is the probability of transiting to state ss^{\prime} from state ss when action aa is executed. The reward function is r:𝒮×𝒜[1,1]r:\mathcal{S}\times\mathcal{A}\to[-1,1], such that r(s,a)r(s,a) is the immediate reward from state ss when action aa is executed (Puterman, 1994). The policy π\pi is a mapping from states to a distribution over the set of actions: π(s):𝒜[0,1]\pi(s):\mathcal{A}\to[0,1], for s𝒮s\in\mathcal{S}. A trajectory is noted as {(st,at,rt)}t=0\{(s_{t},a_{t},r_{t})\}_{t=0}^{\infty}, where sts_{t} (respectively ata_{t}) is the state (respectively the action taken) at time tt, rt=r(st,at)r_{t}=r(s_{t},a_{t}) is the reward received at time tt, and (st,at,rt,st+1)(s_{t},a_{t},r_{t},s_{t+1}) is the tt-step transition.

Value Function and QQ-Function

The value function of a policy π\pi is noted as Vπ:𝒮V^{\pi}:\mathcal{S}\to\mathbb{R}. For s𝒮s\in\mathcal{S}, Vπ(s):=𝔼[t=0γtr(st,at|s0=s)]V^{\pi}(s):=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t}|s_{0}=s)\right], which is the expected discounted cumulative reward received when 1) the initial state is s0=ss_{0}=s, 2) the actions are taken based on the policy π\pi, i.e., atπ(st)a_{t}\sim\pi(s_{t}), for t0t\geq 0. 3) the trajectory is generated by the transition kernel, i.e., st+1P(|st,at)s_{t+1}\sim P(\cdot|s_{t},a_{t}), for all t0t\geq 0. Similarly, let Qπ:𝒮×𝒜Q^{\pi}:\mathcal{S}\times\mathcal{A}\to\mathbb{R} be the action-value function (also known as the QQ-function) of a policy π\pi. For (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, it is defined as Qπ(s,a):=𝔼[t=0γtr(st,at|s0=s,a0=a)].Q^{\pi}(s,a):=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t}|s_{0}=s,a_{0}=a)\right].

There exists an optimal policy, denoted as π\pi^{*} that maximizes Qπ(s,a)Q^{\pi}(s,a) uniformly over all state-action pairs (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A} (Watkins, 1989). We denote QQ^{*} as the QQ-function corresponding to π\pi^{*}, i.e., Q=QπQ^{*}=Q^{\pi^{*}}. The Bellman operator 𝒯\mathcal{T} on a QQ-function is defined as: for (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A},

𝒯(Q)(s,a):=r(s,a)+γ𝔼sP(|s,a)[maxa𝒜Q(s,a)].\displaystyle\mathcal{T}(Q)(s,a):=r(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot|s,a)}\left[\max_{a^{\prime}\in\mathcal{A}}Q(s^{\prime},a^{\prime})\right].

The optimal QQ-function QQ^{*} is the unique fixed point of the Bellman operator (Bertsekas & Yu, 2012).

QQ-learning

The QQ-learning algorithm is a model-free algorithm to learn QQ^{*} (Watkins & Dayan, 1992). The high-level idea is to find the fixed point of the Bellman operator. Given the trajectory {(st,at,rt)}t=0\{(s_{t},a_{t},r_{t})\}_{t=0}^{\infty} generated by some underlying behavior policy π\pi^{\prime}, the asynchronous QQ-learning algorithm estimates a new QQ-function Qt+1:𝒮×𝒜Q_{t+1}:\mathcal{S}\times\mathcal{A}\to\mathbb{R} at each time. At time t0t\geq 0, given a transition (st,at,rt,st+1)(s_{t},a_{t},r_{t},s_{t+1}), the algorithm update as follow:

Qt+1(st,at)\displaystyle Q_{t+1}(s_{t},a_{t}) =(1η)Qt(st,at)+η𝒯t+1(Qt)(st+1,at),\displaystyle=(1-\eta)Q_{t}(s_{t},a_{t})+\eta\mathcal{T}_{t+1}(Q_{t})(s_{t+1},a_{t}), (1)
Qt+1(s,a)\displaystyle Q_{t+1}(s,a) =Qt(s,a),\displaystyle=Q_{t}(s,a), for all (s,a)(st,at).\displaystyle\text{ for all }(s,a)\neq(s_{t},a_{t}).

Here η(0,1)\eta\in(0,1) is the learning rate and 𝒯t+1\mathcal{T}_{t+1} is the empirical Bellman operator: 𝒯t+1(Qt)(st,at):=r(st,at)+γmaxa𝒜Qt(st+1,a)\mathcal{T}_{t+1}(Q_{t})(s_{t},a_{t}):=r(s_{t},a_{t})+\gamma\max_{a^{\prime}\in\mathcal{A}}Q_{t}(s_{t+1},a^{\prime}). Under mild conditions, QtQ_{t} will converge to the fixed point of the Bellman operator and hence to QQ^{*}. When the state space 𝒮\mathcal{S} is small, a tabular structure cab be used to store the values of Qt(s,a)Q_{t}(s,a) for (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}.

QQ-learning with Function Approximation

When the state space 𝒮\mathcal{S} is large, the asynchronous QQ-learning in Equation (1) cannot be applied since it needs to loop over a table of all states and actions. In this case, function approximation is brought into QQ-learning. Let Qw:𝒮×𝒜Q^{w}:\mathcal{S}\times\mathcal{A}\to\mathbb{R} be an approximated QQ-function, which is typically represented with a deep neural network (Mnih et al., 2015) and ww denotes the parameters of the neural network. QwQ^{w} is often called the QQ-network. Given a batch of transitions {(sti,ati,rti,sti+1)}i=1m\{(s_{t_{i}},a_{t_{i}},r_{t_{i}},s_{t_{i}+1})\}_{i=1}^{m}, we define ytiy_{t_{i}} as the image of Qw(sti,ati)Q^{w^{\prime}}(s_{t_{i}},a_{t_{i}}) under the empirical Bellman operator, that is:

yti:=rti+γmaxa𝒜Qw(sti+1,a), for 1imy_{t_{i}}:=r_{t_{i}}+\gamma\max_{a^{\prime}\in\mathcal{A}}Q^{w^{\prime}}(s_{{t_{i}}+1},a^{\prime}),\quad\text{ for }1\leq i\leq m

where ww^{\prime} represents the parameters in target neural network. Parameters ww^{\prime} are synchronized to ww every TtargetT_{target} steps of Stochastic Gradient Descent (SGD). Since QQ^{*} is the fixed point of the Bellman operator, ytiy_{t_{i}} should match Qw(sti,ati)Q^{w}(s_{t_{i}},a_{t_{i}}) when QwQ^{w} converges to QQ^{*}. Hence, learning is done via minimizing the following objective using SGD: (w)=1mi=1mytiQw(sti,ati)22\ell(w)=\frac{1}{m}\sum_{i=1}^{m}\|y_{t_{i}}-Q^{w}(s_{t_{i}},a_{t_{i}})\|_{2}^{2}.

Experience Replay

For the QQ-learning with function approximation, the new trajectories are generated by executing a behavioral policy, which are then saved into the replay buffer, noted as \mathcal{B}. When learning to minimize (w)\ell(w), SGD is performed on batches of randomly sampled transitions from the replay buffer. This process is often called Experience Replay (ER) (Lin, 1992; Li et al., 2022). To improve the stability and convergence rate of QQ-learning, follow-up works sample transitions from the replay buffer with non-uniform probability distributions. Prioritized experience replay favors those transitions with a large temporal difference errors (Schaul et al., 2016; Saglam et al., 2023). Discor (Kumar et al., 2020) favors those transitions with small Bellman errors. LaBER proposes a generalized TD error to reduce the variance of gradient and improve learning stability (Lahire et al., 2022). Hindsight experience replay uses imagined outcomes by relabeling goals in each episode, allowing the agent to learn from unsuccessful attempts as if they were successful (Andrychowicz et al., 2017).

Reverse Experience Replay

is a recently proposed variant of experience replay (Goyal et al., 2019; Bai et al., 2021; Agarwal et al., 2022). RER samples consecutive sequences of transitions from the replay buffer. The QQ-learning algorithm updates its parameters by performing in the reverse order of the sampled sequences. Compared with ER, RER converges faster towards the optimal policy empirically (Lee et al., 2019) and theoretically (Agarwal et al., 2022), under tabular and linear MDP settings. One intuitive explanation of why RER works is to consider a sequence of consecutive transitions s1a1,r1s2a2,r2s3s_{1}\xrightarrow{a_{1},r_{1}}s_{2}\xrightarrow{a_{2},r_{2}}s_{3}. Incorrect QQ-function estimation of Q(s2,a2)Q(s_{2},a_{2}) will affect the estimation of Q(s1,a1)Q(s_{1},a_{1}). Hence, reverse order updates allow the QQ-value updates of Q(s1,a1)Q(s_{1},a_{1}) to use the most up-to-date value of Q(s2,a2)Q(s_{2},a_{2}), hence accelerating the convergence.

2.1 Problem Setups for Reverse Experience Replay

Linear MDP Assumption

In this paper, we follow the definition of linear MDP from Zanette et al. (2020), which states that the reward function can be written as the inner product of the parameter ww and the feature function ϕ\phi. Therefore, the QQ function depends only on ww and the feature vector ϕ(s,a)d\phi(s,a)\in\mathbb{R}^{d} for state s𝒮s\in\mathcal{S} and action a𝒜a\in\mathcal{A}.

Assumption 1 (Linear MDP setting from Zanette et al., 2020).

There exists a vector wdw\in\mathbb{R}^{d} such that R(s,a;w)=w,ϕ(s,a)R(s,a;w)=\langle w,\phi(s,a)\rangle, and the transition probability is proportional to its corresponding feature 𝒫(|s,a)ϕ(s,a)\mathcal{P}(\cdot|s,a)\propto\phi(s,a). Therefore, the optimal Q-function is Q(s,a;w)=w,ϕ(s,a)Q^{*}(s,a;w^{*})=\langle w^{*},\phi(s,a)\rangle for every s𝒮,a𝒜s\in\mathcal{S},a\in\mathcal{A}.

The assumption 1 is the current popular Linear MDP assumption that allows us to quantify the convergence rate (or sample complexity) for the QQ-learning algorithm (Zanette et al., 2020; Agarwal et al., 2022). We need the following additional assumptions to get the final convergence rate result. Assume the sequence of consecutive transitions is of length LL and the constant learning rate in the gradient descent algorithm is η\eta.

Assumption 2 (from Zanette et al. (2020)).

The MDP has zero inherent Bellman error and ϕ(s,a)ϕ(s,a)1\phi(s,a)^{\top}\phi(s,a)\leq 1 for all (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}. There exists constant κ>0\kappa>0, such that 𝔼(s,a)μϕ(s,a)ϕ(s,a)I/κ\mathbb{E}_{(s,a)\sim\mu}\phi(s,a)\phi(s,a)^{\top}\succeq{\mathrm{I}}/{\kappa}. Here μ\mu is the stationary distribution over all the state-action pairs of the Markov chain determined by the transition kernel and the policy.

Remark 1.

Suppose we pick a set of state-action tuples ={(s,a)|(s,a)𝒮×𝒜}\mathcal{L}=\{(s,a)|(s,a)\in\mathcal{S}\times\mathcal{A}\}, which may contains duplicated tuples. By linearity of expectation, we have: 𝔼μ((s,a)ϕ(s,a)ϕ(s,a))=𝔼(s,a)μ(ϕ(s,a)ϕ(s,a))||κI\mathbb{E}_{\mu}\left(\sum_{(s,a)\in\mathcal{L}}\phi(s,a)\phi(s,a)^{\top}\right)=\sum_{\mathcal{L}}\mathbb{E}_{(s,a)\sim\mu}\left(\phi(s,a)\phi(s,a)^{\top}\right)\succeq\frac{|\mathcal{L}|}{\kappa}\mathrm{I}. Here |||\mathcal{L}| indicates the number of state-action tuples in this set.

Definition 1.

Given the feature function ϕ:𝒮×𝒜d\phi:\mathcal{S}\times\mathcal{A}\to\mathbb{R}^{d}. Denote the largest inner product between parameter ww and the feature function ϕ\phi as wϕ=sup(s,a)|ϕ(s,a),w|\|w\|_{\phi}=\sup_{(s,a)}|\langle\phi(s,a),w\rangle|.

Definition 2.

Let I\mathrm{I} be an identity matrix of dimension d×dd\times d and η\eta\in\mathbb{R} as the learning rate. Define matrix Γl\Gamma_{l} recursively as follow:

Γl{I for l=0,(IηϕL+1lϕL+1l)Γl1 for 1lL,\displaystyle\Gamma_{l}\coloneqq\begin{cases}\mathrm{I}&\text{ for }l=0,\\ \left(\mathrm{I}-\eta\phi_{L+1-l}\phi^{\top}_{L+1-l}\right)\Gamma_{l-1}&\text{ for }1\leq l\leq L,\end{cases} (2)

where we use the simplified notation ϕL+1l\phi_{L+1-l} to denote ϕ(sL+1l,aL+1l)\phi(s_{L+1-l},a_{L+1-l}). The explicit form for ΓL\Gamma_{L} is:

ΓL=(Iηϕ1ϕ1)(Iηϕ2ϕ2)(IηϕLϕL)=l=1L(Iηϕlϕl)\Gamma_{L}=\left(\mathrm{I}-\eta\phi_{1}\phi^{\top}_{1}\right)\left(\mathrm{I}-\eta\phi_{2}\phi^{\top}_{2}\right)\ldots\left(\mathrm{I}-\eta\phi_{L}\phi^{\top}_{L}\right)=\prod_{l=1}^{L}\left(\mathrm{I}-\eta\phi_{l}\phi^{\top}_{l}\right) (3)

The semantic interpretation of ΓL\Gamma_{L} in Definition 2 is that it represents the coefficient of the bias term in the error analysis of the learning algorithm’s parameter (as outlined in Lemma 4). This joint product arises because the RER algorithm updates the parameter over a subsequence of consecutive transitions of length LL. The norm of ΓL\Gamma_{L} is influenced by both the sequence length LL and the learning rate η\eta. When the norm of ΓL\Gamma_{L} is small, the parameters of the learning model converge more rapidly to their optimal values.

3 Methodology

3.1 Motivation

Let μ\mu denote the stationary distribution of the state-action pairs in the MDP, η\eta be the learning rate of the gradient descent algorithm, and LL the length of consecutive transitions processed by the RER algorithm. Previous work (Agarwal et al., 2022, Lemma 8 and Lemma 14) established that when ηL13\eta L\leq\frac{1}{3}, the following inequality holds:

𝔼(s,a)μ[ΓLΓL]Iηl=1L𝔼(s,a)μ[ϕlϕl](1ηLκ)I,\mathbb{E}_{(s,a)\sim\mu}\left[\Gamma_{L}^{\top}\Gamma_{L}\right]\preceq\mathrm{I}-\eta\sum_{l=1}^{L}\mathbb{E}_{(s,a)\sim\mu}\left[\phi_{l}\phi^{\top}_{l}\right]\preceq\left(1-\frac{\eta L}{\kappa}\right)\mathrm{I}, (4)

where the matrix ΓLd×d\Gamma_{L}\in\mathbb{R}^{d\times d} is defined in Definition 2 and serves as a “coefficient” in the convergence analysis, as outlined in Lemma 4. The positive semi-definite relation \preceq between two matrices is defined in Definition 4. Here, I\mathrm{I} represents an identity matrix of dimension d×dd\times d, and the coefficient κ>0\kappa>0 is introduced in Assumption 2. The matrix ΓL\Gamma_{L} was mentioned in (Agarwal et al., 2022, Appendix E, Equation 5), but we provide a formal definition here and streamline the original expression by removing unnecessary variables.

The condition in Equation (4) was further incorporated into the convergence requirement in (Agarwal et al., 2022, Theorem 1). It suggests that the RER algorithm cannot handle sequences of consecutive transitions that are too long (corresponding to a large LL) or use a learning rate that is too large (i.e., η\eta). This presents a major limitation between the theoretical justification and real-world application of the RER algorithm. In this work, we address this gap by providing a tighter theoretical analysis that relaxes the constraint ηL1/3\eta L\leq 1/3.

We begin by explaining the main difficulty in upper-bounding the term 𝔼(s,a)μ[ΓLΓL]\mathbb{E}_{(s,a)\sim\mu}\left[\Gamma_{L}^{\top}\Gamma_{L}\right]. According to Definition 2, we can expand ΓL\Gamma_{L}^{\top} as ΓL=(IηϕLϕL)(Iηϕ1ϕ1)\Gamma_{L}^{\top}=\left(\mathrm{I}-\eta\phi_{L}\phi^{\top}_{L}\right)\cdots\left(\mathrm{I}-\eta\phi_{1}\phi^{\top}_{1}\right). Using the linearity of expectation, we expand the entire joint product ΓLΓL\Gamma_{L}^{\top}\Gamma_{L} under the expectation as follows:

𝔼(s,a)μ[ΓLΓL]\displaystyle\mathbb{E}_{(s,a)\sim\mu}\left[\Gamma_{L}^{\top}\Gamma_{L}\right] =𝔼(s,a)μ[(IηϕLϕL)(Iηϕ1ϕ1)(Iηϕ1ϕ1)(IηϕLϕL)]\displaystyle=\mathbb{E}_{(s,a)\sim\mu}\left[\left(\mathrm{I}-\eta\phi_{L}\phi^{\top}_{L}\right)\cdots\left(\mathrm{I}-\eta\phi_{1}\phi^{\top}_{1}\right)\left(\mathrm{I}-\eta\phi_{1}\phi^{\top}_{1}\right)\cdots\left(\mathrm{I}-\eta\phi_{L}\phi^{\top}_{L}\right)\right]
=I2η𝔼(s,a)μ[l=1Lϕlϕl]+𝔼(s,a)μ[k=22L(η)kl1,,lkϕl1ϕl1ϕlkϕlk].\displaystyle=\mathrm{I}-2\eta\mathbb{E}_{(s,a)\sim\mu}\left[\sum_{l=1}^{L}\phi_{l}\phi^{\top}_{l}\right]+\mathbb{E}_{(s,a)\sim\mu}\left[\sum_{k=2}^{2L}(-\eta)^{k}\sum_{l_{1},\ldots,l_{k}}\phi_{l_{1}}\phi^{\top}_{l_{1}}\ldots\phi_{l_{k}}\phi^{\top}_{l_{k}}\right]. (5)

In the third term on the right-hand side (RHS) of the second line, the summation is over all valid combinations of the indices (l1,l2,,lk)(l_{1},l_{2},\ldots,l_{k}), where l1,l2,,lk{1,2,,L}l_{1},l_{2},\ldots,l_{k}\in\{1,2,\ldots,L\}. This is determined by first selecting the index l1l_{1} from the index sequence [L,L1,,2,1,1,2,,L1,L][L,L-1,\ldots,2,1,1,2,\ldots,L-1,L], as seen in the first row of the equation above. The second index l2l_{2} is then chosen, ensuring that l2l_{2} lies to the right of l1l_{1}. The valid combination constraint requires the entire sequence l1,,lkl_{1},\ldots,l_{k} to satisfy the condition that li1l_{i-1} must appear to the left of lil_{i}.

The main challenge to upper-bound the entire product ΓLΓL\Gamma_{L}^{\top}\Gamma_{L} under expectation lies in upper-bound the combinatorially many high-order terms. Our approach leverages the high-level idea that the RHS of Equation (3.1) can be upper-bounded by a form of 𝔼(s,a)μ[l=1Lϕlϕl]\mathbb{E}_{(s,a)\sim\mu}\left[\sum_{l=1}^{L}\phi_{l}\phi^{\top}_{l}\right] with an appropriate coefficient. Specifically, we demonstrate that the third term on the RHS, which contains a large number of combinatorial terms of the form ϕl1ϕl1ϕlkϕlk\phi_{l_{1}}\phi^{\top}_{l_{1}}\cdots\phi_{l_{k}}\phi^{\top}_{l_{k}}, can be bounded by terms involving only ϕlϕl\phi_{l}\phi^{\top}_{l} (with 1lL1\leq l\leq L) through the use of a proposed combinatorial counting method.

Theorem 1.

Let μ\mu be the stationary distribution of the state-action pair in the MDP. The following matrix inequalities, which are positive semi-definite, hold for η(0,1)\eta\in(0,1):

𝔼(s,a)μ[ΓLΓL](1(η(42L)(1η)L1η2+1)Lκ)I,\displaystyle\mathbb{E}_{(s,a)\sim\mu}\left[\Gamma_{L}^{\top}\Gamma_{L}\right]\preceq\left(1-\frac{(\eta(4-2L)-(1-\eta)^{L-1}-\eta^{2}+1)L}{\kappa}\right)\mathrm{I}, (6)

where the matrix ΓL\Gamma_{L} is defined in Definition 2. The relation \preceq between the matrices on both sides is defined in Definition 4, referring to the positive semi-definite property.

Proof Sketch.

By the linearity of expectation, we can upper-bound the second part of Equation (3.1) as follows:

2η𝔼(s,a)μ[l=1Lϕlϕl]=2ηl=1L𝔼(s,a)μ[ϕlϕl]=2ηL𝔼(s,a)μ[ϕϕ]2ηLκI.\displaystyle-2\eta\mathbb{E}_{(s,a)\sim\mu}\left[\sum_{l=1}^{L}\phi_{l}\phi^{\top}_{l}\right]=-2\eta\sum_{l=1}^{L}\mathbb{E}_{(s,a)\sim\mu}\left[\phi_{l}\phi^{\top}_{l}\right]=-2\eta L\mathbb{E}_{(s,a)\sim\mu}\left[\phi\phi^{\top}\right]\preceq-\frac{2\eta L}{\kappa}\mathrm{I}. (7)

Based on the new analysis from Lemma (2), the third part in Equation (3.1) is upper-bounded as:

𝔼(s,a)μ[k=22L(η)kl1,,lkϕl1ϕl1ϕlkϕlk]\displaystyle\mathbb{E}_{(s,a)\sim\mu}\left[\sum_{k=2}^{2L}(-\eta)^{k}\sum_{l_{1},\ldots,l_{k}}\phi_{l_{1}}\phi^{\top}_{l_{1}}\ldots\phi_{l_{k}}\phi^{\top}_{l_{k}}\right] 𝔼(s,a)μ[k=22L(η)kl1,,lk12(ϕl1ϕl1+ϕlkϕlk)]\displaystyle\preceq\mathbb{E}_{(s,a)\sim\mu}\left[\sum_{k=2}^{2L}(-\eta)^{k}\sum_{l_{1},\ldots,l_{k}}\frac{1}{2}(\phi_{l_{1}}\phi^{\top}_{l_{1}}+\phi_{l_{k}}\phi^{\top}_{l_{k}})\right]
((1η)L1+η2+η(2L2)1)𝔼(s,a)μ[l=1Lϕlϕl]\displaystyle\preceq\left((1-\eta)^{L-1}+\eta^{2}+\eta(2L-2)-1\right)\mathbb{E}_{(s,a)\sim\mu}\left[\sum_{l=1}^{L}\phi_{l}\phi^{\top}_{l}\right]
((1η)L1+η2+η(2L2)1)LκI.\displaystyle\preceq\frac{((1-\eta)^{L-1}+\eta^{2}+\eta(2L-2)-1)L}{\kappa}\mathrm{I}.

Combining these two inequalities, we arrive at the upper bound stated in the theorem. A detailed proof can be found in Appendix B. ∎

Theorem 1 is established based on the new analysis in Lemma (2), which is introduced in Section 3.2. It serves as a key component in the final convergence proof of the RER algorithm, which will be presented in Section 4.

Numerical Justification of the Tighter Bound

We provide a numerical evaluation of the derived bound and the original bound in Agarwal et al. (2022, Lemma 8) in Figure 1111The code implementation for the numerical evaluation of the equalities and inequalities in this paper is available at https://github.com/jiangnanhugo/RER-proof.. For a fixed value of sequence length LL, we compare the value (η(42L)(1η)L1η2+1)L(\eta(4-2L)-(1-\eta)^{L-1}-\eta^{2}+1)L in our derived upper bound and the original value ηL\eta L. For all the different sequence lengths, our derived expression value is numerically higher than the original expression, which implies our bound (in Lemma 3) is tighter than the original one in Agarwal et al. (2022, Lemma 8).

Refer to caption
Figure 1: For all the different sequence lengths, our derived expression value is numerically higher than the original expression, which implies our bound (in Lemma 3) is tighter than the original one in Agarwal et al. (2022, Lemma 8).

3.2 Relaxing the Requirement ηL1/3\eta L\leq 1/3 through Combinatorial Counting

Lemma 1.

Let 𝐱d\mathbf{x}\in\mathbb{R}^{d} be any non-zero dd-dimensional vector. For l1,,lk{1,2,,L}l_{1},\ldots,l_{k}\in\{1,2,\ldots,L\} and 2k2L2\leq k\leq 2L, consider a high-order term ϕl1ϕl1ϕlkϕlk\phi_{l_{1}}\phi^{\top}_{l_{1}}\ldots\phi_{l_{k}}\phi^{\top}_{l_{k}} in Equation (3.1). By Assumption 1, we can relax this high-order term as follows:

|𝐱ϕl1ϕl1ϕlkϕlk𝐱|\displaystyle|\mathbf{x}^{\top}\phi_{l_{1}}\phi^{\top}_{l_{1}}\ldots\phi_{l_{k}}\phi^{\top}_{l_{k}}\mathbf{x}| 12𝐱(ϕl1ϕl1+ϕlkϕlk)𝐱.\displaystyle\leq\frac{1}{2}\mathbf{x}^{\top}\left(\phi_{l_{1}}\phi^{\top}_{l_{1}}+\phi_{l_{k}}\phi^{\top}_{l_{k}}\right)\mathbf{x}. (8)

The proof of this inequality can be found in Appendix A.1.

This result implies that, after relaxation, only the first term ϕl1ϕl1\phi_{l_{1}}\phi^{\top}_{l_{1}} (indexed by l1l_{1}) and the last term ϕlkϕlk\phi_{l_{k}}\phi^{\top}_{l_{k}} (indexed by lkl_{k}) determine the upper bound of the high-order term ϕl1ϕl1ϕlkϕlk\phi_{l_{1}}\phi^{\top}_{l_{1}}\ldots\phi_{l_{k}}\phi^{\top}_{l_{k}}. This relaxation simplifies the original complex summation problem 1l1,,lkL\sum_{1\leq l_{1},\ldots,l_{k}\leq L} to count how many valid l1l_{1} and lkl_{k} can be selected at each possible position in the sequence of transitions.

Lemma 2.

Based on the relaxation provided in Lemma 1, the third part in Equation (3.1) can be expanded combinatorially as follows:

k=22L(η)kl1,,lk12(ϕl1ϕl1+ϕlkϕlk)=k=22L(η)kl=1L((L+l2k1)+(Llk1)+(2l2k2))sum over combinatorially many termsϕlϕl\sum_{k=2}^{2L}(-\eta)^{k}\sum_{l_{1},\ldots,l_{k}}\frac{1}{2}(\phi_{l_{1}}\phi^{\top}_{l_{1}}+\phi_{l_{k}}\phi^{\top}_{l_{k}})=\underbrace{\sum_{k=2}^{2L}(-\eta)^{k}\sum_{l=1}^{L}\left(\binom{L+l-2}{k-1}+\binom{L-l}{k-1}+\binom{2l-2}{k-2}\right)}_{\text{sum over combinatorially many terms}}\phi_{l}\phi^{\top}_{l} (9)
Sketch of Proof.

As depicted in Figure 2, we consider two arrays of length LL. The indices in these arrays are symmetrical: the left array decreases from LL to 11, while the right array increases from 11 to LL. These arrays represent the indices of the matrix products in the first line of Equation (3.1). The left array simulates ΓL\Gamma_{L}, and the right array simulates ΓL\Gamma_{L}^{\top}. The key idea is to count the number of combinations of l1l_{1} and lkl_{k} that can produce ϕlϕl\phi_{l}\phi_{l}^{\top} for a fixed ll (where 1lL1\leq l\leq L).

Refer to caption
Figure 2: Case 1 in the proposed combinatorial counting procedure. This case illustrates how many terms of the form ϕl1ϕl1ϕlkϕlk\phi_{l_{1}}\phi^{\top}_{l_{1}}\ldots\phi_{l_{k}}\phi^{\top}_{l_{k}} can be reduced to ϕlϕl\phi_{l}\phi_{l}^{\top} for a fixed ll using Lemma 1, where 1lL1\leq l\leq L. If l1l_{1} is assigned to the left ll-th slot, then lkl_{k} cannot choose any of the left terms with indices L,,l+1L,\ldots,l+1 due to the sequential ordering constraint lil_{i} must be to the right of li1l_{i-1}. To avoid double counting, lkl_{k} is also disallowed from occupying the right ll-th slot. Consequently, there are L+l2L+l-2 available slots for assigning the remaining sequence l2,,lkl_{2},\ldots,l_{k} of length k1k-1. Therefore, there are (L+l2k1)\binom{L+l-2}{k-1} such terms for this case. Further cases are illustrated in Figure 3 in the appendix.

In the first case, illustrated in Figure 2, we fix l1l_{1} in the left ll-th slot. For lkl_{k}, it cannot choose any of the slots in the left array with indices L,,l+1L,\ldots,l+1 due to the sequential ordering constraint, which requires that li1l_{i-1} must be to the left of lil_{i}. Additionally, to avoid double counting, we also exclude the right ll-th slot for lkl_{k}. Consequently, there are L+l2L+l-2 available slots for assigning the remaining sequence l2,,lkl_{2},\ldots,l_{k}. This results in (L+l2k1)\binom{L+l-2}{k-1} contributions for this case, as shown on the right-hand side.

For the remaining cases, detailed in Figure 3 and analyzed in Appendix A.2, they contribute to the second and last terms in Equation (9). ∎

Lemma 2 demonstrates the process of simplifying the complex summation l1,,lk\sum_{l_{1},\ldots,l_{k}} into a more manageable form l=1L\sum_{l=1}^{L}. This transformation significantly simplifies the task of obtaining a tighter upper bound.

Lemma 3.

For η(0,1)\eta\in(0,1) and L>1L>1, the following holds:
k=22L(η)k((L+l2k1)+(Llk1)+(2l2k2))=(1η)L+l2+(1η)Ll+η2(1η)2l2+η(2L2)2\sum_{k=2}^{2L}(-\eta)^{k}\left(\binom{L+l-2}{k-1}+\binom{L-l}{k-1}+\binom{2l-2}{k-2}\right)=(1-\eta)^{L+l-2}+(1-\eta)^{L-l}+\eta^{2}(1-\eta)^{2l-2}+\eta(2L-2)-2.

The proof of Lemma 3 is presented in detail in Appendix A.3, where we utilize the Binomial theorem. To ensure that the oscillatory term (η)k(-\eta)^{k} does not cause divergence, we require the learning rate η\eta to lie within the interval η(0,1)\eta\in(0,1).

4 Sample Complexity of Reverse Experience Replay-Based QQ-Learning on Linear MDPs

The convergence analysis assumes that every sub-trajectory of length LL is almost (or asymptotically) independent of each other with high probability. This condition, known as the mixing requirement for Markovian data, implies that the statistical dependence between two sub-trajectories τL\tau_{L} and τL\tau^{\prime}_{L} diminishes as they become further apart along the trajectory (Tagorti & Scherrer, 2015; Nagaraj et al., 2020).

Prior work (Lee et al., 2019) provided a convergence proof for the Reverse Experience Replay (RER) approach but did not address the rate of convergence, primarily due to the challenges associated with quantifying deep neural networks. By contrast, Linear MDPs (defined in Definition 1), which approximate the reward function and transition kernel linearly via features, allow for an asymptotic performance analysis of RER. Recently, Agarwal et al. (2022) presented the first theoretical proof for RER. However, their analysis is limited by stringent conditions, notably requiring a minimal learning rate ηL13\eta L\leq\frac{1}{3}. This constraint suggests that RER may struggle to compete with plain Experience Replay (ER) when using larger learning rates.

To address this challenge, we provide a tighter theoretical analysis of the RER method in Theorem 1. Our analysis mitigate the constraints on the learning rate for convergence. We demonstrate that the convergence rate can be improved with a larger learning rate and a longer sequence of state-action-reward tuples, thus bridging the gap between theoretical convergence analysis and empirical learning results.

Algorithm 1 Episodic Q-learning with Reverse Experience Replay
1:Sequence length LL of consecutive state-action tuples; Replay buffer \mathcal{B}; Total learning episodes TT; Target network update frequency NN.
2:The best-learned policy.
3:for t=1t=1 to TT do
4:     Act by ϵ\epsilon-greedy strategy w.r.t. policy π\pi.
5:     Save the new trajectory into the replay buffer \mathcal{B}.
6:     Retrieve a sub-trajectory τL\tau_{L} from buffer \mathcal{B}, where τl:=(sl,al,rl)\tau_{l}:=(s_{l},a_{l},r_{l}), for all 1lL1\leq l\leq L.
7:     for l=1l=1 to LL do \triangleright reverse experience replay
8:         εrLl+γmaxa𝒜Q(sL+1l,a;θk)QL+1l{\varepsilon\leftarrow r_{L-l}+\gamma\max_{a^{\prime}\in\mathcal{A}}Q(s_{L+1-l},a^{\prime};\theta_{k})-Q_{L+1-l}}
9:         wt,l+1wt,l+ηεQt,L+1l{w_{t,l+1}\leftarrow w_{t,l}+\eta\varepsilon\nabla Q_{t,L+1-l}}      
10:     if  tmodN=0t\mod N=0 then \triangleright online target update
11:         θkwt,L+1\theta_{k}\leftarrow w_{t,L+1}
12:         kk+1k\leftarrow k+1      
13:     π(s)argmaxa𝒜,Q(s,a;wt,L+1)\pi(s)\leftarrow\arg\max_{a\in\mathcal{A}},Q(s,a;w_{t,L+1}), for all s𝒮s\in\mathcal{S}. \triangleright policy extraction
14:Return The converged policy π\pi.
Lemma 4 (Bias and variance decomposition).

Let the error terms for every parameter ww as the difference between empirical estimation and true MDP: εi(w)Q(si,ai)Q(si,ai)\varepsilon_{i}(w)\coloneqq{Q}(s_{i},a_{i})-Q^{*}(s_{i},a_{i}). For the current iteration tt, the difference between current estimated parameter ww and the optimal parameter ww^{*} accumulated along the LL length transitions with reverse update is:

wLw=ΓL(w1w)Bias term+ηl=1LεlΓl1ϕlvariance term.\displaystyle w_{L}-w^{*}=\underbrace{\Gamma_{L}\left(w_{1}-w^{*}\right)}_{\text{Bias term}}+\underbrace{\eta\sum_{l=1}^{L}\varepsilon_{l}\Gamma_{l-1}\phi_{l}}_{\text{variance term}}. (10)

For clarity, ΓL\Gamma_{L} in Definition 2 is a joint product of LL terms involving the feature vector of the consecutive state-action tuples. When the norm of ΓL\Gamma_{L} is small, the parameter will quickly converge to its optimal.

The first part on RHS is noted as the bias and the second part on RHS is variance along the sub-trajectory, which we will later show with zero mean.

The proof is presented in Appendix C.1. The result is obtained by unrolling the terms for consecutive LL steps in reverse update order according to Lines 5-7 in Algorithm 1. This allows us to separately quantify the upper bound the bias term and the variance terms.

Lemma 5 (Bound on the bias term).

Let 𝐱d\mathbf{x}\in\mathbb{R}^{d} be a non-zero vector and NN is the frequency for the target network to be updated. For η(0,1)\eta\in(0,1) and L>1L>1, the following matrix’s positive semi-definite inequality holds with probability at least 1δ1-\delta:

𝔼j=N1ΓL𝐱ϕ2\displaystyle\mathbb{E}\left\lVert\prod^{1}_{j=N}\Gamma_{L}\mathbf{x}\right\rVert_{\phi}^{2} exp((η(42L)η2+1)NLκ)κδ𝐱ϕ.\displaystyle\leq\exp\left(-\frac{(\eta(4-2L)-\eta^{2}+1)NL}{\kappa}\right)\sqrt{\tfrac{\kappa}{\delta}}\left\lVert\mathbf{x}\right\rVert_{\phi}. (11)

The ϕ\phi-based norm is defined in Definition 1.

Sketch of proof.

The result is obtained first expand the joint product over j=Ni\prod_{j=N}^{i} over ΓL\Gamma_{L} and integrate the result in Theorem 1. The detailed proof is presented in Appendix C.2. ∎

In terms of the bound for the variance term in Lemma 4, even though the term Γl\Gamma_{l} is involved in the expression, it turns out we do not need to modify the original proof and thus we follow the result in the original work. The exact statement is presented in the Appendix C.3.

Theorem 2.

For Linear MDP, assume the reward function, as well as the feature, is bounded R(s,a)[0,1]R(s,a)\in[0,1], ϕ(s,a)21\|\phi(s,a)\|_{2}\leq 1, for all (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}. Let TT be the maximum learning episodes, NN be the frequency of the target network update, η\eta be the learning rate and LL be the length of sequence for RER described in Algorithm 1. When η(0,1),L1\eta\in(0,1),L\geq 1, with sample complexity

𝒪(γT/N1γ+TκNδ(1γ)4exp((η(42L)η2+1)NLκ)+ηlog(TNδ)(1γ)4),\displaystyle\mathcal{O}\left(\frac{\gamma^{T/N}}{1-\gamma}+\sqrt{\frac{T\kappa}{N\delta(1-\gamma)^{4}}}{\color[rgb]{0,0,1}\exp\left(-\frac{(\eta(4-2L)-\eta^{2}+1)NL}{\kappa}\right)}+\sqrt{\frac{\eta\log(\frac{T}{N\delta})}{(1-\gamma)^{4}}}\right), (12)

QT(s,a)Q(s,a)ε\|Q_{T}(s,a)-Q^{*}(s,a)\|_{\infty}\leq\varepsilon holds with probability at least 1δ1-\delta.

Sketch of Proof.

We first establish the independence of sub-trajectories of length LL. We then decompose the error term of the QQ-value using bias-variance decomposition (as shown in Lemma 4), where the RER method and target network help control the variance term using martingale sequences. The upper bound for the bias term is given in Lemma 5 and the upper bound for the variance term is presented in Lemma Theorem. Finally, we summarize the results and provide the complete proof in Lemma 6, leading to the probabilistic bound in this theorem. ∎

Compared to the original theorem in Agarwal et al. (2022, Theorem 1), our work provides a tighter upper bound and relaxes the assumptions needed for the result to hold. This advancement bridges the gap between theoretical justification and empirical MDP evaluation. Furthermore, we hope that the new approach of transforming the original problem into a combinatorial counting problem will inspire further research in related domains.

We acknowledge that the main structure of the convergence proof (i.e., Theorem 2) follows the original work. Our contribution lies in presenting a cleaner proof pipeline and incorporating our tighter bound as detailed in Theorem 1.

5 Conclusion

In this work, we gave a tighter finite-sample analysis for heuristics which are heavily used in practical QQ-learning and showed that seemingly simple modifications can have far-reaching consequences in linear MDP settings. We provide a rigorous analysis that relaxes the limitation on the learning rate and the length of the consecutive tuples. Our key idea is to transform the original problem involving a giant summation into a combinatorial counting problem, which greatly simplifies the whole problem. Finally, we show theoretically that RER converges faster with a larger learning rate η\eta and a longer consecutive sequence LL of state-action-reward tuples.

Acknowledgments

We thank all the reviewers for their constructive comments. We also thank Yi Gu for his feedback on the theoretical justification part of this paper. This research was supported by NSF grant CCF-1918327 and DOE – Fusion Energy Science grant: DE-SC0024583.

References

  • Agarwal et al. (2022) Naman Agarwal, Syomantak Chaudhuri, Prateek Jain, Dheeraj Nagaraj, and Praneeth Netrapalli. Online target q-learning with reverse experience replay: Efficiently finding the optimal policy for linear mdps. In ICLR. OpenReview.net, 2022.
  • Ambrose et al. (2016) R. Ellen Ambrose, Brad E. Pfeiffer, and David J. Foster. Reverse replay of hippocampal place cells is uniquely modulated by changing reward. Neuron, 91(5):1124–1136, 2016. ISSN 0896-6273.
  • Andrychowicz et al. (2017) Marcin Andrychowicz, Dwight Crow, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In NIPS, pp.  5048–5058, 2017.
  • Bai et al. (2021) Chenjia Bai, Lingxiao Wang, Lei Han, Jianye Hao, Animesh Garg, Peng Liu, and Zhaoran Wang. Principled exploration via optimistic bootstrapping and backward induction. In ICML, volume 139 of Proceedings of Machine Learning Research, pp.  577–587. PMLR, 2021.
  • Bertsekas & Yu (2012) Dimitri P. Bertsekas and Huizhen Yu. Q-learning and enhanced policy iteration in discounted dynamic programming. Math. Oper. Res., 37(1):66–94, 2012.
  • Florensa et al. (2017) Carlos Florensa, David Held, Markus Wulfmeier, Michael Zhang, and Pieter Abbeel. Reverse curriculum generation for reinforcement learning. In CoRL, volume 78 of Proceedings of Machine Learning Research, pp.  482–495. PMLR, 2017.
  • Foster & Wilson (2006) David J Foster and Matthew A Wilson. Reverse replay of behavioural sequences in hippocampal place cells during the awake state. Nature, 440(7084):680–683, 2006.
  • Goyal et al. (2019) Anirudh Goyal, Philemon Brakel, William Fedus, Soumye Singhal, Timothy P. Lillicrap, Sergey Levine, Hugo Larochelle, and Yoshua Bengio. Recall traces: Backtracking models for efficient reinforcement learning. In ICLR. OpenReview.net, 2019.
  • Haddad (2021) Roudy El Haddad. Repeated sums and binomial coefficients. arXiv preprint arXiv:2102.12391, 2021.
  • Igata et al. (2021) Hideyoshi Igata, Yuji Ikegaya, and Takuya Sasaki. Prioritized experience replays on a hippocampal predictive map for learning. Proceedings of the National Academy of Sciences, 118(1), 2021.
  • Jain et al. (2021) Prateek Jain, Suhas S. Kowshik, Dheeraj Nagaraj, and Praneeth Netrapalli. Streaming linear system identification with reverse experience replay. In NIPS, volume 34, pp.  30140–30152. Curran Associates, Inc., 2021.
  • Kumar et al. (2020) Aviral Kumar, Abhishek Gupta, and Sergey Levine. Discor: Corrective feedback in reinforcement learning via distribution correction. In NeurIPS, volume 33, pp.  18560–18572, 2020.
  • Lahire et al. (2022) Thibault Lahire, Matthieu Geist, and Emmanuel Rachelson. Large batch experience replay. In ICML, volume 162 of Proceedings of Machine Learning Research, pp.  11790–11813. PMLR, 2022.
  • Lee et al. (2019) Su Young Lee, Sung-Ik Choi, and Sae-Young Chung. Sample-efficient deep reinforcement learning via episodic backward update. In NeurIPS, pp.  2110–2119, 2019.
  • Li et al. (2022) Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Sample complexity of asynchronous q-learning: Sharper analysis and variance reduction. IEEE Trans. Inf. Theory, 68(1):448–473, 2022.
  • Lin (1992) Long Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Mach. Learn., 8:293–321, 1992.
  • Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • Nagaraj et al. (2020) Dheeraj Nagaraj, Xian Wu, Guy Bresler, Prateek Jain, and Praneeth Netrapalli. Least squares regression with markovian data: Fundamental limits and algorithms. In NeurIPS, 2020.
  • Puterman (1994) Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley Series in Probability and Statistics. Wiley, 1994.
  • Rotinov (2019) Egor Rotinov. Reverse experience replay. CoRR, abs/1910.08780, 2019.
  • Saglam et al. (2023) Baturay Saglam, Furkan B. Mutlu, Dogan Can Çiçek, and Suleyman S. Kozat. Actor prioritized experience replay. J. Artif. Intell. Res., 78:639–672, 2023.
  • Schaul et al. (2016) Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. In ICLR, 2016.
  • Tagorti & Scherrer (2015) Manel Tagorti and Bruno Scherrer. On the rate of convergence and error bounds for lstd (λ\lambda). In ICML, volume 37 of JMLR Workshop and Conference Proceedings, pp.  1521–1529. JMLR.org, 2015.
  • van Hasselt et al. (2016) Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In AAAI, pp.  2094–2100. AAAI Press, 2016.
  • Vershynin (2018) Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
  • Watkins & Dayan (1992) Christopher J. C. H. Watkins and Peter Dayan. Technical note q-learning. Mach. Learn., 8:279–292, 1992.
  • Watkins (1989) Christopher John Cornish Hellaby Watkins. Learning from delayed rewards. PhD thesis, King’s College, University of Cambridge, 1989.
  • Zanette et al. (2020) Andrea Zanette, Alessandro Lazaric, Mykel J. Kochenderfer, and Emma Brunskill. Learning near optimal policies with low inherent bellman error. In ICML, volume 119 of Proceedings of Machine Learning Research, pp.  10978–10989. PMLR, 2020.

Appendix A Proof of Combinatorial Counting in Reverse Experience Replay

In this section, we present detailed proof for each of the necessary lemmas used for Theorem 1.

A.1 Proof of Lemma 1

Lemma.

Let 𝐱d\mathbf{x}\in\mathbb{R}^{d} be any dd-dimensional non-zero vector, i.e., 𝐱𝟎\mathbf{x}\neq\mathbf{0}. For 1l1,,lkL1\leq l_{1},\ldots,l_{k}\leq L and 2k2L2\leq k\leq 2L, consider one high-order term ϕl1ϕl1ϕlkϕlk\phi_{l_{1}}\phi^{\top}_{l_{1}}\ldots\phi_{l_{k}}\phi^{\top}_{l_{k}} in Equation (3.1). We have:

|𝐱ϕl1ϕl1ϕlkϕlk𝐱|\displaystyle|\mathbf{x}^{\top}\phi_{l_{1}}\phi^{\top}_{l_{1}}\ldots\phi_{l_{k}}\phi^{\top}_{l_{k}}\mathbf{x}| 12𝐱(ϕl1ϕl1+ϕlkϕlk)𝐱,\displaystyle\leq\frac{1}{2}\mathbf{x}^{\top}\left(\phi_{l_{1}}\phi^{\top}_{l_{1}}+\phi_{l_{k}}\phi^{\top}_{l_{k}}\right)\mathbf{x}, (13)

where |||\cdot| computes the absolute value.

Proof.

For 1l1,,lkL1\leq l_{1},\ldots,l_{k}\leq L and 2k2L2\leq k\leq 2L, we consider one high-order term ϕl1ϕl1ϕlkϕlk\phi_{l_{1}}\phi^{\top}_{l_{1}}\ldots\phi_{l_{k}}\phi^{\top}_{l_{k}} in Equation (3.1). By the Cauchy-Schwarz inequality (shown in Lemma 7), we have:

|ϕljϕlj+1|1, for 1j<k\displaystyle|\phi^{\top}_{l_{j}}\phi_{l_{j+1}}|\leq 1,\qquad\text{ for }1\leq j<k (14)

This allows us to simplify the neighboring vector-vector inner product in the high-order term ϕl1(ϕl1ϕl2)(ϕl2ϕl3)(ϕlk1ϕlk)ϕlk\phi_{l_{1}}\left(\phi^{\top}_{l_{1}}\phi_{l_{2}}\right)\left(\phi^{\top}_{l_{2}}\phi_{l_{3}}\right)\ldots(\phi_{l_{k-1}}^{\top}\phi_{l_{k}})\phi^{\top}_{l_{k}}.

Let 𝐱d\mathbf{x}\in\mathbb{R}^{d} be any dd-dimensional non-zero vector, i.e., 𝐱𝟎\mathbf{x}\neq\mathbf{0}. Based on the above inequality (in Equation 14), we have:

|𝐱ϕl1ϕl1ϕlkϕlk𝐱|\displaystyle|\mathbf{x}^{\top}\phi_{l_{1}}\phi^{\top}_{l_{1}}\ldots\phi_{l_{k}}\phi^{\top}_{l_{k}}\mathbf{x}| |𝐱ϕl1ϕlk𝐱|\displaystyle\leq|\mathbf{x}^{\top}\phi_{l_{1}}\phi^{\top}_{l_{k}}\mathbf{x}| by Equation (14)
12((𝐱ϕl1)2+(ϕlk𝐱)2)\displaystyle\leq\frac{1}{2}\left(\left(\mathbf{x}^{\top}\phi_{l_{1}}\right)^{2}+\left(\phi^{\top}_{l_{k}}\mathbf{x}\right)^{2}\right) AM-GM inequality (in Lemma 8)
=12((ϕl1𝐱)2+(ϕlk𝐱)2)\displaystyle=\frac{1}{2}\left(\left(\phi_{l_{1}}^{\top}\mathbf{x}\right)^{2}+\left(\phi^{\top}_{l_{k}}\mathbf{x}\right)^{2}\right)
=12(𝐱ϕl1ϕl1𝐱+𝐱ϕlkϕlk𝐱)\displaystyle=\frac{1}{2}\left(\mathbf{x}^{\top}\phi_{l_{1}}\phi^{\top}_{l_{1}}\mathbf{x}+\mathbf{x}^{\top}\phi_{l_{k}}\phi^{\top}_{l_{k}}\mathbf{x}\right)
=12𝐱(ϕl1ϕl1+ϕlkϕlk)𝐱,\displaystyle=\frac{1}{2}\mathbf{x}^{\top}\left(\phi_{l_{1}}\phi^{\top}_{l_{1}}+\phi_{l_{k}}\phi^{\top}_{l_{k}}\right)\mathbf{x}, (15)

where the third line holds because the inner product between two vectors equals its transpose. ∎

Note that the above Lemma 1 is extended from (Jain et al., 2021, Claim 4).

A.2 Proof of Lemma 2

Refer to caption
(a) When l1l_{1} selects the right ll-th slot, the terms l2,,lkl_{2},\ldots,l_{k} can only choose from the right slots with indices l+1,,Ll+1,\ldots,L, due to the sequential order constraint l1l2lkl_{1}\leq l_{2}\leq\cdots\leq l_{k}. Thus, there are LlL-l available slots for l2,,lkl_{2},\ldots,l_{k}.
Refer to caption
(b) When lkl_{k} selects the right ll-th slot, l1,,lk1l_{1},\ldots,l_{k-1} cannot choose from all the right slots with indices l+1,,Ll+1,\ldots,L because of the sequential order constraint. To avoid double counting, the left ll-th slot is also disallowed. Therefore, there are L+l2L+l-2 slots available for l1,,lk1l_{1},\ldots,l_{k-1}.
Refer to caption
(c) When lkl_{k} selects the left ll-th slot, l1,,lk1l_{1},\ldots,l_{k-1} can only choose from the left slots with indices L,L1,,l+1L,L-1,\ldots,l+1, due to the sequential order constraint. Similar to case 3(a), there are LlL-l slots available for l1,,lk1l_{1},\ldots,l_{k-1}.
Refer to caption
(d) When l1l_{1} selects the left ll-th slot and lkl_{k} selects the right ll-th slot, the intermediate sequence l2,,lk1l_{2},\ldots,l_{k-1} can only select from the intermediate slots with indices l1,,2,1,1,2,,l1l-1,\ldots,2,1,1,2,\ldots,l-1. There are 2l22l-2 slots available for placing the k2k-2 intermediate elements.
Figure 3: Visualization of all cases for the combinatorial counting problem.
Lemma.

Based on the relaxation in Lemma 1, the weighted summation k=22L(η)kl1,,lk\sum_{k=2}^{2L}(-\eta)^{k}\sum_{l_{1},\ldots,l_{k}} in Equation (3.1) can be expanded combinatorially as follows:

k=22L(η)kl1,,lk12(ϕl1ϕl1+ϕlkϕlk)\displaystyle\sum_{k=2}^{2L}(-\eta)^{k}\sum_{l_{1},\ldots,l_{k}}\frac{1}{2}(\phi_{l_{1}}\phi^{\top}_{l_{1}}+\phi_{l_{k}}\phi^{\top}_{l_{k}}) =k=22L(η)kl=1L((L+l2k1)+(Llk1)+(2l2k2))sum over combinatorially many termsϕlϕl.\displaystyle=\underbrace{\sum_{k=2}^{2L}(-\eta)^{k}\sum_{l=1}^{L}\left(\binom{L+l-2}{k-1}+\binom{L-l}{k-1}+\binom{2l-2}{k-2}\right)}_{\text{sum over combinatorially many terms}}\phi_{l}\phi^{\top}_{l}. (16)
Proof.

In the main paper, we outline the steps for transforming the problem of estimating the large summation (i.e., 1l1,,lkL\sum_{1\leq l_{1},\ldots,l_{k}\leq L}) involving many high-order terms (i.e., ϕl1ϕl1ϕlkϕlk\phi_{l_{1}}\phi^{\top}_{l_{1}}\ldots\phi_{l_{k}}\phi^{\top}_{l_{k}}) into a combinatorial counting problem. Here, we present the details for the remaining cases beyond what was sketched in Figure 2, where we focused on one instance of selecting l1l_{1} and lkl_{k}.

More specifically, the element ϕlϕl\phi_{l}\phi^{\top}_{l} arises from the following cases:

  1. (a)

    When l1l_{1} selects the ll-th slot and lkl_{k} does not, this includes two subcases:

    1. (1)

      l1l_{1} picks the left ll-th slot, and the remaining k1k-1 elements (i.e., l2,,lkl_{2},\ldots,l_{k}) are selected from a subarray of size L+l2L+l-2. See Figure 2 for a visual example.

    2. (2)

      l1l_{1} picks the right ll-th slot, and the remaining k1k-1 elements are selected from a subarray of size LlL-l. See Figure 3(a) for a visual example.

  2. (b)

    When l1l_{1} does not pick the ll-th slot but lkl_{k} does, this is symmetric to the previous case and also includes two subcases:

    1. (1)

      lkl_{k} picks the right ll-th slot, while the remaining k1k-1 elements (i.e., l1,,lk1l_{1},\ldots,l_{k-1}) are selected from a subarray of size L+l2L+l-2. See Figure 3(b).

    2. (2)

      lkl_{k} picks the left ll-th slot, and the remaining k1k-1 elements are selected from a subarray of size LlL-l. See Figure 3(c).

  3. (c)

    When l1l_{1} picks the left ll-th slot and lkl_{k} picks the right ll-th slot, the intermediate k2k-2 elements (i.e., l2,,lk1l_{2},\ldots,l_{k-1}) are selected from a subarray of size 2l22l-2. To ensure that the number of available slots is greater than the number of elements, we require 2l2k22l-2\geq k-2, which simplifies to 2lk2l\geq k. See Figure 3(d) for a visual example.

Note that cases (a) and (b) yield the same result, while case (c) counts twice because both the first and last elements are the ll-th slot. Thus, we derive the final result:

k=22L(η)kl1,,lk12(ϕl1ϕl1+ϕlkϕlk)=k=22L(η)kl=1L((L+l2k1)+(Llk1)+(2l2k2))sum over combinatorially many termsϕlϕl.\displaystyle\sum_{k=2}^{2L}(-\eta)^{k}\sum_{l_{1},\ldots,l_{k}}\frac{1}{2}(\phi_{l_{1}}\phi^{\top}_{l_{1}}+\phi_{l_{k}}\phi^{\top}_{l_{k}})=\underbrace{\sum_{k=2}^{2L}(-\eta)^{k}\sum_{l=1}^{L}\left(\binom{L+l-2}{k-1}+\binom{L-l}{k-1}+\binom{2l-2}{k-2}\right)}_{\text{sum over combinatorially many terms}}\phi_{l}\phi^{\top}_{l}. (17)

The following Lemma 3 further simplify the whole summations with many different nn choose kk in Lemma 2. The obtained result will help us to mitigate the constraint ηL<1/3\eta L<1/3 in the final theorem (in Theorem 1).

A.3 Proof of Lemma 3

Lemma.

For η(0,1)\eta\in(0,1) and L>1L>1, we have:

k=22L(η)k((L+l2k1)+(Llk1)+(2l2k2))=\displaystyle\sum_{k=2}^{2L}(-\eta)^{k}\left(\binom{L+l-2}{k-1}+\binom{L-l}{k-1}+\binom{2l-2}{k-2}\right)= (1η)L+l2+(1η)Ll+η2(1η)2l2\displaystyle(1-\eta)^{L+l-2}+(1-\eta)^{L-l}+\eta^{2}(1-\eta)^{2l-2} (18)
+η(2L2)2.\displaystyle+\eta(2L-2)-2. (19)
Proof.

We start by separating the sum into three parts:

k=22L(L+l2k1)(η)k+k=22L(Llk1)(η)k+k=22L(2l2k2)(η)k.\displaystyle\sum_{k=2}^{2L}\binom{L+l-2}{k-1}(-\eta)^{k}+\sum_{k=2}^{2L}\binom{L-l}{k-1}(-\eta)^{k}+\sum_{k=2}^{2L}\binom{2l-2}{k-2}(-\eta)^{k}. (20)

Then we use the Binomial expansion formula nn: (1x)n=k=0n(nk)(x)k(1-x)^{n}=\sum_{k=0}^{n}\binom{n}{k}(-x)^{k}. This expression will converge when |x|<1|x|<1. Set x=ηx=\eta, we would have:

(1η)n=1+k=1n(nk)(η)k=1ηn+k=2n(nk)(η)k.\displaystyle(1-\eta)^{n}=1+\sum_{k=1}^{n}\binom{n}{k}(-\eta)^{k}=1-\eta n+\sum_{k=2}^{n}\binom{n}{k}(-\eta)^{k}. (21)

Seting n=L+l2n=L+l-2 and x=ηx=\eta, we have the result for the first sum:

k=22L1(L+l2k)(η)k=k=2L+l2(L+l2k)(η)k=(1η)L+l2+η(L+l2)1.\displaystyle\sum_{k=2}^{2L-1}\binom{L+l-2}{k}(-\eta)^{k}=\sum_{k=2}^{L+l-2}\binom{L+l-2}{k}(-\eta)^{k}=(1-\eta)^{L+l-2}+\eta(L+l-2)-1. (22)

Note that the binomial coefficient (nk)\binom{n}{k} is defined to be zero when k>nk>n. Because you cannot choose more elements than are available. In the above equation, (L+l2k1)\binom{L+l-2}{k-1} is zero for k1>L+l2k-1>L+l-2, thus we limit the summation to kL+l2k\leq L+l-2.

Similarly, set n=Lln=L-l and x=ηx=\eta, we have the result for the second sum:

k=22L(Llk)(η)k=k=2Ll(Llk)(η)k=(1η)Ll+η(Ll)1.\displaystyle\sum_{k=2}^{2L}\binom{L-l}{k}(-\eta)^{k}=\sum_{k=2}^{L-l}\binom{L-l}{k}(-\eta)^{k}=(1-\eta)^{L-l}+\eta(L-l)-1. (23)

Similarly, (Llk1)\binom{L-l}{k-1} is zero for k1>Llk-1>L-l, thus the summation is restricted to kLlk\leq L-l.

In terms of the third sum, we adjust the index and apply the binomial theorem,

k=22L(2l2k2)(η)k=η2k=02L2(2l2k)(η)k=η2k=02l2(2l2k)(η)k=η2(1η)2l2.\displaystyle\sum_{k=2}^{2L}\binom{2l-2}{k-2}(-\eta)^{k}=\eta^{2}\sum_{k=0}^{2L-2}\binom{2l-2}{k}(-\eta)^{k}=\eta^{2}\sum_{k=0}^{2l-2}\binom{2l-2}{k}(-\eta)^{k}=\eta^{2}(1-\eta)^{2l-2}. (24)

By combining all three simplified sums, we obtain the final result:

(1η)L+l2+(1η)Ll+η2(1η)2l2+η(2L2)2.\displaystyle(1-\eta)^{L+l-2}+(1-\eta)^{L-l}+\eta^{2}(1-\eta)^{2l-2}+\eta(2L-2)-2. (25)

The above result holds when |η|<1|\eta|<1. When η>1\eta>1, the series could diverge due to the unbounded growth of the terms. Since the learning rate is always positive, the requirement becomes η(0,1)\eta\in(0,1). This completes the proof. ∎

We aim to establish upper and lower bounds for the expression in Lemma 3 that are independent of ll, the results are provided in the following Remark.

Remark 2.

For 0<l<L0<l<L and η(0,1)\eta\in(0,1), the following inequalities hold:

(1η)L+l2+(1η)Ll+η2(1η)2l2\displaystyle(1-\eta)^{L+l-2}+(1-\eta)^{L-l}+\eta^{2}(1-\eta)^{2l-2} >(1η)2L3+(1η)L1+η2(1η)2L4,\displaystyle>(1-\eta)^{2L-3}+(1-\eta)^{L-1}+\eta^{2}(1-\eta)^{2L-4}, (26)
(1η)L+l2+(1η)Ll+η2(1η)2l2\displaystyle(1-\eta)^{L+l-2}+(1-\eta)^{L-l}+\eta^{2}(1-\eta)^{2l-2} <(1η)L1+η2+1.\displaystyle<(1-\eta)^{L-1}+\eta^{2}+1. (27)

Within the range η(0,1)\eta\in(0,1), both the upper and lower bounds are positive.

Proof.

The term (1η)L+l2(1-\eta)^{L+l-2} decreases as ll increases since (1η)<1(1-\eta)<1 for η>0\eta>0. Therefore, the maximum occurs at l=1l=1, and the minimum at l=L1l=L-1.

Similarly, for (1η)Ll(1-\eta)^{L-l}, the maximum occurs at l=L1l=L-1, and the minimum at l=1l=1. For the term η2(1η)2l2\eta^{2}(1-\eta)^{2l-2}, the maximum occurs when l=1l=1, and the minimum when l=L1l=L-1.

Thus, the upper bound of the entire expression is achieved by combining the maximum values of each term, resulting in (1η)L1+η2+1(1-\eta)^{L-1}+\eta^{2}+1. The lower bound is given by (1η)2L3+(1η)L1+η2(1η)2L4(1-\eta)^{2L-3}+(1-\eta)^{L-1}+\eta^{2}(1-\eta)^{2L-4}. ∎

Appendix B Theoretical Justification of Theorem 1

Theorem.

Let μ\mu be the stationary distribution of the state-action pair in the MDP. The following matrix’s positive semi-definite inequalities hold: for η(0,1)\eta\in(0,1),

𝔼(s,a)μ[ΓLΓL](1(η(42L)(1η)L1η2+1)Lκ)I.\displaystyle\mathbb{E}_{(s,a)\sim\mu}\left[\Gamma_{L}^{\top}\Gamma_{L}\right]\preceq\left(1-\frac{(\eta(4-2L)-(1-\eta)^{L-1}-\eta^{2}+1)L}{\kappa}\right)\mathrm{I}. (28)

where the matrix ΓL\Gamma_{L} is defined in Definition 2. Here “\preceq” is defined between two matrices on both sides (please see Definition 4) for the positive semi-definite property.

Proof.

Based on Equation (3.1), we have:

𝔼(s,a)μ[ΓLΓL]=I2η𝔼(s,a)μ[l=1Lϕlϕl]+𝔼(s,a)μ[k=22L(η)kl1,,lkϕl1ϕl1ϕlkϕlk].\displaystyle\mathbb{E}_{(s,a)\sim\mu}\left[\Gamma_{L}^{\top}\Gamma_{L}\right]=\mathrm{I}-2\eta\mathbb{E}_{(s,a)\sim\mu}\left[\sum_{l=1}^{L}\phi_{l}\phi^{\top}_{l}\right]+\mathbb{E}_{(s,a)\sim\mu}\left[\sum_{k=2}^{2L}(-\eta)^{k}\sum_{l_{1},\ldots,l_{k}}\phi_{l_{1}}\phi^{\top}_{l_{1}}\ldots\phi_{l_{k}}\phi^{\top}_{l_{k}}\right]. (29)

By linearity of expectation, the second part of Equation 3.1 can be reformulated as

2η𝔼(s,a)μ[l=1Lϕlϕl]=2ηl=1L𝔼(s,a)μ(ϕlϕl)=2ηL𝔼(s,a)μ(ϕϕ)2ηLκI.-2\eta\mathbb{E}_{(s,a)\sim\mu}\left[\sum_{l=1}^{L}\phi_{l}\phi^{\top}_{l}\right]=-2\eta\sum_{l=1}^{L}\mathbb{E}_{(s,a)\sim\mu}\left(\phi_{l}\phi^{\top}_{l}\right)=-2\eta L\mathbb{E}_{(s,a)\sim\mu}\left(\phi\phi^{\top}\right)\preceq\frac{-2\eta L}{\kappa}\mathrm{I}. (30)

The last step is obtained by Remark 1.

Based on the result in the proposed Lemma 2, Lemma 3 and Remark 2, the second part of Equation 3.1 can be upper bounded as:

𝔼(s,a)μ[k=22L(η)kl1,,lk12(ϕl1ϕl1+ϕlkϕlk)]\displaystyle\mathbb{E}_{(s,a)\sim\mu}\left[\sum_{k=2}^{2L}(-\eta)^{k}\sum_{l_{1},\ldots,l_{k}}\frac{1}{2}(\phi_{l_{1}}\phi^{\top}_{l_{1}}+\phi_{l_{k}}\phi^{\top}_{l_{k}})\right] (31)
=𝔼(s,a)μ[k=22L(η)kl=1L((L+l2k1)+(Llk1)+(2l2k2))ϕlϕl]\displaystyle=\mathbb{E}_{(s,a)\sim\mu}\left[\sum_{k=2}^{2L}(-\eta)^{k}\sum_{l=1}^{L}\left(\binom{L+l-2}{k-1}+\binom{L-l}{k-1}+\binom{2l-2}{k-2}\right)\phi_{l}\phi^{\top}_{l}\right] by Lemma 2 (32)
=𝔼(s,a)μ[l=1Lk=22L((L+l2k1)+(Llk1)+(2l2k2))(η)kϕlϕl]\displaystyle=\mathbb{E}_{(s,a)\sim\mu}\left[\sum_{l=1}^{L}\sum_{k=2}^{2L}\left(\binom{L+l-2}{k-1}+\binom{L-l}{k-1}+\binom{2l-2}{k-2}\right)(-\eta)^{k}\phi_{l}\phi^{\top}_{l}\right] swap two sums (33)
=𝔼(s,a)μ[l=1L((1η)L+l2+(1η)Ll+η2(1η)2l2+η(2L2)2)ϕlϕl]\displaystyle=\mathbb{E}_{(s,a)\sim\mu}\left[\sum_{l=1}^{L}\left((1-\eta)^{L+l-2}+(1-\eta)^{L-l}+\eta^{2}(1-\eta)^{2l-2}+\eta(2L-2)-2\right)\phi_{l}\phi^{\top}_{l}\right] By Lemma 3 (34)
((1η)L1+η2+η(2L2)1)𝔼(s,a)μ[l=1Lϕlϕl]\displaystyle\preceq\left((1-\eta)^{L-1}+\eta^{2}+\eta(2L-2)-1\right)\mathbb{E}_{(s,a)\sim\mu}\left[\sum_{l=1}^{L}\phi_{l}\phi^{\top}_{l}\right] By Remark 2 (35)
(1η)L1L+η2L+η(2L2)LLκI.\displaystyle\preceq\frac{(1-\eta)^{L-1}L+\eta^{2}L+\eta(2L-2)L-L}{\kappa}\mathrm{I}. (36)

Combining the results in the above two inequalities, we finally have the upper bound:

𝔼(s,a)μ[ΓLΓL](1(η(42L)(1η)L1η2+1)Lκ)I.\displaystyle\mathbb{E}_{(s,a)\sim\mu}\left[\Gamma_{L}^{\top}\Gamma_{L}\right]\preceq\left(1-\frac{(\eta(4-2L)-(1-\eta)^{L-1}-\eta^{2}+1)L}{\kappa}\right)\mathrm{I}. (37)

Theorem 1 together its theoretical justification is novel and never used in any analysis relevant to experience replay by the knowledge of the authors.

Appendix C Sample Complexity Proof of Theorem 2

We acknowledge that the main structure of convergence proof follows the original work. Here, we made contribution to present a cleaner proof pipeline of the proof and also integrate our tighter bound in Theorem 1.

C.1 Proof of Bias-Variance Decomposition for the Error in Lemma 4

Lemma.

Let the error terms for every parameter ww as the difference between empirical estimation and true MDP: εi(w)Q(si,ai)Q(si,ai)\varepsilon_{i}(w)\coloneqq{Q}(s_{i},a_{i})-Q^{*}(s_{i},a_{i}). For the current iteration tt, the difference between current estimated parameter ww and the optimal parameter ww^{*} accumulated along the LL length sub-trajectory with reverse update is:

wLw=ΓL(w1w)Bias term+ηl=1LεlΓl1ϕlvariance term.\displaystyle w_{L}-w^{*}=\underbrace{\Gamma_{L}\left(w_{1}-w^{*}\right)}_{\text{Bias term}}+\underbrace{\eta\sum_{l=1}^{L}\varepsilon_{l}\Gamma_{l-1}\phi_{l}}_{\text{variance term}}. (38)
Proof.

As shown in Algorithm 1 in Lines 5-7, we use a sampled sub-trajectory of length LL to execute QQ-update reversely at iteration tt: for l=1,2,,Ll=1,2,\ldots,L,

wl+1\displaystyle w_{l+1} =wl+η(rL+1l+γmaxa𝒜w1,ϕ(sL+2l,a)wl,ϕL+1l)ϕL+1l\displaystyle=w_{l}+\eta\left(r_{L+1-l}+\gamma\max_{a^{\prime}\in\mathcal{A}}\langle w_{1},\phi(s_{L+2-l},a^{\prime})\rangle-\langle w_{l},\phi_{L+1-l}\rangle\right)\phi_{L+1-l} (39)
=(IηϕL+1lϕL+1l)wl+η(rL+1l+γsupa𝒜w1,ϕ(sL+2l,a))ϕL+1l\displaystyle=\left(\mathrm{I}-\eta\phi_{L+1-l}\phi^{\top}_{L+1-l}\right)w_{l}+\eta\left(r_{L+1-l}+\gamma\sup_{a^{\prime}\in\mathcal{A}}\langle w_{1},\phi(s_{L+2-l},a^{\prime})\rangle\right)\phi_{L+1-l} (40)

where I\mathrm{I} denotes the identity matrix and w1w_{1} is the parameter of the target network. The second equality is attained by rearranging the terms of the first equality. The max\max operator is changed to sup\sup operator for rigorous analysis purposes. ,\langle\cdot,\cdot\rangle means inner dot product between two vectors of the same dimension.

Let Q(si,ai)Q^{*}(s_{i},a_{i}) be the optimal QQ-value at state sis_{i} taking action aia_{i} and assume ww^{*} is the optimal parameter, the Bellman optimality equation is written as:

Q(si,ai)=Ri+γ𝔼sP(|si,ai)supa𝒜w,ϕ(s,a)Q^{*}(s_{i},a_{i})=R_{i}+\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot|s_{i},a_{i})}\sup_{a^{\prime}\in\mathcal{A}}\langle w^{*},\phi(s^{\prime},a^{\prime})\rangle

Define the error term εi(w)\varepsilon_{i}(w) for parameter ww and ii-th tuple (si,ai,ri)(s_{i},a_{i},r_{i}) as the difference between empirical estimation and true probabilistic MDP:

εi(w)\displaystyle\varepsilon_{i}(w) Q(si,ai)Q(si,ai)\displaystyle\coloneqq{Q}(s_{i},a_{i})-Q^{*}(s_{i},a_{i}) (41)
=(ri+γsupa𝒜w,ϕ(si+1,a))(Ri+γ𝔼sP(|si,ai)supa𝒜w,ϕ(s,a))\displaystyle=\left({r}_{i}+\gamma{\sup_{a^{\prime}\in\mathcal{A}}\langle w,\phi(s_{i+1}},a^{\prime})\rangle\right)-\left(R_{i}+\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot|s_{i},a_{i})}\sup_{a^{\prime}\in\mathcal{A}}\langle w,\phi(s^{\prime},a^{\prime})\rangle\right) (42)
=(riRi)γ(supa𝒜w,ϕ(si+1,a)𝔼sP(|si,ai)supa𝒜w,ϕ(s,a))\displaystyle=\left({r}_{i}-R_{i}\right)-\gamma\left({\sup_{a^{\prime}\in\mathcal{A}}\langle w,\phi(s_{i+1}},a^{\prime})\rangle-\mathbb{E}_{s^{\prime}\sim P(\cdot|s_{i},a_{i})}\sup_{a^{\prime}\in\mathcal{A}}\langle w,\phi(s^{\prime},a^{\prime})\rangle\right) (43)

For all l=1,Ll=1\ldots,L, apply the Bellman optimality equation over the RER algorithm over the optimal parameter ww^{*}:

w,ϕL+1l=RL+1l+γ𝔼sP(|sL+1l,aL+1l)supa𝒜w1,ϕ(s,a)\displaystyle\langle w^{*},\phi_{L+1-l}\rangle=R_{L+1-l}+\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot|s_{L+1-l},a_{L+1-l})}\sup_{a^{\prime}\in\mathcal{A}}\langle w_{1},\phi(s^{\prime},a^{\prime})\rangle (44)

Right-multiply the above equation on both sides with term ηϕL+1l\eta\phi_{L+1-l} and combine with the first equation in this proof, we shall get

wl+1w=(IηϕL+1lϕL+1l)(wlw)+ηεL+1lϕL+1l\displaystyle w_{l+1}-w^{*}=\left(\mathrm{I}-\eta\phi_{L+1-l}\phi^{\top}_{L+1-l}\right)\left(w_{l}-w^{*}\right)+\eta\varepsilon_{L+1-l}\phi_{L+1-l} (45)

where the error term εL+1l\varepsilon_{L+1-l} is defined in Equation (43). Combined with Definition 2 and recursively expand the RHS, we shall get the difference w.r.t. the optimal one after reversely updating LL consecutive steps:

wL+1w\displaystyle w_{L+1}-w^{*} =ΓL(w1w)+ηl=1LεlΓl1ϕl\displaystyle=\Gamma_{L}\left(w_{1}-w^{*}\right)+\eta\sum_{l=1}^{L}\varepsilon_{l}\Gamma_{l-1}\phi_{l} (46)

The proof is thus finished. ∎

Remark 3.

Suppose we execute Algorithm 1 for NN iterations, we would get:

wN,Lw\displaystyle w_{N,L}-w^{*} =t=N1ΓL(w1w)Bias term+ηi=1Nt=Ni+1ΓLlHjVariance term\displaystyle=\underbrace{\prod_{t=N}^{1}\Gamma_{L}\left(w_{1}-w^{*}\right)}_{\text{Bias term}}+\underbrace{\eta\sum_{i=1}^{N}\prod_{t=N}^{i+1}\Gamma^{l}_{L}H^{j}}_{\text{Variance term}} (47)

where Hj:=ηi=1LεiΓi1jϕijH^{j}:=\eta\sum_{i=1}^{L}\varepsilon_{i}\Gamma^{j}_{i-1}\phi_{i}^{j}. The first term on RHS is noted as the bias, which decays geometrically with NN and the second term on RHS is variance along the sub-trajectory of length LL, which we will later show it has zero mean.

Note that the above Remark 3 is relevant to the (Agarwal et al., 2022, Lemma 5).

C.2 Bound the Bias Term in Remark 3

We briefly mention here to quantify the upper bound of t=1NΓL{\prod_{t=1}^{N}\Gamma_{L}} in Lemma 4. In comparison to (Agarwal et al., 2022, lemma 8), the following Lemma 5 require η(0,1)\eta\in(0,1) instead of ηL<1/3\eta L<1/3, which is a relaxation of the original work.

Lemma.

Let 𝐱d\mathbf{x}\in\mathbb{R}^{d} be a non-zero vector and NN is the frequency for the target network to be updated. For η(0,1),L\eta\in(0,1),L\in\mathbb{N} and L>1L>1, the following matrix’s positive semi-definite inequality holds:

𝔼j=N1ΓL𝐱2\displaystyle\mathbb{E}\left\lVert\prod^{1}_{j=N}\Gamma_{L}\mathbf{x}\right\rVert^{2} exp(N(η(42L)L+Lη2L)κ)𝐱2.\displaystyle\leq\exp\left(-\frac{N(\eta(4-2L)L+L-\eta^{2}L)}{\kappa}\right)\left\lVert\mathbf{x}\right\rVert^{2}. (48)

Furthermore, the following inequality holds with probability at least 1δ1-\delta:

𝔼j=N1ΓL𝐱ϕ2\displaystyle\mathbb{E}\left\lVert\prod^{1}_{j=N}\Gamma_{L}\mathbf{x}\right\rVert_{\phi}^{2} exp(N(η(42L)L+Lη2L)κ)κδ𝐱ϕ.\displaystyle\leq\exp\left(-\frac{N(\eta(4-2L)L+L-\eta^{2}L)}{\kappa}\right)\sqrt{\tfrac{\kappa}{\delta}}\left\lVert\mathbf{x}\right\rVert_{\phi}. (49)

The ϕ\phi-based norm is defined in Definition 1.

Proof.

In Theorem 1, we notice the term (1η)L1(1-\eta)^{L-1} will converge to zero exponentially and thus are omitted under sufficient precision. Thus, we obtain a simplified form

𝔼(s,a)μ(ΓLΓL)\displaystyle\mathbb{E}_{(s,a)\sim\mu}\left(\Gamma_{L}^{\top}\Gamma_{L}\right) (1η(42L)L+L(1η)L1Lη2Lκ)I\displaystyle\preceq\left(1-\frac{\eta(4-2L)L+L-(1-\eta)^{L-1}L-\eta^{2}L}{\kappa}\right)\mathrm{I} (50)
(1η(42L)L+Lη2Lκ)I.\displaystyle\preceq\left(1-\frac{\eta(4-2L)L+L-\eta^{2}L}{\kappa}\right)\mathrm{I}. (51)

Now, we observe that:

j=N1ΓL𝐱2=𝐱j=1N1(ΓLj)((ΓLN)ΓLN)is independent of the restj=N11ΓLj𝐱\left\lVert\prod_{j=N}^{1}\Gamma_{L}\mathbf{x}\right\rVert^{2}=\mathbf{x}^{\top}\prod_{j=1}^{N-1}(\Gamma^{j}_{L})^{\top}\underbrace{\left(\left(\Gamma^{N}_{L}\right)^{\top}\Gamma^{N}_{L}\right)}_{\text{is independent of the rest}}\prod_{j=N-1}^{1}\Gamma^{j}_{L}\mathbf{x} (52)

Therefore, we take the expectation conditioned on ΓLj\Gamma^{j}_{L} for jN1j\leq N-1 in Equation (52):

𝔼j=N1ΓLj𝐱2(1η(42L)L+Lη2Lκ)𝔼j=N11ΓLj𝐱2\displaystyle\mathbb{E}\left\lVert\prod_{j=N}^{1}\Gamma^{j}_{L}\mathbf{x}\right\rVert^{2}\leq\left(1-\frac{\eta(4-2L)L+L-\eta^{2}L}{\kappa}\right)\mathbb{E}\left\lVert\prod_{j=N-1}^{1}\Gamma^{j}_{L}\mathbf{x}\right\rVert^{2} using Equation (51) (53)

Applying the equation above inductively, we conclude the result:

𝔼j=N1ΓLj𝐱2\displaystyle\mathbb{E}\left\lVert\prod_{j=N}^{1}\Gamma^{j}_{L}\mathbf{x}\right\rVert^{2} (1η(42L)L+Lη2Lκ)N𝐱2\displaystyle\leq\left(1-\frac{\eta(4-2L)L+L-\eta^{2}L}{\kappa}\right)^{N}\left\lVert\mathbf{x}\right\rVert^{2} (54)
exp(N(η(42L)L+Lη2L)κ)𝐱2.\displaystyle\approx\exp\left(-\frac{N(\eta(4-2L)L+L-\eta^{2}L)}{\kappa}\right)\left\lVert\mathbf{x}\right\rVert^{2}. (55)

The last step of numerical approximation is obtained by limN+(1+xN)N=exp(x)\lim_{N\to+\infty}(1+\frac{x}{N})^{N}=\exp(x).

We then extend to ϕ\phi-based norm (in Definition 1) as follow:

j=N1ΓL𝐱ϕκj=N1ΓLj𝐱\displaystyle\left\lVert\prod_{j=N}^{1}\Gamma_{L}\mathbf{x}\right\rVert_{\phi}\leq\sqrt{\kappa}\left\lVert\prod_{j=N}^{1}\Gamma^{j}_{L}\mathbf{x}\right\rVert exp(N(η(42L)L+Lη2L)κ)κ𝐱\displaystyle\leq{\exp\left(-\frac{N(\eta(4-2L)L+L-\eta^{2}L)}{\kappa}\right)}\sqrt{\kappa}\left\lVert\mathbf{x}\right\rVert (56)

Thus, by Markov’s inequality, with probability at least 1δ1-\delta, the following event holds:

j=N1ΓL𝐱ϕexp(N(η(42L)L+Lη2L)κ)κδ𝐱ϕ\left\lVert\prod_{j=N}^{1}\Gamma_{L}\mathbf{x}\right\rVert_{\phi}\leq\exp\left(-\frac{N(\eta(4-2L)L+L-\eta^{2}L)}{\kappa}\right)\sqrt{\tfrac{\kappa}{\delta}}\left\lVert\mathbf{x}\right\rVert_{\phi} (57)

C.3 Bound the Variance Term in Remark 3

Even though the term Γl\Gamma_{l} is involved in the expression, it turns out we do not need to modify the original proof and thus we follow the result in the original proof. Please see (Agarwal et al., 2022, Appendix L.3) for the original proof steps.

Theorem (Agarwal et al. (2022) Theorem 4).

Suppose 𝐱,𝐰d\mathbf{x},\mathbf{w}\in\mathbb{R}^{d} are fixed. There exists a universal constant CC such that the following event holds with probability at least 1δ1-\delta:

|𝐱,t=1Nj=Nt+1ΓLjHj(𝐰)|C𝐱(1+𝐰ϕ)ηlog(2δ)\biggr{|}\langle\mathbf{x},\sum_{t=1}^{N}\prod_{j=N}^{t+1}\Gamma^{j}_{L}H^{j}(\mathbf{w})\rangle\biggr{|}\leq C\left\lVert\mathbf{x}\right\rVert\left(1+\left\lVert\mathbf{w}\right\rVert_{\phi}\right)\sqrt{\eta\log\left(\frac{2}{\delta}\right)} (58)

By a direct application of (Vershynin, 2018, Theorem 8.1.6), we derive the following corollary. Notice that CC is the covering number defined in Definition 3.

Corollary 1.

For fixed parameter 𝐱,𝐰d\mathbf{x},\mathbf{w}\in\mathbb{R}^{d}, there exists a constant CC such that the following inequality holds with probability at least 1δ1-\delta:

ηi=1Nt=Ni+1l=1Lεl(𝐰)Γl1ϕlϕC(1+𝐰ϕ)(CΦ+log(2δ))\left\lVert\eta\sum_{i=1}^{N}\prod_{t=N}^{i+1}\sum_{l=1}^{L}\varepsilon_{l}(\mathbf{w})\Gamma_{l-1}\phi_{l}\right\rVert_{\phi}\leq C(1+\left\lVert\mathbf{w}\right\rVert_{\phi})\left(C_{\Phi}+\sqrt{\log\left(\frac{2}{\delta}\right)}\right) (59)

The notation ϕ\|\cdot\|_{\phi} is mentioned in Definition 1. Here CΦC_{\Phi} is the covering number with L2L_{2} norm in d\mathbb{R}^{d}, see Theorem 3 for details.

C.4 Overall Sample Complexity Analysis

Lemma 6.

There exists constants CΦ,C1,C2C_{\Phi},C_{1},C_{2}, such that:

  1. 1.

    C1κlog(TκNδ(1γ))<NC_{1}\kappa\log\left(\frac{T\kappa}{N\delta(1-\gamma)}\right)<N;

  2. 2.

    ηC2(1γ)2CΦ2+log(TNδ)\eta\leq C_{2}\frac{(1-\gamma)^{2}}{C^{2}_{\Phi}+\log\left(\frac{T}{N\delta}\right)}.

Let 1kTN1\leq k\leq\frac{T}{N}, where kk is an index for the target network and T/NT/N is the total number of target network updates. The following holds with probability at least 1δ1-\delta:

  1. 1.

    For the target network, θk41γ\|\theta_{k}\|\leq\frac{4}{1-\gamma};

  2. 2.

    For the error accumulated along LL-consecutive steps of reverse update,

    wk,Lwϕ\displaystyle\|w_{k,L}-w^{*}\|_{\phi}\leq 25TκNδ(1γ)2exp(N((η(42L)L+Lη2L))κ)\displaystyle\sqrt{\frac{25T\kappa}{N\delta(1-\gamma)^{2}}}\exp\left(-\frac{N((\eta(4-2L)L+L-\eta^{2}L))}{\kappa}\right)
    +η(CΦ2+log(TNδ))(1γ)2\displaystyle+\sqrt{\frac{\eta\left(C^{2}_{\Phi}+\log\left(\frac{T}{N\delta}\right)\right)}{(1-\gamma)^{2}}}

When we combine the above two cases, we have:

QTQ\displaystyle\|Q^{T}-Q^{*}\|_{\infty}\leq 𝒪(γT/N1γ)\displaystyle\mathcal{O}\left(\frac{\gamma^{T/N}}{1-\gamma}\right)
+𝒪(TκNδ(1γ)4exp(N(η(42L)L+Lη2L)κ))\displaystyle+\mathcal{O}\left(\sqrt{\frac{T\kappa}{N\delta(1-\gamma)^{4}}}\exp\left(-\frac{N(\eta(4-2L)L+L-\eta^{2}L)}{\kappa}\right)\right)
+𝒪(η(CΦ2+log(TNδ))(1γ)4)\displaystyle+\mathcal{O}\left(\sqrt{\frac{\eta\left(C^{2}_{\Phi}+\log\left(\frac{T}{N\delta}\right)\right)}{(1-\gamma)^{4}}}\right)

We can obtain the sample complexity of the whole learning framework by setting the RHS as ε\varepsilon and we shall recover the result we show in Theorem 2.

Appendix D Extra Notations and Definitions

Definition 3 (ε\varepsilon-Net, Covering Number and Metric Entropy).

Given metric space (d,2)(\mathbb{R}^{d},\|\cdot\|_{2}), consider a region 𝒦d\mathcal{K}\subset\mathbb{R}^{d}. 1) A subset of Euclidean balls 𝒩\mathcal{N} is called an ε\varepsilon-Net of 𝒦\mathcal{K} (for 𝒩𝒦,ε>0\mathcal{N}\subseteq\mathcal{K},\varepsilon>0) if every point x𝒦x\in\mathcal{K} is within ε\varepsilon distance of a point of 𝒩\mathcal{N}:

x𝒩,xx2ε, for all x𝒦\displaystyle\exists x^{\prime}\in\mathcal{N},\|x-x^{\prime}\|_{2}\leq\varepsilon,\text{ for all }x\in\mathcal{K}

2) Equivalently, we denote 𝒩(𝒦,ε)\mathcal{N}(\mathcal{K},\varepsilon) as the smallest number of closed balls with centers in 𝒦\mathcal{K} and radius ε\varepsilon whose union covers 𝒦\mathcal{K}. 3) Metric Entropy. It is the Logarithm of the covering number log2𝒩(𝒦,ε)\log_{2}\mathcal{N}(\mathcal{K},\varepsilon).

Theorem 3 (Dudley’s Integral Inequality).

Let {𝐱td}t=1N𝒦\{\mathbf{x}_{t}\in\mathbb{R}^{d}\}_{t=1}^{N}\in\mathcal{K} be a random process with zero means on the metric space (d,2)(\mathbb{R}^{d},\|\cdot\|_{2}) with sub-gaussian increments. Then

𝔼sup𝐱t𝒦xtCK0log2𝒩(𝒦,ε)𝑑ε\displaystyle\mathbb{E}\sup_{\mathbf{x}_{t}\in\mathcal{K}}x_{t}\geq CK\int_{0}^{\infty}\sqrt{\log_{2}\mathcal{N}(\mathcal{K},\varepsilon)}\mathit{d}\varepsilon
Lemma 7 (Cauchy-Schwarz Inequality).

Based on Assumption 2 that the feature vector for state-action is bounded ϕ(s,a)ϕ(s,a)1\phi(s,a)^{\top}\phi(s,a)\leq 1, for all (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}. For 1l,lL1\leq l,l^{\prime}\leq L, we have:

|ϕlϕl|2ϕlϕlϕlϕl1.\displaystyle|\phi_{l}^{\top}\phi_{l^{\prime}}|^{2}\leq\phi_{l}^{\top}\phi_{l}\cdot\phi_{l^{\prime}}^{\top}\phi_{l^{\prime}}\leq 1. (60)
Definition 4 (Positive Semi-Definite Property of Matrix).

For two symmetric matrices A,Bd×dA,B\in\mathbb{R}^{d\times d}, we say ABA\preceq B if BAB-A is positive semi-definite: 𝐱(BA)𝐱0\mathbf{x}^{\top}(B-A)\mathbf{x}\geq 0, for all 𝐱d\mathbf{x}\in\mathbb{R}^{d} non-zero dd-dimensional vector.

Lemma 8 (AM–GM Inequality).

The inequality of arithmetic and geometric means, or more briefly the AM–GM inequality, states that the arithmetic mean of a list of non-negative real numbers is greater than or equal to the geometric mean of the same list. For x,yx,y\in\mathbb{R}, |xy|12(x2+y2)|xy|\leq\frac{1}{2}(x^{2}+y^{2}).

Lemma 9 (Recursive Formula).

Let m,nm,n\in\mathbb{N} be positive integers. (nk)\binom{n}{k} denotes a binomial coefficient, which is computed as n!k!(nk)!\frac{n!}{k!(n-k)!}. Then for all 1kn11\leq k\leq n-1, we have:

(n1k)+(n1k1)=(nk)\binom{n-1}{k}+\binom{n-1}{k-1}=\binom{n}{k}
Lemma 10 (Rising Sum of Binomial Coefficients).

Let m,nm,n\in\mathbb{N} be positive integers. Then:

j=0m(n+jn)=(n+m+1n+1)=(n+m+1m)\sum_{j=0}^{m}\binom{n+j}{n}=\binom{n+m+1}{n+1}=\binom{n+m+1}{m}
Lemma 11 (Vandermonde identity Haddad (2021)).

For any k,q,nk,q,n\in\mathbb{N} such that nqn\geq q, we have:

i=qn(ik)=(n+1k+1)(qk+1)\sum_{i=q}^{n}\binom{i}{k}=\binom{n+1}{k+1}-\binom{q}{k+1}
Lemma 12.

For any k,q,nk,q,n\in\mathbb{N} such that nqn\geq q, we have:

An=i=qn(2ik)Bn=i=qn(2i1k)A_{n}=\sum_{i=q}^{n}\binom{2i}{k}\qquad B_{n}=\sum_{i=q}^{n}\binom{2i-1}{k}

An+BnA_{n}+B_{n} has a closed form, AnBnA_{n}-B_{n} also has a closed form.

Lemma 13 (Linearity of Expectation).

For random variables X1,X2,,XnX_{1},X_{2},\ldots,X_{n} and constants c1,c2,,cnc_{1},c_{2},\ldots,c_{n}, we have:

𝔼i=1ncixi=i=1nci𝔼Xi\mathbb{E}\sum_{i=1}^{n}c_{i}x_{i}=\sum_{i=1}^{n}c_{i}\mathbb{E}X_{i}