A Tighter Convergence Proof of Reverse Experience Replay

Nan Jiang, Jinzhao Li, Yexiang Xue
{jiang631, li4255, yexiang}@purdue.edu
Department of Computer Science
Purdue University, USA

Abstract

In reinforcement learning, Reverse Experience Replay (RER) is a recently proposed algorithm that attains better sample complexity than the classic experience replay method. RER requires the learning algorithm to update the parameters through consecutive state-action-reward tuples in reverse order. However, the most recent theoretical analysis only holds for a minimal learning rate and short consecutive steps, which converge slower than those large learning rate algorithms without RER. In view of this theoretical and empirical gap, we provide a tighter analysis that mitigates the limitation on the learning rate and the length of consecutive steps. Furthermore, we show theoretically that RER converges with a larger learning rate and a longer sequence.

1 Introduction

Reinforcement Learning (RL) is highly successful for a variety of practical problems in the realm of long-term decision-making. Experience Replay (ER) of historical trajectories plays a vital role in RL algorithms (Lin, 1992; Mnih et al., 2015). The trajectory is a sequence of transitions, where each transition is a state, action, and reward tuple. The memory space used to store these experienced trajectories is noted as the replay buffer. The methods to sample transitions from the replay buffer determine the rate and stability of the convergence of the learning algorithms.

Recently, Reversed Experience Replay (RER) (Florensa et al., 2017; Rotinov, 2019; Lee et al., 2019; Agarwal et al., 2022) is an approach inspired by the hippocampal reverse replay mechanism in human and animal neuron (Foster & Wilson, 2006; Ambrose et al., 2016; Igata et al., 2021). Theoretical analysis shows that RER improves the convergence rate towards optimal policies in comparison with ER-based algorithms. Unlike ER, which samples transitions uniformly (van Hasselt et al., 2016) (known as classic experience replay) or weightily (Schaul et al., 2016) (known as prioritized experience replay) from the replay buffer, RER samples consecutive sequences of transitions from the buffer and reversely fed into the learning algorithm.

However, the most recent theoretical analysis on RER with $Q$ -learning only holds for a minimal learning rate and short consecutive steps (Agarwal et al., 2022), which converges slower than classic $Q$ -learning algorithm (together with ER) with a large learning rate. We attempt to bridge the gap between theory and practice for the newly proposed reverse experience replay algorithm.

In this paper, we provide a tighter analysis that relaxes the limitation on the learning rate and the length of the consecutive transitions. Our key idea is to transform the original problem involving a giant summation (shown in Equation 3.1) into a combinatorial counting problem (shown in Lemma 2), which greatly simplifies the whole problem. We hope the new idea of transforming the original problem into a combinatorial counting problem can enlighten other relevant domains. Furthermore, we show in Theorem 2 that RER converges faster with a larger learning rate $\eta$ and a longer consecutive sequence $L$ of state-action-reward tuples.

2 Preliminaries

Markov Decision Process

We consider a Markov decision process (MDP) with discounted rewards, noted as $\mathcal{M}=(\mathcal{S},\mathcal{A},{P},r,\gamma)$ . Here $\mathcal{S}\subset\mathbb{R}^{d}$ is the set of states, $\mathcal{A}$ is the set of actions, and $\gamma\in(0,1)$ indicates the discounting factor. We use $P:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\to[0,1]$ as the transition probability kernel of MDP. For each pair $(s,a)\in\mathcal{S}\times\mathcal{A}$ , $P(s^{\prime}|s,a)$ is the probability of transiting to state $s^{\prime}$ from state $s$ when action $a$ is executed. The reward function is $r:\mathcal{S}\times\mathcal{A}\to[-1,1]$ , such that $r(s,a)$ is the immediate reward from state $s$ when action $a$ is executed (Puterman, 1994). The policy $\pi$ is a mapping from states to a distribution over the set of actions: $\pi(s):\mathcal{A}\to[0,1]$ , for $s\in\mathcal{S}$ . A trajectory is noted as $\{(s_{t},a_{t},r_{t})\}_{t=0}^{\infty}$ , where $s_{t}$ (respectively $a_{t}$ ) is the state (respectively the action taken) at time $t$ , $r_{t}=r(s_{t},a_{t})$ is the reward received at time $t$ , and $(s_{t},a_{t},r_{t},s_{t+1})$ is the $t$ -step transition.

Value Function and $Q$ -Function

The value function of a policy $\pi$ is noted as $V^{\pi}:\mathcal{S}\to\mathbb{R}$ . For $s\in\mathcal{S}$ , $V^{\pi}(s):=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t}|s_{0}=s)\right]$ , which is the expected discounted cumulative reward received when 1) the initial state is $s_{0}=s$ , 2) the actions are taken based on the policy $\pi$ , i.e., $a_{t}\sim\pi(s_{t})$ , for $t\geq 0$ . 3) the trajectory is generated by the transition kernel, i.e., $s_{t+1}\sim P(\cdot|s_{t},a_{t})$ , for all $t\geq 0$ . Similarly, let $Q^{\pi}:\mathcal{S}\times\mathcal{A}\to\mathbb{R}$ be the action-value function (also known as the $Q$ -function) of a policy $\pi$ . For $(s,a)\in\mathcal{S}\times\mathcal{A}$ , it is defined as $Q^{\pi}(s,a):=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t}|s_{0}=s,a_{0}=a)\right].$

There exists an optimal policy, denoted as $\pi^{*}$ that maximizes $Q^{\pi}(s,a)$ uniformly over all state-action pairs $(s,a)\in\mathcal{S}\times\mathcal{A}$ (Watkins, 1989). We denote $Q^{*}$ as the $Q$ -function corresponding to $\pi^{*}$ , i.e., $Q^{*}=Q^{\pi^{*}}$ . The Bellman operator $\mathcal{T}$ on a $Q$ -function is defined as: for $(s,a)\in\mathcal{S}\times\mathcal{A}$ ,

\displaystyle\mathcal{T}(Q)(s,a):=r(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot|s,a)}\left[\max_{a^{\prime}\in\mathcal{A}}Q(s^{\prime},a^{\prime})\right].

The optimal $Q$ -function $Q^{*}$ is the unique fixed point of the Bellman operator (Bertsekas & Yu, 2012).

$Q$ -learning

The $Q$ -learning algorithm is a model-free algorithm to learn $Q^{*}$ (Watkins & Dayan, 1992). The high-level idea is to find the fixed point of the Bellman operator. Given the trajectory $\{(s_{t},a_{t},r_{t})\}_{t=0}^{\infty}$ generated by some underlying behavior policy $\pi^{\prime}$ , the asynchronous $Q$ -learning algorithm estimates a new $Q$ -function $Q_{t+1}:\mathcal{S}\times\mathcal{A}\to\mathbb{R}$ at each time. At time $t\geq 0$ , given a transition $(s_{t},a_{t},r_{t},s_{t+1})$ , the algorithm update as follow:

	$\displaystyle Q_{t+1}(s_{t},a_{t})$	$\displaystyle=(1-\eta)Q_{t}(s_{t},a_{t})+\eta\mathcal{T}_{t+1}(Q_{t})(s_{t+1},a_{t}),$			(1)
	$\displaystyle Q_{t+1}(s,a)$	$\displaystyle=Q_{t}(s,a),$	$\displaystyle\text{ for all }(s,a)\neq(s_{t},a_{t}).$		(1)

Here $\eta\in(0,1)$ is the learning rate and $\mathcal{T}_{t+1}$ is the empirical Bellman operator: $\mathcal{T}_{t+1}(Q_{t})(s_{t},a_{t}):=r(s_{t},a_{t})+\gamma\max_{a^{\prime}\in\mathcal{A}}Q_{t}(s_{t+1},a^{\prime})$ . Under mild conditions, $Q_{t}$ will converge to the fixed point of the Bellman operator and hence to $Q^{*}$ . When the state space $\mathcal{S}$ is small, a tabular structure cab be used to store the values of $Q_{t}(s,a)$ for $(s,a)\in\mathcal{S}\times\mathcal{A}$ .

$Q$ -learning with Function Approximation

When the state space $\mathcal{S}$ is large, the asynchronous $Q$ -learning in Equation (1) cannot be applied since it needs to loop over a table of all states and actions. In this case, function approximation is brought into $Q$ -learning. Let $Q^{w}:\mathcal{S}\times\mathcal{A}\to\mathbb{R}$ be an approximated $Q$ -function, which is typically represented with a deep neural network (Mnih et al., 2015) and $w$ denotes the parameters of the neural network. $Q^{w}$ is often called the $Q$ -network. Given a batch of transitions $\{(s_{t_{i}},a_{t_{i}},r_{t_{i}},s_{t_{i}+1})\}_{i=1}^{m}$ , we define $y_{t_{i}}$ as the image of $Q^{w^{\prime}}(s_{t_{i}},a_{t_{i}})$ under the empirical Bellman operator, that is:

y_{t_{i}}:=r_{t_{i}}+\gamma\max_{a^{\prime}\in\mathcal{A}}Q^{w^{\prime}}(s_{{t_{i}}+1},a^{\prime}),\quad\text{ for }1\leq i\leq m

where $w^{\prime}$ represents the parameters in target neural network. Parameters $w^{\prime}$ are synchronized to $w$ every $T_{target}$ steps of Stochastic Gradient Descent (SGD). Since $Q^{*}$ is the fixed point of the Bellman operator, $y_{t_{i}}$ should match $Q^{w}(s_{t_{i}},a_{t_{i}})$ when $Q^{w}$ converges to $Q^{*}$ . Hence, learning is done via minimizing the following objective using SGD: $\ell(w)=\frac{1}{m}\sum_{i=1}^{m}\|y_{t_{i}}-Q^{w}(s_{t_{i}},a_{t_{i}})\|_{2}^{2}$ .

Experience Replay

For the $Q$ -learning with function approximation, the new trajectories are generated by executing a behavioral policy, which are then saved into the replay buffer, noted as $\mathcal{B}$ . When learning to minimize $\ell(w)$ , SGD is performed on batches of randomly sampled transitions from the replay buffer. This process is often called Experience Replay (ER) (Lin, 1992; Li et al., 2022). To improve the stability and convergence rate of $Q$ -learning, follow-up works sample transitions from the replay buffer with non-uniform probability distributions. Prioritized experience replay favors those transitions with a large temporal difference errors (Schaul et al., 2016; Saglam et al., 2023). Discor (Kumar et al., 2020) favors those transitions with small Bellman errors. LaBER proposes a generalized TD error to reduce the variance of gradient and improve learning stability (Lahire et al., 2022). Hindsight experience replay uses imagined outcomes by relabeling goals in each episode, allowing the agent to learn from unsuccessful attempts as if they were successful (Andrychowicz et al., 2017).

Reverse Experience Replay

is a recently proposed variant of experience replay (Goyal et al., 2019; Bai et al., 2021; Agarwal et al., 2022). RER samples consecutive sequences of transitions from the replay buffer. The $Q$ -learning algorithm updates its parameters by performing in the reverse order of the sampled sequences. Compared with ER, RER converges faster towards the optimal policy empirically (Lee et al., 2019) and theoretically (Agarwal et al., 2022), under tabular and linear MDP settings. One intuitive explanation of why RER works is to consider a sequence of consecutive transitions $s_{1}\xrightarrow{a_{1},r_{1}}s_{2}\xrightarrow{a_{2},r_{2}}s_{3}$ . Incorrect $Q$ -function estimation of $Q(s_{2},a_{2})$ will affect the estimation of $Q(s_{1},a_{1})$ . Hence, reverse order updates allow the $Q$ -value updates of $Q(s_{1},a_{1})$ to use the most up-to-date value of $Q(s_{2},a_{2})$ , hence accelerating the convergence.

2.1 Problem Setups for Reverse Experience Replay

Linear MDP Assumption

In this paper, we follow the definition of linear MDP from Zanette et al. (2020), which states that the reward function can be written as the inner product of the parameter $w$ and the feature function $\phi$ . Therefore, the $Q$ function depends only on $w$ and the feature vector $\phi(s,a)\in\mathbb{R}^{d}$ for state $s\in\mathcal{S}$ and action $a\in\mathcal{A}$ .

Assumption 1 (Linear MDP setting from Zanette et al., 2020).

There exists a vector $w\in\mathbb{R}^{d}$ such that $R(s,a;w)=\langle w,\phi(s,a)\rangle$ , and the transition probability is proportional to its corresponding feature $\mathcal{P}(\cdot|s,a)\propto\phi(s,a)$ . Therefore, the optimal Q-function is $Q^{*}(s,a;w^{*})=\langle w^{*},\phi(s,a)\rangle$ for every $s\in\mathcal{S},a\in\mathcal{A}$ .

The assumption 1 is the current popular Linear MDP assumption that allows us to quantify the convergence rate (or sample complexity) for the $Q$ -learning algorithm (Zanette et al., 2020; Agarwal et al., 2022). We need the following additional assumptions to get the final convergence rate result. Assume the sequence of consecutive transitions is of length $L$ and the constant learning rate in the gradient descent algorithm is $\eta$ .

Assumption 2 (from Zanette et al. (2020)).

The MDP has zero inherent Bellman error and $\phi(s,a)^{\top}\phi(s,a)\leq 1$ for all $(s,a)\in\mathcal{S}\times\mathcal{A}$ . There exists constant $\kappa>0$ , such that $\mathbb{E}_{(s,a)\sim\mu}\phi(s,a)\phi(s,a)^{\top}\succeq{\mathrm{I}}/{\kappa}$ . Here $\mu$ is the stationary distribution over all the state-action pairs of the Markov chain determined by the transition kernel and the policy.

Remark 1.

Suppose we pick a set of state-action tuples $\mathcal{L}=\{(s,a)|(s,a)\in\mathcal{S}\times\mathcal{A}\}$ , which may contains duplicated tuples. By linearity of expectation, we have: $\mathbb{E}_{\mu}\left(\sum_{(s,a)\in\mathcal{L}}\phi(s,a)\phi(s,a)^{\top}\right)=\sum_{\mathcal{L}}\mathbb{E}_{(s,a)\sim\mu}\left(\phi(s,a)\phi(s,a)^{\top}\right)\succeq\frac{|\mathcal{L}|}{\kappa}\mathrm{I}$ . Here $|\mathcal{L}|$ indicates the number of state-action tuples in this set.

Definition 1.

Given the feature function $\phi:\mathcal{S}\times\mathcal{A}\to\mathbb{R}^{d}$ . Denote the largest inner product between parameter $w$ and the feature function $\phi$ as $\|w\|_{\phi}=\sup_{(s,a)}|\langle\phi(s,a),w\rangle|$ .

Definition 2.

Let $\mathrm{I}$ be an identity matrix of dimension $d\times d$ and $\eta\in\mathbb{R}$ as the learning rate. Define matrix $\Gamma_{l}$ recursively as follow:

\displaystyle\Gamma_{l}\coloneqq\begin{cases}\mathrm{I}&\text{ for }l=0,\\ \left(\mathrm{I}-\eta\phi_{L+1-l}\phi^{\top}_{L+1-l}\right)\Gamma_{l-1}&\text{ for }1\leq l\leq L,\end{cases}

(2)

where we use the simplified notation $\phi_{L+1-l}$ to denote $\phi(s_{L+1-l},a_{L+1-l})$ . The explicit form for $\Gamma_{L}$ is:

\Gamma_{L}=\left(\mathrm{I}-\eta\phi_{1}\phi^{\top}_{1}\right)\left(\mathrm{I}-\eta\phi_{2}\phi^{\top}_{2}\right)\ldots\left(\mathrm{I}-\eta\phi_{L}\phi^{\top}_{L}\right)=\prod_{l=1}^{L}\left(\mathrm{I}-\eta\phi_{l}\phi^{\top}_{l}\right)

(3)

The semantic interpretation of $\Gamma_{L}$ in Definition 2 is that it represents the coefficient of the bias term in the error analysis of the learning algorithm’s parameter (as outlined in Lemma 4). This joint product arises because the RER algorithm updates the parameter over a subsequence of consecutive transitions of length $L$ . The norm of $\Gamma_{L}$ is influenced by both the sequence length $L$ and the learning rate $\eta$ . When the norm of $\Gamma_{L}$ is small, the parameters of the learning model converge more rapidly to their optimal values.

3 Methodology

3.1 Motivation

Let $\mu$ denote the stationary distribution of the state-action pairs in the MDP, $\eta$ be the learning rate of the gradient descent algorithm, and $L$ the length of consecutive transitions processed by the RER algorithm. Previous work (Agarwal et al., 2022, Lemma 8 and Lemma 14) established that when $\eta L\leq\frac{1}{3}$ , the following inequality holds:

\mathbb{E}_{(s,a)\sim\mu}\left[\Gamma_{L}^{\top}\Gamma_{L}\right]\preceq\mathrm{I}-\eta\sum_{l=1}^{L}\mathbb{E}_{(s,a)\sim\mu}\left[\phi_{l}\phi^{\top}_{l}\right]\preceq\left(1-\frac{\eta L}{\kappa}\right)\mathrm{I},

(4)

where the matrix $\Gamma_{L}\in\mathbb{R}^{d\times d}$ is defined in Definition 2 and serves as a “coefficient” in the convergence analysis, as outlined in Lemma 4. The positive semi-definite relation $\preceq$ between two matrices is defined in Definition 4. Here, $\mathrm{I}$ represents an identity matrix of dimension $d\times d$ , and the coefficient $\kappa>0$ is introduced in Assumption 2. The matrix $\Gamma_{L}$ was mentioned in (Agarwal et al., 2022, Appendix E, Equation 5), but we provide a formal definition here and streamline the original expression by removing unnecessary variables.

The condition in Equation (4) was further incorporated into the convergence requirement in (Agarwal et al., 2022, Theorem 1). It suggests that the RER algorithm cannot handle sequences of consecutive transitions that are too long (corresponding to a large $L$ ) or use a learning rate that is too large (i.e., $\eta$ ). This presents a major limitation between the theoretical justification and real-world application of the RER algorithm. In this work, we address this gap by providing a tighter theoretical analysis that relaxes the constraint $\eta L\leq 1/3$ .

We begin by explaining the main difficulty in upper-bounding the term $\mathbb{E}_{(s,a)\sim\mu}\left[\Gamma_{L}^{\top}\Gamma_{L}\right]$ . According to Definition 2, we can expand $\Gamma_{L}^{\top}$ as $\Gamma_{L}^{\top}=\left(\mathrm{I}-\eta\phi_{L}\phi^{\top}_{L}\right)\cdots\left(\mathrm{I}-\eta\phi_{1}\phi^{\top}_{1}\right)$ . Using the linearity of expectation, we expand the entire joint product $\Gamma_{L}^{\top}\Gamma_{L}$ under the expectation as follows:

	$\displaystyle\mathbb{E}_{(s,a)\sim\mu}\left[\Gamma_{L}^{\top}\Gamma_{L}\right]$	$\displaystyle=\mathbb{E}_{(s,a)\sim\mu}\left[\left(\mathrm{I}-\eta\phi_{L}\phi^{\top}_{L}\right)\cdots\left(\mathrm{I}-\eta\phi_{1}\phi^{\top}_{1}\right)\left(\mathrm{I}-\eta\phi_{1}\phi^{\top}_{1}\right)\cdots\left(\mathrm{I}-\eta\phi_{L}\phi^{\top}_{L}\right)\right]$
		$\displaystyle=\mathrm{I}-2\eta\mathbb{E}_{(s,a)\sim\mu}\left[\sum_{l=1}^{L}\phi_{l}\phi^{\top}_{l}\right]+\mathbb{E}_{(s,a)\sim\mu}\left[\sum_{k=2}^{2L}(-\eta)^{k}\sum_{l_{1},\ldots,l_{k}}\phi_{l_{1}}\phi^{\top}_{l_{1}}\ldots\phi_{l_{k}}\phi^{\top}_{l_{k}}\right].$		(5)

In the third term on the right-hand side (RHS) of the second line, the summation is over all valid combinations of the indices $(l_{1},l_{2},\ldots,l_{k})$ , where $l_{1},l_{2},\ldots,l_{k}\in\{1,2,\ldots,L\}$ . This is determined by first selecting the index $l_{1}$ from the index sequence $[L,L-1,\ldots,2,1,1,2,\ldots,L-1,L]$ , as seen in the first row of the equation above. The second index $l_{2}$ is then chosen, ensuring that $l_{2}$ lies to the right of $l_{1}$ . The valid combination constraint requires the entire sequence $l_{1},\ldots,l_{k}$ to satisfy the condition that $l_{i-1}$ must appear to the left of $l_{i}$ .

The main challenge to upper-bound the entire product $\Gamma_{L}^{\top}\Gamma_{L}$ under expectation lies in upper-bound the combinatorially many high-order terms. Our approach leverages the high-level idea that the RHS of Equation (3.1) can be upper-bounded by a form of $\mathbb{E}_{(s,a)\sim\mu}\left[\sum_{l=1}^{L}\phi_{l}\phi^{\top}_{l}\right]$ with an appropriate coefficient. Specifically, we demonstrate that the third term on the RHS, which contains a large number of combinatorial terms of the form $\phi_{l_{1}}\phi^{\top}_{l_{1}}\cdots\phi_{l_{k}}\phi^{\top}_{l_{k}}$ , can be bounded by terms involving only $\phi_{l}\phi^{\top}_{l}$ (with $1\leq l\leq L$ ) through the use of a proposed combinatorial counting method.

Theorem 1.

Let $\mu$ be the stationary distribution of the state-action pair in the MDP. The following matrix inequalities, which are positive semi-definite, hold for $\eta\in(0,1)$ :

\displaystyle\mathbb{E}_{(s,a)\sim\mu}\left[\Gamma_{L}^{\top}\Gamma_{L}\right]\preceq\left(1-\frac{(\eta(4-2L)-(1-\eta)^{L-1}-\eta^{2}+1)L}{\kappa}\right)\mathrm{I},

(6)

where the matrix $\Gamma_{L}$ is defined in Definition 2. The relation $\preceq$ between the matrices on both sides is defined in Definition 4, referring to the positive semi-definite property.

Proof Sketch.

By the linearity of expectation, we can upper-bound the second part of Equation (3.1) as follows:

\displaystyle-2\eta\mathbb{E}_{(s,a)\sim\mu}\left[\sum_{l=1}^{L}\phi_{l}\phi^{\top}_{l}\right]=-2\eta\sum_{l=1}^{L}\mathbb{E}_{(s,a)\sim\mu}\left[\phi_{l}\phi^{\top}_{l}\right]=-2\eta L\mathbb{E}_{(s,a)\sim\mu}\left[\phi\phi^{\top}\right]\preceq-\frac{2\eta L}{\kappa}\mathrm{I}.

(7)

Based on the new analysis from Lemma (2), the third part in Equation (3.1) is upper-bounded as:

	$\displaystyle\mathbb{E}_{(s,a)\sim\mu}\left[\sum_{k=2}^{2L}(-\eta)^{k}\sum_{l_{1},\ldots,l_{k}}\phi_{l_{1}}\phi^{\top}_{l_{1}}\ldots\phi_{l_{k}}\phi^{\top}_{l_{k}}\right]$	$\displaystyle\preceq\mathbb{E}_{(s,a)\sim\mu}\left[\sum_{k=2}^{2L}(-\eta)^{k}\sum_{l_{1},\ldots,l_{k}}\frac{1}{2}(\phi_{l_{1}}\phi^{\top}_{l_{1}}+\phi_{l_{k}}\phi^{\top}_{l_{k}})\right]$
		$\displaystyle\preceq\left((1-\eta)^{L-1}+\eta^{2}+\eta(2L-2)-1\right)\mathbb{E}_{(s,a)\sim\mu}\left[\sum_{l=1}^{L}\phi_{l}\phi^{\top}_{l}\right]$
		$\displaystyle\preceq\frac{((1-\eta)^{L-1}+\eta^{2}+\eta(2L-2)-1)L}{\kappa}\mathrm{I}.$

Combining these two inequalities, we arrive at the upper bound stated in the theorem. A detailed proof can be found in Appendix B. ∎

Theorem 1 is established based on the new analysis in Lemma (2), which is introduced in Section 3.2. It serves as a key component in the final convergence proof of the RER algorithm, which will be presented in Section 4.

Numerical Justification of the Tighter Bound

We provide a numerical evaluation of the derived bound and the original bound in Agarwal et al. (2022, Lemma 8) in Figure 1¹¹1The code implementation for the numerical evaluation of the equalities and inequalities in this paper is available at https://github.com/jiangnanhugo/RER-proof.. For a fixed value of sequence length $L$ , we compare the value $(\eta(4-2L)-(1-\eta)^{L-1}-\eta^{2}+1)L$ in our derived upper bound and the original value $\eta L$ . For all the different sequence lengths, our derived expression value is numerically higher than the original expression, which implies our bound (in Lemma 3) is tighter than the original one in Agarwal et al. (2022, Lemma 8).

Refer to caption — Figure 1: For all the different sequence lengths, our derived expression value is numerically higher than the original expression, which implies our bound (in Lemma 3) is tighter than the original one in Agarwal et al. (2022, Lemma 8).

3.2 Relaxing the Requirement $\eta L\leq 1/3$ through Combinatorial Counting

Lemma 1.

Let $\mathbf{x}\in\mathbb{R}^{d}$ be any non-zero $d$ -dimensional vector. For $l_{1},\ldots,l_{k}\in\{1,2,\ldots,L\}$ and $2\leq k\leq 2L$ , consider a high-order term $\phi_{l_{1}}\phi^{\top}_{l_{1}}\ldots\phi_{l_{k}}\phi^{\top}_{l_{k}}$ in Equation (3.1). By Assumption 1, we can relax this high-order term as follows:

\displaystyle|\mathbf{x}^{\top}\phi_{l_{1}}\phi^{\top}_{l_{1}}\ldots\phi_{l_{k}}\phi^{\top}_{l_{k}}\mathbf{x}|

\displaystyle\leq\frac{1}{2}\mathbf{x}^{\top}\left(\phi_{l_{1}}\phi^{\top}_{l_{1}}+\phi_{l_{k}}\phi^{\top}_{l_{k}}\right)\mathbf{x}.

(8)

The proof of this inequality can be found in Appendix A.1.

This result implies that, after relaxation, only the first term $\phi_{l_{1}}\phi^{\top}_{l_{1}}$ (indexed by $l_{1}$ ) and the last term $\phi_{l_{k}}\phi^{\top}_{l_{k}}$ (indexed by $l_{k}$ ) determine the upper bound of the high-order term $\phi_{l_{1}}\phi^{\top}_{l_{1}}\ldots\phi_{l_{k}}\phi^{\top}_{l_{k}}$ . This relaxation simplifies the original complex summation problem $\sum_{1\leq l_{1},\ldots,l_{k}\leq L}$ to count how many valid $l_{1}$ and $l_{k}$ can be selected at each possible position in the sequence of transitions.

Lemma 2.

Based on the relaxation provided in Lemma 1, the third part in Equation (3.1) can be expanded combinatorially as follows:

\sum_{k=2}^{2L}(-\eta)^{k}\sum_{l_{1},\ldots,l_{k}}\frac{1}{2}(\phi_{l_{1}}\phi^{\top}_{l_{1}}+\phi_{l_{k}}\phi^{\top}_{l_{k}})=\underbrace{\sum_{k=2}^{2L}(-\eta)^{k}\sum_{l=1}^{L}\left(\binom{L+l-2}{k-1}+\binom{L-l}{k-1}+\binom{2l-2}{k-2}\right)}_{\text{sum over combinatorially many terms}}\phi_{l}\phi^{\top}_{l}

(9)

Sketch of Proof.

As depicted in Figure 2, we consider two arrays of length $L$ . The indices in these arrays are symmetrical: the left array decreases from $L$ to $1$ , while the right array increases from $1$ to $L$ . These arrays represent the indices of the matrix products in the first line of Equation (3.1). The left array simulates $\Gamma_{L}$ , and the right array simulates $\Gamma_{L}^{\top}$ . The key idea is to count the number of combinations of $l_{1}$ and $l_{k}$ that can produce $\phi_{l}\phi_{l}^{\top}$ for a fixed $l$ (where $1\leq l\leq L$ ).

In the first case, illustrated in Figure 2, we fix $l_{1}$ in the left $l$ -th slot. For $l_{k}$ , it cannot choose any of the slots in the left array with indices $L,\ldots,l+1$ due to the sequential ordering constraint, which requires that $l_{i-1}$ must be to the left of $l_{i}$ . Additionally, to avoid double counting, we also exclude the right $l$ -th slot for $l_{k}$ . Consequently, there are $L+l-2$ available slots for assigning the remaining sequence $l_{2},\ldots,l_{k}$ . This results in $\binom{L+l-2}{k-1}$ contributions for this case, as shown on the right-hand side.

For the remaining cases, detailed in Figure 3 and analyzed in Appendix A.2, they contribute to the second and last terms in Equation (9). ∎

Lemma 2 demonstrates the process of simplifying the complex summation $\sum_{l_{1},\ldots,l_{k}}$ into a more manageable form $\sum_{l=1}^{L}$ . This transformation significantly simplifies the task of obtaining a tighter upper bound.

Lemma 3.

For $\eta\in(0,1)$ and $L>1$ , the following holds:
$\sum_{k=2}^{2L}(-\eta)^{k}\left(\binom{L+l-2}{k-1}+\binom{L-l}{k-1}+\binom{2l-2}{k-2}\right)=(1-\eta)^{L+l-2}+(1-\eta)^{L-l}+\eta^{2}(1-\eta)^{2l-2}+\eta(2L-2)-2$ .

The proof of Lemma 3 is presented in detail in Appendix A.3, where we utilize the Binomial theorem. To ensure that the oscillatory term $(-\eta)^{k}$ does not cause divergence, we require the learning rate $\eta$ to lie within the interval $\eta\in(0,1)$ .

4 Sample Complexity of Reverse Experience Replay-Based $Q$ -Learning on Linear MDPs

The convergence analysis assumes that every sub-trajectory of length $L$ is almost (or asymptotically) independent of each other with high probability. This condition, known as the mixing requirement for Markovian data, implies that the statistical dependence between two sub-trajectories $\tau_{L}$ and $\tau^{\prime}_{L}$ diminishes as they become further apart along the trajectory (Tagorti & Scherrer, 2015; Nagaraj et al., 2020).

Prior work (Lee et al., 2019) provided a convergence proof for the Reverse Experience Replay (RER) approach but did not address the rate of convergence, primarily due to the challenges associated with quantifying deep neural networks. By contrast, Linear MDPs (defined in Definition 1), which approximate the reward function and transition kernel linearly via features, allow for an asymptotic performance analysis of RER. Recently, Agarwal et al. (2022) presented the first theoretical proof for RER. However, their analysis is limited by stringent conditions, notably requiring a minimal learning rate $\eta L\leq\frac{1}{3}$ . This constraint suggests that RER may struggle to compete with plain Experience Replay (ER) when using larger learning rates.

To address this challenge, we provide a tighter theoretical analysis of the RER method in Theorem 1. Our analysis mitigate the constraints on the learning rate for convergence. We demonstrate that the convergence rate can be improved with a larger learning rate and a longer sequence of state-action-reward tuples, thus bridging the gap between theoretical convergence analysis and empirical learning results.

Algorithm 1 Episodic Q-learning with Reverse Experience Replay

1:Sequence length

L

of consecutive state-action tuples; Replay buffer

\mathcal{B}

; Total learning episodes

T

; Target network update frequency

N

2:The best-learned policy.

3:for

t=1

T

4: Act by

\epsilon

-greedy strategy w.r.t. policy

\pi

5: Save the new trajectory into the replay buffer

\mathcal{B}

6: Retrieve a sub-trajectory

\tau_{L}

from buffer

\mathcal{B}

, where

\tau_{l}:=(s_{l},a_{l},r_{l})

, for all

1\leq l\leq L

7: for

l=1

L

\triangleright

reverse experience replay

{\varepsilon\leftarrow r_{L-l}+\gamma\max_{a^{\prime}\in\mathcal{A}}Q(s_{L+1-l},a^{\prime};\theta_{k})-Q_{L+1-l}}

{w_{t,l+1}\leftarrow w_{t,l}+\eta\varepsilon\nabla Q_{t,L+1-l}}

10: if

t\mod N=0

then

\triangleright

online target update

11:

\theta_{k}\leftarrow w_{t,L+1}

12:

k\leftarrow k+1

13:

\pi(s)\leftarrow\arg\max_{a\in\mathcal{A}},Q(s,a;w_{t,L+1})

, for all

s\in\mathcal{S}

\triangleright

policy extraction

14:Return The converged policy

\pi

Lemma 4 (Bias and variance decomposition).

Let the error terms for every parameter $w$ as the difference between empirical estimation and true MDP: $\varepsilon_{i}(w)\coloneqq{Q}(s_{i},a_{i})-Q^{*}(s_{i},a_{i})$ . For the current iteration $t$ , the difference between current estimated parameter $w$ and the optimal parameter $w^{*}$ accumulated along the $L$ length transitions with reverse update is:

\displaystyle w_{L}-w^{*}=\underbrace{\Gamma_{L}\left(w_{1}-w^{*}\right)}_{\text{Bias term}}+\underbrace{\eta\sum_{l=1}^{L}\varepsilon_{l}\Gamma_{l-1}\phi_{l}}_{\text{variance term}}.

(10)

For clarity, $\Gamma_{L}$ in Definition 2 is a joint product of $L$ terms involving the feature vector of the consecutive state-action tuples. When the norm of $\Gamma_{L}$ is small, the parameter will quickly converge to its optimal.

The first part on RHS is noted as the bias and the second part on RHS is variance along the sub-trajectory, which we will later show with zero mean.

The proof is presented in Appendix C.1. The result is obtained by unrolling the terms for consecutive $L$ steps in reverse update order according to Lines 5-7 in Algorithm 1. This allows us to separately quantify the upper bound the bias term and the variance terms.

Lemma 5 (Bound on the bias term).

Let $\mathbf{x}\in\mathbb{R}^{d}$ be a non-zero vector and $N$ is the frequency for the target network to be updated. For $\eta\in(0,1)$ and $L>1$ , the following matrix’s positive semi-definite inequality holds with probability at least $1-\delta$ :

\displaystyle\mathbb{E}\left\lVert\prod^{1}_{j=N}\Gamma_{L}\mathbf{x}\right\rVert_{\phi}^{2}

\displaystyle\leq\exp\left(-\frac{(\eta(4-2L)-\eta^{2}+1)NL}{\kappa}\right)\sqrt{\tfrac{\kappa}{\delta}}\left\lVert\mathbf{x}\right\rVert_{\phi}.

(11)

The $\phi$ -based norm is defined in Definition 1.

Sketch of proof.

The result is obtained first expand the joint product over $\prod_{j=N}^{i}$ over $\Gamma_{L}$ and integrate the result in Theorem 1. The detailed proof is presented in Appendix C.2. ∎

In terms of the bound for the variance term in Lemma 4, even though the term $\Gamma_{l}$ is involved in the expression, it turns out we do not need to modify the original proof and thus we follow the result in the original work. The exact statement is presented in the Appendix C.3.

Theorem 2.

For Linear MDP, assume the reward function, as well as the feature, is bounded $R(s,a)\in[0,1]$ , $\|\phi(s,a)\|_{2}\leq 1$ , for all $(s,a)\in\mathcal{S}\times\mathcal{A}$ . Let $T$ be the maximum learning episodes, $N$ be the frequency of the target network update, $\eta$ be the learning rate and $L$ be the length of sequence for RER described in Algorithm 1. When $\eta\in(0,1),L\geq 1$ , with sample complexity

\displaystyle\mathcal{O}\left(\frac{\gamma^{T/N}}{1-\gamma}+\sqrt{\frac{T\kappa}{N\delta(1-\gamma)^{4}}}{\color[rgb]{0,0,1}\exp\left(-\frac{(\eta(4-2L)-\eta^{2}+1)NL}{\kappa}\right)}+\sqrt{\frac{\eta\log(\frac{T}{N\delta})}{(1-\gamma)^{4}}}\right),

(12)

$\|Q_{T}(s,a)-Q^{*}(s,a)\|_{\infty}\leq\varepsilon$ holds with probability at least $1-\delta$ .

Sketch of Proof.

We first establish the independence of sub-trajectories of length $L$ . We then decompose the error term of the $Q$ -value using bias-variance decomposition (as shown in Lemma 4), where the RER method and target network help control the variance term using martingale sequences. The upper bound for the bias term is given in Lemma 5 and the upper bound for the variance term is presented in Lemma Theorem. Finally, we summarize the results and provide the complete proof in Lemma 6, leading to the probabilistic bound in this theorem. ∎

Compared to the original theorem in Agarwal et al. (2022, Theorem 1), our work provides a tighter upper bound and relaxes the assumptions needed for the result to hold. This advancement bridges the gap between theoretical justification and empirical MDP evaluation. Furthermore, we hope that the new approach of transforming the original problem into a combinatorial counting problem will inspire further research in related domains.

We acknowledge that the main structure of the convergence proof (i.e., Theorem 2) follows the original work. Our contribution lies in presenting a cleaner proof pipeline and incorporating our tighter bound as detailed in Theorem 1.

5 Conclusion

In this work, we gave a tighter finite-sample analysis for heuristics which are heavily used in practical $Q$ -learning and showed that seemingly simple modifications can have far-reaching consequences in linear MDP settings. We provide a rigorous analysis that relaxes the limitation on the learning rate and the length of the consecutive tuples. Our key idea is to transform the original problem involving a giant summation into a combinatorial counting problem, which greatly simplifies the whole problem. Finally, we show theoretically that RER converges faster with a larger learning rate $\eta$ and a longer consecutive sequence $L$ of state-action-reward tuples.

Acknowledgments

We thank all the reviewers for their constructive comments. We also thank Yi Gu for his feedback on the theoretical justification part of this paper. This research was supported by NSF grant CCF-1918327 and DOE – Fusion Energy Science grant: DE-SC0024583.

References

Agarwal et al. (2022) Naman Agarwal, Syomantak Chaudhuri, Prateek Jain, Dheeraj Nagaraj, and Praneeth Netrapalli. Online target q-learning with reverse experience replay: Efficiently finding the optimal policy for linear mdps. In ICLR. OpenReview.net, 2022.
Ambrose et al. (2016) R. Ellen Ambrose, Brad E. Pfeiffer, and David J. Foster. Reverse replay of hippocampal place cells is uniquely modulated by changing reward. Neuron, 91(5):1124–1136, 2016. ISSN 0896-6273.
Andrychowicz et al. (2017) Marcin Andrychowicz, Dwight Crow, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In NIPS, pp. 5048–5058, 2017.
Bai et al. (2021) Chenjia Bai, Lingxiao Wang, Lei Han, Jianye Hao, Animesh Garg, Peng Liu, and Zhaoran Wang. Principled exploration via optimistic bootstrapping and backward induction. In ICML, volume 139 of Proceedings of Machine Learning Research, pp. 577–587. PMLR, 2021.
Bertsekas & Yu (2012) Dimitri P. Bertsekas and Huizhen Yu. Q-learning and enhanced policy iteration in discounted dynamic programming. Math. Oper. Res., 37(1):66–94, 2012.
Florensa et al. (2017) Carlos Florensa, David Held, Markus Wulfmeier, Michael Zhang, and Pieter Abbeel. Reverse curriculum generation for reinforcement learning. In CoRL, volume 78 of Proceedings of Machine Learning Research, pp. 482–495. PMLR, 2017.
Foster & Wilson (2006) David J Foster and Matthew A Wilson. Reverse replay of behavioural sequences in hippocampal place cells during the awake state. Nature, 440(7084):680–683, 2006.
Goyal et al. (2019) Anirudh Goyal, Philemon Brakel, William Fedus, Soumye Singhal, Timothy P. Lillicrap, Sergey Levine, Hugo Larochelle, and Yoshua Bengio. Recall traces: Backtracking models for efficient reinforcement learning. In ICLR. OpenReview.net, 2019.
Haddad (2021) Roudy El Haddad. Repeated sums and binomial coefficients. arXiv preprint arXiv:2102.12391, 2021.
Igata et al. (2021) Hideyoshi Igata, Yuji Ikegaya, and Takuya Sasaki. Prioritized experience replays on a hippocampal predictive map for learning. Proceedings of the National Academy of Sciences, 118(1), 2021.
Jain et al. (2021) Prateek Jain, Suhas S. Kowshik, Dheeraj Nagaraj, and Praneeth Netrapalli. Streaming linear system identification with reverse experience replay. In NIPS, volume 34, pp. 30140–30152. Curran Associates, Inc., 2021.
Kumar et al. (2020) Aviral Kumar, Abhishek Gupta, and Sergey Levine. Discor: Corrective feedback in reinforcement learning via distribution correction. In NeurIPS, volume 33, pp. 18560–18572, 2020.
Lahire et al. (2022) Thibault Lahire, Matthieu Geist, and Emmanuel Rachelson. Large batch experience replay. In ICML, volume 162 of Proceedings of Machine Learning Research, pp. 11790–11813. PMLR, 2022.
Lee et al. (2019) Su Young Lee, Sung-Ik Choi, and Sae-Young Chung. Sample-efficient deep reinforcement learning via episodic backward update. In NeurIPS, pp. 2110–2119, 2019.
Li et al. (2022) Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Sample complexity of asynchronous q-learning: Sharper analysis and variance reduction. IEEE Trans. Inf. Theory, 68(1):448–473, 2022.
Lin (1992) Long Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Mach. Learn., 8:293–321, 1992.
Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
Nagaraj et al. (2020) Dheeraj Nagaraj, Xian Wu, Guy Bresler, Prateek Jain, and Praneeth Netrapalli. Least squares regression with markovian data: Fundamental limits and algorithms. In NeurIPS, 2020.
Puterman (1994) Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley Series in Probability and Statistics. Wiley, 1994.
Rotinov (2019) Egor Rotinov. Reverse experience replay. CoRR, abs/1910.08780, 2019.
Saglam et al. (2023) Baturay Saglam, Furkan B. Mutlu, Dogan Can Çiçek, and Suleyman S. Kozat. Actor prioritized experience replay. J. Artif. Intell. Res., 78:639–672, 2023.
Schaul et al. (2016) Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. In ICLR, 2016.
Tagorti & Scherrer (2015) Manel Tagorti and Bruno Scherrer. On the rate of convergence and error bounds for lstd ( $\lambda$ ). In ICML, volume 37 of JMLR Workshop and Conference Proceedings, pp. 1521–1529. JMLR.org, 2015.
van Hasselt et al. (2016) Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In AAAI, pp. 2094–2100. AAAI Press, 2016.
Vershynin (2018) Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
Watkins & Dayan (1992) Christopher J. C. H. Watkins and Peter Dayan. Technical note q-learning. Mach. Learn., 8:279–292, 1992.
Watkins (1989) Christopher John Cornish Hellaby Watkins. Learning from delayed rewards. PhD thesis, King’s College, University of Cambridge, 1989.
Zanette et al. (2020) Andrea Zanette, Alessandro Lazaric, Mykel J. Kochenderfer, and Emma Brunskill. Learning near optimal policies with low inherent bellman error. In ICML, volume 119 of Proceedings of Machine Learning Research, pp. 10978–10989. PMLR, 2020.

Appendix A Proof of Combinatorial Counting in Reverse Experience Replay

In this section, we present detailed proof for each of the necessary lemmas used for Theorem 1.

A.1 Proof of Lemma 1

Lemma.

Let $\mathbf{x}\in\mathbb{R}^{d}$ be any $d$ -dimensional non-zero vector, i.e., $\mathbf{x}\neq\mathbf{0}$ . For $1\leq l_{1},\ldots,l_{k}\leq L$ and $2\leq k\leq 2L$ , consider one high-order term $\phi_{l_{1}}\phi^{\top}_{l_{1}}\ldots\phi_{l_{k}}\phi^{\top}_{l_{k}}$ in Equation (3.1). We have:

\displaystyle|\mathbf{x}^{\top}\phi_{l_{1}}\phi^{\top}_{l_{1}}\ldots\phi_{l_{k}}\phi^{\top}_{l_{k}}\mathbf{x}|

\displaystyle\leq\frac{1}{2}\mathbf{x}^{\top}\left(\phi_{l_{1}}\phi^{\top}_{l_{1}}+\phi_{l_{k}}\phi^{\top}_{l_{k}}\right)\mathbf{x},

(13)

where $|\cdot|$ computes the absolute value.

Proof.

For $1\leq l_{1},\ldots,l_{k}\leq L$ and $2\leq k\leq 2L$ , we consider one high-order term $\phi_{l_{1}}\phi^{\top}_{l_{1}}\ldots\phi_{l_{k}}\phi^{\top}_{l_{k}}$ in Equation (3.1). By the Cauchy-Schwarz inequality (shown in Lemma 7), we have:

\displaystyle|\phi^{\top}_{l_{j}}\phi_{l_{j+1}}|\leq 1,\qquad\text{ for }1\leq j<k

(14)

This allows us to simplify the neighboring vector-vector inner product in the high-order term $\phi_{l_{1}}\left(\phi^{\top}_{l_{1}}\phi_{l_{2}}\right)\left(\phi^{\top}_{l_{2}}\phi_{l_{3}}\right)\ldots(\phi_{l_{k-1}}^{\top}\phi_{l_{k}})\phi^{\top}_{l_{k}}$ .

Let $\mathbf{x}\in\mathbb{R}^{d}$ be any $d$ -dimensional non-zero vector, i.e., $\mathbf{x}\neq\mathbf{0}$ . Based on the above inequality (in Equation 14), we have:

$\displaystyle\|\mathbf{x}^{\top}\phi_{l_{1}}\phi^{\top}_{l_{1}}\ldots\phi_{l_{k}}\phi^{\top}_{l_{k}}\mathbf{x}\|$	$\displaystyle\leq\|\mathbf{x}^{\top}\phi_{l_{1}}\phi^{\top}_{l_{k}}\mathbf{x}\|$	by Equation (14)
	$\displaystyle\leq\frac{1}{2}\left(\left(\mathbf{x}^{\top}\phi_{l_{1}}\right)^{2}+\left(\phi^{\top}_{l_{k}}\mathbf{x}\right)^{2}\right)$	AM-GM inequality (in Lemma 8)
	$\displaystyle=\frac{1}{2}\left(\left(\phi_{l_{1}}^{\top}\mathbf{x}\right)^{2}+\left(\phi^{\top}_{l_{k}}\mathbf{x}\right)^{2}\right)$
	$\displaystyle=\frac{1}{2}\left(\mathbf{x}^{\top}\phi_{l_{1}}\phi^{\top}_{l_{1}}\mathbf{x}+\mathbf{x}^{\top}\phi_{l_{k}}\phi^{\top}_{l_{k}}\mathbf{x}\right)$
	$\displaystyle=\frac{1}{2}\mathbf{x}^{\top}\left(\phi_{l_{1}}\phi^{\top}_{l_{1}}+\phi_{l_{k}}\phi^{\top}_{l_{k}}\right)\mathbf{x},$		(15)

where the third line holds because the inner product between two vectors equals its transpose. ∎

Note that the above Lemma 1 is extended from (Jain et al., 2021, Claim 4).

A.2 Proof of Lemma 2

Lemma.

Based on the relaxation in Lemma 1, the weighted summation $\sum_{k=2}^{2L}(-\eta)^{k}\sum_{l_{1},\ldots,l_{k}}$ in Equation (3.1) can be expanded combinatorially as follows:

\displaystyle\sum_{k=2}^{2L}(-\eta)^{k}\sum_{l_{1},\ldots,l_{k}}\frac{1}{2}(\phi_{l_{1}}\phi^{\top}_{l_{1}}+\phi_{l_{k}}\phi^{\top}_{l_{k}})

\displaystyle=\underbrace{\sum_{k=2}^{2L}(-\eta)^{k}\sum_{l=1}^{L}\left(\binom{L+l-2}{k-1}+\binom{L-l}{k-1}+\binom{2l-2}{k-2}\right)}_{\text{sum over combinatorially many terms}}\phi_{l}\phi^{\top}_{l}.

(16)

Proof.

In the main paper, we outline the steps for transforming the problem of estimating the large summation (i.e., $\sum_{1\leq l_{1},\ldots,l_{k}\leq L}$ ) involving many high-order terms (i.e., $\phi_{l_{1}}\phi^{\top}_{l_{1}}\ldots\phi_{l_{k}}\phi^{\top}_{l_{k}}$ ) into a combinatorial counting problem. Here, we present the details for the remaining cases beyond what was sketched in Figure 2, where we focused on one instance of selecting $l_{1}$ and $l_{k}$ .

More specifically, the element $\phi_{l}\phi^{\top}_{l}$ arises from the following cases:

(a)
When $l_{1}$ selects the $l$ -th slot and $l_{k}$ does not, this includes two subcases:
1. (1)
  
  $l_{1}$ picks the left $l$ -th slot, and the remaining $k-1$ elements (i.e., $l_{2},\ldots,l_{k}$ ) are selected from a subarray of size $L+l-2$ . See Figure 2 for a visual example.
2. (2)
  
  $l_{1}$ picks the right $l$ -th slot, and the remaining $k-1$ elements are selected from a subarray of size $L-l$ . See Figure 3(a) for a visual example.
(b)
When $l_{1}$ does not pick the $l$ -th slot but $l_{k}$ does, this is symmetric to the previous case and also includes two subcases:
1. (1)
  
  $l_{k}$ picks the right $l$ -th slot, while the remaining $k-1$ elements (i.e., $l_{1},\ldots,l_{k-1}$ ) are selected from a subarray of size $L+l-2$ . See Figure 3(b).
2. (2)
  
  $l_{k}$ picks the left $l$ -th slot, and the remaining $k-1$ elements are selected from a subarray of size $L-l$ . See Figure 3(c).
(c)

When $l_{1}$ picks the left $l$ -th slot and $l_{k}$ picks the right $l$ -th slot, the intermediate $k-2$ elements (i.e., $l_{2},\ldots,l_{k-1}$ ) are selected from a subarray of size $2l-2$ . To ensure that the number of available slots is greater than the number of elements, we require $2l-2\geq k-2$ , which simplifies to $2l\geq k$ . See Figure 3(d) for a visual example.

Note that cases (a) and (b) yield the same result, while case (c) counts twice because both the first and last elements are the $l$ -th slot. Thus, we derive the final result:

\displaystyle\sum_{k=2}^{2L}(-\eta)^{k}\sum_{l_{1},\ldots,l_{k}}\frac{1}{2}(\phi_{l_{1}}\phi^{\top}_{l_{1}}+\phi_{l_{k}}\phi^{\top}_{l_{k}})=\underbrace{\sum_{k=2}^{2L}(-\eta)^{k}\sum_{l=1}^{L}\left(\binom{L+l-2}{k-1}+\binom{L-l}{k-1}+\binom{2l-2}{k-2}\right)}_{\text{sum over combinatorially many terms}}\phi_{l}\phi^{\top}_{l}.

(17)

∎

The following Lemma 3 further simplify the whole summations with many different $n$ choose $k$ in Lemma 2. The obtained result will help us to mitigate the constraint $\eta L<1/3$ in the final theorem (in Theorem 1).

A.3 Proof of Lemma 3

Lemma.

For $\eta\in(0,1)$ and $L>1$ , we have:

	$\displaystyle\sum_{k=2}^{2L}(-\eta)^{k}\left(\binom{L+l-2}{k-1}+\binom{L-l}{k-1}+\binom{2l-2}{k-2}\right)=$	$\displaystyle(1-\eta)^{L+l-2}+(1-\eta)^{L-l}+\eta^{2}(1-\eta)^{2l-2}$		(18)
		$\displaystyle+\eta(2L-2)-2.$		(19)

Proof.

We start by separating the sum into three parts:

\displaystyle\sum_{k=2}^{2L}\binom{L+l-2}{k-1}(-\eta)^{k}+\sum_{k=2}^{2L}\binom{L-l}{k-1}(-\eta)^{k}+\sum_{k=2}^{2L}\binom{2l-2}{k-2}(-\eta)^{k}.

(20)

Then we use the Binomial expansion formula $n$ : $(1-x)^{n}=\sum_{k=0}^{n}\binom{n}{k}(-x)^{k}$ . This expression will converge when $|x|<1$ . Set $x=\eta$ , we would have:

\displaystyle(1-\eta)^{n}=1+\sum_{k=1}^{n}\binom{n}{k}(-\eta)^{k}=1-\eta n+\sum_{k=2}^{n}\binom{n}{k}(-\eta)^{k}.

(21)

Seting $n=L+l-2$ and $x=\eta$ , we have the result for the first sum:

\displaystyle\sum_{k=2}^{2L-1}\binom{L+l-2}{k}(-\eta)^{k}=\sum_{k=2}^{L+l-2}\binom{L+l-2}{k}(-\eta)^{k}=(1-\eta)^{L+l-2}+\eta(L+l-2)-1.

(22)

Note that the binomial coefficient $\binom{n}{k}$ is defined to be zero when $k>n$ . Because you cannot choose more elements than are available. In the above equation, $\binom{L+l-2}{k-1}$ is zero for $k-1>L+l-2$ , thus we limit the summation to $k\leq L+l-2$ .

Similarly, set $n=L-l$ and $x=\eta$ , we have the result for the second sum:

\displaystyle\sum_{k=2}^{2L}\binom{L-l}{k}(-\eta)^{k}=\sum_{k=2}^{L-l}\binom{L-l}{k}(-\eta)^{k}=(1-\eta)^{L-l}+\eta(L-l)-1.

(23)

Similarly, $\binom{L-l}{k-1}$ is zero for $k-1>L-l$ , thus the summation is restricted to $k\leq L-l$ .

In terms of the third sum, we adjust the index and apply the binomial theorem,

\displaystyle\sum_{k=2}^{2L}\binom{2l-2}{k-2}(-\eta)^{k}=\eta^{2}\sum_{k=0}^{2L-2}\binom{2l-2}{k}(-\eta)^{k}=\eta^{2}\sum_{k=0}^{2l-2}\binom{2l-2}{k}(-\eta)^{k}=\eta^{2}(1-\eta)^{2l-2}.

(24)

By combining all three simplified sums, we obtain the final result:

\displaystyle(1-\eta)^{L+l-2}+(1-\eta)^{L-l}+\eta^{2}(1-\eta)^{2l-2}+\eta(2L-2)-2.

(25)

The above result holds when $|\eta|<1$ . When $\eta>1$ , the series could diverge due to the unbounded growth of the terms. Since the learning rate is always positive, the requirement becomes $\eta\in(0,1)$ . This completes the proof. ∎

We aim to establish upper and lower bounds for the expression in Lemma 3 that are independent of $l$ , the results are provided in the following Remark.

Remark 2.

For $0<l<L$ and $\eta\in(0,1)$ , the following inequalities hold:

	$\displaystyle(1-\eta)^{L+l-2}+(1-\eta)^{L-l}+\eta^{2}(1-\eta)^{2l-2}$	$\displaystyle>(1-\eta)^{2L-3}+(1-\eta)^{L-1}+\eta^{2}(1-\eta)^{2L-4},$		(26)
	$\displaystyle(1-\eta)^{L+l-2}+(1-\eta)^{L-l}+\eta^{2}(1-\eta)^{2l-2}$	$\displaystyle<(1-\eta)^{L-1}+\eta^{2}+1.$		(27)

Within the range $\eta\in(0,1)$ , both the upper and lower bounds are positive.

Proof.

The term $(1-\eta)^{L+l-2}$ decreases as $l$ increases since $(1-\eta)<1$ for $\eta>0$ . Therefore, the maximum occurs at $l=1$ , and the minimum at $l=L-1$ .

Similarly, for $(1-\eta)^{L-l}$ , the maximum occurs at $l=L-1$ , and the minimum at $l=1$ . For the term $\eta^{2}(1-\eta)^{2l-2}$ , the maximum occurs when $l=1$ , and the minimum when $l=L-1$ .

Thus, the upper bound of the entire expression is achieved by combining the maximum values of each term, resulting in $(1-\eta)^{L-1}+\eta^{2}+1$ . The lower bound is given by $(1-\eta)^{2L-3}+(1-\eta)^{L-1}+\eta^{2}(1-\eta)^{2L-4}$ . ∎

Appendix B Theoretical Justification of Theorem 1

Theorem.

Let $\mu$ be the stationary distribution of the state-action pair in the MDP. The following matrix’s positive semi-definite inequalities hold: for $\eta\in(0,1)$ ,

\displaystyle\mathbb{E}_{(s,a)\sim\mu}\left[\Gamma_{L}^{\top}\Gamma_{L}\right]\preceq\left(1-\frac{(\eta(4-2L)-(1-\eta)^{L-1}-\eta^{2}+1)L}{\kappa}\right)\mathrm{I}.

(28)

where the matrix $\Gamma_{L}$ is defined in Definition 2. Here “ $\preceq$ ” is defined between two matrices on both sides (please see Definition 4) for the positive semi-definite property.

Proof.

Based on Equation (3.1), we have:

\displaystyle\mathbb{E}_{(s,a)\sim\mu}\left[\Gamma_{L}^{\top}\Gamma_{L}\right]=\mathrm{I}-2\eta\mathbb{E}_{(s,a)\sim\mu}\left[\sum_{l=1}^{L}\phi_{l}\phi^{\top}_{l}\right]+\mathbb{E}_{(s,a)\sim\mu}\left[\sum_{k=2}^{2L}(-\eta)^{k}\sum_{l_{1},\ldots,l_{k}}\phi_{l_{1}}\phi^{\top}_{l_{1}}\ldots\phi_{l_{k}}\phi^{\top}_{l_{k}}\right].

(29)

By linearity of expectation, the second part of Equation 3.1 can be reformulated as

-2\eta\mathbb{E}_{(s,a)\sim\mu}\left[\sum_{l=1}^{L}\phi_{l}\phi^{\top}_{l}\right]=-2\eta\sum_{l=1}^{L}\mathbb{E}_{(s,a)\sim\mu}\left(\phi_{l}\phi^{\top}_{l}\right)=-2\eta L\mathbb{E}_{(s,a)\sim\mu}\left(\phi\phi^{\top}\right)\preceq\frac{-2\eta L}{\kappa}\mathrm{I}.

(30)

The last step is obtained by Remark 1.

Based on the result in the proposed Lemma 2, Lemma 3 and Remark 2, the second part of Equation 3.1 can be upper bounded as:

$\displaystyle\mathbb{E}_{(s,a)\sim\mu}\left[\sum_{k=2}^{2L}(-\eta)^{k}\sum_{l_{1},\ldots,l_{k}}\frac{1}{2}(\phi_{l_{1}}\phi^{\top}_{l_{1}}+\phi_{l_{k}}\phi^{\top}_{l_{k}})\right]$		(31)
$\displaystyle=\mathbb{E}_{(s,a)\sim\mu}\left[\sum_{k=2}^{2L}(-\eta)^{k}\sum_{l=1}^{L}\left(\binom{L+l-2}{k-1}+\binom{L-l}{k-1}+\binom{2l-2}{k-2}\right)\phi_{l}\phi^{\top}_{l}\right]$	by Lemma 2	(32)
$\displaystyle=\mathbb{E}_{(s,a)\sim\mu}\left[\sum_{l=1}^{L}\sum_{k=2}^{2L}\left(\binom{L+l-2}{k-1}+\binom{L-l}{k-1}+\binom{2l-2}{k-2}\right)(-\eta)^{k}\phi_{l}\phi^{\top}_{l}\right]$	swap two sums	(33)
$\displaystyle=\mathbb{E}_{(s,a)\sim\mu}\left[\sum_{l=1}^{L}\left((1-\eta)^{L+l-2}+(1-\eta)^{L-l}+\eta^{2}(1-\eta)^{2l-2}+\eta(2L-2)-2\right)\phi_{l}\phi^{\top}_{l}\right]$	By Lemma 3	(34)
$\displaystyle\preceq\left((1-\eta)^{L-1}+\eta^{2}+\eta(2L-2)-1\right)\mathbb{E}_{(s,a)\sim\mu}\left[\sum_{l=1}^{L}\phi_{l}\phi^{\top}_{l}\right]$	By Remark 2	(35)
$\displaystyle\preceq\frac{(1-\eta)^{L-1}L+\eta^{2}L+\eta(2L-2)L-L}{\kappa}\mathrm{I}.$		(36)

Combining the results in the above two inequalities, we finally have the upper bound:

\displaystyle\mathbb{E}_{(s,a)\sim\mu}\left[\Gamma_{L}^{\top}\Gamma_{L}\right]\preceq\left(1-\frac{(\eta(4-2L)-(1-\eta)^{L-1}-\eta^{2}+1)L}{\kappa}\right)\mathrm{I}.

(37)

∎

Theorem 1 together its theoretical justification is novel and never used in any analysis relevant to experience replay by the knowledge of the authors.

Appendix C Sample Complexity Proof of Theorem 2

We acknowledge that the main structure of convergence proof follows the original work. Here, we made contribution to present a cleaner proof pipeline of the proof and also integrate our tighter bound in Theorem 1.

C.1 Proof of Bias-Variance Decomposition for the Error in Lemma 4

Lemma.

\displaystyle w_{L}-w^{*}=\underbrace{\Gamma_{L}\left(w_{1}-w^{*}\right)}_{\text{Bias term}}+\underbrace{\eta\sum_{l=1}^{L}\varepsilon_{l}\Gamma_{l-1}\phi_{l}}_{\text{variance term}}.

(38)

Proof.

As shown in Algorithm 1 in Lines 5-7, we use a sampled sub-trajectory of length $L$ to execute $Q$ -update reversely at iteration $t$ : for $l=1,2,\ldots,L$ ,

	$\displaystyle w_{l+1}$	$\displaystyle=w_{l}+\eta\left(r_{L+1-l}+\gamma\max_{a^{\prime}\in\mathcal{A}}\langle w_{1},\phi(s_{L+2-l},a^{\prime})\rangle-\langle w_{l},\phi_{L+1-l}\rangle\right)\phi_{L+1-l}$		(39)
		$\displaystyle=\left(\mathrm{I}-\eta\phi_{L+1-l}\phi^{\top}_{L+1-l}\right)w_{l}+\eta\left(r_{L+1-l}+\gamma\sup_{a^{\prime}\in\mathcal{A}}\langle w_{1},\phi(s_{L+2-l},a^{\prime})\rangle\right)\phi_{L+1-l}$		(40)

where $\mathrm{I}$ denotes the identity matrix and $w_{1}$ is the parameter of the target network. The second equality is attained by rearranging the terms of the first equality. The $\max$ operator is changed to $\sup$ operator for rigorous analysis purposes. $\langle\cdot,\cdot\rangle$ means inner dot product between two vectors of the same dimension.

Let $Q^{*}(s_{i},a_{i})$ be the optimal $Q$ -value at state $s_{i}$ taking action $a_{i}$ and assume $w^{*}$ is the optimal parameter, the Bellman optimality equation is written as:

Q^{*}(s_{i},a_{i})=R_{i}+\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot|s_{i},a_{i})}\sup_{a^{\prime}\in\mathcal{A}}\langle w^{*},\phi(s^{\prime},a^{\prime})\rangle

Define the error term $\varepsilon_{i}(w)$ for parameter $w$ and $i$ -th tuple $(s_{i},a_{i},r_{i})$ as the difference between empirical estimation and true probabilistic MDP:

$\displaystyle\varepsilon_{i}(w)$	$\displaystyle\coloneqq{Q}(s_{i},a_{i})-Q^{*}(s_{i},a_{i})$	(41)
	$\displaystyle=\left({r}_{i}+\gamma{\sup_{a^{\prime}\in\mathcal{A}}\langle w,\phi(s_{i+1}},a^{\prime})\rangle\right)-\left(R_{i}+\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot\|s_{i},a_{i})}\sup_{a^{\prime}\in\mathcal{A}}\langle w,\phi(s^{\prime},a^{\prime})\rangle\right)$	(42)
	$\displaystyle=\left({r}_{i}-R_{i}\right)-\gamma\left({\sup_{a^{\prime}\in\mathcal{A}}\langle w,\phi(s_{i+1}},a^{\prime})\rangle-\mathbb{E}_{s^{\prime}\sim P(\cdot\|s_{i},a_{i})}\sup_{a^{\prime}\in\mathcal{A}}\langle w,\phi(s^{\prime},a^{\prime})\rangle\right)$	(43)

For all $l=1\ldots,L$ , apply the Bellman optimality equation over the RER algorithm over the optimal parameter $w^{*}$ :

\displaystyle\langle w^{*},\phi_{L+1-l}\rangle=R_{L+1-l}+\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot|s_{L+1-l},a_{L+1-l})}\sup_{a^{\prime}\in\mathcal{A}}\langle w_{1},\phi(s^{\prime},a^{\prime})\rangle

(44)

Right-multiply the above equation on both sides with term $\eta\phi_{L+1-l}$ and combine with the first equation in this proof, we shall get

\displaystyle w_{l+1}-w^{*}=\left(\mathrm{I}-\eta\phi_{L+1-l}\phi^{\top}_{L+1-l}\right)\left(w_{l}-w^{*}\right)+\eta\varepsilon_{L+1-l}\phi_{L+1-l}

(45)

where the error term $\varepsilon_{L+1-l}$ is defined in Equation (43). Combined with Definition 2 and recursively expand the RHS, we shall get the difference w.r.t. the optimal one after reversely updating $L$ consecutive steps:

\displaystyle w_{L+1}-w^{*}

\displaystyle=\Gamma_{L}\left(w_{1}-w^{*}\right)+\eta\sum_{l=1}^{L}\varepsilon_{l}\Gamma_{l-1}\phi_{l}

(46)

The proof is thus finished. ∎

Remark 3.

Suppose we execute Algorithm 1 for $N$ iterations, we would get:

\displaystyle w_{N,L}-w^{*}

\displaystyle=\underbrace{\prod_{t=N}^{1}\Gamma_{L}\left(w_{1}-w^{*}\right)}_{\text{Bias term}}+\underbrace{\eta\sum_{i=1}^{N}\prod_{t=N}^{i+1}\Gamma^{l}_{L}H^{j}}_{\text{Variance term}}

(47)

where $H^{j}:=\eta\sum_{i=1}^{L}\varepsilon_{i}\Gamma^{j}_{i-1}\phi_{i}^{j}$ . The first term on RHS is noted as the bias, which decays geometrically with $N$ and the second term on RHS is variance along the sub-trajectory of length $L$ , which we will later show it has zero mean.

Note that the above Remark 3 is relevant to the (Agarwal et al., 2022, Lemma 5).

C.2 Bound the Bias Term in Remark 3

We briefly mention here to quantify the upper bound of ${\prod_{t=1}^{N}\Gamma_{L}}$ in Lemma 4. In comparison to (Agarwal et al., 2022, lemma 8), the following Lemma 5 require $\eta\in(0,1)$ instead of $\eta L<1/3$ , which is a relaxation of the original work.

Lemma.

Let $\mathbf{x}\in\mathbb{R}^{d}$ be a non-zero vector and $N$ is the frequency for the target network to be updated. For $\eta\in(0,1),L\in\mathbb{N}$ and $L>1$ , the following matrix’s positive semi-definite inequality holds:

\displaystyle\mathbb{E}\left\lVert\prod^{1}_{j=N}\Gamma_{L}\mathbf{x}\right\rVert^{2}

\displaystyle\leq\exp\left(-\frac{N(\eta(4-2L)L+L-\eta^{2}L)}{\kappa}\right)\left\lVert\mathbf{x}\right\rVert^{2}.

(48)

Furthermore, the following inequality holds with probability at least $1-\delta$ :

\displaystyle\mathbb{E}\left\lVert\prod^{1}_{j=N}\Gamma_{L}\mathbf{x}\right\rVert_{\phi}^{2}

\displaystyle\leq\exp\left(-\frac{N(\eta(4-2L)L+L-\eta^{2}L)}{\kappa}\right)\sqrt{\tfrac{\kappa}{\delta}}\left\lVert\mathbf{x}\right\rVert_{\phi}.

(49)

The $\phi$ -based norm is defined in Definition 1.

Proof.

In Theorem 1, we notice the term $(1-\eta)^{L-1}$ will converge to zero exponentially and thus are omitted under sufficient precision. Thus, we obtain a simplified form

	$\displaystyle\mathbb{E}_{(s,a)\sim\mu}\left(\Gamma_{L}^{\top}\Gamma_{L}\right)$	$\displaystyle\preceq\left(1-\frac{\eta(4-2L)L+L-(1-\eta)^{L-1}L-\eta^{2}L}{\kappa}\right)\mathrm{I}$		(50)
		$\displaystyle\preceq\left(1-\frac{\eta(4-2L)L+L-\eta^{2}L}{\kappa}\right)\mathrm{I}.$		(51)

Now, we observe that:

\left\lVert\prod_{j=N}^{1}\Gamma_{L}\mathbf{x}\right\rVert^{2}=\mathbf{x}^{\top}\prod_{j=1}^{N-1}(\Gamma^{j}_{L})^{\top}\underbrace{\left(\left(\Gamma^{N}_{L}\right)^{\top}\Gamma^{N}_{L}\right)}_{\text{is independent of the rest}}\prod_{j=N-1}^{1}\Gamma^{j}_{L}\mathbf{x}

(52)

Therefore, we take the expectation conditioned on $\Gamma^{j}_{L}$ for $j\leq N-1$ in Equation (52):

\displaystyle\mathbb{E}\left\lVert\prod_{j=N}^{1}\Gamma^{j}_{L}\mathbf{x}\right\rVert^{2}\leq\left(1-\frac{\eta(4-2L)L+L-\eta^{2}L}{\kappa}\right)\mathbb{E}\left\lVert\prod_{j=N-1}^{1}\Gamma^{j}_{L}\mathbf{x}\right\rVert^{2}

using Equation (51)

(53)

Applying the equation above inductively, we conclude the result:

	$\displaystyle\mathbb{E}\left\lVert\prod_{j=N}^{1}\Gamma^{j}_{L}\mathbf{x}\right\rVert^{2}$	$\displaystyle\leq\left(1-\frac{\eta(4-2L)L+L-\eta^{2}L}{\kappa}\right)^{N}\left\lVert\mathbf{x}\right\rVert^{2}$		(54)
		$\displaystyle\approx\exp\left(-\frac{N(\eta(4-2L)L+L-\eta^{2}L)}{\kappa}\right)\left\lVert\mathbf{x}\right\rVert^{2}.$		(55)

The last step of numerical approximation is obtained by $\lim_{N\to+\infty}(1+\frac{x}{N})^{N}=\exp(x)$ .

We then extend to $\phi$ -based norm (in Definition 1) as follow:

\displaystyle\left\lVert\prod_{j=N}^{1}\Gamma_{L}\mathbf{x}\right\rVert_{\phi}\leq\sqrt{\kappa}\left\lVert\prod_{j=N}^{1}\Gamma^{j}_{L}\mathbf{x}\right\rVert

\displaystyle\leq{\exp\left(-\frac{N(\eta(4-2L)L+L-\eta^{2}L)}{\kappa}\right)}\sqrt{\kappa}\left\lVert\mathbf{x}\right\rVert

(56)

Thus, by Markov’s inequality, with probability at least $1-\delta$ , the following event holds:

\left\lVert\prod_{j=N}^{1}\Gamma_{L}\mathbf{x}\right\rVert_{\phi}\leq\exp\left(-\frac{N(\eta(4-2L)L+L-\eta^{2}L)}{\kappa}\right)\sqrt{\tfrac{\kappa}{\delta}}\left\lVert\mathbf{x}\right\rVert_{\phi}

(57)

∎

C.3 Bound the Variance Term in Remark 3

Even though the term $\Gamma_{l}$ is involved in the expression, it turns out we do not need to modify the original proof and thus we follow the result in the original proof. Please see (Agarwal et al., 2022, Appendix L.3) for the original proof steps.

Theorem (Agarwal et al. (2022) Theorem 4).

Suppose $\mathbf{x},\mathbf{w}\in\mathbb{R}^{d}$ are fixed. There exists a universal constant $C$ such that the following event holds with probability at least $1-\delta$ :

\biggr{|}\langle\mathbf{x},\sum_{t=1}^{N}\prod_{j=N}^{t+1}\Gamma^{j}_{L}H^{j}(\mathbf{w})\rangle\biggr{|}\leq C\left\lVert\mathbf{x}\right\rVert\left(1+\left\lVert\mathbf{w}\right\rVert_{\phi}\right)\sqrt{\eta\log\left(\frac{2}{\delta}\right)}

(58)

By a direct application of (Vershynin, 2018, Theorem 8.1.6), we derive the following corollary. Notice that $C$ is the covering number defined in Definition 3.

Corollary 1.

For fixed parameter $\mathbf{x},\mathbf{w}\in\mathbb{R}^{d}$ , there exists a constant $C$ such that the following inequality holds with probability at least $1-\delta$ :

\left\lVert\eta\sum_{i=1}^{N}\prod_{t=N}^{i+1}\sum_{l=1}^{L}\varepsilon_{l}(\mathbf{w})\Gamma_{l-1}\phi_{l}\right\rVert_{\phi}\leq C(1+\left\lVert\mathbf{w}\right\rVert_{\phi})\left(C_{\Phi}+\sqrt{\log\left(\frac{2}{\delta}\right)}\right)

(59)

The notation $\|\cdot\|_{\phi}$ is mentioned in Definition 1. Here $C_{\Phi}$ is the covering number with $L_{2}$ norm in $\mathbb{R}^{d}$ , see Theorem 3 for details.

C.4 Overall Sample Complexity Analysis

Lemma 6.

There exists constants $C_{\Phi},C_{1},C_{2}$ , such that:

1.

$C_{1}\kappa\log\left(\frac{T\kappa}{N\delta(1-\gamma)}\right)<N$ ;
2.

$\eta\leq C_{2}\frac{(1-\gamma)^{2}}{C^{2}_{\Phi}+\log\left(\frac{T}{N\delta}\right)}$ .

Let $1\leq k\leq\frac{T}{N}$ , where $k$ is an index for the target network and $T/N$ is the total number of target network updates. The following holds with probability at least $1-\delta$ :

1.

For the target network, $\|\theta_{k}\|\leq\frac{4}{1-\gamma}$ ;

For the error accumulated along $L$ -consecutive steps of reverse update,

	$\displaystyle\\|w_{k,L}-w^{*}\\|_{\phi}\leq$	$\displaystyle\sqrt{\frac{25T\kappa}{N\delta(1-\gamma)^{2}}}\exp\left(-\frac{N((\eta(4-2L)L+L-\eta^{2}L))}{\kappa}\right)$
		$\displaystyle+\sqrt{\frac{\eta\left(C^{2}_{\Phi}+\log\left(\frac{T}{N\delta}\right)\right)}{(1-\gamma)^{2}}}$

When we combine the above two cases, we have:

	$\displaystyle\\|Q^{T}-Q^{*}\\|_{\infty}\leq$	$\displaystyle\mathcal{O}\left(\frac{\gamma^{T/N}}{1-\gamma}\right)$
		$\displaystyle+\mathcal{O}\left(\sqrt{\frac{T\kappa}{N\delta(1-\gamma)^{4}}}\exp\left(-\frac{N(\eta(4-2L)L+L-\eta^{2}L)}{\kappa}\right)\right)$
		$\displaystyle+\mathcal{O}\left(\sqrt{\frac{\eta\left(C^{2}_{\Phi}+\log\left(\frac{T}{N\delta}\right)\right)}{(1-\gamma)^{4}}}\right)$

We can obtain the sample complexity of the whole learning framework by setting the RHS as $\varepsilon$ and we shall recover the result we show in Theorem 2.

Appendix D Extra Notations and Definitions

Definition 3 ( $\varepsilon$ -Net, Covering Number and Metric Entropy).

Given metric space $(\mathbb{R}^{d},\|\cdot\|_{2})$ , consider a region $\mathcal{K}\subset\mathbb{R}^{d}$ . 1) A subset of Euclidean balls $\mathcal{N}$ is called an $\varepsilon$ -Net of $\mathcal{K}$ (for $\mathcal{N}\subseteq\mathcal{K},\varepsilon>0$ ) if every point $x\in\mathcal{K}$ is within $\varepsilon$ distance of a point of $\mathcal{N}$ :

\displaystyle\exists x^{\prime}\in\mathcal{N},\|x-x^{\prime}\|_{2}\leq\varepsilon,\text{ for all }x\in\mathcal{K}

2) Equivalently, we denote $\mathcal{N}(\mathcal{K},\varepsilon)$ as the smallest number of closed balls with centers in $\mathcal{K}$ and radius $\varepsilon$ whose union covers $\mathcal{K}$ . 3) Metric Entropy. It is the Logarithm of the covering number $\log_{2}\mathcal{N}(\mathcal{K},\varepsilon)$ .

Theorem 3 (Dudley’s Integral Inequality).

Let $\{\mathbf{x}_{t}\in\mathbb{R}^{d}\}_{t=1}^{N}\in\mathcal{K}$ be a random process with zero means on the metric space $(\mathbb{R}^{d},\|\cdot\|_{2})$ with sub-gaussian increments. Then

\displaystyle\mathbb{E}\sup_{\mathbf{x}_{t}\in\mathcal{K}}x_{t}\geq CK\int_{0}^{\infty}\sqrt{\log_{2}\mathcal{N}(\mathcal{K},\varepsilon)}\mathit{d}\varepsilon

Lemma 7 (Cauchy-Schwarz Inequality).

Based on Assumption 2 that the feature vector for state-action is bounded $\phi(s,a)^{\top}\phi(s,a)\leq 1$ , for all $(s,a)\in\mathcal{S}\times\mathcal{A}$ . For $1\leq l,l^{\prime}\leq L$ , we have:

\displaystyle|\phi_{l}^{\top}\phi_{l^{\prime}}|^{2}\leq\phi_{l}^{\top}\phi_{l}\cdot\phi_{l^{\prime}}^{\top}\phi_{l^{\prime}}\leq 1.

(60)

Definition 4 (Positive Semi-Definite Property of Matrix).

For two symmetric matrices $A,B\in\mathbb{R}^{d\times d}$ , we say $A\preceq B$ if $B-A$ is positive semi-definite: $\mathbf{x}^{\top}(B-A)\mathbf{x}\geq 0$ , for all $\mathbf{x}\in\mathbb{R}^{d}$ non-zero $d$ -dimensional vector.

Lemma 8 (AM–GM Inequality).

The inequality of arithmetic and geometric means, or more briefly the AM–GM inequality, states that the arithmetic mean of a list of non-negative real numbers is greater than or equal to the geometric mean of the same list. For $x,y\in\mathbb{R}$ , $|xy|\leq\frac{1}{2}(x^{2}+y^{2})$ .

Lemma 9 (Recursive Formula).

Let $m,n\in\mathbb{N}$ be positive integers. $\binom{n}{k}$ denotes a binomial coefficient, which is computed as $\frac{n!}{k!(n-k)!}$ . Then for all $1\leq k\leq n-1$ , we have:

\binom{n-1}{k}+\binom{n-1}{k-1}=\binom{n}{k}

Lemma 10 (Rising Sum of Binomial Coefficients).

Let $m,n\in\mathbb{N}$ be positive integers. Then:

\sum_{j=0}^{m}\binom{n+j}{n}=\binom{n+m+1}{n+1}=\binom{n+m+1}{m}

Lemma 11 (Vandermonde identity Haddad (2021)).

For any $k,q,n\in\mathbb{N}$ such that $n\geq q$ , we have:

\sum_{i=q}^{n}\binom{i}{k}=\binom{n+1}{k+1}-\binom{q}{k+1}

Lemma 12.

For any $k,q,n\in\mathbb{N}$ such that $n\geq q$ , we have:

A_{n}=\sum_{i=q}^{n}\binom{2i}{k}\qquad B_{n}=\sum_{i=q}^{n}\binom{2i-1}{k}

$A_{n}+B_{n}$ has a closed form, $A_{n}-B_{n}$ also has a closed form.

Lemma 13 (Linearity of Expectation).

For random variables $X_{1},X_{2},\ldots,X_{n}$ and constants $c_{1},c_{2},\ldots,c_{n}$ , we have:

\mathbb{E}\sum_{i=1}^{n}c_{i}x_{i}=\sum_{i=1}^{n}c_{i}\mathbb{E}X_{i}

A Tighter Convergence Proof of Reverse Experience Replay

Abstract

1 Introduction

2 Preliminaries

Markov Decision Process

Value Function and QQ-Function

QQ-learning

QQ-learning with Function Approximation

Experience Replay

Reverse Experience Replay

2.1 Problem Setups for Reverse Experience Replay

Linear MDP Assumption

Assumption 1 (Linear MDP setting from Zanette et al., 2020).

Assumption 2 (from Zanette et al. (2020)).

Remark 1.

Definition 1.

Definition 2.

3 Methodology

3.1 Motivation

Theorem 1.

Proof Sketch.

Numerical Justification of the Tighter Bound

3.2 Relaxing the Requirement η​L≤1/3\eta L\leq 1/3 through Combinatorial Counting

Lemma 1.

Lemma 2.

Sketch of Proof.

Lemma 3.

4 Sample Complexity of Reverse Experience Replay-Based QQ-Learning on Linear MDPs

Lemma 4 (Bias and variance decomposition).

Lemma 5 (Bound on the bias term).

Sketch of proof.

Theorem 2.

Sketch of Proof.

5 Conclusion

Acknowledgments

References

Appendix A Proof of Combinatorial Counting in Reverse Experience Replay

A.1 Proof of Lemma 1

Lemma.

Proof.

A.2 Proof of Lemma 2

Lemma.

Proof.

A.3 Proof of Lemma 3

Lemma.

Proof.

Remark 2.

Proof.

Appendix B Theoretical Justification of Theorem 1

Theorem.

Proof.

Appendix C Sample Complexity Proof of Theorem 2

C.1 Proof of Bias-Variance Decomposition for the Error in Lemma 4

Lemma.

Proof.

Remark 3.

C.2 Bound the Bias Term in Remark 3

Lemma.

Proof.

C.3 Bound the Variance Term in Remark 3

Theorem (Agarwal et al. (2022) Theorem 4).

Corollary 1.

C.4 Overall Sample Complexity Analysis

Lemma 6.

Appendix D Extra Notations and Definitions

Definition 3 (ε\varepsilon-Net, Covering Number and Metric Entropy).

Theorem 3 (Dudley’s Integral Inequality).

Lemma 7 (Cauchy-Schwarz Inequality).

Definition 4 (Positive Semi-Definite Property of Matrix).

Lemma 8 (AM–GM Inequality).

Lemma 9 (Recursive Formula).

Lemma 10 (Rising Sum of Binomial Coefficients).

Lemma 11 (Vandermonde identity Haddad (2021)).

Lemma 12.

Lemma 13 (Linearity of Expectation).

Value Function and $Q$ -Function

$Q$ -learning

$Q$ -learning with Function Approximation

3.2 Relaxing the Requirement $\eta L\leq 1/3$ through Combinatorial Counting

4 Sample Complexity of Reverse Experience Replay-Based $Q$ -Learning on Linear MDPs

Definition 3 ( $\varepsilon$ -Net, Covering Number and Metric Entropy).