A Tighter Convergence Proof of Reverse Experience Replay
Abstract
In reinforcement learning, Reverse Experience Replay (RER) is a recently proposed algorithm that attains better sample complexity than the classic experience replay method. RER requires the learning algorithm to update the parameters through consecutive state-action-reward tuples in reverse order. However, the most recent theoretical analysis only holds for a minimal learning rate and short consecutive steps, which converge slower than those large learning rate algorithms without RER. In view of this theoretical and empirical gap, we provide a tighter analysis that mitigates the limitation on the learning rate and the length of consecutive steps. Furthermore, we show theoretically that RER converges with a larger learning rate and a longer sequence.
1 Introduction
Reinforcement Learning (RL) is highly successful for a variety of practical problems in the realm of long-term decision-making. Experience Replay (ER) of historical trajectories plays a vital role in RL algorithms (Lin, 1992; Mnih et al., 2015). The trajectory is a sequence of transitions, where each transition is a state, action, and reward tuple. The memory space used to store these experienced trajectories is noted as the replay buffer. The methods to sample transitions from the replay buffer determine the rate and stability of the convergence of the learning algorithms.
Recently, Reversed Experience Replay (RER) (Florensa et al., 2017; Rotinov, 2019; Lee et al., 2019; Agarwal et al., 2022) is an approach inspired by the hippocampal reverse replay mechanism in human and animal neuron (Foster & Wilson, 2006; Ambrose et al., 2016; Igata et al., 2021). Theoretical analysis shows that RER improves the convergence rate towards optimal policies in comparison with ER-based algorithms. Unlike ER, which samples transitions uniformly (van Hasselt et al., 2016) (known as classic experience replay) or weightily (Schaul et al., 2016) (known as prioritized experience replay) from the replay buffer, RER samples consecutive sequences of transitions from the buffer and reversely fed into the learning algorithm.
However, the most recent theoretical analysis on RER with -learning only holds for a minimal learning rate and short consecutive steps (Agarwal et al., 2022), which converges slower than classic -learning algorithm (together with ER) with a large learning rate. We attempt to bridge the gap between theory and practice for the newly proposed reverse experience replay algorithm.
In this paper, we provide a tighter analysis that relaxes the limitation on the learning rate and the length of the consecutive transitions. Our key idea is to transform the original problem involving a giant summation (shown in Equation 3.1) into a combinatorial counting problem (shown in Lemma 2), which greatly simplifies the whole problem. We hope the new idea of transforming the original problem into a combinatorial counting problem can enlighten other relevant domains. Furthermore, we show in Theorem 2 that RER converges faster with a larger learning rate and a longer consecutive sequence of state-action-reward tuples.
2 Preliminaries
Markov Decision Process
We consider a Markov decision process (MDP) with discounted rewards, noted as . Here is the set of states, is the set of actions, and indicates the discounting factor. We use as the transition probability kernel of MDP. For each pair , is the probability of transiting to state from state when action is executed. The reward function is , such that is the immediate reward from state when action is executed (Puterman, 1994). The policy is a mapping from states to a distribution over the set of actions: , for . A trajectory is noted as , where (respectively ) is the state (respectively the action taken) at time , is the reward received at time , and is the -step transition.
Value Function and -Function
The value function of a policy is noted as . For , , which is the expected discounted cumulative reward received when 1) the initial state is , 2) the actions are taken based on the policy , i.e., , for . 3) the trajectory is generated by the transition kernel, i.e., , for all . Similarly, let be the action-value function (also known as the -function) of a policy . For , it is defined as
There exists an optimal policy, denoted as that maximizes uniformly over all state-action pairs (Watkins, 1989). We denote as the -function corresponding to , i.e., . The Bellman operator on a -function is defined as: for ,
The optimal -function is the unique fixed point of the Bellman operator (Bertsekas & Yu, 2012).
-learning
The -learning algorithm is a model-free algorithm to learn (Watkins & Dayan, 1992). The high-level idea is to find the fixed point of the Bellman operator. Given the trajectory generated by some underlying behavior policy , the asynchronous -learning algorithm estimates a new -function at each time. At time , given a transition , the algorithm update as follow:
(1) | |||||
Here is the learning rate and is the empirical Bellman operator: . Under mild conditions, will converge to the fixed point of the Bellman operator and hence to . When the state space is small, a tabular structure cab be used to store the values of for .
-learning with Function Approximation
When the state space is large, the asynchronous -learning in Equation (1) cannot be applied since it needs to loop over a table of all states and actions. In this case, function approximation is brought into -learning. Let be an approximated -function, which is typically represented with a deep neural network (Mnih et al., 2015) and denotes the parameters of the neural network. is often called the -network. Given a batch of transitions , we define as the image of under the empirical Bellman operator, that is:
where represents the parameters in target neural network. Parameters are synchronized to every steps of Stochastic Gradient Descent (SGD). Since is the fixed point of the Bellman operator, should match when converges to . Hence, learning is done via minimizing the following objective using SGD: .
Experience Replay
For the -learning with function approximation, the new trajectories are generated by executing a behavioral policy, which are then saved into the replay buffer, noted as . When learning to minimize , SGD is performed on batches of randomly sampled transitions from the replay buffer. This process is often called Experience Replay (ER) (Lin, 1992; Li et al., 2022). To improve the stability and convergence rate of -learning, follow-up works sample transitions from the replay buffer with non-uniform probability distributions. Prioritized experience replay favors those transitions with a large temporal difference errors (Schaul et al., 2016; Saglam et al., 2023). Discor (Kumar et al., 2020) favors those transitions with small Bellman errors. LaBER proposes a generalized TD error to reduce the variance of gradient and improve learning stability (Lahire et al., 2022). Hindsight experience replay uses imagined outcomes by relabeling goals in each episode, allowing the agent to learn from unsuccessful attempts as if they were successful (Andrychowicz et al., 2017).
Reverse Experience Replay
is a recently proposed variant of experience replay (Goyal et al., 2019; Bai et al., 2021; Agarwal et al., 2022). RER samples consecutive sequences of transitions from the replay buffer. The -learning algorithm updates its parameters by performing in the reverse order of the sampled sequences. Compared with ER, RER converges faster towards the optimal policy empirically (Lee et al., 2019) and theoretically (Agarwal et al., 2022), under tabular and linear MDP settings. One intuitive explanation of why RER works is to consider a sequence of consecutive transitions . Incorrect -function estimation of will affect the estimation of . Hence, reverse order updates allow the -value updates of to use the most up-to-date value of , hence accelerating the convergence.
2.1 Problem Setups for Reverse Experience Replay
Linear MDP Assumption
In this paper, we follow the definition of linear MDP from Zanette et al. (2020), which states that the reward function can be written as the inner product of the parameter and the feature function . Therefore, the function depends only on and the feature vector for state and action .
Assumption 1 (Linear MDP setting from Zanette et al., 2020).
There exists a vector such that , and the transition probability is proportional to its corresponding feature . Therefore, the optimal Q-function is for every .
The assumption 1 is the current popular Linear MDP assumption that allows us to quantify the convergence rate (or sample complexity) for the -learning algorithm (Zanette et al., 2020; Agarwal et al., 2022). We need the following additional assumptions to get the final convergence rate result. Assume the sequence of consecutive transitions is of length and the constant learning rate in the gradient descent algorithm is .
Assumption 2 (from Zanette et al. (2020)).
The MDP has zero inherent Bellman error and for all . There exists constant , such that . Here is the stationary distribution over all the state-action pairs of the Markov chain determined by the transition kernel and the policy.
Remark 1.
Suppose we pick a set of state-action tuples , which may contains duplicated tuples. By linearity of expectation, we have: . Here indicates the number of state-action tuples in this set.
Definition 1.
Given the feature function . Denote the largest inner product between parameter and the feature function as .
Definition 2.
Let be an identity matrix of dimension and as the learning rate. Define matrix recursively as follow:
(2) |
where we use the simplified notation to denote . The explicit form for is:
(3) |
The semantic interpretation of in Definition 2 is that it represents the coefficient of the bias term in the error analysis of the learning algorithm’s parameter (as outlined in Lemma 4). This joint product arises because the RER algorithm updates the parameter over a subsequence of consecutive transitions of length . The norm of is influenced by both the sequence length and the learning rate . When the norm of is small, the parameters of the learning model converge more rapidly to their optimal values.
3 Methodology
3.1 Motivation
Let denote the stationary distribution of the state-action pairs in the MDP, be the learning rate of the gradient descent algorithm, and the length of consecutive transitions processed by the RER algorithm. Previous work (Agarwal et al., 2022, Lemma 8 and Lemma 14) established that when , the following inequality holds:
(4) |
where the matrix is defined in Definition 2 and serves as a “coefficient” in the convergence analysis, as outlined in Lemma 4. The positive semi-definite relation between two matrices is defined in Definition 4. Here, represents an identity matrix of dimension , and the coefficient is introduced in Assumption 2. The matrix was mentioned in (Agarwal et al., 2022, Appendix E, Equation 5), but we provide a formal definition here and streamline the original expression by removing unnecessary variables.
The condition in Equation (4) was further incorporated into the convergence requirement in (Agarwal et al., 2022, Theorem 1). It suggests that the RER algorithm cannot handle sequences of consecutive transitions that are too long (corresponding to a large ) or use a learning rate that is too large (i.e., ). This presents a major limitation between the theoretical justification and real-world application of the RER algorithm. In this work, we address this gap by providing a tighter theoretical analysis that relaxes the constraint .
We begin by explaining the main difficulty in upper-bounding the term . According to Definition 2, we can expand as . Using the linearity of expectation, we expand the entire joint product under the expectation as follows:
(5) |
In the third term on the right-hand side (RHS) of the second line, the summation is over all valid combinations of the indices , where . This is determined by first selecting the index from the index sequence , as seen in the first row of the equation above. The second index is then chosen, ensuring that lies to the right of . The valid combination constraint requires the entire sequence to satisfy the condition that must appear to the left of .
The main challenge to upper-bound the entire product under expectation lies in upper-bound the combinatorially many high-order terms. Our approach leverages the high-level idea that the RHS of Equation (3.1) can be upper-bounded by a form of with an appropriate coefficient. Specifically, we demonstrate that the third term on the RHS, which contains a large number of combinatorial terms of the form , can be bounded by terms involving only (with ) through the use of a proposed combinatorial counting method.
Theorem 1.
Let be the stationary distribution of the state-action pair in the MDP. The following matrix inequalities, which are positive semi-definite, hold for :
(6) |
where the matrix is defined in Definition 2. The relation between the matrices on both sides is defined in Definition 4, referring to the positive semi-definite property.
Proof Sketch.
By the linearity of expectation, we can upper-bound the second part of Equation (3.1) as follows:
(7) |
Based on the new analysis from Lemma (2), the third part in Equation (3.1) is upper-bounded as:
Combining these two inequalities, we arrive at the upper bound stated in the theorem. A detailed proof can be found in Appendix B. ∎
Theorem 1 is established based on the new analysis in Lemma (2), which is introduced in Section 3.2. It serves as a key component in the final convergence proof of the RER algorithm, which will be presented in Section 4.
Numerical Justification of the Tighter Bound
We provide a numerical evaluation of the derived bound and the original bound in Agarwal et al. (2022, Lemma 8) in Figure 1111The code implementation for the numerical evaluation of the equalities and inequalities in this paper is available at https://github.com/jiangnanhugo/RER-proof.. For a fixed value of sequence length , we compare the value in our derived upper bound and the original value . For all the different sequence lengths, our derived expression value is numerically higher than the original expression, which implies our bound (in Lemma 3) is tighter than the original one in Agarwal et al. (2022, Lemma 8).

3.2 Relaxing the Requirement through Combinatorial Counting
Lemma 1.
The proof of this inequality can be found in Appendix A.1.
This result implies that, after relaxation, only the first term (indexed by ) and the last term (indexed by ) determine the upper bound of the high-order term . This relaxation simplifies the original complex summation problem to count how many valid and can be selected at each possible position in the sequence of transitions.
Lemma 2.
Based on the relaxation provided in Lemma 1, the third part in Equation (3.1) can be expanded combinatorially as follows:
(9) |
Sketch of Proof.
As depicted in Figure 2, we consider two arrays of length . The indices in these arrays are symmetrical: the left array decreases from to , while the right array increases from to . These arrays represent the indices of the matrix products in the first line of Equation (3.1). The left array simulates , and the right array simulates . The key idea is to count the number of combinations of and that can produce for a fixed (where ).

In the first case, illustrated in Figure 2, we fix in the left -th slot. For , it cannot choose any of the slots in the left array with indices due to the sequential ordering constraint, which requires that must be to the left of . Additionally, to avoid double counting, we also exclude the right -th slot for . Consequently, there are available slots for assigning the remaining sequence . This results in contributions for this case, as shown on the right-hand side.
Lemma 2 demonstrates the process of simplifying the complex summation into a more manageable form . This transformation significantly simplifies the task of obtaining a tighter upper bound.
Lemma 3.
For and , the following holds:
.
4 Sample Complexity of Reverse Experience Replay-Based -Learning on Linear MDPs
The convergence analysis assumes that every sub-trajectory of length is almost (or asymptotically) independent of each other with high probability. This condition, known as the mixing requirement for Markovian data, implies that the statistical dependence between two sub-trajectories and diminishes as they become further apart along the trajectory (Tagorti & Scherrer, 2015; Nagaraj et al., 2020).
Prior work (Lee et al., 2019) provided a convergence proof for the Reverse Experience Replay (RER) approach but did not address the rate of convergence, primarily due to the challenges associated with quantifying deep neural networks. By contrast, Linear MDPs (defined in Definition 1), which approximate the reward function and transition kernel linearly via features, allow for an asymptotic performance analysis of RER. Recently, Agarwal et al. (2022) presented the first theoretical proof for RER. However, their analysis is limited by stringent conditions, notably requiring a minimal learning rate . This constraint suggests that RER may struggle to compete with plain Experience Replay (ER) when using larger learning rates.
To address this challenge, we provide a tighter theoretical analysis of the RER method in Theorem 1. Our analysis mitigate the constraints on the learning rate for convergence. We demonstrate that the convergence rate can be improved with a larger learning rate and a longer sequence of state-action-reward tuples, thus bridging the gap between theoretical convergence analysis and empirical learning results.
Lemma 4 (Bias and variance decomposition).
Let the error terms for every parameter as the difference between empirical estimation and true MDP: . For the current iteration , the difference between current estimated parameter and the optimal parameter accumulated along the length transitions with reverse update is:
(10) |
For clarity, in Definition 2 is a joint product of terms involving the feature vector of the consecutive state-action tuples. When the norm of is small, the parameter will quickly converge to its optimal.
The first part on RHS is noted as the bias and the second part on RHS is variance along the sub-trajectory, which we will later show with zero mean.
The proof is presented in Appendix C.1. The result is obtained by unrolling the terms for consecutive steps in reverse update order according to Lines 5-7 in Algorithm 1. This allows us to separately quantify the upper bound the bias term and the variance terms.
Lemma 5 (Bound on the bias term).
Let be a non-zero vector and is the frequency for the target network to be updated. For and , the following matrix’s positive semi-definite inequality holds with probability at least :
(11) |
The -based norm is defined in Definition 1.
In terms of the bound for the variance term in Lemma 4, even though the term is involved in the expression, it turns out we do not need to modify the original proof and thus we follow the result in the original work. The exact statement is presented in the Appendix C.3.
Theorem 2.
For Linear MDP, assume the reward function, as well as the feature, is bounded , , for all . Let be the maximum learning episodes, be the frequency of the target network update, be the learning rate and be the length of sequence for RER described in Algorithm 1. When , with sample complexity
(12) |
holds with probability at least .
Sketch of Proof.
We first establish the independence of sub-trajectories of length . We then decompose the error term of the -value using bias-variance decomposition (as shown in Lemma 4), where the RER method and target network help control the variance term using martingale sequences. The upper bound for the bias term is given in Lemma 5 and the upper bound for the variance term is presented in Lemma Theorem. Finally, we summarize the results and provide the complete proof in Lemma 6, leading to the probabilistic bound in this theorem. ∎
Compared to the original theorem in Agarwal et al. (2022, Theorem 1), our work provides a tighter upper bound and relaxes the assumptions needed for the result to hold. This advancement bridges the gap between theoretical justification and empirical MDP evaluation. Furthermore, we hope that the new approach of transforming the original problem into a combinatorial counting problem will inspire further research in related domains.
5 Conclusion
In this work, we gave a tighter finite-sample analysis for heuristics which are heavily used in practical -learning and showed that seemingly simple modifications can have far-reaching consequences in linear MDP settings. We provide a rigorous analysis that relaxes the limitation on the learning rate and the length of the consecutive tuples. Our key idea is to transform the original problem involving a giant summation into a combinatorial counting problem, which greatly simplifies the whole problem. Finally, we show theoretically that RER converges faster with a larger learning rate and a longer consecutive sequence of state-action-reward tuples.
Acknowledgments
We thank all the reviewers for their constructive comments. We also thank Yi Gu for his feedback on the theoretical justification part of this paper. This research was supported by NSF grant CCF-1918327 and DOE – Fusion Energy Science grant: DE-SC0024583.
References
- Agarwal et al. (2022) Naman Agarwal, Syomantak Chaudhuri, Prateek Jain, Dheeraj Nagaraj, and Praneeth Netrapalli. Online target q-learning with reverse experience replay: Efficiently finding the optimal policy for linear mdps. In ICLR. OpenReview.net, 2022.
- Ambrose et al. (2016) R. Ellen Ambrose, Brad E. Pfeiffer, and David J. Foster. Reverse replay of hippocampal place cells is uniquely modulated by changing reward. Neuron, 91(5):1124–1136, 2016. ISSN 0896-6273.
- Andrychowicz et al. (2017) Marcin Andrychowicz, Dwight Crow, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In NIPS, pp. 5048–5058, 2017.
- Bai et al. (2021) Chenjia Bai, Lingxiao Wang, Lei Han, Jianye Hao, Animesh Garg, Peng Liu, and Zhaoran Wang. Principled exploration via optimistic bootstrapping and backward induction. In ICML, volume 139 of Proceedings of Machine Learning Research, pp. 577–587. PMLR, 2021.
- Bertsekas & Yu (2012) Dimitri P. Bertsekas and Huizhen Yu. Q-learning and enhanced policy iteration in discounted dynamic programming. Math. Oper. Res., 37(1):66–94, 2012.
- Florensa et al. (2017) Carlos Florensa, David Held, Markus Wulfmeier, Michael Zhang, and Pieter Abbeel. Reverse curriculum generation for reinforcement learning. In CoRL, volume 78 of Proceedings of Machine Learning Research, pp. 482–495. PMLR, 2017.
- Foster & Wilson (2006) David J Foster and Matthew A Wilson. Reverse replay of behavioural sequences in hippocampal place cells during the awake state. Nature, 440(7084):680–683, 2006.
- Goyal et al. (2019) Anirudh Goyal, Philemon Brakel, William Fedus, Soumye Singhal, Timothy P. Lillicrap, Sergey Levine, Hugo Larochelle, and Yoshua Bengio. Recall traces: Backtracking models for efficient reinforcement learning. In ICLR. OpenReview.net, 2019.
- Haddad (2021) Roudy El Haddad. Repeated sums and binomial coefficients. arXiv preprint arXiv:2102.12391, 2021.
- Igata et al. (2021) Hideyoshi Igata, Yuji Ikegaya, and Takuya Sasaki. Prioritized experience replays on a hippocampal predictive map for learning. Proceedings of the National Academy of Sciences, 118(1), 2021.
- Jain et al. (2021) Prateek Jain, Suhas S. Kowshik, Dheeraj Nagaraj, and Praneeth Netrapalli. Streaming linear system identification with reverse experience replay. In NIPS, volume 34, pp. 30140–30152. Curran Associates, Inc., 2021.
- Kumar et al. (2020) Aviral Kumar, Abhishek Gupta, and Sergey Levine. Discor: Corrective feedback in reinforcement learning via distribution correction. In NeurIPS, volume 33, pp. 18560–18572, 2020.
- Lahire et al. (2022) Thibault Lahire, Matthieu Geist, and Emmanuel Rachelson. Large batch experience replay. In ICML, volume 162 of Proceedings of Machine Learning Research, pp. 11790–11813. PMLR, 2022.
- Lee et al. (2019) Su Young Lee, Sung-Ik Choi, and Sae-Young Chung. Sample-efficient deep reinforcement learning via episodic backward update. In NeurIPS, pp. 2110–2119, 2019.
- Li et al. (2022) Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Sample complexity of asynchronous q-learning: Sharper analysis and variance reduction. IEEE Trans. Inf. Theory, 68(1):448–473, 2022.
- Lin (1992) Long Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Mach. Learn., 8:293–321, 1992.
- Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
- Nagaraj et al. (2020) Dheeraj Nagaraj, Xian Wu, Guy Bresler, Prateek Jain, and Praneeth Netrapalli. Least squares regression with markovian data: Fundamental limits and algorithms. In NeurIPS, 2020.
- Puterman (1994) Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley Series in Probability and Statistics. Wiley, 1994.
- Rotinov (2019) Egor Rotinov. Reverse experience replay. CoRR, abs/1910.08780, 2019.
- Saglam et al. (2023) Baturay Saglam, Furkan B. Mutlu, Dogan Can Çiçek, and Suleyman S. Kozat. Actor prioritized experience replay. J. Artif. Intell. Res., 78:639–672, 2023.
- Schaul et al. (2016) Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. In ICLR, 2016.
- Tagorti & Scherrer (2015) Manel Tagorti and Bruno Scherrer. On the rate of convergence and error bounds for lstd (). In ICML, volume 37 of JMLR Workshop and Conference Proceedings, pp. 1521–1529. JMLR.org, 2015.
- van Hasselt et al. (2016) Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In AAAI, pp. 2094–2100. AAAI Press, 2016.
- Vershynin (2018) Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
- Watkins & Dayan (1992) Christopher J. C. H. Watkins and Peter Dayan. Technical note q-learning. Mach. Learn., 8:279–292, 1992.
- Watkins (1989) Christopher John Cornish Hellaby Watkins. Learning from delayed rewards. PhD thesis, King’s College, University of Cambridge, 1989.
- Zanette et al. (2020) Andrea Zanette, Alessandro Lazaric, Mykel J. Kochenderfer, and Emma Brunskill. Learning near optimal policies with low inherent bellman error. In ICML, volume 119 of Proceedings of Machine Learning Research, pp. 10978–10989. PMLR, 2020.
Appendix A Proof of Combinatorial Counting in Reverse Experience Replay
In this section, we present detailed proof for each of the necessary lemmas used for Theorem 1.
A.1 Proof of Lemma 1
Lemma.
Let be any -dimensional non-zero vector, i.e., . For and , consider one high-order term in Equation (3.1). We have:
(13) |
where computes the absolute value.
A.2 Proof of Lemma 2




Lemma.
Based on the relaxation in Lemma 1, the weighted summation in Equation (3.1) can be expanded combinatorially as follows:
(16) |
Proof.
In the main paper, we outline the steps for transforming the problem of estimating the large summation (i.e., ) involving many high-order terms (i.e., ) into a combinatorial counting problem. Here, we present the details for the remaining cases beyond what was sketched in Figure 2, where we focused on one instance of selecting and .
More specifically, the element arises from the following cases:
-
(a)
When selects the -th slot and does not, this includes two subcases:
-
(b)
When does not pick the -th slot but does, this is symmetric to the previous case and also includes two subcases:
-
(c)
When picks the left -th slot and picks the right -th slot, the intermediate elements (i.e., ) are selected from a subarray of size . To ensure that the number of available slots is greater than the number of elements, we require , which simplifies to . See Figure 3(d) for a visual example.
Note that cases (a) and (b) yield the same result, while case (c) counts twice because both the first and last elements are the -th slot. Thus, we derive the final result:
(17) |
∎
A.3 Proof of Lemma 3
Lemma.
For and , we have:
(18) | ||||
(19) |
Proof.
We start by separating the sum into three parts:
(20) |
Then we use the Binomial expansion formula : . This expression will converge when . Set , we would have:
(21) |
Seting and , we have the result for the first sum:
(22) |
Note that the binomial coefficient is defined to be zero when . Because you cannot choose more elements than are available. In the above equation, is zero for , thus we limit the summation to .
Similarly, set and , we have the result for the second sum:
(23) |
Similarly, is zero for , thus the summation is restricted to .
In terms of the third sum, we adjust the index and apply the binomial theorem,
(24) |
By combining all three simplified sums, we obtain the final result:
(25) |
The above result holds when . When , the series could diverge due to the unbounded growth of the terms. Since the learning rate is always positive, the requirement becomes . This completes the proof. ∎
We aim to establish upper and lower bounds for the expression in Lemma 3 that are independent of , the results are provided in the following Remark.
Remark 2.
For and , the following inequalities hold:
(26) | ||||
(27) |
Within the range , both the upper and lower bounds are positive.
Proof.
The term decreases as increases since for . Therefore, the maximum occurs at , and the minimum at .
Similarly, for , the maximum occurs at , and the minimum at . For the term , the maximum occurs when , and the minimum when .
Thus, the upper bound of the entire expression is achieved by combining the maximum values of each term, resulting in . The lower bound is given by . ∎
Appendix B Theoretical Justification of Theorem 1
Theorem.
Let be the stationary distribution of the state-action pair in the MDP. The following matrix’s positive semi-definite inequalities hold: for ,
(28) |
where the matrix is defined in Definition 2. Here “” is defined between two matrices on both sides (please see Definition 4) for the positive semi-definite property.
Proof.
Based on Equation (3.1), we have:
(29) |
Theorem 1 together its theoretical justification is novel and never used in any analysis relevant to experience replay by the knowledge of the authors.
Appendix C Sample Complexity Proof of Theorem 2
We acknowledge that the main structure of convergence proof follows the original work. Here, we made contribution to present a cleaner proof pipeline of the proof and also integrate our tighter bound in Theorem 1.
C.1 Proof of Bias-Variance Decomposition for the Error in Lemma 4
Lemma.
Let the error terms for every parameter as the difference between empirical estimation and true MDP: . For the current iteration , the difference between current estimated parameter and the optimal parameter accumulated along the length sub-trajectory with reverse update is:
(38) |
Proof.
As shown in Algorithm 1 in Lines 5-7, we use a sampled sub-trajectory of length to execute -update reversely at iteration : for ,
(39) | ||||
(40) |
where denotes the identity matrix and is the parameter of the target network. The second equality is attained by rearranging the terms of the first equality. The operator is changed to operator for rigorous analysis purposes. means inner dot product between two vectors of the same dimension.
Let be the optimal -value at state taking action and assume is the optimal parameter, the Bellman optimality equation is written as:
Define the error term for parameter and -th tuple as the difference between empirical estimation and true probabilistic MDP:
(41) | ||||
(42) | ||||
(43) |
For all , apply the Bellman optimality equation over the RER algorithm over the optimal parameter :
(44) |
Right-multiply the above equation on both sides with term and combine with the first equation in this proof, we shall get
(45) |
where the error term is defined in Equation (43). Combined with Definition 2 and recursively expand the RHS, we shall get the difference w.r.t. the optimal one after reversely updating consecutive steps:
(46) |
The proof is thus finished. ∎
Remark 3.
Suppose we execute Algorithm 1 for iterations, we would get:
(47) |
where . The first term on RHS is noted as the bias, which decays geometrically with and the second term on RHS is variance along the sub-trajectory of length , which we will later show it has zero mean.
C.2 Bound the Bias Term in Remark 3
We briefly mention here to quantify the upper bound of in Lemma 4. In comparison to (Agarwal et al., 2022, lemma 8), the following Lemma 5 require instead of , which is a relaxation of the original work.
Lemma.
Let be a non-zero vector and is the frequency for the target network to be updated. For and , the following matrix’s positive semi-definite inequality holds:
(48) |
Furthermore, the following inequality holds with probability at least :
(49) |
The -based norm is defined in Definition 1.
Proof.
In Theorem 1, we notice the term will converge to zero exponentially and thus are omitted under sufficient precision. Thus, we obtain a simplified form
(50) | ||||
(51) |
Now, we observe that:
(52) |
Therefore, we take the expectation conditioned on for in Equation (52):
using Equation (51) | (53) |
Applying the equation above inductively, we conclude the result:
(54) | ||||
(55) |
The last step of numerical approximation is obtained by .
We then extend to -based norm (in Definition 1) as follow:
(56) |
Thus, by Markov’s inequality, with probability at least , the following event holds:
(57) |
∎
C.3 Bound the Variance Term in Remark 3
Even though the term is involved in the expression, it turns out we do not need to modify the original proof and thus we follow the result in the original proof. Please see (Agarwal et al., 2022, Appendix L.3) for the original proof steps.
Theorem (Agarwal et al. (2022) Theorem 4).
Suppose are fixed. There exists a universal constant such that the following event holds with probability at least :
(58) |
C.4 Overall Sample Complexity Analysis
Lemma 6.
There exists constants , such that:
-
1.
;
-
2.
.
Let , where is an index for the target network and is the total number of target network updates. The following holds with probability at least :
-
1.
For the target network, ;
-
2.
For the error accumulated along -consecutive steps of reverse update,
When we combine the above two cases, we have:
We can obtain the sample complexity of the whole learning framework by setting the RHS as and we shall recover the result we show in Theorem 2.
Appendix D Extra Notations and Definitions
Definition 3 (-Net, Covering Number and Metric Entropy).
Given metric space , consider a region . 1) A subset of Euclidean balls is called an -Net of (for ) if every point is within distance of a point of :
2) Equivalently, we denote as the smallest number of closed balls with centers in and radius whose union covers . 3) Metric Entropy. It is the Logarithm of the covering number .
Theorem 3 (Dudley’s Integral Inequality).
Let be a random process with zero means on the metric space with sub-gaussian increments. Then
Lemma 7 (Cauchy-Schwarz Inequality).
Based on Assumption 2 that the feature vector for state-action is bounded , for all . For , we have:
(60) |
Definition 4 (Positive Semi-Definite Property of Matrix).
For two symmetric matrices , we say if is positive semi-definite: , for all non-zero -dimensional vector.
Lemma 8 (AM–GM Inequality).
The inequality of arithmetic and geometric means, or more briefly the AM–GM inequality, states that the arithmetic mean of a list of non-negative real numbers is greater than or equal to the geometric mean of the same list. For , .
Lemma 9 (Recursive Formula).
Let be positive integers. denotes a binomial coefficient, which is computed as . Then for all , we have:
Lemma 10 (Rising Sum of Binomial Coefficients).
Let be positive integers. Then:
Lemma 11 (Vandermonde identity Haddad (2021)).
For any such that , we have:
Lemma 12.
For any such that , we have:
has a closed form, also has a closed form.
Lemma 13 (Linearity of Expectation).
For random variables and constants , we have: