Settling the Horizon-Dependence of Sample Complexity in Reinforcement Learning
Recently there is a surge of interest in understanding the horizon-dependence of the sample complexity in reinforcement learning (RL). Notably, for an RL environment with horizon length , previous work have shown that there is a probably approximately correct (PAC) algorithm that learns an -optimal policy using episodes of environment interactions when the number of states and actions is fixed. It is yet unknown whether the dependence is necessary or not. In this work, we resolve this question by developing an algorithm that achieves the same PAC guarantee while using only episodes of environment interactions, completely settling the horizon-dependence of the sample complexity in RL. We achieve this bound by (i) establishing a connection between value functions in discounted and finite-horizon Markov decision processes (MDPs) and (ii) a novel perturbation analysis in MDPs. We believe our new techniques are of independent interest and could be applied in related questions in RL.
1 Introduction
Reinforcement learning (RL) is one of the most important paradigms in machine learning. What makes RL different from other paradigms is that it models the long-term effects in decision-making problems. For instance, in a finite-horizon Markov decision process (MDP), which is one of the most fundamental models for RL, an agent interacts with the environment for a total of steps and receives a sequence of random reward values, along with stochastic state transitions, as feedback. The goal of the agent is to find a policy to maximize the expected sum of these rewards values instead of any single one of them. Since decisions made at early stages could significantly impact the future, the agent must take possible future transitions into consideration when choosing the policy. On the other hand, when , RL reduces to the contextual bandits problem in which it suffices to act myopically to achieve optimality.
Due to the important role of the horizon length in RL, Jiang and Agarwal [JA18] propose to study how the sample complexity of RL depends on the horizon length. More formally, let us consider the episodic RL setting, where the horizon length is and the underlying MDP has unknown and time invariant transition probabilities and rewards. Starting from some unknown but fixed initial state distribution, how many episodes of interaction with the MDP do we need to learn an -optimal policy, whose value (the expected sum of rewards collected by the policy) differs from the optimal policy by at most ? Here is the maximum possible sum of rewards in an episode and is the target accuracy. Jiang and Agarwal [JA18] conjecture that the number of episodes required by the above problem depend polynomially on even when the total number of state-action pairs is a constant. Wang et al. [WDYK20] refute this conjecture by showing that the correct sample complexity of episodic RL is at most when there are constant number of states and actions. The result in [WDYK20] shows a surprising fact that one can achieve similar sample efficiency regardless of the horizon length of the RL problem. A number of recent results [ZJD20, MDSV21, ZDJ20, RLD+21] additionally improve the result in [WDYK20] to obtain more (computationally or statistically) efficient algorithm and/or in more general settings. However, these results all have the factor in their sample complexity, seemingly implying that the dependence is necessary. Hence, it is natural to ask:
What is the exact dependence on the horizon length of the sample complexity in episodic RL?
The goal of the current paper is to answer the above question.
Compared to other problems in machine learning, RL is challenging mainly because it requires an effective exploration mechanism. For example, to obtain samples from some state in an MDP, one needs to first learn a policy that can reach that state with good probability. In order to decouple the sample complexity of exploration and that of learning a near-optimal policies, there is a line of research (see, e.g., [KS99, AMK13, SWW+18, Wan17, YW19, LWC+20]) studying the generative model setting, where a learner is able to query as many samples as possible from each state-action pair, circumventing the issue of exploration. The above sample complexity question is still well-posed even in the generative model. However, to achieve fair comparison with the RL setting, we measure the sample complexity as the number of batches, where a batch of queries corresponds to queries in the generative model. Indeed, in the episodic RL setting, the learner can also obtain samples in each episode. Nevertheless, we are not aware of any result addressing the above question even with a generative model.
1.1 Our Results
In this paper, we settle the horizon-dependence of the sample complexity in episodic RL by developing an algorithm whose sample complexity is completely independent of the planning horizon . We state our results in the following theorems, which target RL with an -horizon MDP (formally defined in Section 2) with state space and action space .
Theorem 1.1 (Informal version of Corollary 5.11).
Suppose the reward at each time step is non-negative and the total reward of each episode is upper bounded by 1.111Without loss of generality, we set . Given a target accuracy and a failure probability , our algorithm returns an -optimal non-stationary policy with probability at least by sampling at most episodes, where is number of states and is the number of actions.
Notably, when the number states , the number of actions , the desired accuracy and the failure probability are all fixed constants, the (episode) sample complexity of our algorithm is also a constant. Our result suggests that episodic RL is possible with a sample complexity that is completely independent of the horizon length .
In fact, we can show a much more general result besides finding the optimal policy. We actually prove that using the same number of episodes, one can construct an oracle that returns an -approximation of the value of any non-stationary policy , given as the following theorem.
Theorem 1.2 (Informal version of Theorem 5.10).
In the same setting as Theorem 1.1, with episodes, with probability at least , we can construct an oracle such that for every given non-stationary policy , returns an approximation of the expected total reward of .
This theorem suggests that even completely learning the MDP (i.e., being able to simultaneously estimate the value of all non-stationary policies) can be done with a sample complexity that is independent of .
We now switch to the more powerful generative model and show that a better sample complexity can be achieved. In the generative model, the agent can query samples from any state-action pair and the initial state distribution of the environment. Here, the sample complexity is defined as the total number of batches of queries (a batch corresponds to queries) to the environment to obtain an -optimal policy. 222It is well-known that any algorithm requires queries to find a near-optimal policy (see, e.g., [LH12]), and thus it is reasonable to define a single batch to be queries.
Theorem 1.3 (Informal version of Theorem 6.3).
Suppose the reward at each time step is non-negative and the total reward of each episode is upper bounded by 1. Given a target accuracy and a failure probability , our algorithm returns an -optimal non-stationary policy with probability at least by sampling at most batches of queries in the generative model, where is number of states and is the number of actions.
Compared to the result in Theorem 1.1, the sample complexity in Theorem 6.3 has polynomial dependence on the number of states and has better dependence on the desired accuracy .
We remark that although our algorithms in Theorem 1.1 and Theorem 1.3 achieve tight dependence in terms of , it does not aim to tighten the dependence on the number of states , the number of actions and the desired accuracy . In fact, the sample complexity of our algorithms have much worse dependence in terms of , and compared to prior work. See Section 1.2 for a detailed discussion on prior work, and Section 3 for an overview of our techniques.
1.2 Related Work
In this section, we discuss related work on the sample complexity of RL in the tabular setting, where the number of states , the number of actions , and the horizon length are all assumed to be finite. There is a long line of research studying the sample complexity in tabular RL [KS02, BT03, Kak03, SLW+06, SL08, KN09, BT09, JOA10, SS10, LH12, ORVR13, DB15, AOM17, DLB17, OVR17, JAZBJ18, FPL18, TM18, DLWB19, DWCW19, SJ19, Rus19, ZJ19, ZZJ20, YYD21, PBPH+20, NPB20, WDYK20, ZJD20, MDSV21, ZDJ20, RLD+21]. In most prior work, up to a scaling factor, the standard assumption on the reward values is that for all and hence , where is the reward value collected at step of an episode. We refer this assumption as the reward uniformity assumption. However, Jiang and Agarwal [JA18] point out that one should only impose an upper bound on the summation of the reward values, i.e., , which we refer as the bounded total reward assumption, instead of imposing uniformity which could be much stronger. This new assumption allows a fair comparison between long-horizon and short-horizon problems, and is also more natural when the reward signal is sparse. Also note that algorithms designed under the reward uniformity assumption can be modified to work under the bounded total reward assumption. However, the sample complexity is usually worsen by a factor. Many existing results in tabular RL focuses on regret minimization where the goal is to collect maximum cumulative rewards for a limit number of interactions with the environment. To draw comparison with our results, we convert them, using standard techniques, to the PAC setting, where the algorithm aims to obtain an -optimal policy while minimizing the number of episodes of environment interactions (or batches of queries if in the generative model).
Under the reward uniformity assumption, a line of work have attempted to provide tight sample complexity bounds [AOM17, DB15, DLB17, DLWB19, JAZBJ18, OVR17]. To obtain an -optimal policy, state-of-the-art results show that 333 omits logarithmic factors. episodes suffice [DLWB19, AOM17]. In particular, the first term matches the lower bound up to logarithmic factors [DB15, OVR16, AOM17] while the second term has least linear dependence on . Moreover, converting these bounds to the bounded total reward setting induces extra factors. For instance, the sample complexity in [AOM17, DLWB19] will become under the bound total reward assumption
Recently, there is a surge of interest of designing algorithms under the bounded total reward assumption. In particular, Zanette and Brunskill [ZB19] develop an algorithm which enjoys a sample complexity of , where the second term still scales polynomially with . Wang et al. [WDYK20] show that it is possible to obtain an -optimal policy with a sample complexity of , establishing the first sample complexity with dependence on the horizon length . They achieve this result by using the following ideas: (1) samples collected by different policies can be reused to evaluate other policies; (2) to evaluate all policies in a finite set , the number of sample episodes required is at most ; (3) establish a set of policies that contains at least one -optimal policy for any MDP by using an -nets over the reward values and the transition probabilities. Here contains all optimal non-stationary policies of each MDP in the -net. The accuracy of the -net needs to be at least and hence , which induces an inevitable factor. Other factors come from Step (2) where they use a potential function which increases from to to measure the progress of the algorithm. It is unclear how to remove any of the log factors in their sample complexity upper bound, and they also conjecture that the optimal sample complexity could be which is refuted by our new result.
Another recent work [ZJD20] obtains a more efficient algorithm that achieves a sample complexity of . Zhang et al. [ZJD20] achieve such a sample complexity by using a novel upper confidence bound (UCB) analysis on a model-based algorithm. In particular, they use a recursive structure to bound the total variance in the MDP, which in turn bounds the final sample complexity. However, their bound critically relies on the fact that the total variance of value function moments is upper bounded by which inevitably induces factors. It is unlikely to get rid of all factors in their sample complexity by following their approaches.
In the generative model, there is also a long line of research studying the sample complexity of finding near-optimal policies. See, e.g., [KS99, Kak03, SY94, AMK13, SWWY18, SWW+18, AKY19, LWC+20]. However, most of these work are targeting on the infinite-horizon discounted setting, in which case the problem becomes much easier. This is because in the infinite-horizon discounted setting, there always exists a stationary optimal policy, and the total number of such policies depends only on the number of states and the number of actions . Sidford et al. [SWW+18] develop an algorithm based on a variance-reduced estimator which uses batches of queries to find an -optimal policy under the reward uniformity assumption. However, their result relies on splitting samples into sub-groups, and therefore, the lower order term of their sample complexity fundamentally depends on . To the best of our knowledge, even in the stronger generative model, no algorithm for finite-horizon MDPs achieves a sample complexity that is independent of under the bounded total reward assumption.
2 Preliminaries
Notations.
Throughout this paper, for a given positive integer , we use to denote the set . For a condition , we use to denote the indicator function, i.e., if holds and otherwise.
For a random variable and a real number , its -quantile is defined so that
Markov Chains.
Let be a Markov chain where is the state space, is the transition operator and is the initial state distribution. A Markov chain induces a sequence of random states
where for each and for each .
Finite-Horizon Markov Decision Processes.
Let be an -horizon Markov Decision Process (MDP) where is the finite state space, is the finite action space, is the transition operator which takes a state-action pair and returns a distribution over states, is the reward distribution, is the planning horizon (episode length), and is the initial state distribution. In the learning setting, , and , are unknown but fixed, and , and are known to the learner. Throughout, we assume since otherwise we can apply existing algorithms (e.g. [ZJD20]) to obtain a sample complexity that is independent of . Throughout the paper, for a state , we occasionally abuse notation and use to denote the deterministic distribution that always takes .
Interacting with the MDP.
We now introduce how a RL agent (or an algorithm) interacts with an unknown MDP. In the setting of Theorem 1.1, in each episode, the agent first receives an initial state . For each time step , the agent first decides an action , and then observes and moves to and obtains a reward . The episode stops at time , where the final state is . Note that even if the agent decides to stop before time , e.g. at time , this still counts as one full episode.
In the generative model setting (as in Theorem 1.3), the agent is allowed to start from any instead of , and is allowed to restart at any time. The sample complexity is defined as the number of batches of queries (each batch consists of queries) the agent uses.
Policy Class.
The final goal of RL is to output a good policy with respect to the unknown MDP that the agent interacts with. In this paper, we consider (non-stationary) deterministic policies which choose an action based on the current state and the time step . Formally, where for each , maps a given state to an action. The policy induces a (random) trajectory
where , , , , , , , , and . Throughout the paper, all policies are assumed to be deterministic unless specified otherwise.
Value Functions.
To measure the performance of a policy , we define the value of a policy as
Note that in general a policy does not need to be deterministic, and it can even depend on the history in which case a policy chooses an action at time based on the entire transition history before . It is well-known (see e.g. [Put94]) that the optimal value of can be achieved by a non-stationary deterministic optimal policy . Hence, we only need to consider non-stationary deterministic policies.
The goal of the agent is then to find a policy with , i.e., obtain an -optimal policy, while minimizing the number of episodes of environment interactions (or batches of queries if in the generative model).
For a policy , we also define
to be the -quantile of the visitation frequency of a state-action pair .
Stationary Policies.
For the sake of the analysis, we shall also consider stationary policies. A stationary deterministic policy chooses an action solely based on the current state , i.e, . We use to denote the set of all stationary policies. Note that .
We remark that when the horizon length is finite, the value of the best stationary policy and that of the best non-stationary policy can differ significantly. Consider the case when there are two states and two actions . Starting from , taking action will go back to and obtain a reward of , while taking action will go to and obtain a reward of . Taking any action at will go back to with reward 0. In this case, the optimal non-stationary policy has value since it can exit at time , while deterministic stationary policies can only have value (even random stationary policy can only achieve value ).
For a stationary policy , we use to denote the Markov chain induced by and , where the transition operator is defined so that
Assumption on Rewards.
Below, we formally introduce the bounded total reward assumption.
Assumption 2.1 (Bounded Total Reward).
For any policy , and a random trajectory induced by , we have
-
•
for all ; and
-
•
almost surely.
As discussed in the previous section, this assumption is more general than the standard reward uniformity assumption, where , for all . Throughout this paper, we assume that the above assumption holds for the unknown MDP that the agent interacts with.
The above assumption in fact implies a very interesting consequence on the reward values.
Lemma 2.1.
Under Assumption 2.1, for any with , for any , if there exists a (possibly non-stationary) policy such that for the random trajectory
induced by executing in , we have
for some , then almost surely and therefore .
Proof.
By the assumption, there exists a trajectory
such there exists with and . Moreover,
We may assume and , since otherwise we can replace sub-trajectories that start and end with the same state by that state, and the resulting trajectory can still appear with strictly positive probability. Now consider the policy which is defined so that for each , and for each ,
i.e., repeating the trajectory’s actions in indefinitely. is defined arbitrarily for other states and time steps.
By executing , with strictly positive probability, is visited for times. Therefore, by Assumption 2.1, with probability and thus .
∎
Discounted Markov Decision Processes.
We also introduce another variant of MDP, discounted MDP, which is specified by , where is a discount factor and all other components have the same meaning as in an -horizon MDP. Note that we consider discounted MDP only for the purpose of analysis, and the unknown MDP that the agent interacts with is always a finite-horizon MDP. The difference between a discounted MDP and an -horizon MDP is that discounted MDPs have an infinite horizon length, i.e., the length of a trajectory can be infinite. Hence, to measure the value of a discounted MDP, suppose policy obtains a random trajectory,
we denote
as the discounted value of . Throughout, for a (discounted or finite-horizon) MDP , we simply denote as the value function of and as the value function of .
3 Technical Overview
In this section, we discuss the techniques for establishing our results. To introduce the high-level ideas, we first start with the simpler setting, the generative model, where exploration is not a concern. We then switch to the more challenging RL setting, where we need to carefully design policies to explore the state-action space so that a good policy can be learned. For simplicity, throughout the discussion in this section, we assume , and are all constants.
Algorithm and Analysis in the Generative Model.
Our algorithm in the generative model is conceptually simple: for each state-action pair , we draw samples from and and then return the optimal policy with respect to the empirical model which is obtained by using the empirical estimators for and (denoted as and ). Here for simplicity, we assume which allows us to focus on the estimation error induced by the transition probabilities. Moreover, we assume that differs from only for a single state-action pair . To further simplify the discussion, we assume that there are only two different states on the support of (say and ).
In order to prove the correctness of the algorithm, we show that for any policy , the value of in the empirical model is close to that in the true model . However, standard analysis based on dynamic progarmming shows that the difference between the value of in and that in could be as large as times the estimation error on , which is clearly insufficient for obtaining an algorithm which uses batches of queries. Our main idea here is to show that for most trajectories , the probability of in the empirical model is a multiplicative approximation to that in the true model with constant approximation ratio.
To establish the multiplicative approximation guarantee, our observation is that one should consider and , the two states on the support of , as a whole. To see this, consider the case where . Again, the additive estimation errors on both and are roughly . Now, consider a trajectory that visits both and for times. Note that the multiplicative approximation ratio between and could be as large as , for both and . However, since as the empirical estimator is still a probability distribution, it must be the case that and where and thus . Since is a constant, is a constant factor approximation to the true probability due to cancellation.
In our analysis, to formalize the above intuition, for each trajectory , we take into consideration only when for both and . Here is the number of times that is visited on and is the number of times that is visited on . By Chebyshev’s inequality (see Lemma 5.7 for details), we only ignore a small subset of trajectories whose total probability can be upper bounded by a constant. For the remaining trajectories, it can be shown that is a constant factor approximation to so long as for both and due to the cancellation mentioned above. The formal argument is given in Lemma 5.5. Note that using samples, holds only when . On the other hand, we can also ignore trajectories that visit with since such trajectories have negligible cumulative probability by Markov’s inequality (formalized in Lemma 5.6).
The above analysis can be readily generalized to handle perturbation on the transition probabilities of multiple state-action pairs, and to handle the case when the transition operator is not supported on two states. In summary, by using samples for each state-action pair , the empirical model provides a constant factor approximation to the probabilities of all trajectories, except for a small subset of them whose cumulative probability can be upper bounded by a constant. Hence, for all policies, the empirical model provides an accurate estimate to its value and thus, the optimal policy with respect to the empirical model is near-optimal.
Exploration by Stationary Policies.
In the discussion above, we heavily rely on the ability of the generative model to obtain samples for each state-action pair. However, for the RL setting, it is not possible to reach every state-action pair freely. Although each trajectory contains state-action-state tuples (corresponding to a batch of queries in the generative model), these samples may not cover states that are crucial for learning an optimal policy. Indeed, one could use all possible deterministic non-stationary policies to collect samples, which shall then cover the whole state-action space. Unfortunately, such a naïve method introduces a dependence on the number of non-stationary policies which is exponential in . The sample complexity of other existing methods in the literature also inevitably depends on as their sample complexity intrinsically depends on the number of non-stationary policies.
In this work, we adopt a completely different approach for exploration. Our new idea is to show that if there exists a non-stationary policy that visits for times in expectation, then there exists a stationary policy that visit for times in expectation. This is formalized in Lemma 4.7. If the above claim is true, then intuitively, one can simply enumerate all stationary policies and sample trajectories using each of them to obtain samples of . Note that there are only stationary policies, which is completely independent of . In order to prove the above claim, we show that for any stationary policy , its value in the infinite-horizon discounted setting is close to that in the finite-horizon undiscounted setting (up to a factor of ) by using a properly chosen discount factor (see Lemma 4.5). Note that this implies the correctness of the above claim since there always exists a stationary optimal policy in the infinite-horizon discounted setting.
In order to show the value of a stationary policy in the infinite-horizon discounted setting is close to that in the finite-horizon setting, we study reaching probabilities in time invariant Markov chains. In particular, we show in Lemma 4.4 that in a time invariant Markov chain, for any , the probability of reaching a specific state within steps is close the probability of reaching within steps, up to a factor of . Previous literature on time invariant Markov chains mostly focus on the asymptotic behavior, and as far as we are aware, we are the first to prove the above claim. Note that this claim directly establishes a connection between the value of a stationary policy in the infinite-horizon discounted setting and that in the finite-horizon setting. Moreover, as a direct consequence of the above claim, we can show that if , the value of a stationary policy within steps is close to that of the same policy within steps, up to a factor of . This consequence is crucial for later parts of the analysis.
From Expectation to Quantile.
The above analysis shows that if there exists a non-stationary policy that visits for times in expectation, then our algorithm, which uses all stationary policies to collect samples, visits for times in expectation. However, this does not necessarily mean that one can obtain samples of by sampling trajectories using our algorithm with good probability. To see this, consider the case when our policy visits for times with probability and does not visit with probability . In this case, our policy may not obtain even a single sample for unless one rollouts the policy for times. Therefore, instead of obtaining a visitation frequency guarantee which holds in expectation, it is more desirable to obtain a visitation frequency guarantee that holds with good probability.
To resolve this issue, we establish a connection between the expectation and the -quantile of the visitation frequency of a stationary policy. We note that such a connection could not hold without any restriction. To see this, consider a policy that visits for times with probability . In this case, the expected visitation frequency is while the -quantile is zero. On the other hand, suppose the initial state almost surely, then such a connection is easy to establish by using the Martingale Stopping Theorem. In particular, in Corollary 4.9, we show that if there exists a non-stationary policy that visits for times with probability , then there exists a stationary policy that visits for times with constant probability, when the initial state almost surely.
In general, the initial state comes from a distribution and could be different from with high probability. To tackle this issue, in our algorithm, we simultaneously enumerate two stationary policies and . should be thought as the policy that visits with highest probability within steps starting from the initial state distribution , and should be thought as the policy that maximizes the -quantile of the visitation frequency of within steps when . In our algorithm, we execute before is visited for the first time, and switch to once has been visited. Intuitively, we first use to reach for the first time and then use to collect as many samples as possible. As mentioned above, the value of a stationary policy within steps is close to the value of the same policy within steps, up to a factor of . Thus, by sampling the above policy (formed by concatenating and ) for times, we obtain at least samples for , if there exists a non-stationary policy that visits for times with probability . This is formalized in Lemma 5.1.
Perturbation Analysis in the RL Setting.
By the above analysis, suppose is the largest integer such that there exists a non-stationary policy that visits with probability for times, then our dataset contains samples of . However, could be significantly smaller than and therefore the perturbation analysis established in the generative model no longer applies here. For example, previously we show that if , then is a constant factor approximation to when for both and . However, if , it is hopeless to obtain an estimate with . Fortunately, our perturbation analysis still goes through so long as and , i.e., replacing all appearances with .
The above analysis introduces a final subtlety in our algorithm. In particular, in the empirical model could be significantly larger than that in the true model. On the other hand, the number of samples of in our dataset is at most where is defined by the true model. This means the value estimated in the empirical model could be significantly larger than that in the true model . To resolve this issue, we employ the principle of “pessimism in the face of uncertainty” and for each policy , the estimated value of is set to be the lowest value among all models that lie the confidence set. Since the true model always lies in the confidence set, the estimated value is then guaranteed to be close to the true value.
4 Properties of Stationary Policies
In this section, we prove several properties of stationary policies. In Section 4.1, we first prove properties regarding the reaching probabilities in Markov chains, and then use them to prove properties for stationary policies in Section 4.2.
4.1 Reaching Probabilities in Markov Chains
Let be a Markov chain. For a positive integer and a sequence of states , we write
to denote the probability of in . For a state and an integer , we also write
to denote the probability of reaching with exactly steps.
Our first lemma shows that for any Markov chain , for any sequence of states with , there exists a sequence of states with so that .
Lemma 4.1.
Let be a Markov chain. For a sequence of states
with , there exists a sequence of states
with , and .
Proof.
By pigeonhole principle, since , there exists such that . Consider the sequence induced by removing from , i.e.,
Since , we have
and
Therefore, we have . We continue this process until the length is at most . ∎
Combining Lemma 4.1 with a simple counting argument directly implies the following lemma, which shows that .
Lemma 4.2.
Let be a Markov chain. For any ,
Proof.
Consider a sequence of states with and . By Lemma 4.1, there exists another sequence of states with and so that . Therefore
which implies
Therefore,
∎
By applying Lemma 4.2 in a Markov chain with modified initial state distribution and transition operator, we can also prove that for any integer and integer .
Lemma 4.3.
Let be a Markov chain. For any integer and integer , for any ,
Proof.
We define a new Markov chain based on . The state space of is the same as that of . The initial state distribution is the same as the distribution of in , i.e., the distribution after taking steps in . The transition operator is defined so that taking one step in is equivalent to taking steps in , i.e.,
Clearly, for any state , . By using Lemma 4.2 in , for any , we have
which implies
∎
Finally, Lemma 4.3 implies the main result of this section, which shows that for any , .
Lemma 4.4.
Let be a Markov chain. For any and ,
Proof.
On the other hand,
Note that if , then . Moreover, if , then we have , which implies
Hence, we have
∎
4.2 Implications of Lemma 4.4
In this section, we list several implications of Lemma 4.4 which would be crucial for analysis in later sections.
Our first lemma shows that for any MDP and any stationary policy , for a properly chosen discount factor , is a multiplicative approximation to with approximation ratio .
Lemma 4.5.
For any MDP and any stationary policy , if , by taking ,
As another implication of Lemma 4.4, for any MDP and any stationary policy , we have
Lemma 4.6.
For any MDP and any stationary policy , if ,
Proof.
As a corollary of Lemma 4.5, we show that for any -horizon MDP , there always exists a stationary policy whose value is as large as the best non-stationary policy up to a factor of .
Corollary 4.7.
For any MDP , if , then there exists a stationary policy such that
Proof.
In this proof we fix . We also use to denote a non-stationary policy such that when and is defined arbitrarily when .
By applying Corollary 4.7 in an MDP with an extra terminal state , we can show that for any , there always exists a stationary policy that visits in the first time steps with probability as large as the probability that the best non-stationary policy visits in all the time steps, up to a factor of .
Corollary 4.8.
For any MDP , if , then for any , there exists a stationary policy , such that for any (possibly non-stationary) policy ,
where
is a random trajectory induced by executing in and
is a random trajectory induced by executing in .
Proof.
For the given MDP , we create a new MDP
where is a state such that . Moreover,
and
Clearly, for any policy ,
where
is a random trajectory induced by executing in . Therefore, by Corollary 4.7, there exists a stationary policy such that for any (possibly non-stationary) policy ,
Moreover, by Lemma 4.6, for any (possibly non-stationary) policy ,
which implies the desired result. ∎
Finally, by combining Lemma 4.6 and Corollary 4.7, we can show that for any , if the initial state distribution is and there exists a non-stationary policy that visits for times with probability in all the steps, then there exists a stationary policy that visits for times with constant probability in the first steps.
Corollary 4.9.
For a given MDP and a state-action pair , suppose the initial state distribution is and . If there exists a (possibly non-stationary) policy such that for some integer , then there exists a stationary policy such that
where
is a trajectory induced by executing in .
Proof.
If then the lemma is clearly true. No consider the case . Consider a new MDP where . Clearly, . By Corollary 4.7, there exists a stationary policy such that
By Lemma 4.6,
This implies .
Now we use to denote a random variable which is defined to be
Here the trajectory
is induced by executing the stationary policy in . We also write . We use to denote a sequence of i.i.d. copies of . We use to denote a random variable which is defined to be
Clearly, almost surely. Moreover, is a stationary policy, the initial state distribution deterministically and , which implies and have the same distribution. Indeed, whenever the trajectory visits , it corresponds to restart a new copy of .
Now for each , we define . Clearly . Let and for all . Clearly is a stopping time, and
since for all . By Martingale Stopping Theorem, we have
which implies and therefore
where we use the fact that
Let . By Markov’s inequality, with probability at least ,
in which case . Consequently,
∎
5 Algorithm in the RL Setting
In this section, we present our algorithm in the RL setting together with its analysis. Our algorithm is divided into two parts. In Section 5.1, we first present the algorithm for collecting samples together with its analysis. In Section 5.2, we establish a perturbation analysis on the value functions which is crucial for the analysis in later proofs. Finally, in Section 5.3, we present the algorithm for finding near-optimal policies based on the dataset found by the algorithm in Section 5.1, together with its analysis based on the machinery developed in Section 5.2.
5.1 Collecting Samples
In this section, we present our algorithm for collecting samples. The algorithm is formally presented in Algorithm 1. The dataset returned by Algorithm 1 consists of lists, where for each list, elements in the list are tuples of the form . To construct these lists, Algorithm 1 enumerates a state-action pair and a pair of stationary policies , and then collects a trajectory using and . More specifically, is executed until the trajectory visits , at which point is executed until the last step.
Throughout this section, we use to denote the underlying MDP that the agent interacts with. For each and , let
by a trajectory where and for all , for all , and
for all . Note that the above trajectory is the one collected by Algorithm 1 when a specific state-action pair and a specific pair of policies are used.
For any , define
Clearly, is the -quantile of the frequency that appears in each .
In Lemma 5.1, we first show that for each , if there exists a policy that visits for times with probability at least , then
Lemma 5.1.
Let be a given real number. For each , let be the largest integer such that there exists a (possibly non-stationary) policy such that . Then for each ,
Proof.
For each , there exists a (possibly non-stationary) policy such that . Here we consider the case that , since otherwise the lemma clearly holds. By Corollary 4.8, there exists a stationary policy such that
where
is a random trajectory induced by executing in .
In the remaining part of the analysis, we consider two cases.
Case I: .
Let
be a random trajectory induced by executing in . Let be the random variable which is defined to be
Clearly,
Therefore, there exists such that
and
Note that we must have , since otherwise .
Now we consider a new MDP where . Let be an arbitrary policy so that for all . Clearly,
where
is a random trajectory induced by executing in . Therefore, by Corollary 4.9, there exists a stationary policy such that
where
is a random trajectory induced by executing in . Since and thus , we must have .
Now we consider the case when and . Since ,
Therefore, let be the random variable which is defined to be
We have that
Moreover, for each , since ,
Therefore,
Since , we have
and thus
Case II: .
Consider the case when . Clearly,
and thus
∎
Now we show that for a given percentile , for the dataset returned by Algorithm 1, for each , appears for at least times for at least lists out of the lists returned by Algorithm 1.
Lemma 5.2.
Let be a given real number. Let be the dataset returned by Algorithm 1 where
Suppose . With probability at least , for each , we have
Proof.
For each , by the definition of , for each , we have
Hence, the desired result follows by Chernoff bound and a union bound over all . ∎
We also need a subroutine to estimate for some to be decided. Such estimates are crucial for building estimators for the transition probabilities and the rewards with bounded variance, which we elaborate in later parts of this section.
Our algorithm for estimating is described in Algorithm 2. Algorithm 2 collects lists, where for each list, elements in the list are tuples of the form . These lists are collected using the same approach as in Algorithm 1. Once these lists are collected, for each , our estimate (denoted as ) is then set to be the -th largest element in , where is the set of the number of times appear in each of the lists.
We now show that for each , is an accurate estimate of .
Lemma 5.3.
Let be the function returned by Algorithm 2. With probability at least , for all ,
Proof.
Fix a state-action pair . For each , define
where
For each , by the definition of , we have and thus . By Chernoff bound, with probability at most ,
On the other hand, for each , define
where
For each , by the definition of , we have and thus . By Chernoff bound, with probability at most ,
Hence, by union bound, with probability at least ,
and
in which case the -th largest element in is in . We finish the proof by a union bound over all . ∎
In Lemma 5.4, we show that using the dataset returned by Algorithm 1, and the estimates of quantiles returned by Algorithm 2, we can compute accurate estimates of the transition probabilities and rewards. The estimators used in Lemma 5.4 are the empirical estimators, with proper truncation if a list contains too many samples (i.e., more than )). As will be made clear in the proof, such truncation is crucial for obtaining estimators with bounded variance.
Lemma 5.4.
Suppose Algorithm 2 is invoked with the percentile set to be and the failure probability set to be , and Algorithm 1 is invoked with . Let be the estimates returned by Algorithm 2. Let be the dataset returned by Algorithm 1 where
For each , for each and , define
For each , define
and
Then with probability at least , for all with , we have
and
Proof.
Fix a state-action pair and . For each and , let be the filtration induced by
For each and , define
and
Clearly,
and
which implies
and
and thus
Moreover, for any and , we have
and
Note that for each ,
and
Furthermore, for each and ,
and
Since
we have
and
Now, for each , define
and
We have ,
and
Also note that
and
By Bernstein’s inequality,
Thus, by setting , we have
By applying a union bound over all , with probability at least , for all ,
which we define to be event . Note that conditioned on and the events in Lemma 5.3 and Lemma 5.2, we have
which implies
and thus
for all .
When
we have
and therefore
When
we have
and thus
which implies
Hence, conditioned on and the events in Lemma 5.3 and Lemma 5.2, for all ,
By Bernstein’s inequality,
Thus, by setting , we have
By applying a union bound over all , with probability at least , for all ,
which we define to be event . Note that conditioned on and the events in Lemma 5.3 and Lemma 5.2, we have
Finally, for each , for each , define
Note that
Therefore, by Chernoff bound, with probability at least we have
Hence, with probability at least , for all , we have
which we define to be event .
5.2 Perturbation Analysis
In this section, we establish a perturbation analysis on the value functions which is crucial for the analysis in the next section. We first recall a few basic facts.
Fact 5.1.
Let be a real number, we have
-
1.
;
-
2.
.
We now prove the following lemma using the above facts.
Lemma 5.5.
Let , be positive integers. Let be some real numbers. Let be a vector with . Let be a vector such that for each , and . For every and every such that for all , we have
Proof.
Note that
where
Clearly,
By the choice of , we have
Using Fact 5.1, for all , we have
Hence we,
Note that , , , and . We have,
By the choice of , we have , and therefore
∎
In the following lemma, we show that for any , with probability at least , the number of times is visited can be upper bounded in terms of the -quantile of the number of times is visited and .
Lemma 5.6.
For a given MDP . Suppose a random trajectory
is obtained by executing a (possibly non-stationary) policy in . For any , with probability at least , we have
Proof.
For each , define
Let be the event that for all , . By definition of , we have
For each , let be the filtration induced by . For each , define
and
When , we have
When , we have
which implies
Note that
which implies
By Markov’s inequality, with probability at least ,
which we denote as event .
Conditioned on which happens with probability , we have
∎
In the following lemma, we show that for any , the number of times is visited should be close to the number of times is visited times .
Lemma 5.7.
For a given MDP . Suppose a random trajectory
is obtained by executing a policy in . For any , with probability at least , we have
Proof.
For each , define
Let be the event that for all , . By definition of , we have
For each , let be the filtration induced by . For each , define
As we have shown in the proof of Lemma 5.6, for each , . Moreover, for any , we have
Therefore,
Note that for each .
As we have shown in the proof of Lemma 5.6, for each ,
which implies
By Chebyshev’s inequality, we have with probability at least ,
which we denote as event .
Conditioned on which happens with probability , we have
∎
Using Lemma 5.5, Lemma 5.6 and Lemma 5.7, we now present the main result in this section, which shows that for two MDPs and that are close enough in terms of rewards and transition probabilities, for any policy , its value in should be lower bounded by that in up to an error of .
Lemma 5.8.
Let be an MDP and be a policy. Let be a parameter. For each , define . Let be another MDP. If for all with , we have
and
then
Proof.
Define be set of all possible trajectories, where for each , has the form
For a trajectory , for each , we write
as the number times is visited and
as the number of times is visited. We say a trajectory
is compatible with a (possibly non-stationary) policy if for all ,
For a (possibly non-stationary) policy , we use to denote the set of all trajectories that are compatible with .
For an MDP and a (possibly non-stationary) policy , for a trajectory that is compatible with , we write
to be the probability of when executing in . Here we assume .
Using these definitions, we have
Note that for any trajectory , if , by Assumption 2.1,
We define be the set of trajectories that for each , for each ,
By a union bound over all , we have
We also define be the set of trajectories that for each , for each ,
By Lemma 5.7 and a union bound over all , we have
Finally, we define be the set of trajectories such that for each , for each ,
By Lemma 5.6 and a union bound over all , we have
Thus, by defining , we have
Note that for each with
for each state-action , we must have . This is because Moreover, for any , if , then
This is because , and if
then
For each , define
Therefore, for each ,
and
with
Note that , which implies
By applying Lemma 5.5 and settting to be , to be , to be , and to be , we have
Therefore,
which implies
For the summation of rewards, we have
For those with , we have . By Lemma 2.1, we have
Since and , we have
For those with , we have
Thus,
For each with
we have
Since
we have
∎
Now we show that for two MDPs and with the same transition probabilities and close enough rewards, for any policy , its value in should be upper bounded by that in up to an error of .
Lemma 5.9.
Let be an MDP and be a policy. Let be a parameter. For each , define . Let be another MDP. If for all with , we have
then
5.3 Pessimistic Planning
We now present our final algorithm in the RL setting. The formal description is provided in Algorithm 3. In our algorithm, we first invoke Algorithm 1 to collect a dataset and then invoke Algorithm 2 to estimate for some properly chosen . We then use the estimators in Lemma 5.4 to define , and . Note that Lemma 5.4 not only provides an estimator but also provides a computable confidence interval for and , which we also utilize in our algorithm.
At this point, a natural idea is to find the optimal policy with respect to the MDP defined by , and . However, our Lemma 5.8 only provides a lower bound guarantee for without any upper bound guarantee. We resolve this issue by pessimistic planning. More specifically, for any policy , we define its pessimistic value to be
where includes all MDPs whose transition probabilities are within the confidence interval provided in Lemma 5.4. We simply return the policy that maximizes . Since the true MDP lies in , is never an overestimate. On the other hand, Lemma 5.8 guarantees that is also lower bounded by up to an error of . Therefore, provides an accurate estimate to the true value of . However, note that Lemma 5.4 does not provide a computable confidence interval for the rewards. Fortunately, as we have shown in Lemma 5.9, perturbation on the rewards will not significantly increase the value of the policy and thus the estimate is still accurate.
We now present the formal analysis for our algorithm.
Theorem 5.10.
Proof.
By Lemma 5.4, with probability at least , for all , we have
and
In the remaining part of the analysis, we condition on the above event.
by Lemma 5.1, for any (possibly non-stationary) policy ,
Let be the true MDP. By Lemma 5.8, for any (possibly non-stationary) policy , for any , we have
Moreover, . Therefore, by Lemma 5.9,
Consequently,
Finally, the algorithm samples at most
trajectories. ∎
Theorem 5.10 immediately implies the following corollary.
Corollary 5.11.
Algorithm 3 returns a policy such that
with probability at least . Moreover, the algorithm samples at most
trajectories.
6 Algorithm in the Generative Model
In this section, we present our algorithm in the generative model together with its analysis. The formal description of the algorithm is given in Algorithm 4. Compared to our algorithm in the RL setting, Algorithm 4 is conceptually much simpler. For each state-action pair , Algorithm 4 first draws samples from and , and then builds a model using the empirical estimators. The algorithm simply returns the optimal policy for . In the remaining part of this section, we present the formal analysis for our algorithm.
The following lemma establishes a perturbation analysis similar to those in Lemma 5.8 and Lemma 5.9. Note that here, due to the availability of a generative model and thus one can obtain samples for each state-action pair , we can prove both upper bound and lower bound on the estimated value. This also explains why pessimistic planning is no longer necessary in the generative model setting.
Lemma 6.1.
Let be an MDP and be a policy. Let be a parameter. Let be another MDP. If for all , we have
and
then
Proof.
Define be set of all possible trajectories, where for each , has the form
For a trajectory , for each , we write
as the number times is visited and
as the number of times is visited. We say a trajectory
is compatible with a (possibly non-stationary) policy if for all ,
For a (possibly non-stationary) policy , we use to denote the set of all trajectories that are compatible with .
For an MDP and a (possibly non-stationary) policy , for a trajectory that is compatible with , we write
to be the probability of when executing in . Here we assume .
Using these definitions, we have
Note that for any trajectory , if , by Assumption 2.1,
We define be the set of trajectories such that for each , for each , if then
By Lemma 5.6 and a union bound over all , we have
We also define be the set of trajectories such that for each , for each , if then
Again by Lemma 5.6 and a union bound over all , we have
Now we show that
and therefore,
For each , if there exists such that and
then
and thus . Moreover, for any , for all with , we have
Note that this implies . Furthermore, for all with , we have
which implies .
We define be the set of trajectories that for each , for each ,
By Lemma 5.7 and a union bound over all , we have
Again, we define be the set of trajectories that for each , for each ,
Similarly, by Lemma 5.7 and a union bound over all , we have
Now we show that
and therefore,
For any , there exists such that
Note that this implies
since otherwise (because ) and therefore
Since
we have
and
Hence,
and thus .
Note that
Similarly,
For each , define
Therefore, for each ,
and
with
Note that , which implies
By applying Lemma 5.5 and settting to be , to be , to be , and to be , we have
and
Therefore,
and
For the summation of rewards, we have
For those with , since , by Lemma 2.1, we have
Since and , we have
For those with , we have
Thus,
For each with
we have
where
and
Since
and
we have
∎
The following lemma provides error bound on the empirical model built by the algorithm. Note that the error bounds established here are similar to those in Lemma 5.4.
Lemma 6.2.
With probability at least , for all :
and
Proof.
For each , by Bernstein’s inequality,
Thus, by setting , we have
By applying a union bound over all , with probability at least , for all ,
By a similar argument, with probability at least , for all ,
Moreover, with probability at least , for all , we have
We complete the proof by applying a union bound. ∎
Theorem 6.3.
With probability at least , Algorithm 3 returns a policy such that
by using at most batches of queries.
Proof.
By Lemma 6.2, with probability at least , for all , we have
and
We condition on this event in the remaining part of this proof. By Lemma 6.2, for any (possibly non-stationary) policy , we have
Let be the policy retuned by the algorithm. We thus have
The total number of samples taken by the algorithm is
∎
Acknowledgements
The authors would like to thank Simon S. Du for helpful discussions, and to the anonymous reviewers for helpful comments.
Ruosong Wang was supported in part by NSF IIS1763562, US Army W911NF1920104, and ONR Grant N000141812861. Part of this work was done while Ruosong Wang and Lin F. Yang were visiting the Simons Institute for the Theory of Computing.
References
- [AKY19] Alekh Agarwal, Sham Kakade, and Lin F Yang. On the optimality of sparse model-based planning for Markov decision processes. arXiv preprint arXiv:1906.03804, 2019.
- [AMK13] Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J Kappen. Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91(3):325–349, 2013.
- [AOM17] Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263–272. JMLR. org, 2017.
- [BT03] Ronen I. Brafman and Moshe Tennenholtz. R-max - a general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res., 3(Oct):213–231, March 2003.
- [BT09] Peter L Bartlett and Ambuj Tewari. Regal: A regularization based algorithm for reinforcement learning in weakly communicating mdps. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 35–42. AUAI Press, 2009.
- [DB15] Christoph Dann and Emma Brunskill. Sample complexity of episodic fixed-horizon reinforcement learning. In Advances in Neural Information Processing Systems, pages 2818–2826, 2015.
- [DLB17] Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 5717–5727, Red Hook, NY, USA, 2017. Curran Associates Inc.
- [DLWB19] Christoph Dann, Lihong Li, Wei Wei, and Emma Brunskill. Policy certificates: Towards accountable reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 1507–1516, Long Beach, California, USA, 09–15 Jun 2019. PMLR.
- [DWCW19] Kefan Dong, Yuanhao Wang, Xiaoyu Chen, and Liwei Wang. Q-learning with ucb exploration is sample efficient for infinite-horizon mdp. arXiv preprint arXiv:1901.09311, 2019.
- [FPL18] Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Near optimal exploration-exploitation in non-communicating markov decision processes. In Advances in Neural Information Processing Systems, pages 2994–3004, 2018.
- [JA18] Nan Jiang and Alekh Agarwal. Open problem: The dependence of sample complexity lower bounds on planning horizon. In Conference On Learning Theory, pages 3395–3398, 2018.
- [JAZBJ18] Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is Q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4863–4873, 2018.
- [JOA10] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
- [Kak03] Sham M Kakade. On the sample complexity of reinforcement learning. PhD thesis, University of London London, England, 2003.
- [KN09] J Zico Kolter and Andrew Y Ng. Near-bayesian exploration in polynomial time. In Proceedings of the 26th annual international conference on machine learning, pages 513–520, 2009.
- [KS99] Michael J Kearns and Satinder P Singh. Finite-sample convergence rates for Q-learning and indirect algorithms. In Advances in neural information processing systems, pages 996–1002, 1999.
- [KS02] Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine Learning, 49(2-3):209–232, 2002.
- [LH12] Tor Lattimore and Marcus Hutter. PAC bounds for discounted MDPs. In International Conference on Algorithmic Learning Theory, pages 320–334. Springer, 2012.
- [LWC+20] Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Breaking the sample size barrier in model-based reinforcement learning with a generative model. Advances in Neural Information Processing Systems, 33, 2020.
- [MDSV21] Pierre Menard, Omar Darwiche Domingues, Xuedong Shang, and Michal Valko. Ucb momentum q-learning: Correcting the bias without forgetting. arXiv preprint arXiv:2103.01312, 2021.
- [NPB20] Gergely Neu and Ciara Pike-Burke. A unifying view of optimism in episodic reinforcement learning. arXiv preprint arXiv:2007.01891, 2020.
- [ORVR13] Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning via posterior sampling. arXiv preprint arXiv:1306.0940, 2013.
- [OVR16] Ian Osband and Benjamin Van Roy. On lower bounds for regret in reinforcement learning. ArXiv, abs/1608.02732, 2016.
- [OVR17] Ian Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for reinforcement learning? In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2701–2710. JMLR. org, 2017.
- [PBPH+20] Aldo Pacchiano, Philip Ball, Jack Parker-Holder, Krzysztof Choromanski, and Stephen Roberts. On optimism in model-based reinforcement learning. arXiv preprint arXiv:2006.11911, 2020.
- [Put94] Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 1994.
- [RLD+21] Tongzheng Ren, Jialian Li, Bo Dai, Simon S Du, and Sujay Sanghavi. Nearly horizon-free offline reinforcement learning. arXiv preprint arXiv:2103.14077, 2021.
- [Rus19] Daniel Russo. Worst-case regret bounds for exploration via randomized value functions. arXiv preprint arXiv:1906.02870, 2019.
- [SJ19] Max Simchowitz and Kevin Jamieson. Non-asymptotic gap-dependent regret bounds for tabular mdps. arXiv preprint arXiv:1905.03814, 2019.
- [SL08] Alexander L Strehl and Michael L Littman. An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.
- [SLW+06] Alexander L Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L Littman. Pac model-free reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pages 881–888. ACM, 2006.
- [SS10] István Szita and Csaba Szepesvári. Model-based reinforcement learning with nearly tight exploration complexity bounds. In ICML, 2010.
- [SWW+18] Aaron Sidford, Mengdi Wang, Xian Wu, Lin Yang, and Yinyu Ye. Near-optimal time and sample complexities for solving Markov decision processes with a generative model. In Advances in Neural Information Processing Systems, pages 5186–5196, 2018.
- [SWWY18] Aaron Sidford, Mengdi Wang, Xian Wu, and Yinyu Ye. Variance reduced value iteration and faster algorithms for solving markov decision processes. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 770–787. Society for Industrial and Applied Mathematics, 2018.
- [SY94] Satinder P Singh and Richard C Yee. An upper bound on the loss from approximate optimal-value functions. Machine Learning, 16(3):227–233, 1994.
- [TM18] Mohammad Sadegh Talebi and Odalric-Ambrym Maillard. Variance-aware regret bounds for undiscounted reinforcement learning in mdps. arXiv preprint arXiv:1803.01626, 2018.
- [Wan17] Mengdi Wang. Randomized linear programming solves the discounted Markov decision problem in nearly-linear running time. arXiv preprint arXiv:1704.01869, 2017.
- [WDYK20] Ruosong Wang, Simon S Du, Lin F Yang, and Sham M Kakade. Is long horizon reinforcement learning more difficult than short horizon reinforcement learning? arXiv preprint arXiv:2005.00527, 2020.
- [YW19] Lin Yang and Mengdi Wang. Sample-optimal parametric q-learning using linearly additive features. In International Conference on Machine Learning, pages 6995–7004, 2019.
- [YYD21] Kunhe Yang, Lin Yang, and Simon Du. Q-learning with logarithmic regret. In International Conference on Artificial Intelligence and Statistics, pages 1576–1584. PMLR, 2021.
- [ZB19] Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pages 7304–7312, 2019.
- [ZDJ20] Zihan Zhang, Simon S Du, and Xiangyang Ji. Nearly minimax optimal reward-free reinforcement learning. arXiv preprint arXiv:2010.05901, 2020.
- [ZJ19] Zihan Zhang and Xiangyang Ji. Regret minimization for reinforcement learning by evaluating the optimal bias function. In Advances in Neural Information Processing Systems, pages 2823–2832, 2019.
- [ZJD20] Zihan Zhang, Xiangyang Ji, and Simon S Du. Is reinforcement learning more difficult than bandits? a near-optimal algorithm escaping the curse of horizon. arXiv preprint arXiv:2009.13503, 2020.
- [ZZJ20] Zihan Zhang, Yuan Zhou, and Xiangyang Ji. Almost optimal model-free reinforcement learningvia reference-advantage decomposition. Advances in Neural Information Processing Systems, 33, 2020.