Settling the Horizon-Dependence of Sample Complexity in Reinforcement Learning

Yuanzhi Li
Carnegie Mellon University
[email protected] Ruosong Wang
Carnegie Mellon University
[email protected] Lin F. Yang
University of California, Los Angles
[email protected]

Recently there is a surge of interest in understanding the horizon-dependence of the sample complexity in reinforcement learning (RL). Notably, for an RL environment with horizon length $H$ , previous work have shown that there is a probably approximately correct (PAC) algorithm that learns an $O(1)$ -optimal policy using $\mathrm{polylog}(H)$ episodes of environment interactions when the number of states and actions is fixed. It is yet unknown whether the $\mathrm{polylog}(H)$ dependence is necessary or not. In this work, we resolve this question by developing an algorithm that achieves the same PAC guarantee while using only $O(1)$ episodes of environment interactions, completely settling the horizon-dependence of the sample complexity in RL. We achieve this bound by (i) establishing a connection between value functions in discounted and finite-horizon Markov decision processes (MDPs) and (ii) a novel perturbation analysis in MDPs. We believe our new techniques are of independent interest and could be applied in related questions in RL.

1 Introduction

Reinforcement learning (RL) is one of the most important paradigms in machine learning. What makes RL different from other paradigms is that it models the long-term effects in decision-making problems. For instance, in a finite-horizon Markov decision process (MDP), which is one of the most fundamental models for RL, an agent interacts with the environment for a total of $H$ steps and receives a sequence of $H$ random reward values, along with stochastic state transitions, as feedback. The goal of the agent is to find a policy to maximize the expected sum of these rewards values instead of any single one of them. Since decisions made at early stages could significantly impact the future, the agent must take possible future transitions into consideration when choosing the policy. On the other hand, when $H=1$ , RL reduces to the contextual bandits problem in which it suffices to act myopically to achieve optimality.

Due to the important role of the horizon length in RL, Jiang and Agarwal [JA18] propose to study how the sample complexity of RL depends on the horizon length. More formally, let us consider the episodic RL setting, where the horizon length is $H$ and the underlying MDP has unknown and time invariant transition probabilities and rewards. Starting from some unknown but fixed initial state distribution, how many episodes of interaction with the MDP do we need to learn an $\epsilon$ -optimal policy, whose value (the expected sum of rewards collected by the policy) differs from the optimal policy by at most $\epsilon V_{\max}$ ? Here $V_{\max}$ is the maximum possible sum of rewards in an episode and $\epsilon>0$ is the target accuracy. Jiang and Agarwal [JA18] conjecture that the number of episodes required by the above problem depend polynomially on $H$ even when the total number of state-action pairs is a constant. Wang et al. [WDYK20] refute this conjecture by showing that the correct sample complexity of episodic RL is at most $\mathrm{poly}((\log H)/\epsilon)$ when there are constant number of states and actions. The result in [WDYK20] shows a surprising fact that one can achieve similar sample efficiency regardless of the horizon length of the RL problem. A number of recent results [ZJD20, MDSV21, ZDJ20, RLD⁺21] additionally improve the result in [WDYK20] to obtain more (computationally or statistically) efficient algorithm and/or in more general settings. However, these results all have the $\mathrm{poly}((\log H)/\epsilon)$ factor in their sample complexity, seemingly implying that the $\mathrm{polylog}(H)$ dependence is necessary. Hence, it is natural to ask:

What is the exact dependence on the horizon length $H$ of the sample complexity in episodic RL?

The goal of the current paper is to answer the above question.

Compared to other problems in machine learning, RL is challenging mainly because it requires an effective exploration mechanism. For example, to obtain samples from some state in an MDP, one needs to first learn a policy that can reach that state with good probability. In order to decouple the sample complexity of exploration and that of learning a near-optimal policies, there is a line of research (see, e.g., [KS99, AMK13, SWW⁺18, Wan17, YW19, LWC⁺20]) studying the generative model setting, where a learner is able to query as many samples as possible from each state-action pair, circumventing the issue of exploration. The above sample complexity question is still well-posed even in the generative model. However, to achieve fair comparison with the RL setting, we measure the sample complexity as the number of batches, where a batch of queries corresponds to $H$ queries in the generative model. Indeed, in the episodic RL setting, the learner can also obtain $H$ samples in each episode. Nevertheless, we are not aware of any result addressing the above question even with a generative model.

1.1 Our Results

In this paper, we settle the horizon-dependence of the sample complexity in episodic RL by developing an algorithm whose sample complexity is completely independent of the planning horizon $H$ . We state our results in the following theorems, which target RL with an $H$ -horizon MDP (formally defined in Section 2) with state space $\mathcal{S}$ and action space $\mathcal{A}$ .

Theorem 1.1 (Informal version of Corollary 5.11).

Suppose the reward at each time step is non-negative and the total reward of each episode is upper bounded by 1.¹¹1Without loss of generality, we set $V_{\max}=1$ . Given a target accuracy $\epsilon>0$ and a failure probability $\delta>0$ , our algorithm returns an $\epsilon$ -optimal non-stationary policy with probability at least $1-\delta$ by sampling at most $(|\mathcal{S}||\mathcal{A}|)^{O(|\mathcal{S}|)}\cdot\log(1/\delta)/\epsilon^{5}$ episodes, where $|\mathcal{S}|$ is number of states and $|\mathcal{A}|$ is the number of actions.

Notably, when the number states $|\mathcal{S}|$ , the number of actions $|\mathcal{A}|$ , the desired accuracy $\epsilon$ and the failure probability $\delta$ are all fixed constants, the (episode) sample complexity of our algorithm is also a constant. Our result suggests that episodic RL is possible with a sample complexity that is completely independent of the horizon length $H$ .

In fact, we can show a much more general result besides finding the optimal policy. We actually prove that using the same number of episodes, one can construct an oracle that returns an $\epsilon$ -approximation of the value of any non-stationary policy $\pi$ , given as the following theorem.

Theorem 1.2 (Informal version of Theorem 5.10).

In the same setting as Theorem 1.1, with $(|\mathcal{S}||\mathcal{A}|)^{O(|\mathcal{S}|)}\cdot\log(1/\delta)/\epsilon^{5}$ episodes, with probability at least $1-\delta$ , we can construct an oracle $\mathcal{O}$ such that for every given non-stationary policy $\pi$ , $\mathcal{O}$ returns an $\epsilon$ approximation of the expected total reward of $\pi$ .

This theorem suggests that even completely learning the MDP (i.e., being able to simultaneously estimate the value of all non-stationary policies) can be done with a sample complexity that is independent of $H$ .

We now switch to the more powerful generative model and show that a better sample complexity can be achieved. In the generative model, the agent can query samples from any state-action pair and the initial state distribution of the environment. Here, the sample complexity is defined as the total number of batches of queries (a batch corresponds to $H$ queries) to the environment to obtain an $\epsilon$ -optimal policy. ²²2It is well-known that any algorithm requires $\Omega(H)$ queries to find a near-optimal policy (see, e.g., [LH12]), and thus it is reasonable to define a single batch to be $H$ queries.

Theorem 1.3 (Informal version of Theorem 6.3).

Suppose the reward at each time step is non-negative and the total reward of each episode is upper bounded by 1. Given a target accuracy $\epsilon>0$ and a failure probability $\delta>0$ , our algorithm returns an $\epsilon$ -optimal non-stationary policy with probability at least $1-\delta$ by sampling at most $O(|\mathcal{S}|^{6}|\mathcal{A}|^{4}\log(1/\delta)/\epsilon^{3})$ batches of queries in the generative model, where $|\mathcal{S}|$ is number of states and $|\mathcal{A}|$ is the number of actions.

Compared to the result in Theorem 1.1, the sample complexity in Theorem 6.3 has polynomial dependence on the number of states and has better dependence on the desired accuracy $\epsilon$ .

We remark that although our algorithms in Theorem 1.1 and Theorem 1.3 achieve tight dependence in terms of $H$ , it does not aim to tighten the dependence on the number of states $|\mathcal{S}|$ , the number of actions $|\mathcal{A}|$ and the desired accuracy $\epsilon$ . In fact, the sample complexity of our algorithms have much worse dependence in terms of $|\mathcal{S}|$ , $|\mathcal{A}|$ and $\epsilon$ compared to prior work. See Section 1.2 for a detailed discussion on prior work, and Section 3 for an overview of our techniques.

1.2 Related Work

In this section, we discuss related work on the sample complexity of RL in the tabular setting, where the number of states $\left|\mathcal{S}\right|$ , the number of actions $\left|\mathcal{A}\right|$ , and the horizon length $H$ are all assumed to be finite. There is a long line of research studying the sample complexity in tabular RL [KS02, BT03, Kak03, SLW⁺06, SL08, KN09, BT09, JOA10, SS10, LH12, ORVR13, DB15, AOM17, DLB17, OVR17, JAZBJ18, FPL18, TM18, DLWB19, DWCW19, SJ19, Rus19, ZJ19, ZZJ20, YYD21, PBPH⁺20, NPB20, WDYK20, ZJD20, MDSV21, ZDJ20, RLD⁺21]. In most prior work, up to a scaling factor, the standard assumption on the reward values is that $r_{h}\in[0,1/H]$ for all $h$ and hence $\sum_{h}r_{h}\in[0,1]$ , where $r_{h}$ is the reward value collected at step $h$ of an episode. We refer this assumption as the reward uniformity assumption. However, Jiang and Agarwal [JA18] point out that one should only impose an upper bound on the summation of the reward values, i.e., $\sum_{h\in[H]}r_{h}\leq 1$ , which we refer as the bounded total reward assumption, instead of imposing uniformity which could be much stronger. This new assumption allows a fair comparison between long-horizon and short-horizon problems, and is also more natural when the reward signal is sparse. Also note that algorithms designed under the reward uniformity assumption can be modified to work under the bounded total reward assumption. However, the sample complexity is usually worsen by a $\mathrm{poly}(H)$ factor. Many existing results in tabular RL focuses on regret minimization where the goal is to collect maximum cumulative rewards for a limit number of interactions with the environment. To draw comparison with our results, we convert them, using standard techniques, to the PAC setting, where the algorithm aims to obtain an $\epsilon$ -optimal policy while minimizing the number of episodes of environment interactions (or batches of queries if in the generative model).

Under the reward uniformity assumption, a line of work have attempted to provide tight sample complexity bounds [AOM17, DB15, DLB17, DLWB19, JAZBJ18, OVR17]. To obtain an $\varepsilon$ -optimal policy, state-of-the-art results show that $\widetilde{O}\left(\frac{\left|\mathcal{S}\right|\left|\mathcal{A}\right|}{\varepsilon^{2}}+\frac{\mathrm{poly}\left(\left|\mathcal{S}\right|,\left|\mathcal{A}\right|,H\right)}{\varepsilon}\right)$ ³³3 $\widetilde{O}\left(\cdot\right)$ omits logarithmic factors. episodes suffice [DLWB19, AOM17]. In particular, the first term matches the lower bound $\Omega\left(\frac{\left|\mathcal{S}\right|\left|\mathcal{A}\right|}{\varepsilon^{2}}\right)$ up to logarithmic factors [DB15, OVR16, AOM17] while the second term has least linear dependence on $H$ . Moreover, converting these bounds to the bounded total reward setting induces extra $\mathrm{poly}(H)$ factors. For instance, the sample complexity in [AOM17, DLWB19] will become $\widetilde{O}\left(\frac{\left|\mathcal{S}\right|\left|\mathcal{A}\right|H^{2}}{\varepsilon^{2}}+\frac{\mathrm{poly}\left(\left|\mathcal{S}\right|,\left|\mathcal{A}\right|,H\right)}{\varepsilon}\right)$ under the bound total reward assumption

Recently, there is a surge of interest of designing algorithms under the bounded total reward assumption. In particular, Zanette and Brunskill [ZB19] develop an algorithm which enjoys a sample complexity of $\widetilde{O}\left(\frac{\left|\mathcal{S}\right|\left|\mathcal{A}\right|}{\varepsilon^{2}}+\frac{\mathrm{poly}\left(\left|\mathcal{S}\right|,\left|\mathcal{A}\right|,H\right)}{\varepsilon}\right)$ , where the second term still scales polynomially with $H$ . Wang et al. [WDYK20] show that it is possible to obtain an $\epsilon$ -optimal policy with a sample complexity of $\mathrm{poly}(|\mathcal{S}||\mathcal{A}|/\epsilon)\cdot\log^{3}H$ , establishing the first sample complexity with $\mathrm{polylog}(H)$ dependence on the horizon length $H$ . They achieve this result by using the following ideas: (1) samples collected by different policies can be reused to evaluate other policies; (2) to evaluate all policies in a finite set $\Pi$ , the number of sample episodes required is at most $\mathrm{poly}(|\mathcal{S}||\mathcal{A}|/\epsilon)\cdot\log|\Pi|\cdot\log^{2}H$ ; (3) establish a set of policies $\Pi$ that contains at least one $\epsilon$ -optimal policy for any MDP by using an $\epsilon$ -nets over the reward values and the transition probabilities. Here $\Pi$ contains all optimal non-stationary policies of each MDP in the $\epsilon$ -net. The accuracy of the $\epsilon$ -net needs to be at least $\mathrm{poly}(\epsilon/H)$ and hence $|\Pi|=\mathrm{poly}(\epsilon/H)^{|\mathcal{S}||\mathcal{A}|}$ , which induces an inevitable $\log H$ factor. Other $\log H$ factors come from Step (2) where they use a potential function which increases from $0$ to $H$ to measure the progress of the algorithm. It is unclear how to remove any of the log factors in their sample complexity upper bound, and they also conjecture that the optimal sample complexity could be $\Omega(\mathrm{polylog}(H))$ which is refuted by our new result.

Another recent work [ZJD20] obtains a more efficient algorithm that achieves a sample complexity of $\widetilde{O}\left(\left(\frac{|\mathcal{S}||\mathcal{A}|}{\epsilon^{2}}+\frac{|\mathcal{S}|^{2}|\mathcal{A}|}{\epsilon}\right)\cdot\mathrm{poly}\log H\right)$ . Zhang et al. [ZJD20] achieve such a sample complexity by using a novel upper confidence bound (UCB) analysis on a model-based algorithm. In particular, they use a recursive structure to bound the total variance in the MDP, which in turn bounds the final sample complexity. However, their bound critically relies on the fact that the total variance of value function moments is upper bounded by $H$ which inevitably induces $\mathrm{polylog}(H)$ factors. It is unlikely to get rid of all $\mathrm{polylog}(H)$ factors in their sample complexity by following their approaches.

In the generative model, there is also a long line of research studying the sample complexity of finding near-optimal policies. See, e.g., [KS99, Kak03, SY94, AMK13, SWWY18, SWW⁺18, AKY19, LWC⁺20]. However, most of these work are targeting on the infinite-horizon discounted setting, in which case the problem becomes much easier. This is because in the infinite-horizon discounted setting, there always exists a stationary optimal policy, and the total number of such policies depends only on the number of states $|\mathcal{S}|$ and the number of actions $|\mathcal{A}|$ . Sidford et al. [SWW⁺18] develop an algorithm based on a variance-reduced estimator which uses $\widetilde{O}\left(\frac{|\mathcal{S}||\mathcal{A}|}{\epsilon^{2}}+H^{2}|\mathcal{S}||\mathcal{A}|\right)$ batches of queries to find an $\epsilon$ -optimal policy under the reward uniformity assumption. However, their result relies on splitting samples into $H$ sub-groups, and therefore, the lower order term of their sample complexity fundamentally depends on $H$ . To the best of our knowledge, even in the stronger generative model, no algorithm for finite-horizon MDPs achieves a sample complexity that is independent of $H$ under the bounded total reward assumption.

2 Preliminaries

Notations.

Throughout this paper, for a given positive integer $H$ , we use $[H]$ to denote the set $\{0,1,\ldots,H-1\}$ . For a condition $\mathcal{E}$ , we use $\mathbb{I}[\mathcal{E}]$ to denote the indicator function, i.e., $\mathbb{I}[\mathcal{E}]=1$ if $\mathcal{E}$ holds and $\mathbb{I}[\mathcal{E}]=0$ otherwise.

For a random variable $X$ and a real number $\varepsilon\in[0,1]$ , its $\varepsilon$ -quantile $\mathcal{Q}_{\varepsilon}(X)$ is defined so that

\mathcal{Q}_{\varepsilon}(X)=\sup\{x\mid\Pr[X\geq x]\geq\varepsilon\}.

Markov Chains.

Let $C=(\mathcal{S},P,\mu)$ be a Markov chain where $\mathcal{S}$ is the state space, $P:\mathcal{S}\to\Delta(\mathcal{S})$ is the transition operator and $\mu\in\Delta(\mathcal{S})$ is the initial state distribution. A Markov chain $C$ induces a sequence of random states

s_{0},s_{1},\ldots

where for each $s_{0}\sim\mu$ and $s_{h+1}\sim P(s_{h})$ for each $h\geq 0$ .

Finite-Horizon Markov Decision Processes.

Let $M=\left(\mathcal{S},\mathcal{A},P,R,H,\mu\right)$ be an $H$ -horizon Markov Decision Process (MDP) where $\mathcal{S}$ is the finite state space, $\mathcal{A}$ is the finite action space, $P:\mathcal{S}\times\mathcal{A}\rightarrow\Delta\left(\mathcal{S}\right)$ is the transition operator which takes a state-action pair and returns a distribution over states, $R:\mathcal{S}\times\mathcal{A}\rightarrow\Delta\left(\mathbb{R}\right)$ is the reward distribution, $H\in\mathbb{Z}_{+}$ is the planning horizon (episode length), and $\mu\in\Delta\left(\mathcal{S}\right)$ is the initial state distribution. In the learning setting, $P$ , $R$ and $\mu$ , are unknown but fixed, and $\mathcal{S}$ , $\mathcal{A}$ and $H$ are known to the learner. Throughout, we assume $H=\Omega(|\mathcal{S}|\log|\mathcal{S}|)$ since otherwise we can apply existing algorithms (e.g. [ZJD20]) to obtain a sample complexity that is independent of $H$ . Throughout the paper, for a state $s\in\mathcal{S}$ , we occasionally abuse notation and use $s$ to denote the deterministic distribution that always takes $s$ .

Interacting with the MDP.

We now introduce how a RL agent (or an algorithm) interacts with an unknown MDP. In the setting of Theorem 1.1, in each episode, the agent first receives an initial state $s_{0}\sim\mu$ . For each time step $h\in[H]$ , the agent first decides an action $a_{h}\in\mathcal{A}$ , and then observes and moves to $s_{h+1}\sim P(s_{h},a_{h})$ and obtains a reward $r_{h}\sim R(s_{h},a_{h})$ . The episode stops at time $H$ , where the final state is $s_{H}$ . Note that even if the agent decides to stop before time $H$ , e.g. at time $\sqrt{H}$ , this still counts as one full episode.

In the generative model setting (as in Theorem 1.3), the agent is allowed to start from any $s_{0}\in\mathcal{S}$ instead of $s_{0}\sim\mu$ , and is allowed to restart at any time. The sample complexity is defined as the number of batches of queries (each batch consists of $H$ queries) the agent uses.

Policy Class.

The final goal of RL is to output a good policy $\pi$ with respect to the unknown MDP that the agent interacts with. In this paper, we consider (non-stationary) deterministic policies $\pi$ which choose an action $a$ based on the current state $s\in\mathcal{S}$ and the time step $h\in[H]$ . Formally, $\pi=\{\pi_{h}\}_{h=0}^{H-1}$ where for each $h\in[H]$ , $\pi_{h}:\mathcal{S}\to\mathcal{A}$ maps a given state to an action. The policy $\pi$ induces a (random) trajectory

(s_{0},a_{0},r_{0}),(s_{1},a_{1},r_{1}),\ldots,(s_{H-1},a_{H-1},r_{H-1}),s_{H},

where $s_{0}\sim\mu$ , $a_{0}=\pi_{0}(s_{0})$ , $r_{0}\sim R(s_{0},a_{0})$ , $s_{1}\sim P(s_{0},a_{0})$ , $a_{1}=\pi_{1}(s_{1})$ , $r_{1}\sim R(s_{1},a_{1})$ , $\ldots$ , $a_{H-1}=\pi_{H-1}(s_{H-1})$ , $r_{H-1}\sim R(s_{H-1},a_{H-1})$ and $s_{H}\sim P(s_{H-1},a_{H-1})$ . Throughout the paper, all policies are assumed to be deterministic unless specified otherwise.

Value Functions.

To measure the performance of a policy $\pi$ , we define the value of a policy as

V^{\pi}_{M,H}=\mathbb{E}\left[\sum_{h=0}^{H-1}r_{h}\right].

Note that in general a policy does not need to be deterministic, and it can even depend on the history in which case a policy chooses an action at time $h$ based on the entire transition history before $h$ . It is well-known (see e.g. [Put94]) that the optimal value of $M$ can be achieved by a non-stationary deterministic optimal policy $\pi^{*}$ . Hence, we only need to consider non-stationary deterministic policies.

The goal of the agent is then to find a policy $\pi$ with $V^{\pi}_{M,H}\geq V^{\pi^{*}}_{M,H}-\epsilon$ , i.e., obtain an $\epsilon$ -optimal policy, while minimizing the number of episodes of environment interactions (or batches of queries if in the generative model).

For a policy $\pi$ , we also define

\mathcal{Q}^{\pi}_{\delta}(s,a)=\mathcal{Q}_{\delta}\left[\sum_{t=0}^{H}\mathbb{I}[(s,a)=(s_{t},a_{t})]\right].

to be the $\delta$ -quantile of the visitation frequency of a state-action pair $(s,a)$ .

Stationary Policies.

For the sake of the analysis, we shall also consider stationary policies. A stationary deterministic policy $\pi$ chooses an action $a$ solely based on the current state $s\in\mathcal{S}$ , i.e, $\pi_{0}=\pi_{1}=\ldots=\pi_{H-1}$ . We use $\Pi_{\mathsf{st}}$ to denote the set of all stationary policies. Note that $\left|\Pi_{\mathsf{st}}\right|=|\mathcal{A}|^{|\mathcal{S}|}$ .

We remark that when the horizon length $H$ is finite, the value of the best stationary policy and that of the best non-stationary policy can differ significantly. Consider the case when there are two states $s,s^{\prime}$ and two actions $a,a^{\prime}$ . Starting from $s$ , taking action $a$ will go back to $s$ and obtain a reward of $1/H$ , while taking action $a^{\prime}$ will go to $s^{\prime}$ and obtain a reward of $1$ . Taking any action at $s^{\prime}$ will go back to $s^{\prime}$ with reward 0. In this case, the optimal non-stationary policy has value $2-1/H$ since it can exit $s$ at time $H$ , while deterministic stationary policies can only have value $\leq 1$ (even random stationary policy can only achieve value $2-\Omega(1)$ ).

For a stationary policy $\pi:\mathcal{S}\to\mathcal{A}$ , we use $M^{\pi}=(\mathcal{S},P^{\pi},\mu)$ to denote the Markov chain induced by $M$ and $\pi$ , where the transition operator $P^{\pi}$ is defined so that

P^{\pi}(s^{\prime}\mid s)=P(s^{\prime}\mid s,\pi(s)).

Assumption on Rewards.

Below, we formally introduce the bounded total reward assumption.

Assumption 2.1 (Bounded Total Reward).

For any policy $\pi$ , and a random trajectory $((s_{0},a_{0},r_{0}),$ $(s_{1},a_{1},r_{1}),\ldots,(s_{H-1},a_{H-1},r_{H-1}),s_{H})$ induced by $\pi$ , we have

•

$r_{h}\in[0,1]$ for all $h\in[H]$ ; and
•

$\sum_{h=0}^{H-1}r_{h}\leq 1$ almost surely.

As discussed in the previous section, this assumption is more general than the standard reward uniformity assumption, where $r_{h}\in[0,1/H]$ , for all $h\in[H]$ . Throughout this paper, we assume that the above assumption holds for the unknown MDP that the agent interacts with.

The above assumption in fact implies a very interesting consequence on the reward values.

Lemma 2.1.

Under Assumption 2.1, for any $M=\left(\mathcal{S},\mathcal{A},P,R,H,\mu\right)$ with $H\geq|\mathcal{S}|$ , for any $(s,a)\in\mathcal{S}\times\mathcal{A}$ , if there exists a (possibly non-stationary) policy $\pi$ such that for the random trajectory

(s_{0},a_{0},r_{0}),(s_{1},a_{1},r_{1}),\ldots,(s_{H-1},a_{H-1},r_{H-1}),s_{H}

induced by executing $\pi$ in $M$ , we have

\Pr\left[\sum_{h=0}^{H-1}\mathbb{I}[(s_{h},a_{h})=(s,a)]>1\right]\geq\epsilon

for some $\epsilon>0$ , then $R(s,a)\leq 2|\mathcal{S}|/H$ almost surely and therefore $\mathbb{E}[(R(s,a))^{2}]\leq 4|\mathcal{S}|^{2}/H^{2}$ .

Proof.

By the assumption, there exists a trajectory

((s_{0},a_{0}),(s_{1},a_{1}),\ldots,(s_{H-1},a_{H-1}),s_{H})

such there exists $0\leq h_{1}<h_{2}<H$ with $(s_{h_{1}},a_{h_{1}})=(s,a)$ and $(s_{h_{2}},a_{h_{2}})=(s,a)$ . Moreover,

\mu(s_{0})\prod_{h=0}^{h_{2}}P(s_{h+1}\mid s_{h},a_{h})>0.

We may assume $h_{1}<|\mathcal{S}|$ and $h_{2}-h_{1}\leq|\mathcal{S}|$ , since otherwise we can replace sub-trajectories that start and end with the same state by that state, and the resulting trajectory can still appear with strictly positive probability. Now consider the policy $\widehat{\pi}$ which is defined so that for each $h<h_{1}$ , $\widehat{\pi}_{h}(s_{h})=a_{h}$ and for each $t\in[h_{2}-h_{1}]$ ,

\widehat{\pi}_{h_{1}+t}(s_{h_{1}+t})=\widehat{\pi}_{h_{1}+(h_{2}-h_{1})+t}(s_{h_{1}+t})=\widehat{\pi}_{h_{1}+2(h_{2}-h_{1})+t}(s_{h_{1}+t})=\cdots=a_{h_{1}+t},

i.e., repeating the trajectory’s actions in $[h_{1},h_{2}]$ indefinitely. $\widehat{\pi}$ is defined arbitrarily for other states and time steps.

By executing $\widehat{\pi}$ , with strictly positive probability, $(s,a)$ is visited for $\lfloor H/|\mathcal{S}|\rfloor\geq H/(2|\mathcal{S}|)$ times. Therefore, by Assumption 2.1, $R(s,a)\leq 2|\mathcal{S}|/H$ with probability $1$ and thus $\mathbb{E}[(R(s,a))^{2}]\leq 4|\mathcal{S}|^{2}/H^{2}$ .

∎

Discounted Markov Decision Processes.

We also introduce another variant of MDP, discounted MDP, which is specified by $M=\left(\mathcal{S},\mathcal{A},P,R,\gamma,\mu\right)$ , where $\gamma\in(0,1)$ is a discount factor and all other components have the same meaning as in an $H$ -horizon MDP. Note that we consider discounted MDP only for the purpose of analysis, and the unknown MDP that the agent interacts with is always a finite-horizon MDP. The difference between a discounted MDP and an $H$ -horizon MDP is that discounted MDPs have an infinite horizon length, i.e., the length of a trajectory can be infinite. Hence, to measure the value of a discounted MDP, suppose policy $\pi$ obtains a random trajectory,

(s_{0},a_{0},r_{0}),(s_{1},a_{1},r_{1}),\ldots,(s_{h},a_{h},r_{h}),\ldots,

we denote

V^{\pi}_{M,\gamma}=\mathbb{E}\left[\sum_{h=0}^{\infty}\gamma^{h}r_{h}\mid s=s_{0}\right]

as the discounted value of $\pi$ . Throughout, for a (discounted or finite-horizon) MDP $M=\left(\mathcal{S},\mathcal{A},P,R,\cdot,\mu\right)$ , we simply denote $V^{\pi}_{M,H}$ as the value function of $\left(\mathcal{S},\mathcal{A},P,R,H,\mu\right)$ and $V^{\pi}_{M,\gamma}$ as the value function of $\left(\mathcal{S},\mathcal{A},P,R,\gamma,\mu\right)$ .

3 Technical Overview

In this section, we discuss the techniques for establishing our results. To introduce the high-level ideas, we first start with the simpler setting, the generative model, where exploration is not a concern. We then switch to the more challenging RL setting, where we need to carefully design policies to explore the state-action space so that a good policy can be learned. For simplicity, throughout the discussion in this section, we assume $|\mathcal{S}|$ , $|\mathcal{A}|$ and $1/\epsilon$ are all constants.

Algorithm and Analysis in the Generative Model.

Our algorithm in the generative model is conceptually simple: for each state-action pair $(s,a)$ , we draw $O(H)$ samples from $P(s,a)$ and $R(s,a)$ and then return the optimal policy with respect to the empirical model $\widehat{M}$ which is obtained by using the empirical estimators for $P$ and $R$ (denoted as $\widehat{P}$ and $\widehat{R}$ ). Here for simplicity, we assume $R=\widehat{R}$ which allows us to focus on the estimation error induced by the transition probabilities. Moreover, we assume that $P$ differs from $\widehat{P}$ only for a single state-action pair $(s,a)$ . To further simplify the discussion, we assume that there are only two different states on the support of $P(s^{\prime}\mid s,a)$ (say $s_{1}$ and $s_{2}$ ).

In order to prove the correctness of the algorithm, we show that for any policy $\pi$ , the value of $\pi$ in the empirical model $\widehat{M}$ is close to that in the true model $M$ . However, standard analysis based on dynamic progarmming shows that the difference between the value of $\pi$ in $\widehat{M}$ and that in $M$ could be as large as $H$ times the estimation error on $P(s,a)$ , which is clearly insufficient for obtaining an algorithm which uses $O(1)$ batches of queries. Our main idea here is to show that for most trajectories $T$ , the probability of $T$ in the empirical model $\widehat{M}$ is a multiplicative approximation to that in the true model $M$ with constant approximation ratio.

To establish the multiplicative approximation guarantee, our observation is that one should consider $s_{1}$ and $s_{2}$ , the two states on the support of $P(s,a)$ , as a whole. To see this, consider the case where $P(s_{1}\mid s,a)=P(s_{2}\mid s,a)=1/2$ . Again, the additive estimation errors on both $P(s_{1}\mid s,a)$ and $P(s_{2}\mid s,a)$ are roughly $O(1/\sqrt{H})$ . Now, consider a trajectory that visits both $(s,a,s_{1})$ and $(s,a,s_{2})$ for $H/2$ times. Note that the multiplicative approximation ratio between $\widehat{P}(s^{\prime}\mid s,a)^{H/2}$ and $P(s^{\prime}\mid s,a)^{H/2}$ could be as large as $\exp(\sqrt{H})$ , for both $s^{\prime}=s_{1}$ and $s^{\prime}=s_{2}$ . However, since $\widehat{P}(s_{1}\mid s,a)+\widehat{P}(s_{2}\mid s,a)=1$ as the empirical estimator $\widehat{P}(s,a)$ is still a probability distribution, it must be the case that $\widehat{P}(s_{1}\mid s,a)/P(s_{1}\mid s,a)=1-2\delta$ and $\widehat{P}(s_{2}\mid s,a)/{P}(s_{2}\mid s,a)=1+2\delta$ where $\delta=P(s_{1}\mid s,a)-\widehat{P}(s_{1}\mid s,a)$ and thus $|\delta|\leq O(1/\sqrt{H})$ . Since $(1+2\delta)^{H/2}(1-2\delta)^{H/2}=(1-4\delta^{2})^{H/2}$ is a constant, $(\widehat{P}(s_{1}\mid s,a))^{H/2}(\widehat{P}(s_{2}\mid s,a))^{H/2}$ is a constant factor approximation to the true probability $(P(s_{1}\mid s,a))^{H/2}(P(s_{2}\mid s,a))^{H/2}$ due to cancellation.

In our analysis, to formalize the above intuition, for each trajectory $T$ , we take $T$ into consideration only when $|m_{T}(s,a,s^{\prime})-P(s^{\prime}\mid s,a)\cdot m_{T}(s,a)|\leq O(\sqrt{P(s^{\prime}\mid s,a)\cdot H})$ for both $s^{\prime}=s_{1}$ and $s^{\prime}=s_{2}$ . Here $m_{T}(s,a)$ is the number of times that $(s,a)$ is visited on $T$ and $m_{T}(s,a,s^{\prime})$ is the number of times that $(s,a,s^{\prime})$ is visited on $T$ . By Chebyshev’s inequality (see Lemma 5.7 for details), we only ignore a small subset of trajectories whose total probability can be upper bounded by a constant. For the remaining trajectories, it can be shown that $\widehat{P}(s_{1}\mid s,a)^{m_{T}(s,a,s_{1})}\cdot\widehat{P}(s_{2}\mid s,a)^{m_{T}(s,a,s_{2})}$ is a constant factor approximation to $P(s_{1}\mid s,a)^{m_{T}(s,a,s_{1})}\cdot P(s_{2}\mid s,a)^{m_{T}(s,a,s_{2})}$ so long as $|\widehat{P}(s,a,s^{\prime})-P(s,a,s^{\prime})|\leq O(\sqrt{P(s,a,s^{\prime})/H})$ for both $s^{\prime}=s_{1}$ and $s^{\prime}=s_{2}$ due to the cancellation mentioned above. The formal argument is given in Lemma 5.5. Note that using $O(H)$ samples, $|\widehat{P}(s,a,s^{\prime})-P(s,a,s^{\prime})|\leq O(\sqrt{P(s,a,s^{\prime})/H})$ holds only when $P(s,a,s^{\prime})\geq\Omega(1/H)$ . On the other hand, we can also ignore trajectories that visit $(s,a,s^{\prime})$ with $P(s,a,s^{\prime})\leq O(1/H)$ since such trajectories have negligible cumulative probability by Markov’s inequality (formalized in Lemma 5.6).

The above analysis can be readily generalized to handle perturbation on the transition probabilities of multiple state-action pairs, and to handle the case when the transition operator $P(\cdot\mid s,a)$ is not supported on two states. In summary, by using $O(H)$ samples for each state-action pair $(s,a)$ , the empirical model $\widehat{M}$ provides a constant factor approximation to the probabilities of all trajectories, except for a small subset of them whose cumulative probability can be upper bounded by a constant. Hence, for all policies, the empirical model provides an accurate estimate to its value and thus, the optimal policy with respect to the empirical model is near-optimal.

Exploration by Stationary Policies.

In the discussion above, we heavily rely on the ability of the generative model to obtain $\Omega(H)$ samples for each state-action pair. However, for the RL setting, it is not possible to reach every state-action pair freely. Although each trajectory contains $H$ state-action-state tuples (corresponding to a batch of queries in the generative model), these samples may not cover states that are crucial for learning an optimal policy. Indeed, one could use all possible deterministic non-stationary policies to collect samples, which shall then cover the whole state-action space. Unfortunately, such a naïve method introduces a dependence on the number of non-stationary policies which is exponential in $H$ . The sample complexity of other existing methods in the literature also inevitably depends on $H$ as their sample complexity intrinsically depends on the number of non-stationary policies.

In this work, we adopt a completely different approach for exploration. Our new idea is to show that if there exists a non-stationary policy that visits $(s,a)$ for $f$ times in expectation, then there exists a stationary policy that visit $(s,a)$ for $f/\exp\left(O(|\mathcal{S}|\log|\mathcal{S}|)\right)$ times in expectation. This is formalized in Lemma 4.7. If the above claim is true, then intuitively, one can simply enumerate all stationary policies and sample $\exp\left(O(|\mathcal{S}|\log|\mathcal{S}|)\right)$ trajectories using each of them to obtain $f$ samples of $(s,a)$ . Note that there are only $|\mathcal{A}|^{|\mathcal{S}|}$ stationary policies, which is completely independent of $H$ . In order to prove the above claim, we show that for any stationary policy $\pi$ , its value in the infinite-horizon discounted setting is close to that in the finite-horizon undiscounted setting (up to a factor of $\exp\left(O(|\mathcal{S}|\log|\mathcal{S}|)\right)$ ) by using a properly chosen discount factor (see Lemma 4.5). Note that this implies the correctness of the above claim since there always exists a stationary optimal policy in the infinite-horizon discounted setting.

In order to show the value of a stationary policy in the infinite-horizon discounted setting is close to that in the finite-horizon setting, we study reaching probabilities in time invariant Markov chains. In particular, we show in Lemma 4.4 that in a time invariant Markov chain, for any $H\geq|\mathcal{S}|$ , the probability of reaching a specific state $s$ within $H$ steps is close the probability of reaching $s$ within $4H$ steps, up to a factor of $\exp\left(O(|\mathcal{S}|\log|\mathcal{S}|)\right)$ . Previous literature on time invariant Markov chains mostly focus on the asymptotic behavior, and as far as we are aware, we are the first to prove the above claim. Note that this claim directly establishes a connection between the value of a stationary policy in the infinite-horizon discounted setting and that in the finite-horizon setting. Moreover, as a direct consequence of the above claim, we can show that if $H>2|\mathcal{S}|$ , the value of a stationary policy within $H$ steps is close to that of the same policy within $H/2$ steps, up to a factor of $\exp\left(O(|\mathcal{S}|\log|\mathcal{S}|)\right)$ . This consequence is crucial for later parts of the analysis.

From Expectation to Quantile.

The above analysis shows that if there exists a non-stationary policy that visits $(s,a)$ for $f$ times in expectation, then our algorithm, which uses all stationary policies to collect samples, visits $(s,a)$ for $f/\exp\left(O(|\mathcal{S}|\log|\mathcal{S}|)\right)$ times in expectation. However, this does not necessarily mean that one can obtain $f$ samples of $(s,a)$ by sampling $\exp\left(O(|\mathcal{S}|\log|\mathcal{S}|)\right)$ trajectories using our algorithm with good probability. To see this, consider the case when our policy visits $(s,a)$ for $H$ times with probability $1/\sqrt{H}$ and does not visit $(s,a)$ with probability $1-1/\sqrt{H}$ . In this case, our policy may not obtain even a single sample for $(s,a)$ unless one rollouts the policy for $O(\sqrt{H})$ times. Therefore, instead of obtaining a visitation frequency guarantee which holds in expectation, it is more desirable to obtain a visitation frequency guarantee that holds with good probability.

To resolve this issue, we establish a connection between the expectation and the $\epsilon$ -quantile of the visitation frequency of a stationary policy. We note that such a connection could not hold without any restriction. To see this, consider a policy that visits $(s,a)$ for $H$ times with probability $\epsilon/2$ . In this case, the expected visitation frequency is $\epsilon H/2$ while the $\epsilon$ -quantile is zero. On the other hand, suppose the initial state $s_{0}=s$ almost surely, then such a connection is easy to establish by using the Martingale Stopping Theorem. In particular, in Corollary 4.9, we show that if there exists a non-stationary policy that visits $(s,a)$ for $f$ times with probability $\epsilon$ , then there exists a stationary policy that visits $(s,a)$ for $\epsilon f/\exp\left(O(|\mathcal{S}|\log|\mathcal{S}|)\right)$ times with constant probability, when the initial state $s_{0}=s$ almost surely.

In general, the initial state $s_{0}$ comes from a distribution $\mu$ and could be different from $s$ with high probability. To tackle this issue, in our algorithm, we simultaneously enumerate two stationary policies $\pi_{1}$ and $\pi_{2}$ . $\pi_{1}$ should be thought as the policy that visits $(s,a)$ with highest probability within $H/2$ steps starting from the initial state distribution $\mu$ , and $\pi_{2}$ should be thought as the policy that maximizes the $\epsilon$ -quantile of the visitation frequency of $(s,a)$ within $H/2$ steps when $s_{0}=s$ . In our algorithm, we execute $\pi_{1}$ before $(s,a)$ is visited for the first time, and switch to $\pi_{2}$ once $(s,a)$ has been visited. Intuitively, we first use $\pi_{1}$ to reach $(s,a)$ for the first time and then use $\pi_{2}$ to collect as many samples as possible. As mentioned above, the value of a stationary policy within $H$ steps is close to the value of the same policy within $H/2$ steps, up to a factor of $\exp\left(O(|\mathcal{S}|\log|\mathcal{S}|)\right)$ . Thus, by sampling the above policy (formed by concatenating $\pi_{1}$ and $\pi_{2}$ ) for $\exp\left(O(|\mathcal{S}|\log|\mathcal{S}|)\right)/\epsilon^{2}$ times, we obtain at least $f$ samples for $(s,a)$ , if there exists a non-stationary policy that visits $(s,a)$ for $f$ times with probability $\epsilon$ . This is formalized in Lemma 5.1.

Perturbation Analysis in the RL Setting.

By the above analysis, suppose $m(s,a)$ is the largest integer such that there exists a non-stationary policy that visits $(s,a)$ with probability $\epsilon$ for $m(s,a)$ times, then our dataset contains $\Omega(m(s,a))$ samples of $(s,a)$ . However, $m(s,a)$ could be significantly smaller than $H$ and therefore the perturbation analysis established in the generative model no longer applies here. For example, previously we show that if $|m_{T}(s,a,s^{\prime})-P(s^{\prime}\mid s,a)\cdot m_{T}(s,a)|\leq O(\sqrt{P(s^{\prime}\mid s,a)\cdot H})$ , then $\widehat{P}(s_{1}\mid s,a)^{m_{T}(s,a,s_{1})}\cdot\widehat{P}(s_{2}\mid s,a)^{m_{T}(s,a,s_{2})}$ is a constant factor approximation to $P(s_{1}\mid s,a)^{m_{T}(s,a,s_{1})}\cdot P(s_{2}\mid s,a)^{m_{T}(s,a,s_{2})}$ when $|\widehat{P}(s,a,s^{\prime})-P(s,a,s^{\prime})|\leq O(\sqrt{P(s,a,s^{\prime})/H})$ for both $s^{\prime}=s_{1}$ and $s^{\prime}=s_{2}$ . However, if $m(s,a)\ll H$ , it is hopeless to obtain an estimate $\widehat{P}(s,a,s^{\prime})$ with $|\widehat{P}(s,a,s^{\prime})-P(s,a,s^{\prime})|\leq O(\sqrt{P(s,a,s^{\prime})/H})$ . Fortunately, our perturbation analysis still goes through so long as $m_{T}(s,a,s^{\prime})\leq P(s^{\prime}\mid s,a)\cdot m_{T}(s,a)+O(\sqrt{P(s^{\prime}\mid s,a)\cdot m(s,a)})$ and $|\widehat{P}(s,a,s^{\prime})-P(s,a,s^{\prime})|\leq O(\sqrt{P(s,a,s^{\prime})/m(s,a)})$ , i.e., replacing all $H$ appearances with $m(s,a)$ .

The above analysis introduces a final subtlety in our algorithm. In particular, $m(s,a)$ in the empirical model $\widehat{M}$ could be significantly larger than that in the true model. On the other hand, the number of samples of $(s,a)$ in our dataset is at most $O(m(s,a))$ where $m(s,a)$ is defined by the true model. This means the value estimated in the empirical model $\widehat{M}$ could be significantly larger than that in the true model $M$ . To resolve this issue, we employ the principle of “pessimism in the face of uncertainty” and for each policy $\pi$ , the estimated value of $\pi$ is set to be the lowest value among all models that lie the confidence set. Since the true model always lies in the confidence set, the estimated value is then guaranteed to be close to the true value.

4 Properties of Stationary Policies

In this section, we prove several properties of stationary policies. In Section 4.1, we first prove properties regarding the reaching probabilities in Markov chains, and then use them to prove properties for stationary policies in Section 4.2.

4.1 Reaching Probabilities in Markov Chains

Let $C=(\mathcal{S},P,\mu)$ be a Markov chain. For a positive integer $L$ and a sequence of states $T=(s_{0},s_{1},\ldots,s_{L-1})\in\mathcal{S}^{L}$ , we write

p(T,C)=\mu(s_{0})\cdot\prod_{h=0}^{L-2}P(s_{h+1}\mid s_{h})

to denote the probability of $T$ in $C$ . For a state $s\in\mathcal{S}$ and an integer $L\geq 0$ , we also write

p_{L}(s,C)=\sum_{(s_{0},s_{1},\ldots,s_{L-1})\in\mathcal{S}^{L}}p((s_{0},s_{1},\ldots,s_{L-1},s),C)

to denote the probability of reaching $s$ with exactly $L$ steps.

Our first lemma shows that for any Markov chain $C$ , for any sequence of $L$ states $T$ with $L>|\mathcal{S}|$ , there exists a sequence of $L^{\prime}$ states $T^{\prime}$ with $L^{\prime}\leq|\mathcal{S}|$ so that $p(T,C)\leq p(T^{\prime},C)$ .

Lemma 4.1.

Let $C=(\mathcal{S},P,\mu)$ be a Markov chain. For a sequence of $L$ states

T=(s_{0},s_{1},\ldots,s_{L-1})\in\mathcal{S}^{L}

with $L>|\mathcal{S}|$ , there exists a sequence of $L^{\prime}$ states

T^{\prime}=(s_{0}^{\prime},s_{1}^{\prime},\ldots,s_{L^{\prime}-1}^{\prime})\in\mathcal{S}^{L^{\prime}}

with $s_{L^{\prime}-1}^{\prime}=s_{L-1}$ , $L^{\prime}\leq|\mathcal{S}|$ and $p(T,C)\leq p(T^{\prime},C)$ .

Proof.

By pigeonhole principle, since $L>|\mathcal{S}|$ , there exists $0\leq i<j<L$ such that $s_{i}=s_{j}$ . Consider the sequence induced by removing $s_{i},s_{i+1},s_{i+2},\ldots,s_{j-1}$ from $T$ , i.e.,

T^{\prime}=(s_{0},s_{1},\ldots,s_{i-1},s_{j},s_{j+1},\ldots,s_{L-1}).

Since $s_{i}=s_{j}$ , we have

p(T,C)=\mu(s_{0})\cdot\prod_{h=0}^{L-2}P(s_{h+1}\mid s_{h}).

and

p(T^{\prime},C)\\ =\mu(s_{0})\cdot\prod_{h=0}^{i-1}P(s_{h+1}\mid s_{h})\cdot\prod_{h=j}^{L-2}P(s_{h+1}\mid s_{h}).

Therefore, we have $p(T,C)\leq p(T^{\prime},C)$ . We continue this process until the length is at most $|\mathcal{S}|$ . ∎

Combining Lemma 4.1 with a simple counting argument directly implies the following lemma, which shows that $\sum_{h=0}^{4|\mathcal{S}|-1}p_{h}(s,C)\leq\exp\left(O(|\mathcal{S}|\log|\mathcal{S}|)\right)\cdot\sum_{h=0}^{|\mathcal{S}|-1}p_{h}(s,C)$ .

Lemma 4.2.

Let $C=(\mathcal{S},P,\mu)$ be a Markov chain. For any $s\in\mathcal{S}$ ,

\sum_{h=0}^{4|\mathcal{S}|-1}p_{h}(s,C)\leq 4\cdot|\mathcal{S}|^{4|\mathcal{S}|}\cdot\sum_{h=0}^{|\mathcal{S}|-1}p_{h}(s,C).

Proof.

Consider a sequence of $L+1$ states $T=(s_{0},s_{1},\ldots,s_{L})\in\mathcal{S}^{L+1}$ with $L\geq|\mathcal{S}|$ and $s_{L}=s$ . By Lemma 4.1, there exists another sequence of $L^{\prime}$ states $T^{\prime}=(s_{0}^{\prime},s_{1}^{\prime},\ldots,s_{L^{\prime}-1}^{\prime})\in\mathcal{S}^{L^{\prime}}$ with $s_{L^{\prime}-1}^{\prime}=s_{L}=s$ and $L^{\prime}\leq|\mathcal{S}|$ so that $p(T,C)\leq p(T^{\prime},C)$ . Therefore

p(T,C)\leq p(T^{\prime},C)\leq p_{L^{\prime}-1}(s,C)\leq\sum_{h=0}^{|\mathcal{S}|-1}p_{h}(s,C),

which implies

p_{L}(s,C)=\sum_{(s_{0},s_{1},\ldots,s_{L-2},s_{L-1})\in\mathcal{S}^{L}}p((s_{0},s_{1},\ldots,s_{L-2},s_{L-1},s),C)\leq|\mathcal{S}|^{L}\sum_{h=0}^{|\mathcal{S}|-1}p_{h}(s,C).

Therefore,

\sum_{h=0}^{4|\mathcal{S}|-1}p_{h}(s,C)\leq 4\cdot|\mathcal{S}|\cdot|\mathcal{S}|^{4|\mathcal{S}|-1}\cdot\sum_{h=0}^{|\mathcal{S}|-1}p_{h}(s,C)=4\cdot|\mathcal{S}|^{4|\mathcal{S}|}\cdot\sum_{h=0}^{|\mathcal{S}|-1}p_{h}(s,C).

∎

By applying Lemma 4.2 in a Markov chain $C^{\prime}$ with modified initial state distribution and transition operator, we can also prove that $\sum_{h=0}^{4|\mathcal{S}|-1}p_{\beta h+\alpha}(s,C)\leq\exp\left(O(|\mathcal{S}|\log|\mathcal{S}|)\right)\cdot\sum_{h=0}^{|\mathcal{S}|-1}p_{\beta h+\alpha}(s,C).$ for any integer $\alpha\geq 0$ and integer $\beta\geq 1$ .

Lemma 4.3.

Let $C=(\mathcal{S},P,\mu)$ be a Markov chain. For any integer $\alpha\geq 0$ and integer $\beta\geq 1$ , for any $s\in\mathcal{S}$ ,

\sum_{h=0}^{4|\mathcal{S}|-1}p_{\beta h+\alpha}(s,C)\leq 4\cdot|\mathcal{S}|^{4|\mathcal{S}|}\cdot\sum_{h=0}^{|\mathcal{S}|-1}p_{\beta h+\alpha}(s,C).

Proof.

We define a new Markov chain $C^{\prime}=(\mathcal{S},P^{\prime},\mu^{\prime})$ based on $C=(\mathcal{S},P,\mu)$ . The state space of $C^{\prime}$ is the same as that of $C$ . The initial state distribution $\mu^{\prime}$ is the same as the distribution of $s_{\alpha}$ in $C$ , i.e., the distribution after taking $\alpha$ steps in $C$ . The transition operator is defined so that taking one step in $C^{\prime}$ is equivalent to taking $\beta$ steps in $C$ , i.e.,

P^{\prime}(s^{\prime}\mid s)=\sum_{s_{1},s_{2},\ldots,s_{\beta-1}\in\mathcal{S}^{\beta-1}}P(s^{\prime}\mid s_{\beta-1})\cdot P(s_{\beta-1}\mid s_{\beta-2})\cdot\cdots\cdot P(s_{1}\mid s).

Clearly, for any state $s\in\mathcal{S}$ , $p_{L}(s,C^{\prime})=p_{\beta L+\alpha}(s,C)$ . By using Lemma 4.2 in $C^{\prime}$ , for any $s\in\mathcal{S}$ , we have

\sum_{h=0}^{4|\mathcal{S}|-1}p_{h}(s,C^{\prime})\leq 4\cdot|\mathcal{S}|^{4|\mathcal{S}|}\cdot\sum_{h=0}^{|\mathcal{S}|-1}p_{h}(s,C^{\prime}),

which implies

\sum_{h=0}^{4|\mathcal{S}|-1}p_{\beta h+\alpha}(s,C)\leq 4\cdot|\mathcal{S}|^{4|\mathcal{S}|}\cdot\sum_{h=0}^{|\mathcal{S}|-1}p_{\beta h+\alpha}(s,C).

∎

Finally, Lemma 4.3 implies the main result of this section, which shows that for any $L\geq|\mathcal{S}|$ , $\sum_{h=0}^{4L-1}p_{h}(s,C)\leq\exp\left(O(|\mathcal{S}|\log|\mathcal{S}|)\right)\cdot\sum_{h=0}^{L-1}p_{h}(s,C)$ .

Lemma 4.4.

Let $C=(\mathcal{S},P,\mu)$ be a Markov chain. For any $s\in\mathcal{S}$ and $L\geq|\mathcal{S}|$ ,

\sum_{h=0}^{2L}p_{h}(s,C)\leq 4\cdot|\mathcal{S}|^{4|\mathcal{S}|}\cdot\sum_{h=0}^{L-1}p_{h}(s,C).

Proof.

Clearly,

\sum_{h=0}^{L-1}p_{h}(s,C)\geq\sum_{h=0}^{\lfloor L/|\mathcal{S}|\rfloor\cdot|\mathcal{S}|-1}p_{h}(s,C)=\sum_{i=0}^{\lfloor L/|\mathcal{S}|\rfloor-1}\sum_{j=0}^{|\mathcal{S}|-1}p_{\lfloor L/|\mathcal{S}|\rfloor\cdot j+i}(s,C).

For each $0\leq i<\lfloor L/|\mathcal{S}|\rfloor$ , by Lemma 4.3, we have

\sum_{j=0}^{|\mathcal{S}|-1}p_{\lfloor L/|\mathcal{S}|\rfloor\cdot j+i}(s,C)\geq\frac{1}{4|\mathcal{S}|^{4|\mathcal{S}|}}\sum_{j=0}^{4|\mathcal{S}|-1}p_{\lfloor L/|\mathcal{S}|\rfloor\cdot j+i}(s,C).

On the other hand,

\sum_{h=0}^{2L}p_{h}(s,C)\leq\sum_{i=0}^{\lfloor L/|\mathcal{S}|\rfloor-1}\sum_{j=0}^{\lfloor(2L+1)/\lfloor L/|\mathcal{S}|\rfloor\rfloor-1}p_{\lfloor L/|\mathcal{S}|\rfloor\cdot j+i}(s,C).

Note that if $|\mathcal{S}|>L/2$ , then $\lfloor(2L+1)/\lfloor L/|\mathcal{S}|\rfloor\rfloor=2L+1<4|\mathcal{S}|$ . Moreover, if $|\mathcal{S}|\leq L/2$ , then we have $\lfloor L/|\mathcal{S}|\rfloor\geq 2L/3|\mathcal{S}|$ , which implies

\lfloor(2L+1)/\lfloor L/|\mathcal{S}|\rfloor\rfloor\leq\lfloor(2L+1)/L\cdot 3|\mathcal{S}|/2\rfloor\leq\lfloor 4|\mathcal{S}|\rfloor=4|\mathcal{S}|.

Hence, we have

\sum_{h=0}^{2L}p_{h}(s,C)\leq\sum_{i=0}^{\lfloor L/|\mathcal{S}|\rfloor-1}\ \sum_{j=0}^{4|\mathcal{S}|-1}p_{\lfloor L/|\mathcal{S}|\rfloor\cdot j+i}(s,C)\leq 4\cdot|\mathcal{S}|^{4|\mathcal{S}|}\cdot\sum_{h=0}^{L-1}p_{h}(s,C).

∎

4.2 Implications of Lemma 4.4

In this section, we list several implications of Lemma 4.4 which would be crucial for analysis in later sections.

Our first lemma shows that for any MDP $M$ and any stationary policy $\pi$ , for a properly chosen discount factor $\gamma$ , $V^{\pi}_{M,\gamma}$ is a multiplicative approximation to $V^{\pi}_{M,H}$ with approximation ratio $\exp\left(O(|\mathcal{S}|\log|\mathcal{S}|)\right)$ .

Lemma 4.5.

For any MDP $M$ and any stationary policy $\pi$ , if $H\geq 2\ln(8\cdot|\mathcal{S}|^{4|\mathcal{S}|})$ , by taking $\gamma=1-\frac{\ln(8\cdot|\mathcal{S}|^{4|\mathcal{S}|})}{H}$ ,

\frac{1}{64\cdot|\mathcal{S}|^{8|\mathcal{S}|}}V^{\pi}_{M,H}\leq V^{\pi}_{M,\gamma}\leq 2V^{\pi}_{M,H}.

Proof.

		$\displaystyle V^{\pi}_{M,\gamma}=\sum_{s\in\mathcal{S}}\sum_{h=0}^{\infty}\gamma^{h}\cdot p_{h}(s,M^{\pi})\cdot\mathbb{E}[R(s,\pi(s))]$
	$\displaystyle\leq$	$\displaystyle\sum_{s\in\mathcal{S}}\left(\sum_{h=0}^{H-1}p_{h}(s,M^{\pi})+\sum_{i=1}^{\infty}\gamma^{H\cdot 2^{i-1}}\left(\sum_{h=0}^{2^{i}\cdot H-1}p_{h}(s,M^{\pi})\right)\right)\cdot\mathbb{E}[R(s,\pi(s))].$

For each $i\geq 1$ , by Lemma 4.4, for any $s\in\mathcal{S}$ ,

	$\displaystyle\gamma^{H\cdot 2^{i-1}}\left(\sum_{h=0}^{2^{i}\cdot H-1}p_{h}(s,M^{\pi})\right)$	$\displaystyle\leq\gamma^{H\cdot 2^{i-1}}\cdot\left(4\cdot\|\mathcal{S}\|^{4\|\mathcal{S}\|}\right)^{i}\cdot\left(\sum_{h=0}^{H-1}p_{h}(s,M^{\pi})\right)$
		$\displaystyle\leq\left(8\cdot\|\mathcal{S}\|^{4\|\mathcal{S}\|}\right)^{-2^{i-1}}\cdot\left(4\cdot\|\mathcal{S}\|^{4\|\mathcal{S}\|}\right)^{i}\cdot\left(\sum_{h=0}^{H-1}p_{h}(s,M^{\pi})\right)$
		$\displaystyle\leq 1/2^{i}\cdot\left(\sum_{h=0}^{H-1}p_{h}(s,M^{\pi})\right).$

Therefore,

V^{\pi}_{M,\gamma}\leq\sum_{s\in\mathcal{S}}2\cdot\left(\sum_{h=0}^{H-1}p_{h}(s,M^{\pi})\right)\cdot\mathbb{E}[R(s,\pi(s))]=2V^{\pi}_{M,H}.

On the other hand,

	$\displaystyle V^{\pi}_{M,\gamma}$	$\displaystyle=\sum_{s\in\mathcal{S}}\sum_{h=0}^{\infty}\gamma^{h}\cdot p_{h}(s,M^{\pi})\cdot\mathbb{E}[R(s,\pi(s))]$
		$\displaystyle\geq\sum_{s\in\mathcal{S}}\sum_{h=0}^{H-1}\gamma^{h}\cdot p_{h}(s,M^{\pi})\cdot\mathbb{E}[R(s,\pi(s))]$
		$\displaystyle\geq\gamma^{H}\cdot\sum_{s\in\mathcal{S}}\sum_{h=0}^{H-1}p_{h}(s,M^{\pi})\cdot\mathbb{E}[R(s,\pi(s))]$
		$\displaystyle=\gamma^{H}\cdot V^{\pi}_{M,H}=\left(1-\frac{\ln(8\cdot\|\mathcal{S}\|^{4\|\mathcal{S}\|})}{H}\right)^{H}\cdot V^{\pi}_{M,H}$
		$\displaystyle\geq(1/4)^{\ln(8\cdot\|\mathcal{S}\|^{4\|\mathcal{S}\|})}\cdot V^{\pi}_{M,H}\geq\frac{1}{64\cdot\|\mathcal{S}\|^{8\|\mathcal{S}\|}}\cdot V^{\pi}_{M,H}.$

∎

As another implication of Lemma 4.4, for any MDP $M$ and any stationary policy $\pi$ , we have $V^{\pi}_{M,\lfloor H/2\rfloor}\geq\exp\left(-O(|\mathcal{S}|\log|\mathcal{S}|)\right)V^{\pi}_{M,H}.$

Lemma 4.6.

For any MDP $M$ and any stationary policy $\pi:\mathcal{S}\to\mathcal{A}$ , if $H\geq 2|\mathcal{S}|$ ,

V^{\pi}_{M,\lfloor H/2\rfloor}\geq\frac{1}{4\cdot|\mathcal{S}|^{4|\mathcal{S}|}}V^{\pi}_{M,H}.

Proof.

Note that

V^{\pi}_{M,\lfloor H/2\rfloor}=\sum_{s\in\mathcal{S}}\sum_{h=0}^{\lfloor H/2\rfloor-1}p_{h}(s,M^{\pi})\cdot\mathbb{E}[R(s,\pi(s))].

Since $H\geq 2|\mathcal{S}|$ , by Lemma 4.4, for any $s\in\mathcal{S}$ ,

\sum_{h=0}^{\lfloor H/2\rfloor-1}p_{h}(s,M^{\pi})\geq\frac{1}{4\cdot|\mathcal{S}|^{4|\mathcal{S}|}}\sum_{h=0}^{2\lfloor H/2\rfloor}p_{h}(s,M^{\pi})\geq\frac{1}{4\cdot|\mathcal{S}|^{4|\mathcal{S}|}}\sum_{h=0}^{H-1}p_{h}(s,M^{\pi}).

Therefore,

V^{\pi}_{M,\lfloor H/2\rfloor}\geq\frac{1}{4\cdot|\mathcal{S}|^{4|\mathcal{S}|}}\sum_{s\in\mathcal{S}}\sum_{h=0}^{H-1}p_{h}(s,M^{\pi})\cdot\mathbb{E}[R(s,\pi(s))]=\frac{1}{4\cdot|\mathcal{S}|^{4|\mathcal{S}|}}V^{\pi}_{M,H}.

∎

As a corollary of Lemma 4.5, we show that for any $H$ -horizon MDP $M$ , there always exists a stationary policy whose value is as large as the best non-stationary policy up to a factor of $\exp\left(O(|\mathcal{S}|\log|\mathcal{S}|)\right)$ .

Corollary 4.7.

For any MDP $M$ , if $H\geq 2\ln(8\cdot|\mathcal{S}|^{4|\mathcal{S}|})$ , then there exists a stationary policy $\pi$ such that

V^{\pi}_{M,H}\geq\frac{1}{128\cdot|\mathcal{S}|^{8|\mathcal{S}|}}V^{\pi^{*}}_{M,H},

Proof.

In this proof we fix $\gamma=1-\frac{\ln(8\cdot|\mathcal{S}|^{4|\mathcal{S}|})}{H}$ . We also use $\widetilde{\pi}^{*}$ to denote a non-stationary policy such that $\widetilde{\pi}^{*}_{h}=\pi^{*}_{h}$ when $h\in[H]$ and $\widetilde{\pi}^{*}_{h}$ is defined arbitrarily when $h\geq H$ .

Clearly, there exists a stationary policy $\pi$ such that for any (possibly non-stationary) policy $\pi^{\prime}$ ,

V^{\pi}_{M,\gamma}\geq V^{\pi^{\prime}}_{M,\gamma}.

For a proof, see Theorem 5.5.3 in [Put94]. Clearly,

V^{\widetilde{\pi}^{*}}_{M,\gamma}\geq\gamma^{H}\cdot V^{\pi^{*}}_{M,H}.

Moreover, by Lemma 4.5,

V^{\pi}_{M,H}\geq\frac{1}{2}V^{\pi}_{M,\gamma}\geq\frac{1}{2}V^{\widetilde{\pi}^{*}}_{M,\gamma}\geq\frac{1}{2}\cdot\gamma^{H}\cdot V^{\pi^{*}}_{M,H}\geq\frac{1}{128\cdot|\mathcal{S}|^{8|\mathcal{S}|}}\cdot V^{\pi^{*}}_{M,H}.

∎

By applying Corollary 4.7 in an MDP with an extra terminal state $s_{\mathrm{terminal}}$ , we can show that for any $(s,a)\in\mathcal{S}\times\mathcal{A}$ , there always exists a stationary policy that visits $(s,a)$ in the first $H/2$ time steps with probability as large as the probability that the best non-stationary policy visits $(s,a)$ in all the $H$ time steps, up to a factor of $\exp\left(O(|\mathcal{S}|\log|\mathcal{S}|)\right)$ .

Corollary 4.8.

For any MDP $M$ , if $H\geq 2\ln(8\cdot(|\mathcal{S}|+1)^{4(|\mathcal{S}|+1)})$ , then for any $z\in\mathcal{S}\times\mathcal{A}$ , there exists a stationary policy $\pi$ , such that for any (possibly non-stationary) policy $\pi^{\prime}$ ,

\Pr\left[\sum_{h=0}^{\lfloor H/2\rfloor-1}\mathbb{I}[(s_{h},a_{h})=z]\geq 1\right]\geq\frac{1}{512\cdot(|\mathcal{S}|+1)^{12(|\mathcal{S}|+1)}}\Pr\left[\sum_{h=0}^{H-1}\mathbb{I}[(s_{h}^{\prime},a_{h}^{\prime})=z]\geq 1\right],

where

(s_{0},a_{0}),(s_{1},a_{1}),\ldots,(s_{H-1},a_{H-1}),s_{H}

is a random trajectory induced by executing $\pi$ in $M$ and

(s_{0}^{\prime},a_{0}^{\prime}),(s_{1}^{\prime},a_{1}^{\prime}),\ldots,(s_{H-1}^{\prime},a_{H-1}^{\prime}),s_{H}^{\prime}

is a random trajectory induced by executing $\pi^{\prime}$ in $M$ .

Proof.

For the given MDP $M=\left(\mathcal{S},\mathcal{A},P,R,H,\mu\right)$ , we create a new MDP

M^{\prime}=\left(\mathcal{S}\cup\{s_{\mathrm{terminal}}\},\mathcal{A},P^{\prime},R^{\prime},H,\mu\right),

where $s_{\mathrm{terminal}}$ is a state such that $s_{\mathrm{terminal}}\notin\mathcal{S}$ . Moreover,

P^{\prime}(s,a)=\begin{cases}P(s,a)&s\neq s_{\mathrm{terminal}}\text{ and }(s,a)\neq z\\ s_{\mathrm{terminal}}&s=s_{\mathrm{terminal}}\text{ or }(s,a)=z\end{cases}

and

R^{\prime}(s,a)=\mathbb{I}[(s,a)=z].

Clearly, for any policy $\pi$ ,

V^{\pi}_{M^{\prime},H}=\Pr\left[\sum_{h=0}^{H-1}\mathbb{I}[(s_{h},a_{h})=(s,a)]\geq 1\right]

where

(s_{0},a_{0}),(s_{1},a_{1}),\ldots,(s_{H-1},a_{H-1}),s_{H}

is a random trajectory induced by executing $\pi$ in $M$ . Therefore, by Corollary 4.7, there exists a stationary policy $\pi$ such that for any (possibly non-stationary) policy $\pi^{\prime}$ ,

V^{\pi}_{M^{\prime},H}\geq\frac{1}{128\cdot(|\mathcal{S}|+1)^{8(|\mathcal{S}|+1)}}V^{\pi^{\prime}}_{M^{\prime},H}.

Moreover, by Lemma 4.6, for any (possibly non-stationary) policy $\pi^{\prime}$ ,

V^{\pi}_{M^{\prime},\lfloor H/2\rfloor}\geq\frac{1}{512\cdot(|\mathcal{S}|+1)^{12(|\mathcal{S}|+1)}}V^{\pi^{\prime}}_{M^{\prime},H},

which implies the desired result. ∎

Finally, by combining Lemma 4.6 and Corollary 4.7, we can show that for any $(s,a)\in\mathcal{S}\times\mathcal{A}$ , if the initial state distribution $\mu$ is $s$ and there exists a non-stationary policy that visits $(s,a)$ for $f$ times with probability $\epsilon$ in all the $H$ steps, then there exists a stationary policy that visits $(s,a)$ for $\exp\left(-O(|\mathcal{S}|\log|\mathcal{S}|)\right)\cdot\epsilon\cdot f$ times with constant probability in the first $H/2$ steps.

Corollary 4.9.

For a given MDP $M$ and a state-action pair $z=(s_{z},a_{z})\in\mathcal{S}\times\mathcal{A}$ , suppose the initial state distribution $\mu$ is $s_{z}$ and $H\geq 2\ln(8\cdot|\mathcal{S}|^{4|\mathcal{S}|})$ . If there exists a (possibly non-stationary) policy $\pi^{\prime}$ such that $\mathcal{Q}^{\pi^{\prime}}_{\epsilon}(s_{z},a_{z})\geq f$ for some integer $0\leq f\leq H$ , then there exists a stationary policy $\pi$ such that

\mathcal{Q}^{\pi}_{1/2}\left(\sum_{h=0}^{\lfloor H/2\rfloor-1}\mathbb{I}[(s_{h},a_{h})=z]\right)\geq\left\lfloor\frac{1}{2048\cdot|\mathcal{S}|^{12|\mathcal{S}|}}\cdot\epsilon\cdot f\right\rfloor

where

(s_{0},a_{0},r_{0}),(s_{1},a_{1},r_{1})\ldots,(s_{H-1},a_{H-1},r_{H-1}),s_{H}

is a trajectory induced by executing $\pi$ in $M$ .

Proof.

If $f=0$ then the lemma is clearly true. No consider the case $f>0$ . Consider a new MDP $M^{\prime}=\left(\mathcal{S},\mathcal{A},P,R^{\prime},H,\mu\right)$ where $R(s,a)=\mathbb{I}[(s,a)=z]$ . Clearly, $V^{\pi^{\prime}}_{M^{\prime},H}\geq\epsilon\cdot f$ . By Corollary 4.7, there exists a stationary policy $\pi$ such that

V^{\pi}_{M^{\prime},H}\geq\frac{1}{128\cdot|\mathcal{S}|^{8|\mathcal{S}|}}\cdot\epsilon\cdot f.

By Lemma 4.6,

V^{\pi}_{M^{\prime},\lfloor H/2\rfloor}\geq\frac{1}{512\cdot|\mathcal{S}|^{12|\mathcal{S}|}}\cdot\epsilon\cdot f.

This implies $\pi(s_{z})=a_{z}$ .

Now we use $X$ to denote a random variable which is defined to be

X=\min\{h\geq 1\mid(s_{h},a_{h})=z\}.

Here the trajectory

(s_{0},a_{0}),(s_{1},a_{1}),\ldots

is induced by executing the stationary policy $\pi$ in $M^{\prime}$ . We also write $\hat{X}=\min\{\lfloor H/2\rfloor,X\}$ . We use $\{X_{i}\}_{i=1}^{\infty}$ to denote a sequence of i.i.d. copies of $\hat{X}$ . We use $\tau$ to denote a random variable which is defined to be

\tau=\min\left\{i\geq 1\mid\sum_{j=1}^{i}X_{j}\geq\lfloor H/2\rfloor\right\}.

Clearly, $\tau\leq H/2$ almost surely. Moreover, $\pi$ is a stationary policy, the initial state distribution $\mu=s_{z}$ deterministically and $\pi(s_{z})=a_{z}$ , which implies $\tau$ and $\sum_{h=0}^{\lfloor H/2\rfloor-1}\mathbb{I}[(s_{h},a_{h})=z]$ have the same distribution. Indeed, whenever the trajectory $(s_{0},a_{0}),(s_{1},a_{1})\ldots$ visits $z$ , it corresponds to restart a new copy of $\hat{X}$ .

Now for each $i>0$ , we define $Y_{i}=X_{i}-\mathbb{E}[\hat{X}]$ . Clearly $\mathbb{E}[Y_{i}]=0$ . Let $S_{i}=0$ and $S_{i}=\sum_{j=1}^{i}Y_{j}$ for all $i>0$ . Clearly $\tau$ is a stopping time, and

\sum_{j=1}^{\tau}X_{j}\leq H

since $X_{i}\leq\lfloor H/2\rfloor$ for all $i>0$ . By Martingale Stopping Theorem, we have

\mathbb{E}[S_{\tau}]=\sum_{j=1}^{\tau}\mathbb{E}[X_{j}]-\mathbb{E}[\tau]\cdot\mathbb{E}[\hat{X}]=0,

which implies $\mathbb{E}[\tau]\cdot\mathbb{E}[\hat{X}]\leq H$ and therefore

\mathbb{E}[\hat{X}]\leq H/\mathbb{E}[\tau]=H/V^{\pi}_{M^{\prime},\lfloor H/2\rfloor}\leq 512\cdot|\mathcal{S}|^{12|\mathcal{S}|}H/(\epsilon\cdot f),

where we use the fact that

V^{\pi}_{M^{\prime},\lfloor H/2\rfloor}=\mathbb{E}\left[\sum_{h=0}^{\lfloor H/2\rfloor-1}\mathbb{I}[(s_{h},a_{h})=z]\right]=\mathbb{E}[\tau].

Let $\tau^{\prime}=\left\lfloor\frac{1}{2048\cdot|\mathcal{S}|^{12|\mathcal{S}|}}\cdot\epsilon\cdot f\right\rfloor$ . By Markov’s inequality, with probability at least $1/2$ ,

\sum_{i=1}^{\tau^{\prime}}X_{i}\leq 2\tau^{\prime}\mathbb{E}[\hat{X}]\leq H/2,

in which case $\tau\geq\tau^{\prime}$ . Consequently,

\mathcal{Q}^{\pi}_{1/2}\left(\sum_{h=0}^{\lfloor H/2\rfloor-1}\mathbb{I}[(s_{h},a_{h})=z]\right)=\mathcal{Q}_{1/2}(\tau)\geq\left\lfloor\frac{1}{2048\cdot|\mathcal{S}|^{12|\mathcal{S}|}}\cdot\epsilon\cdot f\right\rfloor.

∎

5 Algorithm in the RL Setting

In this section, we present our algorithm in the RL setting together with its analysis. Our algorithm is divided into two parts. In Section 5.1, we first present the algorithm for collecting samples together with its analysis. In Section 5.2, we establish a perturbation analysis on the value functions which is crucial for the analysis in later proofs. Finally, in Section 5.3, we present the algorithm for finding near-optimal policies based on the dataset found by the algorithm in Section 5.1, together with its analysis based on the machinery developed in Section 5.2.

5.1 Collecting Samples

In this section, we present our algorithm for collecting samples. The algorithm is formally presented in Algorithm 1. The dataset $D$ returned by Algorithm 1 consists of $N$ lists, where for each list, elements in the list are tuples of the form $(s,a,r,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times[0,1]\times\mathcal{S}$ . To construct these lists, Algorithm 1 enumerates a state-action pair $(s,a)\in\mathcal{S}\times\mathcal{A}$ and a pair of stationary policies $(\pi_{1},\pi_{2})$ , and then collects a trajectory using $\pi_{1}$ and $\pi_{2}$ . More specifically, $\pi_{1}$ is executed until the trajectory visits $(s,a)$ , at which point $\pi_{2}$ is executed until the last step.

Algorithm 1 Collect Samples

1:Input: number of repetitions

N

2:Output: Dataset

D

where

D=\left(\left(\left(s_{i,t},a_{i,t},r_{i,t},s^{\prime}_{i,t}\right)\right)_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}\right)_{i=0}^{N-1}

3:for

i\in[N]

4: Let

T_{i}

be an empty list

5: for

(s,a)\in\mathcal{S}\times\mathcal{A}

6: for

(\pi_{1},\pi_{2})\in\Pi_{\mathsf{st}}\times\Pi_{\mathsf{st}}

7: Receive

s_{0}\sim\mu

8: for

h\in[H]

9: if

(s,a)=(s_{h^{\prime}},a_{h^{\prime}})

for some

h^{\prime}<h

then

10: Take

a_{h}=\pi_{2}(s_{h})

11: else

12: Take

a_{h}=\pi_{1}(s_{h})

13: Receive

r_{h}\sim R(s_{h},a_{h})

and

s_{h+1}\sim P(s_{h},a_{h})

14: Append

(s_{h},a_{h},r_{h},s_{h+1})

to the end of

T_{i}

15:return

D

where

D=(T_{i})_{i=0}^{N-1}

Throughout this section, we use $M=\left(\mathcal{S},\mathcal{A},P,R,H,\mu\right)$ to denote the underlying MDP that the agent interacts with. For each $(s,a)\in\mathcal{S}\times\mathcal{A}$ and $(\pi_{1},\pi_{2})\in\Pi_{\mathsf{st}}\times\Pi_{\mathsf{st}}$ , let

(s_{0}^{s,a,\pi_{1},\pi_{2}},a_{0}^{s,a,\pi_{1},\pi_{2}},r_{0}^{s,a,\pi_{1},\pi_{2}}),(s_{1}^{s,a,\pi_{1},\pi_{2}},a_{1}^{s,a,\pi_{1},\pi_{2}},r_{1}^{s,a,\pi_{1},\pi_{2}}),\ldots,(s_{H-1}^{s,a,\pi_{1},\pi_{2}},a_{H-1}^{s,a,\pi_{1},\pi_{2}},r_{H-1}^{s,a,\pi_{1},\pi_{2}}),s_{H}^{s,a,\pi_{1},\pi_{2}}

by a trajectory where $s_{0}^{s,a,\pi_{1},\pi_{2}}\sim\mu$ and $s_{h}^{s,a,\pi_{1},\pi_{2}}\sim P(s_{h-1}^{s,a,\pi_{1},\pi_{2}},a_{h-1}^{s,a,\pi_{1},\pi_{2}})$ for all $1\leq h\leq H$ , $r_{h}^{s,a,\pi_{1},\pi_{2}}\sim R(s_{h}^{s,a,\pi_{1},\pi_{2}},a_{h}^{s,a,\pi_{1},\pi_{2}})$ for all $h\in[H]$ , and

a_{h}^{s,a,\pi_{1},\pi_{2}}=\begin{cases}\pi_{2}(s_{h}^{s,a,\pi_{1},\pi_{2}})&(s,a)=(s_{h^{\prime}}^{s,a,\pi_{1},\pi_{2}},a_{h^{\prime}}^{s,a,\pi_{1},\pi_{2}})\text{ for some $h^{\prime}<h$}\\ \pi_{1}(s_{h}^{s,a,\pi_{1},\pi_{2}})&\text{otherwise}\end{cases}

for all $h\in[H]$ . Note that the above trajectory is the one collected by Algorithm 1 when a specific state-action pair $(s,a)$ and a specific pair of policies $(\pi_{1},\pi_{2})$ are used.

For any $\epsilon\in(0,1]$ , define

\mathcal{Q}^{\mathsf{st}}_{\epsilon}(s,a)=\mathcal{Q}_{\epsilon}\left(\sum_{(s^{\prime},a^{\prime})\in\mathcal{S}\times\mathcal{A}}\sum_{\pi_{1}\in\Pi_{\mathsf{st}}}\sum_{\pi_{2}\in\Pi_{\mathsf{st}}}\sum_{h=0}^{H-1}\mathbb{I}[(s_{h}^{s^{\prime},a^{\prime},\pi_{1},\pi_{2}},a_{h}^{s^{\prime},a^{\prime},\pi_{1},\pi_{2}})=(s,a)]\right).

Clearly, $\mathcal{Q}^{\mathsf{st}}_{\epsilon}(s,a)$ is the $\epsilon$ -quantile of the frequency that $(s,a)$ appears in each $T_{i}$ .

In Lemma 5.1, we first show that for each $(s,a)\in\mathcal{S}\times\mathcal{A}$ , if there exists a policy $\pi$ that visits $(s,a)$ for $m(s,a)$ times with probability at least $\epsilon$ , then

\mathcal{Q}^{\mathsf{st}}_{\epsilon/\exp(O(|\mathcal{S}|\log|\mathcal{S}|))}(s,a)\geq m(s,a)/\exp(O(|\mathcal{S}|\log|\mathcal{S}|)).

Lemma 5.1.

Let $\epsilon\in(0,1]$ be a given real number. For each $(s,a)\in\mathcal{S}\times\mathcal{A}$ , let $m_{\epsilon}(s,a)$ be the largest integer such that there exists a (possibly non-stationary) policy $\pi_{s,a}$ such that $\mathcal{Q}^{\pi_{s,a}}_{\epsilon}(s,a)\geq m_{\epsilon}(s,a)$ . Then for each $(s,a)\in\mathcal{S}\times\mathcal{A}$ ,

\mathcal{Q}^{\mathsf{st}}_{\epsilon(|\mathcal{S}|+1)^{-12(|\mathcal{S}|+1)}/1024}(s,a)\geq\frac{1}{4096\cdot|\mathcal{S}|^{12|\mathcal{S}|}}\cdot\epsilon\cdot m_{\epsilon}(s,a).

Proof.

For each $(s,a)\in\mathcal{S}\times\mathcal{A}$ , there exists a (possibly non-stationary) policy $\pi_{s,a}$ such that $\mathcal{Q}^{\pi_{s,a}}_{\epsilon}(s,a)\geq m_{\epsilon}(s,a)$ . Here we consider the case that $m_{\epsilon}(s,a)\geq 1$ , since otherwise the lemma clearly holds. By Corollary 4.8, there exists a stationary policy $\pi_{s,a}^{\prime}$ such that

\Pr\left[\sum_{h=0}^{\lfloor H/2\rfloor-1}\mathbb{I}[(s_{h},a_{h})=(s,a)]\geq 1\right]\geq\frac{\epsilon}{512\cdot(|\mathcal{S}|+1)^{12(|\mathcal{S}|+1)}},

where

(s_{0},a_{0}),(s_{1},a_{1}),\ldots,(s_{H-1},a_{H-1}),s_{H}

is a random trajectory induced by executing $\pi_{s,a}^{\prime}$ in $M$ .

In the remaining part of the analysis, we consider two cases.

Case I: $m_{\epsilon}(s,a)\geq 4096\cdot|\mathcal{S}|^{12|\mathcal{S}|}/\epsilon$ .

Let

(s_{0},a_{0}),(s_{1},a_{1}),\ldots,(s_{H-1},a_{H-1}),s_{H}

be a random trajectory induced by executing $\pi_{s,a}$ in $M$ . Let $X_{s,a}$ be the random variable which is defined to be

X_{s,a}=\begin{cases}\min\{h\in[H]\mid(s_{h},a_{h})=(s,a)\}&\text{if there exists $h\in[H]$ such that $(s_{h},a_{h})=(s,a)$}\\ H&\text{otherwise}\end{cases}.

Clearly,

		$\displaystyle\sum_{h^{\prime}=0}^{H-1}\Pr[X_{s,a}=h^{\prime}]\cdot\Pr\left[\sum_{h=h^{\prime}}^{H-1}\mathbb{I}[(s_{h},a_{h})=(s,a)]\geq m_{\epsilon}(s,a)\mid(s_{h^{\prime}},a_{h^{\prime}})=(s,a)\right]$
	$\displaystyle=$	$\displaystyle\Pr\left[\sum_{h=0}^{H-1}\mathbb{I}[(s_{h},a_{h})=(s,a)]\geq m_{\epsilon}(s,a)\right]\geq\epsilon.$

Therefore, there exists $h^{\prime}\in[H]$ such that

\Pr[X_{s,a}=h^{\prime}]>0

and

\Pr\left[\sum_{h=h^{\prime}}^{H-1}\mathbb{I}[(s_{h},a_{h})=(s,a)]\geq m_{\epsilon}(s,a)\mid(s_{h^{\prime}},a_{h^{\prime}})=(s,a)\right]\geq\epsilon.

Note that we must have $\pi_{h^{\prime}}(s)=a$ , since otherwise $\Pr[X_{s,a}=h^{\prime}]=0$ .

Now we consider a new MDP $M_{s}=\left(\mathcal{S},\mathcal{A},P,R,H,\mu_{s}\right)$ where $\mu_{s}=s$ . Let $\widetilde{\pi}$ be an arbitrary policy so that $\widetilde{\pi}_{h}=(\pi_{s,a})_{h^{\prime}+h}$ for all $h\in[H-h^{\prime}]$ . Clearly,

\Pr\left[\sum_{h=0}^{H-1}\mathbb{I}[(s_{h}^{\prime},a_{h}^{\prime})=(s,a)]\geq m_{\epsilon}(s,a)\right]\geq\Pr\left[\sum_{h=0}^{H-h^{\prime}-1}\mathbb{I}[(s_{h}^{\prime},a_{h}^{\prime})=(s,a)]\geq m_{\epsilon}(s,a)\right]\geq\epsilon

where

(s_{0}^{\prime},a_{0}^{\prime}),(s_{1}^{\prime},a_{1}^{\prime}),\ldots,(s_{H-1}^{\prime},a_{H-1}^{\prime}),s_{H}^{\prime}

is a random trajectory induced by executing $\widetilde{\pi}$ in $M_{s}$ . Therefore, by Corollary 4.9, there exists a stationary policy $\widetilde{\pi}_{s,a}$ such that

\Pr\left[\sum_{h=0}^{\lfloor H/2\rfloor-1}\mathbb{I}[(s_{h}^{\prime\prime},a_{h}^{\prime\prime})=(s,a)]\geq\left\lfloor\frac{1}{2048\cdot|\mathcal{S}|^{12|\mathcal{S}|}}\cdot\epsilon\cdot m_{\epsilon}(s,a)\right\rfloor\right]\geq 1/2

where

(s_{0}^{\prime\prime},a_{0}^{\prime\prime}),(s_{1}^{\prime\prime},a_{1}^{\prime\prime}),\ldots,(s_{H-1}^{\prime\prime},a_{H-1}^{\prime\prime}),s_{H}^{\prime\prime}

is a random trajectory induced by executing $\widetilde{\pi}_{s,a}$ in $M_{s}$ . Since $m_{\epsilon}(s,a)\geq 4096\cdot|\mathcal{S}|^{12|\mathcal{S}|}/\epsilon$ and thus $\left\lfloor\frac{1}{2048\cdot|\mathcal{S}|^{12|\mathcal{S}|}}\cdot\epsilon\cdot m_{\epsilon}(s,a)\right\rfloor\geq 1$ , we must have $\widetilde{\pi}_{s,a}(s)=a$ .

Now we consider the case when $\pi_{1}=\pi_{s,a}^{\prime}$ and $\pi_{2}=\widetilde{\pi}_{s,a}$ . Since $\pi_{1}=\pi_{s,a}^{\prime}$ ,

\Pr\left[\sum_{h=0}^{\lfloor H/2\rfloor-1}\mathbb{I}[(s_{h}^{s,a,\pi_{1},\pi_{2}},a_{h}^{s,a,\pi_{1},\pi_{2}})=(s,a)]\geq 1\right]\geq\frac{\epsilon}{512\cdot(|\mathcal{S}|+1)^{12(|\mathcal{S}|+1)}}.

Therefore, let $X_{s,a}^{\prime}$ be the random variable which is defined to be

X_{s,a}^{\prime}=\begin{cases}\min\{h\in[\lfloor H/2\rfloor]\mid(s_{h}^{s,a,\pi_{1},\pi_{2}},a_{h}^{s,a,\pi_{1},\pi_{2}})=(s,a)\}&\text{if $(s_{h},a_{h})=(s,a)$ for some $h\in[\lfloor H/2\rfloor]$}\\ \lfloor H/2\rfloor&\text{otherwise}\end{cases}.

We have that

\Pr[X_{s,a}^{\prime}\in[\lfloor H/2\rfloor]]\geq\frac{\epsilon}{512\cdot(|\mathcal{S}|+1)^{12(|\mathcal{S}|+1)}}.

Moreover, for each $h^{\prime}\in[\lfloor H/2\rfloor]$ , since $\pi_{2}=\widetilde{\pi}_{s,a}$ ,

		$\displaystyle\Pr\left[\sum_{h=h^{\prime}}^{H-1}\mathbb{I}[(s_{h}^{s,a,\pi_{1},\pi_{2}},a_{h}^{s,a,\pi_{1},\pi_{2}})=(s,a)]\geq\left\lfloor\frac{1}{2048\cdot\|\mathcal{S}\|^{12\|\mathcal{S}\|}}\cdot\epsilon\cdot m_{\epsilon}(s,a)\right\rfloor\mid(s_{h^{\prime}}^{s,a,\pi_{1},\pi_{2}},a_{h^{\prime}}^{s,a,\pi_{1},\pi_{2}})=(s,a)\right]$
	$\displaystyle\geq$	$\displaystyle 1/2.$

Therefore,

		$\displaystyle\Pr\left[\sum_{h=0}^{H-1}\mathbb{I}[(s_{h}^{s,a,\pi_{1},\pi_{2}},a_{h}^{s,a,\pi_{1},\pi_{2}})=(s,a)]\geq\left\lfloor\frac{1}{2048\cdot\|\mathcal{S}\|^{12\|\mathcal{S}\|}}\cdot\epsilon\cdot m_{\epsilon}(s,a)\right\rfloor\right]$
	$\displaystyle\geq$	$\displaystyle\sum_{h^{\prime}=0}^{\lfloor H/2\rfloor-1}\Pr[X_{s,a}^{\prime}=h^{\prime}]$
	$\displaystyle\cdot$	$\displaystyle\Pr\left[\sum_{h=h^{\prime}}^{H-1}\mathbb{I}[(s_{h}^{s,a,\pi_{1},\pi_{2}},a_{h}^{s,a,\pi_{1},\pi_{2}})=(s,a)]\geq\left\lfloor\frac{1}{2048\cdot\|\mathcal{S}\|^{12\|\mathcal{S}\|}}\cdot\epsilon\cdot m_{\epsilon}(s,a)\right\rfloor\mid(s_{h^{\prime}}^{\pi_{1},\pi_{2}},a_{h^{\prime}}^{\pi_{1},\pi_{2}})=(s,a)\right]$
	$\displaystyle\geq$	$\displaystyle\frac{\epsilon}{1024\cdot(\|\mathcal{S}\|+1)^{12(\|\mathcal{S}\|+1)}}.$

Since $m_{\epsilon}(s,a)\geq 4096\cdot|\mathcal{S}|^{12|\mathcal{S}|}/\epsilon$ , we have

\Pr\left[\sum_{h=0}^{H-1}\mathbb{I}[(s_{h}^{s,a,\pi_{1},\pi_{2}}a_{h}^{s,a,\pi_{1},\pi_{2}})=(s,a)]\geq\frac{1}{4096\cdot|\mathcal{S}|^{12|\mathcal{S}|}}\cdot\epsilon\cdot m_{\epsilon}(s,a)\right]\geq\frac{\epsilon}{1024\cdot(|\mathcal{S}|+1)^{12(|\mathcal{S}|+1)}}

and thus

\mathcal{Q}^{\mathsf{st}}_{\epsilon(|\mathcal{S}|+1)^{-12(|\mathcal{S}|+1)}/1024}(s,a)\geq\frac{1}{4096\cdot|\mathcal{S}|^{12|\mathcal{S}|}}\cdot\epsilon\cdot m_{\epsilon}(s,a).

Case II: $m_{\epsilon}(s,a)<4096\cdot|\mathcal{S}|^{12|\mathcal{S}|}/\epsilon$ .

Consider the case when $\pi_{1}=\pi_{2}=\pi_{s,a}^{\prime}$ . Clearly,

		$\displaystyle\Pr\left[\sum_{h=0}^{H-1}\mathbb{I}[(s_{h}^{s,a,\pi_{1},\pi_{2}},a_{h}^{s,a,\pi_{1},\pi_{2}})=(s,a)]\geq 1\right]$
	$\displaystyle\geq$	$\displaystyle\Pr\left[\sum_{h=0}^{\lfloor H/2\rfloor-1}\mathbb{I}[(s_{h}^{s,a,\pi_{1},\pi_{2}},a_{h}^{s,a,\pi_{1},\pi_{2}})=(s,a)]\geq 1\right]\geq\frac{\epsilon}{512\cdot(\|\mathcal{S}\|+1)^{12(\|\mathcal{S}\|+1)}}$

and thus

\mathcal{Q}^{\mathsf{st}}_{\epsilon(|\mathcal{S}|+1)^{-12(|\mathcal{S}|+1)}/1024}(s,a)\geq\frac{1}{4096\cdot|\mathcal{S}|^{12|\mathcal{S}|}}\cdot\epsilon\cdot m_{\epsilon}(s,a).

∎

Now we show that for a given percentile $\epsilon$ , for the dataset $D$ returned by Algorithm 1, for each $(s,a)\in\mathcal{S}\times\mathcal{A}$ , $(s,a)$ appears for at least $\mathcal{Q}^{\mathsf{st}}_{\epsilon/4}(s,a)$ times for at least $\Omega(N\cdot\epsilon)$ lists out of the $N$ lists returned by Algorithm 1.

Lemma 5.2.

Let $\epsilon,\delta\in(0,1]$ be a given real number. Let $D$ be the dataset returned by Algorithm 1 where

D=\left(\left(\left(s_{i,t},a_{i,t},r_{i,t},s^{\prime}_{i,t}\right)\right)_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}\right)_{i=0}^{N-1}.

Suppose $N\geq 16/\epsilon\cdot\log(3|\mathcal{S}||\mathcal{A}|/\delta)$ . With probability at least $1-\delta/3$ , for each $(s,a)\in\mathcal{S}\times\mathcal{A}$ , we have

\sum_{i=0}^{N-1}\mathbb{I}\left[\sum_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}\mathbb{I}[(s_{i,t},a_{i,t})=(s,a)]\geq\mathcal{Q}^{\mathsf{st}}_{\epsilon/4}(s,a)\right]\geq N\epsilon/8.

Proof.

For each $(s,a)\in\mathcal{S}\times\mathcal{A}$ , by the definition of $\mathcal{Q}^{\mathsf{st}}_{\epsilon/4}(s,a)$ , for each $i\in[N]$ , we have

\mathbb{E}\left[\mathbb{I}\left[\sum_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}\mathbb{I}[(s_{i,t},a_{i,t})=(s,a)]\geq\mathcal{Q}^{\mathsf{st}}_{\epsilon/4}(s,a)\right]\right]\geq\epsilon/4.

Hence, the desired result follows by Chernoff bound and a union bound over all $(s,a)\in\mathcal{S}\times\mathcal{A}$ . ∎

We also need a subroutine to estimate $\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)$ for some $\epsilon_{\mathrm{est}}$ to be decided. Such estimates are crucial for building estimators for the transition probabilities and the rewards with bounded variance, which we elaborate in later parts of this section.

Our algorithm for estimating $\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)$ is described in Algorithm 2. Algorithm 2 collects $N$ lists, where for each list, elements in the list are tuples of the form $(s,a,r,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times[0,1]\times\mathcal{S}$ . These $N$ lists are collected using the same approach as in Algorithm 1. Once these $N$ lists are collected, for each $(s,a)\in\mathcal{S}\times\mathcal{A}$ , our estimate (denoted as $\overline{m}^{\mathsf{st}}(s,a)$ ) is then set to be the $\lceil N\cdot\epsilon_{\mathrm{est}}/2\rceil$ -th largest element in $F_{s,a}$ , where $F_{s,a}$ is the set of the number of times $(s,a)$ appear in each of the $N$ lists.

Algorithm 2 Estimate Quantiles

1:Input: Percentile

\epsilon_{\mathrm{est}}

, failure probability

\delta_{\mathrm{est}}

2:Output: Estimates

\overline{m}^{\mathsf{st}}:\mathcal{S}\times\mathcal{A}\to\mathbb{N}

3:Let

N=\lceil 300\log(6|\mathcal{S}||\mathcal{A}|/\delta_{\mathrm{est}})/\epsilon_{\mathrm{est}}\rceil

4:Let

F_{s,a}

be an empty multiset for all

(s,a)\in\mathcal{S}\times\mathcal{A}

5:for

i\in[N]

6: Let

T_{i}

be an empty list

7: for

(s,a)\in\mathcal{S}\times\mathcal{A}

8: for

(\pi_{1},\pi_{2})\in\Pi_{\mathsf{st}}\times\Pi_{\mathsf{st}}

9: Receive

s_{0}\sim\mu

10: for

h\in[H]

11: if

(s,a)=(s_{h^{\prime}},a_{h^{\prime}})

for some

0\leq h^{\prime}<h

then

12: Take

a_{h}=\pi_{2}(s_{h})

13: else

14: Take

a_{h}=\pi_{1}(s_{h})

15: Receive

r_{h}\sim R(s_{h},a_{h})

and

s_{h+1}\sim P(s_{h},a_{h})

16: Append

(s_{h},a_{h},r_{h},s_{h+1})

to the end of

T_{i}

17: for

(s,a)\in\mathcal{S}\times\mathcal{A}

18: Add

\sum_{t=0}^{|T_{i}|-1}\mathbb{I}[(s_{t},a_{t})=(s,a)]

into

F_{s,a}

where

T_{i}=((s_{0},a_{0},r_{0},s_{0}^{\prime}),(s_{1},a_{1},r_{1},s_{1}^{\prime}),\ldots,(s_{|T_{i}|-1},a_{|T_{i}|-1},r_{|T_{i}|-1},s_{|T_{i}|-1}^{\prime}))

19:for

(s,a)\in\mathcal{S}\times\mathcal{A}

20: Set

\overline{m}^{\mathsf{st}}(s,a)

be the

\lceil N\cdot\epsilon_{\mathrm{est}}/2\rceil

-th largest element in

F_{s,a}

21:return

\overline{m}^{\mathsf{st}}

We now show that for each $(s,a)\in\mathcal{S}\times\mathcal{A}$ , $\overline{m}^{\mathsf{st}}(s,a)$ is an accurate estimate of $\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)$ .

Lemma 5.3.

Let $\overline{m}^{\mathsf{st}}$ be the function returned by Algorithm 2. With probability at least $1-\delta_{\mathrm{est}}/3$ , for all $(s,a)\in\mathcal{S}\times\mathcal{A}$ ,

\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)\leq\overline{m}^{\mathsf{st}}(s,a)\leq\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}/4}(s,a).

Proof.

Fix a state-action pair $(s,a)\in\mathcal{S}\times\mathcal{A}$ . For each $i\in[N]$ , define

\underline{X}_{i}=\mathbb{I}\left[\sum_{t=0}^{|T_{i}|-1}\mathbb{I}[(s_{t},a_{t})=(s,a)]>\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}/4}(s,a)\right]

where

T_{i}=((s_{0},a_{0},r_{0},s_{0}^{\prime}),(s_{1},a_{1},r_{1},s_{1}^{\prime}),\ldots,(s_{|T_{i}|-1},a_{|T_{i}|-1},r_{|T_{i}|-1},s_{|T_{i}|-1}^{\prime})).

For each $i\in[N]$ , by the definition of $\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}/4}(s,a)$ , we have $\mathbb{E}[\underline{X}_{i}]\leq\epsilon_{\mathrm{est}}/4$ and thus $\sum_{i=0}^{N-1}\mathbb{E}[\underline{X}_{i}]\leq N\cdot\epsilon_{\mathrm{est}}/4$ . By Chernoff bound, with probability at most $\delta_{\mathrm{est}}/(6|\mathcal{S}||\mathcal{A}|)$ ,

\sum_{i=0}^{N-1}\underline{X}_{i}\geq N\cdot\epsilon_{\mathrm{est}}/3.

On the other hand, for each $i\in[N]$ , define

\overline{X}_{i}=\mathbb{I}\left[\sum_{t=0}^{|T_{i}|-1}\mathbb{I}[(s_{i,t},a_{i,t})=(s,a)]\geq\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)\right].

where

T_{i}=((s_{0},a_{0},r_{0},s_{0}^{\prime}),(s_{1},a_{1},r_{1},s_{1}^{\prime}),\ldots,(s_{|T_{i}|-1},a_{|T_{i}|-1},r_{|T_{i}|-1},s_{|T_{i}|-1}^{\prime})).

For each $i\in[N]$ , by the definition of $\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)$ , we have $\mathbb{E}[\overline{X}_{i}]\geq\epsilon_{\mathrm{est}}$ and thus $\sum_{i=0}^{N-1}\mathbb{E}[\overline{X}_{i}]\geq N\cdot\epsilon_{\mathrm{est}}$ . By Chernoff bound, with probability at most $\delta_{\mathrm{est}}/(6|\mathcal{S}||\mathcal{A}|)$ ,

\sum_{i=0}^{N-1}\overline{X}_{i}\leq 2N\cdot\epsilon_{\mathrm{est}}/3.

Hence, by union bound, with probability at least $1-\delta_{\mathrm{est}}/(3|\mathcal{S}||\mathcal{A}|)$ ,

\sum_{i=0}^{N-1}\underline{X}_{i}<N\cdot\epsilon_{\mathrm{est}}/3.

and

\sum_{i=0}^{N-1}\overline{X}_{i}>2N\cdot\epsilon_{\mathrm{est}}/3,

in which case the $\lceil N\cdot\epsilon_{\mathrm{est}}/2\rceil$ -th largest element in $F_{s,a}$ is in $\left[\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a),\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}/4}(s,a)\right]$ . We finish the proof by a union bound over all $(s,a)\in\mathcal{S}\times\mathcal{A}$ . ∎

In Lemma 5.4, we show that using the dataset $D$ returned by Algorithm 1, and the estimates of quantiles returned by Algorithm 2, we can compute accurate estimates of the transition probabilities and rewards. The estimators used in Lemma 5.4 are the empirical estimators, with proper truncation if a list $T_{i}$ contains too many samples (i.e., more than $\overline{m}^{\mathsf{st}}(\cdot,\cdot$ )). As will be made clear in the proof, such truncation is crucial for obtaining estimators with bounded variance.

Lemma 5.4.

Suppose Algorithm 2 is invoked with the percentile set to be $\epsilon_{\mathrm{est}}$ and the failure probability set to be $\delta$ , and Algorithm 1 is invoked with $N\geq 16/\epsilon_{\mathrm{est}}\cdot\log(3|\mathcal{S}||\mathcal{A}|/\delta)$ . Let $\overline{m}^{\mathsf{st}}:\mathcal{S}\times\mathcal{A}\to\mathbb{N}$ be the estimates returned by Algorithm 2. Let $D$ be the dataset returned by Algorithm 1 where

D=\left(\left(\left(s_{i,t},a_{i,t},r_{i,t},s^{\prime}_{i,t}\right)\right)_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}\right)_{i=0}^{N-1}.

For each $(s,a)\in\mathcal{S}\times\mathcal{A}$ , for each $i\in[N]$ and $t\in\left[|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H\right]$ , define

\mathsf{Trunc}_{i,t}(s,a)=\mathbb{I}\left[\sum_{t^{\prime}=0}^{t-1}\mathbb{I}\left[(s_{i,t^{\prime}},a_{i,t^{\prime}})=(s,a)\right]<\overline{m}^{\mathsf{st}}(s,a)\right].

For each $(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}$ , define

m_{D}(s,a)=\sum_{i=0}^{N-1}\sum_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}\mathbb{I}\left[(s_{i,t},a_{i,t})=(s,a)\right]\cdot\mathsf{Trunc}_{i,t}(s,a),

\widehat{P}(s^{\prime}\mid s,a)=\frac{\sum_{i=0}^{N-1}\sum_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}\mathbb{I}\left[(s_{i,t},a_{i,t},s_{i,t}^{\prime})=(s,a,s^{\prime})\right]\cdot\mathsf{Trunc}_{i,t}(s,a)}{\max\{1,m_{D}(s,a)\}},

\widehat{R}(s,a)=\frac{\sum_{i=0}^{N-1}\sum_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}\mathbb{I}\left[(s_{i,t},a_{i,t})=(s,a)\right]\cdot r_{i,t}\cdot\mathsf{Trunc}_{i,t}(s,a)}{\max\{1,m_{D}(s,a)\}}

and

\widehat{\mu}(s)=\frac{\sum_{i=0}^{N-1}\mathbb{I}[s_{0}=s]}{N}.

Then with probability at least $1-\delta$ , for all $(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}$ with $\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)>0$ , we have

	$\displaystyle\left\|\widehat{P}(s^{\prime}\mid s,a)-P(s^{\prime}\mid s,a)\right\|\leq$	$\displaystyle\max\left\{\frac{512\log(18\|\mathcal{S}\|^{2}\|\mathcal{A}\|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}},32\sqrt{\frac{\widehat{P}(s^{\prime}\mid s,a)\cdot\log(18\|\mathcal{S}\|^{2}\|\mathcal{A}\|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}}\right\}$
	$\displaystyle\leq$	$\displaystyle\max\left\{\frac{512\log(18\|\mathcal{S}\|^{2}\|\mathcal{A}\|/\delta)}{\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}},64\sqrt{\frac{P(s^{\prime}\mid s,a)\cdot\log(18\|\mathcal{S}\|^{2}\|\mathcal{A}\|/\delta)}{\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}}\right\},$

\left|\widehat{R}(s^{\prime}\mid s,a)-\mathbb{E}[R(s,a)]\right|\leq 8\sqrt{\frac{\mathbb{E}\left[(R(s,a))^{2}\right]\cdot\log(18|\mathcal{S}||\mathcal{A}|/\delta)}{\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}}+\frac{8\log(18|\mathcal{S}||\mathcal{A}|/\delta)}{\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}},

and

\left|\widehat{\mu}(s)-\mu(s)\right|\leq\sqrt{\frac{\log(18|\mathcal{S}|/\delta)}{N}}.

Proof.

Fix a state-action pair $(s,a)\in\mathcal{S}\times\mathcal{A}$ and $s^{\prime}\in\mathcal{S}$ . For each $i\in[N]$ and $t\in\left[|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H\right]$ , let $\mathcal{F}_{i,t}$ be the filtration induced by

\left\{\left(s_{i,t^{\prime}},a_{i,t^{\prime}},r_{i,t^{\prime}},s_{i,t^{\prime}}^{\prime}\right)\right\}_{t^{\prime}=0}^{t-1}.

For each $i\in[N]$ and $t\in\left[|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H\right]$ , define

X_{i,t}=\left(\mathbb{I}\left[(s_{i,t},a_{i,t},s_{i,t}^{\prime})=(s,a,s^{\prime})\right]-P(s^{\prime}\mid s,a)\mathbb{I}\left[(s_{i,t},a_{i,t})=(s,a)\right]\right)\cdot\mathsf{Trunc}_{i,t}(s,a)

and

Y_{i,t}=\mathbb{I}\left[(s_{i,t},a_{i,t})=(s,a)\right]\cdot(r_{i,t}-\mathbb{E}[R(s,a)])\cdot\mathsf{Trunc}_{i,t}(s,a).

Clearly,

		$\displaystyle\mathbb{E}\left[\mathbb{I}\left[(s_{i,t},a_{i,t},s_{i,t}^{\prime})=(s,a,s^{\prime})\right]\cdot\mathsf{Trunc}_{i,t}(s,a)\mid\mathcal{F}_{i,t}\right]$
	$\displaystyle=$	$\displaystyle P(s^{\prime}\mid s,a)\cdot\mathbb{E}\left[\mathbb{I}\left[(s_{i,t},a_{i,t})=(s,a)\right]\cdot\mathsf{Trunc}_{i,t}(s,a)\mid\mathcal{F}_{i,t}\right]$

and

		$\displaystyle\mathbb{E}\left[\mathbb{I}\left[(s_{i,t},a_{i,t})=(s,a)\right]\cdot r_{i,t}\cdot\mathsf{Trunc}_{i,t}(s,a)\mid\mathcal{F}_{i,t}\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[\mathbb{I}\left[(s_{i,t},a_{i,t})=(s,a)\right]\cdot\mathbb{E}[R(s,a)]\cdot\mathsf{Trunc}_{i,t}(s,a)\mid\mathcal{F}_{i,t}\right],$

which implies

		$\displaystyle\mathbb{E}\left[\mathbb{I}\left[(s_{i,t},a_{i,t},s_{i,t}^{\prime})=(s,a,s^{\prime})\right]\cdot\mathsf{Trunc}_{i,t}(s,a)\right]$
	$\displaystyle=$	$\displaystyle P(s^{\prime}\mid s,a)\cdot\mathbb{E}\left[\mathbb{I}\left[(s_{i,t},a_{i,t})=(s,a)\right]\cdot\mathsf{Trunc}_{i,t}(s,a)\right],$

and

		$\displaystyle\mathbb{E}\left[\mathbb{I}\left[(s_{i,t},a_{i,t})=(s,a)\right]\cdot r_{i,t}\cdot\mathsf{Trunc}_{i,t}(s,a)\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[\mathbb{I}\left[(s_{i,t},a_{i,t})=(s,a)\right]\cdot\mathbb{E}[R(s,a)]\cdot\mathsf{Trunc}_{i,t}(s,a)\right],$

and thus

\mathbb{E}\left[X_{i,t}\right]=\mathbb{E}\left[Y_{i,t}\right]=0.

Moreover, for any $i\in[N]$ and $0\leq t^{\prime}<t<|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H$ , we have

\mathbb{E}\left[X_{i,t^{\prime}}\cdot X_{i,t}\right]=\mathbb{E}\left[\mathbb{E}\left[X_{i,t^{\prime}}\cdot X_{i,t}\mid\mathcal{F}_{i,t}\right]\right]=\mathbb{E}\left[X_{i,t^{\prime}}\cdot\mathbb{E}\left[X_{i,t}\mid\mathcal{F}_{i,t}\right]\right]=0

and

\mathbb{E}\left[Y_{i,t^{\prime}}\cdot Y_{i,t}\right]=\mathbb{E}\left[\mathbb{E}\left[Y_{i,t^{\prime}}\cdot Y_{i,t}\mid\mathcal{F}_{i,t}\right]\right]=\mathbb{E}\left[Y_{i,t^{\prime}}\cdot\mathbb{E}\left[Y_{i,t}\mid\mathcal{F}_{i,t}\right]\right]=0.

Note that for each $i\in[N]$ ,

\mathbb{E}\left[\left(\sum_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}X_{i,t}\right)^{2}\right]=\sum_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}\mathbb{E}\left[\left(X_{i,t}\right)^{2}\right]

and

\mathbb{E}\left[\left(\sum_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}Y_{i,t}\right)^{2}\right]=\sum_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}\mathbb{E}\left[\left(Y_{i,t}\right)^{2}\right].

Furthermore, for each $i\in[N]$ and $t\in\left[|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H\right]$ ,

	$\displaystyle\mathbb{E}\left[X_{i,t}^{2}\right]\leq$	$\displaystyle\mathbb{E}\left[\mathbb{I}\left[(s_{i,t},a_{i,t},s_{i,t}^{\prime})=(s,a,s^{\prime})\right]\cdot\mathsf{Trunc}_{i,t}(s,a)\right]$
	$\displaystyle+$	$\displaystyle\mathbb{E}\left[\left(P(s^{\prime}\mid s,a)\right)^{2}\cdot\mathbb{I}\left[(s_{i,t},a_{i,t})=(s,a)\right]\cdot\mathsf{Trunc}_{i,t}(s,a)\right]$
	$\displaystyle\leq$	$\displaystyle 2P(s^{\prime}\mid s,a)\cdot\mathbb{E}\left[\mathbb{I}\left[(s_{i,t},a_{i,t})=(s,a)\right]\cdot\mathsf{Trunc}_{i,t}(s,a)\right]$

and

	$\displaystyle\mathbb{E}\left[Y_{i,t}^{2}\right]\leq$	$\displaystyle\mathbb{E}\left[\mathbb{I}\left[(s_{i,t},a_{i,t})=(s,a)\right]\cdot(r_{i,t}-\mathbb{E}[R(s,a)])^{2}\cdot\mathsf{Trunc}_{i,t}(s,a)\right]$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}\left[\mathbb{I}\left[(s_{i,t},a_{i,t})=(s,a)\right]\cdot\mathbb{E}\left[(R(s,a))^{2}\right]\cdot\mathsf{Trunc}_{i,t}(s,a)\right].$

Since

\sum_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}\mathbb{I}\left[(s_{i,t},a_{i,t})=(s,a)\right]\cdot\mathsf{Trunc}_{i,t}(s,a)\leq\overline{m}^{\mathsf{st}}(s,a),

we have

\mathbb{E}\left[\left(\sum_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}X_{i,t}\right)^{2}\right]\leq 2P(s^{\prime}\mid s,a)\cdot\overline{m}^{\mathsf{st}}(s,a)

and

\mathbb{E}\left[\left(\sum_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}Y_{i,t}\right)^{2}\right]\leq\mathbb{E}\left[(R(s,a))^{2}\right]\cdot\overline{m}^{\mathsf{st}}(s,a).

Now, for each $i\in[N]$ , define

\mathcal{X}_{i}=\sum_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}X_{i,t}

and

\mathcal{Y}_{i}=\sum_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}Y_{i,t}.

We have $\mathbb{E}[\mathcal{X}_{i}]=\mathbb{E}[\mathcal{Y}_{i}]=0$ ,

\mathbb{E}[\mathcal{X}_{i}^{2}]\leq 2P(s^{\prime}\mid s,a)\cdot\overline{m}^{\mathsf{st}}(s,a)

and

\mathbb{E}[\mathcal{Y}_{i}^{2}]\leq\mathbb{E}\left[(R(s,a))^{2}\right]\cdot\overline{m}^{\mathsf{st}}(s,a).

Also note that

\sum_{i=0}^{N-1}\mathcal{X}_{i}=\sum_{i=0}^{N-1}\sum_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}\mathbb{I}\left[(s_{i,t},a_{i,t},s_{i,t}^{\prime})=(s,a,s^{\prime})\right]\cdot\mathsf{Trunc}_{i,t}(s,a)-P(s^{\prime}\mid s,a)\cdot m_{D}(s,a)

and

\sum_{i=0}^{N-1}\mathcal{Y}_{i}=\sum_{i=0}^{N-1}\sum_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}\mathbb{I}\left[(s_{i,t},a_{i,t})=(s,a)\right]\cdot r_{i,t}\cdot\mathsf{Trunc}_{i,t}(s,a)-\mathbb{E}[R(s,a)]\cdot m_{D}(s,a).

By Bernstein’s inequality,

\Pr\left[\left|\sum_{i=0}^{N-1}\mathcal{X}_{i}\right|\geq t\right]\leq 2\exp\left(\frac{-t^{2}}{2\cdot\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot P(s^{\prime}\mid s,a)+t/3}\right).

Thus, by setting $t=2\sqrt{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot P(s^{\prime}\mid s,a)\cdot\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}+\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)$ , we have

\Pr\left[\left|\sum_{i=0}^{N-1}\mathcal{X}_{i}\right|\geq t\right]\leq\delta/(9|\mathcal{S}|^{2}|\mathcal{A}|).

By applying a union bound over all $(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}$ , with probability at least $1-\delta/9$ , for all $(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}$ ,

\left|\widehat{P}(s^{\prime}\mid s,a)-P(s^{\prime}\mid s,a)\right|\leq\frac{2\sqrt{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot P(s^{\prime}\mid s,a)\cdot\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}}{m_{D}(s,a)}+\frac{\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{m_{D}(s,a)},

which we define to be event $\mathcal{E}_{P}$ . Note that conditioned on $\mathcal{E}_{P}$ and the events in Lemma 5.3 and Lemma 5.2, we have

\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)\leq\overline{m}^{\mathsf{st}}(s,a)\leq\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}/4}(s,a),

which implies

m_{D}(s,a)\geq N\cdot\epsilon_{\mathrm{est}}/8\cdot\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}/4}(s,a)\geq N\cdot\epsilon_{\mathrm{est}}/8\cdot\overline{m}^{\mathsf{st}}(s,a)\geq N\cdot\epsilon_{\mathrm{est}}/8\cdot\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a),

and thus

	$\displaystyle\left\|\widehat{P}(s^{\prime}\mid s,a)-P(s^{\prime}\mid s,a)\right\|\leq$	$\displaystyle 8\sqrt{\frac{P(s^{\prime}\mid s,a)\cdot\log(18\|\mathcal{S}\|^{2}\|\mathcal{A}\|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}}+\frac{8\log(18\|\mathcal{S}\|^{2}\|\mathcal{A}\|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}$
	$\displaystyle\leq$	$\displaystyle 8\sqrt{\frac{P(s^{\prime}\mid s,a)\cdot\log(18\|\mathcal{S}\|^{2}\|\mathcal{A}\|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}}+\frac{64\log(18\|\mathcal{S}\|^{2}\|\mathcal{A}\|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}$
	$\displaystyle\leq$	$\displaystyle\max\left\{16\sqrt{\frac{P(s^{\prime}\mid s,a)\cdot\log(18\|\mathcal{S}\|^{2}\|\mathcal{A}\|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}},\frac{512\log(18\|\mathcal{S}\|^{2}\|\mathcal{A}\|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}\right\}$

for all $(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}$ .

When

P(s^{\prime}\mid s,a)\leq\frac{1024\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}},

we have

16\sqrt{\frac{P(s^{\prime}\mid s,a)\cdot\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}}\leq\frac{512\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}},

and therefore

\left|\widehat{P}(s^{\prime}\mid s,a)-P(s^{\prime}\mid s,a)\right|\leq\frac{512\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}\leq\frac{512\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}.

When

P(s^{\prime}\mid s,a)\geq\frac{1024\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}},

we have

\left|\widehat{P}(s^{\prime}\mid s,a)-P(s^{\prime}\mid s,a)\right|\leq 16\sqrt{\frac{P(s^{\prime}\mid s,a)\cdot\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}}\leq P(s^{\prime}\mid s,a)/2

and thus

P(s^{\prime}\mid s,a)/2\leq\widehat{P}(s^{\prime}\mid s,a)\leq 2P(s^{\prime}\mid s,a),

which implies

\left|\widehat{P}(s^{\prime}\mid s,a)-P(s^{\prime}\mid s,a)\right|\leq 32\sqrt{\frac{\widehat{P}(s^{\prime}\mid s,a)\cdot\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}}\leq 64\sqrt{\frac{P(s^{\prime}\mid s,a)\cdot\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}}.

Hence, conditioned on $\mathcal{E}_{P}$ and the events in Lemma 5.3 and Lemma 5.2, for all $(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}$ ,

	$\displaystyle\left\|\widehat{P}(s^{\prime}\mid s,a)-P(s^{\prime}\mid s,a)\right\|\leq$	$\displaystyle\max\left\{\frac{512\log(18\|\mathcal{S}\|^{2}\|\mathcal{A}\|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}},32\sqrt{\frac{\widehat{P}(s^{\prime}\mid s,a)\cdot\log(18\|\mathcal{S}\|^{2}\|\mathcal{A}\|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}}\right\}$
	$\displaystyle\leq$	$\displaystyle\max\left\{\frac{512\log(18\|\mathcal{S}\|^{2}\|\mathcal{A}\|/\delta)}{\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}},64\sqrt{\frac{P(s^{\prime}\mid s,a)\cdot\log(18\|\mathcal{S}\|^{2}\|\mathcal{A}\|/\delta)}{\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}}\right\}.$

By Bernstein’s inequality,

\Pr\left[\left|\sum_{i=0}^{N-1}\mathcal{Y}_{i}\right|\geq t\right]\leq 2\exp\left(\frac{-t^{2}}{\mathbb{E}\left[(R(s,a))^{2}\right]\cdot\overline{m}^{\mathsf{st}}(s,a)\cdot N+t/3}\right).

Thus, by setting $t=2\sqrt{\mathbb{E}\left[(R(s,a))^{2}\right]\cdot\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\log(18|\mathcal{S}||\mathcal{A}|/\delta)}+\log(18|\mathcal{S}||\mathcal{A}|/\delta)$ , we have

\Pr\left[\left|\sum_{i=0}^{N-1}\mathcal{Y}_{i}\right|\geq t\right]\leq\delta/(9|\mathcal{S}||\mathcal{A}|).

By applying a union bound over all $(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}$ , with probability at least $1-\delta/9$ , for all $(s,a)\in\mathcal{S}\times\mathcal{A}$ ,

\left|\widehat{R}(s^{\prime}\mid s,a)-\mathbb{E}[R(s,a)]\right|\leq\frac{2\sqrt{\mathbb{E}\left[(R(s,a))^{2}\right]\cdot\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\log(18|\mathcal{S}||\mathcal{A}|/\delta)}}{m_{D}(s,a)}+\frac{\log(18|\mathcal{S}||\mathcal{A}|/\delta)}{m_{D}(s,a)},

which we define to be event $\mathcal{E}_{R}$ . Note that conditioned on $\mathcal{E}_{R}$ and the events in Lemma 5.3 and Lemma 5.2, we have

\left|\widehat{R}(s^{\prime}\mid s,a)-\mathbb{E}[R(s,a)]\right|\leq 8\sqrt{\frac{\mathbb{E}\left[(R(s,a))^{2}\right]\cdot\log(18|\mathcal{S}||\mathcal{A}|/\delta)}{\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}}+\frac{8\log(18|\mathcal{S}||\mathcal{A}|/\delta)}{\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}.

Finally, for each $s\in\mathcal{S}$ , for each $i\in[N]$ , define

\mathcal{Z}_{i}=\mathbb{I}[s_{i,0}=s]-\mu(s).

Note that

\sum_{i=0}^{N-1}\mathcal{Z}_{i}=\sum_{i=0}^{N-1}\mathbb{I}[s_{0,i}=s]-N\cdot\mu(s).

Therefore, by Chernoff bound, with probability at least $1-\delta/(9|\mathcal{S}|)$ we have

\left|\widehat{\mu}(s)-\mu(s)\right|\leq\sqrt{\frac{\log(18|\mathcal{S}|/\delta)}{K}}.

Hence, with probability at least $1-\delta/9$ , for all $s\in\mathcal{S}$ , we have

\left|\widehat{\mu}(s)-\mu(s)\right|\leq\sqrt{\frac{\log(18|\mathcal{S}|/\delta)}{N}}

which we define to be event $\mathcal{E}_{\mu}$ .

We finish the proof by applying a union bound over $\mathcal{E}_{P}$ , $\mathcal{E}_{R}$ , $\mathcal{E}_{\mu}$ and the events in Lemma 5.3 and Lemma 5.2. ∎

5.2 Perturbation Analysis

In this section, we establish a perturbation analysis on the value functions which is crucial for the analysis in the next section. We first recall a few basic facts.

Fact 5.1.

Let $|x|\leq 1/2$ be a real number, we have

1.

$x-x^{2}\leq\log(1+x)\leq x$ ;
2.

$1+x\leq e^{x}\leq 1+2|x|$ .

We now prove the following lemma using the above facts.

Lemma 5.5.

Let $\overline{m}\geq 1$ , $\bar{n}\geq n\geq 1$ be positive integers. Let $\epsilon\in[0,1/(8\bar{n})]$ be some real numbers. Let $p\in[1/\overline{m},1]^{n}$ be a vector with $\sum_{i=1}^{n}p_{i}\leq 1$ . Let $\delta\in\mathbb{R}^{n}$ be a vector such that for each $1\leq i\leq n$ , $|\delta_{i}|\leq\epsilon\sqrt{p_{i}/\overline{m}}$ and $\left|\sum_{i=1}^{n}\delta_{i}\right|\leq\epsilon\bar{n}/\overline{m}$ . For every $m\in[0,\overline{m}]$ and every $\Gamma\in\mathbb{R}^{n}$ such that $|\Gamma_{i}|\in[-\sqrt{p_{i}\overline{m}},\sqrt{p_{i}\overline{m}}]$ for all $1\leq i\leq n$ , we have

(1-8\bar{n}\epsilon)\prod_{i=1}^{n}p_{i}^{p_{i}m+\Gamma_{i}}\leq\prod_{i=1}^{n}(p_{i}+\delta_{i})^{p_{i}m+\Gamma_{i}}\leq(1+8\bar{n}\epsilon)\prod_{i=1}^{n}p_{i}^{p_{i}m+\Gamma_{i}}.

Proof.

Note that

\prod_{i=1}^{n}(p_{i}+\delta_{i})^{p_{i}m+\Gamma_{i}}=\prod_{i=1}^{n}(p_{i})^{p_{i}m+\Gamma_{i}}\cdot F

where

F=\prod_{i=1}^{n}\left(1+\frac{\delta_{i}}{p_{i}}\right)^{p_{i}m+\Gamma_{i}}.

Clearly,

\log F=\sum_{i=1}^{n}(p_{i}m+\Gamma_{i})\log\left(1+\frac{\delta_{i}}{p_{i}}\right).

By the choice of $\delta$ , we have

\left|\frac{\delta_{i}}{p_{i}}\right|\leq\epsilon\leq\frac{1}{2}.

Using Fact 5.1, for all $1\leq i\leq n$ , we have

\displaystyle\frac{\delta_{i}}{p_{i}}-\frac{\delta_{i}^{2}}{p_{i}^{2}}\leq\log\left(1+\frac{\delta_{i}}{p_{i}}\right)\leq\frac{\delta_{i}}{p_{i}}.

Hence we,

\displaystyle|\log F|

\displaystyle\leq\left|\sum_{i=1}^{n}m\delta_{i}\right|+\sum_{i=1}^{n}\left(\frac{|\Gamma_{i}||\delta_{i}|}{p_{i}}+\frac{|\Gamma_{i}|\delta_{i}^{2}}{p_{i}^{2}}+\frac{m\delta_{i}^{2}}{p_{i}}\right).

Note that $\left|\sum_{i=1}^{n}m\delta_{i}\right|\leq\epsilon\bar{n}$ , $|\Gamma_{i}||\delta_{i}|\leq\epsilon p_{i}$ , $|\Gamma_{i}|\delta_{i}^{2}\leq\epsilon p_{i}\cdot\epsilon\sqrt{p_{i}/\overline{m}}\leq\epsilon^{2}p_{i}^{2}$ , and $m\delta_{i}^{2}\leq\epsilon^{2}p_{i}$ . We have,

\displaystyle|\log F|\leq\epsilon\bar{n}+\epsilon n+\epsilon^{2}n+\epsilon^{2}n\leq 4\bar{n}\epsilon.

By the choice of $\epsilon$ , we have $4\bar{n}\epsilon\leq 1/2$ , and therefore

1-8\bar{n}\epsilon\leq\exp(\log F)\leq 1+8\bar{n}\epsilon.

∎

In the following lemma, we show that for any $(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}$ , with probability at least $1-\delta$ , the number of times $(s,a,s^{\prime})$ is visited can be upper bounded in terms of the $\delta/2$ -quantile of the number of times $(s,a)$ is visited and $P(s^{\prime}\mid s,a)$ .

Lemma 5.6.

For a given MDP $M$ . Suppose a random trajectory

T=((s_{0},a_{0},r_{0}),(s_{1},a_{1},r_{1}),\ldots,(s_{H-1},a_{H-1},r_{H-1}),s_{H})

is obtained by executing a (possibly non-stationary) policy $\pi$ in $M$ . For any $(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}$ , with probability at least $1-\delta$ , we have

\sum_{h=0}^{H-1}\mathbb{I}\left[(s,a,s^{\prime})=(s_{h},a_{h},s_{h+1})\right]\leq\frac{2\mathcal{Q}_{\delta/2}\left(\sum_{h=0}^{H-1}\mathbb{I}\left[(s,a)=(s_{h},a_{h})\right]\right)+4}{\delta}\cdot P(s^{\prime}\mid s,a).

Proof.

For each $h\in[H]$ , define

I_{h}=\begin{cases}1&h=0\\ \mathbb{I}\left[\sum_{t=0}^{h-1}\mathbb{I}\left[(s_{t},a_{t})=(s,a)\right]\leq\mathcal{Q}_{\delta/2}\left(\sum_{h=0}^{H-1}\mathbb{I}\left[(s_{h},a_{h})=(s,a)\right]\right)+1\right]&h>0\end{cases}.

Let $\mathcal{E}_{1}$ be the event that for all $h\in[H]$ , $I_{h}=1$ . By definition of $\mathcal{Q}_{\delta/2}$ , we have $\Pr\left[\mathcal{E}_{1}\right]\geq 1-\delta/2.$

For each $h\in[H]$ , let $\mathcal{F}_{h}$ be the filtration induced by $\{(s_{0},a_{0},r_{0}),\ldots,(s_{h},a_{h},r_{h})\}$ . For each $h\in[H]$ , define

X_{h}=\mathbb{I}\left[(s,a,s^{\prime})=(s_{h},a_{h},s_{h+1})\right]\cdot I_{h}

and

Y_{h}=\mathbb{I}\left[(s,a)=(s_{h},a_{h})\right]\cdot I_{h}.

When $h=0$ , we have

\mathbb{E}[X_{h}]=\mathbb{E}[\mathbb{I}\left[(s,a)=(s_{0},a_{0})\right]\cdot P(s^{\prime}\mid s,a)=\mathbb{E}[Y_{h}]\cdot P(s^{\prime}\mid s,a).

When $h\in[H]\setminus\{0\}$ , we have

	$\displaystyle\mathbb{E}[X_{h}\mid\mathcal{F}_{h-1}]=$	$\displaystyle\mathbb{E}[\mathbb{I}\left[(s,a,s^{\prime})=(s_{h},a_{h},s_{h+1})\right]\cdot I_{h}\mid\mathcal{F}_{h-1}]$
	$\displaystyle=$	$\displaystyle\mathbb{E}[\mathbb{I}\left[(s,a)=(s_{h},a_{h})\right]\cdot I_{h}\mid\mathcal{F}_{h-1}]\cdot P(s^{\prime}\mid s,a),$

which implies

\mathbb{E}[X_{h}]=\mathbb{E}[\mathbb{I}\left[(s,a)=(s_{h},a_{h})\right]\cdot I_{h}]\cdot P(s^{\prime}\mid s,a)=\mathbb{E}[Y_{h}]\cdot P(s^{\prime}\mid s,a).

Note that

\sum_{h=0}^{H-1}Y_{h}\leq\mathcal{Q}_{\delta/2}\left(\sum_{h=0}^{H-1}\mathbb{I}\left[(s,a)=(s_{h},a_{h})\right]\right)+2,

which implies

\mathbb{E}\left[\sum_{h=0}^{H-1}X_{h}\right]\leq\left(\mathcal{Q}_{\delta/2}\left(\sum_{h=0}^{H-1}\mathbb{I}\left[(s,a)=(s_{h},a_{h})\right]\right)+2\right)\cdot P(s^{\prime}\mid s,a).

By Markov’s inequality, with probability at least $1-\delta/2$ ,

\sum_{h=0}^{H-1}X_{h}\leq\frac{2\mathcal{Q}_{\delta/2}\left(\sum_{h=0}^{H-1}\mathbb{I}\left[(s,a)=(s_{h},a_{h})\right]\right)+4}{\delta}\cdot P(s^{\prime}\mid s,a).

which we denote as event $\mathcal{E}_{2}$ .

Conditioned on $\mathcal{E}_{1}\cap\mathcal{E}_{2}$ which happens with probability $1-\delta$ , we have

\sum_{h=0}^{H-1}\mathbb{I}\left[(s,a,s^{\prime})=(s_{h},a_{h},s_{h+1})\right]=\sum_{h=0}^{H-1}X_{h}\leq\frac{2\mathcal{Q}_{\delta/2}\left(\sum_{h=0}^{H-1}\mathbb{I}\left[(s,a)=(s_{h},a_{h})\right]\right)+4}{\delta}\cdot P(s^{\prime}\mid s,a).

∎

In the following lemma, we show that for any $(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}$ , the number of times $(s,a,s^{\prime})$ is visited should be close to the number of times $(s,a)$ is visited times $P(s^{\prime}\mid s,a)$ .

Lemma 5.7.

For a given MDP $M$ . Suppose a random trajectory

T=((s_{0},a_{0},r_{0}),(s_{1},a_{1},r_{1}),\ldots,(s_{H-1},a_{H-1},r_{H-1}),s_{H})

is obtained by executing a policy $\pi$ in $M$ . For any $(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}$ , with probability at least $1-\delta$ , we have

		$\displaystyle\left\|\sum_{h=0}^{H-1}\mathbb{I}\left[(s,a,s^{\prime})=(s_{t},a_{t},s_{t+1})\right]-P(s^{\prime}\mid s,a)\cdot\sum_{h=0}^{H-1}\mathbb{I}\left[(s,a)=(s_{t},a_{t})\right]\right\|$
	$\displaystyle\leq$	$\displaystyle\sqrt{\frac{4\mathcal{Q}_{\delta/2}\left(\sum_{h=0}^{H-1}\mathbb{I}\left[(s,a)=(s_{t},a_{t})\right]\right)+8}{\delta}\cdot P(s^{\prime}\mid s,a)}.$

Proof.

For each $h\in[H]$ , define

I_{h}=\begin{cases}1&h=0\\ \mathbb{I}\left[\sum_{t=0}^{h-1}\mathbb{I}\left[(s,a)=(s_{t},a_{t})\right]\leq\mathcal{Q}_{\delta/2}\left(\sum_{h=0}^{H-1}\mathbb{I}\left[(s,a)=(s_{h},a_{h})\right]\right)+1\right]&h>0\end{cases}.

Let $\mathcal{E}_{1}$ be the event that for all $h\in[H]$ , $I_{h}=1$ . By definition of $\mathcal{Q}_{\delta/2}$ , we have $\Pr\left[\mathcal{E}_{1}\right]\geq 1-\delta/2.$

For each $h\in[H]$ , let $\mathcal{F}_{h}$ be the filtration induced by $\{(s_{0},a_{0},r_{0}),\ldots,(s_{h},a_{h},r_{h})\}$ . For each $h\in[H]$ , define

X_{h}=\mathbb{I}\left[(s,a,s^{\prime})=(s_{h},a_{h},s_{h+1})\right]\cdot I_{h}-P(s^{\prime}\mid s,a)\mathbb{I}\left[(s,a)=(s_{h},a_{h})\right]\cdot I_{h}.

As we have shown in the proof of Lemma 5.6, for each $h\in[H]$ , $\mathbb{E}[X_{h}]=0$ . Moreover, for any $0\leq h^{\prime}<h\leq H-1$ , we have

\mathbb{E}[X_{h}X_{h^{\prime}}]=\mathbb{E}[\mathbb{E}[X_{h}X_{h^{\prime}}|\mathcal{F}_{h-1}]]=\mathbb{E}[X_{h^{\prime}}\mathbb{E}[X_{h}|\mathcal{F}_{h-1}]]=0.

Therefore,

\mathbb{E}\left[\left(\sum_{h=0}^{H-1}X_{h}\right)^{2}\right]=\mathbb{E}\left[\sum_{h=0}^{H-1}X_{h}^{2}\right].

Note that for each $h\in[H]$ .

	$\displaystyle X_{h}^{2}$	$\displaystyle=(\mathbb{I}\left[(s,a,s^{\prime})=(s_{h},a_{h},s_{h+1})\right]\cdot I_{h}-P(s^{\prime}\mid s,a)\mathbb{I}\left[(s,a)=(s_{h},a_{h})\right]\cdot I_{h})^{2}$
		$\displaystyle\leq I_{h}\cdot\left(\mathbb{I}\left[(s,a,s^{\prime})=(s_{h},a_{h},s_{h+1})\right]+\left(P(s^{\prime}\mid s,a)\right)^{2}\cdot\mathbb{I}\left[(s,a)=(s_{h},a_{h})\right]\right).$

As we have shown in the proof of Lemma 5.6, for each $h\in[H]$ ,

\mathbb{E}[I_{h}\cdot\mathbb{I}\left[(s,a,s^{\prime})=(s_{h},a_{h},s_{h+1})\right]]=\mathbb{E}[I_{h}\cdot\mathbb{I}\left[(s,a)=(s_{h},a_{h})\right]]\cdot P(s^{\prime}\mid s,a),

which implies

	$\displaystyle\mathbb{E}\left[\left(\sum_{h=0}^{H-1}X_{h}\right)^{2}\right]$	$\displaystyle=\mathbb{E}\left[\sum_{h=0}^{H-1}X_{h}^{2}\right]$
		$\displaystyle\leq\sum_{h=0}^{H-1}\mathbb{E}\left[I_{h}\cdot\left(\mathbb{I}\left[(s,a,s^{\prime})=(s_{h},a_{h},s_{h+1})\right]+\left(P(s^{\prime}\mid s,a)\right)^{2}\cdot\mathbb{I}\left[(s,a)=(s_{h},a_{h})\right]\right)\right]$
		$\displaystyle\leq 2P(s^{\prime}\mid s,a)\cdot\sum_{h=0}^{H-1}\mathbb{E}\left[I_{h}\cdot\mathbb{I}\left[(s,a)=(s_{h},a_{h})\right]\right]$
		$\displaystyle\leq 2P(s^{\prime}\mid s,a)\cdot\left(\mathcal{Q}_{\delta/2}\left(\sum_{h=0}^{H-1}\mathbb{I}\left[(s,a)=(s_{h},a_{h})\right]\right)+2\right).$

By Chebyshev’s inequality, we have with probability at least $1-\delta/2$ ,

\left|\sum_{h=0}^{H-1}X_{h}\right|\leq\sqrt{\frac{4\mathcal{Q}_{\delta/2}\left(\sum_{h=0}^{H-1}\mathbb{I}\left[(s,a)=(s_{t},a_{t})\right]\right)+8}{\delta}\cdot P(s^{\prime}\mid s,a)}.

which we denote as event $\mathcal{E}_{2}$ .

Conditioned on $\mathcal{E}_{1}\cap\mathcal{E}_{2}$ which happens with probability $1-\delta$ , we have

		$\displaystyle\left\|\sum_{h=0}^{H-1}\mathbb{I}\left[(s,a,s^{\prime})=(s_{t},a_{t},s_{t+1})\right]-P(s^{\prime}\mid s,a)\cdot\sum_{h=0}^{H-1}\mathbb{I}\left[(s,a)=(s_{t},a_{t})\right]\right\|$
	$\displaystyle\leq$	$\displaystyle\sqrt{\frac{4\mathcal{Q}_{\delta/2}\left(\sum_{h=0}^{H-1}\mathbb{I}\left[(s,a)=(s_{t},a_{t})\right]\right)+8}{\delta}\cdot P(s^{\prime}\mid s,a)}.$

∎

Using Lemma 5.5, Lemma 5.6 and Lemma 5.7, we now present the main result in this section, which shows that for two MDPs $M$ and $\widehat{M}$ that are close enough in terms of rewards and transition probabilities, for any policy $\pi$ , its value in $\widehat{M}$ should be lower bounded by that in $M$ up to an error of $\epsilon$ .

Lemma 5.8.

Let $M=(\mathcal{S},\mathcal{A},P,R,H,\mu)$ be an MDP and $\pi$ be a policy. Let $0<\epsilon\leq 1/2$ be a parameter. For each $(s,a)\in\mathcal{S}\times\mathcal{A}$ , define $\overline{m}(s,a):=\mathcal{Q}^{\pi}_{\epsilon/(12|\mathcal{S}||\mathcal{A}|)}(s,a)$ . Let $\widehat{M}=(\mathcal{S},\mathcal{A},\widehat{P},\widehat{R},H,\widehat{\mu})$ be another MDP. If for all $(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}$ with $\overline{m}(s,a)\geq 1$ , we have

|\widehat{P}(s^{\prime}|s,a)-P(s^{\prime}|s,a)|\leq\frac{\epsilon}{96|\mathcal{S}|^{2}|\mathcal{A}|}\cdot\max\left(\sqrt{\frac{\epsilon{P}(s^{\prime}|s,a)}{72\cdot\overline{m}(s,a)\cdot|\mathcal{S}||\mathcal{A}|}},\frac{\epsilon}{72\cdot\overline{m}(s,a)\cdot|\mathcal{S}||\mathcal{A}|}\right),

\left|\mathbb{E}[\widehat{R}|s,a]-\mathbb{E}[{R}|s,a]\right|\leq\frac{\epsilon}{24|\mathcal{S}||\mathcal{A}|}\cdot\max\left\{\sqrt{\frac{\mathbb{E}[(R(s,a))^{2}]}{\overline{m}(s,a)}},\frac{1}{\overline{m}(s,a)}\right\}

and

\left|\mu(s)-\widehat{\mu}(s)\right|\leq\epsilon/(6|\mathcal{S}|),

then

V^{\pi}_{\widehat{M},H}\geq V^{\pi}_{{M},H}-\epsilon.

Proof.

Define $\mathcal{T}=(\mathcal{S}\times\mathcal{A})^{H}\times\mathcal{S}$ be set of all possible trajectories, where for each $T\in\mathcal{T}$ , $T$ has the form

((s_{0},a_{0}),(s_{1},a_{1}),\ldots,(s_{H-1},a_{H-1}),s_{H}).

For a trajectory $T\in\mathcal{T}$ , for each $(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}$ , we write

m_{T}(s,a)=\sum_{h=0}^{H-1}\mathbb{I}[(s_{h},a_{h})=(s,a)]

as the number times $(s,a)$ is visited and

m_{T}(s,a,s^{\prime})=\sum_{h=0}^{H-1}\mathbb{I}[(s_{h},a_{h},s_{h+1})=(s,a,s^{\prime})]

as the number of times $(s,a,s^{\prime})$ is visited. We say a trajectory

T=((s_{0},a_{0}),(s_{1},a_{1}),\ldots,(s_{H-1},a_{H-1}),s_{H})\in\mathcal{T}

is compatible with a (possibly non-stationary) policy $\pi$ if for all $h\in[H]$ ,

a_{h}=\pi_{h}(s_{h}).

For a (possibly non-stationary) policy $\pi$ , we use $\mathcal{T}^{\pi}\subseteq\mathcal{T}$ to denote the set of all trajectories that are compatible with $\pi$ .

For an MDP $M=(\mathcal{S},\mathcal{A},P,R,H,\mu)$ and a (possibly non-stationary) policy $\pi$ , for a trajectory $T$ that is compatible with $\pi$ , we write

p(T,M,\pi)=\mu(s_{0})\cdot\prod_{h=0}^{H-1}P(s_{h+1}\mid s_{h},a_{h})=\mu(s_{0})\cdot\prod_{(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}}P(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}

to be the probability of $T$ when executing $\pi$ in $M$ . Here we assume $0^{0}=1$ .

Using these definitions, we have

V^{\pi}_{M,H}=\sum_{T\in\mathcal{T}^{\pi}}p(T,M,\pi)\cdot\left(\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[R(s,a)]\right).

Note that for any trajectory $T=((s_{0},a_{0}),(s_{1},a_{1}),\ldots,(s_{H-1},a_{H-1}),s_{H})\in\mathcal{T}^{\pi}$ , if $p(T,M,\pi)>0$ , by Assumption 2.1,

\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[R(s,a)]\leq 1.

We define $\mathcal{T}^{\pi}_{1}\subseteq\mathcal{T}^{\pi}$ be the set of trajectories that for each $T\in\mathcal{T}^{\pi}_{1}$ , for each $(s,a)\in\mathcal{S}\times\mathcal{A}$ ,

m_{T}(s,a)\leq\overline{m}(s,a).

By a union bound over all $(s,a)\in\mathcal{S}\times\mathcal{A}$ , we have

\sum_{T\in\mathcal{T}^{\pi}\setminus\mathcal{T}^{\pi}_{1}}p(T,M,\pi)\leq\epsilon/6.

We also define $\mathcal{T}^{\pi}_{2}\subseteq\mathcal{T}^{\pi}$ be the set of trajectories that for each $T\in\mathcal{T}^{\pi}_{2}$ , for each $(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}$ ,

\left|m_{T}(s,a,s^{\prime})-m_{T}(s,a)\cdot P(s^{\prime}|s,a)\right|\leq\sqrt{\frac{6P(s^{\prime}|s,a)(4\overline{m}(s,a)+8)|\mathcal{S}||\mathcal{A}|}{\epsilon}}.

By Lemma 5.7 and a union bound over all $(s,a)\in\mathcal{S}\times\mathcal{A}$ , we have

\sum_{T\in\mathcal{T}^{\pi}\setminus\mathcal{T}^{\pi}_{2}}p(T,M,\pi)\leq\epsilon/6.

Finally, we define $\mathcal{T}^{\pi}_{3}\subseteq\mathcal{T}^{\pi}$ be the set of trajectories such that for each $T\in\mathcal{T}^{\pi}_{3}$ , for each $(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}$ ,

m_{T}(s,a,s^{\prime})\leq\frac{6|\mathcal{S}||\mathcal{A}|(2\overline{m}(s,a)+4)}{\epsilon}\cdot P(s^{\prime}|s,a).

By Lemma 5.6 and a union bound over all $(s,a)\in\mathcal{S}\times\mathcal{A}$ , we have

\sum_{T\in\mathcal{T}^{\pi}\setminus\mathcal{T}^{\pi}_{3}}p(T,M,\pi)\leq\epsilon/6.

Thus, by defining $\mathcal{T}^{\pi}_{\mathrm{pruned}}=\mathcal{T}^{\pi}_{1}\cap\mathcal{T}^{\pi}_{2}\cap\mathcal{T}^{\pi}_{3}$ , we have

\sum_{T\in\mathcal{T}^{\pi}_{\mathrm{pruned}}}p(T,M,\pi)\cdot\left(\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[R(s,a)]\right)\geq V^{\pi}_{M,H}-\epsilon/2.

Note that for each $T\in\mathcal{T}^{\pi}_{\mathrm{pruned}}$ with

T=((s_{0},a_{0}),(s_{1},a_{1}),\ldots,(s_{H-1},a_{H-1}),s_{H}),

for each state-action $(s,a)\in\mathcal{S}\times\mathcal{A}$ , we must have $m_{T}(s,a)\leq\overline{m}(s,a)$ . This is because $T\in\mathcal{T}^{\pi}_{1}$ Moreover, for any $(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}$ , if $m_{T}(s,a,s^{\prime})\geq 1$ , then

P(s^{\prime}\mid s,a)\geq\frac{\epsilon}{36|\mathcal{S}||\mathcal{A}|\overline{m}(s,a)}.

This is because $T\in\mathcal{T}^{\pi}_{3}$ , and if

P(s^{\prime}\mid s,a)<\frac{\epsilon}{36|\mathcal{S}||\mathcal{A}|\overline{m}(s,a)},

then

m_{T}(s,a,s^{\prime})\leq\frac{6|\mathcal{S}||\mathcal{A}|(2\overline{m}(s,a)+4)}{\epsilon}\cdot P(s^{\prime}|s,a)\leq\frac{36|\mathcal{S}||\mathcal{A}|\overline{m}(s,a)}{\epsilon}\cdot P(s^{\prime}|s,a)<1.

For each $(s,a)\in\mathcal{S}\times\mathcal{A}$ , define

\mathcal{S}_{s,a}=\left\{s^{\prime}\in\mathcal{S}\ \mid P(s^{\prime}|s,a)\geq\frac{\epsilon}{36|\mathcal{S}||\mathcal{A}|\overline{m}(s,a)}\right\}.

Therefore, for each $(s,a)\in\mathcal{S}\times\mathcal{A}$ ,

\prod_{s^{\prime}\in\mathcal{S}}P(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}=\prod_{s^{\prime}\in\mathcal{S}_{s,a}}P(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}

and

\prod_{s^{\prime}\in\mathcal{S}}\widehat{P}(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}=\prod_{s^{\prime}\in\mathcal{S}_{s,a}}\widehat{P}(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}

with

	$\displaystyle\left\|m_{T}(s,a,s^{\prime})-m_{T}(s,a)\cdot P(s^{\prime}\|s,a)\right\|$	$\displaystyle\leq\sqrt{\frac{6P(s^{\prime}\|s,a)(4\overline{m}(s,a)+8)\|\mathcal{S}\|\|\mathcal{A}\|}{\epsilon}}$
		$\displaystyle\leq\sqrt{\frac{72P(s^{\prime}\|s,a)\overline{m}(s,a)\|\mathcal{S}\|\|\mathcal{A}\|}{\epsilon}}.$

Note that $\sum_{s^{\prime}\in\mathcal{S}}\widehat{P}(s^{\prime}|s,a)-{P}(s^{\prime}|s,a)=0$ , which implies

\left|\sum_{s^{\prime}\in\mathcal{S}_{s,a}}\widehat{P}(s^{\prime}|s,a)-{P}(s^{\prime}|s,a)\right|=\left|\sum_{s^{\prime}\not\in\mathcal{S}_{s,a}}{P}(s^{\prime}|s,a)-\widehat{P}(s^{\prime}|s,a)\right|\leq\frac{\epsilon}{96|\mathcal{S}||\mathcal{A}|}\cdot\frac{\epsilon}{72\cdot\overline{m}{(s,a)}\cdot|\mathcal{S}||\mathcal{A}|}.

By applying Lemma 5.5 and settting $\bar{n}$ to be $|\mathcal{S}|$ , $n$ to be $|\mathcal{S}_{s,a}|$ , $\epsilon$ to be $\epsilon/(96|\mathcal{S}|^{2}|\mathcal{A}|)$ , and $\overline{m}$ to be $72\cdot\overline{m}(s,a)\cdot|\mathcal{S}||\mathcal{A}|/\epsilon$ , we have

\prod_{s^{\prime}\in\mathcal{S}_{s,a}}\widehat{P}(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}\geq\left(1-\frac{\epsilon}{12|\mathcal{S}||\mathcal{A}|}\right)\prod_{s^{\prime}\in\mathcal{S}_{s,a}}P(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}.

Therefore,

\prod_{(s,a)\in\mathcal{S}\times\mathcal{A}}\prod_{s^{\prime}\in\mathcal{S}}\widehat{P}(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}\geq\left(1-\frac{\epsilon}{12|\mathcal{S}||\mathcal{A}|}\right)^{|\mathcal{S}||\mathcal{A}|}\prod_{(s,a)\in\mathcal{S}\times\mathcal{A}}\prod_{s^{\prime}\in\mathcal{S}}P(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})},

which implies

\prod_{(s,a)\in\mathcal{S}\times\mathcal{A}}\prod_{s^{\prime}\in\mathcal{S}}\widehat{P}(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}\geq(1-\epsilon/6)\prod_{(s,a)\in\mathcal{S}\times\mathcal{A}}\prod_{s^{\prime}\in\mathcal{S}}P(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}.

For the summation of rewards, we have

		$\displaystyle\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\left\|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right\|$
	$\displaystyle=$	$\displaystyle\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}\mid m_{T}(s,a)=1}m_{T}(s,a)\cdot\left\|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right\|$
	$\displaystyle+$	$\displaystyle\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}\mid m_{T}(s,a)>1}m_{T}(s,a)\cdot\left\|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right\|.$

For those $(s,a)\in\mathcal{S}\times\mathcal{A}$ with $m_{T}(s,a)>1$ , we have $\overline{m}(s,a)>1$ . By Lemma 2.1, we have

		$\displaystyle\|\mathbb{E}[\widehat{R}(s,a)]-\mathbb{E}[R(s,a)]\|$
	$\displaystyle\leq$	$\displaystyle\frac{\epsilon}{24\|\mathcal{S}\|\|\mathcal{A}\|}\cdot\max\left\{\sqrt{\frac{\mathbb{E}[(R(s,a))^{2}]}{\overline{m}(s,a)}},\frac{1}{\overline{m}(s,a)}\right\}\leq\max\left\{\frac{\epsilon}{12H},\frac{\epsilon}{24\|\mathcal{S}\|\|\mathcal{A}\|\overline{m}(s,a)}\right\}.$

Since $\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\leq H$ and $m_{T}(s,a)\leq\overline{m}(s,a)$ , we have

\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}\mid m_{T}(s,a)>1}m_{T}(s,a)\cdot\left|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right|\leq\frac{\epsilon}{8}.

For those $(s,a)\in\mathcal{S}\times\mathcal{A}$ with $m_{T}(s,a)=1$ , we have

\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}\mid m_{T}(s,a)=1}m_{T}(s,a)\cdot\left|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right|\leq\frac{\epsilon}{24}.

Thus,

\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\left|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right|\leq\frac{\epsilon}{6}.

For each $T\in\mathcal{T}^{\pi}_{\mathrm{pruned}}$ with

T=((s_{0},a_{0}),(s_{1},a_{1}),\ldots,(s_{H-1},a_{H-1}),s_{H}),

we have

		$\displaystyle p(T,\widehat{M},\pi)\cdot\left(\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[\widehat{R}(s,a)]\right)$
	$\displaystyle=$	$\displaystyle\widehat{\mu}(s_{0})\cdot\prod_{(s,a)\in\mathcal{S}\times\mathcal{A}}\prod_{s^{\prime}\in\mathcal{S}}\widehat{P}(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}\cdot\left(\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[\widehat{R}(s,a)]\right)$
	$\displaystyle\geq$	$\displaystyle(\mu(s_{0})-\epsilon/(6\|\mathcal{S}\|))\cdot(1-\epsilon/6)\prod_{(s,a)\in\mathcal{S}\times\mathcal{A}}\prod_{s^{\prime}\in\mathcal{S}}P(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}\cdot\left(\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[R(s,a)]-\epsilon/6\right).$

Since

\sum_{T\in\mathcal{T}^{\pi}_{\mathrm{pruned}}}p(T,M,\pi)\cdot\left(\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[R(s,a)]\right)\geq V^{\pi}_{M,H}-\epsilon/2,

we have

V^{\pi}_{\widehat{M},H}\geq\sum_{T\in\mathcal{T}^{\pi}_{\mathrm{pruned}}}p(T,\widehat{M},\pi)\cdot\left(\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[\widehat{R}(s,a)]\right)\geq V^{\pi}_{M,H}-\epsilon.

∎

Now we show that for two MDPs $M$ and $\widehat{M}$ with the same transition probabilities and close enough rewards, for any policy $\pi$ , its value in $\widehat{M}$ should be upper bounded by that in $\widehat{M}$ up to an error of $\epsilon$ .

Lemma 5.9.

Let $M=(\mathcal{S},\mathcal{A},P,R,H,\mu)$ be an MDP and $\pi$ be a policy. Let $0<\epsilon\leq 1/2$ be a parameter. For each $(s,a)\in\mathcal{S}\times\mathcal{A}$ , define $\overline{m}(s,a)=\mathcal{Q}^{\pi}_{\epsilon/(12|\mathcal{S}||\mathcal{A}|)}(s,a)$ . Let $\widehat{M}=(\mathcal{S},\mathcal{A},P,\widehat{R},H,\mu)$ be another MDP. If for all $(s,a)\in\mathcal{S}\times\mathcal{A}$ with $\overline{m}(s,a)\geq 1$ , we have

\left|\mathbb{E}[\widehat{R}|s,a]-\mathbb{E}[{R}|s,a]\right|\leq\frac{\epsilon}{24|\mathcal{S}||\mathcal{A}|}\cdot\max\left\{\sqrt{\frac{\mathbb{E}[(R(s,a))^{2}]}{\overline{m}(s,a)}},\frac{1}{\overline{m}(s,a)}\right\}.

then

V^{\pi}_{\widehat{M},H}\leq V^{\pi}_{{M},H}+\epsilon.

Proof.

We adopt the same notations as in the proof of Lemma 5.8. Recall that

V^{\pi}_{M,H}=\sum_{T\in\mathcal{T}^{\pi}}p(T,M,\pi)\cdot\left(\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[R(s,a)]\right).

and

\sum_{T\in\mathcal{T}^{\pi}\setminus\mathcal{T}^{\pi}_{1}}p(T,M,\pi)\leq\epsilon/6.

As in the proof of Lemma 5.8, for the summation of rewards, we have

		$\displaystyle\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\left\|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right\|$
	$\displaystyle=$	$\displaystyle\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}\mid m_{T}(s,a)=1}m_{T}(s,a)\cdot\left\|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right\|$
	$\displaystyle+$	$\displaystyle\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}\mid m_{T}(s,a)>1}m_{T}(s,a)\cdot\left\|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right\|.$

For each $T\in\mathcal{T}^{\pi}_{1}$ , for each $(s,a)\in\mathcal{S}\times\mathcal{A}$ , we must have $m_{T}(s,a)\leq\overline{m}(s,a)$ . Therefore, for those $(s,a)\in\mathcal{S}\times\mathcal{A}$ with $m_{T}(s,a)>1$ , by Lemma 2.1, we have

		$\displaystyle\|\mathbb{E}[\widehat{R}(s,a)]-\mathbb{E}[R(s,a)]\|$
	$\displaystyle\leq$	$\displaystyle\frac{\epsilon}{24\|\mathcal{S}\|\|\mathcal{A}\|}\cdot\max\left\{\sqrt{\frac{\mathbb{E}[(R(s,a))^{2}]}{\overline{m}(s,a)}},\frac{1}{\overline{m}(s,a)}\right\}\leq\max\left\{\frac{\epsilon}{12H},\frac{\epsilon}{24\|\mathcal{S}\|\|\mathcal{A}\|\overline{m}(s,a)}\right\}.$

Since $\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\leq H$ and $m_{T}(s,a)\leq\overline{m}(s,a)$ , we have

\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}\mid m_{T}(s,a)>1}m_{T}(s,a)\cdot\left|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right|\leq\frac{\epsilon}{8}.

For those $(s,a)\in\mathcal{S}\times\mathcal{A}$ with $m_{T}(s,a)=1$ , we have

\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}\mid m_{T}(s,a)=1}m_{T}(s,a)\cdot\left|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right|\leq\frac{\epsilon}{24}.

Thus,

\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\left|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right|\leq\frac{\epsilon}{6}.

Hence,

	$\displaystyle V^{\pi}_{\widehat{M},\mu}$	$\displaystyle\leq\sum_{T\in\mathcal{T}^{\pi}\setminus\mathcal{T}^{\pi}_{1}}p(T,M,\pi)+\sum_{T\in\mathcal{T}^{\pi}_{1}}p(T,M,\pi)\cdot\left(\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[\widehat{R}(s,a)]\right)$
		$\displaystyle\leq\epsilon/6+\sum_{T\in\mathcal{T}^{\pi}_{1}}p(T,M,\pi)\cdot\left(\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[R(s,a)]+\epsilon/6\right)$
		$\displaystyle\leq V^{\pi}_{M,\mu}+\epsilon.$

∎

5.3 Pessimistic Planning

We now present our final algorithm in the RL setting. The formal description is provided in Algorithm 3. In our algorithm, we first invoke Algorithm 1 to collect a dataset $D$ and then invoke Algorithm 2 to estimate $\mathcal{Q}^{\mathsf{st}}_{\epsilon}(s,a)$ for some properly chosen $\epsilon$ . We then use the estimators in Lemma 5.4 to define $\widehat{P}$ , $\widehat{R}$ and $\widehat{\mu}$ . Note that Lemma 5.4 not only provides an estimator but also provides a computable confidence interval for $\widehat{P}$ and $\widehat{\mu}$ , which we also utilize in our algorithm.

At this point, a natural idea is to find the optimal policy with respect to the MDP $\widehat{M}$ defined by $\widehat{P}$ , $\widehat{R}$ and $\widehat{\mu}$ . However, our Lemma 5.8 only provides a lower bound guarantee for $V^{\pi}_{\widehat{M},H}$ without any upper bound guarantee. We resolve this issue by pessimistic planning. More specifically, for any policy $\pi$ , we define its pessimistic value to be

\underline{V}^{\pi}=\min_{M\in\mathcal{M}}V^{\pi}_{M,H}

where $\mathcal{M}$ includes all MDPs whose transition probabilities are within the confidence interval provided in Lemma 5.4. We simply return the policy $\pi$ that maximizes $\underline{V}^{\pi}$ . Since the true MDP lies in $\mathcal{M}$ , $\underline{V}^{\pi}$ is never an overestimate. On the other hand, Lemma 5.8 guarantees that $\underline{V}^{\pi}$ is also lower bounded by $V^{\pi}_{M,H}$ up to an error of $\epsilon$ . Therefore, $\underline{V}^{\pi}$ provides an accurate estimate to the true value of $\pi$ . However, note that Lemma 5.4 does not provide a computable confidence interval for the rewards. Fortunately, as we have shown in Lemma 5.9, perturbation on the rewards will not significantly increase the value of the policy and thus the estimate is still accurate.

Algorithm 3 Pessimistic Planning

1:Input: desired accuracy

\epsilon

, failure probability

\delta

2:Output: An

\epsilon

-optimal policy

\pi

3:Invoke Algorithm 1 with

N=2^{66}\cdot(|\mathcal{S}|+1)^{24(|\mathcal{S}|+1)}\cdot\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)\cdot|\mathcal{S}|^{7}|\mathcal{A}|^{5}/\varepsilon^{5}

and receive

D=\left(\left(\left(s_{i,t},a_{i,t},r_{i,t},s^{\prime}_{i,t}\right)\right)_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}\right)_{i=0}^{N-1}

4:Invoke Algorithm 2 with

\epsilon_{\mathrm{est}}=\frac{\epsilon}{32768\cdot|\mathcal{S}||\mathcal{A}|(|\mathcal{S}|+1)^{12(|\mathcal{S}|+1)}}

and

\delta=\delta

, and receive estimates

\overline{m}^{\mathsf{st}}:\mathcal{S}\times\mathcal{A}\to\mathbb{N}

5:For each

(s,a)\in\mathcal{S}\times\mathcal{A}

, for each

i\in[N]

and

t\in\left[|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H\right]

, define

\mathsf{Trunc}_{i,t}(s,a)=\mathbb{I}\left[\sum_{t^{\prime}=0}^{t-1}\mathbb{I}\left[(s_{i,t^{\prime}},a_{i,t^{\prime}})=(s,a)\right]<\overline{m}^{\mathsf{st}}(s,a)\right]

6:For each

(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}

, define

m_{D}(s,a)=\sum_{i=0}^{N-1}\sum_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}\mathbb{I}\left[(s_{i,t},a_{i,t})=(s,a)\right]\cdot\mathsf{Trunc}_{i,t}(s,a),

\widehat{P}(s^{\prime}\mid s,a)=\frac{\sum_{i=0}^{N-1}\sum_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}\mathbb{I}\left[(s_{i,t},a_{i,t},s_{i,t}^{\prime})=(s,a,s^{\prime})\right]\cdot\mathsf{Trunc}_{i,t}(s,a)}{\max\{1,m_{D}(s,a)\}},

\widehat{R}(s,a)=\frac{\sum_{i=0}^{N-1}\sum_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}\mathbb{I}\left[(s_{i,t},a_{i,t})=(s,a)\right]\cdot r_{i,t}\cdot\mathsf{Trunc}_{i,t}(s,a)}{\max\{1,m_{D}(s,a)\}}

and

\widehat{\mu}(s)=\frac{\sum_{i=0}^{N-1}\mathbb{I}[s_{0}=s]}{N}

7:Define

\mathcal{M}

to be a set of MDPs where for each

M=(\mathcal{S},\mathcal{A},\widetilde{P},\widehat{R},H,\widetilde{\mu})\in\mathcal{M}

\left|\widehat{P}(s^{\prime}\mid s,a)-\widetilde{P}(s^{\prime}\mid s,a)\right|\leq\max\left\{\frac{512\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}},32\sqrt{\frac{\widehat{P}(s^{\prime}\mid s,a)\cdot\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}}\right\}

and

\left|\widehat{\mu}(s)-\widetilde{\mu}(s)\right|\leq\sqrt{\frac{\log(18|\mathcal{S}|/\delta)}{N}}

for all

(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}

8:For each (possibly non-stationary) policy

\pi

, define

\underline{V}^{\pi}=\min_{M\in\mathcal{M}}V^{\pi}_{M,H}

9:return

\mathrm{armgax}_{\pi}\underline{V}^{\pi}

We now present the formal analysis for our algorithm.

Theorem 5.10.

With probability at least $1-\delta$ , for any (possibly non-stationary) policy $\pi$ ,

\left|\underline{V}^{\pi}-V^{\pi}_{M,H}\right|\leq\epsilon/2

where $\underline{V}^{\pi}$ is defined in Line 8 of Algorithm 3. Moreover, Alogrithm 3 samples at most

2^{66}\cdot(|\mathcal{S}|+1)^{24(|\mathcal{S}|+1)}\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot\log(12|\mathcal{S}|^{2}|\mathcal{A}|/\delta)\cdot|\mathcal{S}|^{8}|\mathcal{A}|^{6}/\varepsilon^{5}.

trajectories.

Proof.

By Lemma 5.4, with probability at least $1-\delta$ , for all $(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}$ , we have

	$\displaystyle\left\|\widehat{P}(s^{\prime}\mid s,a)-P(s^{\prime}\mid s,a)\right\|\leq$	$\displaystyle\max\left\{\frac{512\log(18\|\mathcal{S}\|^{2}\|\mathcal{A}\|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}},32\sqrt{\frac{\widehat{P}(s^{\prime}\mid s,a)\cdot\log(18\|\mathcal{S}\|^{2}\|\mathcal{A}\|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}}\right\}$
	$\displaystyle\leq$	$\displaystyle\max\left\{\frac{512\log(18\|\mathcal{S}\|^{2}\|\mathcal{A}\|/\delta)}{\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}},64\sqrt{\frac{P(s^{\prime}\mid s,a)\cdot\log(18\|\mathcal{S}\|^{2}\|\mathcal{A}\|/\delta)}{\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}}\right\},$

\left|\widehat{R}(s^{\prime}\mid s,a)-\mathbb{E}[R(s,a)]\right|\leq 8\sqrt{\frac{\mathbb{E}\left[(R(s,a))^{2}\right]\cdot\log(18|\mathcal{S}||\mathcal{A}|/\delta)}{\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}}+\frac{8\log(18|\mathcal{S}||\mathcal{A}|/\delta)}{\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}},

and

\left|\widehat{\mu}(s)-\mu(s)\right|\leq\sqrt{\frac{\log(18|\mathcal{S}|/\delta)}{N}}.

In the remaining part of the analysis, we condition on the above event.

by Lemma 5.1, for any (possibly non-stationary) policy $\pi$ ,

\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)\geq\frac{1}{4096\cdot|\mathcal{S}|^{12|\mathcal{S}|}}\cdot\epsilon/(24|\mathcal{S}||\mathcal{A}|)\cdot\mathcal{Q}^{\pi}_{\epsilon/(24|\mathcal{S}||\mathcal{A}|)}(s,a).

Let $M=(\mathcal{S},\mathcal{A},P,R,H,\mu)$ be the true MDP. By Lemma 5.8, for any (possibly non-stationary) policy $\pi$ , for any $\overline{M}\in\mathcal{M}$ , we have

V^{\pi}_{\overline{M},H}\geq V^{\pi}_{M,H}-\epsilon/2.

Moreover, $M^{\prime}=(\mathcal{S},\mathcal{A},P,\widehat{R},H,\mu)\in\mathcal{M}$ . Therefore, by Lemma 5.9,

V^{\pi}_{M^{\prime},H}\leq V^{\pi}_{M,H}+\epsilon/2.

Consequently,

\left|\underline{V}^{\pi}-V^{\pi}_{M,H}\right|\leq\epsilon/2.

Finally, the algorithm samples at most

		$\displaystyle(N+\lceil 300\log(6\|\mathcal{S}\|\|\mathcal{A}\|/\delta)/\epsilon_{\mathrm{est}}\rceil)\times\|\mathcal{S}\|\times\|\mathcal{A}\|\times\|\mathcal{A}\|^{2\|\mathcal{S}\|}$
	$\displaystyle\leq$	$\displaystyle 2^{66}\cdot(\|\mathcal{S}\|+1)^{24(\|\mathcal{S}\|+1)}\cdot\|\mathcal{A}\|^{2\|\mathcal{S}\|}\cdot\log(18\|\mathcal{S}\|^{2}\|\mathcal{A}\|/\delta)\cdot\|\mathcal{S}\|^{8}\|\mathcal{A}\|^{6}/\varepsilon^{5}.$

trajectories. ∎

Theorem 5.10 immediately implies the following corollary.

Corollary 5.11.

Algorithm 3 returns a policy $\pi$ such that

V^{\pi}_{M,H}\geq V^{\pi^{*}}_{M,H}-\epsilon

with probability at least $1-\delta$ . Moreover, the algorithm samples at most

2^{66}\cdot(|\mathcal{S}|+1)^{24(|\mathcal{S}|+1)}\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot\log(12|\mathcal{S}|^{2}|\mathcal{A}|/\delta)\cdot|\mathcal{S}|^{8}|\mathcal{A}|^{6}/\varepsilon^{5}.

trajectories.

6 Algorithm in the Generative Model

In this section, we present our algorithm in the generative model together with its analysis. The formal description of the algorithm is given in Algorithm 4. Compared to our algorithm in the RL setting, Algorithm 4 is conceptually much simpler. For each state-action pair $(s,a)\in\mathcal{S}\times\mathcal{A}$ , Algorithm 4 first draws $N$ samples from $P(s,a)$ and $R(s,a)$ , and then builds a model $\widehat{M}$ using the empirical estimators. The algorithm simply returns the optimal policy for $\widehat{M}$ . In the remaining part of this section, we present the formal analysis for our algorithm.

Algorithm 4 Algorithm in Generative Model

1:Input: desired accuracy

\epsilon

, failure probability

\delta

2:Output: An

\epsilon

-optimal policy

\pi

3:Let

N=2^{29}\cdot|\mathcal{S}|^{5}\cdot|\mathcal{A}|^{3}\cdot H/\epsilon^{3}

4:for

(s,a)\in\mathcal{S}\times\mathcal{A}

5: Draw

s^{\prime}_{0},s^{\prime}_{1},\ldots,s^{\prime}_{N-1}

from

P(s,a)

6: Draw

r_{0},r_{1},\ldots,r_{N-1}

from

R(s,a)

7: For each

s^{\prime}\in\mathcal{S}

, let

\widehat{P}(s^{\prime}\mid s,a)=\sum_{i=0}^{N-1}\mathbb{I}[s^{\prime}_{i}=s]/N

8: Let

\widehat{R}(s,a)=\sum_{i=0}^{N-1}r_{i}/N

9:Draw

s_{0},s_{1},\ldots,s_{N-1}

from

\mu

10:For each

s^{\prime}\in\mathcal{S}

, let

\widehat{\mu}(s)=\sum_{i=0}^{N-1}\mathbb{I}[s_{i}=s]/N

11:return

\mathrm{argmax}_{\pi}V^{\pi}_{\widehat{M},H}

where

\widehat{M}=(\mathcal{S},\mathcal{A},\widehat{P},\widehat{R},H,\widehat{\mu})

The following lemma establishes a perturbation analysis similar to those in Lemma 5.8 and Lemma 5.9. Note that here, due to the availability of a generative model and thus one can obtain $\Omega(H)$ samples for each state-action pair $(s,a)\in\mathcal{S}\times\mathcal{A}$ , we can prove both upper bound and lower bound on the estimated value. This also explains why pessimistic planning is no longer necessary in the generative model setting.

Lemma 6.1.

Let $M=(\mathcal{S},\mathcal{A},P,R,H,\mu)$ be an MDP and $\pi$ be a policy. Let $0<\epsilon\leq 1/2$ be a parameter. Let $\widehat{M}=(\mathcal{S},\mathcal{A},\widehat{P},\widehat{R},H,\widehat{\mu})$ be another MDP. If for all $(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}$ , we have

|\widehat{P}(s^{\prime}|s,a)-P(s^{\prime}|s,a)|\leq\frac{\epsilon}{192|\mathcal{S}|^{2}|\mathcal{A}|}\cdot\max\left(\sqrt{\frac{\epsilon{P}(s^{\prime}|s,a)}{576\cdot H\cdot|\mathcal{S}||\mathcal{A}|}},\frac{\epsilon}{576\cdot H\cdot|\mathcal{S}||\mathcal{A}|}\right),

\left|\mathbb{E}[\widehat{R}|s,a]-\mathbb{E}[{R}|s,a]\right|\leq\frac{\epsilon}{48|\mathcal{S}||\mathcal{A}|}\cdot\max\left\{\sqrt{\frac{\mathbb{E}[(R(s,a))^{2}]}{H}},\frac{1}{H}\right\},

and

\left|\mu(s)-\widehat{\mu}(s^{\prime})\right|\leq\epsilon/(12|\mathcal{S}|),

then

\left|V^{\pi}_{\widehat{M},H}-V^{\pi}_{{M},H}\right|\leq\epsilon/2.

Proof.

Define $\mathcal{T}=(\mathcal{S}\times\mathcal{A})^{H}\times\mathcal{S}$ be set of all possible trajectories, where for each $T\in\mathcal{T}$ , $T$ has the form

((s_{0},a_{0}),(s_{1},a_{1}),\ldots,(s_{H-1},a_{H-1}),s_{H}).

For a trajectory $T\in\mathcal{T}$ , for each $(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}$ , we write

m_{T}(s,a)=\sum_{h=0}^{H-1}\mathbb{I}((s_{h},a_{h})=(s,a))

as the number times $(s,a)$ is visited and

m_{T}(s,a,s^{\prime})=\sum_{h=0}^{H-1}\mathbb{I}((s_{h},a_{h},s_{h+1})=(s,a,s^{\prime}))

as the number of times $(s,a,s^{\prime})$ is visited. We say a trajectory

T=((s_{0},a_{0}),(s_{1},a_{1}),\ldots,(s_{H-1},a_{H-1}),s_{H})\in\mathcal{T}

is compatible with a (possibly non-stationary) policy $\pi$ if for all $h\in[H]$ ,

a_{h}=\pi_{h}(s_{h}).

For a (possibly non-stationary) policy $\pi$ , we use $\mathcal{T}^{\pi}\subseteq\mathcal{T}$ to denote the set of all trajectories that are compatible with $\pi$ .

For an MDP $M=(\mathcal{S},\mathcal{A},P,R,H,\mu)$ and a (possibly non-stationary) policy $\pi$ , for a trajectory $T$ that is compatible with $\pi$ , we write

p(T,M,\pi)=\mu(s_{0})\cdot\prod_{h=0}^{H-1}P(s_{h+1}\mid s_{h},a_{h})=\mu(s_{0})\cdot\prod_{(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}}P(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}

to be the probability of $T$ when executing $\pi$ in $M$ . Here we assume $0^{0}=1$ .

Using these definitions, we have

V^{\pi}_{M,H}=\sum_{T\in\mathcal{T}^{\pi}}p(T,M,\pi)\cdot\left(\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[R(s,a)]\right).

Note that for any trajectory $T=((s_{0},a_{0}),(s_{1},a_{1}),\ldots,(s_{H-1},a_{H-1}),s_{H})\in\mathcal{T}^{\pi}$ , if $p(T,M,\pi)>0$ , by Assumption 2.1,

\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[R(s,a)]\leq 1.

We define $\mathcal{T}^{\pi}_{1}\subseteq\mathcal{T}^{\pi}$ be the set of trajectories such that for each $T\in\mathcal{T}^{\pi}_{1}$ , for each $(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}$ , if $m_{T}(s,a,s^{\prime})\geq 1$ then

P(s^{\prime}|s,a)\geq\frac{\epsilon}{144|\mathcal{S}||\mathcal{A}|H}.

By Lemma 5.6 and a union bound over all $(s,a)\in\mathcal{S}\times\mathcal{A}$ , we have

\sum_{T\in\mathcal{T}^{\pi}\setminus\mathcal{T}^{\pi}_{1}}p(T,M,\pi)\leq\epsilon/12.

We also define $\widehat{\mathcal{T}}^{\pi}_{1}\subseteq\mathcal{T}^{\pi}$ be the set of trajectories such that for each $T\in\widehat{\mathcal{T}}^{\pi}_{1}$ , for each $(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}$ , if $m_{T}(s,a,s^{\prime})\geq 1$ then

\widehat{P}(s^{\prime}|s,a)\geq\frac{\epsilon}{72|\mathcal{S}||\mathcal{A}|H}.

Again by Lemma 5.6 and a union bound over all $(s,a)\in\mathcal{S}\times\mathcal{A}$ , we have

\sum_{T\in\mathcal{T}^{\pi}\setminus\widehat{\mathcal{T}}^{\pi}_{1}}p(T,\widehat{M},\pi)\leq\epsilon/12.

Now we show that

\mathcal{T}^{\pi}\setminus\mathcal{T}^{\pi}_{1}\subseteq\mathcal{T}^{\pi}\setminus\widehat{\mathcal{T}}^{\pi}_{1},

and therefore,

\sum_{T\in\mathcal{T}^{\pi}\setminus\mathcal{T}^{\pi}_{1}}p(T,\widehat{M},\pi)\leq\epsilon/12.

For each $T\in\mathcal{T}^{\pi}\setminus\mathcal{T}^{\pi}_{1}$ , if there exists $(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}$ such that $m_{T}(s,a,s^{\prime})\geq 1$ and

P(s^{\prime}\mid s,a)<\frac{\epsilon}{144|\mathcal{S}||\mathcal{A}|H},

then

\widehat{P}(s^{\prime}\mid s,a)\leq P(s^{\prime}\mid s,a)+\frac{\epsilon}{192|\mathcal{S}|^{2}|\mathcal{A}|}\cdot\max\left(\sqrt{\frac{\epsilon{P}(s^{\prime}|s,a)}{576\cdot H\cdot|\mathcal{S}||\mathcal{A}|}},\frac{\epsilon}{576\cdot H\cdot|\mathcal{S}||\mathcal{A}|}\right)<\frac{\epsilon}{72|\mathcal{S}||\mathcal{A}|H}

and thus $T\in\mathcal{T}^{\pi}\setminus\widehat{\mathcal{T}}^{\pi}_{1}$ . Moreover, for any $T\in\mathcal{T}^{\pi}\setminus\mathcal{T}^{\pi}_{1}$ , for all $(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}$ with $m_{T}(s,a,s^{\prime})\geq 1$ , we have

P(s^{\prime}|s,a)\geq\frac{\epsilon}{144|\mathcal{S}||\mathcal{A}|H}>0.

Note that this implies $p(T,M,\pi)>0$ . Furthermore, for all $(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}$ with $m_{T}(s,a,s^{\prime})\geq 1$ , we have

\widehat{P}(s^{\prime}\mid s,a)\geq P(s^{\prime}\mid s,a)-\frac{\epsilon}{192|\mathcal{S}|^{2}|\mathcal{A}|}\cdot\max\left(\sqrt{\frac{\epsilon{P}(s^{\prime}|s,a)}{576\cdot H\cdot|\mathcal{S}||\mathcal{A}|}},\frac{\epsilon}{576\cdot H\cdot|\mathcal{S}||\mathcal{A}|}\right)>0

which implies $p(T,\widehat{M},\pi)>0$ .

We define $\mathcal{T}^{\pi}_{2}\subseteq\mathcal{T}^{\pi}_{1}$ be the set of trajectories that for each $T\in\mathcal{T}^{\pi}_{2}$ , for each $(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}$ ,

\left|m_{T}(s,a,s^{\prime})-m_{T}(s,a)\cdot P(s^{\prime}|s,a)\right|\leq\sqrt{\frac{576P(s^{\prime}|s,a)H|\mathcal{S}||\mathcal{A}|}{\epsilon}}.

By Lemma 5.7 and a union bound over all $(s,a)\in\mathcal{S}\times\mathcal{A}$ , we have

\sum_{T\in\mathcal{T}^{\pi}_{1}\setminus\mathcal{T}^{\pi}_{2}}p(T,M,\pi)\leq\sum_{T\in\mathcal{T}^{\pi}\setminus\mathcal{T}^{\pi}_{2}}p(T,M,\pi)\leq\epsilon/12.

Again, we define $\widehat{\mathcal{T}}^{\pi}_{2}\subseteq\mathcal{T}^{\pi}$ be the set of trajectories that for each $T\in\mathcal{T}^{\pi}_{2}$ , for each $(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}$ ,

\left|m_{T}(s,a,s^{\prime})-m_{T}(s,a)\cdot\widehat{P}(s^{\prime}|s,a)\right|\leq\sqrt{\frac{144\widehat{P}(s^{\prime}|s,a)H|\mathcal{S}||\mathcal{A}|}{\epsilon}}.

Similarly, by Lemma 5.7 and a union bound over all $(s,a)\in\mathcal{S}\times\mathcal{A}$ , we have

\sum_{T\in\mathcal{T}^{\pi}_{1}\setminus\widehat{\mathcal{T}}^{\pi}_{2}}p(T,\widehat{M},\pi)\leq\sum_{T\in\mathcal{T}^{\pi}\setminus\widehat{\mathcal{T}}^{\pi}_{2}}p(T,\widehat{M},\pi)\leq\epsilon/12.

Now we show that

\mathcal{T}^{\pi}_{1}\setminus\mathcal{T}^{\pi}_{2}\subseteq\mathcal{T}^{\pi}_{1}\setminus\widehat{\mathcal{T}}^{\pi}_{2},

and therefore,

\sum_{T\in\mathcal{T}^{\pi}_{1}\setminus\mathcal{T}^{\pi}_{2}}p(T,\widehat{M},\pi)\leq\epsilon/12.

For any $T\in\mathcal{T}^{\pi}_{1}\setminus\mathcal{T}^{\pi}_{2}$ , there exists $(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}$ such that

\left|m_{T}(s,a,s^{\prime})-m_{T}(s,a)\cdot P(s^{\prime}|s,a)\right|>\sqrt{\frac{576P(s^{\prime}|s,a)H|\mathcal{S}||\mathcal{A}|}{\epsilon}}.

Note that this implies

P(s^{\prime}|s,a)\geq\frac{\epsilon}{144|\mathcal{S}||\mathcal{A}|H}

since otherwise $m_{T}(s,a,s^{\prime})=0$ (because $T\in\mathcal{T}^{\pi}_{1}$ ) and therefore

\left|m_{T}(s,a,s^{\prime})-m_{T}(s,a)\cdot P(s^{\prime}|s,a)\right|\leq|H\cdot P(s^{\prime}|s,a)|\leq\sqrt{\frac{576P(s^{\prime}|s,a)H|\mathcal{S}||\mathcal{A}|}{\epsilon}}.

Since

P(s^{\prime}|s,a)\geq\frac{\epsilon}{144|\mathcal{S}||\mathcal{A}|H},

we have

\widehat{P}(s^{\prime}\mid s,a)\leq 2P(s^{\prime}\mid s,a)

and

\left|\widehat{P}(s^{\prime}\mid s,a)-P(s^{\prime}\mid s,a)\right|\leq\sqrt{\frac{\epsilon\cdot P(s^{\prime}\mid s,a)}{144|\mathcal{S}||\mathcal{A}|H}}.

Hence,

	$\displaystyle\left\|m_{T}(s,a,s^{\prime})-m_{T}(s,a)\cdot\widehat{P}(s^{\prime}\|s,a)\right\|$	$\displaystyle\geq\left\|m_{T}(s,a,s^{\prime})-m_{T}(s,a)\cdot P(s^{\prime}\|s,a)\right\|-H\cdot\sqrt{\frac{\epsilon\cdot P(s^{\prime}\mid s,a)}{144\|\mathcal{S}\|\|\mathcal{A}\|H}}$
		$\displaystyle>\sqrt{\frac{576P(s^{\prime}\|s,a)H\|\mathcal{S}\|\|\mathcal{A}\|}{\epsilon}}-\sqrt{\frac{\epsilon\cdot P(s^{\prime}\mid s,a)\cdot H}{144\|\mathcal{S}\|\|\mathcal{A}\|}}$
		$\displaystyle\geq\sqrt{\frac{500P(s^{\prime}\|s,a)H\|\mathcal{S}\|\|\mathcal{A}\|}{\epsilon}}$
		$\displaystyle\geq\sqrt{\frac{144\widehat{P}(s^{\prime}\|s,a)H\|\mathcal{S}\|\|\mathcal{A}\|}{\epsilon}}$

and thus $T\in\mathcal{T}^{\pi}_{1}\setminus\widehat{\mathcal{T}}^{\pi}_{2}$ .

Note that

V^{\pi}_{M,H}-\epsilon/6\leq\sum_{T\in\mathcal{T}^{\pi}_{2}}p(T,M,\pi)\cdot\left(\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[R(s,a)]\right)\leq V^{\pi}_{M,H}.

Similarly,

V^{\pi}_{\widehat{M},H}-\epsilon/6\leq\sum_{T\in\mathcal{T}^{\pi}_{2}}p(T,\widehat{M},\pi)\cdot\left(\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[\widehat{R}(s,a)]\right)\leq V^{\pi}_{\widehat{M},H}.

For each $(s,a)\in\mathcal{S}\times\mathcal{A}$ , define

\mathcal{S}_{s,a}=\left\{s^{\prime}\in\mathcal{S}\ \mid P(s^{\prime}|s,a)\geq\frac{\epsilon}{72|\mathcal{S}||\mathcal{A}|H}\right\}.

Therefore, for each $(s,a)\in\mathcal{S}\times\mathcal{A}$ ,

\prod_{s^{\prime}\in\mathcal{S}}P(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}=\prod_{s^{\prime}\in\mathcal{S}_{s,a}}P(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}

and

\prod_{s^{\prime}\in\mathcal{S}}\widehat{P}(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}=\prod_{s^{\prime}\in\mathcal{S}_{s,a}}\widehat{P}(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}

with

\displaystyle\left|m_{T}(s,a,s^{\prime})-m_{T}(s,a)\cdot P(s^{\prime}|s,a)\right|\leq\sqrt{\frac{576P(s^{\prime}|s,a)H|\mathcal{S}||\mathcal{A}|}{\epsilon}}.

Note that $\sum_{s^{\prime}\in\mathcal{S}}\widehat{P}(s^{\prime}|s,a)-{P}(s^{\prime}|s,a)=0$ , which implies

\left|\sum_{s^{\prime}\in\mathcal{S}_{s,a}}\widehat{P}(s^{\prime}|s,a)-{P}(s^{\prime}|s,a)\right|=\left|\sum_{s^{\prime}\not\in\mathcal{S}_{s,a}}{P}(s^{\prime}|s,a)-\widehat{P}(s^{\prime}|s,a)\right|\leq\frac{\epsilon}{192|\mathcal{S}||\mathcal{A}|}\cdot\frac{\epsilon}{144\cdot H\cdot|\mathcal{S}||\mathcal{A}|}.

By applying Lemma 5.5 and settting $\bar{n}$ to be $|\mathcal{S}|$ , $n$ to be $|\mathcal{S}_{s,a}|$ , $\epsilon$ to be $\epsilon/(192|\mathcal{S}|^{2}|\mathcal{A}|)$ , and $\overline{m}$ to be $576\cdot H\cdot|\mathcal{S}||\mathcal{A}|/\epsilon$ , we have

\prod_{s^{\prime}\in\mathcal{S}_{s,a}}\widehat{P}(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}\geq\left(1-\frac{\epsilon}{24|\mathcal{S}||\mathcal{A}|}\right)\prod_{s^{\prime}\in\mathcal{S}_{s,a}}P(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}

and

\prod_{s^{\prime}\in\mathcal{S}_{s,a}}\widehat{P}(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}\leq\left(1+\frac{\epsilon}{24|\mathcal{S}||\mathcal{A}|}\right)\prod_{s^{\prime}\in\mathcal{S}_{s,a}}P(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}.

Therefore,

\prod_{(s,a)\in\mathcal{S}\times\mathcal{A}}\prod_{s^{\prime}\in\mathcal{S}}\widehat{P}(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}\geq(1-\epsilon/12)\prod_{(s,a)\in\mathcal{S}\times\mathcal{A}}\prod_{s^{\prime}\in\mathcal{S}}P(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}.

and

\prod_{(s,a)\in\mathcal{S}\times\mathcal{A}}\prod_{s^{\prime}\in\mathcal{S}}\widehat{P}(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}\leq(1+\epsilon/12)\prod_{(s,a)\in\mathcal{S}\times\mathcal{A}}\prod_{s^{\prime}\in\mathcal{S}}P(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}.

For the summation of rewards, we have

		$\displaystyle\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\left\|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right\|$
	$\displaystyle=$	$\displaystyle\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}\mid m_{T}(s,a)=1}m_{T}(s,a)\cdot\left\|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right\|$
	$\displaystyle+$	$\displaystyle\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}\mid m_{T}(s,a)>1}m_{T}(s,a)\cdot\left\|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right\|.$

For those $(s,a)\in\mathcal{S}\times\mathcal{A}$ with $m_{T}(s,a)>1$ , since $p(T,M,\pi)>0$ , by Lemma 2.1, we have

		$\displaystyle\|\mathbb{E}[\widehat{R}(s,a)]-\mathbb{E}[R(s,a)]\|$
	$\displaystyle\leq$	$\displaystyle\frac{\epsilon}{48\|\mathcal{S}\|\|\mathcal{A}\|}\cdot\max\left\{\sqrt{\frac{\mathbb{E}[(R(s,a))^{2}]}{H}},\frac{1}{H}\right\}\leq\max\left\{\frac{\epsilon}{24H},\frac{\epsilon}{48\|\mathcal{S}\|\|\mathcal{A}\|H}\right\}.$

Since $\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\leq H$ and $m_{T}(s,a)\leq H$ , we have

\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}\mid m_{T}(s,a)>1}m_{T}(s,a)\cdot\left|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right|\leq\frac{\epsilon}{16}.

For those $(s,a)\in\mathcal{S}\times\mathcal{A}$ with $m_{T}(s,a)=1$ , we have

\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}\mid m_{T}(s,a)=1}m_{T}(s,a)\cdot\left|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right|\leq\frac{\epsilon}{48}.

Thus,

\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\left|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right|\leq\frac{\epsilon}{12}.

For each $T\in\mathcal{T}^{\pi}_{2}$ with

T=((s_{0},a_{0}),(s_{1},a_{1}),\ldots,(s_{H-1},a_{H-1}),s_{H}),

we have

		$\displaystyle p(T,\widehat{M},\pi)\cdot\left(\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[\widehat{R}(s,a)]\right)$
	$\displaystyle=$	$\displaystyle\widehat{\mu}(s_{0})\cdot\prod_{(s,a)\in\mathcal{S}\times\mathcal{A}}\prod_{s^{\prime}\in\mathcal{S}}\widehat{P}(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}\cdot\left(\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[\widehat{R}(s,a)]\right)$

where

\left|\widehat{\mu}(s_{0})-\mu(s_{0})\right|\leq\epsilon/(12|\mathcal{S}|),

\left|\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[\widehat{R}(s,a)]-\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[R(s,a)]\right|\leq\epsilon/12,

\prod_{(s,a)\in\mathcal{S}\times\mathcal{A}}\prod_{s^{\prime}\in\mathcal{S}}\widehat{P}(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}\geq(1-\epsilon/12)\prod_{(s,a)\in\mathcal{S}\times\mathcal{A}}\prod_{s^{\prime}\in\mathcal{S}}P(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})},

and

\prod_{(s,a)\in\mathcal{S}\times\mathcal{A}}\prod_{s^{\prime}\in\mathcal{S}}\widehat{P}(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}\leq(1+\epsilon/12)\prod_{(s,a)\in\mathcal{S}\times\mathcal{A}}\prod_{s^{\prime}\in\mathcal{S}}P(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}.

Since

V^{\pi}_{M,H}-\epsilon/6\leq\sum_{T\in\mathcal{T}^{\pi}_{2}}p(T,M,\pi)\cdot\left(\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[R(s,a)]\right)\leq V^{\pi}_{M,H}

and

V^{\pi}_{\widehat{M},H}-\epsilon/6\leq\sum_{T\in\mathcal{T}^{\pi}_{2}}p(T,\widehat{M},\pi)\cdot\left(\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[\widehat{R}(s,a)]\right)\leq V^{\pi}_{\widehat{M},H},

we have

\left|V^{\pi}_{\widehat{M},H}-V^{\pi}_{M,H}\right|\leq\epsilon/2.

∎

The following lemma provides error bound on the empirical model $\widehat{M}$ built by the algorithm. Note that the error bounds established here are similar to those in Lemma 5.4.

Lemma 6.2.

With probability at least $1-\delta$ , for all $(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}$ :

\displaystyle\left|\widehat{P}(s^{\prime}\mid s,a)-P(s^{\prime}\mid s,a)\right|\leq

\displaystyle\max\left\{4\sqrt{\frac{P(s^{\prime}\mid s,a)\cdot\log(6|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{N}},\frac{2\log(6|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{N}\right\},

\displaystyle\left|\widehat{R}(s,a)-\mathbb{E}[R(s,a)]\right|\leq

\displaystyle\max\left\{4\sqrt{\frac{\mathbb{E}[(R(s,a))^{2}]\cdot\log(6|\mathcal{S}||\mathcal{A}|/\delta)}{N}},\frac{2\log(6|\mathcal{S}||\mathcal{A}|/\delta)}{N}\right\}

and

\left|\widehat{\mu}(s)-\mu(s)\right|\leq\sqrt{\frac{\log((6|\mathcal{S}|)/\delta)}{N}}.

Proof.

For each $(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}$ , by Bernstein’s inequality,

		$\displaystyle\Pr\left[\left\|\sum_{i=0}^{N-1}\mathbb{I}\left[s^{\prime}=s^{\prime}_{i}\right]-P(s^{\prime}\mid s,a)\cdot N\right\|\geq t\right]$
	$\displaystyle\leq$	$\displaystyle 2\exp\left(\frac{-t^{2}}{2\cdot N\cdot P(s^{\prime}\mid s,a)+t/3}\right).$

Thus, by setting $t=2\sqrt{N\cdot P(s^{\prime}\mid s,a)\cdot\log(6|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}+\log(6|\mathcal{S}|^{2}|\mathcal{A}|/\delta)$ , we have

Pr\left[\left|\sum_{i=0}^{N-1}\mathbb{I}\left[s^{\prime}=s^{\prime}_{i}\right]-P(s^{\prime}\mid s,a)\cdot N\right|\geq t\right]\leq\delta/(3|\mathcal{S}|^{2}|\mathcal{A}|).

By applying a union bound over all $(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}$ , with probability at least $1-\delta/3$ , for all $(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}$ ,

	$\displaystyle\left\|\widehat{P}(s^{\prime}\mid s,a)-P(s^{\prime}\mid s,a)\right\|\leq$	$\displaystyle 2\sqrt{\frac{P(s^{\prime}\mid s,a)\cdot\log(6\|\mathcal{S}\|^{2}\|\mathcal{A}\|/\delta)}{N}}+\frac{\log(6\|\mathcal{S}\|^{2}\|\mathcal{A}\|/\delta)}{N}$
	$\displaystyle\leq$	$\displaystyle\max\left\{4\sqrt{\frac{P(s^{\prime}\mid s,a)\cdot\log(6\|\mathcal{S}\|^{2}\|\mathcal{A}\|/\delta)}{N}},\frac{2\log(6\|\mathcal{S}\|^{2}\|\mathcal{A}\|/\delta)}{N}\right\}.$

By a similar argument, with probability at least $1-\delta/3$ , for all $(s,a)\in\mathcal{S}\times\mathcal{A}$ ,

\left|\widehat{R}(s,a)-\mathbb{E}[R(s,a)]\right|\leq\max\left\{4\sqrt{\frac{\mathbb{E}[(R(s,a))^{2}]\cdot\log(6|\mathcal{S}||\mathcal{A}|/\delta)}{N}},\frac{2\log(6|\mathcal{S}||\mathcal{A}|/\delta)}{N}\right\}.

Moreover, with probability at least $1-\delta/3$ , for all $s\in\mathcal{S}$ , we have

\left|\widehat{\mu}(s)-\mu(s)\right|\leq\sqrt{\frac{\log(6|\mathcal{S}|)/\delta)}{N}}.

We complete the proof by applying a union bound. ∎

Using Lemma 6.1 and Lemma 6.2, we can now prove the correctness of our algorithm.

Theorem 6.3.

With probability at least $1-\delta$ , Algorithm 3 returns a policy $\pi$ such that

V^{\pi}\geq V^{*}-\epsilon

by using at most $2^{30}\cdot|\mathcal{S}|^{6}\cdot|\mathcal{A}|^{4}/\epsilon^{3}$ batches of queries.

Proof.

By Lemma 6.2, with probability at least $1-\delta$ , for all $(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}$ , we have

|\widehat{P}(s^{\prime}|s,a)-P(s^{\prime}|s,a)|\leq\frac{\epsilon}{192|\mathcal{S}|^{2}|\mathcal{A}|}\cdot\max\left(\sqrt{\frac{\epsilon{P}(s^{\prime}|s,a)}{576\cdot H\cdot|\mathcal{S}||\mathcal{A}|}},\frac{\epsilon}{576\cdot H\cdot|\mathcal{S}||\mathcal{A}|}\right),

\left|\mathbb{E}[\widehat{R}|s,a]-\mathbb{E}[{R}|s,a]\right|\leq\frac{\epsilon}{48|\mathcal{S}||\mathcal{A}|}\cdot\max\left\{\sqrt{\frac{\mathbb{E}[(R(s,a))^{2}]}{H}},\frac{1}{H}\right\},

and

\left|\mu(s)-\widehat{\mu}(s^{\prime})\right|\leq\epsilon/(12|\mathcal{S}|).

We condition on this event in the remaining part of this proof. By Lemma 6.2, for any (possibly non-stationary) policy $\pi$ , we have

\left|V^{\pi}_{\widehat{M},H}-V^{\pi}_{{M},H}\right|\leq\epsilon/2.

Let $\pi$ be the policy retuned by the algorithm. We thus have

V^{\pi}_{M,H}\geq V^{\pi}_{\widehat{M},H}-\epsilon/2\geq V^{\pi^{*}}_{\widehat{M},H}-\epsilon/2\geq V^{\pi^{*}}_{M,H}-\epsilon.

The total number of samples taken by the algorithm is

(|\mathcal{S}||\mathcal{A}|+1)\cdot N\leq 2^{30}\cdot|\mathcal{S}|^{6}\cdot|\mathcal{A}|^{4}\cdot H/\epsilon^{3}.

∎

Acknowledgements

The authors would like to thank Simon S. Du for helpful discussions, and to the anonymous reviewers for helpful comments.

Ruosong Wang was supported in part by NSF IIS1763562, US Army W911NF1920104, and ONR Grant N000141812861. Part of this work was done while Ruosong Wang and Lin F. Yang were visiting the Simons Institute for the Theory of Computing.

References

[AKY19] Alekh Agarwal, Sham Kakade, and Lin F Yang. On the optimality of sparse model-based planning for Markov decision processes. arXiv preprint arXiv:1906.03804, 2019.
[AMK13] Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J Kappen. Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91(3):325–349, 2013.
[AOM17] Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263–272. JMLR. org, 2017.
[BT03] Ronen I. Brafman and Moshe Tennenholtz. R-max - a general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res., 3(Oct):213–231, March 2003.
[BT09] Peter L Bartlett and Ambuj Tewari. Regal: A regularization based algorithm for reinforcement learning in weakly communicating mdps. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 35–42. AUAI Press, 2009.
[DB15] Christoph Dann and Emma Brunskill. Sample complexity of episodic fixed-horizon reinforcement learning. In Advances in Neural Information Processing Systems, pages 2818–2826, 2015.
[DLB17] Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 5717–5727, Red Hook, NY, USA, 2017. Curran Associates Inc.
[DLWB19] Christoph Dann, Lihong Li, Wei Wei, and Emma Brunskill. Policy certificates: Towards accountable reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 1507–1516, Long Beach, California, USA, 09–15 Jun 2019. PMLR.
[DWCW19] Kefan Dong, Yuanhao Wang, Xiaoyu Chen, and Liwei Wang. Q-learning with ucb exploration is sample efficient for infinite-horizon mdp. arXiv preprint arXiv:1901.09311, 2019.
[FPL18] Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Near optimal exploration-exploitation in non-communicating markov decision processes. In Advances in Neural Information Processing Systems, pages 2994–3004, 2018.
[JA18] Nan Jiang and Alekh Agarwal. Open problem: The dependence of sample complexity lower bounds on planning horizon. In Conference On Learning Theory, pages 3395–3398, 2018.
[JAZBJ18] Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is Q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4863–4873, 2018.
[JOA10] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
[Kak03] Sham M Kakade. On the sample complexity of reinforcement learning. PhD thesis, University of London London, England, 2003.
[KN09] J Zico Kolter and Andrew Y Ng. Near-bayesian exploration in polynomial time. In Proceedings of the 26th annual international conference on machine learning, pages 513–520, 2009.
[KS99] Michael J Kearns and Satinder P Singh. Finite-sample convergence rates for Q-learning and indirect algorithms. In Advances in neural information processing systems, pages 996–1002, 1999.
[KS02] Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine Learning, 49(2-3):209–232, 2002.
[LH12] Tor Lattimore and Marcus Hutter. PAC bounds for discounted MDPs. In International Conference on Algorithmic Learning Theory, pages 320–334. Springer, 2012.
[LWC⁺20] Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Breaking the sample size barrier in model-based reinforcement learning with a generative model. Advances in Neural Information Processing Systems, 33, 2020.
[MDSV21] Pierre Menard, Omar Darwiche Domingues, Xuedong Shang, and Michal Valko. Ucb momentum q-learning: Correcting the bias without forgetting. arXiv preprint arXiv:2103.01312, 2021.
[NPB20] Gergely Neu and Ciara Pike-Burke. A unifying view of optimism in episodic reinforcement learning. arXiv preprint arXiv:2007.01891, 2020.
[ORVR13] Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning via posterior sampling. arXiv preprint arXiv:1306.0940, 2013.
[OVR16] Ian Osband and Benjamin Van Roy. On lower bounds for regret in reinforcement learning. ArXiv, abs/1608.02732, 2016.
[OVR17] Ian Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for reinforcement learning? In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2701–2710. JMLR. org, 2017.
[PBPH⁺20] Aldo Pacchiano, Philip Ball, Jack Parker-Holder, Krzysztof Choromanski, and Stephen Roberts. On optimism in model-based reinforcement learning. arXiv preprint arXiv:2006.11911, 2020.
[Put94] Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 1994.
[RLD⁺21] Tongzheng Ren, Jialian Li, Bo Dai, Simon S Du, and Sujay Sanghavi. Nearly horizon-free offline reinforcement learning. arXiv preprint arXiv:2103.14077, 2021.
[Rus19] Daniel Russo. Worst-case regret bounds for exploration via randomized value functions. arXiv preprint arXiv:1906.02870, 2019.
[SJ19] Max Simchowitz and Kevin Jamieson. Non-asymptotic gap-dependent regret bounds for tabular mdps. arXiv preprint arXiv:1905.03814, 2019.
[SL08] Alexander L Strehl and Michael L Littman. An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.
[SLW⁺06] Alexander L Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L Littman. Pac model-free reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pages 881–888. ACM, 2006.
[SS10] István Szita and Csaba Szepesvári. Model-based reinforcement learning with nearly tight exploration complexity bounds. In ICML, 2010.
[SWW⁺18] Aaron Sidford, Mengdi Wang, Xian Wu, Lin Yang, and Yinyu Ye. Near-optimal time and sample complexities for solving Markov decision processes with a generative model. In Advances in Neural Information Processing Systems, pages 5186–5196, 2018.
[SWWY18] Aaron Sidford, Mengdi Wang, Xian Wu, and Yinyu Ye. Variance reduced value iteration and faster algorithms for solving markov decision processes. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 770–787. Society for Industrial and Applied Mathematics, 2018.
[SY94] Satinder P Singh and Richard C Yee. An upper bound on the loss from approximate optimal-value functions. Machine Learning, 16(3):227–233, 1994.
[TM18] Mohammad Sadegh Talebi and Odalric-Ambrym Maillard. Variance-aware regret bounds for undiscounted reinforcement learning in mdps. arXiv preprint arXiv:1803.01626, 2018.
[Wan17] Mengdi Wang. Randomized linear programming solves the discounted Markov decision problem in nearly-linear running time. arXiv preprint arXiv:1704.01869, 2017.
[WDYK20] Ruosong Wang, Simon S Du, Lin F Yang, and Sham M Kakade. Is long horizon reinforcement learning more difficult than short horizon reinforcement learning? arXiv preprint arXiv:2005.00527, 2020.
[YW19] Lin Yang and Mengdi Wang. Sample-optimal parametric q-learning using linearly additive features. In International Conference on Machine Learning, pages 6995–7004, 2019.
[YYD21] Kunhe Yang, Lin Yang, and Simon Du. Q-learning with logarithmic regret. In International Conference on Artificial Intelligence and Statistics, pages 1576–1584. PMLR, 2021.
[ZB19] Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pages 7304–7312, 2019.
[ZDJ20] Zihan Zhang, Simon S Du, and Xiangyang Ji. Nearly minimax optimal reward-free reinforcement learning. arXiv preprint arXiv:2010.05901, 2020.
[ZJ19] Zihan Zhang and Xiangyang Ji. Regret minimization for reinforcement learning by evaluating the optimal bias function. In Advances in Neural Information Processing Systems, pages 2823–2832, 2019.
[ZJD20] Zihan Zhang, Xiangyang Ji, and Simon S Du. Is reinforcement learning more difficult than bandits? a near-optimal algorithm escaping the curse of horizon. arXiv preprint arXiv:2009.13503, 2020.
[ZZJ20] Zihan Zhang, Yuan Zhou, and Xiangyang Ji. Almost optimal model-free reinforcement learningvia reference-advantage decomposition. Advances in Neural Information Processing Systems, 33, 2020.

		$\displaystyle\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\left\|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right\|$
	$\displaystyle=$	$\displaystyle\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}\mid m_{T}(s,a)=1}m_{T}(s,a)\cdot\left\|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right\|$
	$\displaystyle+$	$\displaystyle\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}\mid m_{T}(s,a)>1}m_{T}(s,a)\cdot\left\|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right\|.$

		$\displaystyle\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\left\|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right\|$
	$\displaystyle=$	$\displaystyle\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}\mid m_{T}(s,a)=1}m_{T}(s,a)\cdot\left\|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right\|$
	$\displaystyle+$	$\displaystyle\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}\mid m_{T}(s,a)>1}m_{T}(s,a)\cdot\left\|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right\|.$

	$\displaystyle\left\|m_{T}(s,a,s^{\prime})-m_{T}(s,a)\cdot\widehat{P}(s^{\prime}\|s,a)\right\|$	$\displaystyle\geq\left\|m_{T}(s,a,s^{\prime})-m_{T}(s,a)\cdot P(s^{\prime}\|s,a)\right\|-H\cdot\sqrt{\frac{\epsilon\cdot P(s^{\prime}\mid s,a)}{144\|\mathcal{S}\|\|\mathcal{A}\|H}}$
		$\displaystyle>\sqrt{\frac{576P(s^{\prime}\|s,a)H\|\mathcal{S}\|\|\mathcal{A}\|}{\epsilon}}-\sqrt{\frac{\epsilon\cdot P(s^{\prime}\mid s,a)\cdot H}{144\|\mathcal{S}\|\|\mathcal{A}\|}}$
		$\displaystyle\geq\sqrt{\frac{500P(s^{\prime}\|s,a)H\|\mathcal{S}\|\|\mathcal{A}\|}{\epsilon}}$
		$\displaystyle\geq\sqrt{\frac{144\widehat{P}(s^{\prime}\|s,a)H\|\mathcal{S}\|\|\mathcal{A}\|}{\epsilon}}$

		$\displaystyle\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\left\|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right\|$
	$\displaystyle=$	$\displaystyle\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}\mid m_{T}(s,a)=1}m_{T}(s,a)\cdot\left\|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right\|$
	$\displaystyle+$	$\displaystyle\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}\mid m_{T}(s,a)>1}m_{T}(s,a)\cdot\left\|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right\|.$

Settling the Horizon-Dependence of Sample Complexity in Reinforcement Learning

1 Introduction

1.1 Our Results

Theorem 1.1 (Informal version of Corollary 5.11).

Theorem 1.2 (Informal version of Theorem 5.10).

Theorem 1.3 (Informal version of Theorem 6.3).

1.2 Related Work

2 Preliminaries

Notations.

Markov Chains.

Finite-Horizon Markov Decision Processes.

Interacting with the MDP.

Policy Class.

Value Functions.

Stationary Policies.

Assumption on Rewards.

Assumption 2.1 (Bounded Total Reward).

Lemma 2.1.

Proof.

Discounted Markov Decision Processes.

3 Technical Overview

Algorithm and Analysis in the Generative Model.

Exploration by Stationary Policies.

From Expectation to Quantile.

Perturbation Analysis in the RL Setting.

4 Properties of Stationary Policies

4.1 Reaching Probabilities in Markov Chains

Lemma 4.1.

Proof.

Lemma 4.2.

Proof.

Lemma 4.3.

Proof.

Lemma 4.4.

Proof.

4.2 Implications of Lemma 4.4

Lemma 4.5.

Proof.

Lemma 4.6.

Proof.

Corollary 4.7.

Proof.

Corollary 4.8.

Proof.

Corollary 4.9.

Proof.

5 Algorithm in the RL Setting

5.1 Collecting Samples

Lemma 5.1.

Proof.

Case I: mϵ​(s,a)≥4096⋅|𝒮|12​|𝒮|/ϵm_{\epsilon}(s,a)\geq 4096\cdot|\mathcal{S}|^{12|\mathcal{S}|}/\epsilon.

Case II: mϵ​(s,a)<4096⋅|𝒮|12​|𝒮|/ϵm_{\epsilon}(s,a)<4096\cdot|\mathcal{S}|^{12|\mathcal{S}|}/\epsilon.

Lemma 5.2.

Proof.

Lemma 5.3.

Proof.

Lemma 5.4.

Proof.

5.2 Perturbation Analysis

Fact 5.1.

Lemma 5.5.

Proof.

Lemma 5.6.

Proof.

Lemma 5.7.

Proof.

Lemma 5.8.

Proof.

Lemma 5.9.

Proof.

5.3 Pessimistic Planning

Theorem 5.10.

Proof.

Corollary 5.11.

6 Algorithm in the Generative Model

Lemma 6.1.

Proof.

Lemma 6.2.

Proof.

Theorem 6.3.

Case I: $m_{\epsilon}(s,a)\geq 4096\cdot|\mathcal{S}|^{12|\mathcal{S}|}/\epsilon$ .

Case II: $m_{\epsilon}(s,a)<4096\cdot|\mathcal{S}|^{12|\mathcal{S}|}/\epsilon$ .