This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Settling the Horizon-Dependence of Sample Complexity in Reinforcement Learning

Yuanzhi Li
Carnegie Mellon University
[email protected]
   Ruosong Wang
Carnegie Mellon University
[email protected]
   Lin F. Yang
University of California, Los Angles
[email protected]

Recently there is a surge of interest in understanding the horizon-dependence of the sample complexity in reinforcement learning (RL). Notably, for an RL environment with horizon length HH, previous work have shown that there is a probably approximately correct (PAC) algorithm that learns an O(1)O(1)-optimal policy using polylog(H)\mathrm{polylog}(H) episodes of environment interactions when the number of states and actions is fixed. It is yet unknown whether the polylog(H)\mathrm{polylog}(H) dependence is necessary or not. In this work, we resolve this question by developing an algorithm that achieves the same PAC guarantee while using only O(1)O(1) episodes of environment interactions, completely settling the horizon-dependence of the sample complexity in RL. We achieve this bound by (i) establishing a connection between value functions in discounted and finite-horizon Markov decision processes (MDPs) and (ii) a novel perturbation analysis in MDPs. We believe our new techniques are of independent interest and could be applied in related questions in RL.

1 Introduction

Reinforcement learning (RL) is one of the most important paradigms in machine learning. What makes RL different from other paradigms is that it models the long-term effects in decision-making problems. For instance, in a finite-horizon Markov decision process (MDP), which is one of the most fundamental models for RL, an agent interacts with the environment for a total of HH steps and receives a sequence of HH random reward values, along with stochastic state transitions, as feedback. The goal of the agent is to find a policy to maximize the expected sum of these rewards values instead of any single one of them. Since decisions made at early stages could significantly impact the future, the agent must take possible future transitions into consideration when choosing the policy. On the other hand, when H=1H=1, RL reduces to the contextual bandits problem in which it suffices to act myopically to achieve optimality.

Due to the important role of the horizon length in RL, Jiang and Agarwal [JA18] propose to study how the sample complexity of RL depends on the horizon length. More formally, let us consider the episodic RL setting, where the horizon length is HH and the underlying MDP has unknown and time invariant transition probabilities and rewards. Starting from some unknown but fixed initial state distribution, how many episodes of interaction with the MDP do we need to learn an ϵ\epsilon-optimal policy, whose value (the expected sum of rewards collected by the policy) differs from the optimal policy by at most ϵVmax\epsilon V_{\max}? Here VmaxV_{\max} is the maximum possible sum of rewards in an episode and ϵ>0\epsilon>0 is the target accuracy. Jiang and Agarwal [JA18] conjecture that the number of episodes required by the above problem depend polynomially on HH even when the total number of state-action pairs is a constant. Wang et al. [WDYK20] refute this conjecture by showing that the correct sample complexity of episodic RL is at most poly((logH)/ϵ)\mathrm{poly}((\log H)/\epsilon) when there are constant number of states and actions. The result in [WDYK20] shows a surprising fact that one can achieve similar sample efficiency regardless of the horizon length of the RL problem. A number of recent results [ZJD20, MDSV21, ZDJ20, RLD+21] additionally improve the result in  [WDYK20] to obtain more (computationally or statistically) efficient algorithm and/or in more general settings. However, these results all have the poly((logH)/ϵ)\mathrm{poly}((\log H)/\epsilon) factor in their sample complexity, seemingly implying that the polylog(H)\mathrm{polylog}(H) dependence is necessary. Hence, it is natural to ask:

What is the exact dependence on the horizon length HH of the sample complexity in episodic RL?

The goal of the current paper is to answer the above question.

Compared to other problems in machine learning, RL is challenging mainly because it requires an effective exploration mechanism. For example, to obtain samples from some state in an MDP, one needs to first learn a policy that can reach that state with good probability. In order to decouple the sample complexity of exploration and that of learning a near-optimal policies, there is a line of research (see, e.g., [KS99, AMK13, SWW+18, Wan17, YW19, LWC+20]) studying the generative model setting, where a learner is able to query as many samples as possible from each state-action pair, circumventing the issue of exploration. The above sample complexity question is still well-posed even in the generative model. However, to achieve fair comparison with the RL setting, we measure the sample complexity as the number of batches, where a batch of queries corresponds to HH queries in the generative model. Indeed, in the episodic RL setting, the learner can also obtain HH samples in each episode. Nevertheless, we are not aware of any result addressing the above question even with a generative model.

1.1 Our Results

In this paper, we settle the horizon-dependence of the sample complexity in episodic RL by developing an algorithm whose sample complexity is completely independent of the planning horizon HH. We state our results in the following theorems, which target RL with an HH-horizon MDP (formally defined in Section 2) with state space 𝒮\mathcal{S} and action space 𝒜\mathcal{A}.

Theorem 1.1 (Informal version of Corollary 5.11).

Suppose the reward at each time step is non-negative and the total reward of each episode is upper bounded by 1.111Without loss of generality, we set Vmax=1V_{\max}=1. Given a target accuracy ϵ>0\epsilon>0 and a failure probability δ>0\delta>0, our algorithm returns an ϵ\epsilon-optimal non-stationary policy with probability at least 1δ1-\delta by sampling at most (|𝒮||𝒜|)O(|𝒮|)log(1/δ)/ϵ5(|\mathcal{S}||\mathcal{A}|)^{O(|\mathcal{S}|)}\cdot\log(1/\delta)/\epsilon^{5} episodes, where |𝒮||\mathcal{S}| is number of states and |𝒜||\mathcal{A}| is the number of actions.

Notably, when the number states |𝒮||\mathcal{S}|, the number of actions |𝒜||\mathcal{A}|, the desired accuracy ϵ\epsilon and the failure probability δ\delta are all fixed constants, the (episode) sample complexity of our algorithm is also a constant. Our result suggests that episodic RL is possible with a sample complexity that is completely independent of the horizon length HH.

In fact, we can show a much more general result besides finding the optimal policy. We actually prove that using the same number of episodes, one can construct an oracle that returns an ϵ\epsilon-approximation of the value of any non-stationary policy π\pi, given as the following theorem.

Theorem 1.2 (Informal version of Theorem 5.10).

In the same setting as Theorem 1.1, with (|𝒮||𝒜|)O(|𝒮|)log(1/δ)/ϵ5(|\mathcal{S}||\mathcal{A}|)^{O(|\mathcal{S}|)}\cdot\log(1/\delta)/\epsilon^{5} episodes, with probability at least 1δ1-\delta, we can construct an oracle 𝒪\mathcal{O} such that for every given non-stationary policy π\pi, 𝒪\mathcal{O} returns an ϵ\epsilon approximation of the expected total reward of π\pi.

This theorem suggests that even completely learning the MDP (i.e., being able to simultaneously estimate the value of all non-stationary policies) can be done with a sample complexity that is independent of HH.

We now switch to the more powerful generative model and show that a better sample complexity can be achieved. In the generative model, the agent can query samples from any state-action pair and the initial state distribution of the environment. Here, the sample complexity is defined as the total number of batches of queries (a batch corresponds to HH queries) to the environment to obtain an ϵ\epsilon-optimal policy. 222It is well-known that any algorithm requires Ω(H)\Omega(H) queries to find a near-optimal policy (see, e.g., [LH12]), and thus it is reasonable to define a single batch to be HH queries.

Theorem 1.3 (Informal version of Theorem 6.3).

Suppose the reward at each time step is non-negative and the total reward of each episode is upper bounded by 1. Given a target accuracy ϵ>0\epsilon>0 and a failure probability δ>0\delta>0, our algorithm returns an ϵ\epsilon-optimal non-stationary policy with probability at least 1δ1-\delta by sampling at most O(|𝒮|6|𝒜|4log(1/δ)/ϵ3)O(|\mathcal{S}|^{6}|\mathcal{A}|^{4}\log(1/\delta)/\epsilon^{3}) batches of queries in the generative model, where |𝒮||\mathcal{S}| is number of states and |𝒜||\mathcal{A}| is the number of actions.

Compared to the result in Theorem 1.1, the sample complexity in Theorem 6.3 has polynomial dependence on the number of states and has better dependence on the desired accuracy ϵ\epsilon.

We remark that although our algorithms in Theorem 1.1 and Theorem 1.3 achieve tight dependence in terms of HH, it does not aim to tighten the dependence on the number of states |𝒮||\mathcal{S}|, the number of actions |𝒜||\mathcal{A}| and the desired accuracy ϵ\epsilon. In fact, the sample complexity of our algorithms have much worse dependence in terms of |𝒮||\mathcal{S}|, |𝒜||\mathcal{A}| and ϵ\epsilon compared to prior work. See Section 1.2 for a detailed discussion on prior work, and Section 3 for an overview of our techniques.

1.2 Related Work

In this section, we discuss related work on the sample complexity of RL in the tabular setting, where the number of states |𝒮|\left|\mathcal{S}\right|, the number of actions |𝒜|\left|\mathcal{A}\right|, and the horizon length HH are all assumed to be finite. There is a long line of research studying the sample complexity in tabular RL [KS02, BT03, Kak03, SLW+06, SL08, KN09, BT09, JOA10, SS10, LH12, ORVR13, DB15, AOM17, DLB17, OVR17, JAZBJ18, FPL18, TM18, DLWB19, DWCW19, SJ19, Rus19, ZJ19, ZZJ20, YYD21, PBPH+20, NPB20, WDYK20, ZJD20, MDSV21, ZDJ20, RLD+21]. In most prior work, up to a scaling factor, the standard assumption on the reward values is that rh[0,1/H]r_{h}\in[0,1/H] for all hh and hence hrh[0,1]\sum_{h}r_{h}\in[0,1], where rhr_{h} is the reward value collected at step hh of an episode. We refer this assumption as the reward uniformity assumption. However, Jiang and Agarwal [JA18] point out that one should only impose an upper bound on the summation of the reward values, i.e., h[H]rh1\sum_{h\in[H]}r_{h}\leq 1, which we refer as the bounded total reward assumption, instead of imposing uniformity which could be much stronger. This new assumption allows a fair comparison between long-horizon and short-horizon problems, and is also more natural when the reward signal is sparse. Also note that algorithms designed under the reward uniformity assumption can be modified to work under the bounded total reward assumption. However, the sample complexity is usually worsen by a poly(H)\mathrm{poly}(H) factor. Many existing results in tabular RL focuses on regret minimization where the goal is to collect maximum cumulative rewards for a limit number of interactions with the environment. To draw comparison with our results, we convert them, using standard techniques, to the PAC setting, where the algorithm aims to obtain an ϵ\epsilon-optimal policy while minimizing the number of episodes of environment interactions (or batches of queries if in the generative model).

Under the reward uniformity assumption, a line of work have attempted to provide tight sample complexity bounds [AOM17, DB15, DLB17, DLWB19, JAZBJ18, OVR17]. To obtain an ε\varepsilon-optimal policy, state-of-the-art results show that O~(|𝒮||𝒜|ε2+poly(|𝒮|,|𝒜|,H)ε)\widetilde{O}\left(\frac{\left|\mathcal{S}\right|\left|\mathcal{A}\right|}{\varepsilon^{2}}+\frac{\mathrm{poly}\left(\left|\mathcal{S}\right|,\left|\mathcal{A}\right|,H\right)}{\varepsilon}\right)333O~()\widetilde{O}\left(\cdot\right) omits logarithmic factors. episodes suffice [DLWB19, AOM17]. In particular, the first term matches the lower bound Ω(|𝒮||𝒜|ε2)\Omega\left(\frac{\left|\mathcal{S}\right|\left|\mathcal{A}\right|}{\varepsilon^{2}}\right) up to logarithmic factors [DB15, OVR16, AOM17] while the second term has least linear dependence on HH. Moreover, converting these bounds to the bounded total reward setting induces extra poly(H)\mathrm{poly}(H) factors. For instance, the sample complexity in [AOM17, DLWB19] will become O~(|𝒮||𝒜|H2ε2+poly(|𝒮|,|𝒜|,H)ε)\widetilde{O}\left(\frac{\left|\mathcal{S}\right|\left|\mathcal{A}\right|H^{2}}{\varepsilon^{2}}+\frac{\mathrm{poly}\left(\left|\mathcal{S}\right|,\left|\mathcal{A}\right|,H\right)}{\varepsilon}\right) under the bound total reward assumption

Recently, there is a surge of interest of designing algorithms under the bounded total reward assumption. In particular, Zanette and Brunskill [ZB19] develop an algorithm which enjoys a sample complexity of O~(|𝒮||𝒜|ε2+poly(|𝒮|,|𝒜|,H)ε)\widetilde{O}\left(\frac{\left|\mathcal{S}\right|\left|\mathcal{A}\right|}{\varepsilon^{2}}+\frac{\mathrm{poly}\left(\left|\mathcal{S}\right|,\left|\mathcal{A}\right|,H\right)}{\varepsilon}\right), where the second term still scales polynomially with HH. Wang et al. [WDYK20] show that it is possible to obtain an ϵ\epsilon-optimal policy with a sample complexity of poly(|𝒮||𝒜|/ϵ)log3H\mathrm{poly}(|\mathcal{S}||\mathcal{A}|/\epsilon)\cdot\log^{3}H, establishing the first sample complexity with polylog(H)\mathrm{polylog}(H) dependence on the horizon length HH. They achieve this result by using the following ideas: (1) samples collected by different policies can be reused to evaluate other policies; (2) to evaluate all policies in a finite set Π\Pi, the number of sample episodes required is at most poly(|𝒮||𝒜|/ϵ)log|Π|log2H\mathrm{poly}(|\mathcal{S}||\mathcal{A}|/\epsilon)\cdot\log|\Pi|\cdot\log^{2}H; (3) establish a set of policies Π\Pi that contains at least one ϵ\epsilon-optimal policy for any MDP by using an ϵ\epsilon-nets over the reward values and the transition probabilities. Here Π\Pi contains all optimal non-stationary policies of each MDP in the ϵ\epsilon-net. The accuracy of the ϵ\epsilon-net needs to be at least poly(ϵ/H)\mathrm{poly}(\epsilon/H) and hence |Π|=poly(ϵ/H)|𝒮||𝒜||\Pi|=\mathrm{poly}(\epsilon/H)^{|\mathcal{S}||\mathcal{A}|}, which induces an inevitable logH\log H factor. Other logH\log H factors come from Step (2) where they use a potential function which increases from 0 to HH to measure the progress of the algorithm. It is unclear how to remove any of the log factors in their sample complexity upper bound, and they also conjecture that the optimal sample complexity could be Ω(polylog(H))\Omega(\mathrm{polylog}(H)) which is refuted by our new result.

Another recent work [ZJD20] obtains a more efficient algorithm that achieves a sample complexity of O~((|𝒮||𝒜|ϵ2+|𝒮|2|𝒜|ϵ)polylogH)\widetilde{O}\left(\left(\frac{|\mathcal{S}||\mathcal{A}|}{\epsilon^{2}}+\frac{|\mathcal{S}|^{2}|\mathcal{A}|}{\epsilon}\right)\cdot\mathrm{poly}\log H\right). Zhang et al. [ZJD20] achieve such a sample complexity by using a novel upper confidence bound (UCB) analysis on a model-based algorithm. In particular, they use a recursive structure to bound the total variance in the MDP, which in turn bounds the final sample complexity. However, their bound critically relies on the fact that the total variance of value function moments is upper bounded by HH which inevitably induces polylog(H)\mathrm{polylog}(H) factors. It is unlikely to get rid of all polylog(H)\mathrm{polylog}(H) factors in their sample complexity by following their approaches.

In the generative model, there is also a long line of research studying the sample complexity of finding near-optimal policies. See, e.g., [KS99, Kak03, SY94, AMK13, SWWY18, SWW+18, AKY19, LWC+20]. However, most of these work are targeting on the infinite-horizon discounted setting, in which case the problem becomes much easier. This is because in the infinite-horizon discounted setting, there always exists a stationary optimal policy, and the total number of such policies depends only on the number of states |𝒮||\mathcal{S}| and the number of actions |𝒜||\mathcal{A}|. Sidford et al. [SWW+18] develop an algorithm based on a variance-reduced estimator which uses O~(|𝒮||𝒜|ϵ2+H2|𝒮||𝒜|)\widetilde{O}\left(\frac{|\mathcal{S}||\mathcal{A}|}{\epsilon^{2}}+H^{2}|\mathcal{S}||\mathcal{A}|\right) batches of queries to find an ϵ\epsilon-optimal policy under the reward uniformity assumption. However, their result relies on splitting samples into HH sub-groups, and therefore, the lower order term of their sample complexity fundamentally depends on HH. To the best of our knowledge, even in the stronger generative model, no algorithm for finite-horizon MDPs achieves a sample complexity that is independent of HH under the bounded total reward assumption.

2 Preliminaries

Notations.

Throughout this paper, for a given positive integer HH, we use [H][H] to denote the set {0,1,,H1}\{0,1,\ldots,H-1\}. For a condition \mathcal{E}, we use 𝕀[]\mathbb{I}[\mathcal{E}] to denote the indicator function, i.e., 𝕀[]=1\mathbb{I}[\mathcal{E}]=1 if \mathcal{E} holds and 𝕀[]=0\mathbb{I}[\mathcal{E}]=0 otherwise.

For a random variable XX and a real number ε[0,1]\varepsilon\in[0,1], its ε\varepsilon-quantile 𝒬ε(X)\mathcal{Q}_{\varepsilon}(X) is defined so that

𝒬ε(X)=sup{xPr[Xx]ε}.\mathcal{Q}_{\varepsilon}(X)=\sup\{x\mid\Pr[X\geq x]\geq\varepsilon\}.

Markov Chains.

Let C=(𝒮,P,μ)C=(\mathcal{S},P,\mu) be a Markov chain where 𝒮\mathcal{S} is the state space, P:𝒮Δ(𝒮)P:\mathcal{S}\to\Delta(\mathcal{S}) is the transition operator and μΔ(𝒮)\mu\in\Delta(\mathcal{S}) is the initial state distribution. A Markov chain CC induces a sequence of random states

s0,s1,s_{0},s_{1},\ldots

where for each s0μs_{0}\sim\mu and sh+1P(sh)s_{h+1}\sim P(s_{h}) for each h0h\geq 0.

Finite-Horizon Markov Decision Processes.

Let M=(𝒮,𝒜,P,R,H,μ)M=\left(\mathcal{S},\mathcal{A},P,R,H,\mu\right) be an HH-horizon Markov Decision Process (MDP) where 𝒮\mathcal{S} is the finite state space, 𝒜\mathcal{A} is the finite action space, P:𝒮×𝒜Δ(𝒮)P:\mathcal{S}\times\mathcal{A}\rightarrow\Delta\left(\mathcal{S}\right) is the transition operator which takes a state-action pair and returns a distribution over states, R:𝒮×𝒜Δ()R:\mathcal{S}\times\mathcal{A}\rightarrow\Delta\left(\mathbb{R}\right) is the reward distribution, H+H\in\mathbb{Z}_{+} is the planning horizon (episode length), and μΔ(𝒮)\mu\in\Delta\left(\mathcal{S}\right) is the initial state distribution. In the learning setting, PP, RR and μ\mu, are unknown but fixed, and 𝒮\mathcal{S}, 𝒜\mathcal{A} and HH are known to the learner. Throughout, we assume H=Ω(|𝒮|log|𝒮|)H=\Omega(|\mathcal{S}|\log|\mathcal{S}|) since otherwise we can apply existing algorithms (e.g. [ZJD20]) to obtain a sample complexity that is independent of HH. Throughout the paper, for a state s𝒮s\in\mathcal{S}, we occasionally abuse notation and use ss to denote the deterministic distribution that always takes ss.

Interacting with the MDP.

We now introduce how a RL agent (or an algorithm) interacts with an unknown MDP. In the setting of Theorem 1.1, in each episode, the agent first receives an initial state s0μs_{0}\sim\mu. For each time step h[H]h\in[H], the agent first decides an action ah𝒜a_{h}\in\mathcal{A}, and then observes and moves to sh+1P(sh,ah)s_{h+1}\sim P(s_{h},a_{h}) and obtains a reward rhR(sh,ah)r_{h}\sim R(s_{h},a_{h}). The episode stops at time HH, where the final state is sHs_{H}. Note that even if the agent decides to stop before time HH, e.g. at time H\sqrt{H}, this still counts as one full episode.

In the generative model setting (as in Theorem 1.3), the agent is allowed to start from any s0𝒮s_{0}\in\mathcal{S} instead of s0μs_{0}\sim\mu, and is allowed to restart at any time. The sample complexity is defined as the number of batches of queries (each batch consists of HH queries) the agent uses.

Policy Class.

The final goal of RL is to output a good policy π\pi with respect to the unknown MDP that the agent interacts with. In this paper, we consider (non-stationary) deterministic policies π\pi which choose an action aa based on the current state s𝒮s\in\mathcal{S} and the time step h[H]h\in[H]. Formally, π={πh}h=0H1\pi=\{\pi_{h}\}_{h=0}^{H-1} where for each h[H]h\in[H], πh:𝒮𝒜\pi_{h}:\mathcal{S}\to\mathcal{A} maps a given state to an action. The policy π\pi induces a (random) trajectory

(s0,a0,r0),(s1,a1,r1),,(sH1,aH1,rH1),sH,(s_{0},a_{0},r_{0}),(s_{1},a_{1},r_{1}),\ldots,(s_{H-1},a_{H-1},r_{H-1}),s_{H},

where s0μs_{0}\sim\mu, a0=π0(s0)a_{0}=\pi_{0}(s_{0}), r0R(s0,a0)r_{0}\sim R(s_{0},a_{0}), s1P(s0,a0)s_{1}\sim P(s_{0},a_{0}), a1=π1(s1)a_{1}=\pi_{1}(s_{1}), r1R(s1,a1)r_{1}\sim R(s_{1},a_{1}), \ldots, aH1=πH1(sH1)a_{H-1}=\pi_{H-1}(s_{H-1}), rH1R(sH1,aH1)r_{H-1}\sim R(s_{H-1},a_{H-1}) and sHP(sH1,aH1)s_{H}\sim P(s_{H-1},a_{H-1}). Throughout the paper, all policies are assumed to be deterministic unless specified otherwise.

Value Functions.

To measure the performance of a policy π\pi, we define the value of a policy as

VM,Hπ=𝔼[h=0H1rh].V^{\pi}_{M,H}=\mathbb{E}\left[\sum_{h=0}^{H-1}r_{h}\right].

Note that in general a policy does not need to be deterministic, and it can even depend on the history in which case a policy chooses an action at time hh based on the entire transition history before hh. It is well-known (see e.g. [Put94]) that the optimal value of MM can be achieved by a non-stationary deterministic optimal policy π\pi^{*}. Hence, we only need to consider non-stationary deterministic policies.

The goal of the agent is then to find a policy π\pi with VM,HπVM,HπϵV^{\pi}_{M,H}\geq V^{\pi^{*}}_{M,H}-\epsilon, i.e., obtain an ϵ\epsilon-optimal policy, while minimizing the number of episodes of environment interactions (or batches of queries if in the generative model).

For a policy π\pi, we also define

𝒬δπ(s,a)=𝒬δ[t=0H𝕀[(s,a)=(st,at)]].\mathcal{Q}^{\pi}_{\delta}(s,a)=\mathcal{Q}_{\delta}\left[\sum_{t=0}^{H}\mathbb{I}[(s,a)=(s_{t},a_{t})]\right].

to be the δ\delta-quantile of the visitation frequency of a state-action pair (s,a)(s,a).

Stationary Policies.

For the sake of the analysis, we shall also consider stationary policies. A stationary deterministic policy π\pi chooses an action aa solely based on the current state s𝒮s\in\mathcal{S}, i.e, π0=π1==πH1\pi_{0}=\pi_{1}=\ldots=\pi_{H-1}. We use Π𝗌𝗍\Pi_{\mathsf{st}} to denote the set of all stationary policies. Note that |Π𝗌𝗍|=|𝒜||𝒮|\left|\Pi_{\mathsf{st}}\right|=|\mathcal{A}|^{|\mathcal{S}|}.

We remark that when the horizon length HH is finite, the value of the best stationary policy and that of the best non-stationary policy can differ significantly. Consider the case when there are two states s,ss,s^{\prime} and two actions a,aa,a^{\prime}. Starting from ss, taking action aa will go back to ss and obtain a reward of 1/H1/H, while taking action aa^{\prime} will go to ss^{\prime} and obtain a reward of 11. Taking any action at ss^{\prime} will go back to ss^{\prime} with reward 0. In this case, the optimal non-stationary policy has value 21/H2-1/H since it can exit ss at time HH, while deterministic stationary policies can only have value 1\leq 1 (even random stationary policy can only achieve value 2Ω(1)2-\Omega(1)).

For a stationary policy π:𝒮𝒜\pi:\mathcal{S}\to\mathcal{A}, we use Mπ=(𝒮,Pπ,μ)M^{\pi}=(\mathcal{S},P^{\pi},\mu) to denote the Markov chain induced by MM and π\pi, where the transition operator PπP^{\pi} is defined so that

Pπ(ss)=P(ss,π(s)).P^{\pi}(s^{\prime}\mid s)=P(s^{\prime}\mid s,\pi(s)).

Assumption on Rewards.

Below, we formally introduce the bounded total reward assumption.

Assumption 2.1 (Bounded Total Reward).

For any policy π\pi, and a random trajectory ((s0,a0,r0),((s_{0},a_{0},r_{0}), (s1,a1,r1),,(sH1,aH1,rH1),sH)(s_{1},a_{1},r_{1}),\ldots,(s_{H-1},a_{H-1},r_{H-1}),s_{H}) induced by π\pi, we have

  • rh[0,1]r_{h}\in[0,1] for all h[H]h\in[H]; and

  • h=0H1rh1\sum_{h=0}^{H-1}r_{h}\leq 1 almost surely.

As discussed in the previous section, this assumption is more general than the standard reward uniformity assumption, where rh[0,1/H]r_{h}\in[0,1/H], for all h[H]h\in[H]. Throughout this paper, we assume that the above assumption holds for the unknown MDP that the agent interacts with.

The above assumption in fact implies a very interesting consequence on the reward values.

Lemma 2.1.

Under Assumption 2.1, for any M=(𝒮,𝒜,P,R,H,μ)M=\left(\mathcal{S},\mathcal{A},P,R,H,\mu\right) with H|𝒮|H\geq|\mathcal{S}|, for any (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, if there exists a (possibly non-stationary) policy π\pi such that for the random trajectory

(s0,a0,r0),(s1,a1,r1),,(sH1,aH1,rH1),sH(s_{0},a_{0},r_{0}),(s_{1},a_{1},r_{1}),\ldots,(s_{H-1},a_{H-1},r_{H-1}),s_{H}

induced by executing π\pi in MM, we have

Pr[h=0H1𝕀[(sh,ah)=(s,a)]>1]ϵ\Pr\left[\sum_{h=0}^{H-1}\mathbb{I}[(s_{h},a_{h})=(s,a)]>1\right]\geq\epsilon

for some ϵ>0\epsilon>0, then R(s,a)2|𝒮|/HR(s,a)\leq 2|\mathcal{S}|/H almost surely and therefore 𝔼[(R(s,a))2]4|𝒮|2/H2\mathbb{E}[(R(s,a))^{2}]\leq 4|\mathcal{S}|^{2}/H^{2}.

Proof.

By the assumption, there exists a trajectory

((s0,a0),(s1,a1),,(sH1,aH1),sH)((s_{0},a_{0}),(s_{1},a_{1}),\ldots,(s_{H-1},a_{H-1}),s_{H})

such there exists 0h1<h2<H0\leq h_{1}<h_{2}<H with (sh1,ah1)=(s,a)(s_{h_{1}},a_{h_{1}})=(s,a) and (sh2,ah2)=(s,a)(s_{h_{2}},a_{h_{2}})=(s,a). Moreover,

μ(s0)h=0h2P(sh+1sh,ah)>0.\mu(s_{0})\prod_{h=0}^{h_{2}}P(s_{h+1}\mid s_{h},a_{h})>0.

We may assume h1<|𝒮|h_{1}<|\mathcal{S}| and h2h1|𝒮|h_{2}-h_{1}\leq|\mathcal{S}|, since otherwise we can replace sub-trajectories that start and end with the same state by that state, and the resulting trajectory can still appear with strictly positive probability. Now consider the policy π^\widehat{\pi} which is defined so that for each h<h1h<h_{1}, π^h(sh)=ah\widehat{\pi}_{h}(s_{h})=a_{h} and for each t[h2h1]t\in[h_{2}-h_{1}],

π^h1+t(sh1+t)=π^h1+(h2h1)+t(sh1+t)=π^h1+2(h2h1)+t(sh1+t)==ah1+t,\widehat{\pi}_{h_{1}+t}(s_{h_{1}+t})=\widehat{\pi}_{h_{1}+(h_{2}-h_{1})+t}(s_{h_{1}+t})=\widehat{\pi}_{h_{1}+2(h_{2}-h_{1})+t}(s_{h_{1}+t})=\cdots=a_{h_{1}+t},

i.e., repeating the trajectory’s actions in [h1,h2][h_{1},h_{2}] indefinitely. π^\widehat{\pi} is defined arbitrarily for other states and time steps.

By executing π^\widehat{\pi}, with strictly positive probability, (s,a)(s,a) is visited for H/|𝒮|H/(2|𝒮|)\lfloor H/|\mathcal{S}|\rfloor\geq H/(2|\mathcal{S}|) times. Therefore, by Assumption 2.1, R(s,a)2|𝒮|/HR(s,a)\leq 2|\mathcal{S}|/H with probability 11 and thus 𝔼[(R(s,a))2]4|𝒮|2/H2\mathbb{E}[(R(s,a))^{2}]\leq 4|\mathcal{S}|^{2}/H^{2}.

Discounted Markov Decision Processes.

We also introduce another variant of MDP, discounted MDP, which is specified by M=(𝒮,𝒜,P,R,γ,μ)M=\left(\mathcal{S},\mathcal{A},P,R,\gamma,\mu\right), where γ(0,1)\gamma\in(0,1) is a discount factor and all other components have the same meaning as in an HH-horizon MDP. Note that we consider discounted MDP only for the purpose of analysis, and the unknown MDP that the agent interacts with is always a finite-horizon MDP. The difference between a discounted MDP and an HH-horizon MDP is that discounted MDPs have an infinite horizon length, i.e., the length of a trajectory can be infinite. Hence, to measure the value of a discounted MDP, suppose policy π\pi obtains a random trajectory,

(s0,a0,r0),(s1,a1,r1),,(sh,ah,rh),,(s_{0},a_{0},r_{0}),(s_{1},a_{1},r_{1}),\ldots,(s_{h},a_{h},r_{h}),\ldots,

we denote

VM,γπ=𝔼[h=0γhrhs=s0]V^{\pi}_{M,\gamma}=\mathbb{E}\left[\sum_{h=0}^{\infty}\gamma^{h}r_{h}\mid s=s_{0}\right]

as the discounted value of π\pi. Throughout, for a (discounted or finite-horizon) MDP M=(𝒮,𝒜,P,R,,μ)M=\left(\mathcal{S},\mathcal{A},P,R,\cdot,\mu\right), we simply denote VM,HπV^{\pi}_{M,H} as the value function of (𝒮,𝒜,P,R,H,μ)\left(\mathcal{S},\mathcal{A},P,R,H,\mu\right) and VM,γπV^{\pi}_{M,\gamma} as the value function of (𝒮,𝒜,P,R,γ,μ)\left(\mathcal{S},\mathcal{A},P,R,\gamma,\mu\right).

3 Technical Overview

In this section, we discuss the techniques for establishing our results. To introduce the high-level ideas, we first start with the simpler setting, the generative model, where exploration is not a concern. We then switch to the more challenging RL setting, where we need to carefully design policies to explore the state-action space so that a good policy can be learned. For simplicity, throughout the discussion in this section, we assume |𝒮||\mathcal{S}|, |𝒜||\mathcal{A}| and 1/ϵ1/\epsilon are all constants.

Algorithm and Analysis in the Generative Model.

Our algorithm in the generative model is conceptually simple: for each state-action pair (s,a)(s,a), we draw O(H)O(H) samples from P(s,a)P(s,a) and R(s,a)R(s,a) and then return the optimal policy with respect to the empirical model M^\widehat{M} which is obtained by using the empirical estimators for PP and RR (denoted as P^\widehat{P} and R^\widehat{R}). Here for simplicity, we assume R=R^R=\widehat{R} which allows us to focus on the estimation error induced by the transition probabilities. Moreover, we assume that PP differs from P^\widehat{P} only for a single state-action pair (s,a)(s,a). To further simplify the discussion, we assume that there are only two different states on the support of P(ss,a)P(s^{\prime}\mid s,a) (say s1s_{1} and s2s_{2}).

In order to prove the correctness of the algorithm, we show that for any policy π\pi, the value of π\pi in the empirical model M^\widehat{M} is close to that in the true model MM. However, standard analysis based on dynamic progarmming shows that the difference between the value of π\pi in M^\widehat{M} and that in MM could be as large as HH times the estimation error on P(s,a)P(s,a), which is clearly insufficient for obtaining an algorithm which uses O(1)O(1) batches of queries. Our main idea here is to show that for most trajectories TT, the probability of TT in the empirical model M^\widehat{M} is a multiplicative approximation to that in the true model MM with constant approximation ratio.

To establish the multiplicative approximation guarantee, our observation is that one should consider s1s_{1} and s2s_{2}, the two states on the support of P(s,a)P(s,a), as a whole. To see this, consider the case where P(s1s,a)=P(s2s,a)=1/2P(s_{1}\mid s,a)=P(s_{2}\mid s,a)=1/2. Again, the additive estimation errors on both P(s1s,a)P(s_{1}\mid s,a) and P(s2s,a)P(s_{2}\mid s,a) are roughly O(1/H)O(1/\sqrt{H}). Now, consider a trajectory that visits both (s,a,s1)(s,a,s_{1}) and(s,a,s2)(s,a,s_{2}) for H/2H/2 times. Note that the multiplicative approximation ratio between P^(ss,a)H/2\widehat{P}(s^{\prime}\mid s,a)^{H/2} and P(ss,a)H/2P(s^{\prime}\mid s,a)^{H/2} could be as large as exp(H)\exp(\sqrt{H}), for both s=s1s^{\prime}=s_{1} and s=s2s^{\prime}=s_{2}. However, since P^(s1s,a)+P^(s2s,a)=1\widehat{P}(s_{1}\mid s,a)+\widehat{P}(s_{2}\mid s,a)=1 as the empirical estimator P^(s,a)\widehat{P}(s,a) is still a probability distribution, it must be the case that P^(s1s,a)/P(s1s,a)=12δ\widehat{P}(s_{1}\mid s,a)/P(s_{1}\mid s,a)=1-2\delta and P^(s2s,a)/P(s2s,a)=1+2δ\widehat{P}(s_{2}\mid s,a)/{P}(s_{2}\mid s,a)=1+2\delta where δ=P(s1s,a)P^(s1s,a)\delta=P(s_{1}\mid s,a)-\widehat{P}(s_{1}\mid s,a) and thus |δ|O(1/H)|\delta|\leq O(1/\sqrt{H}). Since (1+2δ)H/2(12δ)H/2=(14δ2)H/2(1+2\delta)^{H/2}(1-2\delta)^{H/2}=(1-4\delta^{2})^{H/2} is a constant, (P^(s1s,a))H/2(P^(s2s,a))H/2(\widehat{P}(s_{1}\mid s,a))^{H/2}(\widehat{P}(s_{2}\mid s,a))^{H/2} is a constant factor approximation to the true probability (P(s1s,a))H/2(P(s2s,a))H/2(P(s_{1}\mid s,a))^{H/2}(P(s_{2}\mid s,a))^{H/2} due to cancellation.

In our analysis, to formalize the above intuition, for each trajectory TT, we take TT into consideration only when |mT(s,a,s)P(ss,a)mT(s,a)|O(P(ss,a)H)|m_{T}(s,a,s^{\prime})-P(s^{\prime}\mid s,a)\cdot m_{T}(s,a)|\leq O(\sqrt{P(s^{\prime}\mid s,a)\cdot H}) for both s=s1s^{\prime}=s_{1} and s=s2s^{\prime}=s_{2}. Here mT(s,a)m_{T}(s,a) is the number of times that (s,a)(s,a) is visited on TT and mT(s,a,s)m_{T}(s,a,s^{\prime}) is the number of times that (s,a,s)(s,a,s^{\prime}) is visited on TT. By Chebyshev’s inequality (see Lemma 5.7 for details), we only ignore a small subset of trajectories whose total probability can be upper bounded by a constant. For the remaining trajectories, it can be shown that P^(s1s,a)mT(s,a,s1)P^(s2s,a)mT(s,a,s2)\widehat{P}(s_{1}\mid s,a)^{m_{T}(s,a,s_{1})}\cdot\widehat{P}(s_{2}\mid s,a)^{m_{T}(s,a,s_{2})} is a constant factor approximation to P(s1s,a)mT(s,a,s1)P(s2s,a)mT(s,a,s2)P(s_{1}\mid s,a)^{m_{T}(s,a,s_{1})}\cdot P(s_{2}\mid s,a)^{m_{T}(s,a,s_{2})} so long as |P^(s,a,s)P(s,a,s)|O(P(s,a,s)/H)|\widehat{P}(s,a,s^{\prime})-P(s,a,s^{\prime})|\leq O(\sqrt{P(s,a,s^{\prime})/H}) for both s=s1s^{\prime}=s_{1} and s=s2s^{\prime}=s_{2} due to the cancellation mentioned above. The formal argument is given in Lemma 5.5. Note that using O(H)O(H) samples, |P^(s,a,s)P(s,a,s)|O(P(s,a,s)/H)|\widehat{P}(s,a,s^{\prime})-P(s,a,s^{\prime})|\leq O(\sqrt{P(s,a,s^{\prime})/H}) holds only when P(s,a,s)Ω(1/H)P(s,a,s^{\prime})\geq\Omega(1/H). On the other hand, we can also ignore trajectories that visit (s,a,s)(s,a,s^{\prime}) with P(s,a,s)O(1/H)P(s,a,s^{\prime})\leq O(1/H) since such trajectories have negligible cumulative probability by Markov’s inequality (formalized in Lemma 5.6).

The above analysis can be readily generalized to handle perturbation on the transition probabilities of multiple state-action pairs, and to handle the case when the transition operator P(s,a)P(\cdot\mid s,a) is not supported on two states. In summary, by using O(H)O(H) samples for each state-action pair (s,a)(s,a), the empirical model M^\widehat{M} provides a constant factor approximation to the probabilities of all trajectories, except for a small subset of them whose cumulative probability can be upper bounded by a constant. Hence, for all policies, the empirical model provides an accurate estimate to its value and thus, the optimal policy with respect to the empirical model is near-optimal.

Exploration by Stationary Policies.

In the discussion above, we heavily rely on the ability of the generative model to obtain Ω(H)\Omega(H) samples for each state-action pair. However, for the RL setting, it is not possible to reach every state-action pair freely. Although each trajectory contains HH state-action-state tuples (corresponding to a batch of queries in the generative model), these samples may not cover states that are crucial for learning an optimal policy. Indeed, one could use all possible deterministic non-stationary policies to collect samples, which shall then cover the whole state-action space. Unfortunately, such a naïve method introduces a dependence on the number of non-stationary policies which is exponential in HH. The sample complexity of other existing methods in the literature also inevitably depends on HH as their sample complexity intrinsically depends on the number of non-stationary policies.

In this work, we adopt a completely different approach for exploration. Our new idea is to show that if there exists a non-stationary policy that visits (s,a)(s,a) for ff times in expectation, then there exists a stationary policy that visit (s,a)(s,a) for f/exp(O(|𝒮|log|𝒮|))f/\exp\left(O(|\mathcal{S}|\log|\mathcal{S}|)\right) times in expectation. This is formalized in Lemma 4.7. If the above claim is true, then intuitively, one can simply enumerate all stationary policies and sample exp(O(|𝒮|log|𝒮|))\exp\left(O(|\mathcal{S}|\log|\mathcal{S}|)\right) trajectories using each of them to obtain ff samples of (s,a)(s,a). Note that there are only |𝒜||𝒮||\mathcal{A}|^{|\mathcal{S}|} stationary policies, which is completely independent of HH. In order to prove the above claim, we show that for any stationary policy π\pi, its value in the infinite-horizon discounted setting is close to that in the finite-horizon undiscounted setting (up to a factor of exp(O(|𝒮|log|𝒮|))\exp\left(O(|\mathcal{S}|\log|\mathcal{S}|)\right)) by using a properly chosen discount factor (see Lemma 4.5). Note that this implies the correctness of the above claim since there always exists a stationary optimal policy in the infinite-horizon discounted setting.

In order to show the value of a stationary policy in the infinite-horizon discounted setting is close to that in the finite-horizon setting, we study reaching probabilities in time invariant Markov chains. In particular, we show in Lemma 4.4 that in a time invariant Markov chain, for any H|𝒮|H\geq|\mathcal{S}|, the probability of reaching a specific state ss within HH steps is close the probability of reaching ss within 4H4H steps, up to a factor of exp(O(|𝒮|log|𝒮|))\exp\left(O(|\mathcal{S}|\log|\mathcal{S}|)\right). Previous literature on time invariant Markov chains mostly focus on the asymptotic behavior, and as far as we are aware, we are the first to prove the above claim. Note that this claim directly establishes a connection between the value of a stationary policy in the infinite-horizon discounted setting and that in the finite-horizon setting. Moreover, as a direct consequence of the above claim, we can show that if H>2|𝒮|H>2|\mathcal{S}|, the value of a stationary policy within HH steps is close to that of the same policy within H/2H/2 steps, up to a factor of exp(O(|𝒮|log|𝒮|))\exp\left(O(|\mathcal{S}|\log|\mathcal{S}|)\right). This consequence is crucial for later parts of the analysis.

From Expectation to Quantile.

The above analysis shows that if there exists a non-stationary policy that visits (s,a)(s,a) for ff times in expectation, then our algorithm, which uses all stationary policies to collect samples, visits (s,a)(s,a) for f/exp(O(|𝒮|log|𝒮|))f/\exp\left(O(|\mathcal{S}|\log|\mathcal{S}|)\right) times in expectation. However, this does not necessarily mean that one can obtain ff samples of (s,a)(s,a) by sampling exp(O(|𝒮|log|𝒮|))\exp\left(O(|\mathcal{S}|\log|\mathcal{S}|)\right) trajectories using our algorithm with good probability. To see this, consider the case when our policy visits (s,a)(s,a) for HH times with probability 1/H1/\sqrt{H} and does not visit (s,a)(s,a) with probability 11/H1-1/\sqrt{H}. In this case, our policy may not obtain even a single sample for (s,a)(s,a) unless one rollouts the policy for O(H)O(\sqrt{H}) times. Therefore, instead of obtaining a visitation frequency guarantee which holds in expectation, it is more desirable to obtain a visitation frequency guarantee that holds with good probability.

To resolve this issue, we establish a connection between the expectation and the ϵ\epsilon-quantile of the visitation frequency of a stationary policy. We note that such a connection could not hold without any restriction. To see this, consider a policy that visits (s,a)(s,a) for HH times with probability ϵ/2\epsilon/2. In this case, the expected visitation frequency is ϵH/2\epsilon H/2 while the ϵ\epsilon-quantile is zero. On the other hand, suppose the initial state s0=ss_{0}=s almost surely, then such a connection is easy to establish by using the Martingale Stopping Theorem. In particular, in Corollary 4.9, we show that if there exists a non-stationary policy that visits (s,a)(s,a) for ff times with probability ϵ\epsilon, then there exists a stationary policy that visits (s,a)(s,a) for ϵf/exp(O(|𝒮|log|𝒮|))\epsilon f/\exp\left(O(|\mathcal{S}|\log|\mathcal{S}|)\right) times with constant probability, when the initial state s0=ss_{0}=s almost surely.

In general, the initial state s0s_{0} comes from a distribution μ\mu and could be different from ss with high probability. To tackle this issue, in our algorithm, we simultaneously enumerate two stationary policies π1\pi_{1} and π2\pi_{2}. π1\pi_{1} should be thought as the policy that visits (s,a)(s,a) with highest probability within H/2H/2 steps starting from the initial state distribution μ\mu, and π2\pi_{2} should be thought as the policy that maximizes the ϵ\epsilon-quantile of the visitation frequency of (s,a)(s,a) within H/2H/2 steps when s0=ss_{0}=s. In our algorithm, we execute π1\pi_{1} before (s,a)(s,a) is visited for the first time, and switch to π2\pi_{2} once (s,a)(s,a) has been visited. Intuitively, we first use π1\pi_{1} to reach (s,a)(s,a) for the first time and then use π2\pi_{2} to collect as many samples as possible. As mentioned above, the value of a stationary policy within HH steps is close to the value of the same policy within H/2H/2 steps, up to a factor of exp(O(|𝒮|log|𝒮|))\exp\left(O(|\mathcal{S}|\log|\mathcal{S}|)\right). Thus, by sampling the above policy (formed by concatenating π1\pi_{1} and π2\pi_{2}) for exp(O(|𝒮|log|𝒮|))/ϵ2\exp\left(O(|\mathcal{S}|\log|\mathcal{S}|)\right)/\epsilon^{2} times, we obtain at least ff samples for (s,a)(s,a), if there exists a non-stationary policy that visits (s,a)(s,a) for ff times with probability ϵ\epsilon. This is formalized in Lemma 5.1.

Perturbation Analysis in the RL Setting.

By the above analysis, suppose m(s,a)m(s,a) is the largest integer such that there exists a non-stationary policy that visits (s,a)(s,a) with probability ϵ\epsilon for m(s,a)m(s,a) times, then our dataset contains Ω(m(s,a))\Omega(m(s,a)) samples of (s,a)(s,a). However, m(s,a)m(s,a) could be significantly smaller than HH and therefore the perturbation analysis established in the generative model no longer applies here. For example, previously we show that if |mT(s,a,s)P(ss,a)mT(s,a)|O(P(ss,a)H)|m_{T}(s,a,s^{\prime})-P(s^{\prime}\mid s,a)\cdot m_{T}(s,a)|\leq O(\sqrt{P(s^{\prime}\mid s,a)\cdot H}), then P^(s1s,a)mT(s,a,s1)P^(s2s,a)mT(s,a,s2)\widehat{P}(s_{1}\mid s,a)^{m_{T}(s,a,s_{1})}\cdot\widehat{P}(s_{2}\mid s,a)^{m_{T}(s,a,s_{2})} is a constant factor approximation to P(s1s,a)mT(s,a,s1)P(s2s,a)mT(s,a,s2)P(s_{1}\mid s,a)^{m_{T}(s,a,s_{1})}\cdot P(s_{2}\mid s,a)^{m_{T}(s,a,s_{2})} when |P^(s,a,s)P(s,a,s)|O(P(s,a,s)/H)|\widehat{P}(s,a,s^{\prime})-P(s,a,s^{\prime})|\leq O(\sqrt{P(s,a,s^{\prime})/H}) for both s=s1s^{\prime}=s_{1} and s=s2s^{\prime}=s_{2}. However, if m(s,a)Hm(s,a)\ll H, it is hopeless to obtain an estimate P^(s,a,s)\widehat{P}(s,a,s^{\prime}) with |P^(s,a,s)P(s,a,s)|O(P(s,a,s)/H)|\widehat{P}(s,a,s^{\prime})-P(s,a,s^{\prime})|\leq O(\sqrt{P(s,a,s^{\prime})/H}). Fortunately, our perturbation analysis still goes through so long as mT(s,a,s)P(ss,a)mT(s,a)+O(P(ss,a)m(s,a))m_{T}(s,a,s^{\prime})\leq P(s^{\prime}\mid s,a)\cdot m_{T}(s,a)+O(\sqrt{P(s^{\prime}\mid s,a)\cdot m(s,a)}) and |P^(s,a,s)P(s,a,s)|O(P(s,a,s)/m(s,a))|\widehat{P}(s,a,s^{\prime})-P(s,a,s^{\prime})|\leq O(\sqrt{P(s,a,s^{\prime})/m(s,a)}), i.e., replacing all HH appearances with m(s,a)m(s,a).

The above analysis introduces a final subtlety in our algorithm. In particular, m(s,a)m(s,a) in the empirical model M^\widehat{M} could be significantly larger than that in the true model. On the other hand, the number of samples of (s,a)(s,a) in our dataset is at most O(m(s,a))O(m(s,a)) where m(s,a)m(s,a) is defined by the true model. This means the value estimated in the empirical model M^\widehat{M} could be significantly larger than that in the true model MM. To resolve this issue, we employ the principle of “pessimism in the face of uncertainty” and for each policy π\pi, the estimated value of π\pi is set to be the lowest value among all models that lie the confidence set. Since the true model always lies in the confidence set, the estimated value is then guaranteed to be close to the true value.

4 Properties of Stationary Policies

In this section, we prove several properties of stationary policies. In Section 4.1, we first prove properties regarding the reaching probabilities in Markov chains, and then use them to prove properties for stationary policies in Section 4.2.

4.1 Reaching Probabilities in Markov Chains

Let C=(𝒮,P,μ)C=(\mathcal{S},P,\mu) be a Markov chain. For a positive integer LL and a sequence of states T=(s0,s1,,sL1)𝒮LT=(s_{0},s_{1},\ldots,s_{L-1})\in\mathcal{S}^{L}, we write

p(T,C)=μ(s0)h=0L2P(sh+1sh)p(T,C)=\mu(s_{0})\cdot\prod_{h=0}^{L-2}P(s_{h+1}\mid s_{h})

to denote the probability of TT in CC. For a state s𝒮s\in\mathcal{S} and an integer L0L\geq 0, we also write

pL(s,C)=(s0,s1,,sL1)𝒮Lp((s0,s1,,sL1,s),C)p_{L}(s,C)=\sum_{(s_{0},s_{1},\ldots,s_{L-1})\in\mathcal{S}^{L}}p((s_{0},s_{1},\ldots,s_{L-1},s),C)

to denote the probability of reaching ss with exactly LL steps.

Our first lemma shows that for any Markov chain CC, for any sequence of LL states TT with L>|𝒮|L>|\mathcal{S}|, there exists a sequence of LL^{\prime} states TT^{\prime} with L|𝒮|L^{\prime}\leq|\mathcal{S}| so that p(T,C)p(T,C)p(T,C)\leq p(T^{\prime},C).

Lemma 4.1.

Let C=(𝒮,P,μ)C=(\mathcal{S},P,\mu) be a Markov chain. For a sequence of LL states

T=(s0,s1,,sL1)𝒮LT=(s_{0},s_{1},\ldots,s_{L-1})\in\mathcal{S}^{L}

with L>|𝒮|L>|\mathcal{S}|, there exists a sequence of LL^{\prime} states

T=(s0,s1,,sL1)𝒮LT^{\prime}=(s_{0}^{\prime},s_{1}^{\prime},\ldots,s_{L^{\prime}-1}^{\prime})\in\mathcal{S}^{L^{\prime}}

with sL1=sL1s_{L^{\prime}-1}^{\prime}=s_{L-1}, L|𝒮|L^{\prime}\leq|\mathcal{S}| and p(T,C)p(T,C)p(T,C)\leq p(T^{\prime},C).

Proof.

By pigeonhole principle, since L>|𝒮|L>|\mathcal{S}|, there exists 0i<j<L0\leq i<j<L such that si=sjs_{i}=s_{j}. Consider the sequence induced by removing si,si+1,si+2,,sj1s_{i},s_{i+1},s_{i+2},\ldots,s_{j-1} from TT, i.e.,

T=(s0,s1,,si1,sj,sj+1,,sL1).T^{\prime}=(s_{0},s_{1},\ldots,s_{i-1},s_{j},s_{j+1},\ldots,s_{L-1}).

Since si=sjs_{i}=s_{j}, we have

p(T,C)=μ(s0)h=0L2P(sh+1sh).p(T,C)=\mu(s_{0})\cdot\prod_{h=0}^{L-2}P(s_{h+1}\mid s_{h}).

and

p(T,C)=μ(s0)h=0i1P(sh+1sh)h=jL2P(sh+1sh).p(T^{\prime},C)\\ =\mu(s_{0})\cdot\prod_{h=0}^{i-1}P(s_{h+1}\mid s_{h})\cdot\prod_{h=j}^{L-2}P(s_{h+1}\mid s_{h}).

Therefore, we have p(T,C)p(T,C)p(T,C)\leq p(T^{\prime},C). We continue this process until the length is at most |𝒮||\mathcal{S}|. ∎

Combining Lemma 4.1 with a simple counting argument directly implies the following lemma, which shows that h=04|𝒮|1ph(s,C)exp(O(|𝒮|log|𝒮|))h=0|𝒮|1ph(s,C)\sum_{h=0}^{4|\mathcal{S}|-1}p_{h}(s,C)\leq\exp\left(O(|\mathcal{S}|\log|\mathcal{S}|)\right)\cdot\sum_{h=0}^{|\mathcal{S}|-1}p_{h}(s,C).

Lemma 4.2.

Let C=(𝒮,P,μ)C=(\mathcal{S},P,\mu) be a Markov chain. For any s𝒮s\in\mathcal{S},

h=04|𝒮|1ph(s,C)4|𝒮|4|𝒮|h=0|𝒮|1ph(s,C).\sum_{h=0}^{4|\mathcal{S}|-1}p_{h}(s,C)\leq 4\cdot|\mathcal{S}|^{4|\mathcal{S}|}\cdot\sum_{h=0}^{|\mathcal{S}|-1}p_{h}(s,C).
Proof.

Consider a sequence of L+1L+1 states T=(s0,s1,,sL)𝒮L+1T=(s_{0},s_{1},\ldots,s_{L})\in\mathcal{S}^{L+1} with L|𝒮|L\geq|\mathcal{S}| and sL=ss_{L}=s. By Lemma 4.1, there exists another sequence of LL^{\prime} states T=(s0,s1,,sL1)𝒮LT^{\prime}=(s_{0}^{\prime},s_{1}^{\prime},\ldots,s_{L^{\prime}-1}^{\prime})\in\mathcal{S}^{L^{\prime}} with sL1=sL=ss_{L^{\prime}-1}^{\prime}=s_{L}=s and L|𝒮|L^{\prime}\leq|\mathcal{S}| so that p(T,C)p(T,C)p(T,C)\leq p(T^{\prime},C). Therefore

p(T,C)p(T,C)pL1(s,C)h=0|𝒮|1ph(s,C),p(T,C)\leq p(T^{\prime},C)\leq p_{L^{\prime}-1}(s,C)\leq\sum_{h=0}^{|\mathcal{S}|-1}p_{h}(s,C),

which implies

pL(s,C)=(s0,s1,,sL2,sL1)𝒮Lp((s0,s1,,sL2,sL1,s),C)|𝒮|Lh=0|𝒮|1ph(s,C).p_{L}(s,C)=\sum_{(s_{0},s_{1},\ldots,s_{L-2},s_{L-1})\in\mathcal{S}^{L}}p((s_{0},s_{1},\ldots,s_{L-2},s_{L-1},s),C)\leq|\mathcal{S}|^{L}\sum_{h=0}^{|\mathcal{S}|-1}p_{h}(s,C).

Therefore,

h=04|𝒮|1ph(s,C)4|𝒮||𝒮|4|𝒮|1h=0|𝒮|1ph(s,C)=4|𝒮|4|𝒮|h=0|𝒮|1ph(s,C).\sum_{h=0}^{4|\mathcal{S}|-1}p_{h}(s,C)\leq 4\cdot|\mathcal{S}|\cdot|\mathcal{S}|^{4|\mathcal{S}|-1}\cdot\sum_{h=0}^{|\mathcal{S}|-1}p_{h}(s,C)=4\cdot|\mathcal{S}|^{4|\mathcal{S}|}\cdot\sum_{h=0}^{|\mathcal{S}|-1}p_{h}(s,C).

By applying Lemma 4.2 in a Markov chain CC^{\prime} with modified initial state distribution and transition operator, we can also prove that h=04|𝒮|1pβh+α(s,C)exp(O(|𝒮|log|𝒮|))h=0|𝒮|1pβh+α(s,C).\sum_{h=0}^{4|\mathcal{S}|-1}p_{\beta h+\alpha}(s,C)\leq\exp\left(O(|\mathcal{S}|\log|\mathcal{S}|)\right)\cdot\sum_{h=0}^{|\mathcal{S}|-1}p_{\beta h+\alpha}(s,C). for any integer α0\alpha\geq 0 and integer β1\beta\geq 1.

Lemma 4.3.

Let C=(𝒮,P,μ)C=(\mathcal{S},P,\mu) be a Markov chain. For any integer α0\alpha\geq 0 and integer β1\beta\geq 1, for any s𝒮s\in\mathcal{S},

h=04|𝒮|1pβh+α(s,C)4|𝒮|4|𝒮|h=0|𝒮|1pβh+α(s,C).\sum_{h=0}^{4|\mathcal{S}|-1}p_{\beta h+\alpha}(s,C)\leq 4\cdot|\mathcal{S}|^{4|\mathcal{S}|}\cdot\sum_{h=0}^{|\mathcal{S}|-1}p_{\beta h+\alpha}(s,C).
Proof.

We define a new Markov chain C=(𝒮,P,μ)C^{\prime}=(\mathcal{S},P^{\prime},\mu^{\prime}) based on C=(𝒮,P,μ)C=(\mathcal{S},P,\mu). The state space of CC^{\prime} is the same as that of CC. The initial state distribution μ\mu^{\prime} is the same as the distribution of sαs_{\alpha} in CC, i.e., the distribution after taking α\alpha steps in CC. The transition operator is defined so that taking one step in CC^{\prime} is equivalent to taking β\beta steps in CC, i.e.,

P(ss)=s1,s2,,sβ1𝒮β1P(ssβ1)P(sβ1sβ2)P(s1s).P^{\prime}(s^{\prime}\mid s)=\sum_{s_{1},s_{2},\ldots,s_{\beta-1}\in\mathcal{S}^{\beta-1}}P(s^{\prime}\mid s_{\beta-1})\cdot P(s_{\beta-1}\mid s_{\beta-2})\cdot\cdots\cdot P(s_{1}\mid s).

Clearly, for any state s𝒮s\in\mathcal{S}, pL(s,C)=pβL+α(s,C)p_{L}(s,C^{\prime})=p_{\beta L+\alpha}(s,C). By using Lemma 4.2 in CC^{\prime}, for any s𝒮s\in\mathcal{S}, we have

h=04|𝒮|1ph(s,C)4|𝒮|4|𝒮|h=0|𝒮|1ph(s,C),\sum_{h=0}^{4|\mathcal{S}|-1}p_{h}(s,C^{\prime})\leq 4\cdot|\mathcal{S}|^{4|\mathcal{S}|}\cdot\sum_{h=0}^{|\mathcal{S}|-1}p_{h}(s,C^{\prime}),

which implies

h=04|𝒮|1pβh+α(s,C)4|𝒮|4|𝒮|h=0|𝒮|1pβh+α(s,C).\sum_{h=0}^{4|\mathcal{S}|-1}p_{\beta h+\alpha}(s,C)\leq 4\cdot|\mathcal{S}|^{4|\mathcal{S}|}\cdot\sum_{h=0}^{|\mathcal{S}|-1}p_{\beta h+\alpha}(s,C).

Finally, Lemma 4.3 implies the main result of this section, which shows that for any L|𝒮|L\geq|\mathcal{S}|, h=04L1ph(s,C)exp(O(|𝒮|log|𝒮|))h=0L1ph(s,C)\sum_{h=0}^{4L-1}p_{h}(s,C)\leq\exp\left(O(|\mathcal{S}|\log|\mathcal{S}|)\right)\cdot\sum_{h=0}^{L-1}p_{h}(s,C).

Lemma 4.4.

Let C=(𝒮,P,μ)C=(\mathcal{S},P,\mu) be a Markov chain. For any s𝒮s\in\mathcal{S} and L|𝒮|L\geq|\mathcal{S}|,

h=02Lph(s,C)4|𝒮|4|𝒮|h=0L1ph(s,C).\sum_{h=0}^{2L}p_{h}(s,C)\leq 4\cdot|\mathcal{S}|^{4|\mathcal{S}|}\cdot\sum_{h=0}^{L-1}p_{h}(s,C).
Proof.

Clearly,

h=0L1ph(s,C)h=0L/|𝒮||𝒮|1ph(s,C)=i=0L/|𝒮|1j=0|𝒮|1pL/|𝒮|j+i(s,C).\sum_{h=0}^{L-1}p_{h}(s,C)\geq\sum_{h=0}^{\lfloor L/|\mathcal{S}|\rfloor\cdot|\mathcal{S}|-1}p_{h}(s,C)=\sum_{i=0}^{\lfloor L/|\mathcal{S}|\rfloor-1}\sum_{j=0}^{|\mathcal{S}|-1}p_{\lfloor L/|\mathcal{S}|\rfloor\cdot j+i}(s,C).

For each 0i<L/|𝒮|0\leq i<\lfloor L/|\mathcal{S}|\rfloor, by Lemma 4.3, we have

j=0|𝒮|1pL/|𝒮|j+i(s,C)14|𝒮|4|𝒮|j=04|𝒮|1pL/|𝒮|j+i(s,C).\sum_{j=0}^{|\mathcal{S}|-1}p_{\lfloor L/|\mathcal{S}|\rfloor\cdot j+i}(s,C)\geq\frac{1}{4|\mathcal{S}|^{4|\mathcal{S}|}}\sum_{j=0}^{4|\mathcal{S}|-1}p_{\lfloor L/|\mathcal{S}|\rfloor\cdot j+i}(s,C).

On the other hand,

h=02Lph(s,C)i=0L/|𝒮|1j=0(2L+1)/L/|𝒮|1pL/|𝒮|j+i(s,C).\sum_{h=0}^{2L}p_{h}(s,C)\leq\sum_{i=0}^{\lfloor L/|\mathcal{S}|\rfloor-1}\sum_{j=0}^{\lfloor(2L+1)/\lfloor L/|\mathcal{S}|\rfloor\rfloor-1}p_{\lfloor L/|\mathcal{S}|\rfloor\cdot j+i}(s,C).

Note that if |𝒮|>L/2|\mathcal{S}|>L/2, then (2L+1)/L/|𝒮|=2L+1<4|𝒮|\lfloor(2L+1)/\lfloor L/|\mathcal{S}|\rfloor\rfloor=2L+1<4|\mathcal{S}|. Moreover, if |𝒮|L/2|\mathcal{S}|\leq L/2, then we have L/|𝒮|2L/3|𝒮|\lfloor L/|\mathcal{S}|\rfloor\geq 2L/3|\mathcal{S}|, which implies

(2L+1)/L/|𝒮|(2L+1)/L3|𝒮|/24|𝒮|=4|𝒮|.\lfloor(2L+1)/\lfloor L/|\mathcal{S}|\rfloor\rfloor\leq\lfloor(2L+1)/L\cdot 3|\mathcal{S}|/2\rfloor\leq\lfloor 4|\mathcal{S}|\rfloor=4|\mathcal{S}|.

Hence, we have

h=02Lph(s,C)i=0L/|𝒮|1j=04|𝒮|1pL/|𝒮|j+i(s,C)4|𝒮|4|𝒮|h=0L1ph(s,C).\sum_{h=0}^{2L}p_{h}(s,C)\leq\sum_{i=0}^{\lfloor L/|\mathcal{S}|\rfloor-1}\ \sum_{j=0}^{4|\mathcal{S}|-1}p_{\lfloor L/|\mathcal{S}|\rfloor\cdot j+i}(s,C)\leq 4\cdot|\mathcal{S}|^{4|\mathcal{S}|}\cdot\sum_{h=0}^{L-1}p_{h}(s,C).

4.2 Implications of Lemma 4.4

In this section, we list several implications of Lemma 4.4 which would be crucial for analysis in later sections.

Our first lemma shows that for any MDP MM and any stationary policy π\pi, for a properly chosen discount factor γ\gamma, VM,γπV^{\pi}_{M,\gamma} is a multiplicative approximation to VM,HπV^{\pi}_{M,H} with approximation ratio exp(O(|𝒮|log|𝒮|))\exp\left(O(|\mathcal{S}|\log|\mathcal{S}|)\right).

Lemma 4.5.

For any MDP MM and any stationary policy π\pi, if H2ln(8|𝒮|4|𝒮|)H\geq 2\ln(8\cdot|\mathcal{S}|^{4|\mathcal{S}|}), by taking γ=1ln(8|𝒮|4|𝒮|)H\gamma=1-\frac{\ln(8\cdot|\mathcal{S}|^{4|\mathcal{S}|})}{H},

164|𝒮|8|𝒮|VM,HπVM,γπ2VM,Hπ.\frac{1}{64\cdot|\mathcal{S}|^{8|\mathcal{S}|}}V^{\pi}_{M,H}\leq V^{\pi}_{M,\gamma}\leq 2V^{\pi}_{M,H}.
Proof.
VM,γπ=s𝒮h=0γhph(s,Mπ)𝔼[R(s,π(s))]\displaystyle V^{\pi}_{M,\gamma}=\sum_{s\in\mathcal{S}}\sum_{h=0}^{\infty}\gamma^{h}\cdot p_{h}(s,M^{\pi})\cdot\mathbb{E}[R(s,\pi(s))]
\displaystyle\leq s𝒮(h=0H1ph(s,Mπ)+i=1γH2i1(h=02iH1ph(s,Mπ)))𝔼[R(s,π(s))].\displaystyle\sum_{s\in\mathcal{S}}\left(\sum_{h=0}^{H-1}p_{h}(s,M^{\pi})+\sum_{i=1}^{\infty}\gamma^{H\cdot 2^{i-1}}\left(\sum_{h=0}^{2^{i}\cdot H-1}p_{h}(s,M^{\pi})\right)\right)\cdot\mathbb{E}[R(s,\pi(s))].

For each i1i\geq 1, by Lemma 4.4, for any s𝒮s\in\mathcal{S},

γH2i1(h=02iH1ph(s,Mπ))\displaystyle\gamma^{H\cdot 2^{i-1}}\left(\sum_{h=0}^{2^{i}\cdot H-1}p_{h}(s,M^{\pi})\right) γH2i1(4|𝒮|4|𝒮|)i(h=0H1ph(s,Mπ))\displaystyle\leq\gamma^{H\cdot 2^{i-1}}\cdot\left(4\cdot|\mathcal{S}|^{4|\mathcal{S}|}\right)^{i}\cdot\left(\sum_{h=0}^{H-1}p_{h}(s,M^{\pi})\right)
(8|𝒮|4|𝒮|)2i1(4|𝒮|4|𝒮|)i(h=0H1ph(s,Mπ))\displaystyle\leq\left(8\cdot|\mathcal{S}|^{4|\mathcal{S}|}\right)^{-2^{i-1}}\cdot\left(4\cdot|\mathcal{S}|^{4|\mathcal{S}|}\right)^{i}\cdot\left(\sum_{h=0}^{H-1}p_{h}(s,M^{\pi})\right)
1/2i(h=0H1ph(s,Mπ)).\displaystyle\leq 1/2^{i}\cdot\left(\sum_{h=0}^{H-1}p_{h}(s,M^{\pi})\right).

Therefore,

VM,γπs𝒮2(h=0H1ph(s,Mπ))𝔼[R(s,π(s))]=2VM,Hπ.V^{\pi}_{M,\gamma}\leq\sum_{s\in\mathcal{S}}2\cdot\left(\sum_{h=0}^{H-1}p_{h}(s,M^{\pi})\right)\cdot\mathbb{E}[R(s,\pi(s))]=2V^{\pi}_{M,H}.

On the other hand,

VM,γπ\displaystyle V^{\pi}_{M,\gamma} =s𝒮h=0γhph(s,Mπ)𝔼[R(s,π(s))]\displaystyle=\sum_{s\in\mathcal{S}}\sum_{h=0}^{\infty}\gamma^{h}\cdot p_{h}(s,M^{\pi})\cdot\mathbb{E}[R(s,\pi(s))]
s𝒮h=0H1γhph(s,Mπ)𝔼[R(s,π(s))]\displaystyle\geq\sum_{s\in\mathcal{S}}\sum_{h=0}^{H-1}\gamma^{h}\cdot p_{h}(s,M^{\pi})\cdot\mathbb{E}[R(s,\pi(s))]
γHs𝒮h=0H1ph(s,Mπ)𝔼[R(s,π(s))]\displaystyle\geq\gamma^{H}\cdot\sum_{s\in\mathcal{S}}\sum_{h=0}^{H-1}p_{h}(s,M^{\pi})\cdot\mathbb{E}[R(s,\pi(s))]
=γHVM,Hπ=(1ln(8|𝒮|4|𝒮|)H)HVM,Hπ\displaystyle=\gamma^{H}\cdot V^{\pi}_{M,H}=\left(1-\frac{\ln(8\cdot|\mathcal{S}|^{4|\mathcal{S}|})}{H}\right)^{H}\cdot V^{\pi}_{M,H}
(1/4)ln(8|𝒮|4|𝒮|)VM,Hπ164|𝒮|8|𝒮|VM,Hπ.\displaystyle\geq(1/4)^{\ln(8\cdot|\mathcal{S}|^{4|\mathcal{S}|})}\cdot V^{\pi}_{M,H}\geq\frac{1}{64\cdot|\mathcal{S}|^{8|\mathcal{S}|}}\cdot V^{\pi}_{M,H}.

As another implication of Lemma 4.4, for any MDP MM and any stationary policy π\pi, we have VM,H/2πexp(O(|𝒮|log|𝒮|))VM,Hπ.V^{\pi}_{M,\lfloor H/2\rfloor}\geq\exp\left(-O(|\mathcal{S}|\log|\mathcal{S}|)\right)V^{\pi}_{M,H}.

Lemma 4.6.

For any MDP MM and any stationary policy π:𝒮𝒜\pi:\mathcal{S}\to\mathcal{A}, if H2|𝒮|H\geq 2|\mathcal{S}|,

VM,H/2π14|𝒮|4|𝒮|VM,Hπ.V^{\pi}_{M,\lfloor H/2\rfloor}\geq\frac{1}{4\cdot|\mathcal{S}|^{4|\mathcal{S}|}}V^{\pi}_{M,H}.
Proof.

Note that

VM,H/2π=s𝒮h=0H/21ph(s,Mπ)𝔼[R(s,π(s))].V^{\pi}_{M,\lfloor H/2\rfloor}=\sum_{s\in\mathcal{S}}\sum_{h=0}^{\lfloor H/2\rfloor-1}p_{h}(s,M^{\pi})\cdot\mathbb{E}[R(s,\pi(s))].

Since H2|𝒮|H\geq 2|\mathcal{S}|, by Lemma 4.4, for any s𝒮s\in\mathcal{S},

h=0H/21ph(s,Mπ)14|𝒮|4|𝒮|h=02H/2ph(s,Mπ)14|𝒮|4|𝒮|h=0H1ph(s,Mπ).\sum_{h=0}^{\lfloor H/2\rfloor-1}p_{h}(s,M^{\pi})\geq\frac{1}{4\cdot|\mathcal{S}|^{4|\mathcal{S}|}}\sum_{h=0}^{2\lfloor H/2\rfloor}p_{h}(s,M^{\pi})\geq\frac{1}{4\cdot|\mathcal{S}|^{4|\mathcal{S}|}}\sum_{h=0}^{H-1}p_{h}(s,M^{\pi}).

Therefore,

VM,H/2π14|𝒮|4|𝒮|s𝒮h=0H1ph(s,Mπ)𝔼[R(s,π(s))]=14|𝒮|4|𝒮|VM,Hπ.V^{\pi}_{M,\lfloor H/2\rfloor}\geq\frac{1}{4\cdot|\mathcal{S}|^{4|\mathcal{S}|}}\sum_{s\in\mathcal{S}}\sum_{h=0}^{H-1}p_{h}(s,M^{\pi})\cdot\mathbb{E}[R(s,\pi(s))]=\frac{1}{4\cdot|\mathcal{S}|^{4|\mathcal{S}|}}V^{\pi}_{M,H}.

As a corollary of Lemma 4.5, we show that for any HH-horizon MDP MM, there always exists a stationary policy whose value is as large as the best non-stationary policy up to a factor of exp(O(|𝒮|log|𝒮|))\exp\left(O(|\mathcal{S}|\log|\mathcal{S}|)\right).

Corollary 4.7.

For any MDP MM, if H2ln(8|𝒮|4|𝒮|)H\geq 2\ln(8\cdot|\mathcal{S}|^{4|\mathcal{S}|}), then there exists a stationary policy π\pi such that

VM,Hπ1128|𝒮|8|𝒮|VM,Hπ,V^{\pi}_{M,H}\geq\frac{1}{128\cdot|\mathcal{S}|^{8|\mathcal{S}|}}V^{\pi^{*}}_{M,H},
Proof.

In this proof we fix γ=1ln(8|𝒮|4|𝒮|)H\gamma=1-\frac{\ln(8\cdot|\mathcal{S}|^{4|\mathcal{S}|})}{H}. We also use π~\widetilde{\pi}^{*} to denote a non-stationary policy such that π~h=πh\widetilde{\pi}^{*}_{h}=\pi^{*}_{h} when h[H]h\in[H] and π~h\widetilde{\pi}^{*}_{h} is defined arbitrarily when hHh\geq H.

Clearly, there exists a stationary policy π\pi such that for any (possibly non-stationary) policy π\pi^{\prime},

VM,γπVM,γπ.V^{\pi}_{M,\gamma}\geq V^{\pi^{\prime}}_{M,\gamma}.

For a proof, see Theorem 5.5.3 in [Put94]. Clearly,

VM,γπ~γHVM,Hπ.V^{\widetilde{\pi}^{*}}_{M,\gamma}\geq\gamma^{H}\cdot V^{\pi^{*}}_{M,H}.

Moreover, by Lemma 4.5,

VM,Hπ12VM,γπ12VM,γπ~12γHVM,Hπ1128|𝒮|8|𝒮|VM,Hπ.V^{\pi}_{M,H}\geq\frac{1}{2}V^{\pi}_{M,\gamma}\geq\frac{1}{2}V^{\widetilde{\pi}^{*}}_{M,\gamma}\geq\frac{1}{2}\cdot\gamma^{H}\cdot V^{\pi^{*}}_{M,H}\geq\frac{1}{128\cdot|\mathcal{S}|^{8|\mathcal{S}|}}\cdot V^{\pi^{*}}_{M,H}.

By applying Corollary 4.7 in an MDP with an extra terminal state sterminals_{\mathrm{terminal}}, we can show that for any (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, there always exists a stationary policy that visits (s,a)(s,a) in the first H/2H/2 time steps with probability as large as the probability that the best non-stationary policy visits (s,a)(s,a) in all the HH time steps, up to a factor of exp(O(|𝒮|log|𝒮|))\exp\left(O(|\mathcal{S}|\log|\mathcal{S}|)\right).

Corollary 4.8.

For any MDP MM, if H2ln(8(|𝒮|+1)4(|𝒮|+1))H\geq 2\ln(8\cdot(|\mathcal{S}|+1)^{4(|\mathcal{S}|+1)}), then for any z𝒮×𝒜z\in\mathcal{S}\times\mathcal{A}, there exists a stationary policy π\pi, such that for any (possibly non-stationary) policy π\pi^{\prime},

Pr[h=0H/21𝕀[(sh,ah)=z]1]1512(|𝒮|+1)12(|𝒮|+1)Pr[h=0H1𝕀[(sh,ah)=z]1],\Pr\left[\sum_{h=0}^{\lfloor H/2\rfloor-1}\mathbb{I}[(s_{h},a_{h})=z]\geq 1\right]\geq\frac{1}{512\cdot(|\mathcal{S}|+1)^{12(|\mathcal{S}|+1)}}\Pr\left[\sum_{h=0}^{H-1}\mathbb{I}[(s_{h}^{\prime},a_{h}^{\prime})=z]\geq 1\right],

where

(s0,a0),(s1,a1),,(sH1,aH1),sH(s_{0},a_{0}),(s_{1},a_{1}),\ldots,(s_{H-1},a_{H-1}),s_{H}

is a random trajectory induced by executing π\pi in MM and

(s0,a0),(s1,a1),,(sH1,aH1),sH(s_{0}^{\prime},a_{0}^{\prime}),(s_{1}^{\prime},a_{1}^{\prime}),\ldots,(s_{H-1}^{\prime},a_{H-1}^{\prime}),s_{H}^{\prime}

is a random trajectory induced by executing π\pi^{\prime} in MM.

Proof.

For the given MDP M=(𝒮,𝒜,P,R,H,μ)M=\left(\mathcal{S},\mathcal{A},P,R,H,\mu\right), we create a new MDP

M=(𝒮{sterminal},𝒜,P,R,H,μ),M^{\prime}=\left(\mathcal{S}\cup\{s_{\mathrm{terminal}}\},\mathcal{A},P^{\prime},R^{\prime},H,\mu\right),

where sterminals_{\mathrm{terminal}} is a state such that sterminal𝒮s_{\mathrm{terminal}}\notin\mathcal{S}. Moreover,

P(s,a)={P(s,a)ssterminal and (s,a)zsterminals=sterminal or (s,a)=zP^{\prime}(s,a)=\begin{cases}P(s,a)&s\neq s_{\mathrm{terminal}}\text{ and }(s,a)\neq z\\ s_{\mathrm{terminal}}&s=s_{\mathrm{terminal}}\text{ or }(s,a)=z\end{cases}

and

R(s,a)=𝕀[(s,a)=z].R^{\prime}(s,a)=\mathbb{I}[(s,a)=z].

Clearly, for any policy π\pi,

VM,Hπ=Pr[h=0H1𝕀[(sh,ah)=(s,a)]1]V^{\pi}_{M^{\prime},H}=\Pr\left[\sum_{h=0}^{H-1}\mathbb{I}[(s_{h},a_{h})=(s,a)]\geq 1\right]

where

(s0,a0),(s1,a1),,(sH1,aH1),sH(s_{0},a_{0}),(s_{1},a_{1}),\ldots,(s_{H-1},a_{H-1}),s_{H}

is a random trajectory induced by executing π\pi in MM. Therefore, by Corollary 4.7, there exists a stationary policy π\pi such that for any (possibly non-stationary) policy π\pi^{\prime},

VM,Hπ1128(|𝒮|+1)8(|𝒮|+1)VM,Hπ.V^{\pi}_{M^{\prime},H}\geq\frac{1}{128\cdot(|\mathcal{S}|+1)^{8(|\mathcal{S}|+1)}}V^{\pi^{\prime}}_{M^{\prime},H}.

Moreover, by Lemma 4.6, for any (possibly non-stationary) policy π\pi^{\prime},

VM,H/2π1512(|𝒮|+1)12(|𝒮|+1)VM,Hπ,V^{\pi}_{M^{\prime},\lfloor H/2\rfloor}\geq\frac{1}{512\cdot(|\mathcal{S}|+1)^{12(|\mathcal{S}|+1)}}V^{\pi^{\prime}}_{M^{\prime},H},

which implies the desired result. ∎

Finally, by combining Lemma 4.6 and Corollary 4.7, we can show that for any (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, if the initial state distribution μ\mu is ss and there exists a non-stationary policy that visits (s,a)(s,a) for ff times with probability ϵ\epsilon in all the HH steps, then there exists a stationary policy that visits (s,a)(s,a) for exp(O(|𝒮|log|𝒮|))ϵf\exp\left(-O(|\mathcal{S}|\log|\mathcal{S}|)\right)\cdot\epsilon\cdot f times with constant probability in the first H/2H/2 steps.

Corollary 4.9.

For a given MDP MM and a state-action pair z=(sz,az)𝒮×𝒜z=(s_{z},a_{z})\in\mathcal{S}\times\mathcal{A}, suppose the initial state distribution μ\mu is szs_{z} and H2ln(8|𝒮|4|𝒮|)H\geq 2\ln(8\cdot|\mathcal{S}|^{4|\mathcal{S}|}). If there exists a (possibly non-stationary) policy π\pi^{\prime} such that 𝒬ϵπ(sz,az)f\mathcal{Q}^{\pi^{\prime}}_{\epsilon}(s_{z},a_{z})\geq f for some integer 0fH0\leq f\leq H, then there exists a stationary policy π\pi such that

𝒬1/2π(h=0H/21𝕀[(sh,ah)=z])12048|𝒮|12|𝒮|ϵf\mathcal{Q}^{\pi}_{1/2}\left(\sum_{h=0}^{\lfloor H/2\rfloor-1}\mathbb{I}[(s_{h},a_{h})=z]\right)\geq\left\lfloor\frac{1}{2048\cdot|\mathcal{S}|^{12|\mathcal{S}|}}\cdot\epsilon\cdot f\right\rfloor

where

(s0,a0,r0),(s1,a1,r1),(sH1,aH1,rH1),sH(s_{0},a_{0},r_{0}),(s_{1},a_{1},r_{1})\ldots,(s_{H-1},a_{H-1},r_{H-1}),s_{H}

is a trajectory induced by executing π\pi in MM.

Proof.

If f=0f=0 then the lemma is clearly true. No consider the case f>0f>0. Consider a new MDP M=(𝒮,𝒜,P,R,H,μ)M^{\prime}=\left(\mathcal{S},\mathcal{A},P,R^{\prime},H,\mu\right) where R(s,a)=𝕀[(s,a)=z]R(s,a)=\mathbb{I}[(s,a)=z]. Clearly, VM,HπϵfV^{\pi^{\prime}}_{M^{\prime},H}\geq\epsilon\cdot f. By Corollary 4.7, there exists a stationary policy π\pi such that

VM,Hπ1128|𝒮|8|𝒮|ϵf.V^{\pi}_{M^{\prime},H}\geq\frac{1}{128\cdot|\mathcal{S}|^{8|\mathcal{S}|}}\cdot\epsilon\cdot f.

By Lemma 4.6,

VM,H/2π1512|𝒮|12|𝒮|ϵf.V^{\pi}_{M^{\prime},\lfloor H/2\rfloor}\geq\frac{1}{512\cdot|\mathcal{S}|^{12|\mathcal{S}|}}\cdot\epsilon\cdot f.

This implies π(sz)=az\pi(s_{z})=a_{z}.

Now we use XX to denote a random variable which is defined to be

X=min{h1(sh,ah)=z}.X=\min\{h\geq 1\mid(s_{h},a_{h})=z\}.

Here the trajectory

(s0,a0),(s1,a1),(s_{0},a_{0}),(s_{1},a_{1}),\ldots

is induced by executing the stationary policy π\pi in MM^{\prime}. We also write X^=min{H/2,X}\hat{X}=\min\{\lfloor H/2\rfloor,X\}. We use {Xi}i=1\{X_{i}\}_{i=1}^{\infty} to denote a sequence of i.i.d. copies of X^\hat{X}. We use τ\tau to denote a random variable which is defined to be

τ=min{i1j=1iXjH/2}.\tau=\min\left\{i\geq 1\mid\sum_{j=1}^{i}X_{j}\geq\lfloor H/2\rfloor\right\}.

Clearly, τH/2\tau\leq H/2 almost surely. Moreover, π\pi is a stationary policy, the initial state distribution μ=sz\mu=s_{z} deterministically and π(sz)=az\pi(s_{z})=a_{z}, which implies τ\tau and h=0H/21𝕀[(sh,ah)=z]\sum_{h=0}^{\lfloor H/2\rfloor-1}\mathbb{I}[(s_{h},a_{h})=z] have the same distribution. Indeed, whenever the trajectory (s0,a0),(s1,a1)(s_{0},a_{0}),(s_{1},a_{1})\ldots visits zz, it corresponds to restart a new copy of X^\hat{X}.

Now for each i>0i>0, we define Yi=Xi𝔼[X^]Y_{i}=X_{i}-\mathbb{E}[\hat{X}]. Clearly 𝔼[Yi]=0\mathbb{E}[Y_{i}]=0. Let Si=0S_{i}=0 and Si=j=1iYjS_{i}=\sum_{j=1}^{i}Y_{j} for all i>0i>0. Clearly τ\tau is a stopping time, and

j=1τXjH\sum_{j=1}^{\tau}X_{j}\leq H

since XiH/2X_{i}\leq\lfloor H/2\rfloor for all i>0i>0. By Martingale Stopping Theorem, we have

𝔼[Sτ]=j=1τ𝔼[Xj]𝔼[τ]𝔼[X^]=0,\mathbb{E}[S_{\tau}]=\sum_{j=1}^{\tau}\mathbb{E}[X_{j}]-\mathbb{E}[\tau]\cdot\mathbb{E}[\hat{X}]=0,

which implies 𝔼[τ]𝔼[X^]H\mathbb{E}[\tau]\cdot\mathbb{E}[\hat{X}]\leq H and therefore

𝔼[X^]H/𝔼[τ]=H/VM,H/2π512|𝒮|12|𝒮|H/(ϵf),\mathbb{E}[\hat{X}]\leq H/\mathbb{E}[\tau]=H/V^{\pi}_{M^{\prime},\lfloor H/2\rfloor}\leq 512\cdot|\mathcal{S}|^{12|\mathcal{S}|}H/(\epsilon\cdot f),

where we use the fact that

VM,H/2π=𝔼[h=0H/21𝕀[(sh,ah)=z]]=𝔼[τ].V^{\pi}_{M^{\prime},\lfloor H/2\rfloor}=\mathbb{E}\left[\sum_{h=0}^{\lfloor H/2\rfloor-1}\mathbb{I}[(s_{h},a_{h})=z]\right]=\mathbb{E}[\tau].

Let τ=12048|𝒮|12|𝒮|ϵf\tau^{\prime}=\left\lfloor\frac{1}{2048\cdot|\mathcal{S}|^{12|\mathcal{S}|}}\cdot\epsilon\cdot f\right\rfloor. By Markov’s inequality, with probability at least 1/21/2,

i=1τXi2τ𝔼[X^]H/2,\sum_{i=1}^{\tau^{\prime}}X_{i}\leq 2\tau^{\prime}\mathbb{E}[\hat{X}]\leq H/2,

in which case ττ\tau\geq\tau^{\prime}. Consequently,

𝒬1/2π(h=0H/21𝕀[(sh,ah)=z])=𝒬1/2(τ)12048|𝒮|12|𝒮|ϵf.\mathcal{Q}^{\pi}_{1/2}\left(\sum_{h=0}^{\lfloor H/2\rfloor-1}\mathbb{I}[(s_{h},a_{h})=z]\right)=\mathcal{Q}_{1/2}(\tau)\geq\left\lfloor\frac{1}{2048\cdot|\mathcal{S}|^{12|\mathcal{S}|}}\cdot\epsilon\cdot f\right\rfloor.

5 Algorithm in the RL Setting

In this section, we present our algorithm in the RL setting together with its analysis. Our algorithm is divided into two parts. In Section 5.1, we first present the algorithm for collecting samples together with its analysis. In Section 5.2, we establish a perturbation analysis on the value functions which is crucial for the analysis in later proofs. Finally, in Section 5.3, we present the algorithm for finding near-optimal policies based on the dataset found by the algorithm in Section 5.1, together with its analysis based on the machinery developed in Section 5.2.

5.1 Collecting Samples

In this section, we present our algorithm for collecting samples. The algorithm is formally presented in Algorithm 1. The dataset DD returned by Algorithm 1 consists of NN lists, where for each list, elements in the list are tuples of the form (s,a,r,s)𝒮×𝒜×[0,1]×𝒮(s,a,r,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times[0,1]\times\mathcal{S}. To construct these lists, Algorithm 1 enumerates a state-action pair (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A} and a pair of stationary policies (π1,π2)(\pi_{1},\pi_{2}), and then collects a trajectory using π1\pi_{1} and π2\pi_{2}. More specifically, π1\pi_{1} is executed until the trajectory visits (s,a)(s,a), at which point π2\pi_{2} is executed until the last step.

Algorithm 1 Collect Samples
1:Input: number of repetitions NN
2:Output: Dataset DD where D=(((si,t,ai,t,ri,t,si,t))t=0|𝒮||𝒜||𝒜|2|𝒮|H1)i=0N1D=\left(\left(\left(s_{i,t},a_{i,t},r_{i,t},s^{\prime}_{i,t}\right)\right)_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}\right)_{i=0}^{N-1}
3:for i[N]i\in[N] do
4:     Let TiT_{i} be an empty list
5:     for (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A} do
6:         for (π1,π2)Π𝗌𝗍×Π𝗌𝗍(\pi_{1},\pi_{2})\in\Pi_{\mathsf{st}}\times\Pi_{\mathsf{st}} do
7:              Receive s0μs_{0}\sim\mu
8:              for h[H]h\in[H] do
9:                  if (s,a)=(sh,ah)(s,a)=(s_{h^{\prime}},a_{h^{\prime}}) for some h<hh^{\prime}<h then
10:                       Take ah=π2(sh)a_{h}=\pi_{2}(s_{h})
11:                  else
12:                       Take ah=π1(sh)a_{h}=\pi_{1}(s_{h})                   
13:                  Receive rhR(sh,ah)r_{h}\sim R(s_{h},a_{h}) and sh+1P(sh,ah)s_{h+1}\sim P(s_{h},a_{h})
14:                  Append (sh,ah,rh,sh+1)(s_{h},a_{h},r_{h},s_{h+1}) to the end of TiT_{i}                             
15:return DD where D=(Ti)i=0N1D=(T_{i})_{i=0}^{N-1}

Throughout this section, we use M=(𝒮,𝒜,P,R,H,μ)M=\left(\mathcal{S},\mathcal{A},P,R,H,\mu\right) to denote the underlying MDP that the agent interacts with. For each (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A} and (π1,π2)Π𝗌𝗍×Π𝗌𝗍(\pi_{1},\pi_{2})\in\Pi_{\mathsf{st}}\times\Pi_{\mathsf{st}}, let

(s0s,a,π1,π2,a0s,a,π1,π2,r0s,a,π1,π2),(s1s,a,π1,π2,a1s,a,π1,π2,r1s,a,π1,π2),,(sH1s,a,π1,π2,aH1s,a,π1,π2,rH1s,a,π1,π2),sHs,a,π1,π2(s_{0}^{s,a,\pi_{1},\pi_{2}},a_{0}^{s,a,\pi_{1},\pi_{2}},r_{0}^{s,a,\pi_{1},\pi_{2}}),(s_{1}^{s,a,\pi_{1},\pi_{2}},a_{1}^{s,a,\pi_{1},\pi_{2}},r_{1}^{s,a,\pi_{1},\pi_{2}}),\ldots,(s_{H-1}^{s,a,\pi_{1},\pi_{2}},a_{H-1}^{s,a,\pi_{1},\pi_{2}},r_{H-1}^{s,a,\pi_{1},\pi_{2}}),s_{H}^{s,a,\pi_{1},\pi_{2}}

by a trajectory where s0s,a,π1,π2μs_{0}^{s,a,\pi_{1},\pi_{2}}\sim\mu and shs,a,π1,π2P(sh1s,a,π1,π2,ah1s,a,π1,π2)s_{h}^{s,a,\pi_{1},\pi_{2}}\sim P(s_{h-1}^{s,a,\pi_{1},\pi_{2}},a_{h-1}^{s,a,\pi_{1},\pi_{2}}) for all 1hH1\leq h\leq H, rhs,a,π1,π2R(shs,a,π1,π2,ahs,a,π1,π2)r_{h}^{s,a,\pi_{1},\pi_{2}}\sim R(s_{h}^{s,a,\pi_{1},\pi_{2}},a_{h}^{s,a,\pi_{1},\pi_{2}}) for all h[H]h\in[H], and

ahs,a,π1,π2={π2(shs,a,π1,π2)(s,a)=(shs,a,π1,π2,ahs,a,π1,π2) for some h<hπ1(shs,a,π1,π2)otherwisea_{h}^{s,a,\pi_{1},\pi_{2}}=\begin{cases}\pi_{2}(s_{h}^{s,a,\pi_{1},\pi_{2}})&(s,a)=(s_{h^{\prime}}^{s,a,\pi_{1},\pi_{2}},a_{h^{\prime}}^{s,a,\pi_{1},\pi_{2}})\text{ for some $h^{\prime}<h$}\\ \pi_{1}(s_{h}^{s,a,\pi_{1},\pi_{2}})&\text{otherwise}\end{cases}

for all h[H]h\in[H]. Note that the above trajectory is the one collected by Algorithm 1 when a specific state-action pair (s,a)(s,a) and a specific pair of policies (π1,π2)(\pi_{1},\pi_{2}) are used.

For any ϵ(0,1]\epsilon\in(0,1], define

𝒬ϵ𝗌𝗍(s,a)=𝒬ϵ((s,a)𝒮×𝒜π1Π𝗌𝗍π2Π𝗌𝗍h=0H1𝕀[(shs,a,π1,π2,ahs,a,π1,π2)=(s,a)]).\mathcal{Q}^{\mathsf{st}}_{\epsilon}(s,a)=\mathcal{Q}_{\epsilon}\left(\sum_{(s^{\prime},a^{\prime})\in\mathcal{S}\times\mathcal{A}}\sum_{\pi_{1}\in\Pi_{\mathsf{st}}}\sum_{\pi_{2}\in\Pi_{\mathsf{st}}}\sum_{h=0}^{H-1}\mathbb{I}[(s_{h}^{s^{\prime},a^{\prime},\pi_{1},\pi_{2}},a_{h}^{s^{\prime},a^{\prime},\pi_{1},\pi_{2}})=(s,a)]\right).

Clearly, 𝒬ϵ𝗌𝗍(s,a)\mathcal{Q}^{\mathsf{st}}_{\epsilon}(s,a) is the ϵ\epsilon-quantile of the frequency that (s,a)(s,a) appears in each TiT_{i}.

In Lemma 5.1, we first show that for each (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, if there exists a policy π\pi that visits (s,a)(s,a) for m(s,a)m(s,a) times with probability at least ϵ\epsilon, then

𝒬ϵ/exp(O(|𝒮|log|𝒮|))𝗌𝗍(s,a)m(s,a)/exp(O(|𝒮|log|𝒮|)).\mathcal{Q}^{\mathsf{st}}_{\epsilon/\exp(O(|\mathcal{S}|\log|\mathcal{S}|))}(s,a)\geq m(s,a)/\exp(O(|\mathcal{S}|\log|\mathcal{S}|)).
Lemma 5.1.

Let ϵ(0,1]\epsilon\in(0,1] be a given real number. For each (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, let mϵ(s,a)m_{\epsilon}(s,a) be the largest integer such that there exists a (possibly non-stationary) policy πs,a\pi_{s,a} such that 𝒬ϵπs,a(s,a)mϵ(s,a)\mathcal{Q}^{\pi_{s,a}}_{\epsilon}(s,a)\geq m_{\epsilon}(s,a). Then for each (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A},

𝒬ϵ(|𝒮|+1)12(|𝒮|+1)/1024𝗌𝗍(s,a)14096|𝒮|12|𝒮|ϵmϵ(s,a).\mathcal{Q}^{\mathsf{st}}_{\epsilon(|\mathcal{S}|+1)^{-12(|\mathcal{S}|+1)}/1024}(s,a)\geq\frac{1}{4096\cdot|\mathcal{S}|^{12|\mathcal{S}|}}\cdot\epsilon\cdot m_{\epsilon}(s,a).
Proof.

For each (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, there exists a (possibly non-stationary) policy πs,a\pi_{s,a} such that 𝒬ϵπs,a(s,a)mϵ(s,a)\mathcal{Q}^{\pi_{s,a}}_{\epsilon}(s,a)\geq m_{\epsilon}(s,a). Here we consider the case that mϵ(s,a)1m_{\epsilon}(s,a)\geq 1, since otherwise the lemma clearly holds. By Corollary 4.8, there exists a stationary policy πs,a\pi_{s,a}^{\prime} such that

Pr[h=0H/21𝕀[(sh,ah)=(s,a)]1]ϵ512(|𝒮|+1)12(|𝒮|+1),\Pr\left[\sum_{h=0}^{\lfloor H/2\rfloor-1}\mathbb{I}[(s_{h},a_{h})=(s,a)]\geq 1\right]\geq\frac{\epsilon}{512\cdot(|\mathcal{S}|+1)^{12(|\mathcal{S}|+1)}},

where

(s0,a0),(s1,a1),,(sH1,aH1),sH(s_{0},a_{0}),(s_{1},a_{1}),\ldots,(s_{H-1},a_{H-1}),s_{H}

is a random trajectory induced by executing πs,a\pi_{s,a}^{\prime} in MM.

In the remaining part of the analysis, we consider two cases.

Case I: mϵ(s,a)4096|𝒮|12|𝒮|/ϵm_{\epsilon}(s,a)\geq 4096\cdot|\mathcal{S}|^{12|\mathcal{S}|}/\epsilon.

Let

(s0,a0),(s1,a1),,(sH1,aH1),sH(s_{0},a_{0}),(s_{1},a_{1}),\ldots,(s_{H-1},a_{H-1}),s_{H}

be a random trajectory induced by executing πs,a\pi_{s,a} in MM. Let Xs,aX_{s,a} be the random variable which is defined to be

Xs,a={min{h[H](sh,ah)=(s,a)}if there exists h[H] such that (sh,ah)=(s,a)Hotherwise.X_{s,a}=\begin{cases}\min\{h\in[H]\mid(s_{h},a_{h})=(s,a)\}&\text{if there exists $h\in[H]$ such that $(s_{h},a_{h})=(s,a)$}\\ H&\text{otherwise}\end{cases}.

Clearly,

h=0H1Pr[Xs,a=h]Pr[h=hH1𝕀[(sh,ah)=(s,a)]mϵ(s,a)(sh,ah)=(s,a)]\displaystyle\sum_{h^{\prime}=0}^{H-1}\Pr[X_{s,a}=h^{\prime}]\cdot\Pr\left[\sum_{h=h^{\prime}}^{H-1}\mathbb{I}[(s_{h},a_{h})=(s,a)]\geq m_{\epsilon}(s,a)\mid(s_{h^{\prime}},a_{h^{\prime}})=(s,a)\right]
=\displaystyle= Pr[h=0H1𝕀[(sh,ah)=(s,a)]mϵ(s,a)]ϵ.\displaystyle\Pr\left[\sum_{h=0}^{H-1}\mathbb{I}[(s_{h},a_{h})=(s,a)]\geq m_{\epsilon}(s,a)\right]\geq\epsilon.

Therefore, there exists h[H]h^{\prime}\in[H] such that

Pr[Xs,a=h]>0\Pr[X_{s,a}=h^{\prime}]>0

and

Pr[h=hH1𝕀[(sh,ah)=(s,a)]mϵ(s,a)(sh,ah)=(s,a)]ϵ.\Pr\left[\sum_{h=h^{\prime}}^{H-1}\mathbb{I}[(s_{h},a_{h})=(s,a)]\geq m_{\epsilon}(s,a)\mid(s_{h^{\prime}},a_{h^{\prime}})=(s,a)\right]\geq\epsilon.

Note that we must have πh(s)=a\pi_{h^{\prime}}(s)=a, since otherwise Pr[Xs,a=h]=0\Pr[X_{s,a}=h^{\prime}]=0.

Now we consider a new MDP Ms=(𝒮,𝒜,P,R,H,μs)M_{s}=\left(\mathcal{S},\mathcal{A},P,R,H,\mu_{s}\right) where μs=s\mu_{s}=s. Let π~\widetilde{\pi} be an arbitrary policy so that π~h=(πs,a)h+h\widetilde{\pi}_{h}=(\pi_{s,a})_{h^{\prime}+h} for all h[Hh]h\in[H-h^{\prime}]. Clearly,

Pr[h=0H1𝕀[(sh,ah)=(s,a)]mϵ(s,a)]Pr[h=0Hh1𝕀[(sh,ah)=(s,a)]mϵ(s,a)]ϵ\Pr\left[\sum_{h=0}^{H-1}\mathbb{I}[(s_{h}^{\prime},a_{h}^{\prime})=(s,a)]\geq m_{\epsilon}(s,a)\right]\geq\Pr\left[\sum_{h=0}^{H-h^{\prime}-1}\mathbb{I}[(s_{h}^{\prime},a_{h}^{\prime})=(s,a)]\geq m_{\epsilon}(s,a)\right]\geq\epsilon

where

(s0,a0),(s1,a1),,(sH1,aH1),sH(s_{0}^{\prime},a_{0}^{\prime}),(s_{1}^{\prime},a_{1}^{\prime}),\ldots,(s_{H-1}^{\prime},a_{H-1}^{\prime}),s_{H}^{\prime}

is a random trajectory induced by executing π~\widetilde{\pi} in MsM_{s}. Therefore, by Corollary 4.9, there exists a stationary policy π~s,a\widetilde{\pi}_{s,a} such that

Pr[h=0H/21𝕀[(sh′′,ah′′)=(s,a)]12048|𝒮|12|𝒮|ϵmϵ(s,a)]1/2\Pr\left[\sum_{h=0}^{\lfloor H/2\rfloor-1}\mathbb{I}[(s_{h}^{\prime\prime},a_{h}^{\prime\prime})=(s,a)]\geq\left\lfloor\frac{1}{2048\cdot|\mathcal{S}|^{12|\mathcal{S}|}}\cdot\epsilon\cdot m_{\epsilon}(s,a)\right\rfloor\right]\geq 1/2

where

(s0′′,a0′′),(s1′′,a1′′),,(sH1′′,aH1′′),sH′′(s_{0}^{\prime\prime},a_{0}^{\prime\prime}),(s_{1}^{\prime\prime},a_{1}^{\prime\prime}),\ldots,(s_{H-1}^{\prime\prime},a_{H-1}^{\prime\prime}),s_{H}^{\prime\prime}

is a random trajectory induced by executing π~s,a\widetilde{\pi}_{s,a} in MsM_{s}. Since mϵ(s,a)4096|𝒮|12|𝒮|/ϵm_{\epsilon}(s,a)\geq 4096\cdot|\mathcal{S}|^{12|\mathcal{S}|}/\epsilon and thus 12048|𝒮|12|𝒮|ϵmϵ(s,a)1\left\lfloor\frac{1}{2048\cdot|\mathcal{S}|^{12|\mathcal{S}|}}\cdot\epsilon\cdot m_{\epsilon}(s,a)\right\rfloor\geq 1, we must have π~s,a(s)=a\widetilde{\pi}_{s,a}(s)=a.

Now we consider the case when π1=πs,a\pi_{1}=\pi_{s,a}^{\prime} and π2=π~s,a\pi_{2}=\widetilde{\pi}_{s,a}. Since π1=πs,a\pi_{1}=\pi_{s,a}^{\prime},

Pr[h=0H/21𝕀[(shs,a,π1,π2,ahs,a,π1,π2)=(s,a)]1]ϵ512(|𝒮|+1)12(|𝒮|+1).\Pr\left[\sum_{h=0}^{\lfloor H/2\rfloor-1}\mathbb{I}[(s_{h}^{s,a,\pi_{1},\pi_{2}},a_{h}^{s,a,\pi_{1},\pi_{2}})=(s,a)]\geq 1\right]\geq\frac{\epsilon}{512\cdot(|\mathcal{S}|+1)^{12(|\mathcal{S}|+1)}}.

Therefore, let Xs,aX_{s,a}^{\prime} be the random variable which is defined to be

Xs,a={min{h[H/2](shs,a,π1,π2,ahs,a,π1,π2)=(s,a)}if (sh,ah)=(s,a) for some h[H/2]H/2otherwise.X_{s,a}^{\prime}=\begin{cases}\min\{h\in[\lfloor H/2\rfloor]\mid(s_{h}^{s,a,\pi_{1},\pi_{2}},a_{h}^{s,a,\pi_{1},\pi_{2}})=(s,a)\}&\text{if $(s_{h},a_{h})=(s,a)$ for some $h\in[\lfloor H/2\rfloor]$}\\ \lfloor H/2\rfloor&\text{otherwise}\end{cases}.

We have that

Pr[Xs,a[H/2]]ϵ512(|𝒮|+1)12(|𝒮|+1).\Pr[X_{s,a}^{\prime}\in[\lfloor H/2\rfloor]]\geq\frac{\epsilon}{512\cdot(|\mathcal{S}|+1)^{12(|\mathcal{S}|+1)}}.

Moreover, for each h[H/2]h^{\prime}\in[\lfloor H/2\rfloor], since π2=π~s,a\pi_{2}=\widetilde{\pi}_{s,a},

Pr[h=hH1𝕀[(shs,a,π1,π2,ahs,a,π1,π2)=(s,a)]12048|𝒮|12|𝒮|ϵmϵ(s,a)(shs,a,π1,π2,ahs,a,π1,π2)=(s,a)]\displaystyle\Pr\left[\sum_{h=h^{\prime}}^{H-1}\mathbb{I}[(s_{h}^{s,a,\pi_{1},\pi_{2}},a_{h}^{s,a,\pi_{1},\pi_{2}})=(s,a)]\geq\left\lfloor\frac{1}{2048\cdot|\mathcal{S}|^{12|\mathcal{S}|}}\cdot\epsilon\cdot m_{\epsilon}(s,a)\right\rfloor\mid(s_{h^{\prime}}^{s,a,\pi_{1},\pi_{2}},a_{h^{\prime}}^{s,a,\pi_{1},\pi_{2}})=(s,a)\right]
\displaystyle\geq 1/2.\displaystyle 1/2.

Therefore,

Pr[h=0H1𝕀[(shs,a,π1,π2,ahs,a,π1,π2)=(s,a)]12048|𝒮|12|𝒮|ϵmϵ(s,a)]\displaystyle\Pr\left[\sum_{h=0}^{H-1}\mathbb{I}[(s_{h}^{s,a,\pi_{1},\pi_{2}},a_{h}^{s,a,\pi_{1},\pi_{2}})=(s,a)]\geq\left\lfloor\frac{1}{2048\cdot|\mathcal{S}|^{12|\mathcal{S}|}}\cdot\epsilon\cdot m_{\epsilon}(s,a)\right\rfloor\right]
\displaystyle\geq h=0H/21Pr[Xs,a=h]\displaystyle\sum_{h^{\prime}=0}^{\lfloor H/2\rfloor-1}\Pr[X_{s,a}^{\prime}=h^{\prime}]
\displaystyle\cdot Pr[h=hH1𝕀[(shs,a,π1,π2,ahs,a,π1,π2)=(s,a)]12048|𝒮|12|𝒮|ϵmϵ(s,a)(shπ1,π2,ahπ1,π2)=(s,a)]\displaystyle\Pr\left[\sum_{h=h^{\prime}}^{H-1}\mathbb{I}[(s_{h}^{s,a,\pi_{1},\pi_{2}},a_{h}^{s,a,\pi_{1},\pi_{2}})=(s,a)]\geq\left\lfloor\frac{1}{2048\cdot|\mathcal{S}|^{12|\mathcal{S}|}}\cdot\epsilon\cdot m_{\epsilon}(s,a)\right\rfloor\mid(s_{h^{\prime}}^{\pi_{1},\pi_{2}},a_{h^{\prime}}^{\pi_{1},\pi_{2}})=(s,a)\right]
\displaystyle\geq ϵ1024(|𝒮|+1)12(|𝒮|+1).\displaystyle\frac{\epsilon}{1024\cdot(|\mathcal{S}|+1)^{12(|\mathcal{S}|+1)}}.

Since mϵ(s,a)4096|𝒮|12|𝒮|/ϵm_{\epsilon}(s,a)\geq 4096\cdot|\mathcal{S}|^{12|\mathcal{S}|}/\epsilon, we have

Pr[h=0H1𝕀[(shs,a,π1,π2ahs,a,π1,π2)=(s,a)]14096|𝒮|12|𝒮|ϵmϵ(s,a)]ϵ1024(|𝒮|+1)12(|𝒮|+1)\Pr\left[\sum_{h=0}^{H-1}\mathbb{I}[(s_{h}^{s,a,\pi_{1},\pi_{2}}a_{h}^{s,a,\pi_{1},\pi_{2}})=(s,a)]\geq\frac{1}{4096\cdot|\mathcal{S}|^{12|\mathcal{S}|}}\cdot\epsilon\cdot m_{\epsilon}(s,a)\right]\geq\frac{\epsilon}{1024\cdot(|\mathcal{S}|+1)^{12(|\mathcal{S}|+1)}}

and thus

𝒬ϵ(|𝒮|+1)12(|𝒮|+1)/1024𝗌𝗍(s,a)14096|𝒮|12|𝒮|ϵmϵ(s,a).\mathcal{Q}^{\mathsf{st}}_{\epsilon(|\mathcal{S}|+1)^{-12(|\mathcal{S}|+1)}/1024}(s,a)\geq\frac{1}{4096\cdot|\mathcal{S}|^{12|\mathcal{S}|}}\cdot\epsilon\cdot m_{\epsilon}(s,a).

Case II: mϵ(s,a)<4096|𝒮|12|𝒮|/ϵm_{\epsilon}(s,a)<4096\cdot|\mathcal{S}|^{12|\mathcal{S}|}/\epsilon.

Consider the case when π1=π2=πs,a\pi_{1}=\pi_{2}=\pi_{s,a}^{\prime}. Clearly,

Pr[h=0H1𝕀[(shs,a,π1,π2,ahs,a,π1,π2)=(s,a)]1]\displaystyle\Pr\left[\sum_{h=0}^{H-1}\mathbb{I}[(s_{h}^{s,a,\pi_{1},\pi_{2}},a_{h}^{s,a,\pi_{1},\pi_{2}})=(s,a)]\geq 1\right]
\displaystyle\geq Pr[h=0H/21𝕀[(shs,a,π1,π2,ahs,a,π1,π2)=(s,a)]1]ϵ512(|𝒮|+1)12(|𝒮|+1)\displaystyle\Pr\left[\sum_{h=0}^{\lfloor H/2\rfloor-1}\mathbb{I}[(s_{h}^{s,a,\pi_{1},\pi_{2}},a_{h}^{s,a,\pi_{1},\pi_{2}})=(s,a)]\geq 1\right]\geq\frac{\epsilon}{512\cdot(|\mathcal{S}|+1)^{12(|\mathcal{S}|+1)}}

and thus

𝒬ϵ(|𝒮|+1)12(|𝒮|+1)/1024𝗌𝗍(s,a)14096|𝒮|12|𝒮|ϵmϵ(s,a).\mathcal{Q}^{\mathsf{st}}_{\epsilon(|\mathcal{S}|+1)^{-12(|\mathcal{S}|+1)}/1024}(s,a)\geq\frac{1}{4096\cdot|\mathcal{S}|^{12|\mathcal{S}|}}\cdot\epsilon\cdot m_{\epsilon}(s,a).

Now we show that for a given percentile ϵ\epsilon, for the dataset DD returned by Algorithm 1, for each (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, (s,a)(s,a) appears for at least 𝒬ϵ/4𝗌𝗍(s,a)\mathcal{Q}^{\mathsf{st}}_{\epsilon/4}(s,a) times for at least Ω(Nϵ)\Omega(N\cdot\epsilon) lists out of the NN lists returned by Algorithm 1.

Lemma 5.2.

Let ϵ,δ(0,1]\epsilon,\delta\in(0,1] be a given real number. Let DD be the dataset returned by Algorithm 1 where

D=(((si,t,ai,t,ri,t,si,t))t=0|𝒮||𝒜||𝒜|2|𝒮|H1)i=0N1.D=\left(\left(\left(s_{i,t},a_{i,t},r_{i,t},s^{\prime}_{i,t}\right)\right)_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}\right)_{i=0}^{N-1}.

Suppose N16/ϵlog(3|𝒮||𝒜|/δ)N\geq 16/\epsilon\cdot\log(3|\mathcal{S}||\mathcal{A}|/\delta). With probability at least 1δ/31-\delta/3, for each (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, we have

i=0N1𝕀[t=0|𝒮||𝒜||𝒜|2|𝒮|H1𝕀[(si,t,ai,t)=(s,a)]𝒬ϵ/4𝗌𝗍(s,a)]Nϵ/8.\sum_{i=0}^{N-1}\mathbb{I}\left[\sum_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}\mathbb{I}[(s_{i,t},a_{i,t})=(s,a)]\geq\mathcal{Q}^{\mathsf{st}}_{\epsilon/4}(s,a)\right]\geq N\epsilon/8.
Proof.

For each (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, by the definition of 𝒬ϵ/4𝗌𝗍(s,a)\mathcal{Q}^{\mathsf{st}}_{\epsilon/4}(s,a), for each i[N]i\in[N], we have

𝔼[𝕀[t=0|𝒮||𝒜||𝒜|2|𝒮|H1𝕀[(si,t,ai,t)=(s,a)]𝒬ϵ/4𝗌𝗍(s,a)]]ϵ/4.\mathbb{E}\left[\mathbb{I}\left[\sum_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}\mathbb{I}[(s_{i,t},a_{i,t})=(s,a)]\geq\mathcal{Q}^{\mathsf{st}}_{\epsilon/4}(s,a)\right]\right]\geq\epsilon/4.

Hence, the desired result follows by Chernoff bound and a union bound over all (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}. ∎

We also need a subroutine to estimate 𝒬ϵest𝗌𝗍(s,a)\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a) for some ϵest\epsilon_{\mathrm{est}} to be decided. Such estimates are crucial for building estimators for the transition probabilities and the rewards with bounded variance, which we elaborate in later parts of this section.

Our algorithm for estimating 𝒬ϵest𝗌𝗍(s,a)\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a) is described in Algorithm 2. Algorithm 2 collects NN lists, where for each list, elements in the list are tuples of the form (s,a,r,s)𝒮×𝒜×[0,1]×𝒮(s,a,r,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times[0,1]\times\mathcal{S}. These NN lists are collected using the same approach as in Algorithm 1. Once these NN lists are collected, for each (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, our estimate (denoted as m¯𝗌𝗍(s,a)\overline{m}^{\mathsf{st}}(s,a)) is then set to be the Nϵest/2\lceil N\cdot\epsilon_{\mathrm{est}}/2\rceil-th largest element in Fs,aF_{s,a}, where Fs,aF_{s,a} is the set of the number of times (s,a)(s,a) appear in each of the NN lists.

Algorithm 2 Estimate Quantiles
1:Input: Percentile ϵest\epsilon_{\mathrm{est}}, failure probability δest\delta_{\mathrm{est}}
2:Output: Estimates m¯𝗌𝗍:𝒮×𝒜\overline{m}^{\mathsf{st}}:\mathcal{S}\times\mathcal{A}\to\mathbb{N}
3:Let N=300log(6|𝒮||𝒜|/δest)/ϵestN=\lceil 300\log(6|\mathcal{S}||\mathcal{A}|/\delta_{\mathrm{est}})/\epsilon_{\mathrm{est}}\rceil
4:Let Fs,aF_{s,a} be an empty multiset for all (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}
5:for i[N]i\in[N] do
6:     Let TiT_{i} be an empty list
7:     for (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A} do
8:         for (π1,π2)Π𝗌𝗍×Π𝗌𝗍(\pi_{1},\pi_{2})\in\Pi_{\mathsf{st}}\times\Pi_{\mathsf{st}} do
9:              Receive s0μs_{0}\sim\mu
10:              for h[H]h\in[H] do
11:                  if (s,a)=(sh,ah)(s,a)=(s_{h^{\prime}},a_{h^{\prime}}) for some 0h<h0\leq h^{\prime}<h then
12:                       Take ah=π2(sh)a_{h}=\pi_{2}(s_{h})
13:                  else
14:                       Take ah=π1(sh)a_{h}=\pi_{1}(s_{h})                   
15:                  Receive rhR(sh,ah)r_{h}\sim R(s_{h},a_{h}) and sh+1P(sh,ah)s_{h+1}\sim P(s_{h},a_{h})
16:                  Append (sh,ah,rh,sh+1)(s_{h},a_{h},r_{h},s_{h+1}) to the end of TiT_{i}                             
17:     for (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A} do
18:         Add t=0|Ti|1𝕀[(st,at)=(s,a)]\sum_{t=0}^{|T_{i}|-1}\mathbb{I}[(s_{t},a_{t})=(s,a)] into Fs,aF_{s,a} where
Ti=((s0,a0,r0,s0),(s1,a1,r1,s1),,(s|Ti|1,a|Ti|1,r|Ti|1,s|Ti|1))T_{i}=((s_{0},a_{0},r_{0},s_{0}^{\prime}),(s_{1},a_{1},r_{1},s_{1}^{\prime}),\ldots,(s_{|T_{i}|-1},a_{|T_{i}|-1},r_{|T_{i}|-1},s_{|T_{i}|-1}^{\prime}))
     
19:for (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A} do
20:     Set m¯𝗌𝗍(s,a)\overline{m}^{\mathsf{st}}(s,a) be the Nϵest/2\lceil N\cdot\epsilon_{\mathrm{est}}/2\rceil-th largest element in Fs,aF_{s,a}
21:return m¯𝗌𝗍\overline{m}^{\mathsf{st}}

We now show that for each (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, m¯𝗌𝗍(s,a)\overline{m}^{\mathsf{st}}(s,a) is an accurate estimate of 𝒬ϵest𝗌𝗍(s,a)\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a).

Lemma 5.3.

Let m¯𝗌𝗍\overline{m}^{\mathsf{st}} be the function returned by Algorithm 2. With probability at least 1δest/31-\delta_{\mathrm{est}}/3, for all (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A},

𝒬ϵest𝗌𝗍(s,a)m¯𝗌𝗍(s,a)𝒬ϵest/4𝗌𝗍(s,a).\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)\leq\overline{m}^{\mathsf{st}}(s,a)\leq\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}/4}(s,a).
Proof.

Fix a state-action pair (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}. For each i[N]i\in[N], define

X¯i=𝕀[t=0|Ti|1𝕀[(st,at)=(s,a)]>𝒬ϵest/4𝗌𝗍(s,a)]\underline{X}_{i}=\mathbb{I}\left[\sum_{t=0}^{|T_{i}|-1}\mathbb{I}[(s_{t},a_{t})=(s,a)]>\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}/4}(s,a)\right]

where

Ti=((s0,a0,r0,s0),(s1,a1,r1,s1),,(s|Ti|1,a|Ti|1,r|Ti|1,s|Ti|1)).T_{i}=((s_{0},a_{0},r_{0},s_{0}^{\prime}),(s_{1},a_{1},r_{1},s_{1}^{\prime}),\ldots,(s_{|T_{i}|-1},a_{|T_{i}|-1},r_{|T_{i}|-1},s_{|T_{i}|-1}^{\prime})).

For each i[N]i\in[N], by the definition of 𝒬ϵest/4𝗌𝗍(s,a)\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}/4}(s,a), we have 𝔼[X¯i]ϵest/4\mathbb{E}[\underline{X}_{i}]\leq\epsilon_{\mathrm{est}}/4 and thus i=0N1𝔼[X¯i]Nϵest/4\sum_{i=0}^{N-1}\mathbb{E}[\underline{X}_{i}]\leq N\cdot\epsilon_{\mathrm{est}}/4. By Chernoff bound, with probability at most δest/(6|𝒮||𝒜|)\delta_{\mathrm{est}}/(6|\mathcal{S}||\mathcal{A}|),

i=0N1X¯iNϵest/3.\sum_{i=0}^{N-1}\underline{X}_{i}\geq N\cdot\epsilon_{\mathrm{est}}/3.

On the other hand, for each i[N]i\in[N], define

X¯i=𝕀[t=0|Ti|1𝕀[(si,t,ai,t)=(s,a)]𝒬ϵest𝗌𝗍(s,a)].\overline{X}_{i}=\mathbb{I}\left[\sum_{t=0}^{|T_{i}|-1}\mathbb{I}[(s_{i,t},a_{i,t})=(s,a)]\geq\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)\right].

where

Ti=((s0,a0,r0,s0),(s1,a1,r1,s1),,(s|Ti|1,a|Ti|1,r|Ti|1,s|Ti|1)).T_{i}=((s_{0},a_{0},r_{0},s_{0}^{\prime}),(s_{1},a_{1},r_{1},s_{1}^{\prime}),\ldots,(s_{|T_{i}|-1},a_{|T_{i}|-1},r_{|T_{i}|-1},s_{|T_{i}|-1}^{\prime})).

For each i[N]i\in[N], by the definition of 𝒬ϵest𝗌𝗍(s,a)\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a), we have 𝔼[X¯i]ϵest\mathbb{E}[\overline{X}_{i}]\geq\epsilon_{\mathrm{est}} and thus i=0N1𝔼[X¯i]Nϵest\sum_{i=0}^{N-1}\mathbb{E}[\overline{X}_{i}]\geq N\cdot\epsilon_{\mathrm{est}}. By Chernoff bound, with probability at most δest/(6|𝒮||𝒜|)\delta_{\mathrm{est}}/(6|\mathcal{S}||\mathcal{A}|),

i=0N1X¯i2Nϵest/3.\sum_{i=0}^{N-1}\overline{X}_{i}\leq 2N\cdot\epsilon_{\mathrm{est}}/3.

Hence, by union bound, with probability at least 1δest/(3|𝒮||𝒜|)1-\delta_{\mathrm{est}}/(3|\mathcal{S}||\mathcal{A}|),

i=0N1X¯i<Nϵest/3.\sum_{i=0}^{N-1}\underline{X}_{i}<N\cdot\epsilon_{\mathrm{est}}/3.

and

i=0N1X¯i>2Nϵest/3,\sum_{i=0}^{N-1}\overline{X}_{i}>2N\cdot\epsilon_{\mathrm{est}}/3,

in which case the Nϵest/2\lceil N\cdot\epsilon_{\mathrm{est}}/2\rceil-th largest element in Fs,aF_{s,a} is in [𝒬ϵest𝗌𝗍(s,a),𝒬ϵest/4𝗌𝗍(s,a)]\left[\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a),\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}/4}(s,a)\right]. We finish the proof by a union bound over all (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}. ∎

In Lemma 5.4, we show that using the dataset DD returned by Algorithm 1, and the estimates of quantiles returned by Algorithm 2, we can compute accurate estimates of the transition probabilities and rewards. The estimators used in Lemma 5.4 are the empirical estimators, with proper truncation if a list TiT_{i} contains too many samples (i.e., more than m¯𝗌𝗍(,\overline{m}^{\mathsf{st}}(\cdot,\cdot)). As will be made clear in the proof, such truncation is crucial for obtaining estimators with bounded variance.

Lemma 5.4.

Suppose Algorithm 2 is invoked with the percentile set to be ϵest\epsilon_{\mathrm{est}} and the failure probability set to be δ\delta, and Algorithm 1 is invoked with N16/ϵestlog(3|𝒮||𝒜|/δ)N\geq 16/\epsilon_{\mathrm{est}}\cdot\log(3|\mathcal{S}||\mathcal{A}|/\delta). Let m¯𝗌𝗍:𝒮×𝒜\overline{m}^{\mathsf{st}}:\mathcal{S}\times\mathcal{A}\to\mathbb{N} be the estimates returned by Algorithm 2. Let DD be the dataset returned by Algorithm 1 where

D=(((si,t,ai,t,ri,t,si,t))t=0|𝒮||𝒜||𝒜|2|𝒮|H1)i=0N1.D=\left(\left(\left(s_{i,t},a_{i,t},r_{i,t},s^{\prime}_{i,t}\right)\right)_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}\right)_{i=0}^{N-1}.

For each (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, for each i[N]i\in[N] and t[|𝒮||𝒜||𝒜|2|𝒮|H]t\in\left[|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H\right], define

𝖳𝗋𝗎𝗇𝖼i,t(s,a)=𝕀[t=0t1𝕀[(si,t,ai,t)=(s,a)]<m¯𝗌𝗍(s,a)].\mathsf{Trunc}_{i,t}(s,a)=\mathbb{I}\left[\sum_{t^{\prime}=0}^{t-1}\mathbb{I}\left[(s_{i,t^{\prime}},a_{i,t^{\prime}})=(s,a)\right]<\overline{m}^{\mathsf{st}}(s,a)\right].

For each (s,a,s)𝒮×𝒜×𝒮(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}, define

mD(s,a)=i=0N1t=0|𝒮||𝒜||𝒜|2|𝒮|H1𝕀[(si,t,ai,t)=(s,a)]𝖳𝗋𝗎𝗇𝖼i,t(s,a),m_{D}(s,a)=\sum_{i=0}^{N-1}\sum_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}\mathbb{I}\left[(s_{i,t},a_{i,t})=(s,a)\right]\cdot\mathsf{Trunc}_{i,t}(s,a),
P^(ss,a)=i=0N1t=0|𝒮||𝒜||𝒜|2|𝒮|H1𝕀[(si,t,ai,t,si,t)=(s,a,s)]𝖳𝗋𝗎𝗇𝖼i,t(s,a)max{1,mD(s,a)},\widehat{P}(s^{\prime}\mid s,a)=\frac{\sum_{i=0}^{N-1}\sum_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}\mathbb{I}\left[(s_{i,t},a_{i,t},s_{i,t}^{\prime})=(s,a,s^{\prime})\right]\cdot\mathsf{Trunc}_{i,t}(s,a)}{\max\{1,m_{D}(s,a)\}},
R^(s,a)=i=0N1t=0|𝒮||𝒜||𝒜|2|𝒮|H1𝕀[(si,t,ai,t)=(s,a)]ri,t𝖳𝗋𝗎𝗇𝖼i,t(s,a)max{1,mD(s,a)}\widehat{R}(s,a)=\frac{\sum_{i=0}^{N-1}\sum_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}\mathbb{I}\left[(s_{i,t},a_{i,t})=(s,a)\right]\cdot r_{i,t}\cdot\mathsf{Trunc}_{i,t}(s,a)}{\max\{1,m_{D}(s,a)\}}

and

μ^(s)=i=0N1𝕀[s0=s]N.\widehat{\mu}(s)=\frac{\sum_{i=0}^{N-1}\mathbb{I}[s_{0}=s]}{N}.

Then with probability at least 1δ1-\delta, for all (s,a,s)𝒮×𝒜×𝒮(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S} with 𝒬ϵest𝗌𝗍(s,a)>0\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)>0, we have

|P^(ss,a)P(ss,a)|\displaystyle\left|\widehat{P}(s^{\prime}\mid s,a)-P(s^{\prime}\mid s,a)\right|\leq max{512log(18|𝒮|2|𝒜|/δ)m¯𝗌𝗍(s,a)Nϵest,32P^(ss,a)log(18|𝒮|2|𝒜|/δ)m¯𝗌𝗍(s,a)Nϵest}\displaystyle\max\left\{\frac{512\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}},32\sqrt{\frac{\widehat{P}(s^{\prime}\mid s,a)\cdot\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}}\right\}
\displaystyle\leq max{512log(18|𝒮|2|𝒜|/δ)𝒬ϵest𝗌𝗍(s,a)Nϵest,64P(ss,a)log(18|𝒮|2|𝒜|/δ)𝒬ϵest𝗌𝗍(s,a)Nϵest},\displaystyle\max\left\{\frac{512\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}},64\sqrt{\frac{P(s^{\prime}\mid s,a)\cdot\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}}\right\},
|R^(ss,a)𝔼[R(s,a)]|8𝔼[(R(s,a))2]log(18|𝒮||𝒜|/δ)𝒬ϵest𝗌𝗍(s,a)Nϵest+8log(18|𝒮||𝒜|/δ)𝒬ϵest𝗌𝗍(s,a)Nϵest,\left|\widehat{R}(s^{\prime}\mid s,a)-\mathbb{E}[R(s,a)]\right|\leq 8\sqrt{\frac{\mathbb{E}\left[(R(s,a))^{2}\right]\cdot\log(18|\mathcal{S}||\mathcal{A}|/\delta)}{\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}}+\frac{8\log(18|\mathcal{S}||\mathcal{A}|/\delta)}{\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}},

and

|μ^(s)μ(s)|log(18|𝒮|/δ)N.\left|\widehat{\mu}(s)-\mu(s)\right|\leq\sqrt{\frac{\log(18|\mathcal{S}|/\delta)}{N}}.
Proof.

Fix a state-action pair (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A} and s𝒮s^{\prime}\in\mathcal{S}. For each i[N]i\in[N] and t[|𝒮||𝒜||𝒜|2|𝒮|H]t\in\left[|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H\right], let i,t\mathcal{F}_{i,t} be the filtration induced by

{(si,t,ai,t,ri,t,si,t)}t=0t1.\left\{\left(s_{i,t^{\prime}},a_{i,t^{\prime}},r_{i,t^{\prime}},s_{i,t^{\prime}}^{\prime}\right)\right\}_{t^{\prime}=0}^{t-1}.

For each i[N]i\in[N] and t[|𝒮||𝒜||𝒜|2|𝒮|H]t\in\left[|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H\right], define

Xi,t=(𝕀[(si,t,ai,t,si,t)=(s,a,s)]P(ss,a)𝕀[(si,t,ai,t)=(s,a)])𝖳𝗋𝗎𝗇𝖼i,t(s,a)X_{i,t}=\left(\mathbb{I}\left[(s_{i,t},a_{i,t},s_{i,t}^{\prime})=(s,a,s^{\prime})\right]-P(s^{\prime}\mid s,a)\mathbb{I}\left[(s_{i,t},a_{i,t})=(s,a)\right]\right)\cdot\mathsf{Trunc}_{i,t}(s,a)

and

Yi,t=𝕀[(si,t,ai,t)=(s,a)](ri,t𝔼[R(s,a)])𝖳𝗋𝗎𝗇𝖼i,t(s,a).Y_{i,t}=\mathbb{I}\left[(s_{i,t},a_{i,t})=(s,a)\right]\cdot(r_{i,t}-\mathbb{E}[R(s,a)])\cdot\mathsf{Trunc}_{i,t}(s,a).

Clearly,

𝔼[𝕀[(si,t,ai,t,si,t)=(s,a,s)]𝖳𝗋𝗎𝗇𝖼i,t(s,a)i,t]\displaystyle\mathbb{E}\left[\mathbb{I}\left[(s_{i,t},a_{i,t},s_{i,t}^{\prime})=(s,a,s^{\prime})\right]\cdot\mathsf{Trunc}_{i,t}(s,a)\mid\mathcal{F}_{i,t}\right]
=\displaystyle= P(ss,a)𝔼[𝕀[(si,t,ai,t)=(s,a)]𝖳𝗋𝗎𝗇𝖼i,t(s,a)i,t]\displaystyle P(s^{\prime}\mid s,a)\cdot\mathbb{E}\left[\mathbb{I}\left[(s_{i,t},a_{i,t})=(s,a)\right]\cdot\mathsf{Trunc}_{i,t}(s,a)\mid\mathcal{F}_{i,t}\right]

and

𝔼[𝕀[(si,t,ai,t)=(s,a)]ri,t𝖳𝗋𝗎𝗇𝖼i,t(s,a)i,t]\displaystyle\mathbb{E}\left[\mathbb{I}\left[(s_{i,t},a_{i,t})=(s,a)\right]\cdot r_{i,t}\cdot\mathsf{Trunc}_{i,t}(s,a)\mid\mathcal{F}_{i,t}\right]
=\displaystyle= 𝔼[𝕀[(si,t,ai,t)=(s,a)]𝔼[R(s,a)]𝖳𝗋𝗎𝗇𝖼i,t(s,a)i,t],\displaystyle\mathbb{E}\left[\mathbb{I}\left[(s_{i,t},a_{i,t})=(s,a)\right]\cdot\mathbb{E}[R(s,a)]\cdot\mathsf{Trunc}_{i,t}(s,a)\mid\mathcal{F}_{i,t}\right],

which implies

𝔼[𝕀[(si,t,ai,t,si,t)=(s,a,s)]𝖳𝗋𝗎𝗇𝖼i,t(s,a)]\displaystyle\mathbb{E}\left[\mathbb{I}\left[(s_{i,t},a_{i,t},s_{i,t}^{\prime})=(s,a,s^{\prime})\right]\cdot\mathsf{Trunc}_{i,t}(s,a)\right]
=\displaystyle= P(ss,a)𝔼[𝕀[(si,t,ai,t)=(s,a)]𝖳𝗋𝗎𝗇𝖼i,t(s,a)],\displaystyle P(s^{\prime}\mid s,a)\cdot\mathbb{E}\left[\mathbb{I}\left[(s_{i,t},a_{i,t})=(s,a)\right]\cdot\mathsf{Trunc}_{i,t}(s,a)\right],

and

𝔼[𝕀[(si,t,ai,t)=(s,a)]ri,t𝖳𝗋𝗎𝗇𝖼i,t(s,a)]\displaystyle\mathbb{E}\left[\mathbb{I}\left[(s_{i,t},a_{i,t})=(s,a)\right]\cdot r_{i,t}\cdot\mathsf{Trunc}_{i,t}(s,a)\right]
=\displaystyle= 𝔼[𝕀[(si,t,ai,t)=(s,a)]𝔼[R(s,a)]𝖳𝗋𝗎𝗇𝖼i,t(s,a)],\displaystyle\mathbb{E}\left[\mathbb{I}\left[(s_{i,t},a_{i,t})=(s,a)\right]\cdot\mathbb{E}[R(s,a)]\cdot\mathsf{Trunc}_{i,t}(s,a)\right],

and thus

𝔼[Xi,t]=𝔼[Yi,t]=0.\mathbb{E}\left[X_{i,t}\right]=\mathbb{E}\left[Y_{i,t}\right]=0.

Moreover, for any i[N]i\in[N] and 0t<t<|𝒮||𝒜||𝒜|2|𝒮|H0\leq t^{\prime}<t<|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H, we have

𝔼[Xi,tXi,t]=𝔼[𝔼[Xi,tXi,ti,t]]=𝔼[Xi,t𝔼[Xi,ti,t]]=0\mathbb{E}\left[X_{i,t^{\prime}}\cdot X_{i,t}\right]=\mathbb{E}\left[\mathbb{E}\left[X_{i,t^{\prime}}\cdot X_{i,t}\mid\mathcal{F}_{i,t}\right]\right]=\mathbb{E}\left[X_{i,t^{\prime}}\cdot\mathbb{E}\left[X_{i,t}\mid\mathcal{F}_{i,t}\right]\right]=0

and

𝔼[Yi,tYi,t]=𝔼[𝔼[Yi,tYi,ti,t]]=𝔼[Yi,t𝔼[Yi,ti,t]]=0.\mathbb{E}\left[Y_{i,t^{\prime}}\cdot Y_{i,t}\right]=\mathbb{E}\left[\mathbb{E}\left[Y_{i,t^{\prime}}\cdot Y_{i,t}\mid\mathcal{F}_{i,t}\right]\right]=\mathbb{E}\left[Y_{i,t^{\prime}}\cdot\mathbb{E}\left[Y_{i,t}\mid\mathcal{F}_{i,t}\right]\right]=0.

Note that for each i[N]i\in[N],

𝔼[(t=0|𝒮||𝒜||𝒜|2|𝒮|H1Xi,t)2]=t=0|𝒮||𝒜||𝒜|2|𝒮|H1𝔼[(Xi,t)2]\mathbb{E}\left[\left(\sum_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}X_{i,t}\right)^{2}\right]=\sum_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}\mathbb{E}\left[\left(X_{i,t}\right)^{2}\right]

and

𝔼[(t=0|𝒮||𝒜||𝒜|2|𝒮|H1Yi,t)2]=t=0|𝒮||𝒜||𝒜|2|𝒮|H1𝔼[(Yi,t)2].\mathbb{E}\left[\left(\sum_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}Y_{i,t}\right)^{2}\right]=\sum_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}\mathbb{E}\left[\left(Y_{i,t}\right)^{2}\right].

Furthermore, for each i[N]i\in[N] and t[|𝒮||𝒜||𝒜|2|𝒮|H]t\in\left[|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H\right],

𝔼[Xi,t2]\displaystyle\mathbb{E}\left[X_{i,t}^{2}\right]\leq 𝔼[𝕀[(si,t,ai,t,si,t)=(s,a,s)]𝖳𝗋𝗎𝗇𝖼i,t(s,a)]\displaystyle\mathbb{E}\left[\mathbb{I}\left[(s_{i,t},a_{i,t},s_{i,t}^{\prime})=(s,a,s^{\prime})\right]\cdot\mathsf{Trunc}_{i,t}(s,a)\right]
+\displaystyle+ 𝔼[(P(ss,a))2𝕀[(si,t,ai,t)=(s,a)]𝖳𝗋𝗎𝗇𝖼i,t(s,a)]\displaystyle\mathbb{E}\left[\left(P(s^{\prime}\mid s,a)\right)^{2}\cdot\mathbb{I}\left[(s_{i,t},a_{i,t})=(s,a)\right]\cdot\mathsf{Trunc}_{i,t}(s,a)\right]
\displaystyle\leq 2P(ss,a)𝔼[𝕀[(si,t,ai,t)=(s,a)]𝖳𝗋𝗎𝗇𝖼i,t(s,a)]\displaystyle 2P(s^{\prime}\mid s,a)\cdot\mathbb{E}\left[\mathbb{I}\left[(s_{i,t},a_{i,t})=(s,a)\right]\cdot\mathsf{Trunc}_{i,t}(s,a)\right]

and

𝔼[Yi,t2]\displaystyle\mathbb{E}\left[Y_{i,t}^{2}\right]\leq 𝔼[𝕀[(si,t,ai,t)=(s,a)](ri,t𝔼[R(s,a)])2𝖳𝗋𝗎𝗇𝖼i,t(s,a)]\displaystyle\mathbb{E}\left[\mathbb{I}\left[(s_{i,t},a_{i,t})=(s,a)\right]\cdot(r_{i,t}-\mathbb{E}[R(s,a)])^{2}\cdot\mathsf{Trunc}_{i,t}(s,a)\right]
\displaystyle\leq 𝔼[𝕀[(si,t,ai,t)=(s,a)]𝔼[(R(s,a))2]𝖳𝗋𝗎𝗇𝖼i,t(s,a)].\displaystyle\mathbb{E}\left[\mathbb{I}\left[(s_{i,t},a_{i,t})=(s,a)\right]\cdot\mathbb{E}\left[(R(s,a))^{2}\right]\cdot\mathsf{Trunc}_{i,t}(s,a)\right].

Since

t=0|𝒮||𝒜||𝒜|2|𝒮|H1𝕀[(si,t,ai,t)=(s,a)]𝖳𝗋𝗎𝗇𝖼i,t(s,a)m¯𝗌𝗍(s,a),\sum_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}\mathbb{I}\left[(s_{i,t},a_{i,t})=(s,a)\right]\cdot\mathsf{Trunc}_{i,t}(s,a)\leq\overline{m}^{\mathsf{st}}(s,a),

we have

𝔼[(t=0|𝒮||𝒜||𝒜|2|𝒮|H1Xi,t)2]2P(ss,a)m¯𝗌𝗍(s,a)\mathbb{E}\left[\left(\sum_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}X_{i,t}\right)^{2}\right]\leq 2P(s^{\prime}\mid s,a)\cdot\overline{m}^{\mathsf{st}}(s,a)

and

𝔼[(t=0|𝒮||𝒜||𝒜|2|𝒮|H1Yi,t)2]𝔼[(R(s,a))2]m¯𝗌𝗍(s,a).\mathbb{E}\left[\left(\sum_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}Y_{i,t}\right)^{2}\right]\leq\mathbb{E}\left[(R(s,a))^{2}\right]\cdot\overline{m}^{\mathsf{st}}(s,a).

Now, for each i[N]i\in[N], define

𝒳i=t=0|𝒮||𝒜||𝒜|2|𝒮|H1Xi,t\mathcal{X}_{i}=\sum_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}X_{i,t}

and

𝒴i=t=0|𝒮||𝒜||𝒜|2|𝒮|H1Yi,t.\mathcal{Y}_{i}=\sum_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}Y_{i,t}.

We have 𝔼[𝒳i]=𝔼[𝒴i]=0\mathbb{E}[\mathcal{X}_{i}]=\mathbb{E}[\mathcal{Y}_{i}]=0,

𝔼[𝒳i2]2P(ss,a)m¯𝗌𝗍(s,a)\mathbb{E}[\mathcal{X}_{i}^{2}]\leq 2P(s^{\prime}\mid s,a)\cdot\overline{m}^{\mathsf{st}}(s,a)

and

𝔼[𝒴i2]𝔼[(R(s,a))2]m¯𝗌𝗍(s,a).\mathbb{E}[\mathcal{Y}_{i}^{2}]\leq\mathbb{E}\left[(R(s,a))^{2}\right]\cdot\overline{m}^{\mathsf{st}}(s,a).

Also note that

i=0N1𝒳i=i=0N1t=0|𝒮||𝒜||𝒜|2|𝒮|H1𝕀[(si,t,ai,t,si,t)=(s,a,s)]𝖳𝗋𝗎𝗇𝖼i,t(s,a)P(ss,a)mD(s,a)\sum_{i=0}^{N-1}\mathcal{X}_{i}=\sum_{i=0}^{N-1}\sum_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}\mathbb{I}\left[(s_{i,t},a_{i,t},s_{i,t}^{\prime})=(s,a,s^{\prime})\right]\cdot\mathsf{Trunc}_{i,t}(s,a)-P(s^{\prime}\mid s,a)\cdot m_{D}(s,a)

and

i=0N1𝒴i=i=0N1t=0|𝒮||𝒜||𝒜|2|𝒮|H1𝕀[(si,t,ai,t)=(s,a)]ri,t𝖳𝗋𝗎𝗇𝖼i,t(s,a)𝔼[R(s,a)]mD(s,a).\sum_{i=0}^{N-1}\mathcal{Y}_{i}=\sum_{i=0}^{N-1}\sum_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}\mathbb{I}\left[(s_{i,t},a_{i,t})=(s,a)\right]\cdot r_{i,t}\cdot\mathsf{Trunc}_{i,t}(s,a)-\mathbb{E}[R(s,a)]\cdot m_{D}(s,a).

By Bernstein’s inequality,

Pr[|i=0N1𝒳i|t]2exp(t22m¯𝗌𝗍(s,a)NP(ss,a)+t/3).\Pr\left[\left|\sum_{i=0}^{N-1}\mathcal{X}_{i}\right|\geq t\right]\leq 2\exp\left(\frac{-t^{2}}{2\cdot\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot P(s^{\prime}\mid s,a)+t/3}\right).

Thus, by setting t=2m¯𝗌𝗍(s,a)NP(ss,a)log(18|𝒮|2|𝒜|/δ)+log(18|𝒮|2|𝒜|/δ)t=2\sqrt{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot P(s^{\prime}\mid s,a)\cdot\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}+\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta), we have

Pr[|i=0N1𝒳i|t]δ/(9|𝒮|2|𝒜|).\Pr\left[\left|\sum_{i=0}^{N-1}\mathcal{X}_{i}\right|\geq t\right]\leq\delta/(9|\mathcal{S}|^{2}|\mathcal{A}|).

By applying a union bound over all (s,a,s)𝒮×𝒜×𝒮(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}, with probability at least 1δ/91-\delta/9, for all (s,a,s)𝒮×𝒜×𝒮(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S},

|P^(ss,a)P(ss,a)|2m¯𝗌𝗍(s,a)NP(ss,a)log(18|𝒮|2|𝒜|/δ)mD(s,a)+log(18|𝒮|2|𝒜|/δ)mD(s,a),\left|\widehat{P}(s^{\prime}\mid s,a)-P(s^{\prime}\mid s,a)\right|\leq\frac{2\sqrt{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot P(s^{\prime}\mid s,a)\cdot\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}}{m_{D}(s,a)}+\frac{\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{m_{D}(s,a)},

which we define to be event P\mathcal{E}_{P}. Note that conditioned on P\mathcal{E}_{P} and the events in Lemma 5.3 and Lemma 5.2, we have

𝒬ϵest𝗌𝗍(s,a)m¯𝗌𝗍(s,a)𝒬ϵest/4𝗌𝗍(s,a),\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)\leq\overline{m}^{\mathsf{st}}(s,a)\leq\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}/4}(s,a),

which implies

mD(s,a)Nϵest/8𝒬ϵest/4𝗌𝗍(s,a)Nϵest/8m¯𝗌𝗍(s,a)Nϵest/8𝒬ϵest𝗌𝗍(s,a),m_{D}(s,a)\geq N\cdot\epsilon_{\mathrm{est}}/8\cdot\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}/4}(s,a)\geq N\cdot\epsilon_{\mathrm{est}}/8\cdot\overline{m}^{\mathsf{st}}(s,a)\geq N\cdot\epsilon_{\mathrm{est}}/8\cdot\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a),

and thus

|P^(ss,a)P(ss,a)|\displaystyle\left|\widehat{P}(s^{\prime}\mid s,a)-P(s^{\prime}\mid s,a)\right|\leq 8P(ss,a)log(18|𝒮|2|𝒜|/δ)m¯𝗌𝗍(s,a)Nϵest+8log(18|𝒮|2|𝒜|/δ)m¯𝗌𝗍(s,a)Nϵest\displaystyle 8\sqrt{\frac{P(s^{\prime}\mid s,a)\cdot\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}}+\frac{8\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}
\displaystyle\leq 8P(ss,a)log(18|𝒮|2|𝒜|/δ)m¯𝗌𝗍(s,a)Nϵest+64log(18|𝒮|2|𝒜|/δ)m¯𝗌𝗍(s,a)Nϵest\displaystyle 8\sqrt{\frac{P(s^{\prime}\mid s,a)\cdot\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}}+\frac{64\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}
\displaystyle\leq max{16P(ss,a)log(18|𝒮|2|𝒜|/δ)m¯𝗌𝗍(s,a)Nϵest,512log(18|𝒮|2|𝒜|/δ)m¯𝗌𝗍(s,a)Nϵest}\displaystyle\max\left\{16\sqrt{\frac{P(s^{\prime}\mid s,a)\cdot\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}},\frac{512\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}\right\}

for all (s,a,s)𝒮×𝒜×𝒮(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}.

When

P(ss,a)1024log(18|𝒮|2|𝒜|/δ)m¯𝗌𝗍(s,a)Nϵest,P(s^{\prime}\mid s,a)\leq\frac{1024\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}},

we have

16P(ss,a)log(18|𝒮|2|𝒜|/δ)m¯𝗌𝗍(s,a)Nϵest512log(18|𝒮|2|𝒜|/δ)m¯𝗌𝗍(s,a)Nϵest,16\sqrt{\frac{P(s^{\prime}\mid s,a)\cdot\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}}\leq\frac{512\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}},

and therefore

|P^(ss,a)P(ss,a)|512log(18|𝒮|2|𝒜|/δ)m¯𝗌𝗍(s,a)Nϵest512log(18|𝒮|2|𝒜|/δ)𝒬ϵest𝗌𝗍(s,a)Nϵest.\left|\widehat{P}(s^{\prime}\mid s,a)-P(s^{\prime}\mid s,a)\right|\leq\frac{512\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}\leq\frac{512\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}.

When

P(ss,a)1024log(18|𝒮|2|𝒜|/δ)m¯𝗌𝗍(s,a)Nϵest,P(s^{\prime}\mid s,a)\geq\frac{1024\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}},

we have

|P^(ss,a)P(ss,a)|16P(ss,a)log(18|𝒮|2|𝒜|/δ)m¯𝗌𝗍(s,a)NϵestP(ss,a)/2\left|\widehat{P}(s^{\prime}\mid s,a)-P(s^{\prime}\mid s,a)\right|\leq 16\sqrt{\frac{P(s^{\prime}\mid s,a)\cdot\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}}\leq P(s^{\prime}\mid s,a)/2

and thus

P(ss,a)/2P^(ss,a)2P(ss,a),P(s^{\prime}\mid s,a)/2\leq\widehat{P}(s^{\prime}\mid s,a)\leq 2P(s^{\prime}\mid s,a),

which implies

|P^(ss,a)P(ss,a)|32P^(ss,a)log(18|𝒮|2|𝒜|/δ)m¯𝗌𝗍(s,a)Nϵest64P(ss,a)log(18|𝒮|2|𝒜|/δ)𝒬ϵest𝗌𝗍(s,a)Nϵest.\left|\widehat{P}(s^{\prime}\mid s,a)-P(s^{\prime}\mid s,a)\right|\leq 32\sqrt{\frac{\widehat{P}(s^{\prime}\mid s,a)\cdot\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}}\leq 64\sqrt{\frac{P(s^{\prime}\mid s,a)\cdot\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}}.

Hence, conditioned on P\mathcal{E}_{P} and the events in Lemma 5.3 and Lemma 5.2, for all (s,a,s)𝒮×𝒜×𝒮(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S},

|P^(ss,a)P(ss,a)|\displaystyle\left|\widehat{P}(s^{\prime}\mid s,a)-P(s^{\prime}\mid s,a)\right|\leq max{512log(18|𝒮|2|𝒜|/δ)m¯𝗌𝗍(s,a)Nϵest,32P^(ss,a)log(18|𝒮|2|𝒜|/δ)m¯𝗌𝗍(s,a)Nϵest}\displaystyle\max\left\{\frac{512\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}},32\sqrt{\frac{\widehat{P}(s^{\prime}\mid s,a)\cdot\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}}\right\}
\displaystyle\leq max{512log(18|𝒮|2|𝒜|/δ)𝒬ϵest𝗌𝗍(s,a)Nϵest,64P(ss,a)log(18|𝒮|2|𝒜|/δ)𝒬ϵest𝗌𝗍(s,a)Nϵest}.\displaystyle\max\left\{\frac{512\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}},64\sqrt{\frac{P(s^{\prime}\mid s,a)\cdot\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}}\right\}.

By Bernstein’s inequality,

Pr[|i=0N1𝒴i|t]2exp(t2𝔼[(R(s,a))2]m¯𝗌𝗍(s,a)N+t/3).\Pr\left[\left|\sum_{i=0}^{N-1}\mathcal{Y}_{i}\right|\geq t\right]\leq 2\exp\left(\frac{-t^{2}}{\mathbb{E}\left[(R(s,a))^{2}\right]\cdot\overline{m}^{\mathsf{st}}(s,a)\cdot N+t/3}\right).

Thus, by setting t=2𝔼[(R(s,a))2]m¯𝗌𝗍(s,a)Nlog(18|𝒮||𝒜|/δ)+log(18|𝒮||𝒜|/δ)t=2\sqrt{\mathbb{E}\left[(R(s,a))^{2}\right]\cdot\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\log(18|\mathcal{S}||\mathcal{A}|/\delta)}+\log(18|\mathcal{S}||\mathcal{A}|/\delta), we have

Pr[|i=0N1𝒴i|t]δ/(9|𝒮||𝒜|).\Pr\left[\left|\sum_{i=0}^{N-1}\mathcal{Y}_{i}\right|\geq t\right]\leq\delta/(9|\mathcal{S}||\mathcal{A}|).

By applying a union bound over all (s,a,s)𝒮×𝒜(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}, with probability at least 1δ/91-\delta/9, for all (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A},

|R^(ss,a)𝔼[R(s,a)]|2𝔼[(R(s,a))2]m¯𝗌𝗍(s,a)Nlog(18|𝒮||𝒜|/δ)mD(s,a)+log(18|𝒮||𝒜|/δ)mD(s,a),\left|\widehat{R}(s^{\prime}\mid s,a)-\mathbb{E}[R(s,a)]\right|\leq\frac{2\sqrt{\mathbb{E}\left[(R(s,a))^{2}\right]\cdot\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\log(18|\mathcal{S}||\mathcal{A}|/\delta)}}{m_{D}(s,a)}+\frac{\log(18|\mathcal{S}||\mathcal{A}|/\delta)}{m_{D}(s,a)},

which we define to be event R\mathcal{E}_{R}. Note that conditioned on R\mathcal{E}_{R} and the events in Lemma 5.3 and Lemma 5.2, we have

|R^(ss,a)𝔼[R(s,a)]|8𝔼[(R(s,a))2]log(18|𝒮||𝒜|/δ)𝒬ϵest𝗌𝗍(s,a)Nϵest+8log(18|𝒮||𝒜|/δ)𝒬ϵest𝗌𝗍(s,a)Nϵest.\left|\widehat{R}(s^{\prime}\mid s,a)-\mathbb{E}[R(s,a)]\right|\leq 8\sqrt{\frac{\mathbb{E}\left[(R(s,a))^{2}\right]\cdot\log(18|\mathcal{S}||\mathcal{A}|/\delta)}{\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}}+\frac{8\log(18|\mathcal{S}||\mathcal{A}|/\delta)}{\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}.

Finally, for each s𝒮s\in\mathcal{S}, for each i[N]i\in[N], define

𝒵i=𝕀[si,0=s]μ(s).\mathcal{Z}_{i}=\mathbb{I}[s_{i,0}=s]-\mu(s).

Note that

i=0N1𝒵i=i=0N1𝕀[s0,i=s]Nμ(s).\sum_{i=0}^{N-1}\mathcal{Z}_{i}=\sum_{i=0}^{N-1}\mathbb{I}[s_{0,i}=s]-N\cdot\mu(s).

Therefore, by Chernoff bound, with probability at least 1δ/(9|𝒮|)1-\delta/(9|\mathcal{S}|) we have

|μ^(s)μ(s)|log(18|𝒮|/δ)K.\left|\widehat{\mu}(s)-\mu(s)\right|\leq\sqrt{\frac{\log(18|\mathcal{S}|/\delta)}{K}}.

Hence, with probability at least 1δ/91-\delta/9, for all s𝒮s\in\mathcal{S}, we have

|μ^(s)μ(s)|log(18|𝒮|/δ)N\left|\widehat{\mu}(s)-\mu(s)\right|\leq\sqrt{\frac{\log(18|\mathcal{S}|/\delta)}{N}}

which we define to be event μ\mathcal{E}_{\mu}.

We finish the proof by applying a union bound over P\mathcal{E}_{P}, R\mathcal{E}_{R}, μ\mathcal{E}_{\mu} and the events in Lemma 5.3 and Lemma 5.2. ∎

5.2 Perturbation Analysis

In this section, we establish a perturbation analysis on the value functions which is crucial for the analysis in the next section. We first recall a few basic facts.

Fact 5.1.

Let |x|1/2|x|\leq 1/2 be a real number, we have

  1. 1.

    xx2log(1+x)xx-x^{2}\leq\log(1+x)\leq x;

  2. 2.

    1+xex1+2|x|1+x\leq e^{x}\leq 1+2|x|.

We now prove the following lemma using the above facts.

Lemma 5.5.

Let m¯1\overline{m}\geq 1, n¯n1\bar{n}\geq n\geq 1 be positive integers. Let ϵ[0,1/(8n¯)]\epsilon\in[0,1/(8\bar{n})] be some real numbers. Let p[1/m¯,1]np\in[1/\overline{m},1]^{n} be a vector with i=1npi1\sum_{i=1}^{n}p_{i}\leq 1. Let δn\delta\in\mathbb{R}^{n} be a vector such that for each 1in1\leq i\leq n, |δi|ϵpi/m¯|\delta_{i}|\leq\epsilon\sqrt{p_{i}/\overline{m}} and |i=1nδi|ϵn¯/m¯\left|\sum_{i=1}^{n}\delta_{i}\right|\leq\epsilon\bar{n}/\overline{m}. For every m[0,m¯]m\in[0,\overline{m}] and every Γn\Gamma\in\mathbb{R}^{n} such that |Γi|[pim¯,pim¯]|\Gamma_{i}|\in[-\sqrt{p_{i}\overline{m}},\sqrt{p_{i}\overline{m}}] for all 1in1\leq i\leq n, we have

(18n¯ϵ)i=1npipim+Γii=1n(pi+δi)pim+Γi(1+8n¯ϵ)i=1npipim+Γi.(1-8\bar{n}\epsilon)\prod_{i=1}^{n}p_{i}^{p_{i}m+\Gamma_{i}}\leq\prod_{i=1}^{n}(p_{i}+\delta_{i})^{p_{i}m+\Gamma_{i}}\leq(1+8\bar{n}\epsilon)\prod_{i=1}^{n}p_{i}^{p_{i}m+\Gamma_{i}}.
Proof.

Note that

i=1n(pi+δi)pim+Γi=i=1n(pi)pim+ΓiF\prod_{i=1}^{n}(p_{i}+\delta_{i})^{p_{i}m+\Gamma_{i}}=\prod_{i=1}^{n}(p_{i})^{p_{i}m+\Gamma_{i}}\cdot F

where

F=i=1n(1+δipi)pim+Γi.F=\prod_{i=1}^{n}\left(1+\frac{\delta_{i}}{p_{i}}\right)^{p_{i}m+\Gamma_{i}}.

Clearly,

logF=i=1n(pim+Γi)log(1+δipi).\log F=\sum_{i=1}^{n}(p_{i}m+\Gamma_{i})\log\left(1+\frac{\delta_{i}}{p_{i}}\right).

By the choice of δ\delta, we have

|δipi|ϵ12.\left|\frac{\delta_{i}}{p_{i}}\right|\leq\epsilon\leq\frac{1}{2}.

Using Fact 5.1, for all 1in1\leq i\leq n, we have

δipiδi2pi2log(1+δipi)δipi.\displaystyle\frac{\delta_{i}}{p_{i}}-\frac{\delta_{i}^{2}}{p_{i}^{2}}\leq\log\left(1+\frac{\delta_{i}}{p_{i}}\right)\leq\frac{\delta_{i}}{p_{i}}.

Hence we,

|logF|\displaystyle|\log F| |i=1nmδi|+i=1n(|Γi||δi|pi+|Γi|δi2pi2+mδi2pi).\displaystyle\leq\left|\sum_{i=1}^{n}m\delta_{i}\right|+\sum_{i=1}^{n}\left(\frac{|\Gamma_{i}||\delta_{i}|}{p_{i}}+\frac{|\Gamma_{i}|\delta_{i}^{2}}{p_{i}^{2}}+\frac{m\delta_{i}^{2}}{p_{i}}\right).

Note that |i=1nmδi|ϵn¯\left|\sum_{i=1}^{n}m\delta_{i}\right|\leq\epsilon\bar{n}, |Γi||δi|ϵpi|\Gamma_{i}||\delta_{i}|\leq\epsilon p_{i}, |Γi|δi2ϵpiϵpi/m¯ϵ2pi2|\Gamma_{i}|\delta_{i}^{2}\leq\epsilon p_{i}\cdot\epsilon\sqrt{p_{i}/\overline{m}}\leq\epsilon^{2}p_{i}^{2}, and mδi2ϵ2pim\delta_{i}^{2}\leq\epsilon^{2}p_{i}. We have,

|logF|ϵn¯+ϵn+ϵ2n+ϵ2n4n¯ϵ.\displaystyle|\log F|\leq\epsilon\bar{n}+\epsilon n+\epsilon^{2}n+\epsilon^{2}n\leq 4\bar{n}\epsilon.

By the choice of ϵ\epsilon, we have 4n¯ϵ1/24\bar{n}\epsilon\leq 1/2, and therefore

18n¯ϵexp(logF)1+8n¯ϵ.1-8\bar{n}\epsilon\leq\exp(\log F)\leq 1+8\bar{n}\epsilon.

In the following lemma, we show that for any (s,a,s)𝒮×𝒜×𝒮(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}, with probability at least 1δ1-\delta, the number of times (s,a,s)(s,a,s^{\prime}) is visited can be upper bounded in terms of the δ/2\delta/2-quantile of the number of times (s,a)(s,a) is visited and P(ss,a)P(s^{\prime}\mid s,a).

Lemma 5.6.

For a given MDP MM. Suppose a random trajectory

T=((s0,a0,r0),(s1,a1,r1),,(sH1,aH1,rH1),sH)T=((s_{0},a_{0},r_{0}),(s_{1},a_{1},r_{1}),\ldots,(s_{H-1},a_{H-1},r_{H-1}),s_{H})

is obtained by executing a (possibly non-stationary) policy π\pi in MM. For any (s,a,s)𝒮×𝒜×𝒮(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}, with probability at least 1δ1-\delta, we have

h=0H1𝕀[(s,a,s)=(sh,ah,sh+1)]2𝒬δ/2(h=0H1𝕀[(s,a)=(sh,ah)])+4δP(ss,a).\sum_{h=0}^{H-1}\mathbb{I}\left[(s,a,s^{\prime})=(s_{h},a_{h},s_{h+1})\right]\leq\frac{2\mathcal{Q}_{\delta/2}\left(\sum_{h=0}^{H-1}\mathbb{I}\left[(s,a)=(s_{h},a_{h})\right]\right)+4}{\delta}\cdot P(s^{\prime}\mid s,a).
Proof.

For each h[H]h\in[H], define

Ih={1h=0𝕀[t=0h1𝕀[(st,at)=(s,a)]𝒬δ/2(h=0H1𝕀[(sh,ah)=(s,a)])+1]h>0.I_{h}=\begin{cases}1&h=0\\ \mathbb{I}\left[\sum_{t=0}^{h-1}\mathbb{I}\left[(s_{t},a_{t})=(s,a)\right]\leq\mathcal{Q}_{\delta/2}\left(\sum_{h=0}^{H-1}\mathbb{I}\left[(s_{h},a_{h})=(s,a)\right]\right)+1\right]&h>0\end{cases}.

Let 1\mathcal{E}_{1} be the event that for all h[H]h\in[H], Ih=1I_{h}=1. By definition of 𝒬δ/2\mathcal{Q}_{\delta/2}, we have Pr[1]1δ/2.\Pr\left[\mathcal{E}_{1}\right]\geq 1-\delta/2.

For each h[H]h\in[H], let h\mathcal{F}_{h} be the filtration induced by {(s0,a0,r0),,(sh,ah,rh)}\{(s_{0},a_{0},r_{0}),\ldots,(s_{h},a_{h},r_{h})\}. For each h[H]h\in[H], define

Xh=𝕀[(s,a,s)=(sh,ah,sh+1)]IhX_{h}=\mathbb{I}\left[(s,a,s^{\prime})=(s_{h},a_{h},s_{h+1})\right]\cdot I_{h}

and

Yh=𝕀[(s,a)=(sh,ah)]Ih.Y_{h}=\mathbb{I}\left[(s,a)=(s_{h},a_{h})\right]\cdot I_{h}.

When h=0h=0, we have

𝔼[Xh]=𝔼[𝕀[(s,a)=(s0,a0)]P(ss,a)=𝔼[Yh]P(ss,a).\mathbb{E}[X_{h}]=\mathbb{E}[\mathbb{I}\left[(s,a)=(s_{0},a_{0})\right]\cdot P(s^{\prime}\mid s,a)=\mathbb{E}[Y_{h}]\cdot P(s^{\prime}\mid s,a).

When h[H]{0}h\in[H]\setminus\{0\}, we have

𝔼[Xhh1]=\displaystyle\mathbb{E}[X_{h}\mid\mathcal{F}_{h-1}]= 𝔼[𝕀[(s,a,s)=(sh,ah,sh+1)]Ihh1]\displaystyle\mathbb{E}[\mathbb{I}\left[(s,a,s^{\prime})=(s_{h},a_{h},s_{h+1})\right]\cdot I_{h}\mid\mathcal{F}_{h-1}]
=\displaystyle= 𝔼[𝕀[(s,a)=(sh,ah)]Ihh1]P(ss,a),\displaystyle\mathbb{E}[\mathbb{I}\left[(s,a)=(s_{h},a_{h})\right]\cdot I_{h}\mid\mathcal{F}_{h-1}]\cdot P(s^{\prime}\mid s,a),

which implies

𝔼[Xh]=𝔼[𝕀[(s,a)=(sh,ah)]Ih]P(ss,a)=𝔼[Yh]P(ss,a).\mathbb{E}[X_{h}]=\mathbb{E}[\mathbb{I}\left[(s,a)=(s_{h},a_{h})\right]\cdot I_{h}]\cdot P(s^{\prime}\mid s,a)=\mathbb{E}[Y_{h}]\cdot P(s^{\prime}\mid s,a).

Note that

h=0H1Yh𝒬δ/2(h=0H1𝕀[(s,a)=(sh,ah)])+2,\sum_{h=0}^{H-1}Y_{h}\leq\mathcal{Q}_{\delta/2}\left(\sum_{h=0}^{H-1}\mathbb{I}\left[(s,a)=(s_{h},a_{h})\right]\right)+2,

which implies

𝔼[h=0H1Xh](𝒬δ/2(h=0H1𝕀[(s,a)=(sh,ah)])+2)P(ss,a).\mathbb{E}\left[\sum_{h=0}^{H-1}X_{h}\right]\leq\left(\mathcal{Q}_{\delta/2}\left(\sum_{h=0}^{H-1}\mathbb{I}\left[(s,a)=(s_{h},a_{h})\right]\right)+2\right)\cdot P(s^{\prime}\mid s,a).

By Markov’s inequality, with probability at least 1δ/21-\delta/2,

h=0H1Xh2𝒬δ/2(h=0H1𝕀[(s,a)=(sh,ah)])+4δP(ss,a).\sum_{h=0}^{H-1}X_{h}\leq\frac{2\mathcal{Q}_{\delta/2}\left(\sum_{h=0}^{H-1}\mathbb{I}\left[(s,a)=(s_{h},a_{h})\right]\right)+4}{\delta}\cdot P(s^{\prime}\mid s,a).

which we denote as event 2\mathcal{E}_{2}.

Conditioned on 12\mathcal{E}_{1}\cap\mathcal{E}_{2} which happens with probability 1δ1-\delta, we have

h=0H1𝕀[(s,a,s)=(sh,ah,sh+1)]=h=0H1Xh2𝒬δ/2(h=0H1𝕀[(s,a)=(sh,ah)])+4δP(ss,a).\sum_{h=0}^{H-1}\mathbb{I}\left[(s,a,s^{\prime})=(s_{h},a_{h},s_{h+1})\right]=\sum_{h=0}^{H-1}X_{h}\leq\frac{2\mathcal{Q}_{\delta/2}\left(\sum_{h=0}^{H-1}\mathbb{I}\left[(s,a)=(s_{h},a_{h})\right]\right)+4}{\delta}\cdot P(s^{\prime}\mid s,a).

In the following lemma, we show that for any (s,a,s)𝒮×𝒜×𝒮(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}, the number of times (s,a,s)(s,a,s^{\prime}) is visited should be close to the number of times (s,a)(s,a) is visited times P(ss,a)P(s^{\prime}\mid s,a).

Lemma 5.7.

For a given MDP MM. Suppose a random trajectory

T=((s0,a0,r0),(s1,a1,r1),,(sH1,aH1,rH1),sH)T=((s_{0},a_{0},r_{0}),(s_{1},a_{1},r_{1}),\ldots,(s_{H-1},a_{H-1},r_{H-1}),s_{H})

is obtained by executing a policy π\pi in MM. For any (s,a,s)𝒮×𝒜×𝒮(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}, with probability at least 1δ1-\delta, we have

|h=0H1𝕀[(s,a,s)=(st,at,st+1)]P(ss,a)h=0H1𝕀[(s,a)=(st,at)]|\displaystyle\left|\sum_{h=0}^{H-1}\mathbb{I}\left[(s,a,s^{\prime})=(s_{t},a_{t},s_{t+1})\right]-P(s^{\prime}\mid s,a)\cdot\sum_{h=0}^{H-1}\mathbb{I}\left[(s,a)=(s_{t},a_{t})\right]\right|
\displaystyle\leq 4𝒬δ/2(h=0H1𝕀[(s,a)=(st,at)])+8δP(ss,a).\displaystyle\sqrt{\frac{4\mathcal{Q}_{\delta/2}\left(\sum_{h=0}^{H-1}\mathbb{I}\left[(s,a)=(s_{t},a_{t})\right]\right)+8}{\delta}\cdot P(s^{\prime}\mid s,a)}.
Proof.

For each h[H]h\in[H], define

Ih={1h=0𝕀[t=0h1𝕀[(s,a)=(st,at)]𝒬δ/2(h=0H1𝕀[(s,a)=(sh,ah)])+1]h>0.I_{h}=\begin{cases}1&h=0\\ \mathbb{I}\left[\sum_{t=0}^{h-1}\mathbb{I}\left[(s,a)=(s_{t},a_{t})\right]\leq\mathcal{Q}_{\delta/2}\left(\sum_{h=0}^{H-1}\mathbb{I}\left[(s,a)=(s_{h},a_{h})\right]\right)+1\right]&h>0\end{cases}.

Let 1\mathcal{E}_{1} be the event that for all h[H]h\in[H], Ih=1I_{h}=1. By definition of 𝒬δ/2\mathcal{Q}_{\delta/2}, we have Pr[1]1δ/2.\Pr\left[\mathcal{E}_{1}\right]\geq 1-\delta/2.

For each h[H]h\in[H], let h\mathcal{F}_{h} be the filtration induced by {(s0,a0,r0),,(sh,ah,rh)}\{(s_{0},a_{0},r_{0}),\ldots,(s_{h},a_{h},r_{h})\}. For each h[H]h\in[H], define

Xh=𝕀[(s,a,s)=(sh,ah,sh+1)]IhP(ss,a)𝕀[(s,a)=(sh,ah)]Ih.X_{h}=\mathbb{I}\left[(s,a,s^{\prime})=(s_{h},a_{h},s_{h+1})\right]\cdot I_{h}-P(s^{\prime}\mid s,a)\mathbb{I}\left[(s,a)=(s_{h},a_{h})\right]\cdot I_{h}.

As we have shown in the proof of Lemma 5.6, for each h[H]h\in[H], 𝔼[Xh]=0\mathbb{E}[X_{h}]=0. Moreover, for any 0h<hH10\leq h^{\prime}<h\leq H-1, we have

𝔼[XhXh]=𝔼[𝔼[XhXh|h1]]=𝔼[Xh𝔼[Xh|h1]]=0.\mathbb{E}[X_{h}X_{h^{\prime}}]=\mathbb{E}[\mathbb{E}[X_{h}X_{h^{\prime}}|\mathcal{F}_{h-1}]]=\mathbb{E}[X_{h^{\prime}}\mathbb{E}[X_{h}|\mathcal{F}_{h-1}]]=0.

Therefore,

𝔼[(h=0H1Xh)2]=𝔼[h=0H1Xh2].\mathbb{E}\left[\left(\sum_{h=0}^{H-1}X_{h}\right)^{2}\right]=\mathbb{E}\left[\sum_{h=0}^{H-1}X_{h}^{2}\right].

Note that for each h[H]h\in[H].

Xh2\displaystyle X_{h}^{2} =(𝕀[(s,a,s)=(sh,ah,sh+1)]IhP(ss,a)𝕀[(s,a)=(sh,ah)]Ih)2\displaystyle=(\mathbb{I}\left[(s,a,s^{\prime})=(s_{h},a_{h},s_{h+1})\right]\cdot I_{h}-P(s^{\prime}\mid s,a)\mathbb{I}\left[(s,a)=(s_{h},a_{h})\right]\cdot I_{h})^{2}
Ih(𝕀[(s,a,s)=(sh,ah,sh+1)]+(P(ss,a))2𝕀[(s,a)=(sh,ah)]).\displaystyle\leq I_{h}\cdot\left(\mathbb{I}\left[(s,a,s^{\prime})=(s_{h},a_{h},s_{h+1})\right]+\left(P(s^{\prime}\mid s,a)\right)^{2}\cdot\mathbb{I}\left[(s,a)=(s_{h},a_{h})\right]\right).

As we have shown in the proof of Lemma 5.6, for each h[H]h\in[H],

𝔼[Ih𝕀[(s,a,s)=(sh,ah,sh+1)]]=𝔼[Ih𝕀[(s,a)=(sh,ah)]]P(ss,a),\mathbb{E}[I_{h}\cdot\mathbb{I}\left[(s,a,s^{\prime})=(s_{h},a_{h},s_{h+1})\right]]=\mathbb{E}[I_{h}\cdot\mathbb{I}\left[(s,a)=(s_{h},a_{h})\right]]\cdot P(s^{\prime}\mid s,a),

which implies

𝔼[(h=0H1Xh)2]\displaystyle\mathbb{E}\left[\left(\sum_{h=0}^{H-1}X_{h}\right)^{2}\right] =𝔼[h=0H1Xh2]\displaystyle=\mathbb{E}\left[\sum_{h=0}^{H-1}X_{h}^{2}\right]
h=0H1𝔼[Ih(𝕀[(s,a,s)=(sh,ah,sh+1)]+(P(ss,a))2𝕀[(s,a)=(sh,ah)])]\displaystyle\leq\sum_{h=0}^{H-1}\mathbb{E}\left[I_{h}\cdot\left(\mathbb{I}\left[(s,a,s^{\prime})=(s_{h},a_{h},s_{h+1})\right]+\left(P(s^{\prime}\mid s,a)\right)^{2}\cdot\mathbb{I}\left[(s,a)=(s_{h},a_{h})\right]\right)\right]
2P(ss,a)h=0H1𝔼[Ih𝕀[(s,a)=(sh,ah)]]\displaystyle\leq 2P(s^{\prime}\mid s,a)\cdot\sum_{h=0}^{H-1}\mathbb{E}\left[I_{h}\cdot\mathbb{I}\left[(s,a)=(s_{h},a_{h})\right]\right]
2P(ss,a)(𝒬δ/2(h=0H1𝕀[(s,a)=(sh,ah)])+2).\displaystyle\leq 2P(s^{\prime}\mid s,a)\cdot\left(\mathcal{Q}_{\delta/2}\left(\sum_{h=0}^{H-1}\mathbb{I}\left[(s,a)=(s_{h},a_{h})\right]\right)+2\right).

By Chebyshev’s inequality, we have with probability at least 1δ/21-\delta/2,

|h=0H1Xh|4𝒬δ/2(h=0H1𝕀[(s,a)=(st,at)])+8δP(ss,a).\left|\sum_{h=0}^{H-1}X_{h}\right|\leq\sqrt{\frac{4\mathcal{Q}_{\delta/2}\left(\sum_{h=0}^{H-1}\mathbb{I}\left[(s,a)=(s_{t},a_{t})\right]\right)+8}{\delta}\cdot P(s^{\prime}\mid s,a)}.

which we denote as event 2\mathcal{E}_{2}.

Conditioned on 12\mathcal{E}_{1}\cap\mathcal{E}_{2} which happens with probability 1δ1-\delta, we have

|h=0H1𝕀[(s,a,s)=(st,at,st+1)]P(ss,a)h=0H1𝕀[(s,a)=(st,at)]|\displaystyle\left|\sum_{h=0}^{H-1}\mathbb{I}\left[(s,a,s^{\prime})=(s_{t},a_{t},s_{t+1})\right]-P(s^{\prime}\mid s,a)\cdot\sum_{h=0}^{H-1}\mathbb{I}\left[(s,a)=(s_{t},a_{t})\right]\right|
\displaystyle\leq 4𝒬δ/2(h=0H1𝕀[(s,a)=(st,at)])+8δP(ss,a).\displaystyle\sqrt{\frac{4\mathcal{Q}_{\delta/2}\left(\sum_{h=0}^{H-1}\mathbb{I}\left[(s,a)=(s_{t},a_{t})\right]\right)+8}{\delta}\cdot P(s^{\prime}\mid s,a)}.

Using Lemma 5.5, Lemma 5.6 and Lemma 5.7, we now present the main result in this section, which shows that for two MDPs MM and M^\widehat{M} that are close enough in terms of rewards and transition probabilities, for any policy π\pi, its value in M^\widehat{M} should be lower bounded by that in MM up to an error of ϵ\epsilon.

Lemma 5.8.

Let M=(𝒮,𝒜,P,R,H,μ)M=(\mathcal{S},\mathcal{A},P,R,H,\mu) be an MDP and π\pi be a policy. Let 0<ϵ1/20<\epsilon\leq 1/2 be a parameter. For each (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, define m¯(s,a):=𝒬ϵ/(12|𝒮||𝒜|)π(s,a)\overline{m}(s,a):=\mathcal{Q}^{\pi}_{\epsilon/(12|\mathcal{S}||\mathcal{A}|)}(s,a). Let M^=(𝒮,𝒜,P^,R^,H,μ^)\widehat{M}=(\mathcal{S},\mathcal{A},\widehat{P},\widehat{R},H,\widehat{\mu}) be another MDP. If for all (s,a,s)𝒮×𝒜×𝒮(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S} with m¯(s,a)1\overline{m}(s,a)\geq 1, we have

|P^(s|s,a)P(s|s,a)|ϵ96|𝒮|2|𝒜|max(ϵP(s|s,a)72m¯(s,a)|𝒮||𝒜|,ϵ72m¯(s,a)|𝒮||𝒜|),|\widehat{P}(s^{\prime}|s,a)-P(s^{\prime}|s,a)|\leq\frac{\epsilon}{96|\mathcal{S}|^{2}|\mathcal{A}|}\cdot\max\left(\sqrt{\frac{\epsilon{P}(s^{\prime}|s,a)}{72\cdot\overline{m}(s,a)\cdot|\mathcal{S}||\mathcal{A}|}},\frac{\epsilon}{72\cdot\overline{m}(s,a)\cdot|\mathcal{S}||\mathcal{A}|}\right),
|𝔼[R^|s,a]𝔼[R|s,a]|ϵ24|𝒮||𝒜|max{𝔼[(R(s,a))2]m¯(s,a),1m¯(s,a)}\left|\mathbb{E}[\widehat{R}|s,a]-\mathbb{E}[{R}|s,a]\right|\leq\frac{\epsilon}{24|\mathcal{S}||\mathcal{A}|}\cdot\max\left\{\sqrt{\frac{\mathbb{E}[(R(s,a))^{2}]}{\overline{m}(s,a)}},\frac{1}{\overline{m}(s,a)}\right\}

and

|μ(s)μ^(s)|ϵ/(6|𝒮|),\left|\mu(s)-\widehat{\mu}(s)\right|\leq\epsilon/(6|\mathcal{S}|),

then

VM^,HπVM,Hπϵ.V^{\pi}_{\widehat{M},H}\geq V^{\pi}_{{M},H}-\epsilon.
Proof.

Define 𝒯=(𝒮×𝒜)H×𝒮\mathcal{T}=(\mathcal{S}\times\mathcal{A})^{H}\times\mathcal{S} be set of all possible trajectories, where for each T𝒯T\in\mathcal{T}, TT has the form

((s0,a0),(s1,a1),,(sH1,aH1),sH).((s_{0},a_{0}),(s_{1},a_{1}),\ldots,(s_{H-1},a_{H-1}),s_{H}).

For a trajectory T𝒯T\in\mathcal{T}, for each (s,a,s)𝒮×𝒜×𝒮(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}, we write

mT(s,a)=h=0H1𝕀[(sh,ah)=(s,a)]m_{T}(s,a)=\sum_{h=0}^{H-1}\mathbb{I}[(s_{h},a_{h})=(s,a)]

as the number times (s,a)(s,a) is visited and

mT(s,a,s)=h=0H1𝕀[(sh,ah,sh+1)=(s,a,s)]m_{T}(s,a,s^{\prime})=\sum_{h=0}^{H-1}\mathbb{I}[(s_{h},a_{h},s_{h+1})=(s,a,s^{\prime})]

as the number of times (s,a,s)(s,a,s^{\prime}) is visited. We say a trajectory

T=((s0,a0),(s1,a1),,(sH1,aH1),sH)𝒯T=((s_{0},a_{0}),(s_{1},a_{1}),\ldots,(s_{H-1},a_{H-1}),s_{H})\in\mathcal{T}

is compatible with a (possibly non-stationary) policy π\pi if for all h[H]h\in[H],

ah=πh(sh).a_{h}=\pi_{h}(s_{h}).

For a (possibly non-stationary) policy π\pi, we use 𝒯π𝒯\mathcal{T}^{\pi}\subseteq\mathcal{T} to denote the set of all trajectories that are compatible with π\pi.

For an MDP M=(𝒮,𝒜,P,R,H,μ)M=(\mathcal{S},\mathcal{A},P,R,H,\mu) and a (possibly non-stationary) policy π\pi, for a trajectory TT that is compatible with π\pi, we write

p(T,M,π)=μ(s0)h=0H1P(sh+1sh,ah)=μ(s0)(s,a,s)𝒮×𝒜×𝒮P(ss,a)mT(s,a,s)p(T,M,\pi)=\mu(s_{0})\cdot\prod_{h=0}^{H-1}P(s_{h+1}\mid s_{h},a_{h})=\mu(s_{0})\cdot\prod_{(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}}P(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}

to be the probability of TT when executing π\pi in MM. Here we assume 00=10^{0}=1.

Using these definitions, we have

VM,Hπ=T𝒯πp(T,M,π)((s,a)𝒮×𝒜mT(s,a)𝔼[R(s,a)]).V^{\pi}_{M,H}=\sum_{T\in\mathcal{T}^{\pi}}p(T,M,\pi)\cdot\left(\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[R(s,a)]\right).

Note that for any trajectory T=((s0,a0),(s1,a1),,(sH1,aH1),sH)𝒯πT=((s_{0},a_{0}),(s_{1},a_{1}),\ldots,(s_{H-1},a_{H-1}),s_{H})\in\mathcal{T}^{\pi}, if p(T,M,π)>0p(T,M,\pi)>0, by Assumption 2.1,

(s,a)𝒮×𝒜mT(s,a)𝔼[R(s,a)]1.\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[R(s,a)]\leq 1.

We define 𝒯1π𝒯π\mathcal{T}^{\pi}_{1}\subseteq\mathcal{T}^{\pi} be the set of trajectories that for each T𝒯1πT\in\mathcal{T}^{\pi}_{1}, for each (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A},

mT(s,a)m¯(s,a).m_{T}(s,a)\leq\overline{m}(s,a).

By a union bound over all (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, we have

T𝒯π𝒯1πp(T,M,π)ϵ/6.\sum_{T\in\mathcal{T}^{\pi}\setminus\mathcal{T}^{\pi}_{1}}p(T,M,\pi)\leq\epsilon/6.

We also define 𝒯2π𝒯π\mathcal{T}^{\pi}_{2}\subseteq\mathcal{T}^{\pi} be the set of trajectories that for each T𝒯2πT\in\mathcal{T}^{\pi}_{2}, for each (s,a,s)𝒮×𝒜×𝒮(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S},

|mT(s,a,s)mT(s,a)P(s|s,a)|6P(s|s,a)(4m¯(s,a)+8)|𝒮||𝒜|ϵ.\left|m_{T}(s,a,s^{\prime})-m_{T}(s,a)\cdot P(s^{\prime}|s,a)\right|\leq\sqrt{\frac{6P(s^{\prime}|s,a)(4\overline{m}(s,a)+8)|\mathcal{S}||\mathcal{A}|}{\epsilon}}.

By Lemma 5.7 and a union bound over all (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, we have

T𝒯π𝒯2πp(T,M,π)ϵ/6.\sum_{T\in\mathcal{T}^{\pi}\setminus\mathcal{T}^{\pi}_{2}}p(T,M,\pi)\leq\epsilon/6.

Finally, we define 𝒯3π𝒯π\mathcal{T}^{\pi}_{3}\subseteq\mathcal{T}^{\pi} be the set of trajectories such that for each T𝒯3πT\in\mathcal{T}^{\pi}_{3}, for each (s,a,s)𝒮×𝒜×𝒮(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S},

mT(s,a,s)6|𝒮||𝒜|(2m¯(s,a)+4)ϵP(s|s,a).m_{T}(s,a,s^{\prime})\leq\frac{6|\mathcal{S}||\mathcal{A}|(2\overline{m}(s,a)+4)}{\epsilon}\cdot P(s^{\prime}|s,a).

By Lemma 5.6 and a union bound over all (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, we have

T𝒯π𝒯3πp(T,M,π)ϵ/6.\sum_{T\in\mathcal{T}^{\pi}\setminus\mathcal{T}^{\pi}_{3}}p(T,M,\pi)\leq\epsilon/6.

Thus, by defining 𝒯prunedπ=𝒯1π𝒯2π𝒯3π\mathcal{T}^{\pi}_{\mathrm{pruned}}=\mathcal{T}^{\pi}_{1}\cap\mathcal{T}^{\pi}_{2}\cap\mathcal{T}^{\pi}_{3}, we have

T𝒯prunedπp(T,M,π)((s,a)𝒮×𝒜mT(s,a)𝔼[R(s,a)])VM,Hπϵ/2.\sum_{T\in\mathcal{T}^{\pi}_{\mathrm{pruned}}}p(T,M,\pi)\cdot\left(\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[R(s,a)]\right)\geq V^{\pi}_{M,H}-\epsilon/2.

Note that for each T𝒯prunedπT\in\mathcal{T}^{\pi}_{\mathrm{pruned}} with

T=((s0,a0),(s1,a1),,(sH1,aH1),sH),T=((s_{0},a_{0}),(s_{1},a_{1}),\ldots,(s_{H-1},a_{H-1}),s_{H}),

for each state-action (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A} , we must have mT(s,a)m¯(s,a)m_{T}(s,a)\leq\overline{m}(s,a). This is because T𝒯1πT\in\mathcal{T}^{\pi}_{1} Moreover, for any (s,a,s)𝒮×𝒜×𝒮(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}, if mT(s,a,s)1m_{T}(s,a,s^{\prime})\geq 1, then

P(ss,a)ϵ36|𝒮||𝒜|m¯(s,a).P(s^{\prime}\mid s,a)\geq\frac{\epsilon}{36|\mathcal{S}||\mathcal{A}|\overline{m}(s,a)}.

This is because T𝒯3πT\in\mathcal{T}^{\pi}_{3}, and if

P(ss,a)<ϵ36|𝒮||𝒜|m¯(s,a),P(s^{\prime}\mid s,a)<\frac{\epsilon}{36|\mathcal{S}||\mathcal{A}|\overline{m}(s,a)},

then

mT(s,a,s)6|𝒮||𝒜|(2m¯(s,a)+4)ϵP(s|s,a)36|𝒮||𝒜|m¯(s,a)ϵP(s|s,a)<1.m_{T}(s,a,s^{\prime})\leq\frac{6|\mathcal{S}||\mathcal{A}|(2\overline{m}(s,a)+4)}{\epsilon}\cdot P(s^{\prime}|s,a)\leq\frac{36|\mathcal{S}||\mathcal{A}|\overline{m}(s,a)}{\epsilon}\cdot P(s^{\prime}|s,a)<1.

For each (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, define

𝒮s,a={s𝒮P(s|s,a)ϵ36|𝒮||𝒜|m¯(s,a)}.\mathcal{S}_{s,a}=\left\{s^{\prime}\in\mathcal{S}\ \mid P(s^{\prime}|s,a)\geq\frac{\epsilon}{36|\mathcal{S}||\mathcal{A}|\overline{m}(s,a)}\right\}.

Therefore, for each (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A},

s𝒮P(ss,a)mT(s,a,s)=s𝒮s,aP(ss,a)mT(s,a,s)\prod_{s^{\prime}\in\mathcal{S}}P(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}=\prod_{s^{\prime}\in\mathcal{S}_{s,a}}P(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}

and

s𝒮P^(ss,a)mT(s,a,s)=s𝒮s,aP^(ss,a)mT(s,a,s)\prod_{s^{\prime}\in\mathcal{S}}\widehat{P}(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}=\prod_{s^{\prime}\in\mathcal{S}_{s,a}}\widehat{P}(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}

with

|mT(s,a,s)mT(s,a)P(s|s,a)|\displaystyle\left|m_{T}(s,a,s^{\prime})-m_{T}(s,a)\cdot P(s^{\prime}|s,a)\right| 6P(s|s,a)(4m¯(s,a)+8)|𝒮||𝒜|ϵ\displaystyle\leq\sqrt{\frac{6P(s^{\prime}|s,a)(4\overline{m}(s,a)+8)|\mathcal{S}||\mathcal{A}|}{\epsilon}}
72P(s|s,a)m¯(s,a)|𝒮||𝒜|ϵ.\displaystyle\leq\sqrt{\frac{72P(s^{\prime}|s,a)\overline{m}(s,a)|\mathcal{S}||\mathcal{A}|}{\epsilon}}.

Note that s𝒮P^(s|s,a)P(s|s,a)=0\sum_{s^{\prime}\in\mathcal{S}}\widehat{P}(s^{\prime}|s,a)-{P}(s^{\prime}|s,a)=0, which implies

|s𝒮s,aP^(s|s,a)P(s|s,a)|=|s𝒮s,aP(s|s,a)P^(s|s,a)|ϵ96|𝒮||𝒜|ϵ72m¯(s,a)|𝒮||𝒜|.\left|\sum_{s^{\prime}\in\mathcal{S}_{s,a}}\widehat{P}(s^{\prime}|s,a)-{P}(s^{\prime}|s,a)\right|=\left|\sum_{s^{\prime}\not\in\mathcal{S}_{s,a}}{P}(s^{\prime}|s,a)-\widehat{P}(s^{\prime}|s,a)\right|\leq\frac{\epsilon}{96|\mathcal{S}||\mathcal{A}|}\cdot\frac{\epsilon}{72\cdot\overline{m}{(s,a)}\cdot|\mathcal{S}||\mathcal{A}|}.

By applying Lemma 5.5 and settting n¯\bar{n} to be |𝒮||\mathcal{S}|, nn to be |𝒮s,a||\mathcal{S}_{s,a}|, ϵ\epsilon to be ϵ/(96|𝒮|2|𝒜|)\epsilon/(96|\mathcal{S}|^{2}|\mathcal{A}|), and m¯\overline{m} to be 72m¯(s,a)|𝒮||𝒜|/ϵ72\cdot\overline{m}(s,a)\cdot|\mathcal{S}||\mathcal{A}|/\epsilon, we have

s𝒮s,aP^(ss,a)mT(s,a,s)(1ϵ12|𝒮||𝒜|)s𝒮s,aP(ss,a)mT(s,a,s).\prod_{s^{\prime}\in\mathcal{S}_{s,a}}\widehat{P}(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}\geq\left(1-\frac{\epsilon}{12|\mathcal{S}||\mathcal{A}|}\right)\prod_{s^{\prime}\in\mathcal{S}_{s,a}}P(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}.

Therefore,

(s,a)𝒮×𝒜s𝒮P^(ss,a)mT(s,a,s)(1ϵ12|𝒮||𝒜|)|𝒮||𝒜|(s,a)𝒮×𝒜s𝒮P(ss,a)mT(s,a,s),\prod_{(s,a)\in\mathcal{S}\times\mathcal{A}}\prod_{s^{\prime}\in\mathcal{S}}\widehat{P}(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}\geq\left(1-\frac{\epsilon}{12|\mathcal{S}||\mathcal{A}|}\right)^{|\mathcal{S}||\mathcal{A}|}\prod_{(s,a)\in\mathcal{S}\times\mathcal{A}}\prod_{s^{\prime}\in\mathcal{S}}P(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})},

which implies

(s,a)𝒮×𝒜s𝒮P^(ss,a)mT(s,a,s)(1ϵ/6)(s,a)𝒮×𝒜s𝒮P(ss,a)mT(s,a,s).\prod_{(s,a)\in\mathcal{S}\times\mathcal{A}}\prod_{s^{\prime}\in\mathcal{S}}\widehat{P}(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}\geq(1-\epsilon/6)\prod_{(s,a)\in\mathcal{S}\times\mathcal{A}}\prod_{s^{\prime}\in\mathcal{S}}P(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}.

For the summation of rewards, we have

(s,a)𝒮×𝒜mT(s,a)|𝔼[R(s,a)]𝔼[R^(s,a)]|\displaystyle\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\left|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right|
=\displaystyle= (s,a)𝒮×𝒜mT(s,a)=1mT(s,a)|𝔼[R(s,a)]𝔼[R^(s,a)]|\displaystyle\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}\mid m_{T}(s,a)=1}m_{T}(s,a)\cdot\left|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right|
+\displaystyle+ (s,a)𝒮×𝒜mT(s,a)>1mT(s,a)|𝔼[R(s,a)]𝔼[R^(s,a)]|.\displaystyle\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}\mid m_{T}(s,a)>1}m_{T}(s,a)\cdot\left|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right|.

For those (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A} with mT(s,a)>1m_{T}(s,a)>1, we have m¯(s,a)>1\overline{m}(s,a)>1. By Lemma 2.1, we have

|𝔼[R^(s,a)]𝔼[R(s,a)]|\displaystyle|\mathbb{E}[\widehat{R}(s,a)]-\mathbb{E}[R(s,a)]|
\displaystyle\leq ϵ24|𝒮||𝒜|max{𝔼[(R(s,a))2]m¯(s,a),1m¯(s,a)}max{ϵ12H,ϵ24|𝒮||𝒜|m¯(s,a)}.\displaystyle\frac{\epsilon}{24|\mathcal{S}||\mathcal{A}|}\cdot\max\left\{\sqrt{\frac{\mathbb{E}[(R(s,a))^{2}]}{\overline{m}(s,a)}},\frac{1}{\overline{m}(s,a)}\right\}\leq\max\left\{\frac{\epsilon}{12H},\frac{\epsilon}{24|\mathcal{S}||\mathcal{A}|\overline{m}(s,a)}\right\}.

Since (s,a)𝒮×𝒜mT(s,a)H\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\leq H and mT(s,a)m¯(s,a)m_{T}(s,a)\leq\overline{m}(s,a), we have

(s,a)𝒮×𝒜mT(s,a)>1mT(s,a)|𝔼[R(s,a)]𝔼[R^(s,a)]|ϵ8.\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}\mid m_{T}(s,a)>1}m_{T}(s,a)\cdot\left|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right|\leq\frac{\epsilon}{8}.

For those (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A} with mT(s,a)=1m_{T}(s,a)=1, we have

(s,a)𝒮×𝒜mT(s,a)=1mT(s,a)|𝔼[R(s,a)]𝔼[R^(s,a)]|ϵ24.\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}\mid m_{T}(s,a)=1}m_{T}(s,a)\cdot\left|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right|\leq\frac{\epsilon}{24}.

Thus,

(s,a)𝒮×𝒜mT(s,a)|𝔼[R(s,a)]𝔼[R^(s,a)]|ϵ6.\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\left|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right|\leq\frac{\epsilon}{6}.

For each T𝒯prunedπT\in\mathcal{T}^{\pi}_{\mathrm{pruned}} with

T=((s0,a0),(s1,a1),,(sH1,aH1),sH),T=((s_{0},a_{0}),(s_{1},a_{1}),\ldots,(s_{H-1},a_{H-1}),s_{H}),

we have

p(T,M^,π)((s,a)𝒮×𝒜mT(s,a)𝔼[R^(s,a)])\displaystyle p(T,\widehat{M},\pi)\cdot\left(\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[\widehat{R}(s,a)]\right)
=\displaystyle= μ^(s0)(s,a)𝒮×𝒜s𝒮P^(ss,a)mT(s,a,s)((s,a)𝒮×𝒜mT(s,a)𝔼[R^(s,a)])\displaystyle\widehat{\mu}(s_{0})\cdot\prod_{(s,a)\in\mathcal{S}\times\mathcal{A}}\prod_{s^{\prime}\in\mathcal{S}}\widehat{P}(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}\cdot\left(\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[\widehat{R}(s,a)]\right)
\displaystyle\geq (μ(s0)ϵ/(6|𝒮|))(1ϵ/6)(s,a)𝒮×𝒜s𝒮P(ss,a)mT(s,a,s)((s,a)𝒮×𝒜mT(s,a)𝔼[R(s,a)]ϵ/6).\displaystyle(\mu(s_{0})-\epsilon/(6|\mathcal{S}|))\cdot(1-\epsilon/6)\prod_{(s,a)\in\mathcal{S}\times\mathcal{A}}\prod_{s^{\prime}\in\mathcal{S}}P(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}\cdot\left(\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[R(s,a)]-\epsilon/6\right).

Since

T𝒯prunedπp(T,M,π)((s,a)𝒮×𝒜mT(s,a)𝔼[R(s,a)])VM,Hπϵ/2,\sum_{T\in\mathcal{T}^{\pi}_{\mathrm{pruned}}}p(T,M,\pi)\cdot\left(\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[R(s,a)]\right)\geq V^{\pi}_{M,H}-\epsilon/2,

we have

VM^,HπT𝒯prunedπp(T,M^,π)((s,a)𝒮×𝒜mT(s,a)𝔼[R^(s,a)])VM,Hπϵ.V^{\pi}_{\widehat{M},H}\geq\sum_{T\in\mathcal{T}^{\pi}_{\mathrm{pruned}}}p(T,\widehat{M},\pi)\cdot\left(\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[\widehat{R}(s,a)]\right)\geq V^{\pi}_{M,H}-\epsilon.

Now we show that for two MDPs MM and M^\widehat{M} with the same transition probabilities and close enough rewards, for any policy π\pi, its value in M^\widehat{M} should be upper bounded by that in M^\widehat{M} up to an error of ϵ\epsilon.

Lemma 5.9.

Let M=(𝒮,𝒜,P,R,H,μ)M=(\mathcal{S},\mathcal{A},P,R,H,\mu) be an MDP and π\pi be a policy. Let 0<ϵ1/20<\epsilon\leq 1/2 be a parameter. For each (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, define m¯(s,a)=𝒬ϵ/(12|𝒮||𝒜|)π(s,a)\overline{m}(s,a)=\mathcal{Q}^{\pi}_{\epsilon/(12|\mathcal{S}||\mathcal{A}|)}(s,a). Let M^=(𝒮,𝒜,P,R^,H,μ)\widehat{M}=(\mathcal{S},\mathcal{A},P,\widehat{R},H,\mu) be another MDP. If for all (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A} with m¯(s,a)1\overline{m}(s,a)\geq 1, we have

|𝔼[R^|s,a]𝔼[R|s,a]|ϵ24|𝒮||𝒜|max{𝔼[(R(s,a))2]m¯(s,a),1m¯(s,a)}.\left|\mathbb{E}[\widehat{R}|s,a]-\mathbb{E}[{R}|s,a]\right|\leq\frac{\epsilon}{24|\mathcal{S}||\mathcal{A}|}\cdot\max\left\{\sqrt{\frac{\mathbb{E}[(R(s,a))^{2}]}{\overline{m}(s,a)}},\frac{1}{\overline{m}(s,a)}\right\}.

then

VM^,HπVM,Hπ+ϵ.V^{\pi}_{\widehat{M},H}\leq V^{\pi}_{{M},H}+\epsilon.
Proof.

We adopt the same notations as in the proof of Lemma 5.8. Recall that

VM,Hπ=T𝒯πp(T,M,π)((s,a)𝒮×𝒜mT(s,a)𝔼[R(s,a)]).V^{\pi}_{M,H}=\sum_{T\in\mathcal{T}^{\pi}}p(T,M,\pi)\cdot\left(\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[R(s,a)]\right).

and

T𝒯π𝒯1πp(T,M,π)ϵ/6.\sum_{T\in\mathcal{T}^{\pi}\setminus\mathcal{T}^{\pi}_{1}}p(T,M,\pi)\leq\epsilon/6.

As in the proof of Lemma 5.8, for the summation of rewards, we have

(s,a)𝒮×𝒜mT(s,a)|𝔼[R(s,a)]𝔼[R^(s,a)]|\displaystyle\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\left|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right|
=\displaystyle= (s,a)𝒮×𝒜mT(s,a)=1mT(s,a)|𝔼[R(s,a)]𝔼[R^(s,a)]|\displaystyle\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}\mid m_{T}(s,a)=1}m_{T}(s,a)\cdot\left|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right|
+\displaystyle+ (s,a)𝒮×𝒜mT(s,a)>1mT(s,a)|𝔼[R(s,a)]𝔼[R^(s,a)]|.\displaystyle\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}\mid m_{T}(s,a)>1}m_{T}(s,a)\cdot\left|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right|.

For each T𝒯1πT\in\mathcal{T}^{\pi}_{1}, for each (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, we must have mT(s,a)m¯(s,a)m_{T}(s,a)\leq\overline{m}(s,a). Therefore, for those (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A} with mT(s,a)>1m_{T}(s,a)>1, by Lemma 2.1, we have

|𝔼[R^(s,a)]𝔼[R(s,a)]|\displaystyle|\mathbb{E}[\widehat{R}(s,a)]-\mathbb{E}[R(s,a)]|
\displaystyle\leq ϵ24|𝒮||𝒜|max{𝔼[(R(s,a))2]m¯(s,a),1m¯(s,a)}max{ϵ12H,ϵ24|𝒮||𝒜|m¯(s,a)}.\displaystyle\frac{\epsilon}{24|\mathcal{S}||\mathcal{A}|}\cdot\max\left\{\sqrt{\frac{\mathbb{E}[(R(s,a))^{2}]}{\overline{m}(s,a)}},\frac{1}{\overline{m}(s,a)}\right\}\leq\max\left\{\frac{\epsilon}{12H},\frac{\epsilon}{24|\mathcal{S}||\mathcal{A}|\overline{m}(s,a)}\right\}.

Since (s,a)𝒮×𝒜mT(s,a)H\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\leq H and mT(s,a)m¯(s,a)m_{T}(s,a)\leq\overline{m}(s,a), we have

(s,a)𝒮×𝒜mT(s,a)>1mT(s,a)|𝔼[R(s,a)]𝔼[R^(s,a)]|ϵ8.\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}\mid m_{T}(s,a)>1}m_{T}(s,a)\cdot\left|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right|\leq\frac{\epsilon}{8}.

For those (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A} with mT(s,a)=1m_{T}(s,a)=1, we have

(s,a)𝒮×𝒜mT(s,a)=1mT(s,a)|𝔼[R(s,a)]𝔼[R^(s,a)]|ϵ24.\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}\mid m_{T}(s,a)=1}m_{T}(s,a)\cdot\left|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right|\leq\frac{\epsilon}{24}.

Thus,

(s,a)𝒮×𝒜mT(s,a)|𝔼[R(s,a)]𝔼[R^(s,a)]|ϵ6.\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\left|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right|\leq\frac{\epsilon}{6}.

Hence,

VM^,μπ\displaystyle V^{\pi}_{\widehat{M},\mu} T𝒯π𝒯1πp(T,M,π)+T𝒯1πp(T,M,π)((s,a)𝒮×𝒜mT(s,a)𝔼[R^(s,a)])\displaystyle\leq\sum_{T\in\mathcal{T}^{\pi}\setminus\mathcal{T}^{\pi}_{1}}p(T,M,\pi)+\sum_{T\in\mathcal{T}^{\pi}_{1}}p(T,M,\pi)\cdot\left(\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[\widehat{R}(s,a)]\right)
ϵ/6+T𝒯1πp(T,M,π)((s,a)𝒮×𝒜mT(s,a)𝔼[R(s,a)]+ϵ/6)\displaystyle\leq\epsilon/6+\sum_{T\in\mathcal{T}^{\pi}_{1}}p(T,M,\pi)\cdot\left(\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[R(s,a)]+\epsilon/6\right)
VM,μπ+ϵ.\displaystyle\leq V^{\pi}_{M,\mu}+\epsilon.

5.3 Pessimistic Planning

We now present our final algorithm in the RL setting. The formal description is provided in Algorithm 3. In our algorithm, we first invoke Algorithm 1 to collect a dataset DD and then invoke Algorithm 2 to estimate 𝒬ϵ𝗌𝗍(s,a)\mathcal{Q}^{\mathsf{st}}_{\epsilon}(s,a) for some properly chosen ϵ\epsilon. We then use the estimators in Lemma 5.4 to define P^\widehat{P}, R^\widehat{R} and μ^\widehat{\mu}. Note that Lemma 5.4 not only provides an estimator but also provides a computable confidence interval for P^\widehat{P} and μ^\widehat{\mu}, which we also utilize in our algorithm.

At this point, a natural idea is to find the optimal policy with respect to the MDP M^\widehat{M} defined by P^\widehat{P}, R^\widehat{R} and μ^\widehat{\mu}. However, our Lemma 5.8 only provides a lower bound guarantee for VM^,HπV^{\pi}_{\widehat{M},H} without any upper bound guarantee. We resolve this issue by pessimistic planning. More specifically, for any policy π\pi, we define its pessimistic value to be

V¯π=minMVM,Hπ\underline{V}^{\pi}=\min_{M\in\mathcal{M}}V^{\pi}_{M,H}

where \mathcal{M} includes all MDPs whose transition probabilities are within the confidence interval provided in Lemma 5.4. We simply return the policy π\pi that maximizes V¯π\underline{V}^{\pi}. Since the true MDP lies in \mathcal{M}, V¯π\underline{V}^{\pi} is never an overestimate. On the other hand, Lemma 5.8 guarantees that V¯π\underline{V}^{\pi} is also lower bounded by VM,HπV^{\pi}_{M,H} up to an error of ϵ\epsilon. Therefore, V¯π\underline{V}^{\pi} provides an accurate estimate to the true value of π\pi. However, note that Lemma 5.4 does not provide a computable confidence interval for the rewards. Fortunately, as we have shown in Lemma 5.9, perturbation on the rewards will not significantly increase the value of the policy and thus the estimate is still accurate.

Algorithm 3 Pessimistic Planning
1:Input: desired accuracy ϵ\epsilon, failure probability δ\delta
2:Output: An ϵ\epsilon-optimal policy π\pi
3:Invoke Algorithm 1 with
N=266(|𝒮|+1)24(|𝒮|+1)log(18|𝒮|2|𝒜|/δ)|𝒮|7|𝒜|5/ε5N=2^{66}\cdot(|\mathcal{S}|+1)^{24(|\mathcal{S}|+1)}\cdot\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)\cdot|\mathcal{S}|^{7}|\mathcal{A}|^{5}/\varepsilon^{5}
and receive
D=(((si,t,ai,t,ri,t,si,t))t=0|𝒮||𝒜||𝒜|2|𝒮|H1)i=0N1D=\left(\left(\left(s_{i,t},a_{i,t},r_{i,t},s^{\prime}_{i,t}\right)\right)_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}\right)_{i=0}^{N-1}
4:Invoke Algorithm 2 with
ϵest=ϵ32768|𝒮||𝒜|(|𝒮|+1)12(|𝒮|+1)\epsilon_{\mathrm{est}}=\frac{\epsilon}{32768\cdot|\mathcal{S}||\mathcal{A}|(|\mathcal{S}|+1)^{12(|\mathcal{S}|+1)}}
and δ=δ\delta=\delta, and receive estimates m¯𝗌𝗍:𝒮×𝒜\overline{m}^{\mathsf{st}}:\mathcal{S}\times\mathcal{A}\to\mathbb{N}
5:For each (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, for each i[N]i\in[N] and t[|𝒮||𝒜||𝒜|2|𝒮|H]t\in\left[|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H\right], define
𝖳𝗋𝗎𝗇𝖼i,t(s,a)=𝕀[t=0t1𝕀[(si,t,ai,t)=(s,a)]<m¯𝗌𝗍(s,a)]\mathsf{Trunc}_{i,t}(s,a)=\mathbb{I}\left[\sum_{t^{\prime}=0}^{t-1}\mathbb{I}\left[(s_{i,t^{\prime}},a_{i,t^{\prime}})=(s,a)\right]<\overline{m}^{\mathsf{st}}(s,a)\right]
6:For each (s,a,s)𝒮×𝒜×𝒮(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}, define
mD(s,a)=i=0N1t=0|𝒮||𝒜||𝒜|2|𝒮|H1𝕀[(si,t,ai,t)=(s,a)]𝖳𝗋𝗎𝗇𝖼i,t(s,a),m_{D}(s,a)=\sum_{i=0}^{N-1}\sum_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}\mathbb{I}\left[(s_{i,t},a_{i,t})=(s,a)\right]\cdot\mathsf{Trunc}_{i,t}(s,a),
P^(ss,a)=i=0N1t=0|𝒮||𝒜||𝒜|2|𝒮|H1𝕀[(si,t,ai,t,si,t)=(s,a,s)]𝖳𝗋𝗎𝗇𝖼i,t(s,a)max{1,mD(s,a)},\widehat{P}(s^{\prime}\mid s,a)=\frac{\sum_{i=0}^{N-1}\sum_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}\mathbb{I}\left[(s_{i,t},a_{i,t},s_{i,t}^{\prime})=(s,a,s^{\prime})\right]\cdot\mathsf{Trunc}_{i,t}(s,a)}{\max\{1,m_{D}(s,a)\}},
R^(s,a)=i=0N1t=0|𝒮||𝒜||𝒜|2|𝒮|H1𝕀[(si,t,ai,t)=(s,a)]ri,t𝖳𝗋𝗎𝗇𝖼i,t(s,a)max{1,mD(s,a)}\widehat{R}(s,a)=\frac{\sum_{i=0}^{N-1}\sum_{t=0}^{|\mathcal{S}||\mathcal{A}|\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot H-1}\mathbb{I}\left[(s_{i,t},a_{i,t})=(s,a)\right]\cdot r_{i,t}\cdot\mathsf{Trunc}_{i,t}(s,a)}{\max\{1,m_{D}(s,a)\}}
and
μ^(s)=i=0N1𝕀[s0=s]N\widehat{\mu}(s)=\frac{\sum_{i=0}^{N-1}\mathbb{I}[s_{0}=s]}{N}
7:Define \mathcal{M} to be a set of MDPs where for each M=(𝒮,𝒜,P~,R^,H,μ~)M=(\mathcal{S},\mathcal{A},\widetilde{P},\widehat{R},H,\widetilde{\mu})\in\mathcal{M},
|P^(ss,a)P~(ss,a)|max{512log(18|𝒮|2|𝒜|/δ)m¯𝗌𝗍(s,a)Nϵest,32P^(ss,a)log(18|𝒮|2|𝒜|/δ)m¯𝗌𝗍(s,a)Nϵest}\left|\widehat{P}(s^{\prime}\mid s,a)-\widetilde{P}(s^{\prime}\mid s,a)\right|\leq\max\left\{\frac{512\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}},32\sqrt{\frac{\widehat{P}(s^{\prime}\mid s,a)\cdot\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}}\right\}
and
|μ^(s)μ~(s)|log(18|𝒮|/δ)N\left|\widehat{\mu}(s)-\widetilde{\mu}(s)\right|\leq\sqrt{\frac{\log(18|\mathcal{S}|/\delta)}{N}}
for all (s,a,s)𝒮×𝒜×𝒮(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}
8:For each (possibly non-stationary) policy π\pi, define V¯π=minMVM,Hπ\underline{V}^{\pi}=\min_{M\in\mathcal{M}}V^{\pi}_{M,H}
9:return armgaxπV¯π\mathrm{armgax}_{\pi}\underline{V}^{\pi}

We now present the formal analysis for our algorithm.

Theorem 5.10.

With probability at least 1δ1-\delta, for any (possibly non-stationary) policy π\pi,

|V¯πVM,Hπ|ϵ/2\left|\underline{V}^{\pi}-V^{\pi}_{M,H}\right|\leq\epsilon/2

where V¯π\underline{V}^{\pi} is defined in Line 8 of Algorithm 3. Moreover, Alogrithm 3 samples at most

266(|𝒮|+1)24(|𝒮|+1)|𝒜|2|𝒮|log(12|𝒮|2|𝒜|/δ)|𝒮|8|𝒜|6/ε5.2^{66}\cdot(|\mathcal{S}|+1)^{24(|\mathcal{S}|+1)}\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot\log(12|\mathcal{S}|^{2}|\mathcal{A}|/\delta)\cdot|\mathcal{S}|^{8}|\mathcal{A}|^{6}/\varepsilon^{5}.

trajectories.

Proof.

By Lemma 5.4, with probability at least 1δ1-\delta, for all (s,a,s)𝒮×𝒜×𝒮(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}, we have

|P^(ss,a)P(ss,a)|\displaystyle\left|\widehat{P}(s^{\prime}\mid s,a)-P(s^{\prime}\mid s,a)\right|\leq max{512log(18|𝒮|2|𝒜|/δ)m¯𝗌𝗍(s,a)Nϵest,32P^(ss,a)log(18|𝒮|2|𝒜|/δ)m¯𝗌𝗍(s,a)Nϵest}\displaystyle\max\left\{\frac{512\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}},32\sqrt{\frac{\widehat{P}(s^{\prime}\mid s,a)\cdot\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\overline{m}^{\mathsf{st}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}}\right\}
\displaystyle\leq max{512log(18|𝒮|2|𝒜|/δ)𝒬ϵest𝗌𝗍(s,a)Nϵest,64P(ss,a)log(18|𝒮|2|𝒜|/δ)𝒬ϵest𝗌𝗍(s,a)Nϵest},\displaystyle\max\left\{\frac{512\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}},64\sqrt{\frac{P(s^{\prime}\mid s,a)\cdot\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}}\right\},
|R^(ss,a)𝔼[R(s,a)]|8𝔼[(R(s,a))2]log(18|𝒮||𝒜|/δ)𝒬ϵest𝗌𝗍(s,a)Nϵest+8log(18|𝒮||𝒜|/δ)𝒬ϵest𝗌𝗍(s,a)Nϵest,\left|\widehat{R}(s^{\prime}\mid s,a)-\mathbb{E}[R(s,a)]\right|\leq 8\sqrt{\frac{\mathbb{E}\left[(R(s,a))^{2}\right]\cdot\log(18|\mathcal{S}||\mathcal{A}|/\delta)}{\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}}}+\frac{8\log(18|\mathcal{S}||\mathcal{A}|/\delta)}{\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)\cdot N\cdot\epsilon_{\mathrm{est}}},

and

|μ^(s)μ(s)|log(18|𝒮|/δ)N.\left|\widehat{\mu}(s)-\mu(s)\right|\leq\sqrt{\frac{\log(18|\mathcal{S}|/\delta)}{N}}.

In the remaining part of the analysis, we condition on the above event.

by Lemma 5.1, for any (possibly non-stationary) policy π\pi,

𝒬ϵest𝗌𝗍(s,a)14096|𝒮|12|𝒮|ϵ/(24|𝒮||𝒜|)𝒬ϵ/(24|𝒮||𝒜|)π(s,a).\mathcal{Q}^{\mathsf{st}}_{\epsilon_{\mathrm{est}}}(s,a)\geq\frac{1}{4096\cdot|\mathcal{S}|^{12|\mathcal{S}|}}\cdot\epsilon/(24|\mathcal{S}||\mathcal{A}|)\cdot\mathcal{Q}^{\pi}_{\epsilon/(24|\mathcal{S}||\mathcal{A}|)}(s,a).

Let M=(𝒮,𝒜,P,R,H,μ)M=(\mathcal{S},\mathcal{A},P,R,H,\mu) be the true MDP. By Lemma 5.8, for any (possibly non-stationary) policy π\pi, for any M¯\overline{M}\in\mathcal{M}, we have

VM¯,HπVM,Hπϵ/2.V^{\pi}_{\overline{M},H}\geq V^{\pi}_{M,H}-\epsilon/2.

Moreover, M=(𝒮,𝒜,P,R^,H,μ)M^{\prime}=(\mathcal{S},\mathcal{A},P,\widehat{R},H,\mu)\in\mathcal{M}. Therefore, by Lemma 5.9,

VM,HπVM,Hπ+ϵ/2.V^{\pi}_{M^{\prime},H}\leq V^{\pi}_{M,H}+\epsilon/2.

Consequently,

|V¯πVM,Hπ|ϵ/2.\left|\underline{V}^{\pi}-V^{\pi}_{M,H}\right|\leq\epsilon/2.

Finally, the algorithm samples at most

(N+300log(6|𝒮||𝒜|/δ)/ϵest)×|𝒮|×|𝒜|×|𝒜|2|𝒮|\displaystyle(N+\lceil 300\log(6|\mathcal{S}||\mathcal{A}|/\delta)/\epsilon_{\mathrm{est}}\rceil)\times|\mathcal{S}|\times|\mathcal{A}|\times|\mathcal{A}|^{2|\mathcal{S}|}
\displaystyle\leq 266(|𝒮|+1)24(|𝒮|+1)|𝒜|2|𝒮|log(18|𝒮|2|𝒜|/δ)|𝒮|8|𝒜|6/ε5.\displaystyle 2^{66}\cdot(|\mathcal{S}|+1)^{24(|\mathcal{S}|+1)}\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot\log(18|\mathcal{S}|^{2}|\mathcal{A}|/\delta)\cdot|\mathcal{S}|^{8}|\mathcal{A}|^{6}/\varepsilon^{5}.

trajectories. ∎

Theorem 5.10 immediately implies the following corollary.

Corollary 5.11.

Algorithm 3 returns a policy π\pi such that

VM,HπVM,HπϵV^{\pi}_{M,H}\geq V^{\pi^{*}}_{M,H}-\epsilon

with probability at least 1δ1-\delta. Moreover, the algorithm samples at most

266(|𝒮|+1)24(|𝒮|+1)|𝒜|2|𝒮|log(12|𝒮|2|𝒜|/δ)|𝒮|8|𝒜|6/ε5.2^{66}\cdot(|\mathcal{S}|+1)^{24(|\mathcal{S}|+1)}\cdot|\mathcal{A}|^{2|\mathcal{S}|}\cdot\log(12|\mathcal{S}|^{2}|\mathcal{A}|/\delta)\cdot|\mathcal{S}|^{8}|\mathcal{A}|^{6}/\varepsilon^{5}.

trajectories.

6 Algorithm in the Generative Model

In this section, we present our algorithm in the generative model together with its analysis. The formal description of the algorithm is given in Algorithm 4. Compared to our algorithm in the RL setting, Algorithm 4 is conceptually much simpler. For each state-action pair (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, Algorithm 4 first draws NN samples from P(s,a)P(s,a) and R(s,a)R(s,a), and then builds a model M^\widehat{M} using the empirical estimators. The algorithm simply returns the optimal policy for M^\widehat{M}. In the remaining part of this section, we present the formal analysis for our algorithm.

Algorithm 4 Algorithm in Generative Model
1:Input: desired accuracy ϵ\epsilon, failure probability δ\delta
2:Output: An ϵ\epsilon-optimal policy π\pi
3:Let N=229|𝒮|5|𝒜|3H/ϵ3N=2^{29}\cdot|\mathcal{S}|^{5}\cdot|\mathcal{A}|^{3}\cdot H/\epsilon^{3}
4:for (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A} do
5:     Draw s0,s1,,sN1s^{\prime}_{0},s^{\prime}_{1},\ldots,s^{\prime}_{N-1} from P(s,a)P(s,a)
6:     Draw r0,r1,,rN1r_{0},r_{1},\ldots,r_{N-1} from R(s,a)R(s,a)
7:     For each s𝒮s^{\prime}\in\mathcal{S}, let P^(ss,a)=i=0N1𝕀[si=s]/N\widehat{P}(s^{\prime}\mid s,a)=\sum_{i=0}^{N-1}\mathbb{I}[s^{\prime}_{i}=s]/N
8:     Let R^(s,a)=i=0N1ri/N\widehat{R}(s,a)=\sum_{i=0}^{N-1}r_{i}/N
9:Draw s0,s1,,sN1s_{0},s_{1},\ldots,s_{N-1} from μ\mu
10:For each s𝒮s^{\prime}\in\mathcal{S}, let μ^(s)=i=0N1𝕀[si=s]/N\widehat{\mu}(s)=\sum_{i=0}^{N-1}\mathbb{I}[s_{i}=s]/N
11:return argmaxπVM^,Hπ\mathrm{argmax}_{\pi}V^{\pi}_{\widehat{M},H} where M^=(𝒮,𝒜,P^,R^,H,μ^)\widehat{M}=(\mathcal{S},\mathcal{A},\widehat{P},\widehat{R},H,\widehat{\mu})

The following lemma establishes a perturbation analysis similar to those in Lemma 5.8 and Lemma 5.9. Note that here, due to the availability of a generative model and thus one can obtain Ω(H)\Omega(H) samples for each state-action pair (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, we can prove both upper bound and lower bound on the estimated value. This also explains why pessimistic planning is no longer necessary in the generative model setting.

Lemma 6.1.

Let M=(𝒮,𝒜,P,R,H,μ)M=(\mathcal{S},\mathcal{A},P,R,H,\mu) be an MDP and π\pi be a policy. Let 0<ϵ1/20<\epsilon\leq 1/2 be a parameter. Let M^=(𝒮,𝒜,P^,R^,H,μ^)\widehat{M}=(\mathcal{S},\mathcal{A},\widehat{P},\widehat{R},H,\widehat{\mu}) be another MDP. If for all (s,a,s)𝒮×𝒜×𝒮(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}, we have

|P^(s|s,a)P(s|s,a)|ϵ192|𝒮|2|𝒜|max(ϵP(s|s,a)576H|𝒮||𝒜|,ϵ576H|𝒮||𝒜|),|\widehat{P}(s^{\prime}|s,a)-P(s^{\prime}|s,a)|\leq\frac{\epsilon}{192|\mathcal{S}|^{2}|\mathcal{A}|}\cdot\max\left(\sqrt{\frac{\epsilon{P}(s^{\prime}|s,a)}{576\cdot H\cdot|\mathcal{S}||\mathcal{A}|}},\frac{\epsilon}{576\cdot H\cdot|\mathcal{S}||\mathcal{A}|}\right),
|𝔼[R^|s,a]𝔼[R|s,a]|ϵ48|𝒮||𝒜|max{𝔼[(R(s,a))2]H,1H},\left|\mathbb{E}[\widehat{R}|s,a]-\mathbb{E}[{R}|s,a]\right|\leq\frac{\epsilon}{48|\mathcal{S}||\mathcal{A}|}\cdot\max\left\{\sqrt{\frac{\mathbb{E}[(R(s,a))^{2}]}{H}},\frac{1}{H}\right\},

and

|μ(s)μ^(s)|ϵ/(12|𝒮|),\left|\mu(s)-\widehat{\mu}(s^{\prime})\right|\leq\epsilon/(12|\mathcal{S}|),

then

|VM^,HπVM,Hπ|ϵ/2.\left|V^{\pi}_{\widehat{M},H}-V^{\pi}_{{M},H}\right|\leq\epsilon/2.
Proof.

Define 𝒯=(𝒮×𝒜)H×𝒮\mathcal{T}=(\mathcal{S}\times\mathcal{A})^{H}\times\mathcal{S} be set of all possible trajectories, where for each T𝒯T\in\mathcal{T}, TT has the form

((s0,a0),(s1,a1),,(sH1,aH1),sH).((s_{0},a_{0}),(s_{1},a_{1}),\ldots,(s_{H-1},a_{H-1}),s_{H}).

For a trajectory T𝒯T\in\mathcal{T}, for each (s,a,s)𝒮×𝒜×𝒮(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}, we write

mT(s,a)=h=0H1𝕀((sh,ah)=(s,a))m_{T}(s,a)=\sum_{h=0}^{H-1}\mathbb{I}((s_{h},a_{h})=(s,a))

as the number times (s,a)(s,a) is visited and

mT(s,a,s)=h=0H1𝕀((sh,ah,sh+1)=(s,a,s))m_{T}(s,a,s^{\prime})=\sum_{h=0}^{H-1}\mathbb{I}((s_{h},a_{h},s_{h+1})=(s,a,s^{\prime}))

as the number of times (s,a,s)(s,a,s^{\prime}) is visited. We say a trajectory

T=((s0,a0),(s1,a1),,(sH1,aH1),sH)𝒯T=((s_{0},a_{0}),(s_{1},a_{1}),\ldots,(s_{H-1},a_{H-1}),s_{H})\in\mathcal{T}

is compatible with a (possibly non-stationary) policy π\pi if for all h[H]h\in[H],

ah=πh(sh).a_{h}=\pi_{h}(s_{h}).

For a (possibly non-stationary) policy π\pi, we use 𝒯π𝒯\mathcal{T}^{\pi}\subseteq\mathcal{T} to denote the set of all trajectories that are compatible with π\pi.

For an MDP M=(𝒮,𝒜,P,R,H,μ)M=(\mathcal{S},\mathcal{A},P,R,H,\mu) and a (possibly non-stationary) policy π\pi, for a trajectory TT that is compatible with π\pi, we write

p(T,M,π)=μ(s0)h=0H1P(sh+1sh,ah)=μ(s0)(s,a,s)𝒮×𝒜×𝒮P(ss,a)mT(s,a,s)p(T,M,\pi)=\mu(s_{0})\cdot\prod_{h=0}^{H-1}P(s_{h+1}\mid s_{h},a_{h})=\mu(s_{0})\cdot\prod_{(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}}P(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}

to be the probability of TT when executing π\pi in MM. Here we assume 00=10^{0}=1.

Using these definitions, we have

VM,Hπ=T𝒯πp(T,M,π)((s,a)𝒮×𝒜mT(s,a)𝔼[R(s,a)]).V^{\pi}_{M,H}=\sum_{T\in\mathcal{T}^{\pi}}p(T,M,\pi)\cdot\left(\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[R(s,a)]\right).

Note that for any trajectory T=((s0,a0),(s1,a1),,(sH1,aH1),sH)𝒯πT=((s_{0},a_{0}),(s_{1},a_{1}),\ldots,(s_{H-1},a_{H-1}),s_{H})\in\mathcal{T}^{\pi}, if p(T,M,π)>0p(T,M,\pi)>0, by Assumption 2.1,

(s,a)𝒮×𝒜mT(s,a)𝔼[R(s,a)]1.\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[R(s,a)]\leq 1.

We define 𝒯1π𝒯π\mathcal{T}^{\pi}_{1}\subseteq\mathcal{T}^{\pi} be the set of trajectories such that for each T𝒯1πT\in\mathcal{T}^{\pi}_{1}, for each (s,a,s)𝒮×𝒜×𝒮(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}, if mT(s,a,s)1m_{T}(s,a,s^{\prime})\geq 1 then

P(s|s,a)ϵ144|𝒮||𝒜|H.P(s^{\prime}|s,a)\geq\frac{\epsilon}{144|\mathcal{S}||\mathcal{A}|H}.

By Lemma 5.6 and a union bound over all (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, we have

T𝒯π𝒯1πp(T,M,π)ϵ/12.\sum_{T\in\mathcal{T}^{\pi}\setminus\mathcal{T}^{\pi}_{1}}p(T,M,\pi)\leq\epsilon/12.

We also define 𝒯^1π𝒯π\widehat{\mathcal{T}}^{\pi}_{1}\subseteq\mathcal{T}^{\pi} be the set of trajectories such that for each T𝒯^1πT\in\widehat{\mathcal{T}}^{\pi}_{1}, for each (s,a,s)𝒮×𝒜×𝒮(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}, if mT(s,a,s)1m_{T}(s,a,s^{\prime})\geq 1 then

P^(s|s,a)ϵ72|𝒮||𝒜|H.\widehat{P}(s^{\prime}|s,a)\geq\frac{\epsilon}{72|\mathcal{S}||\mathcal{A}|H}.

Again by Lemma 5.6 and a union bound over all (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, we have

T𝒯π𝒯^1πp(T,M^,π)ϵ/12.\sum_{T\in\mathcal{T}^{\pi}\setminus\widehat{\mathcal{T}}^{\pi}_{1}}p(T,\widehat{M},\pi)\leq\epsilon/12.

Now we show that

𝒯π𝒯1π𝒯π𝒯^1π,\mathcal{T}^{\pi}\setminus\mathcal{T}^{\pi}_{1}\subseteq\mathcal{T}^{\pi}\setminus\widehat{\mathcal{T}}^{\pi}_{1},

and therefore,

T𝒯π𝒯1πp(T,M^,π)ϵ/12.\sum_{T\in\mathcal{T}^{\pi}\setminus\mathcal{T}^{\pi}_{1}}p(T,\widehat{M},\pi)\leq\epsilon/12.

For each T𝒯π𝒯1πT\in\mathcal{T}^{\pi}\setminus\mathcal{T}^{\pi}_{1}, if there exists (s,a,s)𝒮×𝒜×𝒮(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S} such that mT(s,a,s)1m_{T}(s,a,s^{\prime})\geq 1 and

P(ss,a)<ϵ144|𝒮||𝒜|H,P(s^{\prime}\mid s,a)<\frac{\epsilon}{144|\mathcal{S}||\mathcal{A}|H},

then

P^(ss,a)P(ss,a)+ϵ192|𝒮|2|𝒜|max(ϵP(s|s,a)576H|𝒮||𝒜|,ϵ576H|𝒮||𝒜|)<ϵ72|𝒮||𝒜|H\widehat{P}(s^{\prime}\mid s,a)\leq P(s^{\prime}\mid s,a)+\frac{\epsilon}{192|\mathcal{S}|^{2}|\mathcal{A}|}\cdot\max\left(\sqrt{\frac{\epsilon{P}(s^{\prime}|s,a)}{576\cdot H\cdot|\mathcal{S}||\mathcal{A}|}},\frac{\epsilon}{576\cdot H\cdot|\mathcal{S}||\mathcal{A}|}\right)<\frac{\epsilon}{72|\mathcal{S}||\mathcal{A}|H}

and thus T𝒯π𝒯^1πT\in\mathcal{T}^{\pi}\setminus\widehat{\mathcal{T}}^{\pi}_{1}. Moreover, for any T𝒯π𝒯1πT\in\mathcal{T}^{\pi}\setminus\mathcal{T}^{\pi}_{1}, for all (s,a,s)𝒮×𝒜×𝒮(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S} with mT(s,a,s)1m_{T}(s,a,s^{\prime})\geq 1, we have

P(s|s,a)ϵ144|𝒮||𝒜|H>0.P(s^{\prime}|s,a)\geq\frac{\epsilon}{144|\mathcal{S}||\mathcal{A}|H}>0.

Note that this implies p(T,M,π)>0p(T,M,\pi)>0. Furthermore, for all (s,a,s)𝒮×𝒜×𝒮(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S} with mT(s,a,s)1m_{T}(s,a,s^{\prime})\geq 1, we have

P^(ss,a)P(ss,a)ϵ192|𝒮|2|𝒜|max(ϵP(s|s,a)576H|𝒮||𝒜|,ϵ576H|𝒮||𝒜|)>0\widehat{P}(s^{\prime}\mid s,a)\geq P(s^{\prime}\mid s,a)-\frac{\epsilon}{192|\mathcal{S}|^{2}|\mathcal{A}|}\cdot\max\left(\sqrt{\frac{\epsilon{P}(s^{\prime}|s,a)}{576\cdot H\cdot|\mathcal{S}||\mathcal{A}|}},\frac{\epsilon}{576\cdot H\cdot|\mathcal{S}||\mathcal{A}|}\right)>0

which implies p(T,M^,π)>0p(T,\widehat{M},\pi)>0.

We define 𝒯2π𝒯1π\mathcal{T}^{\pi}_{2}\subseteq\mathcal{T}^{\pi}_{1} be the set of trajectories that for each T𝒯2πT\in\mathcal{T}^{\pi}_{2}, for each (s,a,s)𝒮×𝒜×𝒮(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S},

|mT(s,a,s)mT(s,a)P(s|s,a)|576P(s|s,a)H|𝒮||𝒜|ϵ.\left|m_{T}(s,a,s^{\prime})-m_{T}(s,a)\cdot P(s^{\prime}|s,a)\right|\leq\sqrt{\frac{576P(s^{\prime}|s,a)H|\mathcal{S}||\mathcal{A}|}{\epsilon}}.

By Lemma 5.7 and a union bound over all (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, we have

T𝒯1π𝒯2πp(T,M,π)T𝒯π𝒯2πp(T,M,π)ϵ/12.\sum_{T\in\mathcal{T}^{\pi}_{1}\setminus\mathcal{T}^{\pi}_{2}}p(T,M,\pi)\leq\sum_{T\in\mathcal{T}^{\pi}\setminus\mathcal{T}^{\pi}_{2}}p(T,M,\pi)\leq\epsilon/12.

Again, we define 𝒯^2π𝒯π\widehat{\mathcal{T}}^{\pi}_{2}\subseteq\mathcal{T}^{\pi} be the set of trajectories that for each T𝒯2πT\in\mathcal{T}^{\pi}_{2}, for each (s,a,s)𝒮×𝒜×𝒮(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S},

|mT(s,a,s)mT(s,a)P^(s|s,a)|144P^(s|s,a)H|𝒮||𝒜|ϵ.\left|m_{T}(s,a,s^{\prime})-m_{T}(s,a)\cdot\widehat{P}(s^{\prime}|s,a)\right|\leq\sqrt{\frac{144\widehat{P}(s^{\prime}|s,a)H|\mathcal{S}||\mathcal{A}|}{\epsilon}}.

Similarly, by Lemma 5.7 and a union bound over all (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, we have

T𝒯1π𝒯^2πp(T,M^,π)T𝒯π𝒯^2πp(T,M^,π)ϵ/12.\sum_{T\in\mathcal{T}^{\pi}_{1}\setminus\widehat{\mathcal{T}}^{\pi}_{2}}p(T,\widehat{M},\pi)\leq\sum_{T\in\mathcal{T}^{\pi}\setminus\widehat{\mathcal{T}}^{\pi}_{2}}p(T,\widehat{M},\pi)\leq\epsilon/12.

Now we show that

𝒯1π𝒯2π𝒯1π𝒯^2π,\mathcal{T}^{\pi}_{1}\setminus\mathcal{T}^{\pi}_{2}\subseteq\mathcal{T}^{\pi}_{1}\setminus\widehat{\mathcal{T}}^{\pi}_{2},

and therefore,

T𝒯1π𝒯2πp(T,M^,π)ϵ/12.\sum_{T\in\mathcal{T}^{\pi}_{1}\setminus\mathcal{T}^{\pi}_{2}}p(T,\widehat{M},\pi)\leq\epsilon/12.

For any T𝒯1π𝒯2πT\in\mathcal{T}^{\pi}_{1}\setminus\mathcal{T}^{\pi}_{2}, there exists (s,a,s)𝒮×𝒜×𝒮(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S} such that

|mT(s,a,s)mT(s,a)P(s|s,a)|>576P(s|s,a)H|𝒮||𝒜|ϵ.\left|m_{T}(s,a,s^{\prime})-m_{T}(s,a)\cdot P(s^{\prime}|s,a)\right|>\sqrt{\frac{576P(s^{\prime}|s,a)H|\mathcal{S}||\mathcal{A}|}{\epsilon}}.

Note that this implies

P(s|s,a)ϵ144|𝒮||𝒜|HP(s^{\prime}|s,a)\geq\frac{\epsilon}{144|\mathcal{S}||\mathcal{A}|H}

since otherwise mT(s,a,s)=0m_{T}(s,a,s^{\prime})=0 (because T𝒯1πT\in\mathcal{T}^{\pi}_{1}) and therefore

|mT(s,a,s)mT(s,a)P(s|s,a)||HP(s|s,a)|576P(s|s,a)H|𝒮||𝒜|ϵ.\left|m_{T}(s,a,s^{\prime})-m_{T}(s,a)\cdot P(s^{\prime}|s,a)\right|\leq|H\cdot P(s^{\prime}|s,a)|\leq\sqrt{\frac{576P(s^{\prime}|s,a)H|\mathcal{S}||\mathcal{A}|}{\epsilon}}.

Since

P(s|s,a)ϵ144|𝒮||𝒜|H,P(s^{\prime}|s,a)\geq\frac{\epsilon}{144|\mathcal{S}||\mathcal{A}|H},

we have

P^(ss,a)2P(ss,a)\widehat{P}(s^{\prime}\mid s,a)\leq 2P(s^{\prime}\mid s,a)

and

|P^(ss,a)P(ss,a)|ϵP(ss,a)144|𝒮||𝒜|H.\left|\widehat{P}(s^{\prime}\mid s,a)-P(s^{\prime}\mid s,a)\right|\leq\sqrt{\frac{\epsilon\cdot P(s^{\prime}\mid s,a)}{144|\mathcal{S}||\mathcal{A}|H}}.

Hence,

|mT(s,a,s)mT(s,a)P^(s|s,a)|\displaystyle\left|m_{T}(s,a,s^{\prime})-m_{T}(s,a)\cdot\widehat{P}(s^{\prime}|s,a)\right| |mT(s,a,s)mT(s,a)P(s|s,a)|HϵP(ss,a)144|𝒮||𝒜|H\displaystyle\geq\left|m_{T}(s,a,s^{\prime})-m_{T}(s,a)\cdot P(s^{\prime}|s,a)\right|-H\cdot\sqrt{\frac{\epsilon\cdot P(s^{\prime}\mid s,a)}{144|\mathcal{S}||\mathcal{A}|H}}
>576P(s|s,a)H|𝒮||𝒜|ϵϵP(ss,a)H144|𝒮||𝒜|\displaystyle>\sqrt{\frac{576P(s^{\prime}|s,a)H|\mathcal{S}||\mathcal{A}|}{\epsilon}}-\sqrt{\frac{\epsilon\cdot P(s^{\prime}\mid s,a)\cdot H}{144|\mathcal{S}||\mathcal{A}|}}
500P(s|s,a)H|𝒮||𝒜|ϵ\displaystyle\geq\sqrt{\frac{500P(s^{\prime}|s,a)H|\mathcal{S}||\mathcal{A}|}{\epsilon}}
144P^(s|s,a)H|𝒮||𝒜|ϵ\displaystyle\geq\sqrt{\frac{144\widehat{P}(s^{\prime}|s,a)H|\mathcal{S}||\mathcal{A}|}{\epsilon}}

and thus T𝒯1π𝒯^2πT\in\mathcal{T}^{\pi}_{1}\setminus\widehat{\mathcal{T}}^{\pi}_{2}.

Note that

VM,Hπϵ/6T𝒯2πp(T,M,π)((s,a)𝒮×𝒜mT(s,a)𝔼[R(s,a)])VM,Hπ.V^{\pi}_{M,H}-\epsilon/6\leq\sum_{T\in\mathcal{T}^{\pi}_{2}}p(T,M,\pi)\cdot\left(\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[R(s,a)]\right)\leq V^{\pi}_{M,H}.

Similarly,

VM^,Hπϵ/6T𝒯2πp(T,M^,π)((s,a)𝒮×𝒜mT(s,a)𝔼[R^(s,a)])VM^,Hπ.V^{\pi}_{\widehat{M},H}-\epsilon/6\leq\sum_{T\in\mathcal{T}^{\pi}_{2}}p(T,\widehat{M},\pi)\cdot\left(\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[\widehat{R}(s,a)]\right)\leq V^{\pi}_{\widehat{M},H}.

For each (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, define

𝒮s,a={s𝒮P(s|s,a)ϵ72|𝒮||𝒜|H}.\mathcal{S}_{s,a}=\left\{s^{\prime}\in\mathcal{S}\ \mid P(s^{\prime}|s,a)\geq\frac{\epsilon}{72|\mathcal{S}||\mathcal{A}|H}\right\}.

Therefore, for each (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A},

s𝒮P(ss,a)mT(s,a,s)=s𝒮s,aP(ss,a)mT(s,a,s)\prod_{s^{\prime}\in\mathcal{S}}P(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}=\prod_{s^{\prime}\in\mathcal{S}_{s,a}}P(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}

and

s𝒮P^(ss,a)mT(s,a,s)=s𝒮s,aP^(ss,a)mT(s,a,s)\prod_{s^{\prime}\in\mathcal{S}}\widehat{P}(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}=\prod_{s^{\prime}\in\mathcal{S}_{s,a}}\widehat{P}(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}

with

|mT(s,a,s)mT(s,a)P(s|s,a)|576P(s|s,a)H|𝒮||𝒜|ϵ.\displaystyle\left|m_{T}(s,a,s^{\prime})-m_{T}(s,a)\cdot P(s^{\prime}|s,a)\right|\leq\sqrt{\frac{576P(s^{\prime}|s,a)H|\mathcal{S}||\mathcal{A}|}{\epsilon}}.

Note that s𝒮P^(s|s,a)P(s|s,a)=0\sum_{s^{\prime}\in\mathcal{S}}\widehat{P}(s^{\prime}|s,a)-{P}(s^{\prime}|s,a)=0, which implies

|s𝒮s,aP^(s|s,a)P(s|s,a)|=|s𝒮s,aP(s|s,a)P^(s|s,a)|ϵ192|𝒮||𝒜|ϵ144H|𝒮||𝒜|.\left|\sum_{s^{\prime}\in\mathcal{S}_{s,a}}\widehat{P}(s^{\prime}|s,a)-{P}(s^{\prime}|s,a)\right|=\left|\sum_{s^{\prime}\not\in\mathcal{S}_{s,a}}{P}(s^{\prime}|s,a)-\widehat{P}(s^{\prime}|s,a)\right|\leq\frac{\epsilon}{192|\mathcal{S}||\mathcal{A}|}\cdot\frac{\epsilon}{144\cdot H\cdot|\mathcal{S}||\mathcal{A}|}.

By applying Lemma 5.5 and settting n¯\bar{n} to be |𝒮||\mathcal{S}|, nn to be |𝒮s,a||\mathcal{S}_{s,a}|, ϵ\epsilon to be ϵ/(192|𝒮|2|𝒜|)\epsilon/(192|\mathcal{S}|^{2}|\mathcal{A}|), and m¯\overline{m} to be 576H|𝒮||𝒜|/ϵ576\cdot H\cdot|\mathcal{S}||\mathcal{A}|/\epsilon, we have

s𝒮s,aP^(ss,a)mT(s,a,s)(1ϵ24|𝒮||𝒜|)s𝒮s,aP(ss,a)mT(s,a,s)\prod_{s^{\prime}\in\mathcal{S}_{s,a}}\widehat{P}(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}\geq\left(1-\frac{\epsilon}{24|\mathcal{S}||\mathcal{A}|}\right)\prod_{s^{\prime}\in\mathcal{S}_{s,a}}P(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}

and

s𝒮s,aP^(ss,a)mT(s,a,s)(1+ϵ24|𝒮||𝒜|)s𝒮s,aP(ss,a)mT(s,a,s).\prod_{s^{\prime}\in\mathcal{S}_{s,a}}\widehat{P}(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}\leq\left(1+\frac{\epsilon}{24|\mathcal{S}||\mathcal{A}|}\right)\prod_{s^{\prime}\in\mathcal{S}_{s,a}}P(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}.

Therefore,

(s,a)𝒮×𝒜s𝒮P^(ss,a)mT(s,a,s)(1ϵ/12)(s,a)𝒮×𝒜s𝒮P(ss,a)mT(s,a,s).\prod_{(s,a)\in\mathcal{S}\times\mathcal{A}}\prod_{s^{\prime}\in\mathcal{S}}\widehat{P}(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}\geq(1-\epsilon/12)\prod_{(s,a)\in\mathcal{S}\times\mathcal{A}}\prod_{s^{\prime}\in\mathcal{S}}P(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}.

and

(s,a)𝒮×𝒜s𝒮P^(ss,a)mT(s,a,s)(1+ϵ/12)(s,a)𝒮×𝒜s𝒮P(ss,a)mT(s,a,s).\prod_{(s,a)\in\mathcal{S}\times\mathcal{A}}\prod_{s^{\prime}\in\mathcal{S}}\widehat{P}(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}\leq(1+\epsilon/12)\prod_{(s,a)\in\mathcal{S}\times\mathcal{A}}\prod_{s^{\prime}\in\mathcal{S}}P(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}.

For the summation of rewards, we have

(s,a)𝒮×𝒜mT(s,a)|𝔼[R(s,a)]𝔼[R^(s,a)]|\displaystyle\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\left|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right|
=\displaystyle= (s,a)𝒮×𝒜mT(s,a)=1mT(s,a)|𝔼[R(s,a)]𝔼[R^(s,a)]|\displaystyle\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}\mid m_{T}(s,a)=1}m_{T}(s,a)\cdot\left|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right|
+\displaystyle+ (s,a)𝒮×𝒜mT(s,a)>1mT(s,a)|𝔼[R(s,a)]𝔼[R^(s,a)]|.\displaystyle\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}\mid m_{T}(s,a)>1}m_{T}(s,a)\cdot\left|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right|.

For those (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A} with mT(s,a)>1m_{T}(s,a)>1, since p(T,M,π)>0p(T,M,\pi)>0, by Lemma 2.1, we have

|𝔼[R^(s,a)]𝔼[R(s,a)]|\displaystyle|\mathbb{E}[\widehat{R}(s,a)]-\mathbb{E}[R(s,a)]|
\displaystyle\leq ϵ48|𝒮||𝒜|max{𝔼[(R(s,a))2]H,1H}max{ϵ24H,ϵ48|𝒮||𝒜|H}.\displaystyle\frac{\epsilon}{48|\mathcal{S}||\mathcal{A}|}\cdot\max\left\{\sqrt{\frac{\mathbb{E}[(R(s,a))^{2}]}{H}},\frac{1}{H}\right\}\leq\max\left\{\frac{\epsilon}{24H},\frac{\epsilon}{48|\mathcal{S}||\mathcal{A}|H}\right\}.

Since (s,a)𝒮×𝒜mT(s,a)H\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\leq H and mT(s,a)Hm_{T}(s,a)\leq H, we have

(s,a)𝒮×𝒜mT(s,a)>1mT(s,a)|𝔼[R(s,a)]𝔼[R^(s,a)]|ϵ16.\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}\mid m_{T}(s,a)>1}m_{T}(s,a)\cdot\left|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right|\leq\frac{\epsilon}{16}.

For those (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A} with mT(s,a)=1m_{T}(s,a)=1, we have

(s,a)𝒮×𝒜mT(s,a)=1mT(s,a)|𝔼[R(s,a)]𝔼[R^(s,a)]|ϵ48.\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}\mid m_{T}(s,a)=1}m_{T}(s,a)\cdot\left|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right|\leq\frac{\epsilon}{48}.

Thus,

(s,a)𝒮×𝒜mT(s,a)|𝔼[R(s,a)]𝔼[R^(s,a)]|ϵ12.\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\left|\mathbb{E}\left[R(s,a)\right]-\mathbb{E}\left[\widehat{R}(s,a)\right]\right|\leq\frac{\epsilon}{12}.

For each T𝒯2πT\in\mathcal{T}^{\pi}_{2} with

T=((s0,a0),(s1,a1),,(sH1,aH1),sH),T=((s_{0},a_{0}),(s_{1},a_{1}),\ldots,(s_{H-1},a_{H-1}),s_{H}),

we have

p(T,M^,π)((s,a)𝒮×𝒜mT(s,a)𝔼[R^(s,a)])\displaystyle p(T,\widehat{M},\pi)\cdot\left(\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[\widehat{R}(s,a)]\right)
=\displaystyle= μ^(s0)(s,a)𝒮×𝒜s𝒮P^(ss,a)mT(s,a,s)((s,a)𝒮×𝒜mT(s,a)𝔼[R^(s,a)])\displaystyle\widehat{\mu}(s_{0})\cdot\prod_{(s,a)\in\mathcal{S}\times\mathcal{A}}\prod_{s^{\prime}\in\mathcal{S}}\widehat{P}(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}\cdot\left(\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[\widehat{R}(s,a)]\right)

where

|μ^(s0)μ(s0)|ϵ/(12|𝒮|),\left|\widehat{\mu}(s_{0})-\mu(s_{0})\right|\leq\epsilon/(12|\mathcal{S}|),
|(s,a)𝒮×𝒜mT(s,a)𝔼[R^(s,a)](s,a)𝒮×𝒜mT(s,a)𝔼[R(s,a)]|ϵ/12,\left|\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[\widehat{R}(s,a)]-\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[R(s,a)]\right|\leq\epsilon/12,
(s,a)𝒮×𝒜s𝒮P^(ss,a)mT(s,a,s)(1ϵ/12)(s,a)𝒮×𝒜s𝒮P(ss,a)mT(s,a,s),\prod_{(s,a)\in\mathcal{S}\times\mathcal{A}}\prod_{s^{\prime}\in\mathcal{S}}\widehat{P}(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}\geq(1-\epsilon/12)\prod_{(s,a)\in\mathcal{S}\times\mathcal{A}}\prod_{s^{\prime}\in\mathcal{S}}P(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})},

and

(s,a)𝒮×𝒜s𝒮P^(ss,a)mT(s,a,s)(1+ϵ/12)(s,a)𝒮×𝒜s𝒮P(ss,a)mT(s,a,s).\prod_{(s,a)\in\mathcal{S}\times\mathcal{A}}\prod_{s^{\prime}\in\mathcal{S}}\widehat{P}(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}\leq(1+\epsilon/12)\prod_{(s,a)\in\mathcal{S}\times\mathcal{A}}\prod_{s^{\prime}\in\mathcal{S}}P(s^{\prime}\mid s,a)^{m_{T}(s,a,s^{\prime})}.

Since

VM,Hπϵ/6T𝒯2πp(T,M,π)((s,a)𝒮×𝒜mT(s,a)𝔼[R(s,a)])VM,HπV^{\pi}_{M,H}-\epsilon/6\leq\sum_{T\in\mathcal{T}^{\pi}_{2}}p(T,M,\pi)\cdot\left(\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[R(s,a)]\right)\leq V^{\pi}_{M,H}

and

VM^,Hπϵ/6T𝒯2πp(T,M^,π)((s,a)𝒮×𝒜mT(s,a)𝔼[R^(s,a)])VM^,Hπ,V^{\pi}_{\widehat{M},H}-\epsilon/6\leq\sum_{T\in\mathcal{T}^{\pi}_{2}}p(T,\widehat{M},\pi)\cdot\left(\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}m_{T}(s,a)\cdot\mathbb{E}[\widehat{R}(s,a)]\right)\leq V^{\pi}_{\widehat{M},H},

we have

|VM^,HπVM,Hπ|ϵ/2.\left|V^{\pi}_{\widehat{M},H}-V^{\pi}_{M,H}\right|\leq\epsilon/2.

The following lemma provides error bound on the empirical model M^\widehat{M} built by the algorithm. Note that the error bounds established here are similar to those in Lemma 5.4.

Lemma 6.2.

With probability at least 1δ1-\delta, for all (s,a,s)𝒮×𝒜×𝒮(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}:

|P^(ss,a)P(ss,a)|\displaystyle\left|\widehat{P}(s^{\prime}\mid s,a)-P(s^{\prime}\mid s,a)\right|\leq max{4P(ss,a)log(6|𝒮|2|𝒜|/δ)N,2log(6|𝒮|2|𝒜|/δ)N},\displaystyle\max\left\{4\sqrt{\frac{P(s^{\prime}\mid s,a)\cdot\log(6|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{N}},\frac{2\log(6|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{N}\right\},
|R^(s,a)𝔼[R(s,a)]|\displaystyle\left|\widehat{R}(s,a)-\mathbb{E}[R(s,a)]\right|\leq max{4𝔼[(R(s,a))2]log(6|𝒮||𝒜|/δ)N,2log(6|𝒮||𝒜|/δ)N}\displaystyle\max\left\{4\sqrt{\frac{\mathbb{E}[(R(s,a))^{2}]\cdot\log(6|\mathcal{S}||\mathcal{A}|/\delta)}{N}},\frac{2\log(6|\mathcal{S}||\mathcal{A}|/\delta)}{N}\right\}

and

|μ^(s)μ(s)|log((6|𝒮|)/δ)N.\left|\widehat{\mu}(s)-\mu(s)\right|\leq\sqrt{\frac{\log((6|\mathcal{S}|)/\delta)}{N}}.
Proof.

For each (s,a,s)𝒮×𝒜×𝒮(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}, by Bernstein’s inequality,

Pr[|i=0N1𝕀[s=si]P(ss,a)N|t]\displaystyle\Pr\left[\left|\sum_{i=0}^{N-1}\mathbb{I}\left[s^{\prime}=s^{\prime}_{i}\right]-P(s^{\prime}\mid s,a)\cdot N\right|\geq t\right]
\displaystyle\leq 2exp(t22NP(ss,a)+t/3).\displaystyle 2\exp\left(\frac{-t^{2}}{2\cdot N\cdot P(s^{\prime}\mid s,a)+t/3}\right).

Thus, by setting t=2NP(ss,a)log(6|𝒮|2|𝒜|/δ)+log(6|𝒮|2|𝒜|/δ)t=2\sqrt{N\cdot P(s^{\prime}\mid s,a)\cdot\log(6|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}+\log(6|\mathcal{S}|^{2}|\mathcal{A}|/\delta), we have

Pr[|i=0N1𝕀[s=si]P(ss,a)N|t]δ/(3|𝒮|2|𝒜|).Pr\left[\left|\sum_{i=0}^{N-1}\mathbb{I}\left[s^{\prime}=s^{\prime}_{i}\right]-P(s^{\prime}\mid s,a)\cdot N\right|\geq t\right]\leq\delta/(3|\mathcal{S}|^{2}|\mathcal{A}|).

By applying a union bound over all (s,a,s)𝒮×𝒜×𝒮(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}, with probability at least 1δ/31-\delta/3, for all (s,a,s)𝒮×𝒜×𝒮(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S},

|P^(ss,a)P(ss,a)|\displaystyle\left|\widehat{P}(s^{\prime}\mid s,a)-P(s^{\prime}\mid s,a)\right|\leq 2P(ss,a)log(6|𝒮|2|𝒜|/δ)N+log(6|𝒮|2|𝒜|/δ)N\displaystyle 2\sqrt{\frac{P(s^{\prime}\mid s,a)\cdot\log(6|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{N}}+\frac{\log(6|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{N}
\displaystyle\leq max{4P(ss,a)log(6|𝒮|2|𝒜|/δ)N,2log(6|𝒮|2|𝒜|/δ)N}.\displaystyle\max\left\{4\sqrt{\frac{P(s^{\prime}\mid s,a)\cdot\log(6|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{N}},\frac{2\log(6|\mathcal{S}|^{2}|\mathcal{A}|/\delta)}{N}\right\}.

By a similar argument, with probability at least 1δ/31-\delta/3, for all (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A},

|R^(s,a)𝔼[R(s,a)]|max{4𝔼[(R(s,a))2]log(6|𝒮||𝒜|/δ)N,2log(6|𝒮||𝒜|/δ)N}.\left|\widehat{R}(s,a)-\mathbb{E}[R(s,a)]\right|\leq\max\left\{4\sqrt{\frac{\mathbb{E}[(R(s,a))^{2}]\cdot\log(6|\mathcal{S}||\mathcal{A}|/\delta)}{N}},\frac{2\log(6|\mathcal{S}||\mathcal{A}|/\delta)}{N}\right\}.

Moreover, with probability at least 1δ/31-\delta/3, for all s𝒮s\in\mathcal{S}, we have

|μ^(s)μ(s)|log(6|𝒮|)/δ)N.\left|\widehat{\mu}(s)-\mu(s)\right|\leq\sqrt{\frac{\log(6|\mathcal{S}|)/\delta)}{N}}.

We complete the proof by applying a union bound. ∎

Using Lemma 6.1 and Lemma 6.2, we can now prove the correctness of our algorithm.

Theorem 6.3.

With probability at least 1δ1-\delta, Algorithm 3 returns a policy π\pi such that

VπVϵV^{\pi}\geq V^{*}-\epsilon

by using at most 230|𝒮|6|𝒜|4/ϵ32^{30}\cdot|\mathcal{S}|^{6}\cdot|\mathcal{A}|^{4}/\epsilon^{3} batches of queries.

Proof.

By Lemma 6.2, with probability at least 1δ1-\delta, for all (s,a,s)𝒮×𝒜×𝒮(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}, we have

|P^(s|s,a)P(s|s,a)|ϵ192|𝒮|2|𝒜|max(ϵP(s|s,a)576H|𝒮||𝒜|,ϵ576H|𝒮||𝒜|),|\widehat{P}(s^{\prime}|s,a)-P(s^{\prime}|s,a)|\leq\frac{\epsilon}{192|\mathcal{S}|^{2}|\mathcal{A}|}\cdot\max\left(\sqrt{\frac{\epsilon{P}(s^{\prime}|s,a)}{576\cdot H\cdot|\mathcal{S}||\mathcal{A}|}},\frac{\epsilon}{576\cdot H\cdot|\mathcal{S}||\mathcal{A}|}\right),
|𝔼[R^|s,a]𝔼[R|s,a]|ϵ48|𝒮||𝒜|max{𝔼[(R(s,a))2]H,1H},\left|\mathbb{E}[\widehat{R}|s,a]-\mathbb{E}[{R}|s,a]\right|\leq\frac{\epsilon}{48|\mathcal{S}||\mathcal{A}|}\cdot\max\left\{\sqrt{\frac{\mathbb{E}[(R(s,a))^{2}]}{H}},\frac{1}{H}\right\},

and

|μ(s)μ^(s)|ϵ/(12|𝒮|).\left|\mu(s)-\widehat{\mu}(s^{\prime})\right|\leq\epsilon/(12|\mathcal{S}|).

We condition on this event in the remaining part of this proof. By Lemma 6.2, for any (possibly non-stationary) policy π\pi, we have

|VM^,HπVM,Hπ|ϵ/2.\left|V^{\pi}_{\widehat{M},H}-V^{\pi}_{{M},H}\right|\leq\epsilon/2.

Let π\pi be the policy retuned by the algorithm. We thus have

VM,HπVM^,Hπϵ/2VM^,Hπϵ/2VM,Hπϵ.V^{\pi}_{M,H}\geq V^{\pi}_{\widehat{M},H}-\epsilon/2\geq V^{\pi^{*}}_{\widehat{M},H}-\epsilon/2\geq V^{\pi^{*}}_{M,H}-\epsilon.

The total number of samples taken by the algorithm is

(|𝒮||𝒜|+1)N230|𝒮|6|𝒜|4H/ϵ3.(|\mathcal{S}||\mathcal{A}|+1)\cdot N\leq 2^{30}\cdot|\mathcal{S}|^{6}\cdot|\mathcal{A}|^{4}\cdot H/\epsilon^{3}.

Acknowledgements

The authors would like to thank Simon S. Du for helpful discussions, and to the anonymous reviewers for helpful comments.

Ruosong Wang was supported in part by NSF IIS1763562, US Army W911NF1920104, and ONR Grant N000141812861. Part of this work was done while Ruosong Wang and Lin F. Yang were visiting the Simons Institute for the Theory of Computing.

References

  • [AKY19] Alekh Agarwal, Sham Kakade, and Lin F Yang. On the optimality of sparse model-based planning for Markov decision processes. arXiv preprint arXiv:1906.03804, 2019.
  • [AMK13] Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J Kappen. Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91(3):325–349, 2013.
  • [AOM17] Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263–272. JMLR. org, 2017.
  • [BT03] Ronen I. Brafman and Moshe Tennenholtz. R-max - a general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res., 3(Oct):213–231, March 2003.
  • [BT09] Peter L Bartlett and Ambuj Tewari. Regal: A regularization based algorithm for reinforcement learning in weakly communicating mdps. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 35–42. AUAI Press, 2009.
  • [DB15] Christoph Dann and Emma Brunskill. Sample complexity of episodic fixed-horizon reinforcement learning. In Advances in Neural Information Processing Systems, pages 2818–2826, 2015.
  • [DLB17] Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 5717–5727, Red Hook, NY, USA, 2017. Curran Associates Inc.
  • [DLWB19] Christoph Dann, Lihong Li, Wei Wei, and Emma Brunskill. Policy certificates: Towards accountable reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 1507–1516, Long Beach, California, USA, 09–15 Jun 2019. PMLR.
  • [DWCW19] Kefan Dong, Yuanhao Wang, Xiaoyu Chen, and Liwei Wang. Q-learning with ucb exploration is sample efficient for infinite-horizon mdp. arXiv preprint arXiv:1901.09311, 2019.
  • [FPL18] Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Near optimal exploration-exploitation in non-communicating markov decision processes. In Advances in Neural Information Processing Systems, pages 2994–3004, 2018.
  • [JA18] Nan Jiang and Alekh Agarwal. Open problem: The dependence of sample complexity lower bounds on planning horizon. In Conference On Learning Theory, pages 3395–3398, 2018.
  • [JAZBJ18] Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is Q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4863–4873, 2018.
  • [JOA10] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
  • [Kak03] Sham M Kakade. On the sample complexity of reinforcement learning. PhD thesis, University of London London, England, 2003.
  • [KN09] J Zico Kolter and Andrew Y Ng. Near-bayesian exploration in polynomial time. In Proceedings of the 26th annual international conference on machine learning, pages 513–520, 2009.
  • [KS99] Michael J Kearns and Satinder P Singh. Finite-sample convergence rates for Q-learning and indirect algorithms. In Advances in neural information processing systems, pages 996–1002, 1999.
  • [KS02] Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine Learning, 49(2-3):209–232, 2002.
  • [LH12] Tor Lattimore and Marcus Hutter. PAC bounds for discounted MDPs. In International Conference on Algorithmic Learning Theory, pages 320–334. Springer, 2012.
  • [LWC+20] Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Breaking the sample size barrier in model-based reinforcement learning with a generative model. Advances in Neural Information Processing Systems, 33, 2020.
  • [MDSV21] Pierre Menard, Omar Darwiche Domingues, Xuedong Shang, and Michal Valko. Ucb momentum q-learning: Correcting the bias without forgetting. arXiv preprint arXiv:2103.01312, 2021.
  • [NPB20] Gergely Neu and Ciara Pike-Burke. A unifying view of optimism in episodic reinforcement learning. arXiv preprint arXiv:2007.01891, 2020.
  • [ORVR13] Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning via posterior sampling. arXiv preprint arXiv:1306.0940, 2013.
  • [OVR16] Ian Osband and Benjamin Van Roy. On lower bounds for regret in reinforcement learning. ArXiv, abs/1608.02732, 2016.
  • [OVR17] Ian Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for reinforcement learning? In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2701–2710. JMLR. org, 2017.
  • [PBPH+20] Aldo Pacchiano, Philip Ball, Jack Parker-Holder, Krzysztof Choromanski, and Stephen Roberts. On optimism in model-based reinforcement learning. arXiv preprint arXiv:2006.11911, 2020.
  • [Put94] Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 1994.
  • [RLD+21] Tongzheng Ren, Jialian Li, Bo Dai, Simon S Du, and Sujay Sanghavi. Nearly horizon-free offline reinforcement learning. arXiv preprint arXiv:2103.14077, 2021.
  • [Rus19] Daniel Russo. Worst-case regret bounds for exploration via randomized value functions. arXiv preprint arXiv:1906.02870, 2019.
  • [SJ19] Max Simchowitz and Kevin Jamieson. Non-asymptotic gap-dependent regret bounds for tabular mdps. arXiv preprint arXiv:1905.03814, 2019.
  • [SL08] Alexander L Strehl and Michael L Littman. An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.
  • [SLW+06] Alexander L Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L Littman. Pac model-free reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pages 881–888. ACM, 2006.
  • [SS10] István Szita and Csaba Szepesvári. Model-based reinforcement learning with nearly tight exploration complexity bounds. In ICML, 2010.
  • [SWW+18] Aaron Sidford, Mengdi Wang, Xian Wu, Lin Yang, and Yinyu Ye. Near-optimal time and sample complexities for solving Markov decision processes with a generative model. In Advances in Neural Information Processing Systems, pages 5186–5196, 2018.
  • [SWWY18] Aaron Sidford, Mengdi Wang, Xian Wu, and Yinyu Ye. Variance reduced value iteration and faster algorithms for solving markov decision processes. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 770–787. Society for Industrial and Applied Mathematics, 2018.
  • [SY94] Satinder P Singh and Richard C Yee. An upper bound on the loss from approximate optimal-value functions. Machine Learning, 16(3):227–233, 1994.
  • [TM18] Mohammad Sadegh Talebi and Odalric-Ambrym Maillard. Variance-aware regret bounds for undiscounted reinforcement learning in mdps. arXiv preprint arXiv:1803.01626, 2018.
  • [Wan17] Mengdi Wang. Randomized linear programming solves the discounted Markov decision problem in nearly-linear running time. arXiv preprint arXiv:1704.01869, 2017.
  • [WDYK20] Ruosong Wang, Simon S Du, Lin F Yang, and Sham M Kakade. Is long horizon reinforcement learning more difficult than short horizon reinforcement learning? arXiv preprint arXiv:2005.00527, 2020.
  • [YW19] Lin Yang and Mengdi Wang. Sample-optimal parametric q-learning using linearly additive features. In International Conference on Machine Learning, pages 6995–7004, 2019.
  • [YYD21] Kunhe Yang, Lin Yang, and Simon Du. Q-learning with logarithmic regret. In International Conference on Artificial Intelligence and Statistics, pages 1576–1584. PMLR, 2021.
  • [ZB19] Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pages 7304–7312, 2019.
  • [ZDJ20] Zihan Zhang, Simon S Du, and Xiangyang Ji. Nearly minimax optimal reward-free reinforcement learning. arXiv preprint arXiv:2010.05901, 2020.
  • [ZJ19] Zihan Zhang and Xiangyang Ji. Regret minimization for reinforcement learning by evaluating the optimal bias function. In Advances in Neural Information Processing Systems, pages 2823–2832, 2019.
  • [ZJD20] Zihan Zhang, Xiangyang Ji, and Simon S Du. Is reinforcement learning more difficult than bandits? a near-optimal algorithm escaping the curse of horizon. arXiv preprint arXiv:2009.13503, 2020.
  • [ZZJ20] Zihan Zhang, Yuan Zhou, and Xiangyang Ji. Almost optimal model-free reinforcement learningvia reference-advantage decomposition. Advances in Neural Information Processing Systems, 33, 2020.