This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Policy Finetuning in Reinforcement Learning
via Design of Experiments using Offline Data

Ruiqi Zhang
Department of Statistics
University of California
Berkeley
[email protected]
Andrea Zanette
Department of EECS
University of California
Berkeley
[email protected]
Abstract

In some applications of reinforcement learning, a dataset of pre-collected experience is already available but it is also possible to acquire some additional online data to help improve the quality of the policy. However, it may be preferable to gather additional data with a single, non-reactive exploration policy and avoid the engineering costs associated with switching policies.

In this paper we propose an algorithm with provable guarantees that can leverage an offline dataset to design a single non-reactive policy for exploration. We theoretically analyze the algorithm and measure the quality of the final policy as a function of the local coverage of the original dataset and the amount of additional data collected.

1 Introduction

Reinforcement learning (RL) is a general framework for data-driven, sequential decision making (Puterman, 1994; Sutton and Barto, 2018). In RL, a common goal is to identify a near-optimal policy, and there exist two main paradigms: online and offline RL.

Online RL is effective when the practical cost of a bad decision is low, such as in simulated environments (e.g., (Mnih et al., 2015; Silver et al., 2016)). In online RL, a well designed learning algorithm starts from tabula rasa and implements a sequence of policies with a value that should approach that of an optimal policy. When the cost of making a mistake is high, such as in healthcare (Gottesman et al., 2018) and in self-driving (Kiran et al., 2021), an offline approach is preferred. In offline RL, the agent uses a dataset of pre-collected experience to extract a policy that is as good as possible. In this latter case, the quality of the policy that can be extracted from the dataset is limited by the quality of the dataset.

Many applications, however, fall between these two opposite settings: for example, a company that sells products online has most likely recorded the feedback that it has received from its customers, but can also collect a small amount of additional strategic data in order to improve its recommendation engine. While in principle an online exploration algorithm can be used to collect fresh data, in practice there are a number of practical engineering considerations that require the policy to be deployed to be non-reactive. We say that a policy is non-reactive, (or passive, memoryless) if it chooses actions only according to the current state of the system. Most online algorithms are, by design, reactive to the data being acquired.

An example of a situation where non-reactive policies may be preferred are those where a human in the loop is required to validate each exploratory policy before they are deployed, to ensure they are of high quality (Dann et al., 2019) and safe (Yang et al., 2021), as well as free of discriminatory content (Koenecke et al., 2020). Other situations that may warrant non-reactive exploration are those where the interaction with the user occurs through a distributed system with delayed feedback. In recommendation systems, data collection may only take minutes, but policy deployment and updates can span weeks (Afsar et al., 2022). Similar considerations apply across various RL application domains, including healthcare (Yu et al., 2021), computer networks (Xu et al., 2018), and new material design (Raccuglia et al., 2016). In all such cases, the engineering effort required to implement a system that handles real-time policy switches may be prohibitive: deploying a single, non-reactive policy is much preferred.

Non-reactive exploration from offline data Most exploration algorithms that we are aware of incorporate policy switches when they interact with the environment (Dann and Brunskill, 2015; Dann et al., 2017; Azar et al., 2017; Jin et al., 2018; Dann et al., 2019; Zanette and Brunskill, 2019; Zhang et al., 2020b). Implementing a sequence of non-reactive policies is necessary in order to achieve near-optimal regret: the number of policy switches must be at least \tildeO(H|\cS||\cA|loglogK)\tildeO\left(H\left|\cS\right|\left|\cA\right|\log\log K\right) where \cS,\cA,H,K\cS,\cA,H,K are the state space, action space, horizon and the total number of episodes, respectively (Qiao et al., 2022). With no switches, i.e., when a fully non-reactive data collection strategy is implemented, it is information theoretically impossible (Xiao et al., 2022) to identify a good policy using a number of samples polynomial in the size of the state and action space.

However, these fundamental limits apply to the case where the agent learns from tabula rasa. In the more common case where offline data is available, we demonstrate that it is possible to leverage the dataset to design an effective non-reactive exploratory policy. More precisely, an available offline dataset contains information (e.g., transitions) about a certain area of the state-action space, a concept known as partial coverage. A dataset with partial coverage naturally identifies a ‘sub-region’ of the original MDP—more precisely, a sub-graph—that is relatively well explored. We demonstrate that it is possible to use the dataset to design a non-reactive policy that further explores such sub-region. The additional data collected can be used to learn a near-optimal policy in such sub-region.

In other words, exploration with no policy switches can collect additional information and compete with the best policy that is restricted to an area where the original dataset has sufficient information. The value of such policy can be much higher than the one that can be computed using only the offline dataset, and does not directly depend on a concentrability coefficient (Munos and Szepesvári, 2008; Chen and Jiang, 2019).

Perhaps surprisingly, addressing the problem of reactive exploration in reinforcement learning requires an approach that combines both optimism and pessimism in the face of uncertainty to explore efficiently. While optimism drives exploration, pessimism ensures that the agent explores conservatively, in a way that restricts its exploration effort to a region that it knows how to navigate, and so our paper makes a technical contribution which can be of independent interest.

Contributions To the best of our knowledge, this is the first paper with theoretical rigor that considers the problem of designing an experiment in reinforcement learning for online, passive exploration, using a dataset of pre-collected experience. More precisely, our contributions are as follows:

  • We introduce an algorithm that takes as input a dataset, uses it to design and deploy a non-reactive exploratory policy, and then outputs a locally near-optimal policy.

  • We introduce the concept of sparsified MDP, which is actively used by our algorithm to design the exploratory policy, as well as to theoretically analyze the quality of the final policy that it finds.

  • We rigorously establish a nearly minimax-optimal upper bound for the sample complexity needed to learn a local ε\varepsilon-optimal policy using our algorithm.

2 Related Work

In this section we discuss some related literature. Our work is related to low-switching algorithms, but unlike those, we focus on the limit case where no-switiches are allowed. For more related work about low-switching algorithms, offline RL, task-agnostic RL, and reward-free RL we refer to Appendix F.

Low-switching RL In reinforcement learning, (Bai et al., 2019) first proposed Q-learning with UCB2 exploration, proving an O(H3|\cS||\cA|logK)O(H^{3}\left|\cS\right|\left|\cA\right|\log K) switching cost. This was later improved by a factor of HH by the UCBadvantage algorithm in (Zhang et al., 2020b). Recently, (Qiao et al., 2022) generalized the policy elimination algorithm from (Cesa-Bianchi et al., 2013) and introduced APEVE, which attains an optimal O(H|\cS||\cA|loglogK)O(H\left|\cS\right|\left|\cA\right|\log\log K) switching cost. The reward-free version of their algorithm (which is not regret minimizing) has an O(H|\cS||\cA|)O(H\left|\cS\right|\left|\cA\right|) switching cost.

Similar ideas were soon applied in RL with linear function approximation (Gao et al., 2021; Wang et al., 2021; Qiao and Wang, 2022) and general function approximation (Qiao et al., 2023). Additionally, numerous research efforts have focused on low-adaptivity in other learning domains, such as batched dueling bandits (Agarwal et al., 2022), batched convex optimization (Duchi et al., 2018), linear contextual bandits (Ruan et al., 2021), and deployment-efficient RL (Huang et al., 2022).

Our work was inspired by the problem of non-reactive policy design in linear contextual bandits. Given access to an offline dataset, (Zanette et al., 2021a) proposed an algorithm to output a single exploratory policy, which generates a dataset from which a near-optimal policy can be extracted. However, there are a number of additional challenges which arise in reinforcement learning, including the fact that the state space is only partially explored in the offline dataset. In fact, in reinforcement learning, (Xiao et al., 2022) established an exponential lower bound for any non-adaptive policy learning algorithm starting from tabula rasa.

3 Setup

Throughout this paper, we let [n]={1,2,,n}[n]=\{1,2,...,n\}. We adopt the big-O notation, where \tildeO()\tildeO(\cdot) suppresses poly-log factors of the input parameters. We indicate the cardinality of a set 𝒳\mathcal{X} with |𝒳||\mathcal{X}|.

Markov decision process We consider time-homogeneous episodic Markov decision processes (MDPs). They are defined by a finite state space \cS\cS, a finite action space \cA\cA, a trasition kernel \P, a reward function rr and the episodic length HH. The transition probability (ss,a)\P\left(s^{\prime}\mid s,a\right), which does not depend on the current time-step h[H]h\in[H], denotes the probability of transitioning to state s\cSs^{\prime}\in\cS when taking action a\cAa\in\cA in the current state s\cSs\in\cS. Typically we denote with s1s_{1} the initial state. For simplicity, we consider deterministic reward functions r:\cS×\cA[0,1]r:\cS\times\cA\to[0,1]. A deterministic non-reactive (or memoryless, or passive) policy π={πh}h[H]\pi=\left\{\pi_{h}\right\}_{h\in[H]} maps a given state to an action.

The value function is defined as the expected cumulated reward. It depends on the state ss under consideration, the transition \P and reward rr that define the MDP as well as on the policy π\pi being implemented. It is defined as Vh(s;,r,π)=\E,π[i=hHr(si,ai)sh=s],V_{h}\left(s;\P,r,\pi\right)=\E_{\P,\pi}[\sum_{i=h}^{H}r(s_{i},a_{i})\mid s_{h}=s], where \E,π\E_{\P,\pi} denotes the expectation generated by \P and policy π\pi. A closely related quantity is the state-action value function, or QQ-function, defined as Qh(s,a;,r,π)=\E,π[i=hHr(si,ai)sh=s,ah=a].Q_{h}\left(s,a;\P,r,\pi\right)=\E_{\P,\pi}[\sum_{i=h}^{H}r(s_{i},a_{i})\mid s_{h}=s,a_{h}=a]. When it is clear from the context, we sometimes omit (,r)(\P,r) and simply write them as Vhπ(s)V_{h}^{\pi}(s) and Qhπ(s,a).Q_{h}^{\pi}(s,a). We denote an MDP defined by \cS,\cA\cS,\cA and the transition matrix \P as \mdp=(\cS,\cA,)\mdp=\left(\cS,\cA,\P\right).

3.1 Interaction protocol

Algorithm 1 Design of experiments in reinforcement learning
1:Offline dataset 𝒟\mathcal{D}
2:Offline phase: use 𝒟\mathcal{D} to compute the exploratory policy πex\pi_{ex}
3:Online phase: deploy πex\pi_{ex} to collect the online dataset 𝒟\mathcal{D}^{\prime}
4:Planning phase: receive the reward function rr and use 𝒟𝒟\mathcal{D}\cup\mathcal{D}^{\prime} to extract πfinal\pi_{final}
5:Return πfinal\pi_{final}

In this paper we assume access to an offline dataset 𝒟={(s,a,s)}\mathcal{D}=\{(s,a,s^{\prime})\} where every state-action (s,a)(s,a) is sampled in an i.i.d. fashion from some distribution μ\mu and s(s,a)s^{\prime}\sim\P(\cdot\mid s,a), which is common in the offline RL literature (Xie et al., 2021a; Zhan et al., 2022; Rashidinejad et al., 2021; Uehara and Sun, 2021). We denote N(s,a)N(s,a) and N(s,a,s)N(s,a,s^{\prime}) as the number of (s,a)(s,a) and (s,a,s)(s,a,s^{\prime}) samples in the offline dataset \cD,\cD, respectively. The interaction protocol considered in this paper consists of three distinct phases, which are displayed in algorithm 1. They are:

  • the offline phase, where the learner uses an offline dataset 𝒟\mathcal{D} of pre-collected experience to design the non-reactive exploratory policy πex\pi_{ex};

  • the online phase where πex\pi_{ex} is deployed to generate the online dataset 𝒟\mathcal{D}^{\prime};

  • the planning phase where the learner receives a reward function and uses all the data collected to extract a good policy πfinal\pi_{final} with respect to that reward function.

The objective is to minimize the number of online episodic interactions needed to find a policy πfinal\pi_{final} whose value is as high as possible. Moreover, we focus on the reward-free RL setting (Jin et al., 2020a; Kaufmann et al., 2021), which is more general than reward-aware RL.

4 Algorithm: balancing optimism and pessimism for experimental design

In this section we outline our algorithm Reward-Free Non-reactive Policy Design (RF-NPD), which follows the high-level protocol described in algorithm 1. The technical novelty lies almost entirely in the design of the exploratory policy πex\pi_{ex}. In order to prepare the reader for the discussion of the algorithm, we first give some intuition in section 4.1 followed by the definition of sparsified MDP in section 4.2, a central concept of this paper, and then describe the implementation of 2 in the protocol in algorithm 1 in section 4.3. We conclude by presenting the implementation of 3 and 4 in the protocol in algorithm 1.

4.1 Intuition

In order to present the main intuition for this paper, in this section we assume that enough transitions are available in the dataset for every edge (s,a)s(s,a)\rightarrow s^{\prime}, namely that the critical condition

N(s,a,s)Φ=Θ~(H2)\displaystyle N(s,a,s^{\prime})\geq\Phi=\widetilde{\Theta}(H^{2}) (4.1)

holds for all tuples (s,a,s)\cS×\cA×\cS(s,a,s^{\prime})\in\cS\times\cA\times\cS (the precise value for Φ\Phi will be given later in eq. 5.1). Such condition is hardly satisfied everywhere in the state-action-state space, but assuming it in this section simplifies the presentation of one of the key ideas of this paper.

The key observation is that when eq. 4.1 holds for all (s,a,s)(s,a,s^{\prime}), we can use the empirical transition kernel to design an exploration policy πex\pi_{ex} to eventually extract a near-optimal policy πfinal\pi_{final} for any desired level of sub-optimality ε\varepsilon, despite eq. 4.1 being independent of ε\varepsilon. More precisely, let ^\widehat{\P} be the empirical transition kernel defined in the usual way ^(ss,a)=N(s,a,s)/N(s,a)\widehat{\P}(s^{\prime}\mid s,a)=N(s,a,s^{\prime})/N(s,a) for any tuple (s,a,s)(s,a,s^{\prime}). The intuition—which will be verified rigorously in the analysis of the algorithm—is the following:

If eq. 4.1 holds for every (s,a,s)(s,a,s^{\prime}) then ^\widehat{\P} can be used to design a non-reactive exploration policy πex\pi_{ex} which can be deployed on \mathcal{M} to find an ε\varepsilon-optimal policy πfinal\pi_{final} using 1ε2\asymp\frac{1}{\varepsilon^{2}} samples.

We remark that even if the condition 4.1 holds for all tuples (s,a,s)(s,a,s^{\prime}), the empirical kernel ^\widehat{\P} is not accurate enough to extract an ε\varepsilon-optimal policy from the dataset 𝒟\mathcal{D} without collecting further data. Indeed, the threshold Φ=Θ~(H2)\Phi=\widetilde{\Theta}(H^{2}) on the number of samples is independent of the desired sub-optimality ε>0\varepsilon>0, while it is well known that at least 1ε2\sim\frac{1}{\varepsilon^{2}} offline samples are needed to find an ε\varepsilon-optimal policy. Therefore, directly implementing an offline RL algorithm to use the available offline dataset 𝒟\mathcal{D} does not yield an ε\varepsilon-optimal policy. However, the threshold Φ=Θ~(H2)\Phi=\widetilde{\Theta}(H^{2}) is sufficient to design a non-reactive exploratory policy πex\pi_{ex} that can discover an ε\varepsilon-optimal policy πfinal\pi_{final} after collecting 1ε2\sim\frac{1}{\varepsilon^{2}} online data.

4.2 Sparsified MDP

The intuition in the prior section must be modified to work with heterogeneous datasets and dynamics where N(s,a,s)ΦN(s,a,s^{\prime})\geq\Phi may fail to hold everywhere. For example, if (ss,a)\P(s^{\prime}\mid s,a) is very small for a certain tuple (s,a,s)(s,a,s^{\prime}), it is unlikely that the dataset contains N(s,a,s)ΦN(s,a,s^{\prime})\geq\Phi samples for that particular tuple. In a more extreme setting, if the dataset is empty, the critical condition in eq. 4.1 is violated for all tuples (s,a,s)(s,a,s^{\prime}), and in fact the lower bound of Xiao et al. (2022) states that finding ε\varepsilon-optimal policies by exploring with a non-reactive policy is not feasible with 1ε2\sim\frac{1}{\varepsilon^{2}} sample complexity. This suggests that in general it is not possible to output an ε\varepsilon-optimal policy using the protocol in algorithm 1.

However, a real-world dataset generally covers at least a portion of the state-action space, and so we expect the condition N(s,a,s)ΦN(s,a,s^{\prime})\geq\Phi to hold somewhere; the sub-region of the MDP where it holds represents the connectivity graph of the sparsified MDP. This is the region that the agent knows how to navigate using the offline dataset 𝒟\mathcal{D}, and so it is the one that the agent can explore further using πex\pi_{ex}. More precisely, the sparsified MDP is defined to have identical dynamics as the original MDP on the edges (s,a)s(s,a)\longrightarrow s^{\prime} that satisfy the critical condition 4.1. When instead the edge (s,a)s(s,a)\longrightarrow s^{\prime} fails to satisfy the critical condition 4.1, it is replaced with a transition (s,a)s(s,a)\longrightarrow s^{\dagger} to an absorbing state ss^{\dagger}.

Definition 4.1 (Sparsified MDP).

Let ss^{\dagger} be an absorbing state, i.e., such that (ss,a)=1\P\left(s^{\dagger}\mid s^{\dagger},a\right)=1 and r(s,a)=0r(s^{\dagger},a)=0 for all a\cAa\in\cA. The state space in the sparsified MDP \mathcal{M}^{\dagger} is defined as that of the original MDP with the addition of ss^{\dagger}. The dynamics \mathbb{P}^{\dagger} of the sparsified MDP are defined as

(ss,a)={(ss,a)ifN(s,a,s)Φ0ifN(s,a,s)<Φ,(ss,a)=ssN(s,a,s)<Φ(ss,a).\displaystyle\mathbb{P}^{\dagger}(s^{\prime}\mid s,a)=\begin{cases}\mathbb{P}(s^{\prime}\mid s,a)&\text{if}\;N(s,a,s^{\prime})\geq\Phi\\ 0\quad&\text{if}\;N(s,a,s^{\prime})<\Phi,\end{cases}\qquad\mathbb{P}^{\dagger}(s^{\dagger}\mid s,a)=\sum_{\begin{subarray}{c}s^{\prime}\neq s^{\dagger}\\ N(s,a,s^{\prime})<\Phi\end{subarray}}\P(s^{\prime}\mid s,a). (4.2)

For any deterministic reward function r:\cS×\cA[0,1],r:\cS\times\cA\to[0,1], the reward function on the sparsified MDP is defined as r(s,a)=r(s,a)r^{\dagger}(s,a)=r(s,a); for simplicity we only consider deterministic reward functions.

The empirical sparsified MDP ^=(\cS{s},\cA,^)\widehat{\mathcal{M}}^{\dagger}=(\cS\cup\{s^{\dagger}\},\cA,\widehat{\P}^{\dagger}) is defined in the same way but by using the empirical transition kernel in eq. 4.2. The empirical sparsified MDP is used by our algorithm to design the exploratory policy, while the (population) sparsified MDP is used for its theoretical analysis. They are two fundamental concepts in this paper.

4.3 Offline design of experiments

Algorithm 2 RF-UCB (\hmdp,Kucb,ε,δ)(\hmdp,K_{ucb},\varepsilon,\delta)
1:δ(0,1),ε>0,\delta\in(0,1),\varepsilon>0, number of episode Kucb,K_{ucb}, MDP \hmdp.\hmdp.
2:Initialize Counter n1(s,a)=n1(s,a,s)=0n^{1}(s,a)=n^{1}(s,a,s^{\prime})=0 for any (s,a,s)\cS×\cA×\cS.(s,a,s^{\prime})\in\cS\times\cA\times\cS.
3:for k=1,2,,Kucbk=1,2,...,K_{ucb} do
4:     for h=H,H1,,1h=H,H-1,\ldots,1 do
5:         Set Uhk(s,a)=0U^{k}_{h}(s,a)=0 for any (h,s,a)[H]×\cS×\cA.(h,s,a)\in[H]\times\cS\times\cA.
6:         for (s,a)\cS×\cA(s,a)\in\cS\times\cA do
7:              Calculate the empirical uncertainty Uhk(s,a)U^{k}_{h}(s,a) using eq. 4.4 where ϕ\phi is from eq. 4.3
8:         end for
9:         πhk(s):=argmaxa𝒜Uhk(s,a),s\cS\pi^{k}_{h}(s):=\mathop{\arg\max}_{a\in\mathcal{A}}U^{k}_{h}(s,a),\forall s\in\cS and πhk(s):=\pi^{k}_{h}(s^{\dagger}):= any action.
10:     end for
11:     Set initial state s1k=s1.s_{1}^{k}=s_{1}.
12:     for h=1,2,,Hh=1,2,...,H do Sample ahkπhk(shk),sh+1k\emP(shk,ahk).a_{h}^{k}\sim\pi_{h}^{k}(s_{h}^{k}),s_{h+1}^{k}\sim\emP^{\dagger}(s_{h}^{k},a_{h}^{k}).
13:     end for
14:     nk+1(s,a)=nk(s,a)+h[H]𝕀[(s,a)=(shk,ahk)].n^{k+1}(s,a)=n^{k}(s,a)+\sum_{h\in[H]}\mathbb{I}[(s,a)=(s_{h}^{k},a_{h}^{k})].
15:     nk+1(s,a,s)=nk(s,a,s)+h[H]𝕀[(s,a,s)=(shk,ahk,sh+1k)].n^{k+1}(s,a,s^{\prime})=n^{k}(s,a,s^{\prime})+\sum_{h\in[H]}\mathbb{I}[(s,a,s^{\prime})=(s_{h}^{k},a_{h}^{k},s_{h+1}^{k})].
16:end for
17:πex=Uniform{πk}k[Kucb]\pi_{ex}=\operatorname{Uniform}\{\pi^{k}\}_{k\in[K_{ucb}]}.

In this section we describe the main sub-component of the algorithm, namely the sub-routine that uses the offline dataset 𝒟\mathcal{D} to compute the exploratory policy πex\pi_{ex}. The exploratory policy πex\pi_{ex} is a mixture of the policies π1,π2,\pi^{1},\pi^{2},\dots produced by a variant of the reward-free exploration algorithm of (Kaufmann et al., 2021; Ménard et al., 2021). Unlike prior literature, the reward-free algorithm is not interfaced with the real MDP \mathcal{M}, but rather simulated on the empirical sparsified MDP ^\widehat{\mathcal{M}}^{\dagger}. This avoids interacting with \mathcal{M} with a reactive policy, but it introduces some bias that must be controlled. The overall procedure is detailed in algorithm 2. To be clear, no real-world samples are collected by algorithm 2; instead we use the word ‘virtual samples’ to refer to those generated from ^\widehat{\mathcal{M}}^{\dagger}.

At a high level, algorithm 2 implements value iteration using the empirical transition kernel ^\widehat{\P}^{\dagger}, with the exploration bonus defined in eq. 4.3 that replaces the reward function. The exploration bonus can be seen as implementing the principle of optimism in the face of uncertainty; however, the possibility of transitioning to an absorbing state with zero reward (due to the use of the absorbing state in the definition of ^\widehat{\P}^{\dagger}) implements the principle of pessimism.

This delicate interplay between optimism and pessimism is critical to the success of the overall procedure: while optimism encourages exploration, pessimism ensures that the exploration efforts are directed to the region of the state space that the agent actually knows how to navigate, and prevents the agent from getting ‘trapped’ in unknown regions. In fact, these latter regions could have combinatorial structures (Xiao et al., 2022) which cannot be explored with non-reactive policies.

More precisely, at the beginning of the kk-th virtual episode in algorithm 2, nk(s,a)n^{k}(s,a) and nk(s,a,s)n^{k}(s,a,s^{\prime}) denote the counters for the number of virtual samples simulated from ^\widehat{\mathcal{M}}^{\dagger} at each (s,a)(s,a) and (s,a,s)(s,a,s^{\prime}) tuple. We define the bonus function

ϕ(x,δ)=Hx[log(6H|\cS||\cA|/δ)+|\cS|log(e(1+x/|\cS|))],\small\phi\left(x,\delta\right)=\frac{H}{x}[\log(6H\left|\cS\right|\left|\cA\right|/\delta)+\left|\cS\right|\log(e(1+x/\left|\cS\right|))], (4.3)

which is used to construct the empirical uncertainty function UhkU_{h}^{k}, a quantity that serves as a proxy for the uncertainty of the value of any policy π\pi on the spasified MDP. Specifically, for the kk-th virtual episode, we set UH+1k(s,a)=0U_{H+1}^{k}(s,a)=0 and s\cS,a\cAs\in\cS,a\in\cA. For h[H]h\in[H], we further define:

Uhk(s,a)=Hmin{1,ϕ(nk(s,a))}+\emP(s,a)(maxaUh+1k(,a)).\small U_{h}^{k}(s,a)=H\min\{1,\phi(n^{k}(s,a))\}+\emP^{\dagger}(s,a)^{\top}(\max_{a^{\prime}}U_{h+1}^{k}(\cdot,a^{\prime})). (4.4)

Note that, the above bonus function takes a similar form of the bonus function in (Ménard et al., 2021). This order of O(1/x)O(1/x) is set to achieve the optimal sample complexity, and other works have also investigated into other forms of bonus function (Kaufmann et al., 2021). Finally, in 11 through 13 the current policy πk\pi^{k}—which is the greedy policy with respect to UkU^{k}—is simulated on the empirical reference MDP ^\widehat{\mathcal{M}}^{\dagger}, and the virtual counters are updated. It is crucial to note that the simulation takes place entirely offline, by generating virtual transitions from ^\widehat{\mathcal{M}}^{\dagger}. Upon termination of algorithm 2, the uniform mixture of policies π1,π2,\pi^{1},\pi^{2},\dots form the non-reactive exploration policy πex\pi_{ex}, ensuring that the latter has wide ‘coverage’ over \mathcal{M}^{\dagger}.

4.4 Online and planning phase

Algorithm 2 implements 2 of the procedure in algorithm 1 by finding the exploratory policy πex\pi_{ex}. After that, in 3 of the interaction protocol the online dataset 𝒟\mathcal{D}^{\prime} is generated by deploying πex\pi_{ex} on the real MDP \mathcal{M} to generate KdeK_{de} trajectories. Conceptually, the online dataset 𝒟\mathcal{D}^{\prime} and the offline dataset 𝒟\mathcal{D} identify an updated empirical transition kernel ~\widetilde{\P} and its sparsified version111 For any (s,a,s)\cS×\cA×\cS(s,a,s^{\prime})\in\cS\times\cA\times\cS, we define ~(ss,a)=m(s,a,s)m(s,a)\widetilde{\P}^{\dagger}(s^{\prime}\mid s,a)=\frac{m(s,a,s^{\prime})}{m(s,a)} if N(s,a,s)ΦN(s,a,s^{\prime})\geq\Phi and ~(ss,a)=0\widetilde{\P}^{\dagger}(s^{\prime}\mid s,a)=0 otherwise. Finally, for any (s,a)\cS×\cA,(s,a)\in\cS\times\cA, we have ~(ss,a)=1m(s,a)s\cS,N(s,a,s)<Φm(s,a,s)\widetilde{\P}^{\dagger}(s^{\dagger}\mid s,a)=\frac{1}{m(s,a)}\sum_{s^{\prime}\in\cS,N(s,a,s^{\prime})<\Phi}m(s,a,s^{\prime}) and for any a\cA,a\in\cA, we have ~(ss,a)=1.\widetilde{\P}(s^{\dagger}\mid s^{\dagger},a)=1. Here N(s,a,s)N(s,a,s^{\prime}) is the counter of initial offline data and m(,)m(\cdot,\cdot) is the counter of online data. ~\widetilde{\P}^{\dagger}. Finally, in 4 a reward function rr is received, and the value iteration algorithm (See Appendix E) is invoked with rr as reward function and ~\widetilde{\P}^{\dagger} as dynamics, and the near-optimal policy πfinal\pi_{final} is produced. The use of the (updated) empirical sparsified dynamics ~\widetilde{\P}^{\dagger} can be seen as incorporating the principle of pessimism under uncertainty due to the presence of the absorbing state.

Our complete algorithm is reported in algorithm 3, and it can be seen as implementing the interaction protocol described in algorithm 1.

Algorithm 3 Reward-Free Non-reactive Policy Design (RF-NPD)
1:Offline dataset 𝒟\mathcal{D}, target suboptimality ε>0\varepsilon>0, failure tolerance δ(0,1].\delta\in(0,1].
2:Construct the empirical sparsified MDP \hmdp\hmdp.
3:Offline phase: run RF-UCB(\hmdp,Kucb,ε,δ)(\hmdp,K_{ucb},\varepsilon,\delta) to obtain the exploratory policy πex.\pi_{ex}.
4:Online phase: deploy πex\pi_{ex} on the MDP \mathcal{M} for KdeK_{de} episodes to get the online dataset 𝒟\mathcal{D}^{\prime}.
5:Planning phase: receive the reward function rr, construct \tmdp\tmdp from the online dataset \cD\cD^{\prime}, compute πfinal\pi_{final} (which is the optimal policy on \tmdp\tmdp) using value iteration (Appendix E).
6:πfinal.\pi_{final}.

5 Main Result

In this section, we present a performance bound on our algorithm, namely a bound on the sub-optimality of the value of the final policy πfinal\pi_{final} when measured on the sparsified MDP \mathcal{M}^{\dagger}. The sparsified MDP arises because it is generally not possible to directly compete with the optimal policy using a non-reactive data collection strategy and a polynomial number of samples due to the lower bound of Xiao et al. (2022); more details are given in Appendix C.

In order to state the main result, we let K=Kucb=Kde,K=K_{ucb}=K_{de}, where KucbK_{ucb} and KdeK_{de} are the number of episodes for the offline simulation and online interaction, respectively. Let CC be some universal constant, and choose the threshold in the definition of sparsified MDP as

Φ=6H2log(12H|\cS|2|\cA|/δ).\Phi=6H^{2}\log(12H\left|\cS\right|^{2}\left|\cA\right|/\delta). (5.1)
Theorem 5.1.

For any ε>0\varepsilon>0 and 0<δ<1,0<\delta<1, if we let the number of online episodes be

K=CH2|\cS|2|\cA|ε2\polylog(|\cS|,|\cA|,H,1ε,1δ),K=\frac{CH^{2}\left|\cS\right|^{2}\left|\cA\right|}{\varepsilon^{2}}\polylog\left(\left|\cS\right|,\left|\cA\right|,H,\frac{1}{\varepsilon},\frac{1}{\delta}\right),

then with probability at least 1δ,1-\delta, for any reward function r,r, the final policy πfinal\pi_{final} returned by Algorithm 3 satisfies the bound

maxπΠV1(s1;,r,π)V1(s1;,r,πfinal)ε.\max_{\pi\in\Pi}V_{1}\left(s_{1};\mathbb{P}^{\dagger},r^{\dagger},\pi\right)-V_{1}\left(s_{1};\mathbb{P}^{\dagger},r^{\dagger},\pi_{final}\right)\leq\varepsilon. (5.2)

The theorem gives a performance guarantee on the value of the policy πfinal\pi_{final}, which depends both on the initial coverage of the offline dataset 𝒟\mathcal{D} as well as on the number of samples collected in the online phase. The dependence on the coverage of the offline dataset is implicit through the definition of the (population) sparsified \mathcal{M}^{\dagger}, which is determined by the counts N(,)N(\cdot,\cdot).

In order to gain some intuition, we examine some special cases as a function of the coverage of the offline dataset.

Empty dataset Suppose that the offline dataset 𝒟\mathcal{D} is empty. Then the sparsified MDP identifies a multi-armed bandit at the initial state s1s_{1}, where any action aa taken from such state gives back the reward r(s1,a)r(s_{1},a) and leads to the absorbing state ss^{\dagger}. In this case, our algorithm essentially designs an allocation strategy πex\pi_{ex} that is uniform across all actions at the starting state s1s_{1}. Given enough online samples, πfinal\pi_{final} converges to the action with the highest instantaneous reward on the multi-armed bandit induced by the start state. With no coverage from the offline dataset, the lower bound of Xiao et al. (2022) for non-reactive policies precludes finding an ε\varepsilon-optimal policy on the original MDP \mathcal{M} unless exponentially many samples are collected.

Known connectivity graph On the other extreme, assume that the offline dataset contains enough information everywhere in the state-action space such that the critical condition 4.1 is satisfied for all (s,a,s)(s,a,s^{\prime}) tuples. Then the sparsified MDP and the real MDP coincide, i.e., =\mathcal{M}=\mathcal{M}^{\dagger}, and so the final policy πfinal\pi_{final} directly competes with the optimal policy π\pi^{*} for any given reward function in eq. 5.2. More precisely, the policy πfinal\pi_{final} is ε\varepsilon-suboptimal on \mathcal{M} if O~(H2|\cS|2|\cA|/ε2)\widetilde{O}(H^{2}\left|\cS\right|^{2}\left|\cA\right|/\varepsilon^{2}) trajectories are collected in the online phase, a result that matches the lower bound for reward-free exploration of Jin et al. (2020b) up to log factors. However, we achieve such result with a data collection strategy that is completely passive, one that is computed with the help of an initial offline dataset whose size |𝒟|Φ×|\cS|2|\cA|=O~(H2|\cS|2|\cA|)|\mathcal{D}|\approx\Phi\times|\cS|^{2}|\cA|=\widetilde{O}(H^{2}|\cS|^{2}|\cA|) need not depend on final accuracy ε\varepsilon.

Partial coverage In more typical cases, the offline dataset has only partial coverage over the state-action space and the critical condition 4.1 may be violated in certain state-action-successor states. In this case, the connectivity graph of the sparsified MDP \mathcal{M}^{\dagger} is a sub-graph of the original MDP \mathcal{M} augmented with edges towards the absorbing state. The lack of coverage of the original dataset arises through the sparsified MDP in the guarantees that we present in theorem 5.1. In this section, we ‘translate’ such guarantees into guarantees on \mathcal{M}, in which case the ‘lack of coverage’ is naturally represented by the concentrability coefficient

C=sups,adπ(s,a)/μ(s,a),C^{*}=\sup_{s,a}d_{\pi}(s,a)/\mu(s,a),

see for examples the papers (Munos and Szepesvári, 2008; Chen and Jiang, 2019) for background material on the concentrability factor. More precisely, we compute the sample complexity—in terms of online as well as offline samples—required for πfinal\pi_{final} to be ε\varepsilon-suboptimal with respect to any comparator policy π\pi, and so in particular with respect to the optimal policy π\pi_{*} on the “real” MDP \mathcal{M}. The next corollary is proved in section B.3.

Corollary 5.2.

Suppose that the offline dataset contains

O~(H4|\cS|2|\cA|Cε),\widetilde{O}\Big{(}\frac{H^{4}\left|\cS\right|^{2}\left|\cA\right|C^{*}}{\varepsilon}\Big{)},

samples and that additional

O~(H3|\cS|2|\cA|ε2)\widetilde{O}\Big{(}\frac{H^{3}\left|\cS\right|^{2}\left|\cA\right|}{\varepsilon^{2}}\Big{)}

online samples are collected during the online phase. Then with probability at least 1δ1-\delta, for any reward function rr, the policy πfinal\pi_{final} is ϵ\epsilon-suboptimal with respect to any comparator policy π\pi

V1(s1;,r,π)V1(s1;,r,πfinal)ε.\small V_{1}\left(s_{1};\mathbb{P},r,\pi\right)-V_{1}\left(s_{1};\mathbb{P},r,\pi_{final}\right)\leq\varepsilon. (5.3)

The online sample size is equivalent to the one that arises in the statement of theorem 5.1 (expressed as number of online trajectories), and does not depend on the concentrability coefficient. The dependence on the offline dataset in theorem 5.1 is implicit in the definition of sparsified MDP; here we have made it explicit using the notion of concentrability.

Corollary 5.2 can be used to compare the achievable guarantees of our procedure with that of an offline algorithm, such as the minimax-optimal procedure detailed in (Xie et al., 2021b). The proceedure described in (Xie et al., 2021b) achieves (5.3) with probability at least 1δ1-\delta by using

O~(H3|\cS|Cε2+H5.5|\cS|Cε)\small\widetilde{O}\left(\frac{H^{3}\left|\cS\right|C^{*}}{\varepsilon^{2}}+\frac{H^{5.5}\left|\cS\right|C^{*}}{\varepsilon}\right) (5.4)

offline samples222Technically, (Zhan et al., 2022) considers the non-homogeneous setting, and expresses their result in terms of number of trajectories. In obtaining eq. 5.4, we ‘removed’ an HH factor due to our dynamics being homogeneous, and add it back to express the result in terms of number of samples. However, notice that (Zhan et al., 2022) consider the reward-aware setting, which is simpler than reward-free RL setting that we consider. This should add an additional |\cS||\cS| factor that is not accounted for in eq. 5.4, see the paper Jin et al. (2020b) for more details.. In terms of offline data, our procedure has a similar dependence on various factors, but it depends on the desired accuracy ε\varepsilon through O~(1/ε)\widetilde{O}(1/\varepsilon) as opposed to O~(1/ε2)\widetilde{O}(1/\varepsilon^{2}) which is typical for an offline algorithm. This implies that in the small-ε\varepsilon regime, if sufficient online samples are collected, one can improve upon a fully offline procedure by collecting a number of additional online samples in a non-reactive way.

Finally, notice that one may improve upon an offline dataset by collecting more data from the distribution μ\mu, i.e., without performing experimental design. Compared to this latter case, notice that our online sample complexity does not depend on the concentrability coefficient. Further discussion can be found in appendix B.

6 Proof

In this section we prove theorem 5.1, and defer the proofs of the supporting statements to the Appendix A.

Let us define the comparator policy π\pi_{*}^{\dagger} used for the comparison in eq. 5.2 to be the (deterministic) policy with the highest value function on the sparsified MDP:

π:=argmaxπΠV1(s1;,r,π).\pi_{*}^{\dagger}:=\mathop{\arg\max}_{\pi\in\Pi}V_{1}(s_{1};\P^{\dagger},r^{\dagger},\pi).

We can bound the suboptimality using the triangle inequality as

V1(s1;\kprob,r,π)V1(s1;\kprob,r,πfinal)|V1(s1;\kprob,r,π)V1(s1;\tkprob,r,π)|\displaystyle V_{1}\left(s_{1};\kprob,r^{\dagger},\pi_{*}^{\dagger}\right)-V_{1}\left(s_{1};\kprob,r^{\dagger},\pi_{final}\right)\leq\left|V_{1}\left(s_{1};\kprob,r^{\dagger},\pi_{*}^{\dagger}\right)-V_{1}\left(s_{1};\tkprob,r^{\dagger},\pi_{*}^{\dagger}\right)\right|
+V1(s1;\tkprob,r,π)V1(s1;\tkprob,r,πfinal)0+|V1(s1;\tkprob,r,πfinalV1(s1;\kprob,r,πfinal))|\displaystyle\hskip 4.30554pt+\underbrace{V_{1}\left(s_{1};\tkprob,r^{\dagger},\pi_{*}^{\dagger}\right)-V_{1}\left(s_{1};\tkprob,r^{\dagger},\pi_{final}\right)}_{\leq 0}+\left|V_{1}\left(s_{1};\tkprob,r^{\dagger},\pi_{final}-V_{1}\left(s_{1};\kprob,r^{\dagger},\pi_{final}\right)\right)\right|
2supπΠ,r|V1(s1;\kprob,r,π)V1(s1;\tkprob,r,π)|.\displaystyle\hskip 43.05542pt\leq 2\sup_{\pi\in\Pi,r}\left|V_{1}\left(s_{1};\kprob,r^{\dagger},\pi\right)-V_{1}\left(s_{1};\tkprob,r^{\dagger},\pi\right)\right|.

The middle term after the first inequality is negative due to the optimality of πfinal\pi_{final} on ~\widetilde{\P}^{\dagger} and rr^{\dagger}. It suffices to prove that for any arbitrary policy π\pi and reward function rr the following statement holds with probability at least 1δ1-\delta

|V1(s1;\kprob,r,π)V1(s1;\tkprob,r,π)|Estimation errorε2.\underbrace{\left|V_{1}\left(s_{1};\kprob,r^{\dagger},\pi\right)-V_{1}\left(s_{1};\tkprob,r^{\dagger},\pi\right)\right|}_{\text{Estimation error}}\leq\frac{\varepsilon}{2}. (6.1)
Bounding the estimation error using the population uncertainty function

In order to prove eq. 6.1, we first define the population uncertainty function XX, which is a scalar function over the state-action space. It represents the maximum estimation error on the value of any policy when it is evaluated on ~\widetilde{\mathcal{M}}^{\dagger} instead of \mathcal{M}. For any (s,a)\cS×\cA(s,a)\in\cS\times\cA, the uncertainty function is defined as XH+1(s,a):=0X_{H+1}(s,a):=0 and for h[H],h\in[H],

Xh(s,a):=min{Hh+1; 9Hϕ(m(s,a))+(1+1H)s~(ss,a)(maxa{Xh+1(s,a)})}.X_{h}(s,a):=\min\Big{\{}H-h+1;\;9H\phi(m(s,a))+\left(1+\frac{1}{H}\right)\sum_{s^{\prime}}\widetilde{\mathbb{P}}^{\dagger}\left(s^{\prime}\mid s,a\right)\left(\max_{a^{\prime}}\left\{X_{h+1}\left(s^{\prime},a^{\prime}\right)\right\}\right)\Big{\}}.

We extend the definition to the absorbing state by letting Xh(s,a)=0X_{h}(s^{\dagger},a)=0 for any h[H],a\cA.h\in[H],a\in\cA. The summation s\sum_{s^{\prime}} used above is over s\cS{s}s^{\prime}\in\cS\cup\{s^{\dagger}\}, but since Xh(s,a)=0X_{h}(s^{\dagger},a)=0 for any h[H],a\cA,h\in[H],a\in\cA, it is equivalent to that over s\cS.s^{\prime}\in\cS. Intuitively, Xh(s,a)X_{h}(s,a) takes a similar form as Bellman optimality equation. The additional (1+1/H)(1+1/H) factor and additional term 9Hϕ(m(s,a))9H\phi(m(s,a)) quantify the uncertainty of the true Q function on the sparsifed MDP and 9Hϕ(m(s,a))9H\phi(m(s,a)) will converge to zero when the sample size goes to infinity. This definition of uncertainty function and the following lemma follow closely from the uncertainty function defined in (Ménard et al., 2021).

The next lemma highlights the key property of the uncertainty function XX, namely that for any reward function and any policy π,\pi, we can upper bound the estimation error via the uncertainty function at the initial times-step; it is proved in section A.2.1.

Lemma 6.1.

With probability 1δ1-\delta, for any reward function rr and any deterministic policy π\pi, it holds that

|V1(s1;~,r,π)V1(s1;,r,π)|maxaX1(s1,a)+CmaxaX1(s1,a).|V_{1}(s_{1};\widetilde{\P}^{\dagger},r^{\dagger},\pi)-V_{1}(s_{1};\P^{\dagger},r^{\dagger},\pi)|\leq\max_{a}X_{1}(s_{1},a)+C\sqrt{\max_{a}X_{1}(s_{1},a)}. (6.2)

The uncertainty function XX contains the inverse number of online samples 1/m(s,a)1/m(s,a) through ϕ(m(s,a))\phi(m(s,a)), and so lemma 6.1 expresses the estimation error in eq. 6.1 as the maximum expected size of the confidence intervals supπ\E~,(s,a)π1/m(s,a)\sup_{\pi}\mathbb{\E}_{\widetilde{\P}^{\dagger},(s,a)\sim\pi}\sqrt{1/m(s,a)}, a quantity that directly depends on the number m(,)m(\cdot,\cdot) of samples collected during the online phase.

Leveraging the exploration mechanics

Throughout this section, CC denotes some universal constant and may vary from line to line. Recall that the agent greedily minimizes the empirical uncertainty function UU to compute the exploratory policy πex\pi_{ex}. The empirical uncertainty is defined as UH+1k(s,a)=0U_{H+1}^{k}(s,a)=0 for any k,s\cS,a\cAk,s\in\cS,a\in\cA and

Uhk(s,a)=Hmin{1,ϕ(nk(s,a))}+\emP(s,a)(maxaUh+1k(,a)),U_{h}^{k}(s,a)=H\min\{1,\phi(n^{k}(s,a))\}+\emP^{\dagger}(s,a)^{\top}(\max_{a^{\prime}}U_{h+1}^{k}(\cdot,a^{\prime})), (6.3)

where nk(s,a)n^{k}(s,a) is the counter of the times we encounter (s,a)(s,a) until the beginning of the kk-the virtual episode in the simulation phase. Note that, Uhk(s,a)U_{h}^{k}(s,a) takes a similar form as Xh(s,a),X_{h}(s,a), except that Uhk(s,a)U_{h}^{k}(s,a) depends on the empirical transition probability ^\widehat{\P}^{\dagger} while Xh(s,a)X_{h}(s,a) depends on the true transition probability on the sparsified MDP. For the exploration scheme to be effective, XX and UU should be close in value, a concept which is at the core of this work and which we formally state below and prove in section A.2.2.

Lemma 6.2 (Bounding uncertainty function with empirical uncertainty functions).

With probability at least 1δ1-\delta, we have for any (h,s,a)[H]×\cS×\cA,(h,s,a)\in[H]\times\cS\times\cA,

Xh(s,a)CKk=1KUhk(s,a).X_{h}(s,a)\leq\frac{C}{K}\sum_{k=1}^{K}U_{h}^{k}(s,a).

Notice that XhX_{h} is the population uncertainty after the online samples have been collected, while UhkU_{h}^{k} is the corresponding empirical uncertainty which varies during the planning phase.

Rate of decrease of the estimation error

Combining lemmas 6.1 and 6.2 shows that (a function of) the agent’s uncertainty estimate UU upper bounds the estimation error in eq. 6.1. In order to conclude, we need to show that UU decreases on average at the rate 1/K1/K, a statement that we present below and prove in section A.2.3.

Lemma 6.3.

With probability at least 1δ1-\delta, we have

1Kk=1KU1k(s,a)H2|𝒮|2|𝒜|K\polylog(K,|\cS|,|\cA|,H,1ε,1δ).\frac{1}{K}\sum_{k=1}^{K}U_{1}^{k}(s,a)\leq\frac{H^{2}|\mathcal{S}|^{2}|\mathcal{A}|}{K}\polylog\left(K,\left|\cS\right|,\left|\cA\right|,H,\frac{1}{\varepsilon},\frac{1}{\delta}\right). (6.4)

Then, for any ε>0,\varepsilon>0, if we take

K:=CH2|\cS||\cA|ε2(ι+|\cS|)\polylog(|\cS|,|\cA|,H,1ε,1δ),K:=\frac{CH^{2}\left|\cS\right|\left|\cA\right|}{\varepsilon^{2}}\bigg{(}\iota+\left|\cS\right|\bigg{)}\polylog\left(\left|\cS\right|,\left|\cA\right|,H,\frac{1}{\varepsilon},\frac{1}{\delta}\right),

then with probability at least 1δ1-\delta, it holds that

1Kk=1KU1k(s1,a)ε2.\frac{1}{K}\sum_{k=1}^{K}U_{1}^{k}(s_{1},a)\leq\varepsilon^{2}.

After combining lemmas 6.1, 6.2 and 6.3, we see that the estimation error can be bounded as

|V1(s1;\kprob,r,π)V1(s1;\tkprob,r,π)|\displaystyle\left|V_{1}\left(s_{1};\kprob,r^{\dagger},\pi\right)-V_{1}\left(s_{1};\tkprob,r^{\dagger},\pi\right)\right| maxaX1(s1,a)+CmaxaX1(s1,a)\displaystyle\leq\max_{a}X_{1}\left(s_{1},a\right)+C\sqrt{\max_{a}X_{1}\left(s_{1},a\right)}
Cmaxa[1Kk=1KU1k(s1,a)+1Kk=1KU1k(s1,a)]\displaystyle\leq C\max_{a}\left[\sqrt{\frac{1}{K}\sum_{k=1}^{K}U_{1}^{k}(s_{1},a)}+\frac{1}{K}\sum_{k=1}^{K}U_{1}^{k}(s_{1},a)\right]
C(ε+ε2)\displaystyle\leq C\left(\varepsilon+\varepsilon^{2}\right)
Cε\displaystyle\leq C\varepsilon (for 0<ε<const0<\varepsilon<const)

Here, the constant CC may vary between lines. Rescaling the universal constant CC and the failure probability δ\delta, we complete the upper bound in equation (6.1) and hence the proof for the main result.

References

  • Afsar et al. (2022) M Mehdi Afsar, Trafford Crump, and Behrouz Far. Reinforcement learning based recommender systems: A survey. ACM Computing Surveys, 55(7):1–38, 2022.
  • Agarwal et al. (2022) Arpit Agarwal, Rohan Ghuge, and Viswanath Nagarajan. Batched dueling bandits. In International Conference on Machine Learning, pages 89–110. PMLR, 2022.
  • Auer et al. (2002) Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47:235–256, 2002.
  • Azar et al. (2017) Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263–272. PMLR, 2017.
  • Bai et al. (2019) Yu Bai, Tengyang Xie, Nan Jiang, and Yu-Xiang Wang. Provably efficient q-learning with low switching cost. Advances in Neural Information Processing Systems, 32, 2019.
  • Cesa-Bianchi et al. (2013) Nicolo Cesa-Bianchi, Ofer Dekel, and Ohad Shamir. Online learning with switching costs and other adaptive adversaries. Advances in Neural Information Processing Systems, 26, 2013.
  • Chen and Jiang (2019) Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pages 1042–1051. PMLR, 2019.
  • Chen et al. (2022) Jinglin Chen, Aditya Modi, Akshay Krishnamurthy, Nan Jiang, and Alekh Agarwal. On the statistical efficiency of reward-free exploration in non-linear rl. arXiv preprint arXiv:2206.10770, 2022.
  • Dann and Brunskill (2015) Christoph Dann and Emma Brunskill. Sample complexity of episodic fixed-horizon reinforcement learning. In Advances in Neural Information Processing Systems (NIPS), 2015.
  • Dann et al. (2017) Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. Advances in Neural Information Processing Systems, 30, 2017.
  • Dann et al. (2019) Christoph Dann, Lihong Li, Wei Wei, and Emma Brunskill. Policy certificates: Towards accountable reinforcement learning. In International Conference on Machine Learning, pages 1507–1516, 2019.
  • Du et al. (2019) Simon Du, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal, Miroslav Dudik, and John Langford. Provably efficient rl with rich observations via latent state decoding. In International Conference on Machine Learning, pages 1665–1674. PMLR, 2019.
  • Duan et al. (2021) Yaqi Duan, Chi Jin, and Zhiyuan Li. Risk bounds and rademacher complexity in batch reinforcement learning. In International Conference on Machine Learning, pages 2892–2902. PMLR, 2021.
  • Duchi et al. (2018) John Duchi, Feng Ruan, and Chulhee Yun. Minimax bounds on stochastic batched convex optimization. In Conference On Learning Theory, pages 3065–3162. PMLR, 2018.
  • Durrett (2019) Rick Durrett. Probability: theory and examples, volume 49. Cambridge university press, 2019.
  • Foster et al. (2021) Dylan J Foster, Akshay Krishnamurthy, David Simchi-Levi, and Yunzong Xu. Offline reinforcement learning: Fundamental barriers for value function approximation. arXiv preprint arXiv:2111.10919, 2021.
  • Gao et al. (2021) Minbo Gao, Tianle Xie, Simon S Du, and Lin F Yang. A provably efficient algorithm for linear markov decision process with low switching cost. arXiv preprint arXiv:2101.00494, 2021.
  • Gao et al. (2019) Zijun Gao, Yanjun Han, Zhimei Ren, and Zhengqing Zhou. Batched multi-armed bandits problem. Advances in Neural Information Processing Systems, 32, 2019.
  • Gottesman et al. (2018) Omer Gottesman, Fredrik Johansson, Joshua Meier, Jack Dent, Donghun Lee, Srivatsan Srinivasan, Linying Zhang, Yi Ding, David Wihl, Xuefeng Peng, Jiayu Yao, Isaac Lage, Christopher Mosch, Li wei H. Lehman, Matthieu Komorowski, Matthieu Komorowski, Aldo Faisal, Leo Anthony Celi, David Sontag, and Finale Doshi-Velez. Evaluating reinforcement learning algorithms in observational health settings, 2018.
  • Hazan et al. (2019) Elad Hazan, Sham Kakade, Karan Singh, and Abby Van Soest. Provably efficient maximum entropy exploration. In International Conference on Machine Learning, pages 2681–2691. PMLR, 2019.
  • Huang et al. (2022) Jiawei Huang, Jinglin Chen, Li Zhao, Tao Qin, Nan Jiang, and Tie-Yan Liu. Towards deployment-efficient reinforcement learning: Lower bound and optimality. arXiv preprint arXiv:2202.06450, 2022.
  • Jiang and Huang (2020) Nan Jiang and Jiawei Huang. Minimax value interval for off-policy evaluation and policy optimization. Advances in Neural Information Processing Systems, 33:2747–2758, 2020.
  • Jin et al. (2018) Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4863–4873, 2018.
  • Jin et al. (2020a) Chi Jin, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. Reward-free exploration for reinforcement learning. In International Conference on Machine Learning, pages 4870–4879. PMLR, 2020a.
  • Jin et al. (2020b) Chi Jin, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. Reward-free exploration for reinforcement learning. In International Conference on Machine Learning (ICML), 2020b.
  • Jin et al. (2020c) Ying Jin, Zhuoran Yang, and Zhaoran Wang. Is pessimism provably efficient for offline rl? arXiv preprint arXiv:2012.15085, 2020c.
  • Jonsson et al. (2020) Anders Jonsson, Emilie Kaufmann, Pierre Ménard, Omar Darwiche Domingues, Edouard Leurent, and Michal Valko. Planning in markov decision processes with gap-dependent sample complexity. Advances in Neural Information Processing Systems, 33:1253–1263, 2020.
  • Kallus and Uehara (2022) Nathan Kallus and Masatoshi Uehara. Efficiently breaking the curse of horizon in off-policy evaluation with double reinforcement learning. Operations Research, 2022.
  • Kaufmann et al. (2021) Emilie Kaufmann, Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Edouard Leurent, and Michal Valko. Adaptive reward-free exploration. In Algorithmic Learning Theory, pages 865–891. PMLR, 2021.
  • Kiran et al. (2021) B Ravi Kiran, Ibrahim Sobh, Victor Talpaert, Patrick Mannion, Ahmad A Al Sallab, Senthil Yogamani, and Patrick Pérez. Deep reinforcement learning for autonomous driving: A survey. IEEE Transactions on Intelligent Transportation Systems, 23(6):4909–4926, 2021.
  • Koenecke et al. (2020) Allison Koenecke, Andrew Nam, Emily Lake, Joe Nudell, Minnie Quartey, Zion Mengesha, Connor Toups, John R Rickford, Dan Jurafsky, and Sharad Goel. Racial disparities in automated speech recognition. Proceedings of the National Academy of Sciences, 117(14):7684–7689, 2020.
  • Lattimore and Szepesvári (2020) Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
  • Li et al. (2023) Gen Li, Yuling Yan, Yuxin Chen, and Jianqing Fan. Optimal reward-agnostic exploration in reinforcement learning. 2023.
  • Long et al. (2021) Jihao Long, Jiequn Han, and E Weinan. An l2 analysis of reinforcement learning in high dimensions with kernel and neural network approximation. arXiv preprint arXiv:2104.07794, 3, 2021.
  • Maurer and Pontil (2009) Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penalization. arXiv preprint arXiv:0907.3740, 2009.
  • Ménard et al. (2021) Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Emilie Kaufmann, Edouard Leurent, and Michal Valko. Fast active learning for pure exploration in reinforcement learning. In International Conference on Machine Learning, pages 7599–7608. PMLR, 2021.
  • Misra et al. (2020) Dipendra Misra, Mikael Henaff, Akshay Krishnamurthy, and John Langford. Kinematic state abstraction and provably efficient rich-observation reinforcement learning. In International conference on machine learning, pages 6961–6971. PMLR, 2020.
  • Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
  • Munos and Szepesvári (2008) Rémi Munos and Csaba Szepesvári. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9(5), 2008.
  • Nachum et al. (2019) Ofir Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. Algaedice: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074, 2019.
  • Nguyen-Tang et al. (2022) Thanh Nguyen-Tang, Ming Yin, Sunil Gupta, Svetha Venkatesh, and Raman Arora. On instance-dependent bounds for offline reinforcement learning with linear function approximation. arXiv preprint arXiv:2211.13208, 2022.
  • Puterman (1994) Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, USA, 1994. ISBN 0471619779.
  • Qiao and Wang (2022) Dan Qiao and Yu-Xiang Wang. Near-optimal deployment efficiency in reward-free reinforcement learning with linear function approximation. arXiv preprint arXiv:2210.00701, 2022.
  • Qiao et al. (2022) Dan Qiao, Ming Yin, Ming Min, and Yu-Xiang Wang. Sample-efficient reinforcement learning with loglog (t) switching cost. In International Conference on Machine Learning, pages 18031–18061. PMLR, 2022.
  • Qiao et al. (2023) Dan Qiao, Ming Yin, and Yu-Xiang Wang. Logarithmic switching cost in reinforcement learning beyond linear mdps. arXiv preprint arXiv:2302.12456, 2023.
  • Qiu et al. (2021) Shuang Qiu, Jieping Ye, Zhaoran Wang, and Zhuoran Yang. On reward-free rl with kernel and neural function approximations: Single-agent mdp and markov game. In International Conference on Machine Learning, pages 8737–8747. PMLR, 2021.
  • Raccuglia et al. (2016) Paul Raccuglia, Katherine C Elbert, Philip DF Adler, Casey Falk, Malia B Wenny, Aurelio Mollo, Matthias Zeller, Sorelle A Friedler, Joshua Schrier, and Alexander J Norquist. Machine-learning-assisted materials discovery using failed experiments. Nature, 533(7601):73–76, 2016.
  • Rashidinejad et al. (2021) Paria Rashidinejad, Banghua Zhu, Cong Ma, Jiantao Jiao, and Stuart Russell. Bridging offline reinforcement learning and imitation learning: A tale of pessimism. arXiv preprint arXiv:2103.12021, 2021.
  • Rashidinejad et al. (2022) Paria Rashidinejad, Hanlin Zhu, Kunhe Yang, Stuart Russell, and Jiantao Jiao. Optimal conservative offline rl with general function approximation via augmented lagrangian. arXiv preprint arXiv:2211.00716, 2022.
  • Ruan et al. (2021) Yufei Ruan, Jiaqi Yang, and Yuan Zhou. Linear bandits with limited adaptivity and learning distributional optimal design. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 74–87, 2021.
  • Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484, 2016.
  • Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT Press, 2018.
  • Talebi and Maillard (2018) Mohammad Sadegh Talebi and Odalric-Ambrym Maillard. Variance-aware regret bounds for undiscounted reinforcement learning in mdps. In Algorithmic Learning Theory, pages 770–805. PMLR, 2018.
  • Uehara and Sun (2021) Masatoshi Uehara and Wen Sun. Pessimistic model-based offline reinforcement learning under partial coverage, 2021.
  • Wagenmaker et al. (2022) Andrew J Wagenmaker, Yifang Chen, Max Simchowitz, Simon Du, and Kevin Jamieson. Reward-free rl is no harder than reward-aware rl in linear markov decision processes. In International Conference on Machine Learning, pages 22430–22456. PMLR, 2022.
  • Wainwright (2019) Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019.
  • Wang et al. (2020a) Ruosong Wang, Simon S Du, Lin Yang, and Russ R Salakhutdinov. On reward-free reinforcement learning with linear function approximation. Advances in neural information processing systems, 33:17816–17826, 2020a.
  • Wang et al. (2020b) Ruosong Wang, Dean P Foster, and Sham M Kakade. What are the statistical limits of offline rl with linear function approximation? arXiv preprint arXiv:2010.11895, 2020b.
  • Wang et al. (2021) Tianhao Wang, Dongruo Zhou, and Quanquan Gu. Provably efficient reinforcement learning with linear function approximation under adaptivity constraints. Advances in Neural Information Processing Systems, 34:13524–13536, 2021.
  • Xiao et al. (2022) Chenjun Xiao, Ilbin Lee, Bo Dai, Dale Schuurmans, and Csaba Szepesvari. The curse of passive data collection in batch reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 8413–8438. PMLR, 2022.
  • Xie and Jiang (2020a) Tengyang Xie and Nan Jiang. Batch value-function approximation with only realizability. arXiv preprint arXiv:2008.04990, 2020a.
  • Xie and Jiang (2020b) Tengyang Xie and Nan Jiang. Q* approximation schemes for batch reinforcement learning: A theoretical comparison. In Conference on Uncertainty in Artificial Intelligence, pages 550–559. PMLR, 2020b.
  • Xie et al. (2021a) Tengyang Xie, Ching-An Cheng, Nan Jiang, Paul Mineiro, and Alekh Agarwal. Bellman-consistent pessimism for offline reinforcement learning. arXiv preprint arXiv:2106.06926, 2021a.
  • Xie et al. (2021b) Tengyang Xie, Nan Jiang, Huan Wang, Caiming Xiong, and Yu Bai. Policy finetuning: Bridging sample-efficient offline and online reinforcement learning. Advances in neural information processing systems, 34:27395–27407, 2021b.
  • Xiong et al. (2022) Wei Xiong, Han Zhong, Chengshuai Shi, Cong Shen, Liwei Wang, and Tong Zhang. Nearly minimax optimal offline reinforcement learning with linear function approximation: Single-agent mdp and markov game. arXiv preprint arXiv:2205.15512, 2022.
  • Xu et al. (2018) Zhiyuan Xu, Jian Tang, Jingsong Meng, Weiyi Zhang, Yanzhi Wang, Chi Harold Liu, and Dejun Yang. Experience-driven networking: A deep reinforcement learning based approach. In IEEE INFOCOM 2018-IEEE conference on computer communications, pages 1871–1879. IEEE, 2018.
  • Yang et al. (2021) Tsung-Yen Yang, Justinian Rosca, Karthik Narasimhan, and Peter J. Ramadge. Accelerating safe reinforcement learning with constraint-mismatched policies, 2021.
  • Yin and Wang (2020) Ming Yin and Yu-Xiang Wang. Asymptotically efficient off-policy evaluation for tabular reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 3948–3958. PMLR, 2020.
  • (69) Ming Yin, Yu-Xiang Wang, Yaqi Duan, and Mengdi Wang. Near-optimal offline reinforcement learning with linear representation: Leveraging variance information with pessimism.
  • Yin et al. (2020) Ming Yin, Yu Bai, and Yu-Xiang Wang. Near optimal provable uniform convergence in off-policy evaluation for reinforcement learning. arXiv preprint arXiv:2007.03760, 2020.
  • Yin et al. (2022) Ming Yin, Mengdi Wang, and Yu-Xiang Wang. Offline reinforcement learning with differentiable function approximation is provably efficient. arXiv preprint arXiv:2210.00750, 2022.
  • Yu et al. (2021) Chao Yu, Jiming Liu, Shamim Nemati, and Guosheng Yin. Reinforcement learning in healthcare: A survey. ACM Computing Surveys (CSUR), 55(1):1–36, 2021.
  • Zanette (2023) Andrea Zanette. When is realizability sufficient for off-policy reinforcement learning? ICML 2023, 2023.
  • Zanette and Brunskill (2019) Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pages 7304–7312. PMLR, 2019.
  • Zanette and Wainwright (2022) Andrea Zanette and Martin J. Wainwright. Bellman residual orthogonalization for offline reinforcement learning, 2022. URL https://arxiv.org/abs/2203.12786.
  • Zanette et al. (2020) Andrea Zanette, Alessandro Lazaric, Mykel J Kochenderfer, and Emma Brunskill. Provably efficient reward-agnostic navigation with linear value iteration. Advances in Neural Information Processing Systems, 33:11756–11766, 2020.
  • Zanette et al. (2021a) Andrea Zanette, Kefan Dong, Jonathan N Lee, and Emma Brunskill. Design of experiments for stochastic contextual linear bandits. Advances in Neural Information Processing Systems, 34:22720–22731, 2021a.
  • Zanette et al. (2021b) Andrea Zanette, Martin J Wainwright, and Emma Brunskill. Provable benefits of actor-critic methods for offline reinforcement learning. Advances in neural information processing systems, 34:13626–13640, 2021b.
  • Zhan et al. (2022) Wenhao Zhan, Baihe Huang, Audrey Huang, Nan Jiang, and Jason D Lee. Offline reinforcement learning with realizability and single-policy concentrability. arXiv preprint arXiv:2202.04634, 2022.
  • Zhang et al. (2022) Ruiqi Zhang, Xuezhou Zhang, Chengzhuo Ni, and Mengdi Wang. Off-policy fitted q-evaluation with differentiable function approximators: Z-estimation and inference theory. In International Conference on Machine Learning, pages 26713–26749. PMLR, 2022.
  • Zhang et al. (2020a) Xuezhou Zhang, Yuzhe Ma, and Adish Singla. Task-agnostic exploration in reinforcement learning. Advances in Neural Information Processing Systems, 33:11734–11743, 2020a.
  • Zhang et al. (2020b) Zihan Zhang, Yuan Zhou, and Xiangyang Ji. Almost optimal model-free reinforcement learningvia reference-advantage decomposition. Advances in Neural Information Processing Systems, 33:15198–15207, 2020b.
  • Zhang et al. (2021) Zihan Zhang, Simon Du, and Xiangyang Ji. Near optimal reward-free reinforcement learning. In International Conference on Machine Learning, pages 12402–12412. PMLR, 2021.

Appendix A Proof of the main result

A.1 Definitions

In this section, we define some crucial concepts that will be used in the proof of the main result.

A.1.1 Sparsified MDP

First, we restate the definition 4.1 in the main text.

Definition A.1 (Sparsified MDP).

Let ss^{\dagger} be an absorbing state, i.e., such that (ss,a)=1\P\left(s^{\dagger}\mid s^{\dagger},a\right)=1 and r(s,a)=0r(s^{\dagger},a)=0 for all a\cAa\in\cA. The state space in the sparsified MDP \mathcal{M}^{\dagger} is defined as that of the original MDP with the addition of ss^{\dagger}. The dynamics \mathbb{P}^{\dagger} of the sparsified MDP are defined as

(ss,a)={(ss,a)ifN(s,a,s)Φ0ifN(s,a,s)<Φ,(ss,a)=ssN(s,a,s)<Φ(ss,a).\displaystyle\mathbb{P}^{\dagger}(s^{\prime}\mid s,a)=\begin{cases}\mathbb{P}(s^{\prime}\mid s,a)&\text{if}\;N(s,a,s^{\prime})\geq\Phi\\ 0\quad&\text{if}\;N(s,a,s^{\prime})<\Phi,\end{cases}\qquad\mathbb{P}^{\dagger}(s^{\dagger}\mid s,a)=\sum_{\begin{subarray}{c}s^{\prime}\neq s^{\dagger}\\ N(s,a,s^{\prime})<\Phi\end{subarray}}\P(s^{\prime}\mid s,a). (A.1)

For any deterministic reward function r:\cS×\cA[0,1],r:\cS\times\cA\to[0,1], the reward function on the sparsified MDP is defined as r(s,a)=r(s,a)r^{\dagger}(s,a)=r(s,a); for simplicity we only consider deterministic reward functions.

In the offline phase of our algorithm, we simulate the virtual episodes from the empirical version of sparsified MDP (See Algorithm 2). Now we formally this MDP.

Definition A.2 (Emprical sparsified MDP).

Let ss^{\dagger} be the absorbing state defined in the sparsified MDP. The state space in the empirical sparsified MDP ^\widehat{\mathcal{M}}^{\dagger} is defined as that of the original MDP with the addition of ss^{\dagger}. The dynamics \emP\emP^{\dagger} of the sparsified MDP are defined as

\emP(ss,a)={N(s,a,s)N(s,a)ifN(s,a,s)Φ0ifN(s,a,s)<Φ,\emP(ss,a)=ssN(s,a,s)<ΦN(s,a,s)N(s,a).\displaystyle\emP^{\dagger}(s^{\prime}\mid s,a)=\begin{cases}\frac{N(s,a,s^{\prime})}{N(s,a)}&\text{if}\;N(s,a,s^{\prime})\geq\Phi\\ 0\quad&\text{if}\;N(s,a,s^{\prime})<\Phi,\end{cases}\qquad\emP^{\dagger}(s^{\dagger}\mid s,a)=\sum_{\begin{subarray}{c}s^{\prime}\neq s^{\dagger}\\ N(s,a,s^{\prime})<\Phi\end{subarray}}\frac{N(s,a,s^{\prime})}{N(s,a)}. (A.2)

For any deterministic reward function r:\cS×\cA[0,1],r:\cS\times\cA\to[0,1], the reward function on the empirical sparsified MDP is defined as r(s,a)=r(s,a)r^{\dagger}(s,a)=r(s,a); for simplicity we only consider deterministic reward functions. Here, the counters N(s,a)N(s,a) and N(s,a,s)N(s,a,s^{\prime}) are the number of (s,a)(s,a) and (s,a,s)(s,a,s^{\prime}) in the offline data, respectively.

Finally, in the planning phase, after we interact with the true environment for many online episodes, construct a fine-estimated sparsified MDP, which is used to extract the optmal policy of given reward functions. We formally define it below.

Definition A.3 (Fine-estimated sparsified MDP).

Let ss^{\dagger} be the absorbing state defined in the sparsified MDP. The state space in the fine-estimated sparsified MDP \mdp~\widetilde{\mdp}^{\dagger} is defined as that of the original MDP with the addition of ss^{\dagger}. The dynamics ~\widetilde{\mathbb{P}}^{\dagger} of the sparsified MDP are defined as

~(ss,a)={m(s,a,s)m(s,a)ifN(s,a,s)Φ0ifN(s,a,s)<Φ,~(ss,a)=ssN(s,a,s)<Φm(s,a,s)m(s,a).\displaystyle\widetilde{\P}^{\dagger}(s^{\prime}\mid s,a)=\begin{cases}\frac{m(s,a,s^{\prime})}{m(s,a)}&\text{if}\;N(s,a,s^{\prime})\geq\Phi\\ 0\quad&\text{if}\;N(s,a,s^{\prime})<\Phi,\end{cases}\qquad\widetilde{\P}^{\dagger}(s^{\dagger}\mid s,a)=\sum_{\begin{subarray}{c}s^{\prime}\neq s^{\dagger}\\ N(s,a,s^{\prime})<\Phi\end{subarray}}\frac{m(s,a,s^{\prime})}{m(s,a)}. (A.3)

For any deterministic reward function r:\cS×\cA[0,1],r:\cS\times\cA\to[0,1], the reward function on the fine-estimated sparsified MDP is defined as r(s,a)=r(s,a)r^{\dagger}(s,a)=r(s,a); for simplicity we only consider deterministic reward functions. Here, the counters N(s,a)N(s,a) and N(s,a,s)N(s,a,s^{\prime}) are the number of (s,a)(s,a) and (s,a,s)(s,a,s^{\prime}) in the offline data, respectively, while m(s,a)m(s,a) and m(s,a,s)m(s,a,s^{\prime}) are the counters of (s,a)(s,a) and (s,a,s)(s,a,s^{\prime}) in online episodes.

A.1.2 High probability events

In this section, we define all high-probability events we need in order to make our theorem to hold. Specifically, we define

\eventP:\displaystyle\event^{P}: ={(s,a,s) s.t. N(s,a,s)Φ,|\emP(ss,a)(ss,a)|\displaystyle=\Bigg{\{}\forall(s,a,s^{\prime})\text{ s.t. }N(s,a,s^{\prime})\geq\Phi,\left|\emP^{\dagger}(s^{\prime}\mid s,a)-\P^{\dagger}(s^{\prime}\mid s,a)\right|
2\emP(ss,a)N(s,a)log(12|\cS|2|\cA|δ)+143N(s,a)log(12|\cS|2|\cA|δ)};\displaystyle\hskip 86.11084pt\leq\left.\sqrt{\frac{2\emP^{\dagger}(s^{\prime}\mid s,a)}{N(s,a)}\log\left(\frac{12\left|\cS\right|^{2}\left|\cA\right|}{\delta}\right)}+\frac{14}{3N(s,a)}\log\left(\frac{12\left|\cS\right|^{2}\left|\cA\right|}{\delta}\right)\right\};
\event2\displaystyle\event^{2} :={k,(s,a)\cS×\cA,nk(s,a)12i<kh[H]d^πi,h(s,a)Hln(6H|\cS||\cA|δ)};\displaystyle:=\left\{\forall k\in\mathbb{N},(s,a)\in\cS\times\cA,n^{k}(s,a)\geq\frac{1}{2}\sum_{i<k}\sum_{h\in[H]}\widehat{d}_{\pi^{i},h}^{\dagger}(s,a)-H\ln\left(\frac{6H\left|\cS\right|\left|\cA\right|}{\delta}\right)\right\};
\event3\displaystyle\event^{3} :={k,(s,a)\cS×\cA,nk(s,a)2i<kh[H]d^πi,h(s,a)+Hln(6H|\cS||\cA|δ)};\displaystyle:=\left\{\forall k\in\mathbb{N},(s,a)\in\cS\times\cA,n^{k}(s,a)\leq 2\sum_{i<k}\sum_{h\in[H]}\widehat{d}_{\pi^{i},h}^{\dagger}(s,a)+H\ln\left(\frac{6H\left|\cS\right|\left|\cA\right|}{\delta}\right)\right\};
\event4\displaystyle\event^{4} :={(s,a)\cS×\cA,\KL(~(s,a);(s,a))\displaystyle:=\bigg{\{}\forall(s,a)\in\cS\times\cA,\KL\left(\widetilde{\P}^{\dagger}(s,a);\P^{\dagger}(s,a)\right)
1m(s,a)[log(6|\cS||\cA|δ)+|\cS|log(e(1+m(s,a)|\cS|))]};\displaystyle\hskip 129.16626pt\leq\left.\frac{1}{m(s,a)}\left[\log\left(\frac{6\left|\cS\right|\left|\cA\right|}{\delta}\right)+\left|\cS\right|\log\left(e\left(1+\frac{m(s,a)}{\left|\cS\right|}\right)\right)\right]\right\};
\event5\displaystyle\event^{5} :={(s,a)[H]×\cS×\cA,m(s,a)12Kdeh[H]whmix(s,a)Hln(6H|\cS||\cA|δ)},\displaystyle:=\left\{\forall(s,a)\in[H]\times\cS\times\cA,m(s,a)\geq\frac{1}{2}K_{de}\sum_{h\in[H]}w_{h}^{mix}(s,a)-H\ln\left(\frac{6H\left|\cS\right|\left|\cA\right|}{\delta}\right)\right\},

where

whmix(s,a):=1Kucbk=1Kucbdπk,h(s,a).w^{mix}_{h}(s,a):=\frac{1}{K_{ucb}}\sum_{k=1}^{K_{ucb}}d^{\dagger}_{\pi^{k},h}(s,a).

Here, \KL(;)\KL(\cdot;\cdot) denotes the Kullback–Leibler divergence between two distributions. KucbK_{ucb} is the number of episodes of the sub-routine RF-UCB and πi\pi^{i} is the policy executed at the i-th episode of RF-UCB (See algorithm 2). KdeK_{de} is the number of episodes executed in the online phase. nk(s,a)n^{k}(s,a) is the counter of (s,a)(s,a) before the beginning of the kk-th episode in the offline phase and m(s,a)m(s,a) is the number of (s,a)(s,a) samples in the online data (See definitions in section A.1). We denote dπ,h(s,a),d_{\pi,h}(s,a), dπ,h(s,a)d_{\pi,h}^{\dagger}(s,a) and d^π,h(s,a)\widehat{d}_{\pi,h}^{\dagger}(s,a) as the occupancy measure of (s,a)(s,a) at stage hh under policy π,\pi, on \P (the true transition dynamics), \P^{\dagger} (the transition dynamics in the sparsfied MDP) and \emP\emP^{\dagger}(the transition dynamics in the empirical sparsified MDP) respectively.

For the event defined above, we have the following guarantee, which is proved in section A.3.1.

Lemma A.4.

For δ>0,\delta>0, we have

\event:={\eventP\event2\event3\event4\event5}\event:=\left\{\event^{P}\cup\event^{2}\cup\event^{3}\cup\event^{4}\cup\event^{5}\right\}

happens with probability at least 1δ,1-\delta,

A.1.3 Uncertainty function and empirical uncertainty function

In this section, we restate the definition of uncertainty function and empirical uncertainty functions, as well as some intermediate uncertainty functions which will be often used in the proof. First, we restate the definition of bonus function.

Definition A.5 (Bonus Function).

We define

ϕ(x)=Hx[log(6H|\cS||\cA|δ)+|\cS|log(e(1+x|\cS|))].\phi\left(x\right)=\frac{H}{x}\left[\log\left(\frac{6H\left|\cS\right|\left|\cA\right|}{\delta}\right)+\left|\cS\right|\log\left(e\left(1+\frac{x}{\left|\cS\right|}\right)\right)\right]. (A.4)

Further, we define

\hphi(x)=min{1,Hϕ(x)}.\hphi(x)=\min\left\{1,H\phi(x)\right\}. (A.5)

Next, we restate the definition of uncertainty function and empirical uncertainty function. Notice that the uncertainty function does not depend on reward function or specific policy π\pi.

Definition A.6 (Uncertainty function).

We define for any (h,s,a)[H]×\cS×\cA(h,s,a)\in[H]\times\cS\times\cA and any deterministic policy π,\pi,

XH+1(s,a):=0,\displaystyle X_{H+1}(s,a):=0,
Xh(s,a):=min{Hh+1,9Hϕ(m(s,a))+(1+1H)~(s,a)(maxaXh+1(,a))},\displaystyle X_{h}(s,a):=\min\left\{H-h+1,9H\phi(m(s,a))+\left(1+\frac{1}{H}\right)\widetilde{\mathbb{P}}^{\dagger}(\cdot\mid s,a)^{\top}\left(\max_{a^{\prime}}X_{h+1}\left(\cdot,a^{\prime}\right)\right)\right\},

where we specify

~(s,a)(maxa{Xh+1(,a)}):=s~(ss,a)(maxa{Xh+1(s,a)}).\widetilde{\mathbb{P}}^{\dagger}(\cdot\mid s,a)^{\top}\left(\max_{a^{\prime}}\left\{X_{h+1}\left(\cdot,a^{\prime}\right)\right\}\right):=\sum_{s^{\prime}}\widetilde{\mathbb{P}}^{\dagger}\left(s^{\prime}\mid s,a\right)\left(\max_{a^{\prime}}\left\{X_{h+1}\left(s^{\prime},a^{\prime}\right)\right\}\right).

In addition, we define Xh(s,a)=0X_{h}(s^{\dagger},a)=0 for any h[H],a\cA.h\in[H],a\in\cA. Here, the notation s\sum_{s^{\prime}} above means summation over s\cS{s}.s^{\prime}\in\cS\cup\left\{s^{\dagger}\right\}. But since Xh(s,a)=0X_{h}(s^{\dagger},a)=0 for any h[H],a\cA,h\in[H],a\in\cA, this is equivalent to the summation over s\cS.s^{\prime}\in\cS. ~\widetilde{\P}^{\dagger} is the transition probability of fine-estimated sparsified MDP defined in section A.1.1.

Then, for the reader to better understand the proof, we define some intermediate quantity, which also measure the uncertainty of value estimation, but they may be reward or policy dependent.

Definition A.7 (Intermediate uncertainty function).

We define for any (h,s,a)[H]×\cS×\cA(h,s,a)\in[H]\times\cS\times\cA and any deterministc policy π,\pi,

WH+1π(s,a,r):=0,\displaystyle W_{H+1}^{\pi}(s,a,r):=0,
Whπ(s,a,r):=min{Hh+1,8H2\hphi(m(s,a))Vars~(s,a)(Vh+1(s;~,r,π))\displaystyle W_{h}^{\pi}(s,a,r):=\min\left\{H-h+1,\sqrt{\frac{8}{H^{2}}\hphi\left(m(s,a)\right)\cdot\operatorname{Var}_{s^{\prime}\sim\widetilde{\P}^{\dagger}(s,a)}\left(V_{h+1}\left(s^{\prime};\widetilde{\P}^{\dagger},r^{\dagger},\pi\right)\right)}\right.
+9Hϕ(m(s,a))+(1+1H)(~πh+1)Wh+1π(s,a,r)},\displaystyle\hskip 34.44434pt+9H\phi\left(m(s,a)\right)+\left.\left(1+\frac{1}{H}\right)\left(\widetilde{\P}^{\dagger}\pi_{h+1}\right)W_{h+1}^{\pi}(s,a,r)\right\},

where

(~πh+1)Wh+1π(s,a,r):=sa~(ss,a)πh+1(as)Wh+1π(s,a,r).\left(\widetilde{\P}^{\dagger}\pi_{h+1}\right)W_{h+1}^{\pi}(s,a,r):=\sum_{s^{\prime}}\sum_{a^{\prime}}\widetilde{\P}^{\dagger}\left(s^{\prime}\mid s,a\right)\pi_{h+1}\left(a^{\prime}\mid s^{\prime}\right)W_{h+1}^{\pi}\left(s^{\prime},a^{\prime},r\right).

In addition, we define Whπ(s,a,r)=0W_{h}^{\pi}(s^{\dagger},a,r)=0 for any h[H],a\cAh\in[H],a\in\cA and any reward function rr. Here, ~\widetilde{\P}^{\dagger} is the transition probability of fine-estimated sparsified MDP defined in section A.1.1.

The intermediate function WW is reward and policy dependent and can be used to upper bound the value estimation error. Further, we need to define some policy-dependent version of the uncertainty function, denoted as Xhπ(s,a),X_{h}^{\pi}(s,a), as well as another quantity Yhπ(s,a,r).Y_{h}^{\pi}(s,a,r).

Definition A.8.

We define XH+1π(s,a)=YH+1π(s,a,r)=0X_{H+1}^{\pi}(s,a)=Y_{H+1}^{\pi}(s,a,r)=0 for any π,r,s,a\pi,r,s,a and

Xhπ(s,a)\displaystyle X_{h}^{\pi}(s,a) :=min{Hh+1,9Hϕ(m(s,a))+(1+1H)(~πh+1)Xh+1π(s,a)};\displaystyle:=\min\left\{H-h+1,9H\phi\left(m(s,a)\right)+\left(1+\frac{1}{H}\right)\left(\widetilde{\P}^{\dagger}\pi_{h+1}\right)X_{h+1}^{\pi}(s,a)\right\};
Yhπ(s,a,r)\displaystyle Y_{h}^{\pi}(s,a,r) :=8H2\hphi(m(s,a))Vars~(s,a)(Vh+1(s;~,r,π))+(1+1H)(~πh+1)Yh+1π(s,a,r)\displaystyle:=\sqrt{\frac{8}{H^{2}}\hphi\left(m(s,a)\right)\operatorname{Var}_{s^{\prime}\sim\widetilde{\P}^{\dagger}(s,a)}\left(V_{h+1}\left(s^{\prime};\widetilde{\P}^{\dagger},r^{\dagger},\pi\right)\right)}+\left(1+\frac{1}{H}\right)\left(\widetilde{\P}^{\dagger}\pi_{h+1}\right)Y_{h+1}^{\pi}(s,a,r)

where

(~πh+1)Xh+1π(s,a)\displaystyle\left(\widetilde{\P}^{\dagger}\pi_{h+1}\right)X_{h+1}^{\pi}(s,a) =sa~(ss,a)πh+1(as)Xh+1π(s,a);\displaystyle=\sum_{s^{\prime}}\sum_{a^{\prime}}\widetilde{\P}^{\dagger}\left(s^{\prime}\mid s,a\right)\pi_{h+1}\left(a^{\prime}\mid s^{\prime}\right)X_{h+1}^{\pi}\left(s^{\prime},a^{\prime}\right);
(~πh+1)Yh+1π(s,a,r)\displaystyle\left(\widetilde{\P}^{\dagger}\pi_{h+1}\right)Y_{h+1}^{\pi}(s,a,r) =sa~(ss,a)πh+1(as)Yh+1π(s,a,r).\displaystyle=\sum_{s^{\prime}}\sum_{a^{\prime}}\widetilde{\P}^{\dagger}\left(s^{\prime}\mid s,a\right)\pi_{h+1}\left(a^{\prime}\mid s^{\prime}\right)Y_{h+1}^{\pi}\left(s^{\prime},a^{\prime},r\right).

In addition, we define Xhπ(s,a)=Yhπ(s,a,r)=0X_{h}^{\pi}(s^{\dagger},a)=Y_{h}^{\pi}(s^{\dagger},a,r)=0 for any a\cA,h[H]a\in\cA,h\in[H] and any reward r,r, any policy π.\pi. Here, ~\widetilde{\P}^{\dagger} is the transition probability of fine-estimated sparsified MDP defined in section A.1.1.

Finally, we define the empirical uncertainty functions.

Definition A.9.

We define UH+1k(s,a)=0U_{H+1}^{k}(s,a)=0 for any k[Kucb]k\in[K_{ucb}] and s\cS,a\cA.s\in\cS,a\in\cA. Further, we define

Uhk(s,a)=Hmin{1,ϕ(nk(s,a))}+\emP(s,a)(maxaUh+1k(,a)),U_{h}^{k}(s,a)=H\min\left\{1,\phi\left(n^{k}(s,a)\right)\right\}+\emP^{\dagger}(s,a)^{\top}\left(\max_{a^{\prime}}U_{h+1}^{k}\left(\cdot,a^{\prime}\right)\right), (A.6)

where ϕ\phi is the bonus function defined in eq. A.4 and nk(s,a)n^{k}(s,a) is the counter of the times we encounter (s,a)(s,a) until the beginning of the kk-the episode when running the sub-routine RF-UCB in the offline phase.

A.2 Proof of the key theorems and lemmas

A.2.1 Uncertainty Functions upper bounds the estimation error

Proof.

This proof follows closely from the techniques in [Ménard et al., 2021], but here we consider the homogeneous MDP. We let d~π,h(s,a)\widetilde{d}^{\dagger}_{\pi,h}(s,a) be the probability of encountering (s,a)(s,a) at stage hh when running policy π\pi on ~\widetilde{\P}^{\dagger}, starting from s1.s_{1}. Now we assume \event\event happens and fix a reward function rr and policy π.\pi. From lemma A.14, we know

|V1(s1;~,r,π)V1(s1;,r,π)|\displaystyle\left|V_{1}\left(s_{1};\widetilde{\P}^{\dagger},r^{\dagger},\pi\right)-V_{1}\left(s_{1};\P^{\dagger},r^{\dagger},\pi\right)\right| |Qh(s1,π1(a1);~,r,π)Qh(s1,π1(s1);,r,π)|\displaystyle\leq\left|Q_{h}\left(s_{1},\pi_{1}(a_{1});\widetilde{\P}^{\dagger},r^{\dagger},\pi\right)-Q_{h}\left(s_{1},\pi_{1}(s_{1});\P^{\dagger},r^{\dagger},\pi\right)\right|
Whπ(s1,π(s1),r),\displaystyle\leq W_{h}^{\pi}(s_{1},\pi(s_{1}),r),

where Whπ(s,a,r)W_{h}^{\pi}(s,a,r) is the intermediate uncertainty function defined in definition A.7. Here, we use the policy-dependent version of uncertainty function Wπ(s,a,r),W^{\pi}(s,a,r), which will then upper bounded using another two policy-dependent quantities Xhπ(s,a)X_{h}^{\pi}(s,a) and Yhπ(s,a,r).Y_{h}^{\pi}(s,a,r). We define these policy-dependent uncertainty functions to upper bound the estimation error of specific policy, but since we are considering a reward-free setting, which entails a low estimation error for any policy and reward, these quantities cannot be directly used in the algorithm to update the policy. Next, we claim

Whπ(s,a,r)Xhπ(s,a)+Yhπ(s,a,r),W_{h}^{\pi}(s,a,r)\leq X_{h}^{\pi}(s,a)+Y_{h}^{\pi}(s,a,r),

where Xhπ(s,a)X_{h}^{\pi}(s,a) and Yhπ(s,a,r)Y_{h}^{\pi}(s,a,r) are defined in definition A.8. Actually, this is easy from the definition of Xhπ(s,a)X_{h}^{\pi}(s,a), Yhπ(s,a,r)Y_{h}^{\pi}(s,a,r) and Whπ(s,a,r).W_{h}^{\pi}(s,a,r). For h=H+1,h=H+1, this is trivial. Assume this is true for h+1,h+1, then the case of hh is given by the definition and the fact that min{x,y+z}min{x,y}+z\min\left\{x,y+z\right\}\leq\min\left\{x,y\right\}+z for any x,y,z0.x,y,z\geq 0. Therefore, we have

|V1(s1;~,r,π)V1(s1;,r,π)|X1π(s1,π1(s1))+Y1π(s1,π1(s1),r).\left|V_{1}\left(s_{1};\widetilde{\P}^{\dagger},r^{\dagger},\pi\right)-V_{1}\left(s_{1};\P^{\dagger},r^{\dagger},\pi\right)\right|\leq X_{1}^{\pi}(s_{1},\pi_{1}(s_{1}))+Y_{1}^{\pi}\left(s_{1},\pi_{1}(s_{1}),r\right). (A.7)

Next, we eliminate the dependency to policy π\pi and obtain an upper bound of estimation error using the policy-independent uncertainty function Xh(s,a).X_{h}(s,a). This is done by bounding Y1π(s1,π1(s1),r)Y_{1}^{\pi}(s_{1},\pi_{1}(s_{1}),r) by Xhπ(s,a)X_{h}^{\pi}(s,a) and then upper bounding Xhπ(s,a)X_{h}^{\pi}(s,a) by Xh(s,a).X_{h}(s,a). From definition A.8, the Cauchy-Schwarz inequality and the fact that r[0,1],r\in[0,1], we have for any reward function and any deterministic policy π,\pi,

Y1π(s1,π1(s1),r)\displaystyle Y_{1}^{\pi}(s_{1},\pi_{1}(s_{1}),r)
\displaystyle\leq s,ah=1Hd~π,h(s,a)(1+1H)h18H2\hphi(m(s,a))Vars~(s,a)(Vh+1(s;~,r,π))\displaystyle\sum_{s,a}\sum_{h=1}^{H}\widetilde{d}^{\dagger}_{\pi,h}(s,a)\left(1+\frac{1}{H}\right)^{h-1}\sqrt{\frac{8}{H^{2}}\hphi\left(m(s,a)\right)\cdot\operatorname{Var}_{s^{\prime}\sim\widetilde{\P}^{\dagger}(s,a)}\left(V_{h+1}\left(s^{\prime};\widetilde{\P}^{\dagger},r^{\dagger},\pi\right)\right)} (Induction by successively using the definition of YY)
\displaystyle\leq eH8s,ah=1Hd~π,h(s,a)Vars~(s,a)(Vh+1(s;~,r,π))s,ah=1Hd~π,h(s,a)\hphi(m(s,a))\displaystyle\frac{e}{H}\sqrt{8\sum_{s,a}\sum_{h=1}^{H}\ \widetilde{d}^{\dagger}_{\pi,h}(s,a)\operatorname{Var}_{s^{\prime}\sim\widetilde{\P}^{\dagger}(s,a)}\left(V_{h+1}\left(s^{\prime};\widetilde{\P}^{\dagger},r^{\dagger},\pi\right)\right)}\cdot\sqrt{\sum_{s,a}\sum_{h=1}^{H}\widetilde{d}^{\dagger}_{\pi,h}(s,a)\hphi\left(m(s,a)\right)} (Cauchy-Schwarz Inequality)
\displaystyle\leq eH8H2s,ah=1Hd~π,h(s,a)s,ah=1Hd~π,h(s,a)\hphi(m(s,a))\displaystyle\frac{e}{H}\sqrt{8H^{2}\sum_{s,a}\sum_{h=1}^{H}\widetilde{d}^{\dagger}_{\pi,h}(s,a)}\cdot\sqrt{\sum_{s,a}\sum_{h=1}^{H}\widetilde{d}^{\dagger}_{\pi,h}(s,a)\hphi\left(m(s,a)\right)}
\displaystyle\leq e8s,ah=1Hd~π,h(s,a)\hphi(m(s,a)).\displaystyle e\sqrt{8\sum_{s,a}\sum_{h=1}^{H}\widetilde{d}^{\dagger}_{\pi,h}(s,a)\hphi\left(m(s,a)\right)}.

We notice that the right hand side of the last inequality above is the policy value of some specific reward function when running π\pi on ~.\widetilde{\P}. Concretely, if the transition probability is ~\widetilde{\P} and the reward function at (s,a)(s,a) is \hphi(m(s,a)),\hphi\left(m(s,a)\right), then the state value function at the initial state s1s_{1} is s,ah=1Hd~π,h(s,a)\hphi(m(s,a)).\sum_{s,a}\sum_{h=1}^{H}\widetilde{d}^{\dagger}_{\pi,h}(s,a)\hphi\left(m(s,a)\right). This specific reward function is non-negative and uniformly bounded by one, so it holds that s,ai=hHd~π,i(s,a)\hphi(m(s,a))Hh+1.\sum_{s,a}\sum_{i=h}^{H}\widetilde{d}^{\dagger}_{\pi,i}(s,a)\hphi\left(m(s,a)\right)\leq H-h+1. Moreover, from the definition of ϕ\phi and \hphi\hphi (eq. A.4), we know \hphi(m(s,a))Hϕ(m(s,a)).\hphi\left(m(s,a)\right)\leq H\phi\left(m(s,a)\right). Then, from the definition of Xh(s,a)X_{h}(s,a) in definition A.6, we apply an inductive argument to obtain

s,ah=1Hd~π,h(s,a)\hphi(m(s,a))X1π(s1,π(s1))\sum_{s,a}\sum_{h=1}^{H}\widetilde{d}^{\dagger}_{\pi,h}(s,a)\hphi\left(m(s,a)\right)\leq X_{1}^{\pi}(s_{1},\pi(s_{1}))

for any deterministic policy π.\pi. So we have

Y1π(s1,π1(s1),r)22eX1π(s1,π(s1)).Y_{1}^{\pi}(s_{1},\pi_{1}(s_{1}),r)\leq 2\sqrt{2e}\sqrt{X_{1}^{\pi}(s_{1},\pi(s_{1}))}.

Therefore, combining the last inequality with (A.7), we have

|V1(s1;~,r,π)V1(s1;,r,π)|\displaystyle\left|V_{1}\left(s_{1};\widetilde{\P}^{\dagger},r^{\dagger},\pi\right)-V_{1}\left(s_{1};\P^{\dagger},r^{\dagger},\pi\right)\right| X1π(s1,π1(s1))+22eX1π(s1,π1(s1)).\displaystyle\leq X_{1}^{\pi}(s_{1},\pi_{1}(s_{1}))+2\sqrt{2}e\sqrt{X_{1}^{\pi}(s_{1},\pi_{1}(s_{1}))}.

From the definition of Xh(s,a)X_{h}(s,a) (definition A.6) and Xhπ(s,a)X_{h}^{\pi}(s,a) (definition A.8), we can see

Xhπ(s,a)Xh(s,a)X_{h}^{\pi}(s,a)\leq X_{h}(s,a)

for any (h,s,a)(h,s,a) and any deterministic policy π.\pi. Therefore, we conclude that

|V1(s1;~,r,π)V1(s1;,r,π)|\displaystyle\left|V_{1}\left(s_{1};\widetilde{\P}^{\dagger},r^{\dagger},\pi\right)-V_{1}\left(s_{1};\P^{\dagger},r^{\dagger},\pi\right)\right| X1(s1,π1(s1))+22eX1(s1,π1(s1))\displaystyle\leq X_{1}(s_{1},\pi_{1}(s_{1}))+2\sqrt{2}e\sqrt{X_{1}(s_{1},\pi_{1}(s_{1}))}
maxaX1(s1,a)+22emaxaX1(s1,a).\displaystyle\leq\max_{a}X_{1}(s_{1},a)+2\sqrt{2}e\sqrt{\max_{a}X_{1}(s_{1},a)}.

A.2.2 Proof of lemma 6.2

Proof.

It suffices to prove that for any k[K],h[H],s\cS,a\cA,k\in[K],h\in[H],s\in\cS,a\in\cA, it holds that

Xh(s,a)C(1+1H)3(Hh)Uhk(s,a)X_{h}(s,a)\leq C\left(1+\frac{1}{H}\right)^{3(H-h)}U_{h}^{k}(s,a)

since the theorem comes from an average of the inequalities above and the fact that (1+1/H)He\left(1+1/H\right)^{H}\leq e. For any fixed k,k, we prove it by induction on hh. When h=H+1,h=H+1, both sides are zero by definition. Suppose the claim holds for h+1h+1; we will prove that it also holds for h.h. We denote KucbK_{ucb} as the number of virtual episodes in the offline phase. From lemma A.16, the decreasing property of ϕ()\phi(\cdot)(lemma A.17) and lemma A.21, we have

Xh(s,a)\displaystyle X_{h}(s,a) CHϕ(nKucb(s,a))+(1+2H)(s,a)(maxa{Xh+1(,a)})\displaystyle\leq CH\phi\left(n^{K_{ucb}}(s,a)\right)+\left(1+\frac{2}{H}\right)\P^{\dagger}(\cdot\mid s,a)^{\top}\left(\max_{a^{\prime}}\left\{X_{h+1}(\cdot,a^{\prime})\right\}\right) (Lemma A.16)
CHϕ(nk(s,a))+(1+2H)(s,a)(maxa{Xh+1(,a)})\displaystyle\leq CH\phi\left(n^{k}(s,a)\right)+\left(1+\frac{2}{H}\right)\P^{\dagger}(\cdot\mid s,a)^{\top}\left(\max_{a^{\prime}}\left\{X_{h+1}(\cdot,a^{\prime})\right\}\right) (Lemma A.17)
CHϕ(nk(s,a))+(1+1H)3\emP(s,a)(maxa{Xh+1(,a)})\displaystyle\leq CH\phi\left(n^{k}(s,a)\right)+\left(1+\frac{1}{H}\right)^{3}\emP^{\dagger}(\cdot\mid s,a)^{\top}\left(\max_{a^{\prime}}\left\{X_{h+1}(\cdot,a^{\prime})\right\}\right) (Lemma A.21 and 1+2/H(1+1/H)21+2/H\leq(1+1/H)^{2})

The inductive hypothesis gives us for any s,s^{\prime},

(maxa{Xh+1(s,a)})C(1+1H)3(Hh1)(maxaUh+1k(s,a)),\left(\max_{a^{\prime}}\left\{X_{h+1}(s^{\prime},a^{\prime})\right\}\right)\leq C\left(1+\frac{1}{H}\right)^{3(H-h-1)}\left(\max_{a^{\prime}}U_{h+1}^{k}(s^{\prime},a^{\prime})\right),

which implies

Xh(s,a)\displaystyle X_{h}(s,a) CHϕ(nk(s,a))+(1+1H)3(Hh)\emP(s,a)(maxa{Uh+1k(,a)})\displaystyle\leq CH\phi\left(n^{k}(s,a)\right)+\left(1+\frac{1}{H}\right)^{3(H-h)}\emP^{\dagger}(\cdot\mid s,a)^{\top}\left(\max_{a^{\prime}}\left\{U_{h+1}^{k}(\cdot,a^{\prime})\right\}\right)
C(1+1H)3(Hh)[Hϕ(nk(s,a))+\emP(s,a)(maxa{Uh+1k(,a)})].\displaystyle\leq C\left(1+\frac{1}{H}\right)^{3(H-h)}\left[H\phi\left(n^{k}(s,a)\right)+\emP^{\dagger}(\cdot\mid s,a)^{\top}\left(\max_{a^{\prime}}\left\{U_{h+1}^{k}(\cdot,a^{\prime})\right\}\right)\right]. (A.8)

Again, by definition of Xh(s,a),X_{h}(s,a), we have

Xh(s,a)Hh+1C(1+1H)3(Hh)H.X_{h}(s,a)\leq H-h+1\leq C\left(1+\frac{1}{H}\right)^{3(H-h)}H. (A.9)

Combining (A.8) and (A.9), as well as the definition of UhkU_{h}^{k} (definition A.9), we prove the case of stage hh and we conclude the theorem by induction. ∎

A.2.3 Upper bounding the empirical uncertainty function (lemma 6.3)

Proof.

First, we are going to prove

1Kk=1KU1k(s,a)H2|𝒮||𝒜|K[log(6H|𝒮||𝒜|δ)+|𝒮|log(e(1+KH|𝒮|))]log(1+HK|𝒮||𝒜|).\frac{1}{K}\sum_{k=1}^{K}U_{1}^{k}(s,a)\leq\frac{H^{2}|\mathcal{S}||\mathcal{A}|}{K}\left[\log\left(\frac{6H|\mathcal{S}||\mathcal{A}|}{\delta}\right)+|\mathcal{S}|\log\left(e\left(1+\frac{KH}{|\mathcal{S}|}\right)\right)\right]\log\left(1+\frac{HK}{|\mathcal{S}||\mathcal{A}|}\right). (A.10)

From the definition (see algorithm 2), we know an important property of uncertainty function is that πhk\pi^{k}_{h} is greedy with respect to Uhk.U^{k}_{h}. So we have

Uhk(s,a)\displaystyle U_{h}^{k}(s,a) =Hmin{1,ϕ(nk(s,a))}+\emP(s,a)(maxaUh+1k(,a))\displaystyle=H\min\left\{1,\phi\left(n^{k}(s,a)\right)\right\}+\emP^{\dagger}(s,a)^{\top}\left(\max_{a^{\prime}}U_{h+1}^{k}\left(\cdot,a^{\prime}\right)\right)
=Hmin{1,ϕ(nk(s,a))}+s,a\emP(ss,a)πh+1k(as)Uh+1k(s,a).\displaystyle=H\min\left\{1,\phi\left(n^{k}(s,a)\right)\right\}+\sum_{s^{\prime},a^{\prime}}\emP^{\dagger}(s^{\prime}\mid s,a)\pi^{k}_{h+1}\left(a^{\prime}\mid s^{\prime}\right)U_{h+1}^{k}\left(s^{\prime},a^{\prime}\right).

Therefore, we have

1Kucbk=1KucbU1k(s1,a)\displaystyle\frac{1}{K_{ucb}}\sum_{k=1}^{K_{ucb}}U_{1}^{k}(s_{1},a)
\displaystyle\leq 1Kucbk=1KucbmaxaU1k(s1,a)\displaystyle\frac{1}{K_{ucb}}\sum_{k=1}^{K_{ucb}}\max_{a}U_{1}^{k}(s_{1},a)
\displaystyle\leq HKucbk=1Kucbh=1H(s,a)d^πk,h(s,a)min{1,ϕ(nk(s,a))}\displaystyle\frac{H}{K_{ucb}}\sum_{k=1}^{K_{ucb}}\sum_{h=1}^{H}\sum_{(s,a)}\widehat{d}^{\dagger}_{\pi^{k},h}(s,a)\min\left\{1,\phi\left(n^{k}(s,a)\right)\right\} (definition A.9)
\displaystyle\leq 4H2Kucb[log(6H|𝒮||𝒜|δ)+|𝒮|log(e(1+KucbH|𝒮|))]h=1Hk=1Kucb(s,a)d^πk,h(s,a)max{1,i=1k1h[H]d^πi,h(s,a)}\displaystyle\frac{4H^{2}}{K_{ucb}}\left[\log\left(\frac{6H|\mathcal{S}||\mathcal{A}|}{\delta}\right)+|\mathcal{S}|\log\left(e\left(1+\frac{K_{ucb}H}{|\mathcal{S}|}\right)\right)\right]\sum_{h=1}^{H}\sum_{k=1}^{K_{ucb}}\sum_{(s,a)}\frac{\widehat{d}^{\dagger}_{\pi^{k},h}(s,a)}{\max\left\{1,\sum_{i=1}^{k-1}\sum_{h\in[H]}\widehat{d}^{\dagger}_{\pi^{i},h}(s,a)\right\}} ( lemma A.20)

We apply Lemma D.7 and Jensen’s Inequality to obtain

h[H]k=1Kucb(s,a)d^πk,h(s,a)1max{1,i=1k1h[H]d^πi,h(s,a)}\displaystyle\sum_{h\in[H]}\sum_{k=1}^{K_{ucb}}\sum_{(s,a)}\widehat{d}^{\dagger}_{\pi^{k},h}(s,a)\frac{1}{\max\left\{1,\sum_{i=1}^{k-1}\sum_{h\in[H]}\widehat{d}^{\dagger}_{\pi^{i},h}(s,a)\right\}}
\displaystyle\leq 4(s,a)log(1+i=1Kucbh[H]d^πi,h(s,a))\displaystyle 4\sum_{(s,a)}\log\left(1+\sum_{i=1}^{K_{ucb}}\sum_{h\in[H]}\widehat{d}^{\dagger}_{\pi^{i},h}(s,a)\right) (lemma D.7)
\displaystyle\leq 4|\cS||\cA|log(1+1|\cS||\cA|i=1Kucbh[H](s,a)d^πi,h(s,a))\displaystyle 4\left|\cS\right|\left|\cA\right|\log\left(1+\frac{1}{\left|\cS\right|\left|\cA\right|}\sum_{i=1}^{K_{ucb}}\sum_{h\in[H]}\sum_{(s,a)}\widehat{d}^{\dagger}_{\pi^{i},h}(s,a)\right)
\displaystyle\leq 4|\cS||\cA|log(1+HKucb|\cS||\cA|)\displaystyle 4\left|\cS\right|\left|\cA\right|\log\left(1+\frac{HK_{ucb}}{\left|\cS\right|\left|\cA\right|}\right)

Inserting the last display to the upper bound, we conclude the proof of the upper bound on UU.

Then, to prove the sample complexity, we use lemma D.8. We take

B=64H2|\cA|ε2[log(6H|𝒮||𝒜|δ)+|𝒮|] and x=Kucb|\cS|B=\frac{64H^{2}\left|\cA\right|}{\varepsilon^{2}}\left[\log\left(\frac{6H|\mathcal{S}||\mathcal{A}|}{\delta}\right)+|\mathcal{S}|\right]\text{ and }x=\frac{K_{ucb}}{\left|\cS\right|}

in Lemma D.8. From lemma D.8, we know there exists a universal constant C1C_{1} such that when xC1Blog(B)2,x\geq C_{1}B\log(B)^{2}, it holds that Blog(1+x)(1+log(1+x))x,B\log(1+x)\left(1+\log(1+x)\right)\leq x, which implies whenever

Kucb64H2|\cS||\cA|ε2[log(6H|𝒮||𝒜|δ)+|𝒮|],K_{ucb}\geq\frac{64H^{2}\left|\cS\right|\left|\cA\right|}{\varepsilon^{2}}\left[\log\left(\frac{6H|\mathcal{S}||\mathcal{A}|}{\delta}\right)+|\mathcal{S}|\right],

it holds that

64H2|\cS||\cA|Kucb[log(6H|𝒮||𝒜|δ)+|𝒮|]log(1+Kucb|\cS|)[1+log(1+Kucb|𝒮|)]ε2.\frac{64H^{2}\left|\cS\right|\left|\cA\right|}{K_{ucb}}\left[\log\left(\frac{6H|\mathcal{S}||\mathcal{A}|}{\delta}\right)+|\mathcal{S}|\right]\log\left(1+\frac{K_{ucb}}{\left|\cS\right|}\right)\left[1+\log\left(1+\frac{K_{ucb}}{|\mathcal{S}|}\right)\right]\leq\varepsilon^{2}. (A.11)

For notation simplicity, we define

L=16H2|\cS||\cA|[log(6H|𝒮||𝒜|δ)+|𝒮|].L=16H^{2}\left|\cS\right|\left|\cA\right|\left[\log\left(\frac{6H|\mathcal{S}||\mathcal{A}|}{\delta}\right)+|\mathcal{S}|\right].

Comparing the left hand side of (A.11) and the right hand side of (A.10), we know

Right hand side of (A.10)\displaystyle\text{Right hand side of }\eqref{eqn:upper_bound_U}
=\displaystyle= 16H2|\cS||\cA|Kucblog(1+HKucb|\cS||\cA|)[log(6H|𝒮||𝒜|δ)+|𝒮|+|𝒮|log(1+KucbH|𝒮|)]\displaystyle\frac{16H^{2}\left|\cS\right|\left|\cA\right|}{K_{ucb}}\log\left(1+\frac{HK_{ucb}}{\left|\cS\right|\left|\cA\right|}\right)\left[\log\left(\frac{6H|\mathcal{S}||\mathcal{A}|}{\delta}\right)+|\mathcal{S}|+|\mathcal{S}|\log\left(1+\frac{K_{ucb}H}{|\mathcal{S}|}\right)\right]
\displaystyle\leq 16H2|\cS||\cA|Kucb[log(H)+log(1+Kucb|\cS|)][log(6H|𝒮||𝒜|δ)+|𝒮|+|𝒮|log(1+KucbH|𝒮|)]\displaystyle\frac{16H^{2}\left|\cS\right|\left|\cA\right|}{K_{ucb}}\left[\log(H)+\log\left(1+\frac{K_{ucb}}{\left|\cS\right|}\right)\right]\left[\log\left(\frac{6H|\mathcal{S}||\mathcal{A}|}{\delta}\right)+|\mathcal{S}|+|\mathcal{S}|\log\left(1+\frac{K_{ucb}H}{|\mathcal{S}|}\right)\right]
\displaystyle\leq 16H2|\cS||\cA|Kucb[log(H)+log(1+Kucb|\cS|)][log(6H|𝒮||𝒜|δ)+|𝒮|][1+log(H)+log(1+Kucb|𝒮|)]\displaystyle\frac{16H^{2}\left|\cS\right|\left|\cA\right|}{K_{ucb}}\left[\log(H)+\log\left(1+\frac{K_{ucb}}{\left|\cS\right|}\right)\right]\left[\log\left(\frac{6H|\mathcal{S}||\mathcal{A}|}{\delta}\right)+|\mathcal{S}|\right]\left[1+\log(H)+\log\left(1+\frac{K_{ucb}}{|\mathcal{S}|}\right)\right]
=\displaystyle= LKucb[log(H)+log(1+Kucb|\cS|)][1+log(H)+log(1+Kucb|𝒮|)].\displaystyle\frac{L}{K_{ucb}}\left[\log(H)+\log\left(1+\frac{K_{ucb}}{\left|\cS\right|}\right)\right]\left[1+\log(H)+\log\left(1+\frac{K_{ucb}}{|\mathcal{S}|}\right)\right].

It is easy to see that log(H)log(1+Kucb/|𝒮|),\log(H)\leq\log\left(1+K_{ucb}/|\mathcal{S}|\right), so the right hand side of last inequality is upper bounded by

4LKucblog(1+Kucb|\cS|)[1+log(1+Kucb|𝒮|)].\frac{4L}{K_{ucb}}\log\left(1+\frac{K_{ucb}}{\left|\cS\right|}\right)\left[1+\log\left(1+\frac{K_{ucb}}{|\mathcal{S}|}\right)\right].

From (A.11), we know this is upper bounded by ε2\varepsilon^{2} as long as Kucb4L/ε2.K_{ucb}\geq 4L/\varepsilon^{2}. From the definition of L and equations (A.10) (A.11), we have when Kucb4L/ε2,K_{ucb}\geq 4L/\varepsilon^{2}, it holds that

1Kucbk=1KucbU1k(s1,a) right hand side of (A.10) left hand side of (A.11)ε2.\frac{1}{K_{ucb}}\sum_{k=1}^{K_{ucb}}U_{1}^{k}(s_{1},a)\leq\text{ right hand side of }\eqref{eqn:upper_bound_U}\leq\text{ left hand side of }\eqref{eqn:upper_bound_U_2}\leq\varepsilon^{2}.

In conclusion, this admits a sample complexity of

Kucb=CH2|\cS||\cA|ε2(ι+|\cS|)\polylog(|\cS|,|\cA|,H,1ε,1δ).K_{ucb}=\frac{CH^{2}\left|\cS\right|\left|\cA\right|}{\varepsilon^{2}}\bigg{(}\iota+\left|\cS\right|\bigg{)}\polylog\left(\left|\cS\right|,\left|\cA\right|,H,\frac{1}{\varepsilon},\frac{1}{\delta}\right).

A.3 Omitted proofs

A.3.1 Proof for lemma A.4 (high probability event)

The lemma A.4 is proved by combining all the lemmas below via a union bound. Below, we always denote N(s,a,s)N(s,a,s^{\prime}) as the number of (s,a,s)(s,a,s^{\prime}) in the offline dataset.

Lemma A.10.

\eventP\event^{P} holds with probability at least 1δ/6.1-\delta/6.

Proof.

For any fixed (s,a,s)(s,a,s^{\prime}) such that N(s,a,s)Φ,N(s,a,s^{\prime})\geq\Phi, we denote ni(s,a)n_{i}(s,a) as the index of the i-th time when we visit (s,a).(s,a). For notation simplicity, we fix the state-action pair (s,a)(s,a) here and write ni(s,a)n_{i}(s,a) as nin_{i} simply for i=1,2,,N(s,a).i=1,2,...,N(s,a). Notice that the total visiting time N(s,a)N(s,a) is random, so our argument is based on conditional probability. We denote Xi=𝕀(sni=s)X_{i}=\mathbb{I}\left(s_{n_{i}}^{\prime}=s^{\prime}\right) as the indicator of whether the next state is ss^{\prime} when we visited (s,a)(s,a) at the i-th time. From the data generating mechanism and the definition for the reference MDP, we know conditional on the total number of visiting N(s,a)N(s,a) and all the indexes n1nN(s,a),n_{1}\leq...\leq n_{N(s,a)}, it holds that X1,X2,,XN(s,a)X_{1},X_{2},...,X_{N(s,a)} are independent Bernoulli random variable with successful probability being (ss,a).\P^{\dagger}\left(s^{\prime}\mid s,a\right). We denote X¯\overline{X} as their arithmetic average. Using Empirical Bernstein Inequality (Lemma D.3), and the fact that 1N(s,a)i=1N(s,a)(XiX¯)21N(s,a)i=1N(s,a)Xi2=1N(s,a)i=1N(s,a)Xi=\emP(ss,a),\frac{1}{N(s,a)}\sum_{i=1}^{N(s,a)}\left(X_{i}-\overline{X}\right)^{2}\leq\frac{1}{N(s,a)}\sum_{i=1}^{N(s,a)}X_{i}^{2}=\frac{1}{N(s,a)}\sum_{i=1}^{N(s,a)}X_{i}=\emP^{\dagger}(s^{\prime}\mid s,a), we have

(\eventPni(1iN(s,a)),N(s,a)=n)1δ/6.\P\left(\event^{P}\mid n_{i}\left(1\leq i\leq N(s,a)\right),N(s,a)=n\right)\geq 1-\delta/6.

Taking integral of the conditional probability, we conclude that the unconditional probability of the event \eventP\event^{P} is at least 1δ/6.1-\delta/6. Therefore, we conclude. ∎

Lemma A.11.

For δ>0,\delta>0, it holds that (\event4)1δ/6.\mathbb{P}\left(\event^{4}\right)\geq 1-\delta/6.

Proof.

Consider for a fixed triple (s,a)\cS×\cA.(s,a)\in\cS\times\cA. Let m(s,a)m(s,a) denote the number of times the tuple (s,a)(s,a) was encountered in total during the online interaction phase. We define XiX_{i} as follows. For im(s,a),i\leq m(s,a), we let XiX_{i} be the subsequent state ss^{\prime} when (s,a)(s,a) was encountered the ii-th time in the whole run of the algorithm. Otherwise, we let XiX_{i} be an independent sample from (s,a).\P^{\dagger}(s,a). By construction, {Xi}\left\{X_{i}\right\} is a sequence of i.i.d. categorical random variables from 𝒮\mathcal{S} with distribution (s,a).\P^{\dagger}\left(\cdot\mid s,a\right). We denote ~X,i=1ij=1iδXj\widetilde{\P}^{{\dagger},i}_{X}=\frac{1}{i}\sum_{j=1}^{i}\delta_{X_{j}} as the empirical probability mass and X\P^{\dagger}_{X} as the probability mass of Xi.X_{i}. Then, from Lemma D.4, we have with probability at least 1δ/6,1-\delta/6, it holds that for any i,i\in\mathbb{N},

\KL(~X,i;X)1i[log(6δ)+|\cS|log(e(1+i|\cS|))],\KL\left(\widetilde{\P}^{{\dagger},i}_{X};\P^{\dagger}_{X}\right)\leq\frac{1}{i}\left[\log\left(\frac{6}{\delta}\right)+\left|\cS\right|\log\left(e\left(1+\frac{i}{\left|\cS\right|}\right)\right)\right],

which implies

\KL(~X;(s,a))1m(s,a)[log(6δ)+|\cS|log(e(1+m(s,a)|\cS|))].\KL\left(\widetilde{\P}^{{\dagger}}_{X};\P^{\dagger}(s,a)\right)\leq\frac{1}{m(s,a)}\left[\log\left(\frac{6}{\delta}\right)+\left|\cS\right|\log\left(e\left(1+\frac{m(s,a)}{\left|\cS\right|}\right)\right)\right].

Using a union bound for (s,a)\cS×\cA,(s,a)\in\cS\times\cA, we conclude. ∎

Lemma A.12 (Lower Bound on Counters).

For δ>0,\delta>0, it holds that (\event2)1δ/6\P\left(\event^{2}\right)\geq 1-\delta/6 and (\event5)1δ/6.\P\left(\event^{5}\right)\geq 1-\delta/6.

Proof.

We denote nhkn_{h}^{k} as the number of times we encounter (s,a)(s,a) at stage hh before the beginning of episode k.k. Concretely speaking, we define nh1(s,a)=0n_{h}^{1}(s,a)=0 for any (h,s,a).(h,s,a). Then, we define

nhk(s,a)=nhk(s,a)+𝕀[(s,a)=(shk,ahk)].n_{h}^{k}(s,a)=n_{h}^{k}(s,a)+\mathbb{I}\left[(s,a)=(s_{h}^{k},a_{h}^{k})\right].

We fixed an (s,a,h)[H]×\cS×\cA(s,a,h)\in[H]\times\cS\times\cA and denote \cFk\cF_{k} as the sigma field generated by the first k1k-1 episodes when running RF-UCB and Xk=𝕀[(shk,ahk)=(s,a)].X_{k}=\mathbb{I}\left[(s_{h}^{k},a_{h}^{k})=(s,a)\right]. Then, we know XkX_{k} is \cFk+1\cF_{k+1}-measurable and \E[Xk\cFk]=d^πk,h(s,a)\E\left[X_{k}\mid\cF_{k}\right]=\widehat{d}_{\pi^{k},h}^{\dagger}(s,a) is \cFk\cF_{k}-measurable. Taking W=ln(6/δ)W=\ln\left(6/\delta\right) in Lemma D.1 and applying a union bound, we know with probability 1δ/6,1-\delta/6, the following event happens:

k,(h,s,a)[H]×\cS×\cA,nhk(s,a)12i<kd^πi,h(s,a)ln(6H|\cS||\cA|δ).\forall k\in\mathbb{N},(h,s,a)\in[H]\times\cS\times\cA,\quad n^{k}_{h}(s,a)\geq\frac{1}{2}\sum_{i<k}\widehat{d}_{\pi^{i},h}^{\dagger}(s,a)-\ln\left(\frac{6H\left|\cS\right|\left|\cA\right|}{\delta}\right).

To finish the proof, it remains to note that the event above implies the event we want by summing over h[H]h\in[H] for each kk\in\mathbb{N} and each (s,a)\cS×\cA.(s,a)\in\cS\times\cA. For the \event5,\event^{5}, the proof is almost the same. ∎

Lemma A.13 (Upper Bound on Counters).

For δ>0,\delta>0, it holds that (\event3)1δ/6.\P\left(\event^{3}\right)\geq 1-\delta/6.

Proof.

We denote nhkn_{h}^{k} as the number of times we encounter (s,a)(s,a) at stage hh before the beginning of episode k.k. Concretely speaking, we define nh1(s,a)=0n_{h}^{1}(s,a)=0 for any (h,s,a).(h,s,a). Then, we define

nhk(s,a)=nhk(s,a)+𝕀[(s,a)=(shk,ahk)].n_{h}^{k}(s,a)=n_{h}^{k}(s,a)+\mathbb{I}\left[(s,a)=(s_{h}^{k},a_{h}^{k})\right].

We fixed an (s,a,h)[H]×\cS×\cA(s,a,h)\in[H]\times\cS\times\cA and denote \cFk\cF_{k} as the sigma field generated by the first k1k-1 episodes when running RF-UCB and Xk=𝕀[(shk,ahk)=(s,a)].X_{k}=\mathbb{I}\left[(s_{h}^{k},a_{h}^{k})=(s,a)\right]. Then, we know XkX_{k} is \cFk+1\cF_{k+1}-measurable and \E[Xk\cFk]=d^πk,h(s,a)\E\left[X_{k}\mid\cF_{k}\right]=\widehat{d}_{\pi^{k},h}^{\dagger}(s,a) is \cFk\cF_{k}-measurable. Taking W=ln(6/δ)W=\ln\left(6/\delta\right) in Lemma D.2 and applying a union bound, we know with probability 1δ/6,1-\delta/6, the following event happens:

k,(h,s,a)[H]×\cS×\cA,nhk(s,a)2i<kd^πi,h(s,a)+ln(6H|\cS||\cA|δ).\forall k\in\mathbb{N},(h,s,a)\in[H]\times\cS\times\cA,\quad n^{k}_{h}(s,a)\leq 2\sum_{i<k}\widehat{d}_{\pi^{i},h}^{\dagger}(s,a)+\ln\left(\frac{6H\left|\cS\right|\left|\cA\right|}{\delta}\right).

To finish the proof, it remains to note that the event above implies the event we want by summing over h[H]h\in[H] for each kk\in\mathbb{N} and each (s,a)\cS×\cA.(s,a)\in\cS\times\cA.

A.3.2 Proof for lemma A.14 (property of intermediate uncertainty function)

Lemma A.14 (Intermediate Uncertainty Function).

We define for any (h,s,a)[H]×\cS×\cA(h,s,a)\in[H]\times\cS\times\cA and any deterministc policy π,\pi,

WH+1π(s,a,r):=0,\displaystyle W_{H+1}^{\pi}(s,a,r):=0,
Whπ(s,a,r):=min{Hh+1,8H2\hphi(m(s,a))Vars~(s,a)(Vh+1(s;~,r,π))\displaystyle W_{h}^{\pi}(s,a,r):=\min\left\{H-h+1,\sqrt{\frac{8}{H^{2}}\hphi\left(m(s,a)\right)\cdot\operatorname{Var}_{s^{\prime}\sim\widetilde{\P}^{\dagger}(s,a)}\left(V_{h+1}\left(s^{\prime};\widetilde{\P}^{\dagger},r^{\dagger},\pi\right)\right)}\right.
+9Hϕ(m(s,a))+(1+1H)(~πh+1)Wh+1π(s,a,r)},\displaystyle\hskip 34.44434pt+9H\phi\left(m(s,a)\right)+\left.\left(1+\frac{1}{H}\right)\left(\widetilde{\P}^{\dagger}\pi_{h+1}\right)W_{h+1}^{\pi}(s,a,r)\right\},

where

(~πh+1)Wh+1π(s,a,r):=sa~(ss,a)πh+1(as)Wh+1π(s,a,r).\left(\widetilde{\P}^{\dagger}\pi_{h+1}\right)W_{h+1}^{\pi}(s,a,r):=\sum_{s^{\prime}}\sum_{a^{\prime}}\widetilde{\P}^{\dagger}\left(s^{\prime}\mid s,a\right)\pi_{h+1}\left(a^{\prime}\mid s^{\prime}\right)W_{h+1}^{\pi}\left(s^{\prime},a^{\prime},r\right).

In addition, we define Whπ(s,a,r)=0W_{h}^{\pi}(s^{\dagger},a,r)=0 for any h[H],a\cAh\in[H],a\in\cA and any reward function rr. Then, under \event,\event, for any (h,s,a)[H]×\cS×\cA(h,s,a)\in[H]\times\cS\times\cA , any deterministic policy π,\pi, and any deterministic reward function rr (with its augmentation rr^{\dagger}), it holds that

|Qh(s,a;~,r,π)Qh(s,a;,r,π)|Whπ(s,a,r).\left|Q_{h}\left(s,a;\widetilde{\P}^{\dagger},r^{\dagger},\pi\right)-Q_{h}\left(s,a;\P^{\dagger},r^{\dagger},\pi\right)\right|\leq W_{h}^{\pi}(s,a,r).
Proof.

We prove by induction. For h=H+1,h=H+1, we know WH+1(s,a,r)=0W_{H+1}(s,a,r)=0 by definition and the left hand side of the inequality we want also vanishes. Suppose the claim holds for h+1h+1 and for any s,a.s,a. The Bellman equation gives us

Δh:=Qh(s,a;~,r,π)Qh(s,a;,r,π)s(~(ss,a)(ss,a))Vh+1(s;,r,π)I\displaystyle\Delta_{h}:=Q_{h}\left(s,a;\widetilde{\P}^{\dagger},r^{\dagger},\pi\right)-Q_{h}\left(s,a;\P^{\dagger},r^{\dagger},\pi\right)\leq\underbrace{\sum_{s^{\prime}}\left(\widetilde{\P}^{\dagger}\left(s^{\prime}\mid s,a\right)-\P^{\dagger}\left(s^{\prime}\mid s,a\right)\right)V_{h+1}\left(s^{\prime};\P^{\dagger},r^{\dagger},\pi\right)}_{I}
+s~(ss,a)|Vh+1(s;~,r,π)Vh+1(s;,r,π)|II\displaystyle+\underbrace{\sum_{s^{\prime}}\widetilde{\P}^{\dagger}\left(s^{\prime}\mid s,a\right)\left|V_{h+1}\left(s^{\prime};\widetilde{\P}^{\dagger},r^{\dagger},\pi\right)-V_{h+1}\left(s^{\prime};\P^{\dagger},r^{\dagger},\pi\right)\right|}_{II}

Under \event\event, we know that \KL(~;)1Hϕ(m(s,a))\KL\left(\widetilde{\P}^{\dagger};\P^{\dagger}\right)\leq\frac{1}{H}\phi\left(m(s,a)\right) for any (s,a),(s,a), so from Lemma D.5 we have

|I|\displaystyle\left|I\right| 2Hϕ(m(s,a))Vars(s,a)(Vh+1(s;,r,π))+23(Hh)ϕ(m(s,a))H\displaystyle\leq\sqrt{\frac{2}{H}\phi\left(m(s,a)\right)\cdot\operatorname{Var}_{s^{\prime}\sim\P^{\dagger}(s,a)}\left(V_{h+1}\left(s^{\prime};\P^{\dagger},r^{\dagger},\pi\right)\right)}+\frac{2}{3}(H-h)\frac{\phi\left(m(s,a)\right)}{H}
2Hϕ(m(s,a))Vars(s,a)(Vh+1(s;,r,π))+23ϕ(m(s,a)),\displaystyle\leq\sqrt{\frac{2}{H}\phi\left(m(s,a)\right)\cdot\operatorname{Var}_{s^{\prime}\sim\P^{\dagger}(s,a)}\left(V_{h+1}\left(s^{\prime};\P^{\dagger},r^{\dagger},\pi\right)\right)}+\frac{2}{3}\phi\left(m(s,a)\right),

where ϕ\phi is the bonus defined as (A.4). Here, we let ff in Lemma D.5 be Vh+1(s;,r,π)V_{h+1}\left(s^{\prime};\P^{\dagger},r^{\dagger},\pi\right). So the range will be HhH-h and the upper bound for KL divergence is given by high-probability event \event.\event. We further apply Lemma D.6 to get

Vars(s,a)(Vh+1(s;,r,π))2Vars~(s,a)(Vh+1(s;,r,π))+4Hϕ(m(s,a))\operatorname{Var}_{s^{\prime}\sim\P^{\dagger}(s,a)}\left(V_{h+1}\left(s^{\prime};\P^{\dagger},r^{\dagger},\pi\right)\right)\leq 2\operatorname{Var}_{s^{\prime}\sim\widetilde{\P}^{\dagger}(s,a)}\left(V_{h+1}\left(s^{\prime};\P^{\dagger},r^{\dagger},\pi\right)\right)+4H\phi\left(m(s,a)\right)

and

Vars~(s,a)(Vh+1(s;,r,π))2Vars~(s,a)(Vh+1(s;~,r,π))\displaystyle\operatorname{Var}_{s^{\prime}\sim\widetilde{\P}^{\dagger}(s,a)}\left(V_{h+1}\left(s^{\prime};\P^{\dagger},r^{\dagger},\pi\right)\right)\leq 2\operatorname{Var}_{s^{\prime}\sim\widetilde{\P}^{\dagger}(s,a)}\left(V_{h+1}\left(s^{\prime};\widetilde{\P}^{\dagger},r^{\dagger},\pi\right)\right)
+2Hs~(ss,a)|Vh+1(s;~,r,π)Vh+1(s;,r,π)|.\displaystyle\hskip 107.63855pt+2H\sum_{s^{\prime}}\widetilde{\P}^{\dagger}\left(s^{\prime}\mid s,a\right)\left|V_{h+1}\left(s^{\prime};\widetilde{\P}^{\dagger},r^{\dagger},\pi\right)-V_{h+1}\left(s^{\prime};\P^{\dagger},r^{\dagger},\pi\right)\right|.

Therefore, we have

|I|\displaystyle\left|I\right| 23ϕ(m(s,a))+[8Hϕ(m(s,a))Vars~(s,a)(Vh+1(s;~,r,π))+8ϕ(m(s,a))2\displaystyle\leq\frac{2}{3}\phi\left(m(s,a)\right)+\left[\frac{8}{H}\phi\left(m(s,a)\right)\operatorname{Var}_{s^{\prime}\sim\widetilde{\P}^{\dagger}(s,a)}\left(V_{h+1}\left(s^{\prime};\widetilde{\P}^{\dagger},r^{\dagger},\pi\right)\right)+8\phi\left(m(s,a)\right)^{2}\right.
+8ϕ(m(s,a))s~(ss,a)|Vh+1(s;~,r,π)Vh+1(s;,r,π)|]1/2.\displaystyle\left.+8\phi\left(m(s,a)\right)\sum_{s^{\prime}}\widetilde{\P}^{\dagger}\left(s^{\prime}\mid s,a\right)\left|V_{h+1}\left(s^{\prime};\widetilde{\P}^{\dagger},r^{\dagger},\pi\right)-V_{h+1}\left(s^{\prime};\P^{\dagger},r^{\dagger},\pi\right)\right|\right]^{1/2}.

Using the fact that x+yx+y\sqrt{x+y}\leq\sqrt{x}+\sqrt{y} and xyx+y2\sqrt{xy}\leq\frac{x+y}{2} for positive x,y,x,y, and the definition of II,II, we obtain

|I|\displaystyle\left|I\right| 8Hϕ(m(s,a))Vars~(s,a)(Vh+1(s;~,r,π))+6Hϕ(m(s,a))+1H|II|,\displaystyle\leq\sqrt{\frac{8}{H}\phi\left(m(s,a)\right)\cdot\operatorname{Var}_{s^{\prime}\sim\widetilde{\P}^{\dagger}(s,a)}\left(V_{h+1}\left(s^{\prime};\widetilde{\P}^{\dagger},r^{\dagger},\pi\right)\right)}+6H\phi\left(m(s,a)\right)+\frac{1}{H}\left|II\right|,

which implies

|Δh|8Hϕ(m(s,a))Vars~(s,a)(Vh+1(s;~,r,π))+6Hϕ(m(s,a))+(1+1H)|II|.\left|\Delta_{h}\right|\leq\sqrt{\frac{8}{H}\phi\left(m(s,a)\right)\cdot\operatorname{Var}_{s^{\prime}\sim\widetilde{\P}^{\dagger}(s,a)}\left(V_{h+1}\left(s^{\prime};\widetilde{\P}^{\dagger},r^{\dagger},\pi\right)\right)}+6H\phi\left(m(s,a)\right)+\left(1+\frac{1}{H}\right)\left|II\right|.

If Hϕ(m(s,a))1,H\phi\left(m(s,a)\right)\leq 1, then we have by definition

|Δh|\displaystyle\left|\Delta_{h}\right| 8H2Hϕ(m(s,a))Vars~(s,a)(Vh+1(s;~,r,π))+6Hϕ(m(s,a))+(1+1H)|II|\displaystyle\leq\sqrt{\frac{8}{H^{2}}\cdot H\phi\left(m(s,a)\right)\cdot\operatorname{Var}_{s^{\prime}\sim\widetilde{\P}^{\dagger}(s,a)}\left(V_{h+1}\left(s^{\prime};\widetilde{\P}^{\dagger},r^{\dagger},\pi\right)\right)}+6H\phi\left(m(s,a)\right)+\left(1+\frac{1}{H}\right)\left|II\right|
8H2\hphi(m(s,a))Vars~(s,a)(Vh+1(s;~,r,π))+6Hϕ(m(s,a))+(1+1H)|II|\displaystyle\leq\sqrt{\frac{8}{H^{2}}\hphi\left(m(s,a)\right)\cdot\operatorname{Var}_{s^{\prime}\sim\widetilde{\P}^{\dagger}(s,a)}\left(V_{h+1}\left(s^{\prime};\widetilde{\P}^{\dagger},r^{\dagger},\pi\right)\right)}+6H\phi\left(m(s,a)\right)+\left(1+\frac{1}{H}\right)\left|II\right|

by definition. Otherwise, if Hϕ(m(s,a))1,H\phi\left(m(s,a)\right)\geq 1, we have

8Hϕ(m(s,a))Vars~(s,a)(Vh+1(s;~,r,π))8Hϕ(m(s,a))3Hϕ(m(s,a)).\sqrt{\frac{8}{H}\phi\left(m(s,a)\right)\cdot\operatorname{Var}_{s^{\prime}\sim\widetilde{\P}^{\dagger}(s,a)}\left(V_{h+1}\left(s^{\prime};\widetilde{\P}^{\dagger},r^{\dagger},\pi\right)\right)}\leq\sqrt{8H\phi\left(m(s,a)\right)}\leq 3H\phi\left(m(s,a)\right).

Therefore, we have in either case

|Δh|8H2\hphi(m(s,a))Vars~(s,a)(Vh+1(s;~,r,π))+9Hϕ(m(s,a))+(1+1H)|II|.\left|\Delta_{h}\right|\leq\sqrt{\frac{8}{H^{2}}\hphi\left(m(s,a)\right)\cdot\operatorname{Var}_{s^{\prime}\sim\widetilde{\P}^{\dagger}(s,a)}\left(V_{h+1}\left(s^{\prime};\widetilde{\P}^{\dagger},r^{\dagger},\pi\right)\right)}+9H\phi\left(m(s,a)\right)+\left(1+\frac{1}{H}\right)\left|II\right|.

The induction hypothesis gives us that

|II|\displaystyle\left|II\right| sa~(ss,a)πh+1(as)|Qh+1(s,a;~,r,π)Qh+1(s,a;,r,π)|\displaystyle\leq\sum_{s^{\prime}}\sum_{a^{\prime}}\widetilde{\P}^{\dagger}\left(s^{\prime}\mid s,a\right)\pi_{h+1}\left(a^{\prime}\mid s^{\prime}\right)\left|Q_{h+1}\left(s^{\prime},a^{\prime};\widetilde{\P}^{\dagger},r^{\dagger},\pi\right)-Q_{h+1}\left(s^{\prime},a^{\prime};\P^{\dagger},r^{\dagger},\pi\right)\right|
(~πh+1)Wh+1π(s,a,r).\displaystyle\leq\left(\widetilde{\P}^{\dagger}\pi_{h+1}\right)W_{h+1}^{\pi}(s,a,r).

Inserting this into the upper bound of Δh,\Delta_{h}, we prove the case of hh by the definition of Whπ(s,a,r)W_{h}^{\pi}(s,a,r) and the simple fact that |Δh|Hh+1.\left|\Delta_{h}\right|\leq H-h+1.

A.3.3 Proof for lemma A.15 and lemma A.16 (properties of uncertainty function)

In this section, we prove some lemmas for upper bounding the uncertainty functions Xh(s,a).X_{h}(s,a). We first provide a basic upper bound for Xh(s,a).X_{h}(s,a). The uncertainty function is defined in definition A.6.

Lemma A.15.

Under the high probability event \event\event (defined in section A.1.2) for all (h,s,a)[H]×\cS×\cA,(h,s,a)\in[H]\times\cS\times\cA, it holds that

Xh(s,a)11Hmin{1,ϕ(m(s,a))}+(1+2H)(s,a)(maxaXh+1(,a)).X_{h}(s,a)\leq 11H\min\left\{1,\phi\left(m(s,a)\right)\right\}+\left(1+\frac{2}{H}\right)\P^{\dagger}\left(\cdot\mid s,a\right)^{\top}\left(\max_{a^{\prime}}X_{h+1}(\cdot,a^{\prime})\right).

In addition, for s=s,s=s^{\dagger}, from definition, we know the above upper bound naturally holds.

Proof.

For any fixed (h,s,a)[H]×\cS×\cA,(h,s,a)\in[H]\times\cS\times\cA, from the definition of the uncertainty function, we know

Xh(s,a)\displaystyle X_{h}(s,a) 9Hϕ(m(s,a))+(1+1H)~(s,a)(maxaXh+1(,a)).\displaystyle\leq 9H\phi\left(m(s,a)\right)+\left(1+\frac{1}{H}\right)\widetilde{\P}^{\dagger}\left(\cdot\mid s,a\right)^{\top}\left(\max_{a^{\prime}}X_{h+1}(\cdot,a^{\prime})\right).

To prove the lemma, it suffices to bound the difference between ~(s,a)(maxaXh+1(,a))\widetilde{\P}^{\dagger}\left(\cdot\mid s,a\right)^{\top}\left(\max_{a^{\prime}}X_{h+1}(\cdot,a^{\prime})\right) and (s,a)(maxaXh+1(,a)).\P^{\dagger}\left(\cdot\mid s,a\right)^{\top}\left(\max_{a^{\prime}}X_{h+1}(\cdot,a^{\prime})\right). Under \event\event(which happens with probability at least 1δ1-\delta), it holds that \KL(~h(s,a);(s,a))1Hϕ(m(s,a)).\KL\left(\widetilde{\P}^{\dagger}_{h}\left(\cdot\mid s,a\right);\P^{\dagger}\left(\cdot\mid s,a\right)\right)\leq\frac{1}{H}\phi\left(m(s,a)\right). Applying Lemma D.5 and a simple bound for variance gives us

|~(s,a)(maxaXh+1(,a))(s,a)(maxaXh+1(,a))|\displaystyle\left|\widetilde{\P}^{\dagger}\left(\cdot\mid s,a\right)^{\top}\left(\max_{a^{\prime}}X_{h+1}(\cdot,a^{\prime})\right)-\P^{\dagger}\left(\cdot\mid s,a\right)^{\top}\left(\max_{a^{\prime}}X_{h+1}(\cdot,a^{\prime})\right)\right|
\displaystyle\leq 2HVars(s,a)(maxaXh+1(s,a))ϕ(m(s,a))+23ϕ(m(s,a))\displaystyle\sqrt{\frac{2}{H}\operatorname{Var}_{s^{\prime}\sim\P^{\dagger}\left(s,a\right)}\left(\max_{a^{\prime}}X_{h+1}\left(s^{\prime},a^{\prime}\right)\right)\phi\left(m(s,a)\right)}+\frac{2}{3}\phi\left(m(s,a)\right)
\displaystyle\leq [2H(s,a)(maxaXh+1(,a)2)]ϕ(m(s,a))+23ϕ(m(s,a))\displaystyle\sqrt{\left[\frac{2}{H}\P^{\dagger}\left(\cdot\mid s,a\right)^{\top}\left(\max_{a^{\prime}}X_{h+1}(\cdot,a^{\prime})^{2}\right)\right]\phi\left(m(s,a)\right)}+\frac{2}{3}\phi\left(m(s,a)\right)
\displaystyle\leq [2H(s,a)(maxaXh+1(,a))]Hϕ(m(s,a))+23ϕ(m(s,a))\displaystyle\sqrt{\left[\frac{2}{H}\P^{\dagger}\left(\cdot\mid s,a\right)^{\top}\left(\max_{a^{\prime}}X_{h+1}(\cdot,a^{\prime})\right)\right]\cdot H\phi\left(m(s,a)\right)}+\frac{2}{3}\phi\left(m(s,a)\right)
\displaystyle\leq 1H(s,a)(maxaXh+1(,a))+2Hϕ(m(s,a)).\displaystyle\frac{1}{H}\P^{\dagger}\left(\cdot\mid s,a\right)^{\top}\left(\max_{a^{\prime}}X_{h+1}(\cdot,a^{\prime})\right)+2H\phi\left(m(s,a)\right).

The last line comes from aba+b2\sqrt{ab}\leq\frac{a+b}{2} for any positive a,b,a,b, and the fact that 12Hϕ(m(s,a))+23ϕ(m(s,a))2Hϕ(m(s,a)).\frac{1}{2}H\phi(m(s,a))+\frac{2}{3}\phi(m(s,a))\leq 2H\phi(m(s,a)). Insert this bound into the definition of Xh(s,a)X_{h}(s,a) to obtain

Xh(s,a)11Hϕ(m(s,a))+(1+2H)(s,a)(maxaXh+1(,a)).X_{h}(s,a)\leq 11H\phi\left(m(s,a)\right)+\left(1+\frac{2}{H}\right)\P^{\dagger}\left(\cdot\mid s,a\right)^{\top}\left(\max_{a^{\prime}}X_{h+1}(\cdot,a^{\prime})\right).

Noticing that Xh(s,a)Hh+111H,X_{h}(s,a)\leq H-h+1\leq 11H, we conclude. ∎

Lemma A.16.

There exists a universal constant C1C\geq 1 such that under the high probability event \event,\event, when 3KucbKde,3K_{ucb}\geq K_{de}, we have for any (h,s,a)[H]×\cS×\cA,(h,s,a)\in[H]\times\cS\times\cA, it holds that

Xh(s,a)CHKucbKdeϕ(nKucb(s,a))+(1+2H)(s,a)(maxa{Xh+1(,a)}).X_{h}(s,a)\leq CH\frac{K_{ucb}}{K_{de}}\phi\left(n^{K_{ucb}}(s,a)\right)+\left(1+\frac{2}{H}\right)\P^{\dagger}(\cdot\mid s,a)^{\top}\left(\max_{a^{\prime}}\left\{X_{h+1}(\cdot,a^{\prime})\right\}\right).

In addition, for s=s,s=s^{\dagger}, from definition, we know the above upper bound naturally holds.

Proof.

Here, the universal constant may vary from line to line. Under \event,\event, we have

Xh(s,a)\displaystyle X_{h}(s,a)
\displaystyle\leq CHϕ(m(s,a))+(1+1H)~(s,a)(maxaXh+1(,a)).\displaystyle CH\phi\left(m(s,a)\right)+\left(1+\frac{1}{H}\right)\widetilde{\P}^{\dagger}\left(\cdot\mid s,a\right)^{\top}\left(\max_{a^{\prime}}X_{h+1}(\cdot,a^{\prime})\right). (definition)
\displaystyle\leq CHmin{1,ϕ(m(s,a))}+(1+2H)(s,a)(maxa{Xh+1(,a)})\displaystyle CH\min\left\{1,\phi\left(m(s,a)\right)\right\}+\left(1+\frac{2}{H}\right)\P^{\dagger}(\cdot\mid s,a)^{\top}\left(\max_{a^{\prime}}\left\{X_{h+1}(\cdot,a^{\prime})\right\}\right) (Lemma A.15)
\displaystyle\leq CHϕ(Kdeh[H]whmix(s,a))+(1+2H)(s,a)(maxa{Xh+1(,a)})\displaystyle CH\phi\bigg{(}K_{de}\sum_{h\in[H]}w^{mix}_{h}(s,a)\bigg{)}+\left(1+\frac{2}{H}\right)\P^{\dagger}(\cdot\mid s,a)^{\top}\left(\max_{a^{\prime}}\left\{X_{h+1}(\cdot,a^{\prime})\right\}\right) (Lemma A.18)
=\displaystyle= CHϕ(KdeKucbk=1Kucbh[H]dπk,h(s,a))+(1+2H)(s,a)(maxa{Xh+1(,a)})\displaystyle CH\phi\left(\frac{K_{de}}{K_{ucb}}\sum_{k=1}^{K_{ucb}}\sum_{h\in[H]}d^{\dagger}_{\pi^{k},h}(s,a)\right)+\left(1+\frac{2}{H}\right)\P^{\dagger}(\cdot\mid s,a)^{\top}\left(\max_{a^{\prime}}\left\{X_{h+1}(\cdot,a^{\prime})\right\}\right) (definition)
\displaystyle\leq CHϕ(Kde3Kucbk=1Kucbh[H]d^πk,h(s,a))+(1+2H)(s,a)(maxa{Xh+1(,a)})\displaystyle CH\phi\left(\frac{K_{de}}{3K_{ucb}}\sum_{k=1}^{K_{ucb}}\sum_{h\in[H]}\widehat{d}^{\dagger}_{\pi^{k},h}(s,a)\right)+\left(1+\frac{2}{H}\right)\P^{\dagger}(\cdot\mid s,a)^{\top}\left(\max_{a^{\prime}}\left\{X_{h+1}(\cdot,a^{\prime})\right\}\right) (Lemma A.22 and A.17)
\displaystyle\leq CHKucbKdeϕ(k=1Kucbh[H]d^πk,h(s,a))+(1+2H)(s,a)(maxa{Xh+1(,a)})\displaystyle CH\frac{K_{ucb}}{K_{de}}\phi\left(\sum_{k=1}^{K_{ucb}}\sum_{h\in[H]}\widehat{d}^{\dagger}_{\pi^{k},h}(s,a)\right)+\left(1+\frac{2}{H}\right)\P^{\dagger}(\cdot\mid s,a)^{\top}\left(\max_{a^{\prime}}\left\{X_{h+1}(\cdot,a^{\prime})\right\}\right) (Lemma A.17)

Since Xh(s,a)(Hh)CHKucbKde,X_{h}(s,a)\leq(H-h)\leq CH\frac{K_{ucb}}{K_{de}}, we can modify the last display as

Xh(s,a)\displaystyle X_{h}(s,a) CHKucbKdemin{1,ϕ(k=1Kucbh[H]d^πk,h(s,a))}+(1+2H)(s,a)(maxa{Xh+1(,a)})\displaystyle\leq CH\frac{K_{ucb}}{K_{de}}\min\left\{1,\phi\left(\sum_{k=1}^{K_{ucb}}\sum_{h\in[H]}\widehat{d}^{\dagger}_{\pi^{k},h}(s,a)\right)\right\}+\left(1+\frac{2}{H}\right)\P^{\dagger}(\cdot\mid s,a)^{\top}\left(\max_{a^{\prime}}\left\{X_{h+1}(\cdot,a^{\prime})\right\}\right)
CHKucbKdeϕ(nKucb(s,a))+(1+2H)(s,a)(maxa{Xh+1(,a)})\displaystyle\leq CH\frac{K_{ucb}}{K_{de}}\phi\left(n^{K_{ucb}}(s,a)\right)+\left(1+\frac{2}{H}\right)\P^{\dagger}(\cdot\mid s,a)^{\top}\left(\max_{a^{\prime}}\left\{X_{h+1}(\cdot,a^{\prime})\right\}\right) (Lemma A.19)

Therefore, we conclude. ∎

A.3.4 Proof for lemma A.17, lemma A.18, lemma A.19, lemma A.20 (properties of bonus function)

For our bonus function, we have the following basic property.

Lemma A.17.

ϕ(x)\phi(x) is non-increasing when x>0.x>0. For any α1,\alpha\leq 1, we have ϕ(αx)1αϕ(x).\phi(\alpha x)\leq\frac{1}{\alpha}\phi(x).

Proof.

We define f(x):=1x[C+Dlog(e(1+xD))],f(x):=\frac{1}{x}\left[C+D\log\left(e\left(1+\frac{x}{D}\right)\right)\right], where C,D1.C,D\geq 1. Then, for x>0,x>0,

f(x)=C+Dlog(e(1+xD))x2+Dx(D+x)C+Dlog(1+xD)x20.f^{\prime}(x)=-\frac{C+D\log\left(e\left(1+\frac{x}{D}\right)\right)}{x^{2}}+\frac{D}{x(D+x)}\leq-\frac{C+D\log\left(1+\frac{x}{D}\right)}{x^{2}}\leq 0.

Taking C=log(6H|\cS||\cA|δ)C=\log\left(6H\left|\cS\right|\left|\cA\right|\delta\right) and D=|\cS|,D=\left|\cS\right|, we conclude thee first claim. For the second claim, it is trivial since the logarithm function is increasing. ∎

Lemma A.18.

Under \event,\event, we have for any (s,a)\cS×\cA,(s,a)\in\cS\times\cA,

min{1,ϕ(m(s,a))}4ϕ(Kdeh=1Hwhmix(s,a)).\min\left\{1,\phi\left(m(s,a)\right)\right\}\leq 4\phi\left(K_{de}\sum_{h=1}^{H}w_{h}^{mix}(s,a)\right).
Proof.

For fixed (s,a)\cS×\cA,(s,a)\in\cS\times\cA, when Hln(6H|\cS||\cA|/δ)14(Kdeh[H]whmix(s,a)),H\ln\left(6H\left|\cS\right|\left|\cA\right|/\delta\right)\leq\frac{1}{4}\left(K_{de}\sum_{h\in[H]}w_{h}^{mix}(s,a)\right), we know that \event5\event^{5} implies m(s,a)14h[H]Kdewhmix(s,a),m(s,a)\geq\frac{1}{4}\sum_{h\in[H]}K_{de}w_{h}^{mix}(s,a), From Lemma A.17, we know

ϕ(m(s,a))ϕ(14h[H]Kdewhmix(s,a))4ϕ(h[H]Kdewhmix(s,a)).\phi\left(m(s,a)\right)\leq\phi\left(\frac{1}{4}\sum_{h\in[H]}K_{de}w_{h}^{mix}(s,a)\right)\leq 4\phi\left(\sum_{h\in[H]}K_{de}w_{h}^{mix}(s,a)\right).

When Hln(6H|\cS||\cA|/δ)>14(Kdeh[H]whmix(s,a)),H\ln\left(6H\left|\cS\right|\left|\cA\right|/\delta\right)>\frac{1}{4}\left(K_{de}\sum_{h\in[H]}w_{h}^{mix}(s,a)\right), simple algebra shows that

min{1,ϕ(m(s,a))}14Hln(6H|\cS||\cA|/δ)Kdeh[H]whmix(s,a)4ϕ(Kdeh[H]whmix(s,a)).\min\left\{1,\phi\left(m(s,a)\right)\right\}\leq 1\leq\frac{4H\ln\left(6H\left|\cS\right|\left|\cA\right|/\delta\right)}{K_{de}\sum_{h\in[H]}w^{mix}_{h}(s,a)}\leq 4\phi\left(K_{de}\sum_{h\in[H]}w_{h}^{mix}(s,a)\right).

Therefore, we conclude. ∎

Lemma A.19.

Under \event,\event, we have for any (s,a)\cS×\cA,(s,a)\in\cS\times\cA,

min{1,ϕ(k=1Kucbh=1Hd^πk,h(s,a))}4ϕ(nKucb(s,a)).\min\left\{1,\phi\left(\sum_{k=1}^{K_{ucb}}\sum_{h=1}^{H}\widehat{d}^{{\dagger}}_{\pi^{k},h}(s,a)\right)\right\}\leq 4\phi\left(n^{K_{ucb}}(s,a)\right).
Proof.

We consider under the event \event3,\event^{3}, it holds that for any (s,a)\cS×\cA,(s,a)\in\cS\times\cA,

k=1Kucbh=1Hd^πk,h(s,a)12nKucb(s,a)12Hln(6H|\cS||\cA|δ).\sum_{k=1}^{K_{ucb}}\sum_{h=1}^{H}\widehat{d}_{\pi^{k},h}^{\dagger}(s,a)\geq\frac{1}{2}n^{K_{ucb}}(s,a)-\frac{1}{2}H\ln\left(\frac{6H\left|\cS\right|\left|\cA\right|}{\delta}\right).

If Hln(6H|\cS||\cA|/δ)12nKucb(s,a),H\ln\left(6H\left|\cS\right|\left|\cA\right|/\delta\right)\leq\frac{1}{2}n^{K_{ucb}}(s,a), then we have k=1Kucbh=1Hd^πk,h(s,a)14nKucb(s,a),\sum_{k=1}^{K_{ucb}}\sum_{h=1}^{H}\widehat{d}_{\pi^{k},h}^{\dagger}(s,a)\geq\frac{1}{4}n^{K_{ucb}}(s,a), which implies

ϕ(k=1Kucbh=1Hd^πk,h(s,a))ϕ(14nKucb(s,a))4ϕ(nKucb(s,a)).\phi\left(\sum_{k=1}^{K_{ucb}}\sum_{h=1}^{H}\widehat{d}^{{\dagger}}_{\pi^{k},h}(s,a)\right)\leq\phi\left(\frac{1}{4}n^{K_{ucb}}(s,a)\right)\leq 4\phi\left(n^{K_{ucb}}(s,a)\right).

Otherwise, if Hln(6H|\cS||\cA|/δ)>12nKucb(s,a),H\ln\left(6H\left|\cS\right|\left|\cA\right|/\delta\right)>\frac{1}{2}n^{K_{ucb}}(s,a), then we have

min{1,ϕ(k=1Kucbd^πk,h(s,a))}12Hln(6H|\cS||\cA|/δ)nKucb(s,a)4ϕ(nKucb(s,a)).\min\left\{1,\phi\left(\sum_{k=1}^{K_{ucb}}\widehat{d}^{{\dagger}}_{\pi^{k},h}(s,a)\right)\right\}\leq 1\leq\frac{2H\ln\left(6H\left|\cS\right|\left|\cA\right|/\delta\right)}{n^{K_{ucb}}(s,a)}\leq 4\phi\left(n^{K_{ucb}}(s,a)\right).

Combining the two cases above, we conclude. ∎

Lemma A.20.

Under \event,\event, we have for any k[Kucb]k\in[K_{ucb}] and any (s,a)\cS×\cA,(s,a)\in\cS\times\cA,

min{1,ϕ(nk(s,a))}\displaystyle\min\left\{1,\phi\left(n^{k}(s,a)\right)\right\}
\displaystyle\leq 4Hmax{1,i<kh[H]d^πi,h(s,a)}[log(6H|𝒮||𝒜|δ)+|𝒮|log(e(1+i<kh[H]d^πi,h(s,a)|𝒮|))].\displaystyle\frac{4H}{\max\left\{1,\sum_{i<k}\sum_{h\in[H]}\widehat{d}_{\pi^{i},h}^{\dagger}(s,a)\right\}}\left[\log\left(\frac{6H|\mathcal{S}||\mathcal{A}|}{\delta}\right)+|\mathcal{S}|\log\left(e\left(1+\frac{\sum_{i<k}\sum_{h\in[H]}\widehat{d}_{\pi^{i},h}^{\dagger}(s,a)}{|\mathcal{S}|}\right)\right)\right].
Proof.

Under \event2,\event^{2}, it holds that

nk(s,a)12i<kh[H]d^πi,h(s,a)Hln(6H|𝒮||𝒜|δ).n^{k}(s,a)\geq\frac{1}{2}\sum_{i<k}\sum_{h\in[H]}\widehat{d}_{\pi^{i},h}^{\dagger}(s,a)-H\ln\left(\frac{6H|\mathcal{S}||\mathcal{A}|}{\delta}\right).

If Hln(6H|𝒮||𝒜|/δ)14i<kh[H]d^πi,h(s,a),H\ln\left(6H|\mathcal{S}||\mathcal{A}|/\delta\right)\leq\frac{1}{4}\sum_{i<k}\sum_{h\in[H]}\widehat{d}_{\pi^{i},h}^{\dagger}(s,a), then nk(s,a)14i<kh[H]d^πi,h(s,a)n^{k}(s,a)\geq\frac{1}{4}\sum_{i<k}\sum_{h\in[H]}\widehat{d}_{\pi^{i},h}^{\dagger}(s,a) and hence,

min{1,ϕ(nk(s,a))}ϕ(nk(s,a))ϕ(14i<kh[H]d^πi,h(s,a))4ϕ(i<kh[H]d^πi,h(s,a)).\min\left\{1,\phi\left(n^{k}(s,a)\right)\right\}\leq\phi\left(n^{k}(s,a)\right)\leq\phi\left(\frac{1}{4}\sum_{i<k}\sum_{h\in[H]}\widehat{d}_{\pi^{i},h}^{\dagger}(s,a)\right)\leq 4\phi\left(\sum_{i<k}\sum_{h\in[H]}\widehat{d}_{\pi^{i},h}^{\dagger}(s,a)\right).

This result equals to the right hand side in the lemma, because i<kh[H]d^πi,h(s,a)4Hln(6H|𝒮||𝒜|/δ)1\sum_{i<k}\sum_{h\in[H]}\widehat{d}_{\pi^{i},h}^{\dagger}(s,a)\geq 4H\ln\left(6H|\mathcal{S}||\mathcal{A}|/\delta\right)\geq 1 (so taking maximum does not change anything). Otherwise, if Hln(6H|𝒮||𝒜|/δ)>14i<kh[H]d^πi,h(s,a),H\ln\left(6H|\mathcal{S}||\mathcal{A}|/\delta\right)>\frac{1}{4}\sum_{i<k}\sum_{h\in[H]}\widehat{d}_{\pi^{i},h}^{\dagger}(s,a), then

min{1,ϕ(nk(s,a))}14Hln(H|𝒮||𝒜|/δ)i<kh[H]d^πi,h(s,a).\min\left\{1,\phi\left(n^{k}(s,a)\right)\right\}\leq 1\leq\frac{4H\ln\left(H|\mathcal{S}||\mathcal{A}|/\delta^{\prime}\right)}{\sum_{i<k}\sum_{h\in[H]}\widehat{d}_{\pi^{i},h}^{\dagger}(s,a)}.

Since 14Hln(6H|𝒮||𝒜|/δ),1\leq 4H\ln\left(6H|\mathcal{S}||\mathcal{A}|/\delta\right), we have

min{1,ϕ(nk(s,a))}14Hln(H|𝒮||𝒜|/δ)max{1,i<kh[H]d^πi,h(s,a)}RHS\min\left\{1,\phi\left(n^{k}(s,a)\right)\right\}\leq 1\leq\frac{4H\ln\left(H|\mathcal{S}||\mathcal{A}|/\delta^{\prime}\right)}{\max\left\{1,\sum_{i<k}\sum_{h\in[H]}\widehat{d}_{\pi^{i},h}^{\dagger}(s,a)\right\}}\leq\text{RHS}

The last inequality comes from simple algebra. Therefore, we conclude. ∎

A.3.5 Proof for lemma A.21 and lemma A.22 (properties of empirical sparsified MDP)

In this section, we state two important properties of the empirical sparsified MDP and prove them. We remark that we do not include (ss,a)\P\left(s^{\dagger}\mid s,a\right) in these two lemmas, since by definition s\cS.s^{\dagger}\notin\cS.

Lemma A.21 (Multiplicative Accuracy).

We set

Φ6H2log(12|\cS|2|\cA|δ).\Phi\geq 6H^{2}\log\left(\frac{12\left|\cS\right|^{2}\left|\cA\right|}{\delta}\right).

Then, when H2,H\geq 2, under \eventP,\event^{P}, we have for any (s,a,s)\cS×\cA×\cS,(s,a,s^{\prime})\in\cS\times\cA\times\cS,

(11H)\emP(ss,a)(ss,a)(1+1H)\emP(ss,a).\left(1-\frac{1}{H}\right)\emP^{\dagger}(s^{\prime}\mid s,a)\leq\P^{\dagger}(s^{\prime}\mid s,a)\leq\left(1+\frac{1}{H}\right)\emP^{\dagger}(s^{\prime}\mid s,a).
Proof.

For N(s,a,s)<Φ,N(s,a,s^{\prime})<\Phi, both sides are zero. For N(s,a,s)Φ,N(s,a,s^{\prime})\geq\Phi, recall \emP(ss,a)=N(s,a,s)N(s,a),\emP^{\dagger}(s^{\prime}\mid s,a)=\frac{N(s,a,s^{\prime})}{N(s,a)}, then from lemma A.10, with probability at least 1δ1-\delta,

|\emP(ss,a)(ss,a)|\displaystyle\left|\emP^{\dagger}(s^{\prime}\mid s,a)-\P^{\dagger}(s^{\prime}\mid s,a)\right|
\displaystyle\leq 2\emP(ss,a)N(s,a)log(12|\cS|2|\cA|δ)+143N(s,a)log(12|\cS|2|\cA|δ)\displaystyle\sqrt{\frac{2\emP^{\dagger}(s^{\prime}\mid s,a)}{N(s,a)}\log\left(\frac{12\left|\cS\right|^{2}\left|\cA\right|}{\delta}\right)}+\frac{14}{3N(s,a)}\log\left(\frac{12\left|\cS\right|^{2}\left|\cA\right|}{\delta}\right) (lemma A.10 and definition of \eventP\event^{P})
=\displaystyle= [2N(s,a,s)log(12|\cS|2|\cA|δ)+143N(s,a,s)log(12|\cS|2|\cA|δ)]\emP(ss,a)\displaystyle\left[\sqrt{\frac{2}{N(s,a,s^{\prime})}\log\left(\frac{12\left|\cS\right|^{2}\left|\cA\right|}{\delta}\right)}+\frac{14}{3N(s,a,s^{\prime})}\log\left(\frac{12\left|\cS\right|^{2}\left|\cA\right|}{\delta}\right)\right]\cdot\emP^{\dagger}(s^{\prime}\mid s,a) (\emP(ss,a)=N(s,a,s)N(s,a)\emP^{\dagger}(s^{\prime}\mid s,a)=\frac{N(s,a,s^{\prime})}{N(s,a)})
\displaystyle\leq [13H2+79H2]\emP(ss,a)\emP(ss,a)H,\displaystyle\left[\sqrt{\frac{1}{3H^{2}}}+\frac{7}{9H^{2}}\right]\cdot\emP^{\dagger}(s^{\prime}\mid s,a)\leq\frac{\emP^{\dagger}(s^{\prime}\mid s,a)}{H},

where the second line comes from the lower bound on Φ\Phi. We conclude. ∎

Lemma A.22 (Bound on Ratios of Visitation Probability).

For any deterministic policy π\pi and any (h,s,a)[H]×\cS×\cA,(h,s,a)\in[H]\times\cS\times\cA, it holds that

14dπ,h(s,a)d^π,h(s,a)3dπ,h(s,a).\frac{1}{4}d_{\pi,h}^{\dagger}(s,a)\leq\widehat{d}_{\pi,h}^{\dagger}(s,a)\leq 3d_{\pi,h}^{\dagger}(s,a).

Here, recall that we denote dπ,h(s,a)d_{\pi,h}^{\dagger}(s,a) and d^π,h(s,a)\widehat{d}_{\pi,h}^{\dagger}(s,a) as the occupancy measure of (s,a)(s,a) at stage hh under policy π,\pi, on \P^{\dagger} (the transition dynamics in the sparsfied MDP) and \emP\emP^{\dagger}(the transition dynamics in the empirical sparsified MDP) respectively.

We remark that for s\cSs^{\dagger}\not\in\cS the inequality does not necessarily hold.

Proof.

We denote Th,s,aT_{h,s,a} as all truncated trajectories (s1,a1,s2,a2,,sh,ah)(s_{1},a_{1},s_{2},a_{2},...,s_{h},a_{h}) up to stage hh such that (sh,ah)=(s,a).(s_{h},a_{h})=(s,a). Notice that if τh=(s1,a1,s2,a2,,sh,ah)Th,s,a,\tau_{h}=(s_{1},a_{1},s_{2},a_{2},...,s_{h},a_{h})\in T_{h,s,a}, then it holds that siss_{i}\neq s^{\dagger} for 1ih1.1\leq i\leq h-1. We denote (;,π)\P\left(\cdot;\P^{\prime},\pi\right) as the probability under a specific transition dynamics \P^{\prime} and policy π.\pi. For any fixed (h,s,a)[H]×\cS×\cA(h,s,a)\in[H]\times\cS\times\cA and any fixed τTh,s,a,\tau\in T_{h,s,a}, we apply Lemma A.21 to get

[τ;\emP,π]\displaystyle\P\left[\tau;\emP^{\dagger},\pi\right] =i=1hπi(aisi)i=1h1\emP(si+1si,ai)\displaystyle=\prod_{i=1}^{h}\pi_{i}\left(a_{i}\mid s_{i}\right)\prod_{i=1}^{h-1}\emP^{\dagger}\left(s_{i+1}\mid s_{i},a_{i}\right)
(1+1H)Hi=1hπi(aisi)i=1h1(si+1si,ai)3[τ;,π]\displaystyle\leq\left(1+\frac{1}{H}\right)^{H}\prod_{i=1}^{h}\pi_{i}\left(a_{i}\mid s_{i}\right)\prod_{i=1}^{h-1}\P^{\dagger}\left(s_{i+1}\mid s_{i},a_{i}\right)\leq 3\P\left[\tau;\P^{\dagger},\pi\right]

and

[τ;\emP,π]\displaystyle\P\left[\tau;\emP^{\dagger},\pi\right] =i=1hπi(aisi)i=1h1\emP(si+1si,ai)\displaystyle=\prod_{i=1}^{h}\pi_{i}\left(a_{i}\mid s_{i}\right)\prod_{i=1}^{h-1}\emP^{\dagger}\left(s_{i+1}\mid s_{i},a_{i}\right)
(11H)Hi=1hπi(aisi)i=1h1(si+1si,ai)14[τ;,π].\displaystyle\geq\left(1-\frac{1}{H}\right)^{H}\prod_{i=1}^{h}\pi_{i}\left(a_{i}\mid s_{i}\right)\prod_{i=1}^{h-1}\P^{\dagger}\left(s_{i+1}\mid s_{i},a_{i}\right)\geq\frac{1}{4}\P\left[\tau;\P^{\dagger},\pi\right].

We conclude by rewriting the visiting probability as

dπ,h(s,a)=τTh,s,a[τ;,π];d^π,h(s,a)=τTh,s,a[τ;\emP,π].d_{\pi,h}^{\dagger}(s,a)=\sum_{\tau\in T_{h,s,a}}\P\left[\tau;\P^{\dagger},\pi\right];\quad\widehat{d}_{\pi,h}^{\dagger}(s,a)=\sum_{\tau\in T_{h,s,a}}\P\left[\tau;\emP^{\dagger},\pi\right].

Appendix B Additional comparisons

B.1 Comparison with other comparator policy

Our main result compares the sub-optimality of the policy πfinal\pi_{final} against the optimal policy on the sparsified MDP. We can further derive the sub-optimality of our output with respect to any comparator policy on the original MDP \cM.\cM. If we denote π,π\pi_{*},\pi_{*}^{\dagger} and πfinal\pi_{final} as the global optimal policy, the optimal policy on the sparsified MDP and the policy output by our algorithm, respectively, and denote π\pi as the comparator policy, we have

V1(s1,,r,π)V1(s1,,r,πfinal)\displaystyle V_{1}\left(s_{1},\P,r,\pi\right)-V_{1}\left(s_{1},\P,r,\pi_{final}\right)
\displaystyle\leq V1(s1,,r,π)V1(s1,,r,π)+V1(s1,,r,π)V1(s1,,r,π)0\displaystyle V_{1}\left(s_{1},\P,r,\pi\right)-V_{1}\left(s_{1},\P^{\dagger},r,\pi\right)+\underbrace{V_{1}\left(s_{1},\P^{\dagger},r,\pi\right)-V_{1}\left(s_{1},\P^{\dagger},r,\pi_{*}^{\dagger}\right)}_{\leq 0}
+\displaystyle+ V1(s1,,r,π)V1(s1,,r,πfinal)ε+V1(s1,,r,πfinal)V1(s1,,r,πfinal)0\displaystyle\underbrace{V_{1}\left(s_{1},\P^{\dagger},r,\pi_{*}^{\dagger}\right)-V_{1}\left(s_{1},\P^{\dagger},r,\pi_{final}\right)}_{\lesssim\varepsilon}+\underbrace{V_{1}\left(s_{1},\P^{\dagger},r,\pi_{final}\right)-V_{1}\left(s_{1},\P,r,\pi_{final}\right)}_{\leq 0}
\displaystyle\lesssim V1(s1,,r,π)V1(s1,,r,π)Approximation Error+ε.\displaystyle\underbrace{V_{1}\left(s_{1},\P,r,\pi_{*}\right)-V_{1}\left(s_{1},\P^{\dagger},r,\pi_{*}\right)}_{\text{Approximation Error}}+\varepsilon. (B.1)

Here, the second term is non-positive from the definition of π,\pi_{*}^{\dagger}, the third term is upper bounded by ε\varepsilon due to our main theorem (Theorem 5.1), and the last term is non-positive from the definition of the sparsified MDP. Since the connectivity graph of the sparsified MDP is a sub-graph of the original MDP, for any policy, the policy value on the sparsified MDP must be no higher than that on the original MDP.

At a high level, the ε\varepsilon term in the last line of (B.1) represents the error from the finite online episodes, while the approximation error term V1(s1,,r,π)V1(s1,,r,π)V_{1}\left(s_{1},\P,r,\pi\right)-V_{1}\left(s_{1},\P^{\dagger},r,\pi\right) measures the policy value difference of π\pi on the original MDP and the sparsified one, representing the coverage quality of the offline dataset. If the dataset covers most of what π\pi covers, then this gap should be small. When π\pi is the global optimal policy π,\pi_{*}, this means the data should cover the state-actions pairs where optimal policy covers. The approximation error here plays a similar role as the concentrability coefficient in the offline reinforcement learning.

B.2 Comparison with offline reinforcement learning

Our algorithm leverages some information from the offline dataset, so it is natural to ask how we expect that offline dataset to be, compared to traditional offline reinforcement learning does. In offline RL, we typically require the concentrablity condition, namely good coverage for the offline dataset, in order to achieve a polynomial sample complexity. Specifically, if we assume the offline data are sampled by first sampling (s,a)(s,a) i.i.d. from μ\mu and then sampling the subsequent state from the transition dynamics, then the concentrability condition says the following constant CC^{*} is well-defined and finite.

C:=sup(s,a)dπ(s,a)μ(s,a)<.C^{*}:=\sup_{(s,a)}\frac{d^{\pi_{*}}(s,a)}{\mu(s,a)}<\infty.

The concentrability coefficient can be defined in several alternative ways, either for a set of policies or with respect to a single policy [Chen and Jiang, 2019, Zhan et al., 2022, Xie et al., 2021b, Zanette et al., 2021b]. Here, we follow the definition in [Xie et al., 2021b]. This means, the sampling distribution must covers the region where the global optimal policy covers, which is a very similar intuition obtained from our setting.

[Xie et al., 2021b] also gave optimal sample complexity (in terms of state-action pairs) for an offline RL algorithm is

N=O~(CH3|\cS|ε2+CH5.5|\cS|ε),N=\widetilde{O}\left(\frac{C^{*}H^{3}\left|\cS\right|}{\varepsilon^{2}}+\frac{C^{*}H^{5.5}\left|\cS\right|}{\varepsilon}\right),

which is minimax optimal up to logarithm terms and higher order terms. Similar sample complexity were also given in several literature [Yin and Wang, 2020, Yin et al., 2020, Xie and Jiang, 2020b].

Uniform data distribution

For simplicity, we first assume μ\mu to be uniform on all state-action pairs and the reward function to be given. Consider we have NN state-action pairs in the offline data, which are sampled i.i.d. from the distribution μ\mu. Notice that here, the global optimal policy π\pi_{*} still differs from the optimum on the sparsified MDP π,\pi_{*}^{\dagger}, since even if we get enough samples from each (s,a)(s,a) pairs, we might not get enough samples for every (s,a,s)(s,a,s^{\prime}) and hence, not all (s,a,s)(s,a,s^{\prime}) will be included in the set of known tuples.

Concretely, if we consider the case when we sample each state-action pair for N/(|\cS||\cA|)N/(\left|\cS\right|\left|\cA\right|) times and simply treat the transition frequency as the true probability, then for any N(s,a,s)<Φ,N(s,a,s^{\prime})<\Phi, it holds that (ss,a)=N(s,a,s)N(s,a)=N(s,a,s)|\cS||\cA|NΦ|\cS||\cA|N\P\left(s^{\prime}\mid s,a\right)=\frac{N(s,a,s^{\prime})}{N(s,a)}=\frac{N(s,a,s^{\prime})\left|\cS\right|\left|\cA\right|}{N}\leq\frac{\Phi\left|\cS\right|\left|\cA\right|}{N} . So for any any N(s,a,s)Φ,N(s,a,s^{\prime})\geq\Phi, we know (ss,a)=(ss,a);\P(s^{\prime}\mid s,a)=\P^{\dagger}(s^{\prime}\mid s,a); while for any N(s,a,s)<Φ,N(s,a,s^{\prime})<\Phi, we have (ss,a)Φ|\cS||\cA|N\P\left(s^{\prime}\mid s,a\right)\leq\frac{\Phi\left|\cS\right|\left|\cA\right|}{N} and (ss,a)=0.\P^{\dagger}(s^{\prime}\mid s,a)=0. Therefore, we have

|(ss,a)(ss,a)|Φ|\cS||\cA|N\left|\P(s^{\prime}\mid s,a)-\P^{\dagger}(s^{\prime}\mid s,a)\right|\leq\frac{\Phi\left|\cS\right|\left|\cA\right|}{N}

From the value difference lemma (lemma D.11), we can upper bound the approximation error by

V1(s1,,r,π)V1(s1,,r,π)\displaystyle V_{1}\left(s_{1},\P,r,\pi\right)-V_{1}\left(s_{1},\P^{\dagger},r,\pi\right)
=\displaystyle=~{} \E,π[h=1Hsh+1((sh+1sh,ah)(sh+1sh,ah))Vh(sh+1,,r,π)|sh=s]\displaystyle\E_{\P,\pi}\left[\sum_{h=1}^{H}\sum_{s_{h+1}}\left(\P^{\dagger}(s_{h+1}\mid s_{h},a_{h})-\P(s_{h+1}\mid s_{h},a_{h})\right)\cdot V_{h}\left(s_{h+1},\P,r,\pi\right)\bigg{|}s_{h}=s\right] (lemma D.11)
\displaystyle\leq~{} \E,π[h=1Hsh+1Φ|\cS||\cA|NH|sh=s]\displaystyle\E_{\P,\pi}\left[\sum_{h=1}^{H}\sum_{s_{h+1}}\frac{\Phi\left|\cS\right|\left|\cA\right|}{N}\cdot H\bigg{|}s_{h}=s\right] (the value function is upper bounded by HH)
\displaystyle\leq~{} ΦH2|\cS|2|\cA|N\displaystyle\frac{\Phi H^{2}\left|\cS\right|^{2}\left|\cA\right|}{N} (summation over h[H]h\in[H] and sh+1\cSs_{h+1}\in\cS)
\displaystyle\asymp~{} O~(H4|\cS|2|\cA|N).\displaystyle\widetilde{O}\left(\frac{H^{4}\left|\cS\right|^{2}\left|\cA\right|}{N}\right). (definition of Φ\Phi in (5.1))

Therefore, to get an ε\varepsilon-optimal policy compared to the global optimal one, we need the number of state-action pairs in the initial offline dataset \cD\cD to be

N=O~(H4|\cS|2|\cA|ε).N=\widetilde{O}\left(\frac{H^{4}\left|\cS\right|^{2}\left|\cA\right|}{\varepsilon}\right).

From the theorem 5.1, the offline data size we need here is actually significantly smaller than what we need for an offline algorithm. As long as sup(s,a)dπ(s,a)\sup_{(s,a)}d^{\pi_{*}}(s,a) is not too small, for instance, larger than H1.5,H^{-1.5}, then we shave off the whole 1/ε21/\varepsilon^{2} term. The order of offline sample complexity here is actually O(1/ε)O(1/\varepsilon) instead of O(1/ε2)O(1/\varepsilon^{2}) typical in offline RL, and this is significantly smaller in small ε\varepsilon regime. To compensate the smaller offline sample size, actually we need more online sample to obtain an globally ε\varepsilon-optimal policy, and we summarize the general requirement for offline and online sample size in corollary 5.2.

Non-uniform data distribution

Assume the data generating distribution μ\mu is not uniform but still supported on all (s,a)(s,a) pairs such that dπ(s,a)>0,d^{\pi_{*}}(s,a)>0, so that the concentrability coefficient in offline RL is still well defined. We simply consider the case when each state-action pair (s,a)(s,a) is sampled by Nμ(s,a)N\mu(s,a) times and treat the transition frequency as the true underlying probability. Then, following a very similar argument as in the last paragraph, the number of state-action pairs needed in the initial offline dataset in order to extract an ε\varepsilon-globally optimal policy is

N=O~(H4|\cS|εs,adπ(s,a)μ(s,a)).N=\widetilde{O}\left(\frac{H^{4}\left|\cS\right|}{\varepsilon}\sum_{s,a}\frac{d^{\pi_{*}}(s,a)}{\mu(s,a)}\right).

Here, the quantity

C:=s,adπ(s,a)μ(s,a)C^{\dagger}:=\sum_{s,a}\frac{d^{\pi_{*}}(s,a)}{\mu(s,a)}

plays a similar role of classical concentrability coefficient and also measures the distribution shift between two policies. In the worst case, this coefficient can be |\cS||\cA|C,\left|\cS\right|\left|\cA\right|C^{*}, resulting in an extra |\cS||\cA|\left|\cS\right|\left|\cA\right| factor compared to the optimal offline sample complexity. However, we still shave off the entire 1/ε21/\varepsilon^{2} term and also shave off H1.5H^{1.5} in the 1/ε1/\varepsilon term.

Partial coverage data

Under partial coverage, we expect the output policy πfinal\pi_{final} to be competitive with the value of the best policy supported in the region covered by the offline dataset. In such case, theorem 5.1 provides guarantees with the best comparator policy on the sparsified MDP \mathcal{M}^{\dagger}. In order to gain further intuition, it is best to ‘translate’ such guarantees into guarantees on \mathcal{M}.

In the worst case, the data distribution μ\mu at a certain (s,a)(s,a) pair can be zero when dπ(s,a)>0,d^{\pi}(s,a)>0, which implies the concentrability coefficient C=.C^{*}=\infty. Here, π\pi is an arbitrary comparator policy. In this case, either classical offline RL algorithm or our policy finetuning algorithm cannot guarantee an ε\varepsilon-optimal policy compared to the global optimal policy. However, we can still output a locally ε\varepsilon-optimal policy, compared to the optimal policy on the sparsified MDP.

In order to compare πfinal\pi_{final} to any policy on the original MDP, we have the corollary 5.2, which will be proved in section B.3.

The statement in corollary 5.2 is a quite direct consequence of theorem 5.1, and it expresses the sub-optimality gap of πfinal\pi_{final} with respect to any comparator policy π\pi on the original MDP \mathcal{M}. It can also be written in terms of the sub-optimality: If we fix a comparator policy π\pi, then with probability at least 1δ1-\delta, for any reward function rr, the policy πfinal\pi_{final} returned by algorithm 2 satisfies:

V1(s1;,r,π)V1(s1;,r,πfinal)\displaystyle V_{1}\left(s_{1};\mathbb{P},r,\pi\right)-V_{1}\left(s_{1};\mathbb{P},r,\pi_{final}\right) =O~(H|\cS||\cA|KdeOnline error+H4|\cS|Ns,adπ(s,a)μ(s,a)Offline error)\displaystyle=\widetilde{O}\Big{(}\underbrace{\vphantom{\frac{H^{4}\left|\cS\right|}{N_{\text{tot}}}\sum_{s,a}\frac{d^{\pi}(s,a)}{\mu(s,a)}}\frac{H\left|\cS\right|\sqrt{\left|\cA\right|}}{\sqrt{K_{de}}}}_{\text{Online error}}+\underbrace{\frac{H^{4}\left|\cS\right|}{N}\sum_{s,a}\frac{d^{\pi}(s,a)}{\mu(s,a)}}_{\text{Offline error}}\Big{)}
=O~(H|\cS||\cA|Kde+H4|\cS|2|\cA|Nsups,adπ(s,a)μ(s,a).),\displaystyle=\widetilde{O}\left(\frac{H\left|\cS\right|\sqrt{\left|\cA\right|}}{\sqrt{K_{de}}}+\frac{H^{4}\left|\cS\right|^{2}\left|\cA\right|}{N}\sup_{s,a}\frac{d^{\pi}(s,a)}{\mu(s,a)}.\right),

where KdeK_{de} is the number of online episodes and NN is the number of state-action pairs in offline data. Here, the sub-optimality depends on an online error as well as on an offline error. The online error is the one that also arises in the statement of theorem 5.1. It is an error that can be reduced by collecting more online samples, i.e., by increasing KK, with the typical inverse square-root depedence 1/K1/\sqrt{K}.

However, the upper bound suggests that even in the limit of infinite online data, the value of πfinal\pi_{final} will not approach that of π\pi_{*} because of a residual error due to the offline dataset 𝒟\mathcal{D}. Such residual error depends on certain concentrability factor expressed as s,adπ(s,a)μ(s,a)|\cS||\cA|sups,adπ(s,a)μ(s,a)\sum_{s,a}\frac{d^{\pi}(s,a)}{\mu(s,a)}\leq|\cS||\cA|\sup_{s,a}\frac{d^{\pi}(s,a)}{\mu(s,a)}, whose presence is intuitive: if a comparator policy π\pi is not covered well, our algorithm does not have enough information to navigate to the area that π\pi tends to visit, and so it is unable to refine its estimates there. However the dependence on the number of offline samples N=|𝒟|N=|\mathcal{D}| is through its inverse, i.e., 1/N1/N as opposed to the more typical 1/N1/\sqrt{N}: such gap represents the improvable performance when additional online data are collected non-reactively.

It is useful to compare corollary 5.2 with what is achievable by using a minimax-optimal online algorithm[Xie et al., 2021b]. In this latter case, one can bound the sub-optimality gap for any comparator policy π\pi with high probability as

V1(s1;,r,π)V1(s1;,r,πfinal)O~(H3|\cS|Nsups,adπ(s,a)μ(s,a)).\displaystyle V_{1}\left(s_{1};\mathbb{P},r^{\dagger},\pi\right)-V_{1}\left(s_{1};\mathbb{P},r^{\dagger},\pi_{final}\right)\leq\widetilde{O}\left(\sqrt{\frac{H^{3}\left|\cS\right|}{N}\sup_{s,a}\frac{d^{\pi}(s,a)}{\mu(s,a)}}\right). (B.2)

B.3 Proof of corollary 5.2

Let’s denote \cD={(si,ai,si)}i[N]\cD=\left\{(s_{i},a_{i},s_{i}^{\prime})\right\}_{i\in[N]} as the offline dataset, where NN as the total number of tuples. We keep the notation the same as in the main text. We use N(s,a)N(s,a) and N(s,a,s)N(s,a,s^{\prime}) to denote the counter of (s,a)(s,a) and (s,a,s)(s,a,s^{\prime}) in the offline data \cD.\cD. The state-action pairs are sampled i.i.d. from μ(s,a)\mu(s,a) and the subsequent states are sampled from the transition dynamics. We fix a comparator policy π\pi and assume μ(s,a)>0\mu(s,a)>0 for any (s,a)(s,a) such that dπ(s,a)>0,d^{\pi}(s,a)>0, which implies a finite concentrability constant C.C^{*}. Here, dπ(s,a)d^{\pi}(s,a) is the occupancy probability of (s,a)(s,a) when executing policy π,\pi, averaged over all stages h[H].h\in[H].

Similar to the Section B.1, we have

V1(s1,,r,π)V1(s1,,r,πfinal)\displaystyle V_{1}\left(s_{1},\P,r,\pi\right)-V_{1}\left(s_{1},\P,r,\pi_{final}\right)
\displaystyle\leq V1(s1,,r,π)V1(s1,,r,π)+V1(s1,,r,π)V1(s1,,r,π)0\displaystyle V_{1}\left(s_{1},\P,r,\pi\right)-V_{1}\left(s_{1},\P^{\dagger},r,\pi\right)+\underbrace{V_{1}\left(s_{1},\P^{\dagger},r,\pi\right)-V_{1}\left(s_{1},\P^{\dagger},r,\pi_{*}^{\dagger}\right)}_{\leq 0}
+\displaystyle+ V1(s1,,r,π)V1(s1,,r,πfinal)+V1(s1,,r,πfinal)V1(s1,,r,πfinal)0\displaystyle V_{1}\left(s_{1},\P^{\dagger},r,\pi_{*}^{\dagger}\right)-V_{1}\left(s_{1},\P^{\dagger},r,\pi_{final}\right)+\underbrace{V_{1}\left(s_{1},\P^{\dagger},r,\pi_{final}\right)-V_{1}\left(s_{1},\P,r,\pi_{final}\right)}_{\leq 0}
\displaystyle\lesssim V1(s1,,r,π)V1(s1,,r,π)Approximation Error+V1(s1,,r,π)V1(s1,,r,πfinal)Estimation Error.\displaystyle\underbrace{V_{1}\left(s_{1},\P,r,\pi\right)-V_{1}\left(s_{1},\P^{\dagger},r,\pi\right)}_{\text{Approximation Error}}+\underbrace{V_{1}\left(s_{1},\P^{\dagger},r,\pi_{*}^{\dagger}\right)-V_{1}\left(s_{1},\P^{\dagger},r,\pi_{final}\right)}_{\text{Estimation Error}}. (B.3)

where π\pi_{*}^{\dagger} and πfinal\pi_{final} are the optimal policy on the sparsified MDP and the policy output by our algorithm, respectively, and π\pi is the fixed comparator policy. Here, the second term is non-positive from the definition of π,\pi_{*}^{\dagger}, the last term is non-positive from the definition of the sparsified MDP . This is because, for any state-action pair (s,a)(s,a) and any fixed policy π,\pi, the probability of reaching (s,a)(s,a) under \kprob\kprob will not exceed that under the true transition probability .\P. If we denote the visiting probability under π\pi and \P (or \kprob\kprob resp.) as dπ,h(s,a)d_{\pi,h}(s,a) (dπ,h(s,a)d^{\dagger}_{\pi,h}(s,a) resp.), then we have

dπ,h(s,a)dπ,h(s,a)d^{\dagger}_{\pi,h}(s,a)\leq d_{\pi,h}(s,a)

for any h[H],s\cS,a\cA.h\in[H],s\in\cS,a\in\cA. Note that, for s=s,s=s^{\dagger}, this does not hold necessarily. Them, for any policy π,\pi, we have

V1(s1,,r,π)\displaystyle V_{1}\left(s_{1},\P^{\dagger},r,\pi\right) =h=1Hs,adπ,h(s,a)r(s,a)\displaystyle=\sum_{h=1}^{H}\sum_{s,a}d^{\dagger}_{\pi,h}(s,a)r(s,a) (definition of policy value)
=h=1Hss,adπ,h(s,a)r(s,a)\displaystyle=\sum_{h=1}^{H}\sum_{s\neq s^{\dagger},a}d^{\dagger}_{\pi,h}(s,a)r(s,a) (r(s,a)=0r(s^{\dagger},a)=0 for any aa)
h=1Hss,adπ,h(s,a)r(s,a)=V1(s1,,r,π).\displaystyle\leq\sum_{h=1}^{H}\sum_{s\neq s^{\dagger},a}d_{\pi,h}(s,a)r(s,a)=V_{1}\left(s_{1},\P,r,\pi\right).

Therefore, we get V1(s1,,r,πfinal)V1(s1,,r,πfinal)0V_{1}\left(s_{1},\P^{\dagger},r,\pi_{final}\right)-V_{1}\left(s_{1},\P,r,\pi_{final}\right)\leq 0 when we take π=πfinal.\pi=\pi_{final}.

From the main result (Theorem 5.1), we know the estimation error is bounded as

V1(s1,,r,π)V1(s1,,r,πfinal)Estimation ErrorO~(H|𝒮||𝒜|Kde),\underbrace{V_{1}\left(s_{1},\P^{\dagger},r,\pi_{*}^{\dagger}\right)-V_{1}\left(s_{1},\P^{\dagger},r,\pi_{final}\right)}_{\text{Estimation Error}}\lesssim\widetilde{O}\left(\frac{H|\mathcal{S}|\sqrt{|\mathcal{A}|}}{\sqrt{K_{de}}}\right), (B.4)

where KdeK_{de} is the number of online episodes. Therefore, to make the right hand side of (B.4) less than ε/2,\varepsilon/2, one needs at least O~(H2|\cS||\cA|ε2)\widetilde{O}\left(\frac{H^{2}\left|\cS\right|\left|\cA\right|}{\varepsilon^{2}}\right) online episodes. This is exactly what the main result shows.

So it suffices to bound the approximation error term. From the value difference lemma (Lemma D.11), we have

|V1(s1,,r,π)V1(s1,,r,π)|\displaystyle\left|V_{1}\left(s_{1},\P,r,\pi\right)-V_{1}\left(s_{1},\P^{\dagger},r,\pi\right)\right|
=\displaystyle= |\E,π[i=1H((si,ai)(si,ai))Vi+1(;,r,π)|s1=s]|\displaystyle\left|\E_{\P,\pi}\left[\sum_{i=1}^{H}\left(\P^{\dagger}(\cdot\mid s_{i},a_{i})-\P(\cdot\mid s_{i},a_{i})\right)^{\top}V_{i+1}\left(\cdot;\P^{\dagger},r,\pi\right)\bigg{|}s_{1}=s\right]\right|
=\displaystyle= |i=1Hsi,aidπ,h(si,ai)((si,ai)(si,ai))Vi+1(;,r,π)|\displaystyle\left|\sum_{i=1}^{H}\sum_{s_{i},a_{i}}d_{\pi,h}(s_{i},a_{i})\left(\P^{\dagger}(\cdot\mid s_{i},a_{i})-\P(\cdot\mid s_{i},a_{i})\right)^{\top}V_{i+1}\left(\cdot;\P^{\dagger},r,\pi\right)\right| (by expanding the expectation)
\displaystyle\leq H|\cS|i=1Hsi,aidπ,h(si,ai)(sups|(ssi,ai)(ssi,ai)|).\displaystyle H\left|\cS\right|\cdot\sum_{i=1}^{H}\sum_{s_{i},a_{i}}d_{\pi,h}(s_{i},a_{i})\left(\sup_{s^{\prime}}\left|\P^{\dagger}(s^{\prime}\mid s_{i},a_{i})-\P(s^{\prime}\mid s_{i},a_{i})\right|\right). (Vi+1HV_{i+1}\leq H and the inner product has |\cS|\left|\cS\right| terms)

We define

dπ(s,a)=1Hh=1Hdπ,h(s,a)d_{\pi}(s,a)=\frac{1}{H}\sum_{h=1}^{H}d_{\pi,h}(s,a) (B.5)

as the average visiting probability. Then, we have

|V1(s1,,r,π)V1(s1,,r,π)|\displaystyle\left|V_{1}\left(s_{1},\P,r,\pi\right)-V_{1}\left(s_{1},\P^{\dagger},r,\pi\right)\right|\leq |\cS|H2s,a[sups|(ss,a)(ss,a)|dπ(s,a)],\displaystyle\left|\cS\right|H^{2}\sum_{s,a}\left[\sup_{s^{\prime}}\left|\P^{\dagger}(s^{\prime}\mid s,a)-\P(s^{\prime}\mid s,a)\right|d_{\pi}(s,a)\right], (B.6)

So it suffices to upper bound |(ss,a)(ss,a)|\left|\P^{\dagger}(s^{\prime}\mid s,a)-\P(s^{\prime}\mid s,a)\right|. Notice here we only consider sss\neq s^{\dagger} and ss,s^{\prime}\neq s^{\dagger}, since the value function starting from ss^{\dagger} is always zero.

For (s,a,s)(s,a,s^{\prime}), if N(s,a,s)Φ,N(s,a,s^{\prime})\geq\Phi, it holds that (ss,a)=(ss,a).\P^{\dagger}(s^{\prime}\mid s,a)=\P(s^{\prime}\mid s,a). Otherwise, from the definition, we know (ss,a)=0,\P^{\dagger}(s^{\prime}\mid s,a)=0, so it suffices to bound (ss,a)\P(s^{\prime}\mid s,a) in this case. From lemma B.1, we know that with probability at least 1δ/2,1-\delta/2, for any (s,a,s)(s,a,s^{\prime}) such that N(s,a,s)<ΦN(s,a,s^{\prime})<\Phi, it holds that

(ss,a)\displaystyle\P(s^{\prime}\mid s,a) 2N(s,a,s)+2log(2|\cS|2|\cA|/δ)N(s,a).\displaystyle\leq\frac{2N(s,a,s^{\prime})+2\log\left(2\left|\cS\right|^{2}\left|\cA\right|/\delta\right)}{N(s,a)}.

Then we deal with two cases. When Nμ(s,a)6Φ,N\mu(s,a)\geq 6\Phi, from lemma B.2 we have

(ss,a)\displaystyle\P(s^{\prime}\mid s,a) 4N(s,a,s)+2log(4|\cS|2|\cA|/δ)Nμ(s,a)2log(2|\cS||\cA|/δ)4Φ+2log(4|\cS|2|\cA|/δ)Nμ(s,a)2log(2|\cS||\cA|/δ).\displaystyle\leq\frac{4N(s,a,s^{\prime})+2\log\left(4\left|\cS\right|^{2}\left|\cA\right|/\delta\right)}{N\mu(s,a)-2\log\left(2\left|\cS\right|\left|\cA\right|/\delta\right)}\leq\frac{4\Phi+2\log\left(4\left|\cS\right|^{2}\left|\cA\right|/\delta\right)}{N\mu(s,a)-2\log\left(2\left|\cS\right|\left|\cA\right|/\delta\right)}.

From the definition of Φ,\Phi, we know that 2log(4|\cS|2|\cA|/δ)Φ2\log\left(4\left|\cS\right|^{2}\left|\cA\right|/\delta\right)\leq\Phi and 2log(2|\cS||\cA|/δ)Φ,2\log\left(2\left|\cS\right|\left|\cA\right|/\delta\right)\leq\Phi, which implies

(ss,a)5ΦNμ(s,a)Φ6ΦNμ(s,a).\P(s^{\prime}\mid s,a)\leq\frac{5\Phi}{N\mu(s,a)-\Phi}\leq\frac{6\Phi}{N\mu(s,a)}.

The last inequality comes from our assumption for Nμ(s,a)6Φ.N\mu(s,a)\geq 6\Phi.

In the other case, when Nμ(s,a)<6Φ,N\mu(s,a)<6\Phi, it holds that

(ss,a)16ΦNμ(s,a)\P(s^{\prime}\mid s,a)\leq 1\leq\frac{6\Phi}{N\mu(s,a)}

for any (s,a,s)(s,a,s^{\prime}). Therefore, for any for any (s,a,s)(s,a,s^{\prime}) such that N(s,a,s)<Φ,N(s,a,s^{\prime})<\Phi, we have

(ss,a)6ΦNμ(s,a).\P(s^{\prime}\mid s,a)\leq\frac{6\Phi}{N\mu(s,a)}. (B.7)

Combining equations (B.7) and (B.6), we know for any comparator policy π,\pi, it holds that

|V1(s1,,r,π)V1(s1,,r,π)|Approximation ErrorΦH2|\cS|Ns,adπ(s,a)μ(s,a)O~(H4|\cS|Ns,adπ(s,a)μ(s,a)),\underbrace{\left|V_{1}\left(s_{1},\P,r,\pi\right)-V_{1}\left(s_{1},\P^{\dagger},r,\pi\right)\right|}_{\text{Approximation Error}}\lesssim\frac{\Phi H^{2}\left|\cS\right|}{N}\sum_{s,a}\frac{d_{\pi}(s,a)}{\mu(s,a)}\lesssim\widetilde{O}\left(\frac{H^{4}\left|\cS\right|}{N}\sum_{s,a}\frac{d_{\pi}(s,a)}{\mu(s,a)}\right),

where NN is the total number of transitions in the offline data. In order to make the right hand side of last display less than ε/2,\varepsilon/2, one needs at least

O~(H4|\cS|εs,adπ(s,a)μ(s,a))O~(H4|\cS|2|\cA|Cε),\widetilde{O}\Big{(}\frac{H^{4}\left|\cS\right|}{\varepsilon}\sum_{s,a}\frac{d_{\pi}(s,a)}{\mu(s,a)}\Big{)}\leq\widetilde{O}\Big{(}\frac{H^{4}\left|\cS\right|^{2}\left|\cA\right|C^{*}}{\varepsilon}\Big{)},

offline transitions. Combining the proof for estimation error and approximation error, we conclude.

B.4 Proof for lemma B.1 and lemma B.2

Lemma B.1.

With probability at least 1δ/2,1-\delta/2, for any (s,a,s)\cS×\cA×\cS,(s,a,s^{\prime})\in\cS\times\cA\times\cS, it holds that

N(s,a,s)12N(s,a)(ss,a)log(2|\cS|2|\cA|δ).N(s,a,s^{\prime})\geq\frac{1}{2}N(s,a)\P\left(s^{\prime}\mid s,a\right)-\log\left(\frac{2\left|\cS\right|^{2}\left|\cA\right|}{\delta}\right).
Proof.

We fixed (s,a,s)(s,a,s^{\prime}) and denote II as the index set where (si,ai)=(s,a)(s_{i},a_{i})=(s,a) for iI.i\in I. We range the indexes in II as i1<i2<<iN(s,a).i_{1}<i_{2}<...<i_{N(s,a)}. For jN(s,a),j\leq N(s,a), we denote Xj=𝕀(sij=s),X_{j}=\mathbb{I}\left(s_{i_{j}}^{\prime}=s^{\prime}\right), which is the indicator of whether the next state is ss^{\prime} when we encounter (s,a)(s,a) the jj-th time. When jN(s,a),j\geq N(s,a), we denote XjX_{j} as independent Bernoulli random variables with successful rate (ss,a).\P(s^{\prime}\mid s,a). Then, we know XjX_{j} for all jj\in\mathbb{N} are i.i.d. sequence of Bernoulli random variables. From Lemma D.1, we know with probability at least 1δ/2,1-\delta/2, for any positive integer nn, it holds that

j=1nXj12j=1n(ss,a)log(2δ).\sum_{j=1}^{n}X_{j}\geq\frac{1}{2}\sum_{j=1}^{n}\P(s^{\prime}\mid s,a)-\log\left(\frac{2}{\delta}\right).

We take n=N(s,a)n=N(s,a) (although N(s,a)N(s,a) is random, we can still take it because for any nn the inequality above holds) to get

N(s,a,s)=j=1N(s,a)Xj12N(s,a)(ss,a)log(2δ).N(s,a,s^{\prime})=\sum_{j=1}^{N(s,a)}X_{j}\geq\frac{1}{2}N(s,a)\P(s^{\prime}\mid s,a)-\log\left(\frac{2}{\delta}\right).

Applying a union bound for all (s,a,s),(s,a,s^{\prime}), we conclude. ∎

Lemma B.2.

With probability at least 1δ/2,1-\delta/2, for any (s,a)\cS×\cA,(s,a)\in\cS\times\cA, it holds that

N(s,a)12Nμ(s,a)log(2|\cS||\cA|δ).N(s,a)\geq\frac{1}{2}N\mu(s,a)-\log\left(\frac{2\left|\cS\right|\left|\cA\right|}{\delta}\right).
Proof.

If μ(s,a)=0,\mu(s,a)=0, this is trivial. We fixed an (s,a)(s,a) such that μ(s,a)>0.\mu(s,a)>0. For jN,j\leq N, we denote Xj=𝕀(sj=s,aj=a),X_{j}=\mathbb{I}\left(s_{j}=s,a_{j}=a\right), which is the indicator of whether the jj-th state-action pair we encounter in the offline dataset is (s,a)(s,a). When jN,j\geq N, we denote XjX_{j} as independent Bernoulli random variables with successful rate μ(s,a).\mu(s,a). Then, we know XjX_{j} for all jj\in\mathbb{N} are i.i.d. sequence of Bernoulli random variables. From Lemma D.1, we know with probability at least 1δ/2,1-\delta/2, for any positive integer nn, it holds that

j=1nXj12j=1nμ(s,a)log(2δ).\sum_{j=1}^{n}X_{j}\geq\frac{1}{2}\sum_{j=1}^{n}\mu(s,a)-\log\left(\frac{2}{\delta}\right).

We take n=Nn=N to get

N(s,a)=j=1NXj12Nμ(s,a)log(2δ).N(s,a)=\sum_{j=1}^{N}X_{j}\geq\frac{1}{2}N\mu(s,a)-\log\left(\frac{2}{\delta}\right).

Applying a union bound for all (s,a)(s,a) such that μ(s,a)>0,\mu(s,a)>0, we conclude. ∎

Appendix C Lower bound

In this section we briefly discuss the optimality of the algorithm. Although the following considerations are also mentioned in the main text, here we mention how they naturally lead to a lower bound.

Lower bound for reward-free exploration

Consider the MDP class \mathcal{M} defined in the proof of the lower bound of Theorem 4.1 in [Jin et al., 2020b]. Assume that the dataset arises from a logging policy πlog\pi_{log} which induces the condition N(s,a,s)ΦN(s,a,s^{\prime})\geq\Phi for all (s,a,s)\cS×\cA(s,a,s^{\prime})\in\cS\times\cA for every instance of the class. In this case, every MDP instance \mathcal{M}\in\mathcal{M} and its sparsified version \mathcal{M}^{\dagger} coincide. Then the concatenation of the logging policy πlog\pi_{log} and of the policy πfinal\pi_{final} produced by our algorithm (i.e., algorithm 3) can be interpreted as a reactive policy, which must comply with the reward free lower bound established in Theorem 4.1 of [Jin et al., 2020b]. More precisely, the reward-free sample complexity lower bound established in Theorem 4.1 in [Jin et al., 2020b] is

Ω(|\cS|2|\cA|H2ε2)\displaystyle\Omega\Big{(}\frac{|\cS|^{2}|\cA|H^{2}}{\varepsilon^{2}}\Big{)} (C.1)

trajectories. This matches the sample complexity of theorem 5.1. Notice that the number of samples originally present in the dataset can be

|\cS|2|\cA|×O~(H2)=O~(H2|\cS|2|\cA|),\displaystyle|\cS|^{2}|\cA|\times\widetilde{O}(H^{2})=\widetilde{O}(H^{2}|\cS|^{2}|\cA|), (C.2)

a term independent of the accuracy ε\varepsilon. Given that when ϵ1\epsilon\leq 1 we have

O~(H2|\cS|2|\cA|)+Ω(|\cS|2|\cA|H2ε2)=Ω(|\cS|2|\cA|H2ε2),\displaystyle\widetilde{O}(H^{2}|\cS|^{2}|\cA|)+\Omega\Big{(}\frac{|\cS|^{2}|\cA|H^{2}}{\varepsilon^{2}}\Big{)}=\Omega\Big{(}\frac{|\cS|^{2}|\cA|H^{2}}{\varepsilon^{2}}\Big{)},

our algorithm is unimprovable beyond constant terms and logarithmic terms in a minimax sense.

Lower bound for non-reactive exploration

Consider the MDP class \mathcal{M} defined in the proof of the lower bound in Theorem 1 of [Xiao et al., 2022]. It establishes an exponential sample complexity for non-reactive exploration when no prior knowledge is available. In other words, in absence of any data about the MDP, non-reactive exploration must suffer an exponential sample complexity. In such case, our theorem 5.1 (correctly) provides vacuous guarantees, because the sparsified MDP \mathcal{M} is degenerate (all edges lead to the absorbing state).

Combining the two constructions

It is possible to combine the MDP class 1\mathcal{M}_{1} from the paper [Xiao et al., 2022] with the MDP class 2\mathcal{M}_{2} from the paper [Jin et al., 2020b]. In lieu of a formal proof, here we provide only a sketch of the construction that would induce a lower bound. More precisely, consider a starting state s1s_{1} where only two actions—a=1a=1 and a=2a=2—are available. Taking a=1a=1 leads to the start state of an instance of the class 1\mathcal{M}_{1}, while taking a=2a=2 leads to the start state of an instance of the class 2\mathcal{M}_{2}; in both cases the transition occurs with probability one and zero reward is collected.

Furthermore, assume that the reward function given over the MDPs in 1\mathcal{M}_{1} is shifted such that the value of a policy that takes a=1a=1 in s1s_{1} and then plays optimally is 11 and that the reward functions on 2\mathcal{M}_{2} is shifted such that the value of a policy which takes a=2a=2 initially and then playes optimally is 2ε2\varepsilon.

In addition, assume that the dataset arises from a logging policy πlog\pi_{log} which takes a=2a=2 initially and then visits all (s,a,s)(s,a,s^{\prime}) uniformly.

Such construction and dataset identify a sparsified MDP which coincide with 2\mathcal{M}_{2} with the addition of s1s_{1} (and its transition to 2\mathcal{M}_{2} with zero reward). Intuitively, a policy with value arbitrarily close to 11 must take action a=1a=1 which leads to 1\mathcal{M}_{1}, which is the portion of the MDP that is unexplored in the dataset. In this case, unless the agent collects exponentially many trajectories in the online phase, the lower bound from [Xiao et al., 2022] implies that it is not possible to discover a policy with value close to 11 (e.g., larger than 1/21/2). On the other hand, our theorem 5.1 guarantees that πfinal\pi_{final} has a value at least ε\varepsilon, because πfinal\pi_{final} is ε\varepsilon-optimal on the sparsified MDP—i.e., ε\varepsilon-optimal when restricted to an instance on 2\mathcal{M}_{2}—with high probability using at most H2|\cS|2|\cA|/ε2\sim H^{2}|\cS|^{2}|\cA|/\varepsilon^{2} trajectories. This value is unimprovable given the lower bound of Jin et al. [2020b], which applies to the class 2\mathcal{M}_{2}.

For completeness, in the next sub-section we refine the lower bound of [Xiao et al., 2022] to handle mixture policies.

Appendix D Technical lemmas and proofs

Lemma D.1 (Lemma F.4 in [Dann et al., 2017]).

Let i\mathcal{F}_{i} for i=1i=1\ldots be a filtration and X1,XnX_{1},\ldots X_{n} be a sequence of Bernoulli random variables with (Xi=1i1)=Pi\mathbb{P}\left(X_{i}=1\mid\mathcal{F}_{i-1}\right)=P_{i} with PiP_{i} being i1\mathcal{F}_{i-1}-measurable and XiX_{i} being i\mathcal{F}_{i} measurable. It holds that

(n:t=1nXt<t=1nPt/2W)eW.\mathbb{P}\left(\exists n:\sum_{t=1}^{n}X_{t}<\sum_{t=1}^{n}P_{t}/2-W\right)\leq e^{-W}.
Lemma D.2.

Let i\mathcal{F}_{i} for i=1i=1\ldots be a filtration and X1,XnX_{1},\ldots X_{n} be a sequence of Bernoulli random variables with (Xi=1i1)=Pi\mathbb{P}\left(X_{i}=1\mid\mathcal{F}_{i-1}\right)=P_{i} with PiP_{i} being i1\mathcal{F}_{i-1}-measurable and XiX_{i} being i\mathcal{F}_{i} measurable. It holds that

(n:t=1nXt>t=1n2Pt+W)eW.\mathbb{P}\left(\exists n:\sum_{t=1}^{n}X_{t}>\sum_{t=1}^{n}2P_{t}+W\right)\leq e^{-W}.
Proof.

Notice that 1u2[exp(u)u1]\frac{1}{u^{2}}\left[\exp(u)-u-1\right] is non-decreasing on ,\mathbb{R}, where at zero we continuously extend this function. For any t,t\in\mathbb{N}, since XtPt1,X_{t}-P_{t}\leq 1, we have exp(XtPt)(XtPt)1(XtPt)2(e2)(XtPt)2.\exp\left(X_{t}-P_{t}\right)-\left(X_{t}-P_{t}\right)-1\leq\left(X_{t}-P_{t}\right)^{2}\left(e-2\right)\leq\left(X_{t}-P_{t}\right)^{2}. Taking expectation conditional on \cFt1\cF_{t-1} and noticing that PtXtP_{t}-X_{t} is a Martingale difference sequence w.r.t. the filtration \cFt,\cF_{t}, we have

\E[exp(XtPt)\cFt1]1+\E[(XtPt)2\cFt1]exp[\E[(XtPt)2\cFt1]]exp(Pt),\E\left[\exp\left(X_{t}-P_{t}\right)\mid\cF_{t-1}\right]\leq 1+\E\left[\left(X_{t}-P_{t}\right)^{2}\mid\cF_{t-1}\right]\leq\exp\left[\E\left[\left(X_{t}-P_{t}\right)^{2}\mid\cF_{t-1}\right]\right]\leq\exp\left(P_{t}\right),

where the last inequality comes from the fact that conditional on \cFt1,\cF_{t-1}, XtX_{t} is a Bernoulli random variable. We define Mn:=exp[t=1n(Xt2Pt)],M_{n}:=\exp\left[\sum_{t=1}^{n}\left(X_{t}-2P_{t}\right)\right], which is a supermartingale from our derivation above. We define now the stopping time τ=min{t:Mt>eW}\tau=\min\left\{t\in\mathbb{N}:M_{t}>e^{W}\right\} and the sequence τn=min{t:Mt>eWtn}\tau_{n}=\min\left\{t\in\mathbb{N}:M_{t}>e^{W}\vee t\geq n\right\}. Applying the convergence theorem for nonnegative supermartingales (Theorem 4.2.12 in [Durrett, 2019]), we get that limtMt\lim_{t\rightarrow\infty}M_{t} is well-defined almost surely. Therefore, MτM_{\tau} is well-defined even when τ=\tau=\infty. By the optional stopping theorem for non-negative supermartingales (Theorem 4.8.4 in [Durrett, 2019], we have 𝔼[Mτn]𝔼[M0]1\mathbb{E}\left[M_{\tau_{n}}\right]\leq\mathbb{E}\left[M_{0}\right]\leq 1 for all nn and applying Fatou’s lemma, we obtain 𝔼[Mτ]=𝔼[limnMτn]lim infn𝔼[Mτn]1\mathbb{E}\left[M_{\tau}\right]=\mathbb{E}\left[\lim_{n\rightarrow\infty}M_{\tau_{n}}\right]\leq\liminf_{n\rightarrow\infty}\mathbb{E}\left[M_{\tau_{n}}\right]\leq 1. Using Markov’s inequality, we can finally bound

(n:t=1nXt>2t=1nPt+W)>(τ<)(Mτ>eW)eW𝔼[Mτ]eW.\mathbb{P}\left(\exists n:\sum_{t=1}^{n}X_{t}>2\sum_{t=1}^{n}P_{t}+W\right)>\mathbb{P}(\tau<\infty)\leq\mathbb{P}\left(M_{\tau}>e^{W}\right)\leq e^{-W}\mathbb{E}\left[M_{\tau}\right]\leq e^{-W}.

Lemma D.3 (Empirical Bernstein Inequality, Theorem 11 in [Maurer and Pontil, 2009]).

Let n2n\geq 2 and x1,,xnx_{1},\cdots,x_{n} be i.i.d random variables such that |xi|A\left|x_{i}\right|\leq A with probability 1 . Let x¯=1ni=1nxi\bar{x}=\frac{1}{n}\sum_{i=1}^{n}x_{i}, and V^n=1ni=1n(xix¯)2\widehat{V}_{n}=\frac{1}{n}\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}, then with probability 1δ1-\delta we have

|1ni=1nxi𝔼[x]|2V^nlog(2/δ)n+14A3nlog(2/δ)\left|\frac{1}{n}\sum_{i=1}^{n}x_{i}-\mathbb{E}[x]\right|\leq\sqrt{\frac{2\widehat{V}_{n}\log(2/\delta)}{n}}+\frac{14A}{3n}\log(2/\delta)
Lemma D.4 (Concentration for KL Divergence, Proposition 1 in [Jonsson et al., 2020]).

Let X1,X2,,Xn,X_{1},X_{2},\ldots,X_{n},\ldots be i.i.d. samples from a distribution supported over {1,,m}\{1,\ldots,m\}, of probabilities given by Σm\P\in\Sigma_{m}, where Σm\Sigma_{m} is the probability simplex of dimension m1m-1. We denote by ^n\widehat{\P}_{n} the empirical vector of probabilities. Then, for any δ[0,1]\delta\in[0,1], with probability at least 1δ,1-\delta, it holds that

n,KL(^n,)1nlog(1δ)+mnlog(e(1+nm)).\forall n\in\mathbb{N},\operatorname{KL}\left(\widehat{\P}_{n},\P\right)\leq\frac{1}{n}\log\left(\frac{1}{\delta}\right)+\frac{m}{n}\log\left(e\left(1+\frac{n}{m}\right)\right).

We remark that there is a slight difference between it and the original version. In [Jonsson et al., 2020], they use m1m-1 instead of mm. But since the second term of the right hand side above is increasing with m,m, our version also holds.

Lemma D.5 (Bernstein Transportation, Lemma 11 in [Talebi and Maillard, 2018]).

For any function ff and any two probability measure \Q,\Q,\P which satisfy \Q,\Q\ll\P, we denote 𝕍P[f]:=VarX(f(X))\mathbb{V}_{P}[f]:=\operatorname{Var}_{X\sim\P}(f(X)) and 𝕊(f):=supxf(x)infxf(x).\mathbb{S}(f):=\sup_{x}f(x)-\inf_{x}f(x). We assume 𝕍P[f]\mathbb{V}_{P}[f] and 𝕊(f)\mathbb{S}(f) are finite, then we have

𝔼Q[f]𝔼P[f]2𝕍P[f]KL(Q,P)+23𝕊(f)KL(Q,P),\displaystyle\mathbb{E}_{Q}[f]-\mathbb{E}_{P}[f]\leqslant\sqrt{2\mathbb{V}_{P}[f]\mathrm{KL}(Q,P)}+\frac{2}{3}\mathbb{S}(f)\mathrm{KL}(Q,P),
𝔼P[f]𝔼Q[f]2𝕍P[f]KL(Q,P).\displaystyle\mathbb{E}_{P}[f]-\mathbb{E}_{Q}[f]\leqslant\sqrt{2\mathbb{V}_{P}[f]\mathrm{KL}(Q,P)}.
Lemma D.6 (Difference of Variance, Lemma 12 in [Ménard et al., 2021]).

Let ,\Q\P,\Q be two probability measure on a discrete sample space of cardinality \cS.\cS. Let f,gf,g be two functions defined on 𝒮\mathcal{S} such that 0g(s),f(s)b0\leq g(s),f(s)\leq b for all s𝒮s\in\mathcal{S}, we have that

Var(f)2Var(g)+2b\E|fg| and\displaystyle\operatorname{Var}_{\P}(f)\leq 2\operatorname{Var}_{\P}(g)+2b\E_{\P}|f-g|\quad\text{ and }
Var\Q(f)Var\Q(f)+3b2\Q1,\displaystyle\operatorname{Var}_{\Q}(f)\leq\operatorname{Var}_{\Q}(f)+3b^{2}\|\P-\Q\|_{1},

Further, if \KL(;\Q)α,\KL\left(\P;\Q\right)\leq\alpha, it holds that

Var\Q(f)2Var(f)+4b2α.\operatorname{Var}_{\Q}(f)\leq 2\operatorname{Var}_{\P}(f)+4b^{2}\alpha.
Lemma D.7.

For any sequence of numbers z1,,znz_{1},\ldots,z_{n} with 00\leq zk1,z_{k}\leq 1, we have

k=1nzkmax[1;i=1k1zi]4log(i=1nzi+1)\sum_{k=1}^{n}\frac{z_{k}}{\max\left[1;\sum_{i=1}^{k-1}z_{i}\right]}\leq 4\log\left(\sum_{i=1}^{n}z_{i}+1\right)
Proof.
k=1nzkmax[1;i=1k1zi]\displaystyle\sum_{k=1}^{n}\frac{z_{k}}{\max\left[1;\sum_{i=1}^{k-1}z_{i}\right]} 4k=1ni=1kzii=1k1zi2+2i=1k1zi\displaystyle\leq 4\sum_{k=1}^{n}\frac{\sum_{i=1}^{k}z_{i}-\sum_{i=1}^{k-1}z_{i}}{2+2\sum_{i=1}^{k-1}z_{i}}
4k=1ni=1kzii=1k1zi1+i=1kzi\displaystyle\leq 4\sum_{k=1}^{n}\frac{\sum_{i=1}^{k}z_{i}-\sum_{i=1}^{k-1}z_{i}}{1+\sum_{i=1}^{k}z_{i}}
4k=1ni=1k1zii=1kzi11+xdx4log(i=1nzi+1).\displaystyle\leq 4\sum_{k=1}^{n}\int_{\sum_{i=1}^{k-1}z_{i}}^{\sum_{i=1}^{k}z_{i}}\frac{1}{1+x}\mathrm{d}x\leq 4\log\left(\sum_{i=1}^{n}z_{i}+1\right).

Lemma D.8.

For B16B\geq 16 and x3,x\geq 3, there exists a universal constant C14,C_{1}\geq 4, such that when

xC1Blog(B)2,x\geq C_{1}B\log(B)^{2},

it holds that

Blog(1+x)(1+log(1+x))x.B\log(1+x)\left(1+\log(1+x)\right)\leq x.
Proof.

We have

Blog(1+x)(1+log(1+x))B(1+log(1+x))2B(1+log(2x))2B[log(6x)]2.B\log(1+x)\left(1+\log(1+x)\right)\leq B\left(1+\log(1+x)\right)^{2}\leq B\left(1+\log(2x)\right)^{2}\leq B\left[\log(6x)\right]^{2}.

We define f(x):=xB[log(6x)]2,f(x):=x-B\left[\log(6x)\right]^{2}, then we have f(x)=12Blog(6x)x.f^{\prime}(x)=1-\frac{2B\log(6x)}{x}. Since x2C1Blog(B),x\geq 2C_{1}B\log(B), we have

f(x)1log(12C1Blog(B))C1log(B).f^{\prime}(x)\geq 1-\frac{\log(12C_{1}B\log(B))}{C_{1}\log(B)}.

We can take C11C_{1}\geq 1 such that C1log(B)log(12C1Blog(B))0C_{1}\log(B)-\log(12C_{1}B\log(B))\geq 0 whenever B16.B\geq 16. Therefore, we know f(x)f(x) is increasing when xC1Blog(B)2.x\geq C_{1}B\log(B)^{2}. Then, it suffices to prove

[log(6C1Blog(B)2)]2C1log(B)2.\left[\log\left(6C_{1}B\log(B)^{2}\right)\right]^{2}\leq C_{1}\log(B)^{2}.

Since [log(6C1Blog(B)2)]22log(B)2+2[log(6C1log(B)2)]2,\left[\log\left(6C_{1}B\log(B)^{2}\right)\right]^{2}\leq 2\log(B)^{2}+2\left[\log\left(6C_{1}\log(B)^{2}\right)\right]^{2}, it suffices to prove

log(6C1log(B)2)C122log(B).\log\left(6C_{1}\log(B)^{2}\right)\leq\sqrt{\frac{C_{1}-2}{2}}\log(B).

When C14,C_{1}\geq 4, the difference of right hand side and left hand side s always increasing w.r.t. BB for fixed C1.C_{1}. Therefore, it suffices to prove the case when B=16.B=16. Noticing that we can always take a sufficiently large uniform constant C1C_{1} such that the inequality above holds when B=16,B=16, we conclude. ∎

Lemma D.9 (Chain rule of Kullback-Leibler divergence, Exercise 3.2 in [Wainwright, 2019]).

Given two nn-variate distributions \mathbb{Q} and \mathbb{P}, show that the Kullback-Leibler divergence can be decomposed as

D(;)=D(1;1)+j=2n𝔼1j1[D(j(X1j1);j(X1j1))],D(\mathbb{Q};\mathbb{P})=D\left(\mathbb{Q}_{1};\mathbb{P}_{1}\right)+\sum_{j=2}^{n}\mathbb{E}_{\mathbb{Q}_{1}^{j-1}}\left[D\left(\mathbb{Q}_{j}\left(\cdot\mid X_{1}^{j-1}\right);\mathbb{P}_{j}\left(\cdot\mid X_{1}^{j-1}\right)\right)\right],

where j(X1j1)\mathbb{Q}_{j}\left(\cdot\mid X_{1}^{j-1}\right) denotes the conditional distribution of XjX_{j} given (X1,,Xj1)\left(X_{1},\ldots,X_{j-1}\right) under \mathbb{Q}, with a similar definition for j(X1j1)\mathbb{P}_{j}\left(\cdot\mid X_{1}^{j-1}\right).

Lemma D.10 (Bretagnolle-Huber Inequality, Theorem 14.1 in [Lattimore and Szepesvári, 2020]).

Let \P and \mathbb{Q} be probability measures on the same measurable space (Ω,)(\Omega,\mathcal{F}), and let AA\in\mathcal{F} be an arbitrary event. Then,

(A)+(Ac)12exp(D(;))\P(A)+\mathbb{Q}\left(A^{c}\right)\geq\frac{1}{2}\exp(-\mathrm{D}(\P;\mathbb{Q}))

where Ac=Ω\AA^{c}=\Omega\backslash A is the complement of AA.

Lemma D.11 (Value Difference Lemma, Lemma E.15 in [Dann et al., 2017]).

For any two MDPs \cM\cM^{\prime} and \cM′′\cM^{\prime\prime} with rewards rr^{\prime} and r′′r^{\prime\prime} and transition probabilities \P^{\prime} and ′′\P^{\prime\prime}, the difference in values with respect to the same policy π\pi can be written as

Vi(s)Vi′′(s)\displaystyle V_{i}^{\prime}(s)-V_{i}^{\prime\prime}(s) =𝔼′′[t=iH(r(st,at,t)r′′(st,at,t))si=s]\displaystyle=\mathbb{E}^{\prime\prime}\left[\sum_{t=i}^{H}\left(r^{\prime}\left(s_{t},a_{t},t\right)-r^{\prime\prime}\left(s_{t},a_{t},t\right)\right)\mid s_{i}=s\right]
+𝔼′′[t=iH((st,at,t)′′(st,at,t))Vt+1si=s]\displaystyle+\mathbb{E}^{\prime\prime}\left[\sum_{t=i}^{H}\left(\P^{\prime}\left(s_{t},a_{t},t\right)-\P^{\prime\prime}\left(s_{t},a_{t},t\right)\right)^{\top}V_{t+1}^{\prime}\mid s_{i}=s\right]

where VH+1(s)=VH+1′′(s)=0V_{H+1}^{\prime}(s)=V_{H+1}^{\prime\prime}(s)=0 for any state ss and the expectation 𝔼\mathbb{E}^{\prime} is taken with respect to \P^{\prime} and π\pi and 𝔼′′\mathbb{E}^{\prime\prime} with respect to ′′\P^{\prime\prime} and π\pi.

Appendix E Details of the planning phase

In this section, we provide some details of the planning phase in algorithm 3. In the planning phase, we are given a reward function r:\cS×\cA[0,1]r:\cS\times\cA\to[0,1] and we compute an estimate of sparsified transition dynamics ~,\widetilde{\P}^{\dagger}, which is formally defined section A.1.1. The goal of the planning phase is to compute the optimal policy πfinal\pi_{final} on the MDP specified by the transition dynamics ~\widetilde{\P}^{\dagger} and reward function r,r^{\dagger}, where rr^{\dagger} is the sparsified version of r:r: r(s,a)=r(s,a)r^{\dagger}(s,a)=r(s,a) and r(s,a)=0r^{\dagger}(s^{\dagger},a)=0 for any a\cA.a\in\cA. To compute the optimal policy, we iteratively apply the Bellman optimality equation. First, we define Q~H(s,a)=r(s,a)\widetilde{Q}_{H}(s,a)=r^{\dagger}(s,a) for any (s,a)(s,a) and solve

πfinal,H(s)=argmaxa\cAr(s,a).\pi_{final,H}(s)=\mathop{\arg\max}_{a\in\cA}r^{\dagger}(s,a).

Then, for h=H1,H2,,2,1,h=H-1,H-2,...,2,1, we iteratively define

Q~h(s,a):=r(s,a)+s~(ss,a)Q~h+1(s,πfinal,h+1(s))\widetilde{Q}_{h}(s,a):=r^{\dagger}(s,a)+\sum_{s^{\prime}}\widetilde{\P}^{\dagger}\left(s^{\prime}\mid s,a\right)\widetilde{Q}_{h+1}(s^{\prime},\pi_{final,h+1}(s^{\prime}))

for any (s,a)(s,a), and then solve

πfinal,h(s)=argmaxa\cAQ~h(s,a)\pi_{final,h}(s)=\mathop{\arg\max}_{a\in\cA}\widetilde{Q}_{h}(s,a)

for any s\cS.s\in\cS. For ss^{\dagger} and any h[H],h\in[H], πfinal,h(s)\pi_{final,h}(s^{\dagger}) can be arbitrary action. Then, from the property of Bellman optimality equation, we know πfinal\pi_{final} is the optimal policy on ~\widetilde{\P}^{\dagger} and r.r^{\dagger}.

Appendix F More related works

Other low-switching algorithms Low-switching learning algorithms were initially studied in the context of bandits, with the UCB2 algorithm [Auer et al., 2002] achieving an O(\cAlogK)O(\cA\log K) switching cost. Gao et al. [2019] demonstrated a sufficient and necessary O(\cAloglogK)O(\cA\log\log K) switching cost for attaining the minimax optimal regret in multi-armed bandits. In both adversarial and stochastic online learning, [Cesa-Bianchi et al., 2013] designed an algorithm that achieves an O(loglogK)O(\log\log K) switching cost.

Reward-free reinforcement learning In reward-free reinforcement learning (RFRL) the goal is to find a near-optimal policy for any given reward function. [Jin et al., 2020a] proposed an algorithm based on EULER [Zanette and Brunskill, 2019] that can find a ε\varepsilon policy with \tildeO(H5|\cS|2|\cA|/ε2)\tildeO(H^{5}\left|\cS\right|^{2}\left|\cA\right|/\varepsilon^{2}) trajectories. Subsequently, [Kaufmann et al., 2021] reduces the sample complexity by a factor HH by using uncertainty functions to upper bound the value estimation error. The sample complexity was further improved by another HH factor by [Ménard et al., 2021].

A lower bound of \tildeO(H2|\cS|2|\cA|/ε2)\tildeO(H^{2}\left|\cS\right|^{2}\left|\cA\right|/\varepsilon^{2}) was established for homogeneous MDPs by [Jin et al., 2020a], while an additional HH factor is conjectured for non-homogeneous cases. [Zhang et al., 2021] proposed the first algorithm with matching sample complexity in the homogeneous case. Similar results are available with linear [Wang et al., 2020a, Wagenmaker et al., 2022, Zanette et al., 2020] and general function approximation [Chen et al., 2022, Qiu et al., 2021].

Offline reinforcement learning In offline reinforcement learning the goal is to learn a near-optimal policy from an existing dataset which is generated from a (possibly very different) logging policy. Offline RL in tabular domains has been investigated extensively [Yin and Wang, 2020, Jin et al., 2020c, Nachum et al., 2019, Rashidinejad et al., 2021, Kallus and Uehara, 2022, Xie and Jiang, 2020a]. Similar results were shown using linear [Yin et al., , Xiong et al., 2022, Nguyen-Tang et al., 2022, Zanette et al., 2021b] and general function approximation[Xie et al., 2021a, Long et al., 2021, Zhang et al., 2022, Duan et al., 2021, Jiang and Huang, 2020, Uehara and Sun, 2021, Zanette and Wainwright, 2022, Rashidinejad et al., 2022, Yin et al., 2022]. Offline RL is effective when the dataset ‘covers’ a near optimal policy, as measured by a certain concentrabiluty factor. In the function approximation setting additional conditions, such as Bellman completeness, may need to be approximately satisfied [Munos and Szepesvári, 2008, Chen and Jiang, 2019, Zanette, 2023, Wang et al., 2020b, Foster et al., 2021, Zhan et al., 2022].

Task-agnostic reinforcement learning Another related line of work is task-agnostic RL, where NN tasks are considered during the planning phase, and the reward functions is collected from the environment instead of being provided directly. [Zhang et al., 2020a] presented the first task-agnostic algorithm, UBEZero, with a sample complexity of \tildeO(H5|\cS||\cA|log(N)/ε2)\tildeO(H^{5}\left|\cS\right|\left|\cA\right|\log(N)/\varepsilon^{2}). Recently, [Li et al., 2023] proposed an algorithm that leverages offline RL techniques to estimate a well-behaved behavior policy in the reward-agnostic phase, achieving minimax sample complexity. Other works exploring effective exploration schemes in RL include [Hazan et al., 2019, Du et al., 2019, Misra et al., 2020].