This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Towards Deployment-Efficient Reinforcement Learning: Lower Bound and Optimality

Jiawei Huang ~{}\dagger, Jinglin Chen\dagger, Li Zhao\ddagger, Tao Qin\ddagger, Nan Jiang\dagger, Tie-Yan Liu\ddagger
\dagger Department of Computer Science, University of Illinois at Urbana-Champaign
 {jiaweih, jinglinc, nanjiang}@illinois.edu
\ddagger
Microsoft Research Asia
 {lizo, taoqin, tyliu}@microsoft.com
Work done during the internship at Microsoft Research Asia.
Abstract

Deployment efficiency is an important criterion for many real-world applications of reinforcement learning (RL). Despite the community’s increasing interest, there lacks a formal theoretical formulation for the problem. In this paper, we propose such a formulation for deployment-efficient RL (DE-RL) from an “optimization with constraints” perspective: we are interested in exploring an MDP and obtaining a near-optimal policy within minimal deployment complexity, whereas in each deployment the policy can sample a large batch of data. Using finite-horizon linear MDPs as a concrete structural model, we reveal the fundamental limit in achieving deployment efficiency by establishing information-theoretic lower bounds, and provide algorithms that achieve the optimal deployment efficiency. Moreover, our formulation for DE-RL is flexible and can serve as a building block for other practically relevant settings; we give “Safe DE-RL” and “Sample-Efficient DE-RL” as two examples, which may be worth future investigation.

1 Introduction

In many real-world applications, deploying a new policy to replace the previous one is costly, while generating a large batch of samples with an already deployed policy can be relatively fast and cheap. For example, in recommendation systems (Afsar et al., 2021), education software (Bennane et al., 2013), and healthcare (Yu et al., 2019), the new recommendation, teaching, or medical treatment strategy must pass several internal tests to ensure safety and practicality before being deployed, which can be time-consuming. On the other hand, the algorithm may be able to collect a large amount of samples in a short period of time if the system serves a large population of users. Besides, in robotics applications (Kober et al., 2013), deploying a new policy usually involves operations on the hardware level, which requires non-negligible physical labor and long waiting periods, while sampling trajectories is relatively less laborious. However, deployment efficiency was neglected in most of existing RL literatures. Even for those few works considering this important criterion (Bai et al., 2020; Gao et al., 2021; Matsushima et al., 2021), either their settings or methods have limitations in the scenarios described above, or a formal mathematical formulation is missing. We defer a detailed discussion of these related works to Section 1.1.

In order to close the gap between existing RL settings and real-world applications requiring high deployment efficiency, our first contribution is to provide a formal definition and tractable objective for Deployment-Efficient Reinforcement Learning (DE-RL) via an “optimization with constraints” perspective. Roughly speaking, we are interested in minimizing the number of deployments KK under two constraints: (a) after deploying KK times, the algorithm can return a near-optimal policy, and (b) the number of trajectories collected in each deployment, denoted as NN, is at the same level across KK deployments, and it can be large but should still be polynomial in standard parameters. Similar to the notion of sample complexity in online RL, we will refer to KK as deployment complexity.

To provide a more quantitative understanding, we instantiate our DE-RL framework in finite-horizon linear MDPs111Although we focus on linear MDPs, the core idea can be extended to more general settings such as RL with general function approximation (Kong et al., 2021). (Jin et al., 2019) and develop the essential theory. The main questions we address are:

Q1: What is the optimum of the deployment efficiency in our DE-RL setting?

Q2: Can we achieve the optimal deployment efficiency in our DE-RL setting?

When answering these questions, we separately study algorithms with or without being constrained to deploy deterministic policies each time. While deploying more general forms of policies can be practical (e.g., randomized experiments on a population of users can be viewed as deploying a mixture of deterministic policies), most previous theoretical works in related settings exclusively focused on upper and lower bounds for algorithms using deterministic policies (Jin et al., 2019; Wang et al., 2020b; Gao et al., 2021). As we will show, the origin of the difficulty in optimizing deployment efficiency and the principle in algorithm design to achieve optimal deployment efficiency are generally different in these two settings, and therefore, we believe both of them are of independent interests.

As our second contribution, in Section 3, we answer Q1 by providing information-theoretic lower bounds for the required number of deployments under the constraints of (a) and (b) in Def 2.1. We establish Ω(dH)\Omega(dH) and Ω~(H)\widetilde{\Omega}(H) lower bounds for algorithms with and without the constraints of deploying deterministic policies, respectively. Contrary to the impression given by previous empirical works (Matsushima et al., 2021), even if we can deploy unrestricted policies, the minimal number of deployments cannot be reduced to a constant without additional assumptions, which sheds light on the fundamental limitation in achieving deployment efficiency. Besides, in the line of work on “horizon-free RL” (e.g., Wang et al., 2020a), it is shown that RL problem is not significantly harder than bandits (i.e., when H=1H=1) when we consider sample complexity. In contrast, the HH dependence in our lower bound reveals some fundamental hardness that is specific to long-horizon RL, particularly in the deployment-efficient setting. 222Although (Wang et al., 2020a) considered stationary MDP, as shown in our Corollary 3.3, the lower bounds of deployment complexity is still related to HH. Such hardness results were originally conjectured by Jiang & Agarwal (2018), but no hardness has been shown in sample-complexity settings.

After identifying the limitation of deployment efficiency, as our third contribution, we address Q2 by proposing novel algorithms whose deployment efficiency match the lower bounds. In Section 4.1, we propose an algorithm deploying deterministic policies, which is based on Least-Square Value Iteration with reward bonus (Jin et al., 2019) and a layer-by-layer exploration strategy, and can return an ε\varepsilon-optimal policy within O(dH)O(dH) deployments. As part of its analysis, we prove Lemma 4.2 as a technical contribution, which can be regarded as a batched finite-sample version of the well-known “Elliptical Potential Lemma”(Carpentier et al., 2020) and may be of independent interest. Moreover, our analysis based on Lemma 4.2 can be applied to the reward-free setting (Jin et al., 2020; Wang et al., 2020b) and achieve the same optimal deployment efficiency. In Section 4.2, we focus on algorithms which can deploy arbitrary policies. They are much more challenging because it requires us to find a provably exploratory stochastic policy without interacting with the environment. To our knowledge, Agarwal et al. (2020b) is the only work tackling a similar problem, but their algorithm is model-based which relies on a strong assumption about the realizability of the true dynamics and a sampling oracle that allows the agent to sample data from the model, and how to solve the problem in linear MDPs without a model class is still an open problem. To overcome this challenge, we propose a model-free layer-by-layer exploration algorithm based on a novel covariance matrix estimation technique, and prove that it requires Θ(H)\Theta(H) deployments to return an ε\varepsilon-optimal policy, which only differs from the lower bound Ω~(H)\widetilde{\Omega}(H) by a logarithmic factor. Although the per-deployment sample complexity of our algorithm has dependence on a “reachability coefficient” (see Def. 4.3), similar quantities also appear in related works (Zanette et al., 2020; Agarwal et al., 2020b; Modi et al., 2021) and we conjecture that it is unavoidable and leave the investigation to future work.

Finally, thanks to the flexibility of our “optimization with constraints” perspective, our DE-RL setting can serve as a building block for more advanced and practically relevant settings where optimizing the number of deployments is an important consideration. In Appendix F, we propose two potentially interesting settings: “Safe DE-RL” and “Sample-Efficient DE-RL”, by introducing constraints regarding safety and sample efficiency, respectively.

1.1 Closely Related Works

We defer the detailed discussion of previous literatures about pure online RL and pure offline RL to Appendix A, and mainly focus on those literatures which considered deployment efficiency and more related to us in this section.

To our knowledge, the term “deployment efficiency” was first coined by Matsushima et al. (2021), but they did not provide a concrete mathematical formulation that is amendable to theoretical investigation. In existing theoretical works, low switching cost is a concept closely related to deployment efficiency, and has been studied in both bandit (Esfandiari et al., 2020; Han et al., 2020; Gu et al., 2021; Ruan et al., 2021) and RL settings (Bai et al., 2020; Gao et al., 2021; Kong et al., 2021). Another related concept is concurrent RL, as proposed by Guo & Brunskill (2015). We highlight the difference with them in two-folds from problem setting and techniques.

As for the problem setting, existing literature on low switching cost mainly focuses on sub-linear regret guarantees, which does not directly implies a near-optimal policy after a number of policy deployments333Although the conversion from sub-linear regret to polynominal sample complexity is possible (“online-to-batch”), we show in Appendix A that to achieve accuracy ε\varepsilon after conversion, the number of deployments of previous low-switching cost algorithms has dependence on ε\varepsilon, whereas our guarantee does not. . Besides, low switching-cost RL algorithms (Bai et al., 2020; Gao et al., 2021; Kong et al., 2021) rely on adaptive switching strategies (i.e., the interval between policy switching is not fixed), which can be difficult to implement in practical scenarios. For example, in recommendation or education systems, once deployed, a policy usually needs to interact with the population of users for a fair amount of time and generate a lot of data. Moreover, since policy preparation is time-consuming (which is what motivates our work to begin with), it is practically difficult if not impossible to change the policy immediately once collecting enough data for policy update, and it will be a significant overhead compared to a short policy switch interval. Therefore, in applications we target at, it is more reasonable to assume that the sample size in each deployment (i.e., between policy switching) has the same order of magnitude and is large enough so that the overhead of policy preparation can be ignored.

More importantly, on the technical side, previous theoretical works on low switching cost mostly use deterministic policies in each deployment, which is easier to analyze. This issue also applies to the work of Guo & Brunskill (2015) on concurrent PAC RL. However, if the agent can deploy stochastic (and possibly non-Markov) policies (e.g., a mixture of deterministic policies), then intuitively—and as reflected in our lower bounds—exploration can be done much more deployment-efficiently, and we provide a stochastic policy algorithm that achieves an O~(H)\widetilde{O}(H) deployment complexity and overcomes the Ω(dH)\Omega(dH) lower bounds for deterministic policy algorithms (Gao et al., 2021).

2 Preliminaries

Notation

Throughout our paper, for n+n\in\mathbb{Z}^{+}, we will denote [n]={1,2,,n}[n]=\{1,2,...,n\}. \lceil\cdot\rceil denotes the ceiling function. Unless otherwise specified, for vector xdx\in\mathbb{R}^{d} and matrix Xd×dX\in\mathbb{R}^{d\times d}, x\|x\| denotes the vector l2l_{2}-norm of xx and X\|X\| denotes the largest singular value of XX. We will use standard big-oh notations O(),Ω(),Θ()O(\cdot),\Omega(\cdot),\Theta(\cdot), and notations such as O~()\widetilde{O}(\cdot) to suppress logarithmic factors.

2.1 Episodic Reinforcement Learning

We consider an episodic Markov Decision Process denoted by M(𝒮,𝒜,H,P,r)M(\mathcal{S},\mathcal{A},H,P,r), where 𝒮\mathcal{S} is the state space, 𝒜\mathcal{A} is the finite action space, HH is the horizon length, and P={Ph}h=1HP=\{P_{h}\}_{h=1}^{H} and r={rh}h=1Hr=\{r_{h}\}_{h=1}^{H} denote the transition and the reward functions. At the beginning of each episode, the environment will sample an initial state s1s_{1} from the initial state distribution d1d_{1}. Then, for each time step h[H]h\in[H], the agent selects an action ah𝒜a_{h}\in\mathcal{A}, interacts with the environment, receives a reward rh(sh,ah)r_{h}(s_{h},a_{h}), and transitions to the next state sh+1s_{h+1}. The episode will terminate once sH+1s_{H+1} is reached.

A (Markov) policy πh()\pi_{h}(\cdot) at step hh is a function mapping from 𝒮Δ(𝒜)\mathcal{S}\to\Delta(\mathcal{A}), where Δ(𝒜)\Delta(\mathcal{A}) denotes the probability simplex over the action space. With a slight abuse of notation, when πh()\pi_{h}(\cdot) is a deterministic policy, we will assume πh():𝒮𝒜\pi_{h}(\cdot):\mathcal{S}\to\mathcal{A}. A full (Markov) policy π={π1,π2,,πH}\pi=\{\pi_{1},\pi_{2},...,\pi_{H}\} specifies such a mapping for each time step. We use Vhπ(s)V^{\pi}_{h}(s) and Qhπ(s,a)Q^{\pi}_{h}(s,a) to denote the value function and Q-function at step h[H]h\in[H], which are defined as:

Vhπ(s)=𝔼[h=hHrh(sh,ah)|sh=s,π],Qhπ(s,a)=𝔼[h=hHrh(sh,ah)|sh=s,ah=a,π]\displaystyle V^{\pi}_{h}(s)=\mathbb{E}[\sum_{h^{\prime}=h}^{H}r_{h^{\prime}}(s_{h^{\prime}},a_{h^{\prime}})|s_{h}=s,\pi],\quad Q^{\pi}_{h}(s,a)=\mathbb{E}[\sum_{h^{\prime}=h}^{H}r_{h^{\prime}}(s_{h^{\prime}},a_{h^{\prime}})|s_{h}=s,a_{h}=a,\pi]

We also use Vh()V^{*}_{h}(\cdot) and Qh(,)Q^{*}_{h}(\cdot,\cdot) to denote the optimal value functions and use π\pi^{*} to denote the optimal policy that maximizes the expected return J(π):=𝔼[h=1Hr(sh,ah)|π]J(\pi):=\mathbb{E}[\sum_{h=1}^{H}r(s_{h},a_{h})|\pi]. In some occasions, we use Vhπ(s;r)V^{\pi}_{h}(s;r) and Qhπ(s,a;r)Q^{\pi}_{h}(s,a;r) to denote the value functions with respect to rr as the reward function for disambiguation purposes. The optimal value functions and the optimal policy will be denoted by V(s;r),Q(s,a;r),πrV^{*}(s;r),Q^{*}(s,a;r),\pi_{r}^{*}, respectively.

Non-Markov Policies

While we focus on Markov policies in the above definition, some of our results apply to or require more general forms of policies. For example, our lower bounds apply to non-Markov policies that can depend on the history (e.g., 𝒮1×𝒜1××𝒮h1×𝒜h1××𝒮h𝒜\mathcal{S}_{1}\times\mathcal{A}_{1}\times\mathbb{R}...\times\mathcal{S}_{h-1}\times\mathcal{A}_{h-1}\times\mathbb{R}\times\mathcal{S}_{h}\rightarrow\mathcal{A} for deterministic policies); our algorithm for arbitrary policies deploys a mixture of deterministic Markov policies, which corresponds to choosing a deterministic policy from a given set at the initial state, and following that policy for the entire trajectory. This can be viewed as a non-Markov stochastic policy.

2.2 Linear MDP Setting

We mainly focus on the linear MDP (Jin et al., 2019) satisfying the following assumptions:

Assumption A (Linear MDP Assumptions).

An MDP =(𝒮,𝒜,H,P,r)\mathcal{M}=(\mathcal{S},\mathcal{A},H,P,r) is said to be a linear MDP with a feature map ϕ:𝒮×𝒜d\phi:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}^{d} if the following hold for any h[H]h\in[H]:

  • There are dd unknown signed measures μh=(μh(1),μh(2),,μh(d))\mu_{h}=(\mu_{h}^{(1)},\mu_{h}^{(2)},...,\mu_{h}^{(d)}) over 𝒮\mathcal{S} such that for any (s,a,s)𝒮×𝒜×𝒮,Ph(s|s,a)=μh(s),ϕ(s,a)(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S},P_{h}(s^{\prime}|s,a)=\langle\mu_{h}(s^{\prime}),\phi(s,a)\rangle.

  • There exists an unknown vector θhd\theta_{h}\in\mathbb{R}^{d} such that for any (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, rh(s,a)=ϕ(s,a),θhr_{h}(s,a)=\langle\phi(s,a),\theta_{h}\rangle.

Similar to Jin et al. (2019) and Wang et al. (2020b), without loss of generality, we assume for all (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A} and h[H]h\in[H], ϕ(s,a)1\|\phi(s,a)\|\leq 1, μhd\|\mu_{h}\|\leq\sqrt{d}, and θhd\|\theta_{h}\|\leq\sqrt{d}. In Section 3 we will refer to linear MDPs with stationary dynamics, which is a special case when μ1=μ2=μH\mu_{1}=\mu_{2}=\ldots\mu_{H} and θ1=θ2==θH\theta_{1}=\theta_{2}=\ldots=\theta_{H}.

2.3 A Concrete Definition of DE-RL

In the following, we introduce our formulation for DE-RL in linear MDPs. For discussions of comparison to existing works, please refer to Section 1.1.

Definition 2.1 (Deployment Complexity in Linear MDPs).

We say that an algorithm has a deployment complexity KK in linear MDPs if the following holds: given an arbitrary linear MDP under Assumption A, for arbitrary ε\varepsilon and 0<δ<10<\delta<1, the algorithm will return a policy πK\pi_{K} after KK deployments and collecting at most NN trajectories in each deployment, under the following constraints:

  1. (a)

    With probability 1δ1-\delta, πK\pi_{K} is ε\varepsilon-optimal, i.e. J(πK)maxπJ(π)εJ(\pi_{K})\geq\max_{\pi}J(\pi)-\varepsilon.

  2. (b)

    The sample size NN is polynominal, i.e. N=poly(d,H,1ε,log1δ)N={\rm poly}(d,H,\frac{1}{\varepsilon},\log\frac{1}{\delta}). Moreover, NN should be fixed a priori and cannot change adaptively from deployment to deployment.

Under this definition, the goal of Deployment-Efficient RL is to design algorithms with provable guarantees of low deployment complexity.

Polynomial Size of NN

We emphasize that the restriction of polynomially large NN is crucial to our formulation, and not including it can result in degenerate solutions. For example, if NN is allowed to be exponentially large, we can finish exploration in 11 deployment in the arbitrary policy setting, by deploying a mixture of exponentially many policies that form an ε\varepsilon-net of the policy space. Alternatively, we can sample actions uniformly, and use importance sampling (Precup, 2000) to evaluate all of them in an off-policy manner. None of these solutions are practically feasible and are excluded by our restriction on NN.

3 Lower Bound for Deployment Complexity in RL

In this section, we provide information-theoretic lower bounds of the deployment complexity in our DE-RL setting. We defer the lower bound construction and the proofs to Appendix B. As mentioned in Section 2, we consider non-Markov policies when we refer to deterministic and stochastic policies in this section, which strengthens our lower bounds as they apply to very general forms of policies.

We first study the algorithms which can only deploy deterministic policy at each deployment.

Theorem 3.1.

[Lower bound for deterministic policies, informal] For any d4,Hd\geq 4,H and any algorithm ψ\psi that can only deploy a deterministic policy at each deployment, there exists a linear MDP MM satisfying Assumption A, such that the deployment complexity of ψ\psi in MM is K=Ω(dH)K=\Omega(dH).

The basic idea of our construction and the proof is that, intuitively, a linear MDP with dimension dd and horizon length HH has Ω(dH)\Omega(dH) “independent directions”, while deterministic policies have limited exploration capacity and only reach Θ(1)\Theta(1) direction in each deployment, which result in Ω(dH)\Omega(dH) deployments in the worst case.

In the next theorem, we will show that, even if the algorithm can use arbitrary exploration strategy (e.g. maximizing entropy, adding reward bonus), without additional assumptions, the number of deployments KK still has to depend on HH and may not be reduced to a constant when HH is large.

Theorem 3.2.

[Lower bound for arbitrary policies, informal] For any d4,H,Nd\geq 4,H,N and any algorithm ψ\psi which can deploy arbitrary policies, there exists a linear MDP MM satisfying Assumption A, such that the deployment complexity of ψ\psi in MM is K=Ω(H/logd(NH))=Ω~(H)K=\Omega(H/\lceil\log_{d}(NH)\rceil)=\widetilde{\Omega}(H).

The origin of the difficulty can be illustrated by a recursive dilemma: in the worst case, if the agent does not have enough information at layer hh, then it cannot identify a good policy to explore till layer h+Ω(logd(NH))h+\Omega(\log_{d}(NH)) in 1 deployment, and so on and so forth. Given that we enforce NN to be polynomial, the agent can only push the “information boundary” forward by Ω(logd(NH))=Ω~(1)\Omega(\log_{d}(NH))=\widetilde{\Omega}(1) layers per deployment. In many real-world applications, such difficulty can indeed exist. For example, in healthcare, the entire treatment is often divided into multiple stages. If the treatment in stage hh is not effective, the patient may refuse to continue. This can result in insufficient samples for identifying a policy that performs well in stage h+1h+1.

Stationary vs. non-stationary dynamics

Since we consider non-stationary dynamics in Assump. A, one may suspect that the HH-dependence in the lower bound is mainly due to such non-stationarity. We show that this is not quite the case, and the HH-dependence still exists for stationary dynamics. In fact, our lower bound for non-stationary dynamics directly imply one for stationary dynamics: given a finite horizon non-stationary MDP M~=(𝒮~,𝒜,H,P~,r~)\widetilde{M}=(\widetilde{\mathcal{S}},\mathcal{A},H,\widetilde{P},\widetilde{r}), we can construct a stationary MDP M=(𝒮,𝒜,H,P,r)M=(\mathcal{S},\mathcal{A},H,P,r) by expanding the state space to 𝒮=𝒮~×[H]\mathcal{S}=\widetilde{\mathcal{S}}\times[H] so that the new transition function PP and reward function rr are stationary across time steps. As a result, given arbitrary d4d\geq 4 and H2H\geq 2, we can construct a hard non-stationary MDP instance M~\widetilde{M} with dimension d~=max{4,d/H}\widetilde{d}=\max\{4,d/H\} and horizon h~=d/d~=min{H,d/4}\widetilde{h}=d/\widetilde{d}=\min\{H,d/4\}, and convert it to a stationary MDP MM with dimension dd and horizon h=h~=min{H,d/4}Hh=\widetilde{h}=\min\{H,d/4\}\leq H. If there exists an algorithm which can solve MM in KK deployments, then it can be used to solve M~\widetilde{M} in no more than KK deployments. Therefore, the lower bounds for stationary MDPs can be extended from Theorems 3.1 and 3.2, as shown in the following corollary:

Corollary 3.3 (Extension to Stationary MDPs).

For stationary linear MDP with d4d\geq 4 and H2H\geq 2, suppose N=poly(d,H,1ε,log1δ)N=\mathrm{poly}(d,H,\frac{1}{\varepsilon},\log\frac{1}{\delta}), the lower bound of deployment complexity would be Ω(d)\Omega(d) for deterministic policy algorithms, and Ω(min{d/4,H}logmax{d/H,4}NH)=Ω~(min{d,H})\Omega(\frac{\min\{d/4,H\}}{\lceil\log_{\max\{d/H,4\}}NH\rceil})=\widetilde{\Omega}(\min\{d,H\}) for algorithms which can deploy arbitrary policies.

As we can see, the dependence on dimension and horizon will not be eliminated even if we make a stronger assumption that the MDP is stationary. The intuition is that, although the transition function is stationary, some states may not be reachable from the initial state distribution within a small number of times, so the stationary MDP can effectively have a “layered” structure. For example, in Atari games (Bellemare et al., 2013) (where many algorithms like DQN (Mnih et al., 2013) model the environments as infinite-horizon discounted MDPs) such as Breakout, the agent cannot observe states where most of the bricks are knocked out at the initial stage of the trajectory. Therefore, the agent still can only push forward the “information frontier” a few steps per deployment. That said, it is possible reduce the deployment complexity lower bound in stationary MDPs by adding more assumptions, such as the initial state distribution providing good coverage over the entire state space, or all the states are reachable in the first few time steps. However, because these assumptions do not always hold and may overly trivialize the exploration problem, we will not consider them in our algorithm design. Besides, although our algorithms in the next section are designed for non-stationary MDPs, they can be extended to stationary MDPs by sharing covariance matrices, and we believe the analyses can also be extended to match the lower bound in Corollary 3.3.

4 Towards Optimal Deployment Efficiency

In this section we provide algorithms with deployment-efficiency guarantees that nearly match the lower bounds established in Section 3. Although our lower bound results in Section 3 consider non-Markov policies, our algorithms in this section only use Markov policies (or a mixture of Markov policies, in the arbitrary policy setting), which are simpler to implement and compute and are already near-optimal in deployment efficiency.

Inspiration from Lower Bounds: a Layer-by-Layer Exploration Strategy  The linear dependence on HH in the lower bounds implies a possibly deployment-efficient manner to explore, which we call a layer-by-layer strategy: conditioning on sufficient exploration in previous h1h-1 time steps, we can use poly(d)\mathrm{poly}(d) deployments to sufficiently explore the hh-th time step, then we only need Hpoly(d)H\cdot\mathrm{poly}(d) deployments to explore the entire MDP. If we can reduce the deployment cost in each layer from poly(d)\mathrm{poly}(d) to Θ(d)\Theta(d) or even Θ(1)\Theta(1), then we can achieve the optimal deployment efficiency. Besides, as another motivation, in Appendix C.4, we will briefly discuss the additional benefits of the layer-by-layer strategy, which will be useful especially in “Safe DE-RL”. In Sections 4.1 and 4.2, we will introduce algorithms based on this idea and provide theoretical guarantees.

4.1 Deployment-Efficient RL with Deterministic Policies

1 Input: Failure probability δ>0\delta>0, and target accuracy ε>0\varepsilon>0, βcβdHlog(dHδ1ε1)\beta\leftarrow c_{\beta}\cdot dH\sqrt{\log(dH\delta^{-1}\varepsilon^{-1})} for some cβ>0c_{\beta}>0, total number of deployments KK, batch size NN,
2 h11h_{1}\leftarrow 1\quad\quad // hkh_{k} denotes the layer to explore in iteration kk, for all k[K]k\in[K]
3 for k=1,2,,Kk=1,2,...,K do
4       Qhk+1k(,)0Q^{k}_{h_{k}+1}(\cdot,\cdot)\leftarrow 0 and Vhk+1k()=0V^{k}_{h_{k}+1}(\cdot)=0
5       for h=hk,hk1,,1h=h_{k},h_{k}-1,...,1 do
6             ΛhkI+τ=1k1n=1Nϕhτn(ϕhτn),uhk(,)min{βϕ(,)(Λhk)1ϕ(,),H}\Lambda^{k}_{h}\leftarrow I+\sum_{\tau=1}^{k-1}\sum_{n=1}^{N}\phi_{h}^{\tau n}(\phi_{h}^{\tau n})^{\top},\quad\quad u_{h}^{k}(\cdot,\cdot)\leftarrow\min\{\beta\cdot\sqrt{\phi(\cdot,\cdot)^{\top}(\Lambda^{k}_{h})^{-1}\phi(\cdot,\cdot)},H\}
7             whk(Λhk)1τ=1k1n=1NϕhτnVh+1k(sh+1τn)w^{k}_{h}\leftarrow(\Lambda^{k}_{h})^{-1}\sum_{\tau=1}^{k-1}\sum_{n=1}^{N}\phi_{h}^{\tau n}\cdot V^{k}_{h+1}(s^{\tau n}_{h+1})
8             Qhk(,)min{(whk)ϕ(,)+rh(,)+uhk(,),H}Q^{k}_{h}(\cdot,\cdot)\leftarrow\min\{(w^{k}_{h})^{\top}\phi(\cdot,\cdot)+r_{h}(\cdot,\cdot)+u^{k}_{h}(\cdot,\cdot),H\} and Vhk()=maxa𝒜Qhk(,a)V^{k}_{h}(\cdot)=\max_{a\in\mathcal{A}}Q^{k}_{h}(\cdot,a)
9             πhk()argmaxa𝒜Qhk(,a)\pi^{k}_{h}(\cdot)\leftarrow\arg\max_{a\in\mathcal{A}}Q^{k}_{h}(\cdot,a)
10       end for
11      Define πk=π1kπ2kπhkkunif[hk+1:H]\pi^{k}=\pi^{k}_{1}\circ\pi^{k}_{2}...\circ\pi^{k}_{h_{k}}\circ\textrm{unif}_{[h_{k}+1:H]}
12       for n=1,,Nn=1,...,N do
13             Receive initial state s1knd1s_{1}^{kn}\sim d_{1}
14             for h=1,2,,Hh=1,2,...,H do  Take action ahknπhk(shkn)a^{kn}_{h}\leftarrow\pi^{k}_{h}(s_{h}^{kn}) and observe sh+1knPh(shk,ahk)s_{h+1}^{kn}\sim P_{h}(s^{k}_{h},a^{k}_{h}) ;
15            
16       end for
17      Compute Δk2βNn=1Nh=1hkϕ(shkn,ahkn)(Λhk)1ϕ(shkn,ahkn)\Delta_{k}\leftarrow\frac{2\beta}{N}\sum_{n=1}^{N}\sum_{h=1}^{h_{k}}\sqrt{\phi(s_{h}^{kn},a_{h}^{kn})^{\top}(\Lambda_{h}^{k})^{-1}\phi(s_{h}^{kn},a_{h}^{kn})}.
18       if Δkεhk2H\Delta_{k}\geq\frac{\varepsilon h_{k}}{2H} then  hk+1hkh_{k+1}\leftarrow h_{k} ;
19       else if hk=Hh_{k}=H then  return πk\pi^{k} ;
20       else  hk+1hk+1h_{k+1}\leftarrow h_{k}+1 ;
21      
22 end for
Algorithm 1 Layer-by-Layer Batch Exploration Strategy for Linear MDPs Given Reward Function

In this sub-section, we focus on the setting where each deployed policy is deterministic. In Alg 1, we propose a provably deployment-efficient algorithm built on Least-Square Value Iteration with UCB (Jin et al., 2019)444In order to align with the algorithm in reward-free setting, slightly different from (Jin et al., 2019) but similar to (Wang et al., 2020b), we run linear regression on PhVhP_{h}V_{h} instead of QhQ_{h}. and the “layer-by-layer” strategy. Briefly speaking, at deployment kk, we focus on exploration in previous hkh_{k} layers, and compute π1k,π2k,,πhkk\pi_{1}^{k},\pi_{2}^{k},...,\pi^{k}_{h_{k}} by running LSVI-UCB in an MDP truncated at step hkh_{k}. After that, we deploy πk\pi^{k} to collect NN trajectories, and complete the trajectory after time step hkh_{k} with an arbitrary policy. (In the pseudocode we choose uniform, but the choice is inconsequential.) In line 19, we compute Δk\Delta_{k} with samples and use it to judge whether we should move on to the next layer till all HH layers have been explored. The theoretical guarantee is listed below, and the missing proofs are deferred to Appendix C.

Theorem 4.1 (Deployment Complexity).

For arbitrary ε,δ>0\varepsilon,\delta>0, and arbitrary cK2c_{K}\geq 2, as long as Nc(cKH4cK+1d3cKε2cKlog2cK(Hdδε))1cK1N\geq c\Big{(}c_{K}\frac{H^{4c_{K}+1}d^{3c_{K}}}{\varepsilon^{2c_{K}}}\log^{2c_{K}}(\frac{Hd}{\delta\varepsilon})\Big{)}^{\frac{1}{c_{K}-1}}, where cc is an absolute constant, by choosing

K=cKdH+1.\displaystyle K=c_{K}dH+1. (1)

Algorithm 1 will terminate at iteration kKk\leq K and return us a policy πk\pi^{k}, and with probability 1δ1-\delta, 𝔼s1d1[V1(s1)V1πk(s1)]ε\mathbb{E}_{s_{1}\sim d_{1}}[V_{1}^{*}(s_{1})-V_{1}^{\pi^{k}}(s_{1})]\leq\varepsilon.

As an interesting observation, Eq (1) reflects the trade-off between the magnitude of KK and NN when KK is small. To see this, when we increase cKc_{K} and keep it at the constant level, KK definitely increases while NN will be lower because its dependence on d,H,ε,δd,H,\varepsilon,\delta decreases. Moreover, the benefit of increasing cKc_{K} is only remarkable when cKc_{K} is small (e.g. we have N=O(H9d6ε4)N=O(H^{9}d^{6}\varepsilon^{-4}) if cK=2c_{K}=2, while N=O(H5d3.6ε2.4)N=O(H^{5}d^{3.6}\varepsilon^{-2.4}) if cK=6c_{K}=6), and even for moderately large cKc_{K}, the value of NN quickly approaches the limit limcKN=cH4d3ε2log2(Hdδε)\lim_{c_{K}\rightarrow\infty}N=c\frac{H^{4}d^{3}}{\varepsilon^{2}}\log^{2}(\frac{Hd}{\delta\varepsilon}). It is still an open problem that whether the trade-off in Eq.1 is exact or not, and we leave it for the future work.

Another key step in proving the deployment efficiency of Alg. 1 is Lem. 4.2 below. In fact, by directly applying Lem. 4.2 to LSVI-UCB (Jin et al., 2019) with large batch sizes, we can achieve O(dH)O(dH) deployment complexity in deterministic policy setting without exploring in a layer-by-layer manner. We defer the discussion and the additional benefit of layer-by-layer strategy to Appx. C.4.

Lemma 4.2.

[Batched Finite Sample Elliptical Potential Lemma] Consider a sequence of matrices 𝐀0,𝐀N,,𝐀(K1)Nd×d\mathbf{A}_{0},\mathbf{A}_{N},...,\mathbf{A}_{(K-1)N}\in\mathbb{R}^{d\times d} with 𝐀0=Id×d\mathbf{A}_{0}=I_{d\times d} and 𝐀kN=𝐀(k1)N+Φk1\mathbf{A}_{kN}=\mathbf{A}_{(k-1)N}+\Phi_{k-1}, where Φk1=t=(k1)N+1kNϕtϕt\Phi_{k-1}=\sum_{t=(k-1)N+1}^{kN}\phi_{t}\phi_{t}^{\top} and maxtKNϕt1\max_{t\leq KN}\|\phi_{t}\|\leq 1. We define: 𝒦+:={k[K]|Tr(𝐀(k1)N1Φk1)Nε}\mathcal{K}^{+}:=\Big{\{}k\in[K]\Big{|}Tr(\mathbf{A}_{(k-1)N}^{-1}\Phi_{k-1})\geq N\varepsilon\Big{\}}. For arbitrary ε<1\varepsilon<1, and arbitrary cK2c_{K}\geq 2, if K=cKdH+1K=c_{K}dH+1, by choosing Nc(cKHdcKεcKlogcK(Hdε))1cK1N\geq c\Big{(}c_{K}\frac{Hd^{c_{K}}}{\varepsilon^{c_{K}}}\log^{c_{K}}(\frac{Hd}{\varepsilon})\Big{)}^{\frac{1}{c_{K}-1}}, where cc is an absolute constant independent with cK,d,H,εc_{K},d,H,\varepsilon, we have |𝒦+|cKd<K/H.|\mathcal{K}^{+}|\leq c_{K}d<K/H.

Extension to Reward-free setting

Based on the similar methodology, we can design algorithms for reward-free setting (Wang et al., 2020b) and obtain O(dH)O(dH) deployment complexity. We defer the algorithms and proofs to Appx. D, and summarize the main result in Thm. D.4.

4.2 Deployment-Efficient RL with Arbitrary Policies

1 Input: Accuracy level ε\varepsilon; Iteration number imaxi_{\max}; Resolution ε0{\varepsilon_{0}}; Reward rr; Bonus coefficient β\beta.
2 for h=1,2,,Hh=1,2,...,H do
3       Initialize πh,1\pi_{h,1} with an arbitrary deterministic policy ; Σ~h,1=2I\widetilde{\Sigma}_{h,1}=2I, Πh={}\Pi_{h}=\{\}.
4       for i=1,2,,imaxi=1,2,...,i_{\max} do
5             Λ^hπh,iEstimateCovMatrix(h,D[1:h1],Σ[1:h1],πh,i)\widehat{\Lambda}^{\pi_{h,i}}_{h}\leftarrow{\rm EstimateCovMatrix}(h,D_{[1:h-1]},\Sigma_{[1:h-1]},\pi_{h,i}) # Alg 6, Appx E
6             Σ~h,i+1=Σ~h,i+Λ^hπh,i\widetilde{\Sigma}_{h,i+1}=\widetilde{\Sigma}_{h,i}+\widehat{\Lambda}^{\pi_{h,i}}_{h}
7             Vh,i+1,π¯h,i+1SolveOptQ(h,D[1:h1],Σ[1:h1],β,Σ~h,i+1,ε0)V_{h,i+1},~{}\bar{\pi}_{h,i+1}\leftarrow{\rm SolveOptQ}(h,D_{[1:h-1]},\Sigma_{[1:h-1]},\beta,\widetilde{\Sigma}_{h,i+1},{\varepsilon_{0}}) # Alg 5, Appx E
8             if Vh,i+13νmin2/8V_{h,i+1}\leq 3\nu^{2}_{\min}/8 then  break ;
9             Πh=Πh{π¯h,i+1}\Pi_{h}=\Pi_{h}\bigcup\{{\bar{\pi}}_{h,i+1}\}
10       end for
11       Σh=I\Sigma_{h}=I, Dh={}D_{h}=\{\}, πh,mix:=unif(Πh)\pi_{h,\mathrm{mix}}:=\mathrm{unif}(\Pi_{h})
12       for n=1,2,,Nn=1,2,...,N do
13             Sample trajectories with πh,mix\pi_{h,\mathrm{mix}}
14             Σh=Σh+ϕ(sh,n,ah,n)ϕ(sh,n,ah,n),Dh=Dh{sh,n,ah,n,rh,n,sh+1,n}\Sigma_{h}=\Sigma_{h}+\phi(s_{h,n},a_{h,n})\phi(s_{h,n},a_{h,n})^{\top},\quad D_{h}=D_{h}\bigcup\{s_{h,n},a_{h,n},r_{h,n},s_{h+1,n}\}
15       end for
16      
17 end for
return π^rAlg 4(H,{D1,,DH},r)\widehat{\pi}_{r}\leftarrow\text{Alg \ref{alg:DE_rl_layer_by_layer_reward_free_planning}}(H,\{D_{1},...,D_{H}\},r)
Algorithm 2 Deployment-Efficient RL with Covariance Matrix Estimation

From the discussion of lower bounds in Section 3, we know that in order to reduce the deployment complexity from Ω(dH)\Omega(dH) to Ω~(H)\widetilde{\Omega}(H), we have to utilize stochastic (and possibly non-Markov) policies and try to explore as many different directions as possible in each deployment (as opposed to 11 direction in Algorithm 1). The key challenge is to find a stochastic policy—before the deployment starts—which can sufficiently explore dd independent directions.

In Alg. 2, we overcome this difficulty by a new covariance matrix estimation method (Alg. 6 in Appx. E). The basic idea is that, for arbitrary policy π\pi 555Here we mainly focus on evaluating deterministic policy or stochastic policy mixed from a finite number of deterministic policies, because for the other stochastic policies, exactly computing the expectation over policy distribution may be intractable., the covariance matrix Λhπ:=𝔼π[ϕϕ]\Lambda^{\pi}_{h}:=\mathbb{E}_{\pi}[\phi\phi^{\top}] can be estimated element-wise by running policy evaluation for π\pi with ϕiϕj\phi_{i}\phi_{j} as a reward function, where i,j[d]i,j\in[d] and ϕi\phi_{i} denotes the ii-th component of vector ϕ\phi.

However, a new challenge emerging is that, because the transition is stochastic, in order to guarantee low evaluation error for all possible policies π¯h,i+1{\bar{\pi}}_{h,i+1}, we need an union bound over all policies to be evaluated, which is challenging if the policy class is infinite. To overcome this issue, we discretize the value functions in Algorithm 5 (see Appendix E) to allow for a union bound over the policy space: after computing the Q-function by LSVI-UCB, before converting it to a greedy policy, we first project it to an ε0{\varepsilon_{0}}-net of the entire Q-function class. In this way, the number of policy candidates is finite and the projection error can be controlled as long as ε0{\varepsilon_{0}} is small enough.

Using the above techniques, in Lines 2-2, we repeatedly use Alg 6 to estimate the accumulative covariance matrix Σ~h,i+1\widetilde{\Sigma}_{h,i+1} and further eliminate uncertainty by calling Alg 5 to find a policy (approximately) maximizing uncertainty-based reward function R~:=ϕΣ~h,i+11\widetilde{R}:=\|\phi\|_{\widetilde{\Sigma}^{-1}_{h,i+1}}. For each h[H]h\in[H], inductively conditioning on sufficient exploration in previous h1h-1 layers, the errors of Alg 6 and Alg 5 will be small, and we will find a finite set of policies Πh\Pi_{h} to cover all dimensions in layer hh. (This is similar to the notion of “policy cover” in Du et al. (2019); Agarwal et al. (2020a).) Then, layer hh can be explored sufficiently by deploying a uniform mixture of Π\Pi and choosing NN large enough (Lines 2-2). Also note that the algorithm does not use the reward information, and is essentially a reward-free exploration algorithm. After exploring all HH layers, we obtain a dataset {D1,,DH}\{D_{1},...,D_{H}\} and can use Alg 4 for planning with any given reward function rr satisfying Assump. A to obtain a near-optimal policy.

Deployment complexity guarantees

We first introduce a quantity denoted as νmin\nu_{\min}, which measures the reachability to each dimension in the linear MDP. In Appendix E.8, we will show that the νmin\nu_{\min} is no less than the “explorability” coefficient in Definition 2 of Zanette et al. (2020) and νmin2\nu_{\min}^{2} is also lower bounded by the maximum of the smallest singular value of matrix 𝔼π[ϕϕ]\mathbb{E}_{\pi}[\phi\phi^{\top}].

Definition 4.3 (Reachability Coefficient).
νh:=minθ=1maxπ𝔼π[(ϕhθ)2];νmin=minh[H]νh.\nu_{h}:=\min_{\|\theta\|=1}\max_{\pi}\sqrt{\mathbb{E}_{\pi}[(\phi^{\top}_{h}\theta)^{2}]}~{}~{};\quad\quad\nu_{\min}=\min_{h\in[H]}\nu_{h}~{}.

Now, we are ready to state the main theorem of this section, and defer the formal version and its proofs to Appendix E. Our algorithm is effectively running reward-free exploration and therefore our results hold for arbitrary linear reward functions.

Theorem 4.4.

[Informal] For arbitrary 0<ε,δ<10<\varepsilon,\delta<1, with proper choices of imax,ε0,βi_{\max},\varepsilon_{0},\beta, we can choose N=poly(d,H,1ε,log1δ,1νmin)N=\mathrm{poly}(d,H,\frac{1}{\varepsilon},\log\frac{1}{\delta},\frac{1}{\nu_{\min}}), such that, after K=HK=H deployments, with probability 1δ1-\delta, Algorithm 2 will collect a dataset D={D1,,DH}D=\{D_{1},...,D_{H}\}, and if we run Alg 4 with DD and arbitrary reward function satisfying Assump. A, we will obtain π^r\widehat{\pi}_{r} such that V1π^r(s1;r)V1(s1;r)εV_{1}^{\widehat{\pi}_{r}}(s_{1};r)\geq V_{1}^{*}(s_{1};r)-\varepsilon.

Proof Sketch

Next, we briefly discuss the key steps of the proof. Since ε0\varepsilon_{0} can be chosen to be very small, we will ignore the bias induced by ε0\varepsilon_{0} when providing intuitions. Our proof is based on the induction condition below. We first assume it holds after h1h-1 deployments (which is true when h=1h=1), and then we try to prove at the hh-th deployment we can explore layer hh well enough so that the condition holds for hh.

Condition 4.5.

[Induction Condition] Suppose after h1h-1 deployments, we have the following induction condition for some ξ<1/d\xi<1/d, which will be determined later:

maxπ𝔼π[h~=1h1ϕ(sh~,ah~)Σh~1ϕ(sh~,ah~)]h1Hξ.\displaystyle\textstyle\max_{\pi}\mathbb{E}_{\pi}[\sum_{{\widetilde{h}}=1}^{{h}-1}\sqrt{\phi(s_{\widetilde{h}},a_{\widetilde{h}})^{\top}\Sigma_{\widetilde{h}}^{-1}\phi(s_{\widetilde{h}},a_{\widetilde{h}})}]\leq\frac{h-1}{H}\xi. (2)

The l.h.s. of Eq.(2) measures the uncertainty in previous h1h-1 layers after exploration. As a result, with high probability, the following estimations will be accurate:

Λ^πh,i𝔼πh,i[ϕ(sh,ah)ϕ(sh,ah)],O(ξ),\displaystyle\|\widehat{\Lambda}^{\pi_{h,i}}-\mathbb{E}_{\pi_{h,i}}[\phi(s_{h},a_{h})\phi(s_{h},a_{h})^{\top}]\|_{\infty,\infty}\leq O(\xi), (3)

where ,\|\cdot\|_{\infty,\infty} denotes the entry-wise maximum norm. This directly implies that:

Σ~h,i+1Σh,i+1],iO(ξ).\displaystyle\|\widetilde{\Sigma}_{h,i+1}-\Sigma_{h,i+1}]\|_{\infty,\infty}\leq i\cdot O(\xi).

where Σh,i+1:=2I+i=1i𝔼πh,i[ϕ(sh,ah)ϕ(sh,ah)]\Sigma_{h,i+1}:=2I+\sum_{i^{\prime}=1}^{i}\mathbb{E}_{\pi_{h,i^{\prime}}}[\phi(s_{h},a_{h})\phi(s_{h},a_{h})^{\top}] is the target value for Σ~h,i+1\widetilde{\Sigma}_{h,i+1} to approximate. Besides, recall that in Algorithm 5, we use ϕΣ~h,i+11ϕ\sqrt{\phi^{\top}\widetilde{\Sigma}^{-1}_{h,i+1}\phi} as the reward function, and the induction condition also implies that:

|Vh,i+1maxπ𝔼π[ϕ(sh,ah)Σ~h,i+11]|\displaystyle|V_{h,i+1}-\max_{\pi}\mathbb{E}_{\pi}[\|\phi(s_{h},a_{h})\|_{\widetilde{\Sigma}^{-1}_{h,i+1}}]|\leq O(ξ).\displaystyle O(\xi).

As a result, if ξ\xi and the resolution ε0\varepsilon_{0} are small enough, π¯h,i+1\bar{\pi}_{h,i+1} would gradually reduce the uncertainty and Vh,i+1V_{h,i+1} (also maxπ𝔼π[ϕ(sh,ah)Σ~h,i+11]\max_{\pi}\mathbb{E}_{\pi}[\|\phi(s_{h},a_{h})\|_{\widetilde{\Sigma}^{-1}_{h,i+1}}]) will decrease. However, the bias is at the level O(ξ)O(\xi), and therefore, no matter how small ξ\xi is, as long as ξ>0\xi>0, it is still possible that the policies in Πh\Pi_{h} do not cover all directions if some directions are very difficult to reach, and the error due to such a bias will be at the same level of the required accuracy in induction condition, i.e. O(ξ)O(\xi). This is exactly where the “reachability coefficient” νmin\nu_{\min} definition helps. The introduction of νmin\nu_{\min} provides a threshold, and as long as ξ\xi is small enough so that the bias is lower than the threshold, each dimension will be reached with substantial probability when the breaking criterion in Line 9 is satisfied. As a result, by deploying unif(Πh)\mathrm{unif}(\Pi_{h}) and collecting a sufficiently large dataset, the induction condition will hold till layer HH. Finally, combining the guarantee of Alg 4, we complete the proof.

5 Conclusion and Future Work

In this paper, we propose a concrete theoretical formulation for DE-RL to fill the gap between existing RL literatures and real-world applications with deployment constraints. Based on our framework, we establish lower bounds for deployment complexity in linear MDPs, and provide novel algorithms and techniques to achieve optimal deployment efficiency. Besides, our formulation is flexible and can serve as building blocks for other practically relevant settings related to DE-RL. We conclude the paper with two such examples, defer a more detailed discussion to Appendix F, and leave the investigation to future work.

Sample-Efficient DE-RL

In our basic formulation in Definition 2.1, we focus on minimizing the deployment complexity KK and put very mild constraints on the per-deployment sample complexity NN. In practice, however, the latter is also an important consideration, and we may face additional constraints on how large NN can be, as they can be upper bounded by e.g. the number of customers or patients our system is serving.

Safe DE-RL

In real-world applications, safety is also an important criterion. The definition for safety criterion in Safe DE-RL is still an open problem, but we believe it is an interesting setting since it implies a trade-off between exploration and exploitation in deployment-efficient setting.

Acknowledgements

JH’s research activities on this work were completed by December 2021 during his internship at MSRA. NJ acknowledges funding support from ARL Cooperative Agreement W911NF-17-2-0196, NSF IIS-2112471, and Adobe Data Science Research Award.

References

  • Abbasi-yadkori et al. (2011) Yasin Abbasi-yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011.
  • Afsar et al. (2021) M Mehdi Afsar, Trafford Crump, and Behrouz Far. Reinforcement learning based recommender systems: A survey. arXiv preprint arXiv:2101.06286, 2021.
  • Agarwal et al. (2020a) Alekh Agarwal, Mikael Henaff, Sham Kakade, and Wen Sun. Pc-pg: Policy cover directed exploration for provable policy gradient learning. arXiv preprint arXiv:2007.08459, 2020a.
  • Agarwal et al. (2020b) Alekh Agarwal, Sham Kakade, Akshay Krishnamurthy, and Wen Sun. Flambe: Structural complexity and representation learning of low rank mdps, 2020b.
  • Agrawal et al. (2021) Priyank Agrawal, Jinglin Chen, and Nan Jiang. Improved worst-case regret bounds for randomized least-squares value iteration. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  6566–6573, 2021.
  • Agrawal & Jia (2017) Shipra Agrawal and Randy Jia. Posterior sampling for reinforcement learning: worst-case regret bounds. In Advances in Neural Information Processing Systems, pp. 1184–1194, 2017.
  • Antos et al. (2008) András Antos, Csaba Szepesvári, and Rémi Munos. Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1):89–129, 2008.
  • Azar et al. (2017) M. G. Azar, Ian Osband, and R. Munos. Minimax regret bounds for reinforcement learning. In ICML, 2017.
  • Bai et al. (2020) Yu Bai, Tengyang Xie, Nan Jiang, and Yu-Xiang Wang. Provably efficient q-learning with low switching cost, 2020.
  • Bellemare et al. (2013) Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
  • Bellemare et al. (2016) Marc G. Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation, 2016.
  • Bennane et al. (2013) Abdellah Bennane et al. Adaptive educational software by applying reinforcement learning. Informatics in Education-An International Journal, 12(1):13–27, 2013.
  • Burda et al. (2018) Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation, 2018.
  • Campero et al. (2020) Andres Campero, Roberta Raileanu, Heinrich Küttler, Joshua B Tenenbaum, Tim Rocktäschel, and Edward Grefenstette. Learning with amigo: Adversarially motivated intrinsic goals. arXiv preprint arXiv:2006.12122, 2020.
  • Carpentier et al. (2020) Alexandra Carpentier, Claire Vernade, and Yasin Abbasi-Yadkori. The elliptical potential lemma revisited, 2020.
  • Chen & Jiang (2019) Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pp. 1042–1051. PMLR, 2019.
  • Dann et al. (2018) Christoph Dann, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. On oracle-efficient pac rl with rich observations. Advances in neural information processing systems, 31, 2018.
  • Du et al. (2019) Simon Du, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal, Miroslav Dudik, and John Langford. Provably efficient rl with rich observations via latent state decoding. In International Conference on Machine Learning, pp. 1665–1674. PMLR, 2019.
  • Ecoffet et al. (2019) Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and Jeff Clune. Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995, 2019.
  • Esfandiari et al. (2020) Hossein Esfandiari, Amin Karbasi, Abbas Mehrabian, and Vahab Mirrokni. Regret bounds for batched bandits, 2020.
  • Fujimoto & Gu (2021) Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning, 2021.
  • Gao et al. (2021) Minbo Gao, Tianle Xie, Simon S. Du, and Lin F. Yang. A provably efficient algorithm for linear markov decision process with low switching cost, 2021.
  • Gu et al. (2021) Quanquan Gu, Amin Karbasi, Khashayar Khosravi, Vahab Mirrokni, and Dongruo Zhou. Batched neural bandits, 2021.
  • Guo & Brunskill (2015) Zhaohan Guo and Emma Brunskill. Concurrent pac rl. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015.
  • Han et al. (2020) Yanjun Han, Zhengqing Zhou, Zhengyuan Zhou, Jose Blanchet, Peter W. Glynn, and Yinyu Ye. Sequential batch learning in finite-action linear contextual bandits, 2020.
  • Hazan et al. (2019) Elad Hazan, Sham M. Kakade, Karan Singh, and Abby Van Soest. Provably efficient maximum entropy exploration, 2019.
  • Jiang & Agarwal (2018) Nan Jiang and Alekh Agarwal. Open problem: The dependence of sample complexity lower bounds on planning horizon. In Conference On Learning Theory, pp.  3395–3398. PMLR, 2018.
  • Jiang & Li (2016) Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 652–661. PMLR, 2016.
  • Jiang et al. (2017a) Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextual decision processes with low bellman rank are pac-learnable. In International Conference on Machine Learning, pp. 1704–1713. PMLR, 2017a.
  • Jiang et al. (2017b) Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E. Schapire. Contextual decision processes with low Bellman rank are PAC-learnable. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70, pp. 1704–1713, 2017b.
  • Jin et al. (2018) Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I. Jordan. Is q-learning provably efficient?, 2018.
  • Jin et al. (2019) Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I. Jordan. Provably efficient reinforcement learning with linear function approximation, 2019.
  • Jin et al. (2020) Chi Jin, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. Reward-free exploration for reinforcement learning, 2020.
  • Jin et al. (2021a) Chi Jin, Qinghua Liu, and Sobhan Miryoosefi. Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms. Advances in Neural Information Processing Systems, 34, 2021a.
  • Jin et al. (2021b) Ying Jin, Zhuoran Yang, and Zhaoran Wang. Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pp. 5084–5096. PMLR, 2021b.
  • Kober et al. (2013) Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274, 2013.
  • Kong et al. (2021) Dingwen Kong, R. Salakhutdinov, Ruosong Wang, and Lin F. Yang. Online sub-sampling for reinforcement learning with general function approximation. ArXiv, abs/2106.07203, 2021.
  • Krishnamurthy et al. (2016) Akshay Krishnamurthy, Alekh Agarwal, and John Langford. Pac reinforcement learning with rich observations. Advances in Neural Information Processing Systems, 29:1840–1848, 2016.
  • Kumar et al. (2020) Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. arXiv preprint arXiv:2006.04779, 2020.
  • Laroche et al. (2019) Romain Laroche, Paul Trichelair, and Remi Tachet Des Combes. Safe policy improvement with baseline bootstrapping. In International Conference on Machine Learning, pp. 3652–3661. PMLR, 2019.
  • Lee et al. (2021) Jongmin Lee, Wonseok Jeon, Byung-Jun Lee, Joelle Pineau, and Kee-Eung Kim. Optidice: Offline policy optimization via stationary distribution correction estimation. arXiv preprint arXiv:2106.10783, 2021.
  • Liu et al. (2018) Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon: Infinite-horizon off-policy estimation. arXiv preprint arXiv:1810.12429, 2018.
  • Liu et al. (2020) Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Provably good batch off-policy reinforcement learning without great exploration. Advances in Neural Information Processing Systems, 33:1264–1274, 2020.
  • Matsushima et al. (2021) Tatsuya Matsushima, Hiroki Furuta, Yutaka Matsuo, Ofir Nachum, and Shixiang Gu. Deployment-efficient reinforcement learning via model-based offline optimization. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=3hGNqpI4WS.
  • Misra et al. (2020) Dipendra Misra, Mikael Henaff, Akshay Krishnamurthy, and John Langford. Kinematic state abstraction and provably efficient rich-observation reinforcement learning. In International conference on machine learning, pp. 6961–6971. PMLR, 2020.
  • Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • Modi et al. (2021) Aditya Modi, Jinglin Chen, Akshay Krishnamurthy, Nan Jiang, and Alekh Agarwal. Model-free representation learning and exploration in low-rank mdps. arXiv preprint arXiv:2102.07035, 2021.
  • Moskovitz et al. (2021) Ted Moskovitz, Jack Parker-Holder, Aldo Pacchiano, Michael Arbel, and Michael I. Jordan. Tactical optimism and pessimism for deep reinforcement learning, 2021.
  • Munos & Szepesvári (2008) Rémi Munos and Csaba Szepesvári. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9(5), 2008.
  • Nachum et al. (2019) Ofir Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. Algaedice: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074, 2019.
  • Nair et al. (2021) Ashvin Nair, Abhishek Gupta, Murtaza Dalal, and Sergey Levine. Awac: Accelerating online reinforcement learning with offline datasets, 2021.
  • Pertsch et al. (2020) Karl Pertsch, Youngwoon Lee, and Joseph J Lim. Accelerating reinforcement learning with learned skill priors. arXiv preprint arXiv:2010.11944, 2020.
  • Precup (2000) Doina Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, pp.  80, 2000.
  • Ruan et al. (2021) Yufei Ruan, Jiaqi Yang, and Yuan Zhou. Linear bandits with limited adaptivity and learning distributional optimal design, 2021.
  • Russo & Van Roy (2013) Daniel Russo and Benjamin Van Roy. Eluder dimension and the sample complexity of optimistic exploration. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013.
  • Tropp (2015) Joel A. Tropp. An introduction to matrix concentration inequalities, 2015.
  • Uehara et al. (2020) Masatoshi Uehara, Jiawei Huang, and Nan Jiang. Minimax weight and q-function learning for off-policy evaluation. In International Conference on Machine Learning, pp. 9659–9668. PMLR, 2020.
  • Wang et al. (2020a) Ruosong Wang, Simon S. Du, Lin F. Yang, and Sham M. Kakade. Is long horizon reinforcement learning more difficult than short horizon reinforcement learning?, 2020a.
  • Wang et al. (2020b) Ruosong Wang, Simon S. Du, Lin F. Yang, and Ruslan Salakhutdinov. On reward-free reinforcement learning with linear function approximation, 2020b.
  • Wang et al. (2020c) Ruosong Wang, Ruslan Salakhutdinov, and Lin F Yang. Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension. arXiv preprint arXiv:2005.10804, 2020c.
  • Xie et al. (2021a) Tengyang Xie, Ching-An Cheng, Nan Jiang, Paul Mineiro, and Alekh Agarwal. Bellman-consistent pessimism for offline reinforcement learning. arXiv preprint arXiv:2106.06926, 2021a.
  • Xie et al. (2021b) Tengyang Xie, Nan Jiang, Huan Wang, Caiming Xiong, and Yu Bai. Policy finetuning: Bridging sample-efficient offline and online reinforcement learning, 2021b.
  • Yang et al. (2020) Mengjiao Yang, Ofir Nachum, Bo Dai, Lihong Li, and Dale Schuurmans. Off-policy evaluation via the regularized lagrangian. arXiv preprint arXiv:2007.03438, 2020.
  • Yu et al. (2019) Chao Yu, Jiming Liu, and Shamim Nemati. Reinforcement learning in healthcare: A survey. arXiv preprint arXiv:1908.08796, 2019.
  • Yu et al. (2020) Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. Mopo: Model-based offline policy optimization. arXiv preprint arXiv:2005.13239, 2020.
  • Zanette & Brunskill (2019) Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pp. 7304–7312. PMLR, 2019.
  • Zanette et al. (2020) Andrea Zanette, Alessandro Lazaric, Mykel J Kochenderfer, and Emma Brunskill. Provably efficient reward-agnostic navigation with linear value iteration. arXiv preprint arXiv:2008.07737, 2020.
  • Zhang et al. (2021) Tianjun Zhang, Paria Rashidinejad, Jiantao Jiao, Yuandong Tian, Joseph Gonzalez, and Stuart Russell. Made: Exploration via maximizing deviation from explored regions. arXiv preprint arXiv:2106.10268, 2021.

Appendix A Extended Related Work

Online RL

Online RL is a paradigm focusing on the challenge of strategic exploration. On the theoretical side, based on the “Optimism in Face of Uncertainty”(OFU) principle or posterior sampling techniques, many provable algorithms have been developed for tabular MDPs (Jin et al., 2018; Azar et al., 2017; Zanette & Brunskill, 2019; Agrawal & Jia, 2017; Agrawal et al., 2021), linear MDPs (Jin et al., 2019; Agarwal et al., 2020a), general function approximation (Wang et al., 2020c; Russo & Van Roy, 2013), or MDPs with structural assumptions (Jiang et al., 2017b; Du et al., 2019). Moreover, there is another stream of work studying how to guide exploration by utilizing state occupancy (Hazan et al., 2019; Zhang et al., 2021). Beyond the learning in MDPs with pre-specified reward function, recently, Jin et al. (2020); Wang et al. (2020b); Zanette et al. (2020) provide algorithms for exploration in the scenarios where multiple reward functions are of interest. On the practical side, there are empirical algorithms such as intrinsically-motivated exploration (Bellemare et al., 2016; Campero et al., 2020), exploration with hand-crafted reward bonus (RND) (Burda et al., 2018), and other more sophisticated strategies (Ecoffet et al., 2019). However, all of these exploration methods do not take deployment efficiency into consideration, and will fail to sufficiently explore the MDP and learn near-optimal policies in DE-RL setting where the number of deployments is very limited.

Offline RL

Different from the online setting, where the agents are encouraged to explore rarely visited states to identify the optimal policy, the pure offline RL setting serves as a framework for utilizing historical data to learn a good policy without further interacting with the environment. Therefore, the core problem of offline RL is the performance guarantee of the deployed policy, which motivated multiple importance-sampling based off-policy policy evaluation and optimization methods (Jiang & Li, 2016; Liu et al., 2018; Uehara et al., 2020; Yang et al., 2020; Nachum et al., 2019; Lee et al., 2021), and the “Pessimism in Face of Uncertainty” framework (Liu et al., 2020; Kumar et al., 2020; Fujimoto & Gu, 2021; Yu et al., 2020; Jin et al., 2021b; Xie et al., 2021a) in contrast with OFU in online exploration setting. However, as suggested in Matsushima et al. (2021), pure offline RL can be regarded as constraining the total number of deployments to be 11.

Bridging Online and Offline RL; Trade-off between Pessimism and Optimism

As pointed out by Xie et al. (2021b); Matsushima et al. (2021), there is a clear gap between existing online and offline RL settings, and some efforts have been made towards bridging them. For example, Nair et al. (2021); Pertsch et al. (2020); Bai et al. (2020) studied how to leverage pre-collected offline datasets to learn a good prior to accelerate the online learning process. Moskovitz et al. (2021) proposed a learning framework which can switch between optimism and pessimism by modeling the selection as a bandit problem. None of these works give provable guarantees in our deployment-efficiency setting.

Conversion from Linear Regret in Gao et al. (2021) to Sample Complexity

Gao et al. (2021) proposed an algorithm with the following guarantee: after interacting with the environments for K~\widetilde{K} times (we use K~\widetilde{K} to distinguish with KK in our setting), there exists a constant cc such that the algorithm’s regret is

k=1K~V(s1)Vπk(s1)=cd3H4K~ι=cd3H3Tι\displaystyle\sum_{k=1}^{\widetilde{K}}V^{*}(s_{1})-V^{\pi_{k}}(s_{1})=c\cdot\sqrt{d^{3}H^{4}\widetilde{K}}\cdot\iota=c\cdot\sqrt{d^{3}H^{3}T}\cdot\iota

where we denote T=K~HT=\widetilde{K}H, and use ι\iota to refer to the log terms. Besides, in π1,,πK\pi_{1},...,\pi_{K} there are only O(dHlogK~)O(dH\log\widetilde{K}) policy switching.

As discussed in Section 3.1 by Jin et al. (2018), such a result can be convert to a PAC guarantee that, by uniformly randomly select a policy π\pi from π1,,πK\pi_{1},...,\pi_{K}, with probability at least 2/3, we should have:

V(s1)Vπ(s1)=O~(d3H5T)=O~(d3H4K~)\displaystyle V^{*}(s_{1})-V^{\pi}(s_{1})=\widetilde{O}(\sqrt{\frac{d^{3}H^{5}}{T}})=\widetilde{O}(\sqrt{\frac{d^{3}H^{4}}{\widetilde{K}}})

In order to make sure the upper bound in the r.h.s. will be ε\varepsilon, we need:

K~=d3H4ε2\displaystyle\widetilde{K}=\frac{d^{3}H^{4}}{\varepsilon^{2}}

and the required policy switching would be:

O(dHlogK~)=O(dHlogdHε)\displaystyle O(dH\log\widetilde{K})=O(dH\log\frac{dH}{\varepsilon})

In contrast with our results in Section 4.1, there is an additional logarithmic dependence on d,Hd,H and ε\varepsilon. Moreover, since their algorithm only deploys deterministic policies, their deployment complexity has to depend on dd, which is much higher than our stochastic policy algorithms in Section 4.2 when dd is large.

Investigation on Trade-off between Sample Complexity and Deployment Complexity

Methods R-F? Deployed Policy
# Trajectories
Deployment
Complexity
LSVI-UCB
(Jin et al., 2019)
×\times Deterministic O~(d3H4ε2)\widetilde{O}(\frac{d^{3}H^{4}}{\varepsilon^{2}})
Reward-free LSVI-UCB
(Wang et al., 2020b)
Deterministic O~(d3H6ε2)\widetilde{O}(\frac{d^{3}H^{6}}{\varepsilon^{2}})
FRANCIS
(Zanette et al., 2020)
Deterministic O~(d3H5ε2)\widetilde{O}(\frac{d^{3}H^{5}}{\varepsilon^{2}})
Gao et al. (2021)
×\times Deterministic O~(d3H4ε2)\widetilde{O}(\frac{d^{3}H^{4}}{\varepsilon^{2}}) O(dHlogdHε)O(dH\log\frac{dH}{\varepsilon})
Q-type OLIVE
(Jiang et al., 2017a)
(Jin et al., 2021a)
×\times Deterministic O~(d3H6ε2)\widetilde{O}(\frac{d^{3}H^{6}}{\varepsilon^{2}}) O(dHlog1ε)O(dH\log\frac{1}{\varepsilon})
Simplified MOFFLE
(Modi et al., 2021)
Stochastic O~(d8H7|𝒜|13min(ε2ηmin,ηmin5))\widetilde{O}(\frac{d^{8}H^{7}|\mathcal{A}|^{13}}{\min(\varepsilon^{2}\eta_{\min},\eta_{\min}^{5})}) O~(Hd3|𝒜|4ηmin2)\widetilde{O}(\frac{Hd^{3}|\mathcal{A}|^{4}}{\eta_{\min}^{2}})
Alg. 1 [ Ours] ×\times Deterministic O~(d4H5ε2)\widetilde{O}(\frac{d^{4}H^{5}}{\varepsilon^{2}}) O(dH)O(dH)
Alg. 3 + 4[ Ours] Deterministic O~(d4H7ε2)\widetilde{O}(\frac{d^{4}H^{7}}{\varepsilon^{2}}) O(dH)O(dH)
Alg. 2[ Ours] Stochastic O~(d3H4ε2νmin2+d7H2νmin14)\widetilde{O}(\frac{d^{3}H^{4}}{\varepsilon^{2}\nu^{2}_{\min}}+\frac{d^{7}H^{2}}{\nu^{14}_{\min}}) HH
Table 1: Comparison between our algorithms and online RL methods without considering deployment costraints in our setting defined in Def. 2.1, where R-F is the short note for Reward-Free. The total number of trajectories cost by our methods is computed by KNK\cdot N. We omit log terms in O~\widetilde{O}. For algorithm (Jin et al., 2019), we report the sample complexity after the conversion from regret. For our deterministic policy algorithms, we report the asymptotic results when cK+c_{K}\rightarrow+\infty, which can be achieved approximately when cKc_{K} is a large constant (e.g. cK=100c_{K}=100).

In Table 1, we compare our algorithms and previous online RL works which did not consider deployment efficiency to shed light on the trade-off between sample and deployment complexities. Besides algorithms that are specialized to linear MDPs, we also include results such as Zanette et al. (2020), which studied a more general linear approximation setting and can be adapted to our setting. As stated in Def. 2 of Zanette et al. (2020), they also rely on some reachability assumption. To avoid ambiguity, we use ν~min\widetilde{\nu}_{\min} to refer to their reachability coefficient (as discussed in Appx E.8, ν~min\widetilde{\nu}_{\min} is no larger than and can be much smaller than our νmin\nu_{\min}). Because they also assume that εO~(ν~min/d)\varepsilon\leq\widetilde{O}(\widetilde{\nu}_{\min}/\sqrt{d}) (see Thm 4.1 in their paper), their results have an implicit dependence on ν~min2\widetilde{\nu}^{-2}_{\min}. In addition, by using the class of linear functions w.r.t. ϕ\phi, Q-type OLIVE (Jiang et al., 2017a; Jin et al., 2021a) has O~(d3H6ε2)\widetilde{O}(\frac{d^{3}H^{6}}{\varepsilon^{2}}) sample complexity and O(dHlog(1/ε))O(dH\log(1/\varepsilon)) deployment complexity. Its deployment complexity is close to our deterministic algorithm, but with additional dependence on ε\varepsilon. We also want to highlight that OLIVE is known to be computationally intractable (Dann et al., 2018), while our algorithms are computationally efficient. With the given feature ϕ\phi in linear MDPs and additional reachability assumption (not comparable to us), we can use a simplified version of MOFFLE (Modi et al., 2021) by skipping their LearnRep subroutine. Though this version of MOFFLE is computationally efficient and its deployment complexity does not depend on ε\varepsilon, it has much worse sample complexity (ηmin\eta_{\min} is their reachability coefficient) and deployment complexity. On the other hand, PCID (Du et al., 2019) and HOMER (Misra et al., 2020) achieve HH deployment complexity in block MDPs. However, block MDPs are more restricted than linear MDPs and these algorithms have worse sample and computational complexities.

It is worth to note that all our algorithms achieve the optimal dependence on ε\varepsilon (i.e., ε2\varepsilon^{-2}) in sample complexity. For algorithms that deploy deterministic policies, we can see that our algorithm has higher dependence on dd and HH in the sample complexity in both reward-known and reward-free setting, while our deployment complexity is much lower. Our stochastic policy algorithm (last row) is naturally a reward-free algorithm. Comparing with Wang et al. (2020b) and Zanette et al. (2020), our sample complexity has higher dependence on dd and the reachability coefficient νmin\nu_{\min}, while our algorithm achieves the optimal deployment complexity and better dependence on HH.

Appendix B On the Lower Bound of Deployment-Efficient Reinforcement Learning

B.1 A Hard MDP Instance Template

Figure 1: Lower Bound Instance Template. The states in each layer can be divided into three groups. Group 1: absorbing states, marked with green; Group 2: core state, marked with red; Group 3: normal states, marked with blue.
Refer to caption

In this sub-section, we first introduce a hard MDP template that is used in further proofs. As shown in Figure 1, we construct a tabular MDP (which is a special case of linear MDP) where the horizon length is H+1H+1 and in each layer except the first one, there are d+2d+2 states and 2d+12d+1 different state action pairs. The initial state is fixed as s0s_{0} and there are d+2d+2 different actions. It is easy to see that we can represent the MDP by linear features with at most 2d+12d+1 dimensions, and construct reward and transition function satisfying Assumption A. As a result, it is a linear MDP with dimension 2d+12d+1 and horizon length H+1H+1. Since there is only a constant-level blow up of dimension, the dimension of these MDPs is still Θ(d)\Theta(d), and we will directly use dd instead of Θ(d)\Theta(d) in the rest of the proof. The states in each layer h1h\geq 1 can be divided into three groups and we introduce them one-by-one in the following.

Group 1: Absorbing States (Green Color)

The first group Gh1={uh1,uh2}G_{h}^{1}=\{u_{h}^{1},u_{h}^{2}\} consists of two absorbing states uh1u_{h}^{1} and uh2u_{h}^{2}, which can only take one action at each state a¯h1\bar{a}_{h}^{1} and a¯h2\bar{a}_{h}^{2} and transit to uh+11u_{h+1}^{1} and uh+12u_{h+1}^{2} with probability 1, respectively. The reward function is defined as rh(uh1,a¯h1)=rh(uh2,a¯h2)=0.5r_{h}(u_{h}^{1},\bar{a}_{h}^{1})=r_{h}(u_{h}^{2},\bar{a}_{h}^{2})=0.5 for all hH2h\leq H-2 and rH(uH1,a¯H1)=0.0r_{H}(u_{H}^{1},\bar{a}_{H}^{1})=0.0, rH(uH2,a¯H2)=1.0r_{H}(u_{H}^{2},\bar{a}_{H}^{2})=1.0.

Group 2: Core States (Red Color)

The second group Gh2={sh}G_{h}^{2}=\{s_{h}^{*}\} only contains one state, which we call it core state and denote it as shs_{h}^{*}. For example, in Figure 1, we have s1=s1d,s2=s21s_{1}^{*}=s_{1}^{d},s_{2}^{*}=s_{2}^{1} and s3=s32s_{3}^{*}=s_{3}^{2}. In the “core state”, the agent can take dd actions a~h1,a~h2,,a~hd\widetilde{a}^{1}_{h},\widetilde{a}^{2}_{h},\ldots,\widetilde{a}^{d}_{h} and transit deterministically to sh+11,sh+12,,sh+1ds_{h+1}^{1},s_{h+1}^{2},\ldots,s_{h+1}^{d}. Besides, the reward function is rh(sh,a~hi)=0.5r_{h}(s_{h}^{*},\widetilde{a}_{h}^{i})=0.5 for all i[d]i\in[d].

Group 3: Normal States (Blue Color)

The third group Gh3={shi|i[d],shish}G_{h}^{3}=\{s_{h}^{i}|i\in[d],s_{h}^{i}\neq s_{h}^{*}\} is what we call “normal states”, and each state shiGh3s_{h}^{i}\in G_{h}^{3} can only take one action a~hi\widetilde{a}_{h}^{i} and will transit randomly to one of the absorbing states in the next layer, i.e. Gh+11G_{h+1}^{1}. Besides, the reward function is rh(shi,a~hi)=0.5r_{h}(s_{h}^{i},\widetilde{a}_{h}^{i})=0.5 for arbitrary shiGh3s_{h}^{i}\in G_{h}^{3}, and the transition function is P(uh+11|shi,a~hi)=P(uh+11|shi,a~hi)=0.5P(u_{h+1}^{1}|s_{h}^{i},\widetilde{a}_{h}^{i})=P(u_{h+1}^{1}|s_{h}^{i},\widetilde{a}_{h}^{i})=0.5, except for a state action pair s#:=sh#i#,a#:=ah#i#s^{\#}:=s^{i^{\#}}_{h^{\#}},a^{\#}:=a^{i^{\#}}_{h^{\#}} at layer h#[H1]h^{\#}\in[H-1] with index i#i^{\#}, such that s#Gh#2s^{\#}\not\in G^{2}_{h^{\#}} and P(uh+11|sh#i#,ah#i#)=0.5εP(u_{h+1}^{1}|s^{i^{\#}}_{h^{\#}},a^{i^{\#}}_{h^{\#}})=0.5-\varepsilon and P(uh+12|sh#i#,ah#i#)=0.5+εP(u_{h+1}^{2}|s^{i^{\#}}_{h^{\#}},a^{i^{\#}}_{h^{\#}})=0.5+\varepsilon. In the following, we will call s#,a#s^{\#},a^{\#} the “optimal state” and “optimal action” in this MDP. Note that the “optimal state” can not be the core state.

We will use M(h#,i#,Icore={i1,i2,,iH})M(h^{\#},i^{\#},I_{core}=\{i_{1},i_{2},...,i_{H}\}) with ih#i#i_{h^{\#}}\neq i^{\#} to denote the MDP whose optimal state is at layer h#h^{\#} and indexed by i#i^{\#}, and the core states in each layer are shi1,shi2,,sHiHs_{h}^{i_{1}},s_{h}^{i_{2}},...,s_{H}^{i_{H}}. As we can see, the only optimal policy should be the one which can generate the following sequence of states before transiting to absorb states at layer h#+1h^{\#}+1:

s0,s1i1,s2i2,,sh#1ih#1,sh#i#\displaystyle s_{0},s_{1}^{i_{1}},s_{2}^{i_{2}},...,s_{h^{\#}-1}^{i_{h^{\#}-1}},s_{h^{\#}}^{i^{\#}}

and the optimal value function would be 12H+ε\frac{1}{2}H+\varepsilon. In order to achieve ε\varepsilon-optimal policy, the algorithm should identify sh#i#s^{i^{\#}}_{h^{\#}}, which is the main origin of the difficulty in exploration.

Remark B.1 (Markov v.s. Non-Markov Policies).

As we can see, the core states in each layer are the only states with #actions >> 1, and for each core state, there exists and only exists one deterministic path (a sequence of states, actions and rewards) from initial state to it, which implies that for arbitrary non-Markov policy, there exists an equivalent Markov policy. Therefore, in the rest of the proofs in this section, we only focus on Markov policies.

B.2 Lower bound for algorithms which can deploy deterministic policies only

In the following, we will state the formal version of the lower bound theorem for deterministic policy setting and its proof. The intuition of the proof is that we can construct a hard instance, which can be regarded as a Ω(dH)\Omega(dH) multi-arm bandit problem, and we will show that in expectation the algorithm need to “pull Ω(dH)\Omega(dH) arms” before identifying the optimal one.

Theorem B.2 (Lower bound for number of deployments in deterministic policy setting).

For the linear MDP problem with dimension dd and horizon HH, given arbitrary algorithm ψ\psi (ψ\psi can be deterministic or stochastic), which can only deploy a deterministic policy but can collect arbitrary number of samples in each deployment, there exists a MDP problem where the optimal deterministic policy π\pi^{*} is ε\varepsilon better than all the other deterministic policies, but the estimate policy π^\widehat{\pi} (which is also a deterministic policy) of the best policy output by ψ\psi after KK deployments must have P(ππ^)1/10P(\pi^{*}\neq\widehat{\pi})\geq 1/10 unless the number of deployments K>(d1)(H1)/2=Ω(dH)K>(d-1)(H-1)/2=\Omega(dH).

Proof.

First of all, we introduce how we construct hard instances.

Construction of Hard Instances

We consider a set of MDPs ¯\bar{\mathcal{M}}, where for each MDP in that set, the core states (red color) in each layer are fixed to be s11,s21,,sH11s_{1}^{1},s_{2}^{1},...,s_{H-1}^{1} and the only optimal states which has different probability to transit to absorbing states are randomly selected from (d1)(H1)(d-1)(H-1) normal states (blue color). Easy to see that, |¯|=(d1)(H1)|\bar{\mathcal{M}}|=(d-1)(H-1).

Because of the different position of optimal states, the optimal policies for each MDP in ¯\bar{\mathcal{M}} (i.e. the policy which can transit from s0s_{0} to optimal state) is different. We will use π1,π2,,π(d1)(H1)\pi_{1},\pi_{2},...,\pi_{(d-1)(H-1)} to refer to those different policies and use MπiM_{\pi_{i}} with 1i(d1)(H1)1\leq i\leq(d-1)(H-1) to denote the MDP in ¯\bar{\mathcal{M}} where πi\pi_{i} is the optimal policy. For convenience, we will use M0M_{0} to denote the MDP where all the normal states have equal probability to transit to different absorbing states, i.e., all states are optimal states. Based on the introduction above, we define :=¯{M0}\mathcal{M}:=\bar{\mathcal{M}}\bigcup\{M_{0}\} and use the MDPs in \mathcal{M} as hard instances.

Lower Bound for Average Failure Probability

Next, we try to lower bound the average failure probability, which works as a lower bound for the maximal failure probability among MDPs in \mathcal{M}. Since any randomized algorithm is just a distribution over deterministic ones, and it therefore suffices to only consider deterministic algorithms ψ\psi (Krishnamurthy et al., 2016).

Given an arbitrary algorithm ψ\psi and k[K]k\in[K], we use ψ(k)\psi(k) to denote the policy taken by ψ\psi at the kk-th deployment (which is a random variable). Besides, we denote ψ(K+1)\psi(K+1) as the output policy.

For arbitrary k[K]k\in[K], we use PMπi,ψ(ψ(k)=πj)P_{M_{\pi_{i}},\psi}(\psi(k)=\pi_{j}) with 1i,j(d1)(H1)1\leq i,j\leq(d-1)(H-1) to denote the probability that ψ\psi takes policy πj\pi_{j} at deployment kk when running ψ\psi on MπiM_{\pi_{i}}, and use PMπi,ψ(ψ(K+1)=πi)P_{M_{\pi_{i}},\psi}(\psi(K+1)=\pi_{i}) to denote the probability that the algorithm ψ\psi returns policy πi\pi_{i} as optimal arm after running with KK deployments under MDP MπiM_{\pi_{i}}. We are interested in providing an upper bound for the expected success rate:

Pψ,M(success):=\displaystyle P_{\psi,M\sim\mathcal{M}}(\text{success}):= 1||P(success inM0)+1||i=1|¯|PMπi,ψ(ψ(K+1)=πi)\displaystyle\frac{1}{|\mathcal{M}|}P(\text{success~{}in}~{}M_{0})+\frac{1}{|\mathcal{M}|}\sum_{i=1}^{|\bar{\mathcal{M}}|}P_{M_{\pi_{i}},\psi}(\psi(K+1)=\pi_{i})
=\displaystyle= 1||+1||i=1|¯|PMπi,ψ(ψ(K+1)=πi),\displaystyle\frac{1}{|\mathcal{M}|}+\frac{1}{|\mathcal{M}|}\sum_{i=1}^{|\bar{\mathcal{M}}|}P_{M_{\pi_{i}},\psi}(\psi(K+1)=\pi_{i}),

where we assume that all the policies in M0M_{0} are optimal policies.

In the following, we use Ek,πiE_{k,\pi_{i}} to denote the event that the policy πi\pi_{i} has been deployed at least once in the first kk deployments and PMπj,ψ()P_{M_{\pi_{j}},\psi}(\cdot) to denote the probability of an event when running algorithm ψ\psi under MDP MπjM_{\pi_{j}}.

Next, we prove that, for arbitrary MπiM_{\pi_{i}},

PMπi,ψ(Ek,πi)=PM0,ψ(Ek,πi)k[K+1].\displaystyle P_{M_{\pi_{i}},\psi}(E_{k,\pi_{i}}^{\complement})=P_{M_{0},\psi}(E_{k,\pi_{i}}^{\complement})\quad\forall k\in[K+1]. (4)

First of all, it holds for k=1k=1, because at the beginning ψ\psi has’t observe any data, and all its possible behavior should be the same in both M0M_{0} and MiM_{i}, and therefore PMπi,ψ(E1,πi)=PM0,ψ(E1,πi)P_{M_{\pi_{i}},\psi}(E_{1,\pi_{i}}^{\complement})=P_{M_{0},\psi}(E_{1,\pi_{i}}^{\complement}). Next, we do induction. Suppose we already know it holds for 1,2,,k1,2,...,k, then consider the case for k+1k+1. Because ψ\psi behave the same if the pre-collected episodes are the same, which is the only information it will use for decision, we should have:

PMπi,ψ(ψ(k+1)=πiEk,πi)=\displaystyle P_{M_{\pi_{i}},\psi}(\psi(k+1)=\pi_{i}\cap E_{k,\pi_{i}}^{\complement})= τψ(k+1)=πiEk,πiPMπi,ψ(τ)\displaystyle\sum_{\tau\in\psi(k+1)=\pi_{i}\cap E_{k,\pi_{i}}^{\complement}}P_{M_{\pi_{i}},\psi}(\tau)
=\displaystyle= τψ(k+1)=πiEk,πiPM0,ψ(τ)\displaystyle\sum_{\tau\in\psi(k+1)=\pi_{i}\cap E_{k,\pi_{i}}^{\complement}}P_{M_{0},\psi}(\tau)
=\displaystyle= PM0,ψ(ψ(k+1)=πiEk,πi).\displaystyle P_{M_{0},\psi}(\psi(k+1)=\pi_{i}\cap E_{k,\pi_{i}}^{\complement}). (5)

The second equality is due to each trajectory τψ(k+1)=πiEk,πi\tau\in\psi(k+1)=\pi_{i}\cap E_{k,\pi_{i}}^{\complement} has the same probability under M0M_{0} and MiM_{i} by the construction. Notice that in this induction step, we only consider the trajectory with first (k+1)N(k+1)N episodes because define the whole sample space and event only based on the first (k+1)N(k+1)N episodes.

This implies that,

PMπi,ψ(Ek+1,πi)=\displaystyle P_{M_{\pi_{i}},\psi}(E_{k+1,\pi_{i}}^{\complement})= PMπi,ψ(Ek,πi)PMπi,ψ(ψ(k+1)=πiEk,πi)\displaystyle P_{M_{\pi_{i}},\psi}(E_{k,\pi_{i}}^{\complement})-P_{M_{\pi_{i}},\psi}(\psi(k+1)=\pi_{i}\cap E_{k,\pi_{i}}^{\complement})
=\displaystyle= PM0,ψ(Ek,πi)PM0,ψ(ψ(k+1)=πiEk,πi)\displaystyle P_{M_{0},\psi}(E_{k,\pi_{i}}^{\complement})-P_{M_{0},\psi}(\psi(k+1)=\pi_{i}\cap E_{k,\pi_{i}}^{\complement})
=\displaystyle= PM0,ψ(Ek+1,πi).\displaystyle P_{M_{0},\psi}(E_{k+1,\pi_{i}}^{\complement}).

Now we are ready to bound the failure rate. Suppose K<(d1)(H1)/2<||/2K<(d-1)(H-1)/2<|\mathcal{M}|/2, we have:

1||i=1|¯|(PMπi,ψ(ψ(K+1)=πi)PM0,ψ(ψ(K+1)=πi))\displaystyle\frac{1}{|\mathcal{M}|}\sum_{i=1}^{|\bar{\mathcal{M}}|}\left(P_{M_{\pi_{i}},\psi}(\psi(K+1)=\pi_{i})-P_{M_{0},\psi}(\psi(K+1)=\pi_{i})\right)
=\displaystyle= 1||i=1|¯|(PMπi,ψ(ψ(K+1)=πiEK+1,πi)PM0,ψ(ψ(K+1)=πiEK+1,πi))\displaystyle\frac{1}{|\mathcal{M}|}\sum_{i=1}^{|\bar{\mathcal{M}}|}\Big{(}P_{M_{\pi_{i}},\psi}(\psi(K+1)=\pi_{i}\cap E_{K+1,\pi_{i}})-P_{M_{0},\psi}(\psi(K+1)=\pi_{i}\cap E_{K+1,\pi_{i}})\Big{)}
+1||i=1|¯|(PMπi,ψ(ψ(K+1)=πiEK+1,πi)PM0,ψ(ψ(K+1)=πiEK+1,πi))\displaystyle+\frac{1}{|\mathcal{M}|}\sum_{i=1}^{|\bar{\mathcal{M}}|}\Big{(}P_{M_{\pi_{i}},\psi}(\psi(K+1)=\pi_{i}\cap E_{K+1,\pi_{i}}^{\complement})-P_{M_{0},\psi}(\psi(K+1)=\pi_{i}\cap E_{K+1,\pi_{i}}^{\complement})\Big{)}
=\displaystyle= 1||i=1|¯|PMπi,ψ(EK+1,πi)(PMπi,ψ(ψ(K+1)=πi|EK+1,πi)PM0,ψ(ψ(K+1)=πi|EK+1,πi))\displaystyle\frac{1}{|\mathcal{M}|}\sum_{i=1}^{|\bar{\mathcal{M}}|}P_{M_{\pi_{i}},\psi}(E_{K+1,\pi_{i}})\Big{(}P_{M_{\pi_{i}},\psi}(\psi(K+1)=\pi_{i}|E_{K+1,\pi_{i}})-P_{M_{0},\psi}(\psi(K+1)=\pi_{i}|E_{K+1,\pi_{i}})\Big{)} (Eq.(5) and P(AB)=P(B)P(A|B)P(A\cap B)=P(B)P(A|B))
\displaystyle\leq 1||i=1|¯|PMπi,ψ(EK+1,πi)\displaystyle\frac{1}{|\mathcal{M}|}\sum_{i=1}^{|\bar{\mathcal{M}}|}P_{M_{\pi_{i}},\psi}(E_{K+1,\pi_{i}}) (PMπi,ψ(ψ(K+1)=πi|EK+1,πi)PM0,ψ(ψ(K+1)=πi|EK+1,πi)1P_{M_{\pi_{i}},\psi}(\psi(K+1)=\pi_{i}|E_{K+1,\pi_{i}})-P_{M_{0},\psi}(\psi(K+1)=\pi_{i}|E_{K+1,\pi_{i}})\leq 1)
=\displaystyle= 1||i=1|¯|PM0,ψ(EK+1,πi)\displaystyle\frac{1}{|\mathcal{M}|}\sum_{i=1}^{|\bar{\mathcal{M}}|}P_{M_{0},\psi}(E_{K+1,\pi_{i}}) ( Eq. (4))
\displaystyle\leq K||\displaystyle\frac{K}{|\mathcal{M}|}

where the last step is because:

1||i=1|¯|PM0,ψ(EK+1,πi)\displaystyle\frac{1}{|\mathcal{M}|}\sum_{i=1}^{|\bar{\mathcal{M}}|}P_{M_{0},\psi}(E_{K+1,\pi_{i}})
=\displaystyle= 1||i=1|¯|𝔼M0,ψ[𝟏{πi is selected}]\displaystyle\frac{1}{|\mathcal{M}|}\sum_{i=1}^{|\bar{\mathcal{M}}|}\mathbb{E}_{M_{0},\psi}[\mathbf{1}\{\pi_{i}\text{ is selected}\}]
\displaystyle\leq 1||i=1|¯|𝔼M0,ψ[k=1K+1𝟏{πi is selected at deployment k}]\displaystyle\frac{1}{|\mathcal{M}|}\sum_{i=1}^{|\bar{\mathcal{M}}|}\mathbb{E}_{M_{0},\psi}[\sum_{k=1}^{K+1}\mathbf{1}\{\pi_{i}\text{ is selected at deployment }k\}]
=\displaystyle= 1||𝔼M0,ψ[k=1K+1i=1|¯|𝟏{πi is selected at deployment k}]\displaystyle\frac{1}{|\mathcal{M}|}\mathbb{E}_{M_{0},\psi}[\sum_{k=1}^{K+1}\sum_{i=1}^{|\bar{\mathcal{M}}|}\mathbf{1}\{\pi_{i}\text{ is selected at deployment }k\}]
=\displaystyle= K+1||\displaystyle\frac{K+1}{|\mathcal{M}|} (ψ\psi deploy deterministic policy each time)
\displaystyle\leq 12\displaystyle\frac{1}{2} (Deployment time K<||/2K<|\mathcal{M}|/2)

As a result,

1||+1||i=1|¯|PMπi,ψ(ψ(K+1)=πi)1||+1||i=1|¯|PM0,ψ(ψ(K+1)=πi)+122||+12.\displaystyle\frac{1}{|\mathcal{M}|}+\frac{1}{|\mathcal{M}|}\sum_{i=1}^{|\bar{\mathcal{M}}|}P_{M_{\pi_{i}},\psi}(\psi(K+1)=\pi_{i})\leq\frac{1}{|\mathcal{M}|}+\frac{1}{|\mathcal{M}|}\sum_{i=1}^{|\bar{\mathcal{M}}|}P_{M_{0},\psi}(\psi(K+1)=\pi_{i})+\frac{1}{2}\leq\frac{2}{|\mathcal{M}|}+\frac{1}{2}.

As long as d,H3d,H\geq 3, we have ||=(d1)(H1)+15|\mathcal{M}|=(d-1)(H-1)+1\geq 5 the failure rate will be higher than 1/101/10, which finishes the proof. ∎

B.3 Proof for Lower Bound in Arbitrary Setting

In the following, we provide a formal statement of the lower bound theorem for the arbitrary policy setting and its proof.

Theorem B.3 (Lower bound for number of deployments in arbitrary setting).

For the linear MDP problem with given dimension d2d\geq 2 and horizon H3H\geq 3, N=poly(d,H,1ε,log1δ)N={\rm poly}(d,H,\frac{1}{\varepsilon},\log\frac{1}{\delta}), and arbitrary given algorithm ψ\psi. Unless the number of deployments K>H22logdNH=Ω(H/logd(NH))=Ω~(H)K>\frac{H-2}{2\lceil\log_{d}NH\rceil}=\Omega(H/\log_{d}(NH))=\widetilde{\Omega}(H), for any ε\varepsilon, there exists an MDP such that the output policy is not ε\varepsilon-optimal with probability at least 12e\frac{1}{2e}. Here ψ\psi can be deterministic or stochastic. The algorithm can deploy arbitrary policy but can only collect N=poly(d,H,1ε,log1δ)N={\rm poly}(d,H,\frac{1}{\varepsilon},\log\frac{1}{\delta}) samples in each deployment.

Proof.

Since any randomized algorithm is just a distribution over deterministic ones, it suffices to only consider deterministic algorithms ψ\psi in the following proof (Krishnamurthy et al., 2016). The crucial part here is notice that a deployment means we have a fixed distribution (occupancy) over the state space and such distribution only depends on the prior information.

Construction of Hard Instances

We have dH2×dd^{H-2}\times d instances by enumerating the location of core state from level 1 to H2H-2 and the optimal normal state at level H1H-1. We assign sH1(i+1)%ds_{H-1}^{(i+1)\%d} as the core state at level H1H-1 if sH1is_{H-1}^{i} is the optimal state. Notice that for this hard instance class, we only consider the case that the optimal state is in level H1H-1. We use \mathcal{M} to denote this hard instance class.

We make a few claims and later prove these claims and the theorem. We will use the notation Ei(j)E_{i}(j) to denote the event that at least one state at level jj is reached by the ii-th deployment. Also we notice that in all the discussion an event is just a set of trajectories. For all related discussion, the state at level LL does not include the state in the absorbing chain. In addition, we will use P,ψP_{\mathcal{M},\psi} to denote the distribution of trajectories when executing algorithm ψ\psi and uniformly taking an instance from the hard instance class.

Claim 1.

Assume LH2L\leq H-2. Then for any deterministic algorithm ψ\psi, we have

P,ψ(E1(L))NdL1.P_{\mathcal{M},\psi}(E_{1}(L))\leq\frac{N}{d^{L-1}}.
Claim 2.

Assume L+LH2L+L^{\prime}\leq H-2. We have that for any deterministic algorithm ψ\psi,

P,ψ(Ek(L+L))(1NdL1)P(Ek1(L)).P_{\mathcal{M},\psi}(E_{k}^{\complement}(L^{\prime}+L))\geq(1-\frac{N}{d^{L-1}})P(E^{\complement}_{k-1}(L^{\prime})).

Proof of Claim 1

By the nature of the deterministic algorithm, we know that for any deterministic algorithm ψ\psi, the deployment is the same at the first time for all instances. The reason is that the agent hasn’t observed anything, so the deployed policy has to be the same.

Let p(i,j,h)p(i,j,h) denote the probability of the first deployment policy to choose action jj at node ii at level hh under the first deployment policy. We know that p(i,j,h)p(i,j,h) for ψ\psi is the same under all instances.

Note that there is a one to one correspondence between an MDP in the hard instance class and the specified locations of core states in the layer 1,,H21,\ldots,H-2 and the optimal state at level H1H-1. Therefore, we can use (i1,,iH2,sH1)(i_{1},\ldots,i_{H-2},s_{H-1}) to denote any instance in the hard instance class, where i1,,iH2i_{1},\ldots,i_{H-2} refers to the location of the core states and sH1s_{H-1} refers to the location of the optimal state. From the construction, we know that for instance (i1,,iH2,sH1)(i_{1},\ldots,i_{H-2},s_{H-1}), to arrive at level sLs_{L} at level LL, the path as to be s0,i1,,iL1,sLs_{0},i_{1},\ldots,i_{L-1},s_{L}. Therefore the probability of a trajectory sampled from ψ\psi to reach state sLs_{L} is

p(s0,i1,0)p(i1,i2,1)p(iL2,iL1,L2)p(iL1,sL,L1).p(s_{0},i_{1},0)p(i_{1},i_{2},1)\ldots p(i_{L-2},i_{L-1},L-2)p(i_{L-1},s_{L},L-1).

Here we use p(iL1,sL,L1)p(i_{L-1},s_{L},L-1) denotes the probability of taking action at iL1i_{L-1} to transit to sLs_{L} and similarly for others. In the deployment, ψ\psi draws NN episodes, so the probability of executing ψ\psi to reach any state sLs_{L} at level LL during the first deployment is no more than Np(s0,i1,0)p(i1,i2,1)p(iL2,iL1,L2)p(iL1,sL,L1)Np(s_{0},i_{1},0)p(i_{1},i_{2},1)\ldots p(i_{L-2},i_{L-1},L-2)p(i_{L-1},s_{L},L-1).

Calculating the sum over i1,,iL1i_{1},\ldots,i_{L-1} and sLs_{L} gives us

i1,i2,,iL1,sLNp(s0,i1,0)p(i1,i2,1)p(iL2,iL1,L2)p(iL1,sL,L1)\displaystyle\sum_{i_{1},i_{2},\ldots,i_{L-1},s_{L}}Np(s_{0},i_{1},0)p(i_{1},i_{2},1)\ldots p(i_{L-2},i_{L-1},L-2)p(i_{L-1},s_{L},L-1)
=\displaystyle= Ni1p(s0,i1,0)i2p(i1,i2,1)iL2p(iL3,iL2,L3)iL1p(iL2,iL1,L2)sLp(iL1,sL,L1)\displaystyle N\sum_{i_{1}}p(s_{0},i_{1},0)\sum_{i_{2}}p(i_{1},i_{2},1)\ldots\sum_{i_{L-2}}p(i_{L-3},i_{L-2},L-3)\sum_{i_{L-1}}p(i_{L-2},i_{L-1},L-2)\sum_{s_{L}}p(i_{L-1},s_{L},L-1)
=\displaystyle= Ni1p(s0,i1,0)i2p(i1,i2,1)iL2p(iL3,iL2,L3)iL1p(iL2,iL1,L2)\displaystyle N\sum_{i_{1}}p(s_{0},i_{1},0)\sum_{i_{2}}p(i_{1},i_{2},1)\ldots\sum_{i_{L-2}}p(i_{L-3},i_{L-2},L-3)\sum_{i_{L-1}}p(i_{L-2},i_{L-1},L-2)
=\displaystyle= Ni1p(s0,i1,0)i2p(i1,i2,1)iL2p(iL3,iL2,L3)\displaystyle N\sum_{i_{1}}p(s_{0},i_{1},0)\sum_{i_{2}}p(i_{1},i_{2},1)\ldots\sum_{i_{L-2}}p(i_{L-3},i_{L-2},L-3)
=\displaystyle= \displaystyle\ldots
=\displaystyle= N.\displaystyle N. (6)

Therefore we have the following equation about P,ψ(E1(L))P_{\mathcal{M},\psi}(E_{1}(L))

P,ψ(E1(L))\displaystyle P_{\mathcal{M},\psi}(E_{1}(L))
=\displaystyle= 1||iL,,iH2,sH1i1,,iL1PI=(i1,,iH1,sH1),ψ(E1(L))\displaystyle\frac{1}{|\mathcal{M}|}\sum_{i_{L},\ldots,i_{H-2},s_{H-1}}\sum_{i_{1},\ldots,i_{L-1}}P_{I=(i_{1},\ldots,i_{H-1},s_{H-1}),\psi}(E_{1}(L))
\displaystyle\leq 1dH1iL,,iH2,sH1i1,,iL1,sLNp(s0,i1,0)p(i1,i2,1)p(iL2,iL1,L2)p(iL1,sL,L1)\displaystyle\frac{1}{d^{H-1}}\sum_{i_{L},\ldots,i_{H-2},s_{H-1}}\sum_{i_{1},\ldots,i_{L-1},s_{L}}Np(s_{0},i_{1},0)p(i_{1},i_{2},1)\ldots p(i_{L-2},i_{L-1},L-2)p(i_{L-1},s_{L},L-1)
=\displaystyle= 1dH1iL,,iH2,sH1N\displaystyle\frac{1}{d^{H-1}}\sum_{i_{L},\ldots,i_{H-2},s_{H-1}}N
=\displaystyle= NdL1.\displaystyle\frac{N}{d^{L-1}}.

Proof of Claim 2

Let τ\tau denote any possible concatenation of the first kNkN episodes we get in the first kk deployments. In this claim, it suffices to consider the kNkN episodes because the event Ek(L+L)Ek1(L)E_{k}(L^{\prime}+L)\cap E^{\complement}_{k-1}(L^{\prime}) only depends on the first kNkN episodes. Therefore the sample space and the event will be defined on any trajectory with kNkN episodes. For any τ\tau, we know that ψ\psi will output the kk-th deployment policy solely based on the τ[0,k1]\tau[0,k-1] and this map is deterministic (we use τ[i,j]\tau[i,j] to denote the iN+1iN+1 to jNjN episodes in τ\tau). In other words, ψ\psi will map τ[0,k1]\tau[0,k-1] to a fixed policy ψ(τ[0,k1])\psi(\tau[0,k-1]) to deploy at the kk-th time.

We have the following equation for any II\in\mathcal{M}

PI,ψ(Ek(L+L)Ek1(L))\displaystyle P_{I,\psi}(E_{k}(L^{\prime}+L)\cap E_{k-1}^{\complement}(L^{\prime}))
=\displaystyle= τEk(L+L)Ek1(L)PI,ψ(τ)\displaystyle\sum_{\tau\in E_{k}(L^{\prime}+L)\cap E_{k-1}^{\complement}(L^{\prime})}P_{I,\psi}(\tau)
=\displaystyle= τEk(L+L)Ek1(L)PI,ψ(τ[0,k1])PI,ψ(τ[0,k1])(τ[k1,k])\displaystyle\sum_{\tau\in E_{k}(L^{\prime}+L)\cap E_{k-1}^{\complement}(L^{\prime})}P_{I,\psi}(\tau[0,k-1])P_{I,\psi(\tau[0,k-1])}(\tau[k-1,k])
=\displaystyle= τ:τ[0,k1]Ek1(L),τ:τ[0,k1]=τ[0,k1] and τ[k1,k] hit level L+LPI,ψ(τ[0,k1])PI,ψ(τ[0,k1])(τ[k1,k])\displaystyle\sum_{\tau:\tau[0,k-1]\in E_{k-1}^{\complement}(L^{\prime}),\tau^{\prime}:\tau^{\prime}[0,k-1]=\tau[0,k-1]\text{ and }\tau^{\prime}[k-1,k]\text{ hit level $L^{\prime}+L$}}P_{I,\psi}(\tau[0,k-1])P_{I,\psi(\tau[0,k-1])}(\tau^{\prime}[k-1,k])
=\displaystyle= τ:τ[0,k1]Ek1(L)PI,ψ(τ[0,k1])τ:τ[0,k1]=τ[0,k1] and τ[k1,k] hit level L+LPI,ψ(τ[0,k1])(τ[k1,k])\displaystyle\sum_{\tau:\tau[0,k-1]\in E_{k-1}^{\complement}(L^{\prime})}P_{I,\psi}(\tau[0,k-1])\sum_{\tau^{\prime}:\tau^{\prime}[0,k-1]=\tau[0,k-1]\text{ and }\tau^{\prime}[k-1,k]\text{ hit level $L^{\prime}+L$}}P_{I,\psi(\tau[0,k-1])}(\tau^{\prime}[k-1,k])

Notice that this equality does not generally hold for probability distribution P,ψP_{\mathcal{M},\psi}.

Then we fix τ[0,k1]\tau[0,k-1], such that τ[0,k1]Ek1(L)\tau[0,k-1]\in E_{k-1}^{\complement}(L^{\prime}). We also fix (i1,,iL1)(i_{1},\ldots,i_{L^{\prime}-1}), (iL+L,,iH2,sH1)(i_{L^{\prime}+L},\ldots,i_{H-2},s_{H-1}) and consider two instances I1=(i1,,iL1,iL1,,iL+L11,iL+L,,iH2,sH1)I_{1}=(i_{1},\ldots,i_{L^{\prime}-1},i_{L^{\prime}}^{1},\ldots,i_{L^{\prime}+L-1}^{1},i_{L^{\prime}+L},\ldots,i_{H-2},s_{H-1}) and I2=(i1,,iL1,iL2,,iL+L12,iL+L,,iH2,sH1)I_{2}=(i_{1},\ldots,i_{L^{\prime}-1},i_{L^{\prime}}^{2},\ldots,i_{L^{\prime}+L-1}^{2},i_{L^{\prime}+L},\ldots,i_{H-2},s_{H-1}). Therefore, we have that PI1,ψ(τ[0,k1])=PI2,ψ(τ[0,k1])P_{I_{1},\psi}(\tau[0,k-1])=P_{I_{2},\psi}(\tau[0,k-1]) (from the construction of I1I_{1}, I2I_{2} and the property of deterministic algorithm ψ\psi). We use I(i1,,iL1,iL+L,,iH2,sH1)I(i_{1},\ldots,i_{L^{\prime}-1},i_{L^{\prime}+L},\ldots,i_{H-2},s_{H-1}) to denote the instance class that has fixed (i1,,iL1)(i_{1},\ldots,i_{L^{\prime}-1}), (iL+L,,iH2,sH1)(i_{L^{\prime}+L},\ldots,i_{H-2},s_{H-1}), but different (iL,iL+L1)(i_{L^{\prime}},\ldots i_{L^{\prime}+L-1}). In addition, we use I(i1,,iL1)I(i_{1},\ldots,i_{L^{\prime}-1}) to denote the instance class that has fixed (i1,,iL1)(i_{1},\ldots,i_{L^{\prime}-1}), but different (iL,iL+L1)(i_{L^{\prime}},\ldots i_{L^{\prime}+L-1}) and (iL+L,,iH2,sH1)(i_{L^{\prime}+L},\ldots,i_{H-2},s_{H-1}).

Since we have already fixed τ[0,k1]Ek1(L)\tau[0,k-1]\in E_{k-1}^{\complement}(L^{\prime}) here, ψ(τ[0,k1])\psi(\tau[0,k-1]) is also fixed (for all II(i1,,iL1,iL+L,,iH2,sH1)I\in I(i_{1},\ldots,i_{L^{\prime}-1},i_{L^{\prime}+L},\ldots,i_{H-2},s_{H-1})). Also notice that we are considering the probability of NN episodes τ[k1:k]\tau^{\prime}[k-1:k]. Therefore, we can follow Claim 1 and define p(i,j,h)p(i,j,h) for 0hL+L10\leq h\leq L^{\prime}+L-1, which represents the probability of choosing action jj at node ii at level hh under the kk-th deployment policy. In the kk-th deployment, ψ\psi draws NN episodes, so the probability of executing ψ\psi to reach any state sL+Ls_{L^{\prime}+L} at level L+LL^{\prime}+L under instance I=(i1,,iH2,sH1)I=(i_{1},\ldots,i_{H-2},s_{H-1}) is

Np(s0,i1,0)p(i1,i2,1)p(iL+L2,iL+L1,L+L2)p(iL+L1,sL+L,L+L1)\displaystyle Np(s_{0},i_{1},0)p(i_{1},i_{2},1)\ldots p(i_{L^{\prime}+L-2},i_{L^{\prime}+L-1},L^{\prime}+L-2)p(i_{L^{\prime}+L-1},s_{L^{\prime}+L},L^{\prime}+L-1)
\displaystyle\leq Np(iL1,iL,L1)p(iL+L2,iL+L1,L+L2)p(iL+L1,sL+L,L+L1).\displaystyle Np(i_{L^{\prime}-1},i_{L^{\prime}},L^{\prime}-1)\ldots p(i_{L^{\prime}+L-2},i_{L^{\prime}+L-1},L^{\prime}+L-2)p(i_{L^{\prime}+L-1},s_{L^{\prime}+L},L^{\prime}+L-1).

Following the same step in Eq (6) by summing over iL,,iL+L1i_{L^{\prime}},\ldots,i_{L^{\prime}+L-1} and sL+Ls_{L^{\prime}+L} gives us

II(i1,,iL1,iL+L,,iH2,sH1)τ:τ[0,k1]=τ[0,k1] and τ[k1,k] hit level L+LPI,ψ(τ[0,k1])(τ[k1,k])\displaystyle\sum_{I\in I(i_{1},\ldots,i_{L^{\prime}-1},i_{L^{\prime}+L},\ldots,i_{H-2},s_{H-1})}\quad\sum_{\tau^{\prime}:\tau^{\prime}[0,k-1]=\tau[0,k-1]\text{ and }\tau^{\prime}[k-1,k]\text{ hit level $L^{\prime}+L$}}P_{I,\psi(\tau[0,k-1])}(\tau^{\prime}[k-1,k])
\displaystyle\leq N.\displaystyle N.

Now, we sum over all possible (iL+L,,iH2,sH1)(i_{L^{\prime}+L},\ldots,i_{H-2},s_{H-1}) and take the average. For any fixed τ[0,k1]Ek1(L)\tau[0,k-1]\in E_{k-1}^{\complement}(L^{\prime}) we have

1|I(i1,,iL1)|II(i1,,iL1)τ:τ[0,k1]=τ[0,k1] and τ[k1,k] hit level L+LPI,ψ(τ[0,k1])(τ[k1,k])\displaystyle\frac{1}{|I(i_{1},\ldots,i_{L^{\prime}-1})|}\sum_{I\in I(i_{1},\ldots,i_{L^{\prime}-1})}\quad\sum_{\tau^{\prime}:\tau^{\prime}[0,k-1]=\tau[0,k-1]\text{ and }\tau^{\prime}[k-1,k]\text{ hit level $L^{\prime}+L$}}P_{I,\psi(\tau[0,k-1])}(\tau[k-1,k])
=\displaystyle= 1|I(i1,,iL1)|iL+L,,iH2,sH1II(i1,,iL1,iL+L,,iH2,sH1)\displaystyle\frac{1}{|I(i_{1},\ldots,i_{L^{\prime}-1})|}\sum_{i_{L^{\prime}+L},\ldots,i_{H-2},s_{H-1}}\quad\sum_{I\in I(i_{1},\ldots,i_{L^{\prime}-1},i_{L^{\prime}+L},\ldots,i_{H-2},s_{H-1})}
τ:τ[0,k1]=τ[0,k1] and τ[k1,k] hit level L+LPI,ψ(τ[0,k1])(τ[k1,k])\displaystyle\quad\sum_{\tau^{\prime}:\tau^{\prime}[0,k-1]=\tau[0,k-1]\text{ and }\tau^{\prime}[k-1,k]\text{ hit level $L^{\prime}+L$}}P_{I,\psi(\tau[0,k-1])}(\tau[k-1,k])
\displaystyle\leq 1|I(i1,,iL1)|iL+L,,iH2,sH1N\displaystyle\frac{1}{|I(i_{1},\ldots,i_{L^{\prime}-1})|}\sum_{i_{L^{\prime}+L},\ldots,i_{H-2},s_{H-1}}N
=\displaystyle= dHLLdHLN\displaystyle\frac{d^{H-L^{\prime}-L}}{d^{H-L^{\prime}}}N
=\displaystyle= 1dLN.\displaystyle\frac{1}{d^{L}}N.

Moreover, summing over all τ[0,k1]Ek1(L)\tau[0,k-1]\in E_{k-1}^{\complement}(L^{\prime}), gives us i1,,iL1i_{1},\ldots,i_{L^{\prime}-1}

P,ψ(Ek(L+L)Ek1(L))\displaystyle P_{\mathcal{M},\psi}(E_{k}(L^{\prime}+L)\cap E_{k-1}^{\complement}(L^{\prime}))
=\displaystyle= 1||i1,,iL1II(i1,,iL1)τ[0,k1]Ek1(L)PI,ψ(τ[0,k1])\displaystyle\frac{1}{|\mathcal{M}|}\sum_{i_{1},\ldots,i_{L^{\prime}-1}}\sum_{I\in I(i_{1},\ldots,i_{L^{\prime}-1})}\sum_{\tau[0,k-1]\in E_{k-1}^{\complement}(L^{\prime})}P_{I,\psi}(\tau[0,k-1])
τ:τ[0,k1]=τ[0,k1] and τ[k1,k] hit level L+LPI,ψ(τ[0,k1])(τ[k1,k])\displaystyle\quad\sum_{\tau^{\prime}:\tau^{\prime}[0,k-1]=\tau[0,k-1]\text{ and }\tau^{\prime}[k-1,k]\text{ hit level $L^{\prime}+L$}}P_{I,\psi(\tau[0,k-1])}(\tau^{\prime}[k-1,k])
=\displaystyle= 1||τ[0,k1]Ek1(L)i1,,iL1Pi1,,iL1,ψ(τ[0,k1])\displaystyle\frac{1}{|\mathcal{M}|}\sum_{\tau[0,k-1]\in E_{k-1}^{\complement}(L^{\prime})}\sum_{i_{1},\ldots,i_{L^{\prime}-1}}P_{i_{1},\ldots,i_{L^{\prime}-1},\psi}(\tau[0,k-1])
II(i1,,iL1)τ:τ[0,k1]=τ[0,k1] and τ[k1,k] hit level L+LPI,ψ(τ[0,k1])(τ[k1,k])\displaystyle\quad\sum_{I\in I(i_{1},\ldots,i_{L^{\prime}-1})}\sum_{\tau^{\prime}:\tau^{\prime}[0,k-1]=\tau[0,k-1]\text{ and }\tau^{\prime}[k-1,k]\text{ hit level $L^{\prime}+L$}}P_{I,\psi(\tau[0,k-1])}(\tau^{\prime}[k-1,k])
\displaystyle\leq 1||τ[0,k1]Ek1(L)i1,,iL1Pi1,,iL1,ψ(τ[0,k1])|I(i1,,iL1)|NdL\displaystyle\frac{1}{|\mathcal{M}|}\sum_{\tau[0,k-1]\in E_{k-1}^{\complement}(L^{\prime})}\sum_{i_{1},\ldots,i_{L^{\prime}-1}}P_{i_{1},\ldots,i_{L^{\prime}-1},\psi}(\tau[0,k-1])|I(i_{1},\ldots,i_{L^{\prime}-1})|\frac{N}{d^{L}}
=\displaystyle= NdL|I(i1,,iL1)|||τ[0,k1]Ek1(L)i1,,iL1Pi1,,iL1,ψ(τ[0,k1])\displaystyle\frac{N}{d^{L}}\frac{|I(i_{1},\ldots,i_{L^{\prime}-1})|}{|\mathcal{M}|}\sum_{\tau[0,k-1]\in E_{k-1}^{\complement}(L^{\prime})}\sum_{i_{1},\ldots,i_{L^{\prime}-1}}P_{i_{1},\ldots,i_{L^{\prime}-1},\psi}(\tau[0,k-1])
=\displaystyle= NdLτ[0,k1]Ek1(L)P,ψ(τ[0,k1])\displaystyle\frac{N}{d^{L}}\sum_{\tau[0,k-1]\in E_{k-1}^{\complement}(L^{\prime})}P_{\mathcal{M},\psi}(\tau[0,k-1])
=\displaystyle= NdLP,ψ(Ek1(L)).\displaystyle\frac{N}{d^{L}}P_{\mathcal{M},\psi}(E_{k-1}^{\complement}(L^{\prime})).

In the second equality, we use Pi1,,iL1,ψP_{i_{1},\ldots,i_{L^{\prime}-1,\psi}} because for any fixed τ[0,k1]Ek1(L)\tau[0,k-1]\in E_{k-1}^{\complement}(L^{\prime}) and all II(i1,,iL1)I\in I(i_{1},\ldots,i_{L^{\prime}-1}), Pi1,,iL1,ψ(τ[0,k1])P_{i_{1},\ldots,i_{L^{\prime}-1,\psi}}(\tau[0,k-1]) are the same.

Finally, we have

P,ψ(Ek(L+L))\displaystyle P_{\mathcal{M},\psi}(E_{k}^{\complement}(L^{\prime}+L)) P,ψ(Ek(L+L)Ek1(L))\displaystyle\geq P_{\mathcal{M},\psi}(E_{k}^{\complement}(L^{\prime}+L)\cap E^{\complement}_{k-1}(L^{\prime}))
=P,ψ(Ek1(L))P,ψ(Ek(L+L)Ek1(L))\displaystyle=P_{\mathcal{M},\psi}(E_{k-1}^{\complement}(L^{\prime}))-P_{\mathcal{M},\psi}(E_{k}(L^{\prime}+L)\cap E^{\complement}_{k-1}(L^{\prime}))
(1NdL)P(Ek1(L))\displaystyle\geq(1-\frac{N}{d^{L}})P(E^{\complement}_{k-1}(L^{\prime}))
(1NdL1)P(Ek1(L)).\displaystyle\geq(1-\frac{N}{d^{L-1}})P(E^{\complement}_{k-1}(L^{\prime})).

Proof of the Theorem

If KLH2KL\leq H-2, then applying Claim 2 for K1K-1 times and applying Claim 1 tells us

P,ψ(EK(KL))=\displaystyle P_{\mathcal{M},\psi}(E_{K}^{\complement}(KL))= P,ψ(EK((K1)L+L))(1NdL1)P(EK1((K1)L))\displaystyle P_{\mathcal{M},\psi}(E_{K}^{\complement}((K-1)L+L))\geq(1-\frac{N}{d^{L-1}})P(E^{\complement}_{K-1}((K-1)L))
\displaystyle\geq (1NdL1)K1P(E1(L))(1NdL1)K.\displaystyle\ldots\geq(1-\frac{N}{d^{L-1}})^{K-1}P(E^{\complement}_{1}(L))\geq(1-\frac{N}{d^{L-1}})^{K}.

We can set L=logdNH+1L=\lceil\log_{d}NH\rceil+1 and KH22logdNHK\leq\frac{H-2}{2\lceil\log_{d}NH\rceil}. Then for H3H\geq 3, we get KLH2KL\leq H-2 and

P,ψ(does not hit any state at level H2)\displaystyle P_{\mathcal{M},\psi}(\text{does not hit any state at level $H-2$}) (1NdlogdNH)H22logdNH\displaystyle\geq(1-\frac{N}{d^{\lceil\log_{d}NH\rceil}})^{\frac{H-2}{2\lceil\log_{d}NH\rceil}}
(1NNH)H22logdNH\displaystyle\geq(1-\frac{N}{NH})^{\frac{H-2}{2\lceil\log_{d}NH\rceil}}
(11H)H\displaystyle\geq(1-\frac{1}{H})^{H}
1e.\displaystyle\geq\frac{1}{e}.

Let event FF denote the event (a set of length KNKN episodes trajectories) that any state at level H2H-2 is not hit. Then we have P,ψ(F)11eP_{\mathcal{M},\psi}(F)\geq 1-\frac{1}{e}. We use I(i1,,iH2)I(i_{1},\ldots,i_{H-2}) to denote the instance class that has fixed core states (i1,,iH2)(i_{1},\ldots,i_{H-2}) but different optimal states sH1s_{H-1}.

Consider any fixed τF\tau\in F. Similar as the proof in the prior claims, by the property of deterministic algorithm, we can define p(i,j,h)p(i,j,h) for h=H2h=H-2, which represents the probability of the output policy ψτ(K+1)\psi_{\tau}(K+1) under trajectory τ\tau to choose action jj at node ii at level H2H-2. Then we have

I=(i1,,iH2,sH1)I(i1,,iH2)PI,ψ(ψτ(K+1) chooses optimal state)\displaystyle\sum_{I=(i_{1},\ldots,i_{H-2},s_{H-1})\in I(i_{1},\ldots,i_{H-2})}P_{I,\psi}(\psi_{\tau}(K+1)\text{ chooses optimal state)}
=\displaystyle= I=(i1,,iH2,sH1)I(i1,,iH2)PI,ψ((ψτ(K+1))(iH2)=sH1)\displaystyle\sum_{I=(i_{1},\ldots,i_{H-2},s_{H-1})\in I(i_{1},\ldots,i_{H-2})}P_{I,\psi}((\psi_{\tau}(K+1))(i_{H-2})=s_{H-1})
=\displaystyle= I=(i1,,iH2,sH1)I(i1,,iH2)p(iH2,sH1,H2)\displaystyle\sum_{I=(i_{1},\ldots,i_{H-2},s_{H-1})\in I(i_{1},\ldots,i_{H-2})}p(i_{H-2},s_{H-1},H-2)
=\displaystyle= sH1p(iH2,sH1,H2)\displaystyle\sum_{s_{H-1}}p(i_{H-2},s_{H-1},H-2)
=\displaystyle= 1.\displaystyle 1.

Summing over τF\tau\in F gives us

I=(i1,,iH2,sH1)I(i1,,iH2)PI,ψ(F the output policy chooses optimal state)\displaystyle\sum_{I=(i_{1},\ldots,i_{H-2},s_{H-1})\in I(i_{1},\ldots,i_{H-2})}P_{I,\psi}(F\cap\text{ the output policy chooses optimal state})
=\displaystyle= I=(i1,,iH2,sH1)I(i1,,iH2)τFPI,ψ(τ)PI,ψ(ψτ(K+1) chooses optimal state)\displaystyle\sum_{I=(i_{1},\ldots,i_{H-2},s_{H-1})\in I(i_{1},\ldots,i_{H-2})}\sum_{\tau\in F}P_{I,\psi}(\tau)P_{I,\psi}(\psi_{\tau}(K+1)\text{ chooses optimal state)}
=\displaystyle= I=(i1,,iH2,sH1)I(i1,,iH2)τFPi1,,iH2,ψ(τ)PI,ψ(ψτ(K+1) chooses optimal state)\displaystyle\sum_{I=(i_{1},\ldots,i_{H-2},s_{H-1})\in I(i_{1},\ldots,i_{H-2})}\sum_{\tau\in F}P_{i_{1},\ldots,i_{H-2},\psi}(\tau)P_{I,\psi}(\psi_{\tau}(K+1)\text{ chooses optimal state)}
=\displaystyle= τFPi1,,iH2,ψ(τ)I=(i1,,iH2,sH1)I(i1,,iH2)PI,ψ(ψτ(K+1) chooses optimal state)\displaystyle\sum_{\tau\in F}P_{i_{1},\ldots,i_{H-2},\psi}(\tau)\sum_{I=(i_{1},\ldots,i_{H-2},s_{H-1})\in I(i_{1},\ldots,i_{H-2})}P_{I,\psi}(\psi_{\tau}(K+1)\text{ chooses optimal state)}
=\displaystyle= τFPi1,,iH2,ψ(τ)\displaystyle\sum_{\tau\in F}P_{i_{1},\ldots,i_{H-2},\psi}(\tau)
=\displaystyle= Pi1,,iH2,ψ(F).\displaystyle P_{i_{1},\ldots,i_{H-2},\psi}(F).

In the second equality, we notice that for all instance II=(i1,,iH2,sH1)I\in I=(i_{1},\ldots,i_{H-2},s_{H-1}), PI,ψ(τ)P_{I,\psi}(\tau) are the same, so this probability distribution essentially depends on i1,,iH2i_{1},\ldots,i_{H-2}. In the third inequality, we change the order of the summation.

Finally, summing over i1,,iH2i_{1},\ldots,i_{H-2} and taking average yields that

P,ψ(F the output policy chooses optimal state)\displaystyle P_{\mathcal{M},\psi}(F\cap\text{ the output policy chooses optimal state})
=\displaystyle= 1dH1i1,,iH2I=(i1,,iH2,sH1)I(i1,,iH2)PI,ψ(F the output policy chooses optimal state)\displaystyle\frac{1}{d^{H-1}}\sum_{i_{1},\ldots,i_{H-2}}\sum_{I=(i_{1},\ldots,i_{H-2},s_{H-1})\in I(i_{1},\ldots,i_{H-2})}P_{I,\psi}(F\cap\text{ the output policy chooses optimal state})
=\displaystyle= 1dH1i1,,iH2Pi1,,iH2,ψ(F)\displaystyle\frac{1}{d^{H-1}}\sum_{i_{1},\ldots,i_{H-2}}P_{i_{1},\ldots,i_{H-2},\psi}(F)
=\displaystyle= 1dH1i1,,iH21dsH1PI=(i1,,iH2,sH1),ψ(F)\displaystyle\frac{1}{d^{H-1}}\sum_{i_{1},\ldots,i_{H-2}}\frac{1}{d}\sum_{s_{H-1}}P_{I=(i_{1},\ldots,i_{H-2},s_{H-1}),\psi}(F) (PI,ψ(F)P_{I,\psi}(F) does not depend on the optimal state)
=\displaystyle= 1dP,ψ(F)\displaystyle\frac{1}{d}P_{\mathcal{M},\psi}(F)

Therefore, we get the probability of not choosing the optimal state is

P,ψ( the output policy does not choose the optimal state)\displaystyle P_{\mathcal{M},\psi}(\text{ the output policy does not choose the optimal state})
\displaystyle\geq P,ψ(F the output policy does not choose the optimal state)\displaystyle P_{\mathcal{M},\psi}(F\cap\text{ the output policy does not choose the optimal state})
=\displaystyle= P,ψ(F)P,ψ(F the output policy chooses the optimal state)\displaystyle P_{\mathcal{M},\psi}(F)-P_{\mathcal{M},\psi}(F\cap\text{ the output policy chooses the optimal state})
=\displaystyle= d1dP,ψ(F)\displaystyle\frac{d-1}{d}P_{\mathcal{M},\psi}(F)
\displaystyle\geq 121e.\displaystyle\frac{1}{2}\cdot\frac{1}{e}.

From the construction, we know that any policy that does not choose optimal state (thus also does not choose the optimal action associated with the optimal state) is ε\varepsilon sub-optimal. This implies that with probability at least 12e\frac{1}{2e}, the output policy is at least ε\varepsilon sub-optimal. ∎

Appendix C Deployment-Efficient RL with Deterministic Policies and given Reward Function

C.1 Additional Notations

In the appendix, we will frequently consider the MDP truncated at h~H\widetilde{h}\leq H, and we will use:

Vhπ(s|h~)=𝔼[h=hh~rh(sh,ah)|sh=s,π],Qhπ(s,a|h~)=𝔼[h=hh~rh(sh,ah)|sh=s,ah=a,π]\displaystyle V^{\pi}_{h}(s|\widetilde{h})=\mathbb{E}[\sum_{h^{\prime}=h}^{\widetilde{h}}r_{h^{\prime}}(s_{h^{\prime}},a_{h^{\prime}})|s_{h}=s,\pi],\quad Q^{\pi}_{h}(s,a|\widetilde{h})=\mathbb{E}[\sum_{h^{\prime}=h}^{\widetilde{h}}r_{h^{\prime}}(s_{h^{\prime}},a_{h^{\prime}})|s_{h}=s,a_{h}=a,\pi]

to denote the value function in truncated MDP for arbitrary hh~h\leq\widetilde{h}, and also extend the definition in Section 2 to Vh(|h~),Qh(,|h~)V^{*}_{h}(\cdot|\widetilde{h}),Q^{*}_{h}(\cdot,\cdot|\widetilde{h}), π|h~\pi^{*}_{|{\widetilde{h}}} for optimal policy setting and Vh(,r|h~),Qh(,,r|h~)V^{*}_{h}(\cdot,r|\widetilde{h}),Q^{*}_{h}(\cdot,\cdot,r|\widetilde{h}), πr|h\pi^{*}_{r|h} for reward-free setting.

C.2 Auxiliary Lemma

Lemma C.1 (Elliptical Potential Lemma; Lemma 26 of Agarwal et al. (2020b)).

Consider a sequence of d×dd\times d positive semi-definite matrices X1,,XTX_{1},...,X_{T} with maxtTr(Xt)1\max_{t}Tr(X_{t})\leq 1 and define M0=λI,,Mt=Mt1+XtM_{0}=\lambda I,...,M_{t}=M_{t-1}+X_{t}. Then

t=1TTr(XtMt11)(1+1/λ)dlog(1+T/d).\displaystyle\sum_{t=1}^{T}Tr(X_{t}M^{-1}_{t-1})\leq(1+1/\lambda)d\log(1+T/d).
Lemma C.2 (Abbasi-yadkori et al. (2011)).

Suppose 𝐀,𝐁d×d\mathbf{A},\mathbf{B}\in\mathbb{R}^{d\times d} are two positive definite matrices satisfying 𝐀𝐁\mathbf{A}\succeq\mathbf{B}, then for any xdx\in\mathbb{R}^{d}, we have:

x𝐀2x𝐁2det(𝐀)det(𝐁).\displaystyle\|x\|^{2}_{\mathbf{A}}\leq\|x\|^{2}_{\mathbf{B}}\frac{\det(\mathbf{A})}{\det(\mathbf{B})}.

Next, we prove a lemma to bridge between trace and determinant, which is crucial to prove our key technique in Lemma 4.2.

Lemma C.3.

[Bridge between Trace and Determinant] Consider a sequence of matrices 𝐀0,𝐀N,,𝐀(K1)N\mathbf{A}_{0},\mathbf{A}_{N},...,\mathbf{A}_{(K-1)N} with 𝐀0=I\mathbf{A}_{0}=I and 𝐀kN=𝐀(k1)N+Φk1\mathbf{A}_{kN}=\mathbf{A}_{(k-1)N}+\Phi_{k-1}, where Φk1=t=(k1)N+1kNϕtϕt\Phi_{k-1}=\sum_{t=(k-1)N+1}^{kN}\phi_{t}\phi_{t}^{\top}. We have

Tr(𝐀(k1)N1Φk1)det(𝐀kN)det(𝐀(k1)N)logdet(𝐀kN)det(𝐀(k1)N).\displaystyle Tr(\mathbf{A}^{-1}_{(k-1)N}\Phi_{k-1})\leq\frac{\det(\mathbf{A}_{kN})}{\det(\mathbf{A}_{(k-1)N})}\log\frac{\det(\mathbf{A}_{kN})}{\det(\mathbf{A}_{(k-1)N})}.
Proof.

Consider a more general case, given matrix 𝐘I\mathbf{Y}\succeq I, we have the following inequality

Tr(𝐈𝐘1)logdet(𝐘)Tr(𝐘𝐈).\displaystyle Tr(\mathbf{I}-\mathbf{Y}^{-1})\leq\log\det(\mathbf{Y})\leq Tr(\mathbf{Y}-\mathbf{I}).

By replacing 𝐘\mathbf{Y} with 𝐈+𝐀1𝐗\mathbf{I}+\mathbf{A}^{-1}\mathbf{X} in the above inequality, we have:

Tr((𝐀+𝐗)1𝐗)=\displaystyle Tr((\mathbf{A}+\mathbf{X})^{-1}\mathbf{X})= Tr((𝐈+𝐀1𝐗)1(𝐀1𝐗))=Tr((𝐈+𝐀1𝐗)1(𝐈+𝐀1𝐗𝐈)\displaystyle Tr((\mathbf{I}+\mathbf{A}^{-1}\mathbf{X})^{-1}(\mathbf{A}^{-1}\mathbf{X}))=Tr((\mathbf{I}+\mathbf{A}^{-1}\mathbf{X})^{-1}(\mathbf{I}+\mathbf{A}^{-1}\mathbf{X}-\mathbf{I})
=\displaystyle= Tr(𝐈(𝐈+𝐀1𝐗)1)\displaystyle Tr(\mathbf{I}-(\mathbf{I}+\mathbf{A}^{-1}\mathbf{X})^{-1})
\displaystyle\leq logdet(𝐈+𝐀1𝐗)=logdet(𝐀+𝐗)det(𝐀).\displaystyle\log\det(\mathbf{I}+\mathbf{A}^{-1}\mathbf{X})=\log\frac{\det(\mathbf{A}+\mathbf{X})}{\det(\mathbf{A})}.

By assigning 𝐀=𝐀(k1)N\mathbf{A}=\mathbf{A}_{(k-1)N} and 𝐗=Φk1\mathbf{X}=\Phi_{k-1}, and applying Lemma C.2, we have:

Tr(𝐀(k1)N1Φk1)=\displaystyle Tr(\mathbf{A}_{(k-1)N}^{-1}\Phi_{k-1})= t=(k1)N+1kNϕt𝐀(k1)N12\displaystyle\sum_{t=(k-1)N+1}^{kN}\|\phi_{t}\|^{2}_{\mathbf{A}_{(k-1)N}^{-1}}
\displaystyle\leq t=(k1)N+1kNϕt𝐀kN12det𝐀kNdet(𝐀(k1)N)\displaystyle\sum_{t=(k-1)N+1}^{kN}\|\phi_{t}\|^{2}_{\mathbf{A}_{kN}^{-1}}\frac{\det\mathbf{A}_{kN}}{\det(\mathbf{A}_{(k-1)N})}
=\displaystyle= Tr(𝐀kN1Φk1)det𝐀kNdet(𝐀(k1)N)\displaystyle Tr(\mathbf{A}_{kN}^{-1}\Phi_{k-1})\frac{\det\mathbf{A}_{kN}}{\det(\mathbf{A}_{(k-1)N})}
\displaystyle\leq det𝐀kNdet(𝐀(k1)N)logdet𝐀kNdet(𝐀(k1)N)\displaystyle\frac{\det\mathbf{A}_{kN}}{\det(\mathbf{A}_{(k-1)N})}\log\frac{\det\mathbf{A}_{kN}}{\det(\mathbf{A}_{(k-1)N})}

which finished the proof. ∎

See 4.2

Proof.

Suppose we have Tr(𝐀(k1)N1Φk1)NεTr(\mathbf{A}^{-1}_{(k-1)N}\Phi_{k-1})\geq N\varepsilon, by applying Lemma C.3 we must have:

Nε\displaystyle N\varepsilon\leq det(𝐀kN)det(𝐀(k1)N)logdet(𝐀kN)det(𝐀(k1)N)det(𝐀kN)det(𝐀(k1)N)log(det(𝐀kN))\displaystyle\frac{\det(\mathbf{A}_{kN})}{\det(\mathbf{A}_{(k-1)N})}\log\frac{\det(\mathbf{A}_{kN})}{\det(\mathbf{A}_{(k-1)N})}\leq\frac{\det(\mathbf{A}_{kN})}{\det(\mathbf{A}_{(k-1)N})}\log(\det(\mathbf{A}_{kN}))
\displaystyle\leq ddet(𝐀kN)det(𝐀(k1)N)log(1+KN/d)\displaystyle d\frac{\det(\mathbf{A}_{kN})}{\det(\mathbf{A}_{(k-1)N})}\log(1+KN/d) (det(A)(Tr(A)/d)d\det(A)\leq(Tr(A)/d)^{d})

which implies that,

Nεdlog(1+KN/d)det(𝐀kN)det(𝐀(k1)N)\displaystyle\frac{N\varepsilon}{d\log(1+KN/d)}\leq\frac{\det(\mathbf{A}_{kN})}{\det(\mathbf{A}_{(k-1)N})}

Therefore,

|𝒦+|logNεdlog(1+KN/d)\displaystyle|\mathcal{K}^{+}|\log\frac{N\varepsilon}{d\log(1+KN/d)}\leq k𝒦logdet(𝐀kN)det(𝐀(k1)N)k=1Klogdet(𝐀kN)det(𝐀(k1)N)\displaystyle\sum_{k\in\mathcal{K}}\log\frac{\det(\mathbf{A}_{kN})}{\det(\mathbf{A}_{(k-1)N})}\leq\sum_{k=1}^{K}\log\frac{\det(\mathbf{A}_{kN})}{\det(\mathbf{A}_{(k-1)N})}
=\displaystyle= logdet(𝐀KN)det(𝐀0)dlog(1+KN/d)\displaystyle\log\frac{\det(\mathbf{A}_{KN})}{\det(\mathbf{A}_{0})}\leq d\log(1+KN/d)

which implies that, conditioning on Ndεlog(1+KN/d)N\geq\frac{d}{\varepsilon}\log(1+KN/d), we have:

|𝒦+|dlog(1+KN/d)log(Nεdlog(1+KN/d))|\mathcal{K}^{+}|\leq d\frac{\log(1+KN/d)}{\log(\frac{N\varepsilon}{d\log(1+KN/d)})}

Now, we are interested in find the mimimum NN, under the constraint that |𝒦+|cKd|\mathcal{K}^{+}|\leq c_{K}d. To solve this problem, we first choose an arbitrary pcKp\leq c_{K}, and find a NN such that,

log(1+KN/d)log(Nεdlog(1+KN/d))p\displaystyle\frac{\log(1+KN/d)}{\log(\frac{N\varepsilon}{d\log(1+KN/d)})}\leq p

In order to guarantee the above, we need:

Nεdlog(1+KN/d),(Nεdlog(1+KN/d))p1+KN/d\displaystyle N\varepsilon\geq d\log(1+KN/d),\quad(\frac{N\varepsilon}{d\log(1+KN/d)})^{p}\geq 1+KN/d

The first constraint can be satisfied easily with Nc1dεlogdHεN\geq c_{1}\frac{d}{\varepsilon}\log\frac{dH}{\varepsilon} for some constant c1c_{1}. Since usually KN/d>1KN/d>1, the second constraint can be directly satisfied if:

(Nεdlog(1+KN/d))p2KN/d\displaystyle(\frac{N\varepsilon}{d\log(1+KN/d)})^{p}\geq 2KN/d

Recall K=cKdH+1K=c_{K}dH+1, it can be satisfied by choosing

Nc2(cKHdpεplogp(Hdε))1p1\displaystyle N\geq c_{2}\Big{(}c_{K}\frac{Hd^{p}}{\varepsilon^{p}}\log^{p}(\frac{Hd}{\varepsilon})\Big{)}^{\frac{1}{p-1}} (7)

where c2c_{2} is an absolute constant. Therefore, we can find an absolute number cc such that,

N=c(cKHdpεplogp(Hdε))1p1max{c1dεlog(dε),c2(cKHdpεplogp(Hdε))1p1}\displaystyle N=c\Big{(}c_{K}\frac{Hd^{p}}{\varepsilon^{p}}\log^{p}(\frac{Hd}{\varepsilon})\Big{)}^{\frac{1}{p-1}}\geq\max\{c_{1}\frac{d}{\varepsilon}\log(\frac{d}{\varepsilon}),c_{2}\Big{(}c_{K}\frac{Hd^{p}}{\varepsilon^{p}}\log^{p}(\frac{Hd}{\varepsilon})\Big{)}^{\frac{1}{p-1}}\}

to make sure that

|𝒦+|pd\displaystyle|\mathcal{K^{+}}|\leq pd

Since in Eq.(7), it’s required that 1/(p1)<1/(p-1)<\infty, we should constraint that p>1p>1 and therefore, cK2c_{K}\geq 2. Because the dependence of d,H,1ε,logdHεd,H,\frac{1}{\varepsilon},\log\frac{dH}{\varepsilon} are decreasing as pp increases, by assigning p=cKp=c_{K} and 1<pcK1<p\leq c_{K}, NN will be minimized when p=cKp=c_{K}. Then, we finished the proof. ∎

C.3 Analysis for Algorithms

Next, we will use the above lemma to bound the difference between J(πK)J(\pi_{K}) and J(π)J(\pi^{*}). We first prove a lemma similar to Lemma B.3 in (Jin et al., 2019) and Lemma A.1 in (Wang et al., 2020b).

Lemma C.4 (Concentration Lemma).

We use 1\mathcal{E}_{1} to denote the event that, when running Algorithm 1, the following inequality holds for all k[K]k\in[K] and h[hk]h\in[h_{k}] and arbitrary Vh+1kV^{k}_{h+1} occurs in Alg 1.

τ=1k1n=1Nϕhτn(Vh+1k(sh+1τn)s𝒮Ph(s|shτn,ahτn)Vh+1k(s))(Λhk)1cdHlog(dKNH/δ)\displaystyle\Big{\|}\sum_{\tau=1}^{k-1}\sum_{n=1}^{N}\phi_{h}^{\tau n}\Big{(}V^{k}_{h+1}(s_{h+1}^{\tau n})-\sum_{s^{\prime}\in\mathcal{S}}P_{h}(s^{\prime}|s_{h}^{\tau n},a_{h}^{\tau n})V^{k}_{h+1}(s^{\prime})\Big{)}\Big{\|}_{(\Lambda^{k}_{h})^{-1}}\leq c\cdot dH\sqrt{\log(dKNH/\delta)}

Under Assumption A, there exists some absolute constant c0c\geq 0, such that P(1)1δ/2P(\mathcal{E}_{1})\geq 1-\delta/2.

Proof.

The proof is almost identical to Lemma B.3 in (Jin et al., 2019), so we omit it here. The only difference is that we have an inner summation from n=1n=1 to NN and we truncate the horizon at hkh_{k} in iteration kk. ∎

Lemma C.5 (Overestimation).

On the event 1\mathcal{E}_{1} in Lemma C.4, which holds with probability 1δ/21-\delta/2, for all k[K]k\in[K] and n[N]n\in[N],

V1(s1kn|hk)V1k(s1kn)\displaystyle V_{1}^{*}(s_{1}^{kn}|h_{k})\leq V_{1}^{k}(s_{1}^{kn})

where recall that V1kV_{1}^{k} is the function computed at iteration kk in Alg.1 and V1(|hk)=𝔼[h=1hkrh(sh,ah)|π[1:hk]]V_{1}^{*}(\cdot|h_{k})=\mathbb{E}[\sum_{h=1}^{h_{k}}r_{h}(s_{h},a_{h})|\pi^{*}_{[1:h_{k}]}] denote the optimal value function in the MDP truncated at layer hkh_{k} and π[1:hk]\pi^{*}_{[1:h_{k}]} is the optimal policy in the truncated MDP.

Besides, we also have:

𝔼s1d1[V1(s1|hk)Vπk(s1|hk)]2β𝔼s1,a1,,shk,ahkπk[h=1hkϕ(sh,ah)(Λhk)1]\displaystyle\mathbb{E}_{s_{1}\sim d_{1}}[V_{1}^{*}(s_{1}|h_{k})-V^{\pi_{k}}(s_{1}|h_{k})]\leq 2\beta\mathbb{E}_{s_{1},a_{1},...,s_{h_{k}},a_{h_{k}}\sim\pi_{k}}[\sum_{h=1}^{h_{k}}\|\phi(s_{h},a_{h})\|_{(\Lambda^{k}_{h})^{-1}}]
Proof.

First of all, by applying Lemma C.4 above, after a similar discussion to the proof of Lemma 3.1 in (Wang et al., 2020b), we can show that

|ϕ(s,a)whks𝒮Ph(s|s,a)Vh+1k(s)|βϕ(s,a)(Λhk)1,s𝒮,a𝒜,h[hk]|\phi(s,a)^{\top}w^{k}_{h}-\sum_{s^{\prime}\in\mathcal{S}}P_{h}(s^{\prime}|s,a)V^{k}_{h+1}(s^{\prime})|\leq\beta\|\phi(s,a)\|_{(\Lambda^{k}_{h})^{-1}},\quad\forall s\in\mathcal{S},a\in\mathcal{A},h\in[h_{k}]

and the overestimation

Vh(s|hk)Vhk(s),s𝒮,h[hk]\displaystyle V^{*}_{h}(s|h_{k})\leq V^{k}_{h}(s),\quad\forall s\in\mathcal{S},h\in[h_{k}]

As a result,

𝔼s1d1[V1(s1|hk)Vπk(s1|hk)]\displaystyle\mathbb{E}_{s_{1}\sim d_{1}}[V_{1}^{*}(s_{1}|h_{k})-V^{\pi_{k}}(s_{1}|h_{k})]
\displaystyle\leq 𝔼s1d1[V1k(s1)Vπk(s1|hk)]\displaystyle\mathbb{E}_{s_{1}\sim d_{1}}[V_{1}^{k}(s_{1})-V^{\pi_{k}}(s_{1}|h_{k})]
=\displaystyle= 𝔼s1d1,a1πk[Q1k(s1,a1)Qπk(s1,a1|hk)]\displaystyle\mathbb{E}_{s_{1}\sim d_{1},a_{1}\sim\pi_{k}}[Q_{1}^{k}(s_{1},a_{1})-Q^{\pi_{k}}(s_{1},a_{1}|h_{k})]
=\displaystyle= 𝔼s1d1,a1πk[min{(w1k)ϕ(s1,a1)+r1(s1,a1)+u1k(s1,a1),H}r1(s1,a1)s2𝒮P1(s2|s1,a1)V2πk(s2|hk))]\displaystyle\mathbb{E}_{s_{1}\sim d_{1},a_{1}\sim\pi_{k}}[\min\{(w_{1}^{k})^{\top}\phi(s_{1},a_{1})+r_{1}(s_{1},a_{1})+u_{1}^{k}(s_{1},a_{1}),H\}-r_{1}(s_{1},a_{1})-\sum_{s_{2}\in\mathcal{S}}P_{1}(s_{2}|s_{1},a_{1})V^{\pi_{k}}_{2}(s_{2}|h_{k}))]
=\displaystyle= 𝔼s1d1,a1πk[min{(w1k)ϕ(s1,a1),Hr1(s1,a1)u1k(s1,a1)}s2𝒮P1(s2|s1,a1)V2k(s2)]\displaystyle\mathbb{E}_{s_{1}\sim d_{1},a_{1}\sim\pi_{k}}[\min\{(w_{1}^{k})^{\top}\phi(s_{1},a_{1}),H-r_{1}(s_{1},a_{1})-u_{1}^{k}(s_{1},a_{1})\}-\sum_{s_{2}\in\mathcal{S}}P_{1}(s_{2}|s_{1},a_{1})V^{k}_{2}(s_{2})]
+𝔼s1d1,a1πk[s2𝒮P1(s2|s1,a1)V2k(s2)s2𝒮P1(s2|s1,a1)V2πk(s2|hk))]+𝔼sμ[u1k(s1,a1)]\displaystyle+\mathbb{E}_{s_{1}\sim d_{1},a_{1}\sim\pi_{k}}[\sum_{s_{2}\in\mathcal{S}}P_{1}(s_{2}|s_{1},a_{1})V^{k}_{2}(s_{2})-\sum_{s_{2}\in\mathcal{S}}P_{1}(s_{2}|s_{1},a_{1})V^{\pi_{k}}_{2}(s_{2}|h_{k}))]+\mathbb{E}_{s\sim\mu}[u^{k}_{1}(s_{1},a_{1})]
\displaystyle\leq 𝔼s1d1,a1πk[s2𝒮P1(s2|s1,a1)V2k(s2)s2𝒮P1(s2|s1,a1)V2πk(s2|hk))]+2𝔼sμ[u1k(s1,a1)]\displaystyle\mathbb{E}_{s_{1}\sim d_{1},a_{1}\sim\pi_{k}}[\sum_{s_{2}\in\mathcal{S}}P_{1}(s_{2}|s_{1},a_{1})V_{2}^{k}(s_{2})-\sum_{s_{2}\in\mathcal{S}}P_{1}(s_{2}|s_{1},a_{1})V^{\pi_{k}}_{2}(s_{2}|h_{k}))]+2\mathbb{E}_{s\sim\mu}[u^{k}_{1}(s_{1},a_{1})]
=\displaystyle= 𝔼s1d1,a1,s2,a2πk[V2k(s2)V2πk(s2|hk)]+2𝔼sμ[u1k(s1,a1)]\displaystyle\mathbb{E}_{s_{1}\sim d_{1},a_{1},s_{2},a_{2}\sim\pi_{k}}[V_{2}^{k}(s_{2})-V^{\pi_{k}}_{2}(s_{2}|h_{k})]+2\mathbb{E}_{s\sim\mu}[u^{k}_{1}(s_{1},a_{1})]
\displaystyle\leq \displaystyle...
\displaystyle\leq 2𝔼s1d1,a1,,shk,ahkπk[h=1hkuhk(sh,ah)]\displaystyle 2\mathbb{E}_{s_{1}\sim d_{1},a_{1},...,s_{h_{k}},a_{h_{k}}\sim\pi_{k}}[\sum_{h=1}^{h_{k}}u^{k}_{h}(s_{h},a_{h})]
\displaystyle\leq 2β𝔼s1d1,a1,,shk,ahkπk[h=1hkϕ(sh,ah)(Λhk)1]\displaystyle 2\beta\mathbb{E}_{s_{1}\sim d_{1},a_{1},...,s_{h_{k}},a_{h_{k}}\sim\pi_{k}}[\sum_{h=1}^{h_{k}}\|\phi(s_{h},a_{h})\|_{(\Lambda^{k}_{h})^{-1}}]

where in the second inequality, we use the following fact

min{(w1k)ϕ(s1,a1),Hr1(s1,a1)u1k(s1,a1)}s2𝒮P1(s2|s1,a1)V2k(s2)\displaystyle\min\{(w_{1}^{k})^{\top}\phi(s_{1},a_{1}),H-r_{1}(s_{1},a_{1})-u_{1}^{k}(s_{1},a_{1})\}-\sum_{s_{2}\in\mathcal{S}}P_{1}(s_{2}|s_{1},a_{1})V^{k}_{2}(s_{2})
=\displaystyle= min{(w1k)ϕ(s1,a1)s2𝒮P1(s2|s1,a1)V2k(s2),Hr1k(s1,a1)u1k(s1,a1)s2𝒮P1(s2|s1,a1)V2k(s2)}\displaystyle\min\{(w_{1}^{k})^{\top}\phi(s_{1},a_{1})-\sum_{s_{2}\in\mathcal{S}}P_{1}(s_{2}|s_{1},a_{1})V^{k}_{2}(s_{2}),H-r_{1}^{k}(s_{1},a_{1})-u_{1}^{k}(s_{1},a_{1})-\sum_{s_{2}\in\mathcal{S}}P_{1}(s_{2}|s_{1},a_{1})V^{k}_{2}(s_{2})\}
\displaystyle\leq min{βϕ(s1,a1)(Λ1k)1,H}=u1k(s1,a1)\displaystyle\min\{\beta\|\phi(s_{1},a_{1})\|_{(\Lambda_{1}^{k})^{-1}},H\}=u_{1}^{k}(s_{1},a_{1})

Now we are ready to prove the following theorem restated from Theorem 4.1 in a more detailed version, where we include the guarantees during the execution of the algorithm.

Theorem C.6.

[Deployment Complexity] For arbitrary ε,δ>0\varepsilon,\delta>0, and arbitrary cK2c_{K}\geq 2, as long as Nc(cKH4cK+1d3cKε2cKlog2cK(Hdδε))1cK1N\geq c\Big{(}c_{K}\frac{H^{4c_{K}+1}d^{3c_{K}}}{\varepsilon^{2c_{K}}}\log^{2c_{K}}(\frac{Hd}{\delta\varepsilon})\Big{)}^{\frac{1}{c_{K}-1}}, where cc is an absolute constant and independent with cK,d,H,ε,δc_{K},d,H,\varepsilon,\delta, by choosing

K=cKdH+1.\displaystyle K=c_{K}dH+1. (8)

Algorithm 1 will terminate at iteration kHKk_{H}\leq K and return us a policy πkH\pi^{k_{H}}, and with probability 1δ1-\delta, (1) 𝔼s1d1[V1(s1)V1πkH(s1)]ε\mathbb{E}_{s_{1}\sim d_{1}}[V_{1}^{*}(s_{1})-V_{1}^{\pi^{k_{H}}}(s_{1})]\leq\varepsilon. (2) for each h[H1]h\in[H-1], there exists an iteration khk_{h}, such that hkh=hh_{k_{h}}=h but hkh+1=h+1h_{k_{h}+1}=h+1, and πkh\pi_{k_{h}} is an ε\varepsilon-optimal policy for the MDP truncated at step hh;

Proof.

As stated in the theorem, we use khk_{h} to denote the number of deployment after which the algorithm switch the exploration from layer hh to layer h+1h+1, i.e. hkh=hh_{k_{h}}=h and hkh+1=h+1h_{k_{h}+1}=h+1. According to the definition and the algorithm, we must have Δkhεhkh2H\Delta_{k_{h}}\leq\frac{\varepsilon h_{k_{h}}}{2H}, and for arbitrary kh1+1kkh1k_{h-1}+1\leq k\leq k_{h}-1, Δkεhk2H\Delta_{k}\geq\frac{\varepsilon h_{k}}{2H} (if kh1+1>kh1k_{h-1}+1>k_{h}-1, then it means Δkh1+1\Delta_{k_{h-1}+1} is small enough and the algorithm directly switch the exploration to the next layer, and we can skip the discussion below ). Therefore, for arbitrary kh1+1kkh1k_{h-1}+1\leq k\leq k_{h}-1, during the kk-th deployment, there exists h[hk]h\in[h_{k}], such that,

ε2HΔkhk2βNn=1Nϕ(shkn,ahkn)(Λhk)12β1Nn=1Nϕ(shkn,ahkn)(Λhk)12\displaystyle\frac{\varepsilon}{2H}\leq\frac{\Delta_{k}}{h_{k}}\leq\frac{2\beta}{N}\sum_{n=1}^{N}\|\phi(s_{h}^{kn},a_{h}^{kn})\|_{(\Lambda^{k}_{h})^{-1}}\leq 2\beta\sqrt{\frac{1}{N}\sum_{n=1}^{N}\|\phi(s_{h}^{kn},a_{h}^{kn})\|^{2}_{(\Lambda^{k}_{h})^{-1}}}

where the second inequality is because the average is less than the maximum. The above implies that

1Nn=1Nϕ(shkn,ahkn)(Λhk)12=1NTr((Λhk)1(n=1Nϕ(shkn,ahkn)ϕ(shkn,ahkn)))ε216H2β2\displaystyle\frac{1}{N}\sum_{n=1}^{N}\|\phi(s_{h}^{kn},a_{h}^{kn})\|^{2}_{(\Lambda^{k}_{h})^{-1}}=\frac{1}{N}Tr\Big{(}(\Lambda^{k}_{h})^{-1}\Big{(}\sum_{n=1}^{N}\phi(s_{h}^{kn},a_{h}^{kn})\phi(s_{h}^{kn},a_{h}^{kn})^{\top}\Big{)}\Big{)}\geq\frac{\varepsilon^{2}}{16H^{2}\beta^{2}} (9)

According to Lemma 4.2, there exists constant c,cc,c^{\prime}, such that by choosing NN according to Eq.(10) below, the event in Eq.(9) will not happen more than dcKdc_{K} times at each layer h[hk]h\in[h_{k}].

Nc(cKH4cK+1d3cKε2cKlog2cK(Hdεδ))1cK1c(cKH2cK+1dcKβ2cKε2cKlogcK(Hdβε))1cK1\displaystyle N\geq c\Big{(}c_{K}\frac{H^{4c_{K}+1}d^{3c_{K}}}{\varepsilon^{2c_{K}}}\log^{2c_{K}}(\frac{Hd}{\varepsilon\delta})\Big{)}^{\frac{1}{c_{K}-1}}\geq c^{\prime}\Big{(}c_{K}\frac{H^{2c_{K}+1}d^{c_{K}}\beta^{2c_{K}}}{\varepsilon^{2c_{K}}}\log^{c_{K}}(\frac{Hd\beta}{\varepsilon})\Big{)}^{\frac{1}{c_{K}-1}} (10)

Recall that ε<1\varepsilon<1 and the covariance matrices in each layer is initialized by Id×dI_{d\times d}. Therefore, at the first deployment, although the computation of π1\pi^{1} does not consider the layers h2h\geq 2, Eq.(9) happens in each layer h[H]h\in[H]. We use ζ(k,j)\zeta(k,j) to denote the total number of times events in Eq.(9) happens for layer jj previous to deployment kk, as a result,

khj=1hζ(kh,j)(h1)+hcKdh+1,h[H]\displaystyle k_{h}\leq\sum_{j=1}^{h}\zeta(k_{h},j)-(h-1)+h\leq c_{K}dh+1,\quad\forall h\in[H]

where we minus h1h-1 because such event must happen at the first deployment for each h[H]h\in[H] and we should remove the repeated computation; and we add another hh back is because there are hh times we waste the samples (i.e. for those kk such that Δk<εhk2H\Delta_{k}<\frac{\varepsilon h_{k}}{2H}). Therefore, we must have kHcKdH+1=Kk_{H}\leq c_{K}dH+1=K.

Moreover, because at iteration k=khk=k_{h}, we have Δkhε/2\Delta_{k_{h}}\leq\varepsilon/2, according to Hoeffding inequality, with probability 1δ/21-\delta/2, for each deployment kk, we must have:

𝔼s1,a1,,shk,ahkπk[2βh=1hkϕ(sh,ah)(Λhk)1]Δk+2βH12Nlog(Kδ)\displaystyle\mathbb{E}_{s_{1},a_{1},...,s_{h_{k}},a_{h_{k}}\sim\pi_{k}}[2\beta\sum_{h=1}^{h_{k}}\|\phi(s_{h},a_{h})\|_{(\Lambda^{k}_{h})^{-1}}]\leq\Delta_{k}+2\beta H\sqrt{\frac{1}{2N}\log(\frac{K}{\delta})} (11)

Therefore, by choosing

N8β2H2ε2log(Kδ)=O(d2H4ε2log2(Kδ))\displaystyle N\geq\frac{8\beta^{2}H^{2}}{\varepsilon^{2}}\log(\frac{K}{\delta})=O(\frac{d^{2}H^{4}}{\varepsilon^{2}}\log^{2}(\frac{K}{\delta})) (12)

we must have,

𝔼s1,a1,,sh,ahπkh[2βh=1hϕ(sh,ah)(Λhk)1]Δkh+ε2=ε,h[H]\displaystyle\mathbb{E}_{s_{1},a_{1},...,s_{h},a_{h}\sim\pi_{k_{h}}}[2\beta\sum_{h^{\prime}=1}^{h}\|\phi(s_{h^{\prime}},a_{h^{\prime}})\|_{(\Lambda^{k}_{h^{\prime}})^{-1}}]\leq\Delta_{k_{h}}+\frac{\varepsilon}{2}=\varepsilon,\quad\forall h\in[H]

Therefore, after a combination of Eq.(10) and Eq.(12), we can conclude that, for arbitrary cK2c_{K}\geq 2 , there exists absolute constant cc, such that by choosing

Nc(cKH4cK+1d3cKε2cKlog2cK(Hdεδ))1cK1N\geq c\Big{(}c_{K}\frac{H^{4c_{K}+1}d^{3c_{K}}}{\varepsilon^{2c_{K}}}\log^{2c_{K}}(\frac{Hd}{\varepsilon\delta})\Big{)}^{\frac{1}{c_{K}-1}}

the algorithm will stop at kHKk_{H}\leq K, and with probability 1δ1-\delta (on the event of 1\mathcal{E}_{1} in C.4 and the Hoeffding inequality above), we must have:

𝔼s1d1[V1(s1)Vπk(s1)]𝔼s1,a1,,shk,ahkπk+1[2βh=1hkϕ(sh,ah)(Λhk)1]ε\displaystyle\mathbb{E}_{s_{1}\sim d_{1}}[V_{1}^{*}(s_{1})-V^{\pi_{k}}(s_{1})]\leq\mathbb{E}_{s_{1},a_{1},...,s_{h_{k}},a_{h_{k}}\sim\pi_{k+1}}[2\beta\sum_{h=1}^{h_{k}}\|\phi(s_{h},a_{h})\|_{(\Lambda^{k}_{h})^{-1}}]\leq\varepsilon

and an additional benefits that for each h[H1]h\in[H-1], πkh\pi_{k_{h}} is an ε\varepsilon-optimal policy at the MDP truncated at hh step, or equivalently,

𝔼s1d1[V1(s1|h)V1πkh(s1|h)]ε.\displaystyle\mathbb{E}_{s_{1}\sim d_{1}}[V_{1}^{*}(s_{1}|h)-V_{1}^{\pi_{k_{h}}}(s_{1}|h)]\leq\varepsilon. (13)

C.4 Additional Safety Guarantee Brought with Layer-by-Layer Strategy

The layer-by-layer strategy brings another advantage that, if we finish the exploration of the first hh layers, based on the samples collected so far, we can obtain a policy π^|h\widehat{\pi}_{|h}, which is an ε\varepsilon-optimal in the MDP truncated at step hh, or equivalently:

J(π)J(π^|h)Hh+O(ε),h[H]\displaystyle J(\pi^{*})-J(\widehat{\pi}_{|h})\geq H-h+O(\varepsilon),\quad\forall h\in[H]

We formally state these guarantees in Theorem C.6 (a detailed version of Theorem 4.1), Theorem D.4 and Theorem E.9 (the formal version of Theorem 4.4). Such a property may be valuable in certain application scenarios. For example, in “Safe DE-RL”, which we will discuss in Appendix F, π^|h\widehat{\pi}_{|h} can be used as the pessimistic policy in Algorithm 7 and guarantee the monotonic policy improvement criterion. Besides, in some real-world settings, we may hope to maintain a sub-optimal but gradually improving policy before we complete the execution of the entire algorithm.

If we replace Line 7-8 in LSVI-UCB (Algorithm 1) in Jin et al. (2019) with Line 13-18 in our Algorithm 1, the similar analysis can be done based on Lemma 4.2, and the same Θ(dH)\Theta(dH) deployment complexity can be derived. However, the direct extension based on LSVI-UCB does not have the above safety guarantee. It is only guaranteed to return a near-optimal policy after K=Θ(dH)K=\Theta(dH) deployments, but if we interrupt the algorithm after some k<Kk<K deployments, there is no guarantee about what the best possible policy would be based on the data collected so far.

Appendix D Reward-Free Deployment-Efficient RL with Deterministic Policies

D.1 Algorithm

Similar to other algorithms in reward-free setting (Wang et al., 2020b; Jin et al., 2020), our algorithm includes an “Exploration Phase” to uniformly explore the entire MDP, and a “Planning Phase” to return near-optimal policy given an arbitrary reward function. The crucial part is to collect a well-covered dataset in the online “exploration phase”, which is sufficient for the batch RL algorithm (Antos et al., 2008; Munos & Szepesvári, 2008; Chen & Jiang, 2019) in the offline “planning phase” to work.

Our algorithm in Alg.3 and Alg.4 is based on (Wang et al., 2020b) and the layer-by-layer strategy. The main difference with Algorithm 1 is in two-folds. First, similar to (Wang et al., 2020b), we replace the reward function with 1/H1/H of the bonus term. Secondly, we use a smaller threshold for Δk\Delta_{k} comparing with Algorithm 1.

1 Input: Failure probability δ>0\delta>0, and target accuracy ε>0\varepsilon>0, βcβdHlog(dHδ1ε1)\beta\leftarrow c_{\beta}\cdot dH\sqrt{\log(dH\delta^{-1}\varepsilon^{-1})} for some cβ>0c_{\beta}>0, total number of deployments KK, batch size NN
2 Initialize h1=1h_{1}=1
3 D1={},D2={},,DH={}D_{1}=\{\},D_{2}=\{\},...,D_{H}=\{\}
4 for k=1,2,,Kk=1,2,...,K do
5       Qhk+1k(,)0Q^{k}_{h_{k}+1}(\cdot,\cdot)\leftarrow 0 and Vhk+1k()=0V^{k}_{h_{k}+1}(\cdot)=0
6       for h=hk,hk1,,1h=h_{k},h_{k}-1,...,1 do
7             ΛhkI+τ=1kn=1Nϕhτn(ϕhτn)\Lambda^{k}_{h}\leftarrow I+\sum_{\tau=1}^{k}\sum_{n=1}^{N}\phi_{h}^{\tau n}(\phi_{h}^{\tau n})^{\top}
8             uhk(,)min{βϕ(,)(Λhk)1ϕ(,),H}u_{h}^{k}(\cdot,\cdot)\leftarrow\min\{\beta\cdot\sqrt{\phi(\cdot,\cdot)^{\top}(\Lambda^{k}_{h})^{-1}\phi(\cdot,\cdot)},H\}
9             Define the exploration-driven reward function rhk(,)uhk(,)/Hr^{k}_{h}(\cdot,\cdot)\leftarrow u^{k}_{h}(\cdot,\cdot)/H
10             whk(Λhk)1τ=1k1n=1NϕhτnVh+1k(sh+1τn)w^{k}_{h}\leftarrow(\Lambda^{k}_{h})^{-1}\sum_{\tau=1}^{k-1}\sum_{n=1}^{N}\phi_{h}^{\tau n}\cdot V^{k}_{h+1}(s^{\tau n}_{h+1})
11             Qhk(,)min{(whk)ϕ(,)+rhk(,)+uhk(,),H}Q^{k}_{h}(\cdot,\cdot)\leftarrow\min\{(w^{k}_{h})^{\top}\phi(\cdot,\cdot)+r^{k}_{h}(\cdot,\cdot)+u^{k}_{h}(\cdot,\cdot),H\} and Vhk()=maxa𝒜Qhk(,a)V^{k}_{h}(\cdot)=\max_{a\in\mathcal{A}}Q^{k}_{h}(\cdot,a)
12             πhk()argmaxa𝒜Qhk(,a)\pi^{k}_{h}(\cdot)\leftarrow\arg\max_{a\in\mathcal{A}}Q^{k}_{h}(\cdot,a)
13       end for
14      Define πk=π1kπ2kπhkkunif[hk+1:H]\pi^{k}=\pi^{k}_{1}\circ\pi^{k}_{2}\circ...\pi^{k}_{h_{k}}\circ\mathrm{unif}_{[h_{k}+1:H]}
15       for n=1,,Nn=1,...,N do
16             Receive initial state s1knd1s_{1}^{kn}\sim d_{1}
17             for h=1,2,,Hh=1,2,...,H do
18                   Take action ahknπk(shkn)a^{kn}_{h}\leftarrow\pi^{k}(s_{h}^{kn}) and observe sh+1knPh(shk,ahk)s_{h+1}^{kn}\sim P_{h}(s^{k}_{h},a_{h}^{k})
19                   Dh=Dh{(shkn,ahkn)}D_{h}=D_{h}\bigcup\{(s_{h}^{kn},a_{h}^{kn})\}
20             end for
21            
22       end for
23      Compute Δk2βNn=1Nh=1hkϕ(shkn,ahkn)(Λhk)1ϕ(shkn,ahkn)\Delta_{k}\leftarrow\frac{2\beta}{N}\sum_{n=1}^{N}\sum_{h=1}^{h_{k}}\sqrt{\phi(s_{h}^{kn},a_{h}^{kn})^{\top}(\Lambda_{h}^{k})^{-1}\phi(s_{h}^{kn},a_{h}^{kn})}.
24       if Δk<εhk(4H+2)H\Delta_{k}<\frac{\varepsilon h_{k}}{(4H+2)H} then
25             if hk=Hh_{k}=H then  return D={D1,D2,,DH}D=\{D_{1},D_{2},...,D_{H}\} ;
26             else  hkhk+1h_{k}\leftarrow h_{k}+1 ;
27            
28       end if
29      
30 end for
Algorithm 3 Reward-Free DE-RL with Deterministic Policies in Linear MDPs: Exploration Phase
1 Input: Horizon length h~{\widetilde{h}}; Dataset 𝒟={(shkn,ahkn)k,n,h[K]×[N]×[h~]}\mathcal{D}=\{(s_{h}^{kn},a_{h}^{kn})_{k,n,h\in[K]\times[N]\times[{\widetilde{h}}]}\}, reward function r={rh}h[h~]r=\{r_{h}\}_{h\in[{\widetilde{h}}]}
2 Qh~+1(,)0Q_{{\widetilde{h}}+1}(\cdot,\cdot)\leftarrow 0 and Vh~+1()0V_{{\widetilde{h}}+1}(\cdot)\leftarrow 0
3 for h=h~,h~1,,1h={\widetilde{h}},{\widetilde{h}}-1,...,1 do
4       ΛhI+τ=1Kn=1Nϕ(shτn,ahτn)ϕ(shτn,ahτn)\Lambda_{h}\leftarrow I+\sum_{\tau=1}^{K}\sum_{n=1}^{N}\phi(s_{h}^{\tau n},a_{h}^{\tau n})\phi(s_{h}^{\tau n},a_{h}^{\tau n})^{\top}
5       Let uhplan(,)min{βϕ(,)(Λh)1ϕ(,),h~}u_{h}^{plan}(\cdot,\cdot)\leftarrow\min\{\beta\sqrt{\phi(\cdot,\cdot)^{\top}(\Lambda_{h})^{-1}\phi(\cdot,\cdot)},{\widetilde{h}}\}
6       wh(Λh)1τ=1Kϕ(shτn,ahτn)Vh+1(sh+1τn,a)w_{h}\leftarrow(\Lambda_{h})^{-1}\sum_{\tau=1}^{K}\phi(s_{h}^{\tau n},a_{h}^{\tau n})\cdot V_{h+1}(s_{h+1}^{\tau n},a)
7       Qh(,)min{whϕ(,)+rh(,)+uhplan(,),h~}Q_{h}(\cdot,\cdot)\leftarrow\min\{w_{h}^{\top}\phi(\cdot,\cdot)+r_{h}(\cdot,\cdot)+u_{h}^{plan}(\cdot,\cdot),{\widetilde{h}}\} and Vh()=maxa𝒜Qh(,a)V_{h}(\cdot)=\max_{a\in\mathcal{A}}Q_{h}(\cdot,a)
8       πh()argmaxa𝒜Qh(,a)\pi_{h}(\cdot)\leftarrow\arg\max_{a\in\mathcal{A}}Q_{h}(\cdot,a)
9 end for
return πr|h~={πh}h[h~]\pi_{r|{\widetilde{h}}}=\{\pi_{h}\}_{h\in[{\widetilde{h}}]}, V^1(,r|h~):=V1()\widehat{V}_{1}(\cdot,r|{\widetilde{h}}):=V_{1}(\cdot)
Algorithm 4 Reward-Free DE-RL with Deterministic Policies in Linear MDPs: Planning Phase

D.2 Analysis for Alg 3 and Alg 4

We first show a lemma adapted from Lemma C.4 for Alg 3. Since the proof is similar, we omit it here.

Lemma D.1 (Concentration for DE-RL in Reward-Free Setting).

We use 2\mathcal{E}_{2} to denote the event that, when running Algorithm 3, the following inequality holds for all k[K]k\in[K] and h[hk]h\in[h_{k}] and all V=Vh+1kV=V^{k}_{h+1} occurs in Alg 3 or V=VhV=V_{h} occurs in Alg 4:

τ=1k1n=1Nϕhτn(V(sh+1τn)s𝒮Ph(s|shτn,ahτn)V(s))(Λhk)1cdHlog(dKNH/δ)\displaystyle\Big{\|}\sum_{\tau=1}^{k-1}\sum_{n=1}^{N}\phi_{h}^{\tau n}\Big{(}V(s_{h+1}^{\tau n})-\sum_{s^{\prime}\in\mathcal{S}}P_{h}(s^{\prime}|s_{h}^{\tau n},a_{h}^{\tau n})V(s^{\prime})\Big{)}\Big{\|}_{(\Lambda^{k}_{h})^{-1}}\leq c\cdot dH\sqrt{\log(dKNH/\delta)}

Under Assumption A, there exists some absolute constant c0c\geq 0, such that P(2)1δ/2P(\mathcal{E}_{2})\geq 1-\delta/2.

Proof.

The proof is almost identical to Lemma 3.1 in (Wang et al., 2020b), so we omit it here. The only difference is that we have an inner summation from n=1n=1 to NN and we truncate the horizon at hkh_{k} in iteration kk. ∎

Next, we prove a lemma simlar to Lemma C.5 based on Lemma D.1.

Lemma D.2 (Overestimation).

On the event 2\mathcal{E}_{2} in Lemma D.1, which holds with probability 1δ/21-\delta/2, in Algorithm 3, for all k[K]k\in[K] and n[N]n\in[N],

V1(s1kn,rk|hk)V1k(s1kn)\displaystyle V^{*}_{1}(s_{1}^{kn},r^{k}|h_{k})\leq V_{1}^{k}(s_{1}^{kn})

and

𝔼s1d1[V1(s,rk|hk)]𝔼s1d1[V1k(s)](2H+1)𝔼s1d1[Vπk(s,rk|hk)]\displaystyle\mathbb{E}_{s_{1}\sim d_{1}}[V_{1}^{*}(s,r^{k}|h_{k})]\leq\mathbb{E}_{s_{1}\sim d_{1}}[V_{1}^{k}(s)]\leq(2H+1)\mathbb{E}_{s_{1}\sim d_{1}}[V^{\pi_{k}}(s,r^{k}|h_{k})]
Proof.

We first prove the overestimation inequality.

Overestimation

First of all, similar to the proof of Lemma 3.1 in (Wang et al., 2020b), on the event of 2\mathcal{E}_{2} defined in Lemma C.4, which holds with probability 1δ/21-\delta/2, we have:

|ϕ(s,a)whks𝒮Ph(s|s,a)Vh+1k(s)|βϕ(s,a)(Λhk)1,s,a𝒮×𝒜,k[K],h[hk]\displaystyle|\phi(s,a)^{\top}w_{h}^{k}-\sum_{s^{\prime}\in\mathcal{S}}P_{h}(s^{\prime}|s,a)V^{k}_{h+1}(s^{\prime})|\leq\beta\cdot\|\phi(s,a)\|_{(\Lambda^{k}_{h})^{-1}},\quad\forall s,a\in\mathcal{S}\times\mathcal{A},~{}k\in[K],~{}h\in[h_{k}] (14)

Then, we can use induction to show the overestimation. For h=hk+1h=h_{k}+1, we have:

0=Vhk+1(s,rk|hk)Vhk+1k(s)=0,s𝒮\displaystyle 0=V^{*}_{h_{k}+1}(s,r^{k}|h_{k})\leq V^{k}_{h_{k}+1}(s)=0,\quad\forall s\in\mathcal{S}

Suppose for some h[hk]h\in[h_{k}], we have

Vh+1(s,rk|hk)Vh+1k(s),s𝒮\displaystyle V^{*}_{h+1}(s,r^{k}|h_{k})\leq V^{k}_{h+1}(s),\quad\forall s\in\mathcal{S}

Then, s𝒮\forall s\in\mathcal{S}, we have

Vh(s,rk|hk)=\displaystyle V_{h}^{*}(s,r^{k}|h_{k})= maxa(rhk(s,a)+s𝒮Ph(s|s,a)Vh+1(s,rk|hk))\displaystyle\max_{a}(r_{h}^{k}(s,a)+\sum_{s^{\prime}\in\mathcal{S}}P_{h}(s^{\prime}|s,a)V^{*}_{h+1}(s^{\prime},r^{k}|h_{k}))
\displaystyle\leq maxa(rhk(s,a)+s𝒮Ph(s|s,a)Vh+1k(s)),H\displaystyle\max_{a}(r_{h}^{k}(s,a)+\sum_{s^{\prime}\in\mathcal{S}}P_{h}(s^{\prime}|s,a)V^{k}_{h+1}(s^{\prime})),H
\displaystyle\leq min{maxa(rhk(s,a)+ϕ(s,a)whk+βϕ(s,a)(Λhk)1),H}\displaystyle\min\{\max_{a}(r_{h}^{k}(s,a)+\phi(s,a)^{\top}w^{k}_{h}+\beta\|\phi(s,a)\|_{(\Lambda^{k}_{h})^{-1}}),H\}
=\displaystyle= maxamin{rhk(s,a)+ϕ(s,a)whk+βϕ(s,a)(Λhk)1,H}\displaystyle\max_{a}\min\{r_{h}^{k}(s,a)+\phi(s,a)^{\top}w^{k}_{h}+\beta\|\phi(s,a)\|_{(\Lambda^{k}_{h})^{-1}},H\}
=\displaystyle= Vhk(s)\displaystyle V^{k}_{h}(s)

where in the last inequality, we apply Eq.(14).

Relationship between V1k()V_{1}^{k}(\cdot) and Vπk(,rk)V^{\pi_{k}}(\cdot,r^{k})

𝔼s1d1[V1k(s1)Vπk(s1,rk|hk)]\displaystyle\mathbb{E}_{s_{1}\sim d_{1}}[V_{1}^{k}(s_{1})-V^{\pi_{k}}(s_{1},r^{k}|h_{k})]
=\displaystyle= 𝔼s1d1,a1πk[Q1k(s1,a1)Qπk(s1,a1,rk|hk)]\displaystyle\mathbb{E}_{s_{1}\sim d_{1},a_{1}\sim\pi_{k}}[Q_{1}^{k}(s_{1},a_{1})-Q^{\pi_{k}}(s_{1},a_{1},r^{k}|h_{k})]
=\displaystyle= 𝔼s1d1,a1πk[min{(w1k)ϕ(s1,a1)+r1k(s1,a1)+u1k(s1,a1),H}\displaystyle\mathbb{E}_{s_{1}\sim d_{1},a_{1}\sim\pi_{k}}[\min\{(w_{1}^{k})^{\top}\phi(s_{1},a_{1})+r^{k}_{1}(s_{1},a_{1})+u_{1}^{k}(s_{1},a_{1}),H\}
r1k(s1,a1)s2𝒮P1(s2|s1,a1)V2πk(s2,rk|hk))]\displaystyle-r^{k}_{1}(s_{1},a_{1})-\sum_{s_{2}\in\mathcal{S}}P_{1}(s_{2}|s_{1},a_{1})V^{\pi_{k}}_{2}(s_{2},r^{k}|h_{k}))]
\displaystyle\leq 𝔼s1d1,a1πk[min{(w1k)ϕ(s1,a1),Hr1k(s1,a1)u1k(s1,a1)}s2𝒮P1(s2|s1,a1)V2k(s2)]\displaystyle\mathbb{E}_{s_{1}\sim d_{1},a_{1}\sim\pi_{k}}[\min\{(w_{1}^{k})^{\top}\phi(s_{1},a_{1}),H-r_{1}^{k}(s_{1},a_{1})-u_{1}^{k}(s_{1},a_{1})\}-\sum_{s_{2}\in\mathcal{S}}P_{1}(s_{2}|s_{1},a_{1})V^{k}_{2}(s_{2})]
+𝔼s1d1,a1πk[s2𝒮P1(s2|s1,a1)Vk2(s2)s2𝒮P1(s2|s1,a1)Vπk2(s2,rk|hk))]+𝔼sd1[uk1(s1,a1)]\displaystyle+\mathbb{E}_{s_{1}\sim d_{1},a_{1}\sim\pi_{k}}[\sum_{s_{2}\in\mathcal{S}}P_{1}(s_{2}|s_{1},a_{1})V^{k}_{2}(s_{2})-\sum_{s_{2}\in\mathcal{S}}P_{1}(s_{2}|s_{1},a_{1})V^{\pi_{k}}_{2}(s_{2},r^{k}|h_{k}))]+\mathbb{E}_{s\sim d_{1}}[u^{k}_{1}(s_{1},a_{1})]
\displaystyle\leq 𝔼s1d1,a1πk[s𝒮P1(s2|s1,a1)V2k(s2)s2𝒮P1(s2|s1,a1)Vπk2(s2,rk|hk))]+2𝔼sd1[uk1(s1,a1)]\displaystyle\mathbb{E}_{s_{1}\sim d_{1},a_{1}\sim\pi_{k}}[\sum_{s^{\prime}\in\mathcal{S}}P_{1}(s_{2}|s_{1},a_{1})V_{2}^{k}(s_{2})-\sum_{s_{2}\in\mathcal{S}}P_{1}(s_{2}|s_{1},a_{1})V^{\pi_{k}}_{2}(s_{2},r^{k}|h_{k}))]+2\mathbb{E}_{s\sim d_{1}}[u^{k}_{1}(s_{1},a_{1})]
=\displaystyle= 𝔼s1d1,a1,s2,a2πk[V2k(s2)Vπk2(s2,rk|hk)]+2𝔼sd1[uk1(s1,a1)]\displaystyle\mathbb{E}_{s_{1}\sim d_{1},a_{1},s_{2},a_{2}\sim\pi_{k}}[V_{2}^{k}(s_{2})-V^{\pi_{k}}_{2}(s_{2},r^{k}|h_{k})]+2\mathbb{E}_{s\sim d_{1}}[u^{k}_{1}(s_{1},a_{1})]
\displaystyle\leq \displaystyle...
\displaystyle\leq 2𝔼s1d1,a1,,shk,ahkπk[h=1Huhk(sh,ah)]\displaystyle 2\mathbb{E}_{s_{1}\sim d_{1},a_{1},...,s_{h_{k}},a_{h_{k}}\sim\pi_{k}}[\sum_{h=1}^{H}u_{h}^{k}(s_{h},a_{h})]
=\displaystyle= 2H𝔼s1d1[Vπk(s,rk|hk)]\displaystyle 2H\mathbb{E}_{s_{1}\sim d_{1}}[V^{\pi_{k}}(s,r^{k}|h_{k})]

where in the first inequality, we add and subtract s2𝒮P1(s2|s1,a1)V2k(s2)\sum_{s_{2}\in\mathcal{S}}P_{1}(s_{2}|s_{1},a_{1})V_{2}^{k}(s_{2}), and in the second inequality, we use the following fact

min{(w1k)ϕ(s1,a1),Hr1k(s1,a1)u1k(s1,a1)}s2𝒮P1(s2|s1,a1)Vk2(s2)\displaystyle\min\{(w_{1}^{k})^{\top}\phi(s_{1},a_{1}),H-r_{1}^{k}(s_{1},a_{1})-u_{1}^{k}(s_{1},a_{1})\}-\sum_{s_{2}\in\mathcal{S}}P_{1}(s_{2}|s_{1},a_{1})V^{k}_{2}(s_{2})
=\displaystyle= min{(w1k)ϕ(s1,a1)s2𝒮P1(s2|s1,a1)Vk2(s2),Hr1k(s1,a1)u1k(s1,a1)s2𝒮P1(s2|s1,a1)Vk2(s2)}\displaystyle\min\{(w_{1}^{k})^{\top}\phi(s_{1},a_{1})-\sum_{s_{2}\in\mathcal{S}}P_{1}(s_{2}|s_{1},a_{1})V^{k}_{2}(s_{2}),H-r_{1}^{k}(s_{1},a_{1})-u_{1}^{k}(s_{1},a_{1})-\sum_{s_{2}\in\mathcal{S}}P_{1}(s_{2}|s_{1},a_{1})V^{k}_{2}(s_{2})\}
\displaystyle\leq min{βϕ(s1,a1)(Λ1k)1,H}=u1k(s1,a1)\displaystyle\min\{\beta\|\phi(s_{1},a_{1})\|_{(\Lambda_{1}^{k})^{-1}},H\}=u_{1}^{k}(s_{1},a_{1})

∎ Next, we provide some analysis for Algorithm 4, which will help us to understand what we want to do in Algorithm 3

Lemma D.3.

On the event 2\mathcal{E}_{2} in Lemma D.1, which holds with probability 1δ/21-\delta/2, if we assign h~=hk{\widetilde{h}}=h_{k} in Algorithm 4 and assign 𝒟\mathcal{D} to be the samples collected till deployment kk, i.e. 𝒟={(shkn,ahkn)k,n,h[K]×[N]×[hk]}\mathcal{D}=\{(s_{h}^{kn},a_{h}^{kn})_{k,n,h\in[K]\times[N]\times[h_{k}]}\}, then for arbitrary reward function rr satisfying the linear Assumption A, the policy πr|h~\pi_{r|{\widetilde{h}}} returned by Alg 4 would satisfy:

𝔼s1d1[V1πr|h~(s1,r|h~)V1πr|h~(s1,r|h~)]2H𝔼s1d1[V1πrplan|h~(s1,rplan|h~)]\displaystyle\mathbb{E}_{s_{1}\sim d_{1}}[V_{1}^{\pi_{r|\widetilde{h}}^{*}}(s_{1},r|\widetilde{h})-V_{1}^{\pi_{r|\widetilde{h}}}(s_{1},r|\widetilde{h})]\leq 2H\mathbb{E}_{s_{1}\sim d_{1}}[V_{1}^{\pi_{r^{plan}|{\widetilde{h}}}^{*}}(s_{1},r^{plan}|\widetilde{h})] (15)

where rplan:=uplan/h~r^{plan}:=u^{plan}/{\widetilde{h}}.

Proof.

By applying the similar technique in the analysis of 𝔼s1d1[V1k(s1)Vπk(s1,rk|h~)]\mathbb{E}_{s_{1}\sim d_{1}}[V_{1}^{k}(s_{1})-V^{\pi_{k}}(s_{1},r^{k}|\widetilde{h})] in Lemma D.2 after replacing rkr^{k} with rr, we have:

𝔼s1d1[V1πr|h~(s1,r|h~)V1πr|h~(s1,r|h~)]\displaystyle\mathbb{E}_{s_{1}\sim d_{1}}[V_{1}^{\pi^{*}_{r|\widetilde{h}}}(s_{1},r|\widetilde{h})-V_{1}^{\pi_{r|\widetilde{h}}}(s_{1},r|\widetilde{h})]\leq 𝔼s1d1[V^1(s1,r|h~)V1πr|h~(s1,r|h~)]2𝔼s1d1[V1πr|h~(s1,uplan)]\displaystyle\mathbb{E}_{s_{1}\sim d_{1}}[\widehat{V}_{1}(s_{1},r|\widetilde{h})-V_{1}^{\pi_{r|\widetilde{h}}}(s_{1},r|\widetilde{h})]\leq 2\mathbb{E}_{s_{1}\sim d_{1}}[V_{1}^{\pi_{r|\widetilde{h}}}(s_{1},u^{plan})]

where V^1\widehat{V}_{1} denotes the value function returned by Alg 4 Besides,

2𝔼s1d1[V1πr|h~(s1,uplan)]=\displaystyle 2\mathbb{E}_{s_{1}\sim d_{1}}[V_{1}^{\pi_{r|{\widetilde{h}}}}(s_{1},u^{plan})]= 2h~𝔼s1d1[V1πr|h~(s1,rplan|h~)]2H𝔼s1d1[V1πrplan|h~(s1,rplan|h~)]\displaystyle 2{\widetilde{h}}\mathbb{E}_{s_{1}\sim d_{1}}[V_{1}^{\pi_{r|{\widetilde{h}}}}(s_{1},r^{plan}|{\widetilde{h}})]\leq 2H\mathbb{E}_{s_{1}\sim d_{1}}[V_{1}^{\pi_{r^{plan}|{\widetilde{h}}}^{*}}(s_{1},r^{plan}|{\widetilde{h}})]

then, we finish the proof. ∎

From Eq.(15) in Lemma D.3, we can see that, after exploring with Algorithm 3, the sub-optimality gap between π\pi^{*} and π\pi returned by Alg.4 can be bounded by the value of the optimal policy w.r.t. rKr^{K}, which we will further bound in the next theorem.

Now we are ready to prove the main theorem.

Theorem D.4.

For arbitrary ε,δ>0\varepsilon,\delta>0, by assigning K=cKdH+1K=c_{K}dH+1 for some cK2c_{K}\geq 2, as long as

Nc(cKH6cK+1d3cKε2cKlog2cK(Hdδε))1cK1}\displaystyle N\geq c\Big{(}c_{K}\frac{H^{6c_{K}+1}d^{3c_{K}}}{\varepsilon^{2c_{K}}}\log^{2c_{K}}(\frac{Hd}{\delta\varepsilon})\Big{)}^{\frac{1}{c_{K}-1}}\Big{\}} (16)

where cc is an absolute constant and independent with cK,d,H,ε,δc_{K},d,H,\varepsilon,\delta, then, Alg 3 will terminate at iteration kHKk_{H}\leq K and return us a dataset D={D1,D2,,DH}D=\{D_{1},D_{2},...,D_{H}\}, such that given arbitrary reward function rr satisfying Assumption A, by running Alg 4 with DD and rr, with probability 1δ1-\delta, we can obtain a policy πr\pi_{r} satisfying 𝔼sd1[V1(s,r)Vπ(s,r)]ε\mathbb{E}_{s\sim d_{1}}[V_{1}^{*}(s,r)-V^{\pi}(s,r)]\leq\varepsilon.

Moreover, for each h[H1]h\in[H-1], there exists iteration khk_{h}, such that hkh=hh_{k_{h}}=h but hkh+1=h+1h_{k_{h}+1}=h+1, and if we run Alg 4 with the reward function rr and the dataset Alg 3 has collected till k=khk=k_{h}, we can obtain a policy πr|h\pi_{r|h}, which is an ε\varepsilon-optimal policy for MDP truncated at step hh.

Proof.

The proof is similar to Theorem 4.1. As stated in theorem, we use khk_{h} to denote the number of deployment when the algorithm switch the exploration from layer hh to layer h+1h+1, i.e. hkh=hh_{k_{h}}=h and hkh+1=h+1h_{k_{h}+1}=h+1. According to the definition and the algorithm, we must have Δkhεhkh(4H+2)H\Delta_{k_{h}}\leq\frac{\varepsilon h_{k_{h}}}{(4H+2)H}, and for arbitrary kh1+1kkh1k_{h-1}+1\leq k\leq k_{h}-1, we must have Δkεhk(4H+2)H\Delta_{k}\geq\frac{\varepsilon h_{k}}{(4H+2)H} (if kh1+1>kh1k_{h-1}+1>k_{h}-1, then it means Δkh1+1\Delta_{k_{h-1}+1} is small enough and the algorithm directly switch the exploration to the next layer, and we can skip the discussion below). Therefore, for arbitrary kh1+1kkh1k_{h-1}+1\leq k\leq k_{h}-1, during the kk-th deployment, there exists h[hk]h\in[h_{k}], such that,

ε(4H+2)HΔkhk2βNn=1Nϕ(shkn,ahkn)(Λkh)12β1Nn=1Nϕ(shkn,ahkn)2(Λkh)1\displaystyle\frac{\varepsilon}{(4H+2)H}\leq\frac{\Delta_{k}}{h_{k}}\leq\frac{2\beta}{N}\sum_{n=1}^{N}\|\phi(s_{h}^{kn},a_{h}^{kn})\|_{(\Lambda^{k}_{h})^{-1}}\leq 2\beta\sqrt{\frac{1}{N}\sum_{n=1}^{N}\|\phi(s_{h}^{kn},a_{h}^{kn})\|^{2}_{(\Lambda^{k}_{h})^{-1}}}

which implies that

1Nn=1Nϕ(shkn,ahkn)2(Λkh)1ε216H2(2H+1)2β2\displaystyle\frac{1}{N}\sum_{n=1}^{N}\|\phi(s_{h}^{kn},a_{h}^{kn})\|^{2}_{(\Lambda^{k}_{h})^{-1}}\geq\frac{\varepsilon^{2}}{16H^{2}(2H+1)^{2}\beta^{2}} (17)

According to Lemma 4.2, there exists an absolute constant cc, for arbitrary ε<1\varepsilon<1, by choosing NN according to Eq.(18) below, the events in Eq.(17) will not happen more than cKdc_{K}d times at each layer h[H]h\in[H].

Nc(cKH6cK+1d3cKε2cKlog2cK(Hdεδ))1cK1\displaystyle N\geq c\Big{(}c_{K}\frac{H^{6c_{K}+1}d^{3c_{K}}}{\varepsilon^{2c_{K}}}\log^{2c_{K}}(\frac{Hd}{\varepsilon\delta})\Big{)}^{\frac{1}{c_{K}-1}} (18)

We use ζ(k,j)\zeta(k,j) to denote the total number of times Eq.(9) happens for layer jj till deployment khk_{h}. With a similar discussion as Theorem 4.1, we have:

khj=1hζ(kh,j)(h1)+hcKdh+1,h[H]\displaystyle k_{h}\leq\sum_{j=1}^{h}\zeta(k_{h},j)-(h-1)+h\leq c_{K}dh+1,\quad\forall h\in[H]

Moreover, we must have Δkhε4H+2\Delta_{k_{h}}\leq\frac{\varepsilon}{4H+2} for each h[H]h\in[H], and according to Hoeffding inequality, with probability 1δ/21-\delta/2, for each step kk, we must have

𝔼s1,a1,,sh,ahπkh[2βh=1hϕ(sh,ah)(Λkh)1]Δkh+2βH12Nlog(Kδ)\displaystyle\mathbb{E}_{s_{1},a_{1},...,s_{h},a_{h}\sim\pi_{k_{h}}}[2\beta\sum_{h^{\prime}=1}^{h}\|\phi(s_{h^{\prime}},a_{h^{\prime}})\|_{(\Lambda^{k}_{h^{\prime}})^{-1}}]\leq\Delta_{k_{h}}+2\beta H\sqrt{\frac{1}{2N}\log(\frac{K}{\delta})}

Therefore, by choosing

N8β2H2(2H+1)2ε2log(Kδ)=O(d2H6ε2log2(Kδ))\displaystyle N\geq\frac{8\beta^{2}H^{2}(2H+1)^{2}}{\varepsilon^{2}}\log(\frac{K}{\delta})=O(\frac{d^{2}H^{6}}{\varepsilon^{2}}\log^{2}(\frac{K}{\delta})) (19)

we have,

𝔼s1,a1,,sh,ahπkh[2βh=1hϕ(sh,ah)(Λkh)1]Δkh+ε4H+2=ε2H+1\displaystyle\mathbb{E}_{s_{1},a_{1},...,s_{h},a_{h}\sim\pi_{k_{h}}}[2\beta\sum_{h^{\prime}=1}^{h}\|\phi(s_{h^{\prime}},a_{h^{\prime}})\|_{(\Lambda^{k}_{h^{\prime}})^{-1}}]\leq\Delta_{k_{h}}+\frac{\varepsilon}{4H+2}=\frac{\varepsilon}{2H+1}

For arbitrary h[H]h\in[H], in Algorithm 4, if we assign h~=h{\widetilde{h}}=h and 𝒟={(shkn,ahkn)k,n,h[kh]×[N]×[h]}\mathcal{D}=\{(s_{h}^{kn},a_{h}^{kn})_{k,n,h\in[k_{h}]\times[N]\times[h]}\}, note that rplan=rkhr^{plan}=r^{k_{h}}, by applying Lemma D.2 and Lemma D.3 we have:

𝔼s1d1[V1πr|h~(s1,r|h~)V1πr|h~(s1,r|h~)]2H𝔼s1d1[V1πrplan|h~(s1,rplan|h~)]\displaystyle\mathbb{E}_{s_{1}\sim d_{1}}[V_{1}^{\pi_{r|\widetilde{h}}^{*}}(s_{1},r|\widetilde{h})-V_{1}^{\pi_{r|\widetilde{h}}}(s_{1},r|\widetilde{h})]\leq 2H\mathbb{E}_{s_{1}\sim d_{1}}[V_{1}^{\pi_{r^{plan}|{\widetilde{h}}}^{*}}(s_{1},r^{plan}|\widetilde{h})]
=\displaystyle= 2H𝔼s1d1[V1(s1,rkh|h)]2H(2H+1)𝔼s1d1[Vπkh(s1,rkh|h)]\displaystyle 2H\mathbb{E}_{s_{1}\sim d_{1}}[V_{1}^{*}(s_{1},r^{k_{h}}|h)]\leq 2H(2H+1)\mathbb{E}_{s_{1}\sim d_{1}}[V^{\pi_{k_{h}}}(s_{1},r^{k_{h}}|h)]
=\displaystyle= (2H+1)𝔼s1,a1,,sh,ahπkh[2βh=1hϕ(sh,ah)(Λkh)1]ε\displaystyle(2H+1)\mathbb{E}_{s_{1},a_{1},...,s_{h},a_{h}\sim\pi_{k_{h}}}[2\beta\sum_{h^{\prime}=1}^{h}\|\phi(s_{h^{\prime}},a_{h^{\prime}})\|_{(\Lambda^{k}_{h^{\prime}})^{-1}}]\leq\varepsilon

Therefore, after a combination of Eq.(18) and Eq.(19), we can conclude that, for arbitrary cK2c_{K}\geq 2 , there exists absolute constant cc, such that by choosing

Nc(cKH6cK+1d3cKε2cKlog2cK(Hdδε))1cK1N\geq c\Big{(}c_{K}\frac{H^{6c_{K}+1}d^{3c_{K}}}{\varepsilon^{2c_{K}}}\log^{2c_{K}}(\frac{Hd}{\delta\varepsilon})\Big{)}^{\frac{1}{c_{K}-1}}

Alg 3 will terminate at kHKk_{H}\leq K, and with probability 1δ1-\delta (on the event in Lemma D.1 and Hoeffding inequality above), for each h[H]h\in[H], if we feed Alg 4 with h~=h{\widetilde{h}}=h, 𝒟={(shkn,ahkn)k,n,h[kh]×[N]×[h]}\mathcal{D}=\{(s_{h}^{kn},a_{h}^{kn})_{k,n,h\in[k_{h}]\times[N]\times[h]}\} and arbitrary linear reward function rr, the policy πr|h\pi_{r|h} returned by Alg 4 should satisfy:

𝔼s1d1[V1πr|h(s1,r|h)]𝔼s1d1[V1πr|h(s1,r|h)]ε\displaystyle\mathbb{E}_{s_{1}\sim d_{1}}[V_{1}^{\pi_{r|h}}(s_{1},r|h)]\geq\mathbb{E}_{s_{1}\sim d_{1}}[V_{1}^{\pi_{r|h}^{*}}(s_{1},r|h)]-\varepsilon

Appendix E DE-RL with Arbitrary Deployed Policies

In the proof for this section, without loss of generality, we assume the initial state is fixed, which will makes the notation and derivation simpler without trivialize the results. For the case where initial state is sampled from some fixed distribution, our algorithms and results can be extended simply by considering the concentration error related to the initial state distribution.

E.1 Algorithms

1 Input: Time step h{h}; Dataset in previous steps {D1,,Dh1}\{D_{1},...,D_{{h}-1}\}; Unregularized Covariance Matrices {Σ1,Σh1}\{\Sigma_{1},...\Sigma_{{h}-1}\}; Bonus factor β{\beta^{\prime}}; Matrix to construct reward function ΣR\Sigma_{R}; Discretize resolution ε012d(N+1){\varepsilon_{0}}\leq\frac{1}{2d(N+1)}
2
3R(,)ϕ(,)(2I+ΣR)1ϕ(,)R(\cdot,\cdot)\leftarrow\sqrt{\phi(\cdot,\cdot)^{\top}(2I+\Sigma_{R})^{-1}\phi(\cdot,\cdot)}
4 ZhDiscretization((2I+ΣR)1,ε024d),R¯(,)ϕ(,)Zhϕ(,)Z_{h}\leftarrow{\rm Discretization}((2I+\Sigma_{R})^{-1},\frac{{\varepsilon_{0}}^{2}}{4d}),~{}\bar{R}(\cdot,\cdot)\leftarrow\sqrt{\phi(\cdot,\cdot)^{\top}Z_{h}\phi(\cdot,\cdot)}
5
6Qh(,)=R(,),Vh()=maxaQh(,a)Q_{h}(\cdot,\cdot)=R(\cdot,\cdot),\quad V_{h}(\cdot)=\max_{a}Q_{h}(\cdot,a), Q¯h(,)=R¯(,),V¯h()=maxaQ¯h(,a)\bar{Q}_{h}(\cdot,\cdot)=\bar{R}(\cdot,\cdot),\quad\bar{V}_{h}(\cdot)=\max_{a}\bar{Q}_{h}(\cdot,a)
7 π¯h()=argmaxaQ¯h(,a)\bar{\pi}_{h}(\cdot)=\arg\max_{a}\bar{Q}_{h}(\cdot,a)
8 for h~=h1,,1{\widetilde{h}}={h}-1,...,1 do
9       wh~Σh~1(sh~,ah~,sh~+1)Dh~ϕ(sh~,ah~)Vh~+1(sh~+1)w_{\widetilde{h}}\leftarrow\Sigma_{\widetilde{h}}^{-1}\sum_{(s_{\widetilde{h}},a_{\widetilde{h}},s_{{\widetilde{h}}+1})\in D_{\widetilde{h}}}\phi(s_{\widetilde{h}},a_{\widetilde{h}})\cdot V_{{\widetilde{h}}+1}(s_{{\widetilde{h}}+1})
10       uh~:=β[ϕ(,)Σh~1ϕ(,)]1/2u_{\widetilde{h}}:={\beta^{\prime}}[\phi(\cdot,\cdot)^{\top}\Sigma_{\widetilde{h}}^{-1}\phi(\cdot,\cdot)]^{1/2}
11       Qh~(,)min{wh~ϕ(,)+uh~,1}Q_{\widetilde{h}}(\cdot,\cdot)\leftarrow\min\{w_{\widetilde{h}}^{\top}\phi(\cdot,\cdot)+u_{\widetilde{h}},1\}, Vh~()maxaQh~(,a)V_{\widetilde{h}}(\cdot)\leftarrow\max_{a}Q_{\widetilde{h}}(\cdot,a)
12       w¯h~Discretization(wh~,ε02d)\bar{w}_{\widetilde{h}}\leftarrow{\rm Discretization}(w_{\widetilde{h}},\frac{{\varepsilon_{0}}}{2d}), Zh~Discretization(β2Σh~1,ε024d)Z_{\widetilde{h}}\leftarrow{\rm Discretization}({\beta^{\prime}}^{2}\Sigma_{\widetilde{h}}^{-1},\frac{{\varepsilon_{0}}^{2}}{4d})
13       u¯h~:=[ϕ(,)Zh~ϕ(,)]1/2\bar{u}_{\widetilde{h}}:=[\phi(\cdot,\cdot)^{\top}Z_{\widetilde{h}}\phi(\cdot,\cdot)]^{1/2}
14       Q¯h~(,)min{w¯h~ϕ(,)+u¯h~,1}\bar{Q}_{\widetilde{h}}(\cdot,\cdot)\leftarrow\min\{\bar{w}_{\widetilde{h}}^{\top}\phi(\cdot,\cdot)+\bar{u}_{\widetilde{h}},1\}, V¯h~()maxaQ¯h~(,a)\bar{V}_{\widetilde{h}}(\cdot)\leftarrow\max_{a}\bar{Q}_{\widetilde{h}}(\cdot,a)
15       π¯h~()argmaxaQ¯h~(,a)\bar{\pi}_{\widetilde{h}}(\cdot)\leftarrow\arg\max_{a}\bar{Q}_{\widetilde{h}}(\cdot,a)
16 end for
return V1(s1),π¯:=π¯1π¯2π¯hV_{1}(s_{1}),\bar{\pi}:=\bar{\pi}_{1}\circ\bar{\pi}_{2}\circ...\bar{\pi}_{h}
Algorithm 5 SolveOptQ
1 Input: Time step h{h}; Dataset in previous steps {D1,,Dh1}\{D_{1},...,D_{{h}-1}\}; Covariance Matrices {Σ1,Σh1}\{\Sigma_{1},...\Sigma_{{h}-1}\}; Deterministic Policy to evaluate π¯={π¯1,π¯2,,π¯h}{\bar{\pi}}=\{{\bar{\pi}}_{1},{\bar{\pi}}_{2},...,{\bar{\pi}}_{h}\}
2 Initialize a zero matrix Λ~πh=O\widetilde{\Lambda}^{\pi}_{h}=O
3 for i=1,2,,di=1,2,...,d do
4       for j=i,i+1,,dj=i,i+1,...,d do
5             Define R~ij\widetilde{R}^{ij}, such that, R~ijh(,)=1+ϕi(,)ϕj(,)2\widetilde{R}^{ij}_{h}(\cdot,\cdot)=\frac{1+\phi_{i}(\cdot,\cdot)\phi_{j}(\cdot,\cdot)}{2} and R~ijh~=0\widetilde{R}^{ij}_{{\widetilde{h}}}=0 for all h~[h1]{\widetilde{h}}\in[h-1];
6             Q^π¯h(,)=R~ijh(,)\widehat{Q}^{\bar{\pi}}_{h}(\cdot,\cdot)=\widetilde{R}^{ij}_{h}(\cdot,\cdot), V^π¯h()=Q^π¯h(,π¯h())\widehat{V}^{\bar{\pi}}_{h}(\cdot)=\widehat{Q}^{\bar{\pi}}_{h}(\cdot,{\bar{\pi}}_{h}(\cdot))
7             for h~=h1,,1{\widetilde{h}}={h}-1,...,1 do
8                   w^π¯h~Σh~1(sh~,ah~,sh~+1)Dh~ϕ(st,at)V^π¯h~+1(sh~+1)\widehat{w}^{\bar{\pi}}_{\widetilde{h}}\leftarrow\Sigma_{\widetilde{h}}^{-1}\sum_{(s_{\widetilde{h}},a_{\widetilde{h}},s_{{\widetilde{h}}+1})\in D_{\widetilde{h}}}\phi(s_{t},a_{t})\cdot\widehat{V}^{\bar{\pi}}_{{\widetilde{h}}+1}(s_{{\widetilde{h}}+1})
9                   Q^π¯h~(,)min{R~ijh~(,)+(w^π¯h~)ϕ(,),1}\widehat{Q}^{\bar{\pi}}_{\widetilde{h}}(\cdot,\cdot)\leftarrow\min\{\widetilde{R}^{ij}_{{\widetilde{h}}}(\cdot,\cdot)+(\widehat{w}^{\bar{\pi}}_{\widetilde{h}})^{\top}\phi(\cdot,\cdot),1\}
10                   V^π¯h~()=Q^π¯h~(,π¯h~())\widehat{V}^{\bar{\pi}}_{\widetilde{h}}(\cdot)=\widehat{Q}^{\bar{\pi}}_{\widetilde{h}}(\cdot,{\bar{\pi}}_{\widetilde{h}}(\cdot))
11             end for
12            (Λ~π¯h~)ijV^π¯1(s1);(Λ~π¯h~)jiV^π¯1(s1)(\widetilde{\Lambda}^{\bar{\pi}}_{\widetilde{h}})_{ij}\leftarrow\widehat{V}^{\bar{\pi}}_{1}(s_{1});\quad(\widetilde{\Lambda}^{\bar{\pi}}_{\widetilde{h}})_{ji}\leftarrow\widehat{V}^{\bar{\pi}}_{1}(s_{1})
13       end for
14      
15 end for
16Λ^h~π¯=2(Λ~π¯h~)1\widehat{\Lambda}_{\widetilde{h}}^{{\bar{\pi}}}=2(\widetilde{\Lambda}^{{\bar{\pi}}}_{\widetilde{h}})-\textbf{1}
return Λ^h~π¯\widehat{\Lambda}_{\widetilde{h}}^{{\bar{\pi}}}
Algorithm 6 EstimateCovMatrix

We first introduce the definition for Discretization\mathrm{Discretization} function:

Definition E.1 (Discretization\mathrm{Discretization} function).

Given vector w=(w1,w2,,wd)dw=(w_{1},w_{2},...,w_{d})^{\top}\in\mathbb{R}^{d} or matrix Σ=(Σij)i,j[d]d×d\Sigma=(\Sigma_{ij})_{i,j\in[d]}\in\mathbb{R}^{d\times d} as input, we have:

Discretization(w,ε0)=(ε0w1ε0,ε0w2ε0,,ε0wdε0),Discretization(Σ,ε0)=(ε0Σijε0)i,j[d]\displaystyle{\rm Discretization}(w,{\varepsilon_{0}})=({\varepsilon_{0}}\lceil\frac{w_{1}}{{\varepsilon_{0}}}\rceil,{\varepsilon_{0}}\lceil\frac{w_{2}}{{\varepsilon_{0}}}\rceil,...,{\varepsilon_{0}}\lceil\frac{w_{d}}{{\varepsilon_{0}}}\rceil),\quad{\rm Discretization}(\Sigma,{\varepsilon_{0}})=({\varepsilon_{0}}\lceil\frac{\Sigma_{ij}}{{\varepsilon_{0}}}\rceil)_{i,j\in[d]}

where \lceil\cdot\rceil is the ceiling function.

In Algorithm 6, we are trying to estimate the expected covariance matrix under policy π¯{\bar{\pi}} by policy evaluation. The basic idea is that, the expected covariance matrix can be represented by:

𝔼sh,ahπ¯[ϕ(sh,ah)ϕ(sh,ah)]=(𝔼π¯[ϕi(sh,ah)ϕj(sh,ah)])ij=(Vπ¯(s1,Rij))ij\displaystyle\mathbb{E}_{s_{h},a_{h}\sim{\bar{\pi}}}[\phi(s_{h},a_{h})\phi(s_{h},a_{h})^{\top}]=\big{(}\mathbb{E}_{{\bar{\pi}}}[\phi_{i}(s_{h},a_{h})\phi_{j}(s_{h},a_{h})]\big{)}_{ij}=\big{(}V^{\bar{\pi}}(s_{1},R^{ij})\big{)}_{ij}

where we use (aij)ij(a_{ij})_{ij} to denote a matrix whose element indexed by ii in row and jj in column is aija_{ij}. In another word, the element in the covariance matrix indexed by ijij is equal to the value function of policy π¯{\bar{\pi}} with Rhij(sh,ah):=ϕi(sh,ah)ϕj(sh,ah)R_{{h}}^{ij}(s_{h},a_{h}):=\phi_{i}(s_{h},a_{h})\phi_{j}(s_{h},a_{h}) as reward function at the last layer (and use zero reward in previous layers), where ϕi\phi_{i} denotes the ii-th elements of vector ϕ\phi. Because the techniques rely on the reward is non-negative and bounded in [0,1][0,1], by leveraging the fact that |ϕi(,)|ϕ(,)1|\phi_{i}(\cdot,\cdot)|\leq\|\phi(\cdot,\cdot)\|\leq 1, we shift and scale RijR^{ij} to obtain R~ij\widetilde{R}^{ij} and use it for policy evaluation.

In Alg 5, we maintance two QQ functions QhQ_{h} and Q¯h\bar{Q}_{h}. The learning of QhQ_{h} is based on LSVI-UCB, while Q¯h\bar{Q}_{h} is a “discretized version” for QhQ_{h} computed by discretizing wh,β2Σh1w_{h},{\beta^{\prime}}^{2}\Sigma_{h}^{-1} (or ΣR1\Sigma_{R}^{-1} at layer hh) elementwisely with resolution ε0{\varepsilon_{0}}, and Q¯h\bar{Q}_{h} will be used to compute π¯h{\bar{\pi}}_{h} for deployment. The main reason why we discretize QhQ_{h} is to make sure the number of greedy policies π¯{\bar{\pi}} is bounded, so that we can use union bound and upper bound the error when using Alg 6 to estimate the covariance matrix. In Section E.5, we will analyze the error resulting from discretization, and we will upper bound the estimation error Alg. 6.

E.2 Function Classes and ε0{\varepsilon_{0}}-Cover

We first introduce some useful function classes and their ε0{\varepsilon_{0}}-cover.

Notation for Value Function Classes and Policy Classes

We first introduce some new notations for value and policy classes. Similar to Eq.(6) in (Jin et al., 2019), we define the greedy value function class

𝒱L,B={V()|V()=maxamin{ϕ(,a)w+ϕ(,a)Σϕ(,a),1},wL,ΣB}\displaystyle\mathcal{V}_{L,B}^{*}=\{V(\cdot)|V(\cdot)=\max_{a}\min\{\phi(\cdot,a)^{\top}w+\sqrt{\phi(\cdot,a)^{\top}\Sigma\phi(\cdot,a)},1\},\|w\|\leq L,\|\Sigma\|\leq B\}

and the Q function class:

𝒬L,B={Q(,)|Q(,)=min{ϕ(,)w+ϕ(,)Σϕ(,),1},wL,ΣB}\displaystyle{\mathcal{Q}}_{L,B}=\{Q(\cdot,\cdot)|Q(\cdot,\cdot)=\min\{\phi(\cdot,\cdot)^{\top}w+\sqrt{\phi(\cdot,\cdot)^{\top}\Sigma\phi(\cdot,\cdot)},1\},\|w\|\leq L,\|\Sigma\|\leq B\}

Besides, suppose we have a deterministic policy class Π\Pi with finite candidates (i.e. |Π||\Pi|\leq\infty), we use 𝒱L,B×Π\mathcal{V}_{L,B}\times\Pi to denote:

𝒱L,B×Π={V()|V()=min{ϕ(,π())w+ϕ(,π())Σϕ(,π()),1},wL,ΣB,πΠ}\displaystyle\mathcal{V}_{L,B}\times\Pi=\{V(\cdot)|V(\cdot)=\min\{\phi(\cdot,\pi(\cdot))^{\top}w+\sqrt{\phi(\cdot,\pi(\cdot))^{\top}\Sigma\phi(\cdot,\pi(\cdot))},1\},\|w\|\leq L,\|\Sigma\|\leq B,\pi\in\Pi\}

Recall that in Alg.6, we will use a special reward function, and we need to consider it in the union bound. We denote:

𝒱ϕ×Π={V|V()=1+ϕi(,π())ϕj(,π())2,i,j[d],πΠ}\displaystyle\mathcal{V}_{\phi}\times\Pi=\{V|V(\cdot)=\frac{1+\phi_{i}(\cdot,\pi(\cdot))\phi_{j}(\cdot,\pi(\cdot))}{2},i,j\in[d],\pi\in\Pi\}

and easy to check |𝒱ϕ×Π|=d2|Π||\mathcal{V}_{\phi}\times\Pi|=d^{2}|\Pi|.

Moreover, if we have a Q function class 𝒬{\mathcal{Q}}, we will use Π𝒬\Pi_{\mathcal{Q}} to denote the class of greedy policies induced from 𝒬{\mathcal{Q}}, i.e.

Π𝒬:={π()=argmaxQ(,a)|Q𝒬}.\displaystyle\Pi_{\mathcal{Q}}:=\{\pi(\cdot)=\arg\max Q(\cdot,a)|Q\in{\mathcal{Q}}\}.

Discretization with Resolution ε0{\varepsilon_{0}}

In the following, we will use 𝒞w,L,ε0\mathcal{C}_{w,L,{\varepsilon_{0}}} to denote the ε0{\varepsilon_{0}}-cover for wdw\in\mathbb{R}^{d} with wL\|w\|\leq L, concretely,

𝒞w,L,ε0={w||wiε0|[Lε0],i[d]}\displaystyle\mathcal{C}_{w,L,{\varepsilon_{0}}}=\{w||\frac{w_{i}}{{\varepsilon_{0}}}|\in[\lceil\frac{L}{{\varepsilon_{0}}}\rceil],\forall i\in[d]\}

where \lceil\cdot\rceil is the ceiling function.

Similarly, we will use 𝒞Σ,B,ε0\mathcal{C}_{\Sigma,B,{\varepsilon_{0}}} to denote the ε0{\varepsilon_{0}}-cover for matrix Σd×d\Sigma\in\mathbb{R}^{d\times d} with maxi,j|Σij|B\max_{i,j}|\Sigma_{ij}|\leq B

𝒞Σ,B,ε0={Σ||Σijε0|[Bε0],i,j[d]}.\displaystyle\mathcal{C}_{\Sigma,B,{\varepsilon_{0}}}=\{\Sigma||\frac{\Sigma_{ij}}{{\varepsilon_{0}}}|\in[\lceil\frac{B}{{\varepsilon_{0}}}\rceil],\forall i,j\in[d]\}.

Easy to check that:

log|𝒞w,L,ε0|dlog2Lε0,log|𝒞Σ,B,ε0|d2log2Bε0\displaystyle\log|\mathcal{C}_{w,L,{\varepsilon_{0}}}|\leq d\log\frac{2L}{{\varepsilon_{0}}},\quad\log|\mathcal{C}_{\Sigma,B,{\varepsilon_{0}}}|\leq d^{2}\log\frac{2B}{{\varepsilon_{0}}}

Recall the definition of Discretize\mathrm{Discretize} function in Def. E.1, easy to check that:

Discretize(w,ε0)wdε0,Discretize(Σ,ε0)ΣDiscretize(Σ,ε0)ΣFdε0\displaystyle\|{\rm Discretize}(w,{\varepsilon_{0}})-w\|\leq d{\varepsilon_{0}},\|{\rm Discretize}(\Sigma,{\varepsilon_{0}})-\Sigma\|\leq\|{\rm Discretize}(\Sigma,{\varepsilon_{0}})-\Sigma\|_{F}\leq d{\varepsilon_{0}}

ε0{\varepsilon_{0}}-cover

Before we introduce our notations for ε0{\varepsilon_{0}}-net, we first show a useful lemma:

Lemma E.2.

For arbitrary w,Σw,\Sigma, denote w¯=Discretize(w,ε02d)\bar{w}=\mathrm{Discretize}(w,\frac{{\varepsilon_{0}}}{2d}) and Σ¯=Discretize(Σ,ε024d)\bar{\Sigma}=\mathrm{Discretize}(\Sigma,\frac{{\varepsilon_{0}}^{2}}{4d}). Consider the following two functions and their greedy policies, where ϕ(,)1\|\phi(\cdot,\cdot)\|\leq 1

Q(s,a)=min{wϕ(,a)+ϕ(,a)Σϕ(,a),1},π=argmaxaQ(s,a)\displaystyle Q(s,a)=\min\{w^{\top}\phi(\cdot,a)+\sqrt{\phi(\cdot,a)^{\top}\Sigma\phi(\cdot,a)},1\},\quad\pi=\arg\max_{a}Q(s,a)
Q¯(s,a)=min{w¯ϕ(,a)+ϕ(,a)Σ¯ϕ(,a),1},π¯=argmaxaQ¯(s,a)\displaystyle\bar{Q}(s,a)=\min\{\bar{w}^{\top}\phi(\cdot,a)+\sqrt{\phi(\cdot,a)^{\top}\bar{\Sigma}\phi(\cdot,a)},1\},\quad{\bar{\pi}}=\arg\max_{a}\bar{Q}(s,a)

then we have:

|Q(s,π(s))Q(s,π¯(s))|2ε0,s𝒮,\displaystyle|Q(s,\pi(s))-Q(s,\bar{\pi}(s))|\leq 2{\varepsilon_{0}},\quad\forall s\in\mathcal{S},\quad
QQ¯ε0,sups|maxaQ(,a)maxaQ¯(,a)|ε0\displaystyle\|Q-\bar{Q}\|_{\infty}\leq{\varepsilon_{0}},\quad\sup_{s}|\max_{a}Q(\cdot,a)-\max_{a}\bar{Q}(\cdot,a)|\leq{\varepsilon_{0}}
Proof.

After similar derivation as Eq.(28) in (Jin et al., 2019), we can show that

sups|maxaQ(,a)maxaQ¯(,a)|\displaystyle\sup_{s}|\max_{a}Q(\cdot,a)-\max_{a}\bar{Q}(\cdot,a)|\leq QQ¯\displaystyle\|Q-\bar{Q}\|_{\infty}
\displaystyle\leq sups,a|wϕ(,)+ϕ(,)Σϕ(,)w¯ϕ(,)ϕ(,)Σ¯ϕ(,)|\displaystyle\sup_{s,a}\Big{|}w^{\top}\phi(\cdot,\cdot)+\sqrt{\phi(\cdot,\cdot)^{\top}\Sigma\phi(\cdot,\cdot)}-\bar{w}^{\top}\phi(\cdot,\cdot)-\sqrt{\phi(\cdot,\cdot)^{\top}\bar{\Sigma}\phi(\cdot,\cdot)}\Big{|}
\displaystyle\leq ww¯+ΣΣ¯Fdε02d+dε024dε0\displaystyle\|w-\bar{w}\|+\sqrt{\|\Sigma-\bar{\Sigma}\|_{F}}\leq d\frac{{\varepsilon_{0}}}{2d}+\sqrt{d\frac{{\varepsilon_{0}}^{2}}{4d}}\leq{\varepsilon_{0}}

Because π\pi and π¯\bar{\pi} are greedy policies, we have:

0\displaystyle 0\leq Q(s,π(s))Q(s,π¯(s))Q(s,π(s))Q¯(s,π(s))+Q¯(s,π¯(s))Q(s,π¯(s))\displaystyle Q(s,\pi(s))-Q(s,\bar{\pi}(s))\leq Q(s,\pi(s))-\bar{Q}(s,\pi(s))+\bar{Q}(s,\bar{\pi}(s))-Q(s,\bar{\pi}(s))
\displaystyle\leq 2QQ¯2ε0.\displaystyle 2\|Q-\bar{Q}\|_{\infty}\leq 2{\varepsilon_{0}}.

Now, we consider the following Q function class and V function class,

𝒬¯L,B,ε0:={Q|Q(,)=min{wϕ(,)+ϕ(,)Σϕ(,),1},w𝒞w,L,ε02d,Σ𝒞Σ,dB,ε024d}\displaystyle\bar{\mathcal{Q}}_{L,B,{\varepsilon_{0}}}:=\{Q|Q(\cdot,\cdot)=\min\{w^{\top}\phi(\cdot,\cdot)+\sqrt{\phi(\cdot,\cdot)^{\top}\Sigma\phi(\cdot,\cdot)},1\},w\in\mathcal{C}_{w,L,\frac{{\varepsilon_{0}}}{2d}},\Sigma\in\mathcal{C}_{\Sigma,dB,\frac{{\varepsilon_{0}}^{2}}{4d}}\}
𝒱¯L,B,ε0:={V|V()=maxamin{wϕ(,a)+ϕ(,a)Σϕ(,a),1},w𝒞w,L,ε02d,Σ𝒞Σ,dB,ε024d}\displaystyle\bar{\mathcal{V}}^{*}_{L,B,{\varepsilon_{0}}}:=\{V|V(\cdot)=\max_{a}\min\{w^{\top}\phi(\cdot,a)+\sqrt{\phi(\cdot,a)^{\top}\Sigma\phi(\cdot,a)},1\},w\in\mathcal{C}_{w,L,\frac{{\varepsilon_{0}}}{2d}},\Sigma\in\mathcal{C}_{\Sigma,dB,\frac{{\varepsilon_{0}}^{2}}{4d}}\}

based on Lemma E.2, and another important fact that maxi,j|aij|AFdA\max_{i,j}|a_{ij}|\leq\|A\|_{F}\leq d\|A\|, we know that 𝒬¯L,B,ε0\bar{\mathcal{Q}}_{L,B,{\varepsilon_{0}}} is an ε0{\varepsilon_{0}}-cover of 𝒬L,B{\mathcal{Q}}_{L,B}, i.e. for arbitrary Q𝒬L,BQ\in{\mathcal{Q}}_{L,B}, there exists Q¯𝒬¯L,B,ε0\bar{Q}\in\bar{\mathcal{Q}}_{L,B,{\varepsilon_{0}}}, such that QQ¯ε0\|Q-\bar{Q}\|\leq{\varepsilon_{0}}. Similarly, 𝒱¯L,B,ε0\bar{\mathcal{V}}^{*}_{L,B,{\varepsilon_{0}}} is also an ε0{\varepsilon_{0}}-cover of 𝒱L,B\mathcal{V}^{*}_{L,B}.

Besides, we will use ΠQ¯L,B,ε0\Pi_{\bar{Q}_{L,B,{\varepsilon_{0}}}} to denote the collection of greedy policy induced from elements in 𝒬¯L,B,ε0\bar{\mathcal{Q}}_{L,B,{\varepsilon_{0}}}.

We also define 𝒱¯L,B,ε0×Π\bar{\mathcal{V}}_{L,B,{\varepsilon_{0}}}\times\Pi, which is an ε0{\varepsilon_{0}} cover for 𝒱L,B×Π\mathcal{V}_{L,B}\times\Pi.

𝒱¯L,B,ε0×Π:={V|V()=min{ϕ(,π())w+ϕ(,π())Σϕ(,π()),1},w𝒞w,L,ε02d,Σ𝒞Σ,dB,ε024d,πΠ}\displaystyle\bar{\mathcal{V}}_{L,B,{\varepsilon_{0}}}\times\Pi:=\{V|V(\cdot)=\min\{\phi(\cdot,\pi(\cdot))^{\top}w+\sqrt{\phi(\cdot,\pi(\cdot))^{\top}\Sigma\phi(\cdot,\pi(\cdot))},1\},w\in\mathcal{C}_{w,L,\frac{{\varepsilon_{0}}}{2d}},\Sigma\in\mathcal{C}_{\Sigma,dB,\frac{{\varepsilon_{0}}^{2}}{4d}},\pi\in\Pi\}

Obviously, |𝒱¯L,B,ε0×Π|=|Π||𝒞w,L,ε02d||𝒞Σ,dB,ε024d||\bar{\mathcal{V}}_{L,B,{\varepsilon_{0}}}\times\Pi|=|\Pi|\cdot|\mathcal{C}_{w,L,\frac{{\varepsilon_{0}}}{2d}}|\cdot|\mathcal{C}_{\Sigma,dB,\frac{{\varepsilon_{0}}^{2}}{4d}}|.

Besides, because 𝒱ϕ×Π\mathcal{V}_{\phi}\times\Pi is already a finite set, it is an ε0{\varepsilon_{0}}-cover of itself.

E.3 Constraints in Advance

Induction Condition Related to Accumulative Error

Recall the induction condition in 4.5, and we restate it here. See 4.5

Constraints for the Validity of the Algorithm

Besides, in order to make sure the algorithm can run successfully, we add the following constaints:

ΣR\displaystyle\Sigma_{R}\succeq 12I\displaystyle-\frac{1}{2}I (20)
Zh~\displaystyle Z_{\widetilde{h}}\succeq 0,h~[h1]\displaystyle 0,\quad\forall{\widetilde{h}}\in[{h}-1] (21)
I\displaystyle I\succeq Zh0\displaystyle Z_{h}\succeq 0 (22)

where constraint (22) for ZhZ_{h} is to make sure the reward R¯\bar{R} locates in [0,1][0,1] interval.

Recall the definition of Σh~=I+n=1Nϕ(sh~,ah~)ϕ(sh~,ah~)\Sigma_{\widetilde{h}}=I+\sum_{n=1}^{N}\phi(s_{\widetilde{h}},a_{\widetilde{h}})\phi(s_{\widetilde{h}},a_{\widetilde{h}})^{\top}, therefore,

σmin(β2Σh~1)=β2/σmax(Σh~)β21+N\displaystyle\sigma_{\min}({\beta^{\prime}}^{2}\Sigma_{\widetilde{h}}^{-1})={\beta^{\prime}}^{2}/\sigma_{\max}(\Sigma_{\widetilde{h}})\geq\frac{{\beta^{\prime}}^{2}}{1+N}

According to Lemma E.11, to make sure Zh~0Z_{\widetilde{h}}\succeq 0, we need the following constraint on ε0{\varepsilon_{0}}

dε024dβ2(N+1)\displaystyle d\frac{{\varepsilon_{0}}^{2}}{4d}\leq\frac{{\beta^{\prime}}^{2}}{(N+1)} (23)

which is equivalent to ε0β/N+1{\varepsilon_{0}}\leq{\beta^{\prime}}/\sqrt{N+1}.

As for ZhZ_{h}, the constraint is equivalent to:

2I+ΣR(1+ε024)I,(2I+ΣR)1ε024I\displaystyle 2I+\Sigma_{R}\succeq(1+\frac{{\varepsilon_{0}}^{2}}{4})I,\quad(2I+\Sigma_{R})^{-1}\succeq\frac{{\varepsilon_{0}}^{2}}{4}I

and can be rewritten to

I+ΣRε024I,4ε02I2I+ΣR\displaystyle I+\Sigma_{R}\succeq\frac{{\varepsilon_{0}}^{2}}{4}I,\quad\frac{4}{{\varepsilon_{0}}^{2}}I\succeq 2I+\Sigma_{R} (24)

E.4 Concentration Bound

Based on the notations above, we are already to claim that:

Claim 3.

By choosing L=dN,B=β2+dε0L=\sqrt{dN},B={\beta^{\prime}}^{2}+d{\varepsilon_{0}} for some ε01/d{\varepsilon_{0}}\leq 1/d, during the running of Algorithm 2

  • In Alg 5, for each h[H]{h}\in[H] and h~[h]{\widetilde{h}}\in[{h}], and the Q¯h~\bar{Q}_{\widetilde{h}} and π¯h~{\bar{\pi}}_{\widetilde{h}} generated while running the algorithm, we must have Q¯h~Q¯L,B,ε0\bar{Q}_{\widetilde{h}}\in\bar{Q}_{L,B,{\varepsilon_{0}}} and π¯h~ΠQ¯L,B,ε0{\bar{\pi}}_{\widetilde{h}}\in\Pi_{\bar{Q}_{L,B,{\varepsilon_{0}}}}. Besides, Qh~𝒬L,BQ_{\widetilde{h}}\in{\mathcal{Q}}_{L,B} and Vh~𝒱L,BV_{\widetilde{h}}\in\mathcal{V}_{L,B}^{*}

  • In Alg 6, for all h~[h]{\widetilde{h}}\in[h], we have π¯h~ΠQ¯L,B,ε0{\bar{\pi}}_{\widetilde{h}}\in\Pi_{\bar{Q}_{L,B,{\varepsilon_{0}}}}, and for arbitrary value function V^π¯h~\widehat{V}^{\bar{\pi}}_{\widetilde{h}} generated in it, we have V^π¯h~(𝒱L,B×ΠQ¯L,B,ε0)(𝒱ϕ×ΠQ¯L,B,ε0)\widehat{V}^{\bar{\pi}}_{\widetilde{h}}\in(\mathcal{V}_{L,B}\times\Pi_{\bar{Q}_{L,B,{\varepsilon_{0}}}})\cup(\mathcal{V}_{\phi}\times\Pi_{\bar{Q}_{L,B,{\varepsilon_{0}}}}) and therefore, there exists V(𝒱¯L,B,ε0×ΠQ¯L,B,ε0)(𝒱ϕ×ΠQ¯L,B,ε0)V\in(\bar{\mathcal{V}}_{L,B,{\varepsilon_{0}}}\times\Pi_{\bar{Q}_{L,B,{\varepsilon_{0}}}})\cup(\mathcal{V}_{\phi}\times\Pi_{\bar{Q}_{L,B,{\varepsilon_{0}}}}) such that VV^π¯h~ε0\|V-\widehat{V}^{\bar{\pi}}_{\widetilde{h}}\|\leq{\varepsilon_{0}}

Proof.

Algorithm 5: First we bound the norm of the weights whw_{h} in Algorithm 5. For arbitrary vdv\in\mathbb{R}^{d} and v=1\|v\|=1, we have:

|vwh~|=\displaystyle|v^{\top}w_{\widetilde{h}}|= |vΣh~1(sh~,ah~,sh~+1)Dh~ϕ(sh~,ah~)Vh~+1(sh~+1)||vΣh~1(sh~,ah~,sh~+1)Dh~ϕ(sh~,ah~)|\displaystyle|v\Sigma_{\widetilde{h}}^{-1}\sum_{(s_{\widetilde{h}},a_{\widetilde{h}},s_{{\widetilde{h}}+1})\in D_{\widetilde{h}}}\phi(s_{\widetilde{h}},a_{\widetilde{h}})V_{{\widetilde{h}}+1}(s_{{\widetilde{h}}+1})|\leq|v\Sigma_{\widetilde{h}}^{-1}\sum_{(s_{\widetilde{h}},a_{\widetilde{h}},s_{{\widetilde{h}}+1})\in D_{\widetilde{h}}}\phi(s_{\widetilde{h}},a_{\widetilde{h}})|
\displaystyle\leq |(sh~,ah~,sh~+1)Dh~vΣh~1v||(sh~,ah~,sh~+1)Dh~ϕ(sh~,ah~)Σh~1ϕ(sh~,ah~)|\displaystyle\sqrt{|\sum_{(s_{\widetilde{h}},a_{\widetilde{h}},s_{{\widetilde{h}}+1})\in D_{\widetilde{h}}}v^{\top}\Sigma_{\widetilde{h}}^{-1}v||\sum_{(s_{\widetilde{h}},a_{\widetilde{h}},s_{{\widetilde{h}}+1})\in D_{\widetilde{h}}}\phi(s_{\widetilde{h}},a_{\widetilde{h}})^{\top}\Sigma_{\widetilde{h}}^{-1}\phi(s_{\widetilde{h}},a_{\widetilde{h}})|}
\displaystyle\leq vd|Dh~|=vdN\displaystyle\|v\|\sqrt{d|D_{\widetilde{h}}|}=\|v\|\sqrt{dN}

therefore, wh~dN\|w_{\widetilde{h}}\|\leq\sqrt{dN}. Besides, according to Lemma E.11 and constraint (24), we have:

β2Σh~1β2,Zh~\displaystyle\|{\beta^{\prime}}^{2}\Sigma_{\widetilde{h}}^{-1}\|\leq{\beta^{\prime}}^{2},\quad\|Z_{\widetilde{h}}\|\leq β2Σh~1+dε0β2+dε0h[h1]\displaystyle\|{\beta^{\prime}}^{2}\Sigma_{\widetilde{h}}^{-1}\|+d{\varepsilon_{0}}\leq{\beta^{\prime}}^{2}+d{\varepsilon_{0}}\quad\forall h\in[{h}-1]
(2I+ΣR)11,Zh\displaystyle\|(2I+\Sigma_{R})^{-1}\|\leq 1,\quad\|Z_{h}\|\leq (2I+ΣR)1+dε01+dε0\displaystyle\|(2I+\Sigma_{R})^{-1}\|+d{\varepsilon_{0}}\leq 1+d{\varepsilon_{0}}

Recall B=β2+dε0B={\beta^{\prime}}^{2}+d{\varepsilon_{0}} and β>1{\beta^{\prime}}>1, the claim about Alg 5 is true.

Algorithm 6: The discussion about the value range of w^π¯h~\widehat{w}^{\bar{\pi}}_{\widetilde{h}} is similar to above. Therefore, all the value functions occurred in the previous h1{h}-1 layers would belong to 𝒱L,B×Π\mathcal{V}_{L,B}\times\Pi, except that the last layer should belong to 𝒱ϕ×Π\mathcal{V}_{\phi}\times\Pi. Besides, since Alg 6 is only used to estimate the policies returned by Alg 5, we should have Π=Π𝒬¯L,B,ε0\Pi=\Pi_{\bar{\mathcal{Q}}_{L,B,{\varepsilon_{0}}}}. As a result, the claim for Alg 6 is correct. ∎

Recall Lemma D.4 from (Jin et al., 2019), which holds for arbitrary 𝒱\mathcal{V} with covering number 𝒩ε0\mathcal{N}_{\varepsilon_{0}} and sups|V(s)|H\sup_{s}|V(s)|\leq H. Next, we state a slightly generalized version by replacing sups|V(s)|H\sup_{s}|V(s)|\leq H with sups|V(s)|1\sup_{s}|V(s)|\leq 1, since in our Alg. 5 and Alg. 6 Vmax=1V_{\max}=1. This is also the main reason why we only need the coefficient of the bonus term β=O~(d){\beta^{\prime}}=\widetilde{O}(d) instead of O~(dH)\widetilde{O}(dH) in Alg. 5 so that we can achieve better dependence on HH.

Lemma E.3 ((Revised) Lemma D.4 in (Jin et al., 2019)).

Let {sτ}τ=1\{s_{\tau}\}_{\tau=1}^{\infty}, be a stochastic process on state space 𝒮\mathcal{S} with corresponding filtration {τ}τ=1\{\mathcal{F}_{\tau}\}_{\tau=1}^{\infty}. Let {ϕτ}τ=0\{\phi_{\tau}\}_{\tau=0}^{\infty} be an d\mathbb{R}^{d}-valued stochastic process where ϕττ1\phi_{\tau}\in\mathcal{F}_{\tau-1} and ϕτ1\|\phi_{\tau}\|\leq 1. Let Λt=λI+τ=1tϕτϕτ\Lambda_{t}=\lambda I+\sum_{\tau=1}^{t}\phi_{\tau}\phi_{\tau}^{\top}. Then for any δ>0\delta>0, with probability at least 1δ1-\delta, for all t0t\geq 0, and any V𝒱V\in\mathcal{V} so that sups|V(s)|1\sup_{s}|V(s)|\leq 1, we have:

τ=1tϕτ{V(sτ)𝔼[V(xτ)|τ1]}2Λt14[d2logt+λλ+log𝒩ε0δ]+8t2ε02λ\displaystyle\Big{\|}\sum_{\tau=1}^{t}\phi_{\tau}\{V(s_{\tau})-\mathbb{E}[V(x_{\tau})|\mathcal{F}_{\tau-1}]\}\Big{\|}^{2}_{\Lambda_{t}^{-1}}\leq 4[\frac{d}{2}\log\frac{t+\lambda}{\lambda}+\log\frac{\mathcal{N}_{\varepsilon_{0}}}{\delta}]+\frac{8t^{2}{\varepsilon_{0}}^{2}}{\lambda}

where 𝒩ε0\mathcal{N}_{\varepsilon_{0}} is the ε0{\varepsilon_{0}}-covering number of 𝒱\mathcal{V} w.r.t. the distance dist(V,V)=sups|V(s)V(s)|{\rm dist}(V,V^{\prime})=\sup_{s}|V(s)-V(s)^{\prime}|

Now we are ready to prove the main concentration result for Algorithm 2:

Theorem E.4.

Consider value function class 𝒱:=𝒱L,B(𝒱L,B×ΠQ¯L,B,ε0)(𝒱ϕ×ΠQ¯L,B,ε0)\mathcal{V}:=\mathcal{V}^{*}_{L,B}\cup(\mathcal{V}_{L,B}\times\Pi_{\bar{Q}_{L,B,{\varepsilon_{0}}}})\cup(\mathcal{V}_{\phi}\times\Pi_{\bar{Q}_{L,B,{\varepsilon_{0}}}}) with L=1,B=β2+dε0L=1,B={\beta^{\prime}}^{2}+d{\varepsilon_{0}}. According to Claim 3, 𝒱\mathcal{V} covers all possible value functions occurs when running Alg 2. We use \mathcal{E} to denote the even that the following inequality holds for arbitrary V𝒱V\in\mathcal{V} and arbitrary k[K]=[H],h[H]k\in[K]=[H],h\in[H]:

τ=1k1n=1Nϕhτn(V(sh~+1τn)s𝒮Ph(s|shτn,ahτn)V(s))(Σh)1cdlog(dNβH/ε0δ)\displaystyle\Big{\|}\sum_{\tau=1}^{k-1}\sum_{n=1}^{N}\phi_{h}^{\tau n}\Big{(}V(s_{{\widetilde{h}}+1}^{\tau n})-\sum_{s^{\prime}\in\mathcal{S}}P_{h}(s^{\prime}|s_{h}^{\tau n},a_{h}^{\tau n})V(s^{\prime})\Big{)}\Big{\|}_{(\Sigma_{h})^{-1}}\leq c\cdot d\sqrt{\log(dN{\beta^{\prime}}H/{\varepsilon_{0}}\delta)}

As long as

ε01/N\displaystyle{\varepsilon_{0}}\leq 1/N (25)

there exists some constant cc, such that P()1δ/2P(\mathcal{E})\geq 1-\delta/2.

Proof.

We consider the value function class:

𝒱:=𝒱L,B(𝒱L,B×ΠQ¯L,B,ε0)(𝒱ϕ×ΠQ¯L,B,ε0)\displaystyle\mathcal{V}:=\mathcal{V}^{*}_{L,B}\cup(\mathcal{V}_{L,B}\times\Pi_{\bar{Q}_{L,B,{\varepsilon_{0}}}})\cup(\mathcal{V}_{\phi}\times\Pi_{\bar{Q}_{L,B,{\varepsilon_{0}}}})

and we have an ε0{\varepsilon_{0}}-cover for it, which we denote as:

𝒱ε0:=𝒱¯L,B,ε0(𝒱¯L,B,ε0×ΠQ¯L,B,ε0)(𝒱ϕ×ΠQ¯L,B,ε0)\displaystyle\mathcal{V}_{\varepsilon_{0}}:=\bar{\mathcal{V}}^{*}_{L,B,{\varepsilon_{0}}}\cup(\bar{\mathcal{V}}_{L,B,{\varepsilon_{0}}}\times\Pi_{\bar{Q}_{L,B,{\varepsilon_{0}}}})\cup(\mathcal{V}_{\phi}\times\Pi_{\bar{Q}_{L,B,{\varepsilon_{0}}}})

Besides, there exists c>0c^{\prime}>0, s.t.

log|𝒱ε0|\displaystyle\log|\mathcal{V}_{\varepsilon_{0}}|\leq log|𝒱ϕ×ΠQ¯L,B,ε0|+log|𝒱¯L,B,ε0|+log|𝒱¯L,B,ε0×ΠQ¯L,B,ε0|\displaystyle\log|\mathcal{V}_{\phi}\times\Pi_{\bar{Q}_{L,B,{\varepsilon_{0}}}}|+\log|\bar{\mathcal{V}}^{*}_{L,B,{\varepsilon_{0}}}|+\log|\bar{\mathcal{V}}_{L,B,{\varepsilon_{0}}}\times\Pi_{\bar{Q}_{L,B,{\varepsilon_{0}}}}|
\displaystyle\leq log|𝒱ϕ×Q¯L,B,ε0|+log|𝒱¯L,B,ε0|+log|𝒱¯L,B,ε0×Q¯L,B,ε0|\displaystyle\log|\mathcal{V}_{\phi}\times\bar{Q}_{L,B,{\varepsilon_{0}}}|+\log|\bar{\mathcal{V}}^{*}_{L,B,{\varepsilon_{0}}}|+\log|\bar{\mathcal{V}}_{L,B,{\varepsilon_{0}}}\times\bar{Q}_{L,B,{\varepsilon_{0}}}|
\displaystyle\leq cd2logdHNβε0\displaystyle c^{\prime}d^{2}\log\frac{dHN{\beta^{\prime}}}{{\varepsilon_{0}}}

By plugging into Lemma E.3 and considering the union bound over k[K]k\in[K] and h[H]h\in[H] (note that K=HK=H), we have with probability 1δ/21-\delta/2,

τ=1k1n=1Nϕhτn(V(sh~+1τn)s𝒮Ph(s|shτn,ahτn)V(s))2(Λkh)1cd2logdHNβε0δ+8N2ε02\displaystyle\Big{\|}\sum_{\tau=1}^{k-1}\sum_{n=1}^{N}\phi_{h}^{\tau n}\Big{(}V(s_{{\widetilde{h}}+1}^{\tau n})-\sum_{s^{\prime}\in\mathcal{S}}P_{h}(s^{\prime}|s_{h}^{\tau n},a_{h}^{\tau n})V(s^{\prime})\Big{)}\Big{\|}^{2}_{(\Lambda^{k}_{h})^{-1}}\leq c^{\prime\prime}d^{2}\log\frac{dHN{\beta^{\prime}}}{{\varepsilon_{0}}\delta}+8N^{2}{\varepsilon_{0}}^{2}

When ε01N{\varepsilon_{0}}\leq\frac{1}{N}, the first term will dominate, and there must exists cc such that

τ=1k1n=1Nϕhτn(V(sh~+1τn)s𝒮Ph(s|shτn,ahτn)V(s))(Λkh)1cdlog(dNβH/ε0δ)\displaystyle\Big{\|}\sum_{\tau=1}^{k-1}\sum_{n=1}^{N}\phi_{h}^{\tau n}\Big{(}V(s_{{\widetilde{h}}+1}^{\tau n})-\sum_{s^{\prime}\in\mathcal{S}}P_{h}(s^{\prime}|s_{h}^{\tau n},a_{h}^{\tau n})V(s^{\prime})\Big{)}\Big{\|}_{(\Lambda^{k}_{h})^{-1}}\leq c\cdot d\sqrt{\log(dN{\beta^{\prime}}H/{\varepsilon_{0}}\delta)}

E.5 Bias Analysis

Lemma E.5 (Overestimation in Alg 5).

Suppose we choose

β=cβdlogdHNε0δ\displaystyle{\beta^{\prime}}=c_{\beta^{\prime}}^{\prime}d\sqrt{\log\frac{dHN}{{\varepsilon_{0}}\delta}} (26)

for some cβ>0c_{\beta^{\prime}}>0. During the running of Alg 2, on the condition 4.5 and on the event of \mathcal{E} in Theorem E.4, which holds with probability 1δ/21-\delta/2, for arbitrary h[H]{h}\in[H] and h~h1{\widetilde{h}}\leq{h}-1, the parameter wh~w_{\widetilde{h}} and value function Vh~+1V_{{\widetilde{h}}+1} occurs in Algorithm 5 should satisfy:

|ϕ(s,a)wh~s𝒮Ph~(s|s,a)Vh~+1(s)|βϕ(s,a)Σh~1\displaystyle|\phi(s,a)^{\top}w_{\widetilde{h}}-\sum_{s^{\prime}\in\mathcal{S}}P_{\widetilde{h}}(s^{\prime}|s,a)V_{{\widetilde{h}}+1}(s^{\prime})|\leq{\beta^{\prime}}\|\phi(s,a)\|_{\Sigma_{\widetilde{h}}^{-1}}

and

Vh~(s)Vh~(s)Vh~(s)+𝔼π[h=h~hβϕ(sh,ah)Σh1]Vh~(s)+βξ\displaystyle V^{*}_{\widetilde{h}}(s)\leq V_{\widetilde{h}}(s)\leq V^{*}_{\widetilde{h}}(s)+\mathbb{E}_{\pi}[\sum_{h^{\prime}={\widetilde{h}}}^{h}{\beta^{\prime}}\|\phi(s_{h^{\prime}},a_{h^{\prime}})\|_{\Sigma_{h^{\prime}}^{-1}}]\leq V^{*}_{\widetilde{h}}(s)+{\beta^{\prime}}\xi
Proof.

The proof is mainly based on Theorem E.4, and the steps are similar to Lemma B.3 in (Jin et al., 2019) and Lemma 3.1 in (Wang et al., 2020b) and we omit here. ∎

Lemma E.6 (Bias Accumulation in Alg 5).

On the induction condition 4.5 and on the events in Theorem E.4 which holds with probability 1δ/21-\delta/2, if

ε0βξ2H\displaystyle{\varepsilon_{0}}\leq\frac{{\beta^{\prime}}\xi}{2H} (27)

in Algorithm 5, for arbitrary R¯\bar{R} generated, we have:

V1(s1;R¯)V1π¯(s1;R¯)3βξ\displaystyle V_{1}^{*}(s_{1};\bar{R})-V_{1}^{{\bar{\pi}}}(s_{1};\bar{R})\leq 3{\beta^{\prime}}\xi

where recall that we use Vπ(s;R¯)V^{\pi}(s;\bar{R}) to denote the value function with R¯\bar{R} as reward function.

Proof.

We will use πh():=argmaxaQh(,a)\pi_{h}(\cdot):=\arg\max_{a}Q_{h}(\cdot,a) to denote the optimal policy w.r.t. the QQ function without discretization, although we do not deploy it in practice. According to Lemma E.2, we should have maxs,h|Qh(s,π(s))Qh(s,π¯(s))|2ε0\max_{s,h}|Q_{h}(s,\pi(s))-Q_{h}(s,{\bar{\pi}}(s))|\leq 2{\varepsilon_{0}} for arbitrary h[h]h\in[{h}].

Recall that π¯=π¯1π¯2π¯h\bar{\pi}=\bar{\pi}_{1}\circ\bar{\pi}_{2}...\circ\bar{\pi}_{h}

V1(s1)V1π¯(s1)V1(s1)V1π¯(s1)\displaystyle V_{1}^{*}(s_{1})-V_{1}^{{\bar{\pi}}}(s_{1})\leq V_{1}(s_{1})-V_{1}^{{\bar{\pi}}}(s_{1}) (Lemma E.5; Overestimation)
=\displaystyle= Q1(s1,π1(s1))Q1π¯(s1,π¯1(s1))Q1(s1,π¯1(s1))Q1π¯(s1,π¯1(s1))+2ε0\displaystyle Q_{1}(s_{1},\pi_{1}(s_{1}))-Q_{1}^{\bar{\pi}}(s_{1},{\bar{\pi}}_{1}(s_{1}))\leq Q_{1}(s_{1},{\bar{\pi}}_{1}(s_{1}))-Q_{1}^{\bar{\pi}}(s_{1},{\bar{\pi}}_{1}(s_{1}))+2{\varepsilon_{0}} (Lemma E.2)
=\displaystyle= 𝔼s1d1,a1π¯1[min{ϕ(s1,a1)w1+u1(s1,a1),H}P2V2(s1,a1)+P2V2(s1,a1)P2V2π¯(s1,a1)]+2ε0\displaystyle\mathbb{E}_{s_{1}\sim d_{1},a_{1}\sim{\bar{\pi}}_{1}}[\min\{\phi(s_{1},a_{1})w_{1}^{\top}+u_{1}(s_{1},a_{1}),H\}-P_{2}V_{2}(s_{1},a_{1})+P_{2}V_{2}(s_{1},a_{1})-P_{2}V_{2}^{{\bar{\pi}}}(s_{1},a_{1})]+2{\varepsilon_{0}}
\displaystyle\leq 𝔼s1d1,a1π¯1,s2P2(|s1,a1)[V2(s2)V2π¯(s2)]+2β𝔼s1d1,a1π¯1[ϕ(s1,a1)Σ11]+2ε0\displaystyle\mathbb{E}_{s_{1}\sim d_{1},a_{1}\sim{\bar{\pi}}_{1},s_{2}\sim P_{2}(\cdot|s_{1},a_{1})}[V_{2}(s_{2})-V_{2}^{{\bar{\pi}}}(s_{2})]+2{\beta^{\prime}}\mathbb{E}_{s_{1}\sim d_{1},a_{1}\sim{\bar{\pi}}_{1}}[\|\phi(s_{1},a_{1})\|_{\Sigma_{1}^{-1}}]+2{\varepsilon_{0}}
\displaystyle\leq \displaystyle...
\displaystyle\leq 2β𝔼s1d1,a1,s2,,sh1,ah1π¯[h~=1h1ϕ(sh~,ah~)Σh~1]+2(h1)ε0+𝔼sh,ahπ¯[Vh(sh)Vhπ¯(sh)]\displaystyle 2{\beta^{\prime}}\mathbb{E}_{s_{1}\sim d_{1},a_{1},s_{2},...,s_{{h}-1},a_{{h}-1}\sim{\bar{\pi}}}[\sum_{{\widetilde{h}}=1}^{{h}-1}\|\phi(s_{\widetilde{h}},a_{\widetilde{h}})\|_{\Sigma_{\widetilde{h}}^{-1}}]+2({h}-1){\varepsilon_{0}}+\mathbb{E}_{s_{h},a_{h}\sim{\bar{\pi}}}[V_{h}(s_{h})-V_{h}^{{\bar{\pi}}}(s_{h})]
\displaystyle\leq 2βξ+2hε03βξ\displaystyle 2{\beta^{\prime}}\xi+2{h}{\varepsilon_{0}}\leq 3{\beta^{\prime}}\xi (Condition 4.5)

Lemma E.7 (Bias of Linear Regression in Alg 6).

During the running of Alg 2, on the event of \mathcal{E} in Theorem E.4, which holds with probability 1δ/21-\delta/2, for arbitrary h[H],h~[h1]h\in[H],{\widetilde{h}}\in[h-1], and arbitrary π\pi, w^πh~\widehat{w}^{\pi}_{\widetilde{h}} and V^πh~+1\widehat{V}^{\pi}_{{\widetilde{h}}+1} occurs in Alg 6, we have:

|ϕ(s,a)w^πhs𝒮Ph(s|s,a)V^πh~+1(s)|βϕ(s,a)Σh1\displaystyle|\phi(s,a)^{\top}\widehat{w}^{\pi}_{h}-\sum_{s^{\prime}\in\mathcal{S}}P_{h}(s^{\prime}|s,a)\widehat{V}^{\pi}_{{\widetilde{h}}+1}(s^{\prime})|\leq{\beta^{\prime}}\|\phi(s,a)\|_{\Sigma_{h}^{-1}}

where β{\beta^{\prime}} is the same as Lemma E.5.

The proofs for the above Lemma is based on Theorem E.4 and Claim 3 and is similar to Lemma 3.1 in (Wang et al., 2020b), so we omit it here.

Lemma E.8 (Policy Evaluation Error in Alg 6).

During the running of Algorithm 2, on the events of \mathcal{E} in Theorem E.4, which holds with probability 1δ/21-\delta/2, and on the induction condition in 4.5, for arbitrary h[H]{h}\in[H] and i,j[d]i,j\in[d], and arbitrary policy π\pi and their evaluation results V^π\widehat{V}^{\pi} Algorithm 6, we have:

|Vπ(s1;R~ij)V^1π(s1)|βξ\displaystyle|V^{\pi}(s_{1};\widetilde{R}^{ij})-\widehat{V}_{1}^{\pi}(s_{1})|\leq{\beta^{\prime}}\xi

where we use R~ij\widetilde{R}^{ij} to denote the reward function used in Algorithm 6.

Proof.

As a result of Lemma.E.7 , for arbitrary R~ij\widetilde{R}^{ij}, we have:

|V^1π(s1)V1π(s1;R~ij)|=\displaystyle|\widehat{V}_{1}^{\pi}(s_{1})-V_{1}^{\pi}(s_{1};\widetilde{R}^{ij})|= |Q^1π(s1,a1)Q1π(s1,a1;R~ij)|\displaystyle|\widehat{Q}_{1}^{\pi}(s_{1},a_{1})-Q_{1}^{\pi}(s_{1},a_{1};\widetilde{R}^{ij})|
=\displaystyle= |ϕ(s1,a1)w^π1s2Ph(s2|s1,a1)Vπ(s2;R~ij)|\displaystyle|\phi(s_{1},a_{1})^{\top}\widehat{w}^{\pi}_{1}-\sum_{s_{2}}P_{h}(s_{2}|s_{1},a_{1})V^{\pi}(s_{2};\widetilde{R}^{ij})|
\displaystyle\leq |ϕ(s1,a1)w^π1s2Ph(s2|s1,a1)V^2π(s2)|+𝔼π|V^2π(s2)Qπ2(s2,π(s2);R~ij)|\displaystyle|\phi(s_{1},a_{1})^{\top}\widehat{w}^{\pi}_{1}-\sum_{s_{2}}P_{h}(s_{2}|s_{1},a_{1})\widehat{V}_{2}^{\pi}(s_{2})|+\mathbb{E}_{\pi}|\widehat{V}_{2}^{\pi}(s_{2})-Q^{\pi}_{2}(s_{2},\pi(s_{2});\widetilde{R}^{ij})|
\displaystyle\leq βϕ(s1,a1)Σ11+𝔼s2|Vπ(s2)Qπ2(s2,π(s2);R~ij)|\displaystyle{\beta^{\prime}}\|\phi(s_{1},a_{1})\|_{\Sigma_{1}^{-1}}+\mathbb{E}_{s_{2}}|V^{\pi}(s_{2})-Q^{\pi}_{2}(s_{2},\pi(s_{2});\widetilde{R}^{ij})|
\displaystyle...
\displaystyle\leq β𝔼s1,a1,,sh1,ah1π[t=1h1ϕ(st,at)Σt1]\displaystyle{\beta^{\prime}}\mathbb{E}_{s_{1},a_{1},...,s_{{h}-1},a_{{h}-1}\sim\pi}[\sum_{t=1}^{{h}-1}\|\phi(s_{t},a_{t})\|_{\Sigma_{t}^{-1}}]
\displaystyle\leq βh1Hξβξ\displaystyle{\beta^{\prime}}\frac{{h}-1}{H}\xi\leq{\beta^{\prime}}\xi

E.6 Main Theorem and Proof

Now, we restate Theorem 4.4 in a formal version below:

Theorem E.9 (Formal Version of Theorem 4.4).

For arbitrary 0<ε,δ<10<\varepsilon,\delta<1, there exists absolute constants ci,cβ,cβc_{i},c_{\beta^{\prime}},c_{\beta} and cNc_{N}, such that by choosing (β{\beta^{\prime}} used in Alg. 5 and 6, and β\beta used in Alg. 4)666 The ICLR 2022 version (and also the second version in arxiv) omitted that the bonus coefficients β\beta in Alg. 4 and β{\beta^{\prime}} in Alg. 5 and 6 should have different dependence on HH, which resulted in incorrect dependence on HH in the sample complexity upper bound. We thank Dan Qiao for calling this to our attention.

imax=cidνmin4\displaystyle i_{\max}=c_{i}\frac{d}{\nu_{\min}^{4}} logdνmin,β=cβdlogdHεδνmin,β=cβdHlogdHεδνmin\displaystyle\log\frac{d}{\nu_{\min}},\quad{\beta^{\prime}}=c_{\beta^{\prime}}d\sqrt{\log\frac{dH}{\varepsilon\delta\nu_{\min}}},\quad\beta=c_{\beta}dH\sqrt{\log\frac{dH}{\varepsilon\delta\nu_{\min}}}
N=\displaystyle N= c(H4d3ε2ν2min+H2d7ν14min)log2dHεδνmin,ε0=1N.\displaystyle c\Big{(}\frac{H^{4}d^{3}}{\varepsilon^{2}\nu^{2}_{\min}}+\frac{H^{2}d^{7}}{\nu^{14}_{\min}}\Big{)}\log^{2}\frac{dH}{\varepsilon\delta\nu_{\min}},\quad\varepsilon_{0}=\frac{1}{N}.

with probability 1δ1-\delta, after K=HK=H deployments, by running Alg 4 with the collected dataset D={D1,,DH}D=\{D_{1},...,D_{H}\} and arbitrary rr satisfying the linear assumption in A (in Line of Algorithm 2), we will obtain a policy π^\widehat{\pi} such that V1π(s1;r)V1π(s1;r)εV_{1}^{\pi}(s_{1};r)\geq V_{1}^{\pi^{*}}(s_{1};r)-\varepsilon.

As additional guarantees, after hh deployments, by running Alg 4 with the collected dataset {D1,D2,Dh}\{D_{1},D_{2}...,D_{h}\} and reward function rr, we will obtain a policy π|h\pi_{|h} which is ε\varepsilon-optimal in the MDP truncated at step hh.

Proof.

We will use Σh,i:=2I+j=1i1𝔼πj[ϕ(sh,ah)ϕ(sh,ah)]\Sigma_{h,i}:=2I+\sum_{j=1}^{i-1}\mathbb{E}_{\pi_{j}}[\phi(s_{h},a_{h})\phi(s_{h},a_{h})^{\top}] to denote the matrix which Σ~h,i\widetilde{\Sigma}_{h,i} approximates in Alg. 2, and use Rh,i:=ϕ(,)Σh,i1ϕ(,)R_{h,i}:=\sqrt{\phi(\cdot,\cdot)^{\top}\Sigma_{h,i}^{-1}\phi(\cdot,\cdot)} to denote the reward function used in Alg 5 if the covariance matrix estimation is perfect (i.e. Σ~h,i=Σh,i\widetilde{\Sigma}_{h,i}=\Sigma_{h,i}).

The proof consists of three steps. In step 1, we try to show that the inner loop of Alg 2 will terminate and Πh\Pi_{h} will contain a set of exploratory policies. In step 2, we will analyze the samples generated by a mixture of policies in Πh\Pi_{h}. In the last step, we determine the choice of hyper-parameters and fill the gaps of pre-assumed constraints and induction conditions.

Step 1: Exploration Ability for Policies in Πh\Pi_{h}

In the inner loop (line 5 - 12) in Algorithm 2, our goal is to find a set of policies Πh\Pi_{h}, such that if the algorithm stops at iteration ii, the following uncertainty measure is as small as possible

Vh,i+1(s1;Rh,i):=maxπ𝔼π[ϕ(sh,ah)Σh,i1]\displaystyle V^{*}_{h,i+1}(s_{1};R_{h,i}):=\max_{\pi}\mathbb{E}_{\pi}[\|\phi(s_{h},a_{h})\|_{\Sigma_{h,i}^{-1}}] (28)

To achieve this goal, we repeatedly use Alg 6 to estimate the covariance matrix of the policy and append it to Σ~h,i\widetilde{\Sigma}_{h,i} as an approximation of Σh,i\Sigma_{h,i}, and use Alg 5 to find a near-optimal policy to maximizing the uncertainty-based reward function R~\widetilde{R}, by sampling trajectories with which we can reduce the uncertainty Qh,iQ^{*}_{h,i} in Eq.(28).

First, we take a look at the estimation error of the accumulative covariance matrix when running Algorithm 2. On the conditions in Lemma E.8, we can bound the elementwise estimation error of Σh,i\Sigma_{h,i}:

|(Σ~h,i)jk(Σh,i)jk|iβξ,j,k[d]\displaystyle|(\widetilde{\Sigma}_{h,i})_{jk}-(\Sigma_{h,i})_{jk}|\leq i\cdot{\beta^{\prime}}\xi,\quad\forall j,k\in[d]

As a result of Lemma E.11, we have:

|R~h,i(sh,ah)Rh,i(sh,ah)|=\displaystyle|\widetilde{R}_{h,i}(s_{h},a_{h})-R_{h,i}(s_{h},a_{h})|= |ϕΣ~h,i1ϕϕΣh,i1ϕ|\displaystyle|\sqrt{\phi^{\top}\widetilde{\Sigma}_{h,i}^{-1}\phi}-\sqrt{\phi^{\top}\Sigma_{h,i}^{-1}\phi}|
\displaystyle\leq idβξ1idβξimaxdβξ1imaxdβξ\displaystyle\sqrt{\frac{i\cdot d{\beta^{\prime}}\xi}{1-i\cdot d{\beta^{\prime}}\xi}}\leq\sqrt{\frac{i_{\max}d{\beta^{\prime}}\xi}{1-i_{\max}d{\beta^{\prime}}\xi}}
\displaystyle\leq ν2min8\displaystyle\frac{\nu^{2}_{\min}}{8} (29)

where the last but two step is because we at most repeat it imaxi_{\max} iterations at each layer hh, and we introduce the following constraint for ξ\xi during the derivation, to make sure the bias is small and all the iterms occurs in the derivation is well defined:

ξν4min32imaxdβ\displaystyle\xi\leq\frac{\nu^{4}_{\min}}{32i_{\max}d{\beta^{\prime}}} (30)

Next, we want to find a good choice of imaxi_{\max} to make sure Vh,i+1V_{h,i+1} will not always be large and the for-loop will break for some iimaxi\leq i_{\max}. We first provide an upper bound for Vh,i+1V_{h,i+1}:

Vh,i+1(s1)\displaystyle V_{h,i+1}(s_{1})\leq Vπh,i+1(s1;R~h,i)+βξ\displaystyle V^{\pi_{h,i+1}}(s_{1};\widetilde{R}_{h,i})+{\beta^{\prime}}\xi (Lemma E.5)
\displaystyle\leq Vπ¯h,i+1(s1;R~h,i)+4βξ\displaystyle V^{{\bar{\pi}}_{h,i+1}}(s_{1};\widetilde{R}_{h,i})+4{\beta^{\prime}}\xi (VπVπ¯VVπ¯V^{\pi}-V^{\bar{\pi}}\leq V^{*}-V^{\bar{\pi}}; Lemma E.6)
\displaystyle\leq Vπ¯h,i+1(s1;Rh,i)+4βξ+ν2min8\displaystyle V^{{\bar{\pi}}_{h,i+1}}(s_{1};R_{h,i})+4{\beta^{\prime}}\xi+\frac{\nu^{2}_{\min}}{8} (bias of reward)
\displaystyle\leq Vπ¯h,i+1(s1;Rh,i)+ν2min4\displaystyle V^{{\bar{\pi}}_{h,i+1}}(s_{1};R_{h,i})+\frac{\nu^{2}_{\min}}{4} (Constraints on ξ\xi in Eq.(30))

Next, we try to show that Vπ¯h,i+1(s1;Rh,i)V^{{\bar{\pi}}_{h,i+1}}(s_{1};R_{h,i}) can not always be large. According to Elliptical Potential Lemma in Lemma C.1, we have:

i=1imaxVπ¯h,i+1(s1;Rh,i)=\displaystyle\sum_{i=1}^{i_{\max}}V^{{\bar{\pi}}_{h,i+1}}(s_{1};R_{h,i})= i=1imax𝔼π¯h,i+1[ϕ(sh,ah)Σh,i1]\displaystyle\sum_{i=1}^{i_{\max}}\mathbb{E}_{{\bar{\pi}}_{h,i+1}}[\|\phi(s_{h},a_{h})\|_{\Sigma_{h,i}^{-1}}]
\displaystyle\leq i=1imax𝔼π¯h,i+1[ϕ(sh,ah)2Σh,i1]\displaystyle\sum_{i=1}^{i_{\max}}\sqrt{\mathbb{E}_{{\bar{\pi}}_{h,i+1}}[\|\phi(s_{h},a_{h})\|^{2}_{\Sigma_{h,i}^{-1}}]}
\displaystyle\leq imaxi=1imax𝔼π¯h,i+1[ϕ(sh,ah)2Σh,i1]\displaystyle\sqrt{i_{\max}\sum_{i=1}^{i_{\max}}\mathbb{E}_{{\bar{\pi}}_{h,i+1}}[\|\phi(s_{h},a_{h})\|^{2}_{\Sigma_{h,i}^{-1}}]}
=\displaystyle= imaxi=1imaxTr(𝔼π¯h,i+1[ϕ(sh,ah)ϕ(sh,ah)]Σh,i1)\displaystyle\sqrt{i_{\max}\sum_{i=1}^{i_{\max}}Tr(\mathbb{E}_{{\bar{\pi}}_{h,i+1}}[\phi(s_{h},a_{h})\phi(s_{h},a_{h})^{\top}]\Sigma_{h,i}^{-1})}
\displaystyle\leq 2imaxdlog(1+imax/d)\displaystyle\sqrt{2i_{\max}d\log(1+i_{\max}/d)}

where in the last step, we use the definition of Σh,i\Sigma_{h,i}. Therefore,

miniVπ¯h,i+1(s1;Rh,i)1imaxi=1imaxVπ¯h,i+1(s1;Rh,i)2dlog(1+imax/d)imax\displaystyle\min_{i}V^{{\bar{\pi}}_{h,i+1}}(s_{1};R_{h,i})\leq\frac{1}{i_{\max}}\sum_{i=1}^{i_{\max}}V^{{\bar{\pi}}_{h,i+1}}(s_{1};R_{h,i})\leq\sqrt{2\frac{d\log(1+i_{\max}/d)}{i_{\max}}}

In order to guarantee miniVπ¯h,i+1(s1;Rh,i)ν2min/8\min_{i}V^{{\bar{\pi}}_{h,i+1}}(s_{1};R_{h,i})\leq\nu^{2}_{\min}/8, we require:

2dlog(1+imax/d)imaxν2min/8\displaystyle\sqrt{2\frac{d\log(1+i_{\max}/d)}{i_{\max}}}\leq\nu^{2}_{\min}/8

which can be satisfied by:

imax=cidν4minlogdνmin\displaystyle i_{\max}=c_{i}\frac{d}{\nu^{4}_{\min}}\log\frac{d}{\nu_{\min}} (31)

for some absolute constant cic_{i}.

Combining the above results, we can conclude that the inner loop in Alg 2 will break at some i<imaxi<i_{\max}, such that Vh,i+13ν2min/8V_{h,i+1}\leq 3\nu^{2}_{\min}/8, and guarantee that:

maxπ𝔼π[ϕ(sh,ah)Σ1h,i]:=\displaystyle\max_{\pi}\mathbb{E}_{\pi}[\|\phi(s_{h},a_{h})\|_{\Sigma^{-1}_{h,i}}]:= Vh,i+1(s1;Rh,i)\displaystyle V^{*}_{h,i+1}(s_{1};R_{h,i})
\displaystyle\leq Vh,i+1(s1;R~h,i)+ν2min8\displaystyle V^{*}_{h,i+1}(s_{1};\widetilde{R}_{h,i})+\frac{\nu^{2}_{\min}}{8} (reward estimation error Eq.(29))
\displaystyle\leq Vh,i+1(s1)+ν2min8\displaystyle V_{h,i+1}(s_{1})+\frac{\nu^{2}_{\min}}{8} (Overestimation in Lemma E.5)
\displaystyle\leq Vπ¯h,i+1(s1;Rh,i)+ν2min4+ν2min8\displaystyle V^{{\bar{\pi}}_{h,i+1}}(s_{1};R_{h,i})+\frac{\nu^{2}_{\min}}{4}+\frac{\nu^{2}_{\min}}{8}
\displaystyle\leq ν2min2\displaystyle\frac{\nu^{2}_{\min}}{2}

Step 2: Policy Deployment and Concentration Error

For uniform mixture policy πh,mix:=Unif(Πh)\pi_{h,mix}:=Unif(\Pi_{h}), by applying Lemma E.13, Lemma E.14 and the results above, we must have:

maxπ𝔼π[ϕ(sh,ah)(I+N𝔼πh,mix[ϕϕ])1ϕ(sh,ah)]\displaystyle\max_{\pi}\mathbb{E}_{\pi}[\sqrt{\phi(s_{h},a_{h})^{\top}(I+N\mathbb{E}_{\pi_{h,mix}}[\phi\phi^{\top}])^{-1}\phi(s_{h},a_{h})}]
=\displaystyle= maxπ𝔼π[ϕ(sh,ah)(I+N|Πh||Πh|𝔼πh,mix[ϕϕ])1ϕ(sh,ah)]\displaystyle\max_{\pi}\mathbb{E}_{\pi}[\sqrt{\phi(s_{h},a_{h})^{\top}(I+\frac{N}{|\Pi_{h}|}|\Pi_{h}|\mathbb{E}_{\pi_{h,mix}}[\phi\phi^{\top}])^{-1}\phi(s_{h},a_{h})}]
\displaystyle\leq 21+N/|Πh|maxπ𝔼π[ϕ(sh,ah)(I+|Πh|𝔼πh,mix[ϕϕ])1ϕ(sh,ah)]\displaystyle\sqrt{\frac{2}{1+N/|\Pi_{h}|}\max_{\pi}\mathbb{E}_{\pi}[\sqrt{\phi(s_{h},a_{h})^{\top}(I+|\Pi_{h}|\mathbb{E}_{\pi_{h,mix}}[\phi\phi^{\top}])^{-1}\phi(s_{h},a_{h})}]}
\displaystyle\leq 11+N/imaxνminimaxNνmin\displaystyle\sqrt{\frac{1}{1+N/i_{\max}}}\nu_{\min}\leq\sqrt{\frac{i_{\max}}{N}}\nu_{\min}

and this is the motivation of breaking criterion in Line 2 in Alg 2.

In the following, we will use Σh:=ΣhI=n=1Nϕ(sh,nah,n)ϕ(sh,n,ah,n)\Sigma^{-}_{h}:=\Sigma_{h}-I=\sum_{n=1}^{N}\phi(s_{h,n}a_{h,n})\phi(s_{h,n},a_{h,n})^{\top} to denote the matrix of sampled feature without regularization terms, according to Lemma E.10, with probability 1δ/21-\delta/2, we have:

1Nσmax(N𝔼πh,mix[ϕϕ]Σh)4Nlog8dHδ,h[H]\displaystyle\|\frac{1}{N}\sigma_{\max}(N\mathbb{E}_{\pi_{h,mix}}[\phi\phi^{\top}]-\Sigma^{-}_{h})\|\leq\frac{4}{\sqrt{N}}\log\frac{8dH}{\delta},\quad\forall h\in[H]

Follow the same steps in the proof of Lemma E.14, we know that

σmin(N𝔼πh,mix[ϕϕ])=N|Πh|σmin(|Πh|𝔼πh,mix[ϕϕ])N|Πh|Nimax.\sigma_{\min}(N\mathbb{E}_{\pi_{h,mix}}[\phi\phi^{\top}])=\frac{N}{|\Pi_{h}|}\sigma_{\min}(|\Pi_{h}|\mathbb{E}_{\pi_{h,mix}}[\phi\phi^{\top}])\geq\frac{N}{|\Pi_{h}|}\geq\frac{N}{i_{\max}}.

As a result,

minx:x=1xΣhx=\displaystyle\min_{x:\|x\|=1}x^{\top}\Sigma_{h}x= minx:x=1x(I+N𝔼πh,mix[ϕϕ])x+x(N𝔼πh,mix[ϕϕ]Σh)x\displaystyle\min_{x:\|x\|=1}x^{\top}(I+N\mathbb{E}_{\pi_{h,mix}}[\phi\phi^{\top}])x+x^{\top}(N\mathbb{E}_{\pi_{h,mix}}[\phi\phi^{\top}]-\Sigma^{-}_{h})x
\displaystyle\geq σmin(I+N𝔼πh,mix[ϕϕ])σmax(N𝔼πh,mix[ϕϕ]Σh)\displaystyle\sigma_{\min}(I+N\mathbb{E}_{\pi_{h,mix}}[\phi\phi^{\top}])-\sigma_{\max}(N\mathbb{E}_{\pi_{h,mix}}[\phi\phi^{\top}]-\Sigma^{-}_{h})
\displaystyle\geq 1+(N2imax4Nlog8dHδ)\displaystyle 1+(\frac{N}{2i_{\max}}-4\sqrt{N}\log\frac{8dH}{\delta})

which implies that, as long as

N16i2maxlog28dHδ\displaystyle N\geq 16i^{2}_{\max}\log^{2}\frac{8dH}{\delta} (32)

we have

σmax(Σh1)11+N/2imaxNlog8dHδ4imaxN\displaystyle\sigma_{\max}(\Sigma_{h}^{-1})\leq\frac{1}{1+N/2i_{\max}-\sqrt{N}\log\frac{8dH}{\delta}}\leq\frac{4i_{\max}}{N}

Therefore, for arbitrary π\pi, we have:

|𝔼π[ϕ(sh,ah)(I+N𝔼πh,mix[ϕϕ])1ϕ(sh,ah)]𝔼π[ϕ(sh,ah)(Σh)1ϕ(sh,ah)]|\displaystyle|\mathbb{E}_{\pi}[\sqrt{\phi(s_{h},a_{h})^{\top}(I+N\mathbb{E}_{\pi_{h,mix}}[\phi\phi^{\top}])^{-1}\phi(s_{h},a_{h})}]-\mathbb{E}_{\pi}[\sqrt{\phi(s_{h},a_{h})^{\top}(\Sigma_{h})^{-1}\phi(s_{h},a_{h})}]|
\displaystyle\leq 𝔼π[|ϕ(sh,ah)(I+N𝔼πh,mix[ϕϕ])1ϕ(sh,ah)ϕ(sh,ah)(Σh)1ϕ(sh,ah)|]\displaystyle\mathbb{E}_{\pi}[\sqrt{|\phi(s_{h},a_{h})^{\top}(I+N\mathbb{E}_{\pi_{h,mix}}[\phi\phi^{\top}])^{-1}\phi(s_{h},a_{h})-\phi(s_{h},a_{h})^{\top}(\Sigma_{h})^{-1}\phi(s_{h},a_{h})|}]
\displaystyle\leq 𝔼π[|ϕ(sh,ah)((I+N𝔼πh,mix[ϕϕ])1(Σh)1)ϕ(sh,ah)|]\displaystyle\mathbb{E}_{\pi}[\sqrt{|\phi(s_{h},a_{h})^{\top}\Big{(}(I+N\mathbb{E}_{\pi_{h,mix}}[\phi\phi^{\top}])^{-1}-(\Sigma_{h})^{-1}\Big{)}\phi(s_{h},a_{h})|}]
=\displaystyle= 𝔼π[|ϕ(sh,ah)(I+N𝔼πh,mix[ϕϕ])1(N𝔼πh,mix[ϕϕ]Σh)(Σh)1ϕ(sh,ah)|]\displaystyle\mathbb{E}_{\pi}[\sqrt{|\phi(s_{h},a_{h})^{\top}(I+N\mathbb{E}_{\pi_{h,mix}}[\phi\phi^{\top}])^{-1}\big{(}N\mathbb{E}_{\pi_{h,mix}}[\phi\phi^{\top}]-\Sigma^{-}_{h}\big{)}(\Sigma_{h})^{-1}\phi(s_{h},a_{h})|}]
\displaystyle\leq σmax((I+N𝔼πh,mix[ϕϕ])1)4imaxNσmax(N𝔼πh,mix[ϕϕ]Σh)\displaystyle\sqrt{\sigma_{\max}((I+N\mathbb{E}_{\pi_{h,mix}}[\phi\phi^{\top}])^{-1})\frac{4i_{\max}}{N}\sigma_{\max}(N\mathbb{E}_{\pi_{h,mix}}[\phi\phi^{\top}]-\Sigma^{-}_{h})}
\displaystyle\leq 11+N/imax16imaxNlog8dδ\displaystyle\sqrt{\frac{1}{1+N/i_{\max}}\frac{16i_{\max}}{\sqrt{N}}\log\frac{8d}{\delta}} (imax𝔼πh,mix[ϕϕ]|Πh|𝔼πh,mix[ϕϕ]1i_{\max}\mathbb{E}_{\pi_{h,mix}}[\phi\phi^{\top}]\succeq|\Pi_{h}|\mathbb{E}_{\pi_{h,mix}}[\phi\phi^{\top}]\succeq 1)
\displaystyle\leq 4imaxN3/4log8dHδ\displaystyle\frac{4i_{\max}}{N^{3/4}}\sqrt{\log\frac{8dH}{\delta}}

As a result,

maxπ𝔼π[ϕ(sh,ah)(Σh)1ϕ(sh,ah)]\displaystyle\max_{\pi}\mathbb{E}_{\pi}[\sqrt{\phi(s_{h},a_{h})^{\top}(\Sigma_{h})^{-1}\phi(s_{h},a_{h})}]
\displaystyle\leq maxπ𝔼π[ϕ(sh,ah)(I+N𝔼πh,mix[ϕϕ])1ϕ(sh,ah)]+4imaxN3/4log8dHδ\displaystyle\max_{\pi}\mathbb{E}_{\pi}[\sqrt{\phi(s_{h},a_{h})^{\top}(I+N\mathbb{E}_{\pi_{h,mix}}[\phi\phi^{\top}])^{-1}\phi(s_{h},a_{h})}]+\frac{4i_{\max}}{N^{3/4}}\sqrt{\log\frac{8dH}{\delta}}
\displaystyle\leq imaxNνmin+4imaxN3/4log8dHδ\displaystyle\sqrt{\frac{i_{\max}}{N}}\nu_{\min}+\frac{4i_{\max}}{N^{3/4}}\sqrt{\log\frac{8dH}{\delta}}

In order to make sure the induction conditions holds, we need

imaxNνmin+4imaxN3/4log8dHδξ/H\displaystyle\sqrt{\frac{i_{\max}}{N}}\nu_{\min}+\frac{4i_{\max}}{N^{3/4}}\sqrt{\log\frac{8dH}{\delta}}\leq\xi/H

As long as we tighten the constraint in 32 to:

N256i2maxν4minlog28dHδ=O~(d2ν12min)\displaystyle N\geq 256\frac{i^{2}_{\max}}{\nu^{4}_{\min}}\log^{2}\frac{8dH}{\delta}=\widetilde{O}(\frac{d^{2}}{\nu^{12}_{\min}}) (33)

the induction conditions can be satisfied when

2imaxNνminξ/2H\displaystyle 2\sqrt{\frac{i_{\max}}{N}}\nu_{\min}\leq\xi/2H

or equivalently,

N16H2ν2minimaxξ2=O(H2dξ2ν2min)\displaystyle N\geq\frac{16H^{2}\nu^{2}_{\min}i_{\max}}{\xi^{2}}=O(\frac{H^{2}d}{\xi^{2}\nu^{2}_{\min}})

Step 3: Determine Hyper-parameters

(1) Resolution ε0{\varepsilon_{0}}

Recall that we still have a constraint for ZhZ_{h} in (24)

I+ΣRε024I,4ε02I2I+ΣR\displaystyle I+\Sigma_{R}\succeq\frac{{\varepsilon_{0}}^{2}}{4}I,\quad\frac{4}{{\varepsilon_{0}}^{2}}I\succeq 2I+\Sigma_{R}

Since we already determined imaxi_{\max} in Eq.(31), also recall our constraints on ξ\xi in (30) the above constraints for ε0{\varepsilon_{0}} can be satisfied as long as:

ε01imax\displaystyle{\varepsilon_{0}}\leq\sqrt{\frac{1}{i_{\max}}} (34)

Combining all the constraints of ε0{\varepsilon_{0}}, including (23), (25), (27) and (34), we conclude that:

ε0min{1N,βN+1,1imax,βξH}=1N\displaystyle{\varepsilon_{0}}\leq\min\{\frac{1}{N},\frac{{\beta^{\prime}}}{\sqrt{N+1}},\frac{1}{\sqrt{i_{\max}}},\frac{{\beta^{\prime}}\xi}{H}\}=\frac{1}{N}

(2) Induction error ξ\xi

Besides the constraint in (30), we need another one to make sure the quality of the final output policy. By applying Lemma D.3 for planning algorithm Alg. 2, if the induction condition (4.5) holds till h[H]h\in[H], Alg. 4 will return us a policy π^\widehat{\pi} such that:

VVπ2βmaxπ𝔼π[h~=1Hϕ(sh~,ah~)Σh~1]2βξ\displaystyle V^{*}-V^{\pi}\leq 2\beta\max_{\pi}\mathbb{E}_{\pi}[\sum_{{\widetilde{h}}=1}^{H}\|\phi(s_{\widetilde{h}},a_{\widetilde{h}})\|_{\Sigma_{\widetilde{h}}^{-1}}]\leq 2\beta\xi

To make sure VVπ^εV^{*}-V^{\widehat{\pi}}\leq\varepsilon, we require ξε2β\xi\leq\frac{\varepsilon}{2\beta}, which implies that

ξmin{ν4min32imaxdβ,ε2β}\displaystyle\xi\leq\min\{\frac{\nu^{4}_{\min}}{32i_{\max}d{\beta^{\prime}}},\frac{\varepsilon}{2\beta}\}

Choice of NN, β\beta^{\prime} and β\beta

Since ε0=1N\varepsilon_{0}=\frac{1}{N}, by plugging it into Eq.(26), we may choose β{\beta^{\prime}} to be:

β=cβdlogdHNδ\displaystyle{\beta^{\prime}}=c_{\beta^{\prime}}^{\prime\prime}d\sqrt{\log\frac{dHN}{\delta}}

The discussion about the choice of β\beta is similar to what we did for Alg. 3 and 4, and we can choose:

β=cβdHlogdHNδ\displaystyle\beta=c_{\beta}^{\prime\prime}dH\sqrt{\log\frac{dHN}{\delta}}

Now, we are ready to compute NN. When ξ=ε2βν4min32imaxdβ\xi=\frac{\varepsilon}{2\beta}\leq\frac{\nu^{4}_{\min}}{32i_{\max}d{\beta^{\prime}}}, we have:

N=O(H2dξ2ν2min)=O(H4d3ε2ν2min)logdHεδνmin\displaystyle N=O(\frac{H^{2}d}{\xi^{2}\nu^{2}_{\min}})=O(\frac{H^{4}d^{3}}{\varepsilon^{2}\nu^{2}_{\min}})\log\frac{dH}{\varepsilon\delta\nu_{\min}}

and otherwise, we have:

N=O(H2dξ2ν2min)=O(H2d7ν14min)logdHεδνmin\displaystyle N=O(\frac{H^{2}d}{\xi^{2}\nu^{2}_{\min}})=O(\frac{H^{2}d^{7}}{\nu^{14}_{\min}})\log\frac{dH}{\varepsilon\delta\nu_{\min}}

Combining the additional constraint to control the concentration error in Eq.(33), the total number of complexity would at the level:

Nc(H4d3ε2ν2min+H2d7ν14min)log2dHεδνmin\displaystyle N\geq c\Big{(}\frac{H^{4}d^{3}}{\varepsilon^{2}\nu^{2}_{\min}}+\frac{H^{2}d^{7}}{\nu^{14}_{\min}}\Big{)}\log^{2}\frac{dH}{\varepsilon\delta\nu_{\min}}

and therefore,

β=cβdHlogdHδενmin,β=cβdlogdHδενmin\displaystyle\beta=c_{\beta}dH\sqrt{\log\frac{dH}{\delta\varepsilon\nu_{\min}}},\quad{\beta^{\prime}}=c_{\beta^{\prime}}d\sqrt{\log\frac{dH}{\delta\varepsilon\nu_{\min}}}

for some cβc_{\beta} and cβc_{\beta^{\prime}}.

Near-Optimal Guarantee

Under the events in Theorem E.4, considering the failure rate of concentration inequality in Step 2, we can conclude that the induction condition holds for h[H]h\in[H] with probability 1δ1-\delta. Combining our discussion about choice of ξ\xi above, the probability that Alg 2 will return us an ε\varepsilon-optimal policy would be 1δ1-\delta.

The additional guarantee in Theorem E.9 can be directly obtained by considering the induction condition at layer h[H]h\in[H]. ∎

E.7 Technical Lemma

Lemma E.10 (Matrix Bernstein Theorem (Theorem 6.1.1 in (Tropp, 2015))).

Consider a finite sequence {𝐒k}\{\mathbf{S}_{k}\} of independent, random matrices with common dimension d1×d2d_{1}\times d_{2}. Assume that

𝔼𝐒k=0and𝐒kLforeachindexk\displaystyle\mathbb{E}\mathbf{S}_{k}=0~{}~{}and~{}~{}\|\mathbf{S}_{k}\|\leq L~{}~{}for~{}each~{}index~{}k

Introduce the random matrix

𝐙=k𝐒k\mathbf{Z}=\sum_{k}\mathbf{S}_{k}

Let v(𝐙)v(\mathbf{Z}) be the matrix variance statistic of the sum:

v(𝐙)=\displaystyle v(\mathbf{Z})= max{𝔼(𝐙𝐙),𝔼(𝐙𝐙)}\displaystyle\max\{\|\mathbb{E}(\mathbf{Z}\mathbf{Z}^{*})\|,\|\mathbb{E}(\mathbf{Z}^{*}\mathbf{Z})\|\}
=\displaystyle= max{k𝔼(𝐒k𝐒k),k𝔼(𝐒k𝐒k)}\displaystyle\max\{\|\sum_{k}\mathbb{E}(\mathbf{S}_{k}\mathbf{S}_{k}^{*})\|,\|\sum_{k}\mathbb{E}(\mathbf{S}_{k}^{*}\mathbf{S}_{k})\|\}

Then,

𝔼𝐙2v(𝐙)log(d1+d2)+13Llog(d1+d2)\displaystyle\mathbb{E}\|\mathbf{Z}\|\leq\sqrt{2v(\mathbf{Z})\log(d_{1}+d_{2})}+\frac{1}{3}L\log(d_{1}+d_{2})

Furthermore, for all t0t\geq 0

{𝐙t}(d1+d2)exp(t2/2v(𝐙)+Lt/3)\displaystyle\mathbb{P}\{\|\mathbf{Z}\|\geq t\}\leq(d_{1}+d_{2})\exp\Big{(}\frac{-t^{2}/2}{v(\mathbf{Z})+Lt/3}\Big{)}
Lemma E.11 (Matrix Perturbation).

Given a positive definite matrix AIA\succ I and Δ\Delta satisfying |Δij|ε<1/d|\Delta_{ij}|\leq\varepsilon<1/d, define matrix A+=A+ΔA_{+}=A+\Delta, then for arbitrary ϕd\phi\in\mathbb{R}^{d} with ϕ1\|\phi\|\leq 1, we have:

A+0,|ϕ(A+A)ϕ|dε,|ϕ(A+1A1)ϕ|dε1dε\displaystyle A_{+}\succ 0,\quad|\phi^{\top}(A_{+}-A)\phi|\leq d\varepsilon,\quad|\phi^{\top}(A_{+}^{-1}-A^{-1})\phi|\leq\frac{d\varepsilon}{1-d\varepsilon}

which implies that

A+A+ΔA+ΔFA+dε,A+1A1dε1dε\displaystyle\|A_{+}\|\leq\|A\|+\|\Delta\|\leq\|A\|+\|\Delta\|_{F}\leq\|A\|+d\varepsilon,\quad\|A_{+}^{-1}\|\geq\|A^{-1}\|-\frac{d\varepsilon}{1-d\varepsilon}

Moreover,

|ϕA1+ϕA1||ϕ2A1+ϕ2A1|dε1dε\displaystyle|\|\phi\|_{A^{-1}_{+}}-\|\phi\|_{A^{-1}}|\leq\sqrt{|\|\phi\|^{2}_{A^{-1}_{+}}-\|\phi\|^{2}_{A^{-1}}|}\leq\sqrt{\frac{d\varepsilon}{1-d\varepsilon}}
Proof.

First of all, easy to see that

σmax(Δ)ΔFdε\displaystyle\sigma_{\max}(\Delta)\leq\|\Delta\|_{F}\leq d\varepsilon

and therefore we have

σmin(A+)=minx:x=1xAx+xΔx>1dε>0.\sigma_{\min}(A_{+})=\min_{x:\|x\|=1}x^{\top}Ax+x^{\top}\Delta x>1-d\varepsilon>0.

where we use σmin\sigma_{\min} and σmax\sigma_{\max} to denote the smallest and the largest sigular value, respectively, and use F\|\cdot\|_{F} to refer to the Frobenius norm. Therefore,

|ϕ(A+A)ϕ|=|ϕΔϕ|dε\displaystyle|\phi^{\top}(A_{+}-A)\phi|=|\phi^{\top}\Delta\phi|\leq d\varepsilon

and

|ϕ(A+1A1)ϕ|=|ϕA+1(A+A)A1ϕ|σmax(A+1)σmax(A+A)σmax(A1)dε1dε\displaystyle|\phi^{\top}(A_{+}^{-1}-A^{-1})\phi|=|\phi^{\top}A_{+}^{-1}(A_{+}-A)A^{-1}\phi|\leq\sigma_{\max}(A_{+}^{-1})\sigma_{\max}(A_{+}-A)\sigma_{\max}(A^{-1})\leq\frac{d\varepsilon}{1-d\varepsilon}

Moreover,

|ϕA1+ϕA1||ϕA1+ϕA1||ϕA1++ϕA1|=|ϕ(A+1A1)ϕ|dε1dε\displaystyle|\|\phi\|_{A^{-1}_{+}}-\|\phi\|_{A^{-1}}|\leq\sqrt{|\|\phi\|_{A^{-1}_{+}}-\|\phi\|_{A^{-1}}|\cdot|\|\phi\|_{A^{-1}_{+}}+\|\phi\|_{A^{-1}}|}=\sqrt{|\phi^{\top}(A_{+}^{-1}-A^{-1})\phi|}\leq\sqrt{\frac{d\varepsilon}{1-d\varepsilon}}

Next, we will try to prove that, with a proper choice of NN, Algorithm 2 will explore layer hh to satisfy the recursive induction condition.

Lemma E.12 (Random Matrix Estimation Error).

Denote Λπh=𝔼π[ϕϕ]\Lambda^{\pi}_{h}=\mathbb{E}_{\pi}[\phi\phi^{\top}]. Based on the same induction condition 2, we have:

|ϕ(Λπ)1ϕ(Λ^π)1|dε1dε,ϕ1\displaystyle|\|\phi\|_{(\Lambda^{\pi})^{-1}}-\|\phi\|_{(\widehat{\Lambda}^{\pi})^{-1}}|\leq\sqrt{\frac{d\varepsilon}{1-d\varepsilon}},\quad\forall\|\phi\|\leq 1
Proof.

Based on Lemma E.8, we have:

|ΛπijΛ^πij|h1Hεε\displaystyle|\Lambda^{\pi}_{ij}-\widehat{\Lambda}^{\pi}_{ij}|\leq\frac{h-1}{H}\varepsilon\leq\varepsilon

and as a result of Lemma E.11, we finish the proof. ∎

Lemma E.13.

Given a matrix AλIA\succeq\lambda I with λ>0\lambda>0, and ϕ\phi satisfies ϕ1\|\phi\|\leq 1, then we have:

ϕ(cI+nA)1ϕλ+cλn+cϕ(cI+A)1ϕ,n>1,c>0\displaystyle\phi^{\top}(cI+nA)^{-1}\phi\leq\frac{\lambda+c}{\lambda n+c}\phi^{\top}(cI+A)^{-1}\phi,\quad\forall n>1,c>0
Proof.

Because AλIA\succeq\lambda I, we have

cI+nA=\displaystyle cI+nA= cI+c(n1)λ+cA+(nc(n1)λ+c)A(1+λ(n1)λ+c)cI+(nc(n1)λ+c)A\displaystyle cI+\frac{c(n-1)}{\lambda+c}A+(n-\frac{c(n-1)}{\lambda+c})A\succeq\Big{(}1+\frac{\lambda(n-1)}{\lambda+c}\Big{)}cI+(n-\frac{c(n-1)}{\lambda+c})A
=\displaystyle= λn+cλ+c(cI+A)\displaystyle\frac{\lambda n+c}{\lambda+c}\Big{(}cI+A\Big{)}

Therefore,

ϕ(cI+nA)1ϕλ+cλn+cϕ(cI+A)ϕ\displaystyle\phi^{\top}(cI+nA)^{-1}\phi\leq\frac{\lambda+c}{\lambda n+c}\phi^{\top}(cI+A)\phi

Lemma E.14.

Given a matrix A0A\succeq 0, suppose maxπ𝔼π[ϕ(cI+A)1]ε~ν2min/(2c)\max_{\pi}\mathbb{E}_{\pi}[\|\phi\|_{(cI+A)^{-1}}]\leq\widetilde{\varepsilon}\leq\nu^{2}_{\min}/(2\sqrt{c}), where c>0c>0 is a constant, where νmin\nu_{\min} is defined in Definition 4.3, we have:

maxπ𝔼π[ϕ(cI+nA)1]c+1c(c+n)ε~\displaystyle\max_{\pi}\mathbb{E}_{\pi}[\|\phi\|_{(cI+nA)^{-1}}]\leq\sqrt{\frac{c+1}{\sqrt{c}(c+n)}\widetilde{\varepsilon}}
Proof.

Because ϕ(cI+A)1ϕ(cI)11/c\|\phi\|_{(cI+A)^{-1}}\leq\|\phi\|_{(cI)^{-1}}\leq 1/\sqrt{c}, we must have:

maxπTr((cI+A)1𝔼π[ϕϕ])=maxπ𝔼π[ϕ2(cI+A)1]1cmaxπ𝔼π[ϕ(cI+A)1]ε~c\displaystyle\max_{\pi}Tr((cI+A)^{-1}\mathbb{E}_{\pi}[\phi\phi^{\top}])=\max_{\pi}\mathbb{E}_{\pi}[\|\phi\|^{2}_{(cI+A)^{-1}}]\leq\frac{1}{\sqrt{c}}\max_{\pi}\mathbb{E}_{\pi}[\|\phi\|_{(cI+A)^{-1}}]\leq\frac{\widetilde{\varepsilon}}{\sqrt{c}}

Consider the SVD of A=UΣUA=U^{\top}\Sigma U with Σ=(σii)i=1,,d\Sigma=(\sigma_{ii})_{i=1,...,d} and U=[u1,u2,ud]U=[u_{1},u_{2}...,u_{d}], then we have:

π,ε~cTr((cI+A)1𝔼π[ϕϕ])=Tr((cI+Σ)1U𝔼π[ϕϕ]U)=i=1d𝔼π[|ϕui|2]c+σii.\displaystyle\forall\pi,\quad\frac{\widetilde{\varepsilon}}{\sqrt{c}}\geq Tr((cI+A)^{-1}\mathbb{E}_{\pi}[\phi\phi^{\top}])=Tr((cI+\Sigma)^{-1}U^{\top}\mathbb{E}_{\pi}[\phi\phi^{\top}]U)=\sum_{i=1}^{d}\frac{\mathbb{E}_{\pi}[|\phi^{\top}u_{i}|^{2}]}{c+\sigma_{ii}}.

According to the Definition 4.3, we have:

ε~cmaxπ𝔼π[|ϕui|2]c+σiiν2minc+σii,i[d]\displaystyle\frac{\widetilde{\varepsilon}}{\sqrt{c}}\geq\frac{\max_{\pi}\mathbb{E}_{\pi}[|\phi^{\top}u_{i}|^{2}]}{c+\sigma_{ii}}\geq\frac{\nu^{2}_{\min}}{c+\sigma_{ii}},\quad\forall i\in[d]

which implies that

σiicν2minε~cc,i[d]\displaystyle\sigma_{ii}\geq\frac{\sqrt{c}\nu^{2}_{\min}}{\widetilde{\varepsilon}}-c\geq c,\quad\forall i\in[d]

By applying Lemma E.13 and assign λ=c\lambda=c, we have:

maxπ𝔼π[ϕ(cI+nA)1]maxπ𝔼π[ϕ2(cI+nA)1]maxπ2cc+cn𝔼π[ϕ2(cI+A)1]2c(1+n)ε~\displaystyle\max_{\pi}\mathbb{E}_{\pi}[\|\phi\|_{(cI+nA)^{-1}}]\leq\max_{\pi}\sqrt{\mathbb{E}_{\pi}[\|\phi\|^{2}_{(cI+nA)^{-1}}]}\leq\max_{\pi}\sqrt{\frac{2c}{c+cn}\mathbb{E}_{\pi}[\|\phi\|^{2}_{(cI+A)^{-1}}]}\leq\sqrt{\frac{2}{\sqrt{c}(1+n)}\widetilde{\varepsilon}}

E.8 More about our reachability coefficient

Recall the definition of reachability coefficient in (Zanette et al., 2020) is:

minh[H]minθ=1maxπ|𝔼π[ϕh]θ|\min_{h\in[H]}\min_{\|\theta\|=1}\max_{\pi}|\mathbb{E}_{\pi}[\phi_{h}]^{\top}\theta|

Easy to see that, for arbitrary θ\theta with θ=1\|\theta\|=1, we have

maxπ𝔼π[(ϕhθ)2]maxπ𝔼π[|ϕhθ|]maxπ|𝔼π[ϕhθ]|\displaystyle\max_{\pi}\sqrt{\mathbb{E}_{\pi}[(\phi^{\top}_{h}\theta)^{2}]}\geq\max_{\pi}\mathbb{E}_{\pi}[|\phi^{\top}_{h}\theta|]\geq\max_{\pi}|\mathbb{E}_{\pi}[\phi^{\top}_{h}\theta]|

Therefore,

νmin=minh[H]νh=minh[H]minθ=1maxπ𝔼π[(ϕhθ)2]minh[H]minθ=1maxπ|𝔼π[ϕhθ]|\displaystyle\nu_{\min}=\min_{h\in[H]}\nu_{h}=\min_{h\in[H]}\min_{\|\theta\|=1}\max_{\pi}\sqrt{\mathbb{E}_{\pi}[(\phi^{\top}_{h}\theta)^{2}]}\geq\min_{h\in[H]}\min_{\|\theta\|=1}\max_{\pi}|\mathbb{E}_{\pi}[\phi^{\top}_{h}\theta]|

Besides, according to the min-max theorem, νh\nu_{h} is also lower bounded by maxπσmin(𝔼π[ϕhϕh])\sqrt{\max_{\pi}\sigma_{\min}(\mathbb{E}_{\pi}[\phi_{h}\phi_{h}^{\top}])}, to see this,

maxπσmin(𝔼π[ϕhϕh])=maxπminθ=1θ𝔼π[ϕhϕh]θ=maxπminθ=1𝔼π[(ϕθ)2]minθ=1maxπ𝔼π[(ϕθ)2]\displaystyle\max_{\pi}\sigma_{\min}(\mathbb{E}_{\pi}[\phi_{h}\phi_{h}^{\top}])=\max_{\pi}\min_{\|\theta\|=1}\theta^{\top}\mathbb{E}_{\pi}[\phi_{h}\phi_{h}^{\top}]\theta=\max_{\pi}\min_{\|\theta\|=1}\mathbb{E}_{\pi}[(\phi^{\top}\theta)^{2}]\leq\min_{\|\theta\|=1}\max_{\pi}\mathbb{E}_{\pi}[(\phi^{\top}\theta)^{2}]

In fact, the value of maxπσmin(𝔼π[ϕhϕh])\max_{\pi}\sigma_{\min}(\mathbb{E}_{\pi}[\phi_{h}\phi_{h}^{\top}]) is also related to the ”Well-Explored Dataset” assumption in many previous literature in offline setting (Jin et al., 2021b), where it is assumed that there exists a behavior policy such that the minimum singular value of the covariance matrix is lower bounded. Therefore, we can conclude that our reachability coefficient νmin\nu_{\min} is also lower bounded by, e.g. c¯\underline{c} in Corollary 4.6 in (Jin et al., 2021b).

Appendix F Extended Deployment-Efficient RL Setting

F.1 Sample-Efficient DE-RL

In applications such as recommendation systems, the value of NN cannot exceed the number of users our system serves during a period of time. Therefore, as an interesting extension to our framework, we can revise the constraint (b) in Definition 2.1 and explicitly assign an upper bound for NN. Concretely, we may consider the following alternatives: (b’) The sample size Ndc1Hc2εc3logc4dHεδN\leq d^{c_{1}}H^{c_{2}}\varepsilon^{-c_{3}}\log^{c_{4}}\frac{dH}{\varepsilon\delta}, where c1,c2,c3,c4>0c_{1},c_{2},c_{3},c_{4}>0 are some constant fixed according to the real situation. Under these revised constraints, the lower bound for KK may be different.

In fact, given constraints in the form of NN0N\leq N_{0}, our results in Section 4 already implies an upper bound for KK, since we can emulate 1 deployment of our algorithm that uses a large N>N0N>N_{0} by deploying the same policy for N/N0\lceil N/N_{0}\rceil times. However, this may result in sub-optimal deployment complexity since we are not adaptively updating our policy within those N/N0N/N_{0} deployments. It would be an interesting open problem to identify the fine-grained trade-off between KK and N0N_{0} in such a setting.

F.2 Safe DE-RL

Monotonic Policy Improvement Constraint

In many applications, improvement of service quality after policy update is highly desired, and can be incorporated in our formulation by adding an additional constraint into Def 2.1:

(c)J(πi+1)J(πi)ε.\displaystyle(c)\quad J(\pi_{i+1})\geq J(\pi_{i})-\varepsilon. (35)

Because we require the deployed policy has substantial probability to visit unfamiliar states so that the agent can identify the near-optimal policy as shown in Def 2.1-(a), we relax the strict policy improvement with an small budget term ε>0\varepsilon>0 for exploration.

Trade-off between Pessimism and Optimism

1 for k=1,2,…,K do
2       D={D1,D2,,Dk}D=\{D_{1},D_{2},...,D_{k}\}
3       πk,pessimPessimismBased_OfflineAlgorithm(D)\pi_{k,\mathrm{pessim}}\leftarrow{\rm PessimismBased\_OfflineAlgorithm}(D)
4       πk,optimOptimismBased_BatchExplorationStrategy(D)\pi_{k,\mathrm{optim}}\leftarrow{\rm OptimismBased\_BatchExplorationStrategy}(D)
5       // Mix policy in trajectory level, i.e. w.p. 1α1-\alpha, τπpessim\tau\sim\pi_{\mathrm{pessim}}; w.p. α\alpha, τπoptim\tau\sim\pi_{\mathrm{optim}}
6       πk(1α)πk,pessim+απk,optim\pi_{k}\leftarrow(1-\alpha)\pi_{k,\mathrm{pessim}}+\alpha\pi_{k,\mathrm{optim}}
7       Dk={τnπk,n[N]}D_{k}=\{\tau_{n}\sim\pi_{k},\forall n\in[N]\}
8      
9 end for
Algorithm 7 Mixture Policy Strategy

The balance between satisfying two contradictory constraints: (a) and (c), implies that a proper algorithm should leverage both pessimism and optimism in face of uncertainty. In Algorithm 7, we propose a simple but effective mixture policy srategy, where we treat pessimistic and optimistic algorithm as black boxes and mix the learned policies with a coefficient α\alpha. One key property of the mixed policy is that:

Property F.1 (Policy improvement for mixture policy).
J(πk)J(πk1)J(πk,pessim)J(πk1)O(α)\displaystyle J(\pi_{k})-J(\pi_{k-1})\geq J(\pi_{k,\mathrm{pessim}})-J(\pi_{k-1})-O(\alpha) (36)

As a result, as long as the offline algorithm (which we treat as a black box here) has some policy improvement guarantee, such as (Kumar et al., 2020; Liu et al., 2020; Laroche et al., 2019), then Eq.(36) implies a substantial policy improvement if α\alpha is chosen appropraitely. Besides, if we use Algorithms in Section 4.1 or 4.2, and collecting N~=Θ(N/α)\widetilde{N}=\Theta(N/\alpha) samples, the guarantees in Theorem 4.1 and Theorem 4.4 can be extended correspondingly. Therefore, Alg 7 will return us a near-optimal policy after KK deployments while satisfying the safety constraint (c) in (35).