This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On Well-posedness and Minimax Optimal Rates of Nonparametric QQ-function Estimation in Off-policy Evaluation00footnotetext: Author names are sorted alphabetically

Xiaohong Chen   Zhengling Qi Cowles Foundation for Research in Economics, Yale University. [email protected]Department of Decision Sciences, George Washington University. [email protected]
Abstract

We study the off-policy evaluation (OPE) problem in an infinite-horizon Markov decision process with continuous states and actions. We recast the QQ-function estimation into a special form of the nonparametric instrumental variables (NPIV) estimation problem. We first show that under one mild condition the NPIV formulation of QQ-function estimation is well-posed in the sense of L2L^{2}-measure of ill-posedness with respect to the data generating distribution, bypassing a strong assumption on the discount factor γ\gamma imposed in the recent literature for obtaining the L2L^{2} convergence rates of various QQ-function estimators. Thanks to this new well-posed property, we derive the first minimax lower bounds for the convergence rates of nonparametric estimation of QQ-function and its derivatives in both sup-norm and L2L^{2}-norm, which are shown to be the same as those for the classical nonparametric regression (Stone, 1982). We then propose a sieve two-stage least squares estimator and establish its rate-optimality in both norms under some mild conditions. Our general results on the well-posedness and the minimax lower bounds are of independent interest to study not only other nonparametric estimators for QQ-function but also efficient estimation on the value of any target policy in off-policy settings.

1 Introduction

In recent years, there is a surging interest in studying batch reinforcement learning (RL), which utilizes previously collected data to perform sequential decision making (Sutton & Barto, 2018) and does not require interacting with task environment or accessing a simulator. The batch RL techniques are especially attractive in many high-stake real-world application domains where it is too costly or infeasible to access a simulator, such as mobile health (Liao et al., 2018), robotics (Pinto & Gupta, 2016), digital marketing (Thomas et al., 2017) and precision medicine (Kosorok & Laber, 2019), and others. Nevertheless, the batch setting still posits several theoretical challenges that tamper the generalizability of many RL algorithms in practice. Among them, one central challenge is the distributional mismatch between the data collecting process and the target distribution for evaluation (Levine et al., 2020).

Motivated by these, we study the off-policy evaluation (OPE) problem, which is considered one of fundamental problems in batch RL. The goal of OPE is to leverage pre-collected data generated by a so-called behavior policy to evaluate the performance (e.g., value) of a new/target policy. In particular, we investigate theoretical property of nonparametric estimation of QQ-function in the setting of infinite-horizon Markov decision processes (MDPs) (with discounted rewards, continuous states and actions).

We make several important contributions to the existing literature. Motivated by Bellman equation, we formulate QQ-function estimation under the framework of a nonparametric instrumental variable (NPIV) model. We first show that, under mild regularity conditions, the NPIV formulation of QQ-function estimation is well-posed in the sense of L2L^{2}-measure of ill-posedness with respect to the data generating distribution. This essentially justifies the valid use of the L2L^{2}-norm of Bellman error/residual to measure the accuracy of QQ-function estimation in the batch setting. Next, we derive the minimax lower bounds for the convergence rates in sup-norm and in L2L^{2}-norm for the estimation of QQ-function and its derivatives. Thanks to the general well-posedness result, the lower bounds are shown to be the same as those for the nonparametric regression estimation in the i.i.d. setting (Stone, 1982; Tsybakov, 2009). Thus the nonparametric QQ-function estimation could be as easy as the nonparametric regression in terms of the worst case rate. Using the NPIV formulation, we also propose sieve 2SLS estimators to estimate the QQ-function (and its derivatives) and establish their convergence rates in both sup-norm and L2L^{2}-norm. In particular, B-spline and wavelet 2SLS estimators are shown to achieve the sup-norm lower bound for Hölder class of QQ-functions (and the derivatives), and many more linear sieve (such as polynomials, cosines, splines, wavelets) 2SLS estimators are shown to achieve the L2L^{2}-norm lower bound for Sobolev class of QQ-functions (and the derivatives). Our results on L2L^{2}-norm convergence rates under mild conditions are particularly useful for obtaining efficient estimation and optimal inference on the value (i.e., the expectation of the QQ function) of a target policy. To the best of our knowledge, ours are the first minimax results for non-parametrically estimating QQ-function of continuous states and actions in the off-policy setting. The general results on the well-posedness and the minimax lower bounds (in sup-norm and in L2L^{2}-norm) are of independent interest to study properties of other nonparametric estimators for QQ-function and the related estimators of the marginal importance weight (see, e.g., Liu et al. (2018)) in the off-policy setting.

1.1 Closely Related Work

Estimation of QQ-function for a fixed policy is a key building block for many RL algorithms. There is a growing literature on nonparametric estimation of QQ-function in the infinite-horizon and off-policy setting. See some recent theoretical development in Farahmand et al. (2016); Shi et al. (2020); Uehara et al. (2021) among many others. Specifically, Farahmand et al. (2016) established L2L^{2} error bound for Bellman error of their QQ-function estimator. Shi et al. (2020); Uehara et al. (2021) derived that L2L^{2}-norm convergence rates and error bounds for their respective nonparametric QQ-function estimators under a strong assumption that is essentially equivalent to restricting the discount factor γ\gamma to be close to zero. Our well-posedness result implies that their L2L^{2}-norm convergence rates of their respective estimators for QQ-function remain valid without their strong assumption on the discount rate γ\gamma. See Section 3 and Remark 1 for more detailed discussions.

The connection of estimating QQ-function in Bellman equation to instrumental variables estimation, to the best of our knowledge, has been first pointed out by Bradtke & Barto (1996) for their celebrated least-squares temporal difference (LSTD) method for parametric models. Recently, the relation between nonparametric QQ-function estimation and nonparametric instrumental variables (NPIV) estimation has also been observed by some applied work (such as Chen et al. (2021)) and theoretical work (such as Duan et al. (2021) that focuses on the on-policy setting). The NPIV model has been extensively investigated in econometric literature; see, e.g., Newey & Powell (2003); Ai & Chen (2003); Hall & Horowitz (2005); Blundell et al. (2007); Darolles et al. (2011); Chen & Reiss (2011); Chen & Christensen (2018) for earlier reference. However, there is some subtle difference between the nonparametric QQ-function estimation and the NPIV one. It is known that a generic NPIV model with continuous endogenous variables is a difficult ill-posed inverse problem in econometrics, but we show that estimation of a nonparametric QQ-function of continuous states and actions can be well-posed under mild regularity conditions that are typically assumed in batch RL literature. Our well-posedness result implies that nonparametric estimation and inference on OPE and related batch RL problems could be much simpler than the difficult ill-posed NPIV problems studied in the existing econometric literature.

The rest of the paper is organized as follows. Section 2 presents the framework of infinite-horizon MDPs and some necessary notations. In Section 3, we show that the nonparametric QQ-function estimation in sup-norm and in L2L^{2}-norm are both well-posed. Section B establishes the minimax lower bounds for the rates of convergence for nonparametric estimation of QQ-function in sup-norm and in L2L^{2}-norm respectively. In Section 5, we propose sieve 2SLS estimation of the QQ-function and its derivatives. Under some mild condition, we establish their rates of convergence in both sup-norm and L2L^{2}-norm, which coincide with the lower bounds. Section 6 briefly concludes. Most proofs are given in the appendix.

2 Preliminaries and Notation

Consider a single trajectory {(St,At,Rt)}t0\{(S_{t},A_{t},R_{t})\}_{t\geq 0} where (St,At,Rt)(S_{t},A_{t},R_{t}) denotes the state-action-reward triplet collected at time tt. Let 𝒮{\cal S} and 𝒜{\cal A} be the state and action spaces, respectively. We assume both state and action are continuous (as the discrete and finite spaces are easier). A policy associated with this trajectory defines an agent’s way of choosing the action at each decision time tt. In this paper, we focus on using the batch data to evaluate the performance of a stationary policy denoted by π\pi, which is a function mapping from the state space 𝒮{\cal S} to a probability distribution over 𝒜{\cal A}. In particular, π(a|s)\pi(a\,|\,s) refers to the probability density function of choosing action a𝒜a\in{\cal A} given the state value s𝒮s\in{\cal S}. In addition, let 𝒮×𝒜d{\cal S}\times{\cal A}\subseteq\mathbb{R}^{d} for some d2d\geq 2, and (𝒮){\cal B}({\cal S}) be the family of Borel subsets of 𝒮{\cal S}.

The main goal of this paper is to estimate the so-called QQ-function of a target policy π\pi using the batch data. Specifically, given a stationary policy π\pi and any state-action pair (s,a)𝒮×𝒜(s,a)\in{\cal S}\times{\cal A}, we define QQ-function as

Qπ(s,a)=t=0+γt𝔼π(Rt|S0=s,A0=a),\displaystyle Q^{\pi}(s,a)=\sum_{t=0}^{+\infty}\gamma^{t}\mathbb{E}^{\pi}(R_{t}|S_{0}=s,A_{0}=a),

where 𝔼π\mathbb{E}^{\pi} denotes the expectation assuming the actions are selected according to π\pi, and 0γ<10\leq\gamma<1 denotes some discounted factor that balances the trade-off between immediate and future rewards. We consider the framework of a time-homogeneous MDP and hence make the following two assumptions, which are foundation of many QQ-function estimations.

Assumption 1

There exists a transition kernel PP such that for every t1t\geq 1, s𝒮s\in{\cal S}, a𝒜a\in{\cal A} and any set B(𝒮)B\in{\cal B}({\cal S}),

Pr(St+1B|St=s,At=a,{Sj,Aj,Rj}0j<t)=P(St+1B|St=s,At=a),\Pr(S_{t+1}\in B\,|\,S_{t}=s,A_{t}=a,\left\{S_{j},A_{j},R_{j}\right\}_{0\leq j<t})=P(S_{t+1}\in B\,|\,S_{t}=s,A_{t}=a),

In addition, there exists a probability density function qq for the transition kernel PP.

Assumption 2

For every t0t\geq 0, Rt=(St,At,St+1)R_{t}={\cal R}(S_{t},A_{t},S_{t+1}), i.e., a measurable function of (St,At,St+1)(S_{t},A_{t},S_{t+1}). In addition, there exists a finite constant RmaxR_{\max} such that |Rt|Rmax|R_{t}|\leq R_{\max} for all t0t\geq 0.

Let r(s,a)=𝔼[Rt|St=s,At=a]r(s,a)=\mathbb{E}\left[R_{t}\,|\,S_{t}=s,A_{t}=a\right] for every t0t\geq 0, s𝒮s\in{\cal S} and a𝒜a\in{\cal A}. Assumption 2 implies that |r(St,At)|Rmax|r(S_{t},A_{t})|\leq R_{\max} for all t0t\geq 0. We note that the uniformly bounded reward assumption is imposed for simplicity only, and can be replaced by assuming existence of higher order conditional moments of RtR_{t} given (St,At)(S_{t},A_{t}); see, e.g., Chen & Christensen (2015, 2018).

To estimate QπQ^{\pi}, by Assumptions 1 and 2, one approach is to solve the following Bellman equation, i.e.,

Qπ(s,a)=𝔼[Rt+γa𝒜π(a|St+1)Qπ(St+1,a)da|St=s,At=a],\displaystyle Q^{\pi}(s,a)=\mathbb{E}\left[R_{t}+\gamma\int_{a^{\prime}\in{\cal A}}\pi(a^{\prime}\,|\,S_{t+1})Q^{\pi}(S_{t+1},a^{\prime})\text{d}a^{\prime}\,|\,S_{t}=s,A_{t}=a\right], (1)

for any t0t\geq 0, s𝒮s\in{\cal S} and a𝒜a\in{\cal A}. Throughout this paper, we assume the integration with respect to π\pi in (1) can be exactly evaluated as long as the integrand is known. In practice, one can use Monte Carlo method to approximate this integration since the target policy π\pi is known.

Now suppose the given batch data consist of NN trajectories, which correspond to NN independent and identically distributed copies of {(St,At,Rt)}t0\{(S_{t},A_{t},R_{t})\}_{t\geq 0}. For 1iN1\leq i\leq N, data collected from the iith trajectory are represented by {(Si,t,Ai,t,Ri,t,Si,t+1)}0t<T\{(S_{i,t},A_{i,t},R_{i,t},S_{i,t+1})\}_{0\leq t<T}. We then aim to leverage this batch data to estimate QQ-function of a target policy π\pi. Before presenting our theoretical results and methods, we make one additional assumption on the data generating process. Let πb\pi^{b} be a stationary policy and πb(a|s)\pi^{b}(a\,|\,s) refers to the conditional probability density of choosing the action aa given the state value ss.

Assumption 3

The batch data 𝒟N={(Si,t,Ai,t,Ri,t,Si,t+1)}0t<T,1iN{\cal D}_{N}=\{(S_{i,t},A_{i,t},R_{i,t},S_{i,t+1})\}_{0\leq t<T,1\leq i\leq N} are generated by the policy πb\pi^{b}.

Assumptions 1-3 are standard in the literature of batch RL. Note that in the literature the policy πb\pi^{b} is often called the behavior policy and mostly different from the target one π\pi. Next, we introduce the average visitation probability measure. Let qtπb(s,a)q^{\pi^{b}}_{t}(s,a) be the marginal probability density of a state-action pair (s,a)(s,a) at the decision point tt induced by the behavior policy πb\pi^{b}. Then the average visitation probability density across TT decision points is defined as

d¯Tπb(s,a)=1Tt=0T1qtπb(s,a).\bar{d}^{\pi^{b}}_{T}(s,a)=\frac{1}{T}\sum_{t=0}^{T-1}q^{\pi^{b}}_{t}(s,a).

The corresponding expectation with respect to d¯Tπb\bar{d}^{\pi^{b}}_{T} is denoted by 𝔼¯\overline{\mathbb{E}}. We further let qtπ(s,a|s,a)q^{\pi}_{t}(s^{\prime},a^{\prime}\,|\,s,a) be the tt-step visitation probability density function induced by a policy π\pi at (s,a)(s^{\prime},a^{\prime}) given an initial state-action pair (s,a)𝒮×𝒜(s,a)\in{\cal S}\times{\cal A}.

Notation: For generic sequences {ϖ(N)}N1\{\varpi(N)\}_{N\geq 1} and {θ(N)}N1\{\theta(N)\}_{N\geq 1}, the notation ϖ(N)θ(N)\varpi(N)\gtrsim\theta(N) (resp. ϖ(N)θ(N)\varpi(N)\lesssim\theta(N)) means that there exists a sufficiently large constant (resp. small) constant c1>0c_{1}>0 (resp. c2>0c_{2}>0) such that ϖ(N)c1θ(N)\varpi(N)\geq c_{1}\theta(N) (resp. ϖ(N)c2θ(N)\varpi(N)\leq c_{2}\theta(N)). We use ϖ(N)θ(N)\varpi(N)\asymp\theta(N) when ϖ(N)θ(N)\varpi(N)\gtrsim\theta(N) and ϖ(N)θ(N)\varpi(N)\lesssim\theta(N). For matrix and vector norms, we use q\|\bullet\|_{\ell_{q}} to denote either the vector q\ell_{q}-norm or operator norm induced by the vector q\ell_{q}-norm, for 1q<1\leq q<\infty, when there is no confusion. λmin()\lambda_{\min}(\bullet) and λmax()\lambda_{\max}(\bullet) denote the minimum and maximum eigenvalues of some square matrix, respectively. For any random variable XX, we use Lq(X)L^{q}(X) to denote the class of all measurable functions with finite qq-th moments for 1q1\leq q\leq\infty. Then the LqL^{q}-norm is denoted by Lq(X)\|\bullet\|_{L^{q}(X)}. When there is no confusion in the underlying distribution, we also write it as Lq\|\bullet\|_{L^{q}} or q\|\bullet\|_{q}. In particular, \|\bullet\|_{\infty} denotes the sup-norm. In addition, we use Big OpO_{p} and small opo_{p} as the convention. We often use (S,A,R,S)(S,A,R,S^{\prime}) or (S,A,S)(S,A,S^{\prime}) to represent some generic transition tuples, where the transition probability density is qq. Lastly, we introduce the Hölder class of functions g:𝒳dg:{\cal X}\subseteq\mathbb{R}^{d}\rightarrow\mathbb{R} with smoothness p>0p>0 as

Λ(p,L)\displaystyle\Lambda_{\infty}(p,L)\triangleq {gsup0α1pαgL,supα:α1=psupx,y𝒳,xy|αg(x)αg(y)|xy2αpL},\displaystyle\left\{g\quad\mid\sup_{0\leq\|\alpha\|_{\ell_{1}}\leq\lfloor p\rfloor}\|\partial^{\alpha}g\|_{\infty}\leq L,\quad\sup_{\alpha:\|\alpha\|_{\ell_{1}}=\lfloor p\rfloor}\sup_{x,y\in{\cal X},x\neq y}\frac{\left|\partial^{\alpha}g(x)-\partial^{\alpha}g(y)\right|}{\|x-y\|_{\ell_{2}}^{\alpha-\lfloor p\rfloor}}\leq L\right\},

where 𝒳=𝒮×𝒜d{\cal X}={\cal S}\times{\cal A}\subset\mathbb{R}^{d} is a compact rectangular support with nonempty interior, p\lfloor p\rfloor denotes the integer no larger than pp for any p>0p>0, a non-negative vector α=(α1,α2,,αd)\alpha=(\alpha_{1},\alpha_{2},\cdots,\alpha_{d}) and

αg(x)=αg(x)x1α1x2α2xdαd.\partial^{\alpha}g(x)=\frac{\partial^{\alpha}g(x)}{\partial x_{1}^{\alpha_{1}}\partial x_{2}^{\alpha_{2}}\cdots\partial x_{d}^{\alpha_{d}}}.

We let Λ2(p,L)\Lambda_{2}(p,L) be the Sobolev space of smoothness pp with radius LL and support 𝒳{\cal X}, where the underlying measure is Lebesque measure.

3 A Special Form of NPIV Models: Well-posedness

In this section, we formulate QQ-function estimation under the framework of a nonparametric instrumental variables (NPIV) model, which has been extensively studied in econometrics (e.g., Ai & Chen (2003); Newey & Powell (2003); Blundell et al. (2007)). A generic NPIV model takes the expression as

Y=h0(X)+U,with𝔼[U|W]=0,Y=h_{0}(X)+U,\quad\text{with}\quad\mathbb{E}[U|W]=0, (2)

where h0h_{0} is an unknown function to estimate, XX is called endogenous variables, WW is called instrumental variables, and UU represents some random error. Motivated by Equation (1), we consider the following special form of a NPIV model with Assumptions 1-3 for QQ-function estimation:

Rt=hπ(St,At,St+1;Qπ)+Ut,with𝔼[Ut|St,At]=0\displaystyle R_{t}=h^{\pi}(S_{t},A_{t},S_{t+1};Q^{\pi})+U_{t},\quad\text{with}\quad\mathbb{E}[U_{t}|S_{t},A_{t}]=0 (3)

for 0tT10\leq t\leq T-1, where

hπ(s,a,s;Q)=Q(s,a)γa𝒜π(a|s)Q(s,a)da.h^{\pi}(s,a,s^{\prime};Q)=Q(s,a)-\gamma\int_{a^{\prime}\in{\cal A}}\pi(a^{\prime}|s^{\prime})Q(s^{\prime},a^{\prime})\text{d}a^{\prime}.

We also write hπ(s,a,s;Q)h^{\pi}(s,a,s^{\prime};Q) as hπ(Q)(s,a,s)h^{\pi}(Q)(s,a,s^{\prime}) and h0π=hπ(Qπ)h_{0}^{\pi}=h^{\pi}(Q^{\pi}) when there is no confusion. By requiring 𝔼[Ut|St,At]=0\mathbb{E}[U_{t}|S_{t},A_{t}]=0 for 0tT10\leq t\leq T-1, we recover the Bellman Equation (1). Therefore Model (3) can be used to estimate QπQ^{\pi} nonparametrically, where St+1S_{t+1} can be understood as endogenous variables and (St,At)(S_{t},A_{t}) as instrumental variables under the framework of the NPIV model. Let L2(S,A)L^{2}(S,A) be the space of square integrable functions against the probability measure with density d¯Tπb\bar{d}^{\pi^{b}}_{T} and L2(S,A,S)L^{2}(S,A,S^{\prime}) against the probability measure with density d¯Tπb×q\bar{d}^{\pi^{b}}_{T}\times q. Denote the conditional expectation operator by 𝒯:L2(S,A,S)L2(S,A){\cal T}:L^{2}(S,A,S^{\prime})\to L^{2}(S,A), i.e., for every (s,a)𝒮×𝒜(s,a)\in{\cal S}\times{\cal A},

𝒯f(s,a)=𝔼[f(S,A,S)|S=s,A=a]{\cal T}f(s,a)=\mathbb{E}[f(S,A,S^{\prime})|S=s,A=a]\,

and in particular,

𝒯hπ(Q)(s,a)=𝔼[hπ(S,A,S;Q)|S=s,A=a].{\cal T}h^{\pi}(Q)(s,a)=\mathbb{E}[h^{\pi}(S,A,S^{\prime};Q)|S=s,A=a]\,.

3.1 Well-posedness in sup-norm

In this subsection, we show that QQ-function estimation is in general well-posed in sup-norm given by the following lemma.

Lemma 1

For any discount factor 0γ<10\leq\gamma<1 and any uniformly bounded function QQ defined over (𝒮,𝒜)({\cal S},{\cal A}), the following inequalities hold.

11+γhπ(QQπ)QQπ11γ𝒯hπ(QQπ)11γhπ(QQπ).\displaystyle\frac{1}{1+\gamma}\|h^{\pi}(Q-Q^{\pi})\|_{\infty}\leq\|Q-Q^{\pi}\|_{\infty}\leq\frac{1}{1-\gamma}\|{\cal T}h^{\pi}(Q-Q^{\pi})\|_{\infty}\leq\frac{1}{1-\gamma}\|h^{\pi}(Q-Q^{\pi})\|_{\infty}. (4)

Proof. It is sufficient to show that QQπ11γ𝒯hπ(QQπ)\|Q-Q^{\pi}\|_{\infty}\leq\frac{1}{1-\gamma}\|{\cal T}h^{\pi}(Q-Q^{\pi})\|_{\infty}, while other inequalities can be readily seen. It can be observed that

QQπ\displaystyle\|Q-Q^{\pi}\|_{\infty} 𝒯hπ(QQπ)+γ𝔼π[(QQπ)(S,A)|S=,A=]\displaystyle\leq\|{\cal T}h^{\pi}(Q-Q^{\pi})\|_{\infty}+\gamma\|\mathbb{E}^{\pi}\left[(Q-Q^{\pi})(S^{\prime},A^{\prime})\,|\,S=\bullet,A=\bullet\right]\|_{\infty} (5)
𝒯hπ(QQπ)+γQQπ,\displaystyle\leq\|{\cal T}h^{\pi}(Q-Q^{\pi})\|_{\infty}+\gamma\|Q-Q^{\pi}\|_{\infty}, (6)

where the first line follows the triangle inequality. This immediately implies

QQπ\displaystyle\|Q-Q^{\pi}\|_{\infty} 11γ𝒯hπ(QQπ).\displaystyle\leq\frac{1}{1-\gamma}\|{\cal T}h^{\pi}(Q-Q^{\pi})\|_{\infty}. (7)

 

Lemma 1 implies that to obtain the sup-norm rate for Q^π\widehat{Q}^{\pi}, it is sufficient to focus on hπ(Q^πQπ)\|h^{\pi}(\widehat{Q}^{\pi}-Q^{\pi})\|_{\infty}, which is the sup-norm of so-called temporal difference error. One key reason of having such an inequality is the fact that Bellman operator is γ\gamma-contractive with respect to the sup-norm (from (5)-(6)). However, it is hard to develop an estimator that minimizes the sup-norm of Bellman error in the batch setting so as to directly bound the sup-norm. Instead most existing methods are focused on minimizing the L2L^{2}-norm of Bellman error. This motivates us to study the well-posedness in L2L^{2}-norm below.

3.2 Well-posedness in L2L^{2}-norm

Lemma 1 in general may not hold for L2L^{2}-norm with respect to the data generating process (e.g., d¯Tπb\bar{d}^{\pi^{b}}_{T}) due to the distributional mismatch between the behavior policy and the target one, which is one fundamental barrier in analyzing OPE problem in the literature as discussed in the introduction. To characterize the difficulty of L2L^{2}-estimating QπQ^{\pi} under Model (3), we define a L2L^{2}-measure of ill-posedness as

τ¯=supQL2(S,A)hπ(Q)L2(S,A,S)𝒯hπ(Q)L2(S,A).\overline{\tau}=\sup_{Q\in L^{2}(S,A)}\frac{\|h^{\pi}(Q)\|_{L^{2}(S,A,S^{\prime})}}{\|{\cal T}h^{\pi}(Q)\|_{L^{2}(S,A)}}\,. (8)

It can be seen that τ¯1\overline{\tau}\geq 1 and could be arbitrarily large in general, which can be used to quantify the level of ill-posedness in estimating QπQ^{\pi}. We impose the following mild assumption to ensure the well-posedness in L2L^{2}-norm in the sense of τ¯1\overline{\tau}\lesssim 1.

Assumption 4

(a) There exist positive constants pminp_{\min} and p1,maxp_{1,\max} such that the average visitation probability density function d¯Tπb\bar{d}^{\pi^{b}}_{T} satisfies pmind¯Tπb(s,a)p1,maxp_{\min}\leq\bar{d}^{\pi^{b}}_{T}(s,a)\leq p_{1,\max} for every (s,a)𝒮×𝒜(s,a)\in{\cal S}\times{\cal A}. (b)  The target policy π\pi is absolutely continuous with respect to πb\pi^{b} and qπ(s,a|s,a)p2,maxq^{\pi}(s^{\prime},a^{\prime}\,|\,s,a)\leq p_{2,\max} for some positive constant p2,maxp_{2,\max}.

Let pmax=max(p1,max,p2,max)p_{\max}=\max(p_{1,\max},p_{2,\max}). In general, boundedness assumption on the data generating probability density in Assumption 4 (a) is standard in the classical non-parametric estimation such as Huang et al. (1998); Chen & Christensen (2015). In our setting, that the average visitation probability density is uniformly bounded away from 0 is also called coverage assumption frequently used in RL literature such as Precup et al. (2000); Antos et al. (2008a); Kallus & Uehara (2019) among many others. We use this standard assumption as we consider any target policy for OPE. This assumption can be relaxed to the so-called partial coverage if one is willing to impose some structure assumption on QπQ^{\pi}. See recent studies in Duan et al. (2020); Xie et al. (2021); Agarwal et al. (2021); Uehara & Sun (2021). Assumption 4 (b) imposes one mild identification condition on the target policy. It essentially states that our batch data are able to identify the value of the target policy. Lastly, we remark that when both 𝒮{\cal S} and 𝒜{\cal A} are discrete and finite, Assumption 4 (b) is automatically satisfied because of Assumption 4 (a). In the following, we use 2,ν\|\bullet\|_{2,\nu} to denote L2L^{2}-norm with respect to some probability distribution/density ν\nu.

Now we are ready to present a key theorem in this paper, which can not only be used to establish the minimax-optimal sup-norm and L2L^{2}-norm rates for estimating QπQ^{\pi}, but also provide a foundation for many existing OPE estimators.

Theorem 1

For any policy π\pi, discount factor 0γ<10\leq\gamma<1, and any two square integrable functions Q1Q_{1} and Q2Q_{2} defined over (𝒮,𝒜)({\cal S},{\cal A}) with respect to d¯Tπb\bar{d}_{T}^{\pi^{b}}, under Assumptions 1, 3 and 4, the following inequalities hold.

pminpmax(1γ)Q1Q22,d¯Tπb𝒯hπ(Q1Q2)2,d¯Tπbhπ(Q1Q2)2,d¯Tπb×q.\displaystyle\sqrt{\frac{p_{\min}}{p_{\max}}}(1-\gamma)\|Q_{1}-Q_{2}\|_{2,\bar{d}_{T}^{\pi^{b}}}\leq\|{\cal T}h^{\pi}(Q_{1}-Q_{2})\|_{2,\bar{d}_{T}^{\pi^{b}}}\leq\|h^{\pi}(Q_{1}-Q_{2})\|_{2,\bar{d}_{T}^{\pi^{b}}\times q}. (9)

In particular, the L2L^{2} measure of ill-posedness

τ¯pmax(1+pmaxγ2pmin)pmin(1γ)1.\overline{\tau}\lesssim\frac{\sqrt{p_{\max}(1+\frac{p_{\max}\gamma^{2}}{p_{\min}})}}{\sqrt{p_{\min}}(1-\gamma)}\lesssim 1.

Proof. For the first statement of Theorem 1, it is enough to focus on the first inequality, while the second one is given by Jensen’s inequality. Let {\cal I} be the identity operator and 𝒫π{\cal P}^{\pi} be the operator such that 𝒫πf(s,a)=𝔼π[f(St+1,At+1)|St=s,At=a]{\cal P}^{\pi}f(s,a)=\mathbb{E}^{\pi}\left[f(S_{t+1},A_{t+1})\,|\,S_{t}=s,A_{t}=a\right] for any t0t\geq 0. By induction, we can show that (𝒫π)kf(s,a)=𝔼π[f(St+k,At+k)|St=s,At=a]({\cal P}^{\pi})^{k}f(s,a)=\mathbb{E}^{\pi}\left[f(S_{t+k},A_{t+k})\,|\,S_{t}=s,A_{t}=a\right]. For some integer t¯\bar{t}, which will be specified later, we have

Q1Q22,d¯Tπb(γt¯(𝒫π)t¯)(Q1Q2)2,d¯Tπb(I)+γt¯(𝒫π)t¯(Q1Q2)2,d¯Tπb(II).\displaystyle\|Q_{1}-Q_{2}\|_{2,\bar{d}_{T}^{\pi^{b}}}\leq\underbrace{\|\left({\cal I}-\gamma^{\bar{t}}({\cal P}^{\pi})^{\bar{t}}\right)\left(Q_{1}-Q_{2}\right)\|_{2,\bar{d}_{T}^{\pi^{b}}}}_{(I)}+\gamma^{\bar{t}}\underbrace{\|({\cal P}^{\pi})^{\bar{t}}\left(Q_{1}-Q_{2}\right)\|_{2,\bar{d}_{T}^{\pi^{b}}}}_{(II)}.

We first focus on deriving an upper bound for (II)(II). By Jensen’s inequality, we can show that

{(II)}2\displaystyle\{(II)\}^{2} s𝒮,a𝒜𝔼π[(Q1Q2)2(St¯,At¯)|S0=s,A0=a]d¯Tπb(s,a)dsda\displaystyle\leq\int_{s\in{\cal S},a\in{\cal A}}\mathbb{E}^{\pi}\left[(Q_{1}-Q_{2})^{2}(S_{\bar{t}},A_{\bar{t}})\,|\,S_{0}=s,A_{0}=a\right]\bar{d}_{T}^{\pi^{b}}(s,a)\text{d}s\text{d}a
=s𝒮,a𝒜s𝒮,a𝒜(Q1Q2)2(s,a)qt¯π(s,a|s,a)dsdad¯Tπb(s,a)dsda\displaystyle=\int_{s\in{\cal S},a\in{\cal A}}\int_{s^{\prime}\in{\cal S},a^{\prime}\in{\cal A}}(Q_{1}-Q_{2})^{2}(s^{\prime},a^{\prime})q^{\pi}_{\bar{t}}(s^{\prime},a^{\prime}\,|\,s,a)\text{d}s^{\prime}\text{d}a^{\prime}\bar{d}_{T}^{\pi^{b}}(s,a)\text{d}s\text{d}a
=s𝒮,a𝒜(Q1Q2)2(s,a)q~T;t¯πb;π(s,a)dsda\displaystyle=\int_{s^{\prime}\in{\cal S},a^{\prime}\in{\cal A}}(Q_{1}-Q_{2})^{2}(s^{\prime},a^{\prime})\widetilde{q}^{\pi^{b};\pi}_{T;{\bar{t}}}(s^{\prime},a^{\prime})\text{d}s^{\prime}\text{d}a^{\prime}
=s𝒮,a𝒜(Q1Q2)2(s,a)q~T;t¯πb;π(s,a)d¯Tπb(s,a)d¯Tπb(s,a)dsda\displaystyle=\int_{s^{\prime}\in{\cal S},a^{\prime}\in{\cal A}}(Q_{1}-Q_{2})^{2}(s^{\prime},a^{\prime})\frac{\widetilde{q}^{\pi^{b};\pi}_{T;{\bar{t}}}(s^{\prime},a^{\prime})}{\bar{d}_{T}^{\pi^{b}}(s^{\prime},a^{\prime})}\bar{d}_{T}^{\pi^{b}}(s^{\prime},a^{\prime})\text{d}s^{\prime}\text{d}a^{\prime}
pmaxpminQ1Q22,d¯Tπb2,\displaystyle\leq\frac{p_{\max}}{p_{\min}}\|Q_{1}-Q_{2}\|_{2,\bar{d}_{T}^{\pi^{b}}}^{2},

where q~T;t¯πb;π(s,a)\widetilde{q}^{\pi^{b};\pi}_{T;{\bar{t}}}(s^{\prime},a^{\prime}) refers to the marginal probability density function by composition between d¯Tπb\bar{d}_{T}^{\pi^{b}} and qt¯πq_{\bar{t}}^{\pi}. The last equation holds because q~T;t¯πb;π\widetilde{q}^{\pi^{b};\pi}_{T;{\bar{t}}} is absolutely continuous with respect to d¯Tπb(s,a)\bar{d}_{T}^{\pi^{b}}(s^{\prime},a^{\prime}) by Assumption 4. The last inequality is also given by Assumption 4 since q~T;t¯πb;π(s,a)=𝔼¯[qt¯π(s,a|S,A)]pmax\widetilde{q}^{\pi^{b};\pi}_{T;{\bar{t}}}(s^{\prime},a^{\prime})=\overline{\mathbb{E}}\left[q^{\pi}_{\bar{t}}(s^{\prime},a^{\prime}\,|\,S,A)\right]\leq p_{\max} for every (s,a)𝒮×𝒜(s,a)\in{\cal S}\times{\cal A} (As long as one-step transition density is bounded above, t¯\bar{t}-step will also be bounded above.). Now for any ε>0\varepsilon>0, we can choose t¯\bar{t} sufficiently large such that

γt¯pmax/pminε,\gamma^{\bar{t}}\sqrt{p_{\max}/p_{\min}}\leq\varepsilon,

which implies that γt¯×(II)εQ1Q22,d¯Tπb\gamma^{\bar{t}}\times(II)\leq\varepsilon\|Q_{1}-Q_{2}\|_{2,\bar{d}_{T}^{\pi^{b}}}. This further shows that

Q1Q22,d¯Tπb(1ε)1×(I).\displaystyle\|Q_{1}-Q_{2}\|_{2,\bar{d}_{T}^{\pi^{b}}}\leq(1-\varepsilon)^{-1}\times(I).

In the following, we derive an upper bound for (I)(I). Let g=(γ𝒫π)(Q1Q2)g=({\cal I}-\gamma{\cal P}^{\pi})(Q_{1}-Q_{2}). By a similar argument as before, we have

(I)\displaystyle(I) =(γ𝒫π+γ𝒫πγ2(𝒫π)2++γt¯1(𝒫π)t¯1γt¯(𝒫π)t¯)(Q1Q2)2,d¯Tπb\displaystyle=\|\left({\cal I}-\gamma{\cal P}^{\pi}+\gamma{\cal P}^{\pi}-\gamma^{2}({\cal P}^{\pi})^{2}+\cdots+\gamma^{\bar{t}-1}({\cal P}^{\pi})^{\bar{t}-1}-\gamma^{\bar{t}}({\cal P}^{\pi})^{\bar{t}}\right)\left(Q_{1}-Q_{2}\right)\|_{2,\bar{d}_{T}^{\pi^{b}}}
k=0t¯1γk(𝒫π)k(γ𝒫π)(Q1Q2)2,d¯Tπb\displaystyle\leq\sum_{k=0}^{\bar{t}-1}\gamma^{k}\|({\cal P}^{\pi})^{k}({\cal I}-\gamma{\cal P}^{\pi})(Q_{1}-Q_{2})\|_{2,\bar{d}_{T}^{\pi^{b}}}
=k=0t¯1γk(𝒫π)kg2,d¯Tπb\displaystyle=\sum_{k=0}^{\bar{t}-1}\gamma^{k}\|({\cal P}^{\pi})^{k}g\|_{2,\bar{d}_{T}^{\pi^{b}}}
k=0t¯1γkpmaxpming2,d¯Tπb\displaystyle\leq\sum_{k=0}^{\bar{t}-1}\gamma^{k}\sqrt{\frac{p_{\max}}{p_{\min}}}\|g\|_{2,\bar{d}_{T}^{\pi^{b}}}
1γt¯1γpmaxpming2,d¯Tπb.\displaystyle\leq\frac{1-\gamma^{\bar{t}}}{1-\gamma}\sqrt{\frac{p_{\max}}{p_{\min}}}\|g\|_{2,\bar{d}_{T}^{\pi^{b}}}.

Summarizing together, we can obtain that

Q1Q22,d¯Tπb\displaystyle\|Q_{1}-Q_{2}\|_{2,\bar{d}_{T}^{\pi^{b}}} (1ε)1(1γt¯)1γpmaxpmin𝒯hπ(Q1Q2)2,d¯Tπb,\displaystyle\leq\frac{(1-\varepsilon)^{-1}(1-\gamma^{\bar{t}})}{1-\gamma}\sqrt{\frac{p_{\max}}{p_{\min}}}\|{\cal T}h^{\pi}(Q_{1}-Q_{2})\|_{2,\bar{d}_{T}^{\pi^{b}}},

where we note that 𝒯hπ(Q1Q2)=g{\cal T}h^{\pi}(Q_{1}-Q_{2})=g. Since ε\varepsilon is arbitrary, let ε\varepsilon go to 0, we have

Q1Q22,d¯Tπb11γpmaxpmin𝒯hπ(Q1Q2)2,d¯Tπb\|Q_{1}-Q_{2}\|_{2,\bar{d}_{T}^{\pi^{b}}}\leq\frac{1}{1-\gamma}\sqrt{\frac{p_{\max}}{p_{\min}}}\|{\cal T}h^{\pi}(Q_{1}-Q_{2})\|_{2,\bar{d}_{T}^{\pi^{b}}}

In the remaining proof, we show τ¯\overline{\tau} is bounded above. Note that for any QL2(S,A)Q\in L^{2}(S,A),

hπ(Q)L2(S,A,S)2\displaystyle\|h^{\pi}(Q)\|^{2}_{L^{2}(S,A,S^{\prime})} =𝔼¯[(Q(S,A)γa𝒜π(a|S)Q(S,a)da)2]\displaystyle=\overline{\mathbb{E}}\left[\left(Q(S,A)-\gamma\int_{a^{\prime}\in{\cal A}}\pi(a^{\prime}\,|\,S^{\prime})Q(S^{\prime},a^{\prime})\text{d}a^{\prime}\right)^{2}\right]
2𝔼¯[(Q(S,A))2]+2pmaxγ2pminQ2(s,a)d¯Tπb(s,a)dsda\displaystyle\lesssim 2\overline{\mathbb{E}}\left[(Q(S,A))^{2}\right]+\frac{2p_{\max}\gamma^{2}}{p_{\min}}\int Q^{2}(s,a)\bar{d}^{\pi^{b}}_{T}(s,a)\text{d}s\text{d}a
(1+pmaxγ2pmin)Q2,d¯Tπb2,\displaystyle\lesssim(1+\frac{p_{\max}\gamma^{2}}{p_{\min}})\|Q\|^{2}_{2,\bar{d}^{\pi^{b}}_{T}},

where the first inequality is given by AM-GM, Jensen’s inequalities and Assumption 4 by noting that d¯T+1πb(s,a)pmax\bar{d}^{\pi^{b}}_{T+1}(s,a)\lesssim p_{\max} for any s𝒮s\in{\cal S} and a𝒜a\in{\cal A}. Then by the first inequality given in (9), we can show that

τ¯pmax(1+pmaxγ2pmin)(1γ)pmin,\overline{\tau}\lesssim\frac{\sqrt{p_{\max}(1+\frac{p_{\max}\gamma^{2}}{p_{\min}})}}{(1-\gamma)\sqrt{p_{\min}}},

which concludes our proof.    

Theorem 1 rigorously justifies the validity of using L2L^{2}-norm to measure the Bellman error, which has been widely adopted in the existing literature for constructing various estimators for the QQ-function. To see this, let Q1=QπQ_{1}=Q^{\pi} and Q2=Q~Q_{2}=\widetilde{Q} in Theorem 1, where Q~\widetilde{Q} denotes some estimator for QπQ^{\pi}. Then the first inequality in (9) with Bellman equation (1) implies that

Q~Qπ2,d¯Tπbr+(γ𝒫π)Q~2,d¯Tπb,\|\widetilde{Q}-Q^{\pi}\|_{2,\bar{d}_{T}^{\pi^{b}}}\lesssim\|r+(\gamma{\cal P}^{\pi}-{\cal I})\widetilde{Q}\|_{2,\bar{d}_{T}^{\pi^{b}}},

where the right hand side of the above inequality is called Bellman error (or residual) and recall that rr is the reward function defined in Assumption 2. Therefore L2L^{2}-norm of Bellman error of any QQ-function estimator provides a valid upper bound for the L2L^{2} error bound of this estimator to the true QπQ^{\pi}. Many existing estimators such as Antos et al. (2008b); Farahmand et al. (2016); Uehara & Jiang (2019); Feng et al. (2020) indeed are based on minimizing the L2L^{2}-norm of Bellman error. Therefore our Theorem 1 provides a theoretical guarantee for their procedures. Notice that Theorem 1 is established without imposing any restriction on the structure of QQ-function, it can be used to obtain L2L^{2} error bounds for many different non-parametric estimators of QQ-function obtained using different models and/or methods such as LSTD, kernel methods or neural networks. For example, combining our Theorem 1 with Theorem 11 of Farahmand et al. (2016) immediately gives L2L^{2}-error bound for their estimator to the true QπQ^{\pi}. Applying our Theorem 1 to Example 6 of Uehara et al. (2021) one can obtain L2L^{2}-error bound for their neural network estimator to QπQ^{\pi}.

We also remark that the well-posed result in Theorem 1 can be extended to other metrics such as L1L^{1}-norm, based on which one may develop a new estimator for QQ-function by minimizing the empirical approximation of L1L^{1}-norm of Bellman error. We conjecture that such an estimator could achieve robustness compared with the existing ones, especially when the reward distribution is heavy tailed. Lastly, there is a very recent work (Wang et al., 2021), which developed a sufficient and necessary condition for establishing the well-posedness of Bellman operator in L2L^{2}-norm. Besides they also developed a similar condition as our Assumption 4 in establishing this well-posedness.

4 Minimax Lower Bounds

In this section, we establish minimax lower bounds in both sup-norm and in L2L^{2}-norm for estimation of nonparametric QQ-function in OPE problem. The well-posedness property essentially indicates that non-parametric QQ-function estimation is as easy as the classical non-parametric regression in the i.i.d. setting in terms of the worst case rate.

Recall that by Theorem 1, under Assumptions 1, 3 and 4, for any square integrable function QQ defined over 𝒮×𝒜{\cal S}\times{\cal A}, we have

pminpmax(1γ)Q2,d¯Tπb𝒯hπ(Q)2,d¯Tπbhπ(Q)2,d¯Tπb×qQ2,d¯Tπb.\displaystyle\sqrt{\frac{p_{\min}}{p_{\max}}}(1-\gamma)\|Q\|_{2,\bar{d}^{\pi^{b}}_{T}}\leq\|{\cal T}h^{\pi}(Q)\|_{2,\bar{d}^{\pi^{b}}_{T}}\leq\|h^{\pi}(Q)\|_{2,\bar{d}^{\pi^{b}}_{T}\times q}\lesssim\|Q\|_{2,\bar{d}^{\pi^{b}}_{T}}. (10)

Denote a generic transition tuple as {Si,t,Ai,t,Ri,t,Si,t}\{S_{i,t},A_{i,t},R_{i,t},S^{\prime}_{i,t}\} indexed by (i,t)(i,t). Then we have the following lower bound results for estimating QπQ^{\pi} and its derivative in terms of the sup-norm.

Theorem 2

Let dνd^{\nu} be the average visitation probability density defined over 𝒮×𝒜{\cal S}\times{\cal A} induced by some policy ν\nu such that Assumption 4 holds with d¯Tπb\bar{d}^{\pi^{b}}_{T} and πb\pi^{b} replaced by dνd^{\nu} and ν\nu respectively. Suppose the data 𝒟N={Si,t,Ai,t,Ri,t,Si,t}1iN,0tT1{\cal D}_{N}=\{S_{i,t},A_{i,t},R_{i,t},S^{\prime}_{i,t}\}_{1\leq i\leq N,0\leq t\leq T-1} are i.i.d.i.i.d. from Model (3), where the probability density of (Si,t,Ai,t)(S_{i,t},A_{i,t}) is dνd^{\nu} with the transition probability density qq and for every 0tT10\leq t\leq T-1 and 1iN1\leq i\leq N, 𝔼[Ui,t2|Si,t,Ai,t]σ2\mathbb{E}[U_{i,t}^{2}\,|\,S_{i,t},A_{i,t}]\geq\sigma^{2}, where σ\sigma is some positive constant, then we have for any 0α1<p0\leq\|\alpha\|_{\ell_{1}}<p,

lim infNTinfQ^supQΛ(p,L)PrQ(αQ^αQc(log(NT)/NT)(pα1)/(2p+d))c>0,\displaystyle\liminf_{NT\rightarrow\infty}\inf_{\widehat{Q}}\sup_{Q\in\Lambda_{\infty}(p,L)}{\Pr}^{Q}\left(\|\partial^{\alpha}\widehat{Q}-\partial^{\alpha}Q\|_{\infty}\geq c(\log(NT)/NT)^{(p-\|\alpha\|_{\ell_{1}})/(2p+d)}\right)\geq c^{\prime}>0, (11)

for some constants cc and cc^{\prime}, where infQ^\inf_{\widehat{Q}} denotes the infimum over all estimators using 𝒟N{\cal D}_{N}, and PrQ\Pr^{Q} denotes the joint probability distribution of 𝒟N{\cal D}_{N} with hπ=hπ(Q)h^{\pi}=h^{\pi}(Q) in Model (3).

The following theorem provides lower bound results in terms of L2L^{2}-norm.

Theorem 3

Under all conditions in Theorem 2, for 0α1<p0\leq\|\alpha\|_{\ell_{1}}<p, we have

lim infNTinfQ^supQΛ2(p,L)PrQ(αQ^αQ2c¯(NT)(α1p)/(2p+d))c¯>0,\displaystyle\liminf_{NT\rightarrow\infty}\inf_{\widehat{Q}}\sup_{Q\in\Lambda_{2}(p,L)}{\Pr}^{Q}\left(\|\partial^{\alpha}\widehat{Q}-\partial^{\alpha}Q\|_{2}\geq\bar{c}(NT)^{(\|\alpha\|_{\ell_{1}}-p)/(2p+d)}\right)\geq\bar{c}^{\prime}>0, (12)

for some constant c¯\bar{c} and c¯\bar{c}^{\prime}.

As we can see from Theorems 2 and 3, the minimax lower bounds for the rates of estimating QQ-function and its derivatives are the same as those for nonparametric regression in the i.i.d setting (Stone, 1982). To the best of our knowledge, these are the first lower bound results for nonparametrically estimating QQ-function and its derivatives in the infinite-horizon MDP. In the following section, we proposed simple estimators that match these lower bounds.

5 Sieve 2SLS Estimation of QQ-function

Given the NPIV Model (3) as a reformulation of Bellman equation, we now adopt the idea from for example Blundell et al. (2007); Chen & Christensen (2018) to construct a sieve 2SLS estimator for QπQ^{\pi}. Define two sieve basis functions as

ψJ(s,a)=(ψJ1(s,a),,ψJJ(s,a)),\displaystyle\psi^{J}(s,a)=(\psi_{J1}(s,a),\cdots,\psi_{JJ}(s,a))^{\top}, (13)
bK(s,a)=(bK1(s,a),,bKK(s,a)),\displaystyle b^{K}(s,a)=(b_{K1}(s,a),\cdots,b_{KK}(s,a))^{\top}, (14)

to model QπQ^{\pi} and the space of instrumental variables respectively. Let ΨJ=closure{ΨJ1,,ΨJJ}L2(S,A)\Psi_{J}=\text{closure}\{\Psi_{J1},\ldots,\Psi_{JJ}\}\subset L^{2}(S,A) and BK=closure{bK1,,bKK}L2(S,A)B_{K}=\text{closure}\{b_{K1},\ldots,b_{KK}\}\subset L^{2}(S,A) denote the sieve spaces for QπQ^{\pi} and instrumental variables, respectively. Here the underlying probability measure of L2(S,A)L^{2}(S,A) is d¯Tπb\bar{d}^{\pi^{b}}_{T}. Examples of basis functions include splines or wavelet bases (See more examples in Huang et al. (1998); Chen (2007)). The construction of wavelet bases can also be found in Appendix B. We remark that the numbers of basis functions JJ and KK are allowed to grow with either NN or TT, but require that JKcJJ\leq K\leq cJ for some c1c\geq 1. Due to the special structure of Model (3), it also makes sense to simply let K=JK=J and ψJ=bK\psi^{J}=b^{K}.

Additionally, we let ψπJ(s)=(a𝒜π(a|s)ψJ1(s,a)da,,a𝒜π(a|s)ψJJ(s,a)da)\psi^{J}_{\pi}(s)=(\int_{a\in{\cal A}}\pi(a|s)\psi_{J1}(s,a)\text{d}a,\cdots,\int_{a\in{\cal A}}\pi(a|s)\psi_{JJ}(s,a)\text{d}a)^{\top}. Correspondingly, a sample version of all these functions can be defined as

Ψ=(ψJ(S1,0,A1,0),ψJ(SN,T1,AN,T1))(NT)×J,\displaystyle\Psi=\left(\psi^{J}(S_{1,0},A_{1,0})\cdots,\psi^{J}(S_{N,T-1},A_{N,T-1})\right)^{\top}\in\mathbb{R}^{(NT)\times J},
B=(bK(S1,0,A1,0),bK(S1,1,A1,1),bK(SN,T1,bN,T1))(NT)×K,\displaystyle B=(b^{K}(S_{1,0},A_{1,0}),b^{K}(S_{1,1},A_{1,1})\cdots,b^{K}(S_{N,T-1},b_{N,T-1}))^{\top}\in\mathbb{R}^{(NT)\times K},
Gπ=(ψπJ(S1,1),ψπJ(S1,2),ψπJ(S1,T),ψπJ(S2,1),,ψπJ(SN,T))(NT)×J.\displaystyle G_{\pi}=(\psi^{J}_{\pi}(S_{1,1}),\psi^{J}_{\pi}(S_{1,2})\cdots,\psi^{J}_{\pi}(S_{1,T}),\psi^{J}_{\pi}(S_{2,1}),\cdots,\psi^{J}_{\pi}(S_{N,T}))^{\top}\in\mathbb{R}^{(NT)\times J}.

For notational simplicity, let κπJ(s,a,s)=ψJ(s,a)γψπJ(s)\kappa_{\pi}^{J}(s,a,s^{\prime})=\psi^{J}(s,a)-\gamma\psi_{\pi}^{J}(s^{\prime}), and correspondingly Γπ=ΨγGπ.\Gamma_{\pi}=\Psi-\gamma G_{\pi}. We also denote κJjπ(s,a,s)=ψJj(s,a)γa𝒜π(as)daψJjπ(s,a)\kappa_{Jj}^{\pi}(s,a,s^{\prime})=\psi_{Jj}(s,a)-\gamma\int_{a^{\prime}\in{\cal A}}\pi(a^{\prime}\mid s^{\prime})\text{d}a^{\prime}\psi^{\pi}_{Jj}(s^{\prime},a^{\prime}) for each element of κπJ(s,a,s)J\kappa_{\pi}^{J}(s,a,s^{\prime})\in\mathbb{R}^{J}. Then the sieve 2SLS estimator for QπQ^{\pi} can be constructed as

Q^π(s,a)=ψJ(s,a)c^,\displaystyle\widehat{Q}^{\pi}(s,a)=\psi^{J}(s,a)^{\top}\widehat{c},
withc^=[ΓπB(BB)BΓπ]ΓπB(BB)B𝐑,\displaystyle\text{with}\quad\widehat{c}=\left[\Gamma_{\pi}^{\top}B(B^{\top}B)^{-}B^{\top}\Gamma_{\pi}\right]^{-}\Gamma_{\pi}^{\top}B(B^{\top}B)^{-}B^{\top}\mathbf{R}, (15)

where (Z)(Z)^{-} denotes the generalized inverse of some matrix ZZ and 𝐑=(R1,0,R1,1,RN,T1)NT×1\mathbf{R}=(R_{1,0},R_{1,1}\cdots,R_{N,T-1})^{\top}\in\mathbb{R}^{NT\times 1}. The corresponding estimator for the derivatives of QπQ^{\pi} is denoted by αQ^π\partial^{\alpha}\widehat{Q}^{\pi} for any vector α\alpha. Here c^\widehat{c} can be understood as a minimizer of the following optimization problem.

minimizecJB(BB)1B(𝐑Γπc)22.\displaystyle\underset{c\in\mathbb{R}^{J}}{minimize}\quad\|B(B^{\top}B)^{-1}B^{\top}(\mathbf{R}-\Gamma_{\pi}c)\|^{2}_{\ell_{2}}.

Note that the sieve 2SLS estimator given in (5) becomes the solution of the modified Bellman residual minimiazion in Farahmand et al. (2016) when their function spaces are modeled by sieve ones.

5.1 Sieve measure of ill-posedness in NPIV

An important quantity related to a generic NPIV model (2) is called sieve L2L^{2} measure of ill-posedness, which characterizes the difficulty of non-parametrically estimating h0h_{0} using the sieve estimation. Here a similar measure of ill-posedness can be defined under Model (3). Let ΘJπ={hπ(Q)L2(S,A,S):QΨJ}\Theta^{\pi}_{J}=\{h^{\pi}(Q)\in L^{2}(S,A,S^{\prime}):Q\in\Psi_{J}\}. Adapting from the sieve L2L^{2} measure of ill-posedness in Blundell et al. (2007), we define an average sieve L2L^{2} measure of ill-posedness across TT decision points under Model (3) as

τJ=suphΘJπ:h0hL2(S,A,S)𝒯hL2(S,A).\tau_{J}=\sup_{h\in\Theta^{\pi}_{J}:h\neq 0}\frac{\|h\|_{L^{2}(S,A,S^{\prime})}}{\|{\cal T}h\|_{L^{2}(S,A)}}\,. (16)

It can be seen that τJ1\tau_{J}\geq 1. Basically τJ\tau_{J} measures how much information has been smoothed out by the conditional expectation operator 𝒯{\cal T} over the space ΘJπ\Theta^{\pi}_{J}. For a generic NPIV model (2), τJ\tau_{J} grows to infinity as JJ goes to infinity; see, e.g. (Blundell et al., 2007; Chen & Christensen, 2018). By definition we have τJτ¯1\tau_{J}\leq\overline{\tau}\lesssim 1 for all J1J\geq 1. Thus Theorem 1 directly implies that the NPIV Model (3) is also well-posed under the L2L^{2} sieve measure of ill-posedness defined in (16). Based on this result, minimax-optimal sup-norm and L2L^{2}-norm rates for the sieve 2SLS estimator of QπQ^{\pi} can be established in the following subsections.

5.2 Sup-norm Convergence Rates

In this subsection, we establish the sup-norm convergence rate of Q^π\widehat{Q}^{\pi} to QπQ^{\pi}. We first introduce an additional assumption on the data generating process.

Assumption 5

The stochastic process {St,At}t0\{S_{t},A_{t}\}_{t\geq 0} induced by the behavior policy πb\pi^{b} is a stationary, exponentially 𝛃\bm{\beta}-mixing stochastic process, i.e., the 𝛃\bm{\beta}-mixing coefficient at time lag kk satisfies that βkβ0exp(β1k)\beta_{k}\leq\beta_{0}\exp(-\beta_{1}k) for β00\beta_{0}\geq 0 and β1>0\beta_{1}>0. The induced stationary density is denoted by dπbd^{\pi^{b}}.

Assumption 5 is imposed to characterize the dependency among observations over time because the observed data modeled by MDP are not i.i.d. and transition tuples are dependent. Most of previous works assume transition tuples are independent, which is stronger than this assumption. The 𝜷\bm{\beta}-mixing coefficient at time lag kk basically means that the dependency between {St,At}tj\{S_{t},A_{t}\}_{t\leq j} and {St,At}t(j+k)\{S_{t},A_{t}\}_{t\geq(j+k)} decays to 0 at an exponential rate with respect to kk. See Bradley (2005) for the exact definition of the exponentially 𝜷\bm{\beta}-mixing. A fast mixing rate is imposed here mainly for technical simplicity and our sup-norm and L2L^{2}-norm convergence rates are not affected by the mixing coefficients. Indeed, Assumption 5 can be relaxed to stationary with certain algebraic β\beta-mixing, and by using the matrix Bernstein inequality for general β\beta-mixing of Chen & Christensen (2015) one may obtain the same convergence rates as those in Theorems 4 and 5 below. We can also relax the strictly stationary assumption with some extra notation. Since this is not our focus, we do not impose the weakest possible assumptions on the temporal dependence in this paper. When both Assumptions 4 and 5 hold, the average visitation probability density d¯Tπb\bar{d}^{\pi^{b}}_{T} used in Assumption 4 becomes the induced stationary density dπbd^{\pi^{b}}. We will omit ν\nu in 2,ν\|\bullet\|_{2,\nu} when ν=dπb\nu=d^{\pi^{b}}. Throughout the remaining of this section, unless otherwise specified, (S,A)(S,A) has the probability density dπbd^{\pi^{b}} and the density of (S,A,S)(S,A,S^{\prime}) is dπb×qd^{\pi^{b}}\times q.

Define L2,hπ(S,A,S)={hπ(Q):QL2(S,A)}L_{2,h^{\pi}}(S,A,S^{\prime})=\left\{h^{\pi}(Q):Q\in L^{2}(S,A)\right\}. Let ΠJ:L2,hπ(S,A,S)ΘJπ\Pi_{J}:L_{2,h^{\pi}}(S,A,S^{\prime})\to\Theta^{\pi}_{J} denote the Lhπ2(S,A,S)L_{h^{\pi}}^{2}(S,A,S^{\prime}) mapping onto ΘJπ\Theta^{\pi}_{J}, i.e., ΠJh0π=h0π(Π¯JQπ)\Pi_{J}h_{0}^{\pi}=h_{0}^{\pi}(\overline{\Pi}_{J}Q^{\pi}), where Π¯JQπ=argminQΨJQπQL2(S,A)\overline{\Pi}_{J}Q^{\pi}=\mathrm{arg}\min_{Q\in\Psi_{J}}\|Q^{\pi}-Q\|_{L^{2}(S,A)}, and let ΠK:L2(S,A)BK\Pi_{K}:L^{2}(S,A)\to B_{K} denote the L2(S,A)L^{2}(S,A) orthogonal projection onto BKB_{K}. Let Π~Jh0π=argminhΘJπΠK𝒯(h0πh)L2(S,A)\widetilde{\Pi}_{J}h_{0}^{\pi}=\mathrm{arg}\min_{h\in\Theta^{\pi}_{J}}\|\Pi_{K}{\cal T}(h_{0}^{\pi}-h)\|_{L^{2}(S,A)} denote the sieve 2SLS projection of h0πh_{0}^{\pi} onto ΘJπ\Theta^{\pi}_{J}. Let ΘJ,1π={hΘJπ|h2=1}\Theta_{J,1}^{\pi}=\{h\in\Theta_{J}^{\pi}\,|\,\|h\|_{2}=1\}. We make one additional assumption below for controlling the approximation error of using the sieve bases.

Assumption 6

(a) suphΘJ,1π(ΠK𝒯𝒯)hL2(S,A)=oJ(1)\sup_{h\in\Theta^{\pi}_{J,1}}\|(\Pi_{K}{\cal T}-{\cal T})h\|_{L^{2}(S,A)}=o_{J}(1), where oJ(1)o_{J}(1) refers to a quantity that converges to 0 when JJ\rightarrow\infty; (b) Π~J(h0πΠJh0π)C1×h0πΠJh0π\|\widetilde{\Pi}_{J}(h^{\pi}_{0}-\Pi_{J}h^{\pi}_{0})\|_{\infty}\leq C_{1}\times\|h^{\pi}_{0}-\Pi_{J}h^{\pi}_{0}\|_{\infty} for some constants C1C_{1}.

Assumption 6 (a) is a mild condition on approximating ΘJπ\Theta^{\pi}_{J} by a sieve space BKB_{K}. For fixed JJ (and KK), suphΘJ,1π(ΠK𝒯𝒯)hL2(S,A)\sup_{h\in\Theta^{\pi}_{J,1}}\|(\Pi_{K}{\cal T}-{\cal T})h\|_{L^{2}(S,A)} can be interpreted as an inherent Bellman error (for a fixed policy π\pi), which is widely used in the literature of RL such as the analysis of fitted-q iteration (See Assumption 4.2 of Agarwal et al. (2019)). Assumption 6 (b) is also mild because that Π~J(h0πΠJh0π)L2(S,A)h0πΠJh0πL2(S,A)\|\widetilde{\Pi}_{J}(h_{0}^{\pi}-\Pi_{J}h_{0}^{\pi})\|_{L^{2}(S,A)}\leq\|h_{0}^{\pi}-\Pi_{J}h_{0}^{\pi}\|_{L^{2}(S,A)} holds automatically by the projection property. Here we strengthen it in terms of the sup-norm.

To derive the sup-norm convergence rate, following the proof of Chen & Christensen (2018), we split Q^πQπ\|\widehat{Q}^{\pi}-Q^{\pi}\|_{\infty} into two terms. Let Q~π(s,a)=ψJ(s,a)c~ with c~=[ΓπB(BB)BΓπ]ΓπB(BB)BH0,\widetilde{Q}^{\pi}(s,a)=\psi^{J}(s,a)^{\top}\widetilde{c}~{}~{}\mbox{ with }~{}~{}\widetilde{c}=[\Gamma_{\pi}^{\top}B(B^{\top}B)^{-}B^{\top}\Gamma_{\pi}]^{-}\Gamma_{\pi}^{\top}B(B^{\top}B)^{-}B^{\top}H_{0}, where

H0=(h0π(S1,0,A1,0,S1,1),h0π(S1,1,A1,1,S1,2),,h0π(SN,T1,AN,T1,SN,T))NT.H_{0}=(h_{0}^{\pi}(S_{1,0},A_{1,0},S_{1,1}),h_{0}^{\pi}(S_{1,1},A_{1,1},S_{1,2}),\ldots,h_{0}^{\pi}(S_{N,T-1},A_{N,T-1},S_{N,T}))^{\top}\in\mathbb{R}^{NT}.

Then by triangle inequality, we have Q^πQπQ^πQ~π+QπQ~π\|\widehat{Q}^{\pi}-Q^{\pi}\|_{\infty}\leq\|\widehat{Q}^{\pi}-\widetilde{Q}^{\pi}\|_{\infty}+\|Q^{\pi}-\widetilde{Q}^{\pi}\|_{\infty}. The first term Q^πQ~π\|\widehat{Q}^{\pi}-\widetilde{Q}^{\pi}\|_{\infty} can be interpreted as an estimation error, while the second term QπQ~π\|Q^{\pi}-\widetilde{Q}^{\pi}\|_{\infty} can be understood as the approximation error. Denote Gκ,Jπ=𝔼[κπJ(S,A,S)κπJ(S,A,S)]G^{\pi}_{\kappa,J}=\mathbb{E}[\kappa^{J}_{\pi}(S,A,S^{\prime})\kappa^{J}_{\pi}(S,A,S^{\prime})^{\top}] and eJ=λmin(Gκ,Jπ)e_{J}=\lambda_{\min}(G^{\pi}_{\kappa,J}). Let

ζκπ\displaystyle\zeta^{\pi}_{\kappa} =sups,a,s[Gκ,Jπ]1/2κπJ(s,a,s)2\displaystyle=\sup_{s,a,s^{\prime}}\|[G^{\pi}_{\kappa,J}]^{-1/2}\kappa^{J}_{\pi}(s,a,s^{\prime})\|_{\ell^{2}} Gb=𝔼[bK(S,A)bK(S,A)]\displaystyle G_{b}=\mathbb{E}[b^{K}(S,A)b^{K}(S,A)^{\top}]
ξψ\displaystyle\xi_{\psi} =sups,aψJ(s,a)1\displaystyle=\sup_{s,a}\|\psi^{J}(s,a)\|_{\ell^{1}} ζb=sups,aGb1/2bK(s,a)2\displaystyle\zeta_{b}=\sup_{s,a}\|G_{b}^{-1/2}b^{K}(s,a)\|_{\ell^{2}}

for each JJ and KK by omitting their dependence on JJ and KK for notation simplicity, and define ζ=max{ζb,ζκπ}\zeta=\max\{\zeta_{b},\zeta^{\pi}_{\kappa}\}. In the following lemma, we derive bounds for the aforementioned two terms.

Lemma 2

(1) Let Assumptions 1-5, and Assumption 6(a) hold. If ζ2log(NT)/NT=O(1)\zeta^{2}\sqrt{\log(NT)/NT}=O(1), then we have the following sup-norm bound for the estimation error.

Q^πQ~π=Op(ξψ(logJ)/(NTeJ)).\displaystyle\|\widehat{Q}^{\pi}-\widetilde{Q}^{\pi}\|_{\infty}=O_{p}\Big{(}\xi_{\psi}\sqrt{(\log J)/(NTe_{J})}\Big{)}. (17)

(2) Let Assumptions 5-6 hold. If ζ2log(J)log(NT)/NT=O(1)\zeta^{2}\sqrt{\log(J)\log(NT)/NT}=O(1) then the approximation error can be controlled by

QπQ~π=Op(QπΠ¯JQπ).\displaystyle\|Q^{\pi}-\widetilde{Q}^{\pi}\|_{\infty}=O_{p}(\|Q^{\pi}-\overline{\Pi}_{J}Q^{\pi}\|_{\infty}). (18)

By examining the proof of Lemma 2, it is possible to derive the finite sample error bounds for both Q^πQ~π\|\widehat{Q}^{\pi}-\widetilde{Q}^{\pi}\|_{\infty} and QπQ~π\|Q^{\pi}-\widetilde{Q}^{\pi}\|_{\infty}. We omit them for brevity.

Theorem 4

Let Assumptions 1-6 hold and QπΛ(p,L)Q^{\pi}\in\Lambda_{\infty}(p,L) for some LL. Suppose that the sieve space ΨJ\Psi_{J} is spanned by a B-spline or wavelet basis of Cohen et al. (1993) with regularity larger than pp, and BKB_{K} is spanned by a wavelet, spline or cosine basis. If Jlog(J)log(NT)/NT=O(1)J\sqrt{\log(J)\log(NT)/NT}=O(1), then we have:

Q^πQπ=Op(Jp/d+J(logJ)/(NT)).\|\widehat{Q}^{\pi}-Q^{\pi}\|_{\infty}=O_{p}\big{(}J^{-p/d}+\sqrt{J(\log J)/(NT)}\big{)}. (19)

Further, by choosing J(NTlog(NT))d/(2p+d)J\asymp(\frac{NT}{\log(NT)})^{d/(2p+d)} and assuming 2p>d2p>d , we have for all 0α1<p0\leq\|\alpha\|_{\ell_{1}}<p,

αQ^παQπ=Op((log(NT)NT)(pα1)/(2p+d)).\|\partial^{\alpha}\widehat{Q}^{\pi}-\partial^{\alpha}Q^{\pi}\|_{\infty}=O_{p}\left(\left(\frac{\log(NT)}{NT}\right)^{(p-\|\alpha\|_{\ell_{1}})/(2p+d)}\right). (20)

The smoothness parameter pp in Theorem 4 represents the smoothness level of the true QQ-function and characterize the size of the functional class that the true QQ-function belongs to. We require 2p>d2p>d in Theorem 4 mainly for deriving the sup-norm rate in order to achieve the optimality when considering Hölder class of functions. Theorem 4 shows that in terms of the batch data sample size of NTNT, the sup-norm rate of our 2SLS estimator Q^π\widehat{Q}^{\pi} to QπQ^{\pi} is the same as the optimal one in the classical non-parametric regression estimation (Stone, 1982) using B-splines. The sup-norm convergence rate of Q^π\widehat{Q}^{\pi} can be useful to develop uniform confidence bands (UCBs) for the QQ-function using results in Chen & Christensen (2018) for example. Such UCBs may be incorporated into the framework of pessimistic RL algorithms such as (Jin et al., 2021; Xie et al., 2021). In addition, we can also show that the sup-norm bounds on the constant factor γ\gamma is of order (1γ)3(1-\gamma)^{-3}. Finally, the results on the sup-norm rates for estimating the derivatives of the QQ-function may be useful to some actor-critic algorithms such as Silver et al. (2014); Kallus & Uehara (2020); Xu et al. (2021).

5.3 L2L^{2}-norm Convergence Rates

In this subsection we present the L2L^{2} convergence rates of our 2SLS estimator for the QQ-function. We do not require Assumption 6 (b) as the L2L^{2}-stability condition holds automatically.

Theorem 5

Let Assumptions 1-6 (a) hold. If ζlog(NT)log(J)/NT=o(1)\zeta\sqrt{\log(NT)\log(J)/NT}=o(1), then:

Q^πQπ2=Op(J/(NT)+QπΠ¯JQπ2).\displaystyle\|\widehat{Q}^{\pi}-Q^{\pi}\|_{2}=O_{p}\Big{(}\sqrt{J/(NT)}+\|Q^{\pi}-\overline{\Pi}_{J}Q^{\pi}\|_{2}\Big{)}. (21)

If QπΛ2(p,L)Q^{\pi}\in\Lambda_{2}(p,L) with p>0p>0, and ΨJ\Psi_{J} and BKB_{K} are spanned by some commonly used bases such as polynomials, trigonometric polynomials, splines and wavelets with regularity greater than pp, by choosing J(NT)d/(2p+d)J\asymp(NT)^{d/(2p+d)}, we have: for all 0α1<p0\leq\|\alpha\|_{\ell_{1}}<p,

αQ^παQπ2=Op((NT)(α1p)/(2p+d)).\|\partial^{\alpha}\widehat{Q}^{\pi}-\partial^{\alpha}Q^{\pi}\|_{2}=O_{p}\left(\left(NT\right)^{(\|\alpha\|_{\ell_{1}}-p)/(2p+d)}\right).

According to Theorem 5, the sieve 2SLS estimator Q^π\widehat{Q}^{\pi} achieves the minimax optimal L2L^{2}-norm convergence rate to QπQ^{\pi} under conditions much weaker than those for the optimal sup-norm convergence rate. Let FF be a known marginal distribution. It is well-known that one can estimate the value of a target policy π\pi

vπ=s𝒮[a𝒜π(a|S=s)Qπ(s,a)da]F(ds)v^{\pi}=\int_{s\in{\cal S}}[\int_{a\in{\cal A}}\pi(a\,|\,S=s)Q^{\pi}(s,a)\text{d}a]F(\text{d}s)

by the following simple plug-in sieve 2SLS estimator

v^π=s𝒮[a𝒜π(a|S=s)Q^π(s,a)da]F(ds).\widehat{v}^{\pi}=\int_{s\in{\cal S}}[\int_{a\in{\cal A}}\pi(a\,|\,S=s)\widehat{Q}^{\pi}(s,a)\text{d}a]F(\text{d}s).

Theorem 5 is particularly useful in establishing the asymptotic normality of NT(v^πvπ)\sqrt{NT}(\widehat{v}^{\pi}-v^{\pi}).

Remark 1

(1) Recently Shi et al. (2020) presented a sieve LSTD estimator for QπQ^{\pi} and obtained the L2L^{2}-norm rate of convergence (See (E.46) in appendix of their paper for more details) under some conditions including their Assumption (A3.) or a small discount factor γ\gamma condition. They then apply their L2L^{2}-norm convergence rate to establish the NT\sqrt{NT}-asymptotic normality of plug-in sieve LSTD estimator for the value vπv^{\pi}. Note that their sieve LSTD is a special case of our sieve 2SLS with BK=ΨJB_{K}=\Psi_{J} and K=JK=J, and the sieve LSTD automatically satisfies our Assumption 6 (a). Our Theorem 5 establishes the L2L^{2}-norm convergene rate for their sieve LSTD estimator without the need to impose the strong condition of a small discount factor γ\gamma. Thus we may require weaker conditions for establishing the asymptotic normality of the plug-in sieve 2SLS estimator for the value. We leave details to the longer version of the paper. (2) In this paper, to obtain the optimal rates of convergence in L2L^{2}-norm (and sup-norm) of our sieve 2SLS estimator for the QπQ^{\pi} function, we assume strictly stationary data for simplicity. We note that Shi et al. (2020) did not impose this strict stationarity in their L2L^{2}-norm rate and asymptotic normality calculation. However, they need to assume the distribution of the initial state Si,0S_{i,0} in the batch data is bounded away from 0 uniformly in ii. Indeed it is possible to replace the strict stationary condition in our Assumption 5 by imposing the geometric ergodicity and using the truncation argument to obtain the same sup-norm and L2L^{2}-norm convergence rates for our sieve 2SLS estimator. We leave it for the future work.

6 Conclusion

In this paper, we consider nonparametric estimation of QQ-function of continuous states and actions in the OPE setting. Under some mild conditions, we show that the NPIV model (3) for estimating QQ-function nonparametrically is well-posed in the sense of L2L^{2}-measure of ill-posedness, bypassing the need of imposing a strong condition on the discount factor γ\gamma in the recent literature. The well-posedness property effectively implies that the minimax lower bounds for nonparametric estimation of QQ-function coincide with those for a nonparametric regression in sup-norm and in L2L^{2}-norm under the i.i.d. setting. Under mild sufficient conditions, we also establish that the sup-norm and the L2L^{2}-norm rates of convergence of our proposed sieve 2SLS estimators for QQ-function achieve the lower bounds, and hence are minimax-optimal. These rate results are useful for optimal estimation and inference on various functionals, such as the value, of the QQ-function by plugging in our sieve 2SLS estimators. In particular, one can easily develop uniform confidence bands (UCBs) for the QQ-function by slightly modifying the UCBs result in Chen & Christensen (2018) for a NPIV function estimated via a spline or wavelet sieve 2SLS. We leave this to future work due to the length of the paper.

In this paper we focus on the direct method of using Bellman equation to nonparametrically estimate QQ-function in the OPE setting. In the existing literature, there are two additional approaches to perform OPE. One is using the recently proposed marginal importance sampling for the infinite horizon setting such as Liu et al. (2018); Nachum et al. (2019); Xie et al. (2019); Uehara et al. (2020); Zhang, Dai, Li & Schuurmans (2020); Zhang, Liu & Whiteson (2020). The other approach combines the direct method and marginal importance sampling to construct the so-called doubly robust estimators for the value of the target policy (see, e.g., Kallus & Uehara (2019); Tang et al. (2020); Shi et al. (2021) among many others). Our results on the well-posedness and the minimax lower bounds for QQ function estimation should be useful to establish theoretical properties of these alternative approaches under conditions that are weaker than the existing ones. Finally, since OPE serves as the foundation of many RL algorithms, our results on QQ-function estimation of a target policy can also be useful to other policy learning methods such as those proposed in Ernst et al. (2005); Antos et al. (2008b); Le et al. (2019); Liao et al. (2020); Jin et al. (2021); Zanette et al. (2021). We leave details to future work.

Appendix A Notation

Unless specified, for any transition tuple (S,A,S)(S,A,S^{\prime}), the probability density of (S,A)(S,A) is dπb\sim d^{\pi_{b}} and the probability density of SS^{\prime} given (S,A)(S,A) is qq. In addition, 𝔼\mathbb{E} refers to the expectation taken with respect to dπbd^{\pi_{b}}. We recall the definition of some quantities below, which will appear in our proof.

Gκπ=Gκ,Jπ=𝔼[κπJ(S,A,S)κπJ(S,A,S)]=𝔼[ΓπΓπ/(NT)]Gb=Gb,K=𝔼[bK(S,A)bK(S,A)]=𝔼[BB/(NT)]Gψ=Gψ,J=𝔼[ψJ(S,A)ψJ(S,A)]Σπ=ΣK,Jπ=𝔼[bK(S,A)κπJ(S,A,S)]=𝔼[BΓπ/(NT)].\begin{array}[]{rcccl}G^{\pi}_{\kappa}&=&G^{\pi}_{\kappa,J}&=&\mathbb{E}[\kappa^{J}_{\pi}(S,A,S^{\prime})\kappa^{J}_{\pi}(S,A,S^{\prime})^{\top}]=\mathbb{E}[\Gamma_{\pi}^{\top}\Gamma_{\pi}/(NT)]\\ G_{b}&=&G_{b,K}&=&\mathbb{E}[b^{K}(S,A)b^{K}(S,A)^{\top}]=\mathbb{E}[B^{\top}B/(NT)]\\ G_{\psi}&=&G_{\psi,J}&=&\mathbb{E}[\psi^{J}(S,A)\psi^{J}(S,A)^{\top}]\\ \Sigma^{\pi}&=&\Sigma^{\pi}_{K,J}&=&\mathbb{E}[b^{K}(S,A)\kappa^{J}_{\pi}(S,A,S^{\prime})^{\top}]=\mathbb{E}[B^{\top}\Gamma_{\pi}/(NT)]\,.\end{array}

We assume that Σπ\Sigma^{\pi} has a full column rank JJ. Denote eJ=λmin(Gκ,Jπ)e_{J}=\lambda_{\min}(G^{\pi}_{\kappa,J}). Let

ζκπ\displaystyle\zeta^{\pi}_{\kappa} =ζκ,Jπ=sups,a,s[Gκ,Jπ]1/2κπJ(s,a,s)2\displaystyle=\zeta^{\pi}_{\kappa,J}=\sup_{s,a,s^{\prime}}\|[G^{\pi}_{\kappa,J}]^{-1/2}\kappa^{J}_{\pi}(s,a,s^{\prime})\|_{\ell^{2}} ζb=ζb,K=sups,aGb1/2bK(s,a)2\displaystyle\zeta_{b}=\zeta_{b,K}=\sup_{s,a}\|G_{b}^{-1/2}b^{K}(s,a)\|_{\ell^{2}}
ξκπ\displaystyle\xi^{\pi}_{\kappa} =ξκ,Jπ=sups,a,sκπJ(s,a,s)1\displaystyle=\xi^{\pi}_{\kappa,J}=\sup_{s,a,s^{\prime}}\|\kappa^{J}_{\pi}(s,a,s^{\prime})\|_{\ell^{1}} ξψ=ξJ=sups,aψπJ(s,a)1\displaystyle\xi_{\psi}=\xi_{J}=\sup_{s,a}\|\psi^{J}_{\pi}(s,a)\|_{\ell^{1}}

for each JJ and KK, and ζ=max{ζb,K,ζκ,Jπ}\zeta=\max\{\zeta_{b,K},\zeta^{\pi}_{\kappa,J}\}. Define

(Gb1/2Σπ)l=[[Σπ](Gb)1Σπ]1[Σπ](Gb)1/2,(G_{b}^{-1/2}\Sigma^{\pi})^{-}_{l}=\left[[\Sigma^{\pi}]^{\top}(G_{b})^{-1}\Sigma^{\pi}\right]^{-1}[\Sigma^{\pi}]^{\top}(G_{b})^{-1/2},

and similarly for (G^b1/2Σ^π)l(\widehat{G}_{b}^{-1/2}\widehat{\Sigma}^{\pi})^{-}_{l}, where

Σ^π=BΓπNTandG^b=BBNT.\widehat{\Sigma}^{\pi}=\frac{B^{\top}\Gamma_{\pi}}{NT}\quad\text{and}\quad\widehat{G}_{b}=\frac{B^{\top}B}{NT}.

Appendix B Lower bounds

In this section, the probability density of (Si,t,Ai,t)(S_{i,t},A_{i,t}) is dνd^{\nu} and the expectation is with respect to the density dνd^{\nu}.

B.1 Lower bounds for Sup-norm Rates

The proof mainly follows that of Theorem 3.2 in Chen & Christensen (2018). Consider the Gaussian reduced form of NPIV model with known operator 𝒯{\cal T}:

Ri,t\displaystyle R_{i,t} =𝒯h0π(Si,t,Ai,t)+Ui,t,\displaystyle={\cal T}h^{\pi}_{0}(S_{i,t},A_{i,t})+U_{i,t}, (22)
Ui,t(Si,t,Ai,t)\displaystyle U_{i,t}\mid(S_{i,t},A_{i,t}) 𝒩(0,σ2(Si,t,Ai,t)),\displaystyle\sim{\cal N}(0,\sigma^{2}(S_{i,t},A_{i,t})),

for 1iN1\leq i\leq N and 0tT10\leq t\leq T-1. The known of operator 𝒯{\cal T} is equivalent to knowing the transition density qq. By Lemma 1 of Chen & Reiss (2011), the minimax lower bound of Model (3) is no smaller than Model (22). In the following, we thus focus on Model (22) and make use of Theorem 2.5 of Tsybakov (2009).

We restrict 𝒮×𝒜=[0,1]d{\cal S}\times{\cal A}=[0,1]^{d}. Let {ϕ~j,k,G,ψ~j,k,G}j,k,G\{\widetilde{\phi}_{j,k,G},\widetilde{\psi}_{j,k,G}\}_{j,k,G} be a tensor-product wavelet basis of regularity larger than pp for L2([0,1]d)L^{2}([0,1]^{d}), where jj is the resolution level, k=(k1,k2,,kd){0,1,,2j1}dk=(k_{1},k_{2},\cdots,k_{d})\in\{0,1,\cdots,2^{j}-1\}^{d}, and GG is a vector indicating which element in a Daubechies pair {ϕ,ψ}\{\phi,\psi\} is used. Note that ϕ\phi has support [M+1,M][-M+1,M] for some positive integer MM. All these pairs are generated by CDV wavelets (Cohen et al., 1993). Following the proof of Chen & Christensen (2018), we consider a class of submodels around QπQ^{\pi}. In particular, for a given jj, consider the wavelet space (𝒮×𝒜)j({\cal S}\times{\cal A})_{j}, which consists of 2jd2^{jd} functions {ψ~j,k,G}k{0,,2j1}d\{\widetilde{\psi}_{j,k,G}\}_{k\in\{0,\cdots,2^{j}-1\}^{d}} with GG chosen as all ψ\psi functions. For some constant rr, consider {ψ~j,k,G}k{r,,2j1M}d\{\widetilde{\psi}_{j,k,G}\}_{k\in\{r,\cdots,2^{j}-1-M\}^{d}} as interior wavelets and ψ~j,k,G(s,a)=Πm=1d1ψj,km(sm)ψj,kd(a)\widetilde{\psi}_{j,k,G}(s,a)=\Pi_{m=1}^{d-1}\psi_{j,k_{m}}(s_{m})\psi_{j,k_{d}}(a) for k=(k1,,kd){r,,2j1M}dk=(k_{1},\cdots,k_{d})\in\{r,\cdots,2^{j}-1-M\}^{d}, where s=(s1,,sd1)𝒮s=(s_{1},\cdots,s_{d-1})\in{\cal S}, a𝒜a\in{\cal A} and ψj,km()=2j/2ψ(2j()km)\psi_{j,k_{m}}(\bullet)=2^{j/2}\psi(2^{j}(\bullet)-k_{m}) for 1md1\leq m\leq d. Then for sufficiently large jj, there exists a set {r,,(2jM1)}d{\cal I}\subseteq\{r,\cdots,(2^{j}-M-1)\}^{d} of interior wavelets with Card()2dj\text{Card}({\cal I})\gtrsim 2^{dj}, where Card()\text{Card}(\bullet) refers to the cardinality, such that at least one coordinate of support(ψ~j,k1,G)\text{support}(\widetilde{\psi}_{j,k_{1},G}) and support(ψ~j,k2,G)\text{support}(\widetilde{\psi}_{j,k_{2},G}) is empty for all k1k2{r,,2j1M}dk_{1}\neq k_{2}\in\{r,\cdots,2^{j}-1-M\}^{d}. In addition, we have Card()2jd\text{Card}({\cal I})\lesssim 2^{jd} by definition.

Then for any QπΛ(p,L)Q^{\pi}\in\Lambda_{\infty}(p,L) such that QπΛpL/2\|Q^{\pi}\|_{\Lambda_{\infty}^{p}}\leq L/2, where Λp\|\bullet\|_{\Lambda_{\infty}^{p}} is the Besov norm and for each ii\in{\cal I}, define

Qiπ=Qπ+c02j(p+d/2)ψ~j,i,G.Q^{\pi}_{i}=Q^{\pi}+c_{0}2^{-j(p+d/2)}\widetilde{\psi}_{j,i,G}.

Correspondingly, for every (s,a,s)(s,a,s^{\prime}), let

hiπ(s,a,s)=h0π(s,a,s)+c02j(p+d/2)(ψ~j,i,G(s,a)γa𝒜π(a|s)ψ~j,i,G(s,a)da),h^{\pi}_{i}(s,a,s^{\prime})=h^{\pi}_{0}(s,a,s^{\prime})+c_{0}2^{-j(p+d/2)}\left(\widetilde{\psi}_{j,i,G}(s,a)-\gamma\int_{a^{\prime}\in{\cal A}}\pi(a^{\prime}|s^{\prime})\widetilde{\psi}_{j,i,G}(s^{\prime},a^{\prime})\text{d}a^{\prime}\right),

where c0c_{0} is some positive constant specified later. It can be seen that for all ii\in{\cal I},

c02j(p+d/2)ψ~j,i,G(s,a)Λpc0.\|c_{0}2^{-j(p+d/2)}\widetilde{\psi}_{j,i,G}(s,a)\|_{\Lambda^{p}_{\infty}}\lesssim c_{0}.

Hence QiπΛpL\|Q_{i}^{\pi}\|_{\Lambda^{p}_{\infty}}\leq L for sufficient small c0c_{0}.

For i{0}di\in\{0\}^{d}\cup{\cal I}, consider Model (22) with the true function hiπh^{\pi}_{i} and define the joint probability density of {Sj,t,Aj,t,Rj,t,Sj,t}1jN,0tT1\{S_{j,t},A_{j,t},R_{j,t},S^{\prime}_{j,t}\}_{1\leq j\leq N,0\leq t\leq T-1} as PiP_{i} such that

Pi=Πj=0NΠt=0T1dν(Sj,0,Aj,0)phiπ(Rj,t|Sj,t,Aj,t)q(Sj,t|Sj,t,Aj,t),P_{i}=\Pi_{j=0}^{N}\Pi_{t=0}^{T-1}d^{\nu}(S_{j,0},A_{j,0})p_{h_{i}^{\pi}}(R_{j,t}|S_{j,t},A_{j,t})q(S^{\prime}_{j,t}\,|\,S_{j,t},A_{j,t}),

by recalling that they are i.i.d.i.i.d. samples, where phiπp_{h_{i}^{\pi}} denotes the conditional density of reward given a state-action pair. In particular, when i={0}di=\{0\}^{d}, Qiπ=QπQ^{\pi}_{i}=Q^{\pi} and hiπ=hπh^{\pi}_{i}=h^{\pi}.

First of all, for sufficiently small c0c_{0}, we can show

c02j(p+d/2)ψ~j,i,GΛpc0L\|c_{0}2^{-j(p+d/2)}\widetilde{\psi}_{j,i,G}\|_{\Lambda^{p}_{\infty}}\lesssim c_{0}\leq L

for every ii\in{\cal I}. In addition, by Equation (10), we have

pmin/pmax(1γ)c02j(p+d/2)\displaystyle\sqrt{p_{\min}/p_{\max}}(1-\gamma)c_{0}2^{-j(p+d/2)} 𝒯c02j(p+d/2)(ψ~j,i,G(S,A)γa𝒜π(a|S)ψ~j,i,G(S,a)da)2\displaystyle\leq\|{\cal T}c_{0}2^{-j(p+d/2)}\left(\widetilde{\psi}_{j,i,G}(S,A)-\gamma\int_{a^{\prime}\in{\cal A}}\pi(a^{\prime}|S^{\prime})\widetilde{\psi}_{j,i,G}(S^{\prime},a^{\prime})\text{d}a^{\prime}\right)\|_{2} (23)
c02j(p+d/2).\displaystyle\lesssim c_{0}2^{-j(p+d/2)}. (24)

Secondly, for every ii\in{\cal I}, the Kullback-Leibler distance K(Pi,P0)K(P_{i},P_{0}) can be bounded as

K(Pi,P0)\displaystyle K(P_{i},P_{0}) 12c022j(2p+d)m=1Nt=0T1𝔼[(𝒯(ψ~j,i,G(Sm,t,Am,t)a𝒜π(a|Sm,t)ψ~m,i,G(Sm,t,a)da))2σ2(Sm,t,Am,t)]\displaystyle\leq\frac{1}{2}c^{2}_{0}2^{-j(2p+d)}\sum_{m=1}^{N}\sum_{t=0}^{T-1}\mathbb{E}\left[\frac{\left({\cal T}\left(\widetilde{\psi}_{j,i,G}(S_{m,t},A_{m,t})-\int_{a^{\prime}\in{\cal A}}\pi(a^{\prime}|S^{\prime}_{m,t})\widetilde{\psi}_{m,i,G}(S^{\prime}_{m,t},a^{\prime})\text{d}a^{\prime}\right)\right)^{2}}{\sigma^{2}(S_{m,t},A_{m,t})}\right] (25)
NTc022j(2p+d),\displaystyle\lesssim NTc^{2}_{0}2^{-j(2p+d)}, (26)

by the condition in Theorem 2. By choosing 2j(NT/log(NT))1/(2p+d)2^{j}\asymp(NT/\log(NT))^{1/(2p+d)}, it gives that

K(Pi,P0)c02log(NT),K(P_{i},P_{0})\lesssim c_{0}^{2}\log(NT),

and log(Card())jlog(NT)loglog(NT)\log(\text{Card}({\cal I}))\gtrsim j\gtrsim\log(NT)-\log\log(NT). So for sufficiently small c0c_{0} and large NTNT, K(Pi,P0)1/8log(Card())K(P_{i},P_{0})\leq 1/8\log(\text{Card}({\cal I})) for every ii\in{\cal I}.

Lastly, it can be seen that for i1,i2i_{1},i_{2}\in{\cal I} and i1i2i_{1}\neq i_{2},

αQi1παQi2π\displaystyle\|\partial^{\alpha}Q^{\pi}_{i_{1}}-\partial^{\alpha}Q^{\pi}_{i_{2}}\|_{\infty} =c02j(p+d/2)αψ~j,i1,Gαψ~j,i2,G\displaystyle=c_{0}2^{-j(p+d/2)}\|\partial^{\alpha}\widetilde{\psi}_{j,{i_{1}},G}-\partial^{\alpha}\widetilde{\psi}_{j,{i_{2}},G}\|_{\infty}
2c02j(p+d/2)2jd/22jα1ψ|α|\displaystyle\gtrsim 2c_{0}2^{-j(p+d/2)}2^{jd/2}2^{j\|\alpha\|_{\ell_{1}}}\|\psi^{|\alpha|}\|_{\infty}
=2c02j(pα1)ψ|α|,\displaystyle=2c_{0}2^{-j(p-\|\alpha\|_{\ell_{1}})}\|\psi^{|\alpha|}\|_{\infty},

where the first inequality is given by recalling that at least one coordinate of support(ψ~j,k1,G)\text{support}(\widetilde{\psi}_{j,k_{1},G}) and support(ψ~j,k2,G)\text{support}(\widetilde{\psi}_{j,k_{2},G}) is empty for all k1k2{r,,2j1M}dk_{1}\neq k_{2}\in\{r,\cdots,2^{j}-1-M\}^{d}. Here ψ|α|\psi^{|\alpha|} refers to Πm=1dαmψ\Pi_{m=1}^{d}\partial^{\alpha_{m}}\psi.

Note that 2j(pα1)=(log(NT)/NT)(pα1)/(2p+d)2^{-j(p-\|\alpha\|_{\ell_{1}})}=(\log(NT)/NT)^{(p-\|\alpha\|_{\ell_{1}})/(2p+d)}. Then Theorem 2.5 of Tsybakov (2009) implies that for any 0α1<p0\leq\|\alpha\|_{\ell_{1}}<p,

lim infNTinfQ^supQΛ(p,L)PrQ(αQ^αQc(log(NT)/NT)(pα1)/(2p+d))c>0,\displaystyle\liminf_{NT\rightarrow\infty}\inf_{\widehat{Q}}\sup_{Q\in\Lambda_{\infty}(p,L)}\text{Pr}^{Q}\left(\|\partial^{\alpha}\widehat{Q}-\partial^{\alpha}Q\|_{\infty}\geq c(\log(NT)/NT)^{(p-\|\alpha\|_{\ell_{1}})/(2p+d)}\right)\geq c^{\prime}>0, (27)

for some constants cc and cc^{\prime}.

B.2 Lower bounds for L2L^{2}-norm Rates

The proof mainly follows that of Theorem G.3 in Chen & Christensen (2018). Again we focus on Model (22) and apply Theorem 2.5 of Tsybakov (2009).

We restrict 𝒮×𝒜=[0,1]d{\cal S}\times{\cal A}=[0,1]^{d}. Let {ϕ~j,k,G,ψ~j,k,G}j,k,G\{\widetilde{\phi}_{j,k,G},\widetilde{\psi}_{j,k,G}\}_{j,k,G} be a tensor-product wavelet basis of regularity larger than pp for L2([0,1]d)L^{2}([0,1]^{d}), where jj is the resolution level, k=(k1,k2,,kd){0,1,,2j1}dk=(k_{1},k_{2},\cdots,k_{d})\in\{0,1,\cdots,2^{j}-1\}^{d}, and GG is a vector indicating which element in a Daubechies pair {ϕ,ψ}\{\phi,\psi\} is used. Note that ϕ\phi has support [M+1,M][-M+1,M] for some positive integer MM. All these pairs are generated by CDV wavelets (Cohen et al., 1993). Following the proof of Chen & Christensen (2018), we consider a class of submodels around QπQ^{\pi}. In particular, for a given jj, consider the wavelet space (𝒮×𝒜)j({\cal S}\times{\cal A})_{j}, which consists of 2jd2^{jd} functions {ψ~j,k,G}k{0,,2j1}d\{\widetilde{\psi}_{j,k,G}\}_{k\in\{0,\cdots,2^{j}-1\}^{d}} with GG chosen as all ψ\psi functions. For some constant rr, consider {ψ~j,k,G}k{r,,2j1M}d\{\widetilde{\psi}_{j,k,G}\}_{k\in\{r,\cdots,2^{j}-1-M\}^{d}} as interior wavelets and ψ~j,k,G(s,a)=Πm=1d1ψj,km(sm)ψj,kd(a)\widetilde{\psi}_{j,k,G}(s,a)=\Pi_{m=1}^{d-1}\psi_{j,k_{m}}(s_{m})\psi_{j,k_{d}}(a) for k=(k1,,kd){r,,2j1M}dk=(k_{1},\cdots,k_{d})\in\{r,\cdots,2^{j}-1-M\}^{d}, where s=(s1,,sd1)𝒮s=(s_{1},\cdots,s_{d-1})\in{\cal S}, a𝒜a\in{\cal A} and ψj,km()=2j/2ψ(2j()km)\psi_{j,k_{m}}(\bullet)=2^{j/2}\psi(2^{j}(\bullet)-k_{m}) for 1md1\leq m\leq d. Then for sufficiently large jj, there exists a set {r,,(2jM1)}d{\cal I}\subseteq\{r,\cdots,(2^{j}-M-1)\}^{d} of interior wavelets with Card()2dj\text{Card}({\cal I})\gtrsim 2^{dj}, where Card()\text{Card}(\bullet) refers to the cardinality, such that at least one coordinate of support(ψ~j,k1,G)\text{support}(\widetilde{\psi}_{j,k_{1},G}) and support(ψ~j,k2,G)\text{support}(\widetilde{\psi}_{j,k_{2},G}) is empty for all k1k2{r,,2j1M}dk_{1}\neq k_{2}\in\{r,\cdots,2^{j}-1-M\}^{d}. In addition, we have Card()2jd\text{Card}({\cal I})\lesssim 2^{jd} by definition.

Then for any QπΛ2(p,L/2)Q^{\pi}\in\Lambda_{2}(p,L/2), where Λ2,2p\|\bullet\|_{\Lambda_{2,2}^{p}} is the Sobolev norm with smoothness pp. For each θ={θi}i\theta=\{\theta_{i}\}_{i\in{\cal I}}, where θi{0,1}\theta_{i}\in\{0,1\}, define

Qθπ=Qπ+c02j(p+d/2)iθiψ~j,i,G(s,a),Q^{\pi}_{\theta}=Q^{\pi}+c_{0}2^{-j(p+d/2)}\sum_{i\in{\cal I}}\theta_{i}\widetilde{\psi}_{j,i,G}(s,a),

Correspondingly, let

hθπ(s,a,s)=h0π(s,a,s)+c02j(p+d/2)(iθiψ~j,i,G(s,a)γa𝒜π(a|s)iθiψ~j,i,G(s,a)da),h^{\pi}_{\theta}(s,a,s^{\prime})=h^{\pi}_{0}(s,a,s^{\prime})+c_{0}2^{-j(p+d/2)}\left(\sum_{i\in{\cal I}}\theta_{i}\widetilde{\psi}_{j,i,G}(s,a)-\gamma\int_{a^{\prime}\in{\cal A}}\pi(a^{\prime}|s^{\prime})\sum_{i\in{\cal I}}\theta_{i}\widetilde{\psi}_{j,i,G}(s^{\prime},a^{\prime})\text{d}a^{\prime}\right),

where c0c_{0} is some positive constant specified later. Based on the construction, there are 2Card()2^{\text{Card}({\cal I})} combinations of θ\theta. It can be seen that for every θ\theta,

c02j(p+d/2)iθiψ~j,i,G(,)Λ2,2p\displaystyle\|c_{0}2^{-j(p+d/2)}\sum_{i\in{\cal I}}\theta_{i}\widetilde{\psi}_{j,i,G}(\bullet,\bullet)\|_{\Lambda^{p}_{2,2}}
\displaystyle\lesssim c02j(p+d/2)iθi222jp\displaystyle c_{0}2^{-j(p+d/2)}\sqrt{\sum_{i\in{\cal I}}\theta_{i}^{2}2^{2jp}}
\displaystyle\leq c0\displaystyle c_{0}

Hence QθπΛ2,2pL\|Q_{\theta}^{\pi}\|_{\Lambda^{p}_{2,2}}\leq L for sufficient small c0c_{0}.

First of all, it can be seen that for every θ1\theta_{1} and θ2\theta_{2},

αQθ1παQθ2π2\displaystyle\|\partial^{\alpha}Q^{\pi}_{\theta_{1}}-\partial^{\alpha}Q^{\pi}_{\theta_{2}}\|_{2} =c02j(p+d/2α1)i(θ1,iθ2,i)2j/2ψ|α|(2ji)2\displaystyle=c_{0}2^{-j(p+d/2-\|\alpha\|_{\ell_{1}})}\|\sum_{i\in{\cal I}}(\theta_{1,i}-\theta_{2,i})2^{j/2}\psi^{|\alpha|}(2^{j}\bullet-i)\|_{2}
c02j(p+d/2α1)i(θ1,iθ2,i)22j/2ψ|α|(2ji)22\displaystyle\gtrsim c_{0}2^{-j(p+d/2-\|\alpha\|_{\ell_{1}})}\sqrt{\sum_{i\in{\cal I}}(\theta_{1,i}-\theta_{2,i})^{2}\|2^{j/2}\psi^{|\alpha|}(2^{j}\bullet-i)\|^{2}_{2}}
2c02j(p+d/2α1)i𝕀(θ1,iθ2,i),\displaystyle\asymp 2c_{0}2^{-j(p+d/2-\|\alpha\|_{\ell_{1}})}\sqrt{\sum_{i\in{\cal I}}\mathbb{I}(\theta_{1,i}\neq\theta_{2,i})},

where the second inequality is given by recalling that at least one coordinate of support(ψ~j,k1,G)\text{support}(\widetilde{\psi}_{j,k_{1},G}) and support(ψ~j,k2,G)\text{support}(\widetilde{\psi}_{j,k_{2},G}) is empty for all k1k2{r,,2j1M}dk_{1}\neq k_{2}\in\{r,\cdots,2^{j}-1-M\}^{d}. Here ψ|α|(2ji)=Πm=1dαmψ(2jim).\psi^{|\alpha|}(2^{j}\bullet-i)=\Pi_{m=1}^{d}\partial^{\alpha_{m}}\psi(2^{j}\bullet-i_{m}). The last line is based on that ψ~j,i,GCω\widetilde{\psi}_{j,i,G}\in C^{\omega} with ω>p>α1\omega>p>\|\alpha\|_{\ell_{1}} and is compactly supported with the bounded above and below density, then 2j/2ψ(α)(2ji)21\|2^{j/2}\psi^{(\alpha)}(2^{j}\bullet-i)\|_{2}\asymp 1. Take jj large enough. By Varshamov-Gilbert bound, we can show that there exists a subset of {θ(0),,θ(I)}\{\theta^{(0)},\cdots,\theta^{(I^{\ast})}\} such that θ(0)={0}Card()\theta^{(0)}=\{0\}^{\text{Card}({\cal I})}, I2Card()I^{\ast}\asymp 2^{\text{Card}({\cal I})}, and j𝕀(θj(i)θj(k))2jd/2\sqrt{\sum_{j\in{\cal I}}\mathbb{I}(\theta^{(i)}_{j}\neq\theta^{(k)}_{j})}\gtrsim 2^{jd/2}, where 0ikI0\leq i\leq k\leq I^{\ast}. Therefore QiπQkπ2c02j(pα1)\|Q^{\pi}_{i}-Q^{\pi}_{k}\|_{2}\gtrsim c_{0}2^{-j(p-\|\alpha\|_{\ell_{1}})} for 0ikI0\leq i\leq k\leq I^{\ast}, where we denote Qθ(i)=QiQ_{\theta^{(i)}}=Q_{i}. Similarly, we denote hθ(i)π=hiπh^{\pi}_{\theta^{(i)}}=h^{\pi}_{i}.

For 0mI0\leq m\leq I^{\ast}, consider Model (22) with the true function hmπh^{\pi}_{m} and define the joint probability distribution of {Sj,t,Aj,t,Rj,t,Sj,t}1jN,0tT1\{S_{j,t},A_{j,t},R_{j,t},S^{\prime}_{j,t}\}_{1\leq j\leq N,0\leq t\leq T-1} as PiP_{i} such that

Pm=Πj=0NΠt=0T1dν(Sj,0,Aj,0)phmπ(Rj,t|Sj,t,Aj,t)q(Sj,t|Sj,t,Aj,t),P_{m}=\Pi_{j=0}^{N}\Pi_{t=0}^{T-1}d^{\nu}(S_{j,0},A_{j,0})p_{h_{m}^{\pi}}(R_{j,t}|S_{j,t},A_{j,t})q(S^{\prime}_{j,t}\,|\,S_{j,t},A_{j,t}),

by recalling that they are i.i.d.i.i.d. samples.

Secondly, for sufficiently small c0c_{0}, we can show for every 0mI0\leq m\leq I^{\ast}

c02j(p+d/2)iθi(m)ψ~j,i,GΛ2,2pc0L/2\|c_{0}2^{-j(p+d/2)}\sum_{i\in{\cal I}}\theta^{(m)}_{i}\widetilde{\psi}_{j,i,G}\|_{\Lambda^{p}_{2,2}}\lesssim c_{0}\leq L/2

. In addition, by Equation (10), we have

pmin/pmax(1γ)c02j(p+d/2)iθi(m)ψ~j,i,G2\displaystyle\sqrt{p_{\min}/p_{\max}}(1-\gamma)c_{0}2^{-j(p+d/2)}\|\sum_{i\in{\cal I}}\theta^{(m)}_{i}\widetilde{\psi}_{j,i,G}\|_{2} 𝒯c02j(p+d/2)(iθi(m)ψ~j,i,G(S,A)\displaystyle\lesssim\|{\cal T}c_{0}2^{-j(p+d/2)}\left(\sum_{i\in{\cal I}}\theta^{(m)}_{i}\widetilde{\psi}_{j,i,G}(S,A)\right.
γa𝒜π(a|S)iθi(m)ψ~j,i,G(S,a)da)2\displaystyle\left.-\gamma\int_{a^{\prime}\in{\cal A}}\pi(a^{\prime}|S^{\prime})\sum_{i\in{\cal I}}\theta^{(m)}_{i}\widetilde{\psi}_{j,i,G}(S^{\prime},a^{\prime})\text{d}a^{\prime}\right)\|_{2}
c02j(p+d/2)iθi(m)ψ~j,i,G2\displaystyle\lesssim c_{0}2^{-j(p+d/2)}\|\sum_{i\in{\cal I}}\theta^{(m)}_{i}\widetilde{\psi}_{j,i,G}\|_{2}
c02j(p+d/2)i(θ(m))2\displaystyle\lesssim c_{0}2^{-j(p+d/2)}\sqrt{\sum_{i\in{\cal I}}(\theta^{(m)})^{2}}
c02jp.\displaystyle\asymp c_{0}2^{-jp}.

Moreover, for every 0mI0\leq m\leq I^{\ast}, the distance K(Pm,P0)K(P_{m},P_{0}) can be bounded as

K(Pm,P0)\displaystyle K(P_{m},P_{0})
\displaystyle\leq 12c022j(2p+d)k=1Nt=0T1𝔼[(𝒯(iθi(m)ψ~j,i,G(Sk,t,Ak,t)a𝒜π(a|Sk,t)iθi(m)ψ~j,i,G(Sk,t,a)da))2σ2(Sk,t,Ak,t)]\displaystyle\frac{1}{2}c^{2}_{0}2^{-j(2p+d)}\sum_{k=1}^{N}\sum_{t=0}^{T-1}\mathbb{E}[\frac{({\cal T}(\sum_{i\in{\cal I}}\theta^{(m)}_{i}\widetilde{\psi}_{j,i,G}(S_{k,t},A_{k,t})-\int_{a^{\prime}\in{\cal A}}\pi(a^{\prime}|S^{\prime}_{k,t})\sum_{i\in{\cal I}}\theta^{(m)}_{i}\widetilde{\psi}_{j,i,G}(S^{\prime}_{k,t},a^{\prime})\text{d}a^{\prime}))^{2}}{\sigma^{2}(S_{k,t},A_{k,t})}]
\displaystyle\lesssim NTc022j(2p),\displaystyle NTc^{2}_{0}2^{-j(2p)},

by the condition in Theorem 2. By choosing 2j(NT)1/(2p+d)2^{j}\asymp(NT)^{1/(2p+d)}, it gives that

K(Pm,P0)c02(NT)d/(2p+d),K(P_{m},P_{0})\lesssim c_{0}^{2}(NT)^{d/(2p+d)},

and log(I)2jd(NT)d/(2p+d)\log(I^{\ast})\gtrsim 2^{jd}\asymp(NT)^{d/(2p+d)} by recalling that I2Card()I^{\ast}\asymp 2^{\text{Card}({\cal I})} and Card()2jd\text{Card}({\cal I})\asymp 2^{jd}. So for sufficiently small c0c_{0} and large NTNT, K(Pm,P0)1/8log()K(P_{m},P_{0})\leq 1/8\log({\cal I}^{\ast}) for every 1m1\leq m\in{\cal I}^{\ast}.

Note that 2j(pα1)=(NT)(α1p)/(2p+d)2^{-j(p-\|\alpha\|_{\ell_{1}})}=(NT)^{(\|\alpha\|_{\ell_{1}}-p)/(2p+d)}. Then Theorem 2.5 of Tsybakov (2009) implies that

lim infNTinfQ^supQΛ2(p,L)PrQ(Q^Q2c¯(NT)(α1p)/(2p+d))c¯>0,\displaystyle\liminf_{NT\rightarrow\infty}\inf_{\widehat{Q}}\sup_{Q\in\Lambda_{2}(p,L)}\text{Pr}^{Q}\left(\|\widehat{Q}-Q\|_{2}\geq\bar{c}(NT)^{(\|\alpha\|_{\ell_{1}}-p)/(2p+d)}\right)\geq\bar{c}^{\prime}>0, (28)

for some constants c¯\bar{c} and c¯\bar{c}^{\prime}.

Appendix C Proof of Theorem 4

Let Q0,JπQ^{\pi}_{0,J} solves infQΨJQπQ\inf_{Q\in\Psi_{J}}\|Q^{\pi}-Q\|_{\infty}. Under all assumptions in Lemmas 2, we have as long as ζ2log(J)log(NT)/NT=O(1)\zeta^{2}\sqrt{\log(J)\log(NT)/NT}=O(1),

Q^πQπ\displaystyle\|\widehat{Q}^{\pi}-Q^{\pi}\|_{\infty}
\displaystyle\leq Q^πQ~π+Q~πQπ\displaystyle\|\widehat{Q}^{\pi}-\widetilde{Q}^{\pi}\|_{\infty}+\|\widetilde{Q}^{\pi}-Q^{\pi}\|_{\infty}
\displaystyle\leq Op(Rmax1γτJξJ(logJ)/(NTeJ))+Op(1)×QπΠ¯Qπ\displaystyle O_{p}\Big{(}\frac{R_{\max}}{1-\gamma}\tau_{J}\xi_{J}\sqrt{(\log J)/(NTe_{J})}\Big{)}+O_{p}(1)\times\|Q^{\pi}-\overline{\Pi}Q^{\pi}\|_{\infty}
\displaystyle\leq Op(ξJ(logJ)/(NTeJ))+Op(1)Π¯JQπQ0,Jπ,\displaystyle O_{p}\Big{(}\xi_{J}\sqrt{(\log J)/(NTe_{J})}\Big{)}+O_{p}(1)\|\overline{\Pi}_{J}\|_{\infty}\|Q^{\pi}-Q_{0,J}^{\pi}\|_{\infty},

where the first inequality is by triangle inequality and the last inequality is given by Lebesgue’s lemma and τJ1\tau_{J}\lesssim 1.

To proceed our proof, we only consider the wavelet basis of Cohen et al. (1993) for BKB_{K} and ΨJ\Psi_{J}, while results of other bases given in Theorem 4 can be derived similarly. Based on the property of wavelet basis, we can show that Π¯J1\|\overline{\Pi}_{J}\|_{\infty}\lesssim 1, where the proof is given in Chen & Christensen (2015). Since QπΛ(p,L)Q^{\pi}\in\Lambda_{\infty}(p,L), we have QπQ0,JπO(Jp/d)\|Q^{\pi}-Q_{0,J}^{\pi}\|_{\infty}\lesssim O(J^{-p/d}) by e.g., Huang et al. (1998). Summarizing together, we have

Q^πQπ=Op(τJξJ(logJ)/(NTeJ)+Jp/d).\|\widehat{Q}^{\pi}-Q^{\pi}\|_{\infty}=O_{p}\Big{(}\tau_{J}\xi_{J}\sqrt{(\log J)/(NTe_{J})}+J^{-p/d}\Big{)}.

According to Lemma 4 and by the property of wavelet bases, eJ(1γ)2pmin2/pmax1e_{J}\gtrsim(1-\gamma)^{2}p^{2}_{\min}/p_{\max}\gtrsim 1. Similarly, we can show that ζbζJ\zeta_{b}\leq\zeta\lesssim\sqrt{J}, and ξJJ\xi_{J}\lesssim\sqrt{J}. Hence we have our first statement that

Q^πQπ=Op(Jp/d+J(logJ)/(NT)),\|\widehat{Q}^{\pi}-Q^{\pi}\|_{\infty}=O_{p}\big{(}J^{-p/d}+\sqrt{J(\log J)/(NT)}\big{)}, (29)

as long as Jlog(J)log(NT)/NT=O(1)J\sqrt{\log(J)\log(NT)/NT}=O(1). Lastly, by choosing J(NTlog(NT))d/(2p+d)J\asymp\left(\frac{NT}{\log(NT)}\right)^{d/(2p+d)}, which satisfies the constraint, we have

Q^πQπ=Op((log(NT)NT)p/(2p+d)).\|\widehat{Q}^{\pi}-Q^{\pi}\|_{\infty}=O_{p}\left(\left(\frac{\log(NT)}{NT}\right)^{p/(2p+d)}\right).

Next, we present the proof related to the derivative case. Note that by the previous result, we have

QπQ~π=Op(Jp/d).\|Q^{\pi}-\widetilde{Q}^{\pi}\|_{\infty}=O_{p}(J^{-p/d}).

In addition, by Bernstein inequalities in approximation theory, we have

αQ=O(Jα1/d)Q,\|\partial^{\alpha}Q\|_{\infty}=O(J^{\|\alpha\|_{\ell_{1}}/d})\|Q\|_{\infty},

for all QΨJQ\in\Psi_{J}. Hence we can show that by Lemma 1 and 2 Result (2),

αQ~παQπ\displaystyle\|\partial^{\alpha}\widetilde{Q}^{\pi}-\partial^{\alpha}Q^{\pi}\|_{\infty} αQ~πα(Π¯JQπ)+αQπα(Π¯JQπ)\displaystyle\leq\|\partial^{\alpha}\widetilde{Q}^{\pi}-\partial^{\alpha}(\overline{\Pi}_{J}Q^{\pi})\|_{\infty}+\|\partial^{\alpha}Q^{\pi}-\partial^{\alpha}(\overline{\Pi}_{J}Q^{\pi})\|_{\infty}
O(Jα1/d)Q~πΠ¯JQπ+αQπα(Π¯JQπ)\displaystyle\leq O(J^{\|\alpha\|_{\ell_{1}}/d})\|\widetilde{Q}^{\pi}-\overline{\Pi}_{J}Q^{\pi}\|_{\infty}+\|\partial^{\alpha}Q^{\pi}-\partial^{\alpha}(\overline{\Pi}_{J}Q^{\pi})\|_{\infty}
O(Jα1/d)h~πΠJh0π+αQπα(Π¯JQπ)\displaystyle\leq O(J^{\|\alpha\|_{\ell_{1}}/d})\|\widetilde{h}^{\pi}-\Pi_{J}h_{0}^{\pi}\|_{\infty}+\|\partial^{\alpha}Q^{\pi}-\partial^{\alpha}(\overline{\Pi}_{J}Q^{\pi})\|_{\infty}
Op(J(pα1)/d)+αQπαQJπ+αQJπα(Π¯JQπ)\displaystyle\leq O_{p}(J^{-(p-\|\alpha\|_{\ell_{1}})/d})+\|\partial^{\alpha}Q^{\pi}-\partial^{\alpha}Q^{\pi}_{J}\|_{\infty}+\|\partial^{\alpha}Q^{\pi}_{J}-\partial^{\alpha}(\overline{\Pi}_{J}Q^{\pi})\|_{\infty}
Op(J(pα1)/d)+αQπαQJπ+O(Jα1/d)QJπQπ.\displaystyle\leq O_{p}(J^{-(p-\|\alpha\|_{\ell_{1}})/d})+\|\partial^{\alpha}Q^{\pi}-\partial^{\alpha}Q^{\pi}_{J}\|_{\infty}+O(J^{\|\alpha\|_{\ell_{1}}/d})\|Q^{\pi}_{J}-Q^{\pi}\|_{\infty}.

By choosing QJπQ^{\pi}_{J} such that QJπQπ=O(Jp/d)\|Q^{\pi}_{J}-Q^{\pi}\|_{\infty}=O(J^{-p/d}) and αQJπα(Π¯JQπ)=O(J(pα1)/d)\|\partial^{\alpha}Q^{\pi}_{J}-\partial^{\alpha}(\overline{\Pi}_{J}Q^{\pi})\|_{\infty}=O(J^{-(p-\|\alpha\|_{\ell_{1}})/d}), we have

αQ~παQπ=Op(J(pα1)/d).\|\partial^{\alpha}\widetilde{Q}^{\pi}-\partial^{\alpha}Q^{\pi}\|_{\infty}=O_{p}(J^{-(p-\|\alpha\|_{\ell_{1}})/d}).

Finally, we can derive that

αQ^παQπ\displaystyle\|\partial^{\alpha}\widehat{Q}^{\pi}-\partial^{\alpha}{Q}^{\pi}\|_{\infty} αQ^παQ~π+αQ~παQπ\displaystyle\leq\|\partial^{\alpha}\widehat{Q}^{\pi}-\partial^{\alpha}\widetilde{Q}^{\pi}\|_{\infty}+\|\partial^{\alpha}\widetilde{Q}^{\pi}-\partial^{\alpha}{Q}^{\pi}\|_{\infty}
O(Jα1/d)Q^πQ~π+αQ~παQπ\displaystyle\leq O(J^{\|\alpha\|_{\ell_{1}}/d})\|\widehat{Q}^{\pi}-\widetilde{Q}^{\pi}\|_{\infty}+\|\partial^{\alpha}\widetilde{Q}^{\pi}-\partial^{\alpha}{Q}^{\pi}\|_{\infty}
O(Jα1/d)Op(ξJ(logJ)/(NT))+Op(J(pα1)/d),\displaystyle\leq O(J^{\|\alpha\|_{\ell_{1}}/d})O_{p}\Big{(}\xi_{J}\sqrt{(\log J)/(NT)}\Big{)}+O_{p}(J^{-(p-\|\alpha\|_{\ell_{1}})/d}),

where the last inequality is given by Lemma 2 (1). This concludes our proof.

Appendix D Proof of Lemma 2 Result (1)

The proof consists of three steps.

Step 1: Decompose the difference between c^\widehat{c} and c~\widetilde{c}.

c^c~\displaystyle\widehat{c}-\widetilde{c} =[ΓπB(BB)BΓπ]ΓπB(BB)B(𝐑H0)\displaystyle=[\Gamma_{\pi}^{\top}B(B^{\top}B)^{-}B^{\top}\Gamma_{\pi}]^{-}\Gamma_{\pi}^{\top}B(B^{\top}B)^{-}B^{\top}(\mathbf{R}-H_{0})
=[ΣπGb1Σπ]1ΣπGb1B(𝐑H0NT)\displaystyle=[\Sigma^{\pi\,\top}G_{b}^{-1}\Sigma^{\pi}]^{-1}\Sigma^{\pi\,\top}G_{b}^{-1}B^{\top}(\frac{\mathbf{R}-H_{0}}{NT})
+([ΣπGb1Σπ]1ΣπGb1+[Σ^πG^bΣ^π]Σ^πG^b)B(𝐑H0NT)\displaystyle+\left(-[\Sigma^{\pi\,\top}G_{b}^{-1}\Sigma^{\pi}]^{-1}\Sigma^{\pi\,\top}G_{b}^{-1}+[\widehat{\Sigma}^{\pi\,\top}\widehat{G}_{b}^{-}\widehat{\Sigma}^{\pi}]^{-}\widehat{\Sigma}^{\pi\,\top}\widehat{G}_{b}^{-}\right)B^{\top}(\frac{\mathbf{R}-H_{0}}{NT})
=(I)+(II),\displaystyle=(I)+(II),

where

Σ^π=BΓπNTandG^b=BBNT.\widehat{\Sigma}^{\pi}=\frac{B^{\top}\Gamma_{\pi}}{NT}\quad\text{and}\quad\widehat{G}_{b}=\frac{B^{\top}B}{NT}.

Step 2: Bound the first term (I)(I). Define an event

NT={[Gb]1/2BB[Gb]1/2NTIK12},{\cal E}_{NT}=\left\{\left\|\frac{[G_{b}]^{-1/2}B^{\top}B[G_{b}]^{-1/2}}{NT}-I_{K}\right\|\leq\frac{1}{2}\right\},

where IKI_{K} is an identity matrix with size KK. By Lemma 5 (b), we have

[Gb]1/2BB[Gb]1/2NTIK=Op(ζblog(NT)log(K)/(NT)),\left\|\frac{[G_{b}]^{-1/2}B^{\top}B[G_{b}]^{-1/2}}{NT}-I_{K}\right\|=O_{p}(\zeta_{b}\sqrt{\log(NT)\log(K)/(NT)}),

as long as ζlog(NT)log(K)/(NT)=o(1)\zeta\sqrt{\log(NT)\log(K)/(NT)}=o(1). Hence we obtain that Pr(NTc)=o(1)\Pr({\cal E}_{NT}^{c})=o(1) by the assumption in Lemma 2 that ζ2log(NT)log(K)/(NT)=O(1)\zeta^{2}\sqrt{\log(NT)\log(K)/(NT)}=O(1) and ζJ\zeta\geq\sqrt{J}.

Now for any x>0x>0, we can show that

P((I)>x)\displaystyle P(\|(I)\|_{\ell_{\infty}}>x) (30)
\displaystyle\leq j=1JP(1NTi=1Nt=0T1qjπ(Si,t,Ai,t)(Ri,th0π(Si,t,Ai,t,Si,t+1))>x,NT)+Pr(NTc),\displaystyle\sum_{j=1}^{J}P\left(\mid\frac{1}{NT}\sum_{i=1}^{N}\sum_{t=0}^{T-1}q^{\pi}_{j}(S_{i,t},A_{i,t})\left(R_{i,t}-h_{0}^{\pi}(S_{i,t},A_{i,t},S_{i,t+1})\right)\mid>x,{\cal E}_{NT}\right)+\Pr({\cal E}_{NT}^{c}), (31)

where qjπ(Si,t,Ai,t)={[ΣπGb1Σπ]1ΣπGb1bK(Si,t,Ai,t)}jq^{\pi}_{j}(S_{i,t},A_{i,t})=\left\{[\Sigma^{\pi\,\top}G_{b}^{-1}\Sigma^{\pi}]^{-1}\Sigma^{\pi\,\top}G_{b}^{-1}b^{K}(S_{i,t},A_{i,t})\right\}_{j} (jj-th element of a vector). Note that

𝔼[Ri,th0π(Si,t,Ai,t,Si,t+1)Si,t,Ai,t]=0,\mathbb{E}\left[R_{i,t}-h_{0}^{\pi}(S_{i,t},A_{i,t},S_{i,t+1})\mid S_{i,t},A_{i,t}\right]=0,

by the Bellman equation (1). Therefore the sequence

{qjπ(Si,t,Ai,t)(Ri,th0π(Si,t,Ai,t,Si,t+1))}0t(T1),1iN\{q^{\pi}_{j}(S_{i,t},A_{i,t})\left(R_{i,t}-h_{0}^{\pi}(S_{i,t},A_{i,t},S_{i,t+1})\right)\}_{0\leq t\leq(T-1),1\leq i\leq N}

forms a mean 0 martingale. We aim to apply Freedman’s inequality. Firstly, by Assumption 2 on the reward, we have

|Ri,th0π(Si,t,Ai,t,Si,t+1)|2Rmax1γ.|R_{i,t}-h_{0}^{\pi}(S_{i,t},A_{i,t},S_{i,t+1})|\leq\frac{2R_{\max}}{1-\gamma}.

In addition, we can show that

|qjπ(Si,t,Ai,t)|\displaystyle|q^{\pi}_{j}(S_{i,t},A_{i,t})|
\displaystyle\leq [ΣπGb1Σπ]1ΣπGb1bK(Si,t,Ai,t)2\displaystyle\|[\Sigma^{\pi\,\top}G_{b}^{-1}\Sigma^{\pi}]^{-1}\Sigma^{\pi\,\top}G_{b}^{-1}b^{K}(S_{i,t},A_{i,t})\|_{\ell_{2}}
\displaystyle\leq [Gκπ]1/22[[Gκπ]1/2ΣπGb1Σπ[Gκπ]1/2]1[Gκπ]1/2ΣπGb1/22Gb1/2bK(Si,t,Ai,t)2\displaystyle\|[G^{\pi}_{\kappa}]^{-1/2}\|_{\ell_{2}}\|[[G^{\pi}_{\kappa}]^{-1/2}\Sigma^{\pi\,\top}G_{b}^{-1}\Sigma^{\pi}[G^{\pi}_{\kappa}]^{-1/2}]^{-1}[G^{\pi}_{\kappa}]^{-1/2}\Sigma^{\pi\,\top}G_{b}^{-1/2}\|_{\ell_{2}}\|G_{b}^{-1/2}b^{K}(S_{i,t},A_{i,t})\|_{\ell_{2}}
\displaystyle\leq ζsJKeJ,\displaystyle\frac{\zeta}{s_{JK}\sqrt{e_{J}}},

where

sJK1=suphΘJπhL2(S,A,S)ΠK𝒯hL2(S,A)=smin(Gb12Σπ[Gκπ]1/2]),s_{JK}^{-1}=\sup_{h\in\Theta^{\pi}_{J}}\frac{\|h\|_{L^{2}(S,A,S^{\prime})}}{\|\Pi_{K}{\cal T}h\|_{L^{2}(S,A)}}=s_{\min}(G_{b}^{-\frac{1}{2}}\Sigma_{\pi}[G^{\pi}_{\kappa}]^{-1/2}]),

and smins_{\min} refers to the minimum singular value. One can show that sJK1τJ1s_{JK}^{-1}\leq\tau_{J}\lesssim 1 by Lemma A.1 of (Chen & Christensen, 2018) using Assumption 6 (a)

Secondly, we can show that conditioning on NT{\cal E}_{NT},

i=1Nt=0T1𝔼[{qjπ(Si,t,Ai,t)(Ri,th0π(Si,t,Ai,t,Si,t+1))}2Si,t,Ai,t]6NTRmax2(1γ)2sJK2eJ.\sum_{i=1}^{N}\sum_{t=0}^{T-1}\mathbb{E}\left[\left\{q^{\pi}_{j}(S_{i,t},A_{i,t})\left(R_{i,t}-h_{0}^{\pi}(S_{i,t},A_{i,t},S_{i,t+1})\right)\right\}^{2}\mid S_{i,t},A_{i,t}\right]\leq\frac{6NTR_{\max}^{2}}{(1-\gamma)^{2}s^{2}_{JK}e_{J}}.

This relies on the following argument. Conditioning on NT{\cal E}_{NT}, for every jj,

i=1Nt=0T1𝔼[(qjπ(Si,t,Ai,t))2|Si,t,Ai,t]\displaystyle\sum_{i=1}^{N}\sum_{t=0}^{T-1}\mathbb{E}\left[(q^{\pi}_{j}(S_{i,t},A_{i,t}))^{2}\,|\,S_{i,t},A_{i,t}\right]
=\displaystyle= i=1Nt=0T1{[ΣπGb1Σπ]1ΣπGb1/2}jGb1/2bK(Si,t,Ai,t)22\displaystyle\sum_{i=1}^{N}\sum_{t=0}^{T-1}\|\{[\Sigma^{\pi\,\top}G_{b}^{-1}\Sigma^{\pi}]^{-1}\Sigma^{\pi\,\top}G_{b}^{-1/2}\}_{j\bullet}G_{b}^{-1/2}b^{K}(S_{i,t},A_{i,t})\|^{2}_{\ell_{2}}
\displaystyle\leq 3NT2{[ΣπGb1Σπ]1ΣπGb1/2}j22\displaystyle\frac{3NT}{2}\|\{[\Sigma^{\pi\,\top}G_{b}^{-1}\Sigma^{\pi}]^{-1}\Sigma^{\pi\,\top}G_{b}^{-1/2}\}_{j\bullet}\|^{2}_{\ell_{2}}
\displaystyle\leq 3NT2[ΣπGb1Σπ]122\displaystyle\frac{3NT}{2}\|[\Sigma^{\pi\,\top}G_{b}^{-1}\Sigma^{\pi}]^{-1}\|^{2}_{\ell_{2}}
\displaystyle\leq 3NT2[Gκπ]1/2[[Gκπ]1/2ΣπGb1Σπ[Gκπ]1/2]1[Gκπ]1/222\displaystyle\frac{3NT}{2}\|[G^{\pi}_{\kappa}]^{-1/2}[[G^{\pi}_{\kappa}]^{-1/2}\Sigma^{\pi\,\top}G_{b}^{-1}\Sigma^{\pi}[G^{\pi}_{\kappa}]^{-1/2}]^{-1}[G^{\pi}_{\kappa}]^{-1/2}\|^{2}_{\ell_{2}}
\displaystyle\leq 3NT2sJK2eJ,\displaystyle\frac{3NT}{2s_{JK}^{2}e_{J}},

where the first inequality is given by the event NT{\cal E}_{NT}. Then by Freedman’s inequality (e.g., Theorem 1.1 of Tropp (2011)), we can show,

(I)=Op(logJNTeJ),\displaystyle\|(I)\|_{\ell_{\infty}}=O_{p}\left(\sqrt{\frac{\log J}{NTe_{J}}}\right), (32)

as long as ζlog(J)/(NT)=o(1)\zeta\sqrt{\log(J)/(NT)}=o(1).

Define

(Gb1/2Σπ)l=[[Σπ](Gb)1Σπ]1[Σπ](Gb)1/2,(G_{b}^{-1/2}\Sigma^{\pi})^{-}_{l}=\left[[\Sigma^{\pi}]^{\top}(G_{b})^{-1}\Sigma^{\pi}\right]^{-1}[\Sigma^{\pi}]^{\top}(G_{b})^{-1/2},

and similarly for (G^b1/2Σ^π)l(\widehat{G}_{b}^{-1/2}\widehat{\Sigma}^{\pi})^{-}_{l}.

Step 3: We bound the second term (II)(II). Relying on Lemmas 8 (a) and 6, we have

(II)\displaystyle\|\text{(II)}\|_{\ell_{\infty}} (33)
\displaystyle\leq (Gb1/2Σ^π)lG^b1/2Gb1/2(Gb1/2Σπ)l2Gb1/2B(RH0)/(NT)2\displaystyle\|(G_{b}^{-1/2}\widehat{\Sigma}^{\pi})^{-}_{l}\widehat{G}_{b}^{-1/2}G_{b}^{1/2}-(G_{b}^{-1/2}\Sigma^{\pi})^{-}_{l}\|_{\ell^{2}}\|G_{b}^{-1/2}B^{\top}(R-H_{0})/(NT)\|_{\ell^{2}} (34)
=\displaystyle= Op(sJK2ζ(log(NT)logJ)/(NTeJ))Op(Rmax1γKNT)\displaystyle O_{p}\Big{(}s_{JK}^{-2}\zeta\sqrt{(\log(NT)\log J)/(NTe_{J})}\Big{)}O_{p}(\frac{R_{\max}}{1-\gamma}\sqrt{\frac{K}{NT}}) (35)
=\displaystyle= Op(log(J)/(NTeJ)),\displaystyle O_{p}\Big{(}\sqrt{\log(J)/(NTe_{J})}\Big{)}, (36)

by the assumption in Lemma 2 (1) that ζ2log(NT)/NT=O(1)\zeta^{2}\sqrt{\log(NT)}/\sqrt{NT}=O(1) and the fact that ζK\zeta\geq\sqrt{K} and sJK1τJ1s_{JK}^{-1}\leq\tau_{J}\lesssim 1. This completes the proof of Lemma 2(1) by noting that sups𝒮,a𝒜ψJ(s,a)1=ξJ\sup_{s\in{\cal S},a\in{\cal A}}\|\psi^{J}(s,a)\|_{\ell_{1}}=\xi_{J} by definition.

Appendix E Proof of Lemma 2 Result (2)

We first prove the following Lemma.

Lemma 3

Suppose that ζ2log(J)log(NT)/NT=O(1)\zeta^{2}\sqrt{\log(J)\log(NT)}/\sqrt{NT}=O(1) and let Assumptions 5-6 hold. Then h~πΠJh0πOp(1)×h0πΠJh0π\|\widetilde{h}^{\pi}-\Pi_{J}h^{\pi}_{0}\|_{\infty}\leq O_{p}(1)\times\|h^{\pi}_{0}-\Pi_{J}h^{\pi}_{0}\|_{\infty}.

The proof follows similarly as Lemma A.3 of Chen & Christensen (2018). Note that the difference between h~π\widetilde{h}^{\pi} and ΠJh0π\Pi_{J}h^{\pi}_{0} can be decomposed as

h~π(s,a,s)ΠJh0π(s,a,s)\displaystyle\widetilde{h}^{\pi}(s,a,s^{\prime})-\Pi_{J}h_{0}^{\pi}(s,a,s^{\prime})
=\displaystyle= Π~(h0πΠJh0π)(s,a,s)\displaystyle\widetilde{\Pi}(h^{\pi}_{0}-\Pi_{J}h_{0}^{\pi})(s,a,s^{\prime})
+\displaystyle+ (κπJ(s,a,s))(Gb1/2Σπ)l{Gb1/2(B(H0ΓπcJ)/(NT)E[bK(S,A)(h0π(S,A,S)hJπ(S,A,S))])}\displaystyle(\kappa_{\pi}^{J}(s,a,s^{\prime}))^{\top}(G_{b}^{-1/2}\Sigma^{\pi})^{-}_{l}\{G_{b}^{-1/2}(B^{\top}(H_{0}-\Gamma_{\pi}c_{J})/(NT)-E[b^{K}(S,A)(h_{0}^{\pi}(S,A,S^{\prime})-h^{\pi}_{J}(S,A,S^{\prime}))])\}
+\displaystyle+ (κπJ(s,a,s)){(G^b1/2Σ^π)lG^b1/2Gb1/2(Gb1/2Σπ)l}Gb1/2B(H0ΓπcJ)/(NT)\displaystyle(\kappa_{\pi}^{J}(s,a,s^{\prime}))^{\top}\{(\widehat{G}_{b}^{-1/2}\widehat{\Sigma}^{\pi})^{-}_{l}\widehat{G}_{b}^{-1/2}G_{b}^{-1/2}-(G_{b}^{-1/2}\Sigma^{\pi})^{-}_{l}\}G_{b}^{-1/2}B^{\top}(H_{0}-\Gamma_{\pi}c_{J})/(NT)
=\displaystyle= (I)+(II)+(III).\displaystyle(I)+(II)+(III).

For (I)(I), by Assumption 6 (b), we can show that (I)h0πΠJh0π\|(I)\|_{\infty}\lesssim\|h_{0}^{\pi}-\Pi_{J}h_{0}^{\pi}\|_{\infty}. For (II)(II), by Lemma 7, we can show

(II)\displaystyle\|(II)\|_{\infty} ζκ,JπsJK1Op(ζb,Klog(NT)log(K)NT)h0πΠJh0π\displaystyle\leq\zeta^{\pi}_{\kappa,J}s_{JK}^{-1}O_{p}\big{(}\zeta_{b,K}\sqrt{\frac{\log(NT)\log(K)}{NT}}\big{)}\|h^{\pi}_{0}-\Pi_{J}h^{\pi}_{0}\|_{\infty} (37)
=Op(ζ2log(NT)log(K)NT)h0πΠJh0π=O(1)h0πΠJh0π,\displaystyle=O_{p}\big{(}\zeta^{2}\sqrt{\frac{\log(NT)\log(K)}{NT}}\big{)}\|h^{\pi}_{0}-\Pi_{J}h^{\pi}_{0}\|_{\infty}=O(1)\|h^{\pi}_{0}-\Pi_{J}h^{\pi}_{0}\|_{\infty}, (38)

where we use the fact that ζmax{ζb,K,ζκ,J}\zeta\geq\max\{\zeta_{b,K},\zeta_{\kappa,J}\}, sJK1τJ1s_{JK}^{-1}\leq\tau_{J}\lesssim 1 and that ζ2log(J)log(NT)/NT=O(1)\zeta^{2}\sqrt{\log(J)\log(NT)}/\sqrt{NT}=O(1). For (III)(III) term, by Lemma 8(b), we can show that

(III)\displaystyle\|(III)\|_{\infty}
\displaystyle\leq ζκ,J(G^b1/2Σ^π)lG^b1/2Gb1/2(Gb1/2Σπ)2Gb1/2B(H0ΓπcJ)/(NT)2\displaystyle\zeta_{\kappa,J}\|(\widehat{G}_{b}^{-1/2}\widehat{\Sigma}^{\pi})^{-}_{l}\widehat{G}_{b}^{-1/2}G_{b}^{-1/2}-(G_{b}^{-1/2}\Sigma^{\pi})\|_{\ell_{2}}\|G_{b}^{-1/2}B^{\top}(H_{0}-\Gamma_{\pi}c_{J})/(NT)\|_{\ell_{2}}
\displaystyle\leq ζκ,JOp(ζlogJlog(NT)NT){Op(ζb,Klog(NT)logKNT)h0πΠJh0π+ΠK𝒯(h0πΠJh0π)L2(S,A)}\displaystyle\zeta_{\kappa,J}O_{p}\big{(}\zeta\sqrt{\frac{\log J\log(NT)}{NT}}\big{)}\{O_{p}\big{(}\zeta_{b,K}\sqrt{\frac{\log(NT)\log K}{NT}}\big{)}\|h^{\pi}_{0}-\Pi_{J}h^{\pi}_{0}\|_{\infty}+\|\Pi_{K}{\cal T}(h^{\pi}_{0}-\Pi_{J}h^{\pi}_{0})\|_{L^{2}(S,A)}\}
\displaystyle\leq ζκ,JOp(ζlogJlog(NT)NT){Op(ζb,Klog(NT)logKNT)h0πΠJh0π+(h0πΠJh0π)L2(S,A)}\displaystyle\zeta_{\kappa,J}O_{p}\big{(}\zeta\sqrt{\frac{\log J\log(NT)}{NT}}\big{)}\{O_{p}\big{(}\zeta_{b,K}\sqrt{\frac{\log(NT)\log K}{NT}}\big{)}\|h^{\pi}_{0}-\Pi_{J}h^{\pi}_{0}\|_{\infty}+\|(h^{\pi}_{0}-\Pi_{J}h^{\pi}_{0})\|_{L^{2}(S,A)}\}
=\displaystyle= Op(ζ2log(J)log(NT)NT)(h0πΠJh0π)L2(S,A)\displaystyle O_{p}\big{(}\zeta^{2}\sqrt{\frac{\log(J)\log(NT)}{NT}}\big{)}\|(h^{\pi}_{0}-\Pi_{J}h^{\pi}_{0})\|_{L^{2}(S,A)}
=\displaystyle= Op(1)h0πΠJh0πL2(S,A),\displaystyle O_{p}(1)\|h^{\pi}_{0}-\Pi_{J}h^{\pi}_{0}\|_{L^{2}(S,A)},

by the condition that ζ2log(J)log(NT)/NT=O(1)\zeta^{2}\sqrt{\log(J)\log(NT)}/\sqrt{NT}=O(1).

Now, we return to Result (2) of Lemma 2. By Lemma 1, we can see that

Q~πQπh~πh0π\displaystyle\|\widetilde{Q}^{\pi}-Q^{\pi}\|_{\infty}\lesssim\|\widetilde{h}^{\pi}-h_{0}^{\pi}\|_{\infty}
\displaystyle\leq h~πΠJh0π+h0πΠJh0π\displaystyle\|\widetilde{h}^{\pi}-\Pi_{J}h_{0}^{\pi}\|_{\infty}+\|h_{0}^{\pi}-\Pi_{J}h_{0}^{\pi}\|_{\infty}
=\displaystyle= Op(1)h0πΠJh0π,\displaystyle O_{p}(1)\|h_{0}^{\pi}-\Pi_{J}h_{0}^{\pi}\|_{\infty},
\displaystyle\leq Op(1)QπΠ¯JQπ,\displaystyle O_{p}(1)\|Q^{\pi}-\overline{\Pi}_{J}Q^{\pi}\|_{\infty},

which concludes our proof.

Appendix F Proof of Theorem 5

The idea of proof is similar to that in Lemma 2 (1) and Theorem 4. By triangle inequality, we have Q^πQπ2Q^πQ~π2+Q~πΠ¯JQπ2+QπΠ¯JQπ2\|\widehat{Q}^{\pi}-Q^{\pi}\|_{2}\leq\|\widehat{Q}^{\pi}-\widetilde{Q}^{\pi}\|_{2}+\|\widetilde{Q}^{\pi}-\overline{\Pi}_{J}Q^{\pi}\|_{2}+\|Q^{\pi}-\overline{\Pi}_{J}Q^{\pi}\|_{2}. In the following Step 1-3, we first bound h^πh~π2\|\widehat{h}^{\pi}-\widetilde{h}^{\pi}\|_{2} since Q^πQ~π2h^πh~π2\|\widehat{Q}^{\pi}-\widetilde{Q}^{\pi}\|_{2}\lesssim\|\widehat{h}^{\pi}-\widetilde{h}^{\pi}\|_{2}. The last step is to bound Q~πΠ¯JQπ2\|\widetilde{Q}^{\pi}-\overline{\Pi}_{J}Q^{\pi}\|_{2}.

Step 1: Decompose the difference between h^π(s,a,s)\widehat{h}^{\pi}(s,a,s^{\prime}) and h~π(s,a,s)\widetilde{h}^{\pi}(s,a,s^{\prime}) as follows.

(κπJ(s,a,s))c^(ψJ(s,a))c~=(ψJ(s,a))[ΓπB(BB)BΓπ]ΓπB(BB)B(𝐑H0)\displaystyle(\kappa_{\pi}^{J}(s,a,s^{\prime}))^{\top}\widehat{c}-(\psi^{J}(s,a))^{\top}\widetilde{c}=(\psi^{J}(s,a))^{\top}[\Gamma_{\pi}^{\top}B(B^{\top}B)^{-}B^{\top}\Gamma_{\pi}]^{-}\Gamma_{\pi}^{\top}B(B^{\top}B)^{-}B^{\top}(\mathbf{R}-H_{0})
=\displaystyle= (κπJ(s,a,s))[ΣπGb1Σπ]1ΣπGb1B(𝐑H0NT)\displaystyle(\kappa_{\pi}^{J}(s,a,s^{\prime}))^{\top}[\Sigma^{\pi\,\top}G_{b}^{-1}\Sigma^{\pi}]^{-1}\Sigma^{\pi\,\top}G_{b}^{-1}B^{\top}(\frac{\mathbf{R}-H_{0}}{NT})
+\displaystyle+ (κπJ(s,a,s))([ΣπGb1Σπ]1ΣπGb1+[Σ^πG^bΣ^π]Σ^πG^b)B(𝐑H0NT)\displaystyle(\kappa_{\pi}^{J}(s,a,s^{\prime}))^{\top}\left(-[\Sigma^{\pi\,\top}G_{b}^{-1}\Sigma^{\pi}]^{-1}\Sigma^{\pi\,\top}G_{b}^{-1}+[\widehat{\Sigma}^{\pi\,\top}\widehat{G}_{b}^{-}\widehat{\Sigma}^{\pi}]^{-}\widehat{\Sigma}^{\pi\,\top}\widehat{G}_{b}^{-}\right)B^{\top}(\frac{\mathbf{R}-H_{0}}{NT})
=\displaystyle= (I)+(II),\displaystyle(I)+(II),

where

Σ^π=BΓπNTandG^b=BBNT.\widehat{\Sigma}^{\pi}=\frac{B^{\top}\Gamma_{\pi}}{NT}\quad\text{and}\quad\widehat{G}_{b}=\frac{B^{\top}B}{NT}.

Step 2: Bound the first term (I)(I). Note that

(I)2\displaystyle\|(I)\|_{2} =(κπJ(,,))[ΣπGb1Σπ]1ΣπGb1B(𝐑H0NT)2\displaystyle=\|(\kappa_{\pi}^{J}(\bullet,\bullet,\bullet))^{\top}[\Sigma^{\pi\,\top}G_{b}^{-1}\Sigma^{\pi}]^{-1}\Sigma^{\pi\,\top}G_{b}^{-1}B^{\top}(\frac{\mathbf{R}-H_{0}}{NT})\|_{2}
=[Gκπ]1/2[ΣπGb1Σπ]1ΣπGb1B(𝐑H0NT)2\displaystyle=\|[G^{\pi}_{\kappa}]^{1/2}[\Sigma^{\pi\,\top}G_{b}^{-1}\Sigma^{\pi}]^{-1}\Sigma^{\pi\,\top}G_{b}^{-1}B^{\top}(\frac{\mathbf{R}-H_{0}}{NT})\|_{\ell_{2}}
sKJ1Gb1/2B(𝐑H0NT)2=Op(KNT),\displaystyle\leq s^{-1}_{KJ}\|G_{b}^{-1/2}B^{\top}(\frac{\mathbf{R}-H_{0}}{NT})\|_{\ell_{2}}=O_{p}(\sqrt{\frac{K}{NT}}),

where the last inequality is given by Lemma 6 and sJK11s_{JK}^{-1}\lesssim 1.

Step 3: we bound the second term (II)(II). Relying on Lemmas 8 (b) and 6, we have

(II)2\displaystyle\|\text{(II)}\|_{2} (39)
\displaystyle\leq [Gκπ]1/2{(Gb1/2Σ^π)lG^b1/2Gb1/2(Gb1/2Σπ)l}2Gb1/2B(𝐑H0)/(NT)2\displaystyle\|[G^{\pi}_{\kappa}]^{1/2}\{(G_{b}^{-1/2}\widehat{\Sigma}^{\pi})^{-}_{l}\widehat{G}_{b}^{-1/2}G_{b}^{1/2}-(G_{b}^{-1/2}\Sigma^{\pi})^{-}_{l}\}\|_{\ell^{2}}\|G_{b}^{-1/2}B^{\top}(\mathbf{R}-H_{0})/(NT)\|_{\ell^{2}} (40)
=\displaystyle= Op(sJK2ζ(log(NT)logJ)/(NT))Op(Rmax1γKNT)\displaystyle O_{p}\Big{(}s_{JK}^{-2}\zeta\sqrt{(\log(NT)\log J)/(NT)}\Big{)}O_{p}(\frac{R_{\max}}{1-\gamma}\sqrt{\frac{K}{NT}}) (41)
=\displaystyle= Op(K/(NT)),\displaystyle O_{p}\Big{(}\sqrt{K/(NT)}\Big{)}, (42)

by the assumption in Theorem 5 that ζlog(NT)log(J)/NT)=o(1)\zeta\sqrt{\log(NT)\log(J)}/\sqrt{NT})=o(1) and sJK11s_{JK}^{-1}\lesssim 1.

Step 4: In the remaining proof, we show the bound for Q~πΠ¯JQπ2\|\widetilde{Q}^{\pi}-\overline{\Pi}_{J}Q^{\pi}\|_{2}. By Theorem 1, we can show that,

(1+γ)Q~πΠ¯JQπ2h~πΠJhπ2.(1+\gamma)\|\widetilde{Q}^{\pi}-\overline{\Pi}_{J}Q^{\pi}\|_{2}\lesssim\|\widetilde{h}^{\pi}-\Pi_{J}h^{\pi}\|_{2}.

Then by a similar proof in Appendix E, we can show that as long as ζlog(NT)log(J)/NT)=O(1)\zeta\sqrt{\log(NT)\log(J)}/\sqrt{NT})=O(1),

Q~πΠ¯JQπ2Op(1)×QπΠ¯JQπ2=Op(1)×Jp/d,\|\widetilde{Q}^{\pi}-\overline{\Pi}_{J}Q^{\pi}\|_{2}\leq O_{p}(1)\times\|Q^{\pi}-\overline{\Pi}_{J}Q^{\pi}\|_{2}=O_{p}(1)\times J^{-p/d},

where we use the existing result on the approximation error of the linear sieve in the last equation. Summarizing Step 1-4 together, we obtain the statements in Theorem 5. Finally, we conclude our proof by the similar argument in the proof of Theorem 4 for the derivatives case.

Appendix G Technical Lemmas

Lemma 4

For any policy π\pi, under Assumptions 1, 3-5, we have

eJpmin2pmax(1γ)2ωJe_{J}\gtrsim\frac{p^{2}_{\min}}{p_{\max}}(1-\gamma)^{2}\omega_{J}

for every J1J\geq 1, where ωJ=λmin(𝔼[ψJ(S,A)(ψJ(S,A))])\omega_{J}=\lambda_{\min}(\mathbb{E}\left[\psi^{J}(S,A)(\psi^{J}(S,A))^{\top}\right])

Proof. By definition,

eJ=λmin{𝔼[(ψJ(S,A)γψπJ(S))(ψJ(S,A)γψπJ(S))]}.e_{J}=\lambda_{\min}\left\{\mathbb{E}\left[\left(\psi^{J}(S,A)-\gamma\psi_{\pi}^{J}(S^{\prime})\right)\left(\psi^{J}(S,A)-\gamma\psi_{\pi}^{J}(S^{\prime})\right)^{\top}\right]\right\}.

Applying Theorem 1 with Q1(s,a)=(ψJ(S,A))xQ_{1}(s,a)=(\psi^{J}(S,A))^{\top}x and Q2(s,a)=0Q_{2}(s,a)=0 for every s𝒮s\in{\cal S} and a𝒜a\in{\cal A} (recall that the sieve space is a subset of L2(S,A)L^{2}(S,A)), we have

x𝔼[(ψJ(S,A)γψπJ(S))(ψJ(S,A)γψπJ(S))]x\displaystyle x^{\top}\mathbb{E}\left[\left(\psi^{J}(S,A)-\gamma\psi_{\pi}^{J}(S^{\prime})\right)\left(\psi^{J}(S,A)-\gamma\psi_{\pi}^{J}(S^{\prime})\right)^{\top}\right]x
\displaystyle\geq x𝔼[(ψJ(S,A)γ𝔼[ψπJ(S)S,A])(ψJ(S,A)γ𝔼[ψπJ(S)S,A])]x\displaystyle x^{\top}\mathbb{E}\left[\left(\psi^{J}(S,A)-\gamma\mathbb{E}\left[\psi_{\pi}^{J}(S^{\prime})\mid S,A\right]\right)\left(\psi^{J}(S,A)-\gamma\mathbb{E}\left[\psi_{\pi}^{J}(S^{\prime})\mid S,A\right]\right)^{\top}\right]x
=\displaystyle= (ψJ(S,A)γ𝔼[ψπJ(S)S,A])x22\displaystyle\|(\psi^{J}(S,A)-\gamma\mathbb{E}\left[\psi_{\pi}^{J}(S^{\prime})\mid S,A\right])^{\top}x\|^{2}_{2}
\displaystyle\geq pminpmax(1γ)2(ψJ(S,A))x22pminpmax(1γ)2ωJx22,\displaystyle\frac{p_{\min}}{p_{\max}}(1-\gamma)^{2}\|(\psi^{J}(S,A))^{\top}x\|_{2}^{2}\geq\frac{p_{\min}}{p_{\max}}(1-\gamma)^{2}\omega_{J}\|x\|_{\ell_{2}}^{2},

where the first inequality is given by Jensen’s inequality and the last inequality is by the definition of ωJ\omega_{J}.    

By examining the proof, we can see that the above lemma also holds for d¯Tπb\bar{d}_{T}^{\pi^{b}} without Assumption 5.

Next we present several technical lemmas adapted from Chen & Christensen (2018). Define the orthonormalized matrix estimators

G^bo\displaystyle\widehat{G}_{b}^{o} =\displaystyle= Gb1/2G^bGb1/2\displaystyle G_{b}^{-1/2}\widehat{G}_{b}G_{b}^{-1/2}
G^κπ,o\displaystyle\widehat{G}_{\kappa}^{\pi,\,o} =\displaystyle= [Gκπ]1/2G^κπ[Gκπ]1/2\displaystyle[G_{\kappa}^{\pi}]^{-1/2}\widehat{G}_{\kappa}^{\pi}[G_{\kappa}^{\pi}]^{-1/2}
Σ^π,o\displaystyle\widehat{\Sigma}^{\pi,\,o} =\displaystyle= Gb1/2Σ^π[Gκπ]1/2,\displaystyle G_{b}^{-1/2}\widehat{\Sigma}^{\pi}[G_{\kappa}^{\pi}]^{-1/2},

where G^κπ=ΓπΓπNT\widehat{G}_{\kappa}^{\pi}=\frac{\Gamma_{\pi}^{\top}\Gamma_{\pi}}{NT}. Let Gbo=IKG_{b}^{o}=I_{K}, Gκπo=IJG_{\kappa}^{\pi\,o}=I_{J} and Σπo\Sigma^{\pi\,o} denote their corresponding expected values.

Lemma 5

Under Assumption 5, the following three bounds hold.

G^κπ,oGκπ,o2\displaystyle\|\widehat{G}_{\kappa}^{\pi,\,o}-G_{\kappa}^{\pi,\,o}\|_{\ell^{2}} =\displaystyle= Op(ζκ,Jπ(log(NT)log(J))/(NT))\displaystyle O_{p}(\zeta^{\pi}_{\kappa,J}\sqrt{(\log(NT)\log(J))/(NT)})
G^boGbo2\displaystyle\|\widehat{G}_{b}^{o}-G_{b}^{o}\|_{\ell^{2}} =\displaystyle= Op(ζb,K(log(NT)logK)/(NT))\displaystyle O_{p}(\zeta_{b,K}\sqrt{(\log(NT)\log K)/(NT)})
Σ^π,oΣπ,o2\displaystyle\|\widehat{\Sigma}^{\pi,\,o}-\Sigma^{\pi,\,o}\|_{\ell^{2}} =\displaystyle= Op(max(ζb,K,ζκ,Jπ)(log(NT)logK)/(NT)).\displaystyle O_{p}(\max(\zeta_{b,K},\zeta^{\pi}_{\kappa,J})\sqrt{(\log(NT)\log K)/(NT)})\,.

as N,T,J,KN,T,J,K\to\infty as long as ζ(log(NT)log(J)/NT=o(1)\zeta\sqrt{(\log(NT)\log(J)/NT}=o(1).

Proof. The proof follows similarly from Lemma 2.2 of Chen & Christensen (2015). The basic idea is to use Berbee’s coupling lemma (e.g., Theorem 4.2 of Chen & Christensen (2015)) and matrix Bernstein’s inequality (e.g., Tropp (2015)). For brevity, we only show the proof of the second statement in Lemma 5, while others are similar.

Let Xi,t=Gb1/2bK(Si,t,Ai,t)[bK(Si,t,Ai,t)]Gb1/2/(NT)IK/NTX_{i,t}=G_{b}^{-1/2}b^{K}(S_{i,t},A_{i,t})[b^{K}(S_{i,t},A_{i,t})]^{\top}G_{b}^{-1/2}/(NT)-I_{K}/NT and 𝔼[Xi,t]=0K×K\mathbb{E}[X_{i,t}]=0_{K\times K}. Denote the upper bound of mixing coefficient as β(w)=β0exp(β1w)\beta(w)=\beta_{0}\exp(-\beta_{1}w), where β0\beta_{0} and β1\beta_{1} are given in Assumption 5. By Berbee’s lemma, for a fixed ii with 1iN1\leq i\leq N and some integer ww, the stochastic process {Xi,t}t0\{X_{i,t}\}_{t\geq 0} can be coupled by a process Yi,tY^{\ast}_{i,t} such that Yi,k={Xi,(k1)w+j}0j<wY_{i,k}=\{X_{i,(k-1)w+j}\}_{0\leq j<w} and Yi,k={Xi,(k1)w+j}0j<wY^{\ast}_{i,k}=\{X^{\ast}_{i,(k-1)w+j}\}_{0\leq j<w} are identically distributed for each k1k\geq 1 and P(Yi,kYi,k)β(w)P(Y_{i,k}\neq Y^{\ast}_{i,k})\leq\beta(w). In addition, the sequence {{Yi,k}|k=2z,z1}\{\{Y^{\ast}_{i,k}\}\,\,|\,\,k=2z,z\geq 1\} are independent and so are the sequence {{Yi,k}|k=2z+1,z0}\{\{Y^{\ast}_{i,k}\}\,\,|\,\,k=2z+1,z\geq 0\}. Denote IeI_{e} as the indices of the corresponding even number block and IoI_{o} as indices of the corresponding odd number blocks in {0,,T1}\{0,\cdots,T-1\}. Let IrI_{r} be the indices in the remainders, i.e., Ir={T/ww,,T1}I_{r}=\{\lfloor T/w\rfloor w,\cdots,T-1\} and thus Card(Ir)<w\text{Card}(I_{r})<w. We construct a coupled stochastic process for every 1iN1\leq i\leq N trajectory. Now by triangle inequality, we can show that for x>0x>0

Pr(i=1Nt=0T1Xi,t24x)\displaystyle{\mbox{Pr}}(\|\sum_{i=1}^{N}\sum_{t=0}^{T-1}X_{i,t}\|_{\ell_{2}}\geq 4x)
\displaystyle\leq Pr(i=1Nt=0T/ww1Xi,t22x)+Pr(i=1NtIrXi,t2x))+Pr(i=1Nt=0T/ww1(Xi,tXi,t)2x)\displaystyle{\mbox{Pr}}(\|\sum_{i=1}^{N}\sum_{t=0}^{\lfloor T/w\rfloor w-1}X^{\ast}_{i,t}\|_{\ell_{2}}\geq 2x)+{\mbox{Pr}}(\|\sum_{i=1}^{N}\sum_{t\in I_{r}}X_{i,t}\|_{\ell_{2}}\geq x))+{\mbox{Pr}}(\|\sum_{i=1}^{N}\sum_{t=0}^{\lfloor T/w\rfloor w-1}(X^{\ast}_{i,t}-X_{i,t})\|_{\ell_{2}}\geq x)
\displaystyle\leq Pr(i=1NtIoXi,t2x)+Pr(i=1NtIeXi,t2x)+Pr(i=1NtIrXi,t2x)+NTβ(w)w.\displaystyle{\mbox{Pr}}(\|\sum_{i=1}^{N}\sum_{t\in I_{o}}X^{\ast}_{i,t}\|_{\ell_{2}}\geq x)+{\mbox{Pr}}(\|\sum_{i=1}^{N}\sum_{t\in I_{e}}X^{\ast}_{i,t}\|_{\ell_{2}}\geq x)+{\mbox{Pr}}(\|\sum_{i=1}^{N}\sum_{t\in I_{r}}X_{i,t}\|_{\ell_{2}}\geq x)+\frac{NT\beta(w)}{w}.

By choosing w=clog(NT)w=c\log(NT) for sufficiently large cc, we can show that

NTβ(w)w1NT.\frac{NT\beta(w)}{w}\lesssim\frac{1}{NT}.

For the term Pr(i=1NtIoXi,t2x){\mbox{Pr}}(\|\sum_{i=1}^{N}\sum_{t\in I_{o}}X^{\ast}_{i,t}\|_{\ell_{2}}\geq x), notice that i=1NtIoXi,t\sum_{i=1}^{N}\sum_{t\in I_{o}}X^{\ast}_{i,t} has been decomposed into the sum of fewer than N×T/wN\times\lfloor T\rfloor/w independent matrices, i.e., Zi,k=t=(k1)wkw1Xi,t,k1Z^{\ast}_{i,k}=\sum_{t=(k-1)w}^{kw-1}X^{\ast}_{i,t},k\geq 1. One can show that Zi,k2w(ζ2+1)NT=wR¯\|Z^{\ast}_{i,k}\|_{\ell_{2}}\leq\frac{w(\zeta^{2}+1)}{NT}=w\bar{R} and max(Zi,k[Zi,k]2,[Zi,k]Zi,k2)w2(ζ2+1)(NT)2=w2σ2\max(\|Z^{\ast}_{i,k}[Z^{\ast}_{i,k}]^{\top}\|_{2},\|[Z^{\ast}_{i,k}]^{\top}Z^{\ast}_{i,k}\|_{2})\leq\frac{w^{2}(\zeta^{2}+1)}{(NT)^{2}}=w^{2}\sigma^{2}. Then by matrix Berstein’s inequality, we have

Pr(i=1NtIoXi,t2x)2Kexp(x2/2(NT)wσ2+wR¯x/3).{\mbox{Pr}}(\|\sum_{i=1}^{N}\sum_{t\in I_{o}}X^{\ast}_{i,t}\|_{\ell_{2}}\geq x)\leq 2K\exp\left(\frac{-x^{2}/2}{(NT)w\sigma^{2}+w\bar{R}x/3}\right).

Then we can bound this probability towards 0 as KK\rightarrow\infty by choosing x=CσwNTlog(K)x=C\sigma\sqrt{wNT\log(K)} for sufficiently large CC with the condition given in the statement that R¯wlog(K)=o(σNT)\bar{R}\sqrt{w\log(K)}=o(\sigma\sqrt{NT}), i.e., ζlog(NT)log(K)/NT=o(1)\zeta\sqrt{\log(NT)\log(K)}/\sqrt{NT}=o(1). Similar argument can be applied to Pr(i=1NtIeXi,t2x){\mbox{Pr}}(\|\sum_{i=1}^{N}\sum_{t\in I_{e}}X^{\ast}_{i,t}\|_{\ell_{2}}\geq x).

Next, we derive an upper bound for Pr(i=1NtIrXi,t2x¯){\mbox{Pr}}(\|\sum_{i=1}^{N}\sum_{t\in I_{r}}X_{i,t}\|_{\ell_{2}}\geq\bar{x}) for some x¯>0\bar{x}>0. By Bernstein’s inequality and tIrXi,t\sum_{t\in I_{r}}X_{i,t} are independent for 1iN1\leq i\leq N, we have

Pr(i=1NtIrXi,t2x¯)2Kexp(x¯2/2Nw2σ2+wR¯x¯/3).{\mbox{Pr}}(\|\sum_{i=1}^{N}\sum_{t\in I_{r}}X_{i,t}\|_{\ell_{2}}\geq\bar{x})\leq 2K\exp\left(\frac{-\bar{x}^{2}/2}{Nw^{2}\sigma^{2}+w\bar{R}\bar{x}/3}\right).

By choosing x¯=C1σNTwlog(K)\bar{x}=C_{1}\sigma\sqrt{NTw\log(K)} for sufficiently large C1C_{1}, we can show that

Pr(i=1NtIrXi,t2x¯)KC1(T/w)+1,{\mbox{Pr}}(\|\sum_{i=1}^{N}\sum_{t\in I_{r}}X_{i,t}\|_{\ell_{2}}\geq\bar{x})\lesssim K^{-C_{1}(T/w)+1},

as long as ζlog(NT)log(K)/NT=o(1)\zeta\sqrt{\log(NT)\log(K)}/\sqrt{NT}=o(1). Without loss of generality, we can assume wTw\leq T, which completes the proof in the second statement of Lemma 5 as the probability converges to 0 as long as KK\rightarrow\infty. Otherwise the result in the statement can be obtained directly by using the Bernstein’s inequality in the i.i.d setting without using Berbee’s lemma. Other statements follow similarly.    

Lemma 6

Under Assumptions 1 and 2, Gb1/2B(𝐑H0)/(NT)2=Op(Rmax1γKNT)\|G_{b}^{-1/2}B^{\top}(\mathbf{R}-H_{0})/(NT)\|_{\ell^{2}}=O_{p}(\frac{R_{\max}}{1-\gamma}\sqrt{\frac{K}{NT}}).

Proof. We apply the Markov inequality. Note that

Gb1/2B(𝐑H0)/(NT)22\displaystyle\|G_{b}^{-1/2}B^{\top}(\mathbf{R}-H_{0})/(NT)\|^{2}_{2}
\displaystyle\leq 4Rmax2(1γ)2K/NT,\displaystyle\frac{4R^{2}_{\max}}{(1-\gamma)^{2}}K/NT,

because all the terms in Gb1/2B(𝐑H0)/(NT)G_{b}^{-1/2}B^{\top}(\mathbf{R}-H_{0})/(NT) are uncorrelated by the Bellman equation (1). Hence the proof completes.    

Lemma 7

Let hJπ(s,a,s)=κπJ(s,a,s)cJh^{\pi}_{J}(s,a,s^{\prime})=\kappa_{\pi}^{J}(s,a,s^{\prime})^{\top}c_{J} for any deterministic cJJc_{J}\in\mathbb{R}^{J} and HJ=(hJπ(S1,0,A1,0,S1,1),hJπ(S1,1,A1,1,S1,2),,hJπ(SN,T1,AN,T1,SN,T))=ΓπcJH_{J}=(h^{\pi}_{J}(S_{1,0},A_{1,0},S_{1,1}),h^{\pi}_{J}(S_{1,1},A_{1,1},S_{1,2}),\ldots,h^{\pi}_{J}(S_{N,T-1},A_{N,T-1},S_{N,T}))^{\top}=\Gamma_{\pi}c_{J}. Under Assumptions 5,

Gb1/2(B(H0HJ)/(NT)E[bK(S,A)(h0π(S,A,S)hJπ(S,A,S))])2\displaystyle\|G_{b}^{-1/2}(B^{\top}(H_{0}-H_{J})/(NT)-E[b^{K}(S,A)(h_{0}^{\pi}(S,A,S^{\prime})-h^{\pi}_{J}(S,A,S^{\prime}))])\|_{\ell^{2}}
=Op(ζb,Klog(NT)log(K)NT×h0πhJ).\displaystyle\quad=\quad O_{p}\left(\sqrt{\frac{\zeta_{b,K}\log(NT)\log(K)}{NT}}\times\|h_{0}^{\pi}-h_{J}\|_{\infty}\right)\,.

provided log(NT)log(K)NT=o(1)\sqrt{\frac{\log(NT)\log(K)}{NT}}=o(1).

Proof. We again use Berbee’s coupling lemma and matrix Bernstein’s inequality (e.g., Dedecker & Louhichi (2002); Chen & Christensen (2015)) and get the result. The argument is similar to that in the proof of Lemma 5. In particular, let

Zi,t=Gb1/2bK(Si,t,Ai,t)(h0π(Si,t,Ai,t,Si,t+1)hJπ(Si,t,Ai,t,Si,t+1)).Z_{i,t}=G_{b}^{-1/2}b^{K}(S_{i,t},A_{i,t})(h_{0}^{\pi}(S_{i,t},A_{i,t},S_{i,t+1})-h^{\pi}_{J}(S_{i,t},A_{i,t},S_{i,t+1})).

It can be seen that Zi,t2ζb,Kh0πhJ\|Z_{i,t}\|_{\ell_{2}}\leq\zeta_{b,K}\|h_{0}^{\pi}-h_{J}\|_{\infty} and

max{𝔼[Zi,tZi,t],𝔼[Zi,tZi,t]}ζb,K2h0πhJ2,\max\{\mathbb{E}[Z_{i,t}^{\top}Z_{i,t}],\mathbb{E}[Z_{i,t}Z_{i,t}^{\top}]\}\leq\zeta^{2}_{b,K}\|h_{0}^{\pi}-h_{J}\|^{2}_{\infty},

which gives the result.    

Lemma 8

Let sJK1ζ(log(NT)logJ)/(NT)=o(1)s_{JK}^{-1}\zeta\sqrt{(\log(NT)\log J)/(NT)}=o(1), and Assumption 5 is satisfied. Then:

(a)\displaystyle(a) (G^b1/2Σ^π)lG^b1/2Gb1/2(Gb1/2Σπ)l2=Op(sJK2ζ(log(NT)logJ)/(NTeJ))\displaystyle\|(\widehat{G}_{b}^{-1/2}\widehat{\Sigma}^{\pi})^{-}_{l}\widehat{G}_{b}^{-1/2}G_{b}^{1/2}-(G_{b}^{-1/2}\Sigma^{\pi})^{-}_{l}\|_{\ell^{2}}=O_{p}\Big{(}s_{JK}^{-2}\zeta\sqrt{(\log(NT)\log J)/(NTe_{J})}\Big{)}
(b)\displaystyle(b) [Gκπ]1/2{(G^b1/2Σ^π)lG^b1/2Gb1/2(Gb1/2Σπ)l}2=Op(sJK2ζ(log(NT)logJ)/(NT)))\displaystyle\|[G^{\pi}_{\kappa}]^{1/2}\{(\widehat{G}_{b}^{-1/2}\widehat{\Sigma}^{\pi})^{-}_{l}\widehat{G}_{b}^{-1/2}G_{b}^{1/2}-(G_{b}^{-1/2}\Sigma^{\pi})^{-}_{l}\}\|_{\ell^{2}}=O_{p}\Big{(}s_{JK}^{-2}\zeta\sqrt{(\log(NT)\log J)/(NT))}\Big{)}

where

(Gb1/2Σπ)l=[[Σπ](Gb)1Σπ]1Σπ](Gb)1/2,(G_{b}^{-1/2}\Sigma^{\pi})^{-}_{l}=\left[[\Sigma^{\pi}]^{\top}(G_{b})^{-1}\Sigma^{\pi}\right]^{-1}\Sigma^{\pi}]^{\top}(G_{b})^{-1/2},

and similarly for (G^b1/2Σ^π)l(\widehat{G}_{b}^{-1/2}\widehat{\Sigma}^{\pi})^{-}_{l}.

Proof. We use the similar proof as Lemma F.10 of Chen & Christensen (2018) with Berbee’s coupling lemma again. The argument is also similar to that in the proof of Lemma 5. We omit here for brevity.    

References

  • (1)
  • Agarwal et al. (2019) Agarwal, A., Jiang, N., Kakade, S. M. & Sun, W. (2019), ‘Reinforcement learning: Theory and algorithms’, CS Dept., UW Seattle, Seattle, WA, USA, Tech. Rep .
  • Agarwal et al. (2021) Agarwal, A., Kakade, S. M., Lee, J. D. & Mahajan, G. (2021), ‘On the theory of policy gradient methods: Optimality, approximation, and distribution shift’, Journal of Machine Learning Research 22(98), 1–76.
  • Ai & Chen (2003) Ai, C. & Chen, X. (2003), ‘Efficient estimation of models with conditional moment restrictions containing unknown functions’, Econometrica 71(6), 1795–1843.
  • Antos et al. (2008a) Antos, A., Szepesvári, C. & Munos, R. (2008a), Fitted q-iteration in continuous action-space mdps, in ‘Advances in Neural Information Processing Systems’, Vol. 20.
  • Antos et al. (2008b) Antos, A., Szepesvári, C. & Munos, R. (2008b), ‘Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path’, Machine Learning 71(1), 89–129.
    http://link.springer.com/10.1007/s10994-007-5038-2
  • Blundell et al. (2007) Blundell, R., Chen, X. & Kristensen, D. (2007), ‘Semi-nonparametric iv estimation of shape-invariant engel curves’, Econometrica 75(6), 1613–1669.
  • Bradley (2005) Bradley, R. C. (2005), ‘Basic properties of strong mixing conditions. a survey and some open questions’, Probability Surveys 2, 107–144.
  • Bradtke & Barto (1996) Bradtke, S. J. & Barto, A. G. (1996), ‘Linear least-squares algorithms for temporal difference learning’, Machine learning 22(1), 33–57.
  • Chen (2007) Chen, X. (2007), ‘Large sample sieve estimation of semi-nonparametric models’, Handbook of Econometrics 6, 5549–5632.
  • Chen & Christensen (2015) Chen, X. & Christensen, T. M. (2015), ‘Optimal uniform convergence rates and asymptotic normality for series estimators under weak dependence and weak conditions’, Journal of Econometrics 188(2), 447–465.
  • Chen & Christensen (2018) Chen, X. & Christensen, T. M. (2018), ‘Optimal sup-norm rates and uniform inference on nonlinear functionals of nonparametric IV regression’, Quantitative Economics 9(1), 39–84.
  • Chen & Reiss (2011) Chen, X. & Reiss, M. (2011), ‘On rate optimality for ill-posed inverse problems in econometrics’, Econometric Theory 27(3), 497–521.
  • Chen et al. (2021) Chen, Y., Xu, L., Gulcehre, C., Paine, T. L., Gretton, A., de Freitas, N. & Doucet, A. (2021), ‘On instrumental variable regression for deep offline policy evaluation’, arXiv preprint arXiv:2105.10148 .
  • Cohen et al. (1993) Cohen, A., Daubechies, I. & Vial, P. (1993), ‘Wavelets on the interval and fast wavelet transforms’, Applied and computational harmonic analysis .
  • Darolles et al. (2011) Darolles, S., Fan, Y., Florens, J.-P. & Renault, E. (2011), ‘Nonparametric instrumental regression’, Econometrica 79(5), 1541–1565.
  • Dedecker & Louhichi (2002) Dedecker, J. & Louhichi, S. (2002), Maximal inequalities and empirical central limit theorems, in ‘Empirical process techniques for dependent data’, Springer, pp. 137–159.
  • Duan et al. (2020) Duan, Y., Jia, Z. & Wang, M. (2020), Minimax-optimal off-policy evaluation with linear function approximation, in ‘International Conference on Machine Learning’, PMLR, pp. 2701–2709.
  • Duan et al. (2021) Duan, Y., Wang, M. & Wainwright, M. J. (2021), ‘Optimal policy evaluation using kernel-based temporal difference methods’, arXiv preprint arXiv:2109.12002 .
  • Ernst et al. (2005) Ernst, D., Geurts, P., Wehenkel, L. & Littman, L. (2005), ‘Tree-based batch mode reinforcement learning’, Journal of Machine Learning Research 6, 503–556.
  • Farahmand et al. (2016) Farahmand, A.-m., Ghavamzadeh, M., Szepesvári, C. & Mannor, S. (2016), ‘Regularized policy iteration with nonparametric function spaces’, The Journal of Machine Learning Research 17(1), 4809–4874.
  • Feng et al. (2020) Feng, Y., Ren, T., Tang, Z. & Liu, Q. (2020), Accountable off-policy evaluation with kernel bellman statistics, in ‘International Conference on Machine Learning’, PMLR, pp. 3102–3111.
  • Hall & Horowitz (2005) Hall, P. & Horowitz, J. L. (2005), ‘Nonparametric methods for inference in the presence of instrumental variables’, The Annals of Statistics 33(6), 2904–2929.
  • Huang et al. (1998) Huang, J. Z. et al. (1998), ‘Projection estimation in multiple regression with application to functional anova models’, Annals of Statistics 26(1), 242–272.
  • Jin et al. (2021) Jin, Y., Yang, Z. & Wang, Z. (2021), Is pessimism provably efficient for offline rl?, in ‘International Conference on Machine Learning’, PMLR, pp. 5084–5096.
  • Kallus & Uehara (2019) Kallus, N. & Uehara, M. (2019), ‘Efficiently breaking the curse of horizon: Double reinforcement learning in infinite-horizon processes’, arXiv preprint arXiv:1909.05850 .
  • Kallus & Uehara (2020) Kallus, N. & Uehara, M. (2020), Statistically efficient off-policy policy gradients, in ‘International Conference on Machine Learning’, PMLR, pp. 5089–5100.
  • Kosorok & Laber (2019) Kosorok, M. R. & Laber, E. B. (2019), ‘Precision medicine’, Annual review of statistics and its application 6, 263–286.
  • Le et al. (2019) Le, H., Voloshin, C. & Yue, Y. (2019), Batch policy learning under constraints, in ‘International Conference on Machine Learning’, pp. 3703–3712.
  • Levine et al. (2020) Levine, S., Kumar, A., Tucker, G. & Fu, J. (2020), ‘Offline reinforcement learning: Tutorial, review, and perspectives on open problems’, arXiv preprint arXiv:2005.01643 .
  • Liao et al. (2018) Liao, P., Dempsey, W., Sarker, H., Hossain, S. M., al’Absi, M., Klasnja, P. & Murphy, S. (2018), ‘Just-in-time but not too much: Determining treatment timing in mobile health’, Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies 2(4), 179.
  • Liao et al. (2020) Liao, P., Qi, Z. & Murphy, S. (2020), ‘Batch policy learning in average reward markov decision processes’, arXiv preprint arXiv:2007.11771 .
  • Liu et al. (2018) Liu, Q., Li, L., Tang, Z. & Zhou, D. (2018), Breaking the curse of horizon: Infinite-horizon off-policy estimation, in ‘Advances in Neural Information Processing Systems’, pp. 5356–5366.
  • Nachum et al. (2019) Nachum, O., Chow, Y., Dai, B. & Li, L. (2019), Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections, in ‘Advances in Neural Information Processing Systems’, pp. 2315–2325.
  • Newey & Powell (2003) Newey, W. K. & Powell, J. L. (2003), ‘Instrumental variable estimation of nonparametric models’, Econometrica 71(5), 1565–1578.
  • Pinto & Gupta (2016) Pinto, L. & Gupta, A. (2016), Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours, in ‘2016 IEEE international conference on robotics and automation (ICRA)’, IEEE, pp. 3406–3413.
  • Precup et al. (2000) Precup, D., Sutton, R. S. & Singh, S. P. (2000), Eligibility traces for off-policy policy evaluation, in ‘Proceedings of the Seventeenth International Conference on Machine Learning’, pp. 759–766.
  • Shi et al. (2021) Shi, C., Wan, R., Chernozhukov, V. & Song, R. (2021), Deeply-debiased off-policy interval estimation, in ‘Proceedings of the 38th International Conference on Machine Learning’, Vol. 139, PMLR, pp. 9580–9591.
  • Shi et al. (2020) Shi, C., Zhang, S., Lu, W. & Song, R. (2020), ‘Statistical inference of the value function for reinforcement learning in infinite horizon settings’, Journal of the Royal Statistical Society: Series B, in press .
  • Silver et al. (2014) Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D. & Riedmiller, M. (2014), Deterministic policy gradient algorithms, in ‘International conference on machine learning’, PMLR, pp. 387–395.
  • Stone (1982) Stone, C. J. (1982), ‘Optimal global rates of convergence for nonparametric regression’, The Annals of Statistics pp. 1040–1053.
  • Sutton & Barto (2018) Sutton, R. S. & Barto, A. G. (2018), Reinforcement learning: An introduction, MIT press.
  • Tang et al. (2020) Tang, Z., Feng, Y., Li, L., Zhou, D. & Liu, Q. (2020), Doubly robust bias reduction in infinite horizon off-policy estimation, in ‘International Conference on Learning Representations’.
  • Thomas et al. (2017) Thomas, P. S., Theocharous, G., Ghavamzadeh, M., Durugkar, I. & Brunskill, E. (2017), Predictive off-policy policy evaluation for nonstationary decision problems, with applications to digital marketing, in ‘Twenty-Ninth IAAI Conference’.
  • Tropp (2011) Tropp, J. (2011), ‘Freedman’s inequality for matrix martingales’, Electronic Communications in Probability 16, 262–270.
  • Tropp (2015) Tropp, J. A. (2015), ‘An introduction to matrix concentration inequalities’, arXiv preprint arXiv:1501.01571 .
  • Tsybakov (2009) Tsybakov, A. B. (2009), Introduction to Nonparametric Estimation, Springer.
  • Uehara et al. (2020) Uehara, M., Huang, J. & Jiang, N. (2020), Minimax weight and q-function learning for off-policy evaluation, in ‘International Conference on Machine Learning’, PMLR, pp. 9659–9668.
  • Uehara et al. (2021) Uehara, M., Imaizumi, M., Jiang, N., Kallus, N., Sun, W. & Xie, T. (2021), ‘Finite sample analysis of minimax offline reinforcement learning: Completeness, fast rates and first-order efficiency’, arXiv preprint arXiv:2102.02981 .
  • Uehara & Jiang (2019) Uehara, M. & Jiang, N. (2019), ‘Minimax weight and q-function learning for off-policy evaluation’, arXiv preprint arXiv:1910.12809 .
  • Uehara & Sun (2021) Uehara, M. & Sun, W. (2021), ‘Pessimistic model-based offline rl: Pac bounds and posterior sampling under partial coverage’, arXiv preprint arXiv:2107.06226 .
  • Wang et al. (2021) Wang, J., Qi, Z. & Wong, R. K. (2021), ‘Projected state-action balancing weights for offline reinforcement learning’, arXiv preprint arXiv:2109.04640 .
  • Xie et al. (2021) Xie, T., Cheng, C.-A., Jiang, N., Mineiro, P. & Agarwal, A. (2021), ‘Bellman-consistent pessimism for offline reinforcement learning’, arXiv preprint arXiv:2106.06926 .
  • Xie et al. (2019) Xie, T., Ma, Y. & Wang, Y.-X. (2019), ‘Towards optimal off-policy evaluation for reinforcement learning with marginalized importance sampling’, arXiv preprint arXiv:1906.03393 .
  • Xu et al. (2021) Xu, T., Yang, Z., Wang, Z. & Liang, Y. (2021), ‘Doubly robust off-policy actor-critic: Convergence and optimality’, arXiv preprint arXiv:2102.11866 .
  • Zanette et al. (2021) Zanette, A., Wainwright, M. J. & Brunskill, E. (2021), ‘Provable benefits of actor-critic methods for offline reinforcement learning’, Advances in neural information processing systems 34.
  • Zhang, Dai, Li & Schuurmans (2020) Zhang, R., Dai, B., Li, L. & Schuurmans, D. (2020), ‘Gendice: Generalized offline estimation of stationary values’, arXiv preprint arXiv:2002.09072 .
  • Zhang, Liu & Whiteson (2020) Zhang, S., Liu, B. & Whiteson, S. (2020), Gradientdice: Rethinking generalized offline estimation of stationary values, in ‘International Conference on Machine Learning’, PMLR, pp. 11194–11203.