Distributional Shift-Aware Off-Policy Interval Estimation:
A Unified Error Quantification Framework

Wenzhuo Zhoulabel=e1][email protected] [ Yuhan Lilabel=e2][email protected] [ Ruoqing Zhulabel=e3][email protected] [ Annie Qulabel=e4][email protected] [ University of California Irvine presep=, ]e1,e4 University of Illinois Urbana Champaignpresep=, ]e2,e3

Abstract

We study high-confidence off-policy evaluation in the context of infinite-horizon Markov decision processes, where the objective is to establish a confidence interval (CI) for the target policy value using only offline data pre-collected from unknown behavior policies. This task faces two primary challenges: providing a comprehensive and rigorous error quantification in CI estimation, and addressing the distributional shift that results from discrepancies between the distribution induced by the target policy and the offline data-generating process. Motivated by an innovative unified error analysis, we jointly quantify the two sources of estimation errors: the misspecification error on modeling marginalized importance weights and the statistical uncertainty due to sampling, within a single interval. This unified framework reveals a previously hidden tradeoff between the errors, which undermines the tightness of the CI. Relying on a carefully designed discriminator function, the proposed estimator achieves a dual purpose: breaking the curse of the tradeoff to attain the tightest possible CI, and adapting the CI to ensure robustness against distributional shifts. Our method is applicable to time-dependent data without assuming any weak dependence conditions via leveraging a local supermartingale/martingale structure. Theoretically, we show that our algorithm is sample-efficient, error-robust, and provably convergent even in non-linear function approximation settings. The numerical performance of the proposed method is examined in synthetic datasets and an OhioT1DM mobile health study.

62C20,

62E17,

Finite-sample confidence bound,

Function approximation,

Precision medicine,

Reinforcement learning,

Sequential decision-making,

keywords:

[class=MSC]

keywords:

\startlocaldefs\endlocaldefs

, and

February 15, 2023

1 Introduction

Off-policy evaluation (OPE) is a crucial task in reinforcement learning (RL). Its primary goal is to evaluate a new policy (known as the target policy) using observational data generated by various existing policies (called behavior policies). In many real-world applications, e.g., healthcare [45, 40], financial marketing [62], robotics [64] and education [42], implementing a new policy can be costly, risky, or unsafe. OPE enables us to evaluate the performance of a new policy using only a set of observational data, without interacting with the real environment. This method substantially reduces the cost and risk associated with deploying a new policy, making it a valuable tool in practice.

There have been many works on point estimations of OPE in recent years. Popular approaches include value-based [33, 36, 8], importance sampling based [49, 25, 63], and marginalized importance sampling based approaches [38, 72, 67, 73, 46]. On the theoretical side, [67] established asymptotic optimality and efficiency for OPE in the tabular setting, and [29] provided a complete study of semiparametric efficiency in a more general setting. [70] studied the fundamental hardness of OPE with linear function approximation. We refer readers to [69] for a comprehensive review of OPE problems.

While much of the current research on OPE is centered on point estimation, in many real-world scenarios it is desirable to avoid making overconfident decisions that could result in costly and/or irreversible consequences. That is, rather than solely relying on a point estimate, many applications would benefit significantly from having a confidence interval (CI) on the value of a policy. Existing approaches on off-policy CI estimation are mainly based on asymptotic inference [36, 58, 10, 59] or bootstrapping [21, 22, 51]. In particular, [36] constructed a limiting distribution for the policy value function to estimate CIs in a tabular representation setting, and [58] considered an asymptotic CI for policy value with smoothness assumptions on the action-value function. Typically, the asymptotic inference or bootstrap-based methods require a large number of samples to achieve a desirable coverage. Unfortunately, in practical applications such as mobile health, the number of observations may be limited. For example, the OhioT1DM dataset [43] only contains a few hundred observations [57]. In such scenarios, providing non-asymptotic uncertainty quantification to meet practically feasible conditions may be crucial. [64] constructed an empirical Bernstein inequality applied to the stepwise importance sampling estimator. However, the estimated CI can become quite loose due to the high variance introduced by the stepwise importance weights, particularly in infinite horizon decision-making settings. Recently, [16, 17] proposed a finite-sample CI based on a kernel action-value function estimator. However, their approach relies on a strong assumption that the true model must be correctly specified, which may result in significant bias under potential model misspecification.

More importantly, all the aforementioned CI estimation approaches neglect the influence of the distributional shift, which is arguably the most significant challenge in offline RL. The distribution shift refers to the distributional mismatch between offline data distribution and that induced by target policies [34]. Algorithmically, the distributional shift often leads to compounding biased estimations of action values, resulting in unsatisfactory performance for many OPE algorithms [27]. Practically, in fields such as precision medicine, shifts in treatment regimens may produce arbitrary and sometimes harmful effects that are costly to diagnose and treat [6]. To mitigate this problem, it is essential to account for the distributional shift when performing high-confidence OPE.

From the perspective of the error analysis in the off-policy CI estimation, the existing algorithms are typically developed based on a separate error quantification. The approaches focus on either quantifying the model misspecification bias, which arises from function approximation while neglecting sampling uncertainty, or on measuring statistical uncertainty under the assumption that the true model is accurately specified. This leads to the true relationship between the “bias” and “uncertainty” being unrevealed, which in turn creates a gap between algorithm developments and practical applications. How to rigorously quantify the bias and uncertainty in a unified framework is unknown and challenging. Moreover, the existing works usually assume independence or weak dependence, e.g., mixing random processes, in an offline data distribution. This assumption is often violated due to the complicated interdependent samples in many RL real applications [1]. Additionally, most popular approaches only work when the offline data is induced by a single known or unknown behavior policy. However, they are not applicable in settings involving a mixture of multiple unknown behavior policies [28].

Motivated by the aforementioned issues, in this paper we propose a novel, robust, and possibly tightest off-policy CI estimation framework in infinite-horizon Markov decision process (MDP) settings. The advances of the proposed method and our contributions are summarized as follows. First, our proposed framework mitigates the issue of distributional shifts by properly incorporating a carefully designed discriminator function. This discriminator captures information about the distributional shift and adjusts the resulting CI accordingly, alleviating the risk of poor estimates when the distribution of offline data deviates from the distribution under the target policy. By doing so, the estimated CI effectively reduces potential decision-making risks and avoids unsatisfactory performance.

Second, we develop a novel and unified error quantification analysis that reveals a previously hidden tradeoff among the sources of errors. Specifically, our framework allows us to decompose the error in CI estimation into two distinct parts: the evaluation bias induced by a misspecification error of the function approximator, and the statistical uncertainty due to sampling. By unifying the two source errors within a single interval, we have shown that there is an inherent tradeoff between the errors and can be naturally viewed as a “bias-uncertianty” tradeoff, as depicted in Figure 1. This tradeoff makes it challenging to strike a balance and obtain the tightest possible CI estimation.

Refer to caption — Figure 1: An illustrating example of the curse of the tradeoff in CI estimation from CartPole RL environment. Our unified error quantification analysis shows that the tightness of the estimated CI is determined by both bias and uncertainty, as discussed in Section 3. These factors are influenced by the complexity of the function approximation class. As the model complexity increases (i.e., the number of units increases), the evaluation bias is reduced but the statistical uncertainty increases, and vice versa. This tradeoff makes it challenging to maintain an optimal balance, resulting in a less tight CI. In contrast, our proposed tradeoff-free estimation approach effectively breaks this curse, ultimately yielding a tighter CI with a length of $7.14$ .

Third, we break the curse of the tradeoff and formulate a tradeoff-free framework, resulting in the possibly tightest CI, illustrated in Figure 1. In smooth MDP settings [58], we leverage the principle of maximum mean discrepancy to control uncertainty and preserve function expressivity simultaneously. For more general MDPs, we carefully calibrate the uncertainty deviation, making it entirely independent of the model-misspecification bias. We achieve this by leveraging the advantageous curvature of the designed discriminator function. As a result, both evaluation bias and sampling uncertainty can be minimized jointly without compromising one or the other.

Fourth, our method demonstrates broad applicability. To the best of our knowledge, our algorithm is the first general function approximation algorithm that can be applied to offline data collected without assuming weakly dependent distribution. The proposed method sufficiently utilizes the (super)martingale structure to handle statistical errors and works particularly efficiently in small sample size settings. Also, our algorithm is stable and efficient for a wide range of function approximators, including deep neural networks. These characteristics significantly expand the application scope of the proposed method.

Lastly, the proposed algorithm is sample-efficient and robust against model-misspecification and optimization errors. By sample efficiency, we mean that the sample complexity requirement for finding a near-tight CI is polynomial with respect to key parameters. We further demonstrate that the finite sample bound for learning a near-tight CI can be independent of the function class complexity, and shrink at a sublinear rate, i.e., $\mathcal{O}(1/n)$ , under some mild conditions where $n$ represents the sample size of the offline dataset. We make substantial progress towards solving the longstanding open problem in offline RL with non-linear function approximation [41, 34]. Thanks to the proposed stochastic approximation optimization algorithm, our approach is proven to converge sublinearly to a stationary point with a vanishing gradient even in non-linear (possibly nonconvex) function approximation settings.

The rest of the paper is organized as follows. Section 2 lays out the basic model notation and data-generating process, as well as the connection between the marginalized importance sampling and linear programming. Section 3 formally defines the discriminator function and quantifies bias and uncertainty through a unified error quantification analysis. Section 4 illustrates the curse of the tradeoff between bias and uncertainty, and introduces a novel framework to overcome this challenge. Section 5 presents an efficient stochastic approximation algorithm for CI estimation. A comprehensive theoretical study of our algorithm is provided in Section 6. Sections 7 and 8 demonstrate the empirical performance of our method. Section 9 concludes this paper with a discussion. All technical proofs and additional discussion can be found in the appendix.

2 Preliminaries

2.1 Markov Decision Process

We consider an infinite-horizon discounted Markov decision process (MDP) $\mathcal{M}=\{\mathcal{S},\mathcal{A},\mathds{P},\gamma,r,s^{0}\}$ , where $\mathcal{S}$ is the state space, $\mathcal{A}$ is the action space, $\mathds{P}:\mathcal{S}\times\mathcal{A}\rightarrow\Delta(\mathcal{S})$ is the Markov transition kernel for some probabilistic simplex $\Delta$ , $r:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}$ is the reward function, $\gamma\in[0,1)$ is the discounted factor and $s^{0}$ is the initial state. A policy $\pi:\mathcal{S}\rightarrow\Delta(\mathcal{A})$ induces a distribution of the trajectory $s^{0},a^{0},r^{0},s^{1},\ldots$ , where $a^{t}\sim\pi(\cdot|s^{t}),r^{t}=r(s^{t},a^{t}),s^{t+1}\sim\mathds{P}(\cdot|s^{t},a^{t})$ for any $t\geq 0$ . The expected discounted return of a policy is defined as $J(\pi)=\mathbb{E}[\sum_{t=0}^{\infty}\gamma^{t}r^{t}|\pi]$ . The discounted return when the trajectory starts with $(s,a)$ and all remaining actions are taken according to $\pi$ is called $q$ -function $q^{\pi}:\mathcal{S}\times\mathcal{A}\rightarrow(-\infty,\bar{V}]$ . The $q^{\pi}$ is the unique fixed point of the Bellman operator $\mathcal{B}^{\pi}$ , satisfying the Bellman equation [61]: $\mathcal{B}^{\pi}q(s,a)\coloneqq r(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim\mathds{P}(\cdot|s,a)}[q(s^{\prime},\pi)].$ Here $q(s^{\prime},\pi)$ is denoted as shorthand for $\mathbb{E}_{a^{\prime}\sim\pi\left(\cdot|s^{\prime}\right)}\left[q\left(s^{\prime},a^{\prime}\right)\right]$ , and we define $\mathds{P}^{\pi}q(s,a):=\mathbb{E}_{s^{\prime}\sim\mathds{P}(\cdot|s,a)}\left[q\left(s^{\prime},\pi\right)\right]$ .

Another important notion is the normalized discounted visitation of $\pi$ , defined as $d^{\pi}(s,a)\coloneqq(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}d^{\pi,t}(s,a)$ , where $d^{\pi,t}$ is the marginal state-action distribution at the time-step $t$ . Essentially, $d^{\pi}(s,a)$ characterizes the states and actions visited by $\pi$ . For notation convenience, we write $\mathbb{E}_{d^{\pi}}[\cdot]$ to represent $\mathbb{E}_{d^{\pi}}\left[g\left(s,a,r,s^{\prime}\right)\right]:=\int_{s,a}d^{\pi}(s,a)\mathds{1}_{\{r=r(s,a)\}}\linebreak\mathbb{E}_{s^{\prime}\sim\mathds{P}(\cdot|s,a)}\left[g\left(s,a,r,s^{\prime}\right)\right].$ In the offline RL setting, there exists an unknown offline data-generating distribution $\mu$ induced by behavior policies. With a slight abuse of notation, we refer to the marginal distribution over $(s,a)$ and joint distribution over $(s,a,r,s^{\prime})$ as $\mu$ . Despite the unknowns of $\mu$ , we can observe a set of transition pairs, as offline dataset $\mathcal{D}_{1:n}\coloneqq\{s_{i},a_{i},r_{i},s^{\prime}_{i}\}^{n}_{i=1}$ sampling from $\mu$ .

We make the following minimum data collection condition throughout the paper:

\displaystyle\mu(r_{i},s^{\prime}_{i}|s_{i},a_{i},s_{i-1},a_{i-1},r_{i-1},s^{\prime}_{i-1},...,s_{1},a_{1})=\mathds{P}(s^{\prime}_{i}|s_{i},a_{i})\mathds{1}_{\{r_{i}=r(s_{i},a_{i})\}}.

This condition relaxes the standard assumptions widely made in the RL literature, e.g., [38, 58, 37] typically require the sample path to be either i.i.d or weakly dependent. In contrast, we only require that the reward $r_{i}$ and the transition state $s^{\prime}_{i}$ follow the rules specified by $r$ and $\mathds{P}$ , but do not impose any restrictions on the generation of the state-action pair $(s_{i},a_{i})$ . Therefore, the dataset $\mathcal{D}_{1:n}$ can be collected as a set of multiple trajectories with interdependent samples, or under a mix of arbitrary unknown behavior policies. Ultimately, the goal of high-confidence OPE is to construct an efficient CI for estimating the return of policy $\pi$ using the offline dataset.

2.2 Marginalized Importance Sampling and Linear Programming

In this section, we provide a concise review of two essential concepts in OPE: marginalized importance sampling (MIS) [38] and linear programming (LP) [11]. These concepts form the foundation for our subsequent discussion. Following the definition of the MIS in [38], we have $\tau_{d^{\pi}/\mu}(s,a):=\frac{d^{\pi}(s,a)}{\mu(s,a)}$ to characterize the ratio of the visitation induced by target policy and behavior policy. If it exists, $\mathbb{E}_{d^{\pi}}\left[g\left(s,a,r,s^{\prime}\right)\right]=\mathbb{E}_{\tau_{d^{\pi}/\mu}}\left[g\left(s,a,r,s^{\prime}\right)\right]=\mathbb{E}_{\mu}\left[\tau_{d^{\pi}/\mu}(s,a)g\left(s,a,r,s^{\prime}\right)\right],$ where $\mathbb{E}_{\tau}[\cdot]:=\mathbb{E}_{\mu}[\tau(s,a)(\cdot)]$ is shorthand used throughout the paper. The return $J(\pi)$ can be obtained via re-weighting the reward with respect to the MIS weight, i.e., $J(\pi)=\frac{\mathbb{E}_{d^{\pi}}[r]}{1-\gamma}=\frac{\mathbb{E}_{\tau_{d^{\pi}/\mu}}[r]}{1-\gamma}$ . Motivated by this fact, [67, 56] proposed a minimax estimator to estimate $\tau_{d^{\pi}/\mu}$ by searching a weight function $\tau(s,a):=\frac{d(s,a)}{\mu(s,a)}$ in a specified function class $\tau\in\Omega$ approximately:

\displaystyle\min_{\tau\in\Omega}\max_{q\in\mathcal{Q}}W(q,\tau):=\mathbb{E}_{\mu}\left[\gamma\tau(s,a)q\left(s^{\prime},\pi\right)-\tau(s,a)q(s,a)\right]+(1-\gamma)q\left(s^{0},\pi\right).

(1)

Once $\tau_{d^{\pi}/\mu}$ is solved, the $d^{\pi}$ can be obtained using the chain rule, i.e., $d^{\pi}=\tau_{d^{\pi}/\mu}\cdot\mu$ .

Recently, [46] showed that the above minimax problem has an equivalent solution to the LP problem [11]: $\text{maximize}_{d}\;\,\mathbb{H}(d)\equiv 0$ constrained on

\displaystyle\quad(1-\gamma)\mathds{1}\{s=s^{0}\}\pi(a|s)+\gamma\pi(a|s)\sum_{{s}^{\prime},{a}^{\prime}}\mathds{P}(s|{s}^{\prime},{a}^{\prime})d({s}^{\prime},{a}^{\prime})=d(s,a),\;\text{for any}\;s,a,

(2)

where $\mathbb{H}(\cdot)$ is a functional that takes a visitation $d$ as input and is set to be a constant zero function for the LP problem. It is crucial to note that (2) is over-constrained for any given $s,a$ . Specifically, the visitation $d^{\pi}$ can be uniquely identified solely based on the constraints. This over-constrained characteristic explains why $\mathbb{H}(d)$ can be set as a zero constant function and why (2) is sufficient to recover the optimal solution $d^{\pi}$ . In the following, we will utilize this property to incorporate distributional shift information into the proposed framework.

3 Off-Policy Confidence Interval Estimation

In this section, we propose a novel off-policy CI estimation framework that can adapt to distributional shifts. To develop this framework, we first decompose the source errors in CI estimation into two critical components: evaluation bias, which arises from model misspecifications, and statistical uncertainty stemming from sampling. We subsequently integrate and quantify both components, allowing them to be unified within a single interval. The way to encode the distributional shift information is naturally introduced in evaluation bias quantification.

3.1 Evaluation Bias Quantification

We study the first major component in CI estimation, i.e., evaluation bias, and propose a value interval method to quantify it. The developed framework is robust to model misspecification and naturally encodes the distributional shift information. To begin with, we formally define the discriminator function, which serves a dual purpose in this work. For now, we primarily concentrate on one of these roles, which is to measure the extent of deviation between the offline data-generating distribution $\mu$ and the distribution induced by the target policy. Later, we will demonstrate the other role that the discriminator function plays in estimating CI in Section 4.

Definition 3.1.

For $x,c_{1},c_{2},C\in\mathbb{R}^{+}$ and $C\geq 1$ , the discriminator function $\mathbb{G}(\cdot)$ satisfies the following conditions: (1) Strong convexity: $\mathbb{G}(x)$ is M-strongly convex with respect to $x$ . (2) Boundedness: $|\mathbb{G}(x)|\leq c_{1}$ for $x\in[0,C]$ . (3) Boundedness on first-order derivative: $|\mathbb{G}^{\prime}(x)|\leq c_{2}$ if $x\in[0,C]$ . (4) Non-negativity: $\mathbb{G}(x)\geq 0$ . (5) 1-minimum: $\mathbb{G}(1)=0$ .

The family of Rényi entropy [54], Bhattacharyya distance [9], and simple quadratic form functions all satisfy the conditions outlined in Definition 3.1.

Remark 3.1.

To understand the motivation of the design of the discriminator function, let us consider taking the marginal important ratio $\tau_{d^{\pi}/\mu}(s,a)=d^{\pi}(s,a)/\mu(s,a)$ as the input for $\mathbb{G}(\cdot)$ . The $1$ -minimum condition $d^{\pi}(s,a)/\mu(s,a)$ =1, i.e., $d^{\pi}(s,a)=\mu(s,a)$ , suggests that no distributional shift is present or detected by the discriminator $\mathbb{G}$ . The strong convexity property is capable of characterizing the uniqueness of the minimum point and ensuring that the rate of change increases as $d^{\pi}(s,a)$ deviates from $\mu(s,a)$ . The boundedness condition on the first-order derivative imposes smoothness on $\mathbb{G}$ , while the other two conditions render $\mathbb{G}$ practically quantifiable.

As discussed in Section 2.2, the LP problem (2) is over-constrained and the visitation $d^{\pi}$ can be uniquely identified. Consequently, one may replace the objective $\mathbb{H}(d)$ with another functional, provided that the newly-replaced functional does not affect the optimal solution $d^{*}=d^{\pi}$ . Motivated by this observation, we propose to choose $\mathbb{H}(d)=\lambda\mathbb{E}_{\mu}[\mathbb{G}(\tau(s,a))]$ where $\tau(s,a):=\frac{d(s,a)}{\mu(s,a)}$ , and the hyperparameter $\lambda\in\mathbb{R}$ user’s preference and degree of sensitivity to the distributional shift. When $\lambda<0$ , it indicates that the user favors small distribution shifts and appeals to pessimistic evaluation, which encourages downgrading the policy associated with large distribution shifts. For $\lambda=0$ , the user maintains a neutral attitude toward the distributional shift. Furthermore, our framework degenerates to (2) or (1) under this condition. From this perspective, both the minimax estimation in (1) and the LP problem in (2) can be considered special cases of our approach when we take $\lambda=0$ . According to the connection between (1) and (2), we can convert the newly-defined LP problem with $\mathbb{H}(d)=\lambda\mathbb{E}_{\mu}[\mathbb{G}(\tau(s,a))]$ to a max-min problem. This can be expressed as $\min_{\tau\in\Omega}\max_{q\in\mathcal{Q}}L(q,\tau)$ , where

\displaystyle L(q,\tau):=\frac{\mathbb{E}_{\mu}[\tau(s,a)\left(\gamma q\left(s^{\prime},\pi\right)-q(s,a)\right)+\lambda\mathbb{G}(\tau(s,a))]}{1-\gamma}+q\left(s^{0},\pi\right).

Unfortunately, directly optimizing objective function $L(q,\tau)$ has some drawbacks. To comprehensively reveal these drawbacks, we first present a key lemma to support our arguments.

Lemma 3.1.

For any target policy $\pi:\mathcal{S}\rightarrow\Delta(\mathcal{A})$ and $\tau:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}^{+}$ ,

\displaystyle\frac{\mathbb{E}_{\tau}[r(s,a)]}{1-\gamma}-J^{\alpha}(\pi)=\frac{\mathbb{E}_{\tau}\left[q^{\pi}(s,a)-\gamma q^{\pi}\left(s^{\prime},\pi\right)+\lambda\mathbb{G}(\tau(s,a))/\tau(s,a)\right]}{1-\gamma}-q^{\pi}(s^{0},\pi),

where $J^{\alpha}(\pi):=J(\pi)+\lambda\xi(\mathbb{G},\tau)$ for $\xi(\mathbb{G},\tau):=\mathbb{E}_{\mu}[\frac{\mathbb{G}(\tau(s,a))}{1-\gamma}]$ , and $q^{\pi}$ is the unique fixed point of Bellman equation $\mathcal{B}^{\pi}q=q$ .

Lemma 3.1 suggests that the rationale behind $\min_{\tau\in\Omega}\max_{q\in\mathcal{Q}}L(q,\tau)$ is to find a “good” $\tau^{*}$ such that $\frac{\mathbb{E}_{\tau}[r(s,a)]}{1-\gamma}$ and $J^{\alpha}(\pi)$ are are sufficiently close when $q^{\pi}\in\mathcal{Q}$ . Then $\tau^{*}$ can be plugged in $\frac{\mathbb{E}_{\tau}[r(s,a)]}{1-\gamma}$ to approximate $J^{\alpha}(\pi)$ . In fact, this rationale also implicitly requires the existence of $\tau^{*}\in\Omega$ . When $\Omega$ is misspecified, the evaluation will exhibit a high bias. For OPE problems significantly affected by the distributional shift, the true model is particularly challenging to specify [24, 7]. Additionally, another limitation of the min-max estimation is that the objective function $L(q,\tau)$ completely disregards the reward information.

Motivated by the above-mentioned drawbacks, we propose a more robust interval method for estimation. Free from any model-specification assumptions on $\Omega$ , our proposal implicitly quantifies the evaluation bias resulting from model-misspecification error within the interval. According to Lemma 3.1 and assuming $q^{\pi}$ is correctly specified, the following inequality holds for any $\tau\in\Omega$ :

	$\displaystyle J(\pi)$	$\displaystyle\geq\inf_{q\in\mathcal{Q}}\left\{\frac{\mathbb{E}_{\tau}\left[r(s,a)+\gamma q\left(s^{\prime},\pi\right)-q(s,a)\right]}{1-\gamma}+q(s^{0},\pi)\right\}-\lambda^{-}\xi(\mathbb{G},\tau)$
		$\displaystyle\coloneqq\inf_{q\in\mathcal{Q}}L(\tau,q,\pi)-\lambda^{-}\xi(\mathbb{G},\tau),$		(3)

where $\lambda^{-}:=-\min\{0,\lambda\}$ . Similarly, the return $J(\pi)$ from above:

\displaystyle J(\pi)\leq\sup_{q\in\mathcal{Q}}L(\tau,q,\pi)+\lambda^{+}\xi(\mathbb{G},\tau),\;\text{for any}\;\tau\in\Omega,

(4)

where $\lambda^{+}:=\max\{0,\lambda\}$ . Considering that (3) and (4) hold for any $\tau\in\Omega$ , we can optimize them over $\tau$ to obtain the tightest possible interval:

\displaystyle J(\pi)\in\left[\sup_{\tau\in\Omega}\inf_{q\in\mathcal{Q}}L(\tau,q,\pi)-\lambda^{-}\xi(\mathbb{G},\tau),\;\inf_{\tau\in\Omega}\sup_{q\in\mathcal{Q}}L(\tau,q,\pi)+\lambda^{+}\xi(\mathbb{G},\tau)\right],

(5)

where we denote the upper bound and lower bound as $J^{+}(\pi)$ and $J^{-}(\pi)$ , respectively. This optimization can also be interpreted as searching for the best $\tau^{*}$ within the class $\Omega$ to minimize the model-misspecification error. Based on the value interval in (5), we demonstrate that the evaluation bias, i.e., the model misspecification error of $\Omega$ , is implicitly quantified in the interval and can be decomposed into two components.

Theorem 3.1 (Evaluation bias).

For any target policy $\pi$ , the evaluation bias is bounded above,

\displaystyle J^{+}(\pi)-J^{-}(\pi)\leq 2\bigg{(}\underbrace{\inf_{\tau\in\Omega}\bigg{\{}\sup_{q\in\mathcal{Q}}\left|\mathbb{E}_{\tau}\Big{[}\frac{\gamma q(s^{\prime},\pi)-q(s,a)}{1-\gamma}+q(s^{0},\pi)\Big{]}\right|}_{\epsilon_{mis}}+\frac{1}{2}\underbrace{|\lambda|\xi(\mathbb{G},\tau)}_{\epsilon_{dist}}\bigg{\}}\bigg{)}.

Theorem 3.1 implies that the evaluation bias consists of two components: the model-misspecification error $\epsilon_{mis}$ and the bias caused by the discriminator function, denoted by $\epsilon_{dist}$ . The latter can be effectively controlled by adjusting the magnitude of $\lambda$ . The former, $\epsilon_{mis}$ , is dependent on the wellness of the model specification on $\Omega$ . In particular, if the class $\Omega$ is well-specified, it is possible for $\epsilon_{mis}$ to diminish to zero. We direct readers to Section A in the appendix for a more comprehensive discussion on the tightness and validity of the interval.

Remark 3.1.

In contrast to the role of $\Omega$ , the class $\mathcal{Q}$ primarily influences bias quantification. However, there is no need to be concerned about a tradeoff between bias and uncertainty since the specification of $\mathcal{Q}$ solely affects bias quantification. Consequently, an expressive function class can generally be chosen for modeling $\mathcal{Q}$ . Furthermore, since the specification of the function class $\mathcal{Q}$ is independent of uncertainty quantification, it is feasible to construct the optimal data-dependent $\mathcal{Q}$ using the information generated by fitting $q^{\pi}$ with any state-of-the-art method, such as double reinforcement learning [29]. This may increase the likelihood of correctly specifying $\mathcal{Q}$ . We offer an in-depth discussion on the role of $\mathcal{Q}$ and demonstrate that the evaluation bias resulting from the model-misspecification errors of $\mathcal{Q}$ can depend on the expressivity of the feature mapping class in Theorem B.1, as provided in the appendix.

To conclude this section, we remark that the value interval (5) directly quantifies the evaluation bias and incorporates a well-controlled adjustment scheme for distributional shifts. As a result, our proposed framework is robust against both model-misspecification errors and bias estimation due to distributional shifts. It is important to note that the value interval (5) is not a CI, as it does not address the uncertainty arising from sampling. In the following section, we will explore uncertainty quantification using our established interval estimation framework.

3.2 Statistical Uncertainty Quantification

In this section, we first examine the potential sources of uncertainty that may appear in CI estimation. We then establish two types of CIs for the scenarios when $\lambda=0$ and $\lambda<0$ , and separately quantify the uncertainty in these two CIs. Intuitively, these two CIs represent neutral and pessimistic attitudes towards distributional shifts. We begin by presenting a key lemma for our proposal. Recall that we do not impose any independent or weakly dependent data collection assumptions, e.g., mixing conditions, for uncertainty quantification.

Lemma 3.2 (Empirical evaluation lemma).

Given offline data $\mathcal{D}_{1:n}$ , for any target policy $\pi$ , the following equation holds,

\displaystyle\bar{J}^{\alpha}(\pi)=\frac{1}{n}\sum^{n}_{i=1}\frac{\tau(s_{i},a_{i})(r_{i}+\gamma\mathds{P}^{\pi}q^{\pi}(s_{i},a_{i})-q^{\pi}(s_{i},a_{i}))+\lambda\mathbb{G}(\tau(s_{i},a_{i}))}{1-\gamma}+q^{\pi}(s^{0},\pi),

(6)

where $\bar{J}^{\alpha}(\pi)=J(\pi)+\lambda\xi_{n}(\mathbb{G},\tau)$ for $\xi_{n}(\mathbb{G},\tau)=\frac{1}{n}\sum^{n}_{i=1}\frac{\mathbb{G}(\tau(s_{i},a_{i}))}{1-\gamma}$ .

Lemma 3.2 suggests that there are two sources of uncertainty when constructing CI estimation. The first source originates from the estimand $\{\tau(s_{i},a_{i})(r_{i}+\gamma\mathds{P}^{\pi}q^{\pi}(s_{i},a_{i})-q^{\pi}(s_{i},a_{i}))\}^{n}_{i=1}$ and can be approximated by $\{\tau(s_{i},a_{i})(r_{i}+\gamma q^{\pi}(s^{\prime}_{i},\pi)-q^{\pi}(s_{i},a_{i}))\}^{n}_{i=1}$ . The second source stems from the discriminator $\{\mathbb{G}(\tau(s_{i},a_{i}))\}^{n}_{i=1}$ . Once the uncertainty of these two sources is quantified, it can be incorporated into the CIs according to Lemma 3.2.

3.3 Uncertainty in Neutral Confidence Interval

In the following, we present our results on uncertainty quantification and establish the CI for the case when $\lambda=0$ , which implies that distributional shift is not taken into account. Specifically, we characterize the uncertainty deviation, denoted by $\sigma_{n}$ in (8), which constitutes a major component of the proposed neutral CI in (7).

Theorem 3.2 (Neutral CI).

For any target policy $\pi$ , we suppose $q^{\pi}\in\mathcal{Q}$ and $\tau\in\Omega$ , then the return $J(\pi)$ falls into the following CI with at least probability $1-\delta$ ,

\displaystyle J(\pi)\in\left[\frac{1}{n}\sum^{n}_{i=1}\frac{r_{i}\tau(s_{i},a_{i})}{1-\gamma}-\sup_{q\in\mathcal{Q}}\widehat{M}^{\tau}_{n}(-q,\tau)-\sigma_{n},\frac{1}{n}\sum^{n}_{i=1}\frac{r_{i}\tau(s_{i},a_{i})}{1-\gamma}+\sup_{q\in\mathcal{Q}}\widehat{M}^{\tau}_{n}(q,\tau)+\sigma_{n}\right],

(7)

if the uncertainty deviation $\sigma_{n}$ satisfies

\displaystyle P\left(\sup_{\tau\in\Omega}\Big{|}\widehat{L}^{q}_{n}(q^{\pi},\tau)\Big{|}\leq\sigma_{n}\right)\geq 1-\delta,

(8)

where

	$\displaystyle\widehat{L}^{q}_{n}(q,\tau)$	$\displaystyle:=\frac{1}{n}\sum^{n}_{i=1}\frac{\tau(s_{i},a_{i})\left(q(s_{i},a_{i})-r_{i}-\gamma q(s^{\prime}_{i},\pi)\right)}{1-\gamma},$
	$\displaystyle\widehat{M}^{\tau}_{n}(q,\tau)$	$\displaystyle:=\frac{1}{n}\sum^{n}_{i=1}\frac{\tau(s_{i},a_{i})(\gamma q(s^{\prime}_{i},\pi)-q(s_{i},a_{i}))}{1-\gamma}+q(s^{0},\pi).$

A closer examination of (7) and (8) reveals two key insights. Regarding the class $\mathcal{Q}$ , even though the bias quantification necessitates searching possible $q$ uniformly over $\mathcal{Q}$ , the statistical uncertainty deviation $\sigma_{n}$ can be evaluated point-wisely, specifically at the true $q$ -function $q^{\pi}$ , as shown in (8). Regarding the class $\Omega$ , the result in (7) holds for any $\tau\in\Omega$ , even if the specified $\Omega$ class does not capture the true $\tau$ , i.e., $\tau_{d^{\pi}/\mu}$ . This suggests that the CI in (7) is robust against model misspecification errors. The misspecification of $\Omega$ only influences the tightness of the CI but does not compromise the validity of the CI. Also, optimizing the confidence lower and upper bounds in (7) over $\tau\in\Omega$ can help tighten the CI.

We still need to determine the uncertainty deviation $\sigma_{n}$ in (8) to obtain a tractable CI. Typically, the deviation $\sigma_{n}$ depends on the complexity of hypothesis function class $\Omega$ and uniform concentration arguments. To measure the function class complexity, we use a generalized notation for the VC-dimension, called the pseudo-dimension (see Definition C.3 in the appendix). As for the uniform concentration arguments, we construct a martingale difference sequence, which allows us to apply the empirical Freedman’s concentration inequality [65]. In the following, we illustrate our general result for determining $\sigma_{n}$ using $L_{\infty}$ bounded classes for $\Omega$ modeling.

Theorem 3.3 ( $L_{\infty}$ -function class).

For any target policy $\pi$ , we define a composite function class $\mathcal{G}\circ\Omega=\{g\circ\tau:g\in\mathcal{G},\tau\in\Omega\}$ such that $g\circ\tau(s,a,r,s^{\prime})={\tau}(s,a)\big{(}r(s,a)+\gamma{q}^{\pi}(s^{\prime},{\pi})-{q}^{\pi}(s,a)\big{)},$ for any $(s,a,r,s^{\prime})$ . Suppose $0\leq\tau(s,a)\leq C$ for any $s,a$ , then Theorem 3.2 holds if we choose

\displaystyle\sigma_{n}=C\sqrt{\frac{2\bar{V}^{2}\ln\frac{8\mathcal{N}\left(1/\sqrt{n},\mathcal{G}\circ\Omega,\|\cdot\|_{L_{2}}\right)}{\delta}}{n(1-\gamma)^{2}}}+\frac{2C\bar{V}\ln\frac{8\mathcal{N}\left(1/\sqrt{n},\mathcal{G}\circ\Omega,\|\cdot\|_{L_{2}}\right)}{\delta}}{3(1-\gamma)n},

(9)

for $\mathcal{N}\left(\epsilon,\mathcal{G}\circ\Omega,\|\cdot\|_{L_{2}}\right)\leq(e\left(D_{\Omega}+1\right)(4e\bar{V}^{2})^{D_{\Omega}})\left({1}/{\epsilon}\right)^{D_{\Omega}}$ , where $\mathcal{N}\left(\epsilon,\mathcal{G}\circ\Omega,\|\cdot\|_{L_{2}}\right)$ is the $\epsilon$ -covering number (see Definition C.1 in the appendix) for $\mathcal{G}\circ\Omega$ , and $D_{\Omega}$ is the pseudo dimension of the function class $\Omega$ .

As shown in (9), the deviation $\sigma_{n}$ increases with the pseudo-dimension of $\Omega$ and decreases with the sample complexity of order $\mathcal{O}(1/\sqrt{n})$ . Note that this is an information-theoretical result as finding the exact value of the pseudo-dimension for an arbitrary $L_{\infty}$ -class function is challenging. To refine our result, we consider using a feature mapping class for modeling $\Omega$ . Let $\phi:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}^{d}$ be a $d$ -dimensional feature mapping, and define a feature mapping class $\Omega_{\psi}$ as follows: $\Omega_{\psi}:=\{(s,a)\mapsto\langle\phi(s,a),\psi\rangle\,|\,\|\psi\|_{L_{2}}\leq\text{diam}_{\psi}\}$ where $\text{diam}_{\psi}\in(0,\infty)$ .

Corollary 3.1.

For any target policy $\pi$ , we suppose $\tau\in\Omega_{\psi}$ , then Theorem 3.2 holds if we choose

\displaystyle\sigma_{n}=\frac{\bar{V}\sqrt{2\lambda_{\max}(\Sigma_{n})\ln\frac{4d\sqrt{n}\bar{V}\text{diam}_{\psi}+2}{\delta}}}{2(1-\gamma)n}+\frac{2\text{diam}_{\psi}\bar{V}\ln\frac{\left(4d\sqrt{n}\bar{V}\text{diam}_{\psi}+2\right)}{\delta}}{3(1-\gamma)n},

(10)

where $\lambda_{\max}(\Sigma_{n})$ is the max eigenvalue of the matrix $\Sigma_{n}=\sum^{n}_{i=1}\{\phi(s_{i},a_{i})\phi^{\top}(s_{i},a_{i})\}$ .

Corollary 3.1 provides a tractable quantification of the uncertainty deviation $\sigma_{n}$ . Each element in (10) is either user-specified or can be calculated using the offline data $\mathcal{D}_{1:n}$ . Compared to the vanilla rate of $\mathcal{O}(1/\sqrt{n})$ in Theorem 3.3, the uncertainty deviation can vanish at a faster sublinear rate, i.e., $\mathcal{O}(1/n)$ if $\lambda_{\max}(\Sigma_{n})\ll\text{diam}_{\psi}$ . This improvement in the rate of convergence benefits from our careful analysis of variance terms w.r.t. $\Omega_{\psi}$ .

3.4 Uncertainty in Pessimistic Confidence Interval

In this section, we establish the CI for $\lambda<0$ , which we call the pessimistic CI. Since $\lambda\neq 0$ , the pessimistic CI incorporates a newly-defined discriminator function in (3.1). In the case of $\lambda<0$ , the pessimistic CI favors target policies with small distributional shifts. This property allows the pessimistic CI to be used to differentiate the goodness of policies that suffer from bias estimation issues due to distributional shifts [34]. Additionally, compared to the uncertainty quantification in the neutral CI, we need to take one more step to quantify the uncertainty arising from the estimand of the discriminator function. In the following theorem, we present our formal result in statistical sampling uncertainty quantification and establish the pessimistic CI.

Theorem 3.4 (Pessimistic CI).

For any target policy $\pi$ , we suppose $q^{\pi}\in\mathcal{Q}$ , then the return $J(\pi)$ falls into the CI for any $\tau\in\Omega$ with probability at least $1-\delta$ ,

	$\displaystyle\bigg{[}\frac{1}{n}\sum^{n}_{i=1}\frac{r_{i}\tau(s_{i},a_{i})}{1-\gamma}-\sup_{q\in\mathcal{Q}}\widehat{M}^{\tau}_{n}(-q,\tau)$	$\displaystyle-\lambda^{-}\xi_{n}(\mathbb{G},\tau)-\sigma^{L}_{n},$
		$\displaystyle\frac{1}{n}\sum^{n}_{i=1}\frac{r_{i}\tau(s_{i},a_{i})}{1-\gamma}+\sup_{q\in\mathcal{Q}}\widehat{M}^{\tau}_{n}(q,\tau)+\sigma^{U}_{n}\bigg{]},$		(11)

if the uncertainty deviations $\sigma^{L}_{n}$ and $\sigma^{U}_{n}$ satisfy

\displaystyle P\Bigg{(}\left\{\sup_{\tau\in\Omega}\widehat{L}^{q}_{n}(q^{\pi},\tau)\leq\sigma^{U}_{n}\right\}\;\bigcap\;\left\{\inf_{\tau\in\Omega}\big{\{}\widehat{L}^{q}_{n}(q^{\pi},\tau)+\lambda^{-}\xi_{n}(\mathbb{G},\tau)\big{\}}\geq-\sigma^{L}_{n}\right\}\Bigg{)}\geq 1-\delta.

(12)

The confidence upper bound in (11) is nearly identical to the one in the neutral CI, so we will not discuss it further here. However, the confidence lower bound in (11) differs from the one in (7). The current confidence lower bound incorporates distributional shift information as well as the corresponding uncertainty included in the deviation $\sigma^{L}_{n}$ . By doing so, the estimand of the discriminator $\xi_{n}(\mathbb{G},\tau)$ helps to offset the bias evaluation caused by distributional shifts. When users follow the pessimism principle [31, 26], the confidence lower bound can make policy evaluation and optimization more reliable when the offline dataset is poorly explored or exhibits significant distributional shifts. Careful readers may notice that the involvement of the discriminator tends to widen the interval. However, the degree of widening can be well-controlled by $\lambda^{-}$ . Additionally, due to the potentially tightest construction of the CI, which will be demonstrated in the next section, the pessimistic CI does not lead to overly pessimistic reasoning even with the inclusion of an extra discriminator.

To make (11) estimable, we need to determine the uncertainty deviations, $\sigma^{U}_{n}$ and $\sigma^{L}_{n}$ . For $\sigma^{U}_{n}$ , we can easily follow the derivation in the neutral CI. However, for $\sigma^{L}_{n}$ , the quantification is slightly more complicated. To determine $\sigma^{L}_{n}$ , we construct a pseudo estimator

\displaystyle\mathcal{U}(s_{i},a_{i}):=(1-\gamma)^{-1}\mathbb{E}_{s^{\prime}\sim\mathds{P}(\cdot|s_{i},a_{i})}[\tau(s_{i},a_{i})(q(s_{i},a_{i})-r_{i}-\gamma q(s^{\prime},\pi))+\lambda^{-}\mathbb{G}(\tau(s_{i},a_{i}))],

such that $\{\widehat{L}^{q}_{n}(q^{\pi},\tau)+\lambda^{-}\xi_{n}(\mathbb{G},\tau)-n^{-1}\sum^{n}_{i=1}\mathcal{U}(s_{i},a_{i})\}$ forms a summation of a martingale difference sequence. We can then follow an almost identical martingale concentration technique as in Theorem 3.3 to determine $\sigma^{L}_{n}$ , so we omit it here. In the next section, we will provide a more elegant and tighter approach for quantifying the uncertainty deviation $\sigma^{L}_{n}$ .

4 Breaking the Curse of Tradeoff

In this section, we first highlight a finding that there exists a hidden tradeoff between the evaluation bias, resulting from model-misspecification, and the statistical uncertainty originating from sampling, as implied by the unified error quantification analysis for CI estimation. This seeming conflict prevents the simultaneous minimization of these two sources of errors in order to obtain tighter CIs. This phenomenon can be naturally viewed as the curse of the “bias-uncertainty” tradeoff, as shown in Figure 1. On this observation, we propose two efficient solutions to break the curse for the neutral and pessimistic CI, respectively.

4.1 Bias and Uncertainty Tradeoff

As discussed in the previous sections, bias and uncertainty are two major sources of errors in CI estimations. In order to obtain a tight interval, one would expect to minimize both of them. Statistical uncertainty, as shown in (9), is highly dependent on the complexity of the function class, which is measured in terms of the pseudo-dimension $D_{\Omega}$ . In general, a high pseudo-dimension indicates that the set of functions is complex and thus requires a sufficiently large deviation to control the uncertainty uniformly over all functions in the set. To minimize the uncertainty deviation, we prefer to model $\Omega$ using a parsimonious class of functions with a low pseudo-dimension. As for evaluation bias, Theorem 3.1 indicates that the model-misspecification error of $\Omega$ is a major component. Minimizing bias requires minimizing the model-misspecification error. Generally, a more expressive function class, i.e., with a high pseudo-dimension, is more likely to capture the true model and potentially reduce the bias due to model misspecification. For example, [59] allow the VC-dimension to diverge with the sample size to reduce the estimator’s bias due to model misspecification. If we translate this principle to our framework, it is equivalent to allowing the pseudo-dimension $D_{\Omega}=\mathcal{O}(n^{-\kappa})$ , where $0<\kappa\leq 1$ .

Taking both bias and uncertainty into account, we regrettably identify a tradeoff between them, as exemplified in Figure 1. On one hand, a more parsimonious function class can attain enhanced efficiency in uncertainty quantification. However, this comes at the expense of sacrificing the function class’s expressivity, consequently increasing the bias. On the other hand, a more expressive function class aids in reducing model misspecification errors but demands a higher cost to account for the uncertainty inherent in a complex set of functions. Is there a way to break the curse of the bias and uncertainty tradeoff? In the following sections, we will present our solution to this question.

4.2 Tradeoff-Free Neutral Confidence Interval

In this section, we propose a kernel representation approach to break the curse of the tradeoff for the neutral CI in smooth MDP settings. Under this representation, the uncertainty deviation is independent of the function class complexity. We make use of a powerful tool in concentration measures, Pinelis’ inequality [48], to quantify the data-dependent uncertainty deviation in Hilbert space. As a result, this can achieve minimum costs regarding uncertainty quantification, while simultaneously allowing the model-misspecification error to be arbitrarily minimized due to the flexibility of the kernel representation. To begin with, we introduce the definition of the smooth MDP.

Definition 4.1.

Given an infinite-horizon discounted MDP $\mathcal{M}=\left\{\mathcal{S},\mathcal{A},\mathds{P},\gamma,r,s^{0}\right\}$ , suppose the reward function $r(s,a)$ and the Markov transition kernel $\mathds{P}(\cdot|s,a)$ are continuous over $(s,a)$ , then we call $\mathcal{M}$ the smooth MDP and denote it as $\widetilde{\mathcal{M}}$ .

Smooth MDP settings are widely satisfied in many real-world applications and studies in the existing literature [35, 58]. Recall that our goal is to quantify the uncertainty deviation $\sup_{\tau\in\Omega}|\widehat{L}^{q}_{n}(q^{\pi},\tau)|$ in (8) to establish the CI. According to theorem 3.2 in [74], the maximizer of the problem $\sup_{\tau\in\Omega}|\widehat{L}^{q}_{n}(q^{\pi},\tau)|$ , i.e., $\tau^{*}(s,a)$ , is a continuous function in the smooth MDP setting $\mathcal{M}=\widetilde{\mathcal{M}}$ . This motivates us to consider some smooth function classes for modeling $\Omega$ , provided that these function classes can approximate continuous functionals sufficiently well. In particular, we model $\Omega$ in a bounded reproducing kernel Hilbert space (RKHS) equipped with a positive definite kernel $K(\cdot,\cdot)$ , i.e., $\Omega_{RKHS}(C_{K})\coloneqq\{\tau\in RKHS:\|\tau\|_{RKHS}\leq C_{K}\}$ , where $\|\cdot\|_{RKHS}$ denotes the kernel norm and the constant $C_{K}>0$ . This kernel representation allows the maximization problem $\sup_{\tau\in\Omega_{RKHS}(C_{K})}|\widehat{L}^{q}_{n}(q^{\pi},\tau)|$ to have a simple closed-form solution $\tau^{*}$ . We note that this is based on the maximum mean discrepancy [19]. Accordingly, the maximum value $(1-\gamma)\widehat{L}^{q}_{n}(q^{\pi},\tau^{*})$ equals to

\displaystyle\sqrt{\frac{C_{K}}{n^{2}}\sum_{ij}({q}^{\pi}(s_{i},a_{i})-r_{i}-\gamma{q}^{\pi}(s^{\prime}_{i},{\pi}))K(\{s_{i},a_{i}\},\{s_{j},a_{j}\})({q}^{\pi}(s_{j},a_{j})-r_{i}-\gamma{q}^{\pi}(s^{\prime}_{j},{\pi}))},

(13)

where the square root operator is well-defined by the positive definite kernel $K(\cdot,\cdot)$ .

The closed-form solution assists us to elaborate on uncertainty quantification. In (13), the function $\tau$ is maximized out and no longer appears in the sample estimand. Remarkably, the complexity of $\Omega$ does not influence the uncertainty deviation, which is now entirely independent of the function class complexity. This transforms the uniform uncertainty quantification into a point-wise uncertainty quantification focused on the global maximizer $\tau^{*}$ , significantly reducing the uncertainty deviation. Furthermore, it has been demonstrated that Reproducing Kernel Hilbert Spaces (RKHS) can achieve arbitrarily small approximation errors when approximating continuous functions [2]. In other words, it is possible to minimize the model-misspecification error of $\Omega$ and even eliminate the error entirely, as long as the true model $\tau^{\star}$ is continuous, which holds in smooth MDP settings.

In summary, by leveraging the kernel representation, we observe a double descent phenomenon whereby the model-misspecification error and uncertainty deviation can be minimized simultaneously. Imposing a minor constraint on the continuity of MDPs, we overcome the curse of the bias and uncertainty tradeoff. In the following, we provide a brief overview of our method for quantifying the uncertainty deviation under the kernel representation. Inspired by the reproducing property, we initially construct a pseudo-estimator, i.e.,

\displaystyle\Lambda^{\text{pseudo}}(\cdot;i):=\langle\mathbb{E}_{s^{\prime}\sim\mathds{P}(\cdot|s_{i},a_{i})}[{q}^{\pi}(s_{i},a_{i})-r(s_{i},a_{i})-\gamma{q}^{\pi}(s^{\prime},{\pi})],\sqrt{C_{K}}K(\{s_{i},a_{i}\},\cdot)\rangle,

and rewrite (13) with $\widehat{L}^{q}_{n}(q^{\pi},\tau^{*})=\sqrt{\|\frac{1}{n}\sum^{n}_{i=1}\Lambda(\cdot;i)\|_{RKHS}},$ where $\Lambda(\cdot;i):=\langle{q}^{\pi}(s_{i},a_{i})-r_{i}-\gamma{q}^{\pi}(s^{\prime},{\pi}),\sqrt{C_{K}}K(\{s_{i},a_{i}\},\cdot)\rangle.$ It can be seen that the sequence $\{\Lambda(\cdot;i)-\Lambda^{\text{pseudo}}(\cdot;i)\}^{n}_{i=1}$ forms a martingale difference sequence in Hilbert space with respect to the filtration $\mathcal{F}_{i}:=(s_{\leq i},a_{\leq i},r_{\leq i-1},s^{\prime}_{\leq i-1})$ . This motivates us to use Pinelis’ inequality to quantify the uncertainty deviation as illustrated in the following theorem.

Theorem 4.1.

Given the offline data $\mathcal{D}_{1:n}$ , we suppose the kernel $\|K(\cdot,\cdot)\|_{\infty}\leq\bar{K}$ , then

\displaystyle\sqrt{\Big{\|}\frac{1}{n}\sum^{n}_{i=1}\Lambda(\cdot;i)\Big{\|}_{RKHS}}\leq\frac{\bar{V}\sqrt{\bar{K}C_{K}}}{2}\sqrt{\frac{2\ln\left(2/\delta\right)}{n(1-\gamma)}}+\frac{2\ln\left(2/\delta\right)\bar{V}\sqrt{\bar{K}C_{K}}}{3(1-\gamma)n}:=\varepsilon^{K}_{n},

holds with probability at least $1-\delta$ . Furthermore, Theorem 3.2 holds for any $\tau\in\Omega_{RKHS}(C_{K})$ when it takes $\sigma_{n}=\varepsilon^{K}_{n}$ .

Theorem 4.1 achieves an order of $\mathcal{O}(1/\sqrt{n})$ convergence rate, which is as fast as the minimax rate in learning with i.i.d. data [68]. Enabled by the kernel representation, the upper bound $\varepsilon^{K}_{n}$ is independent of the complexity of $\Omega$ . Let $\sigma_{n}=\varepsilon^{K}_{n}$ , then the uncertainty deviation is significantly reduced compared to those in Theorem 3.3 or Corollary 3.1. Recall that we can optimize the upper and lower bound in the neutral CI (7) to minimize the bias, and obtain a possibly tightest interval. Intriguingly, the maximization problem $\sup_{q\in\mathcal{Q}}\widehat{M}^{\tau}_{n}(-q,\tau)$ within the CI also has a simple closed-form solution when $\mathcal{Q}$ belongs to a bounded RKHS [38]. This can be employed to alleviate the optimization burden and enhance stability. Now, together with the uncertainty defined in Theorem 4.1, we formulate the trade-off neutral CI as follows.

Corollary 4.1 (RKHS reformulation).

On the conditions of Theorem 3.2, suppose $\Omega$ and $\mathcal{Q}$ are in a bounded RKHS with radius $1$ , and we set $\sigma_{n}=\varepsilon^{K}_{n}$ with $C_{K}=1$ in Theorem 4.1, then the return $J(\pi)$ falls into the CI

\displaystyle\left[\sup_{\tau\in\Omega}\left\{\frac{1}{n}\sum^{n}_{i=1}\frac{r_{i}\tau(s_{i},a_{i})}{1-\gamma}-\widehat{M}^{K}_{n}(\pi)-\sigma_{n}\right\},\;\inf_{\tau\in\Omega}\left\{\frac{1}{n}\sum^{n}_{i=1}\frac{r_{i}\tau(s_{i},a_{i})}{1-\gamma}+\widehat{M}^{K}_{n}(\pi)+\sigma_{n}\right\}\right],

(14)

with probability at least $1-\delta$ . The detailed expression of $\widehat{M}^{K}_{n}(\tau)$ and derivations are provided in Section E.11 of the appendix.

As demonstrated in Corollary 4.1, the unstable two-stage optimization problem, involving optimizing both $\tau$ and $q$ , is decoupled into an easier solvable single-stage optimization.

Remark 4.1.

Finding global optimizers in (14) is quite challenging due to the need to solve an infinite-dimensional problem in RKHS. However, approximate solvers can be employed to replace the global optimizer in (14), which is typically difficult to find. Thanks to the robustness property of our CI with respect to the model-misspecification error, using approximate solvers only makes the interval slightly wider but does not compromise the validity of the CI. Consequently, it is feasible to utilize random feature approximation [50] or the Riesz representation theorem to transform a complex infinite-dimensional optimization problem into an easily solvable finite-dimensional optimization problem.

4.3 Tradeoff-Free Pessimistic Confidence Interval

In this section, we introduce a tradeoff-free pessimistic CI estimation. We make a further step in breaking the curse of the “bias-uncertainty” tradeoff without requiring any additional conditions, e.g., smoothness conditions for MDPs as discussed in the last section. In any general MDPs, the evaluation bias and statistical uncertainty in CI estimation can be minimized simultaneously without compromising one or the other. The main idea is to utilize Lagrangian duality and the curvature of the discriminator function $\mathbb{G}$ . This is another motivation for defining a discriminator function as in (3.1), in addition to its role in handling distributional shifts. In the following, we characterize a unique and global optimizer related to the lower bound in (12), which builds a foundation for our subsequent methodological developments.

Proposition 4.1.

For any $s,a$ , the global optimizer of the maximization problem

\displaystyle\max_{\tau\geq 0}\frac{\tau(s,a)}{1-\gamma}\big{(}r(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim\mathds{P}(\cdot|s,a)}[q(s^{\prime},\pi)]-q(s,a)\big{)}-\frac{\lambda^{-}\mathbb{G}(\tau(s,a))}{1-\gamma},

(15)

is unique. Moreover, it has the following analytical form:

\displaystyle\tau^{\star}(s,a;q):=\left[\left(\mathbb{G}^{\prime}\right)^{-1}\left(\frac{r(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim\mathds{P}(\cdot|s,a)}[q(s^{\prime},\pi)]-q(s,a)}{\lambda^{-}}\right)\right]^{+},

(16)

where $\left(\mathbb{G}^{\prime}\right)^{-1}$ is the inverse function of the derivative $\mathbb{G}^{\prime}$ , which is strictly increasing due to the strict convexity of $\mathbb{G}$ .

The proof of Proposition 4.1 relies on the Lagrangian duality and the strong convexity of $\mathbb{G}$ . The analytical form (16) motivates us to construct an appropriate pseudo estimator to measure the uncertainty deviation which is independent of the complexity of $\Omega$ . Define the pseudo estimator $\Gamma^{\text{pseudo}}$ as

\displaystyle\frac{\sum^{n}_{i=1}\tau^{\star}(s_{i},a_{i};q^{\pi})\big{(}r(s_{i},a_{i})+\gamma\mathbb{E}_{s^{\prime}\sim\mathds{P}(\cdot|s_{i},a_{i})}[q^{\pi}(s^{\prime},\pi)]-q^{\pi}(s_{i},a_{i})\big{)}-\lambda^{-}\mathbb{G}(\tau^{\star}(s_{i},a_{i};q^{\pi}))}{n(1-\gamma)},

which can be regarded as the post-optimization objective function of (15). Correspondingly, we can define its transition sample counterpart, where the unknown Markov transition kernel $\mathds{P}$ is replaced by a transition pair, as

\displaystyle\widetilde{\Gamma}=\frac{\sum^{n}_{i=1}\widetilde{\tau}^{\star}(s_{i},a_{i},r_{i},s^{\prime}_{i};q^{\pi})\big{(}r_{i}+\gamma q^{\pi}(s^{\prime}_{i},\pi)-q^{\pi}(s_{i},a_{i})\big{)}-\lambda^{-}\mathbb{G}(\widetilde{\tau}^{\star}(s_{i},a_{i},r_{i},s^{\prime}_{i};q^{\pi}))}{n(1-\gamma)},

where $\widetilde{\tau}^{\star}(s_{i},a_{i},r_{i},s^{\prime}_{i};q^{\pi})$ is the transition pair counterpart of $\tau^{\star}(s_{i},a_{i};q^{\pi})$ , accordingly. By Proposition 4.1, it can be seen that the objective function $-\widehat{L}^{q}_{n}(q^{\pi},\tau)-\lambda^{-}\xi_{n}(\mathbb{G},\tau)$ in confidence set (12) is bounded above by $\widetilde{\Gamma}$ for any $\tau\in\Omega$ . With this fact, we have

\displaystyle\sup_{\tau\in\Omega}\{-\widehat{L}^{q}_{n}(q^{\pi},\tau)-\lambda^{-}\xi_{n}(\mathbb{G},\tau)\}\leq\widetilde{\Gamma}\overset{\mathrm{(*)}}{=}\widetilde{\Gamma}-\Gamma^{\text{pseudo}}.

(17)

The equality $\mathrm{(*)}$ comes from the fact that $\Gamma^{\text{pseudo}}=0$ by the Bellman equation. In (17), the uniform quantification argument on the LHS is transferred to a point-wise deviation on the RHS. This indicates that the uncertainty deviation is sufficient to only be evaluated at the point of the global optimizer $\tau^{\star}$ , which removes the influence of the complexity of function class $\Omega$ on the uncertainty quantification. Next, we need to find the uncertainty deviation $\widetilde{\Gamma}-\Gamma^{\text{pseudo}}$ . However, the martingale structure we utilized to establish the uncertainty deviation no longer applies to the difference sequence $\{\widetilde{\Gamma}_{i}-\Gamma^{\text{pseudo}}_{i}\}^{n}_{i=1}$ , due to the non-linearity of $(\mathbb{G}^{\prime})^{-1}$ and the double sampling issue [3] in $\widetilde{\Gamma}_{i}$ . This makes derivation much more intricate compared to the previous sections. Interestingly, by exploiting the curvature, i.e., the second-order derivative of $\Gamma^{\text{pseudo}}_{i}$ , we discover that the sequence $\{\Gamma^{\text{pseudo}}_{i}-\widetilde{\Gamma}_{i}\}^{n}_{i=1}$ forms a supermartingale difference sequence. We justify this observation in the following lemma.

Lemma 4.1.

For $i=1,...,n$ , we define the difference $\Delta_{i}=\Gamma^{\text{pseudo}}_{i}-\widetilde{\Gamma}_{i}$ and the filtration $\mathcal{F}_{i}:=(s_{\leq i},a_{\leq i},r_{\leq i-1},s^{\prime}_{\leq i-1})$ , then it follows that

\displaystyle\mathbb{E}[\Delta_{i}|\mathcal{F}_{i}]\leq 0,

where the strict equality holds if and only if the transition of the MDP, i.e., $\mathds{P}$ , is deterministic.

Lemma 4.1 offers new evidence that the deviation of $\widetilde{\Gamma}-\Gamma^{\text{pseudo}}$ is quantifiable. Although the martingale structure does not hold, we can measure the deviation using supermartingale concentration theory [14], and then quantify the statistical uncertainty in the pessimistic CI estimation based on the relationship depicted in (17).

Theorem 4.2.

For any target policy $\pi$ , Theorem 3.4 holds if we choose the lower confidence uncertainty deviation

\displaystyle\sigma^{L}_{n}=

\displaystyle\frac{C^{\star}\bar{V}+\lambda^{-}\mathbb{G}(C^{\star})}{1-\gamma}\sqrt{\frac{2\ln(4/\delta)}{n}},

(18)

where $C^{\star}=[(\mathbb{G}^{\prime})^{-1}(\min\{\bar{V}/\lambda^{-},c_{2}\})]^{+}$ .

In Theorem 4.2, even without assuming any independence or weak dependence condition, our sample complexity is of order $\mathcal{O}(1/\sqrt{n})$ , which aligns with the minimax rate in i.i.d. cases as discussed in [12]. A closer examination of (18) reveals that the uncertainty deviation $\sigma^{L}_{n}$ is fully independent of the complexity of $\Omega$ . It is worth noting that equation (17) is valid for any function class $\Omega$ . Consequently, it is not necessary to sacrifice the expressivity of the function class, for example, by restricting it to RKHS, in order to achieve point-wise uncertainty quantification. Following this principle, one can use a sufficiently expressive function class to minimize the model-misspecification error while maintaining efficient uncertainty quantification. The only cost is just the additional bias and uncertainty introduced by the discriminator function, for which it has been shown that this extra bias can be well-controlled by $\lambda^{-}$ . We also note that the upper uncertainty $\sigma^{U}_{n}$ can be determined following Corollary $4.1$ , as the confidence upper bound of the pessimistic CI is almost identical to the one in the neutral CI.

Remark 4.1.

We note that incorporating the discriminator function facilitates numerical optimization. The maximization over $\tau$ can be efficiently solved even though $\Omega$ is not a convex set. For instance, when $\Omega$ is a non-convex set but consists of monotonic functions or a linear transformation of strongly convex functions, the optimization can be solved easily.

5 Optimization Algorithm

In this section, we propose a stochastic approximation-type algorithm to solve the CI estimation problem. For simplicity of exposition, we discuss the optimization of the lower bound in the pessimistic CI given by (11). The upper confidence bound in (11) and the neutral CI can be considered as a mirror or a special case of the proposed algorithm, respectively.

In function approximation settings, the $\Omega$ and $\mathcal{Q}$ are often represented by compact parametric functions in practice, either in linear or non-linear function classes [61]. We denote these parameters as $\psi$ and $\theta$ corresponding to $\Omega$ and $\mathcal{Q}$ , respectively. One can express the parametric objective function for optimization as:

\displaystyle\mathcal{L}(q_{\theta},\tau_{\psi})=\frac{\mathbb{E}_{\mu}\left[\tau_{\psi}(s,a)(r(s,a)+q_{\theta}(s,a)-\gamma q_{\theta}(s^{\prime},\pi))\right]}{1-\gamma}+q_{\theta}(s^{0},\pi)-\frac{\lambda^{-}\mathbb{E}_{\mu}\left[\mathbb{G}(\tau_{\psi}(s,a))\right]}{1-\gamma},

(19)

where $\mathbb{E}_{\mu}[\cdot]$ can be estimated by the observed data $\mathcal{D}_{i:n}$ . Therefore, the optimization problem we aim to solve is $\max_{\psi}\min_{\theta}\mathcal{L}(q_{\theta},\tau_{\psi})$ , which forms a saddle-point formulation, and we denote the saddle point as $(\psi^{*},\theta^{*})$ . We observe that the inner minimization problem is relatively easy to solve. On one hand, when $\mathcal{Q}_{\theta}$ is within a finite ball of RKHS, the optimization yields a simple closed-form solution according to Theorem (4.1). On the other hand, as demonstrated in Theorem B.1 in the Appendix, the feature mapping class is sufficient for modeling $\mathcal{Q}$ . The feature mapping class simplifies the optimization, making it efficiently solvable by various algorithms as discussed in [60].

In contrast, the more challenging aspect is optimizing $\tau_{\psi}$ . Due to its complex structure, it demands a sufficiently flexible non-linear function approximation class, e.g., deep neural networks, for optimization [24]. Unfortunately, concavity typically does not hold for non-linear function approximation classes, and thus the outer maximization of $\max_{\psi}\min_{\theta}\mathcal{L}(q_{\theta},\tau_{\psi})$ is also affected. As a result, we need to develop a more efficient and convergent algorithm. From this perspective, our problem can be seen as solving a non-concave maximization problem, conditional on the solved global optimizer $\bar{q}_{\theta}:=\operatorname*{arg\,min}_{\theta}\mathcal{L}(q_{\theta},\tau_{\psi})$ . Under this framework, we first study the gradients of the objective function with respect to ${\psi}$ . Define $\bar{\mathcal{L}}(\tau_{\psi})=\mathcal{L}(\bar{q}_{\theta},\tau_{\psi})$ , then the gradient of $\bar{\mathcal{L}}(\tau_{\psi})$ with respect to $\psi$ satisfies

\displaystyle\nabla_{\psi}\bar{\mathcal{L}}(\tau_{\psi})=\frac{\mathbb{E}_{\mu}[(r(s,a)+q_{\theta}(s,a)-\gamma q_{\theta}(s^{\prime},\pi))\nabla_{\psi}\tau_{\psi}(s,a)]}{1-\gamma}-\frac{\lambda\mathbb{E}_{\mu}[\mathbb{G}^{\prime}(\tau(s,a))\nabla_{\psi}\tau_{\psi}(s,a)]}{1-\gamma},

(20)

where the corresponding stochastic gradient, denoted as $\widetilde{\nabla}_{\psi}\bar{\mathcal{L}}(\tau_{\psi})$ , can be computed from offline dataset $\mathcal{D}_{1:n}$ . With the gradients provided in (20), we propose a stochastic approximation algorithm to update $\tau_{\psi}$ . At each iteration, we update $\tau_{\psi}$ by solving the proximal mapping [47]: $\text{Proj}_{\psi}(\psi^{*},\nabla;D_{Berg}):=\operatorname*{arg\,max}_{\psi}\{\langle\psi,\nabla\rangle-D_{Berg}(\psi^{*},\psi)\}$ , where $\psi^{*}$ can be viewed as the current update of the parameter, $D_{Berg}(\cdot,\cdot)$ denotes the Bregman divergence as discussed in [53], and $\nabla$ represents the scaled stochastic gradient of the parameter of interest. In practice, we may consider using the Euclidean distance to reduce the computational burden. The details of the proposed algorithm are summarized in Algorithm 1. In the next section, we will demonstrate that our stochastic approximation algorithm is convergent with a sublinear rate even under non-linear (non-concave) settings.

Algorithm 1 Proximal mapping Algorithm for Off-policy CI Estimation

1: Input offline data

\mathcal{D}_{1:n}:=\{(s_{i},a_{i},r_{i},s^{\prime}_{i})\}^{n}_{i=1}

and the initial state

s^{0}

2: Initialize parameters of interest

\psi^{0}

and

\theta^{0}

, initial learning rate

\eta^{0}

, hyperparameter

\lambda

, max iteration

T

, Bergman divergence function

D_{Berg}

3: For

t=1

t=T

4: Update

\theta^{t}

by solving

\min_{\theta}\frac{1}{n}\sum^{n}_{i=1}\frac{\tau_{\psi^{t-1}}(s_{i},a_{i})(\gamma q_{\theta}(s^{\prime}_{i},\pi)-q_{\theta}(s_{i},a_{i}))}{1-\gamma}+q_{\theta}(s^{0},\pi)-\frac{\sum^{n}_{i=1}\lambda^{-}\mathbb{G}(\tau_{\psi^{t-1}}(s_{i},a_{i}))}{n(1-\gamma)}.

5: Decay the stepsize

\eta^{t}

of the rate

\mathcal{O}(t^{-1/4})

6: Compute the stochastic gradient with respect to

\psi

\widetilde{\nabla}_{\psi}{\mathcal{L}}(\tau_{\psi},q_{\theta^{t}})

7: Update

\psi^{t}

by solving the prox-mapping:

\psi^{t}=\text{Proj}_{\psi}(\psi^{t-1},\eta_{t}\widetilde{\nabla}_{\psi}{\mathcal{L}}(\tau_{\psi},q_{\theta^{t}});D_{Berg})

8: End for

9: Output

\widehat{\tau}=\tau_{\psi^{T}}

and

\widehat{q}=q_{\theta^{T}}

6 Theoretical Properties

In this section, we study the theoretical properties of the proposed CI estimations. In Theorems 6.1-6.2, we investigate the finite sample performance of the proposed method. Our theory shows that the sample complexity of learning a near-tight CI is of order $\mathcal{O}(1/\sqrt{n})$ . The results widely hold for function approximation classes with finite pseudo-dimension. To the best of our knowledge, Theorems 6.1-6.2 are the first finite sample bounds in CI estimation without explicitly requiring any independent or weakly dependent conditions on the data collection process. In Theorems 6.3-6.4, we study the robustness of the proposed algorithm with respect to the optimization and model-misspecification error, which makes our algorithm practical in real-world applications. Moreover, we take a big step towards solving the open problem of the algorithmic convergence of the non-linear function approximation in offline RL. In Theorem 6.5, our algorithm is convergent to a stationary point at a sublinear rate even in non-concave settings. The convergent property guarantees the use of non-linear function classes, e.g., deep neural networks (DNNs), without divergence concerns.

6.1 Finite Sample Bound

In this section, we present the finite-sample bound of the CI estimation. Before stating our results, we first make some mild assumptions.

Assumption 1.

For $\tau\in\Omega$ , suppose $0\leq\tau(s,a)\leq C$ for any $s,a$ and $C>0$ .

Assumption 2.

For any target policy $\pi$ and $\nu\geq 0$ , there exists a $\tau^{*}\in\Omega$ satisfying that

	$\displaystyle\Big{\|}\mathbb{P}_{n}\left\{\tau^{*}(s_{i},a_{i})(q(s_{i},a_{i})-\gamma q(s^{\prime}_{i},\pi)-r_{i})\right\}$
	$\displaystyle\qquad\qquad\qquad-\mathbb{E}_{(s,a)\sim d^{\pi},s^{\prime}\sim\mathds{P}(\cdot\|s,a)}\left[q(s,a)-\gamma q(s^{\prime},\pi)-r(s,a)\right]\Big{\|}=\mathcal{O}\left(\frac{c_{0}}{n^{\frac{1}{1+\nu}}}\right),$

for any $q\in\mathcal{Q}$ , where $c_{0}$ is some finite and positive constant.

Assumption 1 is a standard condition on function boundedness. Assumption 2 characterizes that there exists a “good” $\tau^{*}\in\Omega$ , not necessarily exactly equal to the density ratio $d^{\pi}/\mu$ , where the $\tau^{*}\cdot\mu$ is able to approximate $d^{\pi}$ well. This assumption is a much weaker assumption than the standard correct specification model assumption [38, 16]. The correct specification model assumption requires that the true model must be captured in $\Omega$ . In contrast, our assumption only requires that the discrepancy between the limiting distribution of $\tau^{*}\cdot\mu$ and the distribution $d^{\pi}$ cannot be distinguished by the functional $\mathbb{E}_{s^{\prime}\sim\mathds{P}(\cdot|s,a)}[q(s,a)-\gamma q(s^{\prime},\pi)-r(s,a)]$ , which is aligned with the principle in [7]. As a special case, the assumption holds if the limiting distribution of $\tau^{*}\cdot\mu$ converges to the $d^{\pi}$ . The factor $\nu$ depends on the smoothness of $\mathcal{Q}$ [66], and $\nu=1$ when $\mathcal{Q}$ is a bounded ball in RKHS. The factor $c_{0}$ usually depends on the pseudo-dimension and radius of the specified function class. In the following, we denote the confidence upper and lower bound optimized in (7) over $\tau\in\Omega$ as $J^{+}_{n}(\pi)$ and $J^{-}_{n}(\pi)$ , respectively. We first characterize the sample-dependent bound for the neutral CI in the following.

Theorem 6.1.

Under Assumptions 1 and 2, then for any target policy $\pi$ , the length of the estimated neutral CI satisfies

\displaystyle J^{+}_{n}(\pi)-J^{-}_{n}(\pi)\leq

\displaystyle\frac{2c_{0}}{(1-\gamma)n^{\frac{1}{1+\nu}}}+\frac{C\bar{V}\mathcal{E}_{n}(\delta,\bar{V},D_{\Omega})}{(1-\gamma)},

where

\displaystyle\mathcal{E}_{n}(\delta,\bar{V},D_{\Omega}):=\left(\sqrt{\frac{2\ln\frac{2e\left(D_{\Omega}+1\right)(4e\bar{V}^{2}\sqrt{n})^{D_{\Omega}}}{\delta}}{n}}+\frac{4\ln\frac{2e\left(D_{\Omega}+1\right)(4e\bar{V}^{2}\sqrt{n})^{D_{\Omega}}}{\delta}}{3n}\right),

(21)

and $D_{\Omega}$ is the pseudo-dimension of the class $\Omega$ .

The proof of Theorem 6.1 relies on the empirical Freedman’s bounds with a careful analysis of the variance term. When $\nu\leq 1$ , the sample complexity of the generalization bound is of order $\mathcal{O}(1/\sqrt{n})$ . This achieves the minimax optimal rate in off-policy evaluation [12]. The bound can be potentially improved for some special classes, e.g., RKHS, which can be independent of the function space complexity. In some feature mapping classes, the finite sample bound might be vanishing at a sublinear rate, i.e., $\mathcal{O}(1/n)$ if $\lambda_{\max}(\Sigma_{n})\ll\text{diam}_{\psi}$ as discussed in Corollary 3.1. In the following, we present the result for the pessimistic CI estimation in that $J^{+}_{n}(\pi)$ and $J^{-}_{n}(\pi)$ are the optimized confidence bounds in (11) within the class $\Omega$ .

Theorem 6.2.

Under Assumptions 1 and 2, for any target policy $\pi$ we set $\lambda^{-}=\mathcal{O}(1/\sqrt{n})$ , then the estimated pessimistic CI satisfies that

	$\displaystyle(J(\pi)-J^{-}_{n}(\pi))(1-\gamma)\leq$	$\displaystyle\mathcal{O}\bigg{(}\frac{c_{0}}{n^{\frac{1}{1+\nu}}}+\sqrt{\frac{c_{1}^{2}}{n}}+\bar{V}\sqrt{\frac{(C^{\star})^{2}\ln\frac{4}{\delta}}{n}}\bigg{)},$
	$\displaystyle(J^{+}_{n}(\pi)-J(\pi))(1-\gamma)\leq$	$\displaystyle\mathcal{O}\bigg{(}\frac{c_{0}}{n^{\frac{1}{1+\nu}}}+\frac{C\bar{V}\mathcal{E}_{n}(\delta/2,\bar{V},D_{\Omega})}{(1-\gamma)}\bigg{)},$

where $C^{\star}=[(\mathbb{G}^{\prime})^{-1}(c_{2})]^{+}$ and $\mathcal{E}_{n}(\delta,\bar{V},D_{\Omega})$ is defined in (21).

Theorem 6.2 shows that it is sufficient to learn a near-tight confidence upper or lower bound with $\mathcal{O}(\sqrt{n})$ size of samples. Note that the error between the confidence lower bound and the discounted return, i.e., $J(\pi)-J^{-}(\pi)$ , can be independent of function class complexity. To the best of our knowledge, this is the minimum estimation error bound that can be achieved.

6.2 Robustness to Optimization and Model Misspecification Errors

In this section, we consider the setting where $\Omega\times\mathcal{Q}$ may not contain the “good” $\tau^{*}$ , defined in Assumption 2, and true action-value function $q^{\pi}$ . We measure the approximation errors for the function classes $\Omega_{\psi}$ and $\mathcal{Q}_{\theta}$ as follows:

\displaystyle\varepsilon_{\tau}=\min_{\psi}\|\tau_{\psi}-\tau^{*}\|_{\infty},\quad\varepsilon_{q}=\min_{\theta}\|q_{\theta}-q^{\pi}\|_{\infty}.

(22)

In addition to the function approximation errors, we allow the existence of optimization errors. Let $(\widetilde{q}^{opt},\widetilde{\tau}^{opt})$ be the optimizer for the sample version of the objective function $\widehat{\mathcal{L}}(q_{\theta},\tau_{\psi})$ in (19), then we define the optimization errors as

	$\displaystyle\widehat{\mathcal{L}}(\widetilde{q}^{opt},\widetilde{\tau}^{opt})-\max_{\psi}\widehat{\mathcal{L}}(\widetilde{q}^{opt},\tau_{\psi})\geq$	$\displaystyle-\varepsilon_{opt}(\tau),$
	$\displaystyle\max_{\psi}\min_{\theta}\widehat{\mathcal{L}}(q_{\theta},\tau_{\psi})-\max_{\psi}\widehat{\mathcal{L}}(\widetilde{q}^{opt},\tau_{\psi})\geq$	$\displaystyle-\varepsilon_{opt}(q),$		(23)

The above equations indicate that the max-min point of $\max_{\psi}\min_{\theta}\widehat{\mathcal{L}}(q_{\theta},\tau_{\psi})$ is approximated by the solution $\widehat{\mathcal{L}}(\widetilde{q}^{opt},\widetilde{\tau}^{opt})$ . Similarly, we relax the min-max point of $\min_{\psi}\max_{\theta}\widehat{\mathcal{L}}(q_{\theta},\tau_{\psi})$ such that they can be obtained approximately as well, and we use the same error upper bounds, i.e., $\varepsilon_{opt}(q)$ and $\varepsilon_{opt}(\tau)$ , to eliminate unnecessary notations and improve readability. Under the definitions of (22) and (23), we allow the existence of the model misspecification error and algorithm optimization error. In this case, we call our algorithm the robust algorithm. In the following, Theorem 6.3 justifies that the robust algorithm is capable of learning a valid and near-tight neutral CI in polynomial sample complexity.

Theorem 6.3.

Under Assumptions 1 and 2, then for any target policy $\pi$ , the estimated neutral CI returned by the robust algorithm is valid, and its length satisfies

\displaystyle J^{+}_{n}(\pi)-J^{-}_{n}(\pi)\leq\frac{1}{1-\gamma}\left(\frac{2c_{0}}{n^{\frac{1}{1+\nu}}}+C\bar{V}\mathcal{E}_{n}(\delta,\bar{V},D_{\Omega})+\varepsilon_{opt}+\varepsilon_{mis}\right),

where $\varepsilon_{opt}=\varepsilon_{opt}(q)+\varepsilon_{opt}(\tau)$ , $\varepsilon_{mis}=2\bar{V}\varepsilon_{\tau}+4C\varepsilon_{q}$ , and $\mathcal{E}_{n}(\delta,\bar{V},D_{\Omega})$ defined in (21).

Theorem 6.3 is an error-robust version of Theorem 6.1. The optimization error and approximation error are quantified within the finite-sample bound. For the optimization error, according to Theorem 6.5, Algorithm 1 has optimization error of order $\mathcal{O}(1/T)$ , where $T$ is the max number of the optimization iterations. For the approximation error, it depends on the specific structure of $\tau^{*}$ and $q^{\pi}$ . For example, the expressivity of the feature mapping class might achieve an arbitrarily small approximation error $\varepsilon_{q}$ as implied by Theorem B.1 in the appendix. Blessed by the convergence of our algorithm under non-linear approximation settings, one may use a DNN class to model $\varepsilon_{\tau}$ . Recent existing works show that DNNs enjoy universal approximation power in function approximations [39], i.e., $\varepsilon_{\tau}\approx 0$ . In the following, we establish an error-robust finite sample analysis on the empirical length of the pessimistic CI.

Theorem 6.4.

Under Assumptions 1 and 2, for any target policy $\pi$ and we set $\lambda^{-}=\mathcal{O}(1/\sqrt{n})$ , the estimated pessimistic CI returned by the robust algorithm is valid, and $(J^{+}_{n}(\pi)-J^{-}_{n}(\pi))(1-\gamma)$ is bounded above by

\displaystyle\mathcal{O}

\displaystyle\Bigg{(}\frac{c_{0}}{n^{\frac{1}{1+\nu}}}+\sqrt{\frac{c_{1}^{2}}{n}}+\bar{V}\bigg{\{}C\mathcal{E}_{n}(\delta/2,\bar{V},D_{\Omega})+\sqrt{\frac{(C^{\star})^{2}\ln\frac{4}{\delta}}{n}}+\frac{c_{1}\sqrt{\ln\frac{4}{\delta}}}{n\bar{V}}\bigg{\}}+\varepsilon_{opt}+\varepsilon^{\mathbb{G}}_{mis}\Bigg{)},

where $\varepsilon_{opt}=\varepsilon_{opt}(q)+\varepsilon_{opt}(\tau)$ and $\varepsilon^{\mathbb{G}}_{mis}=(c_{2}/\sqrt{n}+\bar{V})\varepsilon_{\tau}+C\varepsilon_{q}$ .

In comparison to Theorem 6.3, there exists an additional model-misspecification error propagated to the finite-sample bound by the introduced discriminator function of order $\mathcal{O}(c_{2}\varepsilon_{\tau}/\sqrt{n})$ . However, the strong convexity of the discriminator $\mathbb{G}(\cdot)$ facilitates the optimization, and thus our algorithm can achieve $\varepsilon_{opt}(\tau)=\mathcal{O}(1/T^{2})$ at a faster rate than the rate $\mathcal{O}(1/T)$ , when the class $\Omega$ is a special non-convex set consisting of monotonic functions or linear transformations of strongly convex functions. A more concrete analysis of the approximation and optimization errors depends on the specific form of the function approximators, which is beyond the scope of this paper.

6.3 Convergence Analysis

In this section, we investigate the convergence of Algorithm 1 when the $\tau_{\psi}$ is approximated by non-linear functions, e.g., DNNs, so that the concavity no longer holds. In Theorem 6.5, we show that our algorithm converges sublinearly to a stationary point when stepsize is diminishing. We first make the following regularity assumptions.

Assumption 3.

For any $\tau_{\psi}\in\Omega_{\psi}$ , $\tau_{\psi}$ is differentiable (not necessarily convex or concave), bounded from below, $\|\nabla_{\psi}\tau_{\psi_{1}}(s,a)-\nabla_{\psi}\tau_{\psi_{2}}(s,a)\|\leq L_{0}\|\psi_{1}-\psi_{2}\|,\;\text{for any}\;s,a$ , where $L_{0}<\infty$ is some universal Lipschitz constant and $\|\cdot\|$ denotes the Euclidean norm.

Assumption 3 imposes the first-order smoothness condition on the specified function class. It requires that the gradient of $\tau_{\psi}$ be $L_{0}$ -Lipschitz continuous, which is a standard assumption in the optimization literature [4, 18].

Assumption 4.

$|q_{\theta_{1}}(s,a)-q_{\theta_{2}}(s,a)|\leq L_{0}\|\theta_{1}-\theta_{2}\|,\;\text{for any}\;s,a,\;\text{and}\;q_{\theta}\in\mathcal{Q}_{\theta}$ .

Assumption 4 holds for a wide range of function approximation classes, including feature mapping space with smooth basis functions, non-linear approximation classes, DNNs with Leaky ReLU activation function, or spectral normalization on ReLU activation [23].

Assumption 5.

The gradient of function $\tau_{\psi}(\cdot)$ evaluated at saddle point $\psi^{*}$ is bounded above; i.e., $\nabla_{\psi}\tau_{\psi^{*}}(s,a)<c_{3}$ uniformly over $(s,a)$ for some finite and positive constant $c_{3}$ .

Assumption 5 is a much weaker assumption compared to the bounded variance of stochastic gradients assumption which is commonly made in the existing literature [52, 44]. Our relaxation is guaranteed by the careful analysis of the variance of the stochastic gradient, which is shown to be upper-bounded by the suboptimality of the current iteration and the stochastic gradient at the saddle point. In the following, we derive the convergence rate for Algorithm 1, which holds for non-concave function approximation class $\Omega_{\psi}$ , e.g., DNNs.

Theorem 6.5.

Under Assumptions 1 and 3-5, suppose Algorithm 1 runs $T\geq 1$ rounds with stepsize $\eta^{t}=\min\{(tT)^{-1/4}\sqrt{2\Lambda/\sigma^{2}_{\max}L_{1}},1/L_{1}\}$ for $t=1,...,T$ and Euclidean distance is used for Bergman divergence. If we pick up the solution output $\psi^{T^{\star}}$ following the probability mass function $P(T^{\star}=t)=\frac{2\eta^{t}-(\eta^{t})^{2}L_{1}}{\sum^{T}_{t=1}(2\eta^{t}-(\eta^{t})^{2}L_{1})}$ , then

\displaystyle\mathbb{E}[\|\nabla_{\psi}\bar{\mathcal{L}}(\tau_{\psi^{T^{\star}}})\|^{2}]\leq\frac{2\Lambda L_{1}}{T}+\frac{\sqrt{2\Lambda L_{1}\sigma^{2}_{\max}}}{T^{3/4}}+\sqrt{\frac{2\Lambda L_{1}\sigma^{2}_{\max}}{T}},

(24)

where $\bar{\mathcal{L}}(\tau_{\psi})$ is defined in (20) and $\Lambda:=\bar{\mathcal{L}}(\tau_{\psi^{0}})-\min_{\psi}\bar{\mathcal{L}}(\tau_{\psi})$ measures the distance of the initial and optimal solution, $L_{1}$ is Lipschitz constant depending on $c_{2},c_{3},M,L_{0},\bar{V}$ , and the variance of the stochastic gradient is bounded above by

\displaystyle\sigma_{\max}:=\max_{t\in 1:T}\sqrt{c_{4}\|\widetilde{\theta}(\psi^{t})-\theta^{*}\|^{2}+c_{5}\|\psi^{t}-\psi^{*}\|^{2}},

for some constants $c_{4}$ depending on $c_{3},L_{0},\gamma$ ; and $c_{5}$ depending on $c_{2},L_{0},\bar{V},\lambda^{-}$ and $\gamma$ . Here, $\widetilde{\theta}(\psi^{t})$ is the optimizer for $\mathcal{L}(q_{\theta},\tau_{\psi^{t}})$ .

Theorem 6.5 implies that Algorithm 1 can converge sublinearly to a stationary point if the $\sigma_{\max}$ is sufficiently small. The rate of convergence is also affected by the smoothness of the class $\Omega_{\psi}$ and the distance of the initial and optimal solution. We possess a nearly optimal rate since the third term in (24) is unimprovable for nonconvex problems for gradient descent methods, as shown in [30]. In the algorithm, we initially use decreasing stepsizes to enable early stopping. However, as the algorithm progresses and more local information about the gradient is obtained, increasing stepsizes, such as $\min\{1/L_{1},\mathcal{O}(\sqrt{t}/T)\}$ , may be more effective. This increasing stepsize strategy can also result in a convergence rate similar to the obtained one in (24).

7 Simulation Studies

In this section, we evaluate the empirical performance of our methods using two synthetic datasets: CartPole benchmark environment from the OpenAI Gym [5] and the simulation environment [36]. In Section 7.1, we study the tightness of the estimated CIs. In Section 7.2, we investigate the policy evaluation performance based on the estimated CIs in scenarios with large distributional shifts. Additionally, we conduct a sensitivity analysis with respect to the hyperparameter $\lambda$ in the pessimistic CI. We compare the proposed CI with several state-of-the-art approaches including Boostrapping Fitted Q-evaluation (B-FQE; [22]), Accountable Bellman Evaluation (ABE; [15]) and the stepwise importance sampling-based estimator, HCOPE proposed by [64]. The true discounted return of the target policy $\pi$ is computed by Monte Carlo approximations [40]. For the simulation environment setting, the system dynamics are given by

	$\displaystyle s^{t+1}$	$\displaystyle=\left(\begin{array}[]{cc}0.75\left(2a^{t}-1\right)&0\\ 0&0.75\left(1-2a^{t}\right)\end{array}\right)s^{t}+\left(\begin{array}[]{cc}0&1\\ 1&0\end{array}\right)\odot s^{t}{s^{t}}^{\top}\mathbb{I}_{2\times 1}+\varepsilon^{t},$
	$\displaystyle r^{t}$	$\displaystyle={s^{t+1}}^{\top}\left(\begin{array}[]{l}2\\ 1\end{array}\right)-\frac{1}{4}\left(2a^{t}-1\right)+({s^{t+1}}^{\top}s^{t+1})^{\frac{3}{2}}\odot\left(\begin{array}[]{c}0.25\\ 0.5\end{array}\right),$

for $t\geq 0$ , where $\odot$ denotes the Hadamard product, $\mathbb{I}$ is the identity matrix, the noise $\left\{\varepsilon^{t}\right\}_{t\geq 0}\stackrel{{\scriptstyle iid}}{{\sim}}N\left({0}_{2\times 1},0.25\mathbb{I}_{2\times 2}\right)$ and the initial state variable $s^{0}\sim N\left({0}_{2\times 1},0.25\mathbb{I}_{2\times 2}\right)$ . The transition dynamic mainly follows the design in [36], but the reward function we consider here is more complex. For the basic setup of CartPole environment, we refer the readers to [5] for more details. We fix the confidence level at $\delta=0.05$ , and the discount factor $\gamma=0.95$ to construct CIs. In particular, for implementing pessimistic CI estimation, we set the discriminator function as a quadratic form, i.e., $\mathbb{G}(x)=\frac{1}{2}(x-1)^{2}$ .

7.1 Empirical Tightness Analysis of Estimated Interval

In this section, we calculate the width of the estimated CIs. We consider different sample size settings. Note that the setting $n=200$ is an extremely small sample size in offline RL problems [34].

In the CartPole environment, we follow [67] to learn a near-optimal policy using a softmax policy class [20] with a temperature parameter $\alpha=1$ as the behavior policy. In addition, we set $\alpha=0.4$ and $\alpha=0.2$ for constructing two target policies. We note that the target policies are more deterministic than the behavior policies, and a smaller $\alpha$ implies a larger difference between the behavior policy and the target policy. In the simulation environment, we consider a completely randomized study and set $\{a^{t}\}_{t\geq 0}$ to be i.i.d Bernoulli random variables with expectation $0.5$ . The target policy we consider is designed as $\pi(a=1|s)=p$ if $s_{[1]}+s_{[2]}>0$ and $1-p$ otherwise, where $s_{[j]}$ denotes the $j$ -th element of the vector $s$ . The two target policies are generated by $p=0.9$ and $0.7$ , respectively. In both environments, we use a feedforward neural network of $2$ hidden layers with $256$ units for each layer to approximate $\Omega$ in the pessimistic CI estimation, and an RKHS finite representer for function approximation in the neutral CI estimation.

The results of the target policy $1$ with respect to the simulation and CartPole environment is depicted in Figures 3, where the results for target policy $2$ are provided in the appendix. We summarize our findings as follows. First, our proposed neutral and pessimistic CIs yield much tighter bounds than ABE and HCOPE, especially in the small sample size setting (e.g., $n=200,500$ ). Although B-FQE achieves tighter intervals than ours, it is highly biased and fails to capture the Monte Carlo true reward of the target policy. The tightness of our CIs is mainly due to the fact that our framework breaks the curse of the bias and uncertainty tradeoff. This allows us to choose an expressive neural network or finite RKHS function class to model $\Omega$ and reduce model-misspecification bias without worrying about an increase in uncertainty deviation. Another reason for the performance gain may be that our proposed method does not assume any i.i.d. or weakly dependent data assumptions, which is not the case for the competing methods. When applied directly to interdependent data, the competing methods often lead to larger bias and an increased interval width [13]. Lastly, our proposed method inherently quantifies the model-misspecification error in the CI and is thus robust to function approximation error. This distinguishes our method from the competing methods, which fail to quantify the model-misspecification error and suffer from large bias when the true model is not correctly specified.

Regarding the comparison between the neutral and pessimistic CIs, we observe that the pessimistic CI is generally wider than the neutral one in the simulation environment. This might be because modeling $\Omega$ in RKHS has already achieved sufficient expressivity to capture the true model in the smooth MDP. Therefore, the pessimistic CI indeed does not have any additional advantages in minimizing model-misspecification errors instead of inheriting additional bias from the discriminator function. In contrast, the smoothness MDP condition does not hold in the CartPole environment, and the neutral CI incorporates more bias. This leads to the neutral CI sharing almost the same tightness as the estimated pessimistic CI.

7.2 Robustness to Distributional Shifts

In this section, we aim to evaluate the performance of the proposed CIs in handling distributional shifts. In the context of offline reinforcement learning, a common safe optimization criterion is to maximize the worst-case performance among a set of statistically plausible models [31, 26]. This pessimism principle is closely related to robust MDPs which can handle distributional shift [71, 26]. Our proposed framework naturally provides a lower bound to implement the pessimism principle. To evaluate the proposed method, we conduct experiments with a set of target policies that have varying degrees of distributional shift. We intentionally design these target policies to have almost the same Monte Carlo true rewards. Therefore, users will select the policies close to the behavior policy, i.e., those with a small distributional shift. We use the estimated confidence lower bound to rank the target policies, and the criterion for good performance is to correctly rank the policies according to their degree of distributional shift.

In the experiment design of robustness to distributional shift, we still use the soft actor-critic algorithm [20] to construct the softmax policies. In contrast to the constructions in the last section, here we learn a near-optimal policy, with a very small temperature parameter $\alpha=0.05$ , as the behavior policy in the CartPole environment. This indicates that the behavior policy is less explorable and has a higher probability of causing large distributional shifts in the policy evaluation step. To evaluate the proposed method, we generate three different near-optimal policies as the target policies, denoted as $\pi^{\text{1}},\pi^{\text{2}},\pi^{\text{3}}$ . The temperature parameter used for learning the three policies increases from $\alpha=0.2$ for $\pi^{\text{1}}$ to $\alpha=0.5$ for $\pi^{\text{2}}$ and then $\alpha=0.8$ for $\pi^{\text{3}}$ , indicating that $\pi^{\text{1}}$ has the smallest distributional shift while $\pi^{\text{3}}$ has the largest. We carefully control the true rewards of the three policies within a very small gap of 1, so the correct ranking of the three target policies should follow $\pi^{\text{1}}$ , then $\pi^{\text{2}}$ , and $\pi^{\text{3}}$ . In the simulation environment, we apply the same principle for experiment design. Specifically, we generate the target policy $\pi^{\text{1}}$ by adding a small random noise to the behavior policy, resulting in the smallest distributional shift. The remaining two target policies are designed to be more explorable than $\pi^{\text{1}}$ .

	NeutralCI	PessmisticCI	B-FQE	ABE	HCOPE
Simulation	71/100	92/100	64/100	53/100	37/100
Cartpole	78/100	95/100	71/100	67/100	38/100

Table 1: The proportion of the successful policy ranking by the estimated confidence lower bound over different methods in

100

repeated experiments. For each experiment, success count is only made when the absolute ranking for the three target policies is correct, otherwise, the count is a failure.

The results in Table 1 demonstrate that the pessimistic CI achieves the best performance in comparison to the other approaches. The discriminator in the pessimistic approach helps to mitigate the bias estimation issue caused by distributional shift. Among the remaining approaches, we find the result of the proposed neutral CI is consistent with B-FQE. The decent performance of the neutral CI is due to the relatively accurate estimation of the confidence lower bound, while the performance of B-FQE mainly relies on the distributional consistency guarantee which provides certain robustness to distributional shift [22].

We also provide a sensitivity analysis of the hyperparameter $\lambda^{-}$ in Figure 3. The $\lambda^{-}$ ’s values vary in a wide range $[0.05,5]$ , and the successful percentages of the ranking tasks are reported accordingly. In general, the successful ranking percentage is not very sensitive to the choice of $\lambda^{-}$ . It can be seen that the desired results mainly concentrate in the range $[0.75,2]$ . Too small $\lambda^{-}$ leads to insufficient distributional shift adjustment, but too large $\lambda^{-}$ results in overly pessimistic reasoning.

8 OhioT1DM Case Study

We apply the proposed methods to the Ohio type 1 diabetes (OhioT1DM) mobile health study [43], which contains six patients with type 1 diabetes, each patient with eight weeks of life-event data including health status measurements and insulin injection dosage. As each individual patient has distinctive glucose dynamics, we follow [75] in regarding each patient’s data as an independent dataset where the data from each day is a single trajectory. We summarize the collected measurements over $60$ -min intervals such that the maximum length of the horizon is $24$ and the total number of the transition pairs in each patient’s dataset is around $n=360$ after removing missing samples and outliers. The state variable $s^{t}$ is set to be a three-dimensional vector including the average blood glucose levels $s^{t}_{[1]}$ , the average heart rate $s^{t}_{[2]}$ and the total carbohydrates $s^{t}_{[3]}$ intake during the period time $[t-1,t]$ . Here, the reward is defined as the average of the index of glycemic control [55] between time $t-1$ and $t$ , measuring the health status of the patient’s glucose level. That is ${r^{t}}=-\frac{\mathbb{I}({s^{t}_{[1]}}>140)|{s^{t}_{[1]}}-140|^{1.10}+\mathbb{I}({s^{t}_{[1]}}<80)({s^{t}_{[1]}}-80)^{2}}{30},$ which implies that reward $r^{t}$ is non-positive and a larger value is preferred. We discretize the insulin dose level to $5$ grids, forming the action space $\mathcal{A}=\{0,1,...,4\}$ , and we set $\gamma=0.9,\delta=0.05$ for estimating the CIs.

We consider three different target policies for evaluation, where the design settings are described in Figure 4. The true value of $\pi^{\text{3}}$ shall be lower than the other two policies and also lower than the behavior policy made by the clinicians. For the policies $\pi^{\text{1}}$ and $\pi^{\text{2}}$ , they are both near-optimal and should achieve similar performance regarding the mean value [32]. However, as the conservative q-learning algorithm is risk-sensitive to distributional shift, one can expect policy $\pi^{\text{1}}$ to have a higher lower bound in the estimated CI.

In Figure 4, we plot the estimated CIs by our method and the ABE method. We summarize our findings as follows. In terms of the validity of the proposed CI estimation, both the proposed neutral and pessimistic CI are overlapped with, and included, in the CI estimated by ABE. In addition, our CIs for the policies $\pi^{\text{1}}$ and $\pi^{\text{2}}$ are above $0$ and the estimated intervals for $\pi^{\text{3}}$ are below $0$ . This is consistent with the fact that policy $\pi^{\text{1}}$ and $\pi^{\text{2}}$ are designed as near-optimal policies which should be better than the behavior policy. In contrast, policy $\pi^{3}$ is a randomized policy and thus will be worse than the policy induced by the clinicians. The above findings confirm that our estimated CIs are valid. In terms of tightness, our CIs are narrower than the ones estimated by ABE. This confirms that the proposed method solves the issue of the tradeoff between bias and uncertainty. In addition, ABE requires i.i.d sampled offline data, thus failing with the complicated real environment and leading to wide estimated intervals. In terms of performance against distributional shifts among all methods, the estimated pessimistic CI identifies a consistently higher confidence lower bound for the distributional shift robust policy $\pi^{\text{1}}$ over all $6$ patients. This implies that the pessimistic CI can efficiently evaluate the target policy by taking the distributional shift information into account. In conclusion, our analysis suggests applying the proposed method could potentially improve some patients’ health status efficiently and reliably, especially in environments with large distributional shifts.

9 Discussion

This paper investigates the off-policy confidence interval estimation problem in behavior-agnostic settings with the minimum data collection assumption. The proposed framework incorporates distributional shift information, which makes the estimated confidence interval robust in learning from pre-collected offline datasets. We propose a unified error analysis and point out the issue regarding the tradeoff between evaluation bias and statistical uncertainty. Then a tradeoff estimation approach is developed to solve this issue, which leads to possibly the tightest confidence interval estimations. Several improvements and extensions are worth exploring in the future. First, we may extend the proposed framework to the environment with an unobservable confounder. In addition, our numerical studies have demonstrated the potential of the proposed method in ranking tentative target policies, which could further enhance designing safe policy improvement or policy optimization algorithms based on the current framework. This could be a potentially interesting direction to explore in the future.

References

[1] {barticle}[author] \bauthor\bsnmAgarwal, \bfnmAlekh\binitsA., \bauthor\bsnmJiang, \bfnmNan\binitsN., \bauthor\bsnmKakade, \bfnmSham M\binitsS. M. and \bauthor\bsnmSun, \bfnmWen\binitsW. (\byear2019). \btitleReinforcement learning: Theory and algorithms. \bjournalCS Dept., UW Seattle, Seattle, WA, USA, Tech. Rep \bpages4–10. \endbibitem
[2] {barticle}[author] \bauthor\bsnmBach, \bfnmFrancis\binitsF. (\byear2017). \btitleBreaking the curse of dimensionality with convex neural networks. \bjournalThe Journal of Machine Learning Research \bvolume18 \bpages629–681. \endbibitem
[3] {bincollection}[author] \bauthor\bsnmBaird, \bfnmLeemon\binitsL. (\byear1995). \btitleResidual algorithms: Reinforcement learning with function approximation. In \bbooktitleMachine Learning Proceedings 1995 \bpages30–37. \bpublisherElsevier. \endbibitem
[4] {binproceedings}[author] \bauthor\bsnmBernstein, \bfnmJeremy\binitsJ., \bauthor\bsnmWang, \bfnmYu-Xiang\binitsY.-X., \bauthor\bsnmAzizzadenesheli, \bfnmKamyar\binitsK. and \bauthor\bsnmAnandkumar, \bfnmAnimashree\binitsA. (\byear2018). \btitlesignSGD: Compressed optimisation for non-convex problems. In \bbooktitleInternational Conference on Machine Learning \bpages560–569. \bpublisherPMLR. \endbibitem
[5] {barticle}[author] \bauthor\bsnmBrockman, \bfnmGreg\binitsG., \bauthor\bsnmCheung, \bfnmVicki\binitsV., \bauthor\bsnmPettersson, \bfnmLudwig\binitsL., \bauthor\bsnmSchneider, \bfnmJonas\binitsJ., \bauthor\bsnmSchulman, \bfnmJohn\binitsJ., \bauthor\bsnmTang, \bfnmJie\binitsJ. and \bauthor\bsnmZaremba, \bfnmWojciech\binitsW. (\byear2016). \btitleOpenAI gym. \bjournalarXiv preprint arXiv:1606.01540. \endbibitem
[6] {barticle}[author] \bauthor\bsnmChallen, \bfnmRobert\binitsR., \bauthor\bsnmDenny, \bfnmJoshua\binitsJ., \bauthor\bsnmPitt, \bfnmMartin\binitsM., \bauthor\bsnmGompels, \bfnmLuke\binitsL., \bauthor\bsnmEdwards, \bfnmTom\binitsT. and \bauthor\bsnmTsaneva-Atanasova, \bfnmKrasimira\binitsK. (\byear2019). \btitleArtificial intelligence, bias and clinical safety. \bjournalBMJ Quality & Safety \bvolume28 \bpages231–237. \endbibitem
[7] {binproceedings}[author] \bauthor\bsnmChen, \bfnmJinglin\binitsJ. and \bauthor\bsnmJiang, \bfnmNan\binitsN. (\byear2022). \btitleOffline reinforcement learning under value and density-ratio realizability: the power of gaps. In \bbooktitleUncertainty in Artificial Intelligence \bpages378–388. \bpublisherPMLR. \endbibitem
[8] {binproceedings}[author] \bauthor\bsnmChen, \bfnmXiaohong\binitsX. and \bauthor\bsnmQi, \bfnmZhengling\binitsZ. (\byear2022). \btitleOn well-posedness and minimax optimal rates of nonparametric q-function estimation in off-policy evaluation. In \bbooktitleInternational Conference on Machine Learning \bpages3558–3582. \bpublisherPMLR. \endbibitem
[9] {barticle}[author] \bauthor\bsnmChoi, \bfnmEuisun\binitsE. and \bauthor\bsnmLee, \bfnmChulhee\binitsC. (\byear2003). \btitleFeature extraction based on Bhattacharyya distance. \bjournalPattern Recognition \bvolume36 \bpages1703–1709. \endbibitem
[10] {barticle}[author] \bauthor\bsnmDai, \bfnmBo\binitsB., \bauthor\bsnmNachum, \bfnmOfir\binitsO., \bauthor\bsnmChow, \bfnmYinlam\binitsY., \bauthor\bsnmLi, \bfnmLihong\binitsL., \bauthor\bsnmSzepesvári, \bfnmCsaba\binitsC. and \bauthor\bsnmSchuurmans, \bfnmDale\binitsD. (\byear2020). \btitleCoinDICE: Off-policy confidence interval estimation. \bjournalAdvances in Neural Information Processing Systems \bvolume33 \bpages9398–9411. \endbibitem
[11] {barticle}[author] \bauthor\bsnmDai, \bfnmBo\binitsB., \bauthor\bsnmShaw, \bfnmAlbert\binitsA., \bauthor\bsnmHe, \bfnmNiao\binitsN., \bauthor\bsnmLi, \bfnmLihong\binitsL. and \bauthor\bsnmSong, \bfnmLe\binitsL. (\byear2017). \btitleBoosting the actor with dual critic. \bjournalarXiv preprint arXiv:1712.10282. \endbibitem
[12] {binproceedings}[author] \bauthor\bsnmDuan, \bfnmYaqi\binitsY., \bauthor\bsnmJia, \bfnmZeyu\binitsZ. and \bauthor\bsnmWang, \bfnmMengdi\binitsM. (\byear2020). \btitleMinimax-optimal off-policy evaluation with linear function approximation. In \bbooktitleInternational Conference on Machine Learning \bpages2701–2709. \bpublisherPMLR. \endbibitem
[13] {barticle}[author] \bauthor\bsnmDuchi, \bfnmJohn C\binitsJ. C., \bauthor\bsnmGlynn, \bfnmPeter W\binitsP. W. and \bauthor\bsnmNamkoong, \bfnmHongseok\binitsH. (\byear2021). \btitleStatistics of robust optimization: A generalized empirical likelihood approach. \bjournalMathematics of Operations Research \bvolume46 \bpages946–969. \endbibitem
[14] {barticle}[author] \bauthor\bsnmFan, \bfnmXiequan\binitsX., \bauthor\bsnmGrama, \bfnmIon\binitsI. and \bauthor\bsnmLiu, \bfnmQuansheng\binitsQ. (\byear2015). \btitleExponential inequalities for martingales with applications. \bjournalElectronic Journal of Probability \bvolume20 \bpages1–22. \endbibitem
[15] {barticle}[author] \bauthor\bsnmFeng, \bfnmYihao\binitsY., \bauthor\bsnmLi, \bfnmLihong\binitsL. and \bauthor\bsnmLiu, \bfnmQiang\binitsQ. (\byear2019). \btitleA kernel loss for solving the Bellman equation. \bjournalarXiv preprint arXiv:1905.10506. \endbibitem
[16] {binproceedings}[author] \bauthor\bsnmFeng, \bfnmYihao\binitsY., \bauthor\bsnmRen, \bfnmTongzheng\binitsT., \bauthor\bsnmTang, \bfnmZiyang\binitsZ. and \bauthor\bsnmLiu, \bfnmQiang\binitsQ. (\byear2020). \btitleAccountable off-policy evaluation with kernel bellman statistics. In \bbooktitleInternational Conference on Machine Learning \bpages3102–3111. \bpublisherPMLR. \endbibitem
[17] {barticle}[author] \bauthor\bsnmFeng, \bfnmYihao\binitsY., \bauthor\bsnmTang, \bfnmZiyang\binitsZ., \bauthor\bsnmZhang, \bfnmNa\binitsN. and \bauthor\bsnmLiu, \bfnmQiang\binitsQ. (\byear2021). \btitleNon-asymptotic confidence intervals of off-policy evaluation: Primal and dual bounds. \bjournalarXiv preprint arXiv:2103.05741. \endbibitem
[18] {barticle}[author] \bauthor\bsnmGhadimi, \bfnmSaeed\binitsS. and \bauthor\bsnmLan, \bfnmGuanghui\binitsG. (\byear2013). \btitleStochastic first-and zeroth-order methods for nonconvex stochastic programming. \bjournalSIAM Journal on Optimization \bvolume23 \bpages2341–2368. \endbibitem
[19] {barticle}[author] \bauthor\bsnmGretton, \bfnmArthur\binitsA., \bauthor\bsnmBorgwardt, \bfnmKarsten M\binitsK. M., \bauthor\bsnmRasch, \bfnmMalte J\binitsM. J., \bauthor\bsnmSchölkopf, \bfnmBernhard\binitsB. and \bauthor\bsnmSmola, \bfnmAlexander\binitsA. (\byear2012). \btitleA kernel two-sample test. \bjournalThe Journal of Machine Learning Research \bvolume13 \bpages723–773. \endbibitem
[20] {barticle}[author] \bauthor\bsnmHaarnoja, \bfnmTuomas\binitsT., \bauthor\bsnmZhou, \bfnmAurick\binitsA., \bauthor\bsnmHartikainen, \bfnmKristian\binitsK., \bauthor\bsnmTucker, \bfnmGeorge\binitsG., \bauthor\bsnmHa, \bfnmSehoon\binitsS., \bauthor\bsnmTan, \bfnmJie\binitsJ., \bauthor\bsnmKumar, \bfnmVikash\binitsV., \bauthor\bsnmZhu, \bfnmHenry\binitsH., \bauthor\bsnmGupta, \bfnmAbhishek\binitsA., \bauthor\bsnmAbbeel, \bfnmPieter\binitsP. \betalet al. (\byear2018). \btitleSoft actor-critic algorithms and applications. \bjournalarXiv preprint arXiv:1812.05905. \endbibitem
[21] {binproceedings}[author] \bauthor\bsnmHanna, \bfnmJosiah\binitsJ., \bauthor\bsnmStone, \bfnmPeter\binitsP. and \bauthor\bsnmNiekum, \bfnmScott\binitsS. (\byear2017). \btitleBootstrapping with models: Confidence intervals for off-policy evaluation. In \bbooktitleProceedings of the AAAI Conference on Artificial Intelligence \bvolume31. \endbibitem
[22] {binproceedings}[author] \bauthor\bsnmHao, \bfnmBotao\binitsB., \bauthor\bsnmJi, \bfnmXiang\binitsX., \bauthor\bsnmDuan, \bfnmYaqi\binitsY., \bauthor\bsnmLu, \bfnmHao\binitsH., \bauthor\bsnmSzepesvári, \bfnmCsaba\binitsC. and \bauthor\bsnmWang, \bfnmMengdi\binitsM. (\byear2021). \btitleBootstrapping fitted q-evaluation for off-policy inference. In \bbooktitleInternational Conference on Machine Learning \bpages4074–4084. \bpublisherPMLR. \endbibitem
[23] {binproceedings}[author] \bauthor\bsnmIzmailov, \bfnmPavel\binitsP., \bauthor\bsnmPodoprikhin, \bfnmDmitrii\binitsD., \bauthor\bsnmGaripov, \bfnmTimur\binitsT., \bauthor\bsnmVetrov, \bfnmDmitry\binitsD. and \bauthor\bsnmWilson, \bfnmAndrew Gordon\binitsA. G. (\byear2018). \btitleAveraging weights leads to wider optima and better generalization. In \bbooktitle34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018 \bpages876–885. \bpublisherAssociation For Uncertainty in Artificial Intelligence (AUAI). \endbibitem
[24] {barticle}[author] \bauthor\bsnmJiang, \bfnmNan\binitsN. and \bauthor\bsnmHuang, \bfnmJiawei\binitsJ. (\byear2020). \btitleMinimax value interval for off-policy evaluation and policy optimization. \bjournalAdvances in Neural Information Processing Systems \bvolume33 \bpages2747–2758. \endbibitem
[25] {binproceedings}[author] \bauthor\bsnmJiang, \bfnmNan\binitsN. and \bauthor\bsnmLi, \bfnmLihong\binitsL. (\byear2016). \btitleDoubly robust off-policy value evaluation for reinforcement learning. In \bbooktitleInternational Conference on Machine Learning \bpages652–661. \bpublisherPMLR. \endbibitem
[26] {binproceedings}[author] \bauthor\bsnmJin, \bfnmYing\binitsY., \bauthor\bsnmYang, \bfnmZhuoran\binitsZ. and \bauthor\bsnmWang, \bfnmZhaoran\binitsZ. (\byear2021). \btitleIs pessimism provably efficient for offline rl? In \bbooktitleInternational Conference on Machine Learning \bpages5084–5096. \bpublisherPMLR. \endbibitem
[27] {binproceedings}[author] \bauthor\bsnmKallus, \bfnmNathan\binitsN., \bauthor\bsnmMao, \bfnmXiaojie\binitsX., \bauthor\bsnmWang, \bfnmKaiwen\binitsK. and \bauthor\bsnmZhou, \bfnmZhengyuan\binitsZ. (\byear2022). \btitleDoubly robust distributionally robust off-policy evaluation and learning. In \bbooktitleInternational Conference on Machine Learning \bpages10598–10632. \bpublisherPMLR. \endbibitem
[28] {barticle}[author] \bauthor\bsnmKallus, \bfnmNathan\binitsN. and \bauthor\bsnmUehara, \bfnmMasatoshi\binitsM. (\byear2019). \btitleIntrinsically efficient, stable, and bounded off-policy evaluation for reinforcement learning. \bjournalAdvances in Neural Information Processing Systems \bvolume32. \endbibitem
[29] {barticle}[author] \bauthor\bsnmKallus, \bfnmNathan\binitsN. and \bauthor\bsnmUehara, \bfnmMasatoshi\binitsM. (\byear2020). \btitleDouble reinforcement learning for efficient off-policy evaluation in markov decision processes. \bjournalThe Journal of Machine Learning Research \bvolume21 \bpages6742–6804. \endbibitem
[30] {barticle}[author] \bauthor\bsnmKornowski, \bfnmGuy\binitsG. and \bauthor\bsnmShamir, \bfnmOhad\binitsO. (\byear2022). \btitleOracle Complexity in Nonsmooth Nonconvex Optimization. \bjournalJournal of Machine Learning Research \bvolume23 \bpages1–44. \endbibitem
[31] {barticle}[author] \bauthor\bsnmKumar, \bfnmAviral\binitsA., \bauthor\bsnmFu, \bfnmJustin\binitsJ., \bauthor\bsnmSoh, \bfnmMatthew\binitsM., \bauthor\bsnmTucker, \bfnmGeorge\binitsG. and \bauthor\bsnmLevine, \bfnmSergey\binitsS. (\byear2019). \btitleStabilizing off-policy q-learning via bootstrapping error reduction. \bjournalAdvances in Neural Information Processing Systems \bvolume32. \endbibitem
[32] {barticle}[author] \bauthor\bsnmKumar, \bfnmAviral\binitsA., \bauthor\bsnmZhou, \bfnmAurick\binitsA., \bauthor\bsnmTucker, \bfnmGeorge\binitsG. and \bauthor\bsnmLevine, \bfnmSergey\binitsS. (\byear2020). \btitleConservative q-learning for offline reinforcement learning. \bjournalAdvances in Neural Information Processing Systems \bvolume33 \bpages1179–1191. \endbibitem
[33] {binproceedings}[author] \bauthor\bsnmLe, \bfnmHoang\binitsH., \bauthor\bsnmVoloshin, \bfnmCameron\binitsC. and \bauthor\bsnmYue, \bfnmYisong\binitsY. (\byear2019). \btitleBatch policy learning under constraints. In \bbooktitleInternational Conference on Machine Learning \bpages3703–3712. \bpublisherPMLR. \endbibitem
[34] {barticle}[author] \bauthor\bsnmLevine, \bfnmSergey\binitsS., \bauthor\bsnmKumar, \bfnmAviral\binitsA., \bauthor\bsnmTucker, \bfnmGeorge\binitsG. and \bauthor\bsnmFu, \bfnmJustin\binitsJ. (\byear2020). \btitleOffline reinforcement learning: Tutorial, review, and perspectives on open problems. \bjournalarXiv preprint arXiv:2005.01643. \endbibitem
[35] {barticle}[author] \bauthor\bsnmLi, \bfnmMengbing\binitsM., \bauthor\bsnmShi, \bfnmChengchun\binitsC., \bauthor\bsnmWu, \bfnmZhenke\binitsZ. and \bauthor\bsnmFryzlewicz, \bfnmPiotr\binitsP. (\byear2022). \btitleReinforcement Learning in Possibly Nonstationary Environments. \bjournalarXiv preprint arXiv:2203.01707. \endbibitem
[36] {barticle}[author] \bauthor\bsnmLiao, \bfnmPeng\binitsP., \bauthor\bsnmKlasnja, \bfnmPredrag\binitsP. and \bauthor\bsnmMurphy, \bfnmSusan\binitsS. (\byear2020). \btitleOff-policy estimation of long-term average outcomes with applications to mobile health. \bjournalJournal of the American Statistical Association \bpages1–10. \endbibitem
[37] {barticle}[author] \bauthor\bsnmLiao, \bfnmPeng\binitsP., \bauthor\bsnmQi, \bfnmZhengling\binitsZ., \bauthor\bsnmWan, \bfnmRunzhe\binitsR., \bauthor\bsnmKlasnja, \bfnmPredrag\binitsP. and \bauthor\bsnmMurphy, \bfnmSusan A\binitsS. A. (\byear2022). \btitleBatch policy learning in average reward markov decision processes. \bjournalThe Annals of Statistics \bvolume50 \bpages3364–3387. \endbibitem
[38] {barticle}[author] \bauthor\bsnmLiu, \bfnmQiang\binitsQ., \bauthor\bsnmLi, \bfnmLihong\binitsL., \bauthor\bsnmTang, \bfnmZiyang\binitsZ. and \bauthor\bsnmZhou, \bfnmDengyong\binitsD. (\byear2018). \btitleBreaking the curse of horizon: Infinite-horizon off-policy estimation. \bjournalAdvances in Neural Information Processing Systems \bvolume31. \endbibitem
[39] {barticle}[author] \bauthor\bsnmLu, \bfnmYulong\binitsY. and \bauthor\bsnmLu, \bfnmJianfeng\binitsJ. (\byear2020). \btitleA universal approximation theorem of deep neural networks for expressing probability distributions. \bjournalAdvances in Neural Information Processing Systems \bvolume33 \bpages3094–3105. \endbibitem
[40] {barticle}[author] \bauthor\bsnmLuckett, \bfnmDaniel J\binitsD. J., \bauthor\bsnmLaber, \bfnmEric B\binitsE. B., \bauthor\bsnmKahkoska, \bfnmAnna R\binitsA. R., \bauthor\bsnmMaahs, \bfnmDavid M\binitsD. M., \bauthor\bsnmMayer-Davis, \bfnmElizabeth\binitsE. and \bauthor\bsnmKosorok, \bfnmMichael R\binitsM. R. (\byear2020). \btitleEstimating dynamic treatment regimes in mobile health using v-learning. \bjournalJournal of the American Statistical Association \bvolume115 \bpages692–706. \endbibitem
[41] {barticle}[author] \bauthor\bsnmMahadevan, \bfnmSridhar\binitsS., \bauthor\bsnmLiu, \bfnmBo\binitsB., \bauthor\bsnmThomas, \bfnmPhilip\binitsP., \bauthor\bsnmDabney, \bfnmWill\binitsW., \bauthor\bsnmGiguere, \bfnmSteve\binitsS., \bauthor\bsnmJacek, \bfnmNicholas\binitsN., \bauthor\bsnmGemp, \bfnmIan\binitsI. and \bauthor\bsnmLiu, \bfnmJi\binitsJ. (\byear2014). \btitleProximal reinforcement learning: A new theory of sequential decision making in primal-dual spaces. \bjournalarXiv preprint arXiv:1405.6757. \endbibitem
[42] {binproceedings}[author] \bauthor\bsnmMandel, \bfnmTravis\binitsT., \bauthor\bsnmLiu, \bfnmYun-En\binitsY.-E., \bauthor\bsnmLevine, \bfnmSergey\binitsS., \bauthor\bsnmBrunskill, \bfnmEmma\binitsE. and \bauthor\bsnmPopovic, \bfnmZoran\binitsZ. (\byear2014). \btitleOffline policy evaluation across representations with applications to educational games. In \bbooktitleAAMAS \bvolume1077. \endbibitem
[43] {barticle}[author] \bauthor\bsnmMarling, \bfnmCindy\binitsC. and \bauthor\bsnmBunescu, \bfnmRazvan\binitsR. (\byear2020). \btitleThe ohiot1dm dataset for blood glucose level prediction: Update 2020. \bjournalKHD@ IJCAI. \endbibitem
[44] {binproceedings}[author] \bauthor\bsnmMokhtari, \bfnmAryan\binitsA., \bauthor\bsnmOzdaglar, \bfnmAsuman\binitsA. and \bauthor\bsnmPattathil, \bfnmSarath\binitsS. (\byear2020). \btitleA unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. In \bbooktitleInternational Conference on Artificial Intelligence and Statistics \bpages1497–1507. \bpublisherPMLR. \endbibitem
[45] {barticle}[author] \bauthor\bsnmMurphy, \bfnmSusan A\binitsS. A., \bauthor\bparticlevan der \bsnmLaan, \bfnmMark J\binitsM. J., \bauthor\bsnmRobins, \bfnmJames M\binitsJ. M. and \bauthor\bsnmGroup, \bfnmConduct Problems Prevention Research\binitsC. P. P. R. (\byear2001). \btitleMarginal mean models for dynamic regimes. \bjournalJournal of the American Statistical Association \bvolume96 \bpages1410–1423. \endbibitem
[46] {barticle}[author] \bauthor\bsnmNachum, \bfnmOfir\binitsO. and \bauthor\bsnmDai, \bfnmBo\binitsB. (\byear2021). \btitleReinforcement learning via fenchel-rockafellar duality. \bjournalarXiv preprint arXiv:2001.01866. \endbibitem
[47] {barticle}[author] \bauthor\bsnmParikh, \bfnmNeal\binitsN., \bauthor\bsnmBoyd, \bfnmStephen\binitsS. \betalet al. (\byear2014). \btitleProximal algorithms. \bjournalFoundations and trends® in Optimization \bvolume1 \bpages127–239. \endbibitem
[48] {barticle}[author] \bauthor\bsnmPinelis, \bfnmIosif\binitsI. (\byear2012). \btitleOptimum bounds for the distributions of martingales in Banach spaces. \bjournalarXiv preprint arXiv:1208.2200. \endbibitem
[49] {barticle}[author] \bauthor\bsnmPrecup, \bfnmDoina\binitsD. (\byear2000). \btitleEligibility traces for off-policy policy evaluation. \bjournalComputer Science Department Faculty Publication Series \bpages80. \endbibitem
[50] {barticle}[author] \bauthor\bsnmRahimi, \bfnmAli\binitsA. and \bauthor\bsnmRecht, \bfnmBenjamin\binitsB. (\byear2007). \btitleRandom features for large-scale kernel machines. \bjournalAdvances in Neural Information Processing Systems \bvolume20. \endbibitem
[51] {barticle}[author] \bauthor\bsnmRamprasad, \bfnmPratik\binitsP., \bauthor\bsnmLi, \bfnmYuantong\binitsY., \bauthor\bsnmYang, \bfnmZhuoran\binitsZ., \bauthor\bsnmWang, \bfnmZhaoran\binitsZ., \bauthor\bsnmSun, \bfnmWill Wei\binitsW. W. and \bauthor\bsnmCheng, \bfnmGuang\binitsG. (\byear2022). \btitleOnline bootstrap inference for policy evaluation in reinforcement learning. \bjournalJournal of the American Statistical Association \bpages1–14. \endbibitem
[52] {binproceedings}[author] \bauthor\bsnmReddi, \bfnmSashank J\binitsS. J., \bauthor\bsnmHefny, \bfnmAhmed\binitsA., \bauthor\bsnmSra, \bfnmSuvrit\binitsS., \bauthor\bsnmPoczos, \bfnmBarnabas\binitsB. and \bauthor\bsnmSmola, \bfnmAlex\binitsA. (\byear2016). \btitleStochastic variance reduction for nonconvex optimization. In \bbooktitleInternational conference on machine learning \bpages314–323. \bpublisherPMLR. \endbibitem
[53] {barticle}[author] \bauthor\bsnmReem, \bfnmDaniel\binitsD., \bauthor\bsnmReich, \bfnmSimeon\binitsS. and \bauthor\bsnmDe Pierro, \bfnmAlvaro\binitsA. (\byear2019). \btitleRe-examination of Bregman functions and new properties of their divergences. \bjournalOptimization \bvolume68 \bpages279–348. \endbibitem
[54] {binproceedings}[author] \bauthor\bsnmRényi, \bfnmAlfréd\binitsA. (\byear1961). \btitleOn measures of entropy and information. In \bbooktitleProceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics \bvolume4 \bpages547–562. \bpublisherUniversity of California Press. \endbibitem
[55] {barticle}[author] \bauthor\bsnmRodbard, \bfnmDavid\binitsD. (\byear2009). \btitleInterpretation of continuous glucose monitoring data: glycemic variability and quality of glycemic control. \bjournalDiabetes Technology & Therapeutics \bvolume11 \bpagesS–55. \endbibitem
[56] {binproceedings}[author] \bauthor\bsnmShi, \bfnmChengchun\binitsC., \bauthor\bsnmUehara, \bfnmMasatoshi\binitsM., \bauthor\bsnmHuang, \bfnmJiawei\binitsJ. and \bauthor\bsnmJiang, \bfnmNan\binitsN. (\byear2022). \btitleA minimax learning approach to off-policy evaluation in confounded partially observable markov decision processes. In \bbooktitleInternational Conference on Machine Learning \bpages20057–20094. \bpublisherPMLR. \endbibitem
[57] {binproceedings}[author] \bauthor\bsnmShi, \bfnmChengchun\binitsC., \bauthor\bsnmWan, \bfnmRunzhe\binitsR., \bauthor\bsnmChernozhukov, \bfnmVictor\binitsV. and \bauthor\bsnmSong, \bfnmRui\binitsR. (\byear2021). \btitleDeeply-debiased off-policy interval estimation. In \bbooktitleInternational Conference on Machine Learning \bpages9580–9591. \bpublisherPMLR. \endbibitem
[58] {barticle}[author] \bauthor\bsnmShi, \bfnmChengchun\binitsC., \bauthor\bsnmZhang, \bfnmSheng\binitsS., \bauthor\bsnmLu, \bfnmWenbin\binitsW. and \bauthor\bsnmSong, \bfnmRui\binitsR. (\byear2020). \btitleStatistical inference of the value function for reinforcement learning in infinite horizon settings. \bjournalarXiv preprint arXiv:2001.04515. \endbibitem
[59] {barticle}[author] \bauthor\bsnmShi, \bfnmChengchun\binitsC., \bauthor\bsnmZhu, \bfnmJin\binitsJ., \bauthor\bsnmYe, \bfnmShen\binitsS., \bauthor\bsnmLuo, \bfnmShikai\binitsS., \bauthor\bsnmZhu, \bfnmHongtu\binitsH. and \bauthor\bsnmSong, \bfnmRui\binitsR. (\byear2022). \btitleOff-policy confidence interval estimation with confounded markov decision process. \bjournalJournal of the American Statistical Association \bpages1–12. \endbibitem
[60] {barticle}[author] \bauthor\bsnmSriperumbudur, \bfnmBharath K\binitsB. K., \bauthor\bsnmFukumizu, \bfnmKenji\binitsK. and \bauthor\bsnmLanckriet, \bfnmGert RG\binitsG. R. (\byear2011). \btitleUniversality, Characteristic Kernels and RKHS Embedding of Measures. \bjournalJournal of Machine Learning Research \bvolume12. \endbibitem
[61] {bbook}[author] \bauthor\bsnmSutton, \bfnmRichard S\binitsR. S. and \bauthor\bsnmBarto, \bfnmAndrew G\binitsA. G. (\byear2018). \btitleReinforcement learning: An introduction. \bpublisherMIT press. \endbibitem
[62] {barticle}[author] \bauthor\bsnmTheocharous, \bfnmGeorgios\binitsG., \bauthor\bsnmChandak, \bfnmYash\binitsY., \bauthor\bsnmThomas, \bfnmPhilip S\binitsP. S. and \bauthor\bparticlede \bsnmNijs, \bfnmFrits\binitsF. (\byear2020). \btitleReinforcement learning for strategic recommendations. \bjournalarXiv preprint arXiv:2009.07346. \endbibitem
[63] {binproceedings}[author] \bauthor\bsnmThomas, \bfnmPhilip\binitsP. and \bauthor\bsnmBrunskill, \bfnmEmma\binitsE. (\byear2016). \btitleData-efficient off-policy policy evaluation for reinforcement learning. In \bbooktitleInternational Conference on Machine Learning \bpages2139–2148. \bpublisherPMLR. \endbibitem
[64] {binproceedings}[author] \bauthor\bsnmThomas, \bfnmPhilip\binitsP., \bauthor\bsnmTheocharous, \bfnmGeorgios\binitsG. and \bauthor\bsnmGhavamzadeh, \bfnmMohammad\binitsM. (\byear2015). \btitleHigh-confidence off-policy evaluation. In \bbooktitleProceedings of the AAAI Conference on Artificial Intelligence \bvolume29. \endbibitem
[65] {barticle}[author] \bauthor\bsnmTropp, \bfnmJoel A\binitsJ. A. (\byear2011). \btitleFreedman’s Inequality for Matrix Martingales. \bjournalElectronic Communications in Probability \bvolume16 \bpages262–270. \endbibitem
[66] {barticle}[author] \bauthor\bsnmTsybakov, \bfnmAlexandre B\binitsA. B. (\byear2004). \btitleIntroduction to nonparametric estimation, 2009. \bjournalSpringer Series in Statistics \bvolume9. \endbibitem
[67] {binproceedings}[author] \bauthor\bsnmUehara, \bfnmMasatoshi\binitsM., \bauthor\bsnmHuang, \bfnmJiawei\binitsJ. and \bauthor\bsnmJiang, \bfnmNan\binitsN. (\byear2020). \btitleMinimax weight and q-function learning for off-policy evaluation. In \bbooktitleInternational Conference on Machine Learning \bpages9659–9668. \bpublisherPMLR. \endbibitem
[68] {barticle}[author] \bauthor\bsnmUehara, \bfnmMasatoshi\binitsM., \bauthor\bsnmImaizumi, \bfnmMasaaki\binitsM., \bauthor\bsnmJiang, \bfnmNan\binitsN., \bauthor\bsnmKallus, \bfnmNathan\binitsN., \bauthor\bsnmSun, \bfnmWen\binitsW. and \bauthor\bsnmXie, \bfnmTengyang\binitsT. (\byear2022). \btitleFinite sample analysis of minimax offline reinforcement learning: Completeness, fast rates and first-order efficiency. \bjournalarXiv preprint arXiv:2102.02981. \endbibitem
[69] {barticle}[author] \bauthor\bsnmUehara, \bfnmMasatoshi\binitsM., \bauthor\bsnmShi, \bfnmChengchun\binitsC. and \bauthor\bsnmKallus, \bfnmNathan\binitsN. (\byear2022). \btitleA Review of Off-Policy Evaluation in Reinforcement Learning. \bjournalarXiv preprint arXiv:2212.06355. \endbibitem
[70] {barticle}[author] \bauthor\bsnmWang, \bfnmRuosong\binitsR., \bauthor\bsnmFoster, \bfnmDean P\binitsD. P. and \bauthor\bsnmKakade, \bfnmSham M\binitsS. M. (\byear2020). \btitleWhat are the statistical limits of offline RL with linear function approximation? \bjournalarXiv preprint arXiv:2010.11895. \endbibitem
[71] {barticle}[author] \bauthor\bsnmXie, \bfnmTengyang\binitsT., \bauthor\bsnmCheng, \bfnmChing-An\binitsC.-A., \bauthor\bsnmJiang, \bfnmNan\binitsN., \bauthor\bsnmMineiro, \bfnmPaul\binitsP. and \bauthor\bsnmAgarwal, \bfnmAlekh\binitsA. (\byear2021). \btitleBellman-consistent pessimism for offline reinforcement learning. \bjournalAdvances in neural information processing systems \bvolume34. \endbibitem
[72] {barticle}[author] \bauthor\bsnmXie, \bfnmTengyang\binitsT., \bauthor\bsnmMa, \bfnmYifei\binitsY. and \bauthor\bsnmWang, \bfnmYu-Xiang\binitsY.-X. (\byear2019). \btitleTowards optimal off-policy evaluation for reinforcement learning with marginalized importance sampling. \bjournalAdvances in Neural Information Processing Systems \bvolume32. \endbibitem
[73] {barticle}[author] \bauthor\bsnmZhang, \bfnmRuiyi\binitsR., \bauthor\bsnmDai, \bfnmBo\binitsB., \bauthor\bsnmLi, \bfnmLihong\binitsL. and \bauthor\bsnmSchuurmans, \bfnmDale\binitsD. (\byear2020). \btitleGendice: Generalized offline estimation of stationary values. \bjournalarXiv preprint arXiv:2002.09072. \endbibitem
[74] {barticle}[author] \bauthor\bsnmZhou, \bfnmWenzhuo\binitsW., \bauthor\bsnmZhu, \bfnmRuoqing\binitsR. and \bauthor\bsnmQu, \bfnmAnnie\binitsA. (\byear2022). \btitleEstimating optimal infinite horizon dynamic treatment regimes via pt-learning. \bjournalJournal of the American Statistical Association \bpages1–14. \endbibitem
[75] {binproceedings}[author] \bauthor\bsnmZhu, \bfnmLiangyu\binitsL., \bauthor\bsnmLu, \bfnmWenbin\binitsW. and \bauthor\bsnmSong, \bfnmRui\binitsR. (\byear2020). \btitleCausal Effect Estimation and Optimal Dose Suggestions in Mobile Health. In \bbooktitleInternational Conference on Machine Learning \bpages11588–11598. \bpublisherPMLR. \endbibitem

SUPPLEMENTARY MATERIALS FOR
“DISTRIBUTIONAL SHIFT-AWARE OFF-POLICY INTERVAL ESTIMATION:
A UNIFIED ERROR QUANTIFICATION FRAMEWORK”

Wenzhuo Zhou, Yuhan Li, Ruoqing Zhu, and Annie Qu

A Interval Validity and Tightness for Bias Quantification 2

B Expressivity and Role of Q-function Class 2

C Probabilistic Tools in Statistical Learning 3

D Details of Numerical Experiments and Additional Results 4

E Technical Proofs and Additional Theoretical Results 6

E.1 Proof of Lemma 3.1 6

E.2 Proof of Theorem A.1 7

E.3 Proof of Corollary A.1 8

E.4 Proof of Lemma E.1 8

E.5 Proof of Theorem 3.1 9

E.6 Proof of Theorem B.1 10

E.7 Proof of Lemma 3.2 12

E.8 Proof of Theorem 3.2 12

E.9 Proof of Theorem 3.3 14

E.10 Proof of Corollary 3.1 17

E.11 Proof of Corollary 4.1 19

E.12 Proof of Theorem 4.1 21

E.13 Proof of Theorem 3.4 23

E.14 Proof of Proposition 4.1 25

E.15 Proof of Lemma 4.1 26

E.16 Proof of Theorem 4.2 28

E.17 Proof of Theorem 6.1 30

E.18 Proof of Theorem 6.2 33

E.19 Proof of Lemma E.2 35

E.20 Proof of Theorem 6.3 36

E.20.1 Proof of the validity of neutral CI 36

E.20.2 Proof of the error-robust finite sample bound for neutral CI 37

E.21 Proof of Theorem 6.4 40

E.21.1 Proof of the validity of pessimist 40

E.21.2 Proof of the error-robust finite sample bound for pessimistic CI 41

E.22 Proof of Theorem E.3 43

E.23 Proof of Theorem E.4 44

E.24 Proof of Theorem 6.5 47

Reference 50

Distributional Shift-Aware Off-Policy Interval Estimation: A Unified Error Quantification Framework

Abstract

keywords:

keywords:

1 Introduction

2 Preliminaries

2.1 Markov Decision Process

2.2 Marginalized Importance Sampling and Linear Programming

3 Off-Policy Confidence Interval Estimation

3.1 Evaluation Bias Quantification

Definition 3.1.

Remark 3.1.

Lemma 3.1.

Theorem 3.1 (Evaluation bias).

Remark 3.1.

3.2 Statistical Uncertainty Quantification

Lemma 3.2 (Empirical evaluation lemma).

3.3 Uncertainty in Neutral Confidence Interval

Theorem 3.2 (Neutral CI).

Theorem 3.3 (L∞L_{\infty}-function class).

Corollary 3.1.

3.4 Uncertainty in Pessimistic Confidence Interval

Theorem 3.4 (Pessimistic CI).

4 Breaking the Curse of Tradeoff

4.1 Bias and Uncertainty Tradeoff

4.2 Tradeoff-Free Neutral Confidence Interval

Definition 4.1.

Theorem 4.1.

Corollary 4.1 (RKHS reformulation).

Remark 4.1.

4.3 Tradeoff-Free Pessimistic Confidence Interval

Proposition 4.1.

Lemma 4.1.

Theorem 4.2.

Remark 4.1.

5 Optimization Algorithm

6 Theoretical Properties

6.1 Finite Sample Bound

Assumption 1.

Assumption 2.

Theorem 6.1.

Theorem 6.2.

6.2 Robustness to Optimization and Model Misspecification Errors

Theorem 6.3.

Theorem 6.4.

6.3 Convergence Analysis

Assumption 3.

Assumption 4.

Assumption 5.

Theorem 6.5.

7 Simulation Studies

7.1 Empirical Tightness Analysis of Estimated Interval

7.2 Robustness to Distributional Shifts

8 OhioT1DM Case Study

9 Discussion

References

Distributional Shift-Aware Off-Policy Interval Estimation:
A Unified Error Quantification Framework

Theorem 3.3 ( $L_{\infty}$ -function class).