This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Distributional Shift-Aware Off-Policy Interval Estimation:
A Unified Error Quantification Framework

Wenzhuo  Zhoulabel=e1][email protected] [    Yuhan Lilabel=e2][email protected] [    Ruoqing Zhulabel=e3][email protected] [    Annie Qulabel=e4][email protected] [ University of California Irvine presep=, ]e1,e4 University of Illinois Urbana Champaignpresep=, ]e2,e3
Abstract

We study high-confidence off-policy evaluation in the context of infinite-horizon Markov decision processes, where the objective is to establish a confidence interval (CI) for the target policy value using only offline data pre-collected from unknown behavior policies. This task faces two primary challenges: providing a comprehensive and rigorous error quantification in CI estimation, and addressing the distributional shift that results from discrepancies between the distribution induced by the target policy and the offline data-generating process. Motivated by an innovative unified error analysis, we jointly quantify the two sources of estimation errors: the misspecification error on modeling marginalized importance weights and the statistical uncertainty due to sampling, within a single interval. This unified framework reveals a previously hidden tradeoff between the errors, which undermines the tightness of the CI. Relying on a carefully designed discriminator function, the proposed estimator achieves a dual purpose: breaking the curse of the tradeoff to attain the tightest possible CI, and adapting the CI to ensure robustness against distributional shifts. Our method is applicable to time-dependent data without assuming any weak dependence conditions via leveraging a local supermartingale/martingale structure. Theoretically, we show that our algorithm is sample-efficient, error-robust, and provably convergent even in non-linear function approximation settings. The numerical performance of the proposed method is examined in synthetic datasets and an OhioT1DM mobile health study.

62C20,
62E17,
Finite-sample confidence bound,
Function approximation,
Precision medicine,
Reinforcement learning,
Sequential decision-making,
keywords:
[class=MSC]
keywords:
\startlocaldefs\endlocaldefs

, and

February 15, 2023

1 Introduction

Off-policy evaluation (OPE) is a crucial task in reinforcement learning (RL). Its primary goal is to evaluate a new policy (known as the target policy) using observational data generated by various existing policies (called behavior policies). In many real-world applications, e.g., healthcare [45, 40], financial marketing [62], robotics [64] and education [42], implementing a new policy can be costly, risky, or unsafe. OPE enables us to evaluate the performance of a new policy using only a set of observational data, without interacting with the real environment. This method substantially reduces the cost and risk associated with deploying a new policy, making it a valuable tool in practice.

There have been many works on point estimations of OPE in recent years. Popular approaches include value-based [33, 36, 8], importance sampling based [49, 25, 63], and marginalized importance sampling based approaches [38, 72, 67, 73, 46]. On the theoretical side, [67] established asymptotic optimality and efficiency for OPE in the tabular setting, and [29] provided a complete study of semiparametric efficiency in a more general setting. [70] studied the fundamental hardness of OPE with linear function approximation. We refer readers to [69] for a comprehensive review of OPE problems.

While much of the current research on OPE is centered on point estimation, in many real-world scenarios it is desirable to avoid making overconfident decisions that could result in costly and/or irreversible consequences. That is, rather than solely relying on a point estimate, many applications would benefit significantly from having a confidence interval (CI) on the value of a policy. Existing approaches on off-policy CI estimation are mainly based on asymptotic inference [36, 58, 10, 59] or bootstrapping [21, 22, 51]. In particular, [36] constructed a limiting distribution for the policy value function to estimate CIs in a tabular representation setting, and [58] considered an asymptotic CI for policy value with smoothness assumptions on the action-value function. Typically, the asymptotic inference or bootstrap-based methods require a large number of samples to achieve a desirable coverage. Unfortunately, in practical applications such as mobile health, the number of observations may be limited. For example, the OhioT1DM dataset [43] only contains a few hundred observations [57]. In such scenarios, providing non-asymptotic uncertainty quantification to meet practically feasible conditions may be crucial. [64] constructed an empirical Bernstein inequality applied to the stepwise importance sampling estimator. However, the estimated CI can become quite loose due to the high variance introduced by the stepwise importance weights, particularly in infinite horizon decision-making settings. Recently, [16, 17] proposed a finite-sample CI based on a kernel action-value function estimator. However, their approach relies on a strong assumption that the true model must be correctly specified, which may result in significant bias under potential model misspecification.

More importantly, all the aforementioned CI estimation approaches neglect the influence of the distributional shift, which is arguably the most significant challenge in offline RL. The distribution shift refers to the distributional mismatch between offline data distribution and that induced by target policies [34]. Algorithmically, the distributional shift often leads to compounding biased estimations of action values, resulting in unsatisfactory performance for many OPE algorithms [27]. Practically, in fields such as precision medicine, shifts in treatment regimens may produce arbitrary and sometimes harmful effects that are costly to diagnose and treat [6]. To mitigate this problem, it is essential to account for the distributional shift when performing high-confidence OPE.

From the perspective of the error analysis in the off-policy CI estimation, the existing algorithms are typically developed based on a separate error quantification. The approaches focus on either quantifying the model misspecification bias, which arises from function approximation while neglecting sampling uncertainty, or on measuring statistical uncertainty under the assumption that the true model is accurately specified. This leads to the true relationship between the “bias” and “uncertainty” being unrevealed, which in turn creates a gap between algorithm developments and practical applications. How to rigorously quantify the bias and uncertainty in a unified framework is unknown and challenging. Moreover, the existing works usually assume independence or weak dependence, e.g., mixing random processes, in an offline data distribution. This assumption is often violated due to the complicated interdependent samples in many RL real applications [1]. Additionally, most popular approaches only work when the offline data is induced by a single known or unknown behavior policy. However, they are not applicable in settings involving a mixture of multiple unknown behavior policies [28].

Motivated by the aforementioned issues, in this paper we propose a novel, robust, and possibly tightest off-policy CI estimation framework in infinite-horizon Markov decision process (MDP) settings. The advances of the proposed method and our contributions are summarized as follows. First, our proposed framework mitigates the issue of distributional shifts by properly incorporating a carefully designed discriminator function. This discriminator captures information about the distributional shift and adjusts the resulting CI accordingly, alleviating the risk of poor estimates when the distribution of offline data deviates from the distribution under the target policy. By doing so, the estimated CI effectively reduces potential decision-making risks and avoids unsatisfactory performance.

Second, we develop a novel and unified error quantification analysis that reveals a previously hidden tradeoff among the sources of errors. Specifically, our framework allows us to decompose the error in CI estimation into two distinct parts: the evaluation bias induced by a misspecification error of the function approximator, and the statistical uncertainty due to sampling. By unifying the two source errors within a single interval, we have shown that there is an inherent tradeoff between the errors and can be naturally viewed as a “bias-uncertianty” tradeoff, as depicted in Figure 1. This tradeoff makes it challenging to strike a balance and obtain the tightest possible CI estimation.

Refer to caption
Figure 1: An illustrating example of the curse of the tradeoff in CI estimation from CartPole RL environment. Our unified error quantification analysis shows that the tightness of the estimated CI is determined by both bias and uncertainty, as discussed in Section 3. These factors are influenced by the complexity of the function approximation class. As the model complexity increases (i.e., the number of units increases), the evaluation bias is reduced but the statistical uncertainty increases, and vice versa. This tradeoff makes it challenging to maintain an optimal balance, resulting in a less tight CI. In contrast, our proposed tradeoff-free estimation approach effectively breaks this curse, ultimately yielding a tighter CI with a length of 7.147.14.

Third, we break the curse of the tradeoff and formulate a tradeoff-free framework, resulting in the possibly tightest CI, illustrated in Figure 1. In smooth MDP settings [58], we leverage the principle of maximum mean discrepancy to control uncertainty and preserve function expressivity simultaneously. For more general MDPs, we carefully calibrate the uncertainty deviation, making it entirely independent of the model-misspecification bias. We achieve this by leveraging the advantageous curvature of the designed discriminator function. As a result, both evaluation bias and sampling uncertainty can be minimized jointly without compromising one or the other.

Fourth, our method demonstrates broad applicability. To the best of our knowledge, our algorithm is the first general function approximation algorithm that can be applied to offline data collected without assuming weakly dependent distribution. The proposed method sufficiently utilizes the (super)martingale structure to handle statistical errors and works particularly efficiently in small sample size settings. Also, our algorithm is stable and efficient for a wide range of function approximators, including deep neural networks. These characteristics significantly expand the application scope of the proposed method.

Lastly, the proposed algorithm is sample-efficient and robust against model-misspecification and optimization errors. By sample efficiency, we mean that the sample complexity requirement for finding a near-tight CI is polynomial with respect to key parameters. We further demonstrate that the finite sample bound for learning a near-tight CI can be independent of the function class complexity, and shrink at a sublinear rate, i.e., 𝒪(1/n)\mathcal{O}(1/n), under some mild conditions where nn represents the sample size of the offline dataset. We make substantial progress towards solving the longstanding open problem in offline RL with non-linear function approximation [41, 34]. Thanks to the proposed stochastic approximation optimization algorithm, our approach is proven to converge sublinearly to a stationary point with a vanishing gradient even in non-linear (possibly nonconvex) function approximation settings.

The rest of the paper is organized as follows. Section 2 lays out the basic model notation and data-generating process, as well as the connection between the marginalized importance sampling and linear programming. Section 3 formally defines the discriminator function and quantifies bias and uncertainty through a unified error quantification analysis. Section 4 illustrates the curse of the tradeoff between bias and uncertainty, and introduces a novel framework to overcome this challenge. Section 5 presents an efficient stochastic approximation algorithm for CI estimation. A comprehensive theoretical study of our algorithm is provided in Section 6. Sections 7 and 8 demonstrate the empirical performance of our method. Section 9 concludes this paper with a discussion. All technical proofs and additional discussion can be found in the appendix.

2 Preliminaries

2.1 Markov Decision Process

We consider an infinite-horizon discounted Markov decision process (MDP) ={𝒮,𝒜,,γ,r,s0}\mathcal{M}=\{\mathcal{S},\mathcal{A},\mathds{P},\gamma,r,s^{0}\}, where 𝒮\mathcal{S} is the state space, 𝒜\mathcal{A} is the action space, :𝒮×𝒜Δ(𝒮)\mathds{P}:\mathcal{S}\times\mathcal{A}\rightarrow\Delta(\mathcal{S}) is the Markov transition kernel for some probabilistic simplex Δ\Delta, r:𝒮×𝒜r:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R} is the reward function, γ[0,1)\gamma\in[0,1) is the discounted factor and s0s^{0} is the initial state. A policy π:𝒮Δ(𝒜)\pi:\mathcal{S}\rightarrow\Delta(\mathcal{A}) induces a distribution of the trajectory s0,a0,r0,s1,s^{0},a^{0},r^{0},s^{1},\ldots, where atπ(|st),rt=r(st,at),st+1(|st,at)a^{t}\sim\pi(\cdot|s^{t}),r^{t}=r(s^{t},a^{t}),s^{t+1}\sim\mathds{P}(\cdot|s^{t},a^{t}) for any t0t\geq 0. The expected discounted return of a policy is defined as J(π)=𝔼[t=0γtrt|π]J(\pi)=\mathbb{E}[\sum_{t=0}^{\infty}\gamma^{t}r^{t}|\pi]. The discounted return when the trajectory starts with (s,a)(s,a) and all remaining actions are taken according to π\pi is called qq-function qπ:𝒮×𝒜(,V¯]q^{\pi}:\mathcal{S}\times\mathcal{A}\rightarrow(-\infty,\bar{V}]. The qπq^{\pi} is the unique fixed point of the Bellman operator π\mathcal{B}^{\pi}, satisfying the Bellman equation [61]: πq(s,a)r(s,a)+γ𝔼s(|s,a)[q(s,π)].\mathcal{B}^{\pi}q(s,a)\coloneqq r(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim\mathds{P}(\cdot|s,a)}[q(s^{\prime},\pi)]. Here q(s,π)q(s^{\prime},\pi) is denoted as shorthand for 𝔼aπ(|s)[q(s,a)]\mathbb{E}_{a^{\prime}\sim\pi\left(\cdot|s^{\prime}\right)}\left[q\left(s^{\prime},a^{\prime}\right)\right], and we define πq(s,a):=𝔼s(|s,a)[q(s,π)]\mathds{P}^{\pi}q(s,a):=\mathbb{E}_{s^{\prime}\sim\mathds{P}(\cdot|s,a)}\left[q\left(s^{\prime},\pi\right)\right].

Another important notion is the normalized discounted visitation of π\pi, defined as dπ(s,a)(1γ)t=0γtdπ,t(s,a)d^{\pi}(s,a)\coloneqq(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}d^{\pi,t}(s,a), where dπ,td^{\pi,t} is the marginal state-action distribution at the time-step tt. Essentially, dπ(s,a)d^{\pi}(s,a) characterizes the states and actions visited by π\pi. For notation convenience, we write 𝔼dπ[]\mathbb{E}_{d^{\pi}}[\cdot] to represent 𝔼dπ[g(s,a,r,s)]:=s,adπ(s,a)𝟙{r=r(s,a)}𝔼s(|s,a)[g(s,a,r,s)].\mathbb{E}_{d^{\pi}}\left[g\left(s,a,r,s^{\prime}\right)\right]:=\int_{s,a}d^{\pi}(s,a)\mathds{1}_{\{r=r(s,a)\}}\linebreak\mathbb{E}_{s^{\prime}\sim\mathds{P}(\cdot|s,a)}\left[g\left(s,a,r,s^{\prime}\right)\right]. In the offline RL setting, there exists an unknown offline data-generating distribution μ\mu induced by behavior policies. With a slight abuse of notation, we refer to the marginal distribution over (s,a)(s,a) and joint distribution over (s,a,r,s)(s,a,r,s^{\prime}) as μ\mu. Despite the unknowns of μ\mu, we can observe a set of transition pairs, as offline dataset 𝒟1:n{si,ai,ri,si}i=1n\mathcal{D}_{1:n}\coloneqq\{s_{i},a_{i},r_{i},s^{\prime}_{i}\}^{n}_{i=1} sampling from μ\mu.

We make the following minimum data collection condition throughout the paper:

μ(ri,si|si,ai,si1,ai1,ri1,si1,,s1,a1)=(si|si,ai)𝟙{ri=r(si,ai)}.\displaystyle\mu(r_{i},s^{\prime}_{i}|s_{i},a_{i},s_{i-1},a_{i-1},r_{i-1},s^{\prime}_{i-1},...,s_{1},a_{1})=\mathds{P}(s^{\prime}_{i}|s_{i},a_{i})\mathds{1}_{\{r_{i}=r(s_{i},a_{i})\}}.

This condition relaxes the standard assumptions widely made in the RL literature, e.g., [38, 58, 37] typically require the sample path to be either i.i.d or weakly dependent. In contrast, we only require that the reward rir_{i} and the transition state sis^{\prime}_{i} follow the rules specified by rr and \mathds{P}, but do not impose any restrictions on the generation of the state-action pair (si,ai)(s_{i},a_{i}). Therefore, the dataset 𝒟1:n\mathcal{D}_{1:n} can be collected as a set of multiple trajectories with interdependent samples, or under a mix of arbitrary unknown behavior policies. Ultimately, the goal of high-confidence OPE is to construct an efficient CI for estimating the return of policy π\pi using the offline dataset.

2.2 Marginalized Importance Sampling and Linear Programming

In this section, we provide a concise review of two essential concepts in OPE: marginalized importance sampling (MIS) [38] and linear programming (LP) [11]. These concepts form the foundation for our subsequent discussion. Following the definition of the MIS in [38], we have τdπ/μ(s,a):=dπ(s,a)μ(s,a)\tau_{d^{\pi}/\mu}(s,a):=\frac{d^{\pi}(s,a)}{\mu(s,a)} to characterize the ratio of the visitation induced by target policy and behavior policy. If it exists, 𝔼dπ[g(s,a,r,s)]=𝔼τdπ/μ[g(s,a,r,s)]=𝔼μ[τdπ/μ(s,a)g(s,a,r,s)],\mathbb{E}_{d^{\pi}}\left[g\left(s,a,r,s^{\prime}\right)\right]=\mathbb{E}_{\tau_{d^{\pi}/\mu}}\left[g\left(s,a,r,s^{\prime}\right)\right]=\mathbb{E}_{\mu}\left[\tau_{d^{\pi}/\mu}(s,a)g\left(s,a,r,s^{\prime}\right)\right], where 𝔼τ[]:=𝔼μ[τ(s,a)()]\mathbb{E}_{\tau}[\cdot]:=\mathbb{E}_{\mu}[\tau(s,a)(\cdot)] is shorthand used throughout the paper. The return J(π)J(\pi) can be obtained via re-weighting the reward with respect to the MIS weight, i.e., J(π)=𝔼dπ[r]1γ=𝔼τdπ/μ[r]1γJ(\pi)=\frac{\mathbb{E}_{d^{\pi}}[r]}{1-\gamma}=\frac{\mathbb{E}_{\tau_{d^{\pi}/\mu}}[r]}{1-\gamma}. Motivated by this fact, [67, 56] proposed a minimax estimator to estimate τdπ/μ\tau_{d^{\pi}/\mu} by searching a weight function τ(s,a):=d(s,a)μ(s,a)\tau(s,a):=\frac{d(s,a)}{\mu(s,a)} in a specified function class τΩ\tau\in\Omega approximately:

minτΩmaxq𝒬W(q,τ):=𝔼μ[γτ(s,a)q(s,π)τ(s,a)q(s,a)]+(1γ)q(s0,π).\displaystyle\min_{\tau\in\Omega}\max_{q\in\mathcal{Q}}W(q,\tau):=\mathbb{E}_{\mu}\left[\gamma\tau(s,a)q\left(s^{\prime},\pi\right)-\tau(s,a)q(s,a)\right]+(1-\gamma)q\left(s^{0},\pi\right). (1)

Once τdπ/μ\tau_{d^{\pi}/\mu} is solved, the dπd^{\pi} can be obtained using the chain rule, i.e., dπ=τdπ/μμd^{\pi}=\tau_{d^{\pi}/\mu}\cdot\mu.

Recently, [46] showed that the above minimax problem has an equivalent solution to the LP problem [11]: maximized(d)0\text{maximize}_{d}\;\,\mathbb{H}(d)\equiv 0 constrained on

(1γ)𝟙{s=s0}π(a|s)+γπ(a|s)s,a(s|s,a)d(s,a)=d(s,a),for anys,a,\displaystyle\quad(1-\gamma)\mathds{1}\{s=s^{0}\}\pi(a|s)+\gamma\pi(a|s)\sum_{{s}^{\prime},{a}^{\prime}}\mathds{P}(s|{s}^{\prime},{a}^{\prime})d({s}^{\prime},{a}^{\prime})=d(s,a),\;\text{for any}\;s,a, (2)

where ()\mathbb{H}(\cdot) is a functional that takes a visitation dd as input and is set to be a constant zero function for the LP problem. It is crucial to note that (2) is over-constrained for any given s,as,a. Specifically, the visitation dπd^{\pi} can be uniquely identified solely based on the constraints. This over-constrained characteristic explains why (d)\mathbb{H}(d) can be set as a zero constant function and why (2) is sufficient to recover the optimal solution dπd^{\pi}. In the following, we will utilize this property to incorporate distributional shift information into the proposed framework.

3 Off-Policy Confidence Interval Estimation

In this section, we propose a novel off-policy CI estimation framework that can adapt to distributional shifts. To develop this framework, we first decompose the source errors in CI estimation into two critical components: evaluation bias, which arises from model misspecifications, and statistical uncertainty stemming from sampling. We subsequently integrate and quantify both components, allowing them to be unified within a single interval. The way to encode the distributional shift information is naturally introduced in evaluation bias quantification.

3.1 Evaluation Bias Quantification

We study the first major component in CI estimation, i.e., evaluation bias, and propose a value interval method to quantify it. The developed framework is robust to model misspecification and naturally encodes the distributional shift information. To begin with, we formally define the discriminator function, which serves a dual purpose in this work. For now, we primarily concentrate on one of these roles, which is to measure the extent of deviation between the offline data-generating distribution μ\mu and the distribution induced by the target policy. Later, we will demonstrate the other role that the discriminator function plays in estimating CI in Section 4.

Definition 3.1.

For x,c1,c2,C+x,c_{1},c_{2},C\in\mathbb{R}^{+} and C1C\geq 1, the discriminator function 𝔾()\mathbb{G}(\cdot) satisfies the following conditions: (1) Strong convexity: 𝔾(x)\mathbb{G}(x) is M-strongly convex with respect to xx. (2) Boundedness: |𝔾(x)|c1|\mathbb{G}(x)|\leq c_{1} for x[0,C]x\in[0,C]. (3) Boundedness on first-order derivative: |𝔾(x)|c2|\mathbb{G}^{\prime}(x)|\leq c_{2} if x[0,C]x\in[0,C]. (4) Non-negativity: 𝔾(x)0\mathbb{G}(x)\geq 0. (5) 1-minimum: 𝔾(1)=0\mathbb{G}(1)=0.

The family of Rényi entropy [54], Bhattacharyya distance [9], and simple quadratic form functions all satisfy the conditions outlined in Definition 3.1.

Remark 3.1.

To understand the motivation of the design of the discriminator function, let us consider taking the marginal important ratio τdπ/μ(s,a)=dπ(s,a)/μ(s,a)\tau_{d^{\pi}/\mu}(s,a)=d^{\pi}(s,a)/\mu(s,a) as the input for 𝔾()\mathbb{G}(\cdot). The 11-minimum condition dπ(s,a)/μ(s,a)d^{\pi}(s,a)/\mu(s,a)=1, i.e., dπ(s,a)=μ(s,a)d^{\pi}(s,a)=\mu(s,a), suggests that no distributional shift is present or detected by the discriminator 𝔾\mathbb{G}. The strong convexity property is capable of characterizing the uniqueness of the minimum point and ensuring that the rate of change increases as dπ(s,a)d^{\pi}(s,a) deviates from μ(s,a)\mu(s,a). The boundedness condition on the first-order derivative imposes smoothness on 𝔾\mathbb{G}, while the other two conditions render 𝔾\mathbb{G} practically quantifiable.

As discussed in Section 2.2, the LP problem (2) is over-constrained and the visitation dπd^{\pi} can be uniquely identified. Consequently, one may replace the objective (d)\mathbb{H}(d) with another functional, provided that the newly-replaced functional does not affect the optimal solution d=dπd^{*}=d^{\pi}. Motivated by this observation, we propose to choose (d)=λ𝔼μ[𝔾(τ(s,a))]\mathbb{H}(d)=\lambda\mathbb{E}_{\mu}[\mathbb{G}(\tau(s,a))] where τ(s,a):=d(s,a)μ(s,a)\tau(s,a):=\frac{d(s,a)}{\mu(s,a)}, and the hyperparameter λ\lambda\in\mathbb{R} user’s preference and degree of sensitivity to the distributional shift. When λ<0\lambda<0, it indicates that the user favors small distribution shifts and appeals to pessimistic evaluation, which encourages downgrading the policy associated with large distribution shifts. For λ=0\lambda=0, the user maintains a neutral attitude toward the distributional shift. Furthermore, our framework degenerates to (2) or (1) under this condition. From this perspective, both the minimax estimation in (1) and the LP problem in (2) can be considered special cases of our approach when we take λ=0\lambda=0. According to the connection between (1) and (2), we can convert the newly-defined LP problem with (d)=λ𝔼μ[𝔾(τ(s,a))]\mathbb{H}(d)=\lambda\mathbb{E}_{\mu}[\mathbb{G}(\tau(s,a))] to a max-min problem. This can be expressed as minτΩmaxq𝒬L(q,τ)\min_{\tau\in\Omega}\max_{q\in\mathcal{Q}}L(q,\tau), where

L(q,τ):=𝔼μ[τ(s,a)(γq(s,π)q(s,a))+λ𝔾(τ(s,a))]1γ+q(s0,π).\displaystyle L(q,\tau):=\frac{\mathbb{E}_{\mu}[\tau(s,a)\left(\gamma q\left(s^{\prime},\pi\right)-q(s,a)\right)+\lambda\mathbb{G}(\tau(s,a))]}{1-\gamma}+q\left(s^{0},\pi\right).

Unfortunately, directly optimizing objective function L(q,τ)L(q,\tau) has some drawbacks. To comprehensively reveal these drawbacks, we first present a key lemma to support our arguments.

Lemma 3.1.

For any target policy π:𝒮Δ(𝒜)\pi:\mathcal{S}\rightarrow\Delta(\mathcal{A}) and τ:𝒮×𝒜+\tau:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}^{+},

𝔼τ[r(s,a)]1γJα(π)=𝔼τ[qπ(s,a)γqπ(s,π)+λ𝔾(τ(s,a))/τ(s,a)]1γqπ(s0,π),\displaystyle\frac{\mathbb{E}_{\tau}[r(s,a)]}{1-\gamma}-J^{\alpha}(\pi)=\frac{\mathbb{E}_{\tau}\left[q^{\pi}(s,a)-\gamma q^{\pi}\left(s^{\prime},\pi\right)+\lambda\mathbb{G}(\tau(s,a))/\tau(s,a)\right]}{1-\gamma}-q^{\pi}(s^{0},\pi),

where Jα(π):=J(π)+λξ(𝔾,τ)J^{\alpha}(\pi):=J(\pi)+\lambda\xi(\mathbb{G},\tau) for ξ(𝔾,τ):=𝔼μ[𝔾(τ(s,a))1γ]\xi(\mathbb{G},\tau):=\mathbb{E}_{\mu}[\frac{\mathbb{G}(\tau(s,a))}{1-\gamma}], and qπq^{\pi} is the unique fixed point of Bellman equation πq=q\mathcal{B}^{\pi}q=q.

Lemma 3.1 suggests that the rationale behind minτΩmaxq𝒬L(q,τ)\min_{\tau\in\Omega}\max_{q\in\mathcal{Q}}L(q,\tau) is to find a “good” τ\tau^{*} such that 𝔼τ[r(s,a)]1γ\frac{\mathbb{E}_{\tau}[r(s,a)]}{1-\gamma} and Jα(π)J^{\alpha}(\pi) are are sufficiently close when qπ𝒬q^{\pi}\in\mathcal{Q}. Then τ\tau^{*} can be plugged in 𝔼τ[r(s,a)]1γ\frac{\mathbb{E}_{\tau}[r(s,a)]}{1-\gamma} to approximate Jα(π)J^{\alpha}(\pi). In fact, this rationale also implicitly requires the existence of τΩ\tau^{*}\in\Omega. When Ω\Omega is misspecified, the evaluation will exhibit a high bias. For OPE problems significantly affected by the distributional shift, the true model is particularly challenging to specify [24, 7]. Additionally, another limitation of the min-max estimation is that the objective function L(q,τ)L(q,\tau) completely disregards the reward information.

Motivated by the above-mentioned drawbacks, we propose a more robust interval method for estimation. Free from any model-specification assumptions on Ω\Omega, our proposal implicitly quantifies the evaluation bias resulting from model-misspecification error within the interval. According to Lemma 3.1 and assuming qπq^{\pi} is correctly specified, the following inequality holds for any τΩ\tau\in\Omega:

J(π)\displaystyle J(\pi) infq𝒬{𝔼τ[r(s,a)+γq(s,π)q(s,a)]1γ+q(s0,π)}λξ(𝔾,τ)\displaystyle\geq\inf_{q\in\mathcal{Q}}\left\{\frac{\mathbb{E}_{\tau}\left[r(s,a)+\gamma q\left(s^{\prime},\pi\right)-q(s,a)\right]}{1-\gamma}+q(s^{0},\pi)\right\}-\lambda^{-}\xi(\mathbb{G},\tau)
infq𝒬L(τ,q,π)λξ(𝔾,τ),\displaystyle\coloneqq\inf_{q\in\mathcal{Q}}L(\tau,q,\pi)-\lambda^{-}\xi(\mathbb{G},\tau), (3)

where λ:=min{0,λ}\lambda^{-}:=-\min\{0,\lambda\}. Similarly, the return J(π)J(\pi) from above:

J(π)supq𝒬L(τ,q,π)+λ+ξ(𝔾,τ),for anyτΩ,\displaystyle J(\pi)\leq\sup_{q\in\mathcal{Q}}L(\tau,q,\pi)+\lambda^{+}\xi(\mathbb{G},\tau),\;\text{for any}\;\tau\in\Omega, (4)

where λ+:=max{0,λ}\lambda^{+}:=\max\{0,\lambda\}. Considering that (3) and (4) hold for any τΩ\tau\in\Omega, we can optimize them over τ\tau to obtain the tightest possible interval:

J(π)[supτΩinfq𝒬L(τ,q,π)λξ(𝔾,τ),infτΩsupq𝒬L(τ,q,π)+λ+ξ(𝔾,τ)],\displaystyle J(\pi)\in\left[\sup_{\tau\in\Omega}\inf_{q\in\mathcal{Q}}L(\tau,q,\pi)-\lambda^{-}\xi(\mathbb{G},\tau),\;\inf_{\tau\in\Omega}\sup_{q\in\mathcal{Q}}L(\tau,q,\pi)+\lambda^{+}\xi(\mathbb{G},\tau)\right], (5)

where we denote the upper bound and lower bound as J+(π)J^{+}(\pi) and J(π)J^{-}(\pi), respectively. This optimization can also be interpreted as searching for the best τ\tau^{*} within the class Ω\Omega to minimize the model-misspecification error. Based on the value interval in (5), we demonstrate that the evaluation bias, i.e., the model misspecification error of Ω\Omega, is implicitly quantified in the interval and can be decomposed into two components.

Theorem 3.1 (Evaluation bias).

For any target policy π\pi, the evaluation bias is bounded above,

J+(π)J(π)2(infτΩ{supq𝒬|𝔼τ[γq(s,π)q(s,a)1γ+q(s0,π)]|ϵmis+12|λ|ξ(𝔾,τ)ϵdist}).\displaystyle J^{+}(\pi)-J^{-}(\pi)\leq 2\bigg{(}\underbrace{\inf_{\tau\in\Omega}\bigg{\{}\sup_{q\in\mathcal{Q}}\left|\mathbb{E}_{\tau}\Big{[}\frac{\gamma q(s^{\prime},\pi)-q(s,a)}{1-\gamma}+q(s^{0},\pi)\Big{]}\right|}_{\epsilon_{mis}}+\frac{1}{2}\underbrace{|\lambda|\xi(\mathbb{G},\tau)}_{\epsilon_{dist}}\bigg{\}}\bigg{)}.

Theorem 3.1 implies that the evaluation bias consists of two components: the model-misspecification error ϵmis\epsilon_{mis} and the bias caused by the discriminator function, denoted by ϵdist\epsilon_{dist}. The latter can be effectively controlled by adjusting the magnitude of λ\lambda. The former, ϵmis\epsilon_{mis}, is dependent on the wellness of the model specification on Ω\Omega. In particular, if the class Ω\Omega is well-specified, it is possible for ϵmis\epsilon_{mis} to diminish to zero. We direct readers to Section A in the appendix for a more comprehensive discussion on the tightness and validity of the interval.

Remark 3.1.

In contrast to the role of Ω\Omega, the class 𝒬\mathcal{Q} primarily influences bias quantification. However, there is no need to be concerned about a tradeoff between bias and uncertainty since the specification of 𝒬\mathcal{Q} solely affects bias quantification. Consequently, an expressive function class can generally be chosen for modeling 𝒬\mathcal{Q}. Furthermore, since the specification of the function class 𝒬\mathcal{Q} is independent of uncertainty quantification, it is feasible to construct the optimal data-dependent 𝒬\mathcal{Q} using the information generated by fitting qπq^{\pi} with any state-of-the-art method, such as double reinforcement learning [29]. This may increase the likelihood of correctly specifying 𝒬\mathcal{Q}. We offer an in-depth discussion on the role of 𝒬\mathcal{Q} and demonstrate that the evaluation bias resulting from the model-misspecification errors of 𝒬\mathcal{Q} can depend on the expressivity of the feature mapping class in Theorem B.1, as provided in the appendix.

To conclude this section, we remark that the value interval (5) directly quantifies the evaluation bias and incorporates a well-controlled adjustment scheme for distributional shifts. As a result, our proposed framework is robust against both model-misspecification errors and bias estimation due to distributional shifts. It is important to note that the value interval (5) is not a CI, as it does not address the uncertainty arising from sampling. In the following section, we will explore uncertainty quantification using our established interval estimation framework.

3.2 Statistical Uncertainty Quantification

In this section, we first examine the potential sources of uncertainty that may appear in CI estimation. We then establish two types of CIs for the scenarios when λ=0\lambda=0 and λ<0\lambda<0, and separately quantify the uncertainty in these two CIs. Intuitively, these two CIs represent neutral and pessimistic attitudes towards distributional shifts. We begin by presenting a key lemma for our proposal. Recall that we do not impose any independent or weakly dependent data collection assumptions, e.g., mixing conditions, for uncertainty quantification.

Lemma 3.2 (Empirical evaluation lemma).

Given offline data 𝒟1:n\mathcal{D}_{1:n}, for any target policy π\pi, the following equation holds,

J¯α(π)=1ni=1nτ(si,ai)(ri+γπqπ(si,ai)qπ(si,ai))+λ𝔾(τ(si,ai))1γ+qπ(s0,π),\displaystyle\bar{J}^{\alpha}(\pi)=\frac{1}{n}\sum^{n}_{i=1}\frac{\tau(s_{i},a_{i})(r_{i}+\gamma\mathds{P}^{\pi}q^{\pi}(s_{i},a_{i})-q^{\pi}(s_{i},a_{i}))+\lambda\mathbb{G}(\tau(s_{i},a_{i}))}{1-\gamma}+q^{\pi}(s^{0},\pi), (6)

where J¯α(π)=J(π)+λξn(𝔾,τ)\bar{J}^{\alpha}(\pi)=J(\pi)+\lambda\xi_{n}(\mathbb{G},\tau) for ξn(𝔾,τ)=1ni=1n𝔾(τ(si,ai))1γ\xi_{n}(\mathbb{G},\tau)=\frac{1}{n}\sum^{n}_{i=1}\frac{\mathbb{G}(\tau(s_{i},a_{i}))}{1-\gamma}.

Lemma 3.2 suggests that there are two sources of uncertainty when constructing CI estimation. The first source originates from the estimand {τ(si,ai)(ri+γπqπ(si,ai)qπ(si,ai))}i=1n\{\tau(s_{i},a_{i})(r_{i}+\gamma\mathds{P}^{\pi}q^{\pi}(s_{i},a_{i})-q^{\pi}(s_{i},a_{i}))\}^{n}_{i=1} and can be approximated by {τ(si,ai)(ri+γqπ(si,π)qπ(si,ai))}i=1n\{\tau(s_{i},a_{i})(r_{i}+\gamma q^{\pi}(s^{\prime}_{i},\pi)-q^{\pi}(s_{i},a_{i}))\}^{n}_{i=1}. The second source stems from the discriminator {𝔾(τ(si,ai))}i=1n\{\mathbb{G}(\tau(s_{i},a_{i}))\}^{n}_{i=1}. Once the uncertainty of these two sources is quantified, it can be incorporated into the CIs according to Lemma 3.2.

3.3 Uncertainty in Neutral Confidence Interval

In the following, we present our results on uncertainty quantification and establish the CI for the case when λ=0\lambda=0, which implies that distributional shift is not taken into account. Specifically, we characterize the uncertainty deviation, denoted by σn\sigma_{n} in (8), which constitutes a major component of the proposed neutral CI in (7).

Theorem 3.2 (Neutral CI).

For any target policy π\pi, we suppose qπ𝒬q^{\pi}\in\mathcal{Q} and τΩ\tau\in\Omega, then the return J(π)J(\pi) falls into the following CI with at least probability 1δ1-\delta,

J(π)[1ni=1nriτ(si,ai)1γsupq𝒬M^nτ(q,τ)σn,1ni=1nriτ(si,ai)1γ+supq𝒬M^nτ(q,τ)+σn],\displaystyle J(\pi)\in\left[\frac{1}{n}\sum^{n}_{i=1}\frac{r_{i}\tau(s_{i},a_{i})}{1-\gamma}-\sup_{q\in\mathcal{Q}}\widehat{M}^{\tau}_{n}(-q,\tau)-\sigma_{n},\frac{1}{n}\sum^{n}_{i=1}\frac{r_{i}\tau(s_{i},a_{i})}{1-\gamma}+\sup_{q\in\mathcal{Q}}\widehat{M}^{\tau}_{n}(q,\tau)+\sigma_{n}\right], (7)

if the uncertainty deviation σn\sigma_{n} satisfies

P(supτΩ|L^nq(qπ,τ)|σn)1δ,\displaystyle P\left(\sup_{\tau\in\Omega}\Big{|}\widehat{L}^{q}_{n}(q^{\pi},\tau)\Big{|}\leq\sigma_{n}\right)\geq 1-\delta, (8)

where

L^nq(q,τ)\displaystyle\widehat{L}^{q}_{n}(q,\tau) :=1ni=1nτ(si,ai)(q(si,ai)riγq(si,π))1γ,\displaystyle:=\frac{1}{n}\sum^{n}_{i=1}\frac{\tau(s_{i},a_{i})\left(q(s_{i},a_{i})-r_{i}-\gamma q(s^{\prime}_{i},\pi)\right)}{1-\gamma},
M^nτ(q,τ)\displaystyle\widehat{M}^{\tau}_{n}(q,\tau) :=1ni=1nτ(si,ai)(γq(si,π)q(si,ai))1γ+q(s0,π).\displaystyle:=\frac{1}{n}\sum^{n}_{i=1}\frac{\tau(s_{i},a_{i})(\gamma q(s^{\prime}_{i},\pi)-q(s_{i},a_{i}))}{1-\gamma}+q(s^{0},\pi).

A closer examination of (7) and (8) reveals two key insights. Regarding the class 𝒬\mathcal{Q}, even though the bias quantification necessitates searching possible qq uniformly over 𝒬\mathcal{Q}, the statistical uncertainty deviation σn\sigma_{n} can be evaluated point-wisely, specifically at the true qq-function qπq^{\pi}, as shown in (8). Regarding the class Ω\Omega, the result in (7) holds for any τΩ\tau\in\Omega, even if the specified Ω\Omega class does not capture the true τ\tau, i.e., τdπ/μ\tau_{d^{\pi}/\mu}. This suggests that the CI in (7) is robust against model misspecification errors. The misspecification of Ω\Omega only influences the tightness of the CI but does not compromise the validity of the CI. Also, optimizing the confidence lower and upper bounds in (7) over τΩ\tau\in\Omega can help tighten the CI.

We still need to determine the uncertainty deviation σn\sigma_{n} in (8) to obtain a tractable CI. Typically, the deviation σn\sigma_{n} depends on the complexity of hypothesis function class Ω\Omega and uniform concentration arguments. To measure the function class complexity, we use a generalized notation for the VC-dimension, called the pseudo-dimension (see Definition C.3 in the appendix). As for the uniform concentration arguments, we construct a martingale difference sequence, which allows us to apply the empirical Freedman’s concentration inequality [65]. In the following, we illustrate our general result for determining σn\sigma_{n} using LL_{\infty} bounded classes for Ω\Omega modeling.

Theorem 3.3 (LL_{\infty}-function class).

For any target policy π\pi, we define a composite function class 𝒢Ω={gτ:g𝒢,τΩ}\mathcal{G}\circ\Omega=\{g\circ\tau:g\in\mathcal{G},\tau\in\Omega\} such that gτ(s,a,r,s)=τ(s,a)(r(s,a)+γqπ(s,π)qπ(s,a)),g\circ\tau(s,a,r,s^{\prime})={\tau}(s,a)\big{(}r(s,a)+\gamma{q}^{\pi}(s^{\prime},{\pi})-{q}^{\pi}(s,a)\big{)}, for any (s,a,r,s)(s,a,r,s^{\prime}). Suppose 0τ(s,a)C0\leq\tau(s,a)\leq C for any s,as,a, then Theorem 3.2 holds if we choose

σn=C2V¯2ln8𝒩(1/n,𝒢Ω,L2)δn(1γ)2+2CV¯ln8𝒩(1/n,𝒢Ω,L2)δ3(1γ)n,\displaystyle\sigma_{n}=C\sqrt{\frac{2\bar{V}^{2}\ln\frac{8\mathcal{N}\left(1/\sqrt{n},\mathcal{G}\circ\Omega,\|\cdot\|_{L_{2}}\right)}{\delta}}{n(1-\gamma)^{2}}}+\frac{2C\bar{V}\ln\frac{8\mathcal{N}\left(1/\sqrt{n},\mathcal{G}\circ\Omega,\|\cdot\|_{L_{2}}\right)}{\delta}}{3(1-\gamma)n}, (9)

for 𝒩(ϵ,𝒢Ω,L2)(e(DΩ+1)(4eV¯2)DΩ)(1/ϵ)DΩ\mathcal{N}\left(\epsilon,\mathcal{G}\circ\Omega,\|\cdot\|_{L_{2}}\right)\leq(e\left(D_{\Omega}+1\right)(4e\bar{V}^{2})^{D_{\Omega}})\left({1}/{\epsilon}\right)^{D_{\Omega}}, where 𝒩(ϵ,𝒢Ω,L2)\mathcal{N}\left(\epsilon,\mathcal{G}\circ\Omega,\|\cdot\|_{L_{2}}\right) is the ϵ\epsilon-covering number (see Definition C.1 in the appendix) for 𝒢Ω\mathcal{G}\circ\Omega , and DΩD_{\Omega} is the pseudo dimension of the function class Ω\Omega.

As shown in (9), the deviation σn\sigma_{n} increases with the pseudo-dimension of Ω\Omega and decreases with the sample complexity of order 𝒪(1/n)\mathcal{O}(1/\sqrt{n}). Note that this is an information-theoretical result as finding the exact value of the pseudo-dimension for an arbitrary LL_{\infty}-class function is challenging. To refine our result, we consider using a feature mapping class for modeling Ω\Omega. Let ϕ:𝒮×𝒜d\phi:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}^{d} be a dd-dimensional feature mapping, and define a feature mapping class Ωψ\Omega_{\psi} as follows: Ωψ:={(s,a)ϕ(s,a),ψ|ψL2diamψ}\Omega_{\psi}:=\{(s,a)\mapsto\langle\phi(s,a),\psi\rangle\,|\,\|\psi\|_{L_{2}}\leq\text{diam}_{\psi}\} where diamψ(0,)\text{diam}_{\psi}\in(0,\infty).

Corollary 3.1.

For any target policy π\pi, we suppose τΩψ\tau\in\Omega_{\psi}, then Theorem 3.2 holds if we choose

σn=V¯2λmax(Σn)ln4dnV¯diamψ+2δ2(1γ)n+2diamψV¯ln(4dnV¯diamψ+2)δ3(1γ)n,\displaystyle\sigma_{n}=\frac{\bar{V}\sqrt{2\lambda_{\max}(\Sigma_{n})\ln\frac{4d\sqrt{n}\bar{V}\text{diam}_{\psi}+2}{\delta}}}{2(1-\gamma)n}+\frac{2\text{diam}_{\psi}\bar{V}\ln\frac{\left(4d\sqrt{n}\bar{V}\text{diam}_{\psi}+2\right)}{\delta}}{3(1-\gamma)n}, (10)

where λmax(Σn)\lambda_{\max}(\Sigma_{n}) is the max eigenvalue of the matrix Σn=i=1n{ϕ(si,ai)ϕ(si,ai)}\Sigma_{n}=\sum^{n}_{i=1}\{\phi(s_{i},a_{i})\phi^{\top}(s_{i},a_{i})\}.

Corollary 3.1 provides a tractable quantification of the uncertainty deviation σn\sigma_{n}. Each element in (10) is either user-specified or can be calculated using the offline data 𝒟1:n\mathcal{D}_{1:n}. Compared to the vanilla rate of 𝒪(1/n)\mathcal{O}(1/\sqrt{n}) in Theorem 3.3, the uncertainty deviation can vanish at a faster sublinear rate, i.e., 𝒪(1/n)\mathcal{O}(1/n) if λmax(Σn)diamψ\lambda_{\max}(\Sigma_{n})\ll\text{diam}_{\psi}. This improvement in the rate of convergence benefits from our careful analysis of variance terms w.r.t. Ωψ\Omega_{\psi}.

3.4 Uncertainty in Pessimistic Confidence Interval

In this section, we establish the CI for λ<0\lambda<0, which we call the pessimistic CI. Since λ0\lambda\neq 0, the pessimistic CI incorporates a newly-defined discriminator function in (3.1). In the case of λ<0\lambda<0, the pessimistic CI favors target policies with small distributional shifts. This property allows the pessimistic CI to be used to differentiate the goodness of policies that suffer from bias estimation issues due to distributional shifts [34]. Additionally, compared to the uncertainty quantification in the neutral CI, we need to take one more step to quantify the uncertainty arising from the estimand of the discriminator function. In the following theorem, we present our formal result in statistical sampling uncertainty quantification and establish the pessimistic CI.

Theorem 3.4 (Pessimistic CI).

For any target policy π\pi, we suppose qπ𝒬q^{\pi}\in\mathcal{Q}, then the return J(π)J(\pi) falls into the CI for any τΩ\tau\in\Omega with probability at least 1δ1-\delta,

[1ni=1nriτ(si,ai)1γsupq𝒬M^nτ(q,τ)\displaystyle\bigg{[}\frac{1}{n}\sum^{n}_{i=1}\frac{r_{i}\tau(s_{i},a_{i})}{1-\gamma}-\sup_{q\in\mathcal{Q}}\widehat{M}^{\tau}_{n}(-q,\tau) λξn(𝔾,τ)σnL,\displaystyle-\lambda^{-}\xi_{n}(\mathbb{G},\tau)-\sigma^{L}_{n},
1ni=1nriτ(si,ai)1γ+supq𝒬M^nτ(q,τ)+σnU],\displaystyle\frac{1}{n}\sum^{n}_{i=1}\frac{r_{i}\tau(s_{i},a_{i})}{1-\gamma}+\sup_{q\in\mathcal{Q}}\widehat{M}^{\tau}_{n}(q,\tau)+\sigma^{U}_{n}\bigg{]}, (11)

if the uncertainty deviations σnL\sigma^{L}_{n} and σnU\sigma^{U}_{n} satisfy

P({supτΩL^nq(qπ,τ)σnU}{infτΩ{L^nq(qπ,τ)+λξn(𝔾,τ)}σnL})1δ.\displaystyle P\Bigg{(}\left\{\sup_{\tau\in\Omega}\widehat{L}^{q}_{n}(q^{\pi},\tau)\leq\sigma^{U}_{n}\right\}\;\bigcap\;\left\{\inf_{\tau\in\Omega}\big{\{}\widehat{L}^{q}_{n}(q^{\pi},\tau)+\lambda^{-}\xi_{n}(\mathbb{G},\tau)\big{\}}\geq-\sigma^{L}_{n}\right\}\Bigg{)}\geq 1-\delta. (12)

The confidence upper bound in (11) is nearly identical to the one in the neutral CI, so we will not discuss it further here. However, the confidence lower bound in (11) differs from the one in (7). The current confidence lower bound incorporates distributional shift information as well as the corresponding uncertainty included in the deviation σnL\sigma^{L}_{n}. By doing so, the estimand of the discriminator ξn(𝔾,τ)\xi_{n}(\mathbb{G},\tau) helps to offset the bias evaluation caused by distributional shifts. When users follow the pessimism principle [31, 26], the confidence lower bound can make policy evaluation and optimization more reliable when the offline dataset is poorly explored or exhibits significant distributional shifts. Careful readers may notice that the involvement of the discriminator tends to widen the interval. However, the degree of widening can be well-controlled by λ\lambda^{-}. Additionally, due to the potentially tightest construction of the CI, which will be demonstrated in the next section, the pessimistic CI does not lead to overly pessimistic reasoning even with the inclusion of an extra discriminator.

To make (11) estimable, we need to determine the uncertainty deviations, σnU\sigma^{U}_{n} and σnL\sigma^{L}_{n}. For σnU\sigma^{U}_{n}, we can easily follow the derivation in the neutral CI. However, for σnL\sigma^{L}_{n}, the quantification is slightly more complicated. To determine σnL\sigma^{L}_{n}, we construct a pseudo estimator

𝒰(si,ai):=(1γ)1𝔼s(|si,ai)[τ(si,ai)(q(si,ai)riγq(s,π))+λ𝔾(τ(si,ai))],\displaystyle\mathcal{U}(s_{i},a_{i}):=(1-\gamma)^{-1}\mathbb{E}_{s^{\prime}\sim\mathds{P}(\cdot|s_{i},a_{i})}[\tau(s_{i},a_{i})(q(s_{i},a_{i})-r_{i}-\gamma q(s^{\prime},\pi))+\lambda^{-}\mathbb{G}(\tau(s_{i},a_{i}))],

such that {L^nq(qπ,τ)+λξn(𝔾,τ)n1i=1n𝒰(si,ai)}\{\widehat{L}^{q}_{n}(q^{\pi},\tau)+\lambda^{-}\xi_{n}(\mathbb{G},\tau)-n^{-1}\sum^{n}_{i=1}\mathcal{U}(s_{i},a_{i})\} forms a summation of a martingale difference sequence. We can then follow an almost identical martingale concentration technique as in Theorem 3.3 to determine σnL\sigma^{L}_{n}, so we omit it here. In the next section, we will provide a more elegant and tighter approach for quantifying the uncertainty deviation σnL\sigma^{L}_{n}.

4 Breaking the Curse of Tradeoff

In this section, we first highlight a finding that there exists a hidden tradeoff between the evaluation bias, resulting from model-misspecification, and the statistical uncertainty originating from sampling, as implied by the unified error quantification analysis for CI estimation. This seeming conflict prevents the simultaneous minimization of these two sources of errors in order to obtain tighter CIs. This phenomenon can be naturally viewed as the curse of the “bias-uncertainty” tradeoff, as shown in Figure 1. On this observation, we propose two efficient solutions to break the curse for the neutral and pessimistic CI, respectively.

4.1 Bias and Uncertainty Tradeoff

As discussed in the previous sections, bias and uncertainty are two major sources of errors in CI estimations. In order to obtain a tight interval, one would expect to minimize both of them. Statistical uncertainty, as shown in (9), is highly dependent on the complexity of the function class, which is measured in terms of the pseudo-dimension DΩD_{\Omega}. In general, a high pseudo-dimension indicates that the set of functions is complex and thus requires a sufficiently large deviation to control the uncertainty uniformly over all functions in the set. To minimize the uncertainty deviation, we prefer to model Ω\Omega using a parsimonious class of functions with a low pseudo-dimension. As for evaluation bias, Theorem 3.1 indicates that the model-misspecification error of Ω\Omega is a major component. Minimizing bias requires minimizing the model-misspecification error. Generally, a more expressive function class, i.e., with a high pseudo-dimension, is more likely to capture the true model and potentially reduce the bias due to model misspecification. For example, [59] allow the VC-dimension to diverge with the sample size to reduce the estimator’s bias due to model misspecification. If we translate this principle to our framework, it is equivalent to allowing the pseudo-dimension DΩ=𝒪(nκ)D_{\Omega}=\mathcal{O}(n^{-\kappa}), where 0<κ10<\kappa\leq 1.

Taking both bias and uncertainty into account, we regrettably identify a tradeoff between them, as exemplified in Figure 1. On one hand, a more parsimonious function class can attain enhanced efficiency in uncertainty quantification. However, this comes at the expense of sacrificing the function class’s expressivity, consequently increasing the bias. On the other hand, a more expressive function class aids in reducing model misspecification errors but demands a higher cost to account for the uncertainty inherent in a complex set of functions. Is there a way to break the curse of the bias and uncertainty tradeoff? In the following sections, we will present our solution to this question.

4.2 Tradeoff-Free Neutral Confidence Interval

In this section, we propose a kernel representation approach to break the curse of the tradeoff for the neutral CI in smooth MDP settings. Under this representation, the uncertainty deviation is independent of the function class complexity. We make use of a powerful tool in concentration measures, Pinelis’ inequality [48], to quantify the data-dependent uncertainty deviation in Hilbert space. As a result, this can achieve minimum costs regarding uncertainty quantification, while simultaneously allowing the model-misspecification error to be arbitrarily minimized due to the flexibility of the kernel representation. To begin with, we introduce the definition of the smooth MDP.

Definition 4.1.

Given an infinite-horizon discounted MDP ={𝒮,𝒜,,γ,r,s0}\mathcal{M}=\left\{\mathcal{S},\mathcal{A},\mathds{P},\gamma,r,s^{0}\right\}, suppose the reward function r(s,a)r(s,a) and the Markov transition kernel (|s,a)\mathds{P}(\cdot|s,a) are continuous over (s,a)(s,a), then we call \mathcal{M} the smooth MDP and denote it as ~\widetilde{\mathcal{M}}.

Smooth MDP settings are widely satisfied in many real-world applications and studies in the existing literature [35, 58]. Recall that our goal is to quantify the uncertainty deviation supτΩ|L^nq(qπ,τ)|\sup_{\tau\in\Omega}|\widehat{L}^{q}_{n}(q^{\pi},\tau)| in (8) to establish the CI. According to theorem 3.2 in [74], the maximizer of the problem supτΩ|L^nq(qπ,τ)|\sup_{\tau\in\Omega}|\widehat{L}^{q}_{n}(q^{\pi},\tau)|, i.e., τ(s,a)\tau^{*}(s,a), is a continuous function in the smooth MDP setting =~\mathcal{M}=\widetilde{\mathcal{M}}. This motivates us to consider some smooth function classes for modeling Ω\Omega, provided that these function classes can approximate continuous functionals sufficiently well. In particular, we model Ω\Omega in a bounded reproducing kernel Hilbert space (RKHS) equipped with a positive definite kernel K(,)K(\cdot,\cdot), i.e., ΩRKHS(CK){τRKHS:τRKHSCK}\Omega_{RKHS}(C_{K})\coloneqq\{\tau\in RKHS:\|\tau\|_{RKHS}\leq C_{K}\}, where RKHS\|\cdot\|_{RKHS} denotes the kernel norm and the constant CK>0C_{K}>0. This kernel representation allows the maximization problem supτΩRKHS(CK)|L^nq(qπ,τ)|\sup_{\tau\in\Omega_{RKHS}(C_{K})}|\widehat{L}^{q}_{n}(q^{\pi},\tau)| to have a simple closed-form solution τ\tau^{*}. We note that this is based on the maximum mean discrepancy [19]. Accordingly, the maximum value (1γ)L^nq(qπ,τ)(1-\gamma)\widehat{L}^{q}_{n}(q^{\pi},\tau^{*}) equals to

CKn2ij(qπ(si,ai)riγqπ(si,π))K({si,ai},{sj,aj})(qπ(sj,aj)riγqπ(sj,π)),\displaystyle\sqrt{\frac{C_{K}}{n^{2}}\sum_{ij}({q}^{\pi}(s_{i},a_{i})-r_{i}-\gamma{q}^{\pi}(s^{\prime}_{i},{\pi}))K(\{s_{i},a_{i}\},\{s_{j},a_{j}\})({q}^{\pi}(s_{j},a_{j})-r_{i}-\gamma{q}^{\pi}(s^{\prime}_{j},{\pi}))}, (13)

where the square root operator is well-defined by the positive definite kernel K(,)K(\cdot,\cdot).

The closed-form solution assists us to elaborate on uncertainty quantification. In (13), the function τ\tau is maximized out and no longer appears in the sample estimand. Remarkably, the complexity of Ω\Omega does not influence the uncertainty deviation, which is now entirely independent of the function class complexity. This transforms the uniform uncertainty quantification into a point-wise uncertainty quantification focused on the global maximizer τ\tau^{*}, significantly reducing the uncertainty deviation. Furthermore, it has been demonstrated that Reproducing Kernel Hilbert Spaces (RKHS) can achieve arbitrarily small approximation errors when approximating continuous functions [2]. In other words, it is possible to minimize the model-misspecification error of Ω\Omega and even eliminate the error entirely, as long as the true model τ\tau^{\star} is continuous, which holds in smooth MDP settings.

In summary, by leveraging the kernel representation, we observe a double descent phenomenon whereby the model-misspecification error and uncertainty deviation can be minimized simultaneously. Imposing a minor constraint on the continuity of MDPs, we overcome the curse of the bias and uncertainty tradeoff. In the following, we provide a brief overview of our method for quantifying the uncertainty deviation under the kernel representation. Inspired by the reproducing property, we initially construct a pseudo-estimator, i.e.,

Λpseudo(;i):=𝔼s(|si,ai)[qπ(si,ai)r(si,ai)γqπ(s,π)],CKK({si,ai},),\displaystyle\Lambda^{\text{pseudo}}(\cdot;i):=\langle\mathbb{E}_{s^{\prime}\sim\mathds{P}(\cdot|s_{i},a_{i})}[{q}^{\pi}(s_{i},a_{i})-r(s_{i},a_{i})-\gamma{q}^{\pi}(s^{\prime},{\pi})],\sqrt{C_{K}}K(\{s_{i},a_{i}\},\cdot)\rangle,

and rewrite (13) with L^nq(qπ,τ)=1ni=1nΛ(;i)RKHS,\widehat{L}^{q}_{n}(q^{\pi},\tau^{*})=\sqrt{\|\frac{1}{n}\sum^{n}_{i=1}\Lambda(\cdot;i)\|_{RKHS}}, where Λ(;i):=qπ(si,ai)riγqπ(s,π),CKK({si,ai},).\Lambda(\cdot;i):=\langle{q}^{\pi}(s_{i},a_{i})-r_{i}-\gamma{q}^{\pi}(s^{\prime},{\pi}),\sqrt{C_{K}}K(\{s_{i},a_{i}\},\cdot)\rangle. It can be seen that the sequence {Λ(;i)Λpseudo(;i)}i=1n\{\Lambda(\cdot;i)-\Lambda^{\text{pseudo}}(\cdot;i)\}^{n}_{i=1} forms a martingale difference sequence in Hilbert space with respect to the filtration i:=(si,ai,ri1,si1)\mathcal{F}_{i}:=(s_{\leq i},a_{\leq i},r_{\leq i-1},s^{\prime}_{\leq i-1}). This motivates us to use Pinelis’ inequality to quantify the uncertainty deviation as illustrated in the following theorem.

Theorem 4.1.

Given the offline data 𝒟1:n\mathcal{D}_{1:n}, we suppose the kernel K(,)K¯\|K(\cdot,\cdot)\|_{\infty}\leq\bar{K}, then

1ni=1nΛ(;i)RKHSV¯K¯CK22ln(2/δ)n(1γ)+2ln(2/δ)V¯K¯CK3(1γ)n:=εnK,\displaystyle\sqrt{\Big{\|}\frac{1}{n}\sum^{n}_{i=1}\Lambda(\cdot;i)\Big{\|}_{RKHS}}\leq\frac{\bar{V}\sqrt{\bar{K}C_{K}}}{2}\sqrt{\frac{2\ln\left(2/\delta\right)}{n(1-\gamma)}}+\frac{2\ln\left(2/\delta\right)\bar{V}\sqrt{\bar{K}C_{K}}}{3(1-\gamma)n}:=\varepsilon^{K}_{n},

holds with probability at least 1δ1-\delta. Furthermore, Theorem 3.2 holds for any τΩRKHS(CK)\tau\in\Omega_{RKHS}(C_{K}) when it takes σn=εnK\sigma_{n}=\varepsilon^{K}_{n}.

Theorem 4.1 achieves an order of 𝒪(1/n)\mathcal{O}(1/\sqrt{n}) convergence rate, which is as fast as the minimax rate in learning with i.i.d. data [68]. Enabled by the kernel representation, the upper bound εnK\varepsilon^{K}_{n} is independent of the complexity of Ω\Omega. Let σn=εnK\sigma_{n}=\varepsilon^{K}_{n}, then the uncertainty deviation is significantly reduced compared to those in Theorem 3.3 or Corollary 3.1. Recall that we can optimize the upper and lower bound in the neutral CI (7) to minimize the bias, and obtain a possibly tightest interval. Intriguingly, the maximization problem supq𝒬M^nτ(q,τ)\sup_{q\in\mathcal{Q}}\widehat{M}^{\tau}_{n}(-q,\tau) within the CI also has a simple closed-form solution when 𝒬\mathcal{Q} belongs to a bounded RKHS [38]. This can be employed to alleviate the optimization burden and enhance stability. Now, together with the uncertainty defined in Theorem 4.1, we formulate the trade-off neutral CI as follows.

Corollary 4.1 (RKHS reformulation).

On the conditions of Theorem 3.2, suppose Ω\Omega and 𝒬\mathcal{Q} are in a bounded RKHS with radius 11, and we set σn=εnK\sigma_{n}=\varepsilon^{K}_{n} with CK=1C_{K}=1 in Theorem 4.1, then the return J(π)J(\pi) falls into the CI

[supτΩ{1ni=1nriτ(si,ai)1γM^nK(π)σn},infτΩ{1ni=1nriτ(si,ai)1γ+M^nK(π)+σn}],\displaystyle\left[\sup_{\tau\in\Omega}\left\{\frac{1}{n}\sum^{n}_{i=1}\frac{r_{i}\tau(s_{i},a_{i})}{1-\gamma}-\widehat{M}^{K}_{n}(\pi)-\sigma_{n}\right\},\;\inf_{\tau\in\Omega}\left\{\frac{1}{n}\sum^{n}_{i=1}\frac{r_{i}\tau(s_{i},a_{i})}{1-\gamma}+\widehat{M}^{K}_{n}(\pi)+\sigma_{n}\right\}\right], (14)

with probability at least 1δ1-\delta. The detailed expression of M^nK(τ)\widehat{M}^{K}_{n}(\tau) and derivations are provided in Section E.11 of the appendix.

As demonstrated in Corollary 4.1, the unstable two-stage optimization problem, involving optimizing both τ\tau and qq, is decoupled into an easier solvable single-stage optimization.

Remark 4.1.

Finding global optimizers in (14) is quite challenging due to the need to solve an infinite-dimensional problem in RKHS. However, approximate solvers can be employed to replace the global optimizer in (14), which is typically difficult to find. Thanks to the robustness property of our CI with respect to the model-misspecification error, using approximate solvers only makes the interval slightly wider but does not compromise the validity of the CI. Consequently, it is feasible to utilize random feature approximation [50] or the Riesz representation theorem to transform a complex infinite-dimensional optimization problem into an easily solvable finite-dimensional optimization problem.

4.3 Tradeoff-Free Pessimistic Confidence Interval

In this section, we introduce a tradeoff-free pessimistic CI estimation. We make a further step in breaking the curse of the “bias-uncertainty” tradeoff without requiring any additional conditions, e.g., smoothness conditions for MDPs as discussed in the last section. In any general MDPs, the evaluation bias and statistical uncertainty in CI estimation can be minimized simultaneously without compromising one or the other. The main idea is to utilize Lagrangian duality and the curvature of the discriminator function 𝔾\mathbb{G}. This is another motivation for defining a discriminator function as in (3.1), in addition to its role in handling distributional shifts. In the following, we characterize a unique and global optimizer related to the lower bound in (12), which builds a foundation for our subsequent methodological developments.

Proposition 4.1.

For any s,as,a, the global optimizer of the maximization problem

maxτ0τ(s,a)1γ(r(s,a)+γ𝔼s(|s,a)[q(s,π)]q(s,a))λ𝔾(τ(s,a))1γ,\displaystyle\max_{\tau\geq 0}\frac{\tau(s,a)}{1-\gamma}\big{(}r(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim\mathds{P}(\cdot|s,a)}[q(s^{\prime},\pi)]-q(s,a)\big{)}-\frac{\lambda^{-}\mathbb{G}(\tau(s,a))}{1-\gamma}, (15)

is unique. Moreover, it has the following analytical form:

τ(s,a;q):=[(𝔾)1(r(s,a)+γ𝔼s(|s,a)[q(s,π)]q(s,a)λ)]+,\displaystyle\tau^{\star}(s,a;q):=\left[\left(\mathbb{G}^{\prime}\right)^{-1}\left(\frac{r(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim\mathds{P}(\cdot|s,a)}[q(s^{\prime},\pi)]-q(s,a)}{\lambda^{-}}\right)\right]^{+}, (16)

where (𝔾)1\left(\mathbb{G}^{\prime}\right)^{-1} is the inverse function of the derivative 𝔾\mathbb{G}^{\prime}, which is strictly increasing due to the strict convexity of 𝔾\mathbb{G}.

The proof of Proposition 4.1 relies on the Lagrangian duality and the strong convexity of 𝔾\mathbb{G}. The analytical form (16) motivates us to construct an appropriate pseudo estimator to measure the uncertainty deviation which is independent of the complexity of Ω\Omega. Define the pseudo estimator Γpseudo\Gamma^{\text{pseudo}} as

i=1nτ(si,ai;qπ)(r(si,ai)+γ𝔼s(|si,ai)[qπ(s,π)]qπ(si,ai))λ𝔾(τ(si,ai;qπ))n(1γ),\displaystyle\frac{\sum^{n}_{i=1}\tau^{\star}(s_{i},a_{i};q^{\pi})\big{(}r(s_{i},a_{i})+\gamma\mathbb{E}_{s^{\prime}\sim\mathds{P}(\cdot|s_{i},a_{i})}[q^{\pi}(s^{\prime},\pi)]-q^{\pi}(s_{i},a_{i})\big{)}-\lambda^{-}\mathbb{G}(\tau^{\star}(s_{i},a_{i};q^{\pi}))}{n(1-\gamma)},

which can be regarded as the post-optimization objective function of (15). Correspondingly, we can define its transition sample counterpart, where the unknown Markov transition kernel \mathds{P} is replaced by a transition pair, as

Γ~=i=1nτ~(si,ai,ri,si;qπ)(ri+γqπ(si,π)qπ(si,ai))λ𝔾(τ~(si,ai,ri,si;qπ))n(1γ),\displaystyle\widetilde{\Gamma}=\frac{\sum^{n}_{i=1}\widetilde{\tau}^{\star}(s_{i},a_{i},r_{i},s^{\prime}_{i};q^{\pi})\big{(}r_{i}+\gamma q^{\pi}(s^{\prime}_{i},\pi)-q^{\pi}(s_{i},a_{i})\big{)}-\lambda^{-}\mathbb{G}(\widetilde{\tau}^{\star}(s_{i},a_{i},r_{i},s^{\prime}_{i};q^{\pi}))}{n(1-\gamma)},

where τ~(si,ai,ri,si;qπ)\widetilde{\tau}^{\star}(s_{i},a_{i},r_{i},s^{\prime}_{i};q^{\pi}) is the transition pair counterpart of τ(si,ai;qπ)\tau^{\star}(s_{i},a_{i};q^{\pi}), accordingly. By Proposition 4.1, it can be seen that the objective function L^nq(qπ,τ)λξn(𝔾,τ)-\widehat{L}^{q}_{n}(q^{\pi},\tau)-\lambda^{-}\xi_{n}(\mathbb{G},\tau) in confidence set (12) is bounded above by Γ~\widetilde{\Gamma} for any τΩ\tau\in\Omega. With this fact, we have

supτΩ{L^nq(qπ,τ)λξn(𝔾,τ)}Γ~=()Γ~Γpseudo.\displaystyle\sup_{\tau\in\Omega}\{-\widehat{L}^{q}_{n}(q^{\pi},\tau)-\lambda^{-}\xi_{n}(\mathbb{G},\tau)\}\leq\widetilde{\Gamma}\overset{\mathrm{(*)}}{=}\widetilde{\Gamma}-\Gamma^{\text{pseudo}}. (17)

The equality ()\mathrm{(*)} comes from the fact that Γpseudo=0\Gamma^{\text{pseudo}}=0 by the Bellman equation. In (17), the uniform quantification argument on the LHS is transferred to a point-wise deviation on the RHS. This indicates that the uncertainty deviation is sufficient to only be evaluated at the point of the global optimizer τ\tau^{\star}, which removes the influence of the complexity of function class Ω\Omega on the uncertainty quantification. Next, we need to find the uncertainty deviation Γ~Γpseudo\widetilde{\Gamma}-\Gamma^{\text{pseudo}}. However, the martingale structure we utilized to establish the uncertainty deviation no longer applies to the difference sequence {Γ~iΓipseudo}i=1n\{\widetilde{\Gamma}_{i}-\Gamma^{\text{pseudo}}_{i}\}^{n}_{i=1}, due to the non-linearity of (𝔾)1(\mathbb{G}^{\prime})^{-1} and the double sampling issue [3] in Γ~i\widetilde{\Gamma}_{i}. This makes derivation much more intricate compared to the previous sections. Interestingly, by exploiting the curvature, i.e., the second-order derivative of Γipseudo\Gamma^{\text{pseudo}}_{i}, we discover that the sequence {ΓipseudoΓ~i}i=1n\{\Gamma^{\text{pseudo}}_{i}-\widetilde{\Gamma}_{i}\}^{n}_{i=1} forms a supermartingale difference sequence. We justify this observation in the following lemma.

Lemma 4.1.

For i=1,,ni=1,...,n, we define the difference Δi=ΓipseudoΓ~i\Delta_{i}=\Gamma^{\text{pseudo}}_{i}-\widetilde{\Gamma}_{i} and the filtration i:=(si,ai,ri1,si1)\mathcal{F}_{i}:=(s_{\leq i},a_{\leq i},r_{\leq i-1},s^{\prime}_{\leq i-1}), then it follows that

𝔼[Δi|i]0,\displaystyle\mathbb{E}[\Delta_{i}|\mathcal{F}_{i}]\leq 0,

where the strict equality holds if and only if the transition of the MDP, i.e., \mathds{P}, is deterministic.

Lemma 4.1 offers new evidence that the deviation of Γ~Γpseudo\widetilde{\Gamma}-\Gamma^{\text{pseudo}} is quantifiable. Although the martingale structure does not hold, we can measure the deviation using supermartingale concentration theory [14], and then quantify the statistical uncertainty in the pessimistic CI estimation based on the relationship depicted in (17).

Theorem 4.2.

For any target policy π\pi, Theorem 3.4 holds if we choose the lower confidence uncertainty deviation

σnL=\displaystyle\sigma^{L}_{n}= CV¯+λ𝔾(C)1γ2ln(4/δ)n,\displaystyle\frac{C^{\star}\bar{V}+\lambda^{-}\mathbb{G}(C^{\star})}{1-\gamma}\sqrt{\frac{2\ln(4/\delta)}{n}}, (18)

where C=[(𝔾)1(min{V¯/λ,c2})]+C^{\star}=[(\mathbb{G}^{\prime})^{-1}(\min\{\bar{V}/\lambda^{-},c_{2}\})]^{+}.

In Theorem 4.2, even without assuming any independence or weak dependence condition, our sample complexity is of order 𝒪(1/n)\mathcal{O}(1/\sqrt{n}), which aligns with the minimax rate in i.i.d. cases as discussed in [12]. A closer examination of (18) reveals that the uncertainty deviation σnL\sigma^{L}_{n} is fully independent of the complexity of Ω\Omega. It is worth noting that equation (17) is valid for any function class Ω\Omega. Consequently, it is not necessary to sacrifice the expressivity of the function class, for example, by restricting it to RKHS, in order to achieve point-wise uncertainty quantification. Following this principle, one can use a sufficiently expressive function class to minimize the model-misspecification error while maintaining efficient uncertainty quantification. The only cost is just the additional bias and uncertainty introduced by the discriminator function, for which it has been shown that this extra bias can be well-controlled by λ\lambda^{-}. We also note that the upper uncertainty σnU\sigma^{U}_{n} can be determined following Corollary 4.14.1, as the confidence upper bound of the pessimistic CI is almost identical to the one in the neutral CI.

Remark 4.1.

We note that incorporating the discriminator function facilitates numerical optimization. The maximization over τ\tau can be efficiently solved even though Ω\Omega is not a convex set. For instance, when Ω\Omega is a non-convex set but consists of monotonic functions or a linear transformation of strongly convex functions, the optimization can be solved easily.

5 Optimization Algorithm

In this section, we propose a stochastic approximation-type algorithm to solve the CI estimation problem. For simplicity of exposition, we discuss the optimization of the lower bound in the pessimistic CI given by (11). The upper confidence bound in (11) and the neutral CI can be considered as a mirror or a special case of the proposed algorithm, respectively.

In function approximation settings, the Ω\Omega and 𝒬\mathcal{Q} are often represented by compact parametric functions in practice, either in linear or non-linear function classes [61]. We denote these parameters as ψ\psi and θ\theta corresponding to Ω\Omega and 𝒬\mathcal{Q}, respectively. One can express the parametric objective function for optimization as:

(qθ,τψ)=𝔼μ[τψ(s,a)(r(s,a)+qθ(s,a)γqθ(s,π))]1γ+qθ(s0,π)λ𝔼μ[𝔾(τψ(s,a))]1γ,\displaystyle\mathcal{L}(q_{\theta},\tau_{\psi})=\frac{\mathbb{E}_{\mu}\left[\tau_{\psi}(s,a)(r(s,a)+q_{\theta}(s,a)-\gamma q_{\theta}(s^{\prime},\pi))\right]}{1-\gamma}+q_{\theta}(s^{0},\pi)-\frac{\lambda^{-}\mathbb{E}_{\mu}\left[\mathbb{G}(\tau_{\psi}(s,a))\right]}{1-\gamma}, (19)

where 𝔼μ[]\mathbb{E}_{\mu}[\cdot] can be estimated by the observed data 𝒟i:n\mathcal{D}_{i:n}. Therefore, the optimization problem we aim to solve is maxψminθ(qθ,τψ)\max_{\psi}\min_{\theta}\mathcal{L}(q_{\theta},\tau_{\psi}), which forms a saddle-point formulation, and we denote the saddle point as (ψ,θ)(\psi^{*},\theta^{*}). We observe that the inner minimization problem is relatively easy to solve. On one hand, when 𝒬θ\mathcal{Q}_{\theta} is within a finite ball of RKHS, the optimization yields a simple closed-form solution according to Theorem (4.1). On the other hand, as demonstrated in Theorem B.1 in the Appendix, the feature mapping class is sufficient for modeling 𝒬\mathcal{Q}. The feature mapping class simplifies the optimization, making it efficiently solvable by various algorithms as discussed in [60].

In contrast, the more challenging aspect is optimizing τψ\tau_{\psi}. Due to its complex structure, it demands a sufficiently flexible non-linear function approximation class, e.g., deep neural networks, for optimization [24]. Unfortunately, concavity typically does not hold for non-linear function approximation classes, and thus the outer maximization of maxψminθ(qθ,τψ)\max_{\psi}\min_{\theta}\mathcal{L}(q_{\theta},\tau_{\psi}) is also affected. As a result, we need to develop a more efficient and convergent algorithm. From this perspective, our problem can be seen as solving a non-concave maximization problem, conditional on the solved global optimizer q¯θ:=argminθ(qθ,τψ)\bar{q}_{\theta}:=\operatorname*{arg\,min}_{\theta}\mathcal{L}(q_{\theta},\tau_{\psi}). Under this framework, we first study the gradients of the objective function with respect to ψ{\psi}. Define ¯(τψ)=(q¯θ,τψ)\bar{\mathcal{L}}(\tau_{\psi})=\mathcal{L}(\bar{q}_{\theta},\tau_{\psi}), then the gradient of ¯(τψ)\bar{\mathcal{L}}(\tau_{\psi}) with respect to ψ\psi satisfies

ψ¯(τψ)=𝔼μ[(r(s,a)+qθ(s,a)γqθ(s,π))ψτψ(s,a)]1γλ𝔼μ[𝔾(τ(s,a))ψτψ(s,a)]1γ,\displaystyle\nabla_{\psi}\bar{\mathcal{L}}(\tau_{\psi})=\frac{\mathbb{E}_{\mu}[(r(s,a)+q_{\theta}(s,a)-\gamma q_{\theta}(s^{\prime},\pi))\nabla_{\psi}\tau_{\psi}(s,a)]}{1-\gamma}-\frac{\lambda\mathbb{E}_{\mu}[\mathbb{G}^{\prime}(\tau(s,a))\nabla_{\psi}\tau_{\psi}(s,a)]}{1-\gamma}, (20)

where the corresponding stochastic gradient, denoted as ~ψ¯(τψ)\widetilde{\nabla}_{\psi}\bar{\mathcal{L}}(\tau_{\psi}), can be computed from offline dataset 𝒟1:n\mathcal{D}_{1:n}. With the gradients provided in (20), we propose a stochastic approximation algorithm to update τψ\tau_{\psi}. At each iteration, we update τψ\tau_{\psi} by solving the proximal mapping [47]: Projψ(ψ,;DBerg):=argmaxψ{ψ,DBerg(ψ,ψ)}\text{Proj}_{\psi}(\psi^{*},\nabla;D_{Berg}):=\operatorname*{arg\,max}_{\psi}\{\langle\psi,\nabla\rangle-D_{Berg}(\psi^{*},\psi)\}, where ψ\psi^{*} can be viewed as the current update of the parameter, DBerg(,)D_{Berg}(\cdot,\cdot) denotes the Bregman divergence as discussed in [53], and \nabla represents the scaled stochastic gradient of the parameter of interest. In practice, we may consider using the Euclidean distance to reduce the computational burden. The details of the proposed algorithm are summarized in Algorithm 1. In the next section, we will demonstrate that our stochastic approximation algorithm is convergent with a sublinear rate even under non-linear (non-concave) settings.

Algorithm 1 Proximal mapping Algorithm for Off-policy CI Estimation
1:  Input offline data 𝒟1:n:={(si,ai,ri,si)}i=1n\mathcal{D}_{1:n}:=\{(s_{i},a_{i},r_{i},s^{\prime}_{i})\}^{n}_{i=1} and the initial state s0s^{0}.
2:  Initialize parameters of interest ψ0\psi^{0} and θ0\theta^{0}, initial learning rate η0\eta^{0}, hyperparameter λ\lambda, max iteration TT, Bergman divergence function DBergD_{Berg}.
3:  For t=1t=1 to t=Tt=T:
4:     Update θt\theta^{t} by solving
minθ1ni=1nτψt1(si,ai)(γqθ(si,π)qθ(si,ai))1γ+qθ(s0,π)i=1nλ𝔾(τψt1(si,ai))n(1γ).\min_{\theta}\frac{1}{n}\sum^{n}_{i=1}\frac{\tau_{\psi^{t-1}}(s_{i},a_{i})(\gamma q_{\theta}(s^{\prime}_{i},\pi)-q_{\theta}(s_{i},a_{i}))}{1-\gamma}+q_{\theta}(s^{0},\pi)-\frac{\sum^{n}_{i=1}\lambda^{-}\mathbb{G}(\tau_{\psi^{t-1}}(s_{i},a_{i}))}{n(1-\gamma)}.
5:     Decay the stepsize ηt\eta^{t} of the rate 𝒪(t1/4)\mathcal{O}(t^{-1/4}).
6:     Compute the stochastic gradient with respect to ψ\psi as ~ψ(τψ,qθt)\widetilde{\nabla}_{\psi}{\mathcal{L}}(\tau_{\psi},q_{\theta^{t}}).
7:     Update ψt\psi^{t} by solving the prox-mapping: ψt=Projψ(ψt1,ηt~ψ(τψ,qθt);DBerg)\psi^{t}=\text{Proj}_{\psi}(\psi^{t-1},\eta_{t}\widetilde{\nabla}_{\psi}{\mathcal{L}}(\tau_{\psi},q_{\theta^{t}});D_{Berg}).
8:  End for
9:  Output τ^=τψT\widehat{\tau}=\tau_{\psi^{T}} and q^=qθT\widehat{q}=q_{\theta^{T}}.

6 Theoretical Properties

In this section, we study the theoretical properties of the proposed CI estimations. In Theorems 6.1-6.2, we investigate the finite sample performance of the proposed method. Our theory shows that the sample complexity of learning a near-tight CI is of order 𝒪(1/n)\mathcal{O}(1/\sqrt{n}). The results widely hold for function approximation classes with finite pseudo-dimension. To the best of our knowledge, Theorems 6.1-6.2 are the first finite sample bounds in CI estimation without explicitly requiring any independent or weakly dependent conditions on the data collection process. In Theorems 6.3-6.4, we study the robustness of the proposed algorithm with respect to the optimization and model-misspecification error, which makes our algorithm practical in real-world applications. Moreover, we take a big step towards solving the open problem of the algorithmic convergence of the non-linear function approximation in offline RL. In Theorem 6.5, our algorithm is convergent to a stationary point at a sublinear rate even in non-concave settings. The convergent property guarantees the use of non-linear function classes, e.g., deep neural networks (DNNs), without divergence concerns.

6.1 Finite Sample Bound

In this section, we present the finite-sample bound of the CI estimation. Before stating our results, we first make some mild assumptions.

Assumption 1.

For τΩ\tau\in\Omega, suppose 0τ(s,a)C0\leq\tau(s,a)\leq C for any s,as,a and C>0C>0.

Assumption 2.

For any target policy π\pi and ν0\nu\geq 0, there exists a τΩ\tau^{*}\in\Omega satisfying that

|n{τ(si,ai)(q(si,ai)γq(si,π)ri)}\displaystyle\Big{|}\mathbb{P}_{n}\left\{\tau^{*}(s_{i},a_{i})(q(s_{i},a_{i})-\gamma q(s^{\prime}_{i},\pi)-r_{i})\right\}
𝔼(s,a)dπ,s(|s,a)[q(s,a)γq(s,π)r(s,a)]|=𝒪(c0n11+ν),\displaystyle\qquad\qquad\qquad-\mathbb{E}_{(s,a)\sim d^{\pi},s^{\prime}\sim\mathds{P}(\cdot|s,a)}\left[q(s,a)-\gamma q(s^{\prime},\pi)-r(s,a)\right]\Big{|}=\mathcal{O}\left(\frac{c_{0}}{n^{\frac{1}{1+\nu}}}\right),

for any q𝒬q\in\mathcal{Q}, where c0c_{0} is some finite and positive constant.

Assumption 1 is a standard condition on function boundedness. Assumption 2 characterizes that there exists a “good” τΩ\tau^{*}\in\Omega, not necessarily exactly equal to the density ratio dπ/μd^{\pi}/\mu, where the τμ\tau^{*}\cdot\mu is able to approximate dπd^{\pi} well. This assumption is a much weaker assumption than the standard correct specification model assumption [38, 16]. The correct specification model assumption requires that the true model must be captured in Ω\Omega. In contrast, our assumption only requires that the discrepancy between the limiting distribution of τμ\tau^{*}\cdot\mu and the distribution dπd^{\pi} cannot be distinguished by the functional 𝔼s(|s,a)[q(s,a)γq(s,π)r(s,a)]\mathbb{E}_{s^{\prime}\sim\mathds{P}(\cdot|s,a)}[q(s,a)-\gamma q(s^{\prime},\pi)-r(s,a)], which is aligned with the principle in [7]. As a special case, the assumption holds if the limiting distribution of τμ\tau^{*}\cdot\mu converges to the dπd^{\pi}. The factor ν\nu depends on the smoothness of 𝒬\mathcal{Q} [66], and ν=1\nu=1 when 𝒬\mathcal{Q} is a bounded ball in RKHS. The factor c0c_{0} usually depends on the pseudo-dimension and radius of the specified function class. In the following, we denote the confidence upper and lower bound optimized in (7) over τΩ\tau\in\Omega as Jn+(π)J^{+}_{n}(\pi) and Jn(π)J^{-}_{n}(\pi), respectively. We first characterize the sample-dependent bound for the neutral CI in the following.

Theorem 6.1.

Under Assumptions 1 and 2, then for any target policy π\pi, the length of the estimated neutral CI satisfies

Jn+(π)Jn(π)\displaystyle J^{+}_{n}(\pi)-J^{-}_{n}(\pi)\leq 2c0(1γ)n11+ν+CV¯n(δ,V¯,DΩ)(1γ),\displaystyle\frac{2c_{0}}{(1-\gamma)n^{\frac{1}{1+\nu}}}+\frac{C\bar{V}\mathcal{E}_{n}(\delta,\bar{V},D_{\Omega})}{(1-\gamma)},

where

n(δ,V¯,DΩ):=(2ln2e(DΩ+1)(4eV¯2n)DΩδn+4ln2e(DΩ+1)(4eV¯2n)DΩδ3n),\displaystyle\mathcal{E}_{n}(\delta,\bar{V},D_{\Omega}):=\left(\sqrt{\frac{2\ln\frac{2e\left(D_{\Omega}+1\right)(4e\bar{V}^{2}\sqrt{n})^{D_{\Omega}}}{\delta}}{n}}+\frac{4\ln\frac{2e\left(D_{\Omega}+1\right)(4e\bar{V}^{2}\sqrt{n})^{D_{\Omega}}}{\delta}}{3n}\right), (21)

and DΩD_{\Omega} is the pseudo-dimension of the class Ω\Omega.

The proof of Theorem 6.1 relies on the empirical Freedman’s bounds with a careful analysis of the variance term. When ν1\nu\leq 1, the sample complexity of the generalization bound is of order 𝒪(1/n)\mathcal{O}(1/\sqrt{n}). This achieves the minimax optimal rate in off-policy evaluation [12]. The bound can be potentially improved for some special classes, e.g., RKHS, which can be independent of the function space complexity. In some feature mapping classes, the finite sample bound might be vanishing at a sublinear rate, i.e., 𝒪(1/n)\mathcal{O}(1/n) if λmax(Σn)diamψ\lambda_{\max}(\Sigma_{n})\ll\text{diam}_{\psi} as discussed in Corollary 3.1. In the following, we present the result for the pessimistic CI estimation in that Jn+(π)J^{+}_{n}(\pi) and Jn(π)J^{-}_{n}(\pi) are the optimized confidence bounds in (11) within the class Ω\Omega.

Theorem 6.2.

Under Assumptions 1 and 2, for any target policy π\pi we set λ=𝒪(1/n)\lambda^{-}=\mathcal{O}(1/\sqrt{n}), then the estimated pessimistic CI satisfies that

(J(π)Jn(π))(1γ)\displaystyle(J(\pi)-J^{-}_{n}(\pi))(1-\gamma)\leq 𝒪(c0n11+ν+c12n+V¯(C)2ln4δn),\displaystyle\mathcal{O}\bigg{(}\frac{c_{0}}{n^{\frac{1}{1+\nu}}}+\sqrt{\frac{c_{1}^{2}}{n}}+\bar{V}\sqrt{\frac{(C^{\star})^{2}\ln\frac{4}{\delta}}{n}}\bigg{)},
(Jn+(π)J(π))(1γ)\displaystyle(J^{+}_{n}(\pi)-J(\pi))(1-\gamma)\leq 𝒪(c0n11+ν+CV¯n(δ/2,V¯,DΩ)(1γ)),\displaystyle\mathcal{O}\bigg{(}\frac{c_{0}}{n^{\frac{1}{1+\nu}}}+\frac{C\bar{V}\mathcal{E}_{n}(\delta/2,\bar{V},D_{\Omega})}{(1-\gamma)}\bigg{)},

where C=[(𝔾)1(c2)]+C^{\star}=[(\mathbb{G}^{\prime})^{-1}(c_{2})]^{+} and n(δ,V¯,DΩ)\mathcal{E}_{n}(\delta,\bar{V},D_{\Omega}) is defined in (21).

Theorem 6.2 shows that it is sufficient to learn a near-tight confidence upper or lower bound with 𝒪(n)\mathcal{O}(\sqrt{n}) size of samples. Note that the error between the confidence lower bound and the discounted return, i.e., J(π)J(π)J(\pi)-J^{-}(\pi), can be independent of function class complexity. To the best of our knowledge, this is the minimum estimation error bound that can be achieved.

6.2 Robustness to Optimization and Model Misspecification Errors

In this section, we consider the setting where Ω×𝒬\Omega\times\mathcal{Q} may not contain the “good” τ\tau^{*}, defined in Assumption 2, and true action-value function qπq^{\pi}. We measure the approximation errors for the function classes Ωψ\Omega_{\psi} and 𝒬θ\mathcal{Q}_{\theta} as follows:

ετ=minψτψτ,εq=minθqθqπ.\displaystyle\varepsilon_{\tau}=\min_{\psi}\|\tau_{\psi}-\tau^{*}\|_{\infty},\quad\varepsilon_{q}=\min_{\theta}\|q_{\theta}-q^{\pi}\|_{\infty}. (22)

In addition to the function approximation errors, we allow the existence of optimization errors. Let (q~opt,τ~opt)(\widetilde{q}^{opt},\widetilde{\tau}^{opt}) be the optimizer for the sample version of the objective function ^(qθ,τψ)\widehat{\mathcal{L}}(q_{\theta},\tau_{\psi}) in (19), then we define the optimization errors as

^(q~opt,τ~opt)maxψ^(q~opt,τψ)\displaystyle\widehat{\mathcal{L}}(\widetilde{q}^{opt},\widetilde{\tau}^{opt})-\max_{\psi}\widehat{\mathcal{L}}(\widetilde{q}^{opt},\tau_{\psi})\geq εopt(τ),\displaystyle-\varepsilon_{opt}(\tau),
maxψminθ^(qθ,τψ)maxψ^(q~opt,τψ)\displaystyle\max_{\psi}\min_{\theta}\widehat{\mathcal{L}}(q_{\theta},\tau_{\psi})-\max_{\psi}\widehat{\mathcal{L}}(\widetilde{q}^{opt},\tau_{\psi})\geq εopt(q),\displaystyle-\varepsilon_{opt}(q), (23)

The above equations indicate that the max-min point of maxψminθ^(qθ,τψ)\max_{\psi}\min_{\theta}\widehat{\mathcal{L}}(q_{\theta},\tau_{\psi}) is approximated by the solution ^(q~opt,τ~opt)\widehat{\mathcal{L}}(\widetilde{q}^{opt},\widetilde{\tau}^{opt}). Similarly, we relax the min-max point of minψmaxθ^(qθ,τψ)\min_{\psi}\max_{\theta}\widehat{\mathcal{L}}(q_{\theta},\tau_{\psi}) such that they can be obtained approximately as well, and we use the same error upper bounds, i.e., εopt(q)\varepsilon_{opt}(q) and εopt(τ)\varepsilon_{opt}(\tau), to eliminate unnecessary notations and improve readability. Under the definitions of (22) and (23), we allow the existence of the model misspecification error and algorithm optimization error. In this case, we call our algorithm the robust algorithm. In the following, Theorem 6.3 justifies that the robust algorithm is capable of learning a valid and near-tight neutral CI in polynomial sample complexity.

Theorem 6.3.

Under Assumptions 1 and 2, then for any target policy π\pi, the estimated neutral CI returned by the robust algorithm is valid, and its length satisfies

Jn+(π)Jn(π)11γ(2c0n11+ν+CV¯n(δ,V¯,DΩ)+εopt+εmis),\displaystyle J^{+}_{n}(\pi)-J^{-}_{n}(\pi)\leq\frac{1}{1-\gamma}\left(\frac{2c_{0}}{n^{\frac{1}{1+\nu}}}+C\bar{V}\mathcal{E}_{n}(\delta,\bar{V},D_{\Omega})+\varepsilon_{opt}+\varepsilon_{mis}\right),

where εopt=εopt(q)+εopt(τ)\varepsilon_{opt}=\varepsilon_{opt}(q)+\varepsilon_{opt}(\tau), εmis=2V¯ετ+4Cεq\varepsilon_{mis}=2\bar{V}\varepsilon_{\tau}+4C\varepsilon_{q}, and n(δ,V¯,DΩ)\mathcal{E}_{n}(\delta,\bar{V},D_{\Omega}) defined in (21).

Theorem 6.3 is an error-robust version of Theorem 6.1. The optimization error and approximation error are quantified within the finite-sample bound. For the optimization error, according to Theorem 6.5, Algorithm 1 has optimization error of order 𝒪(1/T)\mathcal{O}(1/T), where TT is the max number of the optimization iterations. For the approximation error, it depends on the specific structure of τ\tau^{*} and qπq^{\pi}. For example, the expressivity of the feature mapping class might achieve an arbitrarily small approximation error εq\varepsilon_{q} as implied by Theorem B.1 in the appendix. Blessed by the convergence of our algorithm under non-linear approximation settings, one may use a DNN class to model ετ\varepsilon_{\tau}. Recent existing works show that DNNs enjoy universal approximation power in function approximations [39], i.e., ετ0\varepsilon_{\tau}\approx 0. In the following, we establish an error-robust finite sample analysis on the empirical length of the pessimistic CI.

Theorem 6.4.

Under Assumptions 1 and 2, for any target policy π\pi and we set λ=𝒪(1/n)\lambda^{-}=\mathcal{O}(1/\sqrt{n}), the estimated pessimistic CI returned by the robust algorithm is valid, and (Jn+(π)Jn(π))(1γ)(J^{+}_{n}(\pi)-J^{-}_{n}(\pi))(1-\gamma) is bounded above by

𝒪\displaystyle\mathcal{O} (c0n11+ν+c12n+V¯{Cn(δ/2,V¯,DΩ)+(C)2ln4δn+c1ln4δnV¯}+εopt+εmis𝔾),\displaystyle\Bigg{(}\frac{c_{0}}{n^{\frac{1}{1+\nu}}}+\sqrt{\frac{c_{1}^{2}}{n}}+\bar{V}\bigg{\{}C\mathcal{E}_{n}(\delta/2,\bar{V},D_{\Omega})+\sqrt{\frac{(C^{\star})^{2}\ln\frac{4}{\delta}}{n}}+\frac{c_{1}\sqrt{\ln\frac{4}{\delta}}}{n\bar{V}}\bigg{\}}+\varepsilon_{opt}+\varepsilon^{\mathbb{G}}_{mis}\Bigg{)},

where εopt=εopt(q)+εopt(τ)\varepsilon_{opt}=\varepsilon_{opt}(q)+\varepsilon_{opt}(\tau) and εmis𝔾=(c2/n+V¯)ετ+Cεq\varepsilon^{\mathbb{G}}_{mis}=(c_{2}/\sqrt{n}+\bar{V})\varepsilon_{\tau}+C\varepsilon_{q}.

In comparison to Theorem 6.3, there exists an additional model-misspecification error propagated to the finite-sample bound by the introduced discriminator function of order 𝒪(c2ετ/n)\mathcal{O}(c_{2}\varepsilon_{\tau}/\sqrt{n}). However, the strong convexity of the discriminator 𝔾()\mathbb{G}(\cdot) facilitates the optimization, and thus our algorithm can achieve εopt(τ)=𝒪(1/T2)\varepsilon_{opt}(\tau)=\mathcal{O}(1/T^{2}) at a faster rate than the rate 𝒪(1/T)\mathcal{O}(1/T), when the class Ω\Omega is a special non-convex set consisting of monotonic functions or linear transformations of strongly convex functions. A more concrete analysis of the approximation and optimization errors depends on the specific form of the function approximators, which is beyond the scope of this paper.

6.3 Convergence Analysis

In this section, we investigate the convergence of Algorithm 1 when the τψ\tau_{\psi} is approximated by non-linear functions, e.g., DNNs, so that the concavity no longer holds. In Theorem 6.5, we show that our algorithm converges sublinearly to a stationary point when stepsize is diminishing. We first make the following regularity assumptions.

Assumption 3.

For any τψΩψ\tau_{\psi}\in\Omega_{\psi}, τψ\tau_{\psi} is differentiable (not necessarily convex or concave), bounded from below, ψτψ1(s,a)ψτψ2(s,a)L0ψ1ψ2,for anys,a\|\nabla_{\psi}\tau_{\psi_{1}}(s,a)-\nabla_{\psi}\tau_{\psi_{2}}(s,a)\|\leq L_{0}\|\psi_{1}-\psi_{2}\|,\;\text{for any}\;s,a, where L0<L_{0}<\infty is some universal Lipschitz constant and \|\cdot\| denotes the Euclidean norm.

Assumption 3 imposes the first-order smoothness condition on the specified function class. It requires that the gradient of τψ\tau_{\psi} be L0L_{0}-Lipschitz continuous, which is a standard assumption in the optimization literature [4, 18].

Assumption 4.

|qθ1(s,a)qθ2(s,a)|L0θ1θ2,for anys,a,andqθ𝒬θ|q_{\theta_{1}}(s,a)-q_{\theta_{2}}(s,a)|\leq L_{0}\|\theta_{1}-\theta_{2}\|,\;\text{for any}\;s,a,\;\text{and}\;q_{\theta}\in\mathcal{Q}_{\theta}.

Assumption 4 holds for a wide range of function approximation classes, including feature mapping space with smooth basis functions, non-linear approximation classes, DNNs with Leaky ReLU activation function, or spectral normalization on ReLU activation [23].

Assumption 5.

The gradient of function τψ()\tau_{\psi}(\cdot) evaluated at saddle point ψ\psi^{*} is bounded above; i.e., ψτψ(s,a)<c3\nabla_{\psi}\tau_{\psi^{*}}(s,a)<c_{3} uniformly over (s,a)(s,a) for some finite and positive constant c3c_{3}.

Assumption 5 is a much weaker assumption compared to the bounded variance of stochastic gradients assumption which is commonly made in the existing literature [52, 44]. Our relaxation is guaranteed by the careful analysis of the variance of the stochastic gradient, which is shown to be upper-bounded by the suboptimality of the current iteration and the stochastic gradient at the saddle point. In the following, we derive the convergence rate for Algorithm 1, which holds for non-concave function approximation class Ωψ\Omega_{\psi}, e.g., DNNs.

Theorem 6.5.

Under Assumptions 1 and 3-5, suppose Algorithm 1 runs T1T\geq 1 rounds with stepsize ηt=min{(tT)1/42Λ/σmax2L1,1/L1}\eta^{t}=\min\{(tT)^{-1/4}\sqrt{2\Lambda/\sigma^{2}_{\max}L_{1}},1/L_{1}\} for t=1,,Tt=1,...,T and Euclidean distance is used for Bergman divergence. If we pick up the solution output ψT\psi^{T^{\star}} following the probability mass function P(T=t)=2ηt(ηt)2L1t=1T(2ηt(ηt)2L1)P(T^{\star}=t)=\frac{2\eta^{t}-(\eta^{t})^{2}L_{1}}{\sum^{T}_{t=1}(2\eta^{t}-(\eta^{t})^{2}L_{1})}, then

𝔼[ψ¯(τψT)2]2ΛL1T+2ΛL1σmax2T3/4+2ΛL1σmax2T,\displaystyle\mathbb{E}[\|\nabla_{\psi}\bar{\mathcal{L}}(\tau_{\psi^{T^{\star}}})\|^{2}]\leq\frac{2\Lambda L_{1}}{T}+\frac{\sqrt{2\Lambda L_{1}\sigma^{2}_{\max}}}{T^{3/4}}+\sqrt{\frac{2\Lambda L_{1}\sigma^{2}_{\max}}{T}}, (24)

where ¯(τψ)\bar{\mathcal{L}}(\tau_{\psi}) is defined in (20) and Λ:=¯(τψ0)minψ¯(τψ)\Lambda:=\bar{\mathcal{L}}(\tau_{\psi^{0}})-\min_{\psi}\bar{\mathcal{L}}(\tau_{\psi}) measures the distance of the initial and optimal solution, L1L_{1} is Lipschitz constant depending on c2,c3,M,L0,V¯c_{2},c_{3},M,L_{0},\bar{V}, and the variance of the stochastic gradient is bounded above by

σmax:=maxt1:Tc4θ~(ψt)θ2+c5ψtψ2,\displaystyle\sigma_{\max}:=\max_{t\in 1:T}\sqrt{c_{4}\|\widetilde{\theta}(\psi^{t})-\theta^{*}\|^{2}+c_{5}\|\psi^{t}-\psi^{*}\|^{2}},

for some constants c4c_{4} depending on c3,L0,γc_{3},L_{0},\gamma; and c5c_{5} depending on c2,L0,V¯,λc_{2},L_{0},\bar{V},\lambda^{-} and γ\gamma. Here, θ~(ψt)\widetilde{\theta}(\psi^{t}) is the optimizer for (qθ,τψt)\mathcal{L}(q_{\theta},\tau_{\psi^{t}}).

Theorem 6.5 implies that Algorithm 1 can converge sublinearly to a stationary point if the σmax\sigma_{\max} is sufficiently small. The rate of convergence is also affected by the smoothness of the class Ωψ\Omega_{\psi} and the distance of the initial and optimal solution. We possess a nearly optimal rate since the third term in (24) is unimprovable for nonconvex problems for gradient descent methods, as shown in [30]. In the algorithm, we initially use decreasing stepsizes to enable early stopping. However, as the algorithm progresses and more local information about the gradient is obtained, increasing stepsizes, such as min{1/L1,𝒪(t/T)}\min\{1/L_{1},\mathcal{O}(\sqrt{t}/T)\}, may be more effective. This increasing stepsize strategy can also result in a convergence rate similar to the obtained one in (24).

7 Simulation Studies

In this section, we evaluate the empirical performance of our methods using two synthetic datasets: CartPole benchmark environment from the OpenAI Gym [5] and the simulation environment [36]. In Section 7.1, we study the tightness of the estimated CIs. In Section 7.2, we investigate the policy evaluation performance based on the estimated CIs in scenarios with large distributional shifts. Additionally, we conduct a sensitivity analysis with respect to the hyperparameter λ\lambda in the pessimistic CI. We compare the proposed CI with several state-of-the-art approaches including Boostrapping Fitted Q-evaluation (B-FQE; [22]), Accountable Bellman Evaluation (ABE; [15]) and the stepwise importance sampling-based estimator, HCOPE proposed by [64]. The true discounted return of the target policy π\pi is computed by Monte Carlo approximations [40]. For the simulation environment setting, the system dynamics are given by

st+1\displaystyle s^{t+1} =(0.75(2at1)000.75(12at))st+(0110)stst𝕀2×1+εt,\displaystyle=\left(\begin{array}[]{cc}0.75\left(2a^{t}-1\right)&0\\ 0&0.75\left(1-2a^{t}\right)\end{array}\right)s^{t}+\left(\begin{array}[]{cc}0&1\\ 1&0\end{array}\right)\odot s^{t}{s^{t}}^{\top}\mathbb{I}_{2\times 1}+\varepsilon^{t},
rt\displaystyle r^{t} =st+1(21)14(2at1)+(st+1st+1)32(0.250.5),\displaystyle={s^{t+1}}^{\top}\left(\begin{array}[]{l}2\\ 1\end{array}\right)-\frac{1}{4}\left(2a^{t}-1\right)+({s^{t+1}}^{\top}s^{t+1})^{\frac{3}{2}}\odot\left(\begin{array}[]{c}0.25\\ 0.5\end{array}\right),

for t0t\geq 0, where \odot denotes the Hadamard product, 𝕀\mathbb{I} is the identity matrix, the noise {εt}t0iidN(02×1,0.25𝕀2×2)\left\{\varepsilon^{t}\right\}_{t\geq 0}\stackrel{{\scriptstyle iid}}{{\sim}}N\left({0}_{2\times 1},0.25\mathbb{I}_{2\times 2}\right) and the initial state variable s0N(02×1,0.25𝕀2×2)s^{0}\sim N\left({0}_{2\times 1},0.25\mathbb{I}_{2\times 2}\right). The transition dynamic mainly follows the design in [36], but the reward function we consider here is more complex. For the basic setup of CartPole environment, we refer the readers to [5] for more details. We fix the confidence level at δ=0.05\delta=0.05, and the discount factor γ=0.95\gamma=0.95 to construct CIs. In particular, for implementing pessimistic CI estimation, we set the discriminator function as a quadratic form, i.e., 𝔾(x)=12(x1)2\mathbb{G}(x)=\frac{1}{2}(x-1)^{2}.

7.1 Empirical Tightness Analysis of Estimated Interval

In this section, we calculate the width of the estimated CIs. We consider different sample size settings. Note that the setting n=200n=200 is an extremely small sample size in offline RL problems [34].

In the CartPole environment, we follow [67] to learn a near-optimal policy using a softmax policy class [20] with a temperature parameter α=1\alpha=1 as the behavior policy. In addition, we set α=0.4\alpha=0.4 and α=0.2\alpha=0.2 for constructing two target policies. We note that the target policies are more deterministic than the behavior policies, and a smaller α\alpha implies a larger difference between the behavior policy and the target policy. In the simulation environment, we consider a completely randomized study and set {at}t0\{a^{t}\}_{t\geq 0} to be i.i.d Bernoulli random variables with expectation 0.50.5. The target policy we consider is designed as π(a=1|s)=p\pi(a=1|s)=p if s[1]+s[2]>0s_{[1]}+s_{[2]}>0 and 1p1-p otherwise, where s[j]s_{[j]} denotes the jj-th element of the vector ss. The two target policies are generated by p=0.9p=0.9 and 0.70.7, respectively. In both environments, we use a feedforward neural network of 22 hidden layers with 256256 units for each layer to approximate Ω\Omega in the pessimistic CI estimation, and an RKHS finite representer for function approximation in the neutral CI estimation.

Refer to caption
Figure 2: The estimated CIs for the target policy 11 of the simulation and CartPole environments over different methods. The experiments are repeated 5050 times with random seeds. The horizontal dashed line represents the Monto-Carlo true return for the target policy. The threshold parameter λ\lambda for pessimistic CI is fixed to be 11.
Refer to caption
Figure 3: The sensitivity analysis of the hyperparameter λ\lambda^{-} on the successful ranking proportion in the two synthetic environments with 5050 repeated experiments.

The results of the target policy 11 with respect to the simulation and CartPole environment is depicted in Figures 3, where the results for target policy 22 are provided in the appendix. We summarize our findings as follows. First, our proposed neutral and pessimistic CIs yield much tighter bounds than ABE and HCOPE, especially in the small sample size setting (e.g., n=200,500n=200,500). Although B-FQE achieves tighter intervals than ours, it is highly biased and fails to capture the Monte Carlo true reward of the target policy. The tightness of our CIs is mainly due to the fact that our framework breaks the curse of the bias and uncertainty tradeoff. This allows us to choose an expressive neural network or finite RKHS function class to model Ω\Omega and reduce model-misspecification bias without worrying about an increase in uncertainty deviation. Another reason for the performance gain may be that our proposed method does not assume any i.i.d. or weakly dependent data assumptions, which is not the case for the competing methods. When applied directly to interdependent data, the competing methods often lead to larger bias and an increased interval width [13]. Lastly, our proposed method inherently quantifies the model-misspecification error in the CI and is thus robust to function approximation error. This distinguishes our method from the competing methods, which fail to quantify the model-misspecification error and suffer from large bias when the true model is not correctly specified.

Regarding the comparison between the neutral and pessimistic CIs, we observe that the pessimistic CI is generally wider than the neutral one in the simulation environment. This might be because modeling Ω\Omega in RKHS has already achieved sufficient expressivity to capture the true model in the smooth MDP. Therefore, the pessimistic CI indeed does not have any additional advantages in minimizing model-misspecification errors instead of inheriting additional bias from the discriminator function. In contrast, the smoothness MDP condition does not hold in the CartPole environment, and the neutral CI incorporates more bias. This leads to the neutral CI sharing almost the same tightness as the estimated pessimistic CI.

7.2 Robustness to Distributional Shifts

In this section, we aim to evaluate the performance of the proposed CIs in handling distributional shifts. In the context of offline reinforcement learning, a common safe optimization criterion is to maximize the worst-case performance among a set of statistically plausible models [31, 26]. This pessimism principle is closely related to robust MDPs which can handle distributional shift [71, 26]. Our proposed framework naturally provides a lower bound to implement the pessimism principle. To evaluate the proposed method, we conduct experiments with a set of target policies that have varying degrees of distributional shift. We intentionally design these target policies to have almost the same Monte Carlo true rewards. Therefore, users will select the policies close to the behavior policy, i.e., those with a small distributional shift. We use the estimated confidence lower bound to rank the target policies, and the criterion for good performance is to correctly rank the policies according to their degree of distributional shift.

In the experiment design of robustness to distributional shift, we still use the soft actor-critic algorithm [20] to construct the softmax policies. In contrast to the constructions in the last section, here we learn a near-optimal policy, with a very small temperature parameter α=0.05\alpha=0.05, as the behavior policy in the CartPole environment. This indicates that the behavior policy is less explorable and has a higher probability of causing large distributional shifts in the policy evaluation step. To evaluate the proposed method, we generate three different near-optimal policies as the target policies, denoted as π1,π2,π3\pi^{\text{1}},\pi^{\text{2}},\pi^{\text{3}}. The temperature parameter used for learning the three policies increases from α=0.2\alpha=0.2 for π1\pi^{\text{1}} to α=0.5\alpha=0.5 for π2\pi^{\text{2}} and then α=0.8\alpha=0.8 for π3\pi^{\text{3}}, indicating that π1\pi^{\text{1}} has the smallest distributional shift while π3\pi^{\text{3}} has the largest. We carefully control the true rewards of the three policies within a very small gap of 1, so the correct ranking of the three target policies should follow π1\pi^{\text{1}}, then π2\pi^{\text{2}}, and π3\pi^{\text{3}}. In the simulation environment, we apply the same principle for experiment design. Specifically, we generate the target policy π1\pi^{\text{1}} by adding a small random noise to the behavior policy, resulting in the smallest distributional shift. The remaining two target policies are designed to be more explorable than π1\pi^{\text{1}}.

NeutralCI PessmisticCI B-FQE ABE HCOPE
Simulation 71/100 92/100 64/100 53/100 37/100
Cartpole 78/100 95/100 71/100 67/100 38/100
Table 1: The proportion of the successful policy ranking by the estimated confidence lower bound over different methods in 100100 repeated experiments. For each experiment, success count is only made when the absolute ranking for the three target policies is correct, otherwise, the count is a failure.

The results in Table 1 demonstrate that the pessimistic CI achieves the best performance in comparison to the other approaches. The discriminator in the pessimistic approach helps to mitigate the bias estimation issue caused by distributional shift. Among the remaining approaches, we find the result of the proposed neutral CI is consistent with B-FQE. The decent performance of the neutral CI is due to the relatively accurate estimation of the confidence lower bound, while the performance of B-FQE mainly relies on the distributional consistency guarantee which provides certain robustness to distributional shift [22].

We also provide a sensitivity analysis of the hyperparameter λ\lambda^{-} in Figure 3. The λ\lambda^{-}’s values vary in a wide range [0.05,5][0.05,5], and the successful percentages of the ranking tasks are reported accordingly. In general, the successful ranking percentage is not very sensitive to the choice of λ\lambda^{-}. It can be seen that the desired results mainly concentrate in the range [0.75,2][0.75,2]. Too small λ\lambda^{-} leads to insufficient distributional shift adjustment, but too large λ\lambda^{-} results in overly pessimistic reasoning.

8 OhioT1DM Case Study

We apply the proposed methods to the Ohio type 1 diabetes (OhioT1DM) mobile health study [43], which contains six patients with type 1 diabetes, each patient with eight weeks of life-event data including health status measurements and insulin injection dosage. As each individual patient has distinctive glucose dynamics, we follow [75] in regarding each patient’s data as an independent dataset where the data from each day is a single trajectory. We summarize the collected measurements over 6060-min intervals such that the maximum length of the horizon is 2424 and the total number of the transition pairs in each patient’s dataset is around n=360n=360 after removing missing samples and outliers. The state variable sts^{t} is set to be a three-dimensional vector including the average blood glucose levels s[1]ts^{t}_{[1]}, the average heart rate s[2]ts^{t}_{[2]} and the total carbohydrates s[3]ts^{t}_{[3]} intake during the period time [t1,t][t-1,t]. Here, the reward is defined as the average of the index of glycemic control [55] between time t1t-1 and tt, measuring the health status of the patient’s glucose level. That is rt=𝕀(s[1]t>140)|s[1]t140|1.10+𝕀(s[1]t<80)(s[1]t80)230,{r^{t}}=-\frac{\mathbb{I}({s^{t}_{[1]}}>140)|{s^{t}_{[1]}}-140|^{1.10}+\mathbb{I}({s^{t}_{[1]}}<80)({s^{t}_{[1]}}-80)^{2}}{30}, which implies that reward rtr^{t} is non-positive and a larger value is preferred. We discretize the insulin dose level to 55 grids, forming the action space 𝒜={0,1,,4}\mathcal{A}=\{0,1,...,4\}, and we set γ=0.9,δ=0.05\gamma=0.9,\delta=0.05 for estimating the CIs.

We consider three different target policies for evaluation, where the design settings are described in Figure 4. The true value of π3\pi^{\text{3}} shall be lower than the other two policies and also lower than the behavior policy made by the clinicians. For the policies π1\pi^{\text{1}} and π2\pi^{\text{2}}, they are both near-optimal and should achieve similar performance regarding the mean value [32]. However, as the conservative q-learning algorithm is risk-sensitive to distributional shift, one can expect policy π1\pi^{\text{1}} to have a higher lower bound in the estimated CI.

Refer to caption
Figure 4: The estimated CI for the difference between the three target policies and the behavior policy over 66 patients in OhioT1DM dataset. policy 1 is learned by by a distributional shift robust policy optimization algorithm, conservative q-learning [32], policy 2 is learned by the soft actor-critic algorithm [20], and policy 3 is set to be a randomized policy.

In Figure 4, we plot the estimated CIs by our method and the ABE method. We summarize our findings as follows. In terms of the validity of the proposed CI estimation, both the proposed neutral and pessimistic CI are overlapped with, and included, in the CI estimated by ABE. In addition, our CIs for the policies π1\pi^{\text{1}} and π2\pi^{\text{2}} are above 0 and the estimated intervals for π3\pi^{\text{3}} are below 0. This is consistent with the fact that policy π1\pi^{\text{1}} and π2\pi^{\text{2}} are designed as near-optimal policies which should be better than the behavior policy. In contrast, policy π3\pi^{3} is a randomized policy and thus will be worse than the policy induced by the clinicians. The above findings confirm that our estimated CIs are valid. In terms of tightness, our CIs are narrower than the ones estimated by ABE. This confirms that the proposed method solves the issue of the tradeoff between bias and uncertainty. In addition, ABE requires i.i.d sampled offline data, thus failing with the complicated real environment and leading to wide estimated intervals. In terms of performance against distributional shifts among all methods, the estimated pessimistic CI identifies a consistently higher confidence lower bound for the distributional shift robust policy π1\pi^{\text{1}} over all 66 patients. This implies that the pessimistic CI can efficiently evaluate the target policy by taking the distributional shift information into account. In conclusion, our analysis suggests applying the proposed method could potentially improve some patients’ health status efficiently and reliably, especially in environments with large distributional shifts.

9 Discussion

This paper investigates the off-policy confidence interval estimation problem in behavior-agnostic settings with the minimum data collection assumption. The proposed framework incorporates distributional shift information, which makes the estimated confidence interval robust in learning from pre-collected offline datasets. We propose a unified error analysis and point out the issue regarding the tradeoff between evaluation bias and statistical uncertainty. Then a tradeoff estimation approach is developed to solve this issue, which leads to possibly the tightest confidence interval estimations. Several improvements and extensions are worth exploring in the future. First, we may extend the proposed framework to the environment with an unobservable confounder. In addition, our numerical studies have demonstrated the potential of the proposed method in ranking tentative target policies, which could further enhance designing safe policy improvement or policy optimization algorithms based on the current framework. This could be a potentially interesting direction to explore in the future.

References

  • [1] {barticle}[author] \bauthor\bsnmAgarwal, \bfnmAlekh\binitsA., \bauthor\bsnmJiang, \bfnmNan\binitsN., \bauthor\bsnmKakade, \bfnmSham M\binitsS. M. and \bauthor\bsnmSun, \bfnmWen\binitsW. (\byear2019). \btitleReinforcement learning: Theory and algorithms. \bjournalCS Dept., UW Seattle, Seattle, WA, USA, Tech. Rep \bpages4–10. \endbibitem
  • [2] {barticle}[author] \bauthor\bsnmBach, \bfnmFrancis\binitsF. (\byear2017). \btitleBreaking the curse of dimensionality with convex neural networks. \bjournalThe Journal of Machine Learning Research \bvolume18 \bpages629–681. \endbibitem
  • [3] {bincollection}[author] \bauthor\bsnmBaird, \bfnmLeemon\binitsL. (\byear1995). \btitleResidual algorithms: Reinforcement learning with function approximation. In \bbooktitleMachine Learning Proceedings 1995 \bpages30–37. \bpublisherElsevier. \endbibitem
  • [4] {binproceedings}[author] \bauthor\bsnmBernstein, \bfnmJeremy\binitsJ., \bauthor\bsnmWang, \bfnmYu-Xiang\binitsY.-X., \bauthor\bsnmAzizzadenesheli, \bfnmKamyar\binitsK. and \bauthor\bsnmAnandkumar, \bfnmAnimashree\binitsA. (\byear2018). \btitlesignSGD: Compressed optimisation for non-convex problems. In \bbooktitleInternational Conference on Machine Learning \bpages560–569. \bpublisherPMLR. \endbibitem
  • [5] {barticle}[author] \bauthor\bsnmBrockman, \bfnmGreg\binitsG., \bauthor\bsnmCheung, \bfnmVicki\binitsV., \bauthor\bsnmPettersson, \bfnmLudwig\binitsL., \bauthor\bsnmSchneider, \bfnmJonas\binitsJ., \bauthor\bsnmSchulman, \bfnmJohn\binitsJ., \bauthor\bsnmTang, \bfnmJie\binitsJ. and \bauthor\bsnmZaremba, \bfnmWojciech\binitsW. (\byear2016). \btitleOpenAI gym. \bjournalarXiv preprint arXiv:1606.01540. \endbibitem
  • [6] {barticle}[author] \bauthor\bsnmChallen, \bfnmRobert\binitsR., \bauthor\bsnmDenny, \bfnmJoshua\binitsJ., \bauthor\bsnmPitt, \bfnmMartin\binitsM., \bauthor\bsnmGompels, \bfnmLuke\binitsL., \bauthor\bsnmEdwards, \bfnmTom\binitsT. and \bauthor\bsnmTsaneva-Atanasova, \bfnmKrasimira\binitsK. (\byear2019). \btitleArtificial intelligence, bias and clinical safety. \bjournalBMJ Quality & Safety \bvolume28 \bpages231–237. \endbibitem
  • [7] {binproceedings}[author] \bauthor\bsnmChen, \bfnmJinglin\binitsJ. and \bauthor\bsnmJiang, \bfnmNan\binitsN. (\byear2022). \btitleOffline reinforcement learning under value and density-ratio realizability: the power of gaps. In \bbooktitleUncertainty in Artificial Intelligence \bpages378–388. \bpublisherPMLR. \endbibitem
  • [8] {binproceedings}[author] \bauthor\bsnmChen, \bfnmXiaohong\binitsX. and \bauthor\bsnmQi, \bfnmZhengling\binitsZ. (\byear2022). \btitleOn well-posedness and minimax optimal rates of nonparametric q-function estimation in off-policy evaluation. In \bbooktitleInternational Conference on Machine Learning \bpages3558–3582. \bpublisherPMLR. \endbibitem
  • [9] {barticle}[author] \bauthor\bsnmChoi, \bfnmEuisun\binitsE. and \bauthor\bsnmLee, \bfnmChulhee\binitsC. (\byear2003). \btitleFeature extraction based on Bhattacharyya distance. \bjournalPattern Recognition \bvolume36 \bpages1703–1709. \endbibitem
  • [10] {barticle}[author] \bauthor\bsnmDai, \bfnmBo\binitsB., \bauthor\bsnmNachum, \bfnmOfir\binitsO., \bauthor\bsnmChow, \bfnmYinlam\binitsY., \bauthor\bsnmLi, \bfnmLihong\binitsL., \bauthor\bsnmSzepesvári, \bfnmCsaba\binitsC. and \bauthor\bsnmSchuurmans, \bfnmDale\binitsD. (\byear2020). \btitleCoinDICE: Off-policy confidence interval estimation. \bjournalAdvances in Neural Information Processing Systems \bvolume33 \bpages9398–9411. \endbibitem
  • [11] {barticle}[author] \bauthor\bsnmDai, \bfnmBo\binitsB., \bauthor\bsnmShaw, \bfnmAlbert\binitsA., \bauthor\bsnmHe, \bfnmNiao\binitsN., \bauthor\bsnmLi, \bfnmLihong\binitsL. and \bauthor\bsnmSong, \bfnmLe\binitsL. (\byear2017). \btitleBoosting the actor with dual critic. \bjournalarXiv preprint arXiv:1712.10282. \endbibitem
  • [12] {binproceedings}[author] \bauthor\bsnmDuan, \bfnmYaqi\binitsY., \bauthor\bsnmJia, \bfnmZeyu\binitsZ. and \bauthor\bsnmWang, \bfnmMengdi\binitsM. (\byear2020). \btitleMinimax-optimal off-policy evaluation with linear function approximation. In \bbooktitleInternational Conference on Machine Learning \bpages2701–2709. \bpublisherPMLR. \endbibitem
  • [13] {barticle}[author] \bauthor\bsnmDuchi, \bfnmJohn C\binitsJ. C., \bauthor\bsnmGlynn, \bfnmPeter W\binitsP. W. and \bauthor\bsnmNamkoong, \bfnmHongseok\binitsH. (\byear2021). \btitleStatistics of robust optimization: A generalized empirical likelihood approach. \bjournalMathematics of Operations Research \bvolume46 \bpages946–969. \endbibitem
  • [14] {barticle}[author] \bauthor\bsnmFan, \bfnmXiequan\binitsX., \bauthor\bsnmGrama, \bfnmIon\binitsI. and \bauthor\bsnmLiu, \bfnmQuansheng\binitsQ. (\byear2015). \btitleExponential inequalities for martingales with applications. \bjournalElectronic Journal of Probability \bvolume20 \bpages1–22. \endbibitem
  • [15] {barticle}[author] \bauthor\bsnmFeng, \bfnmYihao\binitsY., \bauthor\bsnmLi, \bfnmLihong\binitsL. and \bauthor\bsnmLiu, \bfnmQiang\binitsQ. (\byear2019). \btitleA kernel loss for solving the Bellman equation. \bjournalarXiv preprint arXiv:1905.10506. \endbibitem
  • [16] {binproceedings}[author] \bauthor\bsnmFeng, \bfnmYihao\binitsY., \bauthor\bsnmRen, \bfnmTongzheng\binitsT., \bauthor\bsnmTang, \bfnmZiyang\binitsZ. and \bauthor\bsnmLiu, \bfnmQiang\binitsQ. (\byear2020). \btitleAccountable off-policy evaluation with kernel bellman statistics. In \bbooktitleInternational Conference on Machine Learning \bpages3102–3111. \bpublisherPMLR. \endbibitem
  • [17] {barticle}[author] \bauthor\bsnmFeng, \bfnmYihao\binitsY., \bauthor\bsnmTang, \bfnmZiyang\binitsZ., \bauthor\bsnmZhang, \bfnmNa\binitsN. and \bauthor\bsnmLiu, \bfnmQiang\binitsQ. (\byear2021). \btitleNon-asymptotic confidence intervals of off-policy evaluation: Primal and dual bounds. \bjournalarXiv preprint arXiv:2103.05741. \endbibitem
  • [18] {barticle}[author] \bauthor\bsnmGhadimi, \bfnmSaeed\binitsS. and \bauthor\bsnmLan, \bfnmGuanghui\binitsG. (\byear2013). \btitleStochastic first-and zeroth-order methods for nonconvex stochastic programming. \bjournalSIAM Journal on Optimization \bvolume23 \bpages2341–2368. \endbibitem
  • [19] {barticle}[author] \bauthor\bsnmGretton, \bfnmArthur\binitsA., \bauthor\bsnmBorgwardt, \bfnmKarsten M\binitsK. M., \bauthor\bsnmRasch, \bfnmMalte J\binitsM. J., \bauthor\bsnmSchölkopf, \bfnmBernhard\binitsB. and \bauthor\bsnmSmola, \bfnmAlexander\binitsA. (\byear2012). \btitleA kernel two-sample test. \bjournalThe Journal of Machine Learning Research \bvolume13 \bpages723–773. \endbibitem
  • [20] {barticle}[author] \bauthor\bsnmHaarnoja, \bfnmTuomas\binitsT., \bauthor\bsnmZhou, \bfnmAurick\binitsA., \bauthor\bsnmHartikainen, \bfnmKristian\binitsK., \bauthor\bsnmTucker, \bfnmGeorge\binitsG., \bauthor\bsnmHa, \bfnmSehoon\binitsS., \bauthor\bsnmTan, \bfnmJie\binitsJ., \bauthor\bsnmKumar, \bfnmVikash\binitsV., \bauthor\bsnmZhu, \bfnmHenry\binitsH., \bauthor\bsnmGupta, \bfnmAbhishek\binitsA., \bauthor\bsnmAbbeel, \bfnmPieter\binitsP. \betalet al. (\byear2018). \btitleSoft actor-critic algorithms and applications. \bjournalarXiv preprint arXiv:1812.05905. \endbibitem
  • [21] {binproceedings}[author] \bauthor\bsnmHanna, \bfnmJosiah\binitsJ., \bauthor\bsnmStone, \bfnmPeter\binitsP. and \bauthor\bsnmNiekum, \bfnmScott\binitsS. (\byear2017). \btitleBootstrapping with models: Confidence intervals for off-policy evaluation. In \bbooktitleProceedings of the AAAI Conference on Artificial Intelligence \bvolume31. \endbibitem
  • [22] {binproceedings}[author] \bauthor\bsnmHao, \bfnmBotao\binitsB., \bauthor\bsnmJi, \bfnmXiang\binitsX., \bauthor\bsnmDuan, \bfnmYaqi\binitsY., \bauthor\bsnmLu, \bfnmHao\binitsH., \bauthor\bsnmSzepesvári, \bfnmCsaba\binitsC. and \bauthor\bsnmWang, \bfnmMengdi\binitsM. (\byear2021). \btitleBootstrapping fitted q-evaluation for off-policy inference. In \bbooktitleInternational Conference on Machine Learning \bpages4074–4084. \bpublisherPMLR. \endbibitem
  • [23] {binproceedings}[author] \bauthor\bsnmIzmailov, \bfnmPavel\binitsP., \bauthor\bsnmPodoprikhin, \bfnmDmitrii\binitsD., \bauthor\bsnmGaripov, \bfnmTimur\binitsT., \bauthor\bsnmVetrov, \bfnmDmitry\binitsD. and \bauthor\bsnmWilson, \bfnmAndrew Gordon\binitsA. G. (\byear2018). \btitleAveraging weights leads to wider optima and better generalization. In \bbooktitle34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018 \bpages876–885. \bpublisherAssociation For Uncertainty in Artificial Intelligence (AUAI). \endbibitem
  • [24] {barticle}[author] \bauthor\bsnmJiang, \bfnmNan\binitsN. and \bauthor\bsnmHuang, \bfnmJiawei\binitsJ. (\byear2020). \btitleMinimax value interval for off-policy evaluation and policy optimization. \bjournalAdvances in Neural Information Processing Systems \bvolume33 \bpages2747–2758. \endbibitem
  • [25] {binproceedings}[author] \bauthor\bsnmJiang, \bfnmNan\binitsN. and \bauthor\bsnmLi, \bfnmLihong\binitsL. (\byear2016). \btitleDoubly robust off-policy value evaluation for reinforcement learning. In \bbooktitleInternational Conference on Machine Learning \bpages652–661. \bpublisherPMLR. \endbibitem
  • [26] {binproceedings}[author] \bauthor\bsnmJin, \bfnmYing\binitsY., \bauthor\bsnmYang, \bfnmZhuoran\binitsZ. and \bauthor\bsnmWang, \bfnmZhaoran\binitsZ. (\byear2021). \btitleIs pessimism provably efficient for offline rl? In \bbooktitleInternational Conference on Machine Learning \bpages5084–5096. \bpublisherPMLR. \endbibitem
  • [27] {binproceedings}[author] \bauthor\bsnmKallus, \bfnmNathan\binitsN., \bauthor\bsnmMao, \bfnmXiaojie\binitsX., \bauthor\bsnmWang, \bfnmKaiwen\binitsK. and \bauthor\bsnmZhou, \bfnmZhengyuan\binitsZ. (\byear2022). \btitleDoubly robust distributionally robust off-policy evaluation and learning. In \bbooktitleInternational Conference on Machine Learning \bpages10598–10632. \bpublisherPMLR. \endbibitem
  • [28] {barticle}[author] \bauthor\bsnmKallus, \bfnmNathan\binitsN. and \bauthor\bsnmUehara, \bfnmMasatoshi\binitsM. (\byear2019). \btitleIntrinsically efficient, stable, and bounded off-policy evaluation for reinforcement learning. \bjournalAdvances in Neural Information Processing Systems \bvolume32. \endbibitem
  • [29] {barticle}[author] \bauthor\bsnmKallus, \bfnmNathan\binitsN. and \bauthor\bsnmUehara, \bfnmMasatoshi\binitsM. (\byear2020). \btitleDouble reinforcement learning for efficient off-policy evaluation in markov decision processes. \bjournalThe Journal of Machine Learning Research \bvolume21 \bpages6742–6804. \endbibitem
  • [30] {barticle}[author] \bauthor\bsnmKornowski, \bfnmGuy\binitsG. and \bauthor\bsnmShamir, \bfnmOhad\binitsO. (\byear2022). \btitleOracle Complexity in Nonsmooth Nonconvex Optimization. \bjournalJournal of Machine Learning Research \bvolume23 \bpages1–44. \endbibitem
  • [31] {barticle}[author] \bauthor\bsnmKumar, \bfnmAviral\binitsA., \bauthor\bsnmFu, \bfnmJustin\binitsJ., \bauthor\bsnmSoh, \bfnmMatthew\binitsM., \bauthor\bsnmTucker, \bfnmGeorge\binitsG. and \bauthor\bsnmLevine, \bfnmSergey\binitsS. (\byear2019). \btitleStabilizing off-policy q-learning via bootstrapping error reduction. \bjournalAdvances in Neural Information Processing Systems \bvolume32. \endbibitem
  • [32] {barticle}[author] \bauthor\bsnmKumar, \bfnmAviral\binitsA., \bauthor\bsnmZhou, \bfnmAurick\binitsA., \bauthor\bsnmTucker, \bfnmGeorge\binitsG. and \bauthor\bsnmLevine, \bfnmSergey\binitsS. (\byear2020). \btitleConservative q-learning for offline reinforcement learning. \bjournalAdvances in Neural Information Processing Systems \bvolume33 \bpages1179–1191. \endbibitem
  • [33] {binproceedings}[author] \bauthor\bsnmLe, \bfnmHoang\binitsH., \bauthor\bsnmVoloshin, \bfnmCameron\binitsC. and \bauthor\bsnmYue, \bfnmYisong\binitsY. (\byear2019). \btitleBatch policy learning under constraints. In \bbooktitleInternational Conference on Machine Learning \bpages3703–3712. \bpublisherPMLR. \endbibitem
  • [34] {barticle}[author] \bauthor\bsnmLevine, \bfnmSergey\binitsS., \bauthor\bsnmKumar, \bfnmAviral\binitsA., \bauthor\bsnmTucker, \bfnmGeorge\binitsG. and \bauthor\bsnmFu, \bfnmJustin\binitsJ. (\byear2020). \btitleOffline reinforcement learning: Tutorial, review, and perspectives on open problems. \bjournalarXiv preprint arXiv:2005.01643. \endbibitem
  • [35] {barticle}[author] \bauthor\bsnmLi, \bfnmMengbing\binitsM., \bauthor\bsnmShi, \bfnmChengchun\binitsC., \bauthor\bsnmWu, \bfnmZhenke\binitsZ. and \bauthor\bsnmFryzlewicz, \bfnmPiotr\binitsP. (\byear2022). \btitleReinforcement Learning in Possibly Nonstationary Environments. \bjournalarXiv preprint arXiv:2203.01707. \endbibitem
  • [36] {barticle}[author] \bauthor\bsnmLiao, \bfnmPeng\binitsP., \bauthor\bsnmKlasnja, \bfnmPredrag\binitsP. and \bauthor\bsnmMurphy, \bfnmSusan\binitsS. (\byear2020). \btitleOff-policy estimation of long-term average outcomes with applications to mobile health. \bjournalJournal of the American Statistical Association \bpages1–10. \endbibitem
  • [37] {barticle}[author] \bauthor\bsnmLiao, \bfnmPeng\binitsP., \bauthor\bsnmQi, \bfnmZhengling\binitsZ., \bauthor\bsnmWan, \bfnmRunzhe\binitsR., \bauthor\bsnmKlasnja, \bfnmPredrag\binitsP. and \bauthor\bsnmMurphy, \bfnmSusan A\binitsS. A. (\byear2022). \btitleBatch policy learning in average reward markov decision processes. \bjournalThe Annals of Statistics \bvolume50 \bpages3364–3387. \endbibitem
  • [38] {barticle}[author] \bauthor\bsnmLiu, \bfnmQiang\binitsQ., \bauthor\bsnmLi, \bfnmLihong\binitsL., \bauthor\bsnmTang, \bfnmZiyang\binitsZ. and \bauthor\bsnmZhou, \bfnmDengyong\binitsD. (\byear2018). \btitleBreaking the curse of horizon: Infinite-horizon off-policy estimation. \bjournalAdvances in Neural Information Processing Systems \bvolume31. \endbibitem
  • [39] {barticle}[author] \bauthor\bsnmLu, \bfnmYulong\binitsY. and \bauthor\bsnmLu, \bfnmJianfeng\binitsJ. (\byear2020). \btitleA universal approximation theorem of deep neural networks for expressing probability distributions. \bjournalAdvances in Neural Information Processing Systems \bvolume33 \bpages3094–3105. \endbibitem
  • [40] {barticle}[author] \bauthor\bsnmLuckett, \bfnmDaniel J\binitsD. J., \bauthor\bsnmLaber, \bfnmEric B\binitsE. B., \bauthor\bsnmKahkoska, \bfnmAnna R\binitsA. R., \bauthor\bsnmMaahs, \bfnmDavid M\binitsD. M., \bauthor\bsnmMayer-Davis, \bfnmElizabeth\binitsE. and \bauthor\bsnmKosorok, \bfnmMichael R\binitsM. R. (\byear2020). \btitleEstimating dynamic treatment regimes in mobile health using v-learning. \bjournalJournal of the American Statistical Association \bvolume115 \bpages692–706. \endbibitem
  • [41] {barticle}[author] \bauthor\bsnmMahadevan, \bfnmSridhar\binitsS., \bauthor\bsnmLiu, \bfnmBo\binitsB., \bauthor\bsnmThomas, \bfnmPhilip\binitsP., \bauthor\bsnmDabney, \bfnmWill\binitsW., \bauthor\bsnmGiguere, \bfnmSteve\binitsS., \bauthor\bsnmJacek, \bfnmNicholas\binitsN., \bauthor\bsnmGemp, \bfnmIan\binitsI. and \bauthor\bsnmLiu, \bfnmJi\binitsJ. (\byear2014). \btitleProximal reinforcement learning: A new theory of sequential decision making in primal-dual spaces. \bjournalarXiv preprint arXiv:1405.6757. \endbibitem
  • [42] {binproceedings}[author] \bauthor\bsnmMandel, \bfnmTravis\binitsT., \bauthor\bsnmLiu, \bfnmYun-En\binitsY.-E., \bauthor\bsnmLevine, \bfnmSergey\binitsS., \bauthor\bsnmBrunskill, \bfnmEmma\binitsE. and \bauthor\bsnmPopovic, \bfnmZoran\binitsZ. (\byear2014). \btitleOffline policy evaluation across representations with applications to educational games. In \bbooktitleAAMAS \bvolume1077. \endbibitem
  • [43] {barticle}[author] \bauthor\bsnmMarling, \bfnmCindy\binitsC. and \bauthor\bsnmBunescu, \bfnmRazvan\binitsR. (\byear2020). \btitleThe ohiot1dm dataset for blood glucose level prediction: Update 2020. \bjournalKHD@ IJCAI. \endbibitem
  • [44] {binproceedings}[author] \bauthor\bsnmMokhtari, \bfnmAryan\binitsA., \bauthor\bsnmOzdaglar, \bfnmAsuman\binitsA. and \bauthor\bsnmPattathil, \bfnmSarath\binitsS. (\byear2020). \btitleA unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. In \bbooktitleInternational Conference on Artificial Intelligence and Statistics \bpages1497–1507. \bpublisherPMLR. \endbibitem
  • [45] {barticle}[author] \bauthor\bsnmMurphy, \bfnmSusan A\binitsS. A., \bauthor\bparticlevan der \bsnmLaan, \bfnmMark J\binitsM. J., \bauthor\bsnmRobins, \bfnmJames M\binitsJ. M. and \bauthor\bsnmGroup, \bfnmConduct Problems Prevention Research\binitsC. P. P. R. (\byear2001). \btitleMarginal mean models for dynamic regimes. \bjournalJournal of the American Statistical Association \bvolume96 \bpages1410–1423. \endbibitem
  • [46] {barticle}[author] \bauthor\bsnmNachum, \bfnmOfir\binitsO. and \bauthor\bsnmDai, \bfnmBo\binitsB. (\byear2021). \btitleReinforcement learning via fenchel-rockafellar duality. \bjournalarXiv preprint arXiv:2001.01866. \endbibitem
  • [47] {barticle}[author] \bauthor\bsnmParikh, \bfnmNeal\binitsN., \bauthor\bsnmBoyd, \bfnmStephen\binitsS. \betalet al. (\byear2014). \btitleProximal algorithms. \bjournalFoundations and trends® in Optimization \bvolume1 \bpages127–239. \endbibitem
  • [48] {barticle}[author] \bauthor\bsnmPinelis, \bfnmIosif\binitsI. (\byear2012). \btitleOptimum bounds for the distributions of martingales in Banach spaces. \bjournalarXiv preprint arXiv:1208.2200. \endbibitem
  • [49] {barticle}[author] \bauthor\bsnmPrecup, \bfnmDoina\binitsD. (\byear2000). \btitleEligibility traces for off-policy policy evaluation. \bjournalComputer Science Department Faculty Publication Series \bpages80. \endbibitem
  • [50] {barticle}[author] \bauthor\bsnmRahimi, \bfnmAli\binitsA. and \bauthor\bsnmRecht, \bfnmBenjamin\binitsB. (\byear2007). \btitleRandom features for large-scale kernel machines. \bjournalAdvances in Neural Information Processing Systems \bvolume20. \endbibitem
  • [51] {barticle}[author] \bauthor\bsnmRamprasad, \bfnmPratik\binitsP., \bauthor\bsnmLi, \bfnmYuantong\binitsY., \bauthor\bsnmYang, \bfnmZhuoran\binitsZ., \bauthor\bsnmWang, \bfnmZhaoran\binitsZ., \bauthor\bsnmSun, \bfnmWill Wei\binitsW. W. and \bauthor\bsnmCheng, \bfnmGuang\binitsG. (\byear2022). \btitleOnline bootstrap inference for policy evaluation in reinforcement learning. \bjournalJournal of the American Statistical Association \bpages1–14. \endbibitem
  • [52] {binproceedings}[author] \bauthor\bsnmReddi, \bfnmSashank J\binitsS. J., \bauthor\bsnmHefny, \bfnmAhmed\binitsA., \bauthor\bsnmSra, \bfnmSuvrit\binitsS., \bauthor\bsnmPoczos, \bfnmBarnabas\binitsB. and \bauthor\bsnmSmola, \bfnmAlex\binitsA. (\byear2016). \btitleStochastic variance reduction for nonconvex optimization. In \bbooktitleInternational conference on machine learning \bpages314–323. \bpublisherPMLR. \endbibitem
  • [53] {barticle}[author] \bauthor\bsnmReem, \bfnmDaniel\binitsD., \bauthor\bsnmReich, \bfnmSimeon\binitsS. and \bauthor\bsnmDe Pierro, \bfnmAlvaro\binitsA. (\byear2019). \btitleRe-examination of Bregman functions and new properties of their divergences. \bjournalOptimization \bvolume68 \bpages279–348. \endbibitem
  • [54] {binproceedings}[author] \bauthor\bsnmRényi, \bfnmAlfréd\binitsA. (\byear1961). \btitleOn measures of entropy and information. In \bbooktitleProceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics \bvolume4 \bpages547–562. \bpublisherUniversity of California Press. \endbibitem
  • [55] {barticle}[author] \bauthor\bsnmRodbard, \bfnmDavid\binitsD. (\byear2009). \btitleInterpretation of continuous glucose monitoring data: glycemic variability and quality of glycemic control. \bjournalDiabetes Technology & Therapeutics \bvolume11 \bpagesS–55. \endbibitem
  • [56] {binproceedings}[author] \bauthor\bsnmShi, \bfnmChengchun\binitsC., \bauthor\bsnmUehara, \bfnmMasatoshi\binitsM., \bauthor\bsnmHuang, \bfnmJiawei\binitsJ. and \bauthor\bsnmJiang, \bfnmNan\binitsN. (\byear2022). \btitleA minimax learning approach to off-policy evaluation in confounded partially observable markov decision processes. In \bbooktitleInternational Conference on Machine Learning \bpages20057–20094. \bpublisherPMLR. \endbibitem
  • [57] {binproceedings}[author] \bauthor\bsnmShi, \bfnmChengchun\binitsC., \bauthor\bsnmWan, \bfnmRunzhe\binitsR., \bauthor\bsnmChernozhukov, \bfnmVictor\binitsV. and \bauthor\bsnmSong, \bfnmRui\binitsR. (\byear2021). \btitleDeeply-debiased off-policy interval estimation. In \bbooktitleInternational Conference on Machine Learning \bpages9580–9591. \bpublisherPMLR. \endbibitem
  • [58] {barticle}[author] \bauthor\bsnmShi, \bfnmChengchun\binitsC., \bauthor\bsnmZhang, \bfnmSheng\binitsS., \bauthor\bsnmLu, \bfnmWenbin\binitsW. and \bauthor\bsnmSong, \bfnmRui\binitsR. (\byear2020). \btitleStatistical inference of the value function for reinforcement learning in infinite horizon settings. \bjournalarXiv preprint arXiv:2001.04515. \endbibitem
  • [59] {barticle}[author] \bauthor\bsnmShi, \bfnmChengchun\binitsC., \bauthor\bsnmZhu, \bfnmJin\binitsJ., \bauthor\bsnmYe, \bfnmShen\binitsS., \bauthor\bsnmLuo, \bfnmShikai\binitsS., \bauthor\bsnmZhu, \bfnmHongtu\binitsH. and \bauthor\bsnmSong, \bfnmRui\binitsR. (\byear2022). \btitleOff-policy confidence interval estimation with confounded markov decision process. \bjournalJournal of the American Statistical Association \bpages1–12. \endbibitem
  • [60] {barticle}[author] \bauthor\bsnmSriperumbudur, \bfnmBharath K\binitsB. K., \bauthor\bsnmFukumizu, \bfnmKenji\binitsK. and \bauthor\bsnmLanckriet, \bfnmGert RG\binitsG. R. (\byear2011). \btitleUniversality, Characteristic Kernels and RKHS Embedding of Measures. \bjournalJournal of Machine Learning Research \bvolume12. \endbibitem
  • [61] {bbook}[author] \bauthor\bsnmSutton, \bfnmRichard S\binitsR. S. and \bauthor\bsnmBarto, \bfnmAndrew G\binitsA. G. (\byear2018). \btitleReinforcement learning: An introduction. \bpublisherMIT press. \endbibitem
  • [62] {barticle}[author] \bauthor\bsnmTheocharous, \bfnmGeorgios\binitsG., \bauthor\bsnmChandak, \bfnmYash\binitsY., \bauthor\bsnmThomas, \bfnmPhilip S\binitsP. S. and \bauthor\bparticlede \bsnmNijs, \bfnmFrits\binitsF. (\byear2020). \btitleReinforcement learning for strategic recommendations. \bjournalarXiv preprint arXiv:2009.07346. \endbibitem
  • [63] {binproceedings}[author] \bauthor\bsnmThomas, \bfnmPhilip\binitsP. and \bauthor\bsnmBrunskill, \bfnmEmma\binitsE. (\byear2016). \btitleData-efficient off-policy policy evaluation for reinforcement learning. In \bbooktitleInternational Conference on Machine Learning \bpages2139–2148. \bpublisherPMLR. \endbibitem
  • [64] {binproceedings}[author] \bauthor\bsnmThomas, \bfnmPhilip\binitsP., \bauthor\bsnmTheocharous, \bfnmGeorgios\binitsG. and \bauthor\bsnmGhavamzadeh, \bfnmMohammad\binitsM. (\byear2015). \btitleHigh-confidence off-policy evaluation. In \bbooktitleProceedings of the AAAI Conference on Artificial Intelligence \bvolume29. \endbibitem
  • [65] {barticle}[author] \bauthor\bsnmTropp, \bfnmJoel A\binitsJ. A. (\byear2011). \btitleFreedman’s Inequality for Matrix Martingales. \bjournalElectronic Communications in Probability \bvolume16 \bpages262–270. \endbibitem
  • [66] {barticle}[author] \bauthor\bsnmTsybakov, \bfnmAlexandre B\binitsA. B. (\byear2004). \btitleIntroduction to nonparametric estimation, 2009. \bjournalSpringer Series in Statistics \bvolume9. \endbibitem
  • [67] {binproceedings}[author] \bauthor\bsnmUehara, \bfnmMasatoshi\binitsM., \bauthor\bsnmHuang, \bfnmJiawei\binitsJ. and \bauthor\bsnmJiang, \bfnmNan\binitsN. (\byear2020). \btitleMinimax weight and q-function learning for off-policy evaluation. In \bbooktitleInternational Conference on Machine Learning \bpages9659–9668. \bpublisherPMLR. \endbibitem
  • [68] {barticle}[author] \bauthor\bsnmUehara, \bfnmMasatoshi\binitsM., \bauthor\bsnmImaizumi, \bfnmMasaaki\binitsM., \bauthor\bsnmJiang, \bfnmNan\binitsN., \bauthor\bsnmKallus, \bfnmNathan\binitsN., \bauthor\bsnmSun, \bfnmWen\binitsW. and \bauthor\bsnmXie, \bfnmTengyang\binitsT. (\byear2022). \btitleFinite sample analysis of minimax offline reinforcement learning: Completeness, fast rates and first-order efficiency. \bjournalarXiv preprint arXiv:2102.02981. \endbibitem
  • [69] {barticle}[author] \bauthor\bsnmUehara, \bfnmMasatoshi\binitsM., \bauthor\bsnmShi, \bfnmChengchun\binitsC. and \bauthor\bsnmKallus, \bfnmNathan\binitsN. (\byear2022). \btitleA Review of Off-Policy Evaluation in Reinforcement Learning. \bjournalarXiv preprint arXiv:2212.06355. \endbibitem
  • [70] {barticle}[author] \bauthor\bsnmWang, \bfnmRuosong\binitsR., \bauthor\bsnmFoster, \bfnmDean P\binitsD. P. and \bauthor\bsnmKakade, \bfnmSham M\binitsS. M. (\byear2020). \btitleWhat are the statistical limits of offline RL with linear function approximation? \bjournalarXiv preprint arXiv:2010.11895. \endbibitem
  • [71] {barticle}[author] \bauthor\bsnmXie, \bfnmTengyang\binitsT., \bauthor\bsnmCheng, \bfnmChing-An\binitsC.-A., \bauthor\bsnmJiang, \bfnmNan\binitsN., \bauthor\bsnmMineiro, \bfnmPaul\binitsP. and \bauthor\bsnmAgarwal, \bfnmAlekh\binitsA. (\byear2021). \btitleBellman-consistent pessimism for offline reinforcement learning. \bjournalAdvances in neural information processing systems \bvolume34. \endbibitem
  • [72] {barticle}[author] \bauthor\bsnmXie, \bfnmTengyang\binitsT., \bauthor\bsnmMa, \bfnmYifei\binitsY. and \bauthor\bsnmWang, \bfnmYu-Xiang\binitsY.-X. (\byear2019). \btitleTowards optimal off-policy evaluation for reinforcement learning with marginalized importance sampling. \bjournalAdvances in Neural Information Processing Systems \bvolume32. \endbibitem
  • [73] {barticle}[author] \bauthor\bsnmZhang, \bfnmRuiyi\binitsR., \bauthor\bsnmDai, \bfnmBo\binitsB., \bauthor\bsnmLi, \bfnmLihong\binitsL. and \bauthor\bsnmSchuurmans, \bfnmDale\binitsD. (\byear2020). \btitleGendice: Generalized offline estimation of stationary values. \bjournalarXiv preprint arXiv:2002.09072. \endbibitem
  • [74] {barticle}[author] \bauthor\bsnmZhou, \bfnmWenzhuo\binitsW., \bauthor\bsnmZhu, \bfnmRuoqing\binitsR. and \bauthor\bsnmQu, \bfnmAnnie\binitsA. (\byear2022). \btitleEstimating optimal infinite horizon dynamic treatment regimes via pt-learning. \bjournalJournal of the American Statistical Association \bpages1–14. \endbibitem
  • [75] {binproceedings}[author] \bauthor\bsnmZhu, \bfnmLiangyu\binitsL., \bauthor\bsnmLu, \bfnmWenbin\binitsW. and \bauthor\bsnmSong, \bfnmRui\binitsR. (\byear2020). \btitleCausal Effect Estimation and Optimal Dose Suggestions in Mobile Health. In \bbooktitleInternational Conference on Machine Learning \bpages11588–11598. \bpublisherPMLR. \endbibitem

SUPPLEMENTARY MATERIALS FOR
“DISTRIBUTIONAL SHIFT-AWARE OFF-POLICY INTERVAL ESTIMATION:
A UNIFIED ERROR QUANTIFICATION FRAMEWORK”

Wenzhuo Zhou, Yuhan Li, Ruoqing Zhu, and Annie Qu


AInterval Validity and Tightness for Bias Quantification 2

BExpressivity and Role of Q-function Class 2

CProbabilistic Tools in Statistical Learning 3

DDetails of Numerical Experiments and Additional Results 4

ETechnical Proofs and Additional Theoretical Results 6

E.1  Proof of Lemma 3.1 6

E.2  Proof of Theorem A.1 7

E.3  Proof of Corollary A.1 8

E.4  Proof of Lemma E.1 8

E.5  Proof of Theorem 3.1 9

E.6  Proof of Theorem B.1 10

E.7  Proof of Lemma 3.2 12

E.8  Proof of Theorem 3.2 12

E.9  Proof of Theorem 3.3 14

E.10   Proof of Corollary 3.1 17

E.11   Proof of Corollary 4.1 19

E.12   Proof of Theorem 4.1 21

E.13   Proof of Theorem 3.4 23

E.14   Proof of Proposition 4.1 25

E.15   Proof of Lemma 4.1 26

E.16   Proof of Theorem 4.2 28

E.17   Proof of Theorem 6.1 30

E.18   Proof of Theorem 6.2 33

E.19   Proof of Lemma E.2 35

E.20   Proof of Theorem 6.3 36

E.20.1  Proof of the validity of neutral CI 36

E.20.2  Proof of the error-robust finite sample bound for neutral CI 37

E.21   Proof of Theorem 6.4 40

E.21.1  Proof of the validity of pessimist 40

E.21.2  Proof of the error-robust finite sample bound for pessimistic CI 41

E.22   Proof of Theorem E.3 43

E.23   Proof of Theorem E.4 44

E.24   Proof of Theorem 6.5 47

Reference 50