Projected State-action Balancing Weights for Offline Reinforcement Learning
2George Washington University
)
Abstract
Offline policy evaluation (OPE) is considered a fundamental and challenging problem in reinforcement learning (RL). This paper focuses on the value estimation of a target policy based on pre-collected data generated from a possibly different policy, under the framework of infinite-horizon Markov decision processes. Motivated by the recently developed marginal importance sampling method in RL and the covariate balancing idea in causal inference, we propose a novel estimator with approximately projected state-action balancing weights for the policy value estimation. We obtain the convergence rate of these weights, and show that the proposed value estimator is semi-parametric efficient under technical conditions. In terms of asymptotics, our results scale with both the number of trajectories and the number of decision points at each trajectory. As such, consistency can still be achieved with a limited number of subjects when the number of decision points diverges. In addition, we develop a necessary and sufficient condition for establishing the well-posedness of Bellman operator in the off-policy setting, which characterizes the difficulty of OPE and may be of independent interest. Numerical experiments demonstrate the promising performance of our proposed estimator.
Keywords: Infinite horizons; Markov decision process; Policy evaluation; Reinforcement learning
1 Introduction
In reinforcement learning (RL), off-policy evaluation (OPE) refers to the problem of estimating some notion of rewards (e.g., the value defined in (2)) of a target policy based on historical data collected from a potentially different policy. There is a recent surge of interest in OPE among the statistics and machine learning communities. On the one hand, OPE serves as the foundation for many RL methods. On the other hand, OPE is of great importance in some high-stakes domains where deploying a new policy can be very costly or risky, such as in medical applications.
This work is partly motivated by mobile health (mHealth) studies. Due to the rapid development of mobile devices and sensing technology, mHealth studies have recently emerged as a promising way to promote the healthy behaviors of patients [38]. Mobile devices are used to monitor their health conditions in real-time and deliver just-in-time interventions to individuals [50]. With access to a rich amount of longitudinal data pre-collected by wearable devices, researchers are often interested in evaluating a policy (intervention), potentially different from the one behind the data collection process. For example, the OhioT1DM dataset [43] was collected to improve the health and wellbeing of people with type 1 diabetes. The data consist of 8 weeks’ health information of six subjects, based on some unknown policy. For each subject, real-time information such as insulin doses, glucose levels, and self-reported times of meals and exercises was collected. It is often of great interest to evaluate the efficacy of insulin pump therapy. One notable challenge of mHealth data is a limited number of subjects, combined with a usually large number of decision points for each subject. For instance, the OhioT1DM dataset has six subjects with a few thousand decision points per subject. In statistics, there is a rich literature on estimating the treatment effect or optimal policy from complex longitudinal data [57, 56, 48, 47, 35, 77, 72, 59, 34]. However, these methods are mainly designed for studies with very few decision points and often require the number of subjects to grow in order to achieve accurate estimation.
To address the above challenge in mHealth studies, we adopt the framework of the Markov decision process (MDP) [55] and consider an infinite-horizon time-homogeneous setting. This framework is particularly suitable for mHealth studies, and its efficacy has been well demonstrated in the recent literature [42, 39, 62, 40, 24].
In this paper, we focus on developing a new model-free approach for OPE. The existing model-free OPE methods under the infinite-horizon discounted setting can be mainly divided into three categories. The methods in the first category [e.g., 3, 63, 36, 42, 62] directly estimate the state(-action) value function (see Section 2.1), based on the Bellman equation (see (6)). The second category is motivated by importance weights or the so-called marginal importance sampling [e.g., 41, 49, 68]. These approaches utilize the probability ratio function (see (8)) to adjust for the distributional mismatch due to the difference between the target policy and the behavior policy (i.e., the one in the data collection process). The last category of OPE methods combines methods in the first category and the second category to obtain the so-called “double robust” estimators [e.g., 30, 64, 60]. Apart from the model-free methods, we note in passing that there exists a rich class of literature on model-based methods which mainly rely on directly modeling the dynamics (i.e., the transition mechanism and the reward function) [e.g. 53, 11, 26].
This paper focuses on the second category of model-free OPE methods, which is based on the probability ratio function. These methods do not depend on the actual form of the reward function and thus can be used flexibly to evaluate the target policy on different reward functions, which is appealing in practice. Some core theoretical questions associated with the use of the (estimated) probability ratio function directly relate to the fundamental challenges of offline RL [37], and therefore have recently attracted much interest from the machine learning community [41, 49, 76, 68, 69]. Last but not least, the probability ratio function and related estimators can improve the accuracy and stability of offline RL, which has been demonstrated in [49, 29]. Despite some recent progress towards using the (estimated) probability ratio function to perform OPE, the corresponding development and (theoretical) understanding are still incomplete.
Motivated by the covariate balancing methods [e.g., 1, 74, 73, 27] that have been recently studied in the average treatment effect (ATE) estimation, we propose a novel OPE estimator via projected state-action balancing weights, under the framework of the time-homogeneous MDP and provide a comprehensive theoretical analysis. Specifically, we characterize the special challenge in developing a weighted estimator for the OPE problem in terms of an “expanded dimension” phenomenon, which significantly complicates the adoption of the balancing idea in OPE, as compared with the ATE estimation. See the detailed discussion in Section 3.2. Roughly speaking, a direct modification of an ATE estimator [e.g., 73] for the purpose of OPE leads to estimated weights that depend on not only the current state-action pairs, but also the state variable in the next decision points, which is inconsistent with the definition of the true probability ratio function. To tackle this issue, we propose an approximate projection step to rule out this “expanded dimension” issue. With the help of this additional projection, we are able to show the convergence rate of the proposed weights. (or the estimated ratio function).
As for theoretical contributions, we analyze the convergence rates of the approximate projection step, the projected state-action balancing weights, as well as the estimated value of a given policy. All these convergence rates are characterized in the scaling of both the sample size (, the number of observed trajectories) and the number of observed decision points () per trajectory. The scaling with respect to is particularly important when (the number of subjects) is limited, which is common in mHealth datasets such as the aforementioned OhioT1DM dataset (see also Section 6). For instance, under our setup, the estimated value of the policy is still consistent in the asymptotic setting where is bounded, but . In the course of analyzing the proposed method, we also obtain a uniform convergence rate (with respect to and ) for a non-parametric regression based on exponentially -mixing sequences, which may be of independent interest for other applications. Furthermore, under some appropriate technical assumptions (including notably a non-diminishing minimum eigenvalue condition), we show that the proposed weighted estimator is asymptotically normal with a rate of convergence, and achieves the efficiency bound, which aligns with other types of estimators [30, 39]. Besides, our theoretical results do not require that the underlying data are independent or generated from stationary sequences, although these assumptions are widely used in the existing literature [e.g., 15, 30, 49, 64] for general OPE methods. Without imposing these restrictive assumptions can significantly increase the applicability of the proposed method.
As another theoretical contribution, we make the first attempt to analyze the difficulty of non-parametric OPE under infinite-horizon discounted settings. Some strong assumptions are often imposed in the existing literature to establish desirable theoretical properties for corresponding OPE estimators. For instance, [62] assumed that the minimal eigenvalue of some second moment is strictly bounded away from as the number of basis functions grows. [70] adopts the “completeness” condition to study the convergence of their OPE estimators. In their work, they show that these assumptions can be satisfied when the discount factor in the policy value (2) is small enough (with additional boundedness conditions on the average visitation probability). However, in practice, the discount factor is often preferred to be set close to 1 [32]. As such, these sufficient conditions are of limited use in practice. In this paper, we provide a necessary and sufficient condition for lower bounding the minimal eigenvalue of the operator (see the definition in Section 4.4), with respect to the data generating process, which characterizes the well-posedness of their -function estimations in [62] and [70]. With the help of this characterization, we can further show that the minimal eigenvalue is strictly bounded away from zero under some mild sufficient conditions without any restrictions on or , which may be of independent interest.
The rest of the paper is organized as follows. Section 2 presents the basic framework for OPE and some existing OPE methods. In Section 3 we introduce the state-action balancing weights and the proposed estimator. Theoretical results of our estimated weight function and the asymptotic properties of the proposed estimator are developed in Section 4. A detailed discussion regarding the lower bound of the minimum eigenvalue is presented in Section 4.4. Lastly, a simulation study and a real data application are presented in Sections 5 and 6 respectively.
2 Offline Policy Evaluation in Infinite-horizon Markov Decision Processes
In this section, we review the framework of discrete-time homogeneous MDPs and the related OPE methods. Specifically, the framework of MDP and some necessary notations are presented in Section 2.1, while three major classes of model-free OPE methods under the infinite-horizon discounted setting are reviewed in Section 2.2.
2.1 Preliminary and Notations
Consider a trajectory , where denotes the triplet of the state, action and immediate reward, observed at the decision point . Let and be the state and action spaces, respectively. We assume is finite and consider the following two assumptions, which are commonly imposed in infinite-horizon OPE problems.
Assumption 1 (Markovian assumption with stationary transitions).
There exists a transition kernel such that for every , , and any set ,
where is the family of Borel subsets of .
For the notational convenience, we assume the transition kernel has a density . Assumption 1 requires that given the current state-action pair, future states are independent of past observations. Note that this assumption can be tested using the observed data [e.g., 61]. If one believes the trajectory satisfies some higher-order Markovian properties, one can instead construct a state variable by aggregating all the original state variables over the corresponding number of decision points. We refer the interested readers to [61] for an example.
Assumption 2.
There exists a reward function defined on such that
for every , and . In addition, is uniformly bounded, i.e., there exists a constant such that for every .
Assumption 2 states that the current reward is conditionally mean independent of the history given the current state-action pair. One can also regard as a part of if needed. The uniform boundedness assumption on rewards is introduced for simplifying the technical derivation and can be relaxed. In practice, the time-stationarity of the transition density and the reward function can be warranted by incorporating time-associated covariates. Under Assumptions 1 and 2, the tuple forms a discrete-time homogeneous MDP.
A policy is defined as a way of choosing actions at all decision points. In this work, we use the value criterion (defined in (2) below) to evaluate policies and thus focus on the time-invariant Markovian policy (also called a stationary policy) , which is a function mapping from the state space to a probability mass function over the action space . More specifically, is the probability of choosing an action given a state . The sufficiency of considering only the stationary policy is explained in Section 6.2 of [55].
In the offline RL setting, one of the fundamental tasks is to estimate a target stationary policy’s (expected) value function based on the pre-collected (batch) data. Given a stationary policy , the value function is defined as
(1) |
where denotes the expectation with respect to the distribution whose actions are generated by , and refers to the discounted factor controlling the trade-off between the future rewards and the immediate rewards. The value function , which is always finite due to Assumption 2, represents the discounted sum of rewards under the target policy given the initial state . Our goal is to estimate the policy value, i.e., the expectation of the value function, defined as
(2) |
using the pre-collected training data, where denotes a reference distribution over . In RL literature, is typically assumed known.
Now suppose we are given pre-collected training data consisting of independent and identically distributed finite-horizon trajectories, denoted by
where denotes the termination time for -th trajectory. For simplicity, we assume the same number of time points are observed for all trajectories, i.e. for . This assumption can be relaxed as long as , are of the same order. We also make the following assumption on the data generating mechanism.
Assumption 3.
The training data is generated by a fixed stationary policy with an initial distribution .
Under Assumptions 1 and 3, forms a discrete time time-homogeneous Markov chain. In the literature, is usually called the behavior policy, which may not be known. For convenience, we denote by . Next we define notations for several important probability distributions. For any , we define as the marginal density of at under the target policy and reference distribution . In particular, when , . Similarly, we can define over under the behavior policy, where . With , the discounted visitation probability density is defined as
(3) |
which is assumed to be well-defined.
2.2 Existing Off-policy Evaluation Methods
Most existing model-based OPE methods for the above setting can be grouped into three categories. The first category is direct methods, which directly estimate the state-action value function [e.g., 42, 62] defined as
(4) |
also known as the Q-function. As we can see from (1) and (2),
(5) |
It is well-known that satisfy the following Bellman equation
(6) |
for , and . Clearly (6) forms a conditional moment restriction, based on which can be estimated by many methods such as generalized method of moments [21] and the nonparametric instrumental variable regression [52, 8]. The second category of OPE methods is motivated by the idea of marginal importance sampling [e.g., 41, 49]. Notice that, with (3), one can rewrite as
(7) |
as long as is absolutely continuous with respect to . We call
(8) |
the probability ratio function, which is used to adjust for the mismatch between the behavior policy and the target policy . Based on this relationship, one can obtain an estimator of via estimating . By the so-called backward Bellman equation of , for any , one can show that
(9) |
This implies that
(10) |
See the detailed derivation for obtaining (10) in Section S2.1 of the Supplementary Material Recall that is known. Based on (10), several methods [e.g., 49, 68] can be leveraged to estimate . In Section 3.4, we provide more discussion on other weighted estimators as compared with the proposed estimator.
The last category of methods combines direct and marginal importance sampling methods to construct a doubly robust estimator, which is also motivated by the following efficient influence function [e.g., 30]
(11) |
where and are nuisance functions.
3 Projected State-action Balancing Estimator
In this section, we introduce the proposed weighted estimator for . Since our estimator is motivated by covariate balancing weights in the literature of ATE estimation, we discuss their connection in Section 3.1. In Section 3.2, we show the difficulty of directly applying the covariate balancing idea in the aforementioned policy evaluation problem due to an “expanded-dimension” issue. We address this difficulty and propose a projected state-action balancing estimator for in Section 3.3.
3.1 State-action Balancing Weights
Consider a general form of weighted estimators:
where ’s are some weights constructed from the training data . Due to (7), a reasonable strategy to derive such weights is to first estimate the probability ratio function , and then evaluate it at the observed state-action pairs in . This is analogous to the inverse probability weights commonly adopted in the ATE estimation. However, this strategy often produces an unstable weighted estimator due to small weights [31]. Instead, there is a recent surge of interest to directly obtain weights that achieve empirical covariate balance [e.g., 25, 7, 74, 73]. These weights usually produce more stable weighted estimators with superior finite-sample performances for the ATE estimation.
Inspired by the covariate balancing, a natural idea is to choose the weights that ensure the (approximate) validity of the empirical counterpart of (10), i.e.,
(12) |
over where are pre-specified functions defined on . The equality (12) can be viewed as a form of the state-action balance, in contrast to the covariate balance in the ATE estimation. The space can be viewed as a finite-dimensional approximation of the function space in which the balance should be enforced. In theory, is expected to increase with and . Changing the balancing criterion in [73], one can obtain a form of state-action balancing weights via the following mathematical programming:
(13a) | |||
(13b) |
where the tuning parameters controls the imbalance with respect to . When , the weights achieve the exact balance (12) over . In practice, the exact balance can be hard to achieve especially when is large. Allowing leads to approximate balance [1, 74], which introduces flexibility. In addition, one can also constrain the weights to be non-negative. Since non-negativity constraints are not necessary for consistent estimation (as shown in our theoretical analysis), we will not enforce them throughout this paper. Common choices of are constructed based on tensor products of one-dimensional basis functions. Examples of one-dimensional basis functions include spline basis [12] (for a continuous dimension) and indicator functions of levels (for a categorical dimension). The objective function (13a) is introduced to control the magnitude of the weights. Here is chosen as a non-negative, strictly convex and continuously differentiable function. Examples include and . In the following, we discuss the issue with the weights defined by (13), and explain the challenge of directly applying this covariate balancing idea in the OPE problem.
Define and write the solution of (13) as . Naturally, we can obtain a weighted estimator as . For any function defined on , let
(14) |
The difference between and yields the following decomposition:
(15) | ||||
(16) | ||||
(17) | ||||
(18) | ||||
(19) | ||||
(20) |
where the first equality is given by (6) and the representation of in (5). Clearly, (17)-(18) can be controlled via the balancing constraint (13b) by carefully controlling and selecting so that can be well approximated. For (19), by assuming that are independent noises of the trajectories , it may not be hard to obtain an upper bound of order as long as the magnitude of is properly controlled. However, it remains unclear how to control (20) due to the complex dependence between and . Indeed, we observe an “expanded-dimension” issue due to the balancing constraints (13b), which will be explained in details in Section 3.2. This also motivates the development of the novel projected balancing constraints in Section 3.3.
3.2 Expanded Dimension
First, we obtain the dual form of (13), which provides an important characterization for the solution of (13). Define , , and
Theorem 1.
The proof of Theorem 1 is similar to that of Theorem 2 below, and can be found in Section S2.2 of the Supplementary Material. Now the expanded-dimension issue can be easily seen via the following example. Suppose that there are two triplets and such that , and . As the true probability ratio function is a function of the current state and action variables, we must have . However, the solution form (22) in Theorem 1 does not lead to in general, which violates our knowledge of .
One may hypothesize that the expanded-dimension issue is a finite-sample property, and the variation of the estimated weights due to the next state may diminish asymptotically under some reasonable conditions. To gain more insight, we show that the solution form (22) indeed induces an implicit restriction on the modeling of the true weight function under finite-state and finite-action settings. Therefore, unless one is willing to make further non-trivial assumptions on the weight function, the hypothesis cannot be true in general.
Notice that
(23) |
where . To avoid dealing with the approximation error, we focus on an even more general class of functions on , where
Recall that is the first derivative of defined in Theorem 1. Assume that . We would like to know if can model well. Suppose . As is constant with respect to , we have where
characterizes the subclass with reduced input dimensions. A key question is whether restricts the class of possible weight functions modeled by and, as a result, induces some implicit form of restriction. To see this, we focus on the settings where and are finite. In Lemma 1 below, we show that the dimension of is , which is strictly less than as long as . Note that, due to the natural constraint that , a general weight function should have free parameters. As is invertible, Lemma 1 suggests a possible implicit restriction and that the solution obtained from (13) may not be a consistent estimator for .
Lemma 1.
Suppose and are both finite. Then .
Proof sketch.
For any function , there exists a constant such that for any . Since , and , this yields linearly independent constraints on . Together with the parameter , we can show that the dimension of is . The detailed proof can be found in Section S2.2 of the Supplementary Material. ∎
3.3 Projected State-action Balancing Weights
To overcome the expanded-dimension issue, we propose an approximate projection step, which is applied to , , to rule out the involvement of . To explain the idea, we again focus on the decomposition of . From (15) and (16), we would like to choose weights that ideally control
for every . However, in practice, (i.e., ) is unknown to us. As explained in Section 3.2, the idea of replacing it with the empirical counterpart results in a non-trivial expanded-dimension issue. Instead, we propose to estimate the projection term via a more involved optimization problem:
(24) |
where is a pre-specified function space that contains , is a tuning parameter and is a regularization functional. In this work we focus on the kernel ridge regression, where is a subset of a reproducing kernel Hilbert space (RKHS) and is taken as the squared RKHS norm; see Assumption 4’(c) in Section 4.1 of the Supplementary Material In Theorem 3 of Section 4, we establish the finite-sample error bound of (in scaling of both and ) that holds uniformly for different (in a reasonably large class specified later). This provides a solid theoretical guarantee for replacing by in the construction of the weights.
With the approximate projection (24), we propose to estimate the weights by solving the following optimization problem:
(25a) | |||
(25b) |
The resulting solution are the proposed weights. As such, the proposed estimator for is
(26) |
Similar to Theorem 1, we derive the dual form of (25), which is shown in Theorem 2 below. For the notational simplicity, we introduce the following notations: , , , and .
Theorem 2.
We sometimes write . The proof can be found in Section S2.2 of the Supplementary Material. As seen from the representation (28), do not suffer from the expanded-dimension issue that we see in (21). Besides, (27) can be regarded as an -estimation of with a weighted -norm regularization. Since the estimated weights are parametrized in via (28), Theorem 2 reveals a connection between and the shrinkage estimation of the probability ratio function . To see this, we consider the objective function . By Lemma 1 in [68], the expectation of this loss function is minimized when satisfies , for every , . In Theorem 4 of Section 4, we show the convergence rate of the proposed weights to the true weights in the scaling of both and .
Next, we discuss the computation of the weights in practice. The projection step (24) is a pre-computation step for the optimization (25). In other words, we only need to estimate , , once and there is no need to recompute them within the weights estimation (25). The optimization (25) is a standard convex optimization problem with linear constraints. We outline the proposed algorithm of the weights in Algorithm 1 of the Supplementary Material.
3.4 Other weighting methods
Apart from the proposed projection method, a sensible alternative to avoid the expanded dimension is to directly specify a class of functions over as a model of the weight function (i.e., weights are evaluations of this function), which naturally results in the following form of estimators:
(29) |
where is a space that models , is a regularization functional and is a tuning parameter. Similar approaches have been adopted by a few recent works such as [41, 68, 70], although they are not directly motivated by the above expanded-dimension issue. Since these works assume (without considering the dependence in the trajectory), we restrict our results to this setting for comparisons, with a remark that our estimator is analyzed in a more general setting of (e.g., either bounded, or diverging to infinity). In order to use the above estimator, the choice of the function classes and seems difficult in terms of computation of the weights. For nonparametric modeling, a natural idea is to take as a RKHS, since this often leads to a finite-dimensional optimization in regression problems via a representer theorem [71]. However, the term could make the representer theorem inapplicable when is continuous, and so (3.4) becomes an impractical infinite-dimensional optimization. Another way is to take a finite dimensional space to approximate and . The dimensions of the approximation spaces would need to increase with so as to avoid an asymptotic bias. Moreover, the existing results on the convergence rate of such estimator to the policy value is not optimal (i.e., slower than or ). See Corollary 11 in [70]. As for our weighted estimator, an optimal convergence rate with statistical efficiency can be achieved without an additional nuisance parameter estimation, even in a more general dependent setting when . See Theorem 6 in Section 4.
4 Theoretical Results
In this section, we study the theoretical properties of the approximate projection in (24), balancing weights in (25) and the final weighted estimator in Section 3.3. Specifically, in Section 4.1, we derive the finite-sample error bound for the approximate projection. In Section 4.2, we study the convergence rate of the proposed balancing weights. Finally, we show that the proposed weighted estimator is statistically efficient under additional conditions specified in Section 4.3. In Section 4.4, we study the difficulty of the offline RL in a conservative manner. To start with, we introduce some notations. We define the squared empirical norm as . The notation (resp. ) means that there exists a sufficiently large constant (resp. small) constant (resp. ) such that (resp. ) for some sequences and . Also, we denote by the -covering number of with respect to some metric . We take as the dimension of a state vector .
4.1 Non-parametric Regressions with Exponentially -mixing Sequences
Recall that the constraint (25b) is merely a surrogate of the desired state-action balancing condition, due to the use of , . To bound the surrogate error on the estimations of the weights and the policy value, we study the uniform convergence of the approximate projection. More generally, in Theorem 3 below, we obtain the error bound of uniformly over , where is a class of functions defined on of interest. The bound scales with both and . Later when we adopt Theorem 3, will be taken as a subset of the linear span of . See Corollary 1 for more details. Theorem 3 requires the following assumption. Let denote the total variation norm,
Assumption 4.
The following conditions hold.
-
(a)
The Markov chain has a unique stationary distribution with density , and is geometrically ergodic, i.e., there exists a function and constant such that, for any and ,
where is the behavior policy induced conditioinal distribution of given and . Also, there exists a constant such that , where we recall that is the initial distribution in the batch data.
-
(b)
The function class satisfies that for all .
-
(c)
The function class in (24) satisfies that for all , and for all .
- (d)
Assumption 4(a) basically assumes the stationary distribution exists and the mixing rate is exponentially fast. This allows the convergence rate to scale well with . By a truncation argument (see the proof of Theorem 3), we can show that our estimation is almost equivalent to the nonparametric regression based on stationary and exponentially -mixing sequences. It is known that if a Markov chain is aperiodic and satisfies some drift condition in terms of a well-behaved non-negative measurable function, then it is geometrically ergodic, see [6] and Chapter 15 of [46] for a detailed characterization. The boundedness assumptions of and in Assumptions 4(b) and 4(c) are used to simplify the proof and can be relaxed by a careful truncation argument. Similar assumptions are also adopted in [39]. Assumption 4(d) specifies the complexity of the spaces. These entropy assumptions are satisfied for common functional classes, such as RKHS and Sobolev spaces [22, 19, 20]. Take the Sobolev spaces as an example, , where is the number of continuous derivatives possessed by the functions in the corresponding space. With Assumption 4, we obtain the following uniform finite-sample error bound.
Theorem 3.
Suppose Assumption 4 holds. For any and all sufficiently large (i.e., either or is large enough), with probability at least , the following inequality holds for all :
(30) |
where the leading constant in the above inequality depends on , , , and .
Further suppose and are bounded. If we take , and , then for a sufficiently large , the following inequality holds for all with the probability at least :
Next, we adapt Theorem 3 to our case. More specifically, we consider the setting when in Assumption 4(c) is an RKHS. Due to space limitations, we list the corresponding assumption in the Supplementary Material (See Assumption 4’ in Section 4.1 of the Supplementary MaterialThen we have the following corollary.
Corollary 1.
The proofs of Theorem 3 and Corollary 1 can be found in Section S2.3 of the Supplementary Material. From Theorem 3, by carefully choosing and , we can achieve the optimal nonparametric convergence rate that holds uniformly for all up to a logarithmic factor, compared to the i.i.d. setting. More importantly, Theorem 3 does not require the initial distribution to be the stationary distribution, which can be unrealistic in practice but is often required in most existing results such as [16]. Thus our result is broadly applicable. Accordingly, Corollary 1 provides a tight bound (in the scaling of both and ) on the proposed approximate projection step (24), which leads to an accurate estimation of the target state-action balancing condition for the construction of the proposed weights.
4.2 Convergence Rates of Balancing Weights
With Theorem 3 and Corollary 1, we now derive the convergence of the proposed weights . Define
For any square and symmetric matrix , and represents the maximum and minimum eigenvalues respectively. And we use to denote the Euclidean norm of a vector. We will need the following assumption.
Assumption 5.
The following conditions hold.
-
(a)
There exist and such that the true weight function satisfies , where is a constant. Also, there exists a constant such that .
-
(b)
The second derivative of defined in Theorem 2, i.e., , is a positive and continuous function and for some constant .
-
(c)
There exist constants and such that and for every ,
-
(d)
There exists a quantity depending on such that
- (e)
-
(f)
.
Assumption 5(a) specifies the requirement on the uniform approximation error of the true weight function . The boundedness of can be guaranteed if the average visitation density is bounded away from 0 and the policy-induced discounted visitation probability is bounded above. This overlapping condition is commonly assumed in the literature of the ATE and RL [e.g. 74, 68, 70]. For Assumption 5(b) on , by the relationship between and detailed in Theorem 2, many convex choices of adopted in (25) will result in a that satisfies this condition, e.g., . For Assumption 5(c), the uniform bound for is a mild technical condition on the basis functions , . It is satisfied by many classes of basis functions, including the regression spline, the trigonometric polynomial, and the wavelet basis [51, 23, 8, 2]. See the same assumption in [13] (Condition 6 of their Assumption 4.1). As for the largest eigenvalue condition, we verify that such condition holds for common bases such as tensor-product B-spline basis and tensor-product wavelet basis in Lemma S5 of the Supplementary Material. As for Assumption 5(d), is allowed to depend on in our analysis as opposed to the existing study such as [62]. While we allow to diminish as grows, we later show that can be strictly bounded below if we are willing to make further relatively minor assumptions. See detailed discussion in Section 4.4. Assumption 5(e) is a mild condition requiring the best approximation in , and its projection in . Now we can show the convergence of the proposed balancing weights.
Theorem 4.
The proof of Theorem 4 can be found in Section S2.4 of the Supplementary Material. Theorem 4 gives the convergence rate of the proposed balancing weights in terms of both the empirical and the metrics. Note that the rate implies that as long as either or goes to infinity, the proposed weight estimates converge to the true ratio functions. To the best of our knowledge, this is the first consistent result of the ratio function in the scaling of both and under the discounted infinite-horizon setting.
4.3 Estimation Error and Statistical Efficiency
Assumption 6.
Assumption 6(a) is a regularity condition for . It is satisfied by letting , where is the maximum number of continuous derivatives for with respect to among all action levels, and choosing as basis functions such as splines and power series if is defined over a compact domain. Assumption 6(b) is a mild condition for the error of the reward. In fact, this assumption can be relaxed to allow errors themselves to be dependent. Note that the estimated weight function only depends on , which is independent of . The proof of the convergence of stated in the following theorems is based on a conditioning argument that separates the effects of the weights and the errors. Standard techniques that deal with weighted averages of dependent random variables can be adopted to extend the current results (e.g., Theorem 5) to those dependent settings. For instance, If there are some weak autocorrelations among , we are still able to obtain results in Theorems 5 and 6.
Theorem 5.
Theorem 5 can be proved by the similar arguments in the proof of Theorem 6. To obtain the sharp convergence rate, based on this theorem, one need to tune accordingly. As such, the eigenvalue bound specified in Assumption 5(d) becomes crucial. Indeed, we can show that is strictly lower bounded by a positive constant independent of under some mild conditions. We defer the detailed discussion about characterizing in Section 4.4. In the following, we establish the bounds when is bounded below.
Theorem 6.
Suppose Assumptions 4’ (in the Supplementary Material), 5(a)–(g) and 6 hold. Also, assume that for some constant .
-
(i)
If and , we have
(31) In particular, we can take for any such that .
-
(ii)
Assume , , , , , , , and . Take
Then, as either or , we have
(32) In particular, if , then we can take for any such that so that the above results hold.
The proof of this theorem can be found in Section S2.5 of the Supplementary Material. Note that Theorem 6 requires to be lower bounded by a positive constant, which can be satisfied under the conditions in Assumption 7. The constraints for in Theorem 6(i) and (ii) are stronger than that in Assumption 5, which lead to the final desired -consistency and -consistency of our weighted estimator in cases (i) and (ii) respectively. Compared with case (i), the constraints for in case (ii) are more restrictive so that the bias is asymptotically negligible and thus is -consistent. When and , the existence of is guaranteed. The additional assumption of , i.e., , is a mild assumption, which allows the bias term to diminish asymptotically.
If is fixed, one can show that is indeed the semi-parametric efficiency bound in the standard i.i.d. setting. When both and are allowed to go to infinity, becomes where denotes the expectation with respect to the stationary measure induced by the behavior policy, i.e., . As shown in [30], is also the statistical efficiency bound in the notion of [33]. Note that our results do not require that the data come from stationary distribution, which is however needed in [30].
Finally, we remark that to the best of our knowledge, only two prior works establish the convergence rates of the policy value estimators under some non-parametric models in the scaling of both and . One is [62], which directly estimates -function. The underlying analysis does not require the stationary assumption for the data generating process. However, they did not show that their estimator can achieve the statistical efficiency. In addition, their conditions require that the initial distribution of the data is bounded away from zero, which we do not require. The other one is [30]. This work shows the convergence of their estimator when . In their theoretical results, stationary assumption is needed. In order to obtain the efficiency, this work requires the adoption of cross-fitting (sample splitting) for the Q-function and ratio (weight) function estimations. However, it is unclear how to perform an efficient cross-fitting due to the existence of temporal dependence among the batch data.
4.4 Lower boundedness for the minimal eigenvalue
In this section, we discuss defined in Assumption 5(d) in details. For the notational simplicity, we take and as the probability and expectation with respect to the average visitation distribution . We also write
as the conditional discounted visitation probability. Define the operator by
for any function , and denote as the identity operator.
In Section C.1 of [62], a sufficient condition for the lower boundedness of is provided. They argue that, under some boundedness conditions on the average visitation probability, the minimal eigenvalue is lower bounded by a constant independent of as long as is small enough. (In fact, we show that this sufficient condition can be further relaxed; see Corollary 2 for details.) However, in practice, we may not know the distance between the target policy and the behavior policy in advance, or be able to choose a reasonably small that reflects the desired emphasis of the long-term rewards. More importantly, choosing close to is often preferred in many applications as we discussed before. Despite its importance, the theoretical property of non-parametric OPE under this setting is largely uncharted territory in the current state of the literature. As a result, it is important to understand the behavior of for any and target policy .
In the following, we focus on the general operator and study the (squared) minimal eigenvalue of it:
(33) |
To see the relationship between and , we can take for some in (33). Then we have
In Theorem 7, we propose a necessary and sufficient condition for bounding the minimal eigenvalue of the operator that works for any and . When combined with the standard eigenvalue condition on the basis , it allows a comprehensive characterization of the lower boundedness of .
Theorem 7.
is lower bounded by a positive constant if and only if
In this case, .
Theorem 7 allows us to avoid directly analyzing the minimal eigenvalue of , but to study the upper bound of instead, which is much easier to deal with. We can view as a criterion to examine the difficulty of OPE problems. When gets larger, the OPE problem becomes more difficult. The value of depends on many components, including the target policy , the behavior policy , the discount factor and the horizon of the observed data . Next, we focus on the term and provide some sufficient conditions for it to be upper bounded. Take
Assumption 7.
Suppose the following conditions hold.
-
(a)
The average visitation probability (density) is lower bounded by a constant and upper bounded by a constant .
-
(b)
The transition probability under the target policy is upper bounded by some constant for all .
Corollary 2.
If there exists a constant such that
(34) |
then we have
In addition, if we assume Assumption 7, then we have
where . Finally, under an additional condition that
(35) |
In Corollary 2, (34) provides a sufficient condition for to be upper bounded (or equivalently, for to be lower bounded) that works for general settings without any restriction on or the target policy . The second part of Corollary 2 provides a specific characterization for under Assumption 7, in which we require the average visitation probability to be lower and upper bounded. This coverage assumption is very common in RL literature [e.g. 54, 30, 62]. Compared to [62], we provide an explicit bound with respect to and . More importantly, we do not impose any further assumptions on the target policy or the discount factor , in order to show the lower boundedness of . In other words, we show that such boundedness holds uniformly for any target policy and . We note that there is a parallel work regarding the well-poseness minimax optimal rates of nonparametric Q-function estimation in OPE [10]. They provide a sufficient condition for the lower boundedness of . Their result is similar to the one specified in Corollary 2, and require similar boundedness conditions as in Assumption 7. See Assumption 4(a) and Theorem 1 in [10] for more details. Compared to their bound:
Corollary 2 provides a sharper dependence with respect to , and . Apart from Corollary 2, we note that our general result in Theorem 7 does not require any boundedness condition on visitation and transition probability as in Assumption 7 and [10], and, more importantly, provides a necessary and sufficient condition for the boundedness of .
5 Simulation Study
We conduct a simulation study to investigate the finite-sample performance of the proposed estimator. We adopt the similar simulation settings as in [42], [39] and [62]. Specifically, the data generative model is given as follows. The state variables are two-dimensional, i.e., for and , while the action is binary, i.e., . The initial state follows the standard bivariate normal distribution. The transition dynamics are given by , and , where and are independent normal random variables with mean 0 and variance 0.25. The behavior policy independently follows a Bernoulli distribution with mean . The immediate reward is defined as . We use the initial state distribution as the reference distribution and set to . We evaluate the following four different target policies.
-
(a)
This is the “always-treat” policy used in the simulation study of [39], where the chosen action is always 1, and does not depend on the state variable.
-
(b)
This policy is a discontinuous function with respect to the state variable. The same type of policy is used in the simulation study of [62].
-
(c)
This policy is smooth with respect to the state variable.
-
(d)
. This policy is the same as the behavior one. Note that the observed data only contain finite horizon of decision points, i.e., , while the target here is the policy value under infinite horizon.
For each target policy, we consider four different combinations of and , i.e., , and . The true policy values , , were computed approximately by the Monte Carlo method. Specifically, for every , we simulate independent trajectories of length , with initial states drawn from . Then we approximate by .
Due to space limitations, the implementation details of the proposed method (ProjBalance) are reported in Section S1.2 of the Supplementary Material. For comparison, we include the following estimators: (1) VL: the estimator from [42]; (2) SAVE: the estimator from [62]; (3) FQE: the fitted Q-evaluation estimator developed in [36] where regression problems are solved using random forest models; (4) IS: importance sampling estimator from [54]; (5) MINIMAX: minimax weight learning method from [68]; (6) DR: double reinforcement learning method considered in [28]. As suggested by [42], we implemented VL with Gaussian basis functions since they offer the highest flexibility. For SAVE, we use the same basis function as in the proposed weighted estimator to estimate . Note that, to compute the confidence interval for the proposed weighted estimator, we use the estimate of from SAVE. Construction of our confidence interval can be found in Section Section S1.2 of the Supplementary Material. The implementation details of FQE, IS, MINIMAX and DR can be found in [60].
Table 1 shows the mean squared error (MSE) and median squared error (MeSE) of the above estimators, as well as the empirical coverage probabilities (ECP) and average length (AL) of their confidence intervals, over 500 simulated data sets when . The results for , and can be found in Section S1.3 of the Supplementary Material. Overall, the proposed estimator (ProjBalance) shows competitive performances. Other weighted estimators such as IS and MINIMAX suffer from instability issues and produce inferior results compared with other methods. Also, DR is influenced by the unstable weights from IS and does not perform well in these settings.
Specifically, for Scenario (a), VL in general has the smallest MSE and MeSE. The performances of ProjBalance and VL are close, while that of SAVE and FQE are worse. For Scenario (b), SAVE is the best when the sample size is relatively large, while the performance of ProjBalance is close to that of SAVE. However, when and are relatively small, SAVE produces some extreme estimates, as seen from the notable difference between MSE and MeSE. See Tables 1 and S2 in the Supplementary Material. In contrast, the results of ProjBalance remain stable for small and , and have comparable performance to the best estimator FQE in these settings. As for Scenario (c), ProjBalance and VL perform similarly, and are better than SAVE and FQE. ProjBalance always has the smallest MSE and MeSE when the target policy is . As for ECP and AL, it seems that the confidence intervals of ProjBalance tend to have lower coverage than the target level 95%, especially when the target policy is . We hypothesize that this under-coverage phenomenon is due to the regularization of weights in (25a), which affects the variance estimation in (32). Since this is beyond the scope of the paper, we leave it for future study. In practice, we recommend multiplying a factor to the length of our confidence interval (CI) in order to relieve this under-coverage issue. In this simulation study, we choose a constant factor to obtain adjusted intervals. The ECPs and ALs for adjusted confidence intervals are provided in all tables. In general, our method performs robustly and satisfactorily in terms of the coverage and average length of the confidence interval. In the Supplementary Material, we also evaluate the performance of the above methods on the Cartpole environment, which can be obtained from OpenAI Gym [5] and has been frequently considered in the computer science literature. Overall, the performance of ProjBalance is appealing, See Section S1.5 of the Supplementary Material for more details.
Target | methods | MSE () | MeSE () | ECP | AL () |
---|---|---|---|---|---|
ProjBalance | 8.76 (0.527) | 3.75 | 0.96 / 0.99 | 38.93 (0.258) / 46.71 (0.310) | |
VL | 7.71 (0.461) | 3.57 | 0.96 | 36.25 (0.159) | |
SAVE | 10.2 (0.614) | 4.32 | 0.96 | 42.28 ( 0.932) | |
FQE | 10.99 (0.744) | 4.40 | _ | _ | |
minimax | 34.53 (3.184) | 14.04 | _ | _ | |
DR | 151.87 (16.670) | 22.84 | _ | _ | |
IS | 1040.89 (270.752) | 53.94 | _ | _ | |
ProjBalance | 4.29 (0.356) | 1.75 | 0.91 / 0.94 | 23.57 (2.130) / 28.28 (2.556) | |
VL | 6.61 (0.369) | 3.50 | 0.84 | 24.18 (0.142) | |
SAVE | 36.90 (33.300) | 1.52 | 0.94 | 439.90 (415.277) | |
FQE | 3.93 (0.242) | 1.87 | _ | _ | |
minimax | 120.92 (1.276) | 118.98 | _ | _ | |
DR | 92.33 (6.212) | 16.283 | _ | _ | |
IS | 1111.542 (100.934) | 327.248 | _ | _ | |
ProjBalance | 2.41 (0.153) | 1.06 | 0.93 / 0.97 | 18.47 (0.125) / 22.17 (0.150) | |
VL | 2.45 (0.162) | 1.14 | 0.98 | 23.10 (0.158) | |
SAVE | 2.72 (0.173) | 1.18 | 0.94 | 21.43 (1.717) | |
FQE | 4.66 (0.307) | 2.30 | _ | _ | |
minimax | 37.11 (1.415) | 31.21 | _ | _ | |
DR | 49.20 (34.761) | 2.36 | _ | _ | |
IS | 38.12 (4.839) | 13.97 | _ | _ | |
ProjBalance | 0.67 (0.044) | 0.32 | 0.93 / 0.97 | 9.70 (0.021) / 11.64 (0.025) | |
VL | 0.77 (0.053) | 0.34 | 1.00 | 17.00 (0.031) | |
SAVE | 0.72 (0.047) | 0.35 | 0.93 | 10.03 (0.013) | |
FQE | 1.70 (0.103) | 0.81 | _ | _ | |
minimax | 1.62 (0.104) | 0.73 | _ | _ | |
DR | 1.35 (0.087) | 0.61 | _ | _ | |
IS | 4.78 (0.290) | 2.27 | _ | _ |
6 Real Data Application
In this section we apply the methods mentioned in Section 5 that provide confidence intervals (ProjBalance, VL and SAVE) to the OhioT1DM dataset [44] obtained from the Ohio University. This dataset contains approximately eight weeks’ records of CGM blood glucose levels, insulin doses and self-reported life-event data for each of six subjects with type 1 diabetes. Following [62], we divide the trajectory of every subject into segments of three-hour spans and constructed the state variable as follows. First is the average CGM glucose level over the three-hour interval . Next, is constructed based on the -th-subject’s self-reported time and the corresponding carbohydrate estimate for the meal. More specifically, where are carbohydrate estimates for the -th-subject’s meals at times , and is a 5-minute decay rate. Here we set . Last, is the average of the basal rate during the three-hour interval. The action variable is binarized according to the amount of insulin injected. In particular, we set when the total amount of insulin delivered to the -th subject is larger than one unit during the three-hour interval. Otherwise, we set . In the data, the time series of glucose levels and the life-event data are not perfectly overlapped within the same time frame. So we remove several boundary points of time series to ensure that all state variables have records in the same time frame. According to the Index of Glycemic Control (IGC) [58], the immediate reward is constructed as
Similar to the simulation study, we set the discount factor . In addition, we study six reference distributions , where each of them is taken as the point mass at the initial state of a subject. We evaluate the aforementioned three estimators via two policies — the always-treat policy and the never-treat policy since the estimated optimal policy based on [62] is very close to the always-treat policy. A similar discovery has also been observed in [42]. Therefore, we expect to see that .

Figure 1 shows the point estimates and associated confidence intervals of and for ProjBalance, VL and SAVE. To adjust for the under-coverage issue we observe in the simulation study, we also calculate the adjusted confidence interval for ProjBalance. In most scenarios, it can be observed that estimated values under the always-treat policy are larger than that of the never-treat policy, which aligns with our expectation. Note that the results of VL do not show a distinctive difference between the two policies compared with other methods. For SAVE, due to the invertibility issue (the matrix in [62] is numerically low-rank under the never-treat policy), the confidence interval for the never-treat policy is surprisingly large. Even though the confidence intervals of our proposed estimators rely on the Q-function estimated by SAVE (where the point estimate of the Q-function does not involve inverting the problematic matrix), the confidence interval does not inherit the instability issue.
7 Acknowledgements
Wong’s research was partially supported by the National Science Foundation (DMS-1711952 and CCF-1934904). Portions of this research were conducted with the advanced computing resources provided by Texas A&M High Performance Research Computing.
Supplement to “Projected State-action Balancing Weights for Offline Reinforcement Learning”
S1 Additional Implementation Details and Numerical Results
S1.1 Algorithm
We present the algorithm outline for the proposed estimator in Algorithm 1.
S1.2 Implementation of the Proposed Estimator
In this subsection, we discuss the implementation details in the simulation study. For the proposed estimator (ProjBalance), the basis functions were constructed as follows. For settings described in Section 5, we adopted the tensor product basis functions for two dimensional state variables. First, we set , . Here and were one-dimensional cubic B-spline sets where internal knots were placed at equally spaced sample quantiles for state variables. To avoid extrapolation of the basis function, we put three repeated boundary knots. Then we chose . In our numerical experiments, we set , where was determined by the fact that there should be at least 4 one-dimensional basis for constructing cubic B-spline. Note that the number of tensor products of spline bases grows exponentially as the dimension of the state space increases. We recommend reducing the interaction order of the tensor products or using kernel bases in the higher dimensional settings, while the number of basis functions can be still chosen as stated before. In the simulation study for Cartpole datasets in Section S1.3, we did not include any interaction of tensor products among four state variables, for the sake of simplicity. For the approximate projection step (24), we used a kernel ridge regression with Gaussian kernel, where the bandwidth parameter in Gaussian kernel was chosen by the median heuristic (See Section 2.2 in [18]).We adopted 5-fold cross-validation to tune the parameter . Note that we only needed one for projections, so we took an average of the “standardized” validation errors from the projection steps for all basis functions, where the validation errors were standardized by empirical norm of the corresponding basis function, as our validation criterion. For the weight estimation (25), we chose . As is known to us, we used Monte Carlo method to obtain . To reduce the amount of tuning, we took , where and , and selected the minimal that had a valid solution to (25).
In the following, we provide the details for constructing the confidence interval for ProjBalance. Take as the estimator for obtained by SAVE in [62] and let
Calculate
Then the confidence interval is given by
where and is -quantile of the standard normal distribution.
S1.3 Additional Simulation Results
Tables S2, S3 and S4 show the simulation results under the target policies and data generation process described in Section 5, where , and respectively.
Target | methods | MSE () | MeSE () | ECP | AL () |
---|---|---|---|---|---|
ProjBalance | 17.32 (1.064) | 8.17 | 0.96 / 0.99 | 66.30 (6.145) / 79.56 (7.374) | |
VL | 15.43 (0.942) | 7.08 | 0.97 | 54.15 (0.472) | |
SAVE | 272.36 (178.185) | 11.06 | 0.96 | 637.96 (337.450) | |
FQE | 23.58 (1.431) | 10.74 | _ | _ | |
minimax | 46.27 (3.544) | 16.75 | _ | _ | |
DR | 855.00 (178.345) | 82.56 | _ | _ | |
IS | 945.92 (234.336) | 53.75 | _ | _ | |
ProjBalance | 10.83 (0.874) | 3.17 | 0.87 / 0.91 | 35.26 (4.682) / 42.31 (5.619) | |
VL | 12.33 (0.721) | 5.65 | 0.88 | 36.63 (0.364) | |
SAVE | 310.41 (301.139) | 3.25 | 0.94 | 918.00 (874.044) | |
FQE | 23.58 (1.431) | 10.74 | _ | _ | |
minimax | 41.52 (1.973) | 28.31 | _ | _ | |
DR | 60.20 (5.623) | 16.30 | _ | _ | |
IS | 2973.88 (1803.384) | 70.509 | _ | _ | |
ProjBalance | 4.62 (0.299) | 1.899 | 0.92 / 0.95 | 27.67 (0.984) / 33.20 (1.181) | |
VL | 4.50 (0.305) | 1.89 | 0.98 | 33.04 (0.239) | |
SAVE | 15.87 (8.981) | 2.35 | 0.94 | 669.62 (636.178) | |
FQE | 8.687 (0.604) | 3.69 | _ | _ | |
minimax | 45.51 (1.710) | 39.61 | _ | _ | |
DR | 68.18 (21.377) | 3.98 | _ | _ | |
IS | 34.02 (4.214) | 12.45 | _ | _ | |
ProjBalance | 1.25 (0.087) | 0.49 | 0.94 / 0.98 | 13.61 (0.074) / 16.33 (0.089) | |
VL | 1.63 (0.111) | 0.66 | 0.99 | 24.47 (0.070) | |
SAVE | 1.40 (0.096) | 0.53 | 0.95 | 14.68 (0.119) | |
FQE | 3.46 (0.224) | 1.65 | _ | _ | |
minimax | 3.26 (0.202) | 1.50 | _ | _ | |
DR | 2.48 (0.157) | 1.15 | _ | _ | |
IS | 4.67 (0.281) | 2.28 | _ | _ |
Target | methods | MSE () | MeSE () | ECP | AL () |
---|---|---|---|---|---|
ProjBalance | 10.13 (0.661) | 4.61 | 0.95 / 0.97 | 38.95 (0.175) / 46.74 (0.211) | |
VL | 8.84 (0.562) | 3.66 | 0.95 | 38.38 (0.178) | |
SAVE | 11.55 (0.766) | 4.90 | 0.94 | 40.58 (0.183) | |
FQE | 3.46 (0.224) | 1.65 | _ | _ | |
minimax | 38.08 (3.330) | 17.80 | _ | _ | |
DR | 181.51 (24.850) | 28.07 | _ | _ | |
IS | 4.67 (0.281) | 2.28 | _ | _ | |
ProjBalance | 3.88 (0.279) | 1.76 | 0.92 / 0.97 | 21.53 (0.120) / 25.83 (0.144) | |
VL | 5.06 (0.293) | 2.46 | 0.94 | 25.18 (0.145) | |
SAVE | 3.42 (0.199) | 1.77 | 0.96 | 23.55 (0.512) | |
FQE | 3.88 (0.214) | 2.04 | _ | _ | |
minimax | 36.48 (1.556) | 26.28 | _ | _ | |
DR | 28.51 (2.915) | 7.08 | _ | _ | |
IS | 2951.59 (2613.528) | 62.73 | _ | _ | |
ProjBalance | 2.33 (0.134) | 1.09 | 0.94 / 0.98 | 18.36 (0.1) / 22.03 (0.120)) | |
VL | 2.33 (0.137) | 1.18 | 0.98 | 23.75 (0.122) | |
SAVE | 2.74 (0.161) | 1.26 | 0.94 | 19.56 (0.177) | |
FQE | 4.96 (0.322) | 2.10 | _ | _ | |
minimax | 32.12 (1.150) | 25.85 | _ | _ | |
DR | 15.64 (5.164) | 2.35 | _ | _ | |
IS | 22.39 (4.263) | 7.05 | _ | _ | |
ProjBalance | 0.75 (0.042) | 0.41 | 0.92 / 0.98 | 9.81 (0.018) / 11.77 (0.022) | |
VL | 0.88 (0.056) | 0.37 | 1 | 17.32 (0.033) | |
SAVE | 0.79 (0.045) | 0.43 | 0.93 | 10.19 (0.107) | |
FQE | 1.66 (0.713) | 0.66 | _ | _ | |
minimax | 1.46 (0.084) | 0.73 | _ | _ | |
DR | 1.25 (0.076) | 0.62 | _ | _ | |
IS | 2.27 (0.144) | 1.10 | _ | _ |
Target | methods | MSE () | MeSE () | ECP | AL () |
---|---|---|---|---|---|
ProjBalance | 5.43 (0.353) | 2.31 | 0.92 / 0.97 | 27.2(0.0679) / 32.58 (0.081) | |
VL | 5.09 (0.318) | 2.01 | 0.94 | 26.80 (0.075) | |
SAVE | 5.93 (0.382) | 2.65 | 0.92 | 28.09 (0.117) | |
FQE | 6.21 (0.402) | 2.57 | _ | _ | |
minimax | 27.27 (2.030) | 12.17 | _ | _ | |
DR | 67.29 (11.554) | 11.03 | _ | _ | |
IS | 192585.20 (189045.400) | 44.53 | _ | _ | |
ProjBalance | 2.49 (0.204) | 1.09 | 0.87 / 0.93 | 15.05 (0.045) / 18.06 (0.054) | |
VL | 2.92 (0.162) | 1.58 | 0.89 | 16.93 (0.056) | |
SAVE | 1.73 (0.107) | 0.72 | 0.94 | 15.78 (0.053) | |
FQE | 2.10 (0.140) | 0.91 | _ | _ | |
minimax | 33.13 (1.275) | 26.19 | _ | _ | |
DR | 11.98 (1.063) | 4.10 | _ | _ | |
IS | 2951.59 (2613.528) | 62.73 | _ | _ | |
ProjBalance | 1.19 (0.071) | 0.61 | 0.94 / 0.98 | 13.01 (0.048) / 15.61 (0.057) | |
VL | 1.25 (0.071) | 0.73 | 0.98 | 16.42 (0.053) | |
SAVE | 1.34 (0.079) | 0.64 | 0.94 | 13.97 (0.359) | |
FQE | 2.23 (0.174) | 0.89 | _ | _ | |
minimax | 32.78 (0.747) | 30.92 | _ | _ | |
DR | 3.45 (0.966) | 0.86 | _ | _ | |
IS | 29.67 (6.884) | 6.79 | _ | _ | |
ProjBalance | 0.39 (0.023) | 0.21 | 0.92 / 0.97 | 6.900 (0.008) / 8.28 (0.009) | |
VL | 0.44 (0.026) | 0.24 | 1 | 12.12 (0.027) | |
SAVE | 0.41 (0.024) | 0.22 | 0.92 | 7.06 (0.005) | |
FQE | 0.77 (0.046 | 0.38 | _ | _ | |
minimax | 0.77 (0.049) | 0.39 | _ | _ | |
DR | 0.62 (0.039) | 0.29 | _ | _ | |
IS | 2.26 (0.143) | 1.02 | _ | _ |
S1.4 Weighted Estimator without Projection and Augmented Projected Balancing Estimator
In this section, we compare the performance of the weighted estimator (balance) where weights are obtained from (13) with our proposed estimator ProjBalance under the simulation settings described in Section 5. In particular, the same basis functions are adopted for two weighted estimators in the balancing step. Also, we include the augmented estimator (aug) based on our procedure into the comparison. The augmented estimator is constructed by
where is estimated by SAVE and , , are obtained from (25). Table S5 shows the performance of above three estimators when and . As we can see, balance only performs better when target policy is . Under other settings, ProjBalance performs much better than balance, especially when target policy is . As for aug, it has a very similar performance to SAVE in all four settings.
Target | methods | MSE () | MeSE () | ECP | AL () |
---|---|---|---|---|---|
ProjBalance | 8.76 (0.527) | 3.75 | 0.96 / 0.99 | 38.93 (0.258) / 46.71 (0.310) | |
balance | 4.73 (0.298) | 2.36 | 0.99 / 1.00 | 38.63 (0.127) / 46.36 (0.152) | |
aug | 9.86 (0.585) | 4.31 | 0.95 / 0.99 | 38.81 (0.259) / 46.57 (0.310) | |
ProjBalance | 4.29 (0.356) | 1.75 | 0.91 / 0.94 | 23.57 (2.130) / 28.28 (2.556) | |
balance | 117.23 (1.38) | 111.44 | 0.002 / 0.002 | 19.43 (0.583) / 23.33 (0.699) | |
aug | 36.66 (33.118) | 1.52 | 0.92 / 0.95 | 23.60 (2.166) / 28.32 (2.599) | |
ProjBalance | 2.41 (0.153) | 1.06 | 0.93 / 0.97 | 18.47 (0.125) / 22.17 (0.150) | |
balance | 8.32 (0.32) | 6.79 | 0.45 / 0.57 | 14.95 (0.048) / 17.94 (0.058) | |
aug | 2.62 (0.164) | 1.16 | 0.92 / 0.97 | 18.50 (0.124) / 22.20 (0.148) | |
ProjBalance | 0.67 (0.044) | 0.32 | 0.93 / 0.97 | 9.70 (0.021) / 11.64 (0.025) | |
balance | 0.84 (0.053) | 0.37 | 0.91 / 0.95 | 9.64 (0.010) / 11.56 (0.012) | |
aug | 0.73 (0.048) | 0.34 | 0.92 / 0.96 | 9.69 (0.023) / 11.63 (0.027) |
S1.5 Simulation results for the Cartpole environment
In this section, we compare the performance of various estimators using the CartPole datasets from the OpenAI Gym environment [5]. The state variables in the CartPole environment are four dimensional, including cart position, cart velocity, pole angle and pole angular velocity. The action is binary. We made a slight modification to the Cartpole environment following [68] and [60]. To summarize, we added a small Gaussian noise on the original deterministic transition dynamic and defined a new state-action-dependent reward function. See details in Section B2.1 in [60]. Following [68] and [60], we first ran a deep-Q network to get a near-optimal policy as the target policy (). And we then applied “epsilon-greedy” adjustment using a factor to define our behavior policy, i.e.,
for every . For comparison, we implemented various methods such as fitted-Q evluation (FQE), minimax and DR using the ways they are implemented in [60] and the public code from https://github.com/RunzheStat/D2OPE. We set the discount factor in our evaluations.
In Tables S6, S7 and S8, we show the results for simulated dataset when setting , and respectively. We fixed and for all three settings. As we can see, ProjBalance has the smallest or the second smallest MSE and MeSE among all the methods. The emprical coverages of ProjBalance are also better than other two methods (SAVE and VL). We did not report the confidence interval for the remaining methods because there are no related theoretical results.
methods | MSE () | MeSE () | ECP | AL () |
---|---|---|---|---|
ProjBalance | 0.17 (0.011) | 0.13 | 0.72 / 0.86 | 0.93 (0.016) / 1.12 (0.02) |
SAVE | 0.49 (0.022) | 0.43 | 0.12 | 0.92 (0.018) |
VL | 32.66 (1.262) | 29.57 | 0 | 1.84 (0.008) |
FQE | 5.15 (0.033) | 5.12 | _ | _ |
minimax | 43.04 (1.122) | 43.24 | _ | _ |
DR | 1.26 (0.042) | 1.11 | _ | _ |
IS | 5581.23 (491.491) | 2757.71 | _ | _ |
methods | MSE () | MeSE () | ECP | AL () |
---|---|---|---|---|
ProjBalance | 0.10 (0.009) | 0.04 | 0.84 / 0.91 | 0.85 (0.014) / 1.02 (0.016) |
SAVE | 0.33 (0.021) | 0.24 | 0.38 | 0.84 (0.014) |
VL | 23.06 (0.547) | 22.5 | 0 | 1.21 (0.003) |
FQE | 4.44 (0.043) | 4.38 | _ | _ |
minimax | 35.28 (1.163) | 33.69 | _ | _ |
DR | 0.76 (0.049) | 0.64 | _ | _ |
IS | 20541.94 (2356.361) | 9295.65 | _ | _ |
methods | MSE () | MeSE () | ECP | AL () |
---|---|---|---|---|
ProjBalance | 0.14 (0.011) | 0.08 | 0.65 / 0.76 | 0.84 (0.031) / 1.01 (0.038) |
SAVE | 0.13 (0.014) | 0.05 | 0.74 | 0.93 (0.124) |
VL | 16.08 (0.326) | 15.9 | 0 | 1.02 (0.001) |
FQE | 3.99 (0.029) | 3.98 | _ | _ |
minimax | 26.94 (0.962) | 25.54 | _ | _ |
DR | 0.87 (0.053) | 0.7 | _ | _ |
IS | 59936.33 (7265.061) | 30726.43 | _ | _ |
S2 Proof
S2.1 Proof in Section 2
Starting from (9), we multiply on both sides and get
(S36) |
Then we integrate with respect to and for both sides of (S36). Next by the change of measure, we obtain
(S37) |
Then, based on the definition of , we have
(S38) |
S2.2 Technical Proof in Section 3
Here we show the proof of Lemma 1 and Theorem 2. For Lemma 1, the proof is straightforward. For Theorem 2, we first remove the absolute value notation in (25b) by adding additional constrains and therefore transform the optimization problem into a standard convex optimization problem with linear constraints. Then we make use of the duality results in [67] to show the corresponding statement in Lemma 1.
Proof of Lemma 1.
First, for any function , there exists (independent of ) such that
Re-parametrizing , we have
Let and be the possible values of and respectively. A real-valued function defined on can be identified by a matrix . Similarly, we define . Denote by a vector of ones for any positive integer . To derive , we study the constraint , which is equivalent to
(S39) |
where and are the -th row of and respectively. Since the elements of are non-negative and sum to 1 for all , and each of these constraints are restricted on non-overlapping elements on , so (S39) has linearly independent constraints on . It is easy to see that . Together with the parameter , we can show that .
∎
Proof of Theorem 2.
min | ||||
subject to | (S44) |
From [67], the dual of (S44) is
max | |||
subject to |
where , and . In addition, and is the convex conjugate of defined as
where satisfies the first order condition that . Then we obtain that and
Take , and it can be verified that . Then the dual form can be written as
min | (S45) | |||
subject to |
Suppose , , is an optimal solution of (S45). Take , and as the -th element of and respectively. Next, we show that for .
Suppose that there exists such that and , then we take , , where is a vector where its -th entry is 1 and all remaining entries are zero. Take . Then
due to the fact that . Then it contradicts with the assumption that is the solution.
Therefore, we can take the . And it can be verified that . Rewriting (S45) yields the result in Theorem 2. ∎
S2.3 Technical Proof in Section 4.1
To prove Theorem 3, a novel truncation argument of the Markov chain is presented in Lemma S2 to obtain the tight concentration bound in the scaling of and . Specifically, we truncate each trajectory into two parts in the scaling of . Informally, for the first part, the truncation threshold should be small enough so that we could borrow the proof techniques under standard i.i.d. settings developed in [38] without losing too much for the upper bound. For the remaining part, the idea is that as the chain grows, the distribution becomes “exponentially” close to the stationary distribution according to the mixing condition in Assumption 4(a) (or 4’(a)). We first develop a delicate peeling argument based on [14] to bound the empirical process under the stationary distribution, from which we could achieve the desired order in the scaling of both and . Then it remains to bound the difference between the stationary distribution and the distribution after truncation. By carefully choosing the truncation threshold, we are able to balance the upper bounds from two parts and obtain the desired rate of convergence. See the details in Lemma S2.
Proof of Theorem 3.
Take . In the following, we write and as and in short. We define
as the empirical expectation of some function with an input , which represents random sampled trajectories from some Markov chains.
To proceed our proof, we can decompose the error as
Since for all due to the optimizing property of , the last term above can be simplified as
As a result, we have
For , we define the following two functions:
With these notations, we know that
In the following, we decompose
where
For the first term, the optimizing property of implies that
Thus, holds for all .
Next we derive the uniform bound of over all . For simplicity, take
(S46) |
for any and .
And take , where
(S47) |
Then
By Lemma S2, we are able to show that with probability at least ,
where , and the leading constant depends on . Taking and , the result of this theorem then follow.
∎
Corollary 1 is a direct application of Theorem 3. It utilizes the uniformity of Theorem 3 and the construction of kernel ridge regression estimators. Before presenting our proof, we state our assumption for Corollary 1 below.
Assumption 4’.
The following conditions hold.
-
(a)
The Markov chain has a unique stationary distribution with density over and and is geometrically ergodic, i.e., there exists a function and constant such that, for any and ,
where denotes the total variation norm, and is the distribution of conditioned on and , under the behavior policy. Also, there exists a constant such that , where recall that is the initial distribution of .
-
(b)
The function class satisfies that for all .
-
(c)
The function class in (24) is a subset of an RKHS whose corresponding RKHS norm is . In addition, is a star-shaped set with center 0 and it satisfies that for all and for all .
-
(d)
The regularization functional in (24) is taken as . Let and . There exist constants and , such that for any ,
Proof of Corollary 1.
With conditions stated in the Corollary 1, by directly applying Theorem 3 with , we can show that with probability at least ,
Then it suffices to show that
(S48) |
and
(S49) |
Note that (S48) is due to the definition of . Next, we verify (S49).
Denote the reproducing kernel of by . Write , , and as the Gram matrix. Take
Then
If we keep the tuning parameter the same when approximating all , then for any ,
where , . Then is the solution for approximating using the same tuning parameter . Therefore we have with probability at least ,
∎
S2.4 Technical Proof4 in Section 4.2
In this section, we present the proof of Theorem 4, which is generalized from proofs in [13] and [73]. First, we show the convergence of to This requires a more delicate decomposition than those in [13] and [73] due to the additional projection error from Theorem 3. In addition, by the similar truncation argument as we discussed in Section S2.3, we develop a tight matrix concentration inequality for independent Markov chains. After obtaining the convergence of to , the convergence of weights can be derived based on the results in Lemma 1 combined with Assumption 5(a) and (c).
Proof.
Recall that is the solution to (27) and is the coefficient that leads to the best approximation to the true ratio function with basis . By Assumption 5(a), we can show that . In the following, we study the convergence of . We aim to prove
(S50) |
Define the objective function . In order to show the above bound in (S50), by the continuity and convexity of , it suffices to show that, with high probability,
where }. Here is some appropriate constant.
First, we take the following decomposition:
(S51) | |||
(S52) |
where is a value between and .
Define . Now we focus on the following term:
(S53) | ||||
(S54) | ||||
(S55) | ||||
(S56) |
where is a value between and . In the following, we show how to control (S53)-(S56) one by one.
-
•
For (S53),
(S57) The first inequality is based on Cauchy–Schwarz inequality and Assumption 5(a). For the last inequality, we apply Lemma S3 which yields
Next, we show how to bound in (S57). Note that
Then we apply the matrix concentration inequality developed in Lemma S4 to bound it. Specifically, we take in Lemma S4. By Assumption 5(a) and (c), we can show that . Therefore, Lemma S4 implies that
Thus we can show that
(S58) -
•
For (S54),
where the first equality is given by Cauchy-Schwarz inequality and the second one is due to Assumption 5(a) and the definition of . As , one can show that . Due to Theorem 3 and Assumption 5(e), we have
Therefore,
(S59) where is some positive constant.
-
•
For (S55), note that
due to Assumption 5(e). Similarly, we can show that
As by assumption, . Since is a strictly positive and continuous function, there exists a constant such that . We then have
(S60) where is some positive constant. The first inequality of the above arguments is given by Cauchy-Schwarz inequality. The second one uses results in Theorem 3 based on Assumption 5(e) so that the upper bound for is obtained. We use Lemma S3 in the third inequality, while the last inequality is obtained due to Assumption 5(c).
- •
Substituting (S58), (S59), (S60) and (S61) into (S53), (S54), (S55) and (S56) respectively, we have
(S62) |
Next, we show how to bound
First, note that for any vector such that , we have
(S63) |
where the last inequality can be derived by the same arguments in proving (S54).
Then we have
The first inequality is due to (S63). The second inequality is by Lemma S3. For the third inequality, due to the condition that , then and .
Now, returning to (S52), we can show that
where the last inequality is due to the condition of specified in Assumption 5(g). As long as for some large enough constant , with high probability,
Therefore we have proved (S50).
Finally, we are ready to show the convergence of given below.
The second inequality is based on Assumption 5(a) and the mean value theorem. For the third inequality, we adopt the mean value theorem and Assumption 5(b) again, using the similar arguments for proving (S55) to show the boundedness of . Assumption 5(e) is used to obtain the desired order. The fifth inequality results from Assumption 5(c) and the last inequality is due to (S50).
Noting that
by Lemma S3, we then could use the similar argument to bound , which completes the proof. ∎
S2.5 Technical Proof in Section 4.3
In this section, we provide the proof of Theorem 6, which shows the efficiency of our weighted estimator. In the proof, we first decompose the estimation error incurred by to the true . Based on the decomposition, we show that with the help of convergence results regarding to the projection step and the final weights, together with the approximation error to , the desired convergence rate of can be established. Finally, the efficiency of our estimator can be achieved.
Proof of Theorem 6.
We mainly prove Result (ii). Result (i) can be derived through similar arguments but under a different condition for . Thus, we omit that for brevity. To start with, we derive the following decomposition
In the following, we analyze (I)-(VI) components separately.
- •
- •
- •
-
•
For (V), we will control it by the convergence of weights and also the magnitude of .
due to conditions for in Theorem 6(ii).
- •
-
•
For (II), recall that the true weight function by the assumption in Theorem 6(ii). This is the key to bound (II). Take a function in optimization problem (24). In addition, it can be seen that by the structure of the optimization problem. Let .
Due to the optimization condition for , we have
Recall that . Here, we abuse the notation slightly and denote by , i.e., the inner product with respect to the RKHS specified in Assumption 4’(b). Then we have
Now, we can see that the term multiplied by is exactly term (II) that we aim to bound. Therefore, it is sufficient to bound and . Since we require that , . As for term , recall that the error . Then we have
Next, we use Freedman’s inequality [17] to bound the second term in (i). For any integer , let and be the quotient and the remainder of divided by satisfy and . Let . Then we iteratively define as follows:
Take . From the definition of , we can show that
Then , forms a martingale with respect to the filtration , where stands for the -algebra generated by .
Note that errors , , are bounded by . Then we can verify that by Assumptions 5(a) and 6(a). Now, we are able to apply Theorem 1.6 in [17]. For all , we have
As such, we can derive
Then we can show that
(S64)
Combining all the bounds from to , we have
Taking
we see that
forms a mean zero martingale with respect to the filtration .
Then by martingale central limit theorem [45], we can show the asymptotic normality, i.e., as long as either or , we have
(S65) |
∎
S2.6 Technical Lemmas
In this section, we provide two technical lemmas (Lemma S2 and S4) that involve the truncation arguments for Markov chains. They are developed to obtain sharp convergence rates that depend on both sample size and the length of trajectories . In particular, Lemma S2 uses empirical process arguments and is used for Theorem 3. Lemma S4 generalizes the standard matrix Bernstein’s inequality for independent Markov chains and is used for Theorem 4.
Lemma S2.
Suppose Assumptions 4 hold. For
where and are defined in(S47) and (S46) respectively. Then with probability at least , for any , we have
where , and the leading constant depends on .
Proof of Lemma S2.
To deal with the scenario that the initial distribution is possibly not the stationary distribution , we decompose into three components.
Take , where is a constant to be specified later , , . Take as the empirical evaluation for . And we denote by , the expectation and probability under stationary distribution respectively.
where
and
For some fixed ,
Then we bound these three probabilities one by one. Note that when , we do not need to take into account components III(B) and IV(B). So in the following, we analyze III(B) and IV(B) under the condition that .
-
•
For II(B):
Take
In the following, we verify the conditions (A1-A4) in Theorem 19.3 in [20] with , and .
It’s easy to verify for (A1). For any ,
(S66) |
and therefore
For (A2), recall and thus
For the first term of RHS above:
and the second term:
where we use the fact that . Putting together, we can show that
(S67) |
where . Therefore,
To verify that Condition (A3) holds for every . It suffices to ensure the inequality holds when , i.e.,
(S68) | |||
Next we verify (A4). We first consider the function class It is not hard to verify that with ,
As a result of the entropy condition in Assumption 4
(S69) |
Now the Condition (A4) is satisfied if
for all and any , where is a constant depending on . And the above inequality holds as long as
for some constant depending on .
Now, we are able to show that when , , ,
where are some constants depending on . Then, fixing some , we can verify that there exists a constant , such that by taking ,
Combining all the conditions on , there exists a constant , such that
-
•
For III(B):
Next, we will focus on the term , which is easier to deal with, given the stationary condition.
Take . Due to that Markov chains are stationary, we can write
Similarly as how we handle II(B), we will bound
Here
We use the independent block techniques [75] and the peeling device with the exponential inequality for the relative deviation of the empirical process developed in [16].
Next, we bound each term of the above probabilities by using the independent block technique. We define a partition by dividing the index into blocks, where each block has an equal length . The residual block is denoted by , i.e., and . Then it can be seen that and the cardinality .
For each , we will use a different independent block sequence denoted by with the residual and then optimize the probability bound by properly choosing and . More specifically, we choose
where and with some positive constants and determined later. We require . We also need so that . Suppose is sufficiently large such that
(S71) |
In the following, we consider two cases. The first case considers any such that . In this case, since , we can show that . Combining with the sample size requirement, we can obtain that . Then we can show that in this case,
Therefore, when and ,
The second case that we consider is when . Under the geometric ergodicity assumption, it follows from Theorem 3.7 [4] that the stationary Markov chain is exponentially -mixing. The -mixing coefficient at time lag satisfies that for and . We apply the relative deviation concentration inequality for the exponential -mixing stationary process given in Theorem 4 of [16], which combined results in [75] and Theorem 19.3 in [20]. To use their results, it then suffices to verify Conditions (C1)-(C5) in Theorem 4 of [16] with , and to get an exponential inequality for each term in the summation. First of all, (C1) and (C2) have been verified in (S66) and (S67).
To verify Condition (C3), without loss of generality, we assume . Otherwise, let . Then we know that since . We need to verify , or it suffices to have since by definition. Recall that and . To show this, it is enough to verify that
since . Recall that , then it is sufficient to let so that the above inequality holds for every .
Next we verify (C4) that . Recall that . So if for some positive constant depending on , we can have
In addition, .
Lastly we verify condition (C5). Using the similar arguments in verifying (S69), we can show that
Then Condition (C5) is satisfied if the following inequality holds for all ,
where is a constant depending on .
It is enough to guarantee that
After some algebra, we can check that the above inequality holds if for some constant ,
or equivalently, for some constant ,
by the definition that and . To summarize, if for any ,
then the entropy inequality in Condition (C5) above holds. Since , the right hand side is a non-increasing function of . Then as long as,
Condition (C5) holds .
To summarize, Conditions (C1)–(C5) in Theorem 4 in [16] with , and hold for every when for some constant and . Thus when , there exists a constant , such that
where the last inequality is due to that the Markov chain is stationary and exponentially -mixing. When , we have by using and . This will further imply that .
Then we will have
where , , , are some positive constants.
For some fixed , we can verify that there exists a constant , such that for .
Substituting this probability into (S70) and combining all the conditions over , we can derive that there exists a constant , such that
-
•
For IV(B):
Recall that given the condition that . Then as long as we take , then for sufficiently largest , we have
and
due to the conditions of .
Combining all the results we have derived for II(B), III(B) and IV(B), we have that with probability at least , for any ,
Recall that we require , consider any and pick any value between and . Then we can simplify the bound into
∎
Proof.
Recall that .
By Assumption 5(c), we can show that
.
Next, let us consider two cases.
-
•
Case 1: is fixed, i.e., .
Take . From above, we can see that for any .
-
•
Case 2: .
Take as the stationary distribution and as the expectation taken over the stationary distribution. Take . From Assumption 5(c), we know that . It can be also verified that for any , and
For any fixed , by Theorem 4.2 of [9], there exists some constant , such that for any and integer ,
where . Suppose . Notice that . It follows that
Since , set , we obtain . Set with some appropriate constant . From the condition that , the following event occurs with probability at least ,
Note that
where the first inequality is due to Cauchy Schwarz inequality and the second inequality is due to Assumption 5(c).
Then we have with probability at least ,
Then with probability at least ,
(S72) Take
Then based on the condition of (), the matrix Bernstein inequality [65] yields that
Based on the probability for (S72), we have
Also, we can verify that for any such that ,
Overall, we have
And therefore,
∎
Lemma S4.
Proof.
Take , , . Take and as the expectation and probability under the stationary distribution of .
-
•
For (i),
We apply the Matrix Bernstein inequality [66] to bound it. Take . We can verify that
The same bound can be derived for . Then for all ,
Given the condition that and , we have
If , then we do not need to analyze the remaining two components. In the following, we assume that .
-
•
For (ii),
Then
Next, we apply Theorem 4.2 in [9] to bound
To begin with, take . We can verify that
The same bound can be derived for . Then we adopt the similar arguments in the proof of Case 2 in Lemma S3. Take in Theorem 4.2 of [9]. Then by Corollary 4.2 in [9], we can verify that , and , and we have
-
•
For (iii),
where is a constant depending on and .
Combining all the bounds from (i), (ii) and (iii), given the condition that of and , we can derive that
∎
S3 Additional Proof and Lemmas
Lemma S5.
If we take as either tensor-product B-spline basis or tensor-product wavelet basis for every . And we assume that the average visitation probability (density) is upper bounded by a constant and lower bounded by a constant . Then there exists constants and such that
(S73) |
In addition, when , we have
Proof of Lemma S5.
Under the conditions in Lemma S5, by Lemma 2 in [62], for any , there exists positive constants and such that
(S74) |
Based on these results, we have
Using the similar arguments, we are able to show that
Take and , the conclusion follows.
When , Due to Assumption 4’(a),
As is lower bounded by and upper bounded by . The density of the stationary distribution is lower bounded by and upper bounded by for sufficiently large . Then follow the same argument, we are able to show the bounds for and . ∎
Proof of Theorem 7.
We start with showing the “if” direction of the statement. Suppose for some .
First, we show that following equality holds for any function :
(S75) |
To see this, note the definition of , and we have
Based on (S75), for any , we have
where the inequality comes from the definition of . Then we prove
Next, we derive the “only if” direction of the statement. Suppose for some positive constant .
Then, the null space of is empty. For any , there exists a , such that
(S76) |
Now we can derive
where the first equality is from (S76), the second equality is due to (S75) and the last inequality is from the assumption for the minimal eigenvalue and (S76). Then
∎
Proof of Corollary 2.
By Jensen’s inequality and the definition of , we have
Then under the condition (34), we obtain
Under Assumption 7 (a) and (b), we can show that for every . Then we have
And
The bounds for can be obtained by applying Theorem 7.
For , by taking for any , we have
The conclusion follows.
∎
References
- Athey et al. [2018] Athey, S., G. W. Imbens, and S. Wager (2018). Approximate residual balancing: debiased inference of average treatment effects in high dimensions. Journal of the Royal Statistical Society: Series B 80(4), 597–623.
- Belloni et al. [2015] Belloni, A., V. Chernozhukov, D. Chetverikov, and K. Kato (2015). Some new asymptotic theory for least squares series: Pointwise and uniform results. Journal of Econometrics 186(2), 345–366.
- Bertsekas [1995] Bertsekas, D. P. (1995). Dynamic programming and optimal control, Volume 1. Athena scientific Belmont, MA.
- Bradley [2005] Bradley, R. C. (2005). Basic properties of strong mixing conditions. a survey and some open questions. Probability Surveys 2, 107–144.
- Brockman et al. [2016] Brockman, G., V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016). Openai gym. arXiv preprint arXiv:1606.01540.
- Chan [1989] Chan, K. (1989). A note on the geometric ergodicity of a markov chain. Advances in applied probability 21(3), 702–704.
- Chan et al. [2015] Chan, K. C. G., S. C. P. Yam, and Z. Zhang (2015). Globally efficient non-parametric inference of average treatment effects by empirical balancing calibration weighting. Journal of the Royal Statistical Society: Series B.
- Chen [2007] Chen, X. (2007). Large sample sieve estimation of semi-nonparametric models. Handbook of econometrics 6, 5549–5632.
- Chen and Christensen [2015] Chen, X. and T. M. Christensen (2015). Optimal uniform convergence rates and asymptotic normality for series estimators under weak dependence and weak conditions. Journal of Econometrics 188(2), 447–465.
- Chen and Qi [2022] Chen, X. and Z. Qi (2022). On well-posedness and minimax optimal rates of nonparametric q-function estimation in off-policy evaluation. arXiv preprint arXiv:2201.06169.
- Chua et al. [2018] Chua, K., R. Calandra, R. McAllister, and S. Levine (2018). Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Advances in neural information processing systems 31.
- De Boor [1976] De Boor, C. (1976). Splines as linear combinations of b-splines. a survey. Technical report, Wisconsin Univ Madison Mathematics Research Center.
- Fan et al. [2016] Fan, J., K. Imai, H. Liu, Y. Ning, and X. Yang (2016). Improving covariate balancing propensity score: A doubly robust and efficient approach. Technical report, Technical report, Princeton University.
- Farahmand et al. [2009] Farahmand, A. M., M. Ghavamzadeh, S. Mannor, and C. Szepesvári (2009). Regularized policy iteration. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou (Eds.), Advances in Neural Information Processing Systems 21, pp. 441–448. Curran Associates, Inc.
- Farahmand et al. [2016] Farahmand, A.-m., M. Ghavamzadeh, C. Szepesvári, and S. Mannor (2016). Regularized policy iteration with nonparametric function spaces. The Journal of Machine Learning Research 17(1), 4809–4874.
- Farahmand and Szepesvári [2012] Farahmand, A.-m. and C. Szepesvári (2012). Regularized least-squares regression: Learning from a -mixing sequence. Journal of Statistical Planning and Inference 142(2), 493–505.
- Freedman [1975] Freedman, D. A. (1975). On tail probabilities for martingales. the Annals of Probability, 100–118.
- Garreau et al. [2017] Garreau, D., W. Jitkrittum, and M. Kanagawa (2017). Large sample analysis of the median heuristic. arXiv preprint arXiv:1707.07269.
- Geer and van de Geer [2000] Geer, S. A. and S. van de Geer (2000). Empirical Processes in M-estimation, Volume 6. Cambridge university press.
- Györfi et al. [2006] Györfi, L., M. Kohler, A. Krzyzak, and H. Walk (2006). A distribution-free theory of nonparametric regression. Springer Science & Business Media.
- Hansen [1982] Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica: Journal of the Econometric Society, 1029–1054.
- Hearst et al. [1998] Hearst, M. A., S. T. Dumais, E. Osuna, J. Platt, and B. Scholkopf (1998). Support vector machines. IEEE Intelligent Systems and their applications 13(4), 18–28.
- Horowitz et al. [2004] Horowitz, J. L., E. Mammen, et al. (2004). Nonparametric estimation of an additive model with a link function. Annals of Statistics 32(6), 2412–2443.
- Hu et al. [2021] Hu, X., M. Qian, B. Cheng, and Y. K. Cheung (2021). Personalized policy learning using longitudinal mobile health data. Journal of the American Statistical Association 116(533), 410–420.
- Imai and Ratkovic [2014] Imai, K. and M. Ratkovic (2014). Covariate balancing propensity score. Journal of the Royal Statistical Society: Series B 76(1), 243–263.
- Janner et al. [2019] Janner, M., J. Fu, M. Zhang, and S. Levine (2019). When to trust your model: Model-based policy optimization. Advances in Neural Information Processing Systems 32.
- Kallus [2020] Kallus, N. (2020). Generalized optimal matching methods for causal inference. Journal of Machine Learning Research 21(62), 1–54.
- Kallus and Uehara [2019] Kallus, N. and M. Uehara (2019). Double reinforcement learning for efficient off-policy evaluation in markov decision processes. arXiv preprint arXiv:1908.08526.
- Kallus and Uehara [2020] Kallus, N. and M. Uehara (2020). Statistically efficient off-policy policy gradients. In International Conference on Machine Learning, pp. 5089–5100. PMLR.
- Kallus and Uehara [2022] Kallus, N. and M. Uehara (2022). Efficiently breaking the curse of horizon in off-policy evaluation with double reinforcement learning. Operations Research.
- Kang and Schafer [2007] Kang, J. D. and J. L. Schafer (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical science, 523–539.
- Komorowski et al. [2018] Komorowski, M., L. A. Celi, O. Badawi, A. C. Gordon, and A. A. Faisal (2018). The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nature medicine 24(11), 1716–1720.
- Komunjer and Vuong [2010] Komunjer, I. and Q. Vuong (2010). Semiparametric efficiency bound in time-series models for conditional quantiles. Econometric Theory, 383–405.
- Kosorok and Laber [2019] Kosorok, M. R. and E. B. Laber (2019). Precision medicine. Annual review of statistics and its application 6, 263–286.
- Laber et al. [2014] Laber, E. B., D. J. Lizotte, M. Qian, W. E. Pelham, and S. A. Murphy (2014). Dynamic treatment regimes: Technical challenges and applications. Electronic journal of statistics 8(1), 1225.
- Le et al. [2019] Le, H., C. Voloshin, and Y. Yue (2019). Batch policy learning under constraints. In International Conference on Machine Learning, pp. 3703–3712.
- Levine et al. [2020] Levine, S., A. Kumar, G. Tucker, and J. Fu (2020). Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643.
- Liao et al. [2018] Liao, P., W. Dempsey, H. Sarker, S. M. Hossain, M. al’Absi, P. Klasnja, and S. Murphy (2018). Just-in-time but not too much: Determining treatment timing in mobile health. Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies 2(4), 179.
- Liao et al. [2020] Liao, P., P. Klasnja, and S. Murphy (2020). Off-policy estimation of long-term average outcomes with applications to mobile health. Journal of the American Statistical Association, 1–10.
- Liao et al. [2020] Liao, P., Z. Qi, and S. Murphy (2020). Batch policy learning in average reward markov decision processes. arXiv preprint arXiv:2007.11771.
- Liu et al. [2018] Liu, Q., L. Li, Z. Tang, and D. Zhou (2018). Breaking the curse of horizon: Infinite-horizon off-policy estimation. In Advances in Neural Information Processing Systems, pp. 5356–5366.
- Luckett et al. [2019] Luckett, D. J., E. B. Laber, A. R. Kahkoska, D. M. Maahs, E. Mayer-Davis, and M. R. Kosorok (2019). Estimating dynamic treatment regimes in mobile health using v-learning. Journal of the American Statistical Association (just-accepted), 1–39.
- Marcolino et al. [2018] Marcolino, M. S., J. A. Q. Oliveira, M. D’Agostino, A. L. Ribeiro, M. B. M. Alkmim, and D. Novillo-Ortiz (2018). The impact of mhealth interventions: systematic review of systematic reviews. JMIR mHealth and uHealth 6(1), e23.
- Marling and Bunescu [2020] Marling, C. and R. Bunescu (2020). The ohiot1dm dataset for blood glucose level prediction: Update 2020. In CEUR workshop proceedings, Volume 2675, pp. 71.
- McLeish [1974] McLeish, D. L. (1974). Dependent central limit theorems and invariance principles. the Annals of Probability 2(4), 620–628.
- Meyn et al. [1995] Meyn, S., R. Tweedie, and J. Hibey (1995). Markov chains and stochastic stability. IEEE Transactions on Automatic Control 40(5), 979.
- Murphy [2003] Murphy, S. A. (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65(2), 331–355.
- Murphy et al. [2001] Murphy, S. A., M. J. van der Laan, J. M. Robins, and C. P. P. R. Group (2001). Marginal mean models for dynamic regimes. Journal of the American Statistical Association 96(456), 1410–1423.
- Nachum et al. [2019] Nachum, O., Y. Chow, B. Dai, and L. Li (2019). Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. In Advances in Neural Information Processing Systems, pp. 2315–2325.
- Nahum-Shani et al. [2016] Nahum-Shani, I., S. N. Smith, B. J. Spring, L. M. Collins, K. Witkiewitz, A. Tewari, and S. A. Murphy (2016). Just-in-time adaptive interventions (jitais) in mobile health: key components and design principles for ongoing health behavior support. Annals of Behavioral Medicine, 1–17.
- Newey [1997] Newey, W. K. (1997). Convergence rates and asymptotic normality for series estimators. Journal of econometrics 79(1), 147–168.
- Newey and Powell [2003] Newey, W. K. and J. L. Powell (2003). Instrumental variable estimation of nonparametric models. Econometrica 71(5), 1565–1578.
- Paduraru [2007] Paduraru, C. (2007). Planning with approximate and learned models of markov decision processes. These de maıtre, University of Alberta.
- Precup [2000] Precup, D. (2000). Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, 80.
- Puterman [1994] Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc.
- Robins et al. [2000] Robins, J. M., M. A. Hernan, and B. Brumback (2000). Marginal structural models and causal inference in epidemiology.
- Robins et al. [1994] Robins, J. M., A. Rotnitzky, and L. P. Zhao (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association 89(427), 846–866.
- Rodbard [2009] Rodbard, D. (2009). Interpretation of continuous glucose monitoring data: glycemic variability and quality of glycemic control. Diabetes technology & therapeutics 11(S1), S–55.
- Shi et al. [2018] Shi, C., A. Fan, R. Song, and W. Lu (2018). High-dimensional a-learning for optimal dynamic treatment regimes. Annals of statistics 46(3), 925.
- Shi et al. [2021] Shi, C., R. Wan, V. Chernozhukov, and R. Song (2021). Deeply-debiased off-policy interval estimation. arXiv preprint arXiv:2105.04646.
- Shi et al. [2020] Shi, C., R. Wan, R. Song, W. Lu, and L. Leng (2020). Does the markov decision process fit the data: Testing for the markov property in sequential decision making. arXiv preprint arXiv:2002.01751.
- Shi et al. [2020] Shi, C., S. Zhang, W. Lu, and R. Song (2020). Statistical inference of the value function for reinforcement learning in infinite horizon settings. arXiv preprint arXiv:2001.04515.
- Sutton and Barto [2018] Sutton, R. S. and A. G. Barto (2018). Reinforcement learning: An introduction. MIT press.
- Tang et al. [2019] Tang, Z., Y. Feng, L. Li, D. Zhou, and Q. Liu (2019). Doubly robust bias reduction in infinite horizon off-policy estimation. arXiv preprint arXiv:1910.07186.
- Tropp [2012] Tropp, J. A. (2012). User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics 12(4), 389–434.
- Tropp [2015] Tropp, J. A. (2015). An introduction to matrix concentration inequalities. arXiv preprint arXiv:1501.01571.
- Tseng and Bertsekas [1991] Tseng, P. and D. P. Bertsekas (1991). Relaxation methods for problems with strictly convex costs and linear constraints. Mathematics of operations research 16(3), 462–481.
- Uehara et al. [2020] Uehara, M., J. Huang, and N. Jiang (2020). Minimax weight and q-function learning for off-policy evaluation. In International Conference on Machine Learning, pp. 9659–9668. PMLR.
- Uehara et al. [2021a] Uehara, M., M. Imaizumi, N. Jiang, N. Kallus, W. Sun, and T. Xie (2021a). Finite sample analysis of minimax offline reinforcement learning: Completeness, fast rates and first-order efficiency. arXiv preprint arXiv:2102.02981.
- Uehara et al. [2021b] Uehara, M., M. Imaizumi, N. Jiang, N. Kallus, W. Sun, and T. Xie (2021b). Finite sample analysis of minimax offline reinforcement learning: Completeness, fast rates and first-order efficiency. arXiv preprint arXiv:2102.02981.
- Wahba [1990] Wahba, G. (1990). Spline Models for Observational Data. Philadelphia: SIAM.
- Wang et al. [2018] Wang, L., Y. Zhou, R. Song, and B. Sherwood (2018). Quantile-optimal treatment regimes. Journal of the American Statistical Association 113(523), 1243–1254.
- Wang and Zubizarreta [2020] Wang, Y. and J. R. Zubizarreta (2020). Minimal dispersion approximately balancing weights: asymptotic properties and practical considerations. Biometrika 107(1), 93–105.
- Wong and Chan [2018] Wong, R. K. W. and K. C. G. Chan (2018). Kernel-based covariate functional balancing for observational studies. Biometrika 105(1), 199–213.
- Yu [1994] Yu, B. (1994). Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 94–116.
- Zhang et al. [2020] Zhang, R., B. Dai, L. Li, and D. Schuurmans (2020). Gen{dice}: Generalized offline estimation of stationary values. In International Conference on Learning Representations.
- Zhao et al. [2015] Zhao, Y.-Q., D. Zeng, E. B. Laber, and M. R. Kosorok (2015). New statistical learning methods for estimating optimal dynamic treatment regimes. Journal of the American Statistical Association 110(510), 583–598.