Projected State-action Balancing Weights for Offline Reinforcement Learning

Jiayi Wang¹ Zhengling Qi² Raymond K.W. Wong¹

(¹Texas A&M University
²George Washington University
)

Abstract

Offline policy evaluation (OPE) is considered a fundamental and challenging problem in reinforcement learning (RL). This paper focuses on the value estimation of a target policy based on pre-collected data generated from a possibly different policy, under the framework of infinite-horizon Markov decision processes. Motivated by the recently developed marginal importance sampling method in RL and the covariate balancing idea in causal inference, we propose a novel estimator with approximately projected state-action balancing weights for the policy value estimation. We obtain the convergence rate of these weights, and show that the proposed value estimator is semi-parametric efficient under technical conditions. In terms of asymptotics, our results scale with both the number of trajectories and the number of decision points at each trajectory. As such, consistency can still be achieved with a limited number of subjects when the number of decision points diverges. In addition, we develop a necessary and sufficient condition for establishing the well-posedness of Bellman operator in the off-policy setting, which characterizes the difficulty of OPE and may be of independent interest. Numerical experiments demonstrate the promising performance of our proposed estimator.

Keywords: Infinite horizons; Markov decision process; Policy evaluation; Reinforcement learning

1 Introduction

In reinforcement learning (RL), off-policy evaluation (OPE) refers to the problem of estimating some notion of rewards (e.g., the value defined in (2)) of a target policy based on historical data collected from a potentially different policy. There is a recent surge of interest in OPE among the statistics and machine learning communities. On the one hand, OPE serves as the foundation for many RL methods. On the other hand, OPE is of great importance in some high-stakes domains where deploying a new policy can be very costly or risky, such as in medical applications.

This work is partly motivated by mobile health (mHealth) studies. Due to the rapid development of mobile devices and sensing technology, mHealth studies have recently emerged as a promising way to promote the healthy behaviors of patients [38]. Mobile devices are used to monitor their health conditions in real-time and deliver just-in-time interventions to individuals [50]. With access to a rich amount of longitudinal data pre-collected by wearable devices, researchers are often interested in evaluating a policy (intervention), potentially different from the one behind the data collection process. For example, the OhioT1DM dataset [43] was collected to improve the health and wellbeing of people with type 1 diabetes. The data consist of 8 weeks’ health information of six subjects, based on some unknown policy. For each subject, real-time information such as insulin doses, glucose levels, and self-reported times of meals and exercises was collected. It is often of great interest to evaluate the efficacy of insulin pump therapy. One notable challenge of mHealth data is a limited number of subjects, combined with a usually large number of decision points for each subject. For instance, the OhioT1DM dataset has six subjects with a few thousand decision points per subject. In statistics, there is a rich literature on estimating the treatment effect or optimal policy from complex longitudinal data [57, 56, 48, 47, 35, 77, 72, 59, 34]. However, these methods are mainly designed for studies with very few decision points and often require the number of subjects to grow in order to achieve accurate estimation.

To address the above challenge in mHealth studies, we adopt the framework of the Markov decision process (MDP) [55] and consider an infinite-horizon time-homogeneous setting. This framework is particularly suitable for mHealth studies, and its efficacy has been well demonstrated in the recent literature [42, 39, 62, 40, 24].

In this paper, we focus on developing a new model-free approach for OPE. The existing model-free OPE methods under the infinite-horizon discounted setting can be mainly divided into three categories. The methods in the first category [e.g., 3, 63, 36, 42, 62] directly estimate the state(-action) value function (see Section 2.1), based on the Bellman equation (see (6)). The second category is motivated by importance weights or the so-called marginal importance sampling [e.g., 41, 49, 68]. These approaches utilize the probability ratio function (see (8)) to adjust for the distributional mismatch due to the difference between the target policy and the behavior policy (i.e., the one in the data collection process). The last category of OPE methods combines methods in the first category and the second category to obtain the so-called “double robust” estimators [e.g., 30, 64, 60]. Apart from the model-free methods, we note in passing that there exists a rich class of literature on model-based methods which mainly rely on directly modeling the dynamics (i.e., the transition mechanism and the reward function) [e.g. 53, 11, 26].

This paper focuses on the second category of model-free OPE methods, which is based on the probability ratio function. These methods do not depend on the actual form of the reward function and thus can be used flexibly to evaluate the target policy on different reward functions, which is appealing in practice. Some core theoretical questions associated with the use of the (estimated) probability ratio function directly relate to the fundamental challenges of offline RL [37], and therefore have recently attracted much interest from the machine learning community [41, 49, 76, 68, 69]. Last but not least, the probability ratio function and related estimators can improve the accuracy and stability of offline RL, which has been demonstrated in [49, 29]. Despite some recent progress towards using the (estimated) probability ratio function to perform OPE, the corresponding development and (theoretical) understanding are still incomplete.

Motivated by the covariate balancing methods [e.g., 1, 74, 73, 27] that have been recently studied in the average treatment effect (ATE) estimation, we propose a novel OPE estimator via projected state-action balancing weights, under the framework of the time-homogeneous MDP and provide a comprehensive theoretical analysis. Specifically, we characterize the special challenge in developing a weighted estimator for the OPE problem in terms of an “expanded dimension” phenomenon, which significantly complicates the adoption of the balancing idea in OPE, as compared with the ATE estimation. See the detailed discussion in Section 3.2. Roughly speaking, a direct modification of an ATE estimator [e.g., 73] for the purpose of OPE leads to estimated weights that depend on not only the current state-action pairs, but also the state variable in the next decision points, which is inconsistent with the definition of the true probability ratio function. To tackle this issue, we propose an approximate projection step to rule out this “expanded dimension” issue. With the help of this additional projection, we are able to show the convergence rate of the proposed weights. (or the estimated ratio function).

As for theoretical contributions, we analyze the convergence rates of the approximate projection step, the projected state-action balancing weights, as well as the estimated value of a given policy. All these convergence rates are characterized in the scaling of both the sample size ( $n$ , the number of observed trajectories) and the number of observed decision points ( $T$ ) per trajectory. The scaling with respect to $T$ is particularly important when $n$ (the number of subjects) is limited, which is common in mHealth datasets such as the aforementioned OhioT1DM dataset (see also Section 6). For instance, under our setup, the estimated value of the policy is still consistent in the asymptotic setting where $n$ is bounded, but $T\rightarrow\infty$ . In the course of analyzing the proposed method, we also obtain a uniform convergence rate (with respect to $n$ and $T$ ) for a non-parametric regression based on exponentially $\beta$ -mixing sequences, which may be of independent interest for other applications. Furthermore, under some appropriate technical assumptions (including notably a non-diminishing minimum eigenvalue condition), we show that the proposed weighted estimator is asymptotically normal with a $\sqrt{nT}$ rate of convergence, and achieves the efficiency bound, which aligns with other types of estimators [30, 39]. Besides, our theoretical results do not require that the underlying data are independent or generated from stationary sequences, although these assumptions are widely used in the existing literature [e.g., 15, 30, 49, 64] for general OPE methods. Without imposing these restrictive assumptions can significantly increase the applicability of the proposed method.

As another theoretical contribution, we make the first attempt to analyze the difficulty of non-parametric OPE under infinite-horizon discounted settings. Some strong assumptions are often imposed in the existing literature to establish desirable theoretical properties for corresponding OPE estimators. For instance, [62] assumed that the minimal eigenvalue of some second moment is strictly bounded away from $0$ as the number of basis functions grows. [70] adopts the “completeness” condition to study the convergence of their OPE estimators. In their work, they show that these assumptions can be satisfied when the discount factor $0\leq\gamma<1$ in the policy value (2) is small enough (with additional boundedness conditions on the average visitation probability). However, in practice, the discount factor is often preferred to be set close to 1 [32]. As such, these sufficient conditions are of limited use in practice. In this paper, we provide a necessary and sufficient condition for lower bounding the minimal eigenvalue of the operator $I-\gamma\mathcal{P}^{\pi}$ (see the definition in Section 4.4), with respect to the data generating process, which characterizes the well-posedness of their $Q$ -function estimations in [62] and [70]. With the help of this characterization, we can further show that the minimal eigenvalue is strictly bounded away from zero under some mild sufficient conditions without any restrictions on $\pi$ or $\gamma$ , which may be of independent interest.

The rest of the paper is organized as follows. Section 2 presents the basic framework for OPE and some existing OPE methods. In Section 3 we introduce the state-action balancing weights and the proposed estimator. Theoretical results of our estimated weight function and the asymptotic properties of the proposed estimator are developed in Section 4. A detailed discussion regarding the lower bound of the minimum eigenvalue is presented in Section 4.4. Lastly, a simulation study and a real data application are presented in Sections 5 and 6 respectively.

2 Offline Policy Evaluation in Infinite-horizon Markov Decision Processes

In this section, we review the framework of discrete-time homogeneous MDPs and the related OPE methods. Specifically, the framework of MDP and some necessary notations are presented in Section 2.1, while three major classes of model-free OPE methods under the infinite-horizon discounted setting are reviewed in Section 2.2.

2.1 Preliminary and Notations

Consider a trajectory $\left\{S_{t},A_{t},R_{t}\right\}_{t\geq 0}$ , where $(S_{t},A_{t},R_{t})$ denotes the triplet of the state, action and immediate reward, observed at the decision point $t$ . Let ${\cal S}$ and ${\cal A}$ be the state and action spaces, respectively. We assume ${\cal A}$ is finite and consider the following two assumptions, which are commonly imposed in infinite-horizon OPE problems.

Assumption 1 (Markovian assumption with stationary transitions).

There exists a transition kernel $P$ such that for every $t\geq 0$ , $a\in{\cal A}$ , $s\in{\cal S}$ and any set $F\in{\cal B}({\cal S})$ ,

\Pr(S_{t+1}\in F\,|\,A_{t}=a,S_{t}=s,\left\{S_{j},A_{j},R_{j}\right\}_{0\leq j<t})=P(S_{t+1}\in F\,|\,A_{t}=a,S_{t}=s),

where ${\cal B}({\cal S})$ is the family of Borel subsets of ${\cal S}$ .

For the notational convenience, we assume the transition kernel has a density $p$ . Assumption 1 requires that given the current state-action pair, future states are independent of past observations. Note that this assumption can be tested using the observed data [e.g., 61]. If one believes the trajectory satisfies some higher-order Markovian properties, one can instead construct a state variable by aggregating all the original state variables over the corresponding number of decision points. We refer the interested readers to [61] for an example.

Assumption 2.

There exists a reward function $r$ defined on ${\cal S}\times{\cal A}$ such that
$\mathbb{E}\left[R_{t}\,|\,A_{t}=a,S_{t}=s,\left\{S_{j},A_{j},R_{j}\right\}_{0\leq j<t}\right]=r(s,a),$ for every $t\geq 0$ , $s\in{\cal S}$ and $a\in{\cal A}$ . In addition, $R_{t}$ is uniformly bounded, i.e., there exists a constant $R_{\max}$ such that $|R_{t}|\leq R_{\max}$ for every $t\geq 0$ .

Assumption 2 states that the current reward is conditionally mean independent of the history given the current state-action pair. One can also regard $R_{t}$ as a part of $S_{t+1}$ if needed. The uniform boundedness assumption on rewards is introduced for simplifying the technical derivation and can be relaxed. In practice, the time-stationarity of the transition density $p$ and the reward function $r$ can be warranted by incorporating time-associated covariates. Under Assumptions 1 and 2, the tuple ${\cal M}=\langle{\cal S},{\cal A},r,p\rangle$ forms a discrete-time homogeneous MDP.

A policy is defined as a way of choosing actions at all decision points. In this work, we use the value criterion (defined in (2) below) to evaluate policies and thus focus on the time-invariant Markovian policy (also called a stationary policy) $\pi$ , which is a function mapping from the state space ${\cal S}$ to a probability mass function over the action space ${\cal A}$ . More specifically, $\pi(a\,|\,s)$ is the probability of choosing an action $a$ given a state $s$ . The sufficiency of considering only the stationary policy is explained in Section 6.2 of [55].

In the offline RL setting, one of the fundamental tasks is to estimate a target stationary policy’s (expected) value function based on the pre-collected (batch) data. Given a stationary policy $\pi$ , the value function is defined as

\displaystyle V^{\pi}(s)=\mathbb{E}^{\pi}\left[\sum_{t=0}^{\infty}\gamma^{t}R_{t}\,|\,S_{0}=s\right],

(1)

where $\mathbb{E}^{\pi}$ denotes the expectation with respect to the distribution whose actions are generated by $\pi$ , and $0\leq\gamma<1$ refers to the discounted factor controlling the trade-off between the future rewards and the immediate rewards. The value function $V^{\pi}(s)$ , which is always finite due to Assumption 2, represents the discounted sum of rewards under the target policy given the initial state $s$ . Our goal is to estimate the policy value, i.e., the expectation of the value function, defined as

\displaystyle{\cal V}(\pi)=(1-\gamma)\int_{s\in\mathcal{S}}V^{\pi}(s)\mathbb{G}(ds),

(2)

using the pre-collected training data, where $\mathbb{G}$ denotes a reference distribution over ${\cal S}$ . In RL literature, $\mathbb{G}$ is typically assumed known.

Now suppose we are given pre-collected training data ${\cal D}_{n}$ consisting of $n$ independent and identically distributed finite-horizon trajectories, denoted by

{\cal D}_{n}=\left\{\left\{\left(S_{i,t},A_{i,t},R_{i,t},S_{i,t+1}\right)\right\}_{0\leq t<{T_{i}}}\right\}_{1\leq i\leq n},

where $T_{i}$ denotes the termination time for $i$ -th trajectory. For simplicity, we assume the same number of time points are observed for all trajectories, i.e. $T_{i}=T$ for $i=1,\dots,n$ . This assumption can be relaxed as long as $T_{i}$ , $i=1,\dots,n$ are of the same order. We also make the following assumption on the data generating mechanism.

Assumption 3.

The training data ${\cal D}_{n}$ is generated by a fixed stationary policy $b$ with an initial distribution $\mathbb{G}_{0}$ .

Under Assumptions 1 and 3, $\left\{S_{t},A_{t}\right\}_{t\geq 0}$ forms a discrete time time-homogeneous Markov chain. In the literature, $b$ is usually called the behavior policy, which may not be known. For convenience, we denote $\mathbb{E}$ by $\mathbb{E}^{b}$ . Next we define notations for several important probability distributions. For any $t\geq 0$ , we define $p_{t}^{\pi}(s,a)$ as the marginal density of $(S_{t},A_{t})$ at $(s,a)\in{\cal S}\times{\cal A}$ under the target policy $\pi$ and reference distribution $\mathbb{G}$ . In particular, when $t=0$ , $p_{t}^{\pi}(s,a)=\mathbb{G}(s)\pi(a\mid s)$ . Similarly, we can define $p_{t}^{b}$ over ${\cal S}\times{\cal A}$ under the behavior policy, where $p^{b}_{0}(s,a)=\mathbb{G}_{0}(s)b(a\mid s)$ . With $p_{t}^{\pi}$ , the discounted visitation probability density is defined as

\displaystyle d^{\pi}(s,a)=(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}p^{\pi}_{t}(s,a),

(3)

which is assumed to be well-defined.

2.2 Existing Off-policy Evaluation Methods

Most existing model-based OPE methods for the above setting can be grouped into three categories. The first category is direct methods, which directly estimate the state-action value function [e.g., 42, 62] defined as

\displaystyle Q^{\pi}(s,a)=\mathbb{E}^{\pi}\left[\sum_{t=0}^{\infty}\gamma^{t}R_{t}\,|\,S_{0}=s,A_{0}=a\right],

(4)

also known as the Q-function. As we can see from (1) and (2),

{\cal V}(\pi)=(1-\gamma)\int_{s\in\mathcal{S}}\sum_{a\in{\cal A}}\pi(a\mid s)Q^{\pi}(s,a)\mathbb{G}(ds).

(5)

It is well-known that $Q^{\pi}$ satisfy the following Bellman equation

\displaystyle Q^{\pi}(s,a)=\mathbb{E}\left[R_{t}+\gamma\sum_{a\in{\cal A}}\pi(a^{\prime}\,|\,S_{t+1})Q^{\pi}(S_{t+1},a^{\prime})\,|\,S_{t}=s,A_{t}=a\right],

(6)

for $t\geq 0$ , $s\in{\cal S}$ and $a\in{\cal A}$ . Clearly (6) forms a conditional moment restriction, based on which $Q^{\pi}$ can be estimated by many methods such as generalized method of moments [21] and the nonparametric instrumental variable regression [52, 8]. The second category of OPE methods is motivated by the idea of marginal importance sampling [e.g., 41, 49]. Notice that, with (3), one can rewrite ${\cal V}(\pi)$ as

\displaystyle{\cal V}(\pi)={\mathbb{E}\left[\frac{1}{T}\sum_{t=0}^{T-1}\frac{d^{\pi}(S_{t},A_{t})}{\frac{1}{T}\sum_{v=0}^{T-1}p_{v}^{b}(S_{v},A_{v})}R_{t}\right]}=\mathbb{E}\left[\frac{1}{T}\sum_{t=0}^{T-1}\omega^{\pi}(S_{t},A_{t})R_{t}\right],

(7)

as long as $d^{\pi}$ is absolutely continuous with respect to $\bar{p}_{T}^{b}=\frac{1}{T}\sum_{v=0}^{T-1}p_{v}^{b}$ . We call

\displaystyle\omega^{\pi}(s,a):=\frac{d^{\pi}(s,a)}{\bar{p}_{T}^{b}(s,a)},

(8)

the probability ratio function, which is used to adjust for the mismatch between the behavior policy $b$ and the target policy $\pi$ . Based on this relationship, one can obtain an estimator of ${\cal V}(\pi)$ via estimating $\omega^{\pi}$ . By the so-called backward Bellman equation of $d^{\pi}$ , for any $(s,a)\in{\cal S}\times{\cal A}$ , one can show that

\displaystyle d^{\pi}(s,a)=(1-\gamma)\mathbb{G}(s)\pi(a\mid s)+\gamma\int_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in{\cal A}}d^{\pi}(s^{\prime},a^{\prime})p(s\mid s^{\prime},a^{\prime})\pi(a\mid s)ds^{\prime}.

(9)

This implies that

\mathbb{E}\left[\frac{1}{T}\sum_{t=0}^{T-1}\omega^{\pi}(S_{t},A_{t})\left(f(S_{t},A_{t})-\gamma\sum_{a^{\prime}\in{\cal A}}\pi(a^{\prime}\mid S_{t+1})f(S_{t+1},a^{\prime})\right)\right]=\\ (1-\gamma)\mathbb{E}_{S_{0}\sim\mathbb{G}}\left[\sum_{a\in{\cal A}}\pi(a\mid S_{0})f(a,S_{0})\right].

(10)

See the detailed derivation for obtaining (10) in Section S2.1 of the Supplementary Material Recall that $\mathbb{G}$ is known. Based on (10), several methods [e.g., 49, 68] can be leveraged to estimate $\omega^{\pi}$ . In Section 3.4, we provide more discussion on other weighted estimators as compared with the proposed estimator.

The last category of methods combines direct and marginal importance sampling methods to construct a doubly robust estimator, which is also motivated by the following efficient influence function [e.g., 30]

\frac{1}{T}\sum_{t=0}^{T-1}\omega^{\pi}(S_{t},A_{t})\left(R_{t}+\gamma\sum_{a^{\prime}\in{\cal A}}\pi(a^{\prime}\mid S_{t+1})Q^{\pi}(S_{t+1},a^{\prime})-Q^{\pi}(S_{t},A_{t})\right)\\ +(1-\gamma)\mathbb{E}_{S_{0}\sim\mathbb{G}}\left[\sum_{a_{0}\in{\cal A}}\pi(a_{0}\mid S_{0})Q^{\pi}(S_{0},a_{0})\right]-{\cal V}(\pi),

(11)

where $Q^{\pi}$ and $\omega^{\pi}$ are nuisance functions.

3 Projected State-action Balancing Estimator

In this section, we introduce the proposed weighted estimator for ${\cal V}(\pi)$ . Since our estimator is motivated by covariate balancing weights in the literature of ATE estimation, we discuss their connection in Section 3.1. In Section 3.2, we show the difficulty of directly applying the covariate balancing idea in the aforementioned policy evaluation problem due to an “expanded-dimension” issue. We address this difficulty and propose a projected state-action balancing estimator for ${\cal V}(\pi)$ in Section 3.3.

3.1 State-action Balancing Weights

Consider a general form of weighted estimators:

\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\omega_{i,t}R_{i,t},

where $\omega_{i,t}$ ’s are some weights constructed from the training data $\mathcal{D}_{n}$ . Due to (7), a reasonable strategy to derive such weights is to first estimate the probability ratio function $\omega^{\pi}$ , and then evaluate it at the observed state-action pairs $\{(S_{i,t},A_{i,t})\}$ in ${\cal D}_{n}$ . This is analogous to the inverse probability weights commonly adopted in the ATE estimation. However, this strategy often produces an unstable weighted estimator due to small weights [31]. Instead, there is a recent surge of interest to directly obtain weights that achieve empirical covariate balance [e.g., 25, 7, 74, 73]. These weights usually produce more stable weighted estimators with superior finite-sample performances for the ATE estimation.

Inspired by the covariate balancing, a natural idea is to choose the weights $\{\omega_{i,t}\}$ that ensure the (approximate) validity of the empirical counterpart of (10), i.e.,

\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\omega^{\pi}_{i,t}\left(f(S_{i,t},A_{i,t})-\gamma\sum_{a^{\prime}\in{\cal A}}\pi(a^{\prime}\mid S_{i,t+1})f(S_{i,t+1},a^{\prime})\right)=\\ (1-\gamma)\mathbb{E}_{S_{0}\sim\mathbb{G}}\left[\sum_{a\in{\cal A}}\pi(a\mid S_{0})f(a,S_{0})\right],

(12)

over $f\in\mathrm{span}\{B_{1},\dots,B_{K}\}$ where $B_{1},\dots,B_{K}$ are $K$ pre-specified functions defined on $\mathcal{S}\times\mathcal{A}$ . The equality (12) can be viewed as a form of the state-action balance, in contrast to the covariate balance in the ATE estimation. The space $\mbox{span}\{B_{1},\dots,B_{K}\}$ can be viewed as a finite-dimensional approximation of the function space in which the balance should be enforced. In theory, $K$ is expected to increase with $n$ and $T$ . Changing the balancing criterion in [73], one can obtain a form of state-action balancing weights via the following mathematical programming:


	$\displaystyle\underset{\{{\omega}_{i,t}\}_{1\leq i\leq n,0\leq t\leq T-1}}{\mbox{minimize}}\quad\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}h({\omega}_{i,t})$		(13a)
	$\displaystyle\quad\mbox{subject to}\left\|\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}{\omega}_{i,t}\left\{B_{k}(S_{i,t},A_{i,t})-\gamma\sum_{a^{\prime}\in{\cal A}}\pi(a^{\prime}\mid S_{i,t+1})B_{k}(S_{i,t+1},a^{\prime})\right\}\right.$
	$\displaystyle\qquad\quad\left.-(1-\gamma)\mathbb{E}_{S_{0}\sim\mathbb{G}}\left[\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid S_{0})B_{k}(S_{0},a^{\prime})\right]\right\|\leq\delta_{k},\quad\mbox{for}\ k=1,2,\dots,K,$		(13b)

where the tuning parameters $\delta_{k}\geq 0$ controls the imbalance with respect to $B_{k}$ . When $\delta_{k}=0$ , the weights achieve the exact balance (12) over $B_{k}$ . In practice, the exact balance can be hard to achieve especially when $K$ is large. Allowing $\delta_{k}>0$ leads to approximate balance [1, 74], which introduces flexibility. In addition, one can also constrain the weights to be non-negative. Since non-negativity constraints are not necessary for consistent estimation (as shown in our theoretical analysis), we will not enforce them throughout this paper. Common choices of $\{B_{k}\}$ are constructed based on tensor products of one-dimensional basis functions. Examples of one-dimensional basis functions include spline basis [12] (for a continuous dimension) and indicator functions of levels (for a categorical dimension). The objective function (13a) is introduced to control the magnitude of the weights. Here $h$ is chosen as a non-negative, strictly convex and continuously differentiable function. Examples include $h(x)=(x-1)^{2}$ and $h(x)=(x-1)\log(x)$ . In the following, we discuss the issue with the weights defined by (13), and explain the challenge of directly applying this covariate balancing idea in the OPE problem.

Define $\epsilon_{i,t}=R_{i,t}-r(S_{i,t},A_{i,t})$ and write the solution of (13) as $\tilde{\omega}^{\pi}_{i,t}$ . Naturally, we can obtain a weighted estimator as $\tilde{\cal V}(\pi)=(nT)^{-1}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\tilde{\omega}_{i,t}^{\pi}R_{i,t}$ . For any function $B$ defined on $\mathcal{S}\times\mathcal{A}$ , let

\displaystyle g_{*}^{\pi}(S_{t},A_{t};B):=\mathbb{E}\left\{\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid S_{t+1})B(S_{t+1},a^{\prime})\mid S_{t},A_{t}\right\}.

(14)

The difference between $\tilde{{\cal V}}(\pi)$ and ${\cal V}(\pi)$ yields the following decomposition:

	$\displaystyle\tilde{{\cal V}}(\pi)-{\cal V}(\pi)$
$\displaystyle=$	$\displaystyle\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\tilde{\omega}^{\pi}_{i,t}\left[Q^{\pi}(S_{i,t},A_{i,t})-\gamma g_{*}^{\pi}(S_{i,t},A_{i,t};Q^{\pi})\right]$	(15)
	$\displaystyle\qquad-(1-\gamma)\mathbb{E}_{S_{0}\sim\mathbb{G}}\left\{\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid S_{0})Q^{\pi}(S_{0},a^{\prime})\right\}+\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\tilde{\omega}^{\pi}_{i,t}\epsilon_{i,t}$	(16)
$\displaystyle=$	$\displaystyle\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\tilde{\omega}^{\pi}_{i,t}\left[Q^{\pi}(S_{i,t},A_{i,t})-\gamma\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid S_{i,t+1})Q^{\pi}(S_{i,t+1},a^{\prime})\right]$	(17)
	$\displaystyle\qquad-(1-\gamma)\mathbb{E}_{S_{0}\sim\mathbb{G}}\left\{\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid S_{0})Q^{\pi}(S_{0},a^{\prime})\right\}$	(18)
	$\displaystyle+\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\tilde{\omega}^{\pi}_{i,t}\epsilon_{i,t}$	(19)
	$\displaystyle+\frac{\gamma}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\tilde{\omega}_{i,t}^{\pi}\left[\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid S_{i,t+1})Q^{\pi}(S_{i,t+1},a^{\prime})-g_{*}^{\pi}(S_{i,t},A_{i,t};Q^{\pi})\right],$	(20)

where the first equality is given by (6) and the representation of ${\cal V}(\pi)$ in (5). Clearly, (17)-(18) can be controlled via the balancing constraint (13b) by carefully controlling $\{\delta_{k}\}$ and selecting $\{B_{k}\}$ so that $Q^{\pi}$ can be well approximated. For (19), by assuming that $\{\epsilon_{i,t}\}$ are independent noises of the trajectories $\{S_{i,t},A_{i,t}\}_{i=1,\dots n;t=0,\dots,T-1}$ , it may not be hard to obtain an upper bound of order $\sqrt{nT}$ as long as the magnitude of $\tilde{\omega}_{i,t}^{\pi}$ is properly controlled. However, it remains unclear how to control (20) due to the complex dependence between $\tilde{\omega}_{i,t}^{\pi}$ and ${\cal D}_{n}$ . Indeed, we observe an “expanded-dimension” issue due to the balancing constraints (13b), which will be explained in details in Section 3.2. This also motivates the development of the novel projected balancing constraints in Section 3.3.

3.2 Expanded Dimension

First, we obtain the dual form of (13), which provides an important characterization for the solution of (13). Define $\bm{\Psi}_{K}(s,a,s^{\prime})=[B_{k}(s,a)-\gamma\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid s^{\prime})B_{k}(s^{\prime},a^{\prime})]_{k=1}^{K}\in\mathbb{R}^{K}$ , $\bm{l}_{K}=[\mathbb{E}_{s\sim\mathbb{G}}\{(1-\gamma)\sum_{a\in\mathcal{A}}\pi(a\mid s)B_{k}(s,a)\}]_{k=1,\dots,K}\in\mathbb{R}^{K}$ , and $\bm{\delta}_{K}=[\delta_{1},\dots,\delta_{K}]^{\intercal}\in\mathbb{R}^{K}.$

Theorem 1.

The dual of (13) is equivalent to the following unconstrained optimization problem:

\displaystyle\min_{\bm{\lambda}\in\mathbb{R}^{K}}\qquad\frac{1}{nT}\sum_{j=1}^{n}\sum_{t=0}^{T-1}\rho({\bm{\Psi}}_{K}(S_{i,t},A_{i,t},S_{i,t+1})^{\intercal}\bm{\lambda})-\bm{\lambda}^{\intercal}\bm{l}_{K}+|\bm{\lambda}|^{\intercal}\bm{\delta}_{K},

(21)

where $\rho(t)=t(h^{\prime})^{-1}(t)-h{(h^{\prime})^{-1}(t)}$ . The primal solution to (13) is given as

\displaystyle\tilde{w}^{\pi}_{i,t}=\rho^{\prime}({\bm{\Psi}}_{K}(S_{i,t},A_{i,t},S_{i,t+1})^{\intercal}\bm{\lambda}^{\star}),,

(22)

for every $1\leq i\leq n$ and $0\leq t\leq T-1$ , where $\bm{\lambda}^{\star}$ is the solution to (21).

The proof of Theorem 1 is similar to that of Theorem 2 below, and can be found in Section S2.2 of the Supplementary Material. Now the expanded-dimension issue can be easily seen via the following example. Suppose that there are two triplets $(S_{i_{1},t_{1}},A_{i_{1},t_{1}},S_{i_{1},t_{1}+1})$ and $(S_{i_{2},t_{2}},A_{i_{2},t_{2}},S_{i_{2},t_{2}+1})$ such that $S_{i_{1},t_{1}}=S_{i_{2},t_{2}}$ , $A_{i_{1},t_{1}}=A_{i_{2},t_{2}}$ and $S_{i_{1},t_{1}+1}\neq S_{i_{2},t_{2}+1}$ . As the true probability ratio function $\omega^{\pi}$ is a function of the current state and action variables, we must have $\omega^{\pi}_{i_{1},t_{1}}=\omega^{\pi}_{i_{2},t_{2}}$ . However, the solution form (22) in Theorem 1 does not lead to $\tilde{\omega}^{\pi}_{i_{1},t_{1}}=\tilde{\omega}^{\pi}_{i_{2},t_{2}}$ in general, which violates our knowledge of $\omega^{\pi}$ .

One may hypothesize that the expanded-dimension issue is a finite-sample property, and the variation of the estimated weights due to the next state may diminish asymptotically under some reasonable conditions. To gain more insight, we show that the solution form (22) indeed induces an implicit restriction on the modeling of the true weight function under finite-state and finite-action settings. Therefore, unless one is willing to make further non-trivial assumptions on the weight function, the hypothesis cannot be true in general.

Notice that

\bm{\Psi}_{K}(s,a,s^{\prime})^{\intercal}\bm{\lambda}=f(s,a)-\gamma\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid s^{\prime})f(s^{\prime},a^{\prime}),

(23)

where $f(s,a)=\sum^{K}_{k=1}\lambda_{k}B_{k}(s,a)$ . To avoid dealing with the approximation error, we focus on an even more general class $\rho^{\prime}(\mathcal{G})$ of functions on $\mathcal{S}\times\mathcal{A}\times\mathcal{S}$ , where

\mathcal{G}:=\left\{f(s,a)-\gamma\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid s^{\prime})f(s^{\prime},a^{\prime}):\mbox{$f$ is any real-valued function defined on $\mathcal{S}\times\mathcal{A}$}\right\}.

Recall that $\rho^{\prime}$ is the first derivative of $\rho$ defined in Theorem 1. Assume that $\tilde{\omega}^{\pi}(s,a,s^{\prime})\equiv\omega^{\pi}(s,a)$ . We would like to know if $\rho^{\prime}(\mathcal{G})$ can model $\tilde{\omega}^{\pi}$ well. Suppose $\tilde{\omega}^{\pi}\in\rho^{\prime}(\mathcal{G}):=\{\rho^{\prime}(g(\cdot)):g\in\mathcal{G}\}$ . As $\tilde{\omega}^{\pi}(s,a,s^{\prime})$ is constant with respect to $s^{\prime}$ , we have $\tilde{\omega}^{\pi}\in\rho^{\prime}(\mathcal{G}^{\prime})$ where

\displaystyle\mathcal{G}^{\prime}

\displaystyle:=\left\{g\in\mathcal{G}:g(s,a,s^{\prime})=g(s,a,s^{\prime\prime})\ \forall s,s^{\prime},s^{\prime\prime}\in\mathcal{S},a\in\mathcal{A}\right\}

characterizes the subclass with reduced input dimensions. A key question is whether $\mathcal{G}^{\prime}$ restricts the class of possible weight functions modeled by $\rho^{\prime}(\mathcal{G})$ and, as a result, induces some implicit form of restriction. To see this, we focus on the settings where $|\mathcal{S}|=p_{S}$ and $|\mathcal{A}|=p_{A}$ are finite. In Lemma 1 below, we show that the dimension of $\mathcal{G}^{\prime}$ is $p_{A}p_{S}-p_{S}+1$ , which is strictly less than $p_{S}p_{A}-1$ as long as $p_{S}>2$ . Note that, due to the natural constraint that $\mathbb{E}[T^{-1}\sum_{t=0}^{T-1}\omega^{\pi}(S_{t},A_{t})]=1$ , a general weight function should have $p_{A}p_{S}-1$ free parameters. As $\rho^{\prime}$ is invertible, Lemma 1 suggests a possible implicit restriction and that the solution obtained from (13) may not be a consistent estimator for $\omega^{\pi}$ .

Lemma 1.

Suppose $|\mathcal{S}|=p_{S}$ and $|\mathcal{A}|=p_{A}$ are both finite. Then $\dim(\mathcal{G}^{\prime})=p_{A}p_{S}-p_{S}+1$ .

Proof sketch.

For any function $g=f(s,a)-\gamma\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}|s^{\prime})f(s^{\prime},a^{\prime})\in\mathcal{G}^{\prime}$ , there exists a constant $c_{g}\in\mathbb{R}$ such that $\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}|s^{\prime})f(s^{\prime},a^{\prime})\equiv c_{g}$ for any $s^{\prime}\in\mathcal{S}$ . Since $|\mathcal{S}|=p_{S}$ , $\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}|s^{\prime})=1$ and $\pi(a^{\prime}\mid s^{\prime})\geq 0$ , this yields $p_{S}$ linearly independent constraints on $f$ . Together with the parameter $c_{g}$ , we can show that the dimension of $\mathcal{G}^{\prime}$ is $p_{S}p_{A}-p_{S}+1$ . The detailed proof can be found in Section S2.2 of the Supplementary Material. ∎

3.3 Projected State-action Balancing Weights

To overcome the expanded-dimension issue, we propose an approximate projection step, which is applied to $\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid S_{i,t+1})B_{k}(S_{i,t+1},a^{\prime})$ , $k=1,\dots,K$ , to rule out the involvement of $S_{i,t+1}$ . To explain the idea, we again focus on the decomposition of $\tilde{\cal V}(\pi)-{\cal V}(\pi)$ . From (15) and (16), we would like to choose weights that ideally control

\displaystyle\left|\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}{\omega}_{i,t}\left[B_{k}(S_{i,t},A_{i,t})-\gamma g_{*}^{\pi}(S_{i,t},A_{i,t};B_{k})\right]-(1-\gamma)\mathbb{E}_{S_{0}\sim\mathbb{G}}\left[\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid S_{0})B_{k}(S_{0},a^{\prime})\right]\right|,

for every $k=1,\dots,K$ . However, in practice, $g_{*}^{\pi}(S_{i,t},A_{i,t};B_{k})$ (i.e., $\mathbb{E}\{\sum_{a^{\prime}\in{\cal A}}\pi(a^{\prime}\mid S_{i,t+1})B_{k}(S_{i,t+1},a^{\prime})\mid S_{i,t},A_{i,t}\}$ ) is unknown to us. As explained in Section 3.2, the idea of replacing it with the empirical counterpart $\sum_{a^{\prime}\in{\cal A}}\pi(a^{\prime}\mid S_{i,t+1})B_{k}(S_{i,t+1},a^{\prime})$ results in a non-trivial expanded-dimension issue. Instead, we propose to estimate the projection term $g_{*}^{\pi}(S_{t},A_{t};B_{k})$ via a more involved optimization problem:

\displaystyle\hat{g}^{\pi}(\cdot,\cdot;B_{k})=\text{argmin}_{g\in{\mathcal{G}}}\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\left\{\sum_{a^{\prime}}\pi(a^{\prime}\mid S_{i,t+1})B_{k}(S_{i,t+1},a^{\prime})-g(S_{i,t},A_{i,t})\right\}^{2}+\mu J^{2}_{\mathcal{G}}(g),

(24)

where ${\mathcal{G}}$ is a pre-specified function space that contains $g_{*}^{\pi}$ , $\mu\geq 0$ is a tuning parameter and $J_{\mathcal{G}}(\cdot)$ is a regularization functional. In this work we focus on the kernel ridge regression, where ${\mathcal{G}}$ is a subset of a reproducing kernel Hilbert space (RKHS) and $J_{\mathcal{G}}$ is taken as the squared RKHS norm; see Assumption 4’(c) in Section 4.1 of the Supplementary Material In Theorem 3 of Section 4, we establish the finite-sample error bound of $\hat{g}^{\pi}(\cdot,\cdot;B)$ (in scaling of both $n$ and $T$ ) that holds uniformly for different $B$ (in a reasonably large class $\mathcal{Q}$ specified later). This provides a solid theoretical guarantee for replacing $g^{\pi}_{*}$ by $\hat{g}^{\pi}$ in the construction of the weights.

With the approximate projection (24), we propose to estimate the weights by solving the following optimization problem:


	$\displaystyle\underset{\{{\omega}_{i,t}\}_{1\leq i\leq n,0\leq t\leq T-1}}{\mbox{minimize}}\quad\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}h({\omega}_{i,t})$		(25a)
	$\displaystyle\quad\mbox{subject to}\quad\left\|\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}{\omega}_{i,t}\left\{B_{k}(S_{i,t},A_{i,t})-\gamma\hat{g}^{\pi}(S_{i,t},A_{i,t};B_{k})\right\}\right.$
	$\displaystyle\qquad\quad\left.-(1-\gamma)\mathbb{E}_{S_{0}\sim\mathbb{G}}\left[\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid S_{0})B_{k}(S_{0},a^{\prime})\right]\right\|\leq\delta_{k},\quad\mbox{for}\quad k=1,2,\dots,K.$		(25b)

The resulting solution $\{\hat{w}_{i,t}^{\pi}\}_{1\leq i\leq n,0\leq t<T}$ are the proposed weights. As such, the proposed estimator for ${\cal V}(\pi)$ is

\hat{\cal V}(\pi)=\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\hat{\omega}^{\pi}_{i,t}R_{i,t}.

(26)

Similar to Theorem 1, we derive the dual form of (25), which is shown in Theorem 2 below. For the notational simplicity, we introduce the following notations: $L_{k}(s,a)=B_{k}(s,a)-\gamma g^{\pi}_{*}(s,a;B_{k})$ , $\hat{L}_{k}(s,a)=B_{k}(s,a)-\gamma\hat{g}^{\pi}(s,a;B_{k})$ , ${\bm{B}}_{K}(s,a)=[{B}_{k}(s,a)]_{k=1}^{K}\in\mathbb{R}^{K}$ , ${\bm{L}}_{K}(s,a)=[{L}_{k}(s,a)]_{k=1}^{K}\in\mathbb{R}^{K}$ and $\hat{\bm{L}}_{K}(s,a)=[\hat{L}_{k}(s,a)]_{k=1}^{K}\in\mathbb{R}^{K}$ .

Theorem 2.

The dual of (25) is equivalent to the following unconstrained optimization problem:

\displaystyle\min_{\bm{\lambda}\in\mathbb{R}^{K}}\qquad\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\rho(\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda})-\bm{\lambda}^{\intercal}\bm{l}_{K}+|\bm{\lambda}|^{\intercal}\bm{\delta}_{K}.

(27)

And the primal solution to (25) is given by, for every $1\leq i\leq n$ and $0\leq t\leq T-1$ ,

\displaystyle\hat{\omega}^{\pi}_{i,t}=\rho^{\prime}(\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda}^{+}),

(28)

where $\rho(t)=t(h^{\prime})^{-1}(t)-h{(h^{\prime})^{-1}(t)}$ and $\bm{\lambda}^{+}$ is the solution to (27).

We sometimes write $\hat{\omega}^{\pi}_{i,t}=\hat{\omega}^{\pi}_{i,t}(S_{i,t},A_{i,t})$ . The proof can be found in Section S2.2 of the Supplementary Material. As seen from the representation (28), $\hat{w}^{\pi}_{i,t}$ do not suffer from the expanded-dimension issue that we see in (21). Besides, (27) can be regarded as an $M$ -estimation of $\bm{\lambda}$ with a weighted $\ell_{1}$ -norm regularization. Since the estimated weights are parametrized in $\bm{\lambda}$ via (28), Theorem 2 reveals a connection between $\hat{\omega}^{\pi}_{i,t}$ and the shrinkage estimation of the probability ratio function $\omega^{\pi}$ . To see this, we consider the objective function $T^{-1}\sum_{t=0}^{T-1}\rho({\bm{L}}_{K}(S_{t},A_{t})^{\intercal}\bm{\lambda})-\bm{\lambda}^{\intercal}\bm{l}_{K}$ . By Lemma 1 in [68], the expectation of this loss function is minimized when $\bm{\lambda}$ satisfies $\omega^{\pi}(s,a)=\rho^{\prime}(\bm{L}_{K}(s,a)^{\intercal}\bm{\lambda})$ , for every $s\in\mathcal{S}$ , $a\in\mathcal{A}$ . In Theorem 4 of Section 4, we show the convergence rate of the proposed weights to the true weights in the scaling of both $n$ and $T$ .

Next, we discuss the computation of the weights $\hat{\omega}_{i,t}$ in practice. The projection step (24) is a pre-computation step for the optimization (25). In other words, we only need to estimate $\hat{g}^{\pi}(S_{i,t},A_{i,t};B_{k})$ , $k=1,\dots,K$ , once and there is no need to recompute them within the weights estimation (25). The optimization (25) is a standard convex optimization problem with linear constraints. We outline the proposed algorithm of the weights in Algorithm 1 of the Supplementary Material.

3.4 Other weighting methods

Apart from the proposed projection method, a sensible alternative to avoid the expanded dimension is to directly specify a class $\mathcal{F}_{w}$ of functions over $\mathcal{S}\times\mathcal{A}$ as a model of the weight function (i.e., weights are evaluations of this function), which naturally results in the following form of estimators:

	$\displaystyle\hat{f}^{\pi}_{w}$	$\displaystyle=\text{argmin}_{f_{1}\in\mathcal{F}_{w}}\sup_{f_{2}\in\mathcal{F}_{Q}}\left[\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}f_{1}(S_{i,t},A_{i,t})\left\{f_{2}(S_{i,t},A_{i,t})-\right.\right.$
		$\displaystyle\left.\left.\gamma\sum_{a^{\prime}}\pi(a^{\prime}\mid S_{i,t+1})f_{2}(S_{i,t+1},a^{\prime})\right\}-(1-\gamma)\mathbb{E}_{S_{0}\sim\mathbb{G}}\left\{\sum_{a\in{\cal A}}\pi(a\mid S_{0})f_{2}(a,S_{0})\right\}-\nu_{1}\mathcal{P}_{1}(f_{2})\right],$		(29)

where $\mathcal{F}_{Q}$ is a space that models $Q^{\pi}$ , $\mathcal{P}_{1}$ is a regularization functional and $\nu_{1}\geq 0$ is a tuning parameter. Similar approaches have been adopted by a few recent works such as [41, 68, 70], although they are not directly motivated by the above expanded-dimension issue. Since these works assume $T=1$ (without considering the dependence in the trajectory), we restrict our results to this setting for comparisons, with a remark that our estimator is analyzed in a more general setting of $T$ (e.g., either bounded, or diverging to infinity). In order to use the above estimator, the choice of the function classes $\mathcal{F}_{w}$ and $\mathcal{F}_{Q}$ seems difficult in terms of computation of the weights. For nonparametric modeling, a natural idea is to take $\mathcal{F}_{Q}$ as a RKHS, since this often leads to a finite-dimensional optimization in regression problems via a representer theorem [71]. However, the term $\mathbb{E}_{S_{0}\sim\mathbb{G}}\{\sum_{a\in{\cal A}}\pi(a\mid S_{0})f(a,S_{0})\}$ could make the representer theorem inapplicable when $S$ is continuous, and so (3.4) becomes an impractical infinite-dimensional optimization. Another way is to take a finite dimensional space to approximate $\mathcal{F}_{Q}$ and $\mathcal{F}_{w}$ . The dimensions of the approximation spaces would need to increase with $n$ so as to avoid an asymptotic bias. Moreover, the existing results on the convergence rate of such estimator to the policy value is not optimal (i.e., slower than $\sqrt{n}$ or $\sqrt{nT}$ ). See Corollary 11 in [70]. As for our weighted estimator, an optimal convergence rate with statistical efficiency can be achieved without an additional nuisance parameter estimation, even in a more general dependent setting when $T>1$ . See Theorem 6 in Section 4.

4 Theoretical Results

In this section, we study the theoretical properties of the approximate projection $\hat{g}^{\pi}$ in (24), balancing weights $\hat{\omega}^{\pi}$ in (25) and the final weighted estimator $\hat{\cal V}(\pi)$ in Section 3.3. Specifically, in Section 4.1, we derive the finite-sample error bound for the approximate projection. In Section 4.2, we study the convergence rate of the proposed balancing weights. Finally, we show that the proposed weighted estimator is statistically efficient under additional conditions specified in Section 4.3. In Section 4.4, we study the difficulty of the offline RL in a conservative manner. To start with, we introduce some notations. We define the squared empirical norm as $\|f(\cdot,\cdot)\|^{2}_{{\color[rgb]{0,0,0}n,T}}=(nT)^{-1}\sum_{i=1}^{n}\sum_{t=0}^{T-1}f^{2}(S_{i,t},A_{i,t})$ . The notation $\varpi({\color[rgb]{0,0,0}n,T})\lesssim\theta({\color[rgb]{0,0,0}n,T})$ (resp. $\varpi({\color[rgb]{0,0,0}n,T})\gtrsim\theta({\color[rgb]{0,0,0}n,T})$ ) means that there exists a sufficiently large constant (resp. small) constant $c_{1}>0$ (resp. $c_{2}>0$ ) such that $\varpi({\color[rgb]{0,0,0}n,T})\geq c_{1}\theta({\color[rgb]{0,0,0}n,T})$ (resp. $\varpi({\color[rgb]{0,0,0}n,T})\leq c_{2}\theta({\color[rgb]{0,0,0}n,T})$ ) for some sequences $\theta({\color[rgb]{0,0,0}n,T})$ and $\varpi({\color[rgb]{0,0,0}n,T})$ . Also, we denote by $\mathcal{N}(\epsilon,{\mathcal{F}},\|\cdot\|)$ the $\epsilon$ -covering number of ${\mathcal{F}}$ with respect to some metric $\|\cdot\|$ . We take $d$ as the dimension of a state vector $S$ .

4.1 Non-parametric Regressions with Exponentially $\beta$ -mixing Sequences

Recall that the constraint (25b) is merely a surrogate of the desired state-action balancing condition, due to the use of $\hat{g}^{\pi}(\cdot,\cdot,B_{k})$ , $k=1,\dots,K$ . To bound the surrogate error on the estimations of the weights and the policy value, we study the uniform convergence of the approximate projection. More generally, in Theorem 3 below, we obtain the error bound of $\hat{g}^{\pi}(\cdot,\cdot;B)$ uniformly over $B\in{\cal Q}$ , where $\mathcal{Q}$ is a class of functions defined on ${\cal S}\times{\cal A}$ of interest. The bound scales with both $n$ and $T$ . Later when we adopt Theorem 3, $\mathcal{Q}$ will be taken as a subset of the linear span of $\{B_{1},\dots,B_{K}\}$ . See Corollary 1 for more details. Theorem 3 requires the following assumption. Let $\|\cdot\|_{\mathrm{TV}}$ denote the total variation norm,

Assumption 4.

The following conditions hold.

(a)

The Markov chain $\{S_{t},A_{t}\}_{t\geq 0}$ has a unique stationary distribution $\mathbb{G}^{*}$ with density $p^{*}$ , and is geometrically ergodic, i.e., there exists a function $\phi(s,a)$ and constant $\kappa\in(0,1)$ such that, for any $s\in\mathcal{S}$ and $a\in\mathcal{A}$ ,

\left\|\mathbb{G}^{b}_{t}(\cdot\mid(s,a))-\mathbb{G}^{*}(\cdot)\right\|_{\mathrm{TV}}\leq\phi(s,a)\kappa^{t},\qquad\forall t\geq 0,

where $\mathbb{G}_{t}^{b}(\cdot\mid(s,a))$ is the behavior policy induced conditioinal distribution of $(S_{t},A_{t})$ given $S_{0}=s$ and $A_{0}=a$ . Also, there exists a constant $C_{1}\ltx@label{C_{s}}>0$ such that $\int\phi(s,a)d\mathbb{G}_{0}(s,a)\leq C_{\ref*{C_s}}$ , where we recall that $\mathbb{G}_{0}$ is the initial distribution in the batch data.

(b)

The function class $\mathcal{Q}$ satisfies that $\|B\|_{\infty}\leq Q_{\max}$ for all $B\in\mathcal{Q}$ .
(c)

The function class $\mathcal{G}$ in (24) satisfies that $\|g\|_{\infty}\leq G_{\max}$ for all $g\in\mathcal{G}$ , and $g^{\pi}_{*}(\cdot,\cdot;B)\in\mathcal{G}$ for all $B\in\mathcal{Q}$ .

(d)

The regularization functional $J_{\mathcal{G}}$ in (24) is a pseudo norm. Also, $J_{\mathcal{Q}}$ (chosen in Theorem 3) is a pseudo norm. Let $\mathcal{Q}_{M}=\{B:B\in\mathcal{Q},J_{\mathcal{Q}}(B)\leq M\}$ and $\mathcal{G}_{M}=\{g:g\in\mathcal{G},J_{\mathcal{G}}(g)\leq M\}$ . There exist constants $C_{2}\ltx@label{entropy}>0$ and $\alpha\in(0,1)$ , such that for any $\epsilon,M>0$ ,

\max\left\{\log\mathcal{N}(\epsilon,\mathcal{G}_{M},\|\cdot\|_{\infty}),\log\mathcal{N}(\epsilon,\mathcal{Q}_{M},\|\cdot\|_{\infty})\right\}\leq C_{\ref*{entropy}}(M/\epsilon)^{2\alpha}.

Assumption 4(a) basically assumes the stationary distribution exists and the mixing rate is exponentially fast. This allows the convergence rate to scale well with $T$ . By a truncation argument (see the proof of Theorem 3), we can show that our estimation is almost equivalent to the nonparametric regression based on stationary and exponentially $\beta$ -mixing sequences. It is known that if a Markov chain is aperiodic and satisfies some drift condition in terms of a well-behaved non-negative measurable function, then it is geometrically ergodic, see [6] and Chapter 15 of [46] for a detailed characterization. The boundedness assumptions of $\mathcal{Q}$ and $\mathcal{G}$ in Assumptions 4(b) and 4(c) are used to simplify the proof and can be relaxed by a careful truncation argument. Similar assumptions are also adopted in [39]. Assumption 4(d) specifies the complexity of the spaces. These entropy assumptions are satisfied for common functional classes, such as RKHS and Sobolev spaces [22, 19, 20]. Take the Sobolev spaces as an example, $\alpha=d/(2q)$ , where $q$ is the number of continuous derivatives possessed by the functions in the corresponding space. With Assumption 4, we obtain the following uniform finite-sample error bound.

Theorem 3.

Suppose Assumption 4 holds. For any $0<\tau\leq 1/3$ and all sufficiently large ${\color[rgb]{0,0,0}nT}$ (i.e., either $n$ or $T$ is large enough), with probability at least $1-2\delta-1/({\color[rgb]{0,0,0}nT})$ , the following inequality holds for all $B\in\mathcal{Q}$ :

{\mathbb{E}\left\{\frac{1}{T}\sum_{t=0}^{T-1}\left[\hat{g}^{\pi}(S_{t},A_{t};B)-g_{*}^{\pi}(S_{t},A_{t};B)\right]^{2}\right\}}+\|\hat{g}^{\pi}(\cdot,\cdot;B)-g_{*}^{\pi}(\cdot,\cdot;B)\|_{{\color[rgb]{0,0,0}n,T}}^{2}+\mu J_{\mathcal{G}}^{2}\{\hat{g}^{\pi}(\cdot,\cdot;B)\}\\ \lesssim\mu\{1+J_{\mathcal{G}}^{2}(g^{\pi}_{*}(\cdot,\cdot;B))+J_{\mathcal{Q}}^{2}(B)\}+\frac{1}{{\color[rgb]{0,0,0}nT}\mu^{\frac{\alpha}{1-\tau(2+\alpha)}}}+\frac{[\log(\max\{1/\delta,{\color[rgb]{0,0,0}nT}\})]^{1/\tau}}{{\color[rgb]{0,0,0}nT}},

(30)

where the leading constant in the above inequality depends on $Q_{\max}$ , $G_{\max}$ , $\kappa$ , and $C_{\ref*{C_s}}$ .

Further suppose $J_{\mathcal{G}}^{2}(g^{\pi}_{*}(\cdot,\cdot;B))$ and $J_{\mathcal{Q}}^{2}(B)$ are bounded. If we take $\delta=({\color[rgb]{0,0,0}nT})^{-1}$ , $\tau=(1+\alpha)\log(\log({\color[rgb]{0,0,0}nT}))/(\alpha\log({\color[rgb]{0,0,0}nT}))$ and $\mu\asymp({\color[rgb]{0,0,0}nT})^{-1/(1+\alpha)}(\log({\color[rgb]{0,0,0}nT}))^{(2+\alpha)/(1+\alpha)}$ , then for a sufficiently large ${\color[rgb]{0,0,0}nT}$ , the following inequality holds for all $B\in\mathcal{Q}$ with the probability at least $1-1/({\color[rgb]{0,0,0}nT})$ :

\mathbb{E}\{\hat{g}^{\pi}(\cdot,\cdot;B)-g_{*}^{\pi}(\cdot,\cdot;B)\}^{2}+\|\hat{g}^{\pi}(\cdot,\cdot;B)-g_{*}^{\pi}(\cdot,\cdot;B)\|_{{\color[rgb]{0,0,0}n,T}}^{2}\lesssim({\color[rgb]{0,0,0}nT})^{-\frac{1}{1+\alpha}}(\log({\color[rgb]{0,0,0}nT}))^{\frac{2+\alpha}{1+\alpha}}.

Next, we adapt Theorem 3 to our case. More specifically, we consider the setting when ${\mathcal{G}}$ in Assumption 4(c) is an RKHS. Due to space limitations, we list the corresponding assumption in the Supplementary Material (See Assumption 4’ in Section 4.1 of the Supplementary MaterialThen we have the following corollary.

Corollary 1.

Suppose Assumption 4’ (in the Supplementary Material) hold, and $\hat{g}^{\pi}(\cdot,\cdot;B_{k})$ , $k=1,\dots,K$ , are defined by (24) with the same tuning parameter $\mu$ . If $\mu\asymp({\color[rgb]{0,0,0}nT})^{-1/(1+\alpha)}(\log({\color[rgb]{0,0,0}nT}))^{(2+\alpha)/(1+\alpha)}$ , then for any ${\bm{\upsilon}=(\upsilon_{k})_{k=1}^{K}}\in\mathbb{R}^{K}$ such that $\sum_{k=1}^{K}\upsilon_{k}B_{k}\in\mathcal{Q}$ , $J_{\mathcal{Q}}(\sum_{k=1}^{K}\upsilon_{k}B_{k})<\infty$ and $g^{\pi}(\cdot,\cdot;\sum_{k=1}^{K}\upsilon_{k}B_{k})\in\mathcal{G}$ (where $K$ is allowed to grow with ${\color[rgb]{0,0,0}nT}$ ) and a sufficiently large ${\color[rgb]{0,0,0}nT}$ , the following inequality holds with the probability at least $1-1/({\color[rgb]{0,0,0}nT})$ :

\mathbb{E}\left\{\sum_{k=1}^{K}{\upsilon_{k}}\hat{g}^{\pi}(\cdot,\cdot;B_{k})-\sum_{k=1}^{K}{\upsilon_{k}}g_{*}^{\pi}(\cdot,\cdot;B_{k})\right\}^{2}+\left\|\sum_{k=1}^{K}{\upsilon_{k}}\hat{g}^{\pi}(\cdot,\cdot;B_{k})-\sum_{k=1}^{K}{\upsilon_{k}}g_{*}^{\pi}(\cdot,\cdot;B_{k})\right\|_{{\color[rgb]{0,0,0}n,T}}^{2}\\ \lesssim({\color[rgb]{0,0,0}nT})^{-\frac{1}{1+\alpha}}(\log({\color[rgb]{0,0,0}nT}))^{\frac{2+\alpha}{1+\alpha}}.

The proofs of Theorem 3 and Corollary 1 can be found in Section S2.3 of the Supplementary Material. From Theorem 3, by carefully choosing $\mu$ and $\tau$ , we can achieve the optimal nonparametric convergence rate that holds uniformly for all $B\in\mathcal{Q}$ up to a logarithmic factor, compared to the i.i.d. setting. More importantly, Theorem 3 does not require the initial distribution to be the stationary distribution, which can be unrealistic in practice but is often required in most existing results such as [16]. Thus our result is broadly applicable. Accordingly, Corollary 1 provides a tight bound (in the scaling of both $n$ and $T$ ) on the proposed approximate projection step (24), which leads to an accurate estimation of the target state-action balancing condition for the construction of the proposed weights.

4.2 Convergence Rates of Balancing Weights

With Theorem 3 and Corollary 1, we now derive the convergence of the proposed weights $\hat{\omega}^{\pi}$ . Define

\zeta_{{\color[rgb]{0,0,0}n,T}}=\left(({\color[rgb]{0,0,0}nT})^{-\frac{1}{1+\alpha}}(\log({\color[rgb]{0,0,0}nT}))^{\frac{2+\alpha}{1+\alpha}}\right)^{1/2}.

For any square and symmetric matrix $\bm{A}$ , $\lambda_{\max}(\bm{A})$ and $\lambda_{\min}(\bm{A})$ represents the maximum and minimum eigenvalues respectively. And we use $\|\cdot\|_{2}$ to denote the Euclidean norm of a vector. We will need the following assumption.

Assumption 5.

The following conditions hold.

(a)

There exist $r_{1}>1/2$ and $\bm{\lambda}^{*}\in\mathbb{R}^{K}$ such that the true weight function $\omega^{\pi}(s,a)$ satisfies $\sup_{s,a}|\omega^{\pi}(s,a)-\rho^{\prime}\{\bm{L}_{K}(s,a)^{\intercal}\bm{\lambda}^{*}\}|\leq C_{3}\ltx@label{weightsbound}K^{-r_{1}}$ , where $C_{\ref*{weightsbound}}>0$ is a constant. Also, there exists a constant $C_{4}\ltx@label{weightsC1}>0$ such that $\sup_{s,a}|\omega^{\pi}(s,a)|\leq C_{\ref*{weightsC1}}$ .
(b)

The second derivative of $\rho$ defined in Theorem 2, i.e., $\rho^{\prime\prime}$ , is a positive and continuous function and $\rho^{\prime\prime}\geq C_{5}\ltx@label{weightssecond}$ for some constant $C_{\ref*{weightssecond}}>0$ .

(c)

There exist constants $C_{6}\ltx@label{weightsC2}>0$ and $C_{7}\ltx@label{weightsC4}>0$ such that $\sup_{s,a}\|\bm{B}_{K}(s,a)\|_{2}\leq C_{\ref*{weightsC2}}K^{1/2},$ and for every $T\geq 0$ ,

\left\{\begin{array}[]{l}\max\left\{\lambda_{\max}\left\{\mathbb{E}\left[\frac{1}{T}\sum_{t=0}^{T-1}\bm{L}_{K}(S_{t},A_{t})\bm{L}_{K}(S_{t},A_{t})^{\intercal}\right]\right\},\mathbb{E}_{(S,A)\sim\mathbb{G}^{*}}\left[\frac{\bar{p}^{b}_{T}(S,A)}{p^{*}(S,A)}\right]^{2}\right\}\leq C_{\ref*{weightsC4}}\\ \lambda_{\max}\left\{\mathbb{E}_{(S,A)\sim\mathbb{G^{*}}}\left[\bm{L}_{K}(S,A)\bm{L}_{K}(S,A)^{\intercal}\right]\right\}\leq C_{\ref*{weightsC4}}.\end{array}\right.

(d)

There exists a quantity $\psi(K)>0$ depending on $K$ such that

\lambda_{\min}\left\{\mathbb{E}\left[\frac{1}{T}\sum_{t=0}^{T-1}\bm{L}_{K}(S_{t},A_{t})\bm{L}_{K}(S_{t},A_{t})^{\intercal}\right]\right\}\geq\psi(K).

(e)

$\sum_{k=1}^{K}\bm{\lambda}^{\ast}_{k}B_{k}\in\mathcal{Q}$ , $g^{\pi}_{*}(\cdot,\cdot;\sum_{k=1}^{K}\bm{\lambda}^{\ast}_{k}B_{k})\in\mathcal{G}$ , where $\mathcal{Q}$ and $\mathcal{G}$ are function classes considered in Corollary 1 (also see Section S2.3 of the Supplementary Material). In addition, the same tuning parameter $\mu\asymp({\color[rgb]{0,0,0}nT})^{-1/(1+\alpha)}(\log({\color[rgb]{0,0,0}nT}))^{(2+\alpha)/(1+\alpha)}$ is adopted when estimating $g^{\pi}_{*}(\cdot,\cdot;B_{k})$ , $k=1,\dots,K$ . For any vector $\bm{a}$ that satisfies $\|\sum_{k=1}^{K}\alpha_{k}B_{k}\|_{\infty}\leq Q_{\max}$ , we have $\sum_{k=1}^{K}\alpha_{k}B_{k}\in\mathcal{Q}$ and $g(\cdot,\cdot;\sum_{k=1}^{K}\alpha_{k}B_{k})\in\mathcal{G}$ .
$K=\scalebox{0.7}{$\mathcal{O}$}(\zeta_{{\color[rgb]{0,0,0}n,T}}^{-2})$ .
(f)

$\|\bm{\delta}_{K}\|_{2}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}\left[\{\psi(K)\}^{-1}\left({\sqrt{K}\log({\color[rgb]{0,0,0}nT})}/{\sqrt{{\color[rgb]{0,0,0}nT}}}+K^{-r_{1}}+\sqrt{K}\zeta_{{\color[rgb]{0,0,0}n,T}}\right)\right]$ .

Assumption 5(a) specifies the requirement on the uniform approximation error of the true weight function $\omega^{\pi}$ . The boundedness of $\omega^{\pi}$ can be guaranteed if the average visitation density $\bar{p}_{T}^{b}$ is bounded away from 0 and the policy-induced discounted visitation probability $d^{\pi}$ is bounded above. This overlapping condition is commonly assumed in the literature of the ATE and RL [e.g. 74, 68, 70]. For Assumption 5(b) on $\rho^{\prime\prime}$ , by the relationship between $\rho$ and $h$ detailed in Theorem 2, many convex choices of $h$ adopted in (25) will result in a $\rho$ that satisfies this condition, e.g., $h(x)=(x-1)^{2}$ . For Assumption 5(c), the uniform bound for $\|\bm{B}_{K}\|_{2}$ is a mild technical condition on the basis functions $B_{k}$ , $k=1,\dots,K$ . It is satisfied by many classes of basis functions, including the regression spline, the trigonometric polynomial, and the wavelet basis [51, 23, 8, 2]. See the same assumption in [13] (Condition 6 of their Assumption 4.1). As for the largest eigenvalue condition, we verify that such condition holds for common bases such as tensor-product B-spline basis and tensor-product wavelet basis in Lemma S5 of the Supplementary Material. As for Assumption 5(d), $\psi(K)$ is allowed to depend on $K$ in our analysis as opposed to the existing study such as [62]. While we allow $\psi(K)$ to diminish as $K$ grows, we later show that $\psi(K)$ can be strictly bounded below if we are willing to make further relatively minor assumptions. See detailed discussion in Section 4.4. Assumption 5(e) is a mild condition requiring the best approximation $\sum_{k=1}^{K}\bm{\lambda}^{\ast}_{k}B_{k}$ in $\mathcal{Q}$ , and its projection in $\mathcal{G}$ . Now we can show the convergence of the proposed balancing weights.

Theorem 4.

Suppose Assumptions 4’ (in the Supplementary Material) and 5 hold. If we further assume that $\{\psi(K)\}^{-1}\sqrt{K}\zeta_{{\color[rgb]{0,0,0}n,T}}=\scalebox{0.7}{$\mathcal{O}$}(1)$ as ${\color[rgb]{0,0,0}nT}\rightarrow+\infty$ , then

\max\left\{\left\|\hat{\omega}^{\pi}-\omega^{\pi}\right\|_{{\color[rgb]{0,0,0}n,T}},{\mathbb{E}\left\{\frac{1}{T}\sum_{t=0}^{T-1}\left[\hat{\omega}^{\pi}(S_{t},A_{t})-\omega^{\pi}(S_{t},A_{t})\right]^{2}\right\}^{1/2}}\right\}=\\ \mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left[\{\psi(K)\}^{-1}\left(\frac{\sqrt{K}\log({\color[rgb]{0,0,0}nT})}{\sqrt{{\color[rgb]{0,0,0}nT}}}+K^{-r_{1}}+\sqrt{K}\zeta_{{\color[rgb]{0,0,0}n,T}}\right)\right]=\scalebox{0.7}{$\mathcal{O}$}_{\mathrm{p}}(1).

The proof of Theorem 4 can be found in Section S2.4 of the Supplementary Material. Theorem 4 gives the convergence rate of the proposed balancing weights in terms of both the empirical $L_{2}$ and the $L_{2}$ metrics. Note that the rate implies that as long as either $n$ or $T$ goes to infinity, the proposed weight estimates converge to the true ratio functions. To the best of our knowledge, this is the first consistent result of the ratio function in the scaling of both $n$ and $T$ under the discounted infinite-horizon setting.

4.3 Estimation Error and Statistical Efficiency

Based on Theorems 3 and 4, we can derive an error bound for $\hat{\cal V}(\pi)$ .

Assumption 6.

Suppose the following conditions hold.

(a)

There exists a $\bm{\beta}\in\mathbb{R}^{K}$ such that $\sup_{s,a}|\Delta_{Q}(s,a)|=\sup_{s,a}|Q^{\pi}(s,a)-\bm{B}_{K}(s,a)^{\intercal}\bm{\beta}|\leq C_{8}\ltx@label{Qconst1}K^{-r_{2}}$ for some constants $r_{2}>1/2$ and $C_{\ref*{Qconst1}}>0$ . Also, $\sum_{k=1}^{K}\bm{\beta}_{k}B_{k}\in\mathcal{Q}$ , $g^{*}(\cdot,\cdot;\sum_{k=1}^{K}\bm{\beta}_{k}B_{k})\in\mathcal{G}$ , where $\mathcal{Q}$ and $\mathcal{G}$ are function classes considered in Corollary 1. (Also see Section S2.3 of the Supplementary Material).
(b)

The errors $\epsilon_{i,t}:=R_{i,t}-\mathbb{E}(R_{i,t}\mid S_{i,t},A_{i,t})$ , $i=1,\dots,n,t=0,\dots,T-1$ , are independent mean-zero subgaussian random variables.

Assumption 6(a) is a regularity condition for $Q^{\pi}$ . It is satisfied by letting $r_{2}=q_{Q}/d$ , where $q_{Q}$ is the maximum number of continuous derivatives for $Q^{\pi}$ with respect to $S$ among all action levels, and choosing $B_{K}$ as basis functions such as splines and power series if $Q^{\pi}$ is defined over a compact domain. Assumption 6(b) is a mild condition for the error of the reward. In fact, this assumption can be relaxed to allow errors themselves to be dependent. Note that the estimated weight function $\hat{\omega}^{\pi}$ only depends on $\{S_{i,t},A_{i,t}:i=1,\dots,n,t=0,\dots,T-1\}$ , which is independent of $\{\epsilon_{i,t}:i=1,\dots,n,t=0,\dots,T-1\}$ . The proof of the convergence of $\hat{\mathcal{V}}(\pi)$ stated in the following theorems is based on a conditioning argument that separates the effects of the weights and the errors. Standard techniques that deal with weighted averages of dependent random variables can be adopted to extend the current results (e.g., Theorem 5) to those dependent settings. For instance, If there are some weak autocorrelations among $\epsilon_{i,t}$ , we are still able to obtain results in Theorems 5 and 6.

Theorem 5.

Suppose Assumptions 4’ (in the Supplementary Material), 5 and 6 and hold. Furthermore, assume $\{\psi(K)\}^{-1}\sqrt{K}\zeta_{{\color[rgb]{0,0,0}n,T}}=\scalebox{0.7}{$\mathcal{O}$}(1)$ . Then we have

\left|\hat{\cal V}(\pi)-{\cal V}(\pi)\right|=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left\{{\|\bm{\beta}\circ\bm{\delta}_{K}\|_{1}}\right.\\ \left.+\{\psi(K)\}^{-1}\left(K^{-r_{1}}\zeta_{{\color[rgb]{0,0,0}n,T}}+K^{1/2-r_{2}}\zeta_{{\color[rgb]{0,0,0}n,T}}+K^{-r_{1}-r_{2}}+\sqrt{K}\zeta^{2}_{{\color[rgb]{0,0,0}n,T}}\right)+\zeta_{{\color[rgb]{0,0,0}n,T}}\right\},

where $\circ$ refers to the element-wise product between two vectors.

Theorem 5 can be proved by the similar arguments in the proof of Theorem 6. To obtain the sharp convergence rate, based on this theorem, one need to tune $K$ accordingly. As such, the eigenvalue bound $\psi(K)$ specified in Assumption 5(d) becomes crucial. Indeed, we can show that $\psi(K)$ is strictly lower bounded by a positive constant independent of $K$ under some mild conditions. We defer the detailed discussion about characterizing $\psi(K)$ in Section 4.4. In the following, we establish the bounds when $\psi(K)$ is bounded below.

Theorem 6.

Suppose Assumptions 4’ (in the Supplementary Material), 5(a)–(g) and 6 hold. Also, assume that $\psi(K)\geq C_{9}\ltx@label{weightsC3}$ for some constant $C_{\ref*{weightsC3}}>0$ .

(i)

If ${\|\bm{\beta}\circ\bm{\delta}_{K}\|_{1}}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(\zeta_{{\color[rgb]{0,0,0}n,T}})$ and $K^{-1}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(\zeta_{{\color[rgb]{0,0,0}n,T}}^{1/(r_{1}+r_{2})})$ , we have

\displaystyle\left|\hat{\cal V}(\pi)-{\cal V}(\pi)\right|=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(\zeta_{{\color[rgb]{0,0,0}n,T}}).

(31)

In particular, we can take $K\asymp\xi_{{\color[rgb]{0,0,0}n,T}}^{-a}$ for any $a$ such that $1/(r_{1}+r_{2})<a<2$ .

(ii)

Assume ${\|\bm{\beta}\circ\bm{\delta}_{K}\|_{1}}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(({\color[rgb]{0,0,0}nT})^{-1/2})$ , $r_{1}\geq 1$ , $r_{2}>1$ , $K=\scalebox{0.7}{$\mathcal{O}$}(({\color[rgb]{0,0,0}nT})^{-1}\xi_{{\color[rgb]{0,0,0}n,T}}^{-4})$ , $K^{-1}=\scalebox{0.7}{$\mathcal{O}$}(({\color[rgb]{0,0,0}nT})^{-1/(2r_{1})}\zeta_{{\color[rgb]{0,0,0}n,T}}^{-1/r_{1}})$ , $K^{-1}=\scalebox{0.7}{$\mathcal{O}$}(({\color[rgb]{0,0,0}nT})^{-1/(2r_{2}-1)}\zeta_{{\color[rgb]{0,0,0}n,T}}^{-1/(r_{2}-1/2)})$ , $K^{-r_{1}-r_{2}}=\scalebox{0.7}{$\mathcal{O}$}_{\mathrm{p}}(({\color[rgb]{0,0,0}nT})^{-1/2})$ , and $\omega^{\pi}\in\mathcal{G}$ . Take

\sigma^{2}=\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\left\{\omega^{\pi}(S_{t},A_{t})\left(R_{t}+\gamma\sum_{a^{\prime}\in{\cal A}}\pi(a^{\prime}\mid S_{t+1})Q^{\pi}(S_{t+1},a^{\prime})-Q^{\pi}(S_{t},A_{t})\right)\right\}^{2}.

Then, as either $n\rightarrow\infty$ or $T\rightarrow\infty$ , we have

\displaystyle\frac{\sqrt{nT}}{\sigma}\left\{\hat{\cal V}(\pi)-{\cal V}(\pi)\right\}\xrightarrow[]{d}N(0,1).

(32)

In particular, if $r_{1}=1$ , then we can take $K\asymp({\color[rgb]{0,0,0}nT})^{a}$ for any $a$ such that $1/(6r_{2}-3)<a<1/3$ so that the above results hold.

The proof of this theorem can be found in Section S2.5 of the Supplementary Material. Note that Theorem 6 requires $\psi(K)$ to be lower bounded by a positive constant, which can be satisfied under the conditions in Assumption 7. The constraints for $\bm{\delta}_{K}$ in Theorem 6(i) and (ii) are stronger than that in Assumption 5, which lead to the final desired $\zeta_{{\color[rgb]{0,0,0}n,T}}$ -consistency and $\sqrt{{\color[rgb]{0,0,0}nT}}$ -consistency of our weighted estimator in cases (i) and (ii) respectively. Compared with case (i), the constraints for $K$ in case (ii) are more restrictive so that the bias is asymptotically negligible and thus $\hat{\cal V}(\pi)$ is $\sqrt{{\color[rgb]{0,0,0}nT}}$ -consistent. When $r_{1}\geq 1$ and $r_{2}>1$ , the existence of $K$ is guaranteed. The additional assumption of $\omega^{\pi}$ , i.e., $\omega^{\pi}\in{\cal G}$ , is a mild assumption, which allows the bias term to diminish asymptotically.

If $T$ is fixed, one can show that $\sigma^{2}/T$ is indeed the semi-parametric efficiency bound in the standard i.i.d. setting. When both $n$ and $T$ are allowed to go to infinity, $\sigma^{2}$ becomes $\mathbb{E}^{*}\{\omega^{\pi}(S,A)(R_{t}+\gamma\sum_{a^{\prime}\in{\cal A}}\pi(a^{\prime}\mid S^{\prime})Q^{\pi}(S^{\prime},a^{\prime})-Q^{\pi}(S,A))\}^{2},$ where $\mathbb{E}^{*}$ denotes the expectation with respect to the stationary measure induced by the behavior policy, i.e., $\mathbb{G}^{*}$ . As shown in [30], $\sigma^{2}$ is also the statistical efficiency bound in the notion of [33]. Note that our results do not require that the data come from stationary distribution, which is however needed in [30].

Finally, we remark that to the best of our knowledge, only two prior works establish the convergence rates of the policy value estimators under some non-parametric models in the scaling of both $n$ and $T$ . One is [62], which directly estimates $Q$ -function. The underlying analysis does not require the stationary assumption for the data generating process. However, they did not show that their estimator can achieve the statistical efficiency. In addition, their conditions require that the initial distribution of the data is bounded away from zero, which we do not require. The other one is [30]. This work shows the convergence of their estimator when $T\rightarrow\infty$ . In their theoretical results, stationary assumption is needed. In order to obtain the efficiency, this work requires the adoption of cross-fitting (sample splitting) for the Q-function and ratio (weight) function estimations. However, it is unclear how to perform an efficient cross-fitting due to the existence of temporal dependence among the batch data.

4.4 Lower boundedness for the minimal eigenvalue

In this section, we discuss $\psi(K)$ defined in Assumption 5(d) in details. For the notational simplicity, we take $\overline{}\mbox{Pr}$ and $\bar{\mathbb{E}}$ as the probability and expectation with respect to the average visitation distribution $\bar{p}_{T}^{b}$ . We also write

d^{\pi}(s^{\prime},a^{\prime}\mid s,a)=\sum_{t=0}^{\infty}\gamma^{t}p_{t}^{\pi}(s^{\prime},a^{\prime}\mid S_{0}=s,A_{0}=a)

as the conditional discounted visitation probability. Define the operator $\mathcal{P}^{\pi}:\mathcal{S}\times\mathcal{A}\rightarrow\mathcal{S}\times\mathcal{A}$ by

(\mathcal{P}^{\pi}f)(s,a)=\bar{\mathbb{E}}\left[\sum_{a^{\prime}\in\mathcal{A}}f(S^{\prime},a^{\prime})\pi(a^{\prime}\mid S^{\prime})\mid S=s,A=a\right]

for any function $f:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}$ , and denote $I:\mathbb{R}^{\mathcal{S}\times\mathcal{A}}\rightarrow\mathbb{R}^{\mathcal{S}\times\mathcal{A}}$ as the identity operator.

In Section C.1 of [62], a sufficient condition for the lower boundedness of $\psi(K)$ is provided. They argue that, under some boundedness conditions on the average visitation probability, the minimal eigenvalue is lower bounded by a constant independent of $K$ as long as $\gamma$ is small enough. (In fact, we show that this sufficient condition can be further relaxed; see Corollary 2 for details.) However, in practice, we may not know the distance between the target policy and the behavior policy in advance, or be able to choose a reasonably small $\gamma$ that reflects the desired emphasis of the long-term rewards. More importantly, choosing $\gamma$ close to $1$ is often preferred in many applications as we discussed before. Despite its importance, the theoretical property of non-parametric OPE under this setting is largely uncharted territory in the current state of the literature. As a result, it is important to understand the behavior of $\psi(K)$ for any $0\leq\gamma<1$ and target policy $\pi$ .

In the following, we focus on the general operator $I-\gamma\mathcal{P}^{\pi}$ and study the (squared) minimal eigenvalue of it:

\displaystyle\Upsilon:=\inf_{\{f:\bar{\mathbb{E}}f^{2}(S,A)\geq 1\}}\left[{\bar{\mathbb{E}}\left\{(I-\gamma\mathcal{P}^{\pi})f(S,A)\right\}^{2}}\right].

(33)

To see the relationship between $\Upsilon$ and $\psi(K)$ , we can take $f(s,a)=\bm{B}_{K}^{\intercal}(s,a)\bm{\alpha}$ for some $\bm{\alpha}\in\mathbb{R}^{K}$ in (33). Then we have

	$\displaystyle\psi(K)$	$\displaystyle=\inf_{\\|\bm{\alpha}\\|_{2}=1}\left[{\bar{\mathbb{E}}\left\{(I-\gamma\mathcal{P}^{\pi})\bm{B}_{K}^{\intercal}(S,A)\bm{\alpha}\right\}^{2}}\right]$
		$\displaystyle=\inf_{\\|\bm{\alpha}\\|_{2}=1}\frac{\bar{\mathbb{E}}\left\{(I-\gamma\mathcal{P}^{\pi})\bm{B}_{K}^{\intercal}(S,A)\bm{\alpha}\right\}^{2}}{\bar{\mathbb{E}}\left\{\bm{B}_{K}^{\intercal}(S,A)\bm{\alpha}\right\}^{2}}{\bar{\mathbb{E}}\left\{\bm{B}_{K}^{\intercal}(S,A)\bm{\alpha}\right\}^{2}}.$
		$\displaystyle\geq\Upsilon\lambda_{\min}\left\{\bar{\mathbb{E}}\bm{B}_{K}(S,A)\bm{B}^{\intercal}_{K}(S,A)\right\}.$

In Theorem 7, we propose a necessary and sufficient condition for bounding the minimal eigenvalue of the operator $I-\gamma\mathcal{P}^{\pi}$ that works for any $\pi$ and $\gamma$ . When combined with the standard eigenvalue condition on the basis $\bm{B}_{K}$ , it allows a comprehensive characterization of the lower boundedness of $\psi(K)$ .

Theorem 7.

$\Upsilon$ is lower bounded by a positive constant if and only if

\chi:=\sup_{\{f:\bar{\mathbb{E}}f^{2}(S,A)\leq 1\}}{\bar{\mathbb{E}}\left[\mathbb{E}_{(S^{\prime},A^{\prime})\sim d^{\pi}(\cdot,\cdot\mid S,A)}\left\{f(S^{\prime},A^{\prime})\right\}\right]^{2}}<+\infty.

In this case, $\Upsilon=(1-\gamma)^{2}/\chi$ .

Theorem 7 allows us to avoid directly analyzing the minimal eigenvalue of $I-\mathcal{P}^{\pi}$ , but to study the upper bound of $\chi$ instead, which is much easier to deal with. We can view $\chi$ as a criterion to examine the difficulty of OPE problems. When $\chi$ gets larger, the OPE problem becomes more difficult. The value of $\chi$ depends on many components, including the target policy $\pi$ , the behavior policy $\pi_{b}$ , the discount factor $\gamma$ and the horizon of the observed data $T$ . Next, we focus on the term $\chi$ and provide some sufficient conditions for it to be upper bounded. Take

{\bar{d}}^{\pi}(s^{\prime},a^{\prime})=\bar{\mathbb{E}}\left[d^{\pi}(s^{\prime},a^{\prime}\mid S,A)\right].

Assumption 7.

Suppose the following conditions hold.

(a)

The average visitation probability (density) $\bar{p}_{T}^{b}(s,a)$ is lower bounded by a constant $p_{\min}>0$ and upper bounded by a constant $p_{\max,1}$ .
(b)

The transition probability under the target policy $q(s^{\prime},a^{\prime}\mid s,a):=\pi(a^{\prime}\mid s^{\prime})p(s^{\prime}\mid s,a)$ is upper bounded by some constant $p_{\max,2}$ for all $(s^{\prime},a^{\prime})\in\mathcal{S}\times\mathcal{A}$ .

Corollary 2.

If there exists a constant $C_{\ref*{cnt:denbound}}$ such that

\displaystyle\varrho:=\sup_{(s,a)\in\mathcal{S}\times\mathcal{A}}\frac{\bar{d}^{\pi}(s,a)}{\bar{p}_{T}^{b}(s,a)}\leq C_{10}\ltx@label{cnt:denbound},

(34)

then we have

\chi\leq C_{\ref*{cnt:denbound}}\quad\mbox{and}\quad\Upsilon\geq\frac{(1-\gamma)^{2}}{{C_{\ref*{cnt:denbound}}}}.

In addition, if we assume Assumption 7, then we have

\chi\leq\frac{p_{\max}}{p_{\min}},\qquad\Upsilon\geq{(1-\gamma)^{2}}\frac{p_{\min}}{p_{\max}},

where $p_{\max}=\max\{p_{\max,1},p_{\max,2}\}$ . Finally, $\psi(K)\gtrsim{(1-\gamma)^{2}}\frac{p_{\min}}{p_{\max}}$ under an additional condition that

\displaystyle\lambda_{\min}\left\{\bar{\mathbb{E}}\left[\bm{B}_{K}(S,A)\bm{B}_{K}(S,A)^{\intercal})\right]\right\}\gtrsim 1.

(35)

In Corollary 2, (34) provides a sufficient condition for $\chi$ to be upper bounded (or equivalently, for $\Upsilon$ to be lower bounded) that works for general settings without any restriction on $\gamma$ or the target policy $\pi$ . The second part of Corollary 2 provides a specific characterization for $\varrho$ under Assumption 7, in which we require the average visitation probability $\bar{p}_{T}^{b}$ to be lower and upper bounded. This coverage assumption is very common in RL literature [e.g. 54, 30, 62]. Compared to [62], we provide an explicit bound with respect to $p_{\min}$ and $p_{\max}$ . More importantly, we do not impose any further assumptions on the target policy $\pi$ or the discount factor $\gamma$ , in order to show the lower boundedness of $\psi(K)$ . In other words, we show that such boundedness holds uniformly for any target policy $\pi$ and $\gamma$ . We note that there is a parallel work regarding the well-poseness minimax optimal rates of nonparametric Q-function estimation in OPE [10]. They provide a sufficient condition for the lower boundedness of $\Upsilon$ . Their result is similar to the one specified in Corollary 2, and require similar boundedness conditions as in Assumption 7. See Assumption 4(a) and Theorem 1 in [10] for more details. Compared to their bound:

\Upsilon\gtrsim\frac{(1-\gamma)^{2}p_{\min}}{p_{\max}\left(1+\gamma^{2}\frac{p_{\max}}{p_{\min}}\right)},

Corollary 2 provides a sharper dependence with respect to $\gamma$ , $p_{\min}$ and $p_{\max}$ . Apart from Corollary 2, we note that our general result in Theorem 7 does not require any boundedness condition on visitation and transition probability as in Assumption 7 and [10], and, more importantly, provides a necessary and sufficient condition for the boundedness of $\Upsilon$ .

Finally, we remark that the additional assumption (35) is satisfied for common bases such as tensor-product B-spline basis and tensor-product wavelet basis. See Lemma S5 of the Supplementary Material.

5 Simulation Study

We conduct a simulation study to investigate the finite-sample performance of the proposed estimator. We adopt the similar simulation settings as in [42], [39] and [62]. Specifically, the data generative model is given as follows. The state variables are two-dimensional, i.e., $S_{i,t}=(S^{(1)}_{i,t},S^{(2)}_{i,t})$ for $0\leq t\leq T$ and $1\leq i\leq n$ , while the action is binary, i.e., ${\cal A}=\{0,1\}$ . The initial state follows the standard bivariate normal distribution. The transition dynamics are given by $S^{(1)}_{i,t+1}=(3/4)(2A_{i,t}-1)S^{(1)}_{i,t}+\epsilon^{(1)}_{i,t}$ , and $S^{(2)}_{i,t+1}=(3/4)(1-2A_{i,t})S^{(2)}_{i,t}+\epsilon^{(2)}_{i,t}$ , where $\epsilon^{(1)}_{i,t}$ and $\epsilon^{(2)}_{i,t}$ are independent normal random variables with mean 0 and variance 0.25. The behavior policy independently follows a Bernoulli distribution with mean $1/2$ . The immediate reward $R_{i,t}$ is defined as $R_{i,t}=2S^{(1)}_{i,t+1}+S^{(2)}_{i,t+1}-(1/4)(2A_{i,t}-1)$ . We use the initial state distribution as the reference distribution $\mathbb{G}$ and set $\gamma$ to $0.9$ . We evaluate the following four different target policies.

(a)

$\pi_{1}(a\mid s)=1,\,a\in\mathcal{A},s\in\mathcal{S}.$ This is the “always-treat” policy used in the simulation study of [39], where the chosen action is always 1, and does not depend on the state variable.

(b)

\pi_{2}(a\mid s)=\begin{cases}1&\text{if $s^{(1)}\leq 0$ and $s^{(2)}\leq 0$,}\\ 0&\text{otherwise,}\end{cases},\quad a\in\mathcal{A},s=(s^{(1)},s^{(2)})\in\mathcal{S}.

This policy is a discontinuous function with respect to the state variable. The same type of policy is used in the simulation study of [62].

(c)

$\pi_{3}(a\mid s)=\exp\{-(s^{(1)}+s^{(2)})\},\,a\in\mathcal{A},s=(s^{(1)},s^{(2)})\in\mathcal{S}.$ This policy is smooth with respect to the state variable.
(d)

$\pi_{4}(a\mid s)=0.5,,a\in\mathcal{A},s\in\mathcal{S}$ . This policy is the same as the behavior one. Note that the observed data only contain finite horizon of decision points, i.e., $T<\infty$ , while the target here is the policy value under infinite horizon.

For each target policy, we consider four different combinations of $n$ and $T$ , i.e., $(n,T)=(40,25),(80,25)$ , $(40,50)$ and $(80,50)$ . The true policy values ${\cal V}(\pi_{k})$ , $k=1,\dots,4$ , were computed approximately by the Monte Carlo method. Specifically, for every $\pi_{k}$ , we simulate $\tilde{n}=10^{5}$ independent trajectories of length $\tilde{T}=5000$ , with initial states drawn from $\mathbb{G}$ . Then we approximate ${\cal V}(\pi_{k})$ by $(1-\gamma){\tilde{n}}^{-1}\sum_{t=0}^{\tilde{T}-1}\gamma^{t}R_{i,t}$ .

Due to space limitations, the implementation details of the proposed method (ProjBalance) are reported in Section S1.2 of the Supplementary Material. For comparison, we include the following estimators: (1) VL: the estimator from [42]; (2) SAVE: the estimator from [62]; (3) FQE: the fitted Q-evaluation estimator developed in [36] where regression problems are solved using random forest models; (4) IS: importance sampling estimator from [54]; (5) MINIMAX: minimax weight learning method from [68]; (6) DR: double reinforcement learning method considered in [28]. As suggested by [42], we implemented VL with Gaussian basis functions since they offer the highest flexibility. For SAVE, we use the same basis function $\bm{B}_{K}$ as in the proposed weighted estimator to estimate $Q^{\pi}$ . Note that, to compute the confidence interval for the proposed weighted estimator, we use the estimate of $Q^{\pi}$ from SAVE. Construction of our confidence interval can be found in Section Section S1.2 of the Supplementary Material. The implementation details of FQE, IS, MINIMAX and DR can be found in [60].

Table 1 shows the mean squared error (MSE) and median squared error (MeSE) of the above estimators, as well as the empirical coverage probabilities (ECP) and average length (AL) of their $95\%$ confidence intervals, over 500 simulated data sets when $(n,T)=(40,50)$ . The results for $(n,T)=(40,25)$ , $(n,T)=(80,25)$ and $(n,T)=(80,50)$ can be found in Section S1.3 of the Supplementary Material. Overall, the proposed estimator (ProjBalance) shows competitive performances. Other weighted estimators such as IS and MINIMAX suffer from instability issues and produce inferior results compared with other methods. Also, DR is influenced by the unstable weights from IS and does not perform well in these settings.

Specifically, for Scenario (a), VL in general has the smallest MSE and MeSE. The performances of ProjBalance and VL are close, while that of SAVE and FQE are worse. For Scenario (b), SAVE is the best when the sample size is relatively large, while the performance of ProjBalance is close to that of SAVE. However, when $n$ and $T$ are relatively small, SAVE produces some extreme estimates, as seen from the notable difference between MSE and MeSE. See Tables 1 and S2 in the Supplementary Material. In contrast, the results of ProjBalance remain stable for small $n$ and $T$ , and have comparable performance to the best estimator FQE in these settings. As for Scenario (c), ProjBalance and VL perform similarly, and are better than SAVE and FQE. ProjBalance always has the smallest MSE and MeSE when the target policy is $\pi_{4}$ . As for ECP and AL, it seems that the confidence intervals of ProjBalance tend to have lower coverage than the target level 95%, especially when the target policy is $\pi_{2}$ . We hypothesize that this under-coverage phenomenon is due to the regularization of weights in (25a), which affects the variance estimation in (32). Since this is beyond the scope of the paper, we leave it for future study. In practice, we recommend multiplying a factor to the length of our confidence interval (CI) in order to relieve this under-coverage issue. In this simulation study, we choose a constant factor $1.2$ to obtain adjusted intervals. The ECPs and ALs for adjusted confidence intervals are provided in all tables. In general, our method performs robustly and satisfactorily in terms of the coverage and average length of the confidence interval. In the Supplementary Material, we also evaluate the performance of the above methods on the Cartpole environment, which can be obtained from OpenAI Gym [5] and has been frequently considered in the computer science literature. Overall, the performance of ProjBalance is appealing, See Section S1.5 of the Supplementary Material for more details.

Table 1: Simulation results for four different target policies when

n=40

and

T=50

. The MSE values (with standard errors in parentheses), MeSE values, ECP, and AL (with standard errors in parentheses) are provided for three estimators. For ProjBalance, ECPs and ALs for the adjusted confidence interval are provided after the slashes in the ECP and AL columns.

Target	methods	MSE ( $\times 1000$ )	MeSE ( $\times 1000$ )	ECP	AL ( $\times 100$ )
$\pi_{1}$	ProjBalance	8.76 (0.527)	3.75	0.96 / 0.99	38.93 (0.258) / 46.71 (0.310)
	VL	7.71 (0.461)	3.57	0.96	36.25 (0.159)
	SAVE	10.2 (0.614)	4.32	0.96	42.28 ( 0.932)
	FQE	10.99 (0.744)	4.40	_	_
	minimax	34.53 (3.184)	14.04	_	_
	DR	151.87 (16.670)	22.84	_	_
	IS	1040.89 (270.752)	53.94	_	_
$\pi_{2}$	ProjBalance	4.29 (0.356)	1.75	0.91 / 0.94	23.57 (2.130) / 28.28 (2.556)
	VL	6.61 (0.369)	3.50	0.84	24.18 (0.142)
	SAVE	36.90 (33.300)	1.52	0.94	439.90 (415.277)
	FQE	3.93 (0.242)	1.87	_	_
	minimax	120.92 (1.276)	118.98	_	_
	DR	92.33 (6.212)	16.283	_	_
	IS	1111.542 (100.934)	327.248	_	_
$\pi_{3}$	ProjBalance	2.41 (0.153)	1.06	0.93 / 0.97	18.47 (0.125) / 22.17 (0.150)
	VL	2.45 (0.162)	1.14	0.98	23.10 (0.158)
	SAVE	2.72 (0.173)	1.18	0.94	21.43 (1.717)
	FQE	4.66 (0.307)	2.30	_	_
	minimax	37.11 (1.415)	31.21	_	_
	DR	49.20 (34.761)	2.36	_	_
	IS	38.12 (4.839)	13.97	_	_
$\pi_{4}$	ProjBalance	0.67 (0.044)	0.32	0.93 / 0.97	9.70 (0.021) / 11.64 (0.025)
	VL	0.77 (0.053)	0.34	1.00	17.00 (0.031)
	SAVE	0.72 (0.047)	0.35	0.93	10.03 (0.013)
	FQE	1.70 (0.103)	0.81	_	_
	minimax	1.62 (0.104)	0.73	_	_
	DR	1.35 (0.087)	0.61	_	_
	IS	4.78 (0.290)	2.27	_	_

6 Real Data Application

In this section we apply the methods mentioned in Section 5 that provide confidence intervals (ProjBalance, VL and SAVE) to the OhioT1DM dataset [44] obtained from the Ohio University. This dataset contains approximately eight weeks’ records of CGM blood glucose levels, insulin doses and self-reported life-event data for each of six subjects with type 1 diabetes. Following [62], we divide the trajectory of every subject into segments of three-hour spans and constructed the state variable $S_{i,t}=(S^{(1)}_{i,t},S^{(2)}_{i,t},S^{(3)}_{i,t})$ as follows. First $S^{(1)}_{i,t}$ is the average CGM glucose level over the three-hour interval $[t-1,t)$ . Next, $S^{(2)}_{i,t}$ is constructed based on the $i$ -th-subject’s self-reported time and the corresponding carbohydrate estimate for the meal. More specifically, $S_{i,t}^{(2)}=\sum_{j=1}^{J}\mathrm{CE}_{j}\gamma_{c}^{36(t_{j}-t+1)},$ where $\mathrm{CE}_{1},\mathrm{CE}_{2},\dots,\mathrm{CE}_{J}$ are carbohydrate estimates for the $i$ -th-subject’s meals at times $t_{1},t_{2},\dots,t_{J}\in[t-1,t)$ , and $\gamma_{c}$ is a 5-minute decay rate. Here we set $\gamma_{c}=0.5$ . Last, $S_{i,t}^{(3)}$ is the average of the basal rate during the three-hour interval. The action variable $A_{i,t}$ is binarized according to the amount of insulin injected. In particular, we set $A_{i,t}=1$ when the total amount of insulin delivered to the $i$ -th subject is larger than one unit during the three-hour interval. Otherwise, we set $A_{i,t}=0$ . In the data, the time series of glucose levels and the life-event data are not perfectly overlapped within the same time frame. So we remove several boundary points of time series to ensure that all state variables have records in the same time frame. According to the Index of Glycemic Control (IGC) [58], the immediate reward $R_{i,t}$ is constructed as

R_{i,t}=\begin{cases}-\frac{1}{30}(80-S^{(1)}_{i,t+1})^{2},&\text{if }S^{(1)}_{i,t+1}<80;\\ 0,&\text{if }80\leq S^{(1)}_{i,t+1}<140;\\ -\frac{1}{30}(S^{(1)}_{i,t+1}-140)^{1.35},&\text{otherwise }.\end{cases}

Similar to the simulation study, we set the discount factor $\gamma=0.9$ . In addition, we study six reference distributions $\mathbb{G}$ , where each of them is taken as the point mass at the initial state of a subject. We evaluate the aforementioned three estimators via two policies — the always-treat policy $\pi_{1}(a\mid s)\equiv 1$ and the never-treat policy $\pi_{0}(a\mid s)\equiv 0$ since the estimated optimal policy based on [62] is very close to the always-treat policy. A similar discovery has also been observed in [42]. Therefore, we expect to see that ${\cal V}(\pi_{1})>{\cal V}(\pi_{0})$ .

Refer to caption — Figure 1: Offline policy evaluation (OPE) estimates as well as the associated 95% confidence intervals on the OhioT1DM dataset under always-treat and never-treat policies. The six sub-figures displays these quantities when the reference distributions are the initial state variable of six patients respectively. The adjusted confidence intervals for ProjBalance are also included.

Figure 1 shows the point estimates and associated confidence intervals of ${\cal V}(\pi_{1})$ and ${\cal V}(\pi_{0})$ for ProjBalance, VL and SAVE. To adjust for the under-coverage issue we observe in the simulation study, we also calculate the adjusted confidence interval for ProjBalance. In most scenarios, it can be observed that estimated values under the always-treat policy are larger than that of the never-treat policy, which aligns with our expectation. Note that the results of VL do not show a distinctive difference between the two policies compared with other methods. For SAVE, due to the invertibility issue (the matrix $\hat{\Sigma}_{\pi}$ in [62] is numerically low-rank under the never-treat policy), the confidence interval for the never-treat policy is surprisingly large. Even though the confidence intervals of our proposed estimators rely on the Q-function estimated by SAVE (where the point estimate of the Q-function does not involve inverting the problematic matrix), the confidence interval does not inherit the instability issue.

7 Acknowledgements

Wong’s research was partially supported by the National Science Foundation (DMS-1711952 and CCF-1934904). Portions of this research were conducted with the advanced computing resources provided by Texas A&M High Performance Research Computing.

Supplement to “Projected State-action Balancing Weights for Offline Reinforcement Learning”

S1 Additional Implementation Details and Numerical Results

S1.1 Algorithm

We present the algorithm outline for the proposed estimator in Algorithm 1.

Input: Observed trajectories

{\cal D}_{n}=\{\{S_{i,t},A_{i,t},R_{i,t}\}_{t=0}^{T-1}\cup S_{i,T}\}_{i=1}^{n}

; number of basis functions

K

; a grid of tuning parameters

\mu_{l}

l=1,\dots,L

; function

h

; basis function

B_{k}

k=1,\dots,K

; reference distribution:

\mathbb{G}

1 Calculate

\mathbb{E}_{S\sim\mathbb{G}}\{\sum_{a\in\mathcal{A}}\pi(a\mid S)B_{k}(S,a)\}

k=1,\dots,K

2Divide

{\cal D}_{n}

into the training data

{\cal D}_{1}

and validation data

{\cal D}_{2}

. for $l=1,\dots,L$ do

3 for $k=0,1,\dots,K$ do

4 Compute the solution of (24) (i.e.,

\hat{g}^{\pi}_{l}(\cdot,\cdot,B_{k})

) by taking

\mu=\mu_{l}

{\cal D}_{1}

5 end for

6 Calculate the validation error

\sum_{k=1}^{K}\sum_{(i,t)\in{\cal D}_{2}}\{\hat{g}^{\pi}_{l}(S_{i,t},A_{i,t},B_{k})-\sum_{a\in\mathcal{A}}\pi(a^{\prime}\mid S_{i,t+1})B_{k}(S_{i,t+1},a^{\prime})\}^{2}/\{(nT)^{-1}\sum_{i=1}^{n}\sum_{t=1}^{T}B^{2}_{k}(S_{i,t},A_{i,t})\}

7 end for

9Select

l^{*}

that achieves the smallest validation error among all tuning parameters.

10Take

\hat{g}^{\pi}(S_{i,t},A_{i,t},B_{k})

\hat{g}^{\pi}_{l^{*}}(S_{i,t},A_{i,t},B_{k})

and optimize (13) in the main text with the smallest feasible

\bm{\delta}_{K}=\delta\bm{J}

, where

\bm{J}=[1,\dots,1]^{\intercal}

. Obtain the solution

\hat{\omega}^{\pi}_{i,t}

i=1,\dots,n

and

t=0,\cdots,(T-1)

Output:

\hat{\cal V}^{\pi}(\pi)=\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\hat{\omega}^{\pi}_{i,t}R_{i,t}

Algorithm 1 Outlines for computing

\hat{\cal V}(\pi)

S1.2 Implementation of the Proposed Estimator

In this subsection, we discuss the implementation details in the simulation study. For the proposed estimator (ProjBalance), the basis functions were constructed as follows. For settings described in Section 5, we adopted the tensor product basis functions for two dimensional state variables. First, we set $B_{1,l}(S,A)=b_{1,l_{1}}(S^{(1)})b_{2,l_{2}}(S^{(2)})\mathbbm{1}_{A=1}$ , $B_{2,l}(S,A)=b_{1,l_{1}}(S^{(1)})b_{2,l_{2}}(S^{(2)})$ . Here $b_{1,l_{1}}$ and $b_{2,l_{2}}$ were one-dimensional cubic B-spline sets where internal knots were placed at equally spaced sample quantiles for state variables. To avoid extrapolation of the basis function, we put three repeated boundary knots. Then we chose $\bm{B}_{K}(S,A)=[B_{k}(S,A)]_{k=1}^{K}=[B_{1,l}(S,A),B_{2,l}(S,A)]_{l=1}^{K/2}$ . In our numerical experiments, we set $K=2\max\{(nT)^{1/3},4^{2}\}$ , where $4^{2}$ was determined by the fact that there should be at least 4 one-dimensional basis for constructing cubic B-spline. Note that the number of tensor products of spline bases grows exponentially as the dimension of the state space increases. We recommend reducing the interaction order of the tensor products or using kernel bases in the higher dimensional settings, while the number of basis functions $K$ can be still chosen as stated before. In the simulation study for Cartpole datasets in Section S1.3, we did not include any interaction of tensor products among four state variables, for the sake of simplicity. For the approximate projection step (24), we used a kernel ridge regression with Gaussian kernel, where the bandwidth parameter in Gaussian kernel was chosen by the median heuristic (See Section 2.2 in [18]).We adopted 5-fold cross-validation to tune the parameter $\mu$ . Note that we only needed one $\mu$ for $K$ projections, so we took an average of the “standardized” validation errors from the projection steps for all basis functions, where the validation errors were standardized by empirical norm of the corresponding basis function, as our validation criterion. For the weight estimation (25), we chose $h(x)=(x-1)^{2}$ . As $\mathbb{G}$ is known to us, we used Monte Carlo method to obtain $\mathbb{E}_{S\sim\mathbb{G}}\{\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid S)B_{k}(S,a^{\prime})\}$ . To reduce the amount of tuning, we took $\bm{\delta}_{K}=\delta\bm{J}$ , where $\delta>0$ and $J=[1,\dots,1]\in\mathbb{R}^{K}$ , and selected the minimal $\delta$ that had a valid solution to (25).

In the following, we provide the details for constructing the confidence interval for ProjBalance. Take $\hat{Q}^{\pi}$ as the estimator for $Q^{\pi}$ obtained by SAVE in [62] and let

\hat{\sigma}_{i,t}=\hat{\omega}_{i,t}^{\pi}\left(R_{i,t}+\gamma\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid S_{i,t+1})\hat{Q}^{\pi}(S_{i,t+1},a^{\prime})-\hat{Q}^{\pi}(S_{i,t},A_{i,t})\right).

Calculate

\hat{\sigma}=\left\{\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\hat{\sigma}_{i,t}^{2}\right\}^{1/2}.

Then the confidence interval is given by

\left[\hat{\cal V}(\pi)-z_{\frac{\alpha}{2}}\frac{\hat{\sigma}}{\sqrt{nT}},\ \hat{\cal V}(\pi)+z_{\frac{\alpha}{2}}\frac{\hat{\sigma}}{\sqrt{nT}}\right],

where $\hat{\cal V}(\pi)=(nT)^{-1}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\hat{\omega}^{\pi}_{i,t}R_{i,t}$ and $z_{\frac{\alpha}{2}}$ is $(1-\alpha/2)$ -quantile of the standard normal distribution.

S1.3 Additional Simulation Results

Tables S2, S3 and S4 show the simulation results under the target policies and data generation process described in Section 5, where $(n,T)=(40,25)$ , $(n,T)=(80,25)$ and $(n,T)=(80,50)$ respectively.

Table S2: Simulation results for four different target policies when

n=40

and

T=25

. See details in Table 1.

Target	methods	MSE ( $\times 1000$ )	MeSE ( $\times 1000$ )	ECP	AL ( $\times 100$ )
$\pi_{1}$	ProjBalance	17.32 (1.064)	8.17	0.96 / 0.99	66.30 (6.145) / 79.56 (7.374)
	VL	15.43 (0.942)	7.08	0.97	54.15 (0.472)
	SAVE	272.36 (178.185)	11.06	0.96	637.96 (337.450)
	FQE	23.58 (1.431)	10.74	_	_
	minimax	46.27 (3.544)	16.75	_	_
	DR	855.00 (178.345)	82.56	_	_
	IS	945.92 (234.336)	53.75	_	_
$\pi_{2}$	ProjBalance	10.83 (0.874)	3.17	0.87 / 0.91	35.26 (4.682) / 42.31 (5.619)
	VL	12.33 (0.721)	5.65	0.88	36.63 (0.364)
	SAVE	310.41 (301.139)	3.25	0.94	918.00 (874.044)
	FQE	23.58 (1.431)	10.74	_	_
	minimax	41.52 (1.973)	28.31	_	_
	DR	60.20 (5.623)	16.30	_	_
	IS	2973.88 (1803.384)	70.509	_	_
$\pi_{3}$	ProjBalance	4.62 (0.299)	1.899	0.92 / 0.95	27.67 (0.984) / 33.20 (1.181)
	VL	4.50 (0.305)	1.89	0.98	33.04 (0.239)
	SAVE	15.87 (8.981)	2.35	0.94	669.62 (636.178)
	FQE	8.687 (0.604)	3.69	_	_
	minimax	45.51 (1.710)	39.61	_	_
	DR	68.18 (21.377)	3.98	_	_
	IS	34.02 (4.214)	12.45	_	_
$\pi_{4}$	ProjBalance	1.25 (0.087)	0.49	0.94 / 0.98	13.61 (0.074) / 16.33 (0.089)
	VL	1.63 (0.111)	0.66	0.99	24.47 (0.070)
	SAVE	1.40 (0.096)	0.53	0.95	14.68 (0.119)
	FQE	3.46 (0.224)	1.65	_	_
	minimax	3.26 (0.202)	1.50	_	_
	DR	2.48 (0.157)	1.15	_	_
	IS	4.67 (0.281)	2.28	_	_

Table S3: Simulation results for four different target policies when

n=80

and

T=25

. See details in Table 1.

Target	methods	MSE ( $\times 1000$ )	MeSE ( $\times 1000$ )	ECP	AL ( $\times 100$ )
$\pi_{1}$	ProjBalance	10.13 (0.661)	4.61	0.95 / 0.97	38.95 (0.175) / 46.74 (0.211)
	VL	8.84 (0.562)	3.66	0.95	38.38 (0.178)
	SAVE	11.55 (0.766)	4.90	0.94	40.58 (0.183)
	FQE	3.46 (0.224)	1.65	_	_
	minimax	38.08 (3.330)	17.80	_	_
	DR	181.51 (24.850)	28.07	_	_
	IS	4.67 (0.281)	2.28	_	_
$\pi_{2}$	ProjBalance	3.88 (0.279)	1.76	0.92 / 0.97	21.53 (0.120) / 25.83 (0.144)
	VL	5.06 (0.293)	2.46	0.94	25.18 (0.145)
	SAVE	3.42 (0.199)	1.77	0.96	23.55 (0.512)
	FQE	3.88 (0.214)	2.04	_	_
	minimax	36.48 (1.556)	26.28	_	_
	DR	28.51 (2.915)	7.08	_	_
	IS	2951.59 (2613.528)	62.73	_	_
$\pi_{3}$	ProjBalance	2.33 (0.134)	1.09	0.94 / 0.98	18.36 (0.1) / 22.03 (0.120))
	VL	2.33 (0.137)	1.18	0.98	23.75 (0.122)
	SAVE	2.74 (0.161)	1.26	0.94	19.56 (0.177)
	FQE	4.96 (0.322)	2.10	_	_
	minimax	32.12 (1.150)	25.85	_	_
	DR	15.64 (5.164)	2.35	_	_
	IS	22.39 (4.263)	7.05	_	_
$\pi_{4}$	ProjBalance	0.75 (0.042)	0.41	0.92 / 0.98	9.81 (0.018) / 11.77 (0.022)
	VL	0.88 (0.056)	0.37	1	17.32 (0.033)
	SAVE	0.79 (0.045)	0.43	0.93	10.19 (0.107)
	FQE	1.66 (0.713)	0.66	_	_
	minimax	1.46 (0.084)	0.73	_	_
	DR	1.25 (0.076)	0.62	_	_
	IS	2.27 (0.144)	1.10	_	_

Table S4: Simulation results for four different target policies when

n=80

and

T=50

. See details in Table 1.

Target	methods	MSE ( $\times 1000$ )	MeSE ( $\times 1000$ )	ECP	AL ( $\times 100$ )
$\pi_{1}$	ProjBalance	5.43 (0.353)	2.31	0.92 / 0.97	27.2(0.0679) / 32.58 (0.081)
	VL	5.09 (0.318)	2.01	0.94	26.80 (0.075)
	SAVE	5.93 (0.382)	2.65	0.92	28.09 (0.117)
	FQE	6.21 (0.402)	2.57	_	_
	minimax	27.27 (2.030)	12.17	_	_
	DR	67.29 (11.554)	11.03	_	_
	IS	192585.20 (189045.400)	44.53	_	_
$\pi_{2}$	ProjBalance	2.49 (0.204)	1.09	0.87 / 0.93	15.05 (0.045) / 18.06 (0.054)
	VL	2.92 (0.162)	1.58	0.89	16.93 (0.056)
	SAVE	1.73 (0.107)	0.72	0.94	15.78 (0.053)
	FQE	2.10 (0.140)	0.91	_	_
	minimax	33.13 (1.275)	26.19	_	_
	DR	11.98 (1.063)	4.10	_	_
	IS	2951.59 (2613.528)	62.73	_	_
$\pi_{3}$	ProjBalance	1.19 (0.071)	0.61	0.94 / 0.98	13.01 (0.048) / 15.61 (0.057)
	VL	1.25 (0.071)	0.73	0.98	16.42 (0.053)
	SAVE	1.34 (0.079)	0.64	0.94	13.97 (0.359)
	FQE	2.23 (0.174)	0.89	_	_
	minimax	32.78 (0.747)	30.92	_	_
	DR	3.45 (0.966)	0.86	_	_
	IS	29.67 (6.884)	6.79	_	_
$\pi_{4}$	ProjBalance	0.39 (0.023)	0.21	0.92 / 0.97	6.900 (0.008) / 8.28 (0.009)
	VL	0.44 (0.026)	0.24	1	12.12 (0.027)
	SAVE	0.41 (0.024)	0.22	0.92	7.06 (0.005)
	FQE	0.77 (0.046	0.38	_	_
	minimax	0.77 (0.049)	0.39	_	_
	DR	0.62 (0.039)	0.29	_	_
	IS	2.26 (0.143)	1.02	_	_

S1.4 Weighted Estimator without Projection and Augmented Projected Balancing Estimator

In this section, we compare the performance of the weighted estimator (balance) where weights are obtained from (13) with our proposed estimator ProjBalance under the simulation settings described in Section 5. In particular, the same basis functions are adopted for two weighted estimators in the balancing step. Also, we include the augmented estimator (aug) based on our procedure into the comparison. The augmented estimator is constructed by

(1-\gamma)\mathbb{E}_{S\sim\mathbb{G}}\left\{\sum_{a\in\mathcal{A}}\pi(a\mid S)\hat{Q}^{\pi}(S,a)\right\}\\ +\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\hat{\omega}^{\pi}_{i,t}\left[R_{i,t}-\hat{Q}^{\pi}(A_{i,t},S_{i,t})+\gamma\left\{\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid S_{i,t+1})\hat{Q}^{\pi}(S_{i,t+1},a^{\prime})\right\}\right],

where $\hat{Q}^{\pi}$ is estimated by SAVE and $\hat{\omega}^{\pi}_{i,t}$ , $i=1,\dots,n$ , $t=0,\dots,T-1$ are obtained from (25). Table S5 shows the performance of above three estimators when $n=40$ and $T=50$ . As we can see, balance only performs better when target policy is $\pi_{1}$ . Under other settings, ProjBalance performs much better than balance, especially when target policy is $\pi_{2}$ . As for aug, it has a very similar performance to SAVE in all four settings.

Table S5: Simulation results for ProjBalance, balance and aug under four different target policies when

n=40

and

T=50

. See details in Table 1.

Target	methods	MSE ( $\times 1000$ )	MeSE ( $\times 1000$ )	ECP	AL ( $\times 100$ )
$\pi_{1}$	ProjBalance	8.76 (0.527)	3.75	0.96 / 0.99	38.93 (0.258) / 46.71 (0.310)
	balance	4.73 (0.298)	2.36	0.99 / 1.00	38.63 (0.127) / 46.36 (0.152)
	aug	9.86 (0.585)	4.31	0.95 / 0.99	38.81 (0.259) / 46.57 (0.310)
$\pi_{2}$	ProjBalance	4.29 (0.356)	1.75	0.91 / 0.94	23.57 (2.130) / 28.28 (2.556)
	balance	117.23 (1.38)	111.44	0.002 / 0.002	19.43 (0.583) / 23.33 (0.699)
	aug	36.66 (33.118)	1.52	0.92 / 0.95	23.60 (2.166) / 28.32 (2.599)
$\pi_{3}$	ProjBalance	2.41 (0.153)	1.06	0.93 / 0.97	18.47 (0.125) / 22.17 (0.150)
	balance	8.32 (0.32)	6.79	0.45 / 0.57	14.95 (0.048) / 17.94 (0.058)
	aug	2.62 (0.164)	1.16	0.92 / 0.97	18.50 (0.124) / 22.20 (0.148)
$\pi_{4}$	ProjBalance	0.67 (0.044)	0.32	0.93 / 0.97	9.70 (0.021) / 11.64 (0.025)
	balance	0.84 (0.053)	0.37	0.91 / 0.95	9.64 (0.010) / 11.56 (0.012)
	aug	0.73 (0.048)	0.34	0.92 / 0.96	9.69 (0.023) / 11.63 (0.027)

S1.5 Simulation results for the Cartpole environment

In this section, we compare the performance of various estimators using the CartPole datasets from the OpenAI Gym environment [5]. The state variables in the CartPole environment are four dimensional, including cart position, cart velocity, pole angle and pole angular velocity. The action is binary. We made a slight modification to the Cartpole environment following [68] and [60]. To summarize, we added a small Gaussian noise on the original deterministic transition dynamic and defined a new state-action-dependent reward function. See details in Section B2.1 in [60]. Following [68] and [60], we first ran a deep-Q network to get a near-optimal policy as the target policy ( $\pi_{opt}$ ). And we then applied “epsilon-greedy” adjustment using a factor $\epsilon$ to define our behavior policy, i.e.,

\displaystyle\pi_{b}(a\mid s)=(1-\epsilon)*\pi_{opt}(a\mid s)+\epsilon*0.5,

for every $(s,a)\in{\cal S}\times{\cal A}$ . For comparison, we implemented various methods such as fitted-Q evluation (FQE), minimax and DR using the ways they are implemented in [60] and the public code from https://github.com/RunzheStat/D2OPE. We set the discount factor $\gamma=0.9$ in our evaluations.

In Tables S6, S7 and S8, we show the results for $200$ simulated dataset when setting $\epsilon=0.1$ , $\epsilon=0.2$ and $\epsilon=0.3$ respectively. We fixed $n=50$ and $T=100$ for all three settings. As we can see, ProjBalance has the smallest or the second smallest MSE and MeSE among all the methods. The emprical coverages of ProjBalance are also better than other two methods (SAVE and VL). We did not report the confidence interval for the remaining methods because there are no related theoretical results.

Table S6: Simulation results for the Cartpole environment when

n=50

T=100

and

\epsilon=0.1

. See detailed description in Table 1

methods	MSE ( $\times 10^{6}$ )	MeSE ( $\times 10^{6}$ )	ECP	AL ( $\times 10^{3}$ )
ProjBalance	0.17 (0.011)	0.13	0.72 / 0.86	0.93 (0.016) / 1.12 (0.02)
SAVE	0.49 (0.022)	0.43	0.12	0.92 (0.018)
VL	32.66 (1.262)	29.57	0	1.84 (0.008)
FQE	5.15 (0.033)	5.12	_	_
minimax	43.04 (1.122)	43.24	_	_
DR	1.26 (0.042)	1.11	_	_
IS	5581.23 (491.491)	2757.71	_	_

Table S7: Simulation results for the Cartpole environment when

n=50

T=100

and

\epsilon=0.2

. See detailed description in Table 1

methods	MSE ( $\times 10^{6}$ )	MeSE ( $\times 10^{6}$ )	ECP	AL ( $\times 10^{3}$ )
ProjBalance	0.10 (0.009)	0.04	0.84 / 0.91	0.85 (0.014) / 1.02 (0.016)
SAVE	0.33 (0.021)	0.24	0.38	0.84 (0.014)
VL	23.06 (0.547)	22.5	0	1.21 (0.003)
FQE	4.44 (0.043)	4.38	_	_
minimax	35.28 (1.163)	33.69	_	_
DR	0.76 (0.049)	0.64	_	_
IS	20541.94 (2356.361)	9295.65	_	_

Table S8: Simulation results for the Cartpole environment when

n=50

T=100

and

\epsilon=0.3

. See detailed description in Table 1

methods	MSE ( $\times 10^{6}$ )	MeSE ( $\times 10^{6}$ )	ECP	AL ( $\times 10^{3}$ )
ProjBalance	0.14 (0.011)	0.08	0.65 / 0.76	0.84 (0.031) / 1.01 (0.038)
SAVE	0.13 (0.014)	0.05	0.74	0.93 (0.124)
VL	16.08 (0.326)	15.9	0	1.02 (0.001)
FQE	3.99 (0.029)	3.98	_	_
minimax	26.94 (0.962)	25.54	_	_
DR	0.87 (0.053)	0.7	_	_
IS	59936.33 (7265.061)	30726.43	_	_

S2 Proof

S2.1 Proof in Section 2

Here we show how to obtain (10) from (9).

Starting from (9), we multiply $f(s,a)$ on both sides and get

\displaystyle d^{\pi}(s,a)f(s,a)=(1-\gamma)G(s)\pi(a\mid s)f(s,a)+\gamma f(s,a)\int_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}d^{\pi}(s^{\prime},a^{\prime})p(s\mid s^{\prime},a^{\prime})\pi(a\mid s)ds^{\prime}

(S36)

Then we integrate with respect to $s$ and $a$ for both sides of (S36). Next by the change of measure, we obtain

$\displaystyle\int_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}d^{\pi}(s,a)f(s,a)ds$	$\displaystyle=(1-\gamma)\int_{s\in\mathcal{S},a\in\mathcal{A}}\mathbb{G}(s)\pi(a\mid s)f(s,a)$
	$\displaystyle\qquad+\gamma\int_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}f(s,a)\int_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}d^{\pi}(s^{\prime},a^{\prime})p(s\mid s^{\prime},a^{\prime})\pi(a\mid s)ds^{\prime}ds$
	$\displaystyle=(1-\gamma)\int_{s\in\mathcal{S},a\in\mathcal{A}}\mathbb{G}(s)\pi(a\mid s)f(s,a)$
	$\displaystyle\qquad+\gamma\int_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}d^{\pi}(s^{\prime},a^{\prime})\int_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}f(s,a)p(s\mid s^{\prime},a^{\prime})\pi(a\mid s)dsds^{\prime}$
	$\displaystyle=(1-\gamma)\mathbb{E}_{S_{0}\sim\mathbb{G}}\left[\sum_{a\in\mathcal{A}}f(a,S0)\right]$
	$\displaystyle\qquad+\gamma\int_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}d^{\pi}(s,a)\mathbb{E}\left[\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid S^{\prime})f(S^{\prime},a)\mid S=s,A=a\right]ds$	(S37)

Then, based on the definition of $\bar{p}_{T}^{b}(S_{t},A_{t})$ , we have

\mathbb{E}\left[\frac{1}{T}\sum_{t=0}^{T-1}\frac{d^{\pi}(S_{t},A_{t})}{\bar{p}_{T}^{b}(S_{t},A_{t})}\left\{f(S_{t},A_{t})-\gamma\mathbb{E}\left[\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid S_{t+1})f(S_{t+1},a)\mid S_{t},A_{t}\right]\right\}\right]\\ =\frac{1}{T}\sum_{t=0}^{T-1}\int_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}p_{t}(s,a)\frac{d^{\pi}(s,a)}{\bar{p}^{b}_{T}(s,a)}\left\{f(s,a)-\gamma\mathbb{E}\left[\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid s^{\prime})f(s^{\prime},a)\mid s,a\right]\right\}ds\\ =\int_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}{d^{\pi}(s,a)}\left\{f(s,a)-\gamma\mathbb{E}\left[\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid s^{\prime})f(s^{\prime},a)\mid s,a\right]\right\}ds.

(S38)

Combining (S37) and (S38), equation (10) follows.

S2.2 Technical Proof in Section 3

Here we show the proof of Lemma 1 and Theorem 2. For Lemma 1, the proof is straightforward. For Theorem 2, we first remove the absolute value notation in (25b) by adding additional $K$ constrains and therefore transform the optimization problem into a standard convex optimization problem with linear constraints. Then we make use of the duality results in [67] to show the corresponding statement in Lemma 1.

Proof of Lemma 1.

First, for any function $g\in\mathcal{G}^{\prime}$ , there exists $c\in\mathbb{R}$ (independent of $s,a,s^{\prime}$ ) such that

g(s,a,s^{\prime})=f(s,a)-\gamma c\quad\mbox{where}\quad\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}|s^{\prime})f(s^{\prime},a^{\prime})\equiv c.

Re-parametrizing $f(s,a)=m(s,a)+c$ , we have

\mathcal{G}^{\prime}=\left\{g(s,a,s^{\prime})=m(s,a)+(1-\gamma)c:c\in\mathbb{R},\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid s^{\prime})m(s^{\prime},a^{\prime})\equiv 0\right\}.

Let $\{s_{1},\dots,s_{p_{S}}\}$ and $\{a_{1},\dots,a_{p_{A}}\}$ be the possible values of $\mathcal{S}$ and $\mathcal{A}$ respectively. A real-valued function $m$ defined on $\mathcal{S}\times\mathcal{A}$ can be identified by a matrix $\bm{M}=[m(s_{i},a_{j})]_{i,j=1}^{p_{S},p_{A}}\in\mathbb{R}^{p_{S}\times p_{A}}$ . Similarly, we define $\bm{\Pi}=[\pi(a_{j}\mid s_{i})]_{i,j=1}^{p_{S},p_{A}}$ . Denote by $\bm{1}_{a}\in\mathbb{R}^{a}$ a vector of ones for any positive integer $a$ . To derive $\dim(\mathcal{G}^{\prime})$ , we study the constraint $\sum_{a^{\prime}\in\mathcal{A}^{\prime}}\pi(a^{\prime}\mid s^{\prime})m(s^{\prime},a^{\prime})\equiv 0$ , which is equivalent to

\langle\bm{\pi}_{i},\bm{m}_{i}\rangle=0,\quad i=1,\dots,p_{S},

(S39)

where $\bm{\pi}_{i}$ and $\bm{m}_{i}$ are the $i$ -th row of $\bm{\Pi}$ and $\bm{M}$ respectively. Since the elements of $\bm{\pi}_{i}$ are non-negative and sum to 1 for all $i$ , and each of these constraints are restricted on non-overlapping elements on $\bm{M}$ , so (S39) has $p_{S}$ linearly independent constraints on $\bm{M}$ . It is easy to see that $\dim(\{g(s,a,s^{\prime})=m(s,a):\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid s^{\prime})m(s^{\prime},a^{\prime})\equiv 0\})=p_{S}p_{A}-p_{S}$ . Together with the parameter $c$ , we can show that $\dim(\mathcal{G}^{\prime})=p_{S}p_{A}-p_{S}+1$ .

∎

Proof of Theorem 2.

Let

\hat{\bm{L}}=\frac{1}{nT}\left[\begin{array}[]{cccc c c}\hat{A}_{1}(S_{0,0},A_{0,0})&\dots&\hat{A}_{1}(S_{0,T-1},A_{0,T-1})&\hat{A}_{1}(S_{1,0},A_{1,0})&\dots&\hat{A}_{1}(S_{n,T-1},A_{n,T-1})\\ \vdots&\ddots&\vdots&\vdots&\ddots&\vdots\\ \hat{A}_{K}(S_{0,0},A_{0,0})&\dots&\hat{A}_{K}(S_{0,T-1},A_{0,T-1})&\hat{A}_{K}(S_{1,0},A_{1,0})&\dots&\hat{A}_{K}(S_{n,T-1},A_{n,T-1})\\ \end{array}\right],

\bm{w}=[w_{i,t}]_{i=1,\dots,n,t=0,\dots,T}\in\mathbb{R}^{nT}

Then (25b) can be represented by

	min	$\displaystyle\qquad\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}h(w_{i,t})$
	subject to	$\displaystyle\qquad\left[\begin{array}[]{c}\hat{\bm{L}}\\ -\hat{\bm{L}}\end{array}\right]\bm{w}\leq\left[\begin{array}[]{c}\bm{\delta}_{K}+\bm{l}_{K}\\ \bm{\delta}_{K}-\bm{l}_{K}\end{array}\right]$		(S44)

From [67], the dual of (S44) is

	max	$\displaystyle\qquad g(\bm{\lambda})$
	subject to	$\displaystyle\qquad\bm{\lambda}\geq 0,$

where $g(\bm{\lambda})=-\frac{1}{nT}\sum_{j=1}^{n}\sum_{t=0}^{T-1}h^{*}(-\bm{Q}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda})-\langle\bm{\lambda},\bm{d}\rangle$ , $\bm{\lambda}\in\mathbb{R}^{2K}$ and $\bm{d}=[(\bm{\delta}_{K}+\bm{l}_{K})^{\intercal},(\bm{\delta}_{K}-\bm{l}_{K})^{\intercal}]^{\intercal}\in\mathbb{R}^{2K}$ . In addition, $\bm{Q}(S_{i,t},A_{i,t})=[\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal},-\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}]^{\intercal}\in\mathbb{R}^{2K}$ and $h^{*}$ is the convex conjugate of $h$ defined as

	$\displaystyle h^{*}(t)$	$\displaystyle=\sup_{w}\{tw-h(w)\}$
		$\displaystyle=t\tilde{w}-h(\tilde{w}),$

where $\tilde{w}$ satisfies the first order condition that $t-h^{\prime}(\tilde{w})=0$ . Then we obtain that $\tilde{w}=(h^{\prime})^{-1}(t)$ and

h^{*}(t)=t(h^{\prime})^{-1}(t)-h\{(h^{\prime})^{-1}(t)\}.

Take $\rho(t)=h^{*}(t)$ , and it can be verified that $\rho^{\prime}(t)=(h^{\prime})^{-1}(t)=\tilde{w}$ . Then the dual form can be written as

	min	$\displaystyle\qquad\ell(\bm{\lambda})=\frac{1}{nT}\sum_{j=1}^{n}\sum_{t=0}^{T-1}\rho(-\bm{Q}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda})+\bm{\lambda}^{\intercal}\bm{d}$		(S45)
	subject to	$\displaystyle\qquad\bm{\lambda}\geq 0$

Suppose $\tilde{\bm{\lambda}}=[\tilde{\bm{\lambda}}_{1}^{\intercal},\tilde{\bm{\lambda}}_{2}^{\intercal}]^{\intercal}$ , $\tilde{\bm{\lambda}}_{1},\tilde{\bm{\lambda}}_{2}\in\mathbb{R}^{K}$ , is an optimal solution of (S45). Take $\tilde{\lambda}_{1,k}$ , $\tilde{\lambda}_{2,k}$ and $\delta_{k}$ as the $k$ -th element of $\tilde{\bm{\lambda}}_{1},\tilde{\bm{\lambda}}_{2}$ and $\bm{\delta}_{K}$ respectively. Next, we show that $\tilde{\lambda}_{1,k}\tilde{\lambda}_{2,k}=0$ for $k=1,\dots,K$ .

Suppose that there exists $k$ such that $\tilde{\lambda}_{1,k}>0$ and $\tilde{\lambda}_{2,k}>0$ , then we take $\tilde{\bm{\lambda}^{\prime}}_{1}=\tilde{\bm{\lambda}}_{1}-\min\{\tilde{\lambda}_{1,k},\tilde{\lambda}_{2,k}\}\bm{e}_{k}$ , $\tilde{\bm{\lambda}^{\prime}}_{2}=\tilde{\bm{\lambda}}_{2}-\min\{\tilde{\lambda}_{1,k},\tilde{\lambda}_{2,k}\}\bm{e}_{k}$ , where $\bm{e}_{k}$ is a vector where its $k$ -th entry is 1 and all remaining entries are zero. Take $\tilde{\bm{\lambda}^{\prime}}=[\tilde{\bm{\lambda}^{\prime}}_{1}^{\intercal},\tilde{\bm{\lambda}^{\prime}}_{2}^{\intercal}]^{\intercal}$ . Then

\ell(\tilde{\bm{\lambda}^{\prime}})-\ell(\tilde{\bm{\lambda}})=-2\min\{\tilde{\bm{\lambda}}_{1,k},\tilde{\bm{\lambda}}_{2,k}\}\delta_{k}<0,

due to the fact that $\delta_{k}>0$ . Then it contradicts with the assumption that $\tilde{\bm{\lambda}}$ is the solution.

Therefore, we can take the $\bm{\lambda}^{+}=\tilde{\bm{\lambda}}_{2}-\tilde{\bm{\lambda}}_{1}$ . And it can be verified that $|\bm{\lambda}^{+}|=\tilde{\bm{\lambda}}_{2}+\tilde{\bm{\lambda}}_{1}$ . Rewriting (S45) yields the result in Theorem 2. ∎

S2.3 Technical Proof in Section 4.1

In this section, we provide proofs of Theorem 3 and Corollary 1.

To prove Theorem 3, a novel truncation argument of the Markov chain is presented in Lemma S2 to obtain the tight concentration bound in the scaling of $n$ and $T$ . Specifically, we truncate each trajectory into two parts in the scaling of $T$ . Informally, for the first part, the truncation threshold should be small enough so that we could borrow the proof techniques under standard i.i.d. settings developed in [38] without losing too much for the upper bound. For the remaining part, the idea is that as the chain grows, the distribution becomes “exponentially” close to the stationary distribution according to the mixing condition in Assumption 4(a) (or 4’(a)). We first develop a delicate peeling argument based on [14] to bound the empirical process under the stationary distribution, from which we could achieve the desired order in the scaling of both $n$ and $T$ . Then it remains to bound the difference between the stationary distribution and the distribution after truncation. By carefully choosing the truncation threshold, we are able to balance the upper bounds from two parts and obtain the desired rate of convergence. See the details in Lemma S2.

Proof of Theorem 3.

Take $\delta_{t}^{\pi}(B)=\sum_{a^{\prime}}\pi(a^{\prime}\mid S_{t+1})B(S_{t+1},a^{\prime})$ . In the following, we write $g^{\pi}(\cdot,\cdot;B)$ and $g_{*}^{\pi}(\cdot,\cdot;B)$ as $g^{\pi}(B)$ and $g_{*}^{\pi}(B)$ in short. We define

\mathbb{P}_{{\color[rgb]{0,0,0}n,T}}F(X)=(nT)^{-1}\sum_{i=1}^{n}\sum_{t=0}^{T-1}F(X_{i,t})

as the empirical expectation of some function $F$ with an input $X$ , which represents random sampled trajectories from some Markov chains.

To proceed our proof, we can decompose the error as

		$\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\left[\left\{\hat{g}^{\pi}(S_{t},A_{t};B)-g_{*}^{\pi}(S_{t},A_{t};B)\right\}^{2}\right]$
	$\displaystyle=$	$\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\left[\left\{\hat{g}^{\pi}(S_{t},A_{t};B)-\delta_{t}^{\pi}(B)+\delta_{t}^{\pi}(B)-g_{*}^{\pi}(S_{t},A_{t};B)\right\}^{2}\right]$
	$\displaystyle=$	$\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\left[\left\{\delta_{t}^{\pi}(B)-\hat{g}^{\pi}(S_{t},A_{t};B)\right\}^{2}\right]+\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\left[\left\{\delta_{t}^{\pi}(B)-g_{*}^{\pi}(S_{t},A_{t};B)\right\}^{2}\right]$
		$\displaystyle\qquad\quad+\frac{2}{T}\sum_{t=0}^{T-1}\mathbb{E}\left[\left\{\hat{g}^{\pi}(S_{t},A_{t};B)-\delta_{t}^{\pi}(B)\right\}\left\{\delta_{t}^{\pi}(B)-g_{*}^{\pi}(S_{t},A_{t};B)\right\}\right].$

Since $\sum_{t=0}^{T-1}\mathbb{E}\left[\left\{\delta_{t}^{\pi}(B)-g_{*}^{\pi}(S_{t},A_{t};B)\right\}g(S_{t},A_{t})\right]=0$ for all $g\in\mathcal{G}$ due to the optimizing property of $g_{*}^{\pi}$ , the last term above can be simplified as

	$\displaystyle\frac{2}{T}\sum_{t=0}^{T-1}\mathbb{E}\Big{[}\big{\{}\hat{g}^{\pi}(S_{t},A_{t};B)-g_{*}^{\pi}(S_{t},A_{t};B)$
	$\displaystyle\qquad\qquad+g_{}^{\pi}(S_{t},A_{t};B)-\delta_{t}^{\pi}(B)\big{\}}\big{\{}\delta_{t}^{\pi}(B)-g_{}^{\pi}(S_{t},A_{t};B)\big{\}}\Big{]}$
	$\displaystyle=\frac{2}{T}\sum_{t=0}^{T-1}\mathbb{E}\left[\big{\{}g_{}^{\pi}(S_{t},A_{t};B)-\delta_{t}^{\pi}(B)\big{\}}\big{\{}\delta_{t}^{\pi}(B)-g_{}^{\pi}(S_{t},A_{t};B)\big{\}}\right]$
	$\displaystyle=-\frac{2}{T}\sum_{t=0}^{T-1}\mathbb{E}\left[\left\{\delta_{t}^{\pi}(B)-g_{*}^{\pi}(S_{t},A_{t};B)\right\}^{2}\right].$

As a result, we have

		$\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\left[\left\{\hat{g}^{\pi}(S_{t},A_{t};B)-g_{*}^{\pi}(S_{t},A_{t};B)\right\}^{2}\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}\Big{[}\frac{1}{T}\sum_{t=0}^{T-1}\big{\{}\delta_{t}^{\pi}(B)-\hat{g}^{\pi}(S_{t},A_{t};B)\big{\}}^{2}-\big{\{}\delta_{t}^{\pi}(B)-g_{*}^{\pi}(S_{t},A_{t};B)\big{\}}^{2}\Big{]}.$

For $g_{1},g_{2}\in\mathcal{G},B\in\mathcal{Q}$ , we define the following two functions:

	$\displaystyle f_{1}^{\pi}(g_{1},g_{2},B):(S_{t},A_{t},S_{t+1})\mapsto\left\{\delta_{t}^{\pi}(B)-g_{1}(S_{t},A_{t})\right\}^{2}-\left\{\delta_{t}^{\pi}(B)-g_{2}(S_{t},A_{t})\right\}^{2}$
	$\displaystyle f_{2}^{\pi}(g_{1},g_{2},B):(S_{t},A_{t},S_{t+1})\mapsto\left\{\delta_{t}^{\pi}(B)-g_{2}(S_{t},A_{t})\right\}\left\{g_{1}(S_{t},A_{t})-g_{2}(S_{t},A_{t})\right\},$

With these notations, we know that

	$\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\left[\left\{\hat{g}^{\pi}(S_{t},A_{t};B)-g_{}^{\pi}(S_{t},A_{t};B)\right\}^{2}\right]=\mathbb{E}\left[\frac{1}{T}\sum_{t=0}^{T-1}f_{1}^{\pi}\left\{\hat{g}^{\pi}(B),g_{}^{\pi}(B),B\right\}(S_{t},A_{t},S_{t+1})\right],$
	$\displaystyle\\|\hat{g}^{\pi}(B)-g_{}^{\pi}(B)\\|_{{\color[rgb]{0,0,0}n,T}}^{2}=\mathbb{P}_{N}\left[f_{1}^{\pi}\left\{\hat{g}^{\pi}(B),g_{}^{\pi}(B),B\right\}(S,A,S^{\prime})+2f_{2}^{\pi}\left\{\hat{g}^{\pi}(B),g_{*}^{\pi}(B),B\right\}(S,A,S^{\prime})\right].$

In the following, we decompose

\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\left[\left\{\hat{g}^{\pi}(S_{t},A_{t};B)-g_{*}^{\pi}(S_{t},A_{t};B)\right\}^{2}\right]+\|\hat{g}^{\pi}(B)-g_{*}^{\pi}(B)\|_{{\color[rgb]{0,0,0}n,T}}^{2}+\mu J_{\mathcal{G}}^{2}\left\{\hat{g}^{\pi}(B)\right\}=I_{1}(B)+I_{2}(B),

where

	$\displaystyle I_{1}(B)=3\mathbb{P}_{N}f_{1}^{\pi}\left\{\hat{g}^{\pi}(B),g_{}^{\pi}(B),B\right\}+\mu[3J_{\mathcal{G}}^{2}\left\{\hat{g}^{\pi}(B)\right\}+2J_{\mathcal{G}}^{2}\left\{g_{}^{\pi}(B)\right\}+2J_{\mathcal{Q}}^{2}(B)],$
	$\displaystyle I_{2}(B)=\mathbb{P}_{N}f_{1}^{\pi}\left\{\hat{g}^{\pi}(B),g_{}^{\pi}(B),B\right\}+\frac{1}{T}\mathbb{E}\sum_{t=0}^{T-1}f_{1}^{\pi}\left\{\hat{g}^{\pi}(B),g_{}^{\pi}(B),B\right\}(S_{t},A_{t},S_{t+1})+\mu J_{\mathcal{G}}^{2}\left\{\hat{g}^{\pi}(B)\right\}$
	$\displaystyle\qquad\qquad\qquad+2\mathbb{P}_{N}f_{2}^{\pi}\left\{\hat{g}^{\pi}(B),g_{*}^{\pi}(B),B\right\}-I_{1}(B).$

For the first term, the optimizing property of $\hat{g}^{\pi}(B)$ implies that

	$\displaystyle\frac{1}{3}I_{1}(B)$	$\displaystyle=\mathbb{P}_{N}\Big{[}\left\{\delta_{t}^{\pi}(B)-\hat{g}^{\pi}(S,A;B)\right\}^{2}-\left\{\delta^{\pi}(B)-g_{*}^{\pi}(S,A;B)\right\}^{2}\Big{]}$
		$\displaystyle\qquad+\mu J_{\mathcal{G}}^{2}\left\{\hat{g}^{\pi}(B)\right\}+\frac{2}{3}\mu J_{\mathcal{G}}^{2}\left\{g_{*}^{\pi}(B)\right\}+\frac{2}{3}\mu J_{\mathcal{Q}}^{2}(B)$
		$\displaystyle=\mathbb{P}_{N}\Big{[}\left\{\delta_{t}^{\pi}(B)-\hat{g}^{\pi}(S,A;B)\right\}^{2}\Big{]}+\mu J_{\mathcal{G}}^{2}\left\{\hat{g}^{\pi}(B)\right\}$
		$\displaystyle\qquad-\mathbb{P}_{N}\big{[}\left\{\delta^{\pi}(B)-g_{}^{\pi}(S,A;B)\right\}^{2}\big{]}+\frac{2}{3}\mu J_{\mathcal{G}}^{2}\left\{g_{}^{\pi}(B)\right\}+\frac{2}{3}\mu J_{\mathcal{Q}}^{2}(B)$
		$\displaystyle\leq\frac{5}{3}\mu J_{\mathcal{G}}^{2}\left\{g_{*}^{\pi}(B)\right\}+\frac{2}{3}\mu J_{\mathcal{Q}}^{2}(B).$

Thus, $I_{1}(B)\leq 5\mu J_{\mathcal{G}}^{2}\left\{g_{*}^{\pi}(B)\right\}+2\mu J_{\mathcal{Q}}^{2}(B)$ holds for all $B$ .

Next we derive the uniform bound of $I_{2}(B)$ over all $B$ . For simplicity, take

\displaystyle\mathbf{J}^{2}(g_{1},g_{2},B)=J_{\mathcal{G}}^{2}(g_{1})+J_{\mathcal{G}}^{2}(g_{2})+J_{\mathcal{Q}}^{2}(B),

(S46)

for any $g_{1},g_{2}\in\mathcal{G}$ and $B\in\mathcal{Q}$ .

And take $f^{\pi}=f_{1}^{\pi}-f_{2}^{\pi}$ , where

\displaystyle f^{\pi}(g_{1},g_{2},B):(S_{t},A_{t},S_{t+1})\mapsto(g_{2}-g_{1})(S_{t},A_{t})\cdot\left(3\delta^{\pi}(B)-2g_{2}(S_{t},A_{t})-g_{1}(S_{t},A_{t})\right).

(S47)

Then

\displaystyle I_{2}(B)

\displaystyle=2(\mathbb{E}-\mathbb{P}_{N})f^{\pi}\left\{\hat{g}^{\pi}(B),g_{*}^{\pi}(B),B\right\}-\mathbb{E}f^{\pi}\left\{\hat{g}^{\pi}(B),g_{*}^{\pi}(B),B\right\}-2\mu\mathbf{J}^{2}(\hat{g}^{\pi},g^{\pi}_{*},B).

By Lemma S2, we are able to show that with probability at least $1-2\delta-1/({\color[rgb]{0,0,0}nT})$ ,

\displaystyle I_{2}(B)\lesssim\mu+\frac{1}{{\color[rgb]{0,0,0}nT}\mu^{\frac{\alpha}{1-\tau(2+\alpha)}}}+\frac{[\log(\max\{1/\delta,{\color[rgb]{0,0,0}nT}\})]^{1/\tau}}{{\color[rgb]{0,0,0}nT}},

where $0<\tau\leq\frac{1}{3}$ , and the leading constant depends on $Q_{\max},G_{\max},\alpha,\kappa,C_{\ref*{C_s}}$ . Taking $\tau=(1+\alpha)\log(\log({\color[rgb]{0,0,0}nT}))/(\alpha\log({\color[rgb]{0,0,0}nT}))$ and $\mu\asymp({\color[rgb]{0,0,0}nT})^{-1/(1+\alpha)}(\log({\color[rgb]{0,0,0}nT}))$ , the result of this theorem then follow.

∎

Corollary 1 is a direct application of Theorem 3. It utilizes the uniformity of Theorem 3 and the construction of kernel ridge regression estimators. Before presenting our proof, we state our assumption for Corollary 1 below.

Assumption 4’.

The following conditions hold.

(a)

The Markov chain $\{S_{t},A_{t}\}_{t\geq 0}$ has a unique stationary distribution $\mathbb{G}^{*}$ with density $p^{*}$ over $\mathcal{S}$ and $\mathcal{A}$ and is geometrically ergodic, i.e., there exists a function $\phi(s,a)$ and constant $\kappa\in(0,1)$ such that, for any $s\in\mathcal{S}$ and $a\in\mathcal{A}$ ,

\left\|\mathbb{G}^{b}_{t}(\cdot\mid(s,a))-\mathbb{G}^{*}(\cdot)\right\|_{\mathrm{TV}}\leq\phi(s,a)\kappa^{t},\qquad\forall t\geq 0,

where $\|\cdot\|_{\mathrm{TV}}$ denotes the total variation norm, and $\mathbb{G}_{t}^{b}(\cdot\mid(s,a))$ is the distribution of $(S_{t},A_{t})$ conditioned on $S_{0}=s$ and $A_{0}=a$ , under the behavior policy. Also, there exists a constant $C_{11}\ltx@label{C_{s}}>0$ such that $\int_{(s,a)}\phi(s,a)d\mathbb{G}_{0}(s,a)\leq C_{\ref*{C_s}}$ , where recall that $\mathbb{G}_{0}$ is the initial distribution of $(S,A)$ .

(b)

The function class $\mathcal{Q}$ satisfies that $\max\{\|B\|_{\infty},J_{\mathcal{Q}}(B)\}\leq Q_{\max}$ for all $B\in\mathcal{Q}$ .
(c)

The function class $\mathcal{G}$ in (24) is a subset of an RKHS whose corresponding RKHS norm is $\|\cdot\|_{{\mathcal{H}}}$ . In addition, $\mathcal{G}$ is a star-shaped set with center 0 and it satisfies that $\|g\|_{\infty}\leq G_{\max}$ for all $g\in\mathcal{G}$ and $\|g^{\pi}_{*}(\cdot,\cdot;B)\|_{{\mathcal{H}}}\leq G_{\max}$ for all $B\in\mathcal{Q}$ .

(d)

The regularization functional $J_{\mathcal{G}}$ in (24) is taken as $\|\cdot\|_{{\mathcal{H}}}^{2}$ . Let $\mathcal{Q}_{M}=\{B:B\in\mathcal{Q},J_{\mathcal{Q}}(B)\leq M\}$ and $\mathcal{G}_{M}=\{g:g\in\mathcal{G},\|g\|_{{\mathcal{H}}}^{2}\leq M\}$ . There exist constants $C_{\ref*{entropy}}$ and $\alpha\in(0,1)$ , such that for any $\epsilon,M>0$ ,

\max\left\{\log\mathcal{N}(\epsilon,\mathcal{G}_{M},\|\cdot\|_{\infty}),\log\mathcal{N}(\epsilon,\mathcal{Q}_{M},\|\cdot\|_{\infty})\leq C_{\ref*{entropy}}\left(\frac{M}{\epsilon}\right)^{2\alpha}\right\}

Proof of Corollary 1.

With conditions stated in the Corollary 1, by directly applying Theorem 3 with $\delta=\frac{1}{{\color[rgb]{0,0,0}nT}}$ , we can show that with probability at least $1-3/({\color[rgb]{0,0,0}nT})$ ,

	$\displaystyle{\mathbb{E}\left\{\frac{1}{T}\sum_{t=0}^{T-1}\left[\hat{g}^{\pi}\left(S_{t},A_{t};\sum_{k=1}^{K}{\upsilon_{k}}B_{k}\right)-g_{*}^{\pi}\left(S_{t},A_{t};\sum_{k=1}^{K}{\upsilon_{k}}B_{k}\right)\right]^{2}\right\}}$
	$\displaystyle\quad+\left\\|\hat{g}^{\pi}\left(\cdot,\cdot;\sum_{k=1}^{K}{\upsilon_{k}}B_{k}\right)-g_{*}^{\pi}\left(\cdot,\cdot;\sum_{k=1}^{K}{\upsilon_{k}}B_{k}\right)\right\\|_{{\color[rgb]{0,0,0}n,T}}^{2}$
	$\displaystyle\lesssim({\color[rgb]{0,0,0}nT})^{-\frac{1}{1+\alpha}}(\log({\color[rgb]{0,0,0}nT}))^{\frac{2+\alpha}{1+\alpha}}.$

Then it suffices to show that

\displaystyle g_{*}^{\pi}(\cdot,\cdot;{\bm{\upsilon}}^{\intercal}[B_{k}]_{k=1}^{K})={\bm{\upsilon}}^{\intercal}[g^{\pi}_{*}(\cdot,\cdot;B_{k})]_{k=1}^{K},

(S48)

and

\displaystyle\hat{g}^{\pi}(\cdot,\cdot;{\bm{\upsilon}}^{\intercal}[B_{k}]_{k=1}^{K})={\bm{\upsilon}}^{\intercal}[\hat{g}^{\pi}(\cdot,\cdot;B_{k})]_{k=1}^{K}.

(S49)

Note that (S48) is due to the definition of $g_{*}^{\pi}$ . Next, we verify (S49).

Denote the reproducing kernel of $\mathcal{G}$ by ${\kappa}(\cdot,\cdot)$ . Write $Y^{k}_{i,t}=\sum_{a^{\prime}}\pi(a^{\prime}\mid S_{i,t+1})B_{k}(S_{i,t+1},a^{\prime})$ , $\bm{Y}_{k}=[Y^{k}_{i,t}]_{i=1,\dots,n,t=0,\dots,T-1}$ , and $\bm{G}$ as the Gram matrix. Take

\bm{G}_{\kappa}(s,a)=[{\kappa}\left\{(s,a),(S_{i,t},A_{i,t})\right\})]_{i=1,\dots,n,t=0,\dots,T-1}\in\mathbb{R}^{nT}.

Then

\hat{g}^{\pi}(s,a;B_{k})=\bm{Y}_{k}^{\intercal}(\bm{G}+\mu\bm{I})^{-1}\bm{G}_{\kappa}(s,a)

If we keep the tuning parameter the same when approximating all $g^{\pi}_{*}(\cdot,\cdot;B_{k})$ , then for any ${\bm{\upsilon}}$ ,

	$\displaystyle{\bm{\upsilon}}^{\intercal}[\hat{g}^{\pi}(s,a;B_{k})]_{k=1}^{K}$	$\displaystyle={\bm{\upsilon}}^{\intercal}\left[\bm{Y}_{k}^{\intercal}(\bm{G}+\mu\bm{I})^{-1}\bm{G}_{\kappa}(s,a)\right]_{k=1}^{K}$
		$\displaystyle=[{\bm{\upsilon}}^{\intercal}\tilde{\bm{Y}}_{i,t}]^{\intercal}_{i=1\dots,n,t=0,\dots,T-1}(\bm{G}+\mu\bm{I})^{-1}\bm{G}_{\kappa}(s,a)$

where $\tilde{\bm{Y}}_{i,t}=[Y^{k}_{i,t}]_{k=1}^{K}\in\mathbb{R}^{K}$ , ${\bm{\upsilon}}^{\intercal}\tilde{\bm{Y}}_{i,t}=\sum_{k=1}^{K}{\upsilon_{k}}\{\sum_{a^{\prime}}\pi(a^{\prime}\mid S_{i,t+1})B_{k}(S_{i,t+1},a^{\prime})]\}=\sum_{a^{\prime}}\pi(a^{\prime}\mid S_{i,t+1})\{\sum_{k=1}^{K}\alpha_{k}B_{k}(S_{i,t+1},a^{\prime})\}$ . Then ${\bm{\upsilon}}^{\intercal}[\hat{g}^{\pi}(s,a;B_{k})]_{k=1}^{K}$ is the solution for approximating $g^{\pi}_{*}(s,a;{\bm{\upsilon}}^{\intercal}[B_{k}]_{k=1}^{K})$ using the same tuning parameter $\mu$ . Therefore we have with probability at least $1-3/({\color[rgb]{0,0,0}nT})$ ,

	$\displaystyle\mathbb{E}\left\{\sum_{k=1}^{K}\alpha_{k}\hat{g}^{\pi}\left(\cdot,\cdot;B_{k}\right)-\sum_{k=1}^{K}\alpha_{k}g_{}^{\pi}\left(\cdot,\cdot;B_{k}\right)\right\}^{2}+\left\\|\sum_{k=1}^{K}\alpha_{k}\hat{g}^{\pi}\left(\cdot,\cdot;B_{k}\right)-\sum_{k=1}^{K}\alpha_{k}g_{}^{\pi}\left(\cdot,\cdot;B_{k}\right)\right\\|_{{\color[rgb]{0,0,0}n,T}}^{2}$
	$\displaystyle\lesssim({\color[rgb]{0,0,0}nT})^{-\frac{1}{1+\alpha}}(\log({\color[rgb]{0,0,0}nT}))^{\frac{2+\alpha}{1+\alpha}}.$

∎

S2.4 Technical Proof4 in Section 4.2

In this section, we present the proof of Theorem 4, which is generalized from proofs in [13] and [73]. First, we show the $\ell_{2}$ convergence of $\bm{\lambda}^{+}$ to $\bm{\lambda}_{*}$ This requires a more delicate decomposition than those in [13] and [73] due to the additional projection error from Theorem 3. In addition, by the similar truncation argument as we discussed in Section S2.3, we develop a tight matrix concentration inequality for independent Markov chains. After obtaining the convergence of $\bm{\lambda}^{+}$ to $\bm{\lambda}_{*}$ , the convergence of weights can be derived based on the results in Lemma 1 combined with Assumption 5(a) and (c).

Proof.

Recall that $\bm{\lambda}^{+}$ is the solution to (27) and $\bm{\lambda}_{*}$ is the coefficient that leads to the best approximation to the true ratio function $\omega^{\pi}$ with basis $\bm{L}_{K}$ . By Assumption 5(a), we can show that $\sup_{s,a}|\omega^{\pi}-\rho^{\prime}(\bm{L}_{K}(s,a)^{\intercal}\bm{\lambda}_{*})|\leq C_{\ref*{weightsbound}}(K^{-r_{1}})$ . In the following, we study the convergence of $\|\bm{\lambda}^{+}-\bm{\lambda}_{*}\|_{2}$ . We aim to prove

\displaystyle\|\bm{\lambda}^{+}-\bm{\lambda}_{*}\|_{2}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left[\{\psi(K)\}^{-1}\left(\frac{\sqrt{K}\log K}{\sqrt{{\color[rgb]{0,0,0}nT}}}+K^{1/2-r_{1}}+\sqrt{K}\zeta_{{\color[rgb]{0,0,0}n,T}}\right)\right].

(S50)

Define the objective function $G(\bm{\lambda})=\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\rho(\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda})+|\bm{\lambda}|^{\intercal}\bm{\delta}_{K}-\bm{\lambda}^{\intercal}\bm{l}_{K}$ . In order to show the above bound in (S50), by the continuity and convexity of $G(\cdot)$ , it suffices to show that, with high probability,

\inf_{\Delta\in\mathcal{C}}\{G(\bm{\lambda}_{*}+\Delta)-G(\bm{\lambda}_{*})\}>0,

where $\mathcal{C}=\{\Delta\in\mathbb{R}^{K}:\|\Delta\|_{2}=C_{12}\ltx@label{c_{d}elta}\{\psi(K)\}^{-1}({\sqrt{K}\log({\color[rgb]{0,0,0}nT})}/{\sqrt{{\color[rgb]{0,0,0}nT}}}+K^{-r_{1}}+\sqrt{K}\zeta_{{\color[rgb]{0,0,0}n,T}})$ }. Here $C_{\ref*{c_delta}}$ is some appropriate constant.

First, we take the following decomposition:

	$\displaystyle G(\bm{\lambda}_{}+\Delta)-G(\bm{\lambda}_{})$		(S51)
	$\displaystyle=\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\left[\rho\{\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}(\bm{\lambda}_{}+\Delta)\}-\rho\{\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda}_{}\}\right]+\bm{\delta}_{K}^{\intercal}(\|\bm{\lambda}_{}+\Delta\|-\|\bm{\lambda}_{}\|)-\Delta^{\intercal}\bm{l}_{K}$
	$\displaystyle\geq\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\left[\rho^{\prime}\{\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda}_{*}\}\{\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\Delta\}+\frac{1}{2}\rho^{{}^{\prime\prime}}\{\hat{v}(S_{i,t},A_{i,t})\}\{\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\Delta\}^{2}\right]$
	$\displaystyle\quad-\Delta^{\intercal}\bm{l}_{K}-\|\Delta\|^{\intercal}\bm{\delta}_{K},$		(S52)

where $\hat{v}(S_{i,t},A_{i,t})$ is a value between $\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}(\bm{\lambda}_{*}+\Delta)$ and $\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda}_{*}$ .

Define $\hat{\bm{\xi}}(s,a)=\hat{\bm{L}}_{K}(s,a)-\bm{L}_{K}(s,a)$ . Now we focus on the following term:

	$\displaystyle\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\rho^{\prime}\{\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda}_{*}\}\{\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\Delta\}-\Delta^{\intercal}\bm{l}_{K}$
$\displaystyle=$	$\displaystyle\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\rho^{\prime}\left[\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda}_{}+\bm{\lambda}_{}^{\intercal}\hat{\bm{\xi}}(S_{it},A_{it})\right]\{\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}\Delta+\hat{\bm{\xi}}(S_{i,t},A_{i,t})^{\intercal}\Delta\}-\Delta^{\intercal}\bm{l}_{K}$
$\displaystyle=$	$\displaystyle\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\left[\rho^{\prime}\left\{\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda}_{*}\right\}\left\{\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}\Delta\right\}-\Delta^{\intercal}\bm{l}_{K}\right]$	(S53)
$\displaystyle+$	$\displaystyle\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\rho^{\prime}\left\{\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda}_{*}\right\}\left\{\hat{\bm{\xi}}(S_{i,t},A_{i,t})^{\intercal}\Delta\right\}$	(S54)
$\displaystyle+$	$\displaystyle\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\rho^{{}^{\prime\prime}}\{\tilde{v}(S_{i,t},A_{i,t})\}\left\{\hat{\bm{\xi}}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda}_{*}\right\}\left\{\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}\Delta\right\}$	(S55)
$\displaystyle+$	$\displaystyle\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\rho^{{}^{\prime\prime}}\{\tilde{v}(S_{i,t},A_{i,t})\}\left\{\hat{\bm{\xi}}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda}_{*}\right\}\left\{\hat{\bm{\xi}}(S_{i,t},A_{i,t})^{\intercal}\Delta\right\},$	(S56)

where $\tilde{v}(S_{i,t},A_{i,t})$ is a value between $\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda}_{*}$ and $\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda}_{*}$ . In the following, we show how to control (S53)-(S56) one by one.

•

For (S53),

$\displaystyle\eqref{eqn:t1}$	$\displaystyle=\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\left[\omega^{\pi}(S_{i,t},A_{i,t})\left\{\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}\Delta\right\}-\Delta^{\intercal}\bm{l}_{K}\right]$
	$\displaystyle\quad+\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\left[\rho^{\prime}\left\{\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda}_{*}\right\}-\omega^{\pi}(S_{i,t},A_{i,t})\right]\left\{\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}\Delta\right\}$
	$\displaystyle\geq-\left\\|\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\omega^{\pi}(S_{i,t},A_{i,t})\bm{L}_{K}(S_{i,t},A_{i,t})-\bm{l}_{K}\right\\|_{2}\left\\|\Delta\right\\|_{2}$
	$\displaystyle\quad-\mathop{}\mathopen{}\mathcal{O}\mathopen{}(K^{-r_{1}})\sqrt{\Delta\left\{\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\bm{L}_{K}(S_{i,t},A_{i,t})\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}\right\}\Delta}$
	$\displaystyle\geq-\left\\|\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\omega^{\pi}(S_{i,t},A_{i,t})\bm{L}_{K}(S_{i,t},A_{i,t})-\bm{l}_{K}\right\\|_{2}\left\\|\Delta\right\\|_{2}$
	$\displaystyle\quad-\mathop{}\mathopen{}\mathcal{O}\mathopen{}(K^{-r_{1}})\left[\lambda_{\max}\left\{\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\bm{L}_{K}(S_{i,t},A_{i,t})\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}\right\}\right]^{1/2}\\|\Delta\\|_{2}$
	$\displaystyle\geq-\left\\|\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\omega^{\pi}(S_{i,t},A_{i,t})\bm{L}_{K}(S_{i,t},A_{i,t})-\bm{l}_{K}\right\\|_{2}\left\\|\Delta\right\\|_{2}-\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(K^{-r_{1}})\\|\Delta\\|_{2}.$	(S57)

The first inequality is based on Cauchy–Schwarz inequality and Assumption 5(a). For the last inequality, we apply Lemma S3 which yields

\displaystyle\lambda_{\max}\left\{\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\bm{L}_{K}(S_{i,t},A_{i,t})\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}\right\}\leq C_{\ref*{weightsC4}}+\scalebox{0.7}{$\mathcal{O}$}_{\mathrm{p}}(1),

due to Assumptions 4’(a) and 5(c).

Next, we show how to bound $\|(nT)^{-1}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\omega^{\pi}(S_{i,t},A_{i,t})\bm{L}_{K}(S_{i,t},A_{i,t})-\bm{l}_{K}\|_{2}$ in (S57). Note that

\bm{l}_{K}=\mathbb{E}\left\{\frac{1}{T}\sum_{t=0}^{T-1}\omega^{\pi}(S_{i,t},A_{i,t})\bm{L}_{K}(S_{i,t},A_{i,t})\right\}.

Then we apply the matrix concentration inequality developed in Lemma S4 to bound it. Specifically, we take $\bm{F}(s,a)=\omega^{\pi}(s,a)\bm{L}_{K}(s,a)$ in Lemma S4. By Assumption 5(a) and (c), we can show that $\sup_{(s,a)}\|\bm{F}(s,a)\|_{2}\leq\|\omega^{\pi}\|_{\infty}\left\{\sup_{(s,a)}\|\bm{L}_{K}(s,a)\|_{2}\right\}\leq C_{\ref*{weightsC1}}C_{\ref*{weightsC2}}(1+\gamma)\sqrt{K}$ . Therefore, Lemma S4 implies that

\displaystyle\left\|\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\omega^{\pi}(S_{i,t},A_{i,t})\bm{L}_{K}(S_{i,t},A_{i,t})-\bm{l}_{K}\right\|_{2}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\frac{\sqrt{K}\log({\color[rgb]{0,0,0}nT})}{\sqrt{{\color[rgb]{0,0,0}nT}}}\right).

Thus we can show that

\displaystyle\eqref{eqn:t1}\geq-\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\frac{\sqrt{K}\log({\color[rgb]{0,0,0}nT})}{\sqrt{{\color[rgb]{0,0,0}nT}}}\right)\|\Delta\|_{2}-\mathop{}\mathopen{}\mathcal{O}\mathopen{}(K^{-r_{1}})\|\Delta\|_{2}

(S58)

•

For (S54),

	$\displaystyle\eqref{eqn:t2}$	$\displaystyle\geq-C\sup_{s,a}\left\|\rho^{\prime}\left\{\bm{L}_{K}(s,a)^{\intercal}\bm{\lambda}_{*}\right\}\right\|\sqrt{\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\left\{\hat{\bm{\xi}}(S_{i,t},A_{i,t})^{\intercal}\bm{\Delta}\right\}^{2}_{2}}$
		$\displaystyle\geq-C(C_{\ref{weightsC1}}+K^{-r_{1}})\\|\Delta\\|_{2}\gamma\left\\|\hat{g}^{\pi}\left(\cdot,\cdot;\bm{B}_{K}^{\intercal}\frac{\Delta}{\\|\Delta\\|_{2}}\right)-g^{\pi}_{}\left(\cdot,\cdot;\bm{B}_{K}^{\intercal}\frac{\Delta}{\\|\Delta\\|_{2}}\right)\right\\|_{{\color[rgb]{0,0,0}n,T}},$

where the first equality is given by Cauchy-Schwarz inequality and the second one is due to Assumption 5(a) and the definition of $\hat{\bm{\xi}}$ . As $\sup_{s,a}\sup_{\|\bm{\delta}\|\leq 1}\|\bm{B}_{K}^{\intercal}\bm{\delta}\|_{\infty}\lesssim\sqrt{K}$ , one can show that ${K}^{-1/2}\bm{B}_{K}^{\intercal}\Delta/\|\Delta\|_{2}\lesssim 1$ . Due to Theorem 3 and Assumption 5(e), we have

\displaystyle K^{-1/2}\left\|\hat{g}^{\pi}\left(\cdot,\cdot;\bm{B}_{K}^{\intercal}\frac{\Delta}{\|\Delta\|_{2}}\right)-g^{\pi}_{*}\left(\cdot,\cdot;\bm{B}_{K}^{\intercal}\frac{\Delta}{\|\Delta\|_{2}}\right)\right\|_{{\color[rgb]{0,0,0}n,T}}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\zeta_{{\color[rgb]{0,0,0}n,T}}\right).

Therefore,

\displaystyle\eqref{eqn:t2}

\displaystyle\geq-CC_{\ref*{weightsC1}}\sqrt{K}\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\zeta_{{\color[rgb]{0,0,0}n,T}}\right)\|\Delta\|_{2},

(S59)

where $C$ is some positive constant.

•

For (S55), note that

	$\displaystyle\left\|\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda}_{*}\right\|$	$\displaystyle=\left\|\bm{\lambda}_{}\bm{B}_{K}(S_{i,t},A_{i,t})^{\intercal}-\gamma g^{\pi}_{}(S_{i,t},A_{i,t};\bm{\lambda}_{*}^{\intercal}\bm{\beta}_{K})\right\|$
		$\displaystyle\leq\left\|\bm{\lambda}_{}\bm{B}_{K}(S_{i,t},A_{i,t})^{\intercal}\right\|+\left\|\gamma g^{\pi}_{}(S_{i,t},A_{i,t};\bm{\lambda}_{*}^{\intercal}\bm{\beta}_{K})\right\|$
		$\displaystyle\leq Q_{\max}+\gamma G_{\max},$

due to Assumption 5(e). Similarly, we can show that

\left|\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda}_{*}\right|\leq Q_{\max}+\gamma G_{\max}.

As $\hat{g}^{\pi}(\cdot,\cdot;\bm{\lambda}_{*}^{\intercal}\bm{\beta}_{K})\in\mathcal{G}$ by assumption, $|\tilde{v}(S_{i,t},A_{i,t})|\leq Q_{\max}+\gamma G_{\max}$ . Since $\rho{{}^{\prime\prime}}$ is a strictly positive and continuous function, there exists a constant $c_{1}\ltx@label{wc4}>0$ such that $\rho{{}^{\prime\prime}}\{\tilde{v}(S_{i,t},A_{i,t})\}\leq c_{\ref*{wc4}}$ . We then have

$\displaystyle\eqref{eqn:t3}$	$\displaystyle\geq-\sqrt{\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\rho^{{}^{\prime\prime}}\{\tilde{v}(S_{i,t},A_{i,t})\}^{2}\left\{\hat{\bm{\xi}}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda}_{*}\right\}^{2}}\sqrt{\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\left\{\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}\Delta\right\}^{2}}$
	$\displaystyle\geq-c_{\ref*{wc4}}\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(\zeta_{{\color[rgb]{0,0,0}n,T}})\left[\lambda_{\max}\left\{\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\bm{L}_{K}(S_{i,t},A_{i,t})\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}\right\}\right]^{1/2}\\|\Delta\\|$
	$\displaystyle\geq-Cc_{\ref*{wc4}}\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(\zeta_{{\color[rgb]{0,0,0}n,T}})\left(\left[\lambda_{\max}\left\{\mathbb{E}\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\bm{L}_{K}(S_{i,t},A_{i,t})\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}\right\}\right]^{1/2}+\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(1)\right)\\|\Delta\\|_{2}$
	$\displaystyle\geq-Cc_{\ref{wc4}}\sqrt{C_{\ref{weightsC4}}}\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(\zeta_{{\color[rgb]{0,0,0}n,T}})\\|\Delta\\|_{2},$	(S60)

where $C$ is some positive constant. The first inequality of the above arguments is given by Cauchy-Schwarz inequality. The second one uses results in Theorem 3 based on Assumption 5(e) so that the upper bound for $(nT)^{-1}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\{\hat{\bm{\xi}}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda}_{*}\}^{2}$ is obtained. We use Lemma S3 in the third inequality, while the last inequality is obtained due to Assumption 5(c).

•

For (S56), by using the same argument showing (S60) and (S59), we have

$\displaystyle\eqref{eqn:t4}$	$\displaystyle\geq-\sqrt{\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\rho^{{}^{\prime\prime}}\{\tilde{v}(S_{i,t},A_{i,t})\}^{2}\left\{\hat{\bm{\xi}}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda}_{*}\right\}^{2}}\sqrt{\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\left\{\hat{\bm{\xi}}(S_{i,t},A_{i,t})^{\intercal}\Delta\right\}^{2}}$
	$\displaystyle\geq-c_{\ref*{wc4}}\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(\zeta_{{\color[rgb]{0,0,0}n,T}})\sqrt{K}\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(\zeta_{{\color[rgb]{0,0,0}n,T}})\\|\Delta\\|_{2}$
	$\displaystyle\geq-c_{\ref*{wc4}}\sqrt{K}\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(\zeta^{2}_{{\color[rgb]{0,0,0}n,T}})\\|\Delta\\|_{2}$	(S61)

Substituting (S58), (S59), (S60) and (S61) into (S53), (S54), (S55) and (S56) respectively, we have

		$\displaystyle\frac{1}{nT}\sum_{j=1}^{n}\sum_{t=0}^{T-1}\rho^{\prime}\{\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda}_{*}\}\{\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\Delta\}-\Delta^{\intercal}\bm{l}_{K}$
	$\displaystyle\geq$	$\displaystyle-\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\frac{\sqrt{K}\log({\color[rgb]{0,0,0}nT})}{\sqrt{{\color[rgb]{0,0,0}nT}}}+K^{-r_{1}}+\sqrt{K}\zeta_{{\color[rgb]{0,0,0}n,T}}+\sqrt{K}\zeta_{{\color[rgb]{0,0,0}n,T}}^{2}\right)\\|\Delta\\|_{2}.$		(S62)

In the following, we discuss the remaining term in (S52). Due to Assumption 5(b), $\rho^{{}^{\prime\prime}}\{\hat{v}(S_{i,t},A_{i,t})\}\geq C_{\ref*{weightssecond}}$ .

Next, we show how to bound

\lambda_{\min}\left[\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\right].

First, note that for any vector $\bm{a}\in\mathbb{R}^{K}$ such that $\|\bm{a}\|_{2}=1$ , we have

	$\displaystyle\bm{a}^{\intercal}\left[\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\right]\bm{a}=\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\left\\|\bm{a}^{\intercal}\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})\right\\|^{2}_{2}$
$\displaystyle\geq$	$\displaystyle\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\left\\|\bm{a}^{\intercal}{\bm{L}}_{K}(S_{i,t},A_{i,t})\right\\|^{2}_{2}-\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\left\\|\bm{a}^{\intercal}\left\{\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})-{\bm{L}}_{K}(S_{i,t},A_{i,t})\right\}\right\\|^{2}_{2}$
$\displaystyle\geq$	$\displaystyle\bm{a}^{\intercal}\left[\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}{\bm{L}}_{K}(S_{i,t},A_{i,t}){\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\right]\bm{a}-\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(K\zeta^{2}_{{\color[rgb]{0,0,0}n,T}}),$	(S63)

where the last inequality can be derived by the same arguments in proving (S54).

Then we have

		$\displaystyle\lambda_{\min}\left[\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\right]$
	$\displaystyle\geq$	$\displaystyle\lambda_{\min}\left[\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}{\bm{L}}_{K}(S_{i,t},A_{i,t}){\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\right]-\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(K\zeta^{2}_{{\color[rgb]{0,0,0}n,T}})$
	$\displaystyle\geq$	$\displaystyle\lambda_{\min}\left[\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}{\bm{L}}_{K}(S_{t},A_{t}){\bm{L}}_{K}(S_{t},A_{t})^{\intercal}\right]-\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\sqrt{\frac{K}{{\color[rgb]{0,0,0}nT}}}\log({\color[rgb]{0,0,0}nT})\right)-\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(K\zeta^{2}_{{\color[rgb]{0,0,0}n,T}})$
	$\displaystyle\geq$	$\displaystyle\psi(K)-\scalebox{0.7}{$\mathcal{O}$}_{\mathrm{p}}(\psi(K)).$

The first inequality is due to (S63). The second inequality is by Lemma S3. For the third inequality, due to the condition that $\{\psi(K)\}^{-1}\sqrt{K}\xi_{{\color[rgb]{0,0,0}n,T}}=\scalebox{0.7}{$\mathcal{O}$}(1)$ , then $\sqrt{\frac{K}{{\color[rgb]{0,0,0}nT}}}\log({\color[rgb]{0,0,0}nT})=\scalebox{0.7}{$\mathcal{O}$}(\psi(K))$ and $K\xi_{{\color[rgb]{0,0,0}n,T}}^{2}=\scalebox{0.7}{$\mathcal{O}$}({\psi(K)}^{2})=\scalebox{0.7}{$\mathcal{O}$}(\psi(K))$ .

Now, returning to (S52), we can show that

	$\displaystyle G(\bm{\lambda}_{}+\Delta)-G(\bm{\lambda}_{})$	$\displaystyle\geq-\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\frac{\sqrt{K}\log({\color[rgb]{0,0,0}nT})}{\sqrt{{\color[rgb]{0,0,0}nT}}}+K^{-r_{1}}+\sqrt{K}\zeta_{{\color[rgb]{0,0,0}n,T}}\right)\\|\Delta\\|_{2}$
	$\displaystyle+$	$\displaystyle\frac{1}{2nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\rho^{{}^{\prime\prime}}\{v^{\prime}(S_{i,t},A_{i,t})\}\{\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\Delta\}^{2}-\\|\Delta\\|_{2}\\|\bm{\delta}_{K}\\|_{2}$
	$\displaystyle\geq$	$\displaystyle-\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\frac{\sqrt{K}\log({\color[rgb]{0,0,0}nT})}{\sqrt{{\color[rgb]{0,0,0}nT}}}+K^{-r_{1}}+\sqrt{K}\zeta_{{\color[rgb]{0,0,0}n,T}}\right)\\|\Delta\\|_{2}+\left\{\psi(K)-\scalebox{0.7}{$\mathcal{O}$}_{\mathrm{p}}(\psi(K))\right\}\\|\Delta\\|_{2}^{2},$

where the last inequality is due to the condition of $\bm{\delta}_{K}$ specified in Assumption 5(g). As long as $\|\Delta\|_{2}\geq C_{\ref*{c_delta}}\{\psi(K)\}^{-1}(\frac{\sqrt{K}\log(K{\color[rgb]{0,0,0}nT})}{\sqrt{{\color[rgb]{0,0,0}nT}}}+K^{-r_{1}}+\sqrt{K}\zeta_{{\color[rgb]{0,0,0}n,T}})$ for some large enough constant $C_{\ref*{c_delta}}$ , with high probability,

G(\bm{\lambda}_{*}+\Delta)-G(\bm{\lambda}_{*})>0.

Therefore we have proved (S50).

Finally, we are ready to show the convergence of $\hat{\omega}$ given below.

		$\displaystyle[\mathbb{E}\left\{\hat{\omega}^{\pi}(\cdot,\cdot)-\omega^{\pi}(\cdot,\cdot)\right\}^{2}]^{1/2}$
	$\displaystyle\leq$	$\displaystyle\left[\mathbb{E}\left\{\rho^{\prime}(\hat{\bm{L}}_{K}(\cdot,\cdot)^{\intercal}\bm{\lambda}^{+})-\rho^{\prime}({\bm{L}}_{K}(\cdot,\cdot)^{\intercal}\bm{\lambda}_{})\right\}^{2}\right]^{1/2}+\sup_{s,a}\left\|\omega^{\pi}(s,a)-\rho^{\prime}({\bm{L}}_{K}(s,a)^{\intercal}\bm{\lambda}_{})\right\|$
	$\displaystyle\leq$	$\displaystyle\left[\mathbb{E}\left\{\rho^{\prime}({\bm{L}}_{K}(\cdot,\cdot)^{\intercal}\bm{\lambda}^{+})-\rho^{\prime}({\bm{L}}_{K}(\cdot,\cdot)^{\intercal}\bm{\lambda}_{*})\right\}^{2}\right]^{1/2}+\left[\mathbb{E}\left\{\rho^{{}^{\prime\prime}}\{\tilde{v}(\cdot,\cdot)\}\left\{\hat{\bm{L}}_{K}(\cdot,\cdot)-{\bm{L}}_{K}(\cdot,\cdot)\right\}^{\intercal}\bm{\lambda}^{+}\right\}^{2}\right]^{1/2}+\mathop{}\mathopen{}\mathcal{O}\mathopen{}(K^{-r_{1}})$
	$\displaystyle\leq$	$\displaystyle c_{\ref{wc4}}\left[\mathbb{E}\left\{{\bm{L}}_{K}(\cdot,\cdot)^{\intercal}(\bm{\lambda}^{+}-\bm{\lambda}_{})\right\}^{2}\right]^{1/2}+c_{\ref*{wc4}}\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(\zeta_{{\color[rgb]{0,0,0}n,T}})+\mathop{}\mathopen{}\mathcal{O}\mathopen{}(K^{-r_{1}})$
	$\displaystyle\leq$	$\displaystyle c_{\ref{wc4}}\\|(\bm{\lambda}^{+}-\bm{\lambda}_{})\\|_{2}\left[\lambda_{\max}\left\{\mathbb{E}\left[\frac{1}{T}\sum_{t=0}^{T-1}\bm{L}_{K}(S_{t},A_{t})\bm{L}_{K}(S_{t},A_{t})^{\intercal}\right]\right\}\right]^{1/2}+\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(\zeta_{{\color[rgb]{0,0,0}n,T}})+\mathop{}\mathopen{}\mathcal{O}\mathopen{}(K^{-r_{1}})$
	$\displaystyle\leq$	$\displaystyle c_{\ref{wc4}}\sqrt{C_{\ref{weightsC4}}}\\|(\bm{\lambda}^{+}-\bm{\lambda}_{*})\\|_{2}+\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(\zeta_{{\color[rgb]{0,0,0}n,T}})+\mathop{}\mathopen{}\mathcal{O}\mathopen{}(K^{-r_{1}})$
	$\displaystyle\leq$	$\displaystyle\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left[\{\psi(K)\}^{-1}\left(\frac{\sqrt{K}\log({\color[rgb]{0,0,0}nT})}{\sqrt{{\color[rgb]{0,0,0}nT}}}+K^{-r_{1}}+\sqrt{K}\zeta_{{\color[rgb]{0,0,0}n,T}}\right)\right].$

The second inequality is based on Assumption 5(a) and the mean value theorem. For the third inequality, we adopt the mean value theorem and Assumption 5(b) again, using the similar arguments for proving (S55) to show the boundedness of $\rho^{{}^{\prime\prime}}$ . Assumption 5(e) is used to obtain the desired $\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(\zeta_{{\color[rgb]{0,0,0}n,T}})$ order. The fifth inequality results from Assumption 5(c) and the last inequality is due to (S50).

Noting that

\lambda_{\max}\left\{\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\bm{L}_{K}(S_{i,t},A_{i,t})\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}\right\}=\lambda_{\max}\left\{\mathbb{E}\left[\frac{1}{T}\sum_{t=0}^{T-1}\bm{L}_{K}(S_{t},A_{t})\bm{L}_{K}(S_{t},A_{t})^{\intercal}\right]\right\}+\scalebox{0.7}{$\mathcal{O}$}_{\mathrm{p}}(1)

by Lemma S3, we then could use the similar argument to bound $\|\hat{\omega}^{\pi}(\cdot,\cdot)-\omega^{\pi}(\cdot,\cdot)\|_{{\color[rgb]{0,0,0}n,T}}$ , which completes the proof. ∎

S2.5 Technical Proof in Section 4.3

In this section, we provide the proof of Theorem 6, which shows the efficiency of our weighted estimator. In the proof, we first decompose the estimation error incurred by $\hat{\cal V}(\pi)$ to the true ${\cal V}(\pi)$ . Based on the decomposition, we show that with the help of convergence results regarding to the projection step and the final weights, together with the approximation error to $Q^{\pi}$ , the desired $\sqrt{nT}$ convergence rate of $\hat{\cal V}(\pi)$ can be established. Finally, the efficiency of our estimator can be achieved.

Proof of Theorem 6.

We mainly prove Result (ii). Result (i) can be derived through similar arguments but under a different condition for $K$ . Thus, we omit that for brevity. To start with, we derive the following decomposition

		$\displaystyle\hat{\cal V}(\pi)-{\cal V}(\pi)$
	$\displaystyle=$	$\displaystyle\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\hat{\omega}^{\pi}_{i,t}\left[Q^{\pi}(S_{i,t},A_{i,t}){-}\gamma g_{*}^{\pi}(S_{i,t},A_{i,t};Q^{\pi})\right]{-}(1{-}\gamma)\mathbb{E}_{S_{0}\sim\mathbb{G}}\left[\sum_{a^{\prime}\in{\mathcal{A}}}\pi(a^{\prime}\mid S_{0})Q^{\pi}(S_{0},a^{\prime})\right]$
		$\displaystyle+\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\hat{\omega}^{\pi}_{i,t}\epsilon_{i,t}$
	$\displaystyle=$	$\displaystyle\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\hat{\omega}^{\pi}_{i,t}\left[\bm{B}_{K}(S_{i,t},A_{i,t})^{\intercal}-\gamma g_{*}^{\pi}(S_{i,t},A_{i,t};\bm{B}_{K})^{\intercal}\right]\bm{\beta}-(1-\gamma)\mathbb{E}_{S_{0}\sim\mathbb{G}}\left[\sum_{a^{\prime}\in{\mathcal{A}}}\pi(a^{\prime}\mid S_{0})\bm{B}_{K}(S_{0},a^{\prime})^{\intercal}\bm{\beta}\right]$
		$\displaystyle+\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\hat{\omega}^{\pi}_{i,t}\left[\Delta_{Q}(S_{i,t},A_{i,t})-\gamma g^{\pi}_{*}(S_{i,t},A_{i,t},\Delta_{Q})\right]$
		$\displaystyle-(1-\gamma)\mathbb{E}_{S_{0}\sim\mathbb{G}}\left[\sum_{a^{\prime}\in{\mathcal{A}}}\pi(a^{\prime}\mid S_{0})\Delta(S_{0},a^{\prime})\right]+\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\hat{\omega}^{\pi}_{i,t}\epsilon_{i,t}$
	$\displaystyle=$	$\displaystyle\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\hat{\omega}^{\pi}_{i}\left[\bm{B}_{K}(S_{i},A_{i})^{\intercal}-\gamma\hat{g}^{\pi}(S_{i},A_{i};\bm{B}_{K})^{\intercal}\right]\bm{\beta}-(1-\gamma)\mathbb{E}_{S_{0}\sim\mathbb{G}}\left[\sum_{a^{\prime}\in{\mathcal{A}}}\pi(a^{\prime}\mid S_{0})\bm{B}_{K}(S_{0},a^{\prime})^{\intercal}\bm{\beta}\right]\cdots\cdots(\mathrm{I})$
		$\displaystyle+\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\omega^{\pi}_{i,t}\gamma\left[\hat{g}^{\pi}(S_{i,t},A_{i,t};\bm{B}_{K})-g_{*}^{\pi}(S_{i,t},A_{i,t};\bm{B}_{K})\right]^{\intercal}\bm{\beta}\cdots\cdots(\mathrm{II})$
		$\displaystyle+\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}(\hat{\omega}^{\pi}_{i,t}-\omega^{\pi}_{i,t})\gamma\left[\hat{g}^{\pi}(S_{i,t},A_{i,t};\bm{B}_{K})-g_{*}^{\pi}(S_{i,t},A_{i,t};\bm{B}_{K})\right]^{\intercal}\bm{\beta}\cdots\cdots(\mathrm{III})$
		$\displaystyle+\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\omega^{\pi}_{i,t}\left[\Delta_{Q}(S_{i,t},A_{i,t})-\gamma g^{\pi}_{*}(S_{i,t},A_{i,t},\Delta_{Q})\right]-(1-\gamma)\mathbb{E}_{S_{0}\sim\mathbb{G}}\left[\sum_{a^{\prime}\in{\mathcal{A}}}\pi(a^{\prime}\mid S_{0})\Delta(S_{0},a^{\prime})\right]\cdots\cdots(\mathrm{IV})$
		$\displaystyle+\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}(\hat{\omega}^{\pi}_{i,t}-\omega^{\pi}_{i,t})\left[\Delta_{Q}(S_{i,t},A_{i,t})-\gamma g^{\pi}_{*}(S_{i,t},A_{i,t},\Delta_{Q})\right]\cdots\cdots(\mathrm{V})$
		$\displaystyle+\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\hat{\omega}^{\pi}_{i,t}\epsilon_{i,t}\cdots\cdots(\mathrm{VI})$

In the following, we analyze (I)-(VI) components separately.

•

For (I), from the optimization constraint (25b),

$(\mathrm{I})\leq{\|\bm{\beta}\circ\bm{\delta}_{K}\|_{1}}.$

By the condition that in Assumption 6(a), we can show $(\mathrm{I})=\scalebox{0.7}{$\mathcal{O}$}(({\color[rgb]{0,0,0}nT})^{-1/2})$ .

•

For (III), we control it by utilizing the convergence of weights and function $\hat{g}^{\pi}$ . Note that

	$\displaystyle(\mathrm{III})$	$\displaystyle\leq\gamma\\|\hat{\omega}^{\pi}-\omega^{\pi}\\|_{{\color[rgb]{0,0,0}n,T}}\\|\{g_{*}^{\pi}(\cdot,\cdot;\bm{B}_{K})-\hat{g}(\cdot,\cdot;\bm{B}_{K})\}^{\intercal}\bm{\beta}\\|_{{\color[rgb]{0,0,0}n,T}}$
		$\displaystyle=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\frac{\sqrt{K}\log({\color[rgb]{0,0,0}nT})}{\sqrt{{\color[rgb]{0,0,0}nT}}}+K^{-r_{1}}+\sqrt{K}\zeta_{{\color[rgb]{0,0,0}n,T}}\right)\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\zeta_{{\color[rgb]{0,0,0}n,T}}\right)=\scalebox{0.7}{$\mathcal{O}$}_{\mathrm{p}}(({\color[rgb]{0,0,0}nT})^{-1/2}),$

due to the conditions in Corollary 1 and conditions for $K$ in Theorem 6 Result (ii).

•

For (IV), it can be seen that the mean is zero. In addition, $\sup_{s,a}|\Delta_{Q}(s,a)|=\mathop{}\mathopen{}\mathcal{O}\mathopen{}(K^{-r_{2}})$ by Assumption 6(a), and $\omega^{\pi}(s,a)$ is bounded above by Assumption 5(a). Applying similar arguments in the proof of Theorem 4, we can show that

(\mathrm{IV})=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(({\color[rgb]{0,0,0}nT})^{-1/2}K^{-r_{2}})=\scalebox{0.7}{$\mathcal{O}$}_{\mathrm{p}}(({\color[rgb]{0,0,0}nT})^{-1/2}).

•

For (V), we will control it by the convergence of weights and also the magnitude of $\Delta_{Q}$ .

	$\displaystyle(\mathrm{V})$	$\displaystyle\leq\\|\hat{\omega}^{\pi}-\omega^{\pi}\\|_{{\color[rgb]{0,0,0}n,T}}\\|\Delta_{Q}-\gamma g^{\pi}_{*}(\cdot,\cdot;\Delta_{Q})\\|_{{\color[rgb]{0,0,0}n,T}}$
		$\displaystyle\leq\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\frac{\sqrt{K}\log({\color[rgb]{0,0,0}nT})}{\sqrt{{\color[rgb]{0,0,0}nT}}}+K^{-r_{1}}+\sqrt{K}\zeta_{{\color[rgb]{0,0,0}n,T}}\right)\mathop{}\mathopen{}\mathcal{O}\mathopen{}\left(K^{-r_{2}}\right)=\scalebox{0.7}{$\mathcal{O}$}_{\mathrm{p}}(({\color[rgb]{0,0,0}nT})^{-1/2}),$

due to conditions for $K$ in Theorem 6(ii).

•

For (VI), by using the convergence of $\hat{\omega}^{\pi}$ to $\omega^{\pi}$ and the independence of $\epsilon_{i,t}$ for every $1\leq i\leq n$ and $0\leq t\leq T-1$ , we are able to prove that

		$\displaystyle\mathbb{E}\left\{\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}(\hat{\omega}^{\pi}_{i,t}-\omega^{\pi}_{i,t})\epsilon_{i,t}\right\}^{2}$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[\mathbb{E}\left\{\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}(\hat{\omega}^{\pi}_{i,t}-\omega^{\pi}_{i,t})\epsilon_{i,t}\right\}^{2}\mid\{S_{i,t},A_{i,t},i=1,\dots,n,t=0,\dots,T-1\}\right]$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}\left[\frac{\sup_{i,t}\mathbb{E}\epsilon_{i,t}^{2}}{({\color[rgb]{0,0,0}nT})^{2}}\sum_{i=1}^{n}\sum_{t=0}^{T-1}(\hat{\omega}^{\pi}_{i,t}-\omega^{\pi}_{i,t})^{2}\right]\lesssim\frac{1}{{\color[rgb]{0,0,0}nT}}\mathbb{E}\left[\\|\hat{\omega}^{\pi}-\omega^{\pi}\\|_{{\color[rgb]{0,0,0}n,T}}^{2}\right]=\scalebox{0.7}{$\mathcal{O}$}\left(\frac{1}{{\color[rgb]{0,0,0}nT}}\right),$

The first inequality is derived based on Assumption 6(b) and the second inequality comes from Theorem 4. Then we can show that

\displaystyle\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}(\hat{\omega}^{\pi}_{i,t}-\omega^{\pi}_{i,t})\epsilon_{i,t}=\scalebox{0.7}{$\mathcal{O}$}_{\mathrm{p}}(({\color[rgb]{0,0,0}nT})^{-1/2}).

Therefore,

\displaystyle(\mathrm{VI})=\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\omega^{\pi}_{i,t}\epsilon_{i,t}+\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}(\hat{\omega}^{\pi}_{i,t}-\omega^{\pi}_{i,t})\epsilon_{i,t}=\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\omega^{\pi}_{i,t}\epsilon_{i,t}+\scalebox{0.7}{$\mathcal{O}$}_{\mathrm{p}}(({\color[rgb]{0,0,0}nT})^{-1/2}).

•

For (II), recall that the true weight function $\omega^{\pi}(s,a)\in\mathcal{G}$ by the assumption in Theorem 6(ii). This is the key to bound (II). Take a function $B=\sum_{k=1}^{K}\beta_{k}B_{k}$ in optimization problem (24). In addition, it can be seen that $\hat{g}^{\pi}_{\beta}(\cdot,\cdot):=\sum_{k=1}^{K}\beta_{k}\hat{g}^{\pi}(\cdot,\cdot;B_{k})$ by the structure of the optimization problem. Let ${Y}_{i,t}=\sum_{a^{\prime}}\pi(a^{\prime}\mid S_{i,t+1})B(S_{i,t+1},a^{\prime})$ .

Due to the optimization condition for $\hat{g}^{\pi}_{\beta}$ , we have

\frac{d\left[\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\left\{{Y}_{i,t}-\hat{g}^{\pi}_{\beta}(S_{i,t},A_{i,t})-u\omega^{\pi}(S_{i,t},A_{i,t})\right\}^{2}+\mu J_{\mathcal{G}}(\hat{g}^{\pi}_{\beta}+u\omega^{\pi})\right]}{du}\Bigg{|}_{u=0}=0

Recall that $J_{\mathcal{G}}(g)=\|g\|_{{\mathcal{H}}_{2}}^{2}$ . Here, we abuse the notation slightly and denote by $\langle\cdot,\cdot\rangle_{{\mathcal{H}}_{2}}$ , i.e., the inner product with respect to the RKHS specified in Assumption 4’(b). Then we have

	$\displaystyle 0$	$\displaystyle=\frac{-2}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\omega^{\pi}(S_{i,t},A_{i,t})\{{Y}_{i,t}-g^{\pi}_{*}(S_{i,t},A_{i,t};B)\}$
		$\displaystyle+\frac{2}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\omega^{\pi}(S_{i},A_{i})\{\hat{g}^{\pi}_{\beta}(S_{i,t},A_{i,t})-g^{\pi}_{*}(S_{i,t},A_{i,t};B)\}+2\mu\langle\hat{g}^{\pi}_{\beta},\omega^{\pi}\rangle_{{\mathcal{H}}_{2}}$
		$\displaystyle=(\mathrm{i})+(\mathrm{ii})+(\mathrm{iii}),$

Now, we can see that the term $(\mathrm{ii})$ multiplied by $\gamma$ is exactly term (II) that we aim to bound. Therefore, it is sufficient to bound $(\mathrm{i})$ and $(\mathrm{iii})$ . Since we require that $\mu=\scalebox{0.7}{$\mathcal{O}$}(({\color[rgb]{0,0,0}nT})^{-1/2})$ , $|(\mathrm{iii})|=\scalebox{0.7}{$\mathcal{O}$}_{\mathrm{p}}(({\color[rgb]{0,0,0}nT})^{-1/2})$ . As for term $(i)$ , recall that the error $e_{i,t}:=\sum_{a^{\prime}}\pi(a^{\prime}\mid S_{i,t+1})Q^{\pi}(S_{i,t+1},a^{\prime})-g^{\pi}_{*}(S_{i,t},A_{i,t};Q^{\pi})$ . Then we have

	$\displaystyle\mathrm{(i)}$	$\displaystyle=\frac{-2}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\omega^{\pi}(S_{i,t},A_{i,t})e_{i,t}$
		$\displaystyle+\frac{-2}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\omega^{\pi}(S_{i,t},A_{i,t})\left\{\sum_{a^{\prime}}\pi(a^{\prime}\mid S_{i,t+1})\Delta_{Q}(S_{i,t+1},a)-g_{*}^{\pi}(S_{i,t},A_{i,t};\Delta_{Q})\right\}$

Next, we use Freedman’s inequality [17] to bound the second term in (i). For any integer $1\leq g\leq nT$ , let $i(g)$ and $t(g)$ be the quotient and the remainder of $g+T-1$ divided by $T$ satisfy $g=\{i(g)-1\}T+t(g)+1$ and $0\leq t(g)<T$ . Let $\mathcal{F}^{(0)}=\{S_{1,0},A_{1,0}\}$ . Then we iteratively define $\{\mathcal{F}^{g}\}_{1\leq g\leq nT}$ as follows:

	$\displaystyle\mathcal{F}^{(g)}=\mathcal{F}^{(g-1)}\cup\{S_{i(g),t(g)+1},A_{i(g),t(g)+1}\},\mbox{ if }t(g)<T-1,$
	$\displaystyle\mathcal{F}^{(g)}=\mathcal{F}^{(g-1)}\cup\{S_{i(g),T},S_{i(g)+1,0},A_{i(g)+1,0}\},\mbox{ otherwise }.$

Take $e^{(g)}:=\omega^{\pi}(S_{i(g),t(g)},A_{i(g),t(g)})e^{\Delta}_{i(g),t(g)}$ . From the definition of $e_{i,t}^{\Delta}$ , we can show that

\displaystyle\mathbb{E}\{e^{(g)}\mid\mathcal{F}^{(g-1)}\}=\mathbb{E}\{e^{(g)}\mid S_{i(g),t(g)},A_{i(g),t(g)}\}=0.

Then $\{\sum_{g=1}^{G}e^{(g)}\}$ , $G=1,\dots,nT$ forms a martingale with respect to the filtration $\{\sigma(\mathcal{F}^{(g)})\}_{g\geq 0}$ , where $\sigma(\mathcal{F}^{(g)})$ stands for the $\sigma$ -algebra generated by $\mathcal{F}^{(g)}$ .

Note that errors $e_{i,t}^{\Delta}:=\sum_{a^{\prime}}\pi(a^{\prime}\mid S_{i,t+1})\Delta_{Q}(S_{i,t+1},a)-g_{*}^{\pi}(S_{i,t},A_{i,t};\Delta_{Q})$ , $i=1,\dots,n$ , $t=1,\dots,T$ are bounded by $\|\Delta_{Q}\|_{\infty}\leq C_{\ref*{Qconst1}}K^{-r_{2}}$ . Then we can verify that $\mathbb{E}(e^{(g)})^{2}\leq C_{\ref*{weightsC1}}^{2}C_{\ref*{Qconst1}}^{2}K^{-2r_{2}}$ by Assumptions 5(a) and 6(a). Now, we are able to apply Theorem 1.6 in [17]. For all $x>0$ , we have

\displaystyle\mbox{Pr}\left(\sum_{g=1}^{{\color[rgb]{0,0,0}nT}}e^{(g)}>x\right)\leq\exp\left(-\frac{x^{2}/2}{{\color[rgb]{0,0,0}nT}C_{\ref*{weightsC1}}^{2}C_{\ref*{Qconst1}}^{2}K^{-2r_{2}}+C_{\ref*{weightsC1}}C_{\ref*{Qconst1}}K^{-r_{2}}x/3}\right).

As such, we can derive

	$\displaystyle\frac{-2}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\omega^{\pi}(S_{i,t},A_{i,t})\left\{\sum_{a^{\prime}}\pi(a^{\prime}\mid S_{i,t+1})\Delta_{Q}(S_{i,t+1},a)-g_{*}^{\pi}(S_{i,t},A_{i,t};\Delta_{Q})\right\}$
	$\displaystyle=-\frac{2}{{\color[rgb]{0,0,0}nT}}\sum_{g=1}^{{\color[rgb]{0,0,0}nT}}e^{(g)}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(({\color[rgb]{0,0,0}nT})^{-1/2}K^{-r_{2}}\log({\color[rgb]{0,0,0}nT}))=\scalebox{0.7}{$\mathcal{O}$}_{\mathrm{p}}(({\color[rgb]{0,0,0}nT})^{-1/2}).$

Then we can show that

\displaystyle(\mathrm{II})=\gamma(\mathrm{ii})=\gamma\left\{\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\omega^{\pi}_{i,t}e_{i,t}+\scalebox{0.7}{$\mathcal{O}$}_{\mathrm{p}}(({\color[rgb]{0,0,0}nT})^{-1/2})\right\}

(S64)

Combining all the bounds from $(I)$ to $(VI)$ , we have

	$\displaystyle\hat{\cal V}(\pi)-{\cal V}(\pi)$	$\displaystyle=(I)+(II)+(III)+(IV)+(V)+(VI)$
		$\displaystyle=\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\omega^{\pi}_{i,t}\left\{\epsilon_{i,t}+\gamma e_{i,t}\right\}+\scalebox{0.7}{$\mathcal{O}$}_{\mathrm{p}}(({\color[rgb]{0,0,0}nT})^{-1/2}).$

Taking

\sigma^{2}=\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\left\{\omega^{\pi}(S_{t},A_{t})\left(R_{t}+\gamma\sum_{a^{\prime}\in{\cal A}}\pi(a^{\prime}|S_{t+1})Q^{\pi}(S_{t+1},a^{\prime})-Q^{\pi}(S_{t},A_{t})\right)\right\}^{2},

we see that

\sum_{i=1}^{n}\sum_{t=0}^{T-1}\frac{\omega^{\pi}_{i,t}\left\{\epsilon_{i,t}+\gamma e_{i,t}\right\}}{\sqrt{nT}\sigma}

forms a mean zero martingale with respect to the filtration $\{\sigma(\mathcal{F}^{(g)})\}_{g\geq 0}$ .

Then we can verify that

\displaystyle\max_{i,t}\frac{|\omega^{\pi}_{i,t}(\epsilon_{i,t}+\gamma e_{i,t})|}{\sigma\sqrt{nT}}\leq C_{\ref*{weightsC1}}\left\{\frac{R_{\max}}{\sigma\sqrt{nT}}+\gamma\frac{Q_{\max}+G_{\max}+C_{\ref*{Qconst1}}K^{-r_{2}}}{\sigma\sqrt{nT}}\right\}\xrightarrow[]{P}0

due to Assumption 2, 5(a) and 6(a) and

\displaystyle\sum_{i=1}^{n}\sum_{t=0}^{T-1}\frac{|\omega^{\pi}_{i,t}(\epsilon_{i,t}+\gamma e_{i,t})|^{2}}{\sigma^{2}nT}\xrightarrow[]{P}1

due to the definition of $\sigma$ .

Then by martingale central limit theorem [45], we can show the asymptotic normality, i.e., as long as either $n\rightarrow\infty$ or $T\rightarrow\infty$ , we have

\displaystyle\frac{\sqrt{nT}}{\sigma}\left\{\hat{\cal V}(\pi)-{\cal V}(\pi)\right\}\xrightarrow[]{d}N(0,1).

(S65)

∎

S2.6 Technical Lemmas

In this section, we provide two technical lemmas (Lemma S2 and S4) that involve the truncation arguments for Markov chains. They are developed to obtain sharp convergence rates that depend on both sample size $n$ and the length of trajectories $T$ . In particular, Lemma S2 uses empirical process arguments and is used for Theorem 3. Lemma S4 generalizes the standard matrix Bernstein’s inequality for independent Markov chains and is used for Theorem 4.

Lemma S2.

Suppose Assumptions 4 hold. For

\displaystyle I_{2}(B)

\displaystyle=2(\mathbb{E}-\mathbb{P}_{N})f^{\pi}\left\{\hat{g}^{\pi}(B),g_{*}^{\pi}(B),B\right\}-\mathbb{E}f^{\pi}\left\{\hat{g}^{\pi}(B),g_{*}^{\pi}(B),B\right\}-2\mu\mathbf{J}^{2}(\hat{g}^{\pi},g^{\pi}_{*},B).

where $f^{\pi}$ and $\mathbf{J}^{2}$ are defined in(S47) and (S46) respectively. Then with probability at least $1-2\delta-1/({\color[rgb]{0,0,0}nT})$ , for any $B\in\mathcal{Q}$ , we have

\displaystyle I_{2}(B)\lesssim\mu+\frac{1}{{\color[rgb]{0,0,0}nT}\mu^{\frac{\alpha}{1-\tau(2+\alpha)}}}+\frac{[\log(\max\{1/\delta,{\color[rgb]{0,0,0}nT}\})]^{1/\tau}}{{\color[rgb]{0,0,0}nT}},

where $0<\tau\leq\frac{1}{3}$ , and the leading constant depends on $Q_{max},G_{\max},\alpha,\kappa,C_{\ref*{C_s}}$ .

Proof of Lemma S2.

To deal with the scenario that the initial distribution $\mathbb{G}$ is possibly not the stationary distribution $\mathbb{G}^{*}$ , we decompose $I_{2}(B)$ into three components.

Take $T^{\prime}=\min\{\lceil K_{3}\log(nT)\rceil,T\}$ , where $K_{3}$ is a constant to be specified later , $Z^{T}=[(S_{t},A_{t})]_{t=0}^{T-1}$ , $f_{T}^{\pi}(g_{1},g_{2},B)(Z^{T})=\frac{1}{T-1}\sum_{t=0}^{T}f^{\pi}(g_{1},g_{2},B)(S_{t},A_{t},S_{t+1})$ . Take $\mathbb{P}_{n}f_{T}^{\pi}(g_{1},g_{2},B)(Z^{T})=n^{-1}\sum_{i=1}^{n}f_{T}^{\pi}(g_{1},g_{2},B)(Z^{T}_{i})$ as the empirical evaluation for $f_{T}^{\pi}$ . And we denote by $\mathbb{E}^{*}$ , ${\mbox{Pr}}^{*}$ the expectation and probability under stationary distribution respectively.

	$\displaystyle I_{2}(B)$	$\displaystyle=2(\mathbb{E}-\mathbb{P}_{N})f^{\pi}\left\{\hat{g}^{\pi}(B),g_{}^{\pi}(B),B\right\}-\mathbb{E}f^{\pi}\left\{\hat{g}^{\pi}(B),g_{}^{\pi}(B),B\right\}-2\mu\mathbf{J}^{2}(\hat{g}^{\pi},g^{\pi}_{*},B).$
		$\displaystyle=\frac{T^{\prime}}{T}\mathrm{II}(B)+\frac{T-T^{\prime}}{T}(\mathrm{III}(B)+\mathrm{IV}(B)),$

where

\displaystyle\mathrm{II}(B)=2(\mathbb{E}-\mathbb{P}_{n})f_{T^{\prime}}^{\pi}\left\{\hat{g}^{\pi}(B),g_{*}^{\pi}(B),B\right\}-\mathbb{E}f^{\pi}_{T^{\prime}}\left\{\hat{g}^{\pi}(B),g_{*}^{\pi}(B),B\right\}-2\mu\mathbf{J}^{2}(\hat{g}^{\pi},g^{\pi}_{*},B),

	$\displaystyle\mathrm{III}(B)=$	$\displaystyle\left(\frac{1}{n(T-T^{\prime})}\sum_{i=1}^{n}\sum_{t=T^{\prime}}^{T-1}-2f^{\pi}\left\{\hat{g}^{\pi}(B),g_{}^{\pi}(B),B\right\}(S_{i,t},A_{i,t},S_{i,t+1})\right)+\mathbb{E}^{}f^{\pi}\left\{\hat{g}^{\pi}(B),g_{*}^{\pi}(B),B\right\}$
		$\displaystyle-2\mu\mathbf{J}^{2}(\hat{g}^{\pi},g^{\pi}_{*},B),$

and

\displaystyle\mathrm{IV}(B)=\mathbb{E}\left[\frac{1}{T-T^{\prime}}\sum_{t=T^{\prime}}^{T-1}f^{\pi}\left\{\hat{g}^{\pi}(B),g_{*}^{\pi}(B),B\right\}(S_{t},A_{t},S_{t+1})\right]-\mathbb{E}^{*}f^{\pi}\left\{\hat{g}^{\pi}(B),g_{*}^{\pi}(B),B\right\}.

For some fixed $t>0$ ,

		$\displaystyle\mbox{Pr}\left(\exists B\in\mathcal{Q},I_{2}(B)>3t\right)$
	$\displaystyle\leq$	$\displaystyle\mbox{Pr}\left(\exists B\in\mathcal{Q},\mathrm{II}(B)>\frac{T}{T^{\prime}}t\right)+\mbox{Pr}\left(\exists B\in\mathcal{Q},\mathrm{III}(B)>\frac{T}{T-T^{\prime}}t\right)+\mbox{Pr}\left(\exists B\in\mathcal{Q},\mathrm{IV}(B)>\frac{T}{T-T^{\prime}}t\right)$

Then we bound these three probabilities one by one. Note that when $T^{\prime}=T$ , we do not need to take into account components III(B) and IV(B). So in the following, we analyze III(B) and IV(B) under the condition that $T\gg\log(nT)$ .

•

For II(B):

Take $\mathcal{F}^{T^{\prime}}_{l}=\{f_{T^{\prime}}^{\pi}\left\{g,g_{*}^{\pi}(B),B\right\}:J_{\mathcal{G}}^{2}(g)\leq\frac{2^{l}}{\mu}\frac{T}{T^{\prime}}t,J_{\mathcal{G}}^{2}(g_{*}^{\pi}(B))\leq\frac{2^{l}}{\mu}\frac{T}{T^{\prime}}t,J_{\mathcal{Q}}^{2}(B)\leq\frac{2^{l}}{\mu}\frac{T}{T^{\prime}}t,B\in\mathcal{Q}\}.$

		$\displaystyle\mbox{Pr}\left(\exists B\in\mathcal{Q},\mathrm{II}(B)>\frac{T}{T^{\prime}}t\right)$
	$\displaystyle\leq$	$\displaystyle\sum_{l=0}^{\infty}\mbox{Pr}\left(\sup_{h\in\mathcal{F}^{T^{\prime}}_{l}}\frac{(\mathbb{E}-\mathbb{P}_{n})h}{\mathbb{E}(h)+2^{l}\frac{T}{T^{\prime}}t}>\frac{1}{2}\right),$

In the following, we verify the conditions (A1-A4) in Theorem 19.3 in [20] with $\mathcal{F}=\mathcal{F}^{T}_{l}$ , $\epsilon=1/2$ and $\eta=2^{l}t$ .

It’s easy to verify for (A1). For any $h\in\mathcal{F}$ ,

\displaystyle\|f^{\pi}\{g,g^{\pi}_{*}(B),B\}\|_{\infty}\leq 6G_{\max}(3Q_{\max}+3G_{\max})\triangleq K_{1}

(S66)

and therefore

\displaystyle\|f_{T^{\prime}}^{\pi}\{g,g^{\pi}_{*}(B),B\}\|_{\infty}\leq\frac{1}{T^{\prime}}\sum_{t=0}^{T^{\prime}-1}\|f^{\pi}\{g,g^{\pi}_{*}(B),B\}\|_{\infty}\leq K_{1}

For (A2), recall $f^{\pi}=f_{1}^{\pi}-f_{2}^{\pi}$ and thus

\displaystyle\mathbb{E}[f^{\pi}\left\{g,g_{*}^{\pi}(B),B\right\}^{2}]\leq 2\mathbb{E}[f_{1}^{\pi}\left\{g,g_{*}^{\pi}(B),B\right\}(S,A,S^{\prime})^{2}]+2\mathbb{E}[f_{2}^{\pi}\left\{g,g_{*}^{\pi}(B),B\right\}(S,A,S^{\prime})^{2}].

For the first term of RHS above:

	$\displaystyle\mathbb{E}[f_{1}^{\pi}\left\{g,g_{*}^{\pi}(B),B\right\}(S,A,S^{\prime})^{2}]$
	$\displaystyle=\mathbb{E}\left[\left\{\left\{\delta_{t}^{\pi}(B)-g(S,A)\right\}^{2}-\left\{\delta_{t}^{\pi}(B)-g_{*}^{\pi}(S,A;Q)\right\}^{2}\right\}^{2}\right]$
	$\displaystyle=\mathbb{E}\left[\left\{2\delta^{\pi}(B)-g(S,A)-g_{}^{\pi}(S,A;B)\right\}^{2}\left\{g_{}^{\pi}(S,A;B)-g(S,A)\right\}^{2}\right]$
	$\displaystyle\leq\left\{2(Q_{\max})+2G_{\max}\right\}^{2}\mathbb{E}[\left(g_{*}^{\pi}(S,A;B)-g(S,A)\right)^{2}]$
	$\displaystyle=4\left(Q_{\max}+G_{\max}\right)^{2}\mathbb{E}\left[f^{\pi}\left\{g,g_{*}^{\pi}(B),B\right\}(S,A,S^{\prime})\right],$

and the second term:

	$\displaystyle\mathbb{E}[f_{2}^{\pi}\left\{g,g_{*}^{\pi}(B),B\right\}(S,A,S^{\prime})^{2}]$
	$\displaystyle=\mathbb{E}\left[\Big{\{}\{\delta_{t}^{\pi}(B)-g_{}^{\pi}(S,A;B)\}\{g(S,A)-g_{}^{\pi}(S,A;B)\}\Big{\}}^{2}\right]$
	$\displaystyle\leq\mathbb{E}\left[\{\delta^{\pi}(B)-g_{}^{\pi}(S,A;B)\}^{2}\{g(S,A)-g_{}^{\pi}(S,A;B)\}^{2}\right]$
	$\displaystyle\leq\left(Q_{\max}+G_{\max}\right)^{2}\mathbb{E}[\left(g_{*}^{\pi}(S,A;B)-g(S,A)\right)^{2}]$
	$\displaystyle=\left(Q_{\max}+G_{\max}\right)^{2}\mathbb{E}[f^{\pi}\left\{g,g_{*}^{\pi}(B),B\right\}(S,A,S^{\prime})],$

where we use the fact that $\mathbb{E}[f_{2}^{\pi}\left\{\hat{g}^{\pi}(B),g_{*}^{\pi}(B),B\right\}(S,A,S^{\prime})]=0$ . Putting together, we can show that

\displaystyle\mathbb{E}[f^{\pi}\left\{g,g_{*}^{\pi}(B),B\right\}(S,A,S^{\prime})^{2}]

\displaystyle\leq K_{2}\mathbb{E}[f^{\pi}\left\{g,g_{*}^{\pi}(Q),Q\right\}(S,A,S^{\prime})],

(S67)

where $K_{2}=\left(Q_{\max}+G_{\max}\right)^{2}$ . Therefore,

	$\displaystyle\mathbb{E}f_{T}^{2}$	$\displaystyle=\mathbb{E}\left(\frac{1}{T}\sum_{t=0}^{T-1}f^{\pi}\left\{g,g_{*}^{\pi}(Q),Q\right\}(S_{t},A_{t},S_{t+1})\right)^{2}$
		$\displaystyle\leq\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}[f^{\pi}\left\{g,g_{*}^{\pi}(B),B\right\}(S_{t},A_{t},S_{t+1})^{2}]$
		$\displaystyle\leq\frac{1}{T}\sum_{t=0}^{T-1}K_{2}\mathbb{E}[f^{\pi}\left\{g,g_{*}^{\pi}(B),B\right\}(S_{t},A_{t},S_{t+1})]=K_{2}\mathbb{E}f^{\pi}_{T}$

To verify that Condition (A3) holds for every $l$ . It suffices to ensure the inequality holds when $l=0$ , i.e.,

	$\displaystyle\sqrt{n}(1/2)^{3/2}\sqrt{\frac{T}{T^{\prime}}t}\geq 288\max(K_{1},\sqrt{2K_{2}})$		(S68)
	$\displaystyle\frac{T}{T^{\prime}}t\geq 8\{288\max(K_{1},\sqrt{2K_{2}})\}^{2}n^{-1}$

Next we verify (A4). We first consider the function class $\mathcal{F}_{l}=\{f^{\pi}\left\{g,g_{*}^{\pi}(B),B\right\}:J_{\mathcal{G}}^{2}(g)\leq\frac{2^{l}}{\mu}\frac{T}{T^{\prime}}t,J_{\mathcal{G}}^{2}(g_{*}^{\pi}(B))\leq\frac{2^{l}}{\mu}\frac{T}{T^{\prime}}t,J_{\mathcal{Q}}^{2}(B)\leq\frac{2^{l}}{\mu}\frac{T}{T^{\prime}}t,B\in\mathcal{Q}\}.$ It is not hard to verify that with $M=\sqrt{2^{l}\frac{T}{T^{\prime}}t/\mu}$ ,

		$\displaystyle\log\left(\mathcal{N}(\epsilon,\mathcal{F}^{T}_{l},\\|\cdot\\|_{n})\right)$
	$\displaystyle\lesssim$	$\displaystyle\log\left(\mathcal{N}(\epsilon,\mathcal{F}_{l},\\|\cdot\\|_{{\color[rgb]{0,0,0}n,T}})\right)$
	$\displaystyle\lesssim$	$\displaystyle\log\left(\cal\mathcal{N}(\epsilon,\mathcal{Q}_{M},\\|\cdot\\|_{\infty})\mathcal{N}(\epsilon,\mathcal{G}_{M},\\|\cdot\\|_{\infty})\right).$

As a result of the entropy condition in Assumption 4

	$\displaystyle\log\left(\mathcal{N}(\epsilon,\mathcal{F}^{T}_{l},\\|\cdot\\|_{n})\right)$
$\displaystyle\lesssim$	$\displaystyle\log\mathcal{N}(\epsilon,\mathcal{F}_{l},\\|\cdot\\|_{{\color[rgb]{0,0,0}n,T}})$
$\displaystyle\lesssim$	$\displaystyle\log\mathcal{N}(\epsilon,\mathcal{Q}_{M},\\|\cdot\\|_{\infty})+2\log\mathcal{N}(\epsilon,\mathcal{G}_{M},\\|\cdot\\|_{\infty})$
$\displaystyle\lesssim$	$\displaystyle\left(\frac{2^{l}}{\mu}\frac{T}{T^{\prime}}t\right)^{\alpha}\epsilon^{-2\alpha}.$	(S69)

Now the Condition (A4) is satisfied if

\displaystyle x^{\frac{1+\alpha}{2}}\geq 4*96\sqrt{2}\max(K_{1},2K_{2})\sqrt{c_{2}\ltx@label{case1c1}}\left(\frac{2^{l}}{\mu}\frac{T}{T^{\prime}}t\right)^{\alpha/2}n^{-1/2},

for all $x\geq 2^{l}Tt/(8T^{\prime})$ and any $l$ , where $c_{\ref*{case1c1}}>0$ is a constant depending on $\alpha,G_{\max},Q_{\max}$ . And the above inequality holds as long as

\frac{T}{T^{\prime}}t\geq c_{3}\ltx@label{case1c2}(\mu^{\alpha}n)^{-1}

for some constant $c_{\ref*{case1c2}}>0$ depending on $\alpha,G_{\max},Q_{\max}$ .

Now, we are able to show that when $\frac{T}{T^{\prime}}t>\mu$ , $\frac{T}{T^{\prime}}t\geq c_{\ref*{case1c2}}(\mu^{\alpha}n)^{-1}$ , $\frac{T}{T^{\prime}}t\geq 8\{288\max(K_{1},\sqrt{2K_{2}})\}^{2}n^{-1}$ ,

		$\displaystyle\Pr\left(\exists B\in\mathcal{Q},\mathrm{II}(B)>\frac{T}{T^{\prime}}t\right)$
	$\displaystyle\leq$	$\displaystyle\sum_{l=0}^{\infty}\Pr\left(\sup_{h\in\mathcal{F}^{T}_{l}}\frac{(\mathbb{E}-\mathbb{P}_{n})h}{\mathbb{E}h+2^{l}\frac{T}{T^{\prime}}t}>\frac{1}{2}\right)$
	$\displaystyle\leq$	$\displaystyle\sum_{l=0}^{\infty}\exp\left\{-\frac{n2^{l}\frac{T}{T^{\prime}}t(1/2)^{3}}{128*2304\max(K_{1}^{2},K_{2})}\right\}$
	$\displaystyle\leq$	$\displaystyle c_{4}\ltx@label{case1c3}\exp\left(-c_{5}\ltx@label{case1c4}n\frac{T}{T^{\prime}}t\right),$

where $c_{\ref*{case1c3}},c_{\ref*{case1c4}}>0$ are some constants depending on $K_{1},K_{2}$ . Then, fixing some $\delta>0$ , we can verify that there exists a constant $c_{\ref*{case1c5}}>0$ , such that by taking $t\geq c_{6}\ltx@label{case1c5}\log(1/\delta)T^{\prime}/(nT)$ , $\Pr\left(\exists B\in\mathcal{Q},\mathrm{II}(B)>\frac{T}{T^{\prime}}t\right)\leq c_{\ref*{case1c3}}\exp(-c_{\ref*{case1c4}}nTt/T^{\prime})\leq\delta.$

Combining all the conditions on $t$ , there exists a constant $c_{7}\ltx@label{case1c6}$ , such that

\displaystyle\Pr\left(\exists B\in\mathcal{Q},\frac{T^{\prime}}{T}\mathrm{II}(B)>c_{\ref*{case1c6}}\left\{\frac{T^{\prime}}{T}\mu+\frac{T^{\prime}}{T}(\mu^{\alpha}n)^{-1}+(\log(1/\delta)+1)\frac{T^{\prime}}{nT}\right\}\right)\leq\delta.

•

For III(B):

By the geometrically ergodic property (Assumption 4(a)),

		$\displaystyle\mathbb{E}\left[\Pr\left(\exists B,\mathrm{III}(B)\geq\frac{T}{T-T^{\prime}}t\mid\{(S_{i,0},A_{i,0})\}\right)-{\mbox{Pr}}^{*}\left(\exists B,\mathrm{III}(B)\geq\frac{T}{T-T^{\prime}}t\right)\right]$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}\left\{\sum_{i=1}^{n}\left\\|\mathbb{G}^{b}_{T^{\prime}}(\cdot\mid(S_{i,0},A_{i,0}))-\mathbb{G}^{*}(\cdot)\right\\|_{\mathrm{TV}}\right\}$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}\left\{\sum_{i=1}^{n}\phi(S_{i,0},A_{i,0})\kappa^{T^{\prime}}\right\}$
	$\displaystyle\leq$	$\displaystyle n\kappa^{T^{\prime}}\int_{(s,a)}\phi(s,a)d\mathbb{G}(s,a)\leq n\kappa^{T^{\prime}}C_{\ref*{C_s}}.$

Thus we have

\displaystyle\Pr\left(\exists B,\mathrm{III}(B)\geq\frac{T}{T-T^{\prime}}t\right)\leq{\Pr}^{*}\left(\exists B,\mathrm{III}(B)\geq\frac{T}{T-T^{\prime}}t\right)+n\kappa^{T^{\prime}}C_{\ref*{C_s}}

(S70)

Next, we will focus on the term ${\Pr}^{*}(\exists B,\mathrm{III}(B)\geq\frac{T}{T-T^{\prime}}t)$ , which is easier to deal with, given the stationary condition.

Take $\tilde{N}=n(T-T^{\prime})$ . Due to that Markov chains are stationary, we can write

	$\displaystyle\mathrm{III}(B)=$	$\displaystyle 2(\mathbb{E}^{}-\mathbb{P}_{\tilde{N}})f^{\pi}\left\{\hat{g}^{\pi}(B),g_{}^{\pi}(B),B\right\}({S},{A},{S}^{\prime})-\mathbb{E}^{}f^{\pi}\left\{\hat{g}^{\pi}(B),g_{}^{\pi}(B),B\right\}({S},{A},{S}^{\prime})$
		$\displaystyle-2\mu\mathbf{J}^{2}(\hat{g}^{\pi},g^{\pi}_{*},B).$

Similarly as how we handle II(B), we will bound

\displaystyle{\Pr}^{*}\left(\exists B\in\mathcal{Q},\mathrm{III}>\frac{T}{T-T^{\prime}}t\right)\leq\sum_{l=0}^{\infty}{\Pr}^{*}\left(\sup_{h\in\mathcal{F}_{l}}\frac{(\mathbb{E}^{*}-\mathbb{P}_{\tilde{N}})h}{\mathbb{E}^{*}h+2^{l}\frac{T}{T-T^{\prime}}t}>\frac{1}{2}\right),

Here $\mathcal{F}_{l}=\{f^{\pi}\left\{g,g_{*}^{\pi}(B),B\right\}:J_{\mathcal{G}}^{2}(g)\leq\frac{2^{l}}{\mu}\frac{T}{T-T^{\prime}}t,J_{\mathcal{G}}^{2}(g_{*}^{\pi}(B))\leq\frac{2^{l}}{\mu}\frac{T}{T-T^{\prime}}t,J_{\mathcal{Q}}^{2}(B)\leq\frac{2^{l}}{\mu}\frac{T}{T-T^{\prime}}t,B\in\mathcal{Q}\}.$

We use the independent block techniques [75] and the peeling device with the exponential inequality for the relative deviation of the empirical process developed in [16].

Next, we bound each term of the above probabilities by using the independent block technique. We define a partition by dividing the index $\{1,\cdots,\tilde{N}\}$ into $2v_{\tilde{N}}$ blocks, where each block has an equal length $x_{\tilde{N}}$ . The residual block is denoted by $R_{\tilde{N}}$ , i.e., $\{(j-1)x_{\tilde{N}}+1,\cdots,(j-1)x_{\tilde{N}}+x_{\tilde{N}}\}_{j=1}^{2v_{\tilde{N}}}$ and $R_{\tilde{N}}=\left\{2v_{\tilde{N}}x_{\tilde{N}}+1,\cdots,\tilde{N}\right\}$ . Then it can be seen that ${\tilde{N}}-2x_{\tilde{N}}<2v_{\tilde{N}}x_{\tilde{N}}\leq{\tilde{N}}$ and the cardinality $|R_{\tilde{N}}|<2x_{\tilde{N}}$ .

For each $l\geq 0$ , we will use a different independent block sequence denoted by $(x_{\tilde{N},l},v_{\tilde{N},l})$ with the residual $R_{l}$ and then optimize the probability bound by properly choosing $(x_{\tilde{N},l},v_{\tilde{N},l})$ and $R_{l}$ . More specifically, we choose

x_{\tilde{N},l}=\lfloor x^{\prime}_{\tilde{N},l}\rfloor\hskip 12.0pt\mbox{and }\hskip 12.0ptv_{N,l}=\lfloor\frac{\tilde{N}}{2x_{\tilde{N},l}}\rfloor,

where $x^{\prime}_{\tilde{N},l}=(\tilde{N}\frac{T}{T-T^{\prime}}t)^{\tau}(2^{l})^{p}$ and $v^{\prime}_{\tilde{N},l}=\frac{\tilde{N}}{2x^{\prime}_{\tilde{N},l}}$ with some positive constants $\tau$ and $p$ determined later. We require $\tau\leq p\leq\frac{1}{2+\alpha}\leq\frac{1}{2}$ . We also need $\frac{T}{T-T^{\prime}}t\geq\frac{1}{\tilde{N}}$ so that $x^{\prime}_{\tilde{N},l}\geq 1$ . Suppose $\tilde{N}$ is sufficiently large such that

\displaystyle\tilde{N}\geq c_{8}\ltx@label{case2c0}\triangleq 4\times 8^{2}\times K_{1}.

(S71)

In the following, we consider two cases. The first case considers any $l$ such that $x^{\prime}_{\tilde{N},l}\geq\frac{\tilde{N}}{8}$ . In this case, since $\tau\leq p$ , we can show that $x^{\prime}_{\tilde{N},l}\leq(\tilde{N}\frac{T}{T-T^{\prime}}t2^{l})^{p}$ . Combining with the sample size requirement, we can obtain that $(\tilde{N}\frac{T}{T-T^{\prime}}t2^{l})\geq(\frac{\tilde{N}}{8})^{\frac{1}{p}}\geq 4\tilde{N}K_{1}$ . Then we can show that in this case,

\frac{(\mathbb{E}^{*}-\mathbb{P}_{\tilde{N}})\left\{h(Z)\right\}}{\mathbb{E}^{*}\left\{h(Z)\right\}+2^{l}\frac{T}{T-T^{\prime}}t}\leq\frac{2K_{1}}{2^{l}\frac{T}{T-T^{\prime}}t}\leq\frac{1}{2}.

Therefore, when $\frac{T}{T-T^{\prime}}t\geq\frac{1}{\tilde{N}}$ and $x^{\prime}_{\tilde{N},l}\geq\frac{\tilde{N}}{8}$ ,

{\Pr}^{*}\left(\sup_{h\in\mathcal{F}_{l}}\frac{(\mathbb{E}^{*}-\mathbb{P}_{\tilde{N}})\left\{h(Z)\right\}}{\mathbb{E}^{*}\left\{h(Z)\right\}+2^{l}\frac{T}{T-T^{\prime}}t}>\frac{1}{2}\right)=0.

The second case that we consider is when $x^{\prime}_{\tilde{N},l}<\frac{\tilde{N}}{8}$ . Under the geometric ergodicity assumption, it follows from Theorem 3.7 [4] that the stationary Markov chain is exponentially $\bm{\beta}$ -mixing. The $\bm{\beta}$ -mixing coefficient at time lag $k$ satisfies that $\beta_{k}\leq\beta_{0}\exp(-\beta_{1}k)$ for $\beta_{0}\geq 0$ and $\beta_{1}>0$ . We apply the relative deviation concentration inequality for the exponential $\bm{\beta}$ -mixing stationary process given in Theorem 4 of [16], which combined results in [75] and Theorem 19.3 in [20]. To use their results, it then suffices to verify Conditions (C1)-(C5) in Theorem 4 of [16] with $\mathcal{F}=\mathcal{F}_{l}$ , $\epsilon=1/2$ and $\eta=2^{l}t$ to get an exponential inequality for each term in the summation. First of all, (C1) and (C2) have been verified in (S66) and (S67).

To verify Condition (C3), without loss of generality, we assume $K_{1}\geq 1$ . Otherwise, let $K_{1}=\max(1,K_{1})$ . Then we know that $2K_{1}x_{\tilde{N},l}\geq\sqrt{2K_{1}x_{\tilde{N},l}}$ since $x_{\tilde{N},l}\geq 1$ . We need to verify $\sqrt{\tilde{N}}\epsilon\sqrt{1-\epsilon}\sqrt{\eta}\geq 1152K_{1}x_{\tilde{N},l}$ , or it suffices to have $\sqrt{\tilde{N}}\epsilon\sqrt{1-\epsilon}\sqrt{\eta}\geq 1152K_{1}x^{\prime}_{\tilde{N},l}$ since $x^{\prime}_{\tilde{N},l}\geq x_{\tilde{N},l}$ by definition. Recall that $\epsilon=1/2$ and $\eta=2^{l}\frac{T}{T-T^{\prime}}t$ . To show this, it is enough to verify that

\sqrt{\tilde{N}}\frac{\sqrt{2}}{4}\sqrt{2^{l}\frac{T}{T-T^{\prime}}t}\geq 1152K_{1}(\tilde{N}\frac{T}{T-T^{\prime}}t2^{l})^{p},

since $(\tilde{N}\frac{T}{T-T^{\prime}}t2^{l})^{p}\geq x^{\prime}_{\tilde{N},l}$ . Recall that $p\leq\frac{1}{2+\alpha}$ , then it is sufficient to let $\frac{T}{T-T^{\prime}}t\geq\frac{2304\sqrt{2}K_{1}}{\tilde{N}}\triangleq\frac{c_{1}^{\prime}}{\tilde{N}}$ so that the above inequality holds for every $l\geq 0$ .

Next we verify (C4) that $\frac{|R_{l}|}{\tilde{N}}\leq\frac{\epsilon\eta}{6K_{1}}$ . Recall that $|R_{l}|<2x_{\tilde{N},l}\leq 2x^{\prime}_{\tilde{N},l}=2(\tilde{N}\frac{T}{T-T^{\prime}}t)^{\tau}(2^{l})^{p}$ . So if $\frac{T}{T-T^{\prime}}t\geq\frac{c_{9}\ltx@label{case2c1}}{\tilde{N}}$ for some positive constant $c_{\ref*{case2c1}}$ depending on $K_{1}$ , we can have

\frac{\epsilon\eta}{6K_{1}}=\frac{2^{l}}{12K_{1}}\frac{T}{T-T^{\prime}}t\geq\frac{2(\tilde{N}\frac{T}{T-T^{\prime}}t)^{\tau}(2^{l})^{p}}{\tilde{N}}=\frac{2x^{\prime}_{\tilde{N},l}}{\tilde{N}}>\frac{|R_{l}|}{\tilde{N}}.

In addition, $|R_{l}|\leq 2x^{\prime}_{\tilde{N},l}<\frac{\tilde{N}}{2}$ .

Lastly we verify condition (C5). Using the similar arguments in verifying (S69), we can show that

\displaystyle\log\mathcal{N}(\epsilon,\mathcal{F}_{l},\|\cdot\|_{{\color[rgb]{0,0,0}n,T}})\lesssim\left(\frac{2^{l}\frac{T}{T-T^{\prime}}t}{\mu}\right)^{\alpha}\epsilon^{-2\alpha}.

Then Condition (C5) is satisfied if the following inequality holds for all $x\geq(2^{l}\frac{T}{T-T^{\prime}}tx_{\tilde{N},l})/8$ ,

	$\displaystyle\frac{\sqrt{v_{\tilde{N},l}}(1/2)^{2}x}{96x_{\tilde{N},l}\sqrt{2}\max(K_{1},2K_{2})}$	$\displaystyle\geq\int_{0}^{\sqrt{x}}\sqrt{c_{10}\ltx@label{case2c2}}\left(\frac{2^{l}\frac{T}{T-T^{\prime}}t}{\mu}\right)^{\alpha/2}\left(\frac{u}{2x_{\tilde{N},l}}\right)^{-\alpha}du$
		$\displaystyle=x_{\tilde{N},l}^{\alpha}x^{\frac{1-\alpha}{2}}\sqrt{2^{\alpha}c_{\ref*{case2c2}}}\left(\frac{2^{l}\frac{T}{T-T^{\prime}}}{\mu}\right)^{\alpha/2},$

where $c_{\ref*{case2c2}}>0$ is a constant depending on $\alpha,G_{\max},Q_{\max}$ .

It is enough to guarantee that

\frac{\sqrt{v_{\tilde{N},l}}(1/2)^{2}x}{96x_{\tilde{N},l}\sqrt{2}\max(K_{1},2K_{2})}\geq x_{\tilde{N},l}^{\alpha}x^{\frac{1-\alpha}{2}}\sqrt{2^{\alpha}c_{\ref*{case2c2}}}\left(\frac{2^{l}\frac{T}{T-T^{\prime}}t}{\mu}\right)^{\alpha/2}.

After some algebra, we can check that the above inequality holds if for some constant $c_{11}\ltx@label{case2c3}>0$ ,

\displaystyle\frac{T}{T-T^{\prime}}t\geq c_{\ref*{case2c3}}\frac{(x_{\tilde{N},l})^{1+\alpha}}{v^{\prime}_{\tilde{N},l}2^{l}\mu^{\alpha}},

or equivalently, for some constant $c_{\ref*{case2c4}}>0$ ,

\displaystyle\frac{T}{T-T^{\prime}}t\geq c_{12}\ltx@label{case2c4}\frac{1}{\tilde{N}\mu^{\frac{\alpha}{1-\tau(2+\alpha)}}\left(2^{l}\right)^{\frac{1-p(2+\alpha)}{1-\tau(2+\alpha)}}},

by the definition that $x_{\tilde{N},l}\leq x^{\prime}_{\tilde{N},l}$ and $v^{\prime}_{\tilde{N},l}\leq v_{\tilde{N},l}$ . To summarize, if for any $l\geq 0$ ,

\frac{T}{T-T^{\prime}}t\geq\mu+c_{\ref*{case2c4}}\frac{1}{\tilde{N}\mu^{\frac{\alpha}{1-\tau(2+\alpha)}}\left(2^{l}\right)^{\frac{1-p(2+\alpha)}{1-\tau(2+\alpha)}}},

then the entropy inequality in Condition (C5) above holds. Since $0<\tau\leq p\leq\frac{1}{1+2\alpha}$ , the right hand side is a non-increasing function of $l$ . Then as long as,

\frac{T}{T-T^{\prime}}t\geq\mu+c_{\ref*{case2c4}}\frac{1}{\tilde{N}\mu^{\frac{\alpha}{1-\tau(2+\alpha)}}},

Condition (C5) holds .

To summarize, Conditions (C1)–(C5) in Theorem 4 in [16] with $\mathcal{F}=\mathcal{F}_{l}$ , $\epsilon=1/2$ and $\eta=2^{l}t$ hold for every $l\geq 0$ when $\frac{T}{T-T^{\prime}}t\geq c_{\ref*{case2c1}}\tilde{N}^{-1}$ for some constant $c_{\ref*{case2c1}}\geq 1$ and $\frac{T}{T-T^{\prime}}t\geq\mu+c_{\ref*{case2c4}}\frac{1}{\tilde{N}\mu^{\frac{\alpha}{1-\tau(2+\alpha)}}}$ . Thus when $\tilde{N}\geq c_{\ref*{case2c0}}$ , there exists a constant $c_{13}\ltx@label{case2c5}>0$ , such that

	$\displaystyle{\Pr}^{*}\left\{\exists B\in\mathcal{Q},\mathrm{III}(B)>\frac{T}{T-T^{\prime}}t\right\}$
	$\displaystyle\leq\sum_{l=0}^{\infty}{\Pr}^{}\left[\sup_{h\in\mathcal{F}_{l}}\frac{(\mathbb{E}^{}-\mathbb{P}_{\tilde{N}})\left\{h(Z)\right\}}{\mathbb{E}^{*}\left\{h(Z)\right\}+2^{l}\frac{T}{T-T^{\prime}}t}>\frac{1}{2}\right]$
	$\displaystyle\leq\sum_{l=0}^{\infty}120\exp\left\{-c_{\ref*{case2c5}}\frac{v_{\tilde{N},l}^{{}^{\prime}2}\frac{T}{T-T^{\prime}}t2^{l}}{\tilde{N}}\right\}+2\beta_{x_{\tilde{N},l}}v_{\tilde{N},l}$
	$\displaystyle\leq\sum_{l=0}^{\infty}120\exp\left\{-c_{\ref*{case2c5}}\frac{v_{\tilde{N},l}^{{}^{\prime}2}\frac{T}{T-T^{\prime}}t2^{l}}{\tilde{N}}\right\}+2\beta_{0}\exp\left(-\beta_{1}x_{\tilde{N},l}+\log v_{\tilde{N},l}\right),$

where the last inequality is due to that the Markov chain is stationary and exponentially $\beta$ -mixing. When $t\geq\frac{\left(4/\beta_{1}\log(\tilde{N})\right)^{1/\tau}}{\tilde{N}}$ , we have $\log v_{\tilde{N},l}\leq\frac{1}{2}\beta_{1}x_{\tilde{N},l}$ by using $x_{\tilde{N},l}^{\prime}\leq 2x_{\tilde{N},l}$ and $v_{\tilde{N},l}\leq\tilde{N}$ . This will further imply that $2\beta_{x_{\tilde{N},l}}v_{\tilde{N},l}\leq 2\beta_{0}\exp\left(-\beta_{1}x_{\tilde{N},l}/2\right)$ .

Then we will have

	$\displaystyle{\Pr}^{*}\left\{\exists B\in\mathcal{Q},\mathrm{III}(B)>\frac{T}{T-T^{\prime}}t\right\}$
	$\displaystyle\lesssim\sum_{l=0}^{\infty}120\exp\left(-c_{\ref*{case2c5}}(\tilde{N}\frac{T}{T-T^{\prime}}t)^{1-2\tau}(2l)^{1-2p}\right)+2\beta_{0}\exp\left(-\beta_{1}(\tilde{N}\frac{T}{T-T^{\prime}}t)^{\tau}(2^{l})^{p}\right)$
	$\displaystyle\leq c_{14}\ltx@label{case2c6}\exp\left(-c_{15}\ltx@label{case2c7}(\tilde{N}\frac{T}{T-T^{\prime}}t)^{1-2\tau}\right)+c_{16}\ltx@label{case2c8}\exp\left(-c_{17}\ltx@label{case2c9}(\tilde{N}\frac{T}{T-T^{\prime}}t)^{\tau}\right),$

where $c_{\ref*{case2c6}}$ , $c_{\ref*{case2c7}}$ , $c_{\ref*{case2c7}}$ , $c_{\ref*{case2c8}}$ are some positive constants.

For some fixed $\delta>0$ , we can verify that there exists a constant $c_{18}\ltx@label{case2c10}>0$ , such that $c_{\ref*{case2c6}}\exp(-c_{\ref*{case2c7}}(\tilde{N}t)^{1-2\tau})+c_{\ref*{case2c8}}\exp(-c_{\ref*{case2c9}}(\tilde{N}t)^{\tau})\leq\delta$ for $t\geq c_{\ref*{case2c10}}\max\{[\log(1/\delta)\tilde{N}^{-1}]^{1/(1-2\tau)},[\log(1/\delta)\tilde{N}^{-1}]^{1/\tau}\}$ .

Substituting this probability into (S70) and combining all the conditions over $t$ , we can derive that there exists a constant $c_{19}\ltx@label{case2c11}>0$ , such that

		$\displaystyle\Pr\left(\exists B,\frac{T-T^{\prime}}{T}\mathrm{III}(B)\geq c_{\ref*{case2c11}}\left\{\mu+\frac{1}{\tilde{N}\mu^{\frac{\alpha}{1-\tau(2+\alpha)}}}+\frac{(\log\tilde{N})^{1/\tau}+1+(\log(1/\delta))^{1/(1-2\tau)}+(\log(1/\delta))^{1/\tau}}{\tilde{N}}\right\}\right)$
	$\displaystyle\leq$	$\displaystyle\delta+n\kappa^{T^{\prime}}C_{\ref*{C_s}}$

•

For IV(B):

	$\displaystyle\|\mathrm{IV}(B)\|$	$\displaystyle\leq\mathbb{E}\frac{1}{T-T^{\prime}}\sum_{t=T^{\prime}}^{T-1}K_{1}\left\\|\mathbb{G}^{b}_{t}(\cdot\mid(s_{0},a_{0}))-G_{s}(\cdot)\right\\|_{\mathrm{TV}}$
		$\displaystyle\leq K_{1}C_{\ref*{C_s}}\frac{1}{T-T^{\prime}}\kappa^{T^{\prime}}/(1-\kappa).$

Recall that $T^{\prime}=\lceil K_{3}\log(nT)\rceil$ given the condition that $T\gg\log(nT)$ . Then as long as we take $K_{3}=-3(\log kappa)^{-1}$ , then for sufficiently largest ${\color[rgb]{0,0,0}nT}$ , we have

n\kappa^{T^{\prime}}C_{\ref*{C_s}}=C_{1}n^{-2}T^{-3}\leq n^{-1}T^{-1},

and

\frac{\kappa^{T^{\prime}}}{(T-T^{\prime})(1-\kappa)}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}\left(\frac{1}{n^{3}T^{4}(1-\kappa)}\right)=\scalebox{0.7}{$\mathcal{O}$}(t),

due to the conditions of $t$ .

Combining all the results we have derived for II(B), III(B) and IV(B), we have that with probability at least $1-2\delta-1/({\color[rgb]{0,0,0}nT})$ , for any $B\in\mathcal{Q}$ ,

	$\displaystyle I_{2}(B)\lesssim\frac{\log({\color[rgb]{0,0,0}nT})}{T}\left\{\mu+(\mu^{\alpha}n)^{-1}\}+\frac{\log({\color[rgb]{0,0,0}nT})(1+\log(1/\delta))}{{\color[rgb]{0,0,0}nT}}\right\}$
	$\displaystyle+\mu+\frac{1}{{\color[rgb]{0,0,0}nT}\mu^{\frac{\alpha}{1-\tau(2+\alpha)}}}+\frac{(\log({\color[rgb]{0,0,0}nT}))^{1/\tau}+1+(\log(1/\delta))^{1/(1-2\tau)}+(\log(1/\delta))^{1/\tau}}{{\color[rgb]{0,0,0}nT}}.$

Recall that we require $0<\tau\leq p\leq 1/(2+\alpha)$ , consider any $\tau\leq 1/3$ and pick $p$ any value between $\tau$ and $1/(1+2\alpha)$ . Then we can simplify the bound into

\displaystyle I_{2}(B)\lesssim\mu+\frac{1}{{\color[rgb]{0,0,0}nT}\mu^{\frac{\alpha}{1-\tau(2+\alpha)}}}+\frac{[\log(\max\{1/\delta,{\color[rgb]{0,0,0}nT}\})]^{1/\tau}}{{\color[rgb]{0,0,0}nT}}.

∎

Lemma S3.

Suppose Assumption 4(a) (4’(a)) and Assumption 5(c) hold. Then we have

\lambda_{\max}\left\{\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\bm{L}_{K}(S_{i,t},A_{i,t})\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}-\mathbb{E}\frac{1}{T}\sum_{t=0}^{T-1}\bm{L}_{K}(S_{t},A_{t})\bm{L}_{K}(S_{t},A_{t})^{\intercal}\right\}\\ =\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\sqrt{\frac{K}{{\color[rgb]{0,0,0}nT}}}\log({\color[rgb]{0,0,0}nT})\right)=\scalebox{0.7}{$\mathcal{O}$}_{\mathrm{p}}(1).

Proof.

Recall that $\overline{p}_{T}^{b}=1/T\sum_{t=0}^{T}p_{t}^{b}$ . By Assumption 5(c), we can show that
$\sup_{(s,a)}\lambda_{\max}\{\bm{L}_{K}(s,a)\bm{L}_{K}(s,a)^{\intercal}\}\leq\sup_{(s,a)}\|\bm{L}_{K}(s,a)\|_{2}^{2}\leq(C_{\ref*{weightsC2}}+\gamma C_{\ref*{weightsC2}})^{2}K$ . Next, let us consider two cases.

•

Case 1: $T$ is fixed, i.e., $T\asymp 1$ .

Take $\bm{Z}_{i}=\frac{1}{T}\sum_{t=0}^{T}\bm{L}_{K}(S_{i,t},A_{i,t})\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}$ . From above, we can see that $\lambda_{\max}\left\{\bm{Z}_{i}\right\}\leq(C_{\ref*{weightsC2}}+\gamma C_{\ref*{weightsC2}})^{2}K$ for any $i$ .

	$\displaystyle\lambda_{\max}\left\{\sum_{i=1}^{n}\mathbb{E}\bm{Z}_{i}\bm{Z}_{i}^{\intercal}\right\}$	$\displaystyle\leq n\lambda_{\max}\left\{\mathbb{E}\bm{Z}_{i}\bm{Z}_{i}^{\intercal}\right\}$
		$\displaystyle=n\sup_{i}\lambda_{\max}\{\bm{Z}_{i}\}\lambda_{\max}\left\{\mathbb{E}\bm{Z}_{i}\right\}$
		$\displaystyle\leq n(C_{\ref{weightsC2}}+\gamma C_{\ref{weightsC2}})^{2}KC_{\ref*{weightsC4}}.$

Then, applying Matrix Bernstein inequality [66], we have

	$\displaystyle\mbox{Pr}\left(\lambda_{\max}\left\{\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\bm{L}_{K}(S_{i,t},A_{i,t})\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}-\mathbb{E}\left(\frac{1}{T}\sum_{t=0}^{T-1}\bm{L}_{K}(S_{t},A_{t})\bm{L}_{K}(S_{t},A_{t})^{\intercal}\right)\right\}\geq x\right)$
	$\displaystyle\leq\mbox{Pr}\left(\lambda_{\max}\left\{\frac{1}{n}\sum_{i=1}^{n}\bm{Z}_{i}-\mathbb{E}\bm{Z}_{i}\right\}\geq x\right)$
	$\displaystyle\leq(K+K)\exp\left(-\frac{-nx^{2}/2}{(C_{\ref{weightsC2}}+\gamma C_{\ref{weightsC2}})^{2}KC_{\ref{weightsC4}}+(C_{\ref{weightsC2}}+\gamma C_{\ref*{weightsC2}})^{2}Kx/3}\right)$

Due to the condition that $K=\scalebox{0.7}{$\mathcal{O}$}(\sqrt{{\color[rgb]{0,0,0}nT}/\log({\color[rgb]{0,0,0}nT})})=\scalebox{0.7}{$\mathcal{O}$}(\sqrt{n/\log n})$ , we have

		$\displaystyle\lambda_{\max}\left\{\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\bm{L}_{K}(S_{i,t},A_{i,t})\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}-\mathbb{E}\frac{1}{T}\sum_{t=0}^{T-1}\bm{L}_{K}(S_{t},A_{t})\bm{L}_{K}(S_{t},A_{t})^{\intercal}\right\}$
	$\displaystyle=$	$\displaystyle\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\sqrt{\frac{K\log K}{n}}\right)=\scalebox{0.7}{$\mathcal{O}$}_{\mathrm{p}}(1).$

•

Case 2: $T\rightarrow\infty$ .

Take $\mbox{Pr}^{*}$ as the stationary distribution and $\mathbb{E}^{*}$ as the expectation taken over the stationary distribution. Take $\bm{Z}_{i,t}=\bm{L}_{K}(S_{i,t},A_{i,t})\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}$ . From Assumption 5(c), we know that $\lambda_{\max}\left\{{\mathbb{E}}^{*}\bm{Z}_{i,t}\right\}\leq C_{7}$ . It can be also verified that $\lambda_{\max}\{\bm{Z}_{i,t}\}\leq(C_{\ref*{weightsC2}}+\gamma C_{\ref*{weightsC2}})^{2}K$ for any $i,t$ , and

\displaystyle\lambda_{\max}\left\{{\mathbb{E}}^{*}\bm{Z}_{i,t}\bm{Z}_{i,t}\right\}\leq\sup_{i,t}\lambda_{\max}\left\{\bm{Z}_{i,t}\right\}\lambda_{\max}\left\{{\mathbb{E}}^{*}\bm{Z}_{i,t}\right\}\lesssim K.

For any fixed $i$ , by Theorem 4.2 of [9], there exists some constant $c_{20}\ltx@label{bernc1}>0$ , such that for any $\tau\geq 0$ and integer $1<q<T$ ,

		$\displaystyle{\mbox{Pr}}^{}\left(\lambda_{\max}\left\{\sum_{t=0}^{T-1}\left(\bm{Z}_{i,t}-\mathbb{E}^{}\bm{Z}_{i,t}\right)\right\}\geq 6\tau\right)$
	$\displaystyle\leq$	$\displaystyle\frac{T}{q}\beta(q)+{\mbox{Pr}}^{}\left(\lambda_{\max}\left\{\sum_{t\in\mathcal{I}_{r}}\left(\bm{Z}_{i,t}-\mathbb{E}^{}\bm{Z}_{i,t}\right)\right\}\geq 6\tau\right)$
	$\displaystyle+$	$\displaystyle c_{\ref{bernc1}}K\exp\left(-\frac{\tau^{2}/2}{TqK+q(C_{\ref{weightsC2}}+\gamma C_{\ref*{weightsC2}})^{2}K\tau/3}\right),$

where $\mathcal{I}_{r}=\{q\lfloor(T+1)/q\rfloor{},q\lfloor(T+1)/q\rfloor{}+1,\dots,T-1\}$ . Suppose $\tau\geq q(C_{\ref*{weightsC2}}+\gamma C_{\ref*{weightsC2}})^{2}K$ . Notice that $|\mathcal{I}_{r}|\leq q$ . It follows that

\displaystyle{\mbox{Pr}}^{*}\left(\lambda_{\max}\left\{\sum_{t\in\mathcal{I}_{r}}\left(\bm{Z}_{i,t}-\mathbb{E}^{*}\bm{Z}_{i,t}\right)\right\}\geq 6\tau\right)=0.

Since $\beta(q)=\mathop{}\mathopen{}\mathcal{O}\mathopen{}(\kappa^{q})$ , set $q=-6\log(nT)/\log\kappa$ , we obtain $T\beta(q)/q=\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n^{-6}T^{-5})$ . Set $\tau=c_{21}\ltx@label{bernc2}\max\{\sqrt{TqK\log(nT)},q(C_{\ref*{weightsC2}}+\gamma C_{\ref*{weightsC2}})^{2}K\}$ with some appropriate constant $c_{\ref*{bernc2}}>0$ . From the condition that $K=\scalebox{0.7}{$\mathcal{O}$}(nT)$ , the following event occurs with probability at least $1-\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n^{-6}T^{-5})$ ,

\displaystyle\lambda_{\max}\left\{\sum_{t=0}^{T-1}\left(\bm{Z}_{i,t}-\mathbb{E}^{*}\bm{Z}_{i,t}\right)\right\}\lesssim\max\{\sqrt{TK}\log(nT),K\log^{2}(Tn)\},\quad\mathrm{where}\quad\bm{Z}_{i,t}\sim\mathbb{G}^{*}.

Note that

		$\displaystyle\mbox{Pr}\left(\lambda_{\max}\left\{\frac{1}{T}\sum_{t=0}^{T-1}\bm{Z}_{i,t}-\mathbb{E}^{}\bm{Z}_{i,t}\right\}\geq x\right)=\mathbb{E}\mathbbm{1}_{\lambda_{\max}\left\{\frac{1}{T}\sum_{t=0}^{T-1}\bm{Z}_{i,t}-\mathbb{E}^{}\bm{Z}_{i,t}\right\}\geq x}$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{(S_{i,t},A_{i,t})\sim G}\frac{\bar{p}_{T}^{b}(S,A)}{p^{}(S,A)}\mathbbm{1}_{\lambda_{\max}\left\{\frac{1}{T}\sum_{t=0}^{T-1}\bm{Z}_{i,t}-\mathbb{E}^{*}\bm{Z}_{i,t}\right\}\geq x}$
	$\displaystyle\leq$	$\displaystyle\sqrt{\mathbb{E}^{}\left(\frac{\bar{p}_{T}^{b}(S,A)}{p^{}(S,A)}\right)^{2}}\sqrt{{\mbox{Pr}}^{}\left(\lambda_{\max}\left\{\sum_{t=0}^{T-1}\left(\bm{Z}_{i,t}-\mathbb{E}^{}\bm{Z}_{i,t}\right)\right\}\geq x\right)}$
	$\displaystyle\leq$	$\displaystyle\sqrt{C_{7}}\sqrt{{\mbox{Pr}}^{}\left(\lambda_{\max}\left\{\sum_{t=0}^{T-1}\left(\bm{Z}_{i,t}-\mathbb{E}^{}\bm{Z}_{i,t}\right)\right\}\geq x\right)},$

where the first inequality is due to Cauchy Schwarz inequality and the second inequality is due to Assumption 5(c).

Then we have with probability at least $1-\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n^{-3}T^{-5/2})$ ,

\displaystyle\lambda_{\max}\left\{\sum_{t=0}^{T-1}\left(\bm{Z}_{i,t}-\mathbb{E}^{*}\bm{Z}_{i,t}\right)\right\}\lesssim\max\{\sqrt{TK}\log(nT),K\log^{2}(Tn)\},\quad\mathrm{where}\quad\bm{Z}_{i,t}\sim\overline{p}_{T}^{b}.

Then with probability at least $1-\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n^{-2}T^{-5/2})$ ,

\displaystyle\max_{i=1,\dots,n}\lambda_{\max}\left\{\sum_{t=0}^{T-1}\left(\bm{Z}_{i,t}-\mathbb{E}^{*}\bm{Z}_{i,t}\right)\right\}\lesssim\max\{\sqrt{TK}\log(nT),K\log^{2}(Tn)\}

(S72)

Take

\displaystyle\mathcal{A}_{i}=\left\{\lambda_{\max}\left\{\sum_{t=0}^{T-1}\left(\bm{Z}_{i,t}-\mathbb{E}^{*}\bm{Z}_{i,t}\right)\right\}\lesssim\max\{\sqrt{TK}\log(nT),K\log^{2}(Tn)\}\right\}.

Then based on the condition of $K$ ( $K=\scalebox{0.7}{$\mathcal{O}$}(nT/\log^{2}(nT))$ ), the matrix Bernstein inequality [65] yields that

\displaystyle\lambda_{\max}\left\{\sum_{i=1}^{n}\sum_{t=0}^{T-1}\left(\bm{Z}_{i,t}-\mathbb{E}^{*}\bm{Z}_{i,t}\right)\mathbbm{1}(\mathcal{A}_{i})-n\mathbb{E}\sum_{t=0}^{T-1}\left(\bm{Z}_{0,t}-\mathbb{E}^{*}\bm{Z}_{0,t}\right)\mathbbm{1}(\mathcal{A}_{0})\right\}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\sqrt{nTK}\log(nT)\right).

Based on the probability for (S72), we have

\displaystyle\lambda_{\max}\left\{\sum_{i=1}^{n}\sum_{t=0}^{T-1}\left(\bm{Z}_{i,t}-\mathbb{E}^{*}\bm{Z}_{i,t}\right)-n\mathbb{E}\sum_{t=0}^{T-1}\left(\bm{Z}_{0,t}-\mathbb{E}^{*}\bm{Z}_{0,t}\right)\mathbbm{1}(\mathcal{A}_{0})\right\}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\sqrt{nTK}\log(nT)\right).

Also, we can verify that for any $\bm{a}\in\mathbb{R}^{K}$ such that $\|\bm{a}\|_{2}=1$ ,

	$\displaystyle\bm{a}^{\intercal}\left\{\mathbb{E}\sum_{t=0}^{T-1}\left(\bm{Z}_{0,t}-\mathbb{E}^{}\bm{Z}_{0,t}\right)\mathbbm{1}(\mathcal{A}^{c}_{0})\right\}\bm{a}\leq\sqrt{\mathbb{E}\left\{\bm{a}^{\intercal}\sum_{t=0}^{T-1}\left(\bm{Z}_{0,t}-\mathbb{E}^{}\bm{Z}_{0,t}\right)\bm{a}\right\}^{2}}\sqrt{\mbox{Pr}(\mathcal{A}_{0}^{c})}$
	$\displaystyle{\leq\sqrt{TK}\log(nT)n^{-3/2}T^{-5/4}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n^{-1}).}$

Overall, we have

\displaystyle\lambda_{\max}\left\{\sum_{i=1}^{n}\sum_{t=0}^{T-1}\bm{Z}_{i,t}-n\mathbb{E}\left(\sum_{t=1}^{T-1}\bm{Z}_{0,t}\right)\right\}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(\sqrt{nTK}\log(nT))

And therefore,

	$\displaystyle\lambda_{\max}\left\{\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\bm{L}_{K}(S_{i,t},A_{i,t})\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}-\mathbb{E}\frac{1}{T}\sum_{t=0}^{T-1}\bm{L}_{K}(S_{t},A_{t})\bm{L}_{K}(S_{t},A_{t})^{\intercal}\right\}$
	$\displaystyle=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\frac{\sqrt{K}}{\sqrt{nT}}\log(nT)\right)=\scalebox{0.7}{$\mathcal{O}$}_{\mathrm{p}}(1)$

∎

Lemma S4.

Suppose the Markov chain $\{S_{t},A_{t}\}_{t\geq 0}$ satisfy Assumption 4(a) (4’(a)). Take $\bm{F}(S,A)=[f_{k}(S,A)]_{k1,\dots,K}\in\mathbb{R}^{K}$ . Suppose that $\sup_{(s,a)}\|\bm{F}(s,a)\|_{2}\leq R$ for some quantity $R>0$ , Then we have

\displaystyle\left\|\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\bm{F}(S_{i,t},A_{i,t})-\mathbb{E}\left\{\frac{1}{T}\sum_{t=0}^{T-1}\bm{F}(S_{t},A_{t})\right\}\right\|_{2}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(R\sqrt{\frac{\log(K)\log(nT)}{nT}}\right)

Proof.

Take $T^{\prime}=\min\{T,2\log(1/\kappa)\log({\color[rgb]{0,0,0}nT})\}$ , $\bm{F}^{\mathrm{I}}(S_{i},A_{i})=\sum_{t=0}^{T^{\prime}-1}\bm{F}(S_{i,t},A_{i,t})/T^{\prime}$ , $\bm{F}^{\mathrm{II}}(S_{i},A_{i})=\sum_{t=T^{\prime}}^{T-1}\bm{F}(S_{i,t},A_{i,t})/(T-T^{\prime})$ . Take $\mathbb{E}^{*}$ and ${\mbox{Pr}}^{*}$ as the expectation and probability under the stationary distribution of $\{S_{t},A_{t}\}$ .

		$\displaystyle\left\\|\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\bm{F}(S_{i,t},A_{i,t})-\mathbb{E}\left\{\frac{1}{T}\sum_{t=0}^{T-1}\bm{F}(S_{t},A_{t})\right\}\right\\|_{2}$
	$\displaystyle=$	$\displaystyle\frac{T^{\prime}}{T}\left\\|\frac{1}{n}\sum_{i=1}^{n}\bm{F}^{\mathrm{I}}(S_{i},A_{i})-\mathbb{E}\bm{F}^{\mathrm{I}}(S,A)\right\\|_{2}+\frac{T-T^{\prime}}{T}\left\\|\frac{1}{n}\sum_{i=1}^{n}\bm{F}^{\mathrm{II}}(S_{i},A_{i})-\mathbb{E}^{*}\bm{F}^{\mathrm{II}}(S_{i},A_{i})\right\\|_{2}$
		$\displaystyle+\frac{T-T^{\prime}}{T}\left\\|\mathbb{E}^{*}\bm{F}^{\mathrm{II}}(S,A)-\mathbb{E}\bm{F}^{\mathrm{II}}(S,A)\right\\|_{2}$
		$\displaystyle=\mathrm{(i)}+\mathrm{(ii)}+\mathrm{(iii)}$

•

For (i),

We apply the Matrix Bernstein inequality [66] to bound it. Take $\bm{Z}(S_{i},A_{i})=\bm{F}^{\mathrm{I}}(S_{i},A_{i})-\mathbb{E}\bm{F}^{\mathrm{I}}(S_{i},A_{i})$ . We can verify that

		$\displaystyle\left\\|\sum_{i=1}^{n}\mathbb{E}\bm{Z}(S_{i},A_{i})\{\bm{Z}(S_{i},A_{i})\}^{\intercal}\right\\|_{2}\leq\mathbb{E}\sum_{i=1}^{n}\left\\|{}_{2}\bm{Z}(S_{i},A_{i})\right\\|^{2}_{2}$
	$\displaystyle\leq$	$\displaystyle 2n\sup_{s,a}\\|\bm{F}^{\mathrm{I}}(s,a)\\|^{2}_{2}\leq\frac{2n}{T^{\prime}}\sum_{t=0}^{T^{\prime}-1}\sup_{s,a}\\|\bm{F}(s,a)\\|^{2}_{2}\leq 2nR^{2}.$

The same bound can be derived for $\left\|\sum_{i=1}^{n}\mathbb{E}\{\bm{Z}(S_{i},A_{i})\}^{\intercal}\bm{Z}(S_{i},A_{i})\right\|_{2}$ . Then for all $x>0$ ,

\displaystyle\mbox{Pr}\left(\left\|\frac{1}{n}\sum_{i=1}^{n}\bm{F}^{\mathrm{I}}(S_{i},A_{i})-\mathbb{E}\bm{F}^{\mathrm{I}}(S,A)\right\|_{2}\geq x\right)\leq K\exp\left(\frac{-nx^{2}/2}{2R^{2}+2xR/3}\right).

Given the condition that $K=\scalebox{0.7}{$\mathcal{O}$}({\color[rgb]{0,0,0}nT})$ and $T^{\prime}=\min\{T,2\log(1/\kappa)\log({\color[rgb]{0,0,0}nT})\}$ , we have

\displaystyle(i)=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\frac{T^{\prime}}{T}\frac{R\sqrt{\log K}}{\sqrt{n}}\right)=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\frac{R\log({\color[rgb]{0,0,0}nT})}{\sqrt{{\color[rgb]{0,0,0}nT}}}\right).

If $T=\mathop{}\mathopen{}\mathcal{O}\mathopen{}(\log({\color[rgb]{0,0,0}nT}))$ , then we do not need to analyze the remaining two components. In the following, we assume that $T^{-1}=\scalebox{0.7}{$\mathcal{O}$}(1/\log({\color[rgb]{0,0,0}nT}))$ .

•

For (ii),

		$\displaystyle\mbox{Pr}\left(\left\\|\frac{1}{n}\sum_{i=1}^{n}\bm{F}^{\mathrm{II}}(S_{i},A_{i})-\mathbb{E}^{*}\bm{F}^{\mathrm{II}}(S,A)\right\\|_{2}>x\mid(S_{i,0},A_{i,0}),i=1,\dots,n\right)$
		$\displaystyle\qquad-{\mbox{Pr}}^{}\left(\left\\|\frac{1}{n}\sum_{i=1}^{n}\bm{F}^{\mathrm{II}}(S_{i},A_{i})-\mathbb{E}^{}\bm{F}^{\mathrm{II}}(S,A)\right\\|_{2}>x\mid(S_{i,0},A_{i,0}),i=1,\dots,n\right)$
	$\displaystyle\leq$	$\displaystyle\sum_{i=1}^{n}\\|\mathbb{G}_{T^{\prime}}^{b}(\cdot\mid(S_{i,0},A_{i,0}))-\mathbb{G}^{*}\\|_{\mathrm{TV}}\leq\sum_{i=1}^{n}\phi(S_{i,0},A_{i,0})\kappa^{T^{\prime}}$

Then

	$\displaystyle\mbox{Pr}\left(\left\\|\frac{1}{n}\sum_{i=1}^{n}\bm{F}^{\mathrm{II}}(S_{i},A_{i})-\mathbb{E}^{*}\bm{F}^{\mathrm{II}}(S_{i},A_{i})\right\\|_{2}>x\right)$
	$\displaystyle\leq{\mbox{Pr}}^{}\left(\left\\|\frac{1}{n}\sum_{i=1}^{n}\bm{F}^{\mathrm{II}}(S,A)-\mathbb{E}^{}\bm{F}^{\mathrm{II}}(S_{i},A_{i})\right\\|_{2}>x\right)+n\kappa^{T^{\prime}}\int_{(s,a)}\phi(s,a)d\mathbb{G}(s,a)$
	$\displaystyle\leq{\mbox{Pr}}^{}\left(\left\\|\frac{1}{n(T-T^{\prime})}\sum_{i=1}^{n}\sum_{t=T^{\prime}}^{T-1}\bm{F}(S_{i,t},A_{i,t})-\mathbb{E}^{}\bm{F}(S,A)\right\\|_{2}>x\right)+C_{\ref*{C_s}}n\kappa^{T^{\prime}},$

Next, we apply Theorem 4.2 in [9] to bound

{\mbox{Pr}}^{*}\left(\left\|\frac{1}{n(T-T^{\prime})}\sum_{i=1}^{n}\sum_{t=T^{\prime}}^{T-1}\bm{F}(S_{i,t},A_{i,t})-\mathbb{E}^{*}\bm{F}(S,A)\right\|_{2}>x\right).

To begin with, take $\bm{Z}^{*}(S_{i,t},A_{i,t})=\bm{F}(S_{i,t},A_{i,t})-\mathbb{E}^{*}\bm{F}(S,A)$ . We can verify that

\displaystyle\left\|\sum_{i=1}^{n}\mathbb{E}\bm{Z}^{s}(S_{i},A_{i})\{\bm{Z}^{s}(S_{i},A_{i})\}^{\intercal}\right\|_{2}\leq\mathbb{E}\left\|\sum_{i=1}^{n}\bm{Z}^{s}(S_{i},A_{i})\right\|^{2}_{2}\leq 2n\sup_{s,a}\|\bm{F}(s,a)\|^{2}_{2}\leq 2nR^{2}.

The same bound can be derived for $\left\|\sum_{i=1}^{n}\mathbb{E}\{\bm{Z}^{s}(S_{i},A_{i})\}^{\intercal}\bm{Z}^{s}(S_{i},A_{i})\right\|_{2}$ . Then we adopt the similar arguments in the proof of Case 2 in Lemma S3. Take $q=-3\log(nT)/\log\kappa$ in Theorem 4.2 of [9]. Then by Corollary 4.2 in [9], we can verify that $\beta(q){\color[rgb]{0,0,0}nT}/q=\scalebox{0.7}{$\mathcal{O}$}(1)$ , and $R\sqrt{q\log K}=\scalebox{0.7}{$\mathcal{O}$}(R\sqrt{{\color[rgb]{0,0,0}nT}})$ , and we have

\displaystyle\left\|\frac{1}{n(T-T^{\prime})}\sum_{i=1}^{n}\sum_{t=T^{\prime}}^{T-1}\bm{F}(S_{i,t},A_{i,t})-\mathbb{E}^{*}\bm{F}(S,A)\right\|_{2}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\frac{R\sqrt{\log({\color[rgb]{0,0,0}nT})\log K}}{\sqrt{n(T-T^{\prime})}}\right)

•

For (iii),

	$\displaystyle\left\\|\mathbb{E}^{*}\bm{F}^{\mathrm{II}}_{k}(S,A)-\mathbb{E}\bm{F}^{\mathrm{II}}_{k}(S,A)\right\\|^{2}_{2}$	$\displaystyle=\sum_{k}\left[\frac{1}{T-T^{\prime}}\sum_{t=T^{\prime}}^{T-1}\left\{\mathbb{E}^{*}f_{k}(S,A)-\mathbb{E}f_{k}(S_{t},A_{t})\right\}\right]^{2}$
		$\displaystyle\leq\sum_{k}\frac{1}{T-T^{\prime}}\sum_{t=T^{\prime}}^{T-1}\left\{\mathbb{E}^{*}f_{k}(S,A)-\mathbb{E}f_{k}(S_{t},A_{t})\right\}^{2}$
		$\displaystyle\leq\frac{1}{T-T^{\prime}}\sum_{t=T^{\prime}}^{T-1}\kappa^{2t}\int_{(s,a)}R^{2}d\mathbb{G}_{0}(s,a)$
		$\displaystyle\leq\frac{1}{T-T^{\prime}}\sum_{t=T^{\prime}}^{T-1}\kappa^{2t}R^{2}\leq\frac{c_{22}\ltx@label{concen_{c}2}}{T-T^{\prime}}\frac{R^{2}}{n^{4}T^{4}},$

where $c_{\ref*{concen_c2}}>0$ is a constant depending on $C_{\ref*{C_s}}$ and $\kappa$ .

Combining all the bounds from (i), (ii) and (iii), given the condition that of $a,b$ and $R$ , we can derive that

		$\displaystyle\left\\|\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\bm{F}(S_{i,t},A_{i,t})-\mathbb{E}\left\{\frac{1}{T}\sum_{t=0}^{T-1}\bm{F}(S_{t},A_{t})\right\}\right\\|_{2}$
	$\displaystyle=$	$\displaystyle\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\frac{R\sqrt{\log K\log({\color[rgb]{0,0,0}nT})}}{\sqrt{{\color[rgb]{0,0,0}nT}}}\right)$

∎

S3 Additional Proof and Lemmas

Lemma S5.

If we take $\bm{B}_{K}(\cdot,a)$ as either tensor-product B-spline basis or tensor-product wavelet basis for every $a\in\mathcal{A}$ . And we assume that the average visitation probability (density) $\bar{p}_{T}^{b}(s,a)$ is upper bounded by a constant $p_{\max}$ and lower bounded by a constant $p_{\min}$ . Then there exists constants $c_{*}$ and $C_{*}$ such that

\displaystyle c_{*}\leq\lambda_{\min}\left\{\mathbb{E}\left[\frac{1}{T}\sum_{t=0}^{T-1}\bm{B}_{K}(S_{t},A_{t})\bm{B}_{K}(S_{t},A_{t})^{\intercal}\right]\right\}\leq\lambda_{\max}\left\{\mathbb{E}\left[\frac{1}{T}\sum_{t=0}^{T-1}\bm{B}_{K}(S_{t},A_{t})\bm{B}_{K}(S_{t},A_{t})^{\intercal}\right]\right\}\leq C_{*}.

(S73)

In addition, when $T\rightarrow\infty$ , we have

\displaystyle\frac{c_{*}}{2}\leq\lambda_{\min}\left\{\mathbb{E}_{(S,A)\sim G^{*}}\left[\bm{B}_{K}(S,A)\bm{B}_{K}(S,A)^{\intercal}\right]\right\}\leq\lambda_{\max}\left\{\mathbb{E}_{(S,A)\sim G^{*}}\left[\bm{B}_{K}(S,A)\bm{B}_{K}(S,A)^{\intercal}\right]\right\}\leq 2C_{*}.

Proof of Lemma S5.

Under the conditions in Lemma S5, by Lemma 2 in [62], for any $a\in\mathcal{A}$ , there exists positive constants $\tilde{C}$ and $\tilde{c}$ such that

\displaystyle\tilde{c}\leq\lambda_{\min}\left\{\int_{s\in\mathcal{S}}\bm{B}_{K}(s,a)\bm{B}_{K}(s,a)^{\intercal}ds\right\}\leq\lambda_{\max}\left\{\int_{s\in\mathcal{S}}\bm{B}_{K}(s,a)\bm{B}_{K}(s,a)^{\intercal}ds\right\}\leq\tilde{C}.

(S74)

Based on these results, we have

		$\displaystyle\lambda_{\max}\left\{\mathbb{E}\left[\frac{1}{T}\sum_{t=0}^{T-1}\bm{B}_{K}(S_{t},A_{t})\bm{B}_{K}(S_{t},A_{t})^{\intercal}\right]\right\}$
	$\displaystyle=$	$\displaystyle\lambda_{\max}\left\{\overline{\mathbb{E}}\left\{\bm{B}_{K}(S,A)\bm{B}_{K}(S,A)^{\intercal}\right\}\right\}$
	$\displaystyle\leq$	$\displaystyle p_{\max}\max_{a\in\mathcal{A}}\left[\lambda_{\max}\left\{\int_{s\in\mathcal{S}}\bm{B}_{K}(s,a)\bm{B}_{K}(s,a)^{\intercal}ds\right\}\right]$
	$\displaystyle\leq$	$\displaystyle p_{\max}\tilde{C}.$

Using the similar arguments, we are able to show that

		$\displaystyle\lambda_{\min}\left\{\mathbb{E}\left[\frac{1}{T}\sum_{t=0}^{T-1}\bm{B}_{K}(S_{t},A_{t})\bm{B}_{K}(S_{t},A_{t})^{\intercal}\right]\right\}$
	$\displaystyle\geq$	$\displaystyle p_{\min}\tilde{c}.$

Take $c_{*}=p_{\min}\tilde{c}$ and $C_{*}=p_{\max}\tilde{C}$ , the conclusion follows.

When $T\rightarrow\infty$ , Due to Assumption 4’(a),

	$\displaystyle\left\\|\overline{p}_{T}^{b}-\mbox{Pr}^{*}\right\\|_{\mathrm{TV}}=$	$\displaystyle\int_{(s,a)}\left\\|\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{G}_{t}^{b}(\cdot\mid(s,a))-\mathbb{G}^{*}(\cdot)\right\\|_{\mathrm{TV}}dG_{0}(s,a)$
	$\displaystyle\leq$	$\displaystyle\int_{(s,a)}\frac{1}{T}\sum_{t=0}^{T-1}\phi(s,a)\kappa^{t}dG_{0}(s,a)\leq C_{\ref*{C_s}}\frac{1}{T}\frac{1-\kappa^{T}}{1-\kappa}=\scalebox{0.7}{$\mathcal{O}$}(1).$

As $\overline{p}_{T}^{b}$ is lower bounded by $p_{\min}$ and upper bounded by $p_{\max}$ . The density of the stationary distribution is lower bounded by $p_{\min}/2$ and upper bounded by $2p_{\max}$ for sufficiently large $T$ . Then follow the same argument, we are able to show the bounds for $\lambda_{\max}\{{\mathbb{E}}^{*}\left\{\bm{B}_{K}(S,A)\bm{B}_{K}(S,A)^{\intercal}\right\}\}$ and $\lambda_{\min}\{{\mathbb{E}}^{*}\left\{\bm{B}_{K}(S,A)\bm{B}_{K}(S,A)^{\intercal}\right\}\}$ . ∎

Proof of Theorem 7.

We start with showing the “if” direction of the statement. Suppose $\chi\leq c_{23}\ltx@label{cnt:chi1}$ for some $c_{\ref*{cnt:chi1}}>0$ .

First, we show that following equality holds for any function $f\in\mathbb{R}^{\mathcal{S}\times\mathcal{A}}$ :

\displaystyle f(s,a)=\frac{1}{1-\gamma}\mathbb{E}_{(S^{\prime},A^{\prime})\sim d^{\pi}(\cdot,\cdot\mid s,a)}[f(S^{\prime},A^{\prime})-\gamma(\mathcal{P}^{\pi}f)(S^{\prime},A^{\prime})\mid S=s,A=a],\quad\forall s\in\mathcal{S},a\in\mathcal{A}.

(S75)

To see this, note the definition of $d^{\pi}(\cdot,\cdot\mid S,A)$ , and we have

		$\displaystyle(1-\gamma)^{-1}\mathbb{E}_{(S^{\prime},A^{\prime})\sim d^{\pi}(\cdot,\cdot\mid s,a)}[f(S^{\prime},A^{\prime})-\gamma(\mathcal{P}^{\pi}f)(S^{\prime},A^{\prime})\mid S=s,A=a]$
	$\displaystyle=$	$\displaystyle\int_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid s^{\prime})\left(\sum_{t=0}^{\infty}\left\{f(s^{\prime},a^{\prime})-\gamma\mathbb{E}_{\tilde{S}\sim P(\cdot\mid s^{\prime},a^{\prime}),\tilde{a}\sim\pi(\cdot\mid\tilde{S})}\left[f(\tilde{S},\tilde{a})\mid s^{\prime},a^{\prime}\right]\right\}\gamma^{t}p_{t}^{\pi}(s^{\prime},a^{\prime}\mid S_{0}=s,A_{0}=a)\right)ds^{\prime}$
	$\displaystyle=$	$\displaystyle\int_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid s^{\prime})\left(\sum_{t=0}^{\infty}f(s^{\prime},a^{\prime})\gamma^{t}p_{t}^{\pi}(s^{\prime},a^{\prime}\mid S_{0}=s,A_{0}=a)\right)ds^{\prime}$
		$\displaystyle\qquad\qquad\qquad-\int_{\tilde{s}\in\mathcal{S}}\sum_{\tilde{a}\in\mathcal{A}}\pi(\tilde{a}\mid\tilde{s})\left(\sum_{t=1}^{\infty}f(\tilde{s},\tilde{a})\gamma^{t}p_{t}^{\pi}(\tilde{s},\tilde{a}\mid S_{0}=s,A_{0}=a)\right)d\tilde{s}$
	$\displaystyle=$	$\displaystyle\int_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid s^{\prime})\left(f(s^{\prime},a^{\prime})p_{0}^{\pi}(s^{\prime},a^{\prime}\mid S_{0}=s,A_{0}=a)\right)ds^{\prime}=f(s,a).$

Based on (S75), for any $f\in\mathbb{R}^{\mathcal{S}\times\mathcal{A}}$ , we have

	$\displaystyle\bar{\mathbb{E}}f^{2}(S,A)$	$\displaystyle=\frac{1}{(1-\gamma)^{2}}\bar{\mathbb{E}}\left\{\mathbb{E}_{(S^{\prime},A^{\prime})\sim d^{\pi}(\cdot,\cdot\mid s,a)}[f(S^{\prime},A^{\prime})-\gamma(\mathcal{P}^{\pi}f)(S^{\prime},A^{\prime})\mid S,A]\right\}^{2}$
		$\displaystyle\leq\frac{1}{(1-\gamma)^{2}}\chi\bar{\mathbb{E}}\left\{(I-\gamma\mathcal{P}^{\pi})f(S,A)\right\}^{2},$

where the inequality comes from the definition of $\chi$ . Then we prove

\displaystyle\Upsilon=\inf_{\bar{\mathbb{E}}f^{2}(S,A)\geq 1}\left[{\bar{\mathbb{E}}\left\{(I-\gamma\mathcal{P}^{\pi})f(S,A)\right\}^{2}}\right]\geq\frac{(1-\gamma)^{2}}{{\chi}}>\frac{(1-\gamma)^{2}}{{c_{\ref*{cnt:chi1}}}}>0.

Next, we derive the “only if” direction of the statement. Suppose $\Upsilon\geq c_{24}\ltx@label{cnt:chi2}$ for some positive constant $c_{\ref*{cnt:chi2}}$ .

Then, the null space of $I-\gamma\mathcal{P}^{\pi}$ is empty. For any $f\in\mathbb{R}^{\mathcal{S}\times\mathcal{A}}$ , there exists a $\tilde{f}\in\mathbb{R}^{\mathcal{S}\times\mathcal{A}}$ , such that

\displaystyle f=(I-\gamma\mathcal{P}^{\pi})\tilde{f}.

(S76)

Now we can derive

		$\displaystyle\bar{\mathbb{E}}\left[\mathbb{E}_{(S^{\prime},A^{\prime})\sim d^{\pi}(\cdot,\cdot\mid S,A)}\left\{f(S^{\prime},A^{\prime})\mid S,A\right\}\right]^{2}$
	$\displaystyle=$	$\displaystyle\bar{\mathbb{E}}\left[\mathbb{E}_{(S^{\prime},A^{\prime})\sim d^{\pi}(\cdot,\cdot\mid S,A)}\left\{(I-\gamma\mathcal{P}^{\pi})\tilde{f}(S^{\prime},A^{\prime})\mid S,A\right\}\right]^{2}$
	$\displaystyle=$	$\displaystyle(1-\gamma)^{2}\bar{\mathbb{E}}\left[\tilde{f}(S,A)\right]^{2}$
	$\displaystyle\leq$	$\displaystyle(1-\gamma)^{2}\left[\inf_{\bar{\mathbb{E}}g^{2}(S,A)=1}{\bar{\mathbb{E}}\left\{(I-\gamma\mathcal{P}^{\pi})g(S,A)\right\}^{2}}\right]^{-1}\bar{\mathbb{E}}\left\{(I-\gamma\mathcal{P}^{\pi})\tilde{f}(S,A)\right\}^{2}$
	$\displaystyle\leq$	$\displaystyle(1-\gamma)^{2}c_{\ref*{cnt:chi2}}^{-1}\bar{\mathbb{E}}\left[f(S,A)\right]^{2},$

where the first equality is from (S76), the second equality is due to (S75) and the last inequality is from the assumption for the minimal eigenvalue and (S76). Then

\displaystyle\chi\leq(1-\gamma)^{2}c_{\ref*{cnt:chi2}}^{-1}.

∎

Proof of Corollary 2.

By Jensen’s inequality and the definition of $\bar{d}^{\pi}$ , we have

		$\displaystyle\bar{\mathbb{E}}\left[\mathbb{E}_{(S^{\prime},A^{\prime})\sim d^{\pi}(\cdot,\cdot\mid S,A)}\left\{f(S^{\prime},A^{\prime})\mid S,A\right\}\right]^{2}$
	$\displaystyle\leq$	$\displaystyle\bar{\mathbb{E}}\left[\mathbb{E}_{(S^{\prime},A^{\prime})\sim d^{\pi}(\cdot,\cdot\mid S,A)}\left\{f^{2}(S^{\prime},A^{\prime})\mid S,A\right\}\right]$
	$\displaystyle=$	$\displaystyle{\mathbb{E}}_{(S,A)\sim\bar{d}^{\pi}}f^{2}(S,A).$

Then under the condition (34), we obtain

\displaystyle\chi\leq\sup_{f\in\mathbb{R}^{\mathcal{S}\times\mathcal{A}}}\frac{{\mathbb{E}}_{(S,A)\sim\bar{d}^{\pi}}f^{2}(S,A)}{\bar{\mathbb{E}}f^{2}(S,A)}\leq\frac{\bar{\mathbb{E}}\frac{\bar{d}^{\pi}(S,A)}{\bar{p}_{T}^{b}(S,A)}f^{2}(S,A)}{\bar{\mathbb{E}}f^{2}(S,A)}\leq C_{\ref*{cnt:denbound}}\frac{\bar{\mathbb{E}}f^{2}(S,A)}{\bar{\mathbb{E}}f^{2}(S,A)}=C_{\ref*{cnt:denbound}}.

Under Assumption 7 (a) and (b), we can show that $\bar{d}^{\pi}(s,a)\leq p_{\max}$ for every $(s,a)\in\mathcal{S}\times\mathcal{A}$ . Then we have

\displaystyle\mathbb{E}_{(S,A)\sim\bar{d}^{\pi}}\left[f(S,A)\right]^{2}=\bar{\mathbb{E}}\left[\frac{\bar{d}^{\pi}(S,A)}{\bar{p}_{T}^{b}(S,A)}f^{2}(S,A)\right]\leq\frac{p_{\max}}{p_{\min}}\bar{\mathbb{E}}f^{2}(S,A).

And

\displaystyle\chi\leq\frac{p_{\max}}{p_{\min}}.

The bounds for $\Upsilon$ can be obtained by applying Theorem 7.

For $\psi(K)$ , by taking $f(S,A)=\bm{B}_{K}(S,A)^{\intercal}\bm{\alpha}$ for any $\|\bm{\alpha}\|_{2}=1$ , we have

\displaystyle\bar{\mathbb{E}}\left[\bm{L}^{\intercal}_{K}(S,A)\bm{\alpha}\right]^{2}\geq\frac{(1-\gamma)^{2}}{C_{\ref*{cnt:denbound}}}\bar{\mathbb{E}}\left[\bm{B}^{\intercal}_{K}(S,A)\bm{\alpha}\right]^{2}\geq\frac{(1-\gamma)^{2}}{C_{\ref*{cnt:denbound}}}\lambda_{\min}\bar{\mathbb{E}}\left\{\bm{B}^{\intercal}_{K}(S,A)\bm{B}_{K}(S,A)\right\}\|\bm{\alpha}\|^{2}_{2}.

The conclusion follows.

∎

References

Athey et al. [2018] Athey, S., G. W. Imbens, and S. Wager (2018). Approximate residual balancing: debiased inference of average treatment effects in high dimensions. Journal of the Royal Statistical Society: Series B 80(4), 597–623.
Belloni et al. [2015] Belloni, A., V. Chernozhukov, D. Chetverikov, and K. Kato (2015). Some new asymptotic theory for least squares series: Pointwise and uniform results. Journal of Econometrics 186(2), 345–366.
Bertsekas [1995] Bertsekas, D. P. (1995). Dynamic programming and optimal control, Volume 1. Athena scientific Belmont, MA.
Bradley [2005] Bradley, R. C. (2005). Basic properties of strong mixing conditions. a survey and some open questions. Probability Surveys 2, 107–144.
Brockman et al. [2016] Brockman, G., V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016). Openai gym. arXiv preprint arXiv:1606.01540.
Chan [1989] Chan, K. (1989). A note on the geometric ergodicity of a markov chain. Advances in applied probability 21(3), 702–704.
Chan et al. [2015] Chan, K. C. G., S. C. P. Yam, and Z. Zhang (2015). Globally efficient non-parametric inference of average treatment effects by empirical balancing calibration weighting. Journal of the Royal Statistical Society: Series B.
Chen [2007] Chen, X. (2007). Large sample sieve estimation of semi-nonparametric models. Handbook of econometrics 6, 5549–5632.
Chen and Christensen [2015] Chen, X. and T. M. Christensen (2015). Optimal uniform convergence rates and asymptotic normality for series estimators under weak dependence and weak conditions. Journal of Econometrics 188(2), 447–465.
Chen and Qi [2022] Chen, X. and Z. Qi (2022). On well-posedness and minimax optimal rates of nonparametric q-function estimation in off-policy evaluation. arXiv preprint arXiv:2201.06169.
Chua et al. [2018] Chua, K., R. Calandra, R. McAllister, and S. Levine (2018). Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Advances in neural information processing systems 31.
De Boor [1976] De Boor, C. (1976). Splines as linear combinations of b-splines. a survey. Technical report, Wisconsin Univ Madison Mathematics Research Center.
Fan et al. [2016] Fan, J., K. Imai, H. Liu, Y. Ning, and X. Yang (2016). Improving covariate balancing propensity score: A doubly robust and efficient approach. Technical report, Technical report, Princeton University.
Farahmand et al. [2009] Farahmand, A. M., M. Ghavamzadeh, S. Mannor, and C. Szepesvári (2009). Regularized policy iteration. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou (Eds.), Advances in Neural Information Processing Systems 21, pp. 441–448. Curran Associates, Inc.
Farahmand et al. [2016] Farahmand, A.-m., M. Ghavamzadeh, C. Szepesvári, and S. Mannor (2016). Regularized policy iteration with nonparametric function spaces. The Journal of Machine Learning Research 17(1), 4809–4874.
Farahmand and Szepesvári [2012] Farahmand, A.-m. and C. Szepesvári (2012). Regularized least-squares regression: Learning from a $\beta$ -mixing sequence. Journal of Statistical Planning and Inference 142(2), 493–505.
Freedman [1975] Freedman, D. A. (1975). On tail probabilities for martingales. the Annals of Probability, 100–118.
Garreau et al. [2017] Garreau, D., W. Jitkrittum, and M. Kanagawa (2017). Large sample analysis of the median heuristic. arXiv preprint arXiv:1707.07269.
Geer and van de Geer [2000] Geer, S. A. and S. van de Geer (2000). Empirical Processes in M-estimation, Volume 6. Cambridge university press.
Györfi et al. [2006] Györfi, L., M. Kohler, A. Krzyzak, and H. Walk (2006). A distribution-free theory of nonparametric regression. Springer Science & Business Media.
Hansen [1982] Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica: Journal of the Econometric Society, 1029–1054.
Hearst et al. [1998] Hearst, M. A., S. T. Dumais, E. Osuna, J. Platt, and B. Scholkopf (1998). Support vector machines. IEEE Intelligent Systems and their applications 13(4), 18–28.
Horowitz et al. [2004] Horowitz, J. L., E. Mammen, et al. (2004). Nonparametric estimation of an additive model with a link function. Annals of Statistics 32(6), 2412–2443.
Hu et al. [2021] Hu, X., M. Qian, B. Cheng, and Y. K. Cheung (2021). Personalized policy learning using longitudinal mobile health data. Journal of the American Statistical Association 116(533), 410–420.
Imai and Ratkovic [2014] Imai, K. and M. Ratkovic (2014). Covariate balancing propensity score. Journal of the Royal Statistical Society: Series B 76(1), 243–263.
Janner et al. [2019] Janner, M., J. Fu, M. Zhang, and S. Levine (2019). When to trust your model: Model-based policy optimization. Advances in Neural Information Processing Systems 32.
Kallus [2020] Kallus, N. (2020). Generalized optimal matching methods for causal inference. Journal of Machine Learning Research 21(62), 1–54.
Kallus and Uehara [2019] Kallus, N. and M. Uehara (2019). Double reinforcement learning for efficient off-policy evaluation in markov decision processes. arXiv preprint arXiv:1908.08526.
Kallus and Uehara [2020] Kallus, N. and M. Uehara (2020). Statistically efficient off-policy policy gradients. In International Conference on Machine Learning, pp. 5089–5100. PMLR.
Kallus and Uehara [2022] Kallus, N. and M. Uehara (2022). Efficiently breaking the curse of horizon in off-policy evaluation with double reinforcement learning. Operations Research.
Kang and Schafer [2007] Kang, J. D. and J. L. Schafer (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical science, 523–539.
Komorowski et al. [2018] Komorowski, M., L. A. Celi, O. Badawi, A. C. Gordon, and A. A. Faisal (2018). The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nature medicine 24(11), 1716–1720.
Komunjer and Vuong [2010] Komunjer, I. and Q. Vuong (2010). Semiparametric efficiency bound in time-series models for conditional quantiles. Econometric Theory, 383–405.
Kosorok and Laber [2019] Kosorok, M. R. and E. B. Laber (2019). Precision medicine. Annual review of statistics and its application 6, 263–286.
Laber et al. [2014] Laber, E. B., D. J. Lizotte, M. Qian, W. E. Pelham, and S. A. Murphy (2014). Dynamic treatment regimes: Technical challenges and applications. Electronic journal of statistics 8(1), 1225.
Le et al. [2019] Le, H., C. Voloshin, and Y. Yue (2019). Batch policy learning under constraints. In International Conference on Machine Learning, pp. 3703–3712.
Levine et al. [2020] Levine, S., A. Kumar, G. Tucker, and J. Fu (2020). Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643.
Liao et al. [2018] Liao, P., W. Dempsey, H. Sarker, S. M. Hossain, M. al’Absi, P. Klasnja, and S. Murphy (2018). Just-in-time but not too much: Determining treatment timing in mobile health. Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies 2(4), 179.
Liao et al. [2020] Liao, P., P. Klasnja, and S. Murphy (2020). Off-policy estimation of long-term average outcomes with applications to mobile health. Journal of the American Statistical Association, 1–10.
Liao et al. [2020] Liao, P., Z. Qi, and S. Murphy (2020). Batch policy learning in average reward markov decision processes. arXiv preprint arXiv:2007.11771.
Liu et al. [2018] Liu, Q., L. Li, Z. Tang, and D. Zhou (2018). Breaking the curse of horizon: Infinite-horizon off-policy estimation. In Advances in Neural Information Processing Systems, pp. 5356–5366.
Luckett et al. [2019] Luckett, D. J., E. B. Laber, A. R. Kahkoska, D. M. Maahs, E. Mayer-Davis, and M. R. Kosorok (2019). Estimating dynamic treatment regimes in mobile health using v-learning. Journal of the American Statistical Association (just-accepted), 1–39.
Marcolino et al. [2018] Marcolino, M. S., J. A. Q. Oliveira, M. D’Agostino, A. L. Ribeiro, M. B. M. Alkmim, and D. Novillo-Ortiz (2018). The impact of mhealth interventions: systematic review of systematic reviews. JMIR mHealth and uHealth 6(1), e23.
Marling and Bunescu [2020] Marling, C. and R. Bunescu (2020). The ohiot1dm dataset for blood glucose level prediction: Update 2020. In CEUR workshop proceedings, Volume 2675, pp. 71.
McLeish [1974] McLeish, D. L. (1974). Dependent central limit theorems and invariance principles. the Annals of Probability 2(4), 620–628.
Meyn et al. [1995] Meyn, S., R. Tweedie, and J. Hibey (1995). Markov chains and stochastic stability. IEEE Transactions on Automatic Control 40(5), 979.
Murphy [2003] Murphy, S. A. (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65(2), 331–355.
Murphy et al. [2001] Murphy, S. A., M. J. van der Laan, J. M. Robins, and C. P. P. R. Group (2001). Marginal mean models for dynamic regimes. Journal of the American Statistical Association 96(456), 1410–1423.
Nachum et al. [2019] Nachum, O., Y. Chow, B. Dai, and L. Li (2019). Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. In Advances in Neural Information Processing Systems, pp. 2315–2325.
Nahum-Shani et al. [2016] Nahum-Shani, I., S. N. Smith, B. J. Spring, L. M. Collins, K. Witkiewitz, A. Tewari, and S. A. Murphy (2016). Just-in-time adaptive interventions (jitais) in mobile health: key components and design principles for ongoing health behavior support. Annals of Behavioral Medicine, 1–17.
Newey [1997] Newey, W. K. (1997). Convergence rates and asymptotic normality for series estimators. Journal of econometrics 79(1), 147–168.
Newey and Powell [2003] Newey, W. K. and J. L. Powell (2003). Instrumental variable estimation of nonparametric models. Econometrica 71(5), 1565–1578.
Paduraru [2007] Paduraru, C. (2007). Planning with approximate and learned models of markov decision processes. These de maıtre, University of Alberta.
Precup [2000] Precup, D. (2000). Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, 80.
Puterman [1994] Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc.
Robins et al. [2000] Robins, J. M., M. A. Hernan, and B. Brumback (2000). Marginal structural models and causal inference in epidemiology.
Robins et al. [1994] Robins, J. M., A. Rotnitzky, and L. P. Zhao (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association 89(427), 846–866.
Rodbard [2009] Rodbard, D. (2009). Interpretation of continuous glucose monitoring data: glycemic variability and quality of glycemic control. Diabetes technology & therapeutics 11(S1), S–55.
Shi et al. [2018] Shi, C., A. Fan, R. Song, and W. Lu (2018). High-dimensional a-learning for optimal dynamic treatment regimes. Annals of statistics 46(3), 925.
Shi et al. [2021] Shi, C., R. Wan, V. Chernozhukov, and R. Song (2021). Deeply-debiased off-policy interval estimation. arXiv preprint arXiv:2105.04646.
Shi et al. [2020] Shi, C., R. Wan, R. Song, W. Lu, and L. Leng (2020). Does the markov decision process fit the data: Testing for the markov property in sequential decision making. arXiv preprint arXiv:2002.01751.
Shi et al. [2020] Shi, C., S. Zhang, W. Lu, and R. Song (2020). Statistical inference of the value function for reinforcement learning in infinite horizon settings. arXiv preprint arXiv:2001.04515.
Sutton and Barto [2018] Sutton, R. S. and A. G. Barto (2018). Reinforcement learning: An introduction. MIT press.
Tang et al. [2019] Tang, Z., Y. Feng, L. Li, D. Zhou, and Q. Liu (2019). Doubly robust bias reduction in infinite horizon off-policy estimation. arXiv preprint arXiv:1910.07186.
Tropp [2012] Tropp, J. A. (2012). User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics 12(4), 389–434.
Tropp [2015] Tropp, J. A. (2015). An introduction to matrix concentration inequalities. arXiv preprint arXiv:1501.01571.
Tseng and Bertsekas [1991] Tseng, P. and D. P. Bertsekas (1991). Relaxation methods for problems with strictly convex costs and linear constraints. Mathematics of operations research 16(3), 462–481.
Uehara et al. [2020] Uehara, M., J. Huang, and N. Jiang (2020). Minimax weight and q-function learning for off-policy evaluation. In International Conference on Machine Learning, pp. 9659–9668. PMLR.
Uehara et al. [2021a] Uehara, M., M. Imaizumi, N. Jiang, N. Kallus, W. Sun, and T. Xie (2021a). Finite sample analysis of minimax offline reinforcement learning: Completeness, fast rates and first-order efficiency. arXiv preprint arXiv:2102.02981.
Uehara et al. [2021b] Uehara, M., M. Imaizumi, N. Jiang, N. Kallus, W. Sun, and T. Xie (2021b). Finite sample analysis of minimax offline reinforcement learning: Completeness, fast rates and first-order efficiency. arXiv preprint arXiv:2102.02981.
Wahba [1990] Wahba, G. (1990). Spline Models for Observational Data. Philadelphia: SIAM.
Wang et al. [2018] Wang, L., Y. Zhou, R. Song, and B. Sherwood (2018). Quantile-optimal treatment regimes. Journal of the American Statistical Association 113(523), 1243–1254.
Wang and Zubizarreta [2020] Wang, Y. and J. R. Zubizarreta (2020). Minimal dispersion approximately balancing weights: asymptotic properties and practical considerations. Biometrika 107(1), 93–105.
Wong and Chan [2018] Wong, R. K. W. and K. C. G. Chan (2018). Kernel-based covariate functional balancing for observational studies. Biometrika 105(1), 199–213.
Yu [1994] Yu, B. (1994). Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 94–116.
Zhang et al. [2020] Zhang, R., B. Dai, L. Li, and D. Schuurmans (2020). Gen{dice}: Generalized offline estimation of stationary values. In International Conference on Learning Representations.
Zhao et al. [2015] Zhao, Y.-Q., D. Zeng, E. B. Laber, and M. R. Kosorok (2015). New statistical learning methods for estimating optimal dynamic treatment regimes. Journal of the American Statistical Association 110(510), 583–598.

	$\displaystyle\left\|\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda}_{*}\right\|$	$\displaystyle=\left\|\bm{\lambda}_{}\bm{B}_{K}(S_{i,t},A_{i,t})^{\intercal}-\gamma g^{\pi}_{}(S_{i,t},A_{i,t};\bm{\lambda}_{*}^{\intercal}\bm{\beta}_{K})\right\|$
		$\displaystyle\leq\left\|\bm{\lambda}_{}\bm{B}_{K}(S_{i,t},A_{i,t})^{\intercal}\right\|+\left\|\gamma g^{\pi}_{}(S_{i,t},A_{i,t};\bm{\lambda}_{*}^{\intercal}\bm{\beta}_{K})\right\|$
		$\displaystyle\leq Q_{\max}+\gamma G_{\max},$

	$\displaystyle G(\bm{\lambda}_{}+\Delta)-G(\bm{\lambda}_{})$	$\displaystyle\geq-\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\frac{\sqrt{K}\log({\color[rgb]{0,0,0}nT})}{\sqrt{{\color[rgb]{0,0,0}nT}}}+K^{-r_{1}}+\sqrt{K}\zeta_{{\color[rgb]{0,0,0}n,T}}\right)\\|\Delta\\|_{2}$
	$\displaystyle+$	$\displaystyle\frac{1}{2nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\rho^{{}^{\prime\prime}}\{v^{\prime}(S_{i,t},A_{i,t})\}\{\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\Delta\}^{2}-\\|\Delta\\|_{2}\\|\bm{\delta}_{K}\\|_{2}$
	$\displaystyle\geq$	$\displaystyle-\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\frac{\sqrt{K}\log({\color[rgb]{0,0,0}nT})}{\sqrt{{\color[rgb]{0,0,0}nT}}}+K^{-r_{1}}+\sqrt{K}\zeta_{{\color[rgb]{0,0,0}n,T}}\right)\\|\Delta\\|_{2}+\left\{\psi(K)-\scalebox{0.7}{$\mathcal{O}$}_{\mathrm{p}}(\psi(K))\right\}\\|\Delta\\|_{2}^{2},$

		$\displaystyle\log\left(\mathcal{N}(\epsilon,\mathcal{F}^{T}_{l},\\|\cdot\\|_{n})\right)$
	$\displaystyle\lesssim$	$\displaystyle\log\left(\mathcal{N}(\epsilon,\mathcal{F}_{l},\\|\cdot\\|_{{\color[rgb]{0,0,0}n,T}})\right)$
	$\displaystyle\lesssim$	$\displaystyle\log\left(\cal\mathcal{N}(\epsilon,\mathcal{Q}_{M},\\|\cdot\\|_{\infty})\mathcal{N}(\epsilon,\mathcal{G}_{M},\\|\cdot\\|_{\infty})\right).$

		$\displaystyle{\mbox{Pr}}^{}\left(\lambda_{\max}\left\{\sum_{t=0}^{T-1}\left(\bm{Z}_{i,t}-\mathbb{E}^{}\bm{Z}_{i,t}\right)\right\}\geq 6\tau\right)$
	$\displaystyle\leq$	$\displaystyle\frac{T}{q}\beta(q)+{\mbox{Pr}}^{}\left(\lambda_{\max}\left\{\sum_{t\in\mathcal{I}_{r}}\left(\bm{Z}_{i,t}-\mathbb{E}^{}\bm{Z}_{i,t}\right)\right\}\geq 6\tau\right)$
	$\displaystyle+$	$\displaystyle c_{\ref{bernc1}}K\exp\left(-\frac{\tau^{2}/2}{TqK+q(C_{\ref{weightsC2}}+\gamma C_{\ref*{weightsC2}})^{2}K\tau/3}\right),$

		$\displaystyle\mbox{Pr}\left(\lambda_{\max}\left\{\frac{1}{T}\sum_{t=0}^{T-1}\bm{Z}_{i,t}-\mathbb{E}^{}\bm{Z}_{i,t}\right\}\geq x\right)=\mathbb{E}\mathbbm{1}_{\lambda_{\max}\left\{\frac{1}{T}\sum_{t=0}^{T-1}\bm{Z}_{i,t}-\mathbb{E}^{}\bm{Z}_{i,t}\right\}\geq x}$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{(S_{i,t},A_{i,t})\sim G}\frac{\bar{p}_{T}^{b}(S,A)}{p^{}(S,A)}\mathbbm{1}_{\lambda_{\max}\left\{\frac{1}{T}\sum_{t=0}^{T-1}\bm{Z}_{i,t}-\mathbb{E}^{*}\bm{Z}_{i,t}\right\}\geq x}$
	$\displaystyle\leq$	$\displaystyle\sqrt{\mathbb{E}^{}\left(\frac{\bar{p}_{T}^{b}(S,A)}{p^{}(S,A)}\right)^{2}}\sqrt{{\mbox{Pr}}^{}\left(\lambda_{\max}\left\{\sum_{t=0}^{T-1}\left(\bm{Z}_{i,t}-\mathbb{E}^{}\bm{Z}_{i,t}\right)\right\}\geq x\right)}$
	$\displaystyle\leq$	$\displaystyle\sqrt{C_{7}}\sqrt{{\mbox{Pr}}^{}\left(\lambda_{\max}\left\{\sum_{t=0}^{T-1}\left(\bm{Z}_{i,t}-\mathbb{E}^{}\bm{Z}_{i,t}\right)\right\}\geq x\right)},$

Projected State-action Balancing Weights for Offline Reinforcement Learning

Abstract

1 Introduction

2 Offline Policy Evaluation in Infinite-horizon Markov Decision Processes

2.1 Preliminary and Notations

Assumption 1 (Markovian assumption with stationary transitions).

Assumption 2.

Assumption 3.

2.2 Existing Off-policy Evaluation Methods

3 Projected State-action Balancing Estimator

3.1 State-action Balancing Weights

3.2 Expanded Dimension

Theorem 1.

Lemma 1.

Proof sketch.

3.3 Projected State-action Balancing Weights

Theorem 2.

3.4 Other weighting methods

4 Theoretical Results

4.1 Non-parametric Regressions with Exponentially β\beta-mixing Sequences

Assumption 4.

Theorem 3.

Corollary 1.

4.2 Convergence Rates of Balancing Weights

Assumption 5.

Theorem 4.

4.3 Estimation Error and Statistical Efficiency

Assumption 6.

Theorem 5.

Theorem 6.

4.4 Lower boundedness for the minimal eigenvalue

Theorem 7.

Assumption 7.

Corollary 2.

5 Simulation Study

6 Real Data Application

7 Acknowledgements

S1 Additional Implementation Details and Numerical Results

S1.1 Algorithm

S1.2 Implementation of the Proposed Estimator

S1.3 Additional Simulation Results

S1.4 Weighted Estimator without Projection and Augmented Projected Balancing Estimator

S1.5 Simulation results for the Cartpole environment

S2 Proof

S2.1 Proof in Section 2

S2.2 Technical Proof in Section 3

Proof of Lemma 1.

Proof of Theorem 2.

S2.3 Technical Proof in Section 4.1

Proof of Theorem 3.

Assumption 4’.

Proof of Corollary 1.

S2.4 Technical Proof4 in Section 4.2

Proof.

S2.5 Technical Proof in Section 4.3

Proof of Theorem 6.

S2.6 Technical Lemmas

Lemma S2.

Proof of Lemma S2.

Lemma S3.

Proof.

Lemma S4.

Proof.

S3 Additional Proof and Lemmas

Lemma S5.

Proof of Lemma S5.

Proof of Theorem 7.

Proof of Corollary 2.

References

4.1 Non-parametric Regressions with Exponentially $\beta$ -mixing Sequences